[PDF] Navigating differential structures in complex networks

Abstract

Structural changes in a network representation of a system (e.g.,different experimental conditions, time evolution), can provide insight on its organization, function and on how it responds to external perturbations. The deeper understanding of how gene networks cope with diseases and treatments is maybe the most incisive demonstration of the gains obtained through this differential network analysis point-of-view, which lead to an explosion of new numeric techniques in the last decade. However, {\it where} to focus ones attention, or how to navigate through the differential structures can be overwhelming even for few experimental conditions. In this paper, we propose a theory and a methodological implementation for the characterization of shared "structural roles" of nodes simultaneously within and between networks, whose outcome is a highly {\em interpretable} map. The main features and accuracy are investigated with numerical benchmarks generated by a stochastic block model. Results show that it can provide nuanced and interpretable information in scenarios with very different (i) community sizes and (ii) total number of communities, and (iii) even for a large number of 100 networks been compared (e.g., for 100 different experimental conditions). Then, we show evidence that the strength of the method is its "story-telling"-like characterization of the information encoded in a set of networks, which can be used to pinpoint unexpected differential structures, leading to further investigations and providing new insights. We provide an illustrative, exploratory analysis of four gene co-expression networks from two cell types × two treatments (interferon- β stimulated or control). The method proposed here allowed us to elaborate and test a set of very specific hypotheses related to {\em unique} and {\em subtle} nuances of the structural differences between these networks.

Full PDF

NNavigating diﬀerential structures in complex networks

Leonardo L. Portes ∗ Complex Systems Group, Department of Mathematics and Statistics,University of Western Australia, Nedlands, Perth, WA 6009, Australia

Michael Small

Complex Systems Group, Department of Mathematics and Statistics,University of Western Australia, Nedlands, Perth, WA 6009, Australia andMineral Resources, CSIRO, Kensington, Perth, WA 6151, Australia (Dated: August 24, 2020) a r X i v : . [ phy s i c s . d a t a - a n ] A ug bstract Structural changes in a network representation of a system, due to diﬀerent experimental con-ditions or to its time evolution, can provide insight on its organization, function and on how itresponds to external perturbations. The deeper understanding of how gene networks cope withdiseases and treatments is maybe the most incisive demonstration of the gains obtained throughthis diﬀerential network analysis point-of-view, which lead to an explosion of new numeric tech-niques in the last decade. However, where to focus ones attention, or how to navigate through thediﬀerential structures in the context of large networks can be overwhelming even for few experi-mental conditions. In this paper, we propose a theory and a methodological implementation for thecharacterization of shared “structural roles” of nodes simultaneously within and between networks,whose outcome is a highly interpretable map. The main features and accuracy are investigated withnumerical benchmarks generated by a stochastic block model. Results show that it can providenuanced and interpretable information in scenarios with very diﬀerent (i) community sizes and (ii)total number of communities, and (iii) even for a large number of 100 networks been compared (e.g.,for 100 diﬀerent experimental conditions). Then, we show evidence that the strength of the methodis its “story-telling”-like characterization of the information encoded in a set of networks, whichcan be used to pinpoint unexpected diﬀerential structures, leading to further investigations andproviding new insights. We provide an illustrative, exploratory analysis of four gene co-expressionnetworks from two cell types × two treatments (interferon- β stimulated or control). The methodproposed here allowed us to elaborate a set of very speciﬁc hypotheses related to unique and sub-tle nuances of the structural diﬀerences between these networks — which were then tested andconﬁrmed in the original dataset. Finally, the method is ﬂexible to address diﬀerent research-ﬁeldspeciﬁc questions, by not restricting what scientiﬁc-meaningful characteristic (or relevant feature)of a node shall be used. ∗ [email protected] . INTRODUCTION Modeling a complex system through a network representation, and examining the com-munity structures that change (or on the other hand, remain coherent) along with diﬀerentexperimental conditions, can bring relevant information about the structure and functionof the system’s interacting parts. This diﬀerential network analysis is a very recent topic,which has been mainly developed by the bioinformatics community in the context of geneexpression analysis (see [1] for a review). That approach has contributed to several discov-eries related to how genes communicate to support the emergence of physiological responsesof an organism in diﬀerent disease states [2–5]. However, besides the existence of severalnumeric techniques to accomplish the task, where to focus ones attention on, or how to navi-gate through the diﬀerential structures in the context of large networks can be overwhelmingeven along with few experimental conditions.In this paper we provide a method to unfold that information, allowing one to pinpoint how and what communities structures change (or remain the same) along with diﬀerent con-ditions. Our approach is inspired by recent advances in the ﬁeld of chaotic phase synchroniza-tion (PS) and through regarding its characterization and detection via multivariate-singularspectrum analysis (M-SSA). It has been shown that orthogonal-rotations of the eigenvectors,obtained from concatenated trajectory matrices, provide a clear and almost automatic iden-tiﬁcation of oscillatory modes that are being shared by the coupled chaotic oscillators [6–9].Speciﬁcally, the ﬁnal outcome in that version of M-SSA is a set of rotated eigenvectors thatclearly encode the shared oscillatory components of those oscillators, which can be used forfurther detection of PS and characterization of phase synchronized clusters.Here, we make a parallel between shared oscillatory modes in PS analysis and the shared“structural role” of nodes in diﬀerential network analysis. In short, our method takes advan-tage of (i) a varimax rotation to simplify the structure of the eigenvectors obtained from (ii)the eigendecomposition of (iii) a concatenated adjacency matrix that represents the networkin diﬀerent conditions. For the sake of discussion, we will call it the concatenate-decompose-rotate (CDR) approach. In the new context of network analysis, we show that the outcomeis a highly interpretable map of the nuances of the network in those diﬀerent conditions.It is worth noting that some combination of those three steps (concatenate, decompose, ro-tate) has already been applied in other works and by diﬀerent scientiﬁc communities [10, 11].3owever, both their main goals and outcomes are diﬀerent from the CDR as articulated here.Indeed, we expect that the fundamental underlying ideas of the CDR could be applied overand above the other methods that have been developed within diﬀerent scientiﬁc communi-ties, or even be used as a bridge between them to nurture a deeper theoretical understanding— for example, as the recently demonstrated equivalence of modularity maximization andthe method of maximum likelihood [12]).Nevertheless, there has been some criticism regarding the methodological aspects of com-munity detection in network science [13]. In particular, we share the pertinent view thata “ method based on a mere hunch that something might work is inherently less trustworthythan one based on a provable result or fundamental mathematical insight. ” [14]. The CDRapproach can be described in a very simple and direct way, just by showing that a varimaxrotation of eigenvectors obtained from a “generic” spectral algorithm works in some speciﬁcscenarios. However, we start this paper, and devote a large part of it, to providing a simpletheory for how and why the CDR works. Aiming at a broader audience from diﬀerent ﬁelds,and trying to use the most simple and transparent concepts as possible, this is done from afactor analysis point of view illustrated by a minimalistic toy-scenario. Both the theoreticalaspect and the scenarios explored here do not represent a complete work, but the explorationfor the feasibility of a CDR method to “navigate” through diﬀerential structures in complexnetworks.This paper is organized as follows. The mathematical theory, and illustrative motivationaltoy-scenario, for diﬀerential network analysis with the CDR are presented in Sec. II. Then,we explore the method with larger networks and in a larger number of conditions in Sec.III.Firstly, CDR is applied on synthetic networks with N = 2000 nodes in H = 100 diﬀerentconditions, with a random number of communities (between 10 and 20) of random sizes (10 to100 nodes) distributed in those conditions. The scope here is on networks of non-overlappingclusters (or communities), generated from the stochastic block model. The clusters can bepresent or absent in diﬀerent conditions. The method’s accuracy is investigated throughMonte Carlo simulations, where we manipulated the inner probabilities (how strongly thenodes within a community are connected) and the outer probability (how strongly nodes fromdiﬀerent communities, and the “noisy background”, are connected). Then, we illustrate thestrength of the CDR exploratory nature in revealing subtle diﬀerential structures of a geneexpression dataset in four diﬀerent conditions (two cell types × two treatments). Concluding4emarks are made in Sec.IV. II. METHODS

In this section, we introduce the simple theory and basic methodology from a factoranalysis (FA) point of view. The material is introduced by blending together a brief reviewof some FA results [15, 16] with the pursuit of a theory for diﬀerential network analysis andcommunity structure characterization. This means that we will start by being as abstract asnecessary and that references to a “toy scenario” (Sec. II A) will be used as a motivation andto illustrate the main aspects of the proposed framework. Some small conceptual diﬀerencesfrom actual factor analysis will be discussed when necessary.

A. Motivation: toy scenario

Consider the scenario of an undirected and unweighted network with N = 200 nodes, in H = 4 diﬀerent conditions in respect to its connections, as shown in Fig. 1. Assume thatnodes neither disappear nor are created, only connections may change. For each condition h = 1 , ..., H , the network is represented by its respective adjacency matrix A h of size N × N ,with elements a ij = 1 if node i and j are connected, but a ij = 0 otherwise. Hence, the set { A h } Hh =1 = { A , A , A , A } of adjacency matrices represents the structure of the networkacross the diﬀerent H conditions. There are R = 3 highly connected and non-overlappingcommunities of sizes N r = 60 nodes each, as seen at Fig. 1, labeled as r = 1 , , H = 4 diﬀerent conditions. It has N = 200nodes, which could be members of R = 3 communities. This scenario can be thought of asa system in H diﬀerent experimental conditions, or an evolving network in H diﬀerenttime points.5he set { A h } Hh =1 was generated by a stochastic block model [17] with the package Net-workx [18]. The connection probabilities between nodes within the same community (innerprobability) is p , being q otherwise (i.e., the outer probability). Setting these probabilitiesas 0 ≤ q < p ≤ R highly connected communities, which canbe visualized as as blocks in Fig. 1. For the rest of the network structure, we considereda “ghost” community of size N − (cid:80) r N r = 20 nodes, with both the within and betweenprobabilities equal to q . Let the adjacency matrices A , A and A represent each of thesecommunities. Accordingly, the scenario depicted here can be seen (conceptually) as R Erd¨osand R´enyi [19] random networks G N r ,p “planted” on a G N,q . Because of this, we will oftenrefer to G N r ,p as the “random background”.By design, the toy scenario illustrates two aspects regarding the structural changes inthe network. The ﬁrst one refers to the presence or absence of a community in a givencondition. Speciﬁcally, communities r = 1 and r = 3 are present at all H conditions, while r = 2 is absent at conditions h = 2 and h = 4. In this aspect, the structure of the networkis the same in (i) conditions 1 and 3, as well as in (ii) conditions 2 and 4, but (iii) diﬀerentotherwise. That is the main structural change that we want to identify.However, the internal structure of conditions (and background noise) may be diﬀerent,as follows. The second aspect refers to how strong a community is, as compared to theconnections of its members to the other nodes (i.e., nodes from the other communities, as wellas from nodes from the random background). This is done by using diﬀerent combinationsof the probabilities p and q . The values for conditions 1 and 2 are ( p, q ) = (1 , . p = 0 .

6, while the background noisebecomes stronger by a factor of 10 at condition 4 with q = 0 .

2. So, the mixing betweencommunities and background noise is larger at condition 3, and much larger at condition 4.Later we will investigate how the mixing level interferes on the CDR method.

Remark . We call attention to one consequence of the SBM, that could otherwise passunnoticed. For example, the internal structure of the community r = 1 can be completelydiﬀerent between conditions 3 and 4. They just share the same inner probability p = 0 .

6, butthe actual connections of their nodes are set at random by the model. The same occurs forthe random background in conditions 1, and 3: the inner and outer probabilities p = q = 0 . . Theoretical framework The initial assumptions of the method are as follows. There exists an abstract property y i , i = 1 , ..., HN , for the N nodes in the H conditions . For example, this means that for agiven node j ∈ [1 , N ], the properties y j +( h − N and y j +( h − N will refer to the same node j in two diﬀerent conditions h (cid:54) = h ∈ [1 , H ]. Conceptually, we will assume that a causalrelationship exists between the set { y i } HNi =1 and the sets of unknown and abstract processes { ξ i } vi =1 and { e } HNi =1 (i.e., the factors or latent variables). For now, we just assume that y , ξ and e can be represented as vectors in an abstract vector space, mainly because we willneed the concept of inner products (cid:104)•|•(cid:105) to represent the extent to which they are (or theyare not) close. In the set { e i } are the factors speciﬁc for each y i (i.e., the unique factors ).The set { ξ i } are factors that can inﬂuence any and several y i (i.e., the common factors ).We assume that the common and unique factors are independent, (cid:104) ξ i | e j (cid:105) = 0 , and that theunique factors are orthogonal, (cid:104) e i | e j (cid:105) = 0 if i (cid:54) = j (orthogonality of the common factors willnot be assumed yet). By deﬁning the column matrices Y = [ y ... y HN ] (cid:62) , X = [ ξ ... ξ v ] (cid:62) and E = [ e ... e HN ] (cid:62) , one can write the linear model Y = Λ X + Ψ E , (1)known as the fundamental equation of factor analysis [16]. The matrices Λ (size N × v ) and Ψ (size HN × HN ) provide the common and unique factor loadings (or weights), respectively.Matrix Ψ is diagonal (i.e., oﬀ-diagonal elements are equal to zero), because the factors e i are unique. Without loss of generality, we assume the factors’ norm are equal to one,because they can be absorbed by the factor weight matrices Λ and Ψ . So, (cid:104) e i | e j (cid:105) = δ i,j and (cid:104) ξ i | ξ i (cid:105) = 1 ( δ is the Kronecker’s delta function).There is a subtle conceptual diﬀerent between model (1) and an actual factor analyticalmodel. In the latter, y refers to measured data . But in this paper, we will consider y justas an abstract concept parameterised in a vector space, which will provide us with ﬂexibilitylatter. Accordingly, no assumptions will be made regarding the particular statistics of y (e.g., a random variable with zero mean and unity variance. As well, we will use the conceptof a Gramian matrix, instead of the covariance or the correlation matrices, for the matricesrelated to the inner products. Actual measured data will be inserted into the framework inthe next section. 7inally, we assume that the goal of modeling a given phenomenon (with data from a setof observations) through network theory is to investigate the diﬀerential clustering of nodes:what structures remain the same, and what changes, along with the diﬀerent conditions H .Now we explore two consequences of model (1) under those assumptions.

1. Feasibility for diﬀerential network analysis and clustering

The characterization of the clustering of nodes along conditions due to the sharing oflatent variables ξ i could be achieved by inspecting the structure of the matrix Λ . To seethis, consider the product between y i and ξ j : R YX . = YX (cid:62) , with elements [ R YX ] i,j = (cid:104) ξ i | y j (cid:105) .Because (cid:104) ξ i | e j (cid:105) = 0, we have R YX . =  (cid:104) ξ | y (cid:105) · · · (cid:104) ξ v | y (cid:105) ... . . . ... (cid:104) ξ | y HN (cid:105) · · · (cid:104) ξ v | y HN (cid:105)  ≡ ΛR X X , (2)where R X X . = X X (cid:62) .Expression (2) can be simpliﬁed even further if we are allowed to assume orthogonalitybetween the common factors, (cid:104) ξ i | ξ j (cid:105) = δ i,j . Under that new assumption, for which henceforthwe restrict the scope of this paper, (2) becomes R YX = Λ (3)To gather insights of the implications of (2) and (3), we use (1) to frame the problemof diﬀerential network analysis and community characterization of the toy scenario shownin Fig. 1. In that context, and because we now know the real community structures acrossconditions (i.e., the ground-truth), a reasonable hypothesis is that the “true” underlyingcommunity structure to be captured by (1) is given by the R known planted communitiesonly, and not by the background noise from G Nq . Then, by-design the underlying theoreticalassumptions for the FA model (1) would be: (i) ξ is the “cause” of the community structureof nodes 1 to 60 (i.e., r = 1) at all H = 4 conditions; (ii) ξ for nodes 61 to 120, but only at conditions h = 1 ,

2. (iii) ξ for nodes 121 to 180 at all H = 4 conditions. Note that thelabels 1, 2 and 3 were used here for the sake of illustration (e.g., community r = 1 could be8IG. 2: Schematic picture of the intuitive expected structure of matrix Λ in the contextof the toy scenario of Fig. 1. The leading three columns Λ • k = [ (cid:104) ξ k | y (cid:105) ... (cid:104) ξ k | y HN (cid:105) ] (cid:62) , with k = 1 , ,

3, are shown.related to ξ and so on). Because of that, expressions (2) and (3) tell us that these causalrelationships should be reﬂected on the structure of the common factor loading matrix Λ ashigh loadings related to those three factors ξ for the properties y i within these ranges. Thatis schematically shown in Fig. 2, where Λ • k = [ (cid:104) ξ k | y (cid:105) ... (cid:104) ξ k | y HN (cid:105) ] (cid:62) .The main result of this paper is based upon ﬁnding (or extracting) that special structurefrom measured data, because it clearly reports on the structural changes of the network. Ina sense, this provide us with a map, or a book composed by H “chapters” within the H segments of length N , that allows one to navigate through the network diﬀerential structurehistory. Therefore, our aim now is to obtain that idealized structure after ﬁtting observeddata on a model based on (1), and then both (i) community structure characterization and(ii) diﬀerential network analysis would be straightforward.There is a subtle conceptual diﬀerent between model (1) and an actual factor analyticalmodel. In the latter, y refers to measured data . However, in this paper we are considering y just as an abstract concept within a vector space, which will provide us with ﬂexibility latter.Accordingly, no assumptions will be made regarding particular statistics of y (e.g., a randomvariable with zero mean and unity variance). As well, we will make use of the concept ofa Gramian matrix, instead of the covariance or the correlation matrices, for the matricesrelated to the inner products. Actual measured data will be inserted into the framework inthe next section, when we show one way to extract the structure Λ .9 . Extracting the structure of Λ Consider the projection of y i onto itself, R YY . = YY (cid:62) (with matrix elements [ R YY ] i,j = (cid:104) y i | y j (cid:105) ). Because the common and unique factors are orthogonal one can write R YY = Λ R X X + Ψ , an expression that is often known as the fundamental theorem of factoranalysis [16]. Rearranging for Λ , and because one previously assumed the orthonormalitybetween the common factors, we have Λ = R YY − Ψ . (4)We will simplify (5) even further by assuming that the term Ψ is neglegible, so Λ = R YY . (5)That is an assumption very often used in FA. If there exists a way to estimate the contribu-tions from the unique factors, one can go back and simply update the diagonal elements of Λ (because Ψ is a diagonal matrix). Finally, writing the eigendecomposition R YY = VΣV (cid:62) ,we have Λ = Σ / V . (6)The eigenvectors v k correspond to the columns of matrix V , with respective eigenvalues σ k , k = 1 , , ...n in the main diagonal of matrix Σ . We assume they are in the decreasing order σ > σ ... > σ n .That is one of the several procedures for extracting the structure of the factor loadingmatrix. It is sometimes called principal component factor analysis [15] — or referred toas extraction through principal components analysis (PCA) [16]. However, that is the casewhen R YY comes from measured data (equivalently, y i is a random vector). So, now we needto address the aforementioned conceptual diﬀerent between model (1) and an actual factoranalytical model: y i are concepts, not random variables. What does it mean that two nodes belong to the same community?

The answer can(and should) depend on the actual research question and the ﬁeld-dependent characteristicsthat one aims at by grouping the nodes and asking for their diﬀerential network structure.Here, letting y i being concepts and not actual data, we aim at that ﬂexibility for the CDR10ramework to address diﬀerent ﬁeld-speciﬁc points-of-view. This can be put more clearlythrough the following example, which will be used as well to establish remaining procedures.Consider again the toy scenario of Fig. 1, and assume that the measured data is the set { A h } Hh =1 . For the task at hand (diﬀerential network analysis) and the characteristics of thecommunity structures (high within connected blocks, low between connectivity), one reason-able choice is to use the similarity of the list of neighbours between nodes and conditions tocapture the relevant question one wants to address through a diﬀerential network analysis.Given two nodes i and j , their list of neighbours in conditions h and h are the columns[ A h ] • i and [ A h ] • j of the respective adjacency matrices. Let X = [ A A ... A H ] be the N × HN matrix formed by horizontally concatenating the H adjacency matrices (see Fig. 3,top panel). Accordingly, one deﬁnes the estimate ˆR YY for R YY as R YY . = YY (cid:62) ˆ= ˆR YY = (cid:52) X (cid:62) X , (7)where the symbol ˆ= stands for “estimated from”. Here, we are using the symbol = (cid:52) toemphasize that this deﬁnition depends on the ﬁeld-speciﬁc characteristics that are pertinentfor the question one wants to answer. Henceforth in this paper, we will make use of (7) toestimate the common factor loadings ˆ Λ through the eigendecomposition of X (cid:62) X .The leading ﬁve columns of the estimated loadings, ˆ Λ • k for k = 1 , ...,

5, are shown in Fig 3(middle row). Henceforth we will call them simply “loadings”. They contrast deeply with thedesired “simple” structure previously shown in Fig. 2. Speciﬁcally, one sees a mixed signatureof communities, which is more entangled in the leading two loadings. Actually, that is indeedthe expected intermediate result from FA. The reason is that we have applied PCA to extractthe loadings and PCA, by itself, is the solution for the maximization problem max tr cov ( X )given the restriction of orthonormality of the principal directions (cid:104) v i | v j (cid:105) = δ i,j . So, PCAmaximizes the variance explained by the leading k components, and a large amount of theinformation become entangled in the leading eigenvectors v i — and, consequently, in theleading ˆ Λ • k .The usual follow up procedure in FA is based on Thurstone’s concept of simple struc-ture [20]: the rotation of the factor loadings by a given criterion that maximizes the simplicity of ˆ Λ , and so (hopefully) enhancing the interpretability of that matrix.11IG. 3: Results of the CDR on the toy scenario of Fig. 1. The concatenated matrix X isshown on top. The structure of matrix Λ is shown before (middle) and after (bottom) thevarimax rotation. Before rotation, the leading loadings ˆ Λ • k contain mixed informationregarding the R = 3 communities. After the varimax rotation (bottom row), the rotatedloadings ˆ Λ ∗• k are unmixed and bring more detailed information about the communities andon how they change between the H = 3 conditions — a structure that clearly mirrors theidealised one of Fig. 2.12 . Varimax rotation and the simple structure of Λ Kaisers varimax [21] is considered the most-eﬃcient (orthogonal) rotation in FA [22], andthe most often applied. Let the elements of the factor loading matrix be ˆ Λ = [ λ k,d ]. Thevarimax rotation aims at ﬁnding the orthogonal rotation ˆ Λ ∗ = ˆ ΛT that satisﬁes the varimaxcriterion (VC) VC( Λ ) = S (cid:88) k =1  D D (cid:88) d =1 λ dk − (cid:32) D D (cid:88) d =1 λ dk (cid:33)  . (8)Speciﬁcally, (8) is the raw varimax criterion. It represents the maximization of the varianceacross the columns of the squared factor loadings matrix. The summation is over the ﬁrst S factors ˆ Λ • k , k = 1 , ..., S .The result of that rotation is shown in Fig. 3(bottom), with S = 20. Each leading ˆ Λ ∗• k carries now a unique ﬁngerprint of (i) each community for (ii) each condition, similar to theexpected idealized structure in Fig. 2. The main result of this paper is that that approachrecovers the full “story” of the network, where each “chapter” is encoded on each H = 3segment of length N . Then, it becomes straightforward to read: communities 1 and 2 werepresent along with all H = 3 conditions, while community 2 disappeared in condition h = 2but reappeared in h = 3. Another information is provided by the diﬀerent magnitude ofthe pulse-like pumps at diﬀerent segments of the same ˆ Λ ∗• k . For example, consider the ˆ Λ ∗• .The ﬁrst two pumps have the same magnitude, which is larger than the magnitude of thelast two segments. This means that, besides the community r = 1 been present in all H = 4 conditions, something in its structure is more similar within conditions h ∈ { , } and h ∈ { , } than between them. By design, we know that this should be a consequenceof the diﬀerent inner probabilities: p = 1 for h ∈ { , } , and p = 0 . h ∈ { , } . As aremark, note that as because of the stochastic block model applied, the structure of a givencommunity is not identical in diﬀerent conditions (when p <

1, i.e., communities are not p ).Even so, the proposed approach shows a clear representation of the diﬀerential communitystructure.There is another feature in Fig. 3 worth of commenting. A constant trend (vertical shift)in the leading loadings (both before and after rotation) is clearly seen when the larger outer13robability q = 0 . q because larger the q , larger the meanvalue of each column of X (i.e., more connections imply at more “ones” instead of “zeros”).For the rotated loadings, it is clearly visible in the fourth (last) segment of length N of theleading 3 rotated loadings ˆ Λ ∗• k . For our goal in this paper, that trend is irrelevant. Remark . The analysis here could be conducted by the point-of-view of the eigenvectors v k (the columns of V ). In that case, the correct way to obtain the varimax rotated v ∗ k is using the scaled σ v k to ﬁnd the rotation T , and then applying the rotation to theoriginal (not scaled) eigenvectors, v ∗ k = v k T . The pitfall here is that, because the scaledeigenvectors correspond to the common factor loadings, one could assume incorrectly thatthe rotated eigenvectors could be obtained simply by rescaling the already rotated factorloadings. However, this process of scaling, rotating and rescaling yields an oblique rotation. C. Statistical analysis

Numerically generated scenarios with much larger R , N and H will be used in the nextsection to investigate the application of the proposed CDR framework. For some of them,we will make use of large Monte Carlo runs, using the outer probability q as the mixingparameter. While the emphasis in this paper is on the interpretability provided by thevisual inspection of the rotated ˆ Λ ∗• k as an exploratory tool (as in Fig. 3, bottom), those largerscenarios bring the need of using auxiliary tools to validate the method and to automatesome steps.Firstly, in order to quantify the accuracy of the rotated ˆ Λ ∗• k on correctly identifying thecommunities r that share the same condition h , we use the score provided by the area underthe receiver operating characteristics curve (AUC-ROC). The ROC curve is a common toolin machine learning, used to compare the performance of binary classiﬁers. It is a graphicalrepresentation where the true positive rate (also called sensitivity or recall) is plotted againstthe false positive rate for diﬀerent thresholds used in the decision function. That score canhave values between 1 and 0.5: a random classiﬁer will have AUC-ROC equal to 0.5, whereasa perfect classiﬁer will have ROC-AUC equal to 1.Let ˆ Λ ∗• k,h denote each segment of length N of the rotated factor loadings matrix k th -14olumn, with elements ˆ Λ ∗• k,h ( i ) , i = 1 , , ..., N . Those segments are binary classiﬁers, whichcan be used to cluster the nodes into one or two groups given their respective weights ˆ Λ ∗• k,h ( i )(by referring to that clusters as “groups” we aim to avoid any confusion with the plantedcommunities r ). For instance, consider the segment ˆ Λ ∗• , in Fig. 3 (bottom). The weightsˆ Λ ∗• , ( i ) clearly form two clusters: one with values near the “noise ﬂoor”, and the other withmuch larger values than the noise ﬂoor variance. Figure 4 shows the distribution of weightsfor all segments in Fig. 3 (bottom). Let’s denote those two groups seen in ˆ Λ ∗• , ( i ) by labels0 and 1. We deﬁne the index vector s ( k,h ) of length N with elements s ( k,h ) i =  , if node i belongs to group 01 , otherwise. (9)By the other hand, note that the “ﬂat” segment ˆ Λ ∗• , , which is the ﬁngerprint of the absenceof r = 2 in condition h = 2, will generate a s (3 , with all its elements equal to 0 (i.e., allnodes are associated to the noise ﬂoor, and so to group 0). The diﬀerential structure in Fig. 3is so clearly discernible that the clustering could be done by visual inspection. However,for large numbers of conditions and communities, this would be prohibitive. An alternativesolution would be applying a given clustering algorithm of choice.Hence, secondly, in this paper, we opt to use the density-based spatial clustering ofapplications with noise [23, 24] (DBSCAN), with parameters number of neighbours and distance equal to 5 and 1 /

100 of the amplitude of the ˆ Λ ∗• k,h , respectively. By design inthe next section numerical experiments, the ground truth is known, and hence it is usedto construct the index vectors s r true for each r = 1 , , ...R planted community. Then, wecompute the AUC-ROC score between a single s ( k,h ) and all the R others s r true vectors. Wepick the largest value and the value of r for which it occurred, and use them to representboth the method score in identifying the community with label r through the the ˆ Λ ∗• k,h segment. Note that the knowledge about the condition h , where that community r wasfound, is already provided by the h index of the current ξ ∗ k,h .Finally, the overall accuracy will be quantiﬁed by the fraction of reliably detected commu-nities. Speciﬁcally, we count the number of detected communities with a ROC-AUC scoreabove a given threshold and divide that value by the real (known) number of communities.The value of this threshold is 0 .

8. 15IG. 4: Estimated gausian kernel densities for each segment h for the leading 5 rotatedfactor loadings ˆ Λ ∗• k , corresponding to Fig. 3 (botom). Darker shades refers to larger valuesof k . III. RESULTS

The motivation in this section is two-fold. First, numerical experiments are conducted totest the accuracy of the CDR method in unveiling the history of structural changes in morechallenge scenarios. This is done for a larger network of 2000 nodes in H = 100 conditions16hat could have between 10 and 20 communities with very diﬀerent sizes. Whilst that is amuch more complex scenario than the previous toy one, it is only a small representation ofthe myriad ways in which the network structure could change in real-world systems. Forinstance, in this paper, we are focusing on communities that could only be present or absent.In reality, there can be superpositions, growing and shrinking, combinations of those processetc. Notwithstanding, as the motivation of this paper is to provide a ﬁrst step for the CDRmethod, we make our best to deeply explore this constrained scenario.The second motivation is to explore the potential use of the navigation map providedby the CDR approach, in the context of gene coexpression data. The dataset used consistsof expression data of 1458 genes, from 2 diﬀerent cell types and in 2 diﬀerent experimentalconditions. This application is our principle motivation for the CDR. In our speciﬁc setting,this means a network with 1458 nodes in H = 4 conditions. Whilst the adjacency matricesin that context are weighted rather than binary, and as well there could be both positiveand negative connections, we show that the CDR can ﬁnd unique and subtle nuances of thestructural diﬀerences. As a remark, this is done here for illustrative purpose, and not tounveil any biological meaningful result for the speciﬁc dataset we used. A. Synthetic networks

We start by mimicking the scenario of a large network with N = 2000 nodes in H = 100conditions. In a similar fashion as in the toy scenario of Sec. 1, a set of 100 adjacencymatrices { A h } h =1 was generated using the SBM. However, the diﬀerence is that the numberof planted communities in each condition, the community sizes and their speciﬁc labelswere chosen from random uniform distributions. Figure 5 (top) shows the 10 blocks of theconcatenated matrix X = [ A A ... A ] associated with the ﬁrst 7 and last 2 adjacencymatrices. The inner and outer probabilities are p = 0 . q = 0 . r, N r ) was generate with N r ∼ U (10 , r = 1 , ...,

20 with its respective community size.2. An array containing 100 tuples ( h, R h ), with R h ∼ U (10 , R h to be planted in a condition h .17IG. 5: Concatenated matrix X (top) for the large network ( N = 2000) with R = 20communities randomly distributed along H = 100 conditions. Communities sizes10 ≤ N r ≤

100 were sampled from an uniform distribution. Inner and outer probabilitiesare ( p, q ) = (0 . , . Λ • k show mixedsignatures of the communities. After rotation (bottom), each community in each “shared”condition is clearly represented within the same ˆ Λ ∗• k .3. Fixing h , the adjacency matrix A h was generated with R h communities as speciﬁed bythe step (2), but with community, labels randomly sampled (with equal probability)from the array generated in step (1).4. Step (3) was repeated for h = 1 , ..., r , but in diﬀerent conditions h , will have a diﬀerent speciﬁc internal (and external)connections. Only their inner and outer probabilities are the same across conditions.In summary, each condition h = 1 , ..., H can have 10 to 20 planted communities, andeach of them can have (or not) diﬀerent sizes. A given community can appear in severalconditions, but its internal and external connectivity will be (very likely) diﬀerent.The leading 20 loadings before and after rotation are shown in Fig. 5 (middle and bottom,respectively). The varimax rotation was performed with S = 2 R = 40 vectors, assuming18hat the double of the maximum allowed number of communities (in any condition) will pro-vide a suﬃcient degree of freedom for the algorithm to achieve the desired simple structure.The expected mixing before rotation, and clear representation of the diﬀerential networkstructure after rotation, can be seen. Worth mentioning, the ﬁngerprints of the small com-munities at the last loadings ˆ Λ ∗• k (e.g., for k = 20) would not be visually discernible fromthe random “noise ﬂoor” in a real application: here, by design, successive node indexes wereassociated with the communities. In contrast, with real-world data, the indexes of nodesfrom the same community will be (very likely) spread along the integer segment [1 , N ].Now we assess the ﬁdelity of this representation provided by the rotated ˆ Λ ∗• k . The AUC-ROC scores are shown in Fig 6.A (for the sake of clarity, only values above 0 . N r , givenby plotting the array with 20 tuples ( r, N r ) in Fig 6.(C), we see that the small communities,with size near 10, are more likely to be missed.Figure 7 shows the distribution of ROC-AUC scores relative to the community sizes. Twofeatures can be seen, which conﬁrms the previous discussion. Firstly, communities with sizeabove 60 nodes were detected with a score above 0.98, and they form the vast majority ofidentiﬁcations (see Figure 7.B). Secondly, the smaller communities can be harder to detect,and this is more prominently seen for communities with size below 20 nodes.The previous results depend on the speciﬁc realization that generates the 20 tuples ( r, N r ),Fig 6.(C), as well as the other random features. Furthermore, we’d like to explore how themixing (relative magnitudes between p and q ) inﬂuences those results. So, we now employa Monte Carlo strategy, for which the previous scenario can be considered one of its speciﬁcrealizations. The steps are:a. We ﬁx in inner and outer probabilities ( p, q ).19IG. 6: Communities planted and detected for the large network in H = 100 conditions(see Fig. 5). (A) Roc-auc scores of the communities detected by the rotated loadings ˆ Λ ∗• k .For clarity, only scores above 0.6 are shown. We considered a successful detection if theROC-AUC score is above 0.8. (B) The ground truth: communities planted in eachcondition h . Colors indicate if they were successfully detected (black) or not (red). (C)The community sizes N r ∼ U (10 , X = [ A A ... A H ] is generated by using steps (1-4).20. The fraction of communities detected and planted is calculated as before.d. Steps (a-c) are repeated N MC = 20 times.That was done for 20 increasing values of the outer probability q ∈ [0 . , . p = 0 .

6, 0 . ± standard deviation) for H = 100 conditions. It is seen that lower themixing (i.e., p been more prominent than q ) the detection curve (i) decays slower and (ii)starts decaying at a larger q value. As mentioned before, if a given small community appearsin several conditions, its “signal” would be stronger. That would counterbalance its smallsize, allowing it to be detected. Because of that, we repeat the experiment for a much lowernumber of conditions H = 10. In this scenario, it is much less likely that any of the 20possible communities will appear several times. As a consequence, Fig. 8B, the detectioncurves start decaying at a lower value of q as compared to the scenario with H = 100.Regarding the speed of decay (slop), it is very similar to the previous case but for a smallsegment (between q ≈ . p = 1 curve. B. Genetic data

Here we give an illustrate example of how the CDR could be used to explore the diﬀerentialstructure of real-world data, in the context of diﬀerential gene co-expression network analysis.This is done assuming an exploratory data analysis point-of-view, and then showing evidenceof the extent to which the CDR approach can recover subtle and nuanced information thatis otherwise spread along with the original datasets.We will use a tutorial dataset for single-cell analysis [25]. It consists of gene expressiondata of peripheral blood mononuclear cells (PBMCs) divided into two groups based ontreatment: one with and the other without interferon- β (IFNB) stimulation. Both thedataset and tutorial (with R code) for its analysis can be found at https://satijalab.org/seurat/v3.1/immune_alignment.html , and within the Seurat R package [26]. Thetutorial (and R package) contains the log-normalized expression data of 2000 genes for 12diﬀerent cell types. To illustrate the CDR, we arbitrarily selected two cell types: CD4 NaiveT cell (henceforth refereed as T-cells) and CD14-mono cells, both labelled as STIM (IFNBstimulation) or CRTL (control) regarding their respective treatment group.21IG. 8: Fraction of detected communities with ROC-AUC score above 0 .

8. The curvesshow the mean ( ± standard deviation) over 20 Monte Carlo runs for an increasing outerprobability q ∈ [0 . , .

9] and three ﬁxed values of the inner probability p . The number ofconditions is (A) H = 100 (see Figs. 5-6) and (B) H = 10, both for a network of size N = 2000. A larger H increases the chance of a given community appears on multipleconditions, which increases its chance of been detected. This causes the slower decay of thecurves in (A) as compared to (B).Before proceeding with the CDR, we will build H = 4 (co-expression) networks, whichconceptually represent what we called conditions in the CDR method. Let the matrices E h ,with h = 1 , ..., H , be the expression data with genes in rows and samples in columns. Wewill refer to these conditions as Tc ctrl , Tc stim , cMono ctrl and cMono stim .22efore continuing, three remarks are worth noticing. Firstly, we emphasize here that thesetwo cell types were selected arbitrarily and exclusively to illustrate the use of the CDR in anexploratory analysis to highlight similar and dissimilar networks structures across diﬀerentconditions: in this section, the focus is not on searching for any biological meaningfullyresult from that analysis. Secondly, in a real-world exploratory analysis, the selection ofcell-types and experimental groups should be guided by the speciﬁc research questions thatthe specialist -subject researcher aims for. For example, the selection above could be justiﬁedif the interest is to search for the 2 × × treatments) structural (dis)similaritiesbetween the co-expression networks of Tc and CD14-mono cells (i.e., between themselvesand between treatment). So, speciﬁc research questions are the ﬁrst things to considerbefore selecting the datasets to building the concatenated matrix X . Finally, the estimateof a gene co-expression adjacency from expression data is a very active research topic perse [27, 28]. It is not our aim to discuss, or provide any guidance, on what method wouldprovide more reliable results. The application of the CDR and the illustrative goal of thissection are independent of possible direct biological inference. However, it is expected thatmeaningful biological results will depend on how the estimated adjacency matrices are anaccurate representation of the underlying reality. Therefore, this aspect is pertinent in areal-world application.From the selected data, H = 4 weighted adjacency networks were built as follows. Oneof the most often used methods in gene network analysis is the weighted gene correlationnetwork analysis (WGCNA) [29], which can be used for ﬁnding clusters (modules) of highlycorrelated genes. Its initial step (after standard prepossessing, already done in the caseof the Seurat dataset) is to build a weighted network from the expression data from a co-expression similarity matrix S = cor( E ) (the correlation between the expression of genesacross samples). Then, the weighted network adjacency is deﬁned as W = S β . The parameter β is chosen in a way to provide W with a scale-free topology. The motivation is that real-world biological networks often show the scale-free topology, so an appropriate selection of β would provide a more accurate representation of reality. Given that deﬁnition throughcorrelations, a standard step is to remove genes with near to zero variance across samples.From the 2000 genes from the Seurat data, we removed the ones with a variance less than10 − , yielding 1473 genes. From these, we select 1458 genes with the highest variance(corresponding to the 0.1 quantile). After that, the four expression data matrices sizes are23 ∈ R , , E ∈ R , , E ∈ R , and E ∈ R , .Because we want to focus on the information that the application of the CDR method alone can provide, we will not make use of such enhanced representation of the adjacencymatrix as provided by WGCNA. We will set β = 1, or equivalently use the similarity matrix S itself. Speciﬁcally, the four weighted network adjacency matrices will be W h = cor( E h ) with h = 1 , ..., ctrl , Tc stim , cMono ctrl and cMono stim , respectively). Notethat each of these four matrices has size 1458 × ± . X , allowing more eﬃcient use ofalgorithms for the decomposition of sparse matrices. Finally, the input for the CDR is thematrix X = [ W W W W ] of size 1458 × W h . The number of varimax rotated vectors was 60.We experimented with diﬀerent values (e.g., 40 and 100), with no impact on the followingresults.The leading 8 rotated factor loading vectors ˆ Λ ∗• k are shown in Fig. 9A. We will sometimesrefer to them simply by their indexes, k . A simple visual inspection of the magnitudes acrossconditions (i.e., the four segments of length 1458) allows one to interpret that for1. k = 1, 2 and 5, segments 3 and 4 reﬂect a structure peculiar to cMono cells, which isstronger (in “a sense”, more on this later) in the control group than in the stimulatedone.2. k = 7, those segments 3 and 4 reﬂect again a structure peculiar to cMono cells, butwhich is now equivalent in both stimulated and control groups.3. k = 3, 4, 6 and 8, the ﬁrst segment reﬂects a structure related only to Tc cells in thecontrol group.Therefore, we will use that list of statements above as a metaphorical “navigation map” toguide the creation of some hypotheses, which can be tested by using the original datasets E h .But before that, we will explore some complementary information provided by the scatterplots between the leading k = 1 , ...,

10 vectors, Fig. 9B. Note that because consecutive nodes24IG. 9: Applying the CDR framework to gene co-expression networks. (A) Cell and/orgroup speciﬁc nuances are clearly suggested by the rotated loading vectors ˆ Λ ∗• k . (B)Scatter plots ˆ Λ ∗• k versus ˆ Λ ∗• k (lower panel), with k = 1 , . . . ,

10 and k (cid:54) = k , show theenhanced interpretability (i.e., simple structure) provided by the varimax rotation ascompared to the unrotated ˆ Λ • k (top). (C) Correlations in the original expression data forthe selected genes in the set L show that they form a highly connected community speciﬁc for Tc cells in the control group. (D) As before, but for the genes from the set L . Thesegenes are part of a more complex structure of highly connected nodes speciﬁc for cMonocells, and more “stronger” in the control group than in the stimulated one.are not within the same (diﬀerential) community structure, the eﬀects of the rotation cannotbe seen by plotting the factor loading vectors before and after rotation, as done the case ofthe synthetic network shown in Fig. 5. This makes the scatter plots a suitable alternative.Firstly, consider the eﬀects of the varimax rotation in disentangling the information between25he vectors. For instance, the rotation of a cross-shaped structure can be seen for ( k , k j ) =(2 ,

3) (second row, third column of Fig. 9B top and bottom panels), meaning that the factorloading vectors ˆ Λ ∗• and ˆ Λ ∗• become uncorrelated after rotation — and because of that,they should reﬂect cell-group speciﬁc and distinct structures encoded into X : a feature thatagrees with the statements (1) and (3) above. Another eﬀect is seen as the reduced numberof circular cloud of points, which after rotation are replaced by linear structures parallel tothe horizontal and or vertical axis. Secondly, some scatter plots remain as circular cloudseven after rotation. For instance, the features captured by vectors k = 1 and 2, which arespeciﬁc to cMono cells, are correlated — this suggests that those vectors could tell diﬀerentnuances of the same diﬀerential structure that make those cells apart from the Tc cells. So,the inspection of the scatter plots provides complementary information for the factor loadingvectors interpretability.Taken together, the information above provides guidance for the creation and explorationof a vast amount of hypotheses. Next, we will develop and explore some possibilities, whichwe believe provide the most complete illustrative example of the power of an exploratoryanalysis through the CDR. Note that, in practice, this should be done side-by-side with anexpert in the ﬁeld related to the dataset.Consider the k = 6 ﬁrst segment, which should refer to a speciﬁc structure for Tc cellsin the control group. Let L be the set of the six genes associated with the nodes with thelargest (absolute) factor loadings. That selection is highlighted with a rectangular (orange)box in Fig. 9A, and the genes are L = { ADAMDEC1, CCNA1, DEFB1, IL27, LILRA3,NRM } . Larger marker size was used for all occurrences of those six genes, which allows oneto verify that they appear (collectively) with a large loading only at the k = 6 ﬁrst segment.However, the genes CCNA1 and IL27 appear with a large loading in k = 1 segment 3,a condition that should reﬂect a structure speciﬁc for cMono cells in the control group.Based on what we have discussed in the example of the synthetic data (Sec. III A), thoseobservations suggest the following inter-connected hypotheses:I. The L genes form a highly connected community, which is speciﬁc for the Tc cells inthe control group.II. That community is present solely for the Tc cells in the control group.III. There is another structure L (see the purple rectangular selection in Fig 9A) speciﬁc26or cMono cells in the control group, for which genes CCNA1 and IL27 are members.IV. The structural role of genes CCNA1 and IL27 within community L is diﬀerent fromtheir structural roles in L . Remark . The set L contains eleven genes: IFIT3, IFIT1, RSAD2, CTSC, IL27, HSPA1A,HSPB1, HRASLS2, SCIN, CCNA1, MASTL.Those hypotheses can be tested by estimating the strength of co-expression (correlations)of the genes sets L and L within each original expression dataset E h . Figures 9C-D showthe correlations for L and L , respectively, which strongly agree with the hypotheses above.Speciﬁcally, the following features can be seen. (i) The six L genes have a highly-correlatedco-expression solely for the Tc cells in the control group, hence forming a diﬀerential com-munity. (ii) There is a slightly higher correlation between genes CCNA1 and IL27 for thecMono in the control group, Figure 9C, showing that these two genes can have a diﬀerentstructural role in another community (related speciﬁcally to cMono in the control group),and which we will track now. To allow better visualization of the community structure,the four heat-maps in Fig. 9.D were plotted after rearranging their columns and rows basedon the hierarchical clustering dendrogram (HCD) of the cMono control group. In the thirdheat-map of Fig. 9.D, (iii) the eleven genes in L show a somewhat more complex communitystructure (i.e., 3 or 4 small communities, without a crisp frontier, forming a larger commu-nity). As suggested by the ordering provided by the HCD, the co-expression between genesCCNA1 and IL27 forms a link between two diﬀerent communities (one of them formed bythe last 4 genes).Nevertheless, the last two heat-maps in Fig. 9D can be used to test another hypothesis.Previously, we mentioned that for k = 1 the segments 3 and 4 should reﬂect not only astructure more peculiar to the cMono cells but that that structure should be stronger (in asense) in the control group than in the stimulated one. Figure 9.D shows that this is indeedthe case and that the structure is stronger in two senses: (a) there are links between the sub-communities in the control group, but they are broken in the stimulated group (for whichthe heat-map suggests the existence of two sub-communities, one with genes IFIT3, IFIT1,RSAD2m, and the other with genes HSPA1A, HSPB1); (b) the third sub-community in thecontrol group (with genes HRASLS2, SCIN, CCNA1, MASTL) disappears in the stimulatedgroup. 27aken together, those results support the conclusion that the “story-telling”-like informa-tion provided by the CDR approach can guide one to recover subtle and nuanced informationthat is otherwise spread along with the original datasets. IV. CONCLUSION

In this paper, we have proposed a theory and method (CDR) for the characterizationof shared “structural roles” of nodes simultaneously within and between networks, whoseoutcome is a highly interpretable map. Without loss of generality, for the sake of presentationwe have assumed that each network represents diﬀerent experimental conditions. They couldbe, equivalently, representations of a time-evolving network (i.e., in diﬀerent time-points),layers in a multiplex network etc.Supported by a simple and transparent theory, rooted in the factor analysis framework,the method provides ﬂexibility to address diﬀerent research-ﬁeld speciﬁc questions. This isaccomplished by deﬁning what is the scientiﬁc-meaningful characteristic (or relevant feature)of a node at the problem at hand, and then mapping it to an appropriate mathematical sim-ilarity construct to estimate the proximity from measured data. In the context of diﬀerentialnetwork analysis, with communities of highly connected nodes, in this paper the methodwas illustrated by assuming as the relevant feature the similarity of the list of neighboursbetween nodes, captured by the notion of proximity through the inner (vector) product.The insights provided by the method and its accuracy have been explored in numericalbenchmarks generated by a stochastic block model. In the scope of non-overlapping com-munities (which could be present or absent on a given condition) the results have shownthe method’s high accuracy despite very diﬀerent (i) community sizes, (ii) total numberof communities within a given condition and (iii) number of networks been compared (i.e.,experimental conditions).In particular, we have argued that the strength of the method is its “story-telling”-likecharacterization of the diﬀerential network structure encoded in a set of networks. Thiswas illustrated with an exploratory analysis of gene expression datasets, a context calleddiﬀerential co-expression network analysis. In particular, applying the CDR to the co-expression networks from two cell types × two treatments allowed us to elaborate hypothesisrelated to unique and subtle nuances of the structural diﬀerences between the networks.28hose hypotheses were then tested and conﬁrmed by the original dataset.Our vision is of a virtuous cycle, where the CDR is used as a hypothesis generator by pin-pointing unexpected diﬀerential structures within the data, leading to further investigationsand providing new insights.The analysis of genetic data was used, as well, to illustrate how the fundamental under-lying ideas of the CDR could be applied over and above the other methods that have beendeveloped within diﬀerent scientiﬁc communities. For this, it has been applied on (a simpli-ﬁed version of) weighted gene correlation network analysis. Notwithstanding, due to (i) theCDR ﬂexibility and (ii) the several ways the theoretical basis of factor analysis allows one toextract the factor loadings (here we have chosen the PCA), those ideas could even be usedas a bridge between the several methods of network analysis (from diﬀerent communities)to nurture a deeper theoretical understanding. These and other possibilities we leave forfuture work. ACKNOWLEDGMENTS

LLP and MS acknowledge ﬁnancial support from Australian Research Council DiscoveryGrant (DP 180100718). This work was partly supported by the Pawsey SupercomputingCentre with funding from the Australian Government and the Government of Western Aus-tralia. 29

1] H. A. Chowdhury, D. K. Bhattacharyya, and J. K. Kalita, IEEE/ACM Transactions onComputational Biology and Bioinformatics , 1 (2019).[2] A. R. Sonawane, S. T. Weiss, K. Glass, and A. Sharma, Frontiers in Genetics , 1 (2019),arXiv:1903.05449.[3] S. van Dam, U. V˜osa, A. van der Graaf, L. Franke, and J. P. de Magalh˜aes, Brieﬁngs inBioinformatics , bbw139 (2017).[4] R. Anglani, T. M. Creanza, V. C. Liuzzi, A. Piepoli, A. Panza, A. Andriulli, and N. Ancona,PLoS ONE , e87075 (2014).[5] D. Amar, H. Safer, and R. Shamir, PLoS Computational Biology , e1002955 (2013).[6] L. L. Portes and M. Small, Physical Review E , 042218 (2019).[7] L. L. Portes and L. A. Aguirre, Physical Review E , 052216 (2016).[8] L. L. Portes and L. A. Aguirre, Chaos: An Interdisciplinary Journal of Nonlinear Science ,093112 (2016).[9] A. Groth and M. Ghil, Physical Review E , 036206 (2011).[10] X. Xiao, A. Moreno-Moral, M. Rotival, L. Bottolo, and E. Petretto, PLoS Genetics ,e1004006 (2014).[11] M. Vejmelka and M. Paluˇs, Chaos: An Interdisciplinary Journal of Nonlinear Science ,033103 (2010).[12] M. E. J. Newman, Physical Review E , 052315 (2016).[13] S. Fortunato and D. Hric, Physics Reports , 1 (2016), arXiv:1608.00163.[14] B. Ball, B. Karrer, and M. E. J. Newman, Physical Review E , 036103 (2011),arXiv:1104.3590.[15] I. Koch, Analysis of Multivariate and High-Dimensional Data , Cambridge Series in Statisticaland Probabilistic Mathematics (Cambridge University Press, 2013).[16] S. A. Mulaik,

The foundations of factor analysis , 2nd ed. (Taylor & Francis Group, New York,NY, 2010) p. 524.[17] P. W. Holland, K. B. Laskey, and S. Leinhardt, Social Networks , 109 (1983).[18] A. A. Hagberg, D. A. Schult, and P. J. Swart, in Proceedings of the 7th Python in ScienceConference , edited by G. Varoquaux, T. Vaught, and J. Millman (Pasadena, CA USA, 2008) p. 11 – 15.[19] P. Erdos and A. Renyi, Publ. Math. Inst. Hungary. Acad. Sci. , 17 (1960).[20] L. L. Thurstone, Psychological Review , 406 (1931).[21] H. F. Kaiser, Psychometrika , 31 (1974).[22] M. B. Richman, Journal of Climatology , 293 (1986).[23] E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, ACM Transactions on DatabaseSystems (TODS) , 1 (2017).[24] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. , in Kdd , Vol. 96 (1996) pp. 226–231.[25] H. M. Kang, M. Subramaniam, S. Targ, M. Nguyen, L. Maliskova, E. McCarthy, E. Wan,S. Wong, L. Byrnes, C. M. Lanata, R. E. Gate, S. Mostafavi, A. Marson, N. Zaitlen, L. A.Criswell, and C. J. Ye, Nature Biotechnology (2017), doi:10.1038/nbt.4042.[26] T. Stuart, A. Butler, P. Hoﬀman, C. Hafemeister, E. Papalexi, W. M. M. III, Y. Hao,M. Stoeckius, P. Smibert, and R. Satija, Cell , 1888 (2019).[27] S. Choobdar, M. E. Ahsen, J. Crawford, M. Tomasoni, T. Fang, D. Lamparter, J. Lin, B. Hes-cott, X. Hu, J. Mercer, T. Natoli, R. Narayan, A. Subramanian, J. D. Zhang, G. Stolovitzky,Z. Kutalik, K. Lage, D. K. Slonim, J. Saez-Rodriguez, L. J. Cowen, S. Bergmann, andD. Marbach, Nature Methods , 843 (2019).[28] F. Liesecke, J.-O. De Craene, S. Besseau, V. Courdavault, M. Clastre, V. Verg`es, N. Papon,N. Giglioli-Guivarc’h, G. Gl´evarec, O. Pichon, and T. Dug´e de Bernonville, Scientiﬁc Reports , 14431 (2019).[29] P. Langfelder and S. Horvath, BMC Bioinformatics , 559 (2008)., 559 (2008).