Navigating differential structures in complex networks
NNavigating differential structures in complex networks
Leonardo L. Portes ∗ Complex Systems Group, Department of Mathematics and Statistics,University of Western Australia, Nedlands, Perth, WA 6009, Australia
Michael Small
Complex Systems Group, Department of Mathematics and Statistics,University of Western Australia, Nedlands, Perth, WA 6009, Australia andMineral Resources, CSIRO, Kensington, Perth, WA 6151, Australia (Dated: August 24, 2020) a r X i v : . [ phy s i c s . d a t a - a n ] A ug bstract Structural changes in a network representation of a system, due to different experimental con-ditions or to its time evolution, can provide insight on its organization, function and on how itresponds to external perturbations. The deeper understanding of how gene networks cope withdiseases and treatments is maybe the most incisive demonstration of the gains obtained throughthis differential network analysis point-of-view, which lead to an explosion of new numeric tech-niques in the last decade. However, where to focus ones attention, or how to navigate through thedifferential structures in the context of large networks can be overwhelming even for few experi-mental conditions. In this paper, we propose a theory and a methodological implementation for thecharacterization of shared “structural roles” of nodes simultaneously within and between networks,whose outcome is a highly interpretable map. The main features and accuracy are investigated withnumerical benchmarks generated by a stochastic block model. Results show that it can providenuanced and interpretable information in scenarios with very different (i) community sizes and (ii)total number of communities, and (iii) even for a large number of 100 networks been compared (e.g.,for 100 different experimental conditions). Then, we show evidence that the strength of the methodis its “story-telling”-like characterization of the information encoded in a set of networks, whichcan be used to pinpoint unexpected differential structures, leading to further investigations andproviding new insights. We provide an illustrative, exploratory analysis of four gene co-expressionnetworks from two cell types × two treatments (interferon- β stimulated or control). The methodproposed here allowed us to elaborate a set of very specific hypotheses related to unique and sub-tle nuances of the structural differences between these networks — which were then tested andconfirmed in the original dataset. Finally, the method is flexible to address different research-fieldspecific questions, by not restricting what scientific-meaningful characteristic (or relevant feature)of a node shall be used. ∗ [email protected] . INTRODUCTION Modeling a complex system through a network representation, and examining the com-munity structures that change (or on the other hand, remain coherent) along with differentexperimental conditions, can bring relevant information about the structure and functionof the system’s interacting parts. This differential network analysis is a very recent topic,which has been mainly developed by the bioinformatics community in the context of geneexpression analysis (see [1] for a review). That approach has contributed to several discov-eries related to how genes communicate to support the emergence of physiological responsesof an organism in different disease states [2–5]. However, besides the existence of severalnumeric techniques to accomplish the task, where to focus ones attention on, or how to navi-gate through the differential structures in the context of large networks can be overwhelmingeven along with few experimental conditions.In this paper we provide a method to unfold that information, allowing one to pinpoint how and what communities structures change (or remain the same) along with different con-ditions. Our approach is inspired by recent advances in the field of chaotic phase synchroniza-tion (PS) and through regarding its characterization and detection via multivariate-singularspectrum analysis (M-SSA). It has been shown that orthogonal-rotations of the eigenvectors,obtained from concatenated trajectory matrices, provide a clear and almost automatic iden-tification of oscillatory modes that are being shared by the coupled chaotic oscillators [6–9].Specifically, the final outcome in that version of M-SSA is a set of rotated eigenvectors thatclearly encode the shared oscillatory components of those oscillators, which can be used forfurther detection of PS and characterization of phase synchronized clusters.Here, we make a parallel between shared oscillatory modes in PS analysis and the shared“structural role” of nodes in differential network analysis. In short, our method takes advan-tage of (i) a varimax rotation to simplify the structure of the eigenvectors obtained from (ii)the eigendecomposition of (iii) a concatenated adjacency matrix that represents the networkin different conditions. For the sake of discussion, we will call it the concatenate-decompose-rotate (CDR) approach. In the new context of network analysis, we show that the outcomeis a highly interpretable map of the nuances of the network in those different conditions.It is worth noting that some combination of those three steps (concatenate, decompose, ro-tate) has already been applied in other works and by different scientific communities [10, 11].3owever, both their main goals and outcomes are different from the CDR as articulated here.Indeed, we expect that the fundamental underlying ideas of the CDR could be applied overand above the other methods that have been developed within different scientific communi-ties, or even be used as a bridge between them to nurture a deeper theoretical understanding— for example, as the recently demonstrated equivalence of modularity maximization andthe method of maximum likelihood [12]).Nevertheless, there has been some criticism regarding the methodological aspects of com-munity detection in network science [13]. In particular, we share the pertinent view thata “ method based on a mere hunch that something might work is inherently less trustworthythan one based on a provable result or fundamental mathematical insight. ” [14]. The CDRapproach can be described in a very simple and direct way, just by showing that a varimaxrotation of eigenvectors obtained from a “generic” spectral algorithm works in some specificscenarios. However, we start this paper, and devote a large part of it, to providing a simpletheory for how and why the CDR works. Aiming at a broader audience from different fields,and trying to use the most simple and transparent concepts as possible, this is done from afactor analysis point of view illustrated by a minimalistic toy-scenario. Both the theoreticalaspect and the scenarios explored here do not represent a complete work, but the explorationfor the feasibility of a CDR method to “navigate” through differential structures in complexnetworks.This paper is organized as follows. The mathematical theory, and illustrative motivationaltoy-scenario, for differential network analysis with the CDR are presented in Sec. II. Then,we explore the method with larger networks and in a larger number of conditions in Sec.III.Firstly, CDR is applied on synthetic networks with N = 2000 nodes in H = 100 differentconditions, with a random number of communities (between 10 and 20) of random sizes (10 to100 nodes) distributed in those conditions. The scope here is on networks of non-overlappingclusters (or communities), generated from the stochastic block model. The clusters can bepresent or absent in different conditions. The method’s accuracy is investigated throughMonte Carlo simulations, where we manipulated the inner probabilities (how strongly thenodes within a community are connected) and the outer probability (how strongly nodes fromdifferent communities, and the “noisy background”, are connected). Then, we illustrate thestrength of the CDR exploratory nature in revealing subtle differential structures of a geneexpression dataset in four different conditions (two cell types × two treatments). Concluding4emarks are made in Sec.IV. II. METHODS
In this section, we introduce the simple theory and basic methodology from a factoranalysis (FA) point of view. The material is introduced by blending together a brief reviewof some FA results [15, 16] with the pursuit of a theory for differential network analysis andcommunity structure characterization. This means that we will start by being as abstract asnecessary and that references to a “toy scenario” (Sec. II A) will be used as a motivation andto illustrate the main aspects of the proposed framework. Some small conceptual differencesfrom actual factor analysis will be discussed when necessary.
A. Motivation: toy scenario
Consider the scenario of an undirected and unweighted network with N = 200 nodes, in H = 4 different conditions in respect to its connections, as shown in Fig. 1. Assume thatnodes neither disappear nor are created, only connections may change. For each condition h = 1 , ..., H , the network is represented by its respective adjacency matrix A h of size N × N ,with elements a ij = 1 if node i and j are connected, but a ij = 0 otherwise. Hence, the set { A h } Hh =1 = { A , A , A , A } of adjacency matrices represents the structure of the networkacross the different H conditions. There are R = 3 highly connected and non-overlappingcommunities of sizes N r = 60 nodes each, as seen at Fig. 1, labeled as r = 1 , , H = 4 different conditions. It has N = 200nodes, which could be members of R = 3 communities. This scenario can be thought of asa system in H different experimental conditions, or an evolving network in H differenttime points.5he set { A h } Hh =1 was generated by a stochastic block model [17] with the package Net-workx [18]. The connection probabilities between nodes within the same community (innerprobability) is p , being q otherwise (i.e., the outer probability). Setting these probabilitiesas 0 ≤ q < p ≤ R highly connected communities, which canbe visualized as as blocks in Fig. 1. For the rest of the network structure, we considereda “ghost” community of size N − (cid:80) r N r = 20 nodes, with both the within and betweenprobabilities equal to q . Let the adjacency matrices A , A and A represent each of thesecommunities. Accordingly, the scenario depicted here can be seen (conceptually) as R Erd¨osand R´enyi [19] random networks G N r ,p “planted” on a G N,q . Because of this, we will oftenrefer to G N r ,p as the “random background”.By design, the toy scenario illustrates two aspects regarding the structural changes inthe network. The first one refers to the presence or absence of a community in a givencondition. Specifically, communities r = 1 and r = 3 are present at all H conditions, while r = 2 is absent at conditions h = 2 and h = 4. In this aspect, the structure of the networkis the same in (i) conditions 1 and 3, as well as in (ii) conditions 2 and 4, but (iii) differentotherwise. That is the main structural change that we want to identify.However, the internal structure of conditions (and background noise) may be different,as follows. The second aspect refers to how strong a community is, as compared to theconnections of its members to the other nodes (i.e., nodes from the other communities, as wellas from nodes from the random background). This is done by using different combinationsof the probabilities p and q . The values for conditions 1 and 2 are ( p, q ) = (1 , . p = 0 .
6, while the background noisebecomes stronger by a factor of 10 at condition 4 with q = 0 .
2. So, the mixing betweencommunities and background noise is larger at condition 3, and much larger at condition 4.Later we will investigate how the mixing level interferes on the CDR method.
Remark . We call attention to one consequence of the SBM, that could otherwise passunnoticed. For example, the internal structure of the community r = 1 can be completelydifferent between conditions 3 and 4. They just share the same inner probability p = 0 .
6, butthe actual connections of their nodes are set at random by the model. The same occurs forthe random background in conditions 1, and 3: the inner and outer probabilities p = q = 0 . . Theoretical framework The initial assumptions of the method are as follows. There exists an abstract property y i , i = 1 , ..., HN , for the N nodes in the H conditions . For example, this means that for agiven node j ∈ [1 , N ], the properties y j +( h − N and y j +( h − N will refer to the same node j in two different conditions h (cid:54) = h ∈ [1 , H ]. Conceptually, we will assume that a causalrelationship exists between the set { y i } HNi =1 and the sets of unknown and abstract processes { ξ i } vi =1 and { e } HNi =1 (i.e., the factors or latent variables). For now, we just assume that y , ξ and e can be represented as vectors in an abstract vector space, mainly because we willneed the concept of inner products (cid:104)•|•(cid:105) to represent the extent to which they are (or theyare not) close. In the set { e i } are the factors specific for each y i (i.e., the unique factors ).The set { ξ i } are factors that can influence any and several y i (i.e., the common factors ).We assume that the common and unique factors are independent, (cid:104) ξ i | e j (cid:105) = 0 , and that theunique factors are orthogonal, (cid:104) e i | e j (cid:105) = 0 if i (cid:54) = j (orthogonality of the common factors willnot be assumed yet). By defining the column matrices Y = [ y ... y HN ] (cid:62) , X = [ ξ ... ξ v ] (cid:62) and E = [ e ... e HN ] (cid:62) , one can write the linear model Y = Λ X + Ψ E , (1)known as the fundamental equation of factor analysis [16]. The matrices Λ (size N × v ) and Ψ (size HN × HN ) provide the common and unique factor loadings (or weights), respectively.Matrix Ψ is diagonal (i.e., off-diagonal elements are equal to zero), because the factors e i are unique. Without loss of generality, we assume the factors’ norm are equal to one,because they can be absorbed by the factor weight matrices Λ and Ψ . So, (cid:104) e i | e j (cid:105) = δ i,j and (cid:104) ξ i | ξ i (cid:105) = 1 ( δ is the Kronecker’s delta function).There is a subtle conceptual different between model (1) and an actual factor analyticalmodel. In the latter, y refers to measured data . But in this paper, we will consider y justas an abstract concept parameterised in a vector space, which will provide us with flexibilitylatter. Accordingly, no assumptions will be made regarding the particular statistics of y (e.g., a random variable with zero mean and unity variance. As well, we will use the conceptof a Gramian matrix, instead of the covariance or the correlation matrices, for the matricesrelated to the inner products. Actual measured data will be inserted into the framework inthe next section. 7inally, we assume that the goal of modeling a given phenomenon (with data from a setof observations) through network theory is to investigate the differential clustering of nodes:what structures remain the same, and what changes, along with the different conditions H .Now we explore two consequences of model (1) under those assumptions.
1. Feasibility for differential network analysis and clustering
The characterization of the clustering of nodes along conditions due to the sharing oflatent variables ξ i could be achieved by inspecting the structure of the matrix Λ . To seethis, consider the product between y i and ξ j : R YX . = YX (cid:62) , with elements [ R YX ] i,j = (cid:104) ξ i | y j (cid:105) .Because (cid:104) ξ i | e j (cid:105) = 0, we have R YX . = (cid:104) ξ | y (cid:105) · · · (cid:104) ξ v | y (cid:105) ... . . . ... (cid:104) ξ | y HN (cid:105) · · · (cid:104) ξ v | y HN (cid:105) ≡ ΛR X X , (2)where R X X . = X X (cid:62) .Expression (2) can be simplified even further if we are allowed to assume orthogonalitybetween the common factors, (cid:104) ξ i | ξ j (cid:105) = δ i,j . Under that new assumption, for which henceforthwe restrict the scope of this paper, (2) becomes R YX = Λ (3)To gather insights of the implications of (2) and (3), we use (1) to frame the problemof differential network analysis and community characterization of the toy scenario shownin Fig. 1. In that context, and because we now know the real community structures acrossconditions (i.e., the ground-truth), a reasonable hypothesis is that the “true” underlyingcommunity structure to be captured by (1) is given by the R known planted communitiesonly, and not by the background noise from G Nq . Then, by-design the underlying theoreticalassumptions for the FA model (1) would be: (i) ξ is the “cause” of the community structureof nodes 1 to 60 (i.e., r = 1) at all H = 4 conditions; (ii) ξ for nodes 61 to 120, but only at conditions h = 1 ,
2. (iii) ξ for nodes 121 to 180 at all H = 4 conditions. Note that thelabels 1, 2 and 3 were used here for the sake of illustration (e.g., community r = 1 could be8IG. 2: Schematic picture of the intuitive expected structure of matrix Λ in the contextof the toy scenario of Fig. 1. The leading three columns Λ • k = [ (cid:104) ξ k | y (cid:105) ... (cid:104) ξ k | y HN (cid:105) ] (cid:62) , with k = 1 , ,
3, are shown.related to ξ and so on). Because of that, expressions (2) and (3) tell us that these causalrelationships should be reflected on the structure of the common factor loading matrix Λ ashigh loadings related to those three factors ξ for the properties y i within these ranges. Thatis schematically shown in Fig. 2, where Λ • k = [ (cid:104) ξ k | y (cid:105) ... (cid:104) ξ k | y HN (cid:105) ] (cid:62) .The main result of this paper is based upon finding (or extracting) that special structurefrom measured data, because it clearly reports on the structural changes of the network. Ina sense, this provide us with a map, or a book composed by H “chapters” within the H segments of length N , that allows one to navigate through the network differential structurehistory. Therefore, our aim now is to obtain that idealized structure after fitting observeddata on a model based on (1), and then both (i) community structure characterization and(ii) differential network analysis would be straightforward.There is a subtle conceptual different between model (1) and an actual factor analyticalmodel. In the latter, y refers to measured data . However, in this paper we are considering y just as an abstract concept within a vector space, which will provide us with flexibility latter.Accordingly, no assumptions will be made regarding particular statistics of y (e.g., a randomvariable with zero mean and unity variance). As well, we will make use of the concept ofa Gramian matrix, instead of the covariance or the correlation matrices, for the matricesrelated to the inner products. Actual measured data will be inserted into the framework inthe next section, when we show one way to extract the structure Λ .9 . Extracting the structure of Λ Consider the projection of y i onto itself, R YY . = YY (cid:62) (with matrix elements [ R YY ] i,j = (cid:104) y i | y j (cid:105) ). Because the common and unique factors are orthogonal one can write R YY = Λ R X X + Ψ , an expression that is often known as the fundamental theorem of factoranalysis [16]. Rearranging for Λ , and because one previously assumed the orthonormalitybetween the common factors, we have Λ = R YY − Ψ . (4)We will simplify (5) even further by assuming that the term Ψ is neglegible, so Λ = R YY . (5)That is an assumption very often used in FA. If there exists a way to estimate the contribu-tions from the unique factors, one can go back and simply update the diagonal elements of Λ (because Ψ is a diagonal matrix). Finally, writing the eigendecomposition R YY = VΣV (cid:62) ,we have Λ = Σ / V . (6)The eigenvectors v k correspond to the columns of matrix V , with respective eigenvalues σ k , k = 1 , , ...n in the main diagonal of matrix Σ . We assume they are in the decreasing order σ > σ ... > σ n .That is one of the several procedures for extracting the structure of the factor loadingmatrix. It is sometimes called principal component factor analysis [15] — or referred toas extraction through principal components analysis (PCA) [16]. However, that is the casewhen R YY comes from measured data (equivalently, y i is a random vector). So, now we needto address the aforementioned conceptual different between model (1) and an actual factoranalytical model: y i are concepts, not random variables. What does it mean that two nodes belong to the same community?
The answer can(and should) depend on the actual research question and the field-dependent characteristicsthat one aims at by grouping the nodes and asking for their differential network structure.Here, letting y i being concepts and not actual data, we aim at that flexibility for the CDR10ramework to address different field-specific points-of-view. This can be put more clearlythrough the following example, which will be used as well to establish remaining procedures.Consider again the toy scenario of Fig. 1, and assume that the measured data is the set { A h } Hh =1 . For the task at hand (differential network analysis) and the characteristics of thecommunity structures (high within connected blocks, low between connectivity), one reason-able choice is to use the similarity of the list of neighbours between nodes and conditions tocapture the relevant question one wants to address through a differential network analysis.Given two nodes i and j , their list of neighbours in conditions h and h are the columns[ A h ] • i and [ A h ] • j of the respective adjacency matrices. Let X = [ A A ... A H ] be the N × HN matrix formed by horizontally concatenating the H adjacency matrices (see Fig. 3,top panel). Accordingly, one defines the estimate ˆR YY for R YY as R YY . = YY (cid:62) ˆ= ˆR YY = (cid:52) X (cid:62) X , (7)where the symbol ˆ= stands for “estimated from”. Here, we are using the symbol = (cid:52) toemphasize that this definition depends on the field-specific characteristics that are pertinentfor the question one wants to answer. Henceforth in this paper, we will make use of (7) toestimate the common factor loadings ˆ Λ through the eigendecomposition of X (cid:62) X .The leading five columns of the estimated loadings, ˆ Λ • k for k = 1 , ...,
5, are shown in Fig 3(middle row). Henceforth we will call them simply “loadings”. They contrast deeply with thedesired “simple” structure previously shown in Fig. 2. Specifically, one sees a mixed signatureof communities, which is more entangled in the leading two loadings. Actually, that is indeedthe expected intermediate result from FA. The reason is that we have applied PCA to extractthe loadings and PCA, by itself, is the solution for the maximization problem max tr cov ( X )given the restriction of orthonormality of the principal directions (cid:104) v i | v j (cid:105) = δ i,j . So, PCAmaximizes the variance explained by the leading k components, and a large amount of theinformation become entangled in the leading eigenvectors v i — and, consequently, in theleading ˆ Λ • k .The usual follow up procedure in FA is based on Thurstone’s concept of simple struc-ture [20]: the rotation of the factor loadings by a given criterion that maximizes the simplicity of ˆ Λ , and so (hopefully) enhancing the interpretability of that matrix.11IG. 3: Results of the CDR on the toy scenario of Fig. 1. The concatenated matrix X isshown on top. The structure of matrix Λ is shown before (middle) and after (bottom) thevarimax rotation. Before rotation, the leading loadings ˆ Λ • k contain mixed informationregarding the R = 3 communities. After the varimax rotation (bottom row), the rotatedloadings ˆ Λ ∗• k are unmixed and bring more detailed information about the communities andon how they change between the H = 3 conditions — a structure that clearly mirrors theidealised one of Fig. 2.12 . Varimax rotation and the simple structure of Λ Kaisers varimax [21] is considered the most-efficient (orthogonal) rotation in FA [22], andthe most often applied. Let the elements of the factor loading matrix be ˆ Λ = [ λ k,d ]. Thevarimax rotation aims at finding the orthogonal rotation ˆ Λ ∗ = ˆ ΛT that satisfies the varimaxcriterion (VC) VC( Λ ) = S (cid:88) k =1 D D (cid:88) d =1 λ dk − (cid:32) D D (cid:88) d =1 λ dk (cid:33) . (8)Specifically, (8) is the raw varimax criterion. It represents the maximization of the varianceacross the columns of the squared factor loadings matrix. The summation is over the first S factors ˆ Λ • k , k = 1 , ..., S .The result of that rotation is shown in Fig. 3(bottom), with S = 20. Each leading ˆ Λ ∗• k carries now a unique fingerprint of (i) each community for (ii) each condition, similar to theexpected idealized structure in Fig. 2. The main result of this paper is that that approachrecovers the full “story” of the network, where each “chapter” is encoded on each H = 3segment of length N . Then, it becomes straightforward to read: communities 1 and 2 werepresent along with all H = 3 conditions, while community 2 disappeared in condition h = 2but reappeared in h = 3. Another information is provided by the different magnitude ofthe pulse-like pumps at different segments of the same ˆ Λ ∗• k . For example, consider the ˆ Λ ∗• .The first two pumps have the same magnitude, which is larger than the magnitude of thelast two segments. This means that, besides the community r = 1 been present in all H = 4 conditions, something in its structure is more similar within conditions h ∈ { , } and h ∈ { , } than between them. By design, we know that this should be a consequenceof the different inner probabilities: p = 1 for h ∈ { , } , and p = 0 . h ∈ { , } . As aremark, note that as because of the stochastic block model applied, the structure of a givencommunity is not identical in different conditions (when p <
1, i.e., communities are not p ).Even so, the proposed approach shows a clear representation of the differential communitystructure.There is another feature in Fig. 3 worth of commenting. A constant trend (vertical shift)in the leading loadings (both before and after rotation) is clearly seen when the larger outer13robability q = 0 . q because larger the q , larger the meanvalue of each column of X (i.e., more connections imply at more “ones” instead of “zeros”).For the rotated loadings, it is clearly visible in the fourth (last) segment of length N of theleading 3 rotated loadings ˆ Λ ∗• k . For our goal in this paper, that trend is irrelevant. Remark . The analysis here could be conducted by the point-of-view of the eigenvectors v k (the columns of V ). In that case, the correct way to obtain the varimax rotated v ∗ k is using the scaled σ v k to find the rotation T , and then applying the rotation to theoriginal (not scaled) eigenvectors, v ∗ k = v k T . The pitfall here is that, because the scaledeigenvectors correspond to the common factor loadings, one could assume incorrectly thatthe rotated eigenvectors could be obtained simply by rescaling the already rotated factorloadings. However, this process of scaling, rotating and rescaling yields an oblique rotation. C. Statistical analysis
Numerically generated scenarios with much larger R , N and H will be used in the nextsection to investigate the application of the proposed CDR framework. For some of them,we will make use of large Monte Carlo runs, using the outer probability q as the mixingparameter. While the emphasis in this paper is on the interpretability provided by thevisual inspection of the rotated ˆ Λ ∗• k as an exploratory tool (as in Fig. 3, bottom), those largerscenarios bring the need of using auxiliary tools to validate the method and to automatesome steps.Firstly, in order to quantify the accuracy of the rotated ˆ Λ ∗• k on correctly identifying thecommunities r that share the same condition h , we use the score provided by the area underthe receiver operating characteristics curve (AUC-ROC). The ROC curve is a common toolin machine learning, used to compare the performance of binary classifiers. It is a graphicalrepresentation where the true positive rate (also called sensitivity or recall) is plotted againstthe false positive rate for different thresholds used in the decision function. That score canhave values between 1 and 0.5: a random classifier will have AUC-ROC equal to 0.5, whereasa perfect classifier will have ROC-AUC equal to 1.Let ˆ Λ ∗• k,h denote each segment of length N of the rotated factor loadings matrix k th -14olumn, with elements ˆ Λ ∗• k,h ( i ) , i = 1 , , ..., N . Those segments are binary classifiers, whichcan be used to cluster the nodes into one or two groups given their respective weights ˆ Λ ∗• k,h ( i )(by referring to that clusters as “groups” we aim to avoid any confusion with the plantedcommunities r ). For instance, consider the segment ˆ Λ ∗• , in Fig. 3 (bottom). The weightsˆ Λ ∗• , ( i ) clearly form two clusters: one with values near the “noise floor”, and the other withmuch larger values than the noise floor variance. Figure 4 shows the distribution of weightsfor all segments in Fig. 3 (bottom). Let’s denote those two groups seen in ˆ Λ ∗• , ( i ) by labels0 and 1. We define the index vector s ( k,h ) of length N with elements s ( k,h ) i = , if node i belongs to group 01 , otherwise. (9)By the other hand, note that the “flat” segment ˆ Λ ∗• , , which is the fingerprint of the absenceof r = 2 in condition h = 2, will generate a s (3 , with all its elements equal to 0 (i.e., allnodes are associated to the noise floor, and so to group 0). The differential structure in Fig. 3is so clearly discernible that the clustering could be done by visual inspection. However,for large numbers of conditions and communities, this would be prohibitive. An alternativesolution would be applying a given clustering algorithm of choice.Hence, secondly, in this paper, we opt to use the density-based spatial clustering ofapplications with noise [23, 24] (DBSCAN), with parameters number of neighbours and distance equal to 5 and 1 /
100 of the amplitude of the ˆ Λ ∗• k,h , respectively. By design inthe next section numerical experiments, the ground truth is known, and hence it is usedto construct the index vectors s r true for each r = 1 , , ...R planted community. Then, wecompute the AUC-ROC score between a single s ( k,h ) and all the R others s r true vectors. Wepick the largest value and the value of r for which it occurred, and use them to representboth the method score in identifying the community with label r through the the ˆ Λ ∗• k,h segment. Note that the knowledge about the condition h , where that community r wasfound, is already provided by the h index of the current ξ ∗ k,h .Finally, the overall accuracy will be quantified by the fraction of reliably detected commu-nities. Specifically, we count the number of detected communities with a ROC-AUC scoreabove a given threshold and divide that value by the real (known) number of communities.The value of this threshold is 0 .
8. 15IG. 4: Estimated gausian kernel densities for each segment h for the leading 5 rotatedfactor loadings ˆ Λ ∗• k , corresponding to Fig. 3 (botom). Darker shades refers to larger valuesof k . III. RESULTS
The motivation in this section is two-fold. First, numerical experiments are conducted totest the accuracy of the CDR method in unveiling the history of structural changes in morechallenge scenarios. This is done for a larger network of 2000 nodes in H = 100 conditions16hat could have between 10 and 20 communities with very different sizes. Whilst that is amuch more complex scenario than the previous toy one, it is only a small representation ofthe myriad ways in which the network structure could change in real-world systems. Forinstance, in this paper, we are focusing on communities that could only be present or absent.In reality, there can be superpositions, growing and shrinking, combinations of those processetc. Notwithstanding, as the motivation of this paper is to provide a first step for the CDRmethod, we make our best to deeply explore this constrained scenario.The second motivation is to explore the potential use of the navigation map providedby the CDR approach, in the context of gene coexpression data. The dataset used consistsof expression data of 1458 genes, from 2 different cell types and in 2 different experimentalconditions. This application is our principle motivation for the CDR. In our specific setting,this means a network with 1458 nodes in H = 4 conditions. Whilst the adjacency matricesin that context are weighted rather than binary, and as well there could be both positiveand negative connections, we show that the CDR can find unique and subtle nuances of thestructural differences. As a remark, this is done here for illustrative purpose, and not tounveil any biological meaningful result for the specific dataset we used. A. Synthetic networks
We start by mimicking the scenario of a large network with N = 2000 nodes in H = 100conditions. In a similar fashion as in the toy scenario of Sec. 1, a set of 100 adjacencymatrices { A h } h =1 was generated using the SBM. However, the difference is that the numberof planted communities in each condition, the community sizes and their specific labelswere chosen from random uniform distributions. Figure 5 (top) shows the 10 blocks of theconcatenated matrix X = [ A A ... A ] associated with the first 7 and last 2 adjacencymatrices. The inner and outer probabilities are p = 0 . q = 0 . r, N r ) was generate with N r ∼ U (10 , r = 1 , ...,
20 with its respective community size.2. An array containing 100 tuples ( h, R h ), with R h ∼ U (10 , R h to be planted in a condition h .17IG. 5: Concatenated matrix X (top) for the large network ( N = 2000) with R = 20communities randomly distributed along H = 100 conditions. Communities sizes10 ≤ N r ≤
100 were sampled from an uniform distribution. Inner and outer probabilitiesare ( p, q ) = (0 . , . Λ • k show mixedsignatures of the communities. After rotation (bottom), each community in each “shared”condition is clearly represented within the same ˆ Λ ∗• k .3. Fixing h , the adjacency matrix A h was generated with R h communities as specified bythe step (2), but with community, labels randomly sampled (with equal probability)from the array generated in step (1).4. Step (3) was repeated for h = 1 , ..., r , but in different conditions h , will have a different specific internal (and external)connections. Only their inner and outer probabilities are the same across conditions.In summary, each condition h = 1 , ..., H can have 10 to 20 planted communities, andeach of them can have (or not) different sizes. A given community can appear in severalconditions, but its internal and external connectivity will be (very likely) different.The leading 20 loadings before and after rotation are shown in Fig. 5 (middle and bottom,respectively). The varimax rotation was performed with S = 2 R = 40 vectors, assuming18hat the double of the maximum allowed number of communities (in any condition) will pro-vide a sufficient degree of freedom for the algorithm to achieve the desired simple structure.The expected mixing before rotation, and clear representation of the differential networkstructure after rotation, can be seen. Worth mentioning, the fingerprints of the small com-munities at the last loadings ˆ Λ ∗• k (e.g., for k = 20) would not be visually discernible fromthe random “noise floor” in a real application: here, by design, successive node indexes wereassociated with the communities. In contrast, with real-world data, the indexes of nodesfrom the same community will be (very likely) spread along the integer segment [1 , N ].Now we assess the fidelity of this representation provided by the rotated ˆ Λ ∗• k . The AUC-ROC scores are shown in Fig 6.A (for the sake of clarity, only values above 0 . N r , givenby plotting the array with 20 tuples ( r, N r ) in Fig 6.(C), we see that the small communities,with size near 10, are more likely to be missed.Figure 7 shows the distribution of ROC-AUC scores relative to the community sizes. Twofeatures can be seen, which confirms the previous discussion. Firstly, communities with sizeabove 60 nodes were detected with a score above 0.98, and they form the vast majority ofidentifications (see Figure 7.B). Secondly, the smaller communities can be harder to detect,and this is more prominently seen for communities with size below 20 nodes.The previous results depend on the specific realization that generates the 20 tuples ( r, N r ),Fig 6.(C), as well as the other random features. Furthermore, we’d like to explore how themixing (relative magnitudes between p and q ) influences those results. So, we now employa Monte Carlo strategy, for which the previous scenario can be considered one of its specificrealizations. The steps are:a. We fix in inner and outer probabilities ( p, q ).19IG. 6: Communities planted and detected for the large network in H = 100 conditions(see Fig. 5). (A) Roc-auc scores of the communities detected by the rotated loadings ˆ Λ ∗• k .For clarity, only scores above 0.6 are shown. We considered a successful detection if theROC-AUC score is above 0.8. (B) The ground truth: communities planted in eachcondition h . Colors indicate if they were successfully detected (black) or not (red). (C)The community sizes N r ∼ U (10 , X = [ A A ... A H ] is generated by using steps (1-4).20. The fraction of communities detected and planted is calculated as before.d. Steps (a-c) are repeated N MC = 20 times.That was done for 20 increasing values of the outer probability q ∈ [0 . , . p = 0 .
6, 0 . ± standard deviation) for H = 100 conditions. It is seen that lower themixing (i.e., p been more prominent than q ) the detection curve (i) decays slower and (ii)starts decaying at a larger q value. As mentioned before, if a given small community appearsin several conditions, its “signal” would be stronger. That would counterbalance its smallsize, allowing it to be detected. Because of that, we repeat the experiment for a much lowernumber of conditions H = 10. In this scenario, it is much less likely that any of the 20possible communities will appear several times. As a consequence, Fig. 8B, the detectioncurves start decaying at a lower value of q as compared to the scenario with H = 100.Regarding the speed of decay (slop), it is very similar to the previous case but for a smallsegment (between q ≈ . p = 1 curve. B. Genetic data
Here we give an illustrate example of how the CDR could be used to explore the differentialstructure of real-world data, in the context of differential gene co-expression network analysis.This is done assuming an exploratory data analysis point-of-view, and then showing evidenceof the extent to which the CDR approach can recover subtle and nuanced information thatis otherwise spread along with the original datasets.We will use a tutorial dataset for single-cell analysis [25]. It consists of gene expressiondata of peripheral blood mononuclear cells (PBMCs) divided into two groups based ontreatment: one with and the other without interferon- β (IFNB) stimulation. Both thedataset and tutorial (with R code) for its analysis can be found at https://satijalab.org/seurat/v3.1/immune_alignment.html , and within the Seurat R package [26]. Thetutorial (and R package) contains the log-normalized expression data of 2000 genes for 12different cell types. To illustrate the CDR, we arbitrarily selected two cell types: CD4 NaiveT cell (henceforth refereed as T-cells) and CD14-mono cells, both labelled as STIM (IFNBstimulation) or CRTL (control) regarding their respective treatment group.21IG. 8: Fraction of detected communities with ROC-AUC score above 0 .
8. The curvesshow the mean ( ± standard deviation) over 20 Monte Carlo runs for an increasing outerprobability q ∈ [0 . , .
9] and three fixed values of the inner probability p . The number ofconditions is (A) H = 100 (see Figs. 5-6) and (B) H = 10, both for a network of size N = 2000. A larger H increases the chance of a given community appears on multipleconditions, which increases its chance of been detected. This causes the slower decay of thecurves in (A) as compared to (B).Before proceeding with the CDR, we will build H = 4 (co-expression) networks, whichconceptually represent what we called conditions in the CDR method. Let the matrices E h ,with h = 1 , ..., H , be the expression data with genes in rows and samples in columns. Wewill refer to these conditions as Tc ctrl , Tc stim , cMono ctrl and cMono stim .22efore continuing, three remarks are worth noticing. Firstly, we emphasize here that thesetwo cell types were selected arbitrarily and exclusively to illustrate the use of the CDR in anexploratory analysis to highlight similar and dissimilar networks structures across differentconditions: in this section, the focus is not on searching for any biological meaningfullyresult from that analysis. Secondly, in a real-world exploratory analysis, the selection ofcell-types and experimental groups should be guided by the specific research questions thatthe specialist -subject researcher aims for. For example, the selection above could be justifiedif the interest is to search for the 2 × × treatments) structural (dis)similaritiesbetween the co-expression networks of Tc and CD14-mono cells (i.e., between themselvesand between treatment). So, specific research questions are the first things to considerbefore selecting the datasets to building the concatenated matrix X . Finally, the estimateof a gene co-expression adjacency from expression data is a very active research topic perse [27, 28]. It is not our aim to discuss, or provide any guidance, on what method wouldprovide more reliable results. The application of the CDR and the illustrative goal of thissection are independent of possible direct biological inference. However, it is expected thatmeaningful biological results will depend on how the estimated adjacency matrices are anaccurate representation of the underlying reality. Therefore, this aspect is pertinent in areal-world application.From the selected data, H = 4 weighted adjacency networks were built as follows. Oneof the most often used methods in gene network analysis is the weighted gene correlationnetwork analysis (WGCNA) [29], which can be used for finding clusters (modules) of highlycorrelated genes. Its initial step (after standard prepossessing, already done in the caseof the Seurat dataset) is to build a weighted network from the expression data from a co-expression similarity matrix S = cor( E ) (the correlation between the expression of genesacross samples). Then, the weighted network adjacency is defined as W = S β . The parameter β is chosen in a way to provide W with a scale-free topology. The motivation is that real-world biological networks often show the scale-free topology, so an appropriate selection of β would provide a more accurate representation of reality. Given that definition throughcorrelations, a standard step is to remove genes with near to zero variance across samples.From the 2000 genes from the Seurat data, we removed the ones with a variance less than10 − , yielding 1473 genes. From these, we select 1458 genes with the highest variance(corresponding to the 0.1 quantile). After that, the four expression data matrices sizes are23 ∈ R , , E ∈ R , , E ∈ R , and E ∈ R , .Because we want to focus on the information that the application of the CDR method alone can provide, we will not make use of such enhanced representation of the adjacencymatrix as provided by WGCNA. We will set β = 1, or equivalently use the similarity matrix S itself. Specifically, the four weighted network adjacency matrices will be W h = cor( E h ) with h = 1 , ..., ctrl , Tc stim , cMono ctrl and cMono stim , respectively). Notethat each of these four matrices has size 1458 × ± . X , allowing more efficient use ofalgorithms for the decomposition of sparse matrices. Finally, the input for the CDR is thematrix X = [ W W W W ] of size 1458 × W h . The number of varimax rotated vectors was 60.We experimented with different values (e.g., 40 and 100), with no impact on the followingresults.The leading 8 rotated factor loading vectors ˆ Λ ∗• k are shown in Fig. 9A. We will sometimesrefer to them simply by their indexes, k . A simple visual inspection of the magnitudes acrossconditions (i.e., the four segments of length 1458) allows one to interpret that for1. k = 1, 2 and 5, segments 3 and 4 reflect a structure peculiar to cMono cells, which isstronger (in “a sense”, more on this later) in the control group than in the stimulatedone.2. k = 7, those segments 3 and 4 reflect again a structure peculiar to cMono cells, butwhich is now equivalent in both stimulated and control groups.3. k = 3, 4, 6 and 8, the first segment reflects a structure related only to Tc cells in thecontrol group.Therefore, we will use that list of statements above as a metaphorical “navigation map” toguide the creation of some hypotheses, which can be tested by using the original datasets E h .But before that, we will explore some complementary information provided by the scatterplots between the leading k = 1 , ...,
10 vectors, Fig. 9B. Note that because consecutive nodes24IG. 9: Applying the CDR framework to gene co-expression networks. (A) Cell and/orgroup specific nuances are clearly suggested by the rotated loading vectors ˆ Λ ∗• k . (B)Scatter plots ˆ Λ ∗• k versus ˆ Λ ∗• k (lower panel), with k = 1 , . . . ,
10 and k (cid:54) = k , show theenhanced interpretability (i.e., simple structure) provided by the varimax rotation ascompared to the unrotated ˆ Λ • k (top). (C) Correlations in the original expression data forthe selected genes in the set L show that they form a highly connected community specific for Tc cells in the control group. (D) As before, but for the genes from the set L . Thesegenes are part of a more complex structure of highly connected nodes specific for cMonocells, and more “stronger” in the control group than in the stimulated one.are not within the same (differential) community structure, the effects of the rotation cannotbe seen by plotting the factor loading vectors before and after rotation, as done the case ofthe synthetic network shown in Fig. 5. This makes the scatter plots a suitable alternative.Firstly, consider the effects of the varimax rotation in disentangling the information between25he vectors. For instance, the rotation of a cross-shaped structure can be seen for ( k , k j ) =(2 ,
3) (second row, third column of Fig. 9B top and bottom panels), meaning that the factorloading vectors ˆ Λ ∗• and ˆ Λ ∗• become uncorrelated after rotation — and because of that,they should reflect cell-group specific and distinct structures encoded into X : a feature thatagrees with the statements (1) and (3) above. Another effect is seen as the reduced numberof circular cloud of points, which after rotation are replaced by linear structures parallel tothe horizontal and or vertical axis. Secondly, some scatter plots remain as circular cloudseven after rotation. For instance, the features captured by vectors k = 1 and 2, which arespecific to cMono cells, are correlated — this suggests that those vectors could tell differentnuances of the same differential structure that make those cells apart from the Tc cells. So,the inspection of the scatter plots provides complementary information for the factor loadingvectors interpretability.Taken together, the information above provides guidance for the creation and explorationof a vast amount of hypotheses. Next, we will develop and explore some possibilities, whichwe believe provide the most complete illustrative example of the power of an exploratoryanalysis through the CDR. Note that, in practice, this should be done side-by-side with anexpert in the field related to the dataset.Consider the k = 6 first segment, which should refer to a specific structure for Tc cellsin the control group. Let L be the set of the six genes associated with the nodes with thelargest (absolute) factor loadings. That selection is highlighted with a rectangular (orange)box in Fig. 9A, and the genes are L = { ADAMDEC1, CCNA1, DEFB1, IL27, LILRA3,NRM } . Larger marker size was used for all occurrences of those six genes, which allows oneto verify that they appear (collectively) with a large loading only at the k = 6 first segment.However, the genes CCNA1 and IL27 appear with a large loading in k = 1 segment 3,a condition that should reflect a structure specific for cMono cells in the control group.Based on what we have discussed in the example of the synthetic data (Sec. III A), thoseobservations suggest the following inter-connected hypotheses:I. The L genes form a highly connected community, which is specific for the Tc cells inthe control group.II. That community is present solely for the Tc cells in the control group.III. There is another structure L (see the purple rectangular selection in Fig 9A) specific26or cMono cells in the control group, for which genes CCNA1 and IL27 are members.IV. The structural role of genes CCNA1 and IL27 within community L is different fromtheir structural roles in L . Remark . The set L contains eleven genes: IFIT3, IFIT1, RSAD2, CTSC, IL27, HSPA1A,HSPB1, HRASLS2, SCIN, CCNA1, MASTL.Those hypotheses can be tested by estimating the strength of co-expression (correlations)of the genes sets L and L within each original expression dataset E h . Figures 9C-D showthe correlations for L and L , respectively, which strongly agree with the hypotheses above.Specifically, the following features can be seen. (i) The six L genes have a highly-correlatedco-expression solely for the Tc cells in the control group, hence forming a differential com-munity. (ii) There is a slightly higher correlation between genes CCNA1 and IL27 for thecMono in the control group, Figure 9C, showing that these two genes can have a differentstructural role in another community (related specifically to cMono in the control group),and which we will track now. To allow better visualization of the community structure,the four heat-maps in Fig. 9.D were plotted after rearranging their columns and rows basedon the hierarchical clustering dendrogram (HCD) of the cMono control group. In the thirdheat-map of Fig. 9.D, (iii) the eleven genes in L show a somewhat more complex communitystructure (i.e., 3 or 4 small communities, without a crisp frontier, forming a larger commu-nity). As suggested by the ordering provided by the HCD, the co-expression between genesCCNA1 and IL27 forms a link between two different communities (one of them formed bythe last 4 genes).Nevertheless, the last two heat-maps in Fig. 9D can be used to test another hypothesis.Previously, we mentioned that for k = 1 the segments 3 and 4 should reflect not only astructure more peculiar to the cMono cells but that that structure should be stronger (in asense) in the control group than in the stimulated one. Figure 9.D shows that this is indeedthe case and that the structure is stronger in two senses: (a) there are links between the sub-communities in the control group, but they are broken in the stimulated group (for whichthe heat-map suggests the existence of two sub-communities, one with genes IFIT3, IFIT1,RSAD2m, and the other with genes HSPA1A, HSPB1); (b) the third sub-community in thecontrol group (with genes HRASLS2, SCIN, CCNA1, MASTL) disappears in the stimulatedgroup. 27aken together, those results support the conclusion that the “story-telling”-like informa-tion provided by the CDR approach can guide one to recover subtle and nuanced informationthat is otherwise spread along with the original datasets. IV. CONCLUSION
In this paper, we have proposed a theory and method (CDR) for the characterizationof shared “structural roles” of nodes simultaneously within and between networks, whoseoutcome is a highly interpretable map. Without loss of generality, for the sake of presentationwe have assumed that each network represents different experimental conditions. They couldbe, equivalently, representations of a time-evolving network (i.e., in different time-points),layers in a multiplex network etc.Supported by a simple and transparent theory, rooted in the factor analysis framework,the method provides flexibility to address different research-field specific questions. This isaccomplished by defining what is the scientific-meaningful characteristic (or relevant feature)of a node at the problem at hand, and then mapping it to an appropriate mathematical sim-ilarity construct to estimate the proximity from measured data. In the context of differentialnetwork analysis, with communities of highly connected nodes, in this paper the methodwas illustrated by assuming as the relevant feature the similarity of the list of neighboursbetween nodes, captured by the notion of proximity through the inner (vector) product.The insights provided by the method and its accuracy have been explored in numericalbenchmarks generated by a stochastic block model. In the scope of non-overlapping com-munities (which could be present or absent on a given condition) the results have shownthe method’s high accuracy despite very different (i) community sizes, (ii) total numberof communities within a given condition and (iii) number of networks been compared (i.e.,experimental conditions).In particular, we have argued that the strength of the method is its “story-telling”-likecharacterization of the differential network structure encoded in a set of networks. Thiswas illustrated with an exploratory analysis of gene expression datasets, a context calleddifferential co-expression network analysis. In particular, applying the CDR to the co-expression networks from two cell types × two treatments allowed us to elaborate hypothesisrelated to unique and subtle nuances of the structural differences between the networks.28hose hypotheses were then tested and confirmed by the original dataset.Our vision is of a virtuous cycle, where the CDR is used as a hypothesis generator by pin-pointing unexpected differential structures within the data, leading to further investigationsand providing new insights.The analysis of genetic data was used, as well, to illustrate how the fundamental under-lying ideas of the CDR could be applied over and above the other methods that have beendeveloped within different scientific communities. For this, it has been applied on (a simpli-fied version of) weighted gene correlation network analysis. Notwithstanding, due to (i) theCDR flexibility and (ii) the several ways the theoretical basis of factor analysis allows one toextract the factor loadings (here we have chosen the PCA), those ideas could even be usedas a bridge between the several methods of network analysis (from different communities)to nurture a deeper theoretical understanding. These and other possibilities we leave forfuture work. ACKNOWLEDGMENTS
LLP and MS acknowledge financial support from Australian Research Council DiscoveryGrant (DP 180100718). This work was partly supported by the Pawsey SupercomputingCentre with funding from the Australian Government and the Government of Western Aus-tralia. 29
1] H. A. Chowdhury, D. K. Bhattacharyya, and J. K. Kalita, IEEE/ACM Transactions onComputational Biology and Bioinformatics , 1 (2019).[2] A. R. Sonawane, S. T. Weiss, K. Glass, and A. Sharma, Frontiers in Genetics , 1 (2019),arXiv:1903.05449.[3] S. van Dam, U. V˜osa, A. van der Graaf, L. Franke, and J. P. de Magalh˜aes, Briefings inBioinformatics , bbw139 (2017).[4] R. Anglani, T. M. Creanza, V. C. Liuzzi, A. Piepoli, A. Panza, A. Andriulli, and N. Ancona,PLoS ONE , e87075 (2014).[5] D. Amar, H. Safer, and R. Shamir, PLoS Computational Biology , e1002955 (2013).[6] L. L. Portes and M. Small, Physical Review E , 042218 (2019).[7] L. L. Portes and L. A. Aguirre, Physical Review E , 052216 (2016).[8] L. L. Portes and L. A. Aguirre, Chaos: An Interdisciplinary Journal of Nonlinear Science ,093112 (2016).[9] A. Groth and M. Ghil, Physical Review E , 036206 (2011).[10] X. Xiao, A. Moreno-Moral, M. Rotival, L. Bottolo, and E. Petretto, PLoS Genetics ,e1004006 (2014).[11] M. Vejmelka and M. Paluˇs, Chaos: An Interdisciplinary Journal of Nonlinear Science ,033103 (2010).[12] M. E. J. Newman, Physical Review E , 052315 (2016).[13] S. Fortunato and D. Hric, Physics Reports , 1 (2016), arXiv:1608.00163.[14] B. Ball, B. Karrer, and M. E. J. Newman, Physical Review E , 036103 (2011),arXiv:1104.3590.[15] I. Koch, Analysis of Multivariate and High-Dimensional Data , Cambridge Series in Statisticaland Probabilistic Mathematics (Cambridge University Press, 2013).[16] S. A. Mulaik,
The foundations of factor analysis , 2nd ed. (Taylor & Francis Group, New York,NY, 2010) p. 524.[17] P. W. Holland, K. B. Laskey, and S. Leinhardt, Social Networks , 109 (1983).[18] A. A. Hagberg, D. A. Schult, and P. J. Swart, in Proceedings of the 7th Python in ScienceConference , edited by G. Varoquaux, T. Vaught, and J. Millman (Pasadena, CA USA, 2008) p. 11 – 15.[19] P. Erdos and A. Renyi, Publ. Math. Inst. Hungary. Acad. Sci. , 17 (1960).[20] L. L. Thurstone, Psychological Review , 406 (1931).[21] H. F. Kaiser, Psychometrika , 31 (1974).[22] M. B. Richman, Journal of Climatology , 293 (1986).[23] E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, ACM Transactions on DatabaseSystems (TODS) , 1 (2017).[24] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. , in Kdd , Vol. 96 (1996) pp. 226–231.[25] H. M. Kang, M. Subramaniam, S. Targ, M. Nguyen, L. Maliskova, E. McCarthy, E. Wan,S. Wong, L. Byrnes, C. M. Lanata, R. E. Gate, S. Mostafavi, A. Marson, N. Zaitlen, L. A.Criswell, and C. J. Ye, Nature Biotechnology (2017), doi:10.1038/nbt.4042.[26] T. Stuart, A. Butler, P. Hoffman, C. Hafemeister, E. Papalexi, W. M. M. III, Y. Hao,M. Stoeckius, P. Smibert, and R. Satija, Cell , 1888 (2019).[27] S. Choobdar, M. E. Ahsen, J. Crawford, M. Tomasoni, T. Fang, D. Lamparter, J. Lin, B. Hes-cott, X. Hu, J. Mercer, T. Natoli, R. Narayan, A. Subramanian, J. D. Zhang, G. Stolovitzky,Z. Kutalik, K. Lage, D. K. Slonim, J. Saez-Rodriguez, L. J. Cowen, S. Bergmann, andD. Marbach, Nature Methods , 843 (2019).[28] F. Liesecke, J.-O. De Craene, S. Besseau, V. Courdavault, M. Clastre, V. Verg`es, N. Papon,N. Giglioli-Guivarc’h, G. Gl´evarec, O. Pichon, and T. Dug´e de Bernonville, Scientific Reports , 14431 (2019).[29] P. Langfelder and S. Horvath, BMC Bioinformatics , 559 (2008)., 559 (2008).