Abstract

Understanding the dependence and interplay between architecture and function in biological networks has great relevance to disease progression, biological fabrication and biological systems in general. We propose methods to assess the association of various microbe characteristics and phenotypes with the topology of their networks. We adopt an automated approach to characterize metabolic networks of 32 microbial species using 11 topological metrics from complex networks. Clustering allows us to extract the indispensable, independent and informative metrics. Using hierarchical linear modeling, we identify relevant subgroups of these metrics and establish that they associate with microbial phenotypes surprisingly well. This work can serve as a stepping stone to cataloging biologically relevant topological properties of networks and towards better modeling of phenotypes. The methods we use can also be applied to networks from other disciplines.

Full PDF

aa r X i v : . [ q - b i o . Q M ] O c t Strong associations between microbe phenotypes and their network architecture

Soumen Roy

1, 2 and Vladimir Filkov Dept of Medicine and Institute for Genomics and Systems Biology, The University of Chicago, Chicago, IL 60637 Dept of Mechanical and Aerospace Engineering, University of California, Davis CA 95616 ∗ Dept of Computer Science, University of California, Davis CA 95616 † Understanding the dependence and interplay between architecture and function in biological net-works has great relevance to disease progression, biological fabrication and biological systems ingeneral. We propose methods to assess the association of various microbe characteristics and phe-notypes with the topology of their networks. We adopt an automated approach to characterizemetabolic networks of 32 microbial species using 11 topological metrics from complex networks.Clustering allows us to extract the indispensable, independent and informative metrics. Using hi-erarchical linear modeling, we identify relevant subgroups of these metrics and establish that theyassociate with microbial phenotypes surprisingly well. This work can serve as a stepping stone tocataloging biologically relevant topological properties of networks and towards better modeling ofphenotypes. The methods we use can also be applied to networks from other disciplines.

PACS numbers: 87.18.Vf, 89.75.Hc, 07.05.Rm

A prime goal of systems biology is to discover emer-gent properties that may be unraveled when a systemicview is adopted to gain a comprehensive understanding ofmany processes that occur in biological systems. The re-ductionist approach which has held sway in biology overthe past several decades has successfully identiﬁed thekey components in living systems and many interactionsamong them. However, it almost never presents a holisticunderstanding of how the systemic properties emerge. Itis now becoming increasingly clear that the functioningof biological systems depends crucially on their complexunderlying structure [1]. This complexity is the conse-quence of numerous interconnected, dynamic and non-linear interactions among the plethora of elements, likegenes, proteins, and metabolites. But the importance ofbiological networks lies beyond their being the most vis-ible signatures of complexity. Understanding the depen-dence and interplay between architecture and function inbiological networks has great relevance to disease progres-sion, bio-fabrication and biological systems in general.The central issue, then, is to discover whether net-works encode systemic events and the precise manner inwhich they do so. Ideally, we would like to understandand modify the complex behavior of biological networks,which is contingent on the proper level of modeling oftheir molecular interactions. To model the systemic oremergent properties, one would have to involve critically,the interdependencies among interactions and other or-ganizational patterns, on a local level (e.g. network mo-tifs) as well as global level (e.g. modularity). Recentresearch in complex systems and networks has presentedopportunities to properly mine and thence exploit thearchitectural interdependence in networks [2, 3, 4].Multiple metrics exist in complex networks and vari- ∗ Electronic address: [email protected] † Electronic address: ﬁ[email protected] ous studies have utilized one or few of them at a time,to characterize biological networks. Signiﬁcant researchhas been done to examine various topological propertiesof diﬀerent networks using computational and analyti-cal methods. It has been found that many biologicalnetworks (just like other empirical networks) may havepower-law degree distributions [5], are modular [6] andhierarchical [7], and have speciﬁc distributions of topo-logical features which can be used to characterize them[8, 9, 10]. In addition, topological properties have beenused to predict missing edges in networks [11] and via-bility of mutant strains [12].In this rapid communication, we show that varioustopological metrics (which are the signature of complexnetwork architecture), associate with microbe character-istics and phenotypes to a surprisingly high degree. Weundertake an automated approach using various topolog-ical metrics from complex networks to characterize a col-lection of various kinds of biological networks and showhow these metrics associate strongly with microbe char-acteristics. Speciﬁcally, (i) using publicly available datawe collect and cross-reference metabolic networks for 32diﬀerent microbes via ten diﬀerent quantiﬁable charac-teristics and phenotypes, (ii) we use a suite of 11 com-plex network metrics, so as to comprehensively compareall 32 networks simultaneously, allowing for a much morein-depth evaluation of network models [13] than is pos-sible with the usually existing practice of comparing oneor two particular properties, most commonly the degreedistribution, (iii) we show that most of the network met-rics we use are independent and that multiple metricsare necessary to characterize the variability in networksmeaningfully, (iv) via a hierarchical linear modeling ap-proach, we identify subsets of network parameters whichassociate strongly with various microbe characteristicsand phenotypes. By presenting these strong associationsand exhibiting the necessity of multiple metrics to do so,this work is a step forward toward a systemic catalogingof the methods and properties of biological networks thatare relevant to the underlying biology, and towards bettermodeling of emergent biological properties.The microbe characteristics or phenotypes that are ex-plored in this work are: (1) microbe class (MC), (2)genome size (GS), (3) GC content (GC), (4) modular-ity ( Q ), (5) number of such modules ( N Q ), (6) motil-ity ( M O ), (7) competence( CO ) and whether these mi-crobes are (8) animal pathogens ( AP ), (9) strict anaer-obes ( AN ), or, (10) extremophiles ( EX ).Microbes are normally classiﬁed as archaea or bacteria[14]. Genome size alludes to the sum total of DNA con-tained within one copy of a genome. The usual measureof it is in terms of mass in picograms or the total num-ber of nucleotide base pairs (commonly as millions of basepairs, or megabases). Intriguingly, an organism’s genomesize is not directly proportional to its complexity and afew microbes have much more DNA compared to othermicrobes. In this context, it is interesting to point outthat the association between genome size and topologi-cal metrics of the networks are among the strongest of allphenotypes explored in this work. GC content is the per-centage of nitrogenous bases on a DNA molecule which iseither cytosine or guanine (and not thymine or adenine).Data for genome size and GC content was obtained fromthe NCBI Entrez genome project database [15]. Withregard to biological networks, modularity is deﬁned asthe fraction of edges within modules less the expectedfraction of such edges. We use a recent algorithm [17] indetermining the community-structure in networks whichincorporates edge directionality. Until recently, the mostcommon approach to modularity in complex networksliterature has been to simply ignore edge direction andapply methods developed for community discovery inundirected networks. However, this discards potentiallyuseful information contained in edge directions, whichis most commonly a very biologically relevant criterion.It should be noted that modularity is intimately con-nected to function in biology as the modules typicallycorrespond to genetic circuits or pathways [6, 16]. There-fore, we include it here as a phenotypic property ratherthan as a variable. In scenarios where modularity lacksapparent connection to function, it is more appropriateto treat Q and N Q as input variables.Motility allows microbes to move toward desirable en-vironments and away from undesirable ones. Compe-tence denotes the ability of a cell to take up extracel-lular DNA from its environment. Anaerobic organismsare those that do not require oxygen for growth and mayeven die in its presence. Extremophiles are organismswhich thrive in or require extreme physical or geochemi-cal conditions, in which majority of life on Earth cannotsurvive. Data for phenotypes (6) to (10) have been com-piled from Ref. [18]. While GS, GC, Q, N Q can take onany value, the rest of the microbe characteristics or phe-notypes are “binary” (e.g., a microbe is either an archaeaor a bacteria; either aerobic or anerobic, etc.).We used metabolic networks of 32 diﬀerent microbesbased on data deposited in the WIT database [19]. This FIG. 1: The heatmap over network metrics database contains metabolic pathways that were pre-dicted using the sequenced genomes of several organisms.The nodes in these networks are enzymes, substrates andintermediate complexes, while edges represent sequencesof reactions in the organism’s cells. (We had to excludethe following three microbial species:

A. actinomyc., R.caps. and

M. thermoautot. , from the original collectionbecause many of the microbe characteristics or pheno-typic data do not seem to be publicly available for them.)The network sizes vary from 595 nodes and 1354 edgesto 2982 nodes and 7300 edges.We calculated a suite of 11 important complex networkattributes across all 32 networks. These are: the numberof nodes , N , and edges in the network and the ﬁrst threestandardized moments (mean, standard deviation, andskewness) of the distributions of geodesic [20], between-ness coeﬃcient [21], and, degree, of the network; respec-tively denoted as geo , geo , geo , betw , betw , betw , deg , deg , deg . The importance of studying the highermoments of distributions is well known in physics [22].The Geodesic was calculated by using the Dijkstra Al-gorithm [20]. For normalization, we subtract the meanvalue of a metric (over all species) and then divide by thestandard deviation of the metric (over all species), for allnetworks. Some of our metrics are robust to measure-ment errors. Observing the system (i.e. network) frommultiple angles, provides a measure of robustness againstnoise (false positives and false negatives).The degree of overlap, or dependence, between the at-tributes when characterizing networks can be accuratelyassessed by a symmetric heatmap, showing the pairwisecorrelations of the network metrics over all the networks.Fig. 1 shows the heatmap over network attributes. Westart with a 32 dimensional vector (which is the numberof microbes studied) for each of the 11 metrics. Thus, wehave 11 points in the 32 dimensional vector space. Wethen calculate the correlation between all pairs of these11 points, and color-code the distance. White indicatesperfect correlation while black indicates anti-correlation.

Range (min,max) ρ best h ρ random i p-value Best Model VariablesMC Binary 0.113 0.507 < × − N, edges, geo , geo , geo , betw , betw , betw , deg GS (0.58, 6.3) 0.476 1.302 < − N, edges, betw , betw , betw , deg , deg GC (28.2, 66.6) 0.763 1.158 < . × − N, edges, geo , geo , geo , betw Q (0.59, 0.69) 0.005 0.033 < − N, edges, geo , geo , betw , betw , deg , deg N Q (14, 35) 2.102 6.413 < − N, edges, geo , geo , geo , betw , deg , deg MO Binary 0.315 0.577 < . × − N, edges, betw , deg , deg , deg CO Binary 0.158 0.683 < × − N, edges, geo , geo , geo , betw , betw , deg , deg , deg AP Binary 0.325 0.567 < − geo , geo , betw , deg , deg AN Binary 0.359 0.495 < . × − edges, geo , geo , betw , betw , betw , deg EX Binary 0.284 0.540 < − geo , geo , betw , deg , deg , deg TABLE I: Exploring the association of microbe characteristics and phenotypes with network metrics: Microbe class (MC);Genome size (GS); GC content (GC); Modularity ( Q ); Number of modules( N Q ); Motility (MO); Competence (CO); andwhether the microbes are Animal Pathogens (AP), Strict Anaerobes (AN) or Extremophiles (EX). The rows (and by symmetry the columns) are arrangedautomatically so that the rows most similar are placednext to each other, as determined by the hierarchicalclustering algorithm implemented in the heatmap pack-age of the R system [23] (as any other clustering scheme,this one too has its limitations, e.g. in the placementof the edges and nodes columns, which could arguablybe swapped). Thus, the map allows us to identify clus-ters of “similar” network attributes by looking for blocksof light-colored squares along the diagonal of the ﬁgure.Since there is only a small amount of clustering alongthe diagonal, it follows that the network attributes wehave chosen are relatively independent, and thus, theyall provide information to our analysis.To ﬁnd how well the organism phenotype associateswith the underlying network architecture, we considerour 11 network metrics (which can be regarded as char-acteristics of the architecture) and model each phenotypeas a linear combination of these metrics. It should be es-pecially noted that the basis of linear modeling is notto imply that the dependent variables are the cause andthe explanatory variables are the eﬀect, but that thereis a signiﬁcant association between these variables. An-ticipating that not all metrics will be pertinent to eachphenotype, and, in general, to avoid over-ﬁtting we use hierarchical linear regression methods to model the phe-notypes as linear combinations of subsets of the networkmetrics. To identify the best model we start by assum-ing a linear dependence on all 11 variables, because wedo not know initially which ones associate better thanothers. We then iteratively proceed to exclude variableswhose absence improves or does not signiﬁcantly alter thequality of the resulting model (we used a speciﬁc imple-mentation of this iterative procedure through the step() function of the R system [23]). The model selection isguided by minimizing the well-known Akaike Informa-tion Criterion [24] denoted here as α , a standard measurein statistics allowing for selection among various nestedmodels. α scores a model based on its goodness-of-ﬁt tothe data and penalizes models having many parameters. If k is the number of parameters in the statistical model,and L is the maximum log-likelihood for the estimatedmodel, α is deﬁned as: α = 2 k − L ) (1)Thus, we identify the smallest number of independentand indispensable network metrics that can be associ-ated with the microbe characteristic or phenotype. Theresults for the best model for each phenotype are given inTable 1. We use the Root Mean Square Error, ρ , whichis a measure of the goodness-of-ﬁt of our model associ-ations and the experimental data. ρ of an estimator ˆ X with respect to the estimated parameter X is deﬁned asthe square root of the mean squared error, ρ ( ˆ X ) = q E (( ˆ X − X ) ) . (2)We also report the signiﬁcance of the best model, whichwe obtain by the linear hierarchical modeling procedurediscussed above by bootstrapping with respect to thesame model and using a random permutation of the ob-served data. We measure the ρ of these random mod-els, ρ random , and how many times (or whether at all) ρ random < ρ best , where ρ best is obviously the ρ of the bestmodel. The number of times this happens is reﬂected inthe normalized signiﬁcance. We observe 10 such ran-dom permutations, for each microbe phenotype. We alsoperformed an analysis of variance (ANOVA) of the diﬀer-ence of our model with fewest dependent variables versusthe model with all 11 variables, and the diﬀerence wasnot signiﬁcant.For half of the microbe phenotypes in this study (GS, Q , N Q , AP and EX), we do not come across a singleinstance where the ρ random is less than ρ best for thatphenotype. For each of these ﬁve phenotypes and alsofor the rest of the ones considered in this study, ρ best isalways less than h ρ random i , with very low p-values. This,thus indicates a strong association of organism pheno-types with the relevant network metrics, in general.There are some other facts which are observable fromTable 1: (i) there is no supremely important single metricassociated with each and every phenotype studied here,and, (ii) in the present study, this of course rules outone or more set of metrics associated with more thanone phenotype(s). Albeit, the latter occurrence does notautomatically follow from the former if one or more met-rics is consistently observed to be associated with allphenotypes. These facts, however, attest to the indis-pensability of the simultaneous study of multiple networkmetrics. It is notable that the association patterns arenon-trivial, even when the microbe phenotype or char-acteristic is simply binary, as opposed to the case, whenit possesses a range of values. The dependence of theprediction quality on the number of metrics is also notreadily ascertainable. For example in AP, ﬁve of the 11metrics seem to be needed for suﬃcient representation,while eight are required for Q and N Q . However, with sixmetrics for GC and MO, or nine for MC the predictionquality is apparently not enhanced.Interestingly, the orthogonality of the geodesic and be-tweenness metrics which we established before is reﬂectedby their consistent appearance in the association results.It is entirely possible that the association of other net-work metrics, which are not a part of this study, withthese or other phenotypes or organism characteristicscould be particularly strong. Exhaustive studies with theinclusion of such metrics should bear out this fact. Herewe focused on metrics that have been shown to be biolog-ically pertinent. The approaches adopted here are scal-able and can easily accommodate other important met-rics, which could be unraveled in future as a result of thecontinuous ongoing research in network theory.The importance of this study is in justifying that suit-ably identiﬁed groups of network metrics, can and shouldbe used to meaningfully model and study organism char-acteristics. Most immediately, the results can be usedto build more sophisticated and even predictive modelsof organism phenotypes, based on their network archi- tecture. These results are also a good starting point forclassiﬁcation or cataloging of biologically relevant topo-logical features, that can eventually yield vocabularieswhich cross-reference topology with biological function.While still far away, we expect such tabulated and well-described architectural features to be akin to biologicalmarkers in other empirical data. In this sense, our workis a modest step toward understanding the precise na-ture of interdependence between function and topologyin biological networks. Followup modeling and simula-tions could give valuable insight into a wide range of far-reaching issues like the eﬀect of topology on the designand evolution of networks. The comprehensive “look-upscheme”, elucidated with the present set of biological net-works, could also be helpful for other real-world complexnetworks in general. Of course, the measures need not bethe same as those above and will depend on the natureand topology of the network.It is well known that various centrality measures playan important role in networks and in some cases (e.g.,in the global airline network [25]), few nodes which haverelatively low degree but high betweenness could be veryspecial. Nodes with high betweenness can act as bot-tlenecks for information passage and the role of between-ness is well known in epidemiology, information and wire-less or sensor networks. The role of betweenness in bio-logical networks is being thoroughly exploited in recenttimes [26]. However, to our knowledge, there is almostno in-depth work in literature, investigating the role ofhigher moments of the betweenness distribution in bio-logical networks. The present work underlines the im-portance of such studies.We are grateful to Z. Saul for help in data gatheringfor some of our networks. We thank E. Leicht and M.Newman for providing us the codes on modularity struc-ture in directed networks. This work was funded in partby the NSF under Grant No. IIS-0613949. [1] P Oliveri, Q Tu, EH Davidson, Proc Natl Acad Sci USA, , 5955 (2008)[2] R Sharan and T Ideker, Nat. Biotech., , 427 (2006)[3] RB Russel and P Aloy, Nat. Chem. Biol., , 666 (2008)[4] T Koide, WL Pang and NS Baliga, Nat. Rev. Microbiol-ogy, , 297 (2009)[5] H Jeong et al. , Nature, , 650 (2000).[6] N Kashtan and U Alon, Proc Natl Acad Sci U S A. ,13773 (2005)[7] M Sales-Pardo, R. Guimer`a, AA Moreira, and LAN Ama-ral Proc Natl Acad Sci U S A., et al. , Science, , 1538 (2004)[9] R Milo et al. , Science, , 824 (2002)[10] ZM Saul and V Filkov, Bioinformatics, , 2604 (2007)[11] A Clauset, C Moore and MEJ Newman, Nature, , 98(2008)[12] Z Wunderlich, and LA Mirny, Biophys J., , 2304 (2006)[13] V Filkov, ZM Saul, S Roy, RM D’Souza and PT Devanbu,Europhysics Letters, , 28003 (2009)[14] CR Woese and GE Fox, PNAS, , 5088 (1977) [15] [16] G Schlosser and G Wagner, Modularity in Developmentand Evolution. The Univ. Chicago Press, Chicago (2004)[17] EA Leicht and MEJ Newman, Phys. Rev. Lett, ,118703 (2008)[18] AH Singh, DM Woolf, P Wang and AP Arkin, Proc. Natl.Acad. Sci. USA, , 7501 (2008)[19] R Overbeek et al. , Nucleic Acids Res. , 123 (2000)[20] EW Dijkstra, Numerische Mathematik, (1959), S 269.[21] LC Freeman, Sociometry, , 35 (1977)[22] JP Bouchaud and A Georges, Physics Reports [24] H Akaike, IEEE Trans. on Automatic Control, , 716(1974)[25] R Guimer`a, S Mossa, A Turtschi, LAN Amaral, Proc.Natl. Acad. Sci. USA, , 7794 (2005)[26] J Liu et al. , Science,323