A comprehensive statistical study of metabolic and protein-protein interaction network properties
aa r X i v : . [ q - b i o . M N ] D ec A comprehensive statistical study of metabolicand protein-protein interaction networkproper ties
D. Gamermann , J. Triana , and R. Jaime Department of Physics, Universidade Federal do Rio Grande do Sul (UFRGS) - Instituto de F´ısica , Av.Bento Gonc¸alves 9500 - Caixa Postal 15051 - CEP 91501-970 - Porto Alegre, RS, Brasil. Universidad Polit´ecnica Salesiana - sede de Guayaquil, Ecuador Universidade Hermanos Sa´ız Montes de Oca - Pinar Del Rio, Cuba ✦ Abstract —Understanding the mathematical properties of graphs under-ling biological systems could give hints on the evolutionary mechanismsbehind these structures. In this article we perform a complete statisticalanalysis over thousands of graphs representing metabolic and protein-protein interaction (PPI) networks. First, we investigate the quality of fitsobtained for the nodes degree distributions to power-law functions. Thisanalysis suggests that a power-law distribution poorly describes the dataexcept for the far right tail in the case of PPI networks. Next we obtaindescriptive statistics for the main graph parameters and try to identifythe properties that deviate from the expected values had the networksbeen built by randomly linking nodes with the same degree distribution.This survey identifies the properties of biological networks which arenot solely the result of their degree distribution, but emerge from yetunidentified mechanisms other than those that drive these distributions.The findings suggest that, while PPI networks have properties thatdiffer from their expected values in their randomized versions with greatstatistical significance, the differences for metabolic networks have asmaller statistical significance, though it is possible to identify some drift.
Keywords : graphs, biological networks, degree distribu-tion, PPI networks, metabolic networks, scale-free networks
NTRODUCTION
Networks are an intuitive way to pictorially represent ele-ments and their interactions in many complex systems. Ontop of its visual appeal, graph theory is a well establishedmathematical field which allows these structures to be quan-titatively analyzed and have their properties objectivelyevaluated. As soon as the abstract graph theory began to beapplied in order to describe real world networks, it becameclear that graphs representing real systems differed vastlyfrom what would be expected from the naive ¨Erdos-R´ennyrandom model for networks studied in the fifties [1].Graphs representing many different real world systemssuch as for example, author citations relations [2], biologicalnetworks [3], [4] or flight connections [5] present many simi-lar topological properties that significantly distinguish them [email protected] from ¨Erdos-R´enny random graphs. Some of these commoncharacteristics, often claimed to be ubiquitous in real worldnetworks, are the presence of hubs (a few highly connectednodes), the so called small world property [6], scale-freenessor self-similarity as a consequence of the network’s nodesdegree distribution often be similar to a power-law function( p ( k ) ∝ k − γ ) [7], [8], [9], high clusterization and hierarchicalorganization [10].Biological systems are the result of evolution and na-tural selection. Therefore, the characteristics one observesin biological networks are indications of the evolutionarypressures under which these systems developed. The densetails of the degree distributions in these networks, for ex-ample, have been suggested to give rise to the robustness ofthese systems against random node deletions [11], [12], [13],which would reflect the fact that live organisms are resilientto random mutations or to deprivation situations. In fact,many works study evolutionary models that attempt togenerate graphs with similar degree distributions than theones observed in real systems. These models define simplerules for the growth of a network and study the assymptoticbehaviour of an evolving graph. In the preferential attach-ment model [14], [15], for example, a new node is morelikely to connect to already highly connected nodes in thenetwork, while in the duplication and divergence model [16]an existing node is first copied with the same connectionsand then may have its connections altered. These worksusually focus in obtaining a set of rules that will generaterandom graphs with degree distributions similar to theones observed in real systems. Therefore, it is important todetermine the characteristics of these distributions on onehand and whether the degree distribution alone is enoughto replicate other common characteristics of these networkssuch as high clusterization.Though it is a very common claim that biological net-works are scale-free (meaning that their node’s degree distri-bution follows a power-law function), there are some studiesthat dispute this conclusion [17], [18]. Most works that fit a power-law to the degree distribution of a given network,overlook the quality of the fit. However, in [19], an objectivestatistical study of several different real world networks ismade in order tho tackle this issue. In the study, the authorsuse robust statistical tools not only to fit the distributions butalso to evaluate the quality of the fits (p-value), sheddinglight on which networks representing real systems mightand might not be called scale-free. In order to do that, theauthors in [19] chose, for each real world system, a singlerepresentative network and analyze the fit obtained for itsdegree distribution. Though a single metabolic and a singleprotein-protein interaction (PPI) network are analyzed inthe paper, these data sets composed of single elements arenot enough to extract strong general conclusions about thesebiological systems. In [20] a version of the duplication anddivergence model is analytically studied and it is observedthat although it is possible to find parameter regions wherethe degree distribution will have a denser right tail, thisdistribution has a peak for low degrees indicating that itcan not be fitted to a monotonically decreasing power-lawfunction if one considers its whole range.The present study has performed two systematical anal-ysis over two huge data sets of biological networks: morethan 3000 metabolic and over 1000 PPI networks. Theobjective of the first analysis is to determine whether thedegree distributions of the networks as a whole or theirright tails may be well described as scale-free (power-law)functions. For this, we use the same tools as in [19] to fitthe degree distribution of each network to a power-lawand standard statistical tools in order to assess the qualityof the obtained fits. In a second analysis, the main graphattributes of each network are evaluated, which allowsus to draw a general picture about the characteristics ofthese graphs for a wide range of organisms. In order toidentify which properties are simply expected given thedegree distributions of the networks and which ones arethe result of other possible evolutionary pressures overthe underlying biological systems, the same properties arecomputed over randomized versions of each network thatpreserve their degree distributions. With this study, it ispossible to evaluate the differences between the property inthe real networks with respect to their randomized versions,to assess the statistical significance on the existence of suchdifferences and, therefore, to identify attributes that are notonly a consequence of the network’s degree distribution.The work is organized as follows: in the next sectionwe define the graphs we study and from where the datain order to build them was retrieved. The section after thatis dedicated to describe all the analysis done. Finally, wepresent the results and discussion and in the last sectiona brief overview and our conclusions. We also include anappendix on the distribution of a network characteristic inits randomized versions. ATA AND G RAPHS
In this section we define the graphs we analyze and describethe procedures followed in order to obtain the data fromwhich we build the networks.
1. Not only driving the degree distribution.
For an organism, we define its metabolic network as theundirected and unweighted graph resulting from con-necting the molecules or metabolites appearing in itsmetabolism based on the biochemical reactions that keepits cells (or cell) alive. Two metabolites (nodes in the graph)are connected if they appear as a substrate-product pair inany chemical reaction in its metabolism.Therefore, the data needed to build the metabolic net-work for one organism is the list of all biochemical re-actions that can be found in its metabolism. This datawas obtained from the Kyoto Encyclopedia of Genes andGenomes (KEGG) database [21], [22]. First, a list of all genesin an organism is obtained, from this list of genes, thoseannotated as coding for enzymes are identified, as wellas the corresponding chemical reactions catalyzed by eachenzyme. In a complementary step, the pathways identifiedin each organism are obtained and the corresponding
KGML files (KEGG Markup Language) are retrieved. These filesallow one to identify the non-enzymatic reactions associatedwith known pathways. After the retrieval process, one has alist of chemical reactions from which one builds the networkby listing all single metabolites appearing in the reactionsand stipulating an undirected link between two metaboliteswhenever they appear in opposite sides of a reaction (assubstrate-product).An automated python script which connects to thedatabase rest API was written, in order to obtain KEGG’s listof organisms and run through it retrieving the data neededto build the networks [23]. The networks for 3481 organismswere successfully build by the procedure.
In a protein-protein interaction network, every proteinfound in an organism’s proteome represents a graph nodeand two nodes are linked if the proteins have some kind ofinteraction between them.Data for the production of PPI networks was down-loaded from the STRING database [24]. From this database,for hundreds of organisms, one obtains lists of pairs ofproteins present in the organisms and several scorings foreach pair representing the confidence on the existence ofsome interaction between them (different scores are asso-ciated to different sources of evidence for the existence ofthe interaction). For all organisms downloaded, we builtthe network for each organism by setting the threshold onthe minimal confidence level that an interaction must havein order to define an undirected link in the network. Thethreshold considered was 0.90 (in a range between 0 and 1)for the combined score.With this procedure we built 1073 PPI networks.
TATISTICAL A NALYSIS
In this section we explain the graph parameters and char-acteristics that were analyzed in each network and thestatistical tools employed in the analysis.The theory on measurements related to graphs and thestudy of network characteristics and parameters can befound in several books and reviews. See, for example, [15],[25].
It is often claimed that biological networks have scale-free(power-law) node degree distribution. The discrete power-law distribution has the form: p ( x/γ, x ) = ( x − γ ζ ( γ,x ) x ≥ x x < x (1) ζ ( γ, x ) = ∞ X x = x x − γ (2)where ζ ( γ, x ) is the Riemann zeta function (modified suchthat the sum starts at a minimum value x ). This distributionhas γ and x as parameters. The parameter x is an integerindicating the smallest number in the distribution.According to the above, we attempt to fit a power-law(scale-free) distribution to the nodes degree distribution ofthe studied networks. In order to do that, given the set of N numbers { k i , i =
1, 2, ..., N } representing the degree ofevery node in a network, we find the value of the parameter γ that maximize the likelihood , for a given value of x : ln L = N ′ X i =1 ln p ( k i /γ, x ) (3)where the sum is made over the degrees of every node inthe network bigger than x . To find the parameter γ thatmaximizes the likelihood, one must solve the equation: ddγ ln L = 0 (4) N ′ ddγ ζ ( γ, x ) ζ ( γ, x ) − N ′ X i =1 ln k i = 0 (5)where N ′ is the number of nodes in the network with degreebigger or equal to x .Once γ is established, for many possible values x ,we evaluate the goodness of fit through a χ test: the χ statistic is calculated and the right-cumulative distributionof the Pearson’s χ distribution at this point is obtained.The result is the p-value i.e. the probability of obtaining astatistical fluctuation bigger than the observed one if the k i ’sdistribution does come from a power-law with parameters γ and x . Therefore, big values of the p-value indicate a goodfit. For each network, we follow the same procedure: hav-ing its degree distribution, for every possible value of x between 1 and some x max we solve Eq. (5) and findthe value of γ that maximize the likelihood for the given x . We also evaluate the p-value and count the amount(fraction) of nodes in the network with degree smaller than x (these nodes did not participate in the fit procedure).We also evaluate an upper and lower uncertainty for the γ parameter by finding the two points around the maximumlikelihood where it decreases half point (0.5) [26].
2. Since the logarithm is a monotonically increasing function, itsmaximum is at the same point as the maximum value of its argument.3. As will be clear in the Results section, the p-values increase untilsome given value of x and decrease afterwards. For every single graph produced (3481 metabolic networksand 1073 PPI networks), first the values for basic networkparameters are obtained. Most of the obtained graphs con-tained small disconnected components, so we also count, foreach graph, the number of disconnected components, thesize of the biggest component and the average size of thesmaller components. The most straight forward propertiesthat are obtained from the graphs are its number of nodes N and its number of links N .Also, for every network it is straightforward to obtain itsnode’s degree distribution i.e. n i , the number of nodes ineach network that have i links, for every possible integer i .This is evaluated by first obtaining the degree for each node i , k i which is the number of links node i has: k i = N X j =1 M ij (6)where M ij is the adjacency matrix of the graph (a squaresymmetrical N × N matrix where each element M ij is 1 ifnode i is connected to node j and 0 otherwise).Another local property of each node is its clusteringcoefficient C i : C i = 2 E i k i ( k i − (7)where E i is the number of links between the neighbors ofnode i . This coefficient is the ratio between the number oftriangles node i actually forms with its neighbors and thetotal number of all possible ones given its degree k i .The local parameters (properties associated with eachnode in a network) can be averaged over all nodes in thenetwork in order to establish an average network parameter.In the case of the two above mentioned parameters, one hasthe network’s average degree ¯ k and average local clusteringcoefficient ¯ C . It is also possible to define a global parameterrepresenting the clustering of a network as the ratio betweenall triangles (size 3 clicks ) the network (as a whole) actuallyhas and the number of possible triangles it could have,based on the number of connected triples (length 2 paths): C = 3 | C || P | (8)where | C | is the number of triangles (tree nodes connectedin a cycle) and | P | the number of 2-paths (connectedtriples).We also study two parameters related to node correla-tions, namely nodes distances and network assortativity.First we evaluate the symmetric distance matrix, a matrixwhere every element d ij is shortest path length betweennode i to node j , via the Dijkstra’s algorithm [27]. Since weconsider the unweighted network (every link has weight 1),the size of the path is set as the distance between the twonodes. The average of all elements in the distance matrix is
4. A click is a complete subgraph of the network.5. Since some networks have disconnected components, the distancebetween nodes in different components is infinity (one is not reachablefrom the other). These distances are left out from the sum evaluatingthe average. the network’s average distance ¯ d : ¯ d = 2 N ( N − N X i =1 X j>i d ij (9)The network’s assortativity, A , is a correlation coefficientbetween the node’s excess degree and its expected value inan ¨Erdos-R´enny random network: A = 1 σ q X k i ,k j k i k j ( e ( k i , k j ) − q ( k i ) q ( k j )) (10) q ( k i ) = ( k i + 1) p ( k i + 1)¯ k (11)where p ( k i ) = n ki N is the probability that a node has degree k i , e ( k i , k j ) is the fraction of links in the network connectingnodes of degree k i with k j , q ( k i ) is called the excess degreedistribution and σ q is its standard deviation.Positive coefficient A means an assortative network i.e.high degree nodes tend to be connected to other high degreenodes; while a negative coefficient A means a dissortativenetwork that is, a network where high degree nodes tend toconnect with low degree nodes. We want to distinguish properties of the network that comesolely as a natural consequence of its degree distributionand those that require some underlying mechanism to beachieved. To accomplish that, after having evaluated the net-work parameters, we compare those to averages of the sameparameters evaluated over sets of randomized networksi.e. networks with the same degree distribution for theirnodes, but where the links have been randomly exchanged.In the appendix we discuss the distribution of the networkproperties over the population of networks generated bythis randomization process.The randomization process we implemented is the fol-lowing: first two links of the network are randomly selected.The links are broken and the nodes participating in oneof the original links are connected to the nodes participat-ing in the other original link. Since we work with simpleundirected networks, sometimes this process fails, becausegiven the randomly selected links, the relinking of the net-work would either generate a node connected with itself ortwo nodes sharing multiple connections. Repeating multipletimes this process, one can also estimate (by bootstrap) theamount of times the process failed estimating, in this way,the probability of success in the process, which will be aproperty solely of the degree distribution of the network. Wecall this parameter ξ . Note that this randomization processdoes not change the degree distribution of the network i.e.every node keeps its k i constant during the process.One is left to decide how many times to repeat thisrandomization process in order to obtain a truly randomnetwork . We adopt the paradigm that each link shouldhave a 99.9% probability of having been touched at least
6. Note that this randomization process does not generate a ¨Erdos-R´enny random network, but a network with the exact same degreedistribution as the real original graph. once by the process. The idea behind this process is to gen-erate random networks with the same degree distributionas the original network therefore, assessing the propertiesof these random networks, one obtains the characteristics ofthe network that emerge only as a consequence of the degreedistribution of the graphs. This set of random networksmimics what would be expected as result from an evolu-tionary model constructed in order to generate networksadjusting the degree distribution observed in real worldgraphs and not caring with any other aspect of the resultinggraphs.Since in each step of the randomization process two linksare selected, the probability p that any given link is touchedin a given step is p = N . Therefore, the probability thata link is not touched is ¯ p = 1 − p . If the randomizationprocess is repeated n times, the probability that a given linkis never touched is ¯ p n . So the number of times we mustrepeat the randomization process in order that there is a99.9% probability that any given link was touched by theprocess (probability of not being touched) is: n = − ln(1000)ln (cid:0) − N (cid:1) (12)to give an idea of his number, in a typical metabolic networkwith 2400 links, this value is n = 8286 , while for a typicalPPI network with 28000 links, this number is n = 96705 .For each network, we obtain 10 randomized versions of it, evaluate each network parameter in each randomizednetwork and estimate an expected value and its uncertaintyby evaluating the average and standard deviation of theparameter over the ten randomized network samples. Inthis way, we are able to evaluate, for each network, ameasure of the parameter deviation in the real networkfrom the expected value in the randomized versions of it,by computing the statistical t P for each parameter P : t P = ¯ P random − P realS √ (13)where P real is the value obtained for the parameter P in thereal network, ¯ P random is the average value of the parameterover the ten randomized versions of the network and S thestandard deviation of the parameter in the random samples.The statistical t P can be used to assess the statisticalsignificance on the existence of a difference between theparameter value in the real network and its expected valuein randomized versions of it, by evaluating the cumulativestudent’s t-distribution with 9 degrees of freedom at point | t P | . In the appendix we provide a normality test for thedistribution of the network characteristics over its random-ized versions, justifying thus the use of the student’s t-test. Two times the value of this cumulative distributionis interpreted as the p-value for the null hypothesis thatthe observed network has the P parameter equal to itsexpected value. High values of this p-value would indicatethat the difference is not significant (high probability of
7. The calculations become computationally intensive for big net-works (some PPI networks have over 10000 nodes) and therefore,choosing a bigger number of random samples, would result in unrea-sonable running time for the calculations over the entire data set. obtaining a fluctuation equal or bigger than the observedone in the population). Parameters for which the p-valueis small would indicate a significant difference between theobserved and expected value, hinting that evolution favors(selects) networks in which the parameter value is bigger(if t P is negative) or lower (if t P is positive) than what isexpected in a random version of the network. Therefore,a good dynamical evolutionary model for these biologicalsystems would have to incorporate mechanisms that resultin networks with such characteristics. ESULTS AND D ISCUSSION
For each data set of networks studied (metabolic and PPI),we present first the results for fitting procedure for thenode’s degree distributions followed by the descriptivestatistics for the network parameters and then the resultsfrom the comparison between the real and the randomizednetwork versions.
In table 1 we show the results for the fitting proceduredone over the node’s degree distributions, averaged overthe 3481 metabolic networks. For each network, for eachvalue of x between 1 and 15, we evaluate the value for theparameter γ that maximizes the likelihood for the observeddegrees in the network (solution of eq. (5)), we obtain itsuncertainties, and for the fitted value of γ the statistical χ is computed along with its correspondent p-value. Note thatthe fit procedure is done for every x in the table for eachone of the 3481 graphs in the data set. Therefore, the valuesof γ and its uncertainties presented in the table are the resultof averaging the obtained values of γ and its uncertainty ineach network weightening the values by the p-value of thefits (giving more importance to the better fitted values). Nextto the average p-value we also show the fraction of the fitsfor which the p-value was bigger than 1%. Note also that,for each value of x bigger than 1, some nodes (the ones forwhich k i < x ) are left out of the fit. The table presents alsothe average fraction of discarded nodes in the fits, for eachvalue of x .For x = 1 , the p-value is lower than − , indicatingthat only the tail of the distribution might be well adjustedto a power-law function and the highest p-values obtainedare for x = 2 . In this case, the value of γ is a little above2.1 and around 4% of the nodes in the network have degree1 and do not participate in the fit. But even in this case, theaverage p-value is around 0.01 indicating that the deviationfrom a power-law is important and there is small probabilityof observing such fluctuation if the degrees came from thehypothesized (scale-free) distribution.Now, for each one of the 3481 metabolic networks in ourdata set, we evaluated the graph properties and character-istics described in subsection 3.2. Histograms depicting thedistributions of the main parameters over our data set areshown in figure 1. The distributions show the bulk of thedata distributed around a central value, but all of them alsopresent a significant number of outliers.In table 2 we present the descriptive statistic for the pa-rameters. Since some distributions have a sizable skewness Table 1Fitted parameter and uncertainties averaged using p-value as weight,for metabolic networks. In parenthesis, next to the average p-value, thefraction of fits for which the individual p-value was greater than 0.01. x γ Discarded average p-value (fraction > . +0 . − . . ± . (0.000000 %)2 . +0 . − . . ± . (58.527493 %)3 . +0 . − . . ± . (12.395154 %)4 . +0 . − . . ± . (6.337372 %)5 . +0 . − . . ± . (7.362535 %)6 . +0 . − . . ± . (6.150979 %)7 . +0 . − . . ± . (7.362535 %)8 . +0 . − . . ± . (5.498602 %)9 . +0 . − . . ± . (4.473439 %)10 . +0 . − . . ± . (4.287046 %)11 . +0 . − . . ± . (4.380242 %)12 . +0 . − . . ± . (3.914259 %)13 . +0 . − . . ± . (3.075489 %)14 . +0 . − . . ± . (2.423113 %)15 . +0 . − . . ± . (2.329916 %) (asymmetry), besides evaluating the standard deviation ofthe distribution, we also evaluated the standard deviationfor all values bigger and smaller than the average, sepa-rately. These are shown in the table as uncertainties aroundthe average value of each parameter.In a real metabolic network, one would not expect dis-connected components. The small components into whichthe networks fragment themselves are possibly a problemof wrong annotations in the databases or misidentificationof some chemical reactions or metabolites within them inthe automated process of reconstructing the networks. Inany case, as can be seen from the difference between theaverage size of the main components and the average totalnumber of nodes, the disconnected components amount toa negligible number of nodes.All metabolic networks are dissortative ( A < ). It is alsopossible to observe that, while the networks show a highaverage local clustering, their global clustering parametertend to be small, close to zero. The networks tend to clusterlocally, but not globally. The lack of correlation between thelocal and global clustering coefficients in real networks hasalready been observed in other systems [25], [28].Table 3 shows the same parameters as in table 2 (exceptfor those that are not affected by the randomization process,like N , N or P ) evaluated for the average values of therandomized samples of each network. The last row in thistable presents the parameter ξ , whose average value isaround 0.87, indicating that, on average, in 13% of therandomization steps, the links broken could not have beenproperly relinked (the random step failed).The statistical significance of the differences between theparameters in tables 2 and 3 can be appreciated in table 4 C o u n t s Value N C o u n t s Value N . . . . C o u n t s Value ¯ k .
06 0 .
08 0 . .
12 0 .
14 0 .
16 0 .
18 0 . .
22 0 .
24 0 . C o u n t s Value ¯ C . . . . . . . . . . . C o u n t s Value ¯ d − . − . − . − . − . − .
05 0 C o u n t s Value A Figure 1. Histograms for the parameter distributions in metabolic networks. Top left, right: number of nodes and number of links. Center left, right:average degree and average local clustering. Bottom left, right: Average shortest path and assortativity. where it is shown the average value for the statistical t P foreach parameter. For each network (3481 in total) the valueof t P and its correspondent p-value is evaluated. The tableshows the statics of the distribution of t P over the wholedata set. Next to the average p-value, we also present thefraction of the networks for which this p-value was below0.05. Histograms for the distribution of t P for the differentparameters can be found in figure 2.The most significant difference observed is for the aver-age shortest path. Note, that the significance in this analysisis not on the amount of the difference, but on the confidencelevel for the existence of this difference. If one comparesthe difference between ¯ d and and ¯ d rand in tables 2 and3 it amounts to around 2.5%. Though this difference issmall, given any metabolic network, one can say with highconfidence that the average shortest path is bigger in thereal network than its expected value in randomized versionsof it. For the other parameters the situation is not as clear.One observes that a sizable fraction of the networks (around3/4 of them) do show significance on the existence ofsome difference (p-val < . ), but these differences are notubiquitous. The same analysis were done over the data set of 1073protein-protein interaction (PPI) networks. First, in table 5we present the results of the fit procedure for the degree distributions. Here, one can can see that it is not possibleto obtain a reasonable fit unless more than 50% of thenetwork’s nodes are left out of the fit. Only for the far righttail of the distribution (around x ≥ ) one begins to obtainreasonable quality fits. In particular, for x = 1 the fits wereso bad, that the p-values were all smaller than the machineprecision ( p < − ) and therefore it was not possible toevaluate the averages weighted by the p-values.Next we performed the analysis of the network proper-ties of the data set. In figure 3 we present the histograms forthe distributions of the main graph parameters and in table6, the computed descriptive statistics of the distributions.The PPI networks show some peculiar properties ifcompared with the metabolic ones. They present both, localand global clustering coefficients significantly different fromzero; These networks are mildly assortative and, thoughthey are bigger in size, they are more densely connected andpresent a similar shortest average path than the metabolicgraphs.Table 7 shows the descriptive statistics for the averagevalues of the parameters obtained from the randomizedsamples. These values show some sizable differences withrespect to the parameters in the real networks shown in table6. For the random samples, the global and local clusteringcoefficients are close to zero. The different behavior of theglobal clustering coefficient can be directly linked to thenumber of triangles in the networks: the real networks Table 2Descriptive Statistics for the distribution of metabolic network parameters. The first two columns show the parameter name (definition) and itssymbol as used in the present work. Around the average, shown as uncertainties, are the standard deviations calculated for values bigger andsmaller than the average, separately.
Parameter Math Symbol Average Standard Deviation SkewnessNumber of nodes N . +327 . − . N . +1017 . − . ¯ k . +0 . − . ¯ C . +0 . − . C . +0 . − . A − . +0 . − . ¯ d . +0 . − . P . +91543 . − . C . +916 . − . N comps . +5 . − . S main . +318 . − . S small . +0 . − . present almost twice more size 3 clicks than the randomsamples. Moreover, the networks present a mild dissortativedegree correlation and a slightly smaller average shortestpath. The statistical significance of the differences can beread in table 8, where we present the descriptive statics forthe t P statisticals and their associated averaged p-values. Infigure 4 are depicted the histograms for the distributions ofthese statisticals.In the case of PPI networks, the differences betweenreal and expected values for all parameters are statisticallysignificant. Real PPI networks present bigger average localand global clustering coefficients than their expected valuesin random networks with the same degree distribution,as well as bigger shortest average paths and assortativity.Realistic evolutionary models that try to mimic growth ofPPI networks should not only try to reproduce their degreedistribution (which is not well described by a power-lawfunction) but also try to incorporate underling mechanismsthat result in networks with such deviations from the ex-pected values in networks with the same degree distribu-tions. VERVIEW AND C ONCLUSION
We have analyzed a huge set of graphs representing bi-ological systems (3481 metabolic networks and 1073 PPInetworks). First, we study in detail the degree distributionsof the networks and test them against the hypothesis thatthey follow a power-law function. The results of this firstanalysis show that the degree distributions of these realworld graphs are not well described by this function, butonly the right tail containing a smaller fraction of the nodes in the case of PPI networks are reasonably adjusted tothis scale-free distribution. In a second analysis, a completedescriptive statistics of the networks properties is presentedand then we identify those parameters that deviate fromtheir expected values in randomized versions of the graphsthat preserve the network’s degree distribution. Biologicalnetworks are the result of evolution and natural selection.Therefore, these deviations are the result of evolutionarypressures these systems developed under. Realistic evolu-tionary models that describe these systems should incorpo-rate mechanisms that result in graphs with such deviationsand not only try to reproduce their degree distributions.Our analysis did not focus in any given specific branch ofthe tree of life, such that the networks vary a lot in size. Theaverage shortest path for both, metabolic and PPI networks,tend to be slightly smaller in the case of real graphs than theexpected values in randomized networks with high statisti-cal significance. For other parameters, while the metabolicnetworks also show some differences, these differences donot present the same confidence level significance overthe whole sample. But in the case of PPI networks, allparameters analyzed do show differences between real andexpected values with a high confidence level. PPI networksare more assortative, and have bigger local and globalclustering coefficients than would be expected by randomlylinking nodes with the same degree distributions.This study points to two important conclusions: on onehand, the degree distributions of graphs representing bio-logical systems (metabolic and PPI networks) is not wellfitted by a power-law function and on the other hand,network evolutionary models that focus solely in obtaining
Table 3Parameter values in randomized versions of the metabolic networks.
Parameter Math Symbol Average Standard Deviation SkewnessAverage local clustring ¯ C rand . +0 . − . C rand . +0 . − . A rand − . +0 . − . ¯ d rand . +0 . − . C rand . +875 . − . N compsrand . +0 . − . S mainrand . +327 . − . S smallrand . +0 . − . ξ . +0 . − . Table 4Deviation of real metabolic network parameters from the randomized expected values. The column p-value presents the average p-value for thestudent’s t-test evaluated as described in subsection 3.3. In the parenthesis next to it, it is presented the fraction of networks for which the p-valuewas below 0.05. t Average Standard Deviation Skewness p-val (fracion < . ) t ¯ C . +7 . − . t ¯ C − . +5 . − . t ¯ d − . +6 . − . t A . +6 . − . t C − . +5 . − . graphs with similar degree distributions as real world bio-logical networks might not be enough to adjust other graphparameters observed in these systems. A PPENDIX AP ARAMETER D ISTRIBUTION OVER THE R ANDOM - IZED N ETWORKS
In this work we compared the graph parameters of a realworld network with their expected values in the set of allgraphs that can be built with the same degree distribution.This analysis is inspired by the fact that many models thattry to describe the evolution of these systems usually focusin obtaining as a result of the model simulation graphs withsimilar degree distributions. We ask, therefore, if mimickingthis degree distribution is enough to reproduce the mainproperties of these structures. Given a degree distribution,the number of different graphs that can be actually builtfrom it is astronomically huge and a random model adjustedonly to reproduce a given degree distribution would, inprinciple, generate any of the possible networks that share this same distribution. So the question is: Do the charac-teristics of a real world network significantly differ fromthose of a randomly selected graph from the population ofall networks that share the same degree distribution?The process through which we obtain samples fromthis population is the randomization process described insection 3.3, where links are broken and rewired in orderto obtain a complete new network from the original butkeeping the degrees of the nodes untouched. Though thisrewiring process is computationally fast, the computationof all parameters of the resulting network, in particular P , C and d ij , is a computationally intense process andin order to perform all calculations in a relatively shortand reasonable time period (around a month) we chose toobtain a small random sample. Therefore, we had to use astatistical test designed to provide reliable results even forsmall samples: the student’s t-test. One of the suppositionsbehind the student’s t-test performed in the analysis is thatthe population behind the obtained sample has a normaldistribution.Here, for a few organisms, we obtained bigger ran- − − C o u n t s Value t A − − − − − − −
10 0 10 C o u n t s Value t ¯ d − −
10 0 10 20 30 40 C o u n t s Value t ¯ C − − −
10 0 10 20 30 C o u n t s Value t ¯ C Figure 2. Histograms for the distribution of the parameter t P in metabolic networks. Top left, right: assortativity and average shortest path. Bottomleft, right: Average local clustering and global clustering. dom samples and performed two normality tests to checkwhether these populations comply with the needed suppo-sition of normality in order to perform the t-test. In table 9we present the results for the PPI and metabolic networksof a few selected organisms. The table shows the averagevalue of the given parameters in a sample of randomizednetworks and for each parameter two p-values next to it thatare the result of the Shapiro-Wilk [29] and D’Agostino [30]normality tests. In any of the tests, a big p-value (bigger thanthe critical value) indicates that the sample seems to havea normal distribution. In the tables, for the PPI networkswe identify the organisms by their NCBI taxonomy IDand metablic networks by their KEGG code. Again, in thisevaluation we reduced the sample sizes for those networksin which the calculations took longer time. R EFERENCES [1] P. Erd˝os and A. R´enyi, “On random graphs i.,”
PublicationesMathematicae (Debrecen) , vol. 6, pp. 290–297, 1959 1959.[2] D. J. de Solla Price, “Networks of scientific papers,”
Science ,vol. 149, pp. 510–515, jul 1965.[3] N. Przulj, D. Wigle, and I. Jurisica, “Functional topology in anetwork of protein interactions,”
Bioinformatics , vol. 20, pp. 340–348, feb 2004.[4] M. M. Babu, N. M. Luscombe, L. Aravind, M. Gerstein, and S. A.Teichmann, “Structure and evolution of transcriptional regulatorynetworks,”
Current Opinion in Structural Biology , vol. 14, no. 3,pp. 283 – 291, 2004.[5] C. Li-Ping, W. Ru, S. Hang, X. Xin-Ping, Z. Jin-Song, L. Wei, andC. Xu, “Structural properties of us flight network,”
Chinese PhysicsLetters , vol. 20, no. 8, p. 1393, 2003.[6] L. A. N. Amaral, A. Scala, M. Barthelemy, and H. E. Stanley,“Classes of small-world networks,”
Proceedings of the NationalAcademy of Sciences , vol. 97, pp. 11149–11152, sep 2000.[7] A. L. Barabasi, “Scale-free networks: a decade and beyond,”
Sci-ence , vol. 325, pp. 412–413, Jul 2009.[8] A. L. Barabasi and E. Bonabeau, “Scale-free networks,”
Sci. Am. ,vol. 288, pp. 60–69, May 2003.[9] R. Albert, “Scale-free networks in cell biology,”
J. Cell. Sci. , vol. 118,pp. 4947–4957, Nov 2005.[10] E. Ravasz and A.-L. Barab´asi, “Hierarchical organization in com-plex networks,”
Physical Review E , vol. 67, feb 2003. [11] R. Albert, H. Jeong, and A.-L. Barab´asi, “Error and attack toleranceof complex networks,”
Nature , vol. 406, pp. 378–382, jul 2000.[12] D. S. Callaway, M. E. J. Newman, S. H. Strogatz, and D. J.Watts, “Network robustness and fragility: Percolation on randomgraphs,”
Phys. Rev. Lett. , vol. 85, pp. 5468–5471, Dec 2000.[13] R. Cohen, K. Erez, D. ben Avraham, and S. Havlin, “Resilienceof the internet to random breakdowns,”
Phys. Rev. Lett. , vol. 85,pp. 4626–4628, Nov 2000.[14] A.-L. Barab´asi and R. Albert, “Emergence of scaling in randomnetworks,”
Science , vol. 286, no. 5439, pp. 509–512, 1999.[15] R. Albert and A.-L. Barab´asi, “Statistical mechanics of complexnetworks,”
Rev. Mod. Phys. , vol. 74, pp. 47–97, Jan 2002.[16] A. V´azquez, “Growing network with local rules: Preferential at-tachment, clustering hierarchy, and degree correlations,”
Phys. Rev.E , vol. 67, p. 056104, May 2003.[17] R. Khanin and E. Wit, “How scale-free are biological networks,”
Journal of Computational Biology , vol. 13, pp. 810–818, Apr 2006.[18] G. Lima-Mendez and J. van Helden, “The powerful law of thepower law and other myths in network biology,”
Mol. BioSyst. ,vol. 5, pp. 1482–1493, 2009.[19] A. Clauset, C. R. Shalizi, and M. E. J. Newman, “Power-lawdistributions in empirical data,”
SIAM Rev. , vol. 51, pp. 661–703,Nov. 2009.[20] V. Sudbrack, L. G. Brunnet, R. M. de Almeida, R. M. Ferreira,and D. Gamermann, “Master equation for the degree distributionof a duplication and divergence network,”
Physica A: StatisticalMechanics and its Applications , vol. 509, pp. 588–598, nov 2018.[21] M. Kanehisa and S. Goto, “KEGG: kyoto encyclopedia of genesand genomes,”
Nucleic Acids Res. , vol. 28, pp. 27–30, Jan 2000.[22] M. Kanehisa, Y. Sato, M. Kawashima, M. Furumichi, and M. Tan-abe, “Kegg as a reference resource for gene and protein annota-tion,”
Nucleic Acids Research , vol. 44, no. D1, pp. D457–D462, 2016.[23] R. Reyes, D. Gamermann, A. Montagud, D. Fuente, J. Triana,J. F. Urchueguia, and P. F. de Cordoba, “Automation on thegeneration of genome-scale metabolic models,”
J. Comput. Biol. ,vol. 19, pp. 1295–1306, Dec 2012.[24] D. Szklarczyk, A. Franceschini, S. Wyder, K. Forslund, D. Heller,J. Huerta-Cepas, M. Simonovic, A. Roth, A. Santos, K. P. Tsafou,M. Kuhn, P. Bork, L. J. Jensen, and C. von Mering, “STRING v10:protein-protein interaction networks, integrated over the tree oflife,”
Nucleic Acids Res. , vol. 43, pp. D447–452, Jan 2015.[25] E. Estrada,
The Structure of Complex Networks: Theory and Applica-tions . Oxford University Press, 2011.[26] J. R. Fonseca, M. I. Friswell, J. E. Mottershead, and A. W. Lees,“Uncertainty identification by the maximum likelihood method,”
Journal of Sound and Vibration , vol. 288, no. 3, pp. 587 – 599, 2005.Uncertainty in structural dynamics. C o u n t s Value N C o u n t s Value N C o u n t s Value ¯ k C o u n t s Value ¯ C C o u n t s Value ¯ d C o u n t s Value A Figure 3. Histograms for the parameters in PPI networks. Top left, right: number of nodes and number of links. Center left, right: average degreeand average clustering. Bottom left, right: Average shortest path and assortativity. [27] E. W. Dijkstra, “A note on two problems in connexion withgraphs,”
Numerische Mathematik , vol. 1, no. 1, pp. 269–271, 1959.[28] M. E. J. Newman, “The structure and function of complex net-works,”
SIAM Review , vol. 45, pp. 167–256, jan 2003.[29] S. S. SHAPIRO and M. B. WILK, “An analysis of variance test fornormality (complete samples),”
Biometrika , vol. 52, pp. 591–611,dec 1965.[30] R. B. D. AGOSTINO, “Transformation to normality of the nulldistribution of g1,”
Biometrika , vol. 57, no. 3, pp. 679–681, 1970. C o u n t s Value t A C o u n t s Value t ¯ d C o u n t s Value t ¯ C C o u n t s Value t ¯ C Figure 4. Histograms for parameter deviations in PPI networks. Top left, right: assortativity and average shortest path. Bottom left, right: Averagelocal clustering and global clustering. Table 5Fitted parameter and uncertainties weight averaged by p-value for PPInetworks. In parenthesis, next to the average p-value, the fraction of fitsfor which the individual p-value was greater than 0.01. x γ Discarded average p-value (fraction > . . − . . ± . (0.000000 %)3 . . − . . ± . (0.000000 %)4 . . − . . ± . (0.000000 %)5 . . − . . ± . (0.000000 %)6 . . − . . ± . (0.093197 %)7 . . − . . ± . (0.093197 %)8 . . − . . ± . (0.186393 %)9 . . − . . ± . (0.279590 %)10 . . − . . ± . (1.025163 %)11 . . − . . ± . (1.863933 %)12 . . − . . ± . (3.727866 %)13 . . − . . ± . (6.150979 %)14 . . − . . ± . (11.649581 %)15 . . − . . ± . (18.732526 %)16 . . − . . ± . (27.772600 %)17 . . − . . ± . (32.898416 %)18 . . − . . ± . (38.956198 %)19 . . − . . ± . (45.013979 %)20 . . − . . ± . (49.953402 %)21 . . − . . ± . (54.426841 %)22 . . − . . ± . (58.434296 %)23 . . − . . ± . (62.068966 %)24 . . − . . ± . (64.958062 %)25 . . − . . ± . (67.567568 %)26 . . − . . ± . (69.897484 %)27 . . − . . ± . (72.320596 %)28 . . − . . ± . (74.650513 %)29 . . − . . ± . (75.955266 %)30 . . − . . ± . (76.514445 %)31 . . − . . ± . (77.446412 %)32 . . − . . ± . (78.285182 %)33 . . − . . ± . (78.378378 %)34 . . − . . ± . (78.657968 %)35 . . − . . ± . (78.191985 %)36 . . − . . ± . (78.098788 %)37 . . − . . ± . (77.912395 %)38 . . − . . ± . (76.421249 %)39 . . − . . ± . (75.582479 %)40 . . − . . ± . (73.904939 %) Table 6Descriptive Statistics for the distribution of PPI network parameters.
Parameter Math Symbol Average Standard Deviation SkewnessNumber of nodes N . +1674 . − . N . +63095 . − . ¯ k . +16 . − . ¯ C . +0 . − . C . +0 . − . A . +0 . − . ¯ d . +0 . − . P . +12364896 . − . C . +378962 . − . N comps . +2 . − . S main . +1672 . − . S small . +1 . − . Table 7Parameter values in randomized versions of the PPI networks.
Parameter Math Symbol Average Standard Deviation SkewnessAverage local clustring ¯ C rand . +0 . − . C rand . +0 . − . A rand − . +0 . − . ¯ d rand . +0 . − . C rand . +285093 . − . N compsrand . +0 . − . S mainrand . +1673 . − . S smallrand . +0 . − . ξ . +0 . − . Table 8Deviation of real PPI network parameters from randomized expected values. The column p-value presents the average p-value for the student’st-test evaluated as described in subsection 3.3. In the parenthesis next to it, it is presented the fraction of networks for which the p-value was lowerthan 0.05. t Average Standard Deviation Skewness p-val (fracion < . ) t ¯ C − . +95 . − . t ¯ C − . +181 . − . t ¯ d − . +80 . − . t A − . +23 . − . t C − . +184 . − . Table 9Normality test for a few selected organisms. The average value of the parameters over the samples are shown and the two values next to each arethe p-value of the Shapiro-Wilk test (pv1) and the p-value for the D’agostino test (pv2).