[PDF] Evaluating Network Models: A Likelihood Analysis

Abstract

Many models are put forward to mimic the evolution of real networked systems. A well-accepted way to judge the validity is to compare the modeling results with real networks subject to several structural features. Even for a specific real network, we cannot fairly evaluate the goodness of different models since there are too many structural features while there is no criterion to select and assign weights on them. Motivated by the studies on link prediction algorithms, we propose a unified method to evaluate the network models via the comparison of the likelihoods of the currently observed network driven by different models, with an assumption that the higher the likelihood is, the better the model is. We test our method on the real Internet at the Autonomous System (AS) level, and the results suggest that the Generalized Linear Preferential (GLP) model outperforms the Tel Aviv Network Generator (Tang), while both two models are better than the Barabási-Albert (BA) and Erdös-Rényi (ER) models. Our method can be further applied in determining the optimal values of parameters that correspond to the maximal likelihood. Experiment indicates that the parameters obtained by our method can better capture the characters of newly-added nodes and links in the AS-level Internet than the original methods in the literature.

Full PDF

aa r X i v : . [ phy s i c s . s o c - ph ] D ec Evaluating Network Models: A Likelihood Analysis

Wen-Qiang Wang , Qian-Ming Zhang , , and Tao Zhou ∗ Web Sciences Center, School of Computer Science and Technology,University of Electronic Science and Technology of China, 610054 Chengdu, People’s Republic of China Beijing Computational Science Research Center, Beijing 100089, People’s Republic of China (Dated: June 7, 2018)Many models are put forward to mimic the evolution of real networked systems. A well-acceptedway to judge the validity is to compare the modeling results with real networks subject to severalstructural features. Even for a speciﬁc real network, we cannot fairly evaluate the goodness ofdiﬀerent models since there are too many structural features while there is no criterion to selectand assign weights on them. Motivated by the studies on link prediction algorithms, we propose auniﬁed method to evaluate the network models via the comparison of the likelihoods of the currentlyobserved network driven by diﬀerent models, with an assumption that the higher the likelihood is,the better the model is. We test our method on the real Internet at the Autonomous System (AS)level, and the results suggest that the Generalized Linear Preferential (GLP) model outperformsthe Tel Aviv Network Generator (Tang), while both two models are better than the Barab´asi-Albert (BA) and Erd¨os-R´enyi (ER) models. Our method can be further applied in determining theoptimal values of parameters that correspond to the maximal likelihood. Experiment indicates thatthe parameters obtained by our method can better capture the characters of newly-added nodesand links in the AS-level Internet than the original methods in the literature.

PACS numbers: 89.75.Fb, 05.40.Fb, 89.75.Da

I. INTRODUCTION

Recent years have witnessed a fast development ofcomplex networks [1–4]. A network is a set of itemsthat are called vertices with connections between them,which are named as edges. Many natural and man-madesystems can be described as networks. Such paragonscannot be numbered that biological networks includingprotein-protein interaction networks [5] and metabolicnetwork [6]; social networks such as movie actor col-laboration [7] and scientiﬁc collaboration networks [8];technological networks like power grids [9], WWW [10]and the Internet at the Autonomous System (AS) level[11–16]. A major endeavor in academics is to discoverthe common properties shared by many real networksand the speciﬁc features owned by a certain type of net-works. A great number of measurements to reveal thestructural features of networks are applied [17]. The de-gree distribution [18], as one of the most important globalmeasurements, has attracted increasing attention sincethe awareness of the scale-freeness [19]. Clustering coef-ﬁcient is a local measurement that characterizes the loopstructure of order three. Another signiﬁcant measure-ment is the average distance. A network is consideredto be small-world if it has large clustering coeﬃcient butshort average distance [9]. Except for the properties men-tioned above, there are many other measurements suchas degree-degree correlation [20], betweenness centrality[21] and so forth. Moreover, some statistical measure-ments borrowed from physics such as entropy [22], and ∗ Electronic address: [email protected] novel metrics such as modularity [23] also play importantroles in characterizing networks.Not only the statistical features but also the dynam-ical evolution of networks the current research interesthas focused on. A mess of models have been proposed toreveal the origins of the impressive statistical features ofcomplex networks. There are also many evolving modelsdeveloped for some certain type of networks such as theInternet at the AS level [11–16], the social networks [24–29] and so forth. However the prosperous developmentof measurements sets a barrier for evaluating diﬀerentevolving models. The traditional idea is that: if the net-work generated by a model resembles the target networkin terms of some statistical features usually selected bythe authors themselves, the model is claimed as a properdescription of the real evolution. But this methodologyseems to be puzzling. First, unselected statistical prop-erties are entirely ignored so no one knows whether themodel is suﬃcient to describe them as well. Secondly, theauthors tend to select the metrics that support their mod-els. Therefore, it is impossible to give a fair remark thatwhich model is better. Thirdly, it is diﬃcult to quantifythe extent to which the models resemble the real evolvingmechanisms.Inspired by the link prediction approaches and likeli-hood analysis, we propose a method that tries to fairlyand objectively evaluate diﬀerent models. Link predic-tion aims at estimating the likelihood of non-existingedges in a network and try to dig out the missing edges[30]. The evolution of networks involves two processes -one is the addition or deletion of nodes and another oneis the changing of edges between nodes [28]. In principlethe rules of the additions of edges of a model can be con-sidered as a kind of link prediction algorithm and here liesthe bridge between link prediction and the mechanism ofevolving models.The present paper is organized as follows. We willgive a general description of our method in Section II.Section III introduces the data and explains how to useour method to evaluate evolving models in details withthe AS-level Internet being an example network. Theresults obtained by our method are shown in Section IV.We draw the conclusion and give some discussion in thelast section.

II. METHOD

In this section, we will give a general description aboutour method to evaluate evolving models. It is believedthat an evolving model is a description of the evolvingprocess of a network in reality. An evolving model de-scribes the evolving mechanism of a real network or aclass of networks. Given two snaps of one network attime t and t ( t < t ), as well as an evolving model, wecan in principle calculate the likelihood that the networkstarting from the conﬁguration at time t will evolves tothe conﬁguration at t under the rules of the given model.We say a model is better than another one if the likeli-hood of the former model is greater than that of the latterone. However, how to calculate such likelihood is still abig challenge. Inspired by the like prediction algorithms,we can calculate the likelihood of the addition of an edgeaccording to a given evolving model [30]. In a short du-ration of time, each edge’s generation can be thought asindependent to others and the sequence of generationscan be ignored. Thus the likelihood mentioned above isthe product of the newly generated edges’ likelihoods.Denote by G the network and E t the set of edges attime step t . The new edges generated at the current timestep is E new = E t +1 \ E t . The probability that node i isselected as one end of the newly generated edge isΠ i = f ( G, ~a ) , (1)where ~a is the set of parameters applied by the model.Then the likelihood of a new monitored edge ( i, j ) is P ( i,j ) = Π i × Π j . (2)Eq. (2) is applicable only when i and j are both oldnodes. If i or j is newly generated, we set Π i = 1 orΠ j = 1. In order to make comparison between diﬀer-ent models, P ( i, j ) is normalized by 1 / P ( a,b ) ∈ E N p ( a, b ),where E N is the set of nonexisting edges(( i, j ) ∈ E N ).Given diﬀerent parameters ~a , the values of P ( i, j ) maybe diﬀerent, resulting in diﬀerent likelihoods of the targetnetwork. The parameters corresponding to the maximumlikelihood are intuitively considered to be the optimal setof parameters for the evaluated model. In a word, a net-work’s likelihood can be calculated if the evolution dataand the corresponding model are given. And if there areseveral candidate models, our method could judge themby comparing the corresponding likelihoods: the model TABLE I: The number of nodes and edges of the three datasets: two real data sets and one data set that is processed aswe describe in the paper.Time giving higher likelihood according to the target networkis more favored.

III. EXPERIMENTAL ANALYSIS

In this paper we focus on the models of the AS-level In-ternet. Two popular models - Generalized Linear Prefer-ential model (GLP) [11] and Tel Aviv Network Generator(Tang) [15] - will be evaluated by our method. The well-known Barab´asi-Albert (BA) [19] and Erd¨os-R´enyi (ER)[31, 32] models are also analyzed as two benchmarks.The data sets we utilize here are collected by the

Route-views Project [33]. We use the data of Jun. 2006 and Dec.2006. Some nodes and edges in Jun. 2006 disappear inthe record of Dec. 2006. Although an autonomous sys-tem might be canceled, rarely does it happen during ashort time span. Therefore we assume that the nodesand edges in Jun. 2006 will not disappear in Dec. 2006.That is to say that the network conﬁguration in Jun.2006 is a subgraph of that in Dec. 2006. We merge thenetwork of Jun. 2006 into that of Dec. 2006 to make aset substraction between the two sets to obtain the newlygenerated edges and nodes. The basic information of theprocessed data set of Dec. 2006 and two original datasets is shown in Table I.Now we will describe how to calculate the likelihood ofeach newly-generated edge in terms of the four models.(i)

GLP model - This model starts from a few nodes.At each time step, with the probability 1 − p , one newnode is added and m edges are generated between thenew node and m old ones and with the probability p , m edges are generated among the existing nodes. The endsof new edges are selected following the rule of generalizedlinear preferential attachment asΠ i = k i − β P j ( k j − β ) , (3)in which β ∈ ( −∞ , i of anew edge is selected among the existing nodes, then Π i is calculated by the Eq. (3). Otherwise, if the end i itselfis a new node, Π i is 1. So the likelihood of a new edgeconnecting two existing nodes a and b is P ( a,b ) = k a − β P j ( k j − β ) k b − β P j ( k j − β ) . (4)The likelihood of an edge generated between a new node −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 a p GLPTangBAER −124442 −124449 −120497 −132356 −120,000 −122,000 −124,000 −126,000 −128,000 −130,000 −132,000 −134,000 FIG. 1: Likelihoods for diﬀerent models and diﬀerent parameters.. b and an existing node a is P ( a,b ) = k a − β P j ( k j − β ) . (5)When a new edge connects two new nodes a and b , itslikelihood is P ( a,b ) = 1 . (6)(ii) Tang model - This model applies a super linearpreferential mechanism, sayΠ i = k ǫi P j k ǫj . (7)This model also starts with a few nodes and at each timestep a new node is generated with one edge connecting toone of the existing nodes that is selected with the prob-ability described in Eq. (7). The remaining m − m − P ( a,b ) = 1 N s k ǫa P j k ǫj k ǫb P j k ǫj , (8) where N is the current size of the monitored network.Eq. (8) takes a geometric mean due to the fact thateither a or b could be the one selected randomly. Thecases involving new nodes are managed in the same wayas that for the GLP model. (iii) BA model - The BAmodel also starts from a small graph and at each timestep a new node associated with m edges is added. Theprobability that the existing node i is selected isΠ i = k i P j k j . (9)Note that the original BA model cannot deal with the sit-uation where edges are generated between two existingnodes. We thus generalize the BA model as if one edgeis generated between two existing nodes, one node is se-lected preferentially following the Eq. (9) and anotherone is selected randomly. Therefore the likelihood of anedge between two existing nodes a and b is calculated as P ( a,b ) = 1 N s k a P j k j k b P j k j . (10)The likelihood of an edge connecting a new node b and GLP(0.230) GLP(0.616) Tang(0.025) Tang(0.200)1.061.081.11.121.141.161.181.6 The Average Degrees of New Nodes < k > Real Internet: 1.59636 a GLP(0.230) GLP(0.616) Tang(0.025) Tang(0.200)00.511.52x 10 −5 The Density of Interaction among the New Nodes ρ Real Internet: 1.91708x10 −5 GLP(0.230) GLP(0.616) Tang(0.025) Tang(0.200)0.840.860.880.90.920.94 f Fraciton of Leaves among the New Nodes

Real Internet: 0.516099 bc FIG. 2: (a) The average degree of the newly generated nodes; (b) The density among the newly generated nodes; (c) Thefraction of leaves in the newly generated nodes. Dash line in each plot represents the values for the real Internet. Thestructural features corresponding to the networks obtained by our suggesting parameters are closer to the reality. For eachmodel with each parameter, we generate 100 networks and use the so-called box-and-whisker plot [34] to display the results,where the horizontal lines from top to bottom respectively stand for the maximum, the upper quartile, the median, the lowerquartile and the minimum of a set of data. an old one a is P ( a,b ) = k a P j k j . (11)The likelihood of a new edge generated between two newnodes is 1 as discussed above. (iv) ER model - Themechanism of this model is that when one edge is gener-ated, both its ends are selected in a random fashion. Thelikelihood of one edge ( a, b ) between two old nodes is P ( a,b ) = 1 N . (12)The calculation of other two types of edges is similar tothat of GLP. Note that BA is a special case equivalentto the GLP model when β = 0. It is also obvious thatthe ER model is a special case of the Tang model when ǫ = 0.The likelihoods of the four evolving models with dif-ferent parameters are shown in Figure 1. The maximumlikelihoods as well as the corresponding parameters arelisted in Table II. The maximum likelihoods of both spe-ciﬁc Internet models (GLP and Tang) are greater than TABLE II: Maximum likelihoods and the corresponding pa-rameters for the four models.Model Maximum Likelihood Optimum parametersGLP 3 . × − . . × − . . × − N/ABA 2 . × − N/A those of the BA model and the ER model. Notice thatthe BA and ER model are parameter-free and thus rep-resented by two straight lines in Figure 1. Our resultssuggest that subject to the mimicking of the AS-level In-ternet evolution, the GLP model is better than the Tangmodel, and the Tang model is better than the BA model,of course, the ER model performs the worst. A puzzlingpoint is that the optimal parameters corresponding tothe maximum likelihoods are far from the ones suggestedin the original literature [11, 15]. We next devise an ex-periment to demonstrate that the parameters obtainedby our method are more advantageous than the originalones.Traditionally, an evolving model starts from a smallnetwork with a few nodes. In this experiment, we re-spectively use the GLP and Tang models to drive thenetwork evolution starting from the conﬁguration of Jun.2006, ending with the same size of the conﬁguration ofDec. 2006. According to the Refs. [11, 15] and the data, β = 0 . , m = 1 . , p = 0 . ǫ = 0 .

2. Thenwe analyze some statistical features of the newly gener-ated part including the average degree, the density ofinteraction and the fraction of leaves. We ﬁnd that theperformance of the GLP model is better than the Tangmodel with the same kind of parameters in the threecases, demonstrating that our evaluating method is rea-sonable. For both the two models, the statistical featuresobtained by the optimum parameters suggested by us re-semble the real data better than those obtained by usingthe original parameters. The comparisons are shown inFigure 2.

IV. CONCLUSION AND DISCUSSION

Thousands of network models are put forward in recentten years. Some of them aim at uncovering mechanismsthat underlie general topological properties like scale-freenature and small-world phenomenon, others are proposedto reproduce structural features of speciﬁc networks, suchas the Internet, the World Wide Web, co-authorship net-works, food webs, protein-protein interacting networks,metabolic networks, and so on. Besides the prosperity,we are worrying that there is no uniﬁed method to evalu-ate the performance of diﬀerent models, even if the targetnetwork is given beforehand.Instead of considering many structural metrics, thispaper reports an evaluating method based on likelihoodanalysis, with an assumption that a better model will as-sign a higher likelihood to the observed structure. We have tested our method on the real Internet at the ASlevel, and the results suggest that the GLP model out-performs the Tang model, and both models are betterthan the BA and ER models. This method can be fur-ther applied in determining the optimal parameters ofnetwork models, and the experiment indicates that theparameters obtained by our method can better capturethe structural characters of newly-added nodes and links.The main contributions of this work are twofold. Inthe methodology aspect, we provide a starting point to-wards a uniﬁed way to evaluate network models. In theperspective aspect, we believe for majority of real evolu-tionary networks, the driven factors and the parameterswill vary in time. For example, recent empirical analysissuggests that before and after the year 2004, the Inter-net at the AS level grows with diﬀerent mechanisms [16].To ﬁnd out a single mechanisms that drives a networkfrom a little baby to a giant may be an infeasible task.In fact, in diﬀerent stages, a network could grow in dif-ferent ways, or in a hybrid matter with changing weightdistribution on several mechanisms. Once, the researchfocus has shifted from analyzing static models to evolu-tionary models. In the near future, it may shift fromthe evolutionary models to the evolving of the evolution-ary models themselves. In principle, the current methodcould capture the tracks of not only the network evolu-tion, but also the mechanism evolution. Hopefully thiswork could provide some insights into the studies on net-work modeling.

Acknowledgments

We acknowledge A. Wool for the codes of the Tangmodel. This work is supported by the National NaturalScience Foundation of China under grant No. 11075031and the Fundamental Research Funds for the CentralUniversities [1] R. Albert and A.-L. Barab´asi, Rev. Mod. Phys. , 47(2002).[2] S. N. Dorogovtsev and J. F. F. Mendes, Adv. Phys. ,1079 (2002).[3] M. E. J. Newman, SIAM Rev. , 167 (2003).[4] A.-L. Barab´asi, Science , 412 (2009).[5] H. Jeong, S. P. Mason, A.-L. Barab´asi, and Z. N. Oltvai,Nature (London) , 41 (2001).[6] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A.-L.Barab´asi, Nature (London) , 651 (2001).[7] P.-P. Zhang, K. Chen, Y. He, T. Zhou, B.-B. Su, Y. Jin,H. Chang, Y.-P. Zhou, L.-C. Sun, B.-H. Wang, and D.-R.He, Physica A , 599 (2006).[8] M. E. J. Newman, Proc. Natl. Acad. Sci. U.S.A. , 404(2001).[9] D. J. Watts and S. H. Strogatz, Nature (London) ,440 (1998).[10] R. Albert, H. Jeong, and A.-L. Barab´asi, Nature (Lon- don) , 130 (1999).[11] T. Bu and D. Towsley, in Proc. of IEEE INFOCOM 2002 ,p. 638.[12] S. Zhou and R. J. Mondragon, in

Proc. of the 18th In-ternational Teletraﬃc Congress , p. 121.[13] S. T. Park, D. M. Pennock, and C. L. Giles, in

Proc. ofIEEE INFOCOM 2002 , p. 1616.[14] S. Zhou and R. J. Mondragon, Phys. Rev. E , 066108(2004).[15] S. Bar, M. Gonen, and A. Wool, Lect. Notes Comput.Sci. , 53 (2004).[16] G.-Q. Zhang, G.-Q. Zhang, Q.-F. Yang, S.-Q. Cheng, andT. Zhou, New J. Phys. , 123027 (2008).[17] L. da F. Costa, F. A. Rodrigues, G. Travieso, and P. R.Villas Boas, Adv. Phys. , 167 (2007).[18] A. Clauset, C. R. Shalizi, and M. E. J. Newman, SIAMRev. , 661 (2009).[19] A.-L. Barab´asi and R. Albert, Science , 509 (1999). [20] D. S. Callaway, J. E. Hopcroft, J. M. Kleinberg, M. E. J.Newman, and S. H. Strogatz, Phys. Rev. E , 041902(2001).[21] L. C. Freeman, Sociometry , 35 (1977).[22] C. E. Shannon, Bell Syst. Tech. J. , 379 (1948).[23] M. E. Newman, Proc. Natl. Acad. Sci. U.S.A. , 8577(2006).[24] A.-L. Barab´asi, H. Jeong, Z. H´eda, E. Ravasz, A. Schu-bert, and T. Vicsek, Physica A , 590 (2002).[25] M. Bogun´a, R. Pastor-Satorras, A. D´ıaz-Guilera, and A.Arenas, Phys. Rev. E , 056122 (2004).[26] R. Kumar, J. Novak, and A. Tomkins, Link mining:models, algorithms and applications (Springer press, NewYork, 2010).[27] J. Huang, Z. Zhuang, J. Li, and C. L. Giles, in

Proc. of the International Conference on Web Search and WebData Mining 2008 .[28] R. Albert and A.-L. Barab´asi, Phys. Rev. Lett. , 5234(2000).[29] S. N. Dorogovtsev and J. F. F. Mendes, Phys. Rev. E ,1842 (2000).[30] L. L¨u and T. Zhou, Physica A , 1150 (2011).[31] P. Erd¨os and A. R´enyi, Publ. Math. Inst. Hung. Acad.Sci. , 17 (1960).[32] B. Bollob´as, Random Graphs32