Generative Model Selection Using a Scalable and Size-Independent Complex Network Classifier
aa r X i v : . [ c s . S I] F e b GMSCN: Generative Model Selection Using a Scalable and Size-IndependentComplex Network Classifier
Sadegh Motallebi, a) Sadegh Aliakbary, b) and Jafar Habibi c) Department of Computer Engineering, Sharif University of Technology, Tehran,Iran (Dated: 4 February 2014)
Real networks exhibit nontrivial topological features such as heavy-tailed degree distribution, high clustering,and small-worldness. Researchers have developed several generative models for synthesizing artificial networksthat are structurally similar to real networks. An important research problem is to identify the generativemodel that best fits to a target network. In this paper, we investigate this problem and our goal is to select themodel that is able to generate graphs similar to a given network instance. By the means of generating syntheticnetworks with seven outstanding generative models, we have utilized machine learning methods to developa decision tree for model selection. Our proposed method, which is named “Generative Model Selectionfor Complex Networks” (GMSCN), outperforms existing methods with respect to accuracy, scalability andsize-independence.Keywords: Complex Networks, Generative Models, Synthetic Networks, Model Selection, Network StructuralFeatures, Social Networks, Decision Tree Learning
A realistic network generative model can gener-ate artificial graphs similar to real networks. Butthere is no best universal generative model for allsituations. In any application, among differentexisting generative models, we should choose thebest model for that specific application. So, gen-erative model selection is a prerequisite for creat-ing artificial realistic networks. In this paper, weconsider the problem of model selection and pro-pose a method for finding the model that best fitsa given network. The selected generative modelhelps us infer the growth mechanisms of the givennetwork. It can also generate artificial networkssimilar to the given network for tasks like sim-ulation, prediction, extrapolation, and hypothe-sis testing. We propose utilizing a combinationof different local and global network features forlearning a model selection decision tree. We alsodevise a new network feature based on the quan-tification of the degree distribution. We showthat Our proposed method is robust, scalable, in-dependent from the size of the given network, andmore accurate than the baseline method.
I. INTRODUCTION
Complex networks appear in different categories suchas social networks, citation networks, collaborationnetworks, and communication networks . In recentyears, complex networks are frequently studied and a) [email protected]. b) [email protected]. c) [email protected]. many evidences indicate that they show some non-trivialstructural properties . For example, power lawdegree distribution, high clustering and small pathlengths are some properties that distinguish complexnetworks from completely random graphs.An active field of research is dedicated to the devel-opment of algorithms for generating complex networks.These algorithms, called “generative models”, try to gen-erate synthetic graphs that adhere the structural proper-ties of complex networks . Realistic generative modelshave many applications and benefits. Once a genera-tive model is fitted to a given real network, we can re-place the real network with artificial networks in taskssuch as simulation, extrapolation (by generating similargraphs with larger sizes), sampling (reverse of extrapo-lations), capturing the network structure and networkscomparison .Despite the advances in the field, there is no universalgenerative model suitable for all network types and fea-tures. The prerequisite of network generation is the stageof generative model selection. In fact, when we generatesynthetic networks, we hope to reach graphs that arestructurally similar to a target network. In the modelselection stage, the properties of a given network (called target network ) are analyzed and the best model suit-able for generating similar networks is selected. A modelselection method tries to answer this question: “Amongcandidate generative models, which one is the most suit-able one for generating complex network instances sim-ilar to the given network?” In this paper, we investi-gate this problem and by the means of machine learningalgorithms, we propose a new model selection methodbased on network structural properties. The proposedmethod is named “Generative Model Selection for Com-plex Networks” (GMSCN). The need for model selectionis frequently indicated in literature . More specifi-cally some works are based on counting subgraphsof small sizes (called graphlets or motifs ), andsome others concentrate on structural features of complexnetworks , and some are based on manually selecting amodel through watching a small set of network features .We will show that by using an appropriate combinationof local and global network features, we can develop amore accurate model selection method. In our proposedmethod (GMSCN), we consider seven prominent gener-ative models by which we have generated datasets ofnetwork instances. The datasets are used as trainingdata for learning a decision tree for model selection. Ourmethod also consists of a special technique for quantifi-cation of degree distribution. In comparison to existingmethods , we have considered wider, newer and moresignificant generative models. Due to a better selectionof network features, GMSCN is also more efficient andmore scalable than similar methods .The rest of this paper is organized as follows. Section IIreviews the related work. Section III presents GMSCN.Section IV is dedicated to evaluation of GMSCN. SectionV describes a case study on some real network samples.The results and evaluations of this paper are discussedin Section VI. Finally, Section VII concludes the paper. II. RELATED WORKA. Network Generation Models
In this subsection, we briefly introduce the leadingmethods of network generation: • Kronecker Graphs Model (KG) . This model gener-ates realistic synthetic networks by applying a ma-trix operation (the kronecker product) on a smallinitiator matrix. This model is mathematicallytractable and supports many network features suchas small path lengths, heavy tail degree distribu-tion, heavy tails for eigenvalues and eigenvectors,densification and shrinking diameters over time. • Forest Fire Model (FF) . In this model, edges areadded in a process similar to a fire-spreading pro-cess. This model is inspired by Copying model and Community Guided Attachment but sup-ports the shrinking diameter property. • Random Typing Generator Model (RTG) . RTGuses a process of “random typing” for generatingnode identifiers. This model mimics real worldgraphs and conforms to eleven important patterns(such as power law degree distribution, densifica-tion power law and small and shrinking diameter)observed in real networks . • Preferential Attachment Model (PA) . The classi-cal preferential attachment model generates scale-free networks with power law degree distribution. In this model, the nodes are added to the networkincrementally and the probability of the attach-ments depends on the degree of existing nodes. • Small World Model (SW) . This is another classi-cal network generation model that synthesizes net-works with small path lengths and high clustering.It starts with a regular lattice and then rewiressome edges of the network randomly. • Erd¨os-R´enyi Model (ER) . This model generatesa completely random graph. The number of nodesand edges are configurable. • Random Power Law Model (RP) . The RP modelgenerates synthetic networks by following a vari-ation of ER model that supports the power lawdegree distribution property.Other generative models are also available (wehave not utilized them but they are used in relatedmodel selection methods), such as Copying Model(CM) , Random Geometric Model (GEO) , Spa-tial Preferential Attachment (SPA) , Random Grow-ing (RDG) , Duplication-Mutation-Complementation(DMC) , Duplication-Mutation using Random muta-tions (DMR) , Aging Vertex (AGV) , Ring Lattice(RL) , Core-periphery (CP) , and Cellular model(CL) . B. Model Selection Methods
The aim of this paper and the model selectionmethods is to find the best generative model that fits agiven network instance. Some model selection methodsare based on graphlet counting . Graphlets aresubgraphs of bounded sizes (e.g., all possible subgraphswith three or four nodes) and the frequency of graphletsin a network is considered as a way of capturing thenetwork structure . In some works, directed graphs andgraphlets are considered and some others considerthe network as simple (undirected) graphs .Janssen et al. have tested both graphlet features andstructural features (degree distribution, assortativityand average path length) in the model selection problem.They conclude that counting graphlets of three andfour nodes is sufficient for capturing the structure ofthe network, i.e., appending structural features to thefeature vector of graphlet counts does not improvethe accuracy of the model selector. In this paper, wecritique this claim and show that using a better set oflocal (such as transitivity) and global (such as effectivediameter ) network structural features, along with anappropriate degree distribution quantification algorithm,actually improves the accuracy of the model selection.In fact, graphlet counts are limited local features and arenot able to reflect the structural properties of a networkinstance. Janssen et al implemented six generativemodels and generated a dataset of synthetic networksas the training data for decision tree learning . In thismethod, candidate generative models are: PA , CM ,GEO (GEO2D and GEO3D) and SPA (SPA2D andSPA3D).A similar method is proposed by Middendorf et al. .In this method, the feature vectors are the counts ofgraphlets with small sizes. Seven different generativemodels are considered by which network instances aregenerated as the training data. Candidate generativemodels are: ER , PA , SW , RDG , DMC ,DMR and AGV . The authors have used a general-ized decision tree called alternating decision tree (ADT)as the learning algorithm.Airoldi et al. propose to form feature vectors accordingto structural network properties. They have consideredsome classical generative models and generated a datasetby which a na¨ıve Bayes classifier is learned. Candidategenerative models are: PA , ER , RL , CP andCL . This method is dependent on the size and averageconnectivity of the target network and this dependencyis one of its limitations.Patro et al. propose a framework for implementingnetwork generation models. The user of this frameworkcan specify the important network features and theweight of each feature. In other words, we considereach generative model as a class of networks. Thismodel, more than to be a specific method, is a relativelyopen framework and the user should determine differentparameters of the framework according to the targetapplication. III. THE PROPOSED METHOD
GMSCN is based on learning a classifier for modelselection. The goal of a classifier is to accurately pre-dict the target class for a given network instance and inour method, generative models play the role of networkclasses. In GMSCN, the classifier suggests the best modelthat generates networks similar to a given network. Theinputs of the classifier are the structural properties ofthe target network and the output is the selected modelamong the candidate network generation models.
A. Methodology
Fig. 1 shows the high-level methodology of GMSCN.The methodology is configurable by several parametersand decision points, such as the set of considered net-work features, the chosen supervised learning algorithmand the candidate generative models. The steps of con-structing the network classifier, as illustrated in Fig. 1,are described in the following: 1. Many artificial network instances are synthesizedusing the candidate network generative models.These network instances will form the dataset(training and test data) for learning a network clas-sifier. In this step, the parameters of the generativemodels are tuned in order to synthesize networkswith densities similar to the density of the giventarget network.2. After generating the network instances, the struc-tural features (e.g., the degree distribution and theclustering coefficient) of each network instance areextracted. The result is a dataset of labeled struc-tural features in which each record consists of topo-logical features of a synthesized network along withthe label of its generative model.3. The labeled dataset forms the training and testdata for the supervised learning algorithm. Thelearning algorithm will return a network classifierwhich is able to predict the class (the best genera-tive model) of the given network instance.4. The structural features of the target network arealso extracted. The same “Feature Extraction”block which is used in the second step is appliedhere. The structural features of the target networkare used as input for the learned classifier.5. The learned network classifier is a customized“model selector” for finding the model that fits thetarget network. It gets the structural features ofthe target network as input and returns the mostcompatible generative model.In this methodology, the density of the target networkis considered as an important property of the targetnetwork. Network density is defined as the ratio ofthe existing edges to potential edges and is regardedas an indicator of the sparseness of the graph. In theproposed methodology, generative models are configuredto synthesize networks with densities similar to thedensity of the target network. This decision is due to thefact that it is hard to compare networks of completelydifferent densities for predicting their growth mechanismand generation process. On the other hand, even withsimilar network densities, various generative models cre-ate different network structures. So, we try to keep thedensity of the generated networks similar to the densityof the target network. In this manner, the networkclassifier can learn the difference among the structure ofvarious generative models with similar network densities.It is also worth noting that it is not possible togenerate networks with exactly equal densities withsome of the existing generative models. This is becausesome generative models (such as Kronecker graphs andRTG) are not configurable for finely tuning the exactdensity of synthesized networks. So, we generate thenetworks of training data with similar, and not exactly
FIG. 1. The methodology of learning a network classifier equal, densities to the density of the given network.Our proposed methodology, unlike existingmethods , is not dependent on the size (num-ber of nodes) of the target network. Size-independenceis an important feature of our method. It enables theclassifier to learn from a dataset of generated networkswith sizes different -perhaps smaller- from the size ofthe target network, but with a similar density. Thisfacility decreases the time of network generation andfeature extraction considerably. We will demonstratethe size-independence property of the GMSCN in theevaluation section.GMSCN is actually a realization of the describedmethodology. In the following subsections, we furtherillustrate the details of GMSCN by specifying the openparameters and decision points of the methodology.
B. Network Features
The process of model selection, as described in Fig.1, utilizes structural network features in the secondand fourth steps. There are plenty of different networkfeatures, so we clarify the considered features in GMSCNhere.To capture the properties of a network, we should anal-yse a wide and diverse feature set of network connectivitypatterns. We propose the utilization of a combination oflocal and global network structural features. The utiliza- tion of a limited set of local features (graphlet counts)in similar methods has resulted in a lower precisionfor the model selector. As explained later, we have uti-lized ten network features from four feature categories.While trying to find the best and minimal set of net-work features, we considered features that are not onlyeffective on the classification accuracy, but also efficientlycomputable and size-independent. One may consider alonger list of network features, even from different featurecategories (e.g. eigenvalues). In such an approach, auto-matic methods for feature selection such as the method-ology explained in Ref. may be helpful. But support-ing specified diverse criteria (effectiveness, efficiency andsize-independence) for selected features is quite difficultin such an automatic methodologies.The utilized features and measurements in GMSCN are: • Transitivity of relationships . In this categoryof network features, we consider two measure-ments of “average clustering coefficient” and“transitivity” . • Degree correlation . The measure of assortativity is selected from this category of network features. • Path lengths . There are different global fea-tures about the path lengths in a network, suchas diameter , effective diameter and averagepath length . We selected the “effective diame-ter” measurement since it is more robust and alsobecause of its less computation cost and sensitiv-ity to small network changes . Effective diameterindicates the minimum number of edges in which90 percent of all connected pairs can reach eachother . Effective diameter is well defined forboth connected and disconnected networks . • Degree distribution . It is a common approach tofit a power law on the degree distribution and ex-tract the power law exponent as a representativequantity for the degree distribution. But a singlenumber (the power law exponent) is too limited forrepresenting the whole degree distribution. On theother hand, some real networks do not conform tothe power law degree distribution . We proposean alternative method for quantification of the de-gree distribution by computing its probability per-centiles. The percentiles are calculated from somedefined regions of the degree distribution accordingto its mean and standard deviation. We devise K intervals in the degree distribution and then calcu-late the probability of degrees of each interval. K is always an even number greater than or equal tofour. The size of all intervals, except the first andthe last one, is considered equal to pσ where σ isthe standard deviation of the distribution and p isa tunable parameter. In any application, we canconfigure the values of K and p in a manner thatthe percentile values become more distinctive. Inour experiments we let K = 6 and p = 0 .
3, sowe extract six quantities (DegDistP1..DegDistP6percentiles) from any degree distributions. If weincrease the value of K , we should normally de-crease the value of σ so that most of the inter-val points stay in the range of existing node de-grees. Smaller values for σ also necessitate largervalues for K . Large values (e.g., K = 100) andsmall values (e.g., K = 1) for K will also decreasethe distinction power of the extracted features vec-tor. The specified values for σ and K are foundthrough trial and error. Equation 1 shows the in-terval points of the degree distribution and Equa-tion 2 specifies the probability for a node degree tosit in the i th interval. The set of six percentiles(DegDistP1..DegDistP6) are used as the networkfeatures representing the degree distribution.Let IP i be the i th interval point and D be the de-gree random variable. IP i = min ( D ) i = 1 µ − ( k − i + 1) pσ i = 2 ..Kmax ( D ) i = K + 1 (1) DegDistP i = P ( IP i < D < IP i +1 ) , i = 1 ..K (2) C. Learning the Classifier
The third step of the proposed methodology is theutilization of a supervised machine learning algorithm.The learning algorithm constructs the network classifierbased on the features of generated network instancesas the training data. Each record of the training dataconsists of the structural features -as described in theprevious subsection- of a generated network along withthe label of its generative model. By the means ofsupervised algorithms, we can learn from this trainingdata a classifier which predicts the best generative modelfor a given network with the specified structural features.We examined several supervised learning algorithmssuch as decision tree learning , Bayesian networks ,support vector machines (SVM) and neural networks among which the LADTree method showed betterresults. A short description of examined learning algo-rithms is presented in Appendix A. In our experiments,although some methods (such as Bayesian networks)resulted in a small improvement in the accuracy ofthe learned classifier, but the decision tree learned byLADTree algorithm was obviously more robust andless sensitive to noises than other learning methods.The robustness to noise analysis is described in theevaluation section. To avoid over-fitting, we always usedstratified 10-fold cross-validation. D. Network Models
Among several existing network generative models,we have selected seven important models: KroneckerGraphs Model, Forest Fire Model , Random TypingGenerator Model, Preferential Attachment Model,Small World Model, Erd¨os-R´enyi Model and Ran-dom Power Law Model. The selected models are thestate of the art methods of network generation. The ex-isting model selection methods such as Ref. and Ref. have ignored some new and important generative mod-els such as Kronecker Graphs , Forest Fire and RTG models. IV. EVALUATION
In this section, we evaluate our proposed method ofmodel selection (GMSCN). We also compare GMSCNwith the baseline method and show that it outperformsstate of the art methods with respect to different criteria.Despite most of the existing methods, GMSCN has nodependency on the size of the given network. In otherwords, we ignore the number of nodes of the target net-work and we only consider its density in generating thetraining data. Because the baseline method is depen-dent on the size of the target network, we evaluate themethods in two stages. In the first stage, we fix thesize of the generated networks to prepare a fair condi-tion for comparing GMSCN with the baseline method.Although size-dependence is a drawback for the baselinemethod, the evaluation shows that GMSCN outperformsthe baseline method even in fixed network size condition.In the second stage, we allow the generation models tosynthesize networks of different sizes. In this stage, weshow that the size diversity of generated networks doesnot affect the accuracy of the learned decision tree. Asdescribed in Section III, GMSCN is based on learning adecision tree from a training set of generated networks.In each evaluation stage, we generated 100 networks fromeach network generative model and with seven candidatemodels, we gathered 700 generated networks. We usedthese network instances as the training and test data forlearning the decision tree. A. The Baseline method
We have selected the graphlet-based method proposedby Janssen et al. as the baseline method. The baselinemethod has some similarities to GMSCN: it is based onconsidering some network generative models and thenlearning a decision tree for network classification withthe aid of a set of generated networks. In the baselinemethod, eight graphlet counts are considered as thenetwork features. All subgraphs with three nodes (twographlets) and four nodes (six graphlets) are considered FIG. 2. The graphlets with three and four nodes in the baseline method (Fig. 2). A similar approach isalso proposed by Middendorf et al. , with distinctionson the learning algorithm and the set of candidategenerative models. The graphlet-based method isselected as the baseline because it is a new method andits evaluations show a high accuracy, and it is proposedsimilarly in different research domains such as socialnetworks and protein networks .Despite the similarities, there exist some importantdifferences between GMSCN and the baseline method.First, the baseline method is based on counting graphletsin networks while GMSCN proposes a wider set of lo-cal and global features. Janssen et al. conclude thatconsidering structural features does not improve the ac-curacy of the graphlet-based classifier, but we will showthat choosing a better set of local and global networkfeatures and with the aid of our proposed degree dis-tribution quantification method, structural features willplay an undeniable role in model selection. Second, thebaseline method is size-dependent, i.e., it considers boththe size and the density of the target network, and itgenerates network instances according to these two prop-erties. On the other hand, GMSCN is size-independentand we only consider the density of the target network inthe network generation phase. Third, GMSCN employsnewer and more important generative models such as theKronecker Graphs model, the Forest Fire model andthe RTG model. Fourth, we examined different learn-ing algorithms and then selected LADTree as the bestlearning algorithm for this application. Our evaluationof GMSCN is more thorough, considering different eval-uation criteria. We have also presented a new algorithmfor quantifying the network degree distribution.Graphlet counting is a very time consuming task andthere is no efficient algorithm for computing the fullcounts of graphlets for large networks. To handle thealgorithmic complexity, most of the graphlet-countingmethods (e.g., Refs. 11, 17, 50, and 51) propose a sam-pling phase before counting the graphlets. But the sam-pling algorithm may affect the graphlet counts and theresulting counts may be biased towards the features ofthe sampling algorithm. It is also possible to estimatethe graphlet counts with approximate algorithms ,but this approach may also bring remarkable errors ingraphlet counts. To prepare a fair comparison situation,we have counted the exact number of graphlets in orig-inal networks and have not employed any sampling orapproximation algorithms. It is worth noting that re-ported accuracy of the baseline method in this paper isdifferent from the report of the original paper , mainly because the set of generative models are not the same inthe two papers. B. Accuracy of the Model Classifier
We first set a fixed size for generated networks of thedataset and generate networks with about 4096 nodes.Almost all the generated networks in our dataset contain4096 nodes, but the networks generated by RTG modelhave small variations in their size. Number of nodes inthese networks is in the range of 4000 to 4200 and thisis because the exact number of nodes is not configurablein the RTG model. Since the Kronecker Graphs modelgenerates networks with 2 x nodes in its original form,we chosen 4096 (2 ) as the size of the networks. Theaverage density of networks in this dataset is equal to0.0024.In addition to overall accuracy, we evalu-ate the precision and recall of the learned de-cision tree for different network models. “Pre-cision” shows the percentage of correctly classi-fied instances calculated for each category (e.g., P recision
F F = number of correctly predicted F F intancesnumber of F F predicted instances ),“Recall” illustrates the ability of the methodin finding the instances of a category (e.g., Recall
F F = number of correctly predicted F F intancesnumber of F F instances ),and “Accuracy” is an indicator of overall effective-ness of the classifier across the entire dataset (i.e., Accuracy = number of correctly predicted intancestotal number of instances ). Theoverall accuracy of GMSCN is 97.14% while the accuracyof the baseline method is 78.57% which indicates 18.57%improvements. Fig. 3 and Fig. 4 show the precision andrecall of GMSCN and the baseline method respectivelyfor different network models. In addition to an apparentimprovement in the precision and recall for most ofthe generative models, the figures show the stability(less undesired deviation) of GMSCN over the baselinemethod. The accuracy and precision of GMSCN showsmall deviation for different generative models, whilethese measures for baseline method vary in a widerange. Table I shows the details of GMSCN results fordifferent network models. For example, the first row ofthis table indicates that among 700 network instances,104 networks are predicted to be generated by the ERmodel but in fact 97 (out of 104) instances are ER, sixinstances are the KG model and one is generated by theSW model. Because we have utilized cross-validation,all of the 700 network instances are included in theevaluation. Table II shows corresponding results for thebaseline method.It is worth noting that considering both the graphletcounts and the structural features does not improve theaccuracy of the classifier considerably. Since we want toprepare a size-independent and efficient method, we do FIG. 3. Precision of GMSCN compared to baseline methodfor different generative modelsFIG. 4. Recall of GMSCN compared to baseline method fordifferent generative models not consider the graphlet counts in feature vectors.
C. Size Independence
GMSCN for model selection is independent from thesize of the target network. When we want to find the bestmodel fitting a real network, we can discard the numberof nodes in the network and generate the training dataonly according to its density. The size-independenceis an important feature of GMSCN which is missingin the baseline method. This feature is especiallyimportant when we want to find the generative modelfor a relatively large network. In this condition, we cangenerate the training network instances with smallersizes than the target network. This feature also increasesthe applicability, scalability and performance of GMSCN.For evaluating the dependency of GMSCN to thesize of the network, we generate a new dataset withnetworks of different sizes. Instead of fixing the numberof nodes in each network instance (such as about 4096nodes in the previous evaluation) we allow networkswith different number of nodes in the dataset. In thistest, with each of the generative models, we generated100 networks of different sizes: 24 networks with 4,096nodes, 24 networks with 32,768 nodes, 24 networks with131,072 nodes, 24 networks with 524,288 nodes andfour networks with 1,048,576 nodes. Again, the onlyexception is the RTG model which generates networkswith small variations from the specified sizes. The nodecounts are powers of two because the original versionof Kronecker graph model is able to generate networkswith 2 n nodes. The average density of networks in this FIG. 5. Accuracy of GMSCN for different network sizes. dataset is equal to 0.000885.Table III shows the precision and recall of GMSCN forthis dataset. In this evaluation, the overall accuracy ofthe classifier is 97.29% which is very close to the accuracyof the system in the evaluation with fixed network sizes.This fact shows that GMSCN is not dependent on the sizeof the target network. The average density of networks inthis dataset (0.000885) is different from the average den-sity of networks in the fixed-size dataset (0.0024). So,the model selection is also performing well for differentdensities of the given network. We also extended thisexperiment to ensure that there is no meaningful lowerbound for GMSCN in terms of network size. The new ex-periment is configured similar to the previous trial, butit examines a wider range of network sizes. Fig. 5 plotsthe result of this experiment at each number of nodes. Itindicates that GMSCN shows good performance for thevarying network sizes. Obviously, the baseline methodis size-dependent because the graphlet counts com-pletely depend on the size of the network. So, it is notnecessary to show the precision and recall of the baselinemethod for dataset of networks of different sizes. We ig-nored such a useless evaluation because the calculationof graphlet counts for large networks is very time con-suming.
D. Robustness to Noise
We also evaluate the robustness of GMSCN withrespect to random changes in networks. For eachtest-case network, we randomly select a fraction ofedges, rewire them to random nodes, and test theaccuracy of the classifier for the resulting network. Westart from the pure network samples and in each step,we change five percent of the edges until all the edges(100 percent change) are randomly rewired. In otherwords, in addition to pure networks, we generated 20test-sets with from zero to 100 percent edge changes,each of which containing 700 network samples fromseven generative models.As discussed before, we have chosen LADTree asthe supervised learning algorithm in GMSCN. Fig. 6shows the average accuracy of GMSCN for differentrandom change fractions. This figure shows the effect
TABLE I. Precision, Recall and Accuracy of GMSCN for different generative modelsTrue ER True FF True KG True PA True RP True SW True RTG
Class Precision
Pred. ER 97 0 6 0 0 1 0 93.27%Pred. FF 0 100 0 0 2 0 0 98.04%Pred. KG 2 0 93 2 0 0 0 95.88%Pred. PA 1 0 0 98 0 0 0 98.99%Pred. RP 0 0 1 0 94 0 1 97.92%Pred. SW 0 0 0 0 0 99 0 100.00%Pred. RTG 0 0 0 0 4 0 99 96.12%
Class Recall
97% 100% 93% 98% 94% 99% 99%
Accuracy: 97.14%
TABLE II. Precision, Recall and Accuracy of the baseline method for different generative modelsTrue ER True FF True KG True PA True RP True SW True RTG
Class Precision
Pred. ER 94 1 30 0 11 0 0 69.12%Pred. FF 0 73 2 0 6 6 0 83.91%Pred. KG 6 0 37 0 17 0 0 61.67%Pred. PA 0 0 26 100 6 0 0 75.76%Pred. RP 0 23 5 0 52 0 0 65.00%Pred. SW 0 3 0 0 0 94 0 96.91%Pred. RTG 0 0 0 0 8 0 100 92.59%
Class Recall
94% 73% 37% 100% 52% 94% 100%
Accuracy: 78.57%
TABLE III. Precision and Recall of GMSCN for networks ofdifferent sizesER FF KG PA RP SW RTG
Precision
Recall of choosing different learning algorithms for GMSCN.As the figure shows, LADTree results in a more robustclassifier for this application, since it is less sensitive tonoise. The accuracy of GMSCN is smoothly decreasingnearly linear with random changes. There is no suddendrop in the chart of the GMSCN (based on LADTree).With 100 percent random changes (the right end ofthe diagram), the accuracy of the classifier reachesthe value of 14.43 percent, which is near to 1/7 (i.e., number of candidate models ). This is due to existence ofseven network models and indicates that almost all thecharacteristics of the generative model is eliminatedfrom a generated network with 100 percent edge rewiring. E. Scalability and Performance
The aim of GMSCN is finding a generative model bestfitting a given real network. We define the scalabilityof such a method as its ability to handle networks of
FIG. 6. Robustness of the different classification methodswith respect to random edge rewiring. large sizes as the input. Noting to the methodology ofthe proposed method (Fig. 1), the most time-consumingpart of the model classification is the feature extractiontask. For the feature extraction task, GMSCN is obvi-ously more scalable than the baseline method. There isno efficient algorithm for counting the graphlets in largenetworks. The selected network features in GMSCN (ef-fective diameter, clustering coefficient, transitivity, as-sorativity and degree distribution percentiles) are effi-ciently computable by existing algorithms. We have alsodiscarded “timely to extract” features such as “averagepath length” because their extraction has more compu-tationally complex algorithms.Most of the graphlet-based methods such as Ref. andRef. try to increase their scalability by incorporating apre-stage of network sampling with very small rates suchas 0.01% (one out of 10,000) in Ref. . But such samplingrates decreases the accuracy of graph counts and the cho-sen sampling algorithm will also bias the graph counts.On the other hand, if sampling or approximation algo-rithms are accepted for baseline method, these techniqueswill improve the performance of GMSCN too. In otherwords, utilization of sampling and approximation algo-rithms increases the scalability of both of the baselinemethod and GMSCN similarly. Some notes about theimplementation and evaluation of GMSCN are presentedin the Appendix B. F. Effectiveness of the Degree Distribution QuantificationMethod
As described in Section III, we have proposed a newmethod for the quantification of the degree distributionbased on its mean and standard deviation. In this sub-section, we test the effectiveness of this quantificationmethod. We show that without the proposed features ofdegree distribution, the accuracy of the network classifierwill diminish. Table IV shows the results of GMSCN byeliminating six features related to the degree distribution(DegDistP1..DegDistP6 percentiles). By this change, theoverall accuracy of the method decreases about eight per-cent (from 97.14% to 89.29%). This can be seen by com-paring the values in Table IV with those of Table I whichreflects the results of GMSCN when employing all thefeatures. Precision and recall are improved for almost allthe models with incorporating features related to the de-gree distribution. This fact shows the effectiveness of ourproposed quantification method for degree distribution.
TABLE IV. The results of GMSCN after excluding the fea-tures of degree distributionER FF KG PA RP SW RTG
Precision
Recall
90% 95% 81% 97% 78% 96% 88%
V. CASE STUDY
We applied GMSCN for some real networks. The realnetwork instances and the result of applying GMSCN onthese networks are illustrated here:1. “dblp cite” (with 475,886 nodes and 2,284,694edges) is a network which is extracted from theDBLP service. This network shows the citationnetwork among scientific papers. GMSCN proposesForest Fire as the best fitting generative model forthis network. Leskovec et al. also propose For-est Fire model for two similar graphs of arXiv andpatent citation networks.2. “dblp collab” (with 975,044 nodes and 3,489,572edges) is a co-authorship network of papers indexed in the DBLP service. A node in this network rep-resents an author and an edge indicates at leastone collaboration in writing papers between the twoauthors. GMSCN suggests Forest Fire for this net-work instance too.3. “p2p-Gnutella08” (with 6,301 nodes and 20,777edges) is a relatively small P2P network with about6000 nodes. The best fitting model suggested byGMSCN for this network instance is KroneckerGraphs.4. Slashdot, as a technology-related news website,presented the Slashdot Zoo which allowed users totag each other as friends. “Slashdot0902” (with82,168 nodes and 543,381 edges) is a network offriendship links between the users of Slashdot, ob-tained in February 2009. The output of GMSCNfor this social network is the Random Power Lawmodel.5. In the “web-Google” (with 875,713 nodes and4,322,051 edges) network, the nodes representweb pages and directed edges represent hyperlinksamong them. We ignored the direction of the linksand considered the network as a simple undirectedgraph. The random Power Law model is also pro-posed for this network by GMSCN.6. “Email-EuAll” (with 265,214 nodes and 365,025edges) is a communication network of email con-tacts which is predicted to follow the RTG model.7. Finally, for the small network of “Email-URV” (with 1,133 nodes and 5,451 edges), which is an-other communication network of emails, GMSCNsuggests the Small World model.As explained above, various real networks, which are se-lected from a wide range of sizes, densities, and domains,are categorized in different network models by the GM-SCN classifier. This fact indicates that no generativemodel is dominated in GMSCN for real networks andit suggests different models for different network struc-tures. The case study also verifies that no generativemodel is sufficient for synthesizing networks similar toreal networks and we should find the best model fittingto the target network in each application. As a result,it is worth noting that the task of generative model se-lection is an important stage before generating networkinstances. VI. DISCUSSION
We evaluated GMSCN from different perspectives.GMSCN proposes a size-independent methodology forbuilding the network classifier based on a wide range of0local and global network features as the inputs of a de-cision tree. It shows a high accuracy in predicting thegenerative model for a given network. It is tolerant andinsensitive to small network changes. In addition to size-independence, GMSCN outperforms the baseline methodthat only considers local features of graphlet counts withrespect to accuracy and efficiency. A new structural fea-ture is also proposed in GMSCN which quantifies thenetwork degree distribution.One may argue that the size of the training set (700 net-work instances) is relatively small for a machine learn-ing task. But we have actually utilized many more net-work instances in the process of evaluating GMSCN. Ourdataset for evaluating GMSCN includes 15,400 differ-ent network instances: 700 instances in the fixed-sizeevaluation, 700 instances in the size-independence testand 14,000 (20 × VII. CONCLUSION
In this paper, we proposed a new method (GMSCN)for network model selection. This method, which isbased on learning a decision tree, finds the best modelfor generating complex networks similar to a specifiednetwork instance. The structural features of the givennetwork instance are utilized as the input of the decisiontree and the result is the best fitting model. GMSCNoutperforms the existing methods with respect todifferent criteria. The accuracy of GMSCN shows aconsiderable improvement over the baseline method .In addition, the set of supported generative models inGMSCN contains wider, newer and more importantgenerative models such as Kronecker graphs, ForestFire and RTG. Despite most of the existing methods,GMSCN is independent from the size of the inputnetwork. GMSCN is a robust model and insensitive tosmall network changes and noises. It is also a scalablemethod and its performance is obviously better thanthe baseline method. GMSCN also includes a new andeffective algorithm for the quantification of networkdegree distribution. We have examined different learningalgorithms and as a result, decision tree learning byLADTree method was the most accurate and robustmodel. We showed that the local structural features,such as graphlet counts, are insufficient for inferring thenetwork mechanisms and it is a must to consider a widerrange of local and global structural features to be ableto predict the network growth mechanisms.In future, we will investigate the effect of networkstructural features and growth mechanisms on dynamicsand behavior of the network when it is faced with dif-ferent processes. For example, we will evaluate the sim-ilarity of the information diffusion process in a networkand its counterparts synthesized by the selected network1generation model. ACKNOWLEDGMENTS
We wish to thank Masoud Asadpour, Mehdi Jalili andAbbas Heydarnoori for their great comments.
Appendix A: Brief Introduction to Classification Methods
Machine Learning is a subfield of Artificial Intelligencein which the main goal is to learn knowledge through ex-perience. Classification is a learning task of inferring aclassification function from labeled training data. Here,we explain some classifiers that are used in this paper.
Support Vector Machines (SVM) . SVM per-forms a classification by mapping the inputs into a high-dimensional feature space and constructing hyperplanesto categorize the data instances. The best hyperplanesare those that cause the largest margin among the classes.The parameters of such a maximum-margin hyperplaneare derived by solving an optimization problem. Sequen-tial Minimal Optimization (SMO) is a common methodfor solving the optimization problem. Bayesian Networks Learning . A Bayesian net-work model is a probabilistic graphical model that rep-resents a set of random variables and their conditionaldependencies by a directed acyclic graph. The nodes inthis graph represent the random variables and an edgeshows a conditional dependency between two variables.Bayesian network learning aims to create a network thatbest describes the probability distribution over the train-ing data. To find the best network among the set of pos-sible Bayesian networks, the heuristic search techniqueshas been frequently used in the literature. Artificial Neural Networks . ANN is inspired by hu-man brain neural network. An ANN consists of neuronunits, arranged in layers and connected with weightedlinks, which convert an input vector into some outputs.Usually, the networks are defined to be feed-forward, withno feedback to the previous layer. In the training phase,the weights of the links are tuned to adapt an ANN tothe training data. Back-propagation algorithm is a com-mon method for the training phase. C4.5 Decision Tree Learning . A decision tree isa tree structure of decision rules which can be used asa classification function (leaf nodes show the returnedclasses). C4.5 constructs a decision tree based on a la-beled training data. C4.5 uses information entropy toevaluate the goodness of branches in the tree. LADTree .This classifier generates a multi-class al-ternating decision tree and it uses the boosting strat-egy. Boosting is a well-established classification tech-nique that combines some weak classifiers to form a sin-gle powerful classifier. A prediction node in a LADTreeincludes a score for each of candidate classes. LADTreecalculates confidences for different classes according to their visited score in prediction nodes, and it returns thebest class according to the confidences. Appendix B: Implementation Notes
To implement Kronecker Graphs, Forest Fire model,Preferential Attachment, Small World, and RandomPower Law models, we utilized the SNAP library( http://snap.stanford.edu/snap/ ). The implemen-tation of RTG model is available in a MATLAB li-brary ( ). We alsodeveloped our own implementation of the ER model.The features are extracted by the aid of differ-ent network analysis tools. The igraph package( http://igraph.sourceforge.net/ ) of the R projecthelped us calculate the assortativity and transitivitymeasures. We used the SNAP library for measuring theeffective diameter, average clustering coefficient, densityand also the graphlet counts. Since we proposed a newmethod for quantifying network degree distribution, wehave implemented this method ourselves. We utilizedRapidMiner as an open source tool for machine learn-ing. The implementation of LADTree and Bayesian net-work learning and SVM are actually part of the Wekatool which is embedded in RapidMiner. The amount ofcomputation needed for this research, especially count-ing the exact number of graphlets, was enormous. Weutilized three virtual machines on a super-computer forthis enormous computation task, each of which simulateda computer with 16 processing cores of 2.8 GHz and 24GB of memory. Most of the computation time was spentfor counting the graphlets of the generated network in-stances. M. E. Newman, “The structure and function of complex net-works,” SIAM review , 167–256 (2003). R. Albert and A.-L. Barab´asi, “Statistical mechanics of complexnetworks,” Reviews of modern physics , 47 (2002). L. d. F. Costa, F. A. Rodrigues, G. Travieso, and P. Villas Boas,“Characterization of complex networks: A survey of measure-ments,” Advances in Physics , 167–242 (2007). C. Christensen and R. Albert, “Using graph concepts to under-stand the organization of complex systems,” International Jour-nal of Bifurcation and Chaos , 2201–2214 (2007). S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D.-U.Hwang, “Complex networks: Structure and dynamics,” Physicsreports , 175–308 (2006). J. Hlinka, D. Hartman, and M. Paluˇs, “Small-world topologyof functional connectivity in randomly connected dynamical sys-tems,” Chaos: An Interdisciplinary Journal of Nonlinear Science , 033107–033107 (2012). A. Yazdani and P. Jeffrey, “Complex network analysis of waterdistribution systems,” Chaos: An Interdisciplinary Journal ofNonlinear Science , 016111–016111 (2011). P. Cano, O. Celma, M. Koppenberger, and J. M. Buld´u, “Topol-ogy of music recommendation networks,” Chaos: An Interdisci-plinary Journal of Nonlinear Science , 013107–013107 (2006). J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, andZ. Ghahramani, “Kronecker graphs: An approach to modelingnetworks,” The Journal of Machine Learning Research , 985–1042 (2010). L. Akoglu and C. Faloutsos, “Rtg: a recursive realistic graphgenerator using random typing,” Data Mining and KnowledgeDiscovery , 194–209 (2009). J. Janssen, M. Hurshman, and N. Kalyaniwalla, “Model selectionfor social networks using graphlets,” Internet Mathematics ,338–363 (2012). E. M. Airoldi, X. Bai, and K. M. Carley, “Network samplingand classification: An investigation of network model represen-tations,” Decision Support Systems , 506–518 (2011). M. Middendorf, E. Ziv, and C. H. Wiggins, “Inferring net-work mechanisms: the drosophila melanogaster protein interac-tion network,” Proceedings of the National Academy of Sciencesof the United States of America , 3192–3197 (2005). R. Milo, S. Itzkovitz, N. Kashtan, R. Levitt, S. Shen-Orr,I. Ayzenshtat, M. Sheffer, and U. Alon, “Superfamilies of evolvedand designed networks,” Science , 1538–1542 (2004). R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii,and U. Alon, “Network motifs: simple building blocks of complexnetworks,” Science Signaling , 824 (2002). I. Bordino, D. Donato, A. Gionis, and S. Leonardi, “Mininglarge networks with subgraph counting,” in
Data Mining, 2008.ICDM’08. Eighth IEEE International Conference on (IEEE,2008) pp. 737–742. J. A. Grochow and M. Kellis, “Network motif discovery usingsubgraph enumeration and symmetry-breaking,” in
Research inComputational Molecular Biology (Springer, 2007) pp. 92–106. L. Cui, S. Kumara, and R. Albert, “Complex networks: Anengineering view,” Circuits and Systems Magazine, IEEE ,10–25 (2010). A. Pomerance, E. Ott, M. Girvan, and W. Losert, “The effect ofnetwork topology on the stability of discrete state models of ge-netic control,” Proceedings of the National Academy of Sciences , 8209–8214 (2009). E. Lameu, C. Batista, A. Batista, K. Iarosz, R. Viana, S. Lopes,and J. Kurths, “Suppression of bursting synchronization in clus-tered scale-free (rich-club) neuronal networks,” Chaos: An Inter-disciplinary Journal of Nonlinear Science , 043149 (2012). J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graphs over time:densification laws, shrinking diameters and possible explana-tions,” in
Proceedings of the eleventh ACM SIGKDD interna-tional conference on Knowledge discovery in data mining (ACM,2005) pp. 177–187. J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, andA. S. Tomkins, “The web as a graph: Measurements, models, andmethods,” in
Computing and combinatorics (Springer, 1999) pp.1–17. A.-L. Barab´asi and R. Albert, “Emergence of scaling in randomnetworks,” science , 509–512 (1999). D. J. Watts and S. H. Strogatz, “Collective dynamics of ‘small-world’networks,” nature , 440–442 (1998). P. Erd¨os and A. R´enyi, “On the central limit theorem for samplesfrom a finite population,” Publ. Math. Inst. Hungar. Acad. Sci , 49–61 (1959). D. Volchenkov and P. Blanchard, “An algorithm generating ran-dom graphs with power law degree distributions,” Physica A:Statistical Mechanics and its Applications , 677–690 (2002). M. Penrose,
Random geometric graphs , Vol. 5 (Oxford UniversityPress Oxford, 2003). W. Aiello, A. Bonato, C. Cooper, J. Janssen, and P. Pra lat, “Aspatial web graph model with local influence regions,” InternetMathematics , 175–196 (2008). D. S. Callaway, J. E. Hopcroft, J. M. Kleinberg, M. E. Newman,and S. H. Strogatz, “Are randomly grown graphs really random?”Physical Review E , 041902 (2001). R. V. Sol´e, R. Pastor-Satorras, E. Smith, and T. B. Kepler, “Amodel of large-scale proteome evolution,” Advances in ComplexSystems , 43–54 (2002). K. Klemm and V. M. Eguiluz, “Highly clustered scale-free net-works,” Physical Review E , 036123 (2002). B. Bollob´as,
Random graphs , Vol. 73 (Cambridge universitypress, 2001). S. P. Borgatti and M. G. Everett, “Models of core/peripherystructures,” Social networks , 375–395 (2000). T. L. Frantz and K. M. Carley, “A formal characterization ofcellular networks,” Tech. Rep. (DTIC Document, 2005). J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graph evolution:Densification and shrinking diameters,” ACM Transactions onKnowledge Discovery from Data (TKDD) , 2 (2007). G. Holmes, B. Pfahringer, R. Kirkby, E. Frank, and M. Hall,“Multiclass alternating decision trees,” in
Machine Learning:ECML 2002 (Springer, 2002) pp. 161–172. R. Patro, G. Duggal, E. Sefer, H. Wang, D. Filippova, andC. Kingsford, “The missing models: a data-driven approach forlearning how networks grow,” in
Proceedings of the 18th ACMSIGKDD international conference on Knowledge discovery anddata mining (ACM, 2012) pp. 42–50. M. Zanin, P. Sousa, D. Papo, R. Bajo, J. Garc´ıa-Prieto, F. delPozo, E. Menasalvas, and S. Boccaletti, “Optimizing functionalnetwork representation of multivariate time series,” Scientific re-ports (2012). A. Barrat and M. Weigt, “On the properties of small-worldnetwork models,” The European Physical Journal B-CondensedMatter and Complex Systems , 547–560 (2000). B. Bollob´as, “The diameter of random graphs,” Transactions ofthe American Mathematical Society , 41–52 (1981). P. V. Boas, F. Rodrigues, G. Travieso, and L. da F Costa, “Sensi-tivity of complex networks measurements,” Journal of StatisticalMechanics: Theory and Experiment , P03009 (2010). S. L. Tauro, C. Palmer, G. Siganos, and M. Faloutsos, “A simpleconceptual model for the internet topology,” in
Global Telecom-munications Conference, 2001. GLOBECOM’01. IEEE , Vol. 3(IEEE, 2001) pp. 1667–1671. V. G´omez, A. Kaltenbrunner, and V. L´opez, “Statistical analy-sis of the social network and discussion threads in slashdot,” in
Proceedings of the 17th international conference on World WideWeb (ACM, 2008) pp. 645–654. N. Z. Gong, W. Xu, L. Huang, P. Mittal, E. Stefanov, V. Sekar,and D. Song, “Evolution of social-attribute networks: measure-ments, modeling, and implications using google+,” in
Proceed-ings of the 2012 ACM conference on Internet measurement con-ference (ACM, 2012) pp. 131–144. H. Kwak, C. Lee, H. Park, and S. Moon, “What is twitter,a social network or a news media?” in
Proceedings of the 19thinternational conference on World wide web (ACM, 2010) pp.591–600. J. R. Quinlan,
C4. 5: programs for machine learning , Vol. 1(Morgan kaufmann, 1993). N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian networkclassifiers,” Machine learning , 131–163 (1997). J. C. Platt, “12 fast training of support vector machines usingsequential minimal optimization,” (1999). J. A. Freeman and D. M. Skapura, “Neural networks: Algo-rithms, applications, and programming techniques (computationand neural systems series),” Neural networks: algorithms, appli-cations and programming techniques (Computation and NeuralSystems Series) (1991). E. de Silva and M. P. Stumpf, “Complex networks and simplemodels in biology,” Journal of the Royal Society Interface , 419–430 (2005). T. Aittokallio and B. Schwikowski, “Graph-based methods foranalysing networks in cell biology,” Briefings in bioinformatics ,243–255 (2006). M. Rahman, M. Bhuiyan, and M. A. Hasan, “Graft: an approx-imate graphlet counting algorithm for large graph analysis,” in
Proceedings of the 21st ACM international conference on Infor-mation and knowledge management (ACM, 2012) pp. 1467–1471. Http://dblp.uni-trier.de/xml. Http://dblp.uni-trier.de/xml. Http://snap.stanford.edu. Http://snap.stanford.edu. Http://snap.stanford.edu. Http://konect.uni-koblenz.de. Http://deim.urv.cat/˜aarenas. %80%90%100%