Empirical Characterization of Graph Sampling Algorithms
NNoname manuscript No. (will be inserted by the editor)
Empirical Characterization of Graph SamplingAlgorithms
Muhammad Irfan Yousuf, Izza Anwer,Raheel Anwar
Received: date / Accepted: date
Abstract
Graph sampling allows mining a small representative subgraph from abig graph. Sampling algorithms deploy different strategies to replicate the proper-ties of a given graph in the sampled graph. In this study, we provide a comprehen-sive empirical characterization of five graph sampling algorithms on six propertiesof a graph including degree, clustering coefficient, path length, global clusteringcoefficient, assortativity, and modularity. We extract samples from fifteen graphsgrouped into five categories including collaboration, social, citation, technological,and synthetic graphs. We provide both qualitative and quantitative results. Wefind that there is no single method that extracts true samples from a given graphwith respect to the properties tested in this work. Our results show that the sam-pling algorithm that aggressively explores the neighborhood of a sampled nodeperforms better than the others.
Keywords
Graph Sampling · Graph Properties · Empirical Characterization
In the last few years, an explosive growth of online networks has attracted millionsof users from around the globe. This huge user base provides many opportunitiesto analyze user behavior (Benevenuto et al. 2009), social interaction (Wilson et al.2009), and information propagation patterns (Kwak et al. 2010), to name a few.However, given a large real-world graph with millions of vertices and edges, it isvery difficult to apply typical graph processing approaches to analyze the graph
M. I. YousufDepartment of Computer Science (New Campus)University of Engineering and Technology, Lahore, PakistanTel: +92-42-37951901E-mail: [email protected]. AnwerDepartment of Transportation Engineering and ManagementUniversity of Engineering and Technology, Lahore, PakistanR. AnwarKarl Franzens Universit¨at, Graz, Austria a r X i v : . [ c s . S I] F e b M. I. Yousuf et al. directly. As a result, various graph sampling techniques have been proposed forthe analysis or mining of large complex networks. Graph sampling is a techniqueto extract a small subgraph from the original large graph such that the propertiesof the original graph are preserved in the sample. If true, analyzing the sampledgraph should have approximately the same effect as analyzing the original graph.Many properties have been defined to characterize a graph and these prop-erties are very important to understand the graph (Hu and Lau 2013). To ourunderstanding, an ideal representative sample of a large graph should preserveall the properties of a graph as accurately as possible. However, previous studies(Gjoka et al. 2010; Lee et al. 2006) show that some sampling methods are biasedtowards high degree nodes. It means that such sampling methods tend to sam-ple high degree nodes more often than low degree nodes. In other words, suchsampling approaches cannot preserve degree-related properties, e.g., node degreedistribution, and at the same time compromise over other properties too. More-over, some properties, e.g., assortativity and modularity have not been discussedby the research community in association with graph sampling.In this paper, we aim at providing a comprehensive empirical characterizationof several graph sampling algorithms on a bigger set of properties of a graph. Weextract samples from fifteen graphs with five state-of-the-art sampling methods atdifferent sampling fractions. We characterize the sampling methods on six proper-ties of a graph including three local properties, i.e., degree, clustering coefficient,and path length, and three global properties, i.e., global clustering coefficient, as-sortativity, and modularity. We provide both qualitative and quantitative resultsfor a better understanding. To the best of our knowledge, we are the first to presentsuch a comprehensive study on graph sampling algorithms. We believe that thisstudy will provide new insights into the graph sampling research and it would behelpful in designing better sampling methods in the future.The rest of the paper is organized as follows. In section 2, we provide somedefinitions and discuss six properties of a graph. In section 3, we overview five state-of-the-art sampling algorithms. In section 4, we discuss the evaluation criteria andthe datasets used in this study. In section 5, we present the experimental evaluationand results of our study. In section 6, we discuss related work and conclude thepaper in section 7. { v , v , v , ..., v n } is the set of vertices (or nodes) and E= { e , e , e , ..., e m } is the set of edges (or links). The total number of nodes andedges are represented as | V | = n and | E | = m respectively. A sampling algorithm extracts a sample graph G s = ( V s , E s ) from G such that V s ⊂ V and E s ⊂ E .The resulting sample graph G s has n s number of vertices and m s number of edgesin it. Given a sampling fraction or sampling budget φ such that | V s | / | V | = φ , theaim of sampling is to obtain a sample with a small value of φ while preserving theproperties of G in G s . mpirical Characterization of Graph Sampling Algorithms 3 Degree:
The degree of a node is defined as the number of edges connected to thenode in the graph while the average degree is simply the average number of edgesper node in the graph. We represent the degree of node v as d v and calculate theaverage degree d avg ( G ) of graph G as: d avg ( G ) = 2 ∗ mn (1)The degree distribution P ( d ) of a graph is defined as the fraction of nodes in thenetwork with degree d . Thus if there are n nodes in total in a network and n d ofthem have degree d , we have P ( d ) = n d n (2) We find both the average degree and degree distribution of graphs in this work.
Clustering Coefficient:
The clustering coefficient measures the average proba-bility that two neighbors of a vertex are themselves neighbors. The local clusteringcoefficient c v of node v of degree d v is the proportion of the number of edges e v between the neighbors of v relative to the total number of possible edges betweenthe neighbors, given by c ( v ) = 2 ∗ e v d v ( d v −
1) (3)The average clustering coefficient c avg of a graph is calculated as: c avg = 1 n n (cid:88) i =1 c i (4)We also find the distribution of local clustering coefficients to show the fraction ofnodes having a particular value of clustering coefficient. Path length:
In a graph, path length is defined as the total number of edgestraversed while going from a source node to a destination node. The average pathlength is calculated as the number of edges along the shortest paths for all possiblepairs of network nodes. Let d ( v i , v j ) where v i , v j ∈ V denote the shortest distancebetween nodes v i and v j . Then, the average path length l avg is: l avg = 1 n ( n − (cid:88) i (cid:54) = j d ( v i , v j ) (5) The frequency distribution of these path lengths define the path length distrubtion.
Global Clustering Coeffient:
The global clustering coefficient is based on tripletsof nodes. A triplet has three nodes that are connected by either two (open triplet)or three (closed triplet) undirected edges. The global clustering coefficient c g ( G ) M. I. Yousuf et al. of a graph G is defined as the ratio of the number of closed triplets to the totalnumber of triplets (both open and closed). c g ( G ) = number of closed tripletstotal number of triplets (6) Assortativity:
It measures the tendency of high-degree nodes connect to otherhigh-degree nodes and vice-versa in a network using the Pearson correlation coef-ficient (Newman 2002). Positive values of this coefficient indicate a correlation be-tween nodes of similar degrees while negative values indicate relationships betweennodes of different degrees. The assortativity coefficient is the Pearson correlationcoefficient of degree between pairs of linked nodes. It is given by r as: r = (cid:80) jk jk ( e jk − q j q k ) σ q (7)where e jk is the joint probability distribution of the remaining degree of the nodepair and q k is the distribution of the remaining degree. When r = 1, the network issaid to be assortative, when r = 0 the network is non-assortative, while at r = − Modularity:
Modularity is one of the measures of the structure of networks orgraphs. It measures the strength of division of a network into modules (clusters orcommunities). Networks with high modularity have dense connections between thenodes within modules but sparse connections between nodes in different modules.Modularity is the fraction of the edges that fall within the given modules minus theexpected fraction if the edges were distributed at random (Newman 2006). Giventhe partition of a network into a set of communities C i , the degree of modularity Q associated to this partition can be measured as follows: Q = 12 m (cid:88) ij [ A ij − k i k j m ] δ ( C i , C j ) (8)where m is the total number of edges in the network, k i is the degree of node i , A ij are the elements of adjacency matrix, C i is the community i th node belongsto and δ ( x ) = 1 if x > = 1 and δ ( x ) = 0 otherwise. Graph sampling is a technique to extract a small subgraph G s from a big graph G such that the subgraph truly represents the original graph. In some scenarios, thewhole graph is known in advance and sampling is used to obtain a smaller graph. Inother scenarios, the whole graph is unknown and sampling is performed to explorethe graph. Although sampled graphs are smaller in size, they are similar to original graphs in some way. In the last decade or so, numerous graph sampling methodshave been proposed. These methods can be categorized into Node Sampling, EdgeSampling and Traversal-based Sampling. Of these categories, Traversal-based Sam-pling has a long history and has some advantages over the other two (Hu and Lau2013). Regardless of the sampling method, we are particularly interested in what mpirical Characterization of Graph Sampling Algorithms 5 graph properties are preserved given a sampling method. If some properties arepreserved, we can construct efficient estimators for them.In this paper, we explore how existing sampling methods perform in preserv-ing different important properties of original graphs. We explore the following fivesampling algorithms for this purpose. Frontier Sampling (FS):
In FS (Ribeiro and Towsley 2010), we deploy m-dimensional dependent random walks to sample a graph. It requires a specialestimator function to estimate the metric to remove the bias introduced by Ran-dom Walks. FS works in three steps. First, it randomly chooses a set of nodes, S ,as seeds. Second, it selects a seed v from S with the probability P ( v ) given as: P ( v ) = d v (cid:80) u ∈ S d u (9)Third, an edge e ( v, w ) is selected uniformly from the edges incident to v . It thenadds the edge e ( v, w ) to the set of sampled edges and replaces v with w in S . Sim-ilar to other methods (Gjoka et al. 2010), FS ignores isolated nodes in a graph.It should be noted that in order to study any metric under FS method, we mustconstruct a particular estimator, rather than just studying the sampled nodes andedges directly. Expansion Sampling (XS):
The XS strategy (Maiya and Berger-Wolf 2010) isbased on the concept of expansion from work on expander graphs. It was particu-larly designed to sample communities in networks. It extracts a stratified sampleof community structure, i.e., it tries to sample nodes from all the communities ina graph. The algorithm seeks to greedily construct the sample by maximizing theexpansion factor X ( S ) of sample S given by: X ( S ) = | N ( S ) || S | (10)where N ( S ) is the neighborhood of S . In our work, we implement Snowball Ex-pansion Sampling as it performs slightly better than Markov Chain Monte CarloExpansion Sampling (Maiya and Berger-Wolf 2010). Rank Degree (RD):
RD (Voudigari et al. 2016) is a graph exploration methodbased on the ranking of nodes according to their degree values. The algorithmtakes three parameters ( s, ρ, x ) as input where s is the number of initial seeds,0 < ρ ≤ x is the sample size.The algorithm starts with a set of seed nodes and picks a node at random from thisset. It then ranks the neighbors of the selected seed node according to their degreevalues and selects top-k nodes from the ranked list. The selected top-k nodes alongwith the edges with the seed node are added to the sample graph. Moreover, thesetop-k nodes form the set of seeds in the next iteration. List Sampling (LS):
The work (Yousuf and Kim 2018) claims that the previousmethods do not explore the neighborhood of sampled nodes fairly and hence yieldsub-optimal samples. It then introduces a new approach in which we keep a list of candidate nodes that is populated with all the neighbors of nodes that have beensampled so far. By doing this, we balance the depth and breadth of graph explo-ration and produce better samples. The paper proposes three algorithms based onthis idea that differ in how to select nodes from the candidate list. We implementLS2 algorithm in this work as it performs better than the other two variations
M. I. Yousuf et al. (Yousuf and Kim 2018).
Hybrid Jump (HJ):
HJ (Liu et al. 2019) introduces a hybrid jump strategy intoMetropolis–Hasting Random Walk during the sampling process. It uses a breadth-first search to obtain a set of unique nodes from a list of jump nodes. By applyingUniform Sampling, it gets the average degree of the original network and deter-mines the optimal value of the jump parameter.
We evaluate the above mentioned five sampling methods both qualitatively andquantitatively against one another. We evaluate them on both real-world andsynthetic graphs and see how well they preserve the above mentioned six propertiesof a graph.4.1 DatasetsWe perform experiments on both real-world and synthetic datasets. We use 12 real-world and 3 synthetic datasets. The real datasets are drawn from a wide rangeof networks including collaboration networks, social networks, citation networksand topological networks. The size of these networks vary from a few thousands toa million of nodes. All these datasets are publicly available (Leskovec and Krevl2014; Rossi and Ahmed 2015; Konect 2015).Moreover, we also extract samples from three synthetic networks because thesenetworks have strong mathematical foundations and it would be interesting tosample them. We select three generative models for synthesizing networks and runthe sampling methods on the generated networks. The selection of these models isbased on the fact that they generate graphs that follow many properties of real-world graphs. The parameter values of each generative model are tuned such thatthe generated network has nearly the same number of nodes and edges as that ofthe average of the twelve real-world datasets. The three generative models are: 1)Forest Fire (FF) (Leskovec et al. 2007) 2) Small World (SW) (Watts and Strogatz1998) and 3) Mixed Model (MM) (Yousuf and Kim 2020a). We use samplingfractions φ = { } to extract samples from these graphs.Table 1 summarizes the characteristics of these datasets.4.2 Evaluation MetricsWe evaluate the quality of samples by comparing the above mentioned properties ofsample graphs with that of the original graphs. We also perform quantitative tests such as Jensen-Shannon Distance (JSD) and Root Mean Square Error (RMSE) forquantitative evaluation of sampling algorithms. All the results presented in thispaper are averaged over ten readings. We use the following metrics for evaluation. Point Statistics:
A point statistic shows the value of a property at a single point.We vary the sampling fraction φ from 0.02 to 0.1 and plot the scaling ratio of a mpirical Characterization of Graph Sampling Algorithms 7 Table 1: Characteristics of the datasets used in the experiments
Dataset Nodes Edges Avg.De-gree Avg.Clust.Coeff. Avg.Pathlength Assort-ativity GlobalClust.Coeff. Modul-arity NetworkType
CiteSeer 227,320 814,134 7.16 0.76 7.82 0.07 0.45 0.89 CollaborationDBLP 317,080 1,049,866 6.62 0.73 6.79 0.26 0.31 0.81 CollaborationActors 382,219 15,038,084 78.68 0.78 3.56 0.22 0.16 0.68 CollaborationGowalla 196,591 950,327 9.66 0.31 4.62 -0.03 0.02 0.69 SocialDigg 770,799 5,907,132 15.32 0.16 4.49 -0.09 0.05 0.53 SocialHyves 1,402,673 2,777,419 3.96 0.11 5.67 -0.02 0.01 0.76 SocialCora 23,166 89,157 7.69 0.31 5.74 -0.05 0.12 0.78 CitationHepTh 27,769 352,285 25.37 0.32 4.27 -0.03 0.11 0.64 CitationHepPh 34,546 420,877 24.36 0.29 4.41 -0.01 0.14 0.72 CitationTopology 34,761 107,720 6.19 0.42 3.78 -0.21 0.05 0.61 TechnologicalGnutella 62,586 147,892 4.72 0.01 5.96 -0.09 0.01 0.49 TechnologicalCaida 190,914 607,610 6.36 0.21 6.98 0.02 0.06 0.83 TechnologicalFF 300,000 2,446,862 16.31 0.16 3.34 -0.06 0.01 0.27 SyntheticSW 300,000 2,400,000 16.00 0.37 5.89 0.00 0.36 0.78 SyntheticMM 300,000 2,322,676 15.48 0.39 3.44 -0.17 0.07 0.15 Synthetic property Θ as the ratio of the value of that property in the sampled graph Θ S tothe value of that property in the actual graph Θ A : Scaling Ratio = Θ S Θ A (11)For example, we measure the average degree of the sample and original graphs andfind the scaling ratio of degree by diving the average degree of the sample graph G s at a sampling fraction φ with the average degree of the original graph G . Distributions:
A distribution is a multivalued statistic and shows the distributionof a property in a graph. For example, the degree distribution shows the fractionof nodes that have degree greater than or less than a particular value. We find andplot the Empirical Cumulative Distribution Function (ECDF) of degree, clusteringcoefficient and path lengths of sample graphs at φ = 0.02. Root Mean Square Error:
Given the original graph G and sampled graph G s ,we want to measure how far is G s from G. For scalar quantities such as the averagedegree, we use the common measure for the quality of estimation by Root MeanSquare Error (RMSE), given as RM SE = (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) ( Θ S − Θ A ) (12)where Θ S and Θ A are sampled and original values respectively. Jensen-Shannon Distance:
For distributions of the properties, we measure theJensen-Shannon Distance. In probability theory, the Jensen-Shannon Divergence measures the similarity between two probability distributions, calculated as D JS ( P || Q ) = 12 D KL ( P || M ) + 12 D KL ( Q || M ) (13)where D JS and D KL are Jensen-Shannon and Kullback-Leibler Divergences re-spectively while P and Q are two Probability Distribution Functions (PDFs) and M. I. Yousuf et al.
M = (P+Q). Its square root is a true metric often referred to as Jensen-ShannonDistance (JSD). In this section, we evaluate FS, XS, RD, LS and HJ sampling algorithms in terms ofsix graph properties defined above. We analyze the performance of each algorithmand consider the properties of the original graphs as ground truth values.5.1 DegreeIn the first experiment, we vary the sampling fraction from φ = 0 .
02 to φ = 0 . φ =0.02 and present the results in Figure 2 in the form of the EmpiricalCumulative Distribution Function (ECDF). In general, in the collaboration, socialand technological graphs, all the algorithms extract better samples than in thecitation networks. The overall trend is that these algorithms tend to pick low degreenodes more often than high degree nodes. We present the Jensen-Shannon distancein Table 3. The results show that the XS and JS samples produce less deviationfrom the original distributions. On average, LS outperforms other methods in this metric. We observe that these algorithms perform better in sampling thecollaboration and social graphs whereas performing worse in the citation networks.Possibly, the high average degree of the citation networks could not be achievedwith a small sampling budget. In the case of synthetic networks, the algorithmsextract better samples from FF and MM graphs. mpirical Characterization of Graph Sampling Algorithms 9 Fig. 1: Point statistics of average degree of all the networks with 95% confidenceintervals.Table 2: RMSE values and standard deviations of point statistics of average degree.
Boldface values are the best results.
Datasets
FS XS RD LS HJ
CiteSeer 4.27 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Fig. 2: Degree distributions of all the networks at φ = 0 . (Best viewed incolor.) Table 3: Jensen Shannon distance and standard deviations for degree distributions.
Boldface values are the best results.
Datasets
FS XS RD LS HJ
CiteSeer 0.44 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± mpirical Characterization of Graph Sampling Algorithms 11 φ = 0 .
02 to φ = 0 . φ = 0 .
02 by the sampling algorithms. We see that, generally, thesamples extracted by LS in the collaboration, social, and citation networks bettermatch with their original counterparts. In the technological networks, LS, XS, andRD extract good samples, however, all the algorithms fail in sampling the small-world network. Besides LS, other methods also extract reasonable samples in fewdatasets, e.g., XS in the citation networks and RD in the social networks. The JSdistance is summarized in Table 5. On average, LS samples give minimum errorand outperform in nine datasets whereas RD stands second to it. It seems thatthe sampling algorithms find it hard to sample the synthetic networks while theyperform relatively better in the other networks.5.3 Path lengthThe point statistics of path length in all the networks are shown in Figure 5.The figure shows that HJ and FS perform poorly and most of their statistics fallout of the plotting area. We fixed the plotting area otherwise it would have beendifficult to see the results visually because of the high statistics of HJ and FS.Other sampling algorithms extract good samples in most of the networks with afew exceptions, e.g., the small-world network could not be sampled well by anyalgorithm. We give the RMSE values with standard deviations in Table 6. We findthat HJ and FS samples produce big error values. Overall, the RMSE values are onthe higher side when compared with that of the degree and clustering coefficientstatistics. The possible reason is that the path length of a network is a complexmetric and it seems that sampling algorithms need a more thoughtful design tosample path length.We show the distribution of path length in all the networks in Figure 6. AgainHJ and FS give poor results in almost all the datasets. Other methods performwell in the collaboration networks and also extract reasonable samples in the social and citation networks. In Table 7, we give JS distance values with standarddeviations. We find that, on average, LS samples give the least error among all thesampling methods and outperforms in ten networks. This experiment shows thatthe sampling algorithms fail to follow the path length distribution in a networkand do not estimate the average path length very well.
Fig. 3: Point statistics of average clustering coefficient of all the networks with95% confidence intervals.Table 4: RMSE values and standard deviations of point statistics of average clus- tering coefficient.
Boldface values are the best results.
Datasets
FS XS RD LS HJ
CiteSeer 0.55 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Fig. 4: Clustering coefficient distributions of all the networks at φ = 0 . (Bestviewed in color.) Table 5: Jensen Shannon distance and standard deviations for clustering coefficient distributions.
Boldface values are the best results.
Datasets
FS XS RD LS HJ
CiteSeer 0.43 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Fig. 5: Point statistics of average path length of all the networks with 95% confi-dence intervals.Table 6: RMSE values and standard deviations of point statistics of average path length.
Boldface values are the best results.
Datasets
FS XS RD LS HJ
CiteSeer 23.11 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Fig. 6: Path length distributions of all the networks at φ = 0 . (Best viewedin color.) Table 7: Jensen Shannon distance and standard deviations for path length distri- butions.
Boldface values are the best results.
Datasets
FS XS RD LS HJ
CiteSeer 0.78 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± fractions. FS and HJ perform poorly in the synthetic network of MM. We give theRMSE values and standard deviations in Table 10. The table shows that XS andRD give the smallest error in five and four networks respectively while FS andLS outperform in three networks each. On average, LS samples give the minimumerror in this metric. mpirical Characterization of Graph Sampling Algorithms 17 Fig. 7: Point statistics of global clustering coefficient of all the networks with 95%confidence intervals.Table 8: RMSE values and standard deviations of point statistics of global clus- tering coefficient.
Boldface values are the best results.
Datasets
FS XS RD LS HJ
CiteSeer 0.23 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Fig. 8: Point statistics of assortativity of all the networks with 95% confidenceintervals.Table 9: RMSE values and standard deviations of point statistics of assortativity.
Boldface values are the best results.
Datasets
FS XS RD LS HJ
CiteSeer 0.13 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Fig. 9: Point statistics of modularity of all the networks with 95% confidenceintervals.Table 10: RMSE values and standard deviations of point statistics of modularity.
Boldface values are the best results.
Datasets
FS XS RD LS HJ
CiteSeer 0.06 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 11: Summary of the results: The average values of RMSE and JSD of all thesampling algorithms.
Boldface values are the best results.
Metric Graph Property FS XS RD LS HJRMSE Degree 13.06 20.25
Graph sampling is used to obtain a representative subgraph from a large graph.A very naive approach is to select nodes uniformly at random from the originalgraph and then induce the sample graph over this node-set. A similar approachis to sample the edges uniformly at random and then add the nodes at the endsof those edges to the node-set. A fundamental problem in these random samplingmethods is that real-world large graphs are not available as a whole and it be-comes nearly infeasible to select nodes or edges uniformly at random. In traversalbased sampling we traverse a small portion of a graph, e.g., Breadth First Sam- pling (Becchetti et al. 2006), Random First Sampling (Doerr and Blenn 2013),Snowball Sampling (Lee et al. 2006) and different variations of Random Walk(B.Ribeeiro and D.Towsley 2010; Bar-Yossef and Gurevich 2008; Gkantsidis et al.2006; Stutzbach et al. 2009; Rasti et al. 2009). More recently, the authors (Yousufand Kim 2020b) apply a traversal based sampling that consumes the local informa-tion of nodes, combined with the estimated values of a set of properties, to guide mpirical Characterization of Graph Sampling Algorithms 21 the sampling process and extract tiny samples that preserve the properties of thegraph. All these approaches have their own pros and cons. For example, BreadthFirst Sampling (BFS) is known to overestimate samples in terms of degree becauseBFS samples are biased towards high degree nodes. The closest to our work is theone presented in (Wang et al. 2011) and provides a good understanding of howsampling works in big graphs. The authors analyze several graph sampling algo-rithms and evaluate their performance on some widely recognized graph propertieson directed graphs using large-scale social network datasets. However, the workconsiders a very small set of properties of a graph compared to our work.
In this work, we conduct a comprehensive empirical study to characterize severalgraph sampling algorithms. We test the ability of these sampling methods in main-taining the properties of the original graphs. We characterize these methods on fivetypes of graphs and test their ability to match six properties of the original graphs.We find that the algorithms that explore the neighborhood of a sampled node onpriority bases perform better than other methods in maintaining the structure ofthe original graph in their samples. We also find that preferring high degree nodesduring the sampling process also benefits the sampling methods. However, we re-alize that extracting the structure and maintaining the properties of a graph in atiny sample is a difficult task and needs a very thoughtful sampling process.
References
Bar-Yossef Z, Gurevich M (2008) Random sampling from a search engine’s index. J ACM55(5):24:1–24:74Becchetti L, Castillo C, Donato D, Fazzone A (2006) A comparison of sampling techniques forweb graph characterization. In: LinkKDDBenevenuto F, Rodrigues T, Cha M, Almeida V (2009) Characterizing user behavior in on-line social networks. In: Proceedings of the 9th ACM SIGCOMM Conference on InternetMeasurement, p 49–62BRibeeiro, DTowsley (2010) Estimating and Sampling Graphs with Multidimensional RandomWalks. In ACM Internet Measurement ConferenceDoerr C, Blenn N (2013) Metric Convergence in Social Network Sampling. In ACM HotplanetGjoka M, Kurant M, Butts C, Markopoulou A (2010) Walking in Facebook: A Case Study ofUnbiased Sampling of OSNS. INFOCOMGkantsidis C, Mihail M, Saberi A (2006) Random walks in peer-to-peer networks: Algorithmsand evaluation. Perform Eval 63(3):241–263Hu P, Lau WC (2013) A survey and taxonomy of graph sampling. CoRR abs/1308.5865, URL http://arxiv.org/abs/1308.5865 ,1308.5865