[PDF] Graph Sampling Approach for Reducing Computational Complexity of Large-Scale Social Network

Abstract

Online social network services provide a platform for human social interactions. Nowadays, many kinds of online interactions generate large-scale social network data. Network analysis helps to mine knowledge and pattern from the relationship between actors inside the network. This approach is important to support predictions and the decision-making process in many real-world applications. The social network analysis methodology, which borrows approaches from graph theory provides several metrics that enabled us to measure specific properties of the networks. Some of the metrics calculations were built with no scalability in minds, thus it is computationally expensive. In this paper, we propose a graph sampling approach to reduce social network size, thus reducing computation operations. The performance comparison between natural graph sampling strategies using edge random sampling, node random sampling, and random walks are presented on each selected graph property. We found that the performance of graph sampling strategies depends on graph properties measured.

Full PDF

http://dx.doi.org/10.12988/jite.2016.6828

Graph Sampling Approach for Reducing Computational Complexity of Large-Scale Social Network

Andry Alamsyah , Yahya Peranginangin , Intan Muchtadi-Alamsyah , Budi Rahardjo and Kuspriyanto School of Economics and Business, Telkom University, Indonesia School of Electrical Engineering, Bandung Institute of Technology, Indonesia Faculty of Mathematics and Natural Science Bandung Institute of Technology Indonesia

Abstract

The online social network services provide platform for human social interactions. Nowadays, many kinds of online interactions generate large-scale social network data. Network analysis helps to mine knowledge and pattern from relationship between actors inside the network. This approach is important to support predictions and decision-making process in many real-world applications. The social network analysis methodology, which borrow approach from graph theory provides several metrics that enabled us to measure specific properties of the networks. Some of metrics calculations were built with no scalability in minds, thus it is computationally expensive. In this paper, we propose graph sampling approach to reduce social network size, thus reducing computation operations. The performance comparison between natural graph sampling strategies using edge random sampling, node random sampling and random walks are presented on each selected graph property. We found that the performance of graph sampling strategies depends on graph properties measured.

Keywords : Graph Sampling, Graph Theory, Large-Scale Social Network, Computational Complexity, Social Network Analysis 32 Introduction

Our daily online interactions have contributed to the rise of data production. This lead to massive data available, which in turn can offer opportunity for us to model the phenomenon and make predictions that support decision making process in many real world applications. The legacy methodology in social sciences are based on directly working with the population and few take the route to gain insight using massive data available in online. We use the

Social Network Analysis (SNA ) [1] [2] methodology to analyze the network data. SNA foundation are based on graph theory [3]. SNA provides many metrics that support network topology measurement but some of them for example Betweenness Centrality were built for small network. As the network becomes bigger, it increases the computational complexity of finding the shortest path between all pairs in network.

Betweenness Centrality have computational complexity reaching

O(|N| ) [4] , where |N| is the number of nodes in the network. In our previous attempt to reduce graph size and make representative sample, we propose Graph Summary [5] based on

Minimum Description Length [6] principle. Our effort can reduce the graph size by 50%, but the time needed to construct graph summary is high, because of merging nodes operation complexity. For that reason, we try another alternative with more simple approach such as using graph sampling methodology. Graph sampling [7] is a technique to pick a subset of nodes and/or edges from original graph. Although being smaller in size, sampled graph may be similar to original graph in some way (see. Figure 1). In this paper, we are interested on how properties behave after graph sampling implementation. We generate different size of graph sample in order to see the properties evolution in different size of network. We investigate which properties are preserved in certain sampling technique and which properties are not. If a property in sampled graph is preserved, then we can estimate the property value in original network. We define original graph

G(N G , E G ) where N G is set of nodes and E G is set of edges. The graph sample S(N S , E S ) is subset of G(N G , E G ) , where N S (cid:0) N G and E S (cid:0) E G . We denote |N| as the number of nodes and |E| as the number of edges. Figure 1. (a) left: Original network with 4039 nodes and 88234 edges. (b). right:

Sampling of original network using RW strategy with 100 nodes and 2013 edges Social Network Properties

There are many social network metrics available to describe certain social network properties. We use the following properties [2] as comparison in each graph sample size. 1.

Average Degree is the number of edges E compared to number of nodes N. We denote

Average Degree = |E|/|N| Density is the number of edges E in the network compared to maximum number of edges between nodes in the network. We denote Density = |E|/(|N|*(|N|-1)) Modularity measures fraction of the edges that fall within the given groups minus the expected of such fraction is edges were distributed at random. The bigger M value means the boundary between groups in the network are more distinct. 4. Average Clustering Coefficient measures the total degree to which nodes are tend to cluster together in network compared to total number of nodes. 5.

Diameter measures the longest of shortest path between any pair of nodes in network. Average Path Length calculated by finding the shortest path between all pairs of nodes, adding them up and then dividing by the total number of pairs. Connected Component shows how many network components in which any two nodes are connected to each other by paths. Graph Sampling Strategies

We propose three strategies of graph sampling by using

Edge Random Sampling (ERS) , Node Random Sampling (NRS) and

Random Walks Sampling (RW) methods . We explain each of the strategy and algorithm construction as follows: 3.1 Edge Random Sampling The

Edge Random Sampling (ERS) strategy is the simplest strategy of three. We pick random edges uniformly from the network. The selected edges are used to construct new smaller network. The

ERS algorithm is as follows: input network G(N G , E G )

2: choose set of random edges E S from G

3: construct network sample

S(N S , E S ) , where E S are the chosen edges from E G and N S are selected nodes connecting the E S The

Node Random Sampling (NRS) strategy works as follows. We pick random

34 nodes uniformly from the network, then we construct new smaller network from the chosen nodes. The

NRS algorithm is as follows: input network G(N G , E G ) choose set of random nodes N C from G create permutations list P = P(N G , 2) create intersection list I = P & G , where I is the edge list from the connections between selected N G from G construct network sample S(N S , E S ) from list I Random Walks (RW) strategy are based on random walks idea, where given a network and a starting node, then we select neighbor of this node at random and move to it. The random sequence selected, which consists of nodes and edges are kept as the result of RW strategy. We run each RW process for certain number of iterations which we choose carefully depends what we see as the sufficient sample of network size. The x most visited nodes from the iterations become our candidate for network sample. From this point, the process similar with NRS process. The difference RW process and NRS process is nodes are not chosen by uniformly random sampling, but by random walks mechanism. The RW algorithm is as follows:

1: input network

G(N G , E G )

2: while r < number of run

3: while i < number of iterations : We randomly choose N G (i+1) where N G (i+1) is neighbor of N G (i)

4: list L r = (N G (i), N G (i+1),N G (i+2),…,N G (number of iterations))

5: list

L = L + L + …. + L number of run

6: list M = descending sorted list L by number times a node has been selected in RW strategy

7: list M x = x most visited node

8: create permutations list

P = P(M x , 2)

9: create intersection list

I = P & G , where I is the edge list from the connections between selected N G from G

10: construct network sample

S(N S , E S ) from list I Experiments

We use ego-Facebook dataset from

Stanford Network Analysis Project (SNAP) repositories for our experiment to construct graph sample, using three strategies in section 3. This dataset network consists of nodes and edges. The dataset network properties are

Density 0.011, Average Degree 43,691, Modularity 0.835, Average Clustering Coefficient 0.617, Diameter 8, Average Path Length 3.693 and Connected Component 1

The

ERS algorithm experiment is run by pick a set of random edges which each round chosen by stepping up edges from edges to edges. The random sampling nature of this algorithm produces fluctuate values. To overcome discrepancy between each result, on each round, we run the algorithm times. The average network properties value from all run become our final result of the ERS strategy. 135 The

NRS algorithm experiment is run by pick some randomly uniform nodes, start from nodes, and then continue to nodes. This set of nodes N C become candidate of our graph sample. To check whether there is a connection between a pair of selected nodes, we construct permutations mechanism P(N C , 2) , where we list all possible permutations between pairs of nodes. On the last step, we check whether there is actual connection between a pair of nodes from the list of possible connection between nodes resulted from permutations mechanism P by using intersection list I between list of N G and list of N C . The list I converted to edge list and become network sample S(N S , E S ) . The RW algorithm experiment is run with the scenario as follows. At one random walks process, we set the number of iterations i as the numbers of how many nodes collected on each random walks process. We choose the value i = 10000 , as we consider this number is suitable enough to get representative number of nodes. We run random walks process. On each random walks process, we pick x most visited nodes, throughout the experimentation we change variable x to and nodes respectively. The sum of all 10 random walks process are collected, and we pick the top and most visited nodes respectively. The network from the top most visited nodes becomes our graph sample S(N S , E S ) . Figure 2. Experiment result chart (a). Graph Sample Size, (b). Average Degree, (c).

Density, (d). Modularity, (e). Average Clustering Coefficient, (f). Diameter, (g). Average

Path Length, (h). Connected Component Discussions and Conclusions

Figure 2 show the performance chart of three sampling methods. In

Graph Sample Size , RW create sample graph with smaller number of nodes but with much bigger number of edges compared to the other two, this signify that sample from RW are faster and more effective in constructing representative graph sample. A method performs better in a graph property if, 1). it converges to the value of the original graph property with predicted result, both linearly or exponentially and 2). the gap between predicted value and the original value is small. With this regards, RW methods are performs better in properties Average Clustering Coefficient, Diameter, Average Path Length, Connected Component . Both

NRS and

ERS are performs better in properties

Average Degree , Modularity and

Density . To reduce the computational complexity of

Betweenness Centrality metrics as the sample of our original problems. We pick a property that related to our problem, in this case is finding shortest path between all pairs. The problem is related to

Average Path Length and

Diameter properties; thus we choose RW methods. 137 Our conclusions are 1). We can use graph sampling based on three strategies to reduce graph size while still retain some properties or retain some information that we use as estimator of original graph properties. 2). Different graph properties need different graph sampling methodology. 3). For the future works, we need to compare the computations complexity of each sampling methods in order to get the conclusive results regarding each sampling methods. References [1]

J. Scott,

Social Network Analysis: a Handbook , Sage Publications, 2000. [2]

M.E.J. Newman,

Network: An Introduction , Oxford University Press, 2010. http://dx.doi.org/10.1093/acprof:oso/9780199206650.001.0001 [3]

R. Diestel,

Graph Theory: Electronic,

Edition 2005, Springer-Verlag Heidelberg, New York, 1997, 2000, 2005. [4]

U. Brandes, A Faster Algorithm for Betweenness Centrality,

Journal of Mathematical Sociology , (2000), no. 2, 163-177. http://dx.doi.org/10.1080/0022250x.2001.9990249 [5] A. Alamsyah, Y. Peranginangin, B. Rahardjo, I. Muchtadi-Alamsyah, Kuspriyanto Kuspriyanto, Reducing Computational Complexity of Network Analysis using Graph Compression Method for Brand Awareness Effort,

The 3 rd International Conference on Computational Science and Technology , (2015), 1-6. http://dx.doi.org/10.2991/iccst-15.2015.26 [6]

P.D. Grunwald,

The Minimum Description Length Principle,

The MIT Press, 2007. [7]

P. Hu, W.C. Lau,

A Survey and Taxonomy of Graph Sampling , arXiv:1308.5865[cs.SI], 2013.