[PDF] Finding Community Structure Based on Subgraph Similarity

Abstract

Community identification is a long-standing challenge in the modern network science, especially for very large scale networks containing millions of nodes. In this paper, we propose a new metric to quantify the structural similarity between subgraphs, based on which an algorithm for community identification is designed. Extensive empirical results on several real networks from disparate fields has demonstrated that the present algorithm can provide the same level of reliability, measure by modularity, while takes much shorter time than the well-known fast algorithm proposed by Clauset, Newman and Moore (CNM). We further propose a hybrid algorithm that can simultaneously enhance modularity and save computational time compared with the CNM algorithm.

Full PDF

aa r X i v : . [ c s . N I] F e b Finding Community Structure Based onSubgraph Similarity

Biao Xiang, En-Hong Chen, and Tao Zhou

Abstract

Community identiﬁcation is a long-standing challenge in the modern net-work science, especially for very large scale networks containing millions of nodes.In this paper, we propose a new metric to quantify the structural similarity be-tween subgraphs, based on which an algorithm for community identiﬁcation is de-signed. Extensive empirical results on several real networks from disparate ﬁeldshas demonstrated that the present algorithm can provide the same level of reliabil-ity, measure by modularity, while takes much shorter time than the well-known fastalgorithm proposed by Clauset, Newman and Moore (CNM). We further proposea hybrid algorithm that can simultaneously enhance modularity and save computa-tional time compared with the CNM algorithm.

The study of complex networks has become a common focus of many branches ofscience [1]. An open problem that attracts increasing attention is the identiﬁcationand analysis of communities [2]. The so-called communities can be loosely deﬁnedas distinct subsets of nodes within which they are densely connected, while sparserbetween which [3]. The knowledge of community structure is signiﬁcant for theunderstanding of network evolution [4] and the dynamics taking place on networks,such as epidemic spreading [5, 6] and synchronization [7, 8]. In addition, reasonable

Biao Xiang, En-Hong ChenDepartment of Computer Science, University of Science and Technology of China, Hefei Anhui230009, P. R. China. e-mail: [email protected]

Tao ZhouDepartment of Modern Physics, University of Science and Technology of China, Hefei Anhui230026, P. R. China, and Department of Physics, University of Fribourg, Chemin du Mus´ee 3,Fribourg 1700, Switzerland. e-mail: [email protected] identiﬁcation of communities is helpful for enhancing the accuracy of informationﬁltering and recommendation [9].Many algorithms for community identiﬁcation have been proposed, these includethe agglomerative method based on node similarity [10], divisive method via itera-tive removal of the edge with the highest betweenness [3, 11], divisive method basedon dissimilarity index between nearest-neighboring nodes [12], a local algorithmbased on edge-clustering coefﬁcient [13], Potts model for fuzzy community detec-tion [14], simulated annealing [15], extremal optimization [16], spectrum-based al-gorithm [17], iterative algorithm based on passing message [18], and so on.Finding out the optimal division of communities, measure by modularity [11],is very hard [19], and for most cases, we can only get the near optimal division.Generally speaking, without any prior knowledge, such as the maximal communitysize and the number of communities, an algorithm that can give higher modular-ity is more time consuming [20]. As a consequence, providing accurate division ofcommunities for a very large scale network in reasonable time is a big challenge inthe modern network science. To address this issue, Newman proposed a fast greedyalgorithm with time complexity O ( n ) for sparse networks [21], where n denotes thenumber of nodes. Furthermore, Clauset, Newman, and Moore (CNM) designed animproved algorithm giving identical result but with lower computational complexity[22], as O ( n log n ) . In this paper, based on a newly proposed metric of similaritybetween subgraphs, we design an agglomerative algorithm for community identiﬁ-cation, which gives the same level of reliability but is typically hundreds of timesfaster than the CNM algorithm. We further propose a hybrid method that can si-multaneously enhance modularity and save computational time compared with theCNM algorithm.The rest of this paper is organized as follows. In Section 2, we introduce thepresent method, including the new metric of subgraph similarity and the correspond-ing algorithm, as well as the hybrid algorithm. In Section 3, we give a brief de-scription of the empirical data used in this paper. The performance of our proposedalgorithms for both algorithmic accuracy and computational time are presented inSection 4. Finally, we sum up this paper in Section 5. Considering an undirected simple network G ( V , E ) , where V is the set of nodesand E is the set of edges. The multiple edges and self-connections are not allowed.Denote G = { V , V , · · · , V h } a division of G , that is, V i ∩ V j = /0 for 1 ≤ i = j ≤ h and V ∪ V ∪ · · · ∪ V h = V . We here propose a new metric of similarity between twosubgraphs, V i and V j , as: s i j = e i j + (cid:229) hk = √ e ik e kj | V k | p d i d j , (1) inding Community Structure Based on Subgraph Similarity 3 where e i j is the number of edges with two endpoints respectively belonging to V i and V j ( e i j is deﬁned to be zero if i = j ), | V k | is the number of nodes in subgraph V k , and d i = (cid:229) x ∈ V i k x is the sum of degrees of nodes in V i , where the degree of node x , namely k x , is deﬁned as the number of edges adjacent to x in G ( V , E ) . The similarity herecan be considered as a measure of proximity between subgraphs, and two subgraphshaving more connections or being simultaneously closely connected to some othersubgraphs are supposed to have higher proximity to each other. d i can be consideredas the mass of a subgraph, and the denominator, p d i d j , is introduced to reduce thebias induced by the inequality of subgraph sizes. Note that, if each subgraph onlycontains a single node, as V i = { v i } , the similarity between too subgraphs, V i and V j ,is degenerated to the well-known Salton index (also called cosine similarity in theliterature) [23] between v i and v j if they are not directly connected.Our algorithm starts from an n -division G = { V , V , · · · , V n } with V i = { v i } for1 ≤ i ≤ n . The procedure is as follows. (i) For each subgraph V i , let it connect to themost similar subgraphs, namely { V j | s i j = max k { s ik }} . (ii) Merge each connectedcomponent in the network of subgraphs generated by step (i) into one subgraph,which deﬁnes the next division. (iii) Repeat the step (i) until the number of sub-graphs equals one. During this procedure, we calculate the modularity for each di-vision and the one corresponding to the maximal modularity is recorded. To makeour algorithm clear to readers, we show a small scale example consisted of six sub-graphs with similarity matrix: S =   . (2)After the step (i), as shown in Figure 1, we get a network where each node representsa subgraph. We use the directed network representation, in which a directed arcfrom V i to V j means V j is one of the most similar subgraphs to V i . In the algorithmicimplementation, those directed arcs can be treated as undirected (symmetry) edges.The network shown in Figure 1 is determined by the similarity matrix S , and afterstep (ii), the updated division contains only two subgraphs, V ∪ V ∪ V ∪ V and V ∪ V , corresponding to the two connected components. Note that, the algorithmicprocedure is deterministic and the result is therefore not sensitive to where it startsat all.The CNM algorithm is relatively rough in the early stage, actually, it stronglytends to merge lower-degree nodes together (see Eq. (2) in Ref. [21], the ﬁrst termis not distinguishable in the early stage while the enhancement of the second termfavors lower-degree nodes). This tendency usually makes mistakes in the very early Biao Xiang, En-Hong Chen, and Tao Zhou V V V V V V Fig. 1

Illustration of the algorithm procedure, where each node represents a subgraph. The simi-larities between subgraph pairs are shown in Eq. (2). stage and can not be corrected afterwards. We therefore propose a hybrid algorithmwhich starts from a n -division G = { V , V , · · · , V n } , and takes the procedure men-tioned in the last paragraph for one round (i.e., step (i) and step (ii)). The subgraphsimilarity is degenerated to the similarity between two nodes: s xy = a xy + n xy p k x k y , (3)where n xy denotes the number of common neighbors between x and y , a xy is 1 if x and y are directly connected, and 0 otherwise. After this round, each subgraph has atleast two nodes. Then, we implement the CNM algorithm until all nodes are mergedtogether. In this paper, we consider ﬁve real networks drawn from disparate ﬁelds: (i)Football.— A network of American football games between Division IA collegesduring regular season Fall 2000, where nodes denote football teams and edges rep-resent regular season games [3]. (ii) Yeast PPI.— A protein-protein interaction net-work where each node represents a protein [24, 25]. (iii) Cond-Mat.— A network ofcoauthorships between scientists posting preprints on the

Condensed Matter E-PrintArchive from Jan 1995 to March 2005 [26]. (iv) WWW.— A sampling network of inding Community Structure Based on Subgraph Similarity 5

Fig. 2

Comparison of the algorithmic outputs corresponding to the best identiﬁcations subject tomodularity. The three panels are (upper panel) real grouping in regular season Fall 2000, (middlepanel) resulting communities from the CNM algorithm, and (lower panel) resulting communitiesfrom the XCZ+CNM algorithm. Each node here denotes a football team and different colors rep-resent different groups/communities. Biao Xiang, En-Hong Chen, and Tao Zhou the World Wide Web [27]. (v) IMDB.— Actor networks from the

Internet MovieDatabase [28]. We summarize the basic information of these networks in Table 1.

Table 1

Basic information of the networks for testing.Networks Number of Nodes, | V | Number of Edges, | E | ReferencesFootball 115 613 [3]Yeast PPI 2631 7182 [24, 25]Cond-Mat 40421 175693 [26]WWW 325729 1090107 [27]IMDB 1324748 3782463 [28]

Table 2

Maximal modularity.Algorithms Football Yeast PPI Cond-Mat WWW IMDBCNM 0.577 0.565 0.645 0.927 N/AXCZ 0.538 0.566 0.682 0.882 0.691XCZ+CNM 0.605 0.590 0.716 0.932 0.786

Table 3

CPU Time in millisecond (ms) resolution.Algorithms Football Yeast PPI Cond-Mat WWW IMDBCNM 172 5132 559781 12304152 N/AXCZ 0 47 2022 17734 257875XCZ+CNM 0 62 36422 443907 47714093

In Table 2 and Table 3, we respectively report the maximal modularities and theCPU times corresponding to the CNM algorithm, our proposed algorithm (referredas XCZ algorithm where XCZ is the abbreviation of the authors’ names), and thehybrid algorithm (referred as XCZ+CNM). All computations were carried out ina desktop computer with a single

Inter CoreE2160 processor (1.8GHz) and 2GBEMS memory. The programme code for the CNM algorithm is directly downloadedfrom the personal homepage of Clauset. The IMDB seems too large for the CNMalgorithm, and we can not get the result in reasonable time. inding Community Structure Based on Subgraph Similarity 7

From Table 2, one can ﬁnd that the XCZ algorithm can provide competitivelyaccurate division of communities verse the CNM algorithm. A signiﬁcant featureof the XCZ algorithm is that it is very fast, in general more than 100 times fastersthan the CNM algorithm. Just by a desktop computer, one can ﬁnd out the commu-nity structure of a network containing 10 nodes within minutes. In comparison, thehybrid algorithm is remarkably more accurate (measured by the maximal modular-ity) than both the CNM and XCZ algorithms. In Figure 2, we compare the resultingcommunity structures of the Football network, from which one can see obviouslythat the hybrid algorithm gives closer result to the real grouping than the CNM al-gorithm. We think the hybrid algorithm is fast enough for many real applications.Taking IMDB as an example, although it contains more than 1 . × nodes, thehybrid algorithm only spends less than one day. Indeed, the hybrid algorithm out-performs the CNM algorithm for both the accuracy and the speed. Thanks to the quick development of computing power and database technology,many very large scale networks, consisted of millions or more nodes, are now avail-able to scientiﬁc community. Analysis of such networks asks for highly efﬁcientalgorithms, where the problem of community identiﬁcation has attracted more andmore attentions for its hardness and practical signiﬁcance.The agglomerative method based on node similarity [10] is of lower accuracycompared with the divisive algorithms based on edge-betweenness [3] and edge-clustering coefﬁcient [13]. In this paper, we extended the similarity measuring thestructural equivalence of a pair of nodes to the so-called subgraph similarity thatcan quantify the proximity of two subsets of nodes. Accordingly, we deigned anultrafast algorithm, which provides competitively accurate division of communitieswhile runs typically hundreds of times faster than the well-known CNM algorithm.Using our algorithm, just by a desktop computer, one can deal with a network ofmillions of nodes in minutes. For example, it takes less than ﬁve minutes to get thecommunity structure of IMDB, which is consisted of more than 1 . × nodes.Furthermore, we integrated the CNM algorithm and our proposed algorithmand designed a hybrid method. Numerical results on representative real networksshowed that this hybrid algorithm is remarkably more accurate than the CNM algo-rithm, and can manage a network of about one million nodes in a few hours.The modularity has been widely accepted as a standard metric for evaluatingthe community identiﬁcation, as well as has found some other applications such asbeing an assistant for extracting the hierarchical organization of complex systems[29]. Although modularity is indeed the most popular metric for community iden-tiﬁcation, and the result corresponding to the maximal modularity looks very rea-sonable (see, for example, Figure 2), it has an intrinsic resolution limit that makessmall communities hard to detect [30, 31]. An alternative, named normalized mu-tual information [20] is a good candidate for future investigation. In addition, an Biao Xiang, En-Hong Chen, and Tao Zhou extension of modularity for weighted networks, namely weighted modularity [32],has been adopted to deal with community identiﬁcation problem in weighted net-works [33, 34]. We hope the subgraph similarity proposed in this paper can also beproperly extended to a weighted version to help extract the weighted communities.

Acknowledgements

This work is beneﬁted from the

Pajek Datasets and the

Internet MovieDatabase , as well as the network data collected by Mark Newman, Albert-L´aszl´o Barab´asi andtheir colleagues. E.-H.C. acknowledges the National Natural Science Foundation of China undergrant numbers 60573077 and 60775037. T.Z. acknowledges the National Natural Science Founda-tion of China under grant number 10635040.

References

1. M. E. J. Newman, A.-L. Barab´asi, D. J. Watts,

The Structure and Dynamics of Networks (Princeton University Press, Princeton, 2006).2. M. E. J. Newman,

Modularity and community structure in networks , Proc. Natl. Acad. Sci.U.S.A. , 8577 (2006).3. M. Girvan, M. E. J. Newman,

Community structure in social and biological networks , Proc.Natl. Acad. Sci. U.S.A. , 7821 (2002).4. G. Palla, A.-L. Barab´asi, T. Vicsek, Quantifying social group evolution , Nature , 664(2007).5. Z. Liu, B. Hu,

Epidemic spreading in community networks , Europhys. Lett. , 315 (2005).6. G. Yan, Z.-Q. Fu, J. Ren, W.-X. Wang, Collective synchronization induced by epidemic dy-namics on complex networks with communities , Phys. Rev. E , 016108 (2007).7. A. Arenas, A. D´ıaz-Guilera, C. J. P´erez-Vicente, Synchronization Reveals Topological Scalesin Complex Networks , Phys. Rev. Lett. , 114102 (2006).8. T. Zhou, M. Zhao, G.-R. Chen, G. Yan, B.-H. Wang, Phase synchronization on scale-freenetworks with community structure , Phys. Lett. A , 431 (2007).9. G.-R. Xue, C. Lin, Q. Yang, W.-S. Xi, H.-J. Zeng, Y. Yu, Z. Chen,

Scalable collaborativeﬁltering using cluster-based smoothing , in

Proceedings of the 28th Annual International ACMSIGIR conference on Research and Development in Information Retrieval (ACM Press, pp.114-121, 2005).10. R. L. Breiger, S. A. Boorman, P. Arabie,

An algorithm for clustering relational data withapplications to social network analysis and comparison with multidimensional scaling , J.Math. Psychol. , 328 (1975).11. M. E. J. Newman, M. Girvan, Finding and evaluating community structure in networks , Phys.Rev. E , 026113 (2004).12. H. Zhou, Distance, dissimilarity index, and network community structure , Phys. Rev. E ,061901 (2003).13. F. Radicchi, C. Castellano, F. Ceccon, V. Loreto, D. Parisi, Deﬁning and identifying commu-nities in networks , Proc. Natl. Acad. Sci. U.S.A. , 2658 (2004).14. J. Reichardt, S. Bornholdt,

Detecting Fuzzy Community Structures in Complex Networks witha Potts Model , Phys. Rev. Lett. , 218701 (2004).15. R. Guimer`a, M. Sales, L. A. N. Amaral, Modularity from ﬂuctuations in random graphs andcomplex networks , Phys. Rev. E , 025101 (2004).16. J. Duch, A. Arenas, Community detection in complex networks using extremal optimization ,Phys. Rev. E , 027104 (2005).17. M. E. J. Newman, Finding community strcuture in networks using the eigenvectors of matrics ,Phys. Rev. E , 036104 (2006).18. B. J. Frey, D. Dueck, Clustering by Passing Messages Between Data Points , Science , 972(2007).inding Community Structure Based on Subgraph Similarity 919. U. Brandes, D. Delling, M. Gaertler, R. G¨orke, M. Hoefer, Z. Nikoloski, D. Wagner, OnFinding Graph Clusterings with Maximum Modularity, Lect. Notes Comput. Sci. , 121(2007).20. L. Danon, A. D´ıaz-Guilera, J. Duch, A. Arenas,

Comparing community structure identiﬁca-tion , J. Stat. Mech. P09008 (2005).21. M. E. J. Newman,

Fast algorithm for detecting community strcuture in networks , Phys. Rev.E , 066133 (2004).22. A. Clauset, M. E. J. Newman, C. Moore, Finding community structure in very large networks ,Phys. Rev. E , 066111 (2004).23. G. Salton, M. J. McGill, Introduction to Modern Information Retrieval (MuGraw-Hill, Auck-land, 1983).24. C. von Merging, R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, P. Bork,

Comparativeassessment of large-scale data sets of protein-protein interactions , Nature , 399 (2002).25. D. Bu, Y. Zhao, L. Cai, H. Xue, X. Zhu, H. Lu, J. Zhang, S. Sun, L. Ling, N. Zhang, G. Li,R. Chen,

Topological structure analysis of the protein-protein interaction network in buddingyeast , Nucleic Acids Research , 2443 (2003).26. M. E. J. Newman, The structure of scientiﬁc collaboration networks , Proc. Natl. Acad. Sci.U.S.A. , 404 (2001).27. R. Albert, H. Jeong, A.-L. Barab´asi, Diameter of the World Wide Web , Nature , 130 (1999).28. A. Ahmen, V. Batagelj, X. Fu, S.-H. Hong, D. Merrick, A. Mrvar,

Visualisation and Analy-sis of the Internet Movie Database , in

Proceedings of the 2007 Asia-Paciﬁc Symposium onVisualization (IEEE Press, pp. 17-24, 2007).29. M. Sales-Pardo, R. Guimer`a, A. A. Moreira, L. A. N. Amaral,

Extracting the hierarchicalorganization of complex systems , Proc. Natl. Acad. Sci. U.S.A. , 15224 (2007).30. S. Fortunato, M. Barthe´emy,

Resolution limit in community detection , Proc. Natl. Acad. Sci.U.S.A. , 36 (2007).31. A. Lancichinetti, S. Fortunato, F. Radicchi,

Benchmark graphs for testing community detec-tion algorithms , Phys. Rev. E , 046110 (2008).32. M. E. J. Newman, Analysis of weighted networks , Phys. Rev. E , 056131 (2004).33. Y. Fan, M. Li, P. Zhang, J. Wu, Z. Di, The effect of weight on community structure of networks ,Physica A , 583 (2007).34. M. Mitrovi´c, B. Tadi´c,

Search of Weighted Subgraphs on Complex Networks with MaximumLikelihood Methods , Lect. Notes Comput. Sci.5102