Overlapping Community Discovery Methods: A Survey
aa r X i v : . [ c s . S I] N ov Overlapping Community DiscoveryMethods: A Survey
Alessia Amelio and Clara PizzutiNational Research Council of Italy (CNR)Institute for High Performance Computing and Networking (ICAR),Via Pietro Bucci, 41C87036 Rende (CS), Italyemail: { amelio,pizzuti } @icar.cnr.it Abstract
The detection of overlapping communities is a challenging problem which isgaining increasing interest in recent years because of the natural attitude of indi-viduals, observed in real-world networks, to participate in multiple groups at thesame time. This review gives a description of the main proposals in the field. Be-sides the methods designed for static networks, some new approaches that dealwith the detection of overlapping communities in networks that change over time,are described. Methods are classified with respect to the underlying principlesguiding them to obtain a network division in groups sharing part of their nodes.For each of them we also report, when available, computational complexity andweb site address from which it is possible to download the software implementingthe method.
Complex networks constitute an efficacious formalism to represent the relation-ships among objects composing many real world systems. Collaboration networks,the Internet, the world-wide-web, biological networks, communication and trans-port networks, social networks are just some examples. Networks are modeledas graphs, where nodes represent objects and edges represent interactions amongthese objects. One of the main problems in the study of complex networks isthe detection of community structure , i.e. the division of a network into groups(clusters or modules) of nodes having dense intra-connections, and sparse inter-connections. In the last few years many different approaches have been proposedto uncover community structure in networks.Much of the effort in defining efficient and efficacious methods for communitydetection has been directed on finding disjoint communities. However, communi-ties may overlap, i.e. some nodes may belong to more than one group. The mem-bership of an entity to many groups is very common in real world networks. Forexample, in a social network, a person may participate in several interest groups, n collaboration networks a researcher may collaborate with many groups, in cita-tion networks a paper could involve more than one topic, in biological networksproteins play different roles in the cell by taking part in several processes.In this review a description of the most recent proposals for overlapping com-munity detection is given, and a classification in different categories is provided.Algorithms have been classified by taking into account the underlying principlesguiding the methods to obtain network division in groups sharing part of theirnodes.Many recent reviews describing community detection algorithms have beenpublished [12, 20, 11, 7, 38]. However, our review differs from those of [12, 20,11, 7] since we focus only on overlapping approaches, and from [38] because alsodynamic approaches are described.The paper is organized as follows. The next section gives some preliminarydefinitions necessary for the description of the approaches. Section 3 introducesthe method categorization proposed in this review. Section 3.1 describes nodeseeds and local expansion methods . Section 3.2 considers clique expansion meth-ods . Section 3.3 describes link clustering algorithms . Section 3.4 presents labelpropagation approaches . Methods for which a categorization in one of the definedclasses was not possible are reported in Section 3.5. Section 3.6 considers the morerecent proposals for dynamic networks. Section 4 gives some information regard-ing benchmarks that can be used to test algorithm performance. Finally, Section 5gives some final considerations on the described methods and concludes the paper. In this section some basic definitions, necessary for a clear understanding of theconcepts described in the survey, are given.A network N can be modeled as a graph G = ( V , E ) where V is a set of n = | V | objects, called nodes or vertices, and E is a set of m = | E | links, callededges, that connect two elements of V . A community in a network is a group ofvertices (i.e. a subgraph) having a high density of edges among them, and a lowerdensity of edges between groups. In [11] it is observed that a formal definition ofcommunity does not exist because this definition often depends on the applicationdomain. Nevertheless, an impressive number of methods has been proposed todetect communities in complex networks. Before starting with the description ofoverlapping approaches, it is necessary to introduce the concept of modularity because of its popularity and large use among researchers.The concept of modularity has been originally defined by Girvan and Newman[26] as quality function to evaluate the goodness of a partition. Along the years,however, it has been accepted as one of the most meaningful measures to partitiona network, that more closely agrees with the intuitive concept of community on awide range of real world networks.The idea underlying modularity is that a random graph has not a clusteringstructure, thus the edge density of a cluster should be higher than the expecteddensity of a subgraph whose nodes are connected at random. This expected edgedensity depends on a chosen null model . Modularity can be written in the followingway: = m (cid:229) i j ( A i j − P i j ) d ( C i , C j ) (1)where A is the adjacency matrix of the graph G , m is the number of edges of G ,and P i j is the expected number of edges between nodes i and j in the null model. d is the Kronecker function and yields one if i and j are in the same community,zero otherwise. When it is assumed that the random graph has the same degreedistribution of the original graph, P i j = k i k j m , where k i and k j are the degrees ofnodes i and j respectively. Thus the modularity expression becomes: Q = m (cid:229) i j ( A i j − k i k j m ) d ( C i , C j ) (2)Since only the pairs of vertices belonging to the same cluster contribute to the sum,modularity can be rewritten as Q = k (cid:229) s = [ l s m − ( d s m ) ] (3)where k is the number of modules found inside a network, l s is the total number ofedges joining vertices inside the module s , and d s is the sum of the degrees of thenodes of s . Thus the first term of each summand is the fraction of edges inside acommunity, and the second one is the expected value of the fraction of edges thatwould be in the network if edges fell at random without regard to the communitystructure. Values approaching 1 indicate strong community structure. In this section, methods for the detection of overlapping communities are de-scribed. They have been classified in six different categories on the base of themethodology employed to identify communities. The categories are the following: • Node seeds and local expansion • Clique expansion • Link clustering • Label propagation • Other approaches • Dynamic networksFor each category, a short description of the main common features amongalgorithms of that class is provided.
The idea underlying these approaches is that starting from a node or a small set ofnodes, a community can be obtained by adding neighboring nodes that improve aquality function. The quality function characterizes the structure of the obtainedclustering. aumes et al. [4] introduced the concept of density function and defined a com-munity as a subgraph that is locally optimal with respect to this density function.The internal p in and external p ex edge intensities of a community C are defined as p in ( C ) = E ( C ) | C | ( | C | − ) p ex ( C ) = E out ( C ) | C | ( n − | C | ) (4)where E ( C ) is the number of internal edges of C and E out ( C ) is the number ofedges from nodes of C towards nodes not belonging to C . Three weight or metricfunctions are defined to assign a weight to a graph. The internal edge probabilityW p = p in ( C ) (5)that coincides with the internal edge density, the edge ratioW e = E ( C ) E ( C ) + E out ( C ) (6)and the intensity ratio W i = p in ( C ) p in ( C ) + p ex ( C ) (7)These metrics measure the intensity of communication within the clusters, andcan be efficiently updated when a new node is added or removed from the cluster.Finally, to measure the difference between two clusters, the Hamming, or editdistance, and the percentage of non-overlap are defined.In order to find overlapped communities, the authors proposed two methods.The first algorithm, Iterative Scan (IS) , starts by choosing an edge at random,called seed , and adds or removes one vertex at a time until the chosen density met-ric improves. When no more improvement can be obtained, the algorithm stopsand it is restarted with a new seed. Overlapping is possible because the restartprocess can reassign nodes already present in a cluster to another new formingcommunity.The second algorithm,
Rank Removal (RaRe) , assumes that there are somehigh-ranking nodes which, when removed from the graph, disconnect the graphinto smaller connected components, called cores. The deleted nodes are then addedto one or more cores. This means that the overlapping between two clusters ispossible only through these vertices.To validate the methods, the Hamming distance between each cluster of thetrue clustering and the obtained clustering is computed, then the average of allthese distances is considered.The choice of a random edge can negatively influence the result of IS . Thus thesame authors modified their Iterative Scan method and proposed a more efficientalgorithm for finding overlapping communities [3], named IS , that combines IS and RaRe . IS also relies on a new strategy for initializing seed clusters able tocompute the ranking of each node only once. The strategy is based on the ob-servation that the only nodes capable of increasing cluster density are either themembers of the cluster itself or members of the immediate neighborhood clusters,where neighboring clusters are those containing nodes adjacent to a node inside the luster. Thus, rather than visiting each node for each iteration, nodes not belongingto one of these two groups can be skipped over.Another variation of the IS method, that ensures connectivity of the detectedcommunities, is described in [15].Lancichinetti et al. [21] proposed a method to detect overlapping and hierarchi-cal community structure, in the following referred as LFM , based on the conceptof fitness of nodes belonging to a community S . Let k ini ( S ) and k outi ( S ) be the in-ternal and external degrees of the nodes belonging to a community S . The fitness of S is then defined as F S = (cid:229) i ∈ S k ini ( S )( k ini ( S ) + k outi ( S )) a (8)where a , called resolution parameter, is a positive real-valued parameter control-ling the size of the communities. When k outi ( S ) = ∀ i , F S reaches its maximumvalue for a fixed a . The community fitness has been used in [21] to find commu-nities one at a time. Node fitness with respect to a community S is defined as thevariation of community fitness of S with and without the node i , i. e. F iS = F S ∪{ i } − F S −{ i } (9)The method starts by picking a node at random, and considering it as a community S . Then a loop over all the neighbor nodes of S , not included in S , is performedin order to choose the neighbor node to be added to S . The choice is done bycomputing the node fitness for each node, and augmenting S with the node havingthe highest value of fitness. At this point the fitness of each node is recomputed,and if a node turns out to have a negative fitness value it is removed from S . Theprocess stops when all the not yet included neighboring nodes of the nodes in S have a negative fitness. Once a community has been obtained, a new node is pickedand the process restarts until all the nodes have been assigned to at least one group.The authors found that the divisions obtained for the resolution parameter a = a . The length of this range determines the more stable division,which is deemed the best result.A different approach is applied by Lancichinetti et al. in [23] to choose theneighbor of a node to add to a cluster S . In fact, a method called OSLOM uses astatistical test to grow a community S by evaluating the statistical significance ofa node. The intuition under the approach is that, if a vertex v shares many moreedges with the nodes of S than expected in the null model, then the relationshipbetween v and S is unexpectedly strong, thus v can be included in S . DOCS (Detecting Overlapping Community Structures) [36] is a method basedon an approach of global division followed by local expansion.
DOCS uses classi-cal spectral graph partitioning and random walk techniques. It first applies a spec-tral bi-section method with multi-level recursion to generate seed groups, and thenemploys a process of local expansion to add a vertex to the current cluster. Thelocal optimization process randomly walks over the network from the seeds. In articular, the locally-optimal expansion process is based on the concept of mod-ularity Q . At each time step, the scanned vertices are sorted in descending orderwith the degree normalized probabilities. If a vertex brings better Q change to thecommunity candidates, it may be absorbed as a new member of the communitystructure. Furthermore, the algorithm, besides trying to achieve a good Q value,takes also into account the overlapping rate of the discovered communities. Theuser must fix a threshold t , and the expansion is continued until the overlappingrate is beyond the t value. Moses [25] is a greedy algorithm that optimizes a global objective functionbased on a statistical network model. In this model a graph is represented by a ran-dom symmetric adjacency matrix. The objective function computes the maximumlikelihood estimators from the observed likelihood.
Moses selects an edge ( u , v )at random and the community C = { u , v } constituted by the two correspondingnodes is built by expanding it with new nodes taken from the set of neighboringnodes not yet added to C . Nodes are selected such that the objective function ismaximized, and expansion continues until the highest value of the objective is ob-tained. Since edges are chosen at random with replacement, the same edge couldbe selected many times, and expanded in different communities. The algorithmperiodically checks for all the communities found so far if the removal of onecommunity improves the objective function. Furthermore, a tuning phase at theend of the expansion is performed. In this phase each node is removed from allthe communities it has been assigned, and added to the communities to which it isconnected by an edge. The change is accomplished only if the objective functionincreases. Moses has been compared with the methods
LFM of Lancichinetti et.al. [20],
COPRA of Gregory [18],
Iterative Scan ( IS ) of Baumes et al. [4], GCE ofLee et al. [24],
CFinder of Palla et al. [29] . The authors performed experimentsby varying the overlap to range from one to ten communities per node, i.e. a nodecan participate from one until 10 communities. They found that when the averageoverlap increases,
Moses is able to recover community structure better than theother methods.
Clique expansion methods are similar to the approaches described in the previoussection. However, they consider as seed cores highly connected sets of nodes con-stituted by cliques, and then generate overlapping communities by merging thesecliques, applying different criteria.
CFinder [1] is a system to identify and visualize overlapped, densely con-nected groups of nodes in undirected graphs. It also allows to navigate the originalgraph and the communities found. The search algorithm
CFinder uses the
CliquePercolation Method [29] to find k - clique percolation clusters. A k-clique is a com-plete subgraph constituted by k nodes. Two cliques are said adjacent if they shareexactly k − k has to be provided in input. The higher the value of k , the smallerthe size of the highly dense groups. The authors suggest that a value between 4and 6 gives the richest group structure. CFinder works as follows. First of all,the community finding algorithm extracts all the maximal complete subgraphs, i.e. he cliques. Then a clique-clique overlap matrix is prepared. In this matrix eachrow (and column) corresponds to a clique, the diagonal contains the size of theclique, while the value contained in the position ( i , j ) is the number of commonnodes between the cliques i and j . K-clique communities are obtained by deletingevery element on the diagonal smaller than k , and every element off the diagonalsmaller than k −
1, replacing the remaining elements by one, and finally findingthe connected components in the modified matrix.Shen et al. [35] proposed a hierarchical agglomerative algorithm named
EAGLE ( agglomerativE HierarchicAL clusterinG based on maximaL cliquE ) to uncover hi-erarchical and overlapping community structure in networks. The method is basedon the concepts of maximal cliques , i.e. cliques that are not a subset of anotherclique, and subordinate maximal cliques , i.e. maximal cliques whose vertices arecontained in some other larger maximal clique. Fixed a threshold k , all the sub-ordinate maximal cliques with size smaller than k are discarded. The deletionof these cliques implies that some vertices do not belong to any maximal clique.These vertices are called subordinate vertices . The algorithm starts by consideringas initial communities the maximal cliques and the subordinate vertices. Then,for each couple of communities, the similarity between them is computed and thetwo groups with maximum similarity are merged. This process is repeated untilonly one community remains. The similarity M between two communities is de-fined by specializing the concept of modularity, introduced by Newmann, for twocommunities C and C : M = m (cid:229) v ∈ C , w ∈ C , v = w [ A vw − k v k w m ] (10)where m is the total number of edges in the network, A is the adjacency matrix ofthe network, k v is the degree of node v . The evaluation of the results obtained isperformed by computing an extended modularity value EQ that takes into accountthe number of communities a node belongs to. This value is computed as follows: EQ = m (cid:229) i (cid:229) v ∈ C i , w ∈ C i O v O w [ A vw − k v k w m ] (11)where O v is the number of communities to which v belongs. The value of EQ isused by the authors to cut the generated dendrogram and to select the communitystructure having the maximum extended modularity value. Experiments on tworeal life networks show that the results obtained are meaningful. Greedy Clique Expansion ( GCE ) is a community detection algorithm proposedby Lee at al. [24] that assigns nodes to multiple groups by expanding cliques ofsmall size. A clique is considered as the seed or core of a community C , and it isused as starting point to obtain C by greedily adding nodes that maximize a fitnessfunction. The fitness function adopted is that proposed by Lancichinetti et al. [20].The algorithm first finds the maximal cliques contained in the graph G representingthe network, with at least k nodes. Then it creates a candidate community C ′ by se-lecting the largest unexpanded seed and adding the nodes contained in the frontierthat maximize the fitness function. This expansion is continued until the inclusionof any node would lower the fitness. At this point GCE checks if the community ′ is near-duplicate, i.e. C ′ is compared with all the already obtained communities,and it is accepted only if it is sufficiently different from these communities. Allthe steps are repeated until no seeds remain to be considered. In order to decideif a community is near-duplicate, a distance measure between communities, basedon the percentage of uncommon nodes, is introduced. This measure is defined asfollows. Given two communities S and S ′ , d E ( S , S ′ ) = − | S ∩ S ′ | min ( | S | , | S ′ | ) (12)This measure can be interpreted as the proportion of nodes belonging to the smallercommunity that are not members of larger community. Fixed a parameter e as theminimum community distance, and a set of communities W , the near-duplicates of S are all the communities in W that are within a distance e from S . GCE has been compared with other state-of-the-art overlapping communitydetection methods, and, analogously to
Moses , it performs better when the numberof communities that a node can belong to increases from 1 to 5.
Link clustering methods propose to detect overlapping communities by partition-ing the set of links rather than the set of nodes. To this end, the line graph is used.The line graph L ( G ) of an undirected graph G is another graph L ( G ) such that 1)each vertex of L ( G ) represents an edge of G , and 2) two vertices of L ( G ) are adja-cent if and only if their corresponding edges share a common endpoint in G . Thusa line graph represents the adjacency between edges of G . The main advantage ofapplying clustering to the line graph is that it produces an overlapping graph divi-sion of the original interaction graph, thus allowing nodes to be present in multiplecommunities.Pereira et al. [30] have been the first in using the line graph to find overlappingmodules for protein-protein interaction networks. To this end they applied a wellknown method (MCL [8]) for protein interaction networks to the line graph.Evans and Lambiotte [10] argued that any algorithm that partitions a networkcan be applied to partition links for discovering overlapping community structure.They first reviewed a definition of modularity which uses statistical properties ofdynamical processes taking place on the edges of a graph, and then proposed linkpartitioning by applying the new modularity concept. In particular, the traditionalmodularity Q is defined in terms of a random walker moving on the links of thenetwork. Such a walker would therefore be located on the links instead of thenodes at each time t , and its movements are between adjacent edges, i.e. linkshaving one node in common. Three quality functions for partitioning links of anetwork G have been proposed. Each formalizes a different dynamical processand explores the structure of the original graph G in a different way.In the first dynamical process, Link-Link random walk , the walker jumps toany of the adjacent edges with equal probability. In the second process,
Link-Node-Link random walk , the walker moves first to a neighboring node with equalprobability, and then jumps to a new link, chosen with equal probability from thosenew edges incident at the node. In the last process, the dynamics are driven by theoriginal random walk but are projected on the links of the network. The stabilities f the three processes have been defined by generalizing the concept of modularityto paths of arbitrary length, in order to tune the resolution of the optimal partitions.The optimal partitions of these quality functions can be discovered by applyingstandard modularity optimization algorithms to the corresponding line graphs. Anextension to deal with weighted line graphs is reported in [9]. GA-NET+ is a method proposed by Pizzuti [31] that employs
Genetic Algo-rithms [14]. A Genetic Algorithm ( GA ) evolves a constant-size population ofelements (called chromosomes ) by using the genetic operators of reproduction , crossover and mutation . Each chromosome represents a candidate solution to agiven problem and it is associated with a fitness value that reflects how good it is,with respect to the other solutions in the population. The method uses the con-cept of community score to measure the quality of the division in communities ofa network. Community score is defined as follows.Let m i denote the fraction of edges connecting node i to the other nodes in acommunity S . More formally m i = | S | k ini ( S ) where | S | is the cardinality of S , and k ini ( S ) = (cid:229) j ∈ S A i j is the number of edgesconnecting i to the other nodes in S .The power mean of S of order r , denoted as M ( S ) is defined as M ( S ) = (cid:229) i ∈ S ( m i ) r | S | (13)The volume v S of a community S is defined as the number of edges connectingvertices inside S , i.e the number of 1 entries in the adjacency sub-matrix of A corresponding to S , v S = (cid:229) i , j ∈ S A i j .The score of S is defined as score ( S ) = M ( S ) × v S . The community score of aclustering { S ,... S k } of a network is defined as C S = k (cid:229) i score ( S i ) (14) Community score gives a global measure of the network division in communitiesby summing up the local score of each module found. The problem of communityidentification can then be formulated as the problem of maximizing
C S .The algorithm tries to maximize
C S by running the genetic algorithm on theline graph L ( G ) of the graph G modeling the network. As already pointed out, amain advantage in using the line graph is that the partitioning of L ( G ) obtainedby GA-NET+ corresponds to an overlapping graph division of G . The methoduses the locus-based adjacency representation. In this representation an individualof the population consists of N genes g ,..., g N and each gene can assume allelevalues j in the range { ,... , N } . Genes and alleles represent nodes of the graph G = ( V , E ) modelling a network N , and a value j assigned to the i th gene is inter-preted as a link between the nodes i and j of V , thus in the clustering solution found i and j will be in the same cluster. GA-NET+ starts by generating a population ini-tialized at random with individuals representing a partition in subgraphs of the linegraph L ( G ) . After that, the fitness of the individuals from the original graph mustbe evaluated and a new population of individuals is created by applying uniform rossover and mutation. The dense communities present in the network structureare obtained at the end of the algorithm, without the need to know in advance theexact number of groups. This number is automatically determined by the optimalvalue of the community score.Ahn et al. [2] proposed a hierarchical agglomerative link clustering method togroup links into topologically related clusters. The algorithm applies a hierarchicalmethod to the line graph by defining two concepts: link similarity and partitiondensity . Link similarity is used during the single-linkage hierarchical method tofind the pair of links with the largest similarity in order to merge their respectivecommunities. This similarity measure is defined as follows. Let a node i be given,the inclusive neighbors of i are n + ( i ) = { x | d ( i , x ) ≤ } (15)where d ( i , x ) is the length of the shortest path between nodes i and x . Thus n + ( i ) contains the node itself and its neighbors. Then, the similarity S between two links e ik and e jk can be defined by using the Jaccard index: S ( e ik , e jk ) = | n + ( i ) ∩ n + ( j ) || n + ( i ) ∪ n + ( j ) | (16)The similarity between links is also extended to networks with weighted, directed,or signed links (without self-loops). The agglomerative process is repeated untilall links belong to a single cluster. To find a meaningful community structure, itis necessary to decide where the built dendrogram must be cut. To this end, theauthors introduce a new quantity, the partition density D , that measures the qualityof a link partitioning.Partition density is defined as follows. Let m be the number of links of a givennetwork, and { P ,..., P C } the partition of the links in C subsets. Each subset P c has m c = | P c | links and n c = | ∪ e ij ∈ P c { i , j } | nodes. Then D c = m c − ( n c − ) n c ( n c − ) / − ( n c − ) = m c − ( n c − )( n c − )( n c − ) (17)is the normalization of the number of links m c by the minimum and maximumnumber of possible links between n c connected nodes. It is assumed that D c = n c =
2. The partition density D is the average of the D c , weighted by thefraction of present links: D = m (cid:229) c m c m c − ( n c − )( n c − )( n c − ) (18)The authors, in order to compare their approach with other state-of-the-art meth-ods, introduce four measures. Two measures, community quality and overlap qual-ity , are based on metadata known possessed by some networks studied in the lit-erature. These metadata consist of a small set of annotations or tags attached toeach node. The other two measures, community coverage and overlap coverage ,consider the amount of information extracted from the network. Community cov-erage counts the fraction of nodes that belong to at least one community of three r more nodes, called nontrivial communities. The authors state that this mea-sure provides a sense of how much of the network is analyzed. Overlap coveragecounts the average number of membership per nodes to nontrivial communities.When the communities are not overlapping, the two coverage measures give thesame information. In label propagation approaches a community is considered a set of nodes whichare grouped together by the propagation of the same property, action or informa-tion in the network.Gregory in [19] proposed the algorithm
COPRA ( Community Overlap PRopa-gation Algorithm ), as an extension of the label propagation technique of Raghavanet al. [32]. The main modification consists in assigning multiple community iden-tifiers to each vertex. The method, thus, associates with each vertex x a set ofcouples ( c , b ), where c is a community identifier and b is a belonging coefficientexpressing the strength of x as member of community c . COPRA starts by givingto each vertex a single label with belonging coefficient set to 1. Then, repeatedly,each vertex x updates its labels by summing and normalizing the belonging coef-ficients of its neighboring nodes. The new set of x ’s labels is constituted by theunion of its neighbor labels. However, in order to limit the number of communi-ties a vertex can participate, a parameter v must be given in input. In particular,the labels whose belonging coefficient is less that 1 / v are deleted. COPRA hasa nondeterministic behavior when all the belonging coefficients corresponding tothe labels associated with a vertex are the same, but below the threshold. In sucha case a randomly selected label is maintained, while the remaining are discarded.Finally, communities totally contained in others are removed, and disconnectedcommunities that could be generated are split in connected ones.Wu et al. in [37] pointed out that the input parameter v makes COPRA un-stable since it is a global vertex-independent parameter not taking into accountthat, often, most nodes are non-overlapping, while few nodes participate in manycommunities. Thus an appropriate choice of v is difficult and it induces the non-determinism described above. To overcome this shortcoming, Wu et al. proposed BMLPA ( Balanced Multi-Label Propagation Algorithm ), a method based on a newlabel update strategy that computes balanced belonging coefficients, and does notlimit the number of communities a node can belong to. A balanced belongingcoefficient is computed by normalizing each coefficient by the maximum value avertex has, and retaining it only if its normalized value is above a fixed threshold p . Another characteristic introduced by BMLPA is the initialization process of ver-tex labels based on the extraction of overlapping rough cores. Such cores allow
BMLPA to efficiently assign labels to each node, and to effectively update labelssince the threshold p , independently the value chosen, would make the new updatestrategy not work well because each node would retain all the labels of its neigh-bors. SLPA [40, 39] is another extension of the label propagation technique of Ragha-van et al. [32] that adopts a speaker-listener based information propagation pro-cess. Each node is endowed with a memory to store the labels received. It can have oth the role of listener and speaker . In the former case it takes labels from theneighbors and accepts only one following a listening rule, such as the most popularobserved at the current step. If it is a speaker, it sends a label to the neighboringlistener node by choosing a label with respect to a certain speaker rule, such as sin-gle out a label with probability proportional to its frequency in the memory. Thealgorithm stops when a fixed number t of iterations has been reached. In this section methods which could not be categorized in one of the above classesare reported.Zhang et al. [41] developed an algorithm for detecting overlapping communitystructure by combining modularity concept, spectral relaxation and fuzzy c-means.In particular, a new modularity function extending the Newman’s modularity con-cept is introduced to take into account soft assignments of nodes to communities.The problem of maximizing the modularity function is reformulated as an eigen-vector problem. Fixed an upper bound to the number k of communities, the top k − d -dimensional Euclidean space is performed, where d ≤ k −
1. After that, fuzzy c-means clustering is applied to group nodes by maxi-mizing the modified modularity function.In [16] Gregory presented a hierarchical, divisive approach, based on Girvanand Newman’s algorithm (GN) [13], but extended with a novel method of split-ting vertices.
CONGA ( Cluster-Overlap Newman Girvan Algorithm ) adds to theGN algorithm the possibility to split vertices between communities, based on theconcept of split betweenness . This concept allows to choose either to split a vertexor remove an edge. The edge betweenness of an edge e is the number of shortestpaths, between all pairs of vertices, that pass along e . A high betweenness indi-cates that the edge acts as a bottleneck between a large number of vertex pairsand suggests that it is an inter-cluster edge. The split betweenness of a vertex v isthe number of shortest paths that would pass between the two parts of v if it weresplit. A vertex can be split in many ways; the best split is the one that maximizesthe split betweenness. An approximate, efficient algorithm has been presented forcomputing split betweenness and edge betweenness at the same time. In CONGA ,a network is initially considered as a single community, assuming it is connected.After one or more iterations, the network is subdivided into two components (com-munities). Communities are repeatedly split into two until only singleton commu-nities remain. If binary splits are represented as a dendrogram, the network can bedivided into any desired number of communities.
CONGA has a complexity of O ( m ) , where m is the number of edges, thus it israther inefficient. A faster implementation of CONGA , named
CONGO , that useslocal betweenness and runs in O ( m log m ) is proposed by the same author in [17].Chen et al. in [6] proposed an approach to find communities with overlapsand outlier nodes based on visual data mining. They consider a community as anetwork partition whose entities share some common features and a relationshipmetric is adopted to evaluate the proximity of the entities each other. Such met-ric is based on the notion of random connections useful to identify communities hich are considered as non-random structures. Furthermore, it takes into accountthe neighborhood around any two nodes in order to evaluate their relationship. Analgorithm that generates an ordering of the network nodes according to their rela-tion scores is presented. From this ordered list of nodes, communities are obtainedby considering a consecutive group of nodes with high relation score. A 2D visu-alization of these scores shows peaks and valleys, where a sharp drop of relationscores after a peak is interpreted as the end of a community, while the valleys be-tween two peaks represent a set of hubs which belong to several communities. Bythis visualization a user is requested to fix a community threshold and an outlierthreshold that allow the algorithm to decide whether a node should be consideredan outlier or be added to the current community. The authors state that the mainadvantage of this visual mining approach is that a user can easily provide inputparameters that allow the method to find communities, hub nodes, and outliers.Rees and Gallagher [33] proposed an approach to discover communities basedon the collective viewpoint of individuals. The base concept is that each node inthe network knows, by way of its egonet , the members of its friendship group. Anegonet is an induced subgraph composed of a central node, its neighbors, and alledges among nodes in the egonet that are also links in the main graph. There-fore, by merging each individual’s views of friendship groups, communities canbe discovered. The friendship groups represent the small clusters, extracted fromegonets, composed of the central node and connected neighbors. More friendship-groups can be combined to create a community. The algorithm consists of twosteps; the first one is the detection of friendship groups, while the second stepcomprises the merging of friendship groups into communities. In step one, thealgorithm iterates through every node in the graph, centering on the selected ver-tex and computing the egonet. After that, friendship groups are extracted fromthat egonet, and the central node is eliminated, since it is known to exist in mul-tiple friendship groups. Consequently, the graph breaks into multiple connectedcomponents. The central vertex is then added back to each connected componentobtained to create the friendship groups. The output of the first phase is a set offriendship groups, from an egocentric point of view. The second step consists inmerging the groups into communities. This process is done by first merging allexact matches, i.e. groups that are complete or proper subsets of other groups.Finally, groups that are relatively close, i.e. groups that match all but one elementfrom the smaller group, are merged. The process is repeated until no more mergescan be performed.The same authors, in [34] presented a swarm intelligence approach for over-lapping community detection. A network is considered as a set of agents corre-sponding to nodes characterized by neighbors. Each agent interacts with its socialgroups. The agent knows the set of its friends and even which of their friends arealso friends. Friends agree on a common community ID inside the different socialgroups or friendship groups. The algorithm consists of the following steps. Firstof all, each agent is assigned an identifier ID. It determines a complete map ofits neighborhood and builds an egonet. Starting from the egonet, the friendshipgroups can be extracted by a union-find algorithm. Each friendship group is iden-tified by a unique ID, composed by the base agent ID and a unique incrementingdecimal value. Secondly, within each friendship group, the agent will ask to itsneighbors their views of the friendship group (which can be various from differentperspectives) in order to identify the non-propagating nodes (nodes whose views iffer from the agent view and consequently their information is not further propa-gated). Finally, the assigned friendship groups IDs are propagated. In particular, ifthe ID value on one of the friendship groups has been modified, the new ID valuewill be spread to the nodes inside the friendship group. The process is repeateduntil the convergence in propagation has been reached. At the end of the process,each agent will have a list of assigned communities. Communities can be easilydetected because they are groups of agents that share a common ID value. The methods described so far do not take into account an important aspect charac-terizing networks: i.e. the evolution they go through over time. The representationof many complex systems through a static graph, even when the temporal dimen-sion describing the varying interconnections among nodes is available, does notallow to study the network dynamics and the changes it incurs over time.
Dynamic networks , instead, capture the modifications of interconnections overtime, allowing to trace the changes of network structure at different time steps.Analyzing networks and their evolution is recently receiving an increasing inter-est from researchers. However, there have been few proposals for the detection ofoverlapping communities in dynamic social networks. In the following the morerecent methods aiming to seek out dynamic communities are described.Palla et al. [28] have been among the first researchers to introduce an approachthat allows to analyze the time dependence of overlapping communities on a largescale and as such, to uncover basic relationships characterizing community evo-lution. Actually they argued that at each time step communities can be extractedby using the
Clique Percolation Method (CPM) [29]. The events that characterizethe life time of a community are growth or contraction, at each time step a newcommunity can appear, while others can disappear. Furthermore groups can mergeor split. In order to identify community evolution along time, the authors proposedto merge networks of two consecutive time steps t and t +
1, and then apply the
CPM method to extract the new community structure of the joint network. Sincethe joined graph contains the union of the links of the two graphs, any communityfrom time step t to t + t and t +
1. Ifa community in the joint graph contains a single community from t and a singlecommunity from t +
1, then they are matched. If the joint group contains morethan one community from either time steps, the communities are matched in de-scending order of their relative node overlap. Overlap is computed for every pairof communities from the two time steps as the fraction of the number of commonnodes to the sum of number of nodes of both communities. Experiments on tworeal-life networks showed that large groups remain alive if they undergo dynamicchanges. On the contrary, small groups survive if they are stable.Cazabet et al. [5] introduced the concepts of intrinsic community and longi-tudinal detection , and proposed an algorithm named iLCD ( intrinsic LongitudinalCommunity Detection ) to discover highly overlapping groups of nodes. An intrin- ic community is considered one that owns a characteristic deemed meaningful,such as for example being a 4-clique. Longitudinal detection means that, startingfrom an intrinsic community, new members join gradually like the snowball effect. iLCD considers the list of edges ordered with respect to the time they appeared,and, for each edge( u , v ) of the set of edges E t created at time t , it performs threesteps. First, for each community C which u (resp. v ) belongs to, it tries to add v (resp. u ), as explained below. Then, if u and v do not already belong to anycommunity, it tries to create a new one; finally, similar communities are merged.The first step of updating existing communities by the addition of new nodes isrealized by estimating, for each community, the mean number of second neighbors EMSN , i.e. nodes that can be reached with a path of length 2 or less, and the meannumber of robust second neighbors
EMRSN , i.e. nodes that can be reached withat least two paths of length 2 or less. A new node is accepted in the community ifthe number of its neighbors at rank 2 is greater than
EMSN , or the number of itsrobust neighbors at rank 2 is greater than
EMRSN . The second step of creating anew community checks whether the couple of nodes ( u , v ) constitutes a minimalpredefined pattern, like a 4-clique. This intrinsic property must be predetermined.Finally merging is executed by fixing an overlap threshold, and when two commu-nities have an overlap above the threshold, the smaller is deleted and the greater isretained. This choice, as the authors point out, limits the use of uncertain heuris-tics. Comparison with other methods shows that this approach outperforms CPM only if the network is highly dense.Another recent proposal to detect overlapping communities in dynamic net-works is described in [27] by Nguyen et al. The method, named
AFOCS ( AdaptiveFinding Overlapping Community Structure ), consists of two phases. In the firstphase local communities are obtained by searching for all the groups of nodes C whose internal density Y ( C ) = | C in || C | ∗ ( | C | − ) / C in is the number of internal connections of C , i.e. the number of linkshaving both endpoints in C , is higher than a threshold t ( C ) defined as t ( C ) = s ( C ) | C | ∗ ( | C | − ) / s ( C ) = ( | C | ∗ ( | C | − ) / ) − | C |∗ ( | C |− ) / (21)The local communities are then merged provided that their overlapping score ishigher than a value given as input parameter. The second phase adaptively updatesthe communities obtained in the first step, by considering how the network evolvesover time. The authors individuate four major changes a network can incur: a newnode and its adjacent edges are either added or removed to/from the network; anew edge connecting two existing nodes is added or an existing edge is removed.The algorithm is able to obtain the new community structure by adopting the moreapt strategy to determine whether a community will split, or two communities willmerge. A comparison with existing approaches showed that AFOCS performancesare competitive with other methods, mainly as regards running time. h is the number ofpairs of maximal cliques which are neighbors, and s is the number of maximal cliques;for CGE, h is the number of cliques; for Ahn and AFOCS, d max is the maximum nodedegree; for GA-NET+ t is the number of generations and p the population size; forSLPA t is the number of iterations performed by the algorithm. A PPROACH M ETHOD R EFERENCE C OMPLEXITY N ODE SEEDS AND LOCALEXPANSION
IS, R A R E Baumes et al. IS , CIS [3],[4],[15] O ( mk + n ) LFM
Lancichinetti et al.
OSLOM [21],[23] O ( n ) DOCS Wei et al. [36] -M
OSES
McDaid and Hurley [25] O ( n ) C LIQUE CF INDER
Palla et al. O ( m ln m ) E XPANSION
CPM [1],[29]EAGLE Shen et al. [35] O ( n + ( n + h ) s ) GCE Lee et al. [24] O ( mh ) L INE G RAPH E VANS
Evans and Lambiotte[10],[9] O ( mk log n ) GA-NET+ Pizzuti [31] O ( t p ( m + m log m ) A HN Ahn et al.[2] O ( nd max ) L ABEL
COPRA Gregory [19] O ( vm log ( vm / n )) P ROPAGATION
BMLPA Wu et al. [37] O ( n log n ) SLPA Xie et al. [40, 39] O ( tm ) D YNAMIC
CPM Palla et al. [29]
METHODS I
LCD Cazabet et al. [5] O ( nk ) AFOCS Nguyen et al. [27] O ( d max m ) + O ( n ) O THER
FCM Zhang et al. [41] O ( nk ) METHODS
CONGA Gregory [16] O ( m ) CONGO Gregory [17] O ( mlogm ) ONDOCS Chen et al. in [6] -E
GONET
Rees and Gallagher [33] O ( n ( log n ) + n log n ) S WARM E GONET
Rees and Gallagher [34] O ( n log n ) Table 2: Methods for which it is possible to download the software. R EFERENCE S OFTWARE ∼ ∼ ∼ steve/networks/congopaper/ Benchmarks for testing algorithms
The capability of an algorithm in detecting community structure is usually val-idated by testing the method on artificial or real world networks for which thedivision in communities is known. Since the availability of ground-truth commu-nity structure for large real networks is rather difficult, synthetic benchmarks builtby specifying parameters to characterize network structure are preferred.One of the most known benchmarks for non-overlapping networks has beenproposed by Girvan and Newan in [13]. The network consists of 128 nodes dividedinto four communities of 32 nodes each. Edges are placed between vertex pairs atrandom but such that z in + z out =
16, where z in and z out are the internal and externaldegree of a node with respect to its community. If z in > z out the neighbors of a nodeinside its group are more than the neighbors belonging to the other three groups,thus a good algorithm should discover them. This benchmark, as observed in [20],however, is rather simple since it is characterized by non-overlapping communitieshaving all the same size and by nodes having the same expected degree. Thus,Lancichinetti et al. [22] proposed a new class of benchmarks that extend the Girvanand Newman’s benchmark by introducing power law degree distributions, differentcommunity size, and percentage of overlap between communities.The benchmark is also characterized by the mixing parameter m = z out z in + z out that gives the ratio between the external degree of a node and the total degreeof the node. When m < . The paper reviewed state-of-the-art approaches for the detection of overlappedcommunities. Methods have been classified in different categories. Node seedand local expansion methods together with clique expansion approaches are char-acterized by the same idea of starting with a node (in the former case), or a group ofnodes (in the latter case) and then expanding the current cluster until the adoptedquality function increases. Link clustering methods detect overlapping commu-nities by partitioning the set of links rather than the set of nodes by using the line graph corresponding to the network under consideration. Since generally thenumber of edges is much higher than the number of nodes, these methods arecomputationally expensive. Label propagation approaches are among the most ef-ficient since they start from a node and visit neighboring nodes to propagate classlabel. Approaches reported in Section 3.5 to find overlapping communities adoptstrategies that substantially differ from the others. Thus Zhang et al. [41] use mod-ularity, spectral relaxation and fuzzy c-means, Gregory [17] relies on the conceptof split betweenness to duplicate a node, Chen et al. [6] propose an interactive ap-proach based on visual data mining, Rees and Gallagher [33], [34] use the conceptof egonet and apply swarm intelligence. Dynamic approaches try to deal with theproblem of network evolution and constitute a valid help in understanding changesa network might undergo over time.A summarization of the described methods is reported in Table 1. For eachmethod, when known, the computational complexity is reported. Furthermore, inTable 2, a link to the web site from which it is possible to download the software mplementing the algorithm is given.Though the number of approaches present in the literature is notably, the re-sults obtained by each of them are substantially different, thus there is no a uni-versal method that is competitive with respect to all the others for networks havingdifferent characteristics such as sparsity, degree distribution, overlap percentageamong communities, and so on. As pointed out in [38], there are two questionsthat researchers should focus on: ”when to apply overlapping methods and howsignificant the overlapping is”. Investigation on these issues and extensions toweighted networks constitute open problems for future research. References [1] B. Adamcsek, G. Palla, I. J. Farkas, I. Der´enyi, and T. Vicsek. Cfinder: Locat-ing cliques and overlapping modules in biological networks.
Bioinformatics ,22:1021–1023, 2006.[2] Yong-Yeol Ahn, James P. Bagrow, and Sune Lehmann. Link communitiesreveal multiscale complexity in networks.
Nature , 466:761–764, 2010.[3] J. Baumes, M. Goldberg, and M. Magdon-Ismail. Efficient identification ofoverlapping communities. In
Proceedings of the 2005 IEEE InternationalConference on Intelligence and Security Informatics , ISI’05, pages 27–36,Berlin, Heidelberg, 2005. Springer-Verlag.[4] J. Baumes, M. K. Goldberg, M. S. Krishnamoorthy, M. Magdon-Ismail, andN. Preston. Finding communities by clustering a graph into overlapping sub-graphs. In
IADIS AC , pages 97–104. IADIS, 2005.[5] R´emy Cazabet, Fr´ed´eric Amblard, and Chihab Hanachi. Detection of over-lapping communities in dynamic social networks. In
IEEE InternationalConference on Social Computing/IEEE Conference on Privacy, Security,Risk and Trust , pages 309–314, 2010.[6] J. Chen, O.R. Zaiane, J. Sander, and R. Goebel. Ondocs: Ordering nodes todetect overlapping community structure. In N. Memon, J. J. Xu, D. L. Hicks,and H. Chen, editors,
Data Mining for Social Network Data , volume 12 of
Annals of Information Systems , pages 125–148. Springer US, 2010.[7] Michele Coscia, Fosca Giannotti, and Dino Pedreschi. A classification forcommunity discovery methods in complex networks.
Statistical Analysis andData Mining , 5(4):512–546, 2011.[8] A.J. Enright, S.V. Dongen, and C.A. Ouzounis. An efficient algorithmfor large-scale detection of protein families.
Nucleic Acids Research ,30(7):1575–84, 2002.[9] T. S. Evans and R. Lambiotte. Line graphs of weighted networks for overlap-ping communities.
The European Physical Journal B , 77(2):265–272, 2010.[10] T.S. Evans and R. Lambiotte. Line graphs, link partitions, and overlappingcommunities.
Physical Review E , 80(1):016105:1–016105:8, 2009.[11] Santo Fortunato. Community detection in graphs.
Phisics Reports , 486:75–174, 2010.[12] Santo Fortunato and Claudio Castellano. Community structure in graphs. arXiv:0712.2716v1 [physics.soc-ph] , 2007.
13] M. Girvan and M. E. J. Newman. Community structure in social and bi-ological networks. In
Proc. National. Academy of Science. USA 99 , pages7821–7826, 2002.[14] D.E. Goldberg.
Genetic Algorithms in Search, Optimization, and MachineLearning . Addison-Wesley, 1989.[15] M. Goldberg, S. Kelley, M. Magdon-Ismail, K. Mertsalov, and A. Wallace.Finding overlapping communities in social networks. In
Proceedings of the2010 IEEE Second International Conference on Social Computing (SOCIAL-COM ’10) , pages 104–113, 2010.[16] S. Gregory. An algorithm to find overlapping community structure in net-works. In
Proceedings of the 11th European conference on Principles andPractice of Knowledge Discovery in Databases , PKDD 2007, pages 91–102,Berlin, Heidelberg, 2007. Springer-Verlag.[17] S. Gregory. A fast algorithm to find overlapping communities in networks.In
Proceedings of the 12th European conference on Principles and Practiceof Knowledge Discovery in Databases , PKDD 2008, pages 408–423, Berlin,Heidelberg, 2008. Springer-Verlag.[18] S. Gregory. Finding overlapping communities using disjoint communitydetection algorithms. In Santo Fortunato, Giuseppe Mangioni, RonaldoMenezes, and Vincenzo Nicosia, editors,
Complex Networks , volume 207of
Studies in Computational Intelligence , pages 47–61. Springer, 2009.[19] S. Gregory. Finding overlapping communities in networks by label propaga-tion.
New Journal of Physics , 12(10):103018, 2010.[20] A. Lancichinetti and S. Fortunato. Community detection algorithms: a com-parative analysis.
Physical Review E , 80(056117), 2009.[21] A. Lancichinetti, S. Fortunato, and J. Kert´esz. Detecting the overlappingand hierarchical community structure of complex networks.
New Journal ofPhysics , 11:033015, 2009.[22] Andrea Lancichinetti, Santo Fortunato, and Filippo Radicchi. Benchmarkgraphs for testing community detection algorithms.
Physical Review E ,78(046110), 2008.[23] A. Lancichinetti, Filippo Radicchi, J. J. Ramasco, and S. Fortunato. Findingstatistically significant communities in networks.
PLOS one , 6:e18961, 2011.[24] C. Lee, F. Reid, A. McDaid, and N. Hurley. Detecting highly overlappingcommunity structure by greedy clique expansion. In
Workshop on SocialNetwork Mining and Analysis , 2010.[25] A. McDaid and N. Hurley. Detecting highly overlapping communities withmodel-based overlapping seed expansion. In
Proceedings of the 2010 Inter-national Conference on Advances in Social Networks Analysis and Mining ,ASONAM ’10, pages 112–119, 2010.[26] M. E. J. Newman and M. Girvan. Finding and evaluating community struc-ture in networks.
Physical Review , E69:0261135, 2004.[27] Nam P. Nguyen, Thang N. Dinh, Sindhura Tokala, and My T. Thai. Overlap-ping communities in dynamic networks: their detection and mobile applica-tions. In
Proceedings of the 17th Annual International Conference on MobileComputing and Networking (MOBICOM 2011) , pages 85–96, 2011.
28] G. Palla, A. Barabasi, and T. Vicsek. Quantifying social group evolution.
Nature , 446:664–667, April 2007.[29] G. Palla, I. J. Farkas, I. Der´enyi, and T. Vicsek. Uncovering the overlappingcommunity structure of complex networks in nature and society.
Nature ,435:814–818, 2005.[30] J. B. Pereira, A.J. Enright, and C.A. Ouzounis. Detection of functional mod-ules from protein interaction networks.
Proteins:Structure, Function, andBioinformatics , 54(1):49–57, January 2004.[31] C. Pizzuti. Overlapped community detection in complex networks. In
Pro-ceedings of the 11th Annual conference on Genetic and Evolutionary com-putation , GECCO ’09, pages 859–866, 2009.[32] U.N. Raghavan, R. Albert, and S. Kumara. near linear time algorithm todetect community structures in large-scale networks.
Physical Review E ,76(036106), 2007.[33] B. S. Rees and K. B. Gallagher. Overlapping community detection by col-lective friendship group inference. In
Proceedings of the 2010 InternationalConference on Advances in Social Networks Analysis and Mining , ASONAM’10, pages 375–379, 2010.[34] B.S. Rees and K.B. Gallagher. Overlapping community detection using acommunity optimized graph swarm.
Social Network Analysis and Mining ,2(4):405–417, 2012.[35] Huawei Shen, Xuequi Cheng, Kai Cai, and Mao-Bin Hu. Detect overlappingand hierarchical community structure in networks.
Physica A: A StatisticalMechanics and its Applications , 388(8):1706–1712, 2009.[36] F. Wei, W. Qian, C. Wang, and A. Zhou. Detecting overlapping communitystructures in networks.
World Wide Web , 12(2):235–261, June 2009.[37] Zhi-Hao Wu, You-Fang Lin, Steve Gregory, Huai-Yu Wan, and Sheng-FengTian. Balanced multi-label propagation for overlapping community detectionin social networks.
Journal of Computer Science and Technology , 27(3):468–479, 2012.[38] Jierui Xie, Stephen Kelley, and Boleslaw K. Szymanski. Overlapping com-munity detection in networks: the state of the art and comparative study.
ACM Computing Survey , 45(4), 2013.[39] Jierui Xie and Boleslaw K. Szymanski. Towards linear time overlappingcommunity detection in social networks. In
Advances in Knowledge Discov-ery and Data Mining - 16th Pacific-Asia Conference, PAKDD 2012 , pages25–36, 2012.[40] Jierui Xie, Boleslaw K. Szymanski, and Xiaoming Liu. Slpa: Uncoveringoverlapping communities in social networks via a speaker-listener interac-tion dynamic process. In
Proceedings of ICDM Workshops on Data min-ing technologies for Computational Collective Intelligence , pages 344–349,2011.[41] Shihua Zhang, Rui-Sheng Wang, and Xiang-Sun Zhang. Identification ofoverlapping community structure in complex networks using fuzzy -meansclustering.
Physica A: Statistical Mechanics and its Applications , 374(1):483– 490, 2007., 374(1):483– 490, 2007.