Stochastic fluctuations and the detectability limit of network communities
Lucio Floretta, Jonas Liechti, Alessandro Flammini, Paolo De Los Rios
SStochastic fluctuations and the detectability limit of network communities.
Lucio Floretta, Jonas Liechti, Alessandro Flammini, and Paolo De Los Rios Laboratoire de Biophysique Statistique, Ecole Polytechnique F´ed´erale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland School of Informatics and Computing, Indiana University, Bloomington, Indiana 47406, USA (Dated: November 9, 2018)We have analyzed the detectability limits of network communities in the framework of the pop-ular Girvan and Newman benchmark. By carefully taking into account the inevitable stochasticfluctuations that affect the construction of each and every instance of the benchmark, we come tothe conclusions that the native, putative partition of the network is completely lost even before thein-degree/out-degree ratio becomes equal to the one of a structure-less Erd¨os-R´enyi network. Wedevelop a simple iterative scheme, analytically well described by an infinite branching-process, toprovide an estimate of the true detectability limit. Using various algorithms based on modularityoptimization, we show that all of them behave (semi-quantitatively) in the same way, with the samefunctional form of the detectability threshold as a function of the network parameters. Because thesame behavior has also been found by further modularity-optimization methods and for methodsbased on different heuristics implementations, we conclude that indeed a correct definition of thedetectability limit must take into account the stochastic fluctuations of the network construction.
PACS numbers: 89.75.Hc,64.60.aq,89.20.-a
In the process of introducing the notion of modularity [1], Girvan and Newman (GN) also proposed a methodto compare the performance of community detection algo-rithms [2, 3], which is still at the basis of modern assays[4–6]. In the GN benchmark, different algorithms aretested on a set of planted l - partition models [7]. In a nut-shell, a planted l -partition model is a network composedof l groups of n vertices that are stochastically connectedwith each other: an edge between any two vertices withinthe same group is present with probability p in , whereasan edge between any two vertices belonging to differentgroups is present with probability p out . Accordingly, theaverage internal degree is (cid:104) k in (cid:105) = ( n − p in , whereasthe average external degree is (cid:104) k out (cid:105) = ( l − np out . Tomeasure the extent of the community structure presentin the planted l -partition model, it is customary to in-troduce the mixing parameter µ that is defined by therelations (cid:104) k out (cid:105) = µ (cid:104) k t (cid:105) and (cid:104) k in (cid:105) = (1 − µ ) (cid:104) k t (cid:105) , where (cid:104) k t (cid:105) = (cid:104) k in (cid:105) + (cid:104) k out (cid:105) is the average total degree. Indeed,at µ = 0 the network has l disconnected components; as µ increases, the average internal degree decreases while theaverage external degree toward one specific other clus-ter, (cid:104) k out (cid:105) / ( l − µ = ( l − /l ≡ µ ERc , when the planted l -partition modelis practically an Erd¨os-R´enyi graph [8]. The performanceof different algorithms is then ranked according to thevalue of the mixing parameter beyond which they canno longer recover the native l clusters. Thus, the GNbenchmark tries to encode the simple heuristics that com-munities correspond to groups of vertices that are moreconnected with each other than with the rest of the net-work, although of course a rigorous univocal definition of“more connected” is presently lacking.Naively, one would expect any clustering algorithm tobe successful at most up to µ ERc . However, this is not the case because the construction of a planted l -partitionmodel is a stochastic process. Indeed, it might happeneven at µ < µ ERc that a vertex declared as belongingto a group has fewer connections toward its putativecommunity than toward a different one. According tothe fundamental heuristic definition of community, out-lined above, this fluctuation should be interpreted as achange of membership (a relabeling ) of the vertex. Thus,we can expect that the community structure will bebadly degraded when the difference between (cid:104) k in (cid:105) and (cid:104) k out (cid:105) / ( l −
1) is comparable with its own statistical fluc-tuation, an argument that translates into the condition (cid:104) k in (cid:105) − (cid:104) k out (cid:105) / ( l −
1) = α l (cid:112) (cid:104) k t (cid:105) , where α l is a positivereal constant that may depend on l . Thus, we can arguethat the native community structure disappears when µ > µ ERc (cid:32) − α l (cid:112) (cid:104) k t (cid:105) (cid:33) (1)and any algorithm trying to recover the native partitionbased on criteria that adhere to the heuristics should con-sequently fail to do so because, in essence, there is nonative structure to be recognized anymore.In order to gauge the extent to which stochastic fluc-tuations skew network realizations away from the na-tive one, we have analyzed the planted l -partition mod-els of the original GN benchmark ( l = 4, n = 32, and (cid:104) k t (cid:105) = 16). Nodes with more connections to an outergroup than to their putative one were relabeled, and theprocedure was iteratively repeated until no more nodeshad to change their membership. For each realization,we measured the similarity between the relabeled parti-tion R and the native partition N using the normalizedmutual information ¯ I ( R , N ) [9]. ¯ I ( R , N ) is defined as¯ I ( R , N ) ≡ I ( R , N ) H ( R ) + H ( N ) , a r X i v : . [ phy s i c s . s o c - ph ] J un where I ( R , N ) = (cid:88) ρ ∈R (cid:88) ν ∈N f ( ρ, ν ) log (cid:18) f ( ρ, ν ) f ( ρ ) f ( ν ) (cid:19) , is the mutual information of R and N , H ( R ) = − (cid:80) ρ ∈R f ( ρ ) log f ( ρ ) is the entropy of R , and H ( N ) = − (cid:80) ν ∈N f ( ν ) log f ( ν ) is the entropy of N . f ( ρ ) is thefraction of nodes that belong to the community ρ within R , f ( ν ) is the fraction of nodes that belong to the com-munity ν within N , and f ( ρ, ν ) is the fraction of nodesthat are simultaneously assigned to the community ρ within R and to the community ν within N .The outcome is shown in Fig. 1: because ¯ I ( R , N ) de-creases as µ increases and since it goes to zero well be-fore µ ERc = 3 /
4, a labeling closer to what a zero orderheuristics would suggest is already different enough fromthe native partition that the latter could not be detectedanymore. As modularity was conceived with the samesimple heuristics in mind, we have applied three differentclustering algorithms based on its maximization to thenetworks of Fig. 1: simulated annealing [10], that looksfor the absolute maximum, and two greedy algorithmswith different implementations, fastgreedy [11] and Lou-vain method [12]. Interestingly, the modularity of therelabeled partition, Q ( R ), can be larger than the mod-ularity of the native one, Q ( N ), in a significant regionof µ values. Even more intriguingly, there the modular-ity found by simulated annealing roughly coincides withthe modularity of the relabeled partition, Q ( R ), and thesimilarity between the retrieved partition and the rela-beled partition is larger than the similarity between theretrieved partition and the putative partition, signalingthat modularity optimization and relabeling agree on thedetected deviations from the native partition. Greedymethods do not follow the same trend, likely because ofthe roughness of the modularity landscape, that hampersthese algorithms for small values of the total degree [10].We then moved to larger networks, for various values ofthe total average degrees and for different number of pu-tative communities ( l = 2 , , µ ERc was approached.The similarity threshold, defined as the value µ R c of themixing parameter where ¯ I ( R , N ) falls below 0.01, wasvery well described by Eq. (1), up to corrections of order1 / (cid:104) k t (cid:105) . Furthermore, α l approaches 1 from above as l in-creases (Fig. 2, (cid:3) ), therefore recovering the detectabilitythreshold ˆ µ c ≡ µ ERc (cid:32) − (cid:112) (cid:104) k t (cid:105) (cid:33) (2)recently derived with statistical inference [13, 14] andmodularity [15] based methods. These results confirm FIG. 1. Similarity (upper panel) along with the modularitydifference (lower panel) between the relabeled partition ( (cid:3) ),the partitions retrieved by simulated annealing ( ◦ in red),by fastgreedy ( ♦ in green), and by the Louvain method ( (cid:52) inblue) with the native partition for the original GN benchmark( n = 32, l = 4, (cid:104) k t (cid:105) = 16). Each point is averaged over500 realizations, the error is smaller than the marker size.Further, five networks at different µ are depicted to visualizehow relabeled vertices slowly invade other putative clusters. thus that the stochastic fluctuations affecting the net-work construction are so strong to make the native struc-ture disappear before µ ERc , and consistently with Eq. (1).The functional form of Eq. (1) can be formally de-rived from the topology of the planted l -partition modeland from the aforementioned definition of membershipby requiring that each relabeled vertex leaves in its wakefurther vertices to be relabeled, thus triggering an in-finite avalanche. We defined P l ( µ, (cid:104) k t (cid:105) ) as the proba-bility that a node has an internal degree that is largeror equal than its number of connections with nodes ofanother given group. Assuming that the network is tree-like in the same mean-field spirit of [13–15], in the limit (cid:104) k t (cid:105) → ∞ an avalanche becomes infinite at the value ˜ µ R c of the mixing parameter satisfying (cid:0)(cid:0) − ˜ µ R c (cid:1) (cid:104) k t (cid:105) − (cid:1) (cid:0) − P l (cid:0) ˜ µ R c , (cid:104) k t (cid:105) (cid:1)(cid:1) = 1 , (3)as in this limit the probability that a node with a re-labeled neighbor maintains its membership converges to P l ( µ, (cid:104) k t (cid:105) ) (see SI). For an infinite planted l -partition .
00 0 .
02 0 .
04 0 .
06 0 .
08 0 .
10 0 .
12 0 . / p h k t i . . . . . µ c / µ E R c , l = .
00 0 .
02 0 .
04 0 .
06 0 .
08 0 . / p h k t i . . . . . . . µ c / µ E R c , l = .
00 0 .
01 0 .
02 0 .
03 0 .
04 0 . / p h k t i . . . . . µ c / µ E R c , l = FIG. 2. Comparison between the similarity thresholds ˆ µ c (Eq. 2, dashed line), ˜ µ R c (Eq. 3, × ), µ R c ( ¯ I ( A , R ) ≈ .
01 , (cid:3) ), µ fc ( ¯ I ( A , N ) ≈ .
01 for fastgreedy, ♦ ), and µ Lc ( ¯ I ( A , N ) ≈ .
01 for the Louvain method, (cid:52) ) for l = 2 (left panel), 4 (central panel),and 8 (right panel). In the last case, the results for fastgreedy are not available because the network sizes required to exit theglassy phase are out of its reach. The solid lines are fits of the form µ c = µ ERc (cid:16) − α/ (cid:112) (cid:104) k t (cid:105) + β/ (cid:104) k t (cid:105) (cid:17) . Errors are smallerthan markers and to improve the visualization everything is divided by µ ERc . model P l ( µ, (cid:104) k t (cid:105) ) = e −(cid:104) k t (cid:105) ∞ (cid:88) i =0 i ! ((1 − µ ) (cid:104) k t (cid:105) ) i i (cid:88) j =0 j ! (cid:18) µ (cid:104) k t (cid:105) ( l − (cid:19) j l − ; (4)further, it is possible, although cumbersome, to prove an-alytically that P ( µ, (cid:104) k t (cid:105) ) depends only on the rescaledmixing parameter ( µ − µ ERc ) (cid:112) (cid:104) k t (cid:105) up to correctionsof the form ( a + b/ (cid:104) k t (cid:105) ) when (cid:104) k t (cid:105) → ∞ (see SI).Therefore, we made the scaling ansatz P l ( µ, (cid:104) k t (cid:105) ) = F l (cid:16) ( µ − µ ERc ) (cid:112) (cid:104) k t (cid:105) (cid:17) which is also numerically con-firmed for l > l = 4).Taking these results together, the (numerical) solutionof Eq. (3) yields ˜ µ R c as a function of (cid:104) k t (cid:105) and it indeedshows that Eq. (1) is valid up to corrections of order1 / (cid:104) k t (cid:105) (Fig. 2, × ). We can again confirm that it is ageneral feature of the relabeled community structure tobe appreciably different from the “native” one at some µ < µ ERc (and scaling as in Eq. (1)).Armed with this new view of the resilience to fluctua-tions of the GN benchmark, we also reanalyzed the per-formance of modularity optimization methods fastgreedyand Louvain (we could not test simulated annealing forthe sizes under scrutiny here) [6]. In spite of their differ-ences, their behavior is qualitatively similar at low (cid:104) k t (cid:105) ,when the modularity landscape is rough, giving rise to a glassy phase [10]: at low values of µ the two algorithms,that just look for local modularity maxima, are not ableto find partitions with a modularity as high as the nativeone, which on average follows the expected mean-fieldvalue Q MF ( µ ) = µ ERc − µ (Fig. (4), upper panel; we only − . − . − . − . − . . . . . . ( µ − µ ERc ) √ k t . . . . . . . . . . . P l ( µ , h k t i ) .
65 0 .
70 0 .
75 0 .
80 0 . µ . . . . . . FIG. 3. Magnification of the data collapse of P l ( µ, (cid:104) k t (cid:105) ) for l = 4 and for different values of (cid:104) k t (cid:105) (different colors). Theinset displays the original data. show the results for the Louvain method, which allowsfor better statistics on larger networks). In that region,though, the native partition still fairly well correspondsto the heuristics, because there the relabeling procedureinvolves just a very limited number of nodes, if any (seeSI). Also the similarity index ¯ I ( A , N ) between the re-trieved partition A and the native partition N drops tozero (in a rather erratic way) (Fig. (4), lower panel), con-firming the significant difference between the partitionretrieved by the algorithms and the native one. Thus,low values of (cid:104) k t (cid:105) (sparse networks) do indeed cause de-tectability problems much more serious than a simpleshift of the threshold value of µ , because of the presenceof multiple competing modularity maxima. In order tocorroborate this interpretation, we initialized the Lou-vain method with the native planted l -partition, finding . . . . . . . . . . . µ . . . . . . . . . . . ¯ I ( A , N ) . . . . . . . . . Q ( A ) FIG. 4. Modularity (upper panel) of the partition A retrievedby the Louvain method and its similarity index ¯ I ( A , N ) withthe native partition N (middle panel) for the planted 4-partition models with n = 32768 and various (cid:104) k t (cid:105) . Each datapoint is averaged over 100 realizations; only some errors areshown for clarity. .
40 0 .
42 0 .
44 0 .
46 0 .
48 0 . µ .
40 0 .
42 0 .
44 0 .
46 0 .
48 0 . µ . . . . . . . . . . . ¯ I FIG. 5. Comparison close to the phase transition pointamong ¯ I ( A , R ) (solid red line), ¯ I ( A , N ) (dotted blue line),and ¯ I ( R , N ) (dashed green line) for fastgreedy (left panel)and for the Louvain method (right panel) on a 2-planted par-tition with n = 4096, (cid:104) k t (cid:105) = 2048. For clarity, only the largeststandard deviation is shown. that the modularity of the retrieved partition never fellbelow its expected mean-field value (see SI).The behavior of the two algorithms differs instead out-side of the glassy phase (Fig. 5). The community struc-ture detected by fastgreedy departs from the native par-tition earlier than the relabeled one, implying that the al-gorithms does not recover the native partition even whenno nodes need to be relabeled. The Louvain methodfinds clusters that are instead almost identical to the onesfound upon relabeling, indicating that the performanceof the Louvain method is optimal.Nevertheless, in the narrow interval defined by the dra-matic drop of all similarities, the one between the re-labeled partitions and the native partition, ¯ I ( R , N ) de-creases more rapidly and reaches zero earlier than the ones of both greedy algorithms (Fig. 5), and we quan-tified this difference by investigating in more detail thesimilarity thresholds µ fc and µ Lc (now defined as the valueof the mixing parameter at which ¯ I ( A , N ) drops below0.01). We found that both µ fc and µ Lc are also well fittedby Eq. (1) with corrections of the order 1 / (cid:104) k t (cid:105) (Fig. 2, ♦ and (cid:52) , respectively). Moreover µ R c < µ fc < ˆ µ c ≤ µ Lc andthey approach each other as l increases . The differencesare likely due to the different ability of each procedure totake into account higher order correlations, beyond thesimple one taken into account by the relabeling proce-dure.Taking together these observations, we conclude thatdeciding whether community detection algorithms reallyfail is trickier than previously believed. Indeed, in orderto quantify the degree of success of a clustering method,some sort of ground truth should be known about thenetwork under scrutiny. Since this is in general not avail-able, various benchmarks have been developed, the sim-plest one being the GN, with the naive expectation that,at least for such artificial cases, the exact solution shouldbe known. Unfortunately, their construction is itself af-fected by random fluctuations that blur their structure,making it different from the putative one. Carefully con-sidering such statistical variations, we have shown thatit is possible to change the definition of detectable re-gion , recognizing that fluctuations disrupt the putativepartition for values of µ < µ ERc and well described byEq. (1), so that one should expect all community de-tection algorithms to be affected, qualitatively, in thesame way, with differences likely to be imputed to differ-ent ways to implement mathematically the fundamentalheuristics behind the GN benchmark. Indeed, the find-ing that algorithms based on completely different prin-ciples (modularity-based vs . statistical inference) showthe same threshold beyond which a network structurebecomes undetectable, must be reinterpreted in the lightof our results: beyond such threshold the real communitystructure simply bears no resemblance anymore with theputative one.The present results also call for a complete reconsider-ation of the detectability thresholds recently derived forbenchmarks richer in structure [13, 14, 16–18] and of theperformance of community detection algorithms testedon them [4–6]. On a more philosophical note, we mighteven relax the stringency of benchmarking protocols, andresign to the fact that clustering is, at the end of the day,an ill posed problem [19].This work was financially supported by the Swiss Na-tional Science Foundation (Grant No. 200021 125277/1). [1] M. Girvan and M. E. J. Newman, P Natl Acad Sci Usa , 7821 (2002). [2] S. Fortunato, Phys Rep , 75 (2010).[3] M. E. J. Newman, Nat Phys , 25 (2012).[4] A. Lancichinetti, S. Fortunato, and F. Radicchi, PhysRev E , 046110 (2008).[5] A. Lancichinetti and S. Fortunato, Phys Rev E ,016118 (2009).[6] A. Lancichinetti and S. Fortunato, Phys Rev E ,056117 (2009).[7] A. Condon and R. M. Karp, Random Struct Algor ,116 (2001).[8] P. Erd¨os and A. R´enyi, Publ. Math. Debrecen , 290(1959).[9] L. Danon, A. D´ıaz-Guilera, J. Duch, and A. Arenas, JStat Mech-Theory E , P09008 (2005).[10] B. H. Good, Y.-A. de Montjoye, and A. Clauset, PhysRev E , 046106 (2010). [11] A. Clauset, M. E. J. Newman, and C. Moore, Phys RevE , 066111 (2004).[12] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, andE. Lefebvre, J Stat Mech-Theory E , P10008 (2008).[13] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov´a,Phys Rev Lett , 065701 (2011).[14] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov´a,Phys Rev E , 066106 (2011).[15] R. R. Nadakuditi and M. E. J. Newman, Phys Rev Lett , 188701 EP (2012).[16] R. R. Nadakuditi and M. E. J. Newman, Phys Rev E ,012803 EP (2013).[17] T. P. Peixoto, Phys Rev Lett , 148701 EP (2013).[18] F. Radicchi, arXiv physics.soc-ph (2013), 1306.1102v1.[19] E. Domany, Physica A263