[PDF] Stochastic fluctuations and the detectability limit of network communities

Abstract

We have analyzed the detectability limits of network communities in the framework of the popular Girvan and Newman benchmark. By carefully taking into account the inevitable stochastic fluctuations that affect the construction of each and every instance of the benchmark, we come to the conclusions that the native, putative partition of the network is completely lost even before the in-degree/out-degree ratio becomes equal to the one of a structure-less Erdös-Rényi network. We develop a simple iterative scheme, analytically well described by an infinite branching-process, to provide an estimate of the true detectability limit. Using various algorithms based on modularity optimization, we show that all of them behave (semi-quantitatively) in the same way, with the same functional form of the detectability threshold as a function of the network parameters. Because the same behavior has also been found by further modularity-optimization methods and for methods based on different heuristics implementations, we conclude that indeed a correct definition of the detectability limit must take into account the stochastic fluctuations of the network construction.

Full PDF

SStochastic ﬂuctuations and the detectability limit of network communities.

Lucio Floretta, Jonas Liechti, Alessandro Flammini, and Paolo De Los Rios Laboratoire de Biophysique Statistique, Ecole Polytechnique F´ed´erale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland School of Informatics and Computing, Indiana University, Bloomington, Indiana 47406, USA (Dated: November 9, 2018)We have analyzed the detectability limits of network communities in the framework of the pop-ular Girvan and Newman benchmark. By carefully taking into account the inevitable stochasticﬂuctuations that aﬀect the construction of each and every instance of the benchmark, we come tothe conclusions that the native, putative partition of the network is completely lost even before thein-degree/out-degree ratio becomes equal to the one of a structure-less Erd¨os-R´enyi network. Wedevelop a simple iterative scheme, analytically well described by an inﬁnite branching-process, toprovide an estimate of the true detectability limit. Using various algorithms based on modularityoptimization, we show that all of them behave (semi-quantitatively) in the same way, with the samefunctional form of the detectability threshold as a function of the network parameters. Because thesame behavior has also been found by further modularity-optimization methods and for methodsbased on diﬀerent heuristics implementations, we conclude that indeed a correct deﬁnition of thedetectability limit must take into account the stochastic ﬂuctuations of the network construction.

PACS numbers: 89.75.Hc,64.60.aq,89.20.-a

In the process of introducing the notion of modularity [1], Girvan and Newman (GN) also proposed a methodto compare the performance of community detection algo-rithms [2, 3], which is still at the basis of modern assays[4–6]. In the GN benchmark, diﬀerent algorithms aretested on a set of planted l - partition models [7]. In a nut-shell, a planted l -partition model is a network composedof l groups of n vertices that are stochastically connectedwith each other: an edge between any two vertices withinthe same group is present with probability p in , whereasan edge between any two vertices belonging to diﬀerentgroups is present with probability p out . Accordingly, theaverage internal degree is (cid:104) k in (cid:105) = ( n − p in , whereasthe average external degree is (cid:104) k out (cid:105) = ( l − np out . Tomeasure the extent of the community structure presentin the planted l -partition model, it is customary to in-troduce the mixing parameter µ that is deﬁned by therelations (cid:104) k out (cid:105) = µ (cid:104) k t (cid:105) and (cid:104) k in (cid:105) = (1 − µ ) (cid:104) k t (cid:105) , where (cid:104) k t (cid:105) = (cid:104) k in (cid:105) + (cid:104) k out (cid:105) is the average total degree. Indeed,at µ = 0 the network has l disconnected components; as µ increases, the average internal degree decreases while theaverage external degree toward one speciﬁc other clus-ter, (cid:104) k out (cid:105) / ( l − µ = ( l − /l ≡ µ ERc , when the planted l -partition modelis practically an Erd¨os-R´enyi graph [8]. The performanceof diﬀerent algorithms is then ranked according to thevalue of the mixing parameter beyond which they canno longer recover the native l clusters. Thus, the GNbenchmark tries to encode the simple heuristics that com-munities correspond to groups of vertices that are moreconnected with each other than with the rest of the net-work, although of course a rigorous univocal deﬁnition of“more connected” is presently lacking.Naively, one would expect any clustering algorithm tobe successful at most up to µ ERc . However, this is not the case because the construction of a planted l -partitionmodel is a stochastic process. Indeed, it might happeneven at µ < µ ERc that a vertex declared as belongingto a group has fewer connections toward its putativecommunity than toward a diﬀerent one. According tothe fundamental heuristic deﬁnition of community, out-lined above, this ﬂuctuation should be interpreted as achange of membership (a relabeling ) of the vertex. Thus,we can expect that the community structure will bebadly degraded when the diﬀerence between (cid:104) k in (cid:105) and (cid:104) k out (cid:105) / ( l −

1) is comparable with its own statistical ﬂuc-tuation, an argument that translates into the condition (cid:104) k in (cid:105) − (cid:104) k out (cid:105) / ( l −

1) = α l (cid:112) (cid:104) k t (cid:105) , where α l is a positivereal constant that may depend on l . Thus, we can arguethat the native community structure disappears when µ > µ ERc (cid:32) − α l (cid:112) (cid:104) k t (cid:105) (cid:33) (1)and any algorithm trying to recover the native partitionbased on criteria that adhere to the heuristics should con-sequently fail to do so because, in essence, there is nonative structure to be recognized anymore.In order to gauge the extent to which stochastic ﬂuc-tuations skew network realizations away from the na-tive one, we have analyzed the planted l -partition mod-els of the original GN benchmark ( l = 4, n = 32, and (cid:104) k t (cid:105) = 16). Nodes with more connections to an outergroup than to their putative one were relabeled, and theprocedure was iteratively repeated until no more nodeshad to change their membership. For each realization,we measured the similarity between the relabeled parti-tion R and the native partition N using the normalizedmutual information ¯ I ( R , N ) [9]. ¯ I ( R , N ) is deﬁned as¯ I ( R , N ) ≡ I ( R , N ) H ( R ) + H ( N ) , a r X i v : . [ phy s i c s . s o c - ph ] J un where I ( R , N ) = (cid:88) ρ ∈R (cid:88) ν ∈N f ( ρ, ν ) log (cid:18) f ( ρ, ν ) f ( ρ ) f ( ν ) (cid:19) , is the mutual information of R and N , H ( R ) = − (cid:80) ρ ∈R f ( ρ ) log f ( ρ ) is the entropy of R , and H ( N ) = − (cid:80) ν ∈N f ( ν ) log f ( ν ) is the entropy of N . f ( ρ ) is thefraction of nodes that belong to the community ρ within R , f ( ν ) is the fraction of nodes that belong to the com-munity ν within N , and f ( ρ, ν ) is the fraction of nodesthat are simultaneously assigned to the community ρ within R and to the community ν within N .The outcome is shown in Fig. 1: because ¯ I ( R , N ) de-creases as µ increases and since it goes to zero well be-fore µ ERc = 3 /

4, a labeling closer to what a zero orderheuristics would suggest is already diﬀerent enough fromthe native partition that the latter could not be detectedanymore. As modularity was conceived with the samesimple heuristics in mind, we have applied three diﬀerentclustering algorithms based on its maximization to thenetworks of Fig. 1: simulated annealing [10], that looksfor the absolute maximum, and two greedy algorithmswith diﬀerent implementations, fastgreedy [11] and Lou-vain method [12]. Interestingly, the modularity of therelabeled partition, Q ( R ), can be larger than the mod-ularity of the native one, Q ( N ), in a signiﬁcant regionof µ values. Even more intriguingly, there the modular-ity found by simulated annealing roughly coincides withthe modularity of the relabeled partition, Q ( R ), and thesimilarity between the retrieved partition and the rela-beled partition is larger than the similarity between theretrieved partition and the putative partition, signalingthat modularity optimization and relabeling agree on thedetected deviations from the native partition. Greedymethods do not follow the same trend, likely because ofthe roughness of the modularity landscape, that hampersthese algorithms for small values of the total degree [10].We then moved to larger networks, for various values ofthe total average degrees and for diﬀerent number of pu-tative communities ( l = 2 , , µ ERc was approached.The similarity threshold, deﬁned as the value µ R c of themixing parameter where ¯ I ( R , N ) falls below 0.01, wasvery well described by Eq. (1), up to corrections of order1 / (cid:104) k t (cid:105) . Furthermore, α l approaches 1 from above as l in-creases (Fig. 2, (cid:3) ), therefore recovering the detectabilitythreshold ˆ µ c ≡ µ ERc (cid:32) − (cid:112) (cid:104) k t (cid:105) (cid:33) (2)recently derived with statistical inference [13, 14] andmodularity [15] based methods. These results conﬁrm FIG. 1. Similarity (upper panel) along with the modularitydiﬀerence (lower panel) between the relabeled partition ( (cid:3) ),the partitions retrieved by simulated annealing ( ◦ in red),by fastgreedy ( ♦ in green), and by the Louvain method ( (cid:52) inblue) with the native partition for the original GN benchmark( n = 32, l = 4, (cid:104) k t (cid:105) = 16). Each point is averaged over500 realizations, the error is smaller than the marker size.Further, ﬁve networks at diﬀerent µ are depicted to visualizehow relabeled vertices slowly invade other putative clusters. thus that the stochastic ﬂuctuations aﬀecting the net-work construction are so strong to make the native struc-ture disappear before µ ERc , and consistently with Eq. (1).The functional form of Eq. (1) can be formally de-rived from the topology of the planted l -partition modeland from the aforementioned deﬁnition of membershipby requiring that each relabeled vertex leaves in its wakefurther vertices to be relabeled, thus triggering an in-ﬁnite avalanche. We deﬁned P l ( µ, (cid:104) k t (cid:105) ) as the proba-bility that a node has an internal degree that is largeror equal than its number of connections with nodes ofanother given group. Assuming that the network is tree-like in the same mean-ﬁeld spirit of [13–15], in the limit (cid:104) k t (cid:105) → ∞ an avalanche becomes inﬁnite at the value ˜ µ R c of the mixing parameter satisfying (cid:0)(cid:0) − ˜ µ R c (cid:1) (cid:104) k t (cid:105) − (cid:1) (cid:0) − P l (cid:0) ˜ µ R c , (cid:104) k t (cid:105) (cid:1)(cid:1) = 1 , (3)as in this limit the probability that a node with a re-labeled neighbor maintains its membership converges to P l ( µ, (cid:104) k t (cid:105) ) (see SI). For an inﬁnite planted l -partition .

00 0 .

02 0 .

04 0 .

06 0 .

08 0 .

10 0 .

12 0 . / p h k t i . . . . . µ c / µ E R c , l = .

00 0 .

02 0 .

04 0 .

06 0 .

08 0 . / p h k t i . . . . . . . µ c / µ E R c , l = .

00 0 .

01 0 .

02 0 .

03 0 .

04 0 . / p h k t i . . . . . µ c / µ E R c , l = FIG. 2. Comparison between the similarity thresholds ˆ µ c (Eq. 2, dashed line), ˜ µ R c (Eq. 3, × ), µ R c ( ¯ I ( A , R ) ≈ .

01 , (cid:3) ), µ fc ( ¯ I ( A , N ) ≈ .

01 for fastgreedy, ♦ ), and µ Lc ( ¯ I ( A , N ) ≈ .

01 for the Louvain method, (cid:52) ) for l = 2 (left panel), 4 (central panel),and 8 (right panel). In the last case, the results for fastgreedy are not available because the network sizes required to exit theglassy phase are out of its reach. The solid lines are ﬁts of the form µ c = µ ERc (cid:16) − α/ (cid:112) (cid:104) k t (cid:105) + β/ (cid:104) k t (cid:105) (cid:17) . Errors are smallerthan markers and to improve the visualization everything is divided by µ ERc . model P l ( µ, (cid:104) k t (cid:105) ) = e −(cid:104) k t (cid:105) ∞ (cid:88) i =0 i ! ((1 − µ ) (cid:104) k t (cid:105) ) i  i (cid:88) j =0 j ! (cid:18) µ (cid:104) k t (cid:105) ( l − (cid:19) j  l − ; (4)further, it is possible, although cumbersome, to prove an-alytically that P ( µ, (cid:104) k t (cid:105) ) depends only on the rescaledmixing parameter ( µ − µ ERc ) (cid:112) (cid:104) k t (cid:105) up to correctionsof the form ( a + b/ (cid:104) k t (cid:105) ) when (cid:104) k t (cid:105) → ∞ (see SI).Therefore, we made the scaling ansatz P l ( µ, (cid:104) k t (cid:105) ) = F l (cid:16) ( µ − µ ERc ) (cid:112) (cid:104) k t (cid:105) (cid:17) which is also numerically con-ﬁrmed for l > l = 4).Taking these results together, the (numerical) solutionof Eq. (3) yields ˜ µ R c as a function of (cid:104) k t (cid:105) and it indeedshows that Eq. (1) is valid up to corrections of order1 / (cid:104) k t (cid:105) (Fig. 2, × ). We can again conﬁrm that it is ageneral feature of the relabeled community structure tobe appreciably diﬀerent from the “native” one at some µ < µ ERc (and scaling as in Eq. (1)).Armed with this new view of the resilience to ﬂuctua-tions of the GN benchmark, we also reanalyzed the per-formance of modularity optimization methods fastgreedyand Louvain (we could not test simulated annealing forthe sizes under scrutiny here) [6]. In spite of their diﬀer-ences, their behavior is qualitatively similar at low (cid:104) k t (cid:105) ,when the modularity landscape is rough, giving rise to a glassy phase [10]: at low values of µ the two algorithms,that just look for local modularity maxima, are not ableto ﬁnd partitions with a modularity as high as the nativeone, which on average follows the expected mean-ﬁeldvalue Q MF ( µ ) = µ ERc − µ (Fig. (4), upper panel; we only − . − . − . − . − . . . . . . ( µ − µ ERc ) √ k t . . . . . . . . . . . P l ( µ , h k t i ) .

65 0 .

70 0 .

75 0 .

80 0 . µ . . . . . . FIG. 3. Magniﬁcation of the data collapse of P l ( µ, (cid:104) k t (cid:105) ) for l = 4 and for diﬀerent values of (cid:104) k t (cid:105) (diﬀerent colors). Theinset displays the original data. show the results for the Louvain method, which allowsfor better statistics on larger networks). In that region,though, the native partition still fairly well correspondsto the heuristics, because there the relabeling procedureinvolves just a very limited number of nodes, if any (seeSI). Also the similarity index ¯ I ( A , N ) between the re-trieved partition A and the native partition N drops tozero (in a rather erratic way) (Fig. (4), lower panel), con-ﬁrming the signiﬁcant diﬀerence between the partitionretrieved by the algorithms and the native one. Thus,low values of (cid:104) k t (cid:105) (sparse networks) do indeed cause de-tectability problems much more serious than a simpleshift of the threshold value of µ , because of the presenceof multiple competing modularity maxima. In order tocorroborate this interpretation, we initialized the Lou-vain method with the native planted l -partition, ﬁnding . . . . . . . . . . . µ . . . . . . . . . . . ¯ I ( A , N ) . . . . . . . . . Q ( A ) FIG. 4. Modularity (upper panel) of the partition A retrievedby the Louvain method and its similarity index ¯ I ( A , N ) withthe native partition N (middle panel) for the planted 4-partition models with n = 32768 and various (cid:104) k t (cid:105) . Each datapoint is averaged over 100 realizations; only some errors areshown for clarity. .

40 0 .

42 0 .

44 0 .

46 0 .

48 0 . µ .

40 0 .

42 0 .

44 0 .

46 0 .

48 0 . µ . . . . . . . . . . . ¯ I FIG. 5. Comparison close to the phase transition pointamong ¯ I ( A , R ) (solid red line), ¯ I ( A , N ) (dotted blue line),and ¯ I ( R , N ) (dashed green line) for fastgreedy (left panel)and for the Louvain method (right panel) on a 2-planted par-tition with n = 4096, (cid:104) k t (cid:105) = 2048. For clarity, only the largeststandard deviation is shown. that the modularity of the retrieved partition never fellbelow its expected mean-ﬁeld value (see SI).The behavior of the two algorithms diﬀers instead out-side of the glassy phase (Fig. 5). The community struc-ture detected by fastgreedy departs from the native par-tition earlier than the relabeled one, implying that the al-gorithms does not recover the native partition even whenno nodes need to be relabeled. The Louvain methodﬁnds clusters that are instead almost identical to the onesfound upon relabeling, indicating that the performanceof the Louvain method is optimal.Nevertheless, in the narrow interval deﬁned by the dra-matic drop of all similarities, the one between the re-labeled partitions and the native partition, ¯ I ( R , N ) de-creases more rapidly and reaches zero earlier than the ones of both greedy algorithms (Fig. 5), and we quan-tiﬁed this diﬀerence by investigating in more detail thesimilarity thresholds µ fc and µ Lc (now deﬁned as the valueof the mixing parameter at which ¯ I ( A , N ) drops below0.01). We found that both µ fc and µ Lc are also well ﬁttedby Eq. (1) with corrections of the order 1 / (cid:104) k t (cid:105) (Fig. 2, ♦ and (cid:52) , respectively). Moreover µ R c < µ fc < ˆ µ c ≤ µ Lc andthey approach each other as l increases . The diﬀerencesare likely due to the diﬀerent ability of each procedure totake into account higher order correlations, beyond thesimple one taken into account by the relabeling proce-dure.Taking together these observations, we conclude thatdeciding whether community detection algorithms reallyfail is trickier than previously believed. Indeed, in orderto quantify the degree of success of a clustering method,some sort of ground truth should be known about thenetwork under scrutiny. Since this is in general not avail-able, various benchmarks have been developed, the sim-plest one being the GN, with the naive expectation that,at least for such artiﬁcial cases, the exact solution shouldbe known. Unfortunately, their construction is itself af-fected by random ﬂuctuations that blur their structure,making it diﬀerent from the putative one. Carefully con-sidering such statistical variations, we have shown thatit is possible to change the deﬁnition of detectable re-gion , recognizing that ﬂuctuations disrupt the putativepartition for values of µ < µ ERc and well described byEq. (1), so that one should expect all community de-tection algorithms to be aﬀected, qualitatively, in thesame way, with diﬀerences likely to be imputed to diﬀer-ent ways to implement mathematically the fundamentalheuristics behind the GN benchmark. Indeed, the ﬁnd-ing that algorithms based on completely diﬀerent prin-ciples (modularity-based vs . statistical inference) showthe same threshold beyond which a network structurebecomes undetectable, must be reinterpreted in the lightof our results: beyond such threshold the real communitystructure simply bears no resemblance anymore with theputative one.The present results also call for a complete reconsider-ation of the detectability thresholds recently derived forbenchmarks richer in structure [13, 14, 16–18] and of theperformance of community detection algorithms testedon them [4–6]. On a more philosophical note, we mighteven relax the stringency of benchmarking protocols, andresign to the fact that clustering is, at the end of the day,an ill posed problem [19].This work was ﬁnancially supported by the Swiss Na-tional Science Foundation (Grant No. 200021 125277/1). [1] M. Girvan and M. E. J. Newman, P Natl Acad Sci Usa , 7821 (2002). [2] S. Fortunato, Phys Rep , 75 (2010).[3] M. E. J. Newman, Nat Phys , 25 (2012).[4] A. Lancichinetti, S. Fortunato, and F. Radicchi, PhysRev E , 046110 (2008).[5] A. Lancichinetti and S. Fortunato, Phys Rev E ,016118 (2009).[6] A. Lancichinetti and S. Fortunato, Phys Rev E ,056117 (2009).[7] A. Condon and R. M. Karp, Random Struct Algor ,116 (2001).[8] P. Erd¨os and A. R´enyi, Publ. Math. Debrecen , 290(1959).[9] L. Danon, A. D´ıaz-Guilera, J. Duch, and A. Arenas, JStat Mech-Theory E , P09008 (2005).[10] B. H. Good, Y.-A. de Montjoye, and A. Clauset, PhysRev E , 046106 (2010). [11] A. Clauset, M. E. J. Newman, and C. Moore, Phys RevE , 066111 (2004).[12] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, andE. Lefebvre, J Stat Mech-Theory E , P10008 (2008).[13] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov´a,Phys Rev Lett , 065701 (2011).[14] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov´a,Phys Rev E , 066106 (2011).[15] R. R. Nadakuditi and M. E. J. Newman, Phys Rev Lett , 188701 EP (2012).[16] R. R. Nadakuditi and M. E. J. Newman, Phys Rev E ,012803 EP (2013).[17] T. P. Peixoto, Phys Rev Lett , 148701 EP (2013).[18] F. Radicchi, arXiv physics.soc-ph (2013), 1306.1102v1.[19] E. Domany, Physica A263