Disentangling homophily, community structure and triadic closure in networks
DDisentangling homophily, community structure and triadic closure in networks
Tiago P. Peixoto ∗ Department of Network and Data Science, Central European University, 1100 Vienna, Austria andDepartment of Mathematical Sciences, University of Bath,Claverton Down, Bath BA2 7AY, United Kingdom
Network homophily, the tendency of similar nodes to be connected, and transitivity, the ten-dency of two nodes being connected if they share a common neighbor, are conflated properties innetwork analysis, since one mechanism can drive the other. Here we present a generative modeland corresponding inference procedure that is capable of distinguishing between both mechanisms.Our approach is based on a variation of the stochastic block model (SBM) with the addition oftriadic closure edges, and its inference can identify the most plausible mechanism responsible forthe existence of every edge in the network, in addition to the underlying community structure itself.We show how the method can evade the detection of spurious communities caused solely by theformation of triangles in the network, and how it can improve the performance of link predictionwhen compared to the pure version of the SBM without triadic closure.
I. INTRODUCTION
One of the most typical properties of social networksis the presence of homophily [1–4], i.e. the increased ten-dency of an edge to exist between two nodes if they sharethe same underlying characteristic, such as race, gen-der, class and a variety of other social parameters. Morebroadly, when the underlying similarity parameter is notspecified a priori , the same homophily pattern is knownas community structure [5]. Another pervasive patternencountered in the same kinds of network is transitiv-ity [6–8], i.e. the increased probability of observing anedge between two nodes if they have a neighbor in com-mon. Although these patterns are indicative of two dis-tinct mechanisms of network formation, namely choice orconstraint homophily [9] and triadic closure [10], respec-tively, they are generically conflated in non-longitudinaldata. This is because both processes can result in thesame kinds of observation: 1. the preferred connectionbetween nodes of the same kind can induce the pres-ence of triangles involving similar nodes, and 2. the ten-dency of triangles to be formed can induce the formationof groups of nodes with a higher density of connectionsbetween them, when compared to the rest of the net-work [11, 12]. This conflation means we cannot reliablyinterpret the underlying mechanisms of network forma-tion merely from the abundance of triangles or observedcommunity structure in network data.In this work we present a solution to this problem, con-sisting in a principled method to disentangle homophilyand community structure from triadic closure in networkdata. This is achieved by formulating a generative modelthat includes community structure in a first instance, andan iterated process of triadic closure in a second. Basedon this model, we develop a Bayesian inference algorithmthat is capable of identifying which edges are more likelyto be due to community structure or triadic closure, in ∗ [email protected] addition to the underlying community structure itself.Several authors have demonstrated that triadic closurecan induce community structure and homophily in net-works. Foster et al [11, 12] have shown that maximum en-tropy network ensembles conditioned on prescribed abun-dances of triangles tend to possess high modularity. Amore recent analysis of this kind of ensemble by López etal [13] showed that it is marked by a spontaneous size-dependent formation of “triangle clusters.” Bianconi etal [14] have investigated a network growth model, wherenodes are progressively added to the network, and con-nected in such a way as to increase the amount of trian-gles, and shown that it is capable of producing networkswith emergent community structure. The effect of trian-gle formation on apparent community structure has beenfurther studied by Wharrie et al [15], who showed thatthose patterns can even mislead methods specifically de-signed to avoid the detection of spurious communities inrandom networks. More recently, Asikainen et al [16]have shown that iterated triadic closure can exacerbatehomophily present in the original network, via a simplemacroscopic model.The approach presented in this work differs from theaforementioned ones primarily in that it runs in the re-verse direction: instead of only defining a conceptual net-work model that demonstrates the interlink between tri-adic closure and homophily given prescribed parameters,the proposed method operates on empirical network data,and reconstructs the underlying generative process, de-composing it into distinct community structure and tri-adic closure components. As we show, this reconstruc-tion yields a detailed interpretation of the underlyingmechanisms of network formation, allowing us to identifymacro-scale structures that emerge spontaneously frommicro-scale higher-order interactions [17, 18], and in thisway we can separate them from inherently macro-scalestructures.Our method is based on the nonparametric Bayesianinference of a modified version of the stochastic blockmodel (SBM) [19, 20] with the addition of triadic clo-sure edges, and therefore leverages the statistical evi- a r X i v : . [ c s . S I] J a n Generative process
Seminal edges Triadic closure Observed network
Statistical inference
Observed network Posterior distribution Marginal probabilities . . . . . . . . . . . . Figure 1. Schematic representation of the generative processconsidered (top) and the associated inference procedure (bot-tom). The generative process consists in the placement ofseminal edges according to a SBM, and the addition of tri-adic closure edges conditioned on the seminal edges (shown inred). The inference procedure runs in the reverse direction,and given an observed graph, it produces a posterior distribu-tion of possible divisions of seminal and triadic closure edges,with which edge marginal probabilities on the edge identitiescan be obtained. dence available in the data, without overfitting. Impor-tantly, our method is capable of determining when theobserved structure can be attributed to an actual pref-erence of connection between nodes, as described by theSBM, rather than an iterated triadic closure process oc-curring on top of a substrate network. As a result, wecan distinguish between “true” and apparent communitystructure caused by increased transitivity. As we alsodemonstrate, this decomposition yields an edge predic-tion method that tends to perform better in many in-stances than the SBM used in isolation.Our manuscript is organized as follows. In Sec. IIwe describe our model, and its inference procedure. InSec. III we demonstrate how it can be used disambiguatetriadic closure from community structure in artificiallygenerated networks. In Sec. IV we perform an analysisof empirical networks, in view of our method. In Sec. Vwe show how our model can improve edge prediction. Weend in Sec. VI with a conclusion.
II. STOCHASTIC BLOCK MODEL WITHTRIADIC CLOSURE (SBM/TC)
Community structure and triadic closure are generallyinterpreted as different processes of network formation.With the objective of allowing their identification a pos-teriori from network data, our approach consists in defin-ing a generative network model that encodes both pro-cesses explicitly. More specifically, our generative modelconsists of two steps, with the first one being the gener- ation of a substrate network containing “seminal” edges,placed according to an arbitrary mixing pattern betweennodes, and an additional layer containing triadic closureedges, potentially connecting two nodes if they share acommon neighbor in the substrate network (see Fig. 1).The final network is obtained by “erasing” the identity ofthe edges. i.e. whether they are seminal or due to closureof a triangle. Conversely, the inference procedure consistsin moving in the opposite direction, i.e. given a simplegraph, with no annotations on the edges, we consider theposterior distribution of all possible divisions into semi-nal and triadic closure edges, weighted according to theirplausibility.We will denote the seminal edges with an adjacencymatrix A , and for its generation we will use the degree-corrected stochastic block model (DC-SBM) [21], condi-tioned on a partition b of the nodes into B groups, where b i ∈ [1 , B ] is the group membership of node i , which hasa marginal distribution given by [22] P ( A | b ) = (cid:81) r 11 + (cid:80) i Triadic closures increase the number of edges in thenetwork, and in this way can introduce opportunitiesfor new triadic closures, involving both older and neweredges. This leads naturally to a dynamical model, wheregenerations of triadic closures are progressively intro-duced to the network. We can incorporate this in ourmodel via “layers” of ego graphs g ( l ) representing edgesintroduced in generation l ∈ [1 , . . . , L ] . For our formula-tion, it will be useful to define the cumulative network atgeneration l , defined recursively by A ( l ) ij = (cid:40) , if A ( l − ij + (cid:80) u g ( l ) ij ( u ) > , , otherwise, (11)with boundary conditions A (0) = A , and g (0) ( u ) beingempty graphs for all u , and we will denote the final gener-ation as A ( L ) = G . The formation of new triadic closurelayers is done according to the probability P ( g ( l ) ( u ) | A ( l − , g ( l − , p ( l ) u ) = (cid:89) i 11 + (cid:80) i The posterior distribution of Eq. 18 can be writtenexactly, up to a normalization constant. However, thisfact alone does not allow us to directly sample fromthis distribution, which can only be done in very spe-cial cases. Instead, we rely here on Markov chain MonteCarlo (MCMC), implemented as follows. We begin withan arbitrary choice of { g ( l ) } , A and b that is com-patible with our observed graph G . We then con-sider modifications of these quantities, and accept orreject them according to the Metropolis-Hastings crite-rion [24, 25]. More specifically, we consider moves of thekind P ( { g (cid:48) ( l ) } , A (cid:48) |{ g ( l ) } , A ) , and accept them accordingto the probability min (cid:32) , P ( { g (cid:48) ( l ) } , A (cid:48) , b | G ) P ( { g ( l ) } , A |{ g (cid:48) ( l ) } , A (cid:48) ) P ( { g ( l ) } , A , b | G ) P ( { g (cid:48) ( l ) } , A (cid:48) |{ g ( l ) } , A ) (cid:33) (19)which, as we mentioned before, does not require the com-putation of the intractable marginal probability P ( G ) .We also consider moves that change the communitystructure, according to a proposal P ( b (cid:48) | b ) and acceptwith probability min (cid:18) , P ( A | b (cid:48) ) P ( b (cid:48) ) P ( b | b (cid:48) ) P ( A | b ) P ( b ) P ( b (cid:48) | b ) (cid:19) . (20)For the latter we use the merge-split moves described inRef. [26]. Iterating the moves described above eventuallyproduces samples from the target posterior distribution.In Appendix C we specify the details of the particularmove proposals we use.Given samples from the posterior distribution, we canuse them to summarize it in a variety of ways. A usefulquantity is the marginal probability π ij of an edge ( i, j ) being seminal, which is given by π ij = (cid:88) { g ( l ) } , A , b A ij P ( { g ( l ) } , A , b | G ) . (21) If we wanted to tread L as an unknown, we should introduce aprior for L , P ( L ) , and include that in the posterior as well. How-ever, with the parametrization in Appendix B, generations whichare unpopulated with edges have no contribution to the marginallikelihood. Therefore we can simply set L to be a sufficientlylarge value, for example L = (cid:0) N (cid:1) , since for later generations itis impossible to add new edges. (a) Random seminal edges (b) Triadic closure edges and spuriouscommunities found with SBM( Σ SBM = 801 . nats) (c) Inference of the SBM/TC model( Σ SBM/TC = 590 . nats)Figure 3. (a) Example artificial network generated as a fully random random graph with a geometric degree distribution, N = 100 nodes and E = 94 edges, and (b) a process of triadic closure based on network (a) with parameter p u = 0 . forevery node, with closure edges shown in red. It is also shown the partition found by fitting the SBM to the resulting network,and the description length obtained. (c) The result of inferring the SBM/TC model, which uncovers a single partition — nocommunity structure — and the closure edges shown in red (the thickness of the edges correspond to the marginal probabilities π ij and − π ij for the seminal and closure edges, respectively). It is also shown the description length of the SBM/TC fit. Conversely, the reciprocal quantity, − π ij , (22)corresponds to the probability that edge ( i, j ) is due totriadic closure, occurring in any generation or ego graph.Therefore, the quantity π gives us a concise summary ofposterior decomposition of a network, and we will use itthroughout our analysis. (It easy to devise and computeother summaries, such as the marginal probability of anedge belonging to a given triadic generation, or a partic-ular ego graph, but we will not have use for those in ouranalysis.) III. DISTINGUISHING COMMUNITYSTRUCTURE FROM TRIADIC CLOSURE Here we illustrate how triadic closure can be mistakenas community structure, and how our inference methodis capable of uncovering it. We begin by considering anartificial example, where we first sample a fully randomnetwork with a geometric degree distribution, N = 100 nodes and E = 94 edges, as shown in Fig. 3a. This net-work does not possess any community structure, sincethe probability of observing an edge is just proportionalto the product of the degrees of the endpoint nodes— indeed if we fit a DC-SBM to it, we uncover, cor-rectly, only a single group. Conditioned on this network,Fig. 3b shows sampled triadic closure edges, accordingto the model described previously, where each node has the same probability p u = 0 . of having neighbours con-nected in their ego graphs. In the same figure we showthe result of fitting the DC-SBM on the network obtainedby ignoring the edge types. That approach finds five as-sortative communities, corresponding to regions of higherdensities of edges induced by the random introduction oftransitive edges. One should not, however, interpret thepresence of these regions as a special affinity between therespective groups of nodes, since they are a result of arandom process that has no relation to that particulardivision of the network — indeed, if we run the wholeprocess again from the beginning, the nodes will mostlikely end up clustered in completely different “communi-ties.” If we now perform the inference of our SBM withtriadic closure (SBM/TC), we obtain the result shownin Fig. 3c. Not only are we capable of distinguishingthe seminal from the triadic closure edges (AUC ROC= . ), but we also correctly identify the presence of asingle group of nodes, which is in full accordance withthe completely random nature in which the network hasbeen generated. In other words, with the SBM/TC weare not misled by the density heterogeneity introducedby triadic closures into thinking that the network pos-sesses real community structure, and we realize insteadthat they can be better explained by a different process.In the artificial example considered above, the resultobtained with the SBM/TC model is more appealing,since it more closely matches the known generative pro-cess that was used. However, in more realistic situations,we will need to decide if it provides a better descriptionof the data without such privileged information. In viewof this, we can make our model selection argument moreformal in the following way. Suppose we are consideringa partition b (1) found with inferring the SBM on a givennetwork, as well as another partition b (2) and ego graphs { g ( l ) } found with the SBM/TC model. We can decidewhich one provides a better description of a network viathe posterior odds ratio, Λ = P ( b (2) , { g ( l ) } , H SBM/TC | G ) P ( b (1) , H SBM | G ) (23) = P ( G , { g ( l ) } , A , b (2) ) P ( G , b (1) ) × P ( H SBM/TC ) P ( H SBM ) , (24)where P ( H SBM/TC ) and P ( H SBM ) are the prior proba-bilities for either model. In case these are the same, wehave Λ = e − (Σ SBM/TC − Σ SBM ) , (25)where Σ SBM/TC and Σ SBM are the description lengths ofboth hypotheses, given by Σ SBM/TC = − ln P ( G , { g ( l ) } , A , b (2) ) , (26) Σ SBM = − ln P ( G , b (1) ) . (27)The description length [27] measures the amount of infor-mation necessary to encode both the data and the modelparameters, and hence accounts both for the quality offit and the model complexity. The above means that themodel that is most likely a posteriori is the one that mostcompresses the data under its parametrization, and thusthe criterion amounts to an implementation of Occam’srazor, since it points to the best balance between modelcomplexity and fitness.Before we employ the above criterion to select betweenboth models considered, it is important to emphasize thatthe pure SBM is “nested” inside the SBM/TC, since theformer amounts to the special case of the latter whenthere are zero triadic closure edges. In particular, if weuse the more general parametrization described in Ap-pendix B, in the situation with zero triadic edges, i.e. all { g ( l ) } are empty graphs g empty and A = G , we have P ( G , { g ( l ) = g empty } , A = G , b ) ≥ P ( G , b ) N + 1 . (28)Therefore, in general, we must have max { g ( l ) } , A , b ln P ( G , { g ( l ) } , A , b ) ≥ max b ln P ( G , b ) + ln( N + 1) . (29)Since the last logarithm term becomes negligible for largenetworks, typically the use of the SBM/TC can only re-duce the description length of the data. Therefore, insituations where there is no evidence for triadic closure,both models should yield approximately the same de-scription length value. Inference using SBM(a) (b)Inference using SBM/TC(c) (d)Figure 4. Recovery of community structure for artificial net-works generated from the PP model with added triadic clo-sure, as described in the text, for networks with N = 10 nodes, average degree (cid:104) k (cid:105) = 5 , B = 10 planted groups, anduniform triadic closure probability p u = p shown in the leg-end. Figures (a) and (b) correspond to inferences done usingthe SBM, and (c) and (d) with the SBM/TC model. Allresults where averaged over 10 network realizations. The ver-tical dashed line marks the detectability transition value c ∗ + ,described in the text. In Fig. 3 we show the description lengths for both mod-els for the particular example discussed previously, wherewe can see that the SBM/TC provides a substantiallybetter compression of the data, therefore yielding a moreparsimonious and hence more probable account of thehow the data was generated — which happens also to bethe true one in this controlled setting.We proceed with a more systematic analysis of howtriadic closure can interfere in community detection withartificial networks generated by the SBM, more specif-ically the special case known as the planted partitionmodel (PP), where the B groups have equal size, andthe number of edges between groups is given by e rs = 2 E (cid:20) cB δ rs + 1 − cB ( B − 1) (1 − δ rs ) (cid:21) , (30)where c ∈ [0 , determines the affinity between the(dis)assortative groups. For this model, we know thatthere are critical values c ∗± = 1 B ± B − B √ k , (31)such that if c ∈ [ c ∗− , c ∗ + ] then no algorithm can infer apartition that is correlated to the true one from a sin-gle network realization, as it becomes infinitely large N → ∞ [28]. Starting from a network generated withthe PP model, we include triadic closure edges via theglobal probability p u = p for every node in the net-work. Based on the resulting network, we attempt torecover the original communities, using the SBM andthe SBM/TC model. A result of this analysis is shownin Fig. 4, where we compute the maximum overlap [29] q ∈ [0 , between the inferred ˆ b and true partition b ,defined as q = max µ N (cid:88) i δ µ (ˆ b i ) ,b i , (32)where µ ( r ) is a bijection between the group labels in ˆ b and b , as well as the effective number of inferred groups B e = e S , where S is the group label entropy S = − (cid:88) r n r N ln n r N . (33)As can be seen in Fig. 4a, the presence of triadic closureedges can have a severe negative effect on the recovery ofthe original partitions when using the SBM. In Fig. 4b wesee that the number of groups uncovered can be ordersof magnitude larger than the original partition, speciallywhen the latter is not even detectable. This shows thatthe apparent communities that arise out of the forma-tion of triangles substantially overshadow the underlyingtrue community structure. The situation changes con-siderably when we use the SBM/TC instead, as shownFig. 4c. In this case, the presence of triadic closure hasno noticeable effect on the detectability of the true com-munity structure, and we obtain a recovery performanceindistinguishable from the SBM in the case with no ad-ditional edges. As seen in Fig. 4c the same is true for thenumber of groups inferred. These results seem to pointto a robust capacity of the SBM/TC model to reliablydistinguish between actual community structure, and thedensity fluctuations with result from triadic closures. IV. EMPIRICAL NETWORKS We investigate the use our method with a variety ofempirical networks. We begin with a network of coop-eration among students while doing their homework fora course at Ben-Gurion University [30]. In Fig. 5a weshow the network and a fit of the DC-SBM, which finds9 assortative communities. Based on this result — andknowing that the partitions found by inferring the SBMas we do here point to statistically significant results thatcannot be attributed to mere random fluctuations [20]— we would be tempted to posit that these divisions areuncovering latent social parameters that could explainthe observed cooperation between these groups of of stu-dents. However, if we employ instead the SBM/TC, weobtain the result shown in Fig. 5b, which uncovers in-stead only a single group, and an abundance of triadicclosure edges. This is not unlike the artificial example (a) SBM, Σ SBM = 1145 . nats (b) SBM/TC, Σ SBM/TC = 935 . natsFigure 5. Network of cooperation between students [30]. (a)Fit of the SBM, yielding B = 9 communities. (b) Fit of theSBM/TC, uncovering a single community, and triadic closureedges shown in red. The thickness of the edges correspond tothe marginal probabilities π ij and − π ij for the seminal andclosure edges, respectively. considered in Fig. 3, and points to a very different inter-pretation, namely there is no measurable a priori predis-position for students to work with each other in groups,and the resulting network stems instead from studentschoosing to work together if they already share a mutualpartner. Indeed if we inspect the description lengths ob-tained with each model, we immediately recognize theSBM/TC as the most plausible explanation, and there-fore we deem the community structure found by the SBMas an unlikely one by comparison.We move now to another social network, but this timeof friendships between high school students [31]. Weshow the results of our analysis in Fig. 6. Using theSBM we find B = 26 groups, shown in Fig. 6a, whichat first seems like a reasonable explanation for this net-work. But instead, with the SBM/TC we find only B = 9 groups and a substantial amount of triadic closure edges,as seen in Fig. 6b. Differently from the previous exam-ple, the SBM/TC still finds enough evidence for a sub-stantial amount of community structure, although withfewer groups than the pure SBM. The groups found withthe SBM/TC have a strong correlation with the studentgrades, as shown in Fig. 6b, except for the 11th and 12thgrades, which seem to intermingle more, and for whichthe model finds evidence of more detailed internal socialstructures. This indicates that most of the subdivisionsof the grades found by the pure SBM are in fact bet-ter explained by triadic closure edges, and the a priori friendship preference within these grades are far morehomogeneous than the SBM fit would lead us to con-clude. One particularly striking feature of this analysisis that it imputes some seemingly clear communities en-tirely to triadic closure. A good example is the group (cid:46) Seminal edges Triadic closure edges(a) SBM ( Σ SBM = 8757 . nats) (b) SBM/TC ( Σ SBM/TC = 8456 . nats)Figure 6. Network of friendships between high school students — Adolescent health (comm26) [31]. (a) Fit of the SBM, yielding B = 26 communities. (b) Fit of the SBM/TC, uncovering B = 9 communities, with seminal (black) and triadic closure (red)edges shown separately in the left and right figures. The thickness of the edges correspond to the marginal probabilities π ij and − π ij for the seminal and closure edges, respectively. The red dashed lines delineate the known divisions into grades, asnumbered. (a) SBM, Σ SBM = 3816 . nats (b) SBM/TC, Σ SBM/TC = 3009 . natsFigure 7. Network of collaborations between network scientists [32]. (a) Fit of the SBM, yielding B = 27 communities. (b) Fitof the SBM/TC, uncovering only B = 3 groups, and triadic closure edges shown in red. The thickness of the edges correspondto the marginal probabilities π ij and − π ij for the seminal and closure edges, respectively. highlighted with an arrow in Fig. 6a, formed by studentsin the 8th grade. According to the SBM/TC, this grouphas arisen due to the formation of triangles between aninitially poorly connected subset of students, formed byall friends of a single student, rather than an initial affin-ity between them. The SBM/TC explanation is againmore plausible, due to its smaller description length.We move now to an additional example, this timeof collaborations between researchers in network sci-ence [32], shown in Fig. 7. For this network, the SBM finds B = 27 communities. The interpretation here is thesame as previous analyses of the same network, namelythat these communities are groups that tend to work to-gether, with the occasional collaboration across groups.On the other hand, when we employ the SBM/TC, thedifference this time is quite striking. Most of the commu-nity structure found with the pure SBM vanishes and isreplaced by a substrate network with a substantial “core-periphery” mixing pattern formed of two main groups,where the “core” (blue nodes) is composed of perceived (a) Cooperation betweenstudents (b) Adolescent health (comm26)(c) Scientific collaborations inNetwork Science (d) NCAA college football 2000 Figure 8. Posterior predictive distributions of the cluster-ing coefficient, as described in the text, for the SBM andSBM/TC as indicated in the legend, for different datasets.The vertical line shows the empirical value C ( G ) . initiators of the collaborations with the “periphery” (yel-low nodes), which end up being connected in the fi-nal network simply by virtue of the all-to-all nature ofmulti-way collaborations, captured here by triadic clo-sure edges. The core-periphery pattern is not perfect, aswe observe seminal edges between nodes of every type,but most commonly these exist between core and periph-ery nodes, and the core nodes themselves, who thereforeseem to have a predisposition to wider collaborations.The difference between the description lengths of bothmodels is substantial, indicating that the SBM/TC in-terpretation is indeed far more plausible.Lastly, we consider the network of American footballgames between colleges during the fall of 2000 [33], shownin Fig. 9. For this network we observe an interesting re-sult, namely the SBM and SBM/TC yield the exact sameinference, which means that SBM/TC gives a very lowprobability of triadic closure edges. Although we mightexpect this to occur for a network that has very few or notriangles, and therefore substantial evidence against tri-adic closure, this is not the case for the particular networkin question, which has in fact an abundance of triangles,in addition to clear assortative communities. The reasonfor this is that, in this particular case, the SBM is fullycapable of accounting for the triangles observed, whichtherefore can be characterized being a “side-effect” of thehomophily between nodes of the same group, instead ofan excess that needs additional explanation. We will re-visit this particular case in the following, from a differentangle. Figure 9. Network of games between American college foot-ball teams (NCAA college football 2000) [33]. The node colorsshow the fit of the SBM and SBM/TC, both yielding the same B = 11 communities. The SBM yields a description lengthof Σ SBM = 1761 . nats and the SBM/TC, Σ SBM/TC = 1767 . nats. Z a c h a r y K a r a t e C l u b ( ) D N C e m a il s L i tt l e R o c k L a k e f oo d w e b N C AA c o ll e g e f oo t b a ll L e s M i s é r a b l e s c o a pp e a r a n c e s D o l p h i n s o c i a l n e t w o r k - t e rr o r i s t n e t w o r k P o li t i c a l b oo k s n e t w o r k M a d r i d t r a i n b o m b i n g t e rr o r i s t s G a m e o f T h r o n e s c o a pp e a r a n c e s M u l t il a y e r p h y s i c i s t c o ll a b o r a t i o n s ( p i e rr e A u g e r ) P h y s i c i a n t r u s t n e t w o r k H i g h s c h oo l t e m p o r a l c o n t a c t s ( s u r v e y ) H i g h s c h oo l t e m p o r a l c o n t a c t s ( d i a r i e s ) M a i e r F a c e b oo k f r i e n d s W i t h i n - o r g a n i z a t i o n F a c e b oo k f r i e n d s h i p s ( S ) C . e l e g a n s n e u r o n s H i g h s c h oo l t e m p o r a l c o n t a c t s ( f a c e b oo k ) M e t a b o li c n e t w o r k E g o n e t w o r k s i n s o c i a l m e d i a ( f a c e b oo k - ) C o lli n s y e a s t i n t e r a c t o m e S c i e n t i f i c c o ll a b o r a t i o n s i n n e t w o r k s c i e n c e W i t h i n - o r g a n i z a t i o n F a c e b oo k f r i e n d s h i p s ( S ) J a zz c o ll a b o r a t i o n n e t w o r k A d o l e s c e n t h e a l t h ( c o mm ) M a l a r i a v a r D B L a H V R n e t w o r k s ( H V R - ) S t u d e n t c oo p e r a t i o n E m a il n e t w o r k G oo g l e + S c i e n t i f i c c o ll a b o r a t i o n s i n p h y s i c s ( h e p - t h - ) S c i e n t i f i c c o ll a b o r a t i o n s i n p h y s i c s ( c o n d - m a t - ) Figure 10. Values of the z-score for the posterior predictivedistributions of the clustering coefficient, as described in thetext, for the SBM and SBM/TC as indicated in the legend, fordifferent datasets. The solid horizontal lines mark the values − and . One natural criticism of the SBM as a useful hypothesisfor real networks, however stylized as it clearly is, is thatit assumes conditional independence for the placement ofevery edge. One consequence of this is that the probabil-ity of observing a spontaneous triadic closure edge willscale with O ( B/N ) , for a network with N nodes and B M e t a b o li c n e t w o r k E m a il n e t w o r k P h y s i c i a n t r u s t n e t w o r k C . e l e g a n s n e u r o n s G oo g l e + Z a c h a r y K a r a t e C l u b ( ) A d o l e s c e n t h e a l t h ( c o mm ) S c i e n t i f i c c o ll a b o r a t i o n s i n p h y s i c s ( h e p - t h - ) W i t h i n - o r g a n i z a t i o n F a c e b oo k f r i e n d s h i p s ( S ) D o l p h i n s o c i a l n e t w o r k G a m e o f T h r o n e s c o a pp e a r a n c e s W i t h i n - o r g a n i z a t i o n F a c e b oo k f r i e n d s h i p s ( S ) L i tt l e R o c k L a k e f oo d w e b P o li t i c a l b oo k s n e t w o r k S c i e n t i f i c c o ll a b o r a t i o n s i n p h y s i c s ( c o n d - m a t - ) - t e rr o r i s t n e t w o r k H i g h s c h oo l t e m p o r a l c o n t a c t s ( d i a r i e s ) N C AA c o ll e g e f oo t b a ll E g o n e t w o r k s i n s o c i a l m e d i a ( f a c e b oo k - ) S c i e n t i f i c c o ll a b o r a t i o n s i n n e t w o r k s c i e n c e H i g h s c h oo l t e m p o r a l c o n t a c t s ( s u r v e y ) S t u d e n t c oo p e r a t i o n L e s M i s é r a b l e s c o a pp e a r a n c e s M a i e r F a c e b oo k f r i e n d s H i g h s c h oo l t e m p o r a l c o n t a c t s ( f a c e b oo k ) J a zz c o ll a b o r a t i o n n e t w o r k D N C e m a il s M a d r i d t r a i n b o m b i n g t e rr o r i s t s M a l a r i a v a r D B L a H V R n e t w o r k s ( H V R - ) C o lli n s y e a s t i n t e r a c t o m e M u l t il a y e r p h y s i c i s t c o ll a b o r a t i o n s ( p i e rr e A u g e r ) Figure 11. Values of the clustering coefficient (Eq. 34) com-puted for the original network, C ( G ) , and for the inferredseminal network, C S ( G ) , averaged over the posterior distribu-tion according to Eq. 37, as shown in the legend, for differentdatasets. Z a c h a r y K a r a t e C l u b ( ) P h y s i c i a n t r u s t n e t w o r k D o l p h i n s o c i a l n e t w o r k L e s M i s é r a b l e s c o a pp e a r a n c e s - t e rr o r i s t n e t w o r k M a d r i d t r a i n b o m b i n g t e rr o r i s t s G a m e o f T h r o n e s c o a pp e a r a n c e s P o li t i c a l b oo k s n e t w o r k S t u d e n t c oo p e r a t i o n H i g h s c h oo l t e m p o r a l c o n t a c t s ( d i a r i e s ) W i t h i n - o r g a n i z a t i o n F a c e b oo k f r i e n d s h i p s ( S ) H i g h s c h oo l t e m p o r a l c o n t a c t s ( s u r v e y ) N C AA c o ll e g e f oo t b a ll D N C e m a il s H i g h s c h oo l t e m p o r a l c o n t a c t s ( f a c e b oo k ) L i tt l e R o c k L a k e f oo d w e b M a i e r F a c e b oo k f r i e n d s E g o n e t w o r k s i n s o c i a l m e d i a ( f a c e b oo k - ) W i t h i n - o r g a n i z a t i o n F a c e b oo k f r i e n d s h i p s ( S ) M e t a b o li c n e t w o r k A d o l e s c e n t h e a l t h ( c o mm ) C . e l e g a n s n e u r o n s S c i e n t i f i c c o ll a b o r a t i o n s i n n e t w o r k s c i e n c e G oo g l e + M u l t il a y e r p h y s i c i s t c o ll a b o r a t i o n s ( p i e rr e A u g e r ) J a zz c o ll a b o r a t i o n n e t w o r k M a l a r i a v a r D B L a H V R n e t w o r k s ( H V R - ) E m a il n e t w o r k C o lli n s y e a s t i n t e r a c t o m e S c i e n t i f i c c o ll a b o r a t i o n s i n p h y s i c s ( h e p - t h - ) S c i e n t i f i c c o ll a b o r a t i o n s i n p h y s i c s ( c o n d - m a t - ) Figure 12. Values of effective number of inferred groups, asgiven by Eq. 33, for the SBM and SBM/TC as indicated inthe legend, for different datasets. groups, assuming the group affinities are uniform for allgroups. Therefore if B (cid:28) N , we should not expect anyabundance of triangles, which is at odds with what weobserve in many empirical data. One problem with thislogic is that we do not know a priori the precise relation-ship between B and N for finite empirical networks, andtherefore we cannot rule out the SBM hypothesis basedsimply on an observed abundance of triangles. Auspi-ciously, with the SBM/TC at hand, we are the perfectposition to evaluate the SBM in that regard, and un-derstand how many of the observed triangles can be at-tributed to an incidental link placement due to commu-nity structure, or if they are instead better explained by explicit triadic closure edges. A common way of quanti-fying the amount of triangles in a network G is via itsclustering coefficient C ( G ) ∈ [0 , , which determines thefraction of triads in the network which are closed in atriangle, and is given by C ( G ) = (cid:80) ijk G ij G jk G ki k i ( k i − . (34)A meaningful way to evaluate whether a given model P ( G | θ ) with parameters θ can capture what is seen in thedata is to compute the posterior predictive distribution, P ( C | G ) = (cid:88) G (cid:48) δ ( C − C ( G (cid:48) )) (cid:88) θ P ( G (cid:48) | θ ) P ( θ | G ) . (35)This involves sampling parameters θ from the posterior P ( θ | G ) , generating new networks G (cid:48) from the model P ( G (cid:48) | θ ) , and obtaining the resulting population of C ( G (cid:48) ) values, which can then be compared to the observed value C ( G ) , and in this way we can determine if the model usedis capable of capturing this aspect of the data. In Fig. 8we show the results of this comparison for the SBM andSBM/TC (in Appendix D we give more details about how θ should be chosen in each case) using three datasets. Forthree of the four networks we observe what one might ex-pect: although the SBM is capable of accounting for asubstantial amount of triangles (far more than one wouldexpect by naively assuming B (cid:28) N ), it falls short of ex-plaining what is actually seen in the data. The SBM/TC,on the other hand, accounts for a realm of possibilitiesthat comfortably includes what is observed in the data,with a sufficiently high probability. For the remainingnetwork in Fig. 8c, NCAA college football 2000, as be-fore, we observe a different picture. Namely, both modelsproduce predictive posterior distributions that are essen-tially identical, and fully compatible with what is seenin the data. Therefore we can say with a fair amount ofconfidence that the fairly high clustering coefficient ob-served for this network can in principle be attributed tocommunity structure alone, rather than triadic closure,contradicting the intuition obtained from the asymptoticcase where B (cid:28) N , which is not applicable to this net-work.We extend the previous analysis to a larger set of em-pirical networks, as shown in Fig. 10, by summarizing thecompatibility of the posterior predictive distribution viathe z-score, z = C ( G ) − (cid:104) C (cid:105) σ C , (36)where (cid:104) C (cid:105) and σ C are the mean and standard deviation ofthe posterior predictive distribution. As we can see, thereare a number of networks for which the z-score values liein the plausible interval [ − , for both models, but thereis a much larger fraction of the data for which the valuesfor the SBM point to a decisive incompatibility, whereasthe SBM/TC yields credible values more systematically.1We can further exploit the decomposition that theSBM/TC provides by quantifying precisely, for any givennetwork, how much of the observed clustering can beattributed to triadic closure directly, or to communitystructure indirectly. We can do so by computing themean clustering coefficient of the substrate seminal net-work from the posterior distribution, C S ( G ) = (cid:88) A , g , b C ( A ) P ( A , g , b | G ) . (37)We can then compare this value with the coefficient forthe observed network C ( G ) , as we show in Fig. 11. Weidentify a variety of scenarios, including situations wherethe seminal network (and hence the community struc-ture) accounts for the majority of the observed cluster-ing, but most commonly we observe that a substantialfraction can be attributed to more direct triadic closure.Nevertheless, in many cases the values of C S ( G ) do notdrop to negligible values, showing that the presence of tri-angles cannot be wholly attributed to either mechanismin these cases. Indeed, this variability seems to indicatethat mere presence of a high or low density of triangles,as captured by the clustering coefficient, cannot be usedby itself to evaluate whether triadic closure or communitystructure is the leading underlying mechanism of networkformation.Another aspect of the suitability of triadic closure as amore plausible network model is that it tends to come to-gether with a less pronounced inferred community struc-ture, since part of the density heterogeneity found is at-tributed to the former mechanism, rather than the latter.In Fig. 12 we characterize this difference by the effectivenumber of groups found with both models. We see thatthe discrepancy between them is once again quite varied,where in some cases it can be quite small, indicating thattriadic closure plays a minor role, while in other cases itcan be quite extreme, indicating the dominant role thattriadic role has in the respective networks.Overall, what we seem to extract from these empiricalnetworks is that, in the majority of cases (though not all),the observed structure seems to be better explained bya heterogeneous combination of underlying mixing pat-terns with a further distortion by an additional tendencyof forming triangles. The precise balance between thesetwo components vary considerably in general, and needsto be assessed for each individual network. V. EDGE PREDICTION As with every kind of empirical assessment, networkdata is subject to measurement errors or omissions. Acommon use of network models is to predict such erro-neous and missing information from what is more pre-cisely known [34, 35]. The SBM has been successfullyused as such a model [35, 36], since the latent group as-signments and the affinities between them can be learnedfrom partial network information, which in turn can be Student cooperation P r e c i s i o n R e c a ll Scientific collaborations in Network Science P r e c i s i o n R e c a ll Adolescent health (comm26) P r e c i s i o n R e c a ll NCAA college football 2000 P r e c i s i o n R e c a ll Figure 13. Distributions of Precision and Recall values, ac-cording to the SBM and SBM/TC model, for four empiricalnetworks, and a fraction f = 0 . of omitted edges and cor-responding number of omitted non-edges. The results wereobtained for different realizations of missing edges andnon-edges. used to infer what has been distorted or left unobserved.Another common approach to edge prediction consists ofattributing a higher probability to a potential edge if ithappens to form a triangle [37]. As we have been dis-cussing in this work, these two properties — group affin-ity and triadic closure — point to related but distinctprocesses of edge formation, and approaches of edge pre-diction that rely exclusively on either one will be max-2imally efficient only if it happens to be the dominantunderlying mechanism, which, as we have seen in thelast section, is typically not the case. However, with theSBM/TC model we have introduced, it should be possi-ble to accommodate both mechanisms at the same time,and in this way improve edge prediction is more realisticsettings. In the following, we show how this can be done,and demonstrate it with a few examples.The scenario we consider is the general one presentedin Ref. [36], where we make n ij measurements of nodepair ( i, j ) and record the number of times x ij than anedge has been observed. Based on this data, we infer theunderlying network G according to the posterior distri-bution P ( G | n , x ) = P ( x | G , n ) P ( G ) P ( x | n ) . (38)The measurement model corresponds to a situationwhere the probabilities of observing missing and spuriousedges are unknown, which amounts to marginal proba-bility [36] P ( x | G , n ) = (cid:18) ET (cid:19) − E + 1 (cid:18) M − EX − T (cid:19) − M − E + 1 . (39)where we have M = (cid:88) i VI. CONCLUSION We have presented a generative model and correspond-ing inference scheme that is capable of differentiatingcommunity structure from triadic closure in empiricalnetworks. We have shown that although these featuresare generically conflated in traditional network analy-sis, our method can pick them apart, allowing us tellus whether an observed abundance of triangles is abyproduct of an underlying homophily between nodes,or whether they arise out of a local property of transitiv-ity. Likewise, we have also shown how our method canevade the detection of spurious communities, which arenot due to homophily, but arise instead simply out of arandom formation of triangles.Our approach shows how local and global (ormesoscale) generative processes can be combined into asingle model. Since it contains a mixture of both mecha-nisms, our method is able to decompose them for a givenobserved network according to their inferred contribu-tions. By employing our method on several empirical networks, we were able to demonstrate a wide variety ofscenarios, containing everything from a large number oftriangles caused predominantly by triadic closures, by amixture of community structure and triadic closures, andby community structure alone. These findings seem to in-dicate that local and global network properties tend tomix in nontrivial ways, and we should refrain from au-tomatically concluding that an observed local property(e.g. large number of triangles) cannot have a globalcause (e.g. group homophily), and likewise an observedglobal property (e.g. community structure) cannot havea purely local cause (e.g. triadic closure).Several authors had shown before that triadic closurecan induce the formation of community structure in net-works [11–16]. This introduces a problem of interpreta-tion for community detection methods that do not ac-count for this, which, to the best of our knowledge, hap-pens to be the vast majority of them. This is true also forinference methods based on the SBM, which, although itis not susceptible to finding spurious communities formedby a fully random placement of edges [41] (unlike non-inferential methods, which tend to overfit [42]), they can-not evade those arising from triadic closure [15]. Ourapproach provides a solution to this interpretation prob-lem, allowing us to reliably rule out triadic closure whenidentifying communities in networks.We have also shown how incorporating triadic closuretogether with community structure can improve edge pre-diction, without degrading the performance in situationswhere it is not present. This further demonstrates theusefulness of approaches that model networks in multi-ple scales, and points to a general way of systematicallyimproving our understanding of network data. [1] J. Miller McPherson and Lynn Smith-Lovin, “Homophilyin Voluntary Organizations: Status Distance and theComposition of Face-to-Face Groups,” American Socio-logical Review , 370–379 (1987).[2] Wesley Shrum, Neil H. Cheek, and Saundra MacD.Hunter, “Friendship in School: Gender and Racial Ho-mophily,” Sociology of Education , 227–239 (1988).[3] James Moody, “Race, School Integration, and FriendshipSegregation in America,” American Journal of Sociology , 679–716 (2001).[4] Miller McPherson, Lynn Smith-Lovin, and James MCook, “Birds of a Feather: Homophily in Social Net-works,” Annual Review of Sociology , 415–444 (2001).[5] Santo Fortunato, “Community detection in graphs,”Physics Reports , 75–174 (2010).[6] Anatol Rapoport, “Spread of information through a pop-ulation with socio-structural bias: I. Assumption of tran-sitivity,” The bulletin of mathematical biophysics ,523–533 (1953).[7] Paul W. Holland and Samuel Leinhardt, “Transitivityin Structural Models of Small Groups:,” ComparativeGroup Studies , 107–124 (1971).[8] P. W. Holland and S. Leinhardt, “Local structure in so- cial networks." In D. Heise (ed.), Sociological Methodol-ogy. San Francisco: Jossey-Bass,” (1975).[9] Gueorgi Kossinets and Duncan J. Watts, “Origins of Ho-mophily in an Evolving Social Network,” American Jour-nal of Sociology , 405–450 (2009).[10] Mark S. Granovetter, “The Strength of Weak Ties,”American Journal of Sociology , 1360–1380 (1973).[11] David V. Foster, Jacob G. Foster, Peter Grassberger,and Maya Paczuski, “Clustering drives assortativity andcommunity structure in ensembles of networks,” PhysicalReview E , 066117 (2011).[12] David Foster, Jacob Foster, Maya Paczuski, and PeterGrassberger, “Communities, clustering phase transitions,and hysteresis: Pitfalls in constructing network ensem-bles,” Physical Review E , 046115 (2010).[13] Fabian Aguirre Lopez and Anthony CC Coolen, “Tran-sitions in loopy random graphs with fixed degrees andarbitrary degree distributions,” arXiv:2008.11002 [cond-mat] (2020).[14] Ginestra Bianconi, Richard K. Darst, Jacopo Iacovacci,and Santo Fortunato, “Triadic closure as a basic gener-ating mechanism of communities in complex networks,”Physical Review E , 042806 (2014). [15] Sophie Wharrie, Lamiae Azizi, and Eduardo G. Alt-mann, “Micro-, meso-, macroscales: The effect of tri-angles on communities in networks,” Physical Review E , 022315 (2019).[16] Aili Asikainen, Gerardo Iñiguez, Javier Ureña-Carrión,Kimmo Kaski, and Mikko Kivelä, “Cumulative effects oftriadic closure and homophily in social networks,” ScienceAdvances , eaax7310 (2020).[17] Federico Battiston, Giulia Cencetti, Iacopo Iacopini,Vito Latora, Maxime Lucas, Alice Patania, Jean-GabrielYoung, and Giovanni Petri, “Networks beyond pairwiseinteractions: Structure and dynamics,” Physics ReportsNetworks beyond pairwise interactions: Structure anddynamics, , 1–92 (2020).[18] Austin R. Benson, Rediet Abebe, Michael T. Schaub, AliJadbabaie, and Jon Kleinberg, “Simplicial closure andhigher-order link prediction,” Proceedings of the NationalAcademy of Sciences , E11221–E11230 (2018).[19] Paul W. Holland, Kathryn Blackmond Laskey, andSamuel Leinhardt, “Stochastic blockmodels: First steps,”Social Networks , 109–137 (1983).[20] Tiago P. Peixoto, “Bayesian Stochastic Blockmodeling,”in Advances in Network Clustering and Blockmodeling (John Wiley & Sons, Ltd, 2019) pp. 289–332.[21] Brian Karrer and M. E. J. Newman, “Stochastic block-models and community structure in networks,” PhysicalReview E , 016107 (2011).[22] Tiago P. Peixoto, “Nonparametric Bayesian inference ofthe microcanonical stochastic block model,” Physical Re-view E , 012317 (2017).[23] Tiago P. Peixoto, “Latent Poisson models for networkswith heterogeneous density,” Physical Review E ,012309 (2020).[24] Nicholas Metropolis, Arianna W. Rosenbluth, Mar-shall N. Rosenbluth, Augusta H. Teller, and EdwardTeller, “Equation of State Calculations by Fast Comput-ing Machines,” The Journal of Chemical Physics , 1087(1953).[25] W. K. Hastings, “Monte Carlo sampling methods usingMarkov chains and their applications,” Biometrika , 97–109 (1970).[26] Tiago P. Peixoto, “Merge-split Markov chain MonteCarlo for community detection,” Physical Review E ,012305 (2020).[27] Peter D. Grünwald, The Minimum Description LengthPrinciple (The MIT Press, 2007).[28] Aurelien Decelle, Florent Krzakala, Cristopher Moore,and Lenka Zdeborová, “Asymptotic analysis of thestochastic block model for modular networks and its al-gorithmic applications,” Physical Review E , 066106(2011).[29] Tiago P. Peixoto, “Revealing consensus and dissensusbetween network partitions,” arXiv:2005.13977 [physics,stat] (2020).[30] Michael Fire, Gilad Katz, Yuval Elovici, Bracha Shapira,and Lior Rokach, “Predicting student exam’s scores byanalyzing social network data,” in Active Media Technol-ogy (Springer Berlin Heidelberg, 2012) pp. 584–595.[31] James Moody, “Peer influence groups: identifying denseclusters in large networks,” Social Networks , 261–283(2001).[32] M. E. J. Newman, “Finding community structure in net-works using the eigenvectors of matrices,” Physical Re-view E (2006), 10.1103/physreve.74.036104. [33] M. Girvan and M. E. J. Newman, “Community structurein social and biological networks,” Proceedings of the Na-tional Academy of Sciences , 7821–7826 (2002).[34] Aaron Clauset, Cristopher Moore, and M. E. J. New-man, “Hierarchical structure and the prediction of miss-ing links in networks,” Nature , 98–101 (2008).[35] Roger Guimerà and Marta Sales-Pardo, “Missing andspurious interactions and the reconstruction of complexnetworks,” Proceedings of the National Academy of Sci-ences , 22073 –22078 (2009).[36] Tiago P. Peixoto, “Reconstructing Networks with Un-known and Heterogeneous Errors,” Physical Review X , 041011 (2018).[37] Lada A Adamic and Eytan Adar, “Friends and neighborson the Web,” Social Networks , 211–230 (2003).[38] Amir Ghasemian, Homa Hosseinmardi, Aram Galstyan,Edoardo M. Airoldi, and Aaron Clauset, “Stacking mod-els for nearly optimal link prediction in complex net-works,” Proceedings of the National Academy of Sciences , 23393–23400 (2020).[39] Tiago P. Peixoto, “Hierarchical Block Structures andHigh-Resolution Model Selection in Large Networks,”Physical Review X , 011047 (2014).[40] Toni Vallès-Català, Tiago P. Peixoto, Marta Sales-Pardo,and Roger Guimerà, “Consistencies and inconsistenciesbetween model selection and link prediction in networks,”Physical Review E , 062316 (2018).[41] Roger Guimerà, Marta Sales-Pardo, and Luís A. NunesAmaral, “Modularity from fluctuations in random graphsand complex networks,” Physical Review E , 025101(2004).[42] Amir Ghasemian, Homa Hosseinmardi, and AaronClauset, “Evaluating Overfit and Underfit in Models ofNetwork Community Structure,” IEEE Transactions onKnowledge and Data Engineering , 1–1 (2019).[43] Tiago P. Peixoto, “The graph-tool python library,”figshare (2014), 10.6084/m9.figshare.1164194, availableat https://graph-tool.skewed.de .[44] T. P. Peixoto, “The Netzschleuder network catalogue andrepository.” (2020), accessible at https://networks.skewed.de .[45] Aaron Clauset, Ellen Tucker, and Matthias Sainz, “TheColorado Index of Complex Networks,” (2016), accessi-ble at https://icon.colorado.edu .[46] M. E. J. Newman, “The structure of scientific collabora-tion networks,” Proceedings of the National Academy ofSciences , 404–409 (2001).[47] Jordi Duch and Alex Arenas, “Community detection incomplex networks using extremal optimization,” PhysicalReview E (2005), 10.1103/physreve.72.027104.[48] “The structure of the nervous system of the nematodecaenorhabditis elegans,” Philosophical Transactions ofthe Royal Society of London. B, Biological Sciences ,1–340 (1986).[49] Duncan J. Watts and Steven H. Strogatz, “Collective dy-namics of ‘small-world’ networks,” Nature , 440–442(1998).[50] Sean R. Collins, Patrick Kemmeren, Xue-Chu Zhao,Jack F. Greenblatt, Forrest Spencer, Frank C. P. Hol-stege, Jonathan S. Weissman, and Nevan J. Krogan,“Toward a comprehensive atlas of the physical interac-tome ofSaccharomyces cerevisiae,” Molecular & CellularProteomics , 439–450 (2007).[51] Jérôme Kunegis, “KONECT,” in Proceedings of the 22nd International Conference on World Wide Web - WWW'13 Companion (ACM Press, 2013).[52] David Lusseau, Karsten Schneider, Oliver J. Boisseau,Patti Haase, Elisabeth Slooten, and Steve M. Dawson,“The bottlenose dolphin community of doubtful soundfeatures a large proportion of long-lasting associations,”Behavioral Ecology and Sociobiology , 396–405 (2003).[53] Julian McAuley and Jure Leskovec, “Discovering so-cial circles in ego networks,” (2012), arXiv:1210.8182v3[cs.SI].[54] Benjamin F. Maier and Dirk Brockmann, “Cover time forrandom walks on arbitrary complex networks,” (2017),10.1103/PhysRevE.96.042307.[55] Michael Fire, Rami Puzis, and Yuval Elovici, “Orga-nization mining using online social networks,” (2013),arXiv:1303.3741v2 [cs.SI].[56] Neo D. Martinez, “Artifacts or attributes? effects of reso-lution on the little rock lake food web,” Ecological Mono-graphs , 367–392 (1991).[57] Andrew Beveridge and Jie Shan, “Network of thrones,”Math Horizons , 18–22 (2016).[58] Michael Fire, Lena Tenenboim-Chekina, Rami Puzis,Ofrit Lesser, Lior Rokach, and Yuval Elovici, “Compu-tationally efficient link prediction in a variety of socialnetworks,” ACM Transactions on Intelligent Systems andTechnology , 1–25 (2013).[59] Pablo M. Gleiser and Leon Danon, “Community Struc-ture In Jazz,” Advances in Complex Systems , 565–573(2003).[60] Wayne W. Zachary, “An information flow model for con-flict and fission in small groups,” Journal of Anthropo-logical Research , 452–473 (1977).[61] Donald Ervin Knuth, The Stanford GraphBase: a plat-form for combinatorial computing (AcM Press New York,1993).[62] Daniel B. Larremore, Aaron Clauset, and Caroline O.Buckee, “A network approach to analyzing highly recom-binant malaria parasite genes,” PLoS Computational Bi-ology , e1003268 (2013).[63] James Coleman, Elihu Katz, and Herbert Menzel, “Thediffusion of an innovation among physicians,” Sociometry , 253 (1957).[64] Manlio De Domenico, Andrea Lancichinetti, Alex Are-nas, and Martin Rosvall, “Identifying modular flowson multilayer networks reveals highly overlapping or-ganization in social systems,” (2014), 10.1103/Phys-RevX.5.011027.[65] Boris Pasternak and Ivor Ivask, “Four unpublished let-ters,” Books Abroad , 196 (1970).[66] Rossana Mastrandrea, Julie Fournet, and Alain Barrat,“Contact patterns in a high school: A comparison be-tween data collected using wearable sensors, contact di-aries and friendship surveys,” PLOS ONE , e0136497(2015).[67] Valdis Krebs, “Uncloaking terrorist networks,” FirstMonday (2002), 10.5210/fm.v7i4.941.[68] Brian Hayes, “Connecting the dots,” American Scientist , 400 (2006).[69] R. Guimerà, L. Danon, A. Díaz-Guilera, F. Giralt, andA. Arenas, “Self-similar community structure in a net-work of human interactions,” Physical Review E (2003), 10.1103/physreve.68.065103. Appendix A: Latent multigraph SBM The marginal likelihood of Eq. 1 is in fact obtained fora multigraph model [22], where the adjacency entries cantake any natural value, A ij ∈ N . Although we could inprinciple ignore this discrepancy, since this kind of modelgenerates simple graphs as a special case, this comes atthe expense of a reduced expressiveness of the model [23],since this kind of multigraph model cannot describe theplacement of single edges with high probability, or ac-count for the emergent degree-degree correlations thatmust be present in simple graphs. Instead, here we takethe approach proposed in [23, 36], and consider a latent multigraph A (cid:48) , with A (cid:48) ij ∈ N , which is then convertedinto a simple graph A ( A (cid:48) ) simply by ignoring the edgemultiplicities, i.e. A ij = (cid:40) if A (cid:48) ij > , otherwise. (A1)The latent multigraph A (cid:48) is generated according to Eq. 1,which means the simple graph A is generated accordingto P ( A | b ) = (cid:88) A (cid:48) { A ( A (cid:48) )= A } P ( A (cid:48) | b ) . (A2)Instead of working with this marginal probability directly(which is intractable), we infer the latent edge multiplic-ities as well, from a joint posterior distribution P ( g , A (cid:48) , b | G ) = P ( G , g | A ( A (cid:48) )) P ( A (cid:48) | b ) P ( b ) P ( G ) , (A3)where the simple graph A ( A (cid:48) ) is used for the triadic clo-sure likelihood P ( G , g | A ) . In this way, the inference pro-cedure is the same as the one described in the main text,with the only modification that we need to infer the in-teger values of A (cid:48) rather than its binary values. Appendix B: Expected density of transitivity As mentioned in the main text, the choice of priorsused for Eq. 10 makes the calculation very simple, butit implies that we expect the observed graphs to alwayshave a large fraction of triadic closures. An outcomeof this is that the probability of observing a final graphwithout any triadic closure, i.e. (cid:80) uij g (cid:48) ij ( u ) = 0 , is givenby P ( g (cid:48) | A ) = (cid:89) u (cid:88) i 11 + (cid:80) u Θ (cid:104)(cid:80) i 1. Iterated triadic closure For the generalized model with iterated triadic clo-sures, the marginal likelihood is also analogous to Eq. B8, P ( g ( l ) | A ( l − , g ( l − ) = (cid:89) u (cid:18)(cid:80) i 11 + (cid:80) u { (cid:80) i The MCMC algorithm described in the main text isimplemented with the following moves. The first kind isto attempt to move an edge ( i, j ) in ego graph g ( l ) ( u ) at its current generation l ∈ [0 , L ] to another ego graph g ( l (cid:48) ) ( v ) for v (cid:54) = u at generation l (cid:48) (cid:54) = l . We do so byselecting first an edge ( i, j ) in G as well as a generation l , both uniformly at random, and an ego node u that isrelevant to edge ( i, j ) at generation l , also uniformly atrandom. The number of ego graphs that are relevant forthis edge is given by n ( l ) ij = (cid:88) u A ( l − ui (1 − A ( l − uj ) , (C1)which is independent on the value of g ( l (cid:48) ) ij ( u ) for any l (cid:48) .We then sample another generation l (cid:48) (cid:54) = l and proceedin the same way to sample a relevant ego node v . Ineither case, if l = 0 is selected, then the choice of an egograph is not made, since we are selecting simply an entry ( i, j ) in A with probability one. The final probability ofselecting the move ( i, j, u, l ) → ( i, j, v, l (cid:48) ) , assuming l > and l (cid:48) > , is given by P ( i, j, u, v, l, l (cid:48) |{ g ( l ) } , A ) = 1 E G n ( l ) ij n ( l (cid:48) ) ij L ( L + 1) , (C2)where E G is the number of edges in G . Given this se-lection we then make the change g (cid:48) ( l ) ij ( u ) = g ( l ) ij − and g (cid:48) ( l (cid:48) ) ij ( u ) = g ( l (cid:48) ) ij + 1 , and accept it with probability min (cid:32) , P ( { g (cid:48) ( l ) } , A (cid:48) , b | G ) P ( i, j, u, v, l, l (cid:48) |{ g ( l ) } , A (cid:48) ) P ( { g ( l ) } , A , b | G ) P ( i, j, v, u, l (cid:48) , l |{ g (cid:48) ( l ) } , A ) (cid:33) = min (cid:32) , P ( { g (cid:48) ( l ) } , A (cid:48) , b | G ) P ( { g ( l ) } , A , b | G ) (cid:33) (C3)which is independent on the actual move probabilities,since they remain the same after and before the move.Note that invalid moves that result in g ( l ) ij < or A ij < are always rejected in this way. In addition, we also make a second kind of move byselecting again an edge ( i, j ) in G as well as a generation l , both uniformly at random, and an ego node u thatis relevant to edge ( i, j ) at generation l , with the sameprobability as before. We then make the move g (cid:48) ( l ) ij = g ( l ) ij ± with probability / , and accept again accordingto min (cid:32) , P ( { g (cid:48) ( l ) } , A , b | G ) P ( { g ( l ) } , A , b | G ) (cid:33) . (C4)In case l = 0 is selected, the move is different, due to themultigraph nature of A . We make instead the proposal A ij → A (cid:48) ij according to a geometric distribution withmean A ij + 1 , P ( A (cid:48) ij | A ij ) = (cid:18) A ij + 1 A ij + 2 (cid:19) A (cid:48) ij A ij + 2 . (C5)In this case, the acceptance probability changes to min (cid:32) , P ( { g ( l ) } , A (cid:48) , b | G ) P ( A ij | A (cid:48) ij ) P ( { g ( l ) } , A , b | G ) P ( A (cid:48) ij | A ij ) (cid:33) . (C6)Finally, we last kind of move involves a change in parti-tion b → b (cid:48) from the proposal P ( b (cid:48) | b ) , which is acceptedwith probability min (cid:18) , P ( A | b (cid:48) ) P ( b (cid:48) ) P ( b | b (cid:48) ) P ( A | b ) P ( b ) P ( b (cid:48) | b ) (cid:19) . (C7)For the latter we use the merge-split moves, combinedwith single-node moves, described in Ref. [26].The moves above fulfill detailed balance, and whencombined, they also preserve ergodicity, since they allowevery latent multigraph, decomposition into ego graphs,and node partition to be sampled. Due to this, with suf-ficiently many iterations the algorithm must eventuallyproduce samples from the desired posterior distribution. 1. Algorithmic complexity We can break down the time complexity of the abovealgorithm as follows. At any given time, we keep track of8all relevant ego graphs for each edge ( i, j ) in G , those thathave edge ( i, j ) in them, as well as the number of edges E ( l ) u = (cid:80) i The predictive posterior distribution considered in themain text is P ( C | G ) = (cid:88) G (cid:48) δ ( C − C ( G (cid:48) )) (cid:88) θ P ( G (cid:48) | θ ) P ( θ | G ) , (D1)where θ are the parameters of model P ( G | θ ) . Here wespecify more precisely how these parameters are chosenand sampled for the SBM/TC model. The marginal like-lihood for the SBM given by Eq. 1 can be written equiv-alently as [22] P ( A | b ) = P ( A | k , e , b ) P ( k | e , b ) P ( e | b ) , (D2)where the likelihood of the microcanonical DC-SBM isgiven by P ( A | k , e , b ) = (cid:81) r Below are descriptions of the network datasets used inthis work. The codenames in parenthesis correspond tothe respective entries in the Netzschleuder repository [44]where the networks can be downloaded. Some of thedescriptions were obtained from the Colorado Index ofComplex Networks [45]. Adolescent health ( add_health ) [31]: A directednetwork of friendships obtained through a social surveyof high school students in 1994. The ADD HEALTH dataare constructed from the in-school questionnaire; 90,118students representing 84 communities took this survey in1994-95. Some communities had only one school; othershad two. Where there are two schools in a communitystudents from one school were allowed to name friends inthe other, the “sister school”. For this analysis, a sym-metrized version of the original directed network has beenused, considering only its largest connected component.The particular network named comm26 has been used.This network has N = 551 nodes and E = 2624 edges. Scientific collaborations in physics( arxiv_collab ) [46]: Collaboration graphs forscientists, extracted from the Los Alamos e-Print arXiv(physics), for 1995-1999 for three categories, and addi-tionally for 1995-2003 and 1995-2005 for one category.For copyright reasons, the MEDLINE (biomedical re-search) and NCSTRL (computer science) collaborationgraphs from this paper are not publicly available. Forthis analysis, only the largest connected component ofthe networks were considered. The particular networksnamed cond-mat-1999 , hep-th-1999 have been used,with number of nodes and edges, ( N, E ) , given by(13861, 44619), (5835, 13815), respectively. Metabolic network ( celegans_metabolic ) [47]: List of edges comprising the metabolic network of thenematode C. elegans . This network has N = 453 nodesand E = 4596 edges. C. elegans neurons ( celegansneural ) [48, 49]: A network representing the neural connections of theCaenorhabditis elegans nematode. For this analysis,a symmetrized version of the original directed networkhas been used. This network has N = 297 nodes and E = 2359 edges. Collins yeast interactome ( collins_yeast ) [50]: Network of protein-protein interactions in Saccharomycescerevisiae (budding yeast), measured by co-complex as-sociations identified by high-throughput affinity purifica-tion and mass spectrometry (AP/MS). For this analy-sis, only the largest connected component of the networkwas considered. This network has N = 1004 nodes and E = 8319 edges. DNC emails ( dnc ) [51]: A network representingthe exchange of emails among members of the Demo-cratic National Committee, in the email data leak re-leased by WikiLeaks in 2016. For this analysis, only thelargest connected component of the network was consid-ered. This network has N = 849 nodes and E = 12038 edges. Dolphin social network ( dolphins ) [52]: Anundirected social network of frequent associations ob-served among 62 dolphins (Tursiops) in a communityliving off Doubtful Sound, New Zealand, from 1994-2001.This network has N = 62 nodes and E = 159 edges. Ego networks in social media ( ego_social ) [53]: Ego networks associated with a set of accounts of threesocial media platforms (Facebook, Google+, and Twit-ter). Datasets include node features (profile metadata),circles, and ego networks, and were crawled from publicsources in 2012. For this analysis, only the largest con-nected component of the network was considered. Theparticular network named facebook_0 has been used.This network has N = 324 nodes and E = 2514 edges. Maier Facebook friends ( facebook_friends ) [54]: A small anonymized Facebook ego network, from April2014. Nodes are Facebook profiles, and an edge existsif the two profiles are “friends” on Facebook. Metadatagives the social context for the relationship between egoand alter. For this analysis, only the largest connectedcomponent of the network was considered. This networkhas N = 329 nodes and E = 1954 edges. Within-organization Facebook friendships( facebook_organizations ) [55]: Six networks offriendships among users on Facebook who indicatedemployment at one of the target corporation. Companiesrange in size from small to large. Only edges betweenemployees at the same company are included in a givensnapshot. Node metadata gives listed job-type on theuser’s page. The particular networks named S1 , S2 havebeen used, with number of nodes and edges, ( N, E ) ,given by (320, 2369), (165, 726), respectively. Little Rock Lake food web( foodweb_little_rock ) [56]: A food web amongthe species found in Little Rock Lake in Wisconsin.0Nodes are taxa (like species), either autotrophs, her-bivores, carnivores or decomposers. Edges representfeeding (nutrient transfer) of one taxon on another.For this analysis, a symmetrized version of the originaldirected network has been used. This network has N = 183 nodes and E = 2494 edges. NCAA college football 2000 ( football ) [33]: Anetwork of American football games between Division IAcolleges during regular season Fall 2000. This networkhas N = 115 nodes and E = 613 edges. Game of Thrones coappearances( game_thrones ) [57]: Network of coappearancesof characters in the Game of Thrones series, by GeorgeR. R. Martin, and in particular coappearances in thebook “A Storm of Swords.” Nodes are unique characters,and edges are weighted by the number of times the twocharacters’ names appeared within 15 words of eachother in the text. This network has N = 107 nodes and E = 352 edges. Google+ ( google_plus ) [58]: Snapshot of connec-tions among users of Google+, collected in 2012. Nodesare users and a directed edge ( i, j ) represents user i addeduser j to i ’s circle. For this analysis, a symmetrizedversion of the original directed network has been used,considering only its largest connected component. Thisnetwork has N = 201949 nodes and E = 1496936 edges. Jazz collaboration network ( jazz_collab ) [59]: The network of collaborations among jazz musicians, andamong jazz bands, extracted from The Red Hot JazzArchive digital database, covering bands that performedbetween 1912 and 1940. This network has N = 198 nodesand E = 2742 edges. Zachary Karate Club ( karate ) [60]: Network offriendships among members of a university karate club.Includes metadata for faction membership after a socialpartition. Note: there are two versions of this network,one with 77 edges and one with 78, due to an ambiguoustypo in the original study. (The most commonly used isthe one with 78 edges.). The particular network named has been used. This network has N = 34 nodes and E = 78 edges. Les Misérables coappearances ( lesmis ) [61]: The network of scene coappearances of characters in Vic-tor Hugo’s novel “Les Miserables.” Edge weights de-note the number of such occurrences. This network has N = 77 nodes and E = 254 edges. Malaria var DBLa HVR networks( malaria_genes ) [62]: Networks of recombinantantigen genes from the human malaria parasite P.falciparum . Each of the 9 networks shares the same set of vertices but has different edges, correspondingto the 9 highly variable regions (HVRs) in the DBLadomain of the var protein. Nodes are var genes, andtwo genes are connected if they share a substring whoselength is statistically significant. Metadata includes twotypes of node labels, both based on sequence structurearound HVR6. For this analysis, only the largestconnected component of the network was considered.The particular network named HVR_9 has been used.This network has N = 297 nodes and E = 7562 edges. Scientific collaborations in network science( netscience ) [32]: A coauthorship network among sci-entists working on network science, from 2006. This net-work is a one-mode projection from the bipartite graph ofauthors and their scientific publications. For this analy-sis, only the largest connected component of the networkwas considered. This network has N = 379 nodes and E = 914 edges. Physician trust network ( physician_trust ) [63]: A network of trust relationships among physicians in fourmidwestern (USA) cities in 1966. Edge direction indi-cates that node i trusts or asks for advice from node j . Each of the four components represent the networkwithin a given city. For this analysis, a symmetrizedversion of the original directed network has been used,considering only its largest connected component. Thisnetwork has N = 117 nodes and E = 542 edges. Multilayer physicist collaborations( physics_collab ) [64]: Two multiplex networksof coauthorships among the Pierre Auger Collaborationof physicists (2010-2012) and among researchers whohave posted preprints on arXiv.org (all papers up toMay 2014). Layers represent different categories ofpublication, and an edge’s weight indicates the numberof reports written by the authors. These layers areone-mode projections from the underlying author-paperbipartite network. For this analysis, only the largestconnected component of the network was considered.The particular network named pierreAuger has beenused. This network has N = 475 nodes and E = 7090 edges. Political books network ( polbooks ) [65]: A net-work of books about U.S. politics published close to the2004 U.S. presidential election, and sold by Amazon.com.Edges between books represent frequent copurchasing ofthose books by the same buyers. The network was com-piled by V. Krebs and is unpublished. This network has N = 105 nodes and E = 441 edges. High school temporal contacts( sp_high_school ) [66]: These data sets corre-spond to the contacts and friendship relations betweenstudents in a high school in Marseilles, France, in1December 2013, as measured through several techniques.For this analysis, symmetrized versions of the originaldirected networks have been used, considering only theirlargest connected component. The particular networksnamed diaries , survey , facebook have been used,with number of nodes and edges, ( N, E ) , given by (120,502), (128, 658), (156, 1437), respectively. Student cooperation ( student_cooperation ) [30]: Network of cooperation among students in the "Com-puter and Network Security" course at Ben-Gurion Uni-versity, in 2012. Nodes are students, and edges denote co-operation between students while doing their homework.The graph contains three types of links: Time, Com-puter, Partners. For this analysis, only the largest con-nected component of the network was considered. Thisnetwork has N = 141 nodes and E = 297 edges. terrorists_911 ) [67]: Network of individuals and their known social associa-tions, centered around the hijackers that carried out the September 11th, 2001 terrorist attacks. Associations ex-tracted after-the-fact from public data. Metadata labelssay which plane a person was on, if any, on 9/11. Thisnetwork has N = 62 nodes and E = 152 edges. Madrid train bombing terrorists( train_terrorists ) [68]: A network of associa-tions among the terrorists involved in the 2004 Madridtrain bombing, as reconstructed from press stories after-the-fact. Edge weights encode four levels of connectionstrength: friendships, ties to Al Qaeda and Osama BinLaden, co-participants in wars, and co-participants inprevious terrorist attacks. This network has N = 64 nodes and E = 243 edges. Email network ( uni_email ) [69]: A network rep-resenting the exchange of emails among members of theRovira i Virgili University in Spain, in 2003. For thisanalysis, a symmetrized version of the original directednetwork has been used. This network has N = 1133 nodes and E = 10903= 10903 , , otherwise. (5)The joint probability of the above process is then givenby P ( G , g , A | p , b ) = { G = G ( A , g ) } P ( A | b ) (cid:89) u P ( g ( u ) | A , p u ) , (6)where { x } is the indicator function. Unfortunately, themarginal probability of the final graph P ( G ) = (cid:88) g , A , b (cid:90) P ( G , g , A | p , b ) P ( p ) P ( b ) d p , (7)with P ( p ) and P ( b ) being prior probabilities, does notlend itself to a tractable computation. Luckily, however,this will not be needed for our inference procedure. In-stead, we are interested in the posterior distribution P ( g , A , b | G ) = P ( G , g , A | b ) P ( b ) P ( G ) , (8)which describes the probability of a decomposition of anobserved simple graph G into its seminal graph A , theunderlying community structure b , and the triadic clo-sures represented by the ego graphs g . (Although themarginal distribution P ( G ) appears in the denominatorof the above equation, we will see in later on that it isjust a normalization constant that does not in fact needto be computed.) The marginal likelihood P ( G , g , A | b ) = P ( G | A , g ) P ( g | A ) P ( A | b ) (9)can be computed easily via P ( g | A ) = (cid:89) u (cid:90) P ( g ( u ) | A , p ) P ( p ) d p = (cid:89) u (cid:34)(cid:18)(cid:80) i