[PDF] Weighted Community Detection and Data Clustering Using Message Passing

Abstract

Grouping objects into clusters based on similarities or weights between them is one of the most important problems in science and engineering. In this work, by extending message passing algorithms and spectral algorithms proposed for unweighted community detection problem, we develop a non-parametric method based on statistical physics, by mapping the problem to Potts model at the critical temperature of spin glass transition and applying belief propagation to solve the marginals corresponding to the Boltzmann distribution. Our algorithm is robust to over-fitting and gives a principled way to determine whether there are significant clusters in the data and how many clusters there are. We apply our method to different clustering tasks and use extensive numerical experiments to illustrate the advantage of our method over existing algorithms. In the community detection problem in weighted and directed networks, we show that our algorithm significantly outperforms existing algorithms. In the clustering problem when the data was generated by mixture models in the sparse regime we show that our method works to the theoretical limit of detectability and gives accuracy very close to that of the optimal Bayesian inference. In the semi-supervised clustering problem, our method only needs several labels to work perfectly in classic datasets. Finally, we further develop Thouless-Anderson-Palmer equations which reduce heavily the computation complexity in dense-networks but gives almost the same performance as belief propagation.

Full PDF

WWeighted Community Detection and Data Clustering Using Message Passing

Cheng Shi , , Yanchen Liu , and Pan Zhang ∗ CompleX Lab, Web Sciences Center, University of Electronic Scienceand Technology of China, Chengdu 611731, Peoples Republic of China Network Science Institute, Northeastern University,177 Huntington Avenue, Boston, MA, 02115, United States of America CAS key laboratory of theoretical physics, Institute of Theoretical Physics,Chinese Academy of Sciences, Beijing 100190, China (Dated: January 31, 2018)Grouping objects into clusters based on similarities or weights between them is one of the mostimportant problems in science and engineering. In this work, by extending message passing algo-rithms and spectral algorithms proposed for unweighted community detection problem, we develop anon-parametric method based on statistical physics, by mapping the problem to Potts model at thecritical temperature of spin glass transition and applying belief propagation to solve the marginalscorresponding to the Boltzmann distribution. Our algorithm is robust to over-ﬁtting and gives aprincipled way to determine whether there are signiﬁcant clusters in the data and how many clustersthere are. We apply our method to diﬀerent clustering tasks. In the community detection problemin weighted and directed networks, we show that our algorithm signiﬁcantly outperforms existingalgorithms. In the clustering problem when the data was generated by mixture models in the sparseregime we show that our method works all the way down to the theoretical limit of detectability andgives accuracy very close to that of the optimal Bayesian inference. In the semi-supervised clusteringproblem, our method only needs several labels to work perfectly in classic datasets. Finally, we fur-ther develop Thouless-Anderson-Palmer equations which reduce heavily the computation complexityin dense-networks but gives almost the same performance as belief propagation. ∗ To whom correspondence should be addressed. Email: [email protected] a r X i v : . [ phy s i c s . s o c - ph ] J a n I. INTRODUCTION

Clustering is one of the core problems in unsupervised learning and plays an important role in science and engi-neering [1]. It aims to group nodes into clusters in such a way that similar items are put into the same group, whilediﬀerent items are put into diﬀerent groups. A common approach is to encode the similarity relationships into asimilarity graph, with nodes representing items and weighted edges carrying the similarities. Thus the problem isclosely related to clustering in the similarity graph, so-called community detection in weighted networks.With a long history, many diﬀerent clustering methods have been proposed [2–4]. These include algorithms thatoptimize an objective function (with an example of the famous K-means algorithm [5] which optimizes distancesbetween data points to their putative centers using a scalable expectation-maximization method); generative modeling[6] which assumes that the data points are generated by an underlying distribution then turns the clustering probleminto a statistical inference problem; and spectral clustering method [7–10] which uses spectral properties of the linearoperator associated with the data.Despite a large variety of methods, there is so far no single algorithm works perfectly in all applications. K-meansalgorithm is fast and is guaranteed to converge, however, it does not work if the clusters are non-spherical. Theexpectation-maximization algorithm used by the K-means algorithm typically has many possible ﬁxed points, makingits performance depend on the initial condition. Methods based on optimizing a goodness-to-ﬁt objective are proneto overﬁt, ﬁnding clusters even when there is no signiﬁcant structure [11]. Generative modeling is widely used inunsupervised learning tasks, however, it is not easy to choose a correct model having good capability while being easyto perform statistical inference. One also has to be very careful about the over-ﬁtting problem when the model hasmany parameters. Moreover, a typical drawback for most clustering algorithms is that they do not provide a principledway to choose the number of clusters, as the objective function is usually a monotone function of the number of groups.This is because for clustering algorithms which optimize an objective function, the objective function can usually beconverted to a log-likelihood (or negative log-likelihood) of a generative model, thus optimizing the objective functionamounts to maximum-likelihood inference of a generative model. We refer to [11, 12] for examples. The likelihoodfunction describes how well the data is ﬁt by the model. It is well known that without proper regularizations ormodel selection method such AIC or BIC, maximum likelihood value is usually an increasing function of numberof parameters, i.e. more complex the model is, the better power it could ﬁt. Increasing number of groups resultsto increasing number of parameters of the model, hence results to increase of the maximum (or minimum) possiblevalue of objective function. Thus one needs to specify the number of clusters based on hypothesis testing or adopt analternative model selection procedure.Another challenge for standard methods is the data sparsity, that with n items we do not have all n / II. METHOD

We consider clustering of n nodes into q groups. An item i takes a discrete group number t i ∈ { , ..., q } , interactingwith another item j with the pairwise similarity measurement ω ij . In this work we consider that some pairs ofsimilarities are unknown, so the similarity graph is sparse. Therefore we can treat the system as a sparse Potts model,with each group assignment { t } being a Potts conﬁguration. We note that our method obviously works as well whenall n ( n − / Q ( { t } ) that wewant to optimize as negative Hamiltonian − E ( { t } ) = Q ( { t } ) of the Potts model, and assign to partitions a Boltzmanndistribution at a ﬁnite inverse temperature β : P ( { t } ) = 1 Z e βQ ( { t } ) , (1)with partition function written as Z = (cid:80) s e βQ ( { s } ) . The function of the introduced temperature is tuning the energylevel so that signiﬁcant clustering can emerge, as we will demonstrate in Sec. II B. It is easy to see that with β = 0,i.e. at an inﬁnite temperature, every partition has the same probability q n , whereas at zero temperature β → ∞ ,only partitions that optimize the Q ( { t } ) have ﬁnite measure.Here we treat similarities as random variables. Notice that in the real-world data where we usually observe diﬀerentsimilarities between pairs of variables without accessing the real process of generating the similarities, random variableswould be the most unbiased assumption. Of course, if we have prior knowledge on the similarities, we would be ableto use them. Actually, our method do not necessarily need the similarities to be a random variable. An example isthat in the Stochastic Block Model with a uniform similarities on a (structured) random graph, our algorithm reducesto known algorithms which works as good as optimal algorithms.Once the set of similarities { ω ij } between pairs of nodes are given as weights, we use the internal-weights as theobjective function to measure the quality of the partition: Q ( { t } ) = 1 m ( (cid:88) (cid:104) ij (cid:105)∈E ω ij δ t i t j − (cid:88) (cid:104) ij (cid:105) ωδ t i t j ) (2)where E represents the set of all edges in the similarity graph, m is the number of edges, and ω = 2 (cid:80) (cid:104) ij (cid:105)∈E ω ij /n isthe averaged mean weight. We can check that the deﬁnition of internal-weights guarantees that for a random partition˜ t , the expectation of internal-weights is m E (cid:0) Q ( { ˜ t } ) (cid:1) = E  (cid:88) (cid:104) ij (cid:105)∈ ε ω ij δ ˜ t i ˜ t j  − E (cid:88) (cid:104) ij (cid:105) ωδ ˜ t i ˜ t j  = 0 . Therefore, larger value of Q ( { t } ) represents larger deviation of { t } from a random partition. A. Belief propagation algorithm

The marginal distribution of Boltzmann distribution Eq. (1) is in general diﬃcult to solve, so we aim to approx-imately compute marginal distribution { ψ t i } using approximations, then assign to each node its most likely groupaccording to its marginal distribution. As we consider the case where the similarity graph is sparse, the best ap-proximation that we could take is the Bethe approximation which approximate the Boltzmann distribution Eq. (1)as P ( { t } ) ≈ ˆ P ( { t } ) = (cid:81) (cid:104) ij (cid:105) ψ t i ,t j (cid:81) i ψ d i t i . (3)Here ψ t i t j denotes two-point marginals of variable i and j , and d i is the degree (number of neighbors) of node i inthe similarity graph. The Bethe approximation is exact when the similarity graph is a tree and a good approximationwhen the similarity graph is sparse. Based on Eq. (3), one can write the Bethe free energy as a good approximationto the true free energy − β log Z , then derive a message passing algorithm, so-called belief propagation , that minimizesthe Bethe free energy [20, 21].Using the conditional marginals (or messages) ψ i → jt i , BP equation is written as ψ i → kt i = 1 Z i → k (cid:89) j ∈ ∂i \ k q (cid:88) t j =1 e βω ij δ titj ψ j → it j (cid:89) l (cid:54) = i q (cid:88) t l =1 e − βωδ titl ψ l → it l (4)where ∂i \ k denotes the neighbors of node i except for node k , and Z i → k is the normalization constant. In the languageof cavity method [20], the message ψ i → jt i means the marginal probability that node i belongs to group q i in the absenceof node j . The ﬁrst term in (4) represents the interactions between node i and its neighbors, while the second termrepresents the weak interactions between node i and all others. Since there is interaction between each pair of nodes,the computational complexity of one update of all messages is O ( n ). In the case of sparse similarity graphs on whichwe mainly focus in our work, we can greatly reduce the computational complexity by doing mean-ﬁeld approximationto the overall weak interactions between every pair of nodes (detailed derivations can be found in Appendix A): ψ i → kt i ≈ e h ( t i ) Z i → k (cid:89) j ∈ ∂i \ k (1 + ψ j → it i ( e βω ij − h ( t ) = − βω (cid:80) i ψ it is the ﬁeld representing the eﬀect of all the other nodes to the node i . With this approximationthe computational complexity is reduced to O ( m ), i.e. linear in the number of edges.After iterating the messages until they converge or exceed a speciﬁed maximum number of iterations, we cancompute marginals of each node using ψ it i = e h ( t i ) Z i (cid:89) j ∈ ∂i (1 + ψ j → it i ( e βω ij − , (6)then obtain the partition { ˆ t } by assigning each node to the group of which its marginal is the largest. We deﬁne theinternal weight of the obtained partition Q { ˆ t } retrieval weights . From Eq. (5) we observe that BP equation alwayshas a ﬁxed point ψ i → jt i = ψ it i = 1 q , (7)that we call the factorized ﬁxed point or paramagnetic ﬁxed point . This ﬁxed point actually comes from the permutationsymmetry of the system, only carrying information that each node is equally likely to be in every group. One canshow that the paramagnetic ﬁxed point is the only ﬁxed point when the temperature is high, i.e. β is close to 0. Witha large β the paramagnetic ﬁxed point could be unstable. Then there may exist another ﬁxed point that dominatesthe Gibbs measure, or too many ﬁxed points that BP jumps back and forth hence fails to converge. Depending onconvergence properties of BP and the value of the retrieval weights Q ( { t } ), we can divide the phase diagram intothree phases: • Paramagnetic phase, where BP converges to the paramagnetic solution, and every partition has the same weighthence by deﬁnition retrieval weights Q { ˆ t } = 0. • Non-convergence phase, where BP does not converge. If the underlying graph is a sparse graph, the non-convergence of BP reﬂects that the spin-glass susceptibility diverges, and replica symmetry is broken [22]. Inthis phase Q { ˆ t } has a ﬂuctuating value and depends on the initial conditions. • Retrieval phase, where BP converges to a non-factorized ﬁxed point, with a non-vanishing retrieval weights Q { ˆ t } .These phases represent diﬀerent structures in the data: the paramagnetic phase reﬂects the permutation symmetryof the system, spin glass phase represents the random structure of noise, and the retrieval phase represents thesigniﬁcant clustering structure in the data. The typical phase diagram is that at low β regime system is in theparamagnetic phase. With β increases, the paramagnetic ﬁxed point becomes unstable and the system enters eitherthe retrieval phase or the spin glass phase, depending on whether there exists statistically signiﬁcant clusteringstructure and how strong it is relative to the random structure. It is known that in some models with a clusteringstructure, e.g. the stochastic block model [6], the model is not distinguishable from a random graph from any test, inother words, contiguous to random graphs [23], if the planted partition exists but is not strong enough. B. Existence of the statistically signiﬁcant clustering structure

The goal of our algorithm is to ﬁnd the statistically signiﬁcant clustering structure, which is represented by theretrieval phase. The ﬁrst step is to determine whether the retrieval phase exists. Naively we can scan the whole rangeof β , trying to ﬁnd the retrieval phase where BP converges and Q { ˆ t } >

0. This procedure is actually fast becauseone can use a binary search. However, we can be smarter by making use of the fact that when the paramagneticﬁxed point becomes unstable, the system will jump to either the retrieval phase or the spin glass phase. This is tosay that we only need to check the inverse temperature where the system is supposed to enter spin glass phase fromthe paramagnetic phase, the spin-glass transition β ∗ : if BP converges at β ∗ and Q { ˆ t } >

0, we conclude that systemhas a retrieval phase, otherwise system has no retrieval phase.The spin-glass transition can be calculated analytically by analyzing stability of the paramagnetic ﬁxed point underrandom perturbations. Suppose we give a perturbation with zero mean and unit variance to the paramagnetic ﬁxedpoint as ψ k → it k = 1 q + (cid:15) k → it k . (8)After one step of iteration of BP equation Eq. (5), the perturbation is propagated to neighbors of each node with (cid:15) i → jt i = (cid:88) k ∈ ∂i \ j (cid:88) t k T i → j,k → it i ,t k (cid:15) k → it k , (9)where the q × q matrix T encodes the derivatives of messages at the paramagnetic ﬁxed point T i → j,k → it i ,t k = ∂ψ i → jt i ∂ψ k → it k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ψ = q (10)The derivatives can be calculated using Eq. (5): T =  e βωij − e βωij + q − q − q − e βωij e βωij + q − q . . . − e βωij e βωij + q − q − e βωij e βωij + q − q e βωij − e βωij + q − q − q . . . − e βωij e βωij + q − q ... ... . . . ... − e βωij e βωij + q − q − e βωij e βωij + q − q . . . e βωij − e βωij + q − q − q  , (11)with the largest eigenvalue being η ij = e βω ij − e βω ij + q − . (12)Since here we consider a sparse similarity graph, the graph is assumed to have a locally-tree-like property. Thenafter τ steps iteration of BP equation Eq. (5), the perturbation is propagated to (on average) ˆ c τ neighbors throughˆ c τ distinct paths. For each path the perturbation that arrives at the end would be (cid:15) i η η η ...η ( τ − τ . Sum of allperturbations at the end of each path is a random variable with mean 0 and variance V τ = (cid:104) η ij (cid:105) τω ij ˆ c τ . (13)Here (cid:104)·(cid:105) ω ij denotes averaging over ω ij , ˆ c denotes average excess degree given degree distribution p ( d ):ˆ c = ∞ (cid:88) d =1 dp ( d ) c ( d −

1) = (cid:104) d (cid:105) c − . (14)If V τ >

1, random noise will propagate in the system, and paramagnetic ﬁxed point is not stable. Thus the spin-glasstransition temperature can be obtained by solving the following equations. (cid:42)(cid:18) e β ∗ ω ij − e β ∗ ω ij + q − (cid:19) (cid:43) ω ij ˆ c = 1 . (15) C. The non-backtracking operator

As what has been investigated in [18, 24], when a system has permutation symmetry, linearizing the BP equation atthe paramagnetic ﬁxed point results to an eﬃcient spectral algorithm using the so-called non-backtracking operator.Again we analyze the stability of the factorized solution by linearizing the cavity marginals: ψ i → jt i = 1 q + ∆ i → jt i . (16)Notice that the diﬀerence between the last equation to Eq. (8) is that ∆ t i does not represent a zero-mean noise, buta perturbation in the direction of the clustering structure. The iterating equation of linearized belief propagation canbe written as ∆ i → jt i := (cid:88) k ∈ ∂i \ j (cid:88) t k T i → j,k → lt i ,t k ∆ k → lt k , (17)where := denotes equality up to a constant. The matrix T is given by Eq. (11), and can be re-written as T i → j,k → lt i ,t k = B i → j,k → l ⊗ I q × q + 1 q B i → j,k → l ⊗ J q × q , (18)where ⊗ denotes the tensor product, I q × q and J q × q are q × q identity matrix and all-one matrix respectively. Herewe introduce the q × q non-backtracking matrix BB i → j,k → l = δ il (1 − δ kj ) e βω ij − e βω ij + q − . (19)Therefore, if we represent the vector of deviations (whose length is 2 m × q ) by ∆ = { ∆ i → jt | t ∈ { , , ..., q } , ( i, j ) ∈ E} ,its update equation is written as ∆ := B ⊗ I q × q · ∆ + 1 q B ⊗ J q × q · ∆ . (20)By making use of the normalization condition (cid:80) q ψ i → ji = 1 and Eq. (16), we have equality (cid:80) t i ∆ i → jt i = 0. Then theupdate equation is simpliﬁed to ∆ t := B · ∆ t . (21)Here ∆ t = { ∆ i → jt | t ( i, j ) ∈ E} is the vector of deviations on group t , which is of size 2 m .From the last equation we can see that linearizing BP becomes the eigenvector problem of the non-backtrackingmatrix B . It is easy to see that B is asymmetric, so its eigenvalues and eigenvectors could be complex numbers.Analogous to the non-backtracking operator of a graph [18], on the complex plane, its spectrum is mainly composedof two parts: a circle that most of the eigenvalues are conﬁned in, and possible real-eigenvalues outside the circle.The radius of the circle, so-called the edge of bulk , together with the leading eigenvalue λ describe completely thestability of the paramagnetic ﬁxed point of BP as well as the phases diagram of the system: • If absolute value of the leading eigenvalue | λ | <

1, then system is in the paramagnetic state. • If λ is complex with | λ | >

1, system is in the spin-glass state. • If λ is real and greater than 1, the system is in the retrieval state. The associated real-eigenvectors can be usedto detect the clustering structure.Using technique of [18] we can compute the edge of the bulk as follows. For any matrix B , its eigenvalues satisfythe following inequality m (cid:88) i =1 | λ i | r (cid:54) T rB r ( B T ) r (22)where { λ i } are the 2 m eigenvalues of B . Note that [ B r ( B T ) r ] i → j,i → j goes from node i through a r -length path tonode j and then come back to i through the same path[ B r ( B T ) r ] i → j,i → j = (cid:88) P (cid:89) ( x → y ) ∈P η xy (23)where P denotes one of these paths, and ( x → y ) ∈ P means all the edges in the path P . Under the assumptions ofsparsity and that correlations between edges are week, we can write the right hand side of (14) as (cid:88) P (cid:89) ( x → y ) ∈P η xy = (cid:88) P (cid:10) η xy (cid:11) lω ij . (24)Because there are c r such paths, take expectation of Eq. (22) results to E | λ i | r (cid:54) c r (cid:10) η ij (cid:11) lω ij . (25)Therefore, the edge of bulk of eigenvalues of B is κ = (cid:113) ˆ c (cid:104) η ij (cid:105) ω ij . (26)Since the eigenvalues of the non-backtracking matrix is closely related to the stability of paramagnetic phase of beliefpropagation, it characterizes the phases of the system. In more detail, when the leading eigavalue is real, the point thatit begins to be larger than 1 indicates the transition from paramagnetic to retrieval phase. If the leading eigenvalueis complex, then it means there is no stable BP ﬁxed point, thus the point where its absolute value becoming largerthan 1 indicates the paramagnetic to spin glass transition. We can see that the transition point where edge of bulkequals to 1 is identical to the condition of the spin-glass transition Eq. (15). This relation connects the edge of bulkin the spectrum of the non-backtracking operator to the spin glass transition of the Potts system, and gives anotherexplanation on why it is suﬃcient to check only at β ∗ for looking for the retrieval state: if system has a retrievalphase, it is easily identiﬁable at β ∗ as the real-eigenvalue representing it has to be larger than 1. D. Determine the number of groups

Choosing the number of groups is a classic model selection problem. A common approach is setting q by optimizingthe objective function. However, it is known to be prone to overﬁtting. For example, the objective function is alwaysoptimizable even in a random graph, and is an increasing function of q , whereas the correct group number is q = 1 inthe random graphs. Other approaches use penalty term such as entropy or description length [25] to regularize themodel. However entropy penalizations, which actually choose q by minimizing the free energy, usually requires thatthe related generative model is very close to the true model that generated the data [26]. Penalty terms adopting theminimum description length is good at controlling the complexity of the model, but usually over constraint the modeland predicts a smaller q [25].Here we propose to choose the correct number of groups, q ∗ , at the value where the retrieval weights Q ( { ˆ t } ) achievesits maximum. For q ≤ q ∗ , retrieval weights increase with q , while for q > q ∗ , either the retrieval phase disappearsor Q ( { ˆ t } ) stays the same indicating that increasing number of groups is no more helpful to obtain a better objectivefunction. III. APPLICATIONS

In this section, we apply our methods to several diﬀerent clustering problems: the mixture model in the sparseregime where exact statistical inference results exist; clustering in two-dimensional geometric clustering datasets wherethe similarity function is hard to deﬁne; and community detection in weighted and directed graphs.

A. Mixture model in sparse regime

We use the model proposed in [27], which can be seen as a generalization of the labeled stochastic block model [28].In the model, data is constructed with n items in q preassigned clusters of equal size. Each item has a group label t ∗ i . Thus { t ∗ i } is the ground truth or planted partition, the partition that we want to recover. Pairwise measurementsare chosen uniformly at random, on a Erd¨os-R´enyi random graph with average degree c . The pairwise similarity S ij ,between item i and j , is a random variable sampled from a probabilistic distribution P ij ( ω ij ) depending on t ∗ i and t ∗ j .Following [27], we use the simplest setting that P ij ( ω ij ) = (cid:40) P in ( ω ij ) , if t ∗ i = t ∗ j , P out ( ω ij ) , otherwise. (27)And two probability distributions are chosen as Gaussian with diﬀerent means, as plotted in insets of Fig. 1.It has been conjectured in [27] that there is a detectability threshold on average number of measurements c ∗ peritem, given by c ∗ = q (cid:18)(cid:90) dω ( P in ( ω ) − P out ( ω )) P in ( ω ) + ( q − P out ( ω ) (cid:19) − . (28)Below the threshold no algorithm can detect the planted partition with success better than a random guess. Whilewith c > c ∗ , authors showed that the Bayes-optimal estimate approximated by belief propagation algorithm usingcorrect parameters of the model achieves this threshold, and provide an asymptotically optimal accuracy.On this model, we ﬁrst examine whether our method can ﬁnd retrieval state in both detectable and undetectablephases. Figure 1 provides the retrieval weights Q ( { t } ) and convergence time of belief propagation as functions of β ontwo instances (one instance in the undetectable phase while the other one in the detectable phase) generated by themodel of [27]. In the top panel, the data is in the undetectable phase, meaning no algorithm can ﬁnd information ofthe planted partition better than a random guess. We can see from the ﬁgure that our algorithm ﬁnds that the phasediagram is composed of paramagnetic phase and non-convergence (NC) phase (which is actually the spin glass phasein this random sparse graph), separated by the transition β ∗ ≈ .

7, with a very good agreement with the predictedspin-glass transition Eq. (15). So our algorithm decides that there is no retrieval phase in the whole regime of β , hencereports that there is no signiﬁcant clustering structure in the data. In the bottom panel of Fig. 1 the data is in thedetectable phase, then we see that our algorithm indeed ﬁnds the retrieval phase in between the paramagnetic phaseand non-convergence phase, which gives information of the planted partition. Fig. 1 also shows that β ∗ indeed locatesat the retrieval phase, hence our algorithm running at β ∗ will ﬁnd the retrieval phase as well as correct informationabout the planted partition. This conﬁrms our theory in II B that in practice we only need to run our algorithm once at β ∗ Eq. (15) rather than checking full regime of β .To evaluate the performance of our algorithm in accuracy of detecting the planted partition, we deﬁne the overlapbetween the inferred partition { ˆ t } and the true partition { t ∗ }O (ˆ t, t ∗ ) = max π n (cid:80) ni =1 δ ( t ∗ i , π (ˆ t i )) − q − /q (29) β P R SG (a) β SGP (b) FIG. 1. Retrieval weights (blue × , left y axis) and BP convergence time (red +, right y axis) on the datasets generated byGaussian mixture model [27] in the undetectable regime (a) and in the detectable regime detectable regime (b). Both datasetshave n = 10000 items, average number of measurements per item c = 4. The Gaussian weight distributions p in and p out haveunit variance and zero mean for (a) (two Gaussian distribution coincide, as illustrated in the inset), and mean 0 . , − . β ∗ Eq. (28). In the panel (b), two dashed lines denote two experimental transitions : paramagnetic to retrievaltransition at β R ≈ .

15 and retrieval to spin glass transition at β SG ≈ . as a measure of accuracy. Here π (ˆ t ) is one permutation (among totally q ! permutations) of the partition ˆ t . Clearlywe have 0 ≤ O (ˆ t, t ∗ ) ≤

1, and O (ˆ t, t ∗ ) = 0 means the inferred partition is not correlated with the planted one while O (ˆ t, t ∗ ) = 1 indicates that the planted partition is fully recovered.In Fig. 2 we plot the overlap between the planted partition and those obtained using BP and from spectral clus-tering algorithm using leading eigenvector of the non-backtracking operator Eq.(19), both at the spin-glass transitiontemperature β ∗ . The performances are compared to the optimal Bayesian inference results in [27]. We can see fromthe ﬁgure that BP and the non-backtracking operator both work all the way down to the detectability transitionEq. (28) at c ∗ ≈ .

63. While the non-backtracking operator works worse in the regime away from the transition,BP works really close to the optimal accuracy, particularly in the regime close the phase transition. Remarkably,while the optimal Bayesian inference algorithm [27] requires the knowledge of correct parameters of the model, ourmethods, both BP and the non-backtracking operator, do not need to know the parameters.

Average Degree O v e r l ap OptimalBPB

FIG. 2. Comparison of accuracy (evaluated using overlap Eq. (2)) of belief propagation (BP), spectral clustering using thenon-backtracking operator (B) and Bayes-optimal results (Optimal) in [27]. Each point is averaged over 10 realizations ofsize n = 100 , ω in , ω out with means 0 .

75 and − .

75 respectively and unit variance. Thetheoretical transition Eq. (28) is at critical average degree c ∗ ≈ . Q for diﬀerent values of group number q as a function of β for two datasetsgenerated by the Gaussian mixture model with 3 and 4 groups respectively. As we can see that, in both networks,the retrieval weights at the correct number of groups Q archives its maximum value, and the retrieval phase has thewidest regime. Therefore on these two examples, our method of determining a number of groups (as in Sec. II D)gives correct group number. β Q β Q q=2q=3q=4q=5q=6 FIG. 3. Retrieval weights for diﬀerent values of q as a function of β on two datasets generated by Gaussian mixtures [27]. Twodatasets are both in the detectable regime with n = 10000 items, average measurements per item c = 6 and correct number ofgroups q ∗ = 3 for the top panel and c = 12 , q ∗ = 4 for the bottom panel. In Fig. 4 we draw spectrum of the non-backtracking operator B (Eq. (19)) at the spin-glass transition β ∗ of fourdatasets generated using Gaussian mixtures with q = 1 , , , B is asymmetric, thespectrum is deﬁned on the complex plane. The black circle in each ﬁgure is the theoretical prediction of edge of bulkgiven by Eq. (26), and we can see that it ﬁts very well with the numerical results in all 4 ﬁgures. The ﬁgures alsoillustrate that the number of real eigenvalues outside the bulk gives the number of groups in the datasets. We havechecked that this is nicely consistent with results given by maximizing the retrieval weights as proposed in Sec. II D.For similarity problems there is a very popular and classic method, name Aﬃnity Propagation (AP) [3] whichis also based on message passing, so we would like to compare the performance of our algorithm against aﬃnitypropagation. Since aﬃnity propagation is designed for fully-connected similarity graphs, to make a fair comparisonwe use the Gaussian mixtures with full similarity matrix. The results are shown in Fig. 5 where we can see that evenon fully-connected similarity graphs, our algorithm signiﬁcantly outperforms the aﬃnity propagation. We note thatover the past several years there have been many variants of the AP algorithm, while what we have compared is theoriginal version. B. Community detection in weighted networks and directed networks.

As we described in the Sec. I, the data clustering problem in the sparse regime is closely related to the communitydetection in weighted networks, when we are interested in the assortative structures where weights on edges can beunderstood as similarities between nodes. In the ﬁeld of community detection, many algorithms have been proposed,and there exist benchmark networks having ground-true communities for evaluating the performance of algorithms.In this work, we consider the widely used LFR benchmark networks [30].The LFR model generates networks having a power-law degree distribution with exponent γ . A topological mixingparameter µ t is introduced to tune the ratio between internal degree (number of edges connecting to the nodes in thesame group) k in i and total degree k i of a node i k ini = (1 − µ t ) k i . The weights of node i is assigned by s i = (cid:80) j ω ij = k αi ,with α being a parameter controlling the strength of weights. Then a third parameter µ ω is used to assign the internalstrength s ini = (1 − µ ω ) s i . In our numerical experiments, we ﬁx γ = 2 , α = 1, µ t = µ ω and vary µ t , number of nodes n and average degree c , to show the performances of our algorithm in diﬀerent situations and compare them withexisting algorithms. With other γ , α , and µ t values we see a qualitatively similar behavior of performance comparison.We conduct experiments on weighted networks generated by the LFR model with n = 10 nodes, average degree c = 10 and q ∗ = 2 groups, and compare the performance of our algorithm with three other algorithms, Infomap [31],Oslom [32], and Louvain [33]. In all experiments, the group number q ∗ = 2 is not provided to 4 algorithms, thusevery algorithm determines the number of groups by itself. Since usually diﬀerent algorithms eventually ﬁnd diﬀerent q values, which are quite diﬀerent from q ∗ , it is not possible to use overlap (Eq. (29)) to characterize the accuracy of1 -1 -0.5 0 0.5 1 1.5-1-0.500.51 (a) -1 -0.5 0 0.5 1 1.5-1-0.500.51 (b) -1 -0.5 0 0.5 1 1.5-1-0.500.51 (c) -1 -0.5 0 0.5 1 1.5-1-0.500.51 (d) FIG. 4. Spectrum of non-backtracking matrix Eq.(19) (at the spin-glass transition β ∗ ) on the complex plane for datasetsgenerated by the Gaussian mixtures [27]. These datasets have 1000 items. For (a),(b),(c) and, (d), the average measurementsper item are 4, 4, 7, the number of groups are 1, 2, 3, and 4 respectively. The black circles are theoretical predictions Eq. (26)of edge of the bulk. community detection. So we adopt the relative normalized mutual information (rNMI) [29] between the ground-truepartition and the detected partition to characterize the accuracy. The reason that we use rNMI rather than thenormalized mutual information (NMI) [34] is that: Infomap usually gives a large number of groups, and NMI has asystematic bias to a large group number [29]. The rNMI is actually the NMI with the bias extracted, hence is moresuitable for measurement for this task.The results are shown in Fig. 6. From the panel (a) we can see that our algorithm always gives a larger rNMIvalue than Infomap, Oslom and Louvain. In the parameter regime with µ t > .

15, Infomap, Oslom and Louvain donot work at all, ﬁnding partitions that have almost 0 rNMI, while our algorithm gives a large rNMI until µ t > . q ∗ = 2 groups and varying µ t . We can see that the number of groups found by our algorithm is very close to theground-true value q ∗ = 2, while other algorithms report signiﬁcantly much larger values. Particularly, the Infomapalgorithm gives a huge number of groups which means it actually tends to break the large network into many smallparts. Figure 7 again gives a diﬀerent view of the results, with the left panel comparing the performance of algorithmson LFR networks with a varying number of nodes n and right panel comparing that with varying average degree c .From the panel (a) we can see that our algorithm is not sensitive to system size, giving accuracy close to 1 in allcases, while the accuracy of Infomap and Louvain fall quickly when system size is increased (while average degree isﬁxed). The panel (b) of Fig. 7 shows that Infomap and Louvain do not work in the whole regime, and Oslom requiresaverage degree larger than 10, while our algorithm performs well in all regime, giving almost perfect detection whenthe average degree is greater than 6.We also compared our algorithm with the recently developed method of Peixoto [35]. Peixoto’s method is based onseries of his work [25, 36, 37] of hierarhical non-parametric Baysian methods in community detection, which becomequite popular in recent years. We have performed extensive comparison between our algorithm and method in [35],the typical result is that on dense networks two algorithm perform similarly, while in sparse networks our algorithm2 δ r N M I BPAP

FIG. 5. Performance comparison between our belief propagation algorithm (BP) and aﬃnity propagation [3] (AP), on syntheticsimilarity graphs generated using Gaussian mixtures (see main text) which is full connected. Parameters are q = 4 groups,In-group and out-group distribuions are Gaussian distribution with unity variance and mean ρ in = 5 − δ and ρ in = 5 − δ respectively, where δ controls the hardness of the clustering problem and varyed as the X-axis. The accuracy of clustering in Y-axis are evalutated using the relative normalized mutual information between the obtained partition and the ground-truth [29].Each point in the ﬁgure is averaged over 10 instances and the implementation of aﬃnity propagation was downloaded from theoﬃcial website. performs considerably better. For an example the comparison results on sparse Gaussian mixtures are shown inFig. 8, where we can see from the left panel that with average degree low, our algorithm works much better. Wenotice that in the ﬁgure one can observe that Peixoto’s method has a large variance. We think this is a quite generalphenomenon when an algorithm, which does not target particularly the sparsity, is applied to sparse graphs. Actuallynotice that in Fig. 10 where the Leight-Newman algorithm is applied to sparse graphs, and in Fig. 12 where theThouless-Anderson-Palmer equation (which are essentially a dense approximation to our BP method) is applied tosparse graphs, the similar phenomenons are observed as well. We have also seen the similar phenomenon in clusteringsparse graphs using spectral algorithm based on the adjacency matrix, see for example in Fig.4 of reference [18].Moreover, the right panel of Fig. 8 illustrates that our method is much (almost 1000 times) faster than methodin [35], especially in large networks, due to the diﬀerence in eﬃciency between our message passing technique and theMCMC method used in [35]. µ t r N M I BPInfomapOslomLouvain (a) µ t N u m be r o f G r oup s BPInfomapOslomLouvain (b)

FIG. 6. Performance comparison of diﬀerent algorithms in community detection in weighted networks. Networks are generatedby LFR benchmarks (see text for details) with n = 10 nodes, maximum degree 30, average degree c = 10, true number ofgroups q ∗ = 2 and β = 1. Panel (a) shows the relative normalized mutual information (rNMI [29], the larger the better) betweenthe true community and that given by algorithms. Panel (b) illustrates the numbers of groups detected by four algorithms.Every point in the ﬁgures is averaged 50 instances. To test our algorithm on real-world network, we ran our algorithm on the network composed of co-occurances of3

Number of Nodes r N M I BPinfomaposlomlouvain (a)

Average Degree r N M I BPinfomaposlomlouvain (b)

FIG. 7. Performance comparison of diﬀerent algorithms on community detection in weighted networks generated by the LFRmodel, with varying size (a) and varying sparsity (b). In panel (a), the true number of groups is q ∗ = 4, average degree is ﬁxedto c = 10, µ t = µ w = 0 .

1, number of nodes is varying. In panel (b), number of nodes is n = 10 , true number of groups is q ∗ = 2, µ t = µ w = 0 .

1, and average degree of networks are varying. Every point in the ﬁgures is averaged 50 instances.

Average Degree O v e r l ap Number of Nodes T i m e ( s ) -3 -2 -1 BPPexioto

FIG. 8. Performance comparison between our belief propagation algorithm (BP) and hierarchical Bayesian method ofPeixoto [35] on sparse Gaussian mixtures, In-group and out-group Gaussian distribuions have unity variance and mean ρ in = 5 .

75 and ρ out = 4 .

25 respectively. In the left panel we plot the accuracy of clustering using overlap by restrictingtwo methods to use q = 2 groups. The graph has n = 500 nodes, and varying average degree. In the right panel, the compu-tational time for two algorithms are plotted for networks with average degree ﬁxed to c = 15 and varying system size. Eachpoint in the ﬁgure is averaged over 10 instances, the implementation of [35] was downloaded from Tiago Peixoto’s website. major characters in Victor Hugo’s novel ’Les Mis´erables’. In the network a node represents a character and an edgebetween two nodes shows that these two characters appeared in the same chapter of the book. The weight of eachlink indicates how often such a co-appearance occured. We have tested our algorithm on both weighted version andunweighted version (which sets all the weights to 1) of the network, and found that the results (which are plotted inFig. 9) are quite diﬀerent: In results obtained by our algorithm on the unweighted version, the Bishop Myriel, his sister,their housemaid together with seven other characters, including Myriel’s father and the general of Myriel’s father inwar, are in one big community; while in the results on the weighted version, Myriel, his sister, and their housemaidare in one group, while the other seven characters in another group. We noticed that the three characters, Myriel,his sister, and their housemaid frequently interact in the book, which corresponds to heavy weights on edges betweenthem, while the other seven characters only communicate with them separately once or twice, which corresponds tovery light weights. In the weighted version, the frequency of interaction between characters are taken into account,therefore the three semi-important characters, Myriel, his sister, and their housemaid are grouped together, while theother seven characters are in another group; however in the unweighted version where all links are considered to beequal, the information of the strength of interactions are lost. Therefore the ten characters are considered to belongto one group inappropriately. We then conclude that on the network of ’Les Mis´erables’ our algorithm on weightednetwork successfully captures the community information given by weights on the edges and give more reasonableresults then simply using the unweighted network.The last application we examine is the community detection in directed networks. It is well known that one way4 (a) (b) FIG. 9. Clustering results given by our algorithm on unweighted version (a) and weighted version (b) of network composed ofco-occurances of major characters in Victor Hugo’s novel ’Les Mis´erables’. A node represents a character and an edge betweentwo nodes shows that these two characters appeared in the same chapter of the book. The weight of each link indicates howoften such a co-appearance occured. The ﬁgures are drawn using Tiago Peixoto’s software graph-tool (graph-tool.skewed.de). to detect communities in directed networks is to treat it as a weighted network [38], giving diﬀerent weights to direct(having one direction) and indirect edges (having two directions). Here we simply give an edge having both directionweight 2, and give an edge having only one direction weight 1. It is a relatively simple strategy but we ﬁnd italready works well with our algorithm for the converted weighted networks that we have tested. We evaluate ourmethod by comparing its performance to the famous Leicht-Newman algorithm [39] which is a spectral algorithm usingeigenvectors of the directed Modularity matrix. The benchmark networks we use are directed LFR networks [30], theresults are plotted in Fig. 10. From the ﬁgure, we can ﬁnd that our algorithm performs better in all regimes thanLeicht-Newman algorithm with a diﬀerent average degree, and with diﬀerent noise parameters µ t . Average Degree O v e r l ap BPLeicht-Newman (a) µ t O v e r l ap BPLeicht-Newman (b)

FIG. 10. Comparison of performance (in overlap Eq. (29)) between our algorithm (BP) and Leicht-Newman algorithm [30] inbenchmark networks generated by the directed LFR model [30] with µ t denoting the mixing parameter. The networks have n = 10000 nodes, q ∗ = 2 groups with same size and maximum degree 50. In the panel (a), µ t is ﬁxed to 0 .

28. In panel (b) theaverage degree is ﬁxed to c = 10. Each result is averaged 50 instances. (a) (b) (c) FIG. 11. Clustering results given by belief propagation using inverse of geodesic distance [40] (on the 5 nearest-neighbor graph)as similarity. Similarity graph in all three ﬁgures are sparse with average degree 10. Diﬀerent colors represent diﬀerent groupsfound by BP. (a) Moon clustering, n = 2000 points, similarity graph is sparse with degree 10. (b) Spiral with n = 312 points.(3) Two cycles, n = 20000 points, 2 true labels are used in semi-supervision. IV. CLUSTERING ON GEOMETRIC DATASETS AND SEMI-SUPERVISED CLUSTERING

In some real-world clustering problems, a good objective function or similarity measures are hard to choose. Forexample in the classic datasets of moons, spirals and two rings, as shown in Fig. 11, although the clusters are veryclear to human beings, those commonly used objective functions such as distance between items in the same groupdo not work well, due to the non-spherical properties of clustering structures.In this case, data does not come with well-deﬁned similarities, we need to choose a W ij using distance betweena pair of nodes. In our experiments, we have tried several deﬁnitions of similarities, and we found that the inversegeodesic distance [40] works best than other choices such as Gaussian kernels.In some hard instances such as Spirals, as shown in Fig. 11 BP still does not give good-enough results, this means thesimilarity function or kernel function is not chosen optimally. In these cases, we need to use a small fraction of labels(true group memberships) to guide the algorithm. This approach is called semi-supervised clustering [41]. In ourmethod, the carefully designed objective function can be integrated smoothly into our belief propagation equations,as we only need to redeﬁne the similarities. For semi-supervised clustering using true label, t i is also easy in ourmethod, because in BP (as adopted in [42]) we can ﬁx the marginals of item i and cavity marginals that being sentfrom item i to its neighbors j in such a way that, ψ it ∗ i = ψ i → jt ∗ i = 1 ψ it (cid:54) = t ∗ i = ψ i → jt (cid:54) = t ∗ i = 0 . (30)Fig. 11 shows results of semi-supervised clustering using our method on classic toy datasets of moons, spirals, andtwo circles. Diﬀerent colors represent diﬀerent groups given by our method using the inverse of geodesic distance[40] as similarity measure. From ﬁgures, we can see that BP successfully detect the human-intuitive clusterings inpanel (a) and (b). In panel (c) of the ﬁgure using geodesic distance does not give a good result, so we further use 2true-labels, each in a circle, in the semi-supervised procedure. This leads to a perfect clustering into two circles. V. CLUSTERING IN THE DENSE NETWORKS USING THE THOULESS-ANDERSON-PALMEREQUATIONS

In above sections, we have shown the performance of our method in several applications in the sparse regime, wherethe number of measurements, i.e. the connectivity of the weighted network is small. However, when the weightednetwork is dense, belief propagation algorithm which scales linearly in a number of edges would be too slow. In thissection we target at resolving the non-scalable problem of BP in dense graphs by deriving the Thouless-Anderson-Palmer (TAP) equations [43] which gives a good approximation to BP in dense graphs while is much faster.6TAP equations have been widely used for studying macroscopic properties of mean-ﬁeld systems in retrieval phaseand in spin glass phases. There are at least two ways to derive the TAP equations, ﬁrst way is via the Plefka expansionswhere the Legendre transform of the free energy at high temperature is expanded and truncated to second-order term;and the second way is adopted here, where belief propagation equation is ﬁrst approximated at high-temperature torelaxed belief propagation, then eliminate all the cavity messages using the Onsager reaction terms. The reason thatwe derive TAP is that it is much faster than BP while only a little less in-accurate than BP in the dense graphs, asillustrated in Fig.9.Notice that although the update equations are derived using high-temperature expansions, it is a function of β ,which is not necessarily small. In other words, at a high temperature (low beta), TAP equations is equivalent to beliefpropagation equations. While at a low temperature, TAP equations is only an approximation to belief propagation,as it is derived by expanding belief propagation at high temperature.Actually the approximatio is very good in the dense graphs, as the paramagnetic-spin glass transition β ∗ is afunction of inverse average connectivity c , hence is very small for dense graphs, vanishing at thermodynamic limit.This means a very small β value could still be enough for system to stay in the retrieval phase.Observe that when average degree of the network is large, marginal probability ψ it i in Eq.(6) is close to cavityprobability in Eq.(5). Thus when one can capture the diﬀerence approximately hence eliminate cavity probabilitiesfor reducing the computational complexity. First we expand cavity messages around 0 to second order, and obtainedthe so-called Relaxed Belief Propagation (RBP) ψ i → kt i ≈ e h ( t i ) Z i → k (cid:89) j ∈ ∂i \ k e ψ j → iti βω ij + [ ψ j → iti − ( ψ j → iti ) ] β ω ij (31) ψ it i = e h ( t i ) Z i (cid:89) j ∈ ∂i e ψ j → iti βω ij + [ ψ jti − ( ψ jti ) ] β ω ij . (32)Then by assuming that β is small (which is true in the retrieval phase with average degree large), and by approxi-mating ψ i → kt i using marginal probabilities we arrive at TAP equations: ψ it i = e h ( t i ) Z i (cid:89) j ∈ ∂i e ψ jti + β ω ij ( ψ jti ( (cid:80) s ψ js ψ is − ψ iti )+ ( ψ jti − ( ψ jti ) )) . (33)The detailed derivations from BP to TAP can be found at Appendix. B.We can see from Eq. (33) that cavity probabilities are completely eliminated, and there are no messages passingalong edges of the graph. Thus the number of messages is reduced from m in BP to n in TAP, as a consequence wecan imagine that TAP are much faster than BP in dense graphs. Then the question remaining is how TAP works ascompared to BP. In Fig. 12 we compare accuracy and computational time of BP and TAP in the Gaussian mixtureproblem. In Fig. 12 (a) the mean degree of the similarity graph is varying, we can see that surprisingly with degreelarger than 6 there are already no diﬀerences between BP and TAP in accuracy. Even for very sparse similaritygraphs with average degree ranging from 2 to 6, the diﬀerence in accuracy is very small. In Fig. 12 (b) computationtime of BP and TAP are compared with varying average degrees, and we can see that with small average degree,TAP is slower than BP. This is because in sparse graphs TAP is a worse approximation than BP, hence requires moreiterations to converge. The ﬁgure also illustrates that with average degree larger than 15, TAP becomes much fasterthan BP. Most importantly the computational time of TAP stays almost constant with average degree increasing,while the computational time of BP is linear increases with average degree.In Fig 12 (c) we compare the performance of BP and TAP in accuracy with average degree ﬁxed to 5 and varyingmean weights. We can see that even in such a sparse network, TAP works closer and closer to BP when the clusteringproblem becomes easier with a larger mean weights.We notice that the original derivation of TAP equations (e.g. in the Sherrington-Kirkpatrick model) was obtainedby minimizing the TAP free energy, which is obtained by taking Legendre transform of the free energy at a hightemperature then perform truncating at the second term. This would give identical equations to ours in our model.Notice if one performs truncating of free energy at the ﬁrst term, the Na¨ıve mean-ﬁeld equations (NMF) can bederived, which is the TAP equations without the Onsager reaction term. In the Appendix we derived also the NMFequations and did an overall comparison of BP, RBP, NMF, and TAP, to complete our picture of using approximationsrelated to Belief Propagation.7 Average Degree O v e r l ap BPTAP (a)

Average Degree T i m e ( s ) BPTAP (b)

Means of Weight O v e r l ap BPTAP (c)

FIG. 12. Comparison of accuracy and time used between BP and TAP. Each point is averaged over 20 realizations of size n =100,000 (a) in the weighted SBM network with Gaussian weight distributions ω in , ω out with means 0 .

75 and − .

75 respectivelyand unit variance. (b) Computational time used by BP and TAP with conditions in (a) . All points are at the spin-glasstransition temperature β ∗ and the maximal iteration time is 1000. (c) Average degree is ﬁxed to 5, the x axis show the meansof weight (cid:104) ω in (cid:105) = −(cid:104) ω out (cid:105) . VI. CONCLUSIONS AND DISCUSSIONS

We have presented a message-passing algorithm and its related spectral algorithm for the data clustering problem.Our method is based on Potts spin-glass model in statistical physics, treating a selected objective function as Hamilto-nian and apply cavity method. Instead of tempting to directly maximize the objective function, our method use beliefpropagation to look for a retrieval phase in the whole temperature regime where the statistically signiﬁcant clusteringemerges. This is amount to ﬁnd the consensus of many partitions with a good objective function, rather than lookingfor a single partition that maximizes the function. Finding consensus gives us a robust method to overcome theover-ﬁtting problem that the optimization procedure is usually prone to. We further showed that this can be doneeﬃciently by checking the convergence of belief propagation algorithm at the spin glass transition point and gives amathematically principled way for determining the number of groups.We applied our method to several clustering problems including a mixture of Gaussian distributions in the sparseregime, semi-supervised clustering, and community detection in weighted and directed networks. We validated theeﬃciency of our method in the sparse Gaussian mixtures by showing that it works all the way down to the theoreticallimits, and works as well as the Bayes optimal inference. In community detection problem in weighted and directednetworks, we show that our method signiﬁcantly outperforms existing algorithms.The belief propagation algorithm we use has computational complexity linear to the number of edges of the similaritygraph. This could be too heavy if the similarity graph is dense and large. In fact, when the similarity graph is dense,weak similarities (or weights) are usually enough to guarantee that the clusterings are detectable. In this case, wecan approximate the belief propagation equations using the Thouless-Anderson-Palmer equations [43] (which is alsoknown as approximated message passing when variables are continuous) whose computational complexity is linearwith the number of nodes, rather than a number of edges. We leave this for future work.A C++ implementation of the proposed algorithms can be found athttp://lib.itp.ac.cn/html/panzhang/.

ACKNOWLEDGMENTS

C.S. would like to acknowledge Tao Zhou for support. [1] J. A. Hartigan and J. Hartigan,

Clustering algorithms , Vol. 209 (Wiley New York, 1975).[2] A. K. Jain, M. N. Murty, and P. J. Flynn, ACM computing surveys (CSUR) , 264 (1999).[3] B. J. Frey and D. Dueck, , 972 (2007).[4] A. Rodriguez and A. Laio, Science , 1492 (2014).[5] A. K. Jain, Pattern recognition letters , 651 (2010).[6] P. W. Holland, K. B. Laskey, and S. Leinhardt, Social networks , 109 (1983).[7] U. V. Luxburg, M. Belkin, O. Bousquet, and Pertinence, Stat. Comput (2007).[8] J. Shi and J. Malik, IEEE Transactions on Pattern Analysis and Machine Intelligence , 888 (1997). [9] A. Y. Ng, M. I. Jordan, and Y. Weiss, Proceedings of Advances in Neural Information Processing Systems. Cambridge,MA: MIT Press , 849 (2001).[10] P. Zhang, in Advances In Neural Information Processing Systems 29 , edited by D. D. Lee, U. V. Luxburg, I. Guyon, andR. Garnett (Curran Associates, Inc., 2016) pp. 541–549.[11] P. Zhang and C. Moore, Proceedings of the National Academy of Sciences , 9 (2007).[14] M. Weigt, R. A. White, H. Szurmant, J. A. Hoch, and T. Hwa, PNAS , 67 (2009).[15] M. Blatt, S. Wiseman, and E. Domany, Physical review letters , 3251 (1996).[16] M. Blatt, S. Wiseman, and E. Domany, in Advances in Neural Information Processing Systems (1996) pp. 416–422.[17] S. Wiseman, M. Blatt, and E. Domany, Physical Review E , 3767 (1998).[18] F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborov´a, and P. Zhang, Proc. Natl. Acad. Sci. USA , 7821 (2002).[20] M. Mezard and A. Montanari, Information, Physics and Computation (Oxford University press, 2009).[21] J. Yedidia, W. Freeman, and Y. Weiss, in

International Joint Conference on Artiﬁcial Intelligence (IJCAI) (2001).[22] L. Zdeborov´a, Acta Phys. Slov. , 169 (2009).[23] E. Mossel, J. Neeman, and A. Sly, Probability Theory and Related Fields , 431 (2015).[24] P. Zhang, Phys. Rev. E , 042120 (2015).[25] T. P. Peixoto, Phys. Rev. X , 011047 (2014).[26] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov´a, Phys. Rev. E , 066106 (2011).[27] A. Saade, M. Lelarge, F. Krzakala, and L. Zdeborov´a, in Information Theory (ISIT), 2016 IEEE International Symposiumon (IEEE, 2016) pp. 780–784.[28] S. Heimlicher, M. Lelarge, and L. Massouli´e, arXiv preprint arXiv:1209.2910 (2012).[29] P. Zhang, Journal of Statistical Mechanics: Theory and Experiment , P11006 (2015).[30] A. Lancichinetti, S. Fortunato, and F. Radicchi, Physical Review E , 046110 (2008).[31] M. Rosvall and C. T. Bergstrom, Proceedings of the National Academy of Sciences , 1118 (2008).[32] A. Lancichinetti, F. Radicchi, J. J. Ramasco, and S. Fortunato, PloS one , e18961 (2011).[33] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, J. Stat. Mech. , P10008 (2008).[34] L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas, Journal of Statistical Mechanics: Theory and Experiment ,P09008 (2005).[35] T. P. Peixoto, arXiv preprint arXiv:1708.01432 (2017).[36] T. P. Peixoto, Phys. Rev. E , 012804 (2014).[37] T. P. Peixoto, Physical review letters , 148701 (2013).[38] F. D. Malliaros and M. Vazirgiannis, Physics Reports , 95 (2013).[39] E. A. Leicht and M. E. J. Newman, Physical review letters , 118703 (2008).[40] J. B. Tenenbaum, V. De Silva, and J. C. Langford, science , 2319 (2000).[41] S. Basu, M. Bilenko, and R. J. Mooney, in Proceedings of the tenth ACM SIGKDD international conference on Knowledgediscovery and data mining (ACM, 2004) pp. 59–68.[42] P. Zhang, C. Moore, and L. Zdeborov´a, Phys. Rev. E , 052802 (2014).[43] D. J. Thouless, P. W. Anderson, and R. G. Palmer, Philosophical Magazine , 593 (1977). Appendix A: Deriving Eq. (5) from Eq. (4)

First notice that on sparse graphs ω = 2 (cid:80) (cid:104) ij (cid:105)∈E ω ij /n is weak, because number of edges |E| is much smaller than n . Thus the Eq. (4) can be well-approximated as ψ i → kt i = 1 Z i → k (cid:89) j ∈ ∂i \ k q (cid:88) t j =1 e βω ij δ titj ψ j → it j (cid:89) l (cid:54) = i q (cid:88) t l =1 e − βωδ titl ψ l → it l = 1 Z i → k (cid:89) j ∈ ∂i \ k  e βω ij ψ j → it i + (cid:88) t j (cid:54) = t i e ψ j → it j  (cid:89) l (cid:54) = i q (cid:88) t l =1 e − βωδ titl ψ l → it l ≈ Z i → k (cid:89) j ∈ ∂i \ k (cid:16) e βω ij ψ j → it i + 1 − ψ j → it i (cid:17) (cid:89) l (cid:54) = i q (cid:88) t l =1 (1 − βωδ t i t l ) ψ l → it l = 1 Z i → k (cid:89) j ∈ ∂i \ k (cid:16) ψ j → it i ( e βω ij − (cid:17) (cid:89) l (cid:54) = i (1 − βω q (cid:88) t l =1 δ t i t l ψ l → it l )= 1 Z i → k (cid:89) j ∈ ∂i \ k (cid:16) ψ j → it i ( e βω ij − (cid:17) (cid:89) l (cid:54) = i (1 − βωψ l → it i ) . (A1)By using the fact that ω is of small order, and putting the last term in the right hand side of the last equation to weget ψ i → kt i ≈ Z i → k (cid:89) j ∈ ∂i \ k (cid:16) ψ j → it i ( e βω ij − (cid:17) (cid:89) l (cid:54) = i e − βωψ l → iti , (A2)which is equivalent to Eq. (5). Appendix B: Deriving the Thouless-Anderson-Palmer equations

The belief propagation equations are written as ψ i → kt i = e h ( t i ) Z i → k (cid:89) j ∈ ∂i \ k (1 + ψ j → it i ( e βω ij − ,ψ it i = e h ( t i ) Z i (cid:89) j ∈ ∂i (1 + ψ j → it i ( e βω ij − . We see from that when β is small, we can expand the ψ it i to second order and approximate the cavity probabilitiesas: ψ it i ≈ e h ( t i ) Z i (cid:89) j ∈ ∂i ( e ψ j → iti βω ij + [ ψ j → iti − ( ψ j → iti ) ] β ω ij ) ≈ e h ( t i ) Z i (cid:89) j ∈ ∂i ( e ψ j → iti βω ij + [ ψ jti − ( ψ jti ) ] β ω ij )0In dense graphs when ψ it i is close to ψ i → kt i , the ψ i → kt i can be expressed by ψ i → kt i = ψ it i − e H iti Z i + e H iti − u k → iti (cid:80) s e H is − u k → is ≈ ψ it i − e H iti Z i + e H iti − e Hiti u k → iti Z i − (cid:80) s ( e H is u k → is ) ≈ ψ it i + e H iti (cid:80) s e H is u k → is − e H iti u k → it i Z i Z i = ψ it i + ψ it i ( (cid:88) s ψ is u k → is − u k → it i )= ψ it i + ψ it i ( (cid:88) s ψ is ln(1 + ψ k → is ( e βω ik − − ln(1 + ψ k → it i ( e βω ik − ≈ ψ it i + ψ it i ( (cid:88) s ψ is ln(1 + ψ ks ( e βω ik − − ln(1 + ψ kt i ( e βω ik − ≈ ψ it i + ψ it i ( (cid:88) s ψ is ψ ks − ψ kt i ) βω ik (B1)Where H i → kt i = e h ( t i ) + (cid:88) j ∈ ∂i \ k u j → it i ,u j → it i = ln(1 + ψ j → it i ( e βω ij − , and H it i = e h ( t i ) + (cid:88) j ∈ ∂i u j → it i . Then by substituting ψ it i into the BP equations, We ﬁnally arrive at the TAP equation: ψ it i = e h ( t i ) Z i (cid:89) j ∈ ∂i e ψ jti + β ω ij ( ψ jti ( (cid:80) s ψ js ψ is − ψ iti )+ ( ψ jti − ( ψ jti ) )) (B2)As noted in the main text, the Na¨ıve mean-ﬁeld equations for this problem can be derived as ψ it i = e h ( t i ) Z i (cid:89) j ∈ ∂i e ψ jti βω ij . (B3)In Fig. 13 we did an overall comparison of BP, RBP, NMF and TAP in the Gaussian mixture problem. In panel (a)we vary average degree and in panel (b) we ﬁx average degree and vary mean weight. We can see from the ﬁgure thatBP almost works best, RBP is very close to BP. although TAP performs not far from BP, NMF which has the samecomputational complexity work much worse than TAP.1 Average Degree O v e r l ap BPTAPRBPNMF (a)

Means of Weight O v e r l ap BPTAPRBPNMF (b)

FIG. 13. Comparison of accuracy and time used by Belief Propagation (BP), Relaxed Belief Propagation (RBP), Na¨ıveMean Field (NMF) and Thouless-Anderson-Palmer (TAP). Each point is averaged over 20 realizations of networks with size n = 100 ,

000 nodes. (a) In the Gaussian mixture model with Gaussian weight distributions ω in , ω out , which have unit varianceand mean 0 .

75 and − .

75 respectively. (b) The same with (a), but with average degree ﬁxed to 5, the x-axis shows diﬀerencebetween the means of two distributions, i.e. (cid:104) ω in (cid:105) − (cid:104) ω out (cid:105)(cid:105)