Belief propagation for networks with loops
MMessage passing for probabilistic models on networks with loops
Alec Kirkley, George T. Cantwell, and M. E. J. Newman
1, 2 Department of Physics, University of Michigan, Ann Arbor, Michigan 48109, USA Center for the Study of Complex Systems, University of Michigan, Ann Arbor, Michigan 48109, USA
In this paper, we extend a recently proposed framework for message passing on “loopy” networksto the solution of probabilistic models. We derive a self-consistent set of message passing equationsthat allow for fast computation of probability distributions in systems that contain short loops,potentially with high density, as well as expressions for the entropy and partition function of suchsystems, which are notoriously difficult quantities to compute. Using the Ising model as an example,we show that our solutions are asymptotically exact on certain classes of networks with short loopsand offer a good approximation on more general networks, improving significantly on results derivedfrom standard belief propagation. We also discuss potential applications of our method to a varietyof other problems.
I. INTRODUCTION
Many complex phenomena can be modeled using net-works, which provide powerful abstract representations ofsystems in terms of their components and interactions [1].The phenomena are often modeled using probabilisticformulations that capture the probabilities of states ofnetwork nodes. Examples include the spread of epidemicsthrough networks of social contacts [2], cascading failuresin power grids [3], and the equilibrium behavior of spinmodels such as the Ising model [4]. Networks are alsoused to represent pairwise dependencies between vari-ables in statistical models that do not otherwise have anetwork component, as a convenient tool for bookkeepingand visualization of model structure [5]. Such “graphicalmodels,” which allow us to represent the conditional de-pendencies between variables in a non-parametric man-ner, form the foundation for many modern machine learn-ing techniques [6].In this paper we consider probabilistic models on net-works, models whose key features are that the state vari-ables of the model live on the nodes of the networkand the only explicit interactions between variables arepairwise dependencies between nodes directly connectedby network edges. The Hamiltonian of a spin modelmight contain only pairwise interactions between spinsand their neighbors, for example, while the individualsin an epidemic model can catch a disease only from thosethey have direct contact with.In probabilistic models we are typically interested incomputing expectation values or marginal probabilitiesover the ensemble of possible states of the system. Forinstance, in Ising spin systems the one-point marginalprobabilities of spin states—the average probability of aparticular spin being up or down—can be used to com-pute the average magnetization. Unfortunately, there arefew analytic results for expectations or marginal proba-bilities in such models [7] and those that do exist tendto have narrow applicability and can be complicated [8].Even computational exploration of these models can bechallenging, as state spaces grow exponentially with thesize of the system. In this paper we focus on message passing, also calledthe “cavity method” or “belief propagation,” an efficientmethod for the solution of probabilistic models on net-works that straddles the line between analytic and nu-merical approaches [9, 10]. Message passing methods in-volve deriving a set of self-consistent equations satisfiedby the variables or probabilities in a model then solvingthose equations by numerical iteration. The name “mes-sage passing” comes from the fact that the equations canbe thought of as representing messages passed along net-work edges describing the probability that the variableon a given node takes a certain value, given the values ofits neighbors.Standard formulations of message passing, however,have a crucial weakness: they rely on the assumptionthat the states of the neighbors are uncorrelated withone another, or equivalently that the network of pairwisedependencies is a tree, i.e., a network that contains noloops. In practice, very few networks are perfect trees,though standard message passing methods are known togive good approximate results on networks that satisfythe weaker condition of being “locally tree-like,” mean-ing that local regions of the network take the form of treeseven though the network as a whole is not a tree. In ef-fect, this means that the network can contain long loops,but not short ones [1]. Unfortunately, even this weakercondition is not true for many real-world networks, whichoften contain a high density of short loops [11], so stan-dard message passing can give quite poor results in prac-tical situations.This limitation of traditional message passing has beenwidely noted and a number of attempts have been madein the past to remedy it. Message passing has been suc-cessfully extended to certain classes of random graphsthat incorporate short loops in a limited fashion, suchas Husimi graphs [12–14] and other tree-like agglomera-tions of small loopy subgraphs [15, 16], although thesetechniques are not generally applicable to real-worldnetworks. Another approach incorporates the effect ofloops using a perturbative expansion around the loop-less case [17, 18], although this approach becomes pro-gressively less accurate as the number of loops increases a r X i v : . [ c ond - m a t . s t a t - m ec h ] S e p and is therefore best suited to networks with a low loopdensity, which rules out a large fraction of real networkswhose loop density is often high [11, 19]. Perhaps thebest known extension of belief propagation, and the onemost similar to our own approach, is the method knownas “generalized belief propagation” [20], which is basedon the idea of passing messages not just between pairs ofnodes but between larger groups. This method is quitecomplicated to implement in practice, however, and re-quires explicit construction of the groups, which involvesnontrivial pre-processing steps [21]. The method we pro-pose requires no such steps.The foundational concepts necessary for applying mes-sage passing to loopy networks are described in [22], withspecific applications to percolation models and spectralcalculations. In this paper we extend those approachesto the solution of general probabilistic models. Ourapproach centers on the development of self-consistentmessage passing equations for the marginal probabilitiesof states of sets of nodes in a neighborhood around agiven reference node, which allow for the fast computa-tion of joint distributions within the neighborhood. Wecan then average over these joint distributions either ex-actly (for networks with small local neighborhoods) orapproximately using Monte Carlo sampling (for largerneighborhoods) to calculate properties of interest suchas single-site marginals.To ground our discussion we use the Ising model asan example of our approach, showing how our improvedmessage passing methods can produce better estimatesfor this model than regular message passing. We showthat our methods are asymptotically exact on networkswhose loop structure satisfies certain general conditionsand give good approximations for networks that deviatefrom these conditions. We give example results for theIsing model on both real and artificial networks and alsodiscuss applications of our method to a range of otherproblems. II. MESSAGE PASSING ON NETWORKSWITH LOOPS
Our first step is to develop the general theory of mes-sage passing for probabilistic models on loopy networks.With an eye on the Ising model, our discussion will bein the language of spin models, although the methods wedescribe can be applied to any probabilistic model withpairwise dependencies between variables.
A. Model description
Consider a general undirected, unweighted network G composed of a set V of nodes or vertices and a set E of pairwise edges. The network can be representedmathematically by its adjacency matrix A with ele-ments A ij = 1 when nodes i and j are connected by an edge and 0 otherwise. On each node of the net-work there is a variable or spin s i , which is restrictedto some discrete set of values S . In a compartmentalmodel of disease propagation, for instance, s i ∈ S = { , , } could be theinfection state of a node [23, 24]. In a spatial model ofsegregation s i ∈ S = { , } could represent land occupation [25].Spins s i and s j interact if and only if there is an edgebetween nodes i and j , a formulation sufficiently generalto describe a large number of models in fields as diverse asstatistical physics, machine learning, economics, psychol-ogy, and sociology [26–33]. Interactions are representedby an interaction energy g ij ( s i , s j | ω ij ), which controls thepreference for any particular pair of states s i and s j tooccur together. The quantity ω ij represents any externalparameters, such as temperature in a classical spin sys-tem or infection rate in an epidemiological model, thatcontrol the nature of the interaction. We also allow forthe inclusion of an external field f i ( s i | θ i ) with parame-ters θ i , which controls the intrinsic propensity for s i totake an particular state. This could be used for instanceto encode individual risk of catching a disease in an epi-demic model.Given these definitions, we write the probabil-ity P ( s | ω, θ ) that the complete set of spins takes value s in the Boltzmann form P ( s | ω, θ ) = e − H ( s | ω,θ ) Z ( ω, θ ) , (1)where the Hamiltonian H ( s | ω, θ ) = − (cid:88) ( i,j ) ∈ E g ij ( s i , s j | ω ij ) − (cid:88) i ∈ V f i ( s i | θ i ) (2)is the log-probability of the state to within an arbitraryadditive constant, and the partition function Z ( ω, θ ) = (cid:88) s e − H ( s | ω,θ ) (3)is the appropriate normalizing constant, ensuring that P ( s | ω, θ ) sums to unity. In this paper we will primar-ily be concerned with computing the single-site (or one-point) marginal probabilities q i ( s i ) = (cid:88) s \ s i P ( s | ω, θ ) , (4)where s \ s i denotes all spins with the exception of s i .For convenience we have dropped ω and θ from the no-tation on the left of the equation, but it should be clearcontextually that q i depends on both of these variables.The one-point marginals reveal useful informationabout physical systems, such as the magnetization ofclassical spin models or the position of a phase transi-tion. They are important for statistical inference prob-lems, where they give the posterior probability of a vari-able taking a given state after averaging over contribu-tions from all other variables (e.g., the total probabilityof an individual being infected with a disease at a giventime). Unfortunately, direct computation of one-pointmarginals is difficult because the number of terms in thesum in Eq. (4) grows exponentially with the number ofspins. The message passing method gives us a way toget around this difficulty and compute q i accurately andrapidly.Message passing can also be used to calculate otherquantities. For instance, we will show how to com-pute the average energy (also called the internal energy),which is given by U ( ω, θ ) = (cid:88) s H ( s | ω, θ ) P ( s | ω, θ ) . (5)The average energy is primarily of interest in thermo-dynamic calculations, although it may also be of interestfor statistical inference, where it corresponds to the av-erage log-likelihood.We can also compute the two-point correlation func-tion between spins P ( s i = x, s j = y ) = P ( s j = y | s i = x ) q i ( s i = x ) . (6)This function can be computed by first calculating theone-point marginal q i ( s i = x ), then fixing s i = x andrepeating the calculation for s j . The same approach canalso be used to compute n -point correlation functions. B. Message passing equations
Our method operates by dividing a network into neigh-borhoods [22]. A neighborhood N ( r ) i around node i isdefined as the node i itself and all of its edges and neigh-boring nodes, plus all nodes and edges along paths oflength r or less between the neighbors of i . See Fig. 1 forexamples. The key to our approach is to focus initiallyon networks in which there are no paths longer than r between the neighbors of i , meaning that all paths areinside N ( r ) i . This means that all correlations betweenspins within N ( r ) i are accounted for by edges that arealso within N ( r ) i , which allows us to write exact messagepassing equations for these networks. Equivalently, wecan define a primitive cycle of length r starting at node i to be a cycle (i.e., a self-avoiding loop) such that at leastone edge in the cycle is not on any shorter cycle begin-ning and ending at i . Our methods are then exact onany network that contains no primitive cycles of lengthgreater than r + 2.We develop a series of such methods, where the r thmember of the series is exact on networks that containprimitive cycles of length r + 2 and less only. The mes-sage passing equations become progressively more com-plex as r gets larger: they are very tractable for smallervalues but become impractical when r is large. In manyreal-world networks the longest primitive loop will be rel-atively long, requiring an unwieldy set of equations for an exact solution. However, long loops introduce smallercorrelations between variables than short ones, and more-over the density of long loops is in many cases lower: thenetwork is “locally dense but globally sparse.” In thissituation, we find that the message passing equations forlow values of r , while formally approximate, give excel-lent results. They account correctly for the effect of theshort loops in the network, while making only a smallapproximation by omitting the long ones.In practice, quite modest values of r can give good re-sults. The smallest possible choice of r = 0 correspondsto assuming there are no loops in the network at all, thatthe network is a tree. This is the assumption made bytraditional message passing methods, and it gives poorresults on many real-world networks. The next approx-imation after this, however, with r = 1, which correctlyaccounts for the effect of loops of length three in the net-work (i.e., triangles), produces substantially better re-sults, and the r = 2 approximation (which includes loopsof length four) is in many cases impressively accurate. Inthe following developments, we drop r from our notationfor convenience—the same equations apply for all valuesof r .Having defined the initial neighborhood N i we nowfurther define a neighborhood N j \ i to be node j plus alledges in N j that are not contained in N i and the nodesat their ends. Our method involves writing the marginalprobability distribution on the spin at node i in terms ofa set of messages received from nodes j that are in N i ,including nodes that are not immediate neighbors of i .This contrasts with traditional message passing in whichmessages are received only from the immediate neighborsof i . These messages are then in turn calculated fromfurther messages j receives from nodes k ∈ N j \ i , and soforth.When written in this manner, the messages i receivesare independent of one another in any network with noprimitive cycles longer than r +2. Messages received fromany two nodes j and j within N i are necessarily inde-pendent since they are calculated from the correspondingneighborhoods N j \ i and N j \ i which are disconnectedfrom one another: if they were connected by any paththen that path would create a primitive cycle starting at i but passing outside of N i , of which there are none in thehypothesized network. By the same argument, we alsoknow that N j \ i and N i only overlap at the single node j for any j ∈ N i .With these observations in mind, we can start at afocal node i and consider N i as comprising a central set ofnodes and edges surrounding i . Then we can think of theset of neighborhoods N j \ i for all j ∈ N i as comprising thenext “layer” in the network, the sets N k \ j for all k ∈ N j \ i as a third layer, and so forth until all nodes and edgesin the network are accounted for. In a network with noprimitive cycles longer than r + 2, this procedure countsall interactions exactly once, allowing us to rewrite ourHamiltonian as a sum of independent contributions from s N = { s , s , s , s , s } s N ∖ = { s , s , s } q ← q ← q ← q ← q ← q ← s s s s s s s N (excluding 0) N ∖ (excluding 1) FIG. 1: Diagram of the expansion used in the Hamiltoniandecomposition of Eq. (7), with r = 2. The focal node is in redwhile the rest of its neighborhood N is in green. Nodes andedges in purple represent the neighborhood N \ excludingnode 1. We also label the corresponding spin and messagevariables used in Eqs. (11) and (12). the various layers thus: H ( s ) = H N i ( s N i ) + (cid:88) j ∈ N i H N j \ i ( s N j \ i )+ (cid:88) j ∈ N i (cid:88) k ∈ N j \ i H N k \ j ( s N k \ j )+ (cid:88) j ∈ N i (cid:88) k ∈ N j \ i (cid:88) l ∈ N k \ j H N l \ k ( s N l \ k ) + . . . , (7)where s N i and s N j \ i are the sets of spins for the nodes inthe neighborhoods N i and N j \ i and we have defined thelocal Hamiltonians H N i ( s N i ) = − (cid:88) ( j,k ) ∈ N i g jk ( s j , s k | ω jk ) − f i ( s i | θ i ) , (8) H N j \ i ( s N j \ i ) = − (cid:88) ( k,l ) ∈ N j \ i g kl ( s k , s l | ω kl ) − f j ( s j | θ j ) . (9)The decomposition of Eq. (7) is illustrated pictorially inFig. 1.The essential feature of this decomposition is that itbreaks sums over spins such as those in Eqs. (3) and (4)into a product of sums over the individual neighborhoods { N j \ i } j ∈ N i . Because these neighborhoods are, as we havesaid, independent, this means that the partition functionand related quantities factorize into products of sumsover a few spins each, which can easily be performed nu-merically. For instance, the one-point marginal of Eq. (4) takes the form q i ( s i = x ) ∝ (cid:88) s Ni : s i = x e − H Ni ( s Ni ) (cid:89) j ∈ N i (cid:88) s Nj \ i \ j e − H Nj \ i ( s Nj \ i ) × (cid:89) k ∈ N j \ i (cid:88) s Nk \ j \ k e − H Nk \ j ( s Nk \ j ) . . . , (10)which can be written recursively as q i ( s i = x ) = 1 Z i (cid:88) s Ni : s i = x e − H Ni ( s Ni ) (cid:89) j ∈ N i q i ← j ( s j ) , (11)with q i ← j ( s j = y ) = 1 Z i ← j (cid:88) s Nj \ i : s j = y e − H Nj \ i ( s Nj \ i ) × (cid:89) k ∈ N j \ i \ j q j ← k ( s k ) , (12)where the normalization constants Z i and Z i ← j ensurethat the marginals q i and messages q i ← j are normalizedso that they sum to 1. (In practice, we simply nor-malize the messages by dividing by their sum.) Thequantity q i ← j ( s j ) is equal to the marginal probabilityof node j having spin s j when all the edges in N i areremoved. Alternatively, one can think of it as a localexternal field on node j that influences the probabilitydistribution of s j . To make this more explicit one couldrewrite Eq. (11) as q i ( s i = x ) = 1 Z i (cid:88) s Ni : s i = x e − H Ni ( s Ni )+ (cid:80) j ∈ Ni log q i ← j ( s j ) , (13)where log q i ← j ( s j ) plays the role of the external field.Equations (11) and (12) define our message passingalgorithm and can be solved for the messages q i ← j bysimple iteration, starting from any suitable set of start-ing values and applying the equations repeatedly untilconvergence is reached.With only slight modification we can use the same ap-proach to calculate the internal energy as well. The con-tribution to the internal energy from the interactions of asingle node i is (cid:80) j : A ij =1 g ( s i , s j | ω ij ) + f ( s i | θ i ), wherethe factor of compensates for double counting of inter-actions. Summing over all nodes i and weighting by theappropriate Boltzmann probabilities, the total internalenergy is U = (cid:88) i ∈ V Z i (cid:88) s Ni (cid:20) (cid:88) j : A ij =1 g ( s i , s j | ω ij ) + f ( s i | θ i ) (cid:21) × e − H Ni ( s Ni ) (cid:89) j ∈ N i q i ← j ( s j ) . (14)All of the quantities appearing here are known a pri-ori, except for the messages q i ← j ( s j ) and the normalizingconstants Z i , which are calculated in the message passingprocess. Performing the message passing and then usingthe final converged values in Eq. (14) then gives us ourinternal energy. C. Implementation
For less dense networks, those with node degrees upto about 20, the message passing equations of Eqs. (11)and (12) can be implemented directly and work well. Themethod is also easily parallelizable, as we can update allmessages asynchronously based on their values from theprevious iteration, as well as compute the final marginalsin parallel.For networks with higher degrees the equations can be-come intractable, the huge reduction in complexity dueto the factorization of the Hamiltonian notwithstanding.For a model with t distinct spin states at every node, thesum over states in the neighborhood of i has t | N i | terms,which can quickly become computationally expensive toevaluate. Moreover, if just a single node has too largea neighborhood it can make the entire computation in-tractable, as that single neighborhood can consume morecomputational power than is available.In such situations, therefore, we take a different ap-proach. We note that Eq. (12) is effectively an expecta-tion q i ← j ( s j = y ) = (cid:104) δ s j ,y (cid:105) N j \ i , (15)where we use the shorthand (cid:104) A (cid:105) N j \ i = (cid:88) s Nj \ i A ( s N j \ i ) e − H Nj \ i ( s Nj \ i ) (cid:81) k ∈ N j \ i \ j q j ← ks k Z i ← j . (16)We approximate this average using Markov chain MonteCarlo importance sampling over spin states, and afterconvergence of the messages the final estimates of themarginals q i can also be obtained by Monte Carlo, thistime on the spins in N i . We describe the details in Sec-tion III. D. Calculating the partition function
The partition function Z is perhaps the most fun-damental quantity in equilibrium statistical mechanics.From a knowledge of the partition function one can cal-culate virtually any other thermodynamic variable of in-terest. Objects equivalent to Z also play a central role inother areas, such as Bayesian statistics, where the quan-tity known as the model evidence , the marginal likelihoodof observed data given a hypothesized model, is mathe-matically analogous to the partition function and playsan important role in model fitting and selection [34–36]. Unfortunately, the partition function is difficult to cal-culate in practice. The calculation can be done analyt-ically in some special cases [37, 38], but numerical cal-culations are difficult due to the need to sum over anexponentially large number of states, and Monte Carlocalculations are challenging because of the difficulty ofcorrectly normalizing the Boltzmann distribution.Another concept central to statistical mechanics is theentropy S = − (cid:88) s P ( s ) ln P ( s ) , (17)which has broad applications not just in physics butacross the sciences [39–41]. Like the partition function,entropy is difficult to calculate numerically, and for ex-actly the same reason, since the two are closely related.For the canonical distribution of Eq. (1) the entropy isgiven in terms of the free energy ln Z by S = ln Z + βU. (18)Even if we know the internal energy U therefore (whichis relatively straightforward to calculate), the entropyis at least as hard to calculate as the partition func-tion. Indeed the fundamental difficulty of normalizingthe Boltzmann distribution is equivalent to establishingthe zero of the entropy, a well known hard problem (un-solvable within classical thermodynamics, requiring theadditional axiom of the Third Law).As we now show, the entropy can be calculated usingour message passing formalism by appropriately factoriz-ing the probability distribution over spin states. Since wehave already developed a prescription for computing U (see Eq. (14)), this also allows us to calculate the parti-tion function.As shown in Section A 4 of the appendix, the stateprobability P ( s ) in Eq. (1) can be rewritten in the fac-torized form P ( s ) = (cid:81) i ∈ G P ( s N i ) (cid:81) (( i,j )) ∈ G P ( s ∩ ij ) / |∩ ij | , (19)where P ( s N i ) is the joint marginal distribution of thevariables in the neighborhood of node i , P ( s ∩ ij ) is thejoint marginal distribution in the intersection ∩ ij = N i ∩ N j of the neighborhoods N i and N j , and (( i, j ))denotes pairs of nodes that are contained in each other’sneighborhoods.By a series of manipulations, this form can be furtherexpressed as the pure product P ( s ) = (cid:32) (cid:89) (( i,j )) ∈ G P ( s ∩ ij ) / ( |∩ ij | ) (cid:33)(cid:32) (cid:89) ( i,j ) ∈ G P ( s i , s j ) W ij (cid:33) × (cid:32)(cid:89) i ∈ G P ( s i ) C i (cid:33) , (20)where W ij = 1 − (cid:88) (( l,m )) ∈ G (cid:0) |∩ lm | (cid:1) { ( i,j ) ∈∩ lm } (21)with { ... } being the indicator function and C i = 1 − (cid:32) (cid:88) j ∈ N i | ∩ ij | − (cid:33) − (cid:32) (cid:88) j ∈ N (0) i W ij (cid:33) . (22)Substituting (20) into Eq. (17), we get an expression forthe entropy thus: S = − (cid:0) |∩ ij | (cid:1) (cid:88) (( i,j )) ∈ G P ( s ∩ ij ) ln P ( s ∩ ij ) − (cid:88) ( i,j ) ∈ G W ij P ( s i , s j ) ln P ( s i , s j ) − (cid:88) i ∈ G C i P ( s i ) ln P ( s i ) . (23)Note that, like the well known Bethe approximation forthe entropy [42], this expression has contributions fromthe one- and two-point marginals P ( s i ) and P ( s i , s j ) ofEqs. (6) and (11), but also contains a term that de-pends on the joint marginal P ( s ∩ ij ) in the intersection ∩ ij , which may be nontrivial if r >
0. As shown in theappendix, we can calculate this joint marginal using themessage passing equation P ( s ∩ ij ) = 1 Z ∩ ij e − βH ( s ∩ ij ) q i ← j ( s j ) (cid:89) k ∈∩ ij \ j q j ← k ( s k ) , (24)where H ( s ∩ ij ) denotes the terms of the Hamiltonian ofEq. (2) that fall within ∩ ij and Z ∩ ij is the correspond-ing normalizing constant. For | ∩ ij | sufficiently small, Z ∩ ij can be computed exactly. In other cases we can cal-culate it using Monte Carlo methods similar to those weused previously for the marginals P ( s i ). E. Comparison with generalized belief propagation
Message passing methods for probabilistic models onloopy networks have been proposed in the past, the bestknown being the generalized belief propagation methodof Yedidia et al. [20]. Generalized belief propagation em-ploys a region-based approximation [43], in which the freeenergy ln Z is approximated by a sum of independent lo-cal free energies of regions of edges and their accompany-ing nodes. Once the regions are defined it is straightfor-ward to write down belief propagation equations, whichcan be used to calculate marginals and other quanti-ties of interest, including approximations to the parti-tion function and entropy. Perhaps the best known ex-ample of generalized belief propagation, at least withinthe statistical physics community, is the cluster varia-tional method , in which the regions are defined so as tobe closed under the intersection operation [26] and theresulting free energy is called the Kikuchi free energy [44].The accuracy and complexity of generalized beliefpropagation is determined by the specific choice of re-gions, which has been described as being “more of an art than a science” [42]. Loops contained within regions arecorrectly accounted for in the belief propagation, whilethose that span two or more regions are not and introduceerror. At the same time, the computational complexityof the belief propagation calculations increases exponen-tially with the size of the regions [42], so choosing theright regions is a balancing act between enclosing as manyloops as possible while not making the regions too large.A number of heuristics have been proposed for choos-ing the regions [21, 45, 46] but real-world networks canpose substantial difficulties because they often containboth high degrees and many loops [1], which effectivelyforces us to compromise either by leaving loops out or byusing very large regions. Our method can have a signif-icant advantage in these systems because it can accom-modate large, tightly connected neighborhoods throughlocal Monte Carlo sampling. Our method also has thebenefit that the neighborhoods are constructed automat-ically based on the network structure rather than beingchosen by the user.
III. EXAMPLE APPLICATIONS
As an archetypal application of our methods we con-sider the Ising model on various example networks. Theferromagnetic Ising model in zero external field is equiv-alent in our notation to the choice g ij ( s i , s j ) = − βA ij s i s j , f i ( s i ) = 0 , (25)where β = 1 /T is the inverse temperature. Note thattemperature in this notation is considered a part of theHamiltonian. It is more conventional to write tempera-ture separately, so that the Hamiltonian has dimensionsof energy, rather than being dimensionless as here, butincorporating the temperature is notationally convenientin the present case. It effectively makes the temperaturea parameter ω ij in Eq. (2) (and all ω ij are equal).As example calculations, we will compute the averagemagnetization M , which is given by M = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) N N (cid:88) i =1 s i (cid:29)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) i =1 (cid:2) q i ( s i = +1) − (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (26)and the heat capacity C , given by C = dUdT = − β dUdβ . (27)As detailed in Section A 1 of the appendix, we employan extension of the message passing equations to com-pute C that avoids having to use a numerical derivativeto evaluate Eq. (27). In brief, we consider the messages q i ← j to be a function of β then define their derivativeswith respect to β as their own set of messages η i ← jy = dq i ← jy dβ , (28)with their own associated message passing equations de-rived by differentiating Eq. (12). We then compute theheat capacity C by differentiating Eq. (14), expressingthe result in terms of the η i ← jy , and substituting it intoEq. (27). A. Phase transition
In many geometries, the ferromagnetic Ising model hasa phase transition at a nonzero critical temperature be-tween a symmetric state with zero average magnetizationand a symmetry broken state with nonzero magnetiza-tion. Making the substitution (25) in Eqs. (11) and (12),we can show that the message passing equations for theIsing model always have a trivial solution q i ← j ( s j ) = for all i, j . This choice is a fixed point of the messagepassing iteration: when started at this point the itera-tion will remain there indefinitely. Looking at Eq. (26),we see that this fixed point corresponds to magnetiza-tion M = 0. If the message passing iteration convergesto this trivial fixed point, therefore, it tells us that themagnetization is zero and we are above the critical tem-perature; if it settles elsewhere then the magnetization isnonzero and we are below the critical temperature. Thusthe phase transition corresponds to the point at whichthe fixed point changes from being attracting to beingrepelling.This behavior is well known in standard belief propa-gation, where it has been shown that on networks withlong loops only there is a critical temperature T BP be-low which the trivial fixed point becomes unstable andhence the system develops nonzero magnetization, andthat this temperature corresponds precisely to the con-ventional zero-field continuous phase transition on thesenetworks [47]. Extending the same idea to the presentcase, we expect the phase transition on a loopy networkto fall at the corresponding transition point between sta-ble and unstable in our message passing formulation.Moreover, because the values of the messages at thetrivial fixed point are known, we can compute an expres-sion for the phase transition point without performingany message passing. We treat the message passing iter-ation as a dynamical system and perform a linear stabil-ity analysis of the trivial fixed point. Perturbing around q = (shorthand for setting all q i ← j = ) and keep-ing terms to linear order, we find that the dynamics isgoverned by the Jacobian J j → i,ν → µ = ∂q i ← j ∂q µ ← ν (cid:12)(cid:12)(cid:12) q =1 / = ˜ B j → i,ν → µ D j → i,ν → µ , (29)where ˜ B is a generalization of the so-called non-backtracking matrix [48, 49] to our loopy message passingformulation:˜ B j → i,ν → µ = (cid:26) j = µ and ν ∈ N j \ i ,0 otherwise, (30) and D j → i,ν → µ is a correlation function between the spins s µ and s ν within the neighborhood N j \ i —see Section A 3of the appendix for details.When the magnitude of the leading eigenvalue λ max of this Jacobian is less than 1, the trivial fixed point isstable; when it is greater than 1 the fixed point unstable.Hence we can locate the phase transition temperatureby numerically evaluating the Jacobian and locating thepoint at which | λ max | crosses 1, for instance by binarysearch. B. A model network
As a first example application, we examine the behav-ior of our method on a model network created precisely tohave short loops only up to a specified maximum length.The network has short primitive cycles only of length r +2and less for a given choice of r , though it can also havelong loops—it is “locally dense but globally sparse” in thesense discussed previously. Indeed this turns out to be acrucial point. It is well known that the Ising model doesnot have a normal phase transition on a true tree, be-cause at any finite temperature there is always a nonzerodensity of defects in the spin state (meaning pairs of ad-jacent spins that are oppositely oriented), which on a treedivide the network into finite sized regions, imposing afinite correlation length and hence no critical behavior.Similarly in the case of a network with only short loopsand no long ones there is no true phase transition. Thelong loops are necessary to produce criticality, a pointdiscussed in detail in [50].To generate networks that have short primitive cyclesonly up to a certain length, we first generate a random bi-partite network—a network with two types of nodes andconnections only between unlike kinds—then “project”down onto one type of node, producing a network com-posed of a set of complete subgraphs or cliques. In detail,the procedure is as follows.1. We first specify the degrees of all the nodes, of bothtypes, in the bipartite network.2. We represent these degrees by “stubs” of edgesemerging, in the appropriate numbers, from eachnode, then we match stubs at random in pairs tocreate our random bipartite network.3. We project this network onto the nodes of type 1,meaning that any two such nodes that are both con-nected to the same neighbor of type 2 are connecteddirectly with an edge in the projection. After allsuch edges have been added, the type-2 nodes arediscarded.4. Finally, we remove a fraction p of the edges in theprojected network at random.If p = 0, the network is composed of fully connectedcliques, but when p > r + 2 then there will be no short loops of length longerthan this.Figure 2 shows the magnetization per spin, entropy,and heat capacity for the ferromagnetic Ising model onan example network of approximately 10 000 nodes gen-erated using this procedure with r = 2. By also limitingthe degrees of the type-1 nodes in the bipartite graphwe ensure that no neighborhood in the projection is toolarge to prevent a complete summation over states andhence that Monte Carlo estimation of the sums in themessage passing equations is unnecessary.Results are shown for belief propagation calculationswith r = 0, 1 and 2, the last of which should, in princi-ple, be exact except for the weak correlations introducedby the presence of long loops in the network. We alsoshow in the figure the magnitude of the leading eigen-value of J for each value of r . The points at which thiseigenvalue equals 1, which give estimates of the criticaltemperature for each r , are indicated by the vertical lines.Also shown in the figure for comparison are results froma direct Monte Carlo simulation of the system, with theentropy calculated from values of the heat capacity com-puted from energy fluctuations and then numerically in-tegrated using the identity S = (cid:90) T C ( T ) T dT. (31)The message passing simulations offer significantly fasterresults for this system: for r = 2 the message passing wasabout 100 times faster than the Monte Carlo simulations.Looking at Fig. 2, we can see that as we increase r , themessage passing results approach those from the directMonte Carlo, except close to the phase transition, wherethe Monte Carlo calculations suffer from finite size effectsthat smear the phase transition, to which the messagepassing approach appears largely immune. While theresults for r = 0 are quite far from the direct MonteCarlo results, most of the improvement in accuracy fromour method is already present even at r = 1. Going to r = 2 offers only a small additional improvement in thiscase.The apparent position of the phase transition alignswell with the predictions derived from the value of the Ja-cobian for each value of r . The transition is particularlyclear in the gradient discontinuity of the magnetization.For r = 1 and 2 the heat capacity appears to exhibit adiscontinuity at the transition, which differs from the di-vergence we expect on low-dimensional lattices but bearsa resemblance to the behavior seen on Bethe lattices andother homogeneous tree-like networks [9, 52, 53]. A v e r a g e m a gn e ti za ti on , | λ m a x | MCMC M , r = 0 M , r = 1 M , r = 2 | λ max | , r = 0 | λ max | , r = 1 | λ max | , r = 2 E n t r opy , S p ec i f i c h ea t S , MCMC S , r = 0 S , r = 1 S , r = 2 C , MCMC C , r = 0 C , r = 1 C , r = 2 FIG. 2: Results for the ferromagnetic Ising model in zerofield as a function of temperature T in a large network gener-ated using the procedure described in Section III B. The toppanel shows the average magnetization, while the bottom oneshows the heat capacity and the entropy (the latter shiftedup for visualization purposes). The magnitude of the leadingeigenvalue for the Jacobian is also shown in the top panel forall three values of r , and we can see that the apparent po-sitions of the phase transition, revealed by discontinuities inthe physical quantities or their gradients, correspond closelyto the temperatures at which the associated eigenvalues areequal to 1. C. Real-world networks
For our next example we look at an application on areal-world network, where we do not expect the methodto be exact though, as we will see, it nonetheless performswell. The network we examine has larger local neighbor-hoods than our synthetic example, which means we arenot able to sum exhaustively over all configurations ofthe spins s N j \ i in Eq. (12) (and similarly s N i in Eq. (11))so, as described in Section II C, we instead make use ofMonte Carlo sampling to estimate the messages q i ← j andmarginals q i .The summation over local spins in Eq. (12) is equiva-lent to computing the expectation in Eq. (15). To calcu-late q i ← j ( s j = y ) we fix the values of its incoming mes-sages { q j ← k } and perform Monte Carlo sampling overthe states of the spins in the neighborhood N j \ i with theHamiltonian of Eq. (9). Then we compute the averagein Eq. (15) separately for the cases y = 1 and − q i ← j can then be used as incomingmessages for calculating other messages in other neigh-borhoods. We perform the Monte Carlo using the Wolffcluster algorithm [51], which makes use of the Fortuin-Kasteleyn percolation representation of the Ising modelto flip large clusters of spins simultaneously and can sig-nificantly reduce the time needed to obtain independentsamples, particularly close to the critical point. Once themessages have converged to their final values we computethe marginals q i by performing a second Monte Carlosampling, this time over the spins s N i with the Hamilto-nian of Eq. (8). More details on the procedure are givenin Section A 2 of the appendix.The Monte Carlo approach combines the best aspectsof message passing and traditional Monte Carlo calcu-lations. Message passing reduces the sums we need toperform to sets of spins much smaller than the entire net-work, while the Monte Carlo approach reduces dramati-cally the number of spin states that need to be evaluated.The approach has other advantages too. For instance, be-cause of the small neighborhood sizes it shows improvedperformance in systems with substantial energy barriersthat might otherwise impede ergodicity, such as antifer-romagnetic systems. But perhaps its biggest advantage isthat it effectively allows us to sample very large numbersof states of the network without taking very large samplesof individual neighborhoods. If we sample k configura-tions from one neighborhood and k configurations fromanother, then in effect we are summing over k possiblecombinations of states in the union of the two neighbor-hoods. Depending on the value of r , there are at least 2 m neighborhoods N j \ i in a network, where m is the num-ber of edges, and hence we are effectively summing overat least k m states overall, a number that increases ex-ponentially with network size. Effective sample sizes of10 or more are easily reachable, far beyond what ispossible with traditional Monte Carlo methods.Figure 3 shows the results of applying these methodsto a network from [54] representing the structure of anelectric power grid, along with results from direct MonteCarlo simulations on the same network. As the figureshows, the magnetization is again poorly approximatedby the traditional ( r = 0) message passing algorithm, butimproves as r increases. In particular, the behavior in theregion of the phase transition is quite poor for r = 0 anddoes not provide a good estimate of the position of thetransition, but for r = 1 and 2 one can extract much bet-ter estimates, with the critical temperature falling some-where in the region of T = 1 . A v e r a g e m a gn e ti za ti on , | λ m a x | MCMC M , r = 0 M , r = 1 M , r = 2 | λ max | , r = 0 | λ max | , r = 1 | λ max | , r = 2 E n t r opy , S p ec i f i c h ea t S , MCMC S , r = 0 S , r = 1 S , r = 2 C , MCMC C , r = 0 C , r = 1 C , r = 2 FIG. 3: Results for the ferromagnetic Ising model in zero fieldas a function of temperature T on a network representing thestructure of an electric power grid. Again the message passingresults approximate the real solution progressively better as r grows larger. a much clearer phase transition in the message passingresults than in the standard Monte Carlo, because of fi-nite size effects in the latter. These results all suggestthat for real systems our method can give substantialimprovements over both ordinary belief propagation anddirect Monte Carlo simulation, and in some cases showcompletely different behavior altogether. IV. DISCUSSION
In this paper we have presented a new class of messagepassing algorithms for solving probabilistic models onnetworks that contain a high density of short loops. Tak-ing the Ising model as an example, we have shown thatour methods give substantially improved results in calcu-lations of magnetization, heat capacity, entropy, marginalspin probabilities, and other quantities over standard0message passing methods that do not account for thepresence of loops. Our methods are exact on networkswith short loops up to a fixed maximum length which wecan choose, and can give good approximations on net-works with loops of any length.There are many ways in which the methods and re-sults of this paper could be extended. We have studiedonly one application in detail, the Ising model, but theformalism we present is a general one that could be ap-plied to many other models. In principle, any modelwith sparse pairwise interactions (i.e., interactions whosenumber scales sub-quadratically with the number of vari-ables) could be studied using these methods. For exam-ple, there is a large class of generative models of networksin which each edge is a random variable, drawn indepen-dently from a distribution that depends on the propertiesof the adjacent nodes. Examples include the Chung-Lumodel [55] and the stochastic block model and its vari-ants [56, 57]. If we assume an observed network to bedrawn from such a model then we can use statistical in-ference to estimate the values of hidden node attributesthat influence edge probability, such as community mem-bership. Our message passing methods could be appliedto such inference calculations and could in principle givemore accurate results in the common case where the ob-served network contains many short loops. Another potential application in the realm of statisti-cal inference is the inverse Ising model, the problem ofinferring the parameters of an Ising or Ising-like modelfrom an observed sequence of spin states, which has nu-merous applications including the reconstruction of neu-ral pathways [58], the inference of protein structure [59],and correlations within financial markets [60]. It can beshown that the one- and two-point correlation functionsof the observed spins are sufficient statistics to reliablyestimate coupling and external field parameters [61] andour method could be used to compute these statistics onloopy networks to greater accuracy than with traditionalmessage passing and faster than standard Monte Carlosimulation. Other potential applications, further afieldfrom traditional statistical physics, include the solutionof constraint satisfaction problems, coding theory, andcombinatorial optimization.
Acknowledgments
This work was funded in part by the US Departmentof Defense NDSEG fellowship program (AK) and by theNational Science Foundation under grants DMS–1710848and DMS–2005899 (MEJN). [1] M. E. J. Newman,
Networks , 2nd ed., Oxford UniversityPress, Oxford (2018).[2] I. Z. Kiss, J. C. Miller, and P. L. Simon,
Mathematics ofEpidemics on Networks , Springer International Publish-ing, Berlin (2017).[3] B. A. Carreras, V. E. Lynch, I. Dobson, and D. E.Newman, Dynamical and probabilistic approaches to thestudy of blackout vulnerability of the power transmissiongrid.
Proceedings of the 37th Annual Hawaii InternationalConference on System Sciences , pp. 7–14, Institute ofElectrical and Electronics Engineers, New York (2004).[4] S. N. Dorogovtsev, A. V. Goltsev, and J. F. F. Mendes,Ising model on networks with an arbitrary distributionof connections.
Phys. Rev. E , 016104 (2002).[5] M. I. Jordan, Learning in Graphical Models , Kluwer Aca-demic Publishers, Dordrecht (1998).[6] D. Koller and N. Friedman,
Probabilistic Graphical Mod-els: Principles and Techniques , MIT Press, Cambridge,MA (2009).[7] G. Mussardo,
Statistical Field Theory: An Introductionto Exactly Solved Models in Statistical Physics , OxfordUniversity Press, Oxford (2010).[8] C. N. Yang, The spontaneous magnetization of a two-dimensional Ising model.
Physical Review , 808 (1952).[9] H. A. Bethe, Statistical theory of superlattices. Proc. R.Soc. London A , 552–575 (1935).[10] M. Mezard and A. Montanari,
Information, Physics, andComputation , Oxford University Press, Oxford (2009).[11] D. J. Watts and S. H. Strogatz, Collective dynamics ofsmall-world networks.
Nature , 440 (1998).[12] M. Eckstein, M. Kollar, K. Byczuk, and D. Vollhardt, Hopping on the Bethe lattice: Exact results for densitiesof states and dynamical mean-field theory.
Phys. Rev. B , 235119 (2005).[13] F. L. Metz, I. Neri, and D. Boll´e, Spectra of sparse regulargraphs with loops. Phys. Rev. E , 055101 (2011).[14] D. Boll´e, F. L. Metz, and I. Neri, On the spectra oflarge sparse graphs with cycles. Preprint arXiv:1206.1512(2012).[15] M. E. J. Newman, Random graphs with clustering. Phys.Rev. Lett. , 058701 (2009).[16] S. Yoon, A. V. Goltsev, S. N. Dorogovtsev, andJ. Mendes, Belief-propagation algorithm and the Isingmodel on networks with arbitrary distributions of mo-tifs.
Physical Review E , 041144 (2011).[17] A. Montanari and T. Rizzo, How to compute loop correc-tions to the Bethe approximation. Journal of StatisticalMechanics: Theory and Experiment , 10011 (2005).[18] M. Chertkov and V. Y. Chernyak, Loop calculus in sta-tistical physics and information science.
Physical ReviewE , 065102 (2006).[19] M. E. J. Newman and J. Park, Why social networks aredifferent from other types of networks. Physical ReviewE , 036122 (2003).[20] J. S. Yedidia, W. T. Freeman, and Y. Weiss, Generalizedbelief propagation. Proceedings of the 13th InternationalConference on Neural Information Processing Systems ,pp. 668–674, MIT Press, Cambridge, MA (2000).[21] M. Welling, On the choice of regions for generalized be-lief propagation.
Proceedings of the 20th Conference onUncertainty in Artificial Intelligence , pp. 585–592, AUAIPress, Banff, CA (2004). [22] G. T. Cantwell and M. Newman, Message passing onnetworks with loops. Proc. Natl. Acad. Sci. USA ,23398 (2019).[23] W. O. Kermack and A. G. McKendrick, A contributionto the mathematical theory of epidemics.
Proc. R. Soc.London A , 700–721 (1927).[24] R. Durrett,
Spatial Epidemic Models: Their structureand Relation to Data , Cambridge University Press, Cam-bridge (1995).[25] D. Stauffer and S. Solomon, Ising, Schelling and self-organising segregation.
Eur. Phys. J. B , 473–479(2007).[26] A. Pelizzola, Cluster variation method in statisticalphysics and probabilistic graphical models. J. Phys. A , R309 (2005).[27] F. Krzakala, F. Ricci-Tersenghi, L. Zdeborov´a,R. Zecchina, E. W. Tramel, and L. F. Cugliandolo, Statistical Physics, Optimization, Inference, andMessage-passing Algorithms , Oxford University Press,Oxford (2016).[28] S. Geman and C. Graffigne, Markov random field im-age models and their applications to computer vision.
Proceedings of the International Congress of Mathemati-cians , p. 2, International Congress of Mathematicians,Berkeley, CA (1986).[29] M. Yasuda and T. Horiguchi, Triangular approximationfor Ising model and its application to Boltzmann ma-chine.
Physica A , 83–95 (2006).[30] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov´a,Asymptotic analysis of the stochastic block model formodular networks and its algorithmic applications.
Phys.Rev. E , 066106 (2011).[31] W.-X. Zhou and D. Sornette, Self-organizing Ising modelof financial markets. Eur. Phys. J. B , 175–181 (2007).[32] S. Galam, Rational group decision making: A randomfield Ising model at T = 0. Physica A , 66–80 (1997).[33] D. Stauffer, Social applications of two-dimensional Isingmodels.
American Journal of Physics , 470 (2008).[34] D. J. MacKay and D. J. Mac Kay, Information Theory,Inference and Learning Algorithms , Cambridge Univer-sity Press, Cambridge (2003).[35] H. S. Migon, D. Gamerman, and F. Louzada,
Statisti-cal Inference: An Integrated Approach
CRC Press, BocaRaton, FL (2014).[36] N. Friel and J. Wyse, Estimating the evidence—a review.
Statistica Neerlandica , 288–308 (2012).[37] S. Salinas, Introduction to Statistical Physics , Springer,New York (2001).[38] R. J. Baxter,
Exactly Solved Models in Statistical Me-chanics , Academic Press, London (1982).[39] C. E. Shannon, Prediction and entropy of printed En-glish.
Bell System Technical Journal , 50–64 (1951).[40] W. Bialek, Biophysics: Searching for Principles , Prince-ton University Press, Princeton, NJ (2012).[41] P. Cabral, G. Augusto, M. Tewolde, and Y. Araya, En-tropy in urban systems.
Entropy , 5223–5236 (2013).[42] J. S. Yedidia, W. T. Freeman, and Y. Weiss, Understand-ing belief propagation and its generalizations. ExploringArtificial Intelligence in the New Millennium , 236–239(2003).[43] J. S. Yedidia, W. T. Freeman, and Y. Weiss, Construct-ing free-energy approximations and generalized beliefpropagation algorithms. IEEE Transactions on Informa-tion Theory , 2282–2313 (2005). [44] R. Kikuchi, A theory of cooperative phenomena. PhysicalReview , 988–1003 (1951).[45] H. J. Kappen and W. Wiegerinck, Novel iterationschemes for the cluster variation method. Advances inNeural Information Processing Systems , pp. 415–422,MIT Press, Cambridge, MA (2002).[46] P. Pakzad and V. Anantharam, Minimal graphical rep-resentation of Kikuchi regions.
Proceedings of the An-nual Allerton Conference on Communication Control andComputing , pp. 1586–1595, University of Illinois, Cham-paign, IL (2002).[47] J. Mooij and H. Kappen, On the properties of the Betheapproximation and loopy belief propagation on binarynetworks.
J. Stat. Mech , P11012 (2005).[48] F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly,L. Zdeborov´a, and P. Zhang, Spectral redemption in clus-tering sparse networks.
Proc. Natl. Acad. Sci. USA ,20935–20940 (2013).[49] B. Karrer, M. E. J. Newman, and L. Zdeborov´a, Perco-lation on sparse networks.
Phys. Rev. Lett. , 208702(2014).[50] A. Allard and L. H´ebert-Dufresne, On the accuracy ofmessage-passing approaches to percolation in complexnetworks. Preprint arXiv:1906.10377 (2019).[51] U. Wolff, Collective Monte Carlo updating for spin sys-tems.
Phys. Rev. Lett. , 361–364 (1989).[52] F. Mancini and A. Naddeo, Equations-of-motion ap-proach to the spin-1/2 Ising model on the Bethe lattice. Phys. Rev. E , 061108 (2006).[53] S. N. Dorogovtsev, A. V. Goltsev, and J. F. Mendes, Crit-ical phenomena in complex networks. Rev. Mod. Phys. , 1275–1335 (2008).[54] T. A. Davis and Y. Hu, The University of Florida sparsematrix collection. ACM Transactions on MathematicalSoftware (TOMS) , 1 (2011).[55] F. Chung and L. Lu, The average distances in randomgraphs with given expected degrees. Proc. Natl. Acad.Sci. USA , 15879–15882 (2002).[56] P. W. Holland, K. B. Laskey, and S. Leinhardt, Stochas-tic blockmodels: First steps. Social Networks , 109–137(1983).[57] B. Karrer and M. E. J. Newman, Stochastic blockmodelsand community structure in networks. Phys. Rev. E ,016107 (2011).[58] E. Schneidman, M. J. Berry II, R. Segev, and W. Bialek,Weak pairwise correlations imply strongly correlated net-work states in a neural population. Nature , 1007–1012 (2006).[59] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S.Marks, C. Sander, R. Zecchina, J. N. Onuchic, T. Hwa,and M. Weigt, Direct-coupling analysis of residue coevo-lution captures native contacts across many protein fam-ilies.
Proc. Natl. Acad. Sci. USA , 1293–1301 (2011).[60] T. Bury, Market structure explained by pairwise interac-tions.
Physica A , 1375–1385 (2013).[61] H. C. Nguyen, R. Zecchina, and J. Berg, Inverse statis-tical problems: From the inverse Ising problem to datascience.
Advances in Physics , 197–261 (2017). Appendix A: Supplementary information1. Calculation of the heat capacityusing message passing
The heat capacity, which is given by C = dUdT = − β dUdβ , (A1)can be calculated from the expression for the internalenergy U ( β ) = 12 (cid:88) i ∈ V Z i ( β ) (cid:88) s Ni H ∂ i ( s ∂ i ) e − βH Ni ( s Ni ) × (cid:89) j ∈ N i \ i q i ← js j ( β ) , (A2)where instead of incorporating the β dependence into theHamiltonian as in the main paper, we now display it ex-plicitly. In this expression, N i denotes the neighborhoodof node i as in the main text, ∂ i denotes the node i andits immediately adjacent edges and nodes, and H N i ( s N i )and H ∂ i ( s ∂ i ) represent the terms in the Hamiltonian forthese subgraphs: H N i ( s N i ) = − f i ( s i | θ i ) − (cid:88) ( j,k ) ∈ N i g jk ( s j , s k | ω jk ) (A3)and H ∂ i ( s ∂ i ) = − f i ( s i | θ i ) − (cid:88) ( i,j ) ∈ ∂ i g ij ( s i , s j | ω ij ) , (A4)with the β dependence omitted from the definition of thefunctions. With the β dependence written in this waythe message passing equations take the form q ix ( β ) = 1 Z i ( β ) (cid:88) s Ni \ i δ s i ,x e − βH Ni ( s Ni ) (cid:89) j ∈ N i \ i q i ← js j ( β ) , (A5)and q i ← jy ( β ) = 1 Z i ← j ( β ) (cid:88) s Nj \ i δ s j ,y e − βH Nj \ i ( s Nj \ i ) × (cid:89) k ∈ N j \ i \ j q j ← ks k ( β ) , (A6)with Z i ( β ) = (cid:88) s Ni e − βH Ni ( s Ni ) (cid:89) j ∈ N i \ i q i ← js j ( β ) , (A7) Z i ← j ( β ) = (cid:88) s Nj \ i e − βH Nj \ i ( s Nj \ i ) (cid:89) k ∈ N j \ i \ j q j ← ks k ( β ) . (A8)Differentiating (A6) with respect to β and defining thequantity η i ← jy = dq i ← jy dβ , (A9) we get η i ← jy = 1 Z i ← j ( β ) (cid:88) s Nj \ i e − βH Nj \ i ( s Nj \ i ) (cid:89) k ∈ N j \ i \ j q j ← ks k ( β ) × (cid:18)(cid:2) q i ← jy ( β ) − δ s j ,y (cid:3) H N j \ i ( s N j \ i )+ (cid:2) δ s j ,y − q i ← jy ( β ) (cid:3) (cid:88) k ∈ N j \ i \ j η j ← ks k ( β ) q j ← ks k ( β ) (cid:19) , (A10)which can be regarded as a new message passing equa-tion for the derivative η i ← jy . To apply it, we first solvefor the q i ← jy ( β ) in the usual fashion then fix their valuesand iterate (A10) from a suitable initial condition untilconvergence.For large neighborhoods, where the sums over spinsstates cannot be performed exhaustively, the local MonteCarlo procedure described in the main text carries overnaturally. We define (cid:104) A (cid:105) N j \ i = (cid:88) s Nj \ i A ( s N j \ i ) e − βH Nj \ i ( s Nj \ i ) (cid:81) k ∈ N j \ i \ j q j ← ks k ( β ) Z i ← j ( β ) (A11)and then rewrite Eq. (A10) as an average η i ← jy = (cid:28)(cid:2) q i ← jy ( β ) − δ s j ,y (cid:3) H N j \ i ( s N j \ i )+ (cid:2) δ s j ,y − q i ← jy ( β ) (cid:3) (cid:88) k ∈ N j \ i \ j η j ← ks k ( β ) q j ← ks k ( β ) (cid:29) N j \ i , (A12)which can be evaluated using Monte Carlo sampling aspreviously.We can also differentiate Z i ( β ), Eq. (A7), which yields1 Z i ( β ) dZ i ( β ) dβ = 1 Z i ( β ) (cid:88) s Ni e − βH Ni ( s Ni ) (cid:89) j ∈ N i \ i q i ← js j ( β ) × (cid:20) (cid:88) j ∈ N i \ i q i ← js j ( β ) dq i ← js j ( β ) dβ − H N i ( s N i ) (cid:21) , (A13)which can again be written as an average1 Z i ( β ) dZ i ( β ) dβ = (cid:28) (cid:88) j ∈ N i \ i η i ← js j q i ← js j ( β ) − H N i ( s N i ) (cid:29) N i , (A14)where we have used a shorthand analogous to that ofEq. (A11): (cid:104) A (cid:105) N i = (cid:88) s Ni A ( s N i ) e − βH Ni ( s Ni ) (cid:81) j ∈ N i \ i q i ← js j ( β ) Z i ( β ) . (A15)3Differentiating Eq. (A2) and substituting from Eqs. (A10) and (A14) we now find, after some manipulation, that dUdβ = 12 (cid:88) i ∈ V (cid:2)(cid:10) H ∂ i ( s ∂ i ) (cid:11) N i (cid:10) H N i ( s N i ) (cid:11) N i − (cid:10) H ∂ i ( s ∂ i ) H N i ( s N i ) (cid:11) N i (cid:3) + 12 (cid:88) i ∈ V (cid:34)(cid:42) H ∂ i ( s ∂ i ) (cid:88) j ∈ N i \ i η i ← js j q i ← js j (cid:43) N i − (cid:10) H ∂ i ( s ∂ i ) (cid:11) N i (cid:42) (cid:88) j ∈ N i \ i η i ← js j q i ← js j (cid:43) N i (cid:35) , (A16)which can be substituted into Eq. (A1) to calculate C .
2. Local Monte Carlo simulation for the Ising model
As discussed in the main text, when neighborhoods are too large to allow us to sum exhaustively over their stateswe can approximate the message passing equations by Monte Carlo sampling. Taking again the example of the Isingmodel, the message passing equations are q i = (cid:80) s Ni δ s i , +1 e − βH Ni ( s Ni ) (cid:81) j ∈ N i \ i q i ← js j (cid:80) s Ni e − βH Ni ( s Ni ) (cid:81) j ∈ N i \ i q i ← js j , q i ← j = (cid:80) s Nj \ i δ s j , +1 e − βH Nj \ i ( s Nj \ i ) (cid:81) k ∈ N j \ i \ j q j ← ks k (cid:80) s Nj \ i e − βH Nj \ i ( s Nj \ i ) (cid:81) k ∈ N j \ i \ j q j ← ks k , (A17)where the messages in this case represent the probability of the corresponding spin being +1. If we divide top andbottom by (cid:80) s Ni e − βH Ni ( s Ni ) in the first equation and by (cid:80) s Nj \ i e − βH Nj \ i ( s Nj \ i ) in the second, we get q i = (cid:80) s Ni e − βH Ni ( s Ni ) (cid:0) δ s i , +1 (cid:81) j ∈ N i \ i q i ← js j (cid:1)(cid:14) (cid:80) s Ni e − βH Ni ( s Ni ) (cid:80) s Ni e − βH Ni ( s Ni ) (cid:0)(cid:81) j ∈ N i \ i q i ← js j (cid:1)(cid:14) (cid:80) s Ni e − βH Ni ( s Ni ) , (A18) q i ← j = (cid:80) s Nj \ i e − βH Nj \ i ( s Nj \ i ) (cid:0) δ s j , +1 (cid:81) k ∈ N j \ i \ j q j ← ks k (cid:1)(cid:14) (cid:80) s Nj \ i e − βH Nj \ i ( s Nj \ i ) (cid:80) s Nj \ i e − βH Nj \ i ( s Nj \ i ) (cid:0)(cid:81) k ∈ N j \ i \ j q j ← ks k (cid:1)(cid:14) (cid:80) s Nj \ i e − βH Nj \ i ( s Nj \ i ) . (A19)Numerators and denominators now take the form ofa Boltzmann average, but over the distributions definedby H N i and H N j \ i alone, which we can think of as a“zero-field” ensemble that omits the effect of the “exter-nal field” imposed by the messages. Defining the usefulshorthand (cid:104) A (cid:105) ,N i = (cid:80) s Ni e − βH Ni ( s Ni ) A ( s N i ) (cid:80) s Ni e − βH Ni ( s Ni ) , (A20) (cid:104) A (cid:105) ,N j \ i = (cid:80) s Nj \ i e − βH Nj \ i ( s Nj \ i ) A ( s N j \ i ) (cid:80) s Nj \ i e − βH Nj \ i ( s Nj \ i ) , (A21)we can then write the message passing equations in theform q i = (cid:10) δ s i , +1 (cid:81) j ∈ N i \ i q i ← js j (cid:11) ,N i (cid:10) (cid:81) j ∈ N i \ i q i ← js j (cid:11) ,N i , (A22) q i ← j = (cid:10) δ s j , +1 (cid:81) k ∈ N j \ i \ j q j ← ks k (cid:11) ,N j \ i (cid:10) (cid:81) k ∈ N j \ i \ j q j ← ks k (cid:11) ,N j \ i , (A23) where the “0” serves to remind us that the expectationis over the zero-field ensemble. Expressing the equationsas zero-field expectations allows us to evaluate them us-ing the Wolff algorithm, which is highly efficient in thiscontext.We can further speed up sampling by making use ofthe up-down symmetry of the zero-field ensemble, whicheffectively gives us two samples for every spin state. If weobtain a set of samples { s N } by sampling from the zero-field ensemble, then because of symmetry {− s N } are alsocorrect samples that would have occurred with the sameprobability. Including these additional samples explicitlyin the message passing questions gives q i ← j = (cid:10) δ s j , +1 (cid:81) k ∈ N j \ i \ j q j ← ks k + δ − s j , +1 (cid:81) k ∈ N j \ i \ j (1 − q j ← ks k ) (cid:11) ,N j \ i (cid:10) (cid:81) k ∈ N j \ i \ j q j ← ks k + (cid:81) k ∈ N j \ i \ j (1 − q j ← ks k ) (cid:11) ,N j \ i , (A24)and corresponding expressions can be derived for any ex-pectation.4
3. The Jacobian at the critical point
In the main text we used the leading eigenvalue of theJacobian of the message passing iteration at the trivialfixed point to locate the position of the phase transition.Taking the Ising model as our example once again, thecalculation is as follows.The message passing equations can be rewritten as q i ← j = 1 Z i ← j (cid:88) s Nj \ i (1 + s j ) e − βH Nj \ i ( s Nj \ i ) × (cid:89) k ∈ N j \ i \ j (cid:2) (1 − s k ) + s k q j ← k (cid:3) , (A25)where Z i ← j = (cid:88) s Nj \ i e − βH Nj \ i ( s Nj \ i ) (cid:89) k ∈ N j \ i \ j (cid:2) (1 − s k ) + s k q j ← k (cid:3) . (A26)Considering the sum over spins as a local average again,the elements of the Jacobian are then given by ∂q i ← j ∂q µ ← ν = { µ = j,ν ∈ N j \ i } (cid:20)(cid:28) (1 + s j ) s ν − s ν + 2 s ν q µ ← ν (cid:29) N j \ i − (cid:10) s j (cid:11) N j \ i (cid:28) s ν − s ν + 2 s ν q µ ← ν (cid:29) N j \ i (cid:21) , (A27)where { ... } is the indicator function and we have usedthe shorthand from Eq. (A11) again. Now evaluating thisexpression at the trivial fixed point q j ← k = for all j, k (which we write as simply q = for short), we get theJacobian J j → i,ν → µ = ∂q i ← j ∂q µ ← ν (cid:12)(cid:12)(cid:12)(cid:12) q = = ˜ B j → i,ν → µ D j → i,ν → µ , (A28)where ˜ B is a generalization of the non-backtracking ma-trix given by˜ B j → i,ν → µ = (cid:26) µ = j and ν ∈ N j \ i ,0 otherwise, (A29)and D j → i,ν → µ = (cid:80) s Nj \ i s µ s ν e − βH Nj \ i ( s Nj \ i ) (cid:80) s Nj \ i e − βH Nj \ i ( s Nj \ i ) (A30) − (cid:80) s Nj \ i s µ e − βH Nj \ i ( s Nj \ i ) (cid:80) s Nj \ i e − βH Nj \ i ( s Nj \ i ) × (cid:80) s Nj \ i s ν e − βH Nj \ i ( s Nj \ i ) (cid:80) s Nj \ i e − βH Nj \ i ( s Nj \ i ) , which we note is temperature dependent. Using theshorthand from Eq. (A20), D can also be written in thesimpler form D j → i,ν → µ = (cid:104) s µ s ν (cid:105) ,N j \ i − (cid:104) s µ (cid:105) ,N j \ i (cid:104) s ν (cid:105) ,N j \ i . (A31)At the temperature where the magnitude of the lead-ing eigenvalue λ max of J is 1 at the trivial fixed point,the fixed point transitions from being stable to unstable,which corresponds to the phase transition as describedin the main text. Thus we can locate the phase tran-sition by evaluating the matrices ˜ B and D numericallyand using them to compute | λ max | . Note that the expec-tations in Eq. (A31) do not depend on the values of themessages, so we do not need to perform message passingto calculate them—evaluating the Jacobian and locat-ing the phase transition requires us only to perform thesums over neighborhoods or approximate them using lo-cal Monte Carlo.
4. Proof of neighborhood-level factorization
In the calculation of the partition function and entropyin Section II D of the main text we make use of the fac-torized form P ( s ) = (cid:81) i ∈ G P ( s N i ) (cid:81) (( i,j )) ∈ G P ( s ∩ ij ) / |∩ ij | , (A32)where ∩ ij = N i ∩ N j and (( i, j )) are pairs of nodes thatare contained in each other’s neighborhood, i.e., nodes i and j such that i ∈ N j and j ∈ N i . This form is derivedas follows.Consider Fig. 4, which illustrates the definition of thesets of nodes we use and their intersections. As shown inpanel (b) of the figure, many of the sets are equivalent toone another. Specifically, for any pair k, l ∈ ∩ ij we have ∩ kl = ∩ ij . This allows us to write P ( s ∩ ij ) = (cid:20) (cid:89) ( k,l ) ∈∩ ij P ( s ∩ kl ) (cid:21) / ( |∩ ij | )= (cid:89) ( k,l ) ∈∩ ij P ( s ∩ kl ) / ( |∩ kl | ) , (A33)where the product is over all (cid:0) |∩ ij | (cid:1) pairs { k, l } ∈ ∩ ij .A proof of Eq. (A32) can then be achieved by induction.Assume that the formula is correct for all networks withfewer than n nodes and no primitive cycles longer than r + 2. If G is a network with n nodes and no primitivecycles longer than r + 2 then P ( s ) = P ( s N i ) (cid:89) j ∈ N i P ( s N j | s N i ) P ( s G i → j | s N j )= P ( s N i ) (cid:89) j ∈ N i P ( s N j ) P ( s ∩ ij ) P ( s G i → j | s N j \ N i ) , (A34)where G i → j denotes the connected subgraph to which j belongs after all edges in N i have been removed (see5 i jk lm xx xx xxx (a) i jk lm xx xx xxx (b) im ij i jk lm xx xx xxx G i j G i m (c) FIG. 4:
Neighborhoods and various related quantities for a node i in an example network. In this example weassume that r = 2 is sufficient to capture all primitive cycles and thus that calculations at r = 2 are exact. (a) The neighborhood N i = N (2) i contains the edges and nodes shown in solid black. (b) At node i there are two distinct intersections, ∩ im = N i ∩ N m and ∩ ij = N i ∩ N j . Note that the intersections for all pairs of nodes in ∩ ij are identical. For instance in this example we have ∩ ij = ∩ ik = ∩ il = ∩ jk = ∩ jl = ∩ lk . (c) The subgraph G i → j is the connected component to which j belongs after all edges in N i are removed, and similarly for G i → m . Fig. 4). Since by definition the G i → j have fewer than n nodes and no primitive cycles longer than r + 2,Eq. (A32) is by hypothesis true for these subgraphs, andusing (A33) we have P ( s ) = P ( s N i ) (cid:89) j ∈ N i (cid:81) ( k,l ) ∈∩ ij P ( s ∩ kl ) / ( |∩ kl | ) × (cid:81) k ∈ G i → j P ( s N k ) (cid:81) ( k,l ) ∈ G i → j P ( s ∩ kl ) / |∩ kl | = (cid:81) i ∈ G P ( s N i ) (cid:81) ( i,j ) ∈ G P ( s ∩ ij ) / |∩ ij | . (A35)The base case is a graph with a single node, for which(A32) is trivially true, and hence by induction (A32) istrue for all networks that have no primitive cycles longerthan r + 2.For the purposes of the calculation presented in Sec-tion II D, Eq. (A32) can be further simplified by notingthat P ( s N i ) = P ( s i ) (cid:89) j ∈ N i P ( s ∩ ij | s i ) |∩ ij |− = P ( s i ) (cid:89) j ∈ N i (cid:20) P ( s ∩ ij ) P ( s i ) (cid:21) |∩ ij |− . (A36)Substituting this result into (A32) then yields P ( s ) = (cid:89) (( i,j )) ∈ G P ( s ∩ ij ) / ( |∩ ij | ) (cid:89) ( i,j ) ∈ G P ( s i , s j ) W ij × (cid:89) i ∈ G P ( s i ) C i , (A37) where W ij = 1 − (cid:88) (( l,m )) ∈ G (cid:0) |∩ lm | (cid:1) { ( i,j ) ∈∩ lm } (A38)and C i = 1 − (cid:88) j ∈ N i | ∩ ij | − − (cid:88) j ∈ N (0) i W ij . (A39)The one- and two-spin marginals P ( s i ) and P ( s i , s j )can be calculated using the message passing meth-ods described in the text, while the intersectionmarginal P ( s ∩ ij ) is given by P ( s ∩ ij ) = 1 Z ∩ ij e − βH ( s ∩ ij ) q i ← j ( s j ) (cid:89) k ∈∩ ij \ j q j ← k ( s k ) , (A40)where H ( s ∩ ij ) denotes the terms of the full Hamiltonianthat fall in ∩ ij and Z ∩ ij is the corresponding normalizingconstant.Equation (A37) is exact when the network containsno primitive cycles longer than r + 2, in which case W ij = 0. When there are longer primitive cycles (andhence Eq. (A32) is not exact), the terms P ( s i , s j ) W ijij