[PDF] A Probability Density Theory for Spin-Glass Systems

Abstract

Spin-glass systems are universal models for representing many-body phenomena in statistical physics and computer science. High quality solutions of NP-hard combinatorial optimization problems can be encoded into low energy states of spin-glass systems. In general, evaluating the relevant physical and computational properties of such models is difficult due to critical slowing down near a phase transition. Ideally, one could use recent advances in deep learning for characterizing the low-energy properties of these complex systems. Unfortunately, many of the most promising machine learning approaches are only valid for distributions over continuous variables and thus cannot be directly applied to discrete spin-glass models. To this end, we develop a continuous probability density theory for spin-glass systems with arbitrary dimensions, interactions, and local fields. We show how our formulation geometrically encodes key physical and computational properties of the spin-glass in an instance-wise fashion without the need for quenched disorder averaging. We show that our approach is beyond the mean-field theory and identify a transition from a convex to non-convex energy landscape as the temperature is lowered past a critical temperature. We apply our formalism to a number of spin-glass models including the Sherrington-Kirkpatrick (SK) model, spins on random Erdős-Rényi graphs, and random restricted Boltzmann machines.

Full PDF

AA Probability Density Theory for Spin-Glass Systems

Gavin S. Hartnett

The RAND Corporation [email protected]

Masoud Mohseni

Google Research [email protected]

Abstract

Spin-glass systems are universal models for representing many-body phenomenain statistical physics and computer science. High quality solutions of NP-hardcombinatorial optimization problems can be encoded into low energy states ofspin-glass systems. In general, evaluating the relevant physical and computationalproperties of such models is difﬁcult due to critical slowing down near a phase tran-sition. Ideally, one could use recent advances in deep learning for characterizingthe low-energy properties of these complex systems. Unfortunately, many of themost promising machine learning approaches are only valid for distributions overcontinuous variables and thus cannot be directly applied to discrete spin-glass mod-els. To this end, we develop a continuous probability density theory for spin-glasssystems with arbitrary dimensions, interactions, and local ﬁelds. We show how ourformulation geometrically encodes key physical and computational properties ofthe spin-glass in an instance-wise fashion without the need for quenched disorderaveraging. We show that our approach is beyond the mean-ﬁeld theory and identifya transition from a convex to non-convex energy landscape as the temperatureis lowered past a critical temperature. We apply our formalism to a number ofspin-glass models including the Sherrington-Kirkpatrick (SK) model, spins onrandom Erd˝os-Rényi graphs, and random restricted Boltzmann machines.

Spin-glasses are a general class of models which can be used to study complexity in physics,chemistry, biology, computer science, and social sciences [1]. They also provide a theoretical andphenomenological framework to analyze hard real-world problems in discrete optimization andprobabilistic inference over graphical models [2]. At the heart of such complex phenomena is theemergent behavior that can occur when disordered systems contain many particles, variables, oragents which exert two-body or higher order interactions. Today, there is a fundamental gap in ourknowledge of how such non-trivial correlations emerge at low temperatures. For these systems thereis a sudden increase in correlations, occurring simultaneously at various scales (up to the overallsystem size), as the temperature is reduced below a critical threshold. After half a century of intensestudy, it is not yet fully understood why, or under what conditions, a distribution of small to largeclusters of variables can become rigid or frozen below a critical point in a hierarchical or multi-scalefashion in the absence of any obvious symmetries. Additionally, the relaxation time-scales growexponentially large as a function of the correlation length-scales, which in the worst-case prevents thesystem from achieving equilibrium in ﬁnite time. This phenomenon is at the heart of the hardness ofcombinatorial optimization problems, such as random K-SAT, near computational phase transitions[3].Our main motivation is to explore the critical and low temperature properties of spin-glass systemsthat encode practical computational problems, which are typically at the intermediate scales withrespect to number of variables, range of physical interactions, and spatial dimensions. The spin-glassformulation of such problems often involves thousands or even millions of variables, which precludesany hope of successfully applying brute-force or ab initio methods. A given instance of these a r X i v : . [ c ond - m a t . d i s - nn ] J a n roblems typically contains considerable structure, with an underlying graph that could have a power-law distribution over the degree of connectivity with a fat tail for variables with many long-rangephysical interactions. Such realistic spin-glasses are also often in an intermediate zone with respectto their fractal dimensions and their physical and computational properties, and may be thought ofas lying between the two well-studied limiting cases of short-range Edwards-Anderson model [2]and inﬁnite range Sherrington-Kirkpatrick (SK) model [4]. Consequently spin-glass representationsof most interesting and relevant problems reside in an uncharted territory that is analytically andcomputationally intractable. Although the disorders for each instance can be considered ﬁxedor "quenched" for the relevant time-scales, the self-averaging assumption in statistical physicsnonetheless becomes inadequate. Mean-ﬁeld techniques which can be otherwise successfully appliedto toy model problems such as random energy models [2], or p -spin models [5], become invalid asthe ﬂuctuations over the mean values are typically large. Moreover, Renormalization Group (RG)techniques [6] are ineffective as these approaches rely on strong symmetry assumptions, which aredifﬁcult to setup for a particular problem class, not well-deﬁned in presence of strong inhomogeneities,and usually involve crude and irreversible coarse-graining of the microscopic degrees of freedom.Recent advances in deep learning open up the possibility that these non-linear and non-perturbativeemergent properties of spin-glass systems could be machine-learned. Unfortunately, many of themost promising machine learning approaches, such as gradient-based iterative optimization, are onlyvalid for distributions over continuous variables and thus they either cannot be directly applied todiscrete spin-glass systems; or they can be applied at the cost of simply ignoring the fact that themachine learning algorithm was developed speciﬁcally for distributions over continuous variables.Despite this, there has been some progress in using neural-network for discovering new phases ofmatter or accelerating Monte Carlo sampling [7, 8, 9, 10, 11]. Here, we are interested in eventuallyapplying recent techniques in deep generative models, such as normalizing ﬂows [12], to discretespin-glass distributions described by the following family of Hamiltonians H = − (cid:88) i h i s i − (cid:88) i ∆ min , with ∆ min := max(0 , − λ ( J )) , (7)where λ ( J ) ≤ ... ≤ λ N ( J ) denote the ordered eigenvalues of the matrix J . In other words, if J ij is already positive deﬁnite, then ∆ can be set to zero. Otherwise, ∆ must be at least as large asthe smallest eigenvalue of J . Throughout this work, we will set ∆ = max(0 , (cid:15) − λ ( J )) , where < (cid:15) (cid:28) is an arbitrarily small positive number meant to ensure positive deﬁniteness, rather thanpositive semi-deﬁniteness.The central starting assumption used in the derivation of Eq. 3 is that the distribution of the continuousvariable x , given a discrete Ising spin conﬁguration s , be given by a multi-variate Gaussian centeredaround s and with covariance matrix proportional to the inverse of the shifted coupling matrix, i.e. p ( x | s ) = N ( s, Σ) , (8) We call H β ( x ) a density in order to emphasize how it transforms under a change of variable. For a changeof variable x (cid:48) = x (cid:48) ( x ) , the Hamiltonian density transforms as H β ( x (cid:48) ) = H β ( x ) − ln det ( ∂x (cid:48) /∂x ) /β , where ∂x (cid:48) /∂x is the Jacobian matrix. The term density is not meant to indicate that H β ( x ) is a per-site quantity. µ = s , Σ = ( β ˜ J ) − , (9)Therefore, the joint distribution p ( x, s ) = p ( x | s ) p ( s ) is itself a multi-variate Gaussian with a priorgiven by the original discrete distribution. Moreover, the above choice of covariance matrix hasthe important property that the quadratic term in s vanishes in the exponent of the joint distributionexpression: p ( x, s ) = Z − x exp (cid:20) − β (cid:18) x T ˜ Jx − s T ˜ Jx − h T s (cid:19)(cid:21) , (10)where we have combined the normalization of the Gaussian with the discrete partition function inorder to form Z x , and where we have employed matrix notation for convenience. This allows s to betrivially marginalized over, which leads to p ( x ) = e − β H β ( x ) /Z x , with H β ( x ) given by Eq. 4 above.In general, temperature-dependent interactions can arise when degrees of freedom are integrated out,and this happens here as well - hence the subscript.Similarly, the conditional distribution p ( s | x ) factorizes over each site, p ( s | x ) = (cid:81) Ni =1 p ( s i | x ) , with p ( s i | x ) = exp (cid:16) β ˜ h i ( x ) s i (cid:17) (cid:16) β ˜ h i ( x ) (cid:17) . (11)Just as the joint distribution may be interpreted as a mixture of Gaussians by writing p ( x, s ) = p ( x | s ) p ( s ) , the above expression allows for an additional interpretation where the joint distribution isgiven by a product of the sigmoidal-like per-site distributions with a prior given by the continuousprobability density, i.e. p ( x, s ) = (cid:81) i p ( s i | x ) p ( x ) .Rather than starting with a Gaussian conditional distribution and then calculating the joint distribution,an alternative derivation would be to employ the Hubbard-Stratonovich transformation with x nowinterpreted as the integration variable. The joint distribution may then be deﬁned as the product ofthe original distribution and the Gaussian integrand: p ( s ) (cid:90) d N x exp (cid:16) − ( x − µ ) T Σ − ( x − µ ) (cid:17)(cid:112) (2 π ) N det Σ (cid:44) (cid:90) d N x p ( x, s ) . (12)Typically, the Hubbard-Stratonovich transformation is associated with mean-ﬁeld and/or the replicaapproach and is applied only after averaging over the disorder. Here, no disorder average has beenperformed, no replicas have been introduced, and no approximations have been made.There is a probabilistic map between the continuous and discrete formulations. Importantly, bothconditional distributions p ( x | s ) and p ( s | x ) are easy to sample from. Given a collection of discretespin conﬁgurations s , perhaps obtained through Monte Carlo techniques, a corresponding collectionof continuous spin distributions may be obtained using the conditional distribution p ( x | s ) (and viceversa). The free energies of each formulation may be related to one another via ln Z x = ln Z s + N π ) −

12 ln det( β ˜ J ) + N β ∆2 . (13)This expression may be derived by equating the joint distribution p ( x, s ) (Eq. 10) with p ( x | s ) p ( s ) .The second and third terms are simply due to the normalization of the multi-variate Gaussian in p ( x | s ) , and last term is due to the fact that s T s = N for Ising spins. The fact that the continuousand discrete formulations are related in this way indicates that there is no “free lunch” here - onecannot use the continuous formulation to circumvent the problems associated with complex spin-glassdistributions - for example, the hardness in sampling or the evaluation of the partition function.The partition function is a generating function for the n -point correlation functions. Denoting theusual thermodynamic ensemble over discrete spins as (cid:104)·(cid:105) s := (cid:80) { s i } (cid:0) e − βH · (cid:1) /Z s , and the analogousensemble over continuous conﬁgurations as (cid:104)·(cid:105) x := (cid:82) d N x (cid:0) e − β H β ( x ) · (cid:1) /Z x , then by applying ∂ h i ...∂ h ip to each side of Eq. 13, the connected correlation functions of the two ensembles may berelated through: (cid:104) s i ...s i p (cid:105) s,C = (cid:68) tanh( β ˜ h i ( x )) ... tanh( β ˜ h i p ( x )) (cid:69) x,C , (14)4here all indices are assumed distinct and the C subscript denotes connected. In particular, theaverage local magnetization at site i is related to the continuous variable via (cid:104) s i (cid:105) s = (cid:104) tanh( β ˜ h i ( x )) (cid:105) x .The marginal probability of the spin pointing up at site i is p ( s i = ±

1) = 12 (cid:16) ± (cid:68) tanh( β ˜ h i ( x )) (cid:69) x (cid:17) . (15)This expression allows for an interpretation of the effective ﬁeld ˜ h i ( x ) as a global input signal thatdetermines the local spin polarization after averaging over all possible x conﬁgurations. In thisexpression, the hyperbolic tangent plays the role of an activation function, commonly used in artiﬁcialneural networks, that determines the polarization of the spin.We can express the overlap distribution of the original discrete system in terms of the continuousvariable using Eq. 14. If the thermodynamic (Gibbs) measure decomposes into a sum over pure states,each with weight w α , then the disorder-dependent overlap distribution is P J ( q ) = (cid:88) αβ w α w β δ ( q αβ − q ) . (16)This distribution can be regarded as the order parameter of mean ﬁeld spin-glasses in our continuousformulation, and the moments of this distribution may be expressed in terms of spin correlationfunctions as [16]: q ( p ) J := (cid:90) d q P J ( q ) q p = 1 N p (cid:88) i ...i p (cid:104) s i ...s i p (cid:105) s . (17)By using Eq. 14, this may be equivalently written as: q ( p ) J = 1 N p (cid:88) i ...i p (cid:68) tanh( β ˜ h i ( x )) ... tanh( β ˜ h i p ( x )) (cid:69) x . (18)This relation shows how the spin-glass order parameter is encoded in the continuous formulation.This concludes the derivation of our continuous formulation. One important aspect of using continu-ous variables is that they provide a geometric encoding of the problem. In particular, p ( s i | x ) encodesthe likelihood that a given spin will point up or down for a given point in R N . This probability is inturn determined by the inverse temperature and the strength of the effective local ﬁeld at that point, ˜ h i ( x ) . The contours of constant ˜ h i ( x ) are given by shifted ellipsoids, with the shift given by theexternal local ﬁeld h i and the shape and scale of the ellipsoid determined by the β ˜ J . The conditionaldistribution p ( s i | x ) can be used to obtain the marginal probability distribution p ( s i ) by integratingover all R N and weighting each point according to it probability under the continuous Boltzmanndistribution p ( x ) . The S -shaped activation function that appears in Eq. 15 implies that the spins willbe frozen if the regions of large local effective ﬁeld are assigned a low energy in the energy landscapegiven by H β ( x ) . In subsequent sections, we will explore the geometric structure of this landscapefurther, both for general coupling matrices ˜ J , and for some well-known examples such as the SKmodel. The probability density formulation affords several advantages over the original discrete formulationas well as some additional mathematical subtleties. Continuous variables allow alternative samplingmethods such as Hamiltonian Monte Carlo [17] to be applicable, and indeed, this was one of themotivations for the continuous relaxation method [13]. Another beneﬁt is that the continuousformulation provides a geometric encoding of of the combinatorial optimization problems which maybe represented in terms of spin-glass systems. In this section we will derive basic properties of the Since this relation holds for all p , it follows that Eq. 14 also holds for the unconnected correlation functions(i.e. the subscript C may be dropped): (cid:104) s i n ...s i p (cid:105) s = (cid:68) tanh( β ˜ h i ( x )) ... tanh( β ˜ h i p ( x )) (cid:69) x . H β ( x ) for arbitrary values of the couplings and graph topology.Later, in Sections 5, 6, and 7 we will further explore our formulation of the the SK model, randomrestricted Boltzmann machines, and spin on random Erd˝os-Rényi graphs as speciﬁc examples.One of our main results is that the energy landscape deﬁned by H β ( x ) is convex above a disorder-dependent critical temperature T convex , given in terms of the largest eigenvalue of the shifted couplingmatrix: T convex := λ N ( J ) + ∆ . (19)The proof is given in Appendix A. One of the most important mathematical properties of the energylandscape is whether it is convex or not. In particular, convexity of H β ( x ) implies that the logprobability density p ( x ) is log-concave, and log-concave probability densities enjoy a number ofuseful properties, such as the fact that the cumulative distribution function (CDF) is also log-concave,as well as the fact that the marginal density over any subset of the x i variables will also be concave.Convexity of H β ( x ) also implies practical consequences, for example it means that certain algorithmssuch as adaptive rejection sampling may be used to efﬁciently sample p ( x ) [18].As the temperature is lowered past T convex , the Hamiltonian density becomes non-convex. In order tounderstand this transition, it will be useful to set the external magnetic ﬁeld to zero, h = 0 . We ﬁrstnote that the expression for H β ( x ) in Eq. 4 is the sum of two terms. The ﬁrst is quadratic in x , and isguaranteed to be positive for any x since ∆ was chosen to make ˜ J positive-deﬁnite. Conversely, thesecond term is negative for any x , and it scales linearly in x at large radii, i.e. as || x || → ∞ . Thus,at large radii the ﬁrst term dominates and the Hamiltonian density is: H β ( x ) ∼ x T ˜ Jx , (20)which ensures that p ( x ) is integrable. In contrast, near the origin x = 0 the expression simpliﬁes to H β ( x ) ∼ const + 12 x T ( ˜ J − β ˜ J ) x . (21)The linear term in the expansion vanishes, and therefore the origin is a critical point of the Hamiltoniandensity. If ( ˜ J − β ˜ J ) is also positive-deﬁnite, then x = 0 is a minimum. This condition is equivalentto T > T convex , and so in this case x = 0 is the unique global minimum. As T is lowered below T convex , the matrix ( ˜ J − β ˜ J ) develops negative eigenvalues, and x = 0 becomes a saddle.In addition to the origin becoming unstable, the convex/non-convex transition is also characterized bythe appearance of a pair of additional critical points. The critical points of H β ( x ) solve x = tanh( β ˜ Jx ) , (22)As the temperature approaches T convex from below, any critical points that exist will merge withthe critical point x = 0 , since x = 0 is the sole critical point for T > T convex . We may thereforelinearize the critical point equation around x = 0 . In this case, Eq. 22 simpliﬁes to x = β ˜ Jx . Anon-trivial solution of this equation is just an eigenvector of β ˜ J with eigenvalue 1, which correspondsto T = T convex . If v ( N ) i is the largest eigenvector of β ˜ J with corresponding eigenvalue λ ( N ) , thenso is c v ( N ) i for any non-zero c - in other words the scale is not ﬁxed in the linear treatment. Goingbeyond linear order will ﬁx c up to a Z reversal c → − c , since h = 0 . Thus, a pair of critical pointswill appear as T convex is reached from above.The convex/non-convex transition experienced by the continuous distribution p ( x ) has no counter-part in the original discrete distribution p ( s ) . For every example we study below, T convex does not correspond to a phase transition in the discrete system. In fact, T convex may be varied without changingthe physical content of the theory by using a shift larger than the minimum, i.e. ∆ > ∆ min . However,there is some physical signiﬁcance of the minimal value of T convex , which can be seen by noting that λ N ( J ) is the critical temperature predicted by the naive mean-ﬁeld equation x = tanh ( βJx ) . (23)Deﬁning T mean-ﬁeld := λ N ( J ) , we may then write T convex = T mean-ﬁeld + ∆ . (24)6oreover, if the eigenvalues of J lie in a symmetric interval, with λ N ( J ) = − λ ( J ) , then ∆ min = λ N ( J ) , and T convex ≥ T mean-ﬁeld + ∆ min = 2 T mean-ﬁeld . With our choice of ∆ = max(0 , (cid:15) − λ ( J )) we have that T convex = 2 T mean-ﬁeld + (cid:15) so that as (cid:15) → the inequality is saturated.To summarize the results of this section: as the temperature is lowered past a T convex , the Hamiltoniandensity becomes non-convex, the critical point at x = 0 becomes unstable, and a pair of non-trivialcritical points with x (cid:54) = 0 appears. It is difﬁcult to go much further than this description and makemore detailed statements about the geometry of the energy landscape without specifying the couplings J . This is to be expected, since our formalism applies to all spin-systems of the form Eq. 1, whichincludes both spin-glasses and ferromagnetic systems like the 2d Ising model. Below in Sections 5, 6and 7 we will further analyze the landscape for the Sherrington-Kirkpatrick model, random RestrictedBoltzmann Machines, and spin-glasses on random Erd˝os-Rényi graphs respectively. The case of 2Dferromagnetic Ising model system is explored in Appendix C. In this section we will discuss the low-temperature limit of our formalism. Our goal will be toprovide some insight into the geometry of energy landscapes of systems which are deep in thespin-glass phase (when such a phase exists), and to show how the metastable spin-glass states aregeometrically encoded in the Hamiltonian density, H β ( x ) . We will leave the coupling matrices anddisorder distribution unspeciﬁed, and as a result, our discussion will be somewhat general.We begin by taking the low-temperature expansion of the Hamiltonian density: H β ( x ) = H ∞ ( x ) + O ( β − ) , where H ∞ ( x ) := 12 x T ˜ Jx − N (cid:88) i =1 | ˜ Jx | i . (25)The equation governing the zero critical points may be obtained from H ∞ ( x ) directly or from the β → ∞ limit of Eq. 22: x = sgn ( ˜ Jx ) . (26)Additionally, the Hessian (matrix of second derivatives) of H ∞ ( x ) is simply the shifted couplingmatrix: K ∞ := ˜ J . There is a subtlety here, which is that the Hessian is not deﬁned for points whichsatisfy ( ˜ Jx ) i = 0 for any i because the absolute value function is not differentiable at the origin.This is an important observation, since without it one would conclude that H ∞ ( x ) is convex, whichit certainly is not.With these ingredients, the integral deﬁning the partition function Z x may then be formally written asa sum over the critical points using Laplace’s method: Z x = (cid:90) d N x e − β H β ( x ) ≈ (cid:88) α e − β H ∞ ( x ( α ) ) (cid:90) d N x e − β ( x − x ( α ) ) T ˜ J ( x − x ( α ) ) = (cid:115) (2 π ) N det( β ˜ J ) (cid:88) α e − β H ∞ ( x ( α ) ) , (27)where x ( α ) are the critical points of the Hamiltonian density, and the prefactor is due to the Gaussianintegration around each critical point. In writing the above expression we have assumed that allcritical points are minima, and so the Gaussian integration converges. Without specifying the couplingmatrix it is difﬁcult to say much about the existence or non-existence of saddles, beyond the fact that x = 0 is always both a solution of the critical point equation and a point for which the Hessian is notdeﬁned. This and any other similar points will require some special treatment, for example by rotatingthe integration contours and including sub-leading corrections in β − . Ignoring such complications,general correlation functions may also be formally written as a sum over critical points as: (cid:104) f ( x ) (cid:105) x ≈ (cid:88) α ω α f ( x ( α ) ) , (28)where ω α := e − β H β ( x ( α ) ) /Z x is the Boltzmann weight of each critical point, and f is an arbitraryfunction. Thus, the critical points can be seen to encode almost all of the physics of the problem in We have ignored the Gaussian prefactor here, since 1) it is subleading in β − , and 2) all critical pointsreceive the same prefactor since the Hessian is just the constant matrix ˜ J . a) (b) (c) (d) Figure 1: Contour plot of the Hamiltonian density for a system of two spins with J = J =0 . with shift ∆ = max(0 , (cid:15) − λ ( J )) and (cid:15) = 0 . for (a) T = T convex (b) T = T convex / (c) T = T convex / (d) T = T convex / . The density clouds due to spin conﬁgurations overlap above T convex ; that is the Gaussians are sufﬁciently broad that the resulting continuous distribution p ( x ) islog-concave. At low-temperatures, the distributions become fragmented into several distinct modes.Blue regions correspond to low energy conﬁgurations.the low-temperature limit. Applying the saddle-point method to the 1-point function (i.e. the p = 1 case of Eq. 14), the saddle point coordinates are related to the average per-site magnetizations via (cid:104) s i (cid:105) s ≈ (cid:88) α ω α x ( α ) i . (29)Therefore, in our continuous formulation the critical points are very analogous to the pure states ofspin-glass theory. Pure states of spin-glasses are sub-regions in the state space which are separated bylarge energy barriers, and the system is sub-ergodic in those regions even though global ergodicityis broken [19]. Indeed, if the sum over critical points is restricted to just a single critical point (orif there is only one such dominant critical point in the thermodynamic limit), then all connectedcorrelation functions vanish, for example (cid:104) x i x i (cid:105) x = (cid:104) x i (cid:105) x (cid:104) x i (cid:105) x . (30)This property is also known as cluster decomposition.The pure states have a simple geometric interpretation in our formalism. Recall that the continuousprobability density may be written as a weighted sum of Gaussians, each centered around one ofthe N spin conﬁgurations, p ( x ) = (cid:80) { s } p ( x | s ) p ( s ) . The covariance matrix of each Gaussian is Σ = ( β ˜ J ) − , and so the level sets of p ( x | s ) are N -dimensional ellipsoids whose shape is determinedby the eigenvectors and eigenvalues of Σ . In general, the density clouds due to each spin conﬁgurationwill overlap - for example above T convex the Gaussians are so broad that the resulting continuousdistribution p ( x ) is log-concave. At low-temperatures, the distribution will “fragment” into a numberof distinct modes. An example of this is shown in Fig. 1 for the simple case of just two spins, N = 2 .The nature of this fragmentation depend on how the β → ∞ limit is taken. Suppose that the originalcoupling matrix J has both positive and negative eigenvalues, so that the eigenvalues of the shiftedmatrix ˜ J satisfy λ i ( ˜ J ) ≥ (cid:15) (recall that the purpose of introducing the small positive constant (cid:15) was to guarantee positive-deﬁniteness of ˜ J ). If (cid:15) is chosen to be temperature-independent, then the β → ∞ limit pushes all the eigenvalues of ( β ˜ J ) to inﬁnity, and consequently all the eigenvaluesof Σ = ( β ˜ J ) − approach zero. In this case p ( x ) is composed of N distinct delta functions withdifferent weights, and the pure states are rather trivially just the N spin conﬁgurations. However, ifinstead β(cid:15) is held ﬁxed as β → ∞ , then the eigenvalue spectrum of Σ will range from 0 to the ﬁnitevalue / ( β(cid:15) ) . Thus, the shape of the ellipsoid deﬁning the level sets of the Gaussians will shrink to apoint in some directions, and remain ﬁnite in others. In this case the fragmentation of p ( x ) will bemore interesting. Groups of spin conﬁgurations will merge to form pure states as determined by thegeometry of the zero-temperature ellipsoids in relation to the N vertices of the [ − , N hypercube.The pure states of non-disorder averaged spin-glasses can be associated with solutions of a modiﬁedmean-ﬁeld equation known as the Thouless, Anderson, and Palmer (TAP) equation, which wasderived in order to correct the failure of naive mean-ﬁeld theory to describe the spin-glass phase of8he SK model. The naive mean-ﬁeld equation is given in Eq. 23, whereas the TAP equation is givenby [20]: x i = tanh (cid:16) β (cid:88) j J ij x j − β x i (cid:88) j J ij (1 − x j ) (cid:17) . (31)We have argued that the critical points of the continuous probability density may be interpreted aspure states at low temperature, and thus there should be a connection between these and the solutionsof the TAP equation. Here we will establish such a connection at zero temperature, for which the TAPequation simpliﬁes considerably: x = sgn ( Jx ) . Importantly, this is also the zero-temperature limitof the naive mean-ﬁeld equation Eq. 23. The zero-temperature limit of the TAP/naive mean-ﬁeldequations may be compared with the equation governing the critical points of H ∞ ( x ) : x = sgn ( ˜ Jx ) , critical points of H ∞ ( x ) . (32a) x = sgn ( Jx ) , naive mean-ﬁeld/TAP . (32b)Note that the TAP/naive mean-ﬁeld equation depends on the original coupling matrix J , whereas thecritical points of the Hamiltonian density depend on the shifted coupling matrix ˜ J = J + ∆ N × N .A key result is that solutions of the zero-temperature naive mean-ﬁeld equation/TAP equation arealso critical points of the zero temperature Hamiltonian density: Proposition 1. If x is a solution of the zero temperature TAP equation x = sgn ( Jx ) , then x is alsoa solution of the zero temperature critical point equation x = sgn ( ˜ Jx ) .Proof. Suppose x = sgn ( Jx ) . The result holds for any ∆ ≥ ∆ min ≥ , so we will consider the cases ∆ = 0 and ∆ > separately. If ∆ = 0 , then clearly the mean-ﬁeld and critical point equations areidentical. If ∆ > , then sgn (∆ x ) = sgn ( x ) = sgn ( sgn ( Jx )) = sgn ( Jx ) . Thus,sgn ( ˜ Jx ) = sgn ( Jx + ∆ x ) = sgn ( sgn ( Jx ) | Jx | + sgn (∆ x ) | ∆ x | ) (33) = sgn ( sgn ( Jx ) ( | Jx | + | ∆ x | )) = sgn ( Jx )= x . This establishes that for T = 0 every solution of the TAP equation is also a critical point of H ∞ ( x ) .The converse does not hold: there are critical points of the Hamiltonian density which are not solutionsof the TAP equation. To understand the signiﬁcance of these points, recall that the nature of the purestates depends on whether (cid:15) is held ﬁxed as β → ∞ , or if instead β(cid:15) is held ﬁxed. In the ﬁrst case, thepure states of the continuous formulation are somewhat trivial, as any of the N spin conﬁgurationswill be a pure state according to the above discussion. There may additionally be critical points with x i = ( ˜ Jx ) i = 0 for some i which will not correspond to any Ising spin conﬁguration. For example,the point x = 0 is always a critical point. Since the TAP solutions are a subset of all possible spinconﬁgurations, the zero-temperature critical points will include both the TAP solutions as well as allother spin conﬁgurations and any saddle-like points such as x = 0 . If the zero-temperature limit isinstead taken while holding β(cid:15) ﬁxed, then the critical points will include just a subset of all N spinconﬁgurations. That subset will include the TAP states, and possibly other spin conﬁgurations andsaddle-like points. In order to build intuition for the probability density formulation of general spin-glass systems, inthis section we consider as an example the Sherrington-Kirkpatrick (SK) model [4]. By specifyingthe coupling matrix J (or rather, the disorder distribution from which J is drawn), we may furtherexplore the geometry of the energy landscape and the nature of both the spin-glass and convexitytransitions in our formulation.The Sherrington-Kirkpatrick (SK) model is deﬁned by specifying that the couplings J ij be drawnfrom an iid Gaussian distribution [4]: J ij ∼ N (cid:18) , J N (cid:19) , ( i < j ) , (34)9igure 2: The eigenvalue distribution of the coupling matrix J and the shifted coupling matrix ˜ J for the SK model. Both distributions are described by the Wigner semi-circle distribution, shownin black for both J and ˜ J . The size of the shift has been chosen so that the shifted distribution hassupport on the positive real numbers, λ ∈ (0 , ∞ ) .where the i > j values are ﬁxed by symmetry of J to be the same as the i < j values, and thediagonal entries are zero. The coupling parameter J controls the variance of the disorder. Theeigenvalue distribution of J in the large- N limit is simply the Wigner semi-circle distribution, so thatthe probability density of the eigenvalues of J is p J ( λ ) = 2 πR (cid:112) R − λ [ − R,R ] ( λ ) , (35)where [ − R,R ] ( λ ) is the indicator function. The radius of the semi-circle is related to the couplingparameter via R = 2 J . Since the eigenvalues of J are restricted to the strip [ − R, R ] , the eigenvaluesof the shifted coupling matrix ˜ J will be shifted to lie within the strip [ (cid:15), R + (cid:15) ] . The eigenvalues ofboth J and J ∆ are depicted in Fig. 2Using the radius of the Wigner semi-circle (and disregarding (cid:15) for now by setting it to zero), we have T mean-ﬁeld = 2 J , T convex = 2 T mean-ﬁeld . (36)These may be contrasted with the critical temperature below which the system is in a spin-glass phase: T crit = J . (37)Therefore, we have found that T crit < T mean-ﬁeld < T convex . (38)This indicates that the Hamiltonian density becomes non-convex due to the appearance of multiplecritical points well before any transition to an ordered phase occurs. The fact that T convex (cid:54) = T crit is intriguing. Naively one might have thought that the two temperatureswould have coincided because the transition from a convex to non-convex Hamiltonian densityrepresents a real and signiﬁcant change in the corresponding Boltzmann distribution. Moreover, theminimal value of T convex = 4 J does not appear to have been previously identiﬁed as having anyparticular importance for the well-studied SK model. The mathematical transformation from theoriginal discrete variables to the continuous variables was exact and involved no approximation;however, one still needs to verify that spin-glass transition has not been shifted and still occursat T crit not at T convex when the convexity is no longer guaranteed. To this end, we carried out ahigh-temperature expansion in terms of the continuous variables and ﬁnd exact agreement with theexpansion in terms of the original discrete variables carried out by Thouless, Anderson, and Palmer It is worth noting that these results are strictly only valid for N → ∞ . For ﬁnite- N both the eigenvalues of J and the critical temperature will exhibit ﬂuctuations due to ﬁnite-sized effects. The ﬂuctuation of the criticaltemperature due to ﬁnite-size effects is investigated in [21].

10n [20]. In both cases, the expansion breaks down at the spin-glass phase transition temperature T = T crit and not at the higher temperature T convex . We will provide an outline of the calculation here,and a more detailed treatment can be found in Appendix B.Using Eq. 13, the partition function Z s may be written in terms of the continuous variables as Z s = e − Nβ ∆2 (cid:42)(cid:89) i cosh (cid:16) β / ( ˜ Jx ) i (cid:17)(cid:43) . (39)The expectation value is taken over a properly normalized Gaussian distribution with zero-meanand covariance matrix ˜ J − . A high-temperature expansion may then be performed by expandingaround β = 0 . At each order the Gaussian integrals may be performed by Wick contractions, whichintroduces an increasing number of terms as the order of the expansion increases. The calculationsimpliﬁes dramatically if the disorder is averaged over. Denoting the disorder average as (cid:104)·(cid:105) J , theﬁnal result is (cid:104) ln Z s (cid:105) J = N (cid:18) ( β J ) (cid:19) + 14 ln (cid:0) − β J (cid:1) + (non-singular) + O ( N − ) . (40)For T > T crit the sub-extensive terms may be neglected, but as the temperature of the spin-glasstransition is approached from above the logarithm becomes singular, indicating a breakdown of theperturbative expansion. Not only is the free energy analytic at the minimal convexity transitiontemperature min ∆ T convex = 4 J , but any dependence on the shift ∆ cancels out, since ∆ was onlyintroduced as part of our formulation.The above result was ﬁrst derived in [20] by considering the expansion of Z s in terms of the originaldiscrete spin variables. In both cases - the expansion in terms of s and the expansion in terms of x ,the singular logarithm term is obtained by summing an inﬁnite number of terms. In terms of Feynmandiagrams, the terms that contribute to the singularity correspond to double-sided regular n -gons for n ≥ : + + + + + · · · The fact that both expansions agree and yield no non-analyticity at T convex indicates that theconvex/non-convex transition is not associated with a thermodynamic phase transition. It alsoprovides a consistency check that the continuous formulation does not break down below T convex . Lastly, we investigated the zero-temperature limit of the SK model by studying the critical points.These are solutions of the equation x = sgn ( ˜ Jx ) . We generated a large number of such solutionsby randomly initializing x (0) ∈ {− , } N and then applying the iterative update rule below until asolution was found (or the algorithm failed to converge after a set number of iterations): x ( t ) = 12 (cid:16) x ( t − + sgn ( ˜ Jx ( t − ) (cid:17) , (critical point) . (41)This update rule corresponds to performing gradient descent on H β ( x ) , using a learning rate of / and ˜ J − ∇H β ( x ) in place of the usual gradient ∇H β ( x ) . We also generated a large number ofsolutions of the zero-temperature mean-ﬁeld/TAP equation x = sgn ( Jx ) using the same procedurewith update rule given by: x ( t ) = 12 (cid:16) x ( t − + sgn ( Jx ( t − ) (cid:17) , (mean-ﬁeld/TAP) . (42)In agreement with Proposition 1 above, we found that every mean-ﬁeld/TAP solution also solved thecritical point equation. Interestingly, we also found that none of the critical point solutions generatedthis way also solved the mean-ﬁeld/TAP equation. This is consistent with our earlier observation(that held for large ∆) that the saddle-like critical points exponentially out-numbered the minima. Wealso found that the solutions produced by the iterative method applied to each equation had widelyseparated energies. Fig. 3 plots the distribution of energies of each set of solutions. We thank Dan Ish for pointing this out to us. It should be emphasized that the iterative update rule/gradient descent method we used almost certainlydoes not generate solutions uniformly. We generated 10,000 unique solutions using each approach, but for the H ∞ ( x ) /N obtained by an application of the iterative procedure discussed in the text. This procedure wasapplied to two equations, the mean-ﬁeld/TAP equation x = sgn ( Jx ) and the critical point equation x = sgn ( ˜ Jx ) , and in both cases N = 500 . Note that every solution of the ﬁrst equation is also asolution of the second, although the converse is not true. Here we have set (cid:15) = 0 in ∆ min . This plotshows that the typical critical point has a much higher energy than typical solutions of the mean-ﬁeldequation (when both sets of solutions are obtained using the iterative procedure). As a second example, we study the bipartite SK model, which is the natural extension of the SKmodel to bipartite complete graphs. This example also has signiﬁcance in machine learning as itrepresents a randomly initialized Restricted Boltzmann Machine (RBM) [23]. In particular, thebipartite SK model describes a random initialization of RBMs where the biases have been set to zero.The connection between the bipartite SK model and RBMs has been recently studied in [24, 25, 26].In this case, the coupling matrix J ij takes on the block form: J = (cid:18) WW T (cid:19) , (43)with W a N v × N h matrix, where N v is the number of visible spins and N h is the number of hiddenspins. The total number of spins is N = N v + N h , and the spin vector may be written as s T = ( v, h ) .As in the SK model, the weights W ij in the bipartite SK model will be iid normally distributed: W ij ∼ N (cid:18) , J √ N v N h (cid:19) . (44)Using the relation det (cid:18) A BC D (cid:19) = det (cid:0) A − BD − C (cid:1) det( D ) (45)for block matrices A, B, C, D , the characteristic equation for J , det ( J − λ N × N ) = 0 , is equivalentto the condition det (cid:0) W W T − λ (cid:1) = 0 , (46)provided that λ (cid:54) = 0 . Thus, the non-zero eigenvalues of J are related to the eigenvalues of W W T via λ i ( J ) = ± λ i (cid:0) W W T (cid:1) / . (47) SK model we expect an exponential (in N ) number of solutions [22] and the solutions plotted in Fig. 3 may notbe representative of the overall distribution. Rather, they are representative of the distribution obtained whensolutions are generated using the iterative update rule/gradient descent. W W T for N v = 1000 , N h = 3000 and β = J = 1 . The dashed line corresponds to the large- N analytic prediction given by theMarchenko-Pastur distribution. (b) The eigenvalue distribution for the coupling matrix J constructedusing the W matrix used in (a). The dashed line corresponds to the large- N analytic prediction,which may be obtained from the Marchenko-Pastur analytic prediction and Eq. 47.The eigenvalue distribution of W W T in the large- N limit deﬁned by N → ∞ with κ = N v /N h held ﬁxed is given by the Marchenko–Pastur distribution [27], which in our conventions is: p W W T ( λ ) = (cid:112) ( R + − λ )( λ − R − )2 π J κ / λ [ R − ,R + ] ( λ ) + max (cid:0) , − κ − (cid:1) δ ( λ ) , (48)where R ± = J (cid:16) κ − / ± κ / (cid:17) . (49)In Fig. 4 we plot the eigenvalue distribution for both W W T and J .As a result of this analysis, we conclude that in the large- N limit λ i ( J ) ∈ [ − (cid:112) R + , (cid:112) R + ] , and also λ i ( ˜ J ) ∈ [0 , (cid:112) R + ] (again neglecting (cid:15) ). Thus, we have that T mean-ﬁeld = J (cid:16) κ − / + κ / (cid:17) , and T convex = 2 T mean-ﬁeld . (50)Moreover, as in the SK model, both of these temperatures are higher than the critical temperature ofthe spin-glass phase transition, which in our conventions is [26]: T crit = J . (51)Similar to the case of the SK model, the convex/non-convex transition happens well before the phasetransition occurs as the temperature is lowered. As a ﬁnal example, we examine another prototypical spin-glass model by placing spins on Erd˝os-Rényi random graphs. We will consider J ij to be a Bernoulli random variable, by which we meanthat J i j entries are ﬁxed to be J ji = J ij , and the diagonal entries are zero. Thus, J isproportional to the adjacency matrix for an Erd˝os-Rényi random graph [28].In the limit where (cid:29) p (cid:29) N − / , the eigenvalues of J have been shown to obey a semi-circle law[29]. The ﬁrst ( N − eigenvalues form the bulk of the spectrum, and lie within the strip [ − R, R ] ,with R = 2 q J γ , (53)13igure 5: (a) The eigenvalue distribution for a random draw the coupling matrix from an Erd˝os-Rényidistribution with N = 4000 , q = 1 . N / , and β J = 1 . The analytic prediction of [29] is depictedby a dashed line. There is a single eigenvalue separated from the bulk at the location of the verticaldashed line. (b) The largest eigenvalue of the coupling matrix J for 2000 draws of the Erd˝os-Rényirandom graph, with N = 4000 , q = 1 . N / , and β J = 1 . The dashed line represents the analyticprediction.where q := pN and γ := (1 − q /N ) − / . The largest eigenvalue is separated from the bulkeigenvalues due to the fact that the matrix has a non-zero mean, and the average value is E J λ N ( J ) = J ( γ − + q ) . (54)Thus, in this limit the mean-ﬁeld and convex temperatures may be worked out to be: T mean-ﬁeld = (cid:0) γ − + q (cid:1) J , T convex = (cid:0) γ − + q (cid:1) J . (55)Note that in this example T convex (cid:54) = 2 T mean-ﬁeld , which is due to the fact that the eigenvalues of J donot lie in a symmetric interval. The ﬁrst ( N − eigenvalues do, but the largest eigenvalue spoils thesymmetry. In this work, we have presented a probability density theory of general spin-glass systems. Thisformulation builds off of the continuous relaxation method of [13], who introduced the method as away to apply Hamiltonian Monte Carlo to energy-based models deﬁned over discrete variables. Ouroriginal motivation for this work was to adopt a similar approach, and use their continuous relaxationmethod as a way to recast spin-glasses in terms of continuous variables so that normalizing ﬂowscould then be trained to model and better sample spin-glass physics. This is done in our companionpaper [15], and we have instead devoted this work to the task of further developing the continuousrelaxation method for spin-glass physics and characterizing their physical properties in this picture.One of the main advantages of our probability density formulation is the fact that H β ( x ) furnishes a ge-ometric encoding of complex spin-glass distributions. The critical points are of paramount importancefor understanding this encoding. For example, the topology of the manifold deﬁned by consideringall points with energy equal to or below some reference value, i.e. M E := { x : H β ( x ) ≤ E } , can bedetermined using knowledge of the critical points and Morse theory [30]. As the energy is varied, thetopology of M E is unchanged unless a critical point is crossed, in which case the topology changesby the addition of a γ -cell, where γ is the index of the critical point. Unfortunately, in none of theexamples considered here were we able to identify all the critical points. It may be possible to makeprogress on this for the mean-ﬁeld models like the SK model.While we have not fully mapped out the geometry of the energy landscape, our results combine toform an interesting picture. For T > T convex , the Hamiltonian density is convex and there is a singleglobal minimum at the origin. As the temperature is lowered below T convex , the energy landscape Although γ → as N → ∞ , we include it here to bring the analytic prediction closer to the numericalresults we ﬁnd for ﬁnite N . x = 0 acquires anegative eigenvalue). In all the examples we considered T convex exceeds the critical temperature ofany phase transition which may be present. We suspect that this statement is universally true unless ∆ = 0 and the phase transition is able to be captured by naive mean-ﬁeld theory, since in that case T mean-ﬁeld = T convex , otherwise T mean-ﬁeld < T convex . Assuming that the model possesses a spin-glassphase transition, then as the temperature is lowered further, as far as T ≤ T crit < T convex , a number ofmetastable states appear. For the SK model, these are given by solutions of the TAP equation, andthey are clearly distinct from the critical points. However, in the limit T → , the equation satisﬁedby the metastable states simpliﬁes to x = sgn ( Jx ) , and we showed that solutions of this equation arealso critical points of the zero-temperature Hamiltonian density. Therefore, at zero-temperature theenergy landscape of the Hamiltonian density encodes the metastable states of the spin-glasss.Our probability density formulation is quite general and applies to any system of Ising spin variableswhose Hamiltonian consists of just 1- and 2-spin interactions (i.e. the family represented by Eq. 1).It would be interesting to extend our approach to more general Hamiltonians consisting of higherspin couplings. However, it is worth noting that even the more restrictive family of Hamiltoniansconsidered here is already universal. In particular, it includes the energy function of RBMs, whichwere shown to be universal approximators capable of approximating any discrete distribution toarbitrary accuracy, provided there are enough hidden units [31]. Therefore, our formulation can beused to convert arbitrarily complex distributions over discrete variables into continuous probabilitydensities. Moreover, a whole suite of algorithms and numerical methods designed for systems ofcontinuous variables may now be applied to discrete problems. It will be interesting to understandwhat properties of a spin-glass system determine whether a continuous or discrete representationneeds to be employed. We hope to report progress on this question in the future. Acknowledgments

We would like to thank S. Isakov, D. Ish, K. Najaﬁ, and E. Parker for useful discussions and commentson this manuscript. We would also like to thank the organizers of the workshop

Theoretical Physicsfor Machine Learning , which took place at the Aspen Center for Physics in January 2019, forstimulating this collaboration and project.

A Convexity of the Hamiltonian density

In this appendix we prove that the Hamiltonian density is convex if and only if

T > T convex , with T convex = λ N ( ˜ J ) . Using our conventions, in [13] it was proven that p ( x ) for β = 1 is log-concave ifand only if the eigenvalue spectrum of ˜ J is sufﬁciently narrow, by which we mean < λ i ( ˜ J ) < , ∀ i ∈ { , ..., N } . (56)Note that the left-hand inequality of Eq. 56 is true by construction, since the shift ∆ was chosenso as to make ˜ J positive deﬁnite. Thus, the spectrum will be narrow if we additionally have that λ N ( ˜ J ) < . Here we repeat the proof of [13] for the case of Ising spin variables s ∈ {− , } N and our parametrization. Also, rather than considering the log-concavity of p ( x ) , we shall insteadconsider the equivalent condition of the convexity of H β ( x ) .In the proof we will make use of the relations λ ( − M ) = − λ N ( M ) and λ ( M − ) = λ N ( M ) − ,which hold for M = ˜ J and for M = S ( x ) , and we will also use the following eigenvalue inequalitiesfor two matrices A , B : λ ( A ) + λ ( B ) ≤ λ ( A + B ) ≤ λ ( A ) + λ N ( B ) . (57)We will also need the Hessian K ij ( x ) := ∂ i ∂ j H β ( x ) , which is K ( x ) = ˜ J − β ˜ JS ( x ) ˜ J , (58)where S ( x ) is the diagonal matrix given by S ij ( x ) := sech ( β ˜ h i ( x )) δ ij . We will ﬁnd it useful towork with the matrix ˜ K ( x ) := β − ˜ J − K ( x ) ˜ J − , which is equal to ˜ K ( x ) = ( β ˜ J ) − − S ( x ) . (59)15ince K and ˜ K are congruent, they have the same numbers of positive, negative, and zero eigenvaluesaccording to Sylvester’s Law of Inertia [32]. Proposition 2.

The Hamiltonian density H β ( x ) is convex if and only if β ˜ J has a narrow spectrum,by which we mean λ N ( β ˜ J ) < . This is equivalent to T convex < T .Proof. For the forward direction, assume that λ N ( β ˜ J ) < . Then λ (cid:16) ˜ K ( x ) (cid:17) = λ (cid:16) ( β ˜ J ) − − S ( x ) (cid:17) ≥ λ (cid:16) ( β ˜ J ) − (cid:17) + λ ( − S ( x )) ≥ λ N ( β ˜ J ) − − > . (60)where we have used λ ( − S ( x )) ≥ − in the penultimate step, and the last inequality followsby assumption. Since the smallest eigenvalue of the Hessian ˜ K ( x ) is everywhere positive, theHamiltonian density is convex.To prove the reverse direction, assume that < inf x λ ( ˜ K ( x )) and let x ∗ = − ˜ J − h . Then, < inf x λ (cid:16) ˜ K ( x ) (cid:17) ≤ λ (cid:16) ˜ K ( x ∗ ) (cid:17) ≤ λ (cid:16) ( β ˜ J ) − (cid:17) + λ N ( − S ( x ∗ )) = λ N ( β ˜ J ) − − , (61)Therefore, λ N ( β ˜ J ) < . B High-temperature expansion of the SK model

According to Eq. 13, the discrete partition function is related to the continuous partition function via: Z s = e − Nβ ∆2 Z x (cid:113) (2 π ) N det(( β ˜ J ) − ) . (62)The denominator is just the normalization of a Gaussian distribution with zero mean and covariancematrix ( β ˜ J ) − . Similarly, Z x may also be written in terms of an un-normalized Gaussian integralwith the same mean and covariance: Z x = (cid:90) d N x e − β H β ( x ) = (cid:90) d N x e − β x T ˜ Jx (cid:89) i (cid:16) β ( ˜ Jx ) i (cid:17) . (63)Thus, after rescaling x → β − / x the partition function may be written as Z s = 2 N e − Nβ ∆2 (cid:42)(cid:89) i cosh (cid:16) β / ( ˜ Jx ) i (cid:17)(cid:43) . (64)where (cid:104)·(cid:105) denotes an average with respect to the Gaussian with covariance matrix Σ = ˜ J − . Theexpansion around β = 0 may now be carried out. Explicitly, to fourth order we have Z s = 2 N e − Nβ ∆2 (cid:42)(cid:89) i (cid:18) β Jx ) i + β

24 ( ˜ Jx ) i + β

720 ( ˜ Jx ) i + β Jx ) i + O ( β ) (cid:19)(cid:43) , (65)and expanding out the product yields Z s − N e Nβ ∆2 = 1 + β (cid:88) i (cid:104) ( ˜ Jx ) i (cid:105) + β (cid:32) (cid:88) i (cid:104) ( ˜ Jx ) i (cid:105) + 18 (cid:88) [ ij ] (cid:104) ( ˜ Jx ) i ( ˜ Jx ) j (cid:105) (cid:33) + (66) + β (cid:32) (cid:88) i (cid:104) ( ˜ Jx ) i (cid:105) + 148 (cid:88) [ ij ] (cid:104) ( ˜ Jx ) i ( ˜ Jx ) j (cid:105) + 148 (cid:88) [ ijk ] (cid:104) ( ˜ Jx ) i ( ˜ Jx ) j ( ˜ Jx ) k (cid:105) (cid:33) + β (cid:32) (cid:88) i (cid:104) ( ˜ Jx ) i (cid:105) + 11440 (cid:88) [ ij ] (cid:104) ( ˜ Jx ) i ( ˜ Jx ) j (cid:105) + 11152 (cid:88) [ ij ] (cid:104) ( ˜ Jx ) i ( ˜ Jx ) j (cid:105) + 1192 (cid:88) [ ijk ] (cid:104) ( ˜ Jx ) i ( ˜ Jx ) j ( ˜ Jx ) k (cid:105) + 1384 (cid:88) [ ijkl ] (cid:104) ( ˜ Jx ) i ( ˜ Jx ) j ( ˜ Jx ) k ( ˜ Jx ) l (cid:105) (cid:33) + O ( β ) . [ i i ...i n ] means that all the indices are distinct.The Gaussian integrals can be done via Wick contractions, each of which brings in a factor of ˜ J − .Working out the ﬁrst few terms explicitly, one ﬁnds a proliferation of terms involving powers of ∆ .These terms may be summed to give e Nβ ∆2 , which cancels the term on the RHS of Eq. 66. Of course,this had to be the case because Z s is independent of ∆ . Carrying out the expansion to fourth order,we ﬁnd that ln Z s = N ln 2+ β (cid:88) [ ij ] J ij + β (cid:88) [ ijk ] J ij J jk J ki + β  (cid:88) [ ijkl ] J ij J jk J kl J li − (cid:88) [ ij ] J ij  + O ( β ) . (67)Taking the disorder average yields: (cid:104) ln Z s (cid:105) J = N (cid:18) ln 2 + 14 ( β J ) (cid:19) − (cid:18) ( β J ) + 12 ( β J ) + O ( β ) (cid:19) + O ( N − ) . (68)With enough effort, the expansion may be extended to arbitrary order. To facilitate comparison withthe result obtained by expanding in terms of the discrete variables performed by TAP [20], we willnote that TAP did not keep track of each term in the expansion. They worked out all terms whichcontribute at O ( N ) , and at sub-leading order O (1) they restricted their attention to only those termsin the series which contribute to a singularity at the spin-glass phase transition. Concretely, theyfound: (cid:104) ln Z s (cid:105) J = N (cid:18) ln 2 + 14 ( β J ) (cid:19) + 14 ln (cid:0) − β J (cid:1) + (non-singular) + O ( N − ) . (69)Our calculation has already reproduced the extensive term. Next we will show that the expansion interms of the x variables also matches the singular logarithm term. For this purpose we will restrictattention to just those terms which involving n distinct copies of ( ˜ Jx ) . Using the notation A ⊃ B to indicate that the expansion for A contains the expansion B , the series restricted to these terms is: Z s ⊃ N ∞ (cid:88) n =3 β n n n ! (cid:88) [ i ...i n ] (cid:104) ( Jx ) i ( Jx ) i ... ( Jx ) i n (cid:105) (70)There are (2 n − different Wick contractions to consider. Of these, there are (2 n − contractionsthat avoid pairing x ’s connected through a coupling matrix. Thus the contribution of the connectedcyclic terms at each order is given by: Z s ⊃ N ∞ (cid:88) n =3 β n n (cid:88) [ i ...i n ] J i i J i i ...J i n i . (71)The series contains additional terms that result from other contractions, but these are either sub-leading in /N or do not contribute to the singularity. The contribution of the cyclic terms above maybe represented diagrammatically as regular n -sided polygons diagrams, where each side represents afactor of the coupling matrix Z s ⊃ + + + + · · · (72)The next step is to take the logarithm and perform the disorder average. The logarithm introducesadditional terms at each order, although many of these vanish at leading order in /N after taking thedisorder average. Among the terms which survive are the squares of the above polygon terms: (cid:104) ln Z s (cid:105) J ⊃ − ∞ (cid:88) n =3 β n (2 n ) (cid:42) (cid:88) [ i ...i n ] J i i J i i ...J i n i  (cid:43) J . (73)The disorder average may also be performed using Wick contractions. The only non-vanishing con-tractions are those where each distinct factor of the coupling J ij appears squared. Diagrammatically,this means that the contribution of these terms corresponds to the double-sided regular polygons: (cid:104) ln Z s (cid:105) J ⊃ + + + + · · · (74)17o count the number of each term, note that there are two cyclic groupings of matrices which must becontracted with one another: J i i J i i ...J i n i and J j j J j j ...J j n j . Each J from the ﬁrst groupwill be contracted with a J from the second. With no loss of generality, the ordering of the ﬁrst cyclecan be ﬁxed. There are then ( n − ways to order the second cycle. However, this over counts bya factor of 2 because the direction of the cycle is irrelevant. So, the symmetry factor for the n -thdiagram is ( n − / . There is also a factor of (cid:0) Nn (cid:1) corresponding to the number of choosing n distinct sites to form a cycle. The end result is that the double-sided polygons give a contribution of: (cid:104) ln Z s (cid:105) J ⊃ − ∞ (cid:88) n =3 ( n − (cid:18) Nn (cid:19) ( (cid:104) J ij (cid:105) J ) n = − ∞ (cid:88) n =3 ( β J ) n n + O ( N − ) . (75)This series just corresponds to − ln(1 − β J ) / , minus the n = 1 and n = 2 terms. Adding thisresult to the previous result of Eq. 68 reproduces the expression TAP found in [20], which we havereproduced in Eq. 69.Therefore, we have found that, regardless of which formulation is used, the disorder-averaged high-temperature expansion produces an extensive O ( N ) result which is valid above the spin-glass phasetransition temperature. At sub-leading order O (1) the expansion also contains an inﬁnite number ofcyclic terms which diagrammatically correspond to regular polygons. These terms may be re-summedto ﬁnd a contribution which becomes singular at the phase transition, indicating that the perturbativeexpansion has broken down. We see no indication that T convex has any particular signiﬁcancewhatsoever in the partition function. Indeed, T convex depends on ∆ , a parameter introduced as partof the deﬁnition of the continuous formulation, whereas Z s does not. Lastly, we note that the twopartition functions Z s and Z x are proportional, and that the constants of proportionality are completelywell-behaved at T = T convex . Thus, we can conclude that the convex/non-convex transition does notcorrespond to any sort of phase transition or non-analyticity in either partition function. C 2d Ising model

As an additional example, we consider the phase transition of the well-studied ferromagnetic Isingmodel deﬁned over the 2-dimensional square lattice with periodic boundary conditions. The probabil-ity density formulation of this model was used in [11] as the ﬁrst step towards modeling the systemwith normalizing ﬂows [12, 33] near the paramagnetic/ferromagetic phase transition.In this case the eigenvalues may be worked out analytically. For a d -dimensional hypercubic latticewith L spins per dimension, the eigenvalues are given by λ ( J ) = 2 J d (cid:88) µ =1 cos (cid:18) πL n µ (cid:19) , n µ ∈ { , , ..., L − } . (76)Where J is the bond strength. Thus, λ N ( J ) = 2 d J , λ ( J ) = (cid:26) − d J L even d J cos (cid:0) π (cid:0) − L − (cid:1)(cid:1) L odd (77)In the large- L limit, the difference between even and odd L vanishes, and the eigenvalues lie withinthe symmetric interval [ − R, R ] with R = 2 d J . As a result, for d = 2 , T mean-ﬁeld = 4 J , T convex = 2 T mean-ﬁeld . (78)Both of these are greater than the critical temperature, which is well known to be T crit = 2 J ln (cid:0) √ (cid:1) ≈ . J . (79)Thus, as the temperature is lowered the Hamiltonian density becomes non-convex well before thephase transition. See for example Sec. 2.2 of [34]. eferences [1] D. L. Stein and C. M. Newman, Spin Glasses and Complexity . Princeton University Press,2013.[2] M. Mezard and A. Montanari,

Information, Physics, and Computation . Oxford UniversityPress, Inc., New York, NY, USA, 2009.[3] C. Moore and S. Mertens,

The Nature of Computation . Oxford University Press, Inc., 2011.[4] D. Sherrington and S. Kirkpatrick,

Solvable model of a spin-glass , Physical review letters (1975), no. 26 1792.[5] T. Castellani and A. Cavagna, Spin-glass theory for pedestrians , Journal of StatisticalMechanics: Theory and Experiment (may, 2005) P05012.[6] H. Nishimori and G. Ortiz,

Phase transitions and critical phenomena , Elements of PhaseTransitions and Critical Phenomena (Feb, 2010) 1–15.[7] M. S. Albergo, G. Kanwar, and P. E. Shanahan,

Flow-based generative models for markovchain monte carlo in lattice ﬁeld theory , Phys. Rev. D (Aug, 2019) 034515.[8] L. Huang and L. Wang,

Accelerated monte carlo simulations with restricted boltzmannmachines , Physical Review B (2017), no. 3 035105.[9] J. Liu, Y. Qi, Z. Y. Meng, and L. Fu, Self-learning monte carlo method , Physical Review B (2017), no. 4 041101.[10] H. Shen, J. Liu, and L. Fu, Self-learning monte carlo with deep neural networks , PhysicalReview B (2018), no. 20 205140.[11] S.-H. Li and L. Wang, Neural network renormalization group , Physical review letters (2018), no. 26 260601.[12] D. J. Rezende and S. Mohamed,

Variational inference with normalizing ﬂows , arXiv preprintarXiv:1505.05770 (2015).[13] Y. Zhang, Z. Ghahramani, A. J. Storkey, and C. A. Sutton, Continuous relaxations for discretehamiltonian monte carlo , in

Advances in Neural Information Processing Systems ,pp. 3194–3202, 2012.[14] F. Caravelli,

On a" continuum" formulation of the ising model partition function , arXiv preprintarXiv:1908.08065 (2019).[15] G. S. Hartnett and M. Mohseni, Self-supervised learning of generative spin-glasses withnormalizing ﬂows , arXiv preprint arXiv:2001.00585 (2020).[16] V. Dotsenko, Introduction to the replica theory of disordered statistical systems , vol. 4.Cambridge University Press, 2005.[17] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth,

Hybrid monte carlo , Physics lettersB (1987), no. 2 216–222.[18] W. R. Gilks and P. Wild,

Adaptive rejection sampling for gibbs sampling , Journal of the RoyalStatistical Society: Series C (Applied Statistics) (1992), no. 2 337–348.[19] M. Mézard, G. Parisi, and M. Virasoro, Spin glass theory and beyond: An Introduction to theReplica Method and Its Applications , vol. 9. World Scientiﬁc Publishing Company, 1987.[20] D. J. Thouless, P. W. Anderson, and R. G. Palmer,

Solution of’solvable model of a spin glass’ , Philosophical Magazine (1977), no. 3 593–601.[21] M. Castellana and E. Zarinelli, Role of tracy-widom distribution in ﬁnite-size ﬂuctuations of thecritical temperature of the sherrington-kirkpatrick spin glass , Physical Review B (2011),no. 14 144417.[22] A. Bray and M. A. Moore, Metastable states in spin glasses , Journal of Physics C: Solid StatePhysics (1980), no. 19 L469. 1923] P. Smolensky, Information processing in dynamical systems: Foundations of harmony theory ,tech. rep., Colorado Univ at Boulder Dept of Computer Science, 1986.[24] A. Decelle, G. Fissore, and C. Furtlehner,

Spectral dynamics of learning restricted boltzmannmachines , arXiv preprint arXiv:1708.02917 (2017).[25] A. Decelle, G. Fissore, and C. Furtlehner, Thermodynamics of restricted boltzmann machinesand related learning dynamics , Journal of Statistical Physics (2018), no. 6 1576–1608.[26] G. S. Hartnett, E. Parker, and E. Geist,

Replica symmetry breaking in bipartite spin glasses andneural networks , Physical Review E (2018), no. 2 022116.[27] V. A. Marˇcenko and L. A. Pastur, Distribution of eigenvalues for some sets of random matrices , Mathematics of the USSR-Sbornik (1967), no. 4 457.[28] P. Erd˝os and A. Rényi, On the evolution of random graphs , Publ. Math. Inst. Hung. Acad. Sci (1960), no. 1 17–60.[29] L. Erd˝os, A. Knowles, H.-T. Yau, J. Yin, et al., Spectral statistics of erd˝os–rényi graphs i: localsemicircle law , The Annals of Probability (2013), no. 3B 2279–2375.[30] J. Milnor, Morse theory.(AM-51) , vol. 51. Princeton university press, 2016.[31] N. Le Roux and Y. Bengio,