[PDF] Algorithmic Thresholds in Mean Field Spin Glasses

Abstract

Full PDF

AAlgorithmic Thresholds in Mean Field Spin Glasses

Ahmed El Alaoui ∗ and Andrea Montanari † Abstract

Optimizing a high-dimensional non-convex function is, in general, computationally hard andmany problems of this type are hard to solve even approximately. Complexity theory charac-terizes the optimal approximation ratios achievable in polynomial time in the worst case. Onthe other hand, when the objective function is random, worst case approximation ratios areoverly pessimistic. Mean ﬁeld spin glasses are canonical families of random energy functionsover the discrete hypercube {− , +1 } N . The near-optima of these energy landscapes are orga-nized according to an ultrametric tree-like structure, which enjoys a high degree of universality.Recently, a precise connection has begun to emerge between this ultrametric structure andthe optimal approximation ratio achievable in polynomial time in the typical case. A new ap-proximate message passing (AMP) algorithm has been proposed that leverages this connection.The asymptotic behavior of this algorithm has been analyzed, conditional on the nature of thesolution of a certain variational problem.In this paper we describe the ﬁrst implementation of this algorithm and the ﬁrst numericalsolution of the associated variational problem. We test our approach on two prototypical mean-ﬁeld spin glasses: the Sherrington-Kirkpatrick (SK) model, and the 3-spin Ising spin glass. Weobserve that the algorithm works well already at moderate sizes ( N (cid:38) Mean ﬁeld spin glasses were introduced nearly half a century ago as idealized mathematical modelsfor disordered magnetic materials [SK75]. Since then, they have emerged as canonical modelsfor random energy functions in high dimension [Tal10]. In this paper we will focus on the mostclassical of these models, the mixed p -spin Ising spin glass model. This can be deﬁned as theenergy function on the discrete hypercube H N : {− , +1 } N → R such that ( H N ( σ )) σ ∈{− , +1 } N isa centered Gaussian vector with E { H N ( σ ) H N ( τ ) } = N ξ ( (cid:104) σ , τ (cid:105) /N ) for a ﬁxed analytic function ∗ Simons Institute for the Theory of Computing, UC Berkeley. † Department of Electrical Engineering and Department of Statistics, Stanford University. a r X i v : . [ c ond - m a t . s t a t - m ec h ] S e p = q γα α q = 0 q q ∗ Figure 1: Cartoon of the tree of ancestor states: two examples. Tree levels are indexed by theoverlap value q , which corresponds to the time index in the algorithm evolution t , and to the normof magnetization vectors (cid:107) m γ (cid:107) /N = q (if γ is at level q ). On the right: a tree with large overlapgap between 0 and q . ξ : [ − , → R . Such a process can be constructed explicitly as H N ( σ ) = ∞ (cid:88) k =2 c k N ( k − / (cid:88) ≤ i , ··· ,i k ≤ N G ( k ) i , ··· ,i k σ i · · · σ i k , (1.1)where ( G ( k ) i ,...,i k ) k ≥ ,i ,...,i k ≥ is a collection of independent standard normal random variables G ( k ) i ,...,i k ∼ N (0 , ξ ( x ) = (cid:80) k ≥ c k x k . In the followingwe will also occasionally refer to the ‘spherical’ version of this model, in which the constraint σ ∈ {− , +1 } N is replaced by σ ∈ S N − ( √ N ) (the sphere of radius √ N in R N ).We will be concerned with the superlevel sets of this Hamiltonian, namely sets S N ( ε ) of con-ﬁgurations σ ∈ {− , +1 } N such that H N ( σ ) ≥ (1 − ε ) max σ (cid:48) ∈{± } N H N ( σ (cid:48) ). We will ask whetherconﬁgurations in S N ( ε ) can be found in polynomial time for any constant ε >

0. From a physicspoint of view, superlevel sets (with small ε ) determine the the low-temperature behavior of thesystem that is described by the Gibbs measure µ N,β ( σ ) := exp { βH N ( σ ) } /Z N,β . Within a decade from their introduction, physicists unveiled a beautiful probabilistic structure thatcaptures the organization of these near-optima [MPV87]. This structure is referred to as ‘replicasymmetry breaking’ (RSB) in connection to its ﬁrst discovery via the so-called replica method. Itis useful to summarize the main elements of this picture, as ﬁrst outlined in [MPS +

84, MV85].For ε a suﬃciently small constant, the random set S N ( ε ) can be partitioned into M ‘pure states’ S αN , 1 ≤ α ≤ M , plus eventually a negligible subset of conﬁgurations N N : S N ( ε ) = ∪ mα =1 S αN ∪ N N .Each pure state S αN can be characterized by its barycenter m α (the ‘magnetization’ vector). All ofthese barycenters approximately lie on a sphere of radius (cid:112) N q ∗ ( ε ): (cid:107) m α (cid:107) = N q ∗ ( ε )(1 + o N (1)).The rescaled radius q ∗ is referred to as the Edwards-Anderson parameter. Within a pure state, thedistance from the barycenter concentrates, which implies (cid:107) σ − m α (cid:107) = N (1 − q ∗ ( ε ))(1 + o N (1)) foralmost all σ ∈ S αN (we can redeﬁne N N to include a small number of conﬁgurations that violatethis condition). Pure states are organized in a tree whose levels can be indexed by q ∈ Q ⊆ [0 , q ∗ ]2s we describe next. Pure states occupy the leaves at level q ∗ . Each internal node γ in this treeis referred to as an ‘ancestor state’, and can be associated to a set of conﬁgurations S γN that isformed by the union of pure states that are its descendants: S Nγ := ∪ α ∈ D ( γ ) S αN (here D ( γ ) is theset of leaves that are descendants of γ ). By averaging over these states, we can associate to theancestor γ a barycenter m γ : the level q of node γ corresponds to the norm of this barycenter, (cid:107) m γ (cid:107) = N q (1 + o N (1)).Given a certain ancestor state γ at level q , the radius (cid:107) m α − m γ (cid:107) concentrates as α variesamong nodes at level q > q that are descendants of γ . Note that (cid:80) α w α (cid:104) m α − m γ , m γ (cid:105) = 0 byconstruction (where w α := |S αN | / |S γN | ). Hence we must have (cid:107) m α − m γ (cid:107) = N ( q − q + o N (1)) and (cid:104) m α − m γ , m γ (cid:105) = o N ( N ). A cartoon of the geometry of pure states and ancestor states is givenin Fig. 1. Each internal node γ corresponds to an ancestor state and its descendant lie close to thehyperplane orthogonal to m γ . As a special consequence of this geometric picture, given two purestates α , α , with closest common ancestor γ , we have (cid:104) m α , m α (cid:105) = (cid:107) m γ (cid:107) (1 + o N (1)). Thismeans that we can directly probe the tree structure of pure states by sampling two independentconﬁgurations σ , σ ∼ iid Unif ( S N ( ε )). The expected (over the realization of H N ) probabilitydistribution of the ‘overlap’ |(cid:104) σ , σ (cid:105)| /N converges to a limit measure ν ε with support Q ⊆ [0 , p -spin models, the ultrametric structure of pure states has been maderigorous in a remarkable sequence of mathematical works [Pan13a, Pan13b, CPS18, CPS19]. Amongthe wealth of results, we know that at suﬃciently low temperatures (small ε or large β ) the numberof levels of the tree (i.e. the card( Q )) becomes arbitrarily large [ACZ20]. On the other hand, givena sequence of coeﬃcients ( c k ) k ≥ (or, equivalently, their generating function ξ ( x )), we do not yetknow what is the qualitative structure of the set Q which indexes the levels of the tree, and thepossible values of the overlap between pure states. Does the universal RSB structure of random energy landscapes have algorithmic consequences?The present paper focuses on this question, and more precisely on the search problem . Namely, weare looking for a polynomial-time algorithm that takes as input a speciﬁcation of the Hamiltonian H N ( · ), and returns as output a vector σ alg ∈ {− , +1 } N such thatlim N →∞ P (cid:16) H N ( σ alg ) ≥ (1 − ε ) max σ ∈{± } N H N ( σ ) (cid:17) = 1 . (1.2)In particular, we are interested in understanding which of these two scenarios holds:( i ) An arbitrarily good approximation of max σ ∈{± } N H N ( σ ) can be achieved in polynomial time,i.e. the above holds for any ﬁxed ε > ii ) Only a constant factor approximation can be achieved, i.e. the above holds only for ε > ε ∗ for some strictly positive ε ∗ . In the latter case we want to determine ε ∗ .3he relevance of spin glass theory for optimization with random structures was noted early onin the history of the subject. Quoting from a seminal paper by M´ezard and Virasoro [MV85]:‘The ultrametric topology of the space of equilibrium states of a spin glass deservesspecial attention. Such an organization might exist in other systems with frustrationand disorder, and it should have consequences in such ﬁelds as optimization problemsor neural networks.’Indeed, the connection between statistical physics and optimization has been an object of interestat least since the introduction of simulated annealing [KGV83]. Beginning in the eighties, physi-cists used nonrigorous methods from physics, such as the replica method (or the equivalent cavitymethod), to study the typical properties of random combinatorial optimization problems. Notableexamples of this literature include the assignment problem [MP85], the random traveling salesmanproblem [MP86], random K -satisﬁability ( K -SAT) [MZK +

99, MPZ02, KMRT + K -SAT [MPZ02].These authors proposed a message passing algorithm known as ‘survey propagation’ that amountsto iteratively solving the ‘one step RSB’ cavity equation for a given K -SAT formula. The solutionof these equations is then used to iteratively assign the values of variables. Empirical studiesdemonstrated that this approach is eﬀective in ﬁnding solutions of large random 3-SAT formulae[BMZ05]. On the other hand, rigorous analysis was carried out for large K , suggesting limitedadvantage oﬀered by survey propagation over simpler heuristics [GS17].It should be emphasized that, unlike the classical p -spin model of Eq. (1.1), the K -SAT modelcan be regarded as a spin glass on a sparse random graph, a setting which presents additionaltechnical challenges.Among negative results, Gamarnik and coauthors established —in several examples— a con-nection between the support Q of the overlap distribution and the ability of certain classes ofalgorithms to approximately solve the underlying optimization problem [GS14, CGPR19]. Namely,if Q is not an interval, then local algorithms fail: the class of local algorithms include certainimplementations of survey propagation and other message passing techniques. The condition of Q not being an interval is also referred to as ‘overlap gap condition’.The idea of exploiting the RSB picture of the energy landscape in an optimization algorithm wasrevived by a sequence of papers over the last two years. Addario-Berry and Maillard [ABM18] studyoptimization of an idealized model for the tree of ancestor states in spin glasses, also known as theGeneralized Random Energy Model (or GREM) [Der85]. They consider optimization algorithmsthat descend the tree with limited lookahead and prove that such algorithms can ﬁnd a nearoptimum eﬃciently, if and only if the tree satisﬁes the no-overlap gap condition. Subag [Sub18]studied the general spherical p -spin model and presented an algorithm that —again— returns anear optimum under the no-overlap gap condition. More generally, [Sub18] gives an explicit formulafor the energy achieved by this algorithm. Unlike for the Ising spin glass, for the spherical modelit is explicitly known which generating functions ξ imply no overlap gap.Finally, in [Mon19, AMS20], the present authors and Mark Sellke developed a nearly linear timeapproximate message passing (AMP) algorithm for the Ising spin glass and characterized the valueit achieves. This characterization holds under the assumption that a certain variational problem4ver functions on the interval [0 ,

1] achieves its inﬁmum. As a special case, whenever the generatingfunction ξ is such that the resulting model has no overlap gap, this algorithm achieves (1.2) for any ε > Variational principle.

As mentioned above, the value achieved asymptotically by the algorithmof [Mon19, AMS20] is described by a variational principle that can be viewed as a modiﬁcationof the celebrated Parisi formula. Is the inﬁmum of this variational principle achieved? Can itbe evaluated numerically? We describe a procedure for solving the variational principle, andapply it to two examples, showing empirically that we obtain approximate solutions with thedesired properties.

Implementation.

The analysis of [Mon19, AMS20] is asymptotic and establishes guarantees ofthe form (1.2), that hold as N → ∞ . How large should N be for these asymptotics to kickin? Do these algorithms also work reasonably well for —say— N the order of 10 ? Finally,for any given N , several implementation choices need to be made, which become irrelevant as N → ∞ . We propose a speciﬁc implementation of the algorithm, which appears to performwell already for N of the order of a few thousands, and study the leading ﬁnite- N eﬀects. Threshold energy.

We study two speciﬁc examples (two choices of the generating function ξ ): theSherrington-Kirkpatrick model and the 3-spin model. In the second case the algorithm onlyachieves a threshold energy that is smaller —by a constant factor— than the global optimum.We compare this threshold with the energy achieved by Glauber dynamics (at ﬁxed very lowtemperature) and show that the new algorithm achieves a better approximation. Connection with physics.

In what ways does the algorithm of [AMS20] explore the landscape ofancestor states? We elucidate this question in two diﬀerent ways. First we show that the pathfollowed by the algorithm during its execution is such that the conﬁguration m t produced attime t ∈ [0 ,

1] has the same properties as an ancestor state at level t . In particular, it is anapproximate solution of the generalized TAP equations and has a nearly optimal value of thegeneralized TAP free energy [MV85, CPS18, CPS19].Second, we show that the algorithm can be modiﬁed as to produce not one, but multiple nearoptima ( σ α ) α ≥ , with (cid:104) σ α , σ γ (cid:105) /N ≈ α (cid:54) = γ . The algorithm of [Mon19, AMS20] is iterative. It is convenient to index iterations (time) by T δ := { , δ, δ, . . . , − δ, } , where δ is a step size that should be though of as vanishing when N → ∞ ,but suﬃciently slowly. The algorithm generates a sequence m t ∈ R N , t ∈ T δ , where each vector m t is a function of ( m s ) s ∈ T δ ∩ [0 ,t − δ ] . The interpretation of these quantities is as follows: at time t ,the algorithm has eﬀectively selected a subset S N,t ⊆ {− , +1 } N of near optima. The vector m t should be interpreted as the barycenter of S N,t (or magnetization vector), and t the correspondingoverlap. In particular, we should expect m t ∈ [ − , N , (cid:107) m t (cid:107) ≈ N t , and (cid:107) m t (cid:48) − m t (cid:107) ≈ N ( t (cid:48) − t )for all t (cid:48) > t . We will see that these conditions are indeed veriﬁed as N → ∞ .Before getting into a detailed description of the iteration that computes the sequence of vectors m t , it can be useful to give a heuristic discussion of some of its elements. At time t we want to5elect a subset S N,t ⊆ S

N,t − δ of near optima, given that we know the current barycenter m t − δ .It would seem reasonable to linearize the objective function around the barycenter and computethe current gradient: ˆ z t = ∇ H N ( m t − δ ). We could then try to compute the new barycenter as theexpectation of the ‘linearized’ Gibbs measure e β (cid:104) ˆ z t , σ (cid:105) /Z t . This would result into a second equation m t = tanh( β ˆ z t ).The actual algorithm is more complex, although it retains some of the structure suggested bythe above naive argument. It is a special example of approximate message passing (AMP) algorithm[DMM09, BM11], and we will refer to it as IAMP (incremental AMP). We introduce two auxiliarysequences ( x t ) t ∈ T δ , ( z t ) t ∈ T δ , and deﬁne d m t := m t + δ − m t (and d x t , d z t analogously). Startingfrom z = m = x = , and z δ ∼ N (0 , δ I N ), we compute recursivelyd z t = ∇ H N ( m t ) − ∇ H N ( m t − δ ) + ˆ d ( t )d m t − δ , (2.1)d x t = ξ (cid:48)(cid:48) ( t ) γ ∗ ( t ) ∂ x Φ γ ∗ ( t ; x t ) δ + d z t , (2.2) m t = ∂ x Φ γ ∗ ( t ; x t ) , (2.3)where ˆ d ( t ) = ξ (cid:48)(cid:48) ( t ) (cid:82) t γ ( s ) d s . A feasible conﬁguration σ alg is obtained by rounding m , e.g. via σ alg := sign( m ), Here Φ γ ∗ : [0 , × R → R and γ ∗ : [0 , → R are functions which we will deﬁnenext via a variational principle (and ∂ x Φ γ ∗ ( t ; x t ) denotes the function ∂ x Φ γ ∗ ( t ; · ) applied entrywiseto the vector x t .)Given a function γ : [0 , → R ≥ , consider the following partial diﬀerential equation, which werefer to as the Parisi PDE: ∂ t Φ γ ( t, x ) + 12 ξ (cid:48)(cid:48) ( t ) (cid:16) ∂ x Φ γ ( t, x ) + γ ( t )( ∂ x Φ γ ( t, x )) (cid:17) = 0 , ( t, x ) ∈ [0 , × R , (2.4)with boundary condition Φ γ (1 , x ) = | x | . It is known that a weak solution to the above PDE existsand is unique if γ ∈ L [JT16, AMS20], where we deﬁne the following space of functions L := (cid:110) γ : [0 , → R ≥ : (cid:107) ξ (cid:48)(cid:48) γ (cid:107) TV [0 ,t ] < ∞ ∀ t ∈ [0 , , (cid:90) ξ (cid:48)(cid:48) γ ( t ) d t < ∞ (cid:111) . (2.5)Here ξ (cid:48)(cid:48) γ ( t ) := ξ (cid:48)(cid:48) ( t ) γ ( t ), and (cid:107) ξ (cid:48)(cid:48) γ (cid:107) TV [0 ,t ] is the total variation of this function over the interval[0 , t ]. In particular, the condition (cid:107) ξ (cid:48)(cid:48) γ (cid:107) TV [0 ,t ] < ∞ is satisﬁed if ξ (cid:48)(cid:48) γ has a ﬁnite number of jumpdiscontinuities over the interval [0 , t ] and has no singularities.We use the solution of this PDE to deﬁne the following variational problem P ( γ ) := Φ γ (0 , − (cid:90) tξ (cid:48)(cid:48) ( t ) γ ( t ) d t , (2.6) e alg := inf γ ∈ L P ( γ ) . (2.7)If the inﬁmum above is achieved at γ ∗ ∈ L , using this function in the construction of the algorithmof Eqs. (2.1) to (2.3) yields H N ( σ alg ) /N → e alg [AMS20]. In other words, the variational principleof Eq. (2.7) characterizes the function value achieved by the IAMP algorithms . The algorithm analyzed in [AMS20] diﬀers from the one of Eqs. (2.1) to (2.3) by terms that vanish as N → ∞ , δ →

0. See below for further discussion. e opt := inf γ ∈ U P ( γ ) , U ≡ L ∩ { γ non-decreasing } (2.8)Then lim N →∞ max σ ∈{± } N H N ( σ ) /N = e opt [Tal06, Pan13b, AC17]. Notice that, since U ⊆ L ,we have e opt ≥ e alg as it should be. Further, γ (cid:55)→ P ( γ ) is known to be strictly convex [AC15, JT16].Hence, if the inﬁmum in Eq. (2.8) is achieved on γ ∗ which is strictly increasing (no overlap gap), theconstraint { γ non-decreasing } is not active at γ ∗ . Hence γ ∗ is also a minimizer for the variationalprinciple (2.7) and therefore e alg = e opt . In other words, under no overlap gap, the above algorithmachieves a (1 − ε )-approximation of the optimum for any ε > z t isapproximately orthogonal to ( m s ) s ≤ t , ( z s ) s ≤ t . Hence, this update can be replaced by a similar onein which d˜ z t = Proj ⊥ t − δ [ ∇ H N ( ˜ m t ) − ∇ H N ( ˜ m t − δ )] , (2.9)where Proj ⊥ t − δ is the projector orthogonal to span( { ˜ m s } s ∈ T δ ∩ [0 ,t − δ ] ).The algorithm of Eqs. (2.1) to (2.3) is a special example of a broader class of IAMP algorithms,and is obtained by optimizing over this class. In order to introduce this class, given a vector x ∈ R N , we denote its average by E N ( x ) := (cid:80) Ni =1 x i /N . For two vectors x , y ∈ R N , the we alsodeﬁne the normalized scalar product (cid:104) x , y (cid:105) N := (cid:80) Ni =1 x i y i /N . The basic iteration depends on twofunctions u, v : [0 , × R → R . Starting from z = m = x = , and z δ ∼ N (0 , δ I N ), we proceedwith the following updates z t + δ = ∇ H N ( m t ) − (cid:88) s ∈ T δ ∩ [ δ,t ] d t,s m s − δ , (2.10) d t,s := − ξ (cid:48)(cid:48) ( (cid:104) m t , m s − δ (cid:105) N (cid:1) E N (cid:16) u δ ( s ; x s ) s

7e conclude this section by observing that the IAMP algorithm described in here admits ageneralization at non-zero temperature [Mon19]. Instead of attempting to ﬁnd a near optimumof the cost function H N ( σ ), this algorithm attempts to produce a magnetization vector m ∈ R N corresponding to a pure state (i.e. a leaf in the tree of ancestor states), under a no overlap gapcondition. The algorithm proceeds again using Eqs. (2.10) to (2.14) and only diﬀers in the choiceof the functions u , v . These are chosen using Eq. (2.15), with Φ γ the solution of the Parisi PDE(2.4) with boundary condition Φ γ (1 , x ) = β − log(2 cosh βx ) (instead of Φ γ (1 , x ) = | x | ). Finally, γ = γ β is determined according to γ β = arg min (cid:8) P ( γ ) , subj. to γ ∈ U , γ (1) = β (cid:9) . For models with no overlap gap, the solution γ β to this problem is strictly increasing on [0 , q ∗ ], andconstant on ( q ∗ , t ≈ q ∗ (e.g. at t = δ (cid:98) q ∗ /δ (cid:99) ). In this section we present our implementation of the algorithm described above, and our empiricalresults. We focus on the pure 2-spin and 3-spin models: ξ ( x ) = x , and ξ ( x ) = x . These two models are representative of qualitatively diﬀerent behaviors. The 2-spin model, alsoknown as Sherrington-Kirkpatrick (SK) model, was ﬁrst introduced in [SK75] and its landcape isbelieved to be organized in a continuum tree with no overlap gap: γ ∗ is strictly increasing. The3-spin model was ﬁrst introduced in [Der80] and subsequently studied as a toy model for structuralglasses [KT87]. The tree of states is believed to have a gap at 0: ancestor state exist only for q ∈ [ q , q ∗ ], q >

0, and are well separated. Given a typical pair γ , γ of ancestor states at level q , we have (cid:107) m γ (cid:107) , (cid:107) m γ (cid:107) = N q (1 + o N (1)) while (cid:104) m γ , m γ (cid:105) = o ( N ). Further, each of theseancestor states gives rise to a continuous tree of descendants. We compute approximate values of the variational problems Eq. (2.7) and Eq. (2.8) by consideringpiecewise constant functions γ ( t ) = (cid:80) n q i =1 r i t ∈ [ q i ,q i +1 ) with n q steps located on a uniform grid q i = i − n q . The Parisi PDE (2.4) can then be solved via the Cole-Hopf transform backwards in time:Φ γ ( q i , x ) = 1 r i log E (cid:104) exp (cid:16) r i Φ γ (cid:0) q i +1 , x + (cid:112) ξ (cid:48) ( q i +1 ) − ξ (cid:48) ( q i ) Z (cid:1)(cid:17)(cid:105) ∀ ≤ i ≤ n, (3.1)where Z ∼ N (0 , x to an interval [ − x max , x max ] which is alsodiscretized into a uniform grid with spacing 1 /n x . This discretization scheme depends on the setof parameters π := ( n q , n x , x max ). In our experiments we observe that n x = 500, and x max = 10 issuﬃcient to produce negligible discretization errors, and study the dependence on n q .The Parisi functional becomes a convex function of the vector r = ( r i ) n q i =1 ∈ R n q ≥ , which wedenote by P π ( r ). The variational principles Eq. (2.7) and Eq. (2.8) are then approximated by the8nite-dimensional convex optimization problemsˆ e alg ( π ) := min r ∈ R n q ≥ P π ( r ) , ˆ e opt ( π ) := min r ∈ R n q ≥ (cid:8) P ( r ) subj. to r ≤ · · · ≤ r n q (cid:9) . (3.2)The above optimization problems are solved to near-optimality via projected gradient descent withbacktracking line search [PB14]. The gradient of Φ γ (0 ,

0) w.r.t. r is computed recursively via theformula (3.1). (We refer to Appendix A for further details.)Figure 2 (left frame) shows the optimizers of the variational problem corresponding to ˆ e alg ( π )for the 2-spin and 3-spin models for a few values of n q . For the 2-spin model, we observe that theoptimizer satisﬁes r < r < · · · < r n q . Since r (cid:55)→ P π ( r ) is a convex function, we conclude thatˆ e alg ( π ) = ˆ e opt ( π ). By solving problem (3.2) for n q ∈ { , , , } , we obtain the estimates:ˆ e = ˆ e = 0 . ± . . (3.3)This compares well with the value published in [CR02] (0 . ± . . r ≤ r ≤ · · · ≤ r n q in the deﬁnition of ˆ e opt ( π ), see Eq. (3.2), isactive: the optimal vector of weights r opt is such that r opt = r opt = · · · = r opt n < r opt n +1 < · · · < r opt n q for some 1 < n < n q . On the other hand, the optimizer r alg in the deﬁnition of ˆ e alg ( π ) is suchthat r alg > r alg > · · · > r alg n (cid:48) < r alg n (cid:48) +1 < · · · < r alg n q . Consequently the two optimization problems inEq. (3.2) have diﬀerent values. We also observe that the optimization problem (3.2) for ˆ e ( π ),while still convex, is harder than in the 2-spin case (convergence of projected gradient descent isslower). In order to accelerate convergence, we use a discrete ansatz for γ that takes explicitlyaccount of the overlap gap between 0 and q : γ ( t ) = r − [0 ,q ) ( t ) + n q − (cid:88) i =0 r i [ q i ,q i +1 ) ( t ) , (3.4)with q i = q + (1 − q ) i/n q and r − < r < r < · · · < r n q − . We run projected gradient descentw.r.t. the variables ( r − , r , · · · , r n q − ), with q ﬁxed. If r = r − at some iteration of the algorithm,we make the update q ← q + (1 − q ) /n q and optimize over the vector ( r , · · · , r n q − ) ∈ R n q − ,and then set n q ← n q −

1. The algorithm is run until the value of q stabilizes and the gradient of P π ( r ) is smaller than 3 · − . We numerically ﬁndˆ e = 0 . ± . , ˆ e = 0 . ± . . (3.5) We consider the version of the algorithm deﬁned by Eqs. (2.10) to (2.14), where u , v are givenin Eq. (2.15), with a slight modiﬁcation. As mentioned in Section 2, the increments (d z t ) t ∈ T δ generated by the algorithm are asymptotically orthogonal. We partially enforce this at ﬁnite n through an explicit orthogonalization step. Further, the analysis of the algorithm of Section 2, see[AMS20], implies that (cid:107) d z t (cid:107) = N ( ξ (cid:48) ( t + δ ) − ξ (cid:48) ( t )) + o ( N ) and (cid:107) m t (cid:107) = N t + o ( N ). We alsoenforce these normalizations explicitly.Namely, we ﬁx an integer k ≥ t , we perform the following additionalcomputations: 9 .0 0.2 0.4 0.6 0.8 1.0 t | n q | ( t ) t n q ( t )

200 250 t | n q | ( t ) t n q ( t ) Figure 2: Numerical solutions γ ∗ of the extended variational principle (2.7) or the 2-spin (SK) model(left) and 3-spin model (right). We discretize the function γ using n q = 50 , , , ,

250 steps(in physics language, this corresponds to n q -steps of RSB), and denote the resulting solution by γ n q .Top plots: diﬀerence between γ n q and γ , indicating the order of magnitude of the discretizationerror. Lower plots: solutions for the ﬁner discretization n q = 200 , z t := z t + δ − z t with the previous k increments d z t − δ ,d z t − δ , · · · , d z t − kδ , and normalize it to have norm (cid:112) N ( ξ (cid:48) ( t + δ ) − ξ (cid:48) ( t )):d z t ←− Proj H ⊥ t,k (cid:0) d z t (cid:1) , (3.6)d z t ←− (cid:112) N ( ξ (cid:48) ( t + δ ) − ξ (cid:48) ( t )) d z t (cid:107) d z t (cid:107) , (3.7)where H t,k = span (cid:0) d z t − δ , d z t − δ , · · · , d z t − kδ (cid:1) .2. We normalize the vector m t , computed via iteration (2.12), to have norm √ tN : m t ←− √ N t m t (cid:107) m t (cid:107) . (3.8)Enforcing these constraints leads to better numerical stability of the algorithm. In our experimentswe use k = 5. We use the numerical solution of the variational principle (2.7) described in theprevious section to compute the nonlinearities u and v via Eq. (2.15). For the 2-spin model, weround the algorithm by taking σ alg = sign( m t =1 ). For the 3-spin model, we ﬁnd it more eﬀectiveto compute the energy H N ( σ t ), for σ t = sign( m t ) at each iteration, and return the vector σ t thatmaximizes it.We run this algorithms on random instances of the 2-spin and 3-spin models, for various valuesof N and the step size δ and record the achieved energy e G ( N, δ ) := H N ( σ alg ) /N (here G refers tothe randomness in the Hamiltonian H N ). We use the values N ∈ { , , , , } forthe 2-spin model and N ∈ { , , , } for the 3-spin model. For each value of N we use δ ∈ { /N, /N, /N, /N } for both models.We are interested in the behavior as N → ∞ , δ →

0. In order to extract these asymptotics, weevaluate the median ˆ e ( N, δ ) over n s = 100 independent realizations for the SK model, and n s = 2010 .000 0.005 0.010 0.015 N e ( N , ) e opt = e alg N = 2/ N = 3/ N = 4/ N N e ( N , ) e alg e opt N = 2/ N = 3/ N = 4/ N Figure 3: Energies achieved by the algorithm of Section 3.2 for several values of the system size N and the step size δ . Left: SK model. Right: 3-spin model. Data points corresponds to medians over n s independent realizations ( n s = 100 for SK, n s = 20 for 3-spin model), with diﬀerent symbolscorresponding to diﬀerent choices of δ (see legend). Lines corresponds to the result of the linearregression (3.9), with each line corresponding to a diﬀerent choice of δ .realizations for the 3 spin model. We then perform least squares linear regression using the modelˆ e N,δ = e + β N − a + β δ . (3.9)We choose the exponent a on the basis of earlier statistical physics literature. For the 2-spin modelwe use a = 2 /

3, which is believed to capture the behavior of the optimum at ﬁnite N , namely e opt ( N ) = e opt + c N − / + o ( N − / ) [ABMM08]. The same behavior is follows from the Tracy-Widom law if we replace the constraint σ ∈ {− , +1 } N with σ ∈ S N − ( √ N ) [AGZ09]. For the3-spin model we use a = 1, which is expected to be the correct behavior for the optimum in modelswith 1-RSB [Der80]. Although the 3-spin model at zero temperature is expected to be FRSB, thestructure of the order parameter is very close to 1-RSB , and hence we expect N − corrections todominate, at least at moderate sizes. We ﬁnd e , = 0 . , e , = 0 . ,β , = − . , β , = − . ,β , = 0 . , β , = − . . (3.10)The above numerics are in striking agreement with the theoretical predictions of Eqs. (3.3) and(3.5) which were obtained using the variational principle (2.7).In Figure 3 we plot the empirical medians ˆ e N,δ as a function of N − a , together with the curves e + β N − a + β ( k/N ), with k ∈ { , , , } corresponding to the regression (3.9). The model (3.9)seems in agreement with the data, a ﬁnding that deserves further investigation. Notice in particularthat it is a priori unclear that the exponent a = 2 / The optimal γ in Eq. (2.8) is expected to be constant on an interval [0 , q ) and strictly increasing over [ q , q , with q being rather close to 1: low precision numerics indicate q ∈ [0 . , . .3 Algorithmic threshold Our simulations conﬁrm that the IAMP algorithm achieves the theshold energy e alg deﬁned by thevariational principle (2.7). Further, in models with overlap gap the latter is a constant factor belowthe asymptotic value of the global optimum e opt . A very interesting question is whether there existsa polynomial-time algorithm that can achieve —with high probability— a better energy value than e alg .While this question is widely open, physicists have devoted considerable attention to the studyof one speciﬁc algorithm: Glauber dynamics (a.k.a. Gibbs sampling). This is a reversible Markovchain to sample from the Boltzmann distribution µ N,β ( σ ). We regard it as an optimization algo-rithm by initializing it uniformly at random, and running it at a ﬁxed, large value of β independentof N . While no exact treatment of Glauber dynamics exists for Ising spin glasses, a conjecturewas put forward in [Riz13] based on analysis of the energy landscape. Denoting by e Glauber ( β, t, N )the average energy achieved by Glauber dynamics after N t updates on an instance of size N , weexpect lim β →∞ lim t →∞ lim N →∞ e Glauber ( β, t, N ) = ˆ e Glauber (this limit is not expected to change if t scales polynomially in N ). The analysis of [Riz13] suggests :ˆ e (cid:46) . . (3.11)This is smaller than ˆ e ≈ . e .These questions can be studied more easily in the context of spherical models (recall that inthis case the constraint σ ∈ {− , +1 } N is replaced by σ ∈ S N − ( √ N )), and considering Langevininstead of Glauber dynamics (in the zero-temperature limit, the former reduces to gradient ﬂow).We carry out this analysis in Appendix B for two speciﬁc examples of the generating function ξ ( · ),both resulting in RSB with overlap gap. Our ﬁndings are consistent with the conclusions for theIsing case that we discussed above. Namely, the energy value e alg achieved by the algorithms of[Sub18, AMS20] is superior to the threshold e th of gradient ﬂow. In a special case, ξ ( x ) = x + x ,we can also compare e alg with the threshold e cool of a diﬀerent annealing algorithm studied in[FFRT20b]. We ﬁnd, again, e alg > e cool > e th : the algorithms of [Sub18, AMS20] is superior also tothis modiﬁed approach. One striking prediction of spin glass theory is that, for any q ∈ Q = supp( γ opt ) and any ε >

0, withhigh probability, there exist σ , σ ∈ S N ( ε ) such that (cid:104) σ , σ (cid:105) /N ≈ q [ACZ20]. (Here γ opt is theoptimizer of Parisi formula (2.8), and supp( γ opt ) is the closure of the set of points at which γ opt isstrictly increasing.) The value quoted here is an extrapolation to zero temperature from the non-zero temperature results of [Riz13].While [Riz13] warns against such extrapolation, the result is also consistent with the zero temperature 2RSB cal-culation of [MRT03], and the diﬀerence between ˆ e and ˆ e is large enough that it might overcome theextrapolation error. N s t d ( Q ) = 0.0050= 0.0025 Q = m (1) , m (2) / N N = 1000 N = 8000 Figure 4: Correlation between the near optima produced by two executions of the IAMP algorithmon the same SK Hamiltonian, with independent initializations. Bottom plot: histograms of Q = (cid:104) m (1) , m (2) (cid:105) /N for δ = 1 /

200 and two values of N . Top plot: standard deviation of Q as afunction of N . Lines are least squares ﬁts sing log std( Q ) = c − α log N . We obtain α ≈ .

37. Inboth cases we average over 100 independent initializations (hence (cid:0) (cid:1) pairs) and 2 realizations ofthe Hamiltonian. 13t is natural to wonder whether the same can be achieved by our IAMP algorithm. In particular,in cases —such as for the SK model— in which we can ﬁnd a near optimum, can we also ﬁnd pairsof near optima σ , σ at nearly all possible values of |(cid:104) σ , σ (cid:105)| /N ∈ [0 , t ∈ T δ , then at time t we bifurcate the trajectory into two parallelcopies indexed by a ∈ { , } : ( z ( a ) ,t , m ( a ) a,t ). The update for m ( a ) ,t is performed at random.Namely, z (1) ,t = z (2) ,t and m (1) ,t = m (2) ,t for all t ≤ t − δ . We draw g (1) , g (2) ∼ iid N (0 , I N ) andset m ( a ) ,t = (cid:40) m ( a ) ,t − δ + √ δ g ( a ) if t = t , m ( a ) ,t − δ + u δ ( t − δ ; x ( a ) ,t − δ ) (cid:12) ( z ( a ) ,t − z ( a ) ,t − δ ) if t ≥ t + δ . (3.12)We expect this to produce a pair of near optima m (1) := m (1) , , m (2) := m (2) , with (cid:104) m (1) , m (2) (cid:105) /N ≈ t (taking the limit δ → after N → ∞ ).We carry out this test for t = 0, i.e. running the algorithm from two independent initializations.To make the eﬀect more visible, we choose the initialization with larger variance: z δ ∼ N (0 , δ I N ),where δ = 10 δ . The results of this experiment are summarized in Fig. 4. For each of N ∈{ , , , } , δ ∈ { / , / } , we generate n = 2 realizations of the SK Hamiltonian.For each realization, we run the algorithm from M = 100 independent initializations. We estimatethe distribution of the scalar product between algorithm outputs Q = (cid:104) m (1) , m (2) (cid:105) /N (for the sameHamiltonian) by taking the empirical distribution over the n (cid:0) M (cid:1) pairs corresponding to the sameHamiltonian. We observe that this distribution is unimodal and centered around 0. The width ofthis distribution shrinks with N (for any ﬁxed δ ) and data are consistent to the width vanishing asstd( Q ) ∝ N − α , α = 0 . As described in Section 1, the construction of the algorithm is motivated by the picture of thestructure of ancestor states ﬁrst developed in [MPS +

84, MV85]. We therefore expect that thevector m t generated at iteration t to have similar properties as the magnetization vector m γ ofa typical ancestor state with (cid:107) m γ (cid:107) /N ≈ t . Here we investigate two speciﬁc consequences ofthis picture: ( i ) The generalized TAP free energy should be approximately constant during thealgorithm execution, and equal to the ﬁnal energy achieved; ( ii ) The generalized TAP equationsshould be approximately satisﬁed at each iteration.Each ancestor state γ can be assigned a free energy which is the log partition function restrictedto the set of conﬁgurations in that state. The generalized TAP free energy expresses this as afunction of the magnetization vector m γ . The TAP free energy associated to a magnetizationvector m ∈ R N with (cid:107) m (cid:107) /N = t reads [CPS18, CPS19]: F TAP ( m ) = H N ( m ) + N (cid:88) i =1 Λ γ ( t, m i ) − N (cid:90) t sξ (cid:48)(cid:48) ( s ) γ ( s ) d s , (4.1)Λ γ ( t, m ) := inf s ∈ R (cid:2) Φ γ ( t, x ) − mx (cid:3) . (4.2)In Figure 5 we report the results of numerical experiments in which we evaluate F TAP ( m t ) /N alongthe iterates of the algorithm. These results are consistent with the hypothesis that F TAP ( m t ) /N =14 .0 0.2 0.4 0.6 0.8 1.0 t F T A P t ( F T A P e a l g ) / = 1/100 = 1/200 t F T A P t ( F T A P e a l g ) / = 1/100 = 1/200 Figure 5: Evolution of the generalized TAP free energy during the execution of the algorithm,indexed by t ∈ [0 , N = 2000 and we study the dependence on the step size δ , for the2-spin model (left) and the 3-spin model (right). Top plots: trajectories of the free energy for 5independent realizations (colors correspond to the step size). Bottom plots, rescaled gap betweenthe TAP free energy and the algorithmic threshold e alg . The collapse of curves in the bottom plotssuggests that ∆ t ( δ, ∞ ) = lim N →∞ ( F TAP ( m t ) /N − e alg ) = Θ( δ ). e alg + ∆ t ( δ, N ), where ∆ t ( δ, N ) is an error that vanishes as N → ∞ , and δ →

0. In particular thedata in Fig. 5 suggest ∆ t ( δ, ∞ ) = Θ( δ ). For completeness we included in Appendix C a proof that F TAP ( m t ) /N is indeed constant and equal to e alg for all t ∈ [0 , N → ∞ followed by δ → γ = γ ∗ , the optimizer in the variational principle (2.7). For the SKmodel, this coincides with the optimizer of Parisi formula (2.8), and hence it is the right prescriptionfor the dominant TAP states [CPS18, CPS19]. On other hand, for the 3-spin model, the optimizer γ ∗ is non-monotone. This case is not covered by earlier theories in physics or mathematics.The magnetization vectors of ancestor states are approximate stationary points of the general-ized TAP free energy. The stationarity conditions are equivalent to the following TAP equations(for (cid:107) m (cid:107) /N = t ): z = ∇ H N ( m ) − m ξ (cid:48)(cid:48) ( t ) (cid:90) t γ ( s ) d s , (4.3) m = ∂ x Φ γ ( t ; z ) . (4.4)Writing these equations as m = F ( m ; t ), we plot in Fig. 6 the error ˜∆ t ( δ, N ) := (cid:107) m t − F ( m t ; t ) (cid:107) /N ,where m t is the vector produced by our algorithm. We observe that this value concentrates tightlyabout its average with respect to the realization. Further, the average decreases as δ →

0. Therescaled plots suggest indeed ˜∆ t ( δ, ∞ ) = Θ( δ ).Notice that the basic step of IAMP, cf. Eqs.(2.1), (2.3) does not coincide with a simple iterationof the generalized TAP equations (4.3), (4.4). This is diﬀerent from the algorithm of [Bol14], thatconstructs TAP solutions in the high temperature phase.15 .0 0.2 0.4 0.6 0.8 1.0 t T A P t T A P / = 0.004 = 0.0025 Figure 6: Evolution of the generalized TAP equations during the execution of the algorithm,indexed by t ∈ [0 , N = 4000 and we study the dependence on the step size δ , for theSK (2-spin) model. Top plots: trajectories of the error ˜∆ t ( δ, N TAP ( δ, N ) /δ averaged over the5 trajectories. Acknowledgements

This work was partially supported by the NSF grants CCF-1714305, CCF-2006489 and by the ONRgrant N00014-18-1-2729.

References [ABM18] Louigi Addario-Berry and Pascal Maillard,

The algorithmic hardness threshold forcontinuous random energy models , arXiv:1810.05129 (2018).[ABMM08] Timo Aspelmeier, Alain Billoire, Enzo Marinari, and Michael A. Moore,

Finite-sizecorrections in the Sherrington–Kirkpatrick model , Journal of Physics A: Mathematicaland Theoretical (2008), no. 32, 324008.[AC15] Antonio Auﬃnger and Wei-Kuo Chen, The Parisi formula has a unique minimizer ,Communications in Mathematical Physics (2015), no. 3, 1429–1444.[AC17] ,

Parisi formula for the ground state energy in the mixed p -spin model , TheAnnals of Probability (2017), no. 6b, 4617–4631.[ACZ20] Antonio Auﬃnger, Wei-Kuo Chen, and Qiang Zeng, The SK Model Is Inﬁnite StepReplica Symmetry Breaking at Zero Temperature , Communications on Pure and Ap-plied Mathematics (2020), no. 5, 921–943.16AGZ09] Greg W. Anderson, Alice Guionnet, and Ofer Zeitouni, An introduction to randommatrices , Cambridge University Press, 2009.[AMS20] Ahmed El Alaoui, Andrea Montanari, and Mark Sellke,

Optimization of mean-ﬁeldspin glasses , arXiv:2001.00904 (2020).[BM11] Mohsen Bayati and Andrea Montanari,

The dynamics of message passing on densegraphs, with applications to compressed sensing , IEEE Trans. on Inform. Theory (2011), 764–785.[BMZ05] Alfredo Braunstein, Marc M´ezard, and Riccardo Zecchina, Survey propagation: Analgorithm for satisﬁability , Random Structures & Algorithms (2005), no. 2, 201–226.[Bol14] Erwin Bolthausen, An iterative construction of solutions of the TAP equations forthe Sherrington–Kirkpatrick model , Communications in Mathematical Physics (2014), no. 1, 333–366.[CGPR19] Wei-Kuo Chen, David Gamarnik, Dmitry Panchenko, and Mustazee Rahman,

Subop-timality of local algorithms for a class of max-cut problems , The Annals of Probability (2019), no. 3, 1587–1618.[CL04] Andrea Crisanti and Luca Leuzzi, Spherical 2+ p spin-glass model: An exactly solvablemodel for glass to spin-glass transition , Physical review letters (2004), no. 21,217203.[CPS18] Wei-Kuo Chen, Dmitry Panchenko, and Eliran Subag, The generalized TAP free en-ergy , arXiv:1812.05066 (2018).[CPS19] ,

The generalized TAP free energy II , arXiv preprint arXiv:1903.01030 (2019).[CR02] Andrea Crisanti and Tommaso Rizzo,

Analysis of the ∞ -replica symmetry breakingsolution of the Sherrington-Kirkpatrick model , Physical Review E (2002), no. 4,046137.[CS92] Andrea Crisanti and H-J Sommers, The sphericalp-spin interaction spin glass model:the statics , Zeitschrift f¨ur Physik B Condensed Matter (1992), no. 3, 341–354.[CS17] Wei-Kuo Chen and Arnab Sen, Parisi formula, disorder chaos and ﬂuctuation forthe ground state energy in the spherical mixed p-spin models , Communications inMathematical Physics (2017), no. 1, 129–173.[Der80] Bernard Derrida,

Random-energy model: Limit of a family of disordered models , Phys-ical Review Letters (1980), no. 2, 79.[Der85] , A generalization of the random energy model which includes correlations be-tween energies , Journal de Physique Lettres (1985), no. 9, 401–407.[DMM09] David L. Donoho, Arian Maleki, and Andrea Montanari, Message Passing Algorithmsfor Compressed Sensing , Proceedings of the National Academy of Sciences (2009),18914–18919. 17EVdB01] Andreas Engel and Christian Van den Broeck,

Statistical mechanics of learning , Cam-bridge University Press, 2001.[FFRT20a] Giampaolo Folena, Silvio Franz, and Federico Ricci-Tersenghi,

Gradient descent dy-namics in the mixed p -spin spherical model: ﬁnite size simulation and comparisonwith mean-ﬁeld integration , arXiv preprint arXiv:2007.07776 (2020).[FFRT20b] , Rethinking mean-ﬁeld glassy dynamics and its relation with the energy land-scape: The surprising case of the spherical mixed p-spin model , Physical Review X (2020), no. 3, 031045.[GS14] David Gamarnik and Madhu Sudan, Limits of local algorithms over sparse randomgraphs , Proceedings of the 5th conference on Innovations in theoretical computerscience, ACM, 2014, pp. 369–376.[GS17] ,

Performance of sequential local algorithms for the random NAE-K-SAT prob-lem , SIAM Journal on Computing (2017), no. 2, 590–619.[JT16] Aukosh Jagannath and Ian Tobasco, A dynamic programming approach to the parisifunctional , Proceedings of the American Mathematical Society (2016), no. 7,3135–3150.[KGV83] Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi,

Optimization by simulatedannealing , science (1983), no. 4598, 671–680.[KMRT +

07] Florent Krzakala, Andrea Montanari, Federico Ricci-Tersenghi, Guilhem Semerjian,and Lenka Zdeborov´a,

Gibbs states and the set of solutions of random constraintsatisfaction problems , Proceedings of the National Academy of Sciences (2007),no. 25, 10318–10323.[KT87] Theodore R Kirkpatrick and Devarajan Thirumalai, p-spin-interaction spin-glass mod-els: Connections with the structural glass problem , Physical Review B (1987),no. 10, 5388.[MM09] Marc M´ezard and Andrea Montanari, Information, Physics and Computation , Oxford,2009.[Mon95] R´emi Monasson,

Structural glass transition and the entropy of the metastable states ,Physical review letters (1995), no. 15, 2847.[Mon19] Andrea Montanari, Optimization of the Sherrington-Kirkpatrick Hamiltonian , IEEESymposium on the Foundations of Computer Science, FOCS, November 2019.[MP85] Marc M´ezard and Giorgio Parisi,

Replicas and optimization , Journal de PhysiqueLettres (1985), no. 17, 771–778.[MP86] , A replica analysis of the travelling salesman problem , Journal de Physique (1986), no. 8, 1285–1296.[MPS +

84] Marc M´ezard, Giorgio Parisi, Nicolas Sourlas, G Toulouse, and Miguel Virasoro,

Na-ture of the spin-glass phase , Physical review letters (1984), no. 13, 1156.18MPV87] Marc M´ezard, Giorgio Parisi, and Miguel A. Virasoro, Spin glass theory and beyond ,World Scientiﬁc, 1987.[MPWZ02] Roberto Mulet, Andrea Pagnani, Martin Weigt, and Riccardo Zecchina,

Coloringrandom graphs , Physical review letters (2002), no. 26, 268701.[MPZ02] Marc M´ezard, Giorgio Parisi, and Riccardo Zecchina, Analytic and algorithmic solu-tion of random satisﬁability problems , Science (2002), no. 5582, 812–815.[MRT03] Andrea Montanari and Federico Ricci-Tersenghi,

On the nature of the low-temperaturephase in discontinuous mean-ﬁeld spin glasses , The European Physical Journal B-Condensed Matter and Complex Systems (2003), no. 3, 339–346.[MRT04] , Cooling-schedule dependence of the dynamics of mean-ﬁeld glasses , PhysicalReview B (2004), no. 13, 134406.[MV85] Marc M´ezard and Miguel Angel Virasoro, The microstructure of ultrametricity , Jour-nal de Physique (1985), no. 8, 1293–1307.[MZK +

99] R´emi Monasson, Riccardo Zecchina, Scott Kirkpatrick, Bart Selman, and Lidror Troy-ansky,

Determining computational complexity from characteristic ‘phase transitions’ ,Nature (1999), no. 6740, 133.[Nis01] Hidetoshi Nishimori,

Statistical Physics of Spin Glasses and Information Processing:An Introduction , Oxford University Press, 2001.[Pan13a] Dmitry Panchenko,

The Parisi ultrametricity conjecture , Annals of Mathematics(2013), 383–393.[Pan13b] ,

The Sherrington-Kirkpatrick model , Springer Science & Business Media, 2013.[PB14] Neal Parikh and Stephen Boyd,

Proximal algorithms , Foundations and Trends inoptimization (2014), no. 3, 127–239.[Riz13] Tommaso Rizzo, Replica-symmetry-breaking transitions and oﬀ-equilibrium dynamics ,Physical Review E (2013), no. 3, 032135.[Sch08] Manuel J Schmidt, Replica symmetry breaking at low temperatures , Ph.F. Thesis, 2008.[SK75] David Sherrington and Scott Kirkpatrick,

Solvable model of a spin-glass , Physicalreview letters (1975), no. 26, 1792.[Sub18] Eliran Subag, Following the ground-states of full-RSB spherical spin glasses , arXiv:1812.04588 (2018).[Tal06] Michel Talagrand, The Parisi formula , Annals of Mathematics (2006), 221–263.[Tal10] Michel Talagrand,

Mean ﬁeld models for spin glasses , Springer-Verlag, Berlin, 2010.19

Further details on the solution of the variational principle

As mentioned in the main text, we solve the variational principle (2.7) numerically using theprojected gradient method. We give here the explicit expression of the gradient of Φ γ (0 ,

0) withrespect to r .Recall from Section 3.1 that using the Cole-Hopf transform with γ ( t ) = (cid:80) n q i =1 r i t ∈ [ q i ,q i +1 ) wehave Φ γ ( q i , x ) = 1 r i log E (cid:104) exp (cid:16) r i Φ γ (cid:0) q i +1 , x + (cid:112) ξ (cid:48) ( q i +1 ) − ξ (cid:48) ( q i ) Z (cid:1)(cid:17)(cid:105) ∀ ≤ i ≤ n q , where Z ∼ N (0 , z i ( x ) = x + (cid:112) ξ (cid:48) ( q i +1 ) − ξ (cid:48) ( q i ) Z for convenience. The gradient with respectto the parameters r = ( r , · · · , r n q ) can be computed recursively as follows ∂∂r j Φ γ ( q i , x ) = 0 if j ≤ i − ,∂∂r j Φ γ ( q i , x ) = E (cid:104) Φ γ ( q i +1 , z i ( x )) e r i Φ γ ( q i +1 ,z i ( x )) (cid:105) E (cid:104) e r i Φ γ ( q i +1 ,z i ( x )) (cid:105) − r i log E (cid:104) e r i Φ γ ( q i +1 ,z i ( x )) (cid:105) if j = i,∂∂r j Φ γ ( q i , x ) = E (cid:104) dd r j Φ γ ( q i +1 , z i ( x )) e r i Φ γ ( q i +1 ,z i ( x )) (cid:105) E (cid:104) e r i Φ γ ( q i +1 ,z i ( x )) (cid:105) if j ≥ i + 1 . B The case of the spherical model

Recall that the spherical model is deﬁned by the Hamiltonian H N : S N − ( √ N ) → R being acentered Gaussian process on the N -dimensional sphere with radius √ N —denoted by S N − ( √ N )—with covariance E { H N ( σ ) H N ( σ ) } = N ξ ( (cid:104) σ , σ (cid:105) /N ). The spherical symmetry simpliﬁes thetreatment.In this appendix we compare the behavior the algorithms of [Sub18, AMS20], with properties ofthe energy landscape, as derived within statistical physics. Our discussion will be mainly heuristic. B.1 Algorithm

We can apply the algorithm of Eqs. (2.10) to (2.14). The optimal choice of functions u, v : [0 , → R → R is given by the following optimization problem:maximize E ( u, v ) := (cid:90) ξ (cid:48)(cid:48) ( t ) E (cid:2) u ( t, X t ) (cid:3) d t , (B.1)subj. to ξ (cid:48)(cid:48) ( t ) E (cid:2) u ( t, X t ) (cid:3) = 1 for all t ∈ [0 , , where it is understood that X t solves the SDE d X t = v ( t, X t ) d t + (cid:112) ξ (cid:48)(cid:48) ( t ) d B t . Unlike in theIsing case, we do not have any constraint on the terminal value M of the martingale M t = (cid:82) t (cid:112) ξ (cid:48)(cid:48) ( s ) u ( s, X s )d B s . This is due to the fact that the Ising constraint σ ∈ {− , +1 } N is replacedby an (cid:96) constraint (cid:107) σ (cid:107) = N , which correspond to the condition E (cid:2) M (cid:3) = 1. This in turn is a20onsequence of the constraint in the optimization problem (B.1). The value of the optimizationproblem (B.1) corresponds to the value achieved by the algorithm.Solving this optimization problem is straightforward. By Cauchy-Schwarz, the only optimizeris given by u ( t, X t ) = 1 / (cid:112) ξ (cid:48)(cid:48) ( t ). With these choices, the algorithm of Eqs. (2.10) to (2.14) reducesto z t + δ = ∇ H N ( m t ) − (cid:88) s ∈ T δ ∩ [ δ,t ] d t,s m s − δ , (B.2) d t,s := − ξ (cid:48)(cid:48) ( (cid:104) m t , m s − δ (cid:105) N (cid:1)(cid:16) ξ (cid:48)(cid:48) ( s ) − / s

The (zero temperature) Parisi functional can be written as an explicit function of the pair γ, L where γ ∈ L , and L ≥ (cid:82) γ ( t )d t [CS92, CL04, CS17]: P ( γ, L ) = 12 (cid:90) (cid:18) ξ (cid:48)(cid:48) ( t )Γ( t ) + 1Γ( t ) (cid:19) d t , (B.6)Γ( t ) := L − (cid:90) t γ ( s )d s . (B.7)Equivalently, we can view P as a function of Γ : [0 , → R ≥ which is continuous and non-increasing.It is concave if and only if γ is non-decreasing.As before, we will consider two variational principles associated with P ( γ, L ): e alg := inf (cid:8) P ( γ, L ) : γ ∈ L , L ≥ (cid:90) γ ( t )d t (cid:9) , (B.8) e opt := inf (cid:8) P ( γ, L ) : γ ∈ L , γ non-decreasing L ≥ (cid:90) γ ( t )d t (cid:9) . (B.9)It is easy to see that the ﬁrst one is the dual of the maximization problem (B.1). The minimum isachieved at Γ( t ) = 1 / (cid:112) ξ (cid:48)(cid:48) ( t ), matching the value of e alg as per Eq. (B.5). The second variationalprinciple —Eq. (B.9)— yields the optimum valuelim N →∞ N max σ ∈ S N − ( √ N ) H N ( σ ) = e opt . (B.10)We will consider two speciﬁc examples, i.e. two choices of the function ξ ( · ), which we believeare representative of classes of models that diﬀer in the structure of the solution ( γ, L ) of thevariational problem (B.9): 21. One-step replica symmetry breaking (1RSB) . In this case γ ( t ) = µ for t ∈ [0 ,

1) and

L > µ .Equivalently, Γ( t ) = L − µt . As a prototype of this class, we study the ‘3 + 4’ model: ξ ( x ) = 12 x + 12 x . (B.11)This pattern of replica symmetry breaking is the simplest possible, and has been studied indetail recently [FFRT20b, FFRT20a]. Hence, this model is particularly useful for comparingthe value achieved by the algorithms studied here, and the features of the energy landscape.We will denote by µ opt the value of µ that corresponds to the minimizer of the variationalprinciple (B.9).Geometrically, this structure of γ corresponds to a one-level tree, whose leaves are well-separated pure states with (cid:107) m α (cid:107) /N ≈

1, and (cid:104) m α (cid:107) /N ≈ { , } ).2. Full replica symmetry breaking with a single gap at (FRSB-1G). In this case there exists t ∈ (0 ,

1) such that γ ( t ) = µ for t ∈ [0 , t ] and γ ( t ) strictly increasing and continuous for t ∈ [ t , t ) = a − µt for t ∈ [0 , t ], and Γ( t ) is strictly concave on [ t , ξ ( x ) = 120 x + 16 x + 1992 x . (B.12)This structure of RSB is interesting because it is believed to be quite generic, and in particularit is expected to be the same occurring for the pure p -spin Ising spin glass (i.e., the Ising spinglass with ξ ( x ) = x p , p ≥ µ opt the value of µ at the minimizer ofthe variational principle (B.9).From a geometric point of view, this corresponds to the ancestor states being well separatedsubsets of the sphere S N − ( √ N ). When restricted to such a set, the Gibbs measure presentsfull-replica symmetry breaking with no gaps.In order to explore the energy landscape, we use the statistical physics method of ‘clones’ ﬁrstintroduced in [Mon95]. This approach consists in ﬁnding a constrained optimum of the functional P ( γ, L ) minimize P ( γ, L ) , subject to γ ∈ L , γ non decreasing, (B.13) γ (0) = µ, L ≥ (cid:90) γ ( t )d t . We observe that, rather than attempting to describe a global minimizer of this variational problem,the correct behavior is obtained by ﬁnding a local minimizer γ = γ µ . Namely, we look for a localminimum of this constrained problem with γ µ strictly increasing on an interval [ t ( µ ) , γ in an interval [ t ( µ ) − ε, µ opt . We denote by Ψ( µ )the value of P ( γ, L ) at this local minimizer.Following [Mon95], the function Ψ( µ ) contains information about the exponential growth rateof the number of pure states in a system with 1RSB (scenario 1 above), and the number of ancestor22tates at level t in a system with FRSB-1G (scenario 2 above). We summarize this connectionnext.Given a pure state or an ancestor state α (and the associated set of conﬁgurations S α ), we let e α := min σ ∈S α H ( σ ) /N denote the corresponding minimum energy density. The number of states(or ancestor states) at energy density e is believed to grow as exp { N Σ( e ) + o ( N ) } . A parametricexpression for the ‘complexity function’ Σ is given byΣ( e ) = µ Ψ( µ ) − µe , e = ∂∂µ (cid:2) µ Φ( µ ) (cid:3) . (B.14)We have µ opt := arg min Ψ( µ ), and deﬁne µ th := arg min ∂ µ [ µ Ψ( µ )]: it is expected that µ (cid:55)→ Ψ( µ )is decreasing for µ < µ opt and increasing for µ > µ opt , and µ (cid:55)→ ∂ µ [ µ Ψ( µ )] is decreasing for µ < µ th and increasing for µ > µ th .We collect a few rigorously known facts about the variational problem (B.9), see [CS17]:1. Given Γ : [0 , → R > , deﬁne ¯ g, g : [0 , → R via¯ g ( t ) := ξ (cid:48) ( t ) − (cid:90) t d s Γ( s ) , g ( t ) = (cid:90) t ¯ g ( s ) d s . (B.15)We let S (Γ) := supp( γ ) be the closure of the set of points t at which γ is strictly increasing,with the convention that 0 ∈ supp( γ ) if γ (0 + ) > g (1) = 0, g ( t ) ≥ t ∈ [0 , S (Γ) ⊆ { t ∈ [0 ,

1] : g ( t ) = 0 } .2. If Γ solves the variational principle (B.9) and ( a, b ) ⊆ S (Γ) for some 0 ≤ a < b ≤

1, thenΓ( t ) = 1 / (cid:112) ξ (cid:48)(cid:48) ( t ) for t ∈ ( a, b ). B.3 1RSB: ξ ( x ) = x + x Substituting the 1RSB expression for γ in Eq. (B.6), and minimizing over L > µ , we get thefollowing well known expressionΨ( µ ) = 12 ξ (cid:48) (1) L − µ (cid:0) ξ (cid:48) (1) − ξ (1) (cid:1) + 12 µ log (cid:18) LL − µ (cid:19) , (B.16) L = µ (cid:115) µ ξ (cid:48) (1) . (B.17)It is easy to compute the derivative ∂ ( µ Ψ( µ )) = 12 ξ (cid:48) (1) L − µ (cid:0) ξ (cid:48) (1) − ξ (1) (cid:1) + 12( L − µ ) . (B.18)In Figure 7 we plot the curves Ψ( µ ), ∂ ( µ Ψ( µ )) for the ‘3 + 4’ model ξ ( x ) = x / x /

2, as well asthe resulting complexity curve. The stability criterion g ( t ) of Eq. (B.15) takes the form g ( t ) = ξ (1) − ξ ( t ) + 1 µL (1 − t ) + 1 µ log (cid:16) L − µL − µt (cid:17) , (B.19)23 .0 0.5 1.0 1.5 2.0 2.51.651.701.751.801.851.901.95 e opt e alg e th th opt ( )( ( )) e alg e e th e alg e opt ( e ) Figure 7: Left: The function Ψ( µ ) and derivative ∂ ( µ Ψ( µ )) for the spherical model with ξ ( x ) = x + x . The values e th and e alg correspond to to the energy of threshold states and maximum en-ergy. The dashed line to the value achieved by message passing algorithms. Right: The complexityfunction Σ( e ). t g ( t ) > opt = optth < < opt = th < th Figure 8: The stability criterion g ( t ) deﬁned in Eq. (B.15) for ξ ( x ) = x + x . From top tobottom µ = 0 . , . , . , . , . L is determined by Eq. (B.17). In particular, expanding g around t = 1, we get: g ( t ) = 12 (cid:104) L − µ ) − ξ (cid:48)(cid:48) (1) (cid:105) (1 − t ) + O ((1 − t ) ) . (B.20)The maximum energy is obtained by minimizing Ψ over µ : the location of the minimum is µ opt ,and its value is e opt = Ψ( µ opt ). The threshold value of µ th is obtained by solving g (cid:48)(cid:48) (1) = 0, andthe corresponding energy is e th = ∂ ( µ Ψ( µ )) | µ = µ th . For our running example ξ ( x ) = x / x / e opt ≈ . , µ opt ≈ . , (B.21) e th ≈ . , µ th ≈ . . (B.22)Notice that the threshold values are close, but not identical to the one corresponding to the mini-mum of ∂ ( µ Ψ( µ )), namely e min ≈ . µ min ≈ . .0 1.2 1.4 1.60.7550.7600.7650.7700.7750.780 e opt e alg e th th opt ( )( ( )) e alg e e th e alg e opt ( e ) Figure 9: Left: The function Ψ( µ ) and corresponding derivative for the spherical model with ξ ( x ) = x + x + x . The two marked points correspond to the energy of threshold statesand maximum energy. The dashed line to the value achieved by message passing algorithms. Right:Complexity function Σ( e ).The above threshold energy matches the asymptotic value achieved by gradient ﬂow, as derivedin [FFRT20b]. These values should be compared with the energy value achieved by the algorithmof [Sub18], or the one studied here. This is given by Eq. (B.5), which evaluates to e alg ≈ . . (B.23)In other words, the present algorithms overcome the ‘threshold’ energy.Finally, we can compare with an algorithm that samples an initial condition σ according tothe Gibbs measure at the ‘mode coupling’ or ‘dynamical phase transition’ temperature β d (this isconjectured to be possible in polynomial time), and then runs gradient ﬂow with initialization σ .The authors of [FFRT20b] obtain the following asymptotic value for this procedure: e cool ≈ . . (B.24)The algorithm studied here surpasses this value by a small but clearly non-vanishing amount.Moving to more general models with 1RSB, the above procedure yields the following formulafor the threshold energy e th = ξ (cid:48) (1) − ξ (cid:48) (1) ξ (1) + ξ (cid:48)(cid:48) (1) ξ (1) ξ (cid:48) (1) (cid:112) ξ (cid:48)(cid:48) (1) µ th = ξ (cid:48)(cid:48) (1) − ξ (cid:48) (1) ξ (cid:48) (1) (cid:112) ξ (cid:48)(cid:48) (1) . (B.25)This again coincides with the asymptotic value achieved by gradient ﬂow, derived in [FFRT20b]. B.4 FRSB-1G: ξ ( x ) = x + x + x Notice that in this case the function t (cid:55)→ / (cid:112) ξ (cid:48)(cid:48) ( t ) is concave over an interval [ t s , t s is obtained by solving the equation 2 ξ (cid:48)(cid:48) ( t s ) ξ (cid:48)(cid:48)(cid:48)(cid:48) ( t s ) = 3 ξ (cid:48)(cid:48)(cid:48) ( t s ) and in, the present example is25 .70 0.75 0.80 0.85 0.90 0.95 1.00 t ( t ) = optth < < opt = th Figure 10: Local optimum γ µ ( t ) of the constrained variational problem (B.13) for ξ ( x ) = x + x + x . From top to bottom µ ∈ { . , . , . } . t g ( t ) < opt = optth < < opt = th t g ( t ) = th + 0.03= th + 0.02= th + 0.01= th Figure 11: The function g ( t ) deﬁned in Eq. (B.15) for ξ ( x ) = x + x + x . Left: fromtop to bottom µ = 1 . , . , . , . t ∗ . From top to bottom: µ = 0 . , . , . , . t s ≈ . t ) = 1 (cid:112) ξ (cid:48)(cid:48) ( t ∗ ) − µ ( t − t ∗ ) for t ∈ [0 , t ∗ ) , (B.26)Γ( t ) = 1 (cid:112) ξ (cid:48)(cid:48) ( t ) for t ∈ ( t ∗ , . (B.27)Concavity of Γ implies that t ∗ ∈ [ t s ,

1] and µ ≤ ξ (cid:48)(cid:48)(cid:48) ( t ∗ )2 ξ (cid:48)(cid:48) ( t ∗ ) / . (B.28)Notice that the stability criterion of Eq. B.15 requires that ¯ g ( t ) = 0 for t ∈ [ t ∗ , g (1) = 0 implies the relation µ = µ ( t ∗ ) := (cid:112) ξ (cid:48)(cid:48) ( t ∗ ) ξ (cid:48) ( t ∗ ) − t ∗ (cid:112) ξ (cid:48)(cid:48) ( t ∗ ) . (B.29)26omparing this with Eq. (B.28), we obtain that t ∗ ≥ t th where t th is determined by the followingequation ξ (cid:48)(cid:48)(cid:48) ( t th )2 ξ (cid:48)(cid:48) ( t th ) / = (cid:112) ξ (cid:48)(cid:48) ( t th ) ξ (cid:48) ( t th ) − t th (cid:112) ξ (cid:48)(cid:48) ( t th ) . (B.30)We deﬁne the threshold value of µ by µ th = µ ( t th ). In Figure 9 we plot the resulting functionΨ( µ ) and its derivative, together with the complexity function Σ( e ). In Figure 10 we plot thecorresponding function γ : [0 , → R achieving the inﬁmum in the variational principle.For our running example ξ ( x ) = x + x + x , we obtain the following values e opt ≈ . , µ opt ≈ . , (B.31) e th ≈ . , µ th ≈ . e alg ≈ . . (B.33)Also in this case we observe that the algorithmic approach in this paper overcomes the thresholdenergy.Finally, in order to justify the deﬁnition of the threshold µ , we evaluate the stability criterion g ( t ) of Eq. B.15. We obtain ¯ g ( t ) = 0 for t ∗ ≤ t ≤ g ( t ) = ξ (cid:48) ( t ) − ξ (cid:48) ( t ∗ ) + ξ (cid:48)(cid:48) ( t ∗ )( t ∗ − t )1 + µ (cid:112) ξ (cid:48)(cid:48) ( t ∗ )( t ∗ − t ) , ≤ t ≤ t ∗ . (B.34)By our criterion for the threshold, we need to verify that g ( t ) ≥ t ∈ ( t ∗ − ε ( µ ) ,

1] if and onlyif µ > µ th . This is equivalent to ¯ g ( t ) ≥ t ∈ ( t ∗ − ε ( µ ) ,

1] if and only if µ > µ th . Expanding ¯ g ( t )for t ↑ t ∗ , we obtain ¯ g ( t ) = ξ (cid:48)(cid:48) ( t ∗ ) / (cid:18) ξ (cid:48)(cid:48)(cid:48) ( t ∗ )2 ξ (cid:48)(cid:48) ( t ∗ ) / − µ (cid:19) ( t − t ∗ ) + O (( t − t ∗ ) ) . (B.35)Hence by Eqs. (B.28), (B.29), (B.30), our deﬁnition of µ th matches the general stability criterion. C The TAP free energy is constant along the trajectory

In this Appendix we provide a short argument for the fact that the rescaled TAP free energy F TAP ( · ) /N is asymptotically constant along the trajectory ( m t ) t ∈ T δ ∩ [0 , of the IAMP algorithm,as N → ∞ followed by δ →

0. To this end we recall from [AMS20] thatlim δ → + p-lim N →∞ H N ( m t ) N = (cid:90) t ξ (cid:48)(cid:48) ( s ) E (cid:2) ∂ x Φ γ ∗ ( s, X s ) (cid:3) d s, (C.1)where ( X t ) t ∈ [0 , satisﬁed the SDE d X t = ξ (cid:48)(cid:48) ( t ) γ ∗ ( t ) ∂ x Φ γ ∗ ( t, X t )d t + (cid:112) ξ (cid:48)(cid:48) ( t )d B t , with X = 0,and ( B t ) t ∈ [0 , is a standard Brownian motion. So it remains to ﬁnd the limit of the entropy27erm. By state evolution and convergence of the discrete-time approximation to the SDE limit as δ → δ → + p-lim N →∞ N N (cid:88) i =1 Λ γ ∗ ( t, m ti ) = E (cid:2) Λ γ ∗ ( t, M t ) (cid:3) , where M t = ∂ x Φ γ ∗ ( t, X t ) = (cid:82) t (cid:112) ξ (cid:48)(cid:48) ( s ) ∂ x Φ γ ∗ ( s, X s )d B s . Using Itˆo’s formula on Λ( t, M t ) we havedΛ( t, M t ) = (cid:16) ∂ t Λ( t, M t ) + ξ (cid:48)(cid:48) ( t )2 ∂ m Λ( t, M t ) ∂ x Φ γ ∗ ( t, X t ) (cid:17) d t + ∂ m Λ( t, M t )d M t . Exploiting the relations Λ γ ∗ ( t, m ) := inf x ∈ R (cid:2) Φ γ ∗ ( t, x ) − mx (cid:3) and M t = ∂ x Φ γ ∗ ( t, X t ) together withthe strict convexity of Φ γ ∗ ( t, · ), we have ∂ t Λ( t, M t ) = ∂ t Φ γ ∗ ( t, X t ) , and ∂ m Λ( t, M t ) = − (cid:0) ∂ x Φ γ ∗ ( t, X t ) (cid:1) − . ThereforedΛ( t, M t ) = (cid:16) ∂ t Φ γ ∗ ( t, X t ) − ξ (cid:48)(cid:48) ( t )2 ∂ x Φ γ ∗ ( t, X t ) (cid:17) d t + ∂ m Λ( t, M t )d M t = − ξ (cid:48)(cid:48) ( t ) (cid:16) γ ∗ ( t ) ∂ x Φ γ ∗ ( t, X t ) d t − ∂ x Φ γ ∗ ( t, X t ) (cid:17) d t + ∂ m Λ( t, M t )d M t , where we used the Parisi PDE to obtain the last line. Integrating between 0 and t , E (cid:2) Λ( t, M t ) (cid:3) = Φ γ ∗ (0 , − (cid:90) t ξ (cid:48)(cid:48) ( s ) γ ∗ ( s ) E (cid:2) ∂ x Φ γ ∗ ( s, X s ) (cid:3) d s − (cid:90) t ξ (cid:48)(cid:48) ( s ) E (cid:2) ∂ x Φ γ ∗ ( s, X s ) (cid:3) d s. (C.2)It was shown in [AMS20] that the optimality of γ ∗ ∈ L as a minimizer of the Parisi formula,assuming a minimizer indeed exists, implies E (cid:2) ∂ x Φ γ ∗ ( t, X t ) (cid:3) = t for all t ∈ [0 , . Putting everything together we see that the last term in Eq. (C.2) cancels with the energy termEq. (C.1), and we get for all t ∈ [0 , δ → + p-lim N →∞ N F TAP ( m t ) = lim δ → + p-lim N →∞ N (cid:110) H N ( m t ) + N (cid:88) i =1 Λ γ ∗ ( t, m ti ) − N (cid:90) t sξ (cid:48)(cid:48) ( s ) γ ∗ ( s ) d s (cid:111) = Φ γ ∗ (0 , − (cid:90) t ξ (cid:48)(cid:48) ( s ) γ ∗ ( s ) s d s − (cid:90) t sξ (cid:48)(cid:48) ( s ) γ ∗ ( s ) d s, = Φ γ ∗ (0 , − (cid:90) sξ (cid:48)(cid:48) ( s ) γ ∗ ( s )d s = P ( γ ∗ ) = e alg ..