Low-Degree Hardness of Random Optimization Problems
aa r X i v : . [ c s . CC ] A p r Low-Degree Hardness of Random Optimization Problems
David Gamarnik ∗ , Aukosh Jagannath ‡ , and Alexander S. Wein § Sloan School of Management and Operations Research Center, MIT Department of Statistics and Actuarial Science and Department of Applied Mathematics,University of Waterloo Department of Mathematics, Courant Institute of Mathematical Sciences, NYU
Abstract
We consider the problem of finding nearly optimal solutions of optimization problems withrandom objective functions. Such problems arise widely in the theory of random graphs, the-oretical computer science, and statistical physics. Two concrete problems we consider are (a)optimizing the Hamiltonian of a spherical or Ising p -spin glass model, and (b) finding a largeindependent set in a sparse Erd¨os-R´enyi graph. Two families of algorithms are considered:(a) low-degree polynomials of the input—a general framework that captures methods such asapproximate message passing and local algorithms on sparse graphs, among others; and (b)the Langevin dynamics algorithm, a canonical Monte Carlo analogue of the gradient descentalgorithm (applicable only for the spherical p -spin glass Hamiltonian).We show that neither family of algorithms can produce nearly optimal solutions with highprobability. Our proof uses the fact that both models are known to exhibit a variant of theoverlap gap property (OGP) of near-optimal solutions. Specifically, for both models, everytwo solutions whose objectives are above a certain threshold are either close or far from eachother. The crux of our proof is the stability of both algorithms: a small perturbation of theinput induces a small perturbation of the output. By an interpolation argument, such a stablealgorithm cannot overcome the OGP barrier.The stability of the Langevin dynamics is an immediate consequence of the well-posednessof stochastic differential equations. The stability of low-degree polynomials is established usingconcepts from Gaussian and Boolean Fourier analysis, including noise sensitivity, hypercontrac-tivity, and total influence. ∗ Email: [email protected] . Supported by ONR grant N00014-17-1-2790. ‡ Email: [email protected] § Email: [email protected] . Partially supported by NSF grant DMS-1712730 and by the Simons Collaborationon Algorithms and Geometry. Introduction
In this paper, we study the problem of producing near-optimal solutions of random optimizationproblems by polynomials of low degree in the input data. Namely, we prove that no low-degreepolynomial can succeed at achieving a certain objective value in two optimization problems: (a)optimizing the Hamiltonian of the (spherical or Ising) p -spin glass model, and (b) finding a largeindependent set in a sparse Erd¨os-R´enyi graph, with high probability in the realization of theproblem. We rule out polynomials of degree as large as cn for the p -spin glass models and as largeas cn/ log n for the independent set problem for a constant c , provided the algorithm is assumedto succeed modulo exponentially small in n probability, where n is the problem dimension. Moregenerally, we provide a tradeoff between the degree of polynomials that we rule out and the successprobability assumed. For the spherical p -spin model, we also give a lower bound against Langevindynamics.Our motivation for focusing on “low-degree” approximations is two-fold. Firstly, from an ap-proximation theory perspective, producing near-optimal solutions by a polynomial in the input isvery natural. Indeed, in many problems of interest the best known polynomial-time algorithms canbe placed within the family of low-degree methods. For example, in the settings we consider here,the best known polynomial-time optimization results can be captured by the approximate messagepassing (AMP) framework [Mon19, EMS20] (for the p -spin) and by the class of local algorithmson sparse graphs [LW07] (for the independent set problem), respectively. Both of these familiesof algorithms are captured by constant-degree polynomials; see Appendix A for more details. Forspherical p -spin glass models, earlier work of [Sub18] introduced an algorithm which performs aswell as AMP; we expect this algorithm to also fall into the family of low-degree methods, butverifying this is less clear. Secondly, a recent line of work [BHK +
19, HS17, HKP +
17, Hop18] onthe sum-of-squares hierarchy has produced compelling evidence that the power of low-degree poly-nomials is a good proxy for the intrinsic computational complexity of a broad class of hypothesistesting problems. Below, we briefly review this theory of low-degree polynomials in hypothesistesting.The low-degree framework was initiated in [HS17, HKP +
17, Hop18] to study computationalhardness in hypothesis testing problems. Specifically, this line of work has focused on high-dimensional testing problems where the goal is to determine whether a given sample (e.g., an n -vertex graph) was drawn from the “null” distribution Q n (e.g., the Erd¨os-R´enyi model) or the“planted” distribution P n (e.g., a random graph with planted structure such as a large cliqueor a small cut). Through an explicit and relatively straightforward calculation, one can determinewhether there exists a (multivariate) polynomial f (in the entries of the observed sample) of a givendegree D = D ( n ) that can distinguish P n from Q n (in a particular sense) [HS17, HKP +
17, Hop18].A conjecture of Hopkins [Hop18] (inspired by [BHK +
19, HS17, HKP + P n , Q n (with error probability o (1)) then there is also an O (log n )-degree polynomial that can dis-tinguish P n , Q n . One justification for this conjecture is its deep connection with the sum-of-squares(SoS) hierarchy —a powerful class of meta-algorithms—and in particular the pseudo-calibration ap-proach [BHK + +
17, Hop18, RSS18] for details). Another justification for the conjecture is that O (log n )-degree polynomials can capture a very broad class of spectral methods (see [KWB19, Theorem 4.4]for specifics), which in turn capture the best known algorithms for many high-dimensional testingproblems (e.g., [HSS15, HSSS16, HKP + O (log n )-degree polynomials succeed (at testing) in the same parameter regime as the best known polynomial-2ime algorithms (e.g., [HS17, HKP +
17, Hop18, BKW19, KWB19, DKWB19]). (Oftentimes, thehypothesis testing variants of these types of problems seem to be equally hard as the more standardtask of recovering the planted signal.) Lower bounds against low-degree polynomials are one con-crete form of evidence that the existing algorithms for these problems cannot be improved (at leastwithout drastically new algorithmic techniques). For more details on the low-degree framework forhypothesis testing, we refer the reader to [Hop18, KWB19].One goal of the current work is to extend the low-degree framework to the setting of ran-dom optimization problems. This includes defining what it means for a low-degree polynomialto succeed at an optimization task, and giving techniques by which one can prove lower boundsagainst all low-degree polynomials. One difference between the optimization and testing settingsis that many existing optimization algorithms can be represented as constant-degree polynomials(see Appendix A), instead of the O (log n )-degree required in the testing case. A substantial diffi-culty that we face in the optimization setting is that, in contrast to the testing setting, it does notseem possible to prove lower bounds against low-degree polynomials via a straightforward explicitcalculation. To overcome this, our proofs take a more indirect route and leverage a certain struc-tural property—the overlap gap property (OGP) —of the optimization landscape, combined withstability properties of low-degree polynomials. We also use similar techniques to give lower boundsagainst Langevin dynamics, a canonical Monte Carlo analogue of gradient descent; while this is nota low-degree polynomial (due to its continuous-time nature), it is similar in spirit and has similarstability properties.While the OGP has been used to rule out various classes of other algorithms previously (seebelow), its usage in our current setting presents some substantial technical difficulties which weneed to overcome. Roughly speaking, the property states that for every pair of nearly-optimalsolutions x and x , their normalized overlap (normalized inner product) measured with respectto the ambient Hilbert space must lie in a disjoint union of intervals [0 , ν ] ∪ [ ν , x , x which are near optimizers for these twomembers, respectively, it is still the case that the overlap of x and x belongs to [0 , ν ] ∪ [ ν , x ( t ) the result of the algorithm correspondingto the interpolation step t , it should be the case that the overlap between x (0) and x ( t ) changes“continuously”. At the same time we show separately that the starting solution x (0) and terminalsolution x (1) have an overlap at most ν , and thus at some point the overlap between x (0) and x ( t )belongs to ( ν , ν ), which is a contradiction.Establishing stability for low-degree polynomials and Langevin dynamics is quite non-trivialand constitutes the key technical contribution of the paper. For the case of polynomials, thesestability results harness results from Gaussian and Boolean Fourier analysis. We prove two separatevariants of this stability result, depending on whether the random input is Gaussian- or Bernoulli-distributed. A key technical result in the Gaussian case is Theorem 3.1 which informally statesthat if we have two ρ -correlated random instances X and Y of a random tensor, and f is a vector-valued low-degree polynomial defined on such tensors, then the distance k f ( X ) − f ( Y ) k is unlikelyto exceed a certain value which depends continuously on ρ . In particular this distance is smallwhen ρ ≈
1. Proving this result relies on a well-known consequence of hypercontractivity for low-degree polynomials, and basic properties of Hermite polynomials (the orthogonal polynomials ofthe Gaussian measure). In the case of Bernoulli-distributed inputs, we prove a related stabilityresult (Theorem 4.2) which shows that when the input variables are resampled one at a time, theoutput of a vector-valued low-degree polynomial will never change significantly in one step, with3ontrivial probability. The proof involves the notion of total influence from Boolean analysis, aswell as a direct proof by induction on the dimension. The proof of stability for Langevin dynamicsis based on the continuous dependence of stochastic differential equations on their coefficients.The OGP emerged for the first time in the context of spin glass theory and random constraintsatisfaction problems. It was first proven implicitly in [AC08], [ACR11], and [MMZ05]. Thesepapers established that the set of satisfying assignments of a random K-SAT formula partitionsinto clusters above a certain clause-to-variables density. This was postulated as evidence of algo-rithmic hardness of finding satisfying assignments for such densities. Implicitly, the proof revealsthat the overlaps of satisfying assignments exhibit the OGP, and clustering is inferred from this. Itis worth noting that while OGP implies the existence of clusters, the converse is not necessarily thecase, as one can easily construct a clustered space of solutions with overlaps spanning the entireinterval [0 , factors of i.i.d.(FIID) —designed to find large independent sets in sparse Erd¨os-R´enyi graphs. The OGP was usedto show that, asymptotically, these algorithms cannot find independent sets larger than a multi-plicative factor 1 / / (2 √ ≈ .
85 of optimal. The present paper recovers this result as a specialcase, since (as we discuss in Appendix A) local algorithms can be captured by constant-degreepolynomials. The lower bound against local algorithms was improved by [RV17] to a multiplicativefactor of 1 /
2. This is the best possible since 1 / / p -spin model [GJ19a]. The current work draws inspiration from a keyidea in [CGPR19, GJ19a], namely that a particular variant of OGP—the same variant that we usein the current work—implies failure of any sufficiently “stable” algorithm.We emphasize that the class of algorithms ruled out by the lower bounds in this paper (namely,low-degree polynomials) not only captures existing methods such as AMP and local algorithms, butcontains a strictly larger (in a substantial way) class of algorithms than prior work on random opti-mization problems. We now illustrate this claim in the setting of the p -spin optimization problem.The best known polynomial-time algorithms for optimizing the p -spin Hamiltonian are capturedby the AMP framework [Mon19, EMS20]. Roughly speaking, AMP algorithms combine a linearupdate step (tensor power iteration) with entry-wise non-linear operations. For a fairly generalclass of p -spin optimization problems (including spherical and Ising mixed p -spin models), it is nowknown precisely what objective value can be reached by the best possible AMP algorithm [EMS20].While this may seem like the end of the story, we point out that for the related tensor PCA problem—which is a variant of the p -spin model with a planted rank-1 signal—AMP is known tobe substantially sub-optimal compared to other polynomial-time algorithms [RM14]. None of thebest known polynomial-time algorithms [RM14, HSS15, HSSS16, WEM19, Has20, BCR20] use thetensor power iteration step as in AMP, and there is evidence that this is fundamental [BGJ18];instead, the optimal algorithms include spectral methods derived from different tensor operationssuch as tensor unfolding [RM14, HSS15] (which can be interpreted as a higher-order “lifting” ofAMP [WEM19]). These spectral methods are captured by O (log n )-degree polynomials. With thisin mind, we should a priori be concerned that AMP might also be sub-optimal for the (non-planted) p -spin optimization problem. This highlights the need for lower bounds that rule out not just AMP,4ut all low-degree polynomial algorithms. While the lower bounds in this paper do not achievethe precise optimal thresholds for objective value, they rule out quite a large class of algorithmscompared to existing lower bounds for random optimization problems.We refer the reader to Appendix A for a more detailed discussion of how various optimizationalgorithms can be approximated by low-degree polynomials. Notation
We use k · k and h· , ·i to denote the standard ℓ norm and inner product of vectors. We alsouse the same notation to denote the Frobenius norm and inner product of tensors. We use theterm polynomial both to refer to (multivariate) polynomials R m → R in the usual sense, and torefer to vector-valued polynomials R m → R n defined as in (3). We abuse notation and use theterm degree- D polynomial to mean a polynomial of degree at most D . A random polynomial haspossibly-random coefficients, as defined in Section 2.1.1. We use A c to denote the complement ofan event A . Unless stated otherwise, asymptotic notation such as o (1) or Ω( n ) refers to the limit n → ∞ with all other parameters held fixed. In other words, this notation may hide constantfactors depending on other parameters such as the degree d in the independent set problem. p -Spin Glass Hamiltonian The first class of problems we consider here is optimization of the (pure) p -spin glass Hamiltonian,defined as follows. Fix an integer p ≥ Y ∈ ( R n ) ⊗ p be a p -tensor with real coefficients.For x ∈ R n , consider the objective function H n ( x ; Y ) = 1 n ( p +1) / h Y, x ⊗ p i . (1)Note that all homogeneous polynomials of degree p (in the variables x ) can be written in thisform for some Y . We focus on the case of a random coefficient tensor Y . In this setting, thefunction H n is sometimes called the Hamiltonian for a p -spin glass model in the statistical physicsliterature. More precisely, for various choices of a (compact) domain X n ⊂ R n , we are interested inapproximately solving the optimization problemmax x ∈X n H n ( x ; Y ) (2)given a random realization of the coefficient tensor Y with i.i.d N (0 ,
1) entries. Here and in thefollowing we let P Y denote the law of Y . (When it is clear from context we omit the subscript Y .)We begin first with a simple norm constraint, namely, we will take as domain S n = { x ∈ R n : k x k = √ n } , the sphere in R n of radius √ n . We then turn to understanding a binaryconstraint, namely where the domain is the discrete hypercube Σ n = { +1 , − } n . Following thestatistical physics literature, in the former setting, we call the objective the spherical p -spin glassHamiltonian and the latter setting the Ising p -spin glass Hamiltonian.In both settings, quite a lot is known about the maximum. It can be shown [GT02, JT17] thatthe maximum value of H n has an almost sure limit (as n → ∞ with p fixed), called the ground stateenergy , which we will denote by E p ( S ) for the spherical setting and E p (Σ) for the Ising setting.Explicit variational formulas are known for E p ( S ) [AB ˇC13, JT17, CS17] and E p (Σ) [AC18, JS17].Algorithmically, it is known how to find, in polynomial time, a solution of value E ∞ p ( S ) − ε or E ∞ p (Σ n ) − ε (respectively for the spherical and Ising settings) for any constant ε > E ∞ = E and E ∞ p < E p for p ≥
3. In other words, it is known how to efficiently optimize arbitrarily close to theoptimal value in the p = 2 case, but not when p ≥ Our goal here is to understand how well one can optimize (2) via the output of a vector-valuedlow-degree polynomial in the coefficients Y . To simplify notation we will often abuse notation andrefer to the space of p -tensors on R n by R m ∼ = ( R n ) ⊗ p where m = n p .We say that a function f : R m → R n is a polynomial of degree (at most) D if it may be writtenin the form f ( Y ) = ( f ( Y ) , . . . , f n ( Y )) , (3)where each f i : R m → R is a polynomial of degree at most D .We will also consider the case where f is allowed to have random coefficients, provided thatthese coefficients are independent of Y . That is, we will assume that there is some probabilityspace (Ω , P ω ) and that f : R m × Ω → R n is such that f ( · , ω ) is a polynomial of degree at most D for each ω ∈ Ω. We will abuse notation and refer to this as a random polynomial f : R m → R n .Our precise notion of what it means for a polynomial to optimize H n will depend somewhaton the domain X n . This is because it is too much to ask for the polynomial’s output to lie in X n exactly, and so we fix a canonical rounding scheme that maps the polynomial’s output to X n . Webegin by defining this notion for the sphere: X n = S n . The spherical case.
We will round a polynomial’s output to the sphere S n by normalizing itin the standard way. To this end, for a random polynomial f : R m → R n we define the randomfunction g f : R m → S n ∪ {∞} by g f ( Y, ω ) = √ n f ( Y, ω ) k f ( Y, ω ) k , with the convention g f ( Y, ω ) = ∞ if f ( Y, ω ) = 0.
Definition 2.1.
For parameters µ ∈ R , δ ∈ [0 , , γ ∈ [0 , , and a random polynomial f : R m → R n , we say that f ( µ, δ, γ ) -optimizes the objective (1) on S n if the following are satisfied when ( Y, ω ) ∼ P Y ⊗ P ω : • E Y,ω k f ( Y, ω ) k = n (normalization). • With probability at least − δ over Y and ω , we have both H n ( g f ( Y, ω ); Y ) ≥ µ and k f ( Y, ω ) k ≥ γ √ n . Implicitly in this definition, the case f ( Y, ω ) = 0 must occur with probability at most δ . Themeaning of the parameters ( µ, δ, γ ) is as follows: µ is the objective value attained after normalizingthe polynomial’s output to the sphere, and δ is the algorithm’s failure probability. Finally, γ isinvolved in the norm bound k f ( Y, ω ) k ≥ γ √ n that we need for technical reasons. Since the domainis S n , f is “supposed to” output a vector of norm √ n . While we do not require this to hold exactly(and have corrected for this by normalizing f ’s output), we do need to require that f usually doesnot output a vector of norm too much smaller than √ n . This norm bound is important for ourproofs because it ensures that a small change in f ( Y, ω ) can only induce a small change in g f ( Y, ω ).We now state our main result on low-degree hardness of the spherical p -spin model, with theproof deferred to Section 3.2. 6 heorem 2.2. For any even integer p ≥ there exist constants µ < E p ( S ) , n ∗ ∈ N , and δ ∗ > such that the following holds. For any n ≥ n ∗ , any D ∈ N , any δ ≤ min { δ ∗ , exp( − D ) } , and any γ ≥ (2 / D , there is no random degree- D polynomial that ( µ, δ, γ ) -optimizes (1) on S n . A number of remarks are in order. First, this result exhibits a tradeoff between the degree D of polynomials that we can rule out and the failure probability δ that we need to assume. Inorder to rule out polynomials of any constant degree, we need only the mild assumption δ = o (1).On the other hand, if we are willing to restrict to algorithms of failure probability δ = exp( − cn )(which we believe is reasonable to expect in this setting), we can rule out all polynomials of degree D ≤ c ′ n for a constant c ′ = c ′ ( c ). It has been observed in various hypothesis testing problemsthat the class of degree- n δ polynomials is at least as powerful as all known exp( n δ − o (1) )-timealgorithms [Hop18, KWB19, DKWB19]. This suggests that optimizing arbitrarily close to theoptimal value in the spherical p -spin (for p ≥ n − o (1) ).The best known results for polynomial-time optimization of the spherical p -spin were firstproved by [Sub18] but can also be recovered via the AMP framework of [EMS20]. As discussed inAppendix A, these AMP algorithms can be captured by constant-degree polynomials. Furthermore,the output of such an algorithm concentrates tightly around √ n and thus easily satisfies the normbound with γ = (2 / D required by our result. We also expect that these AMP algorithms havefailure probability δ = exp( − Ω( n )); while this has not been established formally, a similar resulton concentration of AMP-type algorithms has been shown by [GJ19a].Our results are limited to the case where p ≥ µ is a constant slightly smaller thanthe optimal value E p ( S ). These restrictions are in place because the OGP property used in ourproof is only known to hold for these values of p and µ . If the OGP were proven for other valuesof p or for a lower threshold µ , our results would immediately extend to give low-degree hardnessfor these parameters (see Theorem 3.6). Note that we cannot hope for the result to hold when p = 2 because this is a simple eigenvector problem with no computational hardness: there is aconstant-degree algorithm to optimize arbitrarily close to the maximum (see Appendix A). The Ising case.
We now turn to low-degree hardness in the Ising setting, where the domainis the hypercube: X n = Σ n . In this case, we round a polynomial’s output to the hypercube byapplying the sign function. For x ∈ R , letsgn( x ) = (cid:26) +1 if x ≥ − x < , and for a vector x ∈ R n let sgn( x ) denote entry-wise application of sgn( · ). We now define ournotion of near optimality for a low-degree polynomial. Definition 2.3.
For parameters µ ∈ R , δ ∈ [0 , , γ ∈ [0 , , η ∈ [0 , , and a random polynomial f : R m → R n , we say that f ( µ, δ, γ, η ) -optimizes the objective (1) on Σ n if the following aresatisfied. • E Y,ω k f ( Y, ω ) k = n (normalization). • With probability at least − δ over Y and ω , we have both H n (sgn( f ( Y, ω )); Y ) ≥ µ and |{ i ∈ [ n ] : | f i ( Y, ω ) | ≥ γ }| ≥ (1 − η ) n . The interpretation of these parameters is similar to the spherical case, with the addition of η totake into account issues related to rounding. More precisely, as in the spherical case, µ is theobjective value attained after rounding the polynomial’s output to the hypercube, and δ is the7ailure probability. The parameters γ, η are involved in an additional technical condition, whichrequires f ’s output not to be too “small” in a particular sense. Specifically, all but an η -fraction ofthe coordinates of f ’s output must exceed γ in magnitude. The need for this condition in our proofarises in order to prevent a small change in f ( Y, ω ) from inducing a large change in sgn( f ( Y, ω )).We have the following result on low-degree hardness in the Ising setting. The proof is deferredto Section 3.2.
Theorem 2.4.
For any even integer p ≥ there exist constants µ < E p (Σ) , n ∗ ∈ N , δ ∗ > , and η > such that the following holds. For any n ≥ n ∗ , any D ∈ N , any δ ≤ min { δ ∗ , exp( − D ) } ,and any γ ≥ (2 / D , there is no random degree- D polynomial that ( µ, δ, γ, η ) -optimizes (1) on Σ n . This result is very similar to the spherical case, and the discussion following Theorem 2.2 also ap-plies here. The best known algorithms for the Ising case also fall into the AMP framework [Mon19,EMS20] and are thus captured by constant-degree polynomials. These polynomials output a solu-tion “close” to the hypercube in a way that satisfies our technical condition involving γ, η . As inthe spherical case, the case p = 2 is computationally tractable; here it is not a simple eigenvectorproblem but can nonetheless be solved by the AMP algorithm of [Mon19, EMS20]. One natural motivation for understanding low-degree hardness is to investigate the performanceof natural iterative schemes, such as power iteration or gradient descent. In the spherical p -spinmodel, the natural analogue of these algorithms (in continuous time) are Langevin dynamics and gradient flow . While these are not directly low-degree methods, the overlap gap property can stillbe seen to imply hardness for these results in a fairly transparent manner.To make this precise, let us introduce the following. Let B t denote spherical Brownian motion.(For a textbook introduction to spherical Brownian motion see, e.g., [Hsu02].) For any variance σ ≥
0, we introduce
Langevin dynamics for H n to be the strong solution to the stochastic differentialequation dX t = σdB t + ∇ H n ( X t ; Y ) dt, with X = x , where here ∇ denotes the spherical gradient. Note that since H n ( x ; Y ) is a polynomialin x , H n is (surely) smooth and consequently the solution is well-defined in the strong sense [Hsu02].The case σ = 0 is referred to as gradient flow on the sphere.In this setting, it is natural to study the performance with random starts which are independentof Y , e.g., a uniform at random start. In this case, if the initial distribution is given by X ∼ ν forsome ν ∈ M ( S n ), the space of probability measures on S n , we will denote the law by Q ν . In thissetting we have the following result which is, again, a consequence of the overlap gap property. Theorem 2.5.
Let p ≥ be even. There exists µ < E p ( S ) and c > such that for any σ ≥ , T ≥ fixed, n sufficiently large, and ν ∈ M ( S n ) , if X t denotes Langevin dynamics for H n ( · ; Y ) with variance σ and initial data ν , then P Y ⊗ Q ν ( H n ( X T ; Y ) ≤ µ ) ≥ − exp( − cn ) . In particular, the result holds for ν n = Unif( S n ) , the uniform measure on S n . The proof can be found in Section 3.4. To our knowledge, this is the first proof that neitherLangevin dynamics nor gradient descent reach the ground state started from uniform at randomstart. We note furthermore, that the above applies even to T ≤ c ′ log n for some c ′ > p -spin glass models. It is impossible here to provide a complete reference though we point thereader here to the surveys [BCKM98, Cug03, Gui07, Jag19]. To date, much of the analysis ofthe dynamics in the non-activated regime considered here ( n → ∞ and then t → ∞ ) has con-centrated on the Crisanti–Horner–Sommers–Cugiandolo–Kurchan (CHSCK) equations approach[CHS93, CK93]. This approach centers around the analysis of a system of integro-differential equa-tions which are satisfied by the scaling limit of natural observables of the underlying system. Whilethis property of the scaling limit has now been shown rigorously [BDG01, BDG06], there is limitedrigorous understanding of the solutions of the CHSCK equations beyond the case when p = 2. Afar richer picture is expected here related to the phenomenon of aging [Gui07, Ben02].More recently a new, differential inequality–based approach to understanding this regime wasintroduced in [BGJ20], which provides upper and lower bounds on the energy level reached for agiven initial data. That being said, this upper bound is nontrivial only for σ sufficiently large.We end by noting that overlap gap–like properties, namely “free energy barriers” have been usedto develop spectral gap estimates for Langevin dynamics which control the corresponding L -mixingtime [GJ19b, BJ18]. In [BJ18], it was shown that exponentially-small spectral gaps are connectedto the existence of free energy barriers for the overlap, which at very low temperatures can be shownto be equivalent to a variant of the overlap gap property in this setting. To our knowledge, however,this work is the first approach to connect the behavior of Langevin dynamics in the non-activatedregime ( n → ∞ and then t → ∞ ) that utilizes the overlap distribution. Finally we note here thatthe overlap gap property has been connected to the spectral gap for local, reversible dynamics ofIsing spin glass models in [BJ18] as well as to gradient descent and approximate message passingschemes in [GJ19a]. We now consider the problem of finding a large independent set in a sparse random graph. Here,we are given the adjacency matrix of an n -vertex graph, represented as Y ∈ { , } m where m = (cid:0) n (cid:1) .We write Y ∼ G ( n, d/n ) to denote an Erd¨os-R´enyi graph on n nodes with edge probability d/n ,i.e., every possible edge occurs independently with probability d/n . We are interested in the regimewhere first n → ∞ (with d fixed) and then d → ∞ . A subset of nodes S ⊆ [ n ] is an independentset if it spans no edges, i.e., for every i, j ∈ S , ( i, j ) is not an edge. Letting I ( Y ) denote the set ofall independent sets of the graph Y , consider the optimization problemmax S ∈I ( Y ) | S | (4)where Y ∼ G ( n, d/n ).As n → ∞ with d fixed, the rescaled optimum value of (4) is known to converge to some limitwith high probability: 1 n max S ∈I ( Y ) | S | → α d , as shown in [BGT13]. The limit α d is known to have the following asymptotic behavior as d → ∞ : α d = (1 + o d (1)) 2 log dd , as is known since the work of Frieze [Fri90]. The best known polynomial-time algorithm for thisproblem is achieved by a straightforward greedy algorithm which constructs a 1 / log dd n asymptotically as n → ∞ and then d → ∞ .9e will study the ability of low-degree polynomials to find a large independent set. It is toomuch to ask for a polynomial to exactly output the indicator vector of an independent set, so wefix the following rounding scheme that takes a polynomial’s output and returns an independent set.Recall the terminology for random polynomials defined in Section 2.1.1. Definition 2.6.
Let f : { , } m → R n be a random polynomial. For Y ∈ { , } m , and η > , let V ηf ( Y, ω ) ∈ I ( Y ) be the independent set obtained by the following procedure. Let A = { i ∈ [ n ] : f i ( Y, ω ) ≥ } , ˜ A = { i ∈ A : i has no neighbors in A in the graph Y } , and B = { i ∈ [ n ] : f i ( Y, ω ) ∈ (1 / , } . Let V ηf ( Y, ω ) = (cid:26) ˜ A if | A \ ˜ A | + | B | ≤ ηn, ∅ otherwise. In other words, f should output a value ≥ ≤ / ηn “errors”,each of which can either be a vertex for which the output value lies in (1 / , ∅ is returned. For our proofs itis crucial that this definition of V ηf ensures that a small change in f ( Y, ω ) cannot induce a largechange in the resulting independent set V ηf ( Y, ω ) (without encountering the failure event ∅ ).We now formally define what it means for a polynomial to find a large independent set. Definition 2.7.
For parameters k ∈ N , δ ∈ [0 , , γ ≥ , η > , and a random polynomial f : { , } m → R n , we say that f ( k, δ, γ, η ) -optimizes (4) if the following are satisfied. • E Y,ω k f ( Y, ω ) k ≤ γk . • With probability at least − δ over Y and ω , we have | V ηf ( Y, ω ) | ≥ k . The parameter k denotes the objective value attained (after rounding), i.e., the size of the inde-pendent set. For us, k will be a fixed multiple of log dd n , since this is the scale of the optimum. Theparameter δ is the algorithm’s failure probability. Note that if f were to “perfectly” output the { , } -valued indicator vector of a size- k independent set, then we would have k f ( Y, ω ) k = k . Theparameter γ controls the degree to which this can be violated. Finally, η is the fraction of “errors”tolerated by the rounding process V ηf .We now state our main result of low-degree hardness of maximum independent set, with theproof deferred to Section 4.2. Theorem 2.8.
For any α > / √ there exists d ∗ > such that for any d ≥ d ∗ there exist n ∗ > , η > , and C , C > such that the following holds. Let n ≥ n ∗ , γ ≥ , and D ≤ C nγ log n ,and suppose δ ≥ satisfies δ < exp ( − C γD log n ) . Then for k = α log dd n , there is no random degree- D polynomial that ( k, δ, γ, η ) -optimizes (4) . / √ log dd n , which is roughly 85% of the optimum. This is the threshold abovewhich OGP can be shown using a first moment argument as in [GS17].If γ is a constant, Theorem 2.8 gives a similar tradeoff between D and δ as our results for the p -spin model, although here there is an extra factor of log n . If we are willing to restrict to algorithmsof failure probability δ = exp( − cn ) then we can rule out all polynomials of degree D ≤ c ′ n/ log n for a constant c ′ = c ′ ( c ). As in the p -spin model, this suggests that exponential time exp( n − o (1) )is needed in order to find an independent set larger than (1 + 1 / √ log dd n .As discussed in the introduction, the best known polynomial-time algorithm can find an inde-pendent set 1 / k = (1 + o d (1)) log dd n , γ = O (1), δ = exp( − Ω( n )), and any constant η > As discussed in the introduction, the preceding results will follow due to certain geometric propertiesof the super-level sets of the objectives. The main property is called the overlap gap property (OGP) .Let us begin by defining this formally in a general setting.
Definition 2.9.
We say that a family of real-valued functions F with common domain X ⊂ R n satisfies the overlap gap property for an overlap R : X × X → R ≥ with parameters µ ∈ R and ≤ ν < ν ≤ if for every f , f ∈ F and every x , x ∈ X satisfying f k ( x k ) ≥ µ for k = 1 , , wehave that R ( x, y ) ∈ [0 , ν ] ∪ [ ν , . For ease of notation, when this holds, we simply say that F satisfies the ( µ, ν , ν )-OGP for R on X . Furthermore, as it is often clear from context, we omit the dependence of the above on R .While the definition above might be satisfied for trivial reasons and thus not be informative, itwill be used in this paper in the setting where k x k ≤ n for every x ∈ X , R ( x , x ) = |h x , x i| /n ,and with parameters chosen so that with high probability µ < sup x ∈X H ( x ) for every H ∈ F .Thus, in particular R ( x , x ) ≤ x , x ∈ X , and µ measures some proximity fromoptimal values for each objective function H . The definition says informally that for every two µ -optimal solutions with respect to any two choices of objective functions, their normalized innerproduct is either at least ν or at most ν .In the following, we require one other property of functions, namely separation of their superlevelsets. Definition 2.10.
We say that two real-valued functions f, g with common domain X are ν -separated above µ with respect to the overlap R : X × X → R ≥ if for any x, y ∈ X with f ( x ) ≥ µ and g ( y ) ≥ µ , we have that R ( x, y ) ≤ ν . This property can be thought of a strengthening of OGP for two distinct functions. In particular,the parameter ν will typically equal the parameter ν in the definition of OGP.Let us now turn to stating the precise results regarding these properties in the settings weconsider here. It can be shown that the overlap gap property holds for p -spin glass Hamiltonians inboth the spherical and Ising settings with respect to the overlap R ( x, y ) = n |h x, y i| . More precisely,11et Y be i.i.d. N (0 ,
1) and let Y ′ denote an independent copy of Y . Consider the correspondingfamily of real-valued functions A ( Y, Y ′ ) = { cos( τ ) H n ( · ; Y ) + sin( τ ) H n ( · ; Y ′ ) : τ ∈ [0 , π/ } . (5)We then have the following, which will follow by combining bounds from [CS17, AC18]. The secondresult is a restatment of [GJ19a, Theorem 3.4]. The proof can be found in Section 3.5. Theorem 2.11.
Take as overlap R ( x, y ) = n |h x, y i| and let Y and Y ′ be independent p -tensorswith i.i.d. N (0 , entries. For every even p ≥ there exists an ε > such that the following holds:1. For the domain S n , there are some ≤ ν < ν ≤ and some c > such that the followingholds with probability at least − exp( − cn ) : • A ( Y, Y ′ ) has the overlap gap property for R with parameters ( E p ( S ) − ε, ν , ν ) . • H n ( · ; Y ) and H n ( · ; Y ′ ) are ν -separated above E p ( S ) − ε with respect to R .2. For the domain Σ n , there are some ≤ ν < ν ≤ and some c > such that the followingholds with probability at least − exp( − cn ) : • A ( Y, Y ′ ) has the overlap gap property for R with parameters ( E p (Σ) − ε, ν , ν ) . • H n ( · ; Y ) and H n ( · ; Y ′ ) are ν -separated above E p (Σ) − ε with respect to R . Let us now turn to the maximum independent set problem. Let us begin by first observing thatwe may place this family of optimization problem on a common domain. To this end, consider asdomain, the Boolean hypercube B n = { , } n . Note that by viewing a vector x as the indicatorfunction of the set S = S ( x ) := { i : x i = 1 } , we have a correspondence between the points x ∈ B n and subsets of the vertex set [ n ]. Let m = (cid:0) n (cid:1) , let Y ∈ { , } m denote the adjacency matrix ofsome graph on [ n ] vertices, and consider the function F ( x ; Y ) given by F ( x ; Y ) = | S ( x ) | · { S ( x ) ∈ I ( Y ) } . The maximum independent set problem for Y can then be written in the formmax x ∈B n F ( x ; Y ) . Let us now construct the analogue of the family A ( Y, Y ′ ) from (5) in this setting. Definition 2.12.
For
Y, Y ′ ∈ { , } m , the path from Y to Y ′ is Y = Z → Z → · · · → Z m = Y ′ where ( Z i ) j = Y j for j > i and ( Z i ) j = Y ′ j otherwise. The path is denoted by Y Y ′ . Here (and throughout) we have fixed an arbitrary order by which to index the edges of a graph(the coordinates of Y ).Now let Y, Y ′ ∈ { , } m be (the adjacency matrices of) independent G ( n, d/n ) random graphs.We can then consider the family of functions F ( Y, Y ′ ) = { F ( · ; Z ) : Z is on the path Y Y ′ } . (6)We can now state the relevant overlap gap property. Theorem 2.13.
For any α > / √ there exist constants ≤ ˜ ν < ˜ ν ≤ and d ∗ > suchthat for any constant d ≥ d ∗ , the following holds. If Y, Y ′ ∼ G ( n, d/n ) independently, the followingholds with probability at least − exp( − Ω( n )) . The family of functions F from (6) with domain X = B n satisfies the overlap gap propertywith overlap R ( x , x ) = n |h x , x i| and parameters µ = k := α log dd n , ν = ˜ ν kn , ν = ˜ ν kn with probability at least − exp( − Ω( n )) . • Furthermore, the functions F ( · ; Y ) and F ( · ; Y ′ ) are ν -separated above µ . Above (and throughout), Ω( n ) pertains to the limit n → ∞ with α, d fixed, i.e., it hides a constantfactor depending on α, d . Note that here the overlap is simply the (normalized) cardinality of theintersection of the two sets: R ( x , x ) = n | S ( x ) ∩ S ( x ) | .The proof of Theorem 2.13—which is deferred to Section 4.3—is an adaptation of the firstmoment argument of [GS17]: we compute the expected number of pairs of independent sets whoseoverlap lies in the “forbidden” region, and show that this is exponentially small. p -Spin Model In this section we prove a noise stability–type result for polynomials of Gaussians, which will bea key ingredient in our proofs. Throughout this section, let d ≥ Y ∈ R d be a vectorwith i.i.d. standard Gaussian entries. Denote the standard Gaussian measure on R d by Γ d . Fortwo standard Gaussian random vectors defined on the same probability space, we write X ∼ ρ Y iftheir covariance satisfies Cov( X, Y ) = ρ I for some ρ ∈ [0 , I denotes the identity matrix.Throughout this section, all polynomials have non-random coefficients. The goal of this section isto prove the following stability result. Theorem 3.1.
Let ≤ ρ ≤ . Let X, Y be a pair of standard Gaussian random vectors on R d suchthat X ∼ ρ Y . Let P denote the joint law of X, Y . Let f : R d → R k be a (deterministic) polynomialof degree at most D with E k f ( X ) k = 1 . For any t ≥ (6 e ) D , P ( k f ( X ) − f ( Y ) k ≥ t (1 − ρ D )) ≤ exp (cid:18) − D e t /D (cid:19) . We begin by recalling the following standard consequence of hypercontractivity; see Theo-rem 5.10 and Remark 5.11 of [Jan97] or [LT11, Sec. 3.2].
Proposition (Hypercontractivity for polynomials) . If f : R d → R is a degree- D polynomial and q ∈ [2 , ∞ ) then E [ | f ( Y ) | q ] ≤ ( q − qD/ E [ f ( Y ) ] q/ . (7)Let us now note the following useful corollary of this result for vector-valued polynomials. Lemma 3.2. If f : R d → R k is a degree- D polynomial and q ∈ [2 , ∞ ) then E [ k f ( Y ) k q ] ≤ [3( q − qD E [ k f ( Y ) k ] q . Proof.
Let us begin by observing that by the Cauchy-Schwarz inequality and (7), E [ k f ( Y ) k ] ≤ X i E [ f i ( Y ) ] + 2 X i Using Lemma 3.2, for any q ∈ [2 , ∞ ),Γ d ( k f ( Y ) k ≥ t ) = Γ d ( k f ( Y ) k q ≥ t q ) ≤ E [ k f ( Y ) k q ] t − q ≤ [3( q − qD E [ k f ( Y ) k ] q t − q ≤ (3 q ) qD E [ k f ( Y ) k ] q t − q and so, letting q = t /D / (3 e ) ≥ d ( k f ( Y ) k ≥ t E [ k f ( Y ) k ]) ≤ [(3 q ) D /t ] q = exp( − Dq ) = exp( − Dt /D / (3 e )) . It will be helpful to recall the noise operator , T ρ : L (Γ d ) → L (Γ d ), defined by T ρ f ( x ) = E f ( ρx + p − ρ Y )where ρ ∈ [0 , t ≥ P t := T e − t is the classical Ornstein-Uhlenbeck semigroup.In particular, if ( h ℓ ) are the Hermite polynomials on R normalized to be an orthonormal basis for L (Γ ), then the eigenfunctions of T ρ are given by products of Hermite polynomials [LT11]. Inparticular, for any ψ ( x ) of the form ψ ( x ) = h ℓ ( x ) · · · h ℓ d ( x d ). we have T ρ ψ ( x ) = ρ D ψ ( x ) (9)where D = P ℓ j . With this in hand we are now in position to prove the following inequality. Lemma 3.4. If f : R d → R k is a degree- D polynomial with E k f ( Y ) k = 1 , then for any ρ ∈ [0 , ,if X ∼ ρ Y , E k f ( X ) − f ( Y ) k ≤ − ρ D ) . Proof. Let X ρ be given by X ρ = ρY + p − ρ Y ′ , where Y ′ is an independent copy of Y . Observe that ( X ρ , Y ) is equal in law to ( X, Y ). In this case,we see that E k f ( X ) − f ( Y ) k = 2 − E h f ( X ) , f ( Y ) i = 2 − E h f ( X ρ ) , f ( Y ) i = 2 − E h T ρ f ( Y ) , f ( Y ) i . Consider the collection of products of real valued Hermite polynomials of degree at most D , H D = { ψ : R d → R : ψ ( x ) = h ℓ ( x ) · · · h ℓ d ( x d ) s.t. X ℓ i ≤ D } . H D is an orthonormal system in L (Γ d ) and that the collection of real-valued polyno-mials p : R d → R of degree at most D is contained in its closed linear span. As such, since ρ D ≤ ρ s for 0 ≤ s ≤ D , we see that for any 1 ≤ i ≤ d , ρ D E f i ( Y ) ≤ E T ρ f i ( Y ) f i ( Y ) ≤ E f i ( Y ) by (9). Summing in i yields ρ D ≤ E h T ρ f ( Y ) , f ( Y ) i ≤ . Combining this with the preceding bound yields the desired inequality.We are now in position to prove the main theorem of this section. Proof of Theorem 3.1. Let Y ′ be an independent copy of Y . Then if we let ˜ Y = ( Y, Y ′ ), this is astandard Gaussian vector on R d . Furthermore if we let h ( ˜ Y ) = f ( Y ) − f ( ρY + p − ρ Y ′ ) , then h is a polynomial of degree at most D in ˜ Y and, by Lemma 3.4, E k h ( ˜ Y ) k = E k f ( X ) − f ( Y ) k ≤ − ρ D ) . The result now follows from Proposition 3.3. In this section we prove our main results on low-degree hardness for the spherical and Ising p -spinmodels (Theorems 2.2 and 2.4). The main content of this section is to show that the OGP andseparation properties imply failure of stable algorithms, following an interpolation argument similarto [GJ19a]. The main results then follow by combining this with the stability of low-degree poly-nomials (Theorem 3.1) and the fact that OGP and separation are known to hold (Theorem 2.11). The spherical case. We begin by observing the following elementary fact: when two vectors ofnorm at least γ are normalized onto the unit sphere, the distance between them can only increaseby a factor of γ − . Lemma 3.5. If k x k = k y k = 1 and a ≥ γ , b ≥ γ then k x − y k ≤ γ − k ax − by k .Proof. We have k ax − by k = a + b − ab h x, y i = ( a − b ) + ab k x − y k ≥ γ k x − y k . Throughout the following, it will be convenient to define the following interpolated family of tensors.Consider ( Y τ ) τ ∈ [0 ,π/ defined by Y τ = cos( τ ) Y + sin( τ ) Y ′ . (10)Note that by linearity of inner products, we may equivalently write A ( Y, Y ′ ) from (5) as A ( Y, Y ′ ) = { H n ( x ; Y τ ) : τ ∈ [0 , π/ } . The following result shows that together, the OGP and separation properties imply failure of low-degree polynomials for the spherical p -spin. 15 heorem 3.6. For any ≤ ν < ν ≤ , there exists a constant δ ∗ > such that the followingholds. Let p, n, D ∈ N and µ ∈ R . Suppose that Y, Y ′ are independent p -tensors with i.i.d. standardGaussian entries and let A ( Y, Y ′ ) be as in (5) . Suppose further that with probability at least / over Y, Y ′ , we have that A ( Y, Y ′ ) has the ( µ, ν , ν ) -OGP on domain S n with overlap R = |h· , ·i| /n ,and that H n ( · , Y ) and H n ( · , Y ′ ) are ν separated above µ . Then for any δ ≤ min { δ ∗ , exp( − D ) } and any γ ≥ (2 / D , there is no random degree- D polynomial that ( µ, δ, γ ) -optimizes (1) on S n .Proof. Let Y, Y ′ be as in the statement of the theorem, and let P = P Y ⊗ P ω denote the joint law of( Y, ω ) . Assume on the contrary that f is a random degree D polynomial which ( µ, δ, γ )-optimizes H n ( · , Y ). We first reduce to the case where f is deterministic.Let A ( Y, ω ) denote the “failure” event A ( Y, ω ) = { H n ( g f ( Y, ω ); Y ) < µ ∨ k f ( Y, ω ) k < γ √ n } . Since E k f ( Y, ω ) k = n and P ( A ( Y, ω )) ≤ δ , we have by Markov’s inequality, P ω { E Y k f ( Y, ω ) k ≥ n } ≤ / P ω ( P Y ( A ( Y, ω )) ≥ δ ) ≤ / . This means that there exists an ω ∗ ∈ Ω such that E Y k f ( Y, ω ∗ ) k ≤ n and P Y { A ( Y, ω ∗ ) } ≤ δ .Fix this choice of ω = ω ∗ so that f ( · ) = f ( · , ω ∗ ) becomes a deterministic function.Let Y, Y ′ ∈ ( R n ) ⊗ p be independently i.i.d. N (0 , Y τ be as in (10), and A ( Y, Y ′ ) as in(5). For some L ∈ N to be chosen later, divide the interval [0 , π/ 2] into L equal sub-intervals:0 = τ < τ < · · · < τ L = π/ 2, and let x ℓ = g f ( Y τ ℓ ). We claim that with positive probability (over Y, Y ′ ), all of the following events occur simultaneously and that this leads to a contradiction:(i) The family A ( Y, Y ′ ) has the ( µ, ν , ν )-OGP on S n and H n ( · , Y ) and H n ( · , Y ′ ) are ν -separatedabove µ .(ii) For all ℓ ∈ { , , . . . , L } , f succeeds on input Y τ ℓ , i.e., the event A ( Y τ ℓ , ω ∗ ) c holds.(iii) For all ℓ ∈ { , , . . . , L − } , k f ( Y τ ℓ ) − f ( Y τ ℓ +1 ) k < γ cn for some c = c ( ν , ν ) > | n h x , x ℓ i| ∈ [0 , ν ] ∪ [ ν , 1] for all ℓ , and | n h x , x L i| ∈ [0 , ν ]. Since we also have | n h x , x i| = 1, there must existan ℓ that crosses the OGP gap in the sense that ν − ν ≤ n (cid:12)(cid:12)(cid:12) |h x , x ℓ i| − |h x , x ℓ +1 i| (cid:12)(cid:12)(cid:12) ≤ n |h x , x ℓ i − h x , x ℓ +1 i| ≤ √ n k x ℓ − x ℓ +1 k . Since k f ( Y τ ℓ ) k , k f ( Y τ ℓ +1 ) k ≥ γ √ n by (ii), Lemma 3.5 gives ν − ν ≤ √ n k x ℓ − x ℓ +1 k ≤ γ √ n k f ( Y τ ℓ ) − f ( Y τ ℓ +1 ) k , which contradicts (iii) provided we choose c ≤ ( ν − ν ) .It remains to show that (i)-(iii) occur simultaneously with positive probability. By assumption,(i) fails with probability at most 1 / 4, so it is sufficient to show that (ii) and (iii) each fail withprobability at most 1 / 3. By a union bound, (ii) fails with probability at most 3 δ ( L + 1), which isat most 1 / L ≤ δ − . (11)16or (iii), we will apply Theorem 3.1 with some ˜ D ≥ D (since we are allowed to use any upperbound on the degree) and t = (6 e ) ˜ D . For any ℓ we have Y τ ℓ ∼ ρ Y τ ℓ +1 with ρ = cos (cid:0) π L (cid:1) . Using E Y k f ( Y ) k ≤ n , P ( k f ( Y τ ℓ ) − f ( Y τ ℓ +1 ) k ≥ n (6 e ) ˜ D (1 − ρ ˜ D )) ≤ exp( − D ) . (12)Since 1 − ρ ˜ D = 1 − cos ˜ D (cid:16) π L (cid:17) ≤ − (cid:18) − (cid:16) π L (cid:17) (cid:19) ˜ D ≤ ˜ D (cid:16) π L (cid:17) , equation (12) implies P ( k f ( Y τ ℓ ) − f ( Y τ ℓ +1 ) k ≥ γ cn ) ≤ exp( − D )provided L ≥ π γ s Dc (6 e ) ˜ D/ . (13)Thus, (iii) fails with probability at most L exp( − D ), which is at most 1 / L ≤ 13 exp(2 ˜ D ) . (14)To complete the proof, we need to choose integers ˜ D ≥ D and L satisfying (11), (13), (14), i.e., π γ s Dc ( √ e ) ˜ D ≤ L ≤ min (cid:26) δ − , 13 ( e ) ˜ D (cid:27) . (15)Require δ ≤ exp( − D ) so that the second term in the min {· · · } is smaller (when ˜ D is sufficientlylarge). Since γ ≥ (2 / D ≥ (2 / ˜ D and √ e < e , there now exists an L ∈ N satisfying (15)provided that ˜ D exceeds some constant D ∗ = D ∗ ( c ). Set ˜ D = max { D, D ∗ } and δ ∗ = exp( − D ∗ )to complete the proof.Our main result on low-degree hardness of the spherical p -spin now follows by combining the abovewith the fact that OGP and separation hold in a neighborhood of the optimum. Proof of Theorem 2.2. This result follows by combining Theorem 3.6 with Theorem 2.11. The Ising case. We now turn to the corresponding result for the Ising p -spin model, which againshows that together, OGP and separation imply failure of low-degree polynomials. Theorem 3.7. For any ≤ ν < ν ≤ there exist constants δ ∗ > and η > such thatthe following holds. Let p, n, D ∈ N and µ ∈ R . Suppose that Y, Y ′ are independent p -tensorswith i.i.d. standard Gaussian entries and let A ( Y, Y ′ ) be as in (5) . Suppose further that withprobability at least / over Y, Y ′ , we have that A ( Y, Y ′ ) has the ( µ, ν , ν ) -OGP on domain Σ n with overlap R = |h· , ·i| /n , and that H n ( · , Y ) and H n ( · , Y ′ ) are ν separated above µ . Then forany δ ≤ min { δ ∗ , exp( − D ) } and any γ ≥ (2 / D , there is no random degree- D polynomial that ( µ, δ, γ, η ) -optimizes (1) on Σ n . roof. The proof is nearly identical to that of Theorem 3.6 above, so we only explain the differences.We now define A ( Y, ω ) to be the failure event A ( Y, ω ) = { H n (sgn( f ( Y, ω )); Y ) < µ ∨ |{ k ∈ [ n ] : | f k ( Y, ω ) | ≥ γ }| < (1 − η ) n } , and define x ℓ = sgn( f ( Y τ ℓ )). The only part of the proof we need to modify is the proof that (i)-(iii) imply a contradiction, including the choice of c . As above, combining (i) and (ii) gives theexistence of an ℓ for which ν − ν ≤ √ n k x ℓ − x ℓ +1 k , i.e., k x ℓ − x ℓ +1 k ≥ ( ν − ν ) n , implyingthat x ℓ and x ℓ +1 differ in at least ∆ := ( ν − ν ) n coordinates. Let η = ∆ / (2 n ) = ( ν − ν ) so that there must be at least ∆ / i for which | f i ( Y τ ℓ ) − f i ( Y τ ℓ +1 ) | ≥ γ . This implies k f ( Y τ ℓ ) − f ( Y τ ℓ +1 ) k ≥ γ · ∆2 = γ ( ν − ν ) n , which contradicts (iii) provided we choose c ≤ ( ν − ν ) . Proof of Theorem 2.4. This result follows by combining Theorems 3.7 and 2.11. Let U ∈ C ∞ ( S n ) be some smooth function and for any σ ≥ Langevin Dynamicswith potential U and variance σ to be the strong solution of the stochastic differential equation (inItˆo form) ( dX t = σdB t − ∇ U dtX ∼ ν, where B t is spherical Brownian motion, ∇ is the spherical gradient, and ν ∈ M ( S n ) is someprobability measure on the sphere called the initial data . Note that in the case σ = 0 this is simplygradient flow for U .We recall here the following basic fact about the well-posedness of such equations, namelytheir continuous dependence on the function U . In the following, for a vector-valued function F : S n → T S n , we let k F k ∞ denote the essential supremum of the norm of F induced by thecanonical metric. (Here T S n denotes the tangent bundle to S n .) Lemma. Let U, V ∈ C ∞ ( S n ) and σ ≥ . Fix ν ∈ M ( S n ) . Let X Ut and X Vt denote the corre-sponding solutions to Langevin dynamics with potentials U and V respectively and with the samevariance σ with respect to the same Brownian motion B t . Suppose further that their initial dataare the same. Then there is a universal C > such that for any t > s ≤ t k X Us − X Vs k ≤ Cte Ct k∇ U k ∞ ∨k∇ V k ∞ k∇ U − ∇ V k ∞ a.s., (16) where k·k denotes Euclidean distance in the canonical embedding of S n ⊆ R n . The proof of this result is a standard consequence of Gronwall’s inequality and can be seen, e.g.,in [Var16, Tes12].In this section, for a p -tensor A we will write A ( x , · · · , x p ) to denote the action of A on p vectors, i.e., A ( x , · · · , x p ) = h A, x ⊗ · · · ⊗ x p i . Viewing this as a multilinear operator, we denotethe operator norm by k A k op = sup k x k = ··· = k x p k =1 A ( x , ..., x p ) . As a consequence of the above, we note the following. We then have the following.18 emma 3.8. Let δ = n − α for some α > and let { τ i } denote a partition of [0 , π/ with | τ i +1 − τ i | ≤ δ with ⌈ δ − ⌉ + 1 elements. Let ( X τ i ) i denote the family of strong solutions to Langevin dynamicswith variance σ ≥ , potentials H n ( · ; Y τ i ) and initial data, ν ∈ M ( S n ) . We have that there is a C > independent of n such that for any T > i sup s ≤ T k X τ i s − X τ i +1 s k ≤ CT e CT n − α with probability at least − e − Ω( n ) .Proof. Evidently, the proof will follow by (16) upon controlling the gradients of H n ( · ; Y ). To thisend, we see that ∇ H n ( x ; Y τ ) = 1 n p +12 ( Y τ ( π x , x, . . . , x ) + · · · + Y τ ( x, . . . , x, π x ))where π x denotes the projection on to T x S n . In particular, n − p +12 k∇ H n ( x ; Y τ ) k ≤ √ n k Y τ k op ≤ √ n ( k Y k op + k Y ′ k op ) . By a standard epsilon-net argument (see, e.g., [BGJ20, Lemma 3.7]), we have that k Y k op ≤ C √ n with probability 1 − e − Ω( n ) (while the lemma in [BGJ20] states the result for the expectation, onecan either apply this to the probability by Borell’s inequality, or simply note that the penultimatestep in that proof is the desired high-probability bound). Thus after a union bound, with probability1 − exp( − Ω( n )) sup ≤ τ ≤ π/ k∇ H n ( · ; Y τ ) k ∞ ≤ C. On the other hand, in law we have that Y τ i − Y τ i +1 = Z satisfies Z ( d ) = Y p (cos( τ i ) − cos( τ i +1 )) + (sin( τ i ) − sin( τ i +1 )) . Since both cosine and sine are 1-Lipschitz, we see that the entries of Z are i.i.d. and have variance atmost δ . Consequently, by the same epsilon-net argument, we have with probability 1 − O ( n α e − cn ),max i k∇ H n ( · , Y τ i ) − ∇ H n ( · , Y τ i +1 ) k ≤ Cδ as desired. We begin by noting the following concentration result. Lemma 3.9. Fix T ≥ and σ ≥ . Let X T denote the solution of Langevin dynamics with potential H n , variance σ , and initial data ν ∈ M ( S n ) , and let Q ν denote its law conditionally on Y . Thenwe have that there is some c > such that for every ε > P Y ⊗ Q ν ( | H n ( X T ; Y ) − E H n ( X T ; Y ) | ≥ ε ) ≤ exp( − cε n )19 roof. Note as before that for any two tensors Y and Y ′ , we have k∇ H n ( · ; Y ) − ∇ H n ( · ; Y ′ ) k ≤ k Y − Y ′ k op ≤ k Y − Y ′ k , where here for a tensor A , k A k denotes the square root of the sum of the squares of its entries.Consequently, by (16), the map Y X T is uniformly C -Lipschitz for some C = C ( T ) > n . The result then follows by Gaussian concentration of measure. Proof of Theorem 2.5. In the following, we let P = P ⊗ Q µ . Recall the family ( Y τ ) from (10)and A ( Y, Y ′ ) from (5). Let δ = n − α for some α > τ i ) as in Lemma 3.8. Fix an ε > G denote the event that the overlap gap property holds for A ( Y, Y ′ ) with parameters( E p ( S ) − ε, ν , ν ) as well as ν -separation of H n ( · ; Y ) and H n ( · ; Y ) above level E p ( S ) − ε . ByTheorem 2.11, this holds for every ε > − exp( − Ω( n )).Let X τ i denote the solutions to Langevin dynamics corresponding to the potentials H n ( · ; Y τ i ).Let B n and ˜ B n denote the bad events˜ B n = {∃ i : H n ( X τ i T ; Y τ i ) ≥ E p ( S ) − ε } B n = { H n ( X τ i T ; Y τ i ) ≥ E p ( S ) − ε ∀ i } . Let E i ( ε ) denote the complement of the event bounded in Lemma 3.9 applied to X τ i T , and let E ( ε ) = ∩ E i ( ε ) which has probability at least 1 − exp( − Ω( n )). Note that on ˜ B n ∩ E ( ε ), we havethat E H n ( X τ i T ; Y τ i ) ≥ E p ( S ) − ε for some i . As the expectation is non-random and independent of i , this holds for all i . Consequently, ˜ B n ∩ E ( ε ) ⊂ B n . Thus we have P ( ˜ B n ) ≤ P ( B n ) + exp( − Ω( n )).Suppose now that the events B n and G have non-empty intersection. Let us work on thisintersection. By ν -separation, recalling the overlap function R ( x, y ) = | n h x, y i| , we have that R ( X T , X T ) ≤ ν whereas R ( X , X ) = 1. On the other hand, by Lemma 3.8, it follows that | R ( X T , X τ i T ) − R ( X T , X τ i +1 T ) | ≤ √ nCT e CT n − α . Thus, choosing α > / 2, we see that for n sufficiently large, there must be some (random) j suchthat ν < | R ( X T , X τ j T ) | < ν . This contradicts the overlap gap property. Thus B n ⊆ G c . Consequently, we have that P ( ˜ B n ) ≤ P ( B n ) + e − Ω( n ) ≤ P ( G c ) + e − Ω( n ) = e − Ω( n ) . Observing that ˜ B cn is contained in the event we are trying to bound yields the desired result bymonotonicity of probabilities. Proof of Theorem 2.11. We begin with the spherical setting. Let us view H n ( x ; Y ) as a Gaussianprocess on S n . It was shown in [CS17, Theorem 3] that for any τ, ε > τ ≤ π/ C, c, ˜ µ > − Ce − cn ,max R ( x,y ) >ε H n ( x ; Y τ ) + H n ( x, Y ) < max x ∈S n H n ( x ; Y τ ) + max x ∈S n H n ( x ; Y ) − ˜ µ, 20o that if both u, v satisfy max x ∈S n H n ( x ; Y τ ) − ˜ µ/ ≤ H n ( u, Y τ )max x ∈S n H n ( x ; Y ) − ˜ µ/ ≤ H n ( v ; Y )then it must be that | R ( u, v ) | < ε . (The result is stated there for τ < π/ 2, but can be extended tothe easier case of τ = π/ 2. See Remark 3.10 below.) One can then replace the maximum on theright-hand side of the above upon recalling that by Borell’s inequality, P ( | max x ∈S n H n ( x ; Y ) − E max x ∈S n H n ( x ; Y ) | ≥ ε ) ≤ C exp( − cnε )for some C, c > 0. In particular, upon recalling that E max S n H n ( x ; Y ) → E p ( S ) [JT17], for n sufficiently large we obtain E p ( S ) − ˜ µ/ ≤ H n ( u, Y τ ) E p ( S ) − ˜ µ/ ≤ H n ( v ; Y ) . (17)On the other hand, as shown in [AC18, Theorem 6], (17) holds with τ = 0 as well, except nowwe have that the inner products of the near-maximal u, v must satisfy R ( u, v ) ∈ [0 , ν ] ∪ [ ν , 1] forsome 0 ≤ ν < ν ≤ 1. By combining these results we can obtain the overlap gap property withparameters ( E p ( S ) − ˜ µ/ , ν , ν ) by applying the discretization argument from in [GJ19a]. Notethat, (17) in the case τ = π/ ε -separation below level E p ( S ) − ˜ µ/ 4. As ε was arbitrarilysmall we can take ε = ν .After recalling that E max Σ n H n ( x ; Y ) → E p (Σ) [GT02], we see that the second result is arestatement of [GJ19a, Theorem 3.4] after applying Borell’s inequality as in (17). Remark 3.10. While the result of [CS17, Theorem 3] is only stated for < τ < π/ , it easilyextends to the case τ = π/ by differentiating in the Lagrange multiplier term λ in the “RSBbound” from [CS17, Eq. 59]. For the reader’s convenience, we sketch this change. We follow herethe notation of [CS17]. By comparing to [CS17, Eq. 78], one sees that E (0 , u, λ ) from [CS17, Eq. 61]satisfies E (0 , u, 0) = 2 E p ( S ) (= 2 GS ) . On the other hand for u > we have ∂ λ E (0 , u, 0) = − u < ,from which it follows that min λ T (0 , u, λ ) < E p ( S ) as desired. The case u < follows by symmetry. In this section we prove a key structural property (Theorem 4.2) of low-degree polynomials on theBoolean hypercube. Roughly speaking, with nontrivial probability, a low-degree polynomial willnot change its output significantly at any step when its input coordinates are resampled one at atime.Throughout this section, we work with the Boolean hypercube { , } m and let Y = ( Y , ..., Y m )denote a Bernoulli random vector, Y ∈ { , } m , with independent entries that satisfy P ( Y i = 1) = p i , for some 0 < p i < 1. We view the hypercube as a graph where the vertex set is V = { , } m andthe edge set consists of those edges ( x, y ) such that x and y differ in exactly one coordinate.We introduce the following local regularity property of (non-random) functions { , } m → R n .21 efinition 4.1. Let f : { , } m → R n and let c > . An edge ( x, y ) in { , } m is said to be c -bad for f if k f ( x ) − f ( y ) k ≥ c E k f ( Y ) k . For x, y ∈ { , } m , recall the definition of the path x y (Definition 2.12), which naturallycorresponds to a walk on the edges of the hypercube graph. We now turn to the main result ofthis section, which shows that for a low-degree polynomial, a random path has no bad edges withnontrivial probability. Theorem 4.2. Let Y be a Bernoulli random vector with P ( Y i = 1) = p i , let Y ′ be an independentcopy of Y , and let λ = min i ( p i ∧ − p i ) . For any c > and any (deterministic) degree- D polynomial f : { , } m → R n we have P ( Y Y ′ has no c -bad edge for f ) ≥ λ D/c . The key steps in the proof of Theorem 4.2 are contained in the following two lemmas. Throughoutthe following, for a point x ∈ { , } m , we let x − i denote the all-but- i th coordinates of x , and let q ( x ) = P ( x Y ′ has no c -bad edge). Lemma 4.3. Let f : { , } m → R n be a polynomial of degree D and let Y be a Bernoulli randomvector with P ( Y i = 1) = p i . Let B i denote the event that the edge corresponding to flipping the i thcoordinate of Y is c -bad for f . Then c m X i =1 ( p i ∧ − p i ) P ( B i ) ≤ D. (18) Lemma 4.4. If Y is a Bernoulli random vector with P ( Y i = 1) = p i , then − E log q ( Y ) ≤ m X i =1 S ( p i ) P ( B i ) (19) where S denotes the binary entropy S ( p ) = − p log p − (1 − p ) log(1 − p ) . Intuitively, (18) states that if D is small, there cannot be too many bad edges. The proof will bebased on the fact that low-degree polynomials have small total influence . Intuitively, (19) statesthat if most paths contain a bad edge then there must be many bad edges in total. The actualdefinition of “bad” will not be used in the proof of the latter lemma. We defer the proof of theselemmas momentarilyWe first show how to deduce Theorem 4.2 from the above lemmas, and then we prove thelemmas. Proof of Theorem 4.2. If p ≤ / − p log p ≥ − (1 − p ) log(1 − p ) and so S ( p ) ≤ − p log p . Ifinstead p > / − p log p ≤ − (1 − p ) log(1 − p ) and so S ( p ) ≤ − − p ) log(1 − p ). Therefore,in either case we have S ( p i ) ≤ p i ∧ − p i ) log(1 /λ ) . (20)We now have − log E q ( Y ) ≤ − E log q ( Y ) ≤ m X i =1 S ( p i ) P ( B i ) ≤ λ ) m X i =1 ( p i ∧ − p i ) P ( B i ) ≤ λ ) · Dc where in the first inequality we used Jensen’s inequality, the second we used (19), the third we used(20), and the last we used (18). The result follows by re-arrangement.22efore turning to the proof of the above lemmas, let us pause and recall here some basicfacts from Fourier analysis on the Boolean cube. For more on this see [O’D14]. For i ∈ [ m ],let φ i ( Y i ) = Y i − p i √ p i (1 − p i ) , and for S ⊆ [ m ], let φ S ( Y ) = Q i ∈ S φ i ( Y i ). Recall that the functions { φ S } S ⊆ [ m ] form an orthonormal basis for L ( P ). For a function f we denote its fourier coefficientsby ˆ f ( S ) = E f ( Y ) φ S ( Y ). Observe that Parseval’s theorem in this setting reads: for a function f : { , } m → R , we have E [ f ( Y ) ] = X S ⊆ [ m ] ˆ f ( S ) . For a function f we denote the total influence by I ( f ) = X S ⊆ [ m ] | S | · ˆ f ( S ) . Finally, consider the Laplacian operator L i , defined by L i f = X S ∋ i ˆ f ( S ) φ S ( x ) , which can be thought of as “the part of f that depends on i ”. Proof of Lemma 4.3. Let us begin by first fixing an entry f j of f . Since f j is of degree at most D ,its spectrum is such that ˆ f j ( S ) = 0 for any S with | S | > D . As such, D E [ f j ( Y ) ] = D X | S |≤ D ˆ f j ( S ) ≥ I ( f j ) = X i E ( L i f j ( Y )) = X i E [(1 − p i )( L i f j ( Y − i [0])) + p i ( L i f j ( Y − i [1])) ]where Y − i [ ℓ ] ∈ { , } m is obtained from Y − i by setting the i th coordinate to ℓ . Using that for any a, b ∈ R and p ∈ [0 , p ∧ − p )( a − b ) ≤ − p ) a + pb ) , we see that the above display isbounded below by12 X i ( p i ∧ − p i ) E ( L i f j ( Y − i [1]) − L i f j ( Y − i [0])) = 12 X i ( p i ∧ − p i ) E ( f j ( Y − i [1]) − f j ( Y − i [0])) . Summing over j and applying the definition of c -bad edge, we obtain D E k f ( Y ) k ≥ X i ( p i ∧ − p i ) E k f j ( Y − i [1]) − f j ( Y − i [0]) k ≥ X i ( p i ∧ − p i ) P ( B i ) c E k f ( Y ) k . Cancelling the net factor of E k f ( Y ) k yields the result. Proof of Lemma 4.4. Proceed by induction on m . The base case m = 1 is straightforward: if thesingle edge is bad then both sides of (19) equal S ( p ), and otherwise both sides equal 0.For the inductive step, let q ( Y − ) denote the probability over Y ′− that the path Y − [0] Y ′− [0]has no c -bad edge. Similarly define q ( Y − ) for the path Y − [1] Y ′− [1]. (Note that these areprobabilities on { , } m − .) Integrating in the first coordinate, we have by independence, − E log q ( Y ) = − E [(1 − p ) log q ( Y − [0]) + p log q ( Y − [1])] (21)23here q ( Y − [0]) = (1 − p ) q ( Y − ) + p q ( Y − ) B c and q ( Y − [1]) = (1 − p ) q ( Y − ) B c + p q ( Y − ) . If B c holds (i.e., the edge corresponding to flipping the first coordinate is “good”) then the expres-sion inside the expectation in (21) is(1 − p ) log q ( Y − [0]) + p log q ( Y − [1]) = log[(1 − p ) q ( Y − ) + p q ( Y − )] ≥ (1 − p ) log q ( Y − ) + p log q ( Y − )using concavity of t log t . If instead B holds (i.e., the edge is bad),(1 − p ) log q ( Y − [0]) + p log q ( Y − [1]) = (1 − p ) log[(1 − p ) q ( Y − )] + p log[ p q ( Y − )]= − S ( p ) + (1 − p ) log q ( Y − ) + p log q ( Y − ) . Putting it all together, − E log q ( Y ) ≤ − E [ − S ( p ) B + (1 − p ) log q ( Y − ) + p log q ( Y − )]= S ( p ) P ( B ) − (1 − p ) E log q ( Y − ) − p E log q ( Y − ) . (22)By the induction hypothesis, − E log q ( Y − ) ≤ X i> S ( p i ) P ( B i | Y = 0) . Similarly, − E log q ( Y − ) ≤ X i> S ( p i ) P ( B i | Y = 1) . Combining these yields − (1 − p ) E log q ( Y − ) − p E log q ( Y − ) ≤ X i> S ( p i ) P ( B i ) . Plugging this into (22) completes the proof. This section is devoted to proving Theorem 2.8. We start with the following result which showsthat together, OGP and separation imply that low-degree polynomials fail to find large independentsets. Recall the family of functions F ( Y, Y ′ ) from (6). Theorem 4.5. Suppose d ≤ n/ and ν n ≤ k . Suppose that with probability at least − ∆ when Y, Y ′ ∼ G ( n, d/n ) independently, F ( Y, Y ′ ) has ( k, ν , ν ) -OGP, and F ( · , Y ) and F ( · ; Y ′ ) are ν -separated above k . If ∆ + 3 δ ( m + 1) < exp (cid:18) − γDk log( n/d )( ν − ν ) n (cid:19) (23) then for any η ≤ ( ν − ν ) , there is no random degree- D polynomial that ( k, δ, γ, η ) -optimizes (4) . roof. Assume on the contrary that f ( µ, δ, γ, η )-optimizes maximum independent set. Let A ( Y, ω )denote the “failure” event A ( Y, ω ) = {| V ηf ( Y, ω ) | < k } . As in the proof of Theorem 3.6, we can reduce to the case where f is deterministic: there exists ω ∗ ∈ Ω such that the resulting deterministic function f ( · ) = f ( · , ω ∗ ) satisfies E Y k f ( Y ) k ≤ γk and P Y ( A ( Y, ω ∗ )) ≤ δ .Let Y, Y ′ ∼ G ( n, d/n ) independently, and let Y = Z → Z → · · · → Z m = Y ′ be the path Y Y ′ . Let S j = V ηf ( Z j ). Consider the following events.(i) The family F ( Y, Y ′ ) has the ( k, ν , ν )-OGP on { , } m and the functions F ( · ; Y ) and F ( · ; Y ′ )are ν -separated above k .(ii) For all j ∈ { , , . . . , m } , f succeeds on input Z j , i.e., the event A ( Z j , ω ∗ ) c holds.With probability at least 1 − ∆ − δ ( m + 1), the events (i) and (ii) occur simultaneously. We willshow that when this happens, the path Y Y ′ must contain a c -bad edge (for a particular choiceof c ). This will allow us to derive a contradiction with Theorem 4.2.Toward this end, suppose (i) and (ii) both occur. Since ν n ≤ k , it follows that some j mustcross the OGP gap in the sense that | S ∩ S j | ≥ ν n and | S ∩ S j +1 | ≤ ν n . Thus, letting S ∈ { , } n be the indicator of S ,( ν − ν ) n ≤ |h S , S j − S j +1 i| ≤ k S k ·k S j − S j +1 k = p | S |· q | S j △ S j +1 | ≤ √ n · q | S j △ S j +1 | where △ denotes symmetric difference. From the definition of V ηf , there must be at least | S j △ S j +1 |− ηn coordinates i for which | f i ( Z j ) − f i ( Z j +1 ) | ≥ / 2. This means k f ( Z j ) − f ( Z j +1 ) k ≥ 14 ( | S j △ S j +1 | − ηn ) ≥ (cid:2) ( ν − ν ) n − ηn (cid:3) ≥ n ν − ν ) . provided η ≤ ( ν − ν ) . Since E Y k f ( Y ) k ≤ γk , we now have that ( Z j , Z j +1 ) is a c -bad edgefor c = ( ν − ν ) n/ (24 γk ).Applying Theorem 4.2 yields∆ + 3 δ ( m + 1) ≥ ( d/n ) D/c = exp (cid:18) − γDk log( n/d )( ν − ν ) n (cid:19) . This contradicts (23), completing the proof. Proof of Theorem 2.8. Let α > / √ 2, and let 0 ≤ ˜ ν < ˜ ν ≤ d is sufficiently large, Theorem 2.13 allows us to apply Theorem 4.5 withparameters k = α log dd n , ν j = ˜ ν j kn (for j = 1 , − Ω( n )). This requires η ≤ ( ν − ν ) = α log d d (˜ ν − ˜ ν ) and gives the desired result providedexp( − Ω( n )) + 3 δ (cid:18)(cid:18) n (cid:19) + 1 (cid:19) < exp (cid:18) − γDd log( n/d )(˜ ν − ˜ ν ) α log d (cid:19) . To satisfy this (for n sufficiently large), it is sufficient to have δ < n h exp( − ˜ C γD log n ) − exp( − ˜ C n ) i where ˜ C , ˜ C > α, d . It is in turn sufficient to have ˜ C γD log n ≤ ˜ C n and δ < n exp( − ˜ C γD log n ) = exp( − ˜ C γD log n − log 4 − n ). This completes the proof with C = ˜ C C and C = ˜ C + 3. 25 .3 Proof of Overlap Gap Property Proof of Theorem 2.13. Fix integers k , k , ℓ satisfying k ≥ k , k ≥ k , and 1 ≤ ℓ ≤ k . Fix j , j ∈ { , , . . . , m } . Let T ( k , k , ℓ, j , j ) denote the expected number of ordered pairs ( S , S )where S is an independent set in Z j with | S | = k , S is an independent set in Z j with | S | = k , and | S ∩ S | = ℓ (where the expectation is over Y, Y ′ ). Define α , α , β by the relations k = α dd n , k = α dd n , and ℓ = β log dd n . Restrict to the case δ ≤ β ≤ α − δ for an arbitrarybut fixed constant δ > α but not d ); we will show an interval of forbiddenoverlaps within this interval for β . Note that α ≥ α and α ≥ α , and we can assume α ≤ δ and α ≤ δ since (for sufficiently large d ) there are no independent sets of size exceeding (2 + δ ) log dd n with high probability. We have T ( k , k , ℓ, j , j ) = (cid:18) nℓ (cid:19)(cid:18) n − ℓk − ℓ (cid:19)(cid:18) n − k k − ℓ (cid:19) (1 − d/n ) E ≤ (cid:18) nℓ (cid:19)(cid:18) nk − ℓ (cid:19)(cid:18) nk − ℓ (cid:19) (1 − d/n ) E (24)where E ≥ (cid:0) k (cid:1) + (cid:0) k (cid:1) − (cid:0) ℓ (cid:1) (the worst case being j = j ). Using the standard bounds (cid:0) nk (cid:1) ≤ ( nek ) k (for 1 ≤ k ≤ n ) and log(1 + x ) ≤ x (for x > − T ( k , k , ℓ, j , j ) ≤ exp (cid:18) ℓ log neℓ + ( k − ℓ ) log nek − ℓ + ( k − ℓ ) log nek − ℓ − E dn (cid:19) ≤ exp (cid:20) log dd n (cid:18) β + ( α − β ) + ( α − β ) − 12 ( α + α − β ) + ε d + o (1) (cid:19)(cid:21) = exp (cid:20) log dd n (cid:18) α − α + α − α − β + 12 β + ε d + o (1) (cid:19)(cid:21) where ε d → d → ∞ . Since α ≥ α > / √ 2, we have α − α < / α .Note that β β − β has maximum value 1 / β = 1. Thus if we choose δ > α but not d ), for any β ∈ [1 − δ, δ ] and any α ≥ α , α ≥ 2, we have α − α + α − α − β + 12 β ≤ − δ, implying T ( k , k , ℓ, j , j ) ≤ exp( − Ω( n )( δ − ε d − o (1))), which is exp( − Ω( n )) for sufficiently large d . Accordingly, let ν = (1 − δ ) /α and ν = (1 + δ ) /α . We now have that OGP (with the desiredparameters) holds with high probability, using Markov’s inequality and a union bound over the ≤ n possible values for ( k , k , ℓ, j , j ).It remains to show ν -separation, which pertains to the case j = 0, j = m . In this case, (24)holds with the stronger statement E = (cid:0) k (cid:1) + (cid:0) k (cid:1) . As a result, the expression in (24) is non-increasing in β (provided β ≥ δ and d is sufficiently large). By the above argument, we can againconclude T ( k , k , ℓ, , m ) ≤ exp( − Ω( n )) but now under the weaker condition β ≥ − δ in place of β ∈ [1 − δ, δ ]. This completes the proof. A More on Low-Degree Algorithms Here we discuss some classes of algorithms that can be represented by, or approximated by, low-degree polynomials. We provide proof sketches of how to write these algorithms as polynomialsand discuss what degree is required to do so. We consider polynomials f : R m → R n of the form26 ( Y ) = ( f ( Y ) , . . . , f n ( Y )) where each f j is a polynomial of degree (at most) D = D ( n ). Here m = m ( n ) is the dimension of the input. We allow f to have random coefficients (although theymust be independent from the input Y ). Spectral methods. Consider the following general class of spectral methods. For some N = N ( n ), let M = M ( Y ) be an N × N matrix whose entries are polynomials in Y of degree atmost d = O (1). Then compute the leading eigenvector of M . The leading eigenvector can beapproximated via the power iteration scheme: iterate u t +1 ← M u t starting from some initialvector u (which may be random but does not depend on Y ). After t rounds of power iteration,the result M t u is a (random) polynomial in Y of degree at most td .One particularly simple random optimization problems is to maximize the quadratic form of arandom matrix: max k x k =1 x ⊤ Y x (25)where Y is a GOE matrix, i.e., symmetric n × n with Y ij = Y ji ∼ N (0 , /n ) and Y ii ∼ N (0 , /n ).This is equivalent to the p = 2 case of the spherical p -spin model. In the limit n → ∞ , theeigenvalues of Y follow the semicircle law on [ − , x ∈ R n achieving x ⊤ Y x/ k x k ≥ − ε for a constant ε > C = C ( ε ) number of rounds: x = Y C u where u ∼ N (0 , I ). While this x does not satisfy the constraint k x k = 1 exactly, k x k concentratestightly around its expectation and so by rescaling we can arrange to have k x k = 1 ± o (1) with highprobability. (As we have done throughout this paper, we only ask that a low-degree polynomialoutputs an approximately valid solution in the appropriate problem-specific sense. This solutioncan then be “rounded” to a valid solution via some canonical procedure, which in this case issimply normalization: x ← x/ k x k .) Thus, a near-optimal solution to (25) can be obtained via aconstant-degree polynomial. Iterative methods and AMP. Suppose we start with a random initialization u and carry outthe iteration scheme u t +1 ← F t ( u , . . . , u t ; Y ) where F t is a degree- d polynomial in all of its inputs.Then u t can be written as a (random) polynomial in Y of degree at most d t . In contrast to thelinear update step in power iteration, here the degree can grow exponentially in the number ofiterations.Approximate message passing (AMP) [DMM09] is a class of iterative methods that give state-of-the-art performance for a variety of statistical tasks, both for estimation problems with a plantedsignal and for (un-planted) random optimization problems. AMP iterations typically involve certainnon-linear transformations, applied entrywise to the current state vector. By replacing each non-linear function with a polynomial of large constant degree, one can approximate each AMP iterationarbitrarily well by a constant-degree polynomial. The existing AMP algorithms for spin glassoptimization problems [Mon19, EMS20] only need to run for a constant C ( ε ) number of iterationsin order to reach objective value µ ∗ − ε (for some problem-specific threshold µ ∗ ). As in the previousdiscussion, AMP is guaranteed to output a nearly-valid solution in the appropriate sense. Thus,the AMP algorithms for spin glass optimization are captured by constant-degree polynomials: theobjective value µ ∗ − ε can be obtained by a polynomial of degree D ( ε ). Local algorithms on sparse graphs. Now consider the setting where the input is a randomgraph of constant average degree, e.g., a random d -regular graph or an Erd¨os-R´enyi G ( n, d/n ) graph,where d is a constant. The algorithm’s output is a vector in R n , which could be, for example, (an27pproximation of) the indicator vector for an independent set. We consider “local algorithms” inthe sense of the factors of i.i.d. model [LW07], defined as follows. For the algorithm’s internaluse, we attach a random variable z i to each vertex i ; these are i.i.d. from some distribution ofthe algorithm designer’s choosing. For a constant r , an algorithm is called r - local if the outputassociated to each vertex depends only on the radius- r neighborhood of that vertex in the graph,including both the graph topology and the labels z i .We can see that any O (1)-local algorithm can be approximated by a constant-degree randompolynomial as follows. Suppose we have an r -local algorithm L ( Y ), where Y is the adjacency matrixof the graph. Let ∆ be a constant. For each i there is a random polynomial f i ( Y ) of constant degree D ( r, ∆) such that f i ( Y ) = L i ( Y ) whenever the radius- r neighborhood of vertex i contains at most∆ vertices. We can construct f i as follows. It is sufficient to consider fixed { z i } , since these canbe absorbed into the randomness of f i . Thus, L i ( Y ) is determined by the radius- r neighborhoodof i , i.e., the subgraph consisting of edges reached within distance r from i . Consider a fixed graph N (on the same vertex set as Y ) spanning at most ∆ vertices. The { , } -indicator that N is asubgraph of the radius- r neighborhood of i can be written as a constant-degree polynomial (wherethe degree is the number of edges in N ). We can now form f i by summing over all possibilities for N and using inclusion-exclusion. By choosing ∆ large (compared to d ), we can ensure that f agreeswith L except on an arbitrarily small constant fraction of vertices. (Note that a small fraction oferrors of this type are allowed by our lower bounds.) Thus, local algorithms of constant radius canbe captured by constant-degree polynomials. Exhaustive search. One (computationally-inefficient) way to solve an optimization problem isby exhaustive search over all possible solutions. It will be instructive to investigate at what degreesuch an algorithm can be approximated by a polynomial. As an example, consider the Ising p -spinoptimization problem max x ∈ Σ n h Y, x ⊗ p i (26)where Σ n = { +1 , − } n and Y is a p -tensor with i.i.d. N (0 , 1) entries. In time exp( O ( n )), onecan enumerate all possible solutions and solve (26) exactly. To approximate this by a polynomial,consider f ( Y ) = X x ∈ Σ n x h Y, x ⊗ p i k for some power k = k ( n ) ∈ N . Note that if k is large enough, the sum will be dominated by theterm corresponding to the x ∗ which maximizes (26), and so f ( Y ) will be approximately equal to amultiple of x ∗ . For the spherical p -spin optimization problem, one can similarly replace the sum overΣ n by an integral over the sphere S n . A heuristic calculation based on the known behavior of near-maximal states [SZ17] indicates that k should be chosen as k = n o (1) in order for f ( Y ) to be closeto the optimizer, in which case f has degree n o (1) (we thank Eliran Subag and Jean-ChristopheMourrat for helpful discussions surrounding this point). This is consistent with a phenomenon thathas been observed in various hypothesis testing settings (see [Hop18, KWB19, DKWB19]): theclass of degree- n δ polynomials is at least as powerful as all known exp( n δ − o (1) )-time algorithms. Comparison to the hypothesis testing setting. Above, we have discussed how algorithmsfor random optimization problems can be represented as polynomials. The original motivationfor this comes from the well-established theory of low-degree algorithms for hypothesis testingproblems (see [Hop18, KWB19]). In the hypothesis testing setting, it is typical for state-of-the-artpolynomial-time algorithms to take the form of spectral methods, i.e., the algorithm decides whether28he input Y was drawn from the null or planted distribution by thresholding the leading eigenvalueof some matrix M = M ( Y ) whose entries are constant-degree polynomials in Y . To approximatethis by a polynomial, consider f ( Y ) = Tr( M k ) = P i λ ki for some power k = k ( n ) ∈ N , where { λ i } denote the eigenvalues of M . If k is large enough, f ( Y ) is dominated by the leading eigenvalue.Typically, under the planted distribution there is an outlier eigenvalue that exceeds, by a constantfactor, the largest (in magnitude) eigenvalue that occurs under the null distribution. As a result, k should be chosen as k = Θ(log n ) in order for f to consistently distinguish the planted and nulldistributions in the appropriate formal sense (see [KWB19, Theorem 4.4] for details). For thisreason, low-degree algorithms in the hypothesis testing setting typically require degree O (log n ).In contrast, we have seen above that for random optimization problems with no planted signal,constant degree seems to often be sufficient. Acknowledgments For helpful discussions, A.S.W. is grateful to L´eo Miolane, Eliran Subag, Jean-Christophe Mourrat,and Ahmed El Alaoui. References [AB ˇC13] Antonio Auffinger, G´erard Ben Arous, and Jiˇr´ı ˇCern´y. Random matrices and complex-ity of spin glasses. Comm. Pure Appl. Math. , 66(2):165–201, 2013.[AC08] Dimitris Achlioptas and Amin Coja-Oghlan. Algorithmic barriers from phase transi-tions. In , pages 793–802. IEEE, 2008.[AC18] Antonio Auffinger and Wei-Kuo Chen. On the energy landscape of spherical spinglasses. Advances in Mathematics , 330:553–588, 2018.[ACR11] Dimitris Achlioptas, Amin Coja-Oghlan, and Federico Ricci-Tersenghi. On the solutionspace geometry of random formulas. Random Structures and Algorithms , 38:251–268,2011.[BCKM98] Jean-Philippe Bouchaud, Leticia F Cugliandolo, Jorge Kurchan, and Marc M´ezard.Out of equilibrium dynamics in spin-glasses and other glassy systems. Spin glasses andrandom fields , pages 161–223, 1998.[BCR20] Giulio Biroli, Chiara Cammarota, and Federico Ricci-Tersenghi. How to iron out roughlandscapes and get optimal performances: Averaged gradient descent and its applica-tion to tensor PCA. Journal of Physics A: Mathematical and Theoretical , 2020.[BDG01] G´erard Ben Arous, Amir Dembo, and Alice Guionnet. Aging of spherical spin glasses. Probab. Theory Related Fields , 120(1):1–67, 2001.[BDG06] G´erard Ben Arous, Amir Dembo, and Alice Guionnet. Cugliandolo-Kurchan equationsfor dynamics of spin-glasses. Probab. Theory Related Fields , 136(4):619–660, 2006.[Ben02] G´erard Ben Arous. Aging and spin-glass dynamics. In Proceedings of the InternationalCongress of Mathematicians, Vol. III (Beijing, 2002) , pages 3–14. Higher Ed. Press,Beijing, 2002. 29BGJ18] G´erard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Algorithmic thresholds fortensor PCA. arXiv preprint arXiv:1808.00921 , 2018.[BGJ20] G´erard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Bounding flows for spheri-cal spin glass dynamics. Communications in Mathematical Physics , 373(3):1011–1048,2020.[BGT13] Mohsen Bayati, David Gamarnik, and Prasad Tetali. Combinatorial approach to theinterpolation method and scaling limits in sparse random graphs. Annals of Probabil-ity. (Conference version in Proc. 42nd Ann. Symposium on the Theory of Computing(STOC) 2010) , 41:4080–4115, 2013.[BHK + 19] Boaz Barak, Samuel Hopkins, Jonathan Kelner, Pravesh K Kothari, Ankur Moitra,and Aaron Potechin. A nearly tight sum-of-squares lower bound for the planted cliqueproblem. SIAM Journal on Computing , 48(2):687–735, 2019.[BJ18] G´erard Ben Arous and Aukosh Jagannath. Spectral gap estimates in mean field spinglasses. Communications in Mathematical Physics , 361(1):1–52, 2018.[BKW19] Afonso S Bandeira, Dmitriy Kunisky, and Alexander S Wein. Computational hardnessof certifying bounds on constrained PCA problems. arXiv preprint arXiv:1902.07324 ,2019.[CGPR19] Wei-Kuo Chen, David Gamarnik, Dmitry Panchenko, and Mustazee Rahman. Subop-timality of local algorithms for a class of max-cut problems. The Annals of Probability ,47(3):1587–1618, 2019.[CHH17] Amin Coja-Oghlan, Amir Haqshenas, and Samuel Hetterich. Walksat stalls well belowsatisfiability. SIAM Journal on Discrete Mathematics , 31(2):1160–1173, 2017.[CHS93] A Crisanti, H Horner, and H-J Sommers. The spherical p-spin interaction spin-glassmodel. Zeitschrift f¨ur Physik B Condensed Matter , 92(2):257–271, 1993.[CK93] Leticia F. Cugliandolo and Jorge Kurchan. Analytical solution of the off-equilibriumdynamics of a long-range spin-glass model. Phys. Rev. Lett. , 71:173–176, Jul 1993.[CS17] Wei-Kuo Chen and Arnab Sen. Parisi formula, disorder chaos and fluctuation forthe ground state energy in the spherical mixed p-spin models. Communications inMathematical Physics , 350(1):129–173, 2017.[Cug03] Leticia F Cugliandolo. Course 7: Dynamics of glassy systems. In Slow Relaxations andnonequilibrium dynamics in condensed matter , pages 367–521. Springer, 2003.[DKWB19] Yunzi Ding, Dmitriy Kunisky, Alexander S Wein, and Afonso S Bandeira.Subexponential-time algorithms for sparse PCA. arXiv preprint arXiv:1907.11635 ,2019.[DMM09] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms forcompressed sensing. Proceedings of the National Academy of Sciences , 106(45):18914–18919, 2009.[EMS20] Ahmed El Alaoui, Andrea Montanari, and Mark Sellke. Optimization of mean-fieldspin glasses. arXiv preprint arXiv:2001.00904 , 2020.30Fri90] Alan Frieze. On the independence number of random graphs. Discrete Mathematics ,81:171–175, 1990.[GJ19a] David Gamarnik and Aukosh Jagannath. The overlap gap property and approximatemessage passing algorithms for p -spin models. arXiv preprint arXiv:1911.06943 , 2019.[GJ19b] Reza Gheissari and Aukosh Jagannath. On the spectral gap of spherical spin glassdynamics. Ann. Inst. H. Poincar´e Probab. Statist. , 55(2):756–776, 05 2019.[GS17] David Gamarnik and Madhu Sudan. Limits of local algorithms over sparse randomgraphs. Annals of Probability , 45:2353–2376, 2017.[GT02] Francesco Guerra and Fabio Lucio Toninelli. The thermodynamic limit in mean fieldspin glass models. Comm. Math. Phys. , 230(1):71–79, 2002.[Gui07] Alice Guionnet. Dynamics for spherical models of spin-glass and aging. In Spin glasses ,volume 1900 of Lecture Notes in Math. , pages 117–144. Springer, Berlin, 2007.[Has20] Matthew B Hastings. Classical and quantum algorithms for tensor principal componentanalysis. Quantum , 4:237, 2020.[HKP + 17] Samuel B Hopkins, Pravesh K Kothari, Aaron Potechin, Prasad Raghavendra, TselilSchramm, and David Steurer. The power of sum-of-squares for detecting hidden struc-tures. In , pages 720–731. IEEE, 2017.[Hop18] Samuel Hopkins. Statistical Inference and the Sum of Squares Method . PhD thesis,Cornell University, 2018.[HS17] Samuel B Hopkins and David Steurer. Efficient bayesian estimation from few samples:community detection and related problems. In , pages 379–390. IEEE, 2017.[HSS15] Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal componentanalysis via sum-of-square proofs. In Conference on Learning Theory , pages 956–1006,2015.[HSSS16] Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral al-gorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors.In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing ,pages 178–191, 2016.[Hsu02] Elton P. Hsu. Stochastic analysis on manifolds , volume 38 of Graduate Studies inMathematics . American Mathematical Society, Providence, RI, 2002.[Jag19] Aukosh Jagannath. Dynamics of mean field spin glasses on short and long timescales. Journal of Mathematical Physics , 60(8):083305, 2019.[Jan97] Svante Janson. Gaussian hilbert spaces , volume 129. Cambridge university press, 1997.[JS17] Aukosh Jagannath and Subhabrata Sen. On the unbalanced cut problem and thegeneralized Sherrington-Kirkpatrick model. arXiv preprint arXiv:1707.09042 , 2017.31JT17] Aukosh Jagannath and Ian Tobasco. Low temperature asymptotics of spherical meanfield spin glasses. Comm. Math. Phys. , 352(3):979–1017, 2017.[KWB19] Dmitriy Kunisky, Alexander S Wein, and Afonso S Bandeira. Notes on computationalhardness of hypothesis testing: Predictions using the low-degree likelihood ratio. arXivpreprint arXiv:1907.11636 , 2019.[LT11] Michel Ledoux and Michel Talagrand. Probability in Banach spaces . Classics in Mathe-matics. Springer-Verlag, Berlin, 2011. Isoperimetry and processes, Reprint of the 1991edition.[LW07] J. Lauer and N.C. Wormald. Large independent sets in regular graphs of large girth. Journal of Combinatorial Theory, Series B , 97:999–1009, 2007.[MMZ05] M. M´ezard, T. Mora, and R. Zecchina. Clustering of solutions in the random satisfia-bility problem. Physical Review Letters , 94(19):197205, 2005.[Mon19] Andrea Montanari. Optimization of the Sherrington-Kirkpatrick hamiltonian. In , pages1417–1433. IEEE, 2019.[O’D14] Ryan O’Donnell. Analysis of boolean functions . Cambridge University Press, 2014.[RM14] Emile Richard and Andrea Montanari. A statistical model for tensor PCA. In Advancesin Neural Information Processing Systems , pages 2897–2905, 2014.[RSS18] Prasad Raghavendra, Tselil Schramm, and David Steurer. High-dimensional estimationvia sum-of-squares proofs. arXiv preprint arXiv:1807.11419 , 6, 2018.[RV17] Mustazee Rahman and Balint Virag. Local algorithms for independent sets are half-optimal. The Annals of Probability , 45(3):1543–1577, 2017.[Sub18] Eliran Subag. Following the ground-states of full-RSB spherical spin glasses. arXivpreprint arXiv:1812.04588 , 2018.[SZ17] Eliran Subag and Ofer Zeitouni. The extremal process of critical points of the pure p-spin spherical spin glass model. Probability theory and related fields , 168(3-4):773–820,2017.[Tes12] Gerald Teschl. Ordinary differential equations and dynamical systems , volume 140 of Graduate Studies in Mathematics . American Mathematical Society, Providence, RI,2012.[Var16] S. R. S. Varadhan. Large deviations , volume 27 of Courant Lecture Notes in Mathemat-ics . Courant Institute of Mathematical Sciences, New York; American MathematicalSociety, Providence, RI, 2016.[WEM19] Alexander S Wein, Ahmed El Alaoui, and Cristopher Moore. The Kikuchi hierarchyand tensor PCA. In2019 IEEE 60th Annual Symposium on Foundations of ComputerScience (FOCS)