[PDF] Frozen 1-RSB structure of the symmetric Ising perceptron

Abstract

We prove, under an assumption on the critical points of a real-valued function, that the symmetric Ising perceptron exhibits the `frozen 1-RSB' structure conjectured by Krauth and Mezard in the physics literature; that is, typical solutions of the model lie in clusters of vanishing entropy density. Moreover, we prove this in a very strong form conjectured by Huang, Wong, and Kabashima: a typical solution of the model is isolated with high probability and the Hamming distance to all other solutions is linear in the dimension. The frozen 1-RSB scenario is part of a recent and intriguing explanation of the performance of learning algorithms by Baldassi, Ingrosso, Lucibello, Saglietti, and Zecchina. We prove this structural result by comparing the symmetric Ising perceptron model to a planted model and proving a comparison result between the two models. Our main technical tool towards this comparison is an inductive argument for the concentration of the logarithm of number of solutions in the model.

Full PDF

FFrozen 1-RSB structure of the symmetricIsing perceptron

Will Perkins ∗ Changji Xu † February 11, 2021

Abstract

We prove, under an assumption on the critical points of a real-valued function, thatthe symmetric Ising perceptron exhibits the ‘frozen 1-RSB’ structure conjectured byKrauth and M´ezard in the physics literature; that is, typical solutions of the modellie in clusters of vanishing entropy density. Moreover, we prove this in a very strongform conjectured by Huang, Wong, and Kabashima: a typical solution of the modelis isolated with high probability and the Hamming distance to all other solutions islinear in the dimension. The frozen 1-RSB scenario is part of a recent and intriguingexplanation of the performance of learning algorithms by Baldassi, Ingrosso, Lucibello,Saglietti, and Zecchina. We prove this structural result by comparing the symmetricIsing perceptron model to a planted model and proving a comparison result between thetwo models. Our main technical tool towards this comparison is an inductive argumentfor the concentration of the logarithm of number of solutions in the model.

The perceptron model is a simple model of a neural network storing random patterns. It hasbeen studied in several ﬁelds including information theory [Cov65], statistical physics [Gar87,GD88, KM89, STS90], and probability theory [Tal10, KR98, Tal99].There are several variants of the model, grouped into two main categories: sphericalperceptrons in which patterns are N -dimensional vectors on the unit sphere and Ising per-ceptrons in which patterns are ± N . In each case, we want to understandhow many random patterns can be ‘stored’ by a neural network formed by taking ran-dom synapses J ij and applying a given activation function. Here we will study the Isingperceptron.Let Σ N = {± } N and let { X i } i ≥ be a sequence of independent N -dimensional standardGaussian vectors . For a real valued function φ , a real number κ , and X ∈ R N , deﬁne H φ,κ ( X ) = (cid:110) σ ∈ Σ N : φ (cid:16) (cid:104) X, σ (cid:105) / √ N (cid:17) ≤ κ (cid:111) . ∗ Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago,[email protected]. Supported in part by NSF grants DMS-1847451 and CCF-1934915. † Center of Mathematical Sciences and Applications, Harvard University, [email protected]. In another variant of the model the constraint vectors X i are given by independent samples from Σ N (Bernoulli disorder). While there are signiﬁcant diﬀerences between the spherical and Ising perceptrons, thechoice of Gaussian or Bernoulli disorder is insigniﬁcant for the properties discussed here. a r X i v : . [ m a t h . P R ] F e b he solution space of the Ising perceptron with Gaussian disorder, activation function φ , threshold κ , and m constraints is the random subset of Σ N , S = S φ,κ,N,m = m (cid:92) i =1 H φ,κ ( X i ) . Thus S is a random subset of the Hamming cube Σ N . We call the vectors X i constraintvectors . The constraints depend on the set of constraint vectors, the activation function φ ,and the threshold κ . The classic Ising perceptron corresponds to the choice φ ( x ) = x wherethe most studied case is κ = 0 (e.g. [KM89, DS19]). We will be concerned with the typical structure of the solution space S as a function of φ, κ , and the constraint density α := m/N , as N → ∞ .The most basic structural question is whether S is empty or not. The capacity of theperceptron is deﬁned as the random variable M φ,κ ( N ) = max { m : S φ,κ,N,m (cid:54) = ∅} , and the critical capacity density α c,φ ( κ ) = inf (cid:26) α : lim inf N →∞ Pr[ S φ,κ,N, (cid:98) αN (cid:99) = ∅ ] = 1 (cid:27) is the typical constraint density of the capacity.For densities below the critical capacity density, when the solution space is typicallynon-empty, we can ask about its structure, and how this structure aﬀects the performanceof learning algorithms: algorithms that ﬁnd some solution in S given the instance deﬁnedby X = ( X , . . . , X m ).Basic structural questions include whether solutions appear in connected clusters or areisolated; and what the typical distance is from a solution to the next nearest solution. Forthese structural properties we regard Σ N as the Hamming cube endowed with Hammingdistance: for σ, σ (cid:48) ∈ Σ N , dist( σ, σ (cid:48) ) = |{ i : σ i (cid:54) = σ (cid:48) i }| = n − (cid:104) σ, σ (cid:48) (cid:105) . The perspective taken in recent work on Ising perceptrons in both statistical physicsand computer science is to view the Ising perceptron as a random constraint satisfactionproblem (CSP): Σ N is the set of possible solutions, and each random vector X i deﬁnes aconstraint φ (cid:16) (cid:104) X, σ (cid:105) / √ N (cid:17) ≤ κ on possible solutions σ ∈ Σ N . The critical capacity density α c,φ is then the satisﬁability threshold of the model.Just as in the random k -SAT, random k -NAE-SAT, or random k -XOR-SAT problems,each constraint rules out a constant fraction of all solutions in Σ N . Where the perceptrondiﬀers from these other models is that in the perceptron each constraint involves all of the N coordinates, while in the other models each constraint only involves k , which is heldconstant as N → ∞ . 2andom CSP’s have been studied extensively in computer science, statistical physics,probability, and combinatorics since Mitchell, Selman, and Levesque [MSL92] observedempirically that random k -SAT formulae at certain densities proved extremely challengingfor widely used SAT solvers. Understanding this phenomenon is a major ongoing challengethat has led to the development of a new ﬁeld of inquiry at the intersection of computerscience, physics, and mathematics.A key to the current understanding of random CSP’s is understanding the typical struc-ture of the solution space at diﬀerent constraint densities. A beautifully detailed but non-rigorous picture was put forth by Krzaka(cid:32)la, Montanari, Ricci-Tersenghi, Semerjian, andZdeborov´a [KMRT + freezing . Given a solution σ , a variable (or coordinate) isfree if ﬂipping that coordinate results in another solution σ (cid:48) . Variables that are not free arefrozen. The freezing threshold for a random CSP is the threshold at which typical solutionshave a linear number of frozen variables [ZK07, Mol18].These two properties – clustering and freezing – are conjectured to be the source ofcomputational hardness in random CSP’s. Identifying thresholds for the onset of theseand other structural phenomena and rigorously connecting them to the performance ofalgorithms have been a main focus of the ﬁeld of random computational problems in thelast decade.For the Ising perceptron, the conjectured structural picture looks strikingly diﬀerent.Krauth and Mezard [KM89] conjectured, by means of the replica method, that at all densi-ties below the critical capacity, the solution space of the perceptron is dominated by clustersof vanishing entropy density (that is, of size e o ( n ) ), each cluster separated from the othersby linear Hamming distance. Wong, Kabashima, and Huang [HWK13] and Huang andKabashima [HK14] reﬁned these conjectures and posited that in fact typical solution in theIsing perceptron are completely frozen ; that is, all coordinates are frozen and the solutionslie in clusters of size 1, separated from all other solutions by linear Hamming distance.Thus the Ising perceptron exhibits clustering and freezing in the strongest possible formthroughout the entire satisﬁability regime. Based on the current conjectural understandingof random CSP’s (see e.g. [ZM08]), one might venture a guess that ﬁnding a solution in theIsing perceptron is computationally hard at all positive densities.However, this theory is at odds with other work in physics on learning algorithms for theperceptron. Braunstein and Zecchina [BZ06] observed empirically that a simple message-passing algorithm is able to ﬁnd solutions at positive densities in the Ising perceptron(further algorithms followed in [BBBZ07, Bal09]). Attempting to reconcile this apparentcontradiction between the theory and empirical observations, Baldassi, Ingrosso, Lucibello,Saglietti, and Zecchina [BIL +

15] conjectured that these successful learning algorithms werein fact ﬁnding solutions belonging to rare clusters of positive entropy density. That is,although a 1 − o (1) fraction of solutions belong to isolated, frozen clusters, an exponentially3mall fractions of solutions belong to clusters that are exponentially large; strikingly, theauthors observed that learning algorithms ﬁnd solutions in these rare clusters. Speciﬁcally,the solutions that contribute to the dominant portion of the partition function (number ofsolutions) and determine the equilibrium properties of the model are completely distinctfrom those solutions that eﬃcient algorithms ﬁnd. This work was followed by the proposalof several diﬀerent algorithms to target these subdominant clusters [BIL +

16, BBC + symmetric Ising perceptron , with activation function φ ( x ) = | x | , wasstudied as a model conjectured to exhibit the same structural and algorithmic proper-ties, but more amenable to mathematical analysis. Baldassi, Della Vecchia, Lucibello, andZecchina [BDVLZ20] conﬁrmed that on the level of the physics predictions this model hasthe same qualitative behavior as the Ising perceptron with activation function φ ( x ) = x .In summary, the Ising perceptron, and its symmetric variants, are conjectured to exhibit‘frozen 1-RSB behavior’ at all positive densities below the critical capacity density. Thecurrent understanding of the link between clustering, freezing, and the performance ofalgorithms would suggest that ﬁnding a solution in these models is therefore intractable .This, however, is seemingly in contradiction with empirical observations, and one hypothesissuggests that learning algorithms ﬁnd exponentially rare solutions with atypical structuralproperties.Resolving these questions is a pressing problem since the hypothesis about subdominantclusters calls into question the link between the equilibrium properties of these modelsand algorithmic tractability. In this work, we take a ﬁrst step in addressing this problemrigorously by establishing the frozen 1-RSB picture for the symmetric Ising perceptron(Theorem 1 below), under an assumption on the critical points of a real-valued function(Assumption 1). There are few rigorous results on the Ising perceptron, and most are concerned with boundson the critical capacity.For the classic Ising perceptron Krauth and M´ezard [KM89] predicted, using the replicamethod, that α c ( κ ) = α KM ( κ ) for a complicated but explicit function α KM (with α KM (0) ≈ . α c ( κ ) ≥ α KM ( κ ) using a sophisticated formof the second-moment method guided by the Thouless–Anderson–Palmer (TAP) equa-tions [TAP77]. Their result assumes a technical condition on a certain real-valued function,akin to Assumption 1 below.Much of the technical diﬃculty of [DS19] comes from the asymmetry inherent in the ac-tivation function φ ( x ) = x ; this necessitates a conditioning argument and the sophisticatedsecond-moment calculation. On the other hand, Aubin, Perkins, and Zdeborov´a [APZ19]considered two symmetric activation functions: φ r ( x ) = | x | and φ u ( x ) = −| x | , whichthey called the rectangular and ‘u’ activation functions respectively. Studying symmet-ric constraints has a long history in the random CSP literature: the random k -NAE-SATmodel is a symmetric variant of the random k -SAT model. While the qualitative prop-erties of the two models are expected to be very similar, the symmetric model is oftenmore amenable to rigorous analysis, and thus a clearer understanding can be obtained4see e.g., [DSS14, SSZ16, BSZ19] for recent work on the k -NAE-SAT model). Studyingsymmetric perceptrons allows us to prove stronger and more detailed results than are cur-rently attainable for the classic perceptron, but the phenomena studied are expected to beuniversal.Aubin, Perkins, and Zdeborov´a [APZ19] determine the critical capacity density for thesymmetric perceptron with rectangular activation function: α r,c ( κ ) = − log 2log p ( κ )where p ( κ ) = Pr[ | Z | ≤ κ ] for a standard Gaussian random variable Z . This result, likethat of [DS19], is contingent on an assumption about a certain real-valued function. Thisfunction will also prove useful in our work. Let H ( β ) = − β log β − (1 − β ) log(1 − β ) be theShannon entropy function (all logarithms in this paper are base e ) and let q κ ( β ) = Pr( | Z | ≤ κ, | Z | ≤ κ ) (1)where ( Z , Z ) is a jointly Gaussian vector with means 0, variances 1, and covariance 2 β − Assumption 1 ([APZ19]) . The function F α ( β ) = H ( β ) + α log q κ ( β ) , has a single critical point for β ∈ (0 , / whenever F (cid:48)(cid:48) α (1 / < . Remark 1.

We (and the authors of [APZ19]) have plotted F α ( β ) for many choices of α and κ and in all instances the function has the shape depicted in Figure 1, consistent withAssumption 1. We believe a proof of the assumption might be possible adapting the methodsof [AM02, Proof of Lemma 3]. Xu [Xu19] proved a general sharp threshold result (an analogue of Friedgut’s sharpthreshold result for random graphs and CSP’s [Fri99]), which, combined with [APZ19],gives a sharp threshold for the existence of solutions: for any ε > S φ r ,κ,N ( α r,c + ε ) N = ∅ ] → N → ∞ Pr[ S φ r ,κ,N ( α r,c − ε ) N = ∅ ] → N → ∞ . Both statements hold also for the u -function in a range of κ values, and the results of [DS19,Xu19] prove that the second statement holds for the classic perceptron. Proving the match-ing upper bound on the critical capacity for the classic perceptron remains a challengingopen problem.Baldassi, Della Vecchia, Lucibello, and Zecchina [BDVLZ20] used the second-momentmethod to show the existence of pairs of solutions at arbitrary distances in the symmetricIsing perceptron. 5 .3 Main results We will study properties of typical solutions in the symmetric Ising perceptron (with therectangular activation function φ ( x ) = | x | ). We now specialize and simplify the notationfrom above.For X ∈ R N and κ >

0, deﬁne H κ ( X ) := { σ ∈ Σ N : |(cid:104) X, σ (cid:105)| ≤ κ √ N } . Let { X i } i ≥ be a sequence of i.i.d. N -dimensional standard Gaussians, and deﬁne the solu-tion space S = S α ( N ) = (cid:98) αN (cid:99) (cid:92) i =1 H κ ( X i ) . (2)The critical capacity density, determined in [APZ19], is α c = α c ( κ ) = − log 2 / log p ( κ ).The main result of this paper conﬁrms the frozen 1-RSB scenario in the symmetric Isingperceptron: typical solutions are completely frozen with high probability for α < α c . Let β c = β c ( κ, α ) = β ∈ (0 , /

2) : F ( β ) − α log p ( κ ) = 0 . (3)See Figure 1 for a depiction of β c . We show in Lemma 5 that for α < α c , β c > F α ( β ) − αp ( κ ) plotted for α = { . , . , . } and κ = 1. The dots mark β c forthe three values of α . Theorem 1.

Let κ > and α < α c ( κ ) . Let σ be uniformly sampled from S conditioned onthe event S (cid:54) = ∅ . Under Assumption 1, for any δ ∈ (0 , β c ) , { σ (cid:48) ∈ S : dist( σ, σ (cid:48) ) ≤ ( β c − δ ) N } = { σ } . with probability − o (1) as as N → ∞ . In particular, σ is completely frozen with probability − o (1) . σ is selected according to two sources of randomness: the randomness of theperceptron instance X and the random choice of σ from S .The next result shows that the logarithm of the number of number of solutions in therectangular Ising perceptron is tightly concentrated below the critical density. Theorem 2.

Under Assumption 1, for α < α c log | S | N = log 2 + α log p ( κ ) + O P (cid:16) log NN (cid:17) as N → ∞ in probability . In particular, for α < α c , S is non-empty with probability − o (1) . The second statement of the Theorem 2 proves that the symmetric Ising perceptron un-dergoes a sharp satisﬁability phase transition at α c , answering an open question from [APZ19](where the complementary statement that for α > α c , S = ∅ with high probability isproved). This could also be proved by adapting the sharp threshold result of [Xu19] to thesymmetric perceptron. We study the properties of a typical solution drawn from S by way of the planted model :the experiment of ﬁrst selecting a uniformly random solution from Σ N , then choosing arandom conﬁguration of constraints consistent with this solution. Planted models havebeen studied extensively in the random CSP literature and beyond. They are used as a toymodel for statistical inference: e.g. the ‘teacher–student model’ [ZK16] or the stochasticblock model [Abb17]. They are used to understand the condensation threshold in ran-dom CSP’s [COZ12, BCOH +

16, COKPZ18, COEJ + N, m ∈ N , κ >

0, and following [ACO08], we deﬁne two probability distributionson pairs ( σ ∗ , X ) ∈ Σ N × ( R N ) m of solutions and conﬁgurations of m constraint vectors.In the random model we:1. Sample m i.i.d. N -dimensional standard Gaussian constraint vectors X = ( X , . . . , X m ),conditioned on the event that S ( X ) = (cid:84) mi =1 H κ ( X i ) (cid:54) = ∅ .2. Sample σ ∗ uniformly at random from S .We denote the law of the random model with P r , E r to distinguish the law from boththe unconditional perceptron model and the planted model below. The random model issimply the experiment of selecting a uniformly random solution from the symmetric Isingperceptron conditioned on satisﬁability.In the planted model we:1. Sample σ ∗ uniformly at random from Σ N .2. Sample a conﬁguration of m i.i.d. constraint vectors X = ( X , . . . X m ), with each X i distributed as a standard N -dimensional Gaussian vector conditioned on the eventthat σ ∗ ∈ H κ ( X i ). 7e denote the law of the planted model with P pl , E pl .The key to using the planted model to understand the original model is to show thatat low enough constraint densities, the two distributions on ( σ ∗ , X ) are close. Proving thatthe distributions are close, as we do below in Lemma 19, amounts to proving that thenumber of solutions, | S | , is typically not too far from its expectation, E | S | . The betterconcentration of | S | one can prove, the more one can deduce about the original modelfrom the planted model. In [ACO08] it is shown (in the case of q -colorings of a randomgraph) that if log | S | = log E | S | + o ( N ), then events that occur with probability at mostexp( − θ ( N )) in the planted model occur with probability o (1) as N → ∞ in the randommodel. This notion of closeness is ‘quiet planting’ [KZ09] and it suﬃces to prove somestructural results on the solution spaces such as clustering [ACO08]. On the other hand,much stronger notions of closeness have been proved: ‘silent planting’ [BCOE17] whichimplies the two distributions are mutually contiguous: any event with probability o (1) inthe planted model has probability o (1) in the random model. This has been used to provestronger structural results [Mol18]. Proving contiguity requires much stronger concentrationof log | S | . A very general result on the contiguity of the planted and random model forsymmetric CSP’s [COEJ +

18] involves a rigorous implementation of the cavity method andthe small subgraph conditioning method. In the setting of the perceptron, neither of thesetools exist and so we must prove concetration via another route.We prove Theorem 1 in three steps.In Section 2, we prove that the planted solution is isolated and the next nearest solutionis at linear Hamming distance with high probability in the planted model (Lemma 6).In Section 3 we prove Theorem 2, showing that for α < α c , the logarithm of the numberof solutions in the random model is concentrated around the logarithm of the expectednumber of solutions. To the best of our knowledge, our approach to proving concentrationin this way is new, and we expect the technique to have further applications.In Section 4 we transfer our results about the planted model to the random model byshowing that events that occur with probability at most N − ω (1) in the planted model occurwith probability o (1) in the standard model (Lemma 19). This relies on the concentrationproperties of the logarithm of the number of solutions. Both Theorem 1 and Theorem 2 can be extended verbatim to the u -function Ising perceptronstudied in [APZ19], for κ ∈ (0 , . κ for which the second-momentmethod works there).The main open problem in this area is to resolve the conceptual dilemma described inSection 1.1 and answer the questions raised in [BIL + Question 3.

Is there a polynomial-time algorithm that, with probability − o (1) , ﬁnds a olution to the symmetric Ising perceptron for some density α ∈ (0 , α c ) ? Alternatively, one could leverage the structure results we have proved here and rule outsome class of learning algorithms.We leave the following additional open problems for future work.1. Prove that the classic perceptron with activation function φ ( x ) = x exhibits thefrozen 1-RSB property. It is not at all clear how to extend the method of this paper tothis case. As discussed in [COEJ +

18, Sec. 2.4], for asymmetric random CSP’s, like therandom k -SAT model, the planted model, at least in its straightforward implementation,is not useful to compare to the random model (in particular it is not contiguous with therandom model at any positive density). We expect the same with the classic perceptron,and so our strategy of arguing via the planted model will not work.2. Prove full contiguity between the random and planted models for α < α c . Ourcomparison result (Lemma 19) suﬃces for our purposes here, but it is natural to ask formore (as is the case for a large class of symmetric random CSP’s [COEJ + Conjecture 4.

For α < α c , the random and planted models of the symmetric Ising per-ceptron are mutually contiguous. That is, if P pl ( A ) = o (1) then P r ( A ) = o (1) and viceversa. Consider the planted model with planted solution σ ∗ and constraints X = ( X , X , . . . , X m ).Deﬁne S ( X ) = (cid:84) (cid:98) αN (cid:99) i ≥ H κ ( X i ) as in (2). Recall the deﬁnition of β c from (3). We show that β c exists and is unique for α ∈ (0 , α c ). Lemma 5.

Under Assumption 1, for α ∈ (0 , α c ) , there exists a unique β ∈ (0 , / so that F α ( β ) − α log p ( κ ) = 0 . Also, for any δ ∈ (0 , β c / , sup δ<β<β c − δ F α ( β ) − α log p ( κ ) < . (4) Proof.

Let G ( β ) = F α ( β ) − α log p ( κ ). Then since q κ (0) = p ( κ ) and H (0) = 0, we have G (0) = 0. As observed in [APZ19], G (cid:48) (0) = −∞ and so for some ε > G ( x ) < x ∈ (0 , ε ). Moreover, since α < α c , G (1 / >

0, and so by continuity there exists β ∈ (0 , /

2) with G ( β ) = 0. By Assumption 1 this β is unique. In addition, Assumption 1implies that G ( β ) is ﬁrst strictly decreasing and then strictly increasing on β ∈ (0 , / Lemma 6.

Under Assumption 1, for any α ∈ (0 , α c ) and any δ ∈ (0 , β c ) , there exists aconstant c δ > such that P pl (cid:16) { σ ∈ S : dist( σ, σ ∗ ) ≤ ( β c − δ ) N } (cid:54) = { σ ∗ } (cid:17) ≤ exp (cid:110) − c δ √ N (cid:111) . q ( m ) = P ( σ, σ (cid:48) ∈ H κ ( X )),where σ, σ (cid:48) ∈ Σ N are two arbitrary vectors with (cid:104) σ, σ (cid:48) (cid:105) = m and X is a standard N -dimensional Gaussian vector. Then q ( m ) = q κ (cid:16)

12 + m N (cid:17) , (5)where q κ is deﬁned in (1).Lemma 6 will follow from the following results. Lemma 7.

There exists a constant (cid:15) > suﬃciently small such that for all m ≤ (cid:15)N , log E pl [ |{ σ ∈ S : |(cid:104) σ, σ ∗ (cid:105)| = N − m }| ] ≤ − c √ mN . Proof.

Let σ be an vector in Σ N . Note that E pl [ |{ σ ∈ S : (cid:104) σ, σ ∗ (cid:105) = N − m }| ] = (cid:88) σ : (cid:104) σ,σ (cid:105) = N − m P ( σ ∈ S | σ ∈ S )= (cid:18) N m (cid:19)(cid:16) q ( m ) p ( κ ) (cid:17) αN . (6)We claim that uniformly over all | m | ≤ (cid:15)N with suﬃciently small (cid:15) , q ( N − m ) ≤ p ( κ ) − c (cid:112) m/N (7)Provided with (7), we have(6) ≤ exp (cid:110) m log(2 N/m ) / − c √ mN (cid:111) ≤ exp (cid:110) − c √ mN (cid:111) . Hence Lemma 7 follows. Now it remains to prove (7). To this end, let be the all 1’s vectorof length N and m be an N -dimensional vector with the ﬁrst ( N − m ) coordinates +1 andthe remaining m coordinates −

1. We write p ( κ ) − q ( N − m ) = P ( |(cid:104) , X (cid:105)| ≤ κ √ N , |(cid:104) m , X (cid:105)| > κ √ N ) ≥ P  N − m (cid:88) i =1 X i ∈ ( κ √ N , κ √ N + √ m ) , N (cid:88) i = N − m X i ∈ ( − √ m, − √ m )  = P (cid:32) Z ∈ (cid:32) κ (cid:115) NN − m , κ (cid:115) NN − m + (cid:114) mN − m (cid:33) , Z ∈ ( − √ , − √ (cid:33) ≥ c (cid:112) m/N where Z , Z are two independent standard Gaussian random variables. This proves (7). Lemma 8.

For any (cid:15) > , uniformly in | m | ≤ N − (cid:15)N , lim N →∞ (cid:12)(cid:12)(cid:12) N log E pl (cid:2)(cid:12)(cid:12) { σ ∈ S : (cid:104) σ, σ ∗ (cid:105) = m } (cid:12)(cid:12)(cid:3) − (cid:104) F (cid:16)

12 + m N (cid:17) − α log p ( κ ) (cid:105)(cid:12)(cid:12)(cid:12) = 0 . (8)10 roof. Note that E pl (cid:2)(cid:12)(cid:12) { σ ∈ S : (cid:104) σ, σ ∗ (cid:105) = m } (cid:12)(cid:12)(cid:3) = (cid:18) N N + m (cid:19) · (cid:18) q ( m ) p ( κ ) (cid:19) αN and that for any (cid:15) >

0, uniformly in (cid:15)N ≤ k ≤ N − (cid:15)N , N log (cid:0) Nk (cid:1) H ( k/N ) → N → ∞ .Combining these with (5) yields (8). Proof of Lemma 6.

Let δ ∈ (0 , β c ). It suﬃces to show that E pl |{ σ ∈ S : 0 < dist( σ, σ ∗ ) ≤ ( β c − δ ) N }| ≤ exp (cid:110) − c δ √ N (cid:111) . This bound follows from Lemmas 7 and 8 and (4).

Fixing N , consider the symmetric perceptron as a discrete-time stochastic process with oneconstraint vector added at each time step. The solution space at time t ∈ N is deﬁned as S t := t (cid:92) i =1 H κ ( X i )which is the intersection of t random rectangles.The following strengthening of Theorem 2 is the main result of this section. Theorem 9.

Under Assumption 1, for every (cid:15) > there exists M = M ( (cid:15) ) such that forany α < α c , lim sup N →∞ sup ≤ t ≤ αN P (cid:18)(cid:12)(cid:12)(cid:12) log (cid:16) | S t | E [ | S t | ] (cid:17)(cid:12)(cid:12)(cid:12) ≥ M log N (cid:19) ≤ (cid:15) . (9)Theorem 2 follows immediately since N log E | S t | = log 2 + tn log p ( κ ).Theorem 9 says that the cardinality of the solution space will only deviate from itsexpectation slightly after adding αN random constraints. To prove this theorem, we willlook at the change in this deviation at each time when a new constraint is added. Write Q t := log (cid:18) | S t | E [ | S t | ] (cid:19) = t (cid:88) i =1 log (cid:18) | S i | / | S i − | E [ | S i | ] / E [ | S i − | ] (cid:19) . Note that E [ | S i | ] / E [ | S i − | ] = p ( κ ). Let Y t := 1 p ( κ ) (cid:16) | S t || S t − | − p ( κ ) (cid:17) ,

11o that Q t = t (cid:88) i =1 log(1 + Y i ) . (10)Since 0 ≤ | S t | ≤ | S t − | , we have that − ≤ Y t ≤ (1 − p ( κ )) /p ( κ ); however, we expectthat the Y t ’s are very close to zero with high probability. Hence by a Taylor expansion, as N → ∞ , Q t = t (cid:88) i =1 Y i − Y i o (1) . In fact, we will prove in Lemma 11 that Y t is roughly of order N − / provided that S t − is“regular” (see Deﬁnition 10). Thus if the S t ’s are all regular, we expect the second term (cid:80) αNi =1 Y i be of order O (1). In addition, notice that ( (cid:80) ti =0 Y i ) t ≥ is a martingale with respectto the ﬁltration F t = σ ( S , S , ..., S t ) , t ≥ . Hence if the S t ’s are all regular, we also expect the ﬁrst term (cid:80) αNi =1 Y i to be of order O (1),and hence Q t = O (1). Deﬁnition 10.

For each t ≥ , we let ( σ ( t ) i ) i ≥ be independent uniform random samplesin S t and denote P t ( · ) := P ( · | F t ) , E t [ · ] := E [ · | F t ] . We say S t is regular if P t (cid:16) |(cid:104) σ ( t )1 , σ ( t )2 (cid:105)| ≤ C √ N (cid:112) | Q t | + log N (cid:17) ≥ − N − . Roughly speaking, S t is regular if two random samples from S t are almost orthogonalwith high probability. Deﬁne the stopping time τ S := inf { t ≥ S t is not regular } . The following lemma says that for regular S t , Y t +1 is roughly of order N − / . Lemma 11.

There exists a constant

C > such that for all t ≥ , τ S >t P t (cid:32) | Y t +1 | ≥ C (cid:114) | Q t | + log NN x (cid:33) ≤ C exp( − x ) . By (10), Lemma 11 provides an upper bound on Y t +1 , which can be used to control theincrement of | Q t | . This will be one of the key ingredients in proving Lemma 12, which givesan upper bound on | Q t | for time t before a stopping time deﬁned below.With C > τ Y := inf (cid:40) t ≥ | Y t | ≥ C (cid:114) | Q t − | + log NN log N (cid:41) ,τ Q := inf (cid:8) t ≥ | Q t | ≥ (log N ) (cid:9) , (11)and τ = τ S ∧ τ Y ∧ τ Q . (12)12 emma 12. There exists a constant

C > such that for all t ≥ , E [ | Q t ∧ τ − | ] ≤ exp( Ct/N ) log

N . (13)Lemma 13 says that the stopping time τ will occur later than time αN with highprobability. Theorem 9 will be a direct consequence of Lemmas 12 and 13. Lemma 13.

Under Assumption 1, for any α < α c , lim N →∞ P ( τ < αN ) = 0 . Proof of Theorem 9.

By Lemmas 12 and 13, the probability bound (9) follows fromMarkov’s inequality.

In this section, we consider an arbitrary subset A ⊂ Σ N and deﬁne the the probabilitymeasure P A under which ( σ i ) i ≥ are independent uniformly random vectors in A .Lemma 11 follows from the following lemma. Lemma 14.

There exists constant

C, c > depending only on κ such that for all N suﬃ-ciently large, all A ⊂ Σ N , and ϕ N ∈ R satisfying δ A := P A ( |(cid:104) σ , σ (cid:105)| > ϕ N √ N ) ≤ ϕ − N N − , (14) we have P (cid:18)(cid:12)(cid:12)(cid:12) | H κ ( X ) ∩ A || A | − p ( κ ) (cid:12)(cid:12)(cid:12) > ϕ N x √ N (cid:19) ≤ C exp( − cx ) , where X is a standard N -dimensional Gaussian vector. The rest of this section is devoted to prove Lemma 14.

Lemma 15.

For all N suﬃciently large and every B ⊂ A ⊂ Σ N , if we denote δ A = P A ( |(cid:104) σ , σ (cid:105)| > ϕ N √ N ) and q = P A ( σ (cid:54)∈ B ) , then for all ≤ k ≤ q δ A P A (cid:0) E k ( B ) (cid:1) ≥ (cid:16) q − kδ A q (cid:17) k , where E k ( B ) := (cid:8) σ i (cid:54)∈ B, ∀ ≤ i ≤ k (cid:9) ∩ (cid:8) |(cid:104) σ i , σ j (cid:105)| ≤ ϕ N √ N ∀ ≤ i, j ≤ k, i (cid:54) = j (cid:9) . (15) Proof.

We will prove the following by induction P A ( E t +1 | E t ) ≥ q − tδ A q for 1 ≤ t ≤ q δ A . (16)First, since P A ( E ) = P A ( σ (cid:54)∈ B ) = 1 − q , we see that (16) holds for t = 1.13ext, we suppose (16) is true for t = k with k ≤ q δ A −

1. Since σ k +1 is independent of E k , we have P A ( E k +1 | E k ) ≥ P A ( σ k +1 (cid:54)∈ B | E k ) − k (cid:88) i =1 P A ( |(cid:104) σ i , σ k +1 (cid:105)| > ϕ N √ N | E k )= q − k P A ( |(cid:104) σ k , σ k +1 (cid:105)| > ϕ N √ N | E k ) . Also, since ( σ k +1 , σ k ) is independent of E k − and E k − ⊃ E k , P A ( |(cid:104) σ k , σ k +1 (cid:105)| > ϕ N √ N | E k ) ≤ P A ( {|(cid:104) σ k , σ k +1 (cid:105)| > ϕ N √ N } ∩ E k − ) P A ( E k )= P A ( |(cid:104) σ , σ k +1 (cid:105)| > ϕ N √ N ) · P ( E k − ) P ( E k ) . By the induction hypothesis P ( E k − ) P ( E k ) = 1 P ( E k | E k − ) ≤ q , where we used the assumption k ≤ q δ A −

1. Combining with the previous two inequalities,we get P A ( E k +1 | E k ) ≥ q − kδ A q . Hence (16) holds.

Lemma 16.

We ﬁrst prove (17). Let Σ = (Σ ij ) = ( E [ X i X j ]) ∈ R m × m , then P (cid:18) min ≤ i ≤ m | X i | > κ (cid:19) = (cid:115) π ) m | Σ | (cid:90) max ≤ i ≤ m | x i | >κ exp (cid:26) − x T (Σ − − I ) x + x T x (cid:27) d x = | Σ | − E (cid:20) exp (cid:26) − Z T (Σ − − I ) Z (cid:27) ; min ≤ i ≤ m | Z i | > κ (cid:21) where Z = ( Z , ..., Z m ) is a standard Gaussian vector. Since ( Z , ..., Z m ) and ( e Z , ..., e m Z m )has the same distribution for every e i ∈ {− , } , we have P (cid:18) min ≤ i ≤ m | X i | > κ (cid:19) = | Σ | − E  (cid:88) ( e i ) mi =1 ∈{− , } m exp  − (cid:88) ≤ i,j ≤ m g ij e i e j  ; min ≤ i ≤ m | Z i | > κ  g ij = δ ij Z i Z j and δ ij is the ( i, j )-th element of the matrix Σ − − I . Note that Σ ii = 1and | Σ ij | ≤ c κ m − for i (cid:54) = j impliesmax i,j | δ ij | ≤ Cc κ m − , and || Σ − − I || ≤ Cc κ . (18)Hence (cid:88) i g ii ≤ Cc κ m − | Z | , and (cid:88) i,j g ij ≤ Cc κ m − | Z | . (19)Now, we ﬁrst note that by (18), for any q > E (cid:20) exp (cid:26) − Z T (Σ − − I ) Z (cid:27) ; | Z | ≥ qm (cid:21) ≤ E (cid:20) exp (cid:26) Cc κ | Z | (cid:27) ; | Z | ≥ qm (cid:21) = (1 − Cc κ ) m P ( | Z | ≥ (1 − Cc κ ) qm ) . Since the tail probability of a Chi-square random variable decays exponentially, we see thatfor c κ suﬃciently small and q suﬃciently large, E (cid:20) exp (cid:26) − Z T (Σ − − I ) Z (cid:27) ; | Z | ≥ qm (cid:21) ≤ P (cid:0) | Z | > κ (cid:1) m . (20)Next, by [HW71], we have that if | (cid:80) i g ii | and (cid:80) i,j g ij are suﬃciently small, then (cid:88) ( e i ) mi =1 ∈{− , } m exp  − (cid:88) ≤ i,j ≤ m g ij e i e j  ≤ C .

But we see from (19) that if c κ is suﬃciently small, then on the event {| Z | ≤ qm } thecondition that | (cid:80) i g ii | and (cid:80) i,j g ij are suﬃciently small holds. Combining with (20), wesee that E  (cid:88) ( e i ) mi =1 ∈{− , } m exp  − (cid:88) ≤ i,j ≤ m g ij e i e j  ; min ≤ i ≤ m | Z i | > κ  ≤ C P (cid:0) | Z | > κ (cid:1) m . Finally, [Ost38] gives | Σ | ≥ − c κ . We complete the proof of (17).The other inequality (17) can be proved verbatim by changing all occurrences of min ≤ i ≤ m | X i | >κ and min ≤ i ≤ m | Z i | > κ to max ≤ i ≤ m | X i | ≤ κ and max ≤ i ≤ m | Z i | ≤ κ . In fact, it would beeven easier as max ≤ i ≤ m | Z i | ≤ κ implies | Z | ≤ κm , and thus there is no need to considerthe case {| Z | ≥ qm } . Proof of Lemma 14.

We will prove that P (cid:18) | H κ ( X ) ∩ A || A | < p ( κ ) − ϕ N x √ N (cid:19) ≤ C exp( − cx ) . (21)The other direction can be proved similarly.Recall E k ( · ) as in (15) and denote A ⊥ k := { ( σ i ) ki =1 ∈ A k : |(cid:104) σ i , σ j (cid:105)| ≤ ϕ N √ N ∀ ≤ i, j ≤ k, i (cid:54) = j } . E (cid:2) P A ( E k ( H κ ( X ))) (cid:3) = (cid:88) ( σ i ) ki =1 ∈ A ⊥ k | A | − k P ( σ i (cid:54)∈ H κ ( X ) , ∀ ≤ i ≤ k ) . By Lemma 16, we have that for k = (cid:98) c κ ϕ − N √ N (cid:99) and all ( σ i ) ki =1 ∈ A ⊥ k , P ( σ i (cid:54)∈ H κ ( X ) , ∀ ≤ i ≤ k ) ≤ C P ( | Z | > κ ) k . Hence E (cid:2) P A ( E k ( H κ ( X ))) (cid:3) ≤ C P ( | Z | > κ ) k . (22)On the other hand, we set ∆ = ϕ N x √ N .

Then (14) implies ∆ ≥ kδ A q . Therefore, it follows from Lemma 15 that on the event (cid:8) P A (cid:0) σ (cid:54)∈ H κ ( X ) | H κ ( X ) (cid:1) ≥ P ( | Z | > κ ) + ∆ (cid:9) , we have P A (cid:0) E k [ H κ ( X )] | H κ ( X ) (cid:1) ≥ (cid:16) P A ( σ (cid:54)∈ H κ ( X ) | H κ ( X )) − ∆ / (cid:17) k ≥ P ( | Z | > κ ) k e − k ∆ / . Combined with (22), this yields (21).

By the deﬁnitions (11) and (12), we see that for all 0 ≤ i ≤ τ − | Y i | ≤ C (cid:114) | Q t − | + log NN ≤ C (cid:114) (log N ) + log NN .

Hence it follows from | log(1 + x ) − x | ≤ x for all x ≥ − / (cid:12)(cid:12)(cid:12) t ∧ τ − (cid:88) i =1 log(1 + Y i ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) t ∧ τ − (cid:88) i =1 Y i (cid:12)(cid:12)(cid:12) + t ∧ τ − (cid:88) i =1 Y i ≤ (cid:12)(cid:12)(cid:12) t ∧ τ (cid:88) i =1 Y i (cid:12)(cid:12)(cid:12) + t ∧ τ (cid:88) i =1 Y i + | Y t ∧ τ | . Since τ ≤ τ S , Lemma 11 implies E [ Y i τ ≥ i ] = E [ E i − [ Y i ] τ>i − ] ≤ C (cid:48) E [ | Q i − | τ>i − ] + log NN .

Therefore, since ( (cid:80) t ∧ τi =1 Y i ) t ≥ is a martingale and t ∧ τ ≤ τ ≤ τ S , we have E (cid:12)(cid:12)(cid:12) t ∧ τ − (cid:88) i =1 log(1 + Y i ) (cid:12)(cid:12)(cid:12) ≤ (cid:118)(cid:117)(cid:117)(cid:116) E t ∧ τ (cid:88) i =1 Y i + E t ∧ τ (cid:88) i =1 Y i + E | Y t ∧ τ |≤ E t (cid:88) i =1 Y i τ ≥ i + 1 + E | Y t ∧ τ | C (cid:48) · t log N + (cid:80) ti =1 E [ | Q i ∧ τ − | ] N + 2 p ( κ ) , where C (cid:48) is a constant. Hence E [ | Q t ∧ τ − | ] ≤ C (cid:48) (cid:16) t log N + (cid:80) t − i =1 E [ | Q i ∧ τ − | ] N + 1 p ( κ ) (cid:17) We claim that this yields (13). To prove this, we deﬁne b := E [ | Q ∧ τ − | ] and for t ≥ b t := 3 C (cid:48) (cid:16) t log N + (cid:80) t − i =1 b i N + 1 p ( κ ) (cid:17) . (23)Then a straightforward induction argument shows that b t ≥ E [ | Q t ∧ τ − | ] for t ≥ . (24)On the other hand, (23) implies b t = 3 C (cid:48) (cid:16) log N + b t − N + ( t −

1) log N + (cid:80) t − i =1 b i N + 1 p ( κ ) (cid:17) = 3 C (cid:48) (cid:16) log N + b t − N + b t − C (cid:48) (cid:17) . This implies b t + log N = (cid:16) C (cid:48) N (cid:17) ( b t − + log N ) = (cid:16) C (cid:48) N (cid:17) t − ( b + log N ) . (25)Note that S = {− , } N , Q = 0 and P ( τ ≥

1) = 1. We get b = 0. Hence Lemma (13)follows from (24) and (25). We complete the proof of Lemma 12. Lemma 17.

Under Assumption 1, there exists a constant C such that for every ≤ t ≤ αN , P (cid:16) P t (cid:16) |(cid:104) σ ( t )1 , σ ( t )2 (cid:105)| ≥ C √ N (cid:112) Q t + log N (cid:17) ≥ N − (cid:17) ≤ N − . As a result, P ( τ S ≤ αN ) ≤ N − . (26) Proof.

Since there exist a constant c > λ > t ≥ E [ |{ ( σ , σ ) ∈ S t : |(cid:104) σ , σ (cid:105)| ≥ λ √ N }| ] ≤ exp( − cλ ) E [ | S t | ] , we have that E (cid:20) P t ( |(cid:104) σ ( t )1 , σ ( t )2 (cid:105)| ≥ λ √ N ) | S t | E [ | S t | ] (cid:21) ≤ C exp( − cλ ) , where we used E [ | S t | ] ≤ C E [ | S t | ] (see [APZ19]). Therefore, E (cid:34) exp (cid:32) c (cid:104) σ ( t )1 , σ ( t )2 (cid:105) N + 2 Q t (cid:33)(cid:35) E (cid:34)(cid:90) ∞ cλe cλ P t (cid:16) (cid:104) σ ( t )1 , σ ( t )2 (cid:105) N ≥ λ (cid:17) d λ · | S t | E [ | S t | ] (cid:35) ≤ C (cid:90) ∞ cλe − cλ d λ ≤ C .

Hence, there exists a constant C such that P ( |(cid:104) σ ( t )1 , σ ( t )2 (cid:105)| ≥ C √ N (cid:112) Q t + log N ) ≤ N − , which yields the desired result. Lemma 18.

Proof of Lemma 13.

Write P ( τ ≥ t ) ≤ P ( τ S ≤ αN ) + P ( τ Y ≤ τ S , τ Y ≤ αN ) + P ( τ Q ≤ t, τ Q = τ, | Q τ Q − | ≤ (log N ) / P ( τ Q ≤ t, τ Q = τ, | Q τ Q − | ≥ (log N ) / . Lemma 18 and (26) give upper bounds on the ﬁrst three probability on the right hand. Inaddition, Lemma 12 and the Markov inequality yield P ( τ Q ≤ t, τ Q = τ, | Q τ Q − | ≥ (log N ) / ≤ ((log N ) / − E [ Q t ∧ τ − ; τ Q ≤ t, τ Q = τ, | Q τ Q − | ≥ (log N ) / ≤ ((log N ) / − E [ | Q t ∧ τ − | ] ≤ (log N ) − / . Combining these bounds yields P ( τ ≥ t ) ≤ N ) − / .18 From the planted model to the random model

Theorem 1 will follow from Lemma 6 and the following lemma on the planted model.

Lemma 19.

Suppose A ⊆ { ( σ, U ) : σ ∈ U ⊆ Σ N } , and suppose α < α c . If P pl (( σ ∗ , S ( X )) ∈ A ) ≤ N − m N for some m N → ∞ , then P r (( σ ∗ , S ( X )) ∈ A ) → as N → ∞ . We start by characterizing the planted distribution.

Lemma 20.

For any

N, m ≥ , with A ⊆ { ( σ, U ) : σ ∈ U ⊆ Σ N } , P pl (( σ ∗ , S ( X )) ∈ A ) = E r (cid:20) ( σ N ,S ( X )) ∈ A · | S ( X ) | E | S ( X ) | (cid:21) · P ( S ( X ) (cid:54) = ∅ ) . Proof.

It suﬃces to prove that for any σ ∈ Σ N and U ⊂ Σ N such that σ ∈ U , we have P pl ( σ ∗ = σ, S ( X ) = U ) = P r ( σ ∗ = σ, S ( X ) = U ) | U | E | S ( X ) | · P ( S ( X ) (cid:54) = ∅ ) . (28)To this end, write P pl ( σ ∗ = σ, S ( X ) = U ) = | Σ N | − P ( S ( X ) = U | σ ∈ S ( X ))= P ( S ( X ) = U, σ ∈ S ( X )) | Σ N | P ( σ ∈ S ( X ))= P ( S ( X ) = U ) | Σ N | P ( σ ∈ S ( X )) . Note that E | S ( X ) | = | Σ N | P ( σ ∈ S ( X )) and that P r ( σ ∗ = σ, S ( X ) = U ) · P ( S ( X ) (cid:54) = ∅ ) = | U | − P ( S ( X ) = U ) . We get (28), and thus complete the proof of Lemma 20.

Proof of Lemma 19.

Theorem 2 implies the event G = (cid:26) | S | E | S | ≥ exp {− m N log N } (cid:27) has probability 1 − o (1). Therefore, as N → ∞ , P r (( σ ∗ , S ) ∈ A ) ≤ P r (cid:16) { ( σ ∗ , S ) ∈ A } ∩ G (cid:17) + P ( G ) / P ( S (cid:54) = ∅ ) ≤ exp { m N log N } E (cid:20) ( σ ∗ ,S ) ∈ A · | S | E | S | (cid:21) + o (1)= exp { m N log N } P pl (( σ ∗ , S ) ∈ A ) / P ( S (cid:54) = ∅ ) + o (1)= o (1) . where we used Lemma 20 and Theorem 2. This yields Lemma 19.19inally we prove Theorem 1. Proof of Theorem 1.

Let A = { ( τ, U ) : { σ ∈ U : (cid:104) σ, τ (cid:105) ≥ ( d c + δ ) N } (cid:54) = { τ }} . Then Lemma 6 implies that P pl (( σ ∗ , S ( X )) ∈ A ) ≤ exp (cid:110) − c √ N (cid:111) Combined with Lemma 19, this yields P r (( σ ∗ , S ( X )) ∈ A ) = o (1) , and thus Theorem 1. Acknowledgements

We thank Benjamin Aubin and Lenka Zdeborov´a for inspiring discussions and introducingus to this problem.

References [Abb17] Emmanuel Abbe. Community detection and stochastic block models: recentdevelopments.

The Journal of Machine Learning Research , 18(1):6446–6531,2017.[ACO08] Dimitris Achlioptas and Amin Coja-Oghlan. Algorithmic barriers from phasetransitions. In , pages 793–802. IEEE, 2008.[ACORT11] Dimitris Achlioptas, Amin Coja-Oghlan, and Federico Ricci-Tersenghi. On thesolution-space geometry of random constraint satisfaction problems.

RandomStructures & Algorithms , 38(3):251–268, 2011.[AM02] Dimitris Achlioptas and Cristopher Moore. The asymptotic order of the ran-dom k-SAT threshold. In

The 43rd Annual IEEE Symposium on Foundationsof Computer Science, 2002. Proceedings. , pages 779–788. IEEE, 2002.[APZ19] Benjamin Aubin, Will Perkins, and Lenka Zdeborov´a. Storage capacity insymmetric binary perceptrons.

Journal of Physics A: Mathematical and The-oretical , 2019.[Bal09] Carlo Baldassi. Generalization learning in a perceptron with binary synapses.

Journal of Statistical Physics , 136(5):902–916, 2009.[BBBZ07] Carlo Baldassi, Alfredo Braunstein, Nicolas Brunel, and Riccardo Zecchina.Eﬃcient supervised learning in networks with binary synapses.

Proceedings ofthe National Academy of Sciences , 104(26):11079–11084, 2007.20BBC +

16] Carlo Baldassi, Christian Borgs, Jennifer T Chayes, Alessandro Ingrosso, CarloLucibello, Luca Saglietti, and Riccardo Zecchina. Unreasonable eﬀectivenessof learning neural networks: From accessible states and robust ensembles tobasic algorithmic schemes.

Proceedings of the National Academy of Sciences ,113(48):E7655–E7662, 2016.[BCOE17] Victor Bapst, Amin Coja-Oghlan, and Charilaos Efthymiou. Planting colour-ings silently.

Combinatorics, probability and computing , 26(3):338–366, 2017.[BCOH +

16] Victor Bapst, Amin Coja-Oghlan, Samuel Hetterich, Felicia Raßmann, andDan Vilenchik. The condensation phase transition in random graph coloring.

Communications in Mathematical Physics , 341(2):543–606, 2016.[BDVLZ20] Carlo Baldassi, Riccardo Della Vecchia, Carlo Lucibello, and RiccardoZecchina. Clustering of solutions in the symmetric binary perceptron.

Journalof Statistical Mechanics: Theory and Experiment , 2020(7):073303, 2020.[BIL +

15] Carlo Baldassi, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Ric-cardo Zecchina. Subdominant dense clusters allow for simple learning andhigh computational performance in neural networks with discrete synapses.

Physical review letters , 115(12):128101, 2015.[BIL +

16] Carlo Baldassi, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Ric-cardo Zecchina. Local entropy as a measure for sampling solutions in con-straint satisfaction problems.

Journal of Statistical Mechanics: Theory andExperiment , 2016(2):023301, 2016.[BSZ19] Zsolt Bartha, Nike Sun, and Yumeng Zhang. Breaking of 1RSB in random reg-ular MAX-NAE-SAT. In , pages 1405–1416. IEEE, 2019.[BZ06] Alfredo Braunstein and Riccardo Zecchina. Learning by message passing innetworks of discrete synapses.

Physical review letters , 96(3):030201, 2006.[COEJ +

18] Amin Coja-Oghlan, Charilaos Efthymiou, Nor Jaafari, Mihyun Kang, and To-bias Kapetanopoulos. Charting the replica symmetric phase.

Communicationsin Mathematical Physics , 359(2):603–698, 2018.[COKPZ18] Amin Coja-Oghlan, Florent Krzakala, Will Perkins, and Lenka Zdeborov´a.Information-theoretic thresholds from the cavity method.

Advances in Math-ematics , 333:694–795, 2018.[Cov65] Thomas M Cover. Geometrical and statistical properties of systems of linearinequalities with applications in pattern recognition.

IEEE Transactions onElectronic Computers , (3):326–334, 1965.[COZ12] Amin Coja-Oghlan and Lenka Zdeborov´a. The condensation transition in ran-dom hypergraph 2-coloring. In

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms , pages 241–250. SIAM, 2012.21DS19] Jian Ding and Nike Sun. Capacity lower bound for the Ising perceptron.In

Proceedings of the 51st Annual ACM SIGACT Symposium on Theory ofComputing , pages 816–827, 2019.[DSS14] Jian Ding, Allan Sly, and Nike Sun. Satisﬁability threshold for random reg-ular NAE-SAT. In

Proceedings of the forty-sixth annual ACM symposium onTheory of computing , pages 814–822, 2014.[Fri99] Ehud Friedgut. Sharp thresholds of graph properties, and the k–sat problem.

Journal of the American mathematical Society , 12(4):1017–1054, 1999.[Gar87] Elizabeth Gardner. Maximum storage capacity in neural networks.

EPL (Eu-rophysics Letters) , 4(4):481, 1987.[GD88] E Gardner and B Derrida. Optimal storage properties of neural network mod-els.

Journal of Physics A: Mathematical and general , 21(1):271, 1988.[HK14] Haiping Huang and Yoshiyuki Kabashima. Origin of the computational hard-ness for learning with binary synapses.

Physical Review E , 90(5):052813, 2014.[HW71] D. L. Hanson and F. T. Wright. A bound on tail probabilities for quadraticforms in independent random variables.

Ann. Math. Statist. , 42:1079–1083,1971.[HWK13] Haiping Huang, KY Michael Wong, and Yoshiyuki Kabashima. Entropy land-scape of solutions in the binary perceptron problem.

Journal of Physics A:Mathematical and Theoretical , 46(37):375002, 2013.[KM89] Werner Krauth and Marc M´ezard. Storage capacity of memory networks withbinary couplings.

Journal de Physique , 50(20):3057–3066, 1989.[KMRT +

07] Florent Krzaka(cid:32)la, Andrea Montanari, Federico Ricci-Tersenghi, Guilhem Se-merjian, and Lenka Zdeborov´a. Gibbs states and the set of solutions of randomconstraint satisfaction problems.

Proceedings of the National Academy of Sci-ences , 104(25):10318–10323, 2007.[KR98] Jeong Han Kim and James R Roche. Covering cubes by random half cubes,with applications to binary neural networks.

Journal of Computer and SystemSciences , 56(2):223–252, 1998.[KZ09] Florent Krzakala and Lenka Zdeborov´a. Hiding quiet solutions in randomconstraint satisfaction problems.

Physical review letters , 102(23):238701, 2009.[Mol18] Michael Molloy. The freezing threshold for k-colourings of a random graph.

Journal of the ACM (JACM) , 65(2):1–62, 2018.[MRT11] Andrea Montanari, Ricardo Restrepo, and Prasad Tetali. Reconstruction andclustering in random constraint satisfaction problems.

SIAM Journal on Dis-crete Mathematics , 25(2):771–808, 2011.22MSL92] David Mitchell, Bart Selman, and Hector Levesque. Hard and easy distribu-tions of SAT problems. In

AAAI , volume 92, pages 459–465, 1992.[Ost38] Alexander Markowitsch Ostrowski. Sur l’approximation du determinant defredholm par les determinants des syst`emes d’´equations lin´eaires.

Ark. Math.Stockholm , 26A:1–15, 1938.[SSZ16] Allan Sly, Nike Sun, and Yumeng Zhang. The number of solutions for randomregular NAE-SAT. In , pages 724–731. IEEE, 2016.[STS90] Haim Sompolinsky, Naftali Tishby, and H Sebastian Seung. Learning fromexamples in large neural networks.

Physical Review Letters , 65(13):1683, 1990.[Tal99] Michel Talagrand. Intersecting random half cubes.

Random Structures &Algorithms , 15(3-4):436–449, 1999.[Tal10] Michel Talagrand.

Mean ﬁeld models for spin glasses: Volume I: Basic exam-ples , volume 54. Springer Science & Business Media, 2010.[TAP77] David J Thouless, Philip W Anderson, and Robert G Palmer. Solution of’Solvable model of a spin glass’.

Philosophical Magazine , 35(3):593–601, 1977.[Xu19] Changji Xu. Sharp threshold for the Ising perceptron model. arXiv preprintarXiv:1905.05978 , 2019.[ZK07] Lenka Zdeborov´a and Florent Krzaka(cid:32)la. Phase transitions in the coloring ofrandom graphs.

Physical Review E , 76(3):031131, 2007.[ZK16] Lenka Zdeborov´a and Florent Krzakala. Statistical physics of inference:Thresholds and algorithms.

Advances in Physics , 65(5):453–552, 2016.[ZM08] Lenka Zdeborov´a and Marc M´ezard. Constraint satisfaction problems withisolated solutions are hard.