[PDF] Efficiency of quantum versus classical annealing in non-convex learning problems

Abstract

Quantum annealers aim at solving non-convex optimization problems by exploiting cooperative tunneling effects to escape local minima. The underlying idea consists in designing a classical energy function whose ground states are the sought optimal solutions of the original optimization problem and add a controllable quantum transverse field to generate tunneling processes. A key challenge is to identify classes of non-convex optimization problems for which quantum annealing remains efficient while thermal annealing fails. We show that this happens for a wide class of problems which are central to machine learning. Their energy landscapes is dominated by local minima that cause exponential slow down of classical thermal annealers while simulated quantum annealing converges efficiently to rare dense regions of optimal solutions.

Full PDF

EEﬃciency of quantum versus classical annealing in non-convex learningproblems

Carlo Baldassi , and Riccardo Zecchina , Bocconi Institute for Data Science and Analytics, Bocconi University, Milano, Italy Istituto Nazionale di Fisica Nucleare, Sezione di Torino, Italy International Centre for Theoretical Physics, Trieste, Italy

Quantum annealers aim at solving non-convex optimization problems by exploiting cooperativetunneling eﬀects to escape local minima. The underlying idea consists in designing a classical energyfunction whose ground states are the sought optimal solutions of the original optimization problemand add a controllable quantum transverse ﬁeld to generate tunneling processes. A key challengeis to identify classes of non-convex optimization problems for which quantum annealing remainseﬃcient while thermal annealing fails. We show that this happens for a wide class of problemswhich are central to machine learning. Their energy landscapes is dominated by local minimathat cause exponential slow down of classical thermal annealers while simulated quantum annealingconverges eﬃciently to rare dense regions of optimal solutions.

CONTENTS

I. Introduction 3II. Energy functions 5III. Connection with the local entropy measure 6IV. Phase diagram: analytical and numerical results 7V. Conclusions 10Acknowledgments 11A. Theoretical analysis by the replica method 11a. Small Γ limit 171. Energy function with stability 17a. Small Γ limit 18B. Estimation of the local energy and entropy landscapes with the cavity method 19C. Numerical simulations details of the annealing protocols 201. Quantum annealing protocol 202. Classical simulated annealing protocol 20D. Additional numerical results on the annealing processes 211. Additional comparisons between theory and simulations 212. Experiments with two-layer networks 22E. Real-time Quantum Annealing on small samples 231. Numerical methods 232. Sample selection 24a. Randomized samples 243. Analysis 24 a r X i v : . [ qu a n t - ph ] O c t a. QA vs SQA 24b. Other measurements 25c. Local entropies 28d. Energy gaps 28References 28 I. INTRODUCTION abc E ({ σ j }) σ j Figure 1. Topology of the Suzuki-Trotter vs RobustEnsemble representations. a: the classical objectivefunction we wish to optimize which depends on N discrete variables { σ j } ( N = 5 in the picture). b: Suzuki-Trotter interaction topology: y replicas of theclassical system ( y = 7 in the picture) are coupled byperiodic 1 dimensional chains, one for each classicalspin. c: Robust Ensemble interaction topology: y replicas are coupled through a centroid conﬁguration.In the limit of large N and large y (quantum limit)and for strong interaction couplings all replicas areforced to be close, and the behavior of the two eﬀectivemodels is expected to be similar. Quantum tunneling and quantum correla-tions govern the behavior of very complex col-lective phenomena in quantum physics at lowtemperature. Since the discovery of the fac-toring quantum algorithms in the 90s [1], alot of eﬀorts have been devoted to the under-standing of how quantum ﬂuctuations couldbe exploited to ﬁnd low-energy conﬁgurationsof energy functions which encode the solu-tions of non-convex optimization problems intheir ground states. This has led to thenotion of controlled quantum adiabatic evo-lution, where a time dependent many-bodyquantum system is evolved towards its groundstates so as to escape local minima throughmultiple tunneling events [2–6]. When ﬁ-nite temperature eﬀects have to be taken intoaccount, the computational process is calledQuantum Annealing (QA). Classical Simu-lated Annealing (SA) uses thermal ﬂuctua-tions for the same computational purpose,and Markov Chains based on this principleare among the most widespread optimizationtechniques across science [7]. Quantum ﬂuc-tuations are qualitatively diﬀerent from ther-mal ﬂuctuations and in principle quantum an-nealing algorithms could lead to extremelypowerful alternative computational devices.In the quantum annealing approach, a timedependent quantum transverse ﬁeld is addedto the classical energy function leading to aninterpolating Hamiltonian that may take ad-vantage of correlated ﬂuctuations mediatedby tunneling. Starting with a high trans-verse ﬁeld, the quantum model system can beinitialized in its ground state, e.g. all spinsaligned in the direction of the ﬁeld. The adi-abatic theorem then ensures that by slowlyreducing the transverse ﬁeld the system re-mains in the ground state of the interpolating Hamiltonian. At the end of the process the transverseﬁeld vanishes and the systems ends up in the sought ground state of the classical energy function.The original optimization problem would then be solved if the overall process could take place ina time bounded by some low degree polynomial in the size of the problem. Unfortunately, theadiabatic process can become extremely slow. The adiabatic theorem requires the rate of changeof the Hamiltonian to be smaller than the square of the gap between the ground state and theﬁrst excited state [8–10]. For small gaps the process can thus become ineﬃcient. Exponentiallysmall gaps are not only possible in worst case scenarios but have also been found to exist in typicalrandom systems where comparative studies between quantum and classical annealing have so farfailed in displaying quantum exponential speed up, e.g. at ﬁrst order phase transition in quantumspin glasses [11, 12] or 2D spin glass systems [13–15]. More positive results have been found forad hoc energy functions in which global minima are planted in such a way that tunneling cascadescan become more eﬃcient than thermal ﬂuctuations [5, 16]. As far as the physical implementationsof quantum annealers is concerned, studies have been focused on discriminating the presence ofquantum eﬀects rather than on their computational eﬀectiveness [17–19].Consequently, a key open question is to identify classes of relevant optimization problems forwhich quantum annealing can be shown to be exponentially faster than its classical thermal coun-terpart.Here we give an answer to this question by providing analytic and simulation evidence of exponen-tial speed up of quantum versus classical simulated annealing for a representative class of randomnon-convex optimization problems of basic interest in machine learning. The simplest exampleof this class is the problem of training binary neural networks (described in detail below): veryschematically, the variables of the problem are the (binary) connection weights, while the energymeasures the training error over a given dataset.These problems have been very recently found to possess a rather distinctive geometrical structureof ground states [20–23]: the free energy landscape has been shown to be characterized by theexistence of an exponentially large number metastable states and isolated ground states, and a fewregions where the ground states are dense. These dense regions, which had previously escaped theequilibrium statistical physics analysis [24, 25], are exponentially rare, but still possess a very highlocal internal entropy: they are composed of ground states that are surrounded, at extensive butrelatively small distances, by exponentially many other ground states. Under these circumstances,classical SA (as any Markov Chain satisfying detailed balance) gets trapped in the metastable states,suﬀering ergodicity breaking and exponential slowing down toward the low energy conﬁgurations.These problems have been considered to be intractable for decades and display deep similaritieswith disordered spin glass models which are known to never reach equilibrium.The large deviation analysis that has unveiled the existence of the rare dense regions has ledto several novel algorithms, including a Monte Carlo scheme deﬁned over an appropriate objectivefunction [21] that bears close similarities with a Quantum Monte Carlo (QMC) technique based onthe Suzuki-Trotter transformation [6]. Motivated by this analytical mapping and by the geometricalstructure of the dense and degenerate ground states which is expected to favor zero temperaturekinetic processes [26, 27], we have conducted a full analytical and numerical statistical physics studyof the quantum annealing problem, reaching the conclusion that in the quantum limit the QMCprocess, i.e. Simulated Quantum Annealing (SQA), can equilibrate eﬃciently while the classical SAgets stuck in high energy metastable states. These results generalize to multi layered networks.While it is known that other quasi-optimal classical algorithms for the same problems exist[21, 28, 29], here we focus on the physical speed up that a quantum annealing approach couldprovide in ﬁnding rare regions of ground states. We provide physical arguments and numericalresults supporting the conjecture that the real time quantum annealing dynamics behaves similarlyto SQA.As far as machine learning is concerned, dense regions of low energy conﬁgurations (i.e. quasi-ﬂat minima over macroscopic length scales) are of fundamental interest, as they are particularlywell-suited for making predictions given the learned data: on the one hand, these regions are bydeﬁnition robust with respect to ﬂuctuations in a sizable fraction of the weight conﬁgurations andas such are less prone to ﬁt the noise. On the other hand, an optimal Bayesian estimate, resultingfrom a weighted consensus vote on all conﬁgurations, would receive a major contribution from oneof such regions, compared to a narrow minimum; the centroid of the region (computed accordingto any reasonable metric which correlates the distance between conﬁgurations with the networkoutcomes) would act as a representative of the region as a whole [30]. In this respect, it is worthmentioning that in deep learning [31] all the learning algorithms which lead to good predictionperformance always include eﬀects of a systematically injected noise in the learning phase, a factthat makes the equilibrium Gibbs measure not the stationary measure of the learning protocols anddrive the systems towards wide minima. We expect that these results can be generalized to manyother classes of non convex optimization problems where local entropy plays a role, ranging fromrobust optimization to physical disordered systems.Quantum gate based algorithms for machine learning exist, however the possibility of a physicalimplementation remains a critical issue [32].

II. ENERGY FUNCTIONS

As a working example, we ﬁrst consider the problem of learning random patterns in single layerneural network with binary weights, the so called binary perceptron problem [24]. This networkmaps vectors of N inputs ξ ∈ {− , +1 } N to binary outputs τ = ± through the non linear function τ = sgn ( σ · ξ ) , where σ ∈ {− , +1 } N is the vector of synaptic weights. Given αN input pat-terns { ξ µ } αNµ =1 with µ = 1 , ..., αN and their corresponding desired outputs { τ µ } αNµ =1 , the learningproblem consists in ﬁnding σ such that all input patterns are simultaneously classiﬁed correctly,i.e. sgn ( σ · ξ µ ) = τ µ for all µ . Both the components of the input vectors ξ µi and the outputs τ µ areindependent identically distributed unbiased random variables ( P ( x ) = δ ( x −

1) + δ ( x + 1) ). Inthe binary framework, the procedure for writing a spin Hamiltonian whose ground states are thesought optimal solutions of the original optimization problem is well known [33]. The energy E ofthe binary perceptron is proportional to the number of classiﬁcation errors and can be written as E ( { σ j } ) = αN (cid:88) µ =1 ∆ nµ Θ ( − ∆ µ ) , ∆ µ . = τ µ √ N N (cid:88) j =1 ξ µj σ j (1)where Θ ( x ) is the Heaviside step function: Θ ( x ) = 1 if x > , Θ ( x ) = 0 otherwise. When theargument of the Θ function is positive, the perceptron is implementing the wrong input-outputmapping. The exponent n ∈ { , } deﬁnes two diﬀerent forms of the energy functions which havethe same zero energy ground states and diﬀerent structures of local minima. The equilibriumanalysis of the binary perceptron problem shows that in the large size limit and for α < α c (cid:39) . [24], the energy landscape is dominated by an exponential number of local minima and of zeroenergy ground states that are typically geometrically isolated [34], i.e. they have extensive mutualHamming distances. For both choices of n the problem is computationally hard for SA processes[35]: in the large N limit, a detailed balanced stochastic search process gets stuck in metastablestates at energy levels of order O ( N ) above the ground states.Following the standard SQA approach, we identify the binary variables σ with one of the com-ponents of physical quantum spins, say σ z , and we introduce the Hamiltonian operator of a modelof N quantum spins with the perceptron term of Eq. (1) acting in the longitudinal direction z anda magnetic ﬁeld Γ acting in the transverse direction x . The interpolating Hamiltonian reads: ˆ H = E (cid:0)(cid:8) ˆ σ zj (cid:9)(cid:1) − Γ N (cid:88) j =1 ˆ σ xj (2)where ˆ σ zj and ˆ σ jx are the spin operators (Pauli matrices) in the z and x directions. For Γ = 0 onerecovers the classical optimization problem. The QA procedure consists in initializing the systemat large β and Γ , and slowly decreasing Γ to . To analyze the low temperature phase diagram ofthe model we need to study the average of the logarithm of the partition function Z = Tr (cid:16) e − β ˆ H (cid:17) . This can be done using the Suzuki-Trotter transformation which leads to the study of a classicaleﬀective Hamiltonian acting on a system of y interacting Trotter replicas of the original classicalsystem coupled in an extra dimension: H eﬀ (cid:16)(cid:8) σ aj (cid:9) j,a (cid:17) = 1 y y (cid:88) a =1 E (cid:16)(cid:8) σ aj (cid:9) j (cid:17) − γβ y (cid:88) a =1 N (cid:88) j =1 σ aj σ a +1 j − N Kβ (3)where the σ aj = ± are Ising spins, a ∈ { , . . . , y } is a replica index with periodic boundaryconditions σ y +1 j ≡ σ j , γ = log coth (cid:16) β Γ y (cid:17) and K = y log (cid:16) sinh (cid:16) β Γ y (cid:17)(cid:17) . The replicated system needs to be studied in the limit y → ∞ to recover the so called path integralcontinuous quantum limit and to make the connection with the behavior of quantum devices [15].The SQA dynamical process samples conﬁgurations from an equilibrium distribution and it is notnecessarily equivalent to the real time Schrödinger equation evolution of the system. A particularlydangerous situation occurs if the ground states of the system encounter ﬁrst order phase transitionswhich are associated to exponentially small gaps [11, 36, 37] at ﬁnite N. As discussed below, thisappears not to be the case for the class of models we are considering. III. CONNECTION WITH THE LOCAL ENTROPY MEASURE

The eﬀective Hamiltonian Eq. (3) can be interpreted as many replicas of the original systemscoupled through one dimensional periodic chains, one for each original spin, see Fig. 1b. Note thatthe interaction term γ diverges as the transverse ﬁeld Γ goes to . This geometrical structure isvery similar to that of the Robust Ensemble (RE) formalism [21], where a probability measurethat gives higher weight to rare dense regions of low energy states is introduced. There, the mainidea is to maximize Φ ( σ (cid:63) ) = log (cid:80) { σ } e − βE ( σ ) − λ (cid:80) Nj =1 σ j σ (cid:63)j , i.e. a “local free entropy” where λ is aLagrange parameter that controls the extensive size of the region around a reference conﬁguration σ (cid:63) . One can then build a new Gibbs distribution P ( σ (cid:63) ) ∝ e y Φ( σ (cid:63) ) , where − Φ has the role of anenergy and y of an inverse temperature: in the limit of large y , this distribution concentrates on themaxima of Φ . Upon restricting the values of y to be integer (and large), P ( σ (cid:63) ) takes a factorizedform yielding a replicated probability measure P RE (cid:0) σ (cid:63) , σ , . . . , σ y (cid:1) ∝ e − βH REeff ( σ (cid:63) , { σ aj } ) where theeﬀective energy is given by H REeﬀ (cid:16) σ (cid:63) , (cid:8) σ aj (cid:9) j,a (cid:17) = y (cid:88) a =1 E (cid:16)(cid:8) σ aj (cid:9) j (cid:17) − λβ y (cid:88) a =1 N (cid:88) j =1 σ aj σ (cid:63)j (4)As in the Suzuki-Trotter formalism, H REeﬀ (cid:16) σ (cid:63) , (cid:8) σ aj (cid:9) j,a (cid:17) corresponds to a system with an overallenergy given by the sum of y individual “real replica energies” plus a geometric coupling term; inthis case however the replicas interact with the “reference” conﬁgurations σ (cid:63) rather than amongthemselves, see Fig. 1c.The Suzuki-Trotter representation and the RE formalism diﬀer in the topology of the interactionsbetween replicas and in the scaling of the interactions, but for both cases there is a classical limit, Γ → and λ → ∞ respectively, in which the replicated systems are forced to correlate andeventually coalesce in identical conﬁgurations. For non convex problems, these will not in generalcorrespond to conﬁguration dominating the original classical Gibbs measure.For the sake of clarity we should remind that in the classical limit and for α < α c , our modelpresents an exponential number of far apart isolated ground states which dominate the Gibbsmeasure. At the same time, there exist rare clusters of ground states with a density close to itsmaximum possible value (high local entropy) for small but still macroscopic cluster sizes [20]. Thisfact has several consequences: no further subdivision of the clusters into states is possible, theground states are typically O (1) spin ﬂip connected [20] and a tradeoﬀ between tunneling eventsand exponential number of destination states within the cluster is possible. IV. PHASE DIAGRAM: ANALYTICAL AND NUMERICAL RESULTS

Thanks to the mean ﬁeld nature of the energetic part of the system, Eq. (3), we can resort tothe replica method for calculating analytically the phase diagram. As discussed in the AppendixSec. A, this can be done under the so called static approximation, which consists in using a singleparameter q to represent the overlaps along the Trotter dimension, q ab = (cid:68) N (cid:80) Nj =1 σ aj σ bj (cid:69) ≈ q .Although this approximation crudely neglects the dependency of q ab from | a − b | , the resultingpredictions show a remarkable agreement with numerical simulations.In the main panel of Fig. 2, we report the analytical predictions for the average classical compo-nent of the energy of the quantum model as a function of the transverse ﬁeld Γ . We compare theresults with the outcome of extensive simulations performed with the reduced-rejection-rate MonteCarlo method [38], in which Γ is initialized at . and gradually brought down to in regularsmall steps, at constant temperature, and ﬁxing the total simulation time to τ N y · (as to keepconstant the number of Monte Carlo sweeps when varying N and y ). The details are reported inthe Appendix Sec. C. The size of the systems, the number of samples and the number of Trotterreplicas are scaled up to large values so that both ﬁnite size eﬀects and the quantum limit are keptunder control. A key point is to observe that the results do not degrade with the number of Trot-ter replicas: the average ground state energy approaches a limiting value, close to the theoreticalprediction, in the large y quantum limit. The results appear to be rather insensitive to both N andthe simulation time scaling parameter τ . This indicates that Monte Carlo appears to be able toequilibrate eﬃciently, in a constant (or almost constant) number of sweeps, at each Γ . The analyt-ical prediction for the classical energy only appears to display a relatively small systematic oﬀset(due to the static approximation) at intermediate values of Γ , while it is very precise at both largeand small Γ ; the expectation of the total Hamiltonian on the other hand is in excellent agreementwith the simulations (see Appendix Sec. C).In the same plot we display the behavior of classical SA simulated with a standard Metropolis-Hastings scheme, under an annealing protocol in β that would follow the same theoretical curve asSQA if the system were able to equilibrate (see Appendix Sec. C): as expected [35], SA gets trappedat very high energies (increasing with problem size; in the thermodynamic limit it is expected thatSA would remain stuck at the initial value . N of the energy for times which scale exponentiallywith N ). Alternative annealing protocols yield analogous results; the exponential scaling with N ofSA on binary perceptron models had also been observed experimentally in previous results, e.g. inrefs. [22, 39].In the inset of Fig. 2 we report the analytical prediction for the transverse overlap parameter q ,which quite remarkably reproduces fairly well the average overlap as measured from simulations. c l a ss i c a l ene r g y den s i t y 𝐸 / N transverse field 𝛤 SQA N= N= N= N= ▵ ⋆ r ep li c ao v e r l ap s 𝑞 𝛤 increasing y= y= y=

256 average y= y= Figure 2. Classical energy density (i.e. longitudinal component of the energy, divided by N ) as a functionof the transverse ﬁeld Γ (single layer problems with α = 0 . and n = 0 , independent samples per curve).The QA simulations at β = 20 approach the theoretical prediction as y increases (cf. black arrow). Theresults do not change signiﬁcantly when varying N or the simulation time (the curves with N = 1001 or N = 2001 are indistinguishable from the ones displayed at this level of detail). All SA simulationsinstead got stuck and failed to equilibrate at low enough temperatures (small equivalent Γ ). The resultsare noticeably worse for larger N , and doubling or quadrupling the simulation time doesn’t help much(cf. purple arrows). Inset:

Trotter replicas overlaps q ab (same data as for the main ﬁgure). The theoreticalprediction is in remarkably good agreement with the average value measured from the simulations (the y = 128 curve is barely visible under the y = 256 one). The gray curves show the overlaps at varyingdistances along the Trotter dimension: the topmost one is the overlap between neighboring replicas q a ( a +1)1 ,then there is the overlap between second-neighbors q a ( a +2)1 and so on (cf. Fig. 1). The y = 128 curvesare essentially hidden under the y = 256 ones and can only be seen from their darker shade, following analternating pattern. ene r g y den s i t y d i ff. f r o m r e f e r en c e Δ E / N distance 𝑑 ▵ 𝑑 ⋆ l o c a l en t r op y f r o m r e f e r en c e distance 𝑑 ▵ a b c Figure 3. Panels a and b : energetic proﬁles (in terms of the classical energy E , Eq. (1)) around theconﬁgurations reached during the annealing process, comparing QA (orange lower curves) with SA (graytop curves). The proﬁles represent the most probable value of the energy density shift ∆ E/N with respectto the reference point when moving away from the reference at a given normalized Hamming distance d .The curves refer to the data shown in Fig. 2, using two diﬀerent times in the annealing process, markedwith the symbols (cid:77) and (cid:63) in both ﬁgures. For QA, we show the results for 15 instances with N = 4001 , y = 256 , τ = 4 , using the mode of the replicas σ (cid:63)j = sgn (cid:0)(cid:80) ya =1 σ aj (cid:1) as the reference point; for SA, we show15 samples for N = 4001 and τ = 16 . These results show a marked qualitative diﬀerence in the type oflandscape that is typically explored by the two algorithms: the local landscape of QA is generally muchwider, while SA is typically working inside narrow regions of the landscape which tend to trap the algorithmeventually. Panel c : local entropy, i.e. the logarithm of the number of solutions surrounding the referencepoint at a given distance d for the same conﬁgurations of panel a . The QA conﬁgurations (orange curves atthe top) are located in regions with exponentially many solutions surrounding them (although these regionsare not maximally dense, as can be seen from the comparison with the dashed curve representing the overallnumber of surrounding conﬁgurations at that distance). The SA conﬁgurations (gray curves at the bottom)are far away from these exponentially dense regions (the local entropy has a gap around d = 0 ). In Fig. 3 we provide the proﬁles of the the classical energy minima found for diﬀerent values of Γ in the case of SQA and diﬀerent temperatures for SA. These results are computed analyticallyby the cavity method (see Materials and Methods and SI for details) by evaluating which is themost probable energy found at a normalized Hamming distance d from a given conﬁguration. Asit turns out, throughout the annealing process, SQA follows a path corresponding to wide valleyswhile SA gets stuck in steep metastable states. The quantum ﬂuctuations reproduced by the SQAprocess drive the system to converge toward wide ﬂat regions, in spite of the fact that they areexponentially rare compared to the narrow minima.The physical interpretation of these results is that quantum ﬂuctuations lower the energy of acluster proportionally to its size or, in other words, that quantum ﬂuctuations allow the system tolower its kinetic energy by delocalizing, see Refs. [26, 27, 40] for related results. Along the processof reduction of the transverse ﬁeld we do not observe any phase transition which could induce acritical slowing down of the quantum annealing process and we expect SQA and QA to behavesimilarly [12, 37].This is in agreement with the results of a direct comparison between the real time quantum0dynamics and the SQA on small systems ( N = 21 ): as reported in the Appendix Sec. E, we haveperformed extensive numerical studies of properly selected small instances of the binary perceptronproblem, comparing the results of SQA and QA and analyzing the results of the QA process andthe properties of the Hamiltonian. To reproduce the conditions that are known to exist at largevalues of N , we have selected instances for which a fast annealing schedule SA gets trapped at somepositive fraction of violated constraints, and yet the problems display a suﬃciently high numberof solutions. We found that the agreement between SQA and QA on each sample is excellent.The measurements on the ﬁnal conﬁgurations reached by QA qualitatively conﬁrm the scenariodescribed above, that QA is attracted towards dense low-energy regions without getting stuckduring the annealing process. Finally, the analysis of the gap between the ground state of thesystem and the ﬁrst excited state as Γ decreases shows no signs of the kind of phenomena whichwould typically hamper the performance of QA in other models: there are no vanishingly smallgaps at ﬁnite Γ (cf. the discussion in the introduction). We benchmarked all these results with“randomized” versions of the same samples, in which we randomly permuted the classical energiesassociated to each spin conﬁguration, so as to keep the distribution of the classical energy levelswhile destroying the geometric structure of the states. Indeed, for these randomized samples, wefound that the gaps nearly close at ﬁnite Γ (cid:39) . , and that correspondingly the QA process fails totrack the ground state of the system, resulting in a much reduced probability of ﬁnding a solutionto the problem.To reproduce the conditions that are known to exist at large values of N we have selected instancesfor which a fast annealing schedule SA gets trapped at some positive fraction of violated constraintsand yet the problems display a suﬃciently high number of solutions. We have then compared thebehavior of SQA and the real time quantum dynamics studied by the Lanczos method as discussedin [41]. The agreement between SQA and QA is ... almost perfect.As concluding remarks we report that the models with n = 0 and n = 1 have phase diagramswhich are qualitatively very similar (for the sake of simplicity, here we reported the n = 0 caseonly). The former presents at very small positive values of Γ a collapse of the density matrix ontothe classical one whereas the latter ends up in the classical state only at Γ = 0 .For the sake of completeness, we have checked that the performance of SQA in the y → ∞ quantum limit extends to more complex architectures which include hidden layers; the details arereported in the Appendix Sec. D 2. V. CONCLUSIONS

We conclude by noticing that, at variance with other studies on spin glass models in which theevidence for QA outperforming classical annealing was limited to ﬁnite values of y , thereby justdeﬁning a diﬀerent type of classical SA algorithms, in our case the quantum limit coincides withthe optimal behavior of the algorithm itself. We believe that these results could play a role inmany optimization problems in which optimality of the cost function needs to also meet robustnessconditions (i.e. wide minima). As far as learning problems are concerned, it is worth mentioningthat for the best performing artiﬁcial neural networks, the so called deep networks [31], there isnumerical evidence for the existence of rare ﬂat minima [42], and that all the eﬀective algorithmsalways include eﬀects of systematic injected noise in the learning phase [43], which implies that theequilibrium Gibbs measure is not the stationary measure of the learning protocols. For the sake ofclarity we should remark that our results are aimed to suggest that QA can equilibrate eﬃcientlywhereas SA cannot, i.e. our notion of quantum speed up is relative to the same algorithmic schemethat runs on classical hardware. Other classical algorithms for the same class of problems, besides1the above-mentioned ones based on the RE and the SQA itself, have been discovered [28, 39, 44–46]; however, all of these algorithms are qualitatively diﬀerent from QA, which can provide a hugespeed up by manipulating single bits in parallel. Thus, the overall solving time in a physical QAimplementation (neglecting any other technological considerations) would have, at worst, only amild dependence on N .Our results provide further evidence that learning can be achieved through diﬀerent types ofcorrelated ﬂuctuations, among which quantum tunneling could be a relevant example for physicaldevices. ACKNOWLEDGMENTS

The authors thank G. Santoro, B. Kappen and F. Becca for discussions.

Appendix A: Theoretical analysis by the replica method

We present here the analytical calculations performed to derive all the theoretical results men-tioned in the main text. For completeness, we report all the relevant formulas and deﬁnitions here,even those that were already introduced in the main text.The Hamiltonian operator of a model of N quantum spins with an energy term acting in thelongitudinal direction z and a magnetic ﬁeld Γ acting in the transverse direction x is written as: ˆ H = E (cid:16)(cid:8) ˆ σ zj (cid:9) j (cid:17) − Γ N (cid:88) j =1 ˆ σ xj (A1)where ˆ σ zj and ˆ σ jx are the spin operators (Pauli matrices) in the z and x directions. We want tostudy the partition function: Z = Tr (cid:16) e − β ˆ H (cid:17) . (A2)By using the Suzuki-Trotter transformation, we end up with a classical eﬀective Hamiltonianacting on a system of y interacting Trotter replicas, to be studied in the limit y → ∞ : H eﬀ (cid:16)(cid:8) σ aj (cid:9) j,a (cid:17) = 1 y (cid:88) a E (cid:16)(cid:8) σ aj (cid:9) j (cid:17) − γβ (cid:88) aj σ aj σ a +1 j − N Kβ (A3)where the σ aj = ± are Ising spins, a ∈ { , . . . , y } is a replica index with periodic boundaryconditions σ y +1 j ≡ σ j , and we have deﬁned: γ = 12 log coth (cid:18) β Γ y (cid:19) , (A4) K = 12 y log (cid:18)

12 sinh (cid:18) β Γ y (cid:19)(cid:19) . (A5)In the following, we will just use σ a to denote the conﬁguration of one Trotter replica, (cid:8) σ aj (cid:9) j ; wewill always use the indices a or b for the Trotter replicas and assume that they range in , . . . , y ;we will also use j for the site index and assume that it ranges in , . . . , N .2The eﬀective partition function for a given y reads: Z eﬀ = (cid:88) { σ a } e − βy (cid:80) a E ( σ a )+ γ (cid:80) aj σ aj σ a +1 j + NK . (A6)Here, we ﬁrst study the binary perceptron case in which the longitudinal energy E is deﬁned interms of a set of αN patterns { ξ µ } µ with µ ∈ { , . . . , αN } , where each pattern is a binary vectorof length N , ξ µj = ± : E ( σ ) = αN (cid:88) µ =1 Θ  − √ N (cid:88) j ξ µj σ j  (A7)where Θ ( x ) is the Heaviside step function: Θ ( x ) = 1 if x > , Θ ( x ) = 0 otherwise. The energythus simply counts the number of classiﬁcation errors of the perceptron, assuming that the desiredoutput for each pattern in the set is (this choice can always be made without loss of generalityfor this model when the input patterns are random i.i.d. as described below). A diﬀerent form forthe energy function is treated in sec. A 1.We consider the case in which the patterns entries are extracted randomly and independentlyfrom an unbiased distribution, P (cid:0) ξ µj (cid:1) = δ (cid:0) ξ µj − (cid:1) + δ (cid:0) ξ µj + 1 (cid:1) , and we want to study the typicalproperties of this system by averaging over the quenched disorder introduced by the patterns. Weuse the replica method, which exploits the transformation: (cid:104) log Z (cid:105) ξ = lim n → (cid:104) Z n (cid:105) ξ − n = lim n → (cid:104) (cid:81) nc =1 Z (cid:105) ξ − n (A8)where (cid:104)·(cid:105) ξ denotes the average over the disorder. We thus need to replicate the whole system n times, and therefore we have two replica indices for each spin. We will use indices c, d = 1 , . . . , n for the “virtual” replicas introduced by the replica method, to distinguish them from the indices a and b used for the Trotter replicas. The average replicated partition function of eq. (A6) is thuswritten as: (cid:104) Z n eﬀ (cid:105) ξ = e nNK (cid:42)(cid:90) (cid:89) caj dµ (cid:0) σ caj (cid:1) (cid:89) caj e γσ cj σ c ( a +1) j (cid:89) µca  Θ  √ N (cid:88) j ξ µj σ caj  (cid:16) − e − βy (cid:17) + e − βy (cid:43) ξ (A9)where we changed the sum over all conﬁgurations into an ( n × y × N -dimensional) integral, using thecustomary notation dµ ( σ ) = δ ( σ −

1) + δ ( σ + 1) with δ ( · ) denoting the Dirac-delta distribution.Here and in the following, all integrals will be assumed to range over the whole R unless otherwisespeciﬁed.We introduce new auxiliary variables λ caµ = √ N (cid:80) j ξ µj σ caj via additional Dirac-deltas: Note that the parameter n has a diﬀerent meaning in main text, cf. sec. A 1. (cid:104) Z n eﬀ (cid:105) ξ = e nNK (cid:90) (cid:89) caj dµ (cid:0) σ caj (cid:1) (cid:89) caj e γσ caj σ c ( a +1) j (cid:90) (cid:89) µca dλ caµ (cid:89) µca (cid:16) Θ (cid:2) λ caµ (cid:3) (cid:16) − e − βy (cid:17) + e − βy (cid:17) ×× (cid:42)(cid:89) µca δ  λ caµ − √ N (cid:88) j ξ µj σ caj (cid:43) ξ (A10)We then use the integral representation of the delta δ ( x ) = (cid:82) d ˆ x π e ix ˆ x , and perform the averageover the disorder, to the leading order in N : (cid:42)(cid:89) µca δ  λ caµ − √ N (cid:88) j ξ µj σ caj (cid:43) ξ = (cid:90) (cid:89) µca d ˆ λ caµ π (cid:89) µca e i ˆ λ caµ λ caµ (cid:89) µ exp  − (cid:88) cdab ˆ λ caµ ˆ λ dbµ  N (cid:88) j σ caj σ dbj  (A11)Next, we introduce the overlaps q ca,db = N (cid:80) j σ caj σ dbj via Dirac-deltas (note that due to sym-metries and the fact that the self-overlaps are always we have ny ( ny − / overlaps overall),expand those deltas introducing conjugate parameters ˆ q ca,db (as usual for these parameters in thesemodels, we absorb away a factor i and integrate them along the imaginary axis, without explicitlynoting this), and ﬁnally factorize over the site and pattern indices: (cid:104) Z n eﬀ (cid:105) ξ = e nNK (cid:90) (cid:89) c,a>b dq ca,cb d ˆ q ca,cb N π (cid:89) c>d,ab dq ca,db d ˆ q ca,db N π ×× e − N (cid:80) c,a>b q ca,cb ˆ q ca,cb − N (cid:80) c>d,ab q ca,db ˆ q ca,db × G NS × G αNE (A12) G S . = (cid:90) (cid:89) ca dµ ( σ ca ) e (cid:80) c,a>b ˆ q ca,cb σ ca σ cb + (cid:80) c>d,ab ˆ q ca,db σ ca σ db + γ (cid:80) ca σ ca σ c ( a +1) (A13) G E . = (cid:90) (cid:89) ca dλ ca d ˆ λ ca π (cid:89) ca (cid:16) Θ [ λ ca ] (cid:16) − e − βy (cid:17) + e − βy (cid:17) × (A14) × e − (cid:80) ca ( ˆ λ ca ) + i (cid:80) ca λ ca ˆ λ ca − (cid:80) c,a>b ˆ λ ca ˆ λ cb q ca,cb − (cid:80) c>d,ab ˆ λ ca ˆ λ db q ca,db We now introduce the replica-symmetric (RS) ansatz for the overlaps: q ca,db = (cid:40) q if c = dq if c (cid:54) = d (A15)and analogous for the conjugate parameters ˆ q ca,db .Note that this is the so-called “static approximation” since we neglect the dependency of theoverlap from the distance along the Trotter dimension; however, we have kept the interactionterm γ (cid:80) ca σ ca σ c ( a +1) and inserted it in the G S term (rather than writing it in terms of the overlap q ca,c ( a +1) and inserting it in the G E term where it would have been rewritten as γq ). This diﬀerence,4despite its inconsistency, is the standard procedure when performing the static approximation, andis justiﬁed a posteriori from the comparison with the numerical simulation results. We obtain: (cid:104) Z n eﬀ (cid:105) ξ = e nNK (cid:90) (cid:89) c,a>b dq ca,cb d ˆ q ca,cb N π (cid:89) c>d,ab dq ca,db d ˆ q ca,db N π ×× e − Nn y ( y − q ˆ q − N n ( n − y q ˆ q × G NS × G αNE (A16) G S = (cid:90) (cid:89) ca dµ ( σ ca ) e ˆ q (cid:80) c,a>b σ ca σ cb +ˆ q (cid:80) c>d,ab σ ca σ db + γ (cid:80) ca σ ca σ c ( a +1) (A17) G E = (cid:90) (cid:89) ca dλ ca d ˆ λ ca π (cid:89) ca (cid:16) Θ [ λ ca ] (cid:16) − e − βy (cid:17) + e − βy (cid:17) × (A18) × e − (cid:80) ca ( ˆ λ ca ) + i (cid:80) ca λ ca ˆ λ ca − q (cid:80) c,a>b ˆ λ ca ˆ λ cb − q (cid:80) c>d,ab ˆ λ ca ˆ λ db The entropic term G S can be explicitly computed as G S = (cid:90) (cid:89) ca dµ ( σ ca ) e ˆ q (cid:80) c (cid:16) ( (cid:80) a σ ca ) − (cid:80) a ( σ ca ) (cid:17) + ˆ q (cid:16) ( (cid:80) ca σ ca ) − (cid:80) c ( (cid:80) a σ ca ) (cid:17) × e γ (cid:80) ca σ ca σ c ( a +1) = e − ˆ q ny (cid:90) (cid:89) ca dµ ( σ ca ) e (ˆ q − ˆ q ) (cid:80) c ( (cid:80) a σ ca ) + ˆ q ( (cid:80) ca σ ca ) + γ (cid:80) ca σ ca σ c ( a +1) = (cid:90) Dz e − ˆ q ny (cid:34)(cid:90) (cid:89) a dµ ( σ a ) e (ˆ q − ˆ q ) ( (cid:80) a σ a ) + z √ ˆ q ( (cid:80) a σ a ) + γ (cid:80) a σ a σ a +1 (cid:35) n = (cid:90) Dz e − ˆ q ny (cid:34)(cid:90) Dz (cid:90) (cid:89) a dµ ( σ a ) e ( z √ ˆ q − ˆ q + z √ ˆ q )( (cid:80) a σ a ) + γ (cid:80) a σ a σ a +1 (cid:35) n (A19)where the notation Dz = dz √ π e − x is a shorthand for a Gaussian integral, and we used twice theHubbard-Stratonovich transformation e b = (cid:82) Dz e z √ b . The expression between square bracketsin the last line is the partition function of a 1-dimensional Ising model of size y with uniforminteractions J = γ and uniform ﬁelds h = z √ ˆ q − ˆ q + z √ ˆ q and can be computed by the well-known transfer matrix method. Note however that while usually in the analysis of the 1D Ising spinmodel it is suﬃcient to keep the largest eigenvalue of the transfer matrix in the thermodynamiclimit y → ∞ , in this case instead we need to keep both eigenvalues, since the interaction term scaleswith the size of the system. The result is: G S = (cid:90) Dz e − ˆ q ny (cid:34)(cid:90) Dz e γy (cid:88) w = ± g ( z , z , w ) y (cid:35) n (A20) g ( z , z , w ) . = cosh ( h ( z , z )) + w (cid:113) sinh ( h ( z , z )) + e − γ (A21) h ( z , z ) . = z (cid:112) ˆ q − ˆ q + z (cid:112) ˆ q (A22)5In the limit of small n we obtain: G S . = 1 n log G S + 12 ˆ q y − γy = (cid:90) Dz log (cid:34)(cid:90) Dz (cid:88) w = ± (cid:18) cosh ( h ( z , z )) + w (cid:113) sinh ( h ( z , z )) + e − γ (cid:19) y (cid:35) (A23)Note that in the limit of large y the term γy tends to − K up to terms of order y − .The energetic term G E is computed similarly, by ﬁrst performing two Hubbard-Stratonovichtransformations which allow to factorize the indices c and a , and then explicitly performing theinner integrals: G E = (cid:90) (cid:89) ca dλ ca d ˆ λ ca π (cid:89) ca (cid:16) Θ [ λ ca ] (cid:16) − e − βy (cid:17) + e − βy (cid:17) ×× e − (cid:80) ca ( ˆ λ ca ) + i (cid:80) ca λ ca ˆ λ ca − q (cid:80) c (cid:16) ( (cid:80) a ˆ λ ca ) − (cid:80) a ( ˆ λ ca ) (cid:17) − q (cid:16) ( (cid:80) ca ˆ λ ca ) − (cid:80) c ( (cid:80) a ˆ λ ca ) (cid:17) = (cid:90) Dz (cid:34)(cid:90) Dz (cid:34)(cid:90) dλd ˆ λ π (cid:16) Θ [ λ ] (cid:16) − e − βy (cid:17) + e − βy (cid:17) e − − q ( ˆ λ ) + i ˆ λ ( λ − z √ q − q − z √ q ) (cid:35) y (cid:35) n = (cid:90) Dz (cid:20)(cid:90) Dz (cid:20) − (cid:16) − e − βy (cid:17) H (cid:18) z √ q − q + z √ q √ − q (cid:19)(cid:21) y (cid:21) n (A24)where H ( x ) = erfc (cid:16) x √ (cid:17) . In the limit of small n and of large y we ﬁnally obtain: G E . = 1 n log G E = (cid:90) Dz log (cid:90) Dz exp (cid:18) − βH (cid:18) z √ q − q + z √ q √ − q (cid:19)(cid:19) (A25)Using equations (A23) and (A25), we obtain the expression for the action: φ . = 1 N (cid:104) log Z eﬀ (cid:105) = extr q ,q , ˆ q , ˆ q (cid:26) y q ˆ q − y ( y − q ˆ q −

12 ˆ q y + G S + α G E (cid:27) (A26)In order to obtain a ﬁnite result in the limit of y → ∞ , we assume the following scalings for theconjugated order parameters: ˆ q = ˆ p y (A27) ˆ q = ˆ p y (A28)With these, we ﬁnd the following ﬁnal expressions: φ = extr q ,q , ˆ p , ˆ p (cid:26) q ˆ p − q ˆ p + G S + α G E (cid:27) (A29) G S = (cid:90) Dz log (cid:20)(cid:90) Dz (cid:18)(cid:113) ˆ k ( z , z ) + β Γ (cid:19)(cid:21) (A30) ˆ k ( z , z ) = z (cid:112) ˆ p − ˆ p + z (cid:112) ˆ p (A31) G E = (cid:90) Dz log (cid:90) Dz exp ( − βH ( k ( z , z ))) (A32) k ( z , z ) = z √ q − q + z √ q √ − q (A33)6The parameters q , q , ˆ p and ˆ p are found by solving the system of equations obtained by settingthe partial derivatives of φ with respect to those parameters to : ˆ p = αβ √ − q (cid:90) Dz (cid:82) Dz e − βH ( k ( z ,z )) G ( k ( z , z )) (cid:16) z √ q − q − z √ q (cid:17)(cid:82) Dz e − βH ( k ( z ,z )) (A34) ˆ p = αβ (cid:113) (1 − q ) ( q − q ) × (A35) × (cid:90) Dz (cid:82) Dz e − βH ( k ( z ,z )) G ( k ( z , z )) (cid:16) z (cid:112) q ( q − q ) + z (1 − q ) (cid:17)(cid:82) Dz e − βH ( k ( z ,z )) q = 1 (cid:113) ˆ k ( z , z ) + β Γ × (A36) × (cid:90) Dz (cid:82) Dz sinh (cid:18)(cid:113) ˆ k ( z , z ) + β Γ (cid:19) ˆ k ( z , z ) (cid:16) z √ ˆ p − ˆ p − z √ ˆ p (cid:17)(cid:82) Dz cosh (cid:18)(cid:113) ˆ k ( z , z ) + β Γ (cid:19) q = 1 (cid:113) ˆ k ( z , z ) + β Γ × (A37) × (cid:90) Dz (cid:82) Dz sinh (cid:18)(cid:113) ˆ k ( z , z ) + β Γ (cid:19) ˆ k ( z , z ) (cid:16) z √ ˆ p − ˆ p (cid:17)(cid:82) Dz cosh (cid:18)(cid:113) ˆ k ( z , z ) + β Γ (cid:19) Once these are found, we can use them to compute the action φ and the average values of thelongitudinal energy and the transverse ﬁelds, and ﬁnally of the Hamiltonian: (cid:68) ˆ H (cid:69) ξ = N (cid:0) ¯ E − Γ ¯ T (cid:1) (A38) ¯ E = 1 N (cid:104) E ( { ˆ σ z } ) (cid:105) ξ = − ∂φ∂β = α (cid:90) Dz (cid:82) Dz e − βH ( k ( z ,z )) H ( k ( z , z )) (cid:82) Dz e − βH ( k ( z ,z )) (A39) ¯ T = 1 N (cid:10) ˆ σ xj (cid:11) = ∂φ∂ ( β Γ) = (cid:90) Dz (cid:82) Dz β Γ sinh (cid:16) √ ˆ k ( z ,z ) + β Γ (cid:17) √ ˆ k ( z ,z ) + β Γ (cid:82) Dz cosh (cid:18)(cid:113) ˆ k ( z , z ) + β Γ (cid:19) (A40)where the notation (cid:104)·(cid:105) ξ denotes the fact that we performed both the average over the quencheddisorder and the thermal average.7 a. Small Γ limit It can be veriﬁed that in the limit Γ → the equations (A29)-(A33) reduce to the classical case,in the RS description. In this limit, q → (i.e., the Trotter replicas collapse), which leads to: G E = (cid:90) Dz log (cid:18)(cid:0) − e − β (cid:1) H (cid:18) z (cid:114) q − q (cid:19) + e − β (cid:19) . (A41)For Γ = 0 and q = 1 we also have the identity: −

12 ˆ p q + G S = −

12 ˆ p + (cid:90) Dz log 2 cosh (cid:16) z (cid:112) ˆ p (cid:17) . (A42)Putting these two expressions back in eq. (A29) we recover the classical expression where ˆ p assumes the role of the usual conjugate parameter ˆ q in the RS analysis of ref. [24].In order to study in detail how this classical limit is reached, however, we need to expand thesaddle point equations around this limit. To to this, we deﬁne (cid:15) = 1 − q (cid:28) . From equation (A35),expanding to the leading order, we obtain the scaling ˆ p = ˆ c √ (cid:15) , with ˆ c =  √ − q (cid:90) Dz G (cid:16) z (cid:113) q − q (cid:17) e − β + (1 − e − β ) H (cid:16) z (cid:113) q − q (cid:17)  (cid:20)(cid:90) Dz exp ( − βH ( z )) z (cid:21) . (A43)Then, we use this scaling in equation (A37) and we expand it, ﬁrst using β Γ (cid:28) and then (cid:15) (cid:28) .We obtain the approximate expression: (cid:15) = β Γ −√ ˆ c (cid:15) + √ c + √ (cid:15) ) (cid:15) / F (cid:16) √ (cid:113) ˆ c √ (cid:15) (cid:17) ˆ c / (A44)where F ( x ) = √ π e − x erﬁ ( x ) is the Dawson’s function. For a given β (from which we obtain ˆ c viaeq. (A43)), this equation can be solved numerically to obtain (cid:15) (and thus q and ˆ p ) as a functionof Γ . This expression has always the solution (cid:15) = 0 , which correspond to the purely classical case.There is a critical Γ below which (cid:15) = 0 is also the only solution; above that, two additional solutionsappear at (cid:15) > , of which the largest is the physical one. Therefore, the classical limit is not achievedcontinuously, but rather with a ﬁrst-order transition (although the step is tiny).

1. Energy function with stability

We can generalize the energy function eq. (A7) to take into account, for those patterns that aremisclassiﬁed, by how much the classiﬁcation is wrong: E ( σ ) = αN (cid:88) µ =1 Θ  − √ N (cid:88) j ξ µj σ j   − √ N (cid:88) j ξ µj σ j  r . (A45) This follows from (cid:82) Dz cosh ( a z + b z ) = e a cosh ( b z ) . r = 0 . Here, we study the case r = 1 . Note that thisparameter is called n in the main text: that notation was borrowed from ref. [35], but here wechange it in order to avoid confusion with the number of replicas. While the ground states in theSAT phase of the classical model are unaﬀected, the system can have diﬀerent properties for ﬁnite β . This change only aﬀects the G E term. Equation (A24) becomes (with the deﬁnition of eq. (A33)): G E = (cid:90) Dz (cid:20)(cid:90) Dz (cid:20) e βy √ − q ( k ( z ,z )+ βy √ − q ) H (cid:18) k ( z , z ) + βy (cid:112) − q (cid:19) ++ H ( − k ( z , z )) (cid:105) y (cid:105) n . (A46)In the limit of large y we have the modiﬁed version of eq. (A25): G E = 1 n log G E = (cid:90) Dz log (cid:90) Dz exp (cid:16) − β (cid:112) − q [ G ( k ( z , z )) − k ( z , z ) H ( k ( z , z ))] (cid:17) (A47)The saddle point equations (A34) and (A35) become: ˆ p = − αβ (cid:90) Dz (cid:82) Dz exp (cid:0) − β √ − q A ( z , z ) (cid:1) H ( k ( z , z )) (cid:16) z √ q − q − z √ q (cid:17)(cid:82) Dz exp (cid:0) β √ − q A ( z , z ) (cid:1) (A48) ˆ p = αβ (cid:90) Dz (cid:82) Dz exp (cid:0) − β √ − q A ( z , z ) (cid:1) H ( k ( z , z )) (cid:82) Dz exp (cid:0) β √ − q A ( z , z ) (cid:1) (A49)where A ( z , z ) = G ( k ( z , z )) − k ( z , z ) H ( k ( z , z )) . a. Small Γ limit As in the previous case, it can be checked that for Γ → , we have q → and eq. (A47) becomesthe expression for the classical model under the RS ansatz: G E = (cid:90) Dz log (cid:16) e β √ − q ( k ( z )+ β √ − q ) H (cid:16) k ( z ) + β (cid:112) − q (cid:17) + H ( − k ( z )) (cid:17) (A50)where k ( z ) = z (cid:113) q − q . Also, eq. (A42) still holds, and ˆ p takes the role of the usual parameter ˆ q in the classical RS analysis. In this case, however, we no longer have ˆ p → ∞ ; rather, it tends toa ﬁnite value: ˆ p = αβ (cid:90) Dz (cid:32) − H ( − k ( z )) e β √ − q ( k ( z )+ β √ − q ) H (cid:0) k ( z ) + β √ − q (cid:1) + H ( − k ( z )) (cid:33) (A51)Therefore, the scaling of (cid:15) = 1 − q is diﬀerent in this case. We ﬁnd (using the deﬁnition ofeq. (A31)): − q = β Γ (cid:90) Dz e − ˆ p − ˆ p cosh (cid:0) z √ ˆ p (cid:1) (cid:90) Dz k ( z , z )  cosh (cid:16) ˆ k ( z , z ) (cid:17) − sinh (cid:16) ˆ k ( z , z ) (cid:17) ˆ k ( z , z )  (A52)9Therefore, the convergence to the classical case is smooth. Appendix B: Estimation of the local energy and entropy landscapes with the cavity method

In order to compute the local landscapes of the energy and the entropy around a referenceconﬁguration (Fig. 3), we used the Belief Propagation (BP) algorithm, a cavity method message-passing algorithm that has been successfully employed numerous times for the study of disorderedsystems [47]. In the case of single-layer binary perceptrons trained on random unbiased i.i.d.patterns, it is believed that the results of this algorithm are exact in the limit of N → ∞ , at leastup to the critical value α c ≈ . [48].For a full explanation of the BP equations for binary perceptrons, we refer the interested readerto the Appendix of ref. [22]. Here, we provide only a summary. The BP equations involve twosets of quantities (called “messages”), representing cavity marginal probabilities associated witheach edge in a factor graph representation of the (classical) Boltzmann distribution induced by theenergy function (A7). To each edge in the graph linking the variable node i with the factor node µ , are associated two messages, m i → µ and ˆ m µ → i . These are determined by solving iteratively thefollowing system of equations: m i → µ = tanh (cid:88) ν (cid:54) = µ tanh − ( ˆ m ν → i )  (B1) ˆ m µ → i = ξ i g ( a µ → i , b µ → i ) (B2)where: g ( a, b ) = H (cid:0) a − b (cid:1) − H (cid:0) a +1 b (cid:1) H (cid:0) a − b (cid:1) + H (cid:0) a +1 b (cid:1) (B3) a µ → i = (cid:88) j (cid:54) = i ξ µj m j → µ (B4) b µ → i = (cid:115)(cid:88) j (cid:54) = i (cid:0) − m j → µ (cid:1) (B5)(as for the previous section, we used the deﬁnition H ( x ) = erfc (cid:16) x √ (cid:17) .)Once a self-consistent solution is found, these quantities can be used to compute, using standardformulas, all thermodynamic quantities of interest, in particular the typical (equilibrium) energy andthe entropy of the system. A numerically accurate implementation of these equations is availableat ref. [49].It is also possible to compute those same thermodynamic quantities in a neighborhood of somearbitrary reference conﬁguration w = { w i } i . This is achieved by adding an external ﬁeld in thedirection of that conﬁguration, which amounts at this simple modiﬁcation of eq. (B1): m i → µ = tanh (cid:88) ν (cid:54) = µ tanh − ( ˆ m ν → i ) + λw i  (B6)By varying the auxiliary parameter λ , we can control the size of the neighborhood under consider-ation (the larger λ , the narrower the neighborhood); the typical normalized Hamming distance from0the reference of the conﬁgurations that are considered by this modiﬁed measure can be obtainedfrom the ﬁxed-point BP messages for any given λ by this formula: d = 12 (cid:32) − N (cid:88) i m i w i (cid:33) (B7)where the m i are the total magnetizations: m i = tanh (cid:32)(cid:88) ν tanh − ( ˆ m ν → i ) + λw i (cid:33) (B8)In order to produce the energy landscape plots of Figs. 3a and 3b, we simply ran this algorithm atinﬁnite temperature, varying λ and plotting the energy density shift from the center as a functionof d . This gives us an estimate of the most probable energy density shift which would be obtainedby moving in a random point at distance d from the reference.The plot in Fig. 3c was similarly obtained by setting the temperature to and computing theentropy density instead, which in this context is then simply the natural logarithm of the numberof solutions in the given neighborhood, divided by N . Appendix C: Numerical simulations details of the annealing protocols1. Quantum annealing protocol

In this section we provide the details of the QA results presented in Fig. 2. The simulationswere performed using the RRR Monte Carlo method [38]. We ﬁxed the total number of spin ﬂipattempts at τ N y · and followed a linear protocol for the annealing of Γ , starting from Γ = 2 . and reaching down Γ = 0 . We actually divided the annealing in τ steps, where during each step Γ was kept constant and decreased by ∆Γ = Γ − Γ τ after each step. In the ﬁgure, we have shownthe results for N = 4001 and τ = 4 ; the results for N = 1001 , and for τ = 1 , were essentiallyindistinguishable at that level of detail.

2. Classical simulated annealing protocol

The results for SA presented in Fig. 2 used an annealing protocol in β designed to make a directcomparison to QA: we found analytically a curve β equiv (Γ) such that the classical equilibriumenergy would be equal to the longitudinal component of the quantum system energy, eq. (A39).The classical equilibrium energy was computed from the equations in ref. [24]. The result is shownin Fig. 4. The vertical jump to β = 20 is due to the transition mentioned in sec. A 0 a; as shown inFig. 2, the SA protocol in the regime we tested gets stuck well before this transition.The SA annealing protocol thus consisted in setting β = β equiv (Γ) and decreasing linearly Γ from . to , like for the QA case. We ﬁxed the total number of spin ﬂip attempts at τ N · and used τ = 4 , , ; as for the QA case, the annealing process was divided in τ steps.Other more standard annealing protocols (e.g. linear or exponential or logarithmic) yielded verysimilar qualitative results, as expected from the analysis of ref. [35].1 equ i v a l en t i n v e r s e t e m pe r a t u r e 𝛽 transverse field 𝛤 Figure 4. The curve β equiv (Γ) for α = 0 . corresponding to a quantum system at β = 20 . c l a ss i c a l ene r g y den s i t y 𝐸 / 𝑁 transverse field 𝛤 SQA 𝑦 = 𝑦 = 𝑦 = 𝑦 = 𝑦 = 𝑦 = 𝑦 → ∞ Figure 5. Comparison between theory and simulations for the average classical energy density at diﬀerentvalues of y . The three simulations curves (depicted in shades of blue) are the same shown in Fig. 2. Atlarge Γ , each of them is in good agreement with its corresponding analytical curve (depicted in shades ofred/orange). All the curves basically coalesce at small Γ . In the intermediate regime, the theory and thesimulations exhibit a discrepancy, due to the static approximation used in computing the analytical curves. Appendix D: Additional numerical results on the annealing processes1. Additional comparisons between theory and simulations

Fig. 2 compares the result of Monte Carlo simulations with the theoretical predictions for theclassical component of the energy, eq. (A39), and the transverse overlap, eq. (A37). Fig. 5 comparesthe same simulation results with the analytical curves at ﬁnite y instead. This shows a relativelysmall systematic oﬀset (due to the static approximation) at intermediate values of Γ , while theagreement is good at both large and small Γ .Fig. 6 shows the comparison with the y → ∞ curve for the expectation of the full quantumHamiltonian, eq. (A38), using the same data. The agreement is remarkable, and a close inspection2 -2.5-2-1.5-1-0.50 0 0.5 1 1.5 2 2.5 a v e r age ha m il t on i an den s i t y ⟨ 𝐻 ⟩ / 𝑁 transverse field 𝛤 𝑦 = 𝑦 = 𝑦 = Figure 6. Comparison between theory and simulations for the average of the Hamiltonian density, eq. (A38)divided by N . Same data as Fig. 2. The numerical curves are very close to the theoretical one at this levelof detail. A close inspection reveals that the agreement improves with increasing y . reveals that the curves from the simulation tend towards the theoretical one as y increases, i.e. inthe quantum limit.

2. Experiments with two-layer networks

We performed additional experiments using two-layer fully-connected binary networks, the so-called committee machines. Previous results obtained with the robust-ensemble measure [21] showedthat this case is quite similar to that of single layer networks. In particular, standard SimulatedAnnealing suﬀers from an exponential slow-down as the system size increases even moderately, whilealgorithms that are able to target the dense states do not suﬀer from the trapping in meta-stablestates. Indeed, we found the latter feature to be true in the quantum annealing scenario.The model in this case is deﬁned by a modiﬁed energy function (cf. eq. (A7)): E ( σ ) = αN (cid:88) µ =1 Θ  − K (cid:88) k =1 sgn N/K (cid:88) j =1 ξ µj σ kj  (D1)where now the N spin variables are divided in groups of K hidden units, and consequently the spinvariables σ kj have two indices, k = 1 , . . . , K for the hidden unit and j = 1 , . . . , N/K for the input.Notice that the input size is reduced K -fold with respect to the previous case. The output of thesemachines is simply decided by the majority of the outputs of the individual units, and the energystill counts the number of errors. The Suzuki-Trotter transformation proceeds in exactly the sameway as for the previous cases.Like for the single-layer case, we tested the case of α = 0 . at β = 20 , and we used K = 5 units. We tested diﬀerent values of N = 1005 , , with diﬀerent values of the Trotterreplicas y = 32 , , (only y = 32 for N = 4005 ) at a ﬁxed overall running time of yN τ · spin ﬂip attempts, with τ = 4 (cf. Fig. 2). The MC algorithm and the annealing protocols were3 c l a ss i c a l ene r g y den s i t y 𝐸 / 𝑁 transverse field 𝛤 𝑁 =1005 𝑦 =32 𝑁 =2005 𝑦 =32 𝑁 =4005 𝑦 =32 𝑁 =1005 𝑦 =64 𝑁 =2005 𝑦 =64 𝑁 =1005 𝑦 =128 𝑁 =2005 𝑦 =128 Figure 7. Energy density eq. (D1) as a function of the transverse ﬁeld Γ for the two-layer binary committeemachine model with K = 5 at α = 0 . and β = 20 , with diﬀerent values of N and y , and using τ = 4 inthe overall running time (number of spin ﬂip attempts) set as yNτ · . Each curve is averaged over samples. also unchanged. The results are shown in Fig. 7: all these tests produce curves which are almostindistinguishable at this level of detail for diﬀerent N , and that seemingly tend to converge to somelimit curve for increasing y (while being almost overlapping at small transverse ﬁeld Γ ), consistentlywith the single-layer scenario. Appendix E: Real-time Quantum Annealing on small samples1. Numerical methods

Computing the evolution of the system under Quantum Annealing amounts at solving the time-dependent Schrödinger equation for the system ∂∂t | ψ ( t ) (cid:105) = − i ˆ H ( t ) | ψ ( t ) (cid:105) (E1)where we set (cid:125) = 1 for simplicity. In our case, the time dependence of the Hamiltonian H comesin through the varying transverse magnetic ﬁeld Γ ( t ) . We assume that Γ varies linearly with timebetween some starting value Γ and , in a total time t max . Therefore, the ﬁnal Hamiltonian isreduced to the purely classical case, ˆ H ( t max ) = E .In the following, we will always work in the basis of the ﬁnal Hamiltonian, in which everyeigenvector | σ (cid:105) corresponds to a conﬁguration σ ∈ {− , +1 } N of the spins in the z direction.Therefore, we represent | ψ ( t ) (cid:105) with a complex-valued vector of length N with entries (cid:104) σ | ψ ( t ) (cid:105) ;similarly, the ˆ H ( t ) operator is represented by a matrix of size N × N , H ( σ, σ (cid:48) ) = (cid:68) σ (cid:12)(cid:12)(cid:12) ˆ H (cid:12)(cid:12)(cid:12) σ (cid:48) (cid:69) . Thestructure of this matrix is very sparse: the diagonal elements H ( σ, σ ) correspond to the classicalenergies E ( σ ) , while the only non-zero diagonal elements are those elements H ( σ, σ (cid:48) ) such that σ and σ (cid:48) are related by a single spin ﬂip, in which case the value is − Γ .4In our simulations, the initial state | ψ (0) (cid:105) was set to the ground state of the system at Γ → ∞ ,i.e. with all the spins aligned in the x direction; in our basis, this corresponds to a uniform vector, (cid:104) σ | ψ (0) (cid:105) = N − / for all σ . We simulated the evolution of the system by the short iterativeLanczos (SIL) method [41]: we compute the evolution at ﬁxed Γ for a short time interval ∆ t , thenlower Γ by a small ﬁxed amount ∆Γ , and iterate. The total evolution time is thus t max = Γ ∆Γ ∆ t .Numerical accuracy can be veriﬁed by scaling both these steps by a ﬁxed amount and observingno signiﬁcant diﬀerence in the outcome. The evolution is computed by the Lanczos algorithm withenough iterative steps to ensure suﬃcient accuracy, as determined by observing that increasing thenumber of steps does not change the outcomes signiﬁcantly. In the simulations presented here, weset Γ = 5 , ∆Γ = 10 − and ∆ t = 0 . , and we used steps in the Lanczos iterations.At the end of the annealing process, we could retrieve the ﬁnal probability distribution for eachconﬁguration of the spins as p ( σ ) = |(cid:104) σ | ψ ( t max ) (cid:105)| .

2. Sample selection

Given the exponential scaling with N of the SIL algorithm, simulations are necessarily restrictedto small values of N . We used N = 21 . At these system sizes, there is a very large sample-to-samplevariability. Furthermore, the energy barriers are generally small enough for the classical SimulatedAnnealing to perform well.In order to obtain small but challenging samples, in which we could also study the structure ofthe solutions, we proceeded as follows: we extracted at random samples with P = 17 patternseach (corresponding to α (cid:39) . , close to the critical value of . which is valid for large systems),and selected those which had at least a certain minimum number of solutions (note that in suchsmall systems we can easily enumerate all of the (cid:39) · conﬁgurations and check their energy).We arbitrarily chose solutions as the threshold. We then ran both Simulated Annealing witha fast schedule (with τ = 1) and Simulated Quantum Annealing with τ = 1 and a large numberof Trotter replicas ( y = 512 ), and selected those samples in which SA failed while SQA succeeded.This left us with samples, which we then analyzed in detail and over which we performed thereal-time QA simulations. a. Randomized samples For each of the selected samples, we generated a corresponding randomized version by permutingrandomly the values of the energy associated to each conﬁguration. This procedure maintainsunaltered the spectrum of the energies (so that for example the classical Boltzmann distribution atthermodynamic equilibrium remains unchanged), but completely destroys the geometric features ofthe energy landscape. We used these randomized samples as a benchmark against the measurementsperformed in our analysis.

3. Analysis a. QA vs SQA

We compared the results of real-time QA with the SQA Monte Carlo results, analyzing eachof the selected samples individually. In particular, we compared the values of the average5longitudinal energy as a function of Γ for the two algorithms. As shown in Fig. 8, the agreement isexcellent, and the system always gets very close to zero energy. In the same ﬁgure, we show thatthe same annealing protocol however gives substantially diﬀerent (and rather worse) results on therandomized samples, reﬂecting the fact that the geometrical features of the landscape are crucial(we veriﬁed on a few cases that the results on the randomized samples could be improved by slowingdown the annealing process, but we could not get to the same results as for the original systems evenwith a -fold increase in total time). Note that the sample-to-sample variability in these curvesappears to be fairly small due to our sample ﬁltering process; we veriﬁed in a preliminary analysisthat the agreement is generally excellent also without the ﬁltering conditions, e.g. on instances thathave no solutions at all. b. Other measurements We also performed a number of measurements on the ﬁnal conﬁguration reached by the QAalgorithm (both for the original samples and the randomized ones) and studied the properties ofthe ﬁnal probability distribution p ( σ ) . These are the quantities that we computed, reported intable I: • The average value of the energy (cid:104) E (cid:105) = (cid:80) σ E ( σ ) p ( σ ) . • The probability of ﬁnding a solution P SOL = (cid:80) σ : E ( σ )=0 p ( σ ) . • The probability and the energy of the most probable conﬁguration, p ( σ (cid:63) ) and E ( σ (cid:63) ) , where σ (cid:63) = arg max σ p ( σ (cid:63) ) . • The inverse participation ratio

IPR = (cid:80) σ p ( σ ) , to assess the concentration of the ﬁnaldistribution. (Qualitatively analogous results are obtained using the Shannon entropy.) Thismeasure however does not take into account the geometric structure of the distribution: forinstance, if p ( σ ) were non-zero on just to conﬁgurations σ and σ , the IPR would be veryhigh, but it would not be able to discriminate between the cases in which σ and σ are closeto each other or far apart in Hamming distance. • The mean distance between conﬁgurations, deﬁned as ¯ d = (cid:80) σ,σ (cid:48) p ( σ ) p ( σ (cid:48) ) d ( σ, σ (cid:48) ) , where d ( σ, σ (cid:48) ) is the normalized Hamming distance between conﬁgurations. This measure is usefulsince it reﬂects the geometric features of the ﬁnal measure: it can only be low if the mass ofthe probability is concentrated spatially (in particular, it is zero if and only if p ( σ ) is a deltafunction).As can be seen from the table, the results are generally in agreement with the qualitative picturedescribed in the main text, especially when compared to the randomized benchmark: the system isable to reach very low energies (cid:104) E (cid:105) , the probability of solving the problem P SOL is very high, themeasure is rather concentrated on a few good conﬁgurations and those conﬁgurations are close toeach other (high

IPR , low ¯ d ).For the original samples only, we also looked at the ﬁnal conﬁguration from the Monte Carlo SQAprocess, σ SQA , and computed its ranking according to p ( σ ) , which we denoted as r SQA . A rankingof implies σ SQA = σ (cid:63) . All rankings are very small, the largest ones generally corresponding tosamples with the largest number of solutions and the less concentrated distributions. This furtherattests to the good agreement between the QA and SQA processes.6 ene r g y den s i t y 𝐸 / 𝑁 QAQA smoothedSQAQA on rand. model00.10.20.30.4 ene r g y den s i t y 𝐸 / 𝑁 ene r g y den s i t y 𝐸 / 𝑁 ene r g y den s i t y 𝐸 / 𝑁 ene r g y den s i t y 𝐸 / 𝑁 transverse field 𝛤 𝛤 𝛤 𝛤 Figure 8. Comparison between real-time QA and Monte Carlo SQA for small samples with N = 21 . Theﬁgures show the mean value of the longitudinal energy density (cid:104) E (cid:105) /N as a function of the transverse ﬁeld Γ . The agreement between the two algorithms is quite remarkable. The QA curves (shown in gray) displaysome oscillatory behavior (not visible at this level of zoom because the oscillations are fast) which howeveralways tends to die out as Γ goes to . The dotted blue curves show a smoothing of these oscillations. Thered curves show the results of the same annealing process on a randomized version of the correspondingsample (see text for details). sample Original samples P SOL (cid:104) E (cid:105) /N E ( σ (cid:63) ) p ( σ (cid:63) ) IPR ¯ d r SQA mean 56.65 0.996578 0.009346 0.0 0.239781 0.123921 0.126595 - sample

Randomized samples P SOL (cid:104) E (cid:105) /N E ( σ (cid:63) ) p ( σ (cid:63) ) IPR ¯ d mean 56.65 0.094428 1.274798 0.5 0.003148 0.000726 0.499339 Table I. Results for the small samples (original an randomized) at the end of the QA process. Thesamples are the same as for Figg. 8 and 9, where they are arranged in row-major order. The second columnshows the number of solutions; the other columns are described in the text. c. Local entropies In order to assess whether the denser ground states were favored in the ﬁnal conﬁguration withrespect to more isolated solutions we compared the mean local entropy curves weighted accordingto p ( σ ) with those averaged over all the solutions. More precisely, we deﬁne C ( n ) as the set ofthe n conﬁgurations with highest probability, and n w as the number of conﬁgurations required toachieve a cumulative probability of t , i.e. the lowest n such that (cid:80) σ ∈ C ( n ) p ( σ ) ≥ w . We also deﬁne K ( σ, d ) as the number of solutions at normalized Hamming distance from σ lower or equal to d .Then the mean local entropy curve weighted with p is then deﬁned as: φ w ( d ) = 1 N (cid:80) σ ∈ C ( n w ) p ( σ ) log K ( σ, d ) (cid:80) σ ∈ C ( n w ) p ( σ ) . (E2)Denoting by S = { σ | E ( σ ) = 0 } the set of all the solutions, we also compute the ﬂat average ofthe local entropies over S : φ SOL ( d ) = 1 N |S| (cid:88) σ ∈S log K ( σ, d ) . If p concentrates on denser solutions, we expect that the φ w curves should be generally higherthan the φ SOL curves. Indeed, the results conﬁrm this scenario, as shown in Fig. 9, where we used w = 0 . . (This value ensured that C ( n w ) ⊆ S for all samples and thus that all the local entropiesare ﬁnite; apart from this, the results are quite insensitive to the choice of w .) Note that, in thelimit of large system sizes, the φ SOL curves would be dominated by isolated solutions and displaya gap around zero distances; the fact that this is not visible in Fig. 9 is purely a ﬁnite size eﬀect;the φ w curves on the other hand should be roughly comparable to those shown in Fig. 3. d. Energy gaps As mentioned in the introduction of the main text, it is well known that, according to theadiabatic theorem, the eﬀectiveness of the QA process depends on the relation between the rate ofchange of the Hamiltonian and the size of the gap between the ground state of the system H andthe ﬁrst excited state H : smaller gaps require a slower annealing process. Therefore, we performeda static analysis of the energy spectrum of each of the samples at varying Γ , and computed thegap H − H , comparing the results with those for the randomized versions of the samples. Theresults are shown in Fig. 10. For the original samples, the gap only vanishes in the limit of Γ → (which is expected since the ground state at Γ = 0 is degenerate). For the randomized samples, onthe other hand, the gap nearly closes at non-zero Γ , displaying the characteristics of an “avoidedcrossing” (see the ﬁgure upper inset), which is the type of phenomenon that is known to hamperthe performance of QA algorithms. Indeed, the values of Γ where these avoided crossings occur areprecisely those at which the mean value of H found by the QA algorithm deviates from the groundstate H , thereby getting stuck as shown in Fig. 8. [1] P. W. Shor, in Foundations of Computer Science, 1994 Proceedings., 35th Annual Symposium on (IEEE,1994) pp. 124–134. m ean l o c a l en t r op y weighted with 𝑝 averaged on solutions00.050.10.150.20.25 m ean l o c a l en t r op y m ean l o c a l en t r op y m ean l o c a l en t r op y m ean l o c a l en t r op y distance 0 0.1 0.2 0.3 0.4distance 0 0.1 0.2 0.3 0.4distance 0 0.1 0.2 0.3 0.4distance Figure 9. Comparison of the average local entropies weighted according to the probability distributionobtained at the end of the annealing process, with those obtained from a ﬂat average over all solutions, for diﬀerent small samples with N = 21 . The samples are the same as in Fig. 8. ene r g y gap 𝐻 ₁ - 𝐻 ₀ transverse field 𝛤 originalrandomized-1-0.9-0.8-0.7-0.6-0.50.37 0.39 0.41 ene r g y 𝐻 transverse field 𝛤 𝐻 ₁ 𝐻 ₀ Figure 10. Energy gap between the ground state H and the ﬁrst excited state H as a function of thetransverse ﬁeld Γ , for small samples with N = 21 (same as in Fig. 8). The semi-transparent solid curvesshow the results for each individual sample (blue: original; red: randomized), while the dashed lines areaverages. The behavior is qualitatively diﬀerent for the two cases: the randomized examples all displayavoided crossings at Γ (cid:39) . (the upper inset ﬁgure shows the two energy levels for one representativeexample, enlarged around the relevant region); the original examples on the other hand show no trace ofavoided crossings (with one possible exception) and are generally much higher. The lower inset shows thesame data as the main ﬁgure, but only the averages are plotted, and the range is enlarged up to Γ = 1 .[2] P. Ray, B. K. Chakrabarti, and A. Chakrabarti, Physical Review B , 11828 (1989).[3] A. Finnila, M. Gomez, C. Sebenik, C. Stenson, and J. Doll, Chemical physics letters , 343 (1994).[4] T. Kadowaki and H. Nishimori, Physical Review E , 5355 (1998).[5] E. Farhi, J. Goldstone, S. Gutmann, J. Lapan, A. Lundgren, and D. Preda, Science , 472 (2001).[6] A. Das and B. K. Chakrabarti, Reviews of Modern Physics , 1061 (2008).[7] C. Moore and S. Mertens, The nature of computation (Oxford University Press, 2011).[8] M. Born and V. Fock, Zeitschrift für Physik A Hadrons and Nuclei , 165 (1928).[9] L. Landau, Phys. Z. Sowjetunion , 1 (1932).[10] C. Zener, in Proceedings of the Royal Society of London A: Mathematical, Physical and EngineeringSciences , Vol. 137 (The Royal Society, 1932) pp. 696–702.[11] B. Altshuler, H. Krovi, and J. Roland, Proceedings of the National Academy of Sciences , 12446(2010).[12] V. Bapst, L. Foini, F. Krzakala, G. Semerjian, and F. Zamponi, Physics Reports , 127 (2013).[13] G. E. Santoro, R. Martoňák, E. Tosatti, and R. Car, Science , 2427 (2002).[14] R. Martoňák, G. E. Santoro, and E. Tosatti, Physical Review B , 094203 (2002).[15] B. Heim, T. F. Rønnow, S. V. Isakov, and M. Troyer, Science , 215 (2015). [16] T. F. Rønnow, Z. Wang, J. Job, S. Boixo, S. V. Isakov, D. Wecker, J. M. Martinis, D. A. Lidar, andM. Troyer, Science , 420 (2014).[17] M. W. Johnson, M. H. Amin, S. Gildert, T. Lanting, F. Hamze, N. Dickson, R. Harris, A. J. Berkley,J. Johansson, P. Bunyk, et al. , Nature , 194 (2011).[18] S. Boixo, T. F. Rønnow, S. V. Isakov, Z. Wang, D. Wecker, D. A. Lidar, J. M. Martinis, and M. Troyer,Nature Physics , 218 (2014).[19] W. Langbein, P. Borri, U. Woggon, V. Stavarache, D. Reuter, and A. Wieck, Physical Review B ,161301 (2004).[20] C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina, Physical Review Letters ,128101 (2015).[21] C. Baldassi, C. Borgs, J. T. Chayes, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina, Proceed-ings of the National Academy of Sciences , E7655 (2016).[22] C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina, Journal of Statistical Mechanics:Theory and Experiment , P023301 (2016).[23] C. Baldassi, F. Gerace, C. Lucibello, L. Saglietti, and R. Zecchina, Physical Review E , 052313(2016).[24] W. Krauth and M. Mézard, J. Phys. France , 3057 (1989).[25] H. Sompolinsky, N. Tishby, and H. S. Seung, Physical Review Letters , 1683 (1990).[26] L. Foini, G. Semerjian, and F. Zamponi, Physical review letters , 167204 (2010).[27] G. Biroli and F. Zamponi, Journal of Low Temperature Physics , 101 (2012).[28] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, arXiv preprint arXiv:1609.07061(2016).[29] M. Courbariaux, Y. Bengio, and J.-P. David, in Advances in Neural Information Processing Systems (2015) pp. 3105–3113.[30] D. J. MacKay,

Information theory, inference and learning algorithms (Cambridge university press,2003).[31] Y. LeCun, Y. Bengio, and G. Hinton, Nature , 436 (2015).[32] S. Aaronson, Nature Physics , 291 (2015).[33] F. Barahona, Journal of Physics A: Mathematical and General , 3241 (1982).[34] H. Huang and Y. Kabashima, Physical Review E , 052813 (2014).[35] H. Horner, Zeitschrift für Physik B Condensed Matter , 291 (1992).[36] V. Bapst and G. Semerjian, Journal of Statistical Mechanics: Theory and Experiment , P06007(2012).[37] V. Bapst and G. Semerjian, in Journal of Physics: Conference Series , Vol. 473 (IOP Publishing, 2013)p. 012011.[38] C. Baldassi, Journal of Statistical Mechanics: Theory and Experiment , 033301 (2017).[39] C. Baldassi, A. Braunstein, N. Brunel, and R. Zecchina, Proceedings of the National Academy ofSciences , 11079 (2007).[40] T. E. Markland, J. A. Morrone, B. J. Berne, K. Miyazaki, E. Rabani, and D. R. Reichman, NaturePhysics .[41] B. I. Schneider, X. Guan, and K. Bartschat, Advances in Quantum Chemistry , 95 (2016).[42] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, arXiv preprintarXiv:1609.04836 (2016).[43] L. Bottou, F. E. Curtis, and J. Nocedal, arXiv preprint arXiv:1606.04838 (2016).[44] A. Braunstein and R. Zecchina, Phys. Rev. Lett. , 030201 (2006).[45] C. Baldassi, J. Stat. Phys. , 902 (2009).[46] C. Baldassi and A. Braunstein, Journal of Statistical Mechanics: Theory and Experiment , P08008(2015).[47] M. Mézard and A. Montanari, Information, Physics, and Computation (Oxford University Press, 2009).[48] M. Mézard, Journal of Physics A: Mathematical and General , 2181 (1989).[49] “Belief Propagation code,” https://github.com/carlobaldassi/BinaryCommitteeMachineFBP.jlhttps://github.com/carlobaldassi/BinaryCommitteeMachineFBP.jl