[PDF] Non-equilibrium criticality and efficient exploration of glassy landscapes with memory dynamics

Abstract

Spin glasses are notoriously difficult to study both analytically and numerically due to the presence of frustration and metastability. Their highly non-convex landscapes require collective updates to explore efficiently. Currently, most state-of-the-art algorithms rely on stochastic spin clusters to perform non-local updates, but such "cluster algorithms" lack general efficiency. Here, we introduce a non-equilibrium approach for simulating spin glasses based on classical dynamics with memory. By simulating various classes of 3d spin glasses (Edwards-Anderson, partially-frustrated, and fully-frustrated models), we find that memory dynamically promotes critical spin clusters during time evolution, in a self-organizing manner. This facilitates an efficient exploration of the low-temperature phases of spin glasses.

Full PDF

NNon-equilibrium criticalityand eﬃcient exploration of glassy landscapes with memory dynamics

Yan Ru Pei ∗ and Massimiliano Di Ventra † Department of Physics, University of California, San DiegoLa Jolla, CA 92093

Spin glasses are notoriously diﬃcult to study both analytically and numerically due to the presenceof frustration and multiple metastable states. Their highly non-convex landscape requires collectiveupdates to explore eﬃciently. This is currently done through stochastic cluster algorithms, thoughthey lack general eﬃciency. Here, we introduce a non-equilibrium approach for simulating spinglasses based on classical dynamics with memory. We ﬁnd that memory dynamically promotes critical spin clusters during time evolution, in a self-organizing manner , thus facilitating an eﬃcientexploration of the glassy landscape. The proposed approach has broad applicability to other typesof non-convex optimization problems on general graph structures.

Introduction –

The study of spin glasses has con-tributed substantially to our understanding of a wide va-riety of phenomena , much beyond the complex mag-netic models for which they were ﬁrst introduced . Intheir most basic form, these systems are described bythe following simple Hamiltonian : H = − (cid:88) ij J ij s i s j , (1)where the spins, arranged on some d -dimensional lattice,acquire the values, s i = ±

1, and interact via couplingconstants, J ij = ±

1, with J a multivariate random vari-able taken from some distribution.Despite the deceptively simple form, the energy land-scape of the model Hamiltonian (1) is highly non-trivial for most conceivable distributions of J . Decadesof mathematical ingenuity have culminated in eﬃcient(namely, polynomial-time) algorithms for computingthe partition function of any realization of J in twodimensions , but an eﬃcient algorithm to simulateglasses in d > , with the task ofcomputing its partition function shown to be NP-hard .This hardness fundamentally limits the eﬃciency of anystochastic algorithm.Earlier approaches for simulating the model Hamil-tonian (1) were based on sequential Metropolisupdates , and modern extensions of this methodologyhave also been proposed . However, these algorithmsare generally plagued by a large dynamical critical expo-nent, making the simulation largely ineﬃcient .Later on, a method based on the synchronous updateof a large correlated cluster of spins was suggested ,which proved to be very eﬀective for the 2 d ferro-magnetic Ising model. Unfortunately, despite severalmodiﬁcations made to account for the non-locality of ∗ Email: [email protected] † Email: [email protected] frustration , these cluster algorithms still struggle forhigh-dimensional glasses. The main reason behind theirineﬃciency is the tendency for the cluster percolationprocess to be persistently hyper-critical, due to a mis-match of critical temperature and cluster percolationratio . This means that the largest cluster compo-nent generally covers the entire lattice.At the present stage, the best known method for tam-ing the above issues is based on the overlap distribu-tion of replicas (referred to as the “isoenergetic clustermove” (ICM) method) , which reverses the percola-tion ratio . Unfortunately, this replica approach reliesheavily on the dimension of the lattice, and still tends toover-percolate in the high-temperature regime. Anotherrecent trend is to use machine learning techniques to helpidentify eﬃcient clusters , by putting the hidden nodeson the edges (or plaquettes) of the lattice . In somecases, the eﬀorts towards this direction have been half-hearted attempts in under-employing the representativepower of Boltzmann machines in modeling the Boltz-mann distribution of the Ising glass, resulting in mathe-matically equivalent formulations of the traditional clus-ter algorithms .Here, instead, we propose a novel non-stochastic ap-proach to eﬃciently learn the critical clusters of the glassduring dynamics, without any algorithmic aid . In sharpcontrast to previous stochastic methods, which treat thespins and time as discrete variables, we ﬁrst linearlyrelax the spin variables, and then couple them to mem-ory variables. The coupled spin-memory system is thenevolved in continuous time.The memory variables learn from the evolution of theinteracting continuous spins, and their magnitudes corre-spond to the (non-uniform) percolation ratios that helpgenerate critical clusters. This, in turn, induces non-local updates of the spins, allowing them to easily tran- This approach is an application of the more general computingparadigm known as memcomputing , which has been successfulin the solution of a variety of problems ranging from constrainedoptimization to unsupervised learning . a r X i v : . [ c ond - m a t . d i s - nn ] F e b - ≤ σ ≤ + ≤ μ ≤ + FIG. 1: An instance of using memory to learn long-rangedynamics in a 2d ferromagnet, where both spins and memoryvariables are linearly relaxed. The memory variables “live”on the bonds (denoted by green squares), and they are cou-pled to the interaction between adjacent spin states (denotedby blue/orange circles), in such a way that positive interac-tions increase the memory magnitudes. A potential memorycluster is shaded in green, where the memory variables canbe interpreted as percolation ratios. sition between diﬀerent Gibbs states of the glass moreeﬃciently. The evolution of the spins and memory vari-ables occurs simultaneously , meaning that the memoryvariables do not “wait” for the spins to equilibrate be-fore updating themselves. This process induces a non-equilibrium criticality which persists throughout the en-tire evolution of the system, regardless of the underlyingtemperature. This means that the memory-induced clus-ters are persistently critical at all temperatures.

Memory dynamics –

To introduce the memory dynam-ics, we ﬁrst introduce the continuously relaxed spin glassHamiltonian , H = − (cid:88) ij (cid:0) J ij σ i σ j − µ ij ( σ i + σ j ) (cid:1) , σ i ∈ [ − , +1] , (2)where µ are non-uniform memory variables acting as dy-namic Lagrange multipliers for constraints on the spinmagnitude. If µ were ﬁxed in time, then the standarddynamics ∂ t σ i = −∇ σ i H = (cid:88) j (cid:0) J ij σ j − µ ij σ i (cid:1) , (3)would suﬀer from long auto-correlation times (critical slowing down), due to the presence of metastable states,even when the spins are continuously relaxed.Therefore, in order to eﬃciently escape these states,we can continuously deform the energy local minima,and transform them into saddle points by letting thememory variables to evolve as ∂ t µ ij = ( J ij σ i σ j − γ ) , (4)where γ is some constant restricting the growth of µ ij ,and µ ij ∈ [0 , µ ij , we see that the “gradient term” (cid:80) j J ij σ j forthe spins in Eq. (3) is compensated by the “cluster-like”update term − (cid:80) j µ ij σ i in the same equation (see dis-cussion below).By simulating the coupled Eqs. (3) and (4) until thesystem reaches a ﬁxed time-out, we can take s i = sgn( σ i )for a recorded state that minimizes the Ising energyin Eq. (1). Many numerical strategies can be usedto improve the stability and convergence properties ofthe simulation . They are also included in the codesof the repository PeaBrane/Ising-Simulation , whichcan be used to directly reproduce Figs. 2 and 3. The par-ticular numerical implementation we used in this work isgiven as ˙ σ i = α (cid:88) j J ij σ j − β (cid:88) j x ij σ i ˙ x ij = γC ij − y ij ˙ y ij = δx ij − ζ, (5)where C ij = ( J ij σ i σ j + 1) ∈ [0 , y is a secondary long-term memory ensuring stability , and α, β, γ, δ, ζ are time-scale parameters, ﬁxed for all system sizes. Weused the Euler method to integrate forward the aboveequations. More implementation details and the choiceof parameters are given in the Supplementary Material(SM) Section D. Dynamical critical clusters –

Before showing numericalresults, we provide an understanding of why such a mem-ory dynamics would be eﬃcient in simulating spin glasses.First of all, we note that the memory variables should not be considered as standard dual variables . Instead of be-ing coupled to the spin constraints (the second term ofEq. (2)), the memory evolution is explicitly coupled tothe state of interaction between spins, J ij σ i σ j , as writ-ten in Eq. (4), and this is crucial for simulating frustratedsystems . Furthermore, since we are simulating the sys-tem at non-equilibrium , we do not have to worry aboutusing acceptance schemes to tame numerical trunca-tion errors , which do not seem to play a major role inthe stability and eﬃciency of our dynamics . The memory variables may also be deﬁned on diﬀerent unitcells , such as on plaquettes for the fully frustrated Isingmodel . -3 -2 -1 -5 -4 -3 -2 -1 FIG. 2: Simulations performed on 1000 fully-frustrated 3 d Ising glasses (see SM E 2) sized 6 to 12 . (Left) The eﬀectivetemperature, T eff , of the memory dynamics is tracked in time (see SM E 3 for how we estimate this temperature) for the 6 lattice. The critical temperature, T c , of the fully frustrated glass is determined using the crossing of Binder’s cumulant (see SME 4). Note that the time it takes T eff to dive below T c is extremely short, less than 4 units of time, after which the memorydynamics remain persistently below T c . (Right) An arbitrary point in time t = 10 is chosen, and the cluster size distribution(CSD) is collected over the disorder realizations, and a ∆ t = 2 time window. Other than the tailing drop-oﬀ resulting fromﬁnite-size eﬀects, the CSD follows a power-law decay with the Fischer exponent being ˆ τ = 2 . ± .

01 for all simulated sizes.See Fig. 5 in the SM for further empirical evidence that the Fischer exponent is insensitive to the underlying temperature andlattice size.

If we bound the memory variables between 0 and 1 (seeSection D in the SM), we can interpret them as proba-bilities of forming open bonds in a weighted percolationprocess , from which critical clusters can be formed ,as drawn in Fig. 1.To see why the memory dynamics are critical , let usﬁrst assume that the memory variables are already atthe critical percolation threshold. If one memory vari-able µ ij is then perturbed, say, below the critical value,this eﬀect will propagate throughout the entire lattice .In turn, this will suppress the cluster-like update term − (cid:80) j µ ij σ i , making the gradient term (cid:80) j J ij σ j relativelydominant (see Eq. (3)). This will avalanche the Isingenergy to a lower value , resulting in the sudden ap-pearance of more satisﬁed interactions ( J ij s i s j > µ ij will increase until they organize to some new critical con-ﬁguration (see Eq. (4)).To provide additional evidence of criticality, we havenumerically extracted the cluster size distribution (CSD)for a fully-frustrated Ising model , as generated by thememory variables averaged over disorder at an arbitrarypoint in time (see Fig. 2). While most state-of-the-artalgorithms fail to generate critical clusters even at thecritical temperature T c , the memory clusters are per-sistently critical at all temperatures (with the estimationof non-equilibrium temperature outlined in SM E 3). Inother words, we see that the criticality is self-organizing(SOC ) through the entire simulation, and it avoids critical slowing down at all points in time . It shouldbe noted that SOC is not an intrinsic property of frus-trated short-ranged spin glasses , and this is evidentin Fig. 4 of the SM, which displays a persistently hy-percritical CSD for stochastic clusters generated by theSwensden-Wang (SW) and ICM rules. This also conﬁrmsthe ineﬃciency of traditional cluster algorithms for frus-trated systems . Finding a glassy ground state –

Finally, we show thatthe aforementioned non-equilibrium critical behavior al-lows us to ﬁnd the ground state of spin glasses eﬃ-ciently. First of all, we note that below the critical tem-perature T c , the energy landscape of a spin glass be-comes highly non-convex, and most algorithms fail toeﬃciently navigate it. For “artiﬁcial” glassy instanceswhere the ﬁnite residual entropy (ground state degen-eracy) is suppressed via the coupling of local interactionstates , the ineﬃciency of these simulations is exposedmost prominently when the temperature is lowered tothe T = 0 limit. In other words, to benchmark the ef- Note that, it is not necessary for us to algorithmically connectthe memory clusters and ﬂip them discretely (see SM A), becausesuch cluster-update features are implicitly present in the equa-tions of motion for the spin evolution (see Eq. (3)). However, forextremely frustrated and aging glasses (see SM E 2), theseoccasional algorithmic interventions do help slightly with the re-laxation time during simulations.

FIG. 3: Scalability of the median TTS for SA, ICM, andmemory dynamics on the fully frustrated 3 d Ising glass, mea-sured as the total number of sweeps (see SM B for justiﬁca-tion). Statistics are collected over 400 runs, with the shadedregion denoting the 40-th to 60-th percentile, and the ﬁttingand standard error estimation are done with generalized lin-ear regression (with the TTS logged). The estimated scal-ing constants for SA, ICM, and memory are 0 . ± . . ± .

03, and 4 . ± .

1, respectively. ﬁciency of an algorithm for glass simulations, one canrecord the relaxation time, or equivalently the time-to-solution (TTS) to the ground state over an ensemble ofhighly frustrated instances.With this goal in mind, we use a class of glass in-stances generated by fully-frustrated cubes in 3 d , withthe ground state energies known in advance (see SM E 2for an explanation of how these instances are generated).This way we can verify the correctness of the algorithm.In particular, we gather the TTS statistics for the mem-ory dynamics and see how it scales with system size, andcompare the results against simulated annealing (SA)(see SM C 1) and ICM (see SM C 3) .To perform the evaluation, we generate fully frustrated3 d glasses up to size 8 × ×

10 (beyond which the com-putational cost becomes exceedingly prohibitive), with400 randomly generated instances per size. For each size,we evaluate the eﬃciency of every algorithm by collect-ing its TTS statistics up to the 60-th percentile (see SME 1), and estimating the scaling behavior of the median TTS. The implementation used for the scalability test isdetailed in SM D 1. As shown in Fig. 3, while the scal-ing of standard stochastic algorithms are well-ﬁtted bysuper-polynomial functions (an exponential for SA and asub-exponential for ICM), the scaling of the memory dy-namics is well-ﬁtted by a polynomial up to the maximumsize we have tested. This polynomial scaling, combinedwith the critical cluster size distribution, suggests scale-invariance of the memory dynamics both spatially andtemporally . Conclusions –

In this work, we have introduced a newapproach to simulate spin glasses based on the coupling of(linearly relaxed) spins with memory variables. We haveshown numerically that the generated memory-inducedspin clusters are critical, and the memory dynamics areeﬃcient in ﬁnding the ground state of fully-frustratedIsing spin glasses in 3 d , even using the basic forward Eu-ler discretization scheme. As a future development, theintroduction of an appropriate discretization and accep-tance scheme may endow the memory dynamics withthe detailed-balance property , making it applicable tosimulating equilibrium dynamics of glasses even at ﬁnitetemperature. This would allow the algorithm to be inter-faced with modern stochastic algorithms, which may beuseful for generating critical clusters for bosonic quantumspin and gauge systems .Furthermore, a foreseeable generalization would be toapply this technique to simulating glasses on more gen-eral graph structures and interaction states . Finally,it would be interesting to study the fundamental mech-anism behind the criticality of memory, and its prop-erty of inducing nonlinear solitonic behavior in frustratedsystems which has been shown empirically in thepast . Acknowledgments –

This material is based upon worksupported by the National Science Foundation underGrant No. 2034558. Y.P. would like to thank FirasHamze for stimulating discussions on the entropic prop-erties of the tiling cubes, and Zheng Zhu for clarify-ing the implementation details of the ICM algorithm.All the numerical results presented in this study havebeen done on a single core of an AMD EPYC server.They can be reproduced using the codes in the reposi-tory

PeaBrane/Ising-Simulation . The repository isa complete suite for Ising Simulation in MATLAB thatis actively maintained and developed by Y.P. W. Barthel, A. K. Hartmann, M. Leone, F. Ricci-Tersenghi, M. Weigt, and R. Zecchina, Physical reviewletters , 188701 (2002). A. Fischer and C. Igel, in

Iberoamerican congress on pat-tern recognition (Springer, 2012) pp. 14–36. W. Bialek, A. Cavagna, I. Giardina, T. Mora, E. Silvestri,M. Viale, and A. M. Walczak, Proceedings of the NationalAcademy of Sciences , 4786 (2012). S. F. Edwards and P. W. Anderson, Journal of Physics F: Metal Physics , 965 (1975). E. Ising, Zeitschrift f¨ur Physik , 253 (1925). M. M´ezard, G. Parisi, and M. Virasoro,

Spin glass the-ory and beyond: An Introduction to the Replica Methodand Its Applications , Vol. 9 (World Scientiﬁc PublishingCompany, 1987). T. Castellani and A. Cavagna, Journal of Statistical Me-chanics: Theory and Experiment , P05012 (2005). L. Onsager, Physical Review , 117 (1944). P. W. Kasteleyn, Physica , 1209 (1961). J. Edmonds, Journal of Research of the national Bureauof Standards B , 233 (1967). F. Barahona, Journal of Physics A: Mathematical andGeneral , 3241 (1982). S. Istrail, in

Proceedings of the thirty-second annual ACMsymposium on Theory of computing (2000) pp. 87–96. L. A. Goldberg and M. Jerrum, Proceedings of the Na-tional Academy of Sciences , 13161 (2015). N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth,A. H. Teller, E. Teller, et al. , J. Chem. Phys , 1087(1953). R. J. Glauber, Journal of mathematical physics , 294(1963). S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, science , 671 (1983). H. Suwa and S. Todo, Physical review letters , 120603(2010). Y. Iba, Transactions of the Japanese Society for ArtiﬁcialIntelligence , 279 (2001). J.-C. Walter and G. Barkema, Physica A: Statistical Me-chanics and its Applications , 78 (2015). C. M. Fortuin and P. W. Kasteleyn, Physica , 536(1972). R. H. Swendsen and J.-S. Wang, Physical review letters , 86 (1987). U. Wolﬀ, Physical Review Letters , 361 (1989). F. Niedermayer, Physical review letters , 2026 (1988). D. Kandel, R. Ben-Av, and E. Domany, Physical ReviewB , 4700 (1992). P. Coddington and L. Han, Physical Review B , 3058(1994). Y. R. Pei and M. Di Ventra, In preparation . J. Houdayer and A. K. Hartmann, Physical Review B ,014418 (2004). Z. Zhu, A. J. Ochoa, and H. G. Katzgraber, Physical re-view letters , 077201 (2015). A. Morningstar and R. G. Melko, The Journal of MachineLearning Research , 5975 (2017). L. Wang, Physical Review E , 051301 (2017). N. Le Roux and Y. Bengio, Neural computation , 1631(2008). M. Di Ventra and Y. V. Pershin, Nature Physics , 200(2013). M. Di Ventra and F. L. Traversa, J. Appl. Phys. ,180901 (2018). F. Sheldon, F. L. Traversa, and M. Di Ventra, PhysicalReview E , 053311 (2019). S. R. Bearden, Y. R. Pei, and M. Di Ventra, Scientiﬁcreports , 1 (2020). H. Manukian, Y. R. Pei, S. R. Bearden, and M. Di Ventra,Communications Physics , 1 (2020). M. X. Goemans and D. P. Williamson, Journal of theACM (JACM) , 1115 (1995). M. Kac and C. J. Thompson, Physica Norvegica , 163(1971). F. L. Traversa and M. Di Ventra, Chaos: An Interdisci-plinary Journal of Nonlinear Science , 023107 (2017). Y. Pei, PeaBrane (2020). S. Zhang and A. G. Constantinides, IEEE Transactionson Circuits and Systems II: Analog and Digital SignalProcessing , 441 (1992). V. Cataudella, G. Franzese, M. Nicodemi, A. Scala, andA. Coniglio, Physical Review E , 175 (1996). J. Villain, R. Bidaux, J.-P. Carton, and R. Conte, Journalde Physique , 1263 (1980). J. Besag, J. Roy. Statist. Soc. Ser. B , 591 (1994). M. Betancourt, arXiv preprint arXiv:1701.02434 (2017). J. S. R. Bulirsch,

Introduction to Numerical Analysis (Springer, 2010). Y.-H. Zhang and M. Di Ventra, arXiv preprintarXiv:2102.03547 (2021). M. Hassan and M. Rahman, Physical Review E ,040101 (2015). A. A. Saberi, Physics Reports , 1 (2015). P. Bak, C. Tang, and K. Wiesenfeld, Physical review let-ters , 381 (1987). J. Hesse and T. Gross, Frontiers in systems neuroscience , 166 (2014). E. Marinari, G. Parisi, and F. Ritort, Journal of PhysicsA: Mathematical and General , 327 (1995). F. Hamze, D. C. Jacob, A. J. Ochoa, D. Perera, W. Wang,and H. G. Katzgraber, Physical Review E , 043303(2018). J. C. Andresen, Z. Zhu, R. S. Andrist, H. G. Katzgraber,V. Dobrosavljevi´c, and G. T. Zimanyi, Physical reviewletters , 097203 (2013). Y. R. Pei, H. Manukian, and M. Di Ventra, Journal ofMachine Learning Research , 1 (2020). E. Marinari and G. Parisi, EPL (Europhysics Letters) ,451 (1992). M. Di Ventra, F. L. Traversa, and I. V. Ovchinnikov, Ann.Phys. (Berlin) , 1700123 (2017). S. Todo and K. Kato, Physical review letters , 047203(2001). E. Fradkin and L. Susskind, Physical Review D , 2637(1978). E. Fradkin, B. A. Huberman, and S. H. Shenker, PhysicalReview B , 4789 (1978). C. Pattison, F. Hamze, J. Raymond, and H. Katzgraber,APS , L42 (2019). D. J. Welsh and C. Merino, Journal of MathematicalPhysics , 1127 (2000). M. Toda,

Theory of nonlinear lattices , Vol. 20 (SpringerScience & Business Media, 2012). P. C. Hohenberg and B. I. Halperin, Reviews of ModernPhysics , 435 (1977). T. Tao,

Nonlinear dispersive equations: local and globalanalysis , 106 (American Mathematical Soc., 2006). W. A. Link and M. J. Eaton, Methods in ecology andevolution , 112 (2012). V. Cataudella, G. Franzese, M. Nicodemi, A. Scala, andA. Coniglio, Physical review letters , 1541 (1994). A. Grama, V. Kumar, A. Gupta, and G. Karypis,

Intro-duction to parallel computing (Pearson Education, 2003). F. Traversa and M. Di Ventra, IEEE Trans. Neural Netw.Learn. Syst. , 2702 (2015). J. W. Thomas,

Numerical partial diﬀerential equations: ﬁ-nite diﬀerence methods , Vol. 22 (Springer Science & Busi-ness Media, 2013). M. R. Garey and D. S. Johnson,

Computers and In-tractability; A Guide to the Theory of NP-Completeness (W. H. Freeman & Co., New York, NY, USA, 1990). F. Sheldon, P. Cicotti, F. L. Traversa, and M. Di Ven-tra, IEEE transactions on neural networks and learningsystems (2019). N. R. Draper and H. Smith,

Applied regression analysis ,Vol. 326 (John Wiley & Sons, 1998). A. Smith,

Sequential Monte Carlo methods in practice (Springer Science & Business Media, 2013). S. Murawski, G. Musia(cid:32)l, and G. Paw(cid:32)lowski, Computa-tional Methods in Science and Technology , 117 (2015). M. Mezard and A. Montanari,

Information, Physics, andComputation (Oxford University Press, 2009). J. Beardwood, J. H. Halton, and J. M. Hammersley, in

Mathematical Proceedings of the Cambridge PhilosophicalSociety , Vol. 55 (Cambridge University Press, 1959) pp.299–327. M. Mitchell,

An introduction to genetic algorithms (MITpress, 1998). B. Selman and H. Kautz, in

IJCAI , Vol. 93 (Citeseer,1993) pp. 290–295. G. Audemard and L. Simon, in

International Conferenceon Principles and Practice of Constraint Programming (Springer, 2012) pp. 118–126. D. R. Morrison, S. H. Jacobson, J. J. Sauppe, and E. C.Sewell, Discrete Optimization , 79 (2016). S. R. White, in

AIP Conference Proceedings , Vol. 122(American Institute of Physics, 1984) pp. 261–270. E. G. Rieﬀel, D. Venturelli, M. O Gorman, B. Do, E. M.Prystay, and V. Smelyanskiy, Quantum Information Pro-cessing , 1 (2015). C. Wang, J. D. Hyman, A. Percus, and R. Caﬂisch, Inter-national Journal of Modern Physics C , 539 (2009). G. Desjardins, A. Courville, Y. Bengio, P. Vincent, andO. Delalleau, in

Proceedings of the thirteenth internationalconference on artiﬁcial intelligence and statistics (MITPress Cambridge, MA, 2010) pp. 145–152. D. B. West et al. , Introduction to graph theory , Vol. 2(Prentice hall Upper Saddle River, NJ, 1996). B. Moln´ar, F. Moln´ar, M. Varga, Z. Toroczkai, andM. Ercsey-Ravasz, Nature communications , 1 (2018). V. Nair and G. E. Hinton, in

Icml (2010). C. Umrigar, M. Nightingale, and K. Runge, The Journalof chemical physics , 2865 (1993). A. Su´arez and R. Qu´er´e,

Stability analysis of nonlinearmicrowave circuits (Artech House, 2003). W. H. Press, S. A. Teukolsky, B. P. Flannery, and W. T.Vetterling,

Numerical recipes in Fortran 77: volume 1,volume 1 of Fortran numerical recipes: the art of scientiﬁccomputing (Cambridge university press, 1992). A. J. Bik,

Software Vectorization Handbook, The: Ap-plying Intel Multimedia Extensions for Maximum Perfor-mance (Intel Press, 2004). J. A. Nelder and R. Mead, The computer journal , 308(1965). H. Nishimori,

Statistical physics of spin glasses and infor-mation processing: an introduction , 111 (Clarendon Press,2001). Y. Ozaki, M. Yano, and M. Onishi, IPSJ Transactions onComputer Vision and Applications , 1 (2017). M. Stone, Journal of the Royal Statistical Society: SeriesB (Methodological) , 111 (1974). I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio,

Deep learning , Vol. 1 (MIT press Cambridge, 2016). E. D. Demaine and M. L. Demaine, Graphs and Combi-natorics , 195 (2007). V. Feldman, W. Perkins, and S. Vempala, SIAM Journalon Computing , 1294 (2018). D. Perera, F. Hamze, J. Raymond, M. Weigel, and H. G.Katzgraber, Physical Review E , 023316 (2020).

I. Hen, J. Job, T. Albash, T. F. Rønnow, M. Troyer, andD. A. Lidar, Physical Review A , 042325 (2015). G. Parisi, Proceedings of the National Academy of Sci-ences , 7948 (2006).

H. Jia, C. Moore, and D. Strain, Journal of Artiﬁcial In-telligence Research , 107 (2007). E. Marinari, G. Parisi, and F. Ritort, Journal of PhysicsA: Mathematical and General , 7647 (1994). A. Flaxman, in

Proceedings of the fourteenth annualACM-SIAM symposium on Discrete algorithms (Societyfor Industrial and Applied Mathematics, 2003) pp. 357–363.

F. Bacchus et al. , Department of Computer Science Re-port Series B (2019).

I. Dinur, Journal of the ACM (JACM) , 12 (2007). A. A. Bulatov and E. S. Skvortsov, in

International Sym-posium on Mathematical Foundations of Computer Sci-ence (Springer, 2015) pp. 175–186.

R. Monasson and R. Zecchina, Physical Review E ,1357 (1997). T. Zaslavsky, arXiv preprint arXiv:1303.2770 (2013).

B. Derrida, Y. Pomeau, G. Toulouse, and J. Vannimenus,Journal de Physique , 617 (1979). H. Hong, H. Park, and L.-H. Tang, arXiv preprint cond-mat/0611509 (2006).

F. Guerra, International Journal of Modern Physics B ,1675 (1996). A. Puglisi, A. Sarracino, and A. Vulpiani, Physics Reports , 1 (2017).

C. R. Rao, in

Breakthroughs in statistics (Springer, 1992)pp. 235–247.

K. Binder, Zeitschrift f¨ur Physik B Condensed Matter ,119 (1981). G. Parisi, Physical Review Letters , 1946 (1983). E. Marinari, G. Parisi, F. Ricci-Tersenghi, J. J. Ruiz-Lorenzo, and F. Zuliani, Journal of Statistical Physics ,973 (2000). Supplementary Material

SM A: Critical Percolation

Very brieﬂy, percolation is a random process on a graph where a bond is opened (or a site is “occupied”) witha given probability p , and this usually generates multiple clusters on the graph (for now assume the graph is alattice) connected by open bonds . For most graphs, there is a percolation threshold p c such that when p < p c ,all the clusters are ﬁnite, and when p > p c , there is a unique giant cluster spanning a constant fraction of thelattice. One of the most important characterizations of the percolation process is the ﬁnite cluster size distribution(CSD), n ( s ), which counts the number of clusters of a given size s (excluding the single inﬁnite cluster). In both thesubcritical and supercritical regime ( p (cid:54) = p c ), n ( s ) decays exponentially with respect to s , while at criticality ( p = p c ), n ( s ) ∼ s − τ decays as a power law with the critical exponent τ referred to as the Fischer exponent , which is one of themany scale-free properties of critical percolation . The majority of spin models can be translated into a modiﬁedpercolation process , which has been the inspiration of many cluster algorithms over the past few decades . Inanother work we have analyzed the eﬃciency of these cluster algorithms both analytically and empirically .Both the ICM algorithm (see Section C 3) and memory dynamics (see Section D) can be naturally studied froma percolation perspective. During the Houdayer cluster formation in the ICM algorithm , spin sites with negativeoverlap between a replica pair can be interpreted as occupied sites, and sites with positive overlap are unoccupiedsites. Randomness is introduced into the system in the form of thermalization generated by Metropolis sweeps andreplica exchanges. There are certain procedures ensuring that the largest cluster size does not span the entire lattice.First, the cluster move only occurs between replica pairs of suﬃciently low temperature, and second, whenever thenumber of negative sites exceeds half the spins, one of the replica is ﬂipped globally to suppress the percolationprocess. However, despite these restrictions, it is shown that the algorithm still fails to be eﬃcient in general, as it isheavily reliant on the underlying graph structure .Unlike the ICM algorithm, the percolation process deﬁned by the memory variables we have introduced in themain text is a bond percolation process. However, the more important distinction is that the memory variables arecontinuous, meaning that they naturally induce a weighted graph, where each edge weight denotes the percolationprobability on the bond. It has been suggested that a weighted lattice may display diﬀerent critical propertiesthan the unweighted counterparts . In this work, we update the distribution n ( s ) at each simulation time unit(instead of each adaptive time step to ensure eﬃcient and unbiased sampling ). We ﬁnd that the distribution n ( s )follows a power-law decay with the giant component being absent. This suggests that the memory induced percola-tion process is near criticality, so that the memory dynamics are eﬃcient in sampling the underlying glass near T c .To make practical use of the clusters generated by the memory variables, we can perform a Swensden-Wang (SW)update on these clusters as an intelligent restart method in a digital implementation of the memory dynamics.The SW update entails ﬂipping each cluster independently with probability . Note that this method satisﬁesdetailed balance even if the percolation ratios deﬁned by the memory variables are not uniform across the graph.Alternatively, one can also opt to perform a Wolﬀ update which involves ﬂipping one randomly chosen cluster,though the two methods are similar in eﬃciency when the CSD is critical. Note that this algorithmic step is notcentral to the eﬃciency of the memory dynamics, though it does help to increase the TTS by a small constant factor.For stochastic clusters generated using algorithmic bond-formation rules, such as Swensden-Wang or Houdayer clusters, the CSD tends to be hyper-critical for frustrated models , due to the mismatch between the critical tem-perature T c and the critical threshold for the eﬀective (bond- or site-, respectively) percolation ratio p c . In the SWcase, this can be expressed as T c (cid:28) (cid:0) / (1 − p c ) (cid:1) , meaning that the CSD is already critical far above the critical temperature. Similar expressions can be derived forthe Houdayer clusters as well. A study focusing on the ineﬃciency of stochastic clusters is detailed in another work ,and we here simply show that the empirical CSD for SW and Houdayer rules on the fully-frustrated 3 d Ising glass ispersistently hyper-critical, as shown in Fig. 4, meaning that cluster algorithms employing such non-local update rulescannot be generally eﬃcient. On the other hand, the clusters generated by the memory variables are critical at anytime during evolution (see Fig. 5).

FIG. 4: The cluster size distribution (CSD) for the Swensden-Wang (SW) and Houdayer percolation rules on a fully frustrated3 d Ising glass sized 10 × ×

10. The equilibrium statistics is collected over 100 realizations of disorder, simulated withparallel tempering (PT) over 2 sweeps with a waiting time of 2 . Note that both classes of clusters are hypercritical at thecritical temperature T c ≈ .

20 (see Section E 4). While the SW clusters become increasingly hypercritical as the temperatureis lowered, the ICM clusters appear to be hypercritical at all temperatures, even when the largest component is restricted tohalf the lattice size (see Section C 3).FIG. 5: The clusters generated by the memory variables appear to be critical at any time during evolution on the fully frustrated3 d Ising glass, with the polynomial decay power being insensitive to the system size and time. The equilibrium statistics iscollected over 400 realizations of disorder and a time window of ∆ t = 2 beginning at simulation time t = { , , } ,roughly geometrically spaced. Of particular importance is the leftmost ﬁgure for t ∈ [0 , ], during which the memory dynamicsis transient near T c (see Fig. 2 in main text), and we see that the CSD is still critical. This essentially means that the criticalityof the clusters exists regardless of the underlying eﬀective temperature (see Section E 3). SM B: Complexity

Practically speaking, the most direct measure of complexity is the CPU time required to run a given algorithmuntil the solution is reached. However, this measure of complexity suﬀers from the uncertainty due to a numberof irrelevant variables that are hard to measure or control, originating mainly from the speciﬁc algorithmic design,software implementation, and the hardware architecture . For instance, the hardware implementation of thememory dynamics can vary in whether it is implemented on an analog or digital computing architecture, andfor its simulation, whether an explicit or implicit integration scheme is used (and its order) . However, from atheoretical standpoint, the main focus is the time and space complexity of the algorithm, which can be roughlyinterpreted as the scaling law of the computational cost and memory requirement of the TTS with respect to the sizeof the problem . As the scalability is the main concern here, the actual time and memory (RAM) required to runthe algorithm for a speciﬁc problem type is not of major interest, meaning that any complexity measurement andalgorithmic implementation that diﬀer in the prefactor or additional terms of smaller powers should be treated asequivalent. For example, if the time required to run an algorithm is aN + bN , where N is the size of the problem,then the actual prefactor a and the entire second term bN are inconsequential to the time complexity, which is O ( N ). Finally, since it is clear that both the isoenergetic cluster move (ICM) and memory dynamics shouldscale linearly with respect to memory (RAM) requirements , we will not be focusing on that here.The only measure of interest should then be the time complexity of both algorithms running the same class ofinstances. For both algorithms, we estimate the time complexity numerically using the median statistics of TTS over400 tiling glass instances , with the 40-th to 60-th percentiles reported to verify the robustness of the algorithm. Thereason we chose to record the median instead of the mean is because the hardness of the instances follows roughlya log-normal distribution . One can always extrapolate the scaling of the mean TTS by doing regression on thereported median and the percentile statistics, by using a log-linear model . SM C: Implementation of Stochastic Algorithms

Regardless of how eﬃcient the employed cluster update routine is, all stochastic algorithms inevitably use anergodic routine referred to as a sweep , where all the spins in the lattice are updated sequentially in a Metropolis-typeacceptance scheme . To be more precise, whenever a single spin σ i in the lattice is ﬂipped (from σ i to − σ i ), thechange in the Ising energy associated with it is ∆ E i = 2 (cid:88) ij J ij σ i , and the Metropolis acceptance ratio of this update is then given as P ( σ i → − σ i ) = min (cid:16) , exp (cid:0) − β ∆ E i (cid:1)(cid:17) . A single sweep usually occurs at a constant inverse temperature β , making it generally interfaceable with a plethoraof cluster or replica algorithms.In most cases, it is important for the spins to be updated sequentially , as many attempts of trying to introducesynchronous update methods (such as stripe-wise updates ) generally lose more in ergodicity than gain in paralleleﬃciency . This means that in a computational implementation, a single sweep is best limited to a single core, andit usually constitutes the most computationally intensive routine when used in conjunction with cluster updates. Thecomplexity is simply O ( N ), where N is the number of spins, noting that the coordination number of the 3 d latticeis ﬁxed at z = 6. Several modern algorithms that eﬃciently utilize the sweep routine include simulated annealing(SA), parallel tempering (PT), and isoenergetic cluster moves (ICM), with their scalabilities of the median TTS forthe fully frustrated 3 d Ising glass shown in Fig. 3 of the main text (with the PT and ICM algorithms combined).

1. Simulated Annealing

The Simulated Annealing (SA) algorithm is inspired by the physical annealing process in metallurgy, wherethe metal is gradually cooled from a high temperature to a low one to improve its ductility. In the context ofoptimization, this means that we begin with a high temperature, and gradually lower the temperature to a very smallvalue, and this process somewhat aids the algorithm in navigating the non-convex cost function of the optimizationproblem to ﬁnd the global minimum . This algorithm saw great success in many industrial optimization problems,such as the traveling salesman problem (TSP) . Many state-of-the-art algorithms are based on the same underlyingconcept, with added entropic routines for intelligent exploration of the state space . Here, we are using thealgorithm in its original form, with the annealing parameters extensively optimized.In most cases, a single run of an SA routine is not suﬃcient to ﬁnd the ground state, and more often than not, arestart routine is generally required , to give the algorithms multiple chances at tackling the problem. Althoughthere have been studies on using informed restart methods to interface with the algorithm , these methods are not0general, and usually only provide a pre-factorial improvement over a random restart routine. Therefore, in this work,we will simply use the random restart routine to minimize complications.The three important parameters for SA are β min , β max , and t sweep , together referred to as the annealing routine . β min is the starting inverse temperature of the sweep, β max is the ending inverse temperature of the sweep, and t sweep is the number of sweeps going from β min to β max . Based on seminal work , and extensive optimization studies donepreviously , we ﬁnd that the best annealing routine of β is linear from β min = 0 . β max = log( N ), where N isthe number of spins. Furthermore, we ﬁnd that the optimal number of sweeps between restarts is t sweep = 2 , whichbalances between having a suﬃciently gradual annealing ratio and suﬃcient restart opportunities. Note that this isdiﬀerent from the cubic scaling t sweep ∝ N that we employed previously , as it appears that a scaling numberof sweeps is generally unconducive to simulating fully frustrated lattices. Therefore, the annealing routine can beexpressed as β = 0 . (cid:0) log( N ) − . (cid:1) t − t sweep , where the inverse temperature is increased from 0 . N ) over 2 sweeps. Note that the exchange update istrivial in computational cost, as it simply involves computing an acceptance ratio and exchanging the indices of tworeplicas.

2. Parallel Tempering

As mentioned in the previous section, the major problem with using SA is the lack of a generally intelligent restartmethod, so in practice, most practitioners simply use a random restart routine, where all the spins in the lattice areuniformly sampled from σ i = ±

1. As a substantial improvement, parallel tempering (PT) replicates the lattice intomultiple copies, and simulates the replicas under diﬀerent temperatures (usually by sweeping the lattice), and tworeplicas of neighboring temperatures are exchanged depending on an external rule to improve eﬃciency of exploringthe Gibbs measure. The acceptance ratio of the exchange is given by P ( σ a ↔ σ b ) = min (cid:16) , exp (cid:0) ( β a − β b )( E ( σ a ) − E ( σ b )) (cid:1)(cid:17) , which keeps the joint distribution of the entire replicated system stationary. Intuitively, the replica at the highesttemperature is essentially sampling from the uniform spin measure, and this “random restart” propagates down thereplica chain through the exchange interactions, where multiple replicas essentially “mediate” the restart routine from β min to β max . This is the reason why PT is often times considered as an algorithm with an implicit restart routinethat is “intelligent”. This is arguably the most general algorithm designed to work for many classes of optimizationproblems on diﬀerent underlying graph structures , including many industrial problems . Combined withintra- or inter-replica cluster algorithms , this results in incredibly eﬃcient stochastic algorithms, where non-local updates are complemented by an intelligent restart method. We will discuss one such algorithm in the nextsection. In this work, we use n r = 30 replicas spaced geometrically in inverse temperature from β min = 0 . β max = log( N ), or β i = β min (cid:16) β max β min (cid:17) ( i − / ( n r − , based partially on existing work and our own optimization attempts.

3. Isoenergetic Cluster Moves

The ICM algorithm is currently the state-of-the-art cluster algorithm for simulating spin glasses that combinesthe method of PT and Houdayer cluster updates . In the main text, this is then used as the most representative ofstochastic algorithms to compare against the memory dynamics in the TTS scaling behavior on the fully-frustrated 3 d Ising model . A comprehensive description and the pseudocode for the algorithm is already given in the literature ,and also implemented in MATLAB . Here, we provide a brief overview of the algorithm and list the parametersthat we used to perform the simulations. In addition, we pinpoint the most computationally intensive routine, whosescalability we use as the time complexity of the algorithm.1In this algorithm, the parallel tempering algorithm is coupled with Houdayer cluster moves . To begin, anumber of replica pairs are initialized, with the pairs spaced geometrically in temperature. After one sweep in everyreplica, an Houdayer update is performed for every replica pair. This cluster update ﬂips a non-trivial cluster ofspins with negative overlap between two replicas to increase the mixing rate. Finally, the parallel tempering routineattempts to exchange the temperature between two random replicas of neighboring temperatures, to improve thethermalizatoin process. This algorithm can be used both as a sampler and an optimizer. For the latter case, onesimply has to record and return the lowest energy that is sampled by the algorithm. We use the same parameters asgiven in Section C 2, meaning that the total number of replicas is 2 n r = 60.Note that in the modern implementation of the ICM algorithm , there is an extra algorithmic step that performsa global spin ﬂip on a replica in each pair, whenever the number of negative overlap sites exceeds half the latticesize, so that the largest cluster component never exceeds half of the lattice. Although the intention of this procedureis to suppress the percolation process through restricting the size of the giant component, similar to the intentionof plaquette-based bond-formation rules , it does not fundamentally address the issue of mismatching the criticaltemperature and percolation threshold (as shown in Fig. 4 in the SM), meaning that Houdayer clusters are stillhyper-critical despite algorithmic interventions. It is also important to note that the global spin-ﬂip routine breaksdetailed balance, because the reverse transition probability is zero, meaning that the global ﬂip should in theorynever be accepted. Further discussion of this phenomenon is given in another work .To analyze the computational complexity of the diﬀerent routines to inform a fair TTS measure, we note that thePT routine is trivial in cost (as noted in Section C 2), and in the worst-case, the Houdayer move performs either abreadth-ﬁrst or depth-ﬁrst search (BFS or DFS) to identify the spin clusters, whose time complexity is O ( N ) for thelattice graph. This is also the time complexity of one Metropolis sweep over a replica. Therefore, the complexity canbe measured as the total number of sweeps summed over all the replicas, with the Houdayer update cost “generously”ignored/absorbed into the total complexity as it is of the same complexity power. SM D: Integration of Memory Dynamics

Recall that the memory dynamics is formulated in continuous time, and originally conceived to be implementedon physical circuits . However, it was later discovered that a carefully designed memory system is robust againstnoise and perturbations , and can be readily simulated numerically on a digital computer using basic integrationschemes such as forward Euler . Such a basic implementation has been proven to perform exceptionally well formultiple problem structures , as long as the relative timescale of the memory variables (with respect to the spins)is appropriate. Of course, this leaves room for many improvements in numerical methods to speed up the rate ofconvergence of the dynamics when the goal is to implement the memory dynamics digitally as a practical solver. Here,we present a few improvements that we found relevant to our work. First of all, we rewrite Eq. (5) in the main textfor convenience of the reader, where { α, β, γ, δ, ζ } = { . , . , . , . , . } are constants chosen for the 3 d cubicgraph, and are ﬁxed for system sizes and the type of glass (as long as | J ij | = 1).˙ σ i = α (cid:88) j J ij σ j − β (cid:88) j x ij σ i ˙ x ij = γC ij − y ij ˙ y ij = δx ij − ζ, where C ij = 12 ( J ij σ i σ j + 1) ∈ [0 , . (D1)The function C ij ( σ ) is referred to as the clause function in the ﬁeld of constrained optimization , evaluating to+1 if the spin interaction is satisﬁed, and 0 otherwise. Note that we use C ij purely to notationally interface with theliterature, and the oﬀset from J ij σ i σ j can be easily accounted for by a constant shift in the initialization of y . Thisset of equations and parameters is used to perform the simulation of the CSD as presented in Fig. 2 in the main text. In the context of our study, robustness means that multiple trajectories emanating from any initial point in the phase space will go tothe optimum. Therefore, it is not necessary for us to accurately integrate any particular one of them. It is possible for us to end up inanother “desirable” trajectory after deviation from the original one, and still ﬁnd the optimum in the end. x , we would opt for an exponential growth rate given as ˙ x ij = ( γC ij − y ij ) x ij ,which in fact already outperforms stochastic algorithms in TTS. Nevertheless, we choose to make the growth of x linear for faster dynamics , and to guarantee that no “hidden exponential” is present during dynamics. However,this comes at the price of potential negativity of x and the introduction of unstable modes. To avoid this, we candynamically anneal the decay rate via the extra (long-term) memory variable y coupled bond-wise to x . Intuitively,the new memory variable y grows/decays along with x with some time lag to ensure that the relative change in themagnitude of x is never too large nor too small. Though not necessary in most cases, positivity of these memoryvariables can be simply enforced by introducing explicit bounding values at each time step as such, x ij,n +1 = min { max { x ij,n + dt ( γC ij − y ij ) , } , } y ij,n +1 = min { max { y ij,n + dt ( δx ij − ζ ) , } , } , In most cases, the variable y is only relevant to the initial transient dynamics in its purpose of suppressing the highlyoscillatory memory modes (when the spin glass is far from equilibrium, a rapid relaxation of the spins induceslarge ﬂuctuations in the magnitude of x ). When the system relaxes slightly, the variable y will decay quickly to itslower bound (if the parameter ζ is chosen appropriately), and will eﬀectively serve as a constant decay of − x .If the memory dynamics dive quickly below T c (see Fig. 2 in the main text), then y is usually not needed, but weinclude it here for the sake of generality. A formal analysis of this discretization/bounding in the context of stability,absence of periodic orbits and chaos is given in the supplementary material of our previous work . Note thatwe do not leverage any stochasticity in taming discretization errors , which is further empirical proof of robustness.Beside the memory variables, the step size itself can be also made adaptive to further improve stability and speedup convergence. We adapt the step size as such, dt n = min (cid:110) max (cid:110) i ( | ˙ σ i,n | ) , − (cid:111) , − (cid:111) , to regularize the maximum voltage change at every step . Although not necessary, an intelligent restart method can also be implemented to ensure that the energy landscape is being thoroughly explored by the memory dynamics.After initialization, the dynamics are integrated up to time t = 2 , and if the optimum is not found in the timeduration, then we perform a SW update on clusters generated by x (see Section A), and continue to run the dynamicsfor another iteration, until the ground state is discovered or the total timeout is reached.Note that the time complexity of performing an integration step is O ( N ), which is the same as a single sweepin stochastic algorithms (see Section C). However, when implemented in hardware, integrating a time-step isalways much faster than a sweep, because the spin updates in our integration scheme are synchronous (as withstandard explicit methods for simulating multivariate ODEs), meaning that the machine code can be vectorizedto interface with a single instruction, multiple data (SIMD) hardware structure , whereas it would be incrediblydiﬃcult to do so with a sweep, even if the spin data are stored bit-wise. Again, to be generous, we do notconsider such hardware overhead in TTS measures, and simply measure the TTS as the total simulation time.Note that since the adaptive time step is bounded below by a constant dt min = 2 − , there is no cost associatedwith the inverse scaling of step size with respect to N . In the end, the redundant scaling O ( N ) is factor-ized out for both the TTS measures for stochastic algorithms and the integration of memory dynamics, as it onlyaﬀects the polynomial order of the TTS scaling, but it does not aﬀect whether the scaling is polynomial or exponential.Furthermore, we note that the parameters { α, β, γ, δ, ζ } for the memory dynamics were tuned very minimally usingsimplex descent , with the cost function being the mean Ising energy returned for 40 runs of memory dynamicson the 6 × × . This ensures that we are not taking advantage of the speciﬁcstructure of the tiling instances by over-ﬁtting the parameters speciﬁcally for the fully frustrated 3 d Ising glass. Thisis similar to the technique used in machine learning for tuning hyperparameters such as the learning rates , whereto avoid over-ﬁtting, the parameters are tuned on some given neural network (NN) for a subset of tasks that it isdesigned for and then the NN is tested on another disjoint subset of tasks for cross validation of performance . Onthe other hand, the parameters for SA and ICM (annealing schedule and temperature spacing respectively) are tunedextensively based on previous works , which eﬀectively tilts the playing ﬁeld in favor of stochastic algorithms. Note that this time step adaptive schedule we choose to use is rather unconventional. The standard adaptive schedule is to geometricallytune the step size based on the local error estimate for the purpose of eﬃciently simulating the solution trajectory. However, in ourcase, we do not require such accuracy, and employing such procedure actually decreases the rate of convergence to the optimum. FIG. 6: The green line is the ﬁtted scalability for the median TTS for the bond-based long term memory (LTM) on thefully-frustrated 3 d glass, and the orange line is for the checkered LTM on the same instances. The scaling powers are 4 . ± . . ± .

1. Non-local Extensions

The main culprit behind the ineﬃciency of local cluster algorithms is that the bond-formation rules are deﬁnededge-wise, such as the SW rule, so they completely ignore the non-local eﬀects of frustration. Therefore, suchalgorithms are prone to over-percolate , and are generally ineﬃcient for frustrated systems. This has inspiredmultiple extensions of the original SW rule to more non-local unit cells. For instance, the fully frustrated Isingmodel (FFIM) in 2 d can be deconstructed into unit cells of checkered plaquettes, with each plaquette guaranteeingexactly one negative interaction at the ground state. This led to the realization that bond-formation decisions canbe made plaquette wise, resulting in the prototypical KBD plaquette rules . This in turn inspired a plethora ofother non-local cluster algorithms designed for other classical and quantum glasses. The KBD plaquette rulehas proven to be eﬃcient in reducing the auto-correlation time in simulating magnetization properties , though itseﬃciency is still largely restricted to the 2 d FFIM .The equations for the memory dynamics (see Eq. (D1)) can also be modiﬁed to interface with such non-local bond-formation rules. For instance, if we take the 2 d FFIM, and index the checkered plaquettes (or checkered cubes in 3D)as (cid:3) , we can restrict the long-term memory variables to these plaquettes, and couple it to the short-term memory x as, e.g., y (cid:3) = δ (cid:88) ( i,j ) ∈ (cid:3) x ij − ζ. Instead of simply regularizing x as discussed in Section D, the long-term memory now provides additional non-localinformation to the original edge coupled system of { σ, x } . To make the analogy with machine learning, we notethat the use of multiple layered network structures is common for pattern recognition in deep learning , andhere, y acts as an additional layer of nodes, learning to recognize plaquette frustration patterns. To take thisanalogy further, we could introduce more layers of memory to learn successively non-local cell patterns ,as long as stability is still guaranteed. For this implementation, we use a slightly diﬀerent set of parameters { α, β, γ, δ, ζ } = { . , . , . , . , . } , which again we ﬁnd by minimal tuning.For the scalability of TTS presented in Fig. 3 of the main text, we use an implementation that assigns (cid:3) to bethe checkered cubes. To ensure that we are not exploiting the tiling instances , we intentionally oﬀset the checkeredpattern for the long-term memory variables with the one used to generate the instances. Note that even if the checkeredpatterns for the tiling glasses and the long-term memory were the same, it would by no means represent an advantage.This is because the problem of ﬁnding a ground state for a tiling instance still appears to be NP-complete , even ifthe knowledge of the checkered pattern (or the planting pattern ) is given. In fact, even for 2 d glasses, there is not4a single cluster algorithm that is eﬃcient even when the checkered pattern is known . It is interesting to notethat the scalability of both the bond and plaquette memory dynamics show similar scaling powers, as shown in Fig. 6. SM E: Planted Ising Spin-glasses

Planted Ising spin-glasses is a class of Ising instances generated by assuming a speciﬁc ground state, without sacri-ﬁcing the glassy property of having a highly non-convex energy landscapes. This makes them ideal benchmarks forevaluating algorithms that simulate critical spin-glass dynamics, or ﬁnding the ground state of glass realizations. Inaddition, it is generally beneﬁcial to have control over the hardness of the planted instances to oﬀer diﬀerent degrees ofevaluation, and an important way of realizing this control is to design the class to oﬀer tunable frustration ratio .We would also like to address a common belief that the so-called “planted” glass models are not “real” spin glasses.The identiﬁcation of what constitutes a “real spin glass” is meaningless, as formally speaking, any spin-glass structureis essentially a distribution of couplings J on some underlying graph structure , and practically speaking, there isno reason to expect planted models to be less “physical” than traditional glass models. In fact, multiple “planted”structures are known to exhibit glass-like transitions , and even some highly frustrated deterministic models areglassy in nature . A full discussion of the computational hardness of planted problems is beyond the scope of thiswork, but we refer the reader to for a formal analysis on the computational hardness of planted problems.

1. Time-to-solution

Generally speaking, there are two major ways to evaluate the eﬃciency of an optimizer/solver in its ability todiscover a ground state of a glassy instance. The ﬁrst way is to allocate the solver a certain amount of time, andallow the solver to run until timeout. This evaluation method is generally done for incomplete solvers , where thegoal is to test the capability of the solver to reach the lowest energy possible within a given time. In most cases,this method of evaluation does not provide an accurate measure of the eﬃciency of the solver or the complexity ofthe problem class, and is generally biased towards greedy or local solvers. To see why, most complex systems (suchas the Ising spin glass) admit a rough energy landscape with an abundance of local minima (metastable states)which can be readily accessed by a greedy algorithm with random initial conditions convoluted by random noise .However, the transition from a metastable state to the global optimum (Ising ground state) requires exponential costin time , and, in addition, requires careful coordinated non-local updates .A greedy solver may reach a metastable state relatively quickly , but it may never reach the global optimum.On the other hand, a solver with collective dynamics may sacriﬁce some time to carefully establish long-rangeconnections , and eventually reach the global optimum after being allocated suﬃcient time. Therefore, a morefaithful measurement of the eﬃciency of a solver is to record the time it takes for the solver to return the globaloptimum (or reach a certain gap above the optimum). This evaluation is commonly known as the time-to-solution(TTS) evaluation . However, to actually perform this measurement in practice, one has to know in advance whatthe optimum is. A way to achieve this goal is to assume (or plant) a solution in advance, and generate instances suchthat the optimum can be easily extracted by the generator but exponentially hard for the solver to ﬁnd . These planted instances can then be used to evaluate the performance of the solver, and also check the correctness of thesolver by comparing its solution to the planted one.

2. The Tiling Glass

The tiling glass is a class of Ising spin glasses with a planted ground state energy. It is a class where the expectedlocal residual entropy of the checkered cubes governs exponentially the hardness of the instances . This typeof “planted” glass structure has rather rich equilibrium and non-equilibrium dynamics , making it ideal foranalyzing the eﬃciency of spin-glass simulation algorithms. Getting back on track with the discussion on the tilingglass, we ﬁrst note that the tiling glass allows for the generation of a “fully-frustrated” spin glass where every face ofthe 3 d lattice is frustrated, meaning that one cannot simultaneously satisfy all interactions in any give face .Glossing over some caveats with the problem of deﬁning full frustration in a 3 d lattice , we note that a fullyfrustrated lattice under the tiling construction also attains the maximal local ground state degeneracy cube-wise, andthis has been demonstrated numerically to generate an extremely rough energy landscape. It should be noted that5 FIG. 7: The estimated curves for the oﬀseted intensive internal energy u ( T ) as given in Eq. (E 3), for fully frustrated 3 d Isingglasses sized from N = 4 to N = 6 . The curves slightly separate below the critical temperature T c ≈ . sweepsand replicas spaced geometrically in temperature. the generation of a fully frustrated hypercube with extremal frustration is highly non-trivial in d ≥ .Fortunately in 3 d , all the extremal constructions of a cube can be easily enumerated , and they are all isomorphic upto the octahedral symmetry. Therefore, to assemble a fully frustrated 3 d lattice from these extremal cubes, the cubescan be rotated randomly and assembled in a checkered pattern to introduce disorder. Note that for state-of-the-artsolvers , a fully frustrated construction already becomes computationally prohibitive to simulate in the periodiclattice sized 8 × ×

3. Eﬀective Temperature Estimation

Even though the memory dynamics operate at non-equilibrium, in the sense that the eﬀective temperaturedecreases monotonously in time (see Fig. 2 in the main text), the instantaneous distribution of internal energies overdisorder and uniform initialization of the equations of motion does seem to converge to the Boltzmann distribution.This is reminiscent of the operation of simulated annealing (SA) , though there are major diﬀerences, particularlyin that SA relies on a carefully tuned annealing process approaching T c , while the transient memory dynamics divebelow T c right away without sacriﬁcing long-range order.To see this ‘static’ equilibrium property more clearly, in Fig. 8 we observe that, for the memory dynamics, therelative variation of internal energies over disorder decreases as the system size is increased, which is an expectedproperty of an equilibrated spin glass as the internal energy is proven to be self-averaging , orlim N →∞ U J ( T ) U ( T ) p → , (E1)meaning that we can extract the eﬀective temperature of the memory dynamics at any point of time from the samplemean of the recorded energies. This follows from the general procedure of maximum likelihood estimation (MLE) ,where the following likelihood function is to be optimized over β , (cid:88) σ ∈S P ( σ ) log (cid:0) e − βE ( σ ) Z ( β ) (cid:1) , (E2)6 FIG. 8: The change of internal energy in time when the fully frustrated 3 d Ising glass is evolved under memory dynamics.Statistics are gathered over 400 disorder realizations. The solid line represents the mean of the energy samples u , and theshaded region represents the standard deviation of the samples ˆ σ u . Of particular importance is the observation that the energyvariation decreases as the system size is increased, which is in line with the self-averaging property of the equilibrium internalenergy, as expressed in Eq. (E1). Informally, this means that the energy variation is in fact a thermal property, instead of beinginduced by the random initialization of memory dynamics, which implies that the eﬀective temperature curve shown in the leftpanel of Fig. 2 in the main text is in fact thermally robust. where S is the set of spin state samples. Note that we are ignoring the disorder distribution here for the aforementionedreasons. Setting the derivative of (E2) to zero gives us ∂ β (cid:16) (cid:88) σ ∈S P ( σ ) log (cid:0) − βE ( s ) Z ( β ) (cid:1)(cid:17) = (cid:88) σ ∈S P ( σ ) (cid:0) − βE ( σ ) − ∂ β log( Z ) (cid:1) = β (cid:88) σ ∈S P ( σ ) (cid:0) U ( β ) − E ( σ ) (cid:1) = 1 T ( U ( T ) − E ) = 0 , where E is the sample mean of the energies. The temperature estimator is then given byˆ T = U − ( E ) , where again we are making no distinction between U J and U . Note that ˆ T and ˆ β can be interchanged by virtue ofthe functional invariance of MLE estimators, and furthermore, the estimator ˆ T is eﬃcient .For non-trivial spin glasses , it is likely that the analytic form of U ( T ) is not available, so we resort to using MonteCarlo methods to estimate it. For the tiling cubes , we can deﬁne the intensive internal energy oﬀset by the plantedground state energy E as u ( T ) = U ( T ) − E N , where N is the number of spins.

4. Critical Temperature Estimation

For deterministic models (or suﬃciently structured glasses), one can generally look at the behavior of the spincorrelation or the aging proﬁle of the overlap autocorrelation time to determine the critical temperature T c .However, the local rotations of the tiling construction gives rise to large variations in these order parameters over7 FIG. 9: The curves show the Binder’s cumulant g ( T ) for spin overlap q with respect to temperature (see Eq. (E3)), for fullyfrustrated (FF) 3 d Ising glasses sized from N = 4 to N = 6 . The curves intersect at T c ≈ .

2, which we shall interpretas the critical temperature of the 3 d FF glass. Interestingly, g ( T ) at the intersection point appears to be near 0, seeming toimply that replica symmetry is somehow “restored”, and the glass is fully disordered. This is a very interesting and uncommonphenomenon, as generally g ( T ) is positive at the intersection point . We will explore this phenomenon further in anotherwork . Furthermore, we see that g ( T ) displays a pronounced dip below T c , at varying values of T depending on the system size,in a range roughly corresponding to the separation of energies as seen in Fig. 7. Again, this is possibly due to non-averagingﬁnite size eﬀects induced by metastability in the glass phase. We will make no attempt to investigate this phenomenon ,and simply show the empirical overlap distribution P ( q ) at the low temperature phase, noting that this distribution becomesincreasingly erratic as the temperature is lowered and the system size becomes smaller. The simulation is performed over 400glass instances with ICM for 2 sweeps with the replica pairs spaced geometrically in temperature. diﬀerent instance realizations (even in the absence of any gauge transformation ). Therefore, we will resort tousing the Binder’s cumulant as a much more robust order parameter, which is related to the kurtosis of the overlapdistribution over both the Boltzmann measure and the disorder of J , g = 12 (cid:16) − (cid:104) q (cid:105)(cid:104) q (cid:105) (cid:17) , (E3)8where (cid:104)·(cid:105) denotes the Boltzmann average, the overline denotes the quenched average over disorder, and q is the spinoverlap deﬁned as q = 1 N (cid:88) i σ αi σ βi , with α and β denoting two independent replicas. In most cases, the temperature at which the g ( T ) curves intersectfor diﬀerent system sizes can be taken numerically as the critical temperature T c118