[PDF] Optimization of population annealing Monte Carlo for large-scale spin-glass simulations

Abstract

Population annealing Monte Carlo is an efficient sequential algorithm for simulating k-local Boolean Hamiltonians. Because of its structure, the algorithm is inherently parallel and therefore well suited for large-scale simulations of computationally hard problems. Here we present various ways of optimizing population annealing Monte Carlo using 2-local spin-glass Hamiltonians as a case study. We demonstrate how the algorithm can be optimized from an implementation, algorithmic accelerator, as well as scalable parallelization points of view. This makes population annealing Monte Carlo perfectly suited to study other frustrated problems such as pyrochlore lattices, constraint-satisfaction problems, as well as higher-order Hamiltonians commonly found in, e.g., topological color codes.

Full PDF

OOptimization of population annealing Monte Carlo for large-scale spin-glass simulations

Amin Barzegar, ∗ Christopher Pattison, † Wenlong Wang, ‡ and Helmut G. Katzgraber

2, 1, 3 Department of Physics and Astronomy, Texas A&M University, College Station, Texas 77843-4242, USA Microsoft Quantum, Microsoft, Redmond, WA 98052, USA Santa Fe Institute, Santa Fe, New Mexico 87501 USA

Population annealing Monte Carlo is an efﬁcient sequential algorithm for simulating k -local Boolean Hamil-tonians. Because of its structure, the algorithm is inherently parallel and therefore well suited for large-scalesimulations of computationally hard problems. Here we present various ways of optimizing population anneal-ing Monte Carlo using -local spin-glass Hamiltonians as a case study. We demonstrate how the algorithmcan be optimized from an implementation, algorithmic accelerator, as well as scalable parallelization points ofview. This makes population annealing Monte Carlo perfectly suited to study other frustrated problems such aspyrochlore lattices, constraint-satisfaction problems, as well as higher-order Hamiltonians commonly found in,e.g., topological color codes. I. INTRODUCTION

Monte Carlo algorithms are widely used in many areas ofscience, engineering, and mathematics. These approaches areof paramount importance for problems where no analyticalsolutions are possible. For example, the class of Ising-likeHamiltonians can only be solved analytically in few excep-tionally rare cases. The vanilla Ising model can only be solvedanalytically in one, two, as well as inﬁnite space dimensions.A solution in three space dimensions remains to be found todate [1, 2]. Therefore, simulations are necessary to under-stand these systems in three space dimensions. The situationis far more dire when more complex interactions—such as k -local terms rather than the usual quadratic or -local terms—are used. Similarly, the inclusion of disorder allows for an-alytical solutions only in the mean-ﬁeld regime [3–7]. Thesespin-glass problems, a subset of frustrated and glassy systems,represent the easiest -local Hamiltonian that is computation-ally extremely hard. A combination of diverging algorithmictimescales (with the size of the input) due to rough energylandscapes and the need for conﬁgurational (disorder) aver-ages to compute thermodynamic quantities makes them theperfect benchmark problem to study novel algorithms. Fi-nally, computing ground states of spin glasses on nonplanargraphs is an NP-hard problem where Monte Carlo methodshave been known to be efﬁcient heuristics [8–10] and whereonly few efﬁcient exact methods exist for small system sizes.It is therefore of much importance to design or improve ef-ﬁcient algorithms either to save computational effort or havebetter quality data with the same computational effort whenstudying these complex systems. Two popular algorithms thatare currently in use (for both thermal sampling, as well as op-timization) are parallel tempering (PT) Monte Carlo [11, 12]and population annealing Monte Carlo (PAMC) [13–16].Although both PT and PAMC are extended ensembleMonte Carlo methods, PAMC is a sequential Monte Carloalgorithm, in contrast to PT, which is a (replica-exchange) ∗ Electronic address: [email protected] † Electronic address: [email protected] ‡ Electronic address: [email protected]

Markov-chain Monte Carlo method. PAMC is a population-based Monte Carlo method and thus well suited for imple-mentations on multicore high-performance computing ma-chines. PAMC is similar to simulated annealing [17], how-ever, with an extra resampling step when the temperature isreduced to maintain thermal equilibrium. PT has been inten-sively optimized and has been to date the work horse in statis-tical physics and is equally efﬁcient in simulating spin glasseswhen compared to PAMC [16]. PAMC, on the other hand, re-mains a relatively new simulation method. Although carefulsystematic studies of PAMC [16, 18] exist, and the methodhas been applied broadly [9, 19–21], little effort has beenmade to thoroughly optimize the algorithm. Here we focus onthis problem and study various approaches to improve the efﬁ-ciency of PAMC for large-scale simulations. While some ap-proaches improve PAMC, others have little to no effect. Notethat related optimization ideas are explored in Ref. [22].

Population AnnealingAlgorithmic accelerators • Wolff cluster updates • Houdayer cluster updates

Parallel implementation • OpenMP • MPI

Implementation optimization • Dynamic population sizes • Schedule optimization • Spin selection methods • Number of temperatures optimization

FIG. 1: Diagram outlining the different optimizations we have im-plemented for population annealing Monte Carlo. These range fromoptimizations in the implementation, such as efﬁcient spin selectiontechniques, to algorithmic accelerators (e.g., the inclusion of clusterupdates), as well as parallel implementations. See the main text fordetails. a r X i v : . [ c ond - m a t . d i s - nn ] N ov Our strategy to optimize PAMC is three pronged, as illus-trated in Fig. 1. First, we study different implementation op-timizations. Here we discuss dynamic population sizes thatvary with the temperature during the anneal, as well as theoptimization of different annealing schedules. We also inves-tigate different spin selection methods (order of spin updatesin the simulation) such as random, sequential and checker-board. While for disordered systems sequential updates arecommonplace, random updates are needed for nonequilibriumstudies. In the case of bipartite lattices, a checkerboard spin-update technique can be used, which is perfectly suited forparallelization. Furthermore, we discuss how to determinethe optimum number of temperatures for a given simulation.Second, we analyze the effects of algorithmic accelerators byadding cluster updates to PAMC. We have studied Wolff clus-ter updates [23], as well as Houdayer cluster updates [24],and isoenergetic cluster moves [10]. Third, we discuss differ-ent parallel implementations using both OpenMP (ideal forshared-memory machines [16, 18, 25]) and MPI [26] withload balancing (ideal for scalable massively parallel imple-mentations). Note that PAMC implemented on graphics pro-cessing units has been discussed extensively in Refs. [19, 27].The paper is structured as follows. We ﬁrst introduce inSec. II some concepts needed in this study, such as the casestudy Hamiltonian and outline the PAMC algorithm. Imple-mentation optimizations are presented in Sec. III, algorithmicaccelerators via cluster updates in Sec. IV, and parallel imple-mentations are discussed in Sec. V, followed by concludingremarks.

II. PRELIMINARIES

In this section we introduce some concepts needed for thePAMC optimization in the subsequent section. In particu-lar, we introduce the Ising spin-glass Hamiltonian (our casestudy), as well as PAMC and different algorithmic accelera-tors.

A. Case study: Spin glasses

We study the zero-ﬁeld two-dimensional (2D) and 3DEdwards-Anderson Ising spin glass [3] given by the Hamil-tonian H = − (cid:88) (cid:104) ij (cid:105) J ij S i S j , (1)where S i = ± are Ising spins and the sum is over the nearestneighbors on a D -dimensional lattice of linear size L with N spin = L D spins. The random couplings J ij are chosenfrom a Gaussian distribution with mean zero and variance .We refer to each disorder realization as an “instance.” Themodel has no phase transition to a spin-glass phase in 2D [28],while in 3D there is a spin-glass phase transition at T c ≈ . [29] for Gaussian disorder. B. Outline of population annealing Monte Carlo

Population annealing Monte Carlo [13–16] is similar tosimulated annealing (SA) [17] in many ways. For example,both methods are sequential. However, the most importantdifferentiating aspect between PAMC and SA is the additionof a population of replicas that are resampled when the tem-perature is lowered in the annealing schedule.PAMC [16] starts with a large population of R replicas ata high temperature, where thermalization is easy. In our sim-ulations, we initialize replicas randomly at the inverse tem-perature β = 1 /T = 0 . The population traverses an an-nealing schedule with N T temperatures and maintains ther-mal equilibrium to a low target temperature, T min = 1 /β max .When the temperature is lowered from β to β (cid:48) , the pop-ulation is resampled. The mean number of the copies ofreplica i is proportional to the appropriate reweighting factor, exp[ − ( β (cid:48) − β ) E i ] . The constant of proportionality is chosensuch that the expectation value of the population size at thenew temperature is R ( β (cid:48) ) . Note that R ( β (cid:48) ) is usually keptclose to R ; however, this is not a necessary condition. In-deed, in our dynamical population size implementation, welet R change as a function of β and seek better algorithmicefﬁciency in the number of spin updates. The resamplingis followed by N S = 10 Monte Carlo sweeps (one MonteCarlo sweep represents N spin attempted spin updates) for eachreplica of the new population using the Metropolis algorithm.We keep N S = 10 without loss of generality, because theperformance of PAMC is mostly sensitive to the product of N S N T near optimum. For example, two PAMC simulationswith { N S = 10 , N T } and { N S = 1 , N T } are similar in ef-ﬁciency, if N T is reasonably large. The amount of work of aPAMC simulation in terms of sweeps is W = RN S N T , where R is the average population size.As shown in Ref. [16], the quality of thermalization of anythermodynamic observable is in direct correlation with thefamily entropy S f and the entropic family size ρ s . The sys-tematic errors, on the other hand, are controlled by the equilib-rium population size ρ f . What we here refer to as “efﬁciency”or “speed-up” relates to reducing the statistical as well as thesystematic errors while keeping the computational effort con-stant. Thus, it would be reasonable to use these quantities asmeasures of optimality for various PAMC implementations. S f , ρ s , and ρ f are deﬁned as S f = − (cid:88) i ν i ln ν i , (2) ρ s = lim R →∞ R/e S f , (3) ρ f = lim R →∞ R × var( β F) , (4)where ν i is the fraction of the population that has descendedfrom replica i in the initial population, and β and F are the in-verse temperature and free energy of the system, respectively.The free energy is measured using the free-energy perturba-tion method. Intuitively, exp( S f ) characterizes the number ofsurviving families and ρ s the average surviving family size.For a set of simulation parameters, the larger ρ s and ρ f , orthe smaller S f , the computationally harder the instance. Keepin mind that ρ f is computationally more expensive to mea-sure, because many independent runs (at least ) are neededto measure the variance of the free energy. Note that S f is“extensive” and asymptotically grows as log( R ) , while both ρ s and ρ f are “intensive” quantities, growing asymptoticallyindependent of R when R is sufﬁciently large. In our simula-tions, these metrics are estimated using ﬁnite but large-enough R values such that the systematic errors are negligible.It can be shown [16, 22] that the systematic errors in anypopulation annealing observable at the limit of large R areproportional to var( βF ) . Therefore, in order to ensure thatthe simulations are not affected by the systematic errors, oneneeds to make certain that the quantity ρ f /R is sufﬁcientlysmall. When well deﬁned, ρ s is strongly correlated with ρ f [16] as it is the case for the majority of the spin-glass instancesthat we study in this paper. Hence, we may alternatively min-imize ρ s /R or equivalently maximize S f as a proxy for thequality of equilibration. In our simulations, we ensure that S f (cid:38) for all the instances. C. Outline of cluster updates used

Having outlined PAMC, we now brieﬂy introduce the dif-ferent cluster algorithms we have experimented with in orderto speed up thermalization.

1. Wolff cluster algorithm

The Wolff algorithm [23] greatly speeds up simulations ofIsing systems without frustration near the critical point. Itis well known that the Wolff algorithm does not work wellfor spin glasses in 3D [30] because the cluster size growstoo quickly with β . Nevertheless, we revisit this algorithmsystematically in both 2D and 3D. The idea is that even ifthe cluster size grows too quickly when β is still relativelysmall, the mean cluster size (normalized by the number ofspins N spins ) is still a continuous function in the range [0 , when β grows from β = 0 to ∞ . Therefore, it is a reasonablequestion to ask if there would be some speed-up when restrict-ing the algorithm to the temperature range where the normal-ized mean cluster size is neither too larger nor too small, forexample, in the range [0 . , . .In the ferromagnetic Ising model, where J ij = J = 1 , oneadds a neighboring spin S j when it is parallel to a spin S i inthe cluster with probability p c = 1 − exp( − Jβ ) . In spinglasses, this is generalized as follows: One adds a neighbor-ing spin S j to S i when the bond between the two spins is sat-isﬁed and with probability p c = 1 − exp( − | J ij | β ) . This canbe compactly written as p c = max[0 , − exp( − βJ ij S i S j )] [30]. Note that from p c , there are two interesting limits forthe mean cluster size. In the limit β → , the average clus-ter size is clearly , and in the limit β → ∞ , the normalizedcluster size tends to , because in the ground state each spinhas at least one satisﬁed bond with its neighbors and all thespins would be added to the cluster. From the expression for TABLE I: Simulation parameters for various experiments to opti-mize PAMC: Spin selection methods (SSM), annealing schedules(AS), number of temperatures tuning (NT), dynamic population sizeexperiment (DPS), and cluster algorithms (CA). D is the space di-mension, L is the linear system size, R is the population size, T min = 1 /β max is the lowest temperature simulated, N T is the num-ber of temperatures, and M is the number of disorder realizationsstudied. The label “Schedule” refers to the annealing schedule used,such as the linear-in- β (LB) or the linear-in- β linear-in- T (LBLT)schedules. N S = 10 sweeps are applied to each replica at each tem-perature. Note that in the case of dynamic population sizes (DPS), R is the mean population size. See the text for more details.Technique D L R T min N T Schedule M SSM × . LB SSM × . LB SSM × . LB SSM × . LB AS × . All AS × . All NT × . variable LBLT NT × . variable LBLT NT × . variable LBLT NT × . variable LBLT NT × . variable LBLT NT × . variable LBLT NT × . variable LBLT NT × . variable LBLT DPS × . LB DPS × . LB DPS × . LB CA × . LB/LBLT CA × . LB/LBLT CA × . LB/LBLT CA × . LB/LBLT CA × . LB/LBLT CA × . LB/LBLT CA × . LB/LBLT CA × . LB/LBLT p c one can see that frustration actually makes the cluster sizegrow slower as a function of β . However, frustration also sig-niﬁcantly reduces the transition temperature, which is the pri-mary reason why the Wolff algorithm is less efﬁcient for spinglasses. Finally, note that the Wolff algorithm is both ergodicand satisﬁes detailed balance.

2. Houdayer cluster algorithm

Designed for spin glasses, the Houdayer cluster algorithm[24] or its generalization, the isoenergetic cluster moves(ICM) [31], greatly improves the sampling for parallel tem-pering in 2D, while less so in 3D. ICM in 3D, like the Wolffalgorithm, is restricted to a temperature window where themethod is most efﬁcient [31]. ICM works by updating tworeplicas at the same time. First, an overlap between the tworeplicas is constructed, which naturally forms positive andnegative islands. One island is selected, and the spin con-ﬁgurations of the island in both replicas are ﬂipped.In its original implementation, the spin down sector is al-ways used to construct the cluster. In the implementation ofZhu et al. a full replica is ﬂipped if the chosen island is inthe positive sector to make it negative [31] and therefore re-duce the size of the clusters. Here, we improve on this im-plementation by allowing the chosen island to be either pos-itive or negative, and ﬂipping the spins of the island in bothreplicas. Therefore, we never ﬂip a full replica. This savescomputational time and also has the advantage that it does notartiﬁcially make the spin overlap function symmetric. ICMsatisﬁes detailed balance but is not ergodic. Therefore, the al-gorithm is usually combined with an ergodic method such asthe Metropolis algorithm. ICM greatly improves the thermal-ization time, and also slightly improves the autocorrelationtime in parallel tempering. Because PAMC is a sequentialmethod, there is no thermalization stage. We therefore focuson whether the algorithm reduces correlations, i.e., systematicand statistical errors. Our implementation of PAMC with ICMis as follows: First, after each resampling step, we do regularMonte Carlo sweeps and ICM updates alternately. We ﬁrstdo N S / lattice sweeps for each replica, followed by R ICMupdates done by randomly pairing two replicas in the popu-lation, followed by another N S / lattice sweeps. Second, foreach ICM update, we choose an island from the spin sectorwith the smaller number of spins. Then the spin conﬁgura-tions of the island in both replicas are ﬂipped. This effectivelymeans that the spin conﬁgurations associated with the selectedisland are either exchanged or ﬂipped depending on the signof the island being negative in the former or positive in thelatter. Note that the combined energy of the two replicas isconserved in both cases, therefore making the algorithm re-jection free. III. IMPLEMENTATION OPTIMIZATIONS

In this section, we present our implementation improve-ment to the population annealing algorithm. We ﬁrst presentspin selection methods, followed by experiments using dif-ferent annealing schedules, numbers of temperatures, and theuse of a dynamic population. The simulation parameters aresummarized in Table I.

A. Comparison of spin selection methods

We have studied three spin selection methods: sequential,random, and checkerboard. We have carried out a large-scalesimulation in 3D to compare these methods for L = 4 , , , and , with instances for each system size. Weﬁrst run the simulations using the parameters in Table I. Tomeasure S f or ρ s reliably, we require S f (cid:38) [16]. Whenthis is not satisﬁed for a particular instance, we rerun it with l og ( ρ s ) [ r a nd o m ] log ( ρ s ) [sequential] L = 4 L = 6 L = 8 L = 10(a)123456 1 2 3 4 5 6 l og ( ρ s ) [ c h ec k e r b oa r d ] log ( ρ s ) [sequential] L = 4 L = 6 L = 8 L = 10(b) FIG. 2: Comparison of the entropic population size ρ s for differ-ent spin selection methods: random, sequential and checkerboardupdates in three space dimensions. Sequential and checkerboard up-dates have similar efﬁciency (b), and both are more efﬁcient thanrandom updates (a). a larger population size. We then compare ρ s at the lowesttemperature between different spin selection methods. Fig-ure 2 shows scatter plots comparing ρ s instance by instancefor different system sizes and using different spin selectionsmethods. Figure 2(a) compares random to sequential updates,whereas Fig. 2(b) compares checkerboard to sequential up-dates. Interestingly, sequential and checkerboard updates havesimilar efﬁciency (the data lie on the diagonal), whereas bothsequential and checkerboard are more efﬁcient than randomupdates. This is particularly visible for the larger system sizes,e.g., L = 10 . The random selection method is therefore theleast efﬁcient update technique for disordered Boolean prob-lems, keeping in mind that it requires the computation of anadditional random number for each attempted spin update thusslowing down the simulation. We surmise that a sequentialupdating of the spins accelerates the mobility of domain wallsin most cases. However, in some pathological examples, suchas the one-dimensional Ising chain random updating is neededfor Monte Carlo to be ergodic. B. Optimizing annealing schedules

Most early population annealing simulations used a simplelinear-in- β (LB) schedule where the change in β in the anneal-ing schedule is constant as a function of the temperature index.This, however, is not necessarily the most optimal schedule touse. We use two approaches to optimize the annealing sched-ules and the number of temperatures: One approach uses amathematical model with free parameters to be optimized andthe other includes adaptive schedules based on a guiding func-tion, e.g., the energy ﬂuctuations or the speciﬁc heat. For theparametric schedules we introduce a linear-in- β linear-in- T (LBLT) and a two-stage power-law schedule (TSPL). For theLBLT schedule there is one parameter to tune, namely a tun-ing temperature T N [32]. In this schedule, half of the temper-atures above T N are linear in β , while the other half below T N are linear in T . For the TSPL schedule we deﬁne a rescaledannealing time τ = k β / ( N T − ∈ [0 , , where k β is the an-nealing step (or temperature index) , . . . , N T − . The TSPLschedule is modeled as β ( τ ) = aτ α θ ( τ − τ ) + bτ α θ ( τ − τ ) , (5)where θ is the Heaviside step function. Here α and α arefree parameters. a and b enforce continuity and the ﬁnal an-nealing temperature. τ is selected to enforce a switch-overtemperature β . We optimize the LBLT schedule with a sim-ple scan of the parameter T N . The optimum value of T N (where ρ s is minimal) is shown in Fig. 3(c) for 2D ( T N ≈ . ,marked with a vertical shaded area) and Fig. 3(d) for 3D( T N ≈ . , marked with a vertical shaded area).The TSPL schedule, however, has more parameters thathave to be tuned. Therefore, we have used the Bayesian op-timization package Spearmint [33, 34] rather than a full gridscan in the entire parameter space. We ﬁnd numerically thatthe parameters α = exp( − . , α = exp(2 . , and β = 1 . work well. However, we note that there is noguarantee of global optimality. For the adaptive schedules,we optimize using information provided by energy ﬂuctua-tions, because energy is directly related to the resampling ofthe population. We therefore deﬁne a density of inverse tem-perature β , g ( β ) , and study the following adaptive schemes.var( E ) schedule with g ( β ) ∼ var( E ) ,std( E ) schedule with g ( β ) ∼ (cid:112) var(E) , C V schedule with g ( β ) ∼ C V ( β ) , √ C V schedule with g ( β ) ∼ (cid:112) C V ( β ) ,where C V is the speciﬁc heat of the system. Note that thefunctions are disorder averaged, and the proportionality is de-termined by the number of temperatures. Because g ( β ) maybecome extremely small, we have replaced all the functionvalues that are less than of max( g ) by . × max( g ) toprevent large temperature leaps. With this small modiﬁcation, we generate N T temperatures according to the above densityfunctions. The shapes and β densities of all schedules areshown in Figs. 3(a) and 3(b), respectively. There are clear dif-ferences between the different schedules, especially in com-parison to the traditionally used LB schedule. We comparethe efﬁciency of these different schedules in Fig. 4 by analyz-ing the systematic errors in a number of paradigmatic observ-ables. We have studied the internal energy ( E ), free energy( F ), and the spin-glass Binder cumulant [35] for the systemsize L = 10 . To overcome the scale difference when show-ing the systematic errors for these observables in one plot, wehave normalized the errors with respect to the schedule thathas the greatest error. Therefore, all the errors will be rela-tive to that of the worst schedule. In Figs. 4(a) and 4(b) weshow the normalized systematic errors for two randomly cho-sen and extremely hard instances. In Fig. 4(c) we show thedisorder averaged systematic errors calculated from ofthe hardest instances. It can be readily seen from the plotsthat the LBLT and TSPL schedules yield the best efﬁcien-cies among all the experimented schedules with TSPL slightlymore efﬁcient. Both LBLT and TSPL schedules place moretemperatures at high temperature values (smaller β values),presumably because the Metropolis dynamics is more effec-tive at high temperatures. Additionally, in Fig. 4(c) we haveshown ρ s for various schedules. We observe great correlationbetween ρ s and the systematic errors which corroborates theuse of ρ s as a good measure of efﬁciency.We stress that the optimum schedule depends on the choiceof the number of sweeps at each anneal step N S , because N T and N S are exchangeable when N T is large enough. In ourapproach, we have ﬁxed N S . It is therefore possible that othertechniques may result in different optimal schedules. For in-stance, one may use the energy distribution overlaps at twotemperatures to deﬁne the optimum schedule [22, 27], whichonly depends on the thermodynamic properties of the system.As an example, in Fig. 5 we show the energy distributions ofthe LBLT schedule for L = 8 in 3D. The energy histogramsoverlap considerably up to several temperature steps. Withinthis framework, the optimization is transferred to the distribu-tion of sweeps. However, the density of work (the product ofdensity of β and density of sweeps) should be similar in thetwo different approaches. In our implementation as the num-ber of sweeps is constant, the density of work is the same asthe density of β . C. Optimization of the number of temperatures

To optimize the number of temperatures and their range,we use the LBLT schedule as it is easy to implement andvery close to optimal. Our ﬁgure of merit is to maximizethe number of independent measurements

R/ρ s for constantwork W = RN S N T . We deﬁne efﬁciency as γ = R/ ( ρ s W ) by tuning N T for a constant W . Because N S = 10 is ﬁxed,we need to maximize / ( ρ s N T ) by tuning N T . In the limit R → ∞ , ρ s and the efﬁciency γ are independent of the pop-ulation size. This is expected as γ is an intensive quantity.Therefore, to measure γ , we only need to make sure R is suf- β k β [temperature index in schedule]LBLBLTvar( E )std( E ) C V √ C V TSPL . . . g ( β ) β LBLBLTvar( E )std( E ) C V √ C V TSPL . . . . . l og ( ρ s ) T N L = 16 L = 25 L = 322D . . . . . . . l og ( ρ s ) T N L = 6 L = 8 L = 103D FIG. 3: Panel (a) shows the β values as a function of the inverse temperature index k β for the different schedules experimented with andpanel (b) shows the resulting β densities g ( β ) (the data are cut off at β = 3 for clarity). Note that both TSPL and LBLT schedules have moretemperatures at high T . Panels (c) and (d) show ρ s as a function of T N for 2D and 3D simulations, respectively. The vertical shaded line marksthe optimum. See the main text for details. ﬁciently large such that ρ s has converged. It is not necessaryto use the same W for different N T .The results for both two- and three-dimensional systems areshown in Figs. 6(a) and 6(b), respectively. The solid curvesshow the disorder average while the dashed envelopes arethe instance-by-instance results. It is interesting to note thatfor relatively smaller system sizes we observe a pronouncedpeak. The existence of an optimum number of temperaturescan be intuitively understood in the following way: For a ﬁxedamount of computational effort, if N T is too small, then theannealing or resampling would become too stochastic, whichis inefﬁcient. On the other hand, if the annealing is too slow( N T is too large) this becomes unnecessary and keeping alarger population size is more efﬁcient. Therefore, the opti-mum comes from a careful balance between N T and R . How-ever as the system size grows, the optimum peak starts to ﬂat-ten out due to the onset of temperature chaos [36–42]. Thiscan be seen in Fig. 6 as a discernible increase in the density of instances with irregular oscillatory behavior. Thus we con-clude that the optimization presented here, although captur-ing the bulk of the instances, might not be reliable in caseof extremely hard (chaotic) instances. Instead one may con-sider performing more Metropolis sweeps rather than merelyincreasing the temperature steps or the population size. Thisis especially relevant if memory (which correlates to R ) be-comes a concern for the hardest instances. D. Dynamic population sizes

The reason the LBLT schedule is more efﬁcient than a sim-ple LB schedule is because the Metropolis dynamics is less ef-fective at low temperatures, and therefore using more “hotter”temperatures is more efﬁcient. Here we investigate anothertechnique, namely a variable number of replicas that dependson the annealing temperature, thus having a similar effect to . . . . . T S P L L B L T √ C V L B C V v a r( E ) s t d ( E ) n o r m a li ze d s y s t e m a t i ce rr o r EFg SG (a)00 . . . . . T S P L L B L T √ C V L B C V v a r( E ) s t d ( E ) n o r m a li ze d s y s t e m a t i ce rr o r EFg SG (b)0 . . . . . T S P L L B L T √ C V L B C V v a r( E ) s t d ( E ) . . . . . n o r m a li ze d s y s t e m a t i ce rr o r l og ( ρ s ) EFg SG ρ s (c) FIG. 4: Comparison of the systematic errors for various annealingschedules. The studied observables are energy ( E ), free energy ( F ),and the spin glass Binder cumulant ( g SG ) for the system size L =10 . Panels (a) and (b) show the systemic errors for two randomlychosen hard instances, whereas panel (c) illustrates the systematicerrors averaged over 100 of the hardest instances. Systematic errorsof different observable often have magnitudes largely apart. For thisreason, the errors in each observable have been normalized relativeto the maximum error across all schedules. For instance, in the toppanel the std( E ) schedule which has the greatest systematic erroris normalized to 1 while the rest of the schedules lie below 1. It isseen from the plots that the TSPL schedule is the most efﬁcient. TheLBLT schedule, although conveniently simple, competes well withthe optimal schedule. Note that we also show ρ s (as a dual y -axis) inpanel (c). We observe that ρ s greatly correlates with the systematicerrors justifying the use of it as an effective optimization criterion. . . . . . − . − . − . − . − . − . − . T = 2 . T = 1 . T = 1 . T = 1 . T = 1 . T = 1 . T = 0 . T = 0 . T = 0 . T = 0 . T = 0 . . . . . . − . − . − . − . − . − . − . T = 2 . T = 1 . T = 1 . T = 1 . T = 1 . T = 1 . T = 0 . T = 0 . T = 0 . T = 0 . T = 0 . P ( e ) e [energy density] , L = 8LBLT schedule P ( e ) e [energy density] , L = 8LBLT schedule FIG. 5: Energy density distribution of the LBLT annealing sched-ule for L = 8 in three space dimensions. Thinner curves show thehistograms at all temperatures whereas the thicker ones are drawn atevery temperature steps. There are temperature steps in total.The histograms overlap considerably. having more temperatures at higher values. Regular PAMC isdesigned to have an approximately uniform population size asa function of temperature. Here we allow the population sizeto change with β . Because most families are removed at arelatively early stage of the anneal, transferring some replicasfrom low temperatures to high temperatures may increase thediversity of the ﬁnal population, even though the ﬁnal popula-tion size would be smaller [43].We study a simple clipped exponential population schedulewhere the population starts as a constant R until β = β , andthen decreases exponentially to R f = R /r at β = β max , R ( β ) = (cid:40) R β ≤ β aR / (cid:2) ( r − e βS − e β S ) + a (cid:3) β > β , (6)where a = exp( β max S ) − exp( β S ) . The free parameters totune are S , β , and r . S is chosen such that the function iscontinuous and naturally characterizes the slope of the curve.Once the parameters are optimized, we can scale the full func-tion to have a comparable average population size to that ofthe uniform schedule. The optimization is again done usingBayesian statistics, and we obtain β = 0 . , r = 33 . , and S = exp( − . .It is noteworthy to mention that there are two different mea-sures to detect efﬁciency when the population size is allowedto change. For the same average population size, the dy-namic population schedule is always better at high tempera-ture. However, at low temperature, a smaller ρ s does not jus-tify that the number of independent measurements is larger,because R is also smaller. It is thus reasonable to optimize theparameters using ρ s , and then also to compare to R/ρ s . Notethat we use the local population size R at each temperature tocompute ρ s . The correlations and comparisons of ρ s and ρ f are also studied. With the optimum parameters, we compare − − − − − − l og ( ρ s N T ) − N T L = 8 L = 16 L = 25 L = 322D(a) − − − − − l og ( ρ s N T ) − N T L = 4 L = 6 L = 8 L = 103D(b) FIG. 6: Optimization of the number of annealing steps N T in twospace dimensions [2D, panel (a)] and three space dimensions [3D,panel (b)]. To maximize sampling efﬁciency, one needs to optimize / ( ρ s N T ) with respect to N T . In both panels the points and the solidcurves show the disorder average while the dashed envelopes displayall studied instances. For smaller system sizes the peak (opti-mum) is sharp, whereas for systems with more than approximately spins the peak is broadened, especially in two dimensions. Thereason for this broadening can be understood by noticing the increasein the density of chaotic samples as the system grows in size (wigglylines). the efﬁciency of the dynamic and uniform population sizes.The results are shown in Fig. 7. We see that ρ s and ρ f arewell correlated for the dynamic population size. ρ s is greatlyreduced, suggesting that the simulation is much better at thelevel of averaging over all temperatures. We also see that evenusing the worst-case measure, the dynamic population size ismore efﬁcient than the uniform one. Note, however, that thepeak memory use of the dynamic population size is larger dueto the nonuniformity of the number of replicas as a functionof β . IV. ALGORITHMIC ACCELERATORS

We now turn our attention to algorithmic accelerators byincluding cluster updates in the simulation. The simulationparameters are summarized in Table I.

A. Isoenergetic cluster updates

Here we study PAMC with ICM updates. In 3D, similarlyto the Wolff algorithm, there is an effective temperature rangewhere ICM (see Ref. [31] for more details) is efﬁcient. InICM, two replicas are updated simultaneously. This processuses the detailed structure of the two replica conﬁgurations,and it is natural to question if the family of a replica is stillwell deﬁned. For example, occasionally, two replicas maymerely exchange their conﬁgurations. This is equivalent toexchanging their family names which potentially increases thediversity of the population at little cost . To resolve and inves-tigate this issue, we have therefore measured the (computa-tionally more expensive) equilibration population size ρ f aswell, which unlike ρ s , does not depend on the deﬁnition of thefamilies. Our results are shown in Fig. 8 and Fig. 9, for 2Dand 3D, respectively. We ﬁnd that ρ s is indeed artiﬁcially re-duced by the cluster updates. In both 2D and 3D, ρ f has a widedistribution, while ρ s is almost identical for all instances. Fur-thermore, ρ s and ρ f are strongly correlated for regular PAMC,but the correlation is poor when ICM is turned on. Therefore,we conclude that ρ s is no longer a good equilibration metricfor PAMC when combined with ICM. Using ρ f , we ﬁnd thatsimilar to PT [31], there is clear speed-up in 2D. In 3D, how-ever, the speed-up becomes again marginal. This is in con-trast to the discernible speed-up for PT with the inclusion ofICM in 3D. The results suggest that ICM is mostly efﬁcientin 2D and likely quasi-2D lattices, reducing both thermaliza-tion times (PT) and correlations (PAMC and PT). In 3D, ICMmerely reduces thermalization times, while marginally inﬂu-encing correlations. B. Wolff cluster updates

Wolff cluster updates are not effective in spin-glass simu-lations. We, nevertheless, have revisited this type of clusterupdate in the context of PAMC for the sake of completeness.More details can be found in Appendix A.

V. PARALLEL IMPLEMENTATION

Population annealing is especially well suited for paral-lel computing because operations on the replicas can be car-ried out independently and communication is minimal. SinceOpenMP is a shared-memory parallelization library, it is lim-ited to the resources available on a single node of a high-performance computing system. Although modern computenodes have many cores and large amounts of RAM, these areconsiderably smaller than the number of available nodes by l og ( ρ s ) [ un i f o r m ] log ( ρ f ) [uniform] L = 4 L = 6 L = 8 (b) l og ( ρ s ) [ e x p o n e n t i a l] log ( ρ s ) [uniform] L = 4 L = 6 L = 8 (c) R f / ρ s [ e x p o n e n t i a l] R f /ρ s [uniform] L = 4 L = 6 L = 8 (d) -10123 -1 0 1 2 3 l og ( ρ s ) [ e x p o n e n t i a l] log ( ρ f ) [exponential] L = 4 L = 6 L = 8 (a) FIG. 7: Instance-by-instance comparison for a PAMC simulation with ﬁxed and dynamic population sizes. With a dynamic population size, ρ s and ρ f are well correlated, similarly to the case of uniform population. ρ s is greatly reduced, suggesting that the simulation is much betterat the level of averaging over all temperatures. The dynamic population size is also more efﬁcient than the uniform one using the worst-casemeasure. Here R f is the ﬁnal population size. often several orders of magnitude. To beneﬁt from machineswith multiple compute nodes and therefore simulate largerproblem sizes, we now present an MPI implementation ofPAMC which can utilize resources up to the size of the cluster.While for typical problem sizes single-node OpenMP imple-mentations might sufﬁce for the bulk of the studied instances,hard-to-thermalize instances could then be simulated using amassively parallel MPI implementation with extremely largepopulation sizes. Although the exact run time depends onmany variables such as the simulation parameters, architec-ture, code optimality, compiler, etc., here we show some ex-ample of a typical simulation time with the parameters listedin Table I. On a 20-core node with Intel Xeon E5-2670 v22.50 GHz processors, it takes approximately . , , and minutes to simulate an instance in 3D with N = 216 , ,and spins, respectively. A. Massively parallel MPI implementation

The performance and scaling of our MPI implementationfor 3D Edwards-Anderson spin glasses is shown in Fig. 10.Note that the wall time scales ∼ /N with N the numberof cores for less than cores. In our implementation,the population is partitioned equally between MPI processes(ranks). Each rank is assigned an index k with I/O operationsoccurring on the 0th rank. A rank has a local population onwhich the Monte Carlo sweeps and resampling are carried out.We also deﬁne a global index G which is the index of a replicaas if it were in a single continuous array. In practice, the globalindex G of a replica j on a rank k is computed as the sum ofthe local populations r i on the preceding ranks plus the localindex j , i.e., G = j + k − (cid:88) i =0 r i . (7)0 − − l og ( ρ f ) [ P A + I C M ] log ( ρ f ) [PA] L = 8 L = 16 L = 25 L = 32 2D(c) . . . . l og ( ρ f ) [ P A + I C M ] log ( ρ s ) [PA+ICM] L = 8 L = 16 L = 25 L = 322D(b) − − l og ( ρ f ) [ P A ] log ( ρ s ) [PA] L = 8 L = 16 L = 25 L = 32 2D(a) . . . . l og ( ρ s ) [ P A + I C M ] log ( ρ s ) [PA] L = 8 L = 16 L = 25 L = 322D(d) FIG. 8: Population annealing with ICM updates in 2D. Note that replica family is not well deﬁned when ICM updates are included. Therefore,we use ρ f to characterize speed-up. Signiﬁcant speed-up is observed in 2D. The global index for a particular replica varies as its positionin the global population changes.Load balancing is carried out when a threshold percentagebetween the minimum and maximum local populations is ex-ceeded. In our implementation, all members of a family mustbe in a continuous range of global indices to allow for efﬁcientcomputation of the family entropy and the overlap function ofthe replicas. Therefore, load balancing must maintain adja-cency. The destination rank k of a replica is determined byevenly partitioning the global population such that each rankhas approximately the same number of replicas, i.e., k = (cid:98) G/ (cid:0) RN (cid:1) (cid:99) , (8)where N is the number of ranks (cores).Measurement of most observables is typically an efﬁcientaccumulation operation, i.e., (cid:104)A(cid:105) = 1 R N (cid:88) k r k (cid:88) j A j,k . (9) On the other hand, measuring observables such as the spin-glass overlap is more difﬁcult and only done at select temper-atures. Sets of replicas are randomly sampled from a rank’slocal population and copies are sent to the range of ranks [( k + N/

4) mod N, ( k + 3 N/

4) mod N ] with periodicboundary conditions to ensure that the overlap is not com-puted between correlated replicas. The resulting histogramsare merged in an accumulation operation similar to regularobservables.Improving scaling with process count will require a loweroverhead implementation of the spin overlap measurements—a problem we intend to tackle in the near future. VI. CONCLUSIONS AND FUTURE CHALLENGES

We have investigated various ways to optimize PAMC,ranging from optimizations in the implementation, to the ad-dition of accelerators, as well as massively parallel imple-1 − − l og ( ρ f ) [ P A + I C M ] log ( ρ f ) [PA] L = 4 L = 6 L = 8 L = 103D (c) − − l og ( ρ f ) [ P A ] log ( ρ s ) [PA] L = 4 L = 6 L = 8 L = 10 3D(a) − . . . . . . l og ( ρ f ) [ P A + I C M ] log ( ρ s ) [PA+ICM] L = 4 L = 6 L = 8 L = 10 3D(b) . . . . . l og ( ρ s ) [ P A + I C M ] log ( ρ s ) [PA] L = 4 L = 6 L = 8 L = 103D (d) FIG. 9: Population annealing with ICM updates in 3D. Note that replica family is not well deﬁned when ICM updates are included. Therefore,we use ρ f to characterize speed-up. Modest speed-up is observed in 3D. mentations. Many of these optimizations lead to often con-siderable speed-ups. We do emphasize that these approachesand even the ones that showed only marginal performance im-provements for spin glasses in 2D and 3D might be applied toother approaches to simulate statistical physics problems po-tentially generating sizable performance boosts. The reduc-tion in thermal error studied in this work can most directly beapplied to the study of spin glasses by providing more CPUtime for disorder averaging.For the study of spin glasses, our results show that the bestperformance for PAMC is obtained by selecting the spins in aﬁxed order, i.e., sequentially or from a checkerboard pattern.Similarly, LBLT and TSPL schedules yield the best perfor-mance with LBLT having the least parameters to tune and thuseasier to implement. The number of temperatures needed forannealing is remarkably robust for large system sizes. Hence,in order to tackle hard instances, it is often convenient to in-crease the number of sweeps rather than merely using moretemperatures. Dynamic population sizes are desirable, albeit at the cost of a larger memory footprint. However, this canbe easily mitigated via massively parallel MPI implementa-tions. In conjunction with Ref. [22], and as far as we know,this study represents the ﬁrst analysis of PAMC from an im-plementation point of view.Recently, we learned [44] that the equilibration populationsize ρ f can be measured in a single run using a blockingmethod. It would be interesting to further investigate and testthis idea thoroughly in the future. With an optimized PAMCimplementation, it would be interesting to also perform large-scale spin-glass simulations to answer some of the unresolvedproblems in the ﬁeld, such as the nature of the spin-glass statein three and four dimensions. We plan to address these prob-lems in the near future.2 w a ll t i m e ( s ) N (cores) L = 8 L = 121 /N FIG. 10: Scaling of the total wall time as a function of the numberof processors N for two system sizes L = 8 and L = 12 . Launchingand initialization time are not included. Note that the efﬁciency be-comes better for larger and harder problems. For L = 12 , the scalingremains /N up to about 1000 processors. The efﬁciency then de-creases when the time for collecting observables becomes dominant.Note that resampling still takes a relatively small time. Acknowledgments

We thank Jonathan Machta and Martin Weigel for help-ful discussions and sharing their unpublished manuscripts.H. G. K. would like to thank United Airlines for their hospital-ity during the last stages of this manuscript. We acknowledgesupport from the National Science Foundation, NSF GrantNo. DMR-1151387. The research is based upon work sup-ported in part by the Ofﬁce of the Director of National Intelli-gence (ODNI), Intelligence Advanced Research Projects Ac-tivity (IARPA), via MIT Lincoln Laboratory Air Force Con-tract No. FA8721-05-C-0002. The views and conclusionscontained herein are those of the authors and should not beinterpreted as necessarily representing the ofﬁcial policies orendorsements, either expressed or implied, of ODNI, IARPA,or the U.S. Government. The U.S. Government is authorizedto reproduce and distribute reprints for Governmental purpose notwithstanding any copyright annotation thereon. We thankTexas A&M University for access to their Ada and Curie HPCclusters. We also acknowledge the Texas Advanced Comput-ing Center (TACC) at The University of Texas at Austin forproviding HPC resources that have contributed to the researchresults reported within this paper.

Appendix A: Wolff cluster updates

For the Wolff algorithm, we ﬁrst measure the mean clus-ter size per spin, as shown in Figs. 11(a) and 11(c) for the2D and 3D cases, respectively. Note the smooth transition ofthe mean cluster size from to . We identify a temperaturerange where the mean cluster size is in the window [0 . , . [45]. We perform Wolff updates in this temperature range,i.e., we perform Wolff updates in addition to the reg-ular Metropolis lattice sweeps for each replica. The compar-ison of ρ s with regular PAMC is shown in Figs. 11(b) and11(d). While the Wolff algorithm speeds up ferromagneticIsing model simulations in 2D, the speed-up is marginal for2D spin glasses because of the zero-temperature phase tran-sition. In 3D, the Gaussian spin glass has a phase transitionnear T c ≈ . , but the temperature window where the Wolffalgorithm is effective is much higher than T c . The speed-upis therefore almost entirely eliminated, presumably becausethe Metropolis algorithm is already sufﬁcient for these hightemperatures. The fact that the Wolff algorithm is more efﬁ-cient in 2D than 3D is because clusters percolate faster in 3D,again rendering the effective temperature range higher in 3D.Therefore, Wolff updates constitute unnecessary overhead inthe simulation of spin glasses in conjunction with PAMC.Even though PAMC with the Wolff algorithm does notappear to work very well for spin glasses, this does notmean they cannot be used together. For example, in two-dimensional spin glasses, adding the Wolff algorithm still hasmarginal beneﬁts. The combination of PAMC and the Wolffcluster updates can be used for ferromagnetic Ising modelsfor the purpose of parallel computing, because parallelizingthe Wolff algorithm while doable, is challenging. In popula-tion annealing, however, this can be easily parallelized at thelevel of replicas, and not within the Wolff algorithm itself. [1] E. Ising, Beitrag zur Theorie des Ferromagnetismus , Z. Phys. , 253 (1925).[2] K. Huang, Statistical Mechanics (Wiley, New York, 1987).[3] S. F. Edwards and P. W. Anderson,

Theory of spin glasses , J.Phys. F: Met. Phys. , 965 (1975).[4] G. Parisi, Inﬁnite number of order parameters for spin-glasses ,Phys. Rev. Lett. , 1754 (1979).[5] D. Sherrington and S. Kirkpatrick, Solvable model of a spinglass , Phys. Rev. Lett. , 1792 (1975).[6] K. Binder and A. P. Young, Spin Glasses: Experimental Facts,Theoretical Concepts and Open Questions , Rev. Mod. Phys. ,801 (1986). [7] D. L. Stein and C. M. Newman, Spin Glasses and Complex-ity , Primers in Complex Systems (Princeton University Press,Princeton NJ, 2013).[8] H. G. Katzgraber, M. K¨orner, F. Liers, M. J¨unger, and A. K.Hartmann,

Universality-class dependence of energy distribu-tions in spin glasses , Phys. Rev. B , 094421 (2005).[9] W. Wang, J. Machta, and H. G. Katzgraber, Comparing MonteCarlo methods for ﬁnding ground states of Ising spin glasses:Population annealing, simulated annealing, and parallel tem-pering , Phys. Rev. E , 013303 (2015).[10] Z. Zhu, A. J. Ochoa, and H. G. Katzgraber, Efﬁcient ClusterAlgorithm for Spin Glasses in Any Space Dimension , Phys. Rev. l og ( ρ s ) [ P A + W o l ﬀ ] log ( ρ s ) [PA] L = 8 L = 16 L = 25 L = 32 2D(b) l og ( ρ s ) [ P A + W o l ﬀ ] log ( ρ s ) [PA] L = 4 L = 6 L = 8 L = 10 3D(d) . . . .

81 0 0 . . . . c [ c l u s t e r s i ze ] β L = 8 L = 16 L = 25 L = 322D(a) . . . .

81 0 0 . . . . c [ c l u s t e r s i ze ] β L = 4 L = 6 L = 8 L = 103D(c) FIG. 11: Mean normalized cluster size as a function of β for the Wolff algorithm [(a) and (c)] as well as the performance of the algorithm inboth 2D and 3D. There is marginal speed-up in 2D (b) and no discernible speed-up in 3D (d).Lett. , 077201 (2015).[11] C. Geyer, in , edited by E. M.Keramidas (Interface Foundation, Fairfax Station, VA, 1991),p. 156.[12] K. Hukushima and K. Nemoto, Exchange Monte Carlo methodand application to spin glass simulations , J. Phys. Soc. Jpn. ,1604 (1996).[13] K. Hukushima and Y. Iba, in The Monte Carlo method inthe physical sciences: celebrating the 50th anniversary of theMetropolis algorithm , edited by J. E. Gubernatis (AIP, LosAlamos, New Mexico (USA), 2003), vol. 690, p. 200.[14] E. Zhou and X. Chen, in

Proceedings of the 2010 Winter Simu-lation Conference (WSC) (Springer, New York, 2010), p. 1211.[15] J. Machta,

Population annealing with weighted averages: AMonte Carlo method for rough free-energy landscapes , Phys.Rev. E , 026704 (2010).[16] W. Wang, J. Machta, and H. G. Katzgraber, Population anneal-ing: Theory and application in spin glasses , Phys. Rev. E ,063307 (2015).[17] S. Kirkpatrick, C. D. Gelatt, Jr., and M. P. Vecchi, Optimizationby simulated annealing , Science , 671 (1983). [18] W. Wang, J. Machta, and H. G. Katzgraber,

Evidence againsta mean-ﬁeld description of short-range spin glasses revealedthrough thermal boundary conditions , Phys. Rev. B , 184412(2014).[19] Borovsk´y, M. and Weigel, M. and Barash, Lev Yu. and ˇZukoviˇc,M., GPU-Accelerated Population Annealing Algorithm: Frus-trated Ising Antiferromagnet on the Stacked Triangular Lattice ,EPJ Web of Conferences , 02016 (2016).[20] Barash, Lev Yu. and Weigel, M. and Shchur, Lev N. and Janke,W.,

Exploring ﬁrst-order phase transitions with population an-nealing , Eur. Phys. J. Special Topics , 595 (2017).[21] J. Callaham and J. Machta,

Population annealing simulations ofa binary hard-sphere mixture , Phys. Rev. E , 063315 (2017).[22] C. Amey and J. Machta, Analysis and optimization of popula-tion annealing , Phys. Rev. E , 033301 (2018).[23] U. Wolff, Collective Monte Carlo updating for spin systems ,Phys. Rev. Lett. , 361 (1989).[24] J. Houdayer, A cluster Monte Carlo algorithm for 2-dimensional spin glasses , Eur. Phys. J. B. [27] L. Y. Barash, M. Weigel, M. Borovsk´y, W. Janke, and L. N.Shchur, GPU accelerated population annealing algorithm ,Comp. Phys. Comm. , 341 (2017).[28] R. R. P. Singh and S. Chakravarty,

Critical behavior of an Isingspin-glass , Phys. Rev. Lett. , 245 (1986).[29] H. G. Katzgraber, M. K¨orner, and A. P. Young, Universalityin three-dimensional Ising spin glasses: A Monte Carlo study ,Phys. Rev. B , 224432 (2006).[30] D. A. Kessler and M. Bretz, Unbridled growth of spin-glassclusters , Phys. Rev. B , 4778 (1990).[31] Z. Zhu, A. J. Ochoa, and H. G. Katzgraber, Efﬁcient ClusterAlgorithm for Spin Glasses in Any Space Dimension (2015),(cond-mat/1501.05630).[32] It is worth noting that optimizing T min for various anneal-ing schedules is not necessary because ρ s is a monotonically-increasing function of the temperature. Thus, a higher T min with the same number of temperature steps trivially results ina better thermalization.[33] J. Snoek, H. Larochelle, and R. P. Adams, in Proceedings ofthe 25th International Conference on Neural Information Pro-cessing Systems (Curran Associates Inc., Lake Tahoe, Nevada,USA, 2012), NIPS’12, p. 2951.[34] R. P. Adams, M. Gelbart, and J. Snoek,

Spearmint , Git Reposi-tory, github.com/HIPS/Spearmint, commit ffbab66 (2016).[35] K. Binder,

Critical properties from Monte Carlo coarse grain-ing and renormalization , Phys. Rev. Lett. , 693 (1981).[36] M. Ney-Niﬂe and A. P. Young, Chaos in a two-dimensional Ising spin glass , J. Phys. A , 5311 (1997).[37] T. Aspelmeier, A. J. Bray, and M. A. Moore, Why TemperatureChaos in Spin Glasses Is Hard to Observe , Phys. Rev. Lett. ,197202 (2002).[38] P. E. J¨onsson, H. Yoshino, and P. Nordblad, SymmetricalTemperature-Chaos Effect with Positive and Negative Tempera-ture Shifts in a Spin Glass , Phys. Rev. Lett. , 097201 (2002).[39] H. G. Katzgraber and F. Krzakala, Temperature and DisorderChaos in Three-Dimensional Ising Spin Glasses , Phys. Rev.Lett. , 017201 (2007).[40] L. A. Fernandez, V. Martin-Mayor, G. Parisi, and B. Seoane, Temperature chaos in 3D Ising spin glasses is driven by rareevents , Europhys. Lett. , 67003 (2013).[41] W. Wang, J. Machta, and H. G. Katzgraber,

Chaos in spinglasses revealed through thermal boundary conditions , Phys.Rev. B , 094410 (2015).[42] Z. Zhu, A. J. Ochoa, F. Hamze, S. Schnabel, and H. G. Katz-graber, Best-case performance of quantum annealers on nativespin-glass benchmarks: How chaos can affect success proba-bilities , Phys. Rev. A , 012317 (2016).[43] Note that the uniform population size is a special case of thisgeneralized population size schedule.[44] Martin Weigel, private communication.[45] We have also experimented with other ranges, such as [0 . , . or [0 . , .7]