[PDF] How we are leading a 3-XORSAT challenge: from the energy landscape to the algorithm and its efficient implementation on GPUs

Abstract

A recent 3-XORSAT challenge required to minimize a very complex and rough energy function, typical of glassy models with a random first order transition and a golf course like energy landscape. We present the ideas beyond the quasi-greedy algorithm and its very efficient implementation on GPUs that are allowing us to rank first in such a competition. We suggest a better protocol to compare algorithmic performances and we also provide analytical predictions about the exponential growth of the times to find the solution in terms of free-energy barriers.

Full PDF

eepl draft

How we are leading a 3-XORSAT challenge: from the energylandscape to the algorithm and its eﬃcient implementation onGPUs

M. Bernaschi , M. Bisson , M. Fatica , E. Marinari , , , V. Martin-Mayor , , G. Parisi , , and F. Ricci-Tersenghi , , Institute of Applied Computing, CNR, I-00185 Rome, Italy NVIDIA Corporation, Santa Clara, CA 95050, United States of America Dipartimento di Fisica, Sapienza Università di Roma, I-00185 Rome, Italy INFN, Sezione di Roma 1, I-00185 Rome, Italy CNR-Nanotec, Rome unit, I-00185 Rome, Italy Departamento de Física Teórica, Universidad Complutense, 28040 Madrid, Spain Instituto de Biocomputación y Física de Sistemas Complejos (BIFI), 50018 Zaragoza, Spain

PACS – Computational techniques; simulations

PACS – Numerical optimization

Abstract – A recent 3-XORSAT challenge required to minimize a very complex and rough energyfunction, typical of glassy models with a random ﬁrst order transition and a golf course likeenergy landscape. We present the ideas beyond the quasi-greedy algorithm and its very eﬃcientimplementation on GPUs that are allowing us to rank ﬁrst in such a competition. We suggest abetter protocol to compare algorithmic performances and we also provide analytical predictionsabout the exponential growth of the times to ﬁnd the solution in terms of free-energy barriers.

Introduction. –

On October 2019, we were invitedto join a computing challenge on 3-XORSAT problemslaunched by some colleagues at University of SouthernCalifornia [1]. The idea behind the challenge was to com-pare actual performance of the best available computingplatforms, including quantum computers, in solving a par-ticularly hard optimization problem. Quantum computingis becoming practical these days, and many diﬀerent com-puting devices based on quantum technologies are becom-ing available (D-Wave, Google and IBM just to cite themost known). So it is a natural question to ask, whetherany of these quantum devices available today can do betterthan classical (i.e. non-quantum) computing machines.We decided to join this 3-XORSAT challenge with aproposal combining new algorithmic ideas and a highlyoptimized GPU implementation. We are not going to dis-cuss in detail the results of the 3-XORSAT challenge, thatwill appear elsewhere [1]. We just remark that the per-formances of our algorithm running on commercial NvidiaGPUs are at least 2 orders of magnitude better than thoseof the other devices that entered the 3-XORSAT challenge:D-Wave quantum annealing processor [2], Memcomputing machine [3], Fujitsu digital annealer [4] and Toshiba’s sim-ulated bifurcation machine [5].This is clearly not the end of the story, as quantumtechnologies are evolving very fast and presumably willbecome competitive soon (eventually getting what is calleda quantum advantage). Nonetheless, we believe it is veryimportant to clarify what is today the state of art in the“classical vs. quantum computation challenge”.In the present manuscript we report the ideas and thetechnical details that make our solving algorithm rankingﬁrst as a solver of the hard optimization problems pre-sented at the 3-XORSAT challenge.The manuscript is organized as follows. First we recapthe known physical properties of these hard optimizationproblems (especially their energy landscape). Then we de-scribe the algorithm we decided to use with a particularemphasis on the use of large number of clones and howthis can be implemented eﬃciently for classical comput-ers. We also provide a description of the technical choicesthat made our GPU implementation extremely eﬃcient,even if the problem, being deﬁned on a random graphtopology, would in principle not be ideal for a platformp-1 a r X i v : . [ c ond - m a t . d i s - nn ] F e b . Bernaschi et al. like GPU. Finally, we discuss the numerical results aboutthe time-to-solution (TTS), proposing an improved way ofmeasuring the largest percentiles of TTS. We ﬁnish witha few concluding remarks. The model and its energy landscape. –

The op-timization problem that has been presented to the con-tenders at the 3-XORSAT challenge is the search for theground state of a model based on Ising variables. Themodel is well known in statistical physics under the nameof diluted 3-spin ferromagnetic model [6]. In the computerscience literature, it corresponds to a constraint satisfac-tion problem known under the name of [7]. Inthis paper, we use the statistical physics formulation ofthe model, but switching to its computer science formula-tion requires just a change of variables.The model is deﬁned by the Hamiltonian H [ σ ] ≡ − (cid:88) ( i,j,k ) ∈ E s i s j s k , (1)where s i = ± are N Ising spins. The sum over the set E of triplets ( i, j, k ) is what deﬁnes the interaction topology.The instances provided in the 3-XORSAT challenge weregenerated on a random regular graph of ﬁxed degree 3. Inother words, the set E is made of N triplets randomly cho-sen under the constraints that in each triplet the 3 indicesare diﬀerent and each index appears exactly in 3 triplets.From the deﬁnition of the Hamiltonian H [ s ] in Eq. (1),it is clear that the ground state is the conﬁguration s ∗ with all s ∗ i = 1 . However our algorithm, publicly availableat [8], searches for the ground states without computingthe magnetization and having access only to the energy, sothere is no need to “hide” the solution by a gauge transfor-mation. The organizers of the competition are expected tocheck that the same is true for all the others contenders.One may argue that such a model should be easy to opti-mize, because all interactions are ferromagnetic. Howeverit is well known that such a model shows the same glassyphysics of a disordered model [6, 9] because the 3-spin in-teraction can be satisﬁed in many ways and this generatesfrustration in the system during the optimization. Actually, in the 3-XORSAT challenge, an equivalent for-mulation has been used, where variables are twice in num-ber (one η variable is added per each constraint) and vari-ables interact only pairwise [11]. This has been done toallow devices implementing only pairwise interactions toenter the competition. The resulting Hamiltonian H [ s, η ] is such that (cid:80) η H [ s, η ] = H [ s ] . We have performed such The careful reader may have noticed that the problem of satisfy-ing all interactions in H [ s ] , i.e. s j s j s k = 1 , is equivalent to the prob-lem of solving linear equations modulo 2, ( x i + x j + x k )( mod

2) = 0 where s i = ( − x i . This problem can be solved in polynomial time,e.g. by Gaussian elimination. However, as discussed in previous pub-lications [10], the problem can be slightly modiﬁed preserving thesame physical behavior, and making the polynomial algorithm nolonger useful. The competition was restricted to algorithms whichare robust with respect to such a change: using Gaussian elimina-tion, or algorithms derived from it, was forbidden. a marginalization on the instances provided, so our algo-rithm minimizes the cost function H [ s ] given in Eq. (1).The 3-spin on 3-regular random graph (3S3R) model of-fers a paradigmatic example of a golf course energy land-scape. The thermodynamics of the model has been exactlysolved [12–14] and its dynamics has been accurately stud-ied numerically [14, 15]. The picture that comes out fromthese studies is exactly the one that goes under the nameof “random ﬁrst order transition” in the physics of glasses[16–19]. There exists an exponential (in N ) number ofmetastable states that dominate the Gibbs measure be-low the dynamical critical temperature T d (cid:39) . , suchthat for T < T d ergodicity is broken and the timescale toreach equilibrium diverges with the system size N . Thedivergence of such a timescale depends on the dynamicalalgorithm, but for local algorithms, we expect it to be re-lated to the height of some free energy barrier. We discussthis issue in more detail below. For the moment we limitourselves to observe that the 3S3R model is mean ﬁeld innature and thus free energy barriers grow linearly with N ,implying that the timescales to thermalize and reach theground state grow exponentially as exp( aN ) .When a local algorithm is run for a time much shorterthan exp( aN ) , the simulation typically gets stuck at astrictly positive threshold energy e th , where the expo-nential number of metastable states prevents the relax-ation dynamics from going deeper in the energy landscape.The value of e th is not exactly known, but some closebounds are available. Certainly, e th is below the maxi-mum energy value where metastable states can be found e max = − . , but we expect it to be also not abovethe energy value where most of the metastable states aremarginal e marg = − . [13]. Indeed, marginally sta-ble states correspond to energy minima whose Hessianspectrum has a non-negative support with a lower bandedge in zero: that is, they are minima with at least oneﬂat direction. Analytic solutions to the out of equilib-rium Langevin dynamics [20] predicts these are the statesreached during the relaxation and recent studies [21] foundthat the out of equilibrium dynamics may relax to diﬀerentenergies, but none above e marg .The picture to keep in mind is the one of an energylandscape that at the energy level e th is made of a hugenumber of marginally stable minima on top of which anylocal dynamics gets stuck in the search for the few deepwells bringing to lower energies. As discussed at length inRef. [22], the main algorithmic barrier is entropic in na-ture, that is the relaxation dynamics is not able to proceedfurther towards the ground state not because the need ofjumping over energetic barriers, but because the searchneeds to ﬁnd a golf course hole while wandering at thethreshold energy. The quasi-greedy algorithm. –

In order to ﬁnd aset of spins minimizing H [ s ] , one could be tempted to usea greedy algorithm, that is an algorithm accepting onlymoves decreasing the energy (being interested in fast andp-2ow we are leading a 3-XORSAT challengelocal algorithms, we only consider single spin ﬂip moves).Unfortunately greedy algorithms get stuck at much higherenergies than e th , where local minima in H [ s ] appear [22].Usually to overcome this problem, one switches on atemperature T and exploits thermal ﬂuctuations to escapefrom local minima (this is how Simulated Annealing works[23]). However, the 3S3R model does not have a thermo-dynamic phase transition and for any T > it is in aparamagnetic phase. This means that in thermal equilib-rium, the dynamics will unlikely reach the ferromagneticground state in s ∗ . Even if, by chance, s ∗ is reached, thedynamics will soon leave that conﬁguration to thermalizeagain in the paramagnetic state.In Ref. [22], a new class of heuristic algorithms has beenintroduced with the aim of proving that entropic barriersare the main source of computational complexity in theoptimization of the 3S3R model. These quasi-greedy (QG)algorithms perform with high probability, when this is pos-sible, a step decreasing the energy, but when they reacha local minimum they keep ﬂipping a spin that enters inat least one violated interaction. These algorithms haveseveral advantages: (i) they converge fast to the interest-ing low-energy part of the conﬁgurational space; (ii) theykeep the system evolving even in presence of many localminima, but without increasing too much the energy (thatwould make the search ineﬀective since it would be run inan uninteresting region); (iii) once the ground state s ∗ isfound, the algorithm stops and thus does not escape fromthe solution of the problem (without the need of checkingit after every single spin ﬂip).Calling w k the probability of ﬂipping a spin entering k unsatisﬁed interactions, in Ref. [22] the QG algorithm with w = 0 and w = w = 1 was studied numerically. Theprobability of ﬁnding the solution s ∗ was found to reach amaximum close to w = 0 . , with a median TTS growingapproximately as exp(0 . N ) . Starting on these verypromising results, we have built here a very optimized ver-sion of the QG algorithm.The QG algorithm can also be viewed as an imper-fect Metropolis algorithms not satisfying detailed balance,since by setting w = w we would have an algorithm satis-fying detailed balance for the 3S3R model. Setting w = 0 breaks detailed balance, but brings two advantages: largeenergy jumps are forbidden (they are not strictly requiredin a search limited for entropic reasons), and once theground state s ∗ is found, the algorithm stops.The latter property is extremely useful because any eﬃ-cient implementation of the QG algorithm must perform alarge number of steps before checking for the energy (thattakes a time comparable to the one needed for a sweep ofthe QG algorithm). The condition w = 0 ensures thata ground state found during the dynamical evolution willnot be lost between two successive measurements. The search by rare events requires many clones.–

As discussed in [22], the search for the ground stateis slowed down by entropic barriers, i.e. the search for the right well bringing from the e th manifold to theground state conﬁguration s ∗ is like “ﬁnding a needle in ahaystack”. For this reason, instead of having a single copyof the system evolving for a very long time, it turns outto be more appropriate to follow a large number of copiesof the system (evolving independently) for a shorter time,starting each one from a diﬀerent random initial condition:we call these clones .The rationale beyond this choice is that the evolutionon the marginal manifold at e th is not fast enough to allowa single clone to visit the entire manifold in a reasonabletime. So if the clone starts from an unfavourable initialcondition, his search is bound to fail even if it keeps evolv-ing for a very long time.We are facing a typical phenomenon ruled by rareevents: in the large N limit, for a typical initial condi-tion, the QG algorithm gets stuck at e th and fails to ﬁnd s ∗ , but there are rare initial conditions that allow the QGalgorithm to ﬁnd the solution s ∗ in a short time. Theprobability of such rare initial conditions (that roughly co-incide with the basin of attraction of s ∗ ) is exponentiallysmall in N , as in any large deviation process.One has to make a choice between the following two ex-treme strategies: running a single clone for a time scalingexponentially in N or running a number of clones scalingexponentially with N for a ﬁnite time. In principle, oneshould optimize over all choices in between these extremes,at a ﬁxed total amount of computing time.Our choice has been to run the largest possible numberof clones. This turns out to be the best choice for sev-eral reasons. It reduces ﬂuctuations and, if the number ofclones is large enough, we can derive analytic predictionsfor setting the running time to an optimal value (see be-low). Moreover, it is very unlikely for a single clone (orfew clones) to ﬁnd the solution, while when running a hugenumber of clones some of them can ﬁnd the solution, thusallowing us to estimate the mean TTS. Finally, running alarge number of clones is highly beneﬁcial from the cod-ing point of view, since the clones can evolve in parallel,leading to a drastic reduction in running times. Basic information about the GPU implementa-tion. –

We implemented the algorithm described abovein a CUDA code that looks for the ground state of thegiven problem instance using concurrently thousands ofclones. Since spins can only have two values, we use amulti-spin coding technique by packing values from diﬀer-ent clones into 32-bit words [24]. This allows to update 32distinct clones in parallel by using Boolean operations onthe spin words (we use the same random number for the32 clones in the same word). Moreover we use the nat-ural thread parallelism, evolving diﬀerent clones in eachGPU core. Finally, multiple GPUs can be used simulta-neously by executing the same code with diﬀerent randomseeds. So, in the end, we have three levels of parallelism: i ) multi-spin coding; ii ) thread level; iii ) multiple inde-pendent executions on distinct GPUs. Although the codep-3. Bernaschi et al. m ean TT S ( s e c . ) w Fig. 1: Preliminary runs on instance N = 256 allowed us to estimate of the optimal value w = 0 . forthe only parameter of the QG algorithm. is able to ﬁt the number of threads to the actual numberof cores available on the GPU in use, most of the runshave been executed on Volta 100 GPUs featuring 5120cores. On those GPUs, the total number of clones was N cl = 327680 .One more crucial aspect of our GPU implementationis the partitioning of variables in independent sets. Inthis way, the spin update procedure can be performed inparallel inside each independent set.Further details on the GPU implementation, on the op-timizations and ﬁne tuning of the code are reported in theSupplementary Material. Numerical results. –

We have been provided with100 instances for diﬀerent problem sizes. After some pre-liminary runs, we decided to focus our attention on sizes256, 512 and 640, that, once transformed back to the formof a 3S3R model, correspond to N = 128 , , .The QG algorithm depends on a single parameter, w ,which is the probability of ﬂipping a variable entering inone unsatisﬁed interaction and two satisﬁed interactions.The other parameters are ﬁxed: w = 0 (the ground stateis a ﬁxed point of the QG algorithm) and w = w = 1 (the QG algorithm decreases the energy whenever possi-ble). Our preliminary runs also served to optimize over w . In Fig. 1 we show the mean TTS in a given in-stance of size N = 256 . The quadratic interpolation tothe data estimates an optimal value w = 0 . , witha negligible dependence on the problem size (at least for N ≥ ). So hereafter we ﬁx w = 0 . . Althoughthe QG algorithm does not satisfy the detailed balance,we can associate a pseudo-temperature to the value of w by the relation w = exp( − /T run ) , where the latter is theprobability of ﬂipping the spin in the Metropolis algorithmrunning at temperature T run . It is worth noticing that for w = 0 . we have T run (cid:39) . , which is slightly abovethe dynamical transition temperature T d (cid:39) . [14].Being the QG algorithm stochastic, the TTS is a ran-dom variable whose probability distribution is often mea-sured via its percentiles TTS p , deﬁned by P [ TTS < TTS p ] = p/ . The organizers of the 3-XORSAT com-petition asked the participants to estimate the 99-th per- P r ob [t i m e t o s o l u t i on > t] t (in seconds)N = 512 (20 instances) Fig. 2: Cumulative probability distribution of TTS for 20 in-stances of size N = 256 . centile TTS for each of the 100 instances of a given sizeand to report the median value (over the instances). Thisis the time required to solve with 99% probability an in-stance of median hardness. As we will discuss in detail weare not conﬁdent that this is the best metric for evaluatingthe performance of the proposed algorithms.Most of our simulations have been executed on NvidiaV100 GPUs running in parallel N cl = 327680 clones. TheTTS for a single run of our QG algorithm is given by theshortest among the N cl times each clone requires to reachthe solution. Under the assumption, that we have checkednumerically with great accuracy, that the cumulative dis-tribution of the single clone time to reach a solution startslinearly in the origin, we have that the TTS (the best ofthe N cl clones) is exponentially distributed as (see SM) P [ TTS > t ] = exp( − t/τ ) . (2)A check of the above equation is reported in Fig. 2 wherewe plot in a semilogarithmic scale the probability that theTTS is larger than a given time t (in seconds) for 20 in-stances of size N = 256 . Data have been obtained runningthe QG algorithm 1008 times and sorting the correspond-ing 1008 values of the TTS. We observe that the expo-nential distribution (which is linear in a semilogarithmicscale) describes very well the data down to a small proba-bility. A consequence of this observation is that the TTSof our QG algorithm with a very large number of clonescan be perfectly described in terms of the single timescale τ , the mean TTS , that depends solely on the particularinstance under study.In Fig. 2 the crossing of the data with the horizontaldotted line determines the value of TTS . We believethat this is not the best estimate for the time such thatthe QG algorithm ﬁnds the solution with probability 99%.As a matter of fact, a better estimate, that is aﬀected bymuch smaller ﬂuctuations, is given by − log(0 . τ . Thelatter estimate is much more robust than TTS since itis obtained from all the measured TTS values. Moreover,TTS requires the execution of the algorithm at least100 times, whereas τ can be safely estimated from a muchsmaller number of measures.p-4ow we are leading a 3-XORSAT challenge c ha r a c t e r i s t i c TT S ( i n s e c ond s ) instance numberN = 512 mean TTSTTS(0.5) / -log(0.5)TTS(0.9) / -log(0.1)TTS(0.99) / -log(0.01) Fig. 3: Diﬀerent estimates of τ , the mean TTS, for the 100instances of size N = 256 . In general, the value of TTS p can be better computedvia − log(1 − p/ τ , after the values of τ have been es-timated. In Fig. 3, we plot for each of the 100 instancesof size N = 256 the mean TTS τ and three equivalentestimates obtained from TTS (the median), TTS andTTS . For each instance, the four estimates are veryclose, whereas they vary a lot when changing the instance.A more careful inspection of the data in Fig. 3 highlightsthat the mean TTS τ is always in the middle of the groupof the four estimates, while the estimate based on TTS is sometimes far from the other. The above observationssuggest that using τ instead of TTS would provide morereliable and stable results in the analysis of algorithmsperformance. Running with a short timeout. –

What is the bestpossible way to estimate τ ? Obviously, having an un-bounded computing power, one could simply execute thesearch N run times and just take the average of all the TTSvalues. But for a process that requires a computing timegrowing exponentially with the problem size N , this naiveapproach becomes soon unfeasible.Nonetheless, we deduce from the data shown in Fig. 2and from the cumulative distribution in Eq. (2) that runswith a very short TTS always exists for any τ , althoughthey become very rare for large values of τ .So we can adopt a diﬀerent search strategy. Instead ofletting every run to ﬁnish reaching the solution s ∗ (thatsooner or later is found, since the dynamical process wesimulate is ergodic for ﬁnite N values), we can set a time-out t max such that the QG algorithm reports a failure ifthe solution is not found in a time shorter than t max .This early stop strategy has several advantages. The useof a timeout prevents very long runs: this is very useful,not only because it stops in advance those unfortunateruns that would take an atypically long time, but alsobecause makes all runs of a similar time duration (andthis is very useful when planning a large group of parallelruns). More importantly, the algorithm with a timeoutcan also be run for very large sizes, when the algorithmwithout any timeout would take too long to ﬁnish. The m ean TT S ( i n s e c ond s ) instance (by growing hardness)N = 512 (1008 runs per instance)timeout = 5 sec.timeout = 10 sec.timeout = 20 sec.no timeout Fig. 4: Estimates of mean TTS τ in all the 100 instances ofsize N = 256 obtained from runs with a short timeout. data for problems of size N = 320 have been obtainedwith this strategy, and a sensible estimate would have beenotherwise impossible to get.If the timeout t max is much shorter than the mean TTS, t max (cid:28) τ , only a small fraction of runs will ﬁnd the solu-tion. By running N run runs with a timeout t max , we canestimate τ from the number n of successful runs as follows.The posterior distribution on τ given that we observe n successful runs among N run is proportional to P ( τ | n ) ∝ τ (cid:18) N run n (cid:19) (cid:16) − e − t max /τ (cid:17) n (cid:16) e − t max /τ (cid:17) N run − n , where the factor /τ before the binomial coeﬃcient is theprior on τ and it is such that, before taking any measure-ment, the probability measure is uniform on the variable ln τ . Since t max (cid:28) τ we can simplify the posterior to thefollowing normalized distribution P ( τ | n ) = T n tot ( n − e −T tot /τ τ n +1 , with T tot = N run t max being the total running time. Get-ting an estimate of τ from this posterior distribution isstraightforward and the results are shown in Fig. 4 fortimeouts of 5, 10 and 20 seconds.We see from the data in Fig. 4 that the estimates of τ from runs with a rather short timeout are very accurate formost of the samples: only for samples whose mean TTS isalmost 2 orders of magnitude larger than the timeout didthe estimate turn out to be larger, but still compatiblewithin error bars. In particular we notice that for themedian instance the estimates of τ (and thus of the 99-thpercentile TTS ) obtained from runs with a very shorttimeout are perfectly ﬁne and allow to save a great amountof time.We summarize in Table 1 our best estimates for TTS in the median instance, and in Fig. 5, we plot the valuesreported in the table together with the best ﬁtting expo-nential growth, TTS ∝ exp( aN ) with a = 0 . . Analytic prediction for exponent a in ln τ ∼ aN .– Although the QG algorithm is heuristic and it doesp-5. Bernaschi et al. m ed i an TT S ( s e c . ) N Fig. 5: 99-th percentiles TTS for the median instances ofseveral problem sizes. from all runs from runs timeout N (no timeout) with timeout value128 0.0275 ± ±

20 700 ±

100 4.5320 — 130k ±

30k 450

Table 1: Values of the 99-th percentile TTS for the medianinstances of several sizes. All times are expressed in seconds. not satisfy the detailed balance condition, we can still ob-tain an approximate analytical estimate of the exponent a ruling the growth of τ with N , assuming the dynamicstakes place in contact with a thermal bath at an eﬀectivetemperature T run = − / log( w ) . In thermal equilibrium,we expect the time to visit the ground state s ∗ to be re-lated to the free-energy barrier between the paramagneticstate and the ordered state around s ∗ . We need to com-pute the free-energy as a function of the magnetization m = (cid:80) i s i /N .We consider a K -spin model on a K -regular randomgraph (the model we simulated has K = 3 , but is worthpresenting analytic computations for a generic K value).In order to set the magnetization to an arbitrary value,we add an external ﬁeld b to the Hamiltonian: H [ s ] − b (cid:80) i s i . Using the cavity method for sparse models [25–27]we can write the free-energy at temperature T = 1 /β inthe following variational form − βf = log( Z i ) + log( Z a ) + K log( Z ai ) − b m , (3) Z i = 2 cosh( β ( Ku + b ))(2 cosh( βu )) K ,Z a = cosh( β ) (cid:0) β ) tanh( βh ) K (cid:1) ,Z ai = cosh( β ( u + h ))2 cosh( βu ) cosh( βh ) , that needs to be extremized with respect to the externalﬁeld b and the cavity ﬁelds u and h . The saddle pointequations read m = tanh( β ( Ku + b )) ,h = ( K − u + b , (4) tanh( βu ) = tanh( β ) tanh( βh ) K − . Out[ (cid:1) ]= m (cid:1) (cid:1) f T = T sp T = T = T = T = Fig. 6: The free-energy as a function of the magnetization atdiﬀerent temperatures below the spinodal one.

The paramagnetic solution to Eq. (4) has u = h = m = 0 and f = f para ≡ − log(2 cosh( β )) /β . The ferromagneticsolution has u, h, m > and it exists only for temperatures T < T sp , with the spinodal temperature given by T − sp = min x (cid:110) atanh (cid:104) tanh( x ) tanh (cid:0) ( K − x (cid:1) K − (cid:105)(cid:111) . For K = 3 , we have T sp = 0 . , and below this value,we can compute the barrier separating the paramagneticand the ferromagnetic states.In Fig. 6, we plot β ∆ f = β ( f − f para ) as a function ofthe magnetization m for several temperatures below thespinodal one. We notice that the ferromagnetic state hasalways a higher free-energy than the paramagnetic state.Indeed this model has no thermodynamic phase transi-tion, but only a dynamical transition at T d . Although theferromagnetic state never dominates the Gibbs measure inthe large N limit, we can estimate the time for reachingit by thermal ﬂuctuations via τ ≈ τ exp (cid:104) N β max m ∆ f ( m ) (cid:105) , where τ is a microscopic timescale that we expect to de-pend mainly on the temperature and to diverge in thelimit T → T + d .In Fig. 7, we show the free energy barrier as a function ofthe temperature, together with the relevant temperaturesdiscussed so far. The horizontal line corresponds to theactual growing rate of the QG algorithm.Several observations are in order. The analytic pre-diction based on the thermodynamic free-energy barrier β max m ∆ f ( T run ) = 0 . is a very good approxima-tion to the actual rate. According to data shown in Fig. 7,in principle a purely thermal algorithm could be evenfaster, but the divergence of τ approaching T d impliesthe existence of an optimal running temperature slightlyabove T d . We believe T run used in our simulations is closeto such an optimal temperature.p-6ow we are leading a 3-XORSAT challenge d T run T sp m a x m ( β Δ f ) T Fig. 7: The free-energy barrier that determines the growing ofthe timescale to visit the ground state by thermal ﬂuctuations.

Conclusions. –

We have described the algorithm andthe strategy that are allowing us to rank ﬁrst and wellabove all the other contenders in the 3-XORSAT competi-tion that asked to solve a very hard optimization problem.Although ﬁnding the ground state to the 3S3R modelrequires a time growing exponentially with the system size,we have discussed the strategy that allowed us to reducethe growing rate to a rather small value.The detailed analysis of the running times that we havecarried on suggests that the dogma of measuring TTS forthe median problem is not the best measure of algorithmperformance, as it requires enormous computing resourceswithout any clear advantage with respect to measuringthe median TTS. The strategies discussed above (cloningand restarting) avoid the longest runs and make TTS of limited practical relevance. We propose to rank algo-rithms according to the median TTS to solve the hardestinstances, given that ﬂuctuations among instances are farmore severe.The successful connection between the dynamical be-havior of the QG algorithm and the thermodynamic bar-rier of the 3S3R model suggests that the statistical physicsdescription of algorithms in terms of the energy landscapeis a key tool towards their full comprehension and the pro-posal of possibly better optimization algorithms [28]. ∗ ∗ ∗ We are grateful to Itay Hen for the helpful discussions.This work was supported by MINECO (Spain) throughGrant No. PGC2018-094684-B-C21 (partially funded byFEDER) and the European Research Council under theEuropean Union’s Horizon 2020 research and innovationprogram (Grant No. 694925-Lotglassy). This researchused resources of the Oak Ridge Leadership ComputingFacility at the Oak Ridge National Laboratory, which issupported by the Oﬃce of Science of the U.S. Departmentof Energy under Contract No. DE-AC05-00OR22725.

REFERENCES[1]

Kowalsky M., Albash T., Hen I. and

Lidar D. , inpreparation , (2021) .[2] Bunyk P. I., Hoskinson E. M., Johnson M. W.,Tolkacheva E., Altomare F., Berkley A., Har-ris R., Hilton J. P., Lanting T., Przybysz A. and

Whittaker J. , IEEE Transactions on Applied Supercon-ductivity , (Aug. 2014) 1.[3] Pham V.-T., Traversa F. L., Cicotti P., Shel-don F. and

Di Ventra M. , Complexity , (2018)7982851.[4] Aramon M., Rosenberg G., Valiante E., MiyazawaT., Tamura H. and

Katzgraber H. G. , Frontiers inPhysics , (2019) 48.[5] Goto H., Tatsumura K. and

Dixon A. R. , ScienceAdvances , (2019) .[6] Franz S., Mézard M., Ricci-Tersenghi F., WeigtM. and

Zecchina R. , Europhys. Lett. , (2001) 465.[7] Dubois O. and

Mandler J. , Comptes Rendus Mathe-matique , (2002) 963.[8] https://gitlab.com/maurob/3xorsat [9] Ricci-Tersenghi F. , Science , (2010) 1639.[10] Barthel W., Hartmann A. K., Leone M., Ricci-Tersenghi F., Weigt M. and

Zecchina R. , Physicalreview letters , (2002) 188701.[11] Hen I. , Physical Review Applied , (2019) 011003.[12] Mézard M., Ricci-Tersenghi F. and

Zecchina R. , Journal of Statistical Physics , (2003) 505.[13] Montanari A. and

Ricci-Tersenghi F. , The EuropeanPhysical Journal B , (2003) 339.[14] Krzakala F. and

Zdeborová L. , Europhys. Lett. , (2010) 66002.[15] Montanari A. and

Ricci-Tersenghi F. , Physical Re-view B , (2004) 134406.[16] Kirkpatrick T. and

Wolynes P. , Physical Review A , (1987) 3072.[17] Kirkpatrick T. R., Thirumalai D. and

WolynesP. G. , Physical Review A , (1989) 1045.[18] Castellani T. and

Cavagna A. , Journal of StatisticalMechanics: Theory and Experiment , (2005) P05012.[19] Biroli G. and

Bouchaud J.-P. , Structural Glasses andSupercooled Liquids: Theory, Experiment, and Applica-tions , (2012) 31.[20]

Cugliandolo L. F. and

Kurchan J. , Physical ReviewLetters , (1993) 173.[21] Folena G., Franz S. and

Ricci-Tersenghi F. , Phys-ical Review X , (2020) 031045.[22] Bellitti M., Ricci-Tersenghi F. and

ScardicchioA. , arXiv preprint , (2021) arXiv:2102.00182.[23] Kirkpatrick S., Gelatt C. D. and

Vecchi M. P. , Science , (1983) 671.[24] Jacobs L. and

Rebbi C. , Journal of ComputationalPhysics , (1981) 203.[25] Mézard M. and

Parisi G. , The European Physical Jour-nal B , (2001) 217.[26] Mézard M. and

Parisi G. , Journal of StatisticalPhysics , (2003) 1.[27] Mezard M. and

Montanari A. , Information, physics,and computation (Oxford University Press) 2009.[28]

Zhou H.-J. , arXiv preprint , (2020) arXiv:2007.00303. p-7. Bernaschi et al. Supplementary Material for manuscript“How we are leading a 3-XORSAT challenge:from the energy landscape to the algorithmand its eﬃcient implementation on GPUs”

Derivation of the exponential distribution for theTTS. –

We derive here a central-limit-theorem like re-sult that backs Eq. (2) in the main text.Let p ( t ) be the probability that a single clone has notfound the ground state after running for time t . Let usfurther assume that p ( t ) can be expanded as p ( t ) = 1 − tt + a t + . . . . (5)Let us now assume that we run N cl clones in parallel inour device. The probability that none of them hits theground state after running for time t is p ( t ) = [ p ( t )] N cl . (6)We shall now take the limit of a large number of clones bykeeping ﬁxed ˆ t = tN cl log p ( t = ˆ t/N cl ) = N cl log p ( t ) = ˆ t/t + O (1 /N cl ) , (7)which tells us that the limiting law is, indeed, exponential.Furthermore, a comparison with Eq.(2) in the main texttells us that τ = t /N cl . Details on the GPU implementation. –

The ini-tial version of our code used a single CUDA thread toupdate the spins in each multi-spin coded replica. In thisversion, each of the thousands of threads running on theGPU processed the graph one node at a time by read-ing from global memory its spin values and those ofits neighbors, executing the spin update procedure and,if necessary, updating the spin word of the source nodeby writing it back to global memory. The informationabout the connectivity of the graph (that is the same forall the simulated replicas) was stored in the fast (cache-like) shared memory of the GPU. While this initial ap-proach already delivered a good performance, we eventu-ally implemented a number of optimizations that allowedthe ﬁnal version of the code to speedup considerably. Firstof all, while the GPU can run large numbers of threads si-multaneously, using a thread per msc-replica made theprocessing of each of them a serial procedure (if we ig-nore the parallelism due to the multi-spin coding). Ourﬁrst step of optimization consisted in exposing parallelismin the spin update of each msc-replica in order to processmultiple nodes in parallel by multiple threads, speeding upconsiderably the update process. Also, using one thread Hereafter we refer to each group of 32 independent multi spincoded clones as a “msc-replica” per msc-replica made it necessary to store the msc-replicadata in global memory as the limited size of the faster,on-chip shared memory was insuﬃcient to hold a distinctmsc-replica for each thread in a block (one can have up to1024 threads in a single block). This represented one ofthe main bottlenecks of the code performances since thearithmetic density of the kernel is not high enough to hidethe latency of the global memory accesses. However, us-ing multiple threads per msc-replica, the number of repli-cas processed per block shrinks, making possible to moveall the data into the shared memory. Moreover, since toprocess each node a thread must read its whole adjacencylist from memory, we employed a low level optimizationto allow threads to load an entire 6-element adjacency listwith a single load instruction, achieving higher bandwidthin reading data from the shared memory.Our ﬁrst optimization consisted in an algorithmic im-provement aimed at exposing parallelism in the spin up-date procedure. Since nodes that are not directly con-nected can be processed in parallel, before starting thesimulations, we pre-process the interaction graph that de-termines the instance of the problem by partitioning itinto independent sets. In this way, the spin update proce-dure can be performed in parallel inside each set (each onerepresents an anti-clique in the graph jargon). We processeach msc-replica (and we recall that they are multi-spincoded) with a warp (that is a group of 32 CUDA threads).Each warp progresses in its search for the ground state byprocessing the graph one independent component at time,and for each component by updating the spin values inparallel. A warp synchronization is performed after theprocessing of each component in order to avoid race con-ditions.In order to partition the graph in independent sets, weemploy a simple greedy strategy in which the graph isscanned multiple times, each time trying to build a set ofindependent nodes of maximum size up to a ﬁxed value,until all the nodes have been assigned to a set. Since weprocess each replica by using a warp, we expected that theoptimal size of a set had to be a multiple of , in orderto keep the processing of components as balanced as pos-sible among the warps’ threads. Preliminary tests showedthat the best performance was obtained by partitioningthe graphs in sets of nodes.The choice of a simple partitioning scheme is motivatednot only by the limited size of the graph that has, at most,few hundreds nodes, but also by the fact that the graphdoes not change during the computation. The partitioningp-8ow we are leading a 3-XORSAT challengeis done just once in the beginning of the computation andtakes a negligible fraction of the total execution time.After the graph partitioning, the initial spin valuesfor each replica are generated randomly using the Parisi-Rapuano pseudo random number generator [Physics Let-ters B, (1985) 301] Since each node is part of exactlythree triangles (each triangle represents a clause of the 3-xorsat formula) the graph is represented by a N × matrix,where N is the number of nodes.The graph partitioning is represented with two arrays:one that contains a permutation of the node indices suchthat all the nodes belonging to the same independent setare contiguous and the second, with an element per set,that contains the size of each set (maximum ). Thesamples are stored as columns of a N × N umSamples matrix, where