Variational Neural Annealing
Mohamed Hibat-Allah, Estelle M. Inack, Roeland Wiersema, Roger G. Melko, Juan Carrasquilla
VVariational Neural Annealing
Mohamed Hibat-Allah,
1, 2, ∗ Estelle M. Inack,
3, 1
Roeland Wiersema,
1, 2
Roger G. Melko,
2, 3 and Juan Carrasquilla
1, 2 Vector Institute, MaRS Centre, Toronto, Ontario, M5G 1M1, Canada Department of Physics and Astronomy, University of Waterloo, Ontario, N2L 3G1, Canada Perimeter Institute for Theoretical Physics, Waterloo, ON N2L 2Y5, Canada (Dated: January 26, 2021)Many important challenges in science and technology can be cast as optimization problems. Whenviewed in a statistical physics framework, these can be tackled by simulated annealing, where agradual cooling procedure helps search for groundstate solutions of a target Hamiltonian. Whilepowerful, simulated annealing is known to have prohibitively slow sampling dynamics when theoptimization landscape is rough or glassy. Here we show that by generalizing the target distributionwith a parameterized model, an analogous annealing framework based on the variational principlecan be used to search for groundstate solutions. Modern autoregressive models such as recurrentneural networks provide ideal parameterizations since they can be exactly sampled without slowdynamics even when the model encodes a rough landscape. We implement this procedure in theclassical and quantum settings on several prototypical spin glass Hamiltonians, and find that itsignificantly outperforms traditional simulated annealing in the asymptotic limit, illustrating thepotential power of this yet unexplored route to optimization.
I. INTRODUCTION
A wide array of complex combinatorial optimizationproblems can be reformulated as finding the lowest en-ergy configuration of an Ising Hamiltonian of the form [1]: H target = − (cid:88) i
Simulated annealing
Exact Boltzmann dist.
Figure 1. Schematic illustration of the space of probabilitydistributions visited during simulated annealing. An arbitrar-ily slow SA visits a series of Boltzmann distributions startingat the high temperature (e.g. T = ∞ ) and ending in the T = 0Boltzmann distribution (continuous yellow line), where a per-fect solution to an optimization problem is reached. Thesesolutions are found either at the edge or a corner (for non-degenerate problems) of the standard probabilistic simplex(colored triangle plane). A practical, finite-time SA trajectory(red dotted line), as well as a variational classical annealingtrajectory (green dashed line), deviate from the trajectory ofexact Boltzmann distributions. annealing has been so successful that it has inspired in-tense research into its quantum extension, which requiresquantum hardware to anneal the tunneling amplitude,and can be simulated in an analogous way to SA [11, 12].The SA algorithm explores an optimization problem’senergy landscape via a gradual decrease in thermalfluctuations generated by the Metropolis-Hastings algo-rithm. The procedure stops when all thermal kineticsare removed from the system, at which point the solu-tion to the optimization problem is expected to be found.While an exact solution to the optimization problem is al- a r X i v : . [ c ond - m a t . d i s - nn ] J a n ways attained if the decrease in temperature is arbitrarilyslow, a practical implementation of the algorithm mustnecessarily run on a finite time scale [13]. As a conse-quence, the annealing algorithm samples a series of effec-tive, quasi-equilibrium distributions close but not exactlyequal to the stationary Boltzmann distributions targetedduring the annealing [14] (see Fig. 1 for a schematic illus-tration). This naturally leads to approximate solutionsto the optimization problem, whose quality generally de-pends on the interplay between the problem complexityand the rate at which the temperature is decreased.In this paper, we offer an alternative route to solv-ing optimization problems of the form of Eq. (1), called variational neural annealing . Here, the conventionalsimulated annealing formulation is substituted with theannealing of a parameterized model. Namely, insteadof annealing and approximately sampling the exactBoltzmann distribution, this approach anneals a quasi-equilibrium model, which must be sufficiently expressiveand capable of tractable sampling. Fortunately, suitablemodels have recently been provided by machine learningtechnology [15–17]. In particular, neural autoregressive models combined with variational principles have beenshown to accurately describe the equilibrium propertiesof classical and quantum systems [18–21]. Here, we im-plement variational neural annealing using autoregres-sive recurrent neural networks, and show that they offera powerful alternative to conventional SA and its analo-gous quantum extension, i.e., simulated quantum anneal-ing (SQA) [11]. This powerful and unexplored route tooptimization is schematically illustrated in Fig. 1, wherea variational neural annealing trajectory (dashed greenarrow) is shown to provide a more accurate approxima-tion to the ideal trajectory (continuous yellow line) thana conventional SA run (dotted red line). II. VARIATIONAL CLASSICAL ANDQUANTUM ANNEALING
We first consider the variational approach to statisticalmechanics [18, 22], where a distribution p λ ( σ ) defined bya set of variational parameters λ is optimized to closelyreproduce the equilibrium properties of a system at tem-perature T . Following the spirit of SA, we dub our firstvariational neural annealing algorithm variational classi-cal annealing (VCA).The VCA algorithm searches for the ground state of anoptimization problem, encoded in a target Hamiltonian H target , by slowly annealing the model’s variational freeenergy F λ ( t ) = (cid:104) H target (cid:105) λ − T ( t ) S classical ( p λ ) , (2)from a high temperature to a low temperature. Thequantity F λ ( t ) provides an upper bound to the true in-stantaneous free energy and can be used at each anneal-ing stage to update λ through gradient-descent tech-niques. The brackets (cid:104) ... (cid:105) λ denote ensemble averages taken over the probability p λ ( σ ). The von Neumannentropy is given by S classical ( p λ ) = − (cid:88) σ p λ ( σ ) log ( p λ ( σ )) , (3)where the sum runs over all the elements of the statespace { σ } . In our setting, the temperature is decreasedfrom an initial value T to 0 using a linear schedule func-tion T ( t ) = T (1 − t ), where t ∈ [0 , σ drawn from p λ ( σ ). Since RNNs are normalized by con-struction, these samples naturally allow the estimation ofthe entropy in Eq. (3). We provide a detailed descriptionof the RNN in Methods Sec. V A.The VCA algorithm, summarized in Fig. 2(a), per-forms a warm-up step which brings a randomly initializeddistribution p λ ( σ ) to an approximate equilibrium statewith free energy F λ ( t = 0) via N warmup gradient descentsteps. At each step t , we reduce the temperature of thesystem from T ( t ) to T ( t + δt ) and apply N train gradi-ent descent steps to re-equilibrate the model. A criticalingredient to the success of VCA is that the variationalparameters optimized at temperature T ( t ) are reused attemperature T ( t + δt ) to ensure that the model’s distri-bution is always near its instantaneous equilibrium state.Repeating the last two steps N annealing times, we reachtemperature T (1) = 0, which is the end of the anneal-ing protocol. Here the distribution p λ ( σ ) is expectedto assign high probability to configurations σ that solvethe optimization problem. Likewise, the residual entropyEq. (3) at T (1) = 0 provides a heuristic approach tocount the number of solutions to the problem Hamilto-nian [18]. Further algorithmic details are provided inMethods Sec. V B.Simulated annealing provides a powerful heuristic forthe solution of hard optimization problems by harnessingthermal fluctuations. Inspired by the latter, the advent ofcommercially available quantum devices [24] has enabledthe analogous concept of quantum annealing [25], wherethe solution to an optimization problem is performed byharnessing quantum fluctuations. In quantum annealing,the search for the ground state of Eq. (1) is performed at T = 0, by supplementing the target Hamiltonian with aquantum mechanical kinetic (or “driving”) term,ˆ H ( t ) = ˆ H target + f ( t ) ˆ H D , (4) Figure 2. Variational neural annealing protocols. (a) The variational classical annealing (VCA) algorithm steps. A warm-upstep brings the initialized variational state (green dot) close to the minimum of the free energy (cyan dot) at a given value ofthe order parameter M . This step is followed by an annealing and a training step that brings the variational state back to thenew free energy minimum. Repeating the last two steps until T ( t = 1) = 0 (red dots) produces approximate solutions to H target if the protocol is conducted slowly enough. This schematic illustration corresponds to annealing through a continuous phasetransition with an order parameter M . (b) Variational quantum annealing (VQA). VQA includes a warm-up step, followed byan annealing and a training step, which brings the variational energy (green dot) closer to the new a ground state energy (cyandot). We loop over the previous two steps until reaching the target ground state of ˆ H target (red dot) if annealing is performedslowly enough. where H target in Eq. (1) is promoted to a quantum me-chanical Hamiltonian ˆ H target .Quantum annealing algorithms typically start with adominant driving term ˆ H D (cid:29) ˆ H target chosen so thatthe ground state of ˆ H (0) is easy to prepare. When thestrength of the driving term is subsequently reduced (typ-ically adiabatically) using a schedule function f ( t ), thesystem is annealed to the ground state of ˆ H target . In anal-ogy to its thermal counterpart, SQA emulates this pro-cess on classical computers using quantum Monte Carlomethods [11].Here, we leverage the variational principle of quantummechanics and devise a strategy that emulates quan-tum annealing variationally. We dub our second vari-ational neural annealing algorithm variational quantumannealing (VQA). The latter is based on the variationalMonte Carlo (VMC) algorithm, whose goal is to simu-late the equilibrium properties of quantum systems atzero temperature (see Methods Sec. V C). In VMC, theground state of a Hamiltonian ˆ H is modeled through anansatz | Ψ λ (cid:105) endowed with parameters λ . The varia-tional principle guarantees that the energy (cid:104) Ψ λ | ˆ H | Ψ λ (cid:105) is an upper bound to the ground state energy of ˆ H ,which we use to define a time-dependent objective func-tion E ( λ , t ) ≡ (cid:104) ˆ H ( t ) (cid:105) λ = (cid:104) Ψ λ | ˆ H ( t ) | Ψ λ (cid:105) to optimize theparameters λ .The VQA setup, graphically summarized in Fig. 2(b), applies N warmup gradient descent steps to minimize E ( λ , t = 0), which brings | Ψ λ (cid:105) close to the ground stateof ˆ H (0). Setting t = δt while keeping the parameters λ fixed results in a variational energy E ( λ , t = δt ).A set of N train gradient descent steps bring the ansatzcloser to the new instantaneous ground state, which re-sults in a variational energy E ( λ , t = δt ). The vari-ational parameters optimized at time step t are reusedat time t + δt , which promotes the computational adi-abaticity of the protocol (see Appendix. A). We repeatthe annealing and training steps N annealing times on alinear schedule ( f ( t ) = 1 − t with t ∈ [0 , t = 1,at which point the system should solve the optimizationproblem (red dot in Fig. 2(b)). We note that in our sim-ulations, no training steps are taken at t = 1. Finally,similarly to VCA, we choose normalized RNN wave func-tions [20, 21] as ans¨atze, giving the VQA algorithm accessto exact Monte Carlo samples.To gain theoretical insight on the principles behind asuccessful VQA simulation, we derive a variational ver-sion of the adiabatic theorem [26]. Starting from a set ofassumptions, such as the convexity of the energy land-scape in the warm-up phase and close to convergenceduring annealing, as well as the absence of noise in theenergy gradients, we provide a bound on the total numberof gradient descent steps N steps that guarantees the adia-baticity of the VQA algorithm as well as a success proba-bility of solving the optimization problem P success > − (cid:15) .Here, (cid:15) is an upper bound on the overlap between thevariational wave function and the excited states of theHamiltonian ˆ H ( t ), i.e., |(cid:104) Ψ ⊥ ( t ) | Ψ λ (cid:105)| < (cid:15) . We show that N steps can be bounded as (see Appendix. B): O poly( N ) (cid:15) min { t n } ( g ( t n )) ≤ N steps ≤ O poly( N ) (cid:15) min { t n } ( g ( t n )) . (5)The function g ( t ) is the energy gap between the firstexcited state and the ground state of the instantaneousHamiltonian ˆ H ( t ), N is the system size, and the set oftimes { t n } is defined in Appendix. B. As expected forhard optimization problems, the minimum gap typicallydecreases exponentially with system size N , which dom-inates the computational complexity of a VQA simula-tion, but in cases where the minimum gap scales as theinverse of a polynomial in N , then the number of steps N steps is also polynomial in N . III. RESULTSA. Annealing on random Ising chains
We now proceed to evaluate the power of VCA andVQA. As a first benchmark, we consider the task of solv-ing for the ground state the one-dimensional (1D) IsingHamiltonian with random couplings J i,i +1 , H target = − N − (cid:88) i =1 J i,i +1 σ i σ i +1 . (6)First, we examine J i,i +1 sampled from a uniform dis-tribution in the interval [0 , E G = − (cid:80) N − i =1 J i,i +1 [27].We use a tensorized RNN ansatz without weight shar-ing for both VCA and VQA (see Methods Sec. V A).We consider system sizes N = 32 , ,
128 and N train = 5,which suffices to achieve accurate solutions. For VQA, weuse a one-body driving term ˆ H D = − Γ (cid:80) Ni =1 ˆ σ xi , whereˆ σ x,y,zi are Pauli matrices acting on site i . To quantifythe performance of the algorithms, we use the residualenergy [11], (cid:15) res = (cid:2) (cid:104) H target (cid:105) av − E G (cid:3) dis , (7)where E G is the exact ground state energy of H target . Weuse the arithmetic mean for statistical averages (cid:104) . . . (cid:105) av over samples from the models. For VCA it means that (cid:104) H target (cid:105) av ≈ (cid:104) H target (cid:105) λ , while for VQA the target Hamil-tonian is promoted to ˆ H target = − (cid:80) N − i =1 J i,i +1 ˆ σ zi ˆ σ zi +1 and (cid:104) H target (cid:105) av ≈ (cid:104) ˆ H target (cid:105) λ . We consider the typical(geometric) mean for averaging over instances of the tar-get Hamiltonian, i.e., (cid:2) ... (cid:3) dis = exp( (cid:104) ln( ... ) (cid:105) av ). The aver-age in the argument of the exponential stands for arith-metic mean over different realizations of the couplings. N annealing ✏ r e s / N (a) VQA (N = 32) / /t . ± . VQA (N = 64) / /t . ± . VQA (N = 128) / /t . ± . VCA (N = 32) / /t . ± . VCA (N = 64) / /t . ± . VCA (N = 128) / /t . ± . N annealing ✏ r e s / N (b) VQA (N = 32) / /t . ± . VQA (N = 64) / /t . ± . VQA (N = 128) / /t . ± . VCA (N = 32) / /t . ± . VCA (N = 64) / /t . ± . VCA (N = 128) / /t . ± . Figure 3. Variational neural annealing on a random Isingchain. Here we represent the residual energy per site (cid:15) res /N vs the number of annealing steps N annealing for both VQA andVCA. The system sizes are N = 32 , , J i,i +1 ∈ [0 ,
1) (see text for more details).The error bars represent the one s.d. statistical uncertaintycalculated over different disorder realizations [28].
We take advantage of the autoregressive nature of theRNN and sample 10 configurations at the end of theannealing, which allows us to accurately estimate themodel’s arithmetic mean. The typical mean is taken over25 instances of H target .In Fig. 3 we report the residual energies per site againstthe number of annealing steps N annealing . As expected,the residual energy is a decreasing function of N annealing ,which underlines the importance of adiabaticity and an-nealing in our setting. In our examples, we observe thatthe decrease of the residual energy of VCA and VQA isconsistent with a power-law decay for a large number ofannealing steps. Whereas VCA’s decay exponent is in theinterval 1 . − .
9, the VQA exponent is about 0 . − . J i,i +1 uniformly sampled fromthe discrete set {− , +1 } , are provided in Appendix. A. B. Edwards-Anderson model
We now consider the two-dimensional (2D) Edwards-Anderson (EA) model, which is a prototypical spin glassarranged on a square lattice with nearest neighbor ran-dom interactions. The problem of finding ground statesof the model has been studied experimentally [12] andnumerically [11] from the annealing perspective, as wellas theoretically [2] from the computational complexityperspective. The EA model with open boundary condi-tions is given by H target = − (cid:88) (cid:104) i,j (cid:105) J ij σ i σ j , (8)where (cid:104) i, j (cid:105) denote nearest neighbors. The couplings J ij are drawn from a uniform distribution in the interval[ − , H D = − Γ (cid:80) Ni =1 ˆ σ xi . Fig. 4(a) shows the annealing re-sults obtained on a system size N = 10 ×
10 spins. VCAoutperforms VQA and in the adiabatic, long-time anneal-ing regime, it produces solutions three orders of magni-tude more accurate on average than VQA. In addition, weinvestigate the performance of VQA supplemented witha fictitious Shannon information entropy [21] term thatmimics thermal relaxation effects observed in quantumannealing hardware [31]. This form of regularized VQA,here labelled (RVQA), is described by a pseudo free en-ergy cost function ˜ F λ ( t ) = (cid:104) ˆ H ( t ) (cid:105) λ − T ( t ) S classical ( | Ψ λ | ).As in VCA, the pseudo entropy term S classical ( | Ψ λ | ) at f (1) = 0 provides a heuristic approach to count the num-ber of solutions to H target for VQA and RVQA. The re-sults in Fig. 4(a) do show an amelioration of the VQAperformance, including changing a saturating dynamicsat large N annealing to a power-law like behavior. How-ever, it appears to be insufficient to compete with theVCA scaling (see exponents in Fig. 4(a)). This observa-tion suggests the superiority of a thermally driven varia-tional emulation of annealing over a purely quantum onefor this example.To further scrutinize the relevance of the annealingeffects in VCA, we also consider VCA with zero ther-mal fluctuations, i.e., setting T = 0. Because of itsintimate relation to the classical-quantum optimization(CQO) methods of Refs. [32–34], we refer to this settingas CQO. Fig. 4(a) shows that CQO takes about 10 train-ing steps to reach accuracies nearing 1%. The accuracydoes not further improve upon additional training up to10 gradient steps, which indicates that CQO is proneto getting stuck in local minima. In comparison, VCAand VQA offer solutions orders of magnitude more ac- d h , number of training steps N train , gradient descent optimizer, number of samples,etc), which may open up avenues to boost the perfor-mance of our algorithms. For reproducibility purposes,Appendix. D provides a summary of the hyperparametersused to produce the results shown here. B. Edwards-Anderson model
We now consider the two-dimensional Edwards-Anderson (EA) model, which is a prototypical spin-glassmodel where a set of spins are arranged on a squarelattice with nearest neighbor random interactions. Theproblem of finding ground states of the model has beenstudied experimentally [76] and numerically [55, 56, 68]from the annealing perspective, as well as theoretically [2]from the computational complexity perspective. In thissection, we use the EA model as a benchmark to fur-ther probe VCA and VQA, and compare them againststandard heuristics, namely, SA and SQA implementedvia discrete-time path-integral Monte Carlo [55, 68]. TheEA model is given byˆ H EA = X h i,j i J ij ˆ zi ˆ zj , (8)where the sum runs over nearest neighbors, and the cou-plings J ij are drawn independently from a uniform dis-tribution in the range [ , J ij , we usethe spin-glass server [77] to obtain the exact ground stateenergy. This feature makes the EA model an ideal bench-mark for our method, particularly for large system sizes.To simulate our variational neural annealing protocols,we use a 2D tensorized RNN (see Methods Sec. V B) as anansatz without weight sharing. We implement the meth-ods described in Sec. II and ?? with VQA implementedusing a one-body driving term. Fig. 3 shows the anneal-ing results obtained on a system size N = 10 ⇥
10 spins.As for the random Ising chains in Sec. III A, VCA out-performs VQA and in the adiabatic, long-time annealingregime, VCA produces solutions three orders of magni-tude more accurate than VQA. In addition, we investi-gate the performance of VQA supplemented with a ficti-tious Shannon information entropy [47] term that mimicsthermal relaxation e↵ects observed in quantum anneal-ing hardware [78] and induces a thermal-like explorationof the energy landscape during the quantum annealingemulation. This form of regularized variational quan-tum annealing (RVQA) is described by a free energy costfunction:˜ F ( t ) = h ˆ H ( t ) i T ( t ) S classical ( | ( t ) | ) . (9) N annealing ✏ r e s / N N steps CQOVQARVQA /t . ± . VCA /t . ± . CQOVQARVQA /t . ± . VCA /t . ± . Figure 3. A comparison between VCA, VQA, RVQA, andCQO for Edwards-Anderson (EA) on a 10 ⇥
10 lattice. Theresidual energy per site vs. N annealing for VCA, VQA andRVQA. For CQO, we report the residual energy per site vs.the number of optimization steps N steps . While the results in Fig. 3 do show an amelioration ofthe VQA performance, including changing a saturatingdynamics at long annealing time to a power-law like be-havior, it appears to be insu cient to compete with theVCA scaling. This suggests the superiority of a thermallydriven variational emulation of annealing over a quantumone.To further scrutinize the relevance of the annealing ef-fects in VCA, we also consider VCA with zero thermalfluctuations, i.e., setting T = 0. Because of its intimaterelation to the classical-quantum optimization methodsof Ref. 51, 79, and 80, we call this setting CQO. Fig. 3shows that CQO takes about 10 training steps start-ing from random parameters initialization to reach closeto 1% accuracy. The accuracy does not further improvewhen trained up to 10 gradient steps, indicating that theCQO limit of VCA is prone to getting stuck in local min-ima. In comparison, VCA and VQA o↵er solutions ordersof magnitude more accurate at long annealing times, sug-gesting the importance of the annealing e↵ect in tacklingoptimization problems.Since VCA displays the best performance in the pre-vious benchmarks, we use it to demonstrate its capabili-ties on a relatively large system with 40 ⇥
40 spins. Forcomparison, we use SA as well as SQA with P = 20 trot-ter slices, and take the average energy across all trotterslices, for each realization of randomness (see MethodsSec. V E). In addition, we average the energy obtainedafter 25 annealing runs on every instance of randomnessfor SA and SQA. To average over Hamiltonian instances,we use the typical mean over 25 di↵erent realizations forthe three annealing methods. The results are shown in N annealing ✏ r e s / N SASQAVCASASQAVCA
Figure 4. Comparison between Simulated Annealing (SA),Path-Integral Quantum Monte Carlo (SQA) with P = 20trotter slices, and VCA using a 2D tensorized pRNN state forthe EA model on a 40 ⇥
40 lattice. We report the residualenergy per site as a function of the number of annealing steps N annealing for SA, VCA and SQA. Fig. 4, where we present the residual energies per siteagainst the number of annealing steps N annealing , whichis set so that the speed of annealing is the same for SA,SQA and VCA. We first note that our results confirmthe qualitative behavior of SA and SQA in Refs. [55, 68].While at short annealing times SA and SQA producelower residual energy solutions than VCA, we observethat VCA achieves residual energies for large annealingtime about three orders of magnitude smaller than SQAand SA. Notably, the rate at which the residual energyimproves with increasing the annealing time is signifi-cantly higher in VCA than SQA and SA even at rela-tively short annealing time. These observations highlightthe advantages of solving hard optimization problems ina variational space compared to SA and SQA paradigms. C. Fully-connected spin glasses
We now focus our attention on fully-connected spinglasses [2, 81]. We first focus on the Sherrington-Kirkpatrick (SK) model [82], which provides a concep-tual framework for the understanding of the role of dis-order and frustration in widely diverse systems rangingfrom materials to combinatorial optimization and ma-chine learning. The combined e↵ect of disorder and long-range interactions in the SK model results in an energylandscape characterized by a hierarchy of valleys with anumber of local minima growing exponentially in the sys-tem size [81]. Together with the fact that many combina-torial NP-hard problems can be thought of as the task offinding a ground state of a densely connected spin glass,the properties above make fully connected spin glassesa suitable benchmark for heuristic optimization meth- ods [5]. The SK Hamiltonian ˆ H SK is given byˆ H SK = X i = j J ij p N ˆ zi ˆ zj , (10)where { J ij } is a symmetric matrix such that each matrixelement J ij is sampled from a gaussian distribution withmean 0 and variance 1.Since VCA performed best in our previous examples,we use it to find ground states of the SK model for N =100 spins. Here, exact ground states energies of the SKmodel are calculated using the spin-glass server [77] ona total of 25 instances of disorder. To account for long-distance dependencies between spins in the SK model,we use a dilated RNN that has d log ( N ) e = 7 layers(see Methods Sec. V B) and we start the annealing at aninitial temperature T = 2. We compare our results withSA and SQA. For SQA, we start with an initial magneticfield = 2, while for SA we use T = 2.To e↵ectively compare the three methods (i.e., SA,SQA, and VCA), we first plot the residual energy persite as a function of N annealing for VCA, SA and SQA(with P = 100 trotter slices). Here, the SA and SQAresidual energies are obtained by averaging the outcomeof 50 independent annealing runs, while for VCA we av-erage the outcome of 10 exact samples from the an-nealed RNN. For all methods, we take the typical aver-age over 25 disorder instances. The results are shown inFig. 5(a). As observed in the EA model in Fig. 4, we notethat for fast annealing runs SA and SQA produce lowerresidual energy solutions than VCA, but we emphasizethat VCA delivers a lower residual energy compared toSQA and SA as the total annealing time increases past N annealing ⇠ . Likewise, we observe that the rate atwhich the residual energy improves with increasing thetotal annealing time is significantly higher in VCA thanSQA and SA.A more detailed look at the statistical behaviour of themethods at long annealing times can be obtained fromthe residual energy histograms separately produced byeach method, as shown in Fig. 5(e). For each instance { J ij } after the end of annealing, we represent the ob-tained residual energies in a histogram form. For thethree methods, we extract 10 residual energies for eachdisorder realization. Here, we observe that VCA is supe-rior to SA and SQA, as it produces a higher density oflow residual energies. This indicates that, even thoughVCA typically takes more annealing steps, it ultimatelyresults in a higher chance of getting more accurate solu-tions to optimization problems than their SA and SQAcounterparts.We now focus on the Wishart planted ensemble(WPE), which is a class of zero-field Ising models with afirst-order phase transition and tunable algorithmic hard-ness [83]. These problems belong to a special class of hardproblem ensembles whose solutions are known to the con-structor, which, together with the tunability of the hard-ness, makes the WPE model an ideal tool to benchmark ab Figure 4. Benchmarking the two-dimensional Edwards-Anderson spin glass. (a) A comparison between VCA, VQA,RVQA, and CQO on a 10 ×
10 lattice by plotting the resid-ual energy per site vs N annealing . For CQO, we report theresidual energy per site vs the number of optimization steps N steps . (b) Comparison between SA, SQA with P = 20 trot-ter slices, and VCA using a 2D tensorized RNN ansatz on a40 ×
40 lattice. The annealing speed is the same for SA, SQAand VCA. curate on average for a large number of annealing steps,highlighting the importance of annealing in tackling op-timization problems.Since VCA displays the best performance in the pre-vious benchmarks, we use it to demonstrate its capa-bilities on a 40 ×
40 spin system. For comparison, weuse SA as well as SQA. The SQA simulation uses thepath-integral Monte Carlo method [11] with P = 20 trot-ter slices, and we report averages over energies acrossall trotter slices, for each realization of randomness (seeMethods Sec. V D). In addition, we average the energyobtained after 25 annealing runs on every instance of ran-domness for SA and SQA. To average over Hamiltonianinstances, we use the typical mean over 25 different re-alizations for the three annealing methods. The resultsare shown in Fig. 4(b), where we present the residualenergies per site against the number of annealing steps N annealing , which is set so that the speed of annealing isthe same for SA, SQA and VCA. We first note that ourresults confirm the qualitative behavior of SA and SQAin Refs. [11, 35]. While SA and SQA produce lower resid-ual energy solutions than VCA for small N annealing , weobserve that VCA achieves residual energies about threeorders of magnitude smaller than SQA and SA for a largenumber of annealing steps. Notably, the rate at which theresidual energy improves with increasing N annealing is sig-nificantly higher for VCA compared to SQA and SA evenat relatively small number of annealing steps. C. Fully-connected spin glasses
We now focus our attention on fully-connected spinglasses [2, 36]. We first focus on the Sherrington-Kirkpatrick (SK) model [37], which provides a concep-tual framework for the understanding of the role of dis-order and frustration in widely diverse systems rangingfrom materials to combinatorial optimization and ma-chine learning. The SK Hamiltonian is given by H target = − (cid:88) i (cid:54) = j J ij √ N σ i σ j , (9)where { J ij } is a symmetric matrix such that each matrixelement J ij is sampled from a gaussian distribution withmean 0 and variance 1.Since VCA performed best in our previous examples,we use it to find ground states of the SK model for N =100 spins. Here, exact ground states energies of the SKmodel are calculated using the spin-glass server [30] ona total of 25 instances of disorder. To account for long-distance dependencies between spins in the SK model, weuse a dilated RNN ansatz that has (cid:100) log ( N ) (cid:101) = 7 layers(see Methods Sec. V A) and set the initial temperature T = 2. We compare our results with SA and SQA. ForSQA, we start with an initial magnetic field Γ = 2, whilefor SA we use T = 2.For an effective comparison, we first plot the resid-ual energy per site as a function of N annealing for VCA,SA and SQA (with P = 100 trotter slices). Here, theSA and SQA residual energies are obtained by averag-ing the outcome of 50 independent annealing runs, whilefor VCA we average the outcome of 10 exact samplesfrom the annealed RNN. For all methods, we take thetypical average over 25 disorder instances. The resultsare shown in Fig. 5(a). As observed in the EA model,we note that SA and SQA produce lower residual energysolutions than VCA for small N annealing , but we empha-size that VCA delivers a lower residual energy comparedto SQA and SA as the total number of annealing stepsincreases past N annealing ∼ . Likewise, we observethat the rate at which the residual energy improves withincreasing N annealing is significantly higher for VCA incomparison to SQA and SA. A more detailed look at the statistical behaviour ofthe methods at large N annealing can be obtained from theresidual energy histograms separately produced by eachmethod, as shown in Fig. 5(d). The histograms contain1000 residual energies for each of the same 25 disorderrealizations. For each instance, we plot results for 1000SA runs, 1000 samples obtained from the RNN at theend of annealing for VCA, and 10 SQA runs includingcontribution from each of the P = 100 Trotter slices.We observe that VCA is superior to SA and SQA, as itproduces a higher density of low energy configurations.This indicates that, even though VCA typically takesmore annealing steps, it ultimately results in a higherchance of getting more accurate solutions to optimizationproblems than SA and SQA. Note that for the SK model,the SQA histogram remain quantitatively the same for200 runs, and we report data of 10 runs only for fairnesspurposes compared to both SA and VCA.We now focus on the Wishart planted ensemble(WPE), which is a class of zero-field Ising models with afirst-order phase transition and tunable algorithmic hard-ness [38]. These problems belong to a special class ofhard problem ensembles whose solutions are known a pri-ori, which, together with the tunability of the hardness,makes the WPE model an ideal tool to benchmark heuris-tic algorithms for optimization problems. The Hamilto-nian of the WPE model is defined as H target = − (cid:88) i (cid:54) = j J αij σ i σ j . (10)Here J αij is a symmetric matrix satisfying J α = ˜ J α − diag( ˜ J )and ˜ J α = − N W α W T α . The term W α is an N × (cid:98) αN (cid:99) random matrix satisfy-ing W α t ferro = 0 where t ferro = (+1 , +1 , ..., +1) is theferromagnetic state (see Ref. [38] for details about thegeneration of W α ). The ground state of the WPE modelis known (i.e., it is planted) and corresponds to the ferro-magnetic states ± t ferro . Interestingly, α is a tunable pa-rameter of hardness, where for α < N = 32 and α ∈ { . , . } .We consider 25 instances of the couplings { J αij } andattempt to solve the model with VCA implemented usinga dilated RNN ansatz with (cid:100) log ( N ) (cid:101) = 5 layers and aninitial temperature T = 1. For SQA ( P = 100 trotter Figure 5. Benchmarking SA, SQA ( P = 100 trotter slices) and VCA on the Sherrington-Kirkpatrick (SK) model and theWishart planted ensemble (WPE). Panels (a),(b), and (c) display the residual energy per site as a function of N annealing . (a)The SK model with N = 100 spins. (b) WPE with N = 32 spins and α = 0 .
5. (c) WPE with N = 32 spins and α = 0 . − for (cid:15) res /N , which is within our numerical accuracy. slices), we use an initial magnetic field Γ = 1, and forSA we start with T = 1.We first plot the scaling of residual energies per site (cid:15) res /N as shown in Figs. 5(b) and (c). Here we note thatVCA is superior to SA and SQA for α = 0 . α = 0 .
25 in Fig. 5(c), VCA is competitive whereit achieves a similar performance compared to SA andSQA on average for a large number of annealing steps.We also represent the residual energies in a histogramform. We observe that for α = 0 . (cid:15) res /N ∼ − -10 − compared to SA and SQA. For α = 0 .
25 in Fig. 5(f), VCA leads to a non-negligibledensity at very low residual energies as opposed to SAand SQA, whose solutions display residual energies or-ders of magnitude higher. Finally, our WPE simulationssupport the observation that VCA tends to improve thequality of solutions faster than SQA and SA for a largenumber of annealing steps.
IV. CONCLUSIONS AND OUTLOOK
In conclusion, we have introduced a strategy to com-bat the slow sampling dynamics encountered by simu-lated annealing when an optimization landscape is roughor glassy. Based on annealing the variational parametersof a generalized target distribution, our scheme — whichwe dub variational neural annealing — takes advantageof the power of modern autoregressive models, which canbe exactly sampled without slow dynamics even whena rough landscape is encountered. We implement varia-tional neural annealing parameterized by a recurrent neu-ral network, and compare its performance to conventionalsimulated annealing on prototypical spin glass Hamiltoni-ans known to have landscapes of varying roughness. Wefind that variational neural annealing produces accuratesolutions to all of the optimization problems considered,including spin glass Hamiltonians where our techniquestypically reach solutions orders of magnitude more accu-rate on average than conventional simulated annealing inthe limit of a large number of annealing steps.We emphasize that several hyperparameters, model,hardware, and variational objective function choices canbe explored and may improve our methodologies. Wehave utilized a simple annealing schedule in our protocolsand highlight that reinforcement learning can be used toimprove it [39]. A critical insight gleaned from our exper-iments is that certain neural network architectures weremore efficient on specific Hamiltonians. Thus, a natu-ral direction is to study the intimate relation betweenthe model architecture and the problem Hamiltonian,where we envision that symmetries and domain knowl-edge would guide the design of models and algorithms.As we witness the unfolding of a new age for opti-mization powered by deep learning [40], we anticipatea rapid adoption of machine learning techniques in thespace of combinatorial optimization, as well as antici-pate domain-specific applications of our ideas in diversetechnological and scientific areas related to physics, biol-ogy, health care, economy, transportation, manufactur-ing, supply chain, hardware design, computing and in-formation technology, among others.
V. METHODSA. Recurrent Neural Network Ans¨atze
Recurrent neural networks model complex probabilitydistributions p by taking advantage of the chain rule p ( σ ) = p ( σ ) p ( σ | σ ) · · · p ( σ N | σ N − , . . . , σ , σ ) , (11)where specifying every conditional probability p ( σ i | σ
1) are path-dependent, and are givenby the zigzag path, illustrated by the black arrows inFig. 6(b). Moreover, to sample configurations from the2D tensorized RNNs, we use the same zigzag path asillustrated by the red dashed arrows in Fig. 6(b).For models such as the Sherrington-Kirkpatrick modeland the Wishart planted ensemble, every spin interactswith each other. To account for the long-distance na-ture of the correlations induced by these interactions,we use dilated RNNs [43], which are known to alleviatethe vanishing gradient problem [44]. Dilated RNNs aremulti-layered RNNs that use dilated connections betweenspins to model long-term dependencies [45], as illustratedin Fig. 6(c). At each layer 1 ≤ l ≤ L , the hidden state iscomputed as h ( l ) n = F ( W ( l ) n [ h ( l )max(0 ,n − l − ) ; h ( l − n ] + b ( l ) n ) . Here h (0) n = σ n − and the conditional probability is givenby p λ ( σ n | σ