[PDF] Scaling advantage of nonrelaxational dynamics for high-performance combinatorial optimization

Abstract

The development of physical simulators, called Ising machines, that sample from low energy states of the Ising Hamiltonian has the potential to drastically transform our ability to understand and control complex systems. However, most of the physical implementations of such machines have been based on a similar concept that is closely related to relaxational dynamics such as in simulated, mean-field, chaotic, and quantum annealing. We show that nonrelaxational dynamics that is associated with broken detailed balance and positive entropy production rate can accelerate the sampling of low energy states compared to that of conventional methods. By implementing such dynamics on field programmable gate array, we show that the nonrelaxational dynamics that we propose, called chaotic amplitude control, exhibits a scaling with problem size of the time to finding optimal solutions and its variance that is significantly smaller than that of relaxational schemes recently implemented on Ising machines.

Full PDF

CChaotic Amplitude Control forNeuromorphic Ising Machine in Silico

Timothee Leleu , ∗ , Farad Khoyratee , ∗ , Timothee Levi , , ,Ryan Hamerly, , Takashi Kohno, , , Kazuyuki Aihara , Institute of Industrial Science, University of Tokyo, Japan International Research Center for Neurointelligence, University of Tokyo, Japan IMS, University of Bordeaux, France LIMMS/CNRS-IIS, the University of Tokyo, Japan Massachusetts Institute of Technology, Cambridge, MA, USA NTT Research, 1950 University Ave ∗ To whom correspondence should be addressed; E-mail: [email protected].

Ising machines are special-purpose hardware designed to reduce time, resources,and energy consumption needed for ﬁnding low energy states of the IsingHamiltonian. In recent years, most of the physical implementations of suchmachines have been based on a similar concept that is closely related to an-nealing such as in simulated, mean-ﬁeld, chaotic, and quantum annealing. Weargue that Ising machines beneﬁt from implementing a chaotic amplitude con-trol of mean ﬁeld dynamics that does not rely solely on annealing for escap-ing local minima trap. We show that this scheme outperforms recent state-of-the-art solvers for some reference benchmarks quantitatively and qualita-tively in terms of time to solution, energy efﬁciency, and scaling of its variancewith problem size when it is implemented in hardware such as in a ﬁeld pro-grammable gate array that is organized according to a neuromorphic designwhere memory is distributed with processing. a r X i v : . [ phy s i c s . c o m p - ph ] S e p ne Sentence Summary (150 characters): The performance of Ising machinescan be increased quantitatively and qualitatively by controlling the amplitudeof the mean ﬁeld dynamics especially when implemented on a neuromorphichardware in silico. Many complex systems such as spin glasses, interacting proteins, large scale hardware, and ﬁ-nancial portfolios, can be described as ensembles of disordered elements that have competingand frustrated interactions ( ). There has been a growing interest in using physical simulatorscalled “Ising machines” in order to reduce time and resources needed to identify conﬁgura-tions that minimize their total interaction energy, notably that of the Ising Hamiltonians H with H ( σ ) = − (cid:80) ij ω ij σ i σ j (with ω ij the symmetric Ising couplings, i.e., ω ij = ω ji , and σ i = ± )that is related to many NP-hard combinatorial optimization problems and various real-worldapplications ( ). Recently proposed implementations include memresistor networks ( ), micro-or nano-electromechanical systems ( ), micro-magnets (

5, 6 ), coherent optical systems ( ), hy-brid opto-electronic hardware ( ), integrated photonics ( ), ﬂux qubits ( ), and Bose-Einstein condensates ( ). In principle, these physical systems often possess unique properties,such as coherent superposition in ﬂux qubits ( ) and energy efﬁciency of memresistors (

3, 17 ),which could lead to a distinctive advantage compared to conventional computers (see Fig. 1 (a))for solving hard combinatorial optimization problems. In practice, the difﬁculty in constructingconnections between constituting elements of the hardware is often the main limiting factorto scalability and performance for these systems (

14, 18 ). Moreover, these devices often im-plement schemes that are directly related to the concept of annealing (either simulated (

19, 20 ),mean-ﬁeld ( ), chaotic ( ), and quantum ( )) in which the escape from the numer-ous local minima ( ) and saddle points ( ) of the free energy function can only be achievedunder very slow modulation of the control parameter (see Fig. 1 (b)).Interestingly, alternative dynamics that does not depend on the concept of annealing mayperform better for solving hard combinatorial optimization problems ( ). Various kinds ofdynamics have been proposed ( ), notably chaotic dynamics (

17, 35–38 ), but have either2ot been implemented onto specialized hardware (

35, 39 ) or use chaotic dynamics merely asa replacement to random ﬂuctuations (

17, 36 ). We have recently proposed that the control ofamplitude in mean-ﬁeld dynamics can signiﬁcantly improve the performance of Ising machinesby introducing error correction terms (see Fig. 1 (d)), effectively doubling the dimensionality ofthe system, whose role is to correct the amplitude heterogeneity ( ). Because of the similarityof such dynamics with that of a neural network, it can be implemented especially efﬁciently inelectronic neuromorphic hardware where memory is distributed with the processing ( ).In this paper, we implement a version of this scheme that we name chaotic amplitude control(CAC) on a ﬁeld programmable gate array (FPGA, see Fig. 1 (c)) as a paradigmatic exampleand show that the developed hardware outperforms state-of-the-art Ising problem solvers forsome reference benchmarks quantitatively and qualitatively in terms of time to solution, energyefﬁciency, and scaling of its variance with problem size. For the sake of simplicity, we consider the classical limit of Ising machines for which the statespace is often naturally described by analog variables (i.e., real numbers) noted x i in the follow-ing. The variables x i represent measured physical quantities such as voltage ( ) or optical ﬁeldamplitude ( ) and these systems can often be simpliﬁed to networks of interacting nonlinearelements whose time evolution can be written as follows: d x i d t = f i ( x i ) + β i ( t ) (cid:88) j ω ij g j ( x j ) + σ η i , ∀ i ∈ { , . . . , N } , (1)where f i and g i represent the nonlinear gain and interaction, respectively, and are assumed to bemonotonic, odd, and invertible “sigmoidal” functions for the sake of simplicity; η i , experimen-tal white noise of standard deviation σ with (cid:104) η i ( t ) η j ( t (cid:48) ) (cid:105) = δ ij δ ( t − t (cid:48) ) ; and N , the number ofspins. Ordinary differential equations similar to eq. (1) have been used in various computationalmodels that are applied to nondeterministic polynomial-hard (NP-hard) combinatorial optimiza-tion problems such as Hopﬁeld-Tank neural networks ( ), coherent Ising machines ( ), and δ ij ; δ ( t ) , the Kronecker delta symbol and Dirac delta function, respectively. ). Moreover, the steadystates of eq. (1) correspond to the solutions of the “naive” Thouless-Anderson-Palmer (nTAP)equations that arise from the mean-ﬁeld description of Sherrington-Kirkpatrick spin glasseswhen the Onsager reaction term has been discarded ( ). In the case of neural networks inparticular, the variables x i and constant parameters ω ij correspond to ﬁring rates of neurons andsynaptic coupling weights, respectively.It is well known that, when β i = β for all i and the noise is not taken into account ( σ = 0 ),the time evolution of this system is motion in the state space that seeks out minima of a potentialfunction ( ) (or Lyapunov function) V given as V = β H ( y ) + (cid:80) i V b ( y i ) where V b is a bistablepotential with V b ( y i ) = − (cid:82) y i f i ( g − i ( y )) d y and H ( y ) = − (cid:80) ij ω ij y i y j is the extension ofthe Ising Hamiltonian in the real space with y i = g i ( x i ) (see supplementary material A.1).The bifurcation parameter β , which can be interpreted as the inverse temperature of the naiveTAP equations ( ), the steepness of the neuronal transfer function in Hopﬁeld-Tank neuralnetworks ( ), or to the coupling strength in coherent Ising machines ( ), is usually decreasedgradually in order to improve the quality of solutions found. This procedure has been calledmean-ﬁeld annealing ( ), and can be interpreted as a quasi-static deformation of the potentialfunction V (see Fig. 1 (b)). There is, however, no guarantee that a sufﬁciently slow deformationof the landscape V will ensure convergence to the lowest energy state contrarily to the quantumadiabatic theorem ( ) or the convergence theorem of simulated annealing ( ) . Moreover,the statistical analysis of spin glasses suggests that the potential V is highly non-convex at lowtemperature and that simple gradient descent of the potential V very unlikely reaches the globalminimum of H ( σ ) because of the presence of exponentially numerous local minima ( ) andsaddle points ( ) as the size of the system increases. The slow relaxation time of Monte-Carlosimulations, such as simulated annealing, might also be explained in the case of spin glassesby similar trapping dynamics during the descent of the free energy landscape obtained fromthe TAP equations ( ). In the following, we consider in particular the soft spin descriptionobtained by taking f i ( x i ) = ( − p ) x i − x i and y i = g ( x i ) = x i , where p is the gain At ﬁxed β , global convergence to the minimum of the potential V can be assured if σ is gradually decreasewith σ ( t ) ∼ clog (2+ t ) and c sufﬁciently large ( ). The parameter σ is analogous to the temperature in simulatedannealing in this case. The global minimum of the potential V does not, however, generally correspond to that ofthe Ising Hamiltonians H at ﬁnite β . p . In this case, the potential function V b isgiven as V b ( x i ) = ( − p ) x i − x i and eq. (1) can be written as d x i d t = − ∂V∂x i , ∀ i .The proposed amplitude control of mean-ﬁeld dynamics consists in introducing error sig-nals, noted e i ∈ R , that modulate the strength of coupling β i to the i th nonlinear element suchthat β i ( t ) deﬁned in eq. (1) is expressed as β i ( t ) = βe i ( t ) with β > . The time evolution ofthe error signals e i are given as follows ( ): d e i d t = − ξ ( g ( x i ) − a ) e i , (2)where a and ξ are the target amplitude and the rate of change of error variables, respectively,with a > and ξ > . If the system settles to a steady state, the values y ∗ i = g ( x ∗ i ) becomeexactly binary with y ∗ i = ±√ a . When p < , the internal ﬁelds h i at the steady state, deﬁned as h i = (cid:80) j ω ij σ j with σ j = y ∗ j / | y ∗ j | , are such that h i σ i > , ∀ i ( ). Thus, each equilibrium pointof the analog system corresponds to that of a zero-temperature local minimum of the binaryspin system.The dynamics described by the coupled equations (1) and (2) is not derived from a potentialfunction and the computational principle is not related to a gradient descent. Rather, the additionof error variables results in additional dimensions in the phase space via which the dynamicscan escape local minima. The mechanism of this escape can be summarized as follows. It canbe shown (see the supplementary material A.2) that the dimension of the unstable manifold atequilibrium points corresponding to local minima σ of the Ising Hamiltonian depends on thenumber of eigenvalues µ ( σ ) with µ ( σ ) > F ( a ) where µ ( σ ) are the eigenvalues of the matrix { ω ij | h i | } ij (with internal ﬁeld h i ) and F a function given as F ( y ) = ψ (cid:48) ( y ) ψ ( y ) y and ψ ( y ) = f ( g − ( y ))( g − ) (cid:48) ( y ) .Thus, there exists a value of a such that all local minima (including the ground state) are un-stable and for which the system exhibits chaotic dynamics that explores successively candidateboolean conﬁgurations. The energy is evaluated at each step and the best conﬁguration visitedis kept as the solution of a run. Interestingly, this chaotic search is particularly efﬁcient forsampling conﬁgurations of the Ising Hamiltonian close to that of the ground state using a sin-gle run although the distribution of sampled states is not necessarily given by the Boltzmanndistribution. Note that the use of chaotic dynamics for solving Ising problems has been dis-5ussed previously (

17, 51 ), notably in the context of neural networks, and it has been arguedthat chaotic ﬂuctuations may possess better properties than Brownian noise for escaping fromlocal minima traps. In the case of the proposed scheme, the chaotic dynamics is not merely usedas a replacement to noise. Rather, the interaction between nonlinear gain and error-correctionresults in the destabilization of states associated with lower Ising Hamiltonian.The target amplitude a for which all local minima is unstable is not known a priori . There-fore, we propose instead to destabilize the local minima traps by dynamically modulating thetarget amplitude a as a function of the visited conﬁguration σ using the heuristic function givenas follows: a ( t ) = α − ρ tanh ( δ ∆ H ( t )) , (3)where ∆ H ( t ) = H opt − H ( t ) ; H ( t ) , the Ising Hamiltonian of the conﬁguration visited at time t ; and H opt , a given target energy. In practice, we set H opt to the lowest energy visited duringthe current run, i.e., H opt ( t ) = min t (cid:48) ≤ t H ( t (cid:48) ) . The function tanh is the tangent hyperbolic. ρ and δ are positive real constants. In this way, conﬁgurations that have much larger Ising energythan the lowest energy visited are destabilized more strongly due to smaller target amplitude a . Lastly, the parameter ξ (see eq. (2)) is modulated as follows: d ξ d t = γ when t − t r < ∆ t ,where t r is the last time for which either the best known energy H opt was updated or ξ was reset.Otherwise, ξ is reset to if t − t r ≥ ∆ t and t r is set to t .In order to verify that chaotic amplitude control is able to accelerate the search of mean-ﬁeld dynamics for ﬁnding the ground state of typical frustrated systems, we solve Sherrington-Kirkpatrick (SK) spin glass problems using the numerical simulation of eqs. (1) to (3) andtwo recently proposed implementation of the mean-ﬁeld annealing method: noisy mean-ﬁeldannealing (NMFA) ( ) and the simulation of the coherent Ising machine (simCIM) . Becausethe arithmetic complexity of calculating one step of these three schemes are dominated by the The modulation described in eq. (3) is better suited for a digital implementation on FPGA than the one pro-posed in ( ). Numerical simulations shown in this paper suggest that this modulation results in destabilization ofnon-trivial attractors (periodic, chaotic, etc.) for typical problem instances. See supplementary material E.1 and 2 ν , to ﬁnd the ground state energy of a given instance with successprobability, with ν ( K ) = K ln (1 − . ln (1 − p ( K )) and p ( K ) the probability of visiting a ground stateconﬁguration at least once after a number K of MVMs in a single run . In Fig. 2, NMFA (a) andthe CAC (b) are compared using the averaged success probability (cid:104) p (cid:105) of ﬁnding the groundstate for 100 randomly generated SK spin glass instances per problem size N . Note that thesuccess probability of the mean-ﬁeld annealing method does not seem to converge to 1 even forlarge annealing time (see Fig. 2 (a)). Because the success probability of NMFA and simCIMremains low at larger problem size, its correct estimation requires simulating a larger number7f runs which we achieved by implementing these methods using a GPU. On the other hand,the average success probability (cid:104) p (cid:105) of CAC is of order 1 when the maximal number of MVMis large enough, suggesting that the system rarely gets trapped in local minima of the IsingHamiltonian or non-trivial attractors. In Figure 2 (c) and (d) are shown the q th percentile (with q = 50 , i.e., the median) of the MVM to solution distribution, noted ν q ( K ; N ) , for variousduration of simulation K , where K is the number of MVMs of a single run. The minimumof these curves, noted ν ∗ q ( N ) with ν ∗ q ( N ) = min K ν q ( K ; N ) , represents the optimal scalingof MVM to solution vs. problem size N ( ). For large N , the median MVM to solution ν ∗ ( N ) is better ﬁtted by an exponential scaling with the square root problem size N , thatis ν ∗ ( N ) ∼ e γ √ N (see Fig. 2 (e), p-value: . × − , R-value: . ), than an exponentialscaling with N (p-value: . × − , R-value: . ) although there are not enough data pointsto reject either of the two hypotheses. Using the former hypothesis, CAC exhibits signiﬁcantlysmaller scaling exponent ( γ = 0 . ± . ) than the NMFA ( γ = 0 . ± . ) and simCIM( γ = 0 . ± . , see inset in Fig. 2 (e)). We have veriﬁed that this scaling advantage holds forvarious parameters of the mean-ﬁeld annealing (see supplementary material E.1).Although comparison of algebraic complexity indicates that CAC has a scaling advantageover mean-ﬁeld annealing, it is in practice necessary to compare its physical implementationagainst other state-of-the-art methods because the performance of hardware depends on otherfactors such as memory access and information propagation delays. To this end, CAC is imple-mented into a FPGA because its similarity with neural networks makes it well-ﬁtted for a designwhere memory is distributed with processing (see supplementary material B for the details ofthe FPGA implementation). The organization of the electronic circuit can be understood usingthe following analogy. Pairs of analog values x i and e i , which represent averaged activity oftwo types of neurons, are encoded within neighboring circuits. This micro-structure is repeated N times on the whole surface of the chip which resembles the columnar organization of thebrain. The nonlinear processes f i ( x i ) , which model the local-population activation functionsand are independent for i (cid:54) = j , are calculated in parallel. The coupling between elements i and j ∈ { , . . . , N } that is achieved by the dot product in eq. (1) is implemented by circuits thatare at the periphery of the chip and are organized in a summation tree reminiscent of dendriticbranching (see Fig. 1 (c)). 8igure 2: (a,b) Average success probability (cid:104) p (cid:105) of ﬁnding the ground state conﬁguration of 100Sherrington-Kirkpatrick spin glass instances and (c,d) th percentile of the MVM to solutiondistribution ν vs. problem size N . (a,c) NMFA. (b,d) CAC. (e) Number of matrix-vector mul-tiplication MVM to solution distribution. Lower, higher, and upper whisker of boxes show the th , th , and th percentiles of the distribution. The upper right inset shows the exponentialscaling factor γ of the th percentile with ν ∼ e γ √ N for CAC, NMFA, and simCIM.We compare the FPGA implementation of CAC against state-of-the-art methods imple-mented on various types of hardware: the single-core CPU implementation of break-out localsearch ( ) (BLS) that has been used to ﬁnd most of the best known maximum-cuts (equiva-lently, Ising energies) from the GSET benchmark set ( ), a well-optimized single-core CPUimplementation of parallel tempering (or random replica exchange Monte-Carlo Markov chainsampling) (

54, 55 ) (PT, courtesy of S. Mandra), simulated annealing (SA) ( ), and Toshiba’sbifurcation machine on GPU (TBM) ( ). Figure 3 (a) shows that the CAC on FPGA has thesmallest real time to solution τ ∗ q against the other Ising machines and state-of-the-art algorithmsdespite just 5W power-consumption where τ ∗ q ( N ) is the optimal q th percentile of time to so-lution with 99% success probability and is given as τ ∗ q ( N ) = min T τ q ( T ; N ) where τ ( T ) ofa given instance is τ ( T ) = T ln (1 − . ln (1 − p ( T )) and T is the duration in seconds of a run. Predic-tions of time to solution obtained when increasing the clock frequencies of the FPGA are also9hown in dashed lines. Using 20W of power consumption, CAC on FPGA implementation hassmaller time to solution than recently proposed on-chip implementation of simulated annealingcalled Digital Annealer (see results in ( )). Note that the power consumption of transistorsin the FPGA scales proportionally to its clock frequencies. In order to compare different Isingmachines despite the heterogeneity in their power consumption, the q th percentile of energy-to-solution E ∗ q , i.e., the energy E ∗ q required to solve SK instances with E ∗ q = P τ ∗ q and P thepower consumption , is plotted in Fig. 3 (b). CAC on FPGA is to times more energy ef-ﬁcient than state-of-the-art algorithms running on classical computers. Moreover, the relativelyslow increase of time to solution with respect to the number of spins N when solving larger SKproblems using CAC suggests that our FPGA implementation is faster than the Hopﬁeld neuralnetwork implemented using memresistors (mem-HNN) ( ) and the restricted Boltzmann ma-chine using a FPGA (FPGA-RBM) ( ) for N (cid:39) (see Tab. 1 and supplementary materialC). Ising machine N = 100 N = 700 N = 2000 experimental e γN ﬁt e γ √ N ﬁt e γN ﬁt e γ √ N ﬁtNTT CIM ( ) × − × × > > mem-HNN ( ) − × − > × FPGA-RBM ( ) × − × − > × experimental e γN ﬁt e γ √ N ﬁt5 W FPGA-CAC × − × − × × Table 1: Median time to solution in seconds and extrapolations for the NTT CIM (not paral-lelized) ( ), mem-HNN ( ), FPGA-RBM ( ), and FPGA-CAC (this work). Extrapolationsare based on the hypotheses of scaling in e γN and e γ √ N by ﬁtting the available experimentaldata for each Ising machine (see supplementary material C).We next consider the whole distribution of time to solution in order to compare the ability ofvarious methods to solve harder instances. As shown in Fig. 4 (a), the cumulative distributionfunction (CDF) P ( τ ; T ) of time to solution with success probability τ is not uniquelydeﬁned as it depends on the duration T of the runs. We can deﬁne an optimal CDF P ∗ ( τ ) thatis independent of the runtime T as follows: P ∗ ( τ ) = max T P ( τ ; T ) . Numerical simulations For the sake of simplicity, we assume a 20 and 120 watts power consumptions for the CPU and GPU. Thesenumbers represent typical orders to magnitude for contemporary digital systems. th , th , and th percentilesof the real time to solution distribution in seconds for the FPGA implementation of CAC witha maximum of 5W power consumption, CAC, SA, and PT algorithm running on a CPU (20W)and TBM on a GPU (120W). The dashed and dotted red lines show the predictions of the realtime to solution in the case of a 10W and 20W FPGA implementation, respectively. (b) Thesame as (a) for the energy-to-solution E ∗ .show that this optimal CDF is well described by lognormal distribution, that is P ∗ ( log ( τ )) ∼N ( µ, √ v ) where √ v is the standard deviation of log ( τ ) (see Figs. 4 (b), (c), and (d) for thecases of CAC, SA, and NMFA, respectively). In Fig. 4 (e), it is shown that the scaling of thestandard deviation √ v ( N ) with the problem size N is signiﬁcantly smaller for CAC, whichimplies that harder instances can be solved relatively more rapidly than using other methods.This result conﬁrms the advantageous scaling of higher percentiles for CAC that was observedin Figs. 2 and 3. In order to test that hard instances of other types than SK can be solved by our FPGA implemen-tation, we have benchmarked it against BLS on the GSET ( ) (see supplementary material D):most of the best known solutions were found approximately a hundred times faster than the BLSalgorithm ( ) using less than 5W of power consumption although the current implementation11oes not take advantage of the sparsity of GSET instances for reducing energy consumption.For some instances (14th and 15th of size N = 2000 from GSET), we have found solutions ofbetter quality than previously known in ( ) and ( ). The framework described in this papercan be extended to solve other types of constrained combinatorial optimization problems suchas the traveling salesman ( ), vehicle routing, and lead optimization problems. Moreover, itcan be easily adapted to a variety of recently proposed Ising machines ( ) which wouldbeneﬁt from implementing a scheme that does not rely solely on the descent of a potentialfunction. In particular, the performance of CIM (

8, 9 ), mem-HNN ( ), and chip-scale photonicIsing machine ( ), which have small time to solution for problem sizes N ≈ but with arelatively large scaling exponent that limit their scalability, could be signiﬁcantly improved byadapting the implementation we propose if these hardware can be shown to be able to supportlarger problem sizes experimentally. We target the implementation of a low-power system with maximum power supply of 5W usinga XCKU040 Kintex Ultrascale Xilinx FPGA integrated on an Avnet board. The implementedcircuit can process Ising problems of more than 2000 spins. Data are encoded into 18 bits ﬁxedpoint vectors with 1 sign, 5 integer and 12 decimal bits to optimize computation time and powerconsumption. An important feature our FPGA implementation of CAC is the use of severalclock frequencies to concentrate the electrical power on the circuits that are the bottleneck ofcomputation and require high speed clock. For the realization of the matrix-vector product, eachelement of the matrix is encoded with 2 bits precision ( w ij is − , 0 or 1). An approximationbased on the combination of logic equations describing the behavior of a multiplexer allowsto achieve multiplications within one clock cycle. The results of these multiplications aresummed using cascading DSP and CARRY8 connected in a tree structure. Using pipelining,a matrix-vector product for a squared matrix of size N is computed in log ( N − log (5) + ( Nu ) clock cycles (see supplementary material B.4) at a clock frequency of 50MHz with u = 100 which is determined by the limitation of the number of available electronic component of the12igure 4: (a) Cumulative distribution of the time to solution P ( τ ) for N = 400 SK problems.(b,c,d) Optimal cumulative distribution P ∗ ( τ ) with P ∗ ( log ( τ )) ∼ N ( µ ( N ) , √ v ( N )) for CAC(b), SA (c), and NMFA (d), respectively. (e) Standard deviation √ v of the logarithm of timeto solution distribution vs. problem size N . Shaded regions show the error in the scalingexponents.XCKU040 FPGA . The calculation of the nonlinearity f i and error terms is achieved at higherfrequency using DSP in N/u ) and N/u ) clock cycles, respectively. In order to minimizeenergy resources and maximize speed, the nonlinear and error terms are calculated multipletimes during the calculation of a single matrix-vector product (see supplementary material B). The block size u can be made at least 3 times larger using commercially available FPGAs, which implies thatthe number of clock cycles needed to compute a dot product can scale almost logarithmically for problems of size N < (see supplementary material B.4 for discussions of scalability) and that the calculation time can befurther signiﬁcantly decreased using a higher-end FPGA. eferences

1. G. Parisi, M. M´ezard, M. Virasoro,

World Scientiﬁc, Singapore , 202 (1987).2. G. Kochenberger, et al. , Journal of Combinatorial Optimization , 58 (2014).3. F. Cai, et al. , Nature Electronics pp. 1–10 (2020).4. I. Mahboob, H. Okamoto, H. Yamaguchi,

Science advances , e1600236 (2016).5. K. Y. Camsari, R. Faria, B. M. Sutton, S. Datta, Physical Review X , 031014 (2017).6. K. Y. Camsari, B. M. Sutton, S. Datta, Applied Physics Reviews , 011305 (2019).7. A. Marandi, Z. Wang, K. Takata, R. L. Byer, Y. Yamamoto, Nature Photonics , 937 (2014).8. P. L. McMahon, et al. , Science , 614 (2016).9. T. Inagaki, et al. , Nature Photonics , 415 (2016).10. D. Pierangeli, G. Marcucci, C. Conti, Physical Review Letters , 213902 (2019).11. C. Roques-Carmes, et al. , Nature communications , 1 (2020).12. M. Prabhu, et al. , arXiv preprint arXiv:1909.13877 (2019).13. Y. Okawachi, et al. , Nature Communications , 1 (2020).14. R. Hamerly, et al. , Science advances , eaau0823 (2019).15. K. P. Kalinin, N. G. Berloff, Scientiﬁc reports , 17791 (2018).16. M. W. Johnson, et al. , Nature , 194 (2011).17. S. Kumar, J. P. Strachan, R. S. Williams,

Nature , 318 (2017).18. B. Heim, T. F. Rønnow, S. V. Isakov, M. Troyer,

Science , 215 (2015).19. M. Aramon, et al. , Frontiers in Physics , 48 (2019).140. S. Kirkpatrick, C. D. Gelatt, M. P. Vecchi, Science , 671 (1983).21. A. D. King, W. Bernoudy, J. King, A. J. Berkley, T. Lanting, arXiv preprintarXiv:1806.08422 (2018).22. G. Bilbro, et al. , Advances in neural information processing systems (1989), pp. 91–98.23. L. Chen, K. Aihara,

Neural Netw. , 915 (1995).24. T. Kadowaki, H. Nishimori, Phys. Rev. E , 5355 (1998).25. F. Tanaka, S. Edwards, Journal of Physics F: Metal Physics , 2769 (1980).26. G. Biroli, Journal of Physics A: Mathematical and General , 8365 (1999).27. T. Leleu, Y. Yamamoto, P. L. McMahon, K. Aihara, Physical review letters , 040607(2019).28. M. Ercsey-Ravasz, Z. Toroczkai,

Nat. Phys. , 966 (2011).29. B. Moln´ar, F. Moln´ar, M. Varga, Z. Toroczkai, M. Ercsey-Ravasz, Nature communications , 4864 (2018).30. T. Aspelmeier, M. Moore, Physical Review E , 032127 (2019).31. S. Boettcher, A. G. Percus,

Phys. Rev. Lett. , 5211 (2001).32. G. Zarand, F. Pazmandi, K. Pal, G. Zimanyi, Physical review letters , 150201 (2002).33. T. Leleu, Y. Yamamoto, S. Utsunomiya, K. Aihara, Phys. Rev. E , 022118 (2017).34. S. Krishna Vadlamani, T. P. Xiao, E. Yablonovitch, arXiv e-prints pp. arXiv–2007 (2020).35. T. Aspelmeier, R. Blythe, A. J. Bray, M. A. Moore, Physical Review B , 184411 (2006).36. M. Hasegawa, T. Ikeguchi, K. Aihara, Phys. Rev. Lett. , 2344 (1997).37. Y. Horio, K. Aihara, Physica D: Nonlinear Phenomena , 1215 (2008).158. K. Aihara,

Proceedings of the IEEE , 919 (2002).39. A. Montanari, arXiv preprint arXiv:1812.10897 (2018).40. S. B. Furber, F. Galluppi, S. Temple, L. A. Plana, Proceedings of the IEEE , 652 (2014).41. M. Davies, et al. , IEEE Micro , 82 (2018).42. B. V. Benjamin, et al. , Proceedings of the IEEE , 699 (2014).43. J. J. Hopﬁeld, D. W. Tank,

Biol. Cybern. , 141 (1985).44. Z. Wang, A. Marandi, K. Wen, R. L. Byer, Y. Yamamoto, Physical Review A , 063853(2013).45. H. Sompolinsky, A. Zippelius, Physical Review B , 6860 (1982).46. D. J. Thouless, P. W. Anderson, R. G. Palmer, Philosophical Magazine , 593 (1977).47. J. J. Hopﬁeld, Proceedings of the national academy of sciences , 3088 (1984).48. E. Farhi, J. Goldstone, S. Gutmann, M. Sipser, arXiv preprint quant-ph/0001106 (2000).49. V. Granville, M. Kriv´anek, J.-P. Rasson, IEEE transactions on pattern analysis and machineintelligence , 652 (1994).50. S. Geman, C.-R. Hwang, SIAM Journal on Control and Optimization , 1031 (1986).51. H. Goto, Scientiﬁc reports , 21686 (2016).52. U. Benlic, J.-K. Hao, Eng. Appl. Artif. Intell. , 1162 (2013).53. https://web.stanford.edu/˜yyye/yyye/Gset/ .54. K. Hukushima, K. Nemoto, Journal of the Physical Society of Japan , 1604 (1996).55. S. Mandra, B. Villalonga, S. Boixo, H. Katzgraber, E. Rieffel, APS Meeting Abstracts (2019). 166. S. V. Isakov, I. N. Zintchenko, T. F. Rønnow, M. Troyer,

Computer Physics Communica-tions , 265 (2015).57. H. Goto, K. Tatsumura, A. R. Dixon,

Science advances , eaav2372 (2019).58. S. Patel, L. Chen, P. Canoza, S. Salahuddin, arXiv preprint arXiv:2008.04436 (2020).59. F. Cai, et al. , arXiv preprint arXiv:1903.11194 (2019).60. F. J. Pineda, Journal of Complexity , 216 (1988). Acknowledgments

This research is partially supported by the Brain-Morphic AI to Resolve Social Issues (BMAI)project (NEC corporation), AMED under Grant Number JP20dm0307009, and UTokyo Centerfor Integrative Science of Human Behavior(CiSHuB).17 upplementary material A: theoretical analysis

1. Derivation of a potential function

First, we analyze the system described by eq. (1) only, when σ = 0 and β i = β , ∀ i . In thiscase, the potential function V ( y ) = − β (cid:80) ij ω ij y i y j − (cid:80) i (cid:82) y i f i ( g − i ( y )) d y has the followingproperty ( ): d V d t = − (cid:88) i d y i d t ( β (cid:88) j ω ij y j + f i ( g − i ( y i ))) , (4) = − (cid:88) i d y i d t ( d x i d t ) , (5) = − (cid:88) i ( d y i d t ) ( g − ) (cid:48) ( y i ) . (6)Consequently, V is such that d V d t < because g is strictly monotonic and d V d t = 0 = ⇒ d y i d t = 0 , ∀ i . In other words, the dynamics of the system can be understood in this case as the gradientdescent on the potential function V and its stable steady states correspond to local minima of V .

2. Analysis of CAC

Next, we analyze the system described in eqs. (1) and (2) ( a is constant) by considering thechange of variable y i = g ( x i ) with g such that g − ( y i ) = x i . In this case, eqs. (1) and (2) canbe rewritten as follows (note that f and g are odd functions): φ (cid:48) ( y i ) d y i d t = f ( g − ( y i )) + βe i (cid:88) j ω ij y j , (7) d e i d t = ξ ( y i − a ) e i , (8)where φ ( y ) = g − ( y ) .We analyze the dimension of the unstable manifold of CAC by linearizing the dynamicsnear the steady states and calculating the real part of eigenvalues of the corresponding Jacobianmatrices. The steady state of the system in eqs. (7) and (8) can be written as follows:18 ∗ i = σ i √ a, (9) e ∗ i = − f ( g − ( σ i √ a )) β √ ah i = − f ( g − ( √ a )) β √ aσ i h i . (10)with h i = (cid:80) j ω ij σ j . The last equality is a consequence of the fact that g , and thus also g − , areodd functions.The Jacobian matrix J corresponding to the system of eqs. (1) to (2) at the steady statecan be written in the block representation J = [ J yy J ye ; J ey J ee ] with its components given asfollows: J yyij = ψ (cid:48) ( √ a ) δ ij − ψ ( √ a ) √ a ω ij h i σ i , (11) J yeij = β √ aφ (cid:48) ( √ a ) h i δ ij , (12) J eyij = − ξf ( g − ( √ a )) βh i δ ij , (13) J eeij = 0 . (14)with ψ ( y ) = f ( g − ( y )) φ (cid:48) ( y ) and φ ( y ) = g − ( y ) . Note that we deﬁne J eyii J yejj = b with b = − ξ √ aψ ( √ a ) .The eigenvalues of the Jacobian matrix J are solutions of the polynomial equation P ( λ ) = det [ J − λI ] = det [( J yy − λI ) λI − bI ] . The eigenvalues of J are thus solutions of the quadraticequation z ( z − λ i ) − b = 0 where λ i is the i th eigenvalue of the matrix J yy . Therefore, theeigenvalues of J can be described by pairs λ + i and λ − i given as follows: λ ± i = 12 ( λ i ± (cid:112) ∆ i ) , with (15) λ i = ψ (cid:48) ( √ a ) − ψ ( √ a ) √ a µ i , and (16) ∆ i = λ i + 4 b, (17)where µ i is the i th eigenvalue of the matrix D [( σh ) − ]Ω .19he eigenvalues λ + i and λ − i become complex conjugate when the ∆ i = 0 , i.e., [ ψ (cid:48) ( √ a ) − ψ ( √ a ) √ a µ i ] − ξ √ aψ ( √ a )) = 0 , which can be rewritten as follows: µ i = G ± ξ ( √ a ) , (18)with G ± ξ ( y ) = yψ ( y ) [ ψ (cid:48) ( y ) ± (cid:112) ξ | y | ψ ( | y | )] .Lastly, the real part of eigenvalues λ + i become equal to zero at the condition given as follows(note that Re [ λ + i ] ≥ Re [ λ − i ] and µ i are real at local minima for which, ∀ j , σ j h j > ): √ a = 0 or ψ ( √ a ) = 0 if ∆ i ≤ , (19) µ i = F ( √ a ) otherwise. (20)with F ( y ) = ψ (cid:48) ( y ) ψ ( y ) y . The dimension of the unstable manifold at a given ﬁxed point is then givenby the number of indices i for which Re [ λ ± i ] is positive. To illustrate the destabilization of theground state conﬁguration of an Ising problem, we show in Figs. 5 and 6 the dynamics of CACwhen solving a problem of size N = 15 spins when the state encoding for the ground stateconﬁguration is stable and unstable, respectively, for two different examples of functions f and g deﬁning eqs. (1) and (2). The ﬁrst set of functions f and g with f ( x ) = ( − p ) x − x and g ( x ) = x shown in Fig. 5 (a,b,c,d) corresponds to the soft spin model (or simulation of CIM)with chaotic amplitude control whereas the second one (e,f,g,h) with f ( x ) = − x + tanh [0 . x ] and g ( x ) = tanh [ x ] corresponds to an Hopﬁeld neural network with amplitude heterogeneityerror correction. These two ﬁgures show that the stability of a given local minima of the IsingHamiltonian depends on the value of the target amplitude a as predicted in eq. (20). Because the vector σ · h has positive components at local minima, the eigenvalues of D [( σ · h ) − ]Ω are thesame as the ones of D [( σ · h ) − ] Ω D [( σ · h ) − ] (Sylvester’s law of inertia), which is a symmetric real matrix.Thus, the eigenvalues µ j are always real. {√ a, µ = µ j ( σ ) } , ∀ j ∈ { , . . . , N } , at conﬁg-uration σ . The full red and green dotted lines correspond to the set for which Re [ λ ± i ] = 0 (whereﬁxed points corresponding to local minima become stable) and ∆ i = 0 (where oscillations startto occur around ﬁxed points). Red + and black × symbols correspond to the largest eigenvalues µ ( σ ) of the Jacobian at the ground state and excited states, respectively, of an Ising problem ofsize N = 15 and a = 0 . . In (b,f) and (c,g) are shown the dynamics of the variables x and e ,respectively. In (d,h) are shown the Ising Hamiltonian H ( t ) with horizontal lines showing thevalues of the Ising Hamiltonian at local minima. (a,b,c,d) f ( x ) = ( − p ) x − x and g ( x ) = x with p = 0 . , β = 0 . , ξ = 0 . . (e,f,g,h) f ( x ) = − x + tanh [0 . x ] and g ( x ) = tanh [ x ] with β = 0 . , ξ = 0 . . 21igure 6: Same as Fig. 5 with a = 0 . .22 upplementary material B: FPGA implementation

1. FPGA implementation of CAC

CAC has been implemented on a XCKU040 Xilinx FPGA integrated into the KU040 develop-ment board designed by Avnet. The circuit can process Ising problems of more than 2000 spins.The reconﬁgurable chip is a regular Ultrascale FPGA including 484,800 ﬂip-ﬂops, 242,400Look Up Table (LUT), 12,000 Digital Signal Processing (DSP) and 600 Block RAM (BRAM).Initial value for x i , e i and ω ij are sent through Universal Asynchronous Receiver Transmitter(UART) transmission. UART protocol has been chosen because it does not require a lot ofresources to be implemented and its simplicity.Data are encoded into 18 bits ﬁxed point with 1 sign bit, 5 integer bits which represents agood compromise between accuracy and power consumption. The power consumption of theFPGA is equal or lower to 5W depending on the problem size. The system is deﬁned by itstop-level entity that represents the highest level of organization of the reconﬁgurable chip.

2. Top level entity

The circuit is organized into four principal building blocks as shown on Fig. 7 (a). Severalclock frequencies have been generated to concentrate the electrical power on the circuits thatneed high speed clock when these circuits constitute a bottleneck in the current implementation.Finally, the organization of the main circuit is represented in Fig. 7 (b). The control block iscomposed of Finite State Machines (FSM) and is well tuned to pilot all computation core, theRandom-Access Memory (RAM), and data ﬂowing between computation cores and RAM. Allcircuits are synchronized although the system utilizes multiple clock frequencies. Clock Enable(CE) ports on the BUFGCE component are used to stop the power when a circuit is not used inorder to reduce further the global power dissipation.23igure 7: Organization of the FPGA circuit. (a) The top layer entity is divided into four mod-ules: The Phased-Locked Loop (PLL) that convert a differential clock of 250MHz into threeclocks f g =50MHz, f x =300Mhz and f e =100MHz respectively representing the global clock, theclock for x i and the clock for e i . Two UART modules are used to receives parameters and toreturn the results of the computation. (b) The Max-Cut circuit is composed of several processesaiming to control the exchange of data between the memory and the computation cores ( x i , e i , x i ω ij and Ising). The three generated clocks are controlled by the Clock Enable (CE) port ofthe BUFGCE component of the FPGA. 24 . Temporal organization of the computation The control circuit pilot the computation in order to calculate several steps of x i and e i per cal-culation of the Ising coupling. The dynamics of CAC (see eqs. (1), (2) and (3)) is implementedbased on the following pseudocode: Algorithm

Pseudocode from which the FPGA implementation is adapted for ν ∈ { , · · · , K } do (cid:46) Iterate on the number of MVMs x (cid:48) ← x (cid:46) Save state I ← (cid:15) e ˙(Ω x (cid:48) ) (cid:46) calculation of injection term σ ← x (cid:48) i | x (cid:48) i | H ← − σ ˙(Ω σ ) (cid:46) Ising energy calculation for i ∈ { , · · · , n x } do (cid:46) Update nonlinear terms ∆ x ← ( − p ) x − ( x ) + I x ← x + ∆ x dt x for i ∈ { , · · · , n y } do (cid:46) Update error terms ∆ e ← − β (( x (cid:48) ) − a ) e e ← e + ∆ e dt e β ← β + λdt (cid:46) Update β ∆ H ← H − H opt (cid:46) Update a a ← α + ρ tanh ( δ ∆ H ) if ν − ν c > τ /dt then (cid:46) Reset of β ν c ← ν β ← if H < H opt then (cid:46)

Update optimal H H opt ← H ν opt ← ν ν c ← ν Note that the order of operations described in the pseudocode are a simpliﬁcation of the onesoccurring on the FPGA.The temporal organization of the circuit represented in Fig. 8 shows how is controlled thereading and writing state on the RAM in the case of a problem size N = 500 spins. Fig. 8(a) shows the case of a single x i and e i calculation per dot product which is the classic wayto integrate the differential equations (1) and (2) using the Euler method. Fig. 8 (b) representsthe case of eight and four calculations of x i and e i , respectively, per dot product. In this case,the error correction term is computed with a normalized time step of − (half of the time step25sed in the computation of x i ). The total power consumption can be reduced by increasing thefrequency of circuits involved in the calculation of x i because these circuits are smaller thanthose involved in the dot product calculation (see Fig. 9 (d) and (e)). Increasing the problemsize will increase the power consumption by ﬁlling up the calculation pipelines but this canbe balanced out by reducing the frequency used for the calculation of x i for larger problemsize. The end of the dot product is followed by an update of all RAM and saving of σ i whichcorrespond to the Most Signiﬁcant Bit (MSB) of x i values. This step is not represented here.Figure 8: Temporal representation of the circuits where the red bars represent the computationcores and the blue bars the use of RAM. Here, (r) stand for read and (w) for write. (a) Onetime step representing the classical way of solving differential equation. In this conﬁguration,the dot product create a bottleneck to the computation speed. (b) 8 time steps for x i and 4 timestep for x i accelerating the computation time to ﬁnd the ground state energy and create a newbottleneck to the number of possible time step possible. Note that the time between every redbars are necessary update the x i RAM which is not represented in this ﬁgure for better clarity.The use of several x i and e i calculations per dot product allows to reduce the bottleneck26reated by the later whose calculation is in principle the most time consuming. Fig. 9 (a) to (c)show the reduction in time to solution vs. the number of x i calculations per dot product.Figure 9: (a) The number of iterations of the x variable to solution vs. the number of update,noted n x , of the x variable per matrix-vector product. (b) The number of matrix-vector multi-plications MVM to solution. (c) Real time to solution. (d) Maximum possible time step withand without clock modulation on x i . (e) Maximum possible time step with and without clockmodulation on e i .

4. Coupling

Overview of the coupling circuit

The dot product operation (see Fig. 10) is preceded by a multiplication by β so that the domainof x i is reduced and, in turn, the number of digits required to encode the integer part of x i . Theresult of the dot product is multiplied by the error correction term as shown in Fig. 10. Since thedot product is performed at 50MHz clock speed, the optional registers of the DSP are no longer27equired and have been removed for the two multiplications resulting in 1 clock cycle operationfor each operation.Figure 10: The circuit designed to compute the coupling is composed of three parts: the multi-plication between x i and β , the dot product, and multiplication with the error correction term. High speed and massively parallel multiplication

The multiplication of the elements of a vector x by the element of a matrix Ω can be performedusing an approximation based on a speciﬁc circuit combining logic equations describing thebehavior of a multiplexer and the optimized use of FPGA components. This circuit allows thedesign to achieve 10,000 multiplications in 1 clock cycle. In our case an element of x is ﬁxedpoint binary vector on k bits that are multiplied by an element of a matrix composed of twobits vectors where ω is an element of Ω and ω ∈ {− , , } . The behavior of a multiplexerthat implements the multiplication of x by ω by selecting R ∈ {− x, , x } is given as follows ( ¯ x represent − x ): R = x ¯ ω ¯ ω + ¯ xω ω (21)where x is an element of a vector, ω i a binary vector and an element of the matrix Ω with i theindex of the binary vector that is either 1, the most signiﬁcant bit (MSB) or 0, the less signiﬁcantbit (LSB). If we expand eq. (21) to each bits x i of the vector x and use Boolean operation, weobtain the equation eq. (22) given as follows: R = x ¯ ω ω + (¯ x ⊕ C ) ω ω (22)28here − x is represented by the two-complement operation ¯ x ⊕ C that consist of inverting thebits of x and adding C a constant corresponding to − d with d the decimal part size of the vector.Here, ¯ x now represents the bit-wise not operation. Applying De Morgans law on eq. (22) willlead to the following: R = x ¯ ω ω + ¯ xω ω ⊕ Cω ω , (23) R = ω ( x ⊕ ω ) ⊕ Cω ω , (24) R = ω ( x ⊕ ω ) , (25)In eq. (24), we consider C = 0 which introduces an absolute error of − − d when the MSB andthe LSB of ω are equal to 0. The aim of such approximation is that eq. (25) can be implementedby a single LUT3 component. Consequently, achieving 10,000 operations require k × LUT3where k represents the precision of x and is chosen to lower the required resources and error. First adder stage

The circuit shown in Fig. 11 represents the operations of multiplication used in the dot productbased on eq. (25). To accelerate the computation time, multiple access memory is utilized:Fig. 11 (a) and (b) show the implementation of multiple block RAM (BRAM) that output 100rows of 100 values at the same time. Eq. (25) is implemented in LUT3 as described in thecircuit of the Fig. 11 (c) which compute the addition of two elements of the vector { ω ij x i } i .Using LUT3 for the elements at index i , with i the index of the generated vectors beginning at0, and multiplexers for the elements at index i + 1 ensures the use of lower resources and theoptimal use of the conﬁgurable logic block (CLB). DSP tree

For block matrices and vectors of size u with u = 100 , the dot product requires to perform10,000 multiplications and 8,100 additions. The circuit of the Fig. 11 (c) computes two multi-plications and one addition. Thus, by reproducing this circuit 5,000 times, the FPGA computes15,000 operations in a clock cycle and generates 100 vectors of 50 values that need to be added.This operation is done using an adder tree represented in Fig. 12. A ﬁrst stage of adder, that is29ot represented between the circuit of the Fig. 11 (c) and Fig. 12, is realized using the CARRY8component that reduces the 100 vectors of 100 elements to 100 vectors of 25 elements each.The remaining elements are then added using a DSP tree in adder mode. The advantage of usingan adder tree of DSP is the low LUT number that is required, the optimal use of power con-sumption and the possibility to increase the frequency of the circuit. The Ultrascale architectureprovided by the FPGA KU040 possesses a high number of DSP that can be cascaded allowingto reduce the routing circuit and the computation time. The DSP tree of Fig. 12 is repeated 100times to compute at the same time all elements of the dot product resulting in the use of 1,200DSP. Scalability of such architecture for bigger FPGA

The number of clock cycles required to perform the dot product is given as follows: C t ( u, N ) = K + ( N/u ) (26)where N is the problem size, u the divider that partitions the matrix (in the current implemen-tation u=100) and K the number of clock cycles required for the multiplications and additions.The adder tree composed of CARRY8 and DSP previously proposed is constrained by the fol-lowing condition: N h C + 5 h D = 1 , (27)where h C represents the height of the CARRY8 tree and h D the height of the DSP tree. TheCARRY8 requires only one clock cycle when two cascaded DSP need 5 clock cycles. Theheight K of the adder tree (in clock cycle) is then: K = h C + 5 h D . (28)We can ﬁx h C as a constant according to the proposed FPGA circuit and ﬁnd h D as follow:30 = 2 h C + 5 h D , (29) N − h C = 5 h D , log ( N − h C ) = h D log (5) ,h D = log ( N − h C ) log (5) . (30)Then the height K is equal to: K = h C + 5 log ( N − h C ) log (5) . (31)The adder tree increases logarithmically if we assume that an inﬁnite amount of resources isavailable. Also, we show here that the design can be signiﬁcantly improved if u become largerwith more available resources.

5. Ising energy circuit

The Ising energy is computed at the same time as the main dot product of x i by ω ij because itshares the same output from the RAM that store the ω ij . As for the dot product, the Ising energyhas been computed using logic equations to ﬁt in a minimal number of LUT. The logic equationfor the multiplication of σ by the matrix Ω can be described as follows: S = ( ω ⊕ σ j ) ω , (32) S = w , (33)where σ i is the sign of x i and S is a 2 bits vector representing the multiplication of σ i at index i by ω ( representing ω ij ) which is also represented by 2 bits whose index is either 0 (LSB) or 1(MSB).The Fig. 13 shows a schematic of the circuit used to compute the Ising energy. Note thatthe Ising energy of a matrix M divided into matrices m ij , where i and j are the indexes of thepartitioned M, is equal to the sum of the energies of m ij . Thus, the output of the pipelined Isingenergy circuit is added with itself. 31 . Non-linear term To optimize the use of the electrical power, x i has been designed to use the highest frequency f x and the error correction use and intermediate frequency f e . Both are computed several timesduring the operation of the dot product to reduce the computation time. Fig. 14 shows the circuituse to compute one element of the vectors x i and e i that are reproduced 100 times. The circuitsuse pipelined DSP to compute additions, subtractions and multiplications. A shift register isused to multiply by the dt of Euler approximation. To reduce the number of DSP into thedesign, x i is shared between x i and e i through a true dual port RAM that can be used with twodifferent clocks to synchronize the two circuits. The RAM is controlled by an external FSM inthe control module of Fig. 7.

7. Modulation term

The circuits implementing the modulation of the target amplitude a deﬁned in eq. (3) are shownin Fig. 15. A saturation circuit has been designed to reduce the domain of the sigmoid function.Pre and post operations are available into the DSP for reducing the use of LUT.The tangent hyperbolic function is approximated by Lagrange interpolation using only 96LUT and 1 DSP. The Lagrange interpolation, deﬁned by L , is given as follows: L ( x ) = y + y − y x − x ( x − x ) (34)where, L ( x ) is the interpolation function, [ x , y ] the coordinate of the point situated before theresults and [ x , y ] the coordinate of the point after the result (see Fig. 16).

8. Power consumption

Energy consumption of the circuit is determined by the number of logic transitions (from 0 to 1or 1 to 0) and the frequency with P = (cid:104) s (cid:105) CV F where P is the power dissipated by a transistorbased circuit, C the switching capacitance, V the voltage and F the frequency, and (cid:104) s (cid:105) the aver-age number of switch per clock cycle. Voltage and capacitance are FPGA dependent since it isalready manufactured. The digital circuit are at the highest power consumption when the com-ponents are enabled and when the clock signal drives the Conﬁgurable Logic Block (CLB) or32he DSP. Thus, Enable and disable the block RAM is efﬁcient to reduce the consumption how-ever, clock gating shows better performances in this implementation. The methods have beenextended on the available FF of the Xilinx FPGA and DSP which possess clock enable gate.However, when the design becomes large, high number of signal propagating enable towardsCLB or DSP increase fanout and routing complexity. This can be solved by using BUFGCEcomponent that are available in the FPGA and able to enable or disable the clock. Thus, when agiven circuit is not needed, the clock can be disabled which will reduce signiﬁcantly the powerconsumption.The KU040 boards use Inﬁneon regulators IR38060 incorporated voltage and current sen-sors. These sensors can be access either from the FPGA or from external bus with the PMBUSprotocol which is based on i2c. We used here an Arduino to communicate with the regulatorsand record power data.The power has been measured for different problem sizes and does not exceed 5W. Exper-iments show that the most critical operation for energy efﬁciency and computation time is thedot product which dissipates most of the FPGA power and needs more clock cycles to operate.33igure 11: Representation of the high speed and multiple memory access apply to a circuit usedfor the dot product. (a) 100 RAM are instantiated and can be accessed at the same time. EachRAM corresponds to a row of the matrix Ω . Each element ω ij of Ω is encoded on a two bitsvector. (b) As in (a), 100 RAM can be accessed at the same time to compose the vector x. (c)The Ω i, j and x i values are injected into a LUT3 to compute the eq. (25). The x i +1 valuesare injected into a multiplexer (MUX) whose selector is controlled by the LSB of Ω i, j +1 . Theoutput of the LUT3 and the MUX are propagated through the CARRY8 component that add orsubtract according to the value of Ω i, j +1 MSB.34igure 12: Schematic representation of the DSP tree following the multiplication and the twoaddition stages shown in Fig. 11. This ﬁrst addition stage takes 5 clock cycles and produces 5values that can be added into a new DSP stage of 5 clock cycles.35igure 13: Ising energy circuit. The LUTs integrate eqs. (32) and (33) producing vectors of σ j ω ij . The results are added up using an adder tree composed of CARRY8 resulting in a vectorthat is multiplied by σ i using a multiplexer (MUX). The MUXs select between the result or itsnegation with σ i as a selector. Then, the output of MUXs are added and negated.36igure 14: Circuits of the non-linear terms x i and e i . The two terms use different clocks andshare their results through a dual port block RAM allowing to read and write at different speed.Both circuit use DSP for high speed computation and are generated one hundred times to per-form parallel computation. 37igure 15: Circuit of the target amplitude using pre-subtractor into one DSP and post-subtractorinto the other DSP. A saturation comparator has been added to reduce the range of value atthe input of the sigmoid function. The sigmoid function uses an interpolation of the tangenthyperbolic.Figure 16: (a) Schematic representation of the FPGA implementation of the Lagrange inter-polation function. Each points used for the interpolation is separated by 0.5. A shift registerallows to select the address of interpolated points into the Read Only Memory (ROM). The signof x is conserved so that only the positive part is interpolated. (b) Representation of the tanhfunction and interpolated points. 38

1. Parameter values used for solving SK spin glass instances

The parameter values used for solving SK spin glass instances are shown in Tab. 2.Symbol meaning value β coupling strength . α target amplitude baseline . p linear gain − ( N ) − ρ amplitude and gain variation 3 δ sensitivity to energy variations 4 γ rate of increase of ξ τ max. time w/o energy change 1200 n x number of iterations for nonlinear terms 6 n e number of iterations for error terms 4 dt x normalized time-step of nonlinear terms − dt e normalized time-step of error terms − Table 2: Parameters used for solving SK problem instances.39 upplementary material C: scaling vs. recently proposed Isingmachines

In order to assess the scalability of the chaotic amplitude control and its FPGA implementation,we compare its median time to solution in seconds and extrapolations against that of the NTTCIM (not parallelized) ( ), Hopﬁeld neural network implemented using memresistors (mem-HNN) ( ), restricted Boltzmann machine using a FPGA (FPGA-RBM) ( ). Extrapolationsare based on the hypotheses of scaling in e γN and e γ √ N by ﬁtting the available experimentaldata up to N = 100 for mem-HNN and FPGA-RBM, N = 150 for NTT CIM, and N = 700 for FPGA-CAC. Fig. 17 shows that mem-HNN, FPGA-RBM, and NTT CIM have similarscaling exponents, although FPGA-RBM tends to exhibit a scaling in e γN rather than e γ √ N for N ≈ ( ). In the hypothesis of scaling in e γ √ N for these three machines, extrapolationsshow that FPGA-CAC exhibits smaller time to solution for N (cid:39) even though we donot take into account the increase in scaling exponents that would occur when implementinglarger problem sizes for mem-HNN and FPGA-RBM. Fig. 17 shows moreover that the 20Wimplementation of FPGA-CAC has smaller time to solution than DA ( ) at N ≈ . It shouldbe noted that the experimental data points of Fig. 17 are directly taken from published resultsof each Ising machine (

9, 58, 59 ) which are not based on the same benchmark instances. Itcan be nonetheless expected that the algorithm implemented in mem-HNN, which is similar tomean-ﬁeld annealing, has the same scaling behavior as simCIM and NMFA (see Fig. 2).40igure 17: Median time to solution and extrapolations based on the hypotheses of scaling in e γN and e γ √ N by ﬁtting the available experimental data for each Ising machine (see text). Shadedregions show the error in the scaling exponents. For FPGA-CAC and DA, the lower,higher, and upper whisker of boxes show the th , th , and th percentiles of the real time tosolution distribution. 41 upplementary material D: benchmark on GSET Performance of FPGA-CAC in ﬁnding the maximum cuts known, i.e., lowest Ising Hamiltonianknown, of graphs in the GSET benchmark are shown in Tab. 3. id N C opt C ∗ C ∗ − C opt p < t BLS > (s) < t FPGA > (s) < t BLS > / < t

FPGA > Table 3:

Performance of FPGA-CAC implementation in ﬁnding the maximum cuts known, i.e., lowestIsing Hamiltonian known, of graphs in the GSET benchmark. id , C opt , and C ∗ are the name of instances,best maximum cuts known from (

27, 52 ) and the proposed method, respectively, after 20 runs. p is theprobability that FPGA-CAC ﬁnds the cut C ∗ in a single run. Moreover, < t BLS > and < t FPGA > arethe averaged time to solution using BLS written C++ and running on a Xeon E5440 2.83 GHz ( ) andthe proposed scheme simulated implemented on the KU040 FPGA, respectively. We believe a better tuning of the system parameters would allow to ﬁnd the best knowncut for all instances (except the 2nd instance of N = 2000 ). For instances 14 and 15 of size42 = 2000 , FPGA-CAC ﬁnds solutions of better quality than previously known from ( )and ( ). The solutions found are given as follows. • Solution of instance 14 of size N = 2000 with cut 7685:[1,-1, 1,-1,-1,-1,-1, 1, 1,-1,-1,-1,-1,-1, 1,-1,-1,-1, -1, 1, 1, 1,-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,1, -1, 1, 1, 1,-1, 1,-1,-1,-1,-1, 1, 1, 1,-1,-1,-1,-1, 1, 1, 1,-1, 1, 1, 1,-1, 1, 1, 1,-1, 1,-1,-1,-1,-1,-1,-1,-1,-1,-1, 1, 1,-1, 1, 1, 1, 1, 1, 1,-1,-1, 1, 1,-1, 1, 1,-1,-1,-1,-1,-1, 1, 1, 1,-1, 1,-1,-1, 1,-1,-1,-1, 1,1,-1,-1, 1, 1,-1,-1,-1,-1, 1,-1, 1, 1,-1, 1, 1, 1,-1, 1,-1,-1, 1,-1, 1,-1, 1, 1,-1,-1,-1,-1, 1, 1, 1, 1, 1,1, 1, 1,-1, 1,-1, 1,-1,-1, 1,-1, 1, 1, 1, 1,-1, 1, 1, 1, 1, 1,-1, 1, 1, 1,-1, 1,-1, 1,-1,-1, 1, 1, 1,-1, 1,1, 1,-1,-1, 1,-1, 1, 1,-1,-1,-1, 1,-1, 1, 1,-1, 1, 1, 1, 1,-1, 1,-1,-1, 1, 1, 1, 1, 1, 1, 1,-1, 1, 1,-1,-1,-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1, 1, 1, 1,-1, 1,-1, 1, 1,-1, 1, 1, 1, 1, 1,-1, 1, 1,-1,-1,-1, 1, 1,1, 1, 1,-1,-1, 1, 1, 1, 1,-1, 1, 1, 1, 1,-1,-1, 1,-1, 1,-1,-1, 1,-1,-1, 1, 1,-1, 1, 1,-1,-1, 1,-1,-1,-1, 1,-1, 1, 1, 1, 1,-1, 1, 1, 1,-1,-1, 1, 1, 1, 1,-1,-1,-1, 1,-1, 1,-1,-1, 1,-1,-1, 1, 1, 1,-1, 1, 1,-1,-1, 1,-1,-1, 1,-1,-1, 1, 1,-1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1, -1, 1, 1,-1, 1,-1, 1, 1, 1, 1, 1,-1, 1,-1, 1, 1,-1, 1,-1, 1,-1,-1, 1, 1,-1, 1,-1, 1,-1, 1, 1,-1,-1,-1, 1,-1, -1, 1,-1, 1, 1,-1, 1,-1,-1,-1,-1, 1,-1,-1,-1, 1, 1,-1,-1,-1,-1,-1, 1, 1, 1, 1, 1,-1,-1, 1,-1, 1,-1, 1,-1, 1, -1, 1, 1, 1,-1,-1,-1,-1,-1, 1,-1, 1,-1, 1,-1, 1, 1,-1,1,-1, 1, 1,-1,-1, 1, 1, 1,-1,-1, 1, 1,-1, 1, 1,-1,-1, -1, 1,-1, 1,-1, 1, 1,-1,-1, 1, 1, 1,-1,-1, 1,-1,-1, 1,1,-1, 1, 1, 1,-1, 1,-1, 1,-1,-1, 1, 1,-1,-1, 1, 1, 1, -1,-1,-1,-1, 1, 1, 1,-1,-1,-1, 1, 1,-1, 1, 1,-1, 1, 1,1,-1, 1,-1,-1,-1,-1,-1,-1,-1, 1, 1, 1, 1, 1,-1,-1, 1, 1,-1,-1, 1, 1,-1,-1,-1, 1, 1,-1,-1,-1,-1,-1, 1, 1,-1,-1, 1, 1, 1,-1, 1, 1,-1, 1, 1,-1,-1,-1,-1,-1, 1, 1,-1, 1,-1, 1,-1,-1,-1, 1,-1,-1,-1, 1,-1, 1,-1,-1, 1, 1,1, -1, 1, 1, 1,-1,-1, 1,-1,-1,-1, 1,-1, 1,-1,-1, 1, 1,-1, -1,-1,-1, 1,-1,-1,-1,-1,-1, 1,-1,-1,-1,-1,-1,-1,1,-1, 1,-1, 1, 1, 1, 1, 1,-1, 1, 1, 1,-1, 1,-1,-1, 1, 1, 1, -1,-1,-1, 1, 1, 1,-1, 1,-1,-1,-1,-1, 1,-1,-1,-1,-1,1, -1,-1, 1, 1,-1,-1, 1,-1,-1,-1,-1,-1,-1,-1, 1,-1,-1,-1, 1,-1,-1, 1,-1,-1, 1, 1,-1,-1,-1,-1,-1, 1, 1,-1,1, 1, -1,-1, 1, 1, 1,-1,-1,-1,-1, 1,-1, 1,-1,-1, 1,-1, 1,-1, -1,-1,-1,-1, 1, 1,-1,-1,-1, 1,-1,-1,-1,-1, 1,1,-1,-1, 1,-1, 1,-1, 1,-1, 1, 1, 1,-1,-1,-1,-1,-1, 1,-1,-1, 1, 1, 1,-1, 1, 1,-1,-1,-1,-1, 1, 1,-1,-1, 1,-1,1,-1,-1, -1, 1, 1,-1, 1,-1, 1,-1,-1, 1, 1,-1,-1, 1,-1,-1,-1,-1, -1,-1,-1, 1, 1, 1,-1, 1, 1, 1,-1, 1, 1, 1, 1,1,-1, 1, 1, 1, 1,-1, 1,-1, 1, 1, 1,-1, 1, 1, 1,-1, 1,-1, 1, 1, 1,-1,-1,-1, 1,-1,-1,-1, 1, 1,-1, 1, 1,-1, 1, 1,1,-1, 1,-1,-1, 1, 1, 1, 1,-1, 1, 1,-1,-1,-1,-1,-1,-1, 1,-1, 1, 1,-1,-1,-1, 1, 1,-1, 1,-1, 1, 1, 1,-1,-1,-1,1, 1, 1,-1,-1,-1, 1,-1, 1,-1,-1,-1,-1,-1,-1, 1,-1,-1,-1,-1, -1,-1, 1, 1,-1,-1, 1,-1,-1,-1, 1, 1, 1, 1, 1,-1,1, 1, -1,-1,-1, 1,-1, 1, 1, 1, 1, 1,-1,-1,-1, 1, 1, 1, 1, 1, 1,-1,-1,-1, 1, 1,-1,-1, 1, 1,-1, 1,-1, 1, 1,-1, 1,1, 1, 1, 1,-1, 1, 1,-1,-1, 1,-1, 1,-1, 1, 1,-1, 1, 1,-1, -1, 1,-1, 1,-1, 1, 1,-1,-1,-1, 1,-1, 1, 1, 1, 1,-1,-1,431,-1, 1,-1, 1,-1, 1, 1,-1,-1, 1, 1,-1, 1,-1, 1,-1, 1, 1,-1, 1, 1, 1, 1, 1,-1, 1,-1, 1, 1, 1,-1,-1,-1, 1, 1,-1, 1, 1,-1, 1,-1, 1, 1,-1,-1,-1,-1,-1, 1, 1,-1,-1, 1, 1,-1, 1,-1, 1,-1, 1, 1, 1,-1, 1,-1,-1, 1, 1, 1,-1, 1,1,-1,-1, 1, 1,-1,-1, 1, 1, 1, 1,-1, 1,-1,-1,-1,-1, 1, 1, 1,-1,-1, 1, 1,-1,-1, 1, 1, 1, 1,-1, 1,-1,-1, 1,-1,-1, 1, 1,-1, 1,-1,-1,-1,-1, 1, 1,-1,-1,-1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1,-1, 1, 1,-1, 1, 1, 1, 1,-1,-1, 1,-1,-1,-1, 1,-1,-1,-1, 1, 1, 1, 1,-1, 1,-1, 1, 1, 1,-1, -1, 1, 1, 1,-1, 1,-1, 1, 1, 1,-1, 1, 1,-1,-1, 1,-1,-1,1, 1,-1,-1,-1,-1, 1,-1,-1,-1,-1, 1, 1,-1, 1, 1, 1, 1, -1, 1,-1, 1, 1,-1, 1,-1,-1,-1, 1,-1, 1,-1,-1,-1, 1,-1,-1, 1,-1, 1,-1,-1, 1, 1,-1, 1, 1, 1,-1, 1, 1, 1, 1,-1, -1,-1,-1, 1, 1, 1,-1, 1, 1,-1,-1, 1,-1,-1, 1,-1, 1, 1,-1,-1,-1, 1, 1,-1, 1, 1, 1,-1,-1,-1, 1, 1,-1, 1,-1,-1, 1,-1,-1, 1,-1,-1, 1,-1, 1, 1,-1, 1,-1,-1,-1,-1, 1, 1,1, 1,-1,-1,-1,-1, 1,-1, 1,-1, 1, 1, 1,-1,-1,-1,-1,-1, 1,-1,-1, 1, 1, 1, 1,-1,-1,-1,-1, 1, 1,-1,-1,-1, 1,-1,1, 1, 1, 1, 1,-1, 1, 1,-1,-1, 1, 1, 1,-1, 1,-1,-1, 1, 1, 1,-1,-1,-1,-1, 1,-1, 1,-1, 1, 1, 1,-1,-1, 1, 1,-1,-1, 1,-1,-1,-1,-1,-1,-1, 1, 1,-1,-1,-1,-1,-1,-1, 1,-1, 1, 1,-1,-1, 1,-1,-1,-1, 1, 1,-1, 1, 1,-1,-1, 1, 1, 1,-1,-1, 1,-1, 1,-1, 1,-1,-1, 1,-1, 1,-1,-1, 1, 1,-1, 1, 1,-1, 1,-1, 1,-1,-1,-1, 1, 1,-1, 1, 1,-1, 1, 1,-1, 1,-1,-1,-1,-1, 1,-1, 1, 1, 1,-1,-1,-1, 1, 1,-1, 1, 1,-1, 1,-1,-1,-1,-1, 1, 1, 1, 1,-1,-1,-1, 1, 1, 1,-1,-1, 1,1, 1,-1, 1,-1,-1,-1,-1,-1, 1,-1, 1,-1,-1,-1,-1, 1,-1, -1,-1,-1,-1,-1,-1, 1, 1,-1,-1, 1,-1,-1,-1,-1,-1,-1,-1,1,-1, 1,-1,-1,-1, 1, 1,-1, 1,-1, 1, 1,-1, 1,-1,-1,-1, 1,-1, 1, 1, 1,-1,-1,-1, 1,-1,-1, 1,-1,-1, 1,-1, 1,-1,-1,-1,-1,-1, 1, 1, 1, 1,-1, 1,-1,-1, 1,-1,-1,-1,-1,-1, 1,-1, 1, 1,-1, 1, 1,-1, 1,-1, 1,-1, 1,-1, 1, 1,-1, 1,-1,-1, 1,-1, 1, 1, 1, 1,-1, 1, 1, 1,-1,-1,-1, 1, 1,-1, 1,-1,-1, 1,-1, 1,-1, 1,-1,-1, 1, 1, 1, 1,-1,-1,-1, 1,-1, 1,-1,-1, 1,-1, 1, 1,-1, 1,-1, 1,-1,-1,-1, 1,-1, 1, -1, 1,-1,-1, 1,-1,-1,-1,-1,-1, 1, 1,-1,-1, 1, 1,-1, 1,1, 1, 1,-1,-1,-1, 1, 1,-1, 1, 1,-1,-1, 1, 1,-1,-1,-1, -1,-1, 1,-1,-1,-1,-1,-1, 1, 1, 1, 1, 1,-1, 1,-1, 1, 1,1, 1, 1,-1, 1,-1,-1, 1,-1,-1, 1, 1, 1,-1,-1, 1,-1, 1, -1,-1,-1, 1,-1,-1,-1, 1,-1,-1,-1, 1, 1, 1, 1, 1,-1,-1,1, 1,-1, 1, 1, 1,-1, 1, 1, 1, 1, 1,-1, 1,-1, 1, 1, 1, -1, 1, 1,-1, 1,-1, 1, 1, 1, 1,-1,-1,-1, 1, 1, 1, 1,-1,1,-1,-1,-1, 1, 1,-1,-1,-1,-1,-1, 1, 1,-1, 1, 1, 1,-1, 1, 1,-1, 1, 1, 1, 1,-1,-1,-1, 1,-1, 1,-1, 1, 1,-1,-1,1,-1,-1,-1,-1,-1, 1, 1,-1,-1,-1, 1, 1,-1, 1, 1, 1, 1, 1, 1, 1, 1,-1, 1,-1,-1,-1,-1,-1, 1, 1,-1, 1,-1, 1, 1,-1, 1,-1, 1, 1,-1,-1,-1, 1, 1,-1,-1, 1,-1,-1,-1, 1, 1, 1, 1,-1,-1,-1,-1, 1, 1,-1,-1, 1, 1,-1, 1, 1,-1,-1,-1,-1, 1,-1, 1, 1, 1,-1,-1,-1, 1,-1, 1, 1, 1, 1,-1,-1,-1, 1, 1,-1, 1, 1,-1, 1,-1,-1, 1, 1, 1,-1, 1,-1, 1,-1,-1,-1, 1,-1, 1,-1, 1, 1, 1,-1, 1, 1, 1,-1,-1,-1, 1,-1,-1, -1, 1,-1,-1,-1, 1,-1, 1, 1,-1, 1,-1, 1, 1,-1,-1, 1, 1,-1,-1,-1,-1, 1, 1,-1, 1,-1, 1, 1, 1,-1, 1,-1, 1, 1, 1, -1, 1,-1, 1,-1,-1, 1, 1,-1,-1,-1,-1, 1, 1,-1, 1,-1, 1,-1,-1, 1,-1, 1,-1, 1, 1, 1, 1, 1, 1, 1,-1,-1, 1,-1, 1, 1, 1,-1, 1, 1, 1,-1, 1, 1, 1,-1, 1, 1, 1,-1, 1, 1, 1, -1,1, 1,-1, 1, 1,-1,-1, 1, 1, 1,-1, 1, 1, 1,-1,-1, 1, 1, 1] • Solution of instance 15 of size N = 2000 with cut 7679:44-1, 1,-1,-1,-1,-1, 1,-1,-1,-1,-1,-1, 1,-1, 1, 1,-1,-1, -1,-1, 1,-1, 1,-1,-1, 1,-1, 1,-1, 1,-1,-1,-1, 1,1,-1, -1, 1,-1, 1, 1, 1, 1,-1, 1, 1,-1, 1,-1,-1, 1, 1, 1, 1, 1,-1,-1,-1, 1, 1,-1,-1, 1, 1, 1,-1,-1, 1, 1, 1,1,-1, 1, 1,-1,-1, 1,-1, 1, 1, 1,-1, 1,-1, 1, 1,-1,-1,-1,-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1, 1,-1,-1,1, -1,-1,-1,-1,-1,-1, 1, 1, 1,-1,-1, 1,-1, 1,-1,-1, 1, 1, -1, 1,-1,-1,-1,-1, 1, 1,-1,-1, 1, 1,-1,-1, 1, 1,1,-1, 1, 1, 1, 1, 1,-1,-1, 1, 1, 1, 1, 1, 1,-1, 1, 1, 1,-1, 1,-1, 1,-1, 1, 1,-1,-1, 1, 1, 1,-1,-1, 1,-1, 1,-1,1, 1, 1, 1, 1, 1, 1,-1, 1, 1, 1, 1,-1, 1, 1, 1, 1, 1, 1, -1, 1,-1, 1,-1,-1, 1,-1,-1, 1,-1,-1, 1,-1, 1, 1,-1,-1,1, 1, 1, 1, 1,-1, 1,-1,-1,-1, 1,-1, 1,-1, 1,-1, 1, 1, -1,-1,-1, 1,-1, 1, 1, 1,-1,-1, 1, 1,-1,-1, 1, 1,-1,-1,1,-1, 1, 1,-1,-1, 1,-1,-1,-1, 1,-1,-1, 1,-1,-1,-1, 1, -1,-1, 1,-1, 1,-1, 1, 1, 1,-1, 1, 1,-1,-1,-1,-1, 1,-1,1,-1, 1, 1, 1, 1,-1,-1, 1, 1,-1, 1, 1,-1,-1, 1, 1, 1, 1, 1, 1, 1, 1,-1, 1,-1, 1, 1,-1, 1, 1,-1, 1, 1, 1, 1,-1,-1, 1, 1, 1,-1,-1,-1,-1,-1, 1, 1,-1, 1, 1,-1,-1, 1, 1, 1,-1, 1,-1, 1, 1,-1,-1, 1,-1, 1, 1, 1, 1, 1,-1,-1,1, 1, 1, 1, 1,-1,-1,-1,-1, 1,-1,-1, 1,-1, 1, 1, 1, 1, 1,-1,-1, 1, 1,-1, 1,-1,-1, 1,-1,-1, 1,-1, 1,-1,-1,-1,-1,-1,-1, 1, 1,-1, 1,-1,-1, 1, 1, 1,-1, 1,-1, 1,-1,-1, 1,-1, 1,-1,-1,-1, 1, 1, 1,-1, 1,-1,-1, 1,-1, 1,-1,-1,-1, 1, 1,-1, 1, 1,-1,-1,-1, 1,-1,-1, 1,-1,-1, 1, 1,-1, -1,-1,-1,-1, 1, 1,-1, 1,-1,-1, 1,-1, 1,-1,-1,-1, 1, 1,-1,-1,-1, 1,-1, 1, 1, 1,-1, 1,-1, 1,-1,-1, 1, 1,-1,-1, -1,-1,-1, 1, 1, 1,-1, 1,-1,-1,-1, 1, 1,-1, 1, 1,-1, 1,1,-1,-1,-1,-1,-1,-1, 1,-1, 1,-1, 1, 1, 1,-1,-1, 1,-1, -1, 1,-1, 1,-1, 1,-1, 1, 1, 1, 1, 1, 1,-1,-1, 1, 1, 1,1,-1, 1, 1, 1,-1, 1,-1, 1,-1,-1,-1, 1, 1, 1,-1, 1, 1, 1,-1, 1, 1, 1, 1, 1,-1, 1, 1,-1,-1, 1,-1,-1,-1, 1, 1,1,-1, 1,-1,-1, 1,-1,-1, 1,-1, 1, 1,-1, 1,-1, 1,-1, 1, -1,-1,-1, 1,-1,-1, 1, 1, 1, 1, 1, 1,-1,-1, 1,-1, 1, 1,1,-1,-1, 1,-1, 1, 1, 1, 1,-1, 1, 1,-1,-1,-1,-1,-1, 1, -1,-1, 1,-1, 1,-1, 1, 1, 1, 1,-1, 1, 1, 1,-1, 1,-1,-1,-1,-1, 1, 1, 1, 1, 1, 1,-1,-1,-1, 1,-1, 1, 1,-1, 1, 1, 1,-1, 1, 1,-1,-1, 1, 1, 1, 1,-1,-1, 1,-1, 1,-1, 1,-1,-1, 1, 1, 1,-1, 1, 1,-1,-1, 1, 1, 1,-1, 1, 1, 1, 1,-1, 1, 1, 1,-1,-1,-1, 1, 1, 1,-1,-1,-1, 1,-1,-1, 1,-1, 1,1, 1,-1, 1, 1, 1, 1, 1, 1, 1,-1, 1,-1, 1, 1,-1,-1, 1, 1,-1, 1,-1,-1, 1, 1,-1, 1, 1, 1, 1, 1, 1,-1,-1,-1, 1,1, 1, 1, 1,-1, 1,-1,-1, 1, 1, 1, 1, 1, 1,-1, 1, 1, 1, -1,-1,-1, 1, 1,-1, 1,-1, 1,-1, 1,-1,-1, 1, 1,-1,-1, 1,-1, 1, 1,-1, 1, 1,-1,-1, 1, 1,-1, 1,-1, 1,-1,-1,-1, 1, -1, 1, 1, 1, 1,-1, 1, 1,-1,-1, 1,-1, 1, 1, 1,-1, 1,-1,-1, 1, 1,-1, 1,-1, 1,-1, 1,-1,-1, 1,-1,-1,-1, 1,-1,-1, -1,-1,-1,-1, 1, 1,-1, 1, 1, 1,-1,-1,-1,-1,-1,-1,-1, 1,-1,-1, 1,-1,-1,-1, 1,-1, 1, 1,-1,-1,-1,-1,-1, 1,-1,-1, -1, 1,-1,-1,-1,-1,-1, 1, 1, 1, 1,-1, 1,-1, 1,-1, 1,-1,1,-1,-1, 1,-1,-1,-1, 1,-1,-1,-1,-1,-1, 1, 1,-1,-1,-1, -1, 1,-1, 1,-1,-1,-1, 1, 1, 1,-1,-1, 1, 1,-1, 1,-1, 1,-1, 1,-1,-1, 1,-1, 1, 1,-1,-1, 1,-1,-1, 1,-1,-1, 1,-1, 1,-1,-1,-1, 1,-1,-1, 1, 1,-1, 1,-1,-1, 1,-1,-1,-1, 1,1, 1,-1,-1,-1, 1,-1, 1, 1,-1, 1,-1,-1,-1, 1,-1, 1,-1, 1, 1,-1,-1,-1, 1,-1,-1, 1, 1, 1,-1,-1, 1,-1,-1, 1,-1,1, 1,-1,-1,-1, 1,-1,-1,-1, 1, 1, 1,-1,-1, 1, 1, 1,-1, -1,-1,-1, 1, 1, 1, 1,-1,-1, 1,-1, 1,-1, 1,-1, 1,-1, 1,-1, 1, 1, 1,-1, 1,-1,-1, 1,-1,-1, 1, 1, 1,-1, 1,-1,-1, -1, 1, 1, 1,-1,-1, 1,-1, 1, 1, 1, 1,-1, 1,-1,-1,-1, 1,45, 1,-1,-1, 1,-1,-1, 1, 1,-1,-1,-1,-1,-1,-1, 1, 1, 1, 1,-1, 1, 1,-1, 1, 1,-1,-1,-1, 1, 1, 1,-1, 1, 1,-1,-1,-1,-1, 1,-1,-1, 1,-1,-1, 1, 1,-1, 1,-1, 1,-1,-1, 1,-1, -1, 1,-1,-1, 1,-1,-1,-1,-1, 1,-1, 1, 1,-1,-1, 1, 1, 1,1,-1,-1, 1, 1, 1, 1,-1, 1, 1, 1,-1,-1, 1, 1, 1, 1, 1, -1,-1, 1,-1,-1,-1, 1,-1,-1,-1, 1,-1,-1, 1, 1, 1,-1, 1,-1,-1,-1, 1,-1,-1,-1,-1,-1, 1, 1,-1, 1, 1, 1,-1, 1, 1, 1,-1, 1, 1,-1, 1, 1, 1, 1,-1, 1, 1, 1,-1,-1,-1,-1,-1,1, 1,-1, 1,-1, 1,-1,-1,-1, 1,-1, 1,-1,-1,-1, 1, 1, 1, 1,-1, 1,-1,-1, 1,-1,-1,-1, 1,-1,-1, 1, 1,-1, 1,-1,-1,-1,-1,-1, 1, 1,-1, 1, 1, 1,-1,-1,-1, 1, 1, 1, 1,-1, 1, 1, 1, 1,-1, 1, 1,-1, 1,-1, 1, 1, 1, 1, 1, 1,-1,-1, 1,-1,-1,-1,-1,-1, 1,-1,-1,-1, 1,-1, 1, 1,-1,-1, 1, 1, 1, 1, 1,-1, 1,-1,-1,-1, 1, 1, 1, 1, 1,-1,-1, 1, 1,-1,-1,-1,-1,-1, 1, 1, 1,-1, 1,-1, 1, 1, 1,-1,-1,-1,-1, 1, 1, 1,-1, 1, 1,-1, 1, 1, 1,-1, 1, 1,-1, 1,-1, 1, 1,-1,-1,-1, 1,-1,-1, 1, 1, 1, 1,-1, 1, 1,-1,-1, 1, 1,-1, 1, 1, 1, 1,-1, 1, 1, 1, 1, 1,-1, 1,-1,-1,-1,-1, 1,-1,-1, 1,1,-1, 1, 1,-1, 1, 1,-1,-1, 1,-1, 1, 1, 1,-1, 1, 1,-1, -1,-1, 1, 1,-1,-1,-1, 1,-1, 1, 1, 1, 1, 1,-1, 1, 1, 1,1,-1,-1, 1,-1, 1, 1, 1,-1, 1,-1, 1, 1,-1,-1,-1, 1,-1, 1,-1, 1, 1,-1, 1,-1, 1,-1, 1, 1,-1,-1, 1,-1, 1, 1, 1,-1, 1,-1,-1, 1, 1,-1, 1,-1,-1,-1,-1,-1, 1, 1,-1,-1,-1, 1, 1, 1, 1,-1,-1,-1, 1,-1,-1, 1,-1, 1, 1, 1,-1, 1, 1,1, 1,-1,-1,-1, 1, 1,-1, 1, 1,-1,-1, 1, 1,-1,-1, 1, 1, -1,-1, 1,-1, 1, 1, 1,-1,-1,-1, 1, 1,-1, 1,-1, 1, 1,-1,-1, 1, 1, 1,-1, 1, 1,-1, 1,-1,-1,-1, 1, 1, 1, 1, 1,-1, -1,-1, 1,-1,-1,-1, 1, 1, 1,-1,-1, 1,-1, 1,-1,-1,-1,-1,-1, 1, 1,-1, 1, 1,-1, 1, 1,-1, 1, 1,-1,-1, 1,-1, 1,-1, 1,-1, 1,-1, 1, 1, 1, 1, 1,-1,-1, 1, 1, 1,-1,-1,-1,-1,-1, 1, 1,-1,-1,-1, 1,-1,-1,-1,-1, 1, 1,-1,-1,-1,-1, 1, 1,-1,-1,-1, 1, 1,-1, 1, 1,-1,-1,-1, 1, 1,-1,-1, 1,-1,1, 1, 1,-1,-1, 1, 1,-1,-1, 1, 1, 1, 1, 1,-1, 1,-1,-1, 1, 1,-1, 1,-1, 1,-1,-1, 1, 1,-1, 1, 1,-1, 1, 1,-1, 1,1, 1, 1, 1, 1,-1, 1, 1,-1,-1,-1,-1, 1, 1, 1, 1,-1,-1, -1,-1, 1, 1,-1,-1, 1, 1,-1, 1, 1, 1, 1,-1, 1, 1, 1, 1,-1, 1,-1,-1, 1, 1, 1,-1, 1,-1,-1,-1,-1, 1,-1,-1, 1,-1, -1,-1,-1,-1, 1, 1, 1, 1, 1,-1, 1,-1,-1,-1,-1,-1,-1, 1,1, 1, 1, 1,-1, 1,-1, 1,-1, 1,-1,-1,-1, 1,-1,-1,-1, 1, 1,-1, 1,-1, 1, 1, 1, 1,-1,-1,-1, 1,-1, 1,-1, 1, 1, 1,-1,-1,-1, 1,-1, 1,-1,-1, 1, 1, 1, 1,-1,-1,-1,-1,-1, 1, 1, 1,-1, 1, 1, 1, 1,-1,-1, 1, 1, 1, 1,-1, 1, 1, 1, 1,1,-1,-1, 1, 1,-1,-1, 1, 1,-1, 1, 1,-1, 1,-1, 1, 1, 1, 1,-1,-1, 1, 1, 1,-1,-1, 1,-1,-1,-1, 1,-1,-1, 1, 1,-1,1,-1,-1,-1,-1,-1, 1, 1,-1,-1,-1,-1, 1, 1, 1,-1,-1,-1, 1, 1, 1, 1,-1,-1, 1,-1,-1, 1,-1, 1,-1, 1, 1,-1, 1, 1,1, 1, 1, 1,-1, 1, 1, 1,-1, 1, 1,-1, 1, 1, 1, 1,-1,-1, -1,-1,-1,-1, 1,-1,-1, 1,-1, 1,-1,-1,-1,-1,-1,-1, 1,-1,-1, 1, 1,-1,-1,-1, 1,-1, 1,-1,-1, 1, 1, 1, 1,-1, 1, 1, 1, 1, 1, 1,-1, 1,-1, 1,-1, 1,-1, 1, 1, 1, 1,-1, 1, 1, 1,1,-1,-1,-1, 1, 1,-1,-1, 1,-1,-1,-1,-1, 1, 1, 1,-1, -1, 1]46 arameter values used for solving instances of the GSET are shown in Tab. 4. Symbol meaning value (cid:15) coupling strength d α target amplitude baseline . π linear gain baseline − d − . ρ amplitude and gain variation . δ sensitivity to energy variations . N γ rate of increase of β N τ max. time w/o energy change Nn x number of iterations for nonlinear terms 6 n e number of iterations for error terms 4 dt x normalized time-step of nonlinear terms − dt e normalized time-step of error terms − Table 4: Parameters used for solving GSET problem instances. where d is a function of the maximum degree given as d = max { d , } and d = mean i { (cid:80) j | ω ij |} . upplementary material E: other algorithms

1. Details of NMFA simulation