[PDF] Physics Successfully Implements Lagrange Multiplier Optimization

Abstract

Optimization is a major part of human effort. While being mathematical, optimization is also built into physics. For example, physics has the principle of Least Action, the principle of Minimum Entropy Generation, and the Variational Principle. Physics also has physical annealing which, of course, preceded computational Simulated Annealing. Physics has the Adiabatic Principle, which in its quantum form is called Quantum Annealing. Thus, physical machines can solve the mathematical problem of optimization, including constraints. Binary constraints can be built into the physical optimization. In that case the machines are digital in the same sense that a flip-flop is digital. A wide variety of machines have had recent success at optimizing the Ising magnetic energy. We demonstrate in this paper that almost all those machines perform optimization according to the Principle of Minimum Entropy Generation as put forth by Onsager. Further, we show that this optimization is in fact equivalent to Lagrange multiplier optimization for constrained problems. We find that the physical gain coefficients which drive those systems actually play the role of the corresponding Lagrange Multipliers.

Full PDF

PPhysics Successfully Implements LagrangeMultiplier Optimization

Sri Krishna Vadlamani , Tianyao Patrick Xiao ∗ , Eli Yablonovitch Department of Electrical Engineering and Computer Sciences,University of California, Berkeley, CA, USAJuly 23, 2020

Abstract

Optimization is a major part of human eﬀort. While being mathe-matical, optimization is also built into physics. For example, physics hasthe principle of Least Action, the principle of Minimum Entropy Gener-ation, and the Variational Principle. Physics also has physical annealingwhich, of course, preceded computational Simulated Annealing. Physicshas the Adiabatic Principle which in its quantum form is called QuantumAnnealing. Thus, physical machines can solve the mathematical problemof optimization, including constraints.Binary constraints can be built into the physical optimization. In thatcase, the machines are digital in the same sense that a ﬂip-ﬂop is digital. Awide variety of machines have had recent success at optimizing the Isingmagnetic energy. We demonstrate in this paper that almost all thosemachines perform optimization according to the Principle of MinimumEntropy Generation as put forth by Onsager. Further, we show that thisoptimization is in fact equivalent to Lagrange multiplier optimization forconstrained problems. We ﬁnd that the physical gain coeﬃcients whichdrive those systems actually play the role of the corresponding LagrangeMultipliers.

Optimization is ubiquitous in today’s world. Everyday applications of optimiza-tion range from aerodynamic design of vehicles by ﬂuid mechanics and physicalstress optimization of bridges in civil engineering to scheduling of airline crewsand routing of delivery trucks in operations research. Furthermore, optimiza-tion is also indispensable in machine learning, reinforcement learning, computervision, and speech processing. Given the preponderance of massive datasets andcomputations today, there has been a surge of activity in the design of hardwareaccelerators for neural network training and inference [1]. ∗ Now at Sandia National Laboratories, Albuquerque, NM, USA a r X i v : . [ c s . ET ] J u l e ask whether Physics can address optimization? There are a numberof promising physical principles that drive dynamical systems toward an ex-tremum. These are: the principle of Least Action, the principle of MinimumEntropy Generation, and the Variational Principle. Physics also has actualphysical annealing which preceded computational Simulated Annealing. Fur-ther, Physics has the Adiabatic Principle which in its quantum form is calledQuantum Annealing.In due course, we may learn how to use each of these principles to performoptimization. Let us consider the principle of minimum entropy generation indissipative physical systems, such as resistive electrical circuits. It was shownby Onsager [2] that the equations of linear systems, like resistor networks, canbe re-expressed as the minimization principle of a function f ( i , i , . . . , i n ) forcurrents i n in various branches of the resistor network. For a resistor network,the function f contains the power dissipation, or entropy generation. By re-expressing a merit function in terms of power dissipation, the circuit itself willﬁnd the minimum of the merit function, or minimum power dissipation. Op-timization is generally accompanied by constraints. For example, perhaps theconstraint is that the ﬁnal answers must be restricted to be ±

1. Such a dig-itally constrained optimization produces answers compatible with any digitalcomputer.There has been a series of machines created in the physics and engineeringcommunity to devise physics-based engines for the Ising problem. The Isingchallenge is to ﬁnd the minimum energy conﬁguration of a large set of magnets.It’s a very hard problem even when the magnets are restricted to only twoorientations, North pole up or down [3]. Our main insight in this paper is thatmost of these Ising solvers use hardware based on the Principle of MinimumEntropy generation. The natural evolution of these machines is toward a good,low-power-dissipation ﬁnal state. Further, almost all of them implement thewell-known Lagrange Multipliers method for constrained optimization.An early work was by Yamamoto et al. in [4] and this was followed byfurther work from their group [5], [6], [7], [8], [9], [10], [11] and other groups[12], [13], [14], [15], [16], [17], [18], [19], [20]. These entropy generating machinesrange from coupled optical parametric oscillators, to RLC electrical circuits, tocoupled exciton-polaritons, and silicon photonic coupled arrays. These typesof machines have the advantage that they solve digital problems in the analogdomain, which can be orders-of-magnitude faster and more energy-eﬃcient thanconventional digital chips that are limited by latency and the energy cost [11].Within the framework of these dissipative machines, constraints can be read-ily included. In eﬀect, these machines perform constrained optimization equiv-alent to the technique of Lagrange multipliers. We illustrate this connection bysurveying 7 published physically distinct machines and showing that each min-imizes entropy generation in its own way, subject to constraints; in fact, theyperform Lagrange multiplier optimization. We note here that the systems in[11], [17], and [18] follow their own dynamics and are not related to the methodof Lagrange multipliers. The system in [11] will be discussed in Section 5 in themain text while the work in [17] will be discussed in Appendix D.2n eﬀect, physical machines perform local steepest descent in the entropygeneration rate. They can become stuck in local optima. At the very least,they perform a rapid search for local optima, thus reducing the search space forthe global optimum. These machines are also adaptable toward searching in anexpanded phase space, and other techniques for approaching a global optimum.The paper is organized as follows. In Section 2, we recognize that physics per-forms optimization through its variational and optimization principles. Then,we concentrate on the principle of minimum entropy generation or minimumpower dissipation. In Section 3, we give an overview of the minimum entropygeneration optimization solvers in the literature and show how they incorporateconstraints. Section 4 has a quick tutorial on the method of Lagrange Multi-pliers. Section 5 studies ﬁve published solvers in detail and shows that theyall follow some form of Lagrange multiplier dynamics. In Section 6, we look atthose published physics-based solvers which are less obviously connected to La-grange multipliers. Section 7 presents the applications of physics-based solversto perform linear regression in statistics. Finally, in Section 8, we conclude anddiscuss the consequences of this ability to implement physics-based Lagrangemultiplier optimization for areas such as machine learning.

We survey the minimization principles of physics and the important optimiza-tion algorithms derived from them. These physical optimization machines areintended to converge to optima that are ‘agnostic to initial conditions’. By‘agnostic to initial conditions’, we mean systems that converge to the global op-timum, or a good local optimum, irrespective of the initial point for the search.

The principle of Least Action is the most fundamental principle in physics. New-ton’s Laws of Mechanics, Maxwell’s Equations of Electromagnetism, Schr¨odinger’sequation in Quantum Mechanics, and Quantum Field Theory can all be inter-preted as minimizing a quantity called action. For the special case of lightpropagation, this reduces to the principle of Least Time, as shown in Fig. 1.A conservative system without friction or losses evolves according to theprinciple of Least Action. The fundamental equations of physics are reversible.A consequence of this reversibility is the Liouville Theorem which states thatvolumes in phase space are left unchanged as the system evolves.Contrary-wise, in both a computer and an optimization solver, the goal isto have a speciﬁc solution with less uncertainty or a smaller zone in phasespace than the initial state, an entropy cost ﬁrst speciﬁed by Landauer andBennett. Thus, some degree of irreversibility, or energy cost, is needed, speciﬁedby the number of digits in the answer in the Landauer/Bennett analysis. Analgorithm has to be designed and programmed into the reversible system (either3ia software or via hardcoding of hardware) to eﬀect the reduction in entropyneeded to solve the optimization problem. Coming up with a fast algorithm forNP-hard problems is still an open problem in the ﬁeld of reversible computing,which includes quantum computing.This would require an energy cost, but not necessarily a requirement for con-tinuous power dissipation. We look forward to computer science breakthroughsthat would allow the principle of Least Action to address unsolved problems.An alternative approach to computing would involve physical systems that con-tinuously dissipate power, aiding in the contraction of phase space toward aﬁnal solution. While exceeding the Landauer limit, such systems might havethe advantage of speed and simplicity. This brings us to the principle of LeastPower dissipation. ! " ! Fast medium (small refractive index)

Slow medium $ " $ A B

Figure 1:

The principle of Least Time, a subset of the principle of Least Action: The actualpath that light takes to travel from point A to point B is the one that takes the least time totraverse. Recording the correct path entails a small energy cost consistent with the LandauerLimit.

If we consider systems that continuously dissipate power, we are led to a secondoptimization principle in physics, the principle of Least Entropy Productionor Least Power Dissipation. This principle states that any physical systemwill evolve into a steady-state conﬁguration that minimizes the rate of entropyproduction given the constraints (such as ﬁxed thermodynamic forces, voltagesources, or input power) that are imposed on the system. An early version ofthis statement is provided by Onsager in his celebrated papers on the reciprocalrelations [2]. This was followed by further foundational work on this principleby Prigogine, [21], and de Groot, [22]. This principle is readily seen in actionin electrical circuits and is illustrated in Fig. 2. We shall frequently use thisprinciple, as formulated by Onsager, in the rest of the paper.4

A possible final configuration The actual final configuration &&& & &&

Figure 2:

The principle of Least Power Dissipation: In a parallel connection, the currentdistributes itself in a manner that minimizes the power dissipation subject to the constraintof ﬁxed input current I . This technique is widely used in materials science and metallurgy and involvesthe slow cooling of a system starting from a high-temperature. As the coolingproceeds, the system tries to maintain thermodynamic equilibrium by reorganiz-ing itself into the lowest energy minimum in its phase space. Energy ﬂuctuationsdue to ﬁnite temperatures help the system escape from local optima. This pro-cedure leads to global optima when the temperature reaches zero in theory butthe temperature has to be lowered prohibitively slowly for this to happen. possible solutions I s i ng ene r g y H Figure 3:

Physical annealing involves the slow cooling down of a system. The systemperforms gradient descent in conﬁguration space with occasional jumps activated by ﬁnitetemperature. If the cooling is done slowly enough, the system ends up in the ground state ofconﬁguration space. .1.4 Adiabatic Method The Adiabatic Method involves the slow transformation of a system from ini-tial conditions that are easily constructed to ﬁnal conditions that capture thediﬃcult problem at hand.More speciﬁcally, to solve the Ising problem using this algorithm, one initial-izes the system of spins in the ground state of a simple Hamiltonian and thenslowly varies the parameters of the Hamiltonian to end up with the ﬁnal IsingHamiltonian of interest. If the parameters are varied slowly enough, the sys-tem ends up in the ground state of the ﬁnal Hamiltonian and the problem getssolved. In a quantum mechanical system, this is sometimes called ‘quantumannealing’. Several proposals and demonstrations, including the well-knownD-Wave machine [23], utilize this algorithm.The slow rate of variation of the Hamiltonian parameters that is requiredfor this method to work is determined by the minimum energy spacing betweenthe instantaneous ground state and the instantaneous ﬁrst excited state thatoccurs as we move from the initial Hamiltonian to the ﬁnal Hamiltonian. Thesmaller the gap is for a particular variation schedule, the slower the rate atwhich we need to perform the variation to successfully solve the problem. It hasbeen shown that the gap can get exponentially small in the worst case, implyingthat this algorithm can take exponential time in the worst case for NP -hardproblems. The Adiabatic Method Optimization principles/algorithms in Physics E n e r g y H Hamiltonian parameter (p) p = 0 simple Hamiltonian p = p o hard Hamiltonianground state 1 st excitedstate 2 nd excitedstate3 rd excitedstate Supp

Figure 4:

A system initialized in the ground state of a simple Hamiltonian continues tostay in the ground state as long as the Hamiltonian is changed slowly enough.

Multi-Oscillator Arrays, subject to Parametric Gain were introduced in [4] and[7] for solving Ising problems. This can be regarded as a subset of the Principleof Minimum Entropy Generation, which is always subject to a non-zero input6ower constraint. In this case, gain acts as a boundary condition or constraintfor the principle of minimum entropy generation, and the oscillator array mustarrange itself to dissipate the least power subject to that constraint. In the con-text of a multi-coupled-oscillator arrays with gain, a certain oscillator mode willhave the least loss. That mode will grow in amplitude most rapidly. This leastloss mode can be individually selected by increasing the gain until it matchesthe intrinsic loss of that mode. Then further nonlinear evolution amongst all themodes will occur. If the oscillator array is bistable, as is the case for parametricgain which drives oscillation along the ± real axis, this becomes the analog ofmagnetic bistability in an Ising problem. Then, we seek a solution for the lowestmagnetic energy state with some oscillators locked at zero phase shift and otherslocked at π -phase shift. This mechanism will be the main point of Section 3. The motivation for ‘Coupled Multi-Oscillator Array Ising Solvers’ is best ex-plained using concepts from laser physics. As a laser is slowly being turned on,spontaneous emission from the laser gain medium couples into the various cav-ity modes and begins to become ampliﬁed. The diﬀerent optical modes in thecavity have diﬀerent loss coeﬃcients due to their diﬀering spatial proﬁles. Asthe laser pump/gain increases, the cavity mode with the least loss grows fasterthan the other modes. Once the gain reaches the threshold gain, then furthernonlinear evolution amongst all the modes will occur.The design of the Coupled Multi-Oscillator Array Ising machines tries tomap the power losses of the optimization machine to the magnetic energies ofvarious states of the Ising problem. If the mapping is correct, the lowest powerconﬁguration will match the energetic ground state of the Ising problem. Thisis illustrated in Fig. 5. In eﬀect, this is an example of a system evolving towarda state of minimum entropy generation, or minimum power dissipation, subjectto the constraint of gain being present.The archetypal solver in this class consists of a network of interconnectedoscillators, driven by phase-dependent parametric gain. Parametric gain am-pliﬁes only the cosine quadrature and causes the electric ﬁeld to lie along the ± Real axis in the complex plane. The phase of the electric ﬁeld (0 or π ) can beused to represent ± spin in the Ising problem. The resistive interconnections be-tween the oscillators are designed to favor ferromagnetic or anti-ferromagnetic‘spin-spin’ interactions by the Principle of Minimum Entropy Generation, sub-ject to the constraint of parametric (phase-dependent) gain as the power input.The parametric gain favors oscillation along the real axis of the complex plane,where the positive real axis would correspond to spin up, and the negative realaxis would correspond to spin down.The Voltage (or Current) input constraint is very important to the Principleof Minimum Entropy Generation. If there were no power input constraint, allthe currents and voltages would be zero, and the minimum power dissipatedwould be zero. In the case of the Coupled Multi-Oscillator circuit, the power7 power gainpower losspossible solutions = physical modes I s i ng ene r g y H power gain ground statepossible solutions = physical modes power loss I s i ng ene r g y H Figure 5:

A lossy multi-oscillator system is provided with gain: The x-axis is a list of allthe available modes in the system whereas the y-axis plots the loss coeﬃcient of each mode.Gain is provided to the system and is gradually increased. As in single-mode lasers, thelowest loss mode, illustrated by the blue dot, grows exponentially, saturating the gain. Abovethreshold we can expect further nonlinear evolution among the modes so as to minimize powerdissipation. input is produced through a gain mechanism, or a gain module. The constraintcould be the voltage input to the gain module. But if the gain were to betoo small, it might not exceed the corresponding circuit losses, and the currentand voltage would remain near zero. Thus, there is usually a threshold gainrequirement, when applying the Principle of Minimum Entropy Generation tothe Coupled Multi-Oscillator circuit.If the gain is insuﬃcient, the circuit will achieve minimum entropy generationat negligible currents and voltages. The intrinsic losses in the network woulddominate and the circuit currents and voltages would be near zero. If thepump gain is then gradually ramped up the oscillatory mode requiring the leastthreshold gain begins oscillating. Upon reaching the threshold gain, a non-trivial current distribution built on the bistability of the Couple Multi-Oscillatorcircuit will emerge. As the gain exceeds the required threshold, there will befurther nonlinear evolution among the modes so as minimize power dissipation.The ﬁnal state “spin” conﬁguration, dissipating the lowest power, (or entropygeneration) emerges as the sought-for optimum.Ideally, the gain evolution will be controlled by the Lagrange function to8nd the local minimum power dissipation conﬁguration as will be discussed inSection 5.5. With Minimum Entropy Generation, as with most optimizationschemes, it is diﬃcult to guarantee a global optimum.In optimization, each constraint contributes a Lagrange multiplier. We willshow that the gains of the oscillators are the Lagrange multipliers of the con-strained system. In the next section, we provide a brief tutorial on LagrangeMultiplier optimization.

The method of Lagrange multipliers is a very well-known procedure for solvingconstrained optimization problems. In constrained optimization, the optimalpoint x ∗ ≡ ( x, y ) in multi-dimensional solution space locally optimizes the meritfunction f ( x ) subject to the constraint g ( x ) = 0. The optimal point has theproperty that the slope of the merit function is zero as inﬁnitesimal steps aretaken away from x ∗ , as taught in calculus. But these deviations are restrictedto the constraint curve, as shown in Fig. 6. The iso-contours of the function f ( x ) increase until they are limited by, and just touch, the constraint curve g ( x ) = 0 at the point x ∗ . ! " , $ = & ' ! " , $ = & ( ! " , $ = & ) "$ C on t ou r li n e s o f ! " , $ c on s t r a i n t f un c ti on g " , $ = +! ", $ +,(", $) " ∗ Figure 6:

Maximization of function f ( x, y ) subject to the constraint g ( x, y ) = 0. Atthe constrained local optimum, the gradients of f and g , namely ∇ f ( x, y ) and ∇ g ( x, y ), areparallel. At the point of touching, x ∗ , the gradient of f and the gradient of g areparallel to each other. This can be stated formally as: ∇ f ( x ∗ ) = λ ∗ ∇ g ( x ∗ ) . The proportionality constant λ ∗ is called the Lagrange multiplier correspondingto the constraint g ( x ) = 0. 9lthough Fig. 6 shows only a 2-dimensional optimization space, in general letus take the optimization space x to be n -dimensional. When we have multipleconstraints g , . . . , g p , we correspondingly expand the n -dimensional LagrangeMultiplier requirement: ∇ f ( x ∗ ) = p (cid:88) i =1 λ ∗ i ∇ g i ( x ∗ ) , where the gradient vector ∇ represents n -equations, accompanied by the p constraint equations g i ( x ) = 0, resulting in n + p equations. These equationssolve for the n components in the vector x ∗ , and the p unknown LagrangeMultipliers λ ∗ i . That would be n + p equations for n + p unknowns.Motivated by the above condition, let us introduce a Lagrange function L ( x , λ ) deﬁned as follows: L ( x , λ ) = f ( x ) + p (cid:88) i =1 λ i g i ( x ) , which can be optimized by gradient descent or other methods to solve for x ∗ and λ ∗ . The full theory of Lagrange multipliers, and the popular ‘AugmentedLagrangian Method of Multipliers’ algorithm used to solve for locally optimal( x ∗ , λ ∗ ), are discussed in great detail in [24] and [25]. A gist of the main pointsis presented in Appendices A, B, C.For the speciﬁc case of the Ising problem, the objective function is given by f ( µ ) = (cid:80) i,j J ij µ i · µ j , where f ( µ ) is the magnetic Ising Energy and µ i is the i -th magnetic moment vector. For the optimization method represented in thispaper, we need a circuit or other physical system whose power dissipation isalso f ( x ) = (cid:80) i,j J ij x i x j , but now f ( x ) is power dissipation, not energy, x i is avariable that represents voltage, or current or electric ﬁeld, and the J ij are notmagnetic energy, but rather resistive coupling elements. The correspondence isbetween magnetic spins quantized along the z-axis, µ zi = ±

1, and the circuitvariable x i = ± n spins, there are also p = n constraints, one foreach of the spins. A suﬃcient constraint is: g i ( x ) = 1 − x i . More complicatednonlinear constraints can be envisioned, but (1 − x i ) could represent the ﬁrsttwo terms in a more complicated constraint Taylor expansion.Therefore, a suﬃcient Lagrange function for the Ising problem, with digitalconstraints, is given by: L ( x , λ ) = n (cid:88) i =1 n (cid:88) j =1 J ij x i x j + n (cid:88) i =1 λ i (cid:0) − x i (cid:1) , λ i is the Lagrange Multiplier associated with the corresponding con-straint. We shall see in the next section that most analog algorithms that havebeen proposed for the Ising problem in the literature, actually tend to optimizesome version of the above Lagrange function. In this section, we discuss each physical procedure proposed in the literature andshow how each physical scheme implements the method of Lagrange multipliers.They all obtain good performance on the Gset benchmark problem set [26],and many of them demonstrate that they perform better than the heuristicalgorithm, Breakout Local Search [27]. The main result of our work is therealization that the pump gains used in all the physical methods are in factLagrange multipliers.The available physical solvers in the literature are as follows:1. Optical Parametric Oscillators2. Coupled Radio Oscillators on the Real Axis3. Coupled laser cavities using multicore ﬁbers4. Coupled Radio Oscillators on the Unit Circle5. Coupled polariton condensatesThen there are a number of schemes that also rely upon a variant of minimumentropy production or power dissipation:6. Iterative Analog Matrix Multipliers7. Leleu Mathematical Ising SolverIn Appendix D, there is scheme that appears unconnected with minimum en-tropy production rate:8. Adiabatic coupled radio oscillators (Toshiba)We shall see that methods 1, 2, 4, in the literature use only one gain for allthe oscillators which is equivalent to imposing only one constraint. The othermethods, 3, 5, 6, use diﬀerent gains for each spin and correctly capture the full n constraints of the Ising problem. An early optical machine for solving the Ising problem was presented by Ya-mamoto et al. [4] and [8]. Their system consists of several pulses of light11irculating in an optical ﬁber loop, with the phase of each light pulse repre-senting an Ising spin. In parametric oscillators, gain occurs at half the pumpfrequency. If the gain overcomes the intrinsic losses of the ﬁber, the opticalpulse builds up. Parametric ampliﬁcation provides phase dependent gain. Itrestricts the oscillatory phase to the Real Axis of the complex plane. This leadsto bistability along the positive or negative real axis, allowing the optical pulsesto mimic the bistability of magnets.In the Ising problem, there is magnetic coupling between spins. The corre-sponding coupling between optical pulses is achieved by controlled optical inter-actions between the pulses. In Yamamoto’s approach, one pulse i is ﬁrst pluckedout by an optical gate, amplitude modulated by the proper connection weightspeciﬁed in the J ij Ising Hamiltonian, and then reinjected and superposed ontothe other optical pulse j , producing constructive or destructive interference,representing ferromagnetic or anti-ferromagnetic coupling.By providing saturation to the pulse amplitudes, the optical pulses will ﬁ-nally settle down, each to one of the two bistable states. We will ﬁnd that thepulse amplitude conﬁguration evolves exactly according to the Principle of Min-imum Entropy Generation. If the magnetic dipole solutions in the Ising problemare constrained to ± As an example, Yamamoto et al. [7] analyze their parametric oscillator systemusing coupled wave equations for the slowly varying amplitudes of the circulatingoptical modes. We now show that the coupled wave equation approach reducesto an extremum of their system ‘Entropy Generation’ or ‘power dissipation’. Thecoupled-wave equation for parametric gain of the slowly varying amplitude c i ofthe in-phase cosine component of the i -th optical pulse (representing magneticspin in an Ising system) is as follows: dc i dt = ( − α i + γ i ) c i − (cid:88) j J ij c j , (1)where the weights, J ij , are the magnetic ferromagnetic or anti-ferromagneticcross-couplings, and γ i represents the phase dependent parametric gain givenby the pump to the i -th circulating pulse, whereas α i is the correspondingdissipative loss. The amplitudes c i represent the optical cosine wave electric ﬁeldamplitudes. In the case of circuits, the voltage amplitudes can be expressed as V i = c i +j s i where c i and s i represent the cosine and sine quadrature componentsof voltage, and j is the unit imaginary. For clarity of discussion, we dropped thecubic terms in (1) that Yamamoto et al originally had. A discussion of theseterms in given in Appendix C.Owing to the nature of parametric ampliﬁcation, the quadrature sine wavecomponents s i of the electric ﬁeld amplitude die out rapidly. The rate of en-tropy generation, or net power dissipation h , including the negative dissipation12ssociated with gain can be written: h ( c , . . . , c n ) = (cid:88) i,j J ij c i c j + (cid:88) i α i c i − (cid:88) i γ i c i , (2)If we minimize the entropy generation h ( c ) without invoking any constraints,that is, with γ i = 0, the amplitudes c i simply go to zero, which generates theminimum entropy.If the gain γ i is large enough, some of the amplitudes might go to inﬁnity.To avoid this, we may employ the n constraint functions g i ( c i ) = (cid:0) − c i (cid:1) = 0,which enforce a digital c i = ± g i ( c i ) = (cid:0) − c i (cid:1) = 0 is quite general in the sense that (cid:0) − c i (cid:1) can representthe ﬁrst two terms of the Taylor Series of an arbitrary constraint.) Addingthe constraint function to the entropy generation, yields the Lagrange functionincluding the constraint functions times the respective Lagrange Multipliers: L ( c , γ ) = (cid:88) i,j J ij c i c j + (cid:88) i α i c i − (cid:88) i γ i (cid:0) c i − (cid:1) , (3)Comparing the unconstrained (2) to the constrained (3), they only diﬀer in theﬁnal (-1) term which eﬀectively constrains the amplitudes and prevents themfrom diverging to ∞ . Expression (3) is the Lagrange function given at the endof Section 4. Surprisingly, the gains γ i emerge to play the role of Lagrange Mul-tipliers. This means that each mode, represented by the subscripts in c i , mustadjust to a particular gain γ i which minimizes the overall entropy generation,and the respective gains γ i represent the Lagrange Multipliers. Minimization ofthe Lagrange function (3) provides the ﬁnal steady state of the system dynamics.If the circuit or optical system is designed to dissipate power, or equivalentlygenerate entropy, in a mathematical form that matches the magnetic energy inthe Ising problem, then the dissipative system will seek out a correspondinglocal optimum conﬁguration of the magnetic Ising energy.Such a physical system, constrained to c i = ±

1, is digital in the same sense asa ﬂip-ﬂop circuit, but unlike the von Neumann computer, the inputs are resistorweights for power dissipation. Nonetheless a physical system can evolve in adirect manner, without the need for shuttling information back and forth as in avon Neumann computer, providing faster answers. Without the communicationsoverhead but with the higher operation speed, the energy dissipated to arriveat the ﬁnal answer will be less, in spite of the circuit being required to generateentropy during its evolution toward the ﬁnal state.To achieve minimum entropy production, the amplitudes c i , and the La-grange Multipliers γ i , must all be simultaneously optimized. While a circuit willevolve toward optimal amplitudes c i , the gains γ i must arise from a separateactive circuit. Ideally, the active circuit which controls the Lagrange Multipliergains γ i , would have its entropy production included with the main circuit. Amore common method is to provide gain that follows a heuristic rule using anexternal feedback circuit. For example, Yamamoto et al. follow the heuris-tic rule γ i = a + bt . It is not yet clear whether the heuristic-based approach13oward gain evolution will be equally eﬀective as simply lumping together allmain circuit and feedback components and simply minimizing the total powerdissipation.We conclude this subsection by noting that the Lagrangian, Eq (3), corre-sponds to Lagrange multiplier optimization using the following merit functionand constraints: f ( c ) = (cid:88) i,j J ij c i c j + (cid:88) i α i c i ,g i ( c i ) = (cid:0) − c i (cid:1) = 0 .

1. Physical systems minimize the entropy production rate, or power dissipa-tion, subject to input constraints of voltage, amplitude, gain, etc.2. These systems actually perform Lagrange Multiplier optimization.3. Indeed, it is the gain γ i in each oscillator i that plays the role of thecorresponding Lagrange Multiplier.4. If the Lagrange function is split in such a way that only the Ising func-tion, (cid:80) i,j J ij c i c j , is treated as the merit function f ( c ), then the Lagrangemultipliers corresponding to the constraints g i are the net gains, γ i − α i .5. Under the digital constraint, amplitudes c i = ±

1, entropy generation min-imization schemes are actually binary, similar to a ﬂip-ﬂop.

A coupled LC oscillator system with parametric ampliﬁcation was analyzed inthe circuit simulator, SPICE, by Xiao et al., [12]. This is analogous to the opticalYamamoto system but this system consists of a network of radio frequencyLC oscillators coupled to one another through resistive connections. The LCoscillators contain linear inductors but nonlinear capacitors which provide theparametric gain. The parallel or cross-connect resistive connections between theoscillators are designed to implement the ferromagnetic or anti-ferromagneticcouplings J ij between magnetic dipole moments µ i as shown in Fig. 7. Thecorresponding phase of the voltage amplitude V i , 0 or π determines the sign ofmagnetic dipole moment µ i .The nonlinear capacitors are pumped by voltage V (2 ω ) at frequency 2 ω ,where the LC oscillator natural frequency is ω . Second harmonic pumping leadsto parametric ampliﬁcation in the oscillators. As in the optical case, parametricampliﬁcation plays the dual role of generating/sustaining voltage oscillationsas well as imposing phase bistability to the ac voltages in the oscillators. The14 ( t ) V ( t ) spin 1 spin 2 ferromagnetic, J = +1, the circuit optimizes ! " $ " $ anti-ferromagnetic, J = –1, the circuit optimizes ! " $ " $ V ( t ) V ( t ) spin 1 spin 2noise % & % & % & % & Figure 7:

Coupled LC oscillator circuit for two coupled magnets. The oscillation of the LCoscillators represents the magnetic moments while the parallel or antiparallel cross-connectionsrepresent ferromagnetic or anti-ferromagnetic coupling, respectively. The nonlinear capacitorsare pumped by V (2 ω ) at frequency 2 ω providing parametric gain at ω . bistability refers to gain along the ± Real-Axis deﬁned by time synchronizationwith the 2 ω -pump.The 2 ω -pump induces gain γ i in the Real-Axis quadrature. As in the caseof optical parametric ampliﬁer machines, ideally, an active circuit would controlthe Lagrange Multiplier gains γ i , and the gain control circuit would have itsentropy production included with the main circuit. A more common approachis to provide gain that follows a heuristic rule.As in the optical parametric ampliﬁer case, a mechanism is needed to preventthe parametric gain from producing inﬁnite amplitude signals. Zener diodes canbe inserted into the circuit to restrict the amplitudes to ﬁnite saturation valuesor to digitally deﬁned values. With the diodes in place, the circuit settles intoa voltage phase conﬁguration, 0 or π , that minimizes net power dissipation fora given pump gain. The amplitudes can be deﬁned by the voltages across the LC oscillator capaci-tors and derived from Kirchoﬀ’s voltage and current equations as in Xiao, [12].Performing the slowly-varying amplitude approximation on the cosine compo-nent of these voltages, c i , Xiao obtains the following equation of motion: dc i dt = 14 R c C (cid:88) j J ij c j − α i c i + γ i c i , (4)where the c i are the peak voltage amplitudes in units of a reference voltage, R c is the resistance of the coupling resistors, the cross-couplings J ij are assignedbinary values J ij = ± C is the linear part of the capacitance in each os-cillator, n is the number of oscillators, ω is the natural frequency of all the15scillators, and the parametric gain constant γ = ω | ∆ C | / C where | ∆ C | isthe capacitance modulation at the second harmonic. In the decay constant α = ( n − / (4 R c C ), there are ( n −

1) resistors R c in parallel, since it is as-sumed that each oscillator can give up energy to all the other ( n −

1) oscillators,with the coupling resistors acting in parallel. In this simpliﬁed model all decayconstants α are taken as equal, and moreover each oscillator experiences exactlythe same parametric gain γ ; conditions that can be relaxed if needed.The number 4 is present in the ﬁrst denominator, since for two coupledLC circuits as shown in Fig. 7, the decay is controlled by the RC time of thecapacitors in parallel and the R c resistors in series. Likewise, the number 4appears for loss and parametric gain, but for diﬀerent reasons in each case.We note that equation (4) above performs gradient descent on the net powerdissipation function: h ( c , γ ) = − R c (cid:88) i,j J ij c i c j + (cid:88) i αC c i − (cid:88) i γC c i , (5)which is very similar to Section 5.1. On the right-hand side, the reason for the4 in the ﬁrst denominator is to compensate for double counting in i (cid:54) = j . Theﬁrst two terms on the right-hand side together represent the dissipative losses inthe coupling resistors while the third term is the negative of the gain providedto the system of oscillators.Next, we obtain the following Lagrange function through the same replace-ment of (cid:0) − c i (cid:1) with (cid:0) − c i (cid:1) that we performed in Section 5.1: L ( c , γ ) = − R c (cid:88) i,j J ij c i c j + (cid:88) i αC c i − (cid:88) i γC (cid:0) c i − (cid:1) , (6)The above Lagrangian corresponds to Lagrange multiplier optimization usingthe following merit function and constraints: f ( c ) = − R c (cid:88) i,j J ij c i c j + (cid:88) i αC c i ,g ( c ) = (cid:88) i (cid:0) − c i (cid:1) = 0 . Again, we see that the gain coeﬃcient γ is the Lagrange multiplier of the con-straint g = 0. Although the extremum of Eq (6) represents the ﬁnal evolved state of a phys-ical system representing an optimization outcome, it would be interesting toexamine the time evolution toward the optimal state. The optimization occursby iterative steps, where each iteration can be regarded to take place in a time16nterval ∆ t . At each successive iteration, the voltage amplitude c i takes a stepwhose magnitude is proportional to the gradient of the Lagrange function: c i ( t + ∆ t ) = c i ( t ) − κ ∆ t ∂∂c i L ( c , γ ) , (7)where the minus sign on the right hand side drives the system toward minimumpower dissipation. As the Lagrange function comes closer to its minimum, thegradient ∂∂c i L ( c , γ ) diminishes and the amplitude steps become smaller andsmaller. The adjustable proportionality constant κ , controls the size of eachiterative step; it also calibrates the dimensional units between power dissipationand voltage amplitude. (Since c i is voltage amplitude, κ has units of reciprocalcapacitance.) Converting Eq (7) to continuous time, dc i dt = − κ ∂∂c i L ( c , γ ) , (8) dc i dt = − κ  ∂∂c i f ( c i ) + (cid:88) j γ j ∂∂c i g j ( c j )  , (9)where the γ j play the role of Lagrange multipliers, and the g j = 0 are theconstraints. Taking L ( c , γ ) from Eq (6), the gradient in the voltage amplitudesbecomes ∂∂c i L ( c , γ ) = − R c (cid:88) j J ij c j + 2 αC c i − γC c i . Substituting into Eq (8), the time derivative of the voltage amplitudes becomes dc i dt = 2 κ  R c (cid:88) j J ij c j − αC c i + γC c i  . (10)The constant κ can be absorbed into the units of time to reproduce the dynam-ical equation (4), the slowly varying amplitude approximation for the coupledradio oscillators. Thus, it is interesting that the slowly-varying time dynamicscan be reproduced from iterative optimization steps on the Lagrange function.

1. As in other cases, the coupled LC oscillator system [12] minimizes theentropy production rate, or power dissipation, incorporating the powerinput from the pump gain.2. The coupled LC oscillator system actually performs Lagrange Multiplieroptimization whose Merit Function is the Lagrange Function.3. The gain γ i in each oscillator i plays the role of the corresponding LagrangeMultiplier. 17. Under the amplitude constraint on the voltage cosine component, c i = ± c i , surprisingly, yield the same time-dependent Slowly Varying amplitudediﬀerential equation (4). The Ising solver designed by Babaeian et al., [13], makes use of coupled lasermodes in a multicore optical ﬁber. Polarized light in each core of the opticalﬁber corresponds to each magnetic moment in the Ising problem. This meansthat the number of cores needs to be equal to the number of magnets in thegiven Ising instance. The right-hand circular polarization and left-hand circularpolarization of the laser light in each core represent the two polarities (up anddown) of the corresponding magnet. The mutual coherence of the various coresis maintained by injecting seed light from a master laser.The coupling between the ﬁber cores is achieved through amplitude mixingof the laser modes by Spatial Light Modulators at one end of the multicore ﬁber,[13]. These Spatial Light Modulators couple light amplitude from the i -th coreto the j -th core according to the prescribed connection weight J ij . As in prior physical examples the electric ﬁeld amplitudes can be expressed inthe slowly-varying polarization modes of the i -th core, E iL and E iR , where thetwo electric ﬁeld amplitudes are in-phase temporally, are positive real, but havediﬀerent polarization. They are, ddt E iL = − α i E iL + γ i E iL + 12 (cid:88) j J ij ( E jR − E jL ) ,ddt E iR = − α i E iR + γ i E iR − (cid:88) j J ij ( E jR − E jL ) , where α i is the decay rate in the optical core in 1/sec units, and γ i is thegain supplied to the i -th core. The ﬁrst term on the right in both the equationsrepresents optical ﬁber losses while the second term represents the gain provided.The third term represents the coupling between the j -th and i -th cores thatis provided by the Spatial Light Modulators. We next deﬁne the degree ofpolarization as µ i ≡ E iL − E iR . (This diﬀers from the usual deﬁnition of degreeof polarization of an optical beam which contains intensity rather than electric18eld amplitude.) Subtracting the two equations above, we obtain the followingevolution equation for µ i : ddt µ i = − α i µ i + γ i µ i + (cid:88) j J ij µ j . The power dissipation, or entropy generation, is proportional to two orthogonalcomponents of electric ﬁeld, each squared, | E iL | + | E iR | . But this can also bewritten | E iL − E iR | + | E iL + E iR | = | µ i | + | E iL + E iR | . But | E iL + E iR | canbe regarded as relatively constant as energy switches back and forth betweenright and left circular polarization. Then the changes in power dissipation, orentropy generation, h ( µ ) would be most inﬂuenced by quadratic terms in µ : h ( µ , γ ) = (cid:88) i α i µ i + (cid:88) i,j J ij µ i µ j − (cid:88) i γ i µ i . As before, we add the n digital constraints g i ( µ i ) = 1 − µ i = 0 where µ i = ± L ( µ , γ ) = (cid:88) i α i µ i + (cid:88) i,j J ij µ i µ j − (cid:88) i γ i (cid:0) µ i − (cid:1) . Once again, the gains γ i play the role of Lagrange multipliers. Thus, a mini-mization of the power dissipation, subject to the optical gain γ i , solves the Isingproblem deﬁned by the same J ij couplings.The power dissipation function and constraint function in the Lagrange func-tion above are: f ( µ ) = (cid:88) i α i µ i + (cid:88) i,j J ij µ i µ j ,g i ( µ i ) = (cid:0) − µ i (cid:1) = 0 .

1. The coupled multicore ﬁber system [13] minimizes the entropy productionrate, or power dissipation, which includes the power input from the pumpgain, as in Sections 5.1 and 5.2.2. The coupled multicore ﬁber system actually performs Lagrange Multiplieroptimization.3. The gain γ i in each ﬁber i plays the role of the corresponding LagrangeMultiplier.4. Under the constraint of completely light polarization, µ i = ±

1, the entropygeneration minimization procedure is actually binary, similar to a ﬂip-ﬂop.19 .4 Coupled Electrical Oscillators on the Unit Circle

In this section, we consider a network of interconnected, nonlinear, but amplitude-stable electrical oscillators designed by Roychowdhury et al. [14] to represent aconventional Ising system for which we seek a digital solution with each magneticdipole µ iz = ± φ i = 0 or π revealing the preferredmagnetic dipole orientation µ iz = ±

1. It is noteworthy that Roychowdhurygoes beyond Ising machines and constructs general digital logic gates usingthese amplitude-stable oscillators in [28].In the construction of their oscillators, Roychowdhury et al. [14] use non-linear elements that behave like negative resistors at low voltage amplitudesbut with saturating resistance at high voltage amplitudes. This produces ofamplitude-stable oscillators. In addition, Roychowdhury et al. [14] provide asecond harmonic pump and use a form of parametric ampliﬁcation (referred toas sub-harmonic injection locking in [14]) to obtain bistability with respect tophase.The dynamics of the amplitude saturation is purposely very fast and the os-cillators are essentially clamped to an amplitude limit dictated by the nonlinearresistor in each oscillator. The phase dynamics induced by the 2 nd harmonicpump are slower. It is the readout of these phase shifts, 0 or π that providesthe magnetic dipole orientation µ iz = ±

1. One key diﬀerence between this sys-tem and Yamamoto’s Optical Parametric Oscillator (OPO) system is that theOPO system had fast phase dynamics and slow amplitude dynamics, while theinjection locking system has the reverse.

Roychowdhury et al. [14] derived the dynamics of their amplitude stable os-cillator network using perturbation concepts developed in [29]. While a circuitdiagram is not shown, [14] invokes the following dynamical equation for thephases of their electrical oscillators: dφ i dt = − R c (cid:88) j J ij sin ( φ i ( t ) − φ j ( t )) − λ i sin (2 φ i ( t )) , (11)where R c is a coupling resistance in their system, φ i is the phase of the i -thelectrical oscillator, and the λ i are decay parameters that dictate how fast thephase angles settle towards their steady state values.We shall now show that Eq (11) can be reproduced by iteratively minimizingthe power dissipation in their system. Power dissipation across a resistor is( V − V ) /R c where ( V − V ) is the voltage diﬀerence across resistor R c . Since V and V are sinusoidal, the power dissipation consists of constant terms and20 cross-term of the form: f ( φ , φ ) = | V | cos ( φ − φ ) R c , where f ( φ , φ ) is the power dissipated in the resistors. Magnetic dipole ori-entation parallel or anti-parallel is represented by whether electrical oscillatorphase diﬀerence is φ − φ = 0 or φ − φ = π respectively. We are permittedto choose an origin for angle space at φ = 0 which implies φ i = 0 or φ i = π .This can be implemented through a constraint of the form: g i ( φ i ) = (cos (2 φ i ) −

1) = 0 . Combining the power dissipated in the resistors with the constraint function g i ( φ i ) = 0 we obtain a Lagrange function: L ( φ , λ ) = 1 R c (cid:88) i,j J ij cos ( φ i − φ j ) + (cid:88) i λ i (cos (2 φ i ) − , where λ i is the Lagrange Multiplier corresponding to the phase angle constraint,and J ij is a digital multiplier ± R c .The Lagrange function above is isomorphic with the general form in Sec-tion 4. The eﬀective merit function f and constraints g i in this correspondenceare: f ( φ ) = 1 R c (cid:88) i,j J ij cos ( φ i − φ j ) ,g i ( φ i ) = (cos (2 φ i ) −

1) = 0 .

1. As in other cases, the amplitude-stable oscillator system [14] minimizesthe entropy production rate, or power dissipation, subject to constraintsof amplitude and phase reference.2. The amplitude-stable oscillator system actually performs Lagrange Mul-tiplier optimization.3. The phase decay time constant λ i in each oscillator i plays the role of thecorresponding Lagrange Multiplier.4. Under the phase reference constraint, φ i = 0 or π , the entropy generationminimization scheme is actually binary, similar to a ﬂip-ﬂop. Kalinin and Berloﬀ [15] proposed a system consisting of coupled polariton con-densates to minimize the XY Hamiltonian. The XY Hamiltonian is a 2 dimen-21ional version of the Ising Hamiltonian and is given by: H ( µ ) = (cid:88) ij J ij µ i · µ j , where the µ i represents the magnetic moment vector of the i -th spin restrictedto the spin-space XY plane.To represent the spin system, Kalinin and Berloﬀ pump a grid of coupledsemiconductor microcavities with laser beams and observe the formation ofstrongly coupled exciton-photon states called polaritons. For our purposes, thepolaritonic nomenclature is irrelevant. For us, these are simply coupled elec-tromagnetic cavities similar to optical and LC resonators that we have alreadydiscussed. The electromagnetic system operates by the principle of minimumentropy generation similar to the previous cases. The complex electromagneticamplitude in the i -th microcavity can be written E i = c i + j s i , where c i and s i represent the cosine and sine quadrature components of E i , and j is the unitimaginary. In this case we identify Re( E ) = c as representing the X-componentof the magnetic dipole vector, and Im( E ) = s representing the Y-componentof the magnetic dipole vector. The electromagnetic microcavity system settlesinto a state of minimum entropy generation as the laser pump and optical gainare ramped up to compensate for the intrinsic cavity losses. The phase anglesin the complex plane of the ﬁnal electromagnetic modes are then reported asthe corresponding µ -magnetic moment angles in the XY plane.An important point to note here is that the electromagnetic cavities expe-rience normal phase-independent gain and not parametric gain which happensto be phase-dependent. As a consequence, this system does NOT seek phasebistability as appropriate for binary up/down spin orientations. This is becausewe are searching for the magnetic dipole vector angles in the XY plane whichwould minimize the corresponding XY magnetic energy. Ref. [15] uses ‘Ginzburg-Landau’ equations to analyze their system resulting inequations for the complex amplitudes Ψ i of the polariton wavefunctions. Butthe Ψ i are actually the complex electric ﬁeld amplitudes E i of the corresponding i -th cavity. The electric ﬁeld amplitudes satisfy the following slowly-varying-amplitude equation: dE i dt = (cid:0) γ i − β | E i | (cid:1) E i − iU | E i | E i − (cid:88) j J ij E j , (12)where γ i represents the optical gain, β represents nonlinear attenuation, U rep-resents nonlinear phase shift, and J ij is a dissipative cross-coupling-term rep-resenting linear loss. We note from the above that both the amplitudes andphases of the electromagnetic modes are coupled to each other and evolve oncomparable timescales. This is in contrast to ref. [14] where the main dynamics22ere embedded in phase—amplitude was fast and almost ﬁxed—or conversely[12] embedded in amplitude—phase was fast and almost ﬁxed.We shall now show that the method of ref. [15] is essentially the method ofLagrange multipliers with an added ‘rotation’. The entropy generation or powerdissipation rate is: h ( E ) = ddt (cid:88) i (cid:18) E ∗ i E i (cid:19) = 12 (cid:88) i,j J ij (cid:0) E ∗ i E j + E i E ∗ j (cid:1) + (cid:88) i β | E i | − (cid:88) i γ i | E i | . If we add a saturation constraint, g i ( E i ) = (cid:0) − | E i | (cid:1) = 0, then by analogy tothe previous sections, γ i is reinterpreted as a Lagrange Multiplier: L ( E , γ ) = 12 (cid:88) i,j J ij (cid:0) E ∗ i E j + E i E ∗ j (cid:1) + (cid:88) i β | E i | − (cid:88) i γ i (cid:0) | E i | − (cid:1) , (13)where L is the Lagrange function that represents power dissipation combinedwith the amplitude constraint | E i | = 1. Thus, the scheme of coupled polaritonicresonators operates to ﬁnd the state of minimum entropy generation in steady-state, similar to the parametric oscillator case of Section 5.1. That is, thesteady-state solution of dynamics (12) renders (13) stationary with respect tochanges in E . The diﬀerence is that the coupled polaritonic system solvesthe XY Ising problem for a magnetic moment restricted to the magnetic XYplane, while the parametric oscillator system, in Section 5.1, solves the ± Z Isingproblem. For any particular mathematical optimization that we would perform,we still retain the burden of specifying that dissipative system whose entropygeneration matches the optimization that we are seeking.The imaginary ‘rotation’ term, iU , could potentially be of use in developingmore sophisticated algorithms than the method of Lagrange multipliers and wediscuss this prospect in some detail in Section 6.2 where a system with a similar,but more general, ‘rotation’ term is discussed. Next, we discuss how Kalinin and Berloﬀ et al. [15], adjusted their Lagrangemultipliers γ i during the course of their optimization. In the method of La-grange multipliers, the merit-function, eq (13), is used to optimize not only theelectric ﬁeld amplitudes E i but also the Lagrange Multipliers γ i . The authorsin Sections 5.1 and 5.2 used simple heuristics to adjust their gains/decay con-stants which we have proven to be the Lagrange multipliers. Kalinin and Berloﬀof this section employ the Lagrange function optimization itself to adjust thegain/losses, as in the complete Lagrange method discussed next.23o iteratively adjust the Lagrange multipliers, we shall brieﬂy shift back tothe tutorial notation of Section 4. Afterward, we shall translate it back to thenotation of this section and show that Kalinin, Berloﬀ et al., indeed use thissame iterative procedure to adjust their Lagrange multipliers.The Lagrange function is given by L ( x , λ ) = f ( x ) + (cid:80) pi =1 λ i g i ( x ), where x i are the ﬁeld variables and λ i are the Lagrange multipliers. Then, the procedureto ﬁnd the optimal x ∗ and λ ∗ is to perform gradient descent of L in x andgradient ascent of L in λ . The reason for ascent in λ rather than descent isto more strictly penalize deviations from the constraint. In the language ofiterations, this leads to the expressions: x i ( t + ∆ t ) = x i ( t ) − κ ∆ t ∂∂x i L ( x , λ ) ,λ i ( t + ∆ t ) = λ i ( t ) + κ (cid:48) ∆ t ∂∂λ i L ( x , λ ) , where κ and κ (cid:48) are suitably chosen step sizes.With our identiﬁcation that the Lagrange multipliers λ are actually thesame as the gains γ , we plug the expression for the full Lagrange function(13) into the second iterative equation, projecting out the constraint function g i ( E i ) = (cid:0) − | E i | (cid:1) = 0, and take the limit ∆ t →

0. We obtain the followingdynamical equation for the gains γ i : dγ i dt = κ (cid:48) (cid:0) − | E i | (cid:1) . (14)The above equation for the iterative evolution of the Lagrange multipliers isindeed the very same evolution that Kalinin, Berloﬀ et al. employ in theircoupled polariton system.To Eq (14), we must add the iterative evolution of the ﬁeld variables x i : dx i dt = − κ ∂∂x i L ( x , λ ) . (15)Equations (14) and (15) represent the full iterative evolution, but in some of theearlier sub-sections, γ i ( t ) was sometimes assigned a heuristic time dependence.We conclude this sub-section by splitting the Lagrange function into theeﬀective merit function f , and the constraint function g i . The extra ‘phaserotation’ U is not captured by this interpretation. f ( E , . . . , E n ) = 12 (cid:88) i,j J ij (cid:0) E ∗ i E j + E i E ∗ j (cid:1) + (cid:88) i β | E i | ,g i ( E i ) = (cid:0) − | E i | (cid:1) = 0 .

1. As in other cases, the coupled-cavity system [15] minimizes the steadystate entropy production rate, or power dissipation, which includes thepower input from the pump gain.24. The gain γ i in each oscillator i plays the role of the corresponding LagrangeMultiplier.3. There is a phase rotation term iU in the dynamics that is not capturedby the Lagrange function framework. In this section, we look at other methods in the literature that do not explicitlyimplement the method of Lagrange Multiplier but nevertheless end up with dy-namics that resemble it to varying extents. All these methods oﬀer operationregimes where the dynamics is not analogous to Lagrange multiplier optimiza-tion, and we believe it is an interesting avenue of future work to study thecapabilities of these regimes.

Soljacic et al. [16] developed an iterative procedure consisting of repeated matrixmultiplication to solve the Ising problem. Their algorithm was implementedon a photonic circuit that utilized on-chip optical matrix multiplication unitscomposed of Mach-Zehnder interferometers that were ﬁrst introduced for matrixalgebra by Zeilinger et al. in [30]. Soljacic et al. showed that their algorithmperformed optimization on an eﬀective merit function, such as total magneticIsing energy.Let us use our insights from the previous sections and see how one wouldimplement iterative optimization using an optical matrix multiplier. Let themultiple magnetic moment conﬁguration of the Ising problem be represented asa vector of electric ﬁeld amplitudes, E i , of the spatially-separated optical modes.Each mode ﬁeld amplitude represents the value of each magnetic moment. Ineach iteration, the optical modes are fed into the optical circuit which performsmatrix multiplication, and the resulting output optical modes are then fed backto the optical circuit input for the next iteration. Optical gain or some othertype of gain sustains the successive iterations.In this section we design an iterative optimization scheme for the Ising prob-lem that involves only matrix multiplications in each iterative step and whosepower dissipation function matches the magnetic Ising energy. A simple blockdiagram of such a scheme is shown in Fig. 8.We wish to design the matrix multiplication unit such that it has the fol-lowing power dissipation function: h ( E ) = − (cid:88) i γ i | E i | + 12 (cid:88) i,j J ij (cid:0) E ∗ i E j + E i E ∗ j (cid:1) . The Lagrange function, including a binary constraint, | E i | = 1, is given by: L ( E , γ ) = − (cid:88) i γ i (cid:0) | E i | − (cid:1) + 12 (cid:88) i,j J ij (cid:0) E ∗ i E j + E i E ∗ j (cid:1) , (16)25 atrix multiplication unit composed of 2X2 optical splitters and gainOutput of current iteration fed back as input for the next iteration . . . . . . . . Figure 8:

An optical circuit performing iterative multiplications converges on a solution ofthe Ising problem. Optical pulses are fed as input from the left hand side at the beginningof each iteration, pass through the matrix multiplication unit, and are passed back from theoutputs to the inputs for the next iteration. Distributed optical gain sustains the iterations. where the J ij represents dissipative loss associated with electric ﬁeld interferencebetween optical modes in the Mach-Zehnder interferometers. The dissipation iscompensated by optical gain γ i in the system.The iterative multiplicative procedure that evolves the electric ﬁelds towardthe minimum of the Lagrange function Eq (16) is given by: E i ( t + 1) − E i ( t ) = − κ ∆ t ∂∂E i (cid:32)(cid:88) i γ i (cid:0) − | E i | ( t ) (cid:1) + 12 (cid:88) i,j J ij (cid:0) E ∗ i ( t ) E j ( t ) + E i ( t ) E ∗ j ( t ) (cid:1) , where we move from iteration t to iteration t + 1 by taking steps in E i pro-portional to the gradient ∂/∂E i of the Lagrange function. ( ∂/∂E i representsseparate diﬀerentiation with respect to the two quadrature components.) Witheach iteration, we feed the output of the Mach-Zehnder interferometer matrix-multiplier array back to the input, compensating for the losses with gain. Cal-culating the gradient, we obtain: E i ( t + 1) − E i ( t ) = 2 κ ∆ t  γ i E i ( t ) − (cid:88) j J ij E j ( t )  , where κ is a constant step size, whose dimensionality should ensure consistencyof units. Sending all the terms involving time step t to the right hand side, weﬁnally get: E i ( t + 1) = (cid:88) j ((1 + 2 κ ∆ tγ i ) δ ij − κ ∆ tJ ij ) E j ( t ) , (17)26here δ ij is the Kronecker delta that is 1 only if i = j . The Mach-Zehnder inter-ferometers should be tuned to the matrix [(1 + 2 κ ∆ tγ i ) δ ij − κ ∆ tJ ij ]. Thus,we have an iterative matrix multiplier scheme that minimizes the Lagrangefunction and performs Lagrange multiplier optimization of the Ising problem.In eﬀect, a lump of dissipative optical circuitry, compensated by optical gain,will, in a series of iterations, settle into a solution of the Ising problem.The simple system above diﬀers from that of Soljacic et al. [16] quite sig-niﬁcantly, in that their method has added noise and nonlinear thresholdingafter each iteration. It is possible that their modiﬁcations lead to performanceimprovements, or possibly it might work equally well as our more standard ap-proach. A detailed description of the Soljacic approach is presented in AppendixE. Leleu et al. [11] proposed a modiﬁed version of the Yamamoto’s Ising machine[7]. Leleu’s method signiﬁcantly resembles the Lagrange method while incorpo-rating important new features that might be important research directions intheir own right. To understand the similarities and diﬀerences between Leleu’smethod and that of Lagrange multipliers, we shall ﬁrst list out the Lagrangemultipliers equations of motion.We recall the Lagrange function for the Ising problem that we encounteredin Section 5: L ( x , γ ) = (cid:88) i,j J ij x i x j + (cid:88) i α i x i + (cid:88) i γ i (cid:0) − x i (cid:1) . In the above, x i are the optimization variables, J ij is the interaction matrix, γ i is the gain provided to the i -th variable, and α i is the loss experienced by the i -th variable. To ﬁnd a local optimum ( x ∗ , γ ∗ ) that satisﬁes the constraints,one has to perform gradient descent on the Lagrange function in the x variablesand gradient ascent in the γ variables as discussed in Section 5.5. That is, theiterations one would need to perform are: x i ( t + ∆ t ) = x i ( t ) − κ ∆ t ∂∂x i L ( x , γ ) ,γ i ( t + ∆ t ) = γ i ( t ) + κ (cid:48) ∆ t ∂∂γ i L ( x , γ ) , where κ and κ (cid:48) are suitably chosen step sizes. Substituting the expression for L into the above and taking the limit of ∆ t →

0, we get the following equationsof motion: dx i dt = 2 κ  ( − α i + γ i ) x i − (cid:88) j J ij x j  , (18) dγ i dt = κ (cid:48) (cid:0) − x i (cid:1) . (19)27e shall compare these equations with those designed by Leleu et al. in [11].Their system is described by the following equations: dx i dt = ( − α + γ ) x i + e i (cid:88) j J ij x j , (20) de i dt = β (1 − x i ) e i , (21)where the x i represent the optimization variables as usual, α represents the lossexperienced by each degree of freedom, γ represents a common gain supplied toeach variable, β is some suitably chosen positive parameter, and the e i representerror coeﬃcients that capture how far away each x i is from its desired unitysaturation amplitude. Leleu had cubic terms in x i in [11] but we shall ignorethem in the present discussion for the sake of simplicity. A discussion of thesecubic terms is given in Appendix C.It is clear at once that there are signiﬁcant similarities between Leleu’s sys-tem and the Lagrange multiplier system. The optimization variables in bothsystems experience linear losses and gains and have interaction terms that cap-ture the Ising interaction. Both systems have auxiliary variables that are variedaccording to how far away each degree of freedom is from its preferred saturationamplitude. However, the similarities end here.A major diﬀerentiation in Leleu’s system is that e i multiplies the Ising in-teraction felt by the i -th variable resulting in e i J ij . On the other hand, thecomplementary coeﬃcient is e j J ij . This has the consequence that Leleu’s equa-tions implements a system which has non-symmetric interactions e i J ij (cid:54) = e j J ij between vector components x i and x j . The inclusion of non-symmetric matrixterms seems to be important because Leleu’s system achieves excellent perfor-mance on Ising problems in the Gset problem set as demonstrated in [11].Let us obtain some intuition about this system by splitting the non-symmetricinteraction term e i J ij into the sum of a symmetric and anti-symmetric part.This follows from the fact that any matrix A can be written as the sum of asymmetric matrix, ( A + A T ) /

2, and an anti-symmetric matrix, ( A − A T ) /

2. Thesymmetric part leads to gradient descent dynamics just as in Eq (20), similarto all the systems in Section 5. Conversely, the anti-symmetric part causes aenergy-conserving ‘rotary’ motion in the vector space of x i . Leleu et al.’s non-symmetric approach seems to contain the same number of auxiliary parameters, e i versus γ i , and so the secret of their improved performance seems to lie in thisanti-symmetric part. We believe further analysis of this method might be afruitful future research direction.We recall Onsager’s reciprocity theorem [2] which states that the thermody-namic response coeﬃcient R ij must be equal to the response coeﬃcient R ji iftime-reversal symmetry holds in the equilibrium state. This would imply thatone might have to construct a magnetic system of some sort to break time-reversal symmetry in order to physically implement the asymmetry embeddedin the dynamics designed by Leleu et al.In conclusion, the symmetric part of Eq (20) takes iterative steps along28he gradient of a dissipative merit function. The anti-symmetric part produceenergy conserving rotations in the the vector space of optimization variables, x i . The associated dynamical freedom might provide a fruitful future researchdirection in optimization and deserves further study to ascertain its power. We have seen that minimum power dissipation solvers can address the Isingproblem and various similar problems like the traveling salesman problem, etc.In this section, we provide yet another application of minimum entropy genera-tion solvers to another optimization problem that appears frequently in statis-tics, namely curve ﬁtting. In particular, we make the observation that theproblem of linear least squares regression, linear curve ﬁtting with a quadraticmerit function, resembles the Ising problem. In fact, the electrical circuit exam-ple we presented in Section 5.2 can be applied to linear regression. We presentsuch a circuit in this section. Our circuit provides a digital answer but requiresa series of binary resistance values, that is, . . . , R , R , . R , . . . to representarbitrary binary statistical input observations.The objective of linear least squares regression is to ﬁt a linear functionto a given set of data { ( x , y ) , ( x , y ) , ( x , y ) , . . . , ( x n , y n ) } . The x i s areinput vectors of dimension d while the y i are the observed outputs that we wantour regression to capture. The linear function that is being ﬁt is of the form y ( a ) = (cid:80) di =1 w i a i where a is a feature vector of length d and w is a vector ofunknown weights multiplying the features vector a . The vector w is calculatedby minimizing the sum of the squared errors it causes when used on an actualdata set. This optimization problem may be represented as: w ∗ = min w n (cid:88) i =1  d (cid:88) j =1 w j x ij  − y i  , where x ij is the j -th component of the vector x i . The merit function, upon ex-pansion, yields (cid:80) di =1 y i − (cid:80) di =1 (cid:16)(cid:80) nj =1 x ji y j (cid:17) w i + (cid:80) di,j =1 ( (cid:80) nk =1 x ki x kj ) w i w j .This functional form is identical to that of the Ising Hamiltonian and we mayconstruct an Ising circuit with J ij = (cid:80) nk =1 x ki x kj with the weights w actinglike the unknown magnetic moments. There is an eﬀective magnetic ﬁeld inthe problem, h i = − (cid:80) nj =1 x ji y j , and a ﬁxed number (cid:80) di =1 y i which doesn’tplay a role in the optimization. A simple circuit that solves the linear leastsquares problem for d = 2—the case where there are two features per instanceand, consequently, two weights w d to be estimated—is provided in Fig. 9. Thiscircuit provides weights upto 2-bit precision.The two oscillators on the left hand side of the ﬁgure represent the 2 and2 bits of the weight corresponding to the ﬁrst feature while the oscillators onthe right hand side are the corresponding bits of the second weight.29 " = 2 % ×! "(%) + 2 * ×! "(*) + = ! " , + ! - . Linear Regression of:where: noise / "%-%0 " % (2) / "%-%3 / "%-*0 / "%-*3 / "*-%0 / "*-%3 / "*-*0 / "*-*3 " * (2) 1 - % (2)1 - * (2) / "%"*0 / "%"*3 / -%-*0 / -%-*3 Figure 9:

A two-bit linear regression circuit to ﬁnd the best two curve-ﬁtting weights w d using the Principle of Minimum Entropy Generation. The cross-resistance R that one would need to represent the J ij that connectsthe i -th and j -th oscillators is calculated as:1 R = b R − + b R + b − R , where R m = 2 m R is a binary hierarchy of resistances . . . , R , R , . R , . . . based on a reference resistor R , and b m are the binary digits of J ij expressedas a binary number: b = b × + b × + b − × − . This represents J ij to 3-bit precision using resistors that span a dynamic range 2 = 4. Further,the sign of the coupling is allotted according to whether the resistors R areparallel-connected or cross-connected. In operation, the resistors R would beexternally programmed to the correct binary values, with many more bits than3-bit precision, as given by the matrix product J ij = (cid:80) nk =1 x ki x kj .We have just solved the regression problem of the form Xw = y , wherematrix X and vector y were known measurements and the corresponding bestweight vector w for ﬁtting was the unknown. We conclude by noting that thissame procedure can be adopted to solve linear systems of equations of the form Xw = y . Physics obeys a number of maximization or minimization principles such as theprinciple of Least Action, the principle of Minimum Entropy Generation, the30ariational Principle, Physical Annealing, and the Adiabatic Principle (whichin its quantum form is called Quantum Annealing).Optimization is very signiﬁcant in diverse ﬁelds ranging from scheduling androuting in operations research to protein folding in biology, porfolio optimiza-tion in ﬁnance, and energy minimization in physics. In this article, we made theobservation that physics has optimization principles at its heart, and that theycan be exploited to design fast, low power, digital solvers that avoid the limitsof standard computational paradigms. Nature thus provides us with a meansto solve optimization problems in all these areas including Engineering, Arti-ﬁcial Intelligence, Machine Learning (backpropagation), Control Theory, andReinforcement Learning.We reviewed 7 physical machines, proposed or built, that purported to solvethe Ising problem and found that 6 of the 7 were performing Lagrange multi-plier optimization; further, they also obey the Principle of Minimized Entropygeneration (always subject to a power input constraint). This means that byappropriate choice of parameter values, these physical solvers can be used toperform Lagrange multiplier optimization orders-of-magnitude faster and withlower power than conventional digital computers. This performance advantagecan be utilized for optimization in machine learning applications where energyand time considerations are critical.The questions arise: What are the action items? And what is the mostpromising near term application? All the hardware approaches seem to workcomparably well. The easiest to implement would be the electrical oscillatorcircuits, though the optical oscillator arrays can be compact and very fast.Electrically, there would two integrated circuits, the oscillator array, and theconnecting resistors that would need to be reprogrammed for diﬀerent prob-lems. The action item could be to design the ﬁrst chip consisting of about 1000oscillators, and a second chip that would consist of the appropriate couplingresistor array, for a speciﬁc optimization problem. The resistors should be inan addressable binary hierarchy so that any desired resistance value can be pro-grammed in by switches, within the number of bits accuracy. It is possible toimagine solving a new Ising problem ever milli-second, by reprogramming theresistor chip.On the software side, a compiler would need to be developed to go froman unsolved optimization problem to the resistor array which matches thatdesired goal. If the merit function were mildly nonlinear, we believe that thePrinciple of Minimum Entropy generation would still hold, but there has beenless background science justifying that claim.With regard to the most promising near term application, it might be inControl Theory or in Reinforcement Learning in self-driving vehicles, whererapid answers are required, at modest power dissipation.The act of computation can be regarded as a search among many possibleanswers. Finally, the circuit converges to a ﬁnal correct conﬁguration. Thusthe initial conditions may include a huge phase space volume=2 n of possiblesolutions, ultimately transitioning into a ﬁnal conﬁguration representing a smallor modest-sized binary number. This type of computing implies a substantial31ntropy reduction. This led to Landauer’s admonition that computation costs kn log 2 of entropy decrease, and kT n log 2 of energy, for a ﬁnal answer with n binary digits.By the 2 nd Law of Thermodynamics, such an entropy reduction must beaccompanied by an entropy increase elsewhere. For example, in a dissipativecircuit, electricity converts to heat. In Landauer’s viewpoint, the energy andentropy limit of computing was associated with the ﬁnal acting of writing out theanswer in n -bits, assuming the rest of the computer was reversible. In practice,technology consumes ∼ × times more than the Landauer limit, owing tothe insensitivity of the transistors operating at ∼ ∼ < NP -hard problems. The physical systems we are proposing evolve by steepestdescent toward a local optimum, not a global optimum. Nonetheless, manyof the authors of the 7 physical systems presented here have claimed to ﬁndbetter local optima than their competitors, due to special adjustments in theirmethods. Undoubtedly, some improvements are possible, but none of the 7papers reviewed here claims to always ﬁnd the one global optimum which wouldbe NP -hard [31].We have shown that a number of physical systems that perform optimizationare acting through the Principle of Minimum Entropy generation, though otherphysics principles could also fulﬁll this goal. As the systems evolve toward anextremum, they perform Lagrange function optimization where the LagrangeMultipliers are given by the gain or loss coeﬃcients that keep the machine run-ning. Thus, Nature provides us with a series of physical Optimization Machinesthat are much faster and possibly more energy eﬃcient than conventional com-puters.The authors gratefully acknowledge useful discussions with Dr. Ryan Hamerly,Dr. Tianshi Wang, and Prof. Jaijeet Roychowdhury. The work of S. K. Vadla-mani and Prof. E. Yablonovitch was supported by the National Science Foun-dation through the Center for Energy Eﬃcient Electronics Science ( E S ) underAward ECCS-0939514 and the Oﬃce of Naval Research under grant A Lagrange multipliers theory

In this appendix, we shall study in greater detail the method of Lagrange mul-tipliers, one of the most well-known techniques for the solution of constrained32ptimization problems. At its core, it simply involves the modiﬁcation of themerit function by adding terms that penalize constraint violations.In the setting of constrained optimization, we are required to minimize anobjective function f ( x ) with respect to all points that satisfy certain giveninequality ( h j ( x ) ≤

0) and equality ( g i ( x ) = 0) constraints. Let us assume thatthe domain set D , that is, the intersection of the domains of f ( x ) , g i ( x ) , h j ( x ), isa subset of R n . Further, let the set of points in D that satisfy all the constraints g i , h j be called F , the feasible set. Then, the optimization problem can bewritten as: minimize f ( x )subject to h j ( x ) ≤ , j = 1 , . . . , m,g i ( x ) = 0 , i = 1 , . . . , p, x ∈ D. A.1 Lagrange function

We now deﬁne a new function in n + m + p variables, called the Lagrangefunction, as follows: L ( x , λ , µ ) = f ( x ) + m (cid:88) j =1 µ j h j ( x ) + p (cid:88) i =1 λ i g i ( x ) . The summations that were added to the plain objective function f ( x ) serveas constraint violation penalty terms. The coeﬃcients multiplying the penaltyterms, λ i and µ j , are known as Lagrange multipliers. The inequality Lagrangemultipliers µ j are constrained to be non-negative in order that the penalty thatarises when the inequality constraints are not satisﬁed (i.e. when h j ( x ) >

0) isnon-negative. The equality Lagrange multipliers, λ i , have no such restrictions.The Lagrange function has the advantage that it helps us express the con-strained optimization of f as an unconstrained optimization of L . That is, itcan be shown that: min x ∈ F f ( x ) = min x ∈ D max ( µ ≥ , λ L ( x , λ , µ ) . The minimization over x in the feasible set, F , on the left-hand side has turnedinto a minimization over x in the entire domain, D , on the right-hand side. A.2 Karush-Kuhn-Tucker (KKT) suﬃcient conditions

The Lagrange function also appears in an important, related role. While theconditions for a point x ∗ to be an unconstrained optimum of a diﬀerentiablefunction f are expressed in terms of the gradient of f ( x ), the optimality condi-tions for constrained optimization problems are naturally expressed in terms ofthe Lagrange function. These conditions are presented next.A point x ∗ is a local optimum of the function f ( x ) subject to the constraints g i , h j if it satisﬁes the Karush-Kuhn-Tucker (KKT) suﬃcient conditions:33. Primal feasibility: The point x ∗ is feasible, that is, it satisﬁes all theconstraints. h j ( x ∗ ) ≤ , j = 1 , . . . , m,g i ( x ∗ ) = 0 , i = 1 , . . . , p.

2. First-order condition: There exist Lagrange multipliers µ ∗ , . . . , µ ∗ m , λ ∗ , . . . , λ ∗ p such that the following equation is satisﬁed: ∇ x L ( x ∗ , µ ∗ , λ ∗ )= ∇ x f ( x ∗ ) + m (cid:88) j =1 µ ∗ j ∇ x h j ( x ∗ ) + p (cid:88) i =1 λ ∗ i ∇ x g i ( x ∗ )= 0 .

3. Second-order condition: In addition to the ﬁrst-order condition, if all theconcerned functions are twice diﬀerentiable, we require that the Hessianof L with respect to x be positive deﬁnite along all directions that respectthe active constraints. That is, the following equation has to be satisﬁed: v T ∇ xx L ( x ∗ , µ ∗ , λ ∗ ) v > , ∀ v ∈ T, where T is deﬁned as T = { v : ( ∇ x h j ( x ∗ )) T v = 0 , j = active constraintsat x ∗ , ( ∇ x g i ( x ∗ )) T v = 0 , i = 1 , . . . , p } .4. Complementary slackness: µ ∗ j h j ( x ∗ ) = 0 is satisﬁed for all the inequalityconstraints, j = 1 , . . . , m .5. Dual feasibility: The Lagrange multipliers of all the inequality constraintssatisfy µ ∗ j ≥ , j = 1 , . . . , m , with the inequality being strict for activeconstraints.We recognize condition number 2 as the one we encountered in Section 4 in themain text. A.3 Saddle point nature of global and local optima

If the functions f , g i , h j , and the feasible set F satisfy certain special condi-tions (such as, for instance, the Slater condition), a phenomenon called strongduality occurs. Under such circumstances, it turns out that the minimizationand maximization in the above equation can be switched:min x ∈ D max ( µ ≥ , λ L ( x , λ , µ ) = max ( µ ≥ , λ min x ∈ D L ( x , λ , µ ) . When this occurs, it can be shown that the optimal pair ( x ∗ , λ ∗ , µ ∗ ) thatsatisﬁes the KKT conditions actually forms a saddle point of the Lagrange34unction, L ( x , λ , µ ). That is, one has: L ( x ∗ , λ ∗ , µ ∗ ) ≤ L ( x , λ ∗ , µ ∗ ) , ∀ x ∈ D,L ( x ∗ , λ ∗ , µ ∗ ) ≥ L ( x ∗ , λ , µ ) , ∀ µ ≥ , λ . This can be viewed as a simple motivation for the gradient descent of L in x and gradient ascent of L in λ that we encountered in Section 5.5. In the nextsection, Appendix B, we shall follow Chapter 4 of the excellent book [25] andsee how this descent/ascent procedure can be used to ﬁnd local optima even ifstrong duality doesn’t hold. B Lagrange multipliers algorithms: AugmentedLagrangian

In this section, we discuss the so-called ‘Augmented Lagrangian method of mul-tipliers’, a popular algorithm used to obtain locally optimal solutions ( x ∗ , λ ∗ )that satisfy the KKT conditions. We shall be only considering the case wherethere are no inequality constraints h j (and, consequently, no µ multipliers).This algorithm is discussed in detail in [25].To motivate the ‘Augmented Lagrangian’ approach, let us observe the KKTconditions a bit closely. We conclude from the ﬁrst-order condition that alocally optimum x ∗ renders the function L ( x , λ ∗ ) stationary. However, thisdoesn’t guarantee that x ∗ is a minimum of L ( x , λ ∗ ). Indeed, observation ofthe second-order condition tells us that x ∗ could be a saddle point of L ( x , λ ∗ ).This means that gradient descent-based algorithms will not converge to x ∗ ifthe starting point is not in the correct region. It was to solve this problem thatthe ‘Augmented Lagrangian’ method was invented.The Augmented Lagrange function, L c ( x , λ ), is given by: L c ( x , λ ) = f ( x ) + p (cid:88) i =1 λ i g i ( x ) + c (cid:32) p (cid:88) i =1 ( g i ( x )) (cid:33) . This function L c ( x , λ ), for a suitable choice of c , has the property that x ∗ forms a minimum of L c ( x , λ ∗ ) and not just a saddle point as was the case with L ( x , λ ∗ ).In other words, local optima of the original constrained optimization prob-lem can be obtained if we perform an unconstrained optimization of L c ( x , λ ∗ ).However, for this procedure to be a feasible solution approach, we would haveto know the right λ ∗ . It has been shown in [25] that the way to do this is toperform gradient ascent of L c in the λ variables.The ‘Augmented Lagrangian method of multipliers’ involves the repeatedminimization of L c ( x , λ ( k ) ) using progressively better estimates, λ ( k ) , of λ ∗ .The algorithm starts oﬀ with an arbitrary starting point ( x (0) , λ (0) ). It thenperforms the following steps repeatedly:1. Locally minimize L c ( x , λ ( k ) ) and call the minimum point x ( k ) .35. λ ( k +1) i = λ ( k ) i + cg i ( x ( k ) ).The second step above corresponds to gradient ascent of L c ( x , λ ) in the λ variables. Basically, this method performs a fast gradient descent of L c in the x directions in conjunction with a slow gradient ascent of L c in the λ directions. Adynamical system that performs this process in continuous time is given below: dx i dt = − κ ∂∂x i L c ( x , λ ) ,dλ i dt = κ (cid:48) ∂∂λ i L c ( x , λ ) . C Application of Lagrange multipliers to the Isingproblem; Cubic terms

In this appendix, we shall provide an explanation of the role that is playedby the cubic terms—that were neglected in the main text—in the methods ofYamamoto et al. [7], Kalinin et al. [15], and others. It will turn out that theinclusion of cubic terms helps us implement the ‘Augmented Lagrangian methodof multipliers’ from Appendix B.The statement of the Ising problem is as follows:minimize x T J x subject to x i − , i = 1 , . . . , n. The corresponding Lagrange function is given by: L ( x , λ ) = x T J x − n (cid:88) i =1 λ i ( x i − . Next, we write down the Augmented Lagrange function: L c ( x , λ ) = x T J x − n (cid:88) i =1 λ i ( x i −

1) + c (cid:32) n (cid:88) i =1 (cid:0) x i − (cid:1) (cid:33) . Substituting the above Augmented Lagrange function into the dynamical systemprovided at the end of Appendix B, we get: dx i dt = 2 κ  λ i x i + 2 cx i − cx i − n (cid:88) j =1 J ij x j  ,dλ i dt = κ (cid:48) (1 − x i ) , where κ and κ (cid:48) are appropriately chosen step sizes. We notice the equivalencein form between this dynamical system and those in the papers discussed inthe main text and conclude that the cubic terms that appear in most of thosesystems are in fact helping to implement the Augmented Lagrangian method ofmultipliers. 36 Adiabatic method using Kerr (cid:0) χ (3) (cid:1) nonlinearcoupled oscillators Researchers at Toshiba proposed an adiabatic Ising solver consisting of networksof coupled Kerr nonlinear oscillators [17], [32], [33]. The big diﬀerence betweenthe machines we have studied so far and the Toshiba machine is that Toshibauses dissipation-less optical systems and utilizes the adiabatic method to op-timize the Ising Hamiltonian. By dissipation-less, we mean that the couplingbetween diﬀerent oscillators is not dissipative but perfectly elastic.Further, they replace parametric oscillators that use the χ (2) nonlinearitywith Kerr-parametric oscillators that possess both the χ (2) and the χ (3) non-linearities. The χ (2) nonlinearity gives rise to eﬀects such as parametric am-pliﬁcation whereas the χ (3) nonlinearity leads to intensity dependent refractiveindex change. A second-harmonic pump signal is used to achieve parametricampliﬁcation. The refractive index of the modes in the oscillators is modulatedby their own intensities due to the third-order nonlinearity.The slowly varying amplitude equations for the sine ( s i ) and cosine ( c i )components of the electric ﬁeld in their system are given by: dc i dt = K (cid:0) c i + s i (cid:1) s i + ps i + ξ (cid:88) j J ij s j ,ds i dt = − K (cid:0) c i + s i (cid:1) c i + pc i − ξ (cid:88) j J ij c j , where c i and s i represent the cosine and sine components of the i -th oscilla-tor, p refers to the strength of the parametric pumping, K is the value of thethird-order nonlinearity (Kerr coeﬃcient), and ξ is the strength of the couplinginteraction between the oscillators.We rotate the axes and recast the above set of equations in phasor notation asfollows to elucidate the similarity between this method and that of Cambridge: dE i dt = κ  pE ∗ i + iK | E i | E i + i (cid:88) j J ij E j  . The ﬁrst term on the right-hand side is the parametric gain, the second termis the nonlinear rotation caused by the Kerr nonlinearity, and the third term isthe dissipation-less coupling between oscillators.The machine, which follows the adiabatic principle, works by ramping upthe value of the pump p adiabatically. The authors show that their systemHamiltonian mimics the Ising Hamiltonian well when p becomes large enough.We note the interesting fact that, though the authors have parametric gain,the system does not blow up to inﬁnity due to the presence of the nonlinear ro-tating term. This term ensures that the quadrature vector keeps traversing gainand loss regions periodically in the phase space, hence keeping the amplitudes37n check. E Iterative Analog Matrix Multipliers

In this appendix, we present the system designed by Soljacic et al. [16]. Weshall see that the simpliﬁed system we presented in Sec 6a of the main textdiﬀers from that of Soljacic et al. signiﬁcantly in that their method has addednoise and nonlinear thresholding after each iteration. It is possible that theirmodiﬁcations lead to performance improvements.Formally, their iteration is given by: E ( t + 1) = u (2 KE ( t ) + N ( t )) , where E ( t ) is the vector of electric ﬁeld amplitude values at the beginning ofthe t -th iteration, u ( x ) is the Heaviside step function that is 1 for positive x and0 for negative x , N ( t ) is a zero-mean Gaussian random noise vector, and K isa matrix given by K = √ J + α M , J is the Ising connectivity matrix, α , a realnumber, and M , some suitably chosen matrix. More speciﬁcally, M is chosento have the same eigenvectors as J . It will turn out that the eigenvalues of M play the role of Lagrange multipliers.The authors showed that under the condition of high noise N ( t ), their systemperforms minimization of the following eﬀective merit function: H = − β (cid:88) ij ( J ij + αM ij ) E i E j , where β is some parameter dependent on the noise.Using the fact that the matrix M is chosen to have the same eigenvectorsas J , we rewrite the above merit function, modulo additive constants, as thefollowing Lagrange function: H = L ( E , γ ) = − β (cid:88) ij J ij E i E j + α (cid:88) i γ i (cid:0) z i − (cid:1) , where the γ i are the eigenvalues of the matrix M , and the vector z is the vectorof electric ﬁeld amplitudes E expressed in the basis of eigenvectors of M (thatis, the eigenvectors of J ). We see that the eigenvalues of M play the role ofLagrange multipliers, albeit for diﬀerent constraints than those required by theIsing problem. This diﬀerence seems to be caused by M not being a diagonalmatrix.In conclusion, we interpret their algorithm as optimizing a Lagrange functionwith the merit function being the Ising Hamiltonian itself, and the constraintsbeing that the components of the spin vector when expressed in the eigenvectorbasis of J be restricted to 1 and -1. 38 eferences [1] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg,X. Sun, S. Zhao, H. Larochelle, D. Englund, and M. Soljaˇci´c, “Deep learningwith coherent nanophotonic circuits,” Nature Photonics , vol. 11, pp. 441–446, July 2017.[2] L. Onsager, “Reciprocal Relations in Irreversible Processes. II.,”

PhysicalReview , vol. 38, pp. 2265–2279, Dec. 1931.[3] A. Lucas, “Ising formulations of many NP problems,” arXiv preprintarXiv:1302.5843 , 2013.[4] S. Utsunomiya, K. Takata, and Y. Yamamoto, “Mapping of Ising mod-els onto injection-locked laser systems,”

Optics express , vol. 19, no. 19,pp. 18091–18108, 2011.[5] Z. Wang, A. Marandi, K. Wen, R. L. Byer, and Y. Yamamoto, “CoherentIsing machine based on degenerate optical parametric oscillators,”

PhysicalReview A , vol. 88, no. 6, p. 063853, 2013.[6] A. Marandi, Z. Wang, K. Takata, R. L. Byer, and Y. Yamamoto, “Net-work of time-multiplexed optical parametric oscillators as a coherent Isingmachine,”

Nature Photonics , vol. 8, no. 12, p. 937, 2014.[7] Y. Haribara, S. Utsunomiya, and Y. Yamamoto, “Computational Principleand Performance Evaluation of Coherent Ising Machine Based on Degener-ate Optical Parametric Oscillator Network,”

Entropy , vol. 18, p. 151, Apr.2016.[8] P. L. McMahon, A. Marandi, Y. Haribara, R. Hamerly, C. Langrock, S. Ta-mate, T. Inagaki, H. Takesue, S. Utsunomiya, and K. Aihara, “A fully pro-grammable 100-spin coherent Ising machine with all-to-all connections,”

Science , vol. 354, no. 6312, pp. 614–617, 2016.[9] T. Inagaki, K. Inaba, R. Hamerly, K. Inoue, Y. Yamamoto, and H. Takesue,“Large-scale Ising spin network based on degenerate optical parametricoscillators,”

Nature Photonics , vol. 10, no. 6, p. 415, 2016.[10] T. Inagaki, Y. Haribara, K. Igarashi, T. Sonobe, S. Tamate, T. Honjo,A. Marandi, P. L. McMahon, T. Umeki, K. Enbutsu, O. Tadanaga, H. Take-nouchi, K. Aihara, K.-i. Kawarabayashi, K. Inoue, S. Utsunomiya, andH. Takesue, “A coherent Ising machine for 2000-node optimization prob-lems,”

Science , vol. 354, pp. 603–606, Nov. 2016.[11] T. Leleu, Y. Yamamoto, P. L. McMahon, and K. Aihara, “Destabiliza-tion of Local Minima in Analog Spin Systems by Correction of AmplitudeHeterogeneity,”

Physical Review Letters , vol. 122, p. 040607, Feb. 2019.3912] T. P. Xiao,

Optoelectronics for refrigeration and analog circuits for combi-natorial optimization . PhD Thesis, UC Berkeley, 2019.[13] M. Babaeian, D. T. Nguyen, V. Demir, M. Akbulut, P.-A. Blanche,Y. Kaneda, S. Guha, M. A. Neifeld, and N. Peyghambarian, “A single shotcoherent Ising machine based on a network of injection-locked multicoreﬁber lasers,”

Nature Communications , vol. 10, p. 3516, Dec. 2019.[14] T. Wang and J. Roychowdhury, “OIM: Oscillator-Based Ising Machinesfor Solving Combinatorial Optimisation Problems,” in

UnconventionalComputation and Natural Computation (I. McQuillan and S. Seki, eds.),vol. 11493, pp. 232–256, Cham: Springer International Publishing, 2019.[15] K. P. Kalinin and N. G. Berloﬀ, “Global optimization of spin Hamiltonianswith gain-dissipative systems,”

Scientiﬁc Reports , vol. 8, p. 17791, Dec.2018.[16] C. Roques-Carmes, Y. Shen, C. Zanoci, M. Prabhu, F. Atieh, L. Jing,T. Dubˇcek, C. Mao, M. R. Johnson, V. ˇCeperi´c, J. D. Joannopoulos, D. En-glund, and M. Soljaˇci´c, “Heuristic recurrent algorithms for photonic Isingmachines,”

Nature Communications , vol. 11, p. 249, Dec. 2020.[17] H. Goto, K. Tatsumura, and A. R. Dixon, “Combinatorial optimizationby simulating adiabatic bifurcations in nonlinear Hamiltonian systems,”

Science Advances , vol. 5, p. eaav2372, Apr. 2019.[18] B. Moln´ar, F. Moln´ar, M. Varga, Z. Toroczkai, and M. Ercsey-Ravasz, “Acontinuous-time MaxSAT solver with high analog performance,”

NatureCommunications , vol. 9, p. 4864, Dec. 2018.[19] D. Pierangeli, G. Marcucci, and C. Conti, “Large-Scale Photonic IsingMachine by Spatial Light Modulation,”

Physical Review Letters , vol. 122,p. 213902, May 2019.[20] D. Pierangeli, M. Rafayelyan, C. Conti, and S. Gigan, “Scalable spin-glass optical simulator,” arXiv:2006.00828 [physics] , June 2020. arXiv:2006.00828.[21] I. Prigogine,

Etude Thermodynamique des Ph´enom`enes irr´eversibles . Edi-tions Desoer, Li`ege, 1947. Chap V.[22] S. de Groot,

Thermodynamics of Irreversible Processes . Interscience Pub-lishers, Inc., New York, 1951. Chap X.[23] N. G. Dickson, M. W. Johnson, M. H. Amin, R. Harris, F. Altomare, A. J.Berkley, P. Bunyk, J. Cai, E. M. Chapple, and P. Chavez, “Thermally as-sisted quantum annealing of a 16-qubit problem,”

Nature communications ,vol. 4, p. 1903, 2013. 4024] S. Boyd and L. Vandenberghe,

Convex optimization . Cambridge universitypress, 2004.[25] D. Bertsekas,

Nonlinear programming . Athena Scientiﬁc, 1999.[26] https://web.stanford.edu/~yyye/yyye/Gset/ .[27] U. Benlic and J.-K. Hao, “Breakout Local Search for the Max-Cutproblem,”

Engineering Applications of Artiﬁcial Intelligence , vol. 26, pp. 1162–1173,Mar. 2013.[28] J. Roychowdhury, “Boolean Computation Using Self-Sustaining NonlinearOscillators,”

Proceedings of the IEEE , vol. 103, pp. 1958–1969, Nov. 2015.[29] A. Demir, A. Mehrotra, and J. Roychowdhury, “Phase noise in oscilla-tors: A unifying theory and numerical methods for characterization,”

IEEETransactions on Circuits and Systems I: Fundamental Theory and Appli-cations , vol. 47, no. 5, pp. 655–674, 2000.[30] M. Reck, A. Zeilinger, H. J. Bernstein, and P. Bertani, “Experimental real-ization of any discrete unitary operator,”

Physical Review Letters , vol. 73,pp. 58–61, July 1994.[31] R. M. Karp, “Reducibility among combinatorial problems,” in

Complexityof computer computations , pp. 85–103, Springer, 1972.[32] H. Goto, “Bifurcation-based adiabatic quantum computation with a non-linear oscillator network,”

Scientiﬁc reports , vol. 6, p. 21686, 2016.[33] H. Goto, “Quantum computation based on quantum adiabatic bifurcationsof Kerr-nonlinear parametric oscillators,”