Optimization algorithms inspired by the geometry of dissipative systems
Alessandro Bravetti, Maria L. Daza-Torres, Hugo Flores-Arguedas, Michael Betancourt
OOptimization algorithms inspired by thegeometry of dissipative systems
Alessandro Bravetti , Maria L. Daza-Torres , Hugo Flores-Arguedas ,and Michael Betancourt Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas (IIMAS), UniversidadNacional Autónoma de México, A.P. 70-543, 04510 Ciudad de México, México [email protected] Universidad de Guadalajara, Guadalajara, México, [email protected] Centro de Investigación en Matemáticas (CIMAT), Guanajuato, México, [email protected] Symplectomorphic LLC, New York, USA, [email protected]
Abstract
Accelerated gradient methods are a powerful optimization tool in machine learningand statistics but their development has traditionally been driven by heuristic mo-tivations. Recent research, however, has demonstrated that these methods can bederived as discretizations of dynamical systems, which in turn has provided a basisfor more systematic investigations, especially into the structure of those dynamicalsystems and their structure–preserving discretizations. In this work we introducedynamical systems defined through a contact geometry which are not only naturallysuited to the optimization goal but also subsume all previous methods based on ge-ometric dynamical systems. These contact dynamical systems also admit a natural,robust discretization through geometric contact integrators . We demonstrate thesefeatures in paradigmatic examples which show that we can indeed obtain optimiza-tion algorithms that achieve oracle lower bounds on convergence rates while alsoimproving on previous proposals in terms of stability.
Keywords: optimization, accelerated gradient, dynamical systems, geometric inte-grators, contact geometry
Despite their practical utility and explicit convergence bounds, accelerated gradi-ent methods have long been difficult to motivate from a fundamental theory. This a r X i v : . [ m a t h . O C ] F e b ack of understanding limits the theoretical foundations of the methods, which inturns hinders the development of new and more principled schemes. Recently a pro-gression of work has studied the continuum limit of accelerated gradient methods,demonstrating that these methods can be derived as discretizations of continuousdynamical systems. Shifting focus to the structure and discretization of these la-tent dynamical systems provides a foundation for the systematic development andimplementation of new accelerated gradient methods.This recent direction of research began with [21] which found a continuum limitof Nesterov’s accelerated gradient method (NAG) X k = P k − − τ ∇ f ( P k − ) (1) P k = X k + k − k + 2 ( X k − X k − ) , (2)by discretizing the ordinary differential equation ¨ X + 3 t ˙ X + ∇ f ( X ) = 0 , (3)for t > with the initial conditions X (0) = X and ˙ X (0) = 0 . By generalizingthe ordinary differential equation (3) they were then able to derive new acceleratedgradient methods that achieved comparable convergence rates.[23] continued in this direction by showing that NAG can also be derived asa discretization of a more structured variational dynamical system specified witha time–dependent Lagrangian, or equivalent Hamiltonian. Consider an objectivefunction f : X → R which is continuously differentiable, convex, and has a uniqueminimizer X ∗ ∈ X . Moreover assume that X is a convex set endowed with adistance–generating function h : X → R that is also convex and essentially smooth.From the Bregman divergence induced by h , D h ( Y, X ) = h ( Y ) − h ( X ) − (cid:104)∇ h ( X ) , Y − X (cid:105) (4)they derived the Bregman Lagrangian L ( X, V, t ) = e α + γ (cid:0) D h ( X + e − α V, X ) − e β f ( X ) (cid:1) , (5)where α , β , and γ are continuously differentiable functions of time. They thenproved that under the ideal scaling conditions ˙ β ≤ e α (6) ˙ γ = e α , (7)the solutions of the resulting Euler–Lagrange equations ¨ X + (e α − ˙ α ) ˙ X + e α + β (cid:104) ∇ h ( X + e − α ˙ X ) (cid:105) − ∇ f ( X ) = 0 (8)satisfy [23, Theorem 1.1] f ( X ) − f ( X ∗ ) ≤ O (e − β ) . (9)From a physical perspective the two terms in equation (5) play the role of akinetic and a potential energies, respectively. At the same time the explicit time–dependence of the Lagrangian (5) is a necessary ingredient in order for the dynamical ystem to dissipate energy and relax to a minimum of the potential, and hence toa minimum of the objective function. Moreover, by (6), the optimal convergencerate is achieved by choosing ˙ β = e α , i.e. β = (cid:82) tt e α ( s ) ds , and we observe that inthe Euclidean case, h ( X ) = (cid:107) X (cid:107) , the Hessian is the identity matrix and thus (8)simplifies to ¨ X + (e α − ˙ α ) ˙ X + e α + β ∇ f ( X ) = 0 . (10)Finally the authors developed a heuristic discretization of (8) that yielded optimiza-tion algorithms matching the continuous convergence rates.[3] considered more systematic discretizations of these variational dynamical sys-tems that exploited the fact that they are well suited for numerical discretizationsthat preserve their geometric structure [16]. In particular they Legendre transformedthe Bregman Lagrangian (5) to derive the Bregman Hamiltonian , H ( X, P, t ) = e α + γ (cid:0) D h ∗ (e − γ P + ∇ h ( X ) , ∇ h ( X )) + e β f ( X ) (cid:1) , (11)where h ∗ ( P ) = sup V P · V − h ( X ) is the Legendre transform of h ( X ) , and thenargued that a principled way to obtain reliable and rate–matching discretizations ofthe resulting dynamical system ˙ X = ∇ P H ( X, P ) (12) ˙ P = −∇ X H ( X, P ) , (13)is with symplectic integrators . They numerically demonstrated in the Euclidean casethat a standard leapfrog integrator yields an optimization algorithm that achievespolynomial convergence rates and showed how the introduction of a gradient flowcould achieve late–time exponential convergence rates matching those seen empiri-cally in other accelerated gradient methods.Perhaps most importantly these variational methods allowed the focus to shiftfrom existing accelerated gradient methods to the structure of the latent dynamicalsystems. Although the variational dynamical systems were initially derived to repro-duce existing accelerated gradient methods, their existence suggests the introductionof new, principled dynamical systems that promise entirely new methods.For example we can replace variational dynamical systems that exploit heuristictime–dependencies to achieve dissipation with explicitly dissipative dynamical sys-tems. [19] and [12] considered a dynamical systems perspective on these systems,showing how relatively simple dissipations can achieve state-of-the-art convergence.[13] took a more geometric perspective, replacing the time–dependent Hamiltoniangeometry of the variational systems with a conformally symplectic geometry thatgenerates dynamical systems of the form ˙ X = ∇ P H ( X, P ) (14) ˙ P = −∇ X H ( X, P ) − c P , (15)with c ∈ R a constant. Being a geometric dynamical system this approach alsoadmits effective geometric integrators similar to [3]. These conformally symplecticdynamical systems, however, are less general than the time–dependent variationaldynamical systems; in particular NAG cannot be exactly recovered in this frame-work [13].Another relevant aspect that has been uncovered by studying optimization algo-rithms from a variational or Hamiltonian analysis is the focus on a very important egree of freedom, the choice of the kinetic energy, that plays a fundamental role inthe construction of fast and stable algorithms that can possibly escape local minima,in direct analogy with what happens in Hamiltonian Monte Carlo methods [1, 2, 17].In particular, first [18] and then [13] have motivated that a careful choice of the ki-netic energy term can stabilize the dynamical systems when the objective function israpidly changing, similar to the regularization induced by trust regions. Indeed, likethe popular Adam, AdaGrad and RMSprop algorithms, the resulting RelativisticGradient Descent (RGD) algorithm regularizes the dynamical velocities to achievea faster convergence and improved stability.Finally [18] introduced another way of incorporating dissipative terms into Hamil-ton’s equations (12)–(13) (see also [20]). Their
Hamiltonian descent algorithm isderived from the equations of motion ˙ X = ∇ P H ( X, P ) + X ∗ − X (16) ˙ P = −∇ X H ( X, P ) + P ∗ − P , (17)where ( X ∗ , P ∗ ) = argmin H ( X, P ) . Because the dynamics are defined using termsonly linear in X and P they converge to the optimal solution exponentially quicklyunder mild conditions on H [20]. That said, this exponential convergence requiresalready knowing the optimum ( X ∗ , P ∗ ) in order to generate the dynamics. Addition-ally this particular dynamical system lies outside of the variational and conformalsymplectic families of dynamical systems and so can not take advantage of the geo-metric integrators.In this work we show that all of the above–mentioned dynamical systems canbe incorporated into a single family of contact Hamiltonian systems [4, 6] endowedwith a contact geometry . The geometric foundation provides a powerful set of toolsfor both studying the convergence of the continuous dynamics as well as generat-ing structure–preserving discretizations. We also produce a new version of RGD, Contact Relativistic Gradient Descent (CRGD), and provide numerical experimentsthat show how it improves over RGD in terms of stability and rates of convergence.The structure of this work is as follows: in Section 2 we introduce contact Hamilto-nian systems and show that all systems corresponding to Equations (10), (14)–(15),and (16)–(17) can be easily recovered as particular examples. In Section 3 we pro-vide the basics of the geometric theory of time–discretization of contact systems bymeans of splitting, thus deriving optimization algorithms similar in spirit to, butmore general than, those introduced in [3, 13]. Then in Section 4 and Section 5 weshow numerically that our algorithms can improve both the speed of convergenceand the stability with respect to the ones previously proposed. Finally, in Section 6we summarize the results and discuss future directions.
Contact geometry is a rich subject, but to ease exposition we will consider here onlyhow contact geometries manifest in Euclidean spaces. For a treatment of the moregeneral theory see [5, 6, 9, 10, 11, 14].In the Euclidean context a contact state space is odd–dimensional, R n +1 , andcoordinated by the variables ( X, P, S ) , where the X ∈ R n play the role of generalizedcoordinates , the P ∈ R n the corresponding momenta and S ∈ R the action of thesystem. contact geometry is defined by a contact structure . On a Euclidean state spacethere are two common ways to specify such a structure. Example 1 (The standard contact structure in canonical coordinates) . The stan-dard structure is defined as the kernel of the –form η std1 := dS − P dX . (18)We use “standard” because one can show that a contact structure on any manifoldlooks like this one locally [15].
Example 2 (The standard contact structure in non–canonical coordinates) . Thisstructure is defined as the kernel of the –form η std2 := dS − P dX + 12
XdP . (19)Although this appears different from the structure in Example 1 they define equiv-alent geometries.Transformations that preserve the contact structure, and hence the contact ge-ometry, play a special role on these spaces.
Definition 1. A contact transformation or contactomorphism F : ( M, η ) → ( N, η ) is a map that preserves the contact structure F ∗ η = α F η , (20) where F ∗ is the pullback induced by F , and α F : M → R is a nowhere–vanishingfunction. Remark 1.
From Definition 1 and Examples 1 and 2 we see that a contact mapre–scales the contact –form by multiplying it by a nowhere–vanishing function. In-deed, such multiplication preserves the kernel of the –form, and hence the resultinggeometry. Remark 2.
We can explicitly construct a contact transformation between η std1 and η std2 above. The map F : ( X, P, S ) (cid:55)→ (cid:18) X + P , P − X , S − XP (cid:19) (21) satisfies F ∗ η std2 = η std1 . Consequently the two structures defined in Examples 1and 2 are equivalent. The superficial difference arises only because they are writtenin different coordinates. We can now define dynamical systems that generalize the Hamiltonian systemsarising in symplectic geometries.
Definition 2 (Contact Hamiltonian systems) . Given a possibly time–dependent dif-ferentiable function H ( X, P, S, t ) on the contact state space ( R n +1 , η ) , we define the contact Hamiltonian vector field associated to H as the vector field X H satisfying £ X H η = f H η η ( X H ) = −H , (22) where £ X H η denotes the Lie derivative of η with respect to X H and η can be either η std1 or η std2 respectively. We denote the flow of X H the contact Hamiltonian systemassociated to H . emark 3. The first condition in (22) simply ensures that the flow of X H generatescontact transformations, while the second condition requires the vector field to begenerated by a Hamiltonian function. Lemma 1 (Contact Hamiltonian systems: std1) . Given a (possibly time–dependent)differentiable function H ( X, P, S, t ) on the contact state space ( R n +1 , η std1 ) , theassociated contact Hamiltonian system is the following dynamical system ˙ X = ∇ P H (23) ˙ P = −∇ X H −
P ∂ H ∂S (24) ˙ S = ∇ P H P − H . (25) Lemma 2 (Contact Hamiltonian system: std2) . Given a (possibly time–dependent)differentiable function H ( X, P, S, t ) on the contact state space ( R n +1 , η std2 ) , theassociated contact Hamiltonian system is the following dynamical system ˙ X = ∇ P H − X ∂ H ∂S (26) ˙ P = −∇ X H − P ∂ H ∂S (27) ˙ S = 12 ( X ∇ X H + P ∇ P H ) − H . (28)The proofs of the above lemmas follow from writing explicitly the conditionsin (22) for η std1 and η std2 respectively, using Cartan’s identity for the Lie derivativeof a 1–form, and then contracting the first identity in (22) with the vector field ∂/∂S to show that the factor f H has to be ∂ H /∂S in both cases. Remark 4 (Lagrangian formulation) . Contact systems can alternatively be intro-duced starting from the Lagrangian function L ( X, V, S, t ) and its corresponding gen-eralized Euler–Lagrange equations ddt (cid:18) ∂ L ∂V (cid:19) − ∂ L ∂X − ∂ L ∂V ∂ L ∂S = 0 , (29) together with the action equation ˙ S = L ( X, V, S, t ) . (30) Indeed, for regular Lagrangians it can be shown that (23) – (25) are equivalent to thesystem (29) – (30) . We arrive at our main result with a direct calculation using equations (23)–(25)and (26)–(28),
Proposition 1 (Recovering previous frameworks) . All the previously–mentionedframeworks for describing continuous–time optimization methods can be recoveredas follows:i) If H = H ( X, P, t ) , that is, if H does not depend explicitly on S , then fromequations (23) – (24) we obtain the standard Hamiltonian equations (12) – (13) ,with (25) completely decoupled from the system. Consequently we recover allthe results for the symplectic Hamiltonian analysis of the continuous–timeBregman dynamics considered in [3] as particular cases. i) If H = H ( X, P, t ) + c S , then from equations (23) – (24) we obtain the stan-dard equations for conformally symplectic systems (14) – (15) , with (25) onceagain completely decoupled from the system. Consequently we recover all theresults for the conformally symplectic analysis of continuous–time optimizationdynamics considered in [13] as particular cases.iii) If H = (cid:107) P (cid:107) +e α + β f ( X )+(e α − ˙ α ) S , then from equations (23) – (24) we ob-tain the Euler–Lagrange equations (10) for the Euclidean Bregman Lagrangian.As in the first two cases (25) completely decouples from the system. Here werecover all the results obtained in [23] by exploiting this variational formulationin continuous time.iv) If H = H ( X, P, t ) + X ∗ P − P ∗ X + 2 S , then from equations (26) – (27) weobtain the Hamiltonian descent equations (16) – (17) , with (28) decoupled fromthe system. Consequently we recover all the results for the Hamiltonian descentanalysis of continuous–time optimization dynamics considered in [18] and [20]as particular cases. Remark 5 ( H and Lyapunov functions) . Although in Proposition 1.iii) we startwith the Hamiltonian formulation and recover equation (10) directly, in principle,we can alternatively start by defining the Euclidean contact Bregman Lagrangian , L ( X, V, S, t ) = e − α (cid:107) V (cid:107) − e β f ( X ) − (e α + ˙ α ) S , (31) and check that its associated generalized Euler–Lagrange equations (29) coincidewith (10) .Moreover, if we compute the momenta using the standard Legendre transform, P = ∇ V L = e − α V , (32) then we obtain the corresponding contact Hamiltonian H = e α (cid:107) P (cid:107) + e β f ( X ) + (e α + ˙ α ) S . (33)
The Hamiltonians in Proposition 1.iii) and in (33) are equivalent, in the sensethat both lead to the same dynamical systems (10) . (33) , however, can be furthermanipulated into the equivalent form H = E t + (e α + ˙ α ) S , (34) where E t = e α (cid:107) P (cid:107) + e β ( f ( X ) − f ( X ∗ )) is the Lyapunov function used in [23] to prove the rate of convergence of the dynam-ics (10) .This suggests constructing principled contact Hamiltonians for optimization pur-poses by taking H = E t + g ( t ) G ( S ) (35) and properly engineering the functions g ( t ) and G ( S ) to ensure that the dynamicalsystem converges to the desired optimum. emark 6 (Continuous versus discrete equivalence) . Although the different for-mulations of (10) lead to the same continuous dynamical systems they will not, ingeneral, lead to the same discretized dynamical system. Having multiple formulationsmay make it easier to identify optimal discretizations and hence the most effectiveoptimization algorithms.
Remark 7 (Avoiding Optimum–Dependent Hamiltonians) . In Proposition 1.iv) thecontact Hamiltonian and the resulting equations of motion suffer from the same prob-lem as the Hamiltonian descent case, namely that one needs to know the optimum inorder to define the dynamical systems. Besides the possibility of using the techniquesintroduced in [20], we can also use the geometry of contact Hamiltonian systems toinvestigate equivalent contact Hamiltonians that do not require knowing the optimuma priori.
The possibilities discussed in these remarks will be explored in future works. Weconclude this section with an important corollary of Proposition 1.
Proposition 2 (Recovering NAG) . Continuous–time NAG is contact.Proof. (3) is a particular case of (10) and therefore the continuous limit of NAG isa contact system. One possible contact Hamiltonian that generates these dynamicsis H = 12 (cid:107) P (cid:107) + f ( X ) + 3 t S . (36) Remark 8.
Indeed, all the ODEs giving rise to the generalized Nesterov’s schemesproposed in [21] are analogously seen to be contact systems.
Before considering new optimization algorithms that stem from the discretizationof contact Hamiltonian systems with geometric integrators, we will first prove thediscrete–time analogue of Proposition 2 and show that discrete–time NAG is notgiven by a dynamical system alone but rather a composition of a contact map anda gradient descent. This result is inspired by the conjecture put forward in [3], whoargued that symplectic maps followed by gradient descent steps can generate theexponential convergence near convex optima empirically observed in discrete–timeNAG. Here we provide an actual proof that NAG is based on the composition of acontact map and a gradient step.
Proposition 3 (Recovering NAG) . Discrete–time NAG, (1) – (2) , is given by thecomposition of a contact map and a gradient descent step.Proof. First we recall from Definition 1 that a contact transformation for the contactstructure given by (19) is a map that satisfies dS k +1 − P k +1 dX k +1 + 12 X k +1 dP k +1 = f ( X k , P k , S k ) (cid:18) dS k − P k dX k + 12 X k dP k (cid:19) , (37) or some function f ( X k , P k , S k ) that is nowhere . Then we claim that NAG can beexactly decomposed in the contact state space as the composition of the map X k +1 = P k (38) P k +1 = X k +1 + k − k + 2 ( X k +1 − X k ) , (39) S k +1 = k − k + 2 S k , (40)which is readily seen to be a contact transformation satisfying dS k +1 − P k +1 dX k +1 + 12 X k +1 dP k +1 = k − k + 2 (cid:20) dS k − P k dX k + 12 X k dP k (cid:21) , (41)followed by a standard gradient descent map, X k +1 = X k − τ ∇ f ( X k ) (42) P k +1 = P k (43) S k +1 = S k . (44) Remark 9.
The fact that NAG has a latent geometric nature is already a step for-ward towards understanding its effectiveness. It is also of interest that we have beenable to prove exactly the conjecture in [3] that discrete–time NAG can be obtainedby composing structure–preserving maps, in this case a contact transformation, withgradient descent steps. In light of our result, it seems that this can indeed be theintrinsic mechanism responsible for the late–stage exponential convergence so oftenseen in NAG. We will not pursue this direction here, leaving it to future work.
We now review a more systematic procedure to discretize contact Hamiltoniansystems, which, when applied to equation (3) or to the more general (10), leadsto new optimization algorithms. First we introduce the following lemmas from [8]which show how and when we can construct contact integrators of any even order.
Lemma 3 (Second–order contact integrator) . Let the possibly time–dependent con-tact Hamiltonian be separable into the sum of functions H ( X, P, S, t ) = n (cid:88) j =1 φ j ( X, P, S, t ) , (45) such that each of the associated contact Hamiltonian vector fields X φ j are exactlyintegrable. Then the integrator S ( τ ) = e τ ∂ t e τ X φ e τ X φ · · · e τX φn · · · e τ X φ e τ X φ e τ ∂ t (46) is a second–order contact integrator. Lemma 4 (Higher–order integrator with exact coefficients) . If S n ( τ ) is an inte-grator of order n then the map S n +2 ( τ ) = S n ( z τ ) S n ( z τ ) S n ( z τ ) , (47) ith z and z given by z ( n ) = − n +1 − n +1 , z ( n ) = 12 − n +1 , (48) is an integrator of order n + 2 . Lemma 5 (Higher–order integrator with approximated coefficients) . There exist m ∈ N and a set of real coefficients { w j } mj =0 such that the map S ( m ) ( τ ) = S ( w m τ ) S ( w m − τ ) · · · S ( w τ ) · · · S ( w m − τ ) S ( w m τ ) , (49) is an integrator of order n . The coefficients w , . . . , w m are obtained as approxi-mated solutions to an algebraic equation derived from the Baker–Campbell–Hausdorffformula. We refer to [8] for the proofs of these lemmas and to [8, 22] for the analysis of thecorresponding geometric integrators.Consequently all we need to find in order to obtain contact integrators for con-tact Hamiltonian systems such as (10) is a splitting of the corresponding contactHamiltonian (33) into a sum of contact Hamiltonians whose vector fields are exactlyintegrable. As an example, we provide the next result.
Proposition 4 (Second–order contact optimization algorithm) . Splitting the con-tact Hamiltonian (33) into the terms H φ = e α (cid:107) P (cid:107) (50) H φ = e β f ( X ) (51) H φ = (e α + ˙ α ) S , (52) gives the following second–order contact integrator, which in turn derives an explicitoptimization algorithm, S ( τ ) = e τ ∂ t e τ X φ e τ X φ e τX φ e τ X φ e τ X φ e τ ∂ t , (53) where each map is given by e τ ∂ t XPSt = XPSt + τ (54) e τ X φ XPSt = X + τ e α PPS + τ e α (cid:107) P (cid:107) t (55) e τ X φ XPSt = XP − τ e β ∇ f ( X ) S − τ e β f ( X ) t (56) e τX φ XPSt = XP e − τ (e α + ˙ α ) S e − τ (e α + ˙ α ) t . (57) orollary 1 (Higher–order contact optimization algorithms) . By combining thesecond–order contact optimization algorithm of Proposition 4 as in Lemmas 4 and5, we can obtain contact optimization algorithms of any even order.
Remark 10 (Matching rates) . Given that the thus–obtained optimization algorithmsare based on contact integrators of any even order, in principle one can use backwarderror analysis and show that the convergence rates of the corresponding discrete mapsmatch those of the continuous differential equation up to the order of the integrator.See, for example, the discussion in [13]. This goes beyond the scope of the presentwork and will be presented elsewhere.
Remark 11 (Increased stability) . Since contact integrators are very stable underthe increase of the step size τ (see [8]), we can use larger steps than standard opti-mization algorithms and achieve an effective increase in the rate of convergence. Inparticular, the higher the order of the integrator, the larger we can choose τ . See,for example, Example 3 below. As an example of how the contact geometry framework can guide the generalizationof optimization algorithms, let us consider generalizing the Relativistic GradientDescent proposed in [13].
Proposition 5 (Contact Relativistic Gradient Descent) . Consider the contact ver-sion of the Relativistic Gradient Descent (RGD) introduced in [13]. We start withthe contact Hamiltonian H ( X, P, S, t ) = c (cid:112) || P || + ( mc ) + f ( X ) + h ( t ) S , (58) with a time–dependent dissipative term h ( t ) = (cid:0) γ + αt (cid:1) instead of the constant factor γ considered in [13]. Splitting the contact Hamiltonian into the sum of H φ = h ( t ) S (59) H φ = f ( X ) (60) H φ = c (cid:112) || P || + ( mc ) , (61) yields the following second–order contact integrator, S ( τ ) = e τ ∂ t e τ X φ e τ X φ e τX φ e τ X φ e τ X φ e τ ∂ t , (62) here each map is given explicitly by e τ ∂ t XPSt = XPSt + τ (63) e τ X φ XPSt = XP e − h ( t ) τ S e − h ( t ) τ t (64) e τ X φ XPSt = XP − ∇ f ( X ) τS − f ( X ) τt (65) e τX φ XPSt = X + cP √ (cid:107) P (cid:107) +( mc ) τPP + − c m √ (cid:107) P (cid:107) +( mc ) τt . (66) These discretized dynamics define a new accelerated optimization algorithm thatwe call
Contact Relativistic Gradient Descent (CRGD) . Remark 12.
If we take h ( t ) = γ and using a first–order discretization S ( τ ) = e τX φ e τX φ e τX φ , (67) applied to only the variables ( X, P ) then we re–obtain the RGD algorithm origi-nally proposed in [13]. RGD, as well as any other algorithm based on conformallysymplectic systems, can be considered as a particular contact optimization algorithm. Remark 13.
If we fix γ = α and µ = e − γτ then can rewrite the map for X φ as e τ X φ XPSt = XP µ ( t ) Sµ ( t ) t . (68) The dissipation parameter µ ∈ (0 , in RGD is constant and therefore it has to becarefully tuned; in regions where f ( X ) has a high curvature one would prefer such aparameter to be small while in flat regions one would like it to be as large as possible.A dynamically tuned µ , however, has the potential to work well in both regimes.If we assume convex functions that are relatively flat around the minimum thenthe dynamics are likely to start in a region of high curvature before converging tothe low curvature near the optimum. In other words we would want the dissipationparameter to transition from a small value to a larger value as the system evolves.Indeed the choice of h ( t ) = γ + γ/t in CRGD is equivalent to a time–dependentdissipation parameter µ ( t ) , (69) hich interpolates between and µ < with increasing time. Consequently thischoice should improve convergence for objective functions that exhibit nearly flatregions. We explore this possibility in the numerical experiments below.This is just one example of the type of reasoning that can guide the generaliza-tion from standard symplectic and conformally symplectic to contact optimizationalgorithms. In this section we compare the classical momentum algorithm (CM), NAG, RGDand CRGD on the benchmark examples considered in [13].
Example 3 (Quadratic function) . Let us start with a simple quadratic function f ( X ) = 12 X T AX, X ∈ R , λ ( A ) ∼ U (10 − , , (70)where A ∈ R × is a positive–definite random matrix with eigenvalues uniformlydistributed over the range [10 − , . In Figure 1 we show the convergence rate of each algorithm when minimizing a f ( X ) in each of 50 Monte Carlo simulations. We see that CRGD converges thefastest for most sampled objective functions. Iteration −14 −30 −46 −62 F un c t i o n v a l u e CMNAGRGDCRGD
Figure 1: CRGD demonstrates the fastest convergence on random quadratic functions(70). The initial state for each run is always X = (1 , . . . , T , P = 0 (and S = 0 for CRGD). In Figure 2 we compare the performance of RGD and CRGD on this problem usingdissipation parameter µ = 0 . , step size τ = 0 . , speed of light v = 4403754 . ,and mass m = 0 . all tuned to optimize the performance of RGD. For this par-ticular tuning both RGD and CRGD exhibit similar rates of convergence, but if weincrease the step size slightly to τ = 0 . then RGD becomes unstable and fails toconverge entirely while the CRGD sustains the rapid convergence. This increasedstability allows one to run CRGD with higher step sizes and faster convergence inpractice.
100 200 300
Iteration −14 −30 −46 −62 F un c t i o n v a l u e RGD,τ=0.43RGD,τ=0.53CRGD,τ=0.43CRGD,τ=0.53
Figure 2: When well–tuned, RGD quickly converges to the optimum of a randomquadratic objective function, but so too does CRGD with the same config-uration. When we increase the step size, however, RGD becomes unstable anddiverges while CRGD maintains its rapid convergence.
Example 4 (Correlated Quadratic function) . Next let us consider the correlatedquadratic function f ( X ) = 12 X T AX, A ij = √ ij | i − j | for i, j = 1 , . . . , . (71)We perform 200 Monte Carlo simulations where the initial position in each issampled uniformly at random in the range − ≤ X ,i ≤ to test the robustness ofeach method to the initialization. In this case we observe a similar behavior for thefour methods (Fig. 3). Iteration −6 −14 −22 F un c t i o n v a l u e CMNAGRGDCRGD
Figure 3: NAG, RGD and CRGD perform very similarly when targeting a strongly cor-related quadratic objective function (71) with varying initial conditions.14 xample 5 (Camelback Function) . To push the algorithms further we consider thenonconvex Camelback objective function, f ( X , X ) = 2 X − . X + 16 X + X X + X . (72)The contour plot in Fig. 4(a) demonstrates the multimodality of this objective func-tion, with three locally convex neighborhoods separated by nonconvex valleys. Aunique global minimum can be found at f (0) = 0 while two local minima can befound at ( X , X ) (cid:39) ± ( − . , . with f (cid:39) . .In Fig. 4(b) we set the initial state X = (5 , T , P = 0 , and S = 0 for CRGD,and see that CRGD has the fastest convergence. For Fig. 4(c) we repeat the sameexperiment but initialize each algorithm close to one of the local minimizers. WhileCM and NAG are unable to escape the local minimum, both RGD and CRGD escapeto the global minimum, with CRGD escaping earlier and converging to the globalminimum faster after it has escaped.For a quantitative comparison, we report the numerical estimation for the rate ofconvergence of RGD and CRGD in Table 1. Initialization X = (5 , T X = (1 . , − . T Iterations 0-50 50-150 150-300 0-50RGD t − . t − . t − . t − . CRGD t − . t − . t − . t − . Table 1: The superior convergence rate of CRGD seen in the examples of Fig. 4(b)–(c) canbe quantified by numerically estimating the convergence rates of the competingalgorithms.
In Fig. 4(d) we perform 500 experiments where the initial position is sampleduniformly in the range − ≤ X ,i ≤ . Every algorithm was vulnerable to beingtrapped by the local minima, but CRGD found the global minimum more often thanthe other algorithms (Table 2). Method Frequency of Finding Global Minimum
CM 30.72 %NAG 29.53 %RGD 32.97 %CRGD 33.10 %Table 2: CRGD is able to find the global minimum of the Camelback objective functionmore frequently than CM, NAG, and RGD.15 X −2−1012 X (a) Iteration −1 −39 −77 −115 −153 F un c t i o n v a l u e CMNAGRGDCRGD (b)
Iteration −8 −16 −24 −32 F un c t i o n v a l u e CMNAGRGDCRGD (c)
Iteration −16 −86 −156 −226 −296 F un c t i o n v a l u e CMNAGRGDCRGD (d)
Figure 4: (a) The Camelback function (72) features three modes, one global minimum inthe center surrounded by two local minima. (b) When initialized away from allof the minima, X = (5 , T , CRGD converges the fastest. (c) When initializednear one of the local minima, X = (1 . , − . T , CRGD continues to dominate.(d) When the initialization is chosen at random, − ≤ X ,i ≤ , CRGD displaysthe best asymptotic convergence. Example 6 (Rosenbrock Function) . For a higher–dimensional challenge, let’s con-sider the nonconvex Rosenbrock function, f ( X ) = n − (cid:88) i =1 (cid:0) X i +1 − X i ) + (1 − X i ) (cid:1) , (73)with n = 100 dimensions. The Rosenbrock landscape is quite complex; for instancethere are only two local minimizers, one global at X ∗ = (1 , , . . . , T where f ( X ∗ ) =0 , and one local near X ≈ ( − , , . . . , T .CRGD demonstrates the fastest convergence both when the algorithms are ini-tialized close to the local minimum, X , i = 1 and X , i − = − . , i = 1 , . . . , ,(Fig. 5(b)) and when initialized in the tails of the Rosenbrock function, X , i = 5 nd X , i − = − , i = 1 , . . . , (Fig. 5(c)). Moreover if we sample the initializa-tion uniformly at random in the range − . ≤ X ,i ≤ . then we see that theperformance of CRGD is more robust to the specific initialization than CM, NAG,and RGD (Fig. 5(d)). −1.0 −0.5 0.0 0.5 1.0 X −1.0−0.50.00.51.0 X (a) Iteration −6 −16 F un c t i o n v a l u e CMNAGRGDCRGD (b)
Iteration −4 −14 −24 F un c t i o n v a l u e CMNAGRGDCRGD (c)
Iteration −26 −21 −16 −11 −6 −1 F un c t i o n v a l u e CMNAGRGDCRGD (d)
Figure 5: (a) A slice of the Rosenbrock function (73) where X = ( X , X , , . . . , demonstrates the complex landscape that can frustrate optimization algo-rithms. CRGD converges the fastest when the optimization algorithms areinitialized (b) close to and (c) far from the local minimum. At the same timeCRGD is less sensitive to the specific initialization, demonstrating much lessvariability as the initialization is randomly sampled. In this section we describe the tuning process for the parameters used in the ex-amples of Section 4. In all the experiments we use an exhaustive random searchon parameter space and the same number of Monte Carlo runs for each algorithm.Moreover we always set P = 0 and S = 0 . All the numerical implementations have een performed in Python, using the scipy, numpy and matplotlib packages. Thecodes are available in a Github repository (see [7]). Quadratic function
In Fig. 1 we performed 50 samples of A and for each sample we ran each algorithm150 times and for 200 iterations, and chose the parameters that give the lowestobjective function value. Search ranges are shown in Table 3. Algorithm Step Size Momentum µ Mass m Speed of Light c CM [10 − , · − ] [0 . , . NAG [10 − , · − ] [0 . , . RGD [10 − , · − ] [0 . , .
95] [10 − , − ] [10 , ] CRGD [10 − , · − ] [0 . , .
9] [10 − , − ] [10 , ] Table 3: Ranges for parameters search in the experiment of Fig. 1.
Correlated quadratic function
For the experiment in Fig. 3, we performed 50 samples of the initial position, whereeach sample is chosen uniformly in the range − ≤ X ,i ≤ . Then, for each initialposition, we run each algorithm 100 times and for 100 iterations, and choose theparameters which give the lowest objective function value. Search ranges are shownin Table 4. Algorithm Step Size Momentum µ Mass m Speed of Light c CM (cid:2) − , − (cid:3) [0 . , . NAG (cid:2) − , − (cid:3) [0 . , . RGD (cid:2) − , · − (cid:3) [0 . , . (cid:2) − , − (cid:3) (cid:2) , (cid:3) CRGD (cid:2) − , · − (cid:3) [0 . , . (cid:2) − , − (cid:3) (cid:2) , (cid:3) Table 4: Ranges for parameters search in the experiment of Fig. 3 .
Camelback Function
For the experiment in Fig. 4(b)–(c), we run each algorithm 1500 times and for 300iterations, and choose the parameters which give the lowest objective function value.For the experiment in Fig. 4(d) we performed 100 samples of the initial position,where each sample is chosen uniformly in the range − ≤ X ≤ . Then, for eachinitial position, we run each algorithm 500 times and for 1200 iterations, and choosethe parameters which give the lowest objective function value. Search ranges areshow in Table 5. lgorithm Step Size Momentum µ Mass m Speed of Light c CM [10 − , − ] [0 . , . NAG [10 − , − ] [0 . , . RGD [10 − , · − ] [0 . , .
8] [10 − , − ] [10 , ] CRGD [10 − , · − ] [0 . , .
65] [10 − , − ] [10 , ] Table 5: Ranges for parameters search in the experiments of Fig. 4.
Rosenbrock Function
In the experiment in Fig. 5(b)–(c), we run each algorithm 500 times and for 1200iterations, and choose the parameters which give the lowest objective function value.For CM, NAG and RGD, we use the same search range as in Table 5, except forthe momentum factor, which is searched in the range µ ∈ [0 . , . . For CRGD,we use the search range of Table 6. In the experiment in Fig. 5(d), we performed20 samples of the initial position, where each sample is chosen uniformly in therange − . ≤ X ≤ . . Then, for each initial position, we run each algorithm500 times and for 1200 iterations, and choose the parameters which give the lowestobjective function value. Search ranges are show in Table 6. Algorithm Step Size Momentum µ Mass m Speed of Light c CM [2 · − , · − ] [0 . , . NAG [2 · − , · − ] [0 . , . RGD [10 − , − ] [0 . , .
97] [4 · − , − ] [10 , · ] CRGD [10 − , · − ] [0 . , .
98] [10 − , − ] [10 , ] Table 6: Ranges for parameters search in the experiments of Fig. 5(d).
In this paper we have demonstrated that contact geometry, and contact Hamiltoniansystems, naturally generate the dissipative dynamical systems whose discretizationsgive rise to accelerated gradients algorithms. Not only do these geometric systemssubsume a wide range of dynamical systems previously considered in the literature,their geometric integration provides a principled means of constructing discretiza-tions that preserve the important structure of the latent dynamics. Quite remarkablyNAG itself can be decomposed as a contact transformation followed by a standardgradient descent step, demonstrating the fundamental nature of these systems.We expect that unifying the development of optimization algorithms through thiscontact geometric lens will allow us to not only better understand these algorithmsbut also identify how their structure translates to ultimate performance and hence erive improved algorithms more effectively. As a preliminary example we haveshown that the RGD algorithm can be immediately generalized to the contact case,resulting in a more stable and generally faster algorithm.The geometric foundation also brings with it a wealth of mathematical tools forthe thorough analysis of the convergence of these algorithms, which we defer tosubsequent work. Acknowledgements
We would like to thank Diego Tapias, Mario Díaz, and Shin-itiro Goto for valuablecomments, and Ana Pérez Arteaga and Ramiro Chávez Tovar for their help.
References [1] Betancourt M.
A conceptual introduction to Hamiltonian Monte Carlo .arXiv:1701.02434, 2017.[2] Betancourt M., Byrne S., Livingstone S., Girolami M. et al.
The geometricfoundations of Hamiltonian Monte Carlo . Bernoulli, 23 : 2257–2298, 2017.[3] Betancourt M., Jordan M. I. & Wilson A. C.
On symplectic optimization .arXiv:1802.03653, 2018.[4] Bravetti A.
Contact Hamiltonian dynamics: The concept and its use . Entropy,19, 2017.[5] Bravetti A.
Contact geometry and thermodynamics . International Journal ofGeometric Methods in Modern Physics : 1940003, 2018.[6] Bravetti A., Cruz H. & Tapias D.
Contact Hamiltonian mechanics . Annals ofPhysics, 376 : 17–39, 2017.[7] Bravetti A., Daza-Torres M. L., Flores-Arguedas H. & Betancourt M.
Contact optimization algorithms . https://github.com/mdazatorres/Contact-optimization-algorithms , 2020.[8] Bravetti A., Seri M., Vermeeren M. & Zadra F. Numerical integration in celes-tial mechanics: a case for contact geometry . Celestial Mechanics and DynamicalAstronomy, 132 : 1–29, 2020.[9] Ciaglia F. M., Cruz H. & Marmo G.
Contact manifolds and dissipation, classicaland quantum . Annals of Physics, 398 : 159–179, 2018.[10] de León M. & Lainz Valcázar M.
Contact Hamiltonian systems . Journal ofMathematical Physics, 60 : 102902, 2019.[11] de León M. & Sardón C.
Cosymplectic and contact structures for time-dependent and dissipative Hamiltonian systems . Journal of Physics A: Mathe-matical and Theoretical, 50 : 255205, 2017.[12] Diakonikolas J. & Jordan M. I.
Generalized momentum-based methods: AHamiltonian perspective . arXiv:1906.00436, 2019.[13] França G., Sulam J., Robinson D. P. & Vidal R.
Conformal symplectic andrelativistic optimization . arXiv:1903.04100, 2019.[14] Gaset J., Gràcia X., Muñoz-Lecanda M. C., Rivas X. & Román-Roy N.
New ontributions to the Hamiltonian and Lagrangian contact formalisms for dissi-pative mechanical systems and their symmetries . arXiv:1907.02947, 2019.[15] Geiges H. An introduction to contact topology . Volume 109. Cambridge Uni-versity Press, 2008.[16] Hairer E., Lubich C. & Wanner G.
Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equations . Volume 31 of
Springer Series in Computational Mathematics . Springer-Verlag, Berlin Hei-delberg, 2006.[17] Livingstone S., Faulkner M. F. & Roberts G. O.
Kinetic energy choice inHamiltonian/hybrid Monte Carlo . Biometrika, 106 : 303–319, 2019.[18] Maddison C. J., Paulin D., Teh Y. W., O’Donoghue B. & Doucet A.
Hamilto-nian descent methods . arXiv:1809.05042, 2018.[19] Muehlebach M. & Jordan M. I.
A dynamical systems perspective on Nesterovacceleration . arXiv:1905.07436, 2019.[20] O’Donoghue B. & Maddison C. J.
Hamiltonian descent for composite objectives .arXiv:1906.02608, 2019.[21] Su W., Boyd S. & Candès E. J.
A differential equation for modeling Nesterov’saccelerated gradient method: Theory and insights . Journal of Machine LearningResearch, 17 : 1–43. http://jmlr.org/papers/v17/15-084.html , 2016.[22] Vermeeren M., Bravetti A. & Seri M.
Contact variational integrators . Journalof Physics A: Mathematical and Theoretical, 52 : 445206, 2019.[23] Wibisono A., Wilson A. C. & Jordan M. I.
A variational perspective on ac-celerated methods in optimization . Proceedings of the National Academy ofSciences, 113 : E7351–E7358, 2016.. Proceedings of the National Academy ofSciences, 113 : E7351–E7358, 2016.