A Variational Formulation of Accelerated Optimization on Riemannian Manifolds
AA VARIATIONAL FORMULATION OFACCELERATED OPTIMIZATION ON RIEMANNIAN MANIFOLDS
VALENTIN DURUISSEAUX AND MELVIN LEOK
Abstract.
It was shown recently by Su et al. [19] that Nesterov’s accelerated gradient method forminimizing a smooth convex function f can be thought of as the time discretization of a second-order ODE, and that f ( x ( t )) converges to its optimal value at a rate of O( / t ) along any trajectory x ( t ) of this ODE. A variational formulation was introduced in Wibisono et al. [21] which allowedfor accelerated convergence at a rate of O( / t p ) , for arbitrary p >
0, in normed vector spaces.This framework was exploited in Duruisseaux et al. [6] using time-adaptive geometric integratorsto design efficient explicit algorithms for symplectic accelerated optimization. In Alimisis et al.[3], a second-order ODE was proposed as the continuous-time limit of a Riemannian acceleratedalgorithm, and it was shown that the objective function f ( x ( t )) converges to its optimal value ata rate of O( / t ) along solutions of this ODE, thereby generalizing the earlier Euclidean resultto the Riemannian manifold setting. In this paper, we show that on Riemannian manifolds, theconvergence rate of f ( x ( t )) to its optimal value can also be accelerated to an arbitrary convergencerate O( / t p ) , by considering a family of time-dependent Bregman Lagrangian and Hamiltoniansystems on Riemannian manifolds. This generalizes the results of [21] to Riemannian manifoldsand also provides a variational framework for accelerated optimization on Riemannian manifolds.In particular, we will establish results for objective functions on Riemannian manifolds that aregeodesically convex, weakly-quasi-convex, and strongly convex. An approach based on the time-invariance property of the family of Bregman Lagrangians and Hamiltonians was used to constructvery efficient optimization algorithms in [6], and we establish a similar time-invariance property inthe Riemannian setting. This lays the foundation for constructing similarly efficient optimizationalgorithms on Riemannian manifolds, once the Riemannian analogue of time-adaptive Hamiltonianvariational integrators has been developed. The experience with the numerical discretization ofvariational accelerated optimization flows on vector spaces suggests that the combination of time-adaptivity and symplecticity is important for the efficient, robust, and stable discretization of thesevariational flows describing accelerated optimization. One expects that a geometric numericalintegrator that is time-adaptive, symplectic, and Riemannian manifold preserving will yield a classof similarly promising optimization algorithms on manifolds. Introduction
Efficient optimization has become one of the major concerns in data analysis. Many machinelearning algorithms are designed around the minimization of a loss function or the maximizationof a likelihood function. Due to the ever-growing scale of the data sets and size of the problems,there has been a lot of focus on first-order optimization algorithms because of their low cost periteration. The first gradient descent algorithm was proposed in [5] by Cauchy to deal with the verylarge systems of equations he was facing when trying to simulate orbits of celestial bodies, andmany gradient-based optimization methods have been proposed since Cauchy’s work in 1847.In 1983, Nesterov’s accelerated gradient method was introduced in [16], and was shown to con-verge in O( / k ) to the minimum of the convex objective function f , improving on the O( / k ) convergence rate exhibited by the standard gradient descent methods. This O( / k ) convergencerate was shown in [17] to be optimal among first-order methods using only information about ∇ f at consecutive iterates. This phenomenon in which an algorithm displays this improved rate ofconvergence is referred to as acceleration, and other accelerated algorithms have been derived sinceNesterov’s algorithm, such as accelerated mirror descent [15] and accelerated cubic-regularized a r X i v : . [ m a t h . O C ] J a n VALENTIN DURUISSEAUX AND MELVIN LEOK
Newton’s method [18]. More recently, it was shown in [19] that Nesterov’s accelerated gradientmethod limits to a second order ODE, as the timestep goes to 0, and that the objective function f ( x ( t )) converges to its optimal value at a rate of O( / t ) along the trajectories of this ODE. Itwas then shown in [21] that in continuous time, the convergence rate of f ( x ( t )) can be acceleratedto an arbitrary convergence rate O( / t p ) in normed spaces, by considering flow maps generatedby a family of time-dependent Bregman Lagrangian and Hamiltonian systems which is closed un-der time rescaling. This variational framework and the time-invariance property of the family ofBregman Lagrangians was then exploited in [6] using time-adaptive geometric integrators to designefficient explicit algorithms for symplectic accelerated optimization. It was observed that a carefuluse of adaptivity and symplecticity could result in a significant gain in computational efficiency.In the past few years, there has been some effort to derive accelerated optimization algorithms inthe Riemannian manifold setting [2; 3; 13; 22; 23]. In [3], a second order ODE was proposed as thecontinuous-time limit of a Riemannian accelerated algorithm, and it was shown that the objectivefunction f ( x ( t )) converges to its optimal value at a rate of O( / t ) along solutions of this ODE,generalizing the Euclidean result obtained in [19] to the Riemannian manifold setting.In this paper, we show that in continuous time, the convergence rate of f ( x ( t )) to its opti-mal value can be accelerated to an arbitrary convergence rate O( / t p ) on Riemannian manifolds,thereby generalizing the results of [21] to the Riemannian setting. This is achieved by considering afamily of time-dependent Bregman Lagrangian and Hamiltonian systems on Riemannian manifolds.This also provides a variational framework for accelerated optimization on Riemannian manifolds,generalizing the variational formulation introduced in [21] of accelerated optimization on normedvector spaces. We will then illustrate the derived theoretical convergence rates by integrating theBregman Euler–Lagrange equations using a simple numerical scheme to solve eigenvalue and dis-tance minimization problems on Riemannian manifolds. Finally, we will show that the family ofBregman dynamics is closed under time rescaling, and we will draw inspiration from the approachintroduced in [6] to take advantage of this invariance property via a carefully chosen Poincar´e trans-formation that will allow for the integration of higher-order Bregman dynamics while benefitingfrom the computational efficiency of integrating lower-order Bregman dynamics on Riemannianmanifolds. 2. Definitions and Preliminaries
We first introduce the main objects and notions from Riemannian geometry and Lagrangian andHamiltonian mechanics that will be used throughout this paper (see [3; 7; 8; 10; 11; 14] for moredetails).2.1.
Riemannian Geometry.Definition 2.1.
Suppose we have a Riemannian manifold Q with Riemannian metric g (⋅ , ⋅) = ⟨⋅ , ⋅⟩ ,represented by the positive-definite symmetric matrix ( g ij ) in local coordinates. Then, we define the musical isomorphism g ♭ ∶ T Q → T ∗ Q via g ♭ ( u )( v ) = g p ( u, v ) ∀ p ∈ Q and ∀ u, v ∈ T p Q , and its inverse musical isomorphism g ♯ ∶ T ∗ Q → T Q . The Riemannian metric g (⋅ , ⋅) = ⟨⋅ , ⋅⟩ induces a fiber metric g ∗ (⋅ , ⋅) = ⟪⋅ , ⋅⟫ on T ∗ Q via ⟪ u, v ⟫ = ⟨ g ♯ ( u ) , g ♯ ( v )⟩ ∀ u, v ∈ T ∗ Q , represented by the positive definite symmetric matrix ( g ij ) in local coordinates, which is the inverseof the Riemannian metric matrix ( g ij ) . Definition 2.2.
The
Riemannian gradient gradf ( q ) ∈ T q Q at a point q ∈ Q of a smooth function f ∶ Q → R is the tangent vector at q such that ⟨ gradf ( q ) , u ⟩ = df ( q ) u ∀ u ∈ T q Q , VARIATIONAL FORMULATION OF ACCELERATED OPTIMIZATION ON RIEMANNIAN MANIFOLDS 3 where df is the differential of f . Definition 2.3. A vector field on a Riemannian manifold Q is a map X ∶ Q → T Q such that X ( q ) ∈ T q Q for all q ∈ Q . The set of all vector fields on Q is denoted X (Q) . The integral curve at q of X ∈ X (Q) is the smooth curve c on Q such that c ( ) = q and c ′ ( t ) = X ( c ( t )) . Definition 2.4. A geodesic in a Riemannian manifold Q is a parametrized curve γ ∶ [ , ] → Q which is of minimal local length. It can be thought of as a curve having zero “acceleration” orconstant “speed”, that is as a generalization of the notion of straight line from Euclidean spacesto Riemannian manifolds. Given two points q, ˜ q ∈ Q , a vector in T q Q can be transported to T ˜ q Q along a geodesic γ by an operation Γ ( γ ) ˜ qq ∶ T q Q → T ˜ q Q called parallel transport along γ . Wewill simply write Γ ˜ qq to denote the parallel transport along some geodesic connecting the two points q, ˜ q ∈ Q , and given A ∈ X (Q) , we will denote by Γ ( A ) the parallel transport along integral curvesof A . Note that parallel transport preserves inner products: given a geodesic γ from q ∈ Q to ˜ q ∈ Q , g q ( u, v ) = g ˜ q ( Γ ( γ ) ˜ qq u, Γ ( γ ) ˜ qq v ) ∀ u, v ∈ T q Q . Definition 2.5.
Given
X, Y ∈ X (Q) , the covariant derivative ∇ X Y ∈ X (Q) of Y along X is ∇ X Y ( q ) = lim h → Γ ( γ ) qγ ( h ) Y ( γ ( h )) − Y ( q ) h , where γ is the unique integral curve of X such that γ ( ) = q , for any q ∈ Q . Definition 2.6.
A function f ∶ Q → R is called L -smooth if for any two points q, ˜ q ∈ Q andgeodesic γ connecting them, ∥ gradf ( q ) − Γ ( γ ) q ˜ q gradf ( ˜ q )∥ ≤ L length ( γ ) . Definition 2.7.
The
Riemannian Exponential map
Exp q ∶ T q Q → Q at q ∈ Q is defined via Exp q ( v ) = γ v ( ) , where γ v is the unique geodesic in Q such that γ v ( ) = q and γ ′ v ( ) = v , for any v ∈ T q Q . Exp q is a diffeomorphism in some neighborhood U ⊂ T q Q containing 0, so we can define its inversemap, the Riemannian Logarithm map
Log p ∶ Exp q ( U ) → T q Q . Definition 2.8.
Given a Riemannian manifold Q with sectional curvature bounded below by K min ,and an upper bound D for the diameter of the considered domain, define ζ = ⎧⎪⎪⎨⎪⎪⎩√− K min D coth (√− K min D ) if K min < if K min ≥ . (2.1) Note that ζ ≥ since x coth x ≥ for all real values of x . Convexity in Riemannian Manifolds.Definition 2.9.
A subset A of a Riemannian manifold Q is called geodesically uniquely convex if every two points of A are connected by a unique geodesic in A . A function f ∶ Q → R is called geodesically convex if for any two points q, ˜ q ∈ Q and geodesic γ connecting them, f ( γ ( t )) ≤ ( − t ) f ( q ) + tf ( ˜ q ) ∀ t ∈ [ , ] . Note that if f is a smooth geodesically convex function on a geodesically uniquely convex subset A of a Riemannian manifold, then f ( q ) − f ( ˜ q ) ≥ ⟨ gradf ( ˜ q ) , Log ˜ q ( q )⟩ ∀ q, ˜ q ∈ A. A function f ∶ A → R is called geodesically α -weakly-quasi-convex ( α -WQC) with respect to q ∈ Q for some α ∈ ( , ] if α ( f ( q ) − f ( ˜ q )) ≥ ⟨ gradf ( ˜ q ) , Log ˜ q ( q )⟩ ∀ ˜ q ∈ A. VALENTIN DURUISSEAUX AND MELVIN LEOK
A function f ∶ A → R is called geodesically µ -strongly-convex ( µ -SC) for some µ > if f ( q ) − f ( ˜ q ) ≥ ⟨ gradf ( ˜ q ) , Log ˜ q ( q )⟩ + µ ∥ Log ˜ q ( q )∥ ∀ q, ˜ q ∈ A. A local minimum of a geodesically convex or α -WQC function is also a global minimum, and ageodesically strongly convex function either has no minimum or a unique global minimum. Lagrangian and Hamiltonian Mechanics.
Given a n -dimensional Riemannian manifold Q with local coordinates ( q , . . . , q n ) , a Lagrangian is a function L ∶ T Q × R → R . The actionintegral S is defined to be the functional S( q ) = ∫ T L ( q, ˙ q, t ) dt, over the space of smooth curves q ∶ [ , T ] → Q . Hamilton’s Variational Principle states that δS = δS is induced by an infinitesimal variation δq of the trajectory q thatvanishes at the endpoints. Hamilton’s Variational Principle can be shown to be equivalent to the Euler–Lagrange equations ddt ( ∂L∂ ˙ q k ) = ∂L∂q k for k = , . . . , n. (2.2)The Legendre transform F L ∶ T Q → T ∗ Q of L is defined fiberwise via F L ∶ ( q i , ˙ q i ) ↦ ( q i , p i ) where p i = ∂L∂ ˙ q i ∈ T ∗ Q is the conjugate momentum of q i . We can then define the associated Hamiltonian H ∶ T ∗ Q → R via H ( q, p, t ) = n ∑ j = p j ˙ q j − L ( q, ˙ q, t )RRRRRRRRRRR p i = ∂L∂ ˙ qi . (2.3)We can also define a Hamiltonian Variational Principle on the Hamiltonian side in momentumphase space δ ∫ T n ∑ j = [ p j ˙ q j − H ( q, p, t )] dt = , where the variation is induced by an infinitesimal variation δq of the trajectory q that vanishes atthe endpoints. This is equivalent to Hamilton’s equations , given by˙ p k = − ∂H∂q k ( p, q ) , ˙ q k = ∂H∂p k ( p, q ) for k = , . . . , n, (2.4)which can also be shown to be equivalent to the Euler–Lagrange equations (2.2).3. Variational Formulation and Convergence Rates
Throughout this paper, we will make the following assumptions on the function f ∶ Q → R to beminimized, and on the ambient Riemannian manifold Q : Assumption 1.
The solution of the differential equation remains inside a geodesically uniquelyconvex subset A of a complete Riemannian manifold Q (i.e. any two points in Q can be connectedby a geodesic), such that diam ( A ) is bounded above by some constant D , the sectional curvature isbounded from below by K min on A , and the exponential map is globally a diffeomorphism.Furthermore, f is bounded below, geodesically L -smooth and all its minima are inside A . VARIATIONAL FORMULATION OF ACCELERATED OPTIMIZATION ON RIEMANNIAN MANIFOLDS 5
Convex Case.
Suppose f ∶ Q → R is a geodesically convex function, and that Assumption 1holds true. We define the corresponding p -Bregman Lagrangian L Cp ∶ T Q × R → R for p > L Cp ( X, V, t ) = t ζp + p ⟨ V, V ⟩ −
Cpt ( ζ + ) p − f ( X ) , (3.1) and the corresponding p -Bregman Hamiltonian H Cp ∶ T ∗ Q × R → R is given by H Cp ( X, R, t ) = p t ζp + ⟪ R, R ⟫ +
Cpt ( ζ + ) p − f ( X ) , (3.2) where X ∈ Q denotes position on the manifold Q , V and R are velocity vector and momentumcovector fields, t is the time variable, C is a constant, and ζ is given by equation (2.1). Thischoice of Bregman Lagrangian is inspired by the results of [4; 6; 21], and can be thought of as ageneralization of the normed space p -Bregman Lagrangians and Hamiltonians L ( X, V, t ) = t p + p ⟨ V, V ⟩ −
Cpt p − f ( X ) , H ( X, R, t ) = p t p + ⟨ R, R ⟩ +
Cpt p − f ( X ) , obtained in [6], where the structure of the Riemannian manifold Q has now been incorporatedthrough the constant ζ . These p -Bregman Lagrangians and Hamiltonians arise from the BregmanLagrangians and Hamiltonians introduced in [21], L α,β,γ ( x, v, t ) = e α ( t )+ γ ( t ) [ D h ( x + e − α ( t ) v, x ) − e β ( t ) f ( x )] , (3.3) H α,β,γ ( x, r, t ) = e α ( t )+ γ ( t ) [ D h ∗ (∇ h ( x ) + e − γ ( t ) r, ∇ h ( x )) + e β ( t ) f ( x )] , (3.4)where the Bregman divergence D h is given by D h ( x, y ) = h ( y ) − h ( x ) − ⟨∇ h ( x ) , y − x ⟩ , for the chosen convex, continuously differentiable function h ( x ) = ⟨ x, x ⟩ , and where the Legendretransform (or convex dual function) h ∗ is given by h ∗ = sup v ∈ T X [⟨ r, v ⟩ − h ( v )] . The parameterfunctions α, β, γ are chosen to be α t = log ζp − log t, β t = p log t + log C − ζ, γ t = ζp log t + log ζ, and these satisfy the ideal scaling conditions ˙ β t ≤ e α t and ˙ γ t = e α t . The ideal scaling conditions werenecessary conditions introduced in [21] for the Bregman Lagrangians and Hamiltonians to have flowsthat converge to the minimizer at the rate O( e − β t ) . In this paper, we will simplify the expositionby focusing on the more practically relevant case of Bregman Lagrangians and Hamiltonians thatare parametrized by p > O( / t p ) . Theorem 3.1.
The p -Bregman Euler–Lagrange equation corresponding to the p -Bregman La-grangian L Cp is given by ∇ ˙ X ˙ X + ζp + t ˙ X + Cp t p − gradf ( X ) = . (3.5) Proof.
See Appendix A.1, with λ = ζ . Theorem 3.2.
Suppose f ∶ Q → R is a geodesically convex function, and Assumption 1 is satisfied.Then, the p -Bregman Euler–Lagrange equation ∇ ˙ X ˙ X + ζp + t ˙ X + Cp t p − gradf ( X ) = , VALENTIN DURUISSEAUX AND MELVIN LEOK has a solution, and any solution converges to a minimizer x ∗ of f with rate f ( X ( t )) − f ( x ∗ ) ≤ ζ ∥ Log x ( x ∗ )∥ Ct p . (3.6) Proof.
See Appendix B.1 for the existence of a solution and Appendix C.1 for the convergence rate.Note that this theorem reduces to Theorem 5 from [3] when p = C = / Weakly-Quasi-Convex Case.
Suppose f ∶ Q → R is a geodesically α -weakly-quasi-convexfunction, and suppose that Assumption 1 is satisfied. We define the corresponding p -BregmanLagrangian L W QCp ∶ T Q × R → R for p > L W QCp ( X, V, t ) = t ζα p + p ⟨ V, V ⟩ −
Cpt ( ζα + ) p − f ( X ) , (3.7) and the corresponding p -Bregman Hamiltonian H W QCp ∶ T ∗ Q × R → R is given by H W QCp ( X, R, t ) = p t ζα p + ⟪ R, R ⟫ +
Cpt ( ζα + ) p − f ( X ) , (3.8) where X ∈ Q denotes position on Q , V and R are velocity vector and momentum covector fields, t is the time variable, C is a constant, and ζ is given by equation (2.1). Theorem 3.3.
The p -Bregman Euler–Lagrange equation corresponding to the p -Bregman La-grangian L W QCp is given by ∇ ˙ X ˙ X + ζp + ααt ˙ X + Cp t p − gradf ( X ) = . (3.9) Proof.
See Appendix A.1, with λ = ζ / α . Theorem 3.4.
Suppose f ∶ Q → R is a geodesically α -weakly-quasi-convex function, and supposethat Assumption 1 is satisfied. Then, the p -Bregman Euler–Lagrange equation ∇ ˙ X ˙ X + ζp + ααt ˙ X + Cp t p − gradf ( X ) = , has a solution, and any solution converges to a minimizer x ∗ of f with rate f ( X ( t )) − f ( x ∗ ) ≤ ζ ∥ Log x ( x ∗ )∥ Cα t p . (3.10) Proof.
See Appendix B.1 for the existence of a solution and Appendix C.2 for the convergence rate.3.3.
Strongly Convex Case.
Suppose f ∶ Q → R is a geodesically µ -strongly-convex function,and suppose that Assumption 1 is satisfied. With ζ given by equation (2.1), let η = ( √ ζ + √ ζ ) √ µ. (3.11)We define the corresponding Bregman Lagrangian L SC ∶ T Q × R → R via L SC ( X, V, t ) = e ηt ⟨ V, V ⟩ − e ηt f ( X ) , (3.12) VARIATIONAL FORMULATION OF ACCELERATED OPTIMIZATION ON RIEMANNIAN MANIFOLDS 7 and the corresponding Bregman Hamiltonian H SC ∶ T ∗ Q × R → R is given by H SC ( X, R, t ) = e − ηt ⟪ R, R ⟫ + e ηt f ( X ) . (3.13) Theorem 3.5.
The Bregman Euler–Lagrange equation corresponding to the Bregman Lagrangian L SC is given by ∇ ˙ X ˙ X + η ˙ X + gradf ( X ) = . (3.14) Proof.
The derivation of the Bregman Euler–Lagrange equation is presented in Appendix A.2.From Theorem 7 in [3], we have the following convergence rate for solutions of the BregmanEuler–Lagrange equation:
Theorem 3.6.
Suppose f ∶ Q → R is a geodesically µ -strongly-convex function, and suppose thatAssumption 1 is satisfied. Then, the Bregman Euler–Lagrange equation ∇ ˙ X ˙ X + η ˙ X + gradf ( X ) = , has a solution, and any solution converges to a minimizer x ∗ of f with rate f ( X ( t )) − f ( x ∗ ) ≤ µ ∥ Log x ( x ∗ )∥ + ( f ( x ) − f ( x ∗ )) e √ µζ t . (3.15) Proof.
See Appendix B.2 for the existence of a solution and [3] for the proof of the convergencerate. 4.
Numerical Experiments
The p -Bregman Euler–Lagrange equations can be rewritten as the first order system˙ X = V, ∇ V V = − λp + t V − Cp t p − gradf ( X ) , where λ = ζ in the geodesically convex case and λ = ζ / α in the geodesically α -weakly-quasi-convexcase, and as the first-order system˙ X = V, ∇ V V = − ( √ ζ + √ ζ ) √ µV − gradf ( X ) , for the µ -strongly convex case. As in [3], we can adapt a semi-implicit Euler scheme (explicit Eulerupdate for the velocity V followed by an update for position X based on the updated value of V )to the Riemannian setting to obtain the following algorithm: Algorithm 1:
Semi-Implicit Euler Integration of the p -Bregman Euler–Lagrange Equations Input:
A function f ∶ Q → R . Constants C, h, p > X ∈ Q . V ∈ T X Q . while convergence criterion is not met do if f is µ -geodesically strongly convex then b k ← − h ( √ ζ + √ ζ ) √ µ, c k ← else if f is geodesically convex ( λ = ζ ) or α -weakly-quasi-convex ( λ = ζ / α ) then b k ← − λp + k , c k ← Cp ( kh ) p − Version I : a k ← b k V k − hc k gradf ( X k ) Version II : a k ← b k V k − hc k gradf ( Exp X k ( hb k V k )) X k + ← Exp X k ( ha k ) , V k + ← Γ X k + X k a k VALENTIN DURUISSEAUX AND MELVIN LEOK
Version I of Algorithm 1 corresponds to the usual update for the Semi-Implicit Euler scheme,while Version II is inspired by the reformulation of Nesterov’s method from [20] that uses a correctedgradient ∇ f ( X k + hb k V k ) instead of the traditional gradient ∇ f ( X k ) . Note that the SIRNAGalgorithm presented in [3] corresponds to the special case where p = C = / H . More precisely, we are minimizingthe strongly convex function f ( x ) = d ( x, q ) for a given point q . Note that the hyperbolic plane H is a manifold with constant negative curvature K = − n × n matrix A maximize the Rayleigh quotient v T Avv T v over R n . Thus, a unit eigenvector v ∗ corresponding to the largest eigenvalue of the matrix A is a minimizer of the function f ∶ S n − → R , v ↦ f ( v ) = − v T Av, over the unit sphere
Q = S n − , which can be thought of as a Riemannian submanifold with constantpositive curvature K = R n endowed with the Riemannian metric inherited from the Euclideaninner product g v ( u, w ) = u T w . More information concerning the geometry of S n − , such as itstangent bundle, its orthogonal projection and exponential map can be found in [1]. Solving theRayleigh quotient optimization problem efficiently is challenging when the given symmetric matrix A is ill-conditioned and high-dimensional. Note that an efficient algorithm that solves the aboveminimization problem can also be used to find eigenvectors corresponding to the smallest eigenvalueof A by using the fact that the eigenvalues of A are the negative of the eigenvalues of − A .Experiments carried out in [3] showed that SIRNAG (the convex p = p = p = O( t − p ) convergence rate. Thenumerical results obtained for the distance minimization and Rayleigh minimization problems areillustrated in Figure 1, where all the algorithms were implemented with the same fixed timestep. Wecan see that the p = p = p = O( t − ) rate. While Version I of Algorithm 1 exhibitsa polynomial rate of O( t − . ) and O( t − ) on the objective functions considered, Version II ofAlgorithm 1 exhibits a much faster exponential rate of convergence on both examples.Note however that an increase in the value of p in Algorithm 1, which corresponds to an increasein the order of the Bregman dynamics integrated, requires a decrease in the timestep, in agreementwith intuitive expectations. This timestep decrease requirement is especially important due tothe polynomially growing h ( kh ) p − coefficient multiplying the gradient of f in the updates of thealgorithm. Such a decrease in the timestep does not really affect the convergence rate, but thetransition between the initialization and convergence phases takes longer.Similar issues arise when discretizing the continuous Euler–Lagrange flow associated with ac-celerated optimization on vector spaces, and in that situation, it was observed that time-adaptivesymplectic integrators based on Hamiltonian variational integrators resulted in dramatically im-proved robustness and stability. As such, it will be natural to explore generalizations of time-adaptive symplectic integrators based on Hamiltonian variational integrators applied to Poincar´etransformed Hamiltonians, that respect the Riemannian manifold structure in order to yield morerobust and stable numerical discretizations of the flows we have studied in this paper in order toconstruct accelerated optimization algorithms on Riemannian manifolds. VARIATIONAL FORMULATION OF ACCELERATED OPTIMIZATION ON RIEMANNIAN MANIFOLDS 9
250 500 1000 250010 -15 -10 -5 Distance Minimization
500 1000 2000 500010 -12 -10 -8 -6 -4 -2 Rayleigh Minimization
Figure 1.
Comparison of the rates of convergence of the µ -strongly convex (SC)Algorithm 1 and convex Algorithms 1 with different values of p and with the twoversions of the update corresponding to the traditional and corrected gradient.5. Time Invariance and Poincar´e Transformation
Let f ∶ Q → R be a given function, and suppose that Assumption 1 is satisfied. In both thecases where f is geodesically convex and α -weakly-quasi-convex, we have formulated in section 3a variational framework for the minimization of f , via a p -Bregman Lagrangian L p ∶ T Q × R → R and a corresponding p -Bregman Hamiltonian H p ∶ T ∗ Q × R → R for p > L p ( X, V, t ) = t λp + p ⟨ V, V ⟩ −
Cpt ( λ + ) p − f ( X ) , (5.1) H p ( X, R, t ) = p t λp + ⟪ R, R ⟫ +
Cpt ( λ + ) p − f ( X ) , (5.2)with associated p -Bregman Euler–Lagrange equations given by ∇ ˙ X ˙ X + λp + t ˙ X + Cp t p − gradf ( X ) = , (5.3)where λ = ζ in the geodesically convex case, and λ = ζ / α in the geodesically α -weakly-quasi-convexcase. Theorems 3.2 and 3.4 imply that in both cases, solutions to the p -Bregman Euler–Lagrangeequations converge to a minimizer of f at a convergence rate of O( / t p ) . Now, the followingtwo theorems show that in both cases, time-rescaling via τ ( t ) = t ˚ p / p a solution to the p -BregmanEuler–Lagrange equations yields a solution to the ˚ p -Bregman Euler–Lagrange equations. Theorem 5.1.
Suppose f ∶ Q → R is a geodesically convex function, and Assumption 1 is satis-fied. Suppose the curve X ( t ) satisfies the corresponding p -Bregman Euler–Lagrange equation (3.5) .Then, the reparametrized curve X ( t ˚ p / p ) satisfies the ˚ p -Bregman Euler–Lagrange equation (3.5) .Proof. See Appendix D with λ = ζ . Theorem 5.2.
Suppose f ∶ Q → R is geodesically α -weakly-quasi-convex function, and Assumption1 is satisfied. Suppose the curve X ( t ) satisfies the p -Bregman Euler–Lagrange equation (3.9) . Then,the reparametrized curve X ( t ˚ p / p ) satisfies the ˚ p -Bregman Euler–Lagrange equation (3.9) .Proof. See Appendix D with λ = ζ / α . Thus, the entire subfamily of Bregman trajectories indexed by the parameter p can be obtainedby speeding up or slowing down along the Bregman curve in spacetime corresponding to any specificvalue of p . Inspired by the computational efficiency of the approach introduced in [6], it is naturalto attempt to exploit the time-rescaling property of the Bregman dynamics together with a care-fully chosen Poincar´e transformation to transform the p -Bregman Hamiltonian into an autonomousversion of the ˚ p -Bregman Hamiltonian in extended phase-space, where ˚ p < p . This would allow us tointegrate the higher-order p -Bregman dynamics while benefiting from the computational efficiencyof integrating the lower-order ˚ p -Bregman dynamics. Explicitly, the time rescaling τ ( t ) = t ˚ p / p isassociated to the monitor function dtdτ = g p → ˚ p ( t ) = p ˚ p t − ˚ p / p , (5.4)and generates a Poincar´e transformed Hamiltonian¯ H p → ˚ p ( ¯ X, ¯ R ) = g p → ˚ p ( X t ) (H p ( ¯ X, R ) + R t ) , (5.5)in the extended space ¯ Q = Q × R where ¯ X = [ XX t ] and ¯ R = [ RR t ] . We will make the conventionalchoice X t = t and R t = −H p ( X ( ) , R ( ) , ) = − H , chosen so that ¯ H p → ˚ p ( ¯ X, ¯ R ) = ( ¯ X ( ) , ¯ R ( )) . The time t shall be referred to as the physical time, while τ will bereferred to as the fictive time. The corresponding Hamiltonian equations of motion in the extendedphase space are then given by ˙¯ X = ∂ ¯ H p → ˚ p ∂ ¯ R , ˙¯ R = − ∂ ¯ H p → ˚ p ∂ ¯ X .
Now, suppose ( ¯ X ( τ ) , ¯ R ( τ )) are solutions to these extended equations of motion, and let ( X ( t ) , R ( t )) solve Hamilton’s equations for the original Hamiltonian H p . Then¯ H p → ˚ p ( ¯ X ( τ ) , ¯ R ( τ )) = ¯ H p → ˚ p ( ¯ X ( ) , ¯ R ( )) = . Thus, the components ( X ( τ ) , R ( τ )) in the original phase space of ( ¯ X ( τ ) , ¯ R ( τ )) satisfy H p ( X ( τ ) , R ( τ )) = H . Therefore, ( X ( τ ) , R ( τ )) and ( X ( t ) , R ( t )) both satisfy Hamilton’s equations for the original Hamil-tonian H p with the same initial values, so they must be the same.As a consequence, instead of integrating the p -Bregman Hamiltonian system (5.2), we can focuson the Poincar´e transformed Hamiltonian ¯ H p → ˚ p in extended phase-space given by equation (5.5),with H p and g p → ˚ p given by equations (5.2) and (5.4), that is ¯ H p → ˚ p ( ¯ X, ¯ R ) = p p ( X t ) λp + ˚ p / p ⟪ R, R ⟫ + Cp ˚ p ( X t ) ( λ + ) p − ˚ p / p f ( X ) + p ˚ p ( X t ) − ˚ p / p R t , where λ = ζ if f is geodesically convex, and λ = ζ / α if f is geodesically α -weakly-quasi-convex. Theresulting integrator has constant timestep in fictive time τ but variable timestep in physical time t .In our prior work on discretizations of variational formulations of accelerated optimization onnormed vector spaces [6], we performed a very careful computational study of how time-adaptivityand symplecticity of the numerical scheme improve the performance of the resulting numericaloptimization algorithm. In particular, we observed that time-adaptive Hamiltonian variational dis-cretizations, which are automatically symplectic, and where the adaptive timesteps are informed bythe time invariance of the family of p -Bregman Lagrangians and Hamiltonians yielded the most ro-bust and computationally efficient numerical optimization algorithms, outperforming fixed-timestepsymplectic discretizations, adaptive-timestep non-symplectic discretizations, and Nesterov’s accel-erated gradient algorithm which is neither time-adaptive nor symplectic. As such, it would be desir-able to generalize the time-adaptive Hamiltonian variational integrator framework to Riemannian VARIATIONAL FORMULATION OF ACCELERATED OPTIMIZATION ON RIEMANNIAN MANIFOLDS 11 manifolds, and apply it to the variational formulation of accelerated optimization on Riemannianmanifolds. 6.
Conclusion
We have shown that on Riemannian manifolds, the convergence rate in continuous time of ageodesically convex, α -weakly-quasi convex, or µ -strongly convex function f ( x ( t )) to its opti-mal value can be accelerated to an arbitrary convergence rate O( / t p ) , which extended the re-sults of [21] from normed vector spaces to Riemannian manifolds. This rate of convergence isachieved along solutions of the Euler–Lagrange and Hamilton’s equations corresponding to a fam-ily of time-dependent Bregman Lagrangian and Hamiltonian systems on Riemannian manifolds. Aswas demonstrated in the normed vector space setting, such families of Bregman Lagrangians andHamiltonians can be used to construct practical, robust, and computationally efficient numericaloptimization algorithms that outperform Nesterov’s accelerated gradient method by consideringgeometric structure-preserving discretizations of the continuous-time flows.Numerical experiments implementing a simple discretization of the p -Bregman Euler–Lagrangeequations applied to a distance minimization and Rayleigh minimization problems confirmed thatthe higher-order algorithms outperform significantly their lower-order counterparts and their the-oretical O( t − p ) convergence rates. Numerical results also showed that using a corrected gradientin the update instead of the traditional gradient, as was done in [20], improved the theoreticallypredicted polynomial convergence rate to an exponential rate of convergence in practice. Whilehigher values of p result in faster rates of convergence, they also appear to be more prone to sta-bility issues under numerical discretization, which can cause the numerical optimization algorithmto diverge, but we anticipate that symplectic discretizations will address these stability issues.Finally, in analogy to what was done in [21] for normed vector spaces, we proved that the familyof time-dependent Bregman Lagrangian and Hamiltonians on Riemannian manifolds is closed undertime rescaling. Inspired by the computational efficiency of the approach introduced in [6], we canthen exploit this invariance property via a carefully chosen Poincar´e transformation that will allowus to integrate higher-order p -Bregman dynamics while benefiting from the computational efficiencyof integrating a lower-order ˚ p -Bregman Hamiltonian system.It was observed in our prior computational experiments in the normed vector space case [6]that geometric discretizations which respect the time-rescaling invariance and symplecticity of theBregman Lagrangian and Hamiltonian flows were substantially less prone to stability issues, andwere therefore more robust, reliable, and computationally efficient. As such, it is natural to developtime-adaptive Hamiltonian variational integrators for the Bregman Hamiltonian introduced in thispaper describing accelerated optimization on Riemannian manifolds.This will require some additional work, since the current approach for Hamiltonian variationalintegrators involves Type II/Type III generating functions H + d ( q k .p k + ) , H − d ( p k , q k + ) , which de-pend on the position at one boundary point, and the momentum at the other boundary point.However, this does not make intrinsic sense on a manifold, since one needs the base point inorder to specify the corresponding cotangent space, and one should ideally consider a Hamilton-ian variational integrator construction based on discrete Dirac mechanics [12], which would yielda generating function E + d ( q k , q k + , p k + ) , E − d ( q k , p k , q k + ) , that depends on the position at bothboundary points and the momentum at one of the boundary points, which can be viewed as adiscretization of the generalized energy E ( q, v, p ) = ⟨ p, v ⟩ − L ( q, v ) , in contrast to the Hamiltonian H ( q, p ) = ext v ⟨ p, v ⟩ − L ( q, v ) = ⟨ p, v ⟩ − L ( q, v )∣ p = ∂L∂v . Applying such a generalization of Hamiltonianvariational integrators to the Bregman Hamiltonians introduced in this paper will yield a novelclass of robust and efficient accelerated optimization algorithms on Riemannian manifolds. Acknowledgments
The authors were supported in part by NSF under grants DMS-1411792, DMS-1345013, DMS-1813635, and by AFOSR under grant FA9550-18-1-0288.
Appendix A. Derivation of the Euler–Lagrange Equations
A.1.
Convex and Weakly-Quasi-Convex Cases.Theorem A.1.
The Euler–Lagrange equation corresponding to the Lagrangian L( X, V, t ) = t λp + p ⟨ V, V ⟩ −
Cpt ( λ + ) p − f ( X ) , is given by ∇ ˙ X ˙ X + λp + t ˙ X + Cp t p − gradf ( X ) = . Proof.
Consider a path on the manifold Q described in coordinates by ( x ( t ) , ˙ x ( t )) = ( q ( t ) , . . . , q n ( t ) , v ( t ) , . . . , v n ( t )) . Then, with ⟨⋅ , ⋅⟩ = ∑ ni,j = g ij dx i dx j , the p -Bregman Lagrangian can be written as L ( x ( t ) , ˙ x ( t ) , t ) = t λp + p n ∑ i,j = g ij ( x ( t )) v i ( t ) v j ( t ) − Cpt ( λ + ) p − f ( x ( t )) . The Euler–Lagrange equations are given by ddt ( ∂ L ∂v k ( x ( t ) , ˙ x ( t ) , t )) = ∂ L ∂q k ( x ( t ) , ˙ x ( t ) , t ) for k = , . . . , n, and here, for k = , . . . n , ddt ( ∂ L ∂v k ( x ( t ) , ˙ x ( t ) , t )) = t λp + p n ∑ i = g ik ( x ( t )) dv i dt ( t ) + t λp + p n ∑ i,j = ∂g kj ∂q i ( x ( t )) v i ( t ) v j ( t )+ λp + p t λp n ∑ i = g ik ( x ( t )) v i ( t ) ,∂ L ∂q k ( x ( t ) , ˙ x ( t ) , t ) = t λp + p n ∑ i,j = ∂g ij ∂q k ( x ( t )) v i ( t ) v j ( t ) − Cpt ( λ + ) p − ∂f∂q k ( x ( t )) . If we multiply both terms by pt λp + , the Euler–Lagrange equations are given for k = , . . . , n by0 = n ∑ i = g ik ( x ( t )) dv i dt ( t ) + n ∑ i,j = ∂g kj ∂q i ( x ( t )) v i ( t ) v j ( t ) + λp + t n ∑ i = g ik ( x ( t )) v i ( t )− n ∑ i,j = ∂g ij ∂q k ( x ( t )) v i ( t ) v j ( t ) + Cp t p − ∂f∂q k ( x ( t )) . Multiplying by the matrix ( g ij ) , which is the inverse of ( g ij ) , we get for k = , . . . n ⎛⎝ dv k dt ( t ) + n ∑ i,j = Γ kij ( x ( t )) v i ( t ) v j ( t )⎞⎠ + λp + t v k ( t ) + Cp t p − ( gradf ( x ( t ))) k = , where Γ kij are the Christoffel symbols given byΓ kij = n ∑ l = g kl [ ∂g jl ∂x i + ∂g li ∂x j − ∂g ij ∂x l ] . VARIATIONAL FORMULATION OF ACCELERATED OPTIMIZATION ON RIEMANNIAN MANIFOLDS 13
Therefore, the Euler–Lagrange equation is given by ∇ ˙ X ˙ X + λp + t ˙ X + Cp t p − gradf ( X ) = . A.2.
Strongly Convex Case.Theorem A.2.
The Bregman Euler–Lagrange equation corresponding to the Bregman Lagrangian L SC is given by ∇ ˙ X ˙ X + η ˙ X + gradf ( X ) = . Proof.
Consider a path on the manifold Q described in coordinates by ( x ( t ) , ˙ x ( t )) = ( q ( t ) , . . . , q n ( t ) , v ( t ) , . . . , v n ( t )) . Then, with ⟨⋅ , ⋅⟩ = ∑ ni,j = g ij dx i dx j , the Bregman Lagrangian can be written as L SCp ( x ( t ) , ˙ x ( t ) , t ) = e ηt n ∑ i,j = g ij ( x ( t )) v i ( t ) v j ( t ) − e ηt f ( x ( t )) . The Euler–Lagrange equations are given by ddt ⎛⎝ ∂ L SCp ∂v k ( x ( t ) , ˙ x ( t ) , t )⎞⎠ = ∂ L SCp ∂q k ( x ( t ) , ˙ x ( t ) , t ) for k = , . . . , n, and here, for k = , . . . nddt ⎛⎝ ∂ L SCp ∂v k ( x ( t ) , ˙ x ( t ) , t )⎞⎠ = e ηt n ∑ i = g ik ( x ( t )) dv i dt ( t ) + e ηt n ∑ i,j = ∂g kj ∂q i ( x ( t )) v i ( t ) v j ( t )+ ηe ηt n ∑ i = g ik ( x ( t )) v i ( t ) ,∂ L SCp ∂q k ( x ( t ) , ˙ x ( t ) , t ) = e ηt n ∑ i,j = ∂g ij ∂q k ( x ( t )) v i ( t ) v j ( t ) − e ηt ∂f∂q k ( x ( t )) . If we multiply both terms by e − ηt , the Euler–Lagrange equations are given for k = , . . . , n by0 = n ∑ i = g ik ( x ( t )) dv i dt ( t ) + n ∑ i,j = ∂g kj ∂q i ( x ( t )) v i ( t ) v j ( t ) + η n ∑ i = g ik ( x ( t )) v i ( t )− n ∑ i,j = ∂g ij ∂q k ( x ( t )) v i ( t ) v j ( t ) + ∂f∂q k ( x ( t )) . Rearranging terms, and multiplying by the matrix ( g ij ) which is the inverse of ( g ij ) , we get for k = , . . . n ⎛⎝ dv k dt ( t ) + n ∑ i,j = Γ kij ( x ( t )) v i ( t ) v j ( t )⎞⎠ + ηv k ( t ) + ( gradf ( x ( t ))) k = , where Γ kij are the Christoffel symbols given byΓ kij = n ∑ l = g kl [ ∂g jl ∂x i + ∂g li ∂x j − ∂g ij ∂x l ] . Therefore, the p -Bregman Euler–Lagrange equation is given by ∇ ˙ X ˙ X + η ˙ X + gradf ( X ) = . Appendix B. Proof of Existence Theorems
B.1.
Convex and α -Weakly-Quasi-Convex Cases.Theorem B.1. Suppose Assumption 1 is satisfied, and let
C, p > and v > be given constants.Then the differential equation ∇ ˙ X ˙ X + vt ˙ X + Ct p − gradf ( X ) = , has a global solution X ∶ [ , ∞) → Q under the initial conditions X ( ) = x ∈ Q and ˙ X ( ) = . Proof.
The proof is similar to that of Lemma 3 in [3], which extended Theorem 1 in [19] to theRiemannian setting. We first define a family of smoothed equation for which we then show existenceof a solution for all time. After choosing an equicontinuous and uniformly bounded subfamily ofsmoothed solutions, we use the Arzela–Ascoli Theorem on the complete Riemannian manifold Q to obtain a subsequence converging uniformly, and argue that the limit of this subsequence is asolution of the original problem. When p =
2, we recover the simpler case considered in Lemma 3of [3], so we assume p ≠ δ > ∇ ˙ X ˙ X + v max ( δ, t ) ˙ X + C ( max ( δ, t )) p − gradf ( X ) = p < , ∇ ˙ X ˙ X + v max ( δ, t ) ˙ X + Ct p − gradf ( X ) = p > . Exp and Log are defined globally on Q by Assumption 1, so we can choose geodesically normalcoordinates φ = ψ − around x defined globally on Q and put c = φ ○ X . Using the smoothness of f and letting u = ˙ c gives a system of first order ODEs defining a local representation for a vectorfield in T Q , and section IV.3 of [10] guarantees that the smoothed ODE has a unique solution X δ locally around 0. Actually, X δ exists on [ , ∞) . Indeed, by contradiction, let [ , T ) be the maximalinterval of existence of X δ , for some finite T >
0. Using ddt f ( X δ ( t )) = ⟨ gradf ( X δ ) , ˙ X δ ⟩ gives ddt f ( X δ ) = − δ − p C ⟨∇ ˙ X δ ˙ X δ , ˙ X δ ⟩ − vδ − p C ⟨ ˙ X δ , ˙ X δ ⟩ = − δ − p C ddt ∥ ˙ X δ ∥ − vδ − p C ∥ ˙ X δ ∥ if δ > t, p < ,ddt f ( X δ ) = − t − p C ⟨∇ ˙ X δ ˙ X δ , ˙ X δ ⟩ − vt − p Cδ ⟨ ˙ X δ , ˙ X δ ⟩ = − t − p C ddt ∥ ˙ X δ ∥ − vt − p Cδ ∥ ˙ X δ ∥ if δ > t, p > ,ddt f ( X δ ) = − t − p C ⟨∇ ˙ X δ ˙ X δ , ˙ X δ ⟩ − vt − p C ⟨ ˙ X δ , ˙ X δ ⟩ = − C ddt ( t − p ∥ ˙ X δ ∥ ) − v ( − p ) − C ( − p ) t − p ∥ ˙ X δ ∥ if δ < t. Let θ = v ( − p )− C ( − p ) . Integrating and using the Cauchy-Schwarz inequality for the p < ∫ T √( max ( δ, t )) − p ∥ ˙ X δ ∥ dt = ∫ δ √ δ − p ∥ ˙ X δ ∥ dt + ∫ Tδ √ t − p ∥ ˙ X δ ∥ dt ≤ ¿``(cid:192) Cδv ( f ( x ) − inf u f ( u )) + δ − p v (∥ ˙ X δ ( )∥ − inf t ∈[ ,T ) ∥ ˙ X δ ( t )∥ )+ ¿``(cid:192) T − δθ ( f ( X δ ( δ )) − inf u f ( u )) + T − δ Cθ ( δ − p ∥ ˙ X δ ( δ )∥ − inf t ∈[ ,T ) t − p ∥ ˙ X δ ( t )∥ ) < ∞ , since f is bounded below by Assumption 1. If δ ≥ T , then √ δ − p ˙ X δ is integrable on [ , T ) . If δ < T , then the integrals on [ , T ) and [ , δ ) are finite, so the integral on [ δ, T ) must also be fi-nite, so √ t − p ˙ X δ is integrable on [ δ, T ) . Now, ∥ ∫ Ta ˙ X δ dt ∥ ≤ ∫ Ta ∥ ˙ X δ ∥ dt < ∞ for a = , δ implies thatlim t → T X δ ( t ) exists. Since Q is complete by Assumption 1, the limit is in Q , contradicting the max-imality of [ , T ) . The p > √ t − p ( max ( δ, t )) − ∥ ˙ X δ ∥ ,and the integral on [ δ, T ) remains unchanged while the integral on [ , δ ) can be bounded by the VARIATIONAL FORMULATION OF ACCELERATED OPTIMIZATION ON RIEMANNIAN MANIFOLDS 15 same expression using t < δ . Thus, in both cases, we can find a solution X δ ∶ [ , ∞) → Q to thesmooth initial-valued ODE, and its corresponding solution X δ ∶ [ , ∞) → R n in local coordinates.Now let M δ ( t ) = sup u ∈( ,t ] ∥ ˙ X δ ( u )∥ u . When 0 < t ≤ δ, the smoothed ODE can be written as ∇ ˙ X δ ( ˙ X δ e vδ ) = − Cδ p − gradf ( X δ ) e vδ if p < , ∇ ˙ X δ ( ˙ X δ e vδ ) = − Ct p − gradf ( X δ ) e vδ if p > . Thus, we can use Lemma 4 in [3] to get for p > x X δ ( t ) ˙ X δ ( t ) = − e − vδ t ∫ t ( Γ x X δ ( u ) gradf ( X δ ( u )) − Γ x X δ ( u ) Γ ( X δ ) X δ ( u ) x gradf ( x )) Cu p − e vδ u du − e − vδ t ∫ t Cu p − Γ x X δ ( u ) Γ ( X δ ) X δ ( u ) x gradf ( x ) e vδ u du. From the Lipschitz assumption on f , we have that ∥ gradf ( X δ ( u )) − Γ X δ ( u ) x gradf ( x )∥ ≤ L ∫ u ∥ ˙ X δ ( s )∥ ds = L ∫ u s ∥ ˙ X δ ( s )∥ s ds ≤ LM δ ( u ) u . Thus, since parallel transport preserves inner products, ∥ ˙ X δ ( t )∥ t ≤ ( CLM δ ( δ ) δ p + Cδ p ∥ gradf ( x )∥) e − vδ t t ∫ t e vδ u du ≤ ( CLM δ ( δ ) δ p + Cδ p ∥ gradf ( x )∥) δvt ( − e − vδ t ) ≤ CLM δ ( δ ) δ p + Cδ p ∥ gradf ( x )∥ . Taking the supremum over 0 < t ≤ δ and rearranging gives for δ < δ M = ( CL ) p that M δ ( δ ) ≤ Cδ p ∥ gradf ( x )∥ − CLδ p . The case p < u p − by δ p − in the integrals since the t p − term in the differential equation is already replaced by δ p − .Then, when δ < δ M and δ < t < t M = ( ( v + p + ) CL ) p , the smoothed ODE can be rewritten as ddt ( t v ˙ X δ ( t )) = − Ct v + p − gradf ( X δ ) , so using Lemma 4 in [3] once again givesΓ X δ ( δ ) X δ ( t ) t v ˙ X δ ( t ) − δ v ˙ X δ ( δ ) = ∫ t ( Γ X δ ( δ ) X δ ( u ) gradf ( X δ ( u )) − Γ X δ ( δ ) X δ ( u ) Γ ( X δ ) X δ ( u ) x gradf ( x )) Cu v + p − du − ∫ t Cu v + p − Γ X δ ( δ ) X δ ( u ) Γ ( X δ ) X δ ( u ) x gradf ( x ) du. Using the fact that parallel transport preserves inner products, and dividing by t v + gives ∥ ˙ X δ ( t )∥ t ≤ δ v + t v + ∥ ˙ X δ ( δ )∥ δ + CL t v + ∫ tδ M δ ( u ) u v + p du + Ct v + ∥ gradf ( x )∥ ∫ tδ u v + p − du ≤ δ v + t v + Cδ p ∥ gradf ( x )∥ − CLδ p + CL ( v + p + ) M δ ( t ) t p + C ( t v + p − − δ v + p − )( v + p − ) t v + ∥ gradf ( x )∥ , and since this upper bound is an increasing function of t , we have for any t ′ ∈ ( δ, t ) that ∥ ˙ X δ ( t ′ )∥ t ′ ≤ Cδ p ∥ gradf ( x )∥ − CLδ p + CL ( v + p + ) M δ ( t ) t p + Ct p − v + p − ∥ gradf ( x )∥ . Taking the supremum over all t ′ ∈ ( , t ) gives for δ < δ M and δ < t < t M , M δ ( t ) ≤ − CL ( v + p + ) t p ( Cδ p − CLδ p + Ct p − v + p − ) ∥ gradf ( x )∥ . Now consider the family of functions
F = { X δ ∶ [ , T ] → R ∣ δ = − n ˜ δ, n = , , . . . } , where T = ( v + p + CL ) p and ˜ δ = ( CL ) p . By definition of M δ , we have for t ∈ [ , T ] and δ ∈ ( , ˜ δ ) that ∥ ˙ X δ ∥ ≤ T M δ ( T ) ≤ CT ( ˜ δ + CT p − v + p − ) and d ( X δ ( t ) , X δ ( )) ≤ ∫ t ∥ ˙ X δ ( u )∥ du ≤ t ∥ ˙ X δ ∥ ≤ T ∥ ˙ X δ ∥ . Thus, F is equicontinuous and uniformly bounded, and the Riemannian manifold Q is completefrom Assumption 1, so by the Arzela–Ascoli Theorem (Theorem 17 in [9]), F contains a subsequencethat converges uniformly on [ , T ] to some function X ∗ . The same argument as in part 5 of theproof of Lemma 3 of [3] shows that X ∗ is a solution to the original ODE on [ , T ] which can then beextended to get a global solution on [ , ∞) , and that X ∗ satisfies the initial conditions X ( ) = x and ˙ X ( ) = Strongly Convex Case.Theorem B.2.
Suppose that Assumption 1 is satisfied, and that η > is a given constant. Then,the differential equation ∇ ˙ X ˙ X + η ˙ X + gradf ( X ) = , has a global solution X ∶ [ , ∞) → Q under the initial conditions X ( ) = x ∈ Q and ˙ X ( ) = . Proof.
Exp and Log are defined globally on Q by Assumption 1, so we can choose geodesicallynormal coordinates φ = ψ − around x defined globally on Q and put c = φ ○ X . As in [3], usingthe smoothness of f and letting u = ˙ c gives a system of first order ODEs which defines a localrepresentation for a vector field in T Q , and results from section IV.3 of [10] guarantee that theinitial-valued differential equation has a unique solution locally around 0. It remains to showthat this solution actually exists on [ , ∞) . Towards contradiction, suppose [ , T ) is the maximalinterval of existence of the solution X , for some finite T >
0. Then, ddt f ( X ( t )) = ⟨ gradf ( X ) , ˙ X ⟩ = −⟨∇ ˙ X ˙ X, ˙ X ⟩ − C ⟨ ˙ X, ˙ X ⟩ = − ddt ∥ ˙ X ∥ − C ∥ ˙ X ∥ . Rearranging, integrating both sides and using the Cauchy-Schwarz inequality gives ∫ T ∥ ˙ X ∥ dt = ¿``(cid:192) T ( f ( x ) − inf u f ( u )) + T (∥ ˙ X ( )∥ − inf t ∈[ ,T ) ∥ ˙ X ( t )∥ ) < ∞ , since f is bounded from below by Assumption 1. Thus, lim t → T X ( t ) exists, and since Q is complete,the limit is in Q , contradicting the maximality of [ , T ) , thereby concluding the proof. Appendix C. Proofs of Convergence Rates
The proofs of the convergence rates of solutions to p -Bregman Euler–Lagrange equations areinspired by those of Theorems 5 and 6 from [3], and make use of Lemmas 2 and 12 therein: Lemma C.1.
Given a Riemannian manifold Q with sectional curvature bounded above by K max and below by K min and such that diam (Q) < ⎧⎪⎪⎨⎪⎪⎩ π √ K max if K max > ∞ if K max ≤ , we have that ⟨∇ ˙ X Log X ( p ) , − ˙ X ⟩ ≤ ζ ∥ ˙ X ∥ , where ζ is given by equation (2.1) . VARIATIONAL FORMULATION OF ACCELERATED OPTIMIZATION ON RIEMANNIAN MANIFOLDS 17
Lemma C.2.
Given a point q and a smooth curve X ( t ) on a Riemannian manifold Q , ddt ∥ Log X ( t ) ( q )∥ = ⟨ Log X ( t ) ( q ) , ∇ ˙ X Log X ( t ) ( q )⟩ = ⟨ Log X ( t ) ( q ) , − ˙ X ( t )⟩ . C.1.
Convex Case.Theorem C.1.
Suppose f ∶ Q → R is a geodesically convex function, and Assumption 1 is satisfied.Then, any solution X ( t ) of the p -Bregman Euler–Lagrange equation ∇ ˙ X ˙ X + ζp + t ˙ X + Cp t p − gradf ( X ) = , converges to a minimizer x ∗ of f with rate f ( X ( t )) − f ( x ∗ ) ≤ ζ ∥ Log x ( x ∗ )∥ Ct p . Proof.
Let E( t ) = Ct p ( f ( X ) − f ( x ∗ )) + ( ζ − )∥ Log X ( x ∗ )∥ + ∥ tp ˙ X − Log X ( x ∗ )∥ . Then, using Lemma C.2,˙ E( t ) = Cpt p − ( f ( X ) − f ( x ∗ )) + Ct p ⟨ gradf ( X ) , ˙ X ⟩ + ( ζ − )⟨ Log X ( x ∗ ) , − ˙ X ⟩+ ⟨ tp ˙ X − Log X ( x ∗ ) , p ˙ X + tp ∇ ˙ X ˙ X − ∇ ˙ X Log X ( x ∗ )⟩= Cpt p − ( f ( X ) − f ( x ∗ )) + Ct p ⟨ gradf ( X ) , ˙ X ⟩ + ( ζ − )⟨ Log X ( x ∗ ) , − ˙ X ⟩+ ⟨ tp ˙ X − Log X ( x ∗ ) , ( p ˙ X + tp ∇ ˙ X ˙ X + ζ ˙ X ) − ζ ˙ X − ∇ ˙ X Log X ( x ∗ )⟩ . Now, the p -Bregman Euler–Lagrange equation can be rewritten as1 p ˙ X + tp ∇ ˙ X ˙ X + ζ ˙ X = − Cpt p − gradf ( X ) . Thus, ˙ E( t ) = Cpt p − ( f ( X ) − f ( x ∗ )) + Ct p ⟨ gradf ( X ) , ˙ X ⟩ + ( ζ − )⟨ Log X ( x ∗ ) , − ˙ X ⟩+ ⟨ tp ˙ X − Log X ( x ∗ ) , − Cpt p − gradf ( X ) − ζ ˙ X − ∇ ˙ X Log X ( x ∗ )⟩ . Canceling the ⟨ gradf ( X ) , ˙ X ⟩ and ⟨ Log X ( x ∗ ) , − ˙ X ⟩ terms out using Lemma C.2, we get˙ E( t ) = Cpt p − [ f ( X ) − f ( x ∗ ) + ⟨ Log X ( x ∗ ) , gradf ( X )⟩] − tp (⟨ ˙ X, ∇ ˙ X Log X ( x ∗ )⟩ + ζ ⟨ ˙ X, ˙ X ⟩) . Now, since f is geodesically convex, we have that [ f ( X ) − f ( x ∗ ) + ⟨ Log X ( x ∗ ) , gradf ( X )⟩] ≤ (⟨ ˙ X, ∇ ˙ X Log X ( x ∗ )⟩ + ζ ⟨ ˙ X, ˙ X ⟩) ≥
0. Thus, ˙ E( t ) ≤
0, so Ct p ( f ( X ) − f ( x ∗ )) ≤ Ct p ( f ( X ) − f ( x ∗ )) + ( ζ − )∥ Log X ( x ∗ )∥ + ∥ tp ˙ X − Log X ( x ∗ )∥ = E( t ) ≤ E( ) = ( ζ − )∥ Log x ( x ∗ )∥ + ∥ Log x ( x ∗ )∥ = ζ ∥ Log x ( x ∗ )∥ . which gives the desired rate of convergence. C.2.
Weakly-Quasi-Convex Case.Theorem C.2.
Suppose f ∶ Q → R is a geodesically α -weakly-quasi-convex function, and supposethat Assumption 1 is satisfied. Then, any solution X ( t ) of the p -Bregman Euler–Lagrange equation ∇ ˙ X ˙ X + ζp + ααt ˙ X + Cp t p − gradf ( X ) = , converges to a minimizer x ∗ of f with rate f ( X ( t )) − f ( x ∗ ) ≤ ζ ∥ Log x ( x ∗ )∥ Cα t p . Proof.
Let E( t ) = Cα t p ( f ( X ) − f ( x ∗ )) + ( ζ − )∥ Log X ( x ∗ )∥ + ∥ αtp ˙ X − Log X ( x ∗ )∥ . Then, using Lemma C.2,˙ E( t ) = Cpα t p − ( f ( X ) − f ( x ∗ )) + Cα t p ⟨ gradf ( X ) , ˙ X ⟩ + ( ζ − )⟨ Log X ( x ∗ ) , − ˙ X ⟩+ ⟨ αtp ˙ X − Log X ( x ∗ ) , αp ˙ X + αtp ∇ ˙ X ˙ X − ∇ ˙ X Log X ( x ∗ )⟩= Cpα t p − ( f ( X ) − f ( x ∗ )) + Cα t p ⟨ gradf ( X ) , ˙ X ⟩ + ( ζ − )⟨ Log X ( x ∗ ) , − ˙ X ⟩+ ⟨ αtp ˙ X − Log X ( x ∗ ) , ( αp ˙ X + αtp ∇ ˙ X ˙ X + ζ ˙ X ) − ζ ˙ X − ∇ ˙ X Log X ( x ∗ )⟩ . Now, the p -Bregman Euler–Lagrange equation can be rewritten as αp ˙ X + αtp ∇ ˙ X ˙ X + ζ ˙ X = − Cpαt p − gradf ( X ) . Thus, ˙ E( t ) = Cpα t p − ( f ( X ) − f ( x ∗ )) + Cα t p ⟨ gradf ( X ) , ˙ X ⟩ + ( ζ − )⟨ Log X ( x ∗ ) , − ˙ X ⟩+ ⟨ αtp ˙ X − Log X ( x ∗ ) , − Cpαt p − gradf ( X ) − ζ ˙ X − ∇ ˙ X Log X ( x ∗ )⟩ . Canceling the ⟨ gradf ( X ) , ˙ X ⟩ and ⟨ Log X ( x ∗ ) , − ˙ X ⟩ terms out using Lemma C.2, we get˙ E( t ) = Cpαt p − [ α ( f ( X ) − f ( x ∗ )) + ⟨ Log X ( x ∗ ) , gradf ( X )⟩] − αtp (⟨ ˙ X, ∇ ˙ X Log X ( x ∗ )⟩ + ζ ⟨ ˙ X, ˙ X ⟩) . Now, f is geodesically α -weakly-quasi-convex, so [ α ( f ( X ) − f ( x ∗ )) + ⟨ Log X ( x ∗ ) , gradf ( X )⟩] ≤ (⟨ ˙ X, ∇ ˙ X Log X ( x ∗ )⟩ + ζ ⟨ ˙ X, ˙ X ⟩) ≥
0. Thus, ˙ E( t ) ≤
0, so Cα t p ( f ( X ) − f ( x ∗ )) ≤ Cα t p ( f ( X ) − f ( x ∗ )) + ( ζ − )∥ Log X ( x ∗ )∥ + ∥ αtp ˙ X − Log X ( x ∗ )∥ = E( t ) ≤ E( ) = ( ζ − )∥ Log x ( x ∗ )∥ + ∥ Log x ( x ∗ )∥ = ζ ∥ Log x ( x ∗ )∥ . Therefore, we obtain the desired rate of convergence f ( X ( t )) − f ( x ∗ ) ≤ ζ ∥ Log x ( x ∗ )∥ Cα t p . VARIATIONAL FORMULATION OF ACCELERATED OPTIMIZATION ON RIEMANNIAN MANIFOLDS 19
Appendix D. Proof of Invariance Theorem
Theorem D.1.
Suppose Assumption 1 is satisfied and that the curve X ( t ) satisfies a p -BregmanEuler–Lagrange equation of the form ∇ ˙ X ˙ X + λp + t ˙ X + Cp t p − gradf ( X ) = , for some λ ∈ R . Then the reparametrized curve X ( t ˚ p / p ) satisfies the corresponding ˚ p -BregmanEuler–Lagrange equation.Proof. Let τ ( t ) = t ˚ p / p and Y ( t ) = X ( τ ( t )) . Then˙ Y ( t ) = ˙ τ ( t ) ˙ X ( τ ( t )) , and ∇ ˙ Y ( t ) ˙ Y ( t ) = ¨ τ ( t ) ˙ X ( τ ( t )) + ˙ τ ( t )∇ ˙ X ( τ ( t )) ˙ X ( τ ( t )) . Inverting these relations gives˙ X ( τ ( t )) = τ ( t ) ˙ Y ( t ) , and ∇ ˙ X ( τ ( t )) ˙ X ( τ ( t )) = τ ( t ) ∇ ˙ Y ( t ) ˙ Y ( t ) − ¨ τ ( t ) ˙ τ ( t ) ˙ Y ( t ) . The p -Bregman Euler–Lagrange equation at time τ ( t ) is given by ∇ ˙ X ( τ ( t )) ˙ X ( ˙ τ ( t )) + λp + τ ( t ) ˙ X ( τ ( t )) + Cp τ p − ( t ) gradf ( X ( τ ( t ))) = . Substituting the expressions for X ( τ ( t )) , ˙ X ( τ ( t )) and ∇ ˙ X ( τ ( t )) ˙ X ( τ ( t )) in terms of Y ( t ) and itsderivatives, and multiplying by ˙ τ ( t ) gives ∇ ˙ Y ( t ) ˙ Y ( t ) + (( λp + ) ˙ τ ( t ) τ ( t ) − ¨ τ ( t ) ˙ τ ( t ) ) ˙ Y ( t ) + Cp ˙ τ ( t ) τ p − ( t ) gradf ( Y ( t )) = . Substituting τ ( t ) = t ˚ pp yields the ˚ p -Bregman Euler–Lagrange equation at time t ∇ ˙ Y ( t ) ˙ Y ( t ) + λ ˚ p + t ˙ Y ( t ) + C ˚ p t ˚ p − gradf ( Y ( t )) = . References [1] P.-A. Absil, R. Mahony, and R. Sepulchre.
Optimization Algorithms on Matrix Manifolds ,volume 78. 12 2008. ISBN 978-0-691-13298-3. doi: 10.1515/9781400830244.[2] F. Alimisis, A. Orvieto, G. B´ecigneul, and A. Lucchi. Practical accelerated optimization onRiemannian manifolds, 2020.[3] F. Alimisis, A. Orvieto, G. B´ecigneul, and A. Lucchi. A continuous-time perspective formodeling acceleration in Riemannian optimization. In Silvia Chiappa and Roberto Calandra,editors,
Proceedings of the Twenty Third International Conference on Artificial Intelligenceand Statistics , volume 108 of
Proceedings of Machine Learning Research , pages 1297–1307,26–28 Aug 2020.[4] M. Betancourt, M. I. Jordan, and A. Wilson. On symplectic optimization. arXiv: Computation ,2018.[5] A.-L. Cauchy. M´ethode g´en´erale pour la r´esolution des syst`emes d’´equations simultan´ees.
Acad.Sci. Paris , 25:536–538, 1847.[6] V. Duruisseaux, J. Schmitt, and M. Leok. Adaptive Hamiltonian variational integrators andapplications to symplectic accelerated optimization, 2020.[7] E. Hairer, C. Lubich, and G. Wanner.
Geometric numerical integration , volume 31 of
SpringerSeries in Computational Mathematics . Springer-Verlag, Berlin, second edition, 2006.[8] J. Jost.
Riemannian geometry and geometric analysis . Universitext. Springer, Cham, seventhedition, 2017. [9] J. L. Kelley.
General Topology . Graduate Texts in Mathematics. Springer New York, 1975.ISBN 9780387901251.[10] S. Lang.
Fundamentals of Differential Geometry , volume 191 of
Graduate Texts in Mathemat-ics . Springer -Verlag, New York, 1999. ISBN 9780387985930.[11] J. M. Lee.
Introduction to Riemannian Manifolds , volume 176 of
Graduate Texts in Mathe-matics . Springer, Cham, second edition, 2018.[12] M. Leok and T. Ohsawa. Variational and geometric structures of discrete Dirac mechanics.
Found. Comput. Math. , 11(5):529–562, 2011.[13] Y. Liu, F. Shang, J. Cheng, H. Cheng, and L. Jiao. Accelerated first-order methods forgeodesically convex optimization on Riemannian manifolds. In
Advances in Neural InformationProcessing Systems , volume 30, pages 4868–4877. Curran Associates, Inc., 2017.[14] J. E. Marsden and T. S. Ratiu.
Introduction to mechanics and symmetry , volume 17 of
Textsin Applied Mathematics . Springer-Verlag, New York, second edition, 1999.[15] A. S. Nemirovsky and D. B. Yudin.
Problem Complexity and Method Efficiency in Optimiza-tion . Wiley - Interscience series in discrete mathematics. Wiley, 1983.[16] Y. Nesterov. A method of solving a convex programming problem with convergence rate O( / k ) . Soviet Mathematics Doklady , 27(2):372–376, 1983.[17] Y. Nesterov.
Introductory Lectures on Convex Optimization: A Basic Course , volume 87 of
Applied Optimization . Kluwer Academic Publishers, Boston, MA, 2004.[18] Y. Nesterov. Accelerating the cubic regularization of Newton’s method on convex problems.
Math. Program. , 112:159–181, 2008.[19] W. Su, S. Boyd, and E. Candes. A differential equation for modeling Nesterov’s AcceleratedGradient method: theory and insights.
Journal of Machine Learning Research , 17(153):1–43,2016.[20] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and mo-mentum in deep learning. In
Proceedings of the 30th International Conference on InternationalConference on Machine Learning - Volume 28 , ICML’13, page III–1139–III–1147, 2013.[21] A. Wibisono, A. Wilson, and M. Jordan. A variational perspective on accelerated methods inoptimization.
Proceedings of the National Academy of Sciences , 113(47):E7351–E7358, 2016.[22] H. Zhang and S. Sra. First-order methods for geodesically convex optimization. In VitalyFeldman, Alexander Rakhlin, and Ohad Shamir, editors, , volume 49 of
Proceedings of Machine Learning Research , pages 1617–1638, ColumbiaUniversity, New York, New York, USA, 23–26 Jun 2016.[23] H. Zhang and S. Sra. An estimate sequence for geodesically convex optimization. In S´ebastienBubeck, Vianney Perchet, and Philippe Rigollet, editors,
Proceedings of the 31st ConferenceOn Learning Theory , volume 75 of