Lagrangian and Hamiltonian Mechanics for Probabilities on the Statistical Manifold
LLAGRANGIAN AND HAMILTONIAN MECHANICS FOR PROBABILITIESON THE STATISTICAL MANIFOLD
GOFFREDO CHIRCO, LUIGI MALAG `O, AND GIOVANNI PISTONE
Abstract.
We provide an Information-Geometric formulation of Classical Mechanics on theRiemannian manifold of probability distributions, which is an affine manifold endowed witha dually-flat connection. In a non-parametric formalism, we consider the full set of positiveprobability functions on a finite sample space, and we provide a specific expression for thetangent and cotangent spaces over the statistical manifold, in terms of a Hilbert bundle structurethat we call the
Statistical Bundle . In this setting, we compute velocities and accelerations of aone-dimensional statistical model using the canonical dual pair of parallel transports and definea coherent formalism for Lagrangian and Hamiltonian mechanics on the bundle. Finally, in aseries of examples, we show how our formalism provides a consistent framework for acceleratednatural gradient dynamics on the probability simplex, paving the way for direct applicationsin optimization, game theory and neural networks.
Contents
1. Introduction 12. Statistical bundle 32.1. Maximal exponential family 32.2. The exponential bundle and the mixture bundle 33. Hessian structure and second order geometry 73.1. Velocities and covariant derivatives 73.2. Higher order statistical bundles and accelerations 104. Natural gradient 125. Mechanics of the statistical bundle 155.1. Action integral 165.2. Legendre transform 175.3. Hamilton equations 186. Examples of Lagrangians on the statistical bundle 206.1. Reduction to ordinary differential equations 257. Application to accelerated optimization 267.1. Damped KL Lagrangian 277.2. Damped KL Hamiltonian 308. Discussion 32Acknowledgements 33Appendix A. Covering 33Appendix B. Entropy flow 35Appendix C. Covariant time-derivative of the KL Legendre transform 36References 371.
Introduction
Lagrangian mechnics and Hamiltonian mechanics live on the tangent, respectively, co-tangent,bundle of a finite-dimensional Riemannian manifold, as it is, for example, in [7, Ch. III-IV].
Date : September 22, 2020. a r X i v : . [ m a t h . S T ] S e p nformation Geometry (IG), as firstly formalized by S.-I Amari and H. Nagaoka [6], views para-metric statistical models as as a manifold endowed with a Riemannian metric and a family ofdual connections, the α -connections. In particular, it provides an affine structure and a coupleof dually-flat connection. Recently, some authors have started to inquire about the relationbetween the geometry of classical mechanics and IG [24, 29]. Indeed, the interest of dynami-cal systems on probability functions has raised in several areas, for example, CompartmentalModels, Replicator Equations, Prey-Predator Equations, Mass Action Equations, DifferentialGames, and also, more recently, in Optimization Methods and Machine Learning Theory.In the present paper, we approach this research program with two specific qualifications.First, we consider the full set of positive probability functions on a finite sample space anddiscuss IG in the non-parametric geometric language, as it is in [23, 20]. In Data Analysis, thenon-parametric statistical study of compositional data has been started by [3]. We use herethe simplest instance of non-parametric Information Geometry as it is described in the reviewpaper [28, 31].The second and most qualifying choice, consists in considering IG as defined on a linearbundle, not just on a manifold of probability densities. Indeed, in classical mechanics, the studyof the evolution of a system requires both position q and velocities ˙ q , or conjugate momenta p ,in a phase space (co)-tangent bundle description. Similarly, we are led to consider manifold ofcouples of probability densities and scores (cid:63) q (log-derivatives), or associated conjugate momenta η . We call such a bundle the Statistical Manifold (SM) [29]. This idea should be compared withthe use of the Grassmannian manifold, as defined, for example in [2], to describe the variouscentering of the space of the sufficient statistics of an exponential family, see [25, § § Lagrangian and the
Hamiltonian function on the full bundle. Therefore, in Section 5, we gathered all the necessary structureto define a mechanics of the probability simplex. We define an action integral in terms of ageneric notion on Lagrangian function on the statistical bundle. We can then derive the Euler-Lagrange equation via a standard variational approach on the simplex [29]. We define a Legendretransform, hence we derive the Hamilton equations. As a starting point for our analysis, we lookat the dynamics induced by a standard, though local here, free particle Lagrangian, obtainedfrom the quadratic form on the statistical bundle. In this case we can compute the full analyticsolution of the geodesic motion. Further, we take the quadratic free particle Lagrangian as aquadratic approximation of a Kullback-Leibler (KL) divergence function, and we setup the studyof the dynamics induced by a KL divergence Lagrangian. We focus on the formal constructionof a Lagrangian function from a divergence in section 6. Here, we provide complete examples ofboth quadratic and KL Lagrangian and Hamiltonian flows on the bundle. Finally, in section 7,we consider the case of a time-dependent, damped extensions of the KL Lagrangian, and weapply the Lagrange-Hamilton duality to provide a first realization on the statistical bundle ofthe variational approach to accelerated optimisation methods recently proposed in [39]. e end with a brief discussion in Section 8. In the appendices, we report the calculations forthe geodesic solution of the quadratic Lagrangian on the sphere (appendix A), the calculation ofthe gradient of the negative entropy potential on the simplex (appendix B), and the derivationin chart of the fiber derivative of the KL Lagrangian (appendix C).2. Statistical bundle
Let a finite sample space Ω, N , be given. The probability simplex on Ω is denoted∆(Ω), while ∆ ◦ (Ω) is its interior. The uniform probability function is denoted by µ that is, µ ( x ) = N , x ∈ Ω. In general, we denote by lower case letters both the densities with respectto the uniform probability function µ and the random variables. The upper case is reserved forprobability functions and geometrical objects. The expected value of f : Ω → R with respectto the density p is written E p [ f ]. Note that the reference probability µ has density p = 1,so that n (cid:80) x f ( x ) = E [ f ]. We define the entropy function on densities, H ( p ) = − E p [log p ],so that it is minus the special case, that obtains if q = 1, of the Kullback-Leibler divergence,D ( p (cid:107) q ) = E p (cid:104) log pq (cid:105) .2.1. Maximal exponential family.
We regard ∆ ◦ (Ω) as the maximal exponential family E ( µ )in the sense that each strictly positive density p can be written as p ∝ e f . The random variable f is defined up to a constant. Uniqueness can be obtained in (at least) two ways. For eachgiven reference density p ∈ E ( µ ), one can write either(1) q ( x ) = exp ( u ( x ) − K p ( u )) · p ( x ) , E p [ u ] = 0 , K p ( u ) = log E µ [e u ] = D ( p (cid:107) q ) . or(2) q ( x ) = exp ( v ( x ) + H ( v )) · p ( x ) , E q [ v ] = 0 , H ( v ) = − log E q [e v ] = D ( q (cid:107) p ) , where D denotes the Kullback-Leibler divergence.In the first case (1), the set of all u ’s is a vector space of random variables and the mapping q (cid:55)→ u = log qp − E p (cid:104) log qp (cid:105) provides a chart of E ( µ ). In the second case (2), there is no fixedco-domain for the mapping q (cid:55)→ v = log qp − E q (cid:104) log qp (cid:105) . The proper structure is the vectorbundle to be defined below.2.2. The exponential bundle and the mixture bundle.
The statistical bundle with baseΩ is(3) S E ( µ ) = { ( q, v ) | q ∈ E ( µ ) , E q [ v ] = 0 } . The elements of the statistical bundle are couples of a probability density p and a randomvariable v , respectively. The mapping q (cid:55)→ v = log qp − D ( q (cid:107) p ) uniquely defined in eq. (2)provides a section of the statistical bundle.In the present finite dimensional case, the statistical bundle coincides with the dual statisticalbundle, which is here denoted ∗ S E ( µ ), as a Banach space. Nevertheless it will be useful todistinguish between the two by calling the first one the exponential statistical bundle and thesecond one the dual or mixture (statistical) bundle, respectively. The two bundles have differentgeometries, in the sense that they will be given different affine transports. The duality mappingis defined at the fiber at q by(4) ∗ S q E ( µ ) × S q E ( µ ) (cid:51) ( η, v ) (cid:55)→ (cid:104) η, v (cid:105) q = E q [ ηv ] . The statistical bundle is a semi-algebraic subset of R N , namely the open subset of the N − q, v ) ∈ S E ( µ ) ⇔ (cid:80) x ∈ Ω q ( x ) = N , (cid:80) x ∈ Ω v ( x ) q ( x ) = 0 ,q ( x ) > x ∈ Ω . e retain the manifold structure induced by R N , but we will define a different metric andaffine structure. The inner product on the fiber S q E ( µ ) is defined to be (cid:104) v , v (cid:105) q = 1 N (cid:88) x ∈ Ω v ( x ) v ( x ) q ( x ) . The geometry of the statistical bundles is related with the more traditional set-up of IG asfollows.Consider the tangent bundle of the positive part of the sphere of radius 2,
T S > (2). As asemi-algebraic set, it is defined by( ρ, w ) ∈ T S > (2) ⇔ (cid:80) x ∈ Ω ρ ( x ) = 4 , (cid:80) x ∈ Ω w ( x ) ρ ( x ) = 0 ,ρ ( x ) > x ∈ ΩConsider now a further bundle, namely the open simplex and its affine (trivial) tangent bundle( q, u ) ∈ T ∆ ◦ (Ω) ⇔ (cid:80) x ∈ Ω q ( x ) = N , (cid:80) x ∈ Ω u ( x ) = 0 ,q ( x ) > x ∈ Ω . , The mapping
T S > (2) (cid:51) ( ρ, w ) (cid:55)→ (cid:18) N ρ , N ρw (cid:19) = ( q, u ) ∈ T ∆ ◦ (Ω)is 1-to-1 and surjective. Notice that the action on the tangent vectors is the tangent transfor-mation of the space. The inner product on the fiber T ρ S > (2) is pushed forward as w · w = (cid:88) x ∈ Ω u ( x ) N ρ ( x ) 2 u ( x ) N ρ ( x ) = 1 N (cid:88) x ∈ Ω u ( x ) u ( x ) q ( x ) . This inner product is the well known Fisher-Rao metric.Now, the mapping T ∆ ◦ (Ω) (cid:51) ( q, u ) (cid:55)→ ( q, u/q ) = ( q, v ) ∈ ∗ S E ( µ ) = S E ( µ )is 1-to-1 and surjective. It is, in fact, a trivialization of the mixture bundle. The push forwardof the metric is given by1 N (cid:88) x ∈ Ω u ( x ) u ( x ) q ( x ) = 1 N (cid:88) x ∈ Ω q ( x ) v ( x ) q ( x ) v ( x ) q ( x ) = (cid:104) v , v (cid:105) q . In conclusion, the same metric structure has three different canonical expressions, accordingto the scheme
T S > (2) ↔ S E ( µ ) = ∗ S E ( µ ) ↔ T ∆ ◦ (Ω) . Our choice of the statistical bundle is motivated by two arguments. First, the representationon the sphere produces computations that do not have a clear statistical interpretation. Second,the description of the affine connections is especially simple in the statistical bundle, see below.It is not relevant in the present finite state case, but it is conceptually interesting to remarkthat in the infinite case the representation on the sphere is problematic and the same is truefor the identification of the bundles in duality [28].Let us discuss briefly the issue of the parametrisation of the statistical bundle. This isfrequently done with reference to the presentation T ∆ ◦ (Ω) as follows. Let us code the samplepoints with Ω = { , . . . , N } , and let us parametrise the open simplex as∆ ◦ ( N ) = ( θ , . . . , θ N − , − (cid:88) j θ j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) θ j > , (cid:88) j θ j < . he set of parameters is the solid simplexΓ N − = θ ∈ R N − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) θ j > , (cid:88) j θ j < . The affine space of Γ N − is the full real space R N − , and parameterisation of the trivialisedbundle is Γ N − × R N − (cid:51) ( θ , ˙ θ ) (cid:55)→ (( θ , . . . , θ N − , − (cid:88) j θ j ) , ( ˙ θ , . . . , ˙ θ N − , − (cid:88) j ˙ θ j )= ( Q, w ) ∈ T ∆ ◦ ( N ) . In this parametrisation, the Fisher inner product is expressed by the Fisher matrix in thestandard basis of R N − , which is conveniently expressed as an inverse matrix function, I ( θ ) = (cid:0) diag ( θ ) − θθ T (cid:1) − . See the related computations, for example, in [32, Prop. 1].The corresponding parameterisation of the other presentation of the statistical bundle areeasily derived.We now proceed to introduce the affine geometry of the statistical bundle. Here, we look atthe inner product on the fibers as a duality pairing between ∗ S q E ( µ ) and S q E ( µ ). This point ofview allows for a natural definition of a dual covariant structure. Namely, we define two affinetransports between the fibers of each of the statistical bundles. Definition 1.
The exponential transport is defined for each p, q ∈ E ( µ ) by(5) e U qp : S p E ( µ ) → S q E ( µ ) , e U qp v = v − E q [ v ] , while the mixture transport is(6) m U qp : ∗ S p E ( µ ) → ∗ S q E ( µ ) , m U qp η = pq η . The following properties are easily proved.
Proposition 1.
The two transports defined above are conjugate with respect to the dualitypairing, (7) (cid:10) m U qp η, v (cid:11) q = (cid:10) η, e U pq v (cid:11) p , η ∈ ∗ S p E ( µ ) , v ∈ S q E ( µ ) . Moreover, it holds (8) (cid:10) m U qp η, e U qp v (cid:11) q = (cid:104) η, v (cid:105) p , η ∈ ∗ S p E ( µ ) , v ∈ S p E ( µ ) . We now use this structure to define a special affine atlas of charts in order to build thestructure the affine manifold which provides the set-up of IG in this case. Notice that we definea manifold with global charts whose co-domain depends on the chart itself and is actually afiber of the bundle.
Definition 2.
The exponential atlas of the exponential statistical bundle S E ( µ ) is the collectionof charts given for each p ∈ E ( µ ) by(9) s p : S E ( µ ) (cid:51) ( q, v ) (cid:55)→ ( s p ( q ) , e U pq v ) ∈ S p E ( µ ) × S p E ( µ ) , where(10) s p ( q ) = log qp − E p (cid:20) log qp (cid:21) . s s p ( p, v ) = (0 , v ), we say that s p is the chart centered at p . If s p ( q ) = u , from eq. (10)follows the exponential form of q as a density with respect to p , namely q = e u − E p (cid:104) log qp (cid:105) · p . As E µ [ q ] = 1, then 1 = E p (cid:20) e u − E p (cid:104) log pq (cid:105) (cid:21) = E p [e u ] e − E p (cid:104) log pq (cid:105) , so that the cumulant function K p isdefined on S p E ( µ ) by(11) K p ( u ) = log E p [e u ] = E p (cid:20) log pq (cid:21) = D ( p (cid:107) q ) , that is, K p ( u ) is the expression in the chart at p of Kullback-Leibler divergence of q (cid:55)→ D ( p (cid:107) q ),and we can write(12) q = e u − K p ( u ) · p = e p ( u ) . In conclusion, the patch centered at p is(13) s − p = e p : ( S p E ( µ )) (cid:51) ( u, v ) (cid:55)→ (e p ( u ) , e U e p ( u ) p v ) ∈ S E ( µ ) . In statistical terms, the random variable log ( q/p ) is the relative point-wise information about q relative to the reference p , while s p ( q ) is the deviation from its mean value at p .The expression of the other divergence in the chart centered at p is(14) D ( q (cid:107) p ) = E q (cid:20) log qp (cid:21) = E q [ u − K p ( u )] = E q [ u ] − K p ( u ) . Definition 3.
The dual atlas of the mixture statistical bundle S E ( µ ) is the collection of chartsgiven for each p ∈ E ( µ ) by(15) η p : ∗ S E ( µ ) (cid:51) ( q, w ) (cid:55)→ (cid:0) s p ( q ) , m U pq w (cid:1) ∈ S p E ( µ ) × ∗ S p E ( µ ) . We say that η p is the chart centered at p . The patch centered at p is(16) η − p : S p E ( µ ) × × ∗ S p E ( µ ) (cid:51) ( u, v ) (cid:55)→ (cid:16) e p ( u ) , m U e p ( u ) p v (cid:17) ∈ ∗ S E ( µ ) . we will see that the affine structure is defined by the affine atlases.A further structure, that interpolates between the exponential and the mixture bundle is the Hilbert bundle , that is modeled on the Riemannian connection of the positive sphere. The pushforward of the Riemannian parallel transport can be computed explitely so that to have theisometric property (cid:104) v, w (cid:105) q = (cid:10) U pq v, U pq w (cid:11) p . See the explicit definition of U qp together with some details in the recent tutorial [31]. Thenotion of Hilbert bundle was introduced originally by M. Kumon and S.-I Amari [22] anddeveloped by S.-I Amari [5] as a general set-up for the duality of connections in IG. See alsothe discussion in R.E. Cass and P.W. Vos monograph [18, § full bundle is S E ( µ ) = { ( q, η, w ) | q ∈ E ( µ ) , η ∈ ∗ S q E ( µ ) , w ∈ S q E ( µ ) } . In general, h S k will denote h mixture factors and k exponential factors. Note S = ∗ S and S = S . . Hessian structure and second order geometry
In the construction of the statistical bundle given above, we were inspired by the originalAmari’s Information Geometry [6] in that we have shown that the statistical bundle is an ex-tension of the tangent bundle of the Riemannian manifold whose metric is the Fisher metricand, moreover, we have provided a system of dually affine parallel transports. In this section,we proceed by introducing a further structure, namely, we show that the base manifold E ( µ )is actually a Hessian manifold with respect to any of the convex functions K p ( u ) = log E p [e u ], u ∈ S p E ( µ ), see H. Shima’s monograph [34]. Many useful computations in classical Statis-tical Physics and, later, in Mathematical Statistics, have been actually performed using thederivatives of a master convex function, that is, using the Hessian structure.The connection is established by the following equations which are easily checked: E e p ( u ) [ h ] = dK p ( u )[ h ] ;(17) e U e p ( u ) p h = h − dK p ( u )[ h ] ;(18) d K p ( u )[ h , h ] = (cid:68) e U e p ( u ) p h , e U e p ( u ) p h (cid:69) e p ( u ) ;(19) d K p ( u )[ h , h , h ] = E e p ( u ) (cid:104) ( e U e p ( u ) p h )( e U e p ( u ) p h )( e U e p ( u ) p h ) (cid:105) . (20)With such computational tools, we can proceed to discuss the kinematics of the statisticalbundles.3.1. Velocities and covariant derivatives.
Let us compute the expression of the velocity attime t of a smooth curve(21) t (cid:55)→ γ ( t ) = ( q ( t ) , w ( t )) ∈ S E ( µ )in the exponential chart centered at p . The expression of the curve is(22) γ p ( t ) = (cid:16) s p ( q ( t )) , e U pq ( t ) w ( t ) (cid:17) , and hence we have, by denoting the ordinary derivative of a curve in R N by the dot,(23) ddt s p ( q ( t )) = ddt (cid:18) log q ( t ) p − E p (cid:20) log q ( t ) p (cid:21)(cid:19) = ˙ q ( t ) q ( t ) − E p (cid:20) ˙ q ( t ) q ( t ) (cid:21) = e U pq ( t ) ˙ q ( t ) q ( t ) = e U pq ( t ) ddt log q ( t ) , and(24) ddt e U pq ( t ) w ( t ) = ddt ( w ( t ) − E p [ w ( t )]) = ˙ w ( t ) − E p [ ˙ w ( t )] . There is a clear advantage in expressing the tangent at each time t in the moving framecentered at the position q ( t ) of the curve itself. Because of that, we define the velocity of thecurve(25) t (cid:55)→ q ( t ) = e u ( t ) − K p ( u ( t )) · p , u ( t ) = s p ( q ( t )) , to be(26) (cid:63) q ( t ) = e U q ( t ) p ddt s p ( q ( t )) = ˙ u ( t ) − E q ( t ) [ ˙ u ( t )] = ˙ u ( t ) − dK p ( u ( t ))[ ˙ u ( t )] = ddt log q ( t ) = ˙ q ( t ) q ( t ) . It follows that t (cid:55)→ ( q ( t ) , (cid:63) q ( t )) is a curve in the statistical bundle whose expression in thechart centered at p (the reference density in eq. (25)) is t (cid:55)→ ( u ( t ) , ˙ u ( t )). In fact,(27) e U pq ( t ) ( ˙ u ( t ) − dK p ( u ( t ))[ ˙ u ( t )]) = ˙ u ( t ) . The mapping q (cid:55)→ ( q, (cid:63) q ) is a lift of the curve to the statistical bundle. emark (cid:63) q ) . The velocity as defined above is nothing else as the scorefunction of a one-dimensional parametric statistical model, see, for example, the contemporarytextbook by B. Efron and T. Hastie [14, § f is any random variable, then the variation of theexpectation is ddt E q ( t ) [ f ] = (cid:10) f − E q ( t ) [ f ] , (cid:63) q ( t ) (cid:11) q ( t ) . Moreover, the variance of the score function, that is, the squared norm with respect to q ( t ) ofthe velocity (cid:63) q ( t ), is classically known in Statistics as the Fisher information at t of the statisticalmodel t (cid:55)→ q ( t ). Namely, I ( t ) = (cid:90) ( (cid:63) q ( t )) q ( t ) dµ = (cid:90) ˙ q ( t ) q ( t ) dµ . In turn, Schwartz inequality applied to the two equations above produces the the
Cramer-Raobound I ( t ) − ≤ (cid:18) ddt E q ( t ) [ f ] (cid:19) − Var q ( t ) ( f ) . Let us turn to the interpretation of the second component in eq. (24). Given the exponentialparallel transport, we define a covariant derivative by setting(28)
Ddt w ( t ) = e U q ( t ) p ddt e U pq ( t ) w ( t ) = e U q ( t ) p (cid:16) ˙ w ( t ) − E p [ ˙ w ( t )] (cid:17) = ˙ w ( t ) − E q ( t ) [ ˙ w ( t )] . Throughout the paper, the notation
Ddt denotes the covariant time derivative in a given transportor connection, whose choice will depend on the context.Let us do the computation in the dual bundle . The curve now is ζ ( t ) = ( q ( t ) , η ( t )) and theexpression of the second component is m U pq ( t ) η ( t ) = q ( t ) p η ( t ). This gives(29) ddt m U pq ( t ) η ( t ) = ddt q ( t ) p η ( t ) = 1 p ( ˙ q ( t ) η ( t ) + q ( t ) ˙ η ( t )) , which, in turn, gives the dual covariant derivative(30) Ddt η ( t ) = m U q ( t ) p ddt m U pq ( t ) η ( t ) = pq ( t ) 1 p ( ˙ q ( t ) η ( t ) + q ( t ) ˙ η ( t )) = (cid:63) q ( t ) η ( t ) + ˙ η ( t ) . The couple of covariant derivatives of eqs. (26) and (28) are compatible with the dualitypairing, as the following proposition shows.
Proposition 2 (Duality of the covariant derivatives) . For each smooth curve in the full statis-tical bundle, t (cid:55)→ ( q ( t ) , η ( t ) , w ( t )) ∈ S E ( µ ) , it holds (31) ddt (cid:104) η ( t ) , w ( t ) (cid:105) q ( t ) = (cid:28) Ddt η ( t ) , w ( t ) (cid:29) q ( t ) + (cid:28) η ( t ) , Ddt w ( t ) (cid:29) q ( t ) . Proof.
The proof is a simple computation based on eq. (7). ddt (cid:104) η ( t ) , w ( t ) (cid:105) q ( t ) = ddt (cid:68) m U pq ( t ) η ( t ) , e U pq ( t ) w ( t ) (cid:69) p = (cid:28) ddt m U pq ( t ) η ( t ) , e U pq ( t ) w ( t ) (cid:29) p + (cid:28) m U pq ( t ) η ( t ) , ddt e U pq ( t ) w ( t ) (cid:29) p = (cid:28) m U q ( t ) p ddt m U pq ( t ) η ( t ) , w ( t ) (cid:29) q ( t ) + (cid:28) η ( t ) , e U q ( t ) p ddt e U pq ( t ) w ( t ) (cid:29) q ( t ) = (cid:28) Ddt η ( t ) , w ( t ) (cid:29) q ( t ) + (cid:28) η ( t ) , Ddt w ( t ) (cid:29) q ( t ) . Let us now look at the duality pairing ( (cid:3) , ♦ ) (cid:55)→ (cid:104) (cid:3) , ♦ (cid:105) q as an inner product ( (cid:13) , (cid:13) ) (cid:55)→(cid:104)(cid:13) , (cid:13)(cid:105) q on the Hilbert space L ( q ). As topological vector spaces, we can use the identification L ( q ) = ∗ S q E ( µ ) = S q E ( µ ), so that we can consider the full bundle as an Hilbert bundle.Let be given a smooth curve in such a bundle, t (cid:55)→ ( q ( t ) , α ( t ) , β ( t )). Because now the twostatistical bundles are confounded, we are bound to provisionally use different notations for thetwo covariant derivatives.By using the symmetry, we get ddt (cid:104) α ( t ) , β ( t ) (cid:105) q ( t ) = (cid:28) D m dt α ( t ) , β ( t ) (cid:29) q ( t ) + (cid:28) α ( t ) , D e dt β ( t ) (cid:29) q ( t ) = (cid:28) D e dt α ( t ) , β ( t ) (cid:29) q ( t ) + (cid:28) α ( t ) , D m dt β ( t ) (cid:29) q ( t ) = (cid:28) D dt α ( t ) , β ( t ) (cid:29) q ( t ) + (cid:28) α ( t ) , D dt β ( t ) (cid:29) q ( t ) , where D dt = 12 (cid:18) D m dt + D e dt (cid:19) . Up now, we have defined the following derivation operators on the statistical bundles:(1) A velocity (cid:63) q ( t ) = ddt log q ( t ), which is the expression in the moving frame of the deriva-tive.(2) An exponential covariant derivative Ddt w ( t ) = D e dt w ( t ) = e U q ( t ) p ddt e U pq ( t ) w ( t ).(3) A mixture covariant derivative, Ddt η ( t ) = D m dt η ( t ) = m U q ( t ) p ddt m U pq ( t ) η ( t ).(4) A Hilbert covariant derivative D dt α ( t ) = 12 (cid:18) D m dt α ( t ) + D e dt α ( t ) (cid:19) = ˙ α ( t ) − E q ( t ) [ ˙ α ( t )] + 12 (cid:63) q ( t ) α ( t ) . Remark . We have used here a presentation based on one-dimensional statistical models. Fromthe differential geometry point of view is more common to define covariant derivation on a vectorfield. We briefly comment about this issue below.Given two smooth section
X, Y of the statistical bundle, that is two differentiable mappings
X, Y : E ( µ ) → R N , such that for all q it holds E q [ X ( q )] = E q [ Y ( q )] = 0, the covariant derivativeis defined by D Y X ( q ) = Ddt X ( q ( t )) (cid:12)(cid:12)(cid:12)(cid:12) t =0 for q (0) = q and (cid:63) q (0) = Y ( q ) . A detailed discussion of the geometry associated to our setting should include, for example,the computation of the Christoffel coefficients and the curvature of each of the three connectionswe have introduced. Some of these computations are not really relevant for our main goal, thatis, the foundations of the mechanics of the statistical bundle. Others are probably useful andinteresting.As an example, let us check whether the Hilbert connection defined above is symmetric, thatis D Y X − D X Y = [ X, Y ]. If such a condition holds true, then the connection is the uniqueLevi-Civita connection. e have, for each (cid:63) q = Y ( q ) and q (0) = q , that D Y X ( q ( t )) = ddt X ( q ( t )) − E q ( t ) (cid:20) ddt X ( q ( t )) (cid:21) + 12 (cid:63) q ( t ) X ( q ( t )) = dX ( q ( t ))[ ˙ q ( t )] − E q ( t ) [ dX ( q ( t ))[ ˙ q ( t )]] + 12 X ( q ( t )) Y ( q ( t )) = q ( t ) dX ( q ( t ))[ Y ( q ( t ))] − E q ( t ) [ q ( t ) dX ( q ( t ))[ Y ( q ( t ))]] + 12 X ( q ( t )) Y ( q ( t )) . The form of the Hilbert covariant derivative in terms of ordinary derivatives of fields is D X ( q ) = qdX ( q )[ Y ( q )] − E q [ qdX ( q )[ Y ( q )]] + 12 X ( q ) Y ( q ) . It follows that the bracket is[
X, Y ]( q ) = D Y X ( q ) − D X Y ( q ) = qdX ( q )[ Y ( q )] − qdY ( q )[ X ( q )] . In fact, the expectation term is zero because E q [[ X, Y ]( q )] = 0.3.2. Higher order statistical bundles and accelerations.
We define the second statisticalbundle to be(32) S E ( µ ) = { ( q, w , w , w ) | ( q ∈ E ( µ ) , w , w , w ∈ S q E ( µ ) } , with charts centered at each p ∈ E ( µ ) defined by(33) s p ( q, w , w , w ) = (cid:0) s p ( q ) , e U pq w , e U pq w , e U pq w (cid:1) . The second bundle is an expression of the tangent bundle of the exponential bundle. Foreach curve t (cid:55)→ γ ( t ) = ( q ( t ) , w ( t )) in the statistical bundle, we define its velocity at t to be(34) (cid:63) γ ( t ) = (cid:18) q ( t ) , w ( t ) , (cid:63) q ( t ) , Ddt w ( t ) (cid:19) , because t (cid:55)→ (cid:63) γ ( t ) is a curve in the second statistical bundle and that its expression in the chartat p has the last two components equal to the values given in eq. (23) and eq. (24), respectively.The corresponding notion of gradient will be discussed in the next section.In particular, for each smooth curve t (cid:55)→ q ( t ), the velocity of its lift t (cid:55)→ γ ( t ) = ( q ( t ) , (cid:63) q ( t )) is(35) (cid:63) χ ( t ) = ( q ( t ) , (cid:63) q ( t ) , (cid:63) q ( t ) , ∗∗ q ( t )) , where the acceleration ∗∗ q ( t ) at t is(36) ∗∗ q ( t ) = Ddt (cid:63) q ( t ) = ddt ˙ q ( t ) q ( t ) − E q ( t ) (cid:20) ddt ˙ q ( t ) q ( t ) (cid:21) = ¨ q ( t ) q ( t ) − (cid:16) (cid:63) q ( t ) − E q ( t ) (cid:2) (cid:63) q ( t ) (cid:3) (cid:17) . Notice that the computations above are performed in the embedding space. The accelerationhas been defined using the transports. Indeed, the connection here is defined by the transports In fact,(37) ddt ˙ q ( t ) q ( t ) − E q ( t ) (cid:20) ddt ˙ q ( t ) q ( t ) (cid:21) = ¨ q ( t ) q ( t ) − ˙ q ( t ) q ( t ) − E q ( t ) (cid:20) ¨ q ( t ) q ( t ) − ˙ q ( t ) q ( t ) (cid:21) =¨ q ( t ) q ( t ) − (cid:18) ˙ q ( t ) q ( t ) (cid:19) − E q ( t ) (cid:34) ¨ q ( t ) q ( t ) − (cid:18) ˙ q ( t ) q ( t ) (cid:19) (cid:35) =¨ q ( t ) q ( t ) − ( (cid:63) q ( t )) − E [¨ q ( t )] + E q ( t ) (cid:104) ( (cid:63) q ( t )) (cid:105) , where we write E [ f ] = (cid:82) f dµ when the density is 1. Now, eq. (36) follows from E [¨ q ( t )] = d dt (cid:82) q ( t ) dµ = 0.Recall that(38) t (cid:55)→ E q ( t ) (cid:2) (cid:63) q ( t ) (cid:3) = E (cid:20) ˙ q ( t ) q ( t ) (cid:21) is the Fisher information of t (cid:55)→ q ( t ). U qp , an approach that seems natural from the probabilistic point of view, cf. [15]. The non-parametric approach to IG allows to define naturally a dual transport, hence the dual connectionof [6].The acceleration defined above has the one-dimensional exponential families as (differential)geodesics. Every exponential (Gibbs) curve t (cid:55)→ q ( t ) = e p ( tu ) has velocity (cid:63) q ( t ) = u − dK ( tu )[ u ],so that the acceleration is ∗∗ q ( t ) = 0. Conversely, if one writes v ( t ) = log q ( t ), then0 = ∗∗ q ( t ) = ¨ v ( t ) + E q ( t ) [¨ v ( t )] , so that v ( x ; t ) = tv ( x ) + c ( t ). Example . Let us discuss a representation of the acceleration that does not involve the construc-tion of a second order bundle. Consider the curve t (cid:55)→ q ( t ) and its lift t (cid:55)→ ( q ( t ) , (cid:63) q ( t )) ∈ S E ( µ ).From the retraction ( q, (cid:63) q ) → ( q, χ ), we can define a new curve t (cid:55)→ χ ( t ) = e q ( t ) ( (cid:63) q ( t )) = e (cid:63) q ( t ) − K q ( t ) ( (cid:63) q ( t )) · q ( t ) , such that (cid:63) q ( t ) = s q ( t ) ( χ ( t )).Let us compute the velocity of χ . (cid:63) χ ( t ) = ddt log χ ( t ) = ddt (cid:16) (cid:63) q ( t ) − K q ( t ) ( (cid:63) q ( t )) + log q ( t ) (cid:17) = ∗∗ q ( t ) + E q ( t ) (cid:20) ddt (cid:63) q ( t ) (cid:21) − ddt K q ( t ) ( (cid:63) q ( t )) + (cid:63) q ( t ) = ∗∗ q ( t ) + (cid:63) q ( t ) + c ( t ) . where c ( t ) is a scalar. That is, (cid:63) χ ( t ) and ∗∗ q ( t ) + (cid:63) q ( t ) differ by a scalar, in particular, E q ( t ) [ (cid:63) χ ( t )] + E χ ( t ) [ ∗∗ q ( t ) + (cid:63) q ( t )] = 0 . In conclusion, the following representation of the acceleration ∗∗ q in terms of the velocities (cid:63) χ and (cid:63) q holds true:(39) ∗∗ q = e U qχ (cid:63) χ − (cid:63) q and (cid:63) χ = e U χq ( ∗∗ q + (cid:63) q ) . In a chart centered at p , we have q ( t ) = e p ( u ( t )), (cid:63) q ( t ) = e U q ( t ) p ˙ u ( t ), so that χ ( t ) /p ∝ ˙ u ( t )+ u ( t ).In particular, in the case of an exponential model, u ( t ) = tu and χ ( t ) = q ( t + 1), a propertythat is equivalent to the geodesic property ∗∗ q ( t ) = 0.We can also define other types of acceleration. In fact, we have three different interpretationof the lifted curve, namely, we can consider t (cid:55)→ ( q ( t ) , (cid:63) q ( t )) as a curve in the statistical bundle S E ( µ ), or, a curve in the dual bundle ∗ S E ( µ ), or, a curve in the Hilbert bundle. Each of theseframeworks provides a different derivation, hence, a different acceleration.We have the already defined exponential acceleration e D q ( t ) = ∗∗ q ( t ), and we can define, the mixture acceleration as(40) m D q ( t ) = D m dt (cid:63) q ( t ) = m U q ( t ) p ddt m U pq ( t ) (cid:63) q ( t ) = ¨ q ( t ) /q ( t )and the Riemannian acceleration by(41) D q ( t ) = 12 (cid:0) e D q ( t ) + m D q ( t ) (cid:1) = ¨ q ( t ) q ( t ) − (cid:32)(cid:18) ˙ q ( t ) q ( t ) (cid:19) − E q ( t ) (cid:34)(cid:18) ˙ q ( t ) q ( t ) (cid:19) (cid:35)(cid:33) , In the review papers [28, 30], the various accelerations are used to derive the relevant Taylorformulæ and the relevant Hessians. Moreover, it is shown that the Riemannian accelerationcan be derived using a family of isometric transport on the Hilbert bundle. Here, we will bemostly interested in the mechanical interpretation of the acceleration. . Natural gradient
In this section we generalize the (non-parametric) natural gradient to the statistical bundles.Let us first recall the definition we are going to generalize. Given a scalar field F : E ( µ ) → R the natural gradient is the section q (cid:55)→ grad F ( q ) of the dual bundle ∗ S E ( µ ) such that for allsmooth curve t (cid:55)→ q ( t ) ∈ E ( µ ) it holds(42) ddt F ( q ( t )) = (cid:104) grad F ( q ( t )) , (cid:63) q ( t ) (cid:105) q ( t ) . The natural gradient can be computed in some cases without recourse to the computation incharts, for example,(43) ddt H ( q ( t )) = − ddt E [ q ( t ) log q ( t )] = − E [ ˙ q ( t )(log q ( t ) + 1)] = − E q ( t ) [log q ( t ) (cid:63) q ( t )] = (cid:104)− log q ( t ) − H ( q ( t )) , (cid:63) q ( t ) (cid:105) q ( t ) . In general, the natural gradient could be expressed in charts as a function of the ordinarygradient ∇ as follows. In the generic chart at p , with q = e p ( u ) and F ( q ) = F p ( u ), it holds(44) (cid:104) grad F ( q ( t )) , (cid:63) q ( t ) (cid:105) q ( t ) = ddt F ( q ( t )) = ddt F p ( u ( t )) = dF p ( u ( t ))[ ˙ u ( t )] = dF p ( u ( t ))[ e U pq ( t ) (cid:63) q ( t )] = (cid:68) p − ∇ F p ( u ( t )) , e U pq ( t ) (cid:63) q ( t ) (cid:69) p = (cid:68) m U q ( t ) p p − ∇ F p ( u ( t )) , (cid:63) q ( t ) (cid:69) q ( t ) = (cid:10) q − ∇ F p ( u ( t )) , (cid:63) q ( t ) (cid:11) q ( t ) = (cid:10) q − ∇ F p ( u ( t )) − E q ( t ) (cid:2) q − ∇ F p ( u ( t )) (cid:3) , (cid:63) q ( t ) (cid:11) q ( t ) . We use here the name of natural gradient for a computation which does not involve the Fishermatrix because of our choice of the inner product. The push forward of our definition to thetangent bundle of the simplex with the Fisher metric would indeed map our definition to theRiemannian one.We are going to generalize the computation of the gradient to other cases are of interest,namely, the
Lagrangian function, or Lagrangian field, defined on the exponential bundle S E ( µ ),and the Hamiltonian function, or Hamiltonian field, defined on the dual bundle ∗ S E ( µ ).To include both cases, we derive below the generalization of natural gradient to functionsdefined on the full statistical bundle S E ( µ ) and possibly depending on external parameters.While this derivation is essentially trivial, nevertheless we present here a full proof in order tointroduce and clarify the geometrical features of our presentation of the mechanics of the openprobability simplex in the next section.In the statistical bundles the partial derivatives are not defined, but they are defined in thetrivialisations given by the affine charts. Precisely, let be given a scalar field F : S E ( µ ) × D → R , D a domain of R k , and a generic smooth curve t (cid:55)→ ( q ( t ) , η ( t ) , w ( t ) , c ( t )) ∈ S E ( µ ) × D . We want to write(45) ddt F (cid:0) q ( t ) , η ( t ) , w ( t ) , c ( t ) (cid:1) = (cid:10) grad F (cid:0) q ( t ) , η ( t ) , w ( t ) , c ( t ) (cid:1) , (cid:63) q ( t ) (cid:11) q ( t ) + (cid:28) Ddt η ( t ) , grad m F (cid:0) q ( t ) , η ( t ) , w ( t ) , c ( t ) (cid:1)(cid:29) q ( t ) + (cid:28) grad e F (cid:0) q ( t ) , η ( t ) , w ( t ) , c ( t ) (cid:1) , Ddt w ( t ) (cid:29) q ( t ) + ∇ F (cid:0) q ( t ) , η ( t ) , w ( t ) , c ( t ) (cid:1) · ˙ c ( t ) , here the four components of the gradient are S E ( µ ) × D (cid:51) ( q, η, w, c ) (cid:55)→ ( q, grad F (cid:0) q, η, w, c (cid:1) ) ∈ ∗ S q E ( µ )( q, grad m F (cid:0) q, η, w, c (cid:1) ) ∈ S q E ( µ )( q, grad e F (cid:0) q, η, w, c (cid:1) ) ∈ ∗ S q E ( µ )( q, ∇ F (cid:0) q, η, w, c (cid:1) ) ∈ E ( µ ) × R k Let us fix a reference density p and express both the given function and the generic curve inthe chart at p . We can write the total derivative as ddt F (cid:0) q ( t ) , η ( t ) , w ( t ) , c ( t ) (cid:1) = ddt F p ( u ( t ) , ζ ( t ) , v ( t ) , c ( t )) = d F p (cid:0) u ( t ) , ζ ( t ) , v ( t ) , c ( t ) (cid:1)(cid:2) ˙ u ( t ) (cid:3) + d F p (cid:0) u ( t ) , ζ ( t ) , v ( t ) , c ( t ) (cid:1)(cid:2) ˙ ζ ( t ) (cid:3) + d F p (cid:0) u ( t ) , ζ ( t ) , v ( t ) , c ( t ) (cid:1)(cid:2) ˙ v ( t ) (cid:3) + d F p (cid:0) u ( t ) , ζ ( t ) , v ( t ) , c ( t ) (cid:1)(cid:2) ˙ c ( t ) (cid:3) . In the equation above, d j devotes the partial derivative with respect to the j -th variable of F p , j = 1 , . . . ,
4, which is intended to provide a linear operator to be represented by the appropriatedual vector, that is, the value of the proper gradient.The last term does not require any comment and we can use the ordinary Euclidean gradient: d F p (cid:0) u ( t ) , ζ ( t ) , v ( t ) , c ( t ) (cid:1)(cid:2) ˙ c ( t ) (cid:3) = ∇ F p (cid:0) u ( t ) , ζ ( t ) , v ( t ) , c ( t ) (cid:1) · ˙ c ( t ) . Let us consider together the second and the third term. This is a computation of the fiberderivative and does not involve the representation in chart. Given α ∈ ∗ S p E ( µ ) and β ∈ S p E ( µ ),that is, ( α, β ) ∈ S p E ( µ ), we have d F p ( u, ζ, v, c )[ α ] + d F p ( u, ζ, v, c )[ β ] = ddt F p ( u, ζ + tα, w + tβ, c ) (cid:12)(cid:12)(cid:12)(cid:12) t =0 = ddt F ( q, η + t m U qp α, v + t e U qp β, c ) (cid:12)(cid:12)(cid:12)(cid:12) t =0 = F F ( q, η, w, c )[( m U qp α, e U qp β )] = (cid:10) e U qp α, grad m F ( q, η, w, c ) (cid:11) q + (cid:10) grad e F ( q, η, w, c ) , e U qp β (cid:11) q , where F denotes the fiber derivative in S q E ( µ ), which is expressed, in turn, with the relevantgradients. The notation is possibly confusing, but consider that the inner product has is always ∗ S q E ( µ ) first, followed by S q E ( µ ) and that the subscript to the grad symbol displays whichcomponent of the full bundle is considered.We have that Ddt w ( t ) = e U q ( t ) p ˙ v ( t ) , Ddt η ( t ) = m U q ( t ) p ˙ ζ ( t ) . Putting together all results up now, we have proved that ddt F (cid:0) q ( t ) , η ( t ) , w ( t ) , c ( t ) (cid:1) = d F p (cid:0) u ( t ) , ζ ( t ) , v ( t ) , c ( t ) (cid:1)(cid:2) e U pe p ( u ( t )) (cid:63) q ( t ) (cid:3) + (cid:28) Ddt η ( t ) , grad m F (cid:0) q ( t ) , η ( t ) , w ( t ) , c ( t ) (cid:1)(cid:29) q ( t ) + (cid:28) grad e F (cid:0) q ( t ) , η ( t ) , w ( t ) , c ( t ) (cid:1) , Ddt w ( t ) (cid:29) q ( t ) + ∇ F (cid:0) q ( t ) , η ( t ) , w ( t ) , c ( t ) (cid:1) · ˙ c ( t ) , To identify the first term in the total derivative above, consider the “constant” case, q ( t ) = e p ( u ( t )) , η ( t ) = m U e p ( u ( t )) p ζ, w ( t ) = e U e p ( u ( t )) p v, c ( t ) = c , so that the first term reduces to d F p ( u ( t ) , ζ, v, c )[ e U pe p ( u ( t )) (cid:63) q ( t )]. It follows that the proper wayto compute the first gradient is to consider the function on E ( µ ) defined by q (cid:55)→ F ζ,v,c ( q ) = F ( q, m U qp ζ, e U qp v, c ) hich has a natural gradient whose chart representation is precisely that first term.We state the results obtained above in the following formal statement. Proposition 3.
The total derivative eq. (45) holds true, where (1) grad F (cid:0) q, η, w, c (cid:1) is the natural gradient of q (cid:55)→ F ( q, m U qp ζ, e U qp v, c ) , that is, with the representation in p -chart F p ( u, ζ, w, c ) = F ( e p ( u ) , m U e p ( u ) p ζ, e U e p ( u ) p v, c ) , it is defined by (cid:104) grad F ( q, ζ, w, c ) , (cid:63) q (cid:105) q = d F p ( u, ζ, w, c ) (cid:2) e U pq (cid:63) q (cid:3) , ( q, (cid:63) q ) ∈ S E ( µ ) ;(2) grad m F (cid:0) q, η, w, c (cid:1) and grad e F (cid:0) q, η, w, c (cid:1) are the fiber gradients; (3) ∇ F (cid:0) q, η, w, c (cid:1) is the Euclidean gradient w.r.t. the last variable. We have concluded the computation of the total derivative of a parametric function of thefull bundle. The special cases of the Lagrangian and the Hamiltonian easily follows as a spe-cialization. Notice that the computation of the natural gradient in proposition 3(1) is done byfixing the variables in the fibers to be translations of fixed ones.We provide below three simple examples that we are going to use repeatedly.
Example . If L ( q, w ) = (cid:104) w, w (cid:105) q , then(46) L (cid:16) e p ( u ) , e U e p ( u ) p v (cid:17) = 12 E e p ( u ) (cid:20)(cid:16) e U e p ( u ) p v (cid:17) (cid:21) = 12 E p (cid:20) e u − K p ( u ) (cid:16) e U e p ( u ) p v (cid:17) (cid:21) , with derivative with respect to u in the direction h given by12 E p (cid:20) e u − K p ( u ) e U e p ( u ) p h (cid:16) e U e p ( u ) p v (cid:17) (cid:21) + E p (cid:104) e u − K p ( u ) (cid:16) e U e p ( u ) p v (cid:17) ( − Cov e p ( u ) ( v, h )) (cid:105) =12 E q (cid:104) w U e p ( u ) p h (cid:105) = 12 (cid:68) w − E q (cid:2) w (cid:3) , e U e p ( u ) p h (cid:69) q , which, in turn, identifies the natural gradient as grad (cid:104) w, w (cid:105) q = ( w − E q (cid:2) w (cid:3) ). Example . If L ( q, w ) = K q ( w ), then(47) L (cid:16) e p ( u ) , e U e p ( u ) p v (cid:17) = K e p ( u ) (cid:16) e U e p ( u ) p v (cid:17) = log E e p ( u ) (cid:104) e v − E ep ( u ) [ v ] (cid:105) =log E p (cid:104) e u − K p ( u )+ v − E ep ( u ) [ v ] (cid:105) = log E p (cid:104) e u + v − K p ( u ) − dK p ( u )[ v ] (cid:105) = K p ( u + v ) − K p ( u ) − dK p ( u )[ v ] . Notice that last member of the equalities is the Bregman divergence of the convex function K p .The derivative with respect to u in the direction h is d K p ( u + v )[ h ] − d K p ( u )[ h ] − d k p ( u )[ v, h ] = E e p ( u + v ) [ h ] − E e p ( u ) [ h ] − E e p ( u ) (cid:104)(cid:16) e U e p ( u ) p v (cid:17) (cid:16) e U e p ( u ) p h (cid:17)(cid:105) = E e p ( u ) (cid:20) e p ( u + v ) e p ( u ) h (cid:21) − E e p ( u ) [ h ] − E q (cid:104) w (cid:16) e U e p ( u ) p h (cid:17)(cid:105) = E e p ( u ) (cid:20) e p ( u + v ) e p ( u ) (cid:16) e U e p ( u ) p h (cid:17)(cid:21) − (cid:68) w, e U e p ( u ) p h (cid:69) q . he first term is E e p ( u ) (cid:104) e v − ( K p ( u + v ) − K p ( u )) (cid:16) e U e p ( u ) p h (cid:17)(cid:105) = E e p ( u ) (cid:20) e v − ( K ep ( u ) ( e U ep ( u ) p v )+ dK p ( u )[ v ]) (cid:16) e U e p ( u ) p h (cid:17)(cid:21) = E e p ( u ) (cid:20) e e U ep ( u ) p v − K ep ( u ) ( e U ep ( u ) p v ) (cid:16) e U e p ( u ) p h (cid:17)(cid:21) = E q (cid:104) e w − K q ( w ) (cid:16) e U e p ( u ) p h (cid:17)(cid:105) = (cid:28) e q ( w ) q − , e U e p ( u ) p h (cid:29) q . In conclusion, grad K q ( w ) = (cid:16) e q ( w ) q − (cid:17) − w . The fiber gradient is easily seen to be grad e K q ( w ) = e q ( w ) q − χ ( t ) = e q ( t ) ( (cid:63) q ( t )), we have, for example, ddt K q ( t ) ( (cid:63) q ( t )) = (cid:28) χ ( t ) q ( t ) − − (cid:63) q ( t ) , (cid:63) q ( t ) (cid:29) q ( t ) + (cid:28) χ ( t ) q ( t ) − , ∗∗ q ( t ) (cid:29) q ( t ) = E χ ( t ) [ ∗∗ q ( t ) + (cid:63) q ( t )] − E q ( t ) (cid:104) (cid:63) q ( t ) (cid:105) . This example shall be of interest for us because it is connected with the KL divergence, K q ( w ) = D ( q (cid:107) e q ( w )). Example . The Hamiltonian ∗ S E ( µ ) : ( q, η ) (cid:55)→ H ( q, η ) = E q [(1 + η ) log(1 + η )] , η > − , is the Legendre transform of the cumulant function K q , H ( q, η ) = (cid:10) η, (grad K q ) − ( η ) (cid:11) q − K q (cid:0) (grad K q ) − ( η ) (cid:1) . In particular, the fiber gradient of H q is grad m H ( q, η ) = log(1 + η ) − E q [log(1 + η )] which is theinverse of the fiber gradient of K q . Notice that r = (1 + η ) q is a density, and D ( r (cid:107) q ) = H ( q, η ).Let us compute the natural gradient. The expression of the Hamiltonian in the chart at p is H p ( u, ζ ) = E e p ( u ) (cid:20)(cid:18) pe p ( u ) ζ (cid:19) log (cid:18) pe p ( u ) ζ (cid:19)(cid:21) = E p (cid:20)(cid:18) e p ( u ) p + ζ (cid:19) log (cid:18) pe p ( u ) ζ (cid:19)(cid:21) . As, for h ∈ S p E ( µ ), d (cid:18) e p ( u ) p + ζ (cid:19) [ h ] = e p ( u ) p e U e p ( u ) p h and d (cid:18) pe p ( u ) ζ (cid:19) [ h ] = − pe p ( u ) ζ e U e p ( u ) p h , the derivative of H p with respect to u in the direction h is given by d H p ( u, ζ )[ h ] = E p (cid:20)(cid:18) e p ( u ) p e U e p ( u ) p h (cid:19) log (cid:18) pe p ( u ) ζ (cid:19)(cid:21) − E p (cid:34)(cid:18) e p ( u ) p + ζ (cid:19) (cid:18) pe p ( u ) ζ (cid:19) − pe p ( u ) ζ e U e p ( u ) p h (cid:35) = E q (cid:104) log(1 + η ) e U e p ( u ) p h (cid:105) − E q (cid:104) ζ e U e p ( u ) p h (cid:105) , hence grad H ( q, η ) = log(1 + η ) − E q [log(1 + η )] − η .5. Mechanics of the statistical bundle
Here, we adapt the general set up of analytic mechanics to the statistical bundle. Thepresentation extends the formalism first introduced in [29]. .1. Action integral. If q : [0 , (cid:51) t (cid:55)→ q ( t ) is a smooth curve in the exponential manifold E ( µ ) and t (cid:55)→ ( q ( t ) , (cid:63) q ( t )), (cid:63) q ( t ) = ddt log q ( t ), is its lift to the statistical bundle S E ( µ ), an actionintegral is(48) q (cid:55)→ A ( q ) = (cid:90) L ( q ( t ) , (cid:63) q ( t ) , t ) dt , where L : S E ( µ ) × [0 , → R is a smooth Lagrangian function.Let us express the action integral in the exponential chart s p centered at p . If q ( t ) =e u ( t ) − K p ( u ( t )) · p , with t (cid:55)→ u ( t ) ∈ S p E ( µ ), we have(49) s p ( q ( t ) , (cid:63) q ( t )) = ( u ( t ) , ˙ u ( t )) , hence,(50) L ( q ( t ) , (cid:63) q ( t ) , t ) = L (cid:16) e p ( u ( t )) , e U e p ( u ( t )) p ˙ u ( t ) , t (cid:17) = L p ( u ( t ) , ˙ u ( t ) , t ) , so that the expression of the action integral is(51) u (cid:55)→ A p ( u ) = (cid:90) L p ( u ( t ) , ˙ u ( t ) , t ) dt . Equation (51) is the vector form of the action integral. The Euler-Lagrange equation, writtenwith partial derivatives, that is, without the gradients to be computed below, is(52) d L p ( u ( t ) , ˙ u ( t ) , t )[ h ] = ddt d L p ( u ( t ) , ˙ u ( t ) , t )[ h ] . t ∈ [0 , , h ∈ S p E ( µ ) . Exercise.
The equation above is well known, but, nevertheless, we repeat the variational argu-ment here because of the unusual set-up. Given ϕ ∈ C ([0 , ϕ (0) = ϕ (1) = 0, for each δ ∈ R and h ∈ S p E ( µ ) we define the perturbed curve q δ ( t ) = e ( u ( t )+ δϕ ( t ) h ) − K p ( u ( t )+ δϕ ( t ) h ) · p . Notice that q δ (0) = q (0) and q δ (1) = q (1).The velocity is (cid:63) q δ ( t ) = ddt log q δ ( t ) = ˙ u ( t ) + δ ˙ ϕ ( t ) h − E q δ ( t ) [( ˙ u ( t ) + δ ˙ ϕ ( t )) H ] = e U q δ ( t ) p ( ˙ u ( t ) + δ ˙ ϕ ( t ) h ) , whose expression in the chart centered at p is ˙ u ( t ) + δ ˙ ϕ ( t ) h .The perturbation of the action integral is δ (cid:55)→ (cid:90) L p ( u ( t ) + δφ ( t ) h, ˙ u ( t ) + δ ˙ φ ( t ) h ) dt , whose derivative at δ is ddδ (cid:90) L p ( u ( t ) + δφ ( t ) h, ˙ u ( t ) + δ ˙ φ ( t ) h ) dt = (cid:90) (cid:16) φ ( t ) d L p ( u δ ( t ) , ˙ u δ ( t ))[ h ] + ˙ φ ( t ) d L p ( u δ ( t ) , ˙ u δ ( t ))[ h ] (cid:17) dt = (cid:90) φ ( t ) (cid:18) d L p ( u δ ( t ) , ˙ u δ ( t ))[ h ] − ddt d L p ( u δ ( t ) , ˙ u δ ( t )[ h ] (cid:19) dt . In particular, the value of the derivative at δ = 0 is ddδ (cid:90) L p ( u ( t ) + δφ ( t ) h, ˙ u ( t ) + δ ˙ φ ( t ) h ) dt (cid:12)(cid:12)(cid:12)(cid:12) δ =0 = (cid:90) φ ( t ) (cid:18) d L p ( u ( t ) , ˙ u ( t ))[ h ] − ddt d L p ( u ( t ) , ˙ u ( t ))[ h ] (cid:19) dt . If the curve t (cid:55)→ ( q ( t ) , (cid:63) q ( t )) is an extremal of the action integral, then the equation above iszero for all φ . We have obtained the Euler-Lagrange equation in the exponential chart. e derive now the Euler-Lagrange equations in the statistical bundle. Proposition 4 (Euler-Lagrange equation) . If q is an extremal of the action integral, then, withthe notations of proposition 3, (53) Ddt grad e L ( q ( t ) , (cid:63) q ( t ) , t ) = grad L ( q ( t ) , (cid:63) q ( t ) , t ) . Proof.
Consider first the rhs of eq. (52). From proposition 3(1) we have(54) d L p ( u ( t ) , ˙ u ( t ) , t )[ h ] = (cid:68) grad L ( q ( t ) , (cid:63) q ( t ) , t ) , e U q ( t ) p h (cid:69) q ( t ) . Concerning the lhs, from proposition 3(2) we have(55) d L p ( u ( t ) , ˙ u ( t ))[ h ] = (cid:68) grad e L ( q ( t ) , (cid:63) q ( t ) , t ) , e U q ( t ) p h (cid:69) q ( t ) . The derivation formula of eq. (31) gives ddt d L p ( u ( t ) , ˙ u ( t ) , t )[ h ] = ddt (cid:68) grad e L ( q ( t ) , (cid:63) q ( t ) , t ) , e U q ( t ) p h (cid:69) q ( t ) = (cid:28) Ddt grad e L ( q ( t ) , (cid:63) q ( t ) , t ) , e U q ( t ) p h (cid:29) q ( t ) + (cid:28) grad e L ( q ( t ) , (cid:63) q ( t ) , t ) , Ddt e U q ( t ) p h (cid:29) q ( t ) = (cid:28) Ddt grad e L ( q ( t ) , (cid:63) q ( t ) , t ) , e U q ( t ) p h (cid:29) q ( t ) , because Ddt e U q ( t ) p h = 0. As h is arbitrary, the conclusion follows. (cid:3) Legendre transform.
At each fixed density q ∈ E ( µ ), and each time t , the partialmapping(56) S q E ( µ ) (cid:51) w (cid:55)→ L q,t ( w ) = L ( q, w, t )is defined on the vector space S q E ( µ ), and its gradient mapping in the duality of ∗ S q E ( µ ) × S q E ( µ ) is the mapping w (cid:55)→ grad e L ( q, w, t ). Assumption 1.
In the following, we will always restrict our attention to Lagrangians suchthat the fiber gradient mapping at q , w (cid:55)→ η = grad e L q ( w ) is a 1-to-1 mapping from S q E ( µ )to ∗ S q E ( µ ). In particular, this true when the partial mappings w (cid:55)→ L q ( w ) are strictly convexfor each q . In our finite dimensional context, this is actually equivalent to the fact that thefiber gradient is a diffeomorphism of the statistical bundles grad L : S E ( µ ) → ∗ S E ( µ ). This isrelated to the properties of regularity and hyper-regularity, cf. [1, § ∗ S q E ( µ ) × S q E ( µ ) (cid:51) ( η, w ) (cid:55)→ (cid:104) η, w (cid:105) q = E q [ ηw ] will always bewritten in this order. The Legendre transform H q,t of L q,t is defined for each η ∈ ∗ S q E ( µ ) ofthe image of grad e L ( q, · , t ) by H q,t ( η ) = (cid:10) η, (grad e L q,t ) − ( η ) (cid:11) q − L q ((grad e L q,t ) − ( η )) , which, in turn, defines the Hamiltonian(57) H ( q, η, t ) = (cid:10) η, (grad e L q,t ) − ( η ) (cid:11) q − L ( q, (grad e L q,t ) − ( η )) . It is a general property of the Legendre transform thatgrad m H q,t ( η ) = (grad e L q,t ) − ( η ) , which, in turn, implies the equality(58) H ( q, η, t ) + L ( q, w, t ) = (cid:104) η, w (cid:105) q if η = grad e L ( q, w, t ) or grad m H ( q, η, t ) = w . q ( t ) = ddt log q ( t ) = ˙ q ( t ) q ( t ) Ddt η ( t ) = m U q ( t ) p ddt m U pq ( t ) η ( t ) = (cid:63) q ( t ) η ( t ) + ˙ η ( t ) Ddt w ( t ) = e U q ( t ) p ddt e U pq ( t ) w ( t ) = ˙ w ( t ) − E q ( t ) [ ˙ w ( t )] ddt H ( q ( t ) , η ( t )) = (cid:104) grad H ( q ( t ) , η ( t )) , (cid:63) q ( t ) (cid:105) q ( t ) + (cid:10) Ddt η ( t ) , grad m H ( q ( t ) , η ( t )) (cid:11) q ( t ) ddt L ( q ( t ) , w ( t )) = (cid:104) grad L ( q ( t ) , w ( t )) , (cid:63) q ( t ) (cid:105) q ( t ) + (cid:10) grad e L ( q ( t ) , w ( t )) , Ddt w ( t ) (cid:11) q ( t ) Table 1.
Main notations.Let t (cid:55)→ ( q ( t ) , w ( t )) be a smooth curve in S E ( µ ) and consider the smooth curve in ∗ S E ( µ )given by t (cid:55)→ ( q ( t ) , η ( t )) = ( q ( t ) , grad e L ( q ( t ) , w ( t ) , t ). From eq. (58), proposition 3 and propo-sition 2, we get(59) 0 = ddt (cid:16) H ( q ( t ) , η ( t ) , t ) + L ( q ( t ) , w ( t ) , t ) − (cid:104) η ( t ) , w ( t ) (cid:105) q ( t ) (cid:17) = (cid:104) grad H ( q ( t ) , η ( t ) , t ) , (cid:63) q ( t ) (cid:105) q ( t ) + (cid:28) Ddt η ( t ) , grad m H ( q ( t ) , η ( t ) , t ) (cid:29) q ( t ) + ∂∂t H ( q ( t ) , η ( t ) , t )+ (cid:104) grad L ( q ( t ) , w ( t ) , t ) , (cid:63) q ( t ) (cid:105) q ( t ) + (cid:28) grad e L ( q ( t ) , w ( t ) , t ) , Ddt w ( t ) (cid:29) q ( t ) + ∂∂t L ( q ( t ) , w ( t ) , t ) − (cid:32)(cid:28) Ddt η ( t ) , w ( t ) (cid:29) q ( t ) + (cid:28) η ( t ) , Ddt w ( t ) (cid:29) q ( t ) (cid:33) = (cid:104) grad H ( q ( t ) , η ( t ) , t ) , (cid:63) q ( t ) (cid:105) q ( t ) + (cid:104) grad L ( q ( t ) , w ( t ) , t ) , (cid:63) q ( t ) (cid:105) q ( t ) , where the two partial derivative ∂/∂t cancel because of eq. (58). In conclusion,(60) grad H ( q, η, t ) + grad L ( q, w, t ) = 0 if η = grad e L ( q, w, t ) or grad m H ( q, η, t ) = w . Hamilton equations.
Let t (cid:55)→ q ( t ) a solution of Euler-Lagrange eq. (53) and define thecurve t (cid:55)→ ζ ( t ) = ( q ( t ) , η ( t )) in ∗ S E ( µ ), where η ( t ) = grad e L ( q ( t ) , (cid:63) q ( t ) , t ) is the momentum . Proposition 5 (Hamilton equations) . In the notation above, and when Assumption 1 holdstrue, Euler-Lagrange eq. (53) becomes
Ddt η ( t ) = Ddt grad e L ( q ( t ) , (cid:63) q ( t ) , t ) , and, by eq. (60) , the Hamilton equations hold, namely, (61)
Ddt η ( t ) = − grad H ( q ( t ) , η ( t ) , t ) (cid:63) q ( t ) = grad m H ( q ( t ) , η ( t ) , t ) . For each solution of the Hamilton equations, it holds (62) ddt H ( q ( t ) , η ( t ) , t ) = ∂∂t H ( q ( t ) , η ( t ) , t ) . Proof.
Notice that the first derivative is a covariant derivative defined in the mixture transport,while the second derivative is a velocity defined in the logarithmic scale. The Hamilton equationseq. (61) follow by substitution into the Euler-Lagrange equations. The conservation eq. (62)follow by the total derivative equations in which the Hamilton equations are substituted. (cid:3)
In the following two examples, we compute the Euler-Lagrange equation and the Hamiltonequation for the cases in examples 2 to 4. The relevant notations are summarized in table 1.
Example . If L ( q, w ) = (cid:104) w, w (cid:105) q , then the Legendretransform is H ( q, η ) = (cid:104) η, η (cid:105) q . The gradients are rad H ( q, η ) = − (cid:0) η − E q (cid:2) η (cid:3)(cid:1) grad m H ( q, η ) = η grad L ( q, w ) = 12 ( w − E q (cid:2) w (cid:3) )grad e L ( q, w ) = w For (cid:63) q = w ∈ ∗ S E ( µ ), the Euler-Lagrange equation is Ddt (cid:63) q ( t ) = 12 (cid:0) (cid:63) q ( t ) − E q ( t ) (cid:2) (cid:63) q ( t ) (cid:3)(cid:1) , where the covariant derivative is computed in ∗ S E ( µ ), that is, Ddt (cid:63) q ( t ) = ¨ q ( t ) /q ( t ). In terms of theexponential acceleration ∗∗ q ( t ) = ¨ q ( t ) /q ( t ) − (cid:0) (cid:63) q ( t ) − E q ( t ) (cid:2) (cid:63) q ( t ) (cid:3)(cid:1) , the Euler-Lagrange equationreads ∗∗ q ( t ) = − (cid:0) ( (cid:63) q ( t )) − E q ( t ) (cid:2) ( (cid:63) q ( t )) (cid:3)(cid:1) , while in terms of the Riemannian acceleration in eq. (41), it holds D q ( t ) = 0.The Hamilton equations are Ddt η ( t ) = 12 (cid:0) η − E q (cid:2) η (cid:3)(cid:1) (cid:63) q ( t ) = η ( t ) , with the covariant derivative again computed in ∗ S E ( µ ).The conserved energy is H ( q ( t ) , η ( t )) = 12 (cid:104) (cid:63) q ( t ) , (cid:63) q ( t ) (cid:105) q ( t ) = 12 E (cid:34) ˙ q ( t ) q ( t ) (cid:35) , which reflects in the conservation of the Fisher information (see cf. (38)).In fact, this variational problem has a closed-form solution which is the image of a geodesic(great circle) on the sphere through an isometric covering from the tangent bundle of the sphereto the statistical bundle, see appendix A. It is interesting to note that this solution is a periodiccurve in the set of all densities, but consists of different sections in E ( µ ) because it is interruptedwhen it touches tangentially the border of the simplex of probability densities. Example . If L ( q, w ) = K q ( w ), then its Legendretransform is H ( q, η ) = E q [(1 + η ) log(1 + η )]. This is an expression of the dual divergence: as η = (cid:16) rq − (cid:17) with r = e q ( w ), then H ( q, η ) = D ( r (cid:107) q ), namely the relative entropy dual to thecumulant.The gradients are grad H ( q, η ) = log(1 + η ) − E q [log(1 + η )] − η grad m H ( q, η ) = log(1 + η ) − E q [log(1 + η )]grad L ( q, w ) = grad K q ( w ) = (cid:18) e q ( w ) q − (cid:19) − w grad e L ( q, w ) = grad e K q ( w ) = (cid:18) e q ( w ) q − (cid:19) . The Euler-Lagrange equation is(63)
Ddt (cid:18) χ ( t ) q ( t ) − (cid:19) = (cid:18) χ ( t ) q ( t ) − (cid:19) − (cid:63) q ( t ) , χ ( t ) = e q ( t ) ( (cid:63) q ( t )) . he Hamilton equations are(64) Ddt η ( t ) = − log(1 + η ( t )) + E q ( t ) [log(1 + η ( t ))] + η ( t ) (cid:63) q ( t ) = log(1 + η ) − E q [log(1 + η )]The conserved energy is H ( q ( t ) , η ( t )) = E q ( t ) (cid:20) χ ( t ) q ( t ) log χ ( t ) q ( t ) (cid:21) = D ( χ ( t ) (cid:107) q ( t )) with χ ( t ) = e q ( t ) ( (cid:63) q ( t )) . In the next section, we apply our findings to the solution of problems involving a Lagrangianobtained from a Lagrangian of one of the types above and a potential function.6.
Examples of Lagrangians on the statistical bundle
The derivations above allow us to consider the use of statistical divergences in a setup inspiredby Lagrangian and Hamiltonian mechanics.
Example . Our first example is inspired by the standard free particle
Lagrangian, where therole of the point particle is played by a probability density as a point on the statistical manifold.The Lagrangian is written as a difference of the quadratic form of example 5 and a potentialfunction given by the negative of the entropy function H ( q ( t )) = − E q ( t ) [log q ( t )]. We keep theinertial mass m as a parameter,(65) L ( q, w ) = m (cid:104) w, w (cid:105) q + κ H ( q ) , m, κ > , ( q, w ) ∈ S E ( µ ) . The first component of the natural gradient readsgrad L ( q, w ) = m w − E q (cid:2) w (cid:3) ) − κ (log q + H ( q )) . The natural gradient of the entropy has been computed in eq. (43).The Euler-Lagrangian equation gives(66) m Ddt (cid:63) q ( t ) = m (cid:0) (cid:63) q ( t ) − E q ( t ) (cid:2) (cid:63) q ( t ) (cid:3)(cid:1) − κ (log q ( t ) + H ( q ( t )) , which is Newton’s law, written in terms of the mixture covariant derivative [29].Let us express eq. (66) as a system of ordinary differential equations for q and (cid:63) q . We write v ( t ) = (cid:63) q ( t ) and note that v ( t ) = ddt log q ( t ) implies(67) ddt q ( x ; t ) = q ( x ; t ) v ( x ; t ) , x ∈ Ω . In particular, eq. (67), together with the assumption (cid:80) x q ( x ; t ) = N implies E q ( t ) [ v ( t )] = 0.Conversely, if E q ( t ) [ v ( t )] = 0, then (cid:80) x q ( x ; t ) = (cid:80) x q ( x ; 0).We have(68) ddt v ( t ) = ddt ˙ q ( t ) q ( t ) = ¨ q ( t ) q ( t ) − ˙ q ( t ) q ( t ) = ¨ q ( t ) q ( t ) − v ( t ) , where the first term on the rhs is the mixture acceleration of eq. (40). Thereby, via eq. (66),we have(69) ddt v ( t ) = ¨ q ( t ) q ( t ) − v ( t ) = Ddt v ( t ) − v ( t ) = − v ( t ) − E q ( t ) (cid:20) v ( t ) (cid:21) + κm grad H ( q ) . uadratic LagrangianKL Lagrangian q q q v v v i q i E q [ v ] Figure 1.
Free-motion trajectories on the simplex for the quadratic (blue) andKL (dashed red) Lagrangians. The pink straight line in the right panels indicatesthe value of the sum of the probability components varying with time. Theconstant value equal to one confirms that the trajectories never leave the simplex.The gray straight lines indicate the expected value of the score velocity at q .These values vanish at any time, implying that the velocities belong to thestatistical bundle. In the quadratic case, the geodesic motion approaches theboundary of the simplex tangentially, where one component of the probabilityvanishes (while the associated velocity diverges). Similarly, the solution of thefree KL Lagrangian flow moves toward the boundary of the simplex. In the lattercase, however, the motion is faster and the components of the score velocitydiverge at the boundary, while the probability tends to it non tangentially. Bothsystems share the same initial conditions (black cross): q = ( , , ) , w =( − . , − . , . A ( q, v ) = v / κm log ( q ) and B ( q, v ) = v / − κm log ( q ), the systemof first order differential equations is(70) ddt q ( x ; t ) = q ( x ; t ) v ( x ; t ) ddt v ( x ; t ) = − A ( q ( x ; t ) , v ( x ; t )) − N (cid:88) y q ( y ; t ) B ( q ( y ; t ) , v ( y ; t )) , x ∈ Ω . Example . A non-quadratic, non-symmetric generalization of the kinetic energy on the simplexis realised via the Kullback-Leibler divergence.A divergence is a smooth mapping D : E ( µ ) ×E ( µ ) → R , such that for all p, q ∈ E ( µ ) it holds D ( p, q ) ≥ D ( p, q ) = 0 if, and only if, p = q . Typically, a divergence is not symmetric,and frequently the discussion involves both the divergence and the so-called dual divergence D ∗ ( p, q ) = D ( q, p ).Every divergence can be associated to a Lagrangian by the canonical mapping:(71) E ( µ ) (cid:51) ( q, r ) (cid:55)→ ( q, s q ( r )) = ( q, w ) ∈ S E ( µ ) , where r = e w − K q ( w ) · q , that is, w = s q ( r ). The inverse mapping is the retraction(72) S E ( µ ) (cid:51) ( q, w ) (cid:55)→ ( q, e q ( w )) = ( q, r ) ∈ E ( µ ) . As the curve t (cid:55)→ e q ( tw ) has null exponential acceleration, one could say that eq. (72)defines the exponential mapping of the exponential connection, while eq. (71) defines the so-called logarithmic mapping. It seems to be more informative to observe that we have here an lementary feature of the affine geometry, that is, the equivalence of, on the one side, a coupleof a point and a vector and, on the other side, a couple of points.The expression in a chart centered at p of the mapping of eq. (72) is affine:(73) S p E ( µ ) × S p E ( µ ) → E ( µ ) × E ( µ ) → S E ( µ ) → S p E ( µ ) × S p E ( µ )( u, v ) (cid:55)→ ( e p ( u ) , e p ( v )) (cid:55)→ ( e p ( u ) , s e p ( u ) ( e p ( v ))) (cid:55)→ (cid:16) u, e U e p ( u ) p ( v − u ) (cid:17) , where we have used the computation s e p ( u ) ( e p ( v )) = log e v − K p ( v ) e u − K p ( u ) − E e p ( u ) (cid:34) log e v − K p ( v ) e u − K p ( u ) (cid:35) = ( v − u ) − E e p ( u ) [ v − u ] . The correspondence above maps every divergence D into a divergence Lagrangian , and con-versely,(74) L ( q, w ) = D ( q, e q ( w )) , D ( q, r ) = L ( q, s q ( r )) . Notice that, according to our assumptions on the divergence, the divergence Lagrangian definedin eq. (74) is non-negative and zero if, and only if, w = 0.Similarly, every divergence can be associated to a Hamiltonian function on the dual bundleby the canonical mapping:(75) E ( µ ) (cid:51) ( q, r ) (cid:55)→ ( q, η q ( r )) = ( q, η ) ∈ ∗ S E ( µ ) , where r = (1 + η ) · q . The inverse mapping is(76) ∗ S E ( µ ) (cid:51) ( q, η ) (cid:55)→ ( q, (1 + η ) · q ) = ( q, r ) ∈ E ( µ ) . The canonical example is the Kullback-Leibler divergence D ( q (cid:107) r ), with cumulant Lagrangianfunction K q ( w ). Accordingly, the dual divergence D ( r (cid:107) q ) = E r (cid:104) log rq (cid:105) is naturally associatedto the Hamiltonian function H ( q, η ) = E (1+ η ) · q (cid:20) log (1 + η ) qq (cid:21) = E q [(1 + η ) log(1 + η )] . The quadratic Lagrangian (cid:104) w, w (cid:105) q previously considered gives another example, whose asso-ciated divergence is Var q (cid:16) log rq (cid:17) .A more general case is the class of the f -divergences, f a smooth convex real function, D f ( r, q ) = E q (cid:20) f (cid:18) rq (cid:19)(cid:21) . The corresponding Lagrangian is L ( q, w ) = E q (cid:20) f (cid:18) qe q ( w ) (cid:19)(cid:21) , whose expression at p is L p ( u, v ) = E e p ( u ) f e p ( u ) e e p ( u ) (cid:16) e U e p ( u ) p v (cid:17) = E e p ( u ) (cid:20) f (cid:18) e − e U ep ( u ) p v + K ep ( u ) (cid:16) e U ep ( u ) p v (cid:17) (cid:19)(cid:21) E p (cid:104) e u − K p ( u ) f (cid:16) e − v + K p ( u + v ) − K p ( u ) (cid:17)(cid:105) , cf. example 3. Notice that the gradients can be computed in terms of f and f (cid:48) . e leave the latter case for future work, while focusing for the rest of this paper on the caseof the Kullbac-Leibler divergence. In particular, motivated by our interest in optimization, wewill focus on a family of parametrised Lagrangians of the following standard form,(77) L a,b,c ( q, w ) = c ( a − K q ( aw ) − bf ( q )) , where a, b, c > f is a scalar field on E ( µ ) as, for example, the negative entropy previouslyintroduced. The Lagrangian above is parameterized in such a way that lim a → L a,b,b − ( q, w ) = b − ( dK q (0)[ w ] − bf ( q )) = − f ( q ). The cumulant term is the scaled Lagrangian ( q, w ) (cid:55)→ a − K q ( aw ) whose divergence function in terms of q and r = e q ( w ) is(78) 1 a log E q (cid:20) exp (cid:18) a (cid:18) log rq − E q (cid:20) log rq (cid:21)(cid:19)(cid:19)(cid:21) = 1 a log E q (cid:20)(cid:18) rq (cid:19) a (cid:21) + D ( q (cid:107) r )and the limit for a → a -R´enyi divergence of r from q [33].Here, the constant a is intended to introduce a mass effect in the model in such a way that a = 0 implies that the Lagrangian lost any dependence on the velocity w . We could talk alsoof an inertia of the system. Typically, the notion of inertia describes the resistance of anyphysical object to any change in its velocity. In our statistical setting, the dynamics of a statealong some direction in the manifold can be interpreted as the result of the balance of a gain ofmotion, determined from the descent along some potential function (payoff), against the cost of motion to changes along a given direction from the given state. In this sense, the Lagrangianvector field on the statistical manifold consistently minimizes the action of the difference of adivergence and a potential function.From an optimization viewpoint, our variational problem corresponds to the minimization ofan objective function, the potential, with a proximity constraint enforced via the kinetic energyterm. The kinetic energy acts as a regulariser for the velocities, leading to faster converging andmore stable optimization algorithms (e.g. Hamiltonian Monte Carlo methods [11], Adam [19],AdaGrad [13] and RMSprop [38] algorithms, Relativistic Gradient Descent (RGD) algorithms[12])Let us derive the Hamiltonian of our standard Lagrangian in eq. (77). If f is a real functionwith convex conjugate f ∗ ( η ) = sup w (cid:104) η, w (cid:105) − f ( w ), then g ( w ) = c ( a − f ( aw ) − b ) defines a newfunction whose convex conjugate is g ∗ ( η ) = c ( a − f ∗ ( c − η ) + b ), where the conjugate momentum η = grad e L a,b,c ( q, w ) is now a function of the parameters. In the case of a convex function, theLegendre transform coincides with the convex conjugate on the interior of the proper domain.In our case, f ∗ ( η ) = E q [(1 + η ) log(1 + η )], η > −
1, and b = bf ( q ), so that(79) H a,b,c ( q, η ) = c (cid:0) a − E q (cid:2) (1 + c − η ) log(1 + c − η ) (cid:3) + bf ( q ) (cid:1) . As lim c →∞ c E q (cid:2) (1 + c − η ) log(1 + c − η ) (cid:3) = 0, we have lim c →∞ H a,c − ,c ( q, η ) → f ( q ).We now proceed to compute the relevant natural gradients with proposition 3. We can usehere the computations done in examples 3 and 6. By computing the total differential on thecurve t (cid:55)→ ( q ( t ) , aw ( t )), with the results in example 6, we get ddt K q ( t ) ( aw ( t )) = (cid:28)(cid:18) e q ( t ) ( aw ( t )) q ( t ) − (cid:19) − aw ( t ) , (cid:63) q ( t ) (cid:29) q ( t ) + (cid:28)(cid:18) e q ( t ) ( aw ( t )) q ( t ) − (cid:19) , Ddt aw ( t ) (cid:29) q ( t ) . he gradients of the Laplacian (77) are(80) grad L a,b,c ( q, w ) = c (cid:18) a − (cid:18) e q ( aw ) q − (cid:19) − w − b grad f ( q ) (cid:19) grad e L a,b,c ( q, w ) = c (cid:18) e q ( aw ) q − (cid:19) Now the limit cases are controlled by lim a → (cid:16) e q ( aw ) q − (cid:17) = 0 and lim a → a − (cid:16) e q ( aw ) q − (cid:17) = w ,so that, lim a → grad L a,b,b ( q, w ) = − grad f ( q ) and lim a → grad e L a,b,c ( q, w ) = 0 . The Euler-Lagrange equation is(81)
Ddt (cid:18) e q ( a (cid:63) q ( t )) q ( t ) − (cid:19) = a − (cid:18) e q ( a (cid:63) q ( t )) q ( t ) − (cid:19) − (cid:63) q ( t ) − b grad f ( q ( t )) . Let us compute the covariant time-derivative on the lhs using the trick of example 1. We write χ a ( t ) = e q ( a (cid:63) q ( t )) and recall we are using the mixture covariant derivative for ( χ a ( t ) /q ( t ) − ∈ ∗ S q E ( µ ). Then the left-hand side of the Euler-Lagrange equation becomes(82) Ddt (cid:18) χ a ( t ) q ( t ) − (cid:19) = q ( t ) − ddt ( χ a ( t ) − q ( t )) = χ a ( t ) q ( t ) (cid:63) χ a ( t ) − (cid:63) q ( t ) = m U q ( t ) χ a ( t ) (cid:63) χ a ( t ) − (cid:63) q ( t ) , where (cid:63) χ a ( t ) = ddt log χ a ( t ) = ddt a (cid:63) q ( t ) − ddt K q ( t ) ( a (cid:63) q ( t )) + (cid:63) q ( t ) , which in turn implies (cid:63) χ a ( t ) = e U χ a ( t ) q ( t ) ( a ∗∗ q ( t ) + (cid:63) q ( t )) . so that Ddt (cid:18) χ a ( t ) q ( t ) − (cid:19) = m U q ( t ) χ a ( t ) e U χ a ( t ) q ( t ) ( a ∗∗ q ( t ) + (cid:63) q ( t )) − (cid:63) q ( t ) . The Euler-Lagrange equation becomes an equation in q , (cid:63) q , ∗∗ q , m U q ( t ) e q ( t ) ( a (cid:63) q ( t )) e U e q ( t ) ( a (cid:63) q ( t )) q ( t ) ( a ∗∗ q ( t ) + (cid:63) q ( t )) = a − (cid:18) e q ( a (cid:63) q ( t )) q ( t ) − (cid:19) − b grad f ( q ( t )) , or, moving the transports to the right-hand side,(83) a ∗∗ q ( t ) + (cid:63) q ( t ) = e U q ( t ) e q ( t ) ( a (cid:63) q ( t )) m U e q ( t ) ( a (cid:63) q ( t )) q ( t ) (cid:18) a − (cid:18) e q ( t ) ( a (cid:63) q ( t )) q ( t ) − (cid:19) − b grad f ( q ( t )) (cid:19) , where the right-hand side could be rewritten with m U χq ( χ/q −
1) = 1 − q/χ . Notice that thelimit form as a → f ( q ( t )) = 0. For example, if f ( q ) = − H ( q ), then q ( t ) = 1.A similar argument applies to the computation of the gradients of the Hamiltonian (79). Thevariation of H ( q, η ) = E q [(1 + η ) log(1 + η )] on the curve t (cid:55)→ ( q ( t ) , c − η ( t )) is (cf. example 6) ddt H ( q ( t ) , c − η ( t )) = (cid:10) log(1 + c − η ) − E q (cid:2) log(1 + c − η ) (cid:3) − c − η, (cid:63) q ( t ) (cid:11) q ( t ) + (cid:28) Ddt c − η ( t ) , log(1 + c − η ) − E q (cid:2) log(1 + c − η ) (cid:3)(cid:29) q ( t ) . The covariant time derivative can be computed in the chart at p , as showed in appendix C for a,b,c=1. Thisis a more lengthy though interesting computation, as it shows the use of the triple moments. ubstitution gives grad H a,b,c ( q, η ) = c (cid:0) a − (cid:0) log(1 + c − η ) − E q (cid:2) log(1 + c − η ) (cid:3)(cid:1) − c − η − b grad f ( q ) (cid:1) grad m H a,b,c ( q, η ) = a − (cid:0) log(1 + c − η ) − E q (cid:2) log(1 + c − η ) (cid:3)(cid:1) The Hamilton equations are(84)
Ddt η ( t ) = − c (cid:0) a − (cid:0) log(1 + c − η ) − E q (cid:2) log(1 + c − η ) (cid:3)(cid:1) − c − η − b grad f ( q ) (cid:1) (cid:63) q ( t ) = a − (cid:0) log(1 + c − η ) − E q (cid:2) log(1 + c − η ) (cid:3)(cid:1) . Remark . There is a way, other than the Hamilton equations, to write a first-order system ofdifferential equations equivalent to the second-order Euler-Lagrange eq. (83). We have found ineq. (82) that the Euler-Lagrange eq. (81) can be written as χ a ( t ) q ( t ) (cid:63) χ a ( t ) − (cid:63) q ( t ) = a − (cid:18) χ a ( t ) q ( t ) − (cid:19) − (cid:63) q ( t ) − b grad f ( q ( t )) , which simplifies to χ a ( t ) (cid:63) χ a ( t ) = a − ( χ a ( t ) − q ( t )) − bq ( t ) grad f ( q ( t )) , and, in turn, provides a remarkably simple system of replicator equations,(85) ˙ χ a ( t ) = a − ( χ a ( t ) − q ( t )) − bq ( t ) grad f ( q ( t ))˙ q ( t ) = q ( t ) (cid:18) log χ a ( t ) q ( t ) − E q (cid:20) log χ a ( t ) q ( t ) (cid:21)(cid:19) . Notice that the vector field is null if, and only if, χ a = q and grad f ( q ) = 0.6.1. Reduction to ordinary differential equations.
We proceed now to express the equa-tions we have obtained in the ordinary Euclidean space. After computing the transports in theright-hand side, the Euler-Lagrange eq. (83) becomes(86) a ∗∗ q ( t ) + (cid:63) q ( t ) = a − (cid:16) e − a (cid:63) q ( t )+ K q ( t ) ( a (cid:63) q ( t )) − E q ( t ) (cid:104) e − a (cid:63) q ( t )+ K q ( t ) ( a (cid:63) q ( t )) (cid:105)(cid:17) − b (cid:16) e − a (cid:63) q ( t )+ K q ( t ) ( a (cid:63) q ( t )) grad f ( q ( t )) − E q ( t ) (cid:104) e − a (cid:63) q ( t )+ K q ( t ) ( a (cid:63) q ( t )) grad f ( q ( t )) (cid:105)(cid:17) . Notice the common constant factore K q ( t ) ( a (cid:63) q ( t )) = E q ( t ) (cid:104) e a (cid:63) q ( t ) (cid:105) in each term.There are many ways to rewrite eq. (86) as a system of ordinary differential equations in R N .An immediate option is to introduce the variables q and v = (cid:63) q , in which case the solutionwill stay in the Grassmanian manifold (cid:80) x q ( x ) v ( x ) = 0.It holds ddt q ( t ) = q ( t ) v ( t ). The acceleration is ∗∗ q ( t ) = ddt v ( t ) − E q ( t ) (cid:20) ddt v ( t ) (cid:21) = ˙ v ( t ) + E q ( t ) (cid:2) v ( t ) (cid:3) . The left-hand side of eq. (86) becomes a (cid:0) ˙ v ( t ) + E q ( t ) (cid:2) v ( t ) (cid:3)(cid:1) + v ( t ) , Notice that the gradients above could have been computed otherwise using the fact that the two fiber gradientare inverse of each other. hile the right-hand side becomes a − E q ( t ) (cid:104) e av ( t ) (cid:105) (cid:16) e − av ( t ) − E q ( t ) (cid:104) e − av ( t ) (cid:105)(cid:17) − b E q ( t ) (cid:104) e av ( t ) (cid:105) (cid:16) e − av ( t ) grad f ( q ( t )) − E q ( t ) (cid:104) e − av ( t ) grad f ( q ( t )) (cid:105)(cid:17) . Writing the expected values as sums, we have obtained a system of 2 N ordinary differentialequations. The system could be further reduced to 2( N −
1) equations between independentvariables by using une of the possible parametrisations of the Grassmanian manifold.
Example . If a = 1 and b = 0, the system is(87) ddt q ( x ; t ) = q ( x ; t ) v ( x ; t ) ddt v ( x ; t ) = − v ( x ; t ) − N (cid:88) y q ( y ; t ) v ( y ; t ) − N (cid:88) y q ( y ; t ) e v ( y ; t ) (cid:32) e − v ( x ; t ) − N (cid:88) y q ( y ; t ) e − v ( y ; t ) (cid:33) . Notice that the replicator equations (85) are already differential equations in R N and theinvariant manifold is the product of two open simplexes.The Hamiltonian equations (84) form a differential system in the two variables q, η in thedual statistical bundle, which is an open subset of the Grassmanian manifold.The solution curve and its derivatives can be expressed in the global space in which the dualbundle is embedded by, t (cid:55)→ ( q ( t ) , η ( t )) ∈ ∗ S E ( µ ) ⊂ R Ω × R Ω Ddt η ( t ) = ˙ q ( t ) q ( t ) η ( t ) + ˙ η ( t ) (cid:63) q ( t ) = ˙ q ( t ) q ( t ) . Example . In the case a = c = 1 and b = 0, the resulting ODE are ˙ η ( x ; t ) = η ( x ; t ) − (1 + η ( x ; t )) (cid:16) log(1 + η ( x ; t )) − N (cid:80) y q ( y ; t ) log(1 + η ( y ; t )) (cid:17) , ˙ q ( x ; t ) = q ( x ; t ) (cid:16) log(1 + η ( x ; t )) − N (cid:80) y q ( y ; t ) log(1 + η ( y ; t )) (cid:17) . In fig. 2, we plot the solutions of the Lagrangian motion in a convex potential given by thenegative entropy, f ( q ) = E q [ q ], for the quadratic and the KL Kinetic energy.7. Application to accelerated optimization
As a final example of statistical Lagrangian dynamics, we consider the case of a dampedmass-spring system on the probability space, defined via a time-dependent parametrised KLLagrangian.This choice is motivated by a series of recent interesting results in optimization, where a time-dependent family of so-called
Bregman Lagrangians [39] is introduced to derive a variationalapproach to accelerated optimization methods.While the geometric setting in [39] is a generic Hessian manifold over a convex set in R d , ourgoal is to reproduce such a derivation on the statistical bundle, as to provide a first consistentdescription of accelerated optimization on the dually-flat geometry of the exponential manifold.Recent related work on the accelerated gradient flow for probability distributions can be foundin [37] and in some relevant references therein. uadratic LagrangianKL Lagrangian 0 10 20t1.51.00.50.00.51.01.5 Quadratic Lagrangian 0 10 20t1.51.00.50.00.51.01.5 KL Lagrangian q q q v v v i q i E q [ v ] Figure 2.
Projection of the solutions of the Euler-Lagrange equation in thesimplex, for the quadratic (blue, cf. eq. (70)) and the KL (red, cf. eq. (87), a = b = 1) Lagrangian flows, in a potential given by the negative of the entropyfunction on the simplex. In the right panels, both systems show the expectedharmonic oscillating behavior, while generally displaying different trajectoriesfor the same initial conditions (black cross).7.1. Damped KL Lagrangian.
On the statistical bundle, let us consider a damped La-grangian given by the difference of time-scaled KL divergence and potential function, multipliedby an overall time-dependent damping factor,(88) L ( q, w, t ) = e γ t (cid:16) e α t D (cid:0) q (cid:107) e q (e − α t w ) (cid:1) − e α t + β t f ( q ) (cid:17) = e α t + γ t (cid:18) E q (cid:20) log (cid:18) − α t w − K q (e − α t w )) (cid:19)(cid:21) − e β t f ( q ) (cid:19) = e α t + γ t (cid:16) K q (e − α t w ) − e β t f ( q ) (cid:17) . For each fixed t , the time-dependent Lagrangian above is an instance of the standard La-grangian eq. (77) with a = e − α t , b = e α t + β t , c = e γ t , which reproduces, on the statistical bundle, the time-dependent family of Bregman La-grangians proposed in [39].As in [39], we assume α t , β t , γ t : I → R to be continuously differentiable functions of time.The overall damping factor γ t is responsible for the dissipative behaviour of the Lagrangiansystem; β t provides the potential f with an explicit time dependence; finally, α t defines ascaling in time of the score velocity.In our setting, the scaling of the score is associated to a time-dependent lift to the statisticalbundle. In the exponential map, we consider a time-dependent scaling of the shift vector, suchthat χ = e q (e − α t w ) and s p ( χ ) = u + e − α t v ∈ S p E ( µ ), with α t : I → R smooth, I ⊂ R opentime interval. With this choice the KL divergence reads D : I × S E ( µ ) (cid:51) ( q, w, t ) (cid:55)→ D (cid:0) q (cid:107) e q (e − α t w ) (cid:1) ∈ R . The overall scaling by the inverse factor e α t makes the divergence closed under time-dilationand leads to a time-reparametrization invariant [35] action ( q ( τ ) , t ( τ )) = (cid:90) ˙ τ − dτ e α t ( τ ) e γ t ( τ ) (cid:104) D (cid:16) q ( t ( τ ) | e q (e − α t ( τ ) (cid:63) q ( t ( τ )) ˙ τ ) (cid:17) − e β t ( τ ) f ( q ) (cid:105) = (cid:90) dτ e ˜ α τ e γ τ (cid:104) D (cid:0) q ( τ ) (cid:107) e q (e − ˜ α τ (cid:63) q ( τ )) (cid:1) − e β τ f ( q ) (cid:105) , where we set ˜ α τ = α t − log ( ˙ τ ).It follows directly from eq. (79) that the Hamiltonian is(89) H ( q, η, t ) = e α t + γ t (cid:16) E q (cid:2) (1 + e − γ t η ) log(1 + e − γ t η ) (cid:3) + e β t f ( q ) (cid:17) . The gradients of the Lagrangian have been already computed in eq. (80),grad L ( q, w, t ) = e γ t (cid:18) e α t (cid:18) e q (e − α t w ) q − (cid:19) − w − e α t + β t grad f ( q ) (cid:19) grad e L ( q, w, t ) = e γ t (cid:18) e q (e − α t w ) q − (cid:19) The Euler-Lagrange equation is
Ddt (cid:18) e γ t (cid:18) e q (e − α t (cid:63) q ( t )) q − (cid:19)(cid:19) = e γ t (cid:18) e α t (cid:18) e q (e − α t (cid:63) q ( t )) q − (cid:19) − (cid:63) q ( t ) − e α t + β t grad f ( q ( t )) (cid:19) , or, canceling the factor e γ ( t ) ,(90) ˙ γ t (cid:18) e q (e − α t (cid:63) q ( t )) q − (cid:19) + Ddt (cid:18) e q (e − α t (cid:63) q ( t )) q − (cid:19) =e α t (cid:18) e q (e − α t (cid:63) q ( t )) q ( t ) − (cid:19) − (cid:63) q ( t ) − e α t + β t grad f ( q ( t )) . Let us compute the left-hand side. If we write χ ( t ) = e q ( t ) (cid:16) e − α t (cid:63) q ( t ) (cid:17) , then Ddt (cid:18) χ ( t ) q ( t ) − (cid:19) = 1 q ( t ) ddt ( χ ( t ) − q ( t )) = χ ( t ) q ( t ) (cid:63) χ ( t ) − (cid:63) q ( t ) , where (cid:63) χ ( t ) = ddt (cid:0) e − α t (cid:63) q ( t ) − K q ( t ) (e − α t (cid:63) q ( t )) + log q ( t ) (cid:1) = − ˙ α t e − α t (cid:63) q ( t ) + e − α t ddt (cid:63) q ( t ) − ddt K q ( t ) (e − α t (cid:63) q ( t )) + (cid:63) q ( t ) = (cid:0) − ˙ α t e − α t (cid:1) (cid:63) q ( t ) + e − α t ∗∗ q ( t ) + e − α t E q ( t ) (cid:20) ddt (cid:63) q ( t ) (cid:21) − ddt K q ( t ) (e − α t (cid:63) q ( t )) . It follows that
Ddt (cid:18) χ ( t ) q ( t ) − (cid:19) = m U q ( t ) χ ( t ) e U χ ( t ) q ( t ) (cid:0)(cid:0) − ˙ α t e − α t (cid:1) (cid:63) q ( t ) + e − α t ∗∗ q ( t ) (cid:1) − (cid:63) q ( t ) , and, in turn, the Euler-Lagrange eq. (90) becomes˙ γ t (cid:18) χ ( t ) q ( t ) − (cid:19) + m U q ( t ) χ ( t ) e U χ ( t ) q ( t ) (cid:0)(cid:0) − ˙ α t e − α t (cid:1) (cid:63) q ( t ) + e − α t ∗∗ q ( t ) (cid:1) − (cid:26)(cid:26) (cid:63) q ( t ) =e α t (cid:18) χ ( t ) q ( t ) − (cid:19) − (cid:26)(cid:26) (cid:63) q ( t ) − e α t + β t grad f ( q ( t )) . he equation above can be rearranged to reade − α t ∗∗ q ( t ) + (cid:0) − ˙ α t e − α t (cid:1) (cid:63) q ( t ) =(e α t − ˙ γ t ) (cid:18) − q ( t ) χ ( t ) + E q ( t ) (cid:20) q ( t ) χ ( t ) (cid:21)(cid:19) − e α t + β t (cid:18) q ( t ) χ ( t ) grad f ( q ( t )) − E q ( t ) (cid:20) q ( t ) χ ( t ) grad f ( q ( t )) (cid:21)(cid:19) We shall now see the Euler-Lagrange equation above as the solution of an optimization prob-lem on the simplex, where the potential f ( q ) represents the objective function to be minimized,with a proximity condition induced by the KL divergence. The explicit time-dependence of theLagrangian is the fundamental ingredient in order for the dynamical system to dissipate energyand relax to a minimum of the potential, hence to a minimum of the objective function. Example . It was shown in [39], that the following so-called ideal scaling conditions˙ β ≤ e α t ˙ γ = e α t lead to solutions of the Euler-Lagrange equations which reproduce the vanishing-step-size-limit trajectories of the accelerated gradient optimization schemes (see also [36, 21, 8]).As a concrete example for our formalism, we are going to apply the same ideal scalingconditions to the KL Euler-Lagrange equations on the statistical bundle. This leads to thesimplified equation(91) ∗∗ q ( t ) + (e α t − ˙ α t ) (cid:63) q ( t ) = − e α t + β t (cid:18) q ( t ) χ ( t ) grad f ( q ( t )) − E q ( t ) (cid:20) q ( t ) χ ( t ) grad f ( q ( t )) (cid:21)(cid:19) Along with [39], we further restrict the parametrised Lagrangian family by the followingchoice of parameters, indexed by p > α t = log p − log t (92) β t = p log t + log Cγ t = p log t, where C > p = 2) and that ofNesterov’s accelerated cubic-regularized Newton’s method (when p = 3).For the same system on the bundle, we have (cid:63) q ( t ) = v ( t ) such that(93) ddt q ( x ; t ) = q ( x ; t ) v ( x ; t ) , x ∈ Ω . we get(94) ddt v ( t ) = ∗∗ q ( t ) − E q ( t ) (cid:2) v ( t ) (cid:3) = − (e α t − ˙ α t ) v ( t ) − e α t + β t grad f ( q ( t ))e e − αt v ( t ) − K q (e − αt v ( t )) + E q ( t ) (cid:20) e α t + β t grad f ( q ( t ))e e − αt v ( t ) − K q (e − αt v ( t )) (cid:21) − E q ( t ) (cid:2) v ( t ) (cid:3) Then, via (92), we get ddt q ( x ; t ) = q ( x ; t ) v ( x ; t ) ddt v ( x ; t ) = − p + 1 t v ( t ) − Cp t p − grad f ( q ( t ))e tp v ( t ) − K q ( tp v ( t )) + 1 N (cid:88) x q ( x ; t ) (cid:16) Cp t p − grad f ( q ( t ))e tp v ( t ) − K q ( tp v ( t )) (cid:17) + 1 N (cid:88) x q ( x ; t ) v ( x ; t ) x ∈ Ω . amped KL Lagrangian 0 10 20 30 40 50t0.50.00.51.0 q q q v v v i q i E q [ v ] Figure 3.
Projection in the simplex of the solution of the Euler-Lagrange equa-tion for the damped KL Lagrangian in eq. (88), under the ideal scaling condition,with the specific choice of parametrization given in (92) (with p = 2 , C = 0 . p = 2 , C = 0 . Damped KL Hamiltonian.
The momentum corresponding to the damped
KL Lagrangianis given by(95) η = e γ t (cid:18) e q (e − α t w ) q − (cid:19) ∈ I × ∗ S q E ( µ ) , corresponding to a time-dependent damping of the mechanic momentum associated to the freeKL Lagrangian L ( q, w, t ) = e α t D (cid:0) q (cid:107) e q (e − α t w ) (cid:1) . Now, for (1 + e − γ t η ) q = e q (e − α t w ), we can use the exponential chart at q to easily invert theLegendre transform and solve for the velocity w . We have w ( η ) = (grad L q ) − ( η ) = e α t s q ((1 + e − γ t η ) q )= e α t (cid:16) log(1 + e − γ t η ) − E q (cid:2) log(1 + e − γ t η ) (cid:3) (cid:17) . Thereby, we can write the KL Hamiltonian as H ( q, η, t ) = e α t (cid:68) η, (cid:16) log(1 + e − γ t η ) − E q (cid:2) log(1 + e − γ t η ) (cid:3) (cid:17)(cid:69) q + − e α t + γ t K q (cid:16) log(1 + e − γ t η ) − E q (cid:2) log(1 + e − γ t η ) (cid:3) (cid:17) + e α t + β t + γ t f ( q )= e α t + γ t (cid:16) E q (cid:104) e − γ t η (cid:16) log(1 + e − γ t η ) − E q (cid:2) log(1 + e − γ t η ) (cid:3) (cid:17)(cid:105) + − K q (cid:16) log(1 + e − γ t η ) − E q (cid:2) log(1 + e − γ t η ) (cid:3) (cid:17)(cid:17) + e α t + β t + γ t f ( q )= e α t + γ t (cid:16) E q (cid:2) (1 + e − γ t η ) log(1 + e − γ t η ) (cid:3) + e β t f ( q ) (cid:17) = e α t + γ t (cid:16) D (cid:0) e q (e − α t w ) (cid:107) q (cid:1) + e β t f ( q ) (cid:17) , as presented in (89). amped KL Hamiltonian 0 10 20 30 40 50t0.50.00.51.0 q q q i q i E q [ ] Figure 4.
Projection in the simplex of the solution of the damped KL Hamil-tonian system in eq. (96). We use same ideal scaling condition, parametrization( p = 2 , C = 0 . Remark . In classical mechanics, we normally identify the Hamiltonian with the total energyof the system, given by the sum of the kinetic and potential energy. In facts, the kinetic energyis the conjugate to the Lagrangian kinetic energy. This is apparent in the generalised case of theKL, where the symmetry of the standard quadratic form is generalised to a conjugacy relationwith respect to the dual pairing. In the finite dimensional case, where the statistical bundlecoincides with the dual statistical bundle, the mechanic interpretation is then preserved.The Hamilton equations reads(96)
Ddt η = e α t η − e α t + γ t (cid:0) log(1 + e − γ t η )) − E q ( t ) (cid:2) log(1 + e − γ t η )) (cid:3)(cid:1) − e γ t + α t + β t grad f ( q ) (cid:63) q ( t ) = e α t (cid:0) log(1 + e − γ t η ) − E q ( t ) (cid:2) log(1 + e − γ t η ) (cid:3)(cid:1) . Given
Ddt η ( t ) = ˙ q ( t ) q ( t ) η ( t ) + ˙ η ( t ) = (cid:63) q ( t ) η ( t ) + ˙ η ( t ) , we get a system of first order equations(97) (cid:40) ˙ η ( t ) = e α t η − e α t + γ t (cid:0) − γ t η (cid:1) (cid:0) log(1 + e − γ t η )) − E q ( t ) (cid:2) log(1 + e − γ t η )) (cid:3)(cid:1) − e γ t + α t + β t grad f ( q )˙ q ( t ) = e α t q ( t ) (cid:0) log(1 + e − γ t η ) − E q ( t ) (cid:2) log(1 + e − γ t η ) (cid:3)(cid:1) . Remark . It is interesting to note that the momentum derived from the parametric Lagrangianin (88) (a generalization of the Kanai–Caldirola Lagrangian (see e.g.[10, 17]) is nothing butthe mechanic conjugate momentum to q ( t ), defined in example 6 for the undamped system,multiplied by a scaling factor e γ t . In fact, despite giving the correct equation of motion for thedamped harmonic oscillator, a Lagrangian of the type L = e γ t ( T − V )has been shown to rather describe a harmonic oscillator with variable mass [16]. This is aninteresting perspective to be explored for understanding the role of inertia and acceleration inthe momentum approaches for optimization. t10 f - m i n ( f ) Damped KL LagrangianDamped KL Hamiltonian
Figure 5.
A toy example of optimization on the statistical bundle. The plotprovides a comparison of the convergence rates to the minimum of the shiftednegative entropy potential f ( q ), for the damped KL Lagrangian (blue) andHamiltonian (orange) flows on the simplex. We see that the KL Hamiltonianflow appears to be sensibly faster than the Lagrangian one. Remark . In our case, we see that the system in (97) can be easily rewritten in terms of themechanic momentum ¯ η = (cid:16) e q (e − αt w ) q − (cid:17) . For η = e γ t ¯ η , we have ˙ η = e γ t ( ˙ γ t ¯ η + ˙¯ η m ), hence(98) (cid:40) ˙¯ η ( t ) = (e α t − ˙ γ t ) ¯ η − e α t (1 + ¯ η ) (cid:0) log(1 + ¯ η )) − E q ( t ) [log(1 + ¯ η ))] (cid:1) − e α t + β t grad f ( q )˙ q ( t ) = e α t q ( t ) (cid:0) log(1 + ¯ η ) − E q ( t ) [log(1 + ¯ η )] (cid:1) . Notice that the ideal scaling condition ˙ γ t = e α t [39] in this case leads to a cancellation of thedissipative term in the Hamilton equations (98).8. Discussion
This paper has the character of a first full analysis of a new formalism to be of interest inthe study of the evolution of probability densities on a finite sample space.We believe we were able to show convincingly that a fully non-parametric presentation of theLagrangian and Hamiltonian dynamics is feasible on the statistical bundle. Our version of themechanical formalism applies to a set up that is different from the standard one. Namely, itacts on that specific version of the tangent bundle of the open probability simplex that has themost natural interpretation in terms of statistical quantities.All the underlying mechanical concepts, such as velocity, parallel transport, accelerations,second-order equations, oscillation, damping, receive a specific statistical interpretation andare, in some cases, related with non-mechanical features, such as a divergence. Several simpleexamples illustrate the numerical implementation and graphical illustration of the results, whichare both important in statistical and machine learning applications, as well as for a deeperunderstanding of accelerated methods (see e.g. fig. 5) and the construction of new geometricdiscretization schemes for optimization algorithms (cf. [40, 4]).The use of formal mechanical concepts in statistical modelling has been unusual in the appliedliterature, and we believe this paper could prompt for a change of perspective. On the other side,the non-parametric treatment clearly shows the relation with the modelling used in StatisticalPhysics that is, Boltzmann-Gibbs theory. he case of a generic exponential family is obtained by considering the vector space generatedby the constant 1 and a finite set of sufficient statistics, B = Span (1 , u , . . . , u n ). In this casethe fibers are B p = { u ∈ B | E p [ u ] = 0 } . The exponential transport acts properly, while themixture transport is computed as (cid:10) η, e U qp w (cid:11) q = (cid:10) Π p m U pq η, w (cid:11) p , where Π p is the L ( p ) orthogonal projection onto B p .As a further direction for future research, the dynamical system induced by the Hamiltonequations on the dual bundle suggests the study of the measures on the statistical bundle andtheir evolution. The consideration of such measures provides an interesting extension of theBayes paradigm in that there is a probability measure on the simplex and a transition to ameasure on the fiber.Finally, future work should consider the extension of the statistical bundle mechanics to thecontinuous state space, which requires the definition of a proper functional set-up. One optionwould be to model the fibers of the statistical bundle with an exponential Orlicz space. In sucha case, the fibers of the dual statistical bundle should be modeled as the pre-dual space. Manyother options are available, in particular, Orlicz-Sobolev Banach spaces, or Fr`echet spaces ofinfinitely differentiable densities.An extension to the continuous state space and to arbitrary exponential families would al-low a broad application of the information geometric formalism for accelerated methods to theoptimization over statistical models, in particular in large dimensions. Three examples of suchapplications are the optimization of functions defined over the cone of the positive definite ma-trices, the minimization of a loss function for the training of neural networks, and the stochasticrelaxation of functions defined over the sample space. Acknowledgements
G.C. and L.M. are supported by the DeepRiemann project, co-funded by the EuropeanRegional Development Fund and the Romanian Government through the Competitiveness Op-erational Programme 2014-2020, project ID P 37 714, contract no. 136/27.09.2016, SMIS code103321. G.P. is supported by de Castro Statistics, Collegio Carlo Alberto, Turin, Italy. He is amember of GNAMPA-INDAM.
Appendix A. Covering
The expression in the chart centered at p of the Lagrangian in example 5 follows from eq. (19),(99) L p ( u, v ) = m (cid:68) e U e p ( u ) p v, e U e p ( u ) p v (cid:69) e p ( u ) = m d K p ( u )[ v, v ] , where q = e p ( u ) and w = e U qp v .The derivative of L p ( u, v ) in the direction ( h, k ) is(100) dL p ( u, v )[( h, k )] = m d K p ( u )[ v, v, h ] + md K p ( u )[ v, k ] = m E q (cid:2) w U qp h (cid:3) + m E q (cid:2) w e U qp k (cid:3) = m (cid:10) w − E q (cid:2) w (cid:3) , e U qp u (cid:11) q + m (cid:10) w, e U qp k (cid:11) q . The component of the natural gradient aregrad L ( q, w ) = m w − E q (cid:2) w (cid:3) ) , grad e L ( q, w ) = mw . With w ( t ) = (cid:63) q ( t ), Euler-Lagrange equation is Ddt (cid:63) q ( t ) = 12 (cid:16) (cid:63) q ( t ) − E q ( t ) (cid:104) (cid:63) q ( t ) (cid:105)(cid:17) . otice that the (cid:63) q ( t ) − E q ( t ) (cid:104) (cid:63) q ( t ) (cid:105) belongs to the fiber ∗ S q ( t ) E ( µ ) of the dual bundle.Let us express this equation as a system of second order ODEs. By recalling that here D/dt is the dual (mixture) covariant derivative, we find that¨ q ( t ) q ( t ) = 12 (cid:18) ˙ q ( t ) q ( t ) (cid:19) − E q ( t ) (cid:34)(cid:18) ˙ q ( t ) q ( t ) (cid:19) (cid:35) . If Ω = { , . . . , N } , this is a system of N ¨ q ( j ; t ) = ˙ q ( j ; t ) q ( j ; t ) − q ( j ; t )2 N N (cid:88) i =1 ˙ q ( i ; t ) q ( i ; t ) , j = 1 , . . . , N . There is a closed form solution. In fact the solution is the image of the Riemannian expo-nential on the sphere of radius 1 by a proper transformation.The mapping S E ( µ ) ( q, w ) (cid:55)→ (( q/N ) / , ( q/N ) / w ) = ( α, β ) ∈ T S , where T S is the unit sphere is well defined, is 1-to-1 onto the positive quadrant, and preservesthe respective metrics. The exponential mapping on the Riemannian manifold S is given by the geodesic definedfor each ( α, β ) ∈ T S by α ( t ) = cos( (cid:107) β (cid:107) t ) α + (cid:107) β (cid:107) − sin( (cid:107) β (cid:107) t ) β . We have α (0) = α , (cid:107) α ( t ) (cid:107) = cos ( (cid:107) β (cid:107) t ) + (cid:107) β (cid:107) − (cid:107) β (cid:107) sin ( (cid:107) β (cid:107) t ) = 1 , ˙ α ( t ) = − (cid:107) β (cid:107) sin( (cid:107) β (cid:107) t ) α + cos( (cid:107) β (cid:107) t ) β , ˙ α (0) = β , ˙ α ( t ) · α ( t ) = − (cid:107) β (cid:107) cos( (cid:107) β (cid:107) t ) sin( (cid:107) β (cid:107) t ) α · α + (cid:107) β (cid:107) − sin( (cid:107) β (cid:107) t ) cos( (cid:107) β (cid:107) t ) β · β = 0 , (cid:107) ˙ α ( t ) (cid:107) = (cid:107) β (cid:107) sin ( (cid:107) β (cid:107) t ) (cid:107) α (cid:107) + cos ( (cid:107) β (cid:107) t ) (cid:107) β (cid:107) = (cid:107) β (cid:107) . We leave out the check it is a geodesic. A second option is to take as an indeterminate (cid:96) ( t ) = log q ( t ). It follows that˙ (cid:96) ( t ) = (cid:63) q ( t ) and Ddt (cid:63) q ( t ) = ¨ (cid:96) ( t ) + ˙ (cid:96) ( t ) , and the equation becomes ¨ (cid:96) ( j ; t ) = −
12 ˙ (cid:96) ( j ; t ) − N N (cid:88) i =1 e (cid:96) ( i ; t ) ˙ (cid:96) ( i ; t ) . In fact,
T S (cid:51) ( α, β ) (cid:55)→ ( Nα , α − β ) ∈ S E ( µ ) , (cid:88) x α ( x ) = (cid:88) x q ( x ) 1 N = 1 , (cid:88) x α ( x ) β ( x ) = (cid:88) x w ( x ) q ( x ) 1 N = E q [ w ] = 0 , (cid:88) x β ( x ) β ( x ) = (cid:88) x w ( x ) w ( x ) q ( x ) 1 N = (cid:104) w , w (cid:105) q Notice that this is not the Amari’s embedding that maps T ∆ ◦ onto T S , see the discussion in section 2.2. iven ( q, w ) ∈ S E ( µ ), let us apply the isometric transformation, and define q ( t ) = N (cid:18) cos( (cid:13)(cid:13)(cid:13) ( q/N ) / w (cid:13)(cid:13)(cid:13) t )( q/N ) / + (cid:13)(cid:13)(cid:13) ( q/N ) / w (cid:13)(cid:13)(cid:13) − sin( (cid:13)(cid:13)(cid:13) ( q/N ) / w (cid:13)(cid:13)(cid:13) t )( q/N ) / w (cid:19) = q (cid:0) cos( σ ( w ) t ) + σ ( w ) − sin( σ ( w ) t ) w (cid:1) , with σ ( w ) = (cid:13)(cid:13)(cid:13) ( q/N ) / w (cid:13)(cid:13)(cid:13) = (cid:115)(cid:88) x w ( x ) q ( x ) 1 N = (cid:113) E q [ w ] . We assume that t ∈ I and q ( t ) >
0. Let us compute the velocity and the acceleration of t (cid:55)→ q ( t ). It holds q (0) = q and (cid:63) q ( t ) = 2( w cos( σt ) − σ sin( σt )) σ − w sin( σt ) + cos( σt ) (cid:63) q ( t ) = 4( w cos( σt ) − σ sin( σt )) ( σ − w sin( σt ) + cos( σt )) ¨ q ( t ) /q ( t ) = − σ (cid:0)(cid:0) σ − w (cid:1) cos(2 σt ) + 2 σw sin(2 σt ) (cid:1) ( σ cos( σt ) + w sin( σt )) ¨ q ( t ) q ( t ) − (cid:63) q ( t ) = − σ − (cid:63) q ( t ) q ( t ) = 4 q ( w cos( σt ) − σ sin( σt )) It follows that the Riemannian acceleration on the Hilbert bundle is null.
Appendix B. Entropy flow
We compute the natural gradient of the entropy H ( q ) = − E q [log q ] by using the Hessianformalism. Notice that, by definition, log q + H ( q ) ∈ S q E ( µ ).If(101) t (cid:55)→ q ( t ) = e v ( t ) − K p ( v ( t )) · p , v ( t ) = s p ( q ( t )) , is a smooth curve in E ( µ ) expressed in the chart centered at p , then we can write(102) H ( q ( t )) = − E q ( t ) [ v ( t ) − K p ( v ( t )) + log p ] = K p ( v ( t )) − E q ( t ) [ v ( t ) + log p + H ( p )] + H ( p ) = K p ( v ( t )) − dK p ( v ( t ))[ v ( t ) + log p + H ( p )] + H ( p ) , where the argument v ( t ) + log p + H ( p ) of the expectation belongs to the fiber S P E ( µ ) and wehave expressed the expected value as a derivative by using Eq (17).By using Eq (17) and Eq (19), we see that the derivative of the entropy along the given curveis ddt H ( q ( t )) = ddt K p ( v ( t )) − ddt dK p ( v ( t ))[ v ( t ) + log p + H ( p )] = dK p ( v ( t ))[ ˙ v ( t )] − d K p ( v ( t ))[ v ( t ) + log p + H ( p ) , ˙ v ( t )] − dK p ( v ( t ))[ ˙ v ( t )] = − E q ( t ) (cid:104) e U q ( t ) p ( v ( t ) + log p + H ( p )) e U q ( t ) p ˙ v ( t ) (cid:105) . We use then the identities v ( t ) + log p + H ( p ) = log q ( t ) + K p ( v ( t )) + H ( p ) , (103) e U q ( t ) p (log q ( t ) + K p ( v ( t )) + H ( p )) = log q ( t ) + H ( q ( t )) , (104) e U q ( t ) P ˙ v ( t ) = (cid:63) q ( t ) , (105) o obtain(106) ddt H ( q ( t )) = − (cid:104) log q ( t ) + H ( q ( t )) , (cid:63) q ( t ) (cid:105) q ( t ) . Hence, we can identify the gradient of the entropy in the statistical bundle with(107) grad H ( Q ) = − (log Q + H ( Q )) . Notice that the previous computation could have been done using the exponential family q ( t ) =e p ( tv ).The integral curves of the gradient flow equation(108) (cid:63) q ( t ) = grad H ( q ( t ))are exponential families of the form q ( t ) ∝ q (0) e − t . In fact, if(109) ddt log q ( t ) = − (log q ( t ) + H ( q ( t ))then clearly H ( q ( t )) is constant and the equation reduces to an elementary differential equation.See [27] for more details. Appendix C. Covariant time-derivative of the KL Legendre transform
We report here the explicit calculation of the covariant time derivative
Ddt (cid:16) e (cid:63) q ( t ) − K q ( t ) ( (cid:63) q ( t )) − (cid:17) , via the expression of grad K q ( w ) in the p-chart, contracted with in the direction ˙ u ∈ S p E ( µ ).The calculation gives an explicit example of the formalism adopted in dealing with a triple ovariace . We have ddt (cid:68) e (cid:63) q ( t ) − K q ( (cid:63) q ( t )) − , e U q ( t ) p ˙ u (cid:69) q ( t ) = ddt (cid:16) dK p ( u ( t ) + v ( t ))[ ˙ u ] − dK p ( u ( t ))[ ˙ u ] (cid:17) = d K p ( u ( t ) + v ( t ))[ ˙ u, ddt u ( t )] + d K p ( u ( t ) + v ( t ))[ ˙ u, ddt v ( t )] − d K p ( u ( t ))[ ˙ u, ddt u ( t )]= E e p ( u ( t )+ v ( t )) (cid:20) e U e p ( u ( t )+ v ( t )) p ˙ u e U e p ( u ( t )+ v ( t )) p (cid:16) ddt u ( t ) + ddt v ( t ) (cid:17)(cid:21) − (cid:28) e U q ( t ) p ˙ u , e U q ( t ) p ddt u ( t ) (cid:29) q ( t ) = E e p ( u ( t )) (cid:20) e U e p ( u ( t )) e p ( u ( t )+ v ( t )) ◦ e U e p ( u ( t )+ v ( t )) p ˙ u m U e p ( u ( t )) e p ( u ( t )+ v ( t )) ◦ e U e p ( u ( t )+ v ( t )) p (cid:16) ddt u ( t ) + ddt v ( t ) (cid:17)(cid:21) − (cid:28) e U q ( t ) p ˙ u, e U q ( t ) p ddt u ( t ) (cid:29) q ( t ) = E e p ( u ( t )) (cid:20) e U e p ( u ( t )) p ˙ u m U e p ( u ( t )) e p ( u ( t )+ v ( t )) ◦ e U e p ( u ( t )+ v ( t )) p (cid:16) ddt u ( t ) + ddt v ( t ) (cid:17)(cid:21) − (cid:28) e U q ( t ) p ˙ u, e U q ( t ) p ddt u ( t ) (cid:29) q ( t ) = (cid:28) e U e p ( u ( t )) p ˙ u , m U e p ( u ( t )) e p ( u ( t )+ v ( t )) ◦ e U e p ( u ( t )+ v ( t )) p (cid:16) ddt u ( t ) + ddt v ( t ) (cid:17)(cid:29) q ( t ) − (cid:28) e U q ( t ) p ˙ u , e U q ( t ) p ddt u ( t ) (cid:29) q ( t ) = (cid:28) e U e p ( u ( t )) p ˙ u , m U e p ( u ( t )) e p ( u ( t )+ v ( t )) ◦ e U e p ( u ( t )+ v ( t )) e p ( u ( t )) ◦ e U e p ( u ( t )) p (cid:16) ddt u ( t ) + ddt v ( t ) (cid:17)(cid:29) q ( t ) − (cid:28) e U q ( t ) p ˙ u , e U q ( t ) p ddt u ( t ) (cid:29) q ( t ) = (cid:68) e U e p ( u ( t )) p ˙ u , m U e p ( u ( t )) e p ( u ( t )+ v ( t )) ◦ e U e p ( u ( t )+ v ( t )) e p ( u ( t )) (cid:16) (cid:63) q ( t ) + ∗∗ q ( t ) (cid:17)(cid:69) q ( t ) − (cid:68) e U e p ( u ( t )) p ˙ u , (cid:63) q ( t ) (cid:69) q ( t ) We have m U e p ( u ( t )) e p ( u ( t )+ v ( t )) ◦ e U e p ( u ( t )+ v ( t )) e p ( u ( t )) (cid:16) (cid:63) q ( t ) + ∗∗ q ( t ) (cid:17) = m U e p ( u ( t )) e p ( u ( t )+ v ( t )) (cid:16) (cid:63) q ( t ) + ∗∗ q ( t ) − E e p ( u ( t )+ v ( t )) [ (cid:63) q ( t ) + ∗∗ q ( t )] (cid:17) = e p ( u ( t ) + v ( t )) e p ( u ( t )) (cid:16) (cid:63) q ( t ) + ∗∗ q ( t ) − E e p ( u ( t )+ v ( t )) [ (cid:63) q ( t ) + ∗∗ q ( t )] (cid:17) Hence, eventually, we get
Ddt (cid:16) e (cid:63) q ( t ) − K q ( t ) ( (cid:63) q ( t )) − (cid:17) = e p ( u ( t ) + v ( t )) e p ( u ( t )) (cid:16) (cid:63) q ( t ) + ∗∗ q ( t ) − E e p ( u ( t )+ v ( t )) [ (cid:63) q ( t ) + ∗∗ q ( t )] (cid:17) . References
1. Ralph Abraham and Jerrold E. Marsden,
Foundations of mechanics , Benjamin/Cummings Publishing Co.,Inc., Advanced Book Program, Reading, Mass., 1978, Second edition, revised and enlarged, With theassistance of Tudor Rat¸iu and Richard Cushman. MR 5151412. P.-A. Absil, R. Mahony, and R. Sepulchre,
Optimization algorithms on matrix manifolds , Princeton Univer-sity Press, 2008, With a foreword by Paul Van Dooren. MR 2364186 (2009a:90001)3. J. Aitchison,
The statistical analysis of compositional data , Monographs on Statistics and Applied Proba-bility, Chapman & Hall, London, 1986. MR 865647 . Foivos Alimisis, Antonio Orvieto, Gary Bcigneul, and Aurelien Lucchi, A continuous-time perspective formodeling acceleration in riemannian optimization , 2019.5. Shun-ichi Amari,
Dual connections on the Hilbert bundles of statistical models , Geometrization of statisticaltheory (Lancaster, 1987), ULDM Publ., 1987, pp. 123–151.6. Shun-ichi Amari and Hiroshi Nagaoka,
Methods of information geometry , American Mathematical Society,2000, Translated from the 1993 Japanese original by Daishi Harada. MR 1 800 0717. V. I. Arnold,
Mathematical methods of classical mechanics , Graduate Texts in Mathematics, vol. 60,Springer-Verlag, New York, 1989, Translated from the 1974 Russian original by K. Vogtmann and A.Weinstein, Corrected reprint of the second (1989) edition. MR 13453868. Hedy Attouch, Zaki Chbani, Juan Peypouquet, and Patrick Redont,
Fast convergence of inertial dynamicsand algorithms with asymptotic vanishing viscosity , Mathematical Programming (2018), no. 1, 123–175.9. Nihat Ay, J¨urgen Jost, Hˆong Vˆan Lˆe, and Lorenz Schwachh¨ofer,
Information geometry , Ergebnisse derMathematik und ihrer Grenzgebiete. 3. Folge. A Series of Modern Surveys in Mathematics [Results inMathematics and Related Areas. 3rd Series. A Series of Modern Surveys in Mathematics], vol. 64, Springer,Cham, 2017. MR 370140810. H. Bateman,
On dissipative systems and related variational principles , Phys. Rev. (1931), 815–819.11. Michael Betancourt, A conceptual introduction to Hamiltonian Monte Carlo , arXiv:1701.02434, 2017.12. Guilherme Fran¸ca, Jeremias Sulam, Daniel P. Robinson, and Ren´e Vidal,
Conformal symplectic and rela-tivistic optimization , arXiv:1903.04100, 2019.13. John Duchi, Elad Hazan, and Yoram Singer,
Adaptive subgradient methods for online learning and stochasticoptimization , Journal of Machine Learning Research (2011), no. 61, 2121–2159.14. Bradley Efron and Trevor Hastie, Computer age statistical inference , Institute of Mathematical Statistics(IMS) Monographs, vol. 5, Cambridge University Press, New York, 2016, Algorithms, evidence, and datascience. MR 352395615. Paolo Gibilisco and Giovanni Pistone,
Connections on non-parametric statistical manifolds by Orlicz spacegeometry , IDAQP (1998), no. 2, 325–347. MR 1 628 17716. Daniel M. Greenberger, A critique of the major approaches to damping in quantum theory , Journal ofMathematical Physics (1979), no. 5, 762–770.17. L. Herrera, L. N´u˜nez, A. Pati˜no, and H. Rago, A variational principle and the classical and quantummechanics of the damped harmonic oscillator , American Journal of Physics (1986), 273.18. Robert E. Kass and Paul W. Vos, Geometrical foundations of asymptotic inference , Wiley Series in Probabil-ity and Statistics: Probability and Statistics, John Wiley & Sons, Inc., New York, 1997, A Wiley-IntersciencePublication. MR 1461540 (99b:62032)19. Diederik P. Kingma and Jimmy Ba,
Adam: A method for stochastic optimization , arXiv:1412.6980, 2014.20. Wilhelm P. A. Klingenberg,
Riemannian geometry , second ed., De Gruyter Studies in Mathematics, vol. 1,Walter de Gruyter & Co., Berlin, 1995. MR 133091821. Walid Krichene, Alexandre Bayen, and Peter L Bartlett,
Accelerated mirror descent in continuous anddiscrete time , Advances in Neural Information Processing Systems 28 (C. Cortes, N. D. Lawrence, D. D.Lee, M. Sugiyama, and R. Garnett, eds.), Curran Associates, Inc., 2015, pp. 2845–2853.22. Masayuki Kumon and Shun-ichi Amari,
Differential geometry of testing hypothesis—a higher order asymp-totic theory in multi-parameter curved exponential family , J. Fac. Engrg. Univ. Tokyo Ser. B (1988),no. 3, 241–273. MR 140789423. Serge Lang, Differential and Riemannian manifolds , third ed., Graduate Texts in Mathematics, vol. 160,Springer-Verlag, 1995. MR 96d:5300124. Melvin Leok and Jun Zhang,
Connecting information geometry and geometric mechanics , Entropy (2017),no. 10, 518.25. Luigi Malag`o and Giovanni Pistone, Combinatorial optimization with information geometry: Newtonmethod , Entropy (2014), 4260–4289.26. Mateusz Micha(cid:32)lek, Bernd Sturmfels, Caroline Uhler, and Piotr Zwiernik, Exponential varieties , Proc. Lond.Math. Soc. (3) (2016), no. 1, 27–56. MR 345814427. Giovanni Pistone,
Examples of the application of nonparametric information geometry to statistical physics ,Entropy (2013), no. 10, 4042–4065. MR 313026828. , Nonparametric information geometry , Geometric science of information (Frank Nielsen and Fr´ed´ericBarbaresco, eds.), Lecture Notes in Comput. Sci., vol. 8085, Springer, Heidelberg, 2013, First InternationalConference, GSI 2013 Paris, France, August 28-30, 2013 Proceedings, pp. 5–36. MR 312602929. ,
Lagrangian function on the finite state space statistical bundle , Entropy (2018), no. 2, 139.30. , Information geometry of the probability simplex: A short course , NPCS (Nonlinear Phenomena inComplex Systems) (2019), 221–242, arXiv:1911.01876.31. , Information geometry of the probability simplex: A short course , (to appear, see arXiv:1911.01876),2020.32. Giovanni Pistone and Maria Piera Rogantin,
The gradient flow of the polarization measure. with an appendix ,arXiv:1502.06718, 2015.
3. Alfrd Rnyi,
On measures of entropy and information , Proceedings of the Fourth Berkeley Symposium onMathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics (Berkeley,Calif.), University of California Press, 1961, pp. 547–561.34. Hirohiko Shima,
The geometry of Hessian structures , World Scientific Publishing Co. Pte. Ltd., Hackensack,NJ, 2007. MR 229304535. Jean-Marie Souriau,
Structure des syst`emes dynamiques , Dunod, 1970, R´eimpression autoris´ees, 2 e tirage,`Editions Jacques Gabay 2012.36. Weijie Su, Stephen Boyd, and Emmanuel J. Cand`es, A differential equation for modeling Nesterov’s accel-erated gradient method: Theory and insights , Journal of Machine Learning Research (2016), no. 153,1–43.37. Amirhossein Taghvaei and Prashant Mehta, Accelerated flow for probability distributions , Proceedings ofMachine Learning Research, vol. 97, PMLR, 09–15 Jun 2019, pp. 6076–6085.38. T. Tieleman and G. Hinton,
Lecture 6.5-rmsprop: Divide the gradient by a running average of its recentmagnitude , COURSERA: Neural Networks for Machine Learning, 4, 26-31, 2012.39. Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan,
A variational perspective on accelerated methodsin optimization , Proceedings of the National Academy of Sciences (2016), no. 47, E7351–E7358.40. Hongyi Zhang and Suvrit Sra,
Towards riemannian accelerated gradient methods , 2018.
Romanian Institute of Science and Technology, Strada Virgil Fulicea 17, 400022, Cluj-Napoca,Romania, & Max Planck Institute for Gravitational Physics (Albert Einstein Institute) AmM¨uhlenberg 1, 14476 Potsdam-Golm, Germany
E-mail address : [email protected] Romanian Institute of Science and Technology, Strada Virgil Fulicea 17, 400022, Cluj-Napoca,Romania
E-mail address : [email protected] URL : de Castro Statistics, Collegio Carlo Alberto, Piazza Vincenzo Arbarello 8, 10122 Torino,Italy E-mail address : [email protected] URL :