aa r X i v : . [ m a t h . M G ] F e b TRANSPORT INFORMATION HESSIAN DISTANCES
WUCHEN LI
Abstract.
We formulate closed-form Hessian distances of information entropies in one-dimensional probability density space embedded with the L -Wasserstein metric. Introduction
Hessian distances of information entropies in probability density space play crucial rolesin information theory with applications in signal and image processing, inverse problems,and AI [1, 3, 4]. One important example is the Hellinger distance, known as a Hessiandistance of negative Boltzmann-Shannon entropy in L (Euclidean) space. Nowadays, theHellinger distance has shown various useful properties in AI inference problems.Recently, optimal transport distance, a.k.a. Wasserstein distance , provides the othertype of distance functions in probability density space [2, 9]. Unlike the L distance,the Wasserstein distance compares probability densities by their pushforward mappingfunctions. More importantly, it introduces a metric space under which the informationentropies present convexity properties in term of mapping functions. These propertiesnowadays have been shown useful in fluid dynamics, inverse problems and AI.Nowadays, it reveals that the optimal transport distance itself is a Hessian distanceof a second moment functional in Wasserstein space. Natural questions arise. Can weconstruct Hessian distances of information entropies in Wasserstein space? And what isthe Hessian distance of negative Boltzmann-Shannon entropy in Wasserstein space?
In this paper, we derive closed-form Hessian distances of information entropies inWasserstein space supported on a one dimensional sample space. In details, given acompact set Ω ⊂ R , consider a (negative) f -entropy by F ( p ) = Z Ω f ( p ( x )) dx, where f : R → R is a second differentiable convex function and p is a given probabilitydensity function. We show that the Hessian distance of f -entropy in Wasserstein spacebetween probability density functions p and q satisfiesDist H ( p, q ) = sZ k h ( ∇ y F − p ( y )) − h ( ∇ y F − q ( y )) k dy, Key words and phrases.
Hessian distance; Optimal transport; Information geometry. There are various generalizations of optimal transport distances using different ground costs definedon a sample space. For the simplicity of discussion, we focus on the L -Wasserstein distance. where h ( y ) = R y q f ′′ ( z ) z dz , F p , F q are cumulative density functions (CDFs) of p , q respectively, and F − p , F − q are their inverse CDFs. We call Dist H ( p, q ) transport informa-tion Hessian distances . Shortly, we show that the proposed distances are constructed bythe Jacobi operators of mapping functions between density functions p and q .This paper is organized as follows. In section 2, we briefly review the Wassersteinspace and its Hessian operators for information entropies. In section 3, we derive closed-form solutions of Hessian distances in Wasserstein space. Several analytical examples arepresented in section 4.2. Review of Transport information Hessian metric
In this section, we briefly review the Wasserstein space and its induced Hessian metricsfor information entropies.2.1.
Wasserstein space.
We recall the definition of a one dimensional Wasserstein dis-tance [2]. Denote a spatial domain by Ω = [0 , ⊂ R . Consider the space of smoothprobability densities by P (Ω) = n p ( x ) ∈ C ∞ (Ω) : Z Ω p ( x ) dx = 1 , p ( x ) ≥ o . Given any two probability densities p , q ∈ P (Ω), the squared Wasserstein distance in Ω isdefined by Dist T ( p, q ) = Z Ω k T ( x ) − x k q ( x ) dx, where k · k represents a Euclidean norm and T is a monotone mapping function such that T ( x ) q ( x ) = p ( x ), i.e. p ( T ( x )) ∇ x T ( x ) = q ( x ) . Since Ω ⊂ R , then the mapping function T can be solved analytically. Concretely, T ( x ) = F − p ( F q ( x )) , where F − p , F − q are inverse CDFs of probability densities p , q , respectively. Equivalently,Dist T ( p, q ) = Z Ω k F − p ( F q ( x )) − x k q ( x ) dx = Z k F − p ( y ) − F − q ( y ) k dy, where we apply the change of variable y = F q ( x ) ∈ [0 ,
1] in the second equality.There is a metric formulation for the L -Wasserstein distance. Denote the tangent spaceof P (Ω) at a probability density p by T p P (Ω) = n σ ∈ C ∞ (Ω) : Z Ω σ ( x ) dx = 0 o . RANSPORT INFORMATION HESSIAN DISTANCES 3
And the L -Wasserstein metric refers to the following bilinear form: g T ( p )( σ, σ ) = Z Ω ( ∇ x Φ( x ) , ∇ x Φ( x )) p ( x ) dx, where Φ , σ ∈ C ∞ (Ω) satisfy an elliptical equation σ ( x ) = −∇ x · ( p ( x ) ∇ x Φ( x )) , (1)with either Neumann or periodic boundary conditions on Ω. Here, the above mentionedboundary conditions ensure the fact that σ stays in the tangent space of probability densityspace, i.e. R Ω σ ( x ) dx = 0. As a known fact, the Wasserstein metric can be derived from aTaylor expansion of the Wasserstein distance. I.e.,Dist T ( p, p + σ ) = g T ( σ, σ ) + o ( k σ k L ) , where σ ∈ T p P (Ω) and k · k L represents the L norm. In literature, ( P (Ω) , g T ) is oftencalled the Wasserstein space .2.2.
Hessian metrics in Wasserstein space.
We next review the Hessian operator of f -entropies in Wasserstein space [9]; see a derivation in appendix. Consider a (negative) f -entropy by F ( p ) = Z Ω f ( p ( x )) dx, where f : R → R is a one dimensional second order differentiable convex function. TheHessian operator of f -entropy in Wasserstein space is a bilinear form satisfyingHess T F ( p )( σ, σ ) = Z Ω k∇ x Φ( x ) k f ′′ ( p ( x )) p ( x ) dx. where f ′′ represents the second derivative of function f and (Φ , σ ) satisfies an ellipticequation (1).In this paper, we seek closed-form solutions for transport Hessian distances of F . It isto solve an action functional in ( P (Ω) , Hess T F ). Definition 1 (Transport information Hessian distance [7]) . Define a distance function
Dist H : P (Ω) × P (Ω) → R by Dist H ( p, q ) = inf p : [0 , × Ω → R n Z Hess T F ( p )( ∂ t p, ∂ t p ) dt : p (0 , x ) = q ( x ) , p (1 , x ) = p ( x ) o . (2) Here the infimum is taken among all smooth density paths p : [0 , × Ω → R , which connectsboth initial and terminal time probability density functions q , p ∈ P (Ω) . From now on, we call Dist H the transport information Hessian distance . Here theterminology of “transport” corresponds to the application of Wasserstein space, while thename of “information” refers to the usage of f -entropies.3. Transport information Hessian distances
In this section, we present the main result of this paper. LI Formulations.
We first derive closed-form solutions for transport information Hes-sian distances defined by (2).
Theorem 1.
Denote a one dimensional function h : Ω → R by h ( y ) = Z y r f ′′ ( 1 z ) 1 z dz. Then the squared transport Hessian distance of f -entropy has the following formulations. (i) Inverse CDF formulation:
Dist H ( p, q ) = Z k h ( ∇ y F − p ( y )) − h ( ∇ y F − q ( y )) k dy. (ii) Mapping formulation:
Dist H ( p, q ) = Z Ω k h ( ∇ x T ( x ) q ( x ) ) − h ( 1 q ( x ) ) k q ( x ) dx, where T is a mapping function, such that T q = p and T ( x ) = F − p ( F q ( x )) .Equivalently, Dist H ( p, q ) = Z Ω k h ( ∇ x T − ( x ) p ( x ) ) − h ( 1 p ( x ) ) k p ( x ) dx, where T − is the inverse function of mapping function T , such that ( T − ) p = q and T − ( x ) = F − q ( F p ( x )) .Proof. We first derive formulation (i). To do so, we consider the following two set ofchange of variables.Firstly, denote p ( x ) = q ( x ) and p ( x ) = p ( x ). Denote the variational problem (2) byDist H ( p , p ) = inf Φ ,p : [0 , × Ω → R n Z Z Ω k∇ y Φ( t, y ) k f ′′ ( p ( t, y )) p ( t, y ) dydt : ∂ t p ( t, y ) + ∇ · ( p ( t, y ) ∇ y Φ( t, y )) = 0 , fixed p , p o , where the infimum is among all smooth density paths p : [0 , × Ω → R satisfying thecontinuity equation with gradient drift vector fields ∇ Φ : [0 , × Ω → R . Denote y = T ( t, x ) , and let ∂ t T ( t, x ) = ∇ y Φ( t, y ) . HenceDist H ( p , p ) = inf T : [0 , × Ω → Ω n Z Z Ω k∇ y v ( t, T ( t, x )) k f ′′ ( p ( t, T ( t, x ))) p ( t, T ( t, x )) dT ( t, x ) dt : T ( t, · ) p (0 , x ) = p ( t, x ) o , RANSPORT INFORMATION HESSIAN DISTANCES 5 where the infimum is taken among all smooth mapping functions T : [0 , × Ω → Ω with T (0 , x ) = x and T (1 , x ) = T ( x ). We observe that the above variation problem leads to Z Z Ω k∇ y ∂ t T ( t, x ) k f ′′ ( p ( t, T ( t, x ))) p ( t, T ( t, x )) ∇ x T ( t, x ) dxdt = Z Z Ω k∇ x ∂ t T ( t, x ) dxdy k f ′′ ( p ( t, T ( t, x ))) p ( t, T ( t, x )) ∇ x T ( t, x ) dxdt = Z Z Ω k∇ x ∂ t T ( t, x ) 1 ∇ x T ( t, x ) k f ′′ ( p (0 , x ) ∇ x T ( t, x ) ) p (0 , x ) ∇ x T ( t, x ) dxdt = Z Z Ω k ∂ t ∇ x T ( t, x ) 1( ∇ x T ( t, x )) / s f ′′ ( q ( x ) ∇ x T ( t, x ) ) k q ( x ) dxdt. (3)Secondly, denote y = F q ( x ), where y ∈ [0 , T ( t, · ) q ( x ) = p t with p t := p ( t, x ), we have q ( x ) = dydx = 1 dxdy = 1 ∇ y F − q ( y ) , and ∇ x T ( t, x ) = ∇ x F − p t ( F q ( x )) = ∇ y F − p t ( y ) dydx = ∇ y F − p t ( y ) ∇ y F − q ( y ) . Under the above change of variables, variation problem (3) leads toDist H ( p , p ) = inf ∇ y F − pt : [0 , → R n Z Z k ∂ t ∇ y F − p t ( y ) 1( ∇ y F − p t ( y )) s f ′′ ( 1 ∇ y F − p t ( y ) ) k dydt o = inf ∇ y F − pt : [0 , → R n Z Z k ∂ t h ( ∇ y F − p t ( y )) k dydt o , (4)where the infimum is taken among all paths ∇ y F − p t : [0 , → R with fixed initial andterminal time conditions. Here we apply the fact that ∇ y h ( y ) = r f ′′ ( 1 y ) 1 y , and treat ∇ y F − p t as an individual variable. By using the Euler-Lagrange equation for ∇ y F − p t , we show that the geodesic equation in transport Hessian metric satisfies ∂ tt h ( ∇ y F − p t ( y )) = 0 . In details, we have h ( ∇ y F − p t ( y )) = th ( ∇ y F − p ( y )) + (1 − t ) h ( ∇ y F − q ( y )) , and ∂ t h ( ∇ y F − p t ( y )) = h ( ∇ y F − p ( y )) − h ( ∇ y F − q ( y )) . Therefore, variational problem (4) leads to the formulation (i). LI We next derive formulation (ii). Denote y = F q ( x ) and T ( x ) = F − p ( F q ( x )). By usingformulation (i) and the change of variable formula in integration, we haveDist H ( p , p ) = Z k h ( ∇ y F − p ( y )) − h ( ∇ y F − q ( y )) k dy = Z Ω k h ( ∇ y F − p ( F q ( x )) − h ( ∇ y x ) k dF q ( x )= Z Ω k h ( ∇ x T ( x ) dydx ) − h ( 1 dydx ) k q ( x ) dx = Z Ω k h ( ∇ x T ( x ) q ( x ) ) − h ( 1 q ( x ) ) k q ( x ) dx. This finishes the first part of proof. Similarly, we can derive the transport informationHessian distance in term of the inverse mapping function T − . (cid:3) Remark . We notice that Dist H forms a class of distance functions in probability densityspace. Compared to the classical optimal transport distance, it emphasizes on the differ-ences by the Jacobi operators of mapping operators. In this sense, we call the transportinformation Hessian distance the “Optimal Jacobi transport distance” . Remark . Transport information Hessian distances share similarities with transport Breg-man divergences defined in [8]. Here we remark that transport Hessian distances are sym-metric to p , q , i.e. Dist H ( p, q ) = Dist H ( q, p ), while the transport Bregman divergences areoften asymmetric to p , q ; see examples in [8].3.2. Properties.
We next demonstrate that transport Hessian distances have several ba-sic properties.
Proposition 1.
The transport Hessian distance has the following properties. (i)
Nonnegativity:
Dist H ( p, q ) ≥ . In addition,
Dist H ( p, q ) = 0 iff p ( x + c ) = q ( x ) , where c ∈ R is a constant. (ii) Symmetry:
Dist H ( p, q ) = Dist H ( q, p ) . (iii) Triangle inequality: For any probability densities p , q , r ∈ P (Ω) , we have Dist H ( p, r ) ≤ Dist H ( p, q ) + Dist H ( q, r ) . (v) Hessian metric: Consider a Taylor expansion by
Dist H ( p, p + σ ) = Hess T F ( p )( σ, σ ) + o ( k σ k L ) , where σ ∈ T p P (Ω) .Proof. From the construction of transport Hessian distance, the above properties are sat-isfied. Here we only need to show (i). Dist H ( p, q ) = 0 implies that k h ( ∇ x T ( x )) − h (1) k = k h ( ∇ x T ( x )) k = 0 under the support of density q . Notice that h is a monotone function. RANSPORT INFORMATION HESSIAN DISTANCES 7
Thus ∇ x T ( x ) = 1. Hence T ( x ) = x + c , for some constant c ∈ R . From the fact that T q = p , we derive p ( T ( x )) ∇ x T ( x ) = q ( x ). This implies p ( x + c ) = q ( x ). (cid:3) Closed-form distances
In this section, we provide several closed-form examples of transport information Hessiandistances. From now on, we always denote y = F q ( x ) and T ( x ) = F − p ( F q ( x )) . Example 1 (Boltzmann-Shanon entropy) . Let f ( p ) = p log p , i.e. F ( p ) = −H ( p ) = Z Ω p ( x ) log p ( x ) dx. Then h ( y ) = log y . Hence Dist H ( p, q ) = sZ Ω k log ∇ x T ( x ) k q ( x ) dx = sZ k log ∇ y F − p ( y ) − log ∇ y F − q ( y ) k dy. (5) Remark . We compare the Hessian distanceof Boltzmann-Shannon entropy defined in either L space or Wasserstein space. In L space, this Hessian distance is known as the Hellinger distance, where Hellinger( p, q ) = qR Ω k p p ( x ) − p q ( x ) k dx . Here distance (5) is an analog of the Hellinger distance inWasserstein space. Example 2 (Quadratic entropy) . Let f ( p ) = p , then h ( y ) = − y − − . Hence Dist H ( p, q ) =2 sZ Ω k ( ∇ x T ( x )) − − k q ( x ) dx =2 sZ k ( ∇ y F − p ( y )) − − ( ∇ y F − q ( y )) − k dy. Example 3 (Cross entropy) . Let f ( p ) = − log p , then h ( y ) = 2( y − . Hence Dist H ( p, q ) =2 sZ Ω k ( ∇ x T ( x )) − k dx =2 sZ k ( ∇ y F − p ( y )) − ( ∇ y F − q ( y )) k dy. LI Example 4.
Let f ( p ) = p , then h ( y ) = y − . Hence Dist H ( p, q ) = sZ Ω k∇ x T ( x ) − k q ( x ) − dx = sZ k∇ y F − p ( y ) − ∇ y F − q ( y ) k dy. Example 5 ( γ -entropy) . Let f ( p ) = − γ )(2 − γ ) p − γ , γ = 1 , , then h ( y ) = γ − ( y γ − − .Hence Dist H ( p, q ) = 2 | γ − | sZ Ω k ( ∇ x T ( x )) γ − − k q ( x ) − γ dx = 2 | γ − | sZ k ( ∇ y F − p ( y )) γ − − ( ∇ y F − q ( y )) γ − k dy. Acknowledgement : W. Li is supported by a start-up funding in University of SouthCarolina.
References [1] S. Amari.
Information Geometry and Its Applications . Springer Publishing Company, Incorporated,1st edition, 2016.[2] L. Ambrosio, N. Gigli and G. Savare.
Gradient Flows in Metric Spaces and in the Space of ProbabilityMeasures , 2008.[3] S. Cheng and S. T. Yau. The real Monge-Amp´ere equation and affine flat structures.
Proc. 1980Beijing Symp. Differ. Geom. and Diff. Eqns. , Vol. 1, pp. 339-370, 1982.[4] T. M. Cover and J. A. Thomas.
Elements of Information Theory . Wiley Series in Telecommunications.Wiley, New York, 1991.[5] J. D. Lafferty. The density manifold and configuration space quantization.
Transactions of the Amer-ican Mathematical Society , 305(2):699-741, 1988.[6] W. Li. Transport information geometry: Riemannian calculus in probability simplex. arXiv:1803.06360 [math] , 2018.[7] W. Li. Hessian metric via transport information geometry. arXiv:2003.10526 , 2020.[8] W. Li. Transport information Bregman divergences. arXiv:2101.01162 , 2021.[9] C. Villani.
Optimal Transport: Old and New . Number 338 in Grundlehren Der Mathematischen Wis-senschaften. Springer, Berlin, 2009.
Appendix
For the completeness of this paper, we present the derivation of Hessian operator of f -entropy in Wasserstein space below. Proposition 2 (Formula 15.7 in [9]) . Denote F ( p ) = R Ω f ( p ( x )) dx and Ω ⊂ R . Then Hess T F ( p )( σ, σ ) = Z Ω k∇ x Φ( x ) k f ′′ ( p ( x )) p ( x ) dx, where σ ( x ) = −∇ · ( p ( x ) ∇ x Φ( x )) . RANSPORT INFORMATION HESSIAN DISTANCES 9
Proof.
Notice that the geodesics in Wasserstein space satisfies ∂ t p ( t, x ) + ∇ x · ( p ( t, x ) ∇ x Φ( t, x )) = 0 ∂ t Φ( t, x ) + 12 k∇ x Φ( t, x ) k = 0 , where p (0 , x ) = p ( x ) and ∂ t p (0 , x ) = σ ( x ) = −∇ x · ( p ( x ) ∇ x Φ( x )). We only need to showthat Hess T F ( p )( σ, σ ) = d dt F ( p ( t, · )) | t =0 . The proof follows by a direct calculation. Denote k ( p ) = f ′ ( p ) p − f ( p ). Then the first order derivative of F w.r.t. t forms ddt F ( p ( t, · )) = − Z Ω ∇ x · ( p ( t, x ) ∇ x Φ( t, x )) f ′ ( p ( t, x )) dx = Z Ω ∇ x Φ( t, x ) ∇ x f ′ ( p ( t, x )) p ( t, x ) dx = Z Ω ∇ x Φ( t, x ) ∇ x k ( p ( t, x )) dx = − Z Ω ∆ x Φ( t, x ) k ( p ( t, x )) dx, where the first and last equality use the integration by parts and the second equality applythe fact k ′ ( p ) = f ′′ ( p ) p . And the second order derivative of F w.r.t. t satisfies d dt F ( p ( t, · )) | t =0 = − Z Ω n ∆ x ∂ t Φ( t, x ) k ( p ( t, x )) + ∆ x Φ( t, x ) k ′ ( p ( t, x )) ∂ t p ( t, x ) o dx | t =0 = Z Ω n
12 ∆ x k∇ x Φ( x ) k k ( p ( x )) + ∆ x Φ( x ) k ′ ( p ( x )) ∇ x · ( p ( x ) ∇ x Φ( x )) o dx = Z Ω n
12 ∆ x k∇ x Φ k k ( p ( x )) + ∆ x Φ( x ) k ′ ( p ( x ))( ∇ x p ( x ) , ∇ x Φ( x )) + (∆Φ( x )) k ′ ( p ( x )) p ( x ) o dx = Z Ω n
12 ∆ x k∇ x Φ k k ( p ( x )) + ∆ x Φ( x )( ∇ x k ( p ( x )) , ∇ x Φ( x )) + (∆Φ( x )) k ′ ( p ( x )) p ( x ) o dx. We notice that Z Ω ∆ x Φ( x )( ∇ x k ( p ( x )) , ∇ x Φ( x )) dx = − Z Ω ∇ x · ( ∇ x Φ( x )∆ x Φ( x )) k ( p ( x )) dx = − Z Ω h ( ∇ x ∆ x Φ( x ) , ∇ x Φ( x )) + (∆ x Φ( x )) i k ( p ( x )) dx. Combining the above two formulas and using a Bochner’s formula12 ∆ x k∇ x Φ( x ) k − ∇ x ∆ x Φ( x ) ∇ x Φ( x ) = k∇ x Φ( x ) k , we finish the proof. (cid:3) Email address : [email protected]@mailbox.sc.edu