[PDF] Transport information Hessian distances

Abstract

We formulate closed-form Hessian distances of information entropies in one-dimensional probability density space embedded with the L2-Wasserstein metric.

Full PDF

aa r X i v : . [ m a t h . M G ] F e b TRANSPORT INFORMATION HESSIAN DISTANCES

WUCHEN LI

Abstract.

We formulate closed-form Hessian distances of information entropies in one-dimensional probability density space embedded with the L -Wasserstein metric. Introduction

Hessian distances of information entropies in probability density space play crucial rolesin information theory with applications in signal and image processing, inverse problems,and AI [1, 3, 4]. One important example is the Hellinger distance, known as a Hessiandistance of negative Boltzmann-Shannon entropy in L (Euclidean) space. Nowadays, theHellinger distance has shown various useful properties in AI inference problems.Recently, optimal transport distance, a.k.a. Wasserstein distance , provides the othertype of distance functions in probability density space [2, 9]. Unlike the L distance,the Wasserstein distance compares probability densities by their pushforward mappingfunctions. More importantly, it introduces a metric space under which the informationentropies present convexity properties in term of mapping functions. These propertiesnowadays have been shown useful in ﬂuid dynamics, inverse problems and AI.Nowadays, it reveals that the optimal transport distance itself is a Hessian distanceof a second moment functional in Wasserstein space. Natural questions arise. Can weconstruct Hessian distances of information entropies in Wasserstein space? And what isthe Hessian distance of negative Boltzmann-Shannon entropy in Wasserstein space?

In this paper, we derive closed-form Hessian distances of information entropies inWasserstein space supported on a one dimensional sample space. In details, given acompact set Ω ⊂ R , consider a (negative) f -entropy by F ( p ) = Z Ω f ( p ( x )) dx, where f : R → R is a second diﬀerentiable convex function and p is a given probabilitydensity function. We show that the Hessian distance of f -entropy in Wasserstein spacebetween probability density functions p and q satisﬁesDist H ( p, q ) = sZ k h ( ∇ y F − p ( y )) − h ( ∇ y F − q ( y )) k dy, Key words and phrases.

Hessian distance; Optimal transport; Information geometry. There are various generalizations of optimal transport distances using diﬀerent ground costs deﬁnedon a sample space. For the simplicity of discussion, we focus on the L -Wasserstein distance. where h ( y ) = R y q f ′′ ( z ) z dz , F p , F q are cumulative density functions (CDFs) of p , q respectively, and F − p , F − q are their inverse CDFs. We call Dist H ( p, q ) transport informa-tion Hessian distances . Shortly, we show that the proposed distances are constructed bythe Jacobi operators of mapping functions between density functions p and q .This paper is organized as follows. In section 2, we brieﬂy review the Wassersteinspace and its Hessian operators for information entropies. In section 3, we derive closed-form solutions of Hessian distances in Wasserstein space. Several analytical examples arepresented in section 4.2. Review of Transport information Hessian metric

In this section, we brieﬂy review the Wasserstein space and its induced Hessian metricsfor information entropies.2.1.

Wasserstein space.

We recall the deﬁnition of a one dimensional Wasserstein dis-tance [2]. Denote a spatial domain by Ω = [0 , ⊂ R . Consider the space of smoothprobability densities by P (Ω) = n p ( x ) ∈ C ∞ (Ω) : Z Ω p ( x ) dx = 1 , p ( x ) ≥ o . Given any two probability densities p , q ∈ P (Ω), the squared Wasserstein distance in Ω isdeﬁned by Dist T ( p, q ) = Z Ω k T ( x ) − x k q ( x ) dx, where k · k represents a Euclidean norm and T is a monotone mapping function such that T ( x ) q ( x ) = p ( x ), i.e. p ( T ( x )) ∇ x T ( x ) = q ( x ) . Since Ω ⊂ R , then the mapping function T can be solved analytically. Concretely, T ( x ) = F − p ( F q ( x )) , where F − p , F − q are inverse CDFs of probability densities p , q , respectively. Equivalently,Dist T ( p, q ) = Z Ω k F − p ( F q ( x )) − x k q ( x ) dx = Z k F − p ( y ) − F − q ( y ) k dy, where we apply the change of variable y = F q ( x ) ∈ [0 ,

1] in the second equality.There is a metric formulation for the L -Wasserstein distance. Denote the tangent spaceof P (Ω) at a probability density p by T p P (Ω) = n σ ∈ C ∞ (Ω) : Z Ω σ ( x ) dx = 0 o . RANSPORT INFORMATION HESSIAN DISTANCES 3

And the L -Wasserstein metric refers to the following bilinear form: g T ( p )( σ, σ ) = Z Ω ( ∇ x Φ( x ) , ∇ x Φ( x )) p ( x ) dx, where Φ , σ ∈ C ∞ (Ω) satisfy an elliptical equation σ ( x ) = −∇ x · ( p ( x ) ∇ x Φ( x )) , (1)with either Neumann or periodic boundary conditions on Ω. Here, the above mentionedboundary conditions ensure the fact that σ stays in the tangent space of probability densityspace, i.e. R Ω σ ( x ) dx = 0. As a known fact, the Wasserstein metric can be derived from aTaylor expansion of the Wasserstein distance. I.e.,Dist T ( p, p + σ ) = g T ( σ, σ ) + o ( k σ k L ) , where σ ∈ T p P (Ω) and k · k L represents the L norm. In literature, ( P (Ω) , g T ) is oftencalled the Wasserstein space .2.2.

Hessian metrics in Wasserstein space.

We next review the Hessian operator of f -entropies in Wasserstein space [9]; see a derivation in appendix. Consider a (negative) f -entropy by F ( p ) = Z Ω f ( p ( x )) dx, where f : R → R is a one dimensional second order diﬀerentiable convex function. TheHessian operator of f -entropy in Wasserstein space is a bilinear form satisfyingHess T F ( p )( σ, σ ) = Z Ω k∇ x Φ( x ) k f ′′ ( p ( x )) p ( x ) dx. where f ′′ represents the second derivative of function f and (Φ , σ ) satisﬁes an ellipticequation (1).In this paper, we seek closed-form solutions for transport Hessian distances of F . It isto solve an action functional in ( P (Ω) , Hess T F ). Deﬁnition 1 (Transport information Hessian distance [7]) . Deﬁne a distance function

Dist H : P (Ω) × P (Ω) → R by Dist H ( p, q ) = inf p : [0 , × Ω → R n Z Hess T F ( p )( ∂ t p, ∂ t p ) dt : p (0 , x ) = q ( x ) , p (1 , x ) = p ( x ) o . (2) Here the inﬁmum is taken among all smooth density paths p : [0 , × Ω → R , which connectsboth initial and terminal time probability density functions q , p ∈ P (Ω) . From now on, we call Dist H the transport information Hessian distance . Here theterminology of “transport” corresponds to the application of Wasserstein space, while thename of “information” refers to the usage of f -entropies.3. Transport information Hessian distances

In this section, we present the main result of this paper. LI Formulations.

We ﬁrst derive closed-form solutions for transport information Hes-sian distances deﬁned by (2).

Theorem 1.

Denote a one dimensional function h : Ω → R by h ( y ) = Z y r f ′′ ( 1 z ) 1 z dz. Then the squared transport Hessian distance of f -entropy has the following formulations. (i) Inverse CDF formulation:

Dist H ( p, q ) = Z k h ( ∇ y F − p ( y )) − h ( ∇ y F − q ( y )) k dy. (ii) Mapping formulation:

Dist H ( p, q ) = Z Ω k h ( ∇ x T ( x ) q ( x ) ) − h ( 1 q ( x ) ) k q ( x ) dx, where T is a mapping function, such that T q = p and T ( x ) = F − p ( F q ( x )) .Equivalently, Dist H ( p, q ) = Z Ω k h ( ∇ x T − ( x ) p ( x ) ) − h ( 1 p ( x ) ) k p ( x ) dx, where T − is the inverse function of mapping function T , such that ( T − ) p = q and T − ( x ) = F − q ( F p ( x )) .Proof. We ﬁrst derive formulation (i). To do so, we consider the following two set ofchange of variables.Firstly, denote p ( x ) = q ( x ) and p ( x ) = p ( x ). Denote the variational problem (2) byDist H ( p , p ) = inf Φ ,p : [0 , × Ω → R n Z Z Ω k∇ y Φ( t, y ) k f ′′ ( p ( t, y )) p ( t, y ) dydt : ∂ t p ( t, y ) + ∇ · ( p ( t, y ) ∇ y Φ( t, y )) = 0 , ﬁxed p , p o , where the inﬁmum is among all smooth density paths p : [0 , × Ω → R satisfying thecontinuity equation with gradient drift vector ﬁelds ∇ Φ : [0 , × Ω → R . Denote y = T ( t, x ) , and let ∂ t T ( t, x ) = ∇ y Φ( t, y ) . HenceDist H ( p , p ) = inf T : [0 , × Ω → Ω n Z Z Ω k∇ y v ( t, T ( t, x )) k f ′′ ( p ( t, T ( t, x ))) p ( t, T ( t, x )) dT ( t, x ) dt : T ( t, · ) p (0 , x ) = p ( t, x ) o , RANSPORT INFORMATION HESSIAN DISTANCES 5 where the inﬁmum is taken among all smooth mapping functions T : [0 , × Ω → Ω with T (0 , x ) = x and T (1 , x ) = T ( x ). We observe that the above variation problem leads to Z Z Ω k∇ y ∂ t T ( t, x ) k f ′′ ( p ( t, T ( t, x ))) p ( t, T ( t, x )) ∇ x T ( t, x ) dxdt = Z Z Ω k∇ x ∂ t T ( t, x ) dxdy k f ′′ ( p ( t, T ( t, x ))) p ( t, T ( t, x )) ∇ x T ( t, x ) dxdt = Z Z Ω k∇ x ∂ t T ( t, x ) 1 ∇ x T ( t, x ) k f ′′ ( p (0 , x ) ∇ x T ( t, x ) ) p (0 , x ) ∇ x T ( t, x ) dxdt = Z Z Ω k ∂ t ∇ x T ( t, x ) 1( ∇ x T ( t, x )) / s f ′′ ( q ( x ) ∇ x T ( t, x ) ) k q ( x ) dxdt. (3)Secondly, denote y = F q ( x ), where y ∈ [0 , T ( t, · ) q ( x ) = p t with p t := p ( t, x ), we have q ( x ) = dydx = 1 dxdy = 1 ∇ y F − q ( y ) , and ∇ x T ( t, x ) = ∇ x F − p t ( F q ( x )) = ∇ y F − p t ( y ) dydx = ∇ y F − p t ( y ) ∇ y F − q ( y ) . Under the above change of variables, variation problem (3) leads toDist H ( p , p ) = inf ∇ y F − pt : [0 , → R n Z Z k ∂ t ∇ y F − p t ( y ) 1( ∇ y F − p t ( y )) s f ′′ ( 1 ∇ y F − p t ( y ) ) k dydt o = inf ∇ y F − pt : [0 , → R n Z Z k ∂ t h ( ∇ y F − p t ( y )) k dydt o , (4)where the inﬁmum is taken among all paths ∇ y F − p t : [0 , → R with ﬁxed initial andterminal time conditions. Here we apply the fact that ∇ y h ( y ) = r f ′′ ( 1 y ) 1 y , and treat ∇ y F − p t as an individual variable. By using the Euler-Lagrange equation for ∇ y F − p t , we show that the geodesic equation in transport Hessian metric satisﬁes ∂ tt h ( ∇ y F − p t ( y )) = 0 . In details, we have h ( ∇ y F − p t ( y )) = th ( ∇ y F − p ( y )) + (1 − t ) h ( ∇ y F − q ( y )) , and ∂ t h ( ∇ y F − p t ( y )) = h ( ∇ y F − p ( y )) − h ( ∇ y F − q ( y )) . Therefore, variational problem (4) leads to the formulation (i). LI We next derive formulation (ii). Denote y = F q ( x ) and T ( x ) = F − p ( F q ( x )). By usingformulation (i) and the change of variable formula in integration, we haveDist H ( p , p ) = Z k h ( ∇ y F − p ( y )) − h ( ∇ y F − q ( y )) k dy = Z Ω k h ( ∇ y F − p ( F q ( x )) − h ( ∇ y x ) k dF q ( x )= Z Ω k h ( ∇ x T ( x ) dydx ) − h ( 1 dydx ) k q ( x ) dx = Z Ω k h ( ∇ x T ( x ) q ( x ) ) − h ( 1 q ( x ) ) k q ( x ) dx. This ﬁnishes the ﬁrst part of proof. Similarly, we can derive the transport informationHessian distance in term of the inverse mapping function T − . (cid:3) Remark . We notice that Dist H forms a class of distance functions in probability densityspace. Compared to the classical optimal transport distance, it emphasizes on the diﬀer-ences by the Jacobi operators of mapping operators. In this sense, we call the transportinformation Hessian distance the “Optimal Jacobi transport distance” . Remark . Transport information Hessian distances share similarities with transport Breg-man divergences deﬁned in [8]. Here we remark that transport Hessian distances are sym-metric to p , q , i.e. Dist H ( p, q ) = Dist H ( q, p ), while the transport Bregman divergences areoften asymmetric to p , q ; see examples in [8].3.2. Properties.

We next demonstrate that transport Hessian distances have several ba-sic properties.

Proposition 1.

The transport Hessian distance has the following properties. (i)

Nonnegativity:

Dist H ( p, q ) ≥ . In addition,

Dist H ( p, q ) = 0 iﬀ p ( x + c ) = q ( x ) , where c ∈ R is a constant. (ii) Symmetry:

Dist H ( p, q ) = Dist H ( q, p ) . (iii) Triangle inequality: For any probability densities p , q , r ∈ P (Ω) , we have Dist H ( p, r ) ≤ Dist H ( p, q ) + Dist H ( q, r ) . (v) Hessian metric: Consider a Taylor expansion by

Dist H ( p, p + σ ) = Hess T F ( p )( σ, σ ) + o ( k σ k L ) , where σ ∈ T p P (Ω) .Proof. From the construction of transport Hessian distance, the above properties are sat-isﬁed. Here we only need to show (i). Dist H ( p, q ) = 0 implies that k h ( ∇ x T ( x )) − h (1) k = k h ( ∇ x T ( x )) k = 0 under the support of density q . Notice that h is a monotone function. RANSPORT INFORMATION HESSIAN DISTANCES 7

Thus ∇ x T ( x ) = 1. Hence T ( x ) = x + c , for some constant c ∈ R . From the fact that T q = p , we derive p ( T ( x )) ∇ x T ( x ) = q ( x ). This implies p ( x + c ) = q ( x ). (cid:3) Closed-form distances

In this section, we provide several closed-form examples of transport information Hessiandistances. From now on, we always denote y = F q ( x ) and T ( x ) = F − p ( F q ( x )) . Example 1 (Boltzmann-Shanon entropy) . Let f ( p ) = p log p , i.e. F ( p ) = −H ( p ) = Z Ω p ( x ) log p ( x ) dx. Then h ( y ) = log y . Hence Dist H ( p, q ) = sZ Ω k log ∇ x T ( x ) k q ( x ) dx = sZ k log ∇ y F − p ( y ) − log ∇ y F − q ( y ) k dy. (5) Remark . We compare the Hessian distanceof Boltzmann-Shannon entropy deﬁned in either L space or Wasserstein space. In L space, this Hessian distance is known as the Hellinger distance, where Hellinger( p, q ) = qR Ω k p p ( x ) − p q ( x ) k dx . Here distance (5) is an analog of the Hellinger distance inWasserstein space. Example 2 (Quadratic entropy) . Let f ( p ) = p , then h ( y ) = − y − − . Hence Dist H ( p, q ) =2 sZ Ω k ( ∇ x T ( x )) − − k q ( x ) dx =2 sZ k ( ∇ y F − p ( y )) − − ( ∇ y F − q ( y )) − k dy. Example 3 (Cross entropy) . Let f ( p ) = − log p , then h ( y ) = 2( y − . Hence Dist H ( p, q ) =2 sZ Ω k ( ∇ x T ( x )) − k dx =2 sZ k ( ∇ y F − p ( y )) − ( ∇ y F − q ( y )) k dy. LI Example 4.

Let f ( p ) = p , then h ( y ) = y − . Hence Dist H ( p, q ) = sZ Ω k∇ x T ( x ) − k q ( x ) − dx = sZ k∇ y F − p ( y ) − ∇ y F − q ( y ) k dy. Example 5 ( γ -entropy) . Let f ( p ) = − γ )(2 − γ ) p − γ , γ = 1 , , then h ( y ) = γ − ( y γ − − .Hence Dist H ( p, q ) = 2 | γ − | sZ Ω k ( ∇ x T ( x )) γ − − k q ( x ) − γ dx = 2 | γ − | sZ k ( ∇ y F − p ( y )) γ − − ( ∇ y F − q ( y )) γ − k dy. Acknowledgement : W. Li is supported by a start-up funding in University of SouthCarolina.

References [1] S. Amari.

Information Geometry and Its Applications . Springer Publishing Company, Incorporated,1st edition, 2016.[2] L. Ambrosio, N. Gigli and G. Savare.

Gradient Flows in Metric Spaces and in the Space of ProbabilityMeasures , 2008.[3] S. Cheng and S. T. Yau. The real Monge-Amp´ere equation and aﬃne ﬂat structures.

Proc. 1980Beijing Symp. Diﬀer. Geom. and Diﬀ. Eqns. , Vol. 1, pp. 339-370, 1982.[4] T. M. Cover and J. A. Thomas.

Elements of Information Theory . Wiley Series in Telecommunications.Wiley, New York, 1991.[5] J. D. Laﬀerty. The density manifold and conﬁguration space quantization.

Transactions of the Amer-ican Mathematical Society , 305(2):699-741, 1988.[6] W. Li. Transport information geometry: Riemannian calculus in probability simplex. arXiv:1803.06360 [math] , 2018.[7] W. Li. Hessian metric via transport information geometry. arXiv:2003.10526 , 2020.[8] W. Li. Transport information Bregman divergences. arXiv:2101.01162 , 2021.[9] C. Villani.

Optimal Transport: Old and New . Number 338 in Grundlehren Der Mathematischen Wis-senschaften. Springer, Berlin, 2009.

Appendix

For the completeness of this paper, we present the derivation of Hessian operator of f -entropy in Wasserstein space below. Proposition 2 (Formula 15.7 in [9]) . Denote F ( p ) = R Ω f ( p ( x )) dx and Ω ⊂ R . Then Hess T F ( p )( σ, σ ) = Z Ω k∇ x Φ( x ) k f ′′ ( p ( x )) p ( x ) dx, where σ ( x ) = −∇ · ( p ( x ) ∇ x Φ( x )) . RANSPORT INFORMATION HESSIAN DISTANCES 9

Proof.

Notice that the geodesics in Wasserstein space satisﬁes  ∂ t p ( t, x ) + ∇ x · ( p ( t, x ) ∇ x Φ( t, x )) = 0 ∂ t Φ( t, x ) + 12 k∇ x Φ( t, x ) k = 0 , where p (0 , x ) = p ( x ) and ∂ t p (0 , x ) = σ ( x ) = −∇ x · ( p ( x ) ∇ x Φ( x )). We only need to showthat Hess T F ( p )( σ, σ ) = d dt F ( p ( t, · )) | t =0 . The proof follows by a direct calculation. Denote k ( p ) = f ′ ( p ) p − f ( p ). Then the ﬁrst order derivative of F w.r.t. t forms ddt F ( p ( t, · )) = − Z Ω ∇ x · ( p ( t, x ) ∇ x Φ( t, x )) f ′ ( p ( t, x )) dx = Z Ω ∇ x Φ( t, x ) ∇ x f ′ ( p ( t, x )) p ( t, x ) dx = Z Ω ∇ x Φ( t, x ) ∇ x k ( p ( t, x )) dx = − Z Ω ∆ x Φ( t, x ) k ( p ( t, x )) dx, where the ﬁrst and last equality use the integration by parts and the second equality applythe fact k ′ ( p ) = f ′′ ( p ) p . And the second order derivative of F w.r.t. t satisﬁes d dt F ( p ( t, · )) | t =0 = − Z Ω n ∆ x ∂ t Φ( t, x ) k ( p ( t, x )) + ∆ x Φ( t, x ) k ′ ( p ( t, x )) ∂ t p ( t, x ) o dx | t =0 = Z Ω n

12 ∆ x k∇ x Φ( x ) k k ( p ( x )) + ∆ x Φ( x ) k ′ ( p ( x )) ∇ x · ( p ( x ) ∇ x Φ( x )) o dx = Z Ω n

12 ∆ x k∇ x Φ k k ( p ( x )) + ∆ x Φ( x ) k ′ ( p ( x ))( ∇ x p ( x ) , ∇ x Φ( x )) + (∆Φ( x )) k ′ ( p ( x )) p ( x ) o dx = Z Ω n

12 ∆ x k∇ x Φ k k ( p ( x )) + ∆ x Φ( x )( ∇ x k ( p ( x )) , ∇ x Φ( x )) + (∆Φ( x )) k ′ ( p ( x )) p ( x ) o dx. We notice that Z Ω ∆ x Φ( x )( ∇ x k ( p ( x )) , ∇ x Φ( x )) dx = − Z Ω ∇ x · ( ∇ x Φ( x )∆ x Φ( x )) k ( p ( x )) dx = − Z Ω h ( ∇ x ∆ x Φ( x ) , ∇ x Φ( x )) + (∆ x Φ( x )) i k ( p ( x )) dx. Combining the above two formulas and using a Bochner’s formula12 ∆ x k∇ x Φ( x ) k − ∇ x ∆ x Φ( x ) ∇ x Φ( x ) = k∇ x Φ( x ) k , we ﬁnish the proof. (cid:3) Email address : [email protected]@mailbox.sc.edu