Density estimation and modeling on symmetric spaces
DDensity estimation and modeling on symmetric spaces
Didong Li , , Yulong Lu , Emmanuel Chevallier and David Dunson , Department of Computer Science, Princeton University Department of Biostatistics, University of California, Los Angeles Department of Mathematics and Statistics, University of Massachusetts Amherst Aix Marseille Univeristy, CNRS, Centrale Marseille, Institut Fresnel Department of Statistical Sciences and Mathematics , Duke University In many applications, data and/or parameters are supported on non-Euclidean mani-folds. It is important to take into account the geometric structure of manifolds in statisticalanalysis to avoid misleading results. Although there has been a considerable focus on simpleand specific manifolds, there is a lack of general and easy-to-implement statistical methodsfor density estimation and modeling on manifolds. In this article, we consider a very broadclass of manifolds: non-compact Riemannian symmetric spaces. For this class, we providea very general mathematical result for easily calculating volume changes of the exponentialand logarithm map between the tangent space and the manifold. This allows one to definestatistical models on the tangent space, push these models forward onto the manifold, andeasily calculate induced distributions by Jacobians. To illustrate the statistical utility ofthis theoretical result, we provide a general method to construct distributions on symmetricspaces. In particular, we define the log-Gaussian distribution as an analogue of the mul-tivariate Gaussian distribution in Euclidean space. With these new kernels on symmetricspaces, we also consider the problem of density estimation. Our proposed approach can useany existing density estimation approach designed for Euclidean spaces and push it forwardto the manifold with an easy-to-calculate adjustment. We provide theorems showing thatthe induced density estimators on the manifold inherit the statistical optimality propertiesof the parent Euclidean density estimator; this holds for both frequentist and Bayesian non-parametric methods. We illustrate the theory and practical utility of the proposed approachon the space of positive definite matrices.Key Words: Convergence rate; Density estimation; Log-Gaussian distribution; Poincar´edisk; Positive definite matrices; Riemannian symmetric space; Siegel space; Tangent space.
Assume X , .., X n are i.i.d random variables on a Riemannian manifold ( M , g ) such that thelaw of X has a density f with g the Riemannian metric. There are various approaches toestimate f from n i.i.d data including histograms, orthogonal series, kernel approaches, andmixture models. Kernels and mixture estimators rely on families of parametric probability1 a r X i v : . [ s t a t . M E ] S e p istributions on M . This article provides a way to explicitly construct such families on alarge class of Riemannian manifolds: non compact Riemannian symmetric spaces.On most manifolds, kernels and mixture estimators are more convenient to implementin practice than histograms and orthogonal series. For histograms, the main difficulty is toconstruct appropriate tessellations of the manifold ( M , g ), which guarantee convergence asthe number of samples n goes to infinity, and which can be computed with a reasonablealgorithmic complexity. Even on simple examples like the hyperbolic space, where thereexists numerous regular tessellations, histograms appear to be an impractical solution.Orthogonal series approaches rely on estimating scalar products between the density f and a set of functions, commonly corresponding to the eigenfunctions of the Laplace-Beltramioperator. This strategy enables one to obtain convergence rates adapted to the order ofdifferentiability of the density f (Hendriks, 1990). This method is particularly adapted tomanifolds in which the spectrum of the Laplacian is countable and its eigenfunctions haveexplicit expressions. This is the case for the torus and sphere, but not for non-compactmanifolds, such as the hyperbolic space, where the spectrum of the Laplacian is continuous.Estimation of f then requires estimating an uncountable number of coefficients, with no clearapproach for producing an accurate approximation that can be practically implemented.A primary challenge for kernels and mixture models is to construct families of distribu-tions on the manifold ( M , g ) whose densities have explicit expressions. A seemingly simplecandidate family includes distributions whose densities with respect to the Riemannian mea-sure are normalized indicator functions ϕ p,r = 1 v g ( B ( p, r )) B ( p,r ) , (1)where B ( p, r ) is the ball centered at p of radius r and v g is the Riemannian volume. However,computing the normalization constant is typically non-trivial, as there are no closed formexpressions for the volume of balls on arbitrary manifolds.Another approach to build probability distributions on M is to define a probability dis-tribution in a tangent space T p M and map it to the manifold. The Riemannian exponentialis a natural way to perform this mapping. Pelletier (2005) used such maps to define prob-ability kernels for density estimation. In expressing the density on M , it is necessary tocompute the local volume change between T p M and M induced by the exponential map,see Figure 1. A primary contribution of this article is to provide an expression of this volumechange when M is a locally symmetric space.This volume change appears in various geometry problems and is known to physicistsas the van Vleck-Morette determinant (Van Vleck, 1928; Morette, 1951; Kim and Thorne,1991; Visser, 1993). An example is gravitational lensing: consider a punctual light source inspace time, the van vleck Morette determinant is used to describes the light flux (Reimbergand Abramo, 2013; Aazami and Werner, 2016).We will see in section 3 that the local volume change induced by the exponential map isobtained by solving a Jacobi equation, the Jacobi equation being a second order differential2igure 1: The infinitesimal volume change between the grey areas depends on how theneighboring geodesics are deviating or getting closer. It is determined by the Riemanniancurvature along the geodesic exp p ( tv ).equation whose coefficients are functions of the Riemannian curvature tensor R . In physicalproblems, the solutions of this equation have to be numerically estimated (Visser, 1993).Riemannian symmetric spaces are Riemannian manifolds for which the covariant derivativeof the Riemann curvature tensor vanishes everywhere. On these manifolds, the Jacobi equa-tion has constant coefficients and the solutions have explicit expressions. Thus, Riemanniansymmetric spaces are a broad class of spaces for which the volume change induced by theexponential can be computed explicitly.These spaces play an important role in statistics since they include Euclidean spaces,spheres, orthogonal groups, Grassmannians, hyperbolic spaces, and spaces of symmetric pos-itive definite matrices, to name just a few. Although there has been substantial considerationof specific symmetric spaces on a case-by-case basis, there has not been any general method-ology that can be implemented in practice for density estimation on arbitrary symmetricspaces. Some of the specific cases that have been considered include spheres (Hall et al.,1987; Bai et al., 1989; Baldi et al., 2009; Eugeciouglu and Srinivasan, 2000; Healy et al.,1996; Healy Jr et al., 1998), hyperbolic spaces (Huckemann et al., 2010; Chevallier et al.,2015), symmetric positive definite matrices (Kim and Richards, 2011; Said et al., 2017a,b;Chevallier et al., 2017), rotations (Kim, 1998; Hielscher, 2013), Grassmanians (Chikuse, 2002;Turaga et al., 2008) and the Siegel space (Chevallier et al., 2016; Said et al., 2017b) usedto model the geometry of Toeplitz-block-Toeplitz covariance matrices (Barbaresco, 2013b).Recently, Said et al. (2017a,b) proposed an approach for defining Gaussian distributions onsymmetric spaces with non-positive curvature, with the normalizing constant computable ata reasonable cost. 3ur main contributions are to (1) introduce a broad class of log-Gaussian distributions onRiemannian symmetric spaces; (2) provide a general theorem on volume changes in applyingthe exponential map from a tangent plane to a Riemannian symmetric space, (3) use thisresult to obtain a simple analytic form for the log-Gaussian density and a correspondingalgorithm that can be used practically for kernel density estimation in arbitrary Riemanniansymmetric spaces, (4) provide a general theory on how statistical optimality for any densityestimator in Euclidean spaces implies statistical optimality for the push-forward densityestimator, and (5) apply our theory and methods to a variety of interesting datasets includingimproving methods for specific manifolds, such as the space of positive definite matrices. Defining valid and computationally tractable distributions on general manifolds is knownto be a hard problem. In order to address this problem, we start by attempting to defineGaussian-based distributions on manifolds.For manifolds that are embedded in an ambient Euclidean space, a common strategy isto define a Gaussian distribution in the ambient space and then project onto the manifold.For example, a projected normal distribution on a sphere can be obtained by normalizing aGaussian random vector. Similarly, a Gaussian distribution on the real line can be used todefine a wrapped Gaussian distribution on a circle. However, such constructions cannot begeneralized to broader classes of manifolds.An alternative approach is to define a Gaussian-like distribution on a manifold M byusing the geodesic distance d ( x, ˜ x ) from the intrinsic mean ˜ x in place of the Euclideandistance from the Euclidean mean. This leads to the so-called Riemannian Gaussian (RG)distribution (Said et al., 2017a,b) defined as p ( x | (cid:101) x, σ ) ∝ exp (cid:18) − d ( x, (cid:101) x )2 σ (cid:19) , x, (cid:101) x ∈ M , σ > , where σ controls the variance. However, the RG distribution suffers from multiple prac-tical issues. 1: computation is often not feasible except in toy cases due to intractabilityof the geodesic distance and the lack of a closed form for the normalizing constant. 2: RGis not very flexible due to the isotropic structure with a single variance parameter insteadof a matrix Σ. 3: sampling from the distribution is complicated, and may require nu-merical approximations. Pennec (2006) proposed similar Gaussian laws on manifolds withanisotropy to overcome 2, but the link between the matrix parameter in the exponential andthe covariance of the law is intractable.There is also an existing literature focusing on Fr´echet means or log-Euclidean meanson Riemannian manifolds and corresponding central limit theorems. From this point view,Schwartzman (2016) defined Gaussian laws on PD( m ), called the log normal distribution,without further statistical theory and applications or generalizations to other manifolds.4n this paper, we propose a new class of log-Gaussian distributions, which address allthree practical issues of the RG distribution. The log-Gaussian distribution is obtainedby defining a multivariate Gaussian on a tangent plane T p M to the manifold M at point p ∈ M and then mapping the distribution onto the manifold via the exponential map. Theexponential map from T p M to M is denoted by exp p , with its inverse from M to T p M called the logarithm map or log map, denoted by log p . The Riemannian metric on M isdenoted by g . Figure 1 provides an illustration of this mapping and the associated volumechange. For more details about exponential and log maps, see Supplementary Materials andCarmo (1992).Through the exponential map, any distribution in Euclidean space can be pushed to themanifold. Let ν be a measure on the tangent space, then its push forward measure, denotedby exp ∗ ν , is defined as (exp ∗ ν )( A ) := ν (exp − ( A )) for any A ∈ F M , the sigma field on M .Similarly, any distribution on the manifold can be pulled back to the tangent space throughthe log map. In this paper, we focus on the log-Gaussian distribution – chosen so that thepullback is multivariate Gaussian. However, this technique applies to any other distributionin Euclidean space. Definition 1.
Given a fixed point p ∈ M (discussed in Section 3), a random variable X on M is said to be log-Gaussian if log p ( X ) is d -dimensional multivariate Gaussian in T p M . X is said to follow lG ( µ, Σ) if log p ( X ) ∼ N ( µ, Σ) . The density function of the log-Gaussian distribution admits an analytic form when J does, given by the following theorem: Theorem 1.
The density function of the log-Gaussian distribution lG ( µ, Σ) is f ( x ) = (2 π ) − d det(Σ) − J − p ( x ) exp (cid:18) −
12 (log p ( x ) − µ ) (cid:62) Σ − (log p ( x ) − µ ) (cid:19) , where J p ( x ) is the Jacobian of the exponential map at p , or the volume change, see Figure1. The proof is straightforward, by the rule of change of variables, see also Equation 3.In order to obtain an analytic form for the log-Gaussian distribution, one needs tocalculate the exponential and log maps as well as the Jacobian for the specific manifoldunder consideration. For example, when M = PD( m ), the space of all m by m positivedefinite matrices, then the exponential and logarithm map are the matrix exponential andmatrix log, and the Jacobian also admits a simple form. We postpone the technical detailsof the derivation of the exponential map (exp), the log map (log) and the Jacobian ( J )for general manifolds to the next section, with explicit formulae in some specific examplesincluding the space of positive definite matrices, Poincar´e disk, etc.In contrast to the Riemannian Gaussian and other Gaussian-based distributions on man-ifolds, the specification of the log-Gaussian, an intrinsic distribution on the manifold which5oes not depend on any embedding, does not require evaluation of geodesic distances onmanifolds, the normalizing constant is available, and sampling is simple. In particular, inorder to sample from lG ( µ, Σ), one can first sample vectors from N ( µ, Σ), and then push tothe manifold by the exponential map.This procedure is straightforward for broad manifolds; for example, in the space ofpositive definite matrices PD( m ), we first sample Z , · · · , Z n ∼ N ( µ, Σ) with µ ∈ R d andΣ ∈ PD ( d ), where d = m ( m +1)2 , then we simply let X i = e Z i (matrix exponential) to obtainthe samples X , · · · , X n ∼ lG ( µ, Σ). Exploiting these appealing properties, we propose touse the log-Gaussian as a new kernel for density estimation on manifolds.In the next section, we discuss the class of manifolds where the above construction holds,that is, the exponential and logarithm maps are globally well-defined and the Jacobianadmits an analytic form. We will also provide general tools to calculate exp, log and J . Let ( M , g ) be a d dimensional Riemannian manifold with Riemannian metric g . Let µ g bethe Riemannian measure and exp p be the Riemannian exponential at p ∈ M , denoted expwhen there is no ambiguity with the usual exponential function.The Riemannian metric g induces a Lebesgue measure on T p M , denoted by ν g . Thevolume change described in Figure 1 can be expressed as J p ( x ) = dexp p ∗ ( ν g )d µ g ( x ) , x ∈ M . (2)For any measure ν on T p M dominating ν g with density function f = d ν d ν g , the densityfunction of the push forward measure exp p ∗ ( ν ) with respect to the Riemannian measure µ g is given by d exp p ∗ ( ν )d µ g ( x ) = J − p ( x ) f (exp − p ( x )) , x ∈ M . (3)For simplicity, denote the push forward density function also by exp ∗ f , that is, (exp p ∗ f )( x ) = J − p ( x ) f (exp − p ( x )). When J p ( x ) is known, the measure exp p ∗ ( ν f ) has an explicit density.It can then be used for kernel density estimation or in a mixture model. From now on, wealways assume(A) M has non-positive sectional curvature.(B) M is locally symmetric, that is, ∇R = 0, where ∇ is the Levi-Civita connection inducedby the Riemannian metric g and R is the curvature tensor.6rom the theory of symmetric spaces, any locally Riemannian symmetric space with non-positive sectional curvature is non-compact: the Cartan-Hadamard theorem shows that thespace is simply connected and hence M is a non compact Riemannian symmetric space.Throughout the remaining article, we always assume M satisfies assumption (A)(B), orequivalently, M is a non compact Riemannian symmetric space.Any Riemannian symmetric space can be written as M = G/K , where G is the isometrygroup of M with Lie algebra g and K is the isotropic subgroup of G with Lie algebra k .The Lie algebra decomposes as g = k ⊕ m , where exp( m ) is diffeomorphic to M and G/K .After identifying M with exp( m ), the symmetric space can be seen as a submanifold of G ,see Theorem 11 in supplementary material. Under this identification, the group exponentialcoincides with the Riemannian exponential at the identity element e . This is a key propertythat leads to a variety of the powerful tools used in this article. Assumption (A) ensuresthe global existence of exp = exp e and log = log e while assumption (B) implies a closedform for the Jacobian J = J e . In the remaining sections, we fix the base point at p = e fortwo reasons. First, at e , exp, log and J admit simpler forms than those at any other point.Second, the performance of our method does not depend on the specific choice of the basepoint since the manifold is symmetric. We now describe how to compute J when M is locally symmetric. For u ∈ T e M , let R u : T e M → T e M be the self-adjoint endomorphism of T e M given by R u ( v ) = R ( u, v ) u ,where R is the curvature tensor, and denote its i -th eigenvalue by λ i ( R u ). Similarly, theadjoint representation of Lie algebra g is given by ad u : T e M → T e M , v (cid:55)→ ad u ( v ) = [ u, v ]and denote the i -th eigenvalue of ad u by α i ( u ). Then the Jacobian or volume change is givenby the following theorem. Theorem 2. If M is (locally) symmetric of non-compact type, the Jacobian of the expo-nential map at x ∈ M is J ( x ) = J e ( x ) = d (cid:89) i =1 (cid:112) λ i ( R log x )sinh( (cid:112) λ i ( R log x )) = d (cid:89) i =1 | α i (log x ) | sinh( | α i (log x ) | ) . The above theorem gives two explicit formulae for the Jacobian, and hence for the densityfunction of the induced measure by the exponential/logarithm map, enhancing the flexibilityof calculation. We will apply these formulae throughout the remaining article.The proof of Theorem 2 requires many concepts and theories of Lie groups and differentialgeometry. The proof is given in the appendix and more technical details can be found in theSupplementary Materials and Kobayashi and Nomizu (1963); Helgason (1979). We discussseveral examples of non compact Riemannian symmetric spaces and provide the formulae forthe exp and log map as well as the Jacobian. These pieces allow one to define explicit analyticexpressions for log-Gaussian densities, kernel mixtures and kernel density estimators.7 .3 Examples of locally symmetric spaces
The space of all positive definite matrices arises in a variety of applications. One typeof problem occurs when the parameters of interest are positive definite matrices, such ascovariance matrices. Another popular area is network and graph analysis, where the graphLaplacian, a positive definite matrix, is of key interest in inferring properties of graphs. Also,in a number of areas, data take the form of positive definite matrices. These data play animportant role in many applications including visualization (Moakher and Batchelor, 2006),pedistrian detection (Tuzel et al., 2008), foreground segmentation (Caseiro et al., 2012),texture recognition (Jayasumana et al., 2013), rada signal processing (Arnaudon et al.,2013, 2011), etc.Let Sym( m ) and PD( m ) be, respectively, the sets of real symmetric and real symmetricpositive definite m by m matrices. We have PD( m ) = GL + ( m ) / SO( m ), where GL + ( m ) isthe space of all m by m matrices with positive determinant and SO( m ) is the space of all m by m orthogonal matrices with determinant 1. Then the identity element of GL + ( m ) isId, the identity matrix, and for any Σ ∈ PD( m ), the exponential map at identity is givenby exp(Σ) = e Σ , where e · is the matrix exponential. The tangent space of PD( m ) at theidentity is Sym( m ) and for any X ∈ Sym( m ), the log map is given by the matrix log. Thevolume change of the exponential is J (Σ) = J Id (Σ) = (cid:89) i 1) is the generalized special orthogonal group. The identity element is the origin ∈ R and the tangent space at is R . The exponential map is exp( z ) = z (cid:107) z (cid:107) (cid:113) − (cid:107) z (cid:107) ) for z ∈ T P(2) and the log map at the origin is log( x ) = x (cid:107) x (cid:107) acosh (cid:16) (cid:107) x (cid:107) −(cid:107) x (cid:107) (cid:17) . The Jacobian J is given by J ( x ) = r sinh( r ) , r = acosh (cid:18) (cid:107) x (cid:107) − (cid:107) x (cid:107) (cid:19) . High dimensional Poincare disk P( d ) with d > The Siegel space plays an important role in many fields including radar processing (Bar-baresco, 2011, 2013a,b), image processing (Lenz, 2016) and statistical mechanics. (Berezin,1975).The Siegel upper half-space of degree m (or genus m ) (also called the Siegel space),denoted by S m , is the set of complex m by m matrices with symmetric real part and positivedefinite imaginary part (Siegel, 1939): S m = { Z = X + iY : X ∈ Sym( m ) , Y ∈ PD( m ) } . The Siegel space can be viewed as generalization to matrices of the upper half-space modelfor hyperbolic spaces. S m is equipped with the Riemannian metric, making it a Riemanniansymmetric space: ds = 2 tr( Y − dZY − d ¯ Z ) . Similarly to the upper half-space case, there is a disk representation: S m ∼ = D m = { Z ∈ M m ( C ) : Z ∗ Z ≺ Id } , M m ( C ) is the set of all m by m complex matrices, Id is m by m identity matrix, · ∗ isconjugate transpose and ≺ is the Loewner order: A ≺ B ⇐⇒ B − A is positive semi-definite.Observe that when m = 1, Z ∈ C ∼ = R , the condition becomes (cid:107) Z (cid:107) < 1, which is exactlythe Poincare disk.The identity element is 0 ∈ C m and the tangent space at 0 is T D m = (cid:26) X = (cid:20) A B − B A (cid:21) : A, B ∈ M m ( R ) , A (cid:62) = − A, B (cid:62) = B. (cid:27) The exponential and logarithm maps are matrix exponential and log. The Jacobian is J ( Z ) = (cid:89) i Under the above notation,(1) d H (exp ∗ ν , exp ∗ ν ) = d H ( ν , ν ) .(2) KL (exp ∗ ν (cid:107) exp ∗ ν ) = KL ( ν (cid:107) ν ) .(3) d L p (exp ∗ ν , exp ∗ ν ) ≤ d L p ( ν , ν ) .(4) If further assume Ω = support( µ ) is compact, then W p ( ν , ν ) ≤ W p (exp ∗ ν , exp ∗ ν ) ≤ CW p ( ν , ν ) , where C = min( π , r Ω s Ω ) with r Ω = diam(Ω) = sup x,y ∈ Ω d ( x, y ) the diameter of Ω and s Ω theminimum branch separation (Bernstein et al., 2000). X , · · · , X m ∼ µ on manifold M , where µ is an unknownprobability measure and f is the corresponding density with respect to the Riemannianmeasure µ g . The following Algorithm describes a simple approach for estimation of µ and f leveraging on algorithms for estimating measures and densities in Euclidean spaces. In thefollowing subsections, we consider kernel density estimation and Bayesian density estimationunder the proposed approach. Algorithm 1: Density and measure estimation on manifolds input : Data X , · · · , X n ∼ µ with density function f . output: Estimated density and measure (cid:98) f n , (cid:98) µ n . Transform the data to the tangent space Z i = log( X i ); Apply density estimation in the tangent space by any existing method, to get theestimated density and measure of Z i , denoted by (cid:98) g n and (cid:98) ν n ; Pull the estimated density and measure back to the manifold: (cid:98) f n ( x ) = (exp ∗ (cid:98) g n )( x ) = J − ( x ) (cid:98) g n (log( x )) , ∀ x ∈ M ; (cid:98) µ n ( A ) = (exp ∗ (cid:98) ν n )( A ) = (cid:98) ν n (log( A )) , ∀ A ∈ F M . The kernel density estimator lets (cid:98) f n ( X ) = n (cid:80) ni =1 K h ( X − X i ) . An issue for data on non-Euclidean manifolds is how to choose an appropriate kernel. We solve this problem usingthe kernel of the log-Gaussian defined in section 2. This leads to the estimator (cid:98) f n ( X ) = 1 n n (cid:88) i =1 J − ( X ) N (cid:18) log X − log X i h ; , Id (cid:19) . (5)The convergence rate of this log-Gaussian kernel density estimator is given by the followingtheorem. 12 heorem 4. Let µ be a compactly supported probability measure on M with density func-tion f such that (cid:107) f (cid:107) W ,p ( M ) := (cid:16)(cid:80) i =0 (cid:13)(cid:13)(cid:13) f ( i )0 (cid:13)(cid:13)(cid:13) pL p (cid:17) /p ≤ r . Let ˆ f n be the kernel densityestimator defined as above. Then there exists a constant C ( r ) > such that E f (cid:107) ˆ f n − f (cid:107) pp ≤ C ( r ) n − p d . This is the same convergence rate as that obtained for the KDE with a Gaussian kernel ina d -dimensional Euclidean space with the true density satisfying the same norm conditionsas those above (Cleanthous et al., 2019, Theorem 3.2). Consider Bayesian estimation of the probability density f defined on the manifold M fromiid samples X n = { X , · · · , X n } ⊂ M from f . Assume that the unknown density f belongsto some set F . Given a prior distribution Π on F , the posterior distribution given X n isΠ( f | X n ) ∝ n (cid:89) i =1 f ( X i ) Π( df ) . We are interested in the contraction rate of the posterior from a frequentist perspective; inparticular, assuming that { X i } ni =1 are drawn from some underlying truth f , we would like toidentify the smallest shrinking neighborhood of f where the posterior Π( f | X n ) concentratesas n → ∞ . Specifically we say that the posterior contracts at rate (cid:15) n with respect to theHellinger metric if for every large constant M > 0, as n → ∞ ,Π( f : d H ( f, f ) ≥ M ε n | X n ) P f −→ . There is a rich literature on posterior contraction rates on Euclidean spaces, but verylimited consideration of the manifold setting. A notable exception is Castillo et al. (2014)who studied contraction rates of posterior distributions for several nonparametric modelsdefined on a compact manifold using priors defined by a heat kernel Gaussian process.However, their results are restricted to compact manifolds without boundaries since theyrely heavily on heat kernel estimates, which are only known to be valid on compact manifolds.In this section, we provide posterior contraction results for Bayesian density estimationon a general class of locally symmetric Riemannian manifolds. For a density f definedon such a d -dimensional manifold M , let (cid:101) f = log ∗ f be the push-forward of f under thelogarithmic transform; (cid:101) f is a density on R d . Conversely, f = exp ∗ (cid:101) f . To study frequentistproperties of the posterior, we assume that the underlying true density f of the samples X n is supported on a compact set Ω ⊂ M . Since both the exponential and logarithmictransformations preserve the topological properties of the space, the push-forward density (cid:101) f is supported on the compact set (cid:101) Ω = log(Ω) ⊂ R d . Without loss of generality, we mayassume that (cid:101) Ω is strictly contained in the cube [0 , D ] d for some D > rior. Consider the prior distribution Π on a set of densities F which are supportedon the manifold M . Let ˜Π be the set of pushforwards of Π via the logarithmic map, i.e.˜Π = { log ∗ f, f ∈ Π } . Note that ˜Π is a set of distributions on R d . Conversely, we can buildΠ from ˜Π via the exponential map, i.e. Π = exp ∗ (cid:101) Π. In the sequel, we consider the prior (cid:101) Πchosen as the truncated version of the log Gaussian process prior with adaptive bandwidthproposed in van der Vaart et al. (2009). Specifically, the prior (cid:101) Π is the distribution of therandom function (cid:101) f ∼ (cid:101) Π defined by (cid:101) f ( x ) = p w ( x ) := χ ( x ) e w ( Ax ) (cid:82) [0 ,D ] d χ ( x ) e w ( Ax ) dx , (6)where 0 ≤ χ ≤ (cid:101) Ω and W is a centered Gaussianprocess on R d with smooth trajectories. For simplicity and to fix ideas, we consider in thepresent paper only the squared exponential covariance function E [ w ( x ) w ( y )] = exp( −(cid:107) x − y (cid:107) ) , ∀ x, y ∈ R d . The scaling parameter A in (6) is a positive independent random variable which is introducedto tune the regularity of the sample path of w and hence that of the density (cid:101) f . We assumethat A possesses a positive density h satisfying, for some positive constants C , C andnon-negative p, q and every a > C a p e − a d log q a ≤ h ( a ) ≤ C a p e − a d log q a ≤ g ( a ) . Note that the above is satisfied if A d is a Gamma distribution if q = 0.The next theorem establishes the contraction rate of the posterior Π on M . Theorem 5. Assume that the true density f = exp ∗ (cid:101) f is supported on a compact set Ω ⊂ M with (cid:101) f supported on (cid:101) Ω ⊂ [0 , D ] d . Assume further that (cid:101) f takes the form (cid:101) f ( x ) = χ ( x ) e w ( x ) (cid:82) [0 ,R ] d χ ( x ) e w ( x ) dx for some w ∈ C α ([0 , D ] d ) with α > . Then the posterior distribution Π( ·| X n ) correspondingto the prior Π contracts at rate ε n = n − αd +2 α (log n ) α + d α +2 d . The above convergence rate and posterior contraction rate results are meant to give aflavor of the importance of Theorem 3 and how it can be used to easily carry over con-vergence rate results for existing density estimators in Euclidean cases to a broad class ofnon-Euclidean manifolds. 14 Simulation Study In this section, we present numerical experiments on the space of positive definite matriceswith simulated data focusing on the following four algorithms:1. KDE with Wishart kernel W : (cid:98) f W ( X ) = n (cid:80) ni =1 W ( X | ν X i , ν ) , where ν > W − : (cid:98) f W − ( X ) = n (cid:80) ni =1 W − ( X | νX i , ν + 3) , where ν > lG : (cid:98) f lG ( X ) = 1 n n (cid:88) i =1 lG ( X | X i , h Id) = 1 n J − ( X ) n (cid:88) i =1 N (log X | log X i , h Id) , where J is the Jacobian presented in Equation 4 and h is the bandwidth.4. Mixture of log-Gaussian: (cid:98) f MlG ( X ) = L (cid:88) j =1 ω j lG ( X | µ j , Σ j ) = L (cid:88) j =1 ω j J − ( X ) N (log X | µ j , Σ j ) . The bandwidths are chosen by cross-validation using the log-likelihood for held out data.Implementing 1-3 are straightforward, with 3 relying on Algorithm 1. To implement 4,we transform X i s to the tangent space by the log map and then apply standard algorithmsfor fitting mixtures of multivariate Gaussians; in particular, we use the Matlab function“fitgmdist” which relies on the Expectation-Maximization (EM) algorithm (Dempster et al.,1977). Given the very poor performance of Wishart and inverse Wishart based KDE, andpreliminary results suggesting mixtures of Wishart/inverse Wishart kernels are far fromcompetitive, we do not consider mixture models using such kernels.We sample 2000 2 × Wishart. We sample X ∼ W (Id , ν ) separately for ν ∈ { , · · · , } and repeat the exper-iment 10 times. Figure 2 shows the results averaged across the 10 replicates.The performance of both Wishart and inverse Wishart KDE decay as ν increases; dueto the inflexibility of both distributions they are very poor choices of kernel. We provide arationale for the poor performance of the Wishart; the inverse Wishart can be ruled out forsimilar reasons. Recall that the expectation and variance of X ∼ W ( V, ν ) are E [ X ] = νV ν . The proposed log-Gaussian based methodscorrespond to the blue and magenta lines.and var( X ij ) = ν ( v ij + v ii v jj ). In KDE, the expectation of the i th kernel is X i and thevariance is proportion to X i V . For the Wishart to be well defined, ν > d − 1. Hence, V < d − X i , so that the variance of the i th kernel is bounded above by d − X i . When theupper bound is small, such as when X i is close to the zero matrix, the kernel at X i is forcedto be concentrated at X i . If the true density is not high at such locations, poor performanceresults. In contrast, the bandwidth of the log-Gaussian can be controlled freely. Inverse Wishart. We sample X ∼ W − ( V, ν ) separately for V = [100 , 30; 30 , 10] and ν ∈ { , · · · , } , repeating the experiment 10 times. Figure 3 shows the results averagedacross the 10 replicates.The result is similar to the Wishart case except that the performance of the Wishart andinverse Wishart KDE methods improves with the true degrees of freedom. This is becausethe expectation of W − ( V, ν ) = Vν − , so smaller ν corresponds to a larger scale of the samples,implying worse performance under the argument discussed above in the Wishart case. Log-Gaussian. We sample X ∼ lG (0 , σ Id) separately for σ ∈ { e − , e − , · · · , e } andrepeat the experiment 10 times. Figure 4 shows the results averaging over the 10 replicates.The performance of all algorithms decays as σ increases, as expected since this reflects adecreasing signal-to-noise ratio. The log-Gaussian based KDE and mixture models dominatethe Wishart and inverse Wishart KDE estimators across all true values of the parameters.16igure 3: Test data log-likelihood scores for 4 algorithms with different ν in inverse Wishart.Figure 4: Test data log-likelihood scores for 4 algorithms with different σ in log-Gaussian.17 Applications In this section, we apply our proposed approach to data in the space of positive definitematrices, considering two applications to texture classification. The same method can beapplied trivially for data supported on other non-compact locally symmetric spaces by simplyreplacing the exponential map, logarithm map and Jacobian.Texture classification is a canonical problem in computer vision, focusing on assigningimages to predefined texture classes. Important application areas include material scienceand medical image analysis, both of which are considered in this section.In the texture analysis literature, a texture is considered as a pattern of local variationin image intensity (Tuzel et al., 2006) observed on either gray or RGB scale(s), so an imagecan be represented by a covariance matrix in the following way: • Given an image I with p by q pixels, first define n features at each pixel F uv ∈ R m .Common features include gray scale, gradient of gray scale, Hessian of gray scale, etc,which vary across datasets, mainly depending on the complexity of each image dataset. The more complex, the more features. • Define X = Cov( F uv ). For simplicity and lower computational cost, sometimes onlyfeatures at grid vertices contribute to the covariance matrix because neighbor pixelstend to contain similar information.As a result, texture classification is based on the space of positive definite matricesPD( m ) with manifold dimension d = m ( m +1)2 , where m is the number of features of theimage. For each image I i , we first extract features F i and then calculate X i ∈ PD( m ) bythe above two steps. We also denote the label of image I i by y i ∈ { , · · · , L } . Amongdensity based classifiers, we consider Naive Bayes (NB), where all features of X , denoted by X , · · · , X d , are assumed to be mutually independent. Given X , the posterior is given by p ( y = k | X ) = p ( y = k ) p ( X | y = k ) p ( X ) = p ( y = k ) (cid:81) dj =1 p ( X j | y = k ) (cid:80) Ll =1 p ( y = l ) (cid:81) dj =1 p ( X j | y = l ) , where p ( y = k ) is the marginal probability of y = k and p ( X j | y = k ) is the conditionaldensity of the j th feature given the image class status. Assuming a 0-1 loss function, classassignments are made based on: (cid:98) y = argmax k ∈{ , ··· ,L } p ( y = k | X ) = argmax k ∈{ , ··· ,L } p ( y = k ) d (cid:89) j =1 p ( X j | y = k ) . The above naive Bayes approach has clear conceptual problems in considering the differ-ent elements of the positive definite matrix X i as independent, and separately estimating thedensity within each class for each element. The elements of the matrix X i are constrained18ue to the positive definite constraint, but the naive Bayes method ignores this constraintand treats each element as having unconstrained support on the real line.As a more reasonable alternative that characterizes dependence in the features and takesinto account their non-Euclidean support, we apply our log-Gaussian KDE and mixturemodeling approaches described in the previous section to estimate the within-class featuredensities. In summary, we consider the following four conditional density estimators:1. Let Z i = Vec( X i ) and apply a Gaussian Naive Bayes (GNB) algorithm letting p ( X | y = k ) = N (Vec( X ) | µ k , diag { σ j k } ) , µ k = 1 n k (cid:88) i : y i = k Z i , σ j k = 1 n k (cid:88) i : y i = k ( Z ji − µ jk ) , where n k = { i : y i = k } and j = 1 , · · · , d is coordinate index.2. Applying KDE to the vectorized features, the Gaussian Kernel Classifier (GKC) lets: p ( X | y = k ) = 1 n k (cid:88) i : y i = k N (Vec( X ) | Z i , h Id) . 3. Replacing the Gaussian kernel by log-Gaussian kernel, we have the Log Gaussian NaiveBayes (LGNB) algorithm: p ( X | y = k ) = lG ( X | µ k , diag { σ j k } ) , µ = 1 n (cid:88) i : y i = k log( X i ) , σ j k = 1 n k (cid:88) i : y i = k (log( X i ) j − µ jk ) . 4. Similarly, the Log-Gaussian Kernel Classifier (LGKC) is: p ( X | y = k ) = 1 n k (cid:88) i : y i = k lG ( X | log( X i ) , h Id) . We do not show results for Wishart and inverse Wishart kernels, because the performanceis dramatically worse than for the above methods for the reasons stated in Section 5. Inour analyses, we assume diagonal covariances in the Gaussian and log-Gaussian kernels forsimplicity in computation.To measure the classification performance, we consider both the classification accuracyand Brier score. For each example, we randomly split the data 50 − 50 and repeat 100 timesand present the boxplot for both classification accuracy and Brier score for the above fouralgorithms. We use the matlab toolbox “Classification Learner” where all tuning parametersare chosen by 5 − fold cross validation. 19igure 5: Sample images from Kylberg dataset. Kylberg dataset (Kylberg, 2011) contains 28 texture classes of different natural and man-made surfaces with 160 unique texture patches per class. Each texture patch is representedby a lossless compressed 8 bit png file with 576 × 576 pixels after normalization. Samplesfrom this dataset are shown in Figure 5.Following the convention of Harandi et al. (2014), we resized the images to 128 × F uv = (cid:20) I uv , (cid:12)(cid:12)(cid:12)(cid:12) ∂I∂u (cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12) ∂I∂v (cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12) ∂ I∂u (cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12) ∂ I∂v (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) (cid:62) . Then we generate the covariance matrix on a 32 × 32 grid evenly distributed on the image,resulting a 5 × × 160 = 4480 samples in PD(5), a symmetric space with dimension 15. Our goal is to classifythese images. The accuracy and Brier score boxplots are shown in Figure 6.There is a clear improvement for our log-Gaussian kernel relative to a Gaussian kernelfor both standard naive Bayes and kernel naive Bayes procedures. The virus dataset (Kylberg et al., 2012) contains 15 virus classes of various shapes in trans-mission electron microscopy images. Each image is representated by a 8 bit png file with41 × 41 pixels. Samples from this dataset are shown in Figure 7: the first panel resemblesthe Corona virus.Since viruses from some classes look very similar, classifying virus images is known tobe a very hard problem (Harandi et al., 2014). As a result, more features are needed inaddition to gray-scale and its derivatives in order to better analyze the dataset. We extract25 features commonly used in this field: F uv = (cid:20) I uv , (cid:12)(cid:12)(cid:12)(cid:12) ∂I∂u (cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12) ∂I∂v (cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12) ∂ I∂u (cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12) ∂ I∂v (cid:12)(cid:12)(cid:12)(cid:12) , G , uv , · · · , G , uv (cid:21) (cid:62) , G o,suv is the response of a 2D Gabor wavelet with orientation o and scale s (Lee, 1996),with 4 orientations and 5 scales. Then we generate the covariance matrix on a 10 × 10 gridevenly distributed on the image, resulting in a 25 × 25 positive definite matrix representingeach image. That is, we obtain 15 × 100 = 1500 samples in PD(25), a symmetric space withdimension 325. Our goal is to classify these images. The accuracy and Brier score boxplotsare shown in Figure 8. From the accuracy and Brier score we can tell that classificationof Virus dataset is much harder than Kylberg dataset and log-Gaussian kernel outperformsGaussian.The above results focus on classifiers that rely on estimating the density of the featureswithin each class, motivated by our focus on density estimation on manifolds. However,distance-based classifiers can also benefit from the log transform. In particular, in placeof the Euclidean distance between the original features, we use the log-Euclidean distanceon PD( m ). We illustrate these gains on the above two datasets in Table 1 and 2, showingthe accuracy of the best distance-based classifiers. There is clear improvement using thelog-Euclidean distance on PD( m ) in place of the Frobenius norm in these examples.22irus SubspaceDiscriminant LinearSVM GaussianSVM LinearDiscriminantFrobenius 57.1 54.3 51.7 51.9log-Euclidean 62.1 59.1 57.9 55.6Table 2: Accuracy for different classifiers for the Virus dataset In this article, we propose a simple and general class of methods for analysis of data sup-ported on any one of a very broad class of non-compact manifolds. In particular, by focusingon locally-symmetric spaces, we are able to obtain a simple analytic form for the Jacobianof the transformation between the tangent plane and manifold. This simple form, whichis a novel mathematical result in statistics literature, allows us to easily go back and forthbetween Euclidean models on the tangent plane and induced models on the manifold.In addition to the form of the Jacobian, our main results are to induce a broadly use-ful class of log-Gaussian distributions for manifold data, while also showing how the log-Gaussian can be used for density estimation and in classification problems. Remarkably, wealso show that statistical optimality properties of density estimators on Euclidean spacesare maintained in applying the exponential map to push the density estimator onto themanifold. This is due to the fact that the transformation preserves a number of importantdistances between densities and measures; we provide novel theory in this respect.The framework provided in this article makes it considerably easier to fully take intoaccount the manifold structure of data and/or parameters in statistical analyses. This isa significant step forward from current methods, as it tends to be very difficult to definedensities and measures on manifolds that are simple enough to work with routinely in statis-tical analyses. For example, current approaches are often limited to very specific manifolds,involve intractable to calculate geodesic distances, and/or include intractable normalizingconstants. It will be interesting to build on the framework to define a rich class of modelsnot just focusing on log-Gaussian distributions; for example, one can consider alternativekernels, log-Gaussian processes on manifolds, etc.One limitation is that, although non compact locally-symmetric spaces are a broad class,there are many statistically important manifolds that are not included in this class. Wewould like to generalize the approach to compact symmetric spaces, such as spheres andGrassmannians. In compact spaces, the logarithm map is not globally well defined, but onlydefined within the cut locus. Potentially one can rely on multiple log maps and pastingtogether of local patches in this case. We would also like to generalize the methods toaccommodate spaces with less symmetry; for example, the Stiefel manifold is widely usedin statistics but is not locally symmetric. A primary challenge in such cases is obtaining atractable form for the Jacobian; A promising direction is to obtain approximation algorithmsand otherwise leverage on the results in the current paper.23 cknowledgement The authors acknowledge support for this research from an Office of Naval Research grantN00014-14-1-0245/N00014-16-1-2147. We start with the first equation: J ( x ) = (cid:89) i (cid:112) λ i ( R log x )sinh( (cid:112) λ i ( R log x ))For u = log x ∈ T e M , let γ ( t ) = exp( tu ) be the unit speed geodesic starting from γ (0) = e with initial velocity u = u (cid:107) u (cid:107) . Without loss of generality, we assume (cid:107) u (cid:107) = 1 so u = u ,otherwise we can rescale u and the eigenvalues are also rescaled accordingly. To simplifynotation, let x = γ (1). Let e , .., e d be an orthonormal basis of T e M with e d = u . The volumechange can be expressed as the inverse of the Jacobian determinant of the differential of theexponential, J ( t ) := J ( γ ( t )) = det (cid:18) ∂ exp ∂e ( te d ) , · · · , ∂ exp ∂e d ( te d ) (cid:19) − . Although we are only interested in J ( x ) = J ( γ (1)), we consider general J ( t ) for any t , whichprovides the Jacobian at any point on the geodesic γ .Let Y , .., Y d be the vector fields on γ solving the Jacobi equation with Y i (0) = 0, Y (cid:48) i (0) = e i , i = 1 , · · · , d . Since for t > Y i ( t ) = t ∂ exp ∂e i ( te d ) , we have J ( t ) = det (cid:18) t Y ( t ) , · · · , t Y d ( t ) (cid:19) − , (7)see Eq.(6.93) on page 220 of Willmore (1996).The Jacobi fields along a geodesic γ are solution of the Jacobi equation, Y (cid:48)(cid:48) + R ( γ (cid:48) , Y ) γ (cid:48) = 0 . Let e i ( t ) be the parallel transport of e i along the geodesic γ . Given a vector field X along γ , let [ X ]( t ) be its coordinates in the basis e ( t ) , .., e d ( t ). On a locally symmetric space, R γ (cid:48) has zero derivative, hence R γ (cid:48) ( t ) ≡ R γ (cid:48) (0) = R u and the Jacobi equation becomes[ Y (cid:48)(cid:48) ] + [ R u ][ Y ] = 0 , R u ] is the matrix of the self-adjoint endomorphism of T p M , R u ( v ) = R ( u, v ) u .We can thus set the basis e (0) , · · · , e d (0) to be orthogonal eigenvectors of R u . Let λ , · · · , λ d be the eigenvalues of R u . We then obtain Y i ( t ) = sin( √ λ i t ) √ λ i e i ( t ) , (8)where sin( √ λ i t ) √ λ i is defined as sin( √ λ i t ) √ λ i = (cid:80) ∞ k =0 ( − k λ ki t k +1 (2 k +1)! . Note that if λ i is a non-positiveeigenvalue, the sinus becomes a hyperbolic sinus. Since the locally symmetric space M isnon-compact, the eigenvalues λ i are all non-positive, and combining Eq. 7 and 8 we obtain J ( t ) = J ( γ ( t )) = det (cid:18) sinh( √ λ t ) √ λ t e ( t ) , .., sinh( √ λ d t ) √ λ d t e d ( t ) (cid:19) − = d (cid:89) i =1 √ λ i t sinh( √ λ i t ) . (9)In particular, when t = 1, γ (1) = x so our desired equation holds: J ( x ) = J (1) = d (cid:89) i =1 √ λ i sinh( √ λ i ) . To show the second equation, it suffices to show that α i = λ i , where we need the followingLemma, which is a standard result in symmetric space theory, see Supplementary Materialor (Helgason, 1979, Theorem 4.2) for a proof. Lemma 1. Assume M is a symmetric space, then for any u, v, w ∈ T e M , R ( u, v ) w = [ w, [ u, v ]] . By Cartan-Hadamard theorem, a non-compact symmetric space is diffeomorhphic to itstangent space at e , hence simply connected. Since any simply connected locally symmetricspace is symmetric, Lemma 1 applies: R u ( x ) = R ( u, x ) u = [ u, [ u, x ]] = ad u (ad u ( x )) =ad u ( x ) . Letting v be the eigenvector of ad u corresponding to α i , we have R u ( v ) = ad u (ad u ( v )) = ad u ( α i v ) = α i v = λ i v. That is, v is also an eigenvector of R u and the corresponding eigenvalue is λ i = α i . We prove the four results one by one. 251) Let F e and F M be the sigma algebras of T e M and M , then T V (exp ∗ ( ν ) , exp ∗ ( ν )) = sup B ∈F M | exp ∗ ( ν )( B ) − exp ∗ ( ν )( B ) | = sup B ∈F M (cid:12)(cid:12) ν (cid:0) exp − ( B ) (cid:1) − ν (cid:0) exp − ( B ) (cid:1)(cid:12)(cid:12) = sup A ∈F e | ν ( A ) − ν ( A ) | = T V ( ν , ν ) . (2) H (exp ∗ ( ν ) , exp ∗ ( ν )) = (cid:82) M (cid:16)(cid:113) d exp ∗ ( ν ) dµ g − (cid:113) d exp ∗ ( ν ) dµ g (cid:17) dµ g = (cid:90) M (cid:32)(cid:115) d exp ∗ ( ν ) d exp ∗ ( ν g ) d exp ∗ ( ν g ) dµ g − (cid:115) d exp ∗ ( ν ) d exp ∗ ( ν g ) d exp ∗ ( ν g ) dµ g (cid:33) dµ g d exp ∗ ( ν g ) d exp ∗ ( ν g )= (cid:90) M (cid:32)(cid:115) d exp ∗ ( ν ) d exp ∗ ( ν g ) − (cid:115) d exp ∗ ( ν ) d exp ∗ ( ν g ) (cid:33) d exp ∗ ( ν g )= (cid:90) T e M (cid:32)(cid:115) dν dν g − (cid:115) dν dν g (cid:33) dν g = H ( ν , ν ) . (3) KL (exp ∗ ( ν ) (cid:107) exp ∗ ( ν )) = (cid:82) M log (cid:16) d exp ∗ ( ν ) d exp ∗ ( ν ) (cid:17) d exp ∗ ( ν )= (cid:90) T e M log (cid:18) dν dν (cid:19) dν = KL ( ν (cid:107) ν ) . (4) Observe that x sinh( x ) ≤ J ≤ L p (exp ∗ ( ν ) , exp ∗ ( ν )) = (cid:18)(cid:90) M (cid:12)(cid:12)(cid:12)(cid:12) d exp ∗ ( ν ) dµ g − d exp ∗ ( ν ) dµ g (cid:12)(cid:12)(cid:12)(cid:12) p dµ g (cid:19) p = (cid:18)(cid:90) M (cid:12)(cid:12)(cid:12)(cid:12) d exp ∗ ( ν ) d exp ∗ ( ν g ) d exp ∗ ( ν g ) dµ g − d exp ∗ ( ν ) d exp ∗ ( ν g ) d exp ∗ ( ν g ) dµ g (cid:12)(cid:12)(cid:12)(cid:12) p dµ g d exp ∗ ( ν g ) d exp ∗ ( ν g ) (cid:19) p = (cid:18)(cid:90) M (cid:12)(cid:12)(cid:12)(cid:12) d exp ∗ ( ν ) d exp ∗ ( ν g ) − d exp ∗ ( ν ) d exp ∗ ( ν g ) (cid:12)(cid:12)(cid:12)(cid:12) p J p − d exp ∗ ( ν g ) (cid:19) p ≤ (cid:18)(cid:90) T e M (cid:12)(cid:12)(cid:12)(cid:12) dν dν g − dν dν g (cid:12)(cid:12)(cid:12)(cid:12) p dν g (cid:19) p = L p ( ν , ν ) . (5) Observe that γ ∈ Γ(exp ∗ ( ν ) , exp ∗ ( ν )) ⇐⇒ log ∗ ( γ ) ∈ Γ( ν , ν ) . (10)26et d g ( · , · ) be the geodesic distance on M , then W pp (exp ∗ ( ν ) , exp ∗ ( ν )) = inf γ ∈ Γ(exp ∗ ( ν ) , exp ∗ ( ν )) (cid:90) M×M d g ( x, y ) p dγ ( x, y )= inf log ∗ γ ∈ Γ( ν ,ν ) (cid:90) T e M× T e M (exp ∗ d g )(log x , log y ) p d log ∗ γ (log x, log y )= inf (cid:101) γ ∈ Γ( ν ,ν ) (cid:90) T e M× T e M (cid:101) d g ( a, b ) p d (cid:101) γ ( a, b ) , where (cid:101) d g is the geodesic distance on T e M with respect to the induced Riemannian metric (cid:101) g = exp ∗ g . Recall that W pp ( ν , ν ) = inf (cid:101) γ ∈ Γ( ν ,ν ) (cid:90) T e M× T e M d E ( a, b ) p d (cid:101) γ ( a, b ) , where d E ( a, b ) = (cid:107) a − b (cid:107) is the Euclidean distance on T e M . So we only need to relate (cid:101) d g and d E on T e M .For one direction, since the Euclidean distance is the shortest among all geodesic dis-tances, we have d E ( a, b ) ≤ (cid:101) d g ( a, b ) for any a, b ∈ T e M . For the other direction. Let s Ω be the minimal branch separation and r Ω be the radius of Ω, where the finiteness of s Ω and r Ω result from the compactness of Ω. We then split the proof into two cases:(a) When d E ( a, b ) ≥ s Ω , (cid:101) d g ( a,b ) d E ( a,b ) ≤ r Ω s Ω by definition.(b) When d E ( a, b ) < s Ω , by Lemma 3 in Bernstein et al. (2000), d g ( a, b ) ≤ π d E ( a, b ).Then we conclude that d E ( a, b ) ≤ (cid:101) d g ( a, b ) ≤ Cd E ( a, b ) where C = min (cid:16) π , r Ω s Ω (cid:17) .Combining above pieces, we have the desired inequalities: W pp ( ν , ν ) = inf (cid:101) γ ∈ Γ( ν ,ν ) (cid:90) T e M× T e M d E ( a, b ) p d (cid:101) γ ( a, b ) ≤ inf (cid:101) γ ∈ Γ( ν ,ν ) (cid:90) T e M× T e M (cid:101) d g ( a, b ) p d (cid:101) γ ( a, b ) = W pp (exp ∗ ( ν ) , exp ∗ ( ν )) ≤ C p inf (cid:101) γ ∈ Γ( ν ,ν ) (cid:90) T e M× T e M d E ( a, b ) p d (cid:101) γ ( a, b ) = C p W pp ( ν , ν ) . By assumption, the support Ω = support( f ) ⊂ M is M . Since log, exp, J and J − are allsmooth, they are all bounded on Ω, as are their derivatives of any order. Let F = log ∗ f =27 f ◦ exp) · J − , that is, F ( u ) = log ∗ ( f )( u ) = J − (exp( u )) f (exp( u )), then F is supportedby (cid:101) Ω = log(Ω), which is again compact by the smoothness of log. In addition, there existsa constant C = C (Ω) such that (cid:107) F (cid:107) W ,p ( R d ) ≤ C (cid:107) f (cid:107) W ,p ( M ) ≤ C r Recall the kernel density estimation in R d : (cid:98) F n ( u ) = 1 n n (cid:88) i =1 K h ( u − log( X i )) = log ∗ (cid:98) f n converges to F in the known rate (Cleanthous et al., 2019): E (cid:107) (cid:98) F n − F (cid:107) pp ≤ C ( r ) n − p d Finally, by Theorem 3 (4), we have E (cid:107) (cid:98) f n − f (cid:107) pp = E (cid:107) exp ∗ (cid:98) F n − exp ∗ F (cid:107) pp ≤ E (cid:107) (cid:98) F n − F (cid:107) pp ≤ C ( r ) n − p d . The proof relies on the following Lemma, an analog of van der Vaart et al. (2008, Lemma3.1). The proofs are different due to the cut-off function χ so we present the detailed proofhere. Lemma 2. For any measurable function v, w : [0 , D ] d → R , we have the following inequali-ties:(1) d H ( p v , p w ) ≤ (cid:107) v − w (cid:107) ∞ e (cid:107) v − w (cid:107)∞ (2) KL ( p v (cid:107) p w ) (cid:46) (cid:107) v − w (cid:107) ∞ e (cid:107) v − w (cid:107) ∞ (1 + 2 (cid:107) v − w (cid:107) ∞ ) (3) Let V ( p | q ) = (cid:82) log( p/q ) dp , then V ( p v , p w ) (cid:46) (cid:107) v − w (cid:107) ∞ e (cid:107) v − w (cid:107) ∞ (1 + 2 (cid:107) v − w (cid:107) ∞ ) . roof. (1) First recall that d H ( p v , p w ) = (cid:13)(cid:13)(cid:13) χ / e v/ (cid:107) χ / e v/ (cid:107) − χ / e w/ (cid:107) χ / e w/ (cid:107) (cid:13)(cid:13)(cid:13) , by triangle inequality: d H ( p v , p w ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) χ / e v/ (cid:107) χ / e v/ (cid:107) + χ / e v/ (cid:107) χ / e w/ (cid:107) − χ / e v/ (cid:107) χ / e w/ (cid:107) − χ / e w/ (cid:107) χ / e w/ (cid:107) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) χ / ( e v/ − e w/ ) (cid:107) (cid:107) χ / e w/ (cid:107) + (cid:107) χ / e v/ (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) χ / e v/ (cid:107) − (cid:107) χ / e w/ (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) χ / ( e v/ − e w/ ) (cid:107) (cid:107) χ / e w/ (cid:107) + (cid:12)(cid:12) (cid:107) χ / e v/ (cid:107) − (cid:107) χ / e w/ (cid:107) (cid:12)(cid:12) (cid:107) χ / e w/ (cid:107) ≤ (cid:107) χ / ( e v/ − e w/ ) (cid:107) (cid:107) χ / e w/ (cid:107) . Observe that | e v/ − e w/ | ≤ e w/ e | v − w | / | v − w | / v, w ∈ R , the square of theright side is bounded above by (cid:82) χe w e | v − w | | v − w | / (cid:82) χe w ≤ (cid:107) v − w (cid:107) ∞ e (cid:107) v − w (cid:107) ∞ . (2) By Ghosal and Van der Vaart (2017, Lemma B.2), KL ( p v (cid:107) p w ) (cid:46) d H ( p v , q w ) (cid:18) (cid:13)(cid:13)(cid:13)(cid:13) log p v p w (cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:19) , so we only need to bound (cid:13)(cid:13)(cid:13) log p v p w (cid:13)(cid:13)(cid:13) ∞ . Let c v = log (cid:0)(cid:82) χe v (cid:1) so p v = χe f − c v . Observethat w − (cid:107) v − w (cid:107) ∞ ≤ v ≤ w + (cid:107) v − w (cid:107) ∞ , then by the monotonicity of c and nonnegativity of χ , c w − (cid:107) v − w (cid:107) ∞ ≤ c v ≤ c w + (cid:107) v − w (cid:107) ∞ . So (cid:107) c v − c w (cid:107) ∞ ≤ (cid:107) v − w (cid:107) ∞ and (cid:13)(cid:13)(cid:13)(cid:13) log p v p w (cid:13)(cid:13)(cid:13)(cid:13) ∞ = (cid:107) log χ + v − c v − log χ − w + c w (cid:107) ∞ ≤ (cid:107) v − w (cid:107) ∞ . As a result, KL ( p v (cid:107) p w ) (cid:46) (cid:107) v − w (cid:107) ∞ e (cid:107) v − w (cid:107) ∞ (1 + 2 (cid:107) v − w (cid:107) ∞ ) . (3) Again by Ghosal and Van der Vaart (2017, Lemma B.2), V ( p v , p w ) (cid:46) d H ( p v , q w ) (cid:18) (cid:13)(cid:13)(cid:13)(cid:13) log p v p w (cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:19) . By (2), (cid:13)(cid:13)(cid:13) log p v p w (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:107) v − w (cid:107) ∞ so (3) follows.29ow we can prove Theorem 5.According to van der Vaart et al. (2009, Theorem 2.1) and van der Vaart et al. (2008,Theorem 3.1), Lemma 2 together with the regularity assumption on (cid:101) f implies that thepush-forward posterior measure (cid:101) Π( ·| X n ) on R d contracts at the desired rate ε n : (cid:101) Π( (cid:101) f : d H ( (cid:101) f , (cid:101) f ) ≥ M ε n | X n ) → , where ε = n − αd +2 α (log n ) α + d α +2 d .The invariance of the Hellinger distance under the exponential map (see Theorem 3 (2))gives the same rate for posterior Π( ·| X n ) on M :Π( f : d H ( f, f ) ≥ M ε n | X n ) = (cid:101) Π( (cid:101) f : d H ( (cid:101) f , (cid:101) f ) ≥ M ε n | X n ) → . References Aazami, A. B. and Werner, M. C. (2016). The geometry of gravitational lensing magnifica-tion. Journal of Geometry and Physics , 100:52–61.Arnaudon, M., Barbaresco, F., and Yang, L. (2013). Riemannian medians and means withapplications to radar signal processing. IEEE Journal of Selected Topics in Signal Pro-cessing , 7(4):595–604.Arnaudon, M., Yang, L., and Barbaresco, F. (2011). Stochastic algorithms for computingp-means of probability measures, geometry of radar Toeplitz covariance matrices andapplications to HR Doppler processing. In , pages 651–656. IEEE.Bai, Z., Rao, C. R., and Zhao, L. (1989). Kernel estimators of density function of directionaldata. In Multivariate Statistics and Probability , pages 24–39. Elsevier.Baldi, P., Kerkyacharian, G., Marinucci, D., and Picard, D. (2009). Adaptive density esti-mation for directional data using needlets. The Annals of Statistics , 37(6A):3362–3395.Barbaresco, F. (2011). Robust statistical radar processing in Fr´echet metric space: OS-HDR-CFAR and OS-STAP processing in Siegel homogeneous bounded domains. In , pages 639–644. IEEE.Barbaresco, F. (2013a). Information geometry manifold of Toeplitz Hermitian positive def-inite covariance matrices: Mostow/Berger fibration and Berezin quantization of Cartan-Siegel domains. Int. J. Emerg. Trends Signal Process , 1:1–87.Barbaresco, F. (2013b). Information geometry of covariance matrix: Cartan-Siegel ho-mogeneous bounded domains, Mostow/Berger fibration and Fr´echet median. In MatrixInformation Geometry , pages 199–255. Springer.30erezin, F. A. (1975). Quantization in complex symmetric spaces. Mathematics of theUSSR-Izvestiya , 9(2):341.Bernstein, M., De Silva, V., Langford, J. C., and Tenenbaum, J. B. (2000). Graph approxi-mations to geodesics on embedded manifolds. Technical report, Citeseer.Bhattacharya, A. and Dunson, D. B. (2010). Nonparametric Bayesian density estimationon manifolds with applications to planar shapes. Biometrika , 97(4):851–865.Bhattacharya, A. and Dunson, D. B. (2012). Strong consistency of nonparametric Bayes den-sity estimation on compact metric spaces with applications to specific manifolds. Annalsof the Institute of Statistical Mathematics , 64(4):687–714.Carmo, M. P. d. (1992). Riemannian Geometry . Birkh¨auser.Caseiro, R., Martins, P., Henriques, J. F., and Batista, J. (2012). A nonparametric Rie-mannian framework on tensor field with application to foreground segmentation. PatternRecognition , 45(11):3997–4017.Castillo, I., Kerkyacharian, G., and Picard, D. (2014). Thomas Bayes’ walk on manifolds. Probability Theory and Related Fields , 158(3-4):665–710.Chen, W. and Li, X. (2004). Introduction to Riemannian Geometry , volume 2. PekingUniversity Press.Chevallier, E., Barbaresco, F., and Angulo, J. (2015). Probability density estimation on thehyperbolic space applied to radar processing. In International Conference on GeometricScience of Information , pages 753–761. Springer.Chevallier, E., Forget, T., Barbaresco, F., and Angulo, J. (2016). Kernel density estimationon the Siegel space with an application to radar processing. Entropy , 18(11):396.Chevallier, E., Kalunga, E., and Angulo, J. (2017). Kernel density estimation on spacesof Gaussian distributions and symmetric positive definite matrices. SIAM Journal onImaging Sciences , 10(1):191–215.Chikuse, Y. (2002). Methods of density estimation on the Grassmann manifold. LinearAlgebra and its Applications , 354(1-3):85–102.Cleanthous, G., Georgiadis, A. G., and Porcu, E. (2019). Minimax density estimation onSobolev spaces with dominating mixed smoothness. arXiv preprint arXiv:1906.06835 .Dai, S., Gan, Z., Cheng, Y., Tao, C., Carin, L., and Liu, J. (2020). APo-VAE: Textgeneration in hyperbolic space. arXiv preprint arXiv:2005.00054 .31empster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from in-complete data via the EM algorithm. Journal of the Royal Statistical Society: Series B(Methodological) , 39(1):1–22.Eugeciouglu, ¨O. and Srinivasan, A. (2000). Efficient nonparametric density estimation onthe sphere with applications in fluid mechanics. SIAM Journal on Scientific Computing ,22(1):152–176.Ghosal, S. and Van der Vaart, A. (2017). Fundamentals of nonparametric Bayesian inference ,volume 44. Cambridge University Press.Gin´e, E. and Nickl, R. (2016). Mathematical Foundations of Infinite-dimensional StatisticalModels , volume 40. Cambridge University Press.Hall, P., Watson, G., and Cabrera, J. (1987). Kernel density estimation with spherical data. Biometrika , 74(4):751–762.Harandi, M., Salzmann, M., and Porikli, F. (2014). Bregman divergences for infinite dimen-sional covariance matrices. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 1003–1010.Healy, D. M., Kim, P. T., et al. (1996). An empirical Bayes approach to directional dataand efficient computation on the sphere. The Annals of Statistics , 24(1):232–254.Healy Jr, D. M., Hendriks, H., and Kim, P. T. (1998). Spherical deconvolution. Journal ofMultivariate Analysis , 67(1):1–22.Helgason, S. (1979). Differential Geometry, Lie Groups, and Symmetric Spaces . Academicpress.Hendriks, H. (1990). Nonparametric estimation of a probability density on a Riemannianmanifold using Fourier expansions. The Annals of Statistics , pages 832–849.Hielscher, R. (2013). Kernel density estimation on the rotation group and its application tocrystallographic texture analysis. Journal of Multivariate Analysis , 119:119–143.Huckemann, S. F., Kim, P. T., Koo, J.-Y., Munk, A., et al. (2010). M¨obius deconvolutionon the hyperbolic plane with application to impedance density estimation. The Annals ofStatistics , 38(4):2465–2498.Jayasumana, S., Hartley, R., Salzmann, M., Li, H., and Harandi, M. (2013). Kernel methodson the Riemannian manifold of symmetric positive definite matrices. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition , pages 73–80.Kim, P. T. (1998). Deconvolution density estimation on SO(N). The Annals of Statistics ,26(3):1083–1102. 32im, P. T. and Richards, D. S. P. (2011). Deconvolution density estimation on the spaceof positive definite symmetric matrices. In Nonparametric Statistics And Mixture Models:A Festschrift in Honor of Thomas P Hettmansperger , pages 147–168. World Scientific.Kim, S.-W. and Thorne, K. S. (1991). Do vacuum fluctuations prevent the creation of closedtimelike curves? Physical Review D , 43(12):3929.Kingma, D. P. and Welling, M. (2019). An introduction to variational autoencoders. arXivpreprint arXiv:1906.02691 .Kobayashi, S. and Nomizu, K. (1963). Foundations of Differential Geometry , volume 1. NewYork, London.Kylberg, G. (2011). The Kylberg texture dataset v. 1.0. External report (Blue series) 35,Centre for Image Analysis, Swedish University of Agricultural Sciences and Uppsala Uni-versity, Uppsala, Sweden.Kylberg, G., Uppstr¨om, M., HEDLUND, K.-O., Borgefors, G., and SINTORN, I.-M. (2012).Segmentation of virus particle candidates in transmission electron microscopy images. Journal of Microscopy , 245(2):140–147.Lee, T. S. (1996). Image representation using 2D Gabor wavelets. IEEE Transactions onPattern Analysis and Machine Intelligence , 18(10):959–971.Lenz, R. (2016). Siegel descriptors for image processing. IEEE Signal Processing Letters ,23(5):625–628.Mathieu, E., Le Lan, C., Maddison, C. J., Tomioka, R., and Teh, Y. W. (2019). Continu-ous hierarchical representations with Poincar´e variational auto-encoders. In Advances inNeural Information Processing Systems , pages 12544–12555.Moakher, M. and Batchelor, P. G. (2006). Symmetric positive-definite matrices: from geom-etry to applications and visualization. In Visualization and Processing of Tensor Fields ,pages 285–298. Springer.Morette, C. (1951). On the definition and approximation of Feynman’s path integrals. Physical Review , 81(5):848.Ovinnikov, I. (2019). Poincar´e Wasserstein autoencoder. arXiv preprint arXiv:1901.01427 .Pelletier, B. (2005). Kernel density estimation on Riemannian manifolds. Statistics &probability letters , 73(3):297–304.Pennec, X. (2006). Intrinsic statistics on Riemannian manifolds: Basic tools for geometricmeasurements. Journal of Mathematical Imaging and Vision , 25(1):127.33eimberg, P. H. and Abramo, L. R. (2013). The Jacobi map for gravitational lensing: therole of the exponential map. Classical and Quantum Gravity , 30(6):065020.Said, S., Bombrun, L., and Berthoumieu, Y. (2014). New Riemannian priors on the univari-ate normal model. Entropy , 16(7):4015–4031.Said, S., Bombrun, L., Berthoumieu, Y., and Manton, J. H. (2017a). Riemannian Gaussiandistributions on the space of symmetric positive definite matrices. IEEE Transactions onInformation Theory , 63(4):2153–2170.Said, S., Hajri, H., Bombrun, L., and Vemuri, B. C. (2017b). Gaussian distributions onRiemannian symmetric spaces: statistical learning with structured covariance matrices. IEEE Transactions on Information Theory , 64(2):752–772.Schwartzman, A. (2016). Lognormal distributions and geometric averages of symmetricpositive definite matrices. International Statistical Review , 84(3):456–486.Siegel, C. L. (1939). Einf¨uhrung in die theorie der modulfunktionenn-ten grades. Mathema-tische Annalen , 116(1):617–657.Skopek, O., Ganea, O.-E., and B´ecigneul, G. (2019). Mixed-curvature variational autoen-coders. arXiv preprint arXiv:1911.08411 .Tsybakov, A. B. (2008). Introduction to Nonparametric Estimation . Springer Science &Business Media.Turaga, P., Veeraraghavan, A., and Chellappa, R. (2008). Statistical analysis on Stiefel andGrassmann manifolds with applications in computer vision. In , pages 1–8. IEEE.Tuzel, O., Porikli, F., and Meer, P. (2006). Region covariance: A fast descriptor for detectionand classification. In European Conference on Computer Vision , pages 589–600. Springer.Tuzel, O., Porikli, F., and Meer, P. (2008). Pedestrian detection via classification on Rie-mannian manifolds. IEEE transactions on Pattern Analysis and Machine Intelligence ,30(10):1713–1727.van der Vaart, A. W., van Zanten, J. H., et al. (2008). Rates of contraction of posteriordistributions based on Gaussian process priors. The Annals of Statistics , 36(3):1435–1463.van der Vaart, A. W., van Zanten, J. H., et al. (2009). Adaptive Bayesian estimationusing a Gaussian random field with inverse Gamma bandwidth. The Annals of Statistics ,37(5B):2655–2675. 34an Vleck, J. H. (1928). The correspondence principle in the statistical interpretation ofquantum mechanics. Proceedings of the National Academy of Sciences of the United Statesof America , 14(2):178.Visser, M. (1993). van Vleck determinants: Geodesic focusing in Lorentzian spacetimes. Physical Review D , 47(6):2395.Willmore, T. J. (1996). Riemannian Geometry . Oxford University Press. In this section, we introduce basic concepts related to Riemannian symmetric spaces. Thesupplementary will lead to Lemma 1 at the end, which is the key to prove Theorem 2. Mostmaterials in this section are from Chen and Li (2004). Readers interested in more detailsmay read Helgason (1979) and Kobayashi and Nomizu (1963). Definition 2. Let ( M, g ) be a m dimensional Riemannian manifold, p ∈ M . σ p : M −→ M is called a central symmetry at p if1. σ p ∈ I ( M ) , the isometry group of M , that is, σ p is a diffeomorphism and σ ∗ p ( g ) = g .2. σ p is an involution, that is, σ p ◦ σ p = Id .3. p is an isolated fixed point of σ p , that is, there exists a neighborhood U of p such that p is the unique fixed point of σ p in U . Definition 3. ( M, g ) is said to be symmetric about point p if a central symmetry σ p exists. ( M, g ) is said to be a (Riemannian) symmetric space if it is symmetric about any point p ∈ M . In this section, ( M, g ) is always assumed to be a Riemannian symmetric space and σ p isa symmetry about p . Lemma 3. ( σ p ) ∗ = − Id : T p M → T p M . Corollary 1. σ p (exp p ( tv )) = exp p ( − tv ) . This corollary implies reversing the geodesic is an isomorphism; this is referred to asgeodesic symmetry. Theorem 6. Let R be the curvature tensor, then D R = 0 . Corollary 2. Let R be the Riemannian curvature tensor, then R is parallel about the Levi-Civita connection. efinition 4. ( M, g ) is said to be a locally symmetric space if R is parallel. Any complete, simply connected locally symmetric space is symmetric. Furthermore,the universal covering space of a locally symmetric space is symmetric. Theorem 7. If ( M, g ) is a locally symmetric space, then for any p ∈ M , there exists aneighborhood U and a local isometry σ p : U → U such that p is the unique fixed point of σ p in U and σ p is an involution. σ p is called the local symmetry about p . Theorem 8. A Riemmian symmetric space is complete. Definition 5. Let ( M, g ) be a Riemannian space. If there exists a Lie transformation group G acting on M and G ⊂ I ( M ) transitively, M is said to be a (Riemannian) homogeneousspace. Proposition 1. A homogeneous space that is symmetric about one point is a symmetricspace. Homogeneous spaces are “less symmetric” than symmetric spaces. Theorem 9. I ( M ) acts on M transitively, so a symmetric space is homogeneous. Fixingany p ∈ M , define K := { τ ∈ I ( M ) | τ ( p ) = p } < I ( M ) as a subgroup of I ( M ) . K is calledthe isotropic group of I ( M ) at p and there exists a bijection between M and I ( M ) /K . Definition 6. Letting S := (cid:26) W ( C, U ) = { τ ∈ I ( M ) | τ ( C ) ⊂ U } (cid:12)(cid:12)(cid:12)(cid:12) C ⊂ M compact, U ⊂ M open (cid:27) , the topology generated by S is called the compact-open topology on I ( M ) . The compact-open topology gives I ( M ) a topological structure, making it a topologicalgroup. What’s more, it induces a smooth structure so I ( M ) admits a natural Lie groupstructure, see the following theorem. Theorem 10 (Myers-Steenrod 1939) . The compact-open topology induces a smooth structureso that I ( M ) is a Lie transformation group acting on M . If M is compact, so is I ( M ) . Corollary 3. For any fixed p , the isotropic subgroup K at p is a compact subgroup of I ( M ) .Furthermore, M is diffeomorphic to I ( M ) /K . The above corollary shows that from the manifold M we can find a Lie transformationgroup acting on it and M itself is diffeomorphic to the quotient of this Lie group over acompact subgroup, called the isotropic subgroup. The remarkable fact is that this procedureis reversible: starting from a pair of Lie groups, called the Riemannian symmetric pair, wecan recover the manifold. 36 .2 Riemannian symmetric pair Theorem 11. Let G = I ( M ) be the component of I ( M ) containing the identity element Id M , fix p ∈ M , K is the isotropic subgroup at p , then1. G is connected, K is a compact subgroup of G and M is diffeomorphic to G/K .2. Define σ : G → G , σ ( τ ) := σ p ◦ τ ◦ σ p as an involution endomorphism.3. Define K σ = { τ ∈ G | σ ( τ ) = τ } as a closed subgroup of G .4. Denote the connected component of K σ containing the identity element by K , then K ⊂ K ⊂ K σ ,K does not contain any nontrivial normal subgroup of G .5. Denote the Lie algebra of G and K by g and k , then k = { X ∈ g | σ ∗ e ( X ) = X } . If m := { X ∈ g | σ ∗ e ( X ) = − X } then we have the following decomposition of g : g = k (cid:77) m and the following relations: [ k , k ] ⊂ k , [ m , m ] ⊂ k , [ k , m ] ⊂ m . 6. Let π : G → M be the canonical projection π ( g ) = g · p , then π ∗ e ( k ) = { } , π ∗ e ( k ) = T p M. 7. For any X ∈ m , let γ ( t ) be the geodesic starting at p in direction X , then γ ( t ) = exp( tX ) · p, where exp : g → G is the exponential map of the Lie group. For any Y ∈ T p M , (exp( tx )) ∗ p ( Y ) is the parallel translation along geodesic γ ( t ) . Remark 1. The above theorem implies that the exponential map in a symmetric space is the“same” as the exponential map of the Lie group, which is often easier to deal with, especiallywhen G is a matrix Lie group. In such a case, the exponential map is nothing but the matrixexponential. We have seen this in Section 3. The Lie algebra decomposition g = k (cid:76) m implies that m is the same as the tangentspace of M while k represents the subgroup K which is quotient out in G/K .37 xample (Positive definite matrices) . Let M = PD( m ) , then G = GL + ( m ) and K =SO( m ) . In this case, The Lie algebra decomposition is : g = M m ( R ) = so ( m ) ⊕ Sym( m ) , where so( n ) = { A : A (cid:62) = − A } is the Lie algebra of the special orthogonal group SO( m ) .This is a well-known fact: any matrix can be written as the sum of a symmetric and ananti-symmetric matrix: A = A + A (cid:62) + A − A (cid:62) . From a symmetric space ( M, g ), we can construct ( G, K, σ ) where G is a Lie group, σ : G → G is an involutory endomorphism and K ⊂ K ⊂ K σ where K σ is the fixed pointssubgroup of ( G, σ ) and K is the component of K σ containing the identity. In the abovetheorem, G = I ( M ) is the component of I ( M ) containing the identity and σ : τ (cid:55)→ σ p ◦ τ ◦ σ p is an involutory endomorphism p , and K is the isotropic subgroup at p . In fact, this process isinvertible, that is, given ( G, K, σ ) satisfying certain conditions, we can construct a symmetricspace ( M, g ) such that G = I ( M ), σ : τ (cid:55)→ σ ◦ τ ◦ σ and K is the isotropic subgroup at p . Definition 7. Let G be a connected Lie group and K is a closed subgroup of G . If thereexists an involutory endomorphism σ : G → G such that K ⊂ K ⊂ K σ , where K σ is the fixed point subgroup of G and K is the component of K σ containing identity,then ( G, K, σ ) is called a symmetric pair. Definition 8. let Ad : G → GL ( g ) be the adjoint representation of Lie group G where g isits Lie algebra. For a symmetric pair ( G, K, σ ) , if Ad( K ) is a compact subgroup of GL ( g ) , ( G, M, σ ) is called a Riemannian symmetric pair. Corollary 4. Assume ( M, g ) is a symmetric space, let G = I ( M ) and K be the isotropicsubgroup at p ∈ M , σ : τ (cid:55)→ σ p ◦ τ ◦ σ p , then ( G, K, σ ) is a Riemannian symmetric pair. This is a direct corollary of Theorem 11. The construction in the opposite direction isguaranteed by the following theorem. Theorem 12. Let ( G, K, σ ) be a Riemannian symmetric pair, let π : G → G/K be thenatural projection. Assume p = π ( e ) where e is the identity of G , then there exists a G -invariant Riemannian metric Q on the homogeneous space G/K so that ( G/K, Q ) is asymmetric space. In this case, the symmetry at p , denoted by σ satisfies σ ◦ π = π ◦ σ, and for any g ∈ G , τ ( σ ( g )) = σ ◦ τ ( g ) ◦ σ , where τ ( g ) : G/K → G/K is given by hK (cid:55)→ ghK . This theorem provides the construction in the other direction: from a Riemannian sym-metric pair, we can construct a symmetric space associated with it.38 .3 Curvature tensor in Riemannian symmetric spaces In this section we discuss the special properties of the curvature tensor in Riemanniansymmetric spaces, which leads to Lemma 1. Definition 9. Assume ( M, g ) is a Riemannian manifold with Riemannian metric g , let X ∈ X ( M ) be a smooth vector field. For any p ∈ U p ⊂ M , let ϕ p : ( − ε p , ε p ) × U p → M be the local one parameter transformation group generated by X . If for any p ∈ M , any t ∈ ( − ε p , ε p ) , ϕ p ( t, · ) : U p → M is local isometric, then X is called a Killing vector field. The Levi-Civita connecton ∇ to Killing fileds admits the following properties: Lemma 4. If X is a Killing field, then for any Y, Z ∈ X ( M ) , g ( ∇ Y X, Z ) + g ( ∇ Z X, Y ) = 0 . Furthermore, if X, Y, Z are all Killing forms, then g ( ∇ X Y, Z ) = 12 ( g ([ X, Y ] , Z ) + g ([ Y, Z ] , X ) − g ([ Z, X ] , Y )) . When ( G, K, σ ) is a Riemannian symmetric pair and M = G/K is a symmetric spacewith Lie algebra decomposition g = k ⊕ m , then the tangent map of the natural projection π : G → M is an epimorphism: π ∗ e : g = T e G → T [ e ] M . For any ξ ∈ g , let (cid:101) ξ be theKilling field corresponding to π ∗ e ( ξ ). Furthermore, π ∗ e ( ξ ) = (cid:101) ξ ([ e ]). As a result, ker π ∗ e = k so π ∗ e | m : m → T [ e ] M ∼ = g / k is linearly isomorphic and isometric. This isometry enables usto carry the curvature tensor on T [ e ] M to m . In addition, due to the symmetry of M , it’sgeometry remains the same across the manifold, so we only need to consider the geometry at[ e ] = K , or equivalently, on T [ e ] M ∼ = m . For any ξ, η, ζ, λ ∈ m , we can define the curvaturetensor on m : g ( R ( ξ, η ) ζ, λ ) := g ( R ( (cid:101) ξ, (cid:101) η ) (cid:101) ζ, (cid:101) λ )([ e ]) ,R ( ξ, η ) ζ := ( π ∗ e ) − R ( (cid:101) ξ, (cid:101) η, ) (cid:101) ζ. Definition 10. For any ξ ∈ g , define ad (cid:101) ξ : X ( M ) → X ( M ) to be: ad (cid:101) ξ ( X ) := [ (cid:101) ξ, X ] , ∀ X ∈ X ( M ) . The conjugate of ad (cid:101) ξ , denoted by (ad (cid:101) ξ ) ∗ , is given by: g ((ad (cid:101) ξ ) ∗ (cid:101) η, X ) = g ( (cid:101) η, ad (cid:101) ξ ( X )) = g ( (cid:101) η, [ (cid:101) ξ, X ]) , ∀ (cid:101) η, X ∈ X ( M ) . Lemma 5. Let M = G/K be a symmetric space, then for any ξ, η ∈ g , ∇ (cid:101) ξ (cid:101) η = 12 (cid:16) [ (cid:101) ξ, (cid:101) η ] + (ad (cid:101) ξ ) ∗ (cid:101) η + (ad (cid:101) η ) ∗ (cid:101) ξ (cid:17) . (11)39 n particular, when ξ ∈ m and η ∈ m , ∇ (cid:101) ξ (cid:101) η ([ e ]) = 0 , when ξ ∈ m , η ∈ k , then ∇ (cid:101) ξ (cid:101) η ([ e ]) = − (cid:103) [ ξ, η ]([ e ]) . Now we have all the ingredients to prove the following theorem, and Lemma 1 followsimmediately. Theorem 13. Let M = G/K be a Riemannian symmetric space corresponding to the Rie-mannian symmetric pair ( G, K, σ ) where G acts on M effectively, then for any ξ, η, ζ ∈ m , R ( ξ, η ) ζ = [ ζ, [ ξ, η ]] . Proof. Recall that for any ξ, η, ζ ∈ m , R ( (cid:101) ξ, (cid:101) η ) (cid:101) ζ = ( ∇ (cid:101) ξ ∇ (cid:101) η (cid:101) ζ − ∇ (cid:101) η ∇ (cid:101) ξ (cid:101) ζ − ∇ [ (cid:101) ξ, (cid:101) η ] (cid:101) ζ )([ e ]) . When ξ, η ∈ m , [ (cid:101) ξ, (cid:101) η ]([ e ]) = 0, so the last term vanishes: ∇ [ (cid:101) ξ, (cid:101) η ] (cid:101) ζ ([ e ]) = 0 . Applying Equation11, the first term is ∇ (cid:101) ξ ∇ (cid:101) η (cid:101) ζ = 12 (cid:16) [ (cid:101) ξ, (cid:101) η ] + (ad (cid:101) ξ ) ∗ (cid:101) η + (ad (cid:101) η ) ∗ (cid:101) ξ (cid:17) = 14 (cid:16) [ (cid:101) ξ, [ (cid:101) η, (cid:101) ζ ]] + (ad (cid:101) ξ ) ∗ ([ (cid:101) η, (cid:101) ζ ]) + (ad[ (cid:101) η, (cid:101) ζ ]) ∗ (cid:101) ξ (cid:17) + 12 (cid:16) ∇ (cid:101) ξ ((ad (cid:101) η ) ∗ (cid:101) ζ + ∇ (cid:101) ξ ((ad (cid:101) ζ ) ∗ (cid:101) η ) (cid:17) ([ e ]) . We simplify the above equation term by term. Fix any λ ∈ m , by the Ad K -invariance ofthe inner product (cid:104)· , ·(cid:105) in m , we have g (cid:16) [ (cid:101) ξ, [ (cid:101) η, (cid:101) ζ ]] , (cid:101) λ (cid:17) ([ e ]) = (cid:104) [ ξ, [ η, ζ ]] , λ (cid:105) ,g (cid:16) (ad (cid:101) ξ ) ∗ ([ (cid:101) η, (cid:101) ζ ]) , (cid:101) λ (cid:17) = g (cid:16) [ (cid:101) η, (cid:101) ζ ] , [ (cid:101) ξ, (cid:101) λ ] (cid:17) ([ e ]) = 0 ,g (cid:16) (ad[ (cid:101) η, (cid:101) ζ ]) ∗ (cid:101) ξ, (cid:101) λ (cid:17) ([ e ]) = (cid:104) ξ, [[ η, ζ ] , λ ] (cid:105) = −(cid:104) [[ η, ζ ] , ξ ] , λ (cid:105) = (cid:104) [ ξ, [ η, ζ ]] , λ (cid:105) . Observe that ξ, ζ, λ ∈ m and [ η, λ ] ∈ k , by Lemma 5, g (cid:16) ∇ (cid:101) ξ ((ad (cid:101) η ) ∗ (cid:101) ζ ) , (cid:101) λ (cid:17) ([ e ]) = (cid:16)(cid:101) ξg (cid:16)(cid:101) ζ, [ (cid:101) η, (cid:101) λ ] (cid:17) − g (cid:16) (ad (cid:101) η ) ∗ (cid:101) ζ, ∇ (cid:101) ξ (cid:101) λ (cid:17)(cid:17) ([ e ])= (cid:16) g ( ∇ (cid:101) ξ (cid:101) ζ, [ (cid:101) η, (cid:101) λ ]) + g ( (cid:101) ξ, ∇ (cid:101) ξ [ (cid:101) η, (cid:101) λ ]) (cid:17) ([ e ])= − g (cid:16)(cid:101) ξ, ∇ (cid:101) ξ (cid:93) [ η, λ ] (cid:17) ([ e ]) = (cid:104) ζ, [ ξ, [ η, λ ]] (cid:105) . ∇ (cid:101) ξ ((ad (cid:101) ζ ) ∗ (cid:101) η )([ e ]) = (cid:104) η, [ ξ, [ ζ, λ ]] (cid:105) . Combine above pieces we have g (cid:16) ∇ (cid:101) ξ ∇ (cid:101) η (cid:101) ζ, (cid:101) λ (cid:17) ([ e ]) = 12 ( (cid:104) [ ξ, [ η, ζ ] , λ (cid:105) + (cid:104) [ ξ, [ η, λ ] , ζ (cid:105) + (cid:104) [ ξ, [ ζ, λ ] , η (cid:105) ) . Similarly, by switching ξ and η , we have g (cid:16) ∇ (cid:101) η ∇ (cid:101) ξ (cid:101) ζ, ] (cid:101) λ (cid:17) ([ e ]) = 12 ( (cid:104) [ η, [ ξ, ζ ] , λ (cid:105) + (cid:104) [ η, [ ξ, λ ] , ζ (cid:105) + (cid:104) [ η, [ ζ, λ ] , ξ (cid:105) ) . By the Jacobi identity and the Ad K -invariance of (cid:104)· , ·(cid:105) , we have (cid:104) [ ξ, [ η, ζ ] , λ (cid:105) = −(cid:104) ξ, [ λ, [ η, ζ ] (cid:105) . As a result, (cid:104) R ( ξ, η ) ζ, λ (cid:105) = g (cid:16) R ( (cid:101) ξ, (cid:101) η ) (cid:101) ζ, (cid:101) λ (cid:17) = 12 ( (cid:104) [ ξ, [ η, ζ ]] , λ (cid:105) − (cid:104) [ η, [ ξ, ζ ]] , λ (cid:105) + (cid:104) [ ξ, [ η, λ ]] , ζ (cid:105)− (cid:104) [ η, [ ξ, λ ]] , η (cid:105) + (cid:104) [ ξ, [ ζ, λ ]] , η (cid:105) − (cid:104) [ η, [ ζ, λ ]] , ξ (cid:105) )= (cid:104) [ ξ, [ ζ, λ ]] , ξ (cid:105) . By symmetry of the curvature tensor, the desired equation follows: (cid:104) R ( ξ, η ) ζ, λ (cid:105) = (cid:104) [ ζ, [ ξ, η ]] , λ (cid:105) ,R ( ξ, η ) ζ = [ ζ, [ ξ, η ]] ..