Statistical models and probabilistic methods on Riemannian manifolds
aa r X i v : . [ m a t h . S T ] J a n Statistical models andprobabilistic methodson Riemannian manifolds
Salem Said – CNRS, Universit´e de Bordeaux ontents guide to this thesis This thesis reflects the major themes of my work, which I have carried out in the past four years.It does leave out some of this work, especially on the subject of warped information metrics.I hope readers may find time for at least a glance at this “missing” part, (for example, in [1]).However, the thesis is rather self-contained, and I feel that the best way of reading it is justfrom beginning to end, uninterrupted.At any rate, I would like to ask readers to begin with Chapter 1. Then, once this is done,they can skip to Chapters 3 and 4, which I would like to ask them to read together, or go on toChapter 2, which is quite independent from following. The same goes for Chapter 5, which canbe read right after Chapter 1, provided just a little bit of familiarity with Chapter 3.Each chapter begins with a table of contents, followed by a sort of “abstract”, which providessome additional details, on the table of contents, and points to some of the more interestingresults. I have done my best to avoid the thesis being a copy-paste of published research papers.Chapter 3 uncovers several new connections between Riemannian Gaussian distributions andrandom matrix theory, while Chapter 4 is entirely made up of previously unpublished material.The other chapters stick more closely to my existing papers (published, or under review),although I have made a consistent effort to improve the presentation, and to include usefulbackground and historical discussion.I hope that readers will find it stimulating, to read an “original thesis”. On the other hand,exploring new ideas exposes one to the risk of making mistakes (of various magnitude), andI also hope these are duly pointed out, and the appropriate criticism is served up, withoutrestraint. On the whole, writing this thesis has been a humbling experience for me. I havefound out, once and again, that I was unable to answer questions or to prove statements,even when they seemed very natural. Chapter 6, a short final chapter, contains a list of such“open problems” (they are open to me, but others may find them easy).I should acknowledge the input of many colleagues, who have shaped the ideas layed outin the following. Chapter 2 was born out of discussions with Marc Arnaudon, and Chapter 5relies heavily on joint work with Alain Durmus, Pablo Jimenez, and Eric Moulines [2][3].For Chapter 3, the idea of a useful connection between Riemannian Gaussian distributionsand random matrix theory was first suggested to me by my colleague Yannick Berthoumieu.During the summer of 2020, I worked on this idea with Cyrus Mostajeran and Simon Heuveline.Later on, when I was nearly finished writing this thesis, I was very excited to discover the workof Leonardo Santilli and Miguel Tierz [4], who were simultaneously developing the same idea.It is really a great satisfaction to see a whole project unfold out of an “innocent” discussion.For this, I want to thank all of the colleagues I just mentioned.Perhaps nobody will ever write a better preface than Cervantes, whose following famouswords certainly apply here are now.
Idle reader : Without my swearing to it, you can believethat I would like this book, the child of my understanding, to be the most beautiful, the mostbrilliant, and the most discreet that anyone could imagine. But I have not been able to contravenethe natural order; in it, like begets like . 2 hapter 1
Notation and background
Contents R ( p , q ) . . . . . . . . . . . . . . . . . . . . . . . Certainly, this thesis is intended for specialised readers, who are already familiar with the basicsof Riemannian geometry. This first chapter is not a stand-alone introduction to Riemannian geometry,but merely hopes to help the readers ease into the material in subsequent chapters : by recalling someelementary notions in Riemannian geometry, I hope to find a shared language with my benevolent readers.Some original, or even unpublished, material is still included. As discussed in the following : • C function, in terms of its gradient and Hessian. • • • • • • .1 The Levi-Civita connection A smooth (0,2)-tensor field g , on a finite-dimensional smooth manifold M , is called a Riemannianmetric, if the bilinear form h u,v i x = g x ( u, v ) u , v ∈ T x M (1.1)is a true scalar product, for each x ∈ M . In this case, for u ∈ T x M and t ∈ T ∗ x M , the identities( u ♭ , v ) = h u,v i x and h t ,v i x = ( t, v ) v ∈ T x M uniquely define u ♭ ∈ T ∗ x M and t ∈ T x M . By a useful abuse of notation, u ♭ = g ( u ) t = g − ( t ) (1.2)The Levi-Civita connection of g is the unique affine connection ∇ which is metric, so that ∇ g = 0 (1.3)and tortionless, so that the exterior derivative dθ , of any 1-form θ , reads dθ ( X, Y ) = ∇ X θ ( Y ) − ∇ Y θ ( X ) (1.4)for vector fields X and Y . In effect, by (1.3) and (1.4), the (1 , ∇ X (the covariantderivative of the vector field X ), decomposes into self-adjoint and skew parts,2 h∇ Y X, Z i = L X g ( Y, Z ) + dX ♭ ( Y, Z ) (1.5)where L X g denotes the Lie derivative of the metric g along X , the “the linear elasticity tensor”(the equivalence between (1.3)–(1.4) and (1.5) is the content of Koszul’s theorem).Given local coordinates ( x i ; i = 1 , . . . , n ) on an open U ⊂ M , there is a coordinate frame( ∂ i ), along with a coframe ( dx i ) — of course, ∂ i stands for ∂ (cid:14) ∂x i . In terms of these coordinates,the metric g takes on the form of a length element g = g ij dx i ⊗ dx j g ij = h ∂ i , ∂ j i (1.6)and covariant derivatives may be expressed, in coordinate form, ∇ X = n ∂ j X i + Γ ijk X k o ∂ i ⊗ dx j ∇ ∂j ∂ k = Γ ijk ∂ i (1.7)using the Christoffel symbols (Γ ijk ). A vector field X , along a smooth curve c : I → M , defined on some interval I ⊂ R , is a map X : I → T M such that π ◦ X = c — of course, π : T M → M denotes the canonical projection.The Levi-Civita connection ∇ can be used to compute the covariant derivative of X along c ,itself a vector field along c , here denoted ∇ ˙ c X . In local coordinates, ∇ ˙ c X ( t ) = (cid:26) ddt X i ( t ) + (Γ ijk ◦ c ( t )) ˙ c j ( t ) X k ( t ) (cid:27) ( ∂ i ◦ c )( t ) (1.8)and this suggests writing ∇ ˙ c X = ∇ t X or even ˙ X , when c is understood from the context. Now, X is called parallel along c if ∇ ˙ c X = 0. From (1.8), this means that the components X i ( t )satisfy a linear differential equation with smooth coefficients.4hus, if X is parallel along c , then X is completely determined by its value at any instant,say t o ∈ I . Equivalently, if v is tangent to M at c ( t o ), then there exists a unique parallel vectorfield X along c , with X ( t o ) = v . It follows that, for t ∈ I , there exists a linear operator Π tt o which maps T c ( t o ) M onto T c ( t ) M , by Π tt o ( v ) = X ( t ). This linear operator Π tt o is called paralleltransport along c , from c ( t o ) to c ( t ), and has the following properties,hemigroup property Π tt o = Π tt ◦ Π t t o (1.9)isometry property k Π tt o ( v ) k c ( t ) = k v k c ( to ) (1.10)where k · k x is the norm associated with the scalar product in (1.1), for any x ∈ M . Clearly, ifone knows how to compute parallel transports, then one is able to recover covariant derivatives, ∇ ˙ c X ( t o ) = ddt (cid:12)(cid:12)(cid:12)(cid:12) t = t o Π t o t ( X ( t )) (1.11)A smooth curve c : I → M is called a geodesic curve, if its velocity vector field ˙ c is parallel.This means that c satisfies the geodesic equation, ∇ ˙ c ˙ c = 0 or ¨ c = 0 (1.12)Written out in local coordinates, this is a non-linear ordinary differential equation, d dt c i ( t ) + Γ ijk ( c ( t )) ddt c j ( t ) ddt c k ( t ) = 0 (1.13)If its solutions c ( t ) exists at all finite t ∈ R , for any initial conditions c ( t o ) = x and ˙ c ( t o ) = v , thenthe metric g on the manifold M is called geodesically complete. In this case, the Riemannianexponential map Exp : T M → M , given by Exp x ( v ) = c (1) is well-defined.The geodesic equation (1.12) states that the curve c has zero acceleration (just like a particlein free motion). This means that geodesic curves are extremals of the energy functional E ( c ) = Z I k ˙ c ( t ) k dt (1.14)and that re-parameterised geodesic curves (of the form c ◦ ϕ where c is a geodesic, and t ′ = ϕ ( t )a new parameterisation) are extremals of the length functional L ( c ) = Z I k ˙ c ( t ) k dt (1.15)This leads to the notion of Riemannian distance, which will be discussed in 1.7 below. Let f : M → R be a C function, denote df its differential. The gradient of f is the vector fieldgrad f = g − ( df ) (1.16)In the notation of (1.2). The Hessian of f is the (1 , f = ∇ grad f (1.17)The following proposition says that geodesic curves are exactly the curves which admit a Taylorexpansion of any C function, in terms of its gradient and Hessian.5 roposition 1.1. A smooth curve c : I → M is a geodesic curve, if and only if, for any s, t ∈ I ,and any C function f : M → R , ( f ◦ c )( t ) = ( f ◦ c )( s ) + h grad f, ˙ c i c ( s ) ( t − s ) + 12 h Hess f · ˙ c , ˙ c i c ( s ) ( t − s ) + o ( | t − s | ) (1.18)The proof of this proposition follows from the identity, d dt ( f ◦ c )( t ) = ddt h grad f, ˙ c i c ( t ) = h∇ ˙ c grad f, ˙ c i c ( t ) + h grad f, ¨ c i c ( t ) Indeed, the last term is identically zero ( i.e. , for any C function f ), if and only if ¨ c is identicallyzero, as in the geodesic equation (1.12).In (1.17), Hess f is a (1 , , f = ∇ df = 12 L grad f g (1.19)where the second equality follows from (1.5) and (1.16). This yields a lighter notation, h Hess f · u , v i = Hess f ( u, v )Recall the Riemannian exponential map Exp, from 1.2 (always assume geodesic completeness).Proposition 1.1 can be used to write down a Taylor expansion with Lagrange remainder, f (Exp x ( v )) = f ( x ) + h grad f, v i x + 12 Hess f c ( t ∗ ) ( ˙ c, ˙ c ) (1.20)where c ( t ∗ ) is a point along the geodesic c ( t ) = Exp x ( t v ), corresponding to an instant t ∗ ∈ (0 , Remark : writing (1.19) in local coordinates,Hess f = n ∂ ij f − Γ kij ∂ k f o dx i ⊗ dx j (1.21)The second derivatives ∂ ij f do not transform like a covariant tensor, but the Christoffel symbolscorrect for this problem, yeilding a true covariant tensor, Hess f . A very nice way of saying thisis that the Levi-Civita connection transforms second-order differentials, into covariant tensors.The concepts of second-order vectors and of second-order differentials are reviewed in [5],where they are used as a starting point for stochastic analysis in manifolds. The problem of principal component analysis consists in maximising the objective function f ( x ) = tr ( x ∆) x ∈ Gr R ( p , q ) (1.22)where ∆ is a symmetric positive-definite matrix, of size ( p + q ) × ( p + q ). The maximisation isover x in the real Grassmann manifold Gr R ( p , q ), identified with a space of orthogonal projectorsGr R ( p , q ) = n x ∈ R ( p + q ) × ( p + q ) : x † − x = 0 , x − x = 0 , tr( x ) = p o (1.23)where † denotes the transpose. Remarkably, it is possible to show that Gr R ( p , q ) is a submanifoldof S( p + q ), the affine space of symmetric matrices of size ( p + q ) × ( p + q ), with tangent spaces(the proof of this statement may be found in [6]), T x Gr R ( p , q ) = { v ∈ S( p + q ) : xv + vx = v } (1.24)6t then follows that Gr R ( p , q ) is of dimension pq . Clearly, Gr R ( p , q ) admits of a Riemannianmetric, which is the restriction of the trace scalar product of S( p + q ), h u,v i x = tr( uv ) u , v ∈ T x Gr R ( p , q ) (1.25)By (1.2) and (1.16), it follows from (1.22) that the gradient of f ( x ) is given bygrad f ( x ) = P x (∆) (1.26)where P x : S( p + q ) → S( p + q ) is the orthogonal projection onto T x Gr R ( p , q ). Now, let x = o ,the projector onto the span of the first p vectors in the canonical basis of R p + q . One readilychecks from (1.24) that T o Gr R ( p , q ) = ( ˜ ω = p × p ω † ω q × q ! ; ω ∈ R q × p ) (1.27)Therefore, P o (∆) is just ∆ with its main diagonal blocks of size p × p and q × q set to zero.Then, note that the orthogonal group O ( p + q ) acts transitively on Gr R ( p , q ), by g · x = gxg † for g ∈ O ( p + q ) and x ∈ Gr R ( p , q ), and that this action preserves the Riemannian metric (1.25).Therefore, one has the following alternative to (1.24) , T x Gr R ( p , q ) = { v = g · ˜ ω ; g · o = x , ˜ ω ∈ T o Gr R ( p , q ) } (1.28)where g · o = x simply means the first p columns of g span the image space of x . Since theaction of O ( p + q ) preserves the Riemannian metric (1.25), it easily followsP x (∆) = g · P o ( g † · ∆) for any g such that g · o = x (1.29)which can be used to evaluate the gradient of f ( x ), in (1.26). For the Hessian of f ( x ), notethat, according to Propositon 1.1,Hess f x ( v, v ) = d dt (cid:12)(cid:12)(cid:12)(cid:12) t =0 f (Exp x ( tv )) (1.30)Here, the Riemannian exponential can be transformed into a matrix exponential (see Proposition1.10, in 1.10). For g ∈ O ( p + q ), note that g · o = o if and only if g ∈ O ( p ) × O ( q ) ⊂ O ( p + q ).Denote g and k the Lie algebras of O ( p + q ) and O ( p ) × O ( q ) ⊂ O ( p + q ). Let p denote theorthogonal complement of k (with respect to the bilinear form Q ( ξ , η ) = tr( ξη ), for ξ , η ∈ o ( p + q )). Then, p = ( ˆ ω = p × p − ω † ω q × q ! ; ω ∈ R q × p ) (1.31)From (1.27) and (1.31), it is clear there exists a canonical isomorphism π o : T o Gr R ( p , q ) → p (just add a minus sign in front of ω † in (1.27)). In terms of this isomorphism,Exp x ( tv ) = exp( t ˆ ω v ) · x ˆ ω v = g · π o ( g † · v ) (1.32)Replacing (1.32) into (1.30), the second derivative is easily computed,Hess f x ( v, v ) = tr (cid:0) ∆ ˆ ω v x (cid:1) + tr (cid:0) ∆ x ˆ ω v (cid:1) − ω v x ˆ ω v ) (1.33) Remark : a nice property of the linear map v ˆ ω v is that tr(ˆ ω v ) = h v, v i x . By an abuse of notation, g · a = g a g † , for any matrix a of size ( p + q ) × ( p + q ). .5 Regular retractions A retraction is a map Ret :
T M → M , taking v ∈ T x M to Ret x ( v ), and which verifies [7][8],Ret x (0 x ) = x d Ret x (0 x ) = Id x (1.34)where 0 x ∈ T x M is the zero vector in T x M , and Id x is the identity map of T x M . While theRiemannian exponential Exp : T M → M is itself a retraction , other retractions are often usedas computationally cheap (or numerically stable) substitutes for the Riemannian exponential.From (1.34), for any retraction Ret, Ret x agrees with Exp x up to first-order derivatives.Further, Ret will be called geodesic, if Ret x agrees with Exp x up to second-order derivatives.This means the curve c ( t ) = Ret x ( tv ) has zero initial acceleration : ¨ c (0) = 0 x , for any v ∈ T x M (in the notation of (1.12).To compare a retraction Ret with the exponential Exp, it is useful to introduce the mapsΦ x : T x M → T x M Φ x ( v ) = (cid:0) Exp − x ◦ Ret x (cid:1) ( v ) (1.35)These maps are well-defined if Ret is regular. That is, if Ret x ( v ) / ∈ Cut( x ) for any v ∈ T x M (Cut( x ) denotes the cut locus of x , whose definition is recalled in 1.7). In addition, they satisfythe following propositions. Proposition 1.2.
Let
Ret :
T M → M be a regular retraction. Then, Φ x : T x M → T x M verify (a) Φ x (0 x ) = 0 x and Φ ′ x (0 x ) = Id x (the prime denotes the Fr´echet derivative). (b) Φ ′′ x (0 x )( v, v ) = ¨ c (0) , where the curve c ( t ) is given by c ( t ) = Ret x ( tv ) . Proposition 1.3.
Let
Ret :
T M → M be a regular retraction and f : M → R be a C function. f (Ret x ( v )) = f ( x ) + h grad f, Φ x ( v ) i x + 12 Hess f γ ( t ∗ ) ( ˙ γ, ˙ γ ) (1.36) where γ ( t ∗ ) is a point along the geodesic γ ( t ) = Exp x ( t Φ x ( v )) , corresponding to some t ∗ ∈ (0 , . As an application of Proposition 1.2, consider the following examples.
Example 1 : let M = S n ⊂ R n +1 , the unit sphere of dimension n , with its usual (round)Riemannian metric. The retraction Ret x ( v ) = ( x + v )/ k x + v k ( k · k is the Euclidean norm) isregular, and the maps Φ x are given byΦ x ( v ) = arctan( k v k ) v k v k (1.37) Example 2 : let M = U ( d ), the Lie group of d × d unitary matrices, with its bi-invariant metric h u,v i x = − (1 / uv ). The retraction Ret x ( v ) = Pol( x + v ) (Pol denotes the left polar factor)is regular, and the maps Φ x are given byΦ x ( v ) = x (cid:16) u exp( i arctan( θ )) u † (cid:17) (1.38)where † denotes the conjugate-tranpose, and ω = x † v has spectral decomposition ω = u ( iθ ) u † ,where u is unitary and θ is real and diagonal — as one may expect, arctan( θ ) = diag(arctan( θ ii )).Now, (b) of Proposition 1.2 implies the retractions in question are geodesic, since the Taylorexpansion at zero of the arctangent only contains odd powers. Both of these retractions arebased on orthogonal projection onto the manifold M , which is embedded in a Euclidean space. Recall that it is always assumed M is geodesically complete. roof of Proposition 1.2 : note that (a) is immediate, by (1.34), and the fact that Exp is aretraction. To prove (b), note thatΦ x ( v ) = τ i (Ret x ( v )) ∂ i ( x ) (1.39)where ( τ i ; i = 1 , . . . , n ) are normal coordinates with origin at x , and where ∂ i = ∂ (cid:14) ∂τ i . SinceΦ x is smooth (precisely, C ), Φ ′′ x (0 x )( v, v ) = d dt (cid:12)(cid:12)(cid:12)(cid:12) t =0 Φ x ( tv )Thus, if c ( t ) = Ret x ( tv ) and c i ( t ) = ( τ i ◦ c )( t ), thenΦ ′′ x (0 x )( v, v ) = d dt c i (0) ∂ i ( x ) = (cid:26) d dt c i (0) + Γ ijk ( c (0)) ddt c j (0) ddt c k (0) (cid:27) ∂ i ( x )where the second equality holds since Γ ijk ( c (0)) = Γ ijk ( x ) = 0, by the definition of normalcoordinates. Comparing to (1.12) and (1.13), it is clear Φ ′′ x (0 x )( v, v ) = ¨ c (0). Proof of Proposition 1.3 : this is a direct application of (1.20), using Ret x ( v ) = Exp x (Φ x ( v )). Remark : the claims in Examples 1 and 2 above will not be proved in detail. Example 1 isquite elementary, and only requires one to recall that Cut( x ) = {− x } . For Example 2, the cutlocus on a point x in U ( d ) is described in [9], and (1.38) follows by a straightforward matrixcalculation. Gr R ( p , q ) Let St R ( p , q ) denote the Stiefel manifold, whose elements are the d × p matrices b with b † b = I p (I p is the p × p identity matrix, and d = p + q ). Note that T b St R ( p , q ) = { w : w † b + b † w = 0 } .For w ∈ T b St R ( p , q ), let [ b ] = bb † and [ w ] = wb † + bw † . If v ∈ T x Gr R ( p , q ), one says that ( b, w )is representative of ( x, v ), whenever x = [ b ] and v = [ w ].Recall that x and v may always be expressed x = g · o and v = g · ˜ ω , using (1.28). If x = [ b ],then g may be chosen g = ( b, b ⊥ ) (the columns of b ⊥ span the orthogonal complement of theimage space of x ). Then, a direct calculation shows v = [ w v ], where w v = b ⊥ ω . Now, defineRet x ( v ) = Proj ( b + w v ) for some b such that x = [ b ] (1.40)where Proj( h ) denotes the orthogonal projector onto the span of the columns of h , for h ∈ R d × p .This is well-defined, since it does not depend on the choice of b and b ⊥ , and is indeed a retraction,since it verifies (1.34).For a nicer expression of (1.40), identify each x ∈ Gr R ( p , q ) with its image space Im( x ).In other word, consider Gr R ( p , q ) as the space of all p -dimensional subspaces of R d . Then,Ret x ( v ) = Span ( b + w v ) (1.41)where Span( h ) denotes the span of the column space of h ∈ R d × p . Proposition 1.4. The retraction
Ret in (1.41) is regular, and the corresponding maps Φ x (defined as in (1.35) are given by Φ x ( v ) = h b ⊥ ( r arctan( a ) s † ) i (1.42) for x = [ b ] and v = [ b ⊥ ω ] , where ω has s.v.d. ω = ras † with r ∈ O ( q ) and s ∈ O ( p ) . As for Examples 1 and 2 in 1.5, (b) of Proposition 1.2 now implies Ret is a geodesic retraction. Whenever a = ( α , p × q ) † , where α is p × p and diagonal, let arctan( a ) = (arctan( α ) , p × q ) † where arctan( α ) =diag(arctan( α ii )). For the proof on the following page, define cos( a ) and sin( a ) in the same way. roof of Proposition 1.4 : here, x ∈ Gr R ( p , q ) is identified with its image space, Im( x ).Without loss of generality, it is assumed p ≤ q .With Φ x given by (1.42), the aim will be to show that, for x ∈ Gr R ( p , q ) and v ∈ T x Gr R ( p , q ),Exp x (Φ x ( v )) = Ret x ( v ) (1.43)In [9], the cut locus of x is obtained under the formCut( x ) = n Exp x ([ b ⊥ ω ]) ; ω = ras † , k a k ∞ = π o (1.44)where ω = ras † is the s.v.d. of ω ∈ R q × p , with r ∈ O ( q ) and s ∈ O ( p ), and k a k ∞ = max ij | a ij | .Since k arctan( a ) k ∞ < π/
2, it follows from(1.42) and (1.43) that Ret x ( v ) / ∈ Cut( x ), so Ret is aregular retraction. Thus, to prove the proposition, one only has to prove (1.43).Starting with the left-hand side of (1.43), let ϕ = r arctan( a ) s † , so Φ x ( v ) = [ b ⊥ ϕ ]. By thediscussion before (1.40), it follows thatΦ x ( v ) = g · ˜ ϕ (where g = ( b, b ⊥ )) (1.45)However, then, by (1.32),Exp x (Φ x ( v )) = exp( g · ˆ ϕ ) · x = ( g exp( ˆ ϕ )) · o (1.46)where the second equality follows from g † x = o , using g · ˆ ϕ = g ˆ ϕ g † . Using the s.v.d. of ϕ ( ϕ = r arctan( a ) s † ), a straightforward matrix multiplication yieldsˆ ϕ = k · ˆ q where k = s r ! , q = arctan( a ) (1.47)Thus, from (1.46) and (1.47), using the fact that k ∈ O ( p ) × O ( q ), so k · o = o (or k † · o = o ),Exp x (Φ x ( v )) = ( g k exp(ˆ q )) · o That is, by the group action property,Exp x (Φ x ( v )) = g k · (exp(ˆ q ) · o ) (1.48)Now, let b o = (I p , p × q ) † , so o = Span( b o ) andexp(ˆ q ) · o = Span (exp(ˆ q ) b o ) (1.49)Then, let a = ( α , p × q ) † , where α is p × p and diagonal. It will be shown below thatexp(ˆ q ) b o = cos(arctan( α ))sin(arctan( a )) ! = I p a ! (I p + α ) − (1.50)where the second equality follows from the identitiescos(arctan( α ii )) = (1 + α ii ) − and sin(arctan( α ii )) = α ii (1 + α ii ) − By (1.49) and (1.50), after ignoring the invertible matrix (I p + α ) − ,exp(ˆ q ) · o = Span I p a ! = Span ( b o + b ⊥ o a )10eplacing this into (1.48), it follows thatExp x (Φ x ( v )) = g k · Span ( b o + b ⊥ o a ) = g · Span ( k ( b o + b ⊥ o a )) (1.51)and, by carrying out the matrix products, one may perform the simplification,Span ( k ( b o + b ⊥ o a )) = Span ( b o + b ⊥ o ra ) = Span ( b o + b ⊥ o ω )to obtain from (1.51), Exp x (Φ x ( v )) = g · Span ( b o + b ⊥ o ω )which immediately yields (1.43), since g b o = b and g b ⊥ o = b ⊥ . Proof of (1.50) : write q = ( κ , p × q ) † , where κ is p × p and diagonal. It is enough to showexp(ˆ q ) = cos( κ ) − sin( q ) † sin( q ) cos( κ ) q × q ! (1.52)wehre cos( κ ) q × q is the q × q matrix,cos( κ ) q × q = cos( κ ) I q − p ! This follows by writing ˆ q = p X i =1 κ ii ˆ f i f i = ( δ i , p × q ) † where δ i is p × p , diagonal, with its only non-zero element on the i -th line, and equal to 1.Indeed, the matrices ˆ f i commute with one another, so thatexp(ˆ q ) = p Y i =1 exp( κ ii ˆ f i ) (1.53)and one readily checks ˆ f i = − e i , where e i is d × d , diagonal, with its only non-zero elementson the i -th and ( p + i )-th lines, and equal to 1. Therefore,exp( t ˆ f i ) = I d + (cos( t ) − e i + sin( t ) ˆ f i (1.54)Then, (1.52) obtains after replacing (1.54) into (1.53), and using e i e j = 0 e i ˆ f j = 0ˆ f i e j = 0 ˆ f i ˆ f j = 0 for i = j which may be shown by performing the matrix products. Remark : the above proof has a flavor of the structure theory of Riemannian symmetric spaces.In fact, Gr R ( p , q ) = O ( p + q )/ O ( p ) × O ( q ) is a Riemannian symmetric space. The associatedCartan decomposition is o ( p + q ) = k + p (1.55)where k is the Lie algebra of K = O ( p ) × O ( q ), and where p was given in (1.31). Then, a = n ˆ a ; a = ( α , p × q ) † , α is p × p diagonal o (1.56)is a maximal Abelian subspace of p . From [10] (Lemma 6.3, Chapter V), it follows that anyˆ ω ∈ p is of the form ˆ ω = Ad( k ) ˆ a where Ad denotes the adjoint representation, k ∈ K and ˆ a ∈ a .In the present context, this reads ˆ ω = k · ˆ a , which is indeed realised if ω has s.v.d. ω = ras † ,and k is the same as in (1.47). 11 .7 The squared distance function A Riemannian manifold M becomes a metric space, when equipped with the distance function d ( x, y ) = inf (cid:8) L ( c ) ; c ∈ C ([0 , , M ) : c (0) = x , c (1) = y (cid:9) (1.57)known as the Riemannian distance . Here, L ( c ) is the length functional (1.15). When M isgeodesically complete, the infimum in (1.57) is always achieved by some curve c ∗ , which is thensaid to be length-minimising. In addition, any length-minimising curve is a geodesic.This is not to say that all geodesics are length-minimising. A geodesic curve c , with c (0) = x ,may reach a point c ( t ) = y , such that L ( c | [0 ,t ] ) ≥ d ( x, y ). Roughly, this happens when t is solarge that c becomes too long.For v ∈ T x M with k v k x = 1 ( k · k x is the norm given by the scalar product h· , ·i x ), definet( v ) = sup n t ≥ L ( c v | [0 ,t ] ) = d ( x, c v ( t )) o (1.58)where c v denotes the geodesic curve with ˙ c v (0) = v . The following setsTC( x ) = { t v ; t = t( v ) , k v k x = 1 } TD( x ) = { t v ; t < t( v ) , k v k x = 1 } (1.59)are known as the tangent cut locus and tangent injectivity domain of x . The cut locus andinjectivity domain of x are the sets Cut( x ) = Exp (TC( x )) and D( x ) = Exp (TD( x )).Since any two points x and y in M are connected by a length-minimising geodesic c ∗ , M = D( x ) ∪ Cut( x ) (1.60)It is interesting to note that Cut( x ) is a closed and negligible set. The exponential map Exp x is a diffeomorphism of TD( x ) onto D( x ) (TD( x ) is the largest subsetof T x M with this property). Pick some orthonormal basis ( u i ) of T x M , and define, for y ∈ D( x ), τ i ( y ) = (cid:10) Exp − x ( y ) , u i (cid:11) x i = 1 , . . . , n (1.61)Then, τ i : D( x ) → R are well-defined local coordinates, known as normal coordinates. Thesecoordinates satisfy τ i ( x ) = 0 g ij ( x ) = δ ij Γ ijk ( x ) = 0 (1.62)in the notation of (1.6) and (1.7). Even more, (1.61) is equivalent to the property that geodesicsthrough x appear as straight lines through 0 ∈ R n , in the normal coordinate map τ : D( x ) → R n .Now, the coordinate vector fields ∂ i = ∂ (cid:14) ∂τ i are given by ∂ i ( y ) = dExp x ( v )( u i ) where v = τ i ( y ) u i (1.63)where dExp x is the derivative of Exp x : T x M → M . This may be computed using Jacobi fields,dExp x ( tv )( tu ) = J ( t ) (1.64)where J is a vector field (Jacobi field) along the geodesic c ( t ) = Exp x ( tv ), which solves theJacobi equation ∇ t J − R ( ˙ c, J ) ˙ c = 0 (1.65)where J (0) = 0, ∇ t J (0) = u , and where R denotes the Riemann curvature tensor. Of course,there do exist other means of computing the derivative dExp x ( e.g. when Exp x coincides witha matrix exponential). 12 .7.3 Distance function For x ∈ M , consider the distance function r x ( y ) = d ( x, y ). For y ∈ D( x ), it is possible to show r x ( y ) = n X i =1 τ i ( y ) ! (1.66)in terms of the normal coordinates τ i . From (1.66), the distance function r x is smooth onU x = D( x ) − { x } .When y ∈ U x is of the form y = c v ( t ), where c v is a geodesic with ˙ c v (0) = v and k v k x = 1,define ∂ r ( y ) = ˙ c v ( t ). By the first variation of arc length formula (Theorem II.4.1 in [11]),grad r x ( y ) = ∂ r ( y ) for y ∈ U x (1.67)Introduce geodesic spherical coordinates ( r, θ α ) on U x . If y = c v ( t ) these are given by r = t and( θ α ) = θ ( v ), where θ identifies the unit sphere in T x M with the Euclidean unit sphere S n − .In these coordinates, the metric is given by g = dr ⊗ dr + g r αβ dθ α ⊗ dθ β (1.68)reflecting the fact that ∂ r is orthogonal to constant r x surfaces, here parameterised by ( θ α ).The coordinate vector fields ∂ α are given by (1.64) : ∂ α ( y ) = J ( r ) for y = c v ( r ), where J (0) = 0and ∇ t J (0) = u α (where u α = ∂ / ∂θ α are coordinate vector fields on the unit sphere in T x M ).In particular, if A : T x M → T y M solves the operator Jacobi equation (along the geodesic c v ) ∇ t A − R ˙ c v A = 0 A (0) = 0 , ∇ t A (0) = Id x (1.69)where R ˙ c v ( · ) = R ( ˙ c v , · ) ˙ c v , then ∂ α ( y ) = A ( r ) u α . Thus, if A ( y ) : T x M → T x M is given by A ( y ) = Π r ◦ A ( r ), then g r ( y ) = ( A ( y )) ∗ ( h ), the pullback under A ( y ) of the metric h of the unitsphere in T x M . It should be noted A ( y ) maps tangent spaces of this unit sphere to themselves.The Hessian of r x follows from (1.17) and (1.67), which yield (after using the fact that thevector fields ∂ r and ∂ α commute)Hess r x · ∂ r = 0 and Hess r x · ∂ α = ∇ ∂ r ∂ α Then, using the expression of the ∂ α as Jacobi fields,Hess r x ( y ) = ∇ t A ( t ) A − ( t ) (cid:12)(cid:12) t = r (1.70)Taking the covariant derivative ∇ t of this formula yields the Ricatti equation ∇ ∂ r Hess r x = R ∂r − (Hess r x ) (1.71)The Jacobi equation (1.69) and the Ricatti equation (1.71) lead up to the comparison theorems . Theorem 1.1.
Assume the sectional curvatures of M lie within the interval [ κ min , κ max ] . Then, sn κ max ( r ) h ≤ g r ( y ) ≤ sn κ min ( r ) h (1.72)ct κ max ( r ) g r ( y ) ≤ Hess r x ( y ) ≤ ct κ min ( r ) g r ( y ) (1.73) for y ∈ U x . Here, sn ′′ κ ( r ) + κ sn κ ( r ) = 0 with sn κ (0) = 0 and sn ′ κ (0) = 1 , and ct κ = sn ′ κ /sn κ . Remark : in addition to its singularity at x , the distance function r x is singular on Cut( x ).If y ∈ Cut( x ), then either y is a first conjugate point ( A ( r ) is singular, for the first time after x ),or there exist two distinct length-minimising geodesics connecting x to y . In the first case,Hess r x ( y ) has an eigenvalue equal to −∞ . In the second case, grad r x is discontinuous at y .The distributional Hessian of r x was studied in [12]. Remark : the reader may have noted, or recalled, that y ∈ Cut( x ) if and only if x ∈ Cut( y ). The inequalities (1.72) and (1.73) are in the sense of the usual Loewner order for self-adjoint operators. .7.4 Squared distance For x ∈ M , consider the squared distance function f x ( y ) = d ( x, y ) /
2. For y ∈ D( x ), f x ( y ) = 12 n X i =1 τ i ( y ) (1.74)in terms of the normal coordinates τ i . It follows that f x is smooth on D( x ). Of course, f x = r x / f x ( y ) = − Exp − y ( x ) for y ∈ D( x ) (1.75)and, by another application of the chain rule,Hess f x ( y ) = dr x ⊗ dr x + r x Hess r x (1.76)Just like r x , f x is singular on Cut( x ). If y ∈ Cut( x ) is a first conjugate point, then Hess f x ( y )has an eigenvalue equal to −∞ .The convexity of the function f x will play a significant rˆole, in the following, especially when M is a Hadamard manifold : a simply connected, geodesically complete Riemannian manifoldof non-positive sectional curvature. When M is a Hadamard manifold, the following propertieshold : any x, y ∈ M are connected by a unique geodesic c ; for all x ∈ M , Cut( x ) is empty,and f x is smooth and 1 / M is a Hadamard manifold. In addition, assume that the sectional curvature of M is bounded below by κ min = − c . Theorem 1.1 may be applied to (1.76), after setting κ max = 0.This yields g ( y ) ≤ Hess f x ( y ) ≤ cr x ( y ) coth( cr x ( y )) g ( y ) (1.77)for y ∈ M . In addition to showing that f x is 1 / f x has,at most, linear growth Hess f x ( y ) ≤ (1 + cr x ( y )) g ( y ) (1.78)since x coth( x ) ≤ x for x ≥ Remark : a subset A ⊂ M is called convex (that is, strongly convex, in the terminology of [11])if any x, y ∈ A are connected by a unique length-minimising geodesic c , and c lies entirely in A .A function f : A → R is then called (strictly) convex if f ◦ c : R → R is (strictly) convex, forany geodesic c which lies in A . It is called α -strongly convex (for some α >
0) if f ◦ c : R → R is α -strongly convex, for any geodesic c which lies in A ,( f ◦ c )( ps + q t ) ≤ p ( f ◦ c )( s ) + q ( f ◦ c )( t ) − αpq d ( c ( s ) , c ( t )) (1.79)whenever p, q ≥ p + q = 1. For example, if M is a sphere and A is the open northernhemisphere, then A is convex. Then, f x : A → R , where x denotes the north pole, is strictlyconvex, but not strongly convex. Remark : for x ∈ M , let inj( x ) = d ( x, Cut( x )) denote the injectivity radius at x . Then, letinj( M ) = inf x ∈ M inj( x ), the injectivity radius of M . Assume all the sectional curvatures of M are less than κ max = c . If B ( x, R ) is a geodesic ball with radius R ≤ (1 /
2) min { inj( M ) , π c − } ,then B ( x, R ) is convex. Here, if κ max = 0, then c − is understood to be + ∞ . However, there doexist manifolds M with negative sectional curvature, and with inj( M ) = 0 ( e.g. the quotient ofthe Poincar´e upper half-plane, by a discrete group of translations).14 .8 Example : robust Riemannian barycentre Let M be a Hadamard manifold, with sectional curvatures bounded below by κ min = − c .Recall that f x is 1 / f x has, at most, linear growth (as in (1.78)).On the other hand, consider the function V x ( y ) = δ (cid:2) d ( x, y )/ δ ) (cid:3) − δ (1.80)where δ > V x ( y ) ≥
0, and V x ( y ) = 0 if and only if x = y .Moreover, V x ∼ f x , when d ( x, y )/ δ is small, and V x ∼ δr x , when d ( x, y )/ δ is large. Proposition 1.5.
Let M be a Hadamard manifold, with sectional curvatures bounded below by κ min = − c . If V x : M → R is defined as in (1.80), then V x is smooth, strictly (but not strongly)convex, and Hess V x is bounded by δc . Let π be a probability measure on M , and consider the problem of minimising V π ( y ) = Z M V x ( y ) π ( dx ) (1.81)A global minimiser of V π will be called a robust Riemannian barycentre of π . Here, the adjective“robust” comes from the field of robust statistics [13]. Proposition 1.6.
Let π be a probability distribution on a Hadamard manifold M . If π hasfinite first-order moments, then the function V π is a proper, strictly convex function, with aunique global minimum x ∗ ∈ Θ . Therefore, π has a unique robust Riemannian barycentre x ∗ . Recall that π has finite first-order moments, if and only if there exists y o ∈ M with Z M r x ( y o ) π ( dx ) < ∞ (1.82)and recall that V π is said to be proper if it takes on finite values. Proof of Proposition 1.5 : by applying the chain rule to (1.80), and using (1.75),grad V x ( y ) = − Exp − y ( x ) (cid:2) d ( x, y )/ δ ) (cid:3) (1.83)Then, by applying (1.17),Hess V x ( y ) = − Exp − y ( x ) ⊗ Exp − y ( x ) δ (cid:2) d ( x, y )/ δ ) (cid:3) − ∇ Exp − y ( x ) (cid:2) d ( x, y )/ δ ) (cid:3) (1.84)To conclude, it is enough to note the inequalities,0 ≤ Exp − y ( x ) ⊗ Exp − y ( x ) ≤ d ( x, y ) g ( y )which follows since Exp − y ( x ) ⊗ Exp − y ( x ) is a rank-one operator in T y M , and g ( y ) ≤ − ∇ Exp − y ( x ) ≤ (1 + cr x ( y )) g ( y )which is the same as (1.78), and follows from (1.17) and (1.75). Replacing these into (1.84), adirect calculation shows 0 < Hess V x ( y ) ≤ (1 + δc ) g ( y ) (1.85)which completes the proof. 15 roof of Proposition 1.6 : using the sub-additivity of the square root, (1.80) and (1.81)imply that for any y ∈ M , V π ( y ) ≤ Z M r x ( y ) π ( dx )But, by the triangle inequality, and (1.82), Z M r x ( y ) π ( dx ) ≤ d ( y , y o ) + Z M r x ( y o ) π ( dx ) < ∞ Therefore, V π is proper. That V π is also strictly convex is an immediate result of Proposition 1.5 :each function V x is strictly convex, and V π ( y ) is the expectation of V x ( y ) with respect to arandom x with distribution π . Now, to show that V π has a unique global minimum, it is enoughto show that V π ( y ) goes to infinity as y goes to infinity. Note that ϕ ( x ) = (1 + x ) is convex.This implies (using the elementary fact that the graph of a convex function remains above anyof its tangents), V x ( y ) ≥ ( √ − δ + δ √ r x ( y )Taking the expectation with respect to π , V π ( y ) ≥ ( √ − δ + δ √ Z M r x ( y ) π ( dx )To see that V π ( y ) goes to infinity as y goes to infinity, it is now enough to note, using thetriangle inequality, Z M r x ( y ) π ( dx ) ≥ d ( y , y o ) − Z M r x ( y o ) π ( dx )where d ( y , y o ) goes to infinity as y goes to infinity. Remark : the above Proposition 1.6 only requires M to be a Hadamard manifold, without theadditional condition that it have sectional curvatures bounded below. Indeed, Proposition 1.6only relies on the fact that V x is strictly convex, and not on the fact that the Hessian of V x isbounded above by 1 + δc . Remark : if a function V : M → R , on a Riemannian manifold M , has bounded Hessian, thenit has Lipschitz-gradient. That is, if there exists ℓ ≥ | Hess V ( x )( u, u ) | ≤ ℓg ( u, u )for all x ∈ M and v ∈ T x M , then (cid:13)(cid:13) Π (cid:0) grad V c (1) (cid:1) − grad V c (0) (cid:13)(cid:13) c (0) ≤ ℓL ( c ) (1.86)for any smooth curve c : [0 , → M , where L ( c ) is the length of c . This is due to the following. Lemma 1.1.
Let X be a vector field on a Riemannian manifold M . If the operator norm ofthe covariant derivative ∇ X is bounded by ℓ ≥ , then (cid:13)(cid:13) Π (cid:0) X c (1) (cid:1) − X c (0) (cid:13)(cid:13) c (0) ≤ ℓL ( c ) (1.87) for any smooth curve c : [0 , → M . Sketch of proof : let u i be a parallel orthonormal base along c ( u i are vector fields along c ,with u i ( t ) an orthonormal basis of T c ( t ) M , for each t ). Let X i ( t ) = h X , u i i c ( t ) and note (cid:13)(cid:13) Π (cid:0) X c (1) (cid:1) − X c (0) (cid:13)(cid:13) c (0) = n X i =1 (cid:0) X i (1) − X i (0) (cid:1) = n X i =1 (cid:18)Z h∇ ˙ c X , u i i c ( t ) dt (cid:19) the proof then follows by using Jensen’s inequality, since k∇ ˙ c X k c ( t ) ≤ ℓ k ˙ c k c ( t ) .16 .9 Riemannian volume and integral formulae If a Riemannian manifold M is orientable, then M admits a volume form, called the Riemannianvolume form, to be denoted vol, in the following. In terms of local coordinates ( x i ; i = 1 , . . . , n )vol = det( g ) dx ∧ . . . ∧ dx n (1.88)where det( g ) is the determinant of the metric, which is equal the determinant of the matrix ( g ij ),defined in (1.6). Then, the integral of a continuous, compactly-supported function f : M → R ,with respect to vol, is the integral of the n -form f vol over M . This is denoted R M f ( x ) vol( dx ).There exists a unique measure | vol | on the Borel σ -algebra of M , such that [14] (Chapter 8),for continuous, compactly-supported f , Z M f ( x ) vol( dx ) = Z M f ( x ) | vol | ( dx )where the integral on the left is a Riemann integral, and the integral on the right is a Lebesgueintegral. It is quite useful to study these integrals using geodesic spherical coordinates (whichwere introduced in 1.7.3). Let ( r, θ α ) be geodesic spherical coordinates, with origin at x ∈ M .Recall that these are defined on U x = D( x ) − { x } , where D( x ) is the injectivity domain of x .Since M can be decomposed as in (1.60), and Cut( x ) is negligible, Z M f ( y ) vol( dy ) = Z U x f ( y ) vol( dy ) (1.89)Using (1.68) and (1.88), vol( dy ) = det( A ( y )) dr ∧ ω n − ( dθ ), where ω n − is the area measureon the unit sphere in T x M (as of now, this is identified with the Euclidean unit sphere S n − ).Using (1.59) and D( x ) = Exp (TD( x )), (1.89) yields Z M f ( y ) vol( dy ) = Z t( θ )0 Z S n − f ( r, θ ) det( A ( r, θ )) dr ω n − ( dθ ) (1.90)where t was defined in (1.58). This formula expresses integrals, with respect to the Riemannianvolume form, using geodesic spherical coordinates.Recall the Laplacian ∆ r x = div ∂ r . By definition of the divergence, L ∂ r vol = (div ∂ r )vol.Writing this in geodesic spherical coordinates,∆ r x ( r, θ ) = ∂ r log det( A ( r, θ )) (1.91)Accordingly, the comparison theorems 1.1 can be used to obtain the volume comparison theorem. Theorem 1.2.
Assume the sectional curvatures of M lie within the interval [ κ min , κ max ] . Then, sn n − κ max ( r ) ≤ det( A ( r, θ )) ≤ sn n − κ min ( r ) (1.92)( n − κ max ( r ) ≤ ∂ r log det( A ( r, θ )) ≤ ( n − κ min ( r ) (1.93)This volume comparison theorem is quite elementary, as stronger and deeper comparisonresults do exist . Moreover, in this theorem, the lower bound on sectional curvature may bereplaced by a lower bound on Ricci curvature, without any change to the conclusion. For example, Gromov’s volume comparison theorem can be used to give a short proof of the famous “spheretheorem”, Theorem III.4.6 in [11]. emark : roughly, (1.92) states that “more curvature means less volume”. If f : M → R is anon-negative function of distance to x , so f ( y ) = f ( r ) in terms of the coordinates ( r, θ α ), then ω n − Z R f ( r ) sn n − κ max ( r ) dr ≤ Z B ( x,R ) f ( y ) vol( dy ) ≤ ω n − Z R f ( r ) sn n − κ min ( r ) dr (1.94)for any R ≤ min { inj( x ) , π c − } . Here, inj( x ) is the injectivity radius at x , c = | κ max | / , and ω n − denotes the area of S n − . In addition, if κ max ≤
0, then c − is understood to be + ∞ .In general, it may be impossible to apply the integral formula (1.90), since t( θ ) may beunknown. Here are two examples where t( θ ) is known, and quite tractable (in fact, constant). Example 1 : if M is a Hadamard manifold, then for any choice of the origin x , and any θ ∈ S n − , one has t( θ ) = ∞ , and (1.90) becomes Z M f ( y ) vol( dy ) = Z ∞ Z S n − f ( r, θ ) det( A ( r, θ )) dr ω n − ( dθ ) (1.95) Example 2 : compact rank-one symmetric space are the following manifolds : spheres, realprojective spaces, complex projective spaces, quaternion projective spaces, or the Cayley plane.These are manifolds all of whose geodesics are closed (i.e. periodic) and isometric to one another(see [15], for a detailed account). Therefore, t( θ ) does not depend on x nor on θ , but is alwaysequal to l/
2, where l is the length of a simple geodesic loop. Scaling the metric so the maximumsectional curvature is equal to 1, it can be shown l = π for real projective spaces, and l = 2 π in all other cases. Moreover (1.90) takes on the form (this may be found by looking up thesolution of the Jacobi equation in [15], Page 82), Z M f ( y ) vol( dy ) = Z l Z S n − f ( r, θ ) (sin( r )) k − (2 sin( r/ n − k dr ω n − ( dθ ) (1.96)where k = n for spheres and real projective spaces, and k = 2 or 4 for complex or quaternionprojective spaces, respectively. For the Cayley plane, n = 16 and k = 8. A Riemannian symmetric space is a Riemannian manifold M , such that, for each x ∈ M , thereexists an isometry s x : M → M , with s x ( x ) = x and ds x ( x ) = − Id x . This isometry s x is calledthe geodesic symmetry at x .Let G denote the identity component of the isometry goup of M , and K = K o be thestabiliser in G of some point o ∈ M . Then, M = G/K is a Riemannian homogeneous space.The mapping θ : G → G , where θ ( g ) = s o ◦ g ◦ s o is an involutive isomorphism of G .Let g denote the Lie algebra of G , and consider the Cartan decomposition, g = k + p , where k is the +1 eigenspace of dθ and p is the − dθ . One clearly has the commutationrelations, [ k , k ] ⊂ k ; [ k , p ] ⊂ p ; [ p , p ] ⊂ k (1.97)In addition, it turns out that k is the Lie algebra of K , and that p may be identified with T o M .The Riemannian metric of M may always be expressed in terms of an Ad( K )-invariantscalar product Q on g . If x ∈ M is given by x = g · o for some g ∈ G (where g · o = g ( o )), then h u,v i x = Q ( g − · u, g − · v ) (1.98)where the vectors g − · u and g − · v , which belong to T o M , are identified with elements of p .Here, by an abuse of notation, dg − · u is denoted g − · u . According to the Myers-Steenrod theorem, G is a connected Lie group, and K a compact subgroup of G . g → G denote the Lie group exponential. If v ∈ T o M , then the Riemannianexponential Exp o ( v ) is given by Exp o ( v ) = exp( v ) · o (1.99)Moreover, if Π t denotes parallel transport along the geodesic c ( t ) = Exp o ( tv ), thenΠ t ( u ) = exp( tv ) · u (1.100)for any u ∈ T o M (note that the identification T o M ≃ p is always made, implicitly). Using(1.100), one can derive the following expression for the Riemann curvature tensor at o , R o ( v, u ) w = − [[ v , u ] , w ] v, u, w ∈ T o M (1.101)A fundamental property of symmetric spaces is that the curvature tensor is parallel : ∇ R = 0.This is often used to solve the Jacobi equation (1.65), and then express the derivative of theRiemannian exponential, using in (1.64),dExp x ( v )( u ) = exp( v ) · sh( R v )( u ) (1.102)where sh( R v ) = P ∞ n =0 ( R v ) n / (2 n + 1)! for the self-adjoint curvature operator R v ( u ) = [ v , [ v, u ]].Since exp( v ) is an isometry, the following expression of the Riemannian volume is immediateExp ∗ o (vol) = | det(sh( R v )) | dv (1.103)where dv denotes the volume form on T o M , associated with the restriction of Q to p .Expression (1.103) yields applicable integral formulae, when g is a reductive Lie algebra( g = z + g ss : z the centre of g and g ss semisimple). If a is a maximal Abelian subspace of p ,any v ∈ p is of the form v = Ad( k ) a for some k ∈ K and a ∈ a (see [10], Lemma 6.3, Chapter V).Moreover, using the fact that Ad( k ) is an isomorphism of g ,Ad( k − ) ◦ R v ◦ Ad( k ) = R a = X λ ∈ ∆ + ( λ ( a )) Π λ (1.104)where each λ ∈ ∆ + is a linear form λ : a → R , and Π λ is the orthogonal projectors onto thecorresponding eigenspace of R a . Here, ∆ + is the set of positive roots of g with respect to a [10](see Lemma 2.9, Chapter VII).It is possible to use the diagonalisation (1.104), in order to evaluate the determinant (1.103).To obtain a regular parameterisation, let S = K/K a , where K a is the centraliser of a in K .Then, let ϕ : S × a → M be given by ϕ ( s, a ) = Exp o ( β ( s, a )) where β ( s, a ) = Ad( s ) a . Now, by(1.103) and (1.104), ϕ ∗ (vol) = Y λ ∈ ∆ + (cid:12)(cid:12)(cid:12)(cid:12) sinh λ ( a ) λ ( a ) (cid:12)(cid:12)(cid:12)(cid:12) m λ β ∗ ( dv )where m λ is the multiplicity of λ (the rank of Π λ ). On the other hand, one may show that β ∗ ( dv ) = Y λ ∈ ∆ + | λ ( a ) | m λ da ω ( ds ) (1.105)where da is the volume form on a , and ω is the invariant volume induced onto S from K .Finally, the Riemannian volume, in terms of the parameterisation ϕ , takes on the form ϕ ∗ (vol) = Y λ ∈ ∆ + | sinh λ ( a ) | m λ da ω ( ds ) (1.106)Using (1.106), it will be possible to write down integral formulae for Riemannian symmetricspaces, either non-compact or compact. Recall that the dimension of a is known as the rank of M . In fact, Exp o ( a ) is a totally flat submanifold of M ,of maximal dimension, and the only such submanifold, up to isometry. he non-compact case This is the case were g admits an Ad( G )-invariant, non-degenerate, symmetric bilinear form B ,such that Q ( u, z ) = − B ( u, dθ ( z )) is an Ad( K )-invariant scalar product on g . In this case, B isnegative-definite on k and positive-definite on p . Moreover, ad( z ) = [ z, · ] is skew-symmmetricor symmetric (with respect to Q ), according to whether z ∈ k or z ∈ p .If u , u ∈ p are orthonormal, the sectional curvature of Span( u , u ) is found from (1.101), κ ( u , u ) = −k [ u , u ] k o ≤
0. Therefore, M has non-positive sectional curvature.In fact, M is a Hadamard manifold. It is geodesically complete by (1.99). It is moreoversimply connected, because Exp o : p → M is a diffeomorphism [10] (Theorem 1.1, Chapter VI).Thus, (1.103) yields a first integral formula, Z M f ( x ) vol( dx ) = Z p f (Exp o ( v )) | det(sh( R v )) | dv (1.107)To obtain an integral formula from (1.106), one should first note that β : S × a → p is notregular, nor one-to-one. Recall the following : • the hyperplanes λ ( a ) = 0, where λ ∈ ∆ + , divide a into finitely many connected components,which are open and convex sets, known as Weyl chambers. From (1.105), β is regular on eachWeyl chamber. • let K ′ a denote the normaliser of a in K . Then, W = K ′ a / K a is a finite group of automorphismsof a , called the Weyl group, which acts freely transitively on the set of Weyl chambers [10](Theorem 2.12, Chapter VII).Then, for each Weyl chamber C , β is regular and one-to-one, from S × C onto its image in p .Moreover, if a r is the union of the Weyl chambers ( a ∈ a r if and only if λ ( a ) = 0 for any λ ∈ ∆ + ),then β is regular and | W | -to-one from S × a r onto its image in p .To obtain the desired integral formula, it only remains to note that ϕ is a diffeomorphismfrom S × C onto its image in M . However, this image is the set M r of regular values of ϕ . BySard’s lemma, its complement is negligible [16]. Proposition 1.7.
Let M = G/K be a Riemannian symmetric space, which belongs to the“non-compact case”, just described. Then, for any bounded continuous function f : M → R , Z M f ( x ) vol( dx ) = Z C + Z S f ( ϕ ( s, a )) Y λ ∈ ∆ + (sinh λ ( a )) m λ da ω ( ds ) (1.108)= 1 | W | Z a Z S f ( ϕ ( s, a )) Y λ ∈ ∆ + | sinh λ ( a ) | m λ da ω ( ds ) (1.109) Here, C + is the Weyl chamber C + = { a ∈ a : λ ∈ ∆ + ⇒ λ ( a ) > } . Example 1 : consider M = H( N ) the space of N × N Hermitian positive-definite matrices.Here, G = GL( N, C ) and K = U ( N ), the groups of N × N , complex, invertible and unitarymatrices. Moreover, B ( u,z ) = Re(tr( uz )) and dθ ( z ) = − z † . Thus, p is the space of N × N Hermitian matrices, and one may choose a the space of N × N real diagonal matrices. Thepositive roots are the linear maps λ ( a ) = a ii − a jj where i < j , and each one has its multiplicityequal to 2. Thus, C + is the cone of real diagonal matrices a with a > . . . > a NN > W is the group of permutation matrices in U ( N ) (so | W | = N !). Finally, S = U ( N ) /T N ≡ S N , where T N is the torus of diagonal unitary matrices. By (1.109), Z H( N ) f ( x ) vol( dx ) = 1 N ! Z a Z S N f (cid:16) s exp(2 a ) s † (cid:17) Y i In this case, g admits an Ad( G )-invariant scalar product Q . Therefore, ad( z ) is skew-symmmetric,with respect to Q , for each z ∈ g . Using (1.101), it follows that M is compact, with non-negativesectional curvature.In fact, the compact case may be obtained from the previous non-compact case by duality .Denote g C the complexification of g , and let g ∗ = k + p ∗ where p ∗ = i p . Then, g ∗ is a compactreal form of g C (that is, g ∗ is a compact Lie algebra, and its complexification is equal to g C ).Denote G ∗ the connected Lie group with Lie algebra g ∗ .If M = G/K is a Riemannian symmetric space which belongs to the non-compact case, then M ∗ = G ∗ /K is a Riemannian symmetric space which belongs to the compact case. Formally,to pass from the non-compact case to the compact case, all one has to do is replace a by ia .Applying this recipe to (1.106), one obtains ϕ ∗ (vol) = Y λ ∈ ∆ + | sin λ ( a ) | m λ da ω ( ds ) (1.113)where da is the volume form on a ∗ = i a , and ω is the invariant volume induced onto S from K .Note that the image under Exp o of a ∗ is the torus T ∗ = a ∗ / a K , where a K is the lattice given by a K = { a ∈ a ∗ : Exp o ( a ) = o } . Recall the following : • ϕ ( s, a ) only depends on t = Exp o ( a ). Thus, ϕ may be considered as a map from S × T ∗ to M . • if a ∈ a K then exp(2 a ) = e (the identity element in G ∗ ). Thus, λ ( a ) ∈ iπ Z for all λ ∈ ∆ + [10](Page 383). Therefore, there exists a function D : T → R , such that D ( t ) = Y λ ∈ ∆ + | sin λ ( a ) | m λ whenever t = Exp o ( a )Now, T ∗ is a totally flat submanifold of M . Therefore, Exp ∗ ( dt ) = da , where dt denotes theinvariant volume induced onto T ∗ from M . With a slight abuse of notation, (1.113) now reads, ϕ ∗ (vol) = D ( t ) dt ω ( ds ) (1.114)Denote ( T ∗ ) r the set of t ∈ T ∗ such that D ( t ) = 0. By the same arguments as in the non-compactcase, ϕ is a regular | W | -to-one map from S × ( T ∗ ) r onto M r , the set of regular values of ϕ . Proposition 1.8. Let M = G ∗ /K be a Riemannian symmetric space, which belongs to the“compact case”, just described. Then, for any bounded continuous function f : M → R , Z M f ( x ) vol( dx ) = 1 | W | Z T ∗ Z S f ( ϕ ( t, a )) D ( t ) dt ω ( ds ) (1.115)21 xample 1 : the dual of H( N ) is the unitary group U ( N ). Here, G ∗ = U ( N ) × U ( N ) and K ≃ U ( N ), is the diagonal group K = { ( x, x ) ; x ∈ U ( N ) } . The Riemannian metric is given bythe trace scalar product Q ( u,z ) = − tr( uz ). Moreover, T ∗ = T N and S = S N (this is U ( N ) /T N ).The positive roots are λ ( ia ) = a ii − a jj where i < j and where a is N × N , real and diagonal .By writing the integral over T N as a multiple integral, (1.115) reads, Z U ( N ) f ( x ) vol( dx ) = 1 N ! Z [0 , π ] N Z S N f (cid:16) s exp(2 ia ) s † (cid:17) Y i In the notation of the previous proposition, Exp x ( v ) = exp( ω v ) · x (1.122) for x ∈ M and v ∈ T x M . Propositions 1.9 and 1.10 offer a straightforward computational route to the Riemannianexponential map Exp. To compute Exp x ( v ), one begins by “lifting” v from T x M to g , underthe form of ω v . Then, it is enough to compute the action of exp( ω v ), which is just a matrixexponential, in practice. Example 1 : consider an example of the non-compact case, M = H( N ), the space of N × N Hermitian positive-definite matrices. Here, G = GL( N, C ) and π ( g ) = gg † for g ∈ G . Then, dπ ( g ) · h = hg † + gh † for h ∈ T g G . For x = π ( g ) and v ∈ T x M , it follows that v H ( g ) = v θ ( g )/2 , where θ ( g ) = ( g † ) − .By definition, ω v = dR g − ( v H ( g )). Since gg † = x , this gives ω v = ( v/ x − . Therefore, usingthe fact that g · x = gxg † , it followsExp x ( v ) = exp (cid:0) v x − (cid:14) (cid:1) x exp (cid:0) x − v (cid:14) (cid:1) Accordingly, by an elementary property of the matrix exponential ,Exp x ( v ) = x exp (cid:16) x − v x − (cid:17) x (1.123)which is the formula made popular by [21]. Example 2 : let M = G/K be a Riemannian symmetric space of the compact case. That is,the scalar product Q on g is Ad( G )-invariant. Write g = k + p the Cartan decomposition of g .For x ∈ M , denote K x the stabiliser of x in G . If x = π ( g ), this has Lie algebra k x = Ad( g )( k )(that is, the image under Ad( g ) of k ). For v ∈ T x M , by Proposition 1.9, its “lift” ω v shouldverify (note that, for the present example, B = Q ) ω v · x = v and Q ( ξ , ω v ) = 0 for ξ ∈ k x where the second identity is because ξ · x = 0 for ξ ∈ k x . Because Q is Ad( G )-invariant, thissecond identity is equivalent to Q ( κ, Ad( g − )( ω v )) = 0 for κ ∈ k . That is, ω v = Ad( g )( ω v ( o ))for some ω v ( o ) ∈ p . This ω v ( o ) is determined from ω v · x = v , which yields ω v ( o ) · o = g − · v .However, the map ω ω · o is an isomorphism from p onto T o M . Denoting its inverse by π o : T o M → p , it follows that ω v ( o ) = π o ( g − · v ). Finally, ω v = Ad( g ) ( π o ( g − · v )) (1.124)A special case of this formula was used in (1.32) of 1.4. Proof of Proposition 1.9 : to begin, one must prove ω v · x = v (1.125) Matrix functions (powers, logarithms, etc. ) of Hermitian arguments should be understood as Hermitianmatrix functions, obtained using the spectral decomposition — see [20]. ω v and v H ( e ), it is clear ω v = dR g − ( v H ( g )). Replacing this into (1.121),the left-hand side of (1.125) becomes, ω v · x = ddt (cid:12)(cid:12)(cid:12)(cid:12) t =0 exp( t dR g − v H ( g )) · x = ddt (cid:12)(cid:12)(cid:12)(cid:12) t =0 ( γ ( t ) g − ) · x where γ is any curve in G , through g with ˙ γ (0) = v H ( g ). Therefore, ω v · x = ddt (cid:12)(cid:12)(cid:12)(cid:12) t =0 γ ( t ) · o = dπ ( v H ( g )) = v from the definition of v H ( g ). This proves (1.125). It remains to show, h u, v i x = Q ( ξ , ω v ) for u = ξ · x (1.126)The proof is separated into two cases. non-compact case : in this case, Q ( ξ , ω ) = − B ( ξ, dθ ( ω )), where B is an Ad( G )-invariant,non-degenerate, symmetric bilinear form. To prove (1.126), note that dπ ( g ) ( dR g ( ξ )) = ddt (cid:12)(cid:12)(cid:12)(cid:12) t =0 (exp( tξ ) g ) · o = ddt (cid:12)(cid:12)(cid:12)(cid:12) t =0 exp( tξ ) · x = u Therefore, dR g ( ξ ) = u H ( g ) + w where w ∈ V g . From (1.120), using left-invariance of ( · , · ), h u, v i x = ( dR g ( ξ ) , v H ( g )) g = Q (Ad( g − )( ξ ) , v H ( e )) (1.127)Thus, using the definition of Q , and the fact that v H ( e ) ∈ p , h u, v i x = − B (Ad( g − )( ξ ) , dθ ( v H ( e ))) = B (Ad( g − )( ξ ) , v H ( e ))Finally, since B is Ad( G )-invariant, h u, v i x = B (Ad( g − )( ξ ) , v H ( e )) = B ( ξ, Ad( g )( v H ( e )))which is the same as (1.126), by the definition of ω v . Indeed, in the present case, B = B . compact case : this follows from (1.127), using the fact that Q is Ad( G )-invariant. Indeed, inthe present case, B = Q . Proof of Proposition 1.10 : for ξ ∈ g , introduce the corresponding vector fields X ξ on M ,given by X ξ ( x ) = ξ · x . Since this is a Killing vector field [19], if c : R → M is a geodesiccurve in M , then ℓ ( ξ ) = h X ξ , ˙ c i c ( t ) is a constant, (a law of conservation, really due to Noether’stheorem!). Now, in the notation of Proposition 1.9, let ω ( t ) = ω ˙ c ( t ) . By Proposition 1.9,B( ω ( t ) , ξ ) = ℓ ( ξ )Since this is a constant, and since B is non-degenerate, it follows that ω ( t ) = ω is a constant.Proposition 1.9 also implies that c satisfies the ordinary differential equation˙ c = ω · c But this differential equation is also satisfied by c ( t ) = exp( tω ) · c (0), as one may see from (1.121).By uniqueness of the solution, for given initial conditions, c ( t ) = exp( tω ˙ c (0) ) · c (0)This immediately implies (1.122), by setting t = 1, c (0) = x and ˙ c (0) = v .24 hapter 2 The barycentre problem Contents T W and T δ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . State-of-the art results establish the existence and uniqueness of the Riemannian barycentre of aprobability distribution which is supported inside a compact convex geodesic ball. What happens fora probability distribution which is not supported, but concentrated, inside a convex geodesic ball ?This question raises new difficulties that cannot be resolved by using the tools applicable to distributionswhich have compact convex support. The present chapter develops new tools, able to deal with thesedifficulties (at least in part), following the approach in [22]. • • • π T ∝ exp( − U/T ) be a Gibbs distribution on asimply connected compact Riemannian symmetric space M , such that the potential function U has a unique global minimum at x ∗ ∈ M . If M is simply connected, then for each δ < r cx / r cx is the convexity radius of M ), there exists a critical temperature T δ such that T < T δ impliesthat π T has a unique Riemannian barycentre ˆ x T and this ˆ x T belongs to the geodesic ball B ( x ∗ , δ ).The assumption that M is simply connected cannot be removed (see Lemma 2.1 and the followingremark). • T δ . • • .1 Fr´echet’s fruitful idea In 1948, Maurice Fr´echet proposed a generalisation of the concept of mean value, from Euclideanspaces to general metric spaces [23]. Today, this generalisation is known as the Fr´echet mean.Precisely, a Fr´echet mean, of a probability distribution π on a metric space M , is any globalminimum of the so-called variance function E π ( y ) = 12 Z M d ( y , x ) π ( dx ) (2.1)where d ( x, y ) denotes the distance between x and y in M . In the following, the focus will beon the case where M is a Riemannian manifold. Then, a Fr´echet mean of π will be called aRiemannian barycentre, or just a barycentre, of π .If E π ( y ) takes on finite values (in fact, if it is finite for just one y = y o ), then π has at leastone Fr´echet mean. In particular, if M is a Euclidean space, then this Fr´echet mean is alwaysunique, and equal to the mean value (expectation) of π . In general, the Fr´echet mean of aprobability distribution π is not unique, and one may think of the Fr´echet mean of π as the set F ( π ), of all global minima of its variance function E π . Example 1 : if M = S , the unit circle, and π is the uniform distribution ( i.e. Haar measure),on S , then F ( π ) = S . Any point on the circle is a barycentre of the uniform distribution.If x , . . . , x N ∈ M , then an empirical Fr´echet mean of ( x , . . . , x N ) is any Fr´echet mean ofthe empirical distribution ( δ x + . . . + δ x N ) /N ( δ x denotes the Dirac distribution concentratedat x ). In other words, an empirical Fr´echet mean of ( x , . . . , x N ) is any global minimum of theempirical variance function E N ( y ) = 12 N N X n =1 d ( y , x n ) (2.2)When M is a Riemannian manifold, the term “empirical Fr´echet mean” will be replaced by theterm “empirical barycentre”. Example 2 : if M = S , and x , x are two opposite points on S , then the empirical barycentreof ( x , x ) is a two-point set. For example, if x = 1 and x = − 1, then the empirical barycentreis the set { i, − i } ( i being the square root of − x n ; n ≥ 1) are independent samples from the distribution π . If F N is the set of empirical Fr´echet means of ( x , . . . , x N ), then one is interested in using F N tosomehow approximate F ( π ).In [24], it was shown that, if the metric space M is such that any closed and bounded subsetof M is compact , then for any ǫ > 0, the set F N almost-surely belongs to the ǫ -neighborhoodof the set F ( π ), when N is sufficiently large.Moreover, if π has a unique Fr´echet mean, say F ( π ) = { ˆ x π } , then any sequence of empiricalFr´echet means, ¯ x N ∈ F N converges almost-surely to ˆ x π (an extension of this last result, fromindependent to Markovian samples, is obtained in 4.3.2).In [25], a central limit theorem was added to this last convergence result. Specifically, if M is a Riemannian manifold, the distribution of N Exp ˆ x π (¯ x N ) converges to a multivariate normaldistribution (in the tangent space at ˆ x π ). This “central limit theorem” requires several technicalconditions, in order to hold true, and should therefore only be applied after due verification. That is, for any x ∈ M , the function y d ( x, y ) is a proper function, meaning it has compact sublevel sets. .2 Existence and uniqueness The problem of the existence and uniqueness of Riemannian barycentres has generated a richliterature, with ramifications in stochastic analysis on manifolds, Riemannian geometry, andprobability theory. The present section attempts a quick, non-exhaustive summary of somefamous results from this literature. The works of Emery and Kendall [26], later expanded upon by Afsari [27], are related tothe existence and uniqueness of the Riemannian barycentre of a probability distribution π ,supported inside some geodesic ball B ( x ∗ , δ ), in a Riemannian manifold M .Emery and Kendall, among others, considered the so-called Karcher mean of π . This isa local minimum of the variance function E π in (2.1). In [26], π is assumed to have compactsupport, inside a so-called regular geodesic ball B ( x ∗ , δ ). Here, “regular geodesic ball” means • δ < π c − , where all sectional curvatures of M are less than κ max = c . • the cut locus of x ∗ does not intersect B ( x ∗ , δ ) (that is δ < inj( x ∗ )).These two conditions guarantee that the closed ball ¯ B ( x ∗ , δ ) is weakly convex, and that it hasconvex geometry.Weakly convex means for any x, y ∈ ¯ B ( x ∗ , δ ) there exists a unique geodesic γ : [0 , → M ,such that γ (0) = x , γ (1) = y and γ ( t ) ∈ ¯ B ( x ∗ , δ ) for all t ∈ [0 , 1] (this is equivalent to theterminology of [11] ).Convex geometry means there exists a positive, bounded, continuous, and convex functionΨ, defined on ¯ B ( x ∗ , δ ) × ¯ B ( x ∗ , δ ), such that Ψ( x, y ) = 0 if and only if x = y .When π is supported inside B ( x ∗ , δ ), the function E π takes on finite values, and therefore hasa global minimum ˆ x π . However, it is not immediately clear this ˆ x π should lie within B ( x ∗ , δ ).In [26], the existence of a local minimum, i.e. Karcher mean, within B ( x ∗ , δ ) is guaranteed,subject to interpreting the distance in (2.1) as geodesic distance within B ( x ∗ , δ ).If ˆ x π is a local minimum of E π in B ( x ∗ , δ ), then the convex geometry property of the closedball ¯ B ( x ∗ , δ ) guarantees this local minimum is unique. This follows by using a general form ofJensen’s inequality, due to Emery. Specifically, if ˆ x and ˆ x are Karcher means in B ( x ∗ , δ ), then(ˆ x , ˆ x ) is a Karcher mean of the image distribution δ ∗ π of π , under the map δ ( x ) = ( x, x ).Then, applying Jensen’s inequality to the convex function Ψ, it followsΨ(ˆ x , ˆ x ) ≤ Z ¯ B ( x ∗ ,δ ) Ψ( x, x ) π ( dx )so Ψ(ˆ x , ˆ x ) = 0, and therefore ˆ x = ˆ x . Remark : it was conjectured by Emery that any weakly convex geodesic ball should also haveconvex geometry. A counterexample to this conjecture was provided by Kendall, in the form ofhis “propeller” [28]. Afsari’s seminal work on Riemannian barycentres was published ten years ago [27]. It providedthe following statement : if π is supported inside a geodesic ball B ( x ∗ , δ ), then π has a uniqueRiemannian barycentre ˆ x π and ˆ x π ∈ B ( x ∗ , δ ), as soon as δ < 12 min { πc − , inj( M ) } (2.3) This geodesic γ is the unique length-minimising curve, among all curves which connect x to y and lie in¯ B ( x ∗ , δ ). See the proof of Theorem IX.6.2, Page 405 in [11]. c is such that all sectional curvatures of M are less than κ max = c (if M has negativesectional curvatures, c − is understood to be + ∞ ), and inj( M ) is the injectivity radius of M .Condition (2.3) ensures the geodesic ball B ( x ∗ , δ ) is convex, in the sense of 1.7.4 (stronglyconvex, in the terminology of [11]), rather than just weakly convex as in 2.2.1. This strongercondition is required, because the Riemannian barycentre (Fr´echet mean) is considered, ratherthan just the Karcher mean. In fact, Afsari extended his results beyond Riemannian barycentresto L p Riemannian barycentres, which are obtained by replacing the squared distance in (2.1)with a distance elevated to the power p , where p ≥ E π must lie inside B ( x ∗ , δ ). This is done using theAlexandrov-Toponogov comparison theorem, under its form stated in [11] (Page 420). Then,the Poincar´e-Hopf theorem is employed, in order to prove uniqueness of local minima, insidethe geodesic ball B ( x ∗ , δ ).Specifically, E π is differentiable at any point y which belongs to the closed ball ¯ B ( x ∗ , δ ), andgrad E π ( y ) = − Z M Exp − y ( x ) π ( dx ) for y ∈ ¯ B ( x ∗ , δ )Then, it is shown that, if y ∈ B ( x ∗ , δ ) and grad E π ( y ) = 0, then Hess E π ( y ) is positive-definite.In other words, the singular point y of the gradient vector field grad E π has its index equal to 1.Since this vector field is outward pointing on the boundary of ¯ B ( x ∗ , δ ), the Poincar´e-Hopftheorem implies the sum of the indices of all its singular points in B ( x ∗ , δ ) is equal to theEuler-Poincar´e characteristic of ¯ B ( x ∗ , δ ), which is equal to 1 (since ¯ B ( x ∗ , δ ) is homeomorphic toa closed ball in R n ). Remark : the argument just summarised not only shows that E π has a unique local minimumin B ( x ∗ , δ ), but that it has a unique stationary point in B ( x ∗ , δ ). Moreover, the advantageof this argument, over the “convex geometry” uniqueness argument, (summarised in 2.2.1),is that it can be used to show the uniqueness of L p Riemannian barycentres, for general p > Existence and uniqueness of Riemannian barycentres hold under quite general conditions, whenthe underlying Riemannian manifold M is a Hadamard manifold (recall definition from 1.7.4).Mostly, these existence and uniqueness properties are just special cases of the properties ofFr´echet means in metric spaces of non-positive curvature, which were developed by Sturm [29].Let π be a probability distribution on a Hadamard manifold M . As already mentioned in2.1, if the variance function E π in (2.1) takes on finite values, then π has at least one Riemannianbarycentre, say ˆ x π . For this, it is enough that E π ( y o ) < ∞ , for just one y o ∈ M . In other words,it is enough that π should have a finite second-order moment Z M d ( y o , x ) π ( dx ) < ∞ (2.4)Indeed, if (2.4) is verified, then a straightforward application of the triangle inequality impliesthat E π ( y ) < ∞ for all y ∈ M .When M is a Hadamard manifold, existence of a Riemannian barycentre automaticallyimplies its uniqueness. This can be shown using the “convex geometry” uniqueness argument,discussed in 2.2.1. Indeed, if M is a Hadamard manifold, then Ψ : M × M → R , whereΨ( x, y ) = d ( x, y ) is convex, and Ψ( x, y ) = 0 if and only if x = y . Alternatively, uniquenessof the Riemannian barycentre follows from the strong convexity of the variance function E π .Recall from 1.7.4 that f x ( y ) = d ( x, y ) / / x ∈ M .28hen, (2.1) says that E π is an expectation of 1 / / E π has a unique global minimum, ˆ x π ∈ M .When M is a Hadamard manifold, it should also be noted that E π is smooth throughout M ,and that its gradient is given bygrad E π ( y ) = − Z M Exp − y ( x ) π ( dx ) (2.5)as can be found by applying (1.75) under the integral in (2.1). Strong convexity of E π impliesits global minimum ˆ x π is also its unique stationary point in M ( i.e. the unique point wheregrad E π is equal to zero). The empirical barycentre of the points ( x , . . . , x N ), in any complete Riemannian manifold M ,is generically unique. This means that this empirical barycentre is unique, for almost all( x , . . . , x N ) in the product Riemannian manifold M N = M × . . . × M , equipped with itsRiemannian volume measure. This interesting result was obtained by Arnaudon and Miclo [30].In particular, it implies that when ( x , . . . , x N ) are independent samples, from a distribution π , which has a probability density with respect to the Riemannian volume of M , then theirempirical barycentre ¯ x N is almost-surely unique. Throughout the following, M will be a compact, orientable Riemannian manifold, with positivesectional curvatures, all less than κ max = c . Afsari’s statement, recalled in 2.2.2, says that if π is a probability distribution on M , supported inside a convex geodesic ball B ( x ∗ , δ ), then π hasa unique Riemannian barycentre ˆ x π , as soon as δ < 12 min { πc − , inj( M ) } (2.6)where inj( M ) denotes the injectivty radius of M .Inequality (2.6) is optimal. Indeed, it is easy to think of examples which show that, if itis replaced by an equality, then ˆ x π will immediately fail to be unique. On the other hand,this inequality does not tell us what happens in the important case where π = π T is a Gibbsdistribution, π T ( dx ) = ( Z ( T )) − exp (cid:20) − U ( x ) T (cid:21) vol( dx ) (2.7)for some temperature T , and potential function U : M → R , where Z ( T ) is a normalisingconstant (vol denotes the Riemannian volume form).The present chapter will introduce several results, which deal with this case. These areconcerned with the concentration, differentiability, convexity, and uniqueness properties, of theRiemannian barycentre ˆ x T of the Gibbs distribution π T .The starting assumption for these results is that the potential function U has a uniqueglobal minimum at x ∗ ∈ M . Under this assumption, while π T is not supported inside anyconvex geodesic ball B ( x ∗ , δ ), it is still concentrated on any such ball, provided the temperature T is sufficiently small. Then, the aim is to know exactly how small T should be made, in orderto ensure the required properties of ˆ x T . This aim can be fully achieved, under the furtherassumption that M is a simply connected compact Riemannian symmetric space.29iven these two assumptions, the following conclusion will be obtained : for each δ < r cx ( r cx denotes the convexity radius of M ), there exists a critical temperature T δ such that T < T δ implies π T has a unique Riemannian barycentre ˆ x T and this ˆ x T belongs to the geodesic ball B ( x ∗ , δ ). Moreover, if U is invariant by geodesic symmetry about x ∗ , then ˆ x T = x ∗ . Remark : if M is a Riemannian manifold, the convexity radius r cx ( x ) of x ∈ M is the supremumof R > B ( x, R ) is convex (this is strictly positive, for any x ∈ M ).The convexity radius r cx ( M ) of M is the infimum of r cx ( x ), over all x ∈ M (if M is compact,this is strictly positive). Here, r cx ( M ) is just denoted r cx . Denote the variance function of the Gibbs distribution π T in (2.7) by E T . According to (2.1), E T ( y ) = 12 Z M d ( y , x ) π T ( dx ) (2.8)Throughout the following, it will be assumed that the potential function U , which appears in(2.7), has a unique global minimum at x ∗ ∈ M . While U is not required to be smooth, it isrequired to be well-behaved near x ∗ , in the sense that there exist µ min , µ max > ρ > µ min d ( x, x ∗ ) ≤ U ( x ) − U ( x ∗ )) ≤ µ max d ( x, x ∗ ) (2.9)whenever d ( x, x ∗ ) ≤ ρ . This is always verified if U is twice differentiable at x ∗ , and the spectrumof Hess U ( x ∗ ) is contained in the open interval ( µ min , µ max ).The following Proposition 2.1 establishes the concentration property of the Riemannianbarycentres of π T as the temperature T is made small. In this proposition, W denotes theKantorovich ( L -Wasserstein) distance, and δ x ∗ the Dirac distribution concentrated at x ∗ . Proposition 2.1. Let M be a compact, orientable Riemannian manifold, with positive sectionalcurvatures, and dimension equal to n .(i) Let η > . For any Riemannian barycentre ˆ x T of π T W ( π T , δ x ∗ ) < η M = ⇒ d (ˆ x T , x ∗ ) < η (2.10) where diam M is the diameter of M .(ii) There exists a temperature T W such that T ≤ T W implies W ( π T , δ x ∗ ) ≤ (8 π ) B − n (cid:16) π (cid:17) n − (cid:18) µ max µ min (cid:19) n (cid:18) Tµ min (cid:19) (2.11) where B n = B (1 / , n/ in terms of the Euler Beta function. Proposition 2.1 shows exactly how small T should be made, in order to ensure that all theRiemannian barycentres ˆ x T concentrate within an open ball B ( x ∗ , η ). Roughly, (i) states that,if π T is close to δ x ∗ , then all ˆ x T will be close to x ∗ . On the other hand, (ii) bounds the distancebetween π T and δ x ∗ , as a function of T . The temperature T W mentioned in (ii) will be expressedexplicitly in 2.7, below.Here, two things should be noted, concerning (2.11). First, this inequality is both optimaland explicit. It is optimal because the dependence on T in its right-hand side cannot beimproved. Indeed, the multi-dimensional Laplace approximation (for example, see [31]), shows30he left-hand side is equivalent to L · T when T → 0. While this constant L is not tractable,the constants appearing in (2.11) depend explicitly on the manifold M and the function U . Infact, (2.11) does not follow from the multi-dimensional Laplace approximation, but rather fromthe volume comparison theorems, in 1.9.1.Second, in spite of these nice properties, (2.11) does not escape the curse of dimensionality.Indeed, for fixed T , its right-hand side increases exponentially with the dimension n of M (notethat B n decreases like n − ). In fact, the temperature T W also depends on n , but it is typicallymuch less affected by it, and decreases slower than n − as n increases. Assume that M is a simply connected compact Riemannian symmetric space. Under thisassumption, it turns out that the variance function E T ( y ) is C throughout M , for any value T > T . This surprising result is contained in Proposition 2.2.To state Proposition 2.2, consider for x ∈ M the function f x ( y ) = d ( x, y ) / 2. Recall from1.7.4 that this function is C on the open set D( x ) = M − Cut( x ). When y ∈ D( x ), denote G y ( x ) and H y ( x ) the gradient and Hessian of f x ( y ).With this notation, for any x ∈ M , the gradient G y ( x ) belongs to T y M , and the Hessian H y ( x ) defines a symmetric bilinear form on T y M . However (recall the remarks in 1.7.3), both G y ( x ) and H y ( x ) are singular on Cut( x ), where H y ( x ) will even blow up, as it has an eigenvalueequal to −∞ . Proposition 2.2. Let M be a simply connected compact Riemannian symmetric space.(i) The following integrals converge for any temperature T > G y = Z D( y ) G y ( x ) π T ( dx ) ; H y = Z D( y ) H y ( x ) π T ( dx ) (2.12) and both depend continuously on y .(ii) The gradient and Hessian of the variance function E T ( y ) are given by grad E T ( y ) = G y ; Hess E T ( y ) = H y (2.13) so that E T ( y ) is C throughout M . The proof of Proposition 2.2 relies on the following lemma. Lemma 2.1. Assume M is a simply connected compact Riemannian symmetric space. Let γ : I → M be a geodesic defined on a compact interval I . Denote by Cut( γ ) the union of allcut loci Cut( γ ( t )) for t ∈ I . Then, the Hausdorff dimension of Cut( γ ) is strictly less than thedimension of M . In particular, Cut( γ ) is a set with Riemannian volume equal to zero. Remark : the assumption that M is simply connected cannot be removed. For example, theconclusion of Lemma 2.1 does not hold if M is a real projective space.The proof of Lemma 2.1 uses the structure of Riemannian symmetric spaces, as well as someresults from dimension theory, found in [32]. The notion of Hausdorff dimension is needed,because Cut( γ ) may fail to be a manifold.Lemma 2.1 is crucial to Proposition 2.2, because it leads to the following expression, E T ( γ ( t )) = Z M f x ( γ ( t )) π T ( dx ) = Z D( γ ) f x ( γ ( t )) π T ( dx ) for all t ∈ I where D( γ ) = M − Cut( γ ), and the second inequality follows since Cut( γ ) has Riemannianvolume equal to zero. Then, recalling that x ∈ Cut( γ ( t )) if and only if γ ( t ) ∈ Cut( x ), itbecomes possible to differentiate f x ( γ ( t )) under the integral. This leads to the proof of (ii).31 .6 Uniqueness of the barycentre The following Proposition 2.3 establishes the uniqueness of ˆ x T as the temperature T is madesmall. As in the previous Proposition 2.2, M is a simply connected compact Riemanniansymmetric space. The convexity radius of M is denoted r cx . This is given by r cx = π c − (see2.8, below).Recall the definition (2.7) of the Gibbs distribution π T , where the potential function U hasa unique global minimum at x ∗ ∈ M . Let s x ∗ denote the geodesic symmetry at x ∗ (recalldefinition from 1.9.2). The potential function U is said to be invariant by geodesic symmetryabout x ∗ , if U ◦ s x ∗ = U . Proposition 2.3. Let M be a simply connected compact Riemannian symmetric space, withconvexity radius r cx . For δ < r cx , there exists a critical temperature T δ such that(i) When T < T δ , the Riemannian barycentre ˆ x T of π T is unique and ˆ x T ∈ B ( x ∗ , δ ) .(ii) If, in addition, U is invariant by geodesic symmetry about x ∗ , then ˆ x T = x ∗ . Proposition 2.3 shows exactly how small T should be made, in order to ensure that theRiemannian barycentre ˆ x T is unique. In turn, this uniqueness of ˆ x T follows from the convexityof the variance function E T ( y ), obtained in the following Lemma 2.2.To state this lemma, consider the function f ( T ) of the temperature Tf ( T ) = (cid:18) π (cid:19) (cid:16) π (cid:17) n − (cid:16) µ max T (cid:17) n exp (cid:18) − U δ T (cid:19) (2.14)for any given δ , where U δ = inf { U ( x ) − U ( x ∗ ) ; x / ∈ B ( x ∗ , δ ) } . Note that f ( T ) decreases to zeroas T is made arbitrarily small. Lemma 2.2. Under the same assumptions as Proposition 2.3, let δ < r cx .(i) For all y ∈ B ( x ∗ , δ ) , Hess E T ( y ) ≥ Ct(2 δ )[1 − vol( M ) f ( T )] − πA M f ( T ) (2.15) where Ct(2 δ ) = 2 cδ cot(2 cδ ) and A M > is a constant which depends only on the symmetricspace M .(ii) There exists a critical temperature T δ such that T < T δ implies the variance function E T ( y ) is strongly convex on B ( x ∗ , δ ) . The inequality in (2.15) should be understood as saying all the eigenvalues of Hess E T ( y )are greater than the right-hand side (of course, this is an abuse of notation). The criticaltemperature T δ will be expressed in the following section. T W and T δ The present paragraph provides expressions of the temperatures T W and T δ , which appearin Propositions 2.1 and 2.3. These are expressions (2.17) and (2.18) below, which should beconsidered as part of Propositions 2.1 and 2.3, and will accordingly be proved in 2.9.Expressions (2.17) and (2.18) allow T W and T δ to be computed as solutions of scalar non-linear equations, which depend on Condition (2.9) and on the Riemannian symmetric space M .In order to state them, write f ( T, m, ρ ) = (cid:18) π (cid:19) (cid:16) µ max T (cid:17) m exp (cid:18) − U ρ T (cid:19) (2.16)32n terms of the temperature T and positive m and ρ , where U ρ is defined as in (2.14). It shouldbe noted that f ( T, m, ρ ) decreases to 0 as T is made arbitrarily small, for fixed m and ρ . Thefollowing expression holds for T W , T W = min { T W , T W } (2.17)with T W and T W given by T W = inf { T > f ( T, n − , ρ ) > ρ − n A n − } T W = inf (cid:8) T > f ( T, n + 1 , ρ ) > ( µ max / µ min ) n C n (cid:9) where A n is the n -th absolute moment of a standard normal random variable ( A n = E | X | n where X ∼ N (0 , C n = ( ω n − A n )/(diam M × vol M ) , where ω n − is the area of theunit sphere S n − ⊂ R n . Moreover, for T δ , T δ = min { T δ , T δ } (2.18)where, in the notation of (2.11) and (2.15), T δ = inf n T ≤ T W : (2 πT / µ min ) > δ ( µ min / µ max ) n D n o T δ = inf (cid:8) T ≤ T W : f ( T ) > Ct(2 δ )[Ct(2 δ )vol( M ) + πA M ] − (cid:9) with D n = (2 /π ) n − B n /(4diam M ) . Remark : the following formulae for A n and ω n − will be useful in 2.9, A n = π − n Γ(( n + 1) / 2) ; ω n − = 2 π n Γ( n/ 2) (2.19)These are well-known, and follow easily from the definition of the Euler Gamma function [33]. Compact Riemannian symmetric spaces belong to the “compact case”, already treated in 1.9.2.Some additional material, on these spaces, is needed for the proofs of Propositions 2.2 and 2.3. As of now, let M = G/K be a symmetric space, where G is semisimple and compact, and K = K y the stabiliser in G of some point y ∈ M . Recall the Cartan decomposition g = k + p ,where g and k are the Lie algebras of G and K , respectively. Moreover, let a be a maximalAbelian subspace of p , and denote ∆ + the corresponding set of positive roots λ : a → R .Then, p may be identified with T y M , and any v ∈ p can be written v = Ad( k ) a for some k ∈ K and a ∈ a . Accordinly, the self-adjoint curvature operator, R v (given by R v ( u ) = [ v , [ v, u ]]for u ∈ T y M ), can be diagonalised (the reader may wish to note (2.20) differs from (1.104) bya minus sign, since the space here denoted p would have been p ∗ = i p , in Chapter 1)Ad( k − ) ◦ R v ◦ Ad( k ) = R a where R a = − X λ ∈ ∆ + ( λ ( a )) Π λ (2.20)and where Π λ is the orthogonal projector onto the eigenspace of R a which corresponds to theeigenvalue − ( λ ( a )) . The rank of Π λ is denoted m λ and called the multiplicity of λ .33ecall that the curvature tensor of a symmetric space is parallel : ∇ R = 0. This property,when combined with the diagonalisation (2.20), yields the solutions of the operator Jacobiequation (1.69), and of the Ricatti equation (1.71).Alternatively, if A ( t ) solves (1.69) and A ( t ) = Π t ◦ A ( t ), where Π t denotes parallel transport,along the geodesic c v with c v (0) = y and ˙ c v (0) = v , then A ( t ) solves the differential equation A ′′ − R v A = 0 A (0) = 0 , A ′ (0) = Id y (2.21)where the prime denotes differentiation with respect to t . Using (2.20), it follows that A ( t ) = Π k a + X λ ∈ ∆ + (sin( λ ( a ) t )/ λ ( a )) Π kλ (2.22)where Π k a = Ad( k ) ◦ Π a ◦ Ad( k − ) and Π kλ = Ad( k ) ◦ Π λ ◦ Ad( k − ), with Π a the orthogonalprojector onto a . Let M be a compact Riemannian symmetric space, as above. Assume, as in Propositions 2.2and 2.3, that M is simply connected. In ths case, the following important property holds [9] :the cut locus of any point y ∈ M is identical to the first conjugate locus of this point.Accordingly, if v is a unit vector in p ≃ T y M , the geodesic c v will meet the cut locus of y for the first time, when det( A ( t )) = 0 for the first time after t = 0. But, as seen from (2.22),if v = Ad( k ) a , then this happens when t = t( v ) given byt( v ) = min λ ∈ ∆ + π | λ ( a ) | = min λ ∈ ∆ + πλ ( a ) (2.23)where the absolute value can be dropped because it is always possible to assume a belongs to¯ C + , the closure of the Weyl chamber C + (the set of a ∈ a such that λ ( a ) > λ ∈ ∆ + ).If M is an irreducible symmetric space, then there exists a maximal root c ∈ ∆ + , so that c ( a ) ≥ λ ( a ) for all λ ∈ ∆ + and a ∈ ¯ C + [10]. In this case, t( v ) = π/c ( a ). On the other hand, if M is not irreducible, it is a product of irreducible compact Riemannian symmetric spaces, say M = M × . . . × M s . If c , . . . , c s are the corresponding maximal roots,t( v ) = min ℓ =1 ,...,s πc ℓ ( a ) (2.24)The cut locus of y is the set of all points c v (t( v )) where v is a unit vector in T y M . Then, theinjectivity radius inj( y ) of y is equal to the minimum of t( v ), taken over all unit vectors v .From (2.24), this is equal to π c − where c = max ℓ =1 ,...,s k c ℓ k and k c ℓ k denotes the norm of c ℓ ∈ a ∗ (the dual space of a ). Since M is a homogeneous space, the injectivity radius of M isalso equal to π c − , since it is equal to the injectivity radius of any point y in M . Incidentally, c is the maximum sectional curvature of M .With a bit of additional work, the above description of the cut locus of y can be strengthened,to yield the following statements. Let S = K/K a where K a is the centraliser of a in K . Moreover,denote Q + the set of a ∈ a such that λ ( a ) ∈ (0 , π ) for each λ ∈ ∆ + . Then, consider the mapping ϕ ( s, a ) = Exp y ( β ( s, a )) ( s, a ) ∈ S × ¯ Q + (2.25)where β ( s, a ) = Ad( s ) a and ¯ Q + is the closure of Q + . This mapping ϕ is onto M , and is adiffeomorphism of S × Q + onto its image M r , which is also the set of regular values of ϕ . Finally,Cut( y ) = ϕ ( S × ¯ Q π ) where ¯ Q π = ¯ Q + ∩ ( ∪ ℓ { a : c ℓ ( a ) = π } ) (2.26)34 .8.3 The squared distance function For x ∈ M , consider the squared distance function f x ( y ) = d ( x, y ) / 2. If x / ∈ Cut( y ), then f x is C near y (this is because y ∈ Cut( x ) if and only if x ∈ Cut( y )).In this case, write x = ϕ ( s, a ), where the map ϕ was defined in (2.25). Let G y ( x ) and H y ( x )denote the gradient and Hessian of f x at y . These are given by G y ( x ) = − β ( s, a ) (2.27) H y ( x ) = Π s a + X λ ∈ ∆ + λ ( a ) cot λ ( a ) Π sλ (2.28)in the notation of (2.22). Here, (2.27) follows from (1.75), since x = Exp y ( β ( s, a )), and (2.28)follows from the solution of the Ricatti equation (1.71), discussed in 2.8.1.If M is simply connected, then Cut( y ) is given by (2.26). Now, if x ∈ Cut( y ) is written x = ϕ ( s, a ), then λ ( a ) = π for some λ ∈ ∆ + ( λ = c ℓ which achieves the minimum in (2.24)).By (2.28), H y ( x ) then has an eigenvalue equal to −∞ . In other words, H y ( x ) blows up when x approaches Cut( y ).The convexity radius of a simply connected compact Riemannian symmetric space M is equal to half its injectivity radius. Accordingly, the convexity radius of M is r cx = ( π/ c − .The proof of this statement may be summarised in the following way :If δ < r cx , then any y , y in B ( x, δ ) must have d ( y , y ) < π c − , the injectivity radius of M ,and are therefore connected by a unique length-minimising geodesic curve γ . But, by (2.28), thesquared distance function f x is convex on B ( x, δ ), where all eigenvalues of its Hessian are greaterthan cδ cot( cδ ) > 0. This can be used to show that the geodesic γ lies entirely in B ( x, δ ) [34](Page 177). In other words, the geodesic ball B ( x, δ ) is convex. On the other hand [9], if δ = r cx then there exists a closed (i.e. periodic) geodesic, of length 2 π c − , contained in B ( x, δ ), so thatthis geodesic ball cannot be convex. Consider again the map ϕ , defined in (2.25). Let M r denote the set of regular values of ϕ .By Sard’s lemma [16], the complement of M r in M has zero Riemannian volume. Therefore,if f : M → R is a measurable function, Z M f ( x )vol( dx ) = Z M r f ( x )vol( dx ) (2.29)However, it was seen in 2.8.2 that ϕ is a diffeomorphism of S × Q + onto M r . Then, performinga “change of variables”, it follows that Z M f ( x )vol( dx ) = Z Q + Z S f ( s, a ) D ( a ) daω ( ds ) (2.30)where f ( s, a ) = f ( ϕ ( s, a )) and ϕ ∗ (vol) = D ( a ) daω ( ds ). In particular, the “volume density” D ( a ) can be read from (1.113), D ( a ) = Y λ ∈ ∆ + | sin λ ( a ) | m λ (2.31)where the absolute value may be dropped, whenever a ∈ Q + is understood from the context. Remark : the integral formula (2.30) is somewhat similar to (1.115). Roughly, both formulaeinvolve the same change of variables, but (2.30) takes advantage of the the description of thecut locus of y in (2.26). Of course, (2.30) only works when the compact symmetric space M issimply connected. 35 .9 All the proofs Throughout the following proofs, it will be assumed that U ( x ∗ ) = 0. There is no loss ofgenerality in making this assumption. Indeed, looking back at the definition (2.7) of the Gibbsdistribution π T , it is clear that a factor exp( − U ( x ∗ ) /T ) may always be absorbed into Z ( T ). Proof of (i) For each y ∈ M , let f y ( x ) = d ( y , x ) / 2. It follows from (2.8) that E T ( y ) = Z M f y ( x ) π T ( dx ) (2.32)On the other hand, consider the function E ( y ), E ( y ) = Z M f y ( x ) δ x ∗ ( dx ) = d ( y , x ∗ ) / y ∈ M , it is elementary that f y ( x ) is a Lipschitz function of x , with Lipschitz constantdiam M . Then, from the Kantorovich-Rubinshtein formula [35] (see VIII.4) |E T ( y ) − E ( y ) | ≤ (diam M ) W ( π T , δ x ∗ ) (2.34)a uniform bound in y ∈ M . It now follows that, for any η > y ∈ B ( x ∗ ,η ) E T ( y ) − inf y ∈ B ( x ∗ ,η ) E ( y ) ≤ (diam M ) W ( π T , δ x ∗ ) (2.35)inf y / ∈ B ( x ∗ ,η ) E ( y ) − inf y / ∈ B ( x ∗ ,η ) E T ( y ) ≤ (diam M ) W ( π T , δ x ∗ ) (2.36)However, from (2.33), it is clear thatinf y ∈ B ( x ∗ ,η ) E ( y ) = 0 and inf y / ∈ B ( x ∗ ,η ) E ( y ) = η y ∈ B ( x ∗ ,η ) E T ( y ) < η < inf y / ∈ B ( x ∗ ,η ) E T ( y ) (2.37)However, this means any global minimum of E T ( y ) must belong to B ( x ∗ , η ). Equivalently, anyRiemannian barycentre ˆ x T of π T must verify d (ˆ x T , x ∗ ) < η . Thus, the conclusion in (2.10) holds. Proof of (ii) Recall the condition in (2.9), which holds for d ( x, x ∗ ) ≤ ρ . By choosing ρ < min { inj( x ∗ ) , π c − } ,it will be possible to apply (1.94) from 1.9.1, in the remainder of the proof. Consider thetruncated distribution π ρT ( dx ) = Bρ ( x ) π T ( B ρ ) π T ( dx ) (2.38)where denotes the indicator function, and B ρ denotes the open ball B ( x ∗ , ρ ). Of course, bythe triangle inequality W ( π T , δ x ∗ ) ≤ W ( π T , π ρT ) + W ( π ρT , δ x ∗ ) (2.39)36ow, the proof relies on the following estimates, which use the notation of 2.7. – first estimate : if T ≤ T W , then W ( π T , π ρT ) ≤ (diam M × vol M ) (cid:18) π (cid:19) (cid:16) π (cid:17) n (cid:16) µ max T (cid:17) n exp (cid:18) − U ρ T (cid:19) (2.40) – second estimate : if T ≤ T W , then W ( π ρT , δ x ∗ ) ≤ (2 π ) B − n (cid:16) π (cid:17) n − (cid:18) µ max µ min (cid:19) n (cid:18) Tµ min (cid:19) (2.41)These two estimates will be proved below. To obtain (2.11), assume that they hold, and that T ≤ T W . Then, T ≤ T W and the definition of T W implies f ( T, n + 1 , ρ ) ≤ ( µ max / µ min ) n C n Using the definition of C n and formulae (2.19), this inequality reads(diam M × vol M ) f ( T, n + 1 , ρ ) ≤ π ) n B − n ( µ max / µ min ) n This is the same as(diam M × vol M ) π − (cid:16) π (cid:17) n f ( T, n + 1 , ρ ) ≤ (cid:16) π (cid:17) n − B − n ( µ max / µ min ) n From the definition of f ( T, n + 1 , ρ ), it then follows that the right-hand side of (2.40) is lessthan half the right-hand side of (2.41). Since this is the case, (2.11) follows from the triangleinequality (2.39). – proof of first estimate : consider the probability distribution K on M × M , K ( dx × dx ) = π ρT ( dx ) h π T ( B ρ ) δ x ( dx ) + Bcρ ( x ) π T ( dx ) i (2.42)where B cρ denotes the complement of B ρ in M . This distribution K provides a coupling between π T and π ρT . Therefore, replacing (2.42) into the definition of the Kantorovich distance, it follows W ( π T , π ρT ) ≤ (diam M ) π T ( B cρ ) (2.43)However, the definition (2.7) of π T implies π T ( B cρ ) ≤ ( Z ( T )) − (vol M ) exp (cid:18) − U ρ T (cid:19) (2.44)Now, (2.40) follows directly from (2.43) and (2.44), if the following lower bound on Z ( T ) canbe proved Z ( T ) ≥ (cid:16) π (cid:17) (cid:18) π (cid:19) n (cid:18) Tµ max (cid:19) n for T ≤ T W (2.45)To prove this lower bound, note that Z ( T ) = Z M exp (cid:18) − U ( x ) T (cid:19) vol( dx ) ≥ Z B ρ exp (cid:18) − U ( x ) T (cid:19) vol( dx )Replacing (2.9) into this last inequality, it is possible to write Z ( T ) ≥ Z B ρ exp (cid:18) − U ( x ) T (cid:19) vol( dx ) ≥ Z B ρ exp (cid:16) − µ max T d ( x, x ∗ ) (cid:17) vol( dx ) (2.46)37ince ρ < min { inj( x ∗ ) , π c − } , it is possible to apply (1.94) from 1.9.1, to (2.46). Specifically,the lower bound in (1.94) yields, Z ( T ) ≥ ω n − Z ρ e − µ max2 T r ( c − sin( cr )) n − dr ≥ ω n − (2 /π ) n − Z ρ e − µ max2 T r r n − dr (2.47)where the second inequality follows since sin( t ) is a concave function of t ∈ [0 , π/ cr ) ≥ (2 /π ) cr for r ∈ [0 , ρ ]. Now, the required bound (2.45) follows from (2.47) by noting Z ρ e − µ max2 T r r n − dr = (2 π ) (cid:18) Tµ max (cid:19) n A n − − Z ∞ ρ e − µ max2 T r r n − dr where A n = E | X | n for X ∼ N (0 , Z ∞ ρ e − µ max2 T r r n − dr ≤ ρ n − Tµ max e − µ max2 T ρ ≤ ρ n − Tµ max e − UρT Indeed, taken together, these give Z ( T ) ≥ ω n − (2 /π ) n − " (2 π ) (cid:18) Tµ max (cid:19) n A n − − ρ n − Tµ max e − UρT (2.48)Then, (2.45) can be obtained by noting that the second term in square brackets is negligible incomparison to the first as T → 0, and using formulae (2.19) for A n − and ω n − . – proof of second estimate : the Kantorovich distance W ( π ρT , δ x ∗ ) between π ρT and δ x ∗ isequal to the first-order moment Z M d ( x, x ∗ ) π ρT ( dx )According to (2.7) and (2.38), this means W ( π ρT , δ x ∗ ) = ( π T ( B ρ ) Z ( T )) − Z B ρ d ( x, x ∗ ) exp (cid:18) − U ( x ) T (cid:19) vol( dx )Using (2.7) to express the probability in parentheses, this becomes W ( π ρT , δ x ∗ ) = R B ρ d ( x, x ∗ ) exp (cid:16) − U ( x ) T (cid:17) vol( dx ) R B ρ exp (cid:16) − U ( x ) T (cid:17) vol( dx ) (2.49)A lower bound on the denominator can be found from (2.46) and the subsequent inequalities,which were used to prove (2.45). These inequalities provide Z B ρ exp (cid:18) − U ( x ) T (cid:19) vol( dx ) ≥ ω n − (2 /π ) n − (2 π ) A n − ( T / µ max ) n (2.50)whenever T ≤ T W . For the numerator in (2.49), it will be shown that Z B ρ d ( x, x ∗ ) exp (cid:18) − U ( x ) T (cid:19) vol( dx ) ≤ ω n − (2 π ) A n ( T / µ min ) n +12 (2.51)Then, (2.41) follows by dividing (2.51) by (2.50), and replacing in (2.49), using the fact that A n /A n − = (2 π ) B − n , which can be found from formulae (2.19). Now, it only remails to prove(2.51). This is done by noting, from (2.9), Z B ρ d ( x, x ∗ ) exp (cid:18) − U ( x ) T (cid:19) vol( dx ) ≤ Z B ρ d ( x, x ∗ ) exp (cid:16) − µ min T d ( x, x ∗ ) (cid:17) vol( dx )38pplying the upper bound in (1.94) (with κ min = 0), to the last integral, it follows that Z B ρ d ( x, x ∗ ) exp (cid:16) − µ min T d ( x, x ∗ ) (cid:17) vol( dx ) ≤ ω n − Z ρ e − µ min2 T r r n dr ≤ ω n − Z ∞ e − µ min2 T r r n dr The integral on the right-hand side is half the n -th absolute moment of a normal distribution.By expressing it in terms of A n , it is possible to directly recover (2.51). Proof of (i) Under the integrals in (2.12), G y ( x ) and H y ( x ) are given by (2.27) and (2.28), for any x ∈ D( y ).Furthermore, by (2.27) and (2.28), both integrands G y ( x ) and H y ( x ) are continuous on thedomaine of integration D( y ).The integral G y converges, because G y ( x ) is uniformly bounded on D( y ). Indeed, from(2.27), k G y ( x ) k y = k β ( s, a ) k y = d ( y , x )where the second equality follows from the fact that x = Exp y ( β ( s, a )). Of course, d ( y , x ) isalways less than diam M .The integral H y is an improper integral, since H y ( x ) blows up when x approaches Cut( y ),as explained in 2.8.3. Nonetheless, this integral converges absolutely, as shall be seen from thematerial in 2.8.4.Precisely, recall the mapping ϕ defined in (2.25). Because M is simply connected, Cut( y )is identical to the first conjugate locus of y . This means that Cut( y ) is contained in the set ofcritical values of Exp y , and therefore also in the set of critical values of ϕ . Equivalently, D( y )contains the set of regular values of ϕ , denoted M r in (2.29). It then follows, as in (2.30), H y = Z Q + Z S H y ( s, a ) p T ( s, a ) D ( a ) daω ( ds ) (2.52)where p T denotes the density of π T with respect to the Riemannian volume.To prove that H y converges absolutely, it is enough to prove the integrand in (2.52) isuniformly bounded. However, the density p T is bounded, since it is continuous and M iscompact. Moreover, it is clear from (2.28) and (2.31) that H y ( s, a ) = Π s a + X λ ∈ ∆ + λ ( a ) cot λ ( a ) Π sλ (2.53) D ( a ) = Y λ ∈ ∆ + | sin λ ( a ) | m λ (2.54)The product of these two expressions is uniformly bounded, because λ ( a ) ∈ (0 , π ) on Q + .Thus, the integrals G y and H y converge, and it is clear from the above that this is true forany temperature T > 0. The fact that both G y and H y depend continuously on y will be clearfrom the arguments in the proof of (ii). Proof of (ii) The proof relies in a crucial way on Lemma 2.1, which is proved in 2.9.3, below. To compute thegradient and Hessian of E T at y ∈ M , consider any geodesic γ : I → M , defined on a compactinterval I = [ − τ, τ ], with γ (0) = y . For each t ∈ I , it is immediate from (2.8) that E T ( γ ( t )) = Z M f x ( γ ( t )) π T ( dx ) (2.55)39owever, Lemma 2.1 states that the setCut( γ ) = [ t ∈ I Cut( γ ( t ))has Riemannian volume equal to zero. Thus, since π T is uniformly continuous with respect toRiemannian volume, Cut( γ ) can be removed from the domain of integration in (2.55), to obtain E T ( γ ( t )) = Z D( γ ) f x ( γ ( t )) π T ( dx ) for all t ∈ I (2.56)where D( γ ) = M − Cut( γ ). Now, if x ∈ D( γ ), then x / ∈ Cut( γ ( t )) for any t ∈ I . According to2.8.3, this implies that f x ( γ ( t )) is C near each t ∈ I . In other words, f x ( t ) = f x ( γ ( t )) is C inside the interval I . Then, the first and second derivatives of this function are given by f ′ x ( t ) = (cid:10) G γ ( t ) ( x ) , ˙ γ (cid:11) γ ( t ) ; f ′′ x ( t ) = (cid:10) H γ ( t ) ( x ) · ˙ γ , ˙ γ (cid:11) γ ( t ) (2.57)as in Proposition 1.1. Formally, (2.13) follows by differentiating under the integral sign in (2.56),replacing from (2.57), and then putting t = 0. This differentiation under the integral sign isjustified, as soon as it is shown that the families of functions, (cid:8) x G γ ( t ) ( x ) ; t ∈ I (cid:9) ; (cid:8) x H γ ( t ) ( x ) ; t ∈ I (cid:9) which all have common domain of definition D( γ ), are uniformly integrable with respect to theprobability distribution π T (precisely, with respect to the restriction of π T to D( γ )).Roughly (for the exact definition, see [16]), uniform integrability means that the rate ofabsolute convergence of the following integrals G γ ( t ) = Z D( γ ) G γ ( t ) ( x ) π T ( dx ) ; H γ ( t ) = Z D( γ ) H γ ( t ) ( x ) π T ( dx ) (2.58)does not depend on t ∈ I . This is clear for the integrals G γ ( t ) because, as in the proof of (i), k G γ ( t ) ( x ) k γ ( t ) = d ( γ ( t ) , x )and this is bounded by diam M , independently of x and t .Then, consider the integral H γ (0) = H y , and recall Formulae (2.52)–(2.53). For simplicity,assume that M is an irreducible symmetric space (see Chapter VIII of [10], Page 307). In thiscase, according to 2.8.2, there exists a maximal root c ∈ ∆ + , so that c ( a ) ≥ λ ( a ) for all λ ∈ ∆ + and a ∈ Q + . Therefore, it follows from (2.53) that k H y ( x ) k F ≤ (dim M ) max { , | c ( a ) cot c ( a ) |} (2.59)where k · k F denotes the Frobenius norm given by the Riemannian metric of M . Now, therequired uniform integrability is equivalent to the statement thatlim K →∞ Z D( γ ) k H y ( x ) k F {k H y ( x ) k F > K } π T ( dx ) = 0 uniformly in y (2.60)But from the inequality in (2.59), if K > ǫ > {k H y ( x ) k F > K } ⊂ { c ( a ) > π − ǫ } ǫ → K → ∞ . In this case, the integral in (2.60) is less than(dim M ) (sup p T ( x )) Z D( γ ) | c ( a ) cot c ( a ) | { c ( a ) > π − ǫ } vol( dx ) (2.61)By expressing this integral as in (2.52), it is seen to be equal to R Q + R S | c ( a ) cot c ( a ) | { c ( a ) > π − ǫ } D ( a ) daω ( ds ) = ω ( S ) R Q + [ | c ( a ) cot c ( a ) | D ( a )] { c ( a ) > π − ǫ } da In view, of (2.54), since c ∈ ∆ + , the function in square brackets is bounded on the closure of Q + by c = k c k (incidentally, this is the maximum sectional curvature of M , as explained in2.8.2). Finally, by (2.61), the integral in (2.60) is less than(dim M ) (sup p T ( x )) ω ( S ) c Z Q + { c ( a ) > π − ǫ } da Recall that c ( a ) ∈ (0 , π ) for a ∈ Q + . It is then clear this last integral converges to 0 as ǫ → y . This proves the required uniform integrability, so theproof is now complete, at least in the case where M is an irreducible symmetric space.In the general case, where M is not irreducible, it is enough to note that, according to 2.8.2, M is a product of irreducible Riemannian symmetric spaces, M = M × . . . × M s . Then, theproof boils down to the special case where M is irreducible, as treated above. The proof uses the following general remark. Remark : let M be a Riemannian manifold, and g : M → M be an isometry. Recall that g · y is used to denote g ( y ), for y ∈ M . Similarly, if A ⊂ M , let g · A denote the image of A under g .Then, for any y ∈ M , Cut( g · y ) = g · Cut( y ). This is because a point x ∈ M belongs toCut( y ), if and only if x is a first conjugate point to y along some geodesic, or there exist twodifferent length-minimising geodesics connecting y to x , and because both of these propertiesare preserved by any isometry g .Assume M is a simply connected compact Riemannian symmetric space. In the notation of2.8, M ≃ G/K . Recall (by Proposition 1.10 of 1.10) any geodesic γ : I → M is given by γ ( t ) = exp( tω ) · y (2.62)for some y ∈ M and ω ∈ p , where exp denotes the Lie group exponential. From the aboveremark, for each t ∈ I , the cut locus Cut( γ ( t )) of γ ( t ) is given byCut( γ ( t )) = exp( tω ) · Cut( y ) (2.63)However, Cut( y ) is described by (2.26) in 2.8.2, which readsCut( y ) = ϕ ( S × ¯ Q π ) (2.64)in terms of the mapping ϕ defined in (2.25). It follows from (2.63) and (2.64) thatCut( γ ) = Φ( I × S × ¯ Q π ) Φ( t, s, a ) = exp( tω ) · ϕ ( s, a ) (2.65)41he aim is to show that this set has Hausdorff dimension strictly less than dim M . This is doneusing results from dimension theory [32]. Precisely, note from (2.26) that¯ Q π = ∪ ℓ ¯ Q ℓ where ¯ Q ℓ = ¯ Q + ∩ { a : c ℓ ( a ) = π } Therefore, it is clear that Cut( γ ) = [ ℓ Φ( I × S × ¯ Q ℓ ) (2.66)Then, it follows from [32] (Item (2) of Theorem 2) thatdim H Cut( γ ) ≤ max ℓ dim H Φ( I × S × ¯ Q ℓ ) (2.67)where dim H is used to denote the Hausdorff dimension. Now, for each ℓ ,Φ( I × S × ¯ Q ℓ ) = Φ( I × S ℓ × ¯ Q ℓ ) ⊂ Φ( R × S ℓ × { c ℓ ( a ) = π } )where S ℓ = K/K ℓ with K ℓ the centraliser of { c ℓ ( a ) = π } in K . This last inclusion implies [32](Item (1) of Theorem 2)dim H Φ( I × S × ¯ Q ℓ ) ≤ dim H Φ( R × S ℓ × { c ℓ ( a ) = π } ) (2.68)To conclude, note that the set R × S ℓ × { c ℓ ( a ) = π } is a differentiable manifold. Let thisdifferentiable manifold be equipped with a product Riemannian metric (arising from flat metricson R and { c ℓ ( a ) = π } , and from the invariant metric induced onto S ℓ from K ). It is clear from(2.65) that Φ is smooth, and therefore locally Lipschitz. Then [32] (Item (5) of Theorem 2),dim H Φ( R × S ℓ × { c ℓ ( a ) = π } ) ≤ dim H ( R × S ℓ × { c ℓ ( a ) = π } ) (2.69)But the Hausdorff dimension of a Riemannian manifold is the same as its (usual) dimension.Accordingly, dim H ( R × S ℓ × { c ℓ ( a ) = π } ) = 1 + dim S ℓ + (dim a − a is dim a − 1. In addition, from [10] (Page 253),dim S ℓ < dim S . Therefore,dim H ( R × S ℓ × { c ℓ ( a ) = π } ) = dim S ℓ + dim a < dim M since dim M = dim S + dim a , as can be seen from (2.25). Replacing this into (2.69), it followsfrom (2.67) and (2.68) that dim Cut( γ ) < dim M . The lemma has therefore been proved. Assume that Lemma 2.2 is true. This lemma is proved in 2.9.5, below. Proof of (i) For δ < r cx , let T δ be given by (2.18). By (ii) of Lemma 2.2, T < T δ implies the variancefunction E T ( y ) is strongly convex on B ( x ∗ , δ ). It will be proved that any Riemannian barycentreˆ x T of π T belongs to B ( x ∗ , δ ). Then, since ˆ x T is a minimum of E T ( y ) in B ( x ∗ , δ ), it follows thatˆ x T is unique (thanks to the strong convexity of E T ( y )).By (i) of Proposition 2.1, to prove that any ˆ x T belongs to B ( x ∗ , δ ), it is enough to prove W ( π T , δ x ∗ ) < δ M (2.70)42owever, if T < T δ then T < T W and, by (ii) of Proposition 2.1, W ( π T , δ x ∗ ) satisfies inequality(2.11). In addition (from the definition of T δ and T δ ) one has T < T δ and(2 π ) ( T /µ min ) < δ ( µ min /µ max ) n D n By replacing the expression of D n and simplifying, this is the same as(2 π ) B − n ( π/ n − ( µ max /µ min ) n ( T /µ min ) < δ M (2.71)Now, (2.70) follows from (2.11) and (2.71). Proof of (ii) From the proof of (i), E T ( y ) is strongly convex on B ( x ∗ , δ ), and ˆ x T is the minimum of E T ( y )in B ( x ∗ , δ ). To prove that ˆ x T = x ∗ , it is then enough to prove that x ∗ is a stationary point of E T ( y ). However, the fact that U is invariant by the geodesic symmetry s x ∗ will be seen to imply ds x ∗ · G x ∗ = G x ∗ (2.72)which is equivalent to G x ∗ = 0, since ds x ∗ is equal to minus the identity on T x ∗ M (see 1.9.2).Then, by (2.13) in Proposition 2.2, grad E T ( x ∗ ) = 0, so x ∗ is indeed a stationary point of E T ( y ).To obtain (2.72), it is possible to write from (2.12), ds x ∗ · G x ∗ = ds x ∗ · Z D( x ∗ ) G x ∗ ( x ) π T ( dx ) (2.73)But, it follows from (2.25) and (2.27) that x = Exp x ∗ ( − G x ∗ ( x )). Then, since s x ∗ reversesgeodesics through x ∗ , ds x ∗ · G x ∗ ( x ) = G x ∗ ( s x ∗ · x )Replacing this into (2.74), and introducing the new variable of integration z = s x ∗ · x , ds x ∗ · G x ∗ = Z D( x ∗ ) G x ∗ ( z )( π T ◦ s x ∗ )( dz ) (2.74)since s − x ∗ = s x ∗ and s x ∗ maps D( x ∗ ) onto itself. Now, note that π T ◦ s x ∗ = π T . This is clear,since from (2.7), ( π T ◦ s x ∗ )( dz ) = ( Z ( T )) − exp (cid:20) − ( U ◦ s x ∗ )( z ) T (cid:21) (vol ◦ s x ∗ )( dz )However, by assumption, ( U ◦ s x ∗ )( z ) = U ( z ), since U is invariant by geodesic symmetryabout x ∗ . Moreover, because s x ∗ is an isometry, it must preserve Riemannian volume, so(vol ◦ s x ∗ )( dz ) = vol( dz ). Thus, (2.74) reads ds x ∗ · G x ∗ = Z D( x ∗ ) G x ∗ ( z ) π T ( dz )From (2.12), the right-hand side is G x ∗ , so (2.72) is obtained. From (2.72), since ds x ∗ = − Id x ∗ ,and G x ∗ belongs to T x ∗ M , G x ∗ = − G x ∗ Of course, this means G x ∗ = 0, as required. π T ◦ s x ∗ is the image of the distribution π T under the map s x ∗ : M → M . In other places of this thesis, thiswould be noted ( s x ∗ ) ∗ π T , but this notation seems kind of clumsy, in the present case. .9.5 Proof of Lemma 2.2 Proof of (i) Let y ∈ B ( x ∗ , δ ) where δ < r cx . Now, recall that Hess E T ( y ) = H y for all y ∈ B ( x ∗ , δ ). Then,from (2.12), it is possible to write H y = Z B ( y,r cx ) H y ( x ) π T ( dx ) + Z D( y ) − B ( y,r cx ) H y ( x ) π T ( dx ) (2.75)Indeed, B ( y , r cx ) ⊂ D( y ), since the injectivity radius of M is 2 r cx as given in 2.8.2. The firstintegral in (2.75) will be denoted I and the second integral I .With regard to I , note the inclusions B ( x ∗ , δ ) ⊂ B ( y , δ ) ⊂ B ( y , r cx ), which follow fromthe triangle inequality. In addition, note that H y ( x ) ≥ x ∈ B ( y , r cx ). Therefore, I ≥ Z B ( x ∗ ,δ ) H y ( x ) π T ( dx ) (2.76)where H y ( x ) is given by (2.28). But, from (2.28); the eigenvalues of H y ( x ) are λ ( a ) cot λ ( a ) ≥ min ℓ c ℓ ( a ) cot c ℓ ( a ) (2.77)where the maximal roots c ℓ were introduced before (2.24). By the Cauchy-Schwarz inequality, c ℓ ( a ) ≤ k c ℓ kk a k ≤ c k a k , where c denotes the maximum sectional curvature of M , whoseexpression was recalled in 2.8.2. Now, if x ∈ B ( y , δ ), then k a k = d ( y , x ) < δ , and it followsfrom (2.77) that H y ( x ) ≥ min ℓ c ℓ ( a ) cot c ℓ ( a ) ≥ cδ cot(2 cδ ) = Ct(2 δ ) > cδ < π . Replacing in (2.76) gives I ≥ Ct(2 δ ) π T ( B ( x ∗ , δ )) = Ct(2 δ )[1 − π T ( B c ( x ∗ , δ ))]Finally, (2.44) and (2.45) imply that π T ( B c ( x ∗ , δ )) ≤ vol( M ) f ( T ), where f ( T ) was defined in(2.14) (precisely, this follows after replacing ρ by δ in (2.44)). Thus, I ≥ Ct(2 δ )[1 − vol( M ) f ( T )] (2.79)The proof of (2.15) will be completed by showing I ≥ − πA M f ( T ) (2.80)To do so, introduce the function k ( a ) = min ℓ c ℓ ( a ) cot c ℓ ( a ) for a ∈ Q + (2.81)and note using (2.77) that I ≥ Z D( y ) − B ( y,r cx ) k ( a ) π T ( dx ) ≥ Z D( y ) { k ( a ) ≤ } k ( a ) π T ( dx ) (2.82)Indeed, the set of a such that k ( a ) ≤ y ) − B ( y , r cx ), because { k ( a ) ≤ } = ∪ ℓ { c ℓ ( a ) cot c ℓ ( a ) ≤ } = ∪ ℓ { c ℓ ( a ) ≥ π/ } (2.83)44nd c ℓ ( a ) ≥ π/ d ( y , x ) = k a k ≥ π c − = r cx (by Cauchy-Schwarz). By expressing thelast integral in (2.82) as in (2.52), it is seen to be equal to R Q + R S { k ( a ) ≤ } k ( a ) p T ( s, a ) D ( a ) daω ( ds ) ≥− π R Q + R S { k ( a ) ≤ } p T ( s, a ) daω ( ds )Indeed, it follows from (2.54) and (2.81) that k ( a ) D ( a ) ≥ min ℓ − c ℓ ( a ), and this is greater than − π because c ℓ ( a ) ∈ (0 , π ) for a ∈ Q + . Now, (2.82) implies I ≥ − π Z Q + Z S { k ( a ) ≤ } p T ( s, a ) daω ( ds ) (2.84)As seen from (2.83), the set { k ( a ) ≤ } ⊂ B c ( y , r cx ) ⊂ B c ( x ∗ , δ ). On the other hand, p T ( x ) ≤ f ( T ) for x ∈ B c ( x ∗ , δ ). Replacing in (2.84), I ≥ − πf ( T ) Z Q + Z S daω ( ds ) (2.85)The double integral on the right-hand side is a positive constant which depends only on thesymmetric space M . Denoting this by A M yields (2.80). Proof of (ii) Let δ < r cx . According to (2.15), which has just been provedHess E T ( y ) ≥ Ct(2 δ )[1 − vol( M ) f ( T )] − πA M f ( T ) (2.86)for all y ∈ B ( x ∗ , δ ). Now, let T δ be given by (2.18). It follows from the definition of T δ that T < T δ implies f ( T ) < Ct(2 δ )Ct(2 δ )vol( M ) + πA M (2.87)This amounts to saying the right-hand side of (2.86) is strictly positive. Since this is independentof y , it is clear that the variance function E T ( y ) is indeed strongly convex on B ( x ∗ , δ ).45 hapter 3 Gaussian distributions and RMT Contents Z ( σ ) . . . . . . . . . . . . . . . . . . . . . . . . . . . Z ( σ ) . . . . . . . . . . . . . . . . . . . . . . . . . . . N asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gaussian distributions on Riemannian symmetric spaces were introduced in [36]. The present chapterexpands on this work in several ways. In particular, it uncovers and exploits the connection betweenGaussian distributions on Riemannian symmetric spaces and random matrix theory (RMT). • what is a Gaussian distribution ? , by adoptinga historical perspective (the source material used is from [37][38][39]). • maximum-likelihood estimation is equivalent to the Riemannian barycentre problem . • • • • N ) of N × N Hermitian positive-definite matrices. • N asymptotics of Gaussian distributions on H( N ). In particular, 3.7provides an asymptotic expression of the normalising factor. • U ( N ) (which is thedual symmetric space of H( N )), with a remarkable connection to Gaussian distributions on H( N ). .1 From Gauss to Shannon The story of Gaussian distributions is a story of discovery and re-discovery. Different scientists,at different times, were repeatedly lead to these distributions, through different routes.In 1801, on New Year’s day, Giuseppe Piazzi sighted a heavenly body (in fact, the asteroidCeres), which he thought to be a new planet. Less than six weeks later, this “new planet”disappeared behind the sun. Using a method of least squares, Gauss predicted the area in thesky, where it re-appeared one year later. His justification of this method of least squares (castin modern language) is that measurement errors follow a family of distributions, which satisfies property 1 : maximum-likelihood estimation is equivalent to the least-squares problem.In an 1809 paper, he used this property to show that the distribution of measurement errorsis (again, in modern language) a Gaussian distribution.In 1810, Laplace studied the distribution of a quantity, which is the aggregate of a greatnumber of elementary observations. He was lead in this (completely different) way, to the samedistribution discovered by Gauss. Laplace was among the first scientists to show property 2 : the distribution of the sum of a large number of elementary observations is(asymptotically) a Gaussian distribution.Around 1860, Maxwell rediscovered Gaussian distributions, through his investigation of thevelocity distribution of particles in an ideal gas (which he viewed as freely colliding perfectelastic spheres). Essentially, he showed that property 3 : the distribution of a rotationally-invariant random vector, which has independentcomponents, is a Gaussian distribution. Kinetic theory lead to another fascinating development, related to Gaussian distributions.Around 1905, Einstein (and, independently, Smoluchowsky) showed that property 4 : the distribution of the position of a particle, which is undergoing a Brownianmotion, is a Gaussian distribution.In addition to kinetic theory, alternative routes to Gaussian distributions have been foundin quantum mechanics, information theory, and other fields. In quantum mechanics, a Gaussiandistribution is a position distribution with minimum uncertainty. That is, it achieves equality inHeisenberg’s inequality (this is because a Gaussian function is proportional to its own Fouriertransform). In information theory, one may attribute to Shannon the following maximumentropy characterisation property 5 : a probability distribution with maximum entropy, among all distributions witha given mean and variance, is a Gaussian distribution.The above list of re-discoveries of Gaussian distributions, by means of different definitions,may be extended much longer. However, the main point is the following. In Euclidean space, anyone of the above five properties leads to the same famous expression of a Gaussian distribution, P ( dx | ¯ x, σ ) = (cid:0) πσ (cid:1) − n exp (cid:20) − ( x − ¯ x ) σ (cid:21) dx In non-Euclidean space, each one of these properties may lead to a different kind of distribution,which may then be called a Gaussian distribution, but only from a more restricted point of view.People interested in Brownian motion may call the heat kernel of a Riemannian manifold aGaussian distribution on that manifold. However, statisticians will not like this definition, sinceit will (in general) fail to have a straightforward connection to maximum-likelihood estimation. A deeper version of Maxwell’s idea was obtained by Poincar´e and Borel, around 1912, who showed that :if v = ( v n ; n = 1 , . . . , N ) is uniformly distributed, on an ( N − N , then thedistribution of v is (asymptotically) a Gaussian distribution. This is Poincar´e’s model of the one-dimensionalideal gas, with N particles. As of now, the following definition of Gaussian distributions is chosen. Gaussian distributions,on a Riemannian manifold M , are a family of distributions P (¯ x, σ ), parameterised by ¯ x ∈ M and σ > 0, such that : a maximum-likelihood estimate ˆ x N of ¯ x , based on independent samples( x n ; n = 1 , . . . , N ) from P (¯ x, σ ), is a solution of the least-squares problemminimise over x ∈ M E N ( x ) = N X n =1 d ( x n , x )Of course, this is the same least-squares problem as (2.2), so ˆ x N is an empirical barycentre ofthe samples ( x n ). Therefore (as discussed in 2.2.4), ˆ x N is almost-surely unique, if P (¯ x, σ ) has aprobability density with respect to the Riemannian volume of M (this will indeed be the case).Now, consider the density profile f ( x | ¯ x, σ ) = exp (cid:20) − d ( x, ¯ x )2 σ (cid:21) (3.1)and the normalising factor, Z (¯ x, σ ) = Z M f ( x | ¯ x, σ ) vol( dx ) (3.2)If this is finite, then P ( dx | ¯ x, σ ) = ( Z (¯ x, σ )) − f ( x | ¯ x, σ ) vol( dx ) (3.3)is a well-defined probability distribution on M . In 3.4, below, it will be shown that P (¯ x, σ ), asdefined by (3.3), is indeed a Gaussian distribution, if M is a Hadamard manifold, and also ahomogeneous space. The following propositions will then be helpful. Proposition 3.1. Let M be a Hadamard manifold, whose sectional curvatures lie in [ κ, ,where κ = − c . Then, for any ¯ x ∈ M and σ > , if Z (¯ x, σ ) is given by (3.2), Z ( σ ) ≤ Z (¯ x, σ ) ≤ Z c ( σ ) (3.4) where Z ( σ ) = (cid:0) πσ (cid:1) n and Z c ( σ ) is positive and given by (recall n is the dimension of M ) Z c ( σ ) = ω n − σ (2 c ) n − n − X k =0 ( − k (cid:18) n − k (cid:19) Φ (( n − − k ) σc )Φ ′ (( n − − k ) σc ) (3.5) with ω n − the area of the unit sphere S n − , and Φ the standard normal distribution function. Proposition 3.2. If M is a Riemannian homogeneous space, and Z (¯ x, σ ) is given by (3.2),then Z (¯ x, σ ) does not depend on ¯ x . In other words, Z (¯ x, σ ) = Z ( σ ) . If M is a Hadamard manifold, and also a homogeneous space, then both Propositions 3.1 and3.2 apply to M . Indeed, if M is a Riemannian homogeneous space, then its sectional curvatureslie within a bounded subset of the real line. Therefore, Proposition 3.1 implies Z (¯ x, σ ) is finitefor all ¯ x ∈ M and σ > 0. On the other hand, Proposition 3.2 implies that Z (¯ x, σ ) = Z ( σ ).48hus, if M is a Hadamard manifold, and also a homogeneous space, then (3.3), reduces to P ( dx | ¯ x, σ ) = ( Z ( σ )) − exp (cid:20) − d ( x, ¯ x )2 σ (cid:21) vol( dx ) (3.6)and yields a well-defined probability distribution P (¯ x, σ ) on M . This will be the main focus,throughout the following. Proof of Proposition 3.1 : (3.4) is a direct application of (1.94). Let f ( y ) = f ( y | ¯ x, σ ), and κ max = 0, κ min = κ . Also, since M is a Hadamard manifold, note that min { inj(¯ x ) , π c − } = ∞ .Therefore, (1.94) (applied with x = ¯ x ), yields ω n − Z ∞ exp (cid:20) − r σ (cid:21) sn n − ( r ) dr ≤ Z (¯ x, σ ) ≤ ω n − Z ∞ exp (cid:20) − r σ (cid:21) sn n − κ ( r ) dr However, sn ( r ) = r and sn κ ( r ) = c − sinh( cr ). Therefore, the expression for Z ( σ ) followseasily. For Z c ( σ ), on the other hand, note that Z ∞ exp (cid:20) − r σ (cid:21) sn n − κ ( r ) dr = 1(2 c ) n − Z ∞ exp (cid:20) − r σ (cid:21) (cid:0) e cr − e − cr (cid:1) n − dr Then, (3.5) follows by performing a binomial expansion, and using Z ∞ exp (cid:20) − r σ + ( n − − k ) cr (cid:21) dr = σ Φ (( n − − k ) σc )Φ ′ (( n − − k ) σc ) Remark : clearly, Z ( σ ) is the normalising factor of a Gaussian distribution, when M is aEuclidean space, M = R n . On the other hand, Z c ( σ ) is the normalising factor of a Gaussiandistribution, when M is a hyperbolic space of dimension n , and constant negative curvature κ = − c . This will become clear in 3.3, below. Proof of Proposition 3.2 : assume M is a homogeneous space, and fix some point o ∈ M .There exists an isometry g of M such that g · ¯ x = o . In the integral (3.2), introduce the newvariable of integration z = g · x . Since g (being an isometry) preserves Riemannian volume, Z (¯ x, σ ) = Z M f ( g − · z | ¯ x, σ ) vol( dz ) = Z M f ( z | o, σ ) vol( dz ) = Z ( o, σ )where the second equality follows from (3.1). Thus, Z (¯ x, σ ) = Z ( o, σ ) does not depend on ¯ x . Z ( σ ) Assume now M = G/K is a Riemannian symmetric space which belongs to the non-compactcase, described in 1.9.2. In particular, M is a Hadamard manifold, and also a homogeneousspace. Thus, for each ¯ x ∈ M and σ > 0, there is a well-defined probability distribution P (¯ x, σ )on M , given by (3.6). Here, the normalising factor Z ( σ ) can be expressed as a multiple integral,using the integral formula (1.109), of Proposition 1.7. Applying this proposition (with o = ¯ x ),it is enough to note f ( ϕ ( s, a ) | ¯ x, σ ) = exp (cid:20) − k a k B σ (cid:21) where k a k B = B ( a, a ), in terms of the Ad( G )-invariant symmetric bilinear form B . Since thisexpression only depends on a , it is possible to integrate s out of (1.109), to obtain Z ( σ ) = ω ( S ) | W | Z a exp (cid:20) − k a k B σ (cid:21) Y λ ∈ ∆ + | sinh λ ( a ) | m λ da (3.7)This formula expresses the normalising factor Z ( σ ) as a multiple integral on the vector space a .49 xample 1 : the easiet instance of (3.7) arises when M is a hyperbolic space of dimension n ,and constant sectional curvature equal to − 1. Then, M has rank equal to 1, so that a = R ˆ a forsome unit vector ˆ a ∈ a . Since the sectional curvature is equal to − 1, there is only one positiveroot λ , say λ (ˆ a ) = 1, with multiplicity m λ = n − 1. In addition, there are two Weyl chambers, C + = { t ˆ a ; t > } and C − = { t ˆ a ; t < } . In other words, | W | = 2. Now, (3.7) reads Z ( σ ) = ω n − Z + ∞−∞ exp (cid:20) − r σ (cid:21) | sinh( r ) | n − dr = ω n − Z + ∞ exp (cid:20) − r σ (cid:21) sinh n − ( r ) dr In general, if all distances are divided by c > 0, the sectional curvature − − c .Thus, when M is a hyperbolic space of dimension n , and sectional curvature − c , one has Z ( σ ) = ω n − Z + ∞ exp (cid:20) − r σ (cid:21) ( c − sinh( cr )) n − dr This is exactly Z c ( σ ), expressed analytically in (3.5). Example 2 : another example, also susceptible of analytic expression, is when M is a cone ofpositive-definite matrices (covariance matrices), with real, complex, or quaternion coefficients.Then, M = G/K with G = GL( N, K ), where K = R , C or H (real numbers, complex numbers,or quaternions), and K is a maximal compact subgroup of G , say K = U ( N ) , O ( N ) or Sp ( n ).In each of these three cases, a is the space of N × N real diagonal matrices, and the positiveroots are the linear maps λ ( a ) = a ii − a jj where i < j , each one having its multiplicity m λ = β ,( β = 1 , K = R , C or H ). In addition, k a k B = 4tr( a ) = 4 a + . . . + 4 a NN .The Weyl group W is the groupe of permutation matrices in K , so | W | = N !. Finally, S = K/T N where T N is the subgroup of all matrices t which are diagonal and belong to K .Replacing all of this into (3.7), it follows that Z ( σ ) = ω β ( N ) N ! Z a N Y i =1 exp (cid:20) − a ii σ (cid:21) Y i 1) + 1, ρ ( x, k ) = exp( − log ( x ) /k ) and V ( x ) = Q i 0) onto R . By approximating Z ( σ n ) at certain nodes σ n , and then performing a suitablespline interpolation, it becomes possible to guarantee this behavior of ψ ( η ).This Monte Carlo method applies, with very little modification, not only to the computationof (3.13), but to the computation of the general formula (3.7). It has been used to producetables of the function Z ( σ ), for various Riemannian symmetric spaces M , of rank N up to30, which have been successfully used, in numerical computation (recall the rank of M is thedimension of a ). Unfortunately, this method breaks down, when N is larger (roughly ≈ Z ( σ ) (see 3.6, below), or an asymptotic formula, for large N (see 3.7, below), are then needed. 51 .4 MLE and maximum entropy Let M be a Hadamard manifold, which is also a homogeneous space. Propositions 3.1 and 3.2then imply that, for any ¯ x ∈ M and σ > 0, there exists a well-defined probability distribution P (¯ x, σ ) on M , given by (3.6). The family of distributions P (¯ x, σ ) fits the definition of Gaussiandistributions, stated at the beginning of 3.2. Proposition 3.3. Let P (¯ x, σ ) be given by (3.6), for ¯ x ∈ M and σ > . The maximum-likelihoodestimate of the parameter ¯ x , based on independent samples ( x n ; n = 1 , . . . , N ) from P (¯ x, σ ) , isunique and equal to the empirical barycentre ˆ x N of the samples ( x n ) . The proof of this proposition is immediate. From (3.6), one has the log-likelihood function ℓ (¯ x, σ ) = − N log Z ( σ ) − σ N X n =1 d ( x n , ¯ x ) (3.15)Since the first term does not depend on ¯ x , one may maximise ℓ (¯ x, σ ), first over ¯ x and thenover σ . Clearly, maximising over ¯ x is equivalent to minimising the sum of squared distances d ( x n , ¯ x ). This is just the least-squares problem (2.2), whose solution is the empirical barycentreˆ x N . Moreover, ˆ x N is unique, since M is a Hadamard manifold (as discussed in 2.2.2)Consider now maximum-likelihood estimation of σ . This is better carried out in terms ofthe natural parameter η = ( − σ ) − , or in terms of the moment parameter δ = ψ ′ ( η ), where ψ ( η ) = log Z ( σ ) and the prime denotes the derivative. Proposition 3.4. The function ψ ( η ) , just defined, is a strictly convex function, which mapsthe half-line ( −∞ , onto R . The maximum-likelihood estimates of the parameters η and δ are ˆ η N = ( ψ ′ ) − (ˆ δ N ) and ˆ δ N = 1 N N X n =1 d ( x n , ˆ x N ) (3.16) where ( ψ ′ ) − denotes the reciprocal function. The proof of this proposition is given below. For now, note the maximum-entropy propertyof Gaussian distributions, stated in the following proposition. Proposition 3.5. The Gaussian distribution P (¯ x, σ ) , given by (3.6), is the unique distributionon M , having maximum Shannon entropy, among all distributions with given barycentre ¯ x anddispersion δ = E x ∼ P [ d ( x, ¯ x )] . Its entropy is equal to ψ ∗ ( δ ) where ψ ∗ is the Legendre transformof ψ . Proof of Proposition 3.4 : denote µ the image of the distribution P (¯ x, σ ) under the mapping x d ( x, ¯ x ). Then, ψ ( η ) is the cumulant generating function of µ , ψ ( η ) = log Z ∞ e ηs µ ( ds ) (3.17)and is therefore strictly convex. Note from (3.4) and (3.5) that Z ( σ ) = 0 when σ = 0 and Z ( σ )increases to + ∞ when σ increases to + ∞ . Recalling η = ( − σ ) − and ψ ( η ) = log Z ( σ ), itbecomes clear that ψ is (in fact, strictly increasing, and) maps the half-line ( −∞ , 0) onto R .After maximisation with respect to ¯ x , the log-likelihood function (3.15) becomes, ℓ ( η ) = N n η ˆ δ N − ψ ( η ) o (3.18)which is a strictly concave function. Differentiating, and setting the derivative equal to 0,directly yields the maximum-likelihood estimates (3.16).52 emark : ˆ η N in (3.16) is well-defined, since the range of ψ ′ is equal to (0 , ∞ ). Indeed, it ispossible to use (1.94), as in the proof of (3.4), to show that ψ ′ ( η ) ≤ ψ ′ ( η ) ≤ ψ ′ c ( η ) (3.19)where ψ ( η ) = log Z ( σ ), and ψ c ( η ) = log Z c ( σ ), with κ = − c a lower bound on the sectionalcurvatures of M . Precisely, (3.19) can be obtained by replacing f ( y ) = d ( y , ¯ x ) p ( y | ¯ x, σ ) into(1.94), where p ( y | ¯ x, σ ) is the probability density function in (3.6). Now, ψ ′ ( η ) = nσ , whichincreases to + ∞ when σ increases to + ∞ . On the other hand, by a straightforward applicationof the chain rule, it is seen that ψ ′ c ( η ) = σ ddσ (log Z c ( σ )) (3.20)which, from (3.5), is = 0 when σ = 0. Now, it follows from (3.19), ψ ′ maps the half-line ( −∞ , , + ∞ ). Proof of Proposition 3.5 : let Q ( dx ) be a probability distribution on M with barycentre ¯ x and dispersion δ = E x ∼ Q [ d ( x, ¯ x )]. Assume Q ( dx ) has probablity density function q ( x ), withrespect to Riemannian volume. The Shannon entropy of Q is given by S ( q ) = Z M log( q ( x )) q ( x )vol( dx ) (3.21)Since M is a homogeneous space, S ( q ) does not depend on ¯ x . Fixing some point o ∈ M , it ispossible to assume, without loss of generality, that ¯ x = o . Then, it is enough to maximise S ( q ),subject to the constraints, Z M q ( x )vol( dx ) = 1 and Z M d ( x, o ) q ( x )vol( dx ) = δ Using the method of Lagrange multipliers, this leads to a stationary point q ( x ) = exp ( η d ( x, o ) − ψ ( η )) (3.22)where the Lagrange multiplier η is finally given by η = ( ψ ′ ) − ( δ ), in terms of the cumulantgenerating function, ψ ( η ) = log Z M exp ( η d ( x, o )) vol( dx )Of course, q ( x ) in (3.22) is just p ( x | o, σ ), once the parameter σ > η = ( − σ ) − .Since the Shannon entropy is strictly concave, this stationary point q ( x ) is a unique maximum,over the (convex) set of probability density functions on M , which satisfy the above constraints.Its entropy is equal to S ( q ) = Z M ( η d ( x, o ) − ψ ( η )) q ( x )vol( dx ) = ηδ − ψ ( η ) (3.23)To show that this is ψ ∗ ( δ ), as stated in the proposition, it is enough to show S ( q ) = sup η { ηδ − ψ ( η ) } (3.24)However, since ψ is a strictly convex function, it is seen by differentiation that the sup isachieved when ψ ′ ( η ) = δ , exactly as in (3.22). Accordingly, the right-hand side of (3.24) isequal to ηδ − ψ ( η ), as in (3.23). 53 .5 Barycentre and covariance Let M be a Hadamard manifold, which is also a homogeneous space. Here, it is shown that thebarycentre of the Gaussian distribution P (¯ x, σ ) on M , given by (3.6), is equal to ¯ x .First, it should be noted P (¯ x, σ ) does indeed have a well-defined Riemannian barycentre,since it has finite second-order moments. To see that this is true, it is enough to note that Z M d (¯ x, x ) p ( x | ¯ x, σ )vol( dx ) < ∞ Ineded, this integral is just ψ ′ ( η ) in (3.19). This means π = P (¯ x, σ ) satisfies (2.4) for y o = ¯ x . Proposition 3.6. Let P (¯ x, σ ) be given by (3.6), for ¯ x ∈ M and σ > . The Riemannianbarycentre of P (¯ x, σ ) is equal to ¯ x . First proof : the proof of this proposition relies on the fact that the variance function, E ( y ) = 12 Z M d ( y , x ) p ( x | ¯ x, σ )vol( dx )is 1 / x with grad E (ˆ x ) = 0,which is also its unique global minimum, and (by definition) the Riemannian barycentre of P (¯ x, σ ). Now, let f (¯ x ) be the function given by f (¯ x ) = Z M p ( x | ¯ x, σ )vol( dx )Clearly, this is a constant function, equal to 1 for all ¯ x . On the other hand, its gradient may bewritten down, by differentiating under the integral, with respect to ¯ x , using (1.75) and (3.6),grad f (¯ x ) = σ − Z M Exp − x ( x ) p ( x | ¯ x, σ )vol( dx )Now, grad f (¯ x ) is identically zero. But, the right-hand side of the above expression is equal to − σ − grad E (¯ x ), by (2.5). This shows that grad E (¯ x ) = 0, and therefore ¯ x is the Riemannianbarycentre of P (¯ x, σ ). Second proof : this proof works if M is a Riemannnian symmetric space which belongs to thenon-compact case. From (2.5),grad E (¯ x ) = − Z M Exp − x ( x ) p ( x | ¯ x, σ )vol( dx )Let s ¯ x be the geodesic symmetry at ¯ x . From the definition of s ¯ x , s ¯ x · grad E (¯ x ) = − grad E (¯ x ).On the other hand, s ¯ x · grad E (¯ x ) = − Z M (cid:0) s ¯ x · Exp − x ( x ) (cid:1) p ( x | ¯ x, σ )vol( dx )Since s ¯ x is an isometry and fixes ¯ x , it follows that s ¯ x · Exp − x ( x ) = Exp − x ( s ¯ x · x ) and p ( x | ¯ x, σ ) = p ( s ¯ x · x | ¯ x, σ )Therefore, s ¯ x · grad E (¯ x ) = − Z M Exp − x ( s ¯ x · x ) p ( s ¯ x · x | ¯ x, σ )vol( dx )and, introducing the variable of integration z = s ¯ x · x , it follows that s ¯ x · grad E (¯ x ) = grad E (¯ x ).Now, it has been shown that s ¯ x · grad E (¯ x ) = − grad E (¯ x ) and that s ¯ x · grad E (¯ x ) = grad E (¯ x ).Thus, grad E (¯ x ) = 0 and one may conclude as in the first proof.54 .5.2 The covariance tensor The covariance form of the distribution P (¯ x, σ ) is the symmetric bilinear form C ¯ x on T ¯ x M , C ¯ x ( u, v ) = Z M h u, Exp − x ( x ) ih Exp − x ( x ) , v i p ( x | ¯ x, σ )vol( dx ) u , v ∈ T ¯ x M (3.25)With σ > x ∈ M the covariance form C ¯ x is a (0,2)-tensorfield on M , here called the covariance tensor of P (¯ x, σ ). In order to compute this tensor field,consider the following situation.Assume M = G/K is a Riemannian symmetric space which belongs to the non-compact case.Here, K = K o , the stabiliser in G of o ∈ M . For k ∈ K and u ∈ T o M , it is clear k · u ∈ T o M .This defines a representation of K in the tangent space T o M , called the isotropy representation.One says that M is an irreducible symmetric space, if this isotropy representation is irreducible.If M is not irreducible, then it is a product of irreducible Riemannian symmetric spaces M = M × . . . × M s [10] (Proposition 5.5, Chapter VIII. This is the de Rham decomposition of M ). Accordingly, for x ∈ M and u ∈ T x M , one may write x = ( x , . . . , x s ) and u = ( u , . . . , u s ),where x r ∈ M r and u r ∈ T x r M r . Now, looking back at (3.6), it may be seen that p ( x | ¯ x, σ ) = s Y r =1 p ( x r | ¯ x r , σ ) p ( x r | ¯ x r , σ ) = ( Z r ( σ )) − exp (cid:20) − d ( x r , ¯ x r )2 σ (cid:21) (3.26)For the following proposition, let η = ( − σ ) − and ψ r ( η ) = log Z r ( σ ). Proposition 3.7. Assume that M is a product of irreducible Riemannian symmetric spaces, M = M × . . . × M s . The covariance tensor C in (3.25) is given by C ¯ x ( u, u ) = s X r =1 ψ ′ r ( η )dim M r k u r k x r (3.27) for u ∈ T ¯ x M where ¯ x = (¯ x , . . . , ¯ x s ) and u = ( u , . . . , u s ) , with ¯ x r ∈ M r and u r ∈ T ¯ x r M r . Example : let M = H( N ), so M = GL( N, C ) /U ( N ), with U ( N ) the stabiliser of o = I N .The de Rham decomposition of M is M = M × M , where M = R and M is the submanifoldwhose elements are those x ∈ M such that det( x ) = 1. Accordingly, each ¯ x ∈ M is identifiedwith the couple (¯ x , ¯ x ), ¯ x = 1 N log det(¯ x ) ¯ x = (det(¯ x )) − /N ¯ x and each u ∈ T ¯ x M is written u = u ¯ x + u u = 1 N tr(¯ x − u ) u = u − N tr(¯ x − u ) ¯ x These may be replaced into expression (3.27), C ¯ x ( u, u ) = ψ ′ ( η ) u + ψ ′ ( η ) N − k u k x (3.28)where ψ ( η ) = log (cid:0) π σ (cid:1) , and ψ ( η ) = log Z ( σ ) − ψ ( η ) ( Z ( σ ) is given by (3.35) in 3.6, below).After a direct calculation, this can be brought under the form C ¯ x ( u, u ) = g ( σ )tr (¯ x − u ) + g ( σ )tr(¯ x − u ) (3.29)where g ( σ ) and g ( σ ) are certain functions of σ . Remark : as a corollary of Proposition 3.7, the covariance tensor C is a G -invariant Riemannianmetric on M . This is clear, for example, in the special case of (3.29), which coincides with thegeneral expression of a GL( N, C )-invariant metric.55 roof of Proposition 3.7 : since C ¯ x is bilinear C ¯ x ( u, u ) = s X r =1 s X q =1 C ¯ x ( u r , u q ) (3.30)It will be shown that C ¯ x ( u r , u q ) = 0 for r = q (3.31)and, on the other hand, that C ¯ x ( u r , u r ) = ψ ′ r ( η )dim M r k u r k x r (3.32)Then, (3.27) will follow immediately, by replacing (3.31) and (3.32) into (3.30). Proof of (3.31) : from (3.25), C ¯ x ( u r , u q ) = Z M h u r , Exp − x ( x ) ih Exp − x ( x ) , u q i p ( x | ¯ x, σ )vol( dx ) (3.33)However, since M is given as a product Riemannian manifold, h u r , Exp − x ( x ) i = h u r , Exp − x r ( x r ) i and h u q , Exp − x ( x ) i = h u q , Exp − x q ( x q ) i (3.34)Using (3.26) and (3.34), it follows from (3.33) that C ¯ x ( u r , u q ) = R Mr h u r , Exp − x r ( x r ) i p ( x r | ¯ x r , σ )vol( dx r ) R Mq h u q , Exp − x q ( x q ) i p ( x q | ¯ x q , σ )vol( dx q )= grad E r (¯ x r ) grad E q (¯ x q )= 0where the second equality follows from (2.5), applied to the variance functions E r ( y ) = 12 Z M r d ( y , x r ) p ( x r | ¯ x r , σ )vol( dx r ) and E q ( y ) = 12 Z M q d ( y , x q ) p ( x q | ¯ x q , σ )vol( dx q )which, by Proposition 3.6, respectively have their global minima at ¯ x r and ¯ x q . Proof of (3.32) : let K ¯ x denote the stabiliser of ¯ x in G . For k ∈ K ¯ x and u r ∈ T ¯ x r M r ,note that k · u r ∈ T ¯ x r M r . This defines an irreducible representation of K ¯ x in T ¯ x r M r . Thesymmetric bilinear form C ¯ x is invariant under this representation. Precisely, since any k ∈ K ¯ x is an isometry which fixes ¯ x , it follows from (3.25), C ¯ x ( k · u r , k · u r ) = R Mr h k · u r , Exp − x r ( x r ) i p ( x r | ¯ x r , σ )vol( dx r )= R Mr h u r , Exp − x r ( k − · x r ) i p ( x r | ¯ x r , σ )vol( dx r )= R Mr h u r , Exp − x r ( k − · x r ) i p ( k − · x r | ¯ x r , σ )vol( dx r ) = C ¯ x ( u r , u r )where the last equality follows by introducing the new variable of integration z = k − · x r .Finally, from Schur’s lemma [40], C ¯ x is a multiple of the metric, C ¯ x ( u r , u r ) = f ( η ) k u r k x r where f ( η ) may be found from tr( C ¯ x ) = (dim M r ) f ( η ). To conclude, it is enough to note thatthe trace may be evaluated by introducing an orthonormal basis of T ¯ x r M r . It then follows that,tr( C ¯ x ) = Z M r k Exp − x r ( x r ) k p ( x r | ¯ x r , σ )vol( dx r ) = Z M r d (¯ x r , x r ) p ( x r | ¯ x r , σ )vol( dx r )which is equal to ψ ′ r ( η ), by the same argument as in the discussion before Proposition 3.6.56 .6 An analytic formula for Z ( σ ) Consider the special case where M = H( N ), which corresponds to β = 2 in Example 2 of 3.3.In this case, using the tools of random matrix theory (see [17], Chapter 5), it is possible toprovide an analytic formula for the normalising factor Z ( σ ). Proposition 3.8. When M = H( N ) , the normalising factor Z ( σ ) , given by (3.9) with β = 2 ,admits of the following analytic formula Z ( σ ) = ω ( N )2 N (cid:0) π σ (cid:1) N exp (cid:20)(cid:18) N − N (cid:19) σ (cid:21) N − Y n =1 (cid:16) − e − nσ (cid:17) N − n (3.35) Remark : when N = 2, (3.35) reduces to Z ( σ ) = (cid:16) πσ (cid:17) (cid:16) e σ − (cid:17) (3.36)which can be checked, by directly calculating the integral (3.9). Proof of Proposition 3.8 : putting β = 2 in (3.9), and noting that N = N , it follows that Z ( σ ) = ω ( N )2 N N ! exp (cid:20) − N σ (cid:21) × I (3.37)where I is the integral I = Z R N + N Y i =1 ρ ( u i , σ ) | V ( u ) | N Y i =1 du i (3.38)This can be expressed using a well-known formula from random matrix theory [17] (Chapter5, Page 79). Precisely, if ( p n ; n = 0 , , . . . ) are orthonormal polynomials, with respect to theweight function ρ ( u, σ ) on R + , then I is given by I = N ! N − Y n =0 p − nn (3.39)where p nn is the leading coefficient in p n . The required orthonormal polynomials p n are given by p n = (2 πσ ) − s n , where s n are the Stieltjes-Wigert polynomials [43] (Page 33). Accordingly, p − nn = (cid:0) π σ (cid:1) exp (cid:20) (2 n + 1) σ (cid:21) n Y m =1 (cid:16) − e − mσ (cid:17) Then, working out the product (3.39), it easily follows I = N ! (cid:0) π σ (cid:1) N exp (cid:20)(cid:18) N − N (cid:19) σ (cid:21) N − Y n =1 (cid:16) − e − nσ (cid:17) N − n (3.40)and (3.35) may be obtained by replacing this into (3.37). Remark : the product appearing in (3.40) can be written as a product of q -Gamma functions.Letting q = e − σ , and recalling the definition of the q -Gamma function [44], it may be seen that N − Y n =1 (cid:16) − e − nσ (cid:17) N − n = (1 − q ) ( N − N ) / N Y n =2 Γ q ( n ) (Γ q the q -Gamma function) (3.41)In other words, the product of q -Gamma functions plays, for the present problem, the samerole that the product of classical Gamma functions (known as the Barnes function) plays, forthe Gaussian unitary ensemble. 57 .7 Large N asymptotics Pursuing the development started in 3.6, it is possible to derive an asymptotic expression of Z ( σ ), valid in the limit where N goes to infinity, while the product t = N σ remains constant. Proposition 3.9. Let Z ( σ ) be given by (3.35). If N → ∞ , while t = N σ remains constant,then the following equivalence holds, N log Z ( σ ) ∼ − 12 log (cid:18) Nπ (cid:19) + 34 + t − Li ( e − t ) − ζ (3) t (3.42) where Li ( x ) = P ∞ k =1 x k /k for | x | < (the trilogarithm), and ζ is the Riemann Zeta function. The proposition follows by a direct calculation, once the following lemmas have been shown. Lemma 3.1. In the notation of (3.35), if N → ∞ , N log ω ( N ) ∼ − 12 log (cid:18) N π (cid:19) + 34 (3.43) Lemma 3.2. If N → ∞ , while t = N σ remains constant, then lim 1 N log N − Y n =1 (cid:16) − e − nσ (cid:17) N − n = Z (1 − x ) log (cid:0) − e − tx (cid:1) dx (3.44) and this improper integral is equal to − (Li ( e − t ) − ζ (3)) /t . Proof of Lemma 3.1 : recall, from the footnote in 3.3, that ω ( N ) = (2 π ) ( N − N ) / /G ( N ) where G ( N ) = 1! × × . . . × ( N − ω ( N ) = N π ) − N (cid:20) 12 log( N ) − (cid:21) + o ( N )which directly implies (3.43). Proof of Lemma 3.2 : taking the logarithm of the product, the left-hand side of (3.44) reads1 N log N − Y n =1 (cid:16) − e − nσ (cid:17) N − n = 1 N N − X n =1 (cid:16) − nN (cid:17) log (cid:16) − e − t nN (cid:17) which is a Riemann sum for the improper integral in the right-hand side. To evaluate thisintegral, one may resort to a symbolic computation software, or introduce the power series ofthe logarithm, under the integral, Z (1 − x ) log (cid:0) − e − tx (cid:1) dx = − ∞ X k =1 k Z (1 − x ) e − ktx dx and note that Z (1 − x ) e − ktx dx = 1 − e − ktx ( kt ) in order to obtain − (Li ( e − t ) − ζ (3)) /t . Remark : from (3.42), it follows that Z ( σ ) → N → ∞ , while t = N σ remains constant.However, this is merely because ω ( N ) → N → ∞ . Therefore, one should keep in mind,lim 1 N log (cid:20) Z ( σ ) ω ( N ) (cid:21) = − 12 log(2) + 34 + t − Li ( e − t ) − ζ (3) t (3.45)which may be thought of as the “asymptotic cumulant generating function”.58 .8 The asymptotic distribution From the point of view of random matrix theory, a Gaussian distribution P (I N , σ ) on M = H( N )defines a unitary matrix ensemble. If x is a random matrix, drawn from this ensemble, and( x i ; i = 1 , . . . , N ) are its eigenvalues, which all belong to (0 , ∞ ), then the empirical distribution ν N , which is given by (as usual, δ x i is the Dirac distribution at x i ) ν N ( B ) = E " N N X i =1 δ x i ( B ) (3.46)for measurable B ⊂ (0 , ∞ ), converges to an absolutely continuous distribution ν t , when N goesto infinity, while the product t = N σ remains constant. Proposition 3.10. Let c = e − t and a ( t ) = c (1 + √ − c ) − while b ( t ) = c (1 − √ − c ) − . When N goes to infinity, while the product t = N σ remains constant, the empirical distribution ν N converges weakly to the distribution ν t with probability density function dν t dx ( x ) = 1 πtx arctan (cid:18) e t x − ( x + 1) x + 1 (cid:19) [ a ( t ) ,b ( t )] ( x ) (3.47) where [ a ( t ) ,b ( t )] denotes the indicator function of the interval [ a ( t ) , b ( t )] . Remark : as one should expect, when t = 0 (so σ = 0), a ( t ) = b ( t ) = 1.The proof of Proposition 3.10 is a relatively direct application of a result in [45] (Page 191).Recall the variables u i = e t x i which appear in (3.9). Let ˜ ν N be the empirical distribution of the u i (this is the same as (3.46), but with u i instead of x i ). By applying [17] (Chapter 5, Page 81),˜ ν N ( B ) = 1 N Z B R (1) N ( u )( du ) (3.48)for measurable B ⊂ (0 , ∞ ), where the one-point correlation function R (1) N ( u ) is given by R (1) N ( u ) = ρ ( u, σ ) N − X n =0 p n ( u ) (3.49)in the notation of 3.6 ( p n are orthonormal polynomials, with respect to the weight ρ ( u, σ )).According to [46] (Page 133), ˜ ν N given by (3.48) converges weakly to the so-called equilibriumdistribution ˜ ν t , which minimises the electrostatic energy functional E ( ν ) = 1 t Z ∞ 12 log ( u ) ν ( du ) − Z ∞ Z ∞ log | u − v | ν ( du ) ν ( dv ) (3.50)over probability distributions ν on (0 , ∞ ). Also according to [46] (Page 133), this equilibriumdistribution is the asymptotic distribution of the zeros of the polynomial p N (in the limit N → ∞ while N σ = t ). Fortunately, p N is just a constant multiple of the Stieltjes-Wigert polynomial s N [43] (Page 33). Therefore, the required asymptotic distribution of zeros can be read from [45](Page 191). Finally, (3.47) follows by introducing the change of variables x = e − t u . Remark : in [47], the equilibrium distribution ˜ ν t is derived directly, by searching for stationarydistributions of the energy functional (3.50). This leads to a singular integral equation, whosesolution reduces to a Riemann-Hilbert problem. Astoundingly, the Gaussian distributions onH( N ), as introduced in the present chapter, provide a matrix model for Chern-Simons quantumfield theory (a detailed account is given in [47]).59 .9 Duality : the Θ distributions Recall the Riemannian symmetric space M = H( N ) of 3.6. Its dual space is the unitary group M ∗ = U ( N ). Consider now a family of distributions on M ∗ , which will be called Θ distributions,and which display an interesting connection with Gaussian distributions on M , studied in 3.6.Recall Jacobi’s ϑ function , ϑ ( e iφ | σ ) = + ∞ X m = −∞ exp( − m σ + 2 miφ )As a function of φ , up to some minor modifications, this is just a wrapped normal distribution(in other words, the heat kernel of the unit circle),12 π ϑ (cid:0) e iφ | σ (cid:1) = ∞ X m = −∞ exp (cid:20) − (2 φ − mπ ) σ (cid:21) Each x ∈ M ∗ can be written x = k · e iθ for some k ∈ U ( N ) and e iθ = diag( e iθ i ; i = 1 , . . . , N ),where k · y = ky k † , for y ∈ M ∗ . With this notation, define the following matrix ϑ function,Θ (cid:0) x (cid:12)(cid:12) σ (cid:1) = k · ϑ (cid:0) e iθ | σ (cid:1) (3.51)which is obtained from x by applying Jacobi’s ϑ function to each eigenvalue of x . Further,consider the positive function, f ∗ ( x | ¯ x, σ ) = det h(cid:0) π σ (cid:1) Θ (cid:16) x ¯ x † (cid:12)(cid:12)(cid:12) σ (cid:17)i (3.52)which is also equal to det h(cid:0) π σ (cid:1) Θ (cid:16) ¯ x † x (cid:12)(cid:12)(cid:12) σ (cid:17)i since the matrices x ¯ x † and ¯ x † x are similar. Then, let Z M ∗ ( σ ) denote the normalising constant Z M ∗ ( σ ) = Z M ∗ f ∗ ( x | ¯ x, σ ) vol( dx ) (3.53)which does not depend on ¯ x , as can be seen, by introducing the new variable of integration z = x ¯ x † , and using the invariance of vol( dx ). (compare to the proof of Proposition 3.2).Now, define a Θ distribution Θ(¯ x, σ ) as the probability distribution on M ∗ , whose probabilitydensity function, with respect to vol( dx ), is given by p ∗ ( x | ¯ x, σ ) = ( Z M ∗ ( σ )) − f ∗ ( x | ¯ x, σ ) (3.54) Proposition 3.11. Let Z M ( σ ) = Z ( σ ) , be given by (3.35), and Z M ∗ ( σ ) be given by (3.53).Then, the following equality holds Z M ( σ ) Z M ∗ ( σ ) = exp (cid:20)(cid:18) N − N (cid:19) σ (cid:21) (3.55) Remark : the Gaussian density (3.6) on M , and the Θ distribution density (3.54) on M ∗ areapparently unrelated. Therefore, it is interesting to note their normalising constants Z M ( σ ) and Z M ∗ ( σ ) scale together according to the simple relation (3.55). The connection between the twodistributions is due to the duality between the two spaces ( M and M ∗ ). To follow the original notation of Jacobi [33], this should be written ϑ ( e iφ | q ) where q = e − σ . In otherpopular notations, this function is called ϑ or ϑ . roof of Proposition 3.11 : since Z M ∗ ( σ ) does not depend on ¯ x , one may set ¯ x = o in(3.53), where o = I N . Then, f ∗ ( x | o, σ ) is a class function, so (3.53) can be computed using(1.118). Note that ω ( S N ), which appears in (1.118), is equal to ω ( N ), in the current notation.Therefore, Z M ∗ ( σ ) = ω ( N )2 N N ! (cid:0) π σ (cid:1) N × I (3.56)where I is the integral I = Z [0 , π ] N N Y i =1 ϑ (cid:0) e iθi | σ (cid:1) | V ( e iθ ) | dθ . . . θ N (3.57)which follows from the identity det Θ (cid:0) x (cid:12)(cid:12) σ (cid:1) = N Y i =1 ϑ (cid:0) e iθi | σ (cid:1) Now, I can be expressed using [17] (Chapter 5, Page 79), as in the proof of Proposition 3.8.Precisely, if ( p n ; n = 0 , , . . . ) are orthonormal trigonometric polynomials, with respect to theweight function ϑ ( e iθ | σ / ), on the unit circle, then I is given by (3.39), I = N ! N − Y n =0 p − nn in terms of the leading coefficients p nn of the polynomials p n (these leading coefficients mayalways be chosen to be real). At present, the required orthonormal polynomials p n are given by p n ( z ) = " q n n Y m =1 (1 − q m ) − r n ( − q − z ) (3.58)where q = e − σ and r n ( z ) is the n -th Rogers-Szeg¨o polynomial, which is monic [48]. Therefore, p − nn = n Y m =1 (cid:16) − e − mσ (cid:17) (3.59)and, from (3.57), I is given by I = N ! N − Y n =1 (cid:16) − e − nσ (cid:17) N − n (3.60)which may be replaced into (3.56) to obtain Z M ∗ ( σ ) = ω ( N )2 N (cid:0) π σ (cid:1) N N − Y n =1 (cid:16) − e − nσ (cid:17) N − n (3.61)Finally, (3.55) follows easily, by comparing (3.61) to (3.35). Remark : the construction of the Θ distributions seems to indicate a general constructionof “dual distributions” on pairs of dual Riemannian symmetric spaces. Recalling the generalnotation of 1.9.2, it seems that Gaussian distributions arise from a classical Gaussian densityprofile on the maximal Abelian subspace a , while Θ distributions (“their duals”) arise fromwrapping this Gaussian density profile around the torus Exp o ( i a ).61 hapter 4 Bayesian inference and MCMC Contents The present chapter is entirely made up of previously unpublished material. It continues the studyof Gaussian distributions, from the previous chapter, in a new direction : Bayesian inference, and theMarkov chain Monte Carlo (MCMC) techniques, useful in Bayesian inference. • M . Proposition 4.1 states these two estimators are equal, if thelikelihood and prior densities are identical. • M is a space of constant negative curvature,numerical computation shows the MAP and the MMS are so close to each other that they appearto be equal, even if the likelihood and prior densities are different. • • • • .1 MAP versus MMS Let M be a Riemannian symmetric space, which belongs to the non-compact case (see 1.9.2).Recall the Gaussian distribution P ( x, σ ) on M is given by its probability density function (3.6) p ( y | x, σ ) = ( Z ( σ )) − exp (cid:20) − d ( y, x )2 σ (cid:21) (4.1)In 3.4, it was seen that maximum-likelihood estimation of the parameter x , based on independentsamples ( y n ; n = 1 , . . . , N ), amounts to computing the Riemannian barycentre of these samples.The one-sample maximum-likelihood estimate, given a single observation y , is therefore ˆ x ML = y .Instead of maximum-likelihood estimation, consider a Bayesian approach to estimating x ,based on the observation y . To do so, assign to x a prior density, which is also Gaussian, p ( x | z , τ ) = ( Z ( τ )) − exp (cid:20) − d ( x, z )2 τ (cid:21) (4.2)Upon observation of y , Bayesian inference concerning x is carried out, using the posterior density π ( x ) ∝ exp (cid:20) − d ( y, x )2 σ − d ( x, z )2 τ (cid:21) (4.3)where ∝ indicates a missing (unknown) normalising factor.In particular, the maximum a posteriori estimator ˆ x MAP of x is equal to the mode of theposterior density π ( x ). In other words, ˆ x MAP minimises the weighted sum of squared distances d ( y, x ) /σ + d ( x, z ) /τ . This is expressed in the following notation ,ˆ x MAP = z ρ y where ρ = τ σ + τ (4.4)Thus, ˆ x MAP is a geodesic convex combination of the prior barycentre z and the observation y ,with respective weights σ / ( σ + τ ) and τ / ( σ + τ ).On the other hand, the minimum mean square error estimator ˆ x MMS is the barycentre ofthe posterior density π ( x ). That is, ˆ x MMS is the global minimiser of E π ( y ) = 12 Z M d ( y , x ) π ( x )vol( dx ) (4.5)whose existence and uniqueness are established in the remark below. While it is easy to computeˆ x MAP from (4.4), it is much harder to find ˆ x MMS , as this requires minimising the integral (4.5),where the density π ( x ) is known only up to normalisation.Still, there is one special case where these two estimators are equal. Proposition 4.1. In the above notation, if σ = τ (that is ρ = 1 / ), then ˆ x MMS = ˆ x MAP . This relies on the following (intuitively quite obvious) lemma. Lemma 4.1. Assume that π is a probability distribution on M with Riemannian barycentre b .If g is an isometry of M such that g ∗ π = π ( g ∗ π denotes the image of the distribution π underthe mapping g : M → M ), then g · b = b . This lemma is proved by noting that, for any isometry g of M , one has E g ∗ π = E π ◦ g − .Accordingly, if b is the Riemannian barycentre of π , g · b is the Riemannian barycentre of g ∗ π . If p, q ∈ M and c : [0 , → M is a geodesic curve with c (0) = p and c (1) = q , then p t q = c ( t ), for t ∈ [0 , p t q is a geodesic convex combination of p and q , with respective weights (1 − t ) and t . roof of Proposition 4.1 : in this case, π ( x ) ∝ exp (cid:20) − d ( y, x ) + d ( x, z )2 σ (cid:21) On the other hand, ˆ x MAP = z / y is the midpoint of the geodesic segment connecting z to y (note that ρ = 1 / s denote the geodesic symmetry at ˆ x MAP . Then, s permutes z and y ,and therefore leaves invariant π ( x ). Lemma 4.1 (applied with g = s ) implies the Riemannianbarycentre ˆ x MMS of π verifies s · ˆ x MMS = ˆ x MMS . However, ˆ x MAP is the unique fixed point of s .Therefore, ˆ x MMS = ˆ x MAP . Remark : to see that ˆ x MMS is well-defined, it is enough to show the posterior density π in (4.3)satisfies (2.4). Indeed, this implies that π has a well-defined Riemannian barycentre.Consider then the second-order moment in (2.4), with y o = ˆ x MAP . Specifically, this is m (ˆ x MAP ) = Z M d (ˆ x MAP , x ) π ( x )vol( dx ) (4.6)Rearrange (4.3) to obtain π ( x ) ∝ exp [ − h ( ρf y ( x ) + (1 − ρ ) f z ( x ))] (cid:0) h = 1 /σ + 1 /τ (cid:1) (4.7)in the notation of 1.7. Now, let f ( x ) = ρf y ( x ) + (1 − ρ ) f z ( x ). For x ∈ M , let x = Exp ˆ x MAP ( v ),and recall the Taylor expansion (1.20), f ( x ) = f (ˆ x MAP ) + h grad f, v i ˆ xMAP + 12 Hess f c ( t ∗ ) ( ˙ c, ˙ c ) (4.8)where c ( t ∗ ) is a point along the geodesic c ( t ) = Exp x ( t v ), corresponding to an instant t ∗ ∈ (0 , f (ˆ x MAP ) = 0, as can be checked from (4.4), and that, using (1.77),Hess f ( x ) = ρ Hess f y ( x ) + (1 − ρ )Hess f z ( x ) ≥ g ( y )Replacing these into (4.8), it follows that f ( x ) ≥ ρ (1 − ρ ) d ( z, y ) + 12 d (ˆ x MAP , x )Then, if C − π is the missing normalising factor in (4.7), π ( x ) ≤ C − π exp (cid:20) − ρτ d ( z, y ) − h d (ˆ x MAP , x ) (cid:21) (4.9)From (4.6) and (4.9), m (ˆ x MAP ) ≤ C − π exp h − ρτ d ( z, y ) i Z M d (ˆ x MAP , x ) exp (cid:20) − h d (ˆ x MAP , x ) (cid:21) vol( dx ) (4.10)which is finite, as required in (2.4). In fact, by a direct application of the integral formula(1.109), it is possible to show that Z M d (ˆ x MAP , x ) exp (cid:20) − h d (ˆ x MAP , x ) (cid:21) vol( dx ) = h − / Z ′ ( h − / )where Z ( σ ) was given in (3.7), and the prime denotes the derivative. Finally, replacing this into(4.10), it follows that m (ˆ x MAP ) ≤ C − π exp (cid:2) ( − ρ/τ ) d ( z, y ) (cid:3) h − / Z ′ ( h − / ) (4.11)64 .2 Bounding the distance Proposition 4.1 states that ˆ x MMS = ˆ x MAP , if ρ = 1 / 2. When M is a Euclidean space, it isfamously known that ˆ x MMS = ˆ x MAP for any value of ρ . In general, one expects these twoestimators to be different from one another, if ρ = 1 / M is a space of constant negative curvature, numerical experiments showthat ˆ x MMS and ˆ x MAP lie surprisingly close to each other, and that they even appear to be equal.I am still unaware of any mathematical explanation of this phenomenon.It is possible to bound the distance between ˆ x MMS and ˆ x MAP , using the so-called fundamentalcontraction property [29] (this is an immediate application of Jensen’s inequality, as explainedin the proof of Theorem 6.3 in [29]). d (ˆ x MMS , ˆ x MAP ) ≤ W ( π, δ ˆ x MAP ) (4.12)where W denotes the Kantorovich ( L -Wasserstein) distance, and δ ˆ x MAP denotes the Diracprobability distribution concentrated at ˆ x MAP . Now, the right-hand side of (4.12) is equal tothe first-order moment m (ˆ x MAP ) = Z M d (ˆ x MAP , x ) π ( x )vol( dx ) (4.13)Of course, the upper bound in (4.12) is not tight, since it is strictly positive, even when ρ = 1 / x n ; n ≥ 1) from the posterior density π .Using these samples, it is possible to approximate (4.13), by an empirical average,¯ m (ˆ x MAP ) = 1 N N X n =1 d (ˆ x MAP , x n ) (4.14)In addition, the samples ( x n ) can be used to compute a convergent approximation of ˆ x MMS .Precisely, the empirical barycentre ¯ x MMS of the samples ( x , . . . , x N ) converges almost-surely toˆ x MMS (this is proved in 4.3.2).Numerical experiments were conducted in the case when M is a space of constant curvature,equal to − 1, and of dimension n . The following table was obtained for the values σ = τ = 0 . x , . . . , x N ) where N = 2 × .dimension n m (ˆ x MAP ) 0 . 28 0 . 35 0 . 41 0 . 47 0 . 50 0 . 57 0 . 60 0 . 66 0 . d (¯ x MMS , ˆ x MAP ) 0 . 00 0 . 00 0 . 00 0 . 01 0 . 01 0 . 02 0 . 02 0 . 02 0 . σ = 1 and τ = 0 . 5, again using N = 2 × .dimension n m (ˆ x MAP ) 0 . 75 1 . 00 1 . 12 1 . 44 1 . 73 1 . 97 2 . 15 2 . 54 2 . d (¯ x MMS , ˆ x MAP ) 0 . 00 0 . 00 0 . 03 0 . 02 0 . 02 0 . 03 0 . 04 0 . 03 0 . x MMS and ˆ x MAP can be quite close to each other, even when ρ = 1 / σ and τ lead to similar orders of magnitude for ¯ m (ˆ x MAP ) and d (¯ x MMS , ˆ x MAP ).While ¯ m (ˆ x MAP ) increases with the dimension n , d (¯ x MMS , ˆ x MAP ) does not appear sensitive toincreasing dimension.Based on these experimental results, one may be tempted to conjecture that ˆ x MMS = ˆ x MAP ,even when ρ = 1 / 2. Naturally, numerical experiments do not equate to a mathematical proof.65 .3 Computing the MMS A crucial step, in Bayesian inference, is sampling from the posterior density. Here, this is π ( x ),given by (4.3). Since π ( x ) is known only up to normalisation, a suitable sampling methodis afforded by the Metropolis-Hastings algorithm. This algorithm generates a Markov chain( x n ; n ≥ P f ( x ) = Z M α ( x, y ) q ( x, y ) f ( y )vol( dy ) + ρ ( x ) f ( x ) (4.15)for any bounded measurable function f : M → R , where α ( x, y ) is the probability of acceptinga transition from x to dy , and ρ ( x ) is the probability of staying at x , and where q ( x, y ) is theproposed transition density q ( x, y ) ≥ Z M q ( x, y )vol( dy ) = 1 for x ∈ M (4.16)In the following, ( x n ) will always be an isotropic Metropolis-Hastings chain, in the sense that q ( x, y ) = q ( d ( x, y )), so q ( x, y ) only depends on the distance d ( x, y ). In this case, the acceptanceprobability α ( x, y ) is given by α ( x, y ) = min { , π ( y ) /π ( x ) } .The aim of the Metropolis-Hastings algorithm is to produce a Markov chain ( x n ) which isgeometrically ergodic. Geometric ergodicity means the distribution π n of x n converges to π ,with a geometric rate, in the sense that there exist β ∈ (0 , 1) and R ( x ) ∈ (0 , ∞ ), as well as afunction V : M → R , such that (in the following, π ( dx ) = π ( x )vol( dx )) V ( x ) ≥ max (cid:8) , d ( x, x ∗ ) (cid:9) for some x ∗ ∈ M (4.17) (cid:12)(cid:12)(cid:12)(cid:12)Z M f ( x )( π n ( dx ) − π ( dx )) (cid:12)(cid:12)(cid:12)(cid:12) ≤ R ( x ) β n (4.18)for any function f : M → R with | f | ≤ V . If the chain ( x n ) is geometrically ergodic, then itsatisfies the strong law of large numbers [51]1 N N X n =1 f ( x n ) −→ Z M f ( x ) π ( dx ) (almost-surely) (4.19)as well as a corresponding central limit theorem (see Theorem 17.0.1, in [51]). Then, in practice,the Metropolis-Hastings algorithm generates samples ( x n ) from the posterior density π ( x ).In 4.6, the following general statement will be proved, concerning the geometric ergodicityof isotropic Metropolis-Hastings chains. Proposition 4.2. Let M be a Riemannian symmetric space, which belongs to the non-compactcase. Assume ( x n ; n ≥ is a Markov chain in M , with transition kernel given by (4.15), withproposed transition density q ( x, y ) = q ( d ( x, y )) , and with strictly positive invariant density π .The chain ( x n ) satisfies (4.17) and (4.18), if the following assumptions hold,(a1) there exists x ∗ ∈ M , such that r ( x ) = d ( x ∗ , x ) and ℓ ( x ) = log π ( x ) satisfy lim sup r ( x ) →∞ h grad r, grad ℓ i x r ( x ) < (a2) if n ( x ) = grad ℓ ( x )/ k grad ℓ ( x ) k , then n ( x ) satisfies lim sup r ( x ) →∞ h grad r, n i x < (a3) there exist δ q > and ε q > such that d ( x, y ) < δ q implies q ( x, y ) > ε q emark : the posterior density π in (4.3) verifies Assumptions (a1) and (a2). To see this, let x ∗ = z , and note from (1.67) and (1.75) thatgrad ℓ ( x ) = − τ r ( x )grad r ( x ) − σ grad f y ( x )Then, taking the scalar product with grad r , h grad r, grad ℓ i x = − τ r ( x ) − σ h grad r, grad f y i x (4.20)since grad r ( x ) is a unit vector, for all x ∈ M . Now, grad f y ( x ) = − Exp − x ( y ), by (1.75). But,since r ( x ) is a convex function of x , h grad r, Exp − x ( y ) i ≤ r ( y ) − r ( x )for any y ∈ M . Thus, the right-hand side of (4.20) is strictly negative, as soon as r ( x ) > r ( y ),and Assumption (a1) is indeed verified. That Assumption (a2) is also verified can be proved bya similar reasoning. Remark : on the other hand, Assumption (a3) holds, if the proposed transition density q ( x, y )is a Gaussian density, q ( x, y ) = p ( y | x, τ q ).With this choice of q ( x, y ), all the assumptions of Proposition 4.2 are verified, for theposterior density π in (4.3). Proposition 4.2 therefore implies that the Metropolis-Hastingsalgorithm generates geometrically ergodic samples ( x n ; n ≥ Let ( x n ; n ≥ 1) be a Metropolis-Hastings Markov chain in M , with its transition kernel (4.15),and invariant density π . Assume the chain ( x n ) is geometrically ergodic, so it satisfies the stronglaw of large numbers (4.19).Then, let ¯ x N denote the empirical barycentre of the first N samples ( x , . . . , x N ). This isthe unique global minimum of the variance function E N ( y ) = 12 N N X n =1 d ( y , x n ) (4.21)Assuming it is well-defined, let ˆ x denote the Riemannian barycentre of the invariant density π .It turns out that ¯ x N converges almost-surely to ˆ x . Proposition 4.3. Let ( x n ) be any Markov chain in a Hadamard manifold M , with invariantdistribution π . Denote ¯ x N the empirical barycentre of ( x , . . . , x N ) , and ˆ x the Riemannianbarycentre of π (assuming it is well-defined). If ( x n ) satisfies the strong law of large numbers(4.19), then ¯ x N converges to ˆ x , almost-surely. According to the remarks after Proposition 4.2, the Metropolis-Hastings Markov chain ( x n ),whose invariant density is the posterior density π ( x ), given by (4.3), is geometrically ergodic.Therefore, by Proposition 4.3, the empirical barycentre ¯ x MMS , of the samples ( x , . . . , x N ),converges almost-surely to the minimum mean square error estimator ˆ x MMS (since this is justthe barycentre of the posterior density π ). This provides a practical means of approximatingˆ x MMS . Indeed, ¯ x MMS can be computed using the Riemannian gradient descent method (thismethod is discussed in 4.4, below).The proof of Proposition 4.3 is nearly a word-for-word repetition of the proof in [24] (thatof Theorem 2.3). 67 roof of Proposition 4.3 : denote E π the variance function of the invariant distribution π , E π ( y ) = 12 Z M d ( y , x ) π ( dx )First, for any compact K ⊂ M , it will be proved thatsup y ∈ K |E N ( y ) − E π ( y ) | −→ δ > { w j ; j = 1 , . . . , J } be a δ -net in K (for any y ∈ K , there exists w j such that d ( w j , y ) < δ ). By the strong law of large numbers (4.19),max j =1 ,...,J |E N ( w j ) − E π ( w j ) | −→ | d ( y , x n ) − d ( w, x n ) | ≤ ( d ( y , x n ) + d ( w, x n )) | d ( y , x n ) − d ( w, x n ) | it follows by the triangle inequality that | d ( y , x n ) − d ( w, x n ) | ≤ ( d ( y , x n ) + d ( w, x n )) d ( w, y ) (4.24)From (4.24), it is possible to show that, for y and w in K , |E N ( y ) − E N ( w ) | ≤ sup z ∈ K N N X n =1 d ( z , x n ) ! d ( w, y ) (4.25)However, by the strong law of large numbers (4.19), if y o ∈ K and N is sufficiently large,1 N N X n =1 d ( z , x n ) ≤ Z M d ( y o , x ) π ( dx ) + diam K (almost-surely)Calling this quantity A , it follows that for N sufficiently large (note that this is the same N ,for all y and w in K ), |E N ( y ) − E N ( w ) | ≤ Ad ( w, y ) (almost-surely) (4.26)From (4.24), it is also possible to show that, for y and w in K , |E π ( y ) − E π ( w ) | ≤ Ad ( w, y ) (4.27)Now, if y ∈ K , let w ( y ) ∈ { w j } be such that d ( w ( y ) , y ) < δ . Then, for y in K , |E N ( y ) − E π ( y ) | ≤ |E N ( y ) − E N ( w ( y )) | + |E N ( w ( y )) − E π ( w ( y )) | + |E π ( w ( y )) − E π ( y ) | By (4.26) and (4.27), if N is sufficiently large, it follows that |E N ( y ) − E π ( y ) | ≤ Aδ + max j =1 ,...,J |E N ( w j ) − E π ( w j ) | and (4.22) follows from (4.23), since δ > N sufficiently large, and for any C > 0, it will be proved that there exists acompact K ⊂ M , such that y / ∈ K = ⇒ E N ( y ) > C (almost-surely) (4.28)68o do so, note from (4.21), by the triangle inequality E N ( y ) ≥ N N X n =1 ( d ( y , ˆ x ) − d (ˆ x, x n )) ≥ d ( y , ˆ x ) − N N X n =1 d (ˆ x, x n ) ! d ( y , ˆ x )However, by the strong law of large numbers (4.19), if N is sufficiently large1 N N X n =1 d (ˆ x, x n ) ≤ Z M d (ˆ x, x ) π ( dx )Calling this quantity B , it follows that for N sufficiently large, E N ( y ) ≥ d ( y , ˆ x ) − B d ( y , ˆ x ) (4.29)and this directly yields (4.28), since closed and bounded sets are compact (as a consequence ofthe Hopf-Rinow theorem [11]).Now, to complete the proof, note the following. By (4.28), for N sufficiently large, thereexists a compact K ⊂ M , such that E N ( y ) > E π (ˆ x ) + 1 almost-surely, whenever y / ∈ K . That is,inf y / ∈ K E N ( y ) > E π (ˆ x ) + 1 (almost-surely) (4.30)Moreover, one may always assume that K is a neighborhood of ˆ x . Then, if B (ˆ x, ǫ ) ⊂ K , itfollows from (4.22) that, for N sufficiently large,inf y ∈ B (ˆ x,ǫ ) E N ( y ) < inf y ∈ B (ˆ x,ǫ ) E π ( y ) + ǫ x is the unique global minimum of E π ( y ),inf y ∈ B (ˆ x,ǫ ) E N ( y ) < E π (ˆ x ) + ǫ N sufficiently large,inf y ∈ K − B (ˆ x,ǫ ) E N ( y ) > inf y ∈ K − B (ˆ x,ǫ ) E π ( y ) − ǫ E π is 1 / x , E π ( y ) ≥ E π (ˆ x ) + 12 d ( y , ˆ x )and this implies inf y ∈ K − B (ˆ x,ǫ ) E N ( y ) > E π (ˆ x ) + ǫ N sufficiently largeinf y ∈ M E N ( y ) = inf y ∈ B (ˆ x,ǫ ) E N ( y ) (almost-surely)Since E N has a unique global minimum ¯ x N , it follows that ¯ x N belongs to the closure of B (ˆ x, ǫ ),almost-surely, when N is sufficiently large. The proof is now complete, since ǫ is arbitrary.69 .4 Riemannian gradient descent Since the minimum mean square error estimator ˆ x MMS could not be computed directly, it wasapproximated by ¯ x MMS , the global minimum of the variance function E N , defined as in (4.21).This function E N being 1 / f : M → R , where M is a Hadamard manifold, with sectional curvatures in the interval [ − c , f is an ( α/ f is ( α/ M .In particular, for x, y ∈ M , f ( y ) − f ( x ) ≥ h Exp − x ( y ) , grad f ( x ) i x + ( α/ d ( x, y ) (4.33)This implies that f has compact sublevel sets. Indeed, let x ∗ be the global minimum of f , sograd f ( x ∗ ) = 0. Putting x = x ∗ and y = x in (4.33), it follows that f ( x ) − f ( x ∗ ) ≥ ( α/ d ( x ∗ , x ) (4.34)Accordingly, if S ( y ) is the sublevel set of y , then S ( y ) is contained in the closed ball ¯ B ( x ∗ , R y ),where R y = (2 /α )( f ( y ) − f ( x ∗ )). Therefore, S ( y ) is compact, since it is closed and bounded [11].The Riemannian gradient descent method is based on the iterative scheme x t +1 = Exp x t ( − µ grad f ( x t )) (4.35)where µ is a positive step-size, µ ≤ 1. If this is chosen sufficiently small, then the iterates x t remain within the sublevel set S ( x ). In fact, let ¯ B = ¯ B ( x ∗ , R x ) and ¯ B ′ = ¯ B ( x ∗ , R x + G ),where G denotes the supremum of the norm of grad f ( x ), taken over x ∈ ¯ B . Then, let H ′ denote the supremum of the operator norm of Hess f ( x ), taken over x ∈ ¯ B ′ . Lemma 4.2. For the Riemannian gradient descent method (4.35), if µ ≤ /H ′ , then the iterates x t remain within the sublevel set S ( x ) . Once it has been ensured that the iterates x t remain within S ( x ), it is even possible tochoose µ in such a way that these iterates achieve an exponential rate of convergence towards x ∗ . This relies on the fact that x ∗ is a “strongly attractive” critical point of the vector fieldgrad f . Precisely, putting y = x ∗ in (4.33), it follows that h Exp − x ( x ∗ ) , grad f ( x ) i x ≤ − ( α/ d ( x, x ∗ ) + ( f ( x ∗ ) − f ( x )) (4.36)Now, let C = cR x coth( cR x ). Proposition 4.4. Let ¯ H ′ = max { H ′ , } . If µ ≤ / ( ¯ H ′ C ) (this implies µ ≤ /H ′ ) and µ ≤ /α , d ( x t , x ∗ ) ≤ (1 − µα ) t d ( x , x ∗ ) (4.37)The proof of Proposition 4.4 will employ the following lemma. Lemma 4.3. Let ¯ H ′ = max { H ′ , } . For any x ∈ ¯ B , k grad f k x ≤ H ′ ( f ( x ) − f ( x ∗ )) (4.38) Remark : the rate of convergence predicted by (4.37) is exponential, but depends on the initialguess x , through the constants ¯ H ′ and C . This rate can become arbitrarily bad, if x is chosensufficiently far from x ∗ , since both ¯ H ′ and C may then become arbitrarily large. By contrast,if M is a Euclidean space (that is, in the limit c = 0), C = 1, is a constant. Remark : I have never met with a function f : M → R ( M a non-Euclidean Hadamardmanifold), which is strongly convex, and also has a bounded Hessian. I do not even knowwhether it is possible or not to construct such a function.70 roof of Lemma 4.2 : let c : [0 , → M be the geodesic curve with c (0) = x t and c (1) = x t +1 .From (4.35), ˙ c (0) = − µ grad f ( x t ). Then, by the Taylor expansion (1.20), f ( x t +1 ) = f ( x t ) − µ k grad f k x t + 12 Hess f c ( u ) ( ˙ c, ˙ c ) (4.39)for some u ∈ (0 , x t belongs to S ( x ) ⊂ ¯ B . Then, by the triangle inequality, d ( x ∗ , c ( u )) ≤ d ( x ∗ , x t ) + d ( x t , c ( u )) ≤ R x + µG where the second inequality follows from the definition of G , because d ( x t , c ( u )) = u k ˙ c (0) k .Since µ ≤ 1, it follows that d ( x ∗ , c ( u )) ≤ R x + G . Therefore, c ( u ) ∈ ¯ B ′ . Then, from thedefinition of H ′ , Hess f c ( u ) ( ˙ c, ˙ c ) ≤ H ′ k ˙ c k c ( u ) = H ′ µ k grad f k x t Replacing this into (4.39), f ( x t +1 ) ≤ f ( x t ) − µ (1 − µ ( H ′ / k grad f k x t (4.40)Clearly, then, taking µ ≤ /H ′ , it follows that f ( x t +1 ) ≤ f ( x t ) so that x t +1 belongs to S ( x ).The lemma is proved by induction. Proof of Proposition 4.4 : let c : [0 , → M be the geodesic with c (0) = x t and c (1) = x t +1 .Note from (4.35) that ˙ c (0) = − µ grad f ( x t ). Let W ( x ) = d ( x, x ∗ ) / 2, and write down its Taylorexpansion (1.20), W ( x t +1 ) = W ( x t ) − µ h grad W, grad f i x t + 12 Hess W c ( u ) ( ˙ c, ˙ c ) (4.41)for some u ∈ (0 , W and Hess W are given by (1.75) and (1.77), and also that x t and x t +1 belong to S ( x ) ⊂ ¯ B , by Lemma 4.2, since µ ≤ /H ′ . Since S ( x ) is a convex set(recall the definition from 1.7.4), c ( u ) also belongs to S ( x ) ⊂ ¯ B . By the definition of C ,Hess W c ( u ) ( ˙ c, ˙ c ) ≤ C k ˙ c k c ( u ) = C µ k grad f k x t Replacing into (4.41), one now has W ( x t +1 ) ≤ W ( x t ) + µ h Exp − x t ( x ∗ ) , grad f i x t + ( C / µ k grad f k x t (4.42)Therefore, by (4.36) and (4.38), W ( x t +1 ) ≤ W ( x t )(1 − µα ) + µ (1 − µ ( ¯ H ′ C ))( f ( x ∗ ) − f ( x )) (4.43)If µ ≤ / ( ¯ H ′ C ), then (4.43) implies W ( x t +1 ) ≤ (1 − µα ) W ( x t ), because f ( x ∗ ) − f ( x ) ≤ − µα ≥ Proof of Lemma 4.3 : let c denote the geodesic with c (0) = x and ˙ c (0) = ( − / ¯ H ′ )grad f ( x ).By the same arguments as in the proof of Lemma 4.2, one has that c ( u ) ∈ ¯ B ′ for all u ∈ [0 , y = c (1) and writing down the Taylor expansion (1.20), f ( y ) − f ( x ) ≤ ( − / ¯ H ′ ) k grad f k x + ( ¯ H ′ / k (1 / ¯ H ′ )grad f ( x ) k x = ( − / H ′ ) k grad f k x Multiplying this inequality by − H ′ ,2 ¯ H ′ ( f ( x ) − f ( y )) ≥ k grad f k x Now, (4.38) obtains by noting that f ( x ) − f ( x ∗ ) ≥ f ( x ) − f ( y ).71 .5 A volume growth lemma Lemma 4.4 will be used in the proof of Proposition 4.2, to be carried out in 4.6. This lemma isof a purely geometric content, and is therefore considered separately, beforehand.Let M be a Riemannian symmetric space, which belongs to the non-compact case (see 1.9.2).Then, in particular, M is a Hadamard manifold.Fix x ∗ ∈ M , and let ( r, θ ) be geodesic spherical coordinates, with origin at x ∗ . Any z ∈ M ,other than x ∗ , is uniquely determined by its coordinates ( r, θ ), and will be written z ( r, θ ).Recall the volume density function det( A ( r, θ )), from the integral formula (1.95). This willbe denoted λ ( r, θ ) = det( A ( r, θ )).Essentially, the following lemma states the logarithmic rate of growth of the volume densityfunction λ ( r, θ ) is bounded at infinity. Lemma 4.4. Let M be a Riemannian symmetric space, which belongs to the non-compact case.Fix x ∗ ∈ M and denote r ( x ) = d ( x ∗ , x ) for x ∈ M . Then, for any R > , lim sup r ( x ) →∞ sup z ( r,θ ) ∈ B ( x,R ) λ ( r, θ )inf z ( r,θ ) ∈ B ( x,R ) λ ( r, θ ) < ∞ (4.44)The proof of this lemma proceeds in the following way. Identify the unit sphere in T x ∗ M with S n − , and consider for θ ∈ S n − the self-adjoint curvature operator R θ : T x ∗ M → T x ∗ M ,given by R θ ( v ) = − R ( θ, v ) θ ; v ∈ T x ∗ M Recall that the Riemann curvature tensor is parallel (because M is a symmetric space). Then,from (1.69) and the definition of A ( r, θ ), it follows that A ( r, θ ) solves the Jacobi equation A ′′ − R θ A = 0 A (0) = 0 , A ′ (0) = Id x ∗ (4.45)where the prime denotes differentiation with respect to r . At present, all the eigenvalues of R θ are positive. If c ( θ ) runs through these eigenvalues, then it follows from (4.45) that λ ( r, θ ) = Y c ( θ ) (cid:18) sinh( c ( θ ) r ) c ( θ ) (cid:19) m c ( θ ) (4.46)where m c ( θ ) denotes the multiplicity of the eigenvalue c ( θ ) of R θ .It is possible to express (4.46) in a different form. Let M = G/K where K is the stabiliserin G of x ∗ . Let g and k be the Lie algebras of G and K , and g = k + p the corresponding Cartandecomposition. Let a be a maximal Abelian subspace of p , and recall that it is always possibleto write rθ = Ad( k ) a for some k ∈ K and a ∈ a (see Lemma 6.3, Chapter V, in [10]). In thisnotation, r = k a k x ∗ and c ( θ ) = λ ( a ) / k a k x ∗ , where λ is a positive roots of g with respect to a ,with multiplicity m λ = m c ( θ ) (see Lemma 2.9, Chapter VII, in [10]). Replacing into (4.46) gives λ ( r, θ ) = Y λ ∈ ∆ + (cid:18) sinh( λ ( a )) λ ( a ) / k a k (cid:19) m λ (4.47)Here, if the right-hand side is denoted by f ( a ), then it is elementary that log f ( a ) is a Lipschitzfunction, on the complement of any bounded subset of a which contains the zero element of a .Returning to (4.44), let the supremum in the numerator be achieved at ( r max , θ max ) andthe infimum in the denominator be achieved at ( r min , θ min ). Let ( k max , a max ) and ( k min , a min )be corresponding values of k and a . Note that for z ( r, θ ) ∈ B ( x, R ), by the triangle inequality, r ≥ r ( x ) − R . But, since r = k a k x ∗ , this also means k a k x ∗ ≥ r ( x ) − R .72herefore, if r ( x ) > R then, as stated above, log f ( a ) is a Lipschitz function, on the set of a such that k a k x ∗ ≥ r ( x ) − R . If C is the corresponding Lipschitz constant,sup z ( r,θ ) ∈ B ( x,R ) λ ( r, θ )inf z ( r,θ ) ∈ B ( x,R ) λ ( r, θ ) ≤ exp[C k a max − a min k x ∗ ] (4.48)Now, (4.44) will follow by showing that k a max − a min k x ∗ < R wherever r ( x ) > R .To do so, let z max = z ( r max , θ max ) and z min = z ( r min , θ min ), and note d ( z max , z min ) ≤ R .If c : [0 , → M is a geodesic curve with c (0) = z min and c (1) = z max , then Z k ˙ c ( t ) k c ( t ) dt = d ( z max , z min ) ≤ R (4.49)On the other hand, if c ( t ) = c ( r ( t ) , θ ( t )), then it is possible to write r ( t ) θ ( t ) = Ad( k ( t )) a ( t ),where k ( t ) and a ( t ) are differentiable curves in K and a . It will be shown below that this implies k ˙ c ( t ) k c ( t ) = k ˙ a ( t ) k x ∗ + X λ ∈ ∆ + sinh ( λ ( a ( t )) k ˙ k λ ( t ) k x ∗ (4.50)where ˙ k λ ( t ) is defined following (4.53), below. Finally, from (4.49) and (4.50), it follows that k a max − a min k x ∗ ≤ Z k ˙ a ( t ) k x ∗ dt ≤ Z k ˙ c ( t ) k c ( t ) dt ≤ R Replacing into (4.48), this yieldssup z ( r,θ ) ∈ B ( x,R ) λ ( r, θ )inf z ( r,θ ) ∈ B ( x,R ) λ ( r, θ ) ≤ exp(2C R )for all x such that r ( x ) > R . However, this immediately implies (4.44). Proof of (4.50) : in the notation of 1.9.2, c ( t ) = ϕ ( s ( t ) , a ( t )), where s ( t ) is the representativeof k ( t ) in the quotient K/K a . Recall that ϕ ( s, a ) = Exp o ( β ( s, a )) where β ( s, a ) = Ad( s ) a (the dependence on t is now suppressed). Then, by differentiating with respect to t ,˙ β ( s, a ) = Ad( s ) ( ˙ a + [ ˙ s, a ])Further, by replacing from (1.102),˙ c = exp( rθ ) · sh( R rθ )( ˙ β ( s, a ))However, Ad( s ) preserves norms, and Ad( s − ) ◦ R rθ ◦ Ad( s ) = R a , as in 1.104). Therefore, k ˙ c k c = k sh( R a ) ( ˙ a + [ ˙ s, a ]) k x ∗ (4.51)and from the definition of sh( R a ),sh( R a ) = Π a + X λ ∈ ∆ + sinh( λ ( a )) λ ( a ) Π λ (4.52)Now, one has the orthogonal decomposition ˙ s = P λ ∈ ∆ + ( ξ λ + dθ ( ξ λ )) where [ a, ξ λ ] = λ ( a ) ξ λ and dθ was introduced before (1.97) (see Lemma 3.6, Chapter VI, in [10]). In turn, this yieldsthe orthogonal decomposition [ a, ˙ s ] = X λ ∈ ∆ + λ ( a )( ξ λ − dθ ( ξ λ )) (4.53)Letting ˙ s λ = ( ξ λ − dθ ( ξ λ )), it follows from (4.51) and (4.52) that k ˙ c k c = k ˙ a k x ∗ + X λ ∈ ∆ + sinh ( λ ( a )) k ˙ s λ k x ∗ This is the same as (4.50), once ˙ k is identified with its representative ˙ s .73 .6 Proof of geometric ergodicity The proof of Proposition 4.2 relies on the so-called geometric drift condition. This conditionrequires that there exist a function V : M → R such that V ( x ) ≥ max (cid:8) , d ( x, x ∗ ) (cid:9) for some x ∗ ∈ M (4.54) P V ( x ) ≤ λV ( x ) + b C ( x ) (4.55)for some λ ∈ (0 , 1) and b ∈ (0 , ∞ ), and where C is a small set for P (for the definition, see [51]).If the geometric drift condition (4.55) is verified, then the geometric ergodicity condition (4.18)holds [51].The proof is a generalisation of the proof carried out in the special case where M is aEuclidean space, in [49]. The idea is to use Assumptions (a1)–(a3) to show that the followingtwo conditions hold, lim sup r ( x ) →∞ P V ( x ) V ( x ) < x ∈ M P V ( x ) V ( x ) < ∞ (4.57)where r ( x ) = d ( x ∗ , x ), and V ( x ) = aπ − ( x ) with a chosen so V ( x ) ≥ x ∈ M . However,under Assumption (a3), these two conditions are shown to imply (4.55). Lemma 4.5. Let ( x n ) be a Markov chain in M , with transition kernel (4.15), with proposedtransition density q ( x, y ) = q ( d ( x, y )) , and continuous, strictly positive invariant density π .Moreover, assume the proposed transition density satisfies Assumption (a3). If Conditions(4.56) and (4.57) are verified, then the geometric drift condition (4.55) holds. On the other hand, (4.54) (which is just the same as (4.17)) is a straightforward result ofAssumption (a1), which implies the existence of strictly positive µ, R and π R such that r ( x ) ≥ R = ⇒ π ( x ) ≤ π R exp (cid:0) − µr ( x ) (cid:1) (4.58)Then, to obtain (4.54), it is enough to chose a = max (cid:8) , R , π / R , µ − (cid:9) . Proof of Lemma 4.5 : the proof is almost identical to the proofs for random-walk Metropolischains in Euclidean space [49][52]. The main point is that Assumption (a3) implies that everynon-empty bounded subset of M is a small set for the transition kernel P in (4.15). With thisin mind, the geometric drift condition (4.55) follows almost directly from the two conditions(4.56) and (4.57). Indeed, (4.56) implies that there exist λ ∈ (0 , 1) and R ∈ (0 , ∞ ) such that r ( x ) ≥ R = ⇒ P V ( x ) ≤ λV ( x )That is, (4.55) is verified on M − C , where C is the open ball B ( x ∗ , R ). In addition, by (4.57), b = " sup x ∈ B ( x ∗ ,R ) V ( x ) sup x ∈ M P V ( x ) V ( x ) (cid:21) < ∞ Therefore, (4.55) is also verified on C , since for x ∈ C , P V ( x ) ≤ b ≤ λV ( x ) + b Thus, (4.55) is verified throughout M . It remains to note that C is a small set, since it isbounded. 74ow, the aim is to establish the two conditions (4.56) and (4.57). These will follow fromPropositions 4.5 and 4.6, below. Consider the proposed transition kernel Qf ( x ) = Z M q ( x, y ) f ( y )vol( dy ) (4.59)for any bounded measurable function f : M → R . If f is the indicator function of a measurableset A , then it is usual to write Qf ( x ) = Q ( x, A ). For x ∈ M , consider its acceptance region A ( x ) = { y ∈ M : π ( y ) ≥ π ( x ) } Proposition 4.5. Under the assumptions of Proposition 4.2, the following limit holds lim inf r ( x ) →∞ Q ( x, A ( x )) > Proposition 4.6. Under the assumptions of Proposition 4.2, if (4.60) holds, then the twoconditions (4.56) and (4.57) are verified, where V ( x ) = aπ − ( x ) with a chosen so V ( x ) ≥ forall x ∈ M . The proof of these two propositions will use the following fact, concerning the contourmanifolds of the probability density function π ( x ). For x ∈ M , the contour manifold of x is theset C x of all y ∈ M such that π ( y ) = π ( x ). This is a hypersurface in M , whenever π ( x ) is aregular value of π (by the “regular level set theorem” [53]). fact : if r ( x ) is sufficiently large, then C x can be parameterised by the unit sphere in T x ∗ M .Precisely, it is possible to write C x = { Exp x ∗ ( c ( v ) v ) ; v ∈ S x ∗ M } (4.61)where c is a positive continuous function on S x ∗ M , the set of unit vectors v in T x ∗ M . Moreover, A ( x ) is exactly the region inside of C x . Precisely, y ∈ A ( x ) if and only if y = Exp x ∗ ( cv ) where v ∈ S x ∗ M and c ≤ c ( v ). Proof of Proposition 4.5 : by Assumption (a2), there exist δ > R > r ( y ) ≥ R = ⇒ h grad r, n i y < − δ (4.62)Let − c be a lower bound on the sectional curvatures of M , and Λ be a positive number with(dim M ) Λ ≤ δ c tanh( cR ) (4.63)Now, for any x ∈ M with r ( x ) ≥ R + Λ, consider the setΩ( x ) = (cid:26) Exp x ( − au ) ; a ∈ (0 , Λ) , u ∈ S x M , k grad r ( x ) − u k x ≤ δ (cid:27) Let y = Exp x ( − au ) be a point in Ω( x ), and γ ( t ) the unit-speed geodesic with γ (0) = x and γ ( a ) = y . It is first proved that h ˙ γ , n i γ ( t ) > t ∈ (0 , a ) (4.64)Indeed, the left-hand side of (4.64) may be written h ˙ γ , n i γ ( t ) = − h grad r, n i γ ( t ) + h ˙ γ + grad r, n i γ ( t ) t denotes the parallel transport along γ from γ (0) = x to γ ( t ), h ˙ γ , n i γ ( t ) = − h grad r, n i γ ( t ) + h Π t (grad r ( x ) − u ) , n i γ ( t ) + h grad r − Π t (grad r ( x )) , n i γ ( t ) (4.65)which may be checked by adding together the three terms, and noting that ˙ γ ( t ) = Π t ( − u ),since γ is a geodesic with ˙ γ (0) = − u . But, by the triangle inequality r ( γ ( t )) ≥ r ( x ) − d ( x, γ ( t )) > ( R + Λ) − Λ = R since d ( x ∗ , x ) = r ( x ) ≥ R + Λ and d ( x, γ ( t )) ≤ a ≤ Λ. Thus, it follows from (4.62) − h grad r, n i γ ( t ) > δ (4.66)Moreover, since the parallel transport Π t preserves norms, and since by definition of Ω( x ), k grad r ( x ) − u k x ≤ δ/ 2, it follows from the Cauchy-Schwarz inequality h Π t (grad r ( x ) − u ) , n i γ ( t ) ≥ −k Π t (grad r ( x ) − u ) k x = −k grad r ( x ) − u k x ≥ − δ/ e i ; 1 , . . . , n ) be a parallel orthonormal base, along the geodesic γ . Then, h grad r − Π t (grad r ( x )) , e i i γ ( t ) = Z t h Hess r · ˙ γ , e i i γ ( s ) ds But, according to (1.73) from Theorem 1.1, Z t h Hess r · ˙ γ , e i i γ ( s ) ds ≤ Z t c coth ( cr ( γ ( s ))) ds ≤ Λ c coth ( cR )Thus, using (4.63), it follows by the Cauchy-Schwarz inequality h grad r − Π t (grad r ( x )) , n i γ ( t ) ≥ − δ/ h ˙ γ , n i γ ( t ) > δ − δ/ − δ/ x ) ⊂ A ( x ) (4.69)for all x such that r ( x ) ≥ R + Λ, where A ( x ) is the acceptance region of x , defined after (4.59).To prove (4.69), consider y ∈ Ω( x ) and γ ( t ) as before, with γ (0) = x and γ ( a ) = y . Now,assume that y ∈ C x , the contour manifold of x , defined in (4.61). Then, π ( γ (0)) = π ( γ ( a )), sothat, by the mean-value theorem, there exists t ∈ (0 , a ) such that ddt π ( γ ( t )) = h ˙ γ ( t ) , grad π i γ ( t ) = 0But, from the definition of n ( x ), this implies h ˙ γ ( t ) , n i γ ( t ) = k grad π ( x ) k − h ˙ γ ( t ) , grad π i γ ( t ) = 0in contradiction with (4.64). Thus, the assumption that y ∈ C x cannot hold. Since y ∈ Ω( x ) isarbitrary, this means that Ω( x ) ∩ C x = ∅ (4.70)However, note that y ∗ = Exp x ( − a grad r ( x )) belongs to Ω( x ), as can be seen from the definitionof Ω( x ). Also, since r ( y ∗ ) = r ( x ) − a , it follows that y ∗ is inside of C x . Therefore, y ∗ ∈ A ( x ),and the intersection of Ω( x ) and A ( x ) is non-empty. Finally, it is enouh to note that the setΩ( x ) is connected, since it is the image under Exp x of a connected set. This implies that, if theintersection of Ω( x ) and R ( x ), the complement of A ( x ), were non-empty, then Ω( x ) would alsointersect C x . Clearly, this would be in contradiction with (4.70).76sing (4.69), it is now possible to prove (4.60). Indeed, for x such that r ( x ) ≥ R + Λ, itfollows from (4.69) that Q ( x, A ( x )) ≥ Q ( x, Ω( x )) = Z Ω( x ) q ( x, y )vol( dy ) (4.71)where the last equality follows from (4.59). However, by Assumption (a3), Z Ω( x ) q ( x, y )vol( dy ) ≥ ε q × vol (Ω( x ) ∩ B ( x, δ q )) (4.72)Now, to prove (4.60), it only remains to show thatvol (Ω( x ) ∩ B ( x, δ q )) ≥ c > x . Indeed, it is then clear from (4.71) and (4.72) thatlim inf r ( x ) →∞ Q ( x, A ( x )) > ε q × c > r, θ ) be geodesic spherical coordinates, with origin at x . Using the integralformula (1.95), after noting λ ( r, θ ) = det( A ( r, θ )), it followsvol (Ω( x ) ∩ B ( x, δ q )) = Z τ Z Sn − {k grad r ( x ) − u ( θ ) k x ≤ δ/ } λ ( r, θ ) drω n − ( dθ ) (4.74)where τ = min { Λ , δ q } , and the map θ u ( θ ) identifies S n − with S x M . Here, by (1.92) fromTheoreom 1.2, λ ( r, θ ) ≥ r n − . Therefore, (4.76) impliesvol (Ω( x ) ∩ B ( x, δ q )) ≥ ( τ n /n ) × ω n − ( {k grad r ( x ) − u ( θ ) k x ≤ δ/ } )However, since the area measure ω is invariant by rotation, the area ω n − ( {k grad r ( x ) − u ( θ ) k x ≤ δ/ } ) = ς does not depend on x . Precisely, ς is equal to the area of a spherical cap, with angle equal to2acos(1 − δ / τ d /d ) × ς . Proof of Proposition 4.6 : let V ( x ) = aπ − ( x ), as in the proposition. Recall the transitionkernel P is given by (4.15), which implies ρ ( x ) = Z M (1 − α ( x, y )) q ( x, y )vol( dy )since the right-hand side of (4.15) should integrate to 1 when f ( x ) is the constant function f ( x ) = 1. But, since α ( x, y ) = min { , π ( y ) /π ( x ) } , it follows that 1 − α ( x, y ) = 0 when y ∈ A ( x ), the acceptance region of x , defined after (4.59). Thus, ρ ( x ) = Z R ( x ) (cid:20) − π ( y ) π ( x ) (cid:21) q ( x, y )vol( dy )where R ( x ), the complement of A ( x ), is the rejection region of x . With this expression of ρ ( x ),putting f ( x ) = V ( x ) in (4.15), it follows by a direct calculation that P V ( x ) /V ( x ) is equal to Z A ( x ) q ( x, y ) (cid:20) π ( x ) π ( y ) (cid:21) vol( dy ) + Z R ( x ) q ( x, y ) − (cid:20) π ( y ) π ( x ) (cid:21) + (cid:20) π ( y ) π ( x ) (cid:21) ! vol( dy ) (4.75)77ere, all the ratios are less than or equal to 1, so that (4.16) immediately implies (4.57).In order to prove (4.56), it is enough to prove thatlim r ( x ) →∞ Z A ( x ) q ( x, y ) (cid:20) π ( x ) π ( y ) (cid:21) vol( dy ) = 0 (4.76)lim r ( x ) →∞ Z R ( x ) q ( x, y ) (cid:20) π ( y ) π ( x ) (cid:21) − (cid:20) π ( y ) π ( x ) (cid:21)! vol( dy ) = 0 (4.77)Indeed, if these two limits are replaced in (4.75), it will follow thatlim sup r ( x ) →∞ P V ( x ) V ( x ) = lim sup r ( x ) →∞ Q ( x, R ( x )) = lim sup r ( x ) →∞ − Q ( x, A ( x )) < Proof of (4.76) : this is divided into three steps. First, it is proved thatlim L →∞ Z A ( x ) − B ( x,L ) q ( x, y )( α ( y, x )) vol( dy ) = 0 uniformly in x (4.78)where α ( y, x ) = π ( x ) /π ( y ). To prove (4.78) note that α ( y, x ) ≤ y ∈ A ( x ), and that A ( x ) − B ( x, L ) ⊂ M − B ( x, L ). It follows that, for any x ∈ M , Z A ( x ) − B ( x,L ) q ( x, y )( α ( y, x )) vol( dy ) ≤ Z M − B ( x,L ) q ( x, y )vol( dy ) (4.79)Since M is a symmetric space, there exists an isometry g of M such that g · x ∗ = x . Since g preserves Riemannian volume, Z M − B ( x,L ) q ( x, y )vol( dy ) = Z M − B ( x ∗ ,L ) q ( x, g · y )vol( dy )But, q ( x, y ) = q ( d ( x, y )) depends only on the Riemannian distance d ( x, y ). This implies that q ( x, g · y ) = q ( x ∗ , y ), since g is an isometry. Thus, Z M − B ( x,L ) q ( x, y )vol( dy ) = Z M − B ( x ∗ ,L ) q ( x ∗ , y )vol( dy )Here, the right-hand side does not depend on x , and tends to zero as L → ∞ , as can be seenby putting x = x ∗ in (4.16). Now (4.78) follows directly from (4.79).Second, assume that r ( x ) is so large that the level set C x verifies (4.61) and A ( x ) is equalto the region inside C x . It is then proved that, for any L > r ( x ) →∞ Z A ( x ) ∩ B ( x,L ) − Cx ( ε ) q ( x, y )( α ( y, x )) vol( dy ) = 0 (4.80)where C x ( ε ) is the tubular neighborhood of C x given by C x ( ε ) = (cid:8) Exp y ( s grad r ( y )) ; y ∈ C x , | s | < ε (cid:9) Because of (4.16), to prove (4.80) it is enough to prove thatlim r ( x ) →∞ α ( y, x ) = 0 uniformly in y ∈ A ( x ) ∩ B ( x, L ) − C x ( ε ) (4.81)78owever, this follows by Assumption (a1). Indeed, this assumption guarantees the existenceof some strictly positive µ, R and π R , as in (4.58). Then, take r ( x ) ≥ R + ε and note that,by (4.58), for y as in (4.81), if r ( y ) ≤ R , α ( y, x ) ≤ π R exp (cid:0) − µr ( x ) (cid:1) π ( y ) ≤ π R exp (cid:0) − µr ( x ) (cid:1) min r ( y ) ≤ R π ( y ) (4.82)where the right-hand side converges to zero as r ( x ) → ∞ , uniformly in y . On the other hand,if r ( y ) > R , let c be the unit-speed geodesic connecting x ∗ to y . Since y ∈ A ( x ) (so y lies inside C x ) there exists some r ≥ r ( y ) such that c ( r ) ∈ C x . Moreover, since y / ∈ C x ( ε ), it follows that r > r ( y ) + ε . Then, it is possible to show, by Assumption (a1), α ( y, x ) = π ( c ( r )) π ( c ( r ( y ))) ≤ exp[ − µ ( r − r ( y ))]By a direct calculation, this implies α ( y, x ) ≤ exp[ − µ (2 εr − ε )] ≤ exp[ − µ (2 εr ( w ) − ε )] (4.83)where w ∈ C x is such that r ( w ) is the minimum of r ( w ′ ), taken over all w ′ ∈ C x . Note thatthe right-hand side of (4.83) does not depend on y . Moreover, π ( w ) tends to zero as r ( x ) → ∞ ,since π ( w ) = π ( x ), and π ( x ) tends to zero as r ( x ) → ∞ . Therefore, because π ( w ) is positive, itfollows that r ( w ) → ∞ as r ( x ) → ∞ . But, this implies the right-hand side of (4.83) convergesto zero as r ( x ) → ∞ , uniformly in y . Now, (4.81) follows from (4.83).The third, and final, step is to show that, for any L > ε → lim sup r ( x ) →∞ Z A ( x ) ∩ B ( x,L ) ∩ Cx ( ε ) q ( x, y )( α ( y, x )) vol( dy ) = 0 (4.84)For brevity, the proof is carried out under the assumption that q ( x, y ) is bounded, uniformly in x and y . If this assumption holds, then (4.84) follows immediately by showinglim ε → lim sup r ( x ) →∞ vol ( B ( x, L ) ∩ C x ( ε )) = 0 (4.85)To show (4.85), let θ v ( θ ) identify the Euclidean unit sphere S n − with S x ∗ M , and considerthe following sets T ( x ) = { θ ∈ S n − : Exp x ∗ ( rv ( θ )) ∈ B ( x, L ) for some r ≥ } S ( x ) = { Exp x ∗ ( rv ( θ )) ; θ ∈ T ( x ) and | r − r ( x ) | ≤ L } Using the triangle inequality, it is possible to show that B ( x, L ) ⊂ S ( x ) ⊂ B ( x, L ) (4.86)To estimate the volume in (4.85), let ( r, θ ) be geodesic spherical coordinates, with origin at x ∗ .The first inclusion in (4.86) implies vol ( B ( x, L ) ∩ C x ( ε )) ≤ vol ( S ( x ) ∩ C x ( ε )), and this yieldsvol ( B ( x, L ) ∩ C x ( ε )) ≤ Z r ( x )+ Lr ( x ) − L Z T ( x ) C x ( ε ) (Exp x ∗ ( rv ( θ ))) λ ( r, θ ) drω n − ( dθ )in the notation of (1.95), where λ ( r, θ ) = det( A ( r, θ )). Bounding the last integral from above,vol ( B ( x, L ) ∩ C x ( ε )) ≤ εω n − ( T ( x )) sup z ( r,θ ) ∈ B ( x, L ) λ ( r, θ ) (4.87)79here z ( r, θ ) = Exp x ∗ ( rv ( θ )). Similarly, the second inclusion in (4.86) impliesvol ( B ( x, L )) ≥ vol( S ( x )) = Z r ( x )+ Lr ( x ) − L Z T ( x ) λ ( r, θ ) drω n − ( dθ )and bounding the last integral from below givesvol ( B ( x, L )) ≥ Lω n − ( T ( x )) inf z ( r,θ ) ∈ B ( x, L ) λ ( r, θ ) (4.88)From (4.87) and (4.88), it follows thatvol ( B ( x, L ) ∩ C x ( ε )) ≤ ( ε / L ) vol ( B ( x, L )) sup z ( r,θ ) ∈ B ( x, L ) λ ( r, θ )inf z ( r,θ ) ∈ B ( x, L ) λ ( r, θ ) (4.89)However, by the volume growth lemma 4.4, from 4.5,lim sup r ( x ) →∞ sup z ( r,θ ) ∈ B ( x, L ) λ ( r, θ )inf z ( r,θ ) ∈ B ( x, L ) λ ( r, θ ) = R < ∞ Replacing into (4.89), and noting that, since M is a symmetric space,vol ( B ( x, L )) = vol ( B ( x ∗ , L )), it follows that lim sup r ( x ) →∞ vol ( B ( x, L ) ∩ C x ( ε )) ≤ ( ε / L ) vol ( B ( x ∗ , L )) RThis immediately implies (4.85), and therefore (4.84). Conclusion : finally, (4.76) can be obtained by combining (4.78), (4.80) and (4.84). Precisely,the integral under the limit in (4.76) can be decomposed into the sum of three integrals (cid:18)Z A ( x ) − B ( x,L ) + Z A ( x ) ∩ B ( x,L ) − Cx ( ε ) + Z A ( x ) ∩ B ( x,L ) ∩ Cx ( ε ) (cid:19) q ( x, y )( α ( y, x )) vol( dy )By (4.78), for any ∆ > 0, it is possible to choose L to make the first integral less than ∆ / x and ε . By (4.84), it is possible to choose ε to make the third integral less than∆ / 3, for all x with sufficiently large r ( x ). With L and ε chosen in this way, (4.80) implies thesecond integral is less than ∆ / 3, if r ( x ) is sufficiently large. Then, the sum of the three integralsis less than ∆, and (4.76) follows, because ∆ is arbitrary.80 hapter 5 Stochastic approximation Contents The present chapter is based on [2][3]. It aims to give a general treatment, under realistic assumptions,of two problems related to stochastic approximation on Riemannian manifolds.The first problem is to estimate the rate of convergence of a stochastic approximation scheme, to the setof critical points ( i.e. zeros) of its mean field. • • • • • • x t ; t = 0 , , . . . ) with values in aHadamard manifold M , where each x t +1 is a geodesic convex combination of the old x t and ofa new input y t +1 , with respective weights 1 − µ (for x t ) and µ (for y t +1 ), for some µ ∈ (0 , y t ; t = 1 , , . . . ) are independent samples from a probability distribution P on M , then theMarkov chain ( x t ) is geometrically ergodic, and its invariant distribution concentrates at theRiemannian barycentre of P , as µ goes to zero. .1 Approximate critical points Here, the main object of study will be a stochastic approximation scheme, on a Riemannianmanifold M . Given some initial value x ∈ M , and independent observations ( y t ; t = 1 , , . . . ),drawn from a probability distribution P on a measurable space Y , this computes a sequence ofiterates ( x t ; t = 1 , , . . . ), according to the update rule x t +1 = Ret x t (cid:0) µ t +1 X y t +1 ( x t ) (cid:1) (5.1)where Ret : T M → M is a retraction, ( µ t ; t = 1 , , . . . ) is a sequence of (positive) step-sizes,and the map X : Y × M → T M is such that X ( y, x ) = X y ( x ) always belongs to T x M .One says that X : Y × M → T M is a random vector field. The corresponding mean vectorfield X : M → T M is given by X ( x ) = Z Y X y ( x ) P ( dy ) (5.2)which means that the noise vector field, given by e y ( x ) = X y ( x ) − X ( x ), has zero expectation.In the following, it will be assumed the variance of this noise vector field is not too large, Z Y k e y ( x ) k x P ( dy ) ≤ σ + σ k X k x (5.3)for some constants σ , σ .The scheme (5.1) is often used to search for zeros (critical points) of the mean vector field X .After t iterations, this scheme will have generated the iterates ( x s ; s = 1 , . . . , t ). One mayrandomly sample these, by looking at x τ t where τ t follows a discrete probability distribution P ( τ t = s ) = µ s +1 P ts =1 µ s +1 s = 1 , . . . , t (5.4)Then, the scheme is said to have found an approximate critical point (precisely, an ǫ -criticalpoint, for some suitable accuracy ǫ > 0) in expectation, if E k X ( x τ t ) k ≤ ǫ .For example, note that if µ t = µ is a constant, so (5.1) is a constant-step-size scheme, E h k X ( x τ t ) k x τt i = 1 t t X s =1 E (cid:2) k X ( x s ) k x s (cid:3) is just the average, over the first t iterates, of the expected norm of the mean field.In order to study the stochastic approximation scheme (5.1), it is helpful to introduce aLyapunov function V : M → R . This is a positive function, which is continuously differentiable,and has ℓ -Lipschitz gradient, in the sense of (1.86). It is moreover assumed to satisfy c k X k x ≤ −h grad V, X i x (5.5)for some constant c > Example 1 : let M = S n ⊂ R n +1 , the unit sphere of dimension n . If x ∗ is some critical pointof the mean field X , then one may choose V ( x ) = 1 − cos d ( x, x ∗ ), where d ( x, x ∗ ) denotes theRiemannian distance between x and x ∗ . In this case, V is positive and has 1-Lipschitz gradient. Example 2 : let M be a Hadamard manifold, with sectional curvatures bounded below by κ min = − c . If x ∗ is some critical point of the mean field X , then one may choose V ( x ) = V x ∗ ( x ),for some δ > 0, as in (1.80). From Proposition 1.5 and Lemma 1.1, V is positive and has (1+ δc )-Lipschitz gradient. Zeros of vector fields are also called “singular points”, and “stationary points”. The term “critical points”seems more in line with the context of stochastic approximation, where the mean vector field is often a gradientvector field, so the scheme (5.1) is a stochastic gradient scheme, used to solve some optimisation problem. Lemma 5.1. If V : M → R has ℓ -Lipschitz gradient, then | V (Exp x ( v )) − V ( x ) − h grad V, v i x | ≤ ( ℓ/ k v k x (5.6) for any x ∈ M and v ∈ T x M . Sketch of proof : consider the geodesic c : [0 , → M , given by c ( t ) = Exp x ( tv ). Then, let V ( t ) = V ( c ( t )) and note that V ′ ( t ) = h grad V, ˙ c i c ( t ) . Let Π t denote parallel transport along c ,from c ( t ) to c (0). Since this preserves scalar products, and ˙ c is parallel, V ′ ( t ) = h grad V, ˙ c i c (0) + h Π t (cid:0) grad V c ( t ) (cid:1) − grad V c (0) , ˙ c i c (0) (5.7)Then, using (1.86), it may be shown that (cid:12)(cid:12) h Π t (cid:0) grad V c ( t ) (cid:1) − grad V c (0) , ˙ c i c (0) (cid:12)(cid:12) ≤ ℓt ( L ( c )) (5.8)Since c (0) = x and ˙ c (0) = v , (5.6) follows by replacing (5.8) into (5.7), and integrating over t . Consider now the case where Ret = Exp, in (5.1). That is, x t +1 = Exp x t (cid:0) µ t +1 X y t +1 ( x t ) (cid:1) (5.9)For this exponential scheme, Proposition 5.1 provides a non-asymptotic bound on E k X ( x τ t ) k ,where τ t was defined in (5.4). This proposition uses the notation { µ p } t = P ts =1 µ p +1 s +1 P ts =1 µ s +1 (5.10)which is motivated by the fact that if µ t = µ is a constant, so (5.9) is a constant-step-size scheme,then { µ p } t = µ p . In this spirit, { µ } t will be written { µ } t = { µ } t , throughout the following. Proposition 5.1. Consider the exponential scheme (5.9), with mean vector field (5.2), andwhere the noise variance satisfies (5.3). Assume that there exists a positive function V : M → R ,with ℓ -Lipschitz gradient, which verifies (5.5). If µ t ≤ c (2 ℓ (1 + σ )) − for all t , then E h k X ( x τ t ) k x τt i ≤ (2 /c ) (cid:2) ( V ( x )/ t ) { µ − } t + ( ℓσ ) { µ } t (cid:3) (5.11) Remark : the simplest application of this proposition is to a stochastic gradient scheme, whithmean field X ( x ) = − grad f ( x ) for a cost function f : M → R . If f is positive (or justbounded below), and has ℓ f -Lipschitz gradient, then V = f can be introduced, as a Lyapunovfunction, since (5.5) then holds with c = 1. In the case of a constant-step-size scheme, with µ ≤ (2 ℓ f (1 + σ )) − , it follows from (5.11) that12 t t X s =1 E (cid:2) k grad f ( x s ) k x s (cid:3) ≤ ( f ( x )/ tµ ) + ( ℓ f σ ) µ (5.12)In particular, if t is sufficiently large, then one must have E k grad f ( x s ) k ≤ ℓ f σ ) µ , for atleast one s in the range s = 1 , . . . , t . 83 emark : Proposition 5.1 provides an estimate of the rate of convergence of a stochasticapproximation scheme, to the set of critical points of its mean field, which is applicable evenwhen this set of critical points is complicated. This is especially helpful for stochastic gradientschemes, with a cost function that has many global minima (see the the example in 5.4).Proposition 5.2 will extend Proposition 5.1, from exponential schemes, to retraction schemes. Proof of Proposition 5.1 : for each s = 0 , , . . . , it follows from Lemma 5.1 that V ( x s +1 ) − V ( x s ) ≤ µ s +1 h grad V, X y s +1 i x s + µ s +1 ( ℓ/ k X y s +1 k x s Then, since X y s +1 ( x s ) = X ( x s ) + e y s +1 ( x s ), V ( x s +1 ) − V ( x s ) ≤ µ s +1 h grad V, X y s +1 i x s + µ s +1 ℓ (cid:0) k X k x s + k e y s +1 k x s (cid:1) (5.13)Let Y s be the σ -algebra generated by y , . . . , y s . Taking conditional expectations in (5.13), itfollows from (5.2) that − µ s +1 h grad V, X i x s ≤ E [ V ( x s ) − V ( x s +1 ) |Y s ] + µ s +1 ℓ (cid:0) k X k x s + E (cid:2) k e y s +1 k x s (cid:12)(cid:12) Y s (cid:3)(cid:1) Then, from (5.3), − µ s +1 h grad V, X i x s ≤ E [ V ( x s ) − V ( x s +1 ) |Y s ] + µ s +1 ℓ (cid:0) σ + (1 + σ ) k X k x s (cid:1) Therefore, using (5.5), and rearranging terms,( c − ℓ (1 + σ ) µ s +1 ) µ s +1 k X k x s ≤ E [ V ( x s ) − V ( x s +1 ) |Y s ] + ( ℓσ ) µ s +1 (5.14)If µ s +1 ≤ c (2 ℓ (1 + σ )) − , this becomes( c/ µ s +1 k X k x s ≤ E [ V ( x s ) − V ( x s +1 ) |Y s ] + ( ℓσ ) µ s +1 Finally, (5.11) follows by summing over s = 1 , . . . , t and dividing by P ts =1 µ s +1 = t/ { µ − } t . Consider now the case where Ret in (5.1) is a regular retraction, in the sense of 1.5. Then (5.1)can be written under an exponential form, x t +1 = Exp x t (cid:0) Φ x t (cid:0) µ t +1 X y t +1 ( x t ) (cid:1)(cid:1) (5.15)where the maps Φ x : T x M → T x M were defined in (1.35). This new exponential form is useful,since it renders possible the application of Lemma 5.1, as in the proof of Proposition 5.1.In addition to being regular, the retraction Ret is assumed to verify k Φ x ( v ) k x ≤ k v k x and k Φ x ( v ) − v k x ≤ δ k v k x (5.16)for all x ∈ M and v ∈ T x M , where δ > X y ( x ) has bounded third-order moments, Z Y k X y ( x ) k ax P ( dy ) ≤ τ a ; a = 2 , τ , τ > 0. This implies that it is possible to replace σ = 0 in (5.3).The following Proposition 5.2 is obtained by applying Lemma 5.1 to the exponential form(5.15) of the retraction scheme (5.1), and taking advantage of the assumptions (5.16) and (5.17).84 roposition 5.2. Consider the retraction scheme (5.1), where Ret is a regular retraction, whichsatisfies (5.16). Assume that (5.17) holds, so it is possible replace σ = 0 in (5.3). Assume alsothat there exists a positive function, with bounded and ℓ -Lipschitz gradient, which verifies (5.5).If µ t ≤ ( c/ ℓ ) for all t , then E h k X ( x τ t ) k x τt i ≤ (2 /c ) (cid:2) ( V ( x )/ t ) { µ − } t + ( ℓσ ) { µ } t + ( δτ k V k , ∞ ) { µ } t (cid:3) (5.18) where k V k , ∞ = sup x ∈ M k grad V k x . The assumptions of Proposition 5.2 (namely, that X y ( x ) has bounded third order moments,and that grad V ( x ) is uniformly bounded), can seem a bit too strong. In fact, these assumptionsare quite natural, in several applications, where the underlying Riemannian manifold M iscompact. One such application, to the PCA problem, is presented in 5.5. Remark : the first two terms on the right-hand side of (5.18) are the same as on the right-handside of (5.11). Thus, replacing the Riemannian exponential Exp by a regular retraction Ret hasthe effect of introducing a second-order term ( i.e. a constant multiple of { µ } t ) into (5.18).This additional term vanishes, in the limit where δ goes to zero. Proof of Proposition 5.2 : for s = 0 , , . . . , it follows by applying Lemma 5.1 to (5.15) that V ( x s +1 ) − V ( x s ) ≤ (cid:10) grad V, Φ x s (cid:0) µ s +1 X y s +1 (cid:1)(cid:11) x s + ( ℓ/ (cid:13)(cid:13) Φ x s (cid:0) µ s +1 X y s +1 (cid:1)(cid:13)(cid:13) x s (5.19)Here, the right-hand side may also be written µ s +1 (cid:10) grad V, X y s +1 (cid:11) x s + ( ℓ/ (cid:13)(cid:13) Φ x s (cid:0) µ s +1 X y s +1 (cid:1)(cid:13)(cid:13) x s + (cid:10) grad V, Φ x s (cid:0) µ s +1 X y s +1 (cid:1) − µ s +1 X y s +1 (cid:11) x s However, by (5.16), (cid:13)(cid:13) Φ x s (cid:0) µ s +1 X y s +1 (cid:1)(cid:13)(cid:13) x s ≤ µ s +1 k X y s +1 k x s (5.20)and, in addition, (cid:13)(cid:13) Φ x s (cid:0) µ s +1 X y s +1 (cid:1) − µ s +1 X y s +1 (cid:13)(cid:13) x s ≤ δµ s +1 k X y s +1 k x s (5.21)Replacing (5.20) and (5.21) into (5.19), and using the Cauchy-Schwarz inequality, V ( x s +1 ) − V ( x s ) ≤ µ s +1 h grad V, X y s +1 i x s + µ s +1 ( ℓ/ k X y s +1 k x s + ( δ k V k , ∞ ) µ s +1 k X y s +1 k x s Now, it is possible to proceed as in the proof of Proposition 5.1. Taking conditional expectationswith respect to Y s , and using (5.2) and (5.3), − µ s +1 h grad V, X i x s − µ s +1 ℓ k X y s +1 k x s ≤ − ∆ V s + ( ℓσ ) µ s +1 + ( δ k V k , ∞ ) µ s +1 E (cid:2) k X y s +1 k x s (cid:12)(cid:12) Y s (cid:3) where ∆ V s = E [ V ( x s +1 ) − V ( x s ) |Y s ]. Then, using (5.5), it follows that( c − ℓµ s +1 ) µ s +1 k X k x s ≤ − ∆ V s + ( ℓσ ) µ s +1 + ( δ k V k , ∞ ) µ s +1 E (cid:2) k X y s +1 k x s (cid:12)(cid:12) Y s (cid:3) so, inserting (5.17), one obtains the inequality( c − ℓµ s +1 ) µ s +1 k X k x s ≤ − ∆ V s + ( ℓσ ) µ s +1 + ( δτ k V k , ∞ ) µ s +1 Here, if µ t ≤ ( c/ ℓ ) for all t , then( c/ µ s +1 k X k x s ≤ − ∆ V s + ( ℓσ ) µ s +1 + ( δτ k V k , ∞ ) µ s +1 Finally, (5.18) follows by summing over s = 1 , . . . , t and dividing by P ts =1 µ s +1 = t/ { µ − } t .85 .4 Example : mixture estimation Let M be a Riemannian symmetric space, which belongs to the non-compact case, (see 1.9.2).Consider a probability density m on M , which is a mixture of Gaussian densities (of the kinddefined in 3.2), m ( y | x ) = 1 K K X κ =1 p ( y | x κ ) where p ( y | x κ ) = ( Z (1)) − exp (cid:20) − d ( y, x κ )2 (cid:21) (5.22)where K is the number of mixture components, and the normalising factor Z (1) is given by (3.7).The parameters x = ( x κ ; κ = 1 , . . . , K ) are to be estimated, by fitting the mixture density (5.22)to data y , . . . , y N . Then, maximum-likelihood estimation amounts to minimising the negativelog-likelihood function f ( x ) = − log Z (1) − N N X n =1 log m ( y n | x ) (5.23)where the first term, − log Z (1), has been added to ensure that f ( x ) is positive. The function f is defined on the product Riemannian manifold, M K = M × . . . × M . Its gradient is thengrad f = (grad κ f ; κ = 1 , . . . , K ), where grad κ f denotes the gradient with respect to x κ . Lemma 5.2. For the negative log-likelihood function (5.23), grad κ f ( x ) = − N N X n =1 ω κ ( y n )Exp − x κ ( y n ) (5.24) where ω κ ( y ) ∝ p ( y | x κ ) are positive weights, which add up to . Let ( y t ; t = 1 , , . . . ) be chosen at random among the data y , . . . , y N . By Lemma 5.2, x t +1 κ = Exp x tκ (cid:0) µX κ ( y t +1 , x tκ ) (cid:1) where X κ ( y t +1 , x tκ ) = ω κ ( y t +1 )Exp − x tκ ( y t +1 ) (5.25)is a constant-step-size stochastic gradient scheme, for the cost function f . Here, the step-size µ is assumed to be less than 1 (in comparison to (5.9), t and t +1 have been written as superscripts,rather than subscripts, in order to accommodate the appearance of κ ).Now, let C be a compact and convex subset of M , which contains all of the data points y n ,as well as all of the initial values x κ (since M is a Hadamard manifold, one may take C to beany sufficiently large closed geodesic ball). The diameter of C will be denoted D C . From (5.25), x t +1 κ = x tκ ρt +1 y t +1 where ρ t +1 = µω κ ( y t +1 )in the notation of (4.4), from 4.1. Accordingly, the iterates x tκ remain within C , for all t and κ .Because C is compact and convex, it then becomes possible to derive the following result, byrepeating, with very minor changes, the arguments leading to (5.12). Proposition 5.3. For the stochastic gradient scheme (5.25), let C be a compact and convexsubset of M , which contains all of the data points y n , as well as all of the initial values x κ .If µ ≤ (1 / ℓ C ) , where ℓ C denotes the supremum of the operator norm of Hess f ( x ) , taken over x = ( x κ ; κ = 1 , . . . , K ) ∈ C K , then for all t = 1 , , . . . , t t X s =1 E (cid:2) k grad f ( x s ) k x s (cid:3) ≤ ( f C / tµ ) + ( ℓ C σ ) µ (5.26) Here, f C = sup x ∈ C K f ( x ) and σ = sup x ∈ C K k grad f k x (explicit bounds on f C , σ and ℓ C aregiven in the remark below). emark : tedious, but straightforward, calculations provide the upper bounds f C ≤ D C σ ≤ K D C (5.27) ℓ C ≤ (1 + c D C ) + (1 + Z (1) exp(D C / C (5.28)where c is such that the sectional curvatures of M lie within [ − c , Proof of Lemma 5.2 : taking the gradient of (5.23), it is clear thatgrad κ f ( x ) = − N N X n =1 grad κ log m ( y n | x ) (5.29)Now, grad κ log m ( y | x ) can be computed as follows. If λ is a random variable, independent from y , with P ( λ = κ ) = K − , for κ = 1 , . . . , K , then P ( λ = κ | y ) = ω κ ( y ), with ω κ ( y ) as in (5.24).Therefore, using Bayes rule, p ( λ, y ) m ( y | x ) = K X ν =1 { λ = ν } ω ν ( y )where p ( λ, y ) is the joint distribution of the couple ( λ, y ). Taking logarithms,log p ( λ, y ) − log m ( y | x ) = K X ν =1 { λ = ν } log ω ν ( y ) (5.30)If E y denotes conditional expectation with respect to y , E y " grad κ K X ν =1 { λ = ν } log ω ν ( y ) = K X ν =1 ω ν ( y )grad κ log ω ν ( y ) = 0where the second equality follows since the conditional probabilities ω ν ( y ) always add up to 1.By replacing this into (5.30),grad κ log m ( y | x ) = E y [grad κ log p ( λ, y )] (5.31)But, since λ and y are independent, the joint distribution p ( λ, y ) reads p ( λ, y ) = 1 K K X ν =1 { λ = ν } p ( y | x ν )Therefore, taking logarithms,log p ( λ, y ) = − log( K ) + K X ν =1 { λ = ν } log p ( y | x ν )This immediately yields, E y [grad κ log p ( λ, y )] = ω κ ( y )grad κ log p ( y | x κ ) = ω κ ( y )Exp − x κ ( y ) (5.32)where the second equality follows from (1.75), and from the definition of p ( y | x κ ) in (5.22).Finally, replacing (5.32) into (5.31),grad κ log m ( y | x ) = ω κ ( y )Exp − x κ ( y ) (5.33)so that (5.24) follows by plugging (5.33) into (5.29).87 .5 Example : the PCA problem Here, the notation will be the same as in 1.4 and 1.6. The aim is to apply Proposition 5.2, toa constant-step-size stochastic gradient scheme, for the objective function (1.22), f ( x ) = tr ( x ∆) x ∈ Gr R ( p , q ) (5.34)where ∆ is the covariance matrix of a zero-mean random vector y , with values in R d ( d = p + q ).It is assumed that y has finite moments of order 6.The gradient of the objective function f was given by (1.26) and (1.29). These can bewritten grad f ( x ) = g · ˜ ω ( x ) where ˜ ω ( x ) = P o ( g † · ∆) (5.35)Now, let b ∈ St R ( p , q ) be such that x = [ b ]. By the discussion before (1.40), choosing g = ( b, b ⊥ ),it follows that grad f ( x ) = [ X ( b )], where X ( b ) = b ⊥ ω ( b ). From (1.27) and (1.29), it is clear that ω ( b ) = ( b ⊥ ) † ∆ b . Therefore, using the fact that x = bb † (this is the definition of [ b ]), X ( b ) = (I d − x )∆ b (5.36)In terms of the random vector y , X ( b ) is the expectation of X y ( b ), where X y ( b ) = (I d − x )( yy † ) b (5.37)Let X y ( x ) = [ X y ( b )], and note that the expectation of X y ( x ) is equal to grad f ( x ) (by linearity).Then, consider the constant-step-size stochastic gradient scheme x t +1 = Ret x t (cid:0) µX y t +1 ( x t ) (cid:1) (5.38)where ( y t ; t = 1 , , . . . ) are independent copies of y . If the retraction Ret is given as in (1.41),this becomes x t +1 = Span( b t +1 ) ; b t +1 = b t + µ (I d − b t b † t )( y t +1 y † t +1 ) b t (5.39)Proposition 5.2, applied to this scheme, yields the following bound. Proposition 5.4. Consider the constant-step-size scheme (5.38)-(5.39). For all t = 1 , , . . . , t t X s =1 E (cid:2) k grad f ( x s ) k x s (cid:3) ≤ ( p k ∆ k op )/ tµ ) + (4 k ∆ k op m y ) µ + ( √ k ∆ k F m y ) µ (5.40) where k ∆ k op and k ∆ k F denote the operator norm and Frobenius norm of the matrix ∆ , while m y and m y denote the fourth-order and sixth-order moments of the random vector y . Proposition 5.4 follows directly from Proposition 5.2, by introducing V ( x ) = p k ∆ k op − f ( x ),which satisfies 0 ≤ V ( x ) ≤ p k ∆ k op . Since − grad V ( x ) = grad f ( x ), (5.5) now holds with c = 1.The function V has 2 k ∆ k op -Lipschitz gradient, as will be shown in the remark below, andthe norm of its gradient can be computed from (5.35), k grad V k x = k grad f k x = k ˜ ω x k o ≤ k g † · ∆ k F (5.41)where the inequality follows from (1.25), since P o is an orthogonal projection. But, since g isorthogonal, (5.41) implies that k grad V k x is bounded by k ∆ k F , uniformly in x .Thus, to obtain (5.40), it is possible to replace into (5.18), V ( x ) ≤ p k ∆ k op , ℓ = 2 k ∆ k op ,and k V k , ∞ = k ∆ k F . 88or σ and τ , recall that X y ( x ) = [ X y ( b )], where X y ( b ) = b ⊥ ω y ( b ), with ω y ( b ) = ( b ⊥ ) † ( yy † ) b .However, this implies X y ( x ) = g · ˜ ω y ( b ), (˜ ω y ( b ) is obtained from ω y ( b ), according to (1.27)).Thus, k X y k x = k ˜ ω y ( b ) k o = √ k ω y ( b ) k F . By evaluating the Frobenius norm, k ω y ( b ) k F = tr (cid:16) (I d − x )( yy † ) x ( yy † ) (cid:17) ≤ k (I d − x )( yy † ) k F k ( yy † ) x k F ≤ k yy † k F where the first inequality follows from the Cauchy-Schwarz inequality, and the second inequalityfollows because x and I d − x are orthogonal projectors. Since k yy † k F = k y k (the squaredEuclidean norm of y ), this implies k X y k x ≤ √ k y k . Therefore, it is possible to set σ = 2 m y and τ = √ m y .Finally, the constant δ in (5.16) can be taken equal to 1. Indeed, if Ret x is given by (1.41)and Φ x is given by (1.42), then for v ∈ T x Gr R ( p , q ), where v = g · ˜ ω and ω has s.v.d. ω = ras † ,Φ x ( v ) = g · ˜ ϕ where ϕ has s.v.d. ϕ = r arctan( a ) s † . Therefore, k Φ x ( v ) − v k x = k g · ˜ ϕ − g · ˜ ω k x = k ˜ ϕ − ˜ ω k o (5.42)If k is given by (1.47), then k ˜ ϕ − ˜ ω k o = k k · arctan(˜ a ) − k · ˜ a k o = k arctan(˜ a ) − ˜ a k o By an elementary property of the arctan function, k arctan(˜ a ) − ˜ a k o ≤ k ˜ a k o . Therefore, k ˜ ϕ − ˜ ω k o ≤ k ˜ a k o (5.43)Replacing (5.43) into (5.42), and noting that k ˜ a k o = k v k x , it follows that k Φ x ( v ) − v k x ≤ k v k x .This is the second inequality in (5.16), with δ = 1. The first inequality in (5.16) is obtained byan analogous reasoning, using once more the properties of the arctan function. Remark : it was claimed that the function V has 2 k ∆ k op -Lipschitz gradient (this means grad V satisfies (1.86), with ℓ = 2 k ∆ k op ). To prove this claim, let c ( t ) be a geodesic, with c (0) = x and˙ c (0) = v . In the notation of (1.32), c ( t ) = exp( t ˆ ω v ) · x . From [10] (Theorem 3.3, Chapter IV),Π (grad V ( x )) = exp(ˆ ω v ) · grad V ( x ) (5.44)But, grad V ( x ) = − grad f ( x ), which is given by (5.35). Therefore,Π (grad V ( x )) = (exp(ˆ ω v ) g ) · P o ( g † · ∆) (5.45)On the other hand, letting y = c (1), one has y = (exp(ˆ ω v ) g ) · o . Thus, from (5.35),grad V ( y ) = (exp(ˆ ω v ) g ) · P o ((exp(ˆ ω v ) g ) † · ∆) (5.46)From (5.45) and (5.46), k grad V ( y ) − Π (grad V ( x )) k y = k P o ((exp(ˆ ω v ) g ) † · ∆) − P o ( g † · ∆) k o ≤ k (exp(ˆ ω v ) g ) † · ∆ − g † · ∆ k F where the inequality holds since P o is an orthogonal projection. Using the fact that(exp(ˆ ω v ) g ) † · ∆ − g † · ∆ = Z (∆( t ) ˆ ω v − ˆ ω v ∆( t )) dt where ∆( t ) = (exp( t ˆ ω v ) g ) † · ∆, so k ∆( t ) k op = k ∆ k op , it follows that k grad V ( y ) − Π (grad V ( x )) k y ≤ Z k ∆( t ) ˆ ω v − ˆ ω v ∆( t ) k F dt ≤ k ∆ k op k ˆ ω v k F By the remark at the end of 1.4, the right-hand side is 2 k ∆ k op k v k x . In other words, k grad V ( y ) − Π (grad V ( x )) k y ≤ k ∆ k op L ( c )This is equivalent to the required form of (1.86), as can be seen by applying Π under the norm.89 .6 A central limit theorem (CLT) Here, the aim will be to derive a central limit theorem, describing the asymptotic behaviorof certain constant-step-size exponential schemes, defined on Hadamard manifolds. This is ageneralisation of the central limit theorem, which holds in Euclidean space, found in [54]. Consider the constant-step-size exponential scheme, defined on a Hadamard manifold M , x t +1 = Exp x t (cid:0) µX y t +1 ( x t ) (cid:1) (5.47)Since the observations ( y t ; t = 1 , , . . . ) are independent and identically distributed, it followsthat ( x t ; t = 0 , , . . . ) is a time-homogeneous Markov chain with values in M . The followingassumptions ensure that ( x t ) is geometrically ergodic, and therefore has a unique invariantdistribution π µ . e1. the noise vector e y ( x ) satisfies (5.3). e2. e y ( x ) is P -almost-surely a continuous function of x , and the distribution of e y ( x ) has strictlypositive density, with respect to the Lebesgue measure on T x M . v1. there exists a positive function V : M → R , with compact sublevel sets, and ℓ -Lipschitzgradient, which satisfies (5.5). v2. there exist x ∗ ∈ M and λ > 0, such that h grad V, X i x ≤ − λV ( x ) for x = x ∗ (5.48) v3. V ( x ) = 0 if and only if x = x ∗ . Proposition 5.5. Consider the constant-step-size scheme (5.47), on a Hadamard manifold M .Assume that e1 , e2 , v1 , v2 hold. If µ ≤ c (2 ℓ (1 + σ )) − , then the Markov chain ( x t ) isgeometrically ergodic, with a unique invariant distribution π µ . As the step-size µ goes to zero, the invariant distribution π µ concentrates on the point x ∗ . Proposition 5.6. Under the same conditions as in Proposition 5.5, if v3 holds, then π µ ⇒ δ x ∗ as µ → (here, ⇒ denotes weak convergence of probability measures). Consider now the re-scaled sequence ( u t ; t = 0 , , . . . ), with values in T x ∗ M , u t = ψ µ ( x t ) where ψ µ ( x ) = µ − Exp − x ∗ ( x ) (5.49)This is the image of ( x t ; t = 0 , , . . . ), under the diffeomorphism ψ µ : M → T x ∗ M . It is thereforea time-homogeneous Markov chain with values in T x ∗ M . The trasition kernels of ( x t ) and ( u t )will be denoted Q µ and ˜ Q µ , respectively. Note that˜ Q µ φ ( u ) = Q µ ( φ ◦ ψ µ )( ψ − ( u )) (5.50)for any measurable function φ : T x M → R .The following assumptions ensure that, as µ goes to zero, ( u t ; t = 0 , , . . . ) behave likesamples, taken at evenly spaced times τ t = tµ , from a linear diffusion process ( U τ ; τ ≥ d1. the (2,0)-tensor field Σ, defined byΣ( x ) = Z Y e y ( x ) ⊗ e y ( x ) P ( dy ) for x ∈ M (5.51)is continuous on M . 90 there exists a linear map A : T x ∗ M → T x ∗ M , such that for x ∈ M , X ( x ) = Π (cid:0) A (cid:0) Exp − x ∗ ( x ) (cid:1) + R ( x ) (cid:1) (5.52)where Π denotes parallel transport along the unique geodesic c : [0 , → M , connecting x ∗ to x ,and k R ( x ) k x ∗ = o ( d ( x, x ∗ )).Now, let ( U τ : τ ≥ L φ ( u ) = A ij u j ∂φ∂u i ( u ) + 12 Σ ij ∗ ∂ φ∂u i ∂u j ( u ) (5.53)where ( A ij ) and (Σ ij ∗ ) are the matrices which represent the linear map A and the tensor Σ( x ∗ ),in a basis of normal coordinates centred at x ∗ . Proposition 5.7. Consider the constant-step-size scheme (5.47), on a Hadamard manifold M .Let ( u t : t = 0 , , . . . ) be given by (5.49), and assume that e1 , d1 , d2 hold. For any compactly-supported, smooth function φ : T x M → R , µ − h ˜ Q µ φ ( u ) − φ ( u ) i = L φ ( u ) + ε µ ( u ) (5.54) where ε µ ( u ) → as µ → , uniformly on compact subsets of T x ∗ M . Remark : this proposition implies a functional central limit theorem , by application of [55](Theorem 19.28). This says that the process ( U µτ ; τ ≥ u t for tµ ≤ τ < ( t + 1) µ ,converges in distribution to the linear diffusion U , with generator (5.53), in Skorokhod space.This functional central limit theorem can be used to study the asymptotic behavior of ( x t ), neara general critical point, which satisfies d2 . Sadly, I have not yet had time to develop this idea. A central limit theorem can be derived, in the case where x ∗ is a stable critical point of themean field X , in the following sense. t1. the linear map A in (5.52) has its spectrum contained in the open left half-plane.In this case, the generator L in (5.53) admits of a unique invariant distribution, which ismultivariate normal with mean zero and covariance matrix V , the solution of the Lyapunovequation AV + VA † = Σ ∗ [56] ( A = ( A ij ) and Σ ∗ = (Σ ij ∗ )). This will be denoted N(0 , V ).Under the conditions of Proposition 5.5, the Markov chain ( x t ) has a unique invariantdistribution π µ . Then, the same holds for the Markov chain ( u t ), which will have a uniqueinvariant distribution ˜ π µ . This is ˜ π µ ( A ) = π µ (Exp x ∗ ( µ / A )), for any measurable A ⊂ T x ∗ M .The following assumptions will be essential for the central limit theorem, which is stated inProposition 5.8. Assumption t2 ensures tightness of the family (˜ π µ ; µ ≤ c (2 ℓ (1 + σ )) − ). t2. for each r > v ( r ) > V ( x ) ≥ v ( r ) if d ( x, x ∗ ) > r . Moreover, v ( r ) → ∞ as r → ∞ and a / v ( a / r ) is a non-descreasing function of a > 0, for any r . t3. there exists α > Z Y k e y ( x ) k αx P ( dy ) ≤ ˜ σ + ˜ σ V ( x ) (5.55)for some constants ˜ σ , ˜ σ . Proposition 5.8. Under the conditions of Propositions 5.6 and 5.7, if t1 , t2 , t3 hold, then ˜ π µ ⇒ N(0 , V ) as µ → . .7 Proof of the CLT The proof relies on the following two lemmas, which will be proved below. Lemma 5.3. Assume that e2 holds. Then, the Markov chain ( x t ) is Feller, and | vol | -irreducibleand aperiodic (where | vol | denotes the Riemannian volume measure on M ). Lemma 5.4. Assume that e1 , v1 , v2 hold. If µ ≤ c (2 ℓ (1 + σ )) − , then Q µ V ( x ) ≤ (1 − λµ/ V ( x ) + ( ℓσ ) µ (5.56) for all x ∈ M . Admitting these lemmas, the fact that the chain ( x t ) is geometrically ergodic follows from [51](Theorem 16.0.1). Specifically, let ˜ V ( x ) = V ( x ) + 1. Then, by (5.56), Q µ ˜ V ( x ) ≤ (1 − λµ/ 2) ˜ V ( x ) + b where b = ( ℓσ ) µ + λµ/ C = { x : V ( x ) ≤ b/ ( λµ ) } . Clearly, Q µ ˜ V ( x ) ≤ (1 − λµ/ 4) ˜ V ( x ) + b C ( x ) (5.57)By v1 , ˜ V has compact sublevel sets, so C is a compact subset of M . Therefore, since ( x t ) isFeller, C is a small set for Q µ [51] (Theorem 6.0.1). Then, (5.57) is a geometric drift conditiontowards the small set C . This is equivalent to ( x t ) being geometrically ergodic. Proof of Lemma 5.3 : let f : M → R be a bounded continuous function. By a slight abuseof notation, let y denote a random variable with distribution P . From (5.47), Q µ f ( x ) = E [ f (Exp x ( µX ( x ) + µe y ( x )))] (5.58)By e2 , e y ( x ) is P -almost-surely a continuous function of x . By the dominated convergencetheorem, Q µ f ( x ) is a bounded continuous function of x . In other words, the transition kernel Q µ is a Feller kernel, so the chain ( x t ) is Feller.To show that ( x t ) is | vol | -irreducible and aperiodic, it is enough to show that Q µ ( x, B ) > B ) > 0, where Q µ ( x, B ) = Q µ B ( x ). By e2 , if w = e y ( x ) then the distributionof w is of the form γ ( w ) dw , where γ ( w ) > 0, and dw denotes the Lebesgue measure on T x M .Therefore, from (5.47), Q µ ( x, B ) = Z T x M B (Exp x ( µX ( x ) + µw )) γ ( w ) dw Since M is a Hadamard manifold, Exp x is a diffeomorphism of T x M onto M . Accordingly, Q µ ( x, B ) = (1 /µ ) n Z B γ (cid:0) (1 /µ )Exp − x ( z ) − X ( x ) (cid:1) | J x ( z ) | − vol( dz )where n is the dimension of M , and Exp ∗ x (vol)( dw ) = ( | J x | ◦ Exp x )( w ) dw , so that | J x ( z ) | > B ) > 0, it is clear that Q µ ( x, B ) > Proof of Lemma 5.4 : for any x ∈ M , it follows from (5.47) that Q µ V ( x ) = E [ V ( x )] where x = Exp x ( µX ( x ) + µe y ( x )) (5.59)where y denotes a random variable with distribution P (this is the same abuse of notation madein (5.58)). However, using Lemma 5.1, it is possible to write, as in (5.13),92 ( x ) ≤ V ( x ) + µ h grad V, X y i x + µ ℓ (cid:0) k X k x + k e y k x (cid:1) By e1 and (5.59), it follows after taking expectations, Q µ V ( x ) ≤ V ( x ) + µ h grad V, X i x + µ ℓ (1 + σ ) k X k x + ( ℓσ ) µ (5.60)By v1 , V satisfies (5.5), so that (5.60) implies Q µ V ( x ) ≤ V ( x ) + µ (1 − µℓ (1 + σ ) /c ) h grad V, X i x + +( ℓσ ) µ (5.61)Since h grad V, X i is negative, if µ ≤ c (2 ℓ (1 + σ )) − , then (5.61) implies Q µ V ( x ) ≤ V ( x ) + ( µ/ h grad V, X i x + ( ℓσ ) µ Finally, by v2 , this yields Q µ V ( x ) ≤ (1 − λµ/ V ( x ) + ( ℓσ ) µ (5.62)which is the same as (5.56), since x is arbitrary. Proposition 4.2 implies that the chain ( x t ) has a unique invariant distribution, here denoted π µ ,for any µ ≤ c (2 ℓ (1 + σ )) − . Integrating both sides of (5.56) with respect to π µ , it follows that Z M Q µ V ( x ) π µ ( dx ) ≤ (1 − λµ/ Z M V ( x ) π µ ( dx ) + ( ℓσ ) µ Since π µ is an invariant distribution of the transition kernel Q µ , this means Z M V ( x ) π µ ( dx ) ≤ (1 − λµ/ Z M V ( x ) π µ ( dx ) + ( ℓσ ) µ In other words, Z M V ( x ) π µ ( dx ) ≤ ℓσ /λ ) µ (5.63)so, by Markov’s inequality, π µ ( V > v ) ≤ ℓσ /λ ) µv for all v > v1 , V has compact sublevel sets, so (5.64) implies the family ( π µ ; µ ≤ c (2 ℓ (1 + σ )) − )is tight. If π ∗ is a limit point of this family at µ = 0, then by the Portmaneau theorem, π ∗ ( V > v ) = 0 for all v > 0. In other words, π ∗ ( V = 0) = 1. By v3 , this is equivalent to π ∗ ( { x ∗ } ) = 1, or to π ∗ = δ x ∗ . The proof exploits the relation (5.50), between the transition kernels Q µ and ˜ Q µ , using thefollowing Lemmas 5.5, 5.6, and 5.7.In Lemma 5.5, [ H : T ] will denote the contraction of a (0,2)-tensor H with a (2,0)-tensor T .This is [ H : T ] = H ij T ij , in any local coordinates. Moreover, if f : M → R is compactly-supported and smooth, A f , B f denote positive numbers such that | Hess f x ( w, w ) | ≤ A f k w k x and |∇ Hess f x ( w, w, w ) | ≤ B f k w k x (5.65)for any x ∈ M and w ∈ T x M , where ∇ Hess f is the covariant derivative of the Hessian of f (respectively, Hess f and ∇ Hess f are (0,2)- and (0,3)-tensor fields).93 emma 5.5. For any compactly-supported, smooth f : M → R , Q µ f ( x ) = f ( x ) + µXf ( x ) + µ f : Σ + X ⊗ X ] x + µ R x ( f, µ ) (5.66) where the remainder term R x ( f, µ ) satisfies |R x ( f, µ ) | ≤ A f E (cid:2) {k e y k x > K }k e y k x (cid:3) +6 B f µ (2 k X k x + 2 K ) E [ {k e y k x > K }k e y k x ] + 2 B f µ (4 k X k x + 4 K ) (5.67) for any (arbitrarily chosen) K > . Given normal coordinates ( x i ; i = 1 , . . . , n ) on M , with origin at x ∗ , recall the coordinatevector fields ∂ i = ∂ (cid:14) ∂x i . Any function φ : T x ∗ M → R may be identified with a function of n variables, φ ( u ) = φ ( u , . . . , u n ), for u ∈ T x ∗ M where u = u i ∂ i ( x ∗ ). Lemma 5.6. Let ( x i ; i = 1 , . . . , n ) be normal coordinates on M with origin at x ∗ . For anysmooth function φ : T x ∗ M → R , if ψ µ is given by (5.49), then ∂ i ( φ ◦ ψ µ )( ψ − ( u )) = µ − ∂φ∂u i ( u ) ; ∂ ij ( φ ◦ ψ µ )( ψ − ( u )) = µ − ∂ φ∂u i ∂u j ( u ) (5.68) Lemma 5.7. Let X i ( x ) denote the components of the mean field X , with respect to the normalcoordinates ( x i ; i = 1 , . . . , n ) . If d2 holds, then X i ( ψ − ( u )) = µ A ij u j + R i ( µ u ) (5.69) where | R i ( u ) | = o ( k u k x ∗ ) . Lemmas 5.5, 5.6, and 5.7 will be proved below. Accepting them to be true, recall (5.50)˜ Q µ φ ( u ) = Q µ ( φ ◦ ψ µ )( ψ − ( u ))Replacing (5.66) into the right-hand side gives µ − h ˜ Q µ φ ( u ) − φ ( u ) i = X ( φ ◦ ψ µ )( ψ − ( u )) + µ φ ◦ ψ µ ) : T ] ψ − ( u ) + µ R ψ − ( u ) ( φ ◦ ψ µ , µ ) (5.70)where T = Σ + X ⊗ X . However, working in normal coordinates, X ( φ ◦ ψ µ )( ψ − ( u )) = X i ( ψ − ( u )) ∂ i ( φ ◦ ψ µ )( ψ − ( u ))so that, by (5.68) and (5.69), X ( φ ◦ ψ µ )( ψ − ( u )) = n A ij u j + µ − R i ( µ u ) o ∂φ∂u i ( u )Since φ is compactly-supported, this can be written X ( φ ◦ ψ µ )( ψ − ( u )) = A ij u j ∂φ∂u i ( u ) + ε µ ( u ) (5.71)where ε µ ( u ) → 0, uniformly on T x ∗ M , as µ → 0. For the second term in (5.70), using (1.21),[Hess ( φ ◦ ψ µ ) : T ] ψ − ( u ) = T ij ( ψ − ( u )) h ∂ ij ( φ ◦ ψ µ )( ψ − ( u )) − Γ kij ( ψ − ( u )) ∂ k ( φ ◦ ψ µ )( ψ − ( u )) i 94o that, by (5.68),[Hess ( φ ◦ ψ µ ) : T ] ψ − ( u ) = µ − T ij ( ψ − ( u )) (cid:20) ∂ φ∂u i ∂u j ( u ) − µ Γ kij ( ψ − ( u )) ∂φ∂u k ( u ) (cid:21) where (Γ ijk ) denote the Christoffel symbols. Since φ is compactly-supported, this can be written[Hess ( φ ◦ ψ µ ) : T ] ψ − ( u ) = µ − T ij ∗ ∂ φ∂u i ∂u j ( u ) + ε µ ( u )where ( T ij ∗ ) is the matrix which represents the tensor T ( x ∗ ) in normal coordinates, and where ε µ ( u ) → 0, uniformly on T x ∗ M , as µ → 0. Since (clearly, from (5.52)), T ( x ∗ ) = Σ( x ∗ ), it follows[Hess ( φ ◦ ψ µ ) : T ] ψ − ( u ) = µ − Σ ij ∗ ∂ φ∂u i ∂u j ( u ) + ε µ ( u ) (5.72)Then, replacing (5.71) and (5.72) into (5.70), and recalling the definition of L from (5.53), µ − h ˜ Q µ φ ( u ) − φ ( u ) i = L φ ( u ) + ε µ ( u ) + ε µ ( u ) + µ R ψ − ( u ) ( φ ◦ ψ µ , µ ) (5.73)To conclude, let ε µ ( u ) = ε µ ( u ) + ε µ ( u ) + µ R ψ − ( u ) ( φ ◦ ψ µ , µ ), and recall that ε µ ( u ) and ε µ ( u )converge to zero, uniformly on T x ∗ M . Moreover, using (5.3) and (5.67), it is straightforwardthat R ψ − ( u ) ( φ ◦ ψ µ , µ ) is bounded on compact subsets of T x ∗ M , (by an upper bound which isindependent of µ ). Therefore, ε µ ( u ) → µ → 0, uniformly on compact subsets of T x ∗ M . Proof of Lemma 5.5 : the proof will rely on the following variant of Taylor expansion(compare to [54], Section 2). Let f : M → R be a compactly-supported, smooth function.If A f , B f are given by (5.65), x ∈ M and ξ , η ∈ T x M , then f (Exp x ( ξ + η )) = f ( x ) + ( ξ + η ) f + 12 [Hess f : ( ξ + η ) ⊗ ( ξ + η )] + R f ( x )where |R f ( x ) | ≤ A f k η k x + 6 B f k ξ k x k η k x + 2 B f k ξ k x (5.74)To apply (5.74), recall (5.58), Q µ f ( x ) = E [ f (Exp x ( µX ( x ) + µe y ( x )))]and let ξ = µX ( x ) + µ {k e y k x ≤ K } e y ( x ), η = µ {k e y k x > K } e y ( x ). Taking the expectationof the Taylor expansion in (5.74) and using (5.2) and (5.51), it follows that, as in (5.66), Q µ f ( x ) = f ( x ) + µXf ( x ) + µ f : Σ + X ⊗ X ] x + µ R x ( f, µ )where |R x ( f, µ ) | is less than or equal to2 A f E (cid:2) {k e y k x > K }k e y k x (cid:3) +6 B f µ E (cid:2) k X ( x ) + {k e y k x ≤ K } e y ( x ) k x × k {k e y k x > K } e y ( x ) k x (cid:3) +2 B f µ E (cid:2) k X ( x ) + {k e y k x ≤ K } e y ( x ) k x (cid:3) Then, to obtain (5.67), it is enough to note k X ( x ) + {k e y k x ≤ K } e y ( x ) k x ≤ k X k x + 4 K k X ( x ) + {k e y k x ≤ K } e y ( x ) k x ≤ k X k x + 2 K which follow from the elementary inequalities ( a + b ) ≤ a + 4 b and ( a + b ) ≤ a + 2 b .95 roof of (5.74) : if f : M → R is smooth and compactly-supported, for x ∈ M and ζ ∈ T x M ,one has from the second- and third-order Taylor expansions of f at x , that f (Exp x ( ζ )) = f ( x ) + ζf + 12 [Hess f : ζ ⊗ ζ ] + R f ( x )where, simultaneously, |R f ( x ) | ≤ A f k ζ k x and |R f ( x ) | ≤ B f k ζ k x . If ζ = ξ + η , then |R f ( x ) | ≤ A f k η k x if k η k x ≥ k ξ k x |R f ( x ) | ≤ B f k ξ k x + 6 B f k ξ k x k η k x if k η k x < k ξ k x and (5.74) is obtained by adding up these two cases. Proof of Lemma 5.6 : let f : M → R be a smooth function. From the definition of coordinatevector fields [53] (Page 49), ∂ i f ( x ) = ( f ◦ Exp x ∗ ) ′ (Exp − x ∗ ( x ))( ∂ i ( x ∗ )) (5.75)where the prime denotes the Fr´echet derivative. To obtain (5.68), set f = φ ◦ ψ µ and x = ψ − ( u ),so that f ◦ Exp x ∗ ( w ) = φ ( µ − w ) (for w ∈ T x ∗ M ) and Exp − x ∗ ( x ) = µ u . Then, in particular,( f ◦ Exp x ∗ ) ′ = µ − φ ′ . Replacing into (5.75), it follows that ∂ i ( φ ◦ ψ µ )( ψ − ( u )) = µ − φ ′ ( u )( ∂ i ( x ∗ ))Now, if φ is identified with a function of n variables, φ ( u ) = φ ( u , . . . , u n ) where u = u i ∂ i ( x ∗ ),then φ ′ ( u )( ∂ i ( x ∗ )) = ∂φ ( u ) /∂u i . This yields the first identity in (5.68). The second identityfollows from the first by repeated application. Proof of Lemma 5.7 : assume that d2 holds. Using the same notation as in (5.52), considerthe Taylor expansion of the coordinate vector fields ∂ i (see [11], Page 90), ∂ i ( x ) = Π (cid:0) ∂ i ( x ∗ ) + ∇ ∂ i ( x ∗ ) (cid:0) Exp − x ∗ ( x ) (cid:1) + o ( d ( x, x ∗ )) (cid:1) where ∇ ∂ i ( x ∗ ) : T x ∗ M → T x ∗ M is the covariant derivative of ∂ i at x ∗ . From (1.7) and (1.62),it is clear that ∇ ∂ i ( x ∗ ) = 0, and therefore ∂ i ( x ) = Π ( ∂ i ( x ∗ ) + o ( d ( x, x ∗ ))) (5.76)Take the scalar product of (5.52) and (5.76). Since parallel transport preserves scalar products, h X , ∂ i i x = h A (cid:0) Exp − x ∗ ( x ) (cid:1) , ∂ i ( x ∗ ) i x ∗ + o ( d ( x, x ∗ ))However, Exp − x ∗ ( x ) = x i ∂ i ( x ∗ ), and ∂ i ( x ∗ ) form an orthonormal basis of T x ∗ M . Therefore, h X , ∂ i i x = A ij x j + o ( d ( x, x ∗ )) (5.77)where A ( ∂ i ( x ∗ )) = A ki ∂ k ( x ∗ ). Finally, note that (in normal coordinates), the metric coefficientssatisfy g ij ( x ) = δ ij + o ( d ( x, x ∗ ))Using these to express the scalar product in (5.77), it can be seen that X i ( x ) = A ij x j + o ( d ( x, x ∗ )) (5.78)Thus, (5.69) follows by putting x = ψ − µ ( u ) in (5.78). Then, x j = µ u j and d ( x, x ∗ ) = µ k u k x ∗ .96 .7.4 Proof of Proposition 5.8 To begin, it is helpful to establish tightness of the family (˜ π µ ; µ ≤ c (2 ℓ (1 + σ )) − ). Lemma 5.8. Assume that e1 , e2 , v1 , v2 , t2 hold. Then, the family of probability distributions (˜ π µ ; µ ≤ c (2 ℓ (1 + σ )) − ) is tight. Accepting this lemma, let ˜ π ∗ be some limit point of the family (˜ π µ ; µ ≤ c (2 ℓ (1 + σ )) − )at µ = 0. By integrating both sides of (5.54) with respect to ˜ π µ , and recalling that ˜ π µ is aninvariant distribution of ˜ Q µ (so the integral of the left-hand side is zero), it follows that Z T x ∗ M L φ ( u ) ˜ π µ ( du ) = − Z T x ∗ M ε µ ( u ) ˜ π µ ( du ) (5.79)where ε µ ( u ) = ε µ ( u ) + ε µ ( u ) + µ R ψ − ( u ) ( φ ◦ ψ µ , µ ), in the notation of (5.70), (5.71) and (5.72),from the proof of Proposition 5.7. Since both ε µ ( u ) and ε µ ( u ) converge to zero as µ → T x ∗ M , it follows from (5.79) that, (cid:12)(cid:12)(cid:12)(cid:12)Z T x ∗ M L φ ( u ) ˜ π ∗ ( du ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ lim sup µ → Z T x ∗ M µ (cid:12)(cid:12) R ψ − ( u ) ( φ ◦ ψ µ , µ ) (cid:12)(cid:12) ˜ π µ ( du )Since ˜ π µ is the image of π µ under ψ µ , this is the same as (cid:12)(cid:12)(cid:12)(cid:12)Z T x ∗ M L φ ( u ) ˜ π ∗ ( du ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ lim sup µ → Z M µ |R x ( φ ◦ ψ µ , µ ) | π µ ( dx ) (5.80)To bound the right-hand side, put f = φ ◦ ψ µ in (5.67). If ¯ f = φ ◦ Exp x ∗ , then ¯ f is compactly-supported and smooth. Moreover, applying the chain rule, it follows from (5.65) that A f = µ − A ¯ f and B f = µ − B ¯ f . Therefore, by (5.67), |R x ( φ ◦ ψ µ , µ ) | ≤ µ − A ¯ f E (cid:2) {k e y k x > K }k e y k x (cid:3) +6 µ − B ¯ f (2 k X k x + 2 K ) E [ {k e y k x > K }k e y k x ] + 2 µ − B ¯ f (4 k X k x + 4 K ) (5.81)Now, since t3 holds, it follows from (5.55) that E (cid:2) {k e y k x > K }k e y k x (cid:3) ≤ K − α (˜ σ + ˜ σ V ( x )) (5.82)Moreover, by (5.3) (assuming that K > k X k x + 2 K ) E [ {k e y k x > K }k e y k x ] ≤ (2 k X k x + 2 K )( σ + σ k X k x ) (5.83)Then, it follows from (5.81), (5.82) and (5.83) that µ |R x ( φ ◦ ψ µ , µ ) | ≤ A ¯ f K − α (˜ σ + ˜ σ V ( x )) + 6 µ B ¯ f (2 k X k x + 2 K )( σ + σ k X k x ) + 2 µ B ¯ f (4 k X k x + 4 K )Integrate this inequality with respect to π µ , and recall from Proposition 5.6 that π µ convergesweakly to δ x ∗ as µ → 0. It follows that,lim sup µ → Z M µ |R x ( φ ◦ ψ µ , µ ) | π µ ( dx ) ≤ A ¯ f K − α (˜ σ + ˜ σ V ( x ∗ ))However, since K can be chosen arbitrarily large, and α > 0, the limit superior is equal to zero,and (5.80) becomes Z T x ∗ M L φ ( u ) ˜ π ∗ ( du ) = 0This means that ˜ π ∗ is an invariant distribution of the generator L , and therefore ˜ π ∗ = N(0 , V ),as required. 97 roof of Lemma 5.8 : by Proposition 5.5, e1 , e2 , v1 , v2 ensure that the chain ( x t ) hasa unique invariant distribution π µ , whenever µ ≤ c (2 ℓ (1 + σ )) − . Then, the chain ( u t ) hasa unique invariant distribution ˜ π µ . According to (5.49), this is ˜ π µ ( A ) = π µ (Exp x ∗ ( µ / A )),for any measurable A ⊂ T x ∗ M . The same e1 , e2 , v1 , v2 also imply (5.63), in the proof ofProposition (5.6). Now, for u ∈ T x ∗ M , let x = Exp x ∗ ( µ / u ), and note that k u k x ∗ > r if andonly if d ( x, x ∗ ) > µ / r . It then follows from Assumption t2 that˜ π µ ( k u k x ∗ > r ) ≤ π µ ( V > v ( µ / r ))so, using Markov’s inequality and (5.63),˜ π µ ( k u k x ∗ > r ) ≤ ℓσ /λ ) (cid:0) µ (cid:14) v ( µ / r ) (cid:1) To conclude, let ¯ µ = c (2 ℓ (1 + σ )) − . By t2 , µ / v ( µ / r ) ≤ ¯ µ / v (¯ µ / r ) . Therefore,˜ π µ ( k u k x ∗ > r ) ≤ ℓσ /λ ) (cid:0) ¯ µ (cid:14) v (¯ µ / r ) (cid:1) However (again by t2 ), the right-hand side of this inequality is independent of µ , and goes tozero as r → ∞ . This is equivalent to the required tightness. Let M be a Hadamard manifold, and P a probability distribution on M , which has a strictlypositive probability density, with respect to Riemannian volume. Then, let ( y t ; t = 1 , , . . . ) beindependent samples from P , and x be a point in M . Consider the stochastic update rule,starting from x at t = 0, x t +1 = x t µ y t +1 where µ ∈ (0 , 1) (5.84)where the notation is that of (4.4) from 4.1. This will be called a Riemannian AR(1) model,since each new x t +1 is a geodesic convex combination of the old x t and of the new sample y t +1 .If M is a Euclidean space, M = R n , then (5.84) reads x t +1 = (1 − µ ) x t + µy t +1 , which is afirst-order auto-regressive model (whence the name AR(1)).The update rule (5.84) may be viewed as a constant-step-size exponential scheme, of theform (5.47). Specifically, (5.84) is equivalent to x t +1 = Exp x t (cid:0) µX y t +1 ( x t ) (cid:1) where X y ( x ) = Exp − x ( y ) (5.85)which defines a time-homogeneous Markov chain ( x t ) with values in M .One is tempted to apply the results of 5.6 ( e.g. on geometric ergodicity), directly to thescheme (5.85). However, some of the assumptions in 5.6 (especially e1 ), turn out to be quiteunnatural. Fortunately, it is possible to proceed along a different path, which only requires theexistence of second-order moments. Specifically, it is merely required that E ( x ) = 12 Z M d ( x, y ) P ( dy ) < ∞ (5.86)for some (and therefore all) x ∈ M . As discussed in 2.2.3, (5.86) guarantees existence anduniqueness of the Riemannian barycentre x ∗ of P . This is enough for the following proposition. Proposition 5.9. Consider the Riemannian AR(1) model (5.84) (or (5.85)), on a Hadamardmanifold M . If (5.86) is verified, then the Markov chain ( x t ) is geometrically ergodic, with aunique invariant distribution π µ . Moreover, π µ ⇒ δ x ∗ as µ → . The proof of this proposition begins like that of Proposition 5.5, by noting that the Markovchain ( x t ) is Feller and | vol | -irreducible and aperiodic. Indeed, since X y ( x ) is given by (5.85),and since P has a strictly positive probability density with respect to | vol | , it follows thatAssumption e2 holds. Therefore, it is possible to argue exactly as in the proof of Lemma 5.3.98ow, let V ( x ) = d ( x ∗ , x ) / 2. To prove that the chain ( x t ) is geometrically ergodic, it isenough to obtain the inequality Q µ V ( x ) ≤ (1 − µ ) V ( x ) + µ E ( x ∗ ) (5.87)which is similar to (5.56) of Lemma 5.4. This can then be used, exactly as in the proof ofProposition 5.5, based on [51] (Theorem 16.0.1). Proof of (5.87) : for any x ∈ M , note from (5.84) that Q µ V ( x ) = E [ V ( x µ y )] (5.88)where y denotes a random variable with distribution P . Recall from 1.7.4 that V ( x ) is 1 / V ( x µ y ) ≤ (1 − µ ) V ( x ) + µV ( y ) − µ (1 − µ ) d ( x, y ) / Q µ V ( x ) ≤ (1 − µ ) V ( x ) + µ E ( x ∗ ) − µ (1 − µ ) E ( x ) (5.89)after using the fact that E ( x ) = E [ d ( x, y ) / 2] for any x ∈ M , which is clear from (5.86). Now,recall from 2.2.3 that E is 1 / E ( x ) ≥ E ( x ∗ ) + d ( x ∗ , x ) / E ( x ∗ ) + V ( x ). Thus, replacing (5.90) into (5.89), one has Q µ V ( x ) ≤ (1 − µ ) V ( x ) + µ E ( x ∗ ) − µ (1 − µ ) V ( x ) − µ (1 − µ ) E ( x ∗ )which immediately yields (5.87).Geometric ergodicity ensures the chain ( x t ) has a unique invariant distribution π µ . To provethat π µ ⇒ δ x ∗ as µ → 0, it is possible to argue as in the proof of Proposition 5.6. Precisely,integrating both sides of (5.87) with respect to π µ , it follows that Z M Q µ V ( x ) π µ ( dx ) ≤ (1 − µ ) Z M V ( x ) π µ ( dx ) + µ E ( x ∗ )Since π µ is an invariant distribution of the transition kernel Q µ , this means Z M V ( x ) π µ ( dx ) ≤ (1 − µ ) Z M V ( x ) π µ ( dx ) + µ E ( x ∗ )In other words, Z M V ( x ) π µ ( dx ) ≤ E ( x ∗ ) µ/ (2 − µ ) (5.91)Since µ/ (2 − µ ) ≤ µ for µ ≤ 1, (5.91) can be used like (5.63) in the proof of Proposition 5.6.In this way, π µ ⇒ δ x ∗ follows by noting that V , defined by V ( x ) = d ( x ∗ , x ) / 2, has compactsublevel sets, by the Hopf-Rinow theorem and the fact that M is complete (see [11]), and that V ( x ) = 0 if and only if x = x ∗ . Remark : thanks to Proposition 5.9, it is now possible to prove that a central limit theorem,identical to Proposition 5.8, holds for the Riemannian AR(1) model (5.84). This only requiresthe additional condition (5.55). 99 hapter 6 Open problems While working on this thesis, there are several problems which I found very interesting, andimportant, but could not solve, or even attack in a meaningful way. I would therefore like toclose the thesis with a list of these problems, in the hope that they will attract the attention ofmore people (not just myself). In Chapter 2 : the conclusion of Lemma 2.1 only holds for compact Riemannian symmetricspaces which are simply connected. Therefore, the subsequent Propositions 2.2 and 2.3 arerestricted to this simply connected case. The problem is to describe, at least partially, whathappens for compact Riemannian symmetric spaces which are not simply connected. It wouldbe particularly interesting to give counterexamples to either one of Propositions 2.2 and 2.3, inthe non-simply-connected case. In Chapter 3 : formula (3.7) gives the normalising factor Z ( σ ) for a Gaussian distributionon a symmetric space M , which belongs to the non-compact case. When M is the space of N × N Hermitian positive-definite matrices, (3.35) and (3.42) provide a closed form expression(valid for any N ), and an asymptotic expression (valid for large N ), of the multiple integral in(3.7). The problem is to find similar expressions of this integral, for other symmetric spaces.It should be easiest to first deal with the spaces of N × N real or quaternion positive-definitematrices, and then move on to other spaces, such as the Siegel domaine (Example 3, in 3.3). In Chapter 4 : the problem is to prove or disprove the conjecture, mentioned in 4.2. Namely,that the MAP and MMS Bayesian estimators are equal, for Gaussian distributions on a spaceof constant negative curvature. In Chapter 4 : as mentioned in 4.4, I have never met with a function f : M → R ( M anon-Euclidean Hadamard manifold), which is strongly convex, and also has a bounded Hessian.The problem is to construct a function f with these properties, or to show this is not possible.Another problem, which is quite important for convex optimisation, is to show that a function f : M → R , which is convex and has a bounded Hessian, has co-coercive gradient, in the senseof [57] (Theorem 2.1.5, property (2.1.11)). In Chapter 5 : to state in clear and general terms the functional central limit theorem,which follows from Proposition 5.7, and also to derive a similar functional central limit theoremfor decreasing-step-size schemes. These can be used in studying the behavior of stochasticapproximation schemes, in the presence of unstable critical points (only stable critical pointswere considered, in the above). In Chapter 5 : to generalise Proposition 5.8 to the case where M is not a Hadamard manifold.I believe that, in this case, the asymptotic form of the invariant distribution will no longer bemultivariate normal (roughly, the scheme can always “jump across” the cut locus of a stablecritical point). 100 ibliography [1] S. Said, L. Bombrun, and Y. Berthoumieu, “Warped Riemannian metrics for location-scalemodels,” in Geometric structures of information , F. Nielsen, Ed. Springer Switzerland,2019.[2] A. Durmus, P. Jimenez, E. Moulines, S. Said, and H. T. Wai, “Convergence analysis ofRiemannian stochastic approximation schemes,” arXiv:2005.13284 , 2020.[3] A. Durmus, P. Jimenez, E. Moulines, and S. Said, “On Riemannian stochastic approxima-tion schemes with fixed step-size (under review),” in Artificial Intelligence and Statistics ,2021.[4] L. Santilli and M. Tierz, “Riemannian gaussian distributions, random matrix ensemblesand diffusion kernels,” arXiv:2011.13680 , 2020.[5] P. A. Meyer, “G´eom´etrie stochastique sans larmes,” S´eminaire de probabilit´es (Strasbourg) ,vol. 15, pp. 44–102, 1981.[6] J. H. Eschenburg, Lecture notes on symmetric spaces (course material available online) , uni-augsburg.de/eschenbu/symspace.pdf , 1997.[7] P. A. Absil, R. Mahony, and R. Sepulchre, Optimization algorithms on matrix manifolds .Princeton University Press, 2008.[8] P. A. Absil and J. Malick, “Projection-like retractions on matrix manifolds,” SIAM Journalon Optimization , vol. 22, pp. 135–158, 2012.[9] T. Sakai, “On cut loci of compact symmetric spaces,” Hokkaido Mathematical Journal. ,vol. 6, pp. 136–161, 1977.[10] S. Helgason, Differential geometry and symmetric spaces . New York and London: Aca-demic Press, 1962.[11] I. Chavel, Riemannian geometry, a modern introduction . Cambridge University Press,2006.[12] C. Mantegazza, G. Mascellani, and G. Uraltsev, “On the distributional hessian of thedistance function,” arXiv:1303.1421 , 2013.[13] P. J. Huber and E. M. Ronchetti, Robust statistics (2nd edition) . Wiley-Blackwell, 2009.[14] J. E. Marsden, T. Ratiu, and R. Abraham, Manifolds, tensor analysis, and applications .Springer-Verlag, 2001.[15] A. L. Besse, Manifolds all of whose geodesics are closed . New York: Springer-Verlag, 1978.10116] V. I. Bogachev, Measure Theory, Volume I . Springer-Verlag, 2007.[17] M. L. Mehta, Random matrices (3rd edition) . Elsevier Ltd., 2004.[18] E. S. Meckes, The random matrix theory of the classical compact groups . CambridgeUniversity Press, 2019.[19] S. Kobayashi and K. Nomizu, Foundations of differential geometry, Volume II . IntersciencePublishers, 1969.[20] H. J. Higham, Functions of matrices : theory and computation . SIAM Publications, 2008.[21] X. Pennec, P. Fillard, and N. Ayache, “A Riemannian framework for tensor computing,” International Journal of Computer Vision , vol. 66, no. 1, pp. 41–66, 2006.[22] S. Said and J. H. Manton, “Riemannian barycentres of Gibbs distributions : new resultson concentration and convexity,” Information Geometry (under review) , 2020.[23] M. Fr´echet, “Les ´el´ements al´eatoires de nature quelconque dans un espace distanci´e,” An-nales de l’I.H.P. , vol. 10, no. 4, pp. 215–210, 1948.[24] R. Bhattacharya and V. Patrangenaru, “Large sample theory of instrinsic and extrinsicsample means on manifolds I,” The annals of statistics , vol. 31, no. 1, pp. 1–29, 2003.[25] ——, “Large sample theory of instrinsic and extrinsic sample means on manifolds II,” Theannals of statistics , vol. 33, no. 3, pp. 1225–1259, 2005.[26] W. S. Kendall, “Probability, convexity, and harmonic maps with small image I : uniquenessand fine existence,” Proceedings of the London Mathematical Society , vol. 61, no. 2, pp.371–406, 1990.[27] B. Afsari, “Riemannian L p center of mass : existence, uniqueness and convexity,” Proceed-ings of the American Mathematical Society , vol. 139, no. 2, pp. 655–673, 2010.[28] W. S. Kendall, “The propeller : a counterexample to a conjectured criterion for the exis-tence of certain convex functions,” Journal of the London Mathematical Society , vol. 36,no. 2, pp. 364–374, 1992.[29] K. T. Sturm, “Probability measures on metric spaces of nonpositive curvature,” Contem-porary mathematics , vol. 338, pp. 1–34, 2003.[30] M. Arnaudon and L. Miclo, “Means in complete manifolds : completeness and approxima-tion,” ESAIM :Probability and Statistics , vol. 18, pp. 185–206, 2014.[31] R. Wong, Asymptotic approximation of integrals , Society of Industrial and Applied Math-ematics, 2001.[32] D. Schleicher, Hausdorff dimension, its properties and its surprises (available online) , arXiv:0505099 , 2007.[33] E. T. Whittaker and G. N. Watson, A course of modern analysis (4th edition) , CambridgeUniversity Press, 1950.[34] P. Petersen, Riemannian geometry (2nd edition) , Springer Science, 2006.10235] L. P. Kantorovich and G. P. Akilov, Functional analysis (2nd edition) , Pergamon Press,1982.[36] S. Said, H. Hajri, L. Bombrun, and B. C. Vemuri, “Gaussian distributions on Rieman-nian symmetric spaces : statistical learning with structured covariance matrices,” IEEETransactions on Information Theory , vol. 64, no. 2, pp. 752–772, 2018.[37] S. Stahl, “The evolution of the normal distribution,” Mathematics Magazine , vol. 79, no. 2,pp. 96–113, 2006.[38] E. Borel, Introduction g´eom´etrique `a quelques th´eories physiques , Gauthier-Villars, 1914.[39] J. Perrin, “Mouvement brownien et mol´ecules,” Journal de physique th´eorique et appliqu´ee ,vol. 9, no. 1, pp. 5–39, 1910.[40] A. W. Knapp, Lie groups, beyond an introduction (2nd edition) . Birkhauser, 2002.[41] C. L. Siegel, “Symplectic geometry,” American Journal of Mathematics , vol. 65, no. 1, pp.1–86, 1943.[42] A. Terras, Harmonic analysis on symmetric spaces and applications, Vol. II . Springer-Verlag, 1988.[43] G. Szeg¨o, Orthogonal Polynomials (1st edition) . American Mathematical Society, 1939.[44] B. C. Berndt, What is a q -series? (appeared in Ramanujan rediscovered , available online) , faculty.math.illinois.edu/berndt/articles/q.pdf , 2012.[45] A. B. J. Kuijlaars and W. Van Assche, “The asymptotic zero distribution of orthogonalpolynomials with varying recurrence coefficients,” Journal of Approximation theory , vol. 99,pp. 167–197, 1999.[46] P. Deift, Orthogonal polynomials and random matrices : a Riemann-Hilber approach , Amer-ican Mathematical Sociery, 1998.[47] M. Mari˜no, Chern-Simons theory, matrix models, and topological strings , Oxford UniversityPress, 2005.[48] G. E. Andrews, The theory of partitions , Addison-Wesley Publishing Company, 1976.[49] S. F. Jarner and E. Hansen, “Geometric ergodicity of Metropolis algorithms,” StochasticProcesses and Applications , vol. 58, pp. 341–361, 1998.[50] R. O. Roberts and J. S. Rosenthal, “General state-space Markov chains and MCMC algo-rithms,” Probability Surveys , vol. 1, pp. 20–71, 2004.[51] S. Meyn and R. L. Tweedie, Markov chains and stochastic stability , Cambrdge UniversityPress, 2008.[52] G. O. Roberts and R. L. Tweedie, “Geometric ergodicity and central limit theorems formultidimensional Metropolis and Hastings algorithms,” Biometrika , vol. 82, no. 1, pp.95–110, 1996.[53] J. M. Lee, Introduction to smooth manifolds (2nd edition) , Springer Science, 2012.10354] G. C. Pflug, “Stochastic minimisation with constant step-size : asymptotic laws,” SIAMJournal on Control and Optimization , vol. 24, no. 4, pp. 655–666, 1986.[55] O. Kallenberg, Foundations of modern probability (2nd edition) , Springer-Verlag, 2002.[56] M. Duflo, Algorithmes stochastiques , Springer-Verlag, 1996.[57] Y. Nesterov,