[PDF] On the minmax regret for statistical manifolds: the role of curvature

Abstract

Model complexity plays an essential role in its selection, namely, by choosing a model that fits the data and is also succinct. Two-part codes and the minimum description length have been successful in delivering procedures to single out the best models, avoiding overfitting. In this work, we pursue this approach and complement it by performing further assumptions in the parameter space. Concretely, we assume that the parameter space is a smooth manifold, and by using tools of Riemannian geometry, we derive a sharper expression than the standard one given by the stochastic complexity, where the scalar curvature of the Fisher information metric plays a dominant role. Furthermore, we derive the minmax regret for general statistical manifolds and apply our results to derive optimal dimensional reduction in the context of principal component analysis.

Full PDF

aa r X i v : . [ m a t h . S T ] J u l On the minmax regret for statistical manifolds: therole of curvature

Bruno Mera ∗† , Paulo Mateus ∗† , Alexandra M. Carvalho ∗‡∗ Instituto de Telecomunicac¸ ˜oes, 1049-001 Lisboa, Portugal { bruno.mera, paulo.mateus, alexandra.carvalho } @lx.it.pt † Departamento de Matem´atica, Instituto Superior T´ecnico, Universidade de Lisboa ‡ Departamento de Engenharia Eletrot´ecnica e Computadores,Instituto Superior T´ecnico, Universidade de Lisboa

Abstract

Model complexity plays an essential role in its selection, namely, by choosing a model that ﬁts the data and isalso succinct. Two-part codes and the minimum description length have been successful in delivering proceduresto single out the best models, avoiding overﬁtting. In this work, we pursue this approach and complement it byperforming further assumptions in the parameter space. Concretely, we assume that the parameter space is a smoothmanifold, and by using tools of Riemannian geometry, we derive a sharper expression than the standard one givenby the stochastic complexity, where the scalar curvature of the Fisher information metric plays a dominant role.Furthermore, we derive the minmax regret for general statistical manifolds and apply our results to derive optimaldimensional reduction in the context of principal component analysis.

I. I

NTRODUCTION T WO-part codes are an essential tool in model selection. Not only they optimize the likelihood ofthe data given the model, but they also take into account model complexity. There has been a lineof research where one considers, in the most abstract setting, families of distributions satisfying minimalrequirements and derives an expression for model complexity, such as the stochastic complexity, amongothers [1], [2]. These formulas are sharp to the extent of the absence of assumptions in the assignment ofa probability distribution to each point in the parameter space. Moreover, it is a rather usual assumptionthat this parameter space has the topology of an open subset in R n .In this paper, we show that by making additional assumptions on the parameter space and endowingit with natural information geometric structures, we can arrive to sharper results by applying techniquesfrom Riemannian geometry. In practice, parameters of the distributions are usually taken to live on asmooth manifold, and the distribution is assumed to vary smoothly with the parameters. However, usuallyone takes the simpliﬁcation that this manifold is a trivial open subset of the Euclidean space. In this work,we will drop this assumption, hence allowing for non-trivial topologies. Moreover, Information Theoryendows the manifold with a positive (semi-)deﬁnite covariant -tensor, namely a Riemannian metric –the Fisher information [3], [4]. Since we are given a Riemannian structure, we have a natural notion of auniform distribution over the manifold of parameters, which corresponds to what is known in the literatureas Jeffreys’ prior [5], [6].In the literature, when the parameter space is just a bounded open set in R n , one can ﬁnd the (normalized)maximum likelihood code, deﬁned by p ∗ ( x N ) = p ( x N | ˆ θ ) R y N ∈X N p ( y N | ˆ θ ) dy N . (1) The associated length was ﬁrstly given by Rissanen [1], computed through Laplace’s formula, and hasthe form L ∗ ( x N ) = − log( p ∗ ( x N )) = − log p ( x N | ˆ θ ) + n (cid:18) N π (cid:19) + log Z p | I ( θ ) | dθ + o (1) , (2)where the expansion is stated in terms of the size of the dataset N . While in Rissanen’s original work heconsidered x N beyond i.i.d. processes, in the present work we will only focus in this case. Observe thatEq. (2) does not account for the possible dependence of the o (1) term in the dimension of the parameterspace. Indeed, in this work, using techniques from Riemannian Geometry, we ﬁnd the sharper formula L ∗ ( x N ) = − log p ( x N | ˆ θ ) + n (cid:18) N π (cid:19) + log vol g ( M ) (3) − log  p det( g ˆ θ ) q det( I ( x N , ˆ θ ))  − N R (ˆ θ ) + O (cid:18) N (cid:19)| {z } o (1) as a function of N , where three classical geometric invariants can be easily identiﬁed, namely: (i) the dimension of themanifold n , (ii) the Riemannian volume vol g ( M ) ; and (iii) the Ricci scalar curvature R (ˆ θ ) evaluated atthe maximum likelihood estimate ˆ θ . While in Eq.(2) the term log R p | I ( θ ) | dθ is precisely the logarithmof the Riemannian volume, we choose to write it explicitly to highlight its geometric nature. Note that thescalar curvature might be very large as a function of the type of data involved. For example, currently it isvery common to have high dimensional data and this curvature will most likely depend on this dimension,as it is the case of Gaussian models, as we shall see below.To derive Eq. (3), motivated by the results in [7], we follow a Bayesian approach considering Jeffreys’prior and we adapt Laplace’s method to manifolds, using canonical Riemann normal coordinates to ouradvantage.In order to obtain the minmax regret akin to Eq. (1), we use Haussler’s version of the capacitytheorem [8] that requires the map p : θ p ( ·| θ ) to be continuous with respect with the weak topologyon the target space of probability distributions on X N , that is, for every bounded continuous function f we have that if θ n → θ then E p ( ·| θ n ) [ f ] → E p ( ·| θ ) [ f ] , (4)where ( θ n ) n ∈ N is a (convergent) sequence in M . In [9], such condition is present and equivalent tothe soundness assumption of the parametrization. Since locally, in a smooth manifold, everything lookslike an open set in R n , the natural condition to take is that such soundness holds for every coordinateneighborhood, property that we call local soundness assumption of the statistical model. Under thisassumption, we show that the minmax regret of data x N generated by θ is given by R N ( x N ) = n (cid:18) N π (cid:19) + log vol g ( M ) − log p det( g θ ) p det( I ( x N , θ )) ! − N R ( θ ) + O (cid:18) N (cid:19) . (5)Observe Eq. (3) follows from this result by adding the length of the optimal code, − log p ( x N | θ ) , andreplacing θ with the unique (by assumption) estimator ˆ θ in the manifold. Thus, we can see Eq. (3) asa two-part code, where Eq. (5), with θ replaced by ˆ θ , is a reﬁnement of the stochastic complexity [1],taking into account the geometry of the statistical model, and therefore we call it Geometric Complexity .We apply our results to a very well established method for dimensional reduction, namely, PrincipalComponent Analysis (PCA). In particular, our results yield a natural criterion for the choice of the optimaldimension, by adapting the two-part code given in Eq. (3) to zero mean Gaussian families with varyingcovariance. The underlying parameter space is the manifold P m of positive deﬁnite matrices, with reduced dimension m × m which we want to optimize, equipped with the Fisher metric. We considered a boundedsubset M ( s ) of P m , controlled by an integer s that is the smallest integer such that I d ≤ Σ ≤ s I d ,where Σ = XX T /N is the empirical covariance matrix and I d is the d × d identity matrix. We also assumethe each component of the data is written as an integer multiple of the precision for each variable, andtherefore the volume depends on the precision and not in a particular system of units. For this particularcase, the formula becomes L ∗ ( x N ) = − log p ( x N | ˆ Q ) + m ( m + 1)4 log (cid:18) N π (cid:19) + log vol g ( M ( s )) + ( m + 2) m ( m − N , (6)where log vol g ( M ( s )) = − m − log( m !) + m log(2) + m ( m + 1)4 log( π ) − log π / A / G (cid:0) m − ( − m +1 + (cid:1) / e / ! − log (cid:16) G (cid:16)j m k + 1 (cid:17)(cid:17) + log I ( s ) ,A is the Glaisher constant, G is the Barnes G -function, and I ( s ) = s m (log(2)) m m ( m − Z [0 , m Y ≤ i

Let M be a smooth closed (compact and without boundary), connected, oriented manifold of dimension n , and S = { p ( X | θ ) } θ ∈ M a smooth family of probability distributions modeling a random variable X taking values in the space of outcomes X . By a smooth family of probability distributions we mean thatthe map M ∋ θ p ( X = x | θ ) := p ( x | θ ) ∈ R is smooth for every x ∈ X . We will also assume thatthe map is injective, i.e., the statistical model is said to be identiﬁable . The set S is also known as astatistical model or a parametric model. It is often the case that M ⊂ R k , for some k , but we chooseto leave it as a general abstract manifold. We refer to the pair ( M, p ( X | . )) as a statistical manifold. Themap p : M ∋ θ p ( X | θ ) allows, by pullback, to deﬁne a (possibly degenerate) Riemannian structure on M known in the literature as the Fisher-Information metric [3], [4]: g ( θ ) = E θ [ d log p ( X | θ ) ⊗ d log p ( X | θ )]= n X µ,ν =1 g µν ( θ ) dθ µ dθ ν , (7)where E θ denotes the expectation value with respect to the probability distribution p ( X | θ ) and ( θ , ..., θ n ) are arbitrary local coordinates on the manifold M . The locally deﬁned matrix [ g µν ( θ )] ≤ µ,ν ≤ n is usually referred to as the Fisher information matrix and it is a measure of the amount of information that anobservable random variable X carries about an unknown parameter θ of p ( X | θ ) modelling X .If we have a discrete and ﬁnite space of outcomes, say X = { , ..., N } , then a statistical model isdescribed by smooth functions { p i ( θ ) ≥ i = 1 , ..., N } with P i p i ( θ ) = 1 and g ( θ ) = N X i =1 n X µ,ν =1 p i ( θ ) ∂p i ∂θ µ ( θ ) ∂p i ∂θ ν ( θ ) dθ µ dθ ν . (8)If one considers the standard simplex ∆ N − = { ( p , ..., p N ) ∈ R N : P Ni =1 p i = 1 , p i ≥ } , then themap Φ : ∆ N − ∋ ( p , ..., p N ) ( √ p , ..., √ p N ) ∈ S N − , where S N − denotes the unit sphere in R N ,provides a homeomorphism onto the image and endows ∆ N − with the structure of a smooth manifold.Furthermore, if we equip the sphere S N − ⊂ R N with the standard round metric, then ∆ N − canonicallyinherits, by restriction, the structure of a Riemannian manifold (∆ N − , g can ) . The Fisher metric on M is,up to a multiplicative constant factor (this constant is equal to ), the metric induced on M by the map p : M ∋ θ p ( X | θ ) ∈ ∆ N − . Yet another description of the Fisher metric is provided by the formula g µν ( θ ) = − E θ (cid:20) ∂ log p ( X | θ ) ∂θ µ ∂θ ν (cid:21) , with µ, ν = 1 , ..., n. (9)Among the various important features of this metric is its role in the Cram´er-Rao inequality theorem [3],which states that the covariance matrix of an unbiased estimator minus the inverse of the Fisher informationmatrix is positive semi-deﬁnite. As a consequence, the Fisher information provides the covariance of thebest unbiased estimator, in the sense that its variance is the minimum possible.Suppose we are are given a collection of i.i.d observations of the random variable X , x N = ( x , ..., x N ) .We wish to infer the best statistical model describing the data set x N . Given a statistical model S = { p ( X | θ ) } θ ∈ M , the probability distribution governing x N ∈ X N is given by p ( x N | θ ) = N Y i =1 p ( x i | θ ) . (10)We may then take the random vector X N taking values in X N corresponding to the N observations ofthe single random variable X and describe it through the statistical model S N = { p ( X N | θ ) } θ ∈ M such that p ( X N = x N | θ ) = p ( x N | θ ) . If we denote by g ( θ ) and g N ( θ ) the Fisher metrics associated with S and S N ,respectively, we have: g N ( θ ) = N g ( θ ) . (11)We shall refer to Eq. (11) as the extensive property of the Fisher metric. As a consequence, the geometryof S and that of S N are the same modulo the scale factor N .In the absence of additional information, the Fisher metric allows us to introduce a probability distri-bution on M . This probability distribution has the interpretation of a uniform probability distribution forthe statistical model S N and it is called Jeffreys’ prior in the ﬁeld of Bayesian statistics. The associatedprobability density is given by the top differential form p det[ g N ( θ )] dθ ∧ ... ∧ dθ n vol g N ( M ) , where the normalization factor is the Riemannian volume of M according to the Fisher metric g N :vol g N ( M ) = Z M p det[ g N ( θ )] dθ ∧ ... ∧ dθ n . Notice that if M is compact, this integral is very well deﬁned, but if M is not compact one has to regularizethis integral in some way. By the extensive property of the Fisher metric, Eq. (11), this probabilitydistribution is the same as the one provided by g : p det[ g N ( θ )] dθ ∧ ... ∧ dθ n vol g N ( M ) = p det[ g ( θ )] dθ ∧ ... ∧ dθ n vol g ( M ) . From now on, for the sake of simplicity, we will denote by dV g := p det[ g ( θ )] dθ ∧ ... ∧ dθ n .In a Bayesian perspective, the probability of the statistical model S (or equivalently of S N ) given theobserved data x N , Pr ( S | x N ) , is given byPr ( S | x N ) = Pr ( S ) Pr ( x N ) × Z M p ( x N | θ ) dV g vol g ( M ) , where Pr ( S ) and Pr ( x N ) denote the prior probabilities of the statistical model S and the data x N , and R M p ( x N | θ ) dV g / vol g ( M ) is our posterior likelihood according to the prescription of Jeffreys’ prior. Withoutprior knowledge of details of the true distribution of X , any statistical model S should be equally likely.Maximizing Pr ( S | x N ) is therefore equivalent to maximizing the functional F ( x N , S ) = Z M p ( x N | θ ) dV g vol g ( M ) , with respect to the statistical model S = { p ( X | θ ) } θ ∈ M . Mathematically, ﬁnding a maximum for F is avery difﬁcult problem since the space of all such S is very complicated. Namely, we are considering theunion over all smooth manifolds M of the spaces of maps from these manifolds to the set of probabilitydistributions on a given outcome space X , namely p : θ p ( X | θ ) such that, for every x ∈ X , M ∋ θ p ( x | θ ) ∈ R is smooth. However, we can go a bit further than this by using the Riemannian structure on M and the assumption that N is large. We re-write the functional F ( S ) as F ( x N , S ) = Z M e − Nf ( θ ) dV g vol g ( M ) , (12)with f ( θ ) := − (1 /N ) log p ( x N | θ ) . Notice that the minima of f are precisely the maximum likelihoodparameters denoted by ˆ θ ∈ M . The minimum of f , in the large N limit, is unique because we assumethat the statistical model in identiﬁable. In the following, we will perform a saddle point approximationto this integral, valid in the limit when N is large.We will use the following theorem which is a generalization of Laplace’s method in R n for the caseof closed oriented Riemannian manifolds. Theorem 1. (Laplace’s method) Let ( M, g ) be a Riemannian closed oriented manifold of dimension n ,where g is the Riemannian metric, let dV g denote the Riemannian volume form and f a smooth functionwith a single maximum at p ∈ M . Then, lim N →∞ R M e Nf dV g (cid:0) πN (cid:1) n/ e Nf ( p ) √ det( g p ) √ det( Hess p ( f )) (cid:2) N tr ( Hess p ( f ) − R p ) (cid:3) = 1 where R p denotes the Ricci tensor at p . We leave the proof to the Appendix of this paper.

Corollary 1. (Saddle point approximation) Under the same conditions of Theorem 1, it follows that, as N → ∞ , − log Z M e Nf dV g = − N f ( p ) + n (cid:18) N π (cid:19) − log p det( g p ) p det( Hess p ( f )) ! − N tr ( Hess p ( f ) − R p ) + O (cid:18) N (cid:19) . The strong law of large numbers, which applies to independent identically distributed random variables,ensures that the random variable − (1 /N ) log p ( X N | θ ) = − (1 /N ) P Ni =1 log p ( X i | θ ) satisﬁesPr (cid:20) lim N →∞ (cid:18) − N log p ( X N | θ ) (cid:19) = E [ − log p ( X | θ )] (cid:21) = 1 , in other words, the function f ( θ ) = − (1 /N ) log p ( x N | θ ) , as N → ∞ , approaches the entropy of thedistribution p ( X | θ ) . Moreover, if we take local coordinates ( θ , ..., θ n ) , we can deﬁne the matrix I ( x N , θ ) = [ I µν ( x N , θ )] ≤ µ,ν ≤ n := (cid:20) − N ∂ log p ( x N | θ ) ∂θ µ ∂θ ν (cid:21) ≤ µ,ν ≤ n . We further have that, by smoothness and the strong law of large numbers, I µν ( x N , θ ) → E (cid:20) − N ∂ log p ( X N | θ ) ∂θ µ ∂θ ν (cid:21) = g µν ( θ ) , as N → ∞ , for all µ, ν = 1 , ..., n . We can then apply the results of Theorem 1 to get − log F ( x N , S ) = − log p ( x N | ˆ θ ) + n (cid:18) N π (cid:19) + log vol g ( M ) − log  p det( g ˆ θ ) q det( I ( x N , ˆ θ ))  − N tr (cid:20)(cid:16) I ( x N , ˆ θ ) (cid:17) − R ˆ θ (cid:21) + O (cid:18) N (cid:19) . Furthermore, it is safe to replace I ( x N , ˆ θ ) − by g − (ˆ θ ) , because their difference must go to zero as N → ∞ and, hence, when multiplied by − / N , the result will go faster to zero than /N . Thus, weget the following theorem which is one of the main results of our paper: Theorem 2.

Let S = { p ( X | θ ) } θ ∈ M be a smooth statistical model for closed oriented M . Let g denotethe Fisher metric so that the pair ( M, g ) is a Riemannian manifold. Then, the functional − log F ( x N , S ) has the following large N asymptotic expansion: − log F ( x N , S ) = − log p ( x N | ˆ θ ) + n (cid:18) N π (cid:19) + log vol g ( M ) − log  p det( g ˆ θ ) q det( I ( x N , ˆ θ ))  − N R (ˆ θ ) + O (cid:18) N (cid:19) , where R (ˆ θ ) := P nµ,ν =1 g µν (ˆ θ ) R µν (ˆ θ ) denotes the Ricci scalar curvature at ˆ θ and [ g µν ( θ )] ≤ µ,ν ≤ n is theinverse of [ g µν ( θ )] ≤ µ,ν ≤ n . III. T

HE MINMAX REGRET FOR GENERAL STATISTICAL MANIFOLDS

Herein we obtain the minmax regret in the present context of statistical manifolds. We begin byconsidering a natural assumption, which generalizes the soundness condition of Clark and Barron inRef. [9]. Concretely, we assume that the smooth family { p θ } θ ∈ M is locally sound , i.e., let U ⊂ M be acoordinate neighborhood, with φ : U ⊂ M → φ ( U ) ⊂ R n the chart, then the induced map from φ ( U ) to the set of probability distributions with space of outcomes X is sound. According to this deﬁnition, if ( φ ( θ n )) is a sequence converging in Euclidean norm to φ ( θ ) , denoted by φ ( θ n ) → φ ( θ ) , then ( p θ n ) weaklyconverges to p θ , also denoted by p θ n → p θ . Weak convergence means that for every bounded continuousfunction f : X → R , we have that E p θn [ f ] → E p θ [ f ] .The previous assumption has two important consequences. In proving the results, Clark and Barronassume that the posterior distribution is sound. That implies that the latter localizes on neighborhoods ofthe true value of the distribution at a fast enough rate so that they can use Laplace’s approximation. Inthe present situation, the equivalent statement is made on p N ( θ | x N ) = w ( θ ) p N ( x N | θ ) m N ( x N ) , which is taken to be locally sound in the sense described above. In the previous formula, w ( θ ) dV g is atop form on the manifold M (notice that for Jeffreys’ prior w ( θ ) = 1 / vol g ( M ) is the uniform distributionwith respect to the Riemannian metric), and m N ( x N ) = R M w ( θ ) p N ( x N | θ ) dV g . As a consequence, wecan apply Laplace’s formula form Riemannian manifolds (see Corollary 1). Secondly, the local soundnesscondition implies that the Haussler’s version of the capacity theorem holds, see [8]. Such result statesthat the following two quantities (actually there is a third one that we do not use here) are equal sup w inf q I ( w, q ) = inf q sup θ ∈ M D KL (cid:16) p ( x N | θ ) || q ( x N ) (cid:17) =: R N , where I ( w, q ) = R M w ( θ ) D KL (cid:16) p ( x N | θ ) || q ( x N ) (cid:17) dV g is the cross information between M under w ( θ ) dV g and X under q .The following two technical lemmas are useful to derive the minmax regret in the present setup. Lemma 1.

For all distributions q on the N -fold cartesian product X N , we have Z M w ( θ ) D KL (cid:0) p Nθ || q (cid:1) dV g = Z M w ( θ ) D KL (cid:0) p Nθ || m N (cid:1) dV g + D KL ( m N || q ) , where p Nθ ( x N ) = Q Ni =1 p ( x i | θ ) . Hence, inf q Z M w ( θ ) D KL (cid:0) p Nθ || q (cid:1) dV g = Z M w ( θ ) D KL (cid:0) p Nθ || m N (cid:1) dV g . The proof of the previous lemma follows easily by noticing that m N and q do not depend on θ and R M w ( θ ) dV g = 1 . Lemma 2. Z M w ( θ ) D KL (cid:0) p Nθ || m N (cid:1) dV g = − D KL (cid:0) w || w Jeffreys (cid:1) + n N π + o (1) . Proof.

The local soundness assumption on p ( x | θ ) yields localization, at a sufﬁciently fast rate [9], of thedistribution p ( θ | x N ) = w ( θ ) p ( x N | θ ) m N ( x N ) on a neighborhood of θ ∈ M , where θ is the value of θ that generates the data x N . The argumentfor localization goes as follows. Let { U α } α ∈ A be an open covering of M by coordinate neighborhoodswith φ α : U α → R n the chart map. Then over φ ( U α ) , α ∈ A , the family of distributions { p ( θ = φ − ( ξ ) | x N ) } ξ ∈ φ ( U α ) ⊂ R n is sound as in the deﬁnition of Clark and Barron [9]. It follows by their resultsthat the distribution localizes on φ α ( θ ) for some α ∈ A , i.e., an open set containing θ , where θ is thevalue of θ that generated the data x N .This fact allows for the use of Laplace’s approximation, generalized for manifolds, on the integraldeﬁning m N ( x N ) . Concretely, we have, m N ( x N ) = Z M w ( θ ) p ( x N | θ ) dV g = w ( θ ) p ( x N | θ ) × (cid:18) πN (cid:19) n × p det g θ p det I θ × (cid:18) N Tr (cid:0) I − θ R θ (cid:1) + 1 N c + O (cid:18) N (cid:19)(cid:19) , (13)where c is a constant which depends on the Hessian of w and it is for Jeffreys’ prior. For the purposeof this proof, it is enough to keep the terms up to O (1) . The lemma follows by applying the resultingexpression for m N on R M w ( θ ) D KL (cid:0) p Nθ || m N (cid:1) dV g . Theorem 3.

Let { p ( X | θ ) } θ ∈ M be a locally sound smooth family of probability distributions over X , where M is an oriented smooth manifold of dimension n . Let x N be a data set generated by the probabilitydistribution p N ( X N | θ ) for some θ ∈ M . The minmax regret R N ( x N ) is given by R N ( x N ) = Z M w Jeffreys ( θ ) D KL ( p Nθ || m Jeffreys N ) dV g = n (cid:18) N π (cid:19) + log vol g ( M ) − log p det( g θ ) p det( I ( x N , θ )) ! − N R ( θ ) + O (cid:18) N (cid:19) . Proof.

Given the assumption that p ( x | θ ) is locally sound, we have the topology of weak convergence (i.e.the topology as deﬁned by β in Haussler’s paper [8]). Haussler’s version of the capacity theorem gives sup w inf q I ( w, q ) = inf q sup θ ∈ M D KL (cid:16) p ( x N | θ ) || q ( x N ) (cid:17) =: R N . By Lemma 1, we conclude R N = sup w I ( w, m N ) = sup w Z M w ( θ ) D KL (cid:0) p Nθ || m N (cid:1) dV g . By Lemma 2, it follows that the supremum is achieved for w = w Jeffreys . Finally, if in the proof of Lemma 2we replace w by w Jeffreys and keep all the terms as in Eq. (13), the result follows.Observe Theorem. 2 follows from this result by adding the length of the optimal code, − log p ( x N | θ ) ,and replacing θ with the unique (by assumption) estimator ˆ θ in the manifold.IV. A PPLICATION TO

PCALet x N = ( x , ..., x N ) ∈ X N be a data set, where now we take X = R d , thus x N will be interpreted asa d × N real-valued matrix. Suppose that the empirical mean ¯ x = (1 /N ) P Ni =1 x i vanishes. If it does not,we can always shift the data by the empirical mean so that the transformed data satisﬁes this requirement.Let Σ = x N (cid:0) x N (cid:1) T /N be the empirical covariance matrix and assume that s is the smallest integer suchthat Σ ≤ s I d , where I d is the d × d identity matrix. For the data points to be independent of a unitsystem, we assume all the data to be an integer multiple of the some fundamental precision. With this convention, all covariance matrices Σ are such that I d ≤ Σ . Moreover, let Λ = Tr (Σ) , then Λ ≤ d s .The principal component analysis (PCA) is a method for dimensional reduction of the data using theinformation contained in the empirical covariance Σ . Namely, given the dimension d of the Euclideanspace where the data points live in, we construct a new covariance matrix Σ r as follows. Let S be arotation matrix of eigenvectors of Σ , so that Σ = S diag ( λ , ..., λ d ) S t . By applying a permutation matrix if necessary, we may assume that λ i ≥ λ i + 1 , i = 1 , ..., d . The ideais to simplify the representation of the data by taking the ﬁrst m directions of distinguishability andsimplify the description of the others by taking an isotropic subspace where the variance is the averagethe remaining ones. Explicitly, Σ r = S (cid:0) diag ( λ , ..., λ m ) ⊕ ¯ λI d − m (cid:1) S t , where ¯ λ = (Λ − P mi =1 λ i ) / ( d − m ) .The problem is to ﬁnd a criterion to determine an optimal m . In the following, by using the results ofthe previous sections, we will provide one natural criterion. We will write S = [ v , ..., v d ] = [ A B ] , where A = [ v , ..., v m ] and B = [ v m +1 , ..., v d ] , and v i ∈ R d , i = 1 , ..., d . Let V A = span { v , ..., v m } , with dim V = m , be subspace generated by the ﬁrst m columns of S and similarly for V B = span { v m +1 , ..., v d } .It is clear that V A ⊕ V B is an orthogonal decomposition of R d . We take as our statistical model the familyof Gaussian distribution centered at ∈ R d , whose covariance matrix assumes the form: Q = AqA t + ¯ λBB t , (14)where A, B and ¯ λ are ﬁxed by the data set and q is a m × m positive deﬁnite matrix, and I d ≤ Q ≤ s I d which is equivalent to I m ≤ q ≤ s I m . p ( x | Q ) = 1 p det(2 πQ ) exp (cid:18) − x t Qx (cid:19) . The induced Fisher metric is simply given by ds = 12 Tr (cid:0) Q ( q ) − dQ ( q ) Q − ( q ) dQ ( q ) (cid:1) = 12 Tr (cid:0) q − dqq − dq (cid:1) , where we used the map q Q ( q ) from Eq. (14) to get to the last result (formally this is called a pullback).Note that this is exactly the same as the Fisher metric in the space of Gaussian distributions in dimension m , that the speciﬁc details of the subspace V A (or equivalently V B ) do not enter in its description, andneither does ¯ λ . Moreover, it can be shown that the Ricci scalar [10] for this metric is constant and equalto R = − ( m + 2) m ( m − . The Riemannian volume element in the space P m = { q ∈ Mat m × m ( R ) : q t = q, q > } , equipped withthe Fisher metric g = (1 / Tr ( q − dqq − dq ) is given by (see Ref. [11], where they take a Riemannianmetric which differs by a constant conformal factor g ′ = 2 g ) dV g ( q ) = 2 − m det( q ) − ( m +1)2 Y ≤ i ≤ j ≤ m dq ij , where q = [ q ij ] ≤ i ≤ j ≤ m . We wish to evaluate the volume of the compact subspace M ( s ) = { q ∈ P m : I m ≤ q ≤ s I m } with respect to this measure Z M ( s ) dV g = 2 − m Z M ( s ) det( q ) − ( m +1)2 Y ≤ i

In this paper, we derived an asymptotic formula for the posterior according to Jeffrey’s prior, byextending Laplace’s method to manifolds, which we called geometric complexity (see Theorem 2 andcompare it with Eq. (3)). Then, we provided the minmax regret for general statistical manifolds byintroducing the notion of locally sound smooth families of probability distributions, which builds onClarke and Barron’s results for bounded open sets in R n . Finally, we gave an explicit formula of thegeometric complexity for families of Gaussian distributions with zero-mean, and varying covariance, andapply this formula to optimal dimensional reduction in PCA.Future work includes ﬁnding more expressions of the geometric complexity for other families ofprobability distributions. Another interesting area of research is to understand the higher-order correctionsto the Geometric complexity, as they might be relevant for high dimensional data.A PPENDIX AP ROOF OF T HEOREM R n . Theorem 4. (Laplace’s method) Let f ∈ C ( R n ) , with R R n e − f ( x ) dx < ∞ , such that there exists a unique x with df ( x ) = 0 and Hess ( f )( x ) < , i.e., x is the unique global maximum of f . Suppose additionally that for every x ∈ R n − { x } , we havethat f ( x ) < f ( x ) , i.e, f ( x ) is really the maximum value f can have. Then lim N →∞ R R n e Nf ( x ) dxe Mf ( x ) p det [2 π ( − N Hess ( f )( x )) − ] = 1 . Remark 1.

Another useful formulation of the above theorem found recurrently in the literature is givenby Z R n h ( x ) e Nf ( x ) dx ∼ h ( x ) e Nf ( x ) p det [2 π ( − N Hess ( f )( x )) − ] , as N → ∞ . for a function h . Now let ( M, g ) be a compact closed oriented Riemannian manifold of dimension n and let dV g = p det( g )( x ) dx ∧ ... ∧ dx n be associated Riemannian volume form written in local coordinates ( x , ..., x n ) .We wish to generalize Laplace’s method to integrals of the form Z M e Nf dV g , for large positive N and f being a smooth function with non-degenerate maximum at p . Recall that, at p , there is a well-deﬁned non-degenerate bilinear form Hess p ( f ) : T p M × T p M → R deﬁned byHess p ( f )( X, Y ) = e X · ( e Y · f )( p ) , where e X and e Y are arbitrary extensions of X, Y ∈ T p M to vector ﬁelds in an open neighbourhood of p .We will also need the following result. Proposition 1.

Let ( x , ..., x n ) be Riemann normal coordinates centered at some point p deﬁned in someopen neighborhood U ⊂ M , then, there exists a neighborhood of p , V ⊂ U , such that p det( g ( x )) = 1 − n X i,j =1 R ij (0) x i x j + O ( || x || ) , where R ij (0) are the components of the Ricci tensor with respect to the x i ’s. Using Proposition 1, we can now proceed to the proof of Theorem 1.

Proof of Theorem 1.

Take A = { U k } Kk =1 , K < ∞ (since M is compact we can take a subcover if necessaryso that it is ﬁnite), an open cover of M associated with positively oriented charts ϕ k : U k → R n and let { f k } denote a partition of unity subordinate to A . Then, Z M e Nf dV g = Z M K X k =1 f k e Mf dV g = K X k =1 Z U k f k e Nf dV g = K X k =1 Z ϕ k ( U k ) f k ◦ ϕ − k e Nf ◦ ϕ − k ( ϕ − k ) ∗ dV g . The functions f k , by deﬁnition, satisfy ≥ f k ( p ) ≥ for every p ∈ M . Fix a k ∈ { , ..., K } . Suppose p / ∈ U k . Since M − U k is a closed subset of a compact space it is compact. Therefore f reaches amaximum value say f ( p ) − η in M − U k , for some η > . Therefore, ≤ Z U k f k e Nf dV g ≤ Z M e f e ( N − f ( p ) − η ) dV g ≤ e ( N − f ( p ) − η ) Z M e f dV g . If we divide both sides by (cid:0) πN (cid:1) n/ e Nf ( p √ det( Hess p ( f )) (cid:2) N tr ( Hess p ( f ) − R p ) (cid:3) and take the limit N → ∞ ,it is then clear that this contribution will vanish and, thus, have no role. For simplicity, and without lossof generality, we assume that p is in U k for a single k only. Then, we need to focus on Z U k f k e f dV g = Z ϕ k ( U k ) f k ◦ ϕ − k e Nf ◦ ϕ − k ( ϕ − k ) ∗ dV g . We assume, without loss of generality, ϕ k = ( x , ..., x n ) to be a normal coordinate system centered p andby abuse of notation denote f ◦ ϕ − k by simply f and f k ◦ ϕ − k by simply f k . The image ϕ k ( U k ) is anopen set in R n , which we will denote V . We are then dealing with the integral Z V f k ( x ) e Nf ( x ) p det g ( x ) dx. We can take a smaller open subset W ⊂ V , with ϕ k ( p ) = 0 ∈ W , where f k | W = 1 . Notice that over V − W , since the maximum of f is reached for ∈ W , we have, quite similarly to what we did above, Z V − W f k ( x ) e Nf ( x ) p det g ( x ) dx ≤ Z V f k ( x ) e f ( x ) e ( N − f (0) − η ) p det g ( x ) dx = e ( N − f (0) − η ) Z V f k ( x ) e f ( x ) p det g ( x ) dx, where η > exists since f ( p ) > f ( p ) for all p ∈ M , and the inequality follows from the integral beingpositive. When we divide both sides by (cid:0) πN (cid:1) n/ e Nf ( p ) √ det( g p ) √ det( Hess p ( f )) (cid:2) N tr ( Hess p ( f ) − R p ) (cid:3) it is clearthat this term goes to zero in the limit N → ∞ . It is then enough to consider the integral Z B δ (0) e Nf ( x ) p det g ( x ) dx, where we have replaced W by a ball B δ (0) containing ϕ k ( p ) = 0 . Next, by choosing δ sufﬁciently small,we can use Proposition 1 to write: p det g ( x ) = 1 − n X i,j =1 R ij (0) x i x j + O ( || x || ) . By the identiﬁcations T p M ∼ = R n provided by normal coordinates, this can be reformulated as p det g ( x ) = 1 − x t R p x + || x || g ( x ) , where we see R p as an n × n matrix and g ( x ) is some function with the property g ( x ) → as x → .By compactness of B δ (0) , there exists a constant C > , such that (cid:12)(cid:12)(cid:12)(cid:12)p det g ( x ) − (1 − x t R p x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ C || x || . We can replace || x || on the right hand side by the absolute value of an arbitrary polynomial in the x i ’swhose ﬁrst term is of degree , let us call it P ( x ) = P ni ,i ,i =1 a i i i x i x i x i + ... , with an appropriatenew choice for C . Therefore, (cid:12)(cid:12)(cid:12)(cid:12)Z B δ (0) e Nf ( x ) p det g ( x ) dx − Z B δ (0) e Nf ( x ) (1 − x t R p x ) dx (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z B δ (0) e Nf ( x ) (cid:12)(cid:12)(cid:12)(cid:12)p det g ( x ) − (1 − x t R p x ) (cid:12)(cid:12)(cid:12)(cid:12) dx ≤ C Z B δ (0) e Nf ( x ) | P ( x ) | dx. Next, we let A = − Hess p ( f ) and perform the change of variables according to y = √ N A / x =: F ( x ) .Notice that F , as deﬁned, deﬁnes a diffeomorphism of open sets in R n , where we see A as a linearendomorphism of R n using the orthogonal normal coordinates. It is clear that as N → ∞ the imageunder F of B δ becomes R n . We then have C Z B δ (0) e Nf ( x ) | P ( x ) | dx = C det( N − / A − / ) Z F ( B δ (0)) e Nf ◦ F − ( y ) | P ◦ F − ( y ) | dy. Now

N f ◦ F − ( y ) = N f (0) − / || y || + O ( N − / || y || ) . As N grows larger, all we need to do isthe integral over R n of e Nf ◦ F − | P ◦ F − | , which by Laplace’s approximation in R n , see Remark 1, isproportional to evaluating | P | at , which yields zero. Therefore, lim N →∞ Z B δ (0) e Nf ( x ) p det g ( x ) dx = lim N →∞ Z B δ (0) e Nf ( x ) (1 − x t R p x ) dx. Moreover, for ﬁnite N , Z B δ (0) e Nf ( x ) (1 − x t R p x ) dx = det( N − / A − / ) e Nf (0) × Z F ( B δ (0)) e Nf ◦ F − ( y ) (1 − N y t A − / R p A − / y ) dy. In the large N limit, we just need to evaluate the Gaussian integral, yielding (cid:18) πN (cid:19) n/ e Nf (0) p det( Hess p ( f )) (cid:20) N Tr ( Hess p ( f ) − R p ) (cid:21) . We then get lim N →∞ R M e Nf dV g (cid:0) πN (cid:1) n/ e Nf ( p √ det( Hess p ( f )) (cid:2) N Tr ( Hess p ( f ) − R p ) (cid:3) = 1 . Note that the identiﬁcation of − Hess p ( f ) as a linear map implies the use of the metric g p at T p M ,which in the orthogonal normal coordinates is just the identity matrix. Therefore, the invariant form of p det( Hess p ( f )) is p det( Hess p ( f )) / p det( g p ) , where now Hess p ( f ) and g p are understood as thebilinear forms Hess p ( f ) and g p expressed as matrices in arbitrary, but of course the same, coordinates.This yields the ﬁnal result: lim N →∞ R M e Nf dV g (cid:0) πN (cid:1) n/ e Nf ( p ) √ det( g p ) √ det( Hess p ( f )) (cid:2) N Tr ( Hess p ( f ) − R p ) (cid:3) = 1 . Remark 2.

One can extend the results to the paracompact case, i.e., ( M, g ) an arbitrary orientedRiemannian manifold without boundary, with the additional assumptions that R M e Nf dV g < ∞ for someﬁnite N and that f ( p ) is the maximum value f attains M (assumptions which are immediate for compact M ). A CKNOWLEDGMENT

BM and PM thank the support from SQIG – Security and Quantum Information Group. BM, PMand AC thanks the Fundac¸ ˜ao para a Ciˆencia e a Tecnologia (FCT) project UID/EEA/50008/2020, andEuropean funds, namely H2020 project SPARTA. BM, PM and AC acknowledge PREDICT PTDC/CCI-CIF/29877/2017 funded by FCT. We also acknowledge J. Mour˜ao e J. P. Nunes for valuable discussionsconcerning the Laplace formula in the context of manifolds. We also acknowledge discussions with col-leagues from the Electric Engineering department concerning the applications of manifolds to InformationTheory. R EFERENCES [1] J. Rissanen. Fisher information and stochastic complexity.

IEEE transactions on information theory , 42(1):40–47, 1996.[2] A. Suzuki and K. Yamanishi. Exact calculation of normalized maximum likelihood code length using Fourier analysis. In , pages 1211–1215, 2018.[3] S. Amari and H. Nagaoka.

Methods of information geometry , volume 191. American Mathematical Soc., 2007.[4] S. Amari.

Differential-geometrical methods in statistics , volume 28. Springer Science & Business Media, 2012.[5] H. Jeffreys. An invariant form for the prior probability in estimation problems.

Proceedings of the Royal Society of London. Series A.Mathematical and Physical Sciences , 186(1007):453–461, 1946.[6] B. Clarke and A. Barron. Jeffreys’ prior is asymptotically least favorable under entropy risk.

Journal of Statistical Planning andInference , 41:37–60, 08 1994.[7] V. Balasubramanian. MDL, Bayesian inference, and the geometry of the space of probability distributions. In

Advances in minimumdescription length: Theory and applications , pages 81–98. MIT Press, 2005.[8] D. Haussler. A general minimax result for relative entropy.

IEEE Transactions on Information Theory , 43(4):1276–1280, 1997.[9] B. Clarke and A. Barron. Information-theoretic asymptotics of Bayes methods.

IEEE Transactions on Information Theory , 36(3):453–471, 1990.[10] A. Dolcetti and D. Pertici. Differential properties of spaces of symmetric real matrices.

Rend. Semin. Mat. Univ. Politec. Torino ,77(1):25–43, 2019.[11] A. Terras.

Harmonic analysis on symmetric spaces and applications II . Springer Science & Business Media, 2012.[12] S. Said, L. Bombrun, Y. Berthoumieu, and J. H. Manton. Riemannian Gaussian distributions on the space of symmetric positive deﬁnitematrices.