A geometer's view of the the Cramér-Rao bound on estimator variance
aa r X i v : . [ s t a t . O T ] O c t A GEOMETER’S VIEW OF THE THE CRAM ´ER-RAO BOUNDON ESTIMATOR VARIANCE
ANTHONY D. BLAOM
Abstract.
The classical Cram´er-Rao inequality gives a lower bound for thevariance of a unbiased estimator of an unknown parameter, in some statisticalmodel of a random process. In this note we rewrite the statment and proof ofthe bound using contemporary geometric language.
The Cram´er-Rao inequality gives a lower bound for the variance of a unbiasedestimator of a parameter in some statistical model of a random process. Below isa restatement and proof in sympathy with the underlying geometry the problem.While our presentation is mildly novel, its mathematical content is very well-known.Assuming some very basic familiarity with Riemannian geometry, and that onehas reformulated the bound appropriately, the essential parts of the proof boildown to half a dozen lines. For completeness we explain the connection with log-likelihoods, and show how to recover the more usual statement in terms of theFisher information matrix. We thank Jakob Str¨ohl for helpful feedback.1.
The Cram´er-Rao inequality
The mathematical setting of statistical inference consists of: (i) a smooth man-ifold X , the sample space , which we will suppose is finite-dimensional; and (ii) aset P of probability measures on X , called the space of models or parameters . Theobjective is to make inferences about an unknown model p ∈ P , given one or moreobservations x ∈ X , drawn at random from X according to p .Under certain regularity assumptions detailed below, this data suffices to make X into a Riemannian manifold, whose geometric properties are related to problemsof statistical inference. It seems that Calyampudi Radhakrishna Rao was the firstto articulate this connection between geometry and statistics [2].In formulating the Cram´er-Rao inequality, we suppose that P is a smooth finite -dimensional manifold (i.e., we are doing so-called parametric inference). We saythat P is regular if the probability measures p ∈ P are all Borel measures on X ,and if there exists some positive Borel measure µ on X , hereafter called a referencemeasure , such that(1) p = f p µ, for some collection of smooth functions f p , p ∈ P , on X . The definition of regularityfurthermore requires that we may arrange ( x, p ) f p ( x ) to be jointly smooth. In this note ‘smooth’ means C . An unbiased estimator of some smooth function θ : P → R (the “parameter”)is a smooth function ˆ θ : X → R whose expectation under each p ∈ P is precisely θ ( p ):(2) θ ( p ) = E (ˆ θ | p ) := Z x ∈X ˆ θ ( x ) dp ; p ∈ P . Theorem (Rao-Cram´er [2, 1]) . The space of models P determines a natural Rie-mannian metric on X , known as the Fisher-Rao metric, with respect to which thereis the following lower bound on the variance of an unbiased estimator ˆ θ of θ : (3) V (ˆ θ | p ) > |∇ θ ( p ) | ; p ∈ P . More informally: The parameter space P comes equipped with a natural way ofmeasuring distances, leading to a well-defined notion of steepest rate of ascent, forany function θ on P . The square of this rate is precisely the lower bound for thevariance of an unbiased estimator ˆ θ .2. Observation-dependent one-forms on the space of models
It is fundamental to the present geometric point of view that each observation x ∈ X determines a one-form λ x on the space P of models in the following way: Let v ∈ T p P be a tangent vector, understood as the derivative of some path t p t ∈ P through p :(4) v = ddt p t (cid:12)(cid:12)(cid:12) t =0 . Then, recalling that each p t is a probability measure on X (and P is regular) wemay write p t = g t p , for some smooth function g t : X → R , and define λ x ( v ) = ddt g t ( x ) (cid:12)(cid:12)(cid:12) t =0 . The proof of the following is straightforward:
Lemma. E ( λ x ( v ) | p ) = 0 for all p ∈ P and v ∈ T p P .Now if v ∈ T p P is a tangent vector as in (4), and if (2) holds, then dθ ( v ) = ddt Z x ∈X ˆ θ ( x ) dp t (cid:12)(cid:12)(cid:12) t =0 = ddt Z x ∈X ˆ θ ( x ) g t ( x ) dp (cid:12)(cid:12)(cid:12) t =0 = Z x ∈X ˆ θ ( x ) λ x ( v ) dp , giving us: Proposition.
For any unbiased estimator ˆ θ : X → R of θ : P → R , one has dθ ( v ) = Z x ∈X ˆ θ ( x ) λ x ( v ) dp ; v ∈ T p P . GEOMETER’S VIEW OF THE THE CRAM´ER-RAO BOUND ON ESTIMATOR VARIANCE3 Log-likelihoods
As an aside, we shall now see that the observation-dependent one-forms λ x areexact, and at the same time give their more usual interpretation in terms of log-likelihoods.Choosing a reference measure µ , and defining f p as in (1), one defines the log-likelihood function ( x, p ) L x ( p ) : X × P → R by L x ( p ) = log f p ( x ) . While the log-likelihood depends on the reference measure µ , its derivative dL x (aone-form on P ) does not, for in fact: Lemma. dL x = λ x . Proof.
With a reference measure fixed as in (1), we have, along a path t p t , p t = g t p , where g t = f p t /f p . Applying the definition of λ x , we compute λ x (cid:16) ddt p t (cid:12)(cid:12)(cid:12) t =0 (cid:17) = ddt f p t ( x ) f p ( x ) (cid:12)(cid:12)(cid:12) t =0 = ddt e L x ( p t ) e L x ( p ) (cid:12)(cid:12)(cid:12) t =0 = dL x (cid:16) ddt p t (cid:12)(cid:12)(cid:12) t =0 (cid:17) . (cid:3) In particular, local maxima of L x (points of so-called maximum likelihood) do notdepend on the reference measure.4. The metric and derivation of the bound
With the observation-dependent one-forms in hand, we may now define theFisher-Rao Riemannian metric on P . It is given by I ( u, v ) = Z x ∈X λ x ( u ) λ x ( v ) dp , for u, v ∈ T p P .Now that we have a metric, it is natural to consider ∇ θ instead of dθ in Proposition2. By the definition of gradient, we have |∇ θ ( p ) | = dθ ( ∇ θ ( p )) . This equation and Proposition 2 now gives, for any v ∈ T p P , |∇ θ ( p ) | = Z x ∈X ˆ θ ( x ) λ x ( ∇ θ ( p )) dp = Z (ˆ θ ( x ) − θ ( p )) λ x ( ∇ θ ( p )) dp. The second equality holds because R x ∈X λ x ( ∇ θ ( p )) dp = 0, by Lemma 2. Applyingthe Cauchy-Schwartz inequality to the right-hand side gives |∇ θ ( p ) | (cid:18)Z x ∈X (ˆ θ ( x ) − θ ( p )) dp (cid:19) / (cid:18)Z x ∈X λ x ( ∇ θ ( p )) λ x ( ∇ θ ( p )) dp (cid:19) / = q V (ˆ θ | p )) p I ( ∇ θ ( p ) , ∇ θ ( p )) = q V (ˆ θ | p )) |∇ θ ( p ) | . The Cram´er-Rao bound now follows.
ANTHONY D. BLAOM The bound in terms of Fisher information
Theorem 1 is coordinate-free formulation. To recover the more usual statementof the Cram´er-Rao bound, let φ , . . . , φ k be local coordinates on P , the space ofmodels on X , and ∂∂φ , . . . , ∂∂φ k the corresponding vector fields on P , characterisedby dφ i (cid:18) ∂∂φ j (cid:19) = δ ji . Here δ ji = 1 if i = j and is zero otherwise. Applying Lemma 3, the coordinaterepresentation I ij of the Fisher-Rao metric I is given by I ij ( p ) = I (cid:16) ∂∂φ i ( p ) , ∂∂φ j ( p ) (cid:17) = Z x ∈X dλ x (cid:18) ∂∂φ i ( p ) (cid:19) dλ x (cid:18) ∂∂φ j ( p ) (cid:19) dp = Z x ∈X (cid:18) ∂L x ∂φ i ( p ) (cid:19) (cid:18) ∂L x ∂φ j ( p ) (cid:19) dp, where L x ( p ) = log f p ( x ) is the log-likelihood. In statistics I ij is known as the Fisherinformation matrix .For the moment we continue to let θ denote an arbitrary function on P , and ˆ θ an unbiased estimate. Now ∇ θ is the gradient of θ , with respect to the metric I .Since the coordinate representation of the metric is I ij , a standard computationgives the local coordinate formula ∇ θ = X i,j I ij ∂θ∂φ i ∂∂φ j , where { I ij } is the inverse of { I ij } . Regarding the lower bound in Theorem 1, wecompute |∇ θ ( p ) | = I ( ∇ θ ( p ) , ∇ θ ( p )) = X i,j,m,n I (cid:16) I ij ( p ) ∂θ∂φ i ( p ) ∂∂φ j ( p ) , I mn ( p ) ∂θ∂φ m ( p ) ∂∂φ n ( p ) (cid:17) = X i,j,m,n I ij ( p ) I mn ( p ) ∂θ∂φ i ( p ) ∂θ∂φ m ( p ) I (cid:16) ∂∂φ j ( p ) , ∂∂φ n ( p ) (cid:17) = X i,j,m,n I ij ( p ) I jn ( p ) I mn ( p ) ∂θ∂φ i ( p ) ∂θ∂φ m ( p )= X i,m,n δ ni I mn ( p ) ∂θ∂φ i ( p ) ∂θ∂φ m ( p ) = X i,m I mi ( p ) ∂θ∂φ i ( p ) ∂θ∂φ m ( p ) . Theorem 1 now reads V (ˆ θ | p ) > X i,m I mi ( p ) ∂θ∂φ i ( p ) ∂θ∂φ m ( p ) . GEOMETER’S VIEW OF THE THE CRAM´ER-RAO BOUND ON ESTIMATOR VARIANCE5
In particular, if we suppose θ is one of the coordinate functions, say θ = φ j , thenwe obtain V ( ˆ φ j | p ) > X i,m I mi ( p ) ∂φ j ∂φ i ( p ) ∂φ j ∂φ m ( p ) = X i,m I mi δ ij δ mj = I jj ( p ) , the version of the Cram´er-Rao bound to be found in statistics textbooks. References [1] Harald Cram´er.
Mathematical Methods of Statistics . Princeton Mathematical Series, vol. 9.Princeton University Press, Princeton, N. J., 1946.[2] C. Radhakrishna Rao. Information and the accuracy attainable in the estimation of statisticalparameters.
Bull. Calcutta Math. Soc. , 37:81–91, 1945.