WWrong Priors
Carlos C. Rodríguez http://omega.albany.edu:8008/
Department of Mathematics and StatisticsThe University at Albany, SUNYAlbany, NY 12222USA
Abstract.
All priors are not created equal. There are right and there are wrong priors. That is themain conclusion of this contribution. I use, a cooked-up example designed to create drama, and atypical textbook example to show the pervasiveness of wrong priors in standard statistical practice.
Keywords:
Information Geometry, Volume Prior, Bayesian Inference, Bayesian Information Ge-ometry, Ignorance Priors
PACS:
INTRODUCTION
The information geometry available in regular statistical models can be used to buildobjectively meaningful prior distributions. When the information volume of the model isfinite, the uniform distribution over the model manifold coincides with Jeffreys invariantrule. A simple example in two dimensions shows that the popular naive diffuse priorover the parameters of this model is in fact a wrong prior, requiring more than tenthousand observations to match Jeffreys rule with only 100 samples. Bayesian inferenceis suffering from an epidemic of wrong priors and to prove it I consider standard simplelogistic regression with naive diffuse priors and with uniform priors over the manifold.The results are obviously less dramatic but similar to the previous cooked-up example.When the information volume of the model is infinite, the uniform distribution overthe model does not exist. However, the available geometry can still be exploited and itprovides a semiparametric family of invariant objectively ignorant priors.
A SIMPLE EXAMPLE
Consider bivariate normals with unit covariance matrix and mean vector restricted toa region of the euclidean plane. Specifically, for given values a and b the experimentconsists of choosing ( x , y ) at random on the euclidean plane with, x = exp (cid:0) − ( a + b ) (cid:1) + ε y = exp (cid:0) − (( a − c ) + ( b − c ) ) (cid:1) + ε . Wrong Priors November 5, 2018 1 a r X i v : . [ phy s i c s . d a t a - a n ] S e p IGURE 1.
The prior π is the true uniform over the model. Notice that the picture is not drawn atscale. The actual peaks should be more than 40 times taller than the ones displayed. The unknown parameters are a , b ∈ R but c = . ε and ε are independent standard normals. The problem consists of learning the parameters θ = ( a , b ) from n independent observations ( x , y ) , . . . , ( x n , y n ) . We want to comparethe performance of two priors on ( a , b ) . The naive "‘ignorant"’ prior π that takes a and b independently from N ( , ) and the uniform prior over the manifold model, π givenby, π ( a , b ) = | a − b | Z exp (cid:0) − ( a + b ) + ( a + b ) (cid:1) (1)where Z is a finite normalization constant. Equation (1) is just the normalized volumeform of the model computed trivially as √ det g / Z with g as the information matrix(minus the expected values of the second derivatives of the log likelihood). This priorputs positive mass on the entire ( a , b ) plane (except on the line a = b but that region hasmeasure 0) but it is very far from uniform as it is shown in figure 1. Notice also thatthere are two peaks because the likelihood is invariant under the exchange of a with b .The volume prior respects this symmetry. Wrong Priors November 5, 2018 2
IGURE 2.
The right side shows the posteriors for the naive flat prior. The left side shows the posteriorcomputed with the true uniform over the model.
Posterior Inference
With the help of the free MCMC package [1, 2] it only takes a few lines of code torealize the inadequacy of the naive prior for this example. The results of the MCMCsimulations are summarized in figure 2. The true parameters where fixed at a = . b = − .
01 and independent samples were chosen from the distribution with thoseparameters. With the naive flat prior the posteriors after observing 100 ,
500 and 1000samples were essentially identical to the priors N ( , ) , i.e. nothing was learned fromthe data. With 10000 observations the program was able to learn the values ( . ± . , . ± . ) for the true parameters. In contrast, just after 100 observationsthe posterior with the true uniform prior estimates the parameters very precisely as ( . ± . , − . ± . ) , still one order of magnitude of extra accuracy overthe posterior with the flat prior with two orders of magnitude of extra data! Wrong Priors November 5, 2018 3
IGURE 3.
Ten thousand ( u , v ) points from ( a , b ) points chosen uniformly inside a ball of radius 3centered at 0. Notice that more than 9000 points disappear into the origin. Why is the volume prior so good?
To understand why the naive flat prior is so bad and the volume prior so good let’sidentify the transformed region of means ( u , v ) given by, u = exp (cid:0) − ( a + b ) (cid:1) v = exp (cid:0) − (( a − c ) + ( b − c ) ) (cid:1) as ( a , b ) range over the entire plane. An easy way to find the shape of this region is to pickpoints ( a , b ) at random on the plane and plot the corresponding ( u , v ) points. Figure 3shows 10000 ( u , v ) points obtained from 10000 ( a , b ) points uniformly distributed insidea circle centered at the origin of radius 3. Notice that lots of points disappear into theorigin!. Now take another 10000 ( a , b ) points but now distributed according to π withdensity given in (1). Figure 4 shows these ( a , b ) points. Notice that they are all highlyconcentrated about two points close to the origin. The corresponding ( u , v ) points areshown in figure 5. Got it? Wrong Priors November 5, 2018 4
IGURE 4.
Ten thousand ( a , b ) points from the true uniform over the model TABLE 1.
Bioassay data from [3,p.88]Dose, x i Number of Number of(log g/ml) animals, n i deaths, y i -0.863 5 0-0.296 5 1-0.053 5 30.727 5 5 The equation of the boundary of the leaf of ( u , v ) points The computation of the exact equation of the leaf boundary in figure 5 is a niceexercise in simple optimization: Find max and min of v subject to the constraint that u = t . The max is given by the Red (dark) R ( t ) curve in figure 6, with R ( t ) = exp ( − ( (cid:112) − log t − ) ) . (2)The min is given by the Green (light) curve, G ( t ) = exp ( − ( (cid:112) − log t + ) ) (3)with 0 < t < t = Wrong Priors November 5, 2018 5
IGURE 5.
Ten thousand ( u , v ) points from ( a , b ) chosen from the density (1). FIGURE 6.
The exact boundary of the ( u , v ) points. Upper dark (red) part is R ( t ) . Lower light (green)part is G ( t ) TEXTBOOK EXAMPLE
Perhaps the first non-trivial example of a multiparameter bayesian model is simplelogistic regression (see [3, p.88]). Twenty animals were tested, five at each of four doselevels (see Table 1). The standard model for this kind of data is,
Wrong Priors November 5, 2018 6 x i , n i , y i ) ; i = , . . . , k , assumed independent with, y i | θ i ∼ Bin ( n i , θ i ) , where θ i is the probability of death for animals given dose x i . The standard logisticdose-response relation is: log θ i − θ i = a + bx i (4)The joint distribution of ( y , . . . , y k ) is a function of the unknown parameters ( a , b ) and straight (but tedious) calculations give the volume element dV = √ det g dadb in the ( a , b ) parameterization as dV = T σ dadbT = ∑ j w j = ∑ j n j θ j ( − θ j ) σ = stand. dev. of X defined as, P (cid:8) X = x j (cid:12)(cid:12) θ j (cid:9) ∝ w j θ j = + exp ( − a − bx j ) This is a strange looking density (see figure 7). In particular this prior is proper and itassigns correlation of about 0 . a and b . This correlation is known a priori fromthe underlying geometry. In fact, the volume prior provides a better fit to the data thanthe standard diffuse naive prior that models a and b as independent variables with largevariances. Figure 8 shows the results of the posterior simulations with both priors. Leftpanel with naive prior, right panel with volume prior. The red (dark) middle curvesrepresent the logistic curves associated to the mean posterior values for ( a , b ) (100thousand of them). The pictures also show 500 logistic curves obtained by sampling500 ( a , b ) pairs from the available posterior samples. There is clearly more spread oflogistic curves on the right than on the left panel. This is compatible with the fact that thevolume prior samples uniformly over the manifold. Just like in the cooked-up examplethe over-spread ( a , b ) points cover only a small region of the manifold. BEYOND FINITE VOLUMES
When the information volume V ( M ) of the model M , V ( M ) = (cid:90) M dV = (cid:90) Θ (cid:112) det g d θ Wrong Priors November 5, 2018 7
IGURE 7.
Uniform prior for logistic regression in the ( a , b ) parameterization. Bottom left: Contours.Bottom right: 2 . × samples from this prior. is infinite; there is no uniform distribution over M . However, the underlying informationgeometry provides the following class of priors given as scalar density fields definedinvariantly on M by, π ( p | t , ν , δ , α ) = Z [ + αν I δ ( p : t )] − ν (5)where p ∈ M , t is a probability distribution guessing the actual distribution of the data, δ , ν are scalar parameters in [ , ] , α > Z < ∞ and I δ ( p : t ) is the δ -information deviation between (unnormalized) distributions p and t given by, I δ ( p : t ) = δ ( − δ ) (cid:90) [ δ p + ( − δ ) t − p δ t − δ ] where the integral is over the whole data space manifold. This family of priors existsfor any regular model and it has many remarkable properties. In particular this familymaximizes a simple and objective notion of ignorance. For details see my A geometrictheory of ignorance . The hyper parameters can be estimated with priors of the samekind or with a nonparametric prior of the Dirichlet Process type (which could itself be
Wrong Priors November 5, 2018 8
IGURE 8.
Posterior logistic curves. Left panel with naive prior. Right panel with volume prior. seen as part of this family if we allow M to be infinite dimensional). There are still manyopen problems but the road ahead seems clear: More geometry. ACKNOWLEDGMENTS
I am in debt to Phil Dawid whose invitation to talk at UCL prompted the finding of theexample in the paper and to Ariel Caticha, Kevin Knuth, and John Skilling for manyinteresting discusions.
REFERENCES
1. A. D. Martin, , and K. M. Quinn,
MCMCpack: Markov chain Monte Carlo (MCMC) Package (2007),URL http://mcmcpack.wustl.edu , r package version 0.8-2.2. R Development Core Team,
R: A Language and Environment for Statistical Computing , R Foundationfor Statistical Computing, Vienna, Austria (2007), URL , ISBN3-900051-07-0.3. A. Gelman, J. Carlin, H. Stern, and D. Rubin,