On Bayesian Estimation of Densities and Sampling Distributions: the Posterior Predictive Distribution as the Bayes Estimator
aa r X i v : . [ m a t h . S T ] O c t Noname manuscript No. (will be inserted by the editor)
On Bayesian Estimation of Densities and SamplingDistributions: the Posterior Predictive Distributionas the Bayes Estimator
A.G. Nogales
Received: date / Accepted: date
Abstract
Optimality results for three outstanding Bayesian estimation prob-lems are presented in this paper: the estimation of the sampling distributionfor the squared total variation function, the estimation of the density for the L -squared loss function and the estimation of a real distribution function forthe L ∞ -squared loss function. The posterior predictive distribution providesthe solution to these problems. Some examples are presented to illustrate it. Keywords
Bayesian density estimation · Posterior predictive distribution
Mathematics Subject Classification (2010) · · In the next pages, the problems of estimation of a density or a probabilitymeasure (or even of a distribution function in the real case) are consideredunder the Bayesian point of view. These problems are addressed in a numberof previous references such as Ghosh et al. (2003, Ch. 5), Lijoi et al. (2010,sect. 3.4), Lo (1984), Ferguson (1983) or, recently, Marchand et al. (2018),to mention just a few. Popular choices for Bayesian density estimation areDirichlet-process mixture models, due to their large support and the ease oftheir implementation (see Bean et al. (2016)). Ghosal et al. (2017), p. 121,contains a brief historical review on Bayesian density estimation. But, unlikeTheorem 2 below, no general optimality result can be found in the mentionedliterature.Since the Bayesian statistical experiment is in fact a probability space,Theorem 2 is basically a probabilistic result. Moreover it is not a simply exis-tence result of an optimal estimator of the density: it shows that the optimalestimator is the posterior predictive density.
Dpto. de Matem´aticas, Universidad de ExtremaduraAvda. de Elvas, s/n, 06006–Badajoz, SPAIN E-mail: [email protected] A.G. Nogales
The posterior predictive distribution has been presented as the keystone inPredictive Inference, which seeks to make inferences about a new unknown ob-servation from the previous random sample, in contrast to the greater emphasisthat statistical inference makes on the estimation and contrast of parameterssince its mathematical foundations in the early twentieth century (see Geisser(1993) or Gelman et al. (2014)). With that idea in mind, it has also been usedin other areas such as model selection, testing for discordancy, goodness offit, perturbation analysis or classification (see addtional fields of applicationin Geisser (1993) and Rubin (1984)), but never as a possible solution for theBayesian density estimation problem.Here, the posterior predictive density appears as the optimal estimator ofthe density for the L -squared loss function and this is true whatever be theprior distribution. In fact, the posterior predictive distribution is the opti-mal estimator of the probability measures P θ for the squared total variationloss function. Moreover, in the real case, the posterior predictive distributionfunction becomes the optimal estimator of the sampling distribution functionfor L ∞ -squared loss function. The proofs of Theorems 1 and 2 show thatthe square in the total variation, L and L ∞ loss functions comes from thequadratic error loss function used in the estimation of a real function of the pa-rameter. In this sense, these loss functions should be considered as natural fortheir respective estimation problems. Finally, the results are general enoughto simultaneously cover continuous and discrete, univariate and multivariate,parametric and nonparametric cases.Several examples are presented in Section 4 to illustrate the results. Gelmanet al. (2014) contains many other examples of determination of the posteriorpredictive distribution. But in practice, the explicit evaluation of the posteriorpredictive distribution could be cumbersome and its simulation may becomepreferable. Gelman et al. (2014) is also a good reference for such simulationmethods and, hence, for the computation of the Bayes estimators of the densityand the sampling distribution.In what follows we will place ourselves in a general framewok for theBayesian inference, as is described in Barra (1971).First, let us briefly recall some basic concepts about Markov kernels, mainlyto fix the notations. In the next, ( Ω, A ), ( Ω , A ) and so on will denote mea-surable spaces. Definition 1
1) (Markov kernel) A Markov kernel M : ( Ω, A ) ≻−→ ( Ω , A )is a map M : Ω × A → [0 ,
1] such that: (i) ∀ ω ∈ Ω , M ( ω, · ) is a probabilitymeasure on A ; (ii) ∀ A ∈ A , M ( · , A ) is A -measurable.2) (Image of a Markov kernel) The image (or probability distribution ) ofa Markov kernel M : ( Ω, A , P ) ≻−→ ( Ω , A ) on a probability space is theprobability measure P M on A defined by P M ( A ) := R Ω M ( ω, A ) dP ( ω ).3) (Composition of Markov kernels) Given two Markov kernels M : ( Ω , A ) ≻−→ ( Ω , A )and M : ( Ω , A ) ≻−→ ( Ω , A ), its composition is defined as the Markov ker- ptimal Bayesian Density Estimator 3 nel M M : ( Ω , A ) ≻−→ ( Ω , A ) given by M M ( ω , A ) = Z Ω M ( ω , A ) M ( ω , dω ) . Remarks
1) (Markov kernels as extensions of the concept of random vari-able) The concept of Markov kernel extends the concept of random variable(or measurable map). A random variable T : ( Ω, A , P ) → ( Ω , A ) willbe identified with the Markov kernel M T : ( Ω, A , P ) ≻−→ ( Ω , A ) definedby M T ( ω, A ) = δ T ( ω ) ( A ) = I A ( T ( ω )), where δ T ( ω ) denotes the Diracmeasure -the degenerate distribution- at the point T ( ω ), and I A is the in-dicator function of the event A . In particular, the probability distribution P M T of M T coincides with the probability distribution P T of T defined as P T ( A ) := P ( T ∈ A )2) Given a Markov kernel M : ( Ω , A ) ≻−→ ( Ω , A ) and a random vari-able X : ( Ω , A ) → ( Ω , A ), we have that M X M ( ω , A ) = M ( ω , X − ( A )) = M ( ω , · ) X ( A ) . We write X M := M X M . (cid:3) Let ( Ω, A , { P θ : θ ∈ ( Θ, T , Q ) } ) be a Bayesian statistical experiment where Q is the prior distribution, a probability measure on the measurable space( Θ, T ). ( Ω, A ) is the sample space and ( Θ, T ) is the parameter space.When needed, we shall suppose that P θ has a density (or Radon-Nikodymderivative) p θ with respect to a σ -finite measure µ on A and that the likelihoodfunction L : ( ω, θ ) ∈ ( Ω × Θ, A ⊗ T ) → L ( ω, θ ) := p θ ( ω ) is measurable. Sowe have a Markov kernel P : ( Θ, T ) ≻−→ ( Ω, A ) defined by P ( θ, A ) := P θ ( A ).Let P ∗ : ( Ω, A ) ≻−→ ( Θ, T ) the Markov kernel determined by the posteriordistributions. In fact, if we denote by Π the only probability measure on A ⊗ T such that Π ( A × T ) = Z T P θ ( A ) dQ ( θ ) , A ∈ A , T ∈ T , (1)then P ∗ is defined in such a way that Π ( A × T ) = Z A P ∗ ω ( T ) dβ ∗ Q ( ω ) , A ∈ A , T ∈ T , (2)where β ∗ Q denotes the so called prior predictive probability, defined by β ∗ Q ( A ) = Z Θ P θ ( A ) dQ ( θ ) , A ∈ A . In other terms, β ∗ Q = Q P , the probability distribution of the Markov kernel P with respect to the prior distribution Q .The probability measure Π integrates all the basic ingredients of the Bayesianmodel, and these ingredients can be essentially derived from Π , somethingthat would allow us to identify the Bayesian model as the probability space( Ω × Θ, A ⊗ T , Π ) (so is done, for instance, in Florens et al. (1990)).
A.G. Nogales
It is well known that, for ω ∈ Ω , the posterior density with respect to theprior distribution is proportional to the likelihood. Namely p ∗ ω ( θ ) := dP ∗ ω dQ ( θ ) = C ( ω ) p θ ( ω ) , where C ( ω ) = (cid:2)R Θ p θ ( ω ) dQ ( θ ) (cid:3) − . This way we obtain a statistical experiment ( Θ, T , { P ∗ ω : ω ∈ Ω } ) on the pa-rameter space ( Θ, T ). We can reconsider the Markov kernel P defined on thisstatistical experiment P : ( Θ, T , { P ∗ ω : ω ∈ Ω } ) ≻−→ ( Ω, A ) . Since (cid:0) P ∗ ω (cid:1) P ( A ) = R Θ P θ ( A ) dP ∗ ω ( θ ), for A ∈ A , it is called the posterior pre-dictive distribution on A given ω , and the statistical experiment image of P is (cid:0) Ω, A , (cid:8)(cid:0) P ∗ ω (cid:1) P : ω ∈ Ω (cid:9)(cid:1) . Note that, given ω ∈ Ω , according to Fubini’s Theorem, (cid:0) P ∗ ω (cid:1) P ( A ) = Z Θ P θ ( A ) dP ∗ ω ( θ ) = Z Θ Z A p θ ( ω ′ ) dµ ( ω ′ ) p ∗ ω ( θ ) dQ ( θ )= Z A Z Θ p θ ( ω ′ ) p ∗ ω ( θ ) dQ ( θ ) dµ ( ω ′ ) . So, the posterior predictive density is d (cid:0) P ∗ ω (cid:1) P dµ ( ω ′ ) = Z Θ p θ ( ω ′ ) p ∗ ω ( θ ) dQ ( θ ) . If we consider the composition of the Markov kernels P ∗ and P :( Ω, A ) P ∗ ≻−→ ( Θ, T ) P ≻−→ ( Ω, A ) , defined by P P ∗ ( ω, A ) := Z Θ P θ ( A ) dP ∗ ω ( θ ) = Z A Z Θ p θ ( ω ′ ) p ∗ ω ( θ ) dQ ( θ ) dµ ( ω ′ ) , (3)we have that dP P ∗ ( ω, · ) dµ ( ω ′ ) = Z Θ p θ ( ω ′ ) p ∗ ω ( θ ) dQ ( θ ) . Notice that
P P ∗ ( ω, · ) = (cid:0) P ∗ ω (cid:1) P . Remark Because of (1), we introduce the notation Π := P ⊗ Q . So, (2) readsas Π := β ∗ Q ⊗ P ∗ . Hence, after observing ω ∈ Ω , replacing the prior distribution Q by the posterior distribution P ∗ ω , we get the probability distribution Π ω := P ⊗ P ∗ ω on A ⊗ T . According to (3),
P P ∗ ( ω, A ) = Π ω ( A × Θ ) = Π Iω ( A ) where I ( ω, θ ) = ω . This way the posterior predictive distribution (cid:0) P ∗ ω (cid:1) P given ω appears as the marginal Π ω -distribution on Ω . (cid:3) ptimal Bayesian Density Estimator 5 According to Bayesian philosophy, given A ∈ A , a natural estimator of f A ( θ ) := P θ ( A ) is the posterior mean of f A , which coincides with the posterior predic-tive probability of A , T ( ω ) := (cid:0) P ∗ ω (cid:1) P ( A ). In fact, this is the Bayes estimatorof f A (see Theorem 1.(i)).So, the posterior predictive distribution (cid:0) P ∗ ω (cid:1) P appears as the naturalBayesian estimator of the probability distribution P θ .To estimate probability measures, the squared total variation loss function W ( Q, P ) := sup A ∈A | Q ( A ) − P ( A ) | , will be considered. An estimator of f ( θ ) := P θ is a Markov kernel M :( Ω, A ) ≻−→ ( Ω, A ) so that, being observed ω ∈ Ω , M ( ω, · ) is a probabilitymeasure on A which is considered as an estimation of f . We wonder if theBayes mean risk of the estimator M ∗ := (cid:0) P ∗ (cid:1) P is less than that of any otherestimator M of f , i.e., we wonder if Z Ω × Θ sup A ∈A | (cid:0) P ∗ ω (cid:1) P ( A ) − P θ ( A ) | dΠ ( ω, θ ) ≤ Z Ω × Θ sup A ∈A | M ( ω, A ) − P θ ( A ) | dΠ ( ω, θ ) . Theorem 1.(ii) below gives the answer.An estimator of the density p θ on ( Ω, A , { P θ : θ ∈ ( Θ, T , Q ) } ) is a measur-able map m : ( Ω , A ) −→ R in such a way that, being observed ω ∈ Ω , themap ω ′ m ( ω, ω ′ ) is an estimation of p θ .It is well known (see Ghosal et al. (2017), p. 126) that, given two probabilitymeasures Q and P on ( Ω, A ) having densities q and p with respect to a σ -finitemeasure µ , sup A ∈A | Q ( A ) − P ( A ) | = 12 Z | q − p | dµ. So the Bayesian estimation of the sampling distribution P θ for the squaredtotal variation loss function corresponds to the Bayesian estimation of its den-sity p θ for the L -squared loss function W ′ ( q, p ) := (cid:0) R | q − p | dµ (cid:1) , The next Theorem also solves the estimation problem of the density.
Theorem 1
Let ( Ω, A , { P θ : θ ∈ ( Θ, T , Q ) } ) be a Bayesian statistical exper-iment dominated by a σ -finite measure µ , where the σ -field A is supposedto be separable. We suppose that the likelihood function L ( ω, θ ) := p θ ( ω ) = dP θ ( ω ) /dµ is A ⊗ T -measurable.(i) Given A ∈ A , the posterior predictive probability (cid:0) P ∗ ω (cid:1) P ( A ) of A isthe Bayes estimator of the probability P θ ( A ) of A for the squared error lossfunction W ( x, θ ) := ( x − P θ ( A )) . A.G. Nogales
Moreover, if X is a real statistics with finite mean, its posterior predictivemean E ( P ∗ ω ) P ( X ) = Z Θ Z Ω X ( ω ′ ) dP θ ( ω ′ ) dP ∗ ω ( θ )is the Bayes estimator of E θ ( X ).(ii) The posterior predictive distribution (cid:0) P ∗ ω (cid:1) P is the Bayes estimator ofthe sampling distribution P θ for the squared total variation loss function W ( P, Q ) := sup A ∈A | P ( A ) − Q ( A ) | . (iii) The posterior predictive density b ∗ Q,ω ( ω ′ ) := d (cid:0) P ∗ ω (cid:1) P dµ ( ω ′ ) = Z Θ p θ ( ω ′ ) p ∗ ω ( θ ) dQ ( θ ) . is the Bayes estimator of the density p θ for the L -squared loss function W ′ ( p, q ) := (cid:18)Z Ω | p − q | dµ (cid:19) . More generally, an estimator of f ( θ ) := P θ from a sample of size n of thisdistribution is a Markov kernel M n : ( Ω n , A n ) ≻−→ ( Ω, A ) . Let us consider the Markov kernel P n : ( Θ, T ) ≻−→ ( Ω n , A n )defined by P n ( θ, A ) = P nθ ( A ), A ∈ A n , θ ∈ Θ . We write Π n := P n ⊗ Q, sothat Π n ( A × T ) = Z T P nθ ( A ) dQ ( θ ) , A ∈ A n , T ∈ T . The corresponding prior predictive distribution is β ∗ Q,n ( A ) = Z Θ P nθ ( A ) dQ ( θ ) = Π In ( A ) , where I ( ω, θ ) = ω for ω ∈ Ω n . Let us write I i ( ω ) = ω i and ˆ I i ( ω, θ ) = ω i , for ω ∈ Ω n and i = 1 , . . . , n . Hence (cid:0) β ∗ Q,n (cid:1) I i ( A i ) = Z Θ P θ ( A i ) dQ ( θ ) = β ∗ Q ( A i ) , ptimal Bayesian Density Estimator 7 and Π ˆ I i n ( A i × T ) = Z T P θ ( A i ) dQ ( θ ) , so (cid:0) β ∗ Q,n (cid:1) I i = β ∗ Q , and Π ˆ I i n = Π. Denoting J ( ω, θ ) = θ , the posterior distribution P ∗ ω,n := Π J | I = ωn , ω ∈ Ω n , isdefined in such a way that Π n ( A × T ) = Z A P ∗ ω,n ( T ) dβ ∗ Q,n ( ω ) . The µ n -density of P nθ is p θ,n ( ω ) := dP nθ dµ n ( ω ) = n Y i =1 p θ ( ω i ) for ω = ( ω , . . . , ω n ) ∈ Ω n . The posterior density given ω ∈ Ω n is of the form p ∗ ω,n ( θ ) := dP ∗ ω,n dQ ( θ ) ∝ p θ,n ( ω ) . According to Theorem 1.(ii), the Markov kernel (cid:0) P ∗ n (cid:1) P n : ( Ω n , A n ) ≻−→ ( Ω n , A n )defined by (cid:0) P ∗ n (cid:1) P n ( ω, A ) := (cid:0) P ∗ ω,n (cid:1) P n ( A ) = Z Θ P nθ ( A ) dP ∗ ω,n ( θ ) , is the Bayes estimator of the product probability measure f n ( θ ) := P nθ . Thatis to say Z Ω n × Θ sup A ∈A n | (cid:0) P ∗ ω,n (cid:1) P n ( A ) − P nθ ( A ) | dΠ n ( ω, θ ) ≤ Z Ω n × Θ sup A ∈A n | M ( ω, A ) − P nθ ( A ) | dΠ n ( ω, θ ) , for every estimator M : ( Ω n , A n ) ≻−→ ( Ω n , A n ) of P nθ .The next theorem shows how marginalizing the posterior predictive distri-bution (cid:0) P ∗ ω,n (cid:1) P n we can get the Bayes estimator of the sampling probabilitymeasure P θ or its density. Theorem 2 (Bayesian density estimation from a sample of size n ) Let ( Ω, A , { P θ : θ ∈ ( Θ, T , Q ) } ) be a Bayesian statistical experiment dominated by a σ -finite mea-sure µ , where the σ -field A is supposed to be separable. We suppose that thelikelihood function L ( ω, θ ) := p θ ( ω ) = dP θ ( ω ) /dµ is A ⊗ T -measurable. Let n ∈ N . All the estimation problems below are referred to the product Bayesianstatistical experiment ( Ω n , A n , { P nθ : θ ∈ ( Θ, T , Q ) } ) corresponding to a n -sized sample of the observed unknown distribution. Let I ( ω , . . . , ω n ) := ω . A.G. Nogales (i) Given A ∈ A , h(cid:0) P ∗ ω,n (cid:1) P n i I ( A )is the Bayes estimator of the probability P θ ( A ) of A for the squared error lossfunction W ( x, θ ) := ( x − P θ ( A )) . (ii) The distribution h(cid:0) P ∗ ω,n (cid:1) P n i I of the projection I under the posterior predictive probability (cid:0) P ∗ ω,n (cid:1) P n is theBayes estimator of the sampling distribution P θ for the squared total variationloss function W ( P, Q ) := sup A ∈A | P ( A ) − Q ( A ) | . (iii) The marginal posterior predictive density b ∗ Q,ω,n ( ω ′ ) := d h(cid:0) P ∗ ω,n (cid:1) P n i I dµ ( ω ′ ) = Z Θ p θ ( ω ′ ) p ∗ ω,n ( θ ) dQ ( θ ) . is the Bayes estimator of the density p θ for the L -squared loss function W ′ ( p, q ) := (cid:18)Z Ω | p − q | dµ (cid:19) . We end this section with a remark that address the problem of estimatinga real distribution function.
Remark (Bayesian estimation of a distribution function) When P θ is aprobability distribution on the line, we may be interested in the estimationof its distribution function F θ ( t ) := P θ (] − ∞ , t ]). An estimator of such adistribution function is a map F : ( x, t ) ∈ R n × R F ( x, t ) := M ( x, ] − ∞ , t ])for a Markov kernel M : ( R n , R n ) ≻−→ ( R , R ), where R denotes the Borel σ -field on R .Accordig to the previous results, given t ∈ R , F ∗ x ( t ) := h(cid:0) P ∗ x,n (cid:1) P n i I (] − ∞ , t ]) = Z t −∞ Z Θ p θ,n ( y ) · p ∗ x,n ( θ ) dQ ( θ ) dµ n ( y )is the Bayes estimator of F θ ( t ) for the squared error loss function. So Z R n × Θ | F ∗ x ( t ) − F θ ( t ) | dΠ ( x, θ ) ≤ Z R n × Θ | F ( x, t ) − F θ ( t ) | dΠ ( x, θ ) ptimal Bayesian Density Estimator 9 for any other estimator F of F θ . Sincesup t ∈ R | F ( x, t ) − F θ ( t ) | = sup r ∈ Q | F ( x, r ) − F θ ( r ) | we have that, given ( x, θ ) ∈ R n × Θ and k ∈ N , there exists r k ∈ Q such that C ( x, θ ) − k ≤ | F ∗ x ( r k ) − F θ ( r k ) | , where C ( x, θ ) := sup t ∈ R | F ∗ x ( t ) − F θ ( t ) | , and hence (see Remark 3 at the endof Section 6) Z R n × Θ C ( x, θ ) dΠ ( x, θ ) ≤ Z R n × Θ | F ∗ x ( r k ) − F θ ( r k ) | dΠ ( x, θ ) + 1 k ≤ Z R n × Θ sup t ∈ R | F ( x, t ) − F θ ( t ) | dΠ ( x, θ ) + 1 k . We have proved that the posterior predictive distribution function F ∗ x is theBayes estimator of the distribution function F θ for the L ∞ -squared loss func-tion W ′′ ( F, G ) = (cid:0) sup t ∈ R | F ( t ) − G ( t ) | (cid:1) . (cid:3) Let P θ the normal distribution N ( θ, σ ) with unknown mean θ ∈ R and known variance σ . Let Q := N ( µ, τ ) be the prior distributionwhere the mean µ and variance τ are known constants. It is well known thatthe posterior distribution is P ∗ x,n = N ( m n ( x ) , s n ) where m n ( x ) = nτ ¯ x + σ µnτ + σ and s n = τ σ nτ + σ . It can be shown that the distribution of I with respect to the posterior pre-dictive distribution is h(cid:0) P ∗ x,n (cid:1) P n i = N ( m n ( x ) , σ + s n ) . For the details, the reader is addressed to Boldstat (2004, p. 185), wherethe distribution of I with respect to the posterior predictive distribution isreferred to as the predictive distribution for the next observation given theobservation x .So M ∗ n ( x, · ) := N ( m n ( x ) , σ + s n ) is the Bayes estimator of the samplingdistribution N ( θ, σ ) for the squared total variation loss function and thedensity of N ( m n ( x ) , σ + s n ) is the Bayes estimator of the density of N ( θ, σ )for the L -squared loss function. (cid:3) Example 2
Let G ( α, β ) be the distribution gamma with parameters α, β > P θ := G (1 , θ − ), whose density is p θ ( x ) = θ exp {− θx } for x > P nθ is the joint distribution of a sample of size n of an exponentialdistribution of parameter 1 /θ and its density is p θ,n ( x ) = θ n exp {− θ P i x i } for x = ( x , . . . , x n ) ∈ R n + .Consider the prior distribution Q := G (1 , λ − ) for some known λ > a > Z ∞ θ n exp {− aθ } dθ = n ! a n +1 , we have that the posterior density given x ∈ R n + is p ∗ x,n ( θ ) = (cid:0) λ + P i x i (cid:1) n +1 n ! θ n exp {− θ ( λ + P i x i ) } . So, denoting by µ n the Lebesgue measure on R n + , the density of the posteriorpredictive probability given x is d ( P ∗ x,n ) P n dµ n ( x ′ ) = Z Θ p θ,n ( x ′ ) · p ∗ x,n ( θ ) dθ = (2 n )! n ! (cid:0) λ + P i x i (cid:1) n +1 (cid:0) λ + P i x ′ i + P i x i (cid:1) n +1 . According to the previous results, this is the Bayes estimator of the jointdensity p θ,n for the loss function W ′ n ( q, p ) := (cid:18)Z R n | q − p | dµ n (cid:19) , while the posterior predictive distribution (cid:0) P ∗ x,n (cid:1) P n is the Bayes estimator ofthe sampling distribution P nθ for the squared total variation loss function on( Ω n , A n ).Moreover, the image M ∗ n ( x, · ) := h(cid:0) P ∗ x,n (cid:1) P n i I = I (cid:0) P ∗ x,n (cid:1) P n is the Bayesestimator of the probability distribution P θ for the squared total variation on( Ω, A ) and its density x ′ > dM ∗ n ( x, · ) dµ ( x ′ ) = Z ∞ p θ ( x ′ ) · p ∗ x,n ( θ ) dθ = ( n + 1) (cid:0) λ + P ni =1 x i (cid:1) n +1 ( λ + x ′ + P ni =1 x i ) n +2 is the Bayes estimator of the density p θ for the L -squared loss function W ′ . (cid:3) Example 3
Let P θ be the Poisson distribution with parameter θ > µ on N )is p θ ( k ) = exp {− θ } θ k k ! for k ∈ N .So P nθ is the joint distribution of a sample of size n of a Poisson dis-tribution of parameter θ and its probability function (or density with re-specto to the counter measure µ n on N n ) is p θ,n ( k ) = exp {− nθ } θ k k k Q ni =1 ( k i !) for k = ( k , . . . , k n ) ∈ N n , where k k k := P ni =1 k i . ptimal Bayesian Density Estimator 11 Consider the prior distribution Q := G (1 , λ − ) for some known λ > k ∈ N n is thegamma distribution G (cid:0) k k k +1 , λ + n (cid:1) whose density is p ∗ k,n ( θ ) = ( λ + n ) k k k +1 ( k k k )! · θ k k k exp {− θ ( λ + n ) } . So the probability function of the posterior predictive probability given k ∈ N n is d ( P ∗ k,n ) P n dµ n ( k ′ ) = Z Θ p θ,n ( k ′ ) · p ∗ k,n ( θ ) dθ = ( k k ′ k + k k k )! Q ni =1 ( k i !) · ( k k k )! · ( λ + n ) k k k +1 ( λ + 2 n ) k k ′ k + k k k +1 . According to the previous results, this is the Bayes estimator of the jointdensity p θ,n for the loss function W ′ n ( q, p ) := Z N n | q − p | dµ n ! , while the posterior predictive distribution (cid:0) P ∗ k,n (cid:1) P n is the Bayes estimator ofthe sampling distribution P nθ for the squared total variation loss function on N n .Moreover, the image M ∗ n ( k, · ) := h(cid:0) P ∗ k,n (cid:1) P n i I = I (cid:0) P ∗ k,n (cid:1) P n is the Bayesestimator of the probability distribution P θ for the squared total variation on N and its probability function k ′ ≥ dM ∗ n ( k, · ) dµ ( k ′ ) = Z ∞ p θ ( k ′ ) · p ∗ k,n ( θ ) dθ = ( k ′ + k k k )! k ′ ! · ( k k k )! · ( λ + n ) k k k +1 ( λ + n + 1) k ′ + k k k +1 is the Bayes estimator of the probability function p θ for the loss function W ′ . (cid:3) Example 4
Let P θ be the Bernoulli distribution with parameter θ ∈ (0 , p θ ( k ) := θ k ( − θ ) n − k , k = 0 ,
1. So P nθ is the jointdistribution of a sample of size n of a Bernoulli distribution with parameter θ and its probability function is p θ,n ( k ) = θ k k k (1 − θ ) n −k k k , k ∈ { , } n where k k k := P ki =1 k i . Consider the uniform distribution on the unit intervalas prior distribution. So, the posterior distribution given k ∈ { , } n is theBeta distribution P ∗ k,n = B ( k k k + 1 , n − k k k + 1) with parameters k k k + 1 and n − k k k + 1. Hence, denoting µ n for the countermeasure on { , } n and β the Euler beta function, the probability function ofthe posterior predictive probability given k ∈ { , } n is d ( P ∗ k,n ) P n dµ n ( k ′ ) = Z Θ p θ,n ( k ′ ) · p ∗ k,n ( θ ) dθ = β ( k k k + k k ′ k + 1 , n − k k k − k k ′ k + 1) β ( k k k + 1 , n − k k k + 1)= Γ ( n + 2) Γ (2 n + 2) · ( k k ′ k + k k k )! · (2 n − k k ′ k − k k k )!( k k k )! · ( n − k k k )! . This is the Bayes estimator of the joint probability function p θ,n for theloss function W ′ n ( q, p ) := (cid:16)R { , } n | q − p | dµ n (cid:17) , while the posterior predictivedistribution (cid:0) P ∗ k,n (cid:1) P n is the Bayes estimator of the sampling distribution P nθ for the squared total variation loss function on { , } n .Moreover, the image M ∗ n ( k, · ) := h(cid:0) P ∗ k,n (cid:1) P n i I = I (cid:0) P ∗ k,n (cid:1) P n is the Bayesestimator of the probability distribution P θ for the squared total variation on { , } and its probability function k ′ ∈ { , } 7−→ dM ∗ n ( k, · ) dµ ( k ′ ) = Z p θ ( k ′ ) · p ∗ k,n ( θ ) dθ = Γ ( n + 2) Γ (2 n + 2) · ( k ′ + k k k )! · (2 n − k ′ − k k k )!( k k k )! · ( n − k k k )!is the Bayes estimator of the probability function p θ for the L -squared lossfunction W ′ . (cid:3) Proof (of Theorem 1) (i) Notice that, writing f A ( θ ) := P θ ( A ), (cid:0) P ∗ ω (cid:1) P ( A ) = Z Θ P θ ( A ) dP ∗ ω ( θ ) = E P ∗ ω ( f A ) , that, as a consequence of Jensen’s inequality (see Lehmann et al. (1998) p.228), is the Bayes estimator of f A for the quadratic error loss function.In the same way, if X is a real integrable statistic on ( Ω, A ) and f ( θ ) := E θ ( X ), we have that E ( P ∗ ω ) P ( X ) = Z Θ Z Ω X ( ω ′ ) dP θ ( ω ′ ) dP ∗ ω ( θ ) = E P ∗ ω ( f )is the Bayes estimator of f , the mean of X . ptimal Bayesian Density Estimator 13 (ii) According to (i), given A ∈ A , Z Ω × Θ (cid:12)(cid:12)(cid:12)(cid:0) P ∗ ω (cid:1) P ( A ) − P θ ( A ) (cid:12)(cid:12)(cid:12) dΠ ( ω, θ ) ≤ Z Ω × Θ | X ( ω ) − P θ ( A ) | dΠ ( ω, θ ) , for any real measurable function X on ( Ω, A ). If A is a separable σ -field, thereexists a countable algebra A such that A = σ ( A ). In particular, it followsthat sup A ∈A | M ( ω, A ) − P θ ( A ) | = sup A ∈A | M ( ω, A ) − P θ ( A ) | is ( A ⊗ T )-measurable. Given ( ω, θ ) ∈ Ω × Θ , let C ( ω, θ ) := sup A ∈A (cid:12)(cid:12)(cid:12)(cid:0) P ∗ ω (cid:1) P ( A ) − P θ ( A ) (cid:12)(cid:12)(cid:12) and, given n ∈ N , choose A n ∈ A so that C − n ≤ (cid:12)(cid:12)(cid:12)(cid:0) P ∗ ω (cid:1) P ( A n ) − P θ ( A n ) (cid:12)(cid:12)(cid:12) . It follows from this that Z Ω × Θ CdΠ ≤ Z Ω × Θ (cid:12)(cid:12)(cid:12)(cid:0) P ∗ ω (cid:1) P ( A n ) − P θ ( A n ) (cid:12)(cid:12)(cid:12) dΠ ( ω, θ ) + 1 n ≤ Z Ω × Θ sup A ∈A | M ( ω, A ) − P θ ( A ) | dΠ ( ω, θ ) + 1 n , and this gives the proof as n is arbitrary. To refine the proof from a measure-theoretical point of view, a judicious use of the Ryll-Nardzewski and Kura-towski measurable selection theorem would also be helpful. See the details inRemark 3 at the end of the section.(iii) It follows from (ii) that, to estimate the density p θ , the posteriorpredictive density b ∗ Q,ω ( ω ′ ) := d ( P ∗ ω ) P dµ ( ω ′ )minimizes the Bayes mean risk for the loss function W ′ ( q, p ) := (cid:0) R | q − p | dµ (cid:1) , i.e., E Π "(cid:18)Z | b ∗ Q,ω − p θ | dµ (cid:19) ≤ E Π "(cid:18)Z | m ( ω, · ) − p θ | dµ (cid:19) for any measurable function m : Ω × Ω → [0 , ∞ ) such that R Ω m ( ω, ω ′ ) dµ ( ω ′ ) =1 for every ω . (cid:3) Proof (of Theorem 2) (i) Given A ∈ A n , Theorem 1.(i) shows that theposterior predictive probability (cid:0) P ∗ ω,n (cid:1) P n ( A ) of A is the Bayes estimator of f A ( θ ) := P nθ ( A ) in the product Bayesian statistical experiment, as (cid:0) P ∗ ω,n (cid:1) P n ( A ) = Z Θ P nθ ( A ) dP ∗ ω,n ( θ ) = E P ∗ ω,n ( f A ) , i.e. Z Ω n × Θ (cid:12)(cid:12)(cid:0) P ∗ ω,n (cid:1) P n ( A ) − P nθ ( A ) (cid:12)(cid:12) dΠ n ( ω, θ ) ≤ Z Ω n × Θ (cid:12)(cid:12) X ( ω ) − P nθ ( A ) (cid:12)(cid:12) dΠ n ( ω, θ )for any other estimator X : ( Ω n , A n ) → R of f A . In particular, given A ∈ A ,applying this result to I − ( A ) = A × Ω n − ∈ A n , we obtain that Z Ω n × Θ (cid:12)(cid:12)(cid:0) P ∗ ω,n (cid:1) P n ( I − ( A )) − P θ ( A ) (cid:12)(cid:12) dΠ n ( ω, θ ) ≤ Z Ω n × Θ (cid:12)(cid:12) X ( ω ) − P θ ( A ) (cid:12)(cid:12) dΠ n ( ω, θ )for any other estimator X : ( Ω n , A n ) → R of g A := P θ ( A ).(ii) Being A a separable σ -field, there exists a countable algebra A suchthat A = σ ( A ). In particular, it follows thatsup A ∈A | M ( ω, A ) − P θ ( A ) | = sup A ∈A | M ( ω, A ) − P θ ( A ) | is ( A ⊗ T )-measurable. Given ( ω, θ ) ∈ Ω n × Θ , let C n ( ω, θ ) := sup A ∈A (cid:12)(cid:12)(cid:12)(cid:0) P ∗ ω,n (cid:1) P n ( I − ( A )) − P θ ( A ) (cid:12)(cid:12)(cid:12) and, given k ∈ N , choose A k ∈ A so that C n − k ≤ (cid:12)(cid:12)(cid:12)(cid:0) P ∗ ω,n (cid:1) P n ( I − ( A k )) − P θ ( A k ) (cid:12)(cid:12)(cid:12) . It follows that Z Ω n × Θ C n dΠ n ≤ Z Ω n × Θ (cid:12)(cid:12)(cid:12)(cid:0) P ∗ ω,n (cid:1) P n ( I − ( A k )) − P θ ( A k ) (cid:12)(cid:12)(cid:12) dΠ n ( ω, θ ) + 1 k ≤ Z Ω n × Θ sup A ∈A | M ( ω, A ) − P θ ( A ) | dΠ n ( ω, θ ) + 1 k , for any Markov kernel M : ( Ω n , A n ) ≻−→ ( Ω, A ) and, being k arbitrary, thisproves that M ∗ n ( ω, A ) := (cid:0) P ∗ ω,n (cid:1) P n ( I − ( A ))is the Bayes estimator of f ( θ ) := P θ for the squared total variation loss functionin the Bayesian statistical experiment( Ω n , A n , { P nθ : θ ∈ ( Θ, T , Q ) } )corresponding to a n -sized sample of the observed distribution. See Remark 3below. ptimal Bayesian Density Estimator 15 (iii) Note that, given A ∈ A , Fubini’s theorem yields (cid:0) P ∗ ω,n (cid:1) P n ( I − ( A )) = Z Θ P θ ( A ) dP ∗ ω,n ( θ ) = Z A Z Θ p θ ( ω ′ ) · p ∗ ω,n ( θ ) dQ ( θ ) dµ ( ω ′ ) , where p ∗ ω,n denotes the posterior density with respect to the prior distribution Q . Hence, for ω ∈ Ω n , the µ -density of M ∗ n ( ω, · ) is dM ∗ n ( ω, · ) dµ ( ω ′ ) = Z Θ p θ ( ω ′ ) · p ∗ ω,n ( θ ) dQ ( θ ) , and this is the Bayes estimator of the sampling density p θ for the loss function W ′ . (cid:3) Remark (A precision on measure-theorethical technicalities in the proofsof the previous results) We detail the proof of Theorem 1.(ii), being that ofTheorem 2.(ii) (and even that of the last remark of Section 3) similar. It followsfrom Theorem 1.(i) that, given ( ω, θ ) ∈ Ω × Θ , and writing C ( ω, θ ) := sup A ∈A (cid:12)(cid:12)(cid:12)(cid:0) P ∗ ω (cid:1) P ( A ) − P θ ( A ) (cid:12)(cid:12)(cid:12) , we have that, given n ∈ N , there exists A n ( ω, θ ) ∈ A so that C ( ω, θ ) − n ≤ (cid:12)(cid:12)(cid:12)(cid:0) P ∗ ω (cid:1) P ( A n ( ω, θ )) − P θ ( A n ( ω, θ )) (cid:12)(cid:12)(cid:12) . To continue the proof we will use the Ryll-Nardzewski and Kuratowski mea-surable selection theorem as appears in Bogachev (2007), p. 36. With thenotations of this book, we make ( T, M ) = ( Ω × Θ, A ⊗ T ) and X = A (the countable field generating A ). Given n ∈ N , let us consider the map S n : Ω × Θ → P ( X ) defined by S n ( ω, θ ) = (cid:26) A ∈ A : C ( ω, θ ) − n ≤ (cid:12)(cid:12)(cid:12)(cid:0) P ∗ ω (cid:1) P ( A ) − P θ ( A ) (cid:12)(cid:12)(cid:12) (cid:27) We have that ∅ 6 = S n ( ω, θ ) ⊂ X and S n ( ω, θ ) is closed for the discrete topologyon A . Moreover, given an open set U ⊂ A , { ( ω, θ ) : S n ( ω, θ ) ∩ U = ∅} ∈ A ⊗ T because, given A ∈ A , { ( ω, θ ) : S n ( ω, θ ) ∋ A } = (cid:26) ( ω, θ ) : C ( ω, θ ) − (cid:12)(cid:12)(cid:12)(cid:0) P ∗ ω (cid:1) P ( A ) − P θ ( A ) (cid:12)(cid:12)(cid:12) ≤ n (cid:27) ∈ A⊗T . So, according to the measurable selection theorem cited above, there existsa measurable map s n : ( Ω × Θ, A ⊗ T ) → ( A , P ( A )) such that s n ( ω, θ ) ∈ S n ( ω, θ ) for every ( ω, θ ), or, which is the same, C ( ω, θ ) − n ≤ (cid:12)(cid:12)(cid:12)(cid:0) P ∗ ω (cid:1) P ( s n ( ω, θ )) − P θ ( s n ( ω, θ )) (cid:12)(cid:12)(cid:12) . It follows that Z Ω × Θ C ( ω, θ ) dΠ ( ω, θ ) ≤ Z Ω × Θ (cid:12)(cid:12)(cid:12)(cid:0) P ∗ ω (cid:1) P ( s n ( ω, θ )) − P θ ( s n ( ω, θ )) (cid:12)(cid:12)(cid:12) dΠ ( ω, θ ) + 1 n ≤ Z Ω × Θ sup A ∈A | M ( ω, A ) − P θ ( A ) | dΠ ( ω, θ ) + 1 n , which gives the proof as n is arbitrary. (cid:3) This paper has been supported by the Junta de Extremadura (Spain) underthe grant Gr18016.
References:–
Barra, J.R. (1971) Notions Fondamentales de Statistique Math´ematique,Dunod, Paris. – Bean, A., Xu, X., MacEachern, S (2016) Transformations and Bayesiandensity estimation, Electronic Journal of Statistics, 10, 3355-3373. – Bogachev, V.I. (2007), Measure Theory, Vol. II, Springer, Berlin. – Boldstat, W.M. (2004) Introduction to Bayesian Statistics, Wiley, NewJersey. – Ferguson, T.S. (1983) Bayesian density estimation by mixtures of normaldistributions in “Recent advances in statistics”, pages 287–302. AcademicPress, New York. – Florens, J.P., Mouchart, M., Rolin, J.M. (1990) Elements of BayesianStatistics, Marcel Dekker, New York. – Geisser, S. (1993) Predictive Inference: An Introduction, Springer Science+Business Media, Dordrecht. – Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin,D.B. (2014) Bayesian Data Analysis, 3rd ed., CRC Press. – Ghosal, S., Vaart, A.v.d. (2017) Fundamentals of Noparametric BayesianInference, Cambridge University Press, Cambridge UK. – Ghosh J.K., Ramamoorthy, R.V. (2003) Bayesian Nonparametrics, Springer,New York. – Lehmann, E.L., Casella, G. (1998) Theory of Point Estimation, SecondEdition, Springer, New York. – Lijoi, A., Pr¨unster, I. (2010) Models beyond the Dirichlet Process, in“Bayesian Nonparametrics”, ed. by Hjort, N.L., Holmes, C., M¨uller, P.,Walker, S.G., Cambridge Series in Statistical and Probabilistic Mathemat-ics, Cambridge. – Lo, A.Y. (1984) On a class of Bayesian nonparametric estimates. I. Densityestimates. Ann. Statist., 12(1), 351–357. ptimal Bayesian Density Estimator 17 – Marchand, ´E., Sadeghkhani, A. (2018) Predictive density estimation withadditional information, Electronic Journal of Statistics 12, 4209-4238. ––