[PDF] Bayesian finite mixtures: a note on prior specification and posterior computation

Abstract

A new method for the computation of the posterior distribution of the number k of components in a finite mixture is presented. Two aspects of prior specification are also studied: an argument is made for the use of a Poisson(1) distribution as the prior for k; and methods are given for the selection of hyperparameter values in the mixture of normals model, with natural conjugate priors on the components parameters.

Full PDF

aa r X i v : . [ s t a t . M E ] N ov Bayesian ﬁnite mixtures: a note on prior speciﬁcationand posterior computation

Agostino Nobile ∗ University of Glasgow, Scotland.May 2005

Abstract

A new method for the computation of the posterior distribution of the number k of components ina ﬁnite mixture is presented. Two aspects of prior speciﬁcation are also studied: an argument ismade for the use of a P oi (1) distribution as the prior for k ; and methods are given for the selectionof hyperparameter values in the mixture of normals model, with natural conjugate priors on thecomponents parameters. Keywords:

Galaxy data, Marginal Likelihood, Markov Chain Monte Carlo, Mixtures of Normals.

Finite mixture distributions have become widely used as a tool of semi-parametric inference: theypartake of the conceptual simplicity of parametric models and of the ﬂexibility of non-parametricones. This paper is a contribution to the Bayesian analysis of ﬁnite mixtures with an unspeciﬁednumber of components. I give arguments to support the use of a

P oi (1) prior for the number ofcomponents and present a new method for the numerical computation of its posterior. The methodexploits a fundamental probability identity already used by Chib (1995), but combines it with therepresentation of mixture marginal likelihoods given in Nobile (2004). I also discuss a more speciﬁctopic, hyperparameter selection in a ﬁnite mixture of univariate normals. Throughout the paper,the galaxy data set is used for illustrative purposes.The remainder of this section provides a brief introduction to Bayesian ﬁnite mixtures, rep-resentations of the associated marginal likelihoods and the mixture of normals model. Section 2deals with the estimation of marginal likelihoods using the frequency of empty components in aMarkov Chain Monte Carlo sample of the mixture allocations. Section 3 argues that the structureof the model suggests the

P oi (1) distribution as a suitable prior for the number of components,when no substantive information on it is available. Section 4 concerns the more practical issue ofhyperparameter determination in the mixture of normals model, when natural conjugate priors onthe means and variances of the components are employed. ∗ Department of Statistics, University of Glasgow, Glasgow G12 8QW, Scotland.Email: [email protected] .1 Bayesian ﬁnite mixtures A ﬁnite mixture is a distribution with density, with respect to some underlying measure, given by f ( x ) = k X j =1 λ j p j ( x | θ j ) . (1)The weights λ j are non-negative and sum to 1, while the component densities p j ( ·| θ j ) belong to someknown parametric family. Observations x , . . . , x n are regarded as proceeding from the distribution(1) and interest lies in the number of components k and, conditional on k , in the weights λ and thecomponents’ parameters θ .The model can be rewritten by introducing latent allocation vectors g = ( g , . . . , g n ) with g i ∈ { , . . . , k } denoting the mixture component that generated the i -th observation:Pr[ g i = j | k, λ ] = λ j i = 1 , . . . , n, independently x i | k, g, λ, θ ind . ∼ p j ( x i | θ j ) , j = g i , i = 1 , . . . , n. (2)In the Bayesian analysis of the model, typically one assumes that λ | k ∼ Dir ( α , . . . , α k ), wherethe α j ’s are ﬁxed constants. Also, the component parameters θ j are assumed a priori independent,conditionally on k and, possibly, a vector of hyperparameters φ : π ( θ | k, φ ) = k Y j =1 π j ( θ j | φ j ) . If a prior distribution π ( k ) is speciﬁed, then one can obtain a sample from the joint posterior of( k, λ, θ ) by means of Markov chain Monte Carlo methods, see e.g. Richardson and Green (1997),Phillips and Smith (1996), Stephens (2000a), Nobile and Fearnside (2005). Inference about λ and θ is not straightforward, because the likelihood is invariant with respect to permutations ofthe components’ labels. Achieving identiﬁability by imposing constraints on the parameters doesnot always work and other methods have been proposed, see Richardson and Green (1997) andits discussion (especially the contributions of G. Celeux and M. Stephens), Celeux, Hurn andRobert (2000), Stephens (2000b), Fr¨uhwirth-Schnatter (2001), Nobile and Fearnside (2005).An alternative to sampling from the posterior of ( k, λ, θ ) consists of estimating the marginallikelihoods f k of the mixture model with k components: f k := f ( x | k ) = Z Z f ( x | k, λ, θ ) π ( λ, θ | k ) d λ d θ, k = 1 , , . . . , k max . Each marginal likelihood estimate makes use of MCMC output for a model with ﬁxed k . Theestimates can be used to compute Bayes factors for k vs. k − k , π ( k | x ) ∝ π ( k ) f k . It should be noted that estimation of the marginal likelihood fromMCMC output is not as simple as other posterior inference using MCMC, and as a consequenceseveral methods have been proposed, see e.g. Chib (1995), Raftery (1996), DiCiccio et al. (1997),Gelman and Meng (1998) and references therein. The marginal likelihoods can be rewritten as f k = X g ∈G k f ( g | k ) f ( x | k, g ) (3)2here the sum extends over the set G k of all the allocation vectors with entries less than orequal to k , see e.g. Nobile (1994, 2004). In equation (3), f ( g | k ) = R f ( g | k, λ ) π ( λ | k ) d λ =Γ( α k )Γ( α k + n ) k Y j =1 Γ( α j + n j )Γ( α j ) with α k = P kj =1 α j and n j equal to the number of observations that g allocates to component j . The other term in the right hand side of (3), f ( x | k, g ), is obtainedby integrating f ( x | k, g, θ ) from (2) with respect to the prior distribution of θ | k . Although thisintegration can be performed in closed form only for some prior distributions, notably naturalconjugate priors on θ , representation (3) is always valid. Under the assumption that the Dirichlethyperparameters α j and the prior distributions π j ( ·| φ j ) remain the same for ﬁxed j as k varies, themarginal likelihoods enjoy further representations. Partition the set of allocation vectors G k as G k = k [ t =1 G ⋆t , G ⋆t ∩ G ⋆s = ∅ , t = s where G ⋆t is the set of allocation vectors which assign at least one observation to component t andnone to higher components. Also, let f ⋆t be the portion of the marginal likelihood f t that accountsfor vectors g allocating at least one observation to component t and none to higher: f ⋆t = X g ∈G ⋆t f ( g | t ) f ( x | t, g ) . (4)Then one can show (see Nobile 2004, page 2049) that, for all g ∈ G ⋆t with t < k , f ( x | k, g ) = f ( x | t, g ) and f ( g | k ) f ( g | t ) = Γ( α k )Γ( α k + n ) Γ( α t + n )Γ( α t ) =: a kt (5)and that f k = k X t =1 a kt f ⋆t (6)= a k,k − f k − + f ⋆k . (7)If the prior distribution of ( λ, θ ) | k is invariant to permutations of the components labels, astronger result is available. Let G th be the subset of G ⋆t consisting of allocations with h ≤ t non-empty components. In particular, any vector g ∈ G hh ⊂ G h assigns at least one observation to eachmixture component 1 , . . . , h . Let f † h be the portion of f h which corresponds to allocations with noempty components: f † h = X g ∈G hh f ( g | h ) f ( x | h, g ) . Then (Nobile 2004, page 2053) f k = k ∧ n X h =1 (cid:18) kh (cid:19) a kh f † h . (8)For related representations see Ishwaran, James and Sun (2001). The method to be presented in the following section is of general applicability. Since mixtures ofunivariate normals will be used as an illustration, I introduce here some notation. It is assumedthat the component densities p j ( ·| θ j ) are normal with mean m j and variance r − j : x i | k, g, λ, θ i nd. ∼ N ( m j , r − j ) , j = g i , i = 1 , . . . , n. θ j = ( m j , r j ), j = 1 , . . . , k : r j ind . ∼ Ga ( γ, δ ) m j | r j ind . ∼ N ( µ, { τ r j } − ) , (9)with E ( r j ) = γ/δ . See Diebolt and Robert (1994) or Nobile and Fearnside (2005) for more details.Other priors on θ , such as the independent prior used by Richardson and Green (1997), could beused as well.The prior distribution (9) requires the speciﬁcation of four hyperparameters. The overall mean µ is set to a round value close to the sample mean x . The shape parameter γ is half the degreesof freedom of the prior predictive t distribution. I choose γ = 2, to have a t prior predictive, withrelatively thick tails, but ﬁnite second order moments. The choice of the scale parameter δ and of τ , the prior ratio between within components variance and variance of the means, is discussed inSection 4. For a given parametric model f ( x | θ ) and prior distribution f ( θ ), the marginal likelihood of theobserved data x is deﬁned as f ( x ) = R f ( x | θ ) f ( θ ) d θ . Using Bayes theorem, f ( x ) can be rewrittenas f ( x ) = f ( θ ) f ( x | θ ) f ( θ | x ) (10)where f ( θ ) and f ( x | θ ) are assumed computable, including their normalizing constants, and theformula holds for any parameter value θ . Expression (10) forms the basis of a method of marginallikelihood estimation, see Chib (1995) and Raftery (1996). In short, although typically the posterior f ( θ | x ) cannot be evaluated exactly, an estimate of it at some parameter value θ can be obtainedusing a Monte Carlo sample; substituting this estimate into (10) yields an estimate of f ( x ).In the context of Section 1.2, the marginal likelihood for the model with k components can bewritten as f k = f ( g | k ) f ( x | k, g ) f ( g | k, x ) . (11)Here the allocation vector g plays the role of θ in the above discussion and everything is conditionalon k . Since (11) holds for all g ∈ G k , it still holds if one sums both numerator and denominatorover any non-empty set E : f k = P g ∈ E f ( g | k ) f ( x | k, g ) P g ∈ E f ( g | k, x ) . (12)Letting E = G ⋆k , the denominator of (12) is the posterior probability of G ⋆k , while the numeratorequals f ⋆k = f k − a k,k − f k − , using equations (4) and (7). One then obtains f k = f k − a k,k − f k − Pr[ G ⋆k | k, x ]and after rearranging f k f k − = a k,k − − Pr[ G ⋆k | k, x ] . (13)The left hand side of (13) is the Bayes factor B k,k − for the model with k components againstthe model with k − a k,k − is a known constant, while thedenominator is the posterior probability, according to the model with k components, that the k -th component is empty, which can be easily estimated using a MCMC sample from f ( g | k, x ). In4ome mixture models f = f ⋆ is computable exactly; if this is the case, estimates of the marginallikelihoods, if needed, can be readily produced from the Bayes factors. Otherwise, one can stillobtain estimates of normalized marginal likelihoods, by setting f = 1 and then rescaling thesequence of f k ’s.Using formula (13) is somewhat wasteful, since it only employs the ﬁxed k MCMC sample toestimate Pr[ G ⋆k | k, x ]: the MCMC sample for k +1 components can be used to estimate Pr[ G ⋆k | k +1 , x ],a quantity that is related to the probability in (13). Let Pr[ G ⋆t | k, x ] with t ≤ k be the posteriorprobability, conditional on k components, that component t is non-empty and components t + 1through k are empty. One can show, see the Appendix, that f ⋆t +1 f ⋆t = a t +1 ,t k max X k = t +1 P r [ G ⋆t +1 | k, x ] k max X k = t +1 P r [ G ⋆t | k, x ] . (14)Setting f ⋆ = f if available, or f = 1 if not, the sequence f ⋆t t = 1 , . . . , k max can be estimatedby replacing the probabilities in (14) with MCMC estimates. An application of (6), followed byrescaling, then produces estimates of the normalized marginal likelihoods.Formulae (13) and (14) do not assume that the prior on ( λ, θ ) | k is invariant to permutations ofthe components labels, only that the hyperparameters α j , φ j are the same for all k ≥ j . If the prioris invariant, the additional symmetry can be exploited as follows. Let e G kh = S kt = h G th be the setof allocations g in G k which assign observations to exactly h components. Then Pr[ e G kh | k, x ] is theposterior probability, conditional on k components, that h ≤ k components are non-empty. Onecan show, see the Appendix, that f † h +1 f † h = ( h + 1) a h +1 ,h k max X k = h +1 Pr[ e G kh +1 | k, x ] k max X k = h +1 ( k − h ) Pr[ e G kh | k, x ] . (15)Replacing the probabilities in (15) with MCMC estimates and setting f † to 1 (or f if available),yields estimates of the sequence of f † h ’s; plugging these estimates in formula (8) and rescalingproduces estimates of normalized marginal likelihoods.To illustrate the method, formula (15) was used to compute the marginal likelihood of k com-ponents for the galaxy data. This data set consists of velocity measurements (1000 Km/sec) of 82galaxies from the Corona Borealis region. Since its appearance in Roeder (1990), it has been stud-ied by several authors, see Aitkin (2001) for an interesting comparison of likelihood and Bayesiananalyses of this data set. The data was modelled as a ﬁnite mixture of univariate normals, as setout in Section 1.3. The weights hyperparameters α were set to 1, while the other hyperparameterswere µ = 20, τ = 0 . γ = 2 and δ = 2, their choice is discussed in Section 4. In this example Iused Gibbs sampling of the allocation vectors g , after integrating out the weights and componentsparameters, see Nobile and Fearnside (2005). However, the method applies equally well to theGibbs sampling scheme involving both parameters and allocations, see for instance Diebolt andRobert (1994) and Richardson and Green (1997), as long as empty components are allowed. EachGibbs sampler with ﬁxed k was run for 20000 sweeps, with 1000 sweeps of burn-in. The ﬁnalallocation in the run with k components served as the starting allocation for the run with k + 1components. The estimates of the marginal likelihoods normalized to sum to 1 are displayed as5ine-joined dots in Figure 1. For comparison, the ﬁgure also contains the estimate of the posteriorof k with uniform prior on k = 1 , . . . , k max = 50 using a diﬀerent method, the allocation samplerof Nobile and Fearnside (2005). This sampler was run for 1 million sweeps with a burn-in of 10000sweeps and keeping only one draw every 10. The agreement between the estimates from the two . . . . k P o s t e r i o r p r obab ili t y o f k Figure 1:

Two estimates of the posterior distribution of k for the galaxy data, using a discrete uniformprior on k = 1 , . . . , k max = 50. Histogram gives the frequency of model with k componentsusing the allocation sampler of Nobile and Fearnside (2005). Line-joined dots are the normalizedmarginal likelihoods using formula (15). See main text for hyperparameter values used. unrelated methods provides a welcome check on them, all the more so since visual inspection of thegalaxy data suggests between three and six clusters, while the posterior of k displayed in Figure 1assigns to this range of values a probability smaller than 0.02. If one is to believe the estimatesin Figure 1, as the agreement between the two methods seems to suggest, it would seem that theposterior of k has little to tell about the number of clusters in a data set. In the next section Iargue that this is not the case and that replacing the uniform prior on k with a P oi (1) distributionyields a posterior that is more suitable for inference about the number of actual groups in the data.

In this section I assume that the prior on ( λ, θ ) | k is invariant to permutations of the componentslabels. Recall from Section 1.2 that f † h is the part of the marginal likelihood f h corresponding tono empty components f † h = X g ∈G hh f ( g | h ) f ( x | h, g )6nd that representation (8) holds: f k = k ∧ n X h =1 (cid:18) kh (cid:19) a kh f † h . To have an understanding of how formula (8) arises, look at Figure 2 which displays the nestedstructure of G , with each set of digits denoting allocation vectors with entries equal to those digitsonly. From formula (3), f is the sum over G of f ( g | k = 4) f ( x | k = 4 , g ). Formula (8) gives this

1 2 313 23 414 24 34 12 123 124 134 234 1234

Figure 2:

The nested structure of G k , k = 4. Each set of digits denotes a set of allocation vectors withentries equal to those digits only. Nested boxes denote G , . . . , G . Ellipses enclose the sets G hh , h = 1 , . . . ,

4. In box G k , the complement of the enclosed G k − box is the set G ⋆k . sum in terms of f † h ’s, which are sums over the sets G hh , h = 1 , . . . , a kh serve to rescale f ( g | h ) to f ( g | k ), while the combinatorial terms (cid:0) kh (cid:1) give the numberof “copies” of G hh that are present in G k .A consequence of formula (8) is that the marginal likelihood of k components may derive toa large extent from allocations with less than k non-empty components. For instance, consider ahypothetical data set of n = 80 observations clearly clustered in nine well separated groups, to suchan extent that f † h /f † is nearly 0, for h = 9. Then formula (8) implies that f k f = (cid:18) k (cid:19) a k , k ≥ . (16)With α = 1, one obtains the values reported in Table 1. With a discrete uniform prior on k , k f k /f

1. 1.011 0.618 0.299 0.127 0.050 0.018

Table 1:

Ratio of marginal likelihood f k /f for a hypothetical data set with n = 80 observations, such that f † h /f † = 0, h = 9. the posterior of k gives probability less that 1 / k = 9. Put diﬀerently, upper bounds on theposterior of k can be derived from representation (8). Table 2, taken from Nobile (2004), displaysupper bounds corresponding to a discrete uniform prior and α = 1. Nobile (2004) contains furtherdiscussion and tables for α = 2 and α = 0 .

5. The overall conclusion is that the bounds are weakerfor larger sample sizes, smaller values of k and larger values of α .7 n Table 2:

Bounds on π ( k | x ) for several sample sizes n , π ( k ) = 1 /k max , k = 1 , . . . , k max = 50, α = 1. It is worth mentioning at this point that, as the sample size grows, the marginal likelihoodof k components will tend to reﬂect more and more only allocations with no empty components.Formally, (cid:18) kh (cid:19) a kh → δ { k = h } , n → ∞ , (17)see the Appendix for a proof. Hence, from formula (8), f k − f † k → n → ∞ .Returning to the example of nine well separated groups, it is the combinatorial term (cid:0) kh (cid:1) thatmakes f > f in Table 1. If one were to drop the (cid:0) k (cid:1) term from equation (16), the entries inTable 1 would be as in Table 3. The (cid:0) k (cid:1) term accounts for the fact that with k = 10 components, k f k /f

1. 0.10112 0.01124 0.00136 0.00018 0.00002 0.00000

Table 3:

Ratios f k /f as in Table 1, but dropping the term (cid:0) k (cid:1) from equation (16). there are ten possible ways of choosing nine components to have observations and one componentto be empty. Of course, this is a consequence of the model entertained and its ability to allow forempty components, which correspond to mass on small values for some weights in the prior of λ .Nonetheless, the increasing eﬀect on the marginal likelihoods, as k grows, of the many ways in whichsome of k components may be empty, is a rather unappealing feature of the model. Nobile (2004)has suggested to shift attention from the number of components to the number of non-emptycomponents and to compute its posterior distribution. In this paper I follow a diﬀerent approach:trying to counteract the combinatorial terms in the marginal likelihoods by an appropriate choiceof the prior distribution of k .Multiplying equation (8) by the prior distribution π ( k ) and writing the result explicitly for theﬁrst few k , one has π (1 | x ) = A · π (1) f † π (2 | x ) = A (cid:20) π (2) f † + π (2) (cid:18) (cid:19) a f † (cid:21) π (3 | x ) = A (cid:20) π (3) f † + π (3) (cid:18) (cid:19) a f † + π (3) (cid:18) (cid:19) a f † (cid:21) · · · where A is a normalizing constant. Although there is no prior π ( k ) which exactly cancels outthe binomial coeﬃcients (cid:0) kh (cid:1) , one can keep the contribution of f † h to π ( k | x ) small, relative to its8ontribution to π ( h | x ), by requiring that π ( h ) = sup k>h (cid:26) π ( k ) (cid:18) kh (cid:19)(cid:27) h = 1 , , . . . . (18)It is easy to verify that a P oi (1) distribution satisﬁes equations (18). Indeed, every prior π ( k )satisfying equations (18) is proportional to a truncated P oi (1) distribution, see the Appendix. Forsimplicity, I will take the prior on k to be P oi (1).One way to illustrate the eﬀect of the

P oi (1) prior on k is by recomputing the bounds on π ( k | x )with this prior; they are reported in Table 4. Compared to the bounds with a discrete uniformprior in Table 2, they are much weaker, especially for higher values of k . kn Table 4:

Bounds on π ( k | x ) for several sample sizes n , π ( k ) ∝ /k !, α = 1. For another illustration, reconsider the example of nine well separated groups in Table 1. Theratio of posterior probabilities π ( k | x ) /π (9 | x ) using a P oi (1) prior are given in Table 5. k π ( k | x ) /π (9 | x ) 1. 0.10112 0.00562 0.00023 0.00001 0.00000 0.00000 Table 5:

Ratio of posterior probabilities π ( k | x ) /π (9 | x ) for the same hypothetical data set as in Table 1,using a P oi (1) prior on k . As a further illustration, return to the galaxy data example in Section 2. Figure 3 contains theposterior of k computed using the same hyperparameters and methods as in Figure 1, but with a P oi (1) prior on k , rather than discrete uniform. Other examples, for real and artiﬁcial data sets, ofposterior distributions of k based on a P oi (1) prior can be found in Nobile and Fearnside (2005).

This section is concerned with the choice of hyperparameters in mixtures of univarite normals, withnatural conjugate priors on the means and variances. The method to be described can be readilyadapted to the case of multivariate components, or components from other parametric families.I continue to use the galaxy data set for illustrative purposes. The marked sensitivity to priorspeciﬁcation exhibited in the analysis of this data is, in my experience, far from typical. However,it demonstrates well what diﬃculties may arise. Patterns of dependence of the marginal likelihoodon the prior of θ are likely to be simpler in one-parameter families; see Aitkin (2001, page 289) fora related remark.Figure 4 displays estimates of the posterior distribution of the number of components for thegalaxy data, corresponding to several values of the hyperparameters τ and δ . The other hyperpa-rameters were set to µ = 20 and γ = 2, as discussed in Section 1.3. The prior on k was P oi (1) and9 . . . . k P o s t e r i o r p r obab ili t y o f k Figure 3:

Two estimates of the posterior distribution of k for the galaxy data, using a Poisson(1) prior.Histogram gives the frequency of model with k components using the allocation sampler of Nobileand Fearnside (2005). Solid line is the posterior of k obtained from the normalized marginallikelihood using formula (15). See main text for hyperparameter values used. computations were done using formula (15). Although in all plots most of the mass is concentratedon values of k between 2 and 8, a simple glance at the ﬁgure conveys how dependent on hyperpa-rameter values π ( k | x ) may be. One can also see that, for given δ , as τ increases at ﬁrst posteriormass shifts to higher values of k , and then it moves back to lower values of k . The behaviour for τ ﬁxed and δ increasing consists, apart for few exceptions, of a shift of probability mass from higherto lower values of k . Most pairs ( τ, δ ) yield negligible posterior mass for k = 2. However, somepairs in the upper right corner of the plot assign considerable mass to it. These pairs correspondto a prior distribution that makes likely high values of the variance within each normal component;in turns this makes it plausible to place in a single group the smallest and largest observations inthe galaxy data, with a central group accounting for most of the other observations.Putting a hyperprior π ( τ, δ ) on the two hyperparameters, and sampling from the joint posteriorof all the unknowns, including τ and δ , did not solve the problem. Some experimentation with a fewhyperpriors showed that π ( k | x ) was to a considerable extent aﬀected by the choice of hyperprior:the marginal posterior distributions of τ and δ had very long tails and changed markedly with π ( τ, δ ). For this reason, I preferred to adopt an empirical Bayes stance and estimate τ and δ ratherthan mixing with respect to their posterior distribution.I settled on independent priors: U n (0 ,

1) for (1 + τ ) − , the prior proportion of variance withina component to the total variance, and U n (0 , δ U ) for δ , where δ U = ( γ − s x and s x is the samplevariance. The choice of δ U yields a prior expectation of the components variance equal to s x /

2. Therandom walk Metropolis-Hastings algorithm was used to update τ and δ given all other variables.Figure 5 displays boxplots of the marginal posterior distributions of τ and δ , on a logarithmic scale,conditional on k . Both plots display a pattern whereby a clear change of level occurs as k increases.10 . . . . tau=1, delta=0.5 . . . . tau=1, delta=1 . . . . tau=1, delta=2 . . . tau=1, delta=7 . . . tau=1, delta=20 . . . tau=0.2, delta=0.5 . . . . tau=0.2, delta=1 . . . tau=0.2, delta=2 . . tau=0.2, delta=7 . . . . tau=0.2, delta=20 . . . tau=0.05, delta=0.5 . . . . tau=0.05, delta=1 . . . . tau=0.05, delta=2 . . . tau=0.05, delta=7 . . . tau=0.05, delta=20 . . tau=0.01, delta=0.5 . . . . tau=0.01, delta=1 . . . . tau=0.01, delta=2 . . tau=0.01, delta=7 . . . tau=0.01, delta=20 . . . tau=0.002, delta=0.5 . . . tau=0.002, delta=1 . . . tau=0.002, delta=2 . . . tau=0.002, delta=7 . . . tau=0.002, delta=20 Figure 4:

Posterior distribution of k for the galaxy data, using a P oi (1) prior, for several values of thehyperparameters τ and δ . The procedure I used to estimate τ and δ consists of taking the median of the posterior draws, afterdiscarding those corresponding to values of k preceding the point where a rough level-oﬀ of themedians has occurred. The rationale is that if increasing k by 1 markedly changes the posteriors of τ and δ , it is because it aﬀords a considerable reduction of the within-components variability, byreplacing it with between-means variability. The median of τ seems to level oﬀ at k = 4. For δ thepicture is less clear, but the decreases are much smaller past k = 6. The end result are the roughestimates ˆ τ = 0 .

04 and ˆ δ = 2. These values were used in the runs reported in Sections 2 and 3. Asimilar procedure was used by Nobile and Fearnside (2005), to which I refer for further examples.The overall lesson seems to be that estimates of π ( k | x ) provide only a rough, though useful,guide to the number of groups in the data and that there is really no substitute for the kind ofsensitivity analysis performed in Figure 4. Appendix

A.1 Proof of formula (14)

Letting E = G ⋆t in formula (12), with t ≤ k , yields f k = P g ∈G ⋆t f ( g | k ) f ( x | k, g )Pr[ G ⋆t | k, x ] . − − Log ( t au ) (a) Log ( de l t a ) (b) Figure 5:

Boxplots of draws from the posterior distributions, conditional on k , of τ in panel (a) and of δ in panel (b). Whiskers are drawn at the 0.005 and 0.995 quantiles. Now from formulae (5) the numerator equals a kt f ⋆t , so that solving for f ⋆t produces f ⋆t = Pr[ G ⋆t | k, x ] a kt f k . Then taking the ratio between f ⋆t +1 and f ⋆t and rearranging terms gives f ⋆t +1 Pr[ G ⋆t | k, x ] = a t +1 ,t f ⋆t Pr[ G ⋆t +1 | k, x ] t < k where one uses a k,t /a k,t +1 = a t +1 ,t , a simple consequence of (5). Summing both sides over k from t + 1 to k max and rearranging yields (14). A.2 Proof of formula (15)

Let E = e G kh in formula (12) to obtain f k = P g ∈ e G kh f ( g | k ) f ( x | k, g )Pr[ e G kh | k, x ] h ≤ k. The numerator is equal to (cid:0) kh (cid:1) a kh f † h (see Nobile 2004, Proof of Proposition 4.3), so that rearrangingone has f † h = f k Pr[ e G kh | k, x ] (cid:18) kh (cid:19) a kh Taking the ratio between f † h +1 and f † h and rearranging terms produces f † h +1 ( k − h ) Pr[ e G kh | k, x ] = f † h ( h + 1) a h +1 ,h Pr[ e G kh +1 | k, x ] . Finally, sum both sides over k = h + 1 , . . . , k max and solve for f † h +1 /f † h to obtain (15). A.3 Proof of formula (17)

From formula (5) and under the assumption that the prior is invariant with respect to permutationsof the labels, (cid:18) kt (cid:19) a kt = (cid:18) kt (cid:19) Γ( kα )Γ( kα + n ) Γ( tα + n )Γ( tα ) t ≤ k n ( k − t ) α Γ( tα + n )Γ( kα + n ) →

1, as n → ∞ . Hence n ( k − t ) α (cid:18) kt (cid:19) a kt → (cid:18) kt (cid:19) Γ( kα )Γ( tα ) as n → ∞ . Therefore, for t ≤ k , (cid:0) kt (cid:1) a kt → δ { k = t } as n → ∞ . A.4 Proof that every distribution satisfying equations (18) is truncated

P oi (1)

The proof proceeds as follows: assume that π ( k ) = 0 for k larger than some value k , use inductionto derive π ( k ) with k ≤ k , ﬁnally let k → ∞ . Suppose that, for all k = j + 1 , . . . , k , one has π ( k ) = π ( k ) k ! k ! (19)Then equation (19) also holds for k = j : π ( j ) = sup k>j (cid:26) π ( k ) (cid:18) kj (cid:19)(cid:27) = max j k . Sinceequation (19) clearly holds for k = k , an appeal to induction yields π ( k ) = π ( k ) k ! k ! ∝ k ! k = 1 , . . . , k, i.e., P oi (1) restricted to 1 ≤ k ≤ k . Letting k → ∞ yields a P oi (1) distribution restricted to k ≥ eferences Abramowitz, M. and Stegun, I. A. (1964). Handbook of Mathematical Functions, Dover edition, 9th printing,New York.Aitkin, M. (2001). Likelihood and Bayesian analysis of mixtures.

Statistical Modelling , , 287–304.Celeux, G., Hurn, M. and Robert, C. P. (2000). Computational and Inferential Diﬃculties with MixturePosterior Distributions. Journal of the American Statistical Association , , 957–970.Chib, S. (1995). Marginal Likelihood from the Gibbs Output. Journal of the American Statistical Associa-tion , , 1313–1321.DiCiccio, T. J., Kass, R. E., Raftery, A. and Wasserman, L. (1997). Computing Bayes Factors By CombiningSimulation and Asymptotic Approximations. Journal of the American Statistical Association , ,903–915.Diebolt, J. and Robert, C. P. (1994). Estimation of ﬁnite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society B , , 363–375.Fr¨uhwirth-Schnatter, S. (2001). Markov Chain Monte Carlo Estimation of Classical and Dynamic Switchingand Mixture Models. Journal of the American Statistical Association , , 194–209.Gelman, A. and Meng, X.L. (1998). Simulating Normalizing Constants: From Importance Sampling toBridge Sampling to Path Sampling. Statistical Science , , 163–185.Ishwaran, H., James, L. F. and Sun, J. (2001). Bayesian Model Selection in Finite Mixtures by MarginalDensity Decompositions. Journal of the American Statistical Association , , 1316–1332.Nobile, A. (1994). Bayesian Analysis of Finite Mixture Distributions, Ph.D. dissertation, Department ofStatistics, Carnegie Mellon Univ., Pittsburgh. Available at Nobile, A. (2004). On the posterior distribution of the number of components in a ﬁnite mixture.

TheAnnals of Statistics , , 2044–2073.Nobile, A. and Fearnside, A. (2005). Bayesian ﬁnite mixtures with an unknown number of components: theallocation sampler. Technical Report 05-4, Department of Statistics, University of Glasgow.Phillips, D. B. and Smith, A. F. M. (1996). Bayesian model comparison via jump diﬀusions. In MarkovChain Monte Carlo in Practice (eds W. R. Gilks, S. Richardson and D. J. Spiegelhalter), 215–239,Chapman & Hall.Raftery, A. E. (1996). Hypothesis testing and model selection. In

Markov Chain Monte Carlo in Practice (eds W. R. Gilks, S. Richardson and D. J. Spiegelhalter), 163–187, Chapman & Hall.Richardson, S. and Green P. J. (1997). On Bayesian analysis of mixtures with an unknown number ofcomponents (with discussion).

Journal of the Royal Statistical Society B , , 731–792.Roeder, K. (1990). Density estimation with conﬁdence sets exempliﬁed by superclusters and voids in galaxies. Journal of the American Statistical Association , , 617–624.Stephens, M. (2000a). Bayesian analysis of mixture models with an unknown number of components – analternative to reversible jump methods. The Annals of Statistics , , 40–74.Stephens, M. (2000b). Dealing with Label Switching in Mixture Models. Journal of the Royal StatisticalSociety B , , 795–809., 795–809.