Importance sampling schemes for evidence approximation in mixture models
aa r X i v : . [ s t a t . C O ] N ov Importance sampling schemes for evidenceapproximation in mixture models
Jeong Eun Lee ∗ and Christian P. Robert † Abstract.
The marginal likelihood is a central tool for drawing Bayesian inference aboutthe number of components in mixture models. It is often approximated since theexact form is unavailable. A bias in the approximation may be due to an incompleteexploration by a simulated Markov chain (e.g., a Gibbs sequence) of the collectionof posterior modes, a phenomenon also known as lack of label switching, as allpossible label permutations must be simulated by a chain in order to convergeand hence overcome the bias. In an importance sampling approach, imposinglabel switching to the importance function results an exponential increase of thecomputational cost with the number of components. In this paper, two importancesampling schemes are proposed through choices for the importance function; aMLE proposal and a Rao–Blackwellised importance function. The second schemeis called dual importance sampling. We demonstrate that this dual importancesampling is a valid estimator of the evidence. To reduce the induced high demandin computation, the original importance function is approximated but a suitableapproximation can produce an estimate with the same precision and with reducedcomputational workload.
Keywords:
Model evidence, Importance sampling, Mixture models, Marginal like-lihood
Consider a sample x = { x , · · · , x n x } that is a realisation of a random sample (univariateor multivariate) from a finite mixture of k distributions X j ∼ f k ( x | θ ) = k X i =1 λ i f ( x | ξ i ) , j = 1 , · · · , n x where the component weights λ = { λ i } ki =1 are non-negative and sum to 1. The collectionof the component-specific parameters is denoted ξ = { ξ i } ki =1 and the collection of allparameters by θ = { λ , ξ } . As is now standard (Marin et al. 2005) each observation x j from the sample can be assumed to originate from a specific if unobserved componentof f k , denoted z i , and the mixture inference problem can then be analysed as a missing ∗ Auckland University of Technology, New Zealand [email protected] † PSL, Universit´e Paris-Dauphine, CEREMADE, Department of Statistics, University of Warwick,and CREST, Paris [email protected] c (cid:13) data model, with discrete missing data z = { z , . . . , z n x } , such that x j | z ∼ f ( x j | ξ z j ) , independently for j = 1 , · · · , n x . The conditional distribution of z j ∈ [1 , . . . , k ] is then given by z j | x , θ ∼ M λ f ( x j | ξ ) P ki =1 λ i f ( x j | ξ i ) , . . . , λ k f ( x j | ξ k ) P ki =1 λ i f ( x j | ξ i ) ! . This interpretation of the mixture model leads to a natural clustering of the observationsaccording to their label and the cluster associated with the mixture component i providesinformation about λ i and ξ . In particular, when the full conditional distribution of theparameter θ is available in closed form, conditional simulation from π ( ξ , λ | x , z ) becomesstraightforward (see Diebolt and Robert (1994)).In a Bayesian mixture modelling setup, the goal is to perform inference on the parameter θ and the posterior distribution π ( θ | x ) is usually approximated via MCMC methods.The likelihood function p k ( x | θ ) is both available and invariant under permutations ofthe component indices. If an exchangeable prior is chosen on ( λ , ξ ), the posterior den-sity reproduces the likelihood invariance and component labels are not identifiable. Thisphenomenon is called label switching and is well-studied in the literature (Celeux et al.2000; Stephens 2000b; Jasra et al. 2005). From a simulation perspective, label switchinginduces multimodality in the target and while it is desirable that a simulated Markovchain targeting the posterior explores all of the k ! symmetric modes of the posterior dis-tribution, most samplers fail to switch between modes (Celeux et al. 2000). For instance,when using a data augmentation scheme, which is a form of Gibbs sampler adapted tomissing data problems (Robert and Casella 2004), the Markov chain very slowly if everswitches between the symmetric modes. Therefore, since the chain only explores a cer-tain region of the support of the multimodal posterior, estimates based on the simulationoutput are necessarily biased. When label switching is missing from the MCMC output,it can be simulated by modifying the MCMC sample (see Fr¨uhwirth-Schnatter (2001);Papastamoulis and Roberts (2008); Papastamoulis and Iliopoulos (2010)).A different perspective on the label switching phenomenon is inferential. Indeed, pointestimates of the component-wise parameters are harder to produce when the Markovchain moves freely between modes. To achieve component-specific inference and give ameaning to each component, relabelling methods have been proposed in the literature(see Richardson and Green (1997); Celeux et al. (2000); Stephens (2000b); Jasra et al.(2005); Marin and Robert (2007); Geweke (2012); Rodriguez and Walker (2014) andothers). An R-package, label.switching (Papastamoulis 2013), incorporates some ofthose label switching removing methods.Evaluating the number of components k is a special case of model comparison, which canbe conducted by comparing the posterior probabilities of the models . Those probabilitiesare in turn computed via the marginal likelihoods E ( k ), also known as model evidences(Richardson and Green 1997) E ( k ) = Z S p k ( x | θ ) π k ( θ ) d θ , . Lee and C. P. Robert π k ( θ ) is the selected prior for the k -component mixture. (We assume here thatit is exchangeable wrt its components.) Recall that the ratio of evidences is a Bayesfactor and is properly scaled to be readily compared to 1 (Jeffreys 1939). When alarge collection of values of k is considered for model comparison, sophisticated MCMCmethods have been developed to bypass computing evidences (Richardson and Green1997; Stephens 2000a), even though those are estimated as a byproduct of the methods.Alternatively, estimating the number of components can also proceed from a Bayesiannonparametric (BNP) modelling, which assumes an infinite number of components andthen evaluates the non-empty components implicitly through partitioning data, usingfor instance the Chinese restaurant process (Antoniak 1974; Escobar and West 1995;Rasmussen 2000). This however requires a modification of the prior modelling and wewill not cover it in this paper, which reassesses Monte Carlo ways of approximating theevidence.The difficulty with approaches using E ( k ) is that the quantity often cannot directly andreliably be derived from simulations from the posterior distribution π ( θ | x ) (Newton and Raftery1994). The quantity has been approximated using dedicated methods such harmonicmeans (Satagopan et al. 2000; Raftery et al. 2006), importance sampling (Rubin 1987,1988; Gelman and Meng 1998), bridge sampling (Meng and Wong 1996; Meng and Schilling2002), Laplace approximation (Tierney and Kadane 1986; DiCiccio et al. 1997), stochas-tic substitution (Gelfand and Smith 1990; Chib 1995, 1996), nested sampling (Chopin and Robert2010), Savage-Dickey representations (Verdinelli and Wasserman 1995; Marin and Robert2010b) and erroneous implementations of the Carlin and Chib algorithm (Carlin and Chib1995; Scott 2002; Congdon 2006; Robert and Marin 2008). Comparative studies of thosemethods are found in Marin and Robert (2010a) and Ardia et al. (2012).In the specific case of mixtures, the invariance of the posterior density under an arbitraryrelabelling of the mixture components must be exhibited by simulations and approxi-mations to achieve a valid estimate of E ( k ) as discussed in Neal (1999); Berkhof et al.(2003); Marin and Robert (2008). This often leads to computationally intensive stepsin approximation methods, especially when k is large, and it is the purpose of this paperto provide a partial answer to this specific issue.We consider here two estimators of E ( k ), both based on importance sampling (IS).One is a version of Chib’s estimator and the second one a novel representation called dual importance sampling . Our importance construction aims to better approximatethe posterior distribution both around a specific local mode and at the corresponding( k ! −
1) symmetric modes of the posterior distribution. A particular mode is first ap-proximated based on (i) a point estimate and (ii) Rao–Blackwellisation from a Gibbssequence. Then, the corresponding local density approximate is permuted to reachall modes. We demonstrate here that dual importance sampling is comparable to ourbenchmark method, Chib’s approach. Taking advantage of the symmetry in the poste-rior distribution, we show how to reduce computational demands by approximating ourimportance function.The paper starts with recalling the approximation techniques of Chib’s method andbridge sampling in Section 2. In Section 3, importance sampling is considered, includingour choices of importance functions. Our importance function approximate approachis introduced in Section 4. Simulation studies using both simulated and benchmarkdatasets, namely the galaxy and fishery datasets used in Richardson and Green (1997)are reported in Section 5, and the paper concludes with a short discussion in Section 6.
In this paper, the reference estimator of evidence is Chib’s(1995) method and is derivedfrom rewriting Bayes’ theorem b E ( k ) = m k ( x ) = π k ( θ o ) p k ( x | θ o ) π k ( θ o | x ) (1)where θ o is any plug-in value for θ . When π k ( θ o | x ) is not available in closed form, theGibbs sampling decomposition allows a Rao–Blackwellised approximation (Gelfand and Smith1990; Robert and Casella 2004) b π k ( θ o | x ) = 1 T T X t =1 π k ( θ o | x , z t ) , where ( z t ) Tt =1 is a Markov chain with stationary distribution π k ( z | x ). The appeal ofthis estimator, when available, is that it constitutes a non-parametric density estimatorconverging at a regular parametric rate.It is now an accepted fact that label switching is necessary for the above Rao–Blackwellisedˆ π k ( θ o | x ) to converge. When ( z , · · · , z T ) only explores part of the modes of the poste-rior, this estimator is biased, generally missing the target quantity log( m k ( x )) by a factorof order O(log k !), with no simple correction factor (Neal 1999). Berkhof et al. (2003)later suggested a generic correction by averaging b π k ( θ o | x ) over all possible permutationsof the labels, hence forcing “perfect” label switching. The resulting approximation isexpressed as ˜ π k ( θ o | x ) = 1 T k ! X σ ∈ S k T X t =1 π k ( θ o | x , σ ( z t )) , where S k denotes the set of the k ! permutations of { , . . . , k } and σ is one of thosepermutations. Note that the above correction can also be rewritten as˜ π k ( θ o | x ) = 1 T k ! X σ ∈ S k T X t =1 π k ( σ ( θ o ) | x , z t ) , (2)using a notational shortcut σ ( θ o ) meaning that the components of θ o are then switchedaccording to the permutation σ . This representation may induce computational gainssince only k ! versions of σ ( θ o ) need to be stored. . Lee and C. P. Robert Meng and Wong (1996) proposed a sample–based method to compute a ratio of nor-malizing constants of two densities with common support. The method is well-suited toestimate the marginal likelihood (Fr¨uhwirth-Schnatter 2001, 2004) and used as a pointposterior estimate for Chib’s method (Mira and Nicholls 2004). Considering a nor-malised density q and the unnormalized posterior distribution π ∗ k ( θ | x ) = π k ( θ ) p k ( x | θ ),the bridge sampling identity is given by b E ( k ) = E q ( θ ) [ α ( θ ) π ∗ k ( θ | x )] E π k ( θ | x ) [ α ( θ ) q ( θ )] , for an arbitrary function α (provided all expectations are well-defined, Chen et al. 2000).The (formally) optimal choice for α (Meng and Wong 1996) leads to the following iter-ative estimator b E ( t ) ( k ) = b E ( t − ( k ) M − M X l =1 ˆ π t − (˜ θ l | x ) / M q (˜ θ l )+ M ˆ π t − (˜ θ l | x ) M − M X m =1 q (ˆ θ m ) / M q (ˆ θ m )+ M ˆ π t − (ˆ θ m | x ) (3)where ˆ π t − ( θ | x ) = π ∗ k ( θ | x ) / b E ( t − ( k ). Here, (˜ θ , . . . , ˜ θ M ) and (ˆ θ , . . . , ˆ θ M ) are samplesfrom q ( θ ) and π k ( θ | x ) respectively.The convergence of bridge sampling (with the above optimal α ) is trivial when π ∗ k ( θ | x )and q ( θ ) share a sufficiently large portion of their supports. If the support intersectionis too small, the resulting bridge sampling estimator may end up with an infinite vari-ance (Voter 1985; Servidea 2002). Improvements of the algorithm, like path sampling(Gelman and Meng 1998), a simple location shift of the proposal distribution (Voter1985), and a warp bridge sampling (Meng and Schilling 2002), have been proposed.In the specific case of the mixture posterior distribution, the parameter θ can be splitin λ and k further blocks ξ , . . . , ξ k in the Gibbs sampling steps. The output sam-ples from the Gibbs sampler are denoted by { θ ( j ) , z ( j ) } J j =1 , where the z ( j ) ’s are thecomponent allocation vectors associated with the observations x . For bridge sampling,Fr¨uhwirth-Schnatter (2004) suggested using a Rao–Blackwellised function q ( θ ) = q ( λ , ξ )of the form q ( θ ) = 1 J J X j =1 π k ( θ | θ ( j ) , z ( j ) , x ) (4)= 1 J J X j =1 p ( λ | z ( j ) ) k Y i =1 p ( ξ i | ξ ( j ) , z ( j ) , x )assuming { θ ( j ) , z ( j ) } J j =1 is well-mixed, followed by switching the labels of the simula-tions from the posterior distribution (Fr¨uhwirth-Schnatter 2001). Fr¨uhwirth-Schnatter(2004) demonstrated that the iterative bridge sampling estimator (3), using (4) as q ( · ),converges relatively quickly, in about t = 10 iterations, even with different startingvalues. If q ( θ ) is an importance function with support S q , generating a sample θ = ( θ (1) , . . . , θ ( T ) )from q ( θ ) leads to the evidence approximation b E ( k ) = 1 T T X t =1 π k ( θ ( t ) ) p k ( x | θ ( t ) ) q ( θ ( t ) ) def = 1 T T X t =1 ω ( θ ( t ) ) . (5)To provide a good approximation, the support of q ( θ ) must overlap the support of theposterior distribution, which is both symmetric under permutations and multimodal. Inthis sense, a Rao–Blackwellised estimate like (4) is a natural solution for the choice of q ,despite the drawback that J increases “factorially” fast with k due to the permutationsover { θ ( j ) , z ( j ) } J j =1 (Fr¨uhwirth-Schnatter 2004; Fr¨uhwirth-Schnatter 2006).In the following sections, the parameter θ is decomposed into ( k + 1) blocks θ =( λ , ξ , . . . , ξ k ). Note that ξ i is a component-wise block, most often a vector. Two typesof importance functions, based on the product of marginal posterior distributions, willbe considered. The usefulness and details of the product of block marginal posteriordistributions are well summarised in Perrakie et al. (2014). Using a very simple Rao–Blackwell argument inspired from Chib’s representation, anatural importance function is q ( θ ) = π k ( θ | z o , θ o , x ) . Samples are generated from the posterior distribution conditional on a given completionvector z o , which is usually taken equal the MAP (maximum a posteriori) or the marginalMAP estimate of z derived from MCMC simulations. Taking the full permutation of . Lee and C. P. Robert z o and θ o (inspired by Berkhof et al. (2003) and Marin and Robert(2008)), we thus propose a symmetrised version of a MAP proposal q ( θ ) = 1 k ! X σ ∈ S k π k ( θ | σ ( θ o , z o ) , x ) (6)= 1 k ! X σ ∈ S k p ( λ | σ ( z o )) k Y i =1 p ( ξ i | σ ( ξ o ) , σ ( z o ) , x ) . This proposal is equivalent to generating θ from π k ( θ | θ o , z o , x ) and then operating arandom permutation on the components of θ . The computational cost of producing ω ( θ ) in (5), hence b E ( k ), is then multiplied by k ! under the provision that the supportof (6) is sufficiently wide. If the tails of samples generated from (6) are deemed to betoo narrow, as signalled by the effective sample size, additional selected (and thinned)simulations z , . . . , z t taken from the Gibbs output can be included to make the proposalmore robust.While this estimator is theoretically valid, providing an unbiased estimator of b E ( k ), itmay face difficulties in practice by missing wide regions of the parameter space whensimulating from π k ( θ | x, z o ). This is indeed the practical version of simulating from animportance function with a support that is smaller than the support of the integrand asetting that leads to an erroneous approximation of the corresponding integral. In thecurrent situation, since π k ( θ | x, z o ) is everywhere positive, this is not a theoretical issue.However, in practice, the conditional density is numerically equal to zero around thealternative modes. A dual exploitation of the above Rao–Blackwellisation argument produces an alter-native importance sampling proposal, based on MCMC draws { θ ( j ) , z ( j ) } Jj =1 from theunconstrained posterior distribution. The new importance function is expressed as q ( θ ) = 1 Jk ! J X j =1 X σ ∈ S k π k ( θ | σ ( θ ( j ) , z ( j ) ) , x ) (7)= 1 Jk ! J X j =1 X σ ∈ S k p ( λ | σ ( z ( j ) )) k Y i =1 p ( ξ i | σ ( ξ ( j ) ) , σ ( z ( j ) ) , x ) . Here, π k ( θ | σ ( θ ( j ) , z ( j ) ) , x ) is a product of full conditional densities on each parameter ina Gibbs sampler representation and { θ ( j ) , z ( j ) } Jj =1 is the original albeit not necessarilywell-mixed simulation outcome. Label switching is imposed upon those J conditionaldensities through all k ! permutations and conversely the average of J well-selectedconditional densities should approximate the posterior around any of the k ! symmetricmodes of this posterior.If we now assume that the component labels of the terms { θ ( j ) , z ( j ) } Jj =1 in (7) havenot been permuted and that any label switching occurence has been removed fromthe simulations by a recentering method (Celeux et al. 2000), we denote the resultingtransforms by { ϕ ( j ) } Jj =1 . They can be interpreted as hyperparameters of q . The density(7) then satisfies q ( θ ) = 1 Jk ! J X j =1 k ! X i =1 π ( θ | σ i ( ϕ ( j ) ) , x ) ef = 1 k ! k ! X i =1 h σ i ( θ ) (8)where h σ i ( θ ) = 1 J J X j =1 π ( θ | σ i ( ϕ ( j ) ) , x ). Each of the densities h σ , · · · , h σ k ! has a support–i.e., a domain where it takes non-negligible values– denoted by S σ , · · · , S σ k ! and S q = S k ! i =1 S σ i . Note that an estimator using (8) is equivalent to an estimator using (7).From a computational perspective, an artificial label switching step is necessary incomputing q ( θ ) but not in generating a proposal θ from q . For arbitrary permutationrepresentations σ m , σ c , σ i ∈ S k = { σ , . . . , σ k ! } acting on both θ and ϕ , the followingholds for (7) π ( σ c ( θ ) | σ i ( ϕ ) , x ) = π ( σ m σ c ( θ ) | σ m σ i ( ϕ ) , x ) , where σ m σ c ( θ ) = σ m ( σ c ( θ )). The full permutation representation set is invariant overan additional permutation representation σ m (e.g., S k = { σ m σ , · · · , σ m σ k ! } ), q ( σ c ( θ ))and q ( σ m σ c ( θ )) are equal. Thus the standard estimator using q in (7) is equivalent(from a computational viewpoint) to an estimator such that (i) proposals are generatedfrom a particular term h σ c ( θ ) of (8) and (ii) importance weights are computed accordingto (8). This makes a proposal generating step easier by ignoring label switching eventhough all the h σ ( θ )’s need be evaluated to compute q ( θ ). Importance functions found in (4) and (8) have the same underlying motivation of abetter approximation of the joint posterior distribution and the resulting estimate of(5) should therefore be more efficient. Both are designed to cover the k ! clusters, whichare created by either symmetrizing the labels of hyperparameter set { θ ( j ) , z ( j ) } Jj =1 as in(8) or by randomly permuting the label of each { θ ( j ) , z ( j ) } J j =1 as in (4). Once k ! clus-ters of hyperparameters are thus constructed, the corresponding conditional densitiesconstitute clusters for q .Consider κ ∈ { , . . . , k ! } , a cluster index of q . Associating the cluster component func-tion q κ ( ·| x ) with a support S κ , the importance function q is expressed as q ( θ | x ) = k ! X κ =1 p ( κ ) q κ ( θ | x ) (9)where p ( κ ) is the proportion of those conditional densities that are associated with thecluster κ and P k ! κ =1 p ( κ ) = 1. The importance function representation (8) is thus a . Lee and C. P. Robert κ = 1 , . . . , k !) S σ κ = S κ , h σ κ ( θ ) = q κ ( θ | x ) and p ( κ ) = 1 /k ! . By contrast, the density (4) does not achieve perfect symmetry, which means κ is notuniformly distributed, although p ( κ ) → /k ! as J → ∞ .A marginal likelihood estimate using q ( θ ) as in (9) follows by a standard importancesampling identity E ( k ) = Z S q π ( θ ) p k ( x | θ ) q ( θ | x ) k ! X κ =1 p ( κ ) q κ ( θ | x ) ! d θ = k ! X κ =1 Z S κ π ( θ ) p k ( x | θ ) q ( θ | x ) p ( κ ) q κ ( θ | x )d θ = E p ( θ,κ ) [ ω ( θ )] (10)leading to b E ( k ) = 1 T T X t =1 ω ( θ ( t ) ) , where ω ( θ ) = π ( θ ) p k ( x | θ ) (cid:14) q ( θ | x ), namely a weighted sum of integrals over the S κ ’s( κ = 1 , . . . , k !).Due to the perfect symmetry in the clusters of (8), the integrals of ωq κ with respect to θ over S κ for κ = 1 , · · · , k ! are equal. Given an arbitrary cluster, κ o , the evidence is E ( k ) = k ! X κ =1 p ( κ ) (cid:18)Z S κ ω ( θ ) q κ ( θ | x )d θ (cid:19) = Z S κo ω ( θ ) q κ o ( θ | x )d θ = E q κo ( θ | x ) [ ω ( θ )] . (11)Note that the corresponding estimator (Monte Carlo approximation based on T draws)for the above is exactly in the same form to the estimator for (10).Both (10) and (11) are thus importance sampling estimators using (4) and (8) respec-tively. Hence standard convergence result hold: by the Law of Large Numbers, bothestimates a.s. converge to E ( k ), and the Central Limit theorem also holds √ T ( T T X t =1 ω ( θ ( t ) ) − E ( k ) ) −→ T →∞ N (0 , V ) , √ T ( T T X t =1 ω ( θ ( t ) ) − E ( k ) ) −→ T →∞ N (0 , V )where V = var p ( θ,κ | x ) ( ω ( θ )) and V = var q κo ( θ | x ) ( ω ( θ )). The perfect symmetry in theclusters of (8) does not guarantee a better efficiency in estimation and those variancesare rather highly related to how well the importance functions approximate the jointposterior distribution. If J = Jk ! and both importance functions provide a goodapproximation, V ≈ V is expected.0 Both estimators (10) and (11) suffer from massive computational demands when k islarge. In this section, we show how to approximate (7) and increase the computationalefficiency (i.e., computing time) as a result.It was shown in Section 3.2 that q as in (7) is invariant under a permutation of thelabels of θ and that proposals can be generated from a particular term h σ c ( θ ) of (8)without any loss of statistical efficiency. Given ( θ (1) , . . . , θ ( T ) ) ∼ h σ c ( θ ), it is naturalto consider whether or not all terms in { h σ ( θ ( t ) ) , . . . , h σ k ! ( θ ( t ) ) } are different from zerofor t = 1 , . . . , T . In the case some are not, it is obviously computationally relevant todetermine which ones among them are likely to be insignificant (i.e., almost zero). Thisperspective motivates the following section. Given a proposal θ generated from a particular h σ c ( θ ), θ ∈ S σ c , the importance functionat θ is an average of all h σ ( θ )’s and the relative contribution of each term is η σ i ( θ ) = h σ i ( θ ) (cid:14) k ! q ( θ ) = h σ i ( θ ) . k ! X l =1 h σ l ( θ ) , i = 1 , . . . , k ! . If η σ i ( θ ) is close to zero, h σ i ( θ ) is negligible within q ( θ ) and on the opposite η σ i ( θ ) ≈ h σ i ( θ ). The expected relative contribution of h σ i ( θ ) E h σc [ η σ i ( θ )] = Z S σc η σ i ( θ ) h σ c ( θ ) d θ is estimated by b E h σc [ η σ i ( θ )] = 1 M M X l =1 η σ i ( θ ( l ) ) , θ ( l ) ∼ h σ c ( θ ) . (12)After an appropriate permutation of the indices, we obtain that b E h σc [ η σ ( θ )] ≥ · · · ≥ b E h σc [ η σ k ! ( θ )], namely that the corresponding h σ , · · · , h σ k ! are in decreasing order ofexpected contributions. The importance function q ( θ ) can then be approximated byusing only the n most important h σ ’s (1 ≤ n ≤ k !), leading to the approximation˜ q n ( θ ) = 1 k ! n X i =1 h σ i ( θ ) , (13)and the mean absolute difference from q ( θ ) is approximated byˆ φ n = 1 M M X l =1 (cid:12)(cid:12)(cid:12) ˜ q n ( θ ( l ) ) − q ( θ ( l ) ) (cid:12)(cid:12)(cid:12) , θ ( l ) ∼ h σ c ( θ ) . (14) . Lee and C. P. Robert τ , ˜ q n is consideredto be an appropriate approximation for q . We define the corresponding approximateset A ( k ) ⊆ S k to be made of { σ , · · · , σ n } , n being defined as the smallest size thatsatisfies the condition b φ n < τ . With this truncation, the computational efficiencyobviously improves.Note that the set A ( k ) is determined under the assumption that all proposals ( θ ( t ) ) aregenerated from h σ c since the quality of the approximation is only guaranteed under thisassumption. Due to the perfect symmetry of q ( θ ) over the k ! permutations, the choiceof σ c is obviously irrelevant for the computational gains. The evidence estimate usingsuch an approximation is detailed in the following algorithm: Algorithm 1: Dual importance sampling algorithm with approximation1
Randomly select { z ( j ) , θ ( j ) } Jj =1 from Gibbs sample and remove label switching by anappropriate method. Then, construct q ( θ ) as in (8). Derive the corresponding term h σ c ( θ ) and generate particles { θ ( t ) } Tt =1 ∼ h σ c ( θ ). Construct an approximation, ˜ q ( θ ), using the first M terms in { θ ( t ) } Tt =1 : Compute ( h σ ( θ ( t ) ) , . . . , h σ k ! ( θ ( t ) ) , η σ ( θ ( t ) ) , . . . , η σ k ! ( θ ( t ) )) for t = 1 , . . . , M and b E h σc [ η σ ( θ )] , · · · , b E h σc [ η σ k ! ( θ )] as in (12). Reorder the σ ’s so that b E h σc [ η σ ( θ )] ≥ · · · ≥ b E h σc [ η σ k ! ( θ )]. Initialise n = 1 and compute ˜ q n ( θ (1) ) , · · · , ˜ q n ( θ ( M ) ) as in (13) and b φ n as in(14). If b φ n =1 < τ , go to Step 4. Otherwise increase n = n + 1 and update˜ q n ( θ ) and b φ n ( θ ) until b φ n < τ . Calculate ˜ q n ( θ ( M +1) ) , . . . , ˜ q n ( θ ( T ) ) and replace q ( θ (1) ) , . . . , q ( θ ( T ) ) with˜ q ( θ (1) ) , . . . , ˜ q ( θ ( T ) ) in (5) to estimate b E .In Step 1., we used the method by Jasra et al. (2005), even though alternatives im-plemented in the label.switching package of Papastamoulis and Iliopoulos (2010) or inRodriguez and Walker (2014) could be implemented as well. The total number of h values that are computed is T k ! in the standard dual importance sampling scheme butdecreases to (
M k !) + | A ( k ) | ( T − M ) when using ˜ q ( θ ). The relative gain in the totalnumber of terms is thus∆( A ( k )) = ( M k !) + | A ( k ) | ( T − M ) T k ! = MT (cid:18) − | A ( k ) | k ! (cid:19) + | A ( k ) | k ! . (15)The gain will thus depend on how small | A ( k ) | is, when compared with k !, hence ulti-mately on the acceptable mean absolute difference τ .2 Two simulated mixture datasets and two real datasets are used to examine the perfor-mance of seven marginal likelihood estimators. The simulated datasets, D and D ,are; • D : x , . . . , x ∼ . N ( − ,
1) + 0 . N (5 , ) • D : x , . . . , x ∼ . N ( − ,
1) + 0 . N (1 , ) + 0 . N (6 , N (5 , ) denotes a normal distribution with a mean of 5 and a standard deviationof 2. Two real datasets, called galaxy and fishery datasets respectively, are shown inFigure 1. They have been frequently used in the literature as benchmarks (see, e.g.Chib 1995; Fr¨uhwirth-Schnatter 2006; Jasra et al. 2005; Richardson and Green 1997;Stephens 2000b).Gaussian and Dirichlet priors are used for the means { µ i } ki =1 and proportions λ , { µ i } ki =1 ∼ N (0 , ) and ( λ , . . . , λ k ) ∼ Dir(1 , . . . , . For the variance parameters { σ i } ki =1 , inverse Gamma distributions with two sets ofhyperparameters, IG (2 ,
3) and IG (2 , Gibbs simulations are usedto approximate E ( k ).Firstly, a sensitivity analysis is conducted about the expected relative contribution of h σ i to q ( θ ) with respect to M . Then we set the values for both M and τ . In Section 5.2,the performance of seven estimators for E ( k ) are compared through a large simulationstudy, which confirms that the asymptotic variance of b E ( k ) based on (7) is smaller thanwhen based on (4). x -values x -values(a (bFigure 1: Histogram of the data against estimated six- and four- Gaussian mixturedensities (solid line) for (a) the Galaxy dataset and (b) the fishery dataset, respectively. . Lee and C. P. Robert M and τ The approximation set is constructed in two steps. First, we compute b E h σc [ η σ ( θ ) ], . . . , b E h σc [ η σ k ! ( θ ) ], based on reduced samples of size M as in (12). Second, we derive whichterms are negligible when compared with the threshold τ . In our experiments, we chose τ conservatively so that all zero terms are identified. In MatLab, 10 − is roundeddown to 0 thus τ = 10 − was chosen for the following simulation studies.The expected relative contribution measures for D and D are given in Tables 1 and2, respectively. For J = 10 initial Gibbs simulations, significantly contributing clustersare easily identified by { b E h σ [ η σ i ( θ )] } k ! i =1 and both | A ( k ) | and b φ are relatively stableagainst M . Under a natural lack of label switching, q ( θ ) seems to be well approximatedusing only h σ ( θ ), as seen in Table 1. Even when some label switching occurs in a Gibbssequence corresponding to a Gaussian mixture model with three components, only twoterms, h σ ( θ ) and h σ ( θ ), significantly contribute to q ( θ ), as seen in Table 2. For thesubsequent analyses in this paper, we chose J = 10 , M = 10 and τ = 10 − . M { b E h σ [ η σ i ( θ )] } k ! i =1 | A ( k ) | b φ [1 , . × − ] 1 010 [1 , . × − ] 1 010 [1 , . × − ] 1 010 [1 , . × − ] 1 0 Table 1: Estimates for { b E h σ [ η σ i ] } k ! i =1 , | A ( k ) | and b φ against M for D ( k = 2). Theprior for a variance parameter is IG (2 , M { b E h σ [ η σ i ] } k ! i =1 | A ( k ) | b φ [3 . × − , . × − , . × − , . × − , . , . × − ] 2 010 [1 . × − , . × − , . × − , . × − , . , . × − ] 2 010 [2 . × − , . × − , . × − , . × − , . , . × − ] 2 010 [1 . × − , . × − , . × − , . × − , . , . × − ] 2 0 Table 2: Estimates for { b E h σ [ η i ] } k ! i =1 , | A ( k ) | and b φ with respect to M for D ( k = 3).The prior for a variance parameter is IG (2 , The following seven marginal likelihood estimators using an equal number of proposalsare compared; b E ∗ Ch : Chib’s method (2) using T = 10 samples and multiplying by k ! to compensatefor a lack of label switching; b E Ch : Chib’s method with density estimate (2), using T = 10 randomly permutedGibbs samples; b E IS : Importance sampling using q as in (6), with a maximum likelihood estimate for z o , . . . , z on and T = 10 particles; b E DS : Dual importance sampling using q as in (7), with T = 10 particles and J = 100Gibbs samples in q ( θ ); b E ADS : Dual importance sampling using an approximation as in (13), with T = 10 particles, J = 100 and M = 10 ; b E J : Importance sampling using q as in (4), with T = 10 particles. When k < J = 100 k ! and otherwise J = 5000; b E BS : Bridge sampling (3), using M = M = 6 × samples and q ( θ ) as in (4) via10 iterations. For q , it is set as J = 4000 and label switching is imposed inhyperparameters { θ ( j ) , z ( j ) } J j =1 .The marginal likelihood estimates (in log scales) and the effective sample size (ESS)ratios, R = ESS /T , are summarized in Figures 2 and 3 by boxplots, based on 50 repli-cates. Subscripts of b E and R denote the estimating technique. Note that a modifiedESS, provided by equation (35) in Doucet et al. (2000), is used here for numerical sta-bility. All estimators are based on 10 proposals, as in Table 3, where summing up thesecond and third columns leads to a fixed total number of function evaluations. Withinour setup, b E IS is the least demanding in terms of computational workload while theremaining importance estimators require the same computing time, except for b E ADS . Simulated mixture dataset
Mixture models of two and three components are fitted to D and D respectively.Regardless of the presence or not of label switching in the resulting Gibbs sequences,all estimates based on importance sampling except b E IS coincide with b E Ch , albeit withsmaller Monte Carlo variations as seen in Figures 2 and 3. When a suitable approximatefor q ( θ ) is used for the dual importance sampling, no significant difference in the esti-mates log( b E ( k )) and in the effective sample sizes are observed. The mean sizes of A ( k )in Table 4 are always smaller than k ! and it shows that E ( k ) can be estimated witha lesser computational workload. When posterior modes are very well separated (no . Lee and C. P. Robert q b E IS T T k ! T b E DS T T Jk ! T b E ADS T ( M + ( T − M ) | A ( k ) | /k !) Jk ! T b E J T T J T b E BS M ( M + M ) J M + M Table 3: Computation steps required by different evidence estimation approaches. Notethat the required computation for b E BS is given per iteration.natural label switching ever present in Gibb sequences), the number of evaluations in q is reduced almost by the maximal factor of 1 /k !. In Table 5, the least computationaldemand is observed for the chib’s methods while the bridge sampling costs more than100 times. When A ( k ) < k !, some reduction in CPU time for b E ( k ) ADS is observed dueto ignoring zero function evaluation.Disagreement in the values of b E IS versus b E Ch shows that an importance function mayfail to properly approximate p k ( x | θ ) π ( θ ), resulting in an unreliable estimate with largevariation. Significantly small effective sample sizes (i.e., very small values for R IS ) backthis observation. In our simulation experiments, we observed that b E BS is correctlycalibrated for a large value of J (i.e., a large number of conditional densities in q ).When label switching naturally occurs, as in the Gibbs sequence under the varianceprior IG (2 , b E ∗ Ch disagrees with the other estimates, see Figure 3. Unsurprisingly,this indicates that the simplistic correction through a multiplication by k ! is of no use,as reported in Neal (1999), Fr¨uhwirth-Schnatter (2006) and Marin and Robert (2008). D k k ! | A ( k ) | ∆( A ) | A ( k ) | ∆( A ) D .
00 (0 .
00) 0 .
55 (2 . × − ) 1 .
73 (0 .
45) 0 .
88 (0 . D .
02 (0 .
14) 0 .
25 (0 .
02) 2 .
18 (0 .
60) 0 .
43 (0 . (values in brackets) estimates for the approxi-mation set size, | A ( k ) | , and the reduction rate of a number of evaluated h -terms, ∆( A ),as in (15) for D and D . Subscripts 1 and 2 indicate the results using the priors σ ∼ IG (2 ,
3) and σ ∼ IG (2 , -158.3-158.2-158.1-158-157.9-224.4-224.2-224-223.8-223.6 b E IS b E ∗ Ch b E Ch b E DS b E ADS b E J b E BS b E IS b E ∗ Ch b E Ch b E DS b E ADS b E J b E BS R IS R DS R ADS R J R IS R DS R ADS R J Figure 2: Boxplots of evidence estimates in log scale (left, middle) and effective samplesizes ratios (right) . Mixture models with two and three Gaussian components are fittedto (top) D and (bottom) D , respectively. The prior for { σ i } ki =1 is IG (2 ,
3) and labelswitching did not occur in Gibbs samples. One outlier of b E IS in the top-left panel isdiscarded. -160.5-160-159.5-159-158.5-232-231.5-231-230.5-230-229.5 b E IS b E ∗ Ch b E Ch b E DS b E ADS b E J b E BS b E IS b E ∗ Ch b E Ch b E DS b E ADS b E J b E BS R IS R DS R ADS R J R IS R DS R ADS R J Figure 3: Boxplots of evidence estimates in log scale (left, middle) and effective samplesizes ratios (right) . Mixture models with two and three Gaussian components are fittedto (top) D and (bottom) D , respectively. The prior for { σ i } ki =1 is IG (2 ,
15) and labelswitching naturally occurred in Gibbs samples. Two outliers for b E ∗ Ch in the top-leftpanel are discarded. Galaxy and fishery dataset
The priors suggested by Richardson and Green (1997) are used for our simulation study: µ , . . . , µ k ∼ N (¯ x , r / σ , . . . , σ k ∼ IG (2 , β ) β ∼ G (0 . , /r ) λ , . . . , λ k ∼ Dirichlet(1 , . . . , . Lee and C. P. Robert D D CPU CPU CPU CPU b E ∗ Ch .
80 0 .
76 1 .
17 1 . b E Ch .
79 0 .
81 1 .
32 1 . b E IS .
54 2 .
65 3 .
35 3 . b E DS .
07 2 .
96 6 .
12 6 . b E ADS .
87 3 .
07 3 .
77 5 . b E J .
42 3 .
34 6 .
02 6 . b E BS . × . × . × . × Table 5: Elapsed CPU time in seconds for evidences approximation of mixture modelsfor D and D . Subscripts 1 and 2 of CPU indicate the results using the priors σ ∼ IG (2 ,
3) and σ ∼ IG (2 , x and r are the median and the range of x , respectively. Normal mixture modelsare fitted to both datasets and estimates of log( E ( k )) and R are summarized in Figures4 and 5. In general, a similar behaviour of log( b E ( k )) and R between the methods isobserved. For all cases, the dual importance sampling schemes ( b E DS and b E ADS ) and b E J agree with Chib’s approach ( b E Ch ). Unless modes of the joint posterior distribu-tions are clearly separated (e.g., | A ( k ) | ≈ b E ∗ Ch ) is biased due to an improperpermutation correction. When a poor q ( θ ) is used for importance sampling, inaccu-rate approximations result and the range of b E IS estimates is much off from the otherestimates.Symptoms of the “curse of dimensionality” can be observed. As k increases, the effectivesample size decreases exponentially fast and the variation in the estimates increases.Given the complex shape of the posterior distribution, the support common to q ( θ ) and π ∗ k ( θ | z ) gets progressively smaller and b E BS becomes less accurate, as shown in bothfigures. When k = 6, the variation in the values of b E Ch is much larger than those ofthe estimate by dual importance sampling. When J ≪ Jk !, q does not provide a goodapproximation of the joint posterior and log( E J ) is then biased. Due to a fast increaseof k !, fast increasing in CPU times is seen for all estimators in Table 7.The reduction in the number of evaluated terms used to approximate b E ( k ) varies caseby case, as shown in Table 6. When k = 4 and k = 6, components of the posteriordistribution for the galaxy data tend to have long flat tails and thus have higher chanceto overlap each other. Consequently, the workload reduction is of lesser magnitude thanfor a model with a smaller number of components. Provided that some functions areneglected for b E ADS , there is some gain in computational efficiency as can be seen in Table6.8 k k ! | A ( k ) | ∆( A )3 6 1.00 (0.00) 0.25 (0.00)4 24 2.10 (0.76) 0.18 (0.03)(a) Fishery data k k ! | A ( k ) | ∆( A )3 6 1.06 (0.24) 0.26 (0.04)4 24 13.34 (5.35) 0.60 (0.20)6 720 176.78 (75.31) 0.32 (0.09)(b) Galaxy dataTable 6: Mean and standard deviation (values in brackets) of approximate set sizes, | A ( k ) | , and the reduction rate of a number of evaluated h -terms ∆( A ) as in (15) for (a)fishery and (b) galaxy datasets.Estimator Fishery data Galaxy data k = 3 k = 4 k = 3 k = 4 k = 6 b E ∗ Ch .
71 1 .
71 1 .
20 1 .
88 2 . b E Ch .
40 2 .
30 1 .
56 2 .
18 26 . b E IS .
23 14 .
60 13 .
47 14 .
83 48 . b E DS .
75 86 .
98 27 .
00 85 .
27 3 . × b E ADS .
14 30 .
45 18 .
28 52 .
49 1 . × b E J .
19 90 .
11 26 .
75 87 .
19 244 . b E BS . × . × . × . × . × Table 7: Elapsed CPU time in seconds for evidences approximation of mixture modelsfor fishery and galaxy datasets. -520-519-518-517-516-515-514-520.2-520-519.8-519.6-519.4-519.2-519-518.8-518.6 b E IS b E ∗ Ch b E BS b E Ch b E DS b E ADS b E J b E IS b E ∗ Ch b E BS b E Ch b E DS b E ADS b E J R IS R DS R ADS R J ! R IS R DS R ADS R J Figure 4: Boxplots of evidence estimates in log scale (left, middle) and effective sam-ple sizes ratios (right) . Mixture models with (top) three and (bottom) four Gaussiancomponents are fitted to the fishery dataset. Two outliers of b E Ch in the top-left panelare discarded. . Lee and C. P. Robert -227.5-227-226.5-226-225.5-228-227-226-225-224-223-234-232-230-228-226-224-222 b E IS b E ∗ Ch b E BS b E Ch b E DS b E ADS b E J b E IS b E ∗ Ch b E BS b E Ch b E DS b E ADS b E J b E IS b E ∗ Ch b E BS b E Ch b E DS b E ADS b E J R IS R DS R ADS R J R IS R DS R ADS R J R IS R DS R ADS R J Figure 5: Boxplots of evidence estimates in log scale (left, middle) and effective samplesizes ratios (right) . Mixture models with (top) three, (middle) four, and (bottom) sixGaussian components are fitted to the galaxy dataset. One outlier of b E Ch in the top-leftpanel is discarded. This paper considered evidence approximations by importance sampling for mixturemodels and re-evaluated some of the known challenges resulting from high multimodal-ity in the posterior density. Importance sampling requires that the support of an impor-tance function encompasses the support of the posterior density to perform properly.In the specific case on mixture models, missing some of the invariance under permuta-tion function is likely to produce an unsuitable support hence, a poor estimate of theevidence.In our study, exchangeable priors are used, which implies that the posterior and marginalposterior densities exhibit k ! symmetrical terms. Two marginal likelihood estimatorsare proposed here and tested against other existing estimators. The first approachexploits the permutation invariance of π ( ·| x , z o ) with a pointwise MLE, z o , to create animportance function. However, due to a poor resulting support, this approach performsquite poorly in our simulation studies. Another poor estimate is derived from Chib’smethod when the invariance by permutation is not reproduced in the sample (Neal2001).0A second importance function is constructed by double Rao–Blackwellisation, hence thedenomination of dual importance sampling . We demonstrate both methodologicallyand practically that this solution fits the demands of mixture estimation. Moreover,introducing a suitable and implementable approximation scheme, we show how to avoidthe exponential increase in k of the computational workload. The idea at the core ofthis approximation is to bypass negligible elements in the approximation thanks to theperfect symmetry of the posterior density. When posterior modes are well-separated,the gain is of a larger magnitude than when those modes strongly overlap.Borrowing from the original approach in Chib (1996), dual importance sampling can beextended to cases when conditional Gibbs sampling densities are not available in closedform. However, this solution suffers from the curse of dimensionality, just like any otherimportance sampling estimator.Alternative evidence approximation techniques could be considered for this problem,as exemplified in Friel and Wyse (2012). For instance, ensemble Monte Carlo samplesfrom local ensembles that are extensions or compositions of the original, e.g., usingparallel tempering Monte Carlo methods. Extending this idea, Bayes factor approx-imations were proposed using annealed importance sampling (Neal 2001) and powerposteriors (Friel and Pettitt 2008). Further investigation is needed to characterize theperformances of those alternative solutions in the setting of mixture models and labelswitching. References
Antoniak, C. (1974). “Mixtures of Dirichlet processes with applications to Bayesiannonparametric problems.”
The Annals of Statistics , 2: 1152–1174.Ardia, D., Ba¸st¨urk, N., Hoogerhieide, L., and van Dijik, H. K. (2012). “A compara-tive study of Mnote Carlo methods for efficient evaluation of marginal likelihood.”
Computational Statistics and Data Analysis , 56: 3398–3414.Berkhof, J., Mechelen, I. v., and Gelman, A. (2003). “A Bayesian approach to theselection and testing of mixture models.”
Statistical Sinica , 13(3): 423–442.Carlin, B. and Chib, S. (1995). “Bayesian model choice through Markov chain MonteCarlo.”
J. Royal Statist. Society Series B , 57(3): 473–484.Celeux, G., Hurn, M., and Robert, C. P. (2000). “Computational and inferential diffi-culties with mixture posterior distributions.”
Journal of American Statistical Associ-ation , 95(3): 957–979.Chen, M.-H., Shao, Q. M., and Ibrahim, J. G. (2000).
Monte Carlo methods in Bayesiancomputation . Springer Series in Statistics, 1 edition.Chib, S. (1995). “Marginal likelihoods from the Gibbs output.”
Journal of the AmericanStatistical Association , 90: 1313–1321. . Lee and C. P. Robert
21— (1996). “Calculating posterior distributions and modal estimates in Markov mixturemodels.”
Journal of Econometrics , 75: 79–97.Chopin, N. (2002). “A sequential particle filter method for static models.”
Biometrika ,89(3): 539–552.Chopin, N. and Robert, C. P. (2010). “Properties of Nested sampling.”
Biometrika , 97:741–755.Congdon, P. (2006). “Bayesian model choice based on Monte Carlo estimates of posteriormodel probabilities.”
Computational Statistics and Data Analysis , 50: 346–357.DiCiccio, A. P., Kass, R. E., Raftery, A., and Wasserman, L. (1997). “ComputingBayes factors by combining simulation and asymptotic approximations.”
Journal ofthe American Statistical Association , 92: 903–915.Diebolt, J. and Robert, C. (1994). “Estimation of finite mixture distributions throughBayesian sampling.”
Journal of Royal Statistical Society, Series B , 56: 363–375.Doucet, A., Godsill, S., and Andrieu, C. (2000). “On sequential Mnote Carlo samplingmethods for Bayesian filtering.”
Statistics and Computing , 10: 197–208.Escobar, M. and West, M. (1995). “Bayesian density estimation and inference usingmixtures.”
Journal of the American Statistical Association , 90(430): 577–588.Friel, N. and Pettitt, A. N. (2008). “Marginal likelihood estimation via power posteri-ors.”
Journal of the Royal Statistical Society, Series B , 70: 589–607.Friel, N. and Wyse, J. (2012). “Estimating the evidence: a review.”
Statistica Neer-landica , 66(3): 288–308.Fr¨uhwirth-Schnatter, S. (2001). “Markov Chain Monte Carlo estimation for classicaland dynamic switching and mixture models.”
Journal of the American StatisticalAssociation , 96: 194–209.— (2004). “Estimating marginal likelihoods for mixture and Markov switching modelsusing bridge sampling techniques.”
Journal of Econometrics , 7: 143–167.Fr¨uhwirth-Schnatter, S. (2006).
Finite mixture and Markov switching models . SpringerSeries in Statistics, 1 edition.Gelfand, A. E. and Smith, A. F. M. (1990). “Sampling-based approaches to calculatingmarginal densities.”
Journal of the American Statistical Association , 85: 398–409.Gelman, A. and Meng, X. L. (1998). “Simulating normalizing constants: From im-portance sampling to bridge sampling to path sampling.”
Statistical Science , 13:163–185.Geweke, J. (2012). “Interpretation and inference in mixture models: simple MCMCworks.”
Computational Statistics and Data Analysis , 51: 3529–3550.2Green, P. (1995). “Reversible jump Markov chain Monte Carlo computation andBayesian model determination.”
Biometrika , 85(4): 711–732.Jasra, A., Holmes, C., and Stephens, D. (2005). “Markov Chain Monte Carlo methodsand the label switching problem in Bayesian mixture modeling.”
Statistical Science ,20(1): 50–67.Jeffreys, H. (1939).
Theory of Probability . Oxford, The Clarendon Press, 1 edition.Marin, J. and Robert, C. (2007).
Bayesian Core . Springer-Verlag, New York.— (2010a). “Importance sampling methods for Bayesian discrimination between em-bedded models.” In Chen, M.-H., Dey, D., M¨uller, P., Sun, D., and Ye, K. (eds.),
Frontiers of Statistical Decision Making and Bayesian Analysis . Springer-Verlag, NewYork.— (2010b). “On resolving the Savage–Dickey paradox.”
Electron. J. Statist. , 4: 643–654.Marin, J.-M., Mengersen, K., and Robert, C. P. (2005). “Bayesian modelling andinference on mixtures of distributions.” In Rao, C. and Dey, D. (eds.),
Handbook ofStatistics , volume 25. Springer-Verlag, New York.Marin, J.-M. and Robert, C. P. (2008). “Approximating the marginal likelihood inmixture models.”
Bulletin of the Indian Chapter of ISBA , 1: 2–7.Meng, X. L. and Schilling, S. (2002). “Warp Bridge sampling.”
American StatisticalAssociation , 11(3): 552–586.Meng, X. L. and Wong, W. H. (1996). “Simulating ratios of normalizing constants viaa simple identity.”
Statistica Sinica , 6: 831–860.Mira, A. and Nicholls, G. (2004). “Bridge estimation of the probability density at apoint.”
Statistica Sinica ∼ radford/chib-letter.html.— (2001). “Annealed importance sampling.” Statistics and Computing , 11: 125139.Newton, M. A. and Raftery, A. E. (1994). “Approximate Bayesian inference with theweighted likelihood bootstrap.”
Journal of Royal Statistical Society, Series B , 96(1):3–48.Papastamoulis, P. (2013). label.switching: Relabelling MCMC outputs of mixture mod-els . R package version 1.2.URL http://CRAN.R-project.org/package=label.switching
Papastamoulis, P. and Iliopoulos, G., G. (2010). “An artificial allocations based solutionto the label switching problem in Bayesian analysis of mixtures of distributions.”
Journal of Computational and Graphical Statistics , 19(2): 313–331. . Lee and C. P. Robert
Biometrika , 95: 315–321.Perrakie, K., Ntzoufras, I., and Tsionas, E. G. (2014). “On the use of marginal posteriorsin marginal likelihood estimation via importance sampling.”
Computational Statisticsand Data Analysis , 77: 54–69.Raftery, A., Newton, M., Satagopan, J., and Krivitsky, P. (2006). “Estimating theintegrated likelihood via posterior simulation using the harmonic mean identity.”Technical Report 499, University of Washington, Department of Statistics.Rasmussen, C. E. (2000). “The Infinite Gaussian Mixture Model.” In
In Advances inNeural Information Processing Systems 12 , 554–560. MIT Press.Richardson, S. and Green, P. (1997). “On Bayesian analysis of mixtures and with anunknown number of components.”
Journal of the Royal Statistical Society, Series B ,59(4): 731–792.Robert, C. and Marin, J.-M. (2008). “On some difficulties with a posterior probabilityapproximation technique.”
Bayesian Analysis , 3(2): 427–442.Robert, C. P. and Casella, G. (2004).
Monte Carlo Statistical Methods . Springer, 2edition.Rodriguez, C. and Walker, S. (2014). “Label switching in Bayesian mixture mod-els:Deterministic relabeling strategies.”
Journal of Computational and GraphicalStatistics , 21(1): 23–45.Rubin, D. B. (1987). “Comment on ”The calculation of posterior distributions bydata augmentation” by M. A. Tanner and W. H. Wong.”
Journal of the AmericanStatistical Association , 82: 543–546.— (1988). “Using the SIR algorithm to simulate posterior distributions.” In Bernardo,J. M., DeGroot, M. H., Lindley, D. V., and Smith, A. F. M. (eds.),
Bayesian Statistics,3 , 395–402. Oxford University Press.Satagopan, J., Newton, M., and Raftery, A. (2000). “Easy Estimation of NormalizingConstants and Bayes Factors from Posterior Simulation: Stabilizing the HarmonicMean Estimator.” Technical Report 1028, University of Wisconsin-Madison, Depart-ment of Statistics.Scott, S. L. (2002). “Bayesian methods for hidden Markov models: recursive computingin the 21st Century.”
Journal of the American Statistical Association , 97: 337–351.Servidea, J. D. (2002). “Bridge sampling with dependent random draws:techniques andstrategy.” Ph.D. thesis, Department of Statistics, The University of Chicago.Skilling, J. (2007). “Nested sampling for Bayesian computations.”
Bayesian Analysis ,1(4): 833–859.4Stephens, M. (2000a). “Bayesian Analysis of Mixture Models with an Unknown Num-ber of Components - An Alternative to Reversible Jump Methods.”
The Annals ofStatistics , 28(1): 40–74.— (2000b). “Dealing with label switching in mixture models.”
Journal of Royal Statis-tical Society, Series B , 62: 795–809.Tierney, L. and Kadane, J. (1986). “Accurate approximations for posterior momentsand marginal densities.”
Journal of the American Statistical Association , 81: 82–86.Verdinelli, I. and Wasserman, L. (1995). “Computing Bayes factors using a gener-alization of the Savage–Dickey density ratio.”
Journal of the American StatisticalAssociation , 90: 614–618.Voter, A. F. (1985). “A Monte Carlo method for determining free-energy differences andtransition state theory rate constants.”
Journal of Chemical Physics , 82: 1890–1899.