RRao–Blackwellization in the MCMC era
Christian P. Robert , , and Gareth O. Roberts ∗ University of Warwick, Universit´e Paris Dauphine PSL , and CREST-ENSAE
Abstract
Rao–Blackwellization is a notion often occurring in the MCMC litera-ture, with possibly different meanings and connections with the original Rao–Blackwell theorem (Blackwell, 1947; Rao, 1945), including a reduction of thevariance of the resulting Monte Carlo approximations. This survey reviewssome of the meanings of the term.
Keywords:
Monte Carlo, simulation, Rao–Blackwellization, Metropolis-Hastingsalgorithm, Gibbs sampler, importance sampling, mixtures, parallelisation,
This paper is dedicated to Professor C.R. Rao in honour of his 100thbirthday.
The neologism Rao–Blackwellization stems from the famous Rao–Blackwelltheorem (Blackwell, 1947; Rao, 1945), which states that replacing an estimator byits conditional expectation given a sufficient statistic improves estimation under anyconvex loss. This is a famous mathematical statistics result, both dreaded andappreciated by our students for involving conditional expectation and for produc-ing a constructive improvement, respectively. While Monte Carlo approximationtechniques cannot really be classified as estimation, since they operate over con-trolled simulations, rather than observations, with the ability to increase the samplesize if need be, and since there is rarely a free and unknown parameter involved,hence almost never a corresponding notion of sufficiency, seeking improvement inMonte Carlo approximation via partial conditioning has nonetheless been namedafter this elegant theorem. As shown in Figure 1, the use of the expression Rao–Blackwellization has considerably increased in the 1990’s, once the foundationalpaper popularising MCMC techniques refered to this technique to reduce MonteCarlo variability, ∗ The work of the first author was partly supported in part by the French government undermanagement of Agence Nationale de la Recherche as part of the “Blanc SIMI 1” program, referenceANR-18-CE40-0034 and in part by the French government under management of Agence Nationalede la Recherche as part of the “Investissements d’avenir” program, reference ANR19-P3IA-0001(PRAIRIE 3IA Institute). The first author also acknowledges the support of l’Institut Universitairede France through two consecutive senior chairs. We will use the American English spelling of the neologism as this version is more commonlyused in the literature. Berkson (1955) may have been the first one to use (p.142) this neologism. a r X i v : . [ s t a t . C O ] J a n igure 1: Google assessment of the popularity of the names Rao–Blackwell,Rao–Blackwellisation, and Rao–Blackwellization since the derivation of the Rao–Blackwell theorem.The concept indeed started in the 1990 foundational paper by Gelfand and Smith(“foundational” as it launched the MCMC revolution, see Green et al., 2015). Whilethis is not exactly what is proposed in the paper, as detailed in the following sec-tion, it is now perceived that the authors remarked that, given a Gibbs samplerwhose component θ is simulated from the conditional distribution, π ( θ | θ , x ), theestimation of the marginal π ( θ | x ) is improved by considering the average of the(full) conditionals across iterations, / T r (cid:88) t =1 π ( θ | θ ( t )2 , x )which provides a parametric, unbiased and O( / √ T ) estimator. Similarly, the ap-proximation to E [ θ | x ] based on this representation / T r (cid:88) t =1 E [ θ | θ ( t )2 , x ]is using conditional expectations with lesser variances than the original θ ( t )1 and maythus lead to a reduced variance for the estimator, if correlation does not get into theway. (In that specific two-step sampler, this is always the case Liu et al., 1994.We are thus facing the difficult classification task of separating what is Rao–Blackwellization from what is not Rao–Blackwellization in simulation and in partic-ular MCMC settings.The difficulty resides in setting the limits as • there is no clear notion of sufficiency in simulation and, further, conditioningmay increase the variance of the resulting estimator or slow down convergence; • variance reduction and unbiasedness are not always relevant (as shown by theinfamous harmonic mean estimator, Neal, 1999; Robert and Wraith, 2009), asfor instance in infinite variance importance sampling (Chatterjee and Diaconis,2018; Vehtari et al., 2019); • there are (too) many forms of potential conditioning in simulation settingsto hope for a ranking (see, e.g., the techniques of partitioning, antithetic orauxiliary variables, control variates as in Berg et al., 2019, delayed acceptanceas in Banterle et al., 2019; Beskos et al., 2006a, adaptive mixtures as in Cornuetet al., 2012; Elvira et al., 2019; Owen and Zhou, 2000, the later more closelyconnected to Rao–Blackwellization); See e.g. the comment in the Introduction Section of Liu et al. (1994). the large literature on the approximation of normalising constants and Bayesfactors (Marin and Robert, 2010, 2011; Robert and Marin, 2008) containsmany proposals that relate to Rao-Blackwellisation, as, e.g., through the sim-ulation of auxiliary samples from instrumental distributions as initiated inGeyer (1993) and expanded into bridge sampling by Chopin and Robert (2010)and noise-constrastive estimation by Gutmann and Hyv¨arinen (2012); • in connection with the above, many versions of demarginalization such as slicesampling (Mira et al., 2001; Roberts and Rosenthal, 1999) introduce auxiliaryvariables that could be exploited towards bringing a variance reduction; • there is no optimal solution in simulation as, mathematically, a quantity suchas an expectation is uniquely and exactly defined once the distribution isknown: if computation time is not accounted for, the exact value is the optimalsolution; • while standing outside a probabilistic framework, quasi-Monte Carlo tech-niques (Liao, 1998) can also be deemed to constitute an ultimate form ofRao–Blackwellization, with the proposal of Kong et al. (2003) being an inter-mediate solution; but we will not cover any further these aspects here.The rest of this review paper discusses Gibbs sampling in Section 2, other MCMCsettings in 3, particle filters and SMC in 5, and conclude in Section 6. Let us recall that a Gibbs sampler (Geman and Geman, 1984) is a specific way ofbuilding a Markov chain with stationary density π ( · ) through the iterative genera-tion from conditional densities associated with the joint π ( · ). Its simplest versionconsists in partitioning the argument θ into θ = ( θ , θ ) and generating alternativelyfrom π ( θ | θ ) and from π ( θ | θ ). This binary version is sometimes called data aug-mentation in reference to Tanner and Wong (1987), who implemented an algorithmrelated to the Gibbs sampler for latent variable models.When proposing this algorithm as a way to simulating from marginal densitiesand (hence) posterior distributions, Gelfand and Smith (1990) explicitely relate tothe Rao–Blackwell theorem, as shown by the following quote ...we consider the problem of calculating a final form of marginal den-sity from the final sample produced by either the substitution or Gibbssampling algorithms. Since for any estimated marginal the correspond-ing full conditional has been assumed available, efficient inference about This may however be seen as a perversion of Rao–Blackwellization in that the dimension ofthe random variable used in the simulation is increased, with the resulting estimate being obtainedby the so-called
Law of the Unconscious Statistician . As mentioned by the authors, the “group-averaged estimator may be interpreted as Rao–Blackwellization given the orbit, so group averaging cannot increase the variance” (p. 592) The text has been retyped and may hence contains typos. The notations are those introducedby Gelfand and Smith (1990) and used for a while in the literature, see e.g. Spiegelhalter et al.(1995) with [ X | Y ] denoting the conditional density of X given Y . The double indexation of thesequence is explained below. he marginal should clearly be based on using this full conditional distri-bution. In the simplest case of two variables, this implies that [ X | Y ] and the y ( i ) j ’s ( j = 1 , . . . , m ) should be used to make inferences about [ X ] , rather than imputing X ( i ) j ( j = 1 , . . . , nm ) and basing inferenceon these X ( i ) j ’s. Intuitively, this follows, because to estimate [ X ] usingthe x ( i ) j ’s requires a kernel density estimate. Such an estimate ignoresthe known form [ X | Y ] that is mixed to obtain [ X ] . The formal argu-ment is essentially based on the Rao–Blackwell theorem. We sketch aproof in the context of the density estimator itself. If X is a continuousp-dimensional random variable, consider any kernel density estimatorof [ X ] based on the X ( i ) j ’s (e.g., see Devroye and Gy¨orfi, 1985) evalu-ated at x : ∆ ( i ) x = (1 /h pm ) (cid:80) mj =1 K [( X − X ( i ) j ) /h m ] , say, where K is abounded density on R p and the sequence { h m } is such that as m → ∞ , h m → , whereas mh m → ∞ . To simplify notation, set Q m,x ( X ) =(1 /h pm ) K [( X − X ( i ) j ) /h m ] so that ∆ ( i ) x = (1 /m ) (cid:80) mj =1 Q m,x ( X ( i ) j ) . De-fine γ ix = (1 /m ) (cid:80) mj =1 E [ Q m,x ( X ) | Y ( i ) j ] . By our earlier theory, both ∆ ( i ) x and γ ix have the same expectation. By the Rao–Blackwell theo-rem, var E [ Q m,x ( X )] | Y ) ≤ var Q m,x ( X ) , and hence MSE ( γ ix ) ≤ MSE (∆ ( i ) x ) , where MSE denotes the mean squared error of the estimateof [ X ] . This Section 2.6. of the paper calls for several precisions: • the simulations x ( i ) j and y ( i ) j are double-indexed because the authors consider m parallel and independent runs of the Gibbs sampler, i being the number ofiterations since the initial step, in continuation of Tanner and Wong (1987), • the Rao–Blackwell argument is more specifically a conditional expectationstep, • as later noted by Geyer (1994), the conditioning argument is directed at (bet-ter) approximating the entire density [ X ], even though the authors mention onthe following page that the argument is “simpler for estimation of” a posteriorexpectation, • they compare the mean squared errors of the expected density estimate ratherthan the rates of convergence of a non-parametric kernel estimator(in n − / d )versus an unbiased parametric density estimator (in n − / ), which does not callfor a Rao–Blackwellization argument, • they do not (yet) mention “Rao–Blackwellization” as a technique, • and they do not envision (more) ergodic averages across iterations, possiblyfearing the potential impact of the correlation between the terms for a givenchain.A more relevant step in the use of Rao–Blackwellization techniques for the Gibbssampler is found in Liu et al. (1994). This later article establishes in particular that,for the two-step Gibbs sampler, Rao–Blackwellization always produces a decrease inthe variance of the empirical averages. This is established in a most elegant mannerby showing that each extra conditioning (or further lag) decreases the correlation,4hich is always positive. The proof relies on the associated notion of interleav-ing and expresses the above correlation as the variance of a multiply conditionedexpectation: cov( h ( θ (0)1 ) , h ( θ ( n )1 ) = var ( E ( . . . E [ E { h ( θ ) | θ } ] . . . ) } , where the number of conditional expectations on the rhs is n . The authors also warnthat a “fast mixing scheme gains an extra factor in efficiency if the mixture estimatecan be easily computed” and give a counter-example when Rao–Blackwellizationincreases the variance. This counter-example is exploited in a contemporary paperby Geyer (1994) where a necessary and sufficient but highly theoretical conditionis given for an improvement. As the author puts it in his conclusion, The point of this article is not that Rao–Blackwellized estimators area good thing or a bad thing. They may be better or worse than simpleaveraging of the functional of interest without conditioning. The pointis that, when the autocorrelation structure of the Markov chain is takeninto account, it is not a theorem that Rao–Blackwellized estimators arealways better than simple averaging. Hence the name Rao–Blackwellizedshould be avoided, because it brings to mind optimality properties thatthese estimators do not really possess. Perhaps “averaging a conditionalexpectation” is a better name. but his recommendation was not particularly popular, to judge from the subsequentliterature resorting to this denomination.Another connection between Rao–Blackwellization and Gibbs sampling can befound in Chib (1995), where his approximation to the marginal likelihood m ( x ) = π ( θ ∗ ) f ( x | θ ∗ )ˆ π ( θ ∗ | x )is generaly based on an estimate of the posterior density using a latent (orauxiliary) variable, as in Gelfand and Smith (1990),ˆ π ( θ ∗ | x ) = / T T (cid:88) t =1 π ( θ ∗ | x, z ( t ) )The stabilisation brought by this parametric approximation is notable when com-pared with kernel estimates, even though it requires that the marginal distributionon z is correctly simulated (Neal, 1999). The “always” qualification applies to every transform of the chain and to every time lag. The function leading to the counter-example is however a function of both θ and θ , whichmay be considered as less relevant in latent variable settings. Geyer (1994) also points out that a similar Rao–Blackwellization was proposed by Pearl (1987). Chib (1995) mentions this connection (p.1314) but seems to restrict it to the two-stage Gibbssampler. In the earlier version known as “the candidate’s formula”, due to a Durham studentcoming up with it, Besag (1989) points out the possibility of using an approximation such as aLaplace approximation, rather than an MCMC estimation. A question found on the statistics forum Cross-Validated illustrates the difficulty with un-derstanding demarginalisation and joint simulation: “Chib suggests that we can insert the Gibbssampling outputs of µ into the summation [of the full conditionals] . But aren’t the outputs obtainedfrom Gibbs about the joint posterior p ( µ, φ | y ) ? Why suddenly can we use the results from jointdistribution to replace the marginal distribution?” Markov chain Monte Carlo methods
In the more general setting of Markov chain Monte Carlo (MCMC) algorithms(Robert and Casella, 2004), further results characterise the improvement broughtby Rao–Blackwellization. Let us briefly recall that the concept behind MCMC is tocreate a Markov sequence θ ( n ) ) n of dependent variables that converge (in distribu-tion) to the distribution of interest (also called target ). One of the most ubiquitousversions of an MCMC algorithm is the Metropolis–Hastings algorithm (Green et al.,2015; Hastings, 1970; Metropolis et al., 1953)One direct exploitation of the Rao–Blackwell theorem is found in McKeagueand Wefelmeyer (2000), who sho in particular that, when estimating the mean of θ under the target distribution, a Rao–Blackwellized version based on E [ h ( θ ( n ) | θ ( n − ]will improve the asymptotic variance of the ordinary empirical estimator when thechain ( θ ( n ) ) is reversible. While the setting may appear quite restrictive, the authorsmanage to recover data augmentation with a double conditional expectation (whencompared with Liu et al., 1994) as well as reversible Gibbs and Metropolis samplersof the Ising model. The difficulty in applying the method resides in computingthe conditional expectation, since a replacement with a Monte Carlo approximationcancels its appeal.Casella and Robert (1996) consider an altogether different form of Rao–Blackwellizationfor both accept-reject and Metropolis–Hastings samples. The core idea is to inte-grate out via a global conditional expectation the Uniform variates used to acceptor reject the proposed values.A sample produced by the Metropolis–Hastings algorithm, θ (1) , . . . , θ ( T ) , is in factbased on two simulated samples, the sample of proposed values η , . . . , η T and thesample of decision variates u , . . . , u T , with η t ∼ q ( y | θ ( t − ) and u t ∼ U ([0 , θ ( t ) is equal to one of the earlier proposed values, an empirical average associatedwith this sample can be written δ MH = / T T (cid:88) t =1 h ( θ ( t ) ) = / T T (cid:88) t =1 t (cid:88) i =1 I θ ( t ) = η i h ( η i ) . Therefore, taking a conditional expectation of the above by integrating the decisionvariates, δ RB = / T T (cid:88) i =1 h ( η i ) E (cid:34) T (cid:88) t = i I θ ( t ) = η i (cid:12)(cid:12)(cid:12)(cid:12) η , . . . , η T (cid:35) = / T T (cid:88) i =1 h ( η i ) T (cid:88) t = i P ( θ ( t ) = η i | η , . . . , η T ) , leads to an improvement of the empirical average, δ MH , under convex losses.While, for the independent Metropolis–Hastings algorithm, the conditional prob-ability can be obtained in closed form (see also Atchad´e and Perron, 2005 and Jacobet al., 2011), the general case, based on an arbitrary proposal distribution q ( ·| θ )is such that δ RB is less tractable but Casella and Robert (1996) derive a tractablerecursive expression for the weights of h ( η i ) in δ RB , with complexity of order O ( T ).Follow-up papers are Perron (1999) and Casella et al. (2004). In order to avoid additional notations, we assume a continuous model where all η i ’s are differentwith probability one. δ MH , using the accepted chain ( ξ i ) i instead of the proposed sequence of the η i ’s as in Casella and Robert (1996). Theversion based on accepted values is indeed rewritten as δ MH = / T M (cid:88) i =1 n i h ( ξ i ) , where the ξ i ’s are the accepted η j ’s, M is the number of accepted η j ’s till iteration T ,and n i is the number of times ξ i appears in the sequence ( θ ( t ) ) t . This representation isalso exploited in G ˙asemyr (2002); Sahu and Zhigljavsky (1998, 2003), and Malefakiand Iliopoulos (2008). The Rao–Blackwellisation construct of Douc and Robert(2011) exploits the following properties:1. ( ξ i , n i ) i is a Markov chain;2. ξ i +1 and n i are independent given ξ i ;3. n i is distributed as a Geometric random variable with probability parameter p ( ξ i ) := (cid:90) α ( ξ i , η ) q ( η | ξ i ) d η ; (1)4. ( ξ i ) i is a Markov chain with transition kernel ˜ Q ( ξ, d η ) = ˜ q ( η | ξ )d η and station-ary distribution ˜ π such that˜ q ( ·| ξ ) ∝ α ( ξ, · ) q ( ·| ξ ) and ˜ π ( · ) ∝ π ( · ) p ( · ) . Since the Metropolis–Hastings estimator δ MH only involves the ξ i ’s, i.e. the ac-cepted η t ’s, an optimal weight for those random variables is the importance weight1 /p ( ξ i ), leading to the corresponding importance sampling estimator δ IS = / N M (cid:88) i =1 h ( ξ i ) p ( ξ i ) , but this quantity is almost invariably unavailable in closed form and need be esti-mated by an unbiased estimator. The geometric n i is the de facto solution that isused in the original Metropolis-Hastings estimate, but solutions with smaller vari-ance also are available, based on the property that (if α ( ξ, η ) denotes the Metropolis–Hastings acceptance probability)ˆ ζ i = 1 + ∞ (cid:88) j =1 (cid:89) (cid:96) ≤ j { − α ( ξ i , η (cid:96) ) } η (cid:96) i.i.d. ∼ q ( η | ξ i )is an unbiased estimator of 1 /p ( ξ i ) whose variance, conditional on ξ i , is lower thanthe conditional variance of n i , { − p ( ξ i ) } /p ( ξ i ). For practical implementation, inthe event α ( ξ, η ) is too rately equal to one, the number of terms where the indicatorfuntion is replaced with its expectation α ( ξ i , η (cid:96) ) may be limited, without jeopardisingthe variance domination. 7 Retrospective: Continuous time Monte Carlomethods
Retrospective simulation (Beskos et al., 2006a) is an attempt to take advantageof the redundancy inherent in modern simulation algorithms (particularly MCMC,rejection sampling) by subverting the traditional order of algorithm steps. It isconnected to demarginalisation and pseudo-marginal (Andrieu and Roberts, 2009)techniques in that it replaces a probability of acceptance with an unbiased estimationof the said probability, hence creating an auxiliary variable in the process. In thecase of the Metropolis-Hastings algorithm, this means substituting the ratio π ( θ (cid:48) ) q ( θ ( t ) | θ (cid:48) ) π ( θ ( t ) ) q ( θ (cid:48) | ( t ) )with ˆ π (cid:48) q ( θ ( t ) | θ (cid:48) )ˆ π ( t ) q ( θ (cid:48) | ( t ) )where ˆ π (cid:48) is an auxiliary variable such that E [ˆ π (cid:48) | θ (cid:48) ] = κπ ( θ (cid:48) ) E [ˆ π ( t ) | θ ( t ) ] = κπ ( θ ( t ) )Retrospective simulation is most powerful in infinite dimensional contexts, where itsnatural competitors are approximate and computationally expensive. The solutionadvanced by Beskos et al. (2006a) and Beskos et al. (2006b) to simulate diffusions inan exact manner (for a finite number of points) relies on an auxiliary and boundingPoisson process. The selected points produced this way actually act as a randomsufficient statistic in the sense that the stochastic process can be generated fromBrownian bridges between these points and closed form estimators conditional ofthese points may be available and with a smaller variance. See also Fearnheadet al. (2017) for related results on continuous-time importance sampling (CIS). Thisincludes a sequential importance sampling procedure with a random variable whoseexpectation is equal to the importance weight. Also known as particle filtering sequential Monte Carlo (Del Moral et al., 2006;Doucet et al., 1999; Liu and Chen, 1998) is another branch of the Monte Carlomethodology where the concept of Rao–Blackwellisation has had an impact. Webriefly recall here that sequential Monte Carlo is used in state-space and otherBayesian dynamic models where the magnitude of the latent variable prevents thecall to traditional Monte Carlo (and MCMC) techniques. It is also relevant fordealing with complex static problems by creating a sequence of intermediate andartificial models, a technique called tempering (Marinari and Parisi, 1992).Doucet et al. (2000) introduce a general version of the Rao–Blackwellized particlefilter by commenting on the inherent inefficiency of particle filters in large dimen-sions, compounded by the dynamic nature of the sampling scheme. The central One difficulty with the approach is the possible occurrence of negative importance weights(Jacob and Thiery, 2015). An early instance, called bootstrap filter (Gordon et al., 1993), involved one of the authors ofGelfand and Smith (1990), who thus contributed to the birth of two major advances in the field. p ( z t | y t ∝ p ( z t − | y t − ) p ( y t | z t ) p ( z t | z t − ) (2)in a state-space formulation where ( z t ) t (also denoted z T ) is the latent Markovchain and ( y t ) t the observed sequence. In this update, the conditional densitiesof z t and z t − are usually unavailable and need be approximated by samplingsolutions.If some marginalisation of the sampling is available for the model at hand, thisreduces the degeneracy phenomenon at the core of particle filters. The exampleprovided in Doucet et al. (2000) is one where z t ( x t , r t ), with p ( x t , r t | y t ) = p ( x t | y t , r t ) p ( r t | y t )and p ( x t | y t , r t ) available in closed form. This component can then be usedin the approximation of the filtering distribution (2), instead of weighted Diracmasses, which improves its precision if only by bringing a considerable decrease inthe dimension of the particles (Doucet et al., 2000, Proposition 2). It is indeedsufficient to resort only to particles for the intractable part.See Andrieu et al. (2001); Johansen et al. (2012); Lindsten (2011); Lindstenet al. (2011) for further extensions on this principle. In particular, the PhD thesisof Lindsten (2011) contains the following and relevant paragraph:
Moving from [the particle estimator] to [its Rao–Blackwellized ver-sion] resembles a Rao–Blackwellisation of the estimator (see also Lehmann,1983). In some sense, we movefrom a Monte Carlo integration to apartially analytical integration. However, it is not clear that the Rao–Blackwellized particle filter truly is a Rao-Blackwellisation of [the orig-inal] , in the factual meaning of the concept. That is, it is not obviousthat the conditional expectation of [the original] results in the [its Rao–Blackwellized version] . This is due to the nontrivial relationship betweenthe normalised weights generated by the [particle filter] , and those gener-ated by [its Rao–Blackwellized version] . It can thus be said that [it] hasearned its name from being inspired by the Rao–Blackwell theorem, andnot because it is a direct application of it. Nonetheless, any exploitation of conditional properties that does not induce a(significant) bias is bound to bring stability and faster convergence to particle filters.
The term of Rao–Blackwellisation is therefore common enough in the MCMCliterature to be considered as a component of the MCMC toolbox. As we pointedout in the introduction, many tricks and devices introduced in the past can fallunder the hat of that term and, while a large fraction of them does not comewith a demonstrated improvement over earlier proposals, promoting the concepts ofconditioning and demarginalising as central to the field should be seen as essentialfor researchers and students alike. Linking such concepts, shared by statistics andMonte Carlo, with an elegant and historical result like the Rao–Blackwell theoremstresses both the universality and the resilience of the idea. Doucet et al. (2000) provide a realistic illustration for a neural network where the manageablepart is obtained via a Kalman filter. eferences Andrieu, C., de Freitas, N., and Doucet,A. (2001). Rao–Blackwellised particle fil-tering via data augmentation. In
Ad-vances in Neural Information ProcessingSystems (NIPS) , pages 561–567.Andrieu, C. and Roberts, G. (2009). Thepseudo-marginal approach for efficientMonte Carlo computations.
Ann. Statist. ,37(2):697–725.Atchad´e, Y. and Perron, F. (2005). Im-proving on the independent Metropolis–Hastings algorithm.
Statistica Sinica ,15:3–18.Banterle, M., Grazian, C., Lee, A., andRobert, C. P. (2019). AcceleratingMetropolis–Hastings algorithms by de-layed acceptance.
Foundations of DataScience , 1(2):103.Berg, S., Zhu, J., and Clayton, M. K.(2019). Control variates and Rao–Blackwellization for deterministic sweepMarkov chains. arXiv:1912.06926.Berkson, J. (1955). Maximum likelihoodand minimum χ estimates of the logis-tic function. J. American Statist. Assoc. ,50(269):130–162.Besag, J. (1989). A candidate’s formula:A curious result in Bayesian prediction.
Biometrika , 76(1):183–183.Beskos, A., Papaspiliopoulos, O., Roberts,G., and Fearnhead, P. (2006a). Exact andcomputationally efficient likelihood-basedestimation for discretely observed diffu-sion processes (with discussion).
J. RoyalStatist. Society Series B , 68(3):333–382.Beskos, A., Papaspiliopoulos, O., andRoberts, G. O. (2006b). Retrospective ex-act simulation of diffusion sample pathswith applications.
Bernoulli , 12(6):1077–1098.Blackwell, D. (1947). Conditional expecta-tion and unbiased sequential estimation.
Ann. Statist. , 18(1):105–110. Casella, G. and Robert, C. (1996). Rao-Blackwellization of sampling schemes.
Biometrika , 83:81–94.Casella, G., Robert, C. P., and Wells, M. T.(2004). Generalized accept-reject sam-pling schemes.
Lecture Notes-MonographSeries , 45:342–347.Chatterjee, S. and Diaconis, P. (2018). Thesample size required in importance sam-pling.
Ann. Appl. Probab. , 28(2):1099–1135.Chib, S. (1995). Marginal likelihood fromthe Gibbs output.
J. American Statist.Assoc. , 90:1313–1321.Chopin, N. and Robert, C. (2010). Prop-erties of nested sampling.
Biometrika ,97:741–755.Cornuet, J.-M., Marin, J.-M., Mira, A., andRobert, C. (2012). Adaptive multiple im-portance sampling.
Scandinavian Journalof Statistics , 39(4):798–812.Del Moral, P., Doucet, A., and Jasra, A.(2006). Sequential Monte Carlo sam-plers.
J. Royal Statist. Society Series B ,68(3):411–436.Douc, R. and Robert, C. (2011). A vanillavariance importance sampling via pop-ulation Monte Carlo.
Ann. Statist. ,39(1):261–277.Doucet, A., de Freitas, N., and Gordon, N.(1999).
Sequential MCMC in Practice .Springer-Verlag.Doucet, A., Freitas, N. d., Murphy, K. P.,and Russell, S. J. (2000). Rao–Blackwellised particle filtering for dy-namic Bayesian networks. In
Proceedingsof the 16th Conference on Uncertaintyin Artificial Intelligence , UAI ’00, pages176–183, San Francisco, CA, USA. Mor-gan Kaufmann Publishers Inc.Elvira, V., Martino, L., Luengo, D., andBugallo, M. (2019). Generalized multi-ple importance sampling.
Statist. Science ,34(1):129–155. earnhead, P., (cid:32)Latuszynski, K., Roberts,G. O., and Sermaidis, G. (2017).Continious-time importance sampling:Monte Carlo methods which avoid time-discretisation error. arXiv:1712.06201.G ˙asemyr, J. (2002). Markov chain MonteCarlo algorithms with independent pro-posal distribution and their relation toimportance sampling and rejection sam-pling. Technical Report 2, Department ofStatistics, Univ. of Oslo.Gelfand, A. and Smith, A. (1990). Samplingbased approaches to calculating marginaldensities. J. American Statist. Assoc. ,85:398–409.Geman, S. and Geman, D. (1984). Stochas-tic relaxation, Gibbs distributions andthe Bayesian restoration of images.
IEEE Trans. Pattern Anal. Mach. Intell. ,6:721–741.Geyer, C. (1993). Estimating normaliz-ing constants and reweighting mixturesin Markov chain Monte Carlo. TechnicalReport 568, School of Statistics, Univ. ofMinnesota.Geyer, C. (1994). Conditioning in Markovchain Monte Carlo.
J. Comput. Graph.Statis. , 4:148–154.Gordon, N., Salmond, J., and Smith, A.(1993). A novel approach to non-linear/non-Gaussian Bayesian state esti-mation.
IEEE Proceedings on Radar andSignal Processing , 140:107–113.Green, P. J., (cid:32)Latuszy´nski, K., Pereyra, M.,and Robert, C. P. (2015). Bayesian com-putation: a summary of the current state,and samples backwards and forwards.
Statistics and Computing , 25(4):835–862.Gutmann, M. U. and Hyv¨arinen, A. (2012).Noise-contrastive estimation of unnor-malized statistical models, with applica-tions to natural image statistics.
J. Mach.Learn. Res. , 13(1):307–361.Hastings, W. (1970). Monte Carlo samplingmethods using Markov chains and theirapplication.
Biometrika , 57:97–109. Jacob, P., Robert, C., and Smith, M. (2011).Using parallel computation to improve in-dependent Metropolis–Hastings based es-timation.
J. Comput. Graph. Statist. ,20(3):616–635.Jacob, P. E. and Thiery, A. H. (2015). Onnonnegative unbiased estimators.
Ann.Statist. , 43(2):769–784.Johansen, A. M., Whiteley, N., and Doucet,A. (2012). Exact approximation of Rao–Blackwellised particle filters.
IFAC Pro-ceedings Volumes , 45(16):488 – 493. 16thIFAC Symposium on System Identifica-tion.Kong, A., McCullagh, P., Meng, X. L., Nico-lae, D., and Tan, Z. (2003). A theoryof statistical models for Monte-Carlo in-tegration.
J. Royal Statist. Society SeriesB , 65(3):585–618.Liao, J. (1998). Variance reduction in Gibbssampler using quasi-random numbers.
J.Comput. Graph. Statist. , 7(3):253–266.Lindsten, F. (2011).
Rao-Blackwellised par-ticle methods for inference and identifica-tion . PhD thesis, University of Link¨oping,Sweden.Lindsten, F., Sch¨on, T. B., and Olsson, J.(2011). An explicit variance reduction ex-pression for the Rao–Blackwellised par-ticle filter.
IFAC Proceedings Volumes ,44(1):11979 – 11984. 18th IFAC WorldCongress.Liu, J. and Chen, R. (1998). Sequen-tial Monte-Carlo methods for dynamicsystems.
J. American Statist. Assoc. ,93:1032–1044.Liu, J., Wong, W., and Kong, A. (1994).Covariance structure of the Gibbs sam-pler with applications to the compar-isons of estimators and sampling schemes.
Biometrika , 81:27–40.Malefaki, S. and Iliopoulos, G. (2008). Onconvergence of importance sampling andother properly weighted samples to thetarget distribution.
J. Statist. Plann. In-ference , 138:1210–1225. arin, J. and Robert, C. (2010). On resolv-ing the Savage–Dickey paradox. Electron.J. Statist. , 4:643–654.Marin, J. and Robert, C. (2011). Impor-tance sampling methods for Bayesian dis-crimination between embedded models.In Chen, M.-H., Dey, D., M¨uller, P.,Sun, D., and Ye, K., editors,
Frontiers ofStatistical Decision Making and BayesianAnalysis , pages 513–527. Springer-Verlag,New York.Marinari, E. and Parisi, G. (1992). Sim-ulated tempering: A new Monte Carloschemes.
Europhysics letters , 19:451–458.McKeague, I. and Wefelmeyer, W. (2000).Markov chain Monte Carlo and Rao–Blackwellisation.
J. Statist. Plann. Infer-ence , 85:171–182.Metropolis, N., Rosenbluth, A. W., Rosen-bluth, M. N., Teller, A. H., and Teller,E. (1953). Equations of state calculationsby fast computing machines.
J. Chem.Phys. , 21:1087–1092.Mira, A., Møller, J., and Roberts, G. (2001).Perfect slice samplers.
J. Royal Statist.Society Series B , 63:583–606.Neal, R. (1999). Erroneous results in“Marginal likelihood from the Gibbs out-put“. Technical report, University ofToronto.Owen, A. and Zhou, Y. (2000). Safe and ef-fective importance sampling.
J. AmericanStatistical Association , 95:135–143.Pearl, J. (1987). Evidential reasoning us-ing stochastic simulation in causal mod-els.
Artificial Intelligence , 32:247–257.Perron, F. (1999). Beyond accept–rejectsampling.
Biometrika , 86(4):803–813. Rao, C. (1945). Information and the accu-racy attainable in the estimation of statis-tical parameters.
Bulletin of the CalcuttaMathematical Society , 37:81–91.Robert, C. and Casella, G. (2004).
MonteCarlo Statistical Methods . Springer-Verlag, New York, second edition.Robert, C. and Marin, J.-M. (2008). Onsome difficulties with a posterior proba-bility approximation technique.
BayesianAnalysis , 3(2):427–442.Robert, C. and Wraith, D. (2009). Com-putational methods for Bayesian modelchoice. In Goggans, P. M. and Chan,C.-Y., editors,
MaxEnt 2009 proceedings ,volume 1193. AIP.Roberts, G. and Rosenthal, J. (1999). Con-vergence of slice sampler Markov chains.
J. Royal Statist. Society Series B , 61:643–660.Sahu, S. and Zhigljavsky, A. (1998). Adap-tation for self regenerative MCMC. Tech-nical report, Univ. of Wales, Cardiff.Sahu, S. and Zhigljavsky, A. (2003). Selfregenerative Markov chain Monte Carlowith adaptation.
Bernoulli , 9:395–422.Spiegelhalter, D., Thomas, A., Best, N., andGilks, W. (1995). BUGS: Bayesian in-ference using Gibbs sampling. Techni-cal report, Medical Research Council Bio-statistics Unit, Institute of Public Health,Cambridge University.Tanner, M. and Wong, W. (1987). The cal-culation of posterior distributions by dataaugmentation.
J. American Statist. As-soc. , 82:528–550.Vehtari, A., Simpson, D., Gelman,A., Yao, Y., and Gabry, J. (2019).Pareto smoothed importance sampling.arXiv:1507.02646., 82:528–550.Vehtari, A., Simpson, D., Gelman,A., Yao, Y., and Gabry, J. (2019).Pareto smoothed importance sampling.arXiv:1507.02646.