An Adaptive Sequential Monte Carlo Sampler
AAn Adaptive Sequential Monte Carlo Sampler
Paul Fearnhead and Benjamin M. TaylorOctober 24, 2018
Abstract
Sequential Monte Carlo (SMC) methods are not only a popular tool in theanalysis of state–space models, but offer an alternative to MCMC in situationswhere Bayesian inference must proceed via simulation. This paper introducesa new SMC method that uses adaptive MCMC kernels for particle dynamics.The proposed algorithm features an online stochastic optimization procedureto select the best MCMC kernel and simultaneously learn optimal tuningparameters. Theoretical results are presented that justify the approach andgive guidance on how it should be implemented. Empirical results, basedon analysing data from mixture models, show that the new adaptive SMCalgorithm (ASMC) can both choose the best MCMC kernel, and learn anappropriate scaling for it. ASMC with a choice between kernels outperformedthe adaptive MCMC algorithm of Haario et al. (1998) in 5 out of the 6 casesconsidered.
Keywords:
Adaptive MCMC, Adaptive Sequential Monte Carlo, Bayesian MixtureAnalysis, Optimal Scaling, Stochastic Optimization.1 a r X i v : . [ s t a t . C O ] M a y Introduction
Sequential Monte Carlo (SMC) is a class of algorithms that enable simulation froma target distribution of interest. These algorithms are based on defining a series ofdistributions, and generating samples from each distribution in turn. SMC wasinitially used in the analysis of state-space models. In this setting there is atime–evolving hidden state of interest, inference about which is based on a set ofnoisy observations (Gordon et al., 1993; Liu and Chen, 1998; Doucet et al., 2001;Fearnhead, 2002). The sequence of distributions are defined to be the posteriordistributions of the state at consecutive time-points given the observations up tothose time points. More recent work has looked at developing SMC methods thatcan analyse state-space models which have unknown fixed parameters. Suchmethods introduce steps into the algorithm to allow the support of the sample ofparameter values to change over time, for example by using ideas from kerneldensity estimation (Liu and West, 2001), or MCMC moves (Gilks and Berzuini,1999; Storvik, 2002; Fearnhead, 2002).Most recently, SMC methods have been applied as an alternative to MCMC forstandard Bayesian inference problems. (Neal, 2001; Chopin, 2002; Del Moral et al.,2006; Fearnhead, 2008). In this paper the focus will be on methods for samplingfrom the posterior distribution of a set of parameters of interest. SMC methods forthis class of targets introduce an artificial sequence of distributions that run fromthe prior to the posterior and sample recursively from these using a combination ofImportance Sampling and MCMC moves. This approach to sampling has beendemonstrated empirically to often be more effective than using a single MCMC2hain (Jasra et al., 2007, 2008). There are heuristic reasons for why this may truein general: the annealing of the target and spread of samples over the supportmeans that SMC is less likely to be become trapped in posterior modes.Simply invoking an untuned MCMC move within an SMC algorithm would likelylead to poor results because the move step would not be effective in combatingsample depletion. The structure of SMC means that at the time of a move there isa sample from the target readily available, this can be used to compute posteriormoments and inform the shape of the proposal kernel as in Jasra et al. (2008);however, further refinements can lead to even better performance. Suchrefinements include the scaling of estimated target moments by an optimal factor,see Roberts and Rosenthal (2001) for example. For general targets and proposalsno theoretical results for the choice of scaling exist, and this has led to the recentpopularity of adaptive MCMC (Haario et al., 1998; Andrieu and Robert, 2001;Roberts and Rosenthal, 2009; Craiu et al., 2009; Andrieu and Thoms, 2008). Inthis paper the idea of adapting the MCMC kernel within an SMC algorithm willbe explored.To date there has been little work at adapting SMC methods. Exceptions includethe method of Jasra et al. (2008), whose method assumes a likelihood temperedsequence of target densities (see Neal (2001)) and the adaptation procedure bothchooses this sequence online, as well as computing the variance of a random walkproposal kernel used for particle dynamics. Cornebise et al. (2008) also considersadapting the proposal distribution within SMC for state-space models. Assumingthat the proposal density belongs to a parametric family with parameter θ , theirmethod proceeds by simulating a number of realisations for each of a range of3alues of θ and selecting the value that minimises the empirical Shannon entropyof the importance weights; new samples are then re–proposed using thisapproximately optimal value. Further related work includes that of Douc et al.(2005) and Capp´e et al. (2008) on respectively population Monte Carlo andadaptive importance sampling.The aims of this paper are to introduce a new adaptive SMC algorithm (ASMC)that automatically tunes MCMC move kernels and chooses between differentproposal densities and to provide theoretical justification of the method. Thealgorithm is based on having a distribution of kernels and their tuning parametersat each iteration. Each current sample value, called a particle, is moved using anMCMC kernel drawn from this distribution. By observing the expected squarejumping distance (Craiu et al., 2009; Sherlock and Roberts, 2009) for each particleit is possible to learn which MCMC kernels are mixing better. The informationthus obtained can then used to update the distribution of kernels. The keyassumption of the new approach is that the optimal MCMC kernel for movingparticles does not change much over the iterations of the SMC algorithm. As willbe discussed, and shown empirically, in section 5 this can often be achieved byappropriate parameterisation of a family of MCMC kernels.The structure of the paper is as follows. In the next section, the model of interestwill be introduced and followed by a review of MCMC and SMC approaches. Thenin Section 3, the new adaptive SMC will be presented. Guidelines on implementingthe algorithm as well as some theory on the convergence will be presented inSection 4. In Section 5 the method will be evaluated using simulated data. Theresults show that the proposed method can successfully choose both an4ppropriate MCMC kernel and an appropriate scaling for the kernel. The paperends with a discussion. The focus of this paper will be on Bayesian inference for parameters, θ , from amodel where independent identically distributed data is available. Note that theideas behind the proposed adaptive SMC algorithm can be applied more generally(see section 6). Let π ( θ ) denote the prior for θ and π ( y | θ ) the probability densityfor the observations. The aim will be to calculate the posterior density, π ( θ | y n ) ∝ π ( θ ) n (cid:89) i =1 π ( y i | θ ) , (1)where, here and throughout, π will be used to denote a probability density, and y t means y , . . . , y t .In general, π ( θ | y n ) is analytically intractable and so to compute posteriorfunctionals of interest, for example expectations, Monte Carlo simulation methodsare often employed. Sections 2.1 and 2.2 provide a brief description of two suchMonte Carlo approaches. An MCMC transition kernel, K h , is an iterative rule for generating samples from atarget probability density, for example a posterior. K h comprises a proposal kernel,here and throughout denoted q h (the subscript h indicates dependence on a tuningparameter) and an acceptance ratio that depends on the target and, in general, the5roposal densities (see Gilks et al. (1995); Gamerman and Lopes (2006) for reviewsof MCMC methodology). The most generally applicable MCMC method isMetropolis–Hastings, see Algorithm 1 (Metropolis et al., 1953; Hastings, 1970). Algorithm 1
Metropolis–Hastings Algorithm (Metropolis et al., 1953; Hastings,1970) Start with an initial sample, θ (0) , drawn from any density, π . for j = 1 , , . . . do Propose a move to a new location, ˜ θ , by drawing a sample from q h ( θ ( i − , ˜ θ ). Accept the move (ie set θ ( i ) = ˜ θ ) with probability,min (cid:40) , π (˜ θ | y n ) π ( θ ( i − | y n ) q h (˜ θ, θ ( i − ) q h ( θ ( i − , ˜ θ ) (cid:41) , (2)else set θ ( i ) = θ ( i − . end for Probably the simplest MH algorithm is the random walk Metropolis (RWM). Theproposal kernel for RWM is a symmetric density centred on the current state, themost common example being a multivariate normal, q h ( θ ( i − , ˜ θ ) = N (˜ θ ; θ ( i − , h ˆΣ π ), where ˆΣ π is an estimate of the target covariance.Both the values of ˆΣ π and h are critical to the performance of the algorithm. If ˆΣ π does not accurately estimate the posterior covariance matrix, then the likelydirections of the random walk moves will likely be inappropriate. On the otherhand, a value of h that is too small will lead to high acceptance rates, but thesamples will be highly correlated. If h is too large then the algorithm will rarelymove, which in the worst case scenario could lead to a degenerate sample.These observations on the rˆole of h point to the idea of an optimal scaling , a h somewhere between the extremes that promotes the best mixing of the algorithm.In the case of elliptically symmetric unimodal targets, an optimal random walkscaling can sometimes be computed numerically; this class of targets includes the6ultivariate Gaussian (Sherlock and Roberts, 2009). Other theoretical resultsinclude optimal acceptance rates which are derived in the limit as the dimension of θ , d → ∞ (see Roberts and Rosenthal (2001) for examples of targets andproposals). In general however, there are no such theoretical results.One way of circumventing the need for analytical optimal scalings is to try to learnthem online (Andrieu and Robert, 2001; Atchad´e and Rosenthal, 2005), this caninclude learning both a good scaling, h , and estimating the target covariance, ˆΣ π (Haario et al., 1998). Recent research in adaptive MCMC has generated a numberof new algorithms (see for example Andrieu and Thoms (2008); Roberts andRosenthal (2009); Craiu et al. (2009)), though some care must be taken to ensurethat the resulting chain has the correct ergodic distribution. An alternative approach to generating samples from a posterior is to use sequentialMonte Carlo (SMC, see Del Moral et al. (2006) for a review). The main ideabehind SMC is to introduce a sequence of densities leading from the prior to thetarget density of interest and to iteratively update an approximation to thesedensities. For the application considered here, it is natural to define these densitiesas π t ( θ ) = π ( θ | y t ) for t = 1 , . . . , n ; this ‘data tempered’ schedule will be used inthe sequel. The approximations to each density are defined in terms of a set ofweighted particles, { θ ( j ) t , w ( j ) t } Mj =1 , produced so that as M → ∞ , Monte Carlo sums7onverge to their ‘correct’ expectations:lim M →∞ (cid:40) (cid:80) Mj =1 w ( j ) t f ( θ ( j ) t ) (cid:80) Mi =1 w ( i ) t (cid:41) = E π t ( θ t ) [ f ( θ t )] , for all π t –integrable functions, f . One step of an SMC algorithm can involveimportance reweighting, resampling and moving the particles via an MCMC kernel(Gilks and Berzuini, 1999; Chopin, 2002). For concreteness, this paper will focuson the iterated batch importance sampling (IBIS) algorithm of Chopin (2002).The simplest way to update the particle approximation in model (1) is to let θ ( j ) t = θ ( j ) t − and w ( j ) t = w ( j ) t − π ( y t | θ ( j ) t ). However such an algorithm will degeneratefor large t , as eventually only one particle will have non-negligible weight. WithinIBIS, resample–move steps (sometimes referred to here as simply ‘move steps’) areintroduced to alleviate this. In a move step, the particles are first resampled sothat the expected number of copies of particle θ ( j ) t is proportional to w ( j ) t . Thisprocess produces multiple copies of some particles. In order to create particlediversity, each resampled particle is moved by an MCMC kernel. The MCMCkernel is chosen to have stationary distribution π t . The resulting particles are thenassigned a weight of 1 /M .The decision of whether to apply a resample-move step within IBIS is based on theeffective sample size (ESS, see Kong et al. (1994); Liu and Chen (1998)). The ESSis a measure of variability of the particle weights; using this to decide whether toresample is justified by arguments within Liu and Chen (1995) and Liu et al.(1998). Full details of IBIS are given in Algorithm 2.Chopin’s IBIS algorithm is a special case of the resample–move (RM) algorithm of8 lgorithm 2 Chopin’s IBIS algorithm Initialise from the prior { θ ( j )0 , w ( j )0 } Mj =1 ∼ π . for t = 1 , . . . , n do Assume current { θ ( j ) t − , w ( j ) t − } Mj =1 ∼ π t − Reweight w ( j ) t = w ( j ) t − π t ( θ ( j ) t − ) /π t − ( θ ( j ) t − ). Result: { θ ( j ) t − , w ( j ) t } Mj =1 ∼ π t . if particle weights not degenerate (see text) then { θ ( j ) t , w ( j ) t } Mj =1 ← { θ ( j ) t − , w ( j ) t − } Mj =1 t → t + 1. else Resample: let K = { k , . . . , k M } ⊆ { , . . . , M } be the resampling indices,then { θ ( k ) t − , /M } k ∈K ∼ π t . Relabel: k j ← j , the j th resampling index sothat { θ ( j ) t − , /M } Mj =1 ∼ π t . Move via π t –invariant MCMC kernel. Result: { θ ( j ) t , /M } Mj =1 ∼ π t . end if end for Gilks and Berzuini (1999) and the general algorithm described by Del Moral et al.(2006) (note that the latter method applies beyond MCMC–within–SMC andprovides a unifying framework for sampling from sequences of targets). The maindifference between RM and IBIS is that, in their presentation of RM, Gilks andBerzuini (1999) use resampling and move steps at each iteration of the sampler.Chopin noticed that at a particular iteration it may be better to just reweight theparticles, rather than incur the computational cost and degeneracy induced by aresample–move step. Another related algorithm, a development of simulatedannealing (Kirkpatrick et al., 1983) due to Neal (2001), utilises an alternative‘likelihood tempered’ sequence of targets. The proposed target sequence is π t ( θ ) = π ( θ ) π ( y n | θ ) ξ t , where { ξ t } is a sequence of real numbers starting at 0 (theprior) and ending on 1 (the posterior). Since each move step requires evaluation ofthe likelihood over all available observations, the main disadvantage of likelihoodtempering is computational cost, although for models with sufficient statistics thisis not an issue. Further disadvantages of Neal’s proposed algorithm are the9bsence of resampling steps which eventually leads to sample degeneracy; and thelack of interpretability of intermediate target densities.The efficiency of an SMC algorithm, such as IBIS, depends on the mixingproperties of the associated MCMC kernel. Within SMC there is the advantage ofbeing able to use the current set of particles to help tune an MCMC kernel. Forexample, the weighted particles can give an estimate of the posterior covariancematrix, which can be used within a random walk proposal. However even in thiscase, the proposal variance still needs to be appropriately scaled (Roberts andRosenthal, 2001; Sherlock and Roberts, 2009). In the next section the newadaptive SMC procedure will be introduced, the algorithm can learn anappropriate tuning for the MCMC kernel, and can also be used to choose betweena set of possible kernels. First consider the case where the move step in the IBIS algorithm involves onetype of MCMC kernel. Let π t be an arbitrary continuous probability density (thetarget) and K h,t a π t –invariant MCMC kernel with tuning parameter, h . Theparameter h is to be chosen to maximise the following utility function, g ( t ) ( h ) = (cid:90) π t ( θ t − ) K h,t ( θ t − , θ t )Λ( θ t − , θ t )d θ t − d θ t , (3)= E [Λ( θ t − , θ t )] , where Λ( θ t − , θ t ) > g ( t ) is the averageperformance of the chain with respect to Λ, which would normally be somemeasure of mixing. The ideal choice for Λ would be the integrated autocorrelationtime (whence the goal would be to maximise − g ( t ) ), but a computationally simplermeasure of mixing is the expected square jumping distance (ESJD). Maximisingthe ESJD is equivalent to minimising the lag-1 autocorrelation; this measure isoften used within adaptive MCMC, see for example Sherlock and Roberts (2009);Pasarica and Gelman (2010).In the following it will be assumed that the proposal distribution can depend onquantities calculated from the current set of particles (for example estimates of theposterior variance), but this will be suppressed in the notation. The main idea ofASMC is to use the observed instances of Λ( θ t − , θ t ) to help choose the best h .The tuning parameter will be treated as an auxiliary random variable. Attime-step t the aim is to derive a density for the tunings, π ( t ) ( h ). If a move step isinvoked at this time, a sample of M realisations from π ( t ) ( h ), denoted { h ( j ) t } Mj =1 ,will be drawn and ‘allocated’ to particles at random.When moving the j th resampled particle, the tuning parameter h ( j ) t will be usedwithin the proposal distribution. For notational simplicity this value will bedenoted h in the following. Let θ ( j ) t − be the j th resampled particle (see step 9 ofAlgorithm 2). In moving this particle, ˜ θ ( j ) t is drawn from q h ( θ ( j ) t − , · ), and acceptedwith probability α h ( θ ( j ) t − , ˜ θ ( j ) t ), given by (2). If the proposed particle is acceptedthen θ ( j ) t = ˜ θ ( j ) t otherwise θ ( j ) t = θ ( j ) t − . 11he utility function in (3) simplifies to, g ( t ) ( h ) = (cid:90) π t ( θ t − ) q h ( θ t − , ˜ θ t ) ˜Λ( θ t − , ˜ θ t )d θ t − d˜ θ t , where ˜Λ( θ ( j ) t − , ˜ θ ( j ) t ) = α h ( θ ( j ) t − , ˜ θ ( j ) t )Λ( θ ( j ) t − , ˜ θ ( j ) t ) . Since by assumption the resampled particles are approximately drawn from π t andproposed particles are drawn from q h ( θ t − , ˜ θ t ), the quantity ˜Λ( θ ( j ) t − , ˜ θ ( j ) t ) can beviewed as an unbiased estimate of g ( t ) ( h ).The approach in this paper is to use the observed ˜Λ( θ ( j ) t − , ˜ θ ( j ) t ) to update thedistribution π ( t ) ( h ) to a new distribution π ( t +1) ( h ). In particular each h ( j ) t will beassigned a weight, f ( ˜Λ( θ ( j ) t − , ˜ θ ( j ) t )), for some function f : R + → R + . The newdensity of scalings will be defined, π ( t +1) ( h ) ∝ M (cid:88) i =1 f ( ˜Λ( θ ( j ) t − , ˜ θ ( j ) t )) R ( h − h ( j ) t ) , (4)where R ( h − h ( j ) t ) is a density for h which is centred on h ( j ) t . Simulating from π ( t +1) ( h ) is achieved by first resampling the h ( j ) t s with probabilities proportional totheir weight and then adding noise to each resampled value; the distribution of thisnoise is given by R ( · ). The motivation for adding noise to the resampled h –valuesis to avoid the distributions π ( t ) ( h ) degenerating too quickly to a point-mass on asingle value. Similar ideas are used in dynamic SMC methods for dealing withfixed parameters, for example West (1993); Liu and West (2001). In practice the12ariance of the noise can depend on the variance of π ( t ) ( h ) and by analogy toKernel density estimation should tend to 0 as the number of particles gets large.If there is no resampling at step t then set π ( t +1) ( h ) = π ( t ) ( h ). The scheme isinitiated with an arbitrary distribution π ( h ). The specific choice of f considered inthis paper is a simple linear weighting scheme, f ( ˜Λ) = a + ˜Λ , a ≥ . Theoretical justification for this choice is given in the next section.One assumption of the proposed approach is that a good choice of h at onetime-step will be a good choice at nearby time-steps. Note that this is based on animplicit assumption within SMC that successive targets are similar (see Chopin(2002); Del Moral et al. (2006) for example). Furthermore, using estimates ofposterior variances within the proposal distribution can also help ensure that goodvalues of h at one time-step will be a good choice at nearby time-steps. Sometheoretical results concerning this matter will be presented in Section 4.To choose between different types of MCMC kernel is now a relativelystraightforward extension of the above. Assume there are I different MCMCkernels, each defined by a proposal distribution q h,i , where i ∈ { , . . . , I } . Thealgorithm now learns a set of distributions, π ( t ) ( h, i ), for the pair of kernel typeand associated tuning parameter. Each particle is assigned a random kernel typeand tuning drawn form this distribution, with the pair, ( h ( j ) t − , i ( j ) t − ), associated with θ ( j ) t − . The algorithm proceeds by weighting this pair based on the observed13Λ( θ ( j ) t − , ˜ θ ( j ) t ) values as before, and updating the distribution, π ( t ) ( h, i ) ∝ M (cid:88) j =1 f ( ˜Λ( θ ( j ) t − , ˜ θ ( j ) t )) R ( h − h ( j ) t − ) δ i ( j ) t − ( i ) . (5)where δ i ( j ) t − ( i ) is a point mass on i = i ( j ) t − . The method is described in detail below,see Algorithm 3. Within the specific implementation described, the sample ofpairs, ( h, i ), from π ( t ) ( h, i ) are allocated to particles randomly immediately afterthe resample–move step at iteration t . These pairs are then kept until the nextiteration a resample–move step is called. Algorithm 3
The Adaptive SMC algorithm. Here, π ( · ) , . . . , π n ( · ) are an arbitrarysequence of targets; an MCMC kernel is assumed for particle dynamics. Initialise from the prior { θ ( j )0 , w ( j )0 } Mj =1 ∼ π . Draw a selection of pairs of MCMC kernels with associated tuning parameters, { ( h ( j )0 , K ( j ) h, ) } Mj =1 ≡ { ( h ( j )0 , i ( j )0 ) } Mj =1 ∼ π ( h, i ), and attach one to each particlearbitrarily. for t = 1 , . . . , n do Assume current { θ ( j ) t − , w ( j ) t − } Mj =1 ∼ π t − Reweight w ( j ) t = w ( j ) t − π t ( θ ( j ) t − ) /π t − ( θ ( j ) t − ). Result: { θ ( j ) t − , w ( j ) t } Mj =1 ∼ π t . if particle weights not degenerate (see text) then { θ ( j ) t , w ( j ) t } Mj =1 ← { θ ( j ) t − , w ( j ) t − } Mj =1 { ( h ( j ) t , K ( j ) h,t ) } Mj =1 ← { ( h ( j ) t − , K ( j ) h,t − ) } Mj =1 t → t + 1. else Resample: let K = { k , . . . , k M } ⊆ { , . . . , M } be the resampling indices,then { θ ( k ) t − , /M } k ∈K ∼ π t . Relabel: k j ← j , the j th resampling index sothat { θ ( j ) t − , /M } Mj =1 ∼ π t . DO NOT resample kernels or tuning parametersat this stage. Move θ ( j ) t − via the π t –invariant MCMC kernel, K ( j ) h,t , and tuning parameter h ( j ) t − , denote the proposed new particle as ˜ θ ( j ) t and accepted/rejected particleas θ ( j ) t . Result: { θ ( j ) t , /M } Mj =1 ∼ π t . To obtain { ( h ( j ) t , K ( j ) h,t ) } Mk =1 ≡ { ( h ( j ) t , i ( j ) t ) } , sample M times from (5). Allo-cate the new selection to particles at random. end if end for Theoretical Results
In this section the proposed algorithm will be justified by a series of theoreticalresults; guidance as to how it should best be implemented will also be given. Theresults presented here apply in the limit as the number of particles, M → ∞ . Asdiscussed above, in this limit, the variance of the kernel R ( · ) in (4) tends to 0.To simplify the discussion, it will be assumed that tunings are one dimensional(the arguments presented extend readily to the multivariate case). For a slightnotational simplification, the criterion Λ will be used, rather than ˜Λ (as suggestedin algorithm 3); this does not affect the validity of any of the arguments, whichalso hold for ˜Λ. The section is split into two parts.Firstly, in section 4.1, it is of interest to examine what happens to the distributionof the h s after one step of reweighting and resampling; this result will lead to acriterion for the choice of weight function that guarantees MCMC mixingimprovement with respect to Λ. In section 4.2, the sequential improvement of h swill be considered over many steps of the ASMC algorithm and with a changingtarget. General conditions for convergence of ASMC to the optimal kernel andtuning parameter will be provided. In this section and in the relevant proofs, it is appropriate to temporarily drop the t superscript, eg g ( t ) ≡ g , θ t − ≡ θ and θ t ≡ θ (cid:48) . To study the effect of reweightingand resampling on the distribution of the h s, suppose that currently { h ( j ) } Mj =1 iid ∼ π ( h ), the pdf of a random variable, H . The dependence on current and15roposed particles means the weight attached to h ( j ) is random, but also, due tothe independence of h with the particles, is an unbiased estimator of the ‘true’weight, E Θ , Θ (cid:48) | H [ f (Λ) | H = h ( j ) ], where E Θ , Θ (cid:48) | H denotes the expectation with respectto the joint density of the random variables Θ and Θ (cid:48) conditional on H . The true weighting function will be denoted, w ( h ) = E Θ , Θ (cid:48) | H [ f (Λ) | H = h ] . (6)The following proposition, which is used repeatedly in subsequent results, showshow reweighting and resampling affects π ( h ). Proposition 4.1
Suppose that currently { h ( j ) } Mj =1 iid ∼ π ( h ) , the pdf of a randomvariable, H , independent of θ . Let w ( h ) be the weighting function defined as in(6). Then in the limit as M → ∞ , the distribution of the reweighted andsubsequently resampled h s is, π (cid:63) ( h ) = w ( h ) π ( h ) (cid:82) w ( h ) π ( h )d h . Proof:
See Appendix A. (cid:3)
Since ASMC uses a selection of h s, it is appropriate as a starting point to look forconditions under which their distribution is improved. It would be desirable if,over π (cid:63) ( h ), the objective function would on average take a higher value, for thenthe new distribution would on average perform better with respect to Λ than theold. This criterion can be stated in mathematical form: conditions on f are sought16or which, (cid:90) π (cid:63) ( h ) g ( h )d h ≥ (cid:90) π ( h ) g ( h )d h. Lemma 4.1
Assuming g is π ( h ) –integrable, in the limit as M → ∞ , E π (cid:63) ( h ) [ g ( h )] ≥ E π ( h ) [ g ( h )] ⇐⇒ cov π ( h ) [ g ( h ) , w ( h )] ≥ . (7) That is, provided there is positive correlation between the objective function g ( h ) and the weighting function, w ( h ) , the new distribution of h s will on averageperform better (on g ( h ) ) with respect to Λ than the old. Proof:
The result is obtained by expanding definitions in (7): E π (cid:63) ( h ) [ g ( h )] ≥ E π ( h ) [ g ( h )] , ⇐⇒ E π ( h ) [ w ( h ) g ( h )] ≥ E π ( h ) [ w ( h )] E π ( h ) [ g ( h )] , ⇐⇒ cov π ( h ) [ g ( h ) , w ( h )] ≥ . (cid:3) Although this result does not directly yield a general form for f , it does give asimple criterion that must be fulfilled by any candidate function. An immediatecorollary gives more concrete guidance: Corollary 4.1
A simple linear weighting scheme, f (Λ) = a + Λ , where a ≥ ,satisfies (7). Proof:
This is trivially verified using the linearity property of the covariance. (cid:3)
17 consequence of this lemma is that the ASMC algorithm with linear weights willlead to sequential improvement with respect to Λ under very weak assumptions onthe target and initial density for h . A linear weighting scheme may at first glanceseem sub–optimal, and that it should be possible to learn h more quickly using afunction f (Λ) that increases at a super–linear rate. The present authors conjecturethat such functions will not always guarantee an improvement in the distributionof h . For example consider f (Λ) = Λ , where the weighting function takes theform, w ( h ) = g ( h ) + V [Λ | H = h ]. Because of the V [Λ | H = h ] term, which may belarge for values of h where g ( h ) is small, it is no longer true thatcov π ( h ) [ g ( h ) , w ( h )] ≥ The goal of this section is to provide a theoretical result concerning the ability ofASMC to update the distribution of h s with respect to a sequence of targets, π ( θ ) , . . . , π n ( θ n ). To simplify notation, it will be assumed that a move occurs ateach iteration of the algorithm. The result can be extended to the case wheremoves occur intermittently, providing they incur infinitely often in the limit as thenumber of data points goes to infinity.Define a set of functions, { g ( t ) ( h ) } nt =1 , g ( t ) ( h ) = (cid:90) π t ( θ t − ) K h,t ( θ t − , θ t )Λ( θ t − , θ t )d θ t − d θ t ≥ , where, for each t , K h,t is a π t –invariant MCMC kernel.18or a linear weighting scheme, π ( t ) ( h ) ∝ π ( h ) t (cid:89) s =1 ( a + g ( s ) ( h )) . Below it will be shown that as t → ∞ if the sequence of functions, { g ( t ) ( h ) } ,converge to a fixed function, g ( h ), and if g has a unique global maximum, h opt ,then π ( t ) ( h ) will converge to a point mass on h opt .The key assumption of this theorem regards the convergence of the functions { g ( t ) ( h ) } . This assumption is linked to the idea that a good value of h for thetarget at time t is required to be a good value at times later on. As mentionedabove, the motivation behind SMC is that successive targets should be similar.Moreover, standard Bayesian asymptotic theory shows that as the number ofobservations, n , tends to infinity, the posterior tends in distribution to that of aGaussian random variable. Thus, providing information from the currentparameters about the posterior variance is used appropriately, it should beexpected that the sequence of functions, { g ( t ) ( h ) } , would also converge. This issuewill be explored empirically in the next section. Theorem 4.1
Let π ( h ) be the initial density for the tuning parameter withsupport H ⊆ R and a > . Define, as above, π ( t ) ( h ) ∝ π ( h ) t (cid:89) s =1 ( a + g ( s ) ( h )) . uppose there exists a function g : H → R ≥ such that sup h ∈H | g ( t ) − g | ≤ k g t − α , α ∈ (0 , , k g > . Furthermore, suppose g has a unique global maximum, h opt , contained in theinterior of H and that g is twice differentiable in an interval containing h opt . Thenas t → ∞ , π ( t ) ( h ) tends to a Dirac mass centred on the optimal scaling, h opt . Proof:
See Appendix B. (cid:3)
This section is organised as follows. In Section 5.1, the convergence of h to anoptimal scaling will be demonstrated empirically using a linear Gaussian model.Then in Section 5.2 the problem of Bayesian mixture analysis will be introduced.In Sections 5.3 and 5.4 the proposed method will be evaluated in simulationstudies using the example of Bayesian mixture posteriors as defining the sequenceof targets of interest.Following Sherlock and Roberts (2009), the expected (Mahalanobis) squarejumping distance will be considered as an MCMC performance criterion:Λ( θ t − , θ t ) = ( θ t − − θ t ) T ˆΣ − π t ( θ t − − θ t ) , where θ t − and θ t are two points in the parameter space and ˆΣ π t is an empiricalestimate of the target covariance obtained from the current set of particles.20wo different MCMC kernels will be considered within the SMC algorithm, theseare defined by the following two proposals: q rw ( θ t − , ˜ θ t ) = N ( θ t − , h ˆΣ π t ) ,q lw ( θ t − , ˜ θ t ) = N ( αθ t − + (1 − α )¯ θ t , h ˆΣ π t ) , h ∈ (0 , , α = √ − h , where ¯ θ t and ˆΣ π t are respectively estimates of the target and covariance. The firstof these is a random–walk proposal. The second is based upon a method forupdating parameter values in Liu and West (2001), here named the ‘ Liu/West ’proposal. The Liu/West proposal has mean shrunk towards the mean of the targetand the imposed choice of α = √ − h sets the mean and variance of proposedparticles to be the same as that of the current particles. Note that if the target isGaussian, then this proposal can be shown to be equivalent to a Langevin proposal(Roberts and Tweedie, 1996). h It is of interest to see an example g ( h ) and demonstrate convergence of one of theproposed algorithms to the optimal scaling. This will be achieved using a Gaussiantarget, for which a useable analytic expression for the optimal scaling for therandom walk kernel is available. The results in this section are based on 100observations simulated from a 5–dimensional standard Gaussian density, y ∼ N (0 , I ), where I is the 5 × π ( y | θ ) = N ( y ; θ, I ) . The prior on the unknown parameter, θ , the vector of means, was set to N (0 , I ).ASMC with a random walk proposal was used to generate M = 2000 particlesfrom the posterior. Resampling was invoked when the ESS dropped below M/ h s after resampling. The initial distribution for h was chosen to be uniform on (0 , g ( h ) for this target. Note in this case that thesequence of functions, { g ( t ) } , does not change much since each intermediate targetis exactly Gaussian and the proposal is scaled by the approximate variance of thetarget. The optimum scaling, h opt , was computed using 1-dimensional numericalintegration and Theorem 1 of Sherlock and Roberts (2009). The right hand plotillustrates several features of the adaptive RWM; the resampling frequency, thatthe algorithm does indeed converge to the true optimal scaling and theapproximate rate of this convergence. The ability of the ASMC algorithm to learn MCMC tuning parameters in morecomplicated scenarios was evaluated using simulated data from mixture likelihoods(for a complete review of this topic, see Fr¨uhwirth-Schnatter (2006)). Let p , . . . , p r > (cid:80) ri =1 p i = 1. Let N ( · ; µ, v ) denote the normal densityfunction with mean µ and variance v . Let θ = { p r − , v r , µ r } .22 . . . . . . h g ( h ) Observation Number h mean with 2.5−97.5% range Figure 1: Left plot: g ( h ) for a 5–dimensional Gaussian Target, explored with RWMand with ESJD as the optimization criterion. Right hand plot: convergence of h for the same density based on 100 simulated observations; the horizontal line is theapproximately optimal scaling, 1 . y i ,is π ( y i | θ ) = r (cid:88) j =1 p j N ( y i ; µ j , v j ) . (8)The prior θ was multivariate normal, on a transformed space using the generalisedlogit scale for the weights, log scale for variances, and leaving the meansuntransformed. The components of θ were assumed independent a priori ; thepriors were log( p j /p r ) ∼ N (0 , ), log( v j ) ∼ N ( − . , . ) and µ j ∼ N (0 , . ),where j = 1 , . . . , r − j = 1 , . . . , r for the meansand variances. The MCMC moves within the SMC algorithm were performed inthe transformed space, using the appropriate inverse transformed values tocompute the likelihood in (8).An issue with mixture models is that for the above choice of prior, the likelihoodand posterior are invariant to permutation of the component labels (Stephens,23000). As a result the posterior distribution has a multiple of r ! modes,corresponding to each possible permutation. One way of overcoming this problemis by introducing a constraint on the parameters, such as labelling the componentsso that µ < µ < · · · < µ r , or so that v < v < · · · < v r . This choice will affectthe empirical moments of the resulting posterior and hence the proposaldistribution of the MCMC kernel – both the random walk and Liu/West proposalsdepend on the posterior covariance, the latter also depending on the mean. Inparticular if there is a choice of ordering whereby the posterior is closer toGaussian, then this is likely to lead to better mixing of the MCMC kernels. Thisphenomenon motivates the idea that it is also possible to choose between orderingson the parameter vector, which will be investigated in the sequel. In analysing the simulated data, a number of SMC and ASMC algorithms werecompared. These correspond to using the following MCMC kernels:
RWfixed
Random walk ordered by means, with h chosen based on the theoreticalresults for Gaussian targets (Roberts and Rosenthal, 2001; Sherlock andRoberts, 2009). RWadaptive
Adaptive random walk ordered by means with uniform prior on h LWmean
Adaptive Liu/West proposal ordered by means.
LWvariance
Adaptive Liu/West proposal ordered by variances.
Kmix
Adaptive choice between random walk ordered by means, Liu/Westordered on means and Liu/West ordered on variances.24n each case the reference to ordering relates to how the component labels weredefined, and thus affect the estimate of the posterior mean and covariance used.The above methods were also compared with the adaptive MCMC algorithm ofHaario et al. (1998), denoted
AMCMC . The specific implementation used here isas follows. The prior densities were identical to those for ASMC, the parametervector was ordered by means and the random walk tuning was computed using theapproximately optimal Gaussian scaling of h = 2 . / √ r −
1. The AMCMCalgorithm was run for 12000 iterations for the 5 dimensional datasets and for30000 iterations for the 8 dimensional datasets: these values were chosen so as toapproximately match the number of likelihood computations involved between theASMC and AMCMC methods. The burn–in period was set to half of the numberof iterations and the method was initialised by a draw from the prior. There wasan initial non–adaptive phase, lasting 1000 iterations, where the proposal kernelwas scaled by the prior covariance and after which scaling was via estimates of theposterior covariance computed from the chain to–date, this was updated every 100iterations.For the ASMC algorithms, the initial distribution of h s was chosen to be uniformon (0 ,
2) for the random walk and on (0 ,
1) for the Liu/West proposal. In the caseof the random walk, this range of h s can be justified by considering the optimalscaling for a random walk Metropolis on a multivariate Gaussian target in 5dimensions namely 2 . / √ .
06 (and decays with increasing dimension as O ( d − / )). For the Liu/West, h must be in (0 , . was used in (4). A sensitivityanalysis showed the effect of changing the variance of the noise slightly did not25ffect the conclusions of this research. The parameter for the linear h -weightingscheme was a = 0. If any h was perturbed below zero, a small value, 1 × − , wasimputed and similarly for the Liu/West approach, any h perturbed above 1 wasreplaced by 1.The number of particles was set to M = 2000 for the 2–mixture datasets and M = 5000 for the 3–mixture datasets. Each algorithm was run 100 times on eachdataset with the order of observations randomised each time. For the MCMCbased methods an ESS tolerance of M/ h s. For ease of computingposterior quantities of interest, each of the above algorithms was forced toresample and move on the last iteration.To compare the performance of different methods, a measure of the accuracy of theestimated predictive density was used. This is advantageous because it is invariantto re–labelling of the mixture components. The chosen accuracy measure was thevariability of the predictive density (VPD) and was calculated as follows. Each runof the algorithm produces a weighted particle set, from which an estimate of E [ π ( y ( i ) | y n )] can be obtained at 100 points, { y ( i ) } i =1 , equi-spaced between -2.5and 2.5. For each i , the 100 simulation runs produce 100 realisations of E [ π ( y ( i ) | y n )]; let ˆ y ( i,j ) be the estimate of y ( i ) obtained from run j . The VPDmeasure used in this paper is mean i [var j (ˆ y ( i,j ) )] , i is the mean over the i s and var j is the variance of the estimates of y ( i ) obtained from the 100 simulations. The VPD gives an indication of the globalvariability of the predictive density across the simulations. In the tables, therelative VPD is used, which gives a scale–free comparison between methods. TheSMC/ASMC algorithm with a relative VPD of 1 is the reference algorithm and hasthe smallest VPD of the SMC/ASMC methods; larger values indicate higherVPDs. For the AMCMC methods, the predictive densities were computed using allavailable samples ie with 6000 for the 2–mixture datasets and 15000 for the3–mixture datasets. For the SMC/ASMC methods a Rao–Blackwellised version ofthe predictive density was computed using all current and proposed particlesavailable from the last iteration (that is, using 4000/10000 sample pointsrespectively for the 2/3–mixture datasets).
100 realisations from were simulated from the following likelihoods:Dataset 1: π ( y | θ ) = 0 . N ( y ; − . , . ) + 0 . N ( y ; 0 . , . ) , Dataset 2: π ( y | θ ) = 0 . N ( y ; 0 , ) + 0 . N ( y ; 0 , . ) , Dataset 3: π ( y | θ ) = 0 . N ( y ; − , . ) + 0 . N ( y ; 1 , . ) , Dataset 4: π ( y | θ ) = 0 . N ( y ; − . , . ) + 0 . N ( y ; 0 . , . ) , Dataset 5: π ( y | θ ) = 0 . N ( y ; − . , . ) + 0 . N ( y ; 0 , . ) + 0 . N ( y ; 0 . , ) , Dataset 6: π ( y | θ ) = 0 . N ( y ; − . , . ) + 0 . N ( y ; 0 , . ) + 0 . N ( y ; 0 . , . ) , ataset 1 Method Rel.VPD JD Acc. h Propn.LWvariance 1 1.869 0.3 0.941LWmean 1.189 1.818 0.32 0.956Kmix 1.258 1.845 0.317 LWm 0.963LWv 0.958 LWm 0.785LWv 0.215RWadaptive 2.391 0.708 0.21 0.946AMCMC 2.396 0.575 0.13 1.073 · RWfixed 3.414 0.641 0.18 1.064
Dataset 2
LWvariance 1 9.139 0.873 0.978Kmix 2.843 9.023 0.854 LWm 0.984LWv 0.978 LWm 0.005LWv 0.995AMCMC 28.333 0.197 0.019 1.073 · LWmean 112.23 1.869 0.129 0.969RWadaptive 188.094 0.77 0.134 0.584RWfixed 219.907 0.596 0.041 1.064
Dataset 3
LWmean 1 6.38 0.792 0.98Kmix 1.54 6.378 0.806 LWm 0.979 LWm 1AMCMC 7.465 0.847 0.146 1.073 · RWfixed 40.538 1.124 0.277 1.064RWadaptive 45.739 1.057 0.369 1.045LWvariance 148.827 0.737 0.064 0.966
Dataset 4
LWmean 1 7.132 0.875 0.98Kmix 1.099 7.127 0.877 LWm 0.979 LWm 1AMCMC 24.024 0.462 0.057 1.073 · RWadaptive 48.606 1.143 0.274 1.086RWfixed 51.919 1.167 0.298 1.064LWvariance 1096.167 0.632 0.027 0.961
Dataset 5
AMCMC 0.883 0.356 0.04 0.849 · Kmix 1 2.258 0.234 LWm 0.964LWv 0.971 LWm 0.044LWv 0.956LWvariance 1.151 2.284 0.183 0.971LWmean 2.792 1.007 0.092 0.961RWadaptive 4.923 0.847 0.205 0.435RWfixed 5.187 0.56 0.055 0.84
Dataset 6
LWmean 1 4.099 0.277 0.972Kmix 1.018 3.994 0.363 LWm 0.973 LWm 1AMCMC 1.556 0.211 0.04 0.849 · RWfixed 3.244 0.996 0.429 0.84RWadaptive 3.259 0.93 0.192 0.693LWvariance 3.951 1.951 0.13 0.944
Table 1: Rel. VPD is relative VPD, JD is the mean square jumping distance, Accis the mean final acceptance probability, h is the mean final scaling by kernel andPropn is the mean final kernel proportions. The kernels ‘LWm’ and ‘LWv’ indicaterespectively a Liu/West proposal ordering on means or variances.29ot a good estimate and the adaptive version of the algorithm was able to rescaleto compensate for this. In datasets 3 and 6, the fixed random walk marginallyoutperformed the adaptive.The ‘correctly ordered’ sequential Liu/West algorithms considerably outperformthe sequential RW–based methods in all six datasets and the incorrectly orderedversions perform worse or as poorly as the RW. For the Liu/West proposals, the h selected in each dataset was very close to 1: this special value corresponds to anindependence kernel in the form of a moment–matched Gaussian approximation ofthe target. This is of interest as, in combination with the high acceptance rates ofbetween 80–87% in datasets 2–4, suggests that the ‘correct’ ordering makes thetarget, ostensibly a very complex density function, approximately Gaussian inthese cases.The Kmix algorithm is able to choose between orderings; the advantages of thisare clearly evidenced in the results, as it selects the best ordering in each case,with the exception of dataset 1 (where the means and variances are both similar).The Kmix sampler settles almost unanimously on one ordering above the others.These results show empirically that there is not much difference in using a single(correctly chosen) kernel compared with using a selection of kernels.The performance of AMCMC was surpassed in all cases by the Kmix algorithmwith the exception of dataset 5, where AMCMC was the best performingalgorithm. In this latter case and in dataset 6, neither AMCMC nor theSMC/ASMC algorithms performed well. AMCMC outperformed RWadaptive ineach case apart from in dataset 1, where the difference was small. However, theresults show the average jumping distance of the kernel used in the ASMC30lgorithm was greater than that of AMCMC in all cases, suggesting ASMC is ableto adapt better to well-mixing kernels. To make this comparison more clear, twoMCMC algorithms were run on each data-set, one using the final kernel found byAMCMC and one using a kernel based on the ASMC run, with the final estimatedcovariance matrix and the final mean value of the tuning parameter. The resultingMCMC algorithms performed very similarly in 3 cases (VPD of the two MCMCalgorithms within 10% of each other) and the kernel found by ASMC performedbetter in the other 3 (VPD reduced by 30%, 40% and 80%). This paper introduces a new method for automatically tuning and choosingbetween different MCMC kernels. Where MCMC based SMC code already exists,adapting the h s would be a relatively straightforward means of enhancingperformance, the main effort being in calculating the ratio of the proposedparticles in the accept/reject step. Probably the most important conclusion fromthe simulation studies presented is that there is not much lost in terms ofperformance in the adaption process – the Kmix algorithm performed comparablyto the respective best performing individual component and the adaptive randomwalk Metropolis performed similarly to the fixed, approximately optimally scaledversion.Although the method as presented has assumed that i.i.d. observations areavailable from the likelihood, ASMC readily extends to the case of a dependentsequence. Furthermore, the extension to general sequences of target densities is31mmediate, and implied by the choice of notation in Algorithm 3. The theoreticalresults presented in section 4 only apply to a one–dimensional h , in the case thatthe tuning parameter is a vector, the proposed algorithm and theoretical resultsstill apply (with slight modifications), but convergence is likely to be at a slowerrate.The main assumption of ASMC is that a good h at time t is likely also to performwell at time t + 1. One piece of evidence that supports this assumption is that theresampling frequency decreases with an increasing number of observations(Chopin, 2002). This implies that, although π and π may be quite different, π and π are likely to be less so, provided that the data provides sufficientinformation on the parameters. As mentioned earlier in the text, the assumptionof similar successive target densities is also required for the non–adaptive version(Chopin, 2002; Del Moral et al., 2006).ASMC can be easily extended by considering other proposal densities. Forexample it is possible to formulate a T –distributed version of the Liu/Westproposal, this allows for heavier tailed proposals, the heaviness of which can beselected automatically by adaptively choosing the number of degrees of freedom;this t –based proposal includes the Liu/West as a special case. Other interestingalgorithms can be formulated using DE proposals (Ter Braak, 2006) (whichgeneralises the snooker algorithm of Gilks et al. (1994)) or regional MCMCproposals (Roberts and Rosenthal, 2009; Craiu et al., 2009) – both of which appealstrongly to the particle structure of the new method.32 Proof of Proposition 4.1
Let Λ ( j ) = Λ( θ ( j ) , θ (cid:48) ( j ) ) ie the observed Λ for the j th particle and I denote theindicator function. The collection { h ( j ) , /M } Mj =1 is an iid sample from π ( h ), butwith weights defined as, W ( j ) = f (Λ ( j ) ) (cid:80) Mi =1 f (Λ ( i ) ) , the weighted particle set, { h ( j ) , W ( j ) } Mj =1 , has empirical density,˜ π ( h ) = M (cid:88) j =1 W ( j ) I ( h = h ( j ) ) . Define a discrete random variable H (cid:63) , which takes value h ( j ) with probability W ( j ) .For h (cid:63) ∈ R , P ( H (cid:63) ≤ h (cid:63) ) = M (cid:88) j =1 W ( j ) I ( h ( j ) ≤ h (cid:63) ) = M (cid:80) Mj =1 f (Λ ( j ) ) I ( h ( j ) ≤ h (cid:63) ) M (cid:80) Mi =1 f (Λ ( i ) ) . In the limit as M → ∞ , ( θ, θ (cid:48) ) iid ∼ π ( θ ) K ( θ, θ (cid:48) ) the strong law of large numbersimplies, M (cid:80) Mj =1 f (Λ ( j ) ) I ( h ( j ) ≤ h (cid:63) ) M (cid:80) Mi =1 f (Λ ( i ) ) → E π ( θ ) K ( θ,θ (cid:48) ) [ f (Λ) I ( H < h (cid:63) )] E π ( θ ) K ( θ,θ (cid:48) ) [ f (Λ)] , = E H (cid:2) E Θ , Θ (cid:48) | H [ f (Λ) | H = h ] I ( H < h (cid:63) ) (cid:3) E H (cid:2) E Θ , Θ (cid:48) | H [ f (Λ) | H = h ] (cid:3) E Θ , Θ (cid:48) | H [ f (Λ) | H = h ] = w ( H ), so,lim M →∞ P ( H (cid:63) ≤ h (cid:63) ) = E π ( h ) [ w ( H ) I ( H < h (cid:63) )] E π ( h ) [ w ( H )] = (cid:82) s ≤ h (cid:63) w ( s ) π ( s )d s (cid:82) w ( h ) π ( h )d h ;convergence in distribution follows as required. (cid:3) B Proof of Theorem 4.1
The proof proceeds in two parts and starts by observing that π ( n ) ( h ) = π ( h ) exp { nf n } where, f n ( h ) = 1 n n (cid:88) t =1 log( a + g ( t ) ( h )) , In the first part, the following results will be proved: • There exists a function, f : H → R , such that sup h ∈H | f n − f | < k f n − α . • h opt is the unique global maximum of f . • f is twice differentiable in an interval containing h opt .In the second part of the proof, these results will be used to show that as n → ∞ , π ( n ) ( h ) approaches a Dirac mass centred on h opt . Part 1
Claim that f ( h ) = log( a + g ( h )). It is easy to showsup h ∈H | ( a + g ( t ) ) / ( a + g ) − | < k l t − α , where k l = k g / inf h ∈H { a + g ( h ) } < ∞ byassumption. Put k m = k l + 1 > k l + k l t − α / t > exp {− /k l α } . For34ufficiently large (finite) t and any h ∈ H , ( a + g ( t ) ) / ( a + g ) is close to 1, a Taylorseries argument can therefore be applied to give, − k m t − α < − k l t − α − k l t − α / < log(1 − k l t − α ) < log[( a + g ( t ) ) / ( a + g )] < k l t − α . The preceding argument shows that,sup h ∈H (cid:12)(cid:12)(cid:12)(cid:12) log (cid:26) a + g ( t ) a + g (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) = sup h ∈H | log( a + g ( t ) ) − log( a + g ) | < k m t − α . For all h ∈ H , | f n − log( a + g ) | < n n (cid:88) t =1 (cid:12)(cid:12) log( a + g ( t ) ) − log( a + g ) (cid:12)(cid:12) ,< k m n n (cid:88) t =1 t − α ,< k f n − α , as required. If g is twice differentiable in an interval I ⊆ H , where
I (cid:51) h opt then,being a continuous function of g , f is also twice differentiable on I . That h opt isthe unique global maximum of f is now implied by the assumptions on g and thestrict increasing monotonicity of the logarithm. Part 2
In this part, the properties of f will be used to show that for any intervalcontaining h opt as an interior point and as n → ∞ , the probability that h belongsto that interval tends to 1. 35et ¯ X denote the compliment of X in H . Let I ⊃ H be any interval containing h opt as an interior point. By virtue of the global uniqueness of h opt , there exists anopen interval I ⊃ I also containing h opt such that f (cid:48)(cid:48) ( h ) < h ∈ I andwith the property, inf h ∈I f ≥ sup h ∈ ¯ I f .Then for all open intervals I ⊃ I , define,sup h ∈I { f ( h opt ) − f ( h ) } = (cid:15) , inf h ∈ ¯ I { f ( h opt ) − f ( h ) } = (cid:15) . The strict concavity of f on I implies (cid:15) > (cid:15) (note the strict inequality).Consider the probability of h ∈ I after n updates, P ( h ∈ I ) > P ( h ∈ I ) = (cid:82) I π ( n ) ( h )d h (cid:82) H π ( n ) ( h )d h = (cid:82) I π ( h ) exp { nf n ( h ) } d h (cid:82) H π ( h ) exp { nf n ( h ) } d h ,>
11 + (cid:82) ¯ I π ( h ) exp { nf n ( h ) } d h (cid:82) I π ( h ) exp { nf n ( h ) } d h , since for any positive reals a , a and a , if a > a then a a + a > a a + a ≡ a /a .By uniform convergence of f n , the quotient of integrals in the denominator can bebounded above by, (cid:82) ¯ I π ( h ) exp { nf n ( h ) } d h (cid:82) I π ( h ) exp { nf n ( h ) } d h ≤ (cid:82) ¯ I π ( h ) exp { nf ( h ) + k f n − α } d h (cid:82) I π ( h ) exp { nf ( h ) − k f n − α } d h , ≤ (cid:82) ¯ I π ( h ) exp { nf ( h opt ) − n(cid:15) + k f n − α } d h (cid:82) I π ( h ) exp { nf ( h opt ) − n(cid:15) − k f n − α } d h , ≤ P π ( h ) ( h ∈ ¯ I ) P π ( h ) ( h ∈ I ) exp { n ( (cid:15) − (cid:15) ) + 2 k f n − α } , → n → ∞ , (cid:15) − (cid:15) <
0. Therefore P ( h ∈ I ) → n → ∞ . Since the choice of I (cid:51) h opt was arbitrary, it may be made infinitesimally narrow and still, after enoughiterations of the sampler P ( h ∈ I ) →
1. This implies that π ( n ) ( h ) tends indistribution to a Dirac mass centred on h opt and establishes the claim. (cid:3) References
Andrieu, C. and C. Robert (2001). Controlled MCMC for optimal sampling.Technical report, Universit´e Paris–Dauphine.Andrieu, C. and J. Thoms (2008). A tutorial on adaptive MCMC.
Statistics andComputing 18 (4), 343–373.Atchad´e, Y. and J. Rosenthal (2005). On adaptive Markov chain Monte Carloalgorithms.
Bernoulli 11 (5), 815–828.Capp´e, O., R. Douc, A. Guillin, J.-M. Marin, and C. P. Robert (2008). Adaptiveimportance sampling in general mixture classes.
Statistics andComputing 18 (4), 447–459.Chopin, N. (2002). A sequential particle filter method for static models.
Biometrika 89 (3), 539–552.Cornebise, J., E. Moulines, and J. Olsson (2008). Adaptive methods for sequentialimportance sampling with application to state space models.
Statistics andComputing 18 (4), 461–480.Craiu, R. V., J. Rosenthal, and C. Yang (2009). Learn from thy neighbor:37arallel-chain and regional adaptive MCMC.
Journal of the American StatisticalAssociation 104 (488), 1454–1466.Del Moral, P., A. Doucet, and A. Jasra (2006). Sequential Monte Carlo samplers.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) 68 (3), 411–436.Douc, R., A. Guillin, J.-M. Marin, and C. P. Robert (2005). Minimum varianceimportance sampling via population Monte Carlo. Technical report.Doucet, A., N. de Freitas, and N. Gordon (Eds.) (2001).
Sequential Monte CarloMethods in Practice . Springer–Verlag New York.Fearnhead, P. (2002). MCMC, sufficient statistics and particle filters.
Journal ofComputational and Graphical Statistics 11 , 848–862.Fearnhead, P. (2008). Computational methods for complex stochastic systems: Areview of some alternatives to MCMC.
Statistics and Computing 18 , 151–171.Fr¨uhwirth-Schnatter, S. (2006).
Finite Mixture and Markov Switching Models .Springer.Gamerman, D. and H. F. Lopes (2006). Markov chain Monte Carlo: Stochasticsimulation for Bayesian inference (2nd ed.).Gilks, W. and C. Berzuini (1999). Following a moving target – Monte Carloinference for dynamic Bayesian models.
Journal of the Royal Statistical Society,Series B 63 (1), 127–146. 38ilks, W., S. Richardson, and D. Spiegelhalter (Eds.) (1995).
Markov ChainMonte Carlo in Practice . Chapman & Hall/CRC.Gilks, W. R., G. O. Roberts, and E. I. George (1994). Adaptive directionsampling.
Journal of the Royal Statistical Society. Series D (TheStatistician) 43 (1), 179–189.Gordon, N. J., D. J. Salmond, and A. F. M. Smith (1993). Novel approach tononlinear/non-Gaussian Bayesian state estimation.
Radar and SignalProcessing, IEE Proceedings F 140 (2), 107–113.Haario, H., E. Saksman, and J. Tamminen (1998). An adaptive Metropolisalgorithm.
Bernoulli 7 , 223–242.Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains andtheir applications.
Biometrika 57 (1), 97–109.Jasra, A., A. Doucet, D. A. Stephens, and C. C. Holmes (2008). Interactingsequential Monte Carlo samplers for trans-dimensional simulation.
Comput.Stat. Data Anal. 52 (4), 1765–1791.Jasra, A., D. A. Stephens, A. Doucet, and T. Tsagaris (2008). Inference for Levydriven stochastic volatility models via adaptive SMC. .Jasra, A., D. A. Stephens, and C. C. Holmes (2007). On population-basedsimulation for static inference.
Statistics and Computing 17 (3), 263–279.Kirkpatrick, S., C. D. Gelatt, and M. P. Vecchi (1983). Optimization by simulatedannealing.
Science 220 (4598), 671–680.39ong, A., J. S. Liu, and W. H. Wong (1994). Sequential imputations and Bayesianmissing data problems.
Journal of the American Statistical Association 89 (425),278–288.Liu, J. and M. West (2001).
Sequential Monte Carlo Methods in Practice , Chapter10: Combined Parameter and State Estimation in Simulation-Based Filtering.Springer–Verlag New York.Liu, J. S. and R. Chen (1995). Blind deconvolution via sequential imputations.
Journal of the American Statistical Association 90 , 567–576.Liu, J. S. and R. Chen (1998). Sequential Monte Carlo methods for dynamicsystems.
Journal of the American Statistical Association 93 (443), 1032–1044.Liu, J. S., R. Chen, and W. H. Wong (1998). Rejection control and sequentialimportance sampling.
Journal of the American Statistical Association 93 (443),1022–1031.Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller(1953). Equation of state calculations by fast computing machines.
The Journalof Chemical Physics 21 (6), 1087–1092.Neal, R. (2001). Annealed importance sampling.
Statistics and Computing 11 (2),125–139.Pasarica, C. and A. Gelman (2010). Adaptively scaling the Metropolis algorithmusing expected squared jumped distance.
To appear: Statistica Sinica .Roberts, G. and J. Rosenthal (2001). Optimal scaling for variousMetropolis-Hastings algorithms.
Statistical Science 16 (4), 351–367.40oberts, G. O. and J. S. Rosenthal (2009, June). Examples of adaptive MCMC.
Journal of Computational and Graphical Statistics 18 (2), 349–367.Roberts, G. O. and R. L. Tweedie (1996). Exponential convergence of Langevindistributions and their discrete approximations.
Bernoulli 2 (4), 341–363.Sherlock, C. and G. Roberts (2009). Optimal scaling of the random walkMetropolis on elliptically symmetric unimodal targets.
Bernoulli 15 (3), 774–798.Stephens, M. (2000). Dealing with label switching in mixture models.
Journal ofthe Royal Statistical Society, Series B 62 (4), 795–809.Storvik, G. (2002). Particle filters for state-space models with the presence ofunknown static parameters.
IEEE Transaction on Signal Processing 50 , 281–289.Ter Braak, C. J. F. (2006). A Markov chain Monte Carlo version of the geneticalgorithm differential evolution: easy Bayesian computing for real parameterspaces.
Statistics and Computing 16 (3), 239–249.West, M. (1993). Mixture models, Monte Carlo, Bayesian updating and dynamicmodels.
Computing Science and Statistics 24 , 325–333.Whitley, D. (1994). A genetic algorithm tutorial.