Convergence of adaptive mixtures of importance sampling schemes
aa r X i v : . [ m a t h . S T ] A ug The Annals of Statistics (cid:13)
Institute of Mathematical Statistics, 2007
CONVERGENCE OF ADAPTIVE MIXTURES OF IMPORTANCESAMPLING SCHEMES By R. Douc, A. Guillin, J.-M. Marin and C. P. Robert ´Ecole Polytechnique, ´Ecole Centrale Marseille and LATP, CNRS, INRIAFuturs, Projet Select, Universit´e d’Orsay and Universit´e Paris Dauphineand CREST-INSEE
In the design of efficient simulation algorithms, one is often be-set with a poor choice of proposal distributions. Although the per-formance of a given simulation kernel can clarify a posteriori howadequate this kernel is for the problem at hand, a permanent on-line modification of kernels causes concerns about the validity ofthe resulting algorithm. While the issue is most often intractable forMCMC algorithms, the equivalent version for importance samplingalgorithms can be validated quite precisely. We derive sufficient con-vergence conditions for adaptive mixtures of population Monte Carloalgorithms and show that Rao–Blackwellized versions asymptoticallyachieve an optimum in terms of a Kullback divergence criterion, whilemore rudimentary versions do not benefit from repeated updating.
1. Introduction.
Monte Carlo calibration.
In the simulation settings found in opti-mization and (Bayesian) integration, it is well documented [20] that thechoice of the instrumental distributions is paramount for the efficiency ofthe resulting algorithms. Indeed, in an importance sampling algorithm withimportance function g ( x ), we are relying on a distribution g that is cus-tomarily difficult to calibrate outside a limited range of well-known cases.For instance, a standard result is that the optimal importance density forapproximating an integral I = Z f ( x ) π ( x ) dx Received January 2005; revised December 2005. Supported in part by an ACI “Nouvelles Interfaces des Math´ematiques” grant fromthe Minist`ere de la Recherche.
AMS 2000 subject classifications.
Key words and phrases.
Bayesian statistics, Kullback divergence, LLN, MCMC algo-rithm, population Monte Carlo, proposal distribution, Rao–Blackwellization.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Statistics ,2007, Vol. 35, No. 1, 420–448. This reprint differs from the original in paginationand typographic detail. 1
R. DOUC, A. GUILLIN, J.-M. MARIN AND C. P. ROBERT is g ⋆ ( x ) ∝ | f ( x ) | π ( x ) ([20], Theorem 3.12), but this formal result is notvery informative about the practical choice of g , while a poor choice of g may result in an infinite variance estimator. MCMC algorithms somehowattenuate this difficulty by using local proposals like random walk kernels,but two drawbacks of these proposals are that they may take a long timeto converge [18] and their efficiency ultimately depends on the scale of thelocal exploration.The goals of Monte Carlo experiments are multifaceted and therefore theefficiency of an algorithm can be evaluated from many different perspectives.In particular, in Bayesian statistics the Monte Carlo sample can be used toapproximate a variety of posterior quantities. Nonetheless, if we try to as-sess the generic efficiency of an algorithm and thus develop a portmanteaudevice, a natural approach is to use a measure of agreement between thetarget and the proposal distribution, similar to the intrinsic loss functionproposed in [19] for an invariant estimation of parameters. A robust mea-sure of similarity ubiquitous in statistical approximation theory [7] is theKullback divergence E ( π, ˜ π ) = Z log dπ ( x ) d ˜ π ( x ) π ( dx ) , and this paper aims to minimize E ( π, ˜ π ) within a class of proposals ˜ π .1.2. Adaptivity in Monte Carlo settings.
Given the complexity of theoriginal optimization or integration problem (which does itself require MonteCarlo approximations), it is rarely the case that the optimization of the pro-posal distribution against an efficiency measure can be achieved in closedform. Even the computation of the efficiency measure for a given proposalis impossible in the majority of cases. For this reason, a number of adaptiveschemes have appeared in the recent literature ([20], Section 7.6.3) in orderto design better proposals against a given measure of efficiency without re-sorting to a standard optimization algorithm. For instance, in the MCMCcommunity, sequential changes in the variance of Markov kernels have beenproposed in [13, 14], while adaptive changes taking advantage of regenerationproperties of the kernels have been constructed by Gilks, Roberts and Sahu[12] and Sahu and Zhigljavsky [23, 24]. From a more general perspective,Andrieu and Robert [2] have developed a two-level stochastic optimizationscheme to update parameters of a proposal towards a given integrated effi-ciency criterion such as the acceptance rate (or its difference with a valueknown to be optimal—see Roberts, Gelman and Gilks [21]). However, asreflected in this general technical report of Andrieu and Robert [2], the com-plexity of devising valid adaptive MCMC schemes is a genuine drawback intheir extension, given that the constraints on the inhomogeneous Markov
DAPTIVE MIXTURES FOR IMPORTANCE SAMPLING chain that results from this adaptive construction are either difficult to sat-isfy or result in a fixed proposal after a certain number of iterations.Capp´e, Guillin, Marin and Robert [3] (see also [20], Chapter 14) devel-oped a methodology called Population Monte Carlo (PMC) [16] motivatedby the observation that the importance sampling perspective is much moreamenable to adaptivity than MCMC, due to its unbiased nature: using sam-pling importance resampling, any given sample from an importance distri-bution g can be transformed into a sample of points marginally distributedfrom the target distribution π and Capp´e et al. [3] (see also [8]) showed thatthis property is also preserved by repeated and adaptive sampling. (In thissetting, “adaptive” is to be understood as a modification of the importancedistribution based on the results of previous iterations.) The asymptoticsof adaptive importance sampling are therefore much more manageable thanthose of adaptive MCMC algorithms, at least at a primary level, if onlybecause the algorithm can be stopped at any time. Indeed, since every iter-ation is a valid importance sampling scheme, the algorithm does not requirea burn-in time. Borrowing from the sequential sampling literature [11], themethodology of Capp´e et al. [3] thus aimed at producing an adaptive impor-tance sampling scheme via a learning mechanism on a population of points,themselves marginally distributed from the target distribution. However, asshown by the current paper, the original implementation of Capp´e et al. [3]may suffer from an asymptotic lack of adaptivity that can be overcome byRao–Blackwellization.1.3. Plan and objectives.
This paper focuses on a specific family of im-portance functions that are represented as mixtures of an arbitrary numberof fixed proposals. These proposals can be educated guesses of the targetdistribution, random walk proposals for local exploration of the target, non-parametric kernel approximations to the target or any combination of these.Using these fixed proposals as a basis, we then devise an updating mecha-nism for the weights of the mixture and prove that this mechanism convergesto the optimal mixture, the optimality being defined here in terms of Kull-back divergence. From a probabilistic point of view, the techniques used inthis paper are related to techniques and results found in [4, 6, 17]. In partic-ular, the triangular array technique that is central to the CLT proofs belowcan be found in [4, 10].The paper is organized as follows. We first present the algorithmic andmathematical details in Section 2 and establish a generic central limit the-orem. We evaluate the convergence properties of the basic version of PMCin Section 3, exhibiting its limitations, and show in Section 5 that its Rao–Blackwellized version overcomes these limitations and achieves optimalityfor the Kullback criterion developed in Section 4. Section 6 illustrates thepractical convergence of the method on a few benchmark examples.
R. DOUC, A. GUILLIN, J.-M. MARIN AND C. P. ROBERT
2. Population Monte Carlo.
The Population Monte Carlo (PMC) algo-rithm introduced in [3] is a form of iterated sampling importance resampling(SIR). The appeal of using a repeated form of SIR is that previous samplesare informative about the connections between the proposal (importance)and the target distributions. We stress from the outset that this scheme hasvery few connections with MCMC algorithms since (a) PMC is not Marko-vian, being based on the whole sequence of simulations and (b) PMC canbe stopped at any time, being validated by the basic importance samplingidentity ([20], equation (3.9)) rather than by a probabilistic convergence re-sult like the ergodic theorem. These features motivate the use of the methodin setups where off-the-shelf MCMC algorithms cannot be of use. We firstrecall basic Monte Carlo principles, mostly to define notation and to makeprecise our goals.2.1.
The Monte Carlo framework.
On a measurable space (Ω , A ), we aregiven a target , that is, a probability distribution π on (Ω , A ). We assume that π is dominated by a reference measure µ , π ≪ µ , and also denote by π ( dx ) = π ( x ) µ ( dx ) its density. In most settings, including Bayesian statistics, thedensity π is known up to a normalizing constant, π ( x ) ∝ ˜ π ( x ). The purposeof running a simulation experiment with the target π is to approximatequantities related to π , such as intractable integrals π ( f ) = Z f ( x ) π ( dx ) , but we do not focus here on a specific quantity π ( f ). In this setting, a stan-dard stochastic approximation method is the Monte Carlo method, basedon an i.i.d. sample x , . . . , x N simulated from π , that approximates π ( f ) byˆ π MC N ( f ) = N − N X i =1 f ( x i ) , which almost surely converges to π ( f ) (as N goes to infinity) by the law oflarge numbers (LLN). The central limit theorem (CLT) implies, in addition,that if π ( f ) = R f ( x ) π ( dx ) < ∞ , then √ N { ˆ π MC N ( f ) − π ( f ) } L N (0 , V π ( f )) , where V π ( f ) = π ([ f − π ( f )] ). Obviously, this approach requires a directi.i.d. simulation from π (or ˜ π ), which often is impossible. An alternative ([20],Chapter 3) is to use importance sampling, that is, to choose a probabilitydistribution ν ≪ µ on (Ω , A ) called the proposal or importance distribution,the density of which is also denoted by ν , and to estimate π ( f ) byˆ π IS ν,N ( f ) = N − N X i =1 f ( x i ) (cid:18) πν (cid:19) ( x i ) . DAPTIVE MIXTURES FOR IMPORTANCE SAMPLING If π is also dominated by ν , π ≪ ν , then ˆ π IS ν,N ( f ) almost surely converges to π ( f ) and if ν ( f ( π/ν ) ) < ∞ , then the CLT also applies, that is, √ N { ˆ π IS ν,N ( f ) − π ( f ) } L N (cid:18) , V ν (cid:18) f πν (cid:19)(cid:19) . As the normalizing constant of the target distribution π is unknown, it is notpossible to directly use the IS estimator ˆ π IS ν,N ( f ). A convenient substitute isthe self-normalized IS estimator,ˆ π SNIS ν,N ( f ) = N X i =1 ( π/ν )( x i ) ! − N X i =1 f ( x i )( π/ν )( x i ) , which also converges almost surely to π ( f ). If ν ((1 + f )( π/ν ) ) < ∞ , thenthe CLT applies: √ N { ˆ π SNIS ν,N ( f ) − π ( f ) } L N (cid:18) , V ν (cid:26) [ f − π ( f )] πν (cid:27)(cid:19) . Obviously, the quality of the IS and SNIS approximations strongly dependson the choice of the proposal distribution ν , which is delicate for complexdistributions like those that occur in high-dimensional problems. (Whilemultiplication of the number of proposals may offer some reprieve in thisregard, we must stress at this point that our PMC methodology also suffersfrom the curse of dimensionality to which all importance sampling methodsare subject in the sense that high-dimensional problems require a consider-able increase in computational power.)2.2. Sampling importance resampling.
The sampling importance resam-pling (SIR) method of Rubin [22] is an extension of the IS method thatachieves simulation from π by adding resampling to simple reweighting.More precisely, the SIR algorithm operates in two stages. The first stage isidentical to IS and consists in generating an i.i.d. sample ( x , . . . , x N ) from ν .The second stage builds a sample from π , (˜ x , . . . , ˜ x M ), based on the instru-mental sample ( x , . . . , x N ), by resampling. While there are many resamplingmethods ([20], Section 14.3.5), the most standard (if least efficient) approachis multinomial sampling from { x , . . . , x N } with probabilities proportionalto the importance weights [ πν ( x ) , . . . , πν ( x N )], that is, the replacement ofthe weighted sample ( x , . . . , x N ) by an unweighted sample (˜ x , . . . , ˜ x M ),where ˜ x i = x J i (1 ≤ i ≤ M ) and where ( J , . . . , J M ) ∼ M ( M, ̺ , . . . , ̺ N ), themultinomial distribution with probabilities P [ J ℓ = i | x , . . . , x N ] = ̺ i ∝ πν ( x i ) , ≤ i ≤ N, ≤ ℓ ≤ M. The SIR estimator of π ( f ) is then the standard averageˆ π SIR ν,N,M ( f ) = M − M X i =1 f (˜ x i ) , R. DOUC, A. GUILLIN, J.-M. MARIN AND C. P. ROBERT which also converges to π ( f ) since each ˜ x i is marginally distributed from π . By construction, the variance of the SIR estimator is greater than thevariance of the SNIS estimator. Indeed, the expectation of ˆ π SIR ν,N,M ( f ) condi-tional on the sample ( x , . . . , x N ) is equal to ˆ π SNIS ν,N ( f ). Note, however, thatan asymptotic analysis of ˆ π SIR ν,N,M ( f ) is quite delicate because of the depen-dencies in the SIR sample (which, again, is not an i.i.d. sample from π ).2.3. The population Monte Carlo algorithm.
In their generalization ofimportance sampling, Capp´e et al. [3] introduce an iterative dimension inthe production of importance samples, aimed at adapting the importancedistribution ν to the target distribution π . Iterations are then used to learnabout π from the (poor or good) performance of earlier proposals and thatperformance can be evaluated using different criteria, as, for example, theentropy of the distribution of importance weights.More precisely, at iteration t ( t = 0 , , . . . , T ) of the PMC algorithm, asample of N values from the target distribution π is produced by a SIRalgorithm whose importance function ν t depends on t , in the sense that ν t can be derived from the N × ( t −
1) previous realizations of the algorithm,except for the first iteration t = 0, where the proposal distribution ν ischosen as in a regular importance sampling experiment. A generic renderingof the algorithm is thus as follows: PMC algorithm.
At time t = 0 , , . . . , T ,1. generate ( x i,t ) ≤ i ≤ N by i.i.d. sampling from ν t and compute the normal-ized importance weights ¯ ω i,t ;2. resample (˜ x i,t ) ≤ i ≤ N from ( x i,t ) ≤ i ≤ N by multinomial sampling with weights¯ ω i,t ;3. construct ν t +1 based on ( x i,τ , ¯ ω i,τ ) ≤ i ≤ N, ≤ τ ≤ t .At this stage of the description of the PMC algorithm, we do not givein detail the construction of ν t , which can thus arbitrarily depend on thepast simulations. Section 3 studies a particular choice of the ν t ’s in detail.A major finding of Capp´e et al. [3] is, however, that the dependence of ν t on earlier proposals and realizations does not jeopardize the fundamentalimportance sampling identity. Local adaptive importance sampling schemescan thus be chosen in much wider generality than was previously thoughtand this shows that, thanks to the introduction of a temporal dimensionin the selection of the importance function, an adaptive perspective can beadopted at little cost, with a potentially large gain in efficiency.Obviously, when the construction of the proposal distribution ν t is com-pletely open, there is no guarantee of permanent improvement of the simu-lation scheme, whatever the criterion adopted to measure this improvement. DAPTIVE MIXTURES FOR IMPORTANCE SAMPLING For instance, as illustrated by Capp´e et al. [3], a constant dependence of ν t on the past sample quickly leads to stable behavior. A more extreme illus-tration is the case where the sequence ( ν t ) degenerates into a quasi-Diracmass at a value based on earlier simulations and where the performance ofthe resulting PMC algorithm worsens. In order to study the positive andnegative effects of the update of the importance function ν t , we hereafterconsider a special class of proposals based on mixtures and study two par-ticular updating schemes, one in which improvement does not occur and onein which it does.
3. The D -kernel PMC algorithm. In this section and the following ones,we study adaptivity for a particular type of parameterized PMC scheme, inthe case where ν t is a mixture of measures ζ d (1 ≤ d ≤ D ) that are chosenprior to the simulation experiment, based either on an educated guess about π or on local approximations (as in nonparametric kernel estimation). Thedependence of the ν t ’s on the past simulations is of the form ν t ( dx ) = D X d =1 α td ζ d ( { x i,t − , ¯ ω i,t − } ≤ i ≤ N , dx )= D X d =1 α td N X j =1 ¯ ω j,t − Q d ( x j,t − , dx ) , where the ¯ ω i,t ’s denote the importance weights and the transition kernels Q d (1 ≤ d ≤ D ) are given. This situation is rather common in MCMC settingswhere several competing transition kernels are simultaneously available, butdifficult to compare. For instance, the cycle and mixture MCMC schemesdiscussed by Tierney [25] are of this nature.Hereafter, we thus focus on adapting the weights ( a td ) ≤ d ≤ D toward abetter fit with the target distribution. A natural approach to updating theweights ( a td ) ≤ d ≤ D is to favor those kernels Q d that lead to a high acceptanceprobability in the resampling step of the PMC algorithm. We thus choosethe a td ’s to be proportional to the survival rates of the corresponding Q d ’s inthe previous resampling step [since using the double mixture of the measures Q d ( x j,t − , dx ) means that we first select a point x j,t − in the previous samplewith probability ¯ ω i,t − , then select a component d with probability α td and,finally, simulate from Q d ( x j,t − , dx )].3.1. Algorithmic details.
The family ( Q d ) ≤ d ≤ D of transition kernels onΩ × A is such that ( Q d ( x, · )) ≤ d ≤ D, x ∈ Ω is dominated by the reference mea-sure µ introduced earlier. As above, we set the corresponding target densityand the transition kernel to be π and q d ( · , · ) (1 ≤ d ≤ D ), respectively.The associated PMC algorithm then updates the proposal weights (andgenerates the corresponding samples from π ) as follows: R. DOUC, A. GUILLIN, J.-M. MARIN AND C. P. ROBERT D -kernel PMC algorithm. At time 0,(a) generate x i, . i . d . ∼ ν (1 ≤ i ≤ N ) and compute the normalized importanceweights ¯ ω i, ∝ { π/ν } ( x i, );(b) resample ( x i, ) ≤ i ≤ N into (˜ x i, ) ≤ i ≤ N by multinomial sampling( J i, ) ≤ i ≤ N ∼ M ( N, (¯ ω i, ) ≤ i ≤ N ) , that is , ˜ x i, = x J i, , ;(c) set α ,Nd = 1 /D (1 ≤ d ≤ D ).At time t = 1 , . . . , T ,(a) select the mixture components ( K i,t ) ≤ i ≤ N ∼ M ( N, ( α t,Nd ) ≤ d ≤ D );(b) generate independent x i,t ∼ q K i,t (˜ x i,t − , x ) (1 ≤ i ≤ N ) and compute thenormalized importance weights ¯ ω i,t ∝ π ( x i,t ) /q K i,t (˜ x i,t − , x i,t );(c) resample ( x i,t ) ≤ i ≤ N into (˜ x i,t ) ≤ i ≤ N by multinomial sampling with weights(¯ ω i,t ) ≤ i ≤ N );(d) set α t +1 ,Nd = P Ni =1 ¯ ω i,t I d ( K i,t ) (1 ≤ d ≤ D ).In this implementation, at time t ≥
1, step (a) chooses a kernel index d inthe mixture for each point of the sample, while step (d) updates the weight α d as the relative importance of kernel Q d in the current round or, in otherwords, as the relative survival rate of the points simulated from kernel Q d .(Indeed, since the survival of a simulated value x i,t is driven by its impor-tance weight ¯ ω i,t , reweighting is related to the respective magnitudes of theimportance weights for the different kernels.) Also, note that the resamplingstep (c) is used to avoid the propagation of very small importance weightsalong iterations and the subsequent degeneracy phenomenon that plaguesiterated IS schemes like particle filters [11]. At time t = T , the algorithmshould thus stop at step (b) for integral approximations to be based on theweighted sample ( x i,T , ¯ ω i,T ) ≤ i ≤ N .3.2. Convergence properties.
In order to assess the impact of this updatemechanism on the performance of the PMC scheme, we now consider theconvergence properties of the above algorithm when the sample size N growsto infinity. Indeed, as already pointed out in Capp´e et al. [3], it does not makesense to consider the alternative asymptotics of the PMC scheme (namely,when T grows to infinity), given that this algorithm is intended to be runwith a small number T of iterations.A basic assumption on the kernels Q d is that the corresponding impor-tance weights are almost surely finite, that is, ∀ d ∈ { , . . . , D } π ⊗ π { q d ( x, x ′ ) = 0 } = 0 , (A1)where ξ ⊗ ζ denotes the product measure, that is, ξ ⊗ ζ ( A × B ) = R A × B ξ ( dx ) × ζ ( dy ). We denote by γ u the uniform distribution on { , . . . , D } , that is, DAPTIVE MIXTURES FOR IMPORTANCE SAMPLING γ u ( k ) = 1 /D for all k ∈ { , . . . , D } . The following result (whose proof isgiven in Appendix B) is a general LLN on the pairs ( x i,t , K i,t ) produced bythe above algorithm. Proposition 3.1.
Under (A1) , for any π ⊗ γ u -measurable function h and every t ∈ N , N X i =1 ¯ ω i,t h ( x i,t , K i,t ) N →∞ −→ P π ⊗ γ u ( h ) . Note that this convergence result is more than we need for Monte Carlopurposes since the K i,t ’s are auxiliary variables that are not relevant to theoriginal problem. However, the consequences of this general result are farfrom negligible. First, it implies that the approximation provided by thealgorithm is convergent, in the sense that P Ni =1 ¯ ω i,t f ( x i,t ) is a convergentestimator of π ( f ). But, more importantly, it also shows that for t ≥ α t,Nd = N X i =1 ¯ ω i,t I d ( K i,t ) N →∞ −→ P /D. Therefore, at each iteration, the weights of all kernels converge to 1 /D whenthe sample size grows to infinity. This translates into a lack of learningproperties for the D -kernel PMC algorithm: its properties at iteration 1 andat iteration 10 are the same. In other words, this algorithm is not adaptiveand only requires one iteration for a large value of N . We can also relatethis to the fast stabilization of the approximation in [3]. (Note that a CLTcan be established for this algorithm, but given its unappealing features, weleave the exercise for the interested reader.)
4. The Kullback divergence.
In order to obtain a correct and adaptiveversion of the D -kernel algorithm, we must first choose an effective criterionto evaluate both the adaptivity and the approximation of the target distri-bution by the proposal distribution. We then propose a modification of theoriginal D -kernel algorithm that achieves efficiency in this sense.As argued in many papers, using a wide range of arguments, a naturalchoice of approximation metric is the Kullback divergence. We thus aim toderive the D -kernel mixture that minimizes the Kullback divergence betweenthis mixture and the target measure π , Z Z log (cid:18) π ( x ) π ( x ′ ) π ( x ) P Dd =1 α d q d ( x, x ′ ) (cid:19) ( π ⊗ π )( dx, dx ′ ) . (1)This section is devoted to the problem of finding an iterative choice of mixingcoefficients α d that converge to this minimum. A detailed description of theoptimal PMC scheme is given in Section 5. R. DOUC, A. GUILLIN, J.-M. MARIN AND C. P. ROBERT
The criterion.
Using the same notation as above, in conjunctionwith the choice of weights α d in the D -kernel mixture, we introduce thesimplex of R D , S = ( α = ( α , . . . , α D ); ∀ d ∈ { , . . . , D } , α d ≥ D X d =1 α d = 1 ) , and denote ¯ π = π ⊗ π . We now assume that the D kernels also satisfy thecondition ∀ d ∈ { , . . . , D } (A2) E ¯ π [ | log q d ( X, X ′ ) | ] = Z Z | log q d ( x, x ′ ) | ¯ π ( dx, dx ′ ) < ∞ , which is automatically satisfied when all ratios π/q d are bounded. From theKullback divergence, we derive a function on S such that for α ∈ S , E ¯ π ( α ) = Z Z log D X d =1 α d q d ( x, x ′ )¯ π ( dx, dx ′ ) = E ¯ π " log D X d =1 α d q d ( X, X ′ ) . By virtue of Jensen’s inequality, E ¯ π ( α ) ≤ R π ( dx ) log π ( x ) for all α ∈ S .Note that due to the strict concavity of the log function, E ¯ π is a strictlyconcave function on a connected compact set and thus has no local maximumbesides the global maximum, denoted α max . Since Z log π ( x ) π ( dx ) − E ¯ π ( α ) = E ¯ π log π ( X ) π ( X ′ ) (cid:30) π ( X ) ( D X d =1 α d q d ( X, X ′ ) )! ,α max is the optimal vector of weights for a mixture of transition kernelssuch that the product of π by this mixture is the closest to the productdistribution ¯ π .4.2. A maximization algorithm.
We now study an iterative procedure,akin to the EM algorithm, that updates the weights so that the function E ¯ π ( α ) increases at each step. Defining the function F on S as F ( α ) = E ¯ π " α d q d ( X, X ′ ) (cid:30) D X j =1 α j q j ( X, X ′ ) ≤ d ≤ D , we construct the sequence ( α t ) t ≥ on S such that α = (1 /D, . . . , /D ) and α t +1 = F ( α t ) for t ≥
1. Note that under assumption (A1), for all t ≥ E ¯ π q d ( X, X ′ ) (cid:30) D X j =1 α tj q j ( X, X ′ ) ! > DAPTIVE MIXTURES FOR IMPORTANCE SAMPLING and thus for all t ≥ ≤ d ≤ D , we have α td >
0. If we define theextremal set I D = (cid:26) α ∈ S ; ∀ d ∈ { , . . . , D } , either α d = 0 or E ¯ π (cid:18) q d ( X, X ′ ) P Dj =1 α j q j ( X, X ′ ) (cid:19) = 1 (cid:27) , we then have the following fixed-point result: Proposition 4.1.
Under (A1) and (A2) , (i) E ¯ π ◦ F − E ¯ π is continuous; (ii) for all α ∈ S , E ¯ π ◦ F ( α ) ≥ E ¯ π ( α ) ; (iii) I D = { α ∈ S ; F ( α ) = α } = { α ∈ S ; E ¯ π ◦ F ( α ) = E ¯ π ( α ) } and I D isfinite. Proof. E ¯ π is clearly continuous. Moreover, by Lebesgue’s dominatedconvergence theorem, the function α E ¯ π ( α d q d ( X, X ′ ) / P dj =1 α j q j ( X, X ′ ))is also continuous, which implies that F is continuous. This completes theproof of (i). Due to the concavity of the log function, E ¯ π ( F ( α )) − E ¯ π ( α )= E ¯ π log " D X d =1 α d q d ( X, X ′ ) P Dj =1 α j q j ( X, X ′ ) E ¯ π (cid:18) q d ( X, X ′ ) P Dj =1 α j q j ( X, X ′ ) (cid:19) ≥ E ¯ π " D X d =1 α d q d ( X, X ′ ) P Dj =1 α j q j ( X, X ′ ) log E ¯ π (cid:18) q d ( X, X ′ ) P Dj =1 α j q j ( X, X ′ ) (cid:19) = D X d =1 α d E ¯ π (cid:18) q d ( X, X ′ ) P Dj =1 α j q j ( X, X ′ ) (cid:19) log E ¯ π (cid:18) q d ( X, X ′ ) P Dj =1 α j q j ( X, X ′ ) (cid:19) . Applying the inequality u log u ≥ u − u log u ≥ u − u = 1. Therefore, equality above isequivalent to ∀ α d = 0 E ¯ π q d ( X, X ′ ) (cid:30) D X j =1 α j q j ( X, X ′ ) ! = 1 . Thus, I D = { α ∈ S ; E ¯ π ◦ F ( α ) = E ¯ π ( α ) } . The second equality, I D = { α ∈ S ; F ( α ) = α } , is straightforward.We now prove by recursion on D that I D is finite, which is equivalent toproving that { α ∈ I D ; α d = 0 ∀ d ∈ { , . . . , D }} R. DOUC, A. GUILLIN, J.-M. MARIN AND C. P. ROBERT is empty or finite. If this set is nonempty, then any element α in this setsatisfies ∀ d ∈ { , . . . , D } E ¯ π (cid:18) q d ( X, X ′ ) P Dj =1 α j q j ( X, X ′ ) (cid:19) = 1 , which implies that0 = D X d =1 α max d (cid:18) E ¯ π (cid:18) q d ( X, X ′ ) P Dj =1 α j q j ( X, X ′ ) (cid:19) − (cid:19) = E ¯ π (cid:18) P Dd =1 α max d q d ( X, X ′ ) P Dj =1 α j q j ( X, X ′ ) − (cid:19) ≥ E ¯ π (cid:18) log P Dd =1 α max d q d ( X, X ′ ) P Dj =1 α j q j ( X, X ′ ) (cid:19) ≥ . Since the global maximum of E ¯ π is unique, we conclude that α = α max andhence (iii) follows. (cid:3) Averaged EM.
Proposition 4.1 implies that our recursive proceduresatisfies E ¯ π ( α t +1 ) ≥ E ¯ π ( α t ). Therefore, the Kullback divergence criterion (1)decreases at each step. This property is closely linked to the EM algorithm([20], Section 5.3). More precisely, consider the mixture model V ∼ M (1 , { α , . . . , α D } ) and W = ( X, X ′ ) | V ∼ π ( dx ) Q V ( x, dx ′ )with parameter α . We denote by ¯ E α the corresponding expectation, by p α ( v, w ) the joint density of ( V, W ) with respect to µ ⊗ µ and by p α ( w )the density of W with respect to µ . It is then easy to check that E ¯ π ( α ) = R log( p α ( w ))¯ π ( dw ), which is an average version of the criterion to be maxi-mized in the EM algorithm when only W is observed. In this case, a naturalidea, adapted from the EM algorithm, is to update α according to the iter-ative scheme α t +1 = arg max α ∈ S Z ¯ E α t [log p α ( V, w ) | w ]¯ π ( dw ) . Straightforward algebra can be used to show that this definition of α t +1 is equivalent to the update formula α t +1 = F ( α t ) that we used above. Ouralgorithm then appears as an averaged EM, but shares with EM the propertythat the criterion increases deterministically at each step.The following result ensures that any α different from α max is repulsive. Proposition 4.2.
Under (A1) and (A2) , for every α = α max ∈ S , thereexists a neighborhood V α of α such that if α t ∈ V α , then ( α t ) t ≥ t leaves V α within a finite time. DAPTIVE MIXTURES FOR IMPORTANCE SAMPLING Proof.
Let α = α max . Then using the inequality u − ≥ log u , we have D X d =1 α max d E ¯ π (cid:18) q d ( X, X ′ ) P Dj =1 α j q j ( X, X ′ ) (cid:19) − ≥ E ¯ π (cid:18) log P Dd =1 α max d q d ( X, X ′ ) P Dj =1 α j q j ( X, X ′ ) (cid:19) > , which implies that there exists 1 ≤ d ≤ D such that E ¯ π q d ( X, X ′ ) (cid:30) D X j =1 α j q j ( X, X ′ ) ! > . Using a nonincreasing sequence ( W n ) n ≥ of neighborhoods of α in S , themonotone convergence theorem implies that1 < E ¯ π (cid:18) q d ( X, X ′ ) P Dj =1 α j q j ( X, X ′ ) (cid:19) = E ¯ π (cid:18) lim n →∞ inf β ∈ W n q d ( X, X ′ ) P Dj =1 β j q j ( X, X ′ ) (cid:19) ≤ lim n →∞ inf β ∈ W n E ¯ π (cid:18) q d ( X, X ′ ) P Dj =1 β j q j ( X, X ′ ) (cid:19) . Thus, there exist W n = V α , a neighborhood of α and η > ∀ β ∈ V α E ¯ π q d ( X, X ′ ) (cid:30) D X j =1 β j q j ( X, X ′ ) ! > η. (2)Now, for all t ≥ ≤ d ≤ D , we have 1 ≥ α td >
0. Combining (2) withthe update formulas for α t = F ( α t − ) shows that ( α t ) t ≥ leaves V α within afinite time. (cid:3) We thus conclude that the maximization algorithm is convergent, as as-serted by the following proposition:
Proposition 4.3.
Under (A1) and (A2) , lim t →∞ α t = α max . Proof.
First, recall that I D is a finite set and that α max ∈ I D , that is, I D = { β , β , . . . , β I } with β = α max . We introduce a sequence ( W i ) ≤ i ≤ I of disjoint neighborhoods of the β i ’s such that for all 0 ≤ i ≤ I , F ( W i ) isdisjoint from S j = i W j [this is possible since F ( β i ) = β i and F is continuous]and for all i ∈ { , . . . , I } , W i ⊂ V β i , where the ( V β i )’s are defined as in theproof of Proposition 4.2.The sequence ( E ¯ π ( α t )) t ≥ is upper-bounded and nondecreasing; thereforeit converges. This implies that lim t →∞ E ¯ π ◦ F ( α t ) − E ¯ π ( α t ) = 0. By continuity R. DOUC, A. GUILLIN, J.-M. MARIN AND C. P. ROBERT of E ¯ π ◦ F − E ¯ π , there exists T > t ≥ T , α t ∈ S j W j . Since F ( W i ) is disjoint from S j = i W j , this implies that there exists i ∈ { , . . . , I } such that for all t ≥ T , α t ∈ W i . By Proposition 4.2, i cannot be in { , . . . , I } .Thus, for all t ≥ T , α t ∈ W , which is a neighborhood of β = α max . (cid:3)
5. The Rao–Blackwellized D -kernel PMC. Update of the weights α d through the transform F thus improves the Kullback divergence criterion.We now discuss how to implement this mechanism within a PMC algorithmthat resembles the previous D -kernel algorithm. The only difference withthe algorithm of Section 3.1 is that we make use of the kernel structure inthe computation of the importance weight. In MCMC terminology, this iscalled “Rao–Blackwellization” ([20], Section 4.2) and it is known to providevariance reduction in data augmentation settings ([20], Section 9.2). In thecurrent context, the improvement brought about by Rao–Blackwellizationis dramatic, in that the modified algorithm does converge to the proposalmixture that is closest to the target distribution in the sense of the Kullbackdivergence. More precisely, a Monte Carlo version of the update via F canbe implemented in the iterative definition of the mixture weights, in thesame way that MCEM approximates EM ([20], Section 5.3.3).5.1. The algorithm.
In importance sampling, as well as in MCMC set-tings, the (de)conditioning improvement brought about by Rao–Blackwellizationmay be significant [5]. In the case of the D -kernel PMC scheme, the Rao–Blackwellization argument is that it is not necessary to condition on thevalue of the mixture component in the computation of the importance weightand that the improvement is brought about by using the whole mixture. Theimportance weight should therefore be π ( x i,t ) (cid:30) D X d =1 α t,Nd q d (˜ x i,t − , x i,t ) rather than π ( x i,t )/ q K i,t (˜ x i,t − , x i,t ) , as in the algorithm of Section 3.1. As already noted by Hesterberg [15],the use of the whole mixture in the importance weight is a robust toolthat prevents infinite variance importance sampling estimators. In the nextsection, we show that this choice of weight guarantees that the followingmodification of the D -kernel algorithm converges to the optimal mixture (interms of Kullback divergence). Rao–Blackwellized D -kernel PMC algorithm. At time 0, usethe same steps as in the D -kernel PMC algorithm to obtain (˜ x i, ) ≤ i ≤ N andset α ,Nd = 1 /D (1 ≤ d ≤ D ).At time t = 1 , . . . , T ,(a) generate ( K i,t ) ≤ i ≤ N ∼ M ( N, ( α t,Nd ) ≤ d ≤ D ); DAPTIVE MIXTURES FOR IMPORTANCE SAMPLING (b) generate independent x i,t ∼ q K i,t (˜ x i,t − , x ) (1 ≤ i ≤ N ) and compute thenormalized importance weights ¯ ω i,t ∝ π ( x i,t )/ P Dd =1 α t,Nd q d (˜ x i,t − , x i,t );(c) resample ( x i,t ) into (˜ x i,t ) ≤ i ≤ N by multinomial sampling with weights(¯ ω i,t ) ≤ i ≤ N );(d) set α t +1 ,Nd = P Ni =1 ¯ ω i,t I d ( K i,t ) (1 ≤ d ≤ D ).5.2. The corresponding law of large numbers.
Not very surprisingly, thesample obtained at each iteration of the above Rao–Blackwellized algorithmapproximates the target distribution in the sense of the weak law of largenumbers (LLN). Note that the convergence holds under the very weak as-sumption (A1) and for any test function h that is absolutely integrable withrespect to the target distribution π . The function h may thus be unbounded. Theorem 5.1.
Under (A1) , for any function h in L π and for all t ≥ , N X i =1 ¯ ω i,t h ( x i,t ) N →∞ −→ P π ( h ) and N N X i =1 h (˜ x i,t ) N →∞ −→ P π ( h ) . Proof.
First, convergence of the second average follows from the con-vergence of the first average by Theorem A.1, since the latter is simplya multinomial sampling perturbation of the former. We thus focus on theweighted average and proceed by induction on t . The case t = 0 is the basicimportance sampling convergence result. For t ≥
1, if the convergence holdsfor t −
1, then to prove convergence at iteration t , we need only check, as inProposition 3.1, that 1 N N X i =1 ω i,t h ( x i,t ) N →∞ −→ P π ( h ) , where ω i,t denotes the importance weight π ( x i,t ) / P Dd =1 α t,Nd q d (˜ x i,t − , x i,t ).(The special case h ≡ G N = σ ((˜ x i,t − ) ≤ i ≤ N , ( α t,Nd ) ≤ d ≤ D ) and U N,i = N − ω i,t h ( x i,t ). Then conditionally on G N , the x i,t ’s (1 ≤ i ≤ N ) are indepen-dent and x i,t |G N ∼ D X d =1 α t,Nd Q d (˜ x i,t − , · ) . Noting that N X i =1 E (cid:18) ω i,t h ( x i,t ) N (cid:12)(cid:12)(cid:12)(cid:12) G N (cid:19) = N X i =1 E (cid:18) π ( x i,t ) h ( x i,t ) N P Dd =1 α t,Nd q d (˜ x i,t − , x i,t ) (cid:12)(cid:12)(cid:12)(cid:12) G N (cid:19) = π ( h ) , R. DOUC, A. GUILLIN, J.-M. MARIN AND C. P. ROBERT we only need to check condition (iii). The end of the proof is then quitesimilar to the proof of Proposition 3.1. (cid:3)
Convergence of the weights.
The next proposition ensures that ateach iteration, the update of the mixture weights in the Rao–Blackwellizedalgorithm approximates the theoretical update obtained in Section 4.2 forminimizing the Kullback divergence criterion.
Proposition 5.1.
Under (A1) , for all t ≥ , ∀ ≤ d ≤ D α t,Nd N →∞ −→ P α td , (3) where α t = F ( α t − ) . Combining Proposition 5.1 with Proposition 4.3, we obtain that, underassumptions (A1)–(A2), the Rao–Blackwellized version of the PMC algo-rithm adapts the weights of the proposed mixture of kernels, in the sensethat it converges to the optimal combination of mixtures with respect to theKullback divergence criterion obtained in Section 4.2.
Proof of Proposition 5.1.
The case t = 1 is obvious. Now, assume(3) holds for some t ≥
1. As in the proof of Proposition 3.1, we now establishthat 1 N N X i =1 ω i,t I d ( K i,t ) = 1 N N X i =1 π ( x i,t ) P Dl =1 α t,Nl q l (˜ x i,t − , x i,t ) I d ( K i,t ) N →∞ −→ P α t +1 d , the convergence of the renormalizing sum to 1 being a consequence of thisconvergence. We apply Theorem A.1 with G N = σ ((˜ x i,t − ) ≤ i ≤ N , ( α t,Nd ) ≤ d ≤ D )and U N,i = N − ω i,t I d ( K i,t ). Conditionally on G N , the ( K i,t , x i,t )’s (1 ≤ i ≤ N ) are independent and for all ( d, A ) in { , . . . , D } × A , we have P ( K i,t = d, x i,t ∈ A |G N ) = α t,Nd Q d (˜ x i,t − , A ) . To apply Theorem A.1, we need only check condition (iii). We have, for
C > E N X i =1 ω i,t I d ( K i,t ) N I { ω i,t I d ( K i,t ) >C } (cid:12)(cid:12)(cid:12)(cid:12) G N ! ≤ D X j =1 N N X i =1 Z α t,Nd q d (˜ x i,t − , x ) P Dl =1 α t,Nl q l (˜ x i,t − , x ) I (cid:26) π ( x ) D − q j (˜ x i,t − , x ) > C (cid:27) π ( dx ) ≤ D X j =1 N N X i =1 Z I (cid:26) π ( x ) D − q j (˜ x i,t − , x ) > C (cid:27) π ( dx ) DAPTIVE MIXTURES FOR IMPORTANCE SAMPLING N →∞ −→ P D X j =1 ¯ π (cid:18) π ( x ) D − q j ( x ′ , x ) > C (cid:19) , by the LLN of Theorem 5.1. The right-hand side converges to 0 as C tends toinfinity since by assumption (A1), ¯ π ( { q j ( x, x ′ ) = 0 } ) = 0. Thus, Theorem A.1applies and 1 N N X i =1 ω i,t I d ( K i,t ) − E N N X i =1 ω i,t I d ( K i,t ) (cid:12)(cid:12)(cid:12)(cid:12) G N ! N →∞ −→ P . To complete the proof, it simply remains to show that E N N X i =1 ω i,t I d ( K i,t ) (cid:12)(cid:12)(cid:12)(cid:12) G N ! (4) = 1 N N X i =1 Z α t,Nd q d (˜ x i,t − , x ) P Dl =1 α t,Nl q l (˜ x i,t − , x ) π ( dx ) N →∞ −→ P α t +1 d . It follows from the LLN stated in Theorem 5.1 that1 N N X i =1 Z π ( dx ) α td q d (˜ x i,t − , x ) P Dl =1 α tl q l (˜ x i,t − , x ) N →∞ −→ P E ¯ π (cid:18) α td q d ( X, X ′ ) P Dl =1 α tl q l ( X, X ′ ) (cid:19) = α t +1 d (5)and it thus suffices to check that the difference between (4) and (5) convergesto 0 in probability. To show this, first note that for all t ≥ ≤ d ≤ D , α td > ≤ d ≤ D ,( α t,Nd − α td ) /α td N →∞ −→ P
0. Using the inequality | AB − CD | ≤ | AB || D − BD | + | A − CC || CD | ,we have, by straightforward algebra, that (cid:12)(cid:12)(cid:12)(cid:12) α t,Nd q d (˜ x i,t − , x ) P Dl =1 α t,Nl q l (˜ x i,t − , x ) − α td q d (˜ x i,t − , x ) P Dl =1 α tl q l (˜ x i,t − , x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ α t,Nd q d (˜ x i,t − , x ) P Dj =1 α t,Nl q j (˜ x i,t − , x ) (cid:18) sup l ∈{ ,...,D } (cid:12)(cid:12)(cid:12)(cid:12) α t,Nl − α tl α tl (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) + (cid:12)(cid:12)(cid:12)(cid:12) α t,Nd − α td α td (cid:12)(cid:12)(cid:12)(cid:12) α td q d (˜ x i,t − , x ) P Dl =1 α tl q l (˜ x i,t − , x ) ≤ l ∈{ ,...,D } (cid:12)(cid:12)(cid:12)(cid:12) α t,Nl − α tl α tl (cid:12)(cid:12)(cid:12)(cid:12) . The proposition then follows from the convergence ( α t,Nd − α td ) /α td N →∞ −→ P (cid:3) R. DOUC, A. GUILLIN, J.-M. MARIN AND C. P. ROBERT
A corresponding central limit theorem.
We now establish a CLT forthe weighted and the unweighted samples when the sample size goes to infin-ity. As noted in Section 2.2 for the SIR algorithm, the asymptotic varianceassociated with the unweighted sample is larger than the variance of theweighted sample because of the additional multinomial step.
Theorem 5.2.
Under (A1) , (i) for a function h such that ¯ π { h ( x ′ ) π ( x ) /q d ( x, x ′ ) } < ∞ for at leastone ≤ d ≤ D , we have √ N N X i =1 ¯ ω i,t { h ( x i,t ) − π ( h ) } L −→ N (0 , σ t ) , (6) where σ t = ¯ π ( { h ( x ′ ) − π ( h ) } π ( x ′ ) / P Dd =1 α td q d ( x, x ′ )) ; (ii) if, moreover, π ( h ) < ∞ , then √ N N X i =1 { h (˜ x i,t ) − π ( h ) } L −→ N (0 , σ t + V π ( h )) . (7)Note that amongst the conditions under which this theorem applies, theintegrability condition ¯ π (cid:18) h ( x ′ ) π ( x ) q d ( x, x ′ ) (cid:19) < ∞ (8)is required for some d in { , . . . , D } and not for all d ’s. Thus, settings wheresome transition kernels q d ( · , · ) do not satisfy (8) can still be covered by thistheorem provided (8) holds for at least one particular kernel. An equivalentexpression of the asymptotic variance σ t is σ t = V ν (cid:18) { h − π ( h ) } ¯ πν (cid:19) where ν ( dx, dx ′ ) = π ( dx ) D X d =1 α td Q d ( x, dx ′ ) ! . Written in this form, σ t turns out to be the expression of the asymptoticvariance that appears in the CLT associated with a self-normalized IS algo-rithm (SNIS) (see Section 2.1 for a description of the algorithm), where theproposal distribution would be ν and the target distribution ¯ π . Obviously,this SNIS algorithm cannot be implemented since, given the above defini-tion of ν , the proposal distribution depends on both π and the weights ( α td ),which are unknown. Proof of Theorem 5.2.
Without loss of generality, we assume that π ( h ) = 0. Let d ∈ { , . . . , D } be such that ¯ π ( h ( x ′ ) π ( x ) /q d ( x, x ′ )) < ∞ . DAPTIVE MIXTURES FOR IMPORTANCE SAMPLING A consequence of the proof of Theorem 5.1 is that N P Ni =1 ω i,t N →∞ −→ P
1, so weonly need to prove that1 √ N N X i =1 ω i,t h ( x i,t ) L −→ N (0 , σ t ) . (9)We will apply Theorem A.2 with U N,i = 1 √ N ω i,t h ( x i,t ) = 1 √ N π ( x i,t ) h ( x i,t ) P Dd =1 α t,Nd q d (˜ x i,t − , x i,t ) , G N = σ { (˜ x i,t − ) ≤ i ≤ N , ( α t,Nd ) ≤ d ≤ D ) } . Conditionally on G N , the ( x i,t )’s (1 ≤ i ≤ N ) are independent and x i,t |G N ∼ D X d =1 α t,Nd Q d (˜ x i,t − , · ) . Conditions (i) and (ii) of Theorem A.2 are clearly satisfied. To check condi-tion (iii), first note that E ( U N,i |G N ) = π ( h ) = 0. Moreover, A N = N X i =1 E ( U N,i |G N )= 1 N N X i =1 Z h ( x ) π ( x ) P Dd =1 α t,Nd q d (˜ x i,t − , x ) π ( dx ) . By the LLN for (˜ x i,t ) stated in Theorem 5.1, we have B N = 1 N N X i =1 Z h ( x ) π ( x ) P Dd =1 α td q d (˜ x i,t − , x ) π ( dx ) N →∞ −→ P σ t . To prove that (iii) holds, it is thus sufficient to show that | B N − A N | N →∞ −→ P α t,Nd N →∞ −→ P α td >
0, we need only consider the upper bound I { α t,Nd > − α td } | B N − A N |≤ I { α t,Nd > − α td } sup ≤ d ≤ D (cid:18) α td − α t,Nd α td (cid:19) N N X j =1 Z h ( x ) π ( x ) P Dd =1 α t,Nd q d (˜ x i,t − , x ) π ( dx ) ≤ ( sup ≤ d ≤ D (cid:18) α td − α t,Nd α td (cid:19) N N X j =1 Z h ( x ) π ( x )2 − α td q d (˜ x i,t − , x ) ) π ( dx ) N →∞ −→ P . R. DOUC, A. GUILLIN, J.-M. MARIN AND C. P. ROBERT
Thus, condition (iii) is satisfied. Finally, we consider condition (iv). Usingthe same argument as was used for condition (iii), we have that I { α t,Nd > − α td } N X i =1 E (cid:20) N ω i,t h ( x i,t ) I (cid:26) π ( x i,t ) h ( x i,t ) P Dd =1 α t,Nd q d (˜ x i,t − , x i,t ) > C (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) G N (cid:21) ≤ N N X i =1 Z h ( x ) π ( x )2 − α td q d (˜ x i,t − , x ) I (cid:26) π ( x ) h ( x )2 − α td q d (˜ x i,t − , x ) > C (cid:27) π ( dx ) N →∞ −→ P ¯ π (cid:18) h ( x ) π ( x )2 − α td q d ( x ′ , x ) I (cid:26) π ( x ) h ( x )2 − α td q d ( x ′ , x ) > C (cid:27)(cid:19) , which converges to 0 as C tends to infinity. Thus, Theorem A.2 can beapplied and the proof of (6) is completed. The proof of (7) follows from adirect application of Theorem A.2, as in the SIR result, by setting U N,i = √ N h (˜ x i,t ) and G N = σ { ( x i,t ) ≤ i ≤ N , ( ω i,t ) ≤ i ≤ N } . (cid:3)
6. Illustrations.
In this section, we briefly show how the iterations ofthe PMC algorithm quickly implement adaptivity toward the most efficientmixture of kernels, through three examples of moderate difficulty. (The R programs are available from the last author’s website via an Snw file.)
Example 1.
As a first toy example, we take the target π to be thedensity of a five-dimensional normal mixture, X i =1 N (0 , Σ i ) , (10)and an independent normal mixture proposal with the same means andvariances as (10), but started with different weights α ,Nd . Note that this isa very special case of a D -kernel PMC scheme in that the transition kernels ofSection 5.1 are then independent proposals. In this case, the optimal choiceof weights is obviously α ⋆d = 1 /
3. In our experiment, we used three Wishart-simulated variances Σ i , with 10, 15 and 7 degrees of freedom, respectively.The starting values α ,Nd are indicated on the left of Figure 1, which clearlyshows the convergence to the optimal values 1 / / DAPTIVE MIXTURES FOR IMPORTANCE SAMPLING Example 2.
As a second toy example, consider the case of a three-dimensional normal N (0 , Σ) target with covariance matrixΣ = .
986 0 .
154 3 . .
154 15 .
433 3 . .
523 3 .
528 18 . and the mixture of three kernels given by α t,N T (˜ x i,t − , Σ ) + α t,N N (˜ x i,t − , Σ ) + α t,N N (˜ x i,t − , Σ ) , (11)where the Σ i ’s are random Wishart matrices. [The first proposal in themixture is a product of unidimensional Student t -distributions with twodegrees of freedom, centered at the current values of the components of˜ x i,t − and rotated by √ τ diag(Σ / ).]A compelling feature of this example is that we can visualize the Kullbackdivergence on the R simplex in the sense that the divergence E ( π, ˜ π ) = E ¯ π " log π ( X ′ ) (cid:30)( X d =1 α d q d ( X, X ′ ) )! = e ( α , α , α ) Fig. 1.
Convergence of the accumulated weights α t,N and α t,N + α t,N for the three-com-ponent normal mixture to the optimal values / and / (represented by dotted lines). Ateach iteration, N = 1 , points were simulated from the D -kernel proposal. R. DOUC, A. GUILLIN, J.-M. MARIN AND C. P. ROBERT
Fig. 2.
Grey level and contour representation of the Kullback divergence e ( α , α , α ) between the N (0 , Σ) distribution and the three-component mixture proposal (11) . ( Thedarker pixels correspond to lower values of the divergence. ) We also represent ( in white ) the path of one run of the D -kernel PMC algorithm when started from a random value ( α , α , α ) . The number of iterations T is equal to , while the sample size N is , . can be approximated by a Monte Carlo experiment on a grid of values of( α , α ). Figure 2 shows the result of this Monte Carlo experiment based on25 , N (0 , Σ) simulations and exhibits a minimum divergence inside the R simplex. Running the Rao–Blackwellized D -kernel PMC algorithm froma random starting weight ( α , α , α ) always leads to a neighborhood of theminimum, even though a strict decrease in the divergence requires a largevalue for N and a precise convergence to α max necessitates a large numberof iterations T . Example 3.
Our third example is a contingency table inspired by [1],given here as Table 1. We model this dataset by a Poisson regression, x ij ∼ P (exp( α i + β j )) , i, j = 0 , , with α = 0 for identifiability reasons. We use a flat prior on the param-eter θ = ( α , β , β ) and run the PMC D -kernel algorithm with a mix-ture of ten normal random walk proposals, N (˜ θ i,t − , ̺ d I (ˆ θ )), d = 1 , . . . , I (ˆ θ ) is the Fisher information matrix evaluated at the MLE, ˆ θ =( − . , . , . ̺ d vary from 1 . −
19 to 1 . DAPTIVE MIXTURES FOR IMPORTANCE SAMPLING (the ̺ d ’s are equidistributed on a logarithmic scale). The results of five (suc-cessive) iterations of the Rao–Blackwell D -kernel algorithm are as follows:unsurprisingly, the largest variance kernels are hardly ever sampled, butfulfilltheir main role of variance stabilizers in the importance sampling weightswhile the mixture concentrates on the medium variances, with a quick con-vergence of the mixture weights to the limiting weights—the accumulatedweights of the 5th, 6th, 7th and 8th components of the mixture converge to0, 0 . .
259 and 0 .
7. Conclusion and perspectives.
This paper shows that it is possibleto build an adaptive mixture of proposals aimed at a minimization of theKullback divergence with the distribution of interest. We can therefore setdifferent goals for a simulation experiment and expect to arrive at the mostaccurate proposal. For instance, in a companion paper [9], we also derive anadaptive update of the weights targeted at the minimal variance proposalfor a given integral I of interest. Rather naturally, these results are achievedunder strong restrictions on the family of proposals in the sense that theparameterization of those families is restricted to the weights of the mixture.It is, however, possible to extend the above results to general mixtures ofparameterized proposals, as shown by work currently under development. Table 1
Two-by-two contingency table R . D O U C , A . G U I LL I N , J . - M . M A R I NAN D C . P . R O B E R T Fig. 3.
Distribution of , resampled points after five iterations of the Rao–Blackwellized D -kernel PMC sampler for the contingencytable example. Top: histograms of the components α , β and β ; bottom: scatterplots of the points ( α , β ) , ( α , β ) and ( β , β ) on theprofile slices of the log-likelihood. D A P T I V E M I X T U R E S F O R I M P O R T AN C E S A M P L I N G Fig. 4.
Evolution of the samples over four iterations of the Rao–Blackwellized D -kernel PMC sampler for the contingency table example(the output from each iteration is a block of four graphs, to be read from left to right and from top to bottom) : histograms of the resampledsamples of α , β and β of size , and ( lower right of each block ) log-likelihood and the empirical cumulative distribution functionof the importance weights. R. DOUC, A. GUILLIN, J.-M. MARIN AND C. P. ROBERT
A more practical direction of research is the implementation of such adap-tive algorithms in large-dimensional problems. While our algorithms arein fine importance sampling algorithms, it is conceivable that mixtures ofGibbs-like proposals can take better advantage of the intuition gained fromMCMC methodology, while keeping the finite-horizon validation of impor-tance sampling methods. The major difficulty in this direction, however, isthat the curse of dimensionality still holds in the sense that (a) we need tosimultaneously consider more and more proposals as the dimension increases(as, e.g., the set of all full conditionals) and (b) the number of parametersto tune in the proposals exponentially increases with the dimension.APPENDIX A: CONVERGENCE THEOREMS FOR TRIANGULARARRAYSIn this section, we recall convergence results for triangular arrays of ran-dom variables (see [4] or [10] for more details, including the proofs). We willuse these results to study the asymptotic behavior of the PMC algorithm.In what follows, let { U N,i } N ≥ , ≤ i ≤ N be a triangular array of random vari-ables defined on the same measurable space (Ω , A ) and let {G N } N ≥ be asequence of σ -algebras included in A . The symbol X N −→ P a means that X N converges in probability to a as N goes to infinity. Definition A.1.
The sequence { U N,i } N ≥ , ≤ i ≤ N is independent given {G N } N ≥ if for all N ≥
1, the random variables U N, , . . . , U N,N are indepen-dent given G N . Definition A.2.
The sequence of variables { Z N } N ≥ is bounded inprobability if lim C →∞ sup N ≥ P [ | Z N | ≥ C ] = 0 . Theorem A.1. If (i) { U N,i } N ≥ , ≤ i ≤ N is independent given {G N } N ≥ ;(ii) the sequence { P Ni =1 E [ | U N,i ||G N ] } N ≥ is bounded in probability ;(iii) for all η > , P Ni =1 E [ | U N,i | I | U N,i | >η |G N ] −→ P then P Ni =1 ( U N,i − E [ U N,i |G N ]) −→ P . Theorem A.2. If (i) { U N,i } N ≥ , ≤ i ≤ N is independent given {G N } N ≥ ;(ii) for all N ≥ , ∀ i ∈ { , . . . , N } , E [ | U N,i ||G N ] < ∞ ; DAPTIVE MIXTURES FOR IMPORTANCE SAMPLING (iii) there exists σ > such that P Ni =1 ( E [ U N,i |G N ] − ( E [ U N,i |G N ]) ) −→ P σ ;(iv) for all η > , P Ni =1 E [ U N,i I | U N,i | >η |G N ] −→ P then for all u ∈ R , E " exp iu N X i =1 ( U N,i − E [ U N,i |G N ]) !(cid:12)(cid:12)(cid:12) G N −→ P exp (cid:18) − u σ (cid:19) . APPENDIX B: PROOF OF PROPOSITION 3.1We proceed by induction with respect to t . Using Theorem A.1, the case t = 0 is straightforward as a direct consequence of the convergence of theimportance sampling algorithm.For t ≥
1, let us assume that the LLN holds for t −
1. Then to prove that P Ni =1 ¯ ω i,t h ( x i,t , K i,t ) converges in probability to π ⊗ γ u ( h ), we need onlycheck that N − N X i =1 π ( x i,t ) q K i,t (˜ x i,t − , x i,t ) h ( x i,t , K i,t ) N →∞ −→ P π ⊗ γ u ( h ) , the special case h ≡ U N,i = N − π ( x i,t ) q K i,t (˜ x i,t − , x i,t ) h ( x i,t , K i,t )and G N = σ { (˜ x i,t − ) ≤ i ≤ N , ( α t,Nd ) ≤ d ≤ D } , where σ { ( X i ) i } denotes the σ -algebra induced by the X i ’s, we need onlycheck condition (iii). For any C >
0, we have N − N X i =1 E (cid:20) π ( x i,t ) q K i,t (˜ x i,t − , x i,t ) × h ( x i,t , K i,t ) I (cid:26) π ( x i,t ) q K i,t (˜ x i,t − , x i,t ) h ( x i,t , K i,t ) > C (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) G N (cid:21) (12) = D X d =1 N − N X i =1 F C (˜ x i,t − , d ) α t,Nd , R. DOUC, A. GUILLIN, J.-M. MARIN AND C. P. ROBERT where F C ( x, k ) = R π ( du ) h ( u, k ) I { π ( u ) q k ( x,u ) h ( u, k ) ≥ C } . By induction, we have α t,Nd = N X i =1 ¯ ω i,t − I d ( K i,t − ) −→ P /D and N − N X i =1 F C (˜ x i,t − , k ) N →∞ −→ P π ( F C ( · , k )) . Using these limits in (12) yields N − N X i =1 E (cid:20) π ( x i,t ) h ( x i,t , K i,t ) q K i,t (˜ x i,t − , x i,t ) I (cid:26) π ( x i,t ) h ( x i,t , K i,t ) q K i,t (˜ x i,t − , x i,t ) > C (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) G N (cid:21) N →∞ −→ P π ⊗ γ u ( F C ) . Since π ⊗ γ u ( F C ) converges to 0 as C goes to infinity, this proves that forany η > N − N X i =1 E (cid:20) π ( x i,t ) h ( x i,t , K i,t ) q K i,t (˜ x i,t − , x i,t ) I (cid:26) π ( x i,t ) h ( x i,t , K i,t ) q K i,t (˜ x i,t − , x i,t ) > N η (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) G N (cid:21) N →∞ −→ P . Condition (iii) is satisfied and Theorem A.1 applies. The proof follows. (cid:3)
Acknowledgments.
The authors are grateful to Olivier Capp´e, Paul Fearn-head and Eric Moulines for helpful comments and discussions. Commentsfrom two referees helped considerably in improving the focus and presenta-tion of our results. REFERENCES [1]
Agresti, A. (2002).
Categorical Data Analysis , 2nd ed. Wiley, New York.MR1914507[2]
Andrieu, C. and
Robert, C. (2001). Controlled Markov chain Monte Carlo methodsfor optimal sampling. Technical Report 0125, Univ. Paris Dauphine.[3]
Capp´e, O., Guillin, A., Marin, J. and
Robert, C. (2004). Population MonteCarlo.
J. Comput. Graph. Statist. Capp´e, O., Moulines, E. and
Ryd´en, T. (2005).
Inference in Hidden Markov Mod-els . Springer, New York. MR2159833[5]
Celeux, G., Marin, J. and
Robert, C. (2006). Iterated importance sampling inmissing data problems.
Comput. Statist. Data Anal. Chopin, N. (2004). Central limit theorem for sequential Monte Carlo methods andits application to Bayesian inference.
Ann. Statist. Csisz´ar, I. and
Tusn´ady, G. (1984). Information geometry and alternating min-imization procedures. Recent results in estimation theory and related topics.
Statist. Decisions (suppl. 1) 205–237. MR0785210[8]
Del Moral, P., Doucet, A. and
Jasra, A. (2006). Sequential Monte Carlo sam-plers.
J. R. Stat. Soc. Ser. B Stat. Methodol. [9] Douc, R., Guillin, A., Marin, J. and
Robert, C. (2005). Minimum varianceimportance sampling via population Monte Carlo. Technical report, Cahiers duCEREMADE, Univ. Paris Dauphine.[10]
Douc, R. and
Moulines, E. (2005). Limit theorems for properly weighted sampleswith applications to sequential Monte Carlo. Technical report, TSI, TelecomParis.[11]
Doucet, A., de Freitas, N. and
Gordon, N. , eds. (2001).
Sequential Monte CarloMethods in Practice . Springer, New York. MR1847783[12]
Gilks, W., Roberts, G. and
Sahu, S. (1998). Adaptive Markov chain Monte Carlothrough regeneration.
J. Amer. Statist. Assoc. Haario, H., Saksman, E. and
Tamminen, J. (1999). Adaptive proposal distributionfor random walk Metropolis algorithm.
Comput. Statist. Haario, H., Saksman, E. and
Tamminen, J. (2001). An adaptive Metropolis algo-rithm.
Bernoulli Hesterberg, T. (1995). Weighted average importance sampling and defensive mix-ture distributions.
Technometrics Iba, Y. (2000). Population-based Monte Carlo algorithms.
Trans. Japanese Societyfor Artificial Intelligence K¨unsch, H. (2005). Recursive Monte Carlo filters: Algorithms and theoretical anal-ysis.
Ann. Statist. Mengersen, K. L. and
Tweedie, R. L. (1996). Rates of convergence of the Hastingsand Metropolis algorithms.
Ann. Statist. Robert, C. (1996). Intrinsic losses.
Theory and Decision Robert, C. and
Casella, G. (2004).
Monte Carlo Statistical Methods , 2nd ed.Springer, New York. MR2080278[21]
Roberts, G. O., Gelman, A. and
Gilks, W. R. (1997). Weak convergence andoptimal scaling of random walk Metropolis algorithms.
Ann. Appl. Probab. Rubin, D. (1988). Using the SIR algorithm to simulate posterior distributions. In
Bayesian Statistics 3 (J. M. Bernardo, M. H. DeGroot, D. V. Lindley andA. F. M. Smith, eds.) 395–402. Oxford Univ. Press.[23]
Sahu, S. and
Zhigljavsky, A. (1998). Adaptation for self regenerative MCMC.Technical report, Univ. of Wales, Cardiff.[24]
Sahu, S. and
Zhigljavsky, A. (2003). Self regenerative Markov chain Monte Carlowith adaptation.
Bernoulli Tierney, L. (1994). Markov chains for exploring posterior distributions (with dis-cussion).
Ann. Statist. R. DoucCMAP, ´Ecole Polytechnique, CNRSRoute de Saclay91128 Palaiseau cedexFranceE-mail: [email protected]
A. Guillin´Ecole Centrale de Marseille et LATP, CNRSCentre de Math´ematiques et InformatiqueTechnopˆole Chˆateau-Gombert39 rue F. Joliot Curie13453 Marseille cedex 13FranceE-mail: [email protected] R. DOUC, A. GUILLIN, J.-M. MARIN AND C. P. ROBERT
J.-M. MarinINRIA Futurs, Projet SelectLaboratoire de Math´ematiquesUniversit´e d’Orsay91405 Orsay cedexFranceE-mail: