AAdvances in Importance Sampling
V´ıctor Elvira (cid:63) and Luca Martino † (cid:63) School of Mathematics, University of Edinburgh (United Kingdom) † Universidad Rey Juan Carlos de Madrid (Spain)
Abstract
Importance sampling (IS) is a Monte Carlo technique for the approximation of intractabledistributions and integrals with respect to them. The origin of IS dates from the early 1950s.In the last decades, the rise of the Bayesian paradigm and the increase of the availablecomputational resources have propelled the interest in this theoretically sound methodology.In this paper, we first describe the basic IS algorithm and then revisit the recent advances inthis methodology. We pay particular attention to two sophisticated lines. First, we focus onmultiple IS (MIS), the case where more than one proposal is available. Second, we describeadaptive IS (AIS), the generic methodology for adapting one or more proposals.
Keywords:
Monte Carlo methods, computational statistics, importance sampling.
In many problems of science and engineering intractable integrals must be approximated. Let usdenote an integral of interest I ( f ) = E (cid:101) π [ f ( x )] = (cid:90) f ( x )¯ π ( x ) d x , (1)where f : R d x → R , and (cid:101) π ( x ) is a distribution of the r.v. X ∈ R d x . Note that although Eq. (1)involves a distribution, more generic integrals could be targeted with the techniques describedbelow.The integrals of this form appear often in the Bayesian framework, where a set of observationsare available in y ∈ R d y , and the goal is in inferring some hidden parameters and/or latentvariables x ∈ R d y that are connected to the observations through a probabilistic model [61]. Theinformation provided by the observations is compacted in the likelihood function (cid:96) ( y | x ) and theprior knowledge on x is encoded in the prior distribution p ( x ). Both sources of information arefused to create through the simple Bayes’ rule the posterior probability density function (pdf),also called target distribution, given by (cid:101) π ( x ) = p ( x | y ) = (cid:96) ( y | x ) p ( x ) Z ( y ) , (2) For the sake of easing the notation, from now on we use the same notation for denoting a random variable orone realization of a random variable. a r X i v : . [ s t a t . C O ] F e b able 1: Summary of the notation. d x dimension of the inference problem, x ∈ R dx d y dimension of the observed data, y ∈ R dy x r.v. of interest; parameter to be inferred y observed data (cid:96) ( y | x ) likelihood function p ( x ) prior pdf (cid:101) π ( x ) posterior pdf (target), (cid:101) π ( x ) ≡ p ( x | y ) π ( x ) posterior density function (unnormalized target) , π ( x ) ≡ (cid:96) ( y | x ) g ( x ) ∝ ¯ π ( x ) q ( x ) proposal density Z normalizing constant or marginal likelihood, Z ≡ Z ( y ) I ( f ) integral to be approximated, I ( f ) ≡ E (cid:101) π [ f ( x )].where Z ( y ) = (cid:82) (cid:96) ( y | x ) p ( x ) d x is the marginal likelihood (a.k.a., partition function, Bayesianevidence, model evidence, or normalizing constant) [7, 88]. In most models of interest Z ( y )is unknown and in many applications it must be approximated [7, 9, 88]. But even when itsapproximation is not needed, the unavailability of Z ( y ) implies that the posterior can be evaluatedonly up to that (unknown) constant, i.e., we can only evaluate π ( x ) = (cid:96) ( y | x ) p ( x ) , (3)that we denote as unnormalized target distribution. Table 1 summarizes the notation of thisarticle.The integral I ( f ) cannot be computed in a closed form in many practical scenarios and hencemust be approximated. The approximation methods can be divided into either deterministicor stochastic. While many deterministic numerical methods are available in the literature[1, 5, 11, 59, 87], it is in general accepted that they tend to become less efficient than stochasticapproximations when the problem dimension d x grows. The Monte Carlo approach consists in approximating the integral I ( f ) in Eq. (1) with randomsamples [22, 51, 44, 58, 63, 78, 91]. In the standard Monte Carlo solution (often called instinctivelyvanilla/raw/classical/direct Monte Carlo), N samples x n are independently simulated from (cid:101) π ( x ).The standard Monte Carlo estimator is built as I N ( f ) = 1 N N (cid:88) t =1 f ( x n ) . (4)First, note that I N ( f ) is unbiased since E (cid:101) π [ I N ( f )] = I ( f ). Moreover, due to the weak law of largenumbers, it can be shown that I N is consistent and then converges in probability to the true value From now on, we drop y to ease the notation, e.g., Z ≡ Z ( y ). , i.e., I N ( f ) p −→ I ( f ), which is equivalent to stating that, for any positive number (cid:15) >
0, we havelim N →∞ Pr( | I N ( f ) − I ( f ) | > (cid:15) ) = 0. The variance of I N ( f ) is simply σ = N ( I ( f ) − I ( f ) ). Ifthe second moment is finite, I ( f ) < ∞ , then the central limit theorem (CLT) applies and theestimator converges in distribution to a well-defined Gaussian when N grows to infinity i.e., √ N (cid:16) I N ( f ) − I ( f ) (cid:17) d −→ N (0 , σ ) . (5)There exist multiple families of Monte Carlo methods [91, 89, 100]. We address the interestedreader to the articles in Markov chain Monte Carlo (including Metropolis-Hastings [90, 67] andGibbs sampling [15]) and previous articles in importance sampling [104, 55]. The first use of the importance sampling (IS) methodology dates from 1950 for rare eventestimation in statistical physics, in particular for the approximation of the probability of nuclearparticles penetrating shields [52]. IS was later used as a variance reduction technique whenstandard Monte Carlo integration was not possible and/or not efficient [48]. The renewed interestin IS has run in parallel with the hectic activity in the community of Bayesian analysis and itsever increasing computational demands. In most cases, the posterior in (2) is not available due tothe intractability of the normalizing constant. See [101] for a previous review in IS.
Let us start defining the proposal pdf, q ( x ), used to simulate the samples. It is widely acceptedthat the proposal is supposed to have heavier tails than the target, i.e., the target (cid:101) π ( x ) decaysfaster than q ( x ) when x is far from the region where most of the probability mass is concentrated.However, this usual restriction is too vague and it will be clarified below. Here, we simply stick tothe restriction that q ( x ) > x where (cid:101) π ( x ) f ( x ) (cid:54) = 0. IS is constituted of two simple steps:1. Sampling : N samples are simulated as x n ∼ q ( x ) , n = 1 , ..., N. (6)2. Weighting : Each sample receives an associated importance weight given by w n = π ( x n ) q ( x n ) , n = 1 , . . . , N. (7)The importance weights describe how representative the samples simulated from q ( x ) are whenone is interested in computing integrals w.r.t. (cid:101) π ( x ). The set of N weighted samples can be usedto approximate the generic integral I ( f ) of Eq. (1) by the two following IS estimators:3 Unnormalized (or nonnormalized) IS (UIS) estimator: (cid:98) I N ( f ) = 1 N Z N (cid:88) n =1 w n f ( x n ) . (8)Note that the UIS estimator can be used only when Z is known. • Self-normalized IS (SNIS) estimator: (cid:101) I N ( f ) = N (cid:88) n =1 w n f ( x n ) , (9)where w n = w n (cid:80) Nj =1 w j (10)are the normalized weights.The derivation of the SNIS estimator departs from the UIS estimator of Eq. (8), substituting Z by its unbiased estimator [91] (cid:98) Z = 1 N N (cid:88) n =1 w n . (11)After a few manipulations, one recovers Eq. (10). The normalized weights also allow toapproximate the target distribution by (cid:101) π N ( x ) = N (cid:88) n =1 w n δ ( x − x n ) , (12)where δ represents the Dirac measure. The UIS estimator is unbiased since it can be easily proven that E q [ (cid:98) I N ( f )] = I ( f ). Its varianceE q [ (cid:98) I N ( f )] = σ q N is given by σ q = (cid:90) ( f ( x ) (cid:101) π ( x ) − I ( f ) q ( x )) q ( x ) d x , (13)if q ( x ) > x where (cid:101) π ( x ) f ( x ) (cid:54) = 0, as we have stated above [84]. We remark that it isnot strictly necessary to have a proposal with heavier tails than the target distribution as long as σ q < ∞ . One counter example is a case where f ( x ) decays fast enough to compensate the heaviertails of the target distribution. Another counter example is a case where f ( x ) takes non-zero andfinite values only in a bounded set. 4ote that q ( x ) is chosen by the practitioner and a good choice is critical for the efficiency ofIS. Let us first suppose that sign ( f ( x )) is constant for all x and I ( f ) (cid:54) = 0. Let us also supposethat it is possible to simulate from q ∗ ( x ) = f ( x ) (cid:101) π ( x ) (cid:82) f ( z ) (cid:101) π ( z ) d z . (14)Then, the UIS estimator, for any N ≥ σ q = 0. However, it is very unlikely tohave access to the proposal of (14). The main reason is that its normalizing constant is exactly theintractable integral we are trying to approximate, I ( f ). However, q ∗ ( x ) gives the useful intuitionthat the proposal should have mass proportional to the targeted integrand in Eq. (1). Moreprecisely, inspecting (13), we see that the efficiency is penalized with the mismatch of f ( x ) (cid:101) π ( x )and q ( x ), with this penalization amplified inversely proportional to the density q ( x ). This explainsthe usual safe practice of over-spreading the proposal. The case where sign ( f ( x )) alternates canbe easily modified by splitting the function as f ( x ) = f + ( x ) + f − ( x ), where f + ( x ) is non-negativeand f − ( x ) is non-positive. It is easy to show that with the use of two proposals and N = 2,a zero-variance estimator is possible (see [84, Section 9.13]). In summary, the UIS estimator, (cid:98) I N ( f ) is unbiased, while the (cid:101) I N ( f ) is only asymptotically unbiased, i.e., with a bias that goes to0 when N grows to infinity. Both UIS and SNIS are consistent estimators of I with a variancethat depends on the discrepancy between π ( x ) | f ( x ) | and q ( x ), although the variance of the SNISis more difficult to evaluate and its bias place also a central role when N is not large enough [84].When several different moments f of the target must be estimated, a common strategy in IS isto decrease the mismatch between the proposal q ( x ) and the target (cid:101) π ( x ) [20]. This is equivalentto minimizing the variance of the weights and consequently the variance of the estimator (cid:98) Z , andit is closely linked to the diagnostics of Section 2.4. It is a legitimate question to wonder about the efficiency of the set of simulated weighted samplesin the task of approximating the target distribution and/or moments of it. Usual metrics ofefficiency involve the computation of the variance of the IS estimators. However, the computationof those variances is intractable, and even more, their approximation is usually a harder problemthan computing Eq. (1) (see [84, Chapter 9.3] for a discussion). A classic diagnostic metric in theIS literature [54] is (cid:100)
ESS = 1 (cid:80) Nn =1 ¯ w n . (15)Note that 1 ≤ (cid:100) ESS ≤ N , taking the value (cid:100) ESS = 1, when one w j = 1 and hence w i = 0, for all i (cid:54) = j . Therefore, (cid:100) ESS = N only when w j = 1 /N for all j = 1 , ..., N . Hence (cid:100) ESS measures thediscrepancy among normalized weights. This diagnostic is commonly called effective sample size ,although it is an approximation of the more reasonable but intractable diagnostic given by [37]ESS ∗ = N Var[ I N ]MSE[ (cid:101) I N ] . (16)5hen, ESS ∗ can be interpreted as the number of standard Monte Carlo that are necessary toobtain the same performance (in terms of MSE) as with the SNIS estimator with N samples. Theinterested reader can find the derivation from ESS ∗ to (cid:100) ESS through a series of approximationsand assumptions that rarely hold (see [37] for a thorough analysis). In practice, a low (cid:100)
ESS is asymptom of malfunctioning, but a high (cid:100)
ESS does not necessarily imply good behavior of the ISmethod.New ESS-like methods have been proposed in the last years. In [72, 70], novel discrepancymeasures with similar properties to (cid:100)
ESS are proposed and discussed, mitigating some of thedeficiencies of the original diagnostic. For instance, an alternative to ESS ∗ is using 1 / max( ¯ w n )instead, which preserves some of those properties (e.g., it takes values between 1 and N , being1 if all the normalized weights are zero except one, and N if all weights are the same). Anothermetric in the same spirit has been recently proposed in [49]. Finally, the use of the importancetrick within quadrature schemes has been recently proposed [38, 42]. Note that these importancequadrature schemes are not stochastic but strongly inspired in IS and its variants. The research in IS methods has been very active in the last decade not only in the developmentof novel methodology but also for increasing the understanding and the theoretical behavior ofIS-based methods. For instance, [2] unifies different perspectives about how many samples arenecessary in IS for a given proposal and target densities, a problem that is usually related to somenotion of distance (more precisely divergence) between the two densities. With a similar aim,in [14] it is shown that in a fairly general setting, IS requires a number of samples proportionalto the exponential of the KL divergence between the target and the proposal densities. Thenotion of divergences between both densities is also explored in [92] through the R´enyi generalizeddivergence, and in [79] in terms of the Pearson χ divergence. Both divergences are connectedwith the variance of the (cid:98) Z estimator in Eq. (11). As described in Section 2.4, a large variability in the importance weights is usually responsiblefor a large variance in the IS estimators. One alternative is adapting the proposals in orderto diminish the mismatch with the target, as we describe in Section 4. However, this usuallymeans throwing away past weighted samples (or stick to large variance estimators from the earlyiterations). Another alternative is the nonlinear transformation of the IS weights. The first work inthis line is the truncated importance sampling [50] where the standard unnormalized weights w n aretruncated as w (cid:48) n = min( w n , τ ), where τ is a maximum value allowed for the transformed/truncatedweights. The consistency of the method and a central limit theorem of the modified estimatorare proved. This transformation of the weights was also proposed in [53], and called nonlinearimportance sampling within an adaptive IS scheme (N-PMC algorithm). The convergence of thismethod is analyzed in [53, 79, 80]. The underlying problem that those methods fight is the rightheavy tail in the distribution of the importance weights when the proposal is not well fit. In [103],the authors go a step beyond by characterizing the distribution of the importance weights with6eneralized Pareto distribution that fits the upper tail. Based on this fitting, a method is proposedfor the stabilization of the importance weights. The authors provide proofs for consistency, finitevariance, and asymptotic normality. See [76] for a review of the clipping methodologies. Particle filtering (also known as sequential Monte Carlo) is an IS-based methodology for performingapproximate Bayesian inference on a hidden state that evolves over the time in state-space models,a class of probabilistic Markovian models. Due to the structure of the Bayesian network, it ispossible to process sequentially and efficiently the observations related to the hidden state forbuilding the sequence of filtering distributions (i.e., the posterior distribution of a given hiddenstate conditioned to all available observations). Particle filters (PFs) are based on importancesampling, incorporating in most cases a resampling step that helps to increase the diversityof the particle approximation [18, 62]. Since the publication of the seminal paper [45] wherethe bootstrap PF is developed (BPF), a plethora of PFs have been proposed in the literature[21, 86, 57, 17, 36]. Advanced MIS and AIS techniques are often implicit in those algorithms,but they are rarely explicit. In [39], a novel perspective of BPF and auxiliary PF (APF) basedon MIS is introduced. In these state-space models, the ESS and its approximations are also usedas diagnostics metrics for PF (see Section 2.4). Moreover, since the observations are dependentin these models, other metrics have been recently developed, mostly based on the predictivedistribution of the observations [60, 8, 35, 41].
The IS methodology can be easily extended when the samples are simulated from M proposals, { q m ( x ) } Mm =1 , instead of only one. In a generic setting, one can consider that n m samples aresimulated from each proposal ( (cid:80) mj =1 n j = 1) and weighted appropriately. This extension isusually called multiple importance sampling (MIS), and it has strong connections with the caseof standard IS with a single mixture proposal with components that are distributions, which issometimes called mixture IS. Here we consider mixture IS as a subset of MIS methods when n m are not deterministic number of samples but r.v.’s instead. A unifying framework of MIS has been recently proposed in [40]. The framework encompassesmost of existing IS methods with multiple proposals, proposes new schemes, and compares themin terms of variance. For the sake of clarity, the framework is described in the case where (a) noprior information about the adequateness of the proposals is available; and (b) M = N proposalsare available (i.e., exactly the same number of proposals than samples to be simulated). However,straightforward extensions are possible to more generic settings. According to this framework, aMIS is proper if it fulfills two conditions related to the sampling and weighting processes. A validsampling scheme for the simulation of N samples, { x n } Nn =1 , can be agnostic to the dependence ofthose samples but must fulfill the following statistically property: a sample x randomly picked7rom the whole set of N simulated samples must be distributed as the mixture of proposals ψ ( x ) = N (cid:80) Nn =1 q n ( x ). A valid weighting scheme must yield an unbiased and consistent UISestimator, (cid:98) I N . These properness conditions extend the standard properness in IS established by[63], and have been also used to assign proper importance weights to resampled particles [71]. Thepaper analyzes and ranks several resulting MIS schemes (different combination of valid samplingand weighting procedures) in terms of variance. Due to space restrictions, here we show only twoMIS schemes commonly used in the literature. Let us simulate exactly one sample per proposal(sampling scheme S in [40]) as x n ∼ q n ( x ) , n = 1 , ..., N. (17)The next two weighting schemes are possible (among many others): • Option 1 : Standard MIS (s-MIS, also called N1 scheme): w n = π ( x n ) q n ( x n ) , n = 1 , . . . , N. (18) • Option 2 : Deterministic mixture MIS (DM-MIS, also called N3 scheme): w n = π ( x n ) ψ ( x n ) = π ( x n ) N (cid:80) Nj =1 q j ( x n ) , n = 1 , . . . , N. In both cases, it is possible to build the UIS and SNIS estimators. In [40], it is shown thatVar[ (cid:101) I N N3 ] ≤ Var[ (cid:101) I N N1 ] , i.e., that using the second weighting option with the whole mixture in the denominator is alwaysbetter than using just the proposal that simulated the sample (the equality in the variance relationhappens only when all the proposals are the same). The result is relevant since N1 is widely usedin the literature but it should be avoided whenever possible. Note that both N1 and N3 require justone target evaluation per sample. However, N3 requires N proposal evaluations per sample, while N1 just one. For a small number of proposals, or when the target evaluation is very expensive(and hence the bottleneck), this extra complexity in N3 may be not relevant, but it can becomecumbersome otherwise. Several MIS strategies have been proposed in the literature to alleviatethis problem. In [29], a partition of the proposals is done a priori, and then the N3 scheme isapplied within each cluster (i.e., small mixtures appear in the denominator of the weights). Thismethod is called partial deterministic mixture and in some examples is able a similar variancereduction as in the N3 method, while reducing drastically the number of proposal evaluations (see[29, Fig. 1]). The overlapped partial deterministic mixture method [32] extends the frameworkto the case where the proposals can belong to more than one cluster. However, the way theproposals are clustered remains an open problem and few attempts have been done for optimizing8he clustering (see [31] where the clusters are done after the sampling, using the information ofthe samples, and hence biasing the estimators).When the selection of the proposals is also random, unlike in the sampling in (17), there existoptions to evaluate only the proposals that have been used for sampling (scheme R2 in [40]) insteadof using all of them in the numerator (scheme R3 in [40]). A recent paper explores the R2 schemeand some of its statistical properties [77]. Since the seminal works of [102, 48] in the computer graphics community, several works haveaddressed the case where the number of samples (also called counts) per proposal (also calledtechniques) can be different (see also [83] where the authors introduce control variates in MIS).In particular, the so-called balance heuristic estimator, proposed in [102] and very related to thescheme N3 in Section (3.1) has attracted attention due to its high performance. The UIS balanceheuristic estimator is given by (cid:98) I N ( f ) = M (cid:88) j =1 n j (cid:88) i =1 f ( x j,i )¯ π ( x j,i ) (cid:80) Mk =1 n k q k ( x j,i ) , (19)where again { q m ( x ) } Mm =1 is the set of available proposals, { n m } Mm =1 is the number of samplesassociated to each proposal, N = (cid:80) Mk =1 n k is the total number of samples, and x j,i ∼ q j ( x ), for i = 1 , ..., n j , and for j = 1 , ..., M . Regarding the denominator in (19), it can be interpretedthat the N samples are simulated from the mixture (cid:80) Mk =1 n k q k ( x ) via stratified sampling (asimilar interpretation can be done in the aforementioned N3 scheme). In [96], this estimatoris re-visited and novel bounds are obtained. In [93], the balance heuristic estimator of Eq. (19) isgeneralized, introducing more degrees of freedom that detach the sampling and the denominatorof the importance weights, being able to obtain unbiased estimators that reduce the variance withrespect to the standard balance heuristic. In [47], control variates are introduced in an IS schemewith a mixture proposal (similarly to [83]), and all parameters (including the mixture weights)are optimized to minimize the variance of the UIS estimator (which is jointly convex w.r.t. themixture probabilities and the control variate regression coefficients). More works with a variablenumber of samples per proposal (either fixed or optimized) include [95, 94, 98]. Importance sampling is often considered as a variance reduction technique, not only in the casewhen sampling from (cid:101) π is not possible, but also when it is possible but not efficient. A classicalexample is the case of Eq. (1) when f ( x ) = I S , where I is the indicator function taking value 1 forall x ∈ S , and 0 otherwise. In rare event estimation, S is usually a set where the target (cid:101) π has fewprobability mass, and hence I is a small positive number. It is then not practical to simulate fromthe target, since most of the samples will not contribute to the estimator due to their evaluationin I S ( x ) being zero. IS allows for sampling from a different distribution that will increase theefficiency of the method when q ( x ) is close to I S . A recent MIS method called ALOE (“At least9ne sample”) is able to simulate from a mixture of proposals ensuring that all of them are in theintegration region S in the case where (cid:101) π ( x ) is Gaussian and S is the union of half-spaces definedby a set of hyperplanes (linear constraints) [85]. As an example, the authors show successfulresults in a problem with 5772 constraints, in a 326-dimensional problem with a probability of I ≈ − , with just N = 10 samples. ALOE has been recently applied for characterizing wirelesscommunications systems through the estimation of the symbol error rate [27, 28]. In the last years, several works have focused on alleviating the computational complexity,communication, or storage in intensive IS methods. This computational burden appears oftenwhen the inferential problem is challenging and requires a large amount of simulated samples.This can happen because the adaptive schemes may require many iterations, because of the high-dimensional nature of the tackled problem, and/or because a high precision (low variance) isrequired in the estimate. In [75], several compressing schemes are proposed and theoreticallyanalyzed for assigned importance weights to groups of samples for distributed or decentralizedBayesian inference. The framework is extended in [68], where a stronger theoretical support isgiven, and new deterministic and random rules for compression are given. The approach in [56, 6]considers the case of a single node that keeps simulating samples and assigning them an importanceweight. The bottleneck here is the storage of the samples so one needs to decide at each time ifthe sample is stored or discarded. A compression algorithm is introduced for building a dictionarybased on greedy subspace projections and a kernel density estimator of the targeted distributionwith a limited number of samples. It is shown that asymptotic bias of this method is a tunableconstant depending on the kernel bandwidth parameter and a compression parameter. Finally,some works have studied the combination of IS estimators in the distributed setting. For instance,in [19, Section 4], independent estimators are linearly combined with the combination weightsbeing the inverse of the variance of each estimator. A similar approach is followed in [82], usingthe (cid:100)
ESS instead of the variance of the estimator (which is unknown in most practical problems).A Bayesian combination of Monte Carlo estimators is considered in [65, 66]. Note that the MISapproach is, due to its own nature, an implicit linear combination of multiple estimators (each ofthem using samples from one or several proposals). This perspective is exploited for instance in[46, 97].
Since choosing a good proposal (or set of proposals) in advance is in general impossible, a commonapproach is the use of adaptive importance sampling (AIS) [10]. AIS algorithms are iterativemethods for a gradual learning of one or multiple proposals that aim at approximating the targetpdf. Algorithm 1 describes a generic AIS algorithm through three basic steps: the simulation ofsamples from a one or several proposals (sampling), the computation of the importance weight ofeach sample (weighting), and the update of the parameters that characterize the proposal(s) forrepeating the previous steps in the next iteration (adaptation).10ost existing algorithms can be described in this framework that we describe with more detail.The generic AIS algorithm initializes N proposals { q n ( x | θ n, ) } Nn =1 , parametrized each of them by avector θ n, . Then, K samples are simulated from each proposals, x ( k ) n, , n = 1 , . . . , N, k = 1 , . . . , K ,and weighted properly. Here again many ways of sampling and weighting are possible, as it isdescribed in Section 3.1. At the end of the weighting step, it is possible to approximate theintegral of Eq. (1) with either UIS and SNIS, and the target distribution with a discrete randommeasure, by using the set of weighted samples { x ( k ) n, , w ( k ) n, } , n = 1 , . . . , N, k = 1 , . . . , K . Finally,the parameters of the n -th proposals are updated from θ n, to θ n, . This three-step process isrepeated until an iteration stoppage criterion is met (e.g., a maximum number of iterations, J , isreached). Note that at the end, the estimators can either use all weighted samples from iterations1 to J , or only the samples from the last iteration. Algorithm 1
Generic AIS algorithm Input:
Choose K , N , J , and { θ n, } Nn =1 for i = 1 , . . . , T do Sampling:
Draw K samples from each of the N proposal pdfs, { q n,j ( x | θ n,j ) } Nn =1 , x ( k ) n,j , k =1 , . . . , K, n = 1 , . . . , N Weighting:
Calculate the weights, w ( k ) n,j , for each of the generated KN samples. Adaptation:
Update the proposal parameters { θ n,j } Nn =1 −→ { θ n,j +1 } Nn =1 . end for Output:
Return the
KN J pairs { x ( k ) n,j , w ( k ) n,j } for all k = 1 , . . . , K , n = 1 , . . . , N , j = 1 , . . . , J . The literature is vast in AIS methods and a detailed description of all of them goes beyondthe scope of this paper (see [10] for a thorough review). Most of the AIS algorithms can beclassified within three categories, depending on how the proposals are adapted. Figure 1 showsgraphically the three families of AIS algorithms, describing the dependencies for the adaptationof the proposal parameters and the simulation of the samples. Each subplot corresponds to eachfamily, whose description and corresponding AIS algorithms of the literature are given below.a) The proposal parameters are adapted using the last set of drawn samples (e.g., standardPMC [12], DM-PMC [33, 34], N-PMC [53], M-PMC [13], APIS [69]).b) The proposal parameters are adapted using all drawn samples up to the latest iteration (e.g.,AMIS [16], CAIS [23], Daisee [64], EAMIS [25], RS-CAIS [24]).c) The proposal parameters are adapted using an independent process from the samples (LAIS[74, 73], GAPIS [30], GIS [99], IMIS [43], SL-PMC [26]).In Table 2, we describe some relevant AIS algorithms according to different features: thenumber of proposals; the weighting scheme ( nonlinear corresponds to the clipping strategies ofSection 2.5.1, standard is equivalent to Option 1 in Section 3.1, spatial mixture correspondsto Option 2 with ψ ( x ) = (cid:80) Ni =1 q i,j ( x | θ i,j ), temporal mixture corresponds to Option 2 with ψ ( x ) = (cid:80) jτ =1 q n,τ ( x | θ n,τ )); and the parameters that are adapted (either location and scale, oronly location). In Table 3, we describe the computational complexity of the same algorithms11 n,t +1 ✓ n,t ✓ n, ...... x n, x n,t x n,t +1 q n, y n,j y n,j y n,j y n,j q n,j q n,j +1 (a) The proposal parametersare adapted using the last setof drawn samples (StandardPMC, DM-PMC, N-PMC, M-PMC, APIS). ✓ n,t +1 ✓ n,t ✓ n, ...... x n, q n, y n,j y n,j q n,j q n,j +1 x n,t y n,j x n,t +1 y n,j (b) The proposalparameters are adapted usingall drawn samples up to thelatest iteration (AMIS, CAIS,Daisee, EAMIS, RS-CAIS). ✓ n,t +1 ✓ n,t ✓ n, ...... x n, x n,t +1 q n, y n,j y n,j y n,j q n,j +1 q n,j x n,t y n,j (c) The proposal parametersare adapted using anindependent process from thesamples (LAIS, GAPIS, GIS,IMIS, SL-PMC). Figure 1: Graphical description of three possible dependencies between the adaptation of theproposal parameters θ n,t and the samples. Note that q n,t ≡ q n,t ( x | θ n,t ).according to number of target evaluations, proposal evaluations, target evaluations per proposal,and proposal evaluations per proposal. In some AIS algorithms, the proposals converge with thenumber of iterations J , although proving this convergence (and the associated convergence rates)is in general a tough problem (see a recent result in [3]). For many other AIS algorithms (e.g.,DM-PMC, LAIS, APIS), the proposals do not converge to any limiting distribution. Convergerates have been established only for simple classes of AIS algorithms which are based on optimizedparametric proposals [3]. Note that AIS-based algorithms have been also used for optimizationpurposes [81, 4].Table 2: Comparison of various AIS algorithms according to different features. Algorithm
Standard PMC
N >
N >
N >
N > N = 1 temporal mixture moment estimation location/scaleGAPIS N >
N >
Acknowledgements
V.E. acknowledges support from the
Agence Nationale de la Recherche of France under PISCESproject (ANR-17-CE40-0031-01). 12able 3: Comparison of various AIS algorithms according to the computational complexity.
Algorithm
Standard PMC
N J N J
N J N J
KJ KN J N LAIS K ( N + 1) J KN J /N N DM-PMC
KN J KN J N AMIS
KJ KJ J GAPIS
KN J KN J N APIS
KN J KN J N References [1] F. S. Acton.
Numerical Methods That Work . The Mathematical Association of America,Washington, DC, 1990.[2] S. Agapiou, O. Papaspiliopoulos, D. Sanz-Alonso, A. Stuart, et al. Importance sampling:Intrinsic dimension and computational cost.
Statistical Science , 32(3):405–431, 2017.[3] ¨O. D. Akyildiz and J. M´ıguez. Convergence rates for optimised adaptive importancesamplers. arXiv preprint arXiv:1903.12044 , 2019.[4] O. D. Akyildiz, I. P. Marino, and J. M´ıguez. Adaptive noisy importance sampling forstochastic optimization. In , pages 1–5. IEEE, 2017.[5] M. C. Aus´ın. Quadrature and numerical integration.
Wiley StatsRef: Statistics ReferenceOnline (stat03875) , pages 1–10, 2014.[6] A. S. Bedi, A. Koppel, V. Elvira, and B. M. Sadler. Compressed streaming importancesamplingfor efficient representations of localization distributions. In , pages 1–5. IEEE, 2019.[7] J. M. Bernardo and A. F. M. Smith.
Bayesian Theory . Wiley & sons, 1994.[8] A. Bhadra and E. L. Ionides. Adaptive particle allocation in iterated sequential Monte Carlovia approximating meta-models.
Statistics and Computing , 26(1-2):393–407, 2016.[9] G. E. P. Box and G. C. Tiao.
Bayesian Inference in Statistical Analysis . Wiley & sons,1973.[10] M. F. Bugallo, V. Elvira, L. Martino, D. Luengo, J. M´ıguez, and P. M. Djuric. Adaptiveimportance sampling: The past, the present, and the future.
IEEE Signal Process. Mag. , 34(4):60–79, 2017. 1311] R. L. Burden and J. D. Faires.
Numerical Analysis . Brooks Cole, 2000.[12] O. Capp´e, A. Guillin, J. M. Marin, and C. P. Robert. Population Monte Carlo.
Journal ofComp. and Graphical Statistics , 13(4):907–929, 2004.[13] O. Capp´e, R. Douc, A. Guillin, J. M. Marin, and C. P. Robert. Adaptive importancesampling in general mixture classes.
Stat. Comput. , 18:447–459, 2008.[14] S. Chatterjee, P. Diaconis, et al. The sample size required in importance sampling.
TheAnnals of Applied Probability , 28(2):1099–1135, 2018.[15] J. A. Christen. Gibbs sampling.
Wiley StatsRef: Statistics Reference Online , pages 1–9,2014.[16] J. M. Cornuet, J. M. Marin, A. Mira, and C. P. Robert. Adaptive multiple importancesampling.
Scandinavian Journal of Statistics , 39(4):798–812, December 2012.[17] P. M. Djuric, T. Lu, and M. F. Bugallo. Multiple particle filtering. In , volume 3,pages III–1181. IEEE, 2007.[18] R. Douc, O. Capp´e, and E. Moulines. Comparison of resampling schemes for particlefiltering. In
Proc. 4 th Int. Symp. on Image and Signal Processing and Analysis , pages 64–69,September 2005.[19] R. Douc, A. Guillin, J. M. Marin, and C. P. Robert. Minimum variance importance samplingvia population Monte Carlo.
ESAIM: Probability and Statistics , 11:427–447, 2007.[20] A. Doucet and A. M. Johansen. A tutorial on particle filtering and smoothing: Fifteen yearslater.
Handbook of nonlinear filtering , 12(656-704):3, 2009.[21] A. Doucet, N. De Freitas, K. Murphy, and S. Russell. Rao-blackwellised particle filteringfor dynamic Bayesian networks. In
Proceedings of the Sixteenth conference on Uncertaintyin artificial intelligence , pages 176–183. Morgan Kaufmann Publishers Inc., 2000.[22] W. L. Dunn and J. K. Shultis.
Exploring Monte Carlo Methods . Elsevier Science, Amsterdam(The Netherlands), 2011.[23] Y. El-Laham, V. Elvira, and M. F. Bugallo. Robust covariance adaptation in adaptiveimportance sampling.
IEEE Signal Processing Letters , 25(7):1049–1053, 2018.[24] Y. El-Laham, V. Elvira, and M. F. Bugallo. Recursive shrinkage covariance learning inadaptive importance sampling. In
Proc. IEEE Int. Work. Comput. Adv. Multi-Sensor Adap.Process. (CAMSAP 2019) , pages 1–5, 2019.[25] Y. El-Laham, L. Martino, V. Elvira, and M. F. Bugallo. Efficient adaptive multipleimportance sampling. In ,pages 1–5. IEEE, 2019. 1426] V. Elvira and ´E. Chouzenoux. Langevin-based strategy for efficient proposal adaptationin population Monte Carlo. In
ICASSP 2019-2019 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , pages 5077–5081. IEEE, 2019.[27] V. Elvira and I. Santamar´ıa. Efficient ser estimation for mimo detectors via importancesampling schemes. In , pages1–5. IEEE, 2019.[28] V. Elvira and I. Santamar´ıa. Multiple importance sampling for efficient symbol error rateestimation.
IEEE Signal Processing Letters , 26(3):420–424, 2019.[29] V. Elvira, L. Martino, D. Luengo, and M. F. Bugallo. Efficient multiple importance samplingestimators.
Signal Processing Letters, IEEE , 22(10):1757–1761, 2015.[30] V. Elvira, L. Martino, L. Luengo, and J. Corander. A gradient adaptive populationimportance sampler. In
Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP2015) , pages 4075–4079, Brisbane, Australia, 19-24 April 2015.[31] V. Elvira, L. Martino, D. Luengo, and M. F. Bugallo. Heretical multiple importancesampling.
IEEE Signal Processing Letters , 23(10):1474–1478, 2016.[32] V. Elvira, L. Martino, D. Luengo, and M. F. Bugallo. Multiple importance sampling withoverlapping sets of proposals.
IEEE Workshop on Statistical Signal Processing (SSP) , 2016.[33] V. Elvira, L. Martino, D. Luengo, and M. F. Bugallo. Improving Population Monte Carlo:Alternative weighting and resampling schemes.
Sig. Process. , 131(12):77–91, 2017.[34] V. Elvira, L. Martino, D. Luengo, and M. F. Bugallo. Population Monte Carlo schemeswith reduced path degeneracy. In
Proc. IEEE Int. Work. Comput. Adv. Multi-Sensor Adap.Process. (CAMSAP 2017) , pages 1–5, 2017.[35] V. Elvira, J. M´ıguez, and P. Djuri´c. Adapting the number of particles in sequential montecarlo methods through an online scheme for convergence assessment.
IEEE Transactions onSignal Processing , 65(7):1781–1794, 2017.[36] V. Elvira, L. Martino, M. F. Bugallo, and P. M. Djuri´c. In search for improved auxiliaryparticle filters. In , pages1637–1641. IEEE, 2018.[37] V. Elvira, L. Martino, and C. P. Robert. Rethinking the effective sample size. arXiv preprintarXiv:1809.04129 , 2018.[38] V. Elvira, P. Closas, and L. Martino. Gauss-Hermite quadrature for non-gaussian inferencevia an importance sampling interpretation. In , pages 1–5. IEEE, 2019.1539] V. Elvira, L. Martino, M. F. Bugallo, and P. M. Djuric. Elucidating the auxiliary particlefilter via multiple importance sampling [lecture notes].
IEEE Signal Processing Magazine ,36(6):145–152, 2019.[40] V. Elvira, L. Martino, D. Luengo, and M. F. Bugallo. Generalized multiple importancesampling.
Statistical Science , 34(1):129–155, 2019.[41] V. Elvira, J. M´ıguez, and P. M. Djuri´c. New results on particle filters with adaptive numberof particles. arXiv preprint arXiv:1911.01383 , 2019.[42] V. Elvira, L. Martino, and P. Closas. Importance Gaussian quadrature. arXiv preprintarXiv:2001.03090 , 2020.[43] M. Fasiolo, F. E. de Melo, and S. Maskell. Langevin incremental mixture importancesampling.
Stat. Comput. , 28(3):549–561, 2018.[44] J. E. Gentle.
Random Number Generation and Monte Carlo Methods . Springer, 2004.[45] N. Gordon, D. Salmond, and A. F. M. Smith. Novel approach to nonlinear and non-GaussianBayesian state estimation.
IEE Proceedings-F Radar and Signal Processing , 140:107–113,1993.[46] V. Havran and M. Sbert. Optimal combination of techniques in multiple importancesampling. In
Proceedings of the 13th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry , pages 141–150, 2014.[47] H. Y. He and A. B. Owen. Optimal mixture weights in multiple importance sampling. arXivpreprint arXiv:1411.3954 , 2014.[48] T. Hesterberg. Weighted average importance sampling and defensive mixture distributions.
Technometrics , 37(2):185–194, 1995.[49] J. H. Huggins, D. M. Roy, et al. Sequential Monte Carlo as approximate sampling: bounds,adaptive resampling via inf ty -ess, and an application to particle gibbs.
Bernoulli , 25(1):584–622, 2019.[50] E. L. Ionides. Truncated importance sampling.
Journal of Computational and GraphicalStatistics , 17(2):295–311, 2008.[51] P. Jaeckel.
Monte Carlo Methods in Finance . Wiley, 2002.[52] H. Kahn. Random sampling (Monte Carlo) techniques in neutron attenuation problems.
Nucleonics , 6(5):27–passim, 1950.[53] E. Koblents and J. M´ıguez. A population Monte Carlo scheme with transformed weightsand its application to stochastic kinetic models.
Statistics and Computing , 25(2):407–425,2015. 1654] A. Kong. A note on importance sampling using standardized weights.
University of Chicago,Dept. of Statistics, Tech. Rep , 348, 1992.[55] A. Kong. Importance sampling.
Wiley StatsRef: Statistics Reference Online , 2014.[56] A. Koppel, A. S. Bedi, V. Elvira, and B. M. Sadler. Approximate shannon samplingin importance sampling: Nearly consistent finite particle estimates. arXiv preprintarXiv:1909.10279 , 2019.[57] J. Kotecha and P. M. Djuri´c. Gaussian particle filtering.
IEEE Transactions SignalProcessing , 51(10):2592–2601, October 2003.[58] D. Kroese, T. Taimre, and Z. Botev.
Handbook of Monte Carlo Methods . Wiley Series inProbability and Statistics, John Wiley and Sons, New York, 2011.[59] P. K. Kythe and M. R. Schaferkotter.
Handbook of Computational Methods for Integration .Chapman and Hall/CRC, 2004.[60] A. Lee and N. Whiteley. Variance estimation and allocation in the particle filter. arXiv:1509.00394v1 [stat.CO] , 2015.[61] P. M. Lee. Bayesian inference.
Wiley StatsRef: Statistics Reference Online (stat00207.pub2) ,pages 1–9, 2014.[62] T. Li, M. Bolic, and P. M. Djuric. Resampling methods for particle filtering: Classification,implementation, and strategies.
IEEE Signal Processing Magazine , 32(3):70–86, 2015.[63] J. S. Liu.
Monte Carlo Strategies in Scientific Computing . Springer, 2004.[64] X. Lu, T. Rainforth, Y. Zhou, J.-W. van de Meent, and Y. W. Teh. Onexploration, exploitation and learning in adaptive importance sampling. arXiv preprintarXiv:1810.13296 , 2018.[65] D. Luengo, L. Martino, V. Elvira, and M. Bugallo. Bias correction for distributed bayesianestimators. In , pages 253–256. IEEE, 2015.[66] D. Luengo, L. Martino, V. Elvira, and M. Bugallo. Efficient linear fusion of partialestimators.
Digital Signal Processing , 78:265–283, 2018.[67] L. Martino and V. Elvira. Metropolis sampling.
Wiley StatsRef: Statistics Reference Online ,pages 1–18, 2017.[68] L. Martino and V. Elvira. Compressed Monte Carlo for distributed Bayesian inference. viXra:1811.0505 , 2018.[69] L. Martino, V. Elvira, D. Luengo, and J. Corander. An adaptive population importancesampler: Learning from the uncertanity.
IEEE Transactions on Signal Processing , 63(16):4422–4437, 2015. 1770] L. Martino, V. Elvira, and F. Louzada. Alternative effective sample size measures forimportance sampling. In , pages1–5. IEEE, 2016.[71] L. Martino, V. Elvira, and F. Louzada. Weighting a resampled particle in sequential MonteCarlo. In , pages 1–5. IEEE, 2016.[72] L. Martino, V. Elvira, and F. Louzada. Effective sample size for importance sampling basedon discrepancy measures.
Signal Processing , 131:386–401, 2017.[73] L. Martino, V. Elvira, and D. Luengo. Anti-tempered layered adaptive importance sampling.In , pages 1–5. IEEE,2017.[74] L. Martino, V. Elvira, D. Luengo, and J. Corander. Layered adaptive importance sampling.
Statistics and Computing , 27(3):599–623, 2017.[75] L. Martino, V. Elvira, and G. Camps-Valls. Group importance sampling for particle filteringand mcmc.
Digital Signal Processing , 82:133–151, 2018.[76] L. Martino, V. Elvira, J. M´ıguez, A. Art´es-Rodr´ıguez, and P. Djuri´c. A comparison ofclipping strategies for importance sampling. In , pages 558–562. IEEE, 2018.[77] F. J. Medina-Aguayo and R. G. Everitt. Revisiting the balance heuristic for estimatingnormalising constants. arXiv preprint arXiv:1908.06514 , 2019.[78] E. Medova. Bayesian Analysis and Markov Chain Monte Carlo simulation.
Wiley StatsRef:Statistics Reference Online (stat03616) , pages 1–12, 2015.[79] J. M´ıguez. On the performance of nonlinear importance samplers and population MonteCarlo schemes. In ,pages 1–5. IEEE, 2017.[80] J. Miguez, I. P. Mari˜no, and M. A. V´azquez. Analysis of a nonlinear importance samplingscheme for Bayesian parameter estimation in state-space models.
Signal Processing , 142:281–291, 2018.[81] P. D. Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers.
J. R. Stat. Soc.Ser. B Stat. Methodol. , 68(3):411–436, 2006.[82] T. L. T. Nguyen, F. Septier, G. W. Peters, and Y. Delignon. Improving smc sampler estimateby recycling all past simulated particles. In
Statistical Signal Processing (SSP), 2014 IEEEWorkshop on , pages 117–120. IEEE, 2014.[83] A. Owen and Y. Zhou. Safe and effective importance sampling.
Journal of the AmericanStatistical Association , 95(449):135–143, 2000.1884] A. B. Owen.
Monte Carlo theory, methods and examples . 2013.[85] A. B. Owen, Y. Maximov, M. Chertkov, et al. Importance sampling the union of rareevents with an application to power systems analysis.
Electronic Journal of Statistics , 13(1):231–254, 2019.[86] M. K. Pitt and N. Shephard. Auxiliary variable based particle filters. In A. Doucet,N. de Freitas, and N. Gordon, editors,
Sequential Monte Carlo Methods in Practice ,chapter 13, pages 273–293. Springer, 2001.[87] B. F. Plybon.
An Introduction to Applied Numerical Analysis . PWS-Kent, Boston, MA,1992.[88] C. P. Robert.
The Bayesian Choice . Springer, 2007.[89] C. P. Robert. Monte Carlo methods.
Wiley StatsRef: Statistics Reference Online , pages1–13, 2014.[90] C. P. Robert. The metropolis–hastings algorithm.
Wiley StatsRef: Statistics ReferenceOnline , pages 1–13, 2015.[91] C. P. Robert and G. Casella.
Monte Carlo Statistical Methods . Springer, 2004.[92] E. K. Ryu and S. P. Boyd. Adaptive importance sampling via stochastic convexprogramming. arXiv preprint arXiv:1412.4845 , 2014.[93] M. Sbert and V. Elvira. Generalizing the balance heuristic estimator in multiple importancesampling. arXiv preprint arXiv:1903.11908 , 2019.[94] M. Sbert and V. Havran. Adaptive multiple importance sampling for general functions.
TheVisual Computer , 33(6-8):845–855, 2017.[95] M. Sbert, V. Havran, and L. Szirmay-Kalos. Variance analysis of multi-sample and one-sample multiple importance sampling. In
Computer Graphics Forum , volume 35, pages451–460. Wiley Online Library, 2016.[96] M. Sbert, V. Havran, and L. Szirmay-Kalos. Multiple importance sampling revisited:breaking the bounds.
EURASIP Journal on Advances in Signal Processing , 2018(1):15,2018.[97] M. Sbert, V. Havran, L. Szirmay-Kalos, and V. Elvira. Multiple importance samplingcharacterization by weighted mean invariance.
The Visual Computer , 34(6-8):843–852, 2018.[98] M. Sbert, V. Havran, and L. Szirmay-Kalos. Optimal deterministic mixture sampling. In
Eurographics (Short Papers) , pages 73–76, 2019.[99] I. Schuster. Gradient importance sampling. Technical report, 2015.https://arxiv.org/abs/1507.05781. 19100] T. Taimre, D. P. Kroese, and Z. I. Botev. Monte Carlo methods.
Wiley StatsRef: StatisticsReference Online DOI , 10:9781118445112, 2019.[101] S. T. Tokdar and R. E. Kass. Importance sampling: a review.
Wiley InterdisciplinaryReviews: Computational Statistics , 2(1):54–60, 2010.[102] E. Veach and L. Guibas. Optimally combining sampling techniques for Monte Carlorendering.
In SIGGRAPH 1995 Proceedings , pages 419–428, 1995.[103] A. Vehtari, A. Gelman, and J. Gabry. Pareto smoothed importance sampling. arXiv preprintarXiv:1507.02646 , 2015.[104] S. Wang. Importance sampling including the bootstrap.