Nonasymptotic bounds for suboptimal importance sampling
NNonasymptotic bounds for suboptimal importance sampling
Carsten Hartmann and Lorenz Richter Institute of Mathematics, BTU Cottbus-Senftenberg, 03046 Cottbus, Germany, [email protected] Institute of Mathematics, Freie Universit¨at Berlin, 14195 Berlin, Germany, [email protected]
February 22, 2021
Abstract
Importance sampling is a popular variance reduction method for Monte Carlo estimation, where a notoriousquestion is how to design good proposal distributions. While in most cases optimal (zero-variance) estimatorsare theoretically possible, in practice only suboptimal proposal distributions are available and it can often beobserved numerically that those can reduce statistical performance significantly, leading to large relative errorsand therefore counteracting the original intention. In this article, we provide nonasymptotic lower and upperbounds on the relative error in importance sampling that depend on the deviation of the actual proposal fromoptimality, and we thus identify potential robustness issues that importance sampling may have, especially in highdimensions. We focus on path sampling problems for diffusion processes, for which generating good proposalscomes with additional technical challenges, and we provide numerous numerical examples that support ourfindings.
The numerical approximation of expectations by the Monte Carlo method is ubiquitous in various disciplines suchas quantitative finance [25, 26], machine learning [7], computational statistics [23] or statistical physics [52], toname just a few. Depending on the problem at hand, this estimation problem can be more or less difficult, but itturns out that a major challenge are potentially large statistical errors of naive sampling strategies. It is thereforea common goal to build estimators that have a small variance, as compared to the quantity of interest, and thusa small relative error. A typical situation, in which variance reduction is indispensable, is the simulation of rareevents with its characteristic exponential divergence of the relative error with the parameter that controls the rarityof the quantity of interest (e.g. a level when computing level-crossing probabilities).There are multiple strategies for variance reduction in Monte Carlo estimation [3]. In this article, we focus onimportance sampling. The idea here is to sample from an alternative probability measure and reweight the resultingrandom variables with the likelihood ratio in order to produce an unbiased estimator for the quantity of interest.Naturally, the question arises which probability distribution to choose. In theory, under appropriate assumptions,there exists an optimal proposal that yields a zero-variance estimator and therefore removes all the stochasticityfrom the problem. However this measure depends on the quantity of interest and is therefore practically useless.Coming up with feasible proposals on the other hand is a science in itself, and various numerical experimentsdemonstrate that it is indeed a crucial one, as making bad choices can even increase the relative error of importancesampling estimators significantly, and therefore counteract the original intention. Loosely speaking, importancesampling gets increasingly difficult and sensitive to small deviations from an optimal proposal distribution if thequantity of interest is mainly supported on small regions which have little overlap with the regions of the proposalmeasure; such a phenomenon is more likely to appear in high dimensions. Moreover, concentration of measure, thatmay lead to degeneracies of likelihood ratios when the probability of certain events becomes exponentially small, ismore likely to occur in high dimensions [4, 42].To better understand the robustness (or better: fragility) of the optimal proposal in importance sampling is themain goal of this article. In applications, one often faces situations that the probability measures admit densitieson a subset of R d or a function space like the space of (semi-)continuous trajectories with values in R d (called: pathspace). In this article, we shall put special emphasis on the latter case, specifically on diffusion processes that areparticularly relevant e.g. in molecular dynamics [31], mathematical finance [25], or climate modelling [43]; what theaforementioned examples have in common, is that the quantities of interest are often related to rare events or large1 a r X i v : . [ m a t h . S T ] F e b eviations from a mean or an equilibrium state, and, often, the dynamics exhibits metastability, i.e. it features raretransitions between semi-stable equilibria. To simulate these systems, variance reduction techniques like importancesampling are indispensable, and we will provide quantitative bounds on the relative error that explains the fragilityof importance sampling in these situations. Some of those bounds are formulated on an abstract measurable space,but they can be readily applied to the density case. For the path space measures, we deduce some additional boundsthat, in particular, highlight the challenges due to high dimensionality or long trajectories. Importance sampling is a classic variance reduction method in Monte Carlo simulation and introductions can befound in many textbooks, such as in [40, Section 9] or [25, 35], however, mostly for the finite-dimensional case R d . The non-robustness of importance sampling in high dimensions is well known and has often been observed innumerical experiments [6, 27, 36, 49]. Recently, the authors of [9] have proved that the sample size required forimportance sampling to be accurate scales exponentially in the KL divergence between the proposal and the targetmeasure, when accuracy is understood in the sense of the L error, rather than the commonly used relative error.(Clearly, an unbounded L error implies that the relative error will be unbounded.) Similar results can be foundin [1], in which the authors analyze a self-normalized importance sampling estimator, in connection with inverseproblems and filtering. Necessary conditions that any importance sampling proposal distribution has to satisfy havebeen derived in [45], using the more general f -divergences and adopting an information-theoretic perspective.An important class of techniques for building proposal distributions is known by the name sequential importancesampling , where we recommend [15] for a comprehensive review. Closely related are methods based on interactingparticle systems and nonlinear (mean-field) Feynman-Kac semigoups, in which the variance is controlled by adap-tively annihilating and generating particles to approximate good proposal distributions [13]. Adaptive importancesampling for rare events simulation has been pioneered in [19, 20]; it is typically based on exponential change ofmeasure techniques and the theory of large deviations, dating back to the seminal work [48]. For diffusion pro-cesses, large deviation principles can be used to approximate the optimal change of measure in the small noiseregime, where the resulting change of measure turns out to be be asymptotically optimal [51, 53]. Pre-asymptoticapproximations to the optimal proposal are necessary when studying escape problems, for which the time horizonof the problem is either indefinite or infinitely large, a case that has been analysed in [18]. A non-asymptoticvariant of the aforementioned approaches for finite noise diffusions is based on the stochastic control formulation ofthe optimal change of measure [28, 30]. Furthermore we should note that there have been many attempts to findgood (low-dimensional) proposal by taking advantage of specific structures of the problem at hand, using simplifiedmodels that approximate a complicated multiscale system [17, 29, 32, 50]. Recently, the scaling properties of certainapproximations to control-based importance sampling estimators with the system dimension have been analyzedin [38], suggesting that the empirical loss function that is used to numerically approximate the optimal proposaldistribution is essential. In Section 2 we define importance sampling in an abstract setting and recall the notions of divergences betweenproposal and target measures, while refining a bound on the relative error and highlighting robustness issues inhigh dimensions. In Section 3 we move to importance sampling of stochastic processes. We translate the boundsfrom the previous section to this setting and derive an exact formula for the relative error with which we can statenovel bounds that allow for interpretations with respect to robustness in higher dimensions and long time horizons.When focusing on PDE methods in Section 3.2 we can essentially re-derive bounds from the previous section. InSection 3.3 we comment on how our bounds can help to understand potential issues in the small noise regime.Finally, in Section 4 we present a couple of numerical examples with which we illustrate the previously discussedissues. We conclude the article with Section 5 and discuss future perspectives for importance sampling in highdimensions. The article contains an appendix that records some proofs and various technical lemmas.2
Importance sampling bounds based on divergences
Let us consider the probability space (Ω , F , ν ), on which we want to compute expected values Z = E (cid:104) e − W ( X ) (cid:105) , (1)where X is a random variable taking values in Ω that is distributed according to the measure ν , and W : Ω → R issome functional of X . Later on we will specify Ω to be either R d or the path space C ([0 , T ] , R d ).The idea of importance sampling is to sample instead (cid:101) X ∈ Ω from another distribution (cid:101) ν and weight the samples backaccording to the corresponding likelihood ratio (or Radon-Nikodym derivative), provided that ν (cid:28) (cid:101) ν , namely Z = E (cid:20) e − W ( (cid:101) X ) d ν d (cid:101) ν ( (cid:101) X ) (cid:21) . (2)One notorious intention of importance sampling is the reduction of the variance of the corresponding Monte Carloestimator (cid:98) Z K = 1 K K (cid:88) k =1 e − W ( (cid:101) X k ) d ν d (cid:101) ν ( (cid:101) X k ) , (3)where K is the sample size and (cid:101) X k are i.i.d. samples from (cid:101) ν . We therefore study the relative error r ( (cid:101) ν ) = (cid:114) Var (cid:16) e − W ( (cid:101) X ) d ν d (cid:101) ν ( (cid:101) X ) (cid:17) Z , (4)noting that the true relative error of the estimator (3) is given by r ( (cid:101) ν ) / √ K . It can be readily seen that choosingthe optimal proposal measure (cid:101) ν = ν ∗ defined via d ν ∗ d ν = e − W Z (5)yields an unbiased zero-variance estimator. Of course, this estimator is usually infeasible in practice, as Z is justthe quantity we are after, and therefore not available. In this article, we study the relative error when using anyother absolutely continuous, suboptimal proposal measure (cid:101) ν (cid:54) = ν ∗ . It turns out that divergences between thosemeasures are helpful in this analysis and we therefore start by noting the equivalence of the squared relative errorand the χ divergence between the actual and the optimal proposal measure. Lemma 2.1 (Equivalence with χ divergence) . Let (cid:101) ν be a measure that is absolutely continuous with respect to ν ,let ν ∗ be the optimal proposal measure as defined in (5) and let r ( (cid:101) ν ) be the relative error as in (4) . Then r ( (cid:101) ν ) = χ ( ν ∗ | (cid:101) ν ) . (6) Proof.
By using the definition of the χ divergence in the first step, we compute χ ( ν ∗ | (cid:101) ν ) = E (cid:101) ν (cid:34)(cid:18) d ν ∗ d (cid:101) ν (cid:19) − (cid:35) = E (cid:101) ν (cid:34)(cid:18) d ν ∗ d (cid:101) ν (cid:19) (cid:35) − E (cid:101) ν (cid:20) d ν ∗ d (cid:101) ν (cid:21) = Var (cid:101) ν (cid:18) d ν ∗ d (cid:101) ν (cid:19) = 1 Z Var (cid:18) e − W ( (cid:101) X ) d ν d (cid:101) ν ( (cid:101) X ) (cid:19) = r ( (cid:101) ν ) . (7)Motivated by known bounds on the χ divergence, we can formulate our first statement, where we quantify thesuboptimality by the Kullback-Leibler divergence between the actual and the optimal proposal measure. As a remark on our notation, let us mention that we sometimes endow the expectation operator with a subscript indicating withrespect to which measure the expectation is taken, e.g. E ν indicates that the expectation is considered with respect to the measure ν .When explicitly writing down the corresponding random variable, e.g. E [ X ], it is usually clear from the context with respect to whichmeasure the expectation shall be understood, and we omit the subscript. The exponential form, e − W , constrains our observable to be positive. We make this choice in order to be able to have a zerovariance proposal density without additional tricks, as the optimal proposal measure ν ∗ defined in (5) has to be non-negative. Assumingstrict positivity is convenient in order to get variational dualities that rely on logarithmic transformations, cf. [30]. An extension ofimportance sampling to observables with negative parts can for instance be found in [39]. roposition 2.2 (Lower bound on relative error) . Let W : Ω → R , let (cid:101) ν be a measure and let ν ∗ be the optimalproposal measure as defined in (5) , then for the relative error (2) it holds r ( (cid:101) ν ) ≥ (cid:112) e KL( ν ∗ | (cid:101) ν ) − . (8) Proof.
With Jensens’s inequality we haveKL( ν ∗ | (cid:101) ν ) = E ν ∗ (cid:20) log d ν ∗ d (cid:101) ν (cid:21) ≤ log E ν ∗ (cid:20) d ν ∗ d (cid:101) ν (cid:21) . (9)Combining this with Lemma 2.1 yields r ( (cid:101) ν ) = E (cid:101) ν (cid:34)(cid:18) d ν ∗ d (cid:101) ν (cid:19) − (cid:35) = E ν ∗ (cid:20) d ν ∗ d (cid:101) ν − (cid:21) ≥ e KL( ν ∗ | (cid:101) ν ) − Remark χ divergence) . In the setting of importance sampling the χ divergence also appearsin [10]. A bound of the χ divergence that is sometimes used is χ ( ν ∗ | (cid:101) ν ) ≥ KL( ν ∗ | (cid:101) ν ), which is essentially basedon x ≤ e x − and therefore yields a less tight bound compared to Proposition 2.2. The exponential bound we useinstead can for instance be found in [16, Theorem 4] and [46, Proposition 4] in a discrete setting; here, a lowerbound in terms of the total variation distance is provided as well. [24] offers a continuous version and some otherhelpful relations between divergences. An application of the bound to importance sampling relative errors can befound in [1] and more analysis with respect to more general f -divergences has been done in [45]. The statementshould also be compared to the results in [9], where the required sample size of importance sampling is proved tobe exponentially large in the KL divergence between the proposal and the target measure. Remark . Note that the expression KL( ν ∗ | (cid:101) ν ) appearing in (8) is exactly the quantity thatis minimized in the so-called cross-entropy method [12, 54], which aims at approximating the optimal importancesampling proposal in a family of reference proposals. Remark . We recall that the KL divergence usually gets largerwith increasing state space dimension as can for instance be seen by Lemma A.6 in the appendix, implying thatimportance sampling is especially difficult in high dimensional settings. Another way of noting bad scaling behaviorin high dimensions is motivated by [38, Proposition 5.7]. Assume (cid:101) ν = d (cid:79) i =1 (cid:101) ν i , ν ∗ = d (cid:79) i =1 ν ∗ i , (11)where each (cid:101) ν i , and ν ∗ i respectively, shall be identical for i ∈ { , . . . , d } . Then r ( (cid:101) ν ) = Var (cid:101) ν (cid:18) d ν ∗ d (cid:101) ν (cid:19) = E (cid:101) ν i (cid:34)(cid:18) d ν ∗ i d (cid:101) ν i (cid:19) (cid:35) d − E (cid:101) ν i (cid:20) d ν ∗ i d (cid:101) ν i (cid:21) d = E (cid:101) ν i (cid:34)(cid:18) d ν ∗ i d (cid:101) ν i (cid:19) (cid:35) d − ≥ C d − , (12)where C := E (cid:101) ν i (cid:20)(cid:16) ν ∗ i (cid:101) ν i (cid:17) (cid:21) > (cid:101) ν (cid:54) = ν ∗ due to Jensen’s inequality. This can be compared to [45, Section 5.2.1], and,to be fair, we should note that also naive sampling, i.e. choosing (cid:101) ν = ν , usually leads to an exponential dependencyof the relative error on the dimension.We have so far constructed a lower bound for the relative error. In order to get an upper bound, let us first statethe following version of a generalized Jensen inequality, which will turn out to be helpful and is essentially borrowedfrom [37, Theorem 2]. Proposition 2.6 (Generalized Jensen inequality) . Let λ and ν be measures on (Ω , F ) , let J ( f, ν, ϕ ) := E ν [ f ( ϕ )] − f ( E ν [ ϕ ]) (13) be the normalized Jensen functional, where f : R → R is convex and ϕ : Ω → R is continuous, and let m =inf E ∈F ν ( E ) λ ( E ) , M = sup E ∈F ν ( E ) λ ( E ) . Then mJ ( f, λ, ϕ ) ≤ J ( f, ν, ϕ ) ≤ M J ( f, λ, ϕ ) . (14) The factorization of the optimal proposal measure ν ∗ assumes a factorization of the quantity e − g . roof. See Appendix A.1.We can now derive an upper bound as well as a tighter lower bound for the relative error.
Proposition 2.7 (Refined bounds on relative error) . Let (cid:101) ν be a measure that is absolutely continuous with respectto ν and let ν ∗ be the optimal proposal measure as in (5) . Let m and M be as defined in Proposition 2.6 (with themeasures ν and λ being replaced by (cid:101) ν and ν ∗ respectively). Then for the relative error (4) it holds (cid:112) e m KL( (cid:101) ν | ν ∗ )+KL( ν ∗ | (cid:101) ν ) − ≤ r ( (cid:101) ν ) ≤ (cid:112) e M KL( (cid:101) ν | ν ∗ )+KL( ν ∗ | (cid:101) ν ) − . (15) Proof.
Inspired by [47] (which focuses on a discrete probability space) we choose ν = ν ∗ , λ = (cid:101) ν, ϕ = d ν ∗ d (cid:101) ν and f ( x ) = − log( x ) for the expressions in (14) in order to get J ( f, ν ∗ , ϕ ) = − E ν ∗ (cid:20) log (cid:18) d ν ∗ d (cid:101) ν (cid:19)(cid:21) + log (cid:18) E ν ∗ (cid:20) d ν ∗ d (cid:101) ν (cid:21)(cid:19) = − KL( ν ∗ | (cid:101) ν ) + log (cid:0) χ ( ν ∗ | (cid:101) ν ) + 1 (cid:1) , (16) J ( f, (cid:101) ν, ϕ ) = − E (cid:101) ν (cid:20) log (cid:18) d ν ∗ d (cid:101) ν (cid:19)(cid:21) + log (cid:18) E (cid:101) ν (cid:20) d ν ∗ d (cid:101) ν (cid:21)(cid:19) = KL( (cid:101) ν | ν ∗ ) . (17)With Proposition 2.6 we then get m KL( (cid:101) ν | ν ∗ ) + KL( ν ∗ | (cid:101) ν ) ≤ log (cid:0) χ ( ν ∗ | (cid:101) ν ) + 1 (cid:1) ≤ M KL( (cid:101) ν | ν ∗ ) + KL( ν ∗ | (cid:101) ν ) (18)and with Lemma 2.1 our statement follows. Remark . One should note that m and M depend on (cid:101) ν and ν ∗ , respectively, and are hard to compute in practice.We have m ∈ [0 ,
1] and M ∈ [1 , ∞ ] and indeed it is possible to get m = 0 or M = ∞ . The former case bringsback the ordinary Jensen inequality and the lower bound from Proposition 2.7 is then equivalent to the one fromProposition 2.2. The case M = ∞ on the other hand yields a trivial upper bound, for which we provide anillustration in Example 2.9. Example 2.9 (Upper bound for relative error) . In order to illustrate the case where the upper bound in Proposi-tion 2.7 becomes meaningless, consider for instance the measure ν on [1 , ∞ ) ⊂ R admitting the one-dimensionaldensity p ( x ) = α x α +1 defined for x ≥ .This density is special since for α ≤ we have E [ X ] = ∞ , however for α ∈ (1 , it holds E [ X ] < ∞ , whereasstill E (cid:2) X (cid:3) = ∞ and therefore J ( x (cid:55)→ x , ν, ϕ ) = ∞ for ϕ ( x ) = x . Now Proposition 2.6 implies that the upperbound also has to be infinity. Let us illustrate this for the particular choice of the measure λ admitting the density q ( x ) = 2 α x α +1 . For this choice we have J ( x (cid:55)→ x , λ, x (cid:55)→ x ) < ∞ for α ∈ (1 , , however we compute M = sup a,b ∈ [1 , ∞ ] a (cid:54) = b (cid:82) ba p ( x )d x (cid:82) ba q ( x )d x ≥ sup a ∈ [1 , ∞ ) (cid:82) ∞ a p ( x )d x (cid:82) ∞ a q ( x )d x = sup a ∈ [1 , ∞ ) 1 a α a α = sup a ∈ [1 , ∞ ) a α = ∞ . (19) In fact Proposition 2.6 implies that one cannot not find any λ for which both J ( x (cid:55)→ x , λ, x (cid:55)→ x ) and M are finite. To conclude this section, let us illustrate our bounds by looking at a concrete example using Gaussians on Ω = R d (which should be compared to [36, Section 6]). Example 2.10 (High-dimensional Gaussians) . Suppose we want to compute E (cid:2) e − α · X (cid:3) , with a given vector α ∈ R d ,where X ∼ N ( µ, Σ) =: p is distributed according to a multidimensional Gaussian with mean µ ∈ R d and covariancematrix Σ ∈ R d × d . Then the optimal importance sampling density is given by p ∗ ( x ) = e − α · x Z p ( x ) = N ( µ − Σ α, Σ) . (20) If we however sample from a perturbed version (cid:101) p ε := N ( µ − Σ( α + ε ) , Σ) (21) with a vector ε ∈ R d , we get the relative error r ( (cid:101) p ε ) = 1 Z (cid:115) Var (cid:18) e − α · (cid:101) X p (cid:101) p ε ( (cid:101) X ) (cid:19) = (cid:112) e ε · Σ ε − . (22)5 n this particular case, the computations can be compared to the relative error of a log-normally distributed randomvariable, see Appendix A.3.1. Taking, for instance, ε = ( (cid:101) ε, · · · , (cid:101) ε ) (cid:62) , Σ = diag( σ , · · · , σ ) yields r ( (cid:101) p ε ) = (cid:112) e dσ (cid:101) ε − , (23) where we see an exponential dependence on the variance σ , the squared suboptimality parameter (cid:101) ε and the dimen-sion d . This implies that, in order to control the relative error in high dimensions, any suboptimal importancesampling estimator needs about K = O ( e dσ ˜ ε ) independent realisations to reach convergence. This observationis in agreement with the seminal result of Bengtsson and Bickel [4] that any importance sampling estimator forGaussians ceases to be asymptotically efficient when log( K ) /d → as K, d → ∞ (see also [34, Thm. 3.1]).For this example, we can also apply the bound from Proposition 2.2, by noting that
KL( p ∗ | (cid:101) p ε ) = ε · Σ ε , and get r ( (cid:101) p ε ) ≥ (cid:113) e ε · Σ ε − . (24) A comparison to the exact quantity (22) reveals that this lower bound is not tight. For an application of Proposi-tion 2.7 we note that also
KL( (cid:101) p ε | p ∗ ) = ε · Σ ε , however m and M are intractable. Still, it is intuitively clear that m becomes smaller and M larger, the more the two Gaussians are apart from each other.We made the particular choice of (cid:101) p ε in (21) in order to have an analogy to the path measure setting, which wewill discuss in the next section. In fact, the added term Σ ε in (21) can be compared to a constant control σσ (cid:62) ε in a stochastic process as in (25) , which, as will be seen in (45) , yields a completely analogous expression for therelative error, noting that standard d -dimensional Brownian motion is distributed according to W T ∼ N (0 , Σ) with Σ =
T I d × d . Up to now we have formulated importance sampling for measures on an abstract probability space and provided someillustrations for densities. Let us now elevate those considerations to solutions of stochastic differential equations(SDEs) of the form d X s = b ( X s , s ) d s + σ ( X s , s ) d W s , X t = x init , (25)on the time interval s ∈ [ t, T ], 0 ≤ t < T < ∞ . Here, b : R d × [ t, T ] → R d denotes the drift coefficient, σ : R d × [ t, T ] → R d × d the diffusion coefficient, ( W s ) t ≤ s ≤ T standard d -dimensional Brownian motion, and x init ∈ R d is the (deterministic) initial condition. Our goal is to compute expectations of the form Z = E (cid:104) e −W ( X ) (cid:105) , W ( X ) = T (cid:90) t f ( X s , s )d s + g ( X T ) , (26)where f : R d × [0 , T ] → R , g : R d → R are given functions. We will usually fix the initial time to be t = 0,i.e. consider the SDE (25) on the interval [0 , T ]. For fixed initial condition x init ∈ R d , let us introduce the pathspace C = C x init ([0 , T ] , R d ) = (cid:8) X : [0 , T ] → R d | X continuous , X = x init (cid:9) , (27)equipped with the supremum norm and the corresponding Borel- σ -algebra, and denote the set of probability mea-sures on C by P ( C ).As in the previous section, the idea of importance sampling is to not sample from the original path measure P ∈ P ( C )that corresponds to paths of SDE (25), but from a different measure P u ∈ P ( C ) and weight back accordingly. Justas in (2) one then gets an unbiased estimator via Z = E (cid:20) e − W ( X u ) d P d P u ( X u ) (cid:21) , (28) Whether with X (and r, Z correspondingly) we refer to random variables in R d or solutions to SDEs should usually be clear fromthe context. P u is just a controlled version of the original one,d X us = ( b ( X us , s ) + σ ( X us , s ) u ( X us , s )) d s + σ ( X us , s ) d W s , X ut = x init . (29)We think of u : R d × [ t, T ] → R d as a control term steering the dynamics and note (as already hinted at by thenotation) the correspondence between u and P u . As before, our quantity of interest is the relative error, which nowdepends on the control u : r ( u ) = (cid:113) Var (cid:0) e − W ( X u ) d P d P u ( X u ) (cid:1) Z . (30)Given suitable conditions, there exists u ∗ ∈ U that brings (30), the relative error of the importance samplingestimator, to zero [30]. It turns out that there are multiple equivalent perspectives on the problem of finding sucha u ∗ , for instance by solving either a (high-dimensional) Hamilton-Jacobi-Bellman PDE, a forward-backward SDE,a stochastic optimal control problem, or a conditioning of path measures – the corresponding details regardingthose equivalences can for instance be found in [38]. Let us just relate to the last perspective, which claims that W induces a reweighted path measure Q on C via d Q d P = e −W Z , (31)assuming f and g are such that Z is finite (which we shall tacitly assume from now on). It turns out that Q = P u ∗ and we realize that the above formula is the same as in (5).Let us now bring an example that shall illustrate why variance reduction methods are indispensable in certain SDEsettings. Example 3.1 (Rare events of SDEs) . Monte Carlo estimation gets particularly challenging when considering rareevents. As a prominent example, let us consider the one-dimensional Langevin dynamics d X s = −∇ Ψ( X s ) d s + √ η d W s , X = x, (32) with double well potential Ψ( x ) = κ ( x − , κ > , and noise coefficient η > , as illustrated in Figure 1. Wesuppose that the dynamics starts in the left well and choose a function g such that e − g is concentrated in the rightwell, e.g. g ( x ) = ρ ( x − , ρ > . We are interested in computing E [exp( − g ( X T )) | X = x ] for, say, x = − . x t = T distribution of X T distribution of X u * T target e g Figure 1: Illustration of rare events in a metastable double well potential. We consider the problem described inExample 3.1 with κ = 5 , ρ = 3 on a time horizon T = 10 and display the distributions of X T as well as X u ∗ T , whichis controlled with the optimal importance sampling control u ∗ yielding a time-dependent optimal potential. To understand the difficulties associated with this sampling problem, let p T be the law of X T for some T > andrecall that the optimal change of measure is given by the (unnormalized) likelihood d q T / d p T ∝ exp( − g ) that isconcentrated in the right well. However, regions where exp( − g ) is strongly supported have probability close to zero nder p T , for p T drops to zero quickly for x > . This can be seen as follows: Let τ be the first exit time of the set D = { x : x ≤ } . By Kramer’s law [5], the mean first exit time (MFET) satisfies the large deviations asymptotics E [ τ ] (cid:16) exp(2∆Ψ /η ) as η → , where ∆Ψ is the energy barrier that the dynamics has to overcome to leave the set D , and it turns out that the MFET is independent of the initial condition x ∈ D . Therefore lim η → η log P ( τ < T ) = − , T (cid:28) E [ τ ] , (33) which is is a straight consequence of Kramer’s law, combined with the Donsker-Varadhan large deviations principlethat, for a system of the form (32) states that P ( τ < T ) (cid:16) − exp( λ T ) as T → ∞ and η → , where λ (cid:16) − / E [ τ ] is the principal eigenvalue of the infinitesimal generator associated with (32); see, e.g. [8].Now, by (33), we can conclude that p T ( x ) (cid:16) exp( − /η ) for x > . Since p T is essentially supported on ( −∞ , ,we can approximate exp( − g ( x )) by a step function { x ∈ D c } on x ∈ ( −∞ , and it thus follows that (up to anexponentially small error) the relative error for small η > can be approximated by r (0) = (cid:115) E [exp( − g ( X T ))] − E [exp( − g ( X T ))] E [exp( − g ( X T ))] (34a) ≈ (cid:115) exp( − /η ) − exp( − /η )exp( − /η ) ≈ exp(∆Ψ /η ) . (34b) This kind of exponential behavior is typical for rare event simulation and metastable systems like (32). So unlessour terminal time T is very large or the energy barrier rather small, X T is usually mostly supported on the leftside of the well and therefore does not overlap very much with e − g , which leads to an extremely large relative error.Note that this problem gets even more severe with growing values of κ and ρ . We have stated that, given suitable conditions, there exists u ∗ ∈ U that brings (30), the relative error of theimportance sampling estimator, to zero. However, in practice, u ∗ is usually not available (just as ν ∗ is not availablein the abstract setting). Let us instead consider the setting where we have the control u ∈ U at hand. We want toinvestigate how the relative error (30) behaves depending on how far from optimal u is. For the upcoming analysis,it will turn out that it makes sense to measure the suboptimality and therefore the difference between P u and P u ∗ in terms of the difference δ := u ∗ − u . The first statement is an implication of Proposition 2.2. Corollary 3.2 (Lower bound for relative error on path space) . Consider the path measures P u , P u ∗ ∈ P ( C ) aspreviously defined and let δ = u ∗ − u . For the relative error (30) it holds r ( u ) ≥ (cid:16) exp (cid:16) KL( P u ∗ | P u ) (cid:17) − (cid:17) (35) and therefore r ( u ) ≥ exp E T (cid:90) | δ ( X u ∗ s , s ) | d s − . (36) Proof.
The first statement is just Proposition 2.2 with the abstract measures replaced by path measures. Thesecond statement then follows from Girsanov’s theorem as stated in Lemma A.2.One can of course also transfer the more general bound from Proposition 2.7 to path measures, however, thecomputations of the quantities m and M seem even more difficult and impractical than in the density case. Inorder to still find tighter and more applicable bounds, let us now identify an exact formula for the relative error inthe SDE setting. Proposition 3.3 (Formula for path space relative error) . Let X us be the solution to SDE (29) and let δ = u ∗ − u .Then the relative error (30) is r ( u ) = E exp − T (cid:90) | δ ( X us , s ) | d s + 2 T (cid:90) δ ( X us , s ) · d W s − , (37)8 r equivalently r ( u ) = E exp T (cid:90) | δ ( X u +2 δs , s ) | d s − . (38) Proof.
The proof can be found in Appendix A.2. Alternatively, the second statement follows as well from Proposition3.10.
Remark . We note that in formula (37) the forward process is controlled by u , whereas in (38) it is controlledby u + 2 δ = 2 u ∗ − u , which of course is usually not available in practice. In the upcoming Corollary 3.6 we will seehow we can still make use of the formula. Remark . Note that Proposition 3.3 entails Corollary 3.2 since E P u (cid:32) d P u ∗ d P u (cid:33) = E P u ∗ (cid:34) d P u ∗ d P u (cid:35) = E exp T (cid:90) | δ ( X u ∗ s , s ) | d s + T (cid:90) δ ( X u ∗ s , s ) · d W s (39a) ≥ exp E T (cid:90) | δ ( X u ∗ s , s ) | d s . (39b)Without the change of the measures as in (39a) we obtain E P u (cid:32) d P u ∗ d P u (cid:33) ≥ exp E − T (cid:90) | δ ( X us , s ) | d s , (40)where now the process is controlled by u , however this expression has a negative sign in the exponential and istherefore rather useless. The bound E P u (cid:32) d P u ∗ d P u (cid:33) = E exp T (cid:90) | δ ( X u +2 δs , s ) | d s ≥ exp E T (cid:90) | δ ( X u +2 δs , s ) | d s (41)on the other hand, seems more useful.The following corollary derives bounds from the previous Proposition 3.3 that might be useful in practice. Corollary 3.6 (Bounds for path space relative error) . Let again δ = u ∗ − u and let us assume there exist functions h , h : [0 , T ] → R such that h ( t ) ≤ | δ ( x, t ) | ≤ h ( t ) (42) for all x ∈ R d , t ∈ [0 , T ] , then exp T (cid:90) h ( s )d s − ≤ r ( u ) ≤ exp T (cid:90) h ( s )d s − . (43) In particular, if (cid:101) ε ≤ | δ i ( x, t ) | ≤ (cid:101) ε (44) for all components i ∈ { , . . . , d } and for all ( x, t ) ∈ R d × [0 , T ] with (cid:101) ε , (cid:101) ε ∈ R , then (cid:16) e d (cid:101) ε T − (cid:17) ≤ r ( u ) ≤ (cid:16) e d (cid:101) ε T − (cid:17) . (45) Proof.
Both statements follow directly from equation (38) in Proposition 3.3 by noting that the dependence on thestochastic process and therefore the expectation disappears if we consider bounds on δ that do not depend on x .Two alternative proofs of the corresponding statements can be found in Appendix A.2.9 emark . Note that bounding the suboptimality δ for all x can be a strong assumption for practical applications,as often, it might vary substantially in x . Still, even those conservative bounds often yield lower bounds that renderimportance sampling a very challenging endeavor. On the contrary, it seems to be hard to make x -dependentbounds on δ useful due to potentially very complex stochastic dynamics. Let us further note that the bounds in(43) imply that errors made over different points in time accumulate, i.e. it does not matter if they have been madeat the beginning or the end of a trajectory and neither can they be compensated at later stages.Another upper bound on the relative error can be derived by means of the H¨older inequality. Proposition 3.8 (Another bound for path space relative error) . Let δ = u ∗ − u . For the relative error (29) it holds r ( u ) ≤ E exp (1 + √ T (cid:90) | δ ( X us , s ) | d s √ − (46) Proof.
See Appendix A.2.
Remark . Some intuition of the quality of this bound can be gained when for instance assuming that δ ( x, t ) = ε with a constant vector ε = ( (cid:101) ε, . . . , (cid:101) ε ) (cid:62) ∈ R d . Then this bound yields r ( u ) ≤ (cid:0) exp (cid:0) (1 + √ d (cid:101) ε T (cid:1) − (cid:1) , which isless tight than the bound (45) in Corollary 3.6. Nevertheless the bound is useful in that it only depends on thestochastic process controlled by u , which is a known quantity. Another means of studying the relativ error r ( u ) are partial differential equations (PDEs). We will formulate aPDE for the relative error (4), which might be helpful for future analysis and by which we can rederive boundsfrom the previous section.By a slight generalization of [51], one can identify a PDE for the u -dependent second moment (conditioned on X ut = x ), M u ( x, t ) = E (cid:34) e − W ( X u ) (cid:18) d P d P u ( X u ) (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ut = x (cid:35) , (47)namely ( ∂ t + L − σu ( x, t ) · ∇ − f ( x, t ) + | u ( x, t ) | ) M u ( x, t ) = 0 , ( x, t ) ∈ R d × [0 , T ) , (48a) M u ( x, T ) = e − g ( x ) , x ∈ R d , (48b)where L = ( σσ (cid:62) )( x, t ) : ∇ + b ( x, t ) · ∇ is the infinitesimal generator associated to the SDE (25).Defining δ = u ∗ − u , this then immediately leads to the PDE( ∂ t + L + σ ( σ (cid:62) ∇ V ( x, t ) + δ ( x, t )) · ∇ − f ( x, t ) + | σ (cid:62) ∇ V ( x, t ) + δ ( x, t ) | ) M u ( x, t ) = 0 , ( x, t ) ∈ R d × [0 , T ) , (49a) M u ( x, T ) = e − g ( x ) , x ∈ R d , (49b)which describes the second moment of suboptimal importance sampling. It can be shown that for δ = 0, i.e. underthe optimal control u = u ∗ , we recover indeed the zero-variance property of the corresponding importance samplingestimator, see Proposition A.5 in the appendix. In the following statement we construct the PDE that is relevantfor the relative error r ( u ) and re-derive a formula that we have already seen before. Proposition 3.10 (PDE for the relative error) . Let δ = u ∗ − u . We consider the second moment as in (47) andthe conditional expectation ψ ( x, t ) = E (cid:104) e −W ( X u ) (cid:12)(cid:12)(cid:12) X t = x (cid:105) , then the function h u : R d × [0 , T ] → R defined by h u ( x, t ) = M u ( x, t ) ψ ( x, t ) , (50)10 olves the PDE (cid:0) ∂ t + L u +2 δ + | δ ( x, t ) | (cid:1) h u ( x, t ) = 0 , ( x, t ) ∈ R d × [0 , T ) , (51a) h u ( x, T ) = 1 , x ∈ R d , (51b) with L u +2 δ := L + σ ( u + 2 δ ) · ∇ . This then implies h u ( x, t ) = E exp T (cid:90) t | δ ( X u +2 δs , s ) | d s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X u +2 δt = x . (52) Proof.
We plug the ansatz M u ( x, t ) = h u ( x, t ) ψ ( x, t ) = h u ( x, t ) e − V ( x,t ) (53)into the PDE (49a). Noting that( σσ (cid:62) ) : ∇ ( h u e − V ) = ( σσ (cid:62) ) : (cid:0) ∇ (cid:0) ∇ h u e − V − h u ∇ V e − V (cid:1)(cid:1) (54a)= e − V (cid:0) ( σσ (cid:62) ) : ∇ h u − σσ (cid:62) ∇ V · ∇ h u + 4 h u | σ (cid:62) ∇ V | − h u ( σσ (cid:62) ) : ∇ V (cid:1) , (54b)we get the PDE − h u (cid:18) ∂ t V + LV − | σ (cid:62) ∇ V | + f (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) =0 + ∂ t h u + Lh u − σσ (cid:62) ∇ V · ∇ h u + σδ · ∇ h u + | δ | h u = 0 , (55)from which the statement follows from the identity u ∗ = − σ (cid:62) ∇ V and a specific Hamilton-Jacobi-Bellman equationthat is for instance stated in [38, Problem 2.2]. The probabilistic representation (52) follows immediately from theFeynman-Kac formula [41, Theorem 1.3.17]. Remark . First note that h u from Proposition 3.10 is related to the relative error (30) via r ( u ) = (cid:112) h u ( x, − f and g . This is of course not true and we shouldnote that the PDE depends on u ∗ , which again depends on f and g . Finally, note that with (52) we recover theresult (38) from Proposition 3.3. A prominent application of importance sampling in stochastic processes can be found in the context of small noisediffusions and rare event simulations (relating to Example 3.1, see also [17, 50, 51, 53]). We model small noiseswith the smallness parameter η > d X ηs = b ( X ηs , s ) d s + √ η (cid:101) σ ( X ηs , s ) d W s , X ηt = x init , (56)and we want to compute quantities like ψ η ( x, t ) = E (cid:104) e − η W ( X η ) (cid:12)(cid:12)(cid:12) X ηt = x (cid:105) . (57)If η gets smaller it becomes harder to estimate ψ η ( x, t ) via Monte Carlo methods as the variance grows exponentiallyin η . To be more precise, by Varadhan’s lemma [14, Theorem 4.3.1], using the quantities γ := − lim η → η log E (cid:104) e − η W ( X η ) (cid:105) and γ := − lim η → η log E (cid:104) e − η W ( X η ) (cid:105) , (58)one gets for the relative error of the uncontrolled process r (0) = (cid:113) e γ − γ o (1) η − , (59) To be consistent with the notation from before, we could hide the smallness parameter η in the diffusion coefficient, i.e. σ = √ η (cid:101) σ .Then the HJB equation that provides the zero variance control is ( ∂ t + η ( (cid:101) σ (cid:101) σ (cid:62) ) : ∇ + b · ∇ ) V − |√ η (cid:101) σ ∇ V | + η f ( x, t ) = 0 , V ( x, T ) = η g ( x ) and the relation V η = ηV yields HJB equation (62). η →
0. By Jensen’s inequality we have 2 γ > γ unless W is a.s. constant, but we note that evenfor 2 γ = γ the relative error explodes in the limit η →
0. Let us again consider a controlled processd X u,ηs = ( b ( X u,ηs , s ) + (cid:101) σ ( X u,ηs , s ) u ( X u,ηs , s )) d s + √ η (cid:101) σ ( X u,ηs , s ) d W s , X u,ηt = x init , (60)and realize that the optimal importance sampling control that yields zero variance, u ∗ = − (cid:101) σ (cid:62) ∇ V η = η (cid:101) σ (cid:62) ∇ log ψ η , (61)can be computed via the HJB equation (cid:16) ∂ t + η (cid:101) σ (cid:101) σ (cid:62) )( x, t ) : ∇ + b ( x, t ) · ∇ (cid:17) V η ( x, t ) − | ( (cid:101) σ (cid:62) ∇ V η )( x, t ) | + f ( x, t ) = 0 , V η ( x, T ) = g ( x ) . (62)Since solving this PDE is notoriously difficult (especially in high dimensions), various approximations have beensuggested that lead to estimators that enjoy log-efficiency or a vanishing relative error in the regime of a vanishing η . However, since log-efficient estimators still often perform badly in practice (as for instances discussed in [2, 27]),in [53] it is suggested to replace u ∗ by the vanishing viscosity approximation u based on the corresponding HJBequation with η = 0: u = − (cid:101) σ (cid:62) ∇ V , (63)where V is the solution to( ∂ t + b ( x, t ) · ∇ ) V ( x, t ) − | ( (cid:101) σ (cid:62) ∇ V )( x, t ) | + f ( x, t ) = 0 , V ( x, T ) = g ( x ) . (64)While it can be shown that, given some regularity assumptions on f and g , it holds [53]lim η → r ( u ) = 0 , (65)a large relative error for a small, but fixed η > δ = u ∗ − u and Propositions 3.3 and 3.10 show that r ( u ) = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) E exp T (cid:90) | u ∗ − u | ( X u ∗ − u s , s )d s − . (66)Even though this expression converges to zero as η → V → V and u ∗ → u [21], we expect anexponential dependence on the time T and the dimension d for any fixed η > ∇ V = ∇ V + η ∇ v + o ( η ) , (67)uniformly on all compact subsets of R d × (0 , T ), where v solves the PDE stated in Appendix A.3.2. As a conse-quence, we can write |∇ V − ∇ V | = | η ∇ v + o ( η ) | = η |∇ v + o (1) | (68)and r ( u ) = E exp η T (cid:90) | ( σ (cid:62) ∇ v )( X u ∗ − u ∗ s , s ) | d s + o ( η ) . (69)Specifically, if there exist constants C , C > C < |∇ v ( x, t ) | < C for all ( x, t ) ∈ R d × (0 , T ), then therelative error grows exponentially as (cid:112) e η C T + o ( η ) − ≤ r ( u ∗ ) ≤ (cid:112) e η C T + o ( η ) − v can be strongly x -dependent as is illustrated with a numerical example inSection 4.4. Remark . The above considerations show that the relative error is potentially only small if η is (much) smallerthan C √ T . This can be compared to equation (5.3) in [51] and in particular to [18], where a concrete exampleis constructed for which the second moment can be lower bounded by e − η C +( T − K ) C for C , C , K >
0, i.e. thetime T and the smallness parameter η compete. We illustrate the degeneracy with growing T for a toy example inFigure 7. 12 Numerical examples
In this section we provide numerical examples that shall illustrate some of the formulas and bounds derived inthe previous sections. We particularly demonstrate that importance sampling can be very sensitive to smallperturbations of the optimal proposal measure. Here we focus on path space measures and provide several ex-amples of importance sampling of diffusions. The code can be found at https://github.com/lorenzrichter/suboptimal-importance-sampling . An example where the optimal importance sampling control is analytically computable is the following. Considerthe d -dimensional Ornstein-Uhlenbeck processd X s = AX s d s + B d W s , X = 0 , (71)and its controlled version d X us = ( AX us + Bu ( X us , s )) d s + B d W s , X u = 0 , (72)where A, B ∈ R d × d are given matrices. In (26) we set f = 0 and g ( x ) = α · x , for a fixed vector α ∈ R d , i.e. wewant to estimate the quantity Z = E (cid:2) e − α · X T (cid:3) . (73)As shown in [38], the zero-variance importance sampling control is given by u ∗ ( x, t ) = − B (cid:62) e A (cid:62) ( T − t ) α. (74)We choose A = − I d × d +( ξ ij ) ≤ i,j ≤ d and B = I d × d +( ξ ij ) ≤ i,j ≤ d , where ξ ij ∼ N (0 , σ ) are i.i.d. random coefficientsthat are held fixed throughout the simulation. We set T = 1 , σ = 1, α = (1 , . . . , (cid:62) and first consider the perturbedcontrol u = u ∗ + ( ε, . . . , ε ) (cid:62) . (75)In the two left panels of Figure 2 we display a Monte Carlo estimation of the relative error (30) using K = 10 samples and compare it to the formulas from Corollary 3.6 and the bound from Corollary 3.2, once with varyingperturbation strength ε , once with varying dimension d . We see that in both cases the simulations agree with ourformula, even though for moderate to large deviations from optimality the estimated values of r are observed tofluctuate. r ( u ) constant perturbation, d = 2 sampledcomputedKL bound 3 6 9 12 d r ( u ) constant perturbation, = 0.5 r ( u ) time-dependent perturbation, d = 2 d r ( u ) time-dependent perturbation, = 1.0 Ornstein-Uhlenbeck process
Figure 2: Sampled relative error with varying constant or time-dependent perturbation ε and dimension d comparedto the formulas derived in Corollary 3.6 and to the lower bound from Corollary 3.2.Let us now look at an example with a time-dependent perturbation of the optimal control. More specifically, weconsider a perturbation that is active only for a certain amount of time s < T , namely u ( x, t ) = u ∗ ( x, t ) + ( ε, . . . , ε ) (cid:62) [0 ,s ] ( t ) , (76)where in our experiment we choose s = 0 .
2. In the two right panels of Figure 2 we display the same comparisons asbefore, however now using formula (43) in order to account for the time-dependent nature of the perturbation.13 .2 Double well potential
For strongly metastable systems, Monte Carlo estimation is notoriously difficult and variance reduction methodsare often indispensable. Importance sampling seems like a method of choice, but we want to illustrate that one hasto be very careful with the design of the importance sampling control.As in Example 3.1, let us consider the Langevin SDEd X s = −∇ Ψ( X s ) d s + B d W s , X = x, (77)in d = 1, where B ∈ R is the diffusion coefficient, Ψ( x ) = κ ( x − is a double well potential with κ > x = − f = 0 and g ( x ) = ρ ( x − , where ρ >
0; theterminal time is set to T = 1. Note that choosing higher values for ρ and κ accentuates the metastable features,making sample-based estimation of E [exp( − g ( X T ))] more challenging. For an illustration, the two top panels ofFigure 3 show the potential Ψ and the weight from (31), e − g ( x ) , for different values of ρ and κ and for B = 1. Wealso plot the ‘optimally tilted potentials’ Ψ ∗ = Ψ + BB (cid:62) V , noting that −∇ Ψ ∗ = −∇ Ψ + Bu ∗ . In the bottom leftpanel we show the relative error of the naive estimator depending on different values of ρ and κ . x Potentials, = 1, = 1, = 0.5 potential optimal, t = 0perturbed, t = 0optimal, t = T perturbed, t = T x Potentials, = 3, = 3, = 0.5 relative error of naive estimator r ( u ) relative errors for multiplicative perturbation = 1, = 1= 2, = 2= 3, = 310 Figure 3: Top panels: Double well potentials and optimal tiltings as well as additive perturbations for differentvalues of ρ and κ . Bottom left: Relative error of the naive Monte Carlo estimator for different values of ρ and κ .Bottom right: Relative error depending on the multiplicative perturbation factor ζ. As before, let us perturb the optimal control, this time both in an additive and multiplicative way, namely u = u ∗ + ε = − B (cid:62) ∇ ( V − B −(cid:62) ε · x ) and u = ζu ∗ , (78)where ε ∈ R d , ζ ∈ R specify the perturbation strengths. In the bottom right panel of Figure 3 we show the relativeerror for the multiplicative perturbation and see that for higher values of ρ and κ the exponential divergence becomesmore severe, demonstrating that the robustness issues of importance sampling are particularly present in metastablesettings.Let us now consider perturbations depending either on time or space, u ( x, t ) = u ∗ ( x, t ) + ε sin( αt ) and u ( x, t ) = u ∗ ( x, t ) + ε sin( αx ) , (79)14s illustrated in Figure 4 with α = 50. t optimal control, pertubation in t optimal, x = 0.5perturbed, x = 0.5optimal, x =0perturbed, x =0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x optimal control, pertubation in x optimal, t =0perturbed, t =0optimal, t =0.9perturbed, t =0.9 0.0 0.5 1.0 1.5 2.00.00.51.01.52.02.53.03.5 relative error time perturbation formulatime perturbation sampledspace perturbation sampled Figure 4: Left: Optimal importance sampling control and time perturbation for two different values of x . Middle:Optimal importance sampling control and space perturbation for two different values of t . Right: Relative error ofsuboptimal importance sampling estimators depending on the perturbation strength ε ; here, the dashed line refersto the exact formula (80).In the former case we can analytically compute the relative error due to Corollary 3.6 to be r ( ε ) = (cid:115) exp (cid:18) ε (cid:18) T − sin(2 αT )4 α (cid:19)(cid:19) − . (80)Let us again illustrate how the relative error depends on the perturbation strength ε . In the right panel of Figure 4 wecan see the agreement of the sampled version with formula (80) when considering the time-dependent perturbation.We do not have a formula in the case of a space-dependent perturbation, however we can still observe the exponentialdependence on the perturbation strength in the estimated relative error, which is expected for instance from formulas(36) and (37). The suboptimal importance sampling bounds from Section 3 can be transferred to problems that involve a randomstopping time τ rather than a fixed time horizon T , where mostly τ u = inf { t > X ut / ∈ D} is defined as the firstexit time of a bounded domain D ⊂ R d . However, one has to be careful with applying our formulas and boundsfrom above, as τ u itself depends on the law of the process. For illustration, let us consider a one-dimensional toyexample, where the dynamics is a scaled Brownian motion X t = √ W t (81)and we choose f = 1 , g = 0 in (26), such that Z = E (cid:2) e − τ (cid:3) . (82)By noting that ψ ( x ) = E [ e − τ | X = x ] fulfills the boundary value problem(∆ − ψ ( x ) = 0 , x ∈ D , (83a) ψ ( x ) = 1 , x ∈ ∂ D , (83b)we can compute the optimal zero-variance importance sampling control to be u ∗ ( x ) = √ ∇ log ψ ( x ) = √ − e − x e − x + 1 . (84)In our experiment, we again perturb the optimal control via u = u ∗ + ε. (85) We denote with τ = τ the hitting time of the uncontrolled process X t . T is replaced by a random time τ (which we leavethe reader to check for herself), namely r ( u ) = (cid:16) E (cid:104) e ε τ u ∗− u (cid:105) − (cid:17) ≥ (cid:18) e ε E (cid:104) τ u ∗− u (cid:105) − (cid:19) , (86)where it is essential that τ u ∗ − u refers to the hitting time of the process X u ∗ − ut . We applied Jensen’s inequalityin the last expression and note that naively assuming r ( u ) ≈ (cid:16) e ε E [ τ ] − (cid:17) (87)is usually wrong. Figure 5 compares the sampled relative error with the exact formula, the lower bound in (86) andthe wrong expression (87). r ( u ) Brownian motion with hitting time sampledcorrect formulaformula with T [ ]formula with T [ u * u ] Figure 5: Relative error of a quantity involving a random stopping time compared to the exact formula, a lowerbound as well as a naive, but usually wrong approximation.
Remark . Let us note again that estimating quantities involving hitting times gets particularly challenging inrare event settings, where the expected hitting time might become very large, cf. Example 3.1. The relation (86)for the relative error then indicates that Monte Carlo estimation becomes especially difficult.
As an example for a small noise diffusion, we consider a modification of a one-dimensional toy example that hasbeen proposed in [53]. We take the scaled Brownian motion X ηs = √ ηW s , X = 0 . , (88)and want to compute E (cid:104) e − η g ( X ηT ) (cid:105) (89)with g ( x ) = α (cid:18) − | x |√ α (cid:19) (90)for α >
0. One readily sees that V ( x, t ) = α (cid:16) − | x |√ α (cid:17) T − t + 1) (91)is the unique viscosity solution to the deterministic problem (64); we refer to [22] for a discussion of the theory ofviscosity solutions. Since an explicit solution V ∗ ( x, t ) to the second-order HJB equation (62) is not available, weapproximate it with finite differences. In Figure 6 we show the corresponding controls u ( x, s ) = − σ (cid:62) V ( x, t ) and u ∗ ( x, s ) = − σ (cid:62) V ∗ ( x, t ) for different values of the noise coefficient η .16 x = 0.02 u * u x = 0.2 Figure 6: For a small noise diffusion problem we display once the optimal control and once the control resultingfrom the zero-noise approximation with different noise scalings η .In the middle panel of Figure 7 we show the relative error depending on the noise parameter η . Unlike one couldexpect from (70), it seems to not grow exponentially in η , which can be explained by looking at the exponentiated L error, exp (cid:16) E (cid:104)(cid:82) T | u ∗ − u | ( X u s , s ) ds (cid:105)(cid:17) , which we plot in the left panel. The observation that this does notgrow exponentially seems to be rooted in the fact that the suboptimality δ = u ∗ − u is very different for differentvalues of x . If we vary T , however, we can observe an exponential dependency on the time horizon, as displayed inthe right panel of Figure 7, again being in accordance with the consideration in Section 3.3. Exponential of L error Relative error depending on T Relative error depending on time T Figure 7: Small noise diffusions with vanishing noise coefficient η . Left: Exponential of L error between u ∗ and u depending on η for T = 1. Middle: Relative importance sampling error depending on η . Right: Relative importancesampling error depending on T for η = 0 . In this article, we have provided quantitative bounds on the relative error of importance sampling that dependon the divergence between the actual proposal measure and the theoretically optimal one. These bounds indicatethat importance sampling is very sensitive with respect to suboptimal choices of the proposals, which has beenobserved frequently in numerical experiments and is in line with recent theoretical analysis [1, 9, 45]. We showedthat the relative error of importance sampling estimators scales exponentially in the KL divergence between theoptimal and the proposal measure and argued that this renders importance sampling especially challenging in highdimensions.We have focused on importance sampling of stochastic processes and derived some novel formulas for the relativeerror depending on the suboptimality of the function u that controls the drift of the process. These formulas canbe used to get practically useful bounds, but they also indicate two potential issues for importance sampling inpath space: for systems with large state space and for problems on a long (or infinite) time horizon the relativeerror becomes exponentially large in the state space dimension d and the time horizon T . We have briefly discussedhow this observation can be transferred to random stopping times, such as first hitting times, and have applied our17ormulas to importance sampling in the small noise regime, offering new perspectives and revealing some potentialdrawbacks of existing methods.Even though the key message of the paper regarding the use of path space importance sampling in high dimensionsseems to be rather discouraging, let us finally mention that there is hope. In practice, the approximations to optimalproposals use iterative methods to minimize a divergence (or: loss function) between the approximant and the target,using, for instance, stochastic gradient descent. A crucial question then is which divergence to take, and it turnsout that different choices lead to proposals with vastly different statistical properties. Let us mention four possiblechoices for the loss function in the approximation scheme: (a) relative entropy, (b) cross-entropy, (c) χ divergenceor relative error, and (d) the recently introduced (see [38]) log-variance divergence Var (cid:101) ν (log d ν ∗ / d (cid:101) ν ),where we remark that in all four cases straightforward implementations for both probability densities on finitedimensional spaces and (infinite-dimensional) path measures are available. Since normally we rely on Monte Carloapproximations of the measures and our quantities of interests, it is crucial that the relative error of these divergencesand their gradients is as small as possible. While the analysis in this article suggests that the χ divergence cannot beexpected to lead to a low-variance gradient estimator in general, the other divergences have been recently analyzedin [38] in the context of path sampling (see [44] for related results on densities), some of which show better scalingproperties when going to high dimensions.We expect that these perspectives can turn out fruitful in the future, in that they can guide the design of stableimportance sampling schemes that work even in high dimensions. We therefore conclude that while importancesampling itself is often not robust, there are strategies to approximate the optimal proposal measure in a morerobust way that go beyond cross-entropy minimisation and control of the χ divergence. Acknowledgements : This research has been funded by Deutsche Forschungsgemeinschaft (DFG) through thegrant CRC 1114 ‘Scaling Cascades in Complex Systems’ (project A05, project number 235221301). We would liketo thank Wei Zhang and Nikolas N¨usken for many very useful discussions.18
Appendix
A.1 Proofs for Section 2
Proof of Proposition 2.6.
We adapt a proof of [37]. Assume first that m ≥
1, then for any E ∈ F ν ( E ) − λ ( E ) ≥ ν ( E ) − mλ ( E ) ≥ , (92)where the last inequality follows from the definition of m . On the other hand, if E = Ω, then ν ( E ) − λ ( E ) = 0 andtherefore it follows that m = 1, i.e. ν = λ .Let now m <
1. We want to show mJ ( f, λ, ϕ ) ≤ J ( f, ν, ϕ ), which is equivalent to E ν [ f ( ϕ )] − m E λ [ f ( ϕ )] + mf ( E λ [ ϕ ]) ≥ f ( E ν [ ϕ ]) . (93)We compute E ν [ f ( ϕ )] − m E λ [ f ( ϕ )] + mf ( E λ [ f ( ϕ )]) ≥ ( E ν [1] − m E λ [1]) f (cid:18) E ν [ ϕ ] − m E λ [ ϕ ] E ν [1] − m E λ [1] (cid:19) + mf ( E λ [ f ( ϕ )]) (94a)= (1 − m ) f (cid:18) E ν [ ϕ ] − m E λ [ ϕ ]1 − m (cid:19) + mf ( E λ [ f ( ϕ )]) (94b) ≥ f ( E ν [ ϕ ] − m E λ [ ϕ ] + m E λ [ ϕ ]) (94c)= f ( E ν [ ϕ ]) , (94d)where we used two times the convexity of f . The other inequality follows analogously. A.2 Proofs for Section 3
Proof of Proposition 3.3.
We compute E (cid:34) e − W ( X u ) (cid:18) d P d P u ( X u ) (cid:19) (cid:35) = E e − W ( X u ) (cid:32) d P d P u ∗ ( X u ) d P u ∗ d P u ( X u ) (cid:33) (95a)= Z E (cid:32) d P u ∗ d P u ( X u ) (cid:33) , (95b)where we used d P d P u ∗ ( X u ) = e W ( X u ) Z . (96)Equation (37) now follows by the Girsanov formula (see Lemma A.2) and the definition of the variance. For equation(38) note that we can write E P u (cid:32) d P u ∗ d P u (cid:33) = E P u +2 δ (cid:32) d P u ∗ d P u (cid:33) d P u d P u +2 δ . (97)We compute d P u ∗ d P u ( X u +2 δ ) = exp T (cid:90) | δ ( X u +2 δs , s ) | d s + T (cid:90) δ ( X u +2 δs , s ) · d W s (98)and d P u d P u +2 δ ( X u +2 δ ) = exp − T (cid:90) | δ ( X u +2 δs , s ) | d s − T (cid:90) δ ( X u +2 δs , s ) · d W s , (99)from which the desired formula immediately follows. 19 lternative proof of Corollary 3.6. We follow the reasoning in [33, Thm. 2.1] and apply Gr¨onwall’s inequality tothe square integrable exponential martingale Z . To this end, we define the shorthands δ ( x, t ) := ( u ∗ − u )( x, t ) and Z t := exp − t (cid:90) | δ ( X s , s ) | d s + t (cid:90) δ ( X s , s ) · d W s . (100)Then, by Itˆo’s formula, Z t = 1 + 2 t (cid:90) Z s d Z s + t (cid:90) Z s | δ ( X s , s ) | d s (101)and therefore, after taking expectations, E (cid:2) Z t (cid:3) = 1 + E t (cid:90) Z s | δ ( X s , s ) | d s (102a) ≤ t (cid:90) E (cid:2) Z s (cid:3) h ( s )d s. (102b)We can now apply Gr¨onwall’s inequality to get E (cid:2) Z t (cid:3) ≤ exp t (cid:90) h ( s )d s (103)and therefore the desired statement after applying Proposition 3.3. The other direction follows analogously bynoting that − E (cid:2) Z t (cid:3) ≤ − − t (cid:90) E (cid:2) Z s (cid:3) h ( s )d s. (104) Remark
A.1 . Yet another alternative to prove Corollary 3.6 is by computing E (cid:32) d P u ∗ d P u ( X u ) (cid:33) = E exp − T (cid:90) | δ ( X us , s ) | d s + 2 T (cid:90) δ ( X us , s ) · d W s (105a)= E exp T (cid:90) | δ ( X us , s ) | d s − T (cid:90) | δ ( X us , s ) | d s + 2 T (cid:90) δ ( X us , s ) · d W s (105b) ≤ exp T (cid:90) h ( s )d s E exp − T (cid:90) | δ ( X us , s ) | d s + T (cid:90) δ ( X us , s ) · d W s (105c)= exp T (cid:90) h ( s )d s , (105d)where we used the constant expectation property of the exponential martingale in the last step. The other directionfollows analogously. See also Theorem 2 in http://math.ucsd.edu/~pfitz/downloads/courses/spring05/math280c/expmart.pdf . roof of Proposition 3.8. From Lemma A.4 it holds for n, p, q > p + q = 1 that E (cid:34)(cid:32) d P u ∗ d P u ( X u ) (cid:33) n (cid:35) ≤ E exp nq ( np − T (cid:90) | u ∗ − u | ( X us , s )d s q . (106)We write q = pp − and note that q ( np −
1) = p ( np − p − is minimized by p ∗ = 1 ± (cid:113) − n , from which we are onlyallowed to take the positive part due to the constraint p ≥
1. For n = 2 this yields p ∗ = √ √ and q ∗ = √ r ( u ) = Var P u (cid:32) d P u ∗ d P u (cid:33) = E P u (cid:32) d P u ∗ d P u (cid:33) − . (107) A.3 Auxiliary statements
In this section, we recall some known statements and provide some helpful additional analysis.First note that the Radon-Nikodym derivative appearing in the importance sampling estimator in path space canbe computed explicitly.
Lemma A.2 (Girsanov) . For u ∈ U , the measures P and P u , relating to the SDEs (25) and (29) , are equivalent.Moreover, the Radon-Nikodym derivative satisfies d P u d P ( X ) = exp T (cid:90) (cid:0) u (cid:62) σ − (cid:1) ( X s , s ) · d X s − T (cid:90) ( σ − b · u )( X s , s ) d s − T (cid:90) | u ( X s , s ) | d s . (108) Proof.
See [38, Lemma A.1].
Corollary A.3 (Formula for path space relative error in a special case) . If the difference u ∗ − u does not dependon x , then r ( u ) = exp T (cid:90) | u ∗ − u | ( s )d s − . (109) Proof.
This is a direct consequence of (38). For the reader’s convenience, we provide an alternative proof. If u ∗ − u does not depend on x , then the random variable Y = − T (cid:90) | u ∗ − u | ( s )d s + 2 T (cid:90) ( u ∗ − u )( s ) · d W s (110)is normally distributed, with mean and variance given by µ = − T (cid:90) | u ∗ − u | ( s )d s, σ = 4 T (cid:90) | u ∗ − u | ( s )d s, (111)where the second expression follows from the Itˆo isometry. The random variable (cid:16) d P u ∗ d P u ( X u ) (cid:17) = e Y is thenlog-normally distributed and we compute E (cid:2) e Y (cid:3) = e µ + σ = e σ , (112)which gives the desired statement. 21 emma A.4. Let n, p, q > with p + q = 1 , then it holds that E (cid:34)(cid:32) d P u ∗ d P u ( X u ) (cid:33) n (cid:35) ≤ E exp nq ( np − T (cid:90) | u ∗ − u | ( X us , s )d s q . (113) Proof.
Let us write δ ( x, s ) := ( u ∗ − u )( x, s ), and let n, p, q >
1, then, using the H¨older inequality with p + q = 1,it holds E P u (cid:34)(cid:32) d P u ∗ d P u (cid:33) n (cid:35) = E P u exp n T (cid:90) δ ( X s , s ) · d W s − n p T (cid:90) | δ ( X s , s ) | d s + n ( np − T (cid:90) | δ ( X s , s ) | d s (114a) ≤ E P u exp T (cid:90) np δ ( X s , s ) · d W s − T (cid:90) | np δ ( X s , s ) | d s p (114b) E P u exp nq ( np − T (cid:90) | δ ( X s , s ) | d s q (114c)= E P u exp nq ( np − T (cid:90) | δ ( X s , s ) | d s q . (114d)Note that, even though H¨older’s inequality holds for p, q ∈ [1 , ∞ ], the inequality becomes useless for q = 1 and p = ∞ . Proposition A.5 (Zero-variance property) . We get a vanishing relative error r ( u ) = 0 if and only if δ = u ∗ − u = 0 ,i.e. when having the optimal control u = u ∗ = − σ (cid:62) ∇ V .Proof. The fact that δ = 0 implies r ( u ) = 0 follows directly from (38) or (52). For the other direction note that r ( u ) = 0 implies M u ( x, t ) = ψ ( x, t ) (as defined in Proposition 3.10) for all ( x, t ) ∈ R d × [0 , T ] and therefore equation(48a) becomes ( ∂ t + L − σu ( x, t ) · ∇ − f ( x, t ) + | u ( x, t ) | ) ψ ( x, t ) = 0 . (115)Further note that due to the Kolmogorov backward equation it holds( ∂ t + L − f ( x, t )) ψ ( x, t ) − | ( σ (cid:62) ∇ ψ )( x, t ) | = 0 . (116)Combining these two PDEs brings ψ ( x, t ) | u ( x, t ) | − ψσu · ∇ ψ )( x, t ) + | ( σ (cid:62) ∇ ψ )( x, t ) | = | ( ψu )( x, t ) − ( σ (cid:62) ∇ ψ )( x, t ) | = 0 , (117)which implies that u = σ (cid:62) ∇ ψψ = σ (cid:62) ∇ log ψ = − σ (cid:62) ∇ V. (118)The following lemma shows that the KL divergence increases with the number of dimensions. This result followsfrom the chain-rule of KL divergence, see, e.g., [11]. Lemma A.6 (Dimension dependence of KL divergence) . Let u ( d ) ( z , . . . , z d ) and v ( d ) ( z , . . . , z d ) be two arbitraryprobability distributions on R d . For j ∈ { . . . , d } denote their marginals on the first j coordinates by u ( j ) and v ( j ) ,i.e. u ( j ) ( z , . . . , z j ) = (cid:90) · · · (cid:90) u ( d ) ( z , . . . , z d ) d z j +1 . . . d z d , (119) and v ( j ) ( z , . . . , z j ) = (cid:90) · · · (cid:90) v ( d ) ( z , . . . , z d ) d z j +1 . . . d z d . (120)22 hen KL( u (1) | v (1) ) ≤ KL( u (2) | v (2) ) ≤ . . . ≤ KL( u ( d ) | v ( d ) ) , (121) i.e. the function J (cid:55)→ KL( u ( j ) | v ( j ) ) is increasing. A.3.1 Relative error of log-normal random variables
Let Y ∼ N ( µ, Σ) with arbitrary µ ∈ R d , Σ ∈ R d × d and take γ ∈ R d , c ∈ R , then e γ · Y + c is log-normally distributedand its relative error is r ( γ, Σ) = (cid:115) E (cid:2) e γ · Y + c ) (cid:3) E [ e γ · Y + c ] − (cid:115) E [ e γ · Y ] E [ e γ · Y ] − (cid:113) e γ · Σ γ − , (122)independent of c . With the setting and notation from Example 2.10 we can now for instance compute e − g ( (cid:101) X ) p (cid:101) p ε ( (cid:101) X ) = exp (cid:18) − α · (cid:101) X + log p (cid:101) p ε (cid:19) = exp (cid:18) ε · (cid:101) X − µ · ( α + ε ) + 12 ( α + ε ) · Σ( α + ε ) (cid:19) (123)and with γ = ε, c = − µ · ( α + ε ) + ( α + ε ) · Σ( α + ε ) , Σ = Σ one therefore gets the relative error r ( (cid:101) p ε ) = (cid:112) e ε · Σ ε − A.3.2 Asymptotic expansion in small noise diffusions
To get further intuition on the small noise diffusions defined in Section 3.3, let us consider the formal expansion ofthe solution to the HJB equation (62) V = v + ηv + η v + . . . . (125)Inserting into (62) (with σ = I d × d ) and comparing the powers of η yields the PDEs ∂ t v + b · ∇ v − |∇ v | = 0 , (126a) ∂ t v + 12 ∆ v + b · ∇ v − ∇ v · ∇ v = 0 , (126b) ∂ t v + 12 ∆ v + b · ∇ v − ∇ v · ∇ v − |∇ v | = 0 , (126c)and so on, where all but the first PDE are transport equations (see [51]). We note that (given some appropriateassumptions) we have v = V , with V being to solution to (64). In [21] it is proven that ∇ V = ∇ V + η ∇ v + o ( η ) , (127)where v fulfills the PDE above and V is the solution to the original HJB equation (62). References [1] S. Agapiou, O. Papaspiliopoulos, D. Sanz-Alonso, and A. M. Stuart. Importance sampling: computationalcomplexity and intrinsic dimension.
Statistical Science , 32(3), 2015.[2] S. r. Asmussen, P. Dupuis, R. Rubinstein, and H. Wang. Importance sampling for rare events.
Aarhus Univ.,Aarhus, Denmark, Tech. Rep. , 2011.[3] S. r. Asmussen and P. W. Glynn.
Stochastic Simulation: Algorithms and Analysis . Springer, New York, 2007.[4] T. Bengtsson, P. Bickel, B. Li, et al. Curse-of-dimensionality revisited: collapse of the particle filter in verylarge scale systems. In
Probability and statistics: Essays in honor of David A. Freedman , pages 316–334.Institute of Mathematical Statistics, 2008.[5] N. Berglund. Kramers’ law: validity, derivations and generalisations.
Markov Processes and Related fields ,19(3):459–490, 2013. 236] P. Bickel, B. Li, and T. Bengtsson. Sharp failure rates for the bootstrap particle filter in high dimensions. In
Pushing the limits of contemporary statistics: Contributions in honor of Jayanta K. Ghosh , pages 318–329.Institute of Mathematical Statistics, 2008.[7] C. M. Bishop.
Pattern recognition and machine learning . Springer, 2006.[8] A. Bovier, V. Gayrard, and M. Klein. Metastability in reversible diffusion processes II: precise asymptoticsfor small eigenvalues.
J. Eur. Math. Soc. , 7(1):69–99, 2005.[9] S. Chatterjee and P. Diaconis. The sample size required in importance sampling.
The Annals of AppliedProbability , 28(2):1099–1135, 2018.[10] Y. Chen. Another look at rejection sampling through importance sampling.
Statistics & probability letters ,72(4):277–283, 2005.[11] T. M. Cover and J. A. Thomas.
Elements of Information Theory . John Wiley & Sons, 2012.[12] P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein. A tutorial on the cross-entropy method.
Annals of operations research , 134(1):19–67, 2005.[13] P. Del Moral.
Mean Field Simulation for Monte Carlo Integration . Chapman and Hall/CRC, 2013.[14] A. Dembo and O. Zeitouni.
Large Deviations Techniques and Applications . Springer Berlin Heidelberg, 2009.[15] A. Doucet, N. De Freitas, and N. Gordon. An introduction to sequential Monte Carlo methods. In
SequentialMonte Carlo methods in practice , pages 3–14. Springer, 2001.[16] S. S. Dragomir and V Gluscevic. Some inequalities for the Kullback-Leibler and χ -distances in informationtheory and applications. RGMIA research report collection , 3(2):199–210, 2000.[17] P. Dupuis, K. Spiliopoulos, and H. Wang. Importance sampling for multiscale diffusions.
Multiscale Modeling& Simulation , 10(1):1–27, 2012.[18] P. Dupuis, K. Spiliopoulos, and X. Zhou. Escaping from an attractor: importance sampling and rest points I.
The Annals of Applied Probability :2909–2958, 2015.[19] P. Dupuis and H. Wang. Importance sampling, large deviations, and differential games.
Stochastics: An In-ternational Journal of Probability and Stochastic Processes , 76(6):481–508, 2004.[20] P. Dupuis and H. Wang. Subsolutions of an Isaacs equation and efficient schemes for importance sampling.
Mathematics of Operations Research , 32(3):723–757, 2007.[21] W. H. Fleming. Stochastic control for small noise intensities.
SIAM Journal on Control , 9(3):473–517, 1971.[22] W. H. Fleming and H. M. Soner.
Controlled Markov processes and viscosity solutions , volume 25. SpringerScience & Business Media, 2006.[23] A. Gelman and X.-L. Meng. Simulating normalizing constants: from importance sampling to bridge samplingto path sampling.
Statistical science :163–185, 1998.[24] A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics.
International statistical review ,70(3):419–435, 2002.[25] P. Glasserman.
Monte Carlo methods in financial engineering , volume 53. Springer Science & Business Media,2013.[26] P. Glasserman and J. Li. Importance sampling for portfolio credit risk.
Management science , 51(11):1643–1656, 2005.[27] P. Glasserman and Y. Wang. Counterexamples in importance sampling for large deviations probabilities.
TheAnnals of Applied Probability , 7(3):731–746, 1997.[28] C. Hartmann, O. Kebiri, L. Neureither, and L. Richter. Variational approach to rare event simulation usingleast-squares regression.
Chaos , 29(6):063107, 2019. doi : .[29] C. Hartmann, J. C. Latorre, G. A. Pavliotis, and W. Zhang. Optimal control of multiscale systems usingreduced-order models. J. Computational Dynamics , 1:279–306, 2014.2430] C. Hartmann, L. Richter, C. Sch¨utte, and W. Zhang. Variational characterization of free energy: theory andalgorithms.
Entropy , 19(11):626, 2017.[31] C. Hartmann and C. Sch¨utte. Efficient rare event simulation by optimal nonequilibrium forcing.
Journal ofStatistical Mechanics: Theory and Experiment , 2012(11):P11004, 2012.[32] C. Hartmann, C. Sch¨utte, and W. Zhang. Model reduction algorithms for optimal control and importancesampling of diffusions.
Nonlinearity , 29(8):2298, 2016.[33] F. Klebaner and R. Liptser. When a stochastic exponential is a true martingale. Extension of the Beneˇsmethod.
Theory of Probability and its Applications , 58(1):38–62, 2014.[34] B. Li, T. Bengtsson, and P. Bickel. Curse-of-dimensionality revisited: Collapse of importance sampling in veryhigh-dimensional systems.
Tech Reports, Department of Statistics, UC Berkeley , 696:1–18, 2005.[35] J. S. Liu.
Monte Carlo strategies in scientific computing . Springer Science & Business Media, 2008.[36] X.-L. Meng and W. H. Wong. Simulating ratios of normalizing constants via a simple identity: a theoreticalexploration.
Statistica Sinica :831–860, 1996.[37] F. C. Mitroi. Estimating the normalized Jensen functional.
J. Math. Inequal , 5(4):507–521, 2011.[38] N. N¨usken and L. Richter. Solving high-dimensional Hamilton-Jacobi-Bellman PDEs using neural networks:perspectives from the theory of controlled diffusions and measures on path space. arXiv preprint arXiv:2005.05409 ,2020.[39] A. Owen and Y. Zhou. Safe and effective importance sampling.
Journal of the American Statistical Association ,95(449):135–143, 2000.[40] A. B. Owen.
Monte Carlo theory, methods and examples . Self-published, 2013.[41] H. Pham.
Continuous-time stochastic control and optimization with financial applications , volume 61. SpringerScience & Business Media, 2009.[42] B. Polyak and P. Shcherbakov. Why does Monte Carlo fail to work properly in high-dimensional optimizationproblems?
Journal of Optimization Theory and Applications , 173(2):612–627, 2017.[43] F. Ragone, J. Wouters, and F. Bouchet. Computation of extreme heat waves in climate models using a largedeviation algorithm.
Proceedings of the National Academy of Sciences , 115(1):24–29, 2018. doi : .[44] L. Richter, A. Boustati, N. N¨usken, F. J. Ruiz, and ¨O. D. Akyildiz. VarGrad: a low-variance gradient estimatorfor variational inference. arXiv preprint arXiv:2010.10436 , 2020.[45] D. Sanz-Alonso. Importance sampling and necessary sample size: an information theory approach. SIAM/ASAJournal on Uncertainty Quantification , 6(2):867–879, 2018.[46] I. Sason. On improved bounds for probability metrics and f -divergences. arXiv preprint arXiv:1403.7164 ,2014.[47] I. Sason. Tight bounds for symmetric divergence measures and a new inequality relating f -divergences. In , pages 1–5. IEEE, 2015.[48] D. Siegmund. Importance sampling in the Monte Carlo study of sequential tests. The Annals of Statistics :673–684, 1976.[49] C. Snyder, T. Bengtsson, P. Bickel, and J. Anderson. Obstacles to high-dimensional particle filtering.
MonthlyWeather Review , 136(12):4629–4640, 2008.[50] K. Spiliopoulos. Large deviations and importance sampling for systems of slow-fast motion.
Applied Mathe-matics & Optimization , 67(1):123–161, 2013.[51] K. Spiliopoulos. Nonasymptotic performance analysis of importance sampling schemes for small noise diffu-sions.
Journal of Applied Probability , 52(3):797–810, 2015.[52] G. Stoltz, M. Rousset, et al.
Free energy computations: A mathematical perspective . World Scientific, 2010.2553] E. Vanden-Eijnden and J. Weare. Rare event simulation of small noise diffusions.
Communications on Pureand Applied Mathematics , 65(12):1770–1803, 2012.[54] W. Zhang, H. Wang, C. Hartmann, M. Weber, and C. Sch¨utte. Applications of the cross-entropy method toimportance sampling and optimal control of diffusions.