Importance Sampling and Necessary Sample Size: an Information Theory Approach
aa r X i v : . [ s t a t . C O ] A ug — (2016), — , —, pp . 1–9 —– Importance Sampling and Necessary Sample Size: anInformation Theory Approach B Y D. SANZ-ALONSO
Division of Applied Mathematics, Brown UniversityProvidence, Rhode Island, 02906, US daniel_ [email protected] S UMMARY
Importance sampling approximates expectations with respect to a target measure by usingsamples from a proposal measure. The performance of the method over large classes of testfunctions depends heavily on the closeness between both measures. We derive a general boundthat needs to hold for importance sampling to be successful, and relates the f -divergence betweenthe target and the proposal to the sample size. The bound is deduced from a new and simpleinformation theory paradigm for the study of importance sampling. As examples of the generaltheory we give necessary conditions on the sample size in terms of the Kullback-Leibler and χ divergences, and the total variation and Hellinger distances. Our approach is non-asymptotic,and its generality allows to tell apart the relative merits of these metrics. Unsurprisingly, the non-symmetric divergences give sharper bounds than total variation or Hellinger. Our results extendexisting necessary conditions —and complement sufficient ones— on the sample size requiredfor importance sampling. Some key words : f -divergence; Importance Sampling; Information Theory; Sample Size.
1. I
NTRODUCTION
Let P and Q be, throughout, two probability measures on a measurable space ( X , F ) , with P absolutely continuous with respect to Q . Importance sampling is a Monte Carlo techniquethat approximates expectations with respect to the target P by using samples from the proposal Q . Our aim is to introduce a simple information theory paradigm to determine situations wherethis method cannot be successful. The results are non-asymptotic, and are based on informationbarriers on f -divergences.In what follows it is best to view importance sampling as a way to approximate the target P by a weighted empirical measure π N := N X n =1 w n δ v n , (1)where N is the number of samples v n drawn from Q . We now present heuristically the mainidea. Let Q N MC := (1 /N ) P δ v n denote the standard Monte Carlo approximation of Q . Importancesampling replaces the uniform weights u = { /N, . . . , /N } associated with the particles v N = { v n } Nn =1 by non-uniform weights w N = { w n } Nn =1 . The hope is that then (1) approximates P rather than Q . Let D be some notion —to be made precise— of distance that allows to assessthe closeness of measures in ( X , F ) , and also of probability vectors in [0 , N . Suppose that the C (cid:13) D. S
ANZ -A LONSO bound D ( w N , u ) ≤ U ( N ) holds for any choice of non-negative weights w N with P w n = 1 and a given function U. If we can guarantee —under conditions ensuring the success of standardMonte Carlo and importance sampling— that D ( P , Q ) is close to D ( π N , Q N MC ) , then we wouldlike to conclude that D ( P , Q ) ≈ D ( π N , Q N MC ) = D ( w N , u ) ≤ U ( N ) . There is an information barrier: if N is such that U ( N ) < D ( P , Q ) , then N samples from Q do not contain enough information on the model P for importance sampling to be successful.This heuristic is made rigorous in our main results, Theorems 1 and 2 below. Note that π N (re-spectively Q N MC ) will never be close to P (respectively Q ) in the metrics considered below, sincethe former is atomic and the latter, in general, is not. However, it is still possible to guaran-tee that D ( P , Q ) ≈ D ( π N , Q N MC ) under appropriate performance conditions on Monte Carlo andimportance sampling.The first step in formalizing our argument is to agree on a metric to assess the closenessof measures. This is a major point, since this choice typically impacts the convergence —orthe rate of convergence— of sequences of measures Gibbs & Su (2002). For this reason weallow for flexibility in our analysis, and work with the family of f -divergences. These metricshave a convex function f as a free parameter. We use several important members of this familyas running examples: the Kullback-Leibler and the χ divergences, and the total variation andHellinger distances. Previous non-asymptotic analyses of importance sampling have focused onthe first two. For instance, Chatterjee & Diaconis (2015) suggested —under certain concentrationcondition on the density— the necessity and sufficiency of the sample size being larger thanthe exponential of the Kullback-Leibler divergence between target and proposal, and Agapiouet al. (2015) proved the sufficiency of the sample size being larger than the χ divergence forautonormalized importance sampling for bounded test functions. Indeed the function U in theabove argument is given by U ( N ) = log N when D is the Kullback-Leibler divergence, and U ( N ) = N − for the χ divergence, in agreement with Chatterjee & Diaconis (2015) andcomplementing Agapiou et al. (2015). The Kullback-Leibler divergence plays also a key role inthe asymptotic analysis, since it provides the rate function of the large deviation principle for theempirical measure Sanov (1958), and it also appears in the rate function for weighted empiricalmeasures Hult & Nyquist (2012).A second step in formalizing our idea is to agree on how to interpret the statement that impor-tance sampling is not successful. In this regard, moderate mean squared error seems too exactingas a necessary criteria, since this statistic may even be infinite while the method gives smallerrors with overwhelming probability. Moderate mean squared error seems more adequate as a sufficient than a necessary requirement. We propose, following Chatterjee & Diaconis (2015), toconsider the method unsuccessful when there are test functions for which importance samplinggives significant errors with large probability. Our main results, Theorems 1 and Theorem 2,show that when the sample size is not sufficiently large in terms of the f -divergence between thetarget and the proposal, then the method breaks down —with high probability— for either φ ≡ or φ f := f ◦ ( d P /d Q ) . Note that the latter test function depends on the choice of f -divergence;for a given choice of f , the Q -integrability of φ f will naturally determine the class of test func-tions for which our upper bounds are meaningful.We close the introduction with a brief literature review and an outline of this paper. Importancesampling is a standard tool in computational statistics Liu (2008). It was first proposed as avariance reduction technique for standard Monte Carlo integration Kahn & Marshall (1953),and has been extensively used in the simulation of rare events since Siegmund (1976), where mportance Sampling and Necessary Sample Size the interest lies in computing the expectation of a given test function φ (typically an indicator).Importance sampling has received recent interest as a building block of particle filters Del Moral(2004), Doucet et al. (2001). In this complementary range of applications, which motivates ourpresentation, the interest lies in approximating a measure, and computing expectations over a class of test functions Del Moral (2004), Agapiou et al. (2015). f -divergences were introducedin Csiszár (1963), Csiszár (1967) and Ali & Silvey (1966) as a generalization of the Kullback-Leibler divergence Kullback & Leibler (1951). They have been widely studied in informationtheory, and an in-depth treatment is given in Liese & Vajda (1987). A recent survey of boundson f -divergences is Sason & Verdú (2015). Finally, Gibbs & Su (2002) contains a brief and clearexposition of the relationships between probability metrics, sufficient for the purposes of thispaper.Section 2 provides the necessary background on importance sampling. Section 3 briefly re-views f -divergences, and some bounds on and between them are established. The main resultsare in Section 4. Examples are given in Section 5, and we conclude in Section 6.Notation: We let g := d P /d Q denote the Radon-Nikodym derivative of P with respect to Q . We denote measures that are not necessarily probabilities with Greek letters. For any measure ν in ( X , F ) and measurable function φ : X → R we set ν ( φ ) := R X φ ( x ) ν ( dx ) . Randomnessarising from sampling is indicated with a superscript: N for the number of samples and n for theindices of the samples. Vectors are denoted in bold face, and u := { /N, . . . , /N } denotes theuniform probability vector.
2. I
MPORTANCE S AMPLING B ACKGROUND
Given samples { v n } Nn =1 from Q , importance sampling approximates P ( φ ) as follows: P ( φ ) = Z X φ ( x ) P ( dx ) = Z X φ ( x ) g ( x ) Q ( dx ) ≈ N N X n =1 φ ( v n ) g ( v n ) = N X n =1 w n φ ( v n ) , where w n := g ( v n ) /N. Our presentation does not cover autonormalized importance sampling,but we expect that our paradigm could be generalized with additional effort. Recalling the defi-nition of the particle approximation measure in (1), the previous display can be rewritten as P ( φ ) ≈ π N ( φ ) . The right-hand side is an unbiased estimator of P ( φ ) , and its mean squared error is given byMSE (cid:0) π N ( φ ) (cid:1) = Var Q ( gφ ) N .
As noted in the introduction, we argue that small mean squared error is sufficient but not neces-sary for successful importance sampling. An important point for further developments is that π N is a random measure, but in general it is not a probability measure since the weights w n typicallydo not add up to one. It is clear, however, that in the large N asymptotic N X n =1 w n = 1 N X g ( v n ) ≈ , and precise statements about the quality of the above approximation can be made under differentassumptions on the moments of g under Q . D. S
ANZ -A LONSO
3. T HE F AMILY OF f - DIVERGENCES
In this section we introduce the family of f -divergences, and a slight generalization for atomicmeasures that need not integrate to one. The section closes with useful upper bounds for theanalysis of importance sampling.Given a convex function f with f (1) = 0 , the f -divergence between P and Q is defined by D f ( P k Q ) := Z X f ◦ g ( x ) Q ( dx ) ≡ Q ( f ◦ g ) , where, recall, g = d P /d Q . The assumptions on f and Jensen’s inequality ensure that these di-vergences are non-negative. However, D f does not constitute in general a distance in the space P ( X ) of probability measures on ( X , F ) : it typically does not satisfy the triangle inequality orthe requirement of symmetry, it takes the value ∞ if f ◦ g is not Q -integrable, and it may needto be redefined when the first argument is not absolutely continuous with respect to the second.Examples are given in Table 1, where it is shown a choice of f (the choice is in general notunique) that results in the Kullback-Leibler and the χ divergences, and the total variation andHellinger distances. We spell out the definitions here, and some useful characterizations: D KL ( P k Q ) := Q (cid:0) g log( g ) (cid:1) ≡ P (cid:0) log( g ) (cid:1) ,D χ ( P k Q ) := Q (cid:0) g (cid:1) − ≡ P ( g ) − ,D TV ( P , Q ) := Q ( | − g | ) ≡ sup A ⊂ F | P ( A ) − Q ( A ) | ∈ [0 , ,D Hell ( P , Q ) := Q (cid:0) ( √ g − (cid:1) ∈ [0 , . While D TV and D Hell can be shown to be distances in P ( X ) , D KL and D χ are not. In particular,these latter divergences fail to be symmetric, a feature that makes them appealing for the analysisof importance sampling. Indeed, the very formulation of the method is built on an asymmetricpremise (the absolute continuity of P with respect to Q ). Moreover, it is well acknowledged that itis desirable that the proposal has heavier tails than the target —again an asymmetric requirement.We have already stressed that we are not interested in the mean squared error as a statisticto discard estimators. The next result is, however, instructive for comparison purposes. It givesnecessary conditions on the sample size for bounded mean squared error over bounded test func-tions. Here and later we will drop the arguments of the divergences when no confusion mayarise.P ROPOSITION Let φ ≡ be the constant function , and let C > . If MSE ( π N ( φ )) ≤ C, then N ≥ C − D χ , N ≥ C − (exp( D KL ) − ,N ≥ C − D TV , N ≥ C − D Hell . Proof.
First note that, for φ ≡ ,MSE (cid:0) π N ( φ ) (cid:1) = Var Q ( g ) N = Q ( g ) − N = D χ N .
This gives the bound for D χ . The bounds for the other metrics follow from the general bounds D KL ≤ log(1 + D χ ) , D TV ≤ p D χ , D Hell ≤ p D χ , which can be found, for instance, in Gibbs & Su (2002). (cid:3) mportance Sampling and Necessary Sample Size Summary of the four f -divergences used as examples in this paper. The third columncontains the maximum value these divergences can take when the first argument is a probabilityvector and the second is the uniform probability vector u . The fourth column is the general-ization to the case where the first argument is a non-negative vector with total mass ≤ ǫ. Divergence f ( x ) U f ( N ) U f ( N, ǫ ) Kullback-Leibler x log( x ) log( N ) (1 + ǫ ) log (cid:8) N (1 + ǫ ) (cid:9) χ ( x − N − N (1 + ǫ ) − (1 + 2 ǫ ) Total variation | x − | / − /N − /N + ǫ/ Squared Hellinger ( √ x − − / √ N ) 2 (cid:16) − q ǫN + ǫ/ (cid:17) Remark . Always D TV ≤ and D Hell ≤ √ . Therefore the mean squared error calculationabove gives, in fact, no bounds for these distances unless √ CN < , respectively √ CN < √ .On the other hand, the bounds for D KL and D χ are meaningful for any values of C and N , sincethese divergences could a priori take arbitrarily large values.We now generalize the definition of f -divergence to atomic measures that need not beprobabilities. Let p := { p , . . . , p N } and q := { q , . . . , q N } be vectors with p i ≥ , q i > , ≤ i ≤ N . Given a convex function f with f (1) = 0 , the f -divergence between p and q isdefined by D f ( p k q ) := N X i =1 q i f (cid:16) p i q i (cid:17) . This generalization will be useful for the analysis. We note, however, that the interpretation ofthese generalized f -divergences as “distance" is somewhat lost, as they can take negative values.The following lemma gives a general upper bound on the f -divergence between arbitrary weightsand uniform weights u . Examples are given in Table 1.L EMMA Let p := { p , . . . , p N } and u = { /N, . . . , /N } be probability vectors. Then D f ( p k u ) ≤ f ( N ) + ( N − f (0) N =: U f ( N ) . If p has non-negative entries but it is allowed to have total mass P p i ≤ ǫ , then D f ( p k u ) ≤ f ((1 + ǫ ) N ) + ( N − f (0) N =: U f ( N, ǫ ) . (2) Equality is achieved when p i = 1 (or p i = 1 + ǫ ) for some ≤ i ≤ N. Proof.
It follows from the convexity of f that D f ( · k · ) is convex in its first argument (e.g.Csiszár & Shields (2004)). Hence, by convexity, the p in the probability simplex that maximizes D f ( p k u ) is in one of the vertices of the simplex, i.e. there is ≤ i ≤ N with p i = 1 . The ex-pression for U f is then obtained by substituting such p in the definition of D f . The proof when p is allowed to have total mass ǫ is identical. (cid:3) D. S
ANZ -A LONSO
4. M
AIN R ESULTS . N
ECESSARY S AMPLE S IZE
This section contains the main results of the paper, and formalizes the heuristic argumentoutlined in the introduction.T
HEOREM
ECESSARY SAMPLE SIZE ). Let ǫ, δ > , and let U f ( N, ǫ ) be defined as in (2) . Assume that with Q -positive probability i ) and ii ) below hold simultaneouslyi) π N (1) ≤ ǫ. ii) | Q ( f ◦ g ) − Q N MC ( f ◦ g ) | ≤ δ. Then D f ( P k Q ) ≤ U f ( N, ǫ ) + δ. Proof.
Note that Q ( f ◦ g ) = D f ( P k Q ) , Q N MC ( f ◦ g ) = D f ( w N k u ) . Let A ∈ F be the set where i ) and ii ) hold. Using Lemma 1 Q ( A ) D f ( P k Q ) ≤ Z A (cid:16)(cid:12)(cid:12) D f ( P k Q ) − D f ( w N k u ) (cid:12)(cid:12) + (cid:12)(cid:12) D f ( w N k u ) (cid:12)(cid:12)(cid:17) d Q = Z A (cid:16)(cid:12)(cid:12) Q ( f ◦ g ) − Q N MC ( f ◦ g ) (cid:12)(cid:12) + (cid:12)(cid:12) D f ( w N k u ) (cid:12)(cid:12)(cid:17) d Q ≤ Q ( A ) (cid:8) δ + U f ( N, ǫ ) (cid:9) . Since by assumption Q ( A ) > the proof is complete. (cid:3) For the Kullback Leibler and the χ divergences, condition ii) in Theorem 1 can be rewrittenin terms of the particle measure π N . The next result is thus a reformulation of Theorem 1, wherethe necessary sample size is derived by using the expressions for U f ( N, ǫ ) in Table 1. The proofis immediate and will be omitted.T HEOREM
ECESSARY SAMPLE SIZE : E
XAMPLES ). Let ǫ, δ > . . If N < (1 + ǫ ) − exp (cid:16) D KL − δ ǫ (cid:17) , then, with probability at least / , either π N (1) − P (1) > ǫ or | π N (log g ) − P (log g ) | > δ. . If N < (1 + ǫ ) − (1 + 2 ǫ + D χ − δ ) , then, with probability at least / , either π N (1) − P (1) > ǫ or | π N ( g ) − P ( g ) | > δ. . If N < (1 + ǫ/ δ − D TV ) − then, with probability at least / , either π N (1) − P (1) > ǫ or | Q ( | g − | / − Q N MC ( | g − | / | > δ. . If N < ǫ )(2 + ǫ + δ − D Hell ) − then, with probability at least / , either π N (1) − P (1) > ǫ or | Q (cid:0) ( √ g − (cid:1) − Q N MC (cid:0) ( √ g − (cid:1) | > δ. Remark . Note that ǫ and δ in Theorems 1 and 2 are arbitrary. In particular, choosing δ ∗ ∈ (0 , and replacing δ by δ ∗ Q ( f ◦ g ) immediately gives relative error conditions, as opposed tothe absolute ones above. It could also be interesting to consider scaling ǫ and δ with N , but wedo not pursue this here. mportance Sampling and Necessary Sample Size Remark . Theorems 1 and 2 can be viewed as yielding necessary conditions on the samplesize N for fixed P and Q or, alternatively, as giving necessary conditions on the f -divergencebetween P and Q for fixed N . Both interpretations are useful: in practice and depending onthe problem it may be more convenient to guarantee that the necessary conditions are met byincreasing the sample size or by reducing the f -divergence between target and proposal by meansof a tempering scheme. Remark . The bounds in the previous theorems, as opposed to those in Proposition 1, arederived without assuming finite mean squared error. Moreover, when the required sample sizesabove are not met, we show specific test functions for which the method gives significant errorwith large probability. The first two items in Theorem 2 are meaningful provided D KL and D χ are finite, that is, provided there is Q -integrability of g log g and g , respectively. Thus —whenthese conditions hold— the bounds give a necessary sample size for importance sampling overthe classes of functions F KL := { φ : Q ( φ log φ ) < ∞} , F χ := { φ : Q ( φ ) < ∞} ⊂ F KL . Whenever the more restrictive condition Q ( g ) < ∞ holds the analysis with D χ is sharper,since always D KL ≤ log(1 + D χ ) . We informally summarize the above discussion as follows: r If Q ( g ) < ∞ , then N ≈ D χ is required for accuracy over F χ . r if Q ( g ) = ∞ but Q ( g log g ) < ∞ , then N ≈ exp( D KL ) is required for accuracy over F KL . The analysis with D KL is thus of interest when Q ( g ) = ∞ . For instance, if Q ( g ) = ∞ impor-tance sampling has infinite mean square error for φ ≡ (see Proposition 1) but, as demonstratedin Chatterjee & Diaconis (2015), the L error may be moderate if N > D KL . In such a case it isperhaps advisable to reconsider how to monitor the performance of the method. Precisely, letting b w n := w n P Nn =1 w n be the normalized weights, we suggest the use of ESS KL := N exp (cid:16) D KL ( b w N k u ) (cid:17) = N exp (cid:16) N P Nn =1 b w n log (cid:0) N b w n (cid:1)(cid:17) ∈ [1 , N ] , rather than the usual ESS χ := N D χ ( b w N k u ) = 1 P Nn =1 ( b w n ) ∈ [1 , N ] , to monitor the effective sample size.Finally, note that the bounds on D TV and D Hell do not pose any restriction on the integrability ofthe density g . However, their sharpness is very limited. In particular the largest required samplesize they can provide (achieved when D TV = 1 or D Hell = √ ) is given, respectively, by ( ǫ/ δ ) − , ǫ ) / ( ǫ + δ ) − . For D KL and D χ the required sample size grows without bound, regardless of ǫ and δ , as thetarget and proposal become further apart. This is in analogy with Remark 1. D. S
ANZ -A LONSO
Table 2.
Necessary sample size given by Theorem for Q = N (0 , and P = N ( m, .N KL N χ N TV N Hell m = 2 5 . ≈ ≈ . m = 2 . . ≈ ≈ . m = 3 49 . ≈ ≈ . m = 3 . . ≈ ≈ . Table 3.
Necessary sample size given by Theorem for Q = N (0 , and P = N (0 , σ ) .N KL N χ N TV N Hell σ = 10 − . × ≈ ≈ . σ = 10 − . ≈ ≈ . σ = 16 215 . − − − ≈ . σ = 25 1 . × − − − ≈ .
5. E
XAMPLES
We illustrate the bounds in Theorem 2 with simple examples. In all of them we let the proposalbe a standard Gaussian distribution, Q = N (0 , , and we take as target a Gaussian distribution P = N ( m, σ ) with mean m and covariance σ to be specified later. In this framework D KL and D Hell can be computed in closed form, and we estimate D χ and D TV by an intensive Monte Carlocomputation with samples. Clearly, absolute continuity of P with respect to Q always holds.We fix ǫ = δ = 0 . . Tables 2 and 3 contain the necessary sample size given by Theorem 2 forall four metrics D KL , D χ , D TV and D Hell . In Table 2, σ = 1 is fixed, and we vary the values of m . For any value of m , g has bounded Q -moments of all orders. In Table 3, m = 0 is fixed andwe vary the values of σ . Here, g has finite Q -moment of order α > iff σ ≤ α/ ( α − . Inparticular D χ is finite iff σ < , and when this condition is not met the bound in Theorem 2becomes meaningless (see Remark 4).In order to compare the results derived with each divergence, it is important to keep in mindthe discussion in Remark 4. Tables 2 and 3 show, as predicted, that D χ yields the largest requiredsample size. The analysis with this metric becomes, however, meaningless when Q ( g ) is infinity(Table 3, σ ≥ ). Table 3 is illustrative. As mentioned above, it is desirable that the tails of theproposal are heavier than those of the target (i.e. σ < . The total variation and Hellingerdistances are symmetric in their arguments, and hence they fail to see the problems arising fromheavy target tails (large σ ). The asymmetric divergences D KL and D χ do capture the asymmetricbehavior of the problem: D χ in a dramatic fashion as it becomes infinity for σ ≥ , and D KL gives a bound of the same order when the ratio of the covariances of Q and P is as when itis / . It is perhaps more surprising to see the poor bounds that these metrics yield in Table 2,where target and proposal differ only by a shift, but this is also explained by the discussion inRemark 4.
6. C
ONCLUSION
The approach and results in this paper give new insight into the fundamental challenge thatimportance sampling faces as a building block of more sophisticated algorithms: the target andthe proposal must be sufficiently close. As noted elsewhere Agapiou et al. (2015), the oftenclaimed curse of dimension of importance sampling Bengtsson et al. (2008)—and consequentlyof particle filters— hinges exclusively on the observation that measures tend to become further mportance Sampling and Necessary Sample Size apart in larger dimensional spaces. However, this needs not be the case, and is indeed not so inmany data assimilation problems of applied interest Agapiou et al. (2015), Chorin & Morzfeld(2013). Topics for further research include the extension to autonormalized importance samplingand other related algorithms, and the question of how to optimize over the choice of f with agiven Q -integrability of f ◦ g to achieve the largest necessary condition on the required samplesize. R EFERENCES A GAPIOU , S., P
APASPILIOPOULOS , O., S
ANZ -A LONSO , D. & S
TUART , A. M. (2015). Importance sampling:computational complexity and intrinsic dimension. arXiv preprint arXiv:1511.06196 .A LI , S. M. & S ILVEY , S. D. (1966). A general class of coefficients of divergence of one distribution from another.
Journal of the Royal Statistical Society. Series B (Methodological) , 131–142.B
ENGTSSON , T., B
ICKEL , P., L I , B. et al. (2008). Curse-of-dimensionality revisited: Collapse of the particle filterin very large scale systems. In Probability and statistics: Essays in honor of David A. Freedman . Institute ofMathematical Statistics, pp. 316–334.C
HATTERJEE , S. & D
IACONIS , P. (2015). The sample size required in importance sampling. arXiv preprintarXiv:1511.01437 .C HORIN , A. J. & M
ORZFELD , M. (2013). Conditions for successful data assimilation.
Journal of GeophysicalResearch: Atmospheres , 11–522.C
SISZÁR , I. (1963). Eine informationstheoretische ungleichung und ihre anwendung auf den beweis der ergodizitätvon markoffschen ketten.
Publ. Math. Inst. Hungar. Acad. , 95–108.C SISZÁR , I. (1967). Information-type measures of difference of probability distributions and indirect observations.
Studia Sci. Math. Hungar. , 299–318.C SISZÁR , I. & S
HIELDS , P. C. (2004).
Information theory and statistics: A tutorial . Now Publishers Inc.D EL M ORAL , P. (2004).
Feynman-Kac Formulae . Springer.D
OUCET , A., D E F REITAS , N. & G
ORDON , N. (2001). An Introduction to Sequential Monte Carlo Methods. In
Sequential Monte Carlo Methods in Practice . Springer, pp. 3–14.G
IBBS , A. L. & S U , F. E. (2002). On choosing and bounding probability metrics. International statistical review , 419–435.H ULT , H. & N
YQUIST , P. (2012). Large deviations for weighted empirical measures arising in importance sampling. arXiv preprint arXiv:1210.2251 .K AHN , H. & M
ARSHALL , A. W. (1953). Methods of reducing sample size in Monte Carlo computations.
Journalof the Operations Research Society of America , 263–278.K ULLBACK , S. & L
EIBLER , R. A. (1951). On information and sufficiency.
The annals of mathematical statistics , 79–86.L IESE , F. & V
AJDA , I. (1987).
Convex Statistical Distances , vol. 95. Teubner-Texte Zur Mathematik.L IU , J. S. (2008). Monte Carlo Strategies in Scientific Computing . Springer Science & Business Media.S
ANOV , I. N. (1958).
On the probability of large deviations of random variables . United States Air Force, Office ofScientific Research.S
ASON , I. & V
ERDÚ , S. (2015). Bounds among f -divergences. arXiv preprint arXiv:1508.00335 .S IEGMUND , D. (1976). Importance sampling in the Monte Carlo study of sequential tests.