Product-form estimators: exploiting independence to scale up Monte Carlo
PProduct-form estimators: exploiting independence to scale upMonte Carlo ∗ Juan Kuntz †‡ Francesca R. Crucinio † Adam M. Johansen †‡ February 24, 2021
Abstract
We introduce a novel class of Monte Carlo estimators for product-form targets that aim toovercome the rapid growth of variance with dimension often observed for standard estimators.We establish their unbiasedness, consistency, and asymptotic normality. We show that theyachieve lower variances than their conventional counterparts given the same number of samplesdrawn from the target, investigate the gap in variance via several examples, and identify thesituations in which the difference is most, and least, pronounced. We also study the estimators’computational cost and investigate the settings in which they are most efficient. We illustratetheir utility beyond the product-form setting by giving two simple extensions (one to targets thatare mixtures of product-form distributions and another to targets that are absolutely continuouswith respect to a product-form distribution) and conclude by discussing further possible uses.
Monte Carlo methods are often said to overcome the curse of dimensionality because their rates ofconvergence are square root in the number of samples drawn regardless of the target’s dimension.However, in practice one faces two issues when computing high dimensional integrals with MonteCarlo: (a) the target distributions featuring in the integrals are often intractable and cannot besampled from (e.g. see [23, p.358]); and (b) the constants present in the convergence rates typicallygrow rapidly with dimension. Consequently, even if we are able to draw samples from the highdimensional target, the number of samples necessary to obtain estimates of a predetermined accuracyis often prohibitively large [35, 36, 4]. Variants of the following toy example are sometimes given toillustrate this point (e.g. [8, p.95]).Let µ be a K -dimensional isotropic Gaussian distribution N ( , I ) with unit means and variancesand consider the basic Monte Carlo estimator for the mean ( µ ( ϕ ) = 1) of the product ( ϕ ( x ) := x x . . . x K ) of its components ( x , x , . . . , x K ): µ N ( ϕ ) := 1 N N (cid:88) n =1 ϕ ( X n , . . . , X nK ) = 1 N N (cid:88) n =1 X n . . . X nK , (1) ∗ JK and AMJ acknowledge support from the EPSRC (grant † Department of Statistics, University of Warwick, Coventry, U.K. Email: { juan.kuntz-nussio, f.crucinio,a.m.johansen } @warwick.ac.uk ‡ The Alan Turing Institute, London, U.K. a r X i v : . [ m a t h . S T ] F e b here ( X n , . . . , X nK ) Nn =1 denote i.i.d. samples drawn from N ( , I ). Because the estimator has anasymptotic variance σ ( ϕ ) of 2 K −
1, the number of samples required for it to produce a reasonableestimate of µ ( ϕ ) grows exponentially with the target’s dimension. Hence, µ N ( ϕ ) is impractical forlarge K s. For instance, if K = 20, we require ≈ samples to obtain an estimate with standarddeviation of 0 .
01, reaching the limits of most present-day personal computers, and if K = 30, werequire 10 samples, exceeding these limits.The literature dedicated to overcoming our inability to sample from most target distributionsof interest is vast. Typical solutions involve the use of importance sampling (IS), sequential MonteCarlo (SMC) generalisations thereof, Markov chain Monte Carlo (MCMC), or a combination ofthese; see [34, 8] and references therein. These methods, however, were not designed with thesecond issue in mind and, hence, there is no reason to believe that they should solve it. Quite theopposite, they generally aggravate it: unless one uses proposals specially designed for a particulartest function ϕ , IS and SMC estimators typically exhibit higher variances than the basic i.i.d.estimator and, at least for SMC, it is common practice to tailor the proposal to the target but notto any individual test functions [8, p.88]. For that matter, the effective sample size criterion [16]typically used to monitor impoverishment of SMC weights does not even take into account the testfunctions of interest. It also never exceeds the actual sample size, further emphasising the ideathat an unweighted i.i.d. ensemble is generally the best approximation to the target one can hopefor using SMC. Similarly, the Central Limit Theorem for Markov chains [25, Theorem 17.0.1] tellsus that the asymptotic variance σ MCMC ( ϕ ) of an MCMC estimator N (cid:80) N − n =0 ϕ ( Y n ), for a positiveHarris recurrent chain ( Y n ) ∞ n =0 with stationary distribution µ , equals that of the i.i.d. estimatorplus twice the integrated autocovariance function: σ MCMC ( ϕ ) = σ ( ϕ ) + 2 ∞ (cid:88) n =1 E µ [( ϕ ( Y ) − µ ( ϕ ))( ϕ ( Y n ) − µ ( ϕ ))] , where E µ denotes expectation with respect to an underlying measure P µ such that Y ∼ µ . Itfollows that σ MCMC ( ϕ ) is bounded below by σ ( ϕ ) unless the samples are negatively correlated.This is rarely the case with MCMC algorithms because the ‘local moves’ most of them use toexplore the state space tend to make consecutive samples highly correlated. For instance, anyreversible chain with a positive transition kernel (i.e. for which consecutive samples are positivelycorrelated: E µ [( ϕ ( Y ) − µ ( ϕ ))( ϕ ( Y ) − µ ( ϕ ))] ≥ µ -integrable ϕ s) has non-negativeautocovariance function (e.g. [26, Theorem 3.7.1]). This covers the great majority of Markovchains used in practice: the Metropolized independence sampler [20], random-walk Metropolis withGaussian or Student-t proposals [3] or autoregressively correlated proposals of the same form [10], toname but a few. While in principle it might be possible to construct antithetic Markov chains withnegative odd order autocovariances and doing so remains an active area of research, this has onlybeen attained in special situations (e.g. for Hamiltonian Monte Carlo when Hamilton’s equationscan be solved analytically [29]). Furthermore, to counteract the exponential growth of σ ( ϕ ) with K , one would need to achieve negative autocovariances that grow exponentially with the dimensionof the state space: an even more challenging task.To deal with the exponential growth of the variance, one could instead try one of the manyvariance reduction techniques available (see, for example, [2, 34]), e.g. control variates, antitheticsampling, Rao-Blackwellization, stratification, etc. These techniques can be tremendously powerfulwhen they can be successfully applied. However, this requires more detailed information on µ thanis available for most targets. Moreover, most of these techniques date back before the last fewdecades during which interest in high dimensional targets became widespread and, consequently,were not designed to mitigate the fast growths of estimator variances with dimension.2here is, however, a trivial way of using Monte Carlo to overcome the issue for the aboveexample without any knowledge about µ beyond the fact that it is product-form. In particular, µ isthe product µ ×· · ·× µ K of K univariate unit-mean, unit-variance Gaussian distributions µ , . . . , µ K and ϕ is the product of the K univariate functions ϕ ( x ) = x , . . . , ϕ K ( x K ) = x K . Hence, µ ( ϕ ) isthe product µ ( ϕ ) ...µ K ( ϕ K ) of the corresponding K univariate means µ ( ϕ ) , . . . , µ K ( ϕ K ). As wewill see in Section 2.1, estimating each of these means separately and taking the resulting product,we obtain an estimator for µ ( ϕ ) whose asymptotic variance is K : µ N × ( ϕ ) := 1 N K N (cid:88) n =1 · · · N (cid:88) n K =1 ϕ ( X n , . . . , X nK ) = (cid:32) N N (cid:88) n =1 X n (cid:33) . . . (cid:32) N N (cid:88) n K =1 X n K K (cid:33) . (2)Consequently, the number of samples necessary for µ N × ( ϕ ) to yield a reasonable estimate of µ ( ϕ )only grows linearly with the dimension, allowing us to practically deal with K s in the millions.The central expression in (2) makes sense regardless of whether ϕ is the product of univariatetest functions. It defines a, to the best of our knowledge, novel type of Monte Carlo estimatorfor general ϕ and product-form µ . The aim of this paper is to theoretically characterise these product-form estimators and their extensions.Aside from being unbiased, consistent, and asymptotically normal, these estimators achievelower variances than the standard one in (1). The reason behind the variance reduction is simple:if ( X , . . . , X N ) , . . . , ( X K , . . . , X NK ) are independent sequences of samples respectively drawn from µ , . . . , µ K , then every permutation of these samples has law µ , that is,( X n , . . . , X n K K ) ∼ µ ∀ n , . . . , n K ≤ N. (3)Hence, µ N × ( ϕ ) in (2) averages over N K tuples with law µ while its conventional counterpart (1) onlyaverages over N such tuples. This increase in tuple number leads to a decrease in estimator variance,and we say that the product-form estimator is more statistically efficient than the standard one.Moreover, obtaining these N K tuples does not require drawing any further samples from µ and,in this sense, product-form estimators make the most out of every sample available. However, incontrast with the tuples in (1), those in (2) are not independent (the same components are repeatedacross several tuples). For this reason, product-form estimators achieve the same O ( N − / ) rateof convergence that conventional ones do, and the variance reduction materialises only in lowerproportionality constants (i.e. lim N →∞ Var( µ N ( ϕ )) / Var( µ N × ( ϕ )) = C for some constant C ≤ N K permuted tuples in (2) we need only store the KN numbers X , . . . , X N , . . . , X K , . . . , X NK .Their time complexity, however, scales exponentially with dimension: computing the sum in (2)requires O ( N K ) operations. Consequently, the use of product-form estimators generally provesto be a balancing act in which one must weigh the cost of acquiring new samples from µ (be ita computational one if the samples are obtained from simulations, or a real-life one if they areobtained from experiments) against the extra overhead required to evaluate these estimators.For test functions that are sums of products (SOP) of univariate test functions, the cost iseasily lowered to just O ( N ). For instance, in the case of the toy Gaussian example above, we canevaluate the product-form estimator in O ( N ) operations by expressing it as the product of thecomponent-wise sample averages and computing each separately (i.e. using the middle expressionin (2)). Because many test functions can be approximated well with SOPs (e.g. analytic functionsapproximated by polynomials), this cheaper approach can be extended to non-SOP test functions atthe expense of introducing some bias. Doing so just amounts to approximating a high-dimensionalintegral with a linear combination of products of low-dimension integrals, estimating each of the3atter separately, and plugging the estimates back into the linear combination to obtain an estimateof the original integral.As we show in the paper, product-form estimators are also of use for non-product-form targets.In particular, they are straightforwardly extended to mixtures of product-form distributions. Bythen borrowing ideas from importance sampling, we further extend our methodology to targets thatare absolutely continuous with respect to product-form distributions. Relation to the literature.
As discussed in [17], product-form estimators and their importancesampling variants underpin much of the divide-and-conquer sequential Monte Carlo methodologyintroduced in [19]. Given the simplicity of the idea underlying product-form estimators and theirpotential utility, we find it difficult to believe these have not been considered elsewhere. Indeed,in the simplest setting of approximating product-form functions with respect to product measures,it is well known that better performance is obtained by separately approximating the marginalintegrals, see for example, [8, Section 8.7]. However, to the best of the authors’ knowledge, therehas been no previous systematic attempt to use or understand product-form estimators outside oftrivial settings.
Paper structure.
Section 2 studies product-form estimators and their theoretical properties.We show that the estimators are strongly consistent, unbiased, and asymptotically normal, andwe give expressions for the their finite sample and asymptotic variances (Section 2.1). We thenargue that they are more statistically efficient than their conventional counterpart in the sense thatthey achieve lower variances given the same number of samples (Section 2.2). Lastly, we considertheir computational cost (Section 2.3) and explore the circumstances in which they prove mostcomputationally efficient (Section 2.4). Section 3 considers two extensions for targets that are notproduct-form: the first for targets that are mixtures of product-form distributions (Section 3.1) andthe second for targets that are absolutely continuous with respect to product-form distributions(Section 3.2). We finish the paper with a discussion of the results, future research directions, andpotential applications (Section 4).
Consider the basic Monte Carlo problem: given a probability distribution µ on a measurable space( S, S ) and a function ϕ belonging to the space L µ of square µ -integrable real-valued functions on S , estimate the average µ ( ϕ ) := (cid:90) ϕ ( x ) µ ( dx ) . Throughout this section, we focus on the question ‘by exploiting the product-form structure ofa target µ , can we design estimators of µ ( ϕ ) that are more efficient than the usual ones?’. Byproduct-form, we mean that µ is the product of K > µ , . . . , µ K on measurablespaces ( S , S ) , . . . , ( S K , S K ) satisfying S = S × · · · × S K and S = S × · · · × S K , where the latterdenotes the product sigma algebra. Furthermore, if A is a non-empty subset of [ K ] := { , . . . , K } ,then µ A := (cid:81) k ∈ A µ k denotes the product of the µ k s indexed by k s in A and µ A ( ϕ ) denotes themeasurable function on (cid:81) k (cid:54)∈ A S k obtained by integrating the arguments of ϕ indexed by k s in A with respect to µ A : µ A ( ϕ )( x A c ) := (cid:90) ϕ ( x A , x A c ) µ A ( dx A ) ∀ x A c ∈ (cid:89) k (cid:54)∈ A S k , x A c in (cid:81) k (cid:54)∈ A S k , where A c := [ K ] \ A denotes A ’s complement. If A is empty, we set µ A ( ϕ ) := ϕ . Suppose we have at our disposal N i.i.d. samples X , . . . , X N drawn from µ . That is, N tu-ples ( X , . . . , X K ) , . . . , ( X N , . . . , X NK ) of i.i.d. samples X , . . . , X N , . . . , X K , . . . , X NK independentlydrawn from µ , . . . , µ K , respectively. As we will see in Section 2.2, the product-form estimator , µ N × ( ϕ ) := 1 N K N (cid:88) n ,...,n K =1 ϕ ( X n , . . . , X n K K ) , (4)where ‘ (cid:80) Nn ,...,n K =1 ’ is to be interpreted as ‘ (cid:80) Nn =1 · · · (cid:80) Nn K =1 ’, extracts better estimates of µ ( ϕ )from X , . . . , X N than those afforded by the conventional estimator, µ N ( ϕ ) := 1 N N (cid:88) n =1 ϕ ( X n ) , (5)regardless of whether the test function ϕ possesses any sort of product structure. The reasonbehind this is as follows: while the conventional estimator directly approximates the target withthe samples’ empirical distribution, µ ≈ N N (cid:88) n =1 δ X n =: µ N , (6)the product-form estimator instead first approximates the marginals µ , . . . , µ K of the target withthe corresponding component-wise empirical distributions, µ N := 1 N N (cid:88) n =1 δ X n , . . . , µ NK := 1 N N (cid:88) n =1 δ X nK , and then takes the product of these to obtain an approximation of µ , µ ≈ (cid:32) N N (cid:88) n =1 δ X n (cid:33) × · · · × (cid:32) N N (cid:88) n =1 δ X nK (cid:33) = 1 N K N (cid:88) n ,...,n K =1 δ ( X n ,...,X nKK ) =: µ N × . (7)The built-in product-structure in µ N × makes it a better suited approximation to the product-formtarget µ than the non-product µ N . Before pursuing this further, we take a moment to show that µ N × ( ϕ ) is a well-founded estimator for µ ( ϕ ) and obtain expressions for its variance: Theorem 2.1 (Product-form estimators are unbiased, strongly consistent, and asymptotically nor-mal) . If ϕ is µ -integrable, then µ N × ( ϕ ) in (4) is unbiased: E (cid:2) µ N × ( ϕ ) (cid:3) = µ ( ϕ ) ∀ N > . If, furthermore, ϕ belongs to L µ , then µ A c ( ϕ ) belongs to L µ A for all subsets A of [ K ] . The estimator’svariance is given byVar ( µ N × ( ϕ )) = (cid:88) ∅(cid:54) = A ⊆ [ K ] N | A | (cid:88) B ⊆ A ( − | A |−| B | σ A,B ( µ A c ( ϕ )) ∀ N > , (8)5 here | A | and | B | denote the cardinalities of A and B , and σ A,B ( ψ ) := µ B ([ µ A \ B ( ψ ) − µ A ( ψ )] ) ∀ ψ ∈ L µ A , B ⊆ A ⊆ [ K ] . (9) Furthermore, µ N × ( ϕ ) is consistent and asymptotically normal: µ N × ( ϕ ) → µ ( ϕ ) as N → ∞ , almost surely, (10) N / [ µ N × ( ϕ ) − µ ( ϕ )] ⇒N (0 , σ × ( ϕ )) as N → ∞ , (11) where σ × ( ϕ ) := (cid:80) Kk =1 σ k ( ϕ ) with σ k ( ϕ ) := µ k ([ µ { k } c ( ϕ ) − µ ( ϕ )] ) for all k in [ K ] and ⇒ denotesconvergence in distribution. There is one outstanding technical issue pertinent to the above theorem that we have beenunable to resolve: if the test function ϕ is µ -integrable but not square µ -integrable (i.e. ϕ (cid:54)∈ L µ ),then the basic estimator µ N ( ϕ ) is still strongly consistent even though its variance is infinite. Theexpression (8) shows that the product-form estimator µ N × ( ϕ ) also has infinite variance for such ϕ .Is it, however, consistent (i.e. does (10) still hold)? We leave this as an open question.The remainder of this section is dedicated to the theorem’s proof. For it, we require the followingexpressions for the L norms of products of the ‘marginal errors’ µ N − µ , . . . , µ NK − µ K . They tellus that the product of l of these errors has O ( N − l/ ) norm, as one would expect given that theerrors are independent and that classical theory (e.g. [8, p.168]) tells us that the norm of each is O ( N − / ). Lemma 2.2 ( L norm for products of marginal errors) . If ∅ (cid:54) = A ⊆ [ K ] , ψ ∈ L µ A and σ A,B ( ψ ) isas in (9) , E (cid:34)(cid:32) (cid:89) k ∈ A [ µ Nk − µ k ] (cid:33) ( ψ ) (cid:35) = 1 N | A | (cid:88) B ⊆ A ( − | A |−| B | σ A,B ( ψ ) ∀ N > , ψ ∈ L µ A . Proof.
We start with a simple identity: because (cid:80) ji =0 ( − i (cid:0) ji (cid:1) = (1 − j vanishes unless j = 0, (cid:88) B ⊆ A ( − | B | = | A | (cid:88) i =0 (cid:88) B ⊆ A : | B | = i ( − | B | = | A | (cid:88) i =0 ( − i (cid:88) B ⊆ A : | B | = i | A | (cid:88) i =0 ( − i (cid:18) | A | i (cid:19) = 1 {| A | =0 } = 1 { A = ∅} ∀ B ⊆ A ⊆ [ K ] . (12)Next, fix N > ∅ (cid:54) = A ⊆ [ K ], and note that (cid:32) (cid:89) k ∈ A [ µ Nk − µ k ] (cid:33) ( ψ ) = µ NA ( ψ A ) , where ψ A := (cid:88) B ⊆ A ( − | A |−| B | µ A \ B ( ψ ) . (13)We now establish that the function ψ A has the following property: µ B ( ψ A ) = 0 ∀∅ (cid:54) = B ⊆ A. (14)Because µ B ( ψ A ) = µ B ( µ k ( ψ A )) for any k in a non-empty subset B of A , we need only show that µ k ( ψ A ) = 0 for any k in A . Fix any such k and note that( − | A | ψ A = (cid:88) B ⊆ A \{ k } ( − −| B | µ A \ B ( ψ ) + (cid:88) B ⊆ A \{ k } ( − −| B ∪{ k }| µ A \ ( B ∪{ k } ) ( ψ )= (cid:88) B ⊆ A \{ k } ( − −| B | µ A \ B ( ψ ) − (cid:88) B ⊆ A \{ k } ( − −| B | µ A \ ( B ∪{ k } ) ( ψ ) .
6e obtain µ k ( ψ A ) = 0 by integrating both sides with respect to µ k , and (14) follows.Returning to (13), assume, without loss of generality, that A = { , . . . , | A |} , and note that E (cid:34)(cid:32) (cid:89) k ∈ A [ µ Nk − µ k ] (cid:33) ( ψ ) (cid:35) = E (cid:2) µ NA ( ψ A ) (cid:3) = E N | A | N (cid:88) i ,...,i | A | =1 ψ A ( X i , . . . , X i | A | | A | ) = 1 N | A | N (cid:88) i ,...,i | A | =1 N (cid:88) j ,...,j | A | =1 E (cid:104) ψ A ( X i , . . . , X i | A | | A | ) ψ A ( X j , . . . , X j | A | | A | ) (cid:105) . Suppose that i k (cid:54) = j k for some k . Because the components are independent and both X i k k and X j k k have law µ k , and letting F i k ,j k denote the sigma algebra generated by all X i , . . . , X i | A | | A | , X j , . . . , X j | A | | A | except X i k k , X j k k , E (cid:104) ψ A ( X i , . . . , X i | A | | A | ) ψ A ( X j , . . . , X j | A | | A | ) (cid:105) = E (cid:104) E (cid:104) ψ A ( X i , . . . , X i | A | | A | ) ψ A ( X j , . . . , X j | A | | A | ) (cid:12)(cid:12)(cid:12) F i k ,j k (cid:105)(cid:105) = E (cid:104) E (cid:104) ψ A ( X i , . . . , X i | A | | A | ) (cid:12)(cid:12)(cid:12) F i k ,j k (cid:105) E (cid:104) ψ A ( X j , . . . , X j | A | | A | ) (cid:12)(cid:12)(cid:12) F i k ,j k (cid:105)(cid:105) = E [ µ k ( ψ A ) µ k ( ψ A )] = 0 , where the last equality follows from (14). Hence, E (cid:34)(cid:32) (cid:89) k ∈ A [ µ Nk − µ k ] (cid:33) ( ψ ) (cid:35) = 1 N | A | N (cid:88) i ,...,i | A | =1 E (cid:104) ψ A ( X i , . . . , X i | A | | A | ) (cid:105) = 1 N | A | N (cid:88) i ,...,i | A | =1 µ A ( ψ A ) = µ A ( ψ A ) N | A | . (15)Because A is non-empty by assumption and ( − | B | = ( − −| B | for all B ⊆ A , (12) implies that µ A ( ψ A ) = µ A ψ A − µ A ( ψ )( − | A | (cid:88) B ⊆ A ( − −| B | = µ A (cid:88) B ⊆ A ( − | A |−| B | [ µ A \ B ( ψ ) − µ A ( ψ )] = (cid:88) B ⊆ A (cid:88) B (cid:48) ⊆ A ( − | A |−| B |−| B (cid:48) | µ A ([ µ A \ B ( ψ ) − µ A ( ψ )][ µ A \ B (cid:48) ( ψ ) − µ A ( ψ )]) . But, µ A ([ µ A \ B ( ψ ) − µ A ( ψ )][ µ A \ B (cid:48) ( ψ ) − µ A ( ψ )])= µ B ∩ B (cid:48) ( µ B \ B (cid:48) ([ µ A \ B ( ψ ) − µ A ( ψ )]) µ B (cid:48) \ B ([ µ A \ B (cid:48) ( ψ ) − µ A ( ψ )]))= µ B ∩ B (cid:48) ([ µ A \ ( B ∩ B (cid:48) ) ( ψ ) − µ A ( ψ )] ) , − | A |−| B |−| B (cid:48) | = ( − | B | + | B | (cid:48) , we obtain µ A ( ψ A ) = (cid:88) B ⊆ A (cid:88) B (cid:48) ⊆ A ( − | B | + | B | (cid:48) µ B ∩ B (cid:48) ([ µ A \ ( B ∩ B (cid:48) ) ( ψ ) − µ A ( ψ )] )= (cid:88) C ⊆ A µ C ([ µ A \ C ( ϕ ) − µ A ( ϕ )] ) (cid:88) B,B (cid:48) ⊆ A : B ∩ B (cid:48) = C ( − | B | + | B (cid:48) | . (16)However, our starting identity implies that (cid:88) B,B (cid:48) ⊆ A : B ∩ B (cid:48) = C ( − | B | + | B (cid:48) | = (cid:88) D ⊆ A \ C (cid:88) D (cid:48) ∈⊆ A \ ( C ∪ D ) ( − | C ∪ D | + | C ∪ D (cid:48) | = (cid:88) D ⊆ A \ C (cid:88) D (cid:48) ⊆ A \ ( C ∪ D ) ( − | D | + | D (cid:48) | = (cid:88) D ⊆ A \ C ( − | D | (cid:88) D (cid:48) ⊆ A \ ( C ∪ D ) ( − | D (cid:48) | = (cid:88) D ⊆ A \ C ( − | D | { A \ ( C ∪ D )= ∅} = ( − | A \ C | = ( − | A |−| C | , and the lemma follows from (16).We are now ready to tackle the proof of Theorem 2.1. We do it in steps: Step 1: square integrability of µ A c ( ϕ ) . That µ A c ( ϕ ) is square µ A -integrable, for any subset A of[ K ], follows from our assumption that ϕ in square µ -integrable and Jensen’s inequality: µ A ( µ A c ( ϕ ) ) ≤ µ A ( µ A c ( ϕ )) = µ ( ϕ ) < ∞ ∀ A ⊆ [ K ] . Step 2: unbiasedness.
The key here is the following decomposition of the ‘global error’ µ N × − µ interms of the marginal errors µ N − µ , . . . , µ NK − µ K : µ N × = K (cid:89) k =1 µ Nk = K (cid:89) k =1 [( µ Nk − µ k ) + µ k ] = (cid:88) A ⊆ [ K ] (cid:32) (cid:89) k ∈ A [ µ Nk − µ k ] (cid:33) × µ A c , (17)But, for any subset A of [ K ], (cid:32)(cid:32) (cid:89) k ∈ A [ µ Nk − µ k ] (cid:33) × µ A c (cid:33) ( ϕ ) = (cid:32) (cid:89) k ∈ A [ µ Nk − µ k ] (cid:33) ( µ A c ( ϕ )) = µ NA ( ψ A ) , (18)where ψ A is as in (13) with ψ := µ A c ( ϕ ). Suppose that, w.l.o.g., A = { , . . . , | A |} and note that E (cid:2) µ NA ( ψ A ) (cid:3) = 1 N | A | N (cid:88) i ,...,i | A | =1 E (cid:104) ψ A ( X i , . . . , X i | A | | A | ) (cid:105) = µ A ( ψ A ) . But µ A ( ψ A ) = 0 unless A = ∅ (c.f. (14)). Taking expectations in (17,18) and comparing with (13),we find that the estimator is unbiased: E (cid:2) µ N × ( ϕ ) (cid:3) = ψ ∅ = µ [ K ] ( ϕ ) = µ ( ϕ ) ∀ N > . tep 3: variance expressions. Note that, by (17,18),Var( µ N × ( ϕ )) = E (cid:2) [ µ N × ( ϕ ) − µ ( ϕ )] (cid:3) = E (cid:88) ∅(cid:54) = A ⊆ [ K ] µ NA ( ψ A ) = (cid:88) ∅(cid:54) = A ⊆ [ K ] (cid:88) ∅(cid:54) = A (cid:48) ⊆ [ K ] E (cid:2) µ NA ( ψ A ) µ NA (cid:48) ( ψ A (cid:48) ) (cid:3) . (19)If A \ A (cid:48) is non-empty, then, picking any k therein, setting F (cid:54) k to be the sigma algebra generated byall ( X nk (cid:48) ) Nn =1 with k (cid:48) (cid:54) = k , and exploiting independence, we obtain E (cid:2) µ NA ( ψ A ) µ NA (cid:48) ( ψ A (cid:48) ) (cid:3) = E (cid:2) E (cid:2) µ NA ( ψ A ) µ NA (cid:48) ( ψ A (cid:48) ) |F (cid:54) k (cid:3)(cid:3) = E (cid:2) E (cid:2) µ NA ( ψ A ) |F (cid:54) k (cid:3) µ NA (cid:48) ( ψ A (cid:48) ) (cid:3) . But, once again assuming without loss of generality that A = { , . . . , | A |} and applying (14), E (cid:2) µ NA ( ψ A ) |F (cid:54) k (cid:3) = 1 N | A | N (cid:88) i ,...,i | A | =1 E (cid:104) ψ A ( X i , . . . , X i | A | | A | ) |F (cid:54) k (cid:105) = 1 N | A | N (cid:88) i ,...,i | A | =1 µ k ( ψ A )( X i , . . . , X i k − k − , X i k +1 k +1 , . . . , X i | A | | A | ) = 0 a.s. , and we find that E (cid:2) µ NA ( ψ A ) µ NA (cid:48) ( ψ A (cid:48) ) (cid:3) = 0. Because an analogous argument shows that this is thecase if A (cid:48) \ A is non-empty, (19) reduces toVar( µ N × ( ϕ )) = (cid:88) ∅(cid:54) = A ⊆ [ K ] E (cid:2) µ NA ( ψ A ) (cid:3) = (cid:88) ∅(cid:54) = A ⊆ [ K ] E (cid:34)(cid:32) (cid:89) k ∈ A [ µ Nk − µ k ] (cid:33) ( µ A c ( ϕ )) (cid:35) , and the variance expressions in (8) follow from Lemma 2.2. Step 4: consistency and asymptotic normality.
From (17), it follows that, for all
N > µ N × ( ϕ ) − µ ( ϕ ) = K (cid:88) k =1 [ µ Nk ( µ { k } c ( ϕ )) − µ ( ϕ )] + R N ∀ N > , (20)where R N := (cid:80) A ⊆ [ K ]: | A | > (cid:0)(cid:81) k ∈ A [ µ Nk − µ k ] (cid:1) ( µ A c ( ϕ )). Given the square integrability in Step 1 andthat µ ( ϕ ) = µ k ( µ { k } c ( ϕ )), the classical law of large numbers and central limit theorem imply thatlim N →∞ µ Nk ( µ { k } c ( ϕ )) − µ ( ϕ ) = 0 almost surely , ∀ k ∈ [ K ] ,N / [ µ Nk ( µ { k } c ( ϕ )) − µ ( ϕ )] ⇒ N (0 , σ k ( ϕ )) as N → ∞ , ∀ k ∈ [ K ] . Because ( X n ) ∞ n =1 , . . . , ( X nK ) ∞ n =1 are independent sequences, so are( µ N ( µ { } c ( ϕ )) − µ ( ϕ )) ∞ N =1 , . . . , ( µ NK ( µ { K } c ( ϕ )) − µ ( ϕ )) ∞ N =1 , and the continuous mapping theorem yields K (cid:88) k =1 [ µ Nk ( µ { k } c ( ϕ )) − µ ( ϕ )] → , K (cid:88) k =1 N / [ µ Nk ( µ { k } c ( ϕ )) − µ ( ϕ )] ⇒ N (0 , σ × ( ϕ )) , N → ∞ . Hence, if we can show that the remainder term R N tends to zero fast enough, i.e. R N → , N / R N ⇒ , as N → ∞ , (21)then (10,11) follow from (20) and Slutsky’s theorem. To do so, note that the same argument as inStep 3 shows that E (cid:2) R N (cid:3) = E (cid:88) A ⊆ [ K ]: | A | > µ NA ( ψ A ) = (cid:88) A ⊆ [ K ]: | A | > N | A | (cid:88) B ⊆ A ( − | A |−| B | σ A,B ( µ A c ( ϕ ))for all N >
0. Markov’s inequality then implies that P ( {| R N | ≥ ε } ) ≤ E (cid:2) R N (cid:3) ε = (cid:88) A ⊆ [ K ]: | A | > N | A | ε (cid:88) B ⊆ A ( − | A |−| B | σ A,B ( µ A c ( ϕ )) ∀ ε > . Summing over N and applying a Borel-Cantelli argument, we obtain the first limit in (21). For thesecond, set ε := N − / (cid:15) , for any (cid:15) >
0, and take the limit N → ∞ . Product-form estimators achieve lower variances than their conventional counterparts:
Proposition 2.3 (Statistical efficiency) . If ϕ belongs to L µ and σ ( ϕ ) := µ ([ ϕ − µ ( ϕ )] ) denotes µ N ( ϕ ) ’s asymptotic variance,Var ( µ N × ( ϕ )) ≤ σ ( ϕ ) N = Var ( µ N ( ϕ )) ∀ N > and σ × ( ϕ ) ≤ σ ( ϕ ) . Proof.
By definition µ N × ( ϕ ) = 1 N K N (cid:88) n ,...,n K =1 ϕ ( X n , X n , . . . , X n K K )= 1 N K − N (cid:88) m ,...,m K − =1 N N (cid:88) n =1 ϕ ( X n , X ( n + m mod N )+12 , . . . , X ( m + ··· + m K − + n mod N )+1 K ) (cid:124) (cid:123)(cid:122) (cid:125) =: Y Nm ,...,mK − . Because the sample components are independent, ( Y Nm ,...m K − ) Nm ,...,m K − =1 are identically dis-tributed. Moreover, Y NN − ,N,...,N = µ N ( ϕ ), and, so, ( Y Nm ,...m K − ) Nm ,...,m K − =1 share the same dis-tribution as the conventional estimator µ N ( ϕ ). For this reason, the finite sample variance bound10ollows from the Cauchy-Schwarz inequality:Var( µ N × ( ϕ )) = Var N K − N (cid:88) m ,...,m K − =1 Y Nm ,...,m K − = 1 N K − N (cid:88) m ,...,m K − =1 N (cid:88) m (cid:48) ,...,m (cid:48) K − =1 Cov( Y Nm ,...,m K − , Y Nm (cid:48) ,...,m (cid:48) K − ) ≤ N K − N (cid:88) m ,...,m K − =1 N (cid:88) m (cid:48) ,...,m (cid:48) K − =1 (cid:113) Var( Y Nm ,...,m K − ) (cid:113) Var( Y Nm (cid:48) ,...,m (cid:48) K − )= 1 N K − N (cid:88) m ,...,m K − =1 N (cid:88) m (cid:48) ,...,m (cid:48) K − =1 Var( µ N ( ϕ )) = Var( µ N ( ϕ )) ∀ N > . For the asymptotic variance bound, note that σ ( ϕ ) = µ ( µ { } c ( ϕ − µ ( ϕ )) ) = µ ( µ ( ψ − µ ( ψ )) ) ,σ ( ϕ ) = µ ( µ { } c ( ϕ − µ ( ϕ )) ) = µ ( µ ( ψ − µ ( ψ )) ) . where ψ := µ [3: K ] ( ϕ ) and [3 : K ] := { , . . . , K } . Because Jensen’s inequality implies that µ ( µ ( ψ − µ ( ψ )) ) ≤ µ ( µ ([ ψ − µ ( ψ )] )), it follows that σ ( ϕ ) + σ ( ϕ ) ≤ µ ( µ ( ψ − µ ( ψ )) ) + µ ( µ ([ ψ − µ ( ψ )] ))= µ ([ µ ( ψ ) − µ [2] ( ψ )] ) + µ ( µ ([ ψ − µ ( ψ )] ))= µ ( µ ( ψ ) ) − µ [2] ( ψ ) + µ ( µ ( ψ ) − µ ( ψ ) )= µ [2] ( ψ ) − µ [2] ( ψ ) = µ [2] ([ ψ − µ [2] ( ψ )] ) . Setting now ψ := µ [4: K ] ( ϕ ), we have that σ ( ϕ ) + σ ( ϕ ) ≤ µ [2] ( µ ( ψ − µ [2] ( ψ )) ) , σ µ, ( ϕ ) = µ ( µ [2] ( ψ − µ ( ψ )) ) . Hence, applying Jensen’s inequality again, we obtain σ ( ϕ ) + σ ( ϕ ) + σ ( ϕ ) ≤ µ [3] ( µ [4: K ] ( ϕ − µ [3] ( ϕ )) ) . Iterating the argument forward then gives us the desired bound: σ × ( ϕ ) = K (cid:88) k =1 σ k ( ϕ ) ≤ µ [ K ] ( µ ∅ ( ϕ − µ [ K ] ( ϕ )) ) = µ ([ ϕ − µ ( ϕ )] ) = σ ( ϕ ) . In other words, the product-form estimator is more statistically efficient than the standard one:using the same number of independent samples drawn from the target, µ N × ( ϕ ) achieves a lower vari-ance than µ N ( ϕ ). The reason behind this variance reduction was outlined in the introduction: theproduct-form estimator uses the empirical distribution of the collection ( X n , . . . , X n K K ) Nn ,...,n K =1 ofpermuted tuples as an approximation to µ . Because µ is product-form, each of these permuted tu-ples is as much a sample drawn from µ as any of the original unpermuted tuples ( X n , . . . , X nK ) Nn =1 .11ence, product-form estimators transform N samples drawn from µ into N K samples and, con-sequently, lower the variance. However, the permuted tuples are not independent and we get adiminishing returns effect: the more permutations we make, the greater the correlations amongthem, and the less ‘new information’ each new permutation affords us. For this reason, the estima-tor variance remains O ( N − ), c.f. (8), instead of O ( N − K ) as would be the case for the standardestimator using N K independent samples. As we discuss in Section 4, there is also a pragmaticmiddle ground here: use N < M < N K permutations instead of all N K possible ones. In particular,by choosing these M permutations to be as uncorrelated as possible (e.g. so that they have fewoverlapping entries), it might be possible to retain most of the variance reduction while avoidingthe full O ( N K ) cost.Given that the variances of both estimators are (asymptotically) proportional to each other, weare now faced with the question ‘how large might the proportionality constant be?’. If the testfunction is linear or constant, e.g. S = · · · = S K = R and ϕ ( x ) = K (cid:88) k =1 x k , (22)then the two estimators trivially coincide, no variance reduction is achieved, and the constant is one.However, these are the cases in which the standard estimator performs well: e.g. for (22), µ N ( ϕ )’svariance breaks down into a sum of K univariate integrals and, consequently, grows slowly with thedimension K . However, if the test function includes dependencies between the components, thenthe proportionality constant can be arbitrarily large and the variance reduction unbounded: Example . If K = 2, µ = µ = N (0 , σ ) for some σ > ϕ ( x ) := e x x , then µ ( ϕ ) = µ ( e σ x / ) = 1 σ √ π (cid:90) e − (cid:16) σ − σ (cid:17) x dx = (cid:40) √ − σ if σ < ∞ otherwise , implying that µ ( ϕ ) = 1 / (1 − σ ) whenever the average is finite. Similarly, µ ( ϕ ) = (cid:40) √ − σ if σ < ∞ otherwise , µ ( µ ( ϕ ) ) = µ ( µ ( ϕ ) ) = (cid:40) √ − σ if σ < √ ∞ otherwise . Consequently, ϕ belongs to L µ if and only if σ < /
2; in which case, σ µ ( ϕ ) σ µ, × ( ϕ ) = √ − σ − − σ (cid:16) √ − σ − − σ (cid:17) . Taking the limits σ → σ → /
2, we see that the variance reduction ranges from one toinfinity.Of particular interest is the case of high dimensional targets (i.e. large K ) for which obtainingaccurate estimates of µ ( ϕ ) proves challenging. Even though the exact manner in which the variancereduction achieved by product-form estimators scales with dimension of course depends on theprecise target and test function, it is straightforward to gain some insight by revisiting our startingexample: 12 xample . Suppose that S = · · · = S K , S = · · · = S K , µ = · · · = µ K = ρ, ϕ = K (cid:89) k =1 ϕ k , ϕ = · · · = ϕ K = ψ, for some univariate distribution ρ and test function ψ satisfying ρ ( ψ ) (cid:54) = 0. In this case, σ ( ϕ ) = µ ( ϕ ) − µ ( ϕ ) = ρ ( ψ ) K − ρ ( ψ ) K ,σ × ( ϕ ) = Kρ ([ ρ ( ψ ) K − [ ψ − ρ ( ψ )]] ) = Kρ ( ψ ) K − ρ ([ ψ − ρ ( ψ )] ) = CV Kρ ( ψ ) K , (23)where CV := (cid:112) ρ ([ ψ − ρ ( ψ )] ) (cid:14) ρ ( ψ ) denotes the coefficient of variation of ψ w.r.t. ρ . Hence, σ ( ϕ ) σ × ( ϕ ) = ( ρ ( ψ ) /ρ ( ψ ) ) K − CV K = (1 + CV ) K − CV K = 1 K K − (cid:88) k =0 (cid:18) Kk + 1 (cid:19) CV k , (24)and we see that the reduction in variance grows exponentially with the dimension K .At first glance, (23) might appear to imply that the number of samples required for µ N × ( ϕ ) toyield a reasonable estimate of µ ( ϕ ) grows exponentially with K if | ρ ( ψ ) | >
1. However, what wedeem a ‘reasonable estimate’ here should take into account the magnitude of the average µ ( ϕ ) weare estimating. In particular, it is natural to ask for the standard deviation of our estimates tobe ε | µ ( ϕ ) | for some prescribed relative tolerance ε >
0. In this case, we find that the numberof samples required equals (approximately) σ × ( ϕ ) / ( ε µ ( ϕ ) ) = CV Kε − . In the case of theconventional estimator µ N ( ϕ ), the number required to achieve the same accuracy instead equals σ ( ϕ ) / ( ε µ ( ϕ ) ) = ε − ((1 + CV ) K − µ N × ( ϕ ) and exponentially for µ N ( ϕ ).Notice that the univariate coefficient of variation CV features heavily in Example 2’s analysis:the greater it is, the greater the variance reduction, and the difference gets amplified exponentiallywith the dimension K . This observation might be explained as follows: if µ is highly peaked (sothat the coefficient is close to zero), then the unpermuted tuples are clumped together around thepeak (Fig. 1a), permuting their entries only yields further tuples around the peak (Fig. 1b), andthe empirical average changes little. If, on the other hand, µ is spread out (so that the coefficient islarge), then the unpermuted pairs are scattered across the space (Fig. 1c), permuting their entriesreveals (similarly likely) unexplored regions of the space (Fig. 1d), and the estimates improve. Ofcourse, how spread out the target is must be measured in terms of the test function and we end upwith the coefficients of variation in (24). As shown in the previous section, product-form estimators are at least as statistically efficientas their conventional counterparts: the variances of the former are bounded above by those ofthe latter. These gains in statistical efficiency come at a computational cost: even though bothconventional and product-form estimators observe the same O ( N ) memory needs, the latter requiresevaluating the test function N K times, while the former requires only N evaluations. For thisreason, the question of whether product-form estimators are more computationally efficient thantheir conventional counterparts (i.e. achieve smaller errors given the same computational budget)is not as straightforward. In short, sometimes but not always.One way to answer the computational efficiency question is to compare the cost incurred byeach estimator in order to achieve a desired given variance σ . To do so, we approximate µ N × ( ϕ )’s13 Unpermuted pairs Permuted pairs I s o t r o p i c G a u ss i a n S t u d e n t ' s t (a) (b)(d)(c) P r o b a b ili t y P r o b a b ili t y Figure 1:
Ensembles of unpermuted (a,c) and permuted (b,d) pairs for a peaked tar-get (a,b) and heavy tailed one (c,d). (a)
10 pairs (dots) independently drawn from a two-dimensional isotropic gaussian (contours) with mean zero and variance 0 . (b) The 100 pairs (dots)obtained by permuting the pairs in (a). (c)
20 pairs (dots) independently draw from the productof two student-t distributions (contours) with 1 . (d) The 400 permuted pairs(dots) obtained by permuting the pairs in (c).variance by the estimator’s asymptotic variance divided by the sample number (as justified byTheorem 2.1). The number of samples required for the variance to equal σ is N := σ ( ϕ ) /σ forthe conventional estimator and (approximately) N × := σ × ( ϕ ) /σ for the product-form one. Thecosts of evaluating the former with N samples and the latter with N × samples are N (cost of evaluating ϕ (cid:124) (cid:123)(cid:122) (cid:125) =: C ϕ ) + N (cost of generating a sample X n (cid:124) (cid:123)(cid:122) (cid:125) =: C X ) + N, N K × C ϕ + N × C X + N K × , respectively, where C ϕ and C X are costs relative to that of a single elementary arithmetic operationand the rightmost N and N K × terms account for the cost of computing the corresponding sampleaverage once all evaluations of ϕ are carried out. It follows that µ N × is (asymptotically) at least ascomputationally efficient as µ N if and only if the ratio of their respective costs is no smaller thanone or, after some re-arranging, σ ( ϕ ) σ × ( ϕ ) ≥ ( σ × ( ϕ ) /σ ) K − C r + 1 C r + 1 , (25)where C r := ( C ϕ + 1) /C X denotes the relative cost of evaluating the test function and drawingsamples. Our first observation here is that, because σ ( ϕ ) ≥ σ × ( ϕ ) (Proposition 2.3), the above14s always satisfied in the limit C r →
0. This corresponds the case where the cost of acquiring thesamples dwarfs the overhead of evaluating the sample averages (for instance, if the samples areobtained from long simulations or real-life experiments). If so, we do really want to make the mostof the samples we have and product-form estimators help us to do so.Next, if rough estimates with variance σ ≥ σ × ( ϕ ) are sufficient for our purposes, then we arealso better off using product-form estimators as the RHS of (25) is at most one. Otherwise, in thehigh dimensional (i.e. large K ) case which is of particular interest, (25) reduces (approximately) to σ ( ϕ ) σ × ( ϕ ) ≥ (cid:18) σ × ( ϕ ) σ (cid:19) K − . (26)To gain insight into whether it is reasonable to expect the above to hold, we revisit Example 2. Example . Setting once again our desired standard deviation to be proportional to the magnitudeof the target average (i.e. σ = ε | µ ( ϕ ) | ) and calling on (23,24), we can re-write (26) as(1 + CV ) K − CV K ≥ ( CV Kε − ) K − ⇔ A := (1 + CV ) K − CV K ≥ ε (cid:18) Kε (cid:19) K . The expression shows that, in this full O ( N K ) cost case, µ N ( ϕ ) outperforms µ N × ( ϕ ) for large enoughdimension K or small enough relative tolerance ε . However, if K is large, then obtaining accurateestimates using either estimator might prove too computationally expensive (recall that the vari-ance, and hence the cost, of µ N ( ϕ ) grows exponentially with K ). For rough enough estimates, µ N × ( ϕ ) instead outperforms µ N ( ϕ ). The exact boundary depends on the coefficient of variation. Inparticular, if CV >
1, then A ≈ µ N × ( ϕ ) generally outperforms µ N ( ϕ ) whenever ε ≥ √ K . If instead CV < A ≈ CV /CV K and we find that µ N × ( ϕ ) generally outperforms µ N ( ϕ )whenever ε ≥ CV √ K .Of course, the above analysis is out of place for this example because we can express the product-form estimator as the product µ N × ( ϕ ) = K (cid:89) k =1 (cid:32) N N (cid:88) n =1 ψ ( X nk ) (cid:33) = K (cid:89) k =1 µ Nk ( ψ ) (27)of the univariate sample averages µ N ( ψ ) , . . . , µ NK ( ψ ) and evaluate each of these separately at atotal O ( N ) cost. Indeed, in this setting in which the expectation of interest can be factorised asa product of univariate expectations, this is the approach that one would naturally take. Giventhat the number of samples required for µ N × ( ϕ ) to yield a reasonable estimate scales linearly withdimension (Example 2), it follows that the cost incurred by computing such an estimate scalesquadratically with dimension. In the case of µ N ( ϕ ), the number of samples required, and hencethe cost, scales exponentially with dimension; making the product-form estimator the clear choicefor this simple case. Indeed, the linear-cost-in-sample-number trick (27) significantly expands theusefulness of product-form estimators, as we see in the following section. Recall our starting example from Section 1. In this case, the product-form estimator trivially breaksdown into the product of K sample averages (2) and, consequently, we can evaluate it in O ( N )operations. We can exploit this trick whenever the test function possesses product-like structure:if ϕ is a sum ϕ = J (cid:88) j =1 ϕ j of products ϕ j := K (cid:89) k =1 ϕ jk of univariate functions ϕ jk : S k → R ∀ k ∈ [ K ] , (28)15he product-form estimator decomposes into a sum of products (SOP) of univariate averages, µ N × ( ϕ ) = j (cid:88) j =1 K (cid:89) k =1 µ Nk ( ϕ jk ) , where µ Nk ( ϕ jk ) := 1 N N (cid:88) n =1 ϕ jk ( X nk ) ∀ j ∈ [ J ] , k ∈ [ K ] , and we are able to evaluate it in O ( N ) operations. In these cases, the use of product-form estimatorsamounts to nothing more than a dimensionality-reduction technique: we exploit the independenceof the target to express our K -dimensional integral in terms of an SOP of one-dimensional integrals, µ ( ϕ ) = j (cid:88) j =1 K (cid:89) k =1 µ k ( ϕ jk ) =: sop( { µ k ( ϕ jk ) } j ∈ [ J ] ,k ∈ [ K ] ) , estimate each of these separately, µ k ( ϕ jk ) ≈ µ Nk ( ϕ jk ) ∀ j ∈ [ J ] , k ∈ [ K ] , and replace the one-dimensional integrals with their estimates in the SOP to obtain an estimate forthe K -dimensional integral: µ ( ϕ ) ≈ sop( { µ Nk ( ϕ jk ) } j ∈ [ J ] ,k ∈ [ K ] ) = µ N × ( ϕ ) . By so exploiting the structure in µ and ϕ , the product-form estimator achieves a lower variancethan the standard estimator (Proposition 2.3). Moreover, evaluating each univariate sample average µ Nk ( ϕ jk ) requires only O ( N ) operations and, consequently the computational complexity of µ N × ( ϕ )is O ( N ) instead of O ( N K ) in the non-SOP case (Section 2.3). The running time can be furtherreduced by calculating the univariate sample averages in parallel. Of course, similar considerationsapply if the test function ϕ is a sum of products of low-dimensional functions instead of univariateones.For general ϕ , we are often able to extend the linear-cost approach by approximating ϕ withSOPs (e.g. using truncated Taylor expansions for analytic real-valued ϕ ). The idea is that, if ϕ ≈ ϕ sop for some SOP ϕ sop , then µ ( ϕ ) ≈ µ ( ϕ sop ) , Var( µ N × ( ϕ )) ≈ Var( µ N × ( ϕ sop )) ∀ N > , and we can use µ N × ( ϕ sop ) ≈ µ ( ϕ sop ) as a linear-cost estimator for µ ( ϕ ) without significantly affectingthe variance reduction. This of course comes at the expense of introducing a bias in our estimates,albeit one that can often be made arbitrarily small by using more and more refined approximations(these biases may in principle be removed using multi-level randomisation [24, 33], at the cost ofadditional computational overheads). The choice of approximation quality itself proves to be abalancing act as more refined approximations typically incur higher evaluation costs. If these costsare high enough, then any potential computational gains afforded by the reduction in variance arelost. In summary, this SOP approximation approach is most beneficial for test functions (a) that areeffectively approximated by sop functions (so that the bias is low), (b) whose sop approximationsare relatively cheap to compute (so that the cost is low), and (c) that have a high-dimensionalproduct-form component to them (so that the variance reduction is large, c.f. Section 2.2). In thesecases, the gains in performance can be substantial as illustrated by the following toy example. Example . Let µ , . . . , µ K be uniform distributions over [0 , a ] with a > ϕ ( x ) := e x ...x K . The integral can be expressed in terms of the generalised hypergeometric16unction p F q , µ ( ϕ ) = ∞ (cid:88) j =0 µ ( x j ) . . . µ K ( x jK ) j ! = ∞ (cid:88) j =0 j ! (cid:20) a j ( j + 1) (cid:21) K = K F K (1 , . . . ,
1; 2 , . . . , a K ) , and grows super-exponentially with the dimension K (Fig. 2a). Because ϕ ( x ) = e x ...x K ≈ J (cid:88) j =0 [ x . . . x K ] j j ! = 1 + J (cid:88) j =1 x j . . . x jK j ! =: ϕ J ( x )for large enough truncation cutoffs J , we have that µ N × ( ϕ ) ≈ µ N × ( ϕ J ) = 1 + J (cid:88) j =1 µ N ( x j ) . . . µ NK ( x jK ) j ! . Using µ N × ( ϕ J ) instead of µ N × ( ϕ ) as an estimator for µ ( ϕ ), we lower the computational cost from O ( N K ) to just O ( KN ). In exchange, we introduce a bias: E (cid:2) µ N × ( ϕ J ) (cid:3) − µ ( ϕ ) = µ ( ϕ J ) − µ ( ϕ ) = µ ( ϕ J − ϕ ) = µ ∞ (cid:88) j = J +1 x j . . . x jK j ! , = ∞ (cid:88) j = J +1 µ ( x j ) . . . µ K ( x jK ) j ! = ∞ (cid:88) j = J +1 j ! (cid:20) a j j + 1 (cid:21) K . Because (cid:80) ∞ j = J +1 a jK j ! = o ( a JK /J !), it is straightforward to verify that the bias decays super-exponentially with the cutoff J , at least for sufficiently large J . In practice, we found it to besignificant for J s smaller than 0 . a K and negligible for J s larger than 1 . a K (Fig. 2b). In particu-lar, the cutoff J necessary for µ N × ( ϕ J ) to yield estimates with small bias grows exponentially withthe dimension K .However, similar manipulations to those above reveal that σ ( ϕ ) = K F K (cid:0) , . . . ,
1; 2 , . . . ,
2; 2 a K (cid:1) − K F K (1 , . . . ,
1; 2 , . . . , a K ) σ × ( ϕ J ) = K J (cid:88) i =0 J (cid:88) j =0 i ! j ! iji + j + 1 (cid:18) a i + j ( i + 1)( j + 1) (cid:19) K and we find that the variance reduction achieved by µ N × ( ϕ J ) far outpaces the growth in K of thecutoff (and, consequently, the computational cost of µ N × ( ϕ J )) necessary to achieve a small bias(Fig. 2c). Indeed, the asymptotic-standard-deviation-to-mean ratio, σ ( ϕ ) /µ ( ϕ ), rapidly divergeswith K in the case of the standard estimator (Fig. 2d, solid). In that of the biased product-form estimator, the ratio, σ × ( ϕ J ) /µ ( ϕ ), also diverges with K but at a much slower rate (Fig. 2d,dashed). For this reason, the number of samples necessary for obtain a, say, 1% accuracy estimateof µ ( ϕ ) using µ N × ( ϕ J ) remains manageable for a substantially larger range of a s and K s than inthe case of µ N ( ϕ ), even after factoring in the extra cost required to evaluate µ N × ( ϕ J ) for J ’s largeenough that the bias is negligible. For instance, with a = 1 . K = 10, a cutoff of seventy( J = 70), one million samples ( N = 10 ), and less than one minute of computation time sufficesfor µ N × ( ϕ J ) to produce a 1% accuracy estimate of µ ( ϕ ) ≈ . × (Fig. 2e). Using the same onemillion samples and the standard estimator, we obtain very poor estimates (Fig. 2f). Indeed, itsasymptotic variance equals 4 . × and, so, we would need approximately 10 samples to obtain1% accuracy estimates using µ N ( ϕ ), something far beyond present day computational capabilities.17 Dimension (K)
Target average(a) Standard deviations (normalised by average) Dimension (K) (d)
Dimension (K) -5 -4 -3 -2 -1 Bias (normalised by average)(b)
Dimension (K) Variance reduction (c) P r o b a b ili t y ( x - ) Estimate (millions)
KDE for 100 product-form estimates(f) P r o b a b ili t y ( x - ) Estimate (millions)KDECLTTruevalue
KDE for 100 product-form estimates(e)
Figure 2: (a–d)
Plots generated for three values of a : a = 1 . a = 1 .
45 (magenta), and a = 1 . (a) Target average µ ( ϕ ) as a function of dimension K . (b) Bias of product-formestimator as a function of K with truncation cut-offs J = (cid:100) . a K (cid:101) + 2 (solid) and J = (cid:100) . a K (cid:101) + 2(dashed). We added the +2 to avoid trivial cutoffs for low values of a K . (c) Ratio of asymptoticvariances σ ( ϕ ) /σ × ( ϕ J ) (solid) with J = (cid:100) . a K (cid:101) + 2 (dashed) as a function of K . (d) Asymptoticstandard deviation normalised by target average for conventional (solid) and biased product-form(dashed, with J = (cid:100) . a K (cid:101) + 2) estimators as a function of K . (e) Kernel density estimator [37](blue) obtained with 100 repeats of µ N × ( ϕ J ) each involving N = 10 samples is a good match to thecorresponding sampling distribution (magenta) predicted by the CLT in Theorem 2.1. Comparingwith the target average (yellow), we find a mean absolute error across repeats of 6 . × ≈ µ ( ϕ ). (f ) As in (e) but for µ N ( ϕ ). This time, the predicted sampling distribution is extremely wide (witha standard deviation of 6 . × ) and a poor match to the kernel density estimator (almost aDirac delta close to zero). The mean absolute error is 6 . × ≈ µ ( ϕ ). The estimator’s failuretraces to the extreme rarity of the samples X n that achieve very large values ϕ ( X n ) (i.e. thosewhose components are all close to a ). Because the components are independent, they are extremelyunlikely to simultaneously be close to a and the aformentioned samples are not observed for realisticensemble sizes N . The product-form estimator avoids this issue by averaging over each componentseparately. While interesting product-form distributions can be found throughout the applied probability lit-erature (ranging from stationary distributions of Jackson queues [14, 15] and complex-balancedstochastic reaction networks [1, 7] to mean-field approximations in variational inference [32, 5]),most target distributions are not product-form. For this reason, we study in this section how toexpand the applicability of product-form estimators beyond product-form targets.18e consider two extensions, the first being the ‘mixture of product-form’ estimators in Sec-tion 3.1 for targets that are mixtures of product-form distributions. Because we can approximatemore complicated targets with these types of distributions, this extension opens the door to tacklingstill more complex targets at the expense of introducing some bias. For targets ill-suited to thisapproach, we instead merge in Section 3.2 the ideas underlying importance sampling with thoseof Section 2. As we will see, the product-form versions of importance sampling estimators achievegreater statistical efficiency than the conventional counterparts for general targets whenever theproposal is product-form.
Suppose that our target µ is not a product-form distribution but a mixture of several: µ := I (cid:88) i =1 θ i µ i , (29)where the mixture weights θ , . . . , θ I > (cid:80) Ii =1 θ i = 1 and, for each i in [ I ], µ i is the product (cid:81) K i k =1 µ ik of distributions µ i , . . . , µ iK i respectively defined on( S i , S i ) := (cid:89) k ∈K i S k , (cid:89) k ∈K i S k , . . . , ( S iK i , S iK i ) := (cid:89) k ∈K iKi S k , (cid:89) k ∈K iKi S k , for some given partitions {K , . . . , K K } , . . . , {K I , . . . , K IK I } of [ K ]. Notice that, to further broadenthe applicability of the ensuing estimators, we are allowing for mixture components that are productsover some but not all dimensions (in the case in which every mixture component does factorise fully, K i = K and K ik = { k } for i = 1 , . . . , I and k = 1 , . . . , K ). Fix a square µ -integrable test function ϕ and suppose we wish to compute µ ( ϕ ). It is well-known [28, 13] that the basic Monte Carloestimator µ N ( ϕ ) := N − (cid:80) Nn =1 ϕ ( X n ), where X , . . . , X N denote i.i.d. samples drawn from µ , isoutperformed by its stratified variant: µ Ns ( ϕ ) := I (cid:88) i =1 θ i (cid:32) N i N i (cid:88) n =1 ϕ ( X i,n ) (cid:33) =: I (cid:88) i =1 θ i µ i,N i ( ϕ ) , (30)where ( X , , . . . , X ,N ) , . . . , ( X I, , . . . , X I,N I ) denote independent sequences of i.i.d. samples re-spectively drawn from µ , . . . , µ I and N = θ N, . . . , N I = θ I N for some N > µ Ns ( ϕ ) is a consistent, unbiased, and asymp-totically normal estimator for µ ( ϕ ) whose variance is bounded above by that of µ N ( ϕ ):Var( µ Ns ( ϕ )) = I (cid:88) i =1 θ i Var( µ i,N i ( ϕ )) = I (cid:88) i =1 θ i µ i ([ ϕ − µ i ( ϕ )] ) N i = (cid:80) Ii =1 θ i µ i ([ ϕ − µ i ( ϕ )] ) N = (cid:80) Ii =1 θ i µ i ( ϕ ) − (cid:80) Ii =1 θ i µ i ( ϕ ) N ≤ µ ( ϕ ) − µ ( ϕ ) N = σ ( ϕ ) N = Var( µ N ( ϕ ))for all N >
0. Given our assumption that the distributions in the mixture are product-form,we easily improve on µ Ns ( ϕ ) using the product-form analogues of µ ,N ( ϕ ) , . . . , µ I,N I ( ϕ ): with X i,n , . . . , X i,nK i denoting the components of the n th sample X i,n drawn from µ i , µ N × ( ϕ ) := I (cid:88) i =1 θ i µ i,N i × ( ϕ ) , where µ i,N i × := 1 N K i i N i (cid:88) n =1 ,...,n Ki =1 δ ( X i,n ,...,X i,nKiKi ) ∀ i ∈ [ I ] , (31)19s more statistically efficient than µ Ns ( ϕ ) as long as at least one of the mixture components isproduct-form (notice that if a component, µ i , does not factorise at all, then K i = 1 and µ i,N i × ( ϕ )reduces to the standard Monte Carlo estimator µ i,N i ( ϕ ) for that component). Corollary 3.1 (Mixture-of-products estimators are unbiased, strongly consistent, and asymptot-ically normal) . If ϕ is µ -integrable, then µ N × ( ϕ ) in (31) is an unbiased estimator for µ ( ϕ ) . If ϕ belongs to L µ and N , . . . , N I are integers proportional to N (i.e. N = α N, . . . , N I = α I N forsome α , . . . , α I > ), then µ N × ( ϕ ) is strongly consistent, asymptotically normal with varianceVar ( µ N × ( ϕ )) = I (cid:88) i =1 θ i Var ( µ i,N i × ( ϕ )) = I (cid:88) i =1 θ i (cid:88) ∅(cid:54) = A ⊆ [ K i ] (cid:80) B ⊆ A ( − | A |−| B | [ σ iA,B ( µ iA c ( ϕ ))] α | A | i N | A | i , for all N > , and asymptotic variance σ × ( ϕ ) = (cid:80) Ii =1 [ θ i σ i × ( ϕ )] /α i , where [ σ iA,B ( ψ )] := µ iA ([ µ iA \ B ( ψ ) − µ iA ( ψ )] ) ∀ ψ ∈ L µ iA , B ⊆ A ⊆ [ K i ] , i ∈ [ I ] , [ σ i × ( ϕ )] := K i (cid:88) k =1 µ ik ([ µ i [ K i ] \{ k } ( ϕ ) − µ i ( ϕ )] ) ∀ i ∈ [ I ] . If, in particular, α = θ , . . . , α I = θ I , then the variances of µ N × are bounded above by those of µ Ns :Var ( µ N × ( ϕ )) = I (cid:88) i =1 θ i Var ( µ i,N i × ( ϕ )) ≤ I (cid:88) i =1 θ i Var ( µ i,N i ( ϕ )) = Var ( µ Ns ( ϕ )) ∀ N > ,σ × ( ϕ ) = I (cid:88) i =1 θ i [ σ i × ( ϕ )] ≤ I (cid:88) i =1 θ i µ i ([ ϕ − µ i ( ϕ )] ) = lim N →∞ N Var ( µ Ns ( ϕ )) . Proof.
Follows from Theorem 2.1, Proposition 2.3, and the continuous mapping theorem.If all computations are carried out in serial, then employing a Lagrange multiplier as in [2, p.153],we find that N ∗ i ∝ θ i σ i × ( ϕ ) for all i = 1 , . . . , I achieve the smallest possible asymptotic variance( (cid:80) Ii =1 θ i σ i × ( ϕ )) for fixed N = (cid:80) Ii =1 N i . Of course, σ × ( ϕ ) , . . . , σ I × ( ϕ ) are unknown in practice andmust be estimated, which would lead to an adaptive scheme similar to those for µ N ( ϕ ) in [30, 11, 9].However, if we have at our disposal enough cores that all I empirical averages µ ,N ( ϕ ) , . . . , µ i,N i ( ϕ )can be evaluated in parallel, we would clearly be better off fixing a total computation time andgenerating as many samples from each mixture component as possible within that time. Suppose now that we have an unnormalised (but finite) unsigned target measure γ that is absolutelycontinuous with respect to the product-form distribution µ in Section 2—analogous considerationsapply if µ is instead the mixture of products distribution in Section 3.1—and let w := dγ/dµ be the corresponding Radon-Nykodim derivative. Instead of the usual important sampling (IS)estimator [34], γ N ( ϕ ) := µ N ( wϕ ), for γ ( ϕ ), we consider its product-form variant, γ N × ( ϕ ) := µ N × ( wϕ ).The results of Section 2 immediately give us the following: Corollary 3.2 (Product-form IS estimators are unbiased, strongly consistent, and asymptoticallynormal) . If ϕ is γ -integrable, then γ N × ( ϕ ) is an unbiased estimator for γ ( ϕ ) . If, furthermore, wϕ elongs to L µ , then γ N × ( ϕ ) is strongly consistent, asymptotically normal, and its finite sample andasymptotic variances are bounded above by those of γ N ( ϕ ) :Var ( γ N × ( ϕ )) = Var ( µ N × ( wϕ )) ≤ Var ( µ N ( wϕ )) = Var ( γ N ( ϕ )) ∀ N > ,σ γ, × ( ϕ ) = σ × ( wϕ ) ≤ σ ( wϕ ) = σ γ ( ϕ ) , where Var ( µ N × ( wϕ )) and σ × ( wϕ ) are as Theorem 2.1.Proof. Replace ϕ with wϕ in Theorem 2.1 and Proposition 2.3.Corollary 3.2 tells us that γ N × ( ϕ ) is more statistically efficient than the conventional IS estimator γ N ( ϕ ) regardless of whether the target γ is product-form or not . In a nutshell, µ N × in (7) is a betterapproximation to the proposal µ than µ N in (6) and, consequently, γ N × ( ϕ ) = µ N × ( wϕ ) is a betterapproximation to γ ( ϕ ) = µ ( wϕ ) than γ N ( ϕ ) = µ N ( wϕ ), for any test function ϕ . Indeed, byconstructing all N K permutations of the tuples X , . . . , X n , we reveal other areas of similar µ probability and further explore the state space. This can be particularly of use when the proposaland target are mismatched as it can amplify the number of tuples landing in the target’s highprobability regions (i.e. achieving high weights w ) and, consequently, substantially improve thequality of the finite sample approximation (Fig. 3). P r o b a b ili t y Unpermuted pairs Permuted pairs(a) (b)
Figure 3:
Product-form approximations improve state space exploration.
The target, auniform distribution on [0 , (dashed square), and the proposal, the product of two student-t dis-tributions with 1 . a ) and the correspondingweighted sample approximation ( a, inset, dot diameter proportional to sample weight) is poor.By permuting these pairs, we improve the coverage of the state space ( b ), increase the number ofpairs lying within the target’s support, and obtain a much better weighted sample approximation( b, inset, dot diameter proportional to sample weight).Similarly, the self-normalised version π N × ( ϕ ) := γ N × ( ϕ ) /γ N × ( S ) of the product-form IS estimator γ N × ( ϕ ) is a consistent and asymptotically normal estimator for averages π ( ϕ ) with respect to thenormalised target π := γ/ Z , where Z := γ ( S ) denotes the normalising constant. As in the case of thestandard self-normalised importance sampling (SNIS) estimator π N ( ϕ ) := γ N ( ϕ ) /γ N ( S ), the ratioin π N × ( ϕ )’s definition introduces an O ( N − ) bias and stops us from obtaining analytic expressionfor its finite sample variance (that the bias is O ( N − ) follows from an argument analogous tothat given for conventional SNIS estimator in [21, p.35] and requires making assumptions on thehigher moments of ϕ ( X ).) Otherwise, π N × ( ϕ )’s theoretical properties are analogous to those of theproduct-form estimator µ N × ( ϕ ) and its importance sampling variant γ N × ( ϕ ):21 orollary 3.3 (Product-form SNIS estimators are strongly consistent and asymptotically normal) . If wϕ belongs to L µ , then π N × ( ϕ ) is strongly consistent, asymptotically normal, and its asymptoticvariance is bounded above by that of π N ( ϕ ) : σ π, × ( ϕ ) = σ × ( w π [ ϕ − π ( ϕ )]) ≤ σ ( w π [ ϕ − π ( ϕ )]) = σ π ( ϕ ) , where w π denotes the normalised weight function w/ Z and σ × ( w π [ ϕ − π ( ϕ )]) is as in Theorem 2.1.Proof. Given Theorem 2.1 and Proposition 2.3, the arguments here follow closely those for standardSNIS. In particular, because π N × ( ϕ ) = γ N × ( ϕ ) /γ N × ( S ) = µ N × ( wϕ ) /µ N × ( w ) and µ ( w ) = γ ( S ) = Z , π N × ( ϕ ) − π ( ϕ ) = µ N × ( wϕ ) µ N × ( w ) − π ( ϕ ) = µ ( w ) µ N × ( w ) µ N × (cid:18) w [ ϕ − π ( ϕ )] Z (cid:19) = µ ( w ) µ N × ( w ) µ N × ( w π [ ϕ − π ( ϕ )]) . Because µ N × ( w ) tends to µ ( w ) in probability as N approaches infinity (Theorem 2.1) and σ π, × ( ϕ ) = σ × ( w π [ ϕ − π ( ϕ )]), the strong consistency and asymptotic normality of π N × ( ϕ ) then follow from thoseof µ N × ( w π [ ϕ − π ( ϕ )]) (Theorem 2.1) and Slutsky’s theorem. Given that σ π ( ϕ ) = σ ( w π [ ϕ − π ( ϕ )]),the asymptotic variance bound follows from that in Proposition 2.3. The main message of this paper is that, when employing Monte Carlo estimators to tackle problemspossessing some sort of product structure, one should endeavour to exploit this structure to improvethe estimators’ performance. The resulting product-form estimators are not a panacea for the curseof dimensionality in Monte Carlo, but they are a useful and seemingly overlooked tool in thepractitioner’s arsenal that make certain problems solvable when they otherwise would not be. Inparticular, they shine when the integral in question is laden with product structure and falter ifit is light. More specifically, whenever the target, or proposal, we are drawing samples from isproduct-form, these estimators achieve a smaller variance than their conventional counterparts.In our experience (e.g. Examples 2, 4), the gap in variance grows exponentially with dimensionwhenever the integrand does not decompose into a sum of low-dimensional functions like in thetrivial case (22). For the reasons given in Section 2.2, we expect the variance reduction to befurther accentuated by targets that are ‘spread out’ rather than highly peaked.The gains in statistical efficiency come at a computational cost: in general, product-form esti-mators require N K evaluations of the test function while conventional estimators only require N evaluations. For this reason, product-form estimators are of greatest use when the variance reduc-tion is particular pronounced or when samples are expensive to acquire (both estimators requiredrawing the same number N of samples). In the latter case, product-form estimators enable us toextract the most possible from the samples we have gathered so far: by permuting the samples’components, the estimators artificially generate further samples. Of course, the more permutationswe make, the more correlated our sample ensemble becomes and we get a diminishing returns ef-fect that results in an O ( N − / ) rate of convergence instead of the O ( N − K/ ) we would achieveusing N K independent samples. There is a middle ground here that remains unexplored: using N < M < N K permutations instead of all N K possible, so lowering the cost to O ( M ) at theexpense of sacrificing some of the variance reduction (see [19, 18] for similar ideas). The M permu-tations should be chosen to minimise correlations among them, making the M permutations withleast overlap among their components an attractive option.There is one setting in which we believe that product-form estimators should be applied withouthesitation: if the integrand is a sum of products (SOP) of univariate functions, the cost comes down22o O ( N ) without affecting the variance reduction (Section 2.4). For instance, when estimatingELBO gradients to optimise mean-field approximations [32] of posteriors e v with SOP potentials v . Furthermore, because many test functions can be effectively approximating with SOPs, it ispossible to extend the O ( N ) approach to integrands that are not SOPs at the cost of introducingsome bias (Example 4). If the bias proves particularly undesirable, it can in principle be removedusing multi-level randomisation [24, 33] or similar debiasing techniques. The main challenge thenbecomes devising accurate SOP approximations that do not involve prohibitively large numbers oflow dimensional functions. Care also needs to be taken to ensure that the resulting integrals closelyapproximate the original integral. For example, applying Taylor expansions blindly can easily leadto infinite sums of moments that are not absolutely convergent, the truncations of which will neversettle down to the desired integral. Indeed, how to systematically design SOP approximations thatmix well with product-form estimators remains an open question and the success of this type ofapproach hinges upon its resolution.This product-form-estimators-combined-with-SOP-approximations approach amounts to noth-ing more than a dimensionality reduction technique: we approximate a high dimensional integralwith a linear combination of products of low-dimension integrals, estimate each of the latter sepa-rately, and plug the estimates back into the linear combination to obtain an estimate of the originalintegral. It is certainly not without precedents: for instance, [31, 22, 12, 6] all propose, in ratherdifferent contexts, similar approximations except that the low-dimensional integrals are computedusing closed form analytical expressions or cubature (for a very well-known example, see the deltamethod for moments [27]). By using Monte Carlo, we extend this type of approach to cases whereno analytical expressions are available and the low-dimensional integrals are still sufficiently high-dimensional to preclude the effective use of cubature. In practice, the best option will likely involvea mix of these: use analytic expressions where available, cubature where possible, and Monte Carlofor everything else. One could also use quasi Monte Carlo here and, in doing so, exploit its strengthfor approximating low-dimensional integrals to tackle higher dimensional problems.Even though product-form estimators are directly applicable only to product-form distribu-tions, they are easily extended to other targets using well-established techniques in the MonteCarlo literature. To demonstrate this point, we considered two simple extensions: that to mixtureof product-form targets (Section 3.1) and that to targets that are absolutely continuous with respectto product-form distributions (Section 3.2). The latter extension exemplifies an important point:product-form estimators and their variations achieve lower variance than the usual Monte Carloestimators whenever the proposal is product-form, even for the targets that are not product-form.Many other extensions are possible. For instance, putting together these two extensions, we obtaina product-form version of (stratified) mixture importance sampling estimators [28, 13] that is partic-ularly appropriate for multi-modal targets. For another example, see the divide-and-conquer SMCalgorithm [19] that is obtained by combining product-form estimators with SMC, as demonstratedin [17]. In general, we believe that product-form estimators will be of greatest use when embed-ded within more complicated Monte Carlo routines to tackle the aspects of the problem exhibitingproduct structure. There remains much work to be done in this direction. References [1] D. F. Anderson, G. Craciun, and T. G. Kurtz. Product-form stationary distributions fordeficiency zero chemical reaction networks.
Bulleting of Mathematical Biology , 72(8):1947–1970, 2010. doi:10.1007/s11538-010-9517-4 .232] S. Asmussen and W. Glynn.
Stochastic Simulation: Algorithms and Analysis . Springer-VerlagNew York, 2007. doi:10.1007/978-0-387-69033-9 .[3] P. H. Baxendale. Renewal theory and computable convergence rates for geometrically er-godic Markov chains.
Annals of Applied Probability , 15(1B):700–738, 2005. doi:10.1214/105051604000000710 .[4] T. Bengtsson, P. Bickel, and B. Li.
Curse-of-dimensionality revisited: Collapse of the particlefilter in very large scale systems , volume 2, pages 316–334. Institute of Mathematical Statistics,2008. doi:10.1214/193940307000000518 .[5] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statis-ticians.
Journal of the American Statistical Association , 112(518):859–877, 2017. doi:10.1080/01621459.2017.1285773 .[6] M. Braun and J. McAuliffe. Variational inference for large-scale models of discrete choice.
Journal of the American Statistical Association , 105(489):324–335, 2010. doi:10.1198/jasa.2009.tm08030 .[7] D. Cappelletti and C. Wiuf. Product-form Poisson-like distributions and complex balancedreaction systems.
SIAM Journal on Applied Mathematics , 76(1):411–432, 2016. doi:10.1137/15M1029916 .[8] N. Chopin and O. Papaspiliopoulos.
An Introduction to Sequential Monte Carlo . Springer,Cham, 2020. doi:10.1007/978-3-030-47845-2 .[9] R. Douc, A. Guillin, J.-M. Marin, and C. P. Robert. Convergence of adaptive mixtures ofimportance sampling schemes.
The Annals of Statistics , 35(1):420–448, 2007. doi:10.1214/009053606000001154 .[10] A. Doucet, M. K. Pitt, G. Deligiannidis, and R. Kohn. Efficient implementation of Markovchain Monte Carlo when using an unbiased likelihood estimator.
Biometrika , 102(2):295–313,2015. doi:10.1093/biomet/asu075 .[11] P. ´Etor´e and B. Jourdain. Adaptive optimal allocation in stratified sampling meth-ods.
Methodology and Computing in Applied Probability , 12:335–360, 2010. doi:10.1007/s11009-008-9108-0 .[12] S. J. Gershman, M. D. Hoffman, and D. M. Blei. Nonparametric variational inference. In
Pro-ceedings of the 29th International Coference on International Conference on Machine Learning ,pages 235–242, 2012.[13] T. Hesterberg. Weighted average importance sampling and defensive mixture distributions.
Technometrics , 37(2):185–194, 1995. doi:10.1080/00401706.1995.10484303 .[14] J. R. Jackson. Networks of waiting lines.
Operations Research , 5(4):518–521, 1957. doi:10.1287/opre.5.4.518 .[15] F. P. Kelly.
Reversibility and stochastic networks . Wiley, Chichester, 1st edition, 1979.[16] A. Kong, J. S. Liu, and W. H. Wong. Sequential imputations and bayesian missing dataproblems.
Journal of the American Statistical Association , 89(425):278–288, 1994. doi:10.1080/01621459.1994.10476469 . 2417] J. Kuntz, F. R. Crucinio, and A. M. Johansen. The divide-and-conquer sequential monte carloalgorithm: theoretical properties and limit theorems.
In preparation. , 2021.[18] M. T. Lin, J. L. Zhang, Q. Cheng, and R. Chen. Independent particle filters.
Jour-nal of the American Statistical Association , 100(472):1412–1421, 2005. doi:10.1198/016214505000000349 .[19] F. Lindsten, A. M. Johansen, C. A. Naesseth, B. Kirkpatrick, T. B. Sch¨on, J. A. D. Aston, andA. Bouchard-Cˆot´e. Divide-and-Conquer with sequential Monte Carlo.
Journal of Computa-tional and Graphical Statistics , 26(2):445–458, 2017. doi:10.1080/10618600.2016.1237363 .[20] J. S. Liu. Metropolized independent sampling with comparisons to rejection sampling and im-portance sampling.
Statistics and Computing , 6(2):113–119, 1996. doi:10.1007/BF00162521 .[21] J. S. Liu.
Monte Carlo Strategies in Scientific Computing . Springer, New York, 2001. doi:10.1007/978-0-387-76371-2 .[22] X. Ma and N. Zabaras. An adaptive hierarchical sparse grid collocation algorithm for thesolution of stochastic differential equations.
Journal of Computational Physics , 228(8):3084–3113, 2009. doi:10.1016/j.jcp.2009.01.006 .[23] D. J. C. MacKay.
Information theory, inference, and learning algorithms . Cambridge UniversityPress, 2003.[24] D. McLeish. A general method for debiasing a monte carlo estimator.
Monte Carlo Methodsand Applications , 17(4):301–315, 2011. doi:10.1515/mcma.2011.013 .[25] S. Meyn and R. L. Tweedie.
Markov Chains and Stochastic Stability . Cambridge MathematicalLibrary. 2nd edition, 2009. doi:10.1017/CBO9780511626630 .[26] A. Mira.
Ordering, slicing and splitting Monte Carlo Markov chains . PhD thesis, Universityof Minnesota, 1998.[27] G. W. Oehlert. A note on the delta method.
The American Statistician , 46(1):27–29, 1992. doi:10.1080/00031305.1992.10475842 .[28] M.-S. Oh and J. O. Berger. Integration of multimodal functions by monte carlo importancesampling.
Journal of the American Statistical Association , 88(422):450–456, 1993. doi:10.1080/01621459.1993.10476295 .[29] A. Pakman and L. Paninski. Exact Hamiltonian Monte Carlo for truncated multivari-ate Gaussians.
Journal of Computational and Graphical Statistics , 23(2):518–542, 2014. doi:10.1080/10618600.2013.788448 .[30] N. Raghavan and D. D. Cox. Adaptive mixture importance sampling.
Journal of StatisticalComputation and Simulation , 60(3):237–259, 1998. doi:10.1080/00949659808811890 .[31] S. Rahman and H. Xu. A univariate dimension-reduction method for multi-dimensional in-tegration in stochastic mechanics.
Probabilistic Engineering Mechanics , 19(4):393–408, 2004. doi:10.1016/j.probengmech.2004.04.003 .[32] R. Ranganath, S. Gerrish, and D. Blei. Black Box Variational Inference. In
Proceedings ofthe 17th International Conference on Artificial Intelligence and Statistics , volume 33, pages814–822, 2014. 2533] C.-H. Rhee and P. W. Glynn. Unbiased estimation with square root convergence for sde models.
Operations Research , 63(5):1026–1043, 2015. doi:10.1287/opre.2015.1404 .[34] C. Robert and G. Casella.
Monte Carlo Statistical Methods . Springer-Verlag New York, 2ndedition, 2004. doi:10.1007/978-1-4757-4145-2 .[35] B. W. Silverman.
Density estimation for statistics and data analysis , volume 26 of
Monographson Statistics and Applied Probability . CRC press, 1986.[36] C. Snyder, T. Bengtsson, P. Bickel, and J. Anderson. Obstacles to high-dimensional particlefiltering.
Monthly Weather Review , 136(12):4629–4640, 2008. doi:10.1175/2008MWR2529.1 .[37] M. P. Wand and M. C. Jones. Multivariate plug-in bandwidth selection.