On Unbiased Estimation for Discretized Models
Jeremy Heng, Ajay Jasra, Kody J. H. Law, Alexander Tarakanov
OOn Unbiased Estimation for Discretized Models
BY JEREMY HENG , AJAY JASRA , KODY J. H. LAW & ALEXANDER TARAKANOV ESSEC Business School, Singapore, 139408, SG. E-Mail: [email protected] Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science andTechnology, Thuwal, 23955, KSA. E-Mail: [email protected] School of Mathematics, University of Manchester, Manchester, M13 9PL, UK. E-Mail: [email protected];[email protected]
Abstract
In this article, we consider computing expectations w.r.t. probability measures which are subjectto discretization error. Examples include partially observed diffusion processes or inverse problems,where one may have to discretize time and/or space, in order to practically work with the probabilityof interest. Given access only to these discretizations, we consider the construction of unbiased MonteCarlo estimators of expectations w.r.t. such target probability distributions. It is shown how to obtainsuch estimators using a novel adaptation of randomization schemes and Markov simulation methods.Under appropriate assumptions, these estimators possess finite variance and finite expected cost. Thereare two important consequences of this approach: (i) unbiased inference is achieved at the canonicalcomplexity rate, and (ii) the resulting estimators can be generated independently, thereby allowingstrong scaling to arbitrarily many parallel processors. Several algorithms are presented, and applied tosome examples of Bayesian inference problems, with both simulated and real observed data.
Key words : Randomization Methods; Markov chain Monte Carlo; Bayesian Inverse Problems.
Consider a probability measure π on measurable space ( X , X ) for which one wants to compute π ( ϕ ) := (cid:82) X ϕ ( x ) π ( dx ) with ϕ : X → R , π − integrable. Suppose one can only deal with a sequence of biasedprobability measures ( π l ) l ∈ Z + on ( X , X ) , with π l ( | ϕ | ) < + ∞ ∀ l ∈ Z + , such that lim l →∞ π l ( ϕ ) = π ( ϕ ) and | π l +1 ( ϕ ) − π ( ϕ ) | ≤ | π l ( ϕ ) − π ( ϕ ) | ; examples include partially observed diffusion processes e.g. [18] orinverse problems e.g. [3]. These latter models have a wide range of real applications such as engineering,finance and applied mathematics; see for instance [29].In many applications of interest, one often resorts to constructing a π l − invariant and ergodic Markovchain Monte Carlo (MCMC) kernel K l to estimate the expectation π l ( ϕ ) = (cid:82) X ϕ ( x ) π l ( dx ) . It is often thecase that as l grows, the cost of applying K l will also increase, often exponentially in l . Therefore onewould often fix l to achieve a given bias, and run the Markov chain for long enough to obtain a pre-specifiedvariance which balances the bias. In this article, we consider the task of producing unbiased estimatorswith finite variance. In particular, using a stochastic simulation scheme based upon a family of Markovkernels ( K l ) l ∈ Z + , one can construct an estimator (cid:91) π ( ϕ ) such that E [ (cid:91) π ( ϕ )] = π ( ϕ ) and V ar [ (cid:91) π ( ϕ )] < ∞ ,where E and V ar denote expectation and variance w.r.t. the law of the stochastic scheme to be developed,respectively. This scheme is of interest for several reasons:1. One can produce unbiased estimators of score functions which can be employed within stochasticgradient algorithms to perform parameter inference.2. One can simulate i.i.d. replicates of such unbiased estimators in parallel and combine them to con-struct lower variance estimators in a static context (sometimes referred to as strong parallel scaling).3. The method provides a benchmark for other computations.In terms of the first point, it is often simpler to verify the validity of stochastic gradient algorithms whenthe estimate of the noisy gradient is unbiased. The second point means that the variance can be reducedproportionally to the number of available processors, for the same fixed expected cost per processor. Thethird point means that one can check the precision of biased methods against the results.1 a r X i v : . [ s t a t . C O ] F e b he approach that we follow is based upon an idea that was outlined in [16] and belongs to the class of doubly-randomized estimators – more specifically, estimators which arise from applying randomization ofthe type [25, 28] (see also [33]) twice , in a nested fashion. The baseline version of the randomized estimatorsof [25, 28] place a probability distribution P L over the level of discretization l . Given a simulation fromthis probability distribution, one way to obtain unbiased estimates of π ( ϕ ) is to unbiasedly estimate π l ( ϕ ) − π l − ( ϕ ) for l ∈ N , or an unbiased estimate of π ( ϕ ) if one samples l = 0 . Denoting theseestimators by ξ l , the so-called “single-term” estimator is given by ξ L / P L ( L ) , where L ∼ P L . In the inferencecontext, it is challenging to obtain unbiased estimators, and this is where the second randomization comesinto the picture. For ξ , one can use the recently introduced unbiased MCMC scheme of [14] (see also[10]). This estimator is built by truncating an infinite series of increments of coupled MCMCs once thechains meet. The main complication is then to unbiasedly estimate π l ( ϕ ) − π l − ( ϕ ) . It will typicallynot suffice to estimate π l ( ϕ ) and π l − ( ϕ ) independently, because the resulting estimator would often haveinfinite variance. Therefore an additional technique is required. The main contribution of this article isto develop several novel coupled MCMC schemes that can achieve unbiased estimates ξ l for l > suchthat the resulting estimator of π ( ϕ ) is unbiased and of finite variance. The latter properties are provedmathematically under assumptions. We also implement our proposed algorithms on several challengingstatistical applications.The idea of using doubly-randomized estimators has appeared in several recent works. The work [20]utilizes a “coupled-sum” estimator over sample size to debias multilevel estimators of the type introducedin [3], which are then utilized in the framework described above. That method is applicable to thestatic/non-dynamic problems where one can evaluate the target distribution, up to a normalizing constant,like the method we introduce here. The work [17] uses a “single-term” estimator over sample sizes in orderdebias estimators of the type introduced in [18]. Those estimators are designed for online inference indynamic problems, such as state space models, and partially observed diffusion processes in particular.The methodology in [20] has infinite expected cost, whereas this is not always the case for the methodintroduced in this paper. In a companion paper [13], we show how to extend the framework of this articleto the context of partially observed diffusion processes. A possible alternative to our approach would bethat of [1].This article is structured as follows. In Sections 2.2 and 2.3, the precise problem is stated and ourstrategy outlined. We show in Section 2.4, under assumptions, that our general approach can produceunbiased and finite variance estimators with finite expected costs. In Section 3, we present some spe-cific Markov kernels which fall under our general framework. We illustrate our methodology on severalnumerical examples in Section 4. The proofs of our mathematical results are given in Appendix A. Let ( X , X ) be a measurable space. For ϕ : X → R we write B b ( X ) , to denote the collection of boundedmeasurable functions and, if X ⊆ R d , L ( X ) as the collection of square Lebesgue-integrable functions.For ϕ ∈ B b ( X ) , we write the supremum norm as (cid:107) ϕ (cid:107) ∞ = sup x ∈ X | ϕ ( x ) | . We denote the Borel sets on R d as B ( R d ) . The d − dimensional Lebesgue measure is written as dx . For a metric d : X × X → R + on X and a function ϕ : X → R , Lip d ( X ) are the Lipschitz functions (with finite Lipschitz constants),that is for every ( x, w ) ∈ X × X , | ϕ ( x ) − ϕ ( w ) | ≤ (cid:107) ϕ (cid:107) Lip d ( x, w ) . P ( X ) denotes the collection ofprobability measures on ( X , X ) . For a finite measure µ on ( X , X ) and a ϕ ∈ B b ( X ) , the notation µ ( ϕ ) = (cid:82) X ϕ ( x ) µ ( dx ) is used. For ( X × W , X ∨ W ) a measurable space and µ a non-negative finitemeasure on this space, we use the tensor-product of functions notation for ( ϕ, ψ ) ∈ B b ( X ) × B b ( W ) , µ ( ϕ ⊗ ψ ) = (cid:82) X × Y ϕ ( x ) ψ ( w ) µ ( d ( x, w )) . Given a Markov kernel K : X → P ( X ) and a finite measure µ ,we use the notations µK ( dx (cid:48) ) = (cid:82) X µ ( dx ) K ( x, dx (cid:48) ) and K ( ϕ )( x ) = (cid:82) X ϕ ( x (cid:48) ) K ( x, dx (cid:48) ) , for ϕ ∈ B b ( X ) . Theiterated kernel is K n ( x , dx n ) = (cid:82) X n − (cid:81) ni =1 K ( x i − , dx i ) . For A ∈ X , the indicator function is written as2 A ( x ) . Z + is the set of non-negative integers. For ( µ, ν ) ∈ P ( X ) × P ( X ) , (cid:107) µ − ν (cid:107) tv := sup A ∈X | µ ( A ) − ν ( A ) | is the total variation distance. N d ( µ, Σ) is the d − dimensional Gaussian distribution with mean µ and co-variance Σ , with corresponding Lebesgue density written as x (cid:55)→ φ d ( x ; µ, Σ) . U A denotes the uniformdistribution on a measurable set A . d denotes the d -dimensional column vector of zeros. I d denotes the d × d identity matrix. The transpose of a vector or matrix x is denoted as x T . For a set C ⊂ R d , the L ( C ) norm of f is written as (cid:107) f (cid:107) = (cid:82) C f ( x ) dx , and the space of square integrable functions on C is denotedby L ( C ) = { f : C → R : (cid:107) f (cid:107) < ∞} . For a vector x ∈ R d , its Euclidean norm is also written as (cid:107) x (cid:107) . A particular Bayesian inverse problem associated to partial differential equations (PDEs) is now introducedas a motivating example. The objective is to infer the permeability field associated to a porous medium,based on pressure measurements of the fluid flow governed by Darcy’s law. This example is prototypical inthe context of subsurface inversion, with applications ranging from oil recovery to contaminant transportin groundwater [31, 29].Let C ⊂ R D with the boundary ∂ C convex and once continuously differentiable and suppose f ∈ L ( C ) .Consider the following PDE for the pressure field h on C : −∇ · (Φ ∇ h ) = f, on C , (1) h = 0 , on ∂ C , where, for t ∈ C , the permeability is Φ( t ; X ) := ¯Φ( t ) + d (cid:88) j =1 X j ϑ j v j ( t ) . The known forcing f can represent, e.g. injection and/or extraction of fluid from wells. In the above:• X = ( X , . . . , X d ) , with X j i.i.d. ∼ U [ − , . This determines the prior distribution for X on the statespace X = [ − , d .• ¯Φ : C → R , and for j ∈ { , . . . , d } , v j : C → R with sup t ∈ C | v j ( t ) | ≤ , and ϑ j ∈ R + .• h ( · ; X ) (or just h ( X ) ) denotes the weak solution of (1) for a given X .We remark that one can allow d → ∞ in the above if ϑ j decay to zero sufficiently fast with j ; see [2, 3]and the references therein for further details. The following will be assumed.( H1 ) There exists a Φ (cid:63) > such that inf t ∈ C ¯Φ( t ) ≥ (cid:80) dj =1 ϑ j + Φ (cid:63) .This assumption guarantees that for all x ∈ X , inf t ∈ C Φ( t ; x ) ≥ Φ (cid:63) , hence there is a well-defined andunique weak solution h ( x ) ∈ L ( C ) , and sup x ∈ X (cid:107) h ( x ) (cid:107) ≤ C for some C > [6].Define the vector-valued function G : X → R P by G : x (cid:55)→ G ( x ) = [ g ( h ( x )) , . . . , g P ( h ( x ))] , (2)where g p : L ( C ) → R are bounded linear functionals on L ( C ) for p ∈ { , . . . , P } . It is assumed that thedata y ∈ R P take the form Y | ( X = x ) ∼ N P ( G ( x ) , θ − I P ) , (3)where θ > is a parameter that we will be interested in inferring. In fact, unbiased estimators areparticularly useful in this context, and it will be considered in the numerical examples of Section 4.3e simplify notation by suppressing explicit dependence on parameter θ and data y , and write the un-normalized Lebesgue density of X for fixed y and θ as γ ( x ) = exp (cid:110) − θ (cid:107) y − G ( x ) (cid:107) (cid:111) I X ( x ) , (4)and the normalized density as π ( x ) = γ ( x ) / (cid:82) X γ ( x ) dx . These densities will be written as γ θ ( x ) and π θ ( x ) when we consider inference for θ . This posterior distribution is in general intractable due to the nonlineardependence of y on x , even if the PDE were to admit an analytical solution, and one must resort tocomputationally intensive inference methods such as MCMC. A further complication is that the analyticalsolution of the PDE is in general not available, so one must resort to numerical approximations, whichwill be discussed in the next section. For simplicity we present the case D = 1 and C = [0 , , but extension to higher dimensions is straight-forward – see e.g. [2, 5]. The PDE problem at resolution level l is solved using a finite element method(FEM) with piecewise linear shape functions on a uniform mesh of width ∆ l = 2 − ( l + l ) , for l ∈ { , , . . . } and l ≥ a maximal mesh width. In particular, the finite element basis functions { ψ li } ∆ − l − i =1 on level l are defined as follows for t i = i − l : ψ li ( t ) = (cid:26) ∆ − l [ t − ( t i − ∆ l )] , if t ∈ [ t i − ∆ l , t i ] , ∆ − l [ t i + ∆ l − t ] , if t ∈ [ t i , t i + ∆ l ] . To solve the PDE for a given x ∈ X , h l ( x ) = (cid:80) ∆ − l − i =1 h li ( x ) ψ li is substituted into (1), and projected ontoeach basis element: − (cid:68) ∇ · (cid:16) Φ ∇ ∆ − l − (cid:88) i =1 h li ( x ) ψ li (cid:17) , ψ lj (cid:69) = (cid:104) f, ψ lj (cid:105) . We introduce the matrix A l ( x ) with entries A lij ( x ) = (cid:104) Φ( x ) ∇ ψ li , ∇ ψ lj (cid:105) , and vectors h l ( x ) , f l with entries h li ( x ) and f li = (cid:104) f, ψ li (cid:105) , respectively. Solving the discretized problem involves solving the following linearsystem A l ( x ) h l ( x ) = f l . (5)Define G l ( x ) = [ g ( h l ( x )) , . . . , g P ( h l ( x ))] . We denote the corresponding approximated un-normalizeddensity by γ l ( x ) = exp (cid:110) − θ (cid:107) y − G l ( x ) (cid:107) (cid:111) I X ( x ) , (6)and the approximated normalized density by π l ( x ) = γ l ( x ) / (cid:82) X γ l ( x ) dx. We now present some fundamentalconvergence results relating to this approximation, which are crucial for the application of our proposedmethodology.
Proposition 2.1.
Assume (H1).1. For all x ∈ X , lim l →∞ h l ( x ) = h ( x ) . In addition, there exists a C ∈ (0 , ∞ ) such that for every ( l, x ) ∈ Z + × X : (cid:107) h l ( x ) (cid:107) ∞ ≤ C and (cid:107) h l ( x ) − h ( x ) (cid:107) ≤ C ∆ βl (7) with β = 2 . . For all x ∈ X , lim l →∞ γ l ( x ) = γ ( x ) . In addition, there exists a ( C, ˜ l ) ∈ (0 , ∞ ) × Z + such that for every ( l, x ) ∈ { ˜ l, ˜ l + 1 , . . . } × X : | γ l ( x ) − γ ( x ) | ≤ C ∆ βl and | π l ( x ) − π ( x ) | ≤ C ∆ βl (8) with β = 2 .Proof. The first part in (7) is a standard result in finite element methods [5, 6]. For the second, recall thatfor p ∈ { , . . . , P } , each g p is a bounded linear functional. Using this fact and (7), it is straightforward toestablish (8) – see e.g. [3, 20] and the references therein. We now describe our strategy to construct unbiased estimators of π ( ϕ ) . Consider a positive probabilitymass function, P L , on Z + . It is known [28, 33] that if one can find a sequence of independent randomvariables ( ξ l ) l ≥ independent of L ∼ P L such that E [ ξ ] = π ( ϕ ) , (9) E [ ξ l ] = π l ( ϕ ) − π l − ( ϕ ) , ∀ l ∈ N , (10) (cid:88) l ∈ Z + E [ ξ l ] P L ( l ) < + ∞ , (11)then (cid:91) π ( ϕ ) S := ξ L P L ( L ) (12)is an unbiased and finite variance estimator of π ( ϕ ) . This is the ‘single term’ estimator as discussed by[28, 33], and alternatives such as the ‘independent sum’ estimator are also possible. In this latter case,if one can construct independent random variables ( ξ l ) l ≥ that are independent of L ∼ P L , which satisfy(9)-(10) and additionally that (cid:88) l ∈ Z + V ar [ ξ l ] + ( π l ( ϕ ) − π ( ϕ )) P L ( l ) < + ∞ , (13)where P L ( l ) = (cid:80) k ≥ l P L ( k ) , then (cid:91) π ( ϕ ) I := L (cid:88) l =0 ξ l P L ( l ) (14)is also an unbiased estimator of π ( ϕ ) with finite variance. Typically, one will run N ∈ N independentreplicates of either (12) or (14) and then use the average N N (cid:88) i =1 (cid:16) (cid:91) π ( ϕ ) k (cid:17) i , where k ∈ { S, I } and (cid:16) (cid:91) π ( ϕ ) k (cid:17) i represents the i th − independent replicate of the estimate.We note that this idea was mentioned in [16] and also used in various different ways in [9, 17, 20]. Themain point of these schemes is that one can completely remove the discretization bias (represented by l )associated to for example Euler discretizations of stochastic differential equations or FEM discretizations5f PDEs, whilst only working with biased versions of π , denoted as π l , with l < ∞ . In addition, themethod is completely parallelizable, as one can run each replicate independently.We remark that the condition (10) is not a necessary one (in the case of (12), but is needed for (14)),but is certainly sufficient. In many contexts, satisfying (10) is not trivial as exact simulation from anyof the distributions ( π l ) l ∈ Z + is often not possible. In the work of [10, 14] (see also [12, 15]), the authorsconsider a methodology to unbiasedly estimate π l ( ϕ ) for each l , which we shall build upon. In orderto satisfy (11) (or (13)), it will typically not be sufficient (or at least efficient) to run two independent unbiased MCMC algorithms to estimate π l ( ϕ ) and π l − ( ϕ ) respectively. To see this, let l ∈ N be given andconsider the easier situation where one can sample exactly from π l and π l − . Then an unbiased estimatorof [ π l − π l − ]( ϕ ) is given by ξ l = 1 N N (cid:88) n =1 { ϕ ( X n ) − ϕ ( W n ) } , where ( X n ) n ∈ N are i.i.d. from π l and ( W n ) n ∈ N are i.i.d. from π l − . In this case, we have E [ ξ l ] = V ar π l [ ϕ ( X )] + V ar π l − [ ϕ ( W )] N + [ π l − π l − ]( ϕ ) . In practice, it will typically be difficult to choose P L so that the estimator would satisfy (11) and havefinite expected cost, unless one takes N = O (2 l ) . We shall introduce a novel solution to circumvent thisdifficulty. To motivate our approach, we begin by recalling the methodology in [10, 14]. We consider the unbiased estimation of π l ( ϕ ) with l ∈ Z + fixed. Suppose we have a π l − invariant, ergodicMCMC kernel K l : X → P ( X ) and an initial distribution ν l ∈ P ( X ) . Let ˜ ν l ∈ P ( X ) be a couplingof ν l , i.e. (cid:82) w ∈ X ˜ ν l ( d ( x, w )) = ν l ( dx ) and (cid:82) x ∈ X ˜ ν l ( d ( x, w )) = ν l ( dw ) . The idea is to run a Markov chain ( Z n,l ) n ∈ Z + = ( X n,l , W n,l ) n ∈ Z + on X × X , initialized from ˇ ν l ( d ( x, w )) := (cid:90) X ˜ ν l ( d ( x (cid:48) , w (cid:48) )) K l ( x (cid:48) , dx ) δ { w (cid:48) } ( dw ) , (15)and evolving according to a coupled transition kernel ˇ K l : X × X → P ( X × X ) , satisfying (cid:90) w (cid:48) ∈ X ˇ K l (( x, w ) , d ( x (cid:48) , w (cid:48) )) = K l ( x, dx (cid:48) ) , (cid:90) x (cid:48) ∈ X ˇ K l (( x, w ) , d ( x (cid:48) , w (cid:48) )) = K l ( w, dw (cid:48) ) . The process is simulated as follows.1. Sample Z (cid:48) ,l = ( X (cid:48) ,l , W (cid:48) ,l ) from ˜ ν l ( · ) .2. Generate X ,l | X (cid:48) ,l according to K l ( X (cid:48) ,l , · ) and set W ,l = W (cid:48) ,l .3. For n ≥ , generate Z n,l | Z n − ,l according to ˇ K l ( Z n − ,l , · ) .Marginally, the sequences of random variables ( X n,l ) n ∈ Z + and ( W n,l ) n ∈ Z + are coupled time-homogenousMarkov chains with initial distributions νK l and ν , respectively, and the same transition kernel K l . Definethe meeting time τ l := inf { n ≥ X n,l = W n,l } . The coupled chains ( Z n,l ) n ∈ Z + should be constructed so that τ l is almost surely finite, at the very least.In addition, we require that the chains remain faithful after meeting, that is X n,l = W n,l for all n ≥ τ l .Under fairly weak assumptions (e.g. [10]), the following is an unbiased estimator of π l ( ϕ ) for any k ∈ Z + (cid:91) π l ( ϕ ) k := ϕ ( X k,l ) + τ l − (cid:88) n = k +1 { ϕ ( X n,l ) − ϕ ( W n,l ) } . (16)6e remark that a time-averaged extension is also possible; let ( m, k ) ∈ Z + × Z + , with m ≥ k , then onecan also use (see [14]) the time-averaged estimator (cid:91) π l ( ϕ ) T,k,m := 1 m − k + 1 m (cid:88) n = k ϕ ( X n,l ) + τ l − (cid:88) n = k +1 (cid:16) ∧ n − km − k + 1 (cid:17) { ϕ ( X n,l ) − ϕ ( W n,l ) } , (17)which recovers (16) in the case m = k .We now describe a constructive procedure to generate such a coupled Metropolis–Hastings (MH) kernel ˇ K l [14]. Let Q l : X → P ( X ) be a proposal Markov kernel, such that π l ( dx ) Q l ( x, dx (cid:48) ) has a positive density π l ( x ) q l ( x, x (cid:48) ) w.r.t. a dominating measure. We define the MH acceptance probability for ( x, x (cid:48) ) ∈ X × X as α l ( x, x (cid:48) ) := 1 ∧ π l ( x (cid:48) ) q l ( x (cid:48) , x ) π l ( x ) q l ( x, x (cid:48) ) . For ( x, w ) ∈ X × X , we define a maximal coupling of the proposal Markov kernels Q l ( x, dx (cid:48) ) and Q l ( w, dw (cid:48) )ˇ Q l (( x, w ) , d ( x (cid:48) , w (cid:48) )) = S l ( x, w ) (cid:90) u ∈ X O l (( x, w ) , du ) S l ( x, w ) δ { u } ( d ( x (cid:48) , w (cid:48) )) + (1 − S l ( x, w )) × (cid:16) Q l ( x, dx (cid:48) ) − O l (( x, w ) , dx (cid:48) )1 − S l ( x, w ) (cid:17) ⊗ (cid:16) Q l ( w, dw (cid:48) ) − O l (( x, w ) , dw (cid:48) )1 − S l ( x, w ) (cid:17) , (18)where O l (( x, w ) , du ) := Q l ( x, du ) ∧ Q l ( w, du ) denotes the overlapping kernel on X and S l ( x, w ) := (cid:82) X O l (( x, w ) , du ) is the size of the overlap. Under this transition kernel, with probability S l ( x, w ) , onesimulates X (cid:48) = W (cid:48) from the overlap O l (( x, w ) , du ) /S l ( x, w ) , and with probability − S l ( x, w ) , X (cid:48) and W (cid:48) are simulated independently from the residuals required to ensure they retain the appropriate marginals.As this transition achieves the maximum probability of having X (cid:48) = W (cid:48) , ˇ Q l is known as a maximal cou-pling of Q l ( x, dx (cid:48) ) and Q l ( w, dw (cid:48) ) . This coupling can be simulated using the algorithm of [32], assumingone can sample from the proposal kernels and evaluate their densities. One can then obtain a samplefrom the coupled MH kernel ˇ K l by accepting the proposals X (cid:48) and W (cid:48) with a common uniform randomvariable U ∼ U [0 , . If the proposal is a Gaussian random walk, then the idea of using maximal couplingswithin MH goes back to at least [21]. We note that alternatives to a maximal coupling are possible anddescribed in Section 3.We now outline the procedure required to compute the time-averaged estimator (cid:91) π l ( ϕ ) T,k,m in (17).1. Sample Z (cid:48) ,l = ( X (cid:48) ,l , W (cid:48) ,l ) from ˜ ν l ( · ) .2. Generate X l | X (cid:48) ,l according to Q l ( X (cid:48) ,l , · ) and U ∼ U [0 , . If U < α l ( X (cid:48) ,l , X l ) , set X ,l = X l ,otherwise, set X ,l = X (cid:48) ,l . Set W ,l = W (cid:48) ,l and n = 1 .3. Generate ( X l , W l ) | ( X n − ,l , W n − ,l ) according to ˇ Q l (( X n − ,l , W n − ,l ) , · ) and U ∼ U [0 , .• If U < α l ( X n − ,l , X l ) , set X n,l = X l , otherwise, set X n,l = X n − ,l .• If U < α l ( W n − ,l , W l ) , set W n,l = W l , otherwise, set W n,l = W n − ,l .4. If X n,l = W n,l and n ≥ m stop, otherwise set n = n + 1 and return to step 3.Assuming that the resulting coupled kernel ˇ K l costs twice as much as K l , the above procedure requires (2 τ l − ∨ ( m + τ l − applications of K l . 7 .3.3 Unbiased Estimation of Increments We now describe, abstractly, how one can obtain a sequence of independent random variables ( ξ l ) l ≥ withthe properties prescribed in Section 2.3.1. Concrete approaches are detailed in Section 3.In the case of ξ , one can simply use the unbiased MCMC methodology described in Section 2.3.2.Hence we consider how to unbiasedly estimate [ π l − π l − ]( ϕ ) , for a fix l ∈ N , using MCMC. For s ∈ { l, l − } ,suppose we have a π s − invariant, ergodic MCMC kernel K s : X → P ( X ) and an initial distribution ν s ∈ P ( X ) . Let ˇ ν l,l − be a coupling of the distributions ˇ ν l and ˇ ν l − defined in (15). We will generate aMarkov chain ( Z n,l,l − ) n ∈ Z + with Z n,l,l − = (( X n,l , W n,l ) , ( X n,l − , W n,l − )) ∈ ( X × X ) × ( X × X ) =: Z , for each n ∈ Z + . The Markov chain is such that, marginally, for each s ∈ { l, l − } , the sequence of randomvariables ( X n,s ) n ∈ Z + and ( W n,s ) n ∈ Z + are time-homogenous Markov chains with initial distribution ν s K s and ν s , respectively, and the same transition kernel K s . The four sequences will be constructed in adependent manner in order to satisfy (11) or (13). We denote the transition kernel for ( Z n,l,l − ) n ∈ Z + , as ˇ K l,l − : Z → P ( Z ) . Define the meeting time τ s = inf { n ≥ X n,s = W n,s } for s ∈ { l, l − } . It is explicitly assumed that (at the very least) ( Z n,l,l − ) n ∈ Z + is constructed so that thestopping time ˇ τ l,l − := τ l ∨ τ l − is almost surely finite. In addition, the pair of chains on each level shouldbe faithful, i.e. for s ∈ { l, l − } , we have X n,s = W n,s , for all n ≥ τ s . (19)Hence for time n ≥ ˇ τ l,l − , Z n,l,l − only has a distinct state on each level. We will give explicit examplesof Markov kernels which satisfy these constraints in Section 3. Note that we do not require the pairs ( Z n,l ) n ∈ Z + = ( X n,l , W n,l ) n ∈ Z + or ( Z n,l − ) n ∈ Z + = ( X n,l − , W n,l − ) n ∈ Z + to be Markov chains with exactlythe properties considered in Section 2.3.2.One can estimate [ π l − π l − ]( ϕ ) as follows, for any k ∈ Z + : (cid:92) [ π l − π l − ]( ϕ ) k := (cid:91) π l ( ϕ ) k − (cid:92) π l − ( ϕ ) k , (20)where (cid:92) π s ( ϕ ) k is computed using (16) based on the pair of chains ( X n,s , W n,s ) n ∈ Z + on level s ∈ { l, l − } .One can also employ time-averaging, for ( m, k ) ∈ Z + × Z + satisfying m ≥ k : (cid:92) [ π l − π l − ]( ϕ ) T,k,m := (cid:91) π l ( ϕ ) T,k,m − (cid:92) π l − ( ϕ ) T,k,m , (21)where (cid:92) π s ( ϕ ) T,k,m is computed using (17) based on the pair of chains ( X n,s , W n,s ) n ∈ Z + on level s ∈ { l, l − } .The steps to compute the estimator (21) are outlined below.1. Sample Z ,l,l − from ˇ ν l,l − ( · ) . Set n = 1 .2. Generate Z n,l,l − | Z n − ,l,l − according to ˇ K l,l − ( Z n − ,l,l − , · ) .3. If X n,l = W n,l , X n,l − = W n,l − and n ≥ m stop, otherwise set n = n + 1 and return to step 2.Assuming the cost of ˇ K l,l − is two times that of running both K l and K l − , and K l costs twice as muchas K l − , then the cost of the above procedure requires (2 τ l − − ∨ ( m + τ l − − τ l − ∨ ( m + τ l − ≤ { (2ˇ τ l,l − − ∨ ( m + ˇ τ l,l − − } (22)applications of the kernel K l . 8 .3.4 Summary of Proposed Methodology We now consolidate the above discussion by summarizing our proposed methodology to unbiasedly estimate π ( ϕ ) . We begin with the single term estimator (cid:91) π ( ϕ ) S in (12).1. Sample L ∼ P L .2. If L = 0 , generate a Markov chain ( Z n, ) n ∈ Z + according to ˇ K as described in Section 2.3.2 andcompute the estimator (cid:92) π ( ϕ ) k in (16) or (cid:92) π ( ϕ ) T,k,m in (17).3. If
L > , generate a Markov chain ( Z n,L,L − ) n ∈ Z + according to ˇ K L,L − as described in Section 2.3.3and compute the estimator (cid:92) [ π L − π L − ]( ϕ ) k in (20) or (cid:92) [ π L − π L − ]( ϕ ) T,k,m in (21).We then return the single term estimator (cid:91) π ( ϕ ) S,k := 1 P L ( L ) (cid:16) I { } ( L ) (cid:92) π ( ϕ ) k + I N ( L ) (cid:92) [ π L − π L − ]( ϕ ) k (cid:17) (23)or (cid:91) π ( ϕ ) S,T,k,m := 1 P L ( L ) (cid:16) I { } ( L ) (cid:92) π ( ϕ ) T,k,m + I N ( L ) (cid:92) [ π L − π L − ]( ϕ ) T,k,m (cid:17) , (24)depending on whether one chooses the time-averaged estimator or not.For the independent sum estimator (cid:91) π ( ϕ ) I in (14), the steps are quite similar.1. Sample L ∼ P L .2. If L = 0 , generate a Markov chain ( Z n, ) n ∈ Z + according to ˇ K as described in Section 2.3.2 andcompute the estimator (cid:92) π ( ϕ ) k in (16) or (cid:92) π ( ϕ ) T,k,m in (17).3. If
L > , for all l ∈ { , . . . , L } , generate a Markov chain ( Z n,l,l − ) n ∈ Z + according to ˇ K l,l − asdescribed in Section 2.3.3 and compute the estimator (cid:92) [ π l − π l − ]( ϕ ) k in (20) or (cid:92) [ π l − π l − ]( ϕ ) T,k,m in (21).We then return the independent sum estimator (cid:91) π ( ϕ ) I,k := (cid:92) π ( ϕ ) k + L (cid:88) l =1 P L ( l ) (cid:16) (cid:92) [ π l − π l − ]( ϕ ) k (cid:17) (25)or (cid:91) π ( ϕ ) I,T,k,m := (cid:92) π ( ϕ ) T,k,m + L (cid:88) l =1 P L ( l ) (cid:16) (cid:92) [ π l − π l − ]( ϕ ) T,k,m (cid:17) . (26)As noted, in practice, one can also run the given procedure N times and use an average. For instance, inthe context of the single term estimator (23), one would use (cid:91) π ( ϕ )( N ) = 1 N N (cid:88) i =1 (cid:16) (cid:91) π ( ϕ ) S,k (cid:17) i (27)to estimate π ( ϕ ) , where (cid:16) (cid:91) π ( ϕ ) S,k (cid:17) i denotes the i th − independent replicate of the estimate.9 .4 Theoretical Results The main objective of this section is to establish, under assumptions, that (23)-(26) are unbiased andfinite variance estimators of π ( ϕ ) . We first state the assumptions that we will rely on. In the following, wesuppose that the quality of approximation of π by π l is controlled by a scalar parameter ∆ l = 2 − l , whichis consistent with the examples that are to be considered. For a constant < C < + ∞ and a metric d on X , we define the set B ( C, ∆ l , d ) := { ( x , x , x , x ) ∈ X : ∀ ( i, j ) ∈ { , . . . , } , d ( x i , x j ) ≤ C ∆ l } . We will make the following assumptions with X compact.( A1 ) There exist ( C, ρ ) ∈ (0 , ∞ ) × (0 , such that for any n ∈ N sup l ∈ Z + sup x ∈ X (cid:107) K nl ( x, · ) − π l ( · ) (cid:107) tv ≤ Cρ n . ( A2 ) There exist ( C, ρ ) ∈ (0 , ∞ ) × (0 , such that for any ( l, n ) ∈ Z + × NE [ I { τ l >n } ] ≤ Cρ n . ( A3 ) There exist a C < ∞ and a metric ˜ d : X × X → R + on X , such that for any ( l, ϕ, ( x, w )) ∈ Z + × B b ( X ) ∩ Lip ˜ d ( X ) × X × X | K l ( ϕ )( x ) − K l ( ϕ )( w ) | ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip )˜ d ( x, w ) . ( A4 ) There exist ( C, β ) ∈ (0 , ∞ ) × (0 , ∞ ) such that for any ( l, ϕ ) ∈ N × B b ( X ) | [ π l − π ]( ϕ ) | ≤ C (cid:107) ϕ (cid:107) ∞ ∆ β l . sup x ∈ X | K l ( ϕ )( x ) − K l − ( ϕ )( x ) | ≤ C (cid:107) ϕ (cid:107) ∞ ∆ β l . ( A5 ) There exist ( C, β , (cid:15) ) ∈ (0 , ∞ ) , such that for the metric ˜ d in (A3) and any ( l, n ) ∈ N × NE [ I B ( C, ∆ β l , ˜ d ) c ( Z ,l,l − )] ≤ C ∆ β (2+ (cid:15) ) l , E [ I B ( C, ∆ β l , ˜ d ) c × B ( C, ∆ β l , ˜ d ) ( Z n,l,l − , Z n − ,l,l − )] ≤ C ∆ β (2+ (cid:15) ) l . Our main result focusses upon (23) as the proof of the other results are more-or-less a direct corollaryof the first result. The proofs of all results are in Appendix A.
Theorem 2.1.
Assume (A1-5). Then there exists a choice of positive probability mass function P L , suchthat for the metric ˜ d in (A3) and any ϕ ∈ B b ( X ) ∩ Lip ˜ d ( X ) , (23) is an unbiased and finite variance estimatorof π ( ϕ ) . Remark 2.1.
It is straightforward to establish that one can find a P L , so that (24) is also unbiased withfinite variance. The proof can be constructed via the technical results in the appendix. The following result can be deduced by observing (13) and using the technical results in the appendix.
Corollary 2.1.
Assume (A1-5). Then there exists a choice of positive probability mass function P L , suchthat for the metric ˜ d in (A3) and any ϕ ∈ B b ( X ) ∩ Lip ˜ d ( X ) , (25) is an unbiased and finite variance estimatorof π ( ϕ ) . emark 2.2. As for (24) , one can find a P L , so that (26) is also unbiased with finite variance. The main strategy of the proof is to establish a martingale plus remainder type decomposition for (cid:92) π ( ϕ ) k and (cid:92) [ π l − π l − ]( ϕ ) k . Given this, one can rely on the optional sampling theorem to establish thatthe former quantities are unbiased estimates of π ( ϕ ) and [ π l − π l − ]( ϕ ) . One is then left to control thesecond moments of the decomposition, which can be achieved in a variety of ways; we rely on martingalemethods. Assumption (A1) is a strong assumption, although reasonable as we only consider compact X . It is verifiedfor an example in [19]. Assumption (A2) has been considered in [13] and it shown to hold for a relatedcontext; see the results in [13, Lemmata 14 & 22]. Assumption (A3) has been verified for an example in [19].Assumption (A4) 1. relates to the rate at which the bias converges and can hold for inverse problems (see[3]). Assumption (A4) 2. can be achieved by considering an appropriate coupling of ( K l ( x, · ) , K l − ( x, · )) and is problem specific; see [19] for an example where this is verified.Assumption (A5) appears to be quite non-standard. We first remark that assumptions of these type(not identical), can be verified in complex settings [13]: (A5) is shown in [13, Lemma 16]. Secondly, if onecan establish that the pairs ( X n,l , X n,l − ) n ∈ Z + and ( W n,l , W n,l − ) n ∈ Z + are (marginally) uniformly ergodicMarkov chains with an invariant measure ˇ π l,l − , then it is sufficient to assume that ˇ π l,l − (cid:16) ( ϕ ⊗ − ⊗ ϕ ) (cid:17) ≤ C ∆ β l . This is because the proof that is used, turns these one-step type properties of the coupled Markov chainsin (A5) into similar properties of the Markov chain at any time step (see Lemma A.3). However, theseproperties are inherited directly from the invariant measure ˇ π l,l − if such a quantity exists and the chainconverges sufficiently fast to it. P L The discussion below relates to the single term estimator (cid:91) π ( ϕ ) S,k in (23) with increments estimated using ξ l = (cid:92) [ π l − π l − ]( ϕ ) k defined in (20), and is easily extended to the time-averaged estimator (21). The caseof the independent sum estimators (25) and (26) follows along the same lines and is thus omitted. Thevariance and expected cost of (cid:91) π ( ϕ ) S,k can be bounded as follows: V ar (cid:104) (cid:91) π ( ϕ ) S,k (cid:105) ≤ (cid:88) l ∈ Z + E [ ξ l ] P L ( l ) , (28) Cost (cid:104) (cid:91) π ( ϕ ) S,k (cid:105) ≤ C (cid:88) l ∈ Z + E [ˇ τ l,l − ]Cost( K l ) P L ( l ) , (29)where C > is some constant (see e.g. (22)), Cost( K l ) denotes the cost of an application of the marginalkernel K l , ˇ τ l,l − = τ l ∨ τ l − is the stopping time of the Markov chain ( Z n,l,l − ) n ∈ Z + with ˇ τ , − := τ .Averaging N single term estimators as in (27) would yield a variance of V ar [ (cid:91) π ( ϕ ) S,k ] N − and expectedcost of Cost[ (cid:91) π ( ϕ ) S,k ] N .For the second moment of ξ l featuring in (28), Lemma A.5 provides the bound E [ ξ l ] ≤ ∆ βl , (30)with β = β ∧ β > , where β and β are given in Assumptions (A4) and (A5), respectively, and ˜ d inAssumption (A3) is given by the standard Euclidean distance. Using Assumption (A2), one can upper-bound the expected stopping time E [ˇ τ l,l − ] appearing in (29) by a constant that is independent of l . We11ssume furthermore that there exists ω < β and C K > such that the cost of K l satisfies Cost( K l ) ≤ C K ∆ − ωl . (31)Suppose the probability mass function on Z + is of the form P L ( l ) ∝ ∆ ηl . Following (30) and (31),there are constants C V , C C > such that the right-hand sides of (28) and (29) are bounded above by C V (cid:80) l ∈ Z + ∆ β − ηl and C C (cid:80) l ∈ Z + ∆ η − ωl , respectively. Both quantities are finite for any η ∈ ( ω, β ) , e.g.one can let η = (2 β + ω ) / . The above discussion requires ω < β ; see e.g. [28, 20, 9] for the case ω ≥ β .We now consider the above discussion in the particular context of the motivating example in Section2.2. In this setting, Proposition 2.1 provides β = 2 in Assumption (A4).1. As discussed in Section 2.4.2,the rates for Assumptions (A4).2 and (A5) depend upon the particular kernels used and are more difficultto establish theoretically, but the value of β in (30) can be estimated numerically. Evaluation of thedensity (6) at level l requires the solution of the tridiagonal linear system (5), which has O (∆ − l ) degreesof freedom. Therefore (31) holds with ω = 1 , and we choose η = 5 / . We consider various strategies to construct the kernels ˇ K l,l − , l ∈ N described in Section 2.3.3. We beginwith Metropolis–Hastings algorithms in Sections 3.1 and 3.2, and consider the case of Hamiltonian MonteCarlo methods in Section 3.3. We consider a collection of coupled MH kernels ˇ K l,l − that are defined by the following simulation proce-dure.1. Given current state z l,l − = (( x l , w l ) , ( x l − , w l − )) ∈ Z , generate proposal Z (cid:48) l,l − = (( X (cid:48) l , W (cid:48) l ) , ( X (cid:48) l − , W (cid:48) l − )) according to ˇ Q l,l − ( z l,l − , · ) .2. Generate U ∼ U [0 , and for level s ∈ { l, l − } :• If U < α s ( x s , X (cid:48) s ) , set X (cid:63)s = X (cid:48) s , otherwise set X (cid:63)s = x s .• If U < α s ( w s , W (cid:48) s ) , set W (cid:63)s = W (cid:48) s , otherwise set W (cid:63)s = w s .3. Return Z (cid:63)l,l − = (( X (cid:63)l , W (cid:63)l ) , ( X (cid:63)l − , W (cid:63)l − )) as a sample according to ˇ K l,l − ( z l,l − , · ) .The notation ˇ Q l,l − : Z → P ( Z ) refers to a coupling of the proposal kernels Q l ( x l , dx (cid:48) l ) , Q l ( w l , dw (cid:48) l ) , Q l − ( x l − , dx (cid:48) l − ) and Q l − ( w l − , dw (cid:48) l − ) , in the sense that generating Z (cid:48) l,l − = (( X (cid:48) l , W (cid:48) l ) , ( X (cid:48) l − , W (cid:48) l − )) according to ˇ Q l,l − ( z l,l − , · ) , is marginally equivalent to X (cid:48) l ∼ Q l ( x l , · ) , W (cid:48) l ∼ Q l ( w l , · ) , X (cid:48) l − ∼ Q l − ( x l − , · ) , W (cid:48) l − ∼ Q l − ( w l − , · ) . (32)We also require the coupled proposal kernel ˇ Q l,l − to satisfy ˇ Q l,l − ( z l,l − , D × D ) > , (33)where D := { ( x, w ) ∈ X × X : x = w } denotes the diagonal set, and the following faithfulness property x s = w s = ⇒ X (cid:48) s = W (cid:48) s almost surely (34)for each level s ∈ { l, l − } . The condition (33) requires the coupling mechanism to generate identicalproposals on each level with positive probability. Under (34) and the use of a common uniform randomvariable in Step 2, the pair of MH chains on each level would be faithful, as required in (19).12n the following, we will consider X = R d and describe concrete examples of ˇ Q l,l − for proposal kernelsassociated to random walk Metropolis–Hastings (RWMH) Q s ( x, dx (cid:48) ) = φ d ( x (cid:48) ; x, Σ s ) dx (cid:48) , ( x, s ) ∈ X × Z + , (35)and pre-conditioned Crank-Nicolson (pCN) [7, 27] Q s ( x, dx (cid:48) ) = φ d ( x (cid:48) ; ρ s x, (1 − ρ s )Σ s ) dx (cid:48) , ( x, s ) ∈ X × Z + , (36)where Σ s = σ s σ Ts , σ s is an invertible d × d matrix and ρ s ∈ ( − , . A naive approach is to employ ˇ Q ( I ) l,l − ( z l,l − , dz (cid:48) l,l − ) := ˇ Q l ( z l , dz (cid:48) l ) ⊗ ˇ Q l − ( z l − , dz (cid:48) l − ) , which independently samples from the maximal coupling of the proposal kernels on each level given in(18). Although it does satisfy the requirements (32), (33) and (34), this choice is unlikely to ensure thatthe conditions in (11) or (13) hold, as the proposals on levels l and l − are sampled independently. We now introduce an extension of the maximal coupling in (18) to the case of four marginals. For z l,l − = (( x l , w l ) , ( x l − , w l − )) ∈ Z , define the overlapping kernel on X as O l,l − ( z l,l − , du ) := Q l ( x l , du ) ∧ Q l ( w l , du ) ∧ Q l − ( x l − , du ) ∧ Q l − ( w l − , du ) , and the size of the overlap S l,l − ( z l,l − ) := (cid:82) X O l,l − ( z l,l − , du ) . For each level s ∈ { l, l − } , we define thetwo residual probability measures on X r s ( z l,l − , du ) := Q s ( x s , du ) − O l,l − ( z l,l − , du )1 − S l,l − ( z l,l − ) , (cid:101) r s ( z l,l − , du ) := Q s ( w s , du ) − O l,l − ( z l,l − , du )1 − S l,l − ( z l,l − ) . To accommodate the event when the pair of chains on a level have met, i.e. ( x s , w s ) ∈ D , we define R s ( z l,l − , d ( x (cid:48) s , w (cid:48) s )) := r s ( z l,l − , dx (cid:48) s ) (cid:16) I D ( x s , w s ) δ { x (cid:48) s } ( dw (cid:48) s ) + I D c ( x s , w s ) (cid:101) r s ( z l,l − , dw (cid:48) s ) (cid:17) . Writing ˇ R l,l − ( z l,l − , dz (cid:48) l,l − ) := R l ( z l,l − , d ( x (cid:48) l , w (cid:48) l )) R l − ( z l,l − , d ( x (cid:48) l − , w (cid:48) l − )) , (37)we may then define our coupled proposal kernel ˇ Q ( M ) l,l − : Z → P ( Z ) as ˇ Q ( M ) l,l − ( z l,l − , dz (cid:48) l,l − ):= S l,l − ( z l,l − ) (cid:90) X O l,l − ( z l,l − , du ) S l,l − ( z l,l − ) δ { u } ( dz (cid:48) l,l − ) + (1 − S l,l − ( z l,l − )) ˇ R l,l − ( z l,l − , dz (cid:48) l,l − ) . (38)One can check that this satisfies the requirements in (32), (33) and (34). Moreover, ˇ Q ( M ) l,l − is a maximalcoupling as it achieves the maximum probability of having identical proposals X (cid:48) l = W (cid:48) l = X (cid:48) l − = W (cid:48) l − ,which is given by the size of the overlap S l,l − ( z l,l − ) .Algorithm 1 provides a method to sample from (38), assuming that we can sample from the proposaltransition kernels and evaluate their transition densities. These assumptions clearly hold for the Gaussian13 lgorithm 1 A maximal coupling of four transition kernels
Input : transition kernels Q s ( x s , dx (cid:48) s ) and Q s ( w s , dw (cid:48) s ) for level s ∈ { l, l − } .1. Sample U ∼ Q l ( x l , · ) . With probability min (cid:26) , q l ( w l , U ) q l ( x l , U ) , q l − ( x l − , U ) q l ( x l , U ) , q l − ( w l − , U ) q l ( x l , U ) (cid:27) , output (( X (cid:48) l , W (cid:48) l ) , ( X (cid:48) l − , W (cid:48) l − )) = (( U, U ) , ( U, U )) .2. Otherwise(a) Set X (cid:48) l = U . If x l = w l , set W (cid:48) l = U .(b) If x l (cid:54) = w l , propose U l ∼ Q l ( w l , · ) and with probability − min (cid:26) , q l ( x l , U l ) q l ( w l , U l ) , q l − ( x l − , U l ) q l ( w l , U l ) , q l − ( w l − , U l ) q l ( w l , U l ) (cid:27) , set W (cid:48) l = U l ; otherwise repeat until acceptance.(c) Propose U l − ∼ Q l − ( x l − , · ) and with probability − min (cid:26) , q l ( x l , U l − ) q l − ( x l − , U l − ) , q l ( w l , U l − ) q l − ( x l − , U l − ) , q l − ( w l − , U l − ) q l − ( x l − , U l − ) (cid:27) , set X (cid:48) l − = U l − ; otherwise repeat until acceptance. If x l − = w l − , set W (cid:48) l − = X (cid:48) l − .(d) If x l − (cid:54) = w l − , propose U l − ∼ Q l − ( w l − , · ) and with probability − min (cid:26) , q l ( x l , U l − ) q l − ( w l − , U l − ) , q l ( w l , U l − ) q l − ( w l − , U l − ) , q l − ( x l − , U l − ) q l − ( w l − , U l − ) (cid:27) , set W (cid:48) l − = U l − ; otherwise repeat until acceptance. Output : Sample Z (cid:48) l,l − = (( X (cid:48) l , W (cid:48) l ) , ( X (cid:48) l − , W (cid:48) l − )) from ˇ Q ( M ) l,l − ( z l,l − , · ) .proposal kernels in (35) and (36). Step 1 can be seen as an attempt to sample X (cid:48) l = W (cid:48) l = X (cid:48) l − = W (cid:48) l − from the overlap O l,l − ( z l,l − , du ) /S l,l − ( z l,l − ) , and if this fails, Step 2 corresponds to a rejection samplerto sample from the residuals (37). The four cases considered in Step 2 are needed to ensure faithfulnessproperty in (34).The coupling ˇ Q ( M ) l,l − in (38) can be readily employed on RWMH and pCN proposals, and more generalproposal transitions outside the Gaussian family. Next we present an alternative coupling that is specificto the Gaussian case. We consider the case X = R d and proposal kernels of the form Q s ( x, dx (cid:48) ) = φ d ( x (cid:48) ; µ s ( x ) , Σ s ) dx (cid:48) , ( x, s ) ∈ X × Z + , (39)where µ s : X → X , Σ s = σ s σ Ts and σ s is an invertible d × d matrix. The following will exploit the factthat a sample X (cid:48) from Q s ( x, · ) can be represented as X (cid:48) = µ s ( x ) + σ s v s with v s ∼ N d (0 d , I d ) . As notedin [14], the case of v s following a spherically symmetric distribution can also be accommodated.14 lgorithm 2 Synchronous pairwise reflection maximal couplings
Input : transition kernels Q s ( x s , dx (cid:48) s ) = φ d ( x (cid:48) s ; µ s ( x s ) , Σ s ) dx (cid:48) s , and Q s ( w s , dw (cid:48) s ) = φ d ( w (cid:48) s ; µ s ( w s ) , Σ s ) dw (cid:48) s for level s ∈ { l, l − } .Sample v ∼ N d (0 d , I d ) and for level s ∈ { l, l − } :1. Set v s = v and X (cid:48) s = µ s ( x s ) + σ s v s .2. Set u s = σ − s ( µ s ( x s ) − µ s ( w s )) and e s = u s / (cid:107) u s (cid:107) .3. With probability min { , φ d ( v s + u s ; 0 d , I d ) /φ d ( v s ; 0 d , I d ) } , set (cid:101) v s = v s + u s ; otherwise set (cid:101) v s = v s − v Ts e s ) e s . Set W (cid:48) s = µ s ( w s ) + σ s (cid:101) v s . Output : Sample Z (cid:48) l,l − = (( X (cid:48) l , W (cid:48) l ) , ( X (cid:48) l − , W (cid:48) l − )) from ˇ Q ( R ) l,l − ( z l,l − , · ) .The coupled proposal kernel ˇ Q ( R ) l,l − : Z → P ( Z ) that we construct here is based on a synchronouscoupling of the reflection maximal coupling in [4] for the pair of proposals on each level. Algorithm 2details how to obtain a sample Z (cid:48) l,l − = (( X (cid:48) l , W (cid:48) l ) , ( X (cid:48) l − , W (cid:48) l − )) from ˇ Q ( R ) l,l − ( z l,l − , · ) , which will satisfythe requirements in (32), (33) and (34). The synchronous use of v l = v l − = v in Step 1 induces a couplingbetween the proposals across levels. In Step 3, for each level s ∈ { l, l − } , note that the event (cid:101) v s = v s + u s yields identical proposals X (cid:48) s = W (cid:48) s , and this occurs with maximal probability given by the size of theoverlap S s ( x s , w s ) . When identical proposals are not possible, we take (cid:101) v s as the reflection of v s withrespect to the hyperplane orthogonal to e s , and right between σ − s x s and σ − s w s . Under this reflectioncoupling [22], one can show that X (cid:48) s − W (cid:48) s = (cid:37) ( v s )( µ s ( x s ) − µ s ( w s )) with (cid:37) ( v s ) = 1 + 2( v Ts e s ) e s . Since (cid:37) ( v s ) ∼ N (1 , (cid:107) u s (cid:107) − ) , one has contraction of the proposals on each level with probability of almost / when (cid:107) u s (cid:107) is large. If the proposals on each level are both accepted, contraction is desirable as it leads tostates X (cid:48) s and W (cid:48) s that are closer. This in turn yields a higher probability of generating identical proposalsin the next application of ˇ Q ( R ) l,l − .As (39) clearly includes (35) and (36) as special cases, the coupling ˇ Q ( R ) l,l − is applicable to both RWMHand pCN proposals. In the next section, we consider another construction for pCN that always inducescontractive proposals. We consider a coupled proposal kernel ˇ Q ( S ) l,l − : Z → P ( Z ) for pCN (36). To obtain a sample Z (cid:48) l,l − =(( X (cid:48) l , W (cid:48) l ) , ( X (cid:48) l − , W (cid:48) l − )) from ˇ Q ( S ) l,l − ( z l,l − , · ) , we simulate v ∼ N d (0 d , I d ) and take X (cid:48) s = ρ s x s + (cid:112) − ρ s σ s v, W (cid:48) s = ρ s w s + (cid:112) − ρ s σ s v, for both levels s ∈ { l, l − } . The synchronous use of v guarantees contraction of the proposals on eachlevel, i.e. (cid:107) X (cid:48) s − W (cid:48) s (cid:107) = ρ s (cid:107) x s − w s (cid:107) for ρ s ∈ ( − , , and induces dependencies between the pairs ofproposals across levels. The latter is crucial in our context as we want (11) or (13) to hold. Althoughthis synchronous coupling satisfies the requirements in (32) and (34), it is not possible to have identicalproposals as needed in (33). A solution is to consider a mixture of ˇ Q ( S ) l,l − and either ˇ Q ( M ) l,l − or ˇ Q ( R ) l,l − thatallow identical proposals to occur. More precisely, we take the coupled proposal kernel in Section 3.1 asthe mixture kernel ˇ Q l,l − ( z l,l − , dz (cid:48) l,l − ) = κ ˇ Q ( S ) l,l − ( z l,l − , dz (cid:48) l,l − ) + (1 − κ ) ˇ Q ( p ) l,l − ( z l,l − , dz (cid:48) l,l − ) , p ∈ { M, R } , (40)which will satisfy (32), (33) and (34) for any choice of κ ∈ (0 , .15 .3 Hamiltonian Monte Carlo We restrict ourselves to the case X = R d . Hamiltonian Monte Carlo (HMC) [8] considers the followingauxiliary target distribution on ( R d , B ( R d )) for each l ∈ Z + ¯ π l ( d ( x, v )) := π l ( x ) φ d ( v ; 0 d , I d ) d ( x, v ) , (41)where d ( x, v ) denotes the Lebesgue measure on R d . We will assume that the target density x (cid:55)→ π l ( x ) hasa well-defined gradient, and write the Hamiltonian corresponding to (41) as H l ( x, v ) = − log π l ( x )+ (cid:107) v (cid:107) .Given a current position x ∈ R d , HMC samples an initial velocity v ∼ N d (0 d , I d ) and generates aproposal by discretizing the Hamiltonian dynamics associated to H l using a leapfrog integrator. Given astepsize ε l > and a number of steps M l ∈ N , this numerical scheme initializes at ( x , v ) = ( x, v ) ∈ R d and iterates for m ∈ { , . . . , M l − } v m +1 / = v m + ε l ∇ log π l ( x m ) , x m +1 = x m + ε l v m +1 / , v m +1 = v m +1 / + ε l ∇ log π l ( x m +1 ) . (42)As the Hamiltonian is not exactly conserved under the leapfrog integrator, the proposal ( x (cid:48) , v (cid:48) ) = ( x M l , v M l ) is then subjected to a Metropolis–Hastings accept-reject step, i.e. for U ∼ U [0 , , if U ≤ ¯ α l (cid:8) ( x, v ) , ( x (cid:48) , v (cid:48) ) (cid:9) := 1 ∧ exp { H l ( x, v ) − H l ( x (cid:48) , v (cid:48) ) } , (43)we output ( X (cid:63) , V (cid:63) ) = ( x (cid:48) , v (cid:48) ) , otherwise we output ( X (cid:63) , V (cid:63) ) = ( x, v ) . The resulting transition X (cid:63) ∼ P l ( x, · ) on the position coordinate defines a HMC kernel P l at level l .Following [12], one can construct a faithful coupling of P l ( x, · ) and P l ( w, · ) for ( x, w ) ∈ R d × R d , byemploying a common velocity v to initialize (42) and a common uniform random variable U in the accept-reject step (43). If the initial positions x and w are in a region where π l is log-concave, and the integrationtime ε l M l is appropriately chosen, this can lead to contractive proposals which are then accepted with highprobability for small ε l . The resulting coupled HMC kernel ˇ P l (( x, w ) , · ) cannot be employed within theframework of Section 2.3.2, as it does not allow chains to meet. To circumvent this issue, [12] considereda mixture kernel ˇ K l (( x, w ) , d ( x (cid:48) , w (cid:48) )) = κ ˇ P l (( x, w ) , d ( x (cid:48) , w (cid:48) )) + (1 − κ ) ˇ K ( M ) l (( x, w ) , d ( x (cid:48) , w (cid:48) )) , (44)where κ ∈ (0 , and ˇ K ( M ) l denotes a coupled RWMH kernel based on the maximal coupling in (18).Although the latter enables meetings, the marginal kernel induced by (44) is not the HMC kernel P l , butremains close if κ is close to one.We now extend the work of [12] to our context. For z l,l − = (( x l , w l ) , ( x l − , w l − )) ∈ Z , one can alsoconstruct a faithful coupling of P l ( x l , · ) , P l ( w l , · ) , P l − ( x l − , · ) and P l − ( w l − , · ) by using a common initialvelocity in the leapfrog integrators and a common uniform random variable for all accept-reject steps.Let ˇ P l,l − ( z l,l − , · ) denote the resulting coupled HMC kernel on levels l and l − . In addition to theabove-mentioned behaviour for the pairs on each level, this coupling also induces dependencies betweenthe pairs across levels, which are crucial for the conditions in (11) or (13) to hold. Analogous to (44), wetake ˇ K l,l − ( z l,l − , dz (cid:48) l,l − ) = κ ˇ P l,l − ( z l,l − , dz (cid:48) l,l − ) + (1 − κ ) ˇ K ( M ) l,l − ( z l,l − , dz (cid:48) l,l − ) , (45)where κ ∈ (0 , and ˇ K ( M ) l,l − denotes a coupled RWMH kernel based on the maximal couplings in Section3.2.2 or Section 3.2.3. We note that the mixture kernel (45) satisfies the properties stated in Section 2.3.3.An algorithmic description of how to sample from it is provided in Algorithm 3.16 lgorithm 3 Mixture of coupled HMC and coupled RWMH
Input : current state z l,l − = (( x l , w l ) , ( x l − , w l − )) ∈ Z , leapfrog stepsize ε l > , number of leapfrog steps M l ∈ N , mixing probability κ ∈ (0 , and coupled RWMH kernel ˇ K ( M ) l,l − .With probability κ ,1. Sample initial velocity v ∼ N d (0 d , I d ) and U ∼ U [0 , .2. For s ∈ { l, l − } , run leapfrog integrator (42) on level s with initial condition ( x s , v ) to obtainproposal ( x (cid:48) s , v (cid:48) s ) .3. For s ∈ { l, l − } , run leapfrog integrator (42) on level s with initial condition ( w s , v ) to obtainproposal ( w (cid:48) s , (cid:101) v (cid:48) s ) .4. For s ∈ { l, l − } , if U ≤ ¯ α l { ( x s , v ) , ( x (cid:48) s , v (cid:48) s ) } , set X (cid:63)s = x (cid:48) s ; otherwise set X (cid:63)s = x s .5. For s ∈ { l, l − } , if U ≤ ¯ α l { ( w s , v ) , ( w (cid:48) s , (cid:101) v (cid:48) s ) } , set W (cid:63)s = w (cid:48) s ; otherwise set W (cid:63)s = w s .Otherwise, generate Z (cid:63)l,l − = (( X (cid:63)l , W (cid:63)l ) , ( X (cid:63)l − , W (cid:63)l − )) according to ˇ K ( M ) l,l − ( z l,l − , · ) . Output : Sample Z (cid:63)l,l − = (( X (cid:63)l , W (cid:63)l ) , ( X (cid:63)l − , W (cid:63)l − )) from ˇ K l,l − ( z l,l − , · ) in (45). Three numerical examples will be used to illustrate the properties of various algorithms and our theoreticalresults. The elliptic PDE problem introduced in Section 2.2 will be considered in Section 4.1. In Section4.1.1, we begin with a case where an analytical solution of the PDE is tractable. Subsequently, in Section4.1.2 a particular example of the problem described in Section 2.2 is considered. Finally, we examine amodel from epidemiology in Section 4.2, to analyze COVID-19 infections in the UK.Two quantities of interest will be used to illustrate our methodology. The first is the expected valuecorresponding to the choice of function ϕ ( x ) = x . The next quantity of interest is motivated by theestimation of parameters θ , such as the precision of the observation model in (3). One approach is based onmaximizing the marginal likelihood Z ( θ ) , defined as the normalizing constant of (4). We will compute themaximum likelihood estimator (MLE) θ MLE ∈ arg max Z ( θ ) by employing a stochastic gradient algorithm[11, 20, 23], given by the iterative scheme θ ( i ) = θ ( i − + α i (cid:92) ∇ θ log Z ( θ ( i − ) , i ≥ , (46)where ( α i ) i ∈ N is a sequence of learning rates and (cid:92) ∇ θ log Z ( θ ) denotes an unbiased estimator of ∇ θ log Z ( θ ) = (cid:90) X ( ∇ θ log γ θ ( x )) π θ ( x ) dx. (47)Following convergence results in [11, 23], we select α i = α /i and choose α appropriately. We will relyon our methodology to obtain unbiased estimators of the score function (47) by choosing the function ϕ θ ( x ) = ∇ θ log γ θ ( x ) . Finally, to deal with parameters that have positivity constraints, we apply alogarithmic transformation before employing (46). We first consider an example where an analytical solution is available. The PDE on C = [0 , π ] isdefined by (1) with constant diffusion coefficient Φ = 1 , and forcing f ( t ; X ) = X sin(2 t ) + X sin( t ) .17he analytical solution is given by h ( t ; X ) = X sin(2 t ) + X sin( t ) . Furthermore, we assume a priorof X ∼ N (0 , I ) on the state space X = R . Although this setting extends beyond the theoreticalframework we have considered, we expect our results to generalize. The observation functions (2) aregiven by the Dirac delta functions g p = δ t p , where t p = 2 π (2 p − / P for p ∈ { , . . . , P } with P = 50 .We simulate observations y ∈ R P from (3) using x = (2 , − and θ = 100 .Given a value of θ and data y , the posterior of X in (4), denoted as π θ , has the form N ( µ θ , Σ θ ) , where µ θ = θ Σ θ G T y, Σ − θ = θG T G + 16 − I . In the preceding line, G ∈ R P × is the forward model matrix (such that G ( x ) = Gx ) with entries G p, = sin(2 t p ) and G p, = sin( t p ) for p ∈ { , . . . , P } . The above quantities of interest are also analyticallytractable and used as ground truth. Firstly, for ϕ ( x ) = x , the expected value is π θ ( ϕ ) = µ θ . Secondly,the score function (47) can be computed using the fact that the marginal likelihood satisfies Z ( θ ) = φ ( y ; 0 P , GG T + θ − I P ) . l o g ( | h l h ( l ) | ) log ( Iteration ) l o g ( M S E ) = 0.03 = 0.01 Figure 1: Toy model of Section 4.1.1. Left: error of forward model approximation (cid:107) h l − h l − (cid:107) againstdiscretization level l satisfies (cid:107) h l − h l − (cid:107) ≤ C ∆ βl with a rate of β = 2 . Right: convergence of stochasticgradient iterates ( θ ( i ) ) i ∈ Z + , defined in (46), to the maximum likelihood estimator θ MLE , measured in termsof the mean squared error E [( θ ( i ) − θ MLE ) ] that was estimated using independent realizations. Thelearning rates considered here are α i = α /i with α ∈ { . , . } .To suit the domain under consideration, we take the mesh width of the FEM scheme in Section 2.2.2 as ∆ l = 2 π × − ( l + l ) with l = 5 . Firstly, in the left panel of Figure 1, we numerically verify that our approx-imation of the forward model indeed converges at the rate of β = 2 . Next, we consider the time-averagedestimator ξ l = (cid:92) [ π l − π l − ]( ϕ ) T,k,m in (21) with k = 100 and m = 1000 , and examine the rate at which itssecond moment converges to zero as l increases in Figure 2. These estimators are computed using a Markovchain ( Z n,l,l − ) n ∈ Z + that is simulated using either the mixture of coupled RWMH and HMC kernels (Al-gorithms 2 and 3) in (45) (left panel), or a coupled pCN kernel based on the reflection maximal couplingof Section 3.2.3 (right panel). The algorithmic settings of (45) include a mixing probability of κ = 0 . ,stepsize of ε l = 0 . and M l = 10 leapfrog steps for all l ∈ Z + in the HMC kernels; and proposal covarianceof − I for all l ∈ Z + in the RWMH kernels. For pCN kernels, we took ρ l = 0 . and σ l = 4 . I for all l ∈ Z + . To satisfy Assumption A5, we initialize Z ,l,l − = (( X ,l , W ,l ) , ( X ,l − , W ,l − )) from a coupling ˇ ν l,l − that can be described by the following steps: 1) sample X (cid:48) ,l − and W ,l − from the prior N (0 , I ) independently; 2) generate X (cid:48) ,l | X (cid:48) ,l − ∼ N ( X (cid:48) ,l − , − (2 l +1) I ) and W ,l | W ,l − ∼ N ( W ,l − , − (2 l +1) I ) independently; 3) generate ( X ,l , X ,l − ) | ( X (cid:48) ,l , X (cid:48) ,l − ) according to K l,l − (( X (cid:48) ,l , X (cid:48) ,l − ) , · ) , the marginalkernel on X × X induced by ˇ K l,l − for a pair across levels.18 l o g ( E [ l ]) l o g ( E [ l ]) Figure 2: Toy model of Section 4.1.1 at data generating parameter θ = 1 . Second moment of the time-averaged estimator ξ l = (cid:92) [ π l − π l − ]( ϕ ) T,k,m in (21) with k = 100 and m = 1000 against discretization level l using either the mixture of coupled RWMH and HMC kernels in (45) ( left ), or the coupled pCN kernelbased on the reflection maximal coupling of Section 3.2.3 ( right ). Functions considered here are ϕ ( x ) = ∇ θ log γ θ ( x ) ( blue ), ϕ ( x ) = x ( orange ) and ϕ ( x ) = x ( green ). In both cases, we have E [ ξ l ] ≤ C ∆ βl witha rate of β = 2 . The second moment was estimated using independent realizations. ( N )8642024 l o g ( M S E ) ( N )8642024 l o g ( M S E ) Figure 3: Toy model of Section 4.1.1 at data generating parameter θ = 1 . Mean squared error of averagedestimator E [( (cid:91) π ( ϕ )( N ) − π ( ϕ )) ] against number of single term replicates N using either the mixtureof coupled RWMH and HMC kernels in (45) ( left ), or the coupled pCN kernel based on the reflectionmaximal coupling of Section 3.2.3 ( right ). Functions considered here are ϕ ( x ) = ∇ θ log γ θ ( x ) ( blue ), ϕ ( x ) = x ( orange ) and ϕ ( x ) = x ( green ). In both cases, we have the standard Monte Carlo rate E [( (cid:91) π ( ϕ )( N ) − π ( ϕ )) ] ≤ CN − . The mean squared error was estimated using the exact π ( ϕ ) as groundtruth and independent realizations. 19e can infer from both plots in Figure 2 that E [ ξ l ] ≤ C ∆ βl with a rate of β = 2 , which matches thatof the forward model approximation. Hence the condition in (11) ensuring unbiased and finite varianceproperties of the single term estimator (cid:91) π ( ϕ ) S,T,k,m in (24) can be verified. As the cost of the marginalkernel at level l is of order ∆ − ωl with ω = 1 , following the discussion in Section 2.4.3, we select P L ( l ) ∝ ∆ ηl with η = 5 / to ensure finite expected cost. Figure 3 shows that by averaging N ∈ N independentreplicates of the single term estimator, we obtain an unbiased estimator (cid:91) π ( ϕ )( N ) in (27) that satisfies thestandard Monte Carlo rate as N → ∞ .Lastly, in the right panel of Figure 1, we illustrate convergence of the stochastic gradient algorithm(46), initialized at θ (0) = 1 , to the maximum likelihood estimator θ MLE , for two sequences of learningrates. The MLE was computed numerically by maximizing the marginal likelihood Z ( θ ) with the exactscore function (47). The general case of Section 2.2 is now considered, with unknown diffusion coefficient Φ and forcing f ( t ) =100 t . The prior specification of X = ( X , X ) is taken as d = 2 , ¯Φ = 0 . , ϑ = 1 / , ϑ = 1 / , v ( t ) = sin( πt ) and v ( t ) = cos(2 πt ) . For this particular setting, the solution h is continuous and hencepointwise observations are well-defined. The observation function G ( x ) in (2) is chosen as g p ( h ( X )) = h (0 .
01 + 0 . p − X ) for p ∈ { , . . . , P } with P = 50 . We employ the FEM scheme in Section 2.2.2with mesh width of ∆ l = 2 − ( l + l ) where l = 3 . Using a discretization level of l = 10 to approximate G ( x ) with G l ( x ) , x = (0 . , − . and θ = 1 , we simulate observations y ∈ R P from (3).Figure 4 shows that the forward model approximation and the second moment of time-averaged esti-mator ξ l = (cid:92) [ π l − π l − ]( ϕ ) T,k,m in (21) with k = 100 and m = 1000 converges at the same rate of β = 2 as l increases. The estimators ( ξ l ) are computed using the reflection maximal coupling of pCN kernels inSection 3.2.3, with algorithmic parameters of ρ l = 0 . and σ l = I d for all l ∈ Z + . The Markov chainis initialized in a similar manner to Section 4.1.1, with the exception of having U [ − , as the prior inthis case. As before, we can select P L ( l ) ∝ ∆ ηl with η = 5 / to ensure that the single term estimator (cid:91) π ( ϕ ) S,T,k,m in (24) has finite variance and finite expected cost. The left panel of Figure 5 illustrates thataveraging single term estimators yields a consistent estimator that converges at the standard Monte Carlorate.Finally, we consider inference for θ in the Bayesian framework, under a prior p ( θ ) specified as astandard Gaussian prior on log θ . By adding the gradient of the prior density ∇ θ log p ( θ ) to (46), wecan run a stochastic gradient algorithm initialized at θ (0) = 0 . to compute the maximum a posterioriprobability (MAP) estimator θ MAP ∈ arg max p ( θ ) Z ( θ ) . The left panel of Figure 5 displays convergenceof the stochastic iterates to θ MAP . As competing algorithm, we consider the approach of [20] that can alsocompute unbiased estimators of the score function (47) using the algorithm in [2] instead of MCMC. Theplot shows some gains over [20] when the same learning rates are employed.
Our final application concerns parameter inference for an epidemiological model to analyze COVID-19infections in the UK. We consider a recently developed compartmental model [24] for a closed populationwhere S ( t ) denotes the proportion that is susceptible to the disease, I ( t ) denotes the proportion of infectedindividuals, R ( t ) denotes the proportion of individuals who have recovered and are no longer part of thetransmission process, and Ξ( t ) denotes the proportion of symptomatic and infected individuals who havebeen quarantined. The model dynamics are governed by the following system of ordinary differential20 l o g ( | h l h ( l ) | ) l o g ( E [ l ]) Figure 4: Elliptic Bayesian inverse problem of Sections 2.2 and 4.1.2. Left: error of forward modelapproximation (cid:107) h l − h l − (cid:107) against discretization level l satisfies (cid:107) h l − h l − (cid:107) ≤ C ∆ βl with a rate of β = 2 . Right: second moment of the time-averaged estimator ξ l = (cid:92) [ π l − π l − ]( ϕ ) T,k,m in (21) with k = 100 and m = 1000 against discretization level l satisfies E [ ξ l ] ≤ C ∆ βl with a rate of β = 2 . Thefunction considered here is ϕ ( x ) = ∇ θ log γ θ ( x ) with θ = 0 . . The second moment was estimated using independent realizations. ( N )98765 l o g ( M S E ) ( cost )161412108642 l o g ( M S E ) UBMLMCMC, = 0.30UBMLMCMC, = 0.10UBMLMCMC, = 0.03UBMLSMC, = 0.30 Figure 5: Elliptic Bayesian inverse problem of Sections 2.2 and 4.1.2. Left: mean squared error of averagedestimator E [( (cid:91) π ( ϕ )( N ) − π ( ϕ )) ] against number of single term replicates N for the function ϕ ( x ) = ∇ θ log γ θ ( x ) with θ = 0 . . The mean squared error was estimated using an average of , independentreplicates with a minimal discretization level of l = 8 as ground truth for π ( ϕ ) , and independentrealizations. Right: convergence of stochastic gradient iterates ( θ ( i ) ) i ∈ Z + to the maximum a posterioriprobability estimator θ MAP . The mean squared error E [( θ ( i ) − θ MAP ) ] was estimated using a longer runof the stochastic gradient algorithm as ground truth for θ MAP , and independent realizations. Thelearning rates considered here are α i = α /i with α ∈ { . , . , . } . The red curve corresponds toa comparison with the unbiased MLSMC algorithm of [20]. Here the cost is measured in terms of thenumber of forward model simulations, weighted by the complexity associated to each discretization level,and averaged over realizations. 21quations ddt S ( t ) = − aS ( t ) I ( t ) − x S ( t ) , ddt I ( t ) = aS ( t ) I ( t ) − ( b + x + x ) I ( t ) , (48) ddt R ( t ) = bI ( t ) + x S ( t ) , ddt Ξ( t ) = ( x + x ) I ( t ) . A unit of time in the model will represent the duration of a day. In Equation (48), a > and b > denote the transmission rate and recovery rate, respectively. Following [24], we adopt the values a = 0 . and b = 0 . for COVID-19. The parameter x > captures public containment policies or individualbehavioural changes in response to the epidemic. Quarantine measures for symptomatic and infectedindividuals are described by the parameter x > . We will also infer the time lapsed between the firstinfection and its reporting x > . Letting time t = 0 correspond to the first reported case on January24, 2020, the initial condition is ( S ( − x ) , I ( − x ) , R ( − x ) , Ξ( − x )) = (1 − /N pop , /N pop , , , where N pop = 66 , , denotes the size of the UK population. Given parameters x = ( x , x , x ) , we willwrite the solution of (48) at time t as ( S ( t ; x ) , I ( t ; x ) , R ( t ; x ) , Ξ( t ; x )) . As our prior specification is givenby X ∼ U [0 . , . , X ∼ U [0 . , . and X ∼ U [5 , independently, we will work on the state-space X = [0 . , . × [0 . , . × [5 , .To account for under-reporting, the observed proportion of daily confirmed cases ( Y i ) Pi =1 is modelledas log( Y i ) = log( G i ( x )) − Γ i , (49)for i ∈ { , . . . , P } , where G i ( x ) = a (cid:90) n +1 n − i S ( t ; x ) I ( t ; x ) dt, (50)denotes the number of daily new infections under model (48), and (Γ i ) Pi =1 are independent gamma randomvariables with shape parameter θ > and scale parameter θ > . We set n = 29 to consider only P = 40 observations y = ( y i ) Pi =1 from February 12, 2020, as earlier data seem to be unreliable. We note that theobservation model in (49) differs from [24] which adopted a least squares approach to infer parameters.Under the gamma likelihood, the unnormalized posterior density of X given y and θ = ( θ , θ ) is γ θ ( x ) = P (cid:89) i =1 θ ) θ θ (log( G i ( x ) /y i )) θ − exp( − log( G i ( x ) /y i ) /θ ) I A ( x ) , (51)where A = { x ∈ X : G i ( x ) ≥ y i , i = 1 , . . . , P } .Any practical implementation of MCMC targeting (51) would require an approximation of G i ( x ) in(50). As it suffices to approximate h ( t ; x ) satisfying ( d/dt ) h ( t ) = aS ( t ) I ( t ) , we augment the system in(48) and employ a fourth-order Runge–Kutta numerical integrator [30] with stepsize ∆ l = 0 . × − , for l ∈ Z + . The left panel of Figure 6 shows that the resulting approximation h l ( t ; x ) converges to h ( t ; x ) atthe expected rate of β = 4 . We can now apply our proposed methodology to approximate expectations.The right panel shows that the second moment of the time-averaged estimator ξ l = (cid:92) [ π l − π l − ]( ϕ ) T,k,m in(21) with k = 200 and m = 2000 also converges at the same rate. To compute the estimators ( ξ l ) , we usedthe reflection maximal coupling of pCN kernels in Section 3.2.3, with algorithmic parameters of ρ l = 0 . and σ l = I d for all l ∈ Z + . The Markov chain is initialized in a similar manner to Section 4.1.1, withthe exception of truncating the prior to the subset A in this case. Since the cost of the marginal pCNkernel at level l is of order ∆ − ωl with ω = 1 , we choose P L ( l ) ∝ ∆ ηl with η = 9 / to ensure that the singleterm estimator (cid:91) π ( ϕ ) S,T,k,m in (24) has finite variance and finite expected cost. The left panel of Figure7 illustrates the impact of averaging independent replicates. As before, we compute the MLE of θ usingthe stochastic gradient algorithm (46), with initialization from θ (0) = (1 , . Finally, we exploit the fitted22odel to infer the extent of under-reporting during the time period under consideration. In Figure 8, wedisplay the ratio of the total number of reported cases to the expected number of total infections underthe posterior distribution (48) with θ = θ MLE . l o g ( | h l h ( l ) | ) l o g ( E [ l ]) Figure 6: Compartmental model of Section 4.2. Left: error of forward model approximation | h l ( t ; x ) − h l − ( t ; x ) | against discretization level l satisfies | h l ( t ; x ) − h l − ( t ; x ) | ≤ C ∆ βl with a rate of β = 4 at observation times t and parameter x = (0 . , . , . Right: second moment of the time-averagedestimator ξ l = (cid:92) [ π l − π l − ]( ϕ ) T,k,m in (21) with k = 200 and m = 2000 against discretization level l satisfies E [ ξ l ] ≤ C ∆ βl with a rate of β = 4 . The function considered here is ϕ ( x ) = ∇ θ log γ θ ( x ) with θ = (1 , .The second moment was estimated using independent realizations. ( N )10987654 l o g ( M S E ) ( Iteration )98765432 l o g ( M S E ) Figure 7: Compartmental model of Section 4.2. Left: mean squared error of averaged estimator E [( (cid:91) π ( ϕ )( N ) − π ( ϕ )) ] against number of single term replicates N for the functions ϕ ( x ) = ∂ θ log γ θ ( x ) ( blue ) and ϕ ( x ) = ∂ θ log γ θ ( x ) ( orange ) with θ = (1 , . The mean squared error was estimated usingan average of , independent replicates with a discretization of ∆ l = 0 . × − l as ground truthfor π ( ϕ ) , and independent realizations. Right: convergence of stochastic gradient iterates ( θ ( i ) ) i ∈ Z + to the maximum likelihood estimator θ MLE . The mean squared error E [( θ ( i ) − θ MLE ) ] was estimatedusing a longer run of the stochastic gradient algorithm as ground truth for θ MLE , and independentrealizations. The sequence of learning rates considered here is α i = α /i with α = 0 . .23 F r a c t i o n o f o b s e r v e d c a s e s Figure 8: Compartmental model of Section 4.2. Ratio of the total number of reported cases C i = C + (cid:80) ij =1 y j ( C denotes the number of cases before February 12, 2020) to the expected number of totalinfections E [ a (cid:82) n + i − X S ( t ; X ) I ( t ; X ) dt | y, θ MLE ] on day n + i . Acknowledgements
JH was funded by CY Initiative of Excellence (grant “Investissements d’Avenir” ANR-16-IDEX-0008). AJwas supported by KAUST baseline funding. AT and KJHL were supported by The Alan Turing Instituteunder the EPSRC grant EP/N510129/1.
A Proofs
Throughout the appendix, C is a finite constant that does not depend upon l nor the time parameter ofthe Markov chain. The value may change upon each appearance. For ease of notation only, we will set k = 0 from herein; the proofs, with some minor modifications, will hold for any k ≥ . The appendix firstgives the proof of Theorem 2.1 and then a collection of technical results which are used to achieve theproof. Proof of Theorem 2.1.
To prove the result we must verify for any ( l, ϕ ) ∈ N × B b ( X ) ∩ Lip ˜ d ( X ) E [ (cid:92) π ( ϕ ) ] = π ( ϕ ) , (52) E [ (cid:92) [ π l − π l − ]( ϕ ) ] = [ π l − π l − ]( ϕ ) , (53)and that P L (0) E [ (cid:92) π ( ϕ ) ] + (cid:88) l ∈ N P L ( l ) E [ { (cid:92) [ π l − π l − ]( ϕ ) } ] < + ∞ . (54)Equations (52) and (53) are verified in Proposition A.1. For (54), E [ (cid:92) π ( ϕ ) ] can be controlled by usingLemma A.6 and then by Lemma A.5, we have that (cid:88) l ∈ N P L ( l ) E [ { (cid:92) [ π l − π l − ]( ϕ ) } ] ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip ) (cid:88) l ∈ N P L ( l ) ∆ (cid:0) β ∧ β (cid:1) l . The proof can now easily be completed. 24 roposition A.1.
Assume (A1-2). Then we have for any ( l, ϕ ) ∈ N × B b ( X ) E [ (cid:92) π ( ϕ ) ] = π ( ϕ ) , E [ (cid:92) [ π l − π l − ]( ϕ ) ] = [ π l − π l − ]( ϕ ) . Proof.
The case of (cid:92) π ( ϕ ) is essentially that in [10], so we focus on the second expectation. We have thestandard Martingale plus remainder decomposition: (cid:92) [ π l − π l − ]( ϕ ) = ϕ ( X ,l ) + ˇ τ l,l − − (cid:88) n =1 { ϕ ( X n,l ) − ϕ ( W n,l ) } − (cid:16) ϕ ( X ,l − ) + ˇ τ l,l − − (cid:88) n =1 { ϕ ( X n,l − ) − ϕ ( W n,l − ) } (cid:17) = ϕ ( X ,l ) + ˇ τ l,l − (cid:88) n =1 ζ l ( X n − n,l , W n − n,l ) + K l ( (cid:98) ϕ l )( X ,l ) − K l ( (cid:98) ϕ l )( W ,l ) − (cid:16) ϕ ( X ,l − ) + ˇ τ l,l − (cid:88) n =1 ζ l − ( X n − n,l − , W n − n,l − ) + K l − ( (cid:98) ϕ l − )( X ,l − ) − K l − ( (cid:98) ϕ l − )( W ,l − ) (cid:17) where, for s ∈ { l, l − } and ( x n − n , w n − n ) ∈ X × X ζ s ( x n − n , w n − n ) := (cid:98) ϕ s ( x n ) − K s ( (cid:98) ϕ s )( x n − ) − (cid:16) (cid:98) ϕ s ( w n ) − K s ( (cid:98) ϕ s )( w n − ) (cid:17) and (cid:98) ϕ s ( x ) = (cid:88) n ∈ Z + [ K ns ( ϕ )( x ) − π s ( ϕ )] (55)is well-defined for each x ∈ X and solves the Poisson equation, (cid:98) ϕ s ( x ) − K s ( (cid:98) ϕ s )( x ) = ϕ ( x ) − π s ( ϕ ) . Set ( F n ) n ∈ Z + as the natural filtration generated by ( Z n,l,l − ) n ≥ and for ( n, s ) ∈ N × { l, l − } M n,s := n (cid:88) j =1 ζ s ( X j − j,s , W j − j,s ) with M ,l = M ,l − = 0 . Then for each s ∈ { l, l − } , ( M n,s , F n ) n ∈ Z + is a Martingale. Thus, we have (cid:92) [ π l − π l − ]( ϕ ) = ϕ ( X ,l ) + M ˇ τ l,l − ,l + K l ( (cid:98) ϕ l )( X ,l ) − K l ( (cid:98) ϕ l )( W ,l ) − (cid:16) ϕ ( X ,l − ) + M ˇ τ l,l − ,l − + K l − ( (cid:98) ϕ l − )( X ,l − ) − K l − ( (cid:98) ϕ l − )( W ,l − ) (cid:17) . (56)Now taking expectations on both sides of the equation and applying the optional sampling theorem, wehave E [ (cid:92) [ π l − π l − ]( ϕ ) ] = ν l K l ( ϕ ) + ν l K l ( (cid:98) ϕ l ) − ν l K l ( (cid:98) ϕ l ) − (cid:16) ν l − K l − ( ϕ ) + ν l − K l − ( (cid:98) ϕ l − ) − ν l − K l − ( (cid:98) ϕ l − ) (cid:17) . Hence, we have that E [ (cid:92) [ π l − π l − ]( ϕ ) ] = [ π l − π l − ]( ϕ ) . Remark A.1.
In the proof we have essentially derived the first Wald equality for Markov chains (see [26]).The main point is to provide the Martingale plus remainder decomposition (56) . Lemma A.1.
Assume (A1,3). Then there exists a C ∈ (0 , ∞ ) , such that for any ( l, ϕ ) ∈ Z + × B b ( X ) ∩ Lip ˜ d ( X ) we have: | (cid:98) ϕ l ( x ) − (cid:98) ϕ l ( w ) | ∨ | K l ( (cid:98) ϕ l )( x ) − K l ( (cid:98) ϕ l )( w ) | ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip )˜ d ( x, w ) . Lemma A.2.
Assume (A1,4). Then there exists a C ∈ (0 , ∞ ) , such that for any ( l, ϕ ) ∈ N × B b ( X ) wehave: sup x ∈ X | (cid:98) ϕ l ( x ) − (cid:98) ϕ l − ( x ) | ∨ sup x ∈ X | K l ( (cid:98) ϕ l )( x ) − K l − ( (cid:98) ϕ l − )( x ) | ≤ C (cid:107) ϕ (cid:107) ∞ ∆ β l , where β is as in (A4). Lemma A.3.
Assume (A5). Then there exists a C ∈ (0 , ∞ ) , such that for any ( l, n ) ∈ N × Z + we have: E [ I B ( C, ∆ β l , ˜ d ) c ( Z n,l,l − )] ≤ C ( n + 1)∆ β (2+ (cid:15) ) l , E [˜ d ( X n,l , X n,l − ) (cid:15) ] ∨ E [˜ d ( W n,l , W n,l − ) (cid:15) ] ≤ C ( n + 1)∆ β (2+ (cid:15) ) l , where β and (cid:15) are as in (A5).Proof. The proof of the first statement is by induction. The first statement, which holds at step zero byassumption, so assuming the result for n − : E [ I B ( C, ∆ β l , ˜ d ) c ( Z n,l,l − )] = E [ I B ( C, ∆ β l , ˜ d ) c × B ( C, ∆ β l , ˜ d ) ( Z n,l,l − , Z n − ,l,l − )] + E [ I B ( C, ∆ β l , ˜ d ) c × B ( C, ∆ β l , ˜ d ) c ( Z n,l,l − , Z n − ,l,l − )] . Then applying (A5) along with the induction hypothesis one can conclude that: E [ I B ( C, ∆ β l , ˜ d ) c ( Z n,l,l − )] ≤ C ( n + 1)∆ β (2+ (cid:15) ) l and hence the proof of the first statement is complete.For the second statement, we consider only ˜ d ( X n,l , X n,l − ) (cid:15) as the argument is the same for ˜ d ( W n,l , W n,l − ) (cid:15) .We have E [˜ d ( X n,l , X n,l − ) (cid:15) ] = E [˜ d ( X n,l , X n,l − ) (cid:15) { I B ( C, ∆ β l , ˜ d ) ( Z n,l,l − ) + I B ( C, ∆ β l , ˜ d ) c ( Z n,l,l − ) } ] . On B ( C, ∆ β l , ˜ d ) , one has ˜ d ( X n,l , X n,l − ) (cid:15) ≤ C ∆ β (2+ (cid:15) ) l . For the second term on the R.H.S., X is compactand ˜ d ( X n,l , X n,l − ) (cid:15) is bounded, so one can use the first part of the statement to deduce that E [˜ d ( X n,l , X n,l − ) (cid:15) ] ≤ C ( n + 1)∆ β (2+ (cid:15) ) l . Lemma A.4.
Assume (A1-5). Then there exists a C ∈ (0 , ∞ ) , such that for any ( l, ϕ ) ∈ N × B b ( X ) ∩ Lip ˜ d ( X ) we have: E [ { M ˇ τ l,l − ,l − M ˇ τ l,l − ,l − } ] ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip )∆ (cid:0) β ∧ β (cid:1) l , where ˜ d is as (A3), β is as (A4) and β is as in (A5). roof. Set, for n ∈ Z + ˇ M n,l,l − := M n,l − M n,l − then ( ˇ M n,l,l − , F n ) n ∈ Z + is a Martingale and moreover as ˇ τ l,l − is a F n − stopping time, so is ( ˇ M ˇ τ l,l − ∧ n,l,l − , F n ) n ∈ Z + . Then, by the Burkholder-Gundy-Davis inequality, we have E (cid:104) ˇ M τ l,l − ∧ n,l,l − (cid:105) ≤ E ˇ τ l,l − ∧ n (cid:88) j =1 { ζ l ( X j − j,l , W j − j,l ) − ζ l − ( X j − j,l − , W j − j,l − ) } . By (A2), ˇ τ l,l − is almost surely finite, so by the monotone convergence theorem: E (cid:104) ˇ M τ l,l − ,l,l − (cid:105) ≤ E ˇ τ l,l − (cid:88) j =1 { ζ l ( X j − j,l , W j − j,l ) − ζ l − ( X j − j,l − , W j − j,l − ) } . (57)Then we can upper-bound the R.H.S. to yield E (cid:104) ˇ M τ l,l − ,l,l − (cid:105) ≤ C ( T + T ) where T := E (cid:34) ˇ τ l,l − (cid:88) n =1 (cid:110) (cid:98) ϕ l ( X n,l ) − (cid:98) ϕ l ( W n,l ) − (cid:98) ϕ l − ( X n,l − ) + (cid:98) ϕ l − ( W n,l − ) (cid:111) (cid:35) (58) T := E (cid:34) ˇ τ l,l − (cid:88) n =1 (cid:110) K l ( (cid:98) ϕ l )( X n − ,l ) − K l ( (cid:98) ϕ l )( W n − ,l ) − K l − ( (cid:98) ϕ l − )( X n − ,l − ) + K l − ( (cid:98) ϕ l − )( W n − ,l − ) (cid:111) (cid:35) . (59)To conclude the proof, one must appropriately upper-bound (58) and (59). As the forms are very similarand due to the expression for (cid:98) ϕ s (see (55)), s ∈ { l, l − } , we will give the proof for (58) only, as the prooffor (59) is almost the same.Define, for z ∈ X (cid:98) ϕ l,l − ( z ) := (cid:88) n ∈ Z + ˇ K nl,l − (cid:0) (cid:98) ψ l,l − (cid:1) ( z ) where (cid:98) ψ l,l − ( x l , w l , x l − , w l − ) = ( (cid:98) ϕ l ( x l ) − (cid:98) ϕ l ( w l ) − (cid:98) ϕ l − ( x l − ) + (cid:98) ϕ l − ( w l − )) . We note that by (A2), onecan easily verify that (cid:98) ϕ l,l − is a well-defined function. Now, by the first Wald equality for Markov chains,one has E [ T ] = E [ ˇ K l,l − ( (cid:98) ϕ l,l − )( Z ,l,l − )] . This is because E [ ˇ K l,l − ( (cid:98) ϕ l,l − )( Z ˇ τ l,l − ,l,l − )] = 0 , ˇ π l,l − ( (cid:98) ϕ l,l − ) = 0 , E [ˇ τ l,l − ] < + ∞ , where ˇ π l,l − is the invariant measure of ˇ K l,l − (marginally, the invariant distribution of ( X s , W s ) is π s ( dx s ) δ x s ( dw s ) for s ∈ { l, l − } ). Now T = E [ (cid:98) ψ l,l − ( Z ,l,l − )] + E (cid:34) (cid:88) n ∈ N E [ (cid:98) ψ l,l − ( Z n,l,l − ) | Z ,l,l − ] (cid:35) . (60)27e have E [ (cid:98) ψ l,l − ( Z ,l,l − )] = E (cid:104)(cid:16) (cid:98) ϕ l ( X ,l ) − (cid:98) ϕ l ( W ,l ) − (cid:98) ϕ l − ( X ,l − ) + (cid:98) ϕ l − ( W ,l − ) (cid:17) (cid:105) ≤ C (cid:110) E (cid:104)(cid:16) (cid:98) ϕ l ( X ,l ) − (cid:98) ϕ l − ( X ,l ) + (cid:98) ϕ l − ( X ,l ) − (cid:98) ϕ l − ( X ,l − ) (cid:17) (cid:105) + E (cid:104)(cid:16) (cid:98) ϕ l ( W ,l ) − (cid:98) ϕ l − ( W ,l ) + (cid:98) ϕ l − ( W ,l ) − (cid:98) ϕ l − ( W ,l − ) (cid:17) (cid:105)(cid:111) ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip ) (cid:16) ∆ β l + E [˜ d ( X ,l , X ,l − ) ] + ∆ β l + E [˜ d ( W ,l , W ,l − ) ] (cid:17) ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip )∆ β ∧ β ) l (61)where we have applied the C − inequality and Lemmata A.1-A.2 to go to the third line and Lemma A.3to go to the last line. Then, we also have E (cid:34) (cid:88) n ∈ N E [ (cid:98) ψ l,l − ( Z n,l,l − ) | Z ,l,l − ] (cid:35) = E (cid:34) (cid:88) n ∈ N E [ I { ˇ τ l,l − >n } (cid:98) ψ l,l − ( Z n,l,l − ) | Z ,l,l − ] (cid:35) = (cid:88) n ∈ N E [ I { ˇ τ l,l − >n } (cid:98) ψ l,l − ( Z n,l,l − )] ≤ (cid:88) n ∈ N E [ I { ˇ τ l,l − >n } ] (cid:15)/ (2+ (cid:15) ) E [ (cid:98) ψ l,l − ( Z n,l,l − ) (cid:15) ] / (2+ (cid:15) ) ≤ C (cid:88) n ∈ N ( ρ (cid:15) (cid:15) ) n E [ (cid:98) ψ l,l − ( Z n,l,l − ) (cid:15) ] / (2+ (cid:15) ) where we have used Hölder’s inequality to go to the third line and (A2) to go to the fourth line. Now, bysimilar calculations that lead to (61), we have that E [ (cid:98) ψ l,l − ( Z n,l,l − ) (cid:15) ] / (2+ (cid:15) ) ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip )( n + 1)∆ (cid:0) β ∧ β (cid:1) l and hence that E (cid:34) (cid:88) n ∈ N E [ (cid:98) ψ l,l − ( Z n,l,l − ) | Z ,l,l − ] (cid:35) ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip )∆ (cid:0) β ∧ β (cid:1) l . (62)Then combining (61)-(62) with (60) yields T ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip )∆ (cid:0) β ∧ β (cid:1) l , and hence the proof is concluded. Lemma A.5.
Assume (A1-5). Then there exists a C ∈ (0 , ∞ ) , such that for any ( l, ϕ ) ∈ N × B b ( X ) ∩ Lip ˜ d ( X ) we have: E (cid:104) (cid:92) [ π l − π l − ]( ϕ ) (cid:105) ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip )∆ (cid:0) β ∧ β (cid:1) l where ˜ d is as (A3), β is as (A4) and β , (cid:15) are as in (A5).Proof. Using the decomposition in (56) along with the C − inequality, we have the upper-bound E (cid:104) (cid:92) [ π l − π l − ]( ϕ ) (cid:105) ≤ C (cid:16) E [ { ϕ ( X ,l ) − ϕ ( X ,l − ) } ] + E [ { M ˇ τ l,l − ,l − M ˇ τ l,l − ,l − } ] ++ E (cid:104)(cid:16) K l ( (cid:98) ϕ l )( X ,l ) − K l ( (cid:98) ϕ l )( W ,l ) − K l − ( (cid:98) ϕ l − )( X ,l − ) + K l − ( (cid:98) ϕ l − )( W ,l − ) (cid:17) (cid:105)(cid:17) . The first term on the R.H.S. can be treated by using ϕ ∈ Lip ˜ d ( X ) and Lemma A.3. For the second term onthe R.H.S. one can use Lemma A.4. For the last term on the R.H.S. one can use very similar calculationsto those used to derive (61). The proof is thus completed.28 emma A.6. Assume (A1-2). Then there exists a C ∈ (0 , ∞ ) , such that for any ϕ ∈ B b ( X ) we have: E [ (cid:92) π ( ϕ ) ] ≤ C (cid:107) ϕ (cid:107) ∞ . Proof.
We have (cid:92) π ( ϕ ) = ϕ ( X , ) + ˇ M τ , + ˇ K ( (cid:98) ϕ )( X , ) − ˇ K ( (cid:98) ϕ )( W , ) , where (cid:98) ϕ ( x ) = (cid:80) n ∈ Z + [ K n − π ]( ϕ ) , and for any n ∈ N ˇ M n, = n (cid:88) j =1 { (cid:98) ϕ ( X j, ) − ˇ K ( (cid:98) ϕ )( X j − , ) − (cid:98) ϕ ( W j, ) + ˇ K ( (cid:98) ϕ )( W j − , ) } and ˇ M , := 0 . One has, by using the C − inequality and the fact that ϕ ∈ B b ( X ) along with (A1) E [ (cid:92) π ( ϕ ) ] ≤ C (cid:107) ϕ (cid:107) ∞ + E [ ˇ M τ , ] . Then, by using a similar argument to derive (57), we have E [ ˇ M τ , ] ≤ E (cid:34) τ (cid:88) j =1 { (cid:98) ϕ ( X j, ) − ˇ K ( (cid:98) ϕ )( X j − , ) − (cid:98) ϕ ( W j, ) + ˇ K ( (cid:98) ϕ )( W j − , ) } (cid:35) . The proof is now completed as the summands on the R.H.S. are upper-bounded by C (cid:107) ϕ (cid:107) ∞ and theexpectation of the stopping time is finite via (A2). References [1]
Agapiou , S.,
Roberts , G. O. &
Vollmer , S. (2018). Unbiased Monte Carlo: Posterior estimationfor intractable/infinite-dimensional models.
Bernoulli , , 1726–1786.[2] Beskos , A.,
Jasra , A.,
Law , K. J. H.,
Marzouk , Y., &
Zhou , Y. (2018). Multilevel sequen-tial Monte Carlo with dimension-independent likelihood-informed proposals.
SIAM/ASA J. Uncer.Quant. , , 762–786.[3] Beskos , A.,
Jasra , A.,
Law , K. J. H.,
Tempone , R., &
Zhou , Y. (2017). Multilevel sequentialMonte Carlo samplers.
Stoch. Proc. Appl. , , 1417–1440.[4] Bou-Rabee , N.,
Eberle , A., &
Zimmer , R. (2020). Coupling and convergence for HamiltonianMonte Carlo.
Ann. Appl. Probab. , , 1209–1250.[5] Brenner , S. &
Scott , R. (2007).
The Mathematical Theory of Finite Element Methods . Springer:New York.[6]
Ciarlet , P. G. (2002).
The Finite Element Method for Elliptic Problems . SIAM: Philadelphia.[7]
Cotter , S. L.,
Roberts , G. O.,
Stuart , A. M. &
White , D. (2013). MCMC methods for functions:modifying old algorithms to make them faster.
Stat. Sci. , , 424–446.[8] Duane , S.,
Kennedy , A. D.,
Pendleton , B. J., &
Roweth , D. (1987). Hybrid Monte Carlo.
Phy.Lett. B. , , 216–222.[9] Franks , J.,
Jasra , A.,
Law , K. J. H.,
Chada , N. &
Vihola , M. (2018). Unbiased inference fordiscretely observed hidden Markov model diffusions. arXiv preprint.2910]
Glynn , P. W. &
Rhee , C. H. (2014). Exact estimation for Markov chain equilibrium expectations.
J. Appl. Probab. , , 377–389.[11] Gower , R. M.,
Loizou , N.,
Qian , X.,
Sailanbayev , A.,
Shulgin , E., &
Richtarik , P. (2019).SGD: General analysis and improved rates.
Proceedings of the 36th International Conference on Ma-chine Learning, in PMLR , 5200–5209.[12] Heng , J. &
Jacob , P. (2019). Unbiased Hamiltonian Monte Carlo with couplings.
Biometrika , ,287–302.[13] Heng , J.,
Houssineau , J. &
Jasra , A. (2021). On unbiased score estimation for partially observeddiffusions. Work in progress.[14]
Jacob , P.,
O’ Leary , J. &
Atchadé , Y. (2020). Unbiased Markov chain Monte Carlo with couplings(with discussion).
J. R. Statist. Soc. Ser. B , , 543–600.[15] Jacob , P.,
Lindsten , F. &
Schön , T. (2020). Smoothing with couplings of conditional particlefilters.
J. Amer. Statist. Assoc. , 721–729.[16]
Jasra , A.,
Heng , J. &
Law , K. J. H. (2020). Discussion of Jacob et al.
J. R. Statist. Soc. Ser. B , , 586–587.[17] Jasra , A.,
Law , K. J. H. & Yu , F. (2020). Unbiased filtering of a class of partially observed diffusions.arXiv preprint.[18] Jasra , A.,
Kamatani , K.,
Law
K. J. H. &
Zhou , Y. (2017). Multilevel particle filters.
SIAM J.Numer. Anal. , , 3068–3096.[19] Jasra , A.,
Law , K. J. H. & Xu , Y. (2021). Markov chain Simulation for Multilevel Monte Carlo. Found. Data Sci. (to appear).[20]
Jasra , A.,
Law , K. J. H. & Lu , D. (2021). Unbiased estimation of the gradient of the log-likelihoodin inverse problems. Stat. Comp. (to appear).[21]
Johnson , V. (1996). Studying convergence of Markov chain Monte Carlo algorithms using coupledsample paths.
J. Amer. Statist. Assoc. , , 154–166.[22] Lindvall , T. &
Rogers , L. (1996). Coupling of multidimensional diffusions by reflection.
Ann. Appl.Probab. , , 860–872.[23] Kushner , H. &
Yin , G. G. (2003).
Stochastic Approximation and Recursive Algorithms and Appli-cations . Springer: New York.[24]
Maier , B. F. &
Brockmann , D. (2020). Effective containment explains sub-exponential growth inrecent confirmed COVID-19 cases in China.
Science , , 742–746.[25] McLeish , D. (2011). A general method for debiasing a Monte Carlo estimator.
Monte Carlo Meth.Appl. , , 301–315.[26] Moustakides , G. (1999). An extension of Wald’s first lemma for Markov processes.
J. Appl. Probab. , , 48–59.[27] Neal , R. M. (1998). Regression and classification using Gaussian process priors. In
Bayesian statis-tics, 6 (Bernardo et al. eds), 475–501, Oxford: OUP.[28]
Rhee , C. H. &
Glynn , P. (2015). Unbiased estimation with square root convergence for SDE models.
Op. Res. , , 1026–1043. 3029] Stuart , A. M. (2010). Inverse problems: A Bayesian perspective.
Acta Numerica , , 451–559.[30] Süli , E. &
Mayers , D. F. (2003).
An Introduction to Numerical Analysis . Cambridge: CUP.[31]
Tarantola , A. (2005).
Inverse problem theory and methods for model parameter estimation . Societyfor Industrial and Applied Mathematics.[32]
Thorisson , H. (2000).
Coupling, Stationarity and Regeneration . Springer: New York.[33]
Vihola , M. (2018). Unbiased estimators and multilevel Monte Carlo.
Op. Res. ,66