[PDF] On Unbiased Estimation for Discretized Models

Abstract

In this article, we consider computing expectations w.r.t. probability measures which are subject to discretization error. Examples include partially observed diffusion processes or inverse problems, where one may have to discretize time and/or space, in order to practically work with the probability of interest. Given access only to these discretizations, we consider the construction of unbiased Monte Carlo estimators of expectations w.r.t. such target probability distributions. It is shown how to obtain such estimators using a novel adaptation of randomization schemes and Markov simulation methods. Under appropriate assumptions, these estimators possess finite variance and finite expected cost. There are two important consequences of this approach: (i) unbiased inference is achieved at the canonical complexity rate, and (ii) the resulting estimators can be generated independently, thereby allowing strong scaling to arbitrarily many parallel processors. Several algorithms are presented, and applied to some examples of Bayesian inference problems, with both simulated and real observed data.

Full PDF

OOn Unbiased Estimation for Discretized Models

BY JEREMY HENG , AJAY JASRA , KODY J. H. LAW & ALEXANDER TARAKANOV ESSEC Business School, Singapore, 139408, SG. E-Mail: [email protected] Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science andTechnology, Thuwal, 23955, KSA. E-Mail: [email protected] School of Mathematics, University of Manchester, Manchester, M13 9PL, UK. E-Mail: [email protected];[email protected]

Abstract

In this article, we consider computing expectations w.r.t. probability measures which are subjectto discretization error. Examples include partially observed diﬀusion processes or inverse problems,where one may have to discretize time and/or space, in order to practically work with the probabilityof interest. Given access only to these discretizations, we consider the construction of unbiased MonteCarlo estimators of expectations w.r.t. such target probability distributions. It is shown how to obtainsuch estimators using a novel adaptation of randomization schemes and Markov simulation methods.Under appropriate assumptions, these estimators possess ﬁnite variance and ﬁnite expected cost. Thereare two important consequences of this approach: (i) unbiased inference is achieved at the canonicalcomplexity rate, and (ii) the resulting estimators can be generated independently, thereby allowingstrong scaling to arbitrarily many parallel processors. Several algorithms are presented, and applied tosome examples of Bayesian inference problems, with both simulated and real observed data.

Key words : Randomization Methods; Markov chain Monte Carlo; Bayesian Inverse Problems.

Consider a probability measure π on measurable space ( X , X ) for which one wants to compute π ( ϕ ) := (cid:82) X ϕ ( x ) π ( dx ) with ϕ : X → R , π − integrable. Suppose one can only deal with a sequence of biasedprobability measures ( π l ) l ∈ Z + on ( X , X ) , with π l ( | ϕ | ) < + ∞ ∀ l ∈ Z + , such that lim l →∞ π l ( ϕ ) = π ( ϕ ) and | π l +1 ( ϕ ) − π ( ϕ ) | ≤ | π l ( ϕ ) − π ( ϕ ) | ; examples include partially observed diﬀusion processes e.g. [18] orinverse problems e.g. [3]. These latter models have a wide range of real applications such as engineering,ﬁnance and applied mathematics; see for instance [29].In many applications of interest, one often resorts to constructing a π l − invariant and ergodic Markovchain Monte Carlo (MCMC) kernel K l to estimate the expectation π l ( ϕ ) = (cid:82) X ϕ ( x ) π l ( dx ) . It is often thecase that as l grows, the cost of applying K l will also increase, often exponentially in l . Therefore onewould often ﬁx l to achieve a given bias, and run the Markov chain for long enough to obtain a pre-speciﬁedvariance which balances the bias. In this article, we consider the task of producing unbiased estimatorswith ﬁnite variance. In particular, using a stochastic simulation scheme based upon a family of Markovkernels ( K l ) l ∈ Z + , one can construct an estimator (cid:91) π ( ϕ ) such that E [ (cid:91) π ( ϕ )] = π ( ϕ ) and V ar [ (cid:91) π ( ϕ )] < ∞ ,where E and V ar denote expectation and variance w.r.t. the law of the stochastic scheme to be developed,respectively. This scheme is of interest for several reasons:1. One can produce unbiased estimators of score functions which can be employed within stochasticgradient algorithms to perform parameter inference.2. One can simulate i.i.d. replicates of such unbiased estimators in parallel and combine them to con-struct lower variance estimators in a static context (sometimes referred to as strong parallel scaling).3. The method provides a benchmark for other computations.In terms of the ﬁrst point, it is often simpler to verify the validity of stochastic gradient algorithms whenthe estimate of the noisy gradient is unbiased. The second point means that the variance can be reducedproportionally to the number of available processors, for the same ﬁxed expected cost per processor. Thethird point means that one can check the precision of biased methods against the results.1 a r X i v : . [ s t a t . C O ] F e b he approach that we follow is based upon an idea that was outlined in [16] and belongs to the class of doubly-randomized estimators – more speciﬁcally, estimators which arise from applying randomization ofthe type [25, 28] (see also [33]) twice , in a nested fashion. The baseline version of the randomized estimatorsof [25, 28] place a probability distribution P L over the level of discretization l . Given a simulation fromthis probability distribution, one way to obtain unbiased estimates of π ( ϕ ) is to unbiasedly estimate π l ( ϕ ) − π l − ( ϕ ) for l ∈ N , or an unbiased estimate of π ( ϕ ) if one samples l = 0 . Denoting theseestimators by ξ l , the so-called “single-term” estimator is given by ξ L / P L ( L ) , where L ∼ P L . In the inferencecontext, it is challenging to obtain unbiased estimators, and this is where the second randomization comesinto the picture. For ξ , one can use the recently introduced unbiased MCMC scheme of [14] (see also[10]). This estimator is built by truncating an inﬁnite series of increments of coupled MCMCs once thechains meet. The main complication is then to unbiasedly estimate π l ( ϕ ) − π l − ( ϕ ) . It will typicallynot suﬃce to estimate π l ( ϕ ) and π l − ( ϕ ) independently, because the resulting estimator would often haveinﬁnite variance. Therefore an additional technique is required. The main contribution of this article isto develop several novel coupled MCMC schemes that can achieve unbiased estimates ξ l for l > suchthat the resulting estimator of π ( ϕ ) is unbiased and of ﬁnite variance. The latter properties are provedmathematically under assumptions. We also implement our proposed algorithms on several challengingstatistical applications.The idea of using doubly-randomized estimators has appeared in several recent works. The work [20]utilizes a “coupled-sum” estimator over sample size to debias multilevel estimators of the type introducedin [3], which are then utilized in the framework described above. That method is applicable to thestatic/non-dynamic problems where one can evaluate the target distribution, up to a normalizing constant,like the method we introduce here. The work [17] uses a “single-term” estimator over sample sizes in orderdebias estimators of the type introduced in [18]. Those estimators are designed for online inference indynamic problems, such as state space models, and partially observed diﬀusion processes in particular.The methodology in [20] has inﬁnite expected cost, whereas this is not always the case for the methodintroduced in this paper. In a companion paper [13], we show how to extend the framework of this articleto the context of partially observed diﬀusion processes. A possible alternative to our approach would bethat of [1].This article is structured as follows. In Sections 2.2 and 2.3, the precise problem is stated and ourstrategy outlined. We show in Section 2.4, under assumptions, that our general approach can produceunbiased and ﬁnite variance estimators with ﬁnite expected costs. In Section 3, we present some spe-ciﬁc Markov kernels which fall under our general framework. We illustrate our methodology on severalnumerical examples in Section 4. The proofs of our mathematical results are given in Appendix A. Let ( X , X ) be a measurable space. For ϕ : X → R we write B b ( X ) , to denote the collection of boundedmeasurable functions and, if X ⊆ R d , L ( X ) as the collection of square Lebesgue-integrable functions.For ϕ ∈ B b ( X ) , we write the supremum norm as (cid:107) ϕ (cid:107) ∞ = sup x ∈ X | ϕ ( x ) | . We denote the Borel sets on R d as B ( R d ) . The d − dimensional Lebesgue measure is written as dx . For a metric d : X × X → R + on X and a function ϕ : X → R , Lip d ( X ) are the Lipschitz functions (with ﬁnite Lipschitz constants),that is for every ( x, w ) ∈ X × X , | ϕ ( x ) − ϕ ( w ) | ≤ (cid:107) ϕ (cid:107) Lip d ( x, w ) . P ( X ) denotes the collection ofprobability measures on ( X , X ) . For a ﬁnite measure µ on ( X , X ) and a ϕ ∈ B b ( X ) , the notation µ ( ϕ ) = (cid:82) X ϕ ( x ) µ ( dx ) is used. For ( X × W , X ∨ W ) a measurable space and µ a non-negative ﬁnitemeasure on this space, we use the tensor-product of functions notation for ( ϕ, ψ ) ∈ B b ( X ) × B b ( W ) , µ ( ϕ ⊗ ψ ) = (cid:82) X × Y ϕ ( x ) ψ ( w ) µ ( d ( x, w )) . Given a Markov kernel K : X → P ( X ) and a ﬁnite measure µ ,we use the notations µK ( dx (cid:48) ) = (cid:82) X µ ( dx ) K ( x, dx (cid:48) ) and K ( ϕ )( x ) = (cid:82) X ϕ ( x (cid:48) ) K ( x, dx (cid:48) ) , for ϕ ∈ B b ( X ) . Theiterated kernel is K n ( x , dx n ) = (cid:82) X n − (cid:81) ni =1 K ( x i − , dx i ) . For A ∈ X , the indicator function is written as2 A ( x ) . Z + is the set of non-negative integers. For ( µ, ν ) ∈ P ( X ) × P ( X ) , (cid:107) µ − ν (cid:107) tv := sup A ∈X | µ ( A ) − ν ( A ) | is the total variation distance. N d ( µ, Σ) is the d − dimensional Gaussian distribution with mean µ and co-variance Σ , with corresponding Lebesgue density written as x (cid:55)→ φ d ( x ; µ, Σ) . U A denotes the uniformdistribution on a measurable set A . d denotes the d -dimensional column vector of zeros. I d denotes the d × d identity matrix. The transpose of a vector or matrix x is denoted as x T . For a set C ⊂ R d , the L ( C ) norm of f is written as (cid:107) f (cid:107) = (cid:82) C f ( x ) dx , and the space of square integrable functions on C is denotedby L ( C ) = { f : C → R : (cid:107) f (cid:107) < ∞} . For a vector x ∈ R d , its Euclidean norm is also written as (cid:107) x (cid:107) . A particular Bayesian inverse problem associated to partial diﬀerential equations (PDEs) is now introducedas a motivating example. The objective is to infer the permeability ﬁeld associated to a porous medium,based on pressure measurements of the ﬂuid ﬂow governed by Darcy’s law. This example is prototypical inthe context of subsurface inversion, with applications ranging from oil recovery to contaminant transportin groundwater [31, 29].Let C ⊂ R D with the boundary ∂ C convex and once continuously diﬀerentiable and suppose f ∈ L ( C ) .Consider the following PDE for the pressure ﬁeld h on C : −∇ · (Φ ∇ h ) = f, on C , (1) h = 0 , on ∂ C , where, for t ∈ C , the permeability is Φ( t ; X ) := ¯Φ( t ) + d (cid:88) j =1 X j ϑ j v j ( t ) . The known forcing f can represent, e.g. injection and/or extraction of ﬂuid from wells. In the above:• X = ( X , . . . , X d ) , with X j i.i.d. ∼ U [ − , . This determines the prior distribution for X on the statespace X = [ − , d .• ¯Φ : C → R , and for j ∈ { , . . . , d } , v j : C → R with sup t ∈ C | v j ( t ) | ≤ , and ϑ j ∈ R + .• h ( · ; X ) (or just h ( X ) ) denotes the weak solution of (1) for a given X .We remark that one can allow d → ∞ in the above if ϑ j decay to zero suﬃciently fast with j ; see [2, 3]and the references therein for further details. The following will be assumed.( H1 ) There exists a Φ (cid:63) > such that inf t ∈ C ¯Φ( t ) ≥ (cid:80) dj =1 ϑ j + Φ (cid:63) .This assumption guarantees that for all x ∈ X , inf t ∈ C Φ( t ; x ) ≥ Φ (cid:63) , hence there is a well-deﬁned andunique weak solution h ( x ) ∈ L ( C ) , and sup x ∈ X (cid:107) h ( x ) (cid:107) ≤ C for some C > [6].Deﬁne the vector-valued function G : X → R P by G : x (cid:55)→ G ( x ) = [ g ( h ( x )) , . . . , g P ( h ( x ))] , (2)where g p : L ( C ) → R are bounded linear functionals on L ( C ) for p ∈ { , . . . , P } . It is assumed that thedata y ∈ R P take the form Y | ( X = x ) ∼ N P ( G ( x ) , θ − I P ) , (3)where θ > is a parameter that we will be interested in inferring. In fact, unbiased estimators areparticularly useful in this context, and it will be considered in the numerical examples of Section 4.3e simplify notation by suppressing explicit dependence on parameter θ and data y , and write the un-normalized Lebesgue density of X for ﬁxed y and θ as γ ( x ) = exp (cid:110) − θ (cid:107) y − G ( x ) (cid:107) (cid:111) I X ( x ) , (4)and the normalized density as π ( x ) = γ ( x ) / (cid:82) X γ ( x ) dx . These densities will be written as γ θ ( x ) and π θ ( x ) when we consider inference for θ . This posterior distribution is in general intractable due to the nonlineardependence of y on x , even if the PDE were to admit an analytical solution, and one must resort tocomputationally intensive inference methods such as MCMC. A further complication is that the analyticalsolution of the PDE is in general not available, so one must resort to numerical approximations, whichwill be discussed in the next section. For simplicity we present the case D = 1 and C = [0 , , but extension to higher dimensions is straight-forward – see e.g. [2, 5]. The PDE problem at resolution level l is solved using a ﬁnite element method(FEM) with piecewise linear shape functions on a uniform mesh of width ∆ l = 2 − ( l + l ) , for l ∈ { , , . . . } and l ≥ a maximal mesh width. In particular, the ﬁnite element basis functions { ψ li } ∆ − l − i =1 on level l are deﬁned as follows for t i = i − l : ψ li ( t ) = (cid:26) ∆ − l [ t − ( t i − ∆ l )] , if t ∈ [ t i − ∆ l , t i ] , ∆ − l [ t i + ∆ l − t ] , if t ∈ [ t i , t i + ∆ l ] . To solve the PDE for a given x ∈ X , h l ( x ) = (cid:80) ∆ − l − i =1 h li ( x ) ψ li is substituted into (1), and projected ontoeach basis element: − (cid:68) ∇ · (cid:16) Φ ∇ ∆ − l − (cid:88) i =1 h li ( x ) ψ li (cid:17) , ψ lj (cid:69) = (cid:104) f, ψ lj (cid:105) . We introduce the matrix A l ( x ) with entries A lij ( x ) = (cid:104) Φ( x ) ∇ ψ li , ∇ ψ lj (cid:105) , and vectors h l ( x ) , f l with entries h li ( x ) and f li = (cid:104) f, ψ li (cid:105) , respectively. Solving the discretized problem involves solving the following linearsystem A l ( x ) h l ( x ) = f l . (5)Deﬁne G l ( x ) = [ g ( h l ( x )) , . . . , g P ( h l ( x ))] . We denote the corresponding approximated un-normalizeddensity by γ l ( x ) = exp (cid:110) − θ (cid:107) y − G l ( x ) (cid:107) (cid:111) I X ( x ) , (6)and the approximated normalized density by π l ( x ) = γ l ( x ) / (cid:82) X γ l ( x ) dx. We now present some fundamentalconvergence results relating to this approximation, which are crucial for the application of our proposedmethodology.

Proposition 2.1.

Assume (H1).1. For all x ∈ X , lim l →∞ h l ( x ) = h ( x ) . In addition, there exists a C ∈ (0 , ∞ ) such that for every ( l, x ) ∈ Z + × X : (cid:107) h l ( x ) (cid:107) ∞ ≤ C and (cid:107) h l ( x ) − h ( x ) (cid:107) ≤ C ∆ βl (7) with β = 2 . . For all x ∈ X , lim l →∞ γ l ( x ) = γ ( x ) . In addition, there exists a ( C, ˜ l ) ∈ (0 , ∞ ) × Z + such that for every ( l, x ) ∈ { ˜ l, ˜ l + 1 , . . . } × X : | γ l ( x ) − γ ( x ) | ≤ C ∆ βl and | π l ( x ) − π ( x ) | ≤ C ∆ βl (8) with β = 2 .Proof. The ﬁrst part in (7) is a standard result in ﬁnite element methods [5, 6]. For the second, recall thatfor p ∈ { , . . . , P } , each g p is a bounded linear functional. Using this fact and (7), it is straightforward toestablish (8) – see e.g. [3, 20] and the references therein. We now describe our strategy to construct unbiased estimators of π ( ϕ ) . Consider a positive probabilitymass function, P L , on Z + . It is known [28, 33] that if one can ﬁnd a sequence of independent randomvariables ( ξ l ) l ≥ independent of L ∼ P L such that E [ ξ ] = π ( ϕ ) , (9) E [ ξ l ] = π l ( ϕ ) − π l − ( ϕ ) , ∀ l ∈ N , (10) (cid:88) l ∈ Z + E [ ξ l ] P L ( l ) < + ∞ , (11)then (cid:91) π ( ϕ ) S := ξ L P L ( L ) (12)is an unbiased and ﬁnite variance estimator of π ( ϕ ) . This is the ‘single term’ estimator as discussed by[28, 33], and alternatives such as the ‘independent sum’ estimator are also possible. In this latter case,if one can construct independent random variables ( ξ l ) l ≥ that are independent of L ∼ P L , which satisfy(9)-(10) and additionally that (cid:88) l ∈ Z + V ar [ ξ l ] + ( π l ( ϕ ) − π ( ϕ )) P L ( l ) < + ∞ , (13)where P L ( l ) = (cid:80) k ≥ l P L ( k ) , then (cid:91) π ( ϕ ) I := L (cid:88) l =0 ξ l P L ( l ) (14)is also an unbiased estimator of π ( ϕ ) with ﬁnite variance. Typically, one will run N ∈ N independentreplicates of either (12) or (14) and then use the average N N (cid:88) i =1 (cid:16) (cid:91) π ( ϕ ) k (cid:17) i , where k ∈ { S, I } and (cid:16) (cid:91) π ( ϕ ) k (cid:17) i represents the i th − independent replicate of the estimate.We note that this idea was mentioned in [16] and also used in various diﬀerent ways in [9, 17, 20]. Themain point of these schemes is that one can completely remove the discretization bias (represented by l )associated to for example Euler discretizations of stochastic diﬀerential equations or FEM discretizations5f PDEs, whilst only working with biased versions of π , denoted as π l , with l < ∞ . In addition, themethod is completely parallelizable, as one can run each replicate independently.We remark that the condition (10) is not a necessary one (in the case of (12), but is needed for (14)),but is certainly suﬃcient. In many contexts, satisfying (10) is not trivial as exact simulation from anyof the distributions ( π l ) l ∈ Z + is often not possible. In the work of [10, 14] (see also [12, 15]), the authorsconsider a methodology to unbiasedly estimate π l ( ϕ ) for each l , which we shall build upon. In orderto satisfy (11) (or (13)), it will typically not be suﬃcient (or at least eﬃcient) to run two independent unbiased MCMC algorithms to estimate π l ( ϕ ) and π l − ( ϕ ) respectively. To see this, let l ∈ N be given andconsider the easier situation where one can sample exactly from π l and π l − . Then an unbiased estimatorof [ π l − π l − ]( ϕ ) is given by ξ l = 1 N N (cid:88) n =1 { ϕ ( X n ) − ϕ ( W n ) } , where ( X n ) n ∈ N are i.i.d. from π l and ( W n ) n ∈ N are i.i.d. from π l − . In this case, we have E [ ξ l ] = V ar π l [ ϕ ( X )] + V ar π l − [ ϕ ( W )] N + [ π l − π l − ]( ϕ ) . In practice, it will typically be diﬃcult to choose P L so that the estimator would satisfy (11) and haveﬁnite expected cost, unless one takes N = O (2 l ) . We shall introduce a novel solution to circumvent thisdiﬃculty. To motivate our approach, we begin by recalling the methodology in [10, 14]. We consider the unbiased estimation of π l ( ϕ ) with l ∈ Z + ﬁxed. Suppose we have a π l − invariant, ergodicMCMC kernel K l : X → P ( X ) and an initial distribution ν l ∈ P ( X ) . Let ˜ ν l ∈ P ( X ) be a couplingof ν l , i.e. (cid:82) w ∈ X ˜ ν l ( d ( x, w )) = ν l ( dx ) and (cid:82) x ∈ X ˜ ν l ( d ( x, w )) = ν l ( dw ) . The idea is to run a Markov chain ( Z n,l ) n ∈ Z + = ( X n,l , W n,l ) n ∈ Z + on X × X , initialized from ˇ ν l ( d ( x, w )) := (cid:90) X ˜ ν l ( d ( x (cid:48) , w (cid:48) )) K l ( x (cid:48) , dx ) δ { w (cid:48) } ( dw ) , (15)and evolving according to a coupled transition kernel ˇ K l : X × X → P ( X × X ) , satisfying (cid:90) w (cid:48) ∈ X ˇ K l (( x, w ) , d ( x (cid:48) , w (cid:48) )) = K l ( x, dx (cid:48) ) , (cid:90) x (cid:48) ∈ X ˇ K l (( x, w ) , d ( x (cid:48) , w (cid:48) )) = K l ( w, dw (cid:48) ) . The process is simulated as follows.1. Sample Z (cid:48) ,l = ( X (cid:48) ,l , W (cid:48) ,l ) from ˜ ν l ( · ) .2. Generate X ,l | X (cid:48) ,l according to K l ( X (cid:48) ,l , · ) and set W ,l = W (cid:48) ,l .3. For n ≥ , generate Z n,l | Z n − ,l according to ˇ K l ( Z n − ,l , · ) .Marginally, the sequences of random variables ( X n,l ) n ∈ Z + and ( W n,l ) n ∈ Z + are coupled time-homogenousMarkov chains with initial distributions νK l and ν , respectively, and the same transition kernel K l . Deﬁnethe meeting time τ l := inf { n ≥ X n,l = W n,l } . The coupled chains ( Z n,l ) n ∈ Z + should be constructed so that τ l is almost surely ﬁnite, at the very least.In addition, we require that the chains remain faithful after meeting, that is X n,l = W n,l for all n ≥ τ l .Under fairly weak assumptions (e.g. [10]), the following is an unbiased estimator of π l ( ϕ ) for any k ∈ Z + (cid:91) π l ( ϕ ) k := ϕ ( X k,l ) + τ l − (cid:88) n = k +1 { ϕ ( X n,l ) − ϕ ( W n,l ) } . (16)6e remark that a time-averaged extension is also possible; let ( m, k ) ∈ Z + × Z + , with m ≥ k , then onecan also use (see [14]) the time-averaged estimator (cid:91) π l ( ϕ ) T,k,m := 1 m − k + 1 m (cid:88) n = k ϕ ( X n,l ) + τ l − (cid:88) n = k +1 (cid:16) ∧ n − km − k + 1 (cid:17) { ϕ ( X n,l ) − ϕ ( W n,l ) } , (17)which recovers (16) in the case m = k .We now describe a constructive procedure to generate such a coupled Metropolis–Hastings (MH) kernel ˇ K l [14]. Let Q l : X → P ( X ) be a proposal Markov kernel, such that π l ( dx ) Q l ( x, dx (cid:48) ) has a positive density π l ( x ) q l ( x, x (cid:48) ) w.r.t. a dominating measure. We deﬁne the MH acceptance probability for ( x, x (cid:48) ) ∈ X × X as α l ( x, x (cid:48) ) := 1 ∧ π l ( x (cid:48) ) q l ( x (cid:48) , x ) π l ( x ) q l ( x, x (cid:48) ) . For ( x, w ) ∈ X × X , we deﬁne a maximal coupling of the proposal Markov kernels Q l ( x, dx (cid:48) ) and Q l ( w, dw (cid:48) )ˇ Q l (( x, w ) , d ( x (cid:48) , w (cid:48) )) = S l ( x, w ) (cid:90) u ∈ X O l (( x, w ) , du ) S l ( x, w ) δ { u } ( d ( x (cid:48) , w (cid:48) )) + (1 − S l ( x, w )) × (cid:16) Q l ( x, dx (cid:48) ) − O l (( x, w ) , dx (cid:48) )1 − S l ( x, w ) (cid:17) ⊗ (cid:16) Q l ( w, dw (cid:48) ) − O l (( x, w ) , dw (cid:48) )1 − S l ( x, w ) (cid:17) , (18)where O l (( x, w ) , du ) := Q l ( x, du ) ∧ Q l ( w, du ) denotes the overlapping kernel on X and S l ( x, w ) := (cid:82) X O l (( x, w ) , du ) is the size of the overlap. Under this transition kernel, with probability S l ( x, w ) , onesimulates X (cid:48) = W (cid:48) from the overlap O l (( x, w ) , du ) /S l ( x, w ) , and with probability − S l ( x, w ) , X (cid:48) and W (cid:48) are simulated independently from the residuals required to ensure they retain the appropriate marginals.As this transition achieves the maximum probability of having X (cid:48) = W (cid:48) , ˇ Q l is known as a maximal cou-pling of Q l ( x, dx (cid:48) ) and Q l ( w, dw (cid:48) ) . This coupling can be simulated using the algorithm of [32], assumingone can sample from the proposal kernels and evaluate their densities. One can then obtain a samplefrom the coupled MH kernel ˇ K l by accepting the proposals X (cid:48) and W (cid:48) with a common uniform randomvariable U ∼ U [0 , . If the proposal is a Gaussian random walk, then the idea of using maximal couplingswithin MH goes back to at least [21]. We note that alternatives to a maximal coupling are possible anddescribed in Section 3.We now outline the procedure required to compute the time-averaged estimator (cid:91) π l ( ϕ ) T,k,m in (17).1. Sample Z (cid:48) ,l = ( X (cid:48) ,l , W (cid:48) ,l ) from ˜ ν l ( · ) .2. Generate X l | X (cid:48) ,l according to Q l ( X (cid:48) ,l , · ) and U ∼ U [0 , . If U < α l ( X (cid:48) ,l , X l ) , set X ,l = X l ,otherwise, set X ,l = X (cid:48) ,l . Set W ,l = W (cid:48) ,l and n = 1 .3. Generate ( X l , W l ) | ( X n − ,l , W n − ,l ) according to ˇ Q l (( X n − ,l , W n − ,l ) , · ) and U ∼ U [0 , .• If U < α l ( X n − ,l , X l ) , set X n,l = X l , otherwise, set X n,l = X n − ,l .• If U < α l ( W n − ,l , W l ) , set W n,l = W l , otherwise, set W n,l = W n − ,l .4. If X n,l = W n,l and n ≥ m stop, otherwise set n = n + 1 and return to step 3.Assuming that the resulting coupled kernel ˇ K l costs twice as much as K l , the above procedure requires (2 τ l − ∨ ( m + τ l − applications of K l . 7 .3.3 Unbiased Estimation of Increments We now describe, abstractly, how one can obtain a sequence of independent random variables ( ξ l ) l ≥ withthe properties prescribed in Section 2.3.1. Concrete approaches are detailed in Section 3.In the case of ξ , one can simply use the unbiased MCMC methodology described in Section 2.3.2.Hence we consider how to unbiasedly estimate [ π l − π l − ]( ϕ ) , for a ﬁx l ∈ N , using MCMC. For s ∈ { l, l − } ,suppose we have a π s − invariant, ergodic MCMC kernel K s : X → P ( X ) and an initial distribution ν s ∈ P ( X ) . Let ˇ ν l,l − be a coupling of the distributions ˇ ν l and ˇ ν l − deﬁned in (15). We will generate aMarkov chain ( Z n,l,l − ) n ∈ Z + with Z n,l,l − = (( X n,l , W n,l ) , ( X n,l − , W n,l − )) ∈ ( X × X ) × ( X × X ) =: Z , for each n ∈ Z + . The Markov chain is such that, marginally, for each s ∈ { l, l − } , the sequence of randomvariables ( X n,s ) n ∈ Z + and ( W n,s ) n ∈ Z + are time-homogenous Markov chains with initial distribution ν s K s and ν s , respectively, and the same transition kernel K s . The four sequences will be constructed in adependent manner in order to satisfy (11) or (13). We denote the transition kernel for ( Z n,l,l − ) n ∈ Z + , as ˇ K l,l − : Z → P ( Z ) . Deﬁne the meeting time τ s = inf { n ≥ X n,s = W n,s } for s ∈ { l, l − } . It is explicitly assumed that (at the very least) ( Z n,l,l − ) n ∈ Z + is constructed so that thestopping time ˇ τ l,l − := τ l ∨ τ l − is almost surely ﬁnite. In addition, the pair of chains on each level shouldbe faithful, i.e. for s ∈ { l, l − } , we have X n,s = W n,s , for all n ≥ τ s . (19)Hence for time n ≥ ˇ τ l,l − , Z n,l,l − only has a distinct state on each level. We will give explicit examplesof Markov kernels which satisfy these constraints in Section 3. Note that we do not require the pairs ( Z n,l ) n ∈ Z + = ( X n,l , W n,l ) n ∈ Z + or ( Z n,l − ) n ∈ Z + = ( X n,l − , W n,l − ) n ∈ Z + to be Markov chains with exactlythe properties considered in Section 2.3.2.One can estimate [ π l − π l − ]( ϕ ) as follows, for any k ∈ Z + : (cid:92) [ π l − π l − ]( ϕ ) k := (cid:91) π l ( ϕ ) k − (cid:92) π l − ( ϕ ) k , (20)where (cid:92) π s ( ϕ ) k is computed using (16) based on the pair of chains ( X n,s , W n,s ) n ∈ Z + on level s ∈ { l, l − } .One can also employ time-averaging, for ( m, k ) ∈ Z + × Z + satisfying m ≥ k : (cid:92) [ π l − π l − ]( ϕ ) T,k,m := (cid:91) π l ( ϕ ) T,k,m − (cid:92) π l − ( ϕ ) T,k,m , (21)where (cid:92) π s ( ϕ ) T,k,m is computed using (17) based on the pair of chains ( X n,s , W n,s ) n ∈ Z + on level s ∈ { l, l − } .The steps to compute the estimator (21) are outlined below.1. Sample Z ,l,l − from ˇ ν l,l − ( · ) . Set n = 1 .2. Generate Z n,l,l − | Z n − ,l,l − according to ˇ K l,l − ( Z n − ,l,l − , · ) .3. If X n,l = W n,l , X n,l − = W n,l − and n ≥ m stop, otherwise set n = n + 1 and return to step 2.Assuming the cost of ˇ K l,l − is two times that of running both K l and K l − , and K l costs twice as muchas K l − , then the cost of the above procedure requires (2 τ l − − ∨ ( m + τ l − − τ l − ∨ ( m + τ l − ≤ { (2ˇ τ l,l − − ∨ ( m + ˇ τ l,l − − } (22)applications of the kernel K l . 8 .3.4 Summary of Proposed Methodology We now consolidate the above discussion by summarizing our proposed methodology to unbiasedly estimate π ( ϕ ) . We begin with the single term estimator (cid:91) π ( ϕ ) S in (12).1. Sample L ∼ P L .2. If L = 0 , generate a Markov chain ( Z n, ) n ∈ Z + according to ˇ K as described in Section 2.3.2 andcompute the estimator (cid:92) π ( ϕ ) k in (16) or (cid:92) π ( ϕ ) T,k,m in (17).3. If

L > , generate a Markov chain ( Z n,L,L − ) n ∈ Z + according to ˇ K L,L − as described in Section 2.3.3and compute the estimator (cid:92) [ π L − π L − ]( ϕ ) k in (20) or (cid:92) [ π L − π L − ]( ϕ ) T,k,m in (21).We then return the single term estimator (cid:91) π ( ϕ ) S,k := 1 P L ( L ) (cid:16) I { } ( L ) (cid:92) π ( ϕ ) k + I N ( L ) (cid:92) [ π L − π L − ]( ϕ ) k (cid:17) (23)or (cid:91) π ( ϕ ) S,T,k,m := 1 P L ( L ) (cid:16) I { } ( L ) (cid:92) π ( ϕ ) T,k,m + I N ( L ) (cid:92) [ π L − π L − ]( ϕ ) T,k,m (cid:17) , (24)depending on whether one chooses the time-averaged estimator or not.For the independent sum estimator (cid:91) π ( ϕ ) I in (14), the steps are quite similar.1. Sample L ∼ P L .2. If L = 0 , generate a Markov chain ( Z n, ) n ∈ Z + according to ˇ K as described in Section 2.3.2 andcompute the estimator (cid:92) π ( ϕ ) k in (16) or (cid:92) π ( ϕ ) T,k,m in (17).3. If

L > , for all l ∈ { , . . . , L } , generate a Markov chain ( Z n,l,l − ) n ∈ Z + according to ˇ K l,l − asdescribed in Section 2.3.3 and compute the estimator (cid:92) [ π l − π l − ]( ϕ ) k in (20) or (cid:92) [ π l − π l − ]( ϕ ) T,k,m in (21).We then return the independent sum estimator (cid:91) π ( ϕ ) I,k := (cid:92) π ( ϕ ) k + L (cid:88) l =1 P L ( l ) (cid:16) (cid:92) [ π l − π l − ]( ϕ ) k (cid:17) (25)or (cid:91) π ( ϕ ) I,T,k,m := (cid:92) π ( ϕ ) T,k,m + L (cid:88) l =1 P L ( l ) (cid:16) (cid:92) [ π l − π l − ]( ϕ ) T,k,m (cid:17) . (26)As noted, in practice, one can also run the given procedure N times and use an average. For instance, inthe context of the single term estimator (23), one would use (cid:91) π ( ϕ )( N ) = 1 N N (cid:88) i =1 (cid:16) (cid:91) π ( ϕ ) S,k (cid:17) i (27)to estimate π ( ϕ ) , where (cid:16) (cid:91) π ( ϕ ) S,k (cid:17) i denotes the i th − independent replicate of the estimate.9 .4 Theoretical Results The main objective of this section is to establish, under assumptions, that (23)-(26) are unbiased andﬁnite variance estimators of π ( ϕ ) . We ﬁrst state the assumptions that we will rely on. In the following, wesuppose that the quality of approximation of π by π l is controlled by a scalar parameter ∆ l = 2 − l , whichis consistent with the examples that are to be considered. For a constant < C < + ∞ and a metric d on X , we deﬁne the set B ( C, ∆ l , d ) := { ( x , x , x , x ) ∈ X : ∀ ( i, j ) ∈ { , . . . , } , d ( x i , x j ) ≤ C ∆ l } . We will make the following assumptions with X compact.( A1 ) There exist ( C, ρ ) ∈ (0 , ∞ ) × (0 , such that for any n ∈ N sup l ∈ Z + sup x ∈ X (cid:107) K nl ( x, · ) − π l ( · ) (cid:107) tv ≤ Cρ n . ( A2 ) There exist ( C, ρ ) ∈ (0 , ∞ ) × (0 , such that for any ( l, n ) ∈ Z + × NE [ I { τ l >n } ] ≤ Cρ n . ( A3 ) There exist a C < ∞ and a metric ˜ d : X × X → R + on X , such that for any ( l, ϕ, ( x, w )) ∈ Z + × B b ( X ) ∩ Lip ˜ d ( X ) × X × X | K l ( ϕ )( x ) − K l ( ϕ )( w ) | ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip )˜ d ( x, w ) . ( A4 ) There exist ( C, β ) ∈ (0 , ∞ ) × (0 , ∞ ) such that for any ( l, ϕ ) ∈ N × B b ( X ) | [ π l − π ]( ϕ ) | ≤ C (cid:107) ϕ (cid:107) ∞ ∆ β l . sup x ∈ X | K l ( ϕ )( x ) − K l − ( ϕ )( x ) | ≤ C (cid:107) ϕ (cid:107) ∞ ∆ β l . ( A5 ) There exist ( C, β , (cid:15) ) ∈ (0 , ∞ ) , such that for the metric ˜ d in (A3) and any ( l, n ) ∈ N × NE [ I B ( C, ∆ β l , ˜ d ) c ( Z ,l,l − )] ≤ C ∆ β (2+ (cid:15) ) l , E [ I B ( C, ∆ β l , ˜ d ) c × B ( C, ∆ β l , ˜ d ) ( Z n,l,l − , Z n − ,l,l − )] ≤ C ∆ β (2+ (cid:15) ) l . Our main result focusses upon (23) as the proof of the other results are more-or-less a direct corollaryof the ﬁrst result. The proofs of all results are in Appendix A.

Theorem 2.1.

Assume (A1-5). Then there exists a choice of positive probability mass function P L , suchthat for the metric ˜ d in (A3) and any ϕ ∈ B b ( X ) ∩ Lip ˜ d ( X ) , (23) is an unbiased and ﬁnite variance estimatorof π ( ϕ ) . Remark 2.1.

It is straightforward to establish that one can ﬁnd a P L , so that (24) is also unbiased withﬁnite variance. The proof can be constructed via the technical results in the appendix. The following result can be deduced by observing (13) and using the technical results in the appendix.

Corollary 2.1.

Assume (A1-5). Then there exists a choice of positive probability mass function P L , suchthat for the metric ˜ d in (A3) and any ϕ ∈ B b ( X ) ∩ Lip ˜ d ( X ) , (25) is an unbiased and ﬁnite variance estimatorof π ( ϕ ) . emark 2.2. As for (24) , one can ﬁnd a P L , so that (26) is also unbiased with ﬁnite variance. The main strategy of the proof is to establish a martingale plus remainder type decomposition for (cid:92) π ( ϕ ) k and (cid:92) [ π l − π l − ]( ϕ ) k . Given this, one can rely on the optional sampling theorem to establish thatthe former quantities are unbiased estimates of π ( ϕ ) and [ π l − π l − ]( ϕ ) . One is then left to control thesecond moments of the decomposition, which can be achieved in a variety of ways; we rely on martingalemethods. Assumption (A1) is a strong assumption, although reasonable as we only consider compact X . It is veriﬁedfor an example in [19]. Assumption (A2) has been considered in [13] and it shown to hold for a relatedcontext; see the results in [13, Lemmata 14 & 22]. Assumption (A3) has been veriﬁed for an example in [19].Assumption (A4) 1. relates to the rate at which the bias converges and can hold for inverse problems (see[3]). Assumption (A4) 2. can be achieved by considering an appropriate coupling of ( K l ( x, · ) , K l − ( x, · )) and is problem speciﬁc; see [19] for an example where this is veriﬁed.Assumption (A5) appears to be quite non-standard. We ﬁrst remark that assumptions of these type(not identical), can be veriﬁed in complex settings [13]: (A5) is shown in [13, Lemma 16]. Secondly, if onecan establish that the pairs ( X n,l , X n,l − ) n ∈ Z + and ( W n,l , W n,l − ) n ∈ Z + are (marginally) uniformly ergodicMarkov chains with an invariant measure ˇ π l,l − , then it is suﬃcient to assume that ˇ π l,l − (cid:16) ( ϕ ⊗ − ⊗ ϕ ) (cid:17) ≤ C ∆ β l . This is because the proof that is used, turns these one-step type properties of the coupled Markov chainsin (A5) into similar properties of the Markov chain at any time step (see Lemma A.3). However, theseproperties are inherited directly from the invariant measure ˇ π l,l − if such a quantity exists and the chainconverges suﬃciently fast to it. P L The discussion below relates to the single term estimator (cid:91) π ( ϕ ) S,k in (23) with increments estimated using ξ l = (cid:92) [ π l − π l − ]( ϕ ) k deﬁned in (20), and is easily extended to the time-averaged estimator (21). The caseof the independent sum estimators (25) and (26) follows along the same lines and is thus omitted. Thevariance and expected cost of (cid:91) π ( ϕ ) S,k can be bounded as follows: V ar (cid:104) (cid:91) π ( ϕ ) S,k (cid:105) ≤ (cid:88) l ∈ Z + E [ ξ l ] P L ( l ) , (28) Cost (cid:104) (cid:91) π ( ϕ ) S,k (cid:105) ≤ C (cid:88) l ∈ Z + E [ˇ τ l,l − ]Cost( K l ) P L ( l ) , (29)where C > is some constant (see e.g. (22)), Cost( K l ) denotes the cost of an application of the marginalkernel K l , ˇ τ l,l − = τ l ∨ τ l − is the stopping time of the Markov chain ( Z n,l,l − ) n ∈ Z + with ˇ τ , − := τ .Averaging N single term estimators as in (27) would yield a variance of V ar [ (cid:91) π ( ϕ ) S,k ] N − and expectedcost of Cost[ (cid:91) π ( ϕ ) S,k ] N .For the second moment of ξ l featuring in (28), Lemma A.5 provides the bound E [ ξ l ] ≤ ∆ βl , (30)with β = β ∧ β > , where β and β are given in Assumptions (A4) and (A5), respectively, and ˜ d inAssumption (A3) is given by the standard Euclidean distance. Using Assumption (A2), one can upper-bound the expected stopping time E [ˇ τ l,l − ] appearing in (29) by a constant that is independent of l . We11ssume furthermore that there exists ω < β and C K > such that the cost of K l satisﬁes Cost( K l ) ≤ C K ∆ − ωl . (31)Suppose the probability mass function on Z + is of the form P L ( l ) ∝ ∆ ηl . Following (30) and (31),there are constants C V , C C > such that the right-hand sides of (28) and (29) are bounded above by C V (cid:80) l ∈ Z + ∆ β − ηl and C C (cid:80) l ∈ Z + ∆ η − ωl , respectively. Both quantities are ﬁnite for any η ∈ ( ω, β ) , e.g.one can let η = (2 β + ω ) / . The above discussion requires ω < β ; see e.g. [28, 20, 9] for the case ω ≥ β .We now consider the above discussion in the particular context of the motivating example in Section2.2. In this setting, Proposition 2.1 provides β = 2 in Assumption (A4).1. As discussed in Section 2.4.2,the rates for Assumptions (A4).2 and (A5) depend upon the particular kernels used and are more diﬃcultto establish theoretically, but the value of β in (30) can be estimated numerically. Evaluation of thedensity (6) at level l requires the solution of the tridiagonal linear system (5), which has O (∆ − l ) degreesof freedom. Therefore (31) holds with ω = 1 , and we choose η = 5 / . We consider various strategies to construct the kernels ˇ K l,l − , l ∈ N described in Section 2.3.3. We beginwith Metropolis–Hastings algorithms in Sections 3.1 and 3.2, and consider the case of Hamiltonian MonteCarlo methods in Section 3.3. We consider a collection of coupled MH kernels ˇ K l,l − that are deﬁned by the following simulation proce-dure.1. Given current state z l,l − = (( x l , w l ) , ( x l − , w l − )) ∈ Z , generate proposal Z (cid:48) l,l − = (( X (cid:48) l , W (cid:48) l ) , ( X (cid:48) l − , W (cid:48) l − )) according to ˇ Q l,l − ( z l,l − , · ) .2. Generate U ∼ U [0 , and for level s ∈ { l, l − } :• If U < α s ( x s , X (cid:48) s ) , set X (cid:63)s = X (cid:48) s , otherwise set X (cid:63)s = x s .• If U < α s ( w s , W (cid:48) s ) , set W (cid:63)s = W (cid:48) s , otherwise set W (cid:63)s = w s .3. Return Z (cid:63)l,l − = (( X (cid:63)l , W (cid:63)l ) , ( X (cid:63)l − , W (cid:63)l − )) as a sample according to ˇ K l,l − ( z l,l − , · ) .The notation ˇ Q l,l − : Z → P ( Z ) refers to a coupling of the proposal kernels Q l ( x l , dx (cid:48) l ) , Q l ( w l , dw (cid:48) l ) , Q l − ( x l − , dx (cid:48) l − ) and Q l − ( w l − , dw (cid:48) l − ) , in the sense that generating Z (cid:48) l,l − = (( X (cid:48) l , W (cid:48) l ) , ( X (cid:48) l − , W (cid:48) l − )) according to ˇ Q l,l − ( z l,l − , · ) , is marginally equivalent to X (cid:48) l ∼ Q l ( x l , · ) , W (cid:48) l ∼ Q l ( w l , · ) , X (cid:48) l − ∼ Q l − ( x l − , · ) , W (cid:48) l − ∼ Q l − ( w l − , · ) . (32)We also require the coupled proposal kernel ˇ Q l,l − to satisfy ˇ Q l,l − ( z l,l − , D × D ) > , (33)where D := { ( x, w ) ∈ X × X : x = w } denotes the diagonal set, and the following faithfulness property x s = w s = ⇒ X (cid:48) s = W (cid:48) s almost surely (34)for each level s ∈ { l, l − } . The condition (33) requires the coupling mechanism to generate identicalproposals on each level with positive probability. Under (34) and the use of a common uniform randomvariable in Step 2, the pair of MH chains on each level would be faithful, as required in (19).12n the following, we will consider X = R d and describe concrete examples of ˇ Q l,l − for proposal kernelsassociated to random walk Metropolis–Hastings (RWMH) Q s ( x, dx (cid:48) ) = φ d ( x (cid:48) ; x, Σ s ) dx (cid:48) , ( x, s ) ∈ X × Z + , (35)and pre-conditioned Crank-Nicolson (pCN) [7, 27] Q s ( x, dx (cid:48) ) = φ d ( x (cid:48) ; ρ s x, (1 − ρ s )Σ s ) dx (cid:48) , ( x, s ) ∈ X × Z + , (36)where Σ s = σ s σ Ts , σ s is an invertible d × d matrix and ρ s ∈ ( − , . A naive approach is to employ ˇ Q ( I ) l,l − ( z l,l − , dz (cid:48) l,l − ) := ˇ Q l ( z l , dz (cid:48) l ) ⊗ ˇ Q l − ( z l − , dz (cid:48) l − ) , which independently samples from the maximal coupling of the proposal kernels on each level given in(18). Although it does satisfy the requirements (32), (33) and (34), this choice is unlikely to ensure thatthe conditions in (11) or (13) hold, as the proposals on levels l and l − are sampled independently. We now introduce an extension of the maximal coupling in (18) to the case of four marginals. For z l,l − = (( x l , w l ) , ( x l − , w l − )) ∈ Z , deﬁne the overlapping kernel on X as O l,l − ( z l,l − , du ) := Q l ( x l , du ) ∧ Q l ( w l , du ) ∧ Q l − ( x l − , du ) ∧ Q l − ( w l − , du ) , and the size of the overlap S l,l − ( z l,l − ) := (cid:82) X O l,l − ( z l,l − , du ) . For each level s ∈ { l, l − } , we deﬁne thetwo residual probability measures on X r s ( z l,l − , du ) := Q s ( x s , du ) − O l,l − ( z l,l − , du )1 − S l,l − ( z l,l − ) , (cid:101) r s ( z l,l − , du ) := Q s ( w s , du ) − O l,l − ( z l,l − , du )1 − S l,l − ( z l,l − ) . To accommodate the event when the pair of chains on a level have met, i.e. ( x s , w s ) ∈ D , we deﬁne R s ( z l,l − , d ( x (cid:48) s , w (cid:48) s )) := r s ( z l,l − , dx (cid:48) s ) (cid:16) I D ( x s , w s ) δ { x (cid:48) s } ( dw (cid:48) s ) + I D c ( x s , w s ) (cid:101) r s ( z l,l − , dw (cid:48) s ) (cid:17) . Writing ˇ R l,l − ( z l,l − , dz (cid:48) l,l − ) := R l ( z l,l − , d ( x (cid:48) l , w (cid:48) l )) R l − ( z l,l − , d ( x (cid:48) l − , w (cid:48) l − )) , (37)we may then deﬁne our coupled proposal kernel ˇ Q ( M ) l,l − : Z → P ( Z ) as ˇ Q ( M ) l,l − ( z l,l − , dz (cid:48) l,l − ):= S l,l − ( z l,l − ) (cid:90) X O l,l − ( z l,l − , du ) S l,l − ( z l,l − ) δ { u } ( dz (cid:48) l,l − ) + (1 − S l,l − ( z l,l − )) ˇ R l,l − ( z l,l − , dz (cid:48) l,l − ) . (38)One can check that this satisﬁes the requirements in (32), (33) and (34). Moreover, ˇ Q ( M ) l,l − is a maximalcoupling as it achieves the maximum probability of having identical proposals X (cid:48) l = W (cid:48) l = X (cid:48) l − = W (cid:48) l − ,which is given by the size of the overlap S l,l − ( z l,l − ) .Algorithm 1 provides a method to sample from (38), assuming that we can sample from the proposaltransition kernels and evaluate their transition densities. These assumptions clearly hold for the Gaussian13 lgorithm 1 A maximal coupling of four transition kernels

Input : transition kernels Q s ( x s , dx (cid:48) s ) and Q s ( w s , dw (cid:48) s ) for level s ∈ { l, l − } .1. Sample U ∼ Q l ( x l , · ) . With probability min (cid:26) , q l ( w l , U ) q l ( x l , U ) , q l − ( x l − , U ) q l ( x l , U ) , q l − ( w l − , U ) q l ( x l , U ) (cid:27) , output (( X (cid:48) l , W (cid:48) l ) , ( X (cid:48) l − , W (cid:48) l − )) = (( U, U ) , ( U, U )) .2. Otherwise(a) Set X (cid:48) l = U . If x l = w l , set W (cid:48) l = U .(b) If x l (cid:54) = w l , propose U l ∼ Q l ( w l , · ) and with probability − min (cid:26) , q l ( x l , U l ) q l ( w l , U l ) , q l − ( x l − , U l ) q l ( w l , U l ) , q l − ( w l − , U l ) q l ( w l , U l ) (cid:27) , set W (cid:48) l = U l ; otherwise repeat until acceptance.(c) Propose U l − ∼ Q l − ( x l − , · ) and with probability − min (cid:26) , q l ( x l , U l − ) q l − ( x l − , U l − ) , q l ( w l , U l − ) q l − ( x l − , U l − ) , q l − ( w l − , U l − ) q l − ( x l − , U l − ) (cid:27) , set X (cid:48) l − = U l − ; otherwise repeat until acceptance. If x l − = w l − , set W (cid:48) l − = X (cid:48) l − .(d) If x l − (cid:54) = w l − , propose U l − ∼ Q l − ( w l − , · ) and with probability − min (cid:26) , q l ( x l , U l − ) q l − ( w l − , U l − ) , q l ( w l , U l − ) q l − ( w l − , U l − ) , q l − ( x l − , U l − ) q l − ( w l − , U l − ) (cid:27) , set W (cid:48) l − = U l − ; otherwise repeat until acceptance. Output : Sample Z (cid:48) l,l − = (( X (cid:48) l , W (cid:48) l ) , ( X (cid:48) l − , W (cid:48) l − )) from ˇ Q ( M ) l,l − ( z l,l − , · ) .proposal kernels in (35) and (36). Step 1 can be seen as an attempt to sample X (cid:48) l = W (cid:48) l = X (cid:48) l − = W (cid:48) l − from the overlap O l,l − ( z l,l − , du ) /S l,l − ( z l,l − ) , and if this fails, Step 2 corresponds to a rejection samplerto sample from the residuals (37). The four cases considered in Step 2 are needed to ensure faithfulnessproperty in (34).The coupling ˇ Q ( M ) l,l − in (38) can be readily employed on RWMH and pCN proposals, and more generalproposal transitions outside the Gaussian family. Next we present an alternative coupling that is speciﬁcto the Gaussian case. We consider the case X = R d and proposal kernels of the form Q s ( x, dx (cid:48) ) = φ d ( x (cid:48) ; µ s ( x ) , Σ s ) dx (cid:48) , ( x, s ) ∈ X × Z + , (39)where µ s : X → X , Σ s = σ s σ Ts and σ s is an invertible d × d matrix. The following will exploit the factthat a sample X (cid:48) from Q s ( x, · ) can be represented as X (cid:48) = µ s ( x ) + σ s v s with v s ∼ N d (0 d , I d ) . As notedin [14], the case of v s following a spherically symmetric distribution can also be accommodated.14 lgorithm 2 Synchronous pairwise reﬂection maximal couplings

Input : transition kernels Q s ( x s , dx (cid:48) s ) = φ d ( x (cid:48) s ; µ s ( x s ) , Σ s ) dx (cid:48) s , and Q s ( w s , dw (cid:48) s ) = φ d ( w (cid:48) s ; µ s ( w s ) , Σ s ) dw (cid:48) s for level s ∈ { l, l − } .Sample v ∼ N d (0 d , I d ) and for level s ∈ { l, l − } :1. Set v s = v and X (cid:48) s = µ s ( x s ) + σ s v s .2. Set u s = σ − s ( µ s ( x s ) − µ s ( w s )) and e s = u s / (cid:107) u s (cid:107) .3. With probability min { , φ d ( v s + u s ; 0 d , I d ) /φ d ( v s ; 0 d , I d ) } , set (cid:101) v s = v s + u s ; otherwise set (cid:101) v s = v s − v Ts e s ) e s . Set W (cid:48) s = µ s ( w s ) + σ s (cid:101) v s . Output : Sample Z (cid:48) l,l − = (( X (cid:48) l , W (cid:48) l ) , ( X (cid:48) l − , W (cid:48) l − )) from ˇ Q ( R ) l,l − ( z l,l − , · ) .The coupled proposal kernel ˇ Q ( R ) l,l − : Z → P ( Z ) that we construct here is based on a synchronouscoupling of the reﬂection maximal coupling in [4] for the pair of proposals on each level. Algorithm 2details how to obtain a sample Z (cid:48) l,l − = (( X (cid:48) l , W (cid:48) l ) , ( X (cid:48) l − , W (cid:48) l − )) from ˇ Q ( R ) l,l − ( z l,l − , · ) , which will satisfythe requirements in (32), (33) and (34). The synchronous use of v l = v l − = v in Step 1 induces a couplingbetween the proposals across levels. In Step 3, for each level s ∈ { l, l − } , note that the event (cid:101) v s = v s + u s yields identical proposals X (cid:48) s = W (cid:48) s , and this occurs with maximal probability given by the size of theoverlap S s ( x s , w s ) . When identical proposals are not possible, we take (cid:101) v s as the reﬂection of v s withrespect to the hyperplane orthogonal to e s , and right between σ − s x s and σ − s w s . Under this reﬂectioncoupling [22], one can show that X (cid:48) s − W (cid:48) s = (cid:37) ( v s )( µ s ( x s ) − µ s ( w s )) with (cid:37) ( v s ) = 1 + 2( v Ts e s ) e s . Since (cid:37) ( v s ) ∼ N (1 , (cid:107) u s (cid:107) − ) , one has contraction of the proposals on each level with probability of almost / when (cid:107) u s (cid:107) is large. If the proposals on each level are both accepted, contraction is desirable as it leads tostates X (cid:48) s and W (cid:48) s that are closer. This in turn yields a higher probability of generating identical proposalsin the next application of ˇ Q ( R ) l,l − .As (39) clearly includes (35) and (36) as special cases, the coupling ˇ Q ( R ) l,l − is applicable to both RWMHand pCN proposals. In the next section, we consider another construction for pCN that always inducescontractive proposals. We consider a coupled proposal kernel ˇ Q ( S ) l,l − : Z → P ( Z ) for pCN (36). To obtain a sample Z (cid:48) l,l − =(( X (cid:48) l , W (cid:48) l ) , ( X (cid:48) l − , W (cid:48) l − )) from ˇ Q ( S ) l,l − ( z l,l − , · ) , we simulate v ∼ N d (0 d , I d ) and take X (cid:48) s = ρ s x s + (cid:112) − ρ s σ s v, W (cid:48) s = ρ s w s + (cid:112) − ρ s σ s v, for both levels s ∈ { l, l − } . The synchronous use of v guarantees contraction of the proposals on eachlevel, i.e. (cid:107) X (cid:48) s − W (cid:48) s (cid:107) = ρ s (cid:107) x s − w s (cid:107) for ρ s ∈ ( − , , and induces dependencies between the pairs ofproposals across levels. The latter is crucial in our context as we want (11) or (13) to hold. Althoughthis synchronous coupling satisﬁes the requirements in (32) and (34), it is not possible to have identicalproposals as needed in (33). A solution is to consider a mixture of ˇ Q ( S ) l,l − and either ˇ Q ( M ) l,l − or ˇ Q ( R ) l,l − thatallow identical proposals to occur. More precisely, we take the coupled proposal kernel in Section 3.1 asthe mixture kernel ˇ Q l,l − ( z l,l − , dz (cid:48) l,l − ) = κ ˇ Q ( S ) l,l − ( z l,l − , dz (cid:48) l,l − ) + (1 − κ ) ˇ Q ( p ) l,l − ( z l,l − , dz (cid:48) l,l − ) , p ∈ { M, R } , (40)which will satisfy (32), (33) and (34) for any choice of κ ∈ (0 , .15 .3 Hamiltonian Monte Carlo We restrict ourselves to the case X = R d . Hamiltonian Monte Carlo (HMC) [8] considers the followingauxiliary target distribution on ( R d , B ( R d )) for each l ∈ Z + ¯ π l ( d ( x, v )) := π l ( x ) φ d ( v ; 0 d , I d ) d ( x, v ) , (41)where d ( x, v ) denotes the Lebesgue measure on R d . We will assume that the target density x (cid:55)→ π l ( x ) hasa well-deﬁned gradient, and write the Hamiltonian corresponding to (41) as H l ( x, v ) = − log π l ( x )+ (cid:107) v (cid:107) .Given a current position x ∈ R d , HMC samples an initial velocity v ∼ N d (0 d , I d ) and generates aproposal by discretizing the Hamiltonian dynamics associated to H l using a leapfrog integrator. Given astepsize ε l > and a number of steps M l ∈ N , this numerical scheme initializes at ( x , v ) = ( x, v ) ∈ R d and iterates for m ∈ { , . . . , M l − } v m +1 / = v m + ε l ∇ log π l ( x m ) , x m +1 = x m + ε l v m +1 / , v m +1 = v m +1 / + ε l ∇ log π l ( x m +1 ) . (42)As the Hamiltonian is not exactly conserved under the leapfrog integrator, the proposal ( x (cid:48) , v (cid:48) ) = ( x M l , v M l ) is then subjected to a Metropolis–Hastings accept-reject step, i.e. for U ∼ U [0 , , if U ≤ ¯ α l (cid:8) ( x, v ) , ( x (cid:48) , v (cid:48) ) (cid:9) := 1 ∧ exp { H l ( x, v ) − H l ( x (cid:48) , v (cid:48) ) } , (43)we output ( X (cid:63) , V (cid:63) ) = ( x (cid:48) , v (cid:48) ) , otherwise we output ( X (cid:63) , V (cid:63) ) = ( x, v ) . The resulting transition X (cid:63) ∼ P l ( x, · ) on the position coordinate deﬁnes a HMC kernel P l at level l .Following [12], one can construct a faithful coupling of P l ( x, · ) and P l ( w, · ) for ( x, w ) ∈ R d × R d , byemploying a common velocity v to initialize (42) and a common uniform random variable U in the accept-reject step (43). If the initial positions x and w are in a region where π l is log-concave, and the integrationtime ε l M l is appropriately chosen, this can lead to contractive proposals which are then accepted with highprobability for small ε l . The resulting coupled HMC kernel ˇ P l (( x, w ) , · ) cannot be employed within theframework of Section 2.3.2, as it does not allow chains to meet. To circumvent this issue, [12] considereda mixture kernel ˇ K l (( x, w ) , d ( x (cid:48) , w (cid:48) )) = κ ˇ P l (( x, w ) , d ( x (cid:48) , w (cid:48) )) + (1 − κ ) ˇ K ( M ) l (( x, w ) , d ( x (cid:48) , w (cid:48) )) , (44)where κ ∈ (0 , and ˇ K ( M ) l denotes a coupled RWMH kernel based on the maximal coupling in (18).Although the latter enables meetings, the marginal kernel induced by (44) is not the HMC kernel P l , butremains close if κ is close to one.We now extend the work of [12] to our context. For z l,l − = (( x l , w l ) , ( x l − , w l − )) ∈ Z , one can alsoconstruct a faithful coupling of P l ( x l , · ) , P l ( w l , · ) , P l − ( x l − , · ) and P l − ( w l − , · ) by using a common initialvelocity in the leapfrog integrators and a common uniform random variable for all accept-reject steps.Let ˇ P l,l − ( z l,l − , · ) denote the resulting coupled HMC kernel on levels l and l − . In addition to theabove-mentioned behaviour for the pairs on each level, this coupling also induces dependencies betweenthe pairs across levels, which are crucial for the conditions in (11) or (13) to hold. Analogous to (44), wetake ˇ K l,l − ( z l,l − , dz (cid:48) l,l − ) = κ ˇ P l,l − ( z l,l − , dz (cid:48) l,l − ) + (1 − κ ) ˇ K ( M ) l,l − ( z l,l − , dz (cid:48) l,l − ) , (45)where κ ∈ (0 , and ˇ K ( M ) l,l − denotes a coupled RWMH kernel based on the maximal couplings in Section3.2.2 or Section 3.2.3. We note that the mixture kernel (45) satisﬁes the properties stated in Section 2.3.3.An algorithmic description of how to sample from it is provided in Algorithm 3.16 lgorithm 3 Mixture of coupled HMC and coupled RWMH

Input : current state z l,l − = (( x l , w l ) , ( x l − , w l − )) ∈ Z , leapfrog stepsize ε l > , number of leapfrog steps M l ∈ N , mixing probability κ ∈ (0 , and coupled RWMH kernel ˇ K ( M ) l,l − .With probability κ ,1. Sample initial velocity v ∼ N d (0 d , I d ) and U ∼ U [0 , .2. For s ∈ { l, l − } , run leapfrog integrator (42) on level s with initial condition ( x s , v ) to obtainproposal ( x (cid:48) s , v (cid:48) s ) .3. For s ∈ { l, l − } , run leapfrog integrator (42) on level s with initial condition ( w s , v ) to obtainproposal ( w (cid:48) s , (cid:101) v (cid:48) s ) .4. For s ∈ { l, l − } , if U ≤ ¯ α l { ( x s , v ) , ( x (cid:48) s , v (cid:48) s ) } , set X (cid:63)s = x (cid:48) s ; otherwise set X (cid:63)s = x s .5. For s ∈ { l, l − } , if U ≤ ¯ α l { ( w s , v ) , ( w (cid:48) s , (cid:101) v (cid:48) s ) } , set W (cid:63)s = w (cid:48) s ; otherwise set W (cid:63)s = w s .Otherwise, generate Z (cid:63)l,l − = (( X (cid:63)l , W (cid:63)l ) , ( X (cid:63)l − , W (cid:63)l − )) according to ˇ K ( M ) l,l − ( z l,l − , · ) . Output : Sample Z (cid:63)l,l − = (( X (cid:63)l , W (cid:63)l ) , ( X (cid:63)l − , W (cid:63)l − )) from ˇ K l,l − ( z l,l − , · ) in (45). Three numerical examples will be used to illustrate the properties of various algorithms and our theoreticalresults. The elliptic PDE problem introduced in Section 2.2 will be considered in Section 4.1. In Section4.1.1, we begin with a case where an analytical solution of the PDE is tractable. Subsequently, in Section4.1.2 a particular example of the problem described in Section 2.2 is considered. Finally, we examine amodel from epidemiology in Section 4.2, to analyze COVID-19 infections in the UK.Two quantities of interest will be used to illustrate our methodology. The ﬁrst is the expected valuecorresponding to the choice of function ϕ ( x ) = x . The next quantity of interest is motivated by theestimation of parameters θ , such as the precision of the observation model in (3). One approach is based onmaximizing the marginal likelihood Z ( θ ) , deﬁned as the normalizing constant of (4). We will compute themaximum likelihood estimator (MLE) θ MLE ∈ arg max Z ( θ ) by employing a stochastic gradient algorithm[11, 20, 23], given by the iterative scheme θ ( i ) = θ ( i − + α i (cid:92) ∇ θ log Z ( θ ( i − ) , i ≥ , (46)where ( α i ) i ∈ N is a sequence of learning rates and (cid:92) ∇ θ log Z ( θ ) denotes an unbiased estimator of ∇ θ log Z ( θ ) = (cid:90) X ( ∇ θ log γ θ ( x )) π θ ( x ) dx. (47)Following convergence results in [11, 23], we select α i = α /i and choose α appropriately. We will relyon our methodology to obtain unbiased estimators of the score function (47) by choosing the function ϕ θ ( x ) = ∇ θ log γ θ ( x ) . Finally, to deal with parameters that have positivity constraints, we apply alogarithmic transformation before employing (46). We ﬁrst consider an example where an analytical solution is available. The PDE on C = [0 , π ] isdeﬁned by (1) with constant diﬀusion coeﬃcient Φ = 1 , and forcing f ( t ; X ) = X sin(2 t ) + X sin( t ) .17he analytical solution is given by h ( t ; X ) = X sin(2 t ) + X sin( t ) . Furthermore, we assume a priorof X ∼ N (0 , I ) on the state space X = R . Although this setting extends beyond the theoreticalframework we have considered, we expect our results to generalize. The observation functions (2) aregiven by the Dirac delta functions g p = δ t p , where t p = 2 π (2 p − / P for p ∈ { , . . . , P } with P = 50 .We simulate observations y ∈ R P from (3) using x = (2 , − and θ = 100 .Given a value of θ and data y , the posterior of X in (4), denoted as π θ , has the form N ( µ θ , Σ θ ) , where µ θ = θ Σ θ G T y, Σ − θ = θG T G + 16 − I . In the preceding line, G ∈ R P × is the forward model matrix (such that G ( x ) = Gx ) with entries G p, = sin(2 t p ) and G p, = sin( t p ) for p ∈ { , . . . , P } . The above quantities of interest are also analyticallytractable and used as ground truth. Firstly, for ϕ ( x ) = x , the expected value is π θ ( ϕ ) = µ θ . Secondly,the score function (47) can be computed using the fact that the marginal likelihood satisﬁes Z ( θ ) = φ ( y ; 0 P , GG T + θ − I P ) . l o g ( | h l h ( l ) | ) log ( Iteration ) l o g ( M S E ) = 0.03 = 0.01 Figure 1: Toy model of Section 4.1.1. Left: error of forward model approximation (cid:107) h l − h l − (cid:107) againstdiscretization level l satisﬁes (cid:107) h l − h l − (cid:107) ≤ C ∆ βl with a rate of β = 2 . Right: convergence of stochasticgradient iterates ( θ ( i ) ) i ∈ Z + , deﬁned in (46), to the maximum likelihood estimator θ MLE , measured in termsof the mean squared error E [( θ ( i ) − θ MLE ) ] that was estimated using independent realizations. Thelearning rates considered here are α i = α /i with α ∈ { . , . } .To suit the domain under consideration, we take the mesh width of the FEM scheme in Section 2.2.2 as ∆ l = 2 π × − ( l + l ) with l = 5 . Firstly, in the left panel of Figure 1, we numerically verify that our approx-imation of the forward model indeed converges at the rate of β = 2 . Next, we consider the time-averagedestimator ξ l = (cid:92) [ π l − π l − ]( ϕ ) T,k,m in (21) with k = 100 and m = 1000 , and examine the rate at which itssecond moment converges to zero as l increases in Figure 2. These estimators are computed using a Markovchain ( Z n,l,l − ) n ∈ Z + that is simulated using either the mixture of coupled RWMH and HMC kernels (Al-gorithms 2 and 3) in (45) (left panel), or a coupled pCN kernel based on the reﬂection maximal couplingof Section 3.2.3 (right panel). The algorithmic settings of (45) include a mixing probability of κ = 0 . ,stepsize of ε l = 0 . and M l = 10 leapfrog steps for all l ∈ Z + in the HMC kernels; and proposal covarianceof − I for all l ∈ Z + in the RWMH kernels. For pCN kernels, we took ρ l = 0 . and σ l = 4 . I for all l ∈ Z + . To satisfy Assumption A5, we initialize Z ,l,l − = (( X ,l , W ,l ) , ( X ,l − , W ,l − )) from a coupling ˇ ν l,l − that can be described by the following steps: 1) sample X (cid:48) ,l − and W ,l − from the prior N (0 , I ) independently; 2) generate X (cid:48) ,l | X (cid:48) ,l − ∼ N ( X (cid:48) ,l − , − (2 l +1) I ) and W ,l | W ,l − ∼ N ( W ,l − , − (2 l +1) I ) independently; 3) generate ( X ,l , X ,l − ) | ( X (cid:48) ,l , X (cid:48) ,l − ) according to K l,l − (( X (cid:48) ,l , X (cid:48) ,l − ) , · ) , the marginalkernel on X × X induced by ˇ K l,l − for a pair across levels.18 l o g ( E [ l ]) l o g ( E [ l ]) Figure 2: Toy model of Section 4.1.1 at data generating parameter θ = 1 . Second moment of the time-averaged estimator ξ l = (cid:92) [ π l − π l − ]( ϕ ) T,k,m in (21) with k = 100 and m = 1000 against discretization level l using either the mixture of coupled RWMH and HMC kernels in (45) ( left ), or the coupled pCN kernelbased on the reﬂection maximal coupling of Section 3.2.3 ( right ). Functions considered here are ϕ ( x ) = ∇ θ log γ θ ( x ) ( blue ), ϕ ( x ) = x ( orange ) and ϕ ( x ) = x ( green ). In both cases, we have E [ ξ l ] ≤ C ∆ βl witha rate of β = 2 . The second moment was estimated using independent realizations. ( N )8642024 l o g ( M S E ) ( N )8642024 l o g ( M S E ) Figure 3: Toy model of Section 4.1.1 at data generating parameter θ = 1 . Mean squared error of averagedestimator E [( (cid:91) π ( ϕ )( N ) − π ( ϕ )) ] against number of single term replicates N using either the mixtureof coupled RWMH and HMC kernels in (45) ( left ), or the coupled pCN kernel based on the reﬂectionmaximal coupling of Section 3.2.3 ( right ). Functions considered here are ϕ ( x ) = ∇ θ log γ θ ( x ) ( blue ), ϕ ( x ) = x ( orange ) and ϕ ( x ) = x ( green ). In both cases, we have the standard Monte Carlo rate E [( (cid:91) π ( ϕ )( N ) − π ( ϕ )) ] ≤ CN − . The mean squared error was estimated using the exact π ( ϕ ) as groundtruth and independent realizations. 19e can infer from both plots in Figure 2 that E [ ξ l ] ≤ C ∆ βl with a rate of β = 2 , which matches thatof the forward model approximation. Hence the condition in (11) ensuring unbiased and ﬁnite varianceproperties of the single term estimator (cid:91) π ( ϕ ) S,T,k,m in (24) can be veriﬁed. As the cost of the marginalkernel at level l is of order ∆ − ωl with ω = 1 , following the discussion in Section 2.4.3, we select P L ( l ) ∝ ∆ ηl with η = 5 / to ensure ﬁnite expected cost. Figure 3 shows that by averaging N ∈ N independentreplicates of the single term estimator, we obtain an unbiased estimator (cid:91) π ( ϕ )( N ) in (27) that satisﬁes thestandard Monte Carlo rate as N → ∞ .Lastly, in the right panel of Figure 1, we illustrate convergence of the stochastic gradient algorithm(46), initialized at θ (0) = 1 , to the maximum likelihood estimator θ MLE , for two sequences of learningrates. The MLE was computed numerically by maximizing the marginal likelihood Z ( θ ) with the exactscore function (47). The general case of Section 2.2 is now considered, with unknown diﬀusion coeﬃcient Φ and forcing f ( t ) =100 t . The prior speciﬁcation of X = ( X , X ) is taken as d = 2 , ¯Φ = 0 . , ϑ = 1 / , ϑ = 1 / , v ( t ) = sin( πt ) and v ( t ) = cos(2 πt ) . For this particular setting, the solution h is continuous and hencepointwise observations are well-deﬁned. The observation function G ( x ) in (2) is chosen as g p ( h ( X )) = h (0 .

01 + 0 . p − X ) for p ∈ { , . . . , P } with P = 50 . We employ the FEM scheme in Section 2.2.2with mesh width of ∆ l = 2 − ( l + l ) where l = 3 . Using a discretization level of l = 10 to approximate G ( x ) with G l ( x ) , x = (0 . , − . and θ = 1 , we simulate observations y ∈ R P from (3).Figure 4 shows that the forward model approximation and the second moment of time-averaged esti-mator ξ l = (cid:92) [ π l − π l − ]( ϕ ) T,k,m in (21) with k = 100 and m = 1000 converges at the same rate of β = 2 as l increases. The estimators ( ξ l ) are computed using the reﬂection maximal coupling of pCN kernels inSection 3.2.3, with algorithmic parameters of ρ l = 0 . and σ l = I d for all l ∈ Z + . The Markov chainis initialized in a similar manner to Section 4.1.1, with the exception of having U [ − , as the prior inthis case. As before, we can select P L ( l ) ∝ ∆ ηl with η = 5 / to ensure that the single term estimator (cid:91) π ( ϕ ) S,T,k,m in (24) has ﬁnite variance and ﬁnite expected cost. The left panel of Figure 5 illustrates thataveraging single term estimators yields a consistent estimator that converges at the standard Monte Carlorate.Finally, we consider inference for θ in the Bayesian framework, under a prior p ( θ ) speciﬁed as astandard Gaussian prior on log θ . By adding the gradient of the prior density ∇ θ log p ( θ ) to (46), wecan run a stochastic gradient algorithm initialized at θ (0) = 0 . to compute the maximum a posterioriprobability (MAP) estimator θ MAP ∈ arg max p ( θ ) Z ( θ ) . The left panel of Figure 5 displays convergenceof the stochastic iterates to θ MAP . As competing algorithm, we consider the approach of [20] that can alsocompute unbiased estimators of the score function (47) using the algorithm in [2] instead of MCMC. Theplot shows some gains over [20] when the same learning rates are employed.

Our ﬁnal application concerns parameter inference for an epidemiological model to analyze COVID-19infections in the UK. We consider a recently developed compartmental model [24] for a closed populationwhere S ( t ) denotes the proportion that is susceptible to the disease, I ( t ) denotes the proportion of infectedindividuals, R ( t ) denotes the proportion of individuals who have recovered and are no longer part of thetransmission process, and Ξ( t ) denotes the proportion of symptomatic and infected individuals who havebeen quarantined. The model dynamics are governed by the following system of ordinary diﬀerential20 l o g ( | h l h ( l ) | ) l o g ( E [ l ]) Figure 4: Elliptic Bayesian inverse problem of Sections 2.2 and 4.1.2. Left: error of forward modelapproximation (cid:107) h l − h l − (cid:107) against discretization level l satisﬁes (cid:107) h l − h l − (cid:107) ≤ C ∆ βl with a rate of β = 2 . Right: second moment of the time-averaged estimator ξ l = (cid:92) [ π l − π l − ]( ϕ ) T,k,m in (21) with k = 100 and m = 1000 against discretization level l satisﬁes E [ ξ l ] ≤ C ∆ βl with a rate of β = 2 . Thefunction considered here is ϕ ( x ) = ∇ θ log γ θ ( x ) with θ = 0 . . The second moment was estimated using independent realizations. ( N )98765 l o g ( M S E ) ( cost )161412108642 l o g ( M S E ) UBMLMCMC, = 0.30UBMLMCMC, = 0.10UBMLMCMC, = 0.03UBMLSMC, = 0.30 Figure 5: Elliptic Bayesian inverse problem of Sections 2.2 and 4.1.2. Left: mean squared error of averagedestimator E [( (cid:91) π ( ϕ )( N ) − π ( ϕ )) ] against number of single term replicates N for the function ϕ ( x ) = ∇ θ log γ θ ( x ) with θ = 0 . . The mean squared error was estimated using an average of , independentreplicates with a minimal discretization level of l = 8 as ground truth for π ( ϕ ) , and independentrealizations. Right: convergence of stochastic gradient iterates ( θ ( i ) ) i ∈ Z + to the maximum a posterioriprobability estimator θ MAP . The mean squared error E [( θ ( i ) − θ MAP ) ] was estimated using a longer runof the stochastic gradient algorithm as ground truth for θ MAP , and independent realizations. Thelearning rates considered here are α i = α /i with α ∈ { . , . , . } . The red curve corresponds toa comparison with the unbiased MLSMC algorithm of [20]. Here the cost is measured in terms of thenumber of forward model simulations, weighted by the complexity associated to each discretization level,and averaged over realizations. 21quations ddt S ( t ) = − aS ( t ) I ( t ) − x S ( t ) , ddt I ( t ) = aS ( t ) I ( t ) − ( b + x + x ) I ( t ) , (48) ddt R ( t ) = bI ( t ) + x S ( t ) , ddt Ξ( t ) = ( x + x ) I ( t ) . A unit of time in the model will represent the duration of a day. In Equation (48), a > and b > denote the transmission rate and recovery rate, respectively. Following [24], we adopt the values a = 0 . and b = 0 . for COVID-19. The parameter x > captures public containment policies or individualbehavioural changes in response to the epidemic. Quarantine measures for symptomatic and infectedindividuals are described by the parameter x > . We will also infer the time lapsed between the ﬁrstinfection and its reporting x > . Letting time t = 0 correspond to the ﬁrst reported case on January24, 2020, the initial condition is ( S ( − x ) , I ( − x ) , R ( − x ) , Ξ( − x )) = (1 − /N pop , /N pop , , , where N pop = 66 , , denotes the size of the UK population. Given parameters x = ( x , x , x ) , we willwrite the solution of (48) at time t as ( S ( t ; x ) , I ( t ; x ) , R ( t ; x ) , Ξ( t ; x )) . As our prior speciﬁcation is givenby X ∼ U [0 . , . , X ∼ U [0 . , . and X ∼ U [5 , independently, we will work on the state-space X = [0 . , . × [0 . , . × [5 , .To account for under-reporting, the observed proportion of daily conﬁrmed cases ( Y i ) Pi =1 is modelledas log( Y i ) = log( G i ( x )) − Γ i , (49)for i ∈ { , . . . , P } , where G i ( x ) = a (cid:90) n +1 n − i S ( t ; x ) I ( t ; x ) dt, (50)denotes the number of daily new infections under model (48), and (Γ i ) Pi =1 are independent gamma randomvariables with shape parameter θ > and scale parameter θ > . We set n = 29 to consider only P = 40 observations y = ( y i ) Pi =1 from February 12, 2020, as earlier data seem to be unreliable. We note that theobservation model in (49) diﬀers from [24] which adopted a least squares approach to infer parameters.Under the gamma likelihood, the unnormalized posterior density of X given y and θ = ( θ , θ ) is γ θ ( x ) = P (cid:89) i =1 θ ) θ θ (log( G i ( x ) /y i )) θ − exp( − log( G i ( x ) /y i ) /θ ) I A ( x ) , (51)where A = { x ∈ X : G i ( x ) ≥ y i , i = 1 , . . . , P } .Any practical implementation of MCMC targeting (51) would require an approximation of G i ( x ) in(50). As it suﬃces to approximate h ( t ; x ) satisfying ( d/dt ) h ( t ) = aS ( t ) I ( t ) , we augment the system in(48) and employ a fourth-order Runge–Kutta numerical integrator [30] with stepsize ∆ l = 0 . × − , for l ∈ Z + . The left panel of Figure 6 shows that the resulting approximation h l ( t ; x ) converges to h ( t ; x ) atthe expected rate of β = 4 . We can now apply our proposed methodology to approximate expectations.The right panel shows that the second moment of the time-averaged estimator ξ l = (cid:92) [ π l − π l − ]( ϕ ) T,k,m in(21) with k = 200 and m = 2000 also converges at the same rate. To compute the estimators ( ξ l ) , we usedthe reﬂection maximal coupling of pCN kernels in Section 3.2.3, with algorithmic parameters of ρ l = 0 . and σ l = I d for all l ∈ Z + . The Markov chain is initialized in a similar manner to Section 4.1.1, withthe exception of truncating the prior to the subset A in this case. Since the cost of the marginal pCNkernel at level l is of order ∆ − ωl with ω = 1 , we choose P L ( l ) ∝ ∆ ηl with η = 9 / to ensure that the singleterm estimator (cid:91) π ( ϕ ) S,T,k,m in (24) has ﬁnite variance and ﬁnite expected cost. The left panel of Figure7 illustrates the impact of averaging independent replicates. As before, we compute the MLE of θ usingthe stochastic gradient algorithm (46), with initialization from θ (0) = (1 , . Finally, we exploit the ﬁtted22odel to infer the extent of under-reporting during the time period under consideration. In Figure 8, wedisplay the ratio of the total number of reported cases to the expected number of total infections underthe posterior distribution (48) with θ = θ MLE . l o g ( | h l h ( l ) | ) l o g ( E [ l ]) Figure 6: Compartmental model of Section 4.2. Left: error of forward model approximation | h l ( t ; x ) − h l − ( t ; x ) | against discretization level l satisﬁes | h l ( t ; x ) − h l − ( t ; x ) | ≤ C ∆ βl with a rate of β = 4 at observation times t and parameter x = (0 . , . , . Right: second moment of the time-averagedestimator ξ l = (cid:92) [ π l − π l − ]( ϕ ) T,k,m in (21) with k = 200 and m = 2000 against discretization level l satisﬁes E [ ξ l ] ≤ C ∆ βl with a rate of β = 4 . The function considered here is ϕ ( x ) = ∇ θ log γ θ ( x ) with θ = (1 , .The second moment was estimated using independent realizations. ( N )10987654 l o g ( M S E ) ( Iteration )98765432 l o g ( M S E ) Figure 7: Compartmental model of Section 4.2. Left: mean squared error of averaged estimator E [( (cid:91) π ( ϕ )( N ) − π ( ϕ )) ] against number of single term replicates N for the functions ϕ ( x ) = ∂ θ log γ θ ( x ) ( blue ) and ϕ ( x ) = ∂ θ log γ θ ( x ) ( orange ) with θ = (1 , . The mean squared error was estimated usingan average of , independent replicates with a discretization of ∆ l = 0 . × − l as ground truthfor π ( ϕ ) , and independent realizations. Right: convergence of stochastic gradient iterates ( θ ( i ) ) i ∈ Z + to the maximum likelihood estimator θ MLE . The mean squared error E [( θ ( i ) − θ MLE ) ] was estimatedusing a longer run of the stochastic gradient algorithm as ground truth for θ MLE , and independentrealizations. The sequence of learning rates considered here is α i = α /i with α = 0 . .23 F r a c t i o n o f o b s e r v e d c a s e s Figure 8: Compartmental model of Section 4.2. Ratio of the total number of reported cases C i = C + (cid:80) ij =1 y j ( C denotes the number of cases before February 12, 2020) to the expected number of totalinfections E [ a (cid:82) n + i − X S ( t ; X ) I ( t ; X ) dt | y, θ MLE ] on day n + i . Acknowledgements

JH was funded by CY Initiative of Excellence (grant “Investissements d’Avenir” ANR-16-IDEX-0008). AJwas supported by KAUST baseline funding. AT and KJHL were supported by The Alan Turing Instituteunder the EPSRC grant EP/N510129/1.

A Proofs

Throughout the appendix, C is a ﬁnite constant that does not depend upon l nor the time parameter ofthe Markov chain. The value may change upon each appearance. For ease of notation only, we will set k = 0 from herein; the proofs, with some minor modiﬁcations, will hold for any k ≥ . The appendix ﬁrstgives the proof of Theorem 2.1 and then a collection of technical results which are used to achieve theproof. Proof of Theorem 2.1.

To prove the result we must verify for any ( l, ϕ ) ∈ N × B b ( X ) ∩ Lip ˜ d ( X ) E [ (cid:92) π ( ϕ ) ] = π ( ϕ ) , (52) E [ (cid:92) [ π l − π l − ]( ϕ ) ] = [ π l − π l − ]( ϕ ) , (53)and that P L (0) E [ (cid:92) π ( ϕ ) ] + (cid:88) l ∈ N P L ( l ) E [ { (cid:92) [ π l − π l − ]( ϕ ) } ] < + ∞ . (54)Equations (52) and (53) are veriﬁed in Proposition A.1. For (54), E [ (cid:92) π ( ϕ ) ] can be controlled by usingLemma A.6 and then by Lemma A.5, we have that (cid:88) l ∈ N P L ( l ) E [ { (cid:92) [ π l − π l − ]( ϕ ) } ] ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip ) (cid:88) l ∈ N P L ( l ) ∆ (cid:0) β ∧ β (cid:1) l . The proof can now easily be completed. 24 roposition A.1.

Assume (A1-2). Then we have for any ( l, ϕ ) ∈ N × B b ( X ) E [ (cid:92) π ( ϕ ) ] = π ( ϕ ) , E [ (cid:92) [ π l − π l − ]( ϕ ) ] = [ π l − π l − ]( ϕ ) . Proof.

The case of (cid:92) π ( ϕ ) is essentially that in [10], so we focus on the second expectation. We have thestandard Martingale plus remainder decomposition: (cid:92) [ π l − π l − ]( ϕ ) = ϕ ( X ,l ) + ˇ τ l,l − − (cid:88) n =1 { ϕ ( X n,l ) − ϕ ( W n,l ) } − (cid:16) ϕ ( X ,l − ) + ˇ τ l,l − − (cid:88) n =1 { ϕ ( X n,l − ) − ϕ ( W n,l − ) } (cid:17) = ϕ ( X ,l ) + ˇ τ l,l − (cid:88) n =1 ζ l ( X n − n,l , W n − n,l ) + K l ( (cid:98) ϕ l )( X ,l ) − K l ( (cid:98) ϕ l )( W ,l ) − (cid:16) ϕ ( X ,l − ) + ˇ τ l,l − (cid:88) n =1 ζ l − ( X n − n,l − , W n − n,l − ) + K l − ( (cid:98) ϕ l − )( X ,l − ) − K l − ( (cid:98) ϕ l − )( W ,l − ) (cid:17) where, for s ∈ { l, l − } and ( x n − n , w n − n ) ∈ X × X ζ s ( x n − n , w n − n ) := (cid:98) ϕ s ( x n ) − K s ( (cid:98) ϕ s )( x n − ) − (cid:16) (cid:98) ϕ s ( w n ) − K s ( (cid:98) ϕ s )( w n − ) (cid:17) and (cid:98) ϕ s ( x ) = (cid:88) n ∈ Z + [ K ns ( ϕ )( x ) − π s ( ϕ )] (55)is well-deﬁned for each x ∈ X and solves the Poisson equation, (cid:98) ϕ s ( x ) − K s ( (cid:98) ϕ s )( x ) = ϕ ( x ) − π s ( ϕ ) . Set ( F n ) n ∈ Z + as the natural ﬁltration generated by ( Z n,l,l − ) n ≥ and for ( n, s ) ∈ N × { l, l − } M n,s := n (cid:88) j =1 ζ s ( X j − j,s , W j − j,s ) with M ,l = M ,l − = 0 . Then for each s ∈ { l, l − } , ( M n,s , F n ) n ∈ Z + is a Martingale. Thus, we have (cid:92) [ π l − π l − ]( ϕ ) = ϕ ( X ,l ) + M ˇ τ l,l − ,l + K l ( (cid:98) ϕ l )( X ,l ) − K l ( (cid:98) ϕ l )( W ,l ) − (cid:16) ϕ ( X ,l − ) + M ˇ τ l,l − ,l − + K l − ( (cid:98) ϕ l − )( X ,l − ) − K l − ( (cid:98) ϕ l − )( W ,l − ) (cid:17) . (56)Now taking expectations on both sides of the equation and applying the optional sampling theorem, wehave E [ (cid:92) [ π l − π l − ]( ϕ ) ] = ν l K l ( ϕ ) + ν l K l ( (cid:98) ϕ l ) − ν l K l ( (cid:98) ϕ l ) − (cid:16) ν l − K l − ( ϕ ) + ν l − K l − ( (cid:98) ϕ l − ) − ν l − K l − ( (cid:98) ϕ l − ) (cid:17) . Hence, we have that E [ (cid:92) [ π l − π l − ]( ϕ ) ] = [ π l − π l − ]( ϕ ) . Remark A.1.

In the proof we have essentially derived the ﬁrst Wald equality for Markov chains (see [26]).The main point is to provide the Martingale plus remainder decomposition (56) . Lemma A.1.

Assume (A1,3). Then there exists a C ∈ (0 , ∞ ) , such that for any ( l, ϕ ) ∈ Z + × B b ( X ) ∩ Lip ˜ d ( X ) we have: | (cid:98) ϕ l ( x ) − (cid:98) ϕ l ( w ) | ∨ | K l ( (cid:98) ϕ l )( x ) − K l ( (cid:98) ϕ l )( w ) | ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip )˜ d ( x, w ) . Lemma A.2.

Assume (A1,4). Then there exists a C ∈ (0 , ∞ ) , such that for any ( l, ϕ ) ∈ N × B b ( X ) wehave: sup x ∈ X | (cid:98) ϕ l ( x ) − (cid:98) ϕ l − ( x ) | ∨ sup x ∈ X | K l ( (cid:98) ϕ l )( x ) − K l − ( (cid:98) ϕ l − )( x ) | ≤ C (cid:107) ϕ (cid:107) ∞ ∆ β l , where β is as in (A4). Lemma A.3.

Assume (A5). Then there exists a C ∈ (0 , ∞ ) , such that for any ( l, n ) ∈ N × Z + we have: E [ I B ( C, ∆ β l , ˜ d ) c ( Z n,l,l − )] ≤ C ( n + 1)∆ β (2+ (cid:15) ) l , E [˜ d ( X n,l , X n,l − ) (cid:15) ] ∨ E [˜ d ( W n,l , W n,l − ) (cid:15) ] ≤ C ( n + 1)∆ β (2+ (cid:15) ) l , where β and (cid:15) are as in (A5).Proof. The proof of the ﬁrst statement is by induction. The ﬁrst statement, which holds at step zero byassumption, so assuming the result for n − : E [ I B ( C, ∆ β l , ˜ d ) c ( Z n,l,l − )] = E [ I B ( C, ∆ β l , ˜ d ) c × B ( C, ∆ β l , ˜ d ) ( Z n,l,l − , Z n − ,l,l − )] + E [ I B ( C, ∆ β l , ˜ d ) c × B ( C, ∆ β l , ˜ d ) c ( Z n,l,l − , Z n − ,l,l − )] . Then applying (A5) along with the induction hypothesis one can conclude that: E [ I B ( C, ∆ β l , ˜ d ) c ( Z n,l,l − )] ≤ C ( n + 1)∆ β (2+ (cid:15) ) l and hence the proof of the ﬁrst statement is complete.For the second statement, we consider only ˜ d ( X n,l , X n,l − ) (cid:15) as the argument is the same for ˜ d ( W n,l , W n,l − ) (cid:15) .We have E [˜ d ( X n,l , X n,l − ) (cid:15) ] = E [˜ d ( X n,l , X n,l − ) (cid:15) { I B ( C, ∆ β l , ˜ d ) ( Z n,l,l − ) + I B ( C, ∆ β l , ˜ d ) c ( Z n,l,l − ) } ] . On B ( C, ∆ β l , ˜ d ) , one has ˜ d ( X n,l , X n,l − ) (cid:15) ≤ C ∆ β (2+ (cid:15) ) l . For the second term on the R.H.S., X is compactand ˜ d ( X n,l , X n,l − ) (cid:15) is bounded, so one can use the ﬁrst part of the statement to deduce that E [˜ d ( X n,l , X n,l − ) (cid:15) ] ≤ C ( n + 1)∆ β (2+ (cid:15) ) l . Lemma A.4.

Assume (A1-5). Then there exists a C ∈ (0 , ∞ ) , such that for any ( l, ϕ ) ∈ N × B b ( X ) ∩ Lip ˜ d ( X ) we have: E [ { M ˇ τ l,l − ,l − M ˇ τ l,l − ,l − } ] ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip )∆ (cid:0) β ∧ β (cid:1) l , where ˜ d is as (A3), β is as (A4) and β is as in (A5). roof. Set, for n ∈ Z + ˇ M n,l,l − := M n,l − M n,l − then ( ˇ M n,l,l − , F n ) n ∈ Z + is a Martingale and moreover as ˇ τ l,l − is a F n − stopping time, so is ( ˇ M ˇ τ l,l − ∧ n,l,l − , F n ) n ∈ Z + . Then, by the Burkholder-Gundy-Davis inequality, we have E (cid:104) ˇ M τ l,l − ∧ n,l,l − (cid:105) ≤ E  ˇ τ l,l − ∧ n (cid:88) j =1 { ζ l ( X j − j,l , W j − j,l ) − ζ l − ( X j − j,l − , W j − j,l − ) }  . By (A2), ˇ τ l,l − is almost surely ﬁnite, so by the monotone convergence theorem: E (cid:104) ˇ M τ l,l − ,l,l − (cid:105) ≤ E  ˇ τ l,l − (cid:88) j =1 { ζ l ( X j − j,l , W j − j,l ) − ζ l − ( X j − j,l − , W j − j,l − ) }  . (57)Then we can upper-bound the R.H.S. to yield E (cid:104) ˇ M τ l,l − ,l,l − (cid:105) ≤ C ( T + T ) where T := E (cid:34) ˇ τ l,l − (cid:88) n =1 (cid:110) (cid:98) ϕ l ( X n,l ) − (cid:98) ϕ l ( W n,l ) − (cid:98) ϕ l − ( X n,l − ) + (cid:98) ϕ l − ( W n,l − ) (cid:111) (cid:35) (58) T := E (cid:34) ˇ τ l,l − (cid:88) n =1 (cid:110) K l ( (cid:98) ϕ l )( X n − ,l ) − K l ( (cid:98) ϕ l )( W n − ,l ) − K l − ( (cid:98) ϕ l − )( X n − ,l − ) + K l − ( (cid:98) ϕ l − )( W n − ,l − ) (cid:111) (cid:35) . (59)To conclude the proof, one must appropriately upper-bound (58) and (59). As the forms are very similarand due to the expression for (cid:98) ϕ s (see (55)), s ∈ { l, l − } , we will give the proof for (58) only, as the prooffor (59) is almost the same.Deﬁne, for z ∈ X (cid:98) ϕ l,l − ( z ) := (cid:88) n ∈ Z + ˇ K nl,l − (cid:0) (cid:98) ψ l,l − (cid:1) ( z ) where (cid:98) ψ l,l − ( x l , w l , x l − , w l − ) = ( (cid:98) ϕ l ( x l ) − (cid:98) ϕ l ( w l ) − (cid:98) ϕ l − ( x l − ) + (cid:98) ϕ l − ( w l − )) . We note that by (A2), onecan easily verify that (cid:98) ϕ l,l − is a well-deﬁned function. Now, by the ﬁrst Wald equality for Markov chains,one has E [ T ] = E [ ˇ K l,l − ( (cid:98) ϕ l,l − )( Z ,l,l − )] . This is because E [ ˇ K l,l − ( (cid:98) ϕ l,l − )( Z ˇ τ l,l − ,l,l − )] = 0 , ˇ π l,l − ( (cid:98) ϕ l,l − ) = 0 , E [ˇ τ l,l − ] < + ∞ , where ˇ π l,l − is the invariant measure of ˇ K l,l − (marginally, the invariant distribution of ( X s , W s ) is π s ( dx s ) δ x s ( dw s ) for s ∈ { l, l − } ). Now T = E [ (cid:98) ψ l,l − ( Z ,l,l − )] + E (cid:34) (cid:88) n ∈ N E [ (cid:98) ψ l,l − ( Z n,l,l − ) | Z ,l,l − ] (cid:35) . (60)27e have E [ (cid:98) ψ l,l − ( Z ,l,l − )] = E (cid:104)(cid:16) (cid:98) ϕ l ( X ,l ) − (cid:98) ϕ l ( W ,l ) − (cid:98) ϕ l − ( X ,l − ) + (cid:98) ϕ l − ( W ,l − ) (cid:17) (cid:105) ≤ C (cid:110) E (cid:104)(cid:16) (cid:98) ϕ l ( X ,l ) − (cid:98) ϕ l − ( X ,l ) + (cid:98) ϕ l − ( X ,l ) − (cid:98) ϕ l − ( X ,l − ) (cid:17) (cid:105) + E (cid:104)(cid:16) (cid:98) ϕ l ( W ,l ) − (cid:98) ϕ l − ( W ,l ) + (cid:98) ϕ l − ( W ,l ) − (cid:98) ϕ l − ( W ,l − ) (cid:17) (cid:105)(cid:111) ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip ) (cid:16) ∆ β l + E [˜ d ( X ,l , X ,l − ) ] + ∆ β l + E [˜ d ( W ,l , W ,l − ) ] (cid:17) ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip )∆ β ∧ β ) l (61)where we have applied the C − inequality and Lemmata A.1-A.2 to go to the third line and Lemma A.3to go to the last line. Then, we also have E (cid:34) (cid:88) n ∈ N E [ (cid:98) ψ l,l − ( Z n,l,l − ) | Z ,l,l − ] (cid:35) = E (cid:34) (cid:88) n ∈ N E [ I { ˇ τ l,l − >n } (cid:98) ψ l,l − ( Z n,l,l − ) | Z ,l,l − ] (cid:35) = (cid:88) n ∈ N E [ I { ˇ τ l,l − >n } (cid:98) ψ l,l − ( Z n,l,l − )] ≤ (cid:88) n ∈ N E [ I { ˇ τ l,l − >n } ] (cid:15)/ (2+ (cid:15) ) E [ (cid:98) ψ l,l − ( Z n,l,l − ) (cid:15) ] / (2+ (cid:15) ) ≤ C (cid:88) n ∈ N ( ρ (cid:15) (cid:15) ) n E [ (cid:98) ψ l,l − ( Z n,l,l − ) (cid:15) ] / (2+ (cid:15) ) where we have used Hölder’s inequality to go to the third line and (A2) to go to the fourth line. Now, bysimilar calculations that lead to (61), we have that E [ (cid:98) ψ l,l − ( Z n,l,l − ) (cid:15) ] / (2+ (cid:15) ) ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip )( n + 1)∆ (cid:0) β ∧ β (cid:1) l and hence that E (cid:34) (cid:88) n ∈ N E [ (cid:98) ψ l,l − ( Z n,l,l − ) | Z ,l,l − ] (cid:35) ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip )∆ (cid:0) β ∧ β (cid:1) l . (62)Then combining (61)-(62) with (60) yields T ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip )∆ (cid:0) β ∧ β (cid:1) l , and hence the proof is concluded. Lemma A.5.

Assume (A1-5). Then there exists a C ∈ (0 , ∞ ) , such that for any ( l, ϕ ) ∈ N × B b ( X ) ∩ Lip ˜ d ( X ) we have: E (cid:104) (cid:92) [ π l − π l − ]( ϕ ) (cid:105) ≤ C ( (cid:107) ϕ (cid:107) ∞ ∨ (cid:107) ϕ (cid:107) Lip )∆ (cid:0) β ∧ β (cid:1) l where ˜ d is as (A3), β is as (A4) and β , (cid:15) are as in (A5).Proof. Using the decomposition in (56) along with the C − inequality, we have the upper-bound E (cid:104) (cid:92) [ π l − π l − ]( ϕ ) (cid:105) ≤ C (cid:16) E [ { ϕ ( X ,l ) − ϕ ( X ,l − ) } ] + E [ { M ˇ τ l,l − ,l − M ˇ τ l,l − ,l − } ] ++ E (cid:104)(cid:16) K l ( (cid:98) ϕ l )( X ,l ) − K l ( (cid:98) ϕ l )( W ,l ) − K l − ( (cid:98) ϕ l − )( X ,l − ) + K l − ( (cid:98) ϕ l − )( W ,l − ) (cid:17) (cid:105)(cid:17) . The ﬁrst term on the R.H.S. can be treated by using ϕ ∈ Lip ˜ d ( X ) and Lemma A.3. For the second term onthe R.H.S. one can use Lemma A.4. For the last term on the R.H.S. one can use very similar calculationsto those used to derive (61). The proof is thus completed.28 emma A.6. Assume (A1-2). Then there exists a C ∈ (0 , ∞ ) , such that for any ϕ ∈ B b ( X ) we have: E [ (cid:92) π ( ϕ ) ] ≤ C (cid:107) ϕ (cid:107) ∞ . Proof.

We have (cid:92) π ( ϕ ) = ϕ ( X , ) + ˇ M τ , + ˇ K ( (cid:98) ϕ )( X , ) − ˇ K ( (cid:98) ϕ )( W , ) , where (cid:98) ϕ ( x ) = (cid:80) n ∈ Z + [ K n − π ]( ϕ ) , and for any n ∈ N ˇ M n, = n (cid:88) j =1 { (cid:98) ϕ ( X j, ) − ˇ K ( (cid:98) ϕ )( X j − , ) − (cid:98) ϕ ( W j, ) + ˇ K ( (cid:98) ϕ )( W j − , ) } and ˇ M , := 0 . One has, by using the C − inequality and the fact that ϕ ∈ B b ( X ) along with (A1) E [ (cid:92) π ( ϕ ) ] ≤ C (cid:107) ϕ (cid:107) ∞ + E [ ˇ M τ , ] . Then, by using a similar argument to derive (57), we have E [ ˇ M τ , ] ≤ E (cid:34) τ (cid:88) j =1 { (cid:98) ϕ ( X j, ) − ˇ K ( (cid:98) ϕ )( X j − , ) − (cid:98) ϕ ( W j, ) + ˇ K ( (cid:98) ϕ )( W j − , ) } (cid:35) . The proof is now completed as the summands on the R.H.S. are upper-bounded by C (cid:107) ϕ (cid:107) ∞ and theexpectation of the stopping time is ﬁnite via (A2). References [1]

Agapiou , S.,

Roberts , G. O. &

Vollmer , S. (2018). Unbiased Monte Carlo: Posterior estimationfor intractable/inﬁnite-dimensional models.

Bernoulli , , 1726–1786.[2] Beskos , A.,

Jasra , A.,

Law , K. J. H.,

Marzouk , Y., &

Zhou , Y. (2018). Multilevel sequen-tial Monte Carlo with dimension-independent likelihood-informed proposals.

SIAM/ASA J. Uncer.Quant. , , 762–786.[3] Beskos , A.,

Jasra , A.,

Law , K. J. H.,

Tempone , R., &

Zhou , Y. (2017). Multilevel sequentialMonte Carlo samplers.

Stoch. Proc. Appl. , , 1417–1440.[4] Bou-Rabee , N.,

Eberle , A., &

Zimmer , R. (2020). Coupling and convergence for HamiltonianMonte Carlo.

Ann. Appl. Probab. , , 1209–1250.[5] Brenner , S. &

Scott , R. (2007).

The Mathematical Theory of Finite Element Methods . Springer:New York.[6]

Ciarlet , P. G. (2002).

The Finite Element Method for Elliptic Problems . SIAM: Philadelphia.[7]

Cotter , S. L.,

Roberts , G. O.,

Stuart , A. M. &

White , D. (2013). MCMC methods for functions:modifying old algorithms to make them faster.

Stat. Sci. , , 424–446.[8] Duane , S.,

Kennedy , A. D.,

Pendleton , B. J., &

Roweth , D. (1987). Hybrid Monte Carlo.

Phy.Lett. B. , , 216–222.[9] Franks , J.,

Jasra , A.,

Law , K. J. H.,

Chada , N. &

Vihola , M. (2018). Unbiased inference fordiscretely observed hidden Markov model diﬀusions. arXiv preprint.2910]

Glynn , P. W. &

Rhee , C. H. (2014). Exact estimation for Markov chain equilibrium expectations.

J. Appl. Probab. , , 377–389.[11] Gower , R. M.,

Loizou , N.,

Qian , X.,

Sailanbayev , A.,

Shulgin , E., &

Richtarik , P. (2019).SGD: General analysis and improved rates.

Proceedings of the 36th International Conference on Ma-chine Learning, in PMLR , 5200–5209.[12] Heng , J. &

Jacob , P. (2019). Unbiased Hamiltonian Monte Carlo with couplings.

Biometrika , ,287–302.[13] Heng , J.,

Houssineau , J. &

Jasra , A. (2021). On unbiased score estimation for partially observeddiﬀusions. Work in progress.[14]

Jacob , P.,

O’ Leary , J. &

Atchadé , Y. (2020). Unbiased Markov chain Monte Carlo with couplings(with discussion).

J. R. Statist. Soc. Ser. B , , 543–600.[15] Jacob , P.,

Lindsten , F. &

Schön , T. (2020). Smoothing with couplings of conditional particleﬁlters.

J. Amer. Statist. Assoc. , 721–729.[16]

Jasra , A.,

Heng , J. &

Law , K. J. H. (2020). Discussion of Jacob et al.

J. R. Statist. Soc. Ser. B , , 586–587.[17] Jasra , A.,

Law , K. J. H. & Yu , F. (2020). Unbiased ﬁltering of a class of partially observed diﬀusions.arXiv preprint.[18] Jasra , A.,

Kamatani , K.,

Law

K. J. H. &

Zhou , Y. (2017). Multilevel particle ﬁlters.

SIAM J.Numer. Anal. , , 3068–3096.[19] Jasra , A.,

Law , K. J. H. & Xu , Y. (2021). Markov chain Simulation for Multilevel Monte Carlo. Found. Data Sci. (to appear).[20]

Jasra , A.,

Law , K. J. H. & Lu , D. (2021). Unbiased estimation of the gradient of the log-likelihoodin inverse problems. Stat. Comp. (to appear).[21]

Johnson , V. (1996). Studying convergence of Markov chain Monte Carlo algorithms using coupledsample paths.

J. Amer. Statist. Assoc. , , 154–166.[22] Lindvall , T. &

Rogers , L. (1996). Coupling of multidimensional diﬀusions by reﬂection.

Ann. Appl.Probab. , , 860–872.[23] Kushner , H. &

Yin , G. G. (2003).

Stochastic Approximation and Recursive Algorithms and Appli-cations . Springer: New York.[24]

Maier , B. F. &

Brockmann , D. (2020). Eﬀective containment explains sub-exponential growth inrecent conﬁrmed COVID-19 cases in China.

Science , , 742–746.[25] McLeish , D. (2011). A general method for debiasing a Monte Carlo estimator.

Monte Carlo Meth.Appl. , , 301–315.[26] Moustakides , G. (1999). An extension of Wald’s ﬁrst lemma for Markov processes.

J. Appl. Probab. , , 48–59.[27] Neal , R. M. (1998). Regression and classiﬁcation using Gaussian process priors. In

Bayesian statis-tics, 6 (Bernardo et al. eds), 475–501, Oxford: OUP.[28]

Rhee , C. H. &

Glynn , P. (2015). Unbiased estimation with square root convergence for SDE models.

Op. Res. , , 1026–1043. 3029] Stuart , A. M. (2010). Inverse problems: A Bayesian perspective.

Acta Numerica , , 451–559.[30] Süli , E. &

Mayers , D. F. (2003).

An Introduction to Numerical Analysis . Cambridge: CUP.[31]

Tarantola , A. (2005).

Inverse problem theory and methods for model parameter estimation . Societyfor Industrial and Applied Mathematics.[32]

Thorisson , H. (2000).

Coupling, Stationarity and Regeneration . Springer: New York.[33]

Vihola , M. (2018). Unbiased estimators and multilevel Monte Carlo.

Op. Res. ,66