A Hierarchical Multilevel Markov Chain Monte Carlo Algorithm with Applications to Uncertainty Quantification in Subsurface Flow
AA Hierarchical Multilevel Markov Chain Monte Carlo Algorithm withApplications to Uncertainty Quantification in Subsurface Flow ∗ T.J. Dodwell , C. Ketelsen , R. Scheichl and A.L. Teckentrup Dept of Mechanical Engineering, University of Bath, Bath BA2 7AY, UK Dept of Applied Mathematics, 526 UCB, University of Colorado at Boulder, CO 80309-0526, USA Dept of Mathematical Sciences, University of Bath, Bath BA2 7AY, UK. Email:
[email protected] Mathematics Institute, Zeeman Building, University of Warwick, Coventry CV4 7AL, UK
Abstract
In this paper we address the problem of the prohibitively large computational cost of existingMarkov chain Monte Carlo methods for large–scale applications with high dimensional parame-ter spaces, e.g. in uncertainty quantification in porous media flow. We propose a new multilevelMetropolis-Hastings algorithm, and give an abstract, problem dependent theorem on the cost of thenew multilevel estimator based on a set of simple, verifiable assumptions. For a typical model prob-lem in subsurface flow, we then provide a detailed analysis of these assumptions and show significantgains over the standard Metropolis-Hastings estimator. Numerical experiments confirm the analysisand demonstrate the effectiveness of the method with consistent reductions of more than an order ofmagnitude in the cost of the multilevel estimator over the standard Metropolis-Hastings algorithmfor tolerances ε < − . Keywords.
Elliptic PDES with random coefficients, log-normal coefficients, finite element analysis,Bayesian approach, Metropolis-Hastings algorithm, multilevel Monte Carlo.
Mathematics Subject Classification (2000).
The parameters in mathematical models for many physical processes are often impossible to determinefully or accurately, and are hence subject to uncertainty. It is of great importance to quantify theuncertainty in the model outputs based on the (uncertain) information that is available on the modelinputs. A popular way to achieve this is stochastic modelling. Based on the available information, aprobability distribution (the prior in the Bayesian framework) is assigned to the input parameters. Ifin addition, some dynamic data (or observations ) F obs related to the model outputs are available, it ispossible to reduce the overall uncertainty and to get a better representation of the model by conditioningthe prior distribution on this data (leading to the posterior ).In most situations, however, the posterior distribution is intractable in the sense that exact samplingfrom it is impossible. One way to circumvent this problem, is to generate samples using a Metropolis–Hastings–type Markov chain Monte Carlo (MCMC) approach [22, 28, 30], which consists of two mainsteps: (i) given the previous sample, a new sample is generated according to some proposal distribution,such as a random walk; (ii) the likelihood of this new sample (i.e. the model fit to F obs ) is compared tothe likelihood of the previous sample. Based on this comparison, the proposed sample is either accepted ∗ Part of this work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore NationalLaboratory under Contract DE-AC52-07A27344. LLNL-JRNL-630212-DRAFT a r X i v : . [ m a t h . NA ] A ug nd used for inference, or rejected and the previous sample is used again, leading to a Markov chain. Amajor problem with MCMC is the high cost of the likelihood calculation for large–scale applications,e.g. in subsurface flow where, for accuracy reasons, a partial differential equation (PDE) with highlyvarying coefficients needs to be solved numerically on a fine spatial grid. Due to the slow convergenceof Monte Carlo averaging, the number of samples is also large and moreover, the likelihood has tobe calculated also for all the samples that are rejected in the end. Altogether, this often leads to anintractably high overall complexity, particularly in the context of high-dimensional parameter spaces(typical in subsurface flow), where the acceptance rate of MCMC methods can be very low.We show here how the computational cost of the standard Metropolis-Hastings algorithm can bereduced significantly by using a multilevel approach. This has already proved highly successful in thecontext of standard Monte Carlo estimators based on independent and identically distributed (i.i.d.)samples [9, 1, 19, 6, 34] for subsurface flow problems. The multilevel Monte Carlo (MLMC) method wasfirst introduced by Heinrich for the computation of high-dimensional, parameter-dependent integrals[25], and then rediscovered by Giles [18] in the context of stochastic differential equations in finance.Similar ideas were also used in [2, 3] to accelerate statistical mechanics calculations. The basic ideasare to (i) exploit the linearity of expectation, (ii) introduce a hierarchy of computational models thatconverge (with increasing model resolution) to some limit model (e.g. the original PDE), and (iii) buildestimators for the differences of output quantities instead of the quantities themselves. In the contextof PDEs with random coefficients, the multilevel estimators use a hierarchy of spatial grids and exploitthat the numerical solution of a PDE, and thus the evaluation of the likelihood, is computationally muchcheaper on coarser spatial grids. In that way, the individual estimators will either have small variance,since differences of output quantities from consecutive models go to zero with increased model resolution,or they will require significantly less computational work per sample for low model resolutions. Eitherway the cost of all the individual estimators is significantly reduced, easily compensating for the costof having to compute L + 1 estimators instead of one, where L is the number of levels.However, the application of the multilevel approach in the context of MCMC is not straightforward.The posterior distribution, which depends on the likelihood, has to be level-dependent, since otherwisethe cost on all levels would be dominated by the evaluation of the likelihood on the finest level, leadingto no real cost reduction. In order to avoid introducing extra bias in the estimator, we constructinstead two parallel Markov chains { θ n(cid:96) } n ≥ and { Θ n(cid:96) − } n ≥ on levels (cid:96) and (cid:96) − { θ n(cid:96) } n ≥ . Although similar two-level sampling strategies have been investigated in other applications [7, 15, 16], the computationallycheaper coarse models were only used to accelerate the MCMC sampling and not as a variance reductiontechnique in the estimator. Some ideas on how to obtain a multilevel version of the MCMC estimatorcan also be found in the recent work [26] on sparse MCMC finite element methods.The central result of the paper is a complexity theorem (cf. Theorem 3.4) that quantifies, for anabstract large–scale inference problem, the gains in the ε -cost of the multilevel Metropolis–Hastingsalgorithm over the standard version, i.e. the cost to achieve a root mean square error less than ε , interms of powers of the tolerance ε . For a particular application in stationary, single phase subsurfaceflow with log-normal permeability prior and exponential covariance, we then verify the assumptions ofTheorem 3.4. We show that the ε -cost of our new multilevel version is indeed one order of ε lower thanits single-level counterpart (cf. Theorem 4.9), i.e. O ( ε − ( d +1) − δ ) instead of O ( ε − ( d +2) − δ ), for any δ > d is the spatial dimension of the problem. The numerical experiments for d = 2 in Section 5confirm the theoretical results. In fact, in practice the cost for the multilevel estimator grows only like O ( ε − d ), but this seems to be a pre–asymptotic effect. The absolute cost is about O (10–50) times lowerthan for the standard estimator for values of ε around 10 − , which is a vast improvement and bringsthe cost of the multilevel MCMC estimator down to a similar order of the cost of standard multilevel2C estimators based on i.i.d. samples. This provides real hope for practical applications of MCMCanalyses in subsurface flow and other large scale PDE applications.The outline of the rest of the paper is as follows. In Section 2, we recall, in a very general context, theMetropolis Hastings algorithm, together with results on its convergence. In Section 3, we then presenta new multilevel version and give a general convergence analysis under a set of problem-dependent, butverifiable assumptions. A typical model problem arising in subsurface flow modelling is then presentedin Section 4. We briefly describe the application of the new multilevel algorithm to this application,and give a rigorous convergence analysis and cost estimate of the new multilevel estimator by verifyingthe abstract assumptions from Section 3. Finally, in Section 5, we present some numerical results forthe model problem discussed in Section 4. We will start with a review of the standard Metropolis Hastings algorithm, described in a generalcontext. For ease of presentation, we leave a precise mathematical description of our model problem untilSection 4. We denote by θ := ( θ i ) Ri =1 the R R –valued random input vector to the model, and denote by X := ( X j ) Mj =1 = X ( θ ) the R M –valued random output. Let further Q M,R = G ( X ) be some linear or non–linear functional of X . In the context of groundwater flow modelling, this could for example be the valueof the pressure or the Darcy flux at or around a given point in the computational domain, or the outflowover parts of the boundary. In practice, both θ and X are often finite dimensional approximations ofinfinite dimensional objects, and an underlying ”true” model is recovered as M, R → ∞ . We shalltherefore refer to M as the discretisation level of the model. For more details see Section 4.We consider the setting where we have some real-world data (or observations ) F obs available, andwant to incorporate this information into our simulation in order to reduce the overall uncertainty. Thedata F obs is assumed to be finite dimensional, with F obs ∈ R m for some m ∈ N , and usually correspondsto another linear or non-linear functional F ( X ) of the model output.Let us denote the density of the conditional distribution of θ given F obs by P ( θ | F obs ). Using Bayes’Theorem, we have P ( θ | F obs ) = L ( F obs | θ ) π R ( θ ) P ( F obs ) (cid:104) L ( F obs | θ ) π R ( θ ) . In the Bayesian framework, one usually refers to the conditional distribution P ( θ | F obs ) as the posteriordistribution , to L ( F obs | θ ) as the likelihood and to π R ( θ ) as the prior distribution . Since the normalisingconstant P ( F obs ) is not known in general, the conditional distribution P ( θ | F obs ) is generally intractableand exact sampling not available.The likelihood L ( F obs | θ ) gives the probability of observing the data F obs given a particular value of θ . In practice, this usually involves computing the model response F M,R := F ( X ( θ )) and comparing thisto the observed data F obs . Note that since the model response depends on the discretisation parameter M , in practice we compute an approximation L M ( F obs | θ ) of the true likelihood L ( F obs | θ ). We willdenote the corresponding density of the approximate posterior distribution by π M,R ( θ ) (cid:104) L M ( F obs | θ ) π R ( θ ) . Let now ν M,R ( θ ) := π M,R ( θ ) d θ denote the probability measure corresponding to the density π M,R .We assume that as
M, R → ∞ , we have E ν M,R [ Q M,R ] → E ρ [ Q ], for some (inaccessible) random variable Q and measure ρ . The goal of the simulation is to estimate E ν M,R [ Q M,R ], for M , R sufficiently large.Hence, we compute approximations (or estimators ) (cid:98) Q M,R of E ν M,R [ Q M,R ]. To estimate this with aMonte Carlo type estimator, or in other words by a finite sample average, we need to generate samplesfrom the conditional distribution ν M,R , which is usually intractable, as already mentioned. Instead, wewill use the Metropolis Hastings MCMC algorithm in Algorithm 1.3
LGORITHM 1. (Metropolis Hastings MCMC)
Choose θ . For n ≥ • Given θ n , generate a proposal θ (cid:48) from a given proposal distribution q ( θ (cid:48) | θ n ). • Accept θ (cid:48) as a sample with probability α M,R (cid:0) θ (cid:48) | θ n (cid:1) = min (cid:26) , π M,R ( θ (cid:48) ) q ( θ n | θ (cid:48) ) π M,R ( θ n ) q ( θ (cid:48) | θ n ) (cid:27) (2.1)i.e. θ n +1 = θ (cid:48) with probability α M,R and θ n +1 = θ n with probability 1 − α M,R .Algorithm 1 creates a Markov chain { θ n } n ∈ N , and the states θ n are used as samples for inferencein a Monte Carlo sampler in the usual way. The proposal distribution q ( θ (cid:48) | θ n ) is what defines thealgorithm. A common choice is a simple random walk. However, as outlined in [21], the basic randomwalk does not lead to a convergence that is independent of the input dimension R . A better choiceis a preconditioned Crank-Nicholson (pCN) algorithm [11], which is also a crucial ingredient in themultilevel Metropolis-Hastings algorithm applied to the subsurface flow model problem below.Under reasonable assumptions, one can show that θ n ∼ ν M,R , as n → ∞ , and that sample averagescomputed with these samples converge to expected values with respect to the desired target distribution ν M,R (see Theorem 2.2). The first few samples of the chain, say θ , . . . , θ n , are not usually used forinference to allow the chain to get close to the target distribution ν M,R . This is referred to as the burn–in of the MCMC algorithm. Although the length of the burn-in is crucial for practical purposes, andlargely influences the behaviour of the resulting MCMC estimator for finite sample sizes, asymptoticstatements about the estimator are usually independent of the burn-in. We will denote our MCMCestimator by (cid:98) Q MC N := 1 N N + n (cid:88) n = n +1 Q nM,R = 1 N N + n (cid:88) n = n +1 G ( X ( θ n )) , (2.2)for any n ≥
0, and only explicitly state the dependence on n where needed. Let us give a brief overview of the convergence properties of Algorithm 1, which we will need below inthe analysis of the multilevel variant. For more details we refer the reader, e.g., to [30]. Let K ( θ (cid:48) | θ ) := α M,R ( θ (cid:48) | θ ) q ( θ (cid:48) | θ ) + (cid:18) − (cid:90) R R α M,R ( θ (cid:48)(cid:48) | θ ) q ( θ (cid:48)(cid:48) | θ ) d θ (cid:48)(cid:48) (cid:19) δ ( θ − θ (cid:48) )denote the transition kernel of the Markov chain { θ n } n ∈ N , with δ ( · ) the Dirac delta function, and E = { θ : π M,R ( θ ) > } , D = { θ : q ( θ | θ ∗ ) > θ ∗ ∈ E} . The set E contains all parameter vectors which have a positive posterior probability, and is the setthat Algorithm 1 should sample from. The set D , on the other hand, consists of all samples which canbe generated by the proposal distribution q , and hence contains the set that Algorithm 1 will actuallysample from. For the algorithm to fully explore the target distribution, we therefore crucially require E ⊂ D . The following results are classical, and can be found in [30].4 emma 2.1.
Provided
E ⊂ D , ν M,R is a stationary distribution of the chain { θ n } n ∈ N . Note that the condition
E ⊂ D is also sufficient for the transition kernel K ( ·|· ) to satisfy the usualdetailed balance condition K ( θ | θ ∗ ) π M,R ( θ ∗ ) = K ( θ ∗ | θ ) π M,R ( θ ). Theorem 2.2.
Suppose that E ν M,R [ | Q M,R | ] < ∞ and q ( θ | θ ∗ ) > , for all ( θ, θ ∗ ) ∈ E × E . (2.3) Then lim N →∞ (cid:98) Q MC N = E ν M,R [ Q M,R ] , for any θ ∈ E and n ≥ . The condition (2.3) is sufficient for the chain { θ n } n ∈ N to be irreducible , and it is satisfied for examplefor the random walk sampler or for the pCN algorithm (cf. [21]). Lemma 2.1 and Theorem 2.2 aboveensure that asymptotically, sample averages computed with samples generated by Algorithm 1 convergeto the desired expected value. In particular, we note that stationarity of { θ n } n ∈ N is not required inTheorem 2.2, and the estimator converges for any burn–in n ≥ θ ∈ E .Now that we have established the (asymptotic) convergence of the MCMC estimator (2.2), let usbound its cost. We will quantify the accuracy of our estimator via the mean square error (MSE) e ( (cid:98) Q MC N ) := E Θ (cid:104)(cid:0) (cid:98) Q MC N − E ρ ( Q ) (cid:1) (cid:105) , (2.4)where E Θ denotes the expected value with respect to the joint distribution of Θ := { θ n } n ∈ N as gen-erated by Algorithm 1 (not with respect to the target measure ν M,R ). We denote by C ε ( (cid:98) Q MC N ) thecomputational ε -cost of the estimator, i.e. the number of floating point operations needed to achieve aMSE e ( (cid:98) Q MC N ) < ε .Classically, the MSE can be written as the sum of the variance of the estimator and its bias squared, e ( (cid:98) Q MC N ) = V Θ (cid:104) (cid:98) Q MC N (cid:105) + (cid:16) E Θ (cid:104) (cid:98) Q MC N (cid:105) − E ρ [ Q ] (cid:17) . Here, V Θ is again the variance with respect to the approximating measure generated by Algorithm 1.Using the triangle inequality and linearity of expectation, we can further bound this by e ( (cid:98) Q MC N ) ≤ V Θ (cid:104) (cid:98) Q MC N (cid:105) + 2 (cid:16) E Θ (cid:104) (cid:98) Q MC N (cid:105) − E ν M,R (cid:104) (cid:98) Q MC N (cid:105)(cid:17) + 2 ( E ν M,R [ Q M,R ] − E ρ [ Q ]) (2.5)The three terms in (2.5) correspond to the three sources of error in the MCMC estimator. The third(and last) term in (2.5) is the discretisation error due to approximating Q by Q M,R and ρ by ν M,R .The other two terms are the errors introduced by using an MCMC estimator for the expected value;the first term is the error due to using a finite number of samples and the second term is due to thesamples not all being perfect (i.i.d.) samples from the target distribution ν M,R .Let us first consider the two MCMC related error terms. Quantifying, or even bounding, the varianceand bias of an MCMC estimator in terms of the number of samples N is not an easy task, and is in factstill a very active area of research. The main issue with bounding the variance is that the samples usedin the MCMC estimator are not independent, which means that knowledge of the covariance structure isrequired in order to bound the variance of the estimator. Asymptotically, the behaviour of the MCMCrelated errors (i.e. Terms 1 and 2 on the right hand side of (2.5)) can be described using the followingCentral Limit Theorem, which can again be found in [30].Let ˜ θ ∼ ν M,R . Then the auxiliary chain (cid:101) Θ := { ˜ θ n } n ∈ N constructed by Algorithm 1 starting from ˜ θ is stationary, i.e. ˜ θ n ∼ ν M,R for all n ≥
0. The covariance structure of (cid:101) Θ is still implicitly defined byAlgorithm 1 as for Θ . However, now V (cid:101) Θ [ ˜ Q nM,R ] = V ν M,R [ ˜ Q M,R ], E (cid:101) Θ [ ˜ Q nM,R ] = E ν M,R [ ˜ Q M,R ] andCov (cid:101) Θ (cid:104) ˜ Q M,R , ˜ Q nM,R (cid:105) = E (cid:101) Θ (cid:104)(cid:16) ˜ Q M,R − E ν M,R [ Q M,R ] (cid:17) (cid:16) ˜ Q nM,R − E ν M,R [ Q M,R ] (cid:17)(cid:105) , n ≥
0, where ˜ Q nM,R := G ( X (˜ θ n )). The so-called asymptotic variance of the MCMC estimator isnow defined as σ Q := V ν M,R (cid:104) ˜ Q M,R (cid:105) + 2 ∞ (cid:88) n =1 Cov (cid:101) Θ (cid:104) ˜ Q M,R , ˜ Q nM,R (cid:105) . (2.6)Note that stationarity of the chain is assumed only in the definition of σ Q , i.e. for (cid:101) Θ , and it is notnecessary for the samples Θ actually used in the computation of (cid:98) Q MC N . Theorem 2.3 (Central Limit Theorem) . Suppose (2.3) holds, σ Q < ∞ , and P (cid:2) α M,R = 1 (cid:3) < . (2.7) Then we have, for any n ≥ and θ ∈ E , √ N (cid:16) (cid:98) Q MC N − E ν M,R [ Q M,R ] (cid:17) D −→ N (0 , σ Q ) , where D −→ denotes convergence in distribution. The condition (2.7) is sufficient for the chain Θ to be aperiodic . It is difficult to prove theoretically.In practice, however, this condition is always satisfied, since not all proposals in Algorithm 1 will agreewith the observed data and thus be accepted. Theorem 2.3 shows that asymptotically, the samplingerror of the MCMC estimator decays at the same rate as the sampling error of an estimator based oni.i.d. samples. Note that this includes both sampling errors, and so the constant σ Q is in general largerthan in the i.i.d. case where it is simply V ν M,R [ Q M,R ].Since we are interested in a bound on the MSE of our MCMC estimator for a fixed number ofsamples N , we make the following assumption: A1.
For any N ∈ N , V Θ (cid:104) (cid:98) Q MC N (cid:105) + (cid:16) E Θ (cid:104) (cid:98) Q MC N (cid:105) − E ν M,R (cid:104) (cid:98) Q MC N (cid:105)(cid:17) (cid:46) V ν M,R [ Q M,R ] N , (2.8)with a constant that is independent of M , N and R .Such non-asymptotic bounds on the sampling errors are difficult to obtain, but have recently beenproved for certain Metropolis–Hastings algorithms, see e.g. [21, 31, 26], provided the chain is sufficientlyburnt–in. The implied constant in Assumption A1 usually depends on quantities such as the covariancesappearing in the asymptotic variance σ Q and will in general only be independent of the dimension R for judiciously chosen proposal distributions such as the pCN algorithm. For the simple random walk,for example, the hidden constant grows linearly in R . It is possible to relax Assumption A1 and proveconvergence for algorithms also in this case, but we choose not to do this for ease of presentation.To complete the error analysis, let us now consider the last term in the MSE (2.5), the discretisationbias. As before, we assume E ν M,R [ Q M,R ] → E ρ [ Q ] for M, R → ∞ with a certain order of convergence | E ν M,R [ Q M,R ] − E ρ [ Q ] | (cid:46) M − α + R − α (cid:48) , (2.9)for some α, α (cid:48) >
0. The rates α and α (cid:48) will be problem dependent. Let now R = M α/α (cid:48) , such that thetwo error contributions in (2.9) are balanced. Then it follows from (2.5), (2.8) and (2.9) that the MSEof the MCMC estimator can be bounded by e ( (cid:98) Q MC N ) (cid:46) V ν M,R [ Q M,R ] N + M − α . (2.10)6nder the assumption that V ν M,R [ Q M,R ] is approximately constant, independent of M and R , it ishence sufficient to choose N (cid:38) ε − and M (cid:38) ε − /α to get a MSE of O ( ε ).To bound the computational cost to achieve this error, the so called ε -cost , we assume that onesample Q nM,R can be obtained at cost C ( Q nM,R ) (cid:46) M γ , for some γ >
0. Thus, with N (cid:38) ε − and M (cid:38) ε − /α , the ε –cost of our MCMC estimator can be bounded by C ε ( (cid:98) Q MC N ) (cid:46) N M γ (cid:46) ε − − γ/α . (2.11)In many practical applications, especially in subsurface flow, both the discretisation parameter M and the length of the input R need to be very large in order for E ν M,R [ Q M,R ] to be a good approximationto E ρ [ Q ]. Moreover, as outlined, we need to use a large number of samples N in order to get anaccurate MCMC estimator with a small MSE. Since each sample requires the evaluation of the likelihood L M ( F obs | θ n ), and this is very expensive when M and R are large, the standard MCMC estimator (2.2)is often too expensive in practical situations. Additionally, the acceptance rate of the algorithm can bevery low when R is large. This means that the covariance between the different samples will decay moreslowly, which again makes the hidden constant in Assumption A1 larger, and the number of sampleswe have to take increases even further.To overcome the prohibitively large computational cost of the standard MCMC estimator (2.2), wewill now introduce a new multilevel version of the estimator. The main idea of multilevel Monte Carlo (MLMC) simulation is very simple. We sample not just fromone approximation Q M,R of Q , but from several. Let us recall the main ideas from [18, 9].Let { M (cid:96) } L(cid:96) =0 ⊂ N be an increasing sequence in N , i.e. M < M < . . . < M L =: M , and assume forsimplicity that there exists an s ∈ N \{ } such that M (cid:96) = s M (cid:96) − , for all (cid:96) = 1 , . . . , L. (3.1)We also choose a (not necessarily strictly) increasing sequence { R (cid:96) } L(cid:96) =0 ⊂ N , i.e. R (cid:96) ≥ R (cid:96) − , for all (cid:96) = 1 , . . . , L . For each level (cid:96) , denote correspondingly the parameter vector by θ (cid:96) ∈ R R (cid:96) , the quantityof interest by Q (cid:96) := Q M (cid:96) ,R (cid:96) , the posterior distribution by ν (cid:96) := ν M (cid:96) ,R (cid:96) and the posterior density by π (cid:96) := π M (cid:96) ,R (cid:96) . For simplicity we assume that the parameter vectors { θ (cid:96) } L(cid:96) =0 are nested, i.e. that θ (cid:96) − isa subset of θ (cid:96) , and that the elements of θ (cid:96) are independent.As for multigrid methods applied to discretised (deterministic) PDEs, the key is to avoid estimatingthe expected value of Q (cid:96) directly on level (cid:96) , but instead to estimate the correction with respect to thenext lower level. Since in the context of MCMC simulations, the target distribution ν (cid:96) depends on (cid:96) ,the new multilevel MCMC (MLMCMC) estimator has to be defined carefully. We will use the identity E ν L [ Q L ] = E ν [ Q ] + L (cid:88) (cid:96) =1 ( E ν (cid:96) [ Q (cid:96) ] − E ν (cid:96) − [ Q (cid:96) − ]) (3.2)as a basis. Note that in the case where the distributions are the same, the above reduces to thetelescoping sum used for multilevel Monte Carlo estimators based on i.i.d samples.The idea is now to estimate each of the terms on the right hand side of (3.2) separately, in sucha way that the variance of the resulting multilevel estimator is small. In particular, we will estimateeach term in (3.2) by an MCMC estimator. The first term E ν [ Q ] can be estimated using the standardMCMC estimator in Algorithm 1, i.e. (cid:98) Q MC0 ,N as in (2.2) with N samples. We need to be more careful in7stimating the differences E ν (cid:96) [ Q (cid:96) ] − E ν (cid:96) − [ Q (cid:96) − ], and build an effective two-level version of Algorithm 1.For every (cid:96) ≥
1, we denote Y (cid:96) := Q (cid:96) − Q (cid:96) − and define the estimator on level (cid:96) as (cid:98) Y MC (cid:96),N (cid:96) := 1 N (cid:96) n (cid:96) + N (cid:96) (cid:88) n = n (cid:96) +1 Y n(cid:96) = 1 N (cid:96) n (cid:96) + N (cid:96) (cid:88) n = n (cid:96) +1 Q (cid:96) ( θ n(cid:96) ) − Q (cid:96) − (Θ n(cid:96) − ) , (3.3)where n (cid:96) again denotes the burn-in of the estimator, N (cid:96) is the number of samples on level (cid:96) and Θ (cid:96) − has the same dimension as θ (cid:96) − . The main ingredient in this two–level estimator is a judicious choiceof the two Markov chains { θ n(cid:96) } and { Θ n(cid:96) − } (see Section 3.1). The full MLMCMC estimator is definedas (cid:98) Q ML L, { N (cid:96) } := (cid:98) Q MC0 ,N + L (cid:88) (cid:96) =1 (cid:98) Y MC (cid:96),N (cid:96) , (3.4)where it is important that the two chains { θ n(cid:96) } n ∈ N and { Θ n(cid:96) } n ∈ N , that are used in (cid:98) Y MC (cid:96),N (cid:96) and in (cid:98) Y MC (cid:96) +1 ,N (cid:96) +1 respectively, are drawn from the same posterior distribution ν (cid:96) , so that (cid:98) Q ML L, { N (cid:96) } is an unbiased estimatorof E ν L [ Q L ].There are two main ideas in [18, 9] underlying the reduction in computational cost associated withthe multilevel estimator. Firstly, samples of Q (cid:96) , for (cid:96) < L , are cheaper to compute than samples of Q L ,reducing the cost of the estimators on the coarser levels for any fixed number of samples. Secondly,if the variance of Y (cid:96) = Q (cid:96) ( θ (cid:96) ) − Q (cid:96) − (Θ (cid:96) − ) tends to 0 as (cid:96) → ∞ , we need only a small number ofsamples to obtain a sufficiently accurate estimate of the expected value of Y (cid:96) on the fine grids, and sothe computational effort on the fine grids is also greatly reduced.By using the telescoping sum (3.2) and by sampling from the posterior distribution ν (cid:96) on level (cid:96) , weensure that a sample of Q (cid:96) , for (cid:96) < L , is indeed cheaper to compute than a sample of Q L . It remainsto ensure that the variance of Y (cid:96) = Q (cid:96) ( θ (cid:96) ) − Q (cid:96) − (Θ (cid:96) − ) tends to 0 as (cid:96) → ∞ . This will be ensuredby the choice of θ (cid:96) and Θ (cid:96) − . Note that crucially, this requires the two chains { θ n(cid:96) } and { Θ n(cid:96) − } to becorrelated. However, as long as the stationary marginal distributions of { θ n(cid:96) } and { Θ n(cid:96) − } are ν (cid:96) and ν (cid:96) − respectively, this correlation does not introduce any bias in the telescoping sum (3.2). Q (cid:96) − Q (cid:96) − Let us fix 1 ≤ (cid:96) ≤ L . The challenge is now to generate the chains { θ n(cid:96) } n ∈ N and { Θ n(cid:96) − } n ∈ N such thatthe variance of Y (cid:96) is small. To this end, we partition the chain θ (cid:96) into two parts: the entries which arepresent already on level (cid:96) − (cid:96) (the “fine” modes): θ (cid:96) = [ θ (cid:96),C , θ (cid:96),F ] , where θ (cid:96),C has length R (cid:96) − , i.e. the same length as Θ (cid:96) − . The vector θ (cid:96),F has length R (cid:96) − R (cid:96) − .An easy way to construct θ n(cid:96) and Θ n(cid:96) − such that the variance of Y (cid:96) is small, would be to generate θ n(cid:96) first, and then simply use Θ n(cid:96) − = θ n(cid:96),C . However, since we require Θ n(cid:96) − to come from a Markov chainwith stationary distribution ν (cid:96) − , and θ n(cid:96) comes from the distribution ν (cid:96) , this approach would lead toadditional bias. We do, however, use a similar idea in Algorithm 2.Let us for the moment assume that we have a way of producing i.i.d. samples from the posteriordistribution ν (cid:96) − . Since the distributions ν (cid:96) − and ν (cid:96) are both approximations of the true posteriordistribution ρ , and differ only in the choice of approximation parameters M and R , the distributions ν (cid:96) − and ν (cid:96) will, for sufficiently large (cid:96) , be very similar. The distribution ν (cid:96) − is hence an ideal candidatefor the proposal distribution on level (cid:96) , and this is what is used in Algorithm 2. First, we generate asample Θ n +1 (cid:96) − from the distribution ν (cid:96) − , which is independent of the previous sample Θ n(cid:96) − . We will usethe independence of these samples in Lemma 3.1. Based on Θ n +1 (cid:96) − , we then generate θ n +1 (cid:96) using a new8 LGORITHM 2. (Metropolis Hastings MCMC for Q (cid:96) − Q (cid:96) − ) Choose initial states Θ (cid:96) − ∼ ν (cid:96) − and θ (cid:96) := [Θ (cid:96) − , θ (cid:96),F ]. For n ≥ • On level (cid:96) − : Generate an independent sample Θ n +1 (cid:96) − from the distribution ν (cid:96) − . • On level (cid:96) : Given θ n(cid:96) and Θ n +1 (cid:96) − , generate θ n +1 (cid:96) using Algorithm 1 with the specific proposaldistribution q (cid:96) ML ( θ (cid:48) (cid:96) | θ n(cid:96) ) induced by taking θ (cid:48) (cid:96),C := Θ n +1 (cid:96) − and by generating a proposal for θ (cid:48) (cid:96),F from some proposal distribution q (cid:96),F ML ( θ (cid:48) (cid:96),F | θ n(cid:96),F ) that is independent of the coarse modes. Theacceptance probability is α (cid:96) ML ( θ (cid:48) (cid:96) | θ n(cid:96) ) = min (cid:26) , π (cid:96) ( θ (cid:48) (cid:96) ) q (cid:96) ML ( θ n(cid:96) | θ (cid:48) (cid:96) ) π (cid:96) ( θ n(cid:96) ) q (cid:96) ML ( θ (cid:48) (cid:96) | θ n(cid:96) ) (cid:27) . two-level proposal density q (cid:96) ML in conjunction with the usual Metropolis-Hastings accept/reject stepin Algorithm 1. In particular, to make a proposal on level (cid:96) , we take θ (cid:48) (cid:96),C = Θ n +1 (cid:96) − and independentlygenerate θ (cid:48) (cid:96),F from a proposal distribution q (cid:96),F ML for the fine modes, which can again be a simple randomwalk or the pCN algorithm.At each step in Algorithm 2, there are two different outcomes, depending on whether we acceptor reject on level (cid:96) . The different possibilities are given in Table 1. Observe that when we accept onlevel (cid:96) , we have θ n +1 (cid:96),C = Θ n +1 (cid:96) − , i.e. the coarse modes are the same. If, on the other hand, we reject onlevel (cid:96) , we crucially return to the previous state θ n(cid:96) on that level, which means that the coarse modesof the two states may differ. Level (cid:96) test Θ n +1 (cid:96) − θ n +1 (cid:96),C accept Θ n +1 (cid:96) − Θ n +1 (cid:96) − reject Θ n +1 (cid:96) − θ n(cid:96),C Table 1: Possible states of Θ n +1 (cid:96) − and θ n +1 (cid:96),C in Algorithm 2.In general, this “divergence” of the coarse modes may mean that the variance of Y (cid:96) does not goto 0 as (cid:96) → ∞ for a particular application. But provided the modes are ordered according to theirrelative “influence” on the likelihood L ( F obs | θ ), we can guarantee that α (cid:96) ML ( θ (cid:48) (cid:96) | θ n(cid:96) ) → Y (cid:96) does in fact tend to 0 as (cid:96) → ∞ . We will show this for a subsurface flow applicationin Section 4.The specific proposal distribution q (cid:96) ML in Algorithm 2 can be computed very easily and at noadditional cost, leading to a simple formula for the “two-level” acceptance probability α (cid:96) ML . Lemma 3.1.
Let (cid:96) ≥ . Then α (cid:96) ML ( θ (cid:48) (cid:96) | θ n(cid:96) ) = min (cid:40) , π (cid:96) ( θ (cid:48) (cid:96) ) π (cid:96) − ( θ n(cid:96),C ) q (cid:96),F ML ( θ n(cid:96),F | θ (cid:48) (cid:96),F ) π (cid:96) ( θ n(cid:96) ) π (cid:96) − ( θ (cid:48) (cid:96),C ) q (cid:96),F ML ( θ (cid:48) (cid:96),F | θ n(cid:96),F ) (cid:41) and the induced transition kernel K (cid:96) ML satisfies detailed balance. urthermore, if the distribution q (cid:96),F ML is either (i) symmetric, or (ii) the pCN proposal distribution,then α (cid:96) ML ( θ (cid:48) (cid:96) | θ n(cid:96) ) = min (cid:40) , π (cid:96) ( θ (cid:48) (cid:96) ) π (cid:96) − ( θ n(cid:96),C ) π (cid:96) ( θ n(cid:96) ) π (cid:96) − ( θ (cid:48) (cid:96),C ) (cid:41) , Case (i) , min (cid:40) , L (cid:96) ( F obs | θ (cid:48) (cid:96) ) L (cid:96) − ( F obs | θ n(cid:96),C ) L (cid:96) ( F obs | θ n(cid:96) ) L (cid:96) − ( F obs | θ (cid:48) (cid:96),C ) (cid:41) , Case (ii).Proof.
Since the proposals for the coarse modes θ (cid:96),C and for the fine modes θ (cid:96),F are generated in-dependently, the proposal density q (cid:96) ML ( θ (cid:48) (cid:96) | θ n(cid:96) ) can be written as a product of densities on the twoparts of θ (cid:96) , i.e. q (cid:96),C ML and q (cid:96),F ML . For the coarse part of the proposal distribution, we simply have q (cid:96),C ML ( θ (cid:48) (cid:96),C | θ n(cid:96),C ) = π (cid:96) − ( θ (cid:48) (cid:96),C ) and q (cid:96),C ML ( θ n(cid:96),C | θ (cid:48) (cid:96),C ) = π (cid:96) − ( θ n(cid:96),C ).This completes the proof of the first result. Detailed balance for K (cid:96) ML follows trivially due tothe Metropolis-Hastings construction. The corollary for symmetric distributions q (cid:96),F ML follows by def-inition. The corollary for pCN proposals follows from the identity q (cid:96),F ML ( θ n(cid:96),F | θ (cid:48) (cid:96),F ) /q (cid:96),F ML ( θ (cid:48) (cid:96),F | θ n(cid:96),F ) = π (cid:96),F ( θ n(cid:96),F ) /π (cid:96),F ( θ (cid:48) (cid:96),F ) (see, e.g. [11]), together with the factorisation π (cid:96) ( θ (cid:96) ) = π (cid:96) − ( θ (cid:96),C ) π (cid:96),F ( θ (cid:96),F ). ν (cid:96) − In practice, it will not be possible to generate independent samples of the coarse level posterior distri-bution ν (cid:96) − directly. We instead suggest approximating independent samples of ν (cid:96) − using Algorithm 1in the following manner: After a sufficiently long burn-in period, Algorithm 1 will produce sampleswhich are (approximately) distributed according to ν (cid:96) − . Although the samples produced in this wayare correlated, the correlation between the n th and ( n + j )th sample decays as j increases, and forsufficiently large j , the samples Θ n(cid:96) − and Θ n + j(cid:96) − will be nearly uncorrelated. Hence, an i.i.d sequenceof samples of ν (cid:96) − can be approximated by subsampling a chain { Θ n(cid:96) − } n ∈ N generated by Algorithm 1with, e.g., the pCN proposal distribution.This procedure can be applied very naturally in a recursive manner. Starting on the coarsest level,burning in a Markov chain of samples and subsampling this chain to produce (nearly) independentsamples from ν we can then apply Algorithm 2 to produce a Markov chain of samples from ν . Thiscan then be subsampled again to apply Algorithm 2 on level 2. Continuing in this way, we can recursivelyproduce independent samples from ν (cid:96) − for any (cid:96) >
0. See Algorithm 3 in Section 5 for details.Although, in general the i.i.d. samples of ν (cid:96) − will in practice have to be approximated, for theanalysis of our multilevel algorithm we will assume that the chains { Θ n(cid:96) − } n ∈ N and { θ n(cid:96) } n ∈ N are generatedas in Algorithm 2. The additional bias introduced in the practical Algorithm 3 below is in fact so smallthat we did initially not detect it in our numerical experiments, even for very short subsampling rates. Let us now move on to convergence properties of the multilevel estimator. As in Section 2.1, we define,for all (cid:96) = 0 , . . . , L , the sets E (cid:96) = { θ (cid:96) : π (cid:96) ( θ (cid:96) ) > } , D (cid:96) = { θ (cid:96) : q (cid:96) ML ( θ (cid:96) | θ ∗ (cid:96) ) > θ ∗ (cid:96) ∈ E (cid:96) } . The following convergence results follow from the classical results, due to the telescoping sumproperty (3.2) and the algebra of limits.
Lemma 3.2.
Provided E (cid:96) ⊂ D (cid:96) , ν (cid:96) is a stationary marginal distribution of the chain { θ n(cid:96) } n ∈ N . heorem 3.3. Suppose that for all (cid:96) = 0 , . . . , L , E ν (cid:96) [ | Q (cid:96) | ] < ∞ and q (cid:96) ML ( θ (cid:96) | θ ∗ (cid:96) ) > , for all θ (cid:96) , θ ∗ (cid:96) ∈ E (cid:96) . (3.5) Then lim { N (cid:96) }→∞ (cid:98) Q ML L, { N (cid:96) } = E ν L [ Q L ] , for any θ (cid:96) ∈ E (cid:96) and n (cid:96) ≥ . Let us have a closer look at the irreducibility condition (3.5). As in the proof of Lemma 3.1, wehave q (cid:96) ML ( θ (cid:96) | θ ∗ (cid:96) ) = π (cid:96) − ( θ (cid:96),C ) q (cid:96),F ML ( θ (cid:96),F | θ ∗ (cid:96),F )and thus (3.5) holds, if and only if π (cid:96) − ( θ (cid:96),C ) and q (cid:96),F ML ( θ (cid:96),F | θ ∗ (cid:96),F ) are both positive, for all ( θ (cid:96) , θ ∗ (cid:96) ) ∈E (cid:96) × E (cid:96) . Both terms are positive for common choices of likelihood, prior and proposal distributions.We finish the abstract discussion of the new, hierarchical multilevel Metropolis-Hastings MCMCalgorithm with the main theorem that establishes a bound on the ε -cost of the multilevel estimatorunder certain assumptions on the MCMC error, on the (weak) model error, on the strong error betweenthe states on level (cid:96) and on level (cid:96) − Y (cid:96) ), as well as on the cost C (cid:96) to advance Algorithm 2 by one state from n to n + 1 (i.e. one evaluation of the likelihood on level (cid:96) and one on level (cid:96) − Θ (cid:96) := { θ n(cid:96) } n ∈ N ∪ { Θ n(cid:96) − } n ∈ N , for (cid:96) ≥
1, and Θ := { θ n } n ∈ N ,and define by E Θ (cid:96) (respectively V Θ (cid:96) ) the expected value (respectively variance) with respect to thedistribution of Θ (cid:96) generated by Algorithm 2. Furthermore, let us denote by ν (cid:96),(cid:96) − the joint distributionof θ (cid:96) and Θ (cid:96) − , for (cid:96) ≥
1, which is defined by the marginals of θ (cid:96) and Θ (cid:96) − being ν (cid:96) and ν (cid:96) − , respectively,and the correlation being determined by Algorithm 2. For convenience, we define Y := Q , ν , − := ν and M − = R − = 1. Theorem 3.4.
Let ε < exp[ − and suppose there are positive constants α, α (cid:48) , β, β (cid:48) , γ > such that α ≥ min( β, γ ) . Under the following assumptions, for (cid:96) = 0 , . . . , L , M1. | E ν (cid:96) [ Q (cid:96) ] − E ρ [ Q ] | ≤ C M1 (cid:16) M − α(cid:96) + R − α (cid:48) (cid:96) (cid:17) M2. V ν (cid:96),(cid:96) − [ Y (cid:96) ] ≤ C M2 (cid:16) M − β(cid:96) − + R − β (cid:48) (cid:96) − (cid:17) M3. V Θ (cid:96) [ (cid:98) Y MC (cid:96),N (cid:96) ] + (cid:16) E Θ (cid:96) [ (cid:98) Y MC (cid:96),N (cid:96) ] − E ν (cid:96),(cid:96) − [ (cid:98) Y MC (cid:96),N (cid:96) ] (cid:17) ≤ C M3 N − (cid:96) V ν (cid:96),(cid:96) − [ Y (cid:96) ] M4. C (cid:96) ≤ C M4 M γ(cid:96) , and provided R (cid:96) (cid:38) M max { α/α (cid:48) ,β/β (cid:48) } (cid:96) , there exists a number of levels L and a sequence { N (cid:96) } L(cid:96) =0 such that e ( (cid:98) Q ML L, { N (cid:96) } ) := E ∪ (cid:96) Θ (cid:96) (cid:104)(cid:0) (cid:98) Q ML L, { N (cid:96) } − E ρ [ Q ] (cid:1) (cid:105) < ε , and C ε ( (cid:98) Q ML L, { N (cid:96) } ) ≤ C ML ε − | log ε | , if β > γ,ε − | log ε | , if β = γ,ε − − ( γ − β ) /α | log ε | , if β < γ. Proof.
The proof of this theorem is very similar to the proof of the complexity theorem in the case ofmultilevel estimators based on i.i.d samples (cf. [9, Theorem 1]), which can be found in the appendixof [9]. First note that by assumption we have R − α (cid:48) (cid:96) (cid:46) M − α(cid:96) and R − β (cid:48) (cid:96) (cid:46) M − β(cid:96) .11urthermore, in the same way as in (2.5), we can expand e ( (cid:98) Q ML L, { N (cid:96) } ) ≤ V ∪ (cid:96) Θ (cid:96) (cid:104) (cid:98) Q ML L, { N (cid:96) } (cid:105) + 2 (cid:16) E ∪ (cid:96) Θ (cid:96) (cid:104) (cid:98) Q ML L, { N (cid:96) } (cid:105) − E ν L (cid:104) (cid:98) Q ML L, { N (cid:96) } (cid:105)(cid:17) (cid:124) (cid:123)(cid:122) (cid:125) ( I ) +2 (cid:16) E ν L [ Q L ] − E ρ [ Q ] (cid:17) . It follows from the Cauchy Schwarz inequality that V ∪ (cid:96) Θ (cid:96) (cid:104) (cid:98) Q ML L, { N (cid:96) } (cid:105) = L (cid:88) l =0 V Θ (cid:96) [ (cid:98) Y MC (cid:96),N (cid:96) ] + 2 (cid:88) ≤ (cid:96)<(cid:96) (cid:48) ≤ L Cov ∪ (cid:96) Θ (cid:96) [ (cid:98) Y MC (cid:96),N (cid:96) , (cid:98) Y MC (cid:96) (cid:48) ,N (cid:96) (cid:48) ] (cid:46) ( L + 1) L (cid:88) l =0 V Θ (cid:96) [ (cid:98) Y MC (cid:96),N (cid:96) ] . We can bound the second term in the MSE above by( I ) = (cid:18) L (cid:88) l =0 (cid:16) E Θ (cid:96) (cid:104) (cid:98) Y MC (cid:96),N (cid:96) (cid:105) − E ν (cid:96),(cid:96) − (cid:104) (cid:98) Y MC (cid:96),N (cid:96) (cid:105)(cid:17) (cid:19) ≤ ( L + 1) L (cid:88) l =1 (cid:16) E Θ (cid:96) (cid:104) (cid:98) Y MC (cid:96),N (cid:96) (cid:105) − E ν (cid:96),(cid:96) − (cid:104) (cid:98) Y MC (cid:96),N (cid:96) (cid:105)(cid:17) , and thus it follows from Assumption M3 that e ( (cid:98) Q ML L, { N (cid:96) } ) (cid:46) ( L + 1) L (cid:88) (cid:96) =0 N − (cid:96) V ν (cid:96),(cid:96) − [ Y (cid:96) ] + (cid:16) E ν L [ Q L ] − E ρ [ Q ] (cid:17) . (3.6)In contrast to i.i.d case, we have an additional factor ( L + 1) multiplying the sampling error term onthe right hand side of (3.6). Hence, in order to make this term less than ε /
2, the number of samples N (cid:96) needs to be increased by a factor of ( L + 1) compared to the i.i.d. case, which also increases thecost of the multilevel estimator by a factor of ( L + 1). The remainder of the proof remains identical.Since L is chosen such that the second term in (3.6) (the bias of the multilevel estimator) is lessthan ε /
2, it follows from Assumption M1 that L + 1 (cid:46) | log ε | . The bounds on the ε -cost then followas in [9, Theorem 1], but with an extra | log ε | factor.Note that in our proof we do not require the estimators (cid:98) Y MC (cid:96),N (cid:96) , (cid:96) = 0 , . . . , L , to be independent.However, in practice we found that independent estimators lead to a faster absolute performance of themultilevel estimator (in terms of cost versus error).Assumptions M1 and M4 are the same assumptions as in the single–level case, and are related tothe bias in the model (e.g. due to discretisation) and to the cost per sample, respectively. AssumptionM3 is similar to assumption A1, in that it is a non-asymptotic bound for the sampling errors of theMCMC estimator (cid:98) Y MC (cid:96),N (cid:96) . For this assumption to hold, it is in general necessary that the chains havebeen sufficiently burnt in, i.e. that the values n (cid:96) are sufficiently large. In this section, we will apply the proposed MLMCMC algorithm to a simple model problem arising insubsurface flow modelling. Probabilistic uncertainty quantification in subsurface flow is of interest ina number of situations, as for example in risk analysis for radioactive waste disposal or in oil reservoirsimulation. The classical equations governing (steady state) single–phase subsurface flow consist ofDarcy’s law coupled with an incompressibility condition (see e.g. [14, 10]): w + k ∇ p = g and div w = 0 , in D ⊂ R d , d = 1 , , , (4.1)subject to suitable boundary conditions. In physical terms, p denotes the pressure head of the fluid, k is the permeability tensor, w is the filtration velocity (or Darcy flux) and g is the source term.12 .1 Uncertainty quantification A typical approach to quantify uncertainty in p and w is to model the permeability as a random field k = k ( x, ω ) on D × Ω, for some probability space (Ω , A , P ). The mean and covariance structure of k has to be inferred from the (limited) geological information available. This means that (4.1) becomesa system of PDEs with random coefficients, which can be written in second order form as −∇ · ( k ( x, ω ) ∇ p ( x, ω )) = f ( x ) , in D, (4.2)with f := − div g . This means that the solution p itself will also be a random field on D × Ω. Forsimplicity, we shall restrict ourselves to Dirichlet conditions p ( ω, x ) = ψ ( x ) on ∂D , and assume thatthe boundary data ψ and the source term g are known (and thus deterministic).In this general form solving (4.2) is extremely challenging computationally, and so in practice it iscommon to use relatively simple models for k that are as faithful as possible to the measurements. Onemodel that has been studied extensively is a log-normal distribution for k , i.e. replacing the permeabilitytensor by a scalar valued field whose log is Gaussian. It guarantees that k > k is typically significantly smaller than the size of thecomputational region. In addition, typical sedimentation processes lead to fairly irregular structuresand pore networks. Faithful models should therefore also only assume limited spatial regularity of k .A covariance function that has been proposed in the application literature (cf. [27]) is the followingexponential two-point covariance function for log k : C ( x, y ) := σ exp (cid:18) − (cid:107) x − y (cid:107) r λ (cid:19) , x, y ∈ D, (4.3)where (cid:107) · (cid:107) r denotes the (cid:96) r -norm in R d and typically r = 1 or 2. The parameters σ and λ denote variance and correlation length , respectively. In subsurface flow applications typically only σ ≥ λ ≤ diam D will be of interest. The choice of covariance function in (4.3) implies that k is homogeneous and it follows from Kolmogorov’s theorem [29] that k ( · , ω ) ∈ C ,t ( D ) a.s., for any t < / k is a log-normal random field, where log k hasmean zero and exponential covariance function (4.3) with r = 1. However, other models for k arepossible, and the required theoretical results can be found in [6, 34, 33].Let us now put model problem (4.2) into context for the MCMC and MLMCMC methods describedin sections 2 and 3. The quantity of interest Q is in this case some functional G of the PDE solution p ,and Q M,R is the same functional G evaluated at a discretised solution p M,R . The discretisation level M denotes the number of degrees of freedom for the numerical solution of (4.2) for a given sample and theparameter R denotes the number of random variables used to model the permeability k . The randomvector X will contain the M degrees of freedom of the discrete pressure p M,R .For the spatial discretisation of model problem (4.2), we will use standard, continuous, piecewiselinear finite elements (FEs), see e.g. [4, 8] for more details. Other spatial discretisation schemes arepossible, see for example [9] for a numerical study with finite volume methods and [20] for a theoreticaltreatment of mixed finite elements. We choose a regular triangulation T h of mesh width h of our spatialdomain D , which results in M = O ( h − d ) degrees of freedom for the numerical approximation.In order to apply the proposed MCMC methods to model problem (4.2), we need to represent thepermeability k in terms of a set of random variables. For this, we will use the Karhunen-Lo`eve (KL-)expansion. For the Gaussian field log k , this is an expansion in terms of a countable set of independent,standard Gaussian random variables { ξ n } n ∈ N . It is given bylog k ( ω, x ) = ∞ (cid:88) n =1 √ µ n φ n ( x ) ξ n ( ω ) , { µ n } n ∈ N are the eigenvalues and { φ n } n ∈ N the corresponding L -normalised eigenfunctions of thecovariance operator with kernel function C ( x, y ). For more details on its derivation and properties, seee.g. [17]. We will here only mention that the eigenvalues { µ n } n ∈ N are non–negative with (cid:80) n ≥ µ n < ∞ . For the particular covariance function (4.3) with r = 1, we have µ n (cid:46) n − and hence there is anintrinsic ordering of importance in the KL-expansion. Truncating the KL-expansion after R terms,gives an approximation of k in terms of R standard normal random variables, k R ( ω, x ) = exp (cid:34) R (cid:88) n =1 √ µ n φ n ( x ) ξ n ( ω ) (cid:35) . (4.4)Denote by ϑ := { ξ n } n ∈ N ∈ R N the vector of independent random variables appearing in the KL-expansion of log k . We will work with prior and posterior measures on the space R N . To this end, weequip R N with the product sigma algebra B := (cid:78) n ∈ N B ( R ), where B ( R ) denotes the sigma algebra ofBorel sets of R . We denote by ρ the prior measure on R N , defined by { ξ n } n ∈ N being independent andidentically distributed (i.i.d) N (0 ,
1) random variables, such that ρ = (cid:79) n ∈ N g ( ξ n ) d ξ n , (4.5)where g : R → R + is the Lebesgue density of a N (0 ,
1) random variable and dξ n denotes the onedimensional Lebesgue measure.We assume that the observed data is finite dimensional, i.e. F obs ∈ R m for some m ∈ N , and that F obs = F ( p ( ϑ )) + η, (4.6)where F : H ( D ) → R m is a continuous function of p , the (weak) solution to model problem (4.1) whichdepends on ϑ through k . The observational noise η is assumed to be a realisation of a N (0 , σ F I m )random variable (independent of ϑ ). The parameter σ F is a fidelity parameter that indicates the levelof observational noise present in F obs .With ρ as in (4.5), we have ρ ( R N ) = 1. Furthermore, since p depends continuously on ϑ (see [5,Propositions 3.6 and 4.1] or [35, Lemmas 2.20 and 5.13]), the map F ◦ p : R N → R m is also continuous(by assumption). The posterior distribution, which we will denote by ρ , is then known to be absolutelycontinuous with respect to the prior and satisfies ∂ρ∂ρ ( ϑ ) (cid:104) exp (cid:20) − (cid:107) F obs − F ( p ( ϑ )) (cid:107) σ F (cid:21) =: exp [ − Φ( ϑ ; F obs )] , (4.7)where (cid:107) · (cid:107) denotes the Euclidean norm on R m . The hidden constant depends only on F obs and isgenerally not known (for more details see [32] and the references therein). The right hand side of (4.7)is referred to as the likelihood .Since the exact solution p ( ϑ ) is not available, the likelihood exp [ − Φ( ϑ ; F obs )] needs to be approxi-mated in practical computations. We use a truncation of the KL-expansion of log k after R terms anda spatial approximation p M,R of p ( ϑ ) by piecewise linear FEs. The value of σ F may also be changed to σ F,M . We denote the resulting approximate posterior measure correspondingly by ρ M,R , with ∂ρ M,R ∂ρ ( ϑ ) (cid:104) exp (cid:34) − (cid:107) F obs − F ( p M,R ( ϑ )) (cid:107) σ F,M (cid:35) =: exp (cid:2) − Φ M,R ( ϑ ; F obs ) (cid:3) . (4.8)Since F ◦ p R,M only depends on θ := { ξ n } Rn =1 , the first R components of ϑ , and since the prior measurefactorises as ρ = ρ R ⊗ ρ ⊥ , the approximate posterior measure also factorises as ρ M,R = ν M,R ⊗ ρ ⊥ ,where ∂ν M,R ∂ρ R ( θ ) (cid:104) exp (cid:2) − Φ M,R ( θ ; F obs ) (cid:3) , (4.9)14nd ρ ⊥ = ρ ⊥ [12]. Note that ν M,R is a measure on the finite dimensional space R R . Denoting by π M,R and π R the densities with respect to the R dimensional Lebesgue measure of ν M,R and ρ R , respectively,it follows from (4.9) that π M,R ( θ ) (cid:104) exp (cid:2) − Φ M,R ( θ ; F obs ) (cid:3) π R ( θ ) . (4.10)Our goal is to approximate the expected value of a quantity Q = G ( p ( ϑ )) with respect to the posterior ρ , for some continuous G : H ( D ) → R . We denote this expected value by E ρ [ Q ] := (cid:82) R N G ( p ( ϑ )) ρ (d ϑ )and assume that, as M, R → ∞ , E ν M,R [ Q M,R ] → E ρ [ Q ] , where E ν M,R [ Q M,R ] := (cid:82) R R G ( p M,R ( θ )) ν M,R (d θ ) is a finite dimensional integral.Finally, let us set the notation for our MLMCMC algorithm. To achieve a level-dependent repre-sentation of k , we simply truncate the KL-expansion after a sufficiently large, level-dependent numberof terms R (cid:96) , such that the truncation error on each level is bounded by the discretisation error, and set θ (cid:96) := { ξ n } R (cid:96) n =1 . A sequence of discretisation levels M (cid:96) satisfying (3.1) can be constructed by choosing acoarsest mesh width h for the spatial approximation, and choosing h (cid:96) := s − (cid:96) h . A common (but notnecessarily optimal) choice is s = 2 and uniform refinement between the levels. We denote the resulting(truncated) FE solution by p (cid:96) := p M (cid:96) ,R (cid:96) .The prior density π (cid:96) of θ (cid:96) is simply a standard R (cid:96) -dimensional Gaussian: π (cid:96) ( θ (cid:96) ) = 1(2 π ) R (cid:96) / exp − R (cid:96) (cid:88) j =1 ξ j . (4.11)For the likelihood, we have L (cid:96) ( F obs | θ (cid:96) ) (cid:104) exp (cid:34) −(cid:107) F obs − F (cid:96) ( θ (cid:96) ) (cid:107) σ F,(cid:96) (cid:35) , (4.12)where F (cid:96) ( θ (cid:96) ) = F ( p (cid:96) ( θ (cid:96) )). Recall that the coarser levels in our multilevel estimator are introducedonly to accelerate the convergence and that the multilevel estimator is still an unbiased estimator ofthe expected value of Q L with respect to the posterior ν L on the finest level L . Hence, the posteriordistributions on the coarser levels ν (cid:96) , (cid:96) = 0 , . . . , L −
1, do not have to model the measured data asfaithfully as ν L . In particular, this means that we can choose larger values of the fidelity parameter σ F,(cid:96) on the coarse levels, which will increase the acceptance probability on the coarser levels. The growthin σ F,(cid:96) has to be controlled, as we will see below (cf. Assumption A3).
We now perform a rigorous convergence analysis of the MLMCMC estimator (cid:98) Q ML L, { N (cid:96) } introduced inSection 3 applied to model problem (4.1). We will first verify that the multilevel estimator is indeedan unbiased estimator of E ν L [ Q L ]. To achieve this, we only need to verify the irreducibility condition(3.5) in Theorem 3.3. As already noted, for common choices of proposal distributions, the conditionholds true if π (cid:96) − ( θ (cid:96),C ) >
0, for all θ (cid:96) s.t. π (cid:96) ( θ (cid:96) ) >
0. The conclusion follows, since both the prior andthe likelihood were chosen as normal distributions and normal distributions have infinite support.
Theorem 4.1.
Suppose that for all (cid:96) = 0 , . . . , L , E ν (cid:96) [ | Q (cid:96) | ] < ∞ . Then lim { N (cid:96) }→∞ (cid:98) Q ML L, { N (cid:96) } = E ν L [ Q L ] , for any θ (cid:96) ∈ E (cid:96) and n (cid:96) ≥ . Q (cid:96) . In the best case, with an optimal linear solver to solve the discretised(FE) equations for each sample, M4 is satisfied with γ = 1.We will address assumptions M1 and M2, which are the assumptions related to the discretisationerrors in the quantity of interest Q and the measure ρ . For ease of presentation, we will for the remainderof this section assume that log k has mean zero and exponential covariance function (4.3) with r = 1,and that ψ and f in (4.2) are deterministic, with ψ ∈ H ( ∂D ) and f ∈ H − / ( D ). This implies that thesolution p to (4.2) is in L q (Ω , H / − δ ), for any δ > q < ∞ (cf. [34]). In the Metropolis-Hastingsalgorithm we will only consider symmetric proposal distributions or the pCN algorithm.Since they will become useful later, let us recall some of the main results in the convergence analysisof (“plain vanilla”) multilevel Monte Carlo estimators based on independent and identically distributed(i.i.d.) samples. An extensive convergence analysis of FE multilevel estimators based on i.i.d. samplesfor model problem (4.2) with log–normal coefficients can be found in [6, 34, 33]. We firstly have thefollowing result on the convergence of the FE error in the natural H –norm. Theorem 4.2.
Let g be a Gaussian field with constant mean and covariance function (4.3) with r = 1 ,and let k = exp[ g ] in model problem (4.2) . Suppose D ⊂ R d is Lipschitz polygonal (polyhedral). Then E ρ (cid:104) | p − p (cid:96) | qH ( D ) (cid:105) /q ≤ C k,f,ψ,q ( M − / d + δ(cid:96) + R − / δ(cid:96) ) , for any q < ∞ and δ > , where the (generic) constant C k,f,ψ,q (here and below) depends on the data k , f , ψ and on q , but is independent of any other parameters.Proof. This follows from [34, Proposition 4.1].Convergence results for functionals of the solution p can now be derived from Theorem 4.2 usinga duality argument. We will here for simplicity only consider bounded, linear functionals, but theresults extend to continuously Fr`echet differentiable functionals (see [34, § G (cf. Assumption F1 in [34]). A2.
Let G : H ( D ) → R be linear, and suppose there exists C G ∈ R , such that |G ( v ) | ≤ C G (cid:107) v (cid:107) H / − δ , for all δ > | D ∗ | (cid:82) D ∗ p d x for some D ∗ ⊂ D . The main result on the convergence for functionals is the following. Corollary 4.3.
Let the assumptions of Theorem 4.2 be satisfied, and suppose G satisfies A2. Then E ρ [ |G ( p ) − G ( p (cid:96) ) | q ] /q ≤ C k,f,ψ,q (cid:16) M − /d + δ(cid:96) + R − / δ(cid:96) (cid:17) , for any q < ∞ and δ > .Proof. This follows from [34, Corollary 4.1].Note that assumption A2 is crucial in order to get the faster convergence rates of the spatialdiscretisation error in Corollary 4.3. For multilevel estimators based on i.i.d. samples, it followsimmediately from Corollary 4.3 that the (corresponding) assumptions M1 and M2 are satisfied, with α = 1 /d + δ , α (cid:48) = 1 / δ and β = 2 α , β (cid:48) = 2 α (cid:48) , for any δ > ν (cid:96) and ρ , which are not known explicitly, but are related tothe prior distributions ρ (cid:96) and ρ through Bayes’ Theorem. Secondly, the samples on levels (cid:96) and (cid:96) − Y (cid:96) = Q (cid:96) − Q (cid:96) − are generated by Algorithm 2,and may differ not only due to discretisation and truncation order, but also because they come fromdifferent Markov chains (i.e. Θ n(cid:96) − is not necessarily equal to θ n(cid:96),C , as seen in Table 1).To circumvent the problem of the intractability of the posterior distribution, we have the followinglemma, which relates moments with respect to the posterior distribution to moments with respect tothe prior distribution. Lemma 4.4.
For any random variable Z = Z ( θ (cid:96) ) and for any q s.t. E ρ (cid:96) [ | Z | q ] < ∞ , we have | E ν (cid:96) [ Z q ] | (cid:46) E ρ (cid:96) [ | Z | q ] . Similarly, for any random variable Z = Z ( ϑ ) and for any q s.t. E ρ [ | Z | q ] < ∞ , we have (cid:12)(cid:12) E ρ (cid:96) [ Z q ] (cid:12)(cid:12) (cid:46) E ρ [ | Z | q ] . Proof.
Using (4.10), we have | E ν (cid:96) [ Z q ] | (cid:104) (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) R R(cid:96) Z q ( θ (cid:96) ) exp (cid:2) − Φ M,R ( θ (cid:96) ; F obs ) (cid:3) π (cid:96) ( θ (cid:96) ) d θ (cid:96) (cid:12)(cid:12)(cid:12)(cid:12) (cid:46) sup θ (cid:96) (cid:8) exp (cid:2) − Φ M,R ( θ (cid:96) ; F obs ) (cid:3)(cid:9) (cid:90) R R(cid:96) | Z ( θ (cid:96) ) | q π (cid:96) ( θ (cid:96) ) d θ (cid:96) . The first claim of the Lemma then follows, since the above supremum can be bounded by 1. The proofof the second claim is analogous, using the Radon-Nikodym derivative (4.7).We are now ready to prove assumption M1, under the following assumption on the parameters σ F,(cid:96) in the likelihood model (4.12):
A3.
The sequence of fidelity parameters { σ F,(cid:96) } ∞ (cid:96) =0 satisfies σ − F − σ − F,(cid:96) (cid:46) max (cid:16) R − / δ(cid:96) , M − /d + δ(cid:96) (cid:17) , for all δ > . Lemma 4.5.
Let the assumptions of Corollary 4.3 be satisfied. Suppose F satisfies A2, and A3 holds.Then | E ν (cid:96) [ Q (cid:96) ] − E ρ [ Q ] | ≤ C k,f,ψ (cid:16) M − /d + δ(cid:96) + R − / δ(cid:96) (cid:17) . Proof.
Since Q (cid:96) only depends on θ (cid:96) we have E ν (cid:96) [ Q (cid:96) ] = E ρ (cid:96) [ Q (cid:96) ] and so, using the triangle inequality, | E ν (cid:96) [ Q (cid:96) ] − E ρ [ Q ] | ≤ | E ρ (cid:96) [ Q (cid:96) ] − E ρ (cid:96) [ Q ] | + | E ρ (cid:96) [ Q ] − E ρ [ Q ] | . (4.13)The first term can be bounded using Corollary 4.3 and Lemma 4.4, i.e. | E ρ (cid:96) [ Q (cid:96) ] − E ρ (cid:96) [ Q ] | ≤ C k,f,ψ (cid:16) M − /d + δ(cid:96) + R − / δ(cid:96) (cid:17) . For the second term, we will prove a bound on the Hellinger distance d Hell ( ρ, ρ (cid:96) ). This proof followsclosely the proof of [26, Proposition 10]. Denote by Z and Z (cid:96) the normalising constants of ρ and ρ (cid:96) : Z = (cid:90) R N exp (cid:20) −
12 Φ( ϑ ; F obs ) (cid:21) d ρ ( ϑ ) and Z (cid:96) = (cid:90) R N exp (cid:20) −
12 Φ (cid:96) ( ϑ ; F obs ) (cid:21) d ρ ( ϑ ) , respectively . F satisfies Assumption A2, it follows from the results in [32] that both Z and Z (cid:96) can be boundedaway from zero. Next, we have2 d ( ρ, ρ (cid:96) ) = (cid:90) R N (cid:18) Z − / exp (cid:20) −
12 Φ( ϑ ; F obs ) (cid:21) − Z − / (cid:96) exp (cid:20) −
12 Φ (cid:96) ( ϑ ; F obs ) (cid:21)(cid:19) d ρ ( ϑ ) ≤ I + II, where I := 2 Z (cid:90) R N (cid:18) exp (cid:20) −
12 Φ( ϑ ; F obs ) (cid:21) − exp (cid:20) −
12 Φ (cid:96) ( ϑ ; F obs ) (cid:21)(cid:19) d ρ ( ϑ ) ,II := 2 | Z − / − Z − / (cid:96) | (cid:90) R N exp (cid:104) − Φ (cid:96) ( ϑ ; F obs ) (cid:105) d ρ ( ϑ ) . To estimate I, note that both exp (cid:2) − Φ( ϑ ; F obs ) (cid:3) and exp (cid:2) − Φ (cid:96) ( ϑ ; F obs ) (cid:3) are bounded above by 1, sothat exp (cid:20) −
12 Φ( ϑ ; F obs ) (cid:21) − exp (cid:20) −
12 Φ (cid:96) ( ϑ ; F obs ) (cid:21) ≤ | Φ( ϑ ; F obs ) − Φ (cid:96) ( ϑ ; F obs ) | . Denoting F := F ( p ( ϑ )) and F (cid:96) := F ( p (cid:96) ( θ )), and using the triangle inequality, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) F obs − F (cid:107) σ F − (cid:107) F obs − F (cid:96) (cid:107) σ F,(cid:96) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) (cid:107) F obs − F (cid:96) (cid:107) + (cid:107) F − F (cid:96) (cid:107) (cid:17) σ F − (cid:107) F obs − F (cid:96) (cid:107) σ F,(cid:96) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:107) F obs − F (cid:96) (cid:107) (cid:16) σ − F − σ − F,(cid:96) (cid:17) + 2 (cid:107) F obs − F (cid:96) (cid:107) + (cid:107) F − F (cid:96) (cid:107) σ F (cid:107) F − F (cid:96) (cid:107) . Since F was assumed to satisfy A2, it follows from Corollary 4.3 that E ρ [ (cid:107) F − F (cid:96) (cid:107) q ] /q ≤ C k,f,ψ (cid:16) M − /d + δ(cid:96) + R − / δ(cid:96) (cid:17) . Moreover, since (cid:107) F (cid:96) (cid:107) can be bounded independently of (cid:96) (again courtesy of Assumption A2), and since (cid:107) F obs − F (cid:96) (cid:107) ≤ (cid:107) F obs (cid:107) + (cid:107) F (cid:96) (cid:107) , we can deduce that I (cid:46) E ρ [ | Φ( ϑ ; F obs ) − Φ (cid:96) ( ϑ ; F obs ) | ] ≤ C k,f,ψ (cid:16) M − /d + δ(cid:96) + R − / δ(cid:96) (cid:17) . using Assumption A3. For the second term II, we note that | Z − / − Z − / (cid:96) | (cid:46) max { Z − , Z − (cid:96) } | Z − Z (cid:96) | ,and an analysis similar to the above shows that II (cid:46) E ρ [ | Φ( ϑ ; F obs ) − Φ (cid:96) ( ϑ ; F obs ) | ] ≤ C k,f,ψ (cid:16) M − /d + δ(cid:96) + R − / δ(cid:96) (cid:17) . The claim of the Theorem then follows, since | E ρ (cid:96) [ Q ] − E ρ [ Q ] | ≤ C k,f,ψ d Hell ( ρ, ρ (cid:96) ).In order to prove M2, we further have to analyse the situation where the two samples θ n +1 (cid:96) andΘ n +1 (cid:96) − used to compute Y n +1 (cid:96) “diverge”, i.e. when Θ n +1 (cid:96) − (cid:54) = θ n +1 (cid:96),C .For the remainder we will consider only symmetric or pCN proposal distributions. Lemma 4.6.
Let θ n +1 (cid:96) and Θ n +1 (cid:96) − have joint distribution ν (cid:96),(cid:96) − , and set Y n +1 (cid:96) = Q (cid:96) ( θ n +1 (cid:96) ) − Q (cid:96) − (Θ n +1 (cid:96) − ) .If q (cid:96),F ML is a pCN proposal distribution, then V ν (cid:96),(cid:96) − (cid:2) Y n +1 (cid:96) (cid:3) ≤ C k,f,ψ (cid:16) M − /d + δ(cid:96) − + R − / δ(cid:96) − (cid:17) , for any δ > . This bound also holds for a symmetric proposal distribution q (cid:96),F ML under the additional assumption that ( R (cid:96) − R (cid:96) − )(2 π ) − R(cid:96) − R(cid:96) − (cid:46) R − / δ(cid:96) − , for all δ > . (4.14)18or the growth condition (4.14) to be satisfied, it suffices that R (cid:96) − R (cid:96) − grows logarithmically with R (cid:96) − . To prove Lemma 4.6, we first need some preliminary results. Firstly, note that Θ n +1 (cid:96) − (cid:54) = θ n +1 (cid:96),C onlyif the proposal θ (cid:48) (cid:96) generated for θ n +1 (cid:96) was rejected. Given the states θ n(cid:96) and θ (cid:48) (cid:96) , the probability of thisrejection is given by 1 − α (cid:96) ML ( θ (cid:48) (cid:96) | θ n(cid:96) ). The total probability of a rejection is then E ζ [(1 − α (cid:96) ML ], where ζ denotes the joint distribution of the two variables. We need to quantify this probability.Before we can do so, we need to specify the (marginal) distribution of the proposal θ (cid:48) (cid:96) , whichwe denote by ζ (cid:48) (cid:96) . The first R (cid:96) − entries of θ (cid:48) (cid:96) are distributed as ν (cid:96) − , since they come from Θ (cid:96) − .The remaining R (cid:96) − R (cid:96) − dimensions are distributed according to the proposal density q (cid:96),F ML ( θ (cid:48) (cid:96),F | θ n(cid:96),F )(independent of the first R (cid:96) − dimensions). The same proof technique as in Lemma 4.4 shows againthat | E ζ (cid:48) (cid:96) [ Z q ] | (cid:46) E ρ (cid:96) [ | Z | q ], for any random variable Z = Z ( θ (cid:48) (cid:96) ). Lemma 4.7.
Let θ n(cid:96) and θ (cid:48) (cid:96) be as generated by Algorithm 2 at the ( n + 1) th step. Denote their jointdistribution by ζ , with marginal distributions ν (cid:96) and ζ (cid:48) (cid:96) , respectively. Suppose F satisfies A2, and A3and the assumptions of Corollary 4.3 hold. If q (cid:96),F ML is a pCN proposal distribution, then E ζ (cid:104) (1 − α (cid:96) ML ( θ (cid:48) (cid:96) | θ n(cid:96) )) (cid:105) ≤ C k,f,ψ (cid:16) M − /d + δ(cid:96) − + R − / δ(cid:96) − (cid:17) , for any δ > . This bound also holds for a symmetric proposal distribution q (cid:96),F ML under the additional assumption (4.14) .Proof. We will start by assuming that q (cid:96),F ML is a pCN proposal distribution. For brevity, denote L (cid:96) ( F obs | · ) =: L (cid:96) ( · ). We will first derive a bound on 1 − α (cid:96) ML ( θ (cid:48) (cid:96) | θ n(cid:96) ), for (cid:96) > θ (cid:48) (cid:96) and θ n(cid:96) given. First note that if L (cid:96) ( θ (cid:48) (cid:96) ) L (cid:96) − ( θ n(cid:96),C ) L (cid:96) ( θ n(cid:96) ) L (cid:96) − ( θ (cid:48) (cid:96),C ) ≥
1, then 1 − α (cid:96) ( θ (cid:48) (cid:96) | θ n(cid:96) ) = 0. Otherwise, we have1 − α (cid:96) ML ( θ (cid:48) (cid:96) | θ n(cid:96) ) = (cid:32) − L (cid:96) ( θ (cid:48) (cid:96) ) L (cid:96) − ( θ (cid:48) (cid:96),C ) (cid:33) + (cid:32) L (cid:96) ( θ (cid:48) (cid:96) ) L (cid:96) − ( θ n(cid:96),C ) L (cid:96) ( θ n(cid:96) ) L (cid:96) − ( θ (cid:48) (cid:96),C ) (cid:33) (cid:32) − L (cid:96) ( θ n(cid:96) ) L (cid:96) − ( θ n(cid:96),C ) (cid:33) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − L (cid:96) ( θ (cid:48) (cid:96) ) L (cid:96) − ( θ (cid:48) (cid:96),C ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − L (cid:96) ( θ n(cid:96) ) L (cid:96) − ( θ n(cid:96),C ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (4.15)Let us consider either of these two terms and set θ (cid:96) = ( ξ j ) R (cid:96) j =1 to be either θ (cid:48) (cid:96) or θ n(cid:96) . Using the definition(4.12) of the likelihood, we have L (cid:96) ( θ (cid:96) ) L (cid:96) − ( θ (cid:96),C ) = exp (cid:32) − (cid:107) F obs − F (cid:96) ( θ (cid:96) ) (cid:107) σ F,(cid:96) + (cid:107) F obs − F (cid:96) − ( θ (cid:96),C ) (cid:107) σ F,(cid:96) − (cid:33) . (4.16)Denoting F (cid:96) := F ( θ (cid:96) ) and F (cid:96) − := F ( θ (cid:96),C ), we get as in the proof of Lemma 4.5 that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) F obs − F (cid:96) (cid:107) σ F,(cid:96) − (cid:107) F obs − F (cid:96) − (cid:107) σ F,(cid:96) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) F obs − F (cid:96) − (cid:107) (cid:12)(cid:12)(cid:12) σ − F,(cid:96) − σ − F,(cid:96) − (cid:12)(cid:12)(cid:12) + 2 (cid:107) F obs − F (cid:96) − (cid:107) + (cid:107) F (cid:96) − F (cid:96) − (cid:107) σ F,(cid:96) (cid:107) F (cid:96) − F (cid:96) − (cid:107) . (4.17)Using the inequality | − exp( x ) | ≤ | x | , for 0 ≤ | x | ≤
1, it follows immediately from (4.17), AssumptionA3, Corollary 4.3, Lemma 4.4 and H¨olders inequality that E ζ (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) − L (cid:96) ( θ (cid:96) ) L (cid:96) − ( θ (cid:96),C ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ C k,f,ψ (cid:16) M − /d + δ(cid:96) − + R − / δ(cid:96) − (cid:17) . (4.18)A bound on the expected value of 1 − α (cid:96) ML ( θ (cid:48) (cid:96) | θ n(cid:96) ) now follows from Minkowski’s inequality.19he proof in the case of a symmetric proposal distribution is analogous. The bound (4.15) is replacedby 1 − α (cid:96) ML ( θ (cid:48) (cid:96) | θ n(cid:96) ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − π (cid:96) ( θ (cid:48) (cid:96) ) π (cid:96) − ( θ (cid:48) (cid:96),C ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − π (cid:96) ( θ n(cid:96) ) π (cid:96) − ( θ n(cid:96),C ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Using the definition of π (cid:96) in (4.10), as well as the models (4.11) and (4.12) for the prior and thelikelihood, respectively, we have instead of (4.16) that π (cid:96) ( θ (cid:96) ) π (cid:96) − ( θ (cid:96),C ) = π (cid:96) ( θ (cid:96) ) L (cid:96) ( θ (cid:96) ) π (cid:96) − ( θ (cid:96),C ) L (cid:96) − ( θ (cid:96),C ) (4.19)= exp − (2 π ) − R(cid:96) − R(cid:96) − R (cid:96) (cid:88) j = R (cid:96) − +1 ξ j − (cid:107) F obs − F (cid:96) ( θ (cid:96) ) (cid:107) σ F,(cid:96) + (cid:107) F obs − F (cid:96) − ( θ (cid:96),C ) (cid:107) σ F,(cid:96) − . Since (cid:80) R (cid:96) j = R (cid:96) − +1 ξ j is χ -distributed with R (cid:96) − R (cid:96) − degrees of freedom, we have E ρ (cid:96) (cid:104)(cid:80) j ξ j (cid:105) = 2 Γ (cid:0) ( R (cid:96) − R (cid:96) − ) + 1 (cid:1) Γ (cid:0) ( R (cid:96) − R (cid:96) − ) (cid:1) (cid:46) R (cid:96) − R (cid:96) − . Together with the assumption in (4.14) this implies that the expected value of the additional term in(4.19) is bounded by R − / δ(cid:96) − . The proof then reduces to that in the pCN case above.We will further need the following result. Lemma 4.8.
For any θ (cid:96) , let k (cid:96) ( θ (cid:96) ) := exp (cid:16)(cid:80) R (cid:96) j =1 √ µ j φ j ( θ (cid:96) ) j (cid:17) and κ ( θ (cid:96) ) := min x ∈ D k (cid:96) ( · , x ) . Then | p (cid:96) ( θ (cid:96) ) − p (cid:96) ( θ ∗ (cid:96) ) | H ( D ) (cid:46) (cid:107) f (cid:107) H − ( D ) κ ( θ (cid:96) ) κ ( θ ∗ (cid:96) ) (cid:107) k (cid:96) ( θ (cid:96) ) − k (cid:96) ( θ ∗ (cid:96) ) (cid:107) C ( D ) , for almost all θ (cid:96) , θ ∗ (cid:96) , (4.20) and E ρ (cid:96) (cid:104) | p (cid:96) ( θ (cid:96) ) | qH ( D ) (cid:105) ≤ constant , (4.21) for any q < ∞ , where the hidden constants are independent of (cid:96) and p (cid:96) .Proof. Using the definition of κ ( θ (cid:96) ), as well as the identity (cid:90) D k (cid:96) ( θ (cid:96) ) ∇ p (cid:96) ( θ (cid:96) ) · ∇ v d x = (cid:90) D f v d x = (cid:90) D k (cid:96) ( θ ∗ (cid:96) ) ∇ p (cid:96) ( θ ∗ (cid:96) ) · ∇ v d x, for all v ∈ H ( D ) , we have κ ( θ (cid:96) ) | p (cid:96) ( θ (cid:96) ) − p (cid:96) ( θ ∗ (cid:96) ) | H ( D ) ≤ (cid:90) D k (cid:96) ( θ (cid:96) ) ∇ ( p (cid:96) ( θ (cid:96) ) − p (cid:96) ( θ ∗ (cid:96) )) · ∇ ( p (cid:96) ( θ (cid:96) ) − p (cid:96) ( θ ∗ (cid:96) )) d x ≤ (cid:90) D ( k (cid:96) ( θ (cid:96) ) − k (cid:96) ( θ ∗ (cid:96) )) ∇ p (cid:96) ( θ ∗ (cid:96) ) · ∇ ( p (cid:96) ( θ (cid:96) ) − p (cid:96) ( θ ∗ (cid:96) )) d x. Due to the standard estimate | p (cid:96) ( θ ∗ (cid:96) ) | H ( D ) ≤ (cid:107) f (cid:107) H − ( D ) /κ ( θ ∗ (cid:96) ), (4.20) follows from an application of theCauchy-Schwarz inequality, and (4.21) follows from the fact that E ρ (cid:96) [ κ ( · ) − q ] is bounded independentof (cid:96) ([5, Prop. 3.10]).Using Lemmas 4.7 and 4.8, we are now ready to prove Lemma 4.6.20 roof of Lemma 4.6. Let θ n +1 (cid:96) and Θ n +1 (cid:96) − be as generated by Algorithm 2 at the ( n +1)th step, with jointdistribution ν (cid:96),(cid:96) − . As before, denote the proposal generated for θ n(cid:96) by θ (cid:48) (cid:96) . Firstly, since θ (cid:48) (cid:96),C = Θ n +1 (cid:96) − ,it follows from Minkowski’s inequality that V ν (cid:96),(cid:96) − (cid:2) Y n +1 (cid:96) (cid:3) ≤ E ν (cid:96),(cid:96) − (cid:104)(cid:0) Q (cid:96) ( θ n +1 (cid:96) ) − Q (cid:96) − (Θ n +1 (cid:96) − ) (cid:1) (cid:105) (cid:46) E (cid:101) ζ (cid:104)(cid:0) Q (cid:96) ( θ n +1 (cid:96) ) − Q (cid:96) ( θ (cid:48) (cid:96) ) (cid:1) (cid:105) + E ζ (cid:48) (cid:96) (cid:104)(cid:0) Q (cid:96) ( θ (cid:48) (cid:96) ) − Q (cid:96) − ( θ (cid:48) (cid:96),C ) (cid:1) (cid:105) . (4.22)Here, (cid:101) ζ denotes the joint distribution of θ (cid:48) (cid:96) and θ n +1 (cid:96) and ζ (cid:48) (cid:96) is the marginal distribution of θ (cid:48) (cid:96) . A boundon the second term follows immediately from Corollary 4.3 and Lemma 4.4, i.e. E ζ (cid:48) (cid:96) (cid:104)(cid:0) Q (cid:96) ( θ (cid:48) (cid:96) ) − Q (cid:96) − ( θ (cid:48) (cid:96),C ) (cid:1) (cid:105) (cid:46) E ρ (cid:96) (cid:104)(cid:0) Q (cid:96) ( θ (cid:48) (cid:96) ) − Q (cid:96) − ( θ (cid:48) (cid:96),C ) (cid:1) (cid:105) ≤ C k,f,ψ (cid:16) M − /d + δ(cid:96) − + R − δ(cid:96) − (cid:17) . (4.23)The first term in (4.22) is nonzero only if θ n +1 (cid:96) (cid:54) = θ (cid:48) (cid:96) . We will now use Lemmas 4.7 and 4.8, as well asthe characteristic function I { θ n +1 (cid:96) (cid:54) = θ (cid:48) (cid:96) } ∈ { , } to bound it. Firstly, H¨older’s inequality gives E (cid:101) ζ (cid:104)(cid:0) Q (cid:96) ( θ n +1 (cid:96) ) − Q (cid:96) ( θ (cid:48) (cid:96) ) (cid:1) (cid:105) = E (cid:101) ζ (cid:104)(cid:0) Q (cid:96) ( θ n +1 (cid:96) ) − Q (cid:96) ( θ (cid:48) (cid:96) ) (cid:1) I { θ n +1 (cid:96) (cid:54) = θ (cid:48) (cid:96) } (cid:105) ≤ E (cid:101) ζ (cid:104)(cid:0) Q (cid:96) ( θ n +1 (cid:96) ) − Q (cid:96) ( θ (cid:48) (cid:96) ) (cid:1) q (cid:105) /q E (cid:101) ζ (cid:104) I { θ n +1 (cid:96) (cid:54) = θ (cid:48) (cid:96) } (cid:105) /q , (4.24)for any q , q s.t. q − + q − = 1. Since G satisfies assumption A2, it follows from Lemmas 4.4 and 4.8that the term E (cid:101) ζ (cid:2) (cid:0) Q (cid:96) ( θ n +1 (cid:96) ) − Q (cid:96) ( θ (cid:48) (cid:96) ) (cid:1) q (cid:3) /q in (4.24) can be bounded by a constant independent of (cid:96) , for any q < ∞ : E (cid:101) ζ (cid:2) (cid:0) Q (cid:96) ( θ n(cid:96) ) − Q (cid:96) ( θ (cid:48) (cid:96) ) (cid:1) q (cid:3) (cid:46) E ν (cid:96) (cid:2) ( Q (cid:96) ( θ n +1 (cid:96) )) q (cid:3) + E ζ (cid:48) (cid:96) (cid:2) ( Q (cid:96) ( θ (cid:48) (cid:96) )) q (cid:3) (cid:46) E ρ (cid:96) (cid:2) | p (cid:96) ( θ (cid:96) ) | q H ( D ) (cid:3) ≤ constant . Since θ n +1 (cid:96) (cid:54) = θ (cid:48) (cid:96) only if the proposal θ (cid:48) (cid:96) has been rejected on level (cid:96) at the ( n + 1)th step, the probabilitythat this happens can be bounded by E ζ [1 − α (cid:96) ML ( θ (cid:48) (cid:96) | θ n(cid:96) )], where the joint distribution ζ is as in Lemma4.7. It follows by Lemma 4.7 that E (cid:101) ζ (cid:104) I { θ n +1 (cid:96) (cid:54) = θ (cid:48) (cid:96) } (cid:105) = P [ θ n +1 (cid:96) (cid:54) = θ (cid:48) (cid:96) ] ≤ C k,f,ψ (cid:16) M − /d + δ(cid:96) − + R − δ(cid:96) − (cid:17) . (4.25)Combining (4.22)-(4.25) the claim of the Lemma then follows.We now collect the results in the preceding lemmas to state our main result of this section. Theorem 4.9.
Under the same assumptions as in Lemma 4.6, Assumptions M1 and M2 of Theorem 3.4are satisfied, with α = β = 1 /d − δ and α (cid:48) = β (cid:48) = 1 / − δ , for any δ > . If we assume that we can obtain individual samples in optimal cost C (cid:96) (cid:46) M (cid:96) log( M (cid:96) ), e.g. via amultigrid solver, we can satisfy Assumption M4 with γ = 1 + δ , for any δ >
0. It follows from Theorems3.4 and 4.9 that we can get the following theoretical upper bounds for the ε -costs of classical andmultilevel MCMC applied to model problem (4.2) with log-normal coefficients k : C ε ( (cid:98) Q MC N ) (cid:46) ε − ( d +2) − δ and C ε ( (cid:98) Q ML L, { N (cid:96) } ) (cid:46) ε − ( d +1) − δ , for any δ > . (4.26)We clearly see the advantages of the multilevel method, which gives a saving of one power of ε − compared to the standard MCMC method. Note that for multilevel estimators based on i.i.d samples,the savings of the multilevel method over the standard method are two powers of ε − , for d = 2 , β = 2 α in this case, compared to β = α in the MCMCanalysis above. The numerical results in the next section for d = 2 show that in practice we do seem to21bserve β ≈ ≈ α , leading to C ε ( (cid:98) Q ML L, { N (cid:96) } ) = O ( ε − ). However, we do not believe that this is a lack ofsharpness in our theory, but rather a pre-asymptotic phase. The constant in front of the leading orderterm in the bound of V ν (cid:96),(cid:96) − [ Y n(cid:96) ], namely the term E ˜ ζ (cid:2) (cid:0) Q (cid:96) ( θ n +1 (cid:96) ) − Q (cid:96) ( θ (cid:48) (cid:96) ) (cid:1) q (cid:3) /q in (4.24), dependson the difference between Q (cid:96) ( θ n +1 (cid:96) ) and Q (cid:96) ( θ (cid:48) (cid:96) ). In the case of the pCN algorithm for the proposaldistributions q (cid:96) − and q (cid:96),F ML (as used in Section 5 below) this difference will be small, since θ n(cid:96) and θ (cid:48) (cid:96) will in general be very close to each other. However, the difference is bounded from below and so weshould eventually see the slower convergence rate for the variance as predicted by our theory. In this section we describe the implementation details of the MLMCMC algorithm and examine theperformance of the method in estimating the posterior expectation of some quantity of interest forour model problem (4.2). We consider (4.2) on the domain D = (0 , with f ≡
1. On the lateralboundaries of the domain we choose Dirichlet boundary conditions; on the top and bottom we chooseNeumann conditions: p | x =0 = 0 , p | x =1 = 1 , ∂p∂ n (cid:12)(cid:12)(cid:12) x =0 = 0 and ∂p∂ n (cid:12)(cid:12)(cid:12) x =1 = 0 . (5.1)The quantity of interest is the flux across the boundary at x = 1, given by Q := − (cid:90) k ∂p∂x (cid:12)(cid:12)(cid:12) x =1 dx . (5.2)The (prior) permeability field k is modelled as a log-normal random field, with covariance function (4.3)with r = 1, σ = 1 and λ = 0 .
5. The log-normal distribution is approximated using truncated KL-expansion (4.4) with an increasing number R (cid:96) of terms as (cid:96) increases. For r = 1, the KL eigenfunctionsin (4.4) are known explicitly [9].The model problem is discretised using piecewise linear FEs on a uniform triangular mesh. Thecoarsest mesh consists of m + 1 grid points in each direction, with refined meshes containing m (cid:96) + 1 =2 (cid:96) m + 1 points, so that the total number of grid points on level (cid:96) is M (cid:96) = ( m (cid:96) + 1) . All our algorithmshave been implemented within freeFEM++ [23]. As the linear solver for the resulting linear equationsystem for each sample we used UMFPACK [13].
Let us first define two important quantities for the convergence analysis of Metropolis-Hastings MCMC.
Effective sample size and integrated autocorrelation time.
Let { θ n } n ≥ be the Markov chain produced byAlgorithm 1 and (cid:98) Q MC N the resulting MCMC estimator defined in (2.2). The integrated autocorrelationtime τ Q of the correlated samples Q nM,R := G ( X ( θ n )) produced by Algorithm 1 is defined to be the ratioof the asymptotic variance σ Q of the MCMC estimator (cid:98) Q MC N , defined in (2.6), and the actual variance V ν M,R [ Q M,R ] of Q M,R . If s Q := 1 N N (cid:88) j =0 (cid:16) Q nM,R − (cid:98) Q MC N (cid:17) denotes the sample variance, then a good estimate for τ Q , used e.g. in R , is given by τ Q = s Q /ρ (0),where ρ (0) is the so-called spectral density at frequency zero. Details of a method for approximatingthe spectral density are given in [24] (included in R under the package ‘coda’ ). The effective sample22 LGORITHM 3. (Recursive independence sampling)
Choose initial states Θ (cid:96) − = (cid:101) Θ (cid:96) − , . . . , (cid:101) Θ such that (cid:101) Θ k,C = (cid:101) Θ k − and subsampling rates t k , for all k = 1 , . . . , (cid:96) −
1. Then, for j ≥ • On level 0: – Given (cid:101) Θ j , generate (cid:101) Θ (cid:48) from a pCN proposal distribution. – Compute α ( (cid:101) Θ (cid:48) | (cid:101) Θ j ) = min (cid:40) , L ( F obs | (cid:101) Θ (cid:48) ) L ( F obs | (cid:101) Θ j ) (cid:41) . – Set (cid:101) Θ j +10 = (cid:101) Θ (cid:48) with probability α ( (cid:101) Θ (cid:48) | (cid:101) Θ j ). Set (cid:101) Θ j +10 = (cid:101) Θ j otherwise. • On level k = 1 , . . . , (cid:96) − – Given (cid:101) Θ jk , let (cid:101) Θ (cid:48) k,C = (cid:101) Θ ( j +1) t k − k − and generate (cid:101) Θ (cid:48) k,F from a pCN proposal distribution. – Compute α (cid:96) ML ( (cid:101) Θ (cid:48) k | (cid:101) Θ jk ) = min (cid:40) , L k ( F obs | (cid:101) Θ (cid:48) k ) L k − ( F obs | (cid:101) Θ jk,C ) L k ( F obs | (cid:101) Θ jk ) L k − ( F obs | (cid:101) Θ (cid:48) k,C ) (cid:41) . – Set (cid:101) Θ j +1 k = (cid:101) Θ (cid:48) k with probability α k ML ( (cid:101) Θ (cid:48) k | (cid:101) Θ jk ). Set (cid:101) Θ j +1 k = (cid:101) Θ jk otherwise. • Set Θ j +1 (cid:96) − = (cid:101) Θ ( j +1) t (cid:96) − (cid:96) − .size is defined as N eff := N/τ Q . It represents the number of i.i.d. samples from ν M,R that would leadto a Monte Carlo estimator with the same variance as (cid:98) Q MC N . Recursive independence sampling.
The final ingredient for our hierarchical multilevel MCMC algo-rithm is an efficient practical algorithm to obtain independent samples Θ n(cid:96) − from the coarse posterior ν (cid:96) − which we need in Algorithm 2 in Section 3 to estimate E ν (cid:96) [ Q (cid:96) ] − E ν (cid:96) − [ Q (cid:96) − ]. The algorithm issummarised in Algorithm 3.We start on level 0 by creating a sufficiently long Markov chain { (cid:101) Θ j } j ≥ using Algorithm 1 withpCN proposal distribution q [11] (see (5.3) below for details). Let (cid:101) Q j := G ( p ( (cid:101) Θ j )) be the sample ofthe output quantity of interest associated with the j th sample of the auxiliary chain { (cid:101) Θ j } j ≥ on level 0.The samples in this chain are correlated, but by subsampling it with a sufficiently large rate t ∈ N , weobtain independent samples. The typical rule in statistics to achieve independence is to choose t to betwice the integrated autocorrelation time (cid:101) τ of the Markov chain { (cid:101) Q j } j ≥ . In practice, we found thatmuch shorter subsampling rates were sufficient (see below).Then, on level 0 < k ≤ (cid:96) −
1, we use the independent samples created on level k − { (cid:101) Θ jk } j ≥ on level k . The proposal distribution q k,F for the modesthat are added on level k is again chosen to be a pCN random walk (see (5.4) below for details).We subsample this chain again with sufficiently large rate t k ∈ N to obtain independent samples onlevel k . Finally, we set Θ n(cid:96) − := (cid:101) Θ nt (cid:96) − (cid:96) − . In summary, to produce one independent sample Θ n(cid:96) − onlevel (cid:96) −
1, we need to compute T k := (cid:81) (cid:96) − k (cid:48) = k t k (cid:48) samples, on each of levels k = 0 , . . . , (cid:96) −
1. Since the23cceptance probability α k ML ( (cid:101) Θ (cid:48) k | (cid:101) Θ jk ) converges to 1, as k increases (cf. Lemma 4.7), and since we areusing independent proposals from level k −
1, the integrated autocorrelation times (cid:101) τ k of the auxiliarychains { (cid:101) Θ jk } j ≥ , k = 1 , . . . , (cid:96) −
1, converge to 1, i.e. the samples are essentially independent for large k .As a consequence T k is actually of the same order as the autocorrelation time of samples that Algorithm 1with pCN proposals would produce on level k (see below for more details).At the j th state of the auxiliary chain on level 0, the pCN proposal from the standard multivariatenormal prior distribution is generated as follows:( (cid:101) Θ (cid:48) ) i = (cid:113) − β ( (cid:101) Θ j ) i + β Ψ i , i = 1 , . . . , R . (5.3)Here, Ψ i ∼ N (0 ,
1) and β is a tuning parameter used to control the size of the step in the pCN randomwalk [11]. Similarly, the proposal (cid:101) Θ (cid:48) k,F for the fine modes at the j th state of the auxiliary chain on level k ∈ { , . . . , L } is generated by( (cid:101) Θ (cid:48) k,F ) i = (cid:113) − β k ( (cid:101) Θ jk,F ) i + β k Ψ i , i = 1 , . . . , R k − R k − . (5.4)The actual values of β k = 0 .
1, for all k = 0 , . . . , L , that are used in all the calculations that follow werechosen after carrying out a series of preliminary tests to achieve “good” mixing properties.As in (2.2), in practice, the first j k samples from each of the auxiliary chains are discarded byprescribing a “burn-in” period. We choose the length j k of the “burn-in” period on level k to be twicethe integrated autocorrelation time (cid:101) τ k . Multilevel estimator.
We can now use the independent samples Θ n(cid:96) − ∼ ν (cid:96) − produced by Algorithm 3above in Algorithm 2 to produce samples θ n(cid:96) of the fine chain on level (cid:96) , and thus samples Y n(cid:96) := G ( p (cid:96) ( θ n(cid:96) )) − G ( p (cid:96) − (Θ n(cid:96) − )) for the estimator (cid:98) Y MC (cid:96),N (cid:96) of E ν (cid:96) [ Q (cid:96) ] − E ν (cid:96) − [ Q (cid:96) − ] in (3.3). The samplesfor the estimator (cid:98) Q MC0 ,N on level 0 are produced with Algorithm 1 using again pCN-proposals. Thiscompletes the definition of the multilevel MCMC estimator (cid:98) Q ML L, { N (cid:96) } in (3.4). It only remains to decideon an optimal sample size N (cid:96) on each level that will ensure that the total sampling error is below theprescribed tolerance and that the total cost of the estimator is minimised.Let τ (cid:96) be the integrated autocorrelation time of the chain Y n(cid:96) (resp. Q n ), for (cid:96) = 1 , . . . , L (resp. (cid:96) = 0), and let s (cid:96) be the sample variance on level (cid:96) . Then N eff (cid:96) := N (cid:96) /τ (cid:96) is the effective sample sizeon level (cid:96) and s (cid:96) /N eff (cid:96) is an estimate of the variance of the estimator (cid:98) Y MC (cid:96),N (cid:96) . Our aim is to achieve thefollowing bound on the total sampling error for the multilevel MCMC estimator: L (cid:88) (cid:96) =0 s (cid:96) N eff (cid:96) ≤ ε , (5.5)for some prescribed tolerance ε . In what follows, we will choose ε such that the bias error on level L is ε and thus the two contributions to the mean square error in (3.6) are balanced.To decide on a cost-optimal strategy for the choice of the N (cid:96) , we first need to discuss the cost persample. Recall that C (cid:96) denotes the cost to evaluate Q (cid:96) for a single sample Θ (cid:96) from the prior on level (cid:96) .However, to quantify the cost of the estimator (cid:98) Y MC (cid:96),N (cid:96) on level (cid:96) , we also need to take all the samplesin the auxiliary chains on the coarser levels in Algorithm 3 into account, as well as the integratedautocorrelation time τ (cid:96) of the chain { Y n(cid:96) } . Recalling that t k is the subsampling rate on level k inAlgorithm 3 and that T k = (cid:81) (cid:96) − k (cid:48) = k t k (cid:48) , the total cost to produce one independent (effective) sample is C eff (cid:96) := (cid:100) τ (cid:96) (cid:101) (cid:32) C (cid:96) + (cid:96) − (cid:88) k =1 T k C k (cid:33) . (5.6)24 ag A u t o c o rr e l a t i on F un c t i on -0.200.20.40.60.81 Sub-Sampling Rate, t
80 100 T w o - l e v e l e s t i m a t o r f o r E [ Y ] Figure 1: Left: Autocorrelation function for a typical (burnt-in) coarse level chain { Q n } with anintegrated autocorrelation time of τ ≈
86. Right: Plot of E (cid:2) (cid:98) Y MC1 (cid:3) against subsampling rate t ; thesolid line shows the computed results whilst the dashed lines give the two-sided 95% confidence interval.As in the case of standard multilevel MC with i.i.d. samples, the total cost of the multilevel estimatoris minimised, subject to the constraint (5.5), when the effective number of samples on each level satisfies N eff (cid:96) = 2 ε (cid:32) L (cid:88) (cid:96) =0 (cid:113) s (cid:96) C eff (cid:96) (cid:33) (cid:115) s (cid:96) C eff (cid:96) (5.7)as described in [18, 9]. In practice, the optimal number of samples can be estimated adaptively afteran initial number of samples to get an estimate for s (cid:96) (see again [18, 9] for standard MLMC).In all calculations which follow we simultaneously run P parallel chains. This allows for an efficientparallelisation and aids exploration of multi-modal posterior distributions. Furthermore the calculationof the total sampling error (5.5) is simplified. The parallel chains provide P independent estimates for (cid:98) Y MC (cid:96) . Therefore, using standard statistical tools, the sampling error on each level can be calculatedwithout the need for accurate estimates of the integrated autocorrelation times. For the implementationconsidered here we chose P = 128 and distributed the computations across 128 processors. We start with a two level test to investigate the additional bias created in Algorithm 2 due to thedependence of the coarse samples from the recursive subsampling procedure in Algorithm 3 and howthat bias depends on the subsampling rate t k . We choose two grids with m = 8 and m = 16 and fix thenumbers of KL modes to be R = R = 20. The data is generated synthetically from a single randomsample from the prior distribution computed on grid level 4, i.e. with m = 128. The observations F obs are taken to be the pressure values at 16 uniformly spaced points interior to the domain. The datafidelity is set to σ F = 10 − on both levels.We first computed the autocorrelation function for a typical (burnt-in) coarse level chain { Q n } (seeFig. 1(left)) and note that the integrated autocorrelation time is approximately τ ≈
86 in this case. Wethen ran Algorithms 2 and 3, for different subsampling rates from t = 1 to 100 > τ , until the standarderror for the estimator (cid:98) Y MC1 reached a prescribed tolerance of ε = 2 . × − . Fig. 1(right) shows theexpected value of E Θ [ (cid:98) Y MC1 ] as a function of t , as well as the two-sided 95% confidence interval, i.e. E Θ [ (cid:98) Y MC1 ] ± . ε . We note that E ν [ Q ] − E ν [ Q ] ≈ E { Θ n } [ (cid:98) Q MC1 ] − E { Θ n } [ (cid:98) Q MC0 ] ≈ . ε = 2 . × − on each level.25 ermeability Permeability Permeability Permeability
Figure 2: Left: Synthetic data used in Section 5.3. Right: Posterior sample created by our algorithmon grid level 4. For both plots, data points are marked by crosses.We note that, for the example considered here, the additional bias error due to the dependence ofthe samples is less than 30% even if no subsampling is used (i.e. t = 1). In practice, a value of t = 50would be sufficient to reduce the bias to a negligible amount ( < t (cid:96) = (cid:100) (cid:101) τ (cid:96) (cid:101) . We now test the full MLMCMC Algorithm, using the same coarsest grid with m = 8 and consideringup to five levels in our method with a uniformly increasing number of KL modes across the levels from R = 50 to R = 150. As for the two level example, the data is generated synthetically from a singlerandom sample from the prior distribution on level 4, see Fig. 2(left). We note that since R = 150here, the data differs slightly from that used in the two-level results in Sect. 5.2 (although we usedthe same random numbers for the first 20 KL modes). The fidelity parameter was again chosen tobe σ F,(cid:96) = 10 − , for all (cid:96) = 0 , . . . ,
4. A typical sample from the posterior distribution on grid level 4,produced by our multilevel algorithm, is shown in Fig. 2(right).We compare the performance of our new multilevel method to standard Metropolis-Hastings MCMCwith pCN proposal distribution (again with tuning parameter β (cid:96) = 0 . C (cid:96) to compute oneindividual sample of Q (cid:96) on level (cid:96) with our code is shown in actual CPU time in Fig. 3(left), obtainedon a 2.4GHz Intel Core i7 processor. The cost in FreeFEM++ is dominated by the assembly of the FEstiffness matrix and so it grows like O ( h − (cid:96) ) = O ( M (cid:96) ). We believe that this behaviour is representativefor problems of this size when the uniform grid structure is not exploited in the assembly processand that these CPU times are competitive. For larger problem sizes, the cost of the linear solver willbecome the dominant part. However, for the MLMCMC algorithm we are really interested in the cost C eff (cid:96) defined in (5.6) to compute one independent sample on level (cid:96) using Algorithms 2 and 3 with t k = (cid:100) (cid:101) τ k (cid:101) . These times are shown in Fig. 3(right). They are compared to the cost to produce oneindependent sample on level (cid:96) using the standard MCMC Algorithm 1. The integrated autocorrelationtimes (cid:101) τ (cid:96) for the auxiliary chains { (cid:101) Q n(cid:96) } on each level in our example are given in Tab. 2. Note that since26 -1 C P U t i m e ( s e cs ) -5 -4 -3 -2 -1 Evaluating KL ExpansionLinear system assembly & solveTotal time per sample
Level0 1 2 3 4 C P U t i m e ( s e cs ) -2 -1 MLMCMCStandard MCMC
Figure 3: Left: Cost (CPU time in seconds) to compute one sample of Q h as a function of h . Right:Cost C eff (cid:96) per independent sample on level (cid:96) .the coarse samples are (essentially) independent, the integrated autocorrelation times τ (cid:96) for the chains { Y n(cid:96) } are almost identical, i.e. τ (cid:96) ≈ (cid:101) τ (cid:96) .Level 0 1 2 3 4 (cid:101) τ (cid:96) { (cid:101) Q n(cid:96) } on levels (cid:96) = 0 , . . . , L varyingfrom 1 to 4 with standard MCMC on the same level. The tolerance ε L for each of the cases is chosensuch that the the bias error is less than ε L / √
2, leading to ε = 0 . ε = 0 . ε = 0 .
013 and ε = 0 . O ( h ) which is faster than whatwe would expect for the functional in (5.2) which does not satisfy Assumption A2 (see [34]). It islikely that this is because the second term in (4.13), i.e. the bias error in the posterior distribution,dominates. That bias error is due to the FE approximation of pressure evaluations at points here, whichare expected to converge with O ( h log | h | )) (see [35]). The slight variation in the convergence rate couldmean that some features in the posterior were only picked up on a sufficiently fine grid. The optimalnumbers N eff (cid:96) of (independent) samples on each level are chosen according to formula (5.7). They areplotted in Fig. 4(left). Please note that these are numbers of independent samples. The total numberof samples computed on the coarser levels is much larger. For example, for the four level estimatorwe needed about 4 × actual PDE solves for all the auxiliary chains on level 0 combined. However,each of these solves is about 250 times cheaper than a solve on level 4. Because τ ≈ (cid:101) τ = 1 .
23, we seefrom Fig. 4(left) that we need only about 562 PDE solves on level 4. These are huge savings againststandard MCMC which requires about 4 × solves on level 4 to achieve the same sampling error. Wecan see this clearly in the overall cost comparison in Fig. 4(right). The gains are even more pronouncedif we relax the overly conservative choice of t k = (cid:100) (cid:101) τ k (cid:101) for the subsampling rates.In our final Fig. 5, we confirm our theoretical results and plot our estimates for V ν (cid:96),(cid:96) − [ Y n(cid:96) ] (left)and for E ζ (cid:2) (1 − α (cid:96) ML ( θ (cid:48) (cid:96) | θ n(cid:96) )) (cid:3) (right). Ignoring the last data point in each of the plots, which seem tobe outliers, the variance seems to converge with almost O ( h ) and the multilevel rejection probabilityslightly faster than O ( h ). We are not sure whether this means that the bounds in Lemma 4.6 and inLemma 4.7 are both slightly pessimistic or whether this is just some pre-asymptotic behaviour.27 olerance -2 C o s t i n C P U T i m e ( s e cs ) MLMCMCStandard MCMC Figure 4: Left: Number of independent samples N eff (cid:96) on each level for four different tolerances. Right:Total cost (in seconds) for the multilevel and the single-level estimators plotted against tolerance (cid:15) . Level l og V a r i an c e -12-10-8-6-4-20 Q l Q l - Q l-1 Level l og ( - , l ) -7-6-5-4-3-2-10 Figure 5: Convergence plots for V ν (cid:96),(cid:96) − [ Y n(cid:96) ] and E ζ (cid:2) (1 − α (cid:96) ML ( θ (cid:48) (cid:96) | θ n(cid:96) )) (cid:3) . Remark 5.1.
It is worth to point out that the recursive independence sampling in Algorithm 3 alsobrings significant savings if used to produce proposals for a standard MCMC algorithm, as the com-parison of the cost per independent sample in Fig. 3(right) clearly shows. This is related to the delayedacceptance method of [7]. The multilevel approach also provides a very efficient burn-in method, dueto the significantly reduced integrated autocorrelation times on the finer levels and since most of theburn-in happens on the coarsest level. This is related to the approach in [15].
Bayesian inverse problems in large scale applications are often too costly to solve using conventionalMetropolis-Hastings MCMC algorithms due to the high dimension of the parameter space and the largecost of computing the likelihood. In this paper, we employed a hierarchy of computational models todefine a novel multilevel version of a Metropolis-Hastings algorithm, leading to significant reductions incomputational cost. The main idea underlying the cost reduction is to build estimators for the differencein the quantity of interest between two successive models in the hierarchy, rather than estimators for28he quantity itself. The new algorithm was then analysed and implemented for a single-phase Darcyflow problem in groundwater modelling, confirming the effectiveness of the algorithm.The algorithm presented in this paper is not reliant on the specific computational model underly-ing the simulations, and is generally applicable. The underlying computational model will in generalinfluence the convergence rates α, α (cid:48) , β and β (cid:48) of the discretisation errors, and the growth rate γ ofthe cost of the likelihood computation (cf Theorem 3.4), which in turn govern the cost of the standardand multilevel Metropolis-Hastings algorithms. The gain to be expected from employing the multilevelalgorithm is always significant, and the gain is in fact larger for more challenging model problems, wherethe values of α, α (cid:48) , β and β (cid:48) are small and γ is large.The algorithm also allows for the use of a variety of proposal distributions. The crucial result inthis context is the convergence of the multilevel acceptance probability to 1 (cf. Lemma 4.7), which ingeneral has to be verified for each proposal distribution individually, but is expected to hold for mostproposal distributions. Acknowledgement.
Big thanks go to Panayot Vassilevski who initiated and financially supportedthis work during two visits of Scheichl and Teckentrup at Lawrence Livermore National Labs (LLNL),California. He was involved in most of the original discussions about this method. Christian Ketelsenwas postdoctoral researcher under his supervision at LLNL under Contract DE-AC52-07A27344 at thetime. We would also like to particularly thank Finn Lindgren and Rob Jack for spotting an error inour original version of Lemma 3.1 and for helping us to find a fix.
References [1] A. Barth, Ch. Schwab, and N. Zollinger. Multi–level Monte Carlo finite element method for elliptic PDE’swith stochastic coefficients.
Numer. Math. , 119(1):123–161, 2011.[2] A. Brandt, M. Galun, and D. Ron. Optimal multigrid algorithms for calculating thermodynamic limits.
J.Stat. Phys. , 74(1-2):313–348, 1994.[3] A. Brandt and V. Ilyin. Multilevel Monte Carlo methods for studying large scale phenomena in fluids.
J.Mol. Liq. , 105(2-3):245–248, 2003.[4] S.C. Brenner and L.R. Scott.
The Mathematical Theory of Finite Element Methods , volume 15 of
Texts inApplied Mathematics . Springer, third edition, 2008.[5] J. Charrier. Strong and weak error estimates for the solutions of elliptic partial differential equations withrandom coefficients.
SIAM J. Numer. Anal , 50(1):216–246, 2012.[6] J. Charrier, R. Scheichl, and A.L. Teckentrup. Finite element error analysis of elliptic PDEs with randomcoefficients and its application to multilevel Monte Carlo methods.
SIAM J. Numer. Anal. , 51(1):322–352,2013.[7] J.A. Christen and C. Fox. MCMC using an approximation.
J. Comput. Graph. Stat. , 14(4):795–810, 2005.[8] P. G. Ciarlet.
The Finite Element Method for Elliptic Problems . North–Holland, 1978.[9] K.A. Cliffe, M.B. Giles, R. Scheichl, and A.L. Teckentrup. Multilevel Monte Carlo methods and applicationsto elliptic PDEs with random coefficients.
Comput. Vis. Sci. , 14:3–15, 2011.[10] K.A. Cliffe, I.G. Graham, R. Scheichl, and L. Stals. Parallel computation of flow in heterogeneous mediausing mixed finite elements.
J.Comput. Phys. , 164:258–282, 2000.[11] S.L. Cotter, M. Dashti, and A.M. Stuart. Variational data assimilation using targetted random walks.
Int.J. Numer. Meth. Fluids. , 68:403–421, 2012.[12] M. Dashti and A. Stuart. Uncertainty quantification and weak approximation of an elliptic inverse problem.
SIAM J. Numer. Anal. , 49(6):2524–2542, 2011.
13] T. A. Davis. Algorithm 832: Umfpack v4.3–an unsymmetric-pattern multifrontal method.
ACM Transactionson Mathematical Software (TOMS) , 30(2):196–199, 2004.[14] G. de Marsily.
Quantitative Hydrogeology . Academic Press, 1986.[15] Y. Efendiev, T. Hou, and W. Lou. Preconditioning Markov chain Monte Carlo simulations using coarse–scalemodels.
Water Resourc. Res. , pages 1–10, 2005.[16] M.A.R. Ferreira, Z. Bi, M. West, H. Lee, and D. Higdon. Multi-scale Modelling of 1-D Permeability Fields.In
Bayesian Statistics 7 , pages 519–527. Oxford University Press, 2003.[17] R.G. Ghanem and P.D. Spanos.
Stochastic finite elements: a spectral approach . Springer, New York, 1991.[18] M.B. Giles. Multilevel Monte Carlo path simulation.
Oper. Res. , 256:981–986, 2008.[19] C.J. Gittelson, J. K¨onn¨o, Ch. Schwab, and R. Stenberg. The multilevel Monte Carlo finite element methodfor a stochastic Brinkman problem.
Numer. Math. , 125:347–386, 2013.[20] I.G. Graham, R. Scheichl, and E. Ullmann. Mixed finite element analysis of lognormal diffusion and multilevelMonte Carlo methods.
Stoch. PDE Anal. Comp. , pages 1–35. published online June 12, 2015.[21] M. Hairer, A.M. Stuart, and S.J. Vollmer. Spectral gaps for a Metropolis–Hastings algorithm in infinitedimensions.
Ann. Appl. Probab. , 24(6):2455–2490, 2014.[22] W.K. Hastings. Monte-Carlo sampling methods using Markov chains and their applications.
Biometrika ,57(1):97–109, 1970.[23] F. Hecht. New developments in freeFem++.
J. Numer. Math. , 20(3-4):251–265, 2012.[24] P. Heidelberger and P. D. Welch. A spectral method for confidence interval generation and run length controlin simulations.
Communications of the ACM , 24(4):233–245, 1981.[25] S. Heinrich. Multilevel Monte Carlo methods. volume 2179 of
Lecture notes in Comput. Sci. , pages 3624–3651. Springer, 2001.[26] V.H. Hoang, Ch. Schwab, and A.M. Stuart. Complexity analysis of accelerated MCMC methods for Bayesianinversion.
Inverse Probl. , 29(8):085010, 2013.[27] R.J. Hoeksema and P.K. Kitanidis. Analysis of the spatial structure of properties of selected aquifers.
WaterResour. Res. , 21:536–572, 1985.[28] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equation of state calculationsby fast computing machines.
The J. of Chemical Physics , 21:1087, 1953.[29] G. Da Prato and J. Zabczyk.
Stochastic equations in infinite dimensions , volume 44 of
Encyclopedia Math.Appl.
Cambridge University Press, Cambridge, 1992.[30] C. Robert and G. Casella.
Monte Carlo Statistical Methods . Springer, 1999.[31] D. Rudolf.
Explicit error bounds for Markov chain Monte Carlo . PhD thesis, Friedrich–Schiller–Universit¨atJena, 2011. Available at http://tarxiv.org/abs/1108.3201 .[32] A.M. Stuart.
Inverse problems , volume 19 of
Acta Num. , pages 451–559. Cambridge University Press, 2010.[33] A. L. Teckentrup. Multilevel Monte Carlo methods for highly heterogeneous media. In
Proceedings of theWinter Simulation Conference 2012 , number Article Nr. 32, 2012. Available at http://informs-sim.org .[34] A. L. Teckentrup, R. Scheichl, M. B. Giles, and E. Ullmann. Further analysis of multilevel Monte Carlomethods for elliptic PDEs with random coefficients.
Numer. Math. , 125(3):569–600, 2013.[35] A.L. Teckentrup.
Multilevel Monte Carlo methods and uncertainty quantification . PhD thesis, University ofBath, 2013. Available at http://people.bath.ac.uk/masrs/Teckentrup PhD.pdf ..