[PDF] Accelerating Uncertainty Quantification of Groundwater Flow Modelling Using a Deep Neural Network Proxy

Abstract

Quantifying the uncertainty in model parameters and output is a critical component in model-driven decision support systems for groundwater management. This paper presents a novel algorithmic approach which fuses Markov Chain Monte Carlo (MCMC) and Machine Learning methods to accelerate uncertainty quantification for groundwater flow models. We formulate the governing mathematical model as a Bayesian inverse problem, considering model parameters as a random process with an underlying probability distribution. MCMC allows us to sample from this distribution, but it comes with some limitations: it can be prohibitively expensive when dealing with costly likelihood functions, subsequent samples are often highly correlated, and the standard Metropolis-Hastings algorithm suffers from the curse of dimensionality. This paper designs a Metropolis-Hastings proposal which exploits a deep neural network (DNN) approximation of a groundwater flow model, to significantly accelerate MCMC sampling. We modify a delayed acceptance (DA) model hierarchy, whereby proposals are generated by running short subchains using an inexpensive DNN approximation, resulting in a decorrelation of subsequent fine model proposals. Using a simple adaptive error model, we estimate and correct the bias of the DNN approximation with respect to the posterior distribution on-the-fly. The approach is tested on two synthetic examples; a isotropic two-dimensional problem, and an anisotropic three-dimensional problem. The results show that the cost of uncertainty quantification can be reduced by up to 50% compared to single-level MCMC, depending on the precomputation cost and accuracy of the employed DNN.

Full PDF

AAccelerating Uncertainty Quantiﬁcation of GroundwaterFlow Modelling Using Deep Neural Networks

Mikkel B. Lykkegaard ∗ , Tim J. Dodwell , and David Moxey CEMPS, University of Exeter, UK. The Alan Turing Institute, London, NW1 2DB, UK.July 2, 2020

Abstract

Quantifying the uncertainty in model parameters and output is a critical component inmodel-driven decision support systems for groundwater management. This paper presents anovel algorithmic approach which fuses Markov Chain Monte Carlo (MCMC) and MachineLearning methods to accelerate uncertainty quantiﬁcation for groundwater ﬂow models. Weformulate the governing mathematical model as a Bayesian inverse problem, consideringmodel parameters as a random process with an underlying probability distribution. MCMCallows us to sample from this distribution, but it comes with some limitations: it can beprohibitively expensive when dealing with costly likelihood functions, subsequent samplesare often highly correlated, and the standard Metropolis-Hastings algorithm suﬀers from thecurse of dimensionality. This paper designs a Metropolis-Hastings proposal which exploitsa deep neural network (DNN) approximation of the model, to signiﬁcantly accelerate theBayesian computations. We modify a delayed acceptance (DA) model hierarchy, wherebyproposals are generated by running short subchains using an inexpensive DNN approximation,resulting in a decorrelation of subsequent ﬁne model proposals. Using a simple adaptiveerror model, we estimate and correct the bias of the DNN approximation with respect to theposterior distribution on-the-ﬂy. The approach is tested on two synthetic examples; a isotropictwo-dimensional problem, and an anisotrpoic three-dimensional problem. The results show thatthe cost of uncertainty quantiﬁcation can be reduced by up to 75% compared to single-levelMCMC, depending on the precomputation cost and accuracy of the employed DNN.

Modelling of groundwater ﬂow and transport is an important decision support tool when, forexample, estimating the sustainable yield of an aquifer or remediating groundwater pollution.However, the input parameters for mathematical models of groundwater ﬂow (such as subsurfacetransmissivity and boundary conditions) are often impossible to determine fully or accurately,and are hence subject to various uncertainties. In order to make informed decisions, it is ofcritical importance to decision makers to obtain robust and unbiased estimates of the total modeluncertainty, which in turn is a product of the uncertainty of these input parameters [1]. A popularway to achieve this, in relation to groundwater ﬂow or any inverse problem in general, is stochasticor Bayesian modelling [2, 3, 4]. In this context, a probability distribution, the prior , is assigned tothe input parameters, in accordance with any readily available information. Given some real-worldmeasurements corresponding to the model outputs (e.g. sparse spatial measurements of hydraulichead, Darcy ﬂow or concentration of pollutants), it is possible to reduce the overall uncertaintyand obtain a better representation of the model by conditioning the prior distribution on this data.The result is a distribution of the model input parameters given data, which is also referred to asthe posterior . ∗ Address for correspondence: Mikkel Bue Lykkegaard, College of Engineering, Mathematics and Physical Sciences,Harrison Building, University of Exeter, EX4 4QF, UK, E-mail: [email protected] a r X i v : . [ s t a t . C O ] J u l btaining samples from the posterior distribution directly is not possible for all but the simplestof problems. A popular approach for generating samples is the Metropolis–Hastings type MarkovChain Monte Carlo (MCMC) method [5]. Samples are generated by a sequential process. First,given a current sample, a new proposal for the input parameters is made using a so-called proposaldistribution. Evaluating the model with this new set of parameters, a likelihood is computed - ameasure of misﬁt between the model outputs and the data. The likelihood of the proposed andcurrent samples are then compared. Based on this comparison, the proposal is either accepted orrejected, and the whole process is repeated, generating a Markov chain of probabilistically feasibleinput parameters. The key point is that the distribution of samples in the chain converges to the posterior – the distribution of input parameters given the data [5]. This relatively simple algorithmcan lead to extremely expensive Bayesian computations for three key reasons. First, each step of thechain requires the evaluation of (often) an expensive mathematical model. Second, the sequentialnature of the algorithm means subsequent samples are often highly correlated – even repeated ifa step is rejected. Therefore the chains must often be very long to obtain good statistics on thedistribution of outputs of the model. Third, without special care, the approach does not generallyscale well to large numbers of uncertain input parameters; the so-called curse of dimensionality.Addressing these scientiﬁc challenges is at the heart of modern research in MCMC algorithms. Aswith this paper there is a particular focus on developing novel and innovative proposal distributions,which seek to de-correlate adjacent samples and limit the computational burden of evaluatingexpensive models.Broadly in the literature, simple Darcy type models and other variants of the diﬀusion equationhave long been a popular toy example problems for demonstrating MCMC methodologies in theapplied mathematics community (see e.g. [6, 7, 8]). There appears to be much less interest in MCMCin the applied groundwater modelling community. This may be because of the computational costof running MCMC on highly parametrised, expensive models, or the lack of an easy-to-use MCMCsoftware framework, akin to the parameter estimation toolbox PEST [9].An exciting approach to signiﬁcantly reduce the computational cost has been proposed inmulti-level, multi-ﬁdelity and Delayed Acceptance (DA) MCMC methods. In each case, to alleviatecomputational cost, a hierarchy of models is established, consisting of a ﬁne model and (possiblymultiple) coarse, computationally cheap approximations. Typically, the coarser models are ﬁniteelement solutions of the PDE on a mesh with a coarser resolution, but as we show in this paper, canbe taken to be any general approximation similar to the multi-ﬁdelity philosophy [10]. Independentof the approach, the central idea is the same: to obtain signiﬁcant eﬃciency gains by exploitingapproximate coarse models to generate ‘good’ proposals cheaply, using additional accept/reject stepsto ﬁlter out highly unlikely proposals before evaluating the ﬁne, expensive model. Previous studiesof two-stage approaches include [11] who modelled multi-phase ﬂow with coarse level proposalsevaluated by a coarse-mesh single-phase ﬂow model (an idea that was developed further in [12]),[13] and [14]. We note that the latter of which, instead of simply using a coarser discretisation,implemented a data-driven polynomial chaos expansion as a surrogate model. We intend todemonstrate how the development of novel techniques in MCMC and machine learning can becombined to help realise the potential of MCMC in this ﬁeld.In this work, we propose a combination of multiple cutting-edge MCMC techniques to allow foreﬃcient inversion and uncertainty quantiﬁcation of groundwater ﬂow. We propose an improveddelayed acceptance (DA) MCMC algorithm, adapted from the approach proposed by [15]. Inour case, similarly to multi-level MCMC [7], proposals are generated by computing a subchainusing a Deep Neural Network (DNN) as an approximate model – leading to cheaply computed,decorrelated proposals passed on to the ﬁne model. For our ﬁrst example, the subchain is drivenby the preconditioned Crank-Nicolson (pCN) proposal distribution [16] to ensure the proposedMetropolis-Hastings algorithm is robust with respect to the dimension of the uncertain parameterspace. For our second example, proposals for the subchains are generated using the AdaptiveMetropolis (AM) proposal [17], since the posterior distribution in this case is highly non-sphericaland multiple parameters are correlated. Finally, we propose a enhanced error model, in which theDNN is trained by sampling the prior distribution, yet the bias of the approximation is adaptivelyestimated and corrected on-the-ﬂy by testing the approximations against the full model in anadaptive delayed acceptance setting [18]. 2

Preliminaries

In this section we brieﬂy introduce the forward model, deﬁning the governing equations underpinninggroundwater ﬂow and their corresponding weak form, enabling us to solve the equations usingFEM methods. We then formulate our model as an Bayesian inverse problem with random inputparameters, eﬀectively resulting in a stochastic model, which can be accurately characterisedby sampling from the posterior distribution of parameters using MCMC. The simple Metropolis-Hastings MCMC algorithm is then introduced and extended with the preconditioned Crank-Nicolson(pCN) and Adaptive Metrpolis (AM) transition kernels.

Consider steady groundwater ﬂow in a conﬁned, inhomogenous aquifer which occupies the domain Ωwith boundary Γ. Assuming that water is incompressible, the governing equations for groundwaterﬂow can be written as the scalar elliptic partial diﬀerential equation: −∇ · ( − T ( x ) ∇ h ( x )) = g ( x ) for all x ∈ Ω (1)subject to boundary conditions on Γ = Γ N ∪ Γ D deﬁned by the constraint equations h ( x ) = h D ( x ) on Γ D and ( − T ( x ) ∇ h ( x )) · n = q N ( x ) on Γ N . (2)Here T ( x ) is the heterogeneous, depth-integrated transmissivity, h ( x ) is hydraulic head, h D ( x ) isﬁxed hydraulic head at boundaries with Dirichlet constraints, g ( x ) is ﬂuid sources and sinks, q ( x ) isDarcy velocity, q N ( x ) is Darcy velocity across boundaries with Neumann constraints and Γ D ⊂ ∂ Ωand Γ N ⊂ ∂ Ω deﬁne the boundaries comprising of Dirichlet and Neumann conditions, respectively.Following standard FEM practice (see e.g. [19]), eq. (1) is converted into weak form by multiplyingby an appropriate test function w ∈ H (Ω) and integrating by parts, so that Z Ω ∇ w · ( T ( x ) ∇ h ) d x + Z Γ N w q N ( x ) ds = Z Ω w g ( x ) d x , ∀ w ∈ H (Ω) , (3)where H (Ω) is the Hilbert space of weakly diﬀerentiable functions on Ω. To approximate thehydraulic head solution h ( x ), a ﬁnite element space V τ ⊂ H (Ω) on a ﬁnite element mesh Q τ (Ω).This is deﬁned by a basis of piecewise linear Lagrange polynomials { φ i ( x ) } Mi =1 , associated witheach of the M ﬁnite element nodes. As a result (3) can be rewritten as a system of sparse linearequations Ah = b where A ij = Z Ω ∇ φ i · T ( x ) ∇ φ j ( x ) d x and (4) b i = Z Ω φ i ( x ) g ( x ) d x − Z Γ N φ i ( x ) q N ( x ) ds, (5)where A ∈ R M × M and b ∈ R M are the global stiﬀness matrix and load vector, respectively. Thevector h := [ h , h , . . . , h M ] ∈ R M is the solution vector of hydraulic head at each node within theﬁnite element mesh so that h ( x ) = P Mi =1 h i φ i ( x ). In our numerical experiments, these equationsare solved using the open source general-purpose FEM framework FEniCS [20]. While there arewell-established groundwater simulation software packages available, such as MODFLOW [21] andFEFLOW [19], FEniCS was chosen because of its ﬂexibility and ease of integration with othersoftware and analysis codes.

The aquifer transmissivity T ( x ) is not known everywhere on the domain, therefore a typicalapproach is to model it as a log-Gaussian random ﬁeld. There exists extensive literature onmodelling groundwater ﬂow transmissivity using log-Gaussian random ﬁelds (see e.g. [22, 23, 14]).Whilst this may not always prove a good model, particularly in cases with highly correlatedextreme values and/or preferential ﬂow paths [24, 25] as seen when considering faults and other3iscontinuities [26, 27]. However, the log-Gaussian distribution remains relevant for modellingtransmissivity in a range of aquifers [28, 29, 14].Our starting point is a covariance operator with kernel C ( x , y ), which deﬁnes the correlationstructure of the uncertain transmissivity ﬁeld. For our numerical experiments, we consider theARD (Automatic Relevance Determination) squared exponential kernel, a generalisation of the‘classic’ squared exponential kernel, which allows for handling directional anisotropy: C ( x , y ) = exp  − d X j =1 (cid:18) x j − y j l j (cid:19)  , (6)where d is the spatial dimensionality of the problem and l ∈ R d is a vector of lengths scalescorresponding to each spatial dimension. We emphasise that the covariance kernel is a modellingchoice , and that diﬀerent options are available, such as the Matern kernel which oﬀers additionalcontrol over the smoothness of the ﬁeld.In our work, transmissivity was modeled as a discrete log-Gaussian random ﬁeld expanded inan orthogonal eigenbasis with k Karhunen-Loève (KL) eigenmodes. To achieve this we constructa covariance matrix C ∈ R M × M , where entries are given by C ij = C ( x i , x j ) for each pair ofnodal coordinates within the ﬁnite element mesh i, j = 1 , . . . M . Once constructed the largest k eigenvalues { λ i } ki =1 and associated eigenvectors { ψ i } ki =1 of C can be computed. The transmissivityat the nodes t := [ t , t , . . . , t M ], is given bylog t = µ + σ ΨΛ θ , where Λ =  λ . . . λ . . . . . . λ k  and Ψ = [ ψ , ψ , . . . , ψ k ] , (7)vector µ deﬁnes the log of the mean transmissivitity ﬁeld, σ a scalar parameterising the varianceand θ a vector of Gaussian random variables such that θ ∼ N (0 , I k ) as in [30]. The randomﬁeld can be interpolated from nodal values across Ω, using the shape functions { φ i ( x ) } Mi =1 so that T ( x ) = P Mi =1 t i φ i ( x ) . Truncating the KL eigenmodes at the k th mode and interpolating the ﬁeld both have a smoothingeﬀect on the recovered transmissivity ﬁelds, which may or may not be desirable, depending on theapplication. Figure 1 shows some examples of realisations of Gaussian random ﬁelds with a squareexponential kernel, which illustrates the eﬀect of the covarance length scale l and the number ofadmitted KL eigenmodes k . To setup the Bayesian inverse problem and thereby quantify the uncertainty in the transmissivityﬁeld T ( x ), the starting point is deﬁne a statistical model which describes distribution of themismatch between observations and model predictions. The observations are expressed in a singlevector d obs ∈ R m and for a given set of model input parameters θ , the model’s prediction of thedata is deﬁned by the forward map , F ( θ ) : R k → R m . The statistical model assumes the connectionbetween model and observations through the relationship d obs = F ( θ ) + (cid:15) (8)where we take (cid:15) ∼ N (0 , Σ (cid:15) ) which represents the uncertainty the connection between model anddata, capturing both model mis-speciﬁcation and measurement noise as sources of this uncertainty.The backbone of a Bayesian approach is Bayes’ theorem, which allows for computing posterior beliefs of model parameters using both prior beliefs and observations . Bayes’ theorem states thatthe posterior probability of a parameter realisation θ given data d obs can be computed as π ( θ | d obs ) = π ( θ ) L ( d obs | θ ) π ( d obs ) (9)4 = k = k = l = 0.1 l = 0.05 l = 0.01 Figure 1: A selection of Gaussian random process realisations with a square exponential kernelusing diﬀerent covarance length scales l and number of KL eigenmodes k . All displayed realisationswere generated using the same appropriately truncated random vector ξ with identical eigenvectorsfor each l .where π ( θ | d obs ) is referred to as the posterior distribution , L ( d obs | θ ) is called the likelihood , π ( θ )the prior distribution and π ( d obs ) = Z Θ π ( d obs | θ ) d θ = Z Θ π ( θ ) L ( d obs | θ ) d θ (10)is a normalising constant, sometimes referred to as the evidence . In most cases this integral does nothave a closed-form solution and is infeasible to estimate numerically in most real-world applications,particularly when the dimension of the unknown parameter space is large and the evaluation ofthe model (required to compute L ( d obs | θ )) is computationally expensive. A family of methodscalled Markov Chain Monte Carlo (MCMC) are often employed to approximate the solution [31].Importantly MCMC, whilst computationally expensive, allows indirect sampling from the posteriordistribution and avoids the explicit need to estimate (10). Moreover, it can be designed to beindependent of the dimension of the parameter space and has no embedded unquantiﬁable bias.In this paper we consider a subclass of MCMC methods called the Metropolis-Hastings [32, 33, 5]algorithm, which is described in Algorithm 1. The algorithm generates a Markov chain { θ ( n ) } n ∈ N with a distribution converging to π ( d obs | θ ). It is diﬃcult (often impossible) to sample directlyfrom the posterior, hence at each step, at position θ ( i ) in the chain, a proposal is made θ from asimpler known (proposal) distribution q ( θ | θ ( i ) ). An accept/reject step then determines whetherthe proposal comes from (probabilistically) the posterior distribution or not. This accept/rejectstep is a achieved by essentially computing the ratio of the densities of the current state to theproposal. To do this we exploit Bayes’s Theorem. The key observation in MCMC is that thenormalising constant π ( d obs ) is independent of θ , and so π ( θ | d obs ) ∝ π ( θ ) L ( d obs | θ ) . (11)5herefore when comparing the ratio of the densities, the normalizing contant (since independent of θ ) cancels. Algorithm 1: Metropolis-Hastings Algorithm

1. Given a parameter realisation θ i and a transition kernel q ( θ | θ i ), generate a proposal θ .2. Compute the likelihood ratio between the proposal and the previous realisation: α = min (cid:26) , π ( θ ) L ( d obs | θ ) π ( θ ( i ) ) L ( d obs | θ ( i ) ) q ( θ ( i ) | θ ) q ( θ | θ ( i ) ) (cid:27)

3. If u ∼ U (0 , > α then set θ ( i +1) = θ ( i ) , otherwise, set θ ( i +1) = θ .In our model problem, the prior density of the parameters π ( θ ) represents the available a priori knowledge about the transmissivity of the aquiﬁer. From our statistical model (8) we see that our d obs − F ( θ ) ∼ N (0 , Σ (cid:15) ), hence L ( d obs | θ ) = exp (cid:18) −

12 ( F ( θ ) − d obs ) (cid:124) Σ − e ( F ( θ ) − d obs ) (cid:19) . (12)Importantly we note that for each step of the Metropolis-Hastings algorithms we are requiredto compute L ( d obs | θ ). This requires the evaluation of the forward mapping F ( θ ) which can becomputationally expensive. Moreover, due to the sequential nature of MCMC-based approaches,consecutive samples are correlated and hence many samples are required to obtain good statisticson the outputs.The proposal distribution q ( θ | θ ( n ) ) is the key element which drives the Metropolis-Hastingsalgorithm and control the eﬀectiveness of the algorithm. A common choice is a simple randomwalk, for which q RW ( θ | θ ( i ) ) = N ( θ ( i ) , Σ ), yet as shown in [34] , the basic random walk does notlead to a convergence that is independent of the input dimension m . Better choices would be the preconditioned Crank-Nicolson proposal (pCN, [16]), which has dimension independent acceptanceprobability, or the Adaptive Metropolis algorithm (AM, [17]), which adaptively aligns the proposaldistribution to the posterior during sampling. Moreover, unlike the Metropolis-Adjusted LangevinAlgorithm (MALA), No-U-Turn Sampler (NUTS) and Hamiltonian Monte Carlo, none of theseproposals rely on gradient information, which can be infeasible to compute for expensive forwardmodels.To generate a proposal using the pCN transition kernel, one computes θ = p − β θ ( i ) + β ξ (13)where ξ is a random sample from the prior distribution, ξ ∼ N (0 , Σ ). This expression correspondsto the transition kernel q pCN ( θ | θ ( i ) ) = N ( p − β θ ( i ) , β Σ ). Moreover, for the pCN transitionkernel, the acceptance probability simpliﬁes to α = min (cid:26) , L ( d obs | θ ) L ( d obs | θ ( i ) ) (cid:27) following the identity p ( θ ( i ) ) p ( θ ) = q pCN ( θ ( i ) | θ ) q pCN ( θ | θ ( i ) ) (14)as given in [7]. Additional details of derivation of the pCN proposal are are provided in Appendix A.Similarly, to generate a proposal using the AM transition kernel, we draw a random sample θ ∼ N ( θ ( i ) , Σ ( i ) ) (15)where Σ ( i ) is an iteratively updated covariance structure Σ ( i ) = ( Σ (0) , if i ≤ i ,s d Cov( θ (0) , θ (1) ... θ ( i ) ) + s d γ I d , otherwise . Σ (0) for a given period i , after which adaptivivity is ’switched on’, and used for the remaining samples. The adaptivecovariance Σ ( i ) = s d Cov( θ (0) , θ (1) ... θ ( i ) ) + s d γ I d can be constructed iteratively during samplingusing the following recursive formula: Σ ( i +1) = i − i Σ ( i ) + s d i ( i ¯ θ ( i − ¯ θ ( i − (cid:124) − ( i + 1) ¯ θ ( i ) ¯ θ ( i ) (cid:124) + θ ( i ) θ ( i ) (cid:124) + γ I d ) (16)where ¯ · is the arithmetic mean, s d = 2 . /d is a scaling parameter, d is the dimension of theproposal distribution and γ is a parameter which prevents Σ i from becoming singular [17]. This,on the other hand, corresponds to the transition kernel q AM ( θ | θ (0) , θ (1) ... θ ( i ) ) = N ( θ ( i ) , Σ ( i ) ),which is not guaranteed to be ergodic, since it will depend on the history of the chain. However,the Diminishing Adaptation condition [35] holds, as adaptation will naturally decrease as samplingprogresses. The approximate/surrogate model in our experiments is a feed-forward deep neural network (DNN),a type of artiﬁcial neural network with multiple hidden layers, as implemented in the open-sourceneural-network library

Keras [36] utilizing the

Theano backend [37].The DNN approximates the forward map, accepting a vector of KL coeﬃcients θ ∈ R k , andreturning an approximation of the vector of approximate model output ˆ F ( θ ) ∈ R m – in this papera vector of hydraulic heads at given sampling points, i.e. ˆ F ( θ ) : R k R m . Figure 2 shows thegraph of one particular DNN employed in our experiments. θ θ θ k h h h m Input Output1 2 3

Figure 2: Graph showing the structure of a feedforward DNN.Each edge in Figure 2 is equipped with a weight w li,j where l is index of the layer that the weightfeeds into, i is the index of nodes in the same layer and j is the index of nodes in the previouslayer. These weights can be arranged in n × m matrices W l for each layer l . Similarly, each node isequipped with a bias b li where l is index of its layer and i is the index of node, and these biases canbe arranged in vectors b l . Data is propogated through the network such that the output y l of alayer l with activation function A l ( · ) is y l = A l ( b l + W l y l − ) . (17)Activation functions A ( · ) are applied element-wise on their input vectors x so that A ( x ) = ( A ( x ) , A ( x ) . . . A ( x n )) (cid:124) Many diﬀerent activation functions are available for artiﬁcial neural networks, and we here give ashort description of the ones employed in our experiments: the sigmoid and the rectiﬁed linear unit (‘ReLU’). The transfer function of the nodes in the ﬁrst layer of each DNN was of the type sigmoid : S ( x ) = 11 + e − x (18)7quashing the input vector into the interval (0 , de facto standard hidden layer activation function for deep neural networks, the rectiﬁed linear unit (‘ReLU’): R ( x ) = ( x, if x > , , otherwise . To ﬁt an artiﬁcal neural network to a given set of data, the network is initially compiled usingrandom weights and biases and then trained using a dataset of known inputs and their correspondingoutputs. The weights and biases are updated iteratively during training by way of an appropriateoptimisation algorithm and a loss function, and if appropriately set up, will converge towards a setof optimal values, allowing the MLP to predict the response of the forward model to some level ofaccuracy [38]. Our particular DNNs were trained using the mean squared error (MSE) loss functionMSE = 1 m m X i =1 ( h i − ˆ h i ) for m output variables, and the RMSprop optimiser, a stochastic, gradient based and adaptivealgorithm, suggested by [39] and widely used for training DNNs. In this section we describe a modiﬁed adaptive delayed acceptance proposal for MCMC, using ideasfrom multi-level MCMC [7]. The general approach generates proposals by running Markov subchainsdriven by an approximate model. In our case this approximation is constructed from a DNN of theforward map F ( θ ) trained from oﬄine samples of the prior distribution. Finally, we show how theapproximate map can be corrected online, by adaptively learning a simple multi-variant Gaussiancorrection to the outputs of the neural network. Delayed Acceptance (DA) [15] is a technique that exploits a model hierarchy consisting of anexpensive ﬁne model and relatively inexpensive coarse approximation. The idea is simple: aproposal is ﬁrst evaluated (pre-screened) by an approximate model and immediately discardedif it is rejected. Only if accepted, it is subjected to a second accept/reject step using the ﬁnemodel. In this context, the likelihood of observations given a parameter set is henceforth denotedˆ L ( d obs | θ ) when evaluated on the approximate model and remains L ( d obs | θ ) when evaluated on theﬁne model. This simple screening mechanism cheaply ﬁlters out poor proposals, wasting minimaltime evaluating unlikely proposals on the expensive, ﬁne model. In this paper we extend thisapproach by not evaluating every accepted approximation proposal with the ﬁne model. Instead, aproposal for the ﬁne model is generated by running an approximate subchain until t approximateproposals have been accepted and only then evaluate using the ﬁne model. This modiﬁed DelayedAcceptance MCMC algorithm is described in Algorithm 2 and an illustration of the process is givenin Figure 3. 8 lgorithm 2: Modiﬁed Delayed Acceptance MCMC

1. Given a realisation of the approximation parameters ˆ θ ( j ) and the transition kernel q ( ˆ θ | ˆ θ ( j ) ),generate a proposal for the approximation ˆ θ .2. Compute the likelihood ratio on the approximate model between the proposal and the previousrealisation: α = min ( , π ( ˆ θ ) ˆ L ( d obs | ˆ θ ) π ( ˆ θ ( j ) ) ˆ L ( d obs | ˆ θ ( j ) ) ) ( AM ) α = min ( , ˆ L ( d obs | ˆ θ )ˆ L ( d obs | ˆ θ ( j ) ) ) ( pCN )3. If u ∼ U (0 , > α then set ˆ θ ( j +1) = ˆ θ ( j ) and return to (1); otherwise set ˆ θ ( j +1) = ˆ θ andcontinue to (4).4. If t proposals have been accepted in the approximation subchain, continue to (5), otherwisereturn to (1).5. Given the latest realisation of the entire parameter set θ ( i ) = [ ˆ θ ( i ) , ˜ θ ( i ) ] with ﬁne parameters ˜ θ ( i ) and the transition kernel q ( ˜ θ | ˜ θ ( i ) ), generate a proposal for the ﬁne parameters ˜ θ andset θ := [ ˆ θ , ˜ θ ].6. Compute the likelihood ratio on the ﬁne model between the proposal and the previousrealisation: α = min ( , π ( θ ) L ( d obs | θ ) π ( θ ( i ) ) L ( d obs | θ ( i ) ) π ( ˆ θ ( i ) ) ˆ L ( d obs | ˆ θ ( i ) ) π ( ˆ θ ) ˆ L ( d obs | ˆ θ ) ) ( AM ) α = min ( , L ( d obs | θ ) L ( d obs | θ ( i ) ) ˆ L ( d obs | ˆ θ ( i ) )ˆ L ( d obs | ˆ θ ) ) ( pCN )7. If u ∼ U (0 , > α then set θ ( i +1) = θ ( i ) , otherwise set θ ( i +1) = θ . θθ (i) θ ' FF (0) θ (1) θ (2) (t) θ (3) θ (t-1) θ Figure 3: Illustration of the principle used to oﬀset ﬁne level samples to reduce autocorrelation. Theﬁne model F is only evaluated using the full set of proposed parameters θ after a prescribed number t of approximation parameter sets { ˆ θ (1) , ˆ θ (2) , . . . , ˆ θ ( t ) } have been evaluated on the approximatemodel ˆ F and accepted into the chain.This way, the autocorrelation of the ﬁne chain is reduced, since proposals are ‘more independent’.This approach is strongly related to a two-level version of multi-level MCMC. Since the ﬁne modellikelihood ratio is corrected by the inverse of the approximate likelihood ratio in step 6 of Algorithm2, detailed balance is satisﬁed, the resulting Markov Chain is guaranteed to come from the trueposterior and there is no loss of accuracy, even if the approximate model is severely wrong [15]. Todemonstrate that this approach does indeed decrease the autocorrelation in our ﬁne chain MCMCsamples, we compute the Eﬀecive Sample Size N eff of each MCMC simulation according to theprocedures described in [40]. 9 .2 Adaptive correction of the approximate posterior Whilst in theory the modiﬁed delayed acceptance proposal described in Section 3.1 will provide aconvergent Metropolis-Hastings algorithm, there are cases in which the rate of convergence will beextremely slow. To demonstrate this, the left-hand contour plot in Fig. 4 shows an artiﬁcially badexample. In this case the approximate model (red isolines) poorly captures the target likelihooddistribution (blue density); there is a clear oﬀset in the distributions, and the scale, shape andorientation of the approximate likelihood is incorrect. If using the modiﬁed delayed acceptancealgorithm without alteration, it is easy to see that the proposal mechanism would struggle totraverse the whole of the target distribution, since much of it lies in the tails of the approximatelikelihood distribution. As a result, in practice, we would observe extremely slow convergence to thetrue posterior; in practise – at ﬁnite computational times – results would contain a signiﬁcant bias.An adhoc way to overcome this is to apply so-called tempering on the statistical model whichdrives the subchain. In this technique, the variance of the misﬁt Σ (cid:15) on the subchain is artiﬁciallyinﬂated to capture the uncertainty in the approximate model. The issue in adopting this approachis the diﬃculty in selecting a robust inﬂation factor for tempering, particularly in higher dimensions.Furthermore, an isotropic inﬂation of the approximate posterior will in general be sub-optimal.Figure 4: Fine/target likelihood (blue) and approximate likelihood (red). (Left) Original likelihoodbefore correction, (middle) corrected likelihood by a constant shift µ bias and (right) correctedapproximate likelihood by multivariant Gaussian.In this paper we instead implement an adaptive enhanced error model (EEM), which overcomesmany of these challenges. Moreover, it is easy to implement and has negligible additional com-putational cost. Let ˆ F denote the approximate forward map of the ﬁne/target model F . Then,following [41, 18], we apply a trick to the statistical model (8) where we add and subtract thecoarse map ˆ F . With some rearrangement we obtain the expression d obs = F ( θ ) + (cid:15) = F ( θ ) + ˆ F ( θ ) − ˆ F ( θ ) + (cid:15) = ˆ F ( θ ) + (cid:16) F ( θ ) − ˆ F ( θ ) (cid:17)| {z } := B ( θ ) + (cid:15) . (19)Here B ( θ ) = F ( θ ) − ˆ F ( θ ) is the bias associated with the approximation at given parameter values θ . We approximate this bias using a multivariant Gaussian distribution, i.e. B ∼ N ( µ bias , Σ bias ),and therefore the likelihood function (12) can be rewritten asˆ L ( d obs | θ ) = exp (cid:18) −

12 ( ˆ F ( θ ) + µ bias − d obs ) (cid:124) ( Σ bias + Σ e ) − ( ˆ F ( θ ) + µ bias − d obs ) (cid:19) . (20)The inﬂuence of redeﬁning the likelihood is best demonstrated geometrically, as shown in Fig. 4(middle and right). Firstly, as shown in Fig. 4 (middle) we can make a better approximation bysimply adding a shift of the mean bias µ bias to the original approximate model ˆ F ( θ ). This hasthe eﬀect of aligning the ‘centre of mass’ of each of the distributions. Secondly, we can learn thecovariance structure of the bias. This has the eﬀect of stretching and rotating the approximatedistribution to give an even better overall approximation, as shown in Fig. 4 (right). The ﬁnal10ismatch between the approximate and target distribution, will be driven by the assumption thatbias can be represented by a multivariant Gaussian, although more complex distributions could beconstructed using, for example, Gaussian process regression. Whilst this is an avenue to explorein the future, any such approach would surrender the simplicity of this approach, which from theresults appears particularly eﬀective.The idea of using an EEM when dealing with model hierarchies originates from [41], whosuggested to use samples from the prior distribution of parameters to construct the EEM prior toBayesian inversion, so that µ bias = 1 N N X i =1 B ( θ ( i ) ) and Σ bias = 1 N − N X i =1 ( B ( θ ( i ) ) − µ bias )( B ( θ ( i ) ) − µ bias ) (cid:124) (21)The estimates for µ bias and Σ bias could be obtained by sampling the prior distribution and comparingthe approximate forward map against the target forward map. However, since the structure ofbias in the prior and the posterior could be signiﬁcantly diﬀerent, this approach could result in asuboptimal EEM and requires additional ﬁne model evaluations. Furthermore, in our exampleswhere the approximate model is built from samples from the prior, it is expected that such anapproach would underestimate the bias, since the neural network has been explicitly trained tominimise the error with respect to samples from the prior. Instead of estimating the bias usingthe prior, the posterior bias can be constructed on-line by iteratively updating its mean µ bias andcovariance Σ bias using coarse/ﬁne solution pairs from the MCMC samples as suggested by [42]. Inthis case we select µ bias ,i +1 = 1 i + 1 (cid:0) i µ bias ,i + B ( θ ( i +1) ) (cid:1) and (22) Σ bias ,i +1 = i − i Σ bias ,i + 1 i ( B ( θ ( i +1) ) B ( θ ( i +1) ) (cid:124) − µ bias ,i +1 µ (cid:124) bias ,i +1 ) (23)While this approach does not in theory guarantee ergodicity of the chain, the bias distribution willconverge as the chain progresses and adaptation diminishes, resulting in a de facto ergodic processafter an initial period of high adaptivity. This is a common feature of adaptive MCMC algorithms,as discussed in the classic paper on adaptive Metropolis [17]. Our experiments showed that the biasdistribution did indeed converge for every simulation, and that repeated experiments convergedtowards the same posterior bias distribution. Admitting a bias term in the inverse problem furtherintroduces an issue of identiﬁability , as highlighted in [43]. Since observations are now modelled as asum of coarse model output and multiple stochastic terms, the stochastic terms B ∼ N ( µ bias , Σ bias )and (cid:15) ∼ N (0 , σ I n ) are generally unidentiﬁable in the coarse model formulation. In this section, we examine the eﬀectiveness of our proposed strategy on two synthetic ground-water ﬂow problems: a two-dimensional problem with an isotropic covariance kernel and a three-dimensional problem with an anisotropic covariance kernel. For both examples, we begin byoutlining the model setup, for which we select a ‘true’ transmissivity ﬁeld and a number of ﬁxedobservation points. For the ﬁrst example, the inﬂuence of training size for the DNNs is examined,and the total cost of uncertainty quantiﬁcation using a selection of DNNs is computed. For thesecond example we use a single DNN setup and analyse the resulting posterior marginal distributionsand the quantity of interest. The ﬁrst example was completed on commodity hardware – an HPElitebook 840 G5 with an Intel Xeon E3-1200 quad-core processor, while the second examplewas completed on a TYAN Thunder FT48T-B7105 GPU server with two Intel Xeon Gold 6252processors and an NVIDIA RTX 2080Ti GPU.

This example was conducted on a unit square domain Ω = [0 , , meshed using an unstructuredtriangular grid comprising 2,601 degrees of freedom. Dirichlet boundary conditions were imposed11n the left and right boundaries with hydraulic heads of 1 and 0, respectively. The top and bottomedges impose homogeneous no-ﬂow Neumann boundary conditions. For the forward model, thecovariance length scales of the ARD squared exponential kernel was set to l = (0 . , . , . (cid:124) ,eﬀectively resulting in an isotropic covariance kernel, equal to the ‘classic’ square exponential kernelwith l = 0 .

1. This resulted in a KL decomposition with >

80% of total signal energy in the ﬁrst 32modes and >

95% of signal energy in the ﬁrst 64 KL modes. Hence, 32 modes were included in theapproximate model whilst 64 modes where included in the ﬁne model. (a) Log-Transmissivity. (b) Hydraulic head.

Figure 5: “True” transmissivity ﬁeld, its corresponding solution and sampling points.Figure 5a shows the ‘true’ transmissivity ﬁeld that we attempt to recover through our MCMCmethodology and the modelled, corresponding hydraulic head. Synthetic samples for the likelihoodfunction were extracted at 25 points on a regular grid with a horisontal and vertical spacing of0.2m (Figure 5b).

We evaluated a selection of diﬀerent DNNs to investigate the impact of various network depths andactivation functions on the DNN performance. Table 1 shows the layers of the employed DNNs, thenumber of nodes in each layer and their corresponding activation functions. DNN1 and DNN2 hadthree hidden layers, while DNN3 and DNN4 had only two, as the ReLU layer with 8 k nodes wasnot included in these networks. The output layer of DNN1 and DNN3 consisted of nodes with anexponential activation function E ( x ) = e x , resulting in a strictly positive output. The DNNs withan exponential activation function in the ﬁnal layer tended overall to lead to the best performance. Layer k KL coeﬃcients — — — —1 4 k Sigmoid Sigmoid Sigmoid Sigmoid2 8 k ReLU ReLU until i — —3 4 k ReLU ReLU ReLU ReLUOutput n datapoints Exponential Linear Exponential Linear Table 1: Neural network layers and activation functions in the model approximation DNNs.Each DNN was trained on a set of samples from the prior distribution of parameters π ( θ ) = N (0 , I k ), in advance of running the MCMC. Hence, the DNN samples were drawn from a LatinHypercube [44] in the interval [0 ,

1] and transformed to the standard normal distribution us-ing the probit -function, such that θ train ∼ N (0 , I k ). The FEM model was then run for ev-ery parameter sample, obtaining for each a vector of model outputs at sampling points givenparameters. We trained and tested each DNN on a range of diﬀerent sample sizes, namely12 DNN = { , , , , , } , where N DNN = N train + N test , with a 9:1 train-ing/test splitting ratio. Each DNN was then trained for 200 epochs with a batch size of 50 usingthe rmsprop optimiser [39].Deep Neural Networks performance was compared using the RMSE of their respective testingdataset RMSE = vuut n n X i =1 ( h i − ˆ h i ) (24)The residual RMSE (24) of each DNN was computed to compare the network designs described inTable 1 and to investigate the inﬂuence of training dataset size on the DNN performance (Figure 6).As expected, each DNN performed better as the number of samples in the training dataset wereincreased. In comparison, the DNN design had much less inﬂuence on the testing performance,suggesting that the main driver for constructing an accurate surrogate model, within the bounds ofthe examined DNN designs, was the number of training samples. For the remaining analysis, we chosethe overall best performing network, namely DNN1 constructed from N DNN = { , , } samples. R M SE DNN1DNN2DNN3DNN4

Figure 6: Testing performance (RMSE) of each DNN against the total sample size ( N DNN = N train + N test ). Please refer to table 1 for details in the structure of each DNN.Further performance analysis consisted of analysing the DNN error e = h true − h pred for trueand predicted heads ( h true and h pred , respectively) for datapoints 0, 8, 16 and 24. (Figure 7). Allerror distributions were approximately Gaussian, with the errors for the DNN with N DNN = 4000exhibiting some right skew at sampling point 24. For all DNNs, the sampling points closer to theboundaries (at sampling points 0 and 24) had lower errors than those further away, since the headsclose to the boundaries were more constrained by the model.

For inversion and uncertainty quantiﬁcation, we chose a multivariate standard normal distributionas the prior parameter distribution, π ( θ ) = N (0 , I k ) and set the error covariance to Σ e = 0 . I m .While conputationally convenient, the zero-centrered prior in practice favours transmissivity ﬁeldrealisations capable of reproducing the observed heads with as little variation as possible. In total,seven diﬀerent sampling strategies were investigated, namely single level ‘vanilla’ MCMC, DA usingthree diﬀerent DNNs trained and tested on N DNN = { , , } samples, and DA with anenhanced error model (DA/EEM) using the same three DNNs. Every simulation was completedusing the pCN transition kernel. Each MCMC sampling strategy was repeated ( n = 32) usingrandomly generated random seeds, to ensure that every starting point would converge towards thesame stationary distribution and to allow for cross-chain statistics to be computed. Results given inthis section pertain to these multi-chain samples rather than individual MCMC realisations, unlessotherwise stated. 13 = N = N = Error Error Error Error

Figure 7: Density plot of the error ( e = h true − h pred ) of the testing dataset for DNN1 trained andtested on N DNN = { , , } samples, for sampling points 0, 8, 16 and 24. Bars showdensity of each bin, while the curve shows Gaussian kernel density estimate.Our sampling strategies recovered the ground truth with good accuracy. Figure 8 shows the meanand variance of the recovered ﬁeld from the DA/EEM MCMC using the DNN with N DNN = 64000.All recovered ﬁelds exhibit higher smoothness than the ground truth, which could be attributed tothe relatively low number of sampling points and their regular distribution on the domain. None ofthe chains recovered the local peak in transmissivity on the right side of the domain, since therewas too little data to discover this particular feature.While the recovered ﬁelds indicate that every MCMC sampling strategy converged towards thedesired stationary distribution, they do not reveal the relative eﬃciency of each stategy. Hence,the Eﬀective Sample Size ( N eff ) was computed for each MCMC realisation. Every DA samplingstrategy produced higher N eff than the Vanilla pCN sampler, relative to the simulation time,with a clear correlation between DNN testing performance and N eff . This was mainly becausethe better performing DNNs allowed for a longer coarse chain oﬀset without diverging. Moreover,utilising the EEM produced even higher N eff for every DA chain (Table 2). Strategy N

DNN t N C / N F Acc. Rate Time (min) N eff Vanilla — — — / 80000 0.23 63.3 84.8DA 4000 2 85461.9 / 20000 0.27 16.2 64.5DA/EEM 4000 2 78853.4 / 20000 0.31 15.2 79.0DA 16000 4 172383.1 / 20000 0.27 18.2 116.3DA/EEM 16000 4 178978.4 / 20000 0.30 18.4 143.6DA 64000 8 336447.5 / 20000 0.24 30.1 196.5DA/EEM 64000 8 377524.4 / 20000 0.30 29.9 235.7

Table 2: Results for various MCMC sampling strategies, means of multiple chains with n = 32. N DNN is the number of total samples used to construct the DNN. t is the improved DA oﬀsetlength. N C / N F is the ﬁnal length of the coarse and ﬁne chain, respectively, after subtractingburnin. Acc. rate is the ﬁne chain acceptance rate.

Time (min) is the total running time of thesimulation in minutes. N eff is the Eﬀective Sample Size.14 a) Mean of recovered log-transmissivity. (b) Variance of recovered log-transmissivity. Figure 8: Mean and variance ( n = 32) of recovered log-transmissivity ﬁelds using DA/EEM MCMCwith N DNN = 64000.

Since the DA chains required computation of a signiﬁcant number of ﬁne model solutions andtraining of a DNN in advance of running the chain, the total cost C total of each strategy wascomputed as C total = t ﬁne + t train + t run N eff (25)where t ﬁne was the time spent on precomputing ﬁne model solution, t train was the time spent ontraining the respective DNN, t run was the time taken to run the chain and N eff was the resultingeﬀective sample size (Fig. 9). V a n ill a D A D A / E E M D A D A / E E M D A D A / E E M MCMC Strategy C o s t ( M i n / N e ﬀ ) Figure 9: Violinplot showing the total cost C total of each MCMC strategy with n = 32. Pointsshow individual Markov Chains.The mean cost of every DA chain was lower than that of the Vanilla pCN chain, with thechains using the EEM consistently cheaper than their non-EEM counterparts. Moreover, using theEEM reduced the variance of the cost in repeated experiments, allowing each repetition to producea consistently high N eff . The overall cheapest inversion was completed using the DNN trained15n 16,000 samples using the EEM, reducing the total cost, relative to the Vanilla pCN MCMC,with 75%. Notice that these results are extremely conservative in the sense that the entire cost ofevaluating every DNN training sample and training the DNN in serial on a CPU was factored intothe cost of every repetition, even though the same DNN was used for all the repititions within eachsampling strategy. The precomputation cost can be dramatically reduced by evaluating the DNNsamples in parallel and utilising high-performance hardware, such as GPUs, for training the DNN. This example was conducted on a rectangular cuboid domain Ω = [0 , × [0 , × [0 , .

5] meshed usingan unstructured tetrahedral grid with 10,416 degrees of freedom (Figure 10). Dirichlet boundaryconditions of h = 1 and h = 0 were imposed at x = 0 and x = 2, respectively. No-ﬂow Neumannconditions were imposed on all remaining boundaries. (a) Log-Conductivity of ground truth.(b) Hydraulic head of ground truth and location of sampling points. Figure 10: “True” conductivity ﬁeld, its corresponding solution and sampling points.We drew w = 50 sampling well locations randomly using the Maximin Latin Hypercube Design[45], and samples of hydralic head were extracted at each well at datums x = { . , . , . , . , . } ,measured from the bottom of the domain, resulting in m = 250 datapoints in total (Figure10b). The covariance lengths scales for ARD squared exponential covariance kernel were set to16 = (0 . , . , . (cid:124) , resulting in a highly anisotropic random process with high variation in the x direction to simulate geological stratiﬁcation, some variation in the x direction and little variationin the x direction (Figure 10a). Like in the ﬁrst model, the random process was truncated at 64KL eigenmodes for the ﬁne model and 32 KL eigenmodes for the coarse model, embodying > >

90% of the total signal energy, respectively.For this example, we ﬁrst converged the conductivity parameters to the Maximum a posteriori(MAP) estimate θ MAP = arg max θ π ( θ ) L ( d obs | θ ) using gradient descent, since initial MCMCexperiments struggled to converge to the posterior distribution for random initial parameter sets. Training a DNN to accurately emulate the model response for this setup was challenging, and wefound no single combination of neural network layers and activation functions that would predictthe head at every datapoint with suﬃcient accuracy. We hypothesise that this limitation could becaused by a strong ill-posedness of the DNN – for a single neural network, the output dimensiongreatly exceeded the input dimension, m >> k . When we instead predicted the heads at each θ θ Input OutputDNNs h h h h F ^ ModelOutput θ k h Figure 11: Layout of the multi-DNN design. Each DNN outputs a vector h x vector of w headpredictions at datum x .datapoint datum using a separate DNN, we found that we could utilise largely the same DNNdesign as had been employed in the ﬁrst example. Hence, to predict the head at all datapoints,we utilised ﬁve identically designed but independent DNNs (Figure 11), each with four hiddenlayers and activation functions as indicated in Table 3. Each DNN was trained and tested on a Layer k KL coeﬃcients —1 4 k Sigmoid2 8 k ReLU3 8 k ReLU3 4 k ReLUOutput w wells Exponential Table 3: Layers and activation functions in the four DNNs. Each DNN takes all k KL coeﬃcientsas input and predicts the head h x at w wells for a given datum.dataset of N DNN = 16000 samples with KL coeﬃcients drawn from a Latin Hypercube [44] inthe interval [0 ,

1] and transformed to a normal distribution centered on the MAP estimate of theparameters θ MAP , i.e. θ train ∼ N ( θ MAP , I k ). This was done to increase the density of samplesand thus improve the DNN prediction at and around the MAP point, which ideally equals themode of the posterior distribution. The DNNs were then trained with for 200 epochs using a batchsize of 50, the MSE loss function and the rmsprop optimiser [39]. Figure 12 shows performanceplots of each DNN for both the training (top and the testing (bottom) datasets. While every DNNis clearly moderately biased by the training data, they all performed adequately with respect to thetesting data. 17igure 12: Performance of each DNN with respect to the training dataset (top) and the testingdataset (bottom). Similarly to the ﬁrst example, we chose a multivariate standard normal distribution π ( θ ) = N (0 , I k )as the prior distribution of parameters, and set the error covariance to Σ e = 0 . I m . In thisexample, we utilised the Adaptive Metropolis (AM) transition kernel for generating proposals. Wecompleted n = 8 independent simulations, each initialised from a random initial point close to theMAP point θ MAP , with a burnin of 1000 and a ﬁnal sample size of N = 10 , t = 2, since longer subchains tended to diverge, leading tosub-optimal acceptance rates on the ﬁne level. The simulations had a mean acceptance rate of0.26, a mean eﬀective sample size ( N eff ) of 55.2 and a mean autocorrelation length τ = N/N eff of 181.0. The samples of each independent simulation were pruned according to the respectiveautocorrelation length, and the remaining samples were pooled together to yield 443 statisticallyindependent samples that were then analysed further.Figure 13 shows the marginal distributions of the six coarsest KL coeﬃcients along with a scat-terplot matrix of all the samples remaining after pruning. All the marginal distributions are approx-imately Gaussian, and the two-parameter marginal distributions are mostly elliptical. It is evidentthat some of these parameters are correlated, namely parameters ( θ , θ ) , ( θ , θ ) , ( θ , θ ) , ( θ , θ )and ( θ , θ ). It is worth mentioning that in every independent simulation, the AM proposal kernelmanaged to capture these correlations.Moreover, we analysed the hydraulic head as a function of datum h ( x ) along a line in thecentre of the domain x = (1 . , . , x ) (cid:124) . Figure 14 shows h ( x ) of the ground truth, MAP point θ MAP , the mean of the n = 8 independent simulations, and all the samples remaining after pruning.We observe that both the MAP point and the sample mean are fairly close to the ground truth,albeit exhibiting higher smoothness, particularly between the observation depths, where the head isessentially allowed to vary freely. It is also clear that the individual samples encapsulate the groundtruth, indicating that the ground truth is indeed contained by posterior distribution. In this paper, we have demonstrated the use of a novel Markov Chain Monte Carlo methodologywhich employs a delayed acceptance (DA) model hierarchy with a deep neural network (DNN)as an approximate model and a FEM solver as a ﬁne model, and generates proposals using thepCN and AM transition kernels. Results from the ﬁrst example clearly indicate that the use of acarefully designed DNN as a model approximation can signiﬁcantly reduce the cost of uncertaintyquantiﬁcation, even for DNNs trained on relatively small sample sizes. We have established thatoﬀsetting ﬁne model evaluations in the DA algorithm reduces the autocorrelation of the ﬁne chain,resulting in a higher eﬀective sample size which, in turn, improves the statistical validity of theresults. In this context, the performance of the DNN is a critical driver when determining a feasible18 Figure 13: One and two-dimensional posterior marginal distributions (diagonal and lower triangle)and scatterplots (upper triangle) of posterior samples pruned according to the autocorrelationlength of each chain for the largest 5 KL eigenmodes. Please note that the axis scales of are notequal. 19 .0 0.1 0.2 0.3 0.4 0.5 x h ( x ) Ground truthMaximum a posteriori (MAP)Sample meanPruned samplesObservation depths

Figure 14: Hydraulic head as a function of datum h ( x ) at x = (1 . , . , x ) (cid:124) . The solid redline shows the hydraulic head of the ground truth, the dashed orange line shows the head ofthe Maximum a posteriori (MAP) point θ MAP , the dotted yellow line shows the mean head ofthe independent simulations ( n = 8) and the thin black lines show the head of 538 statisticallyindependent samples, remaining after pruning according to the autocorrelation length of each chain, n = 443. The vertical dotted lines show the observation depths.oﬀset length to avoid divergence of the coarse chain. Hence, if a high eﬀective sample size isrequired, it may be desirable to invest in a well-performing DNN. Moreover, we have shown that anenhanced error model, which introduces an iteratively-constructed bias distribution in the coarsechain likelihood function, further increases the eﬀective sample size and decreases the variance ofthe cost in repeated experiments. Finally, we observed that for the second example, even whenemploying a relatively well-performing model approximation, we had to constrain the oﬀset lengthof the subchains rather strongly to achieve optimal acceptance rates. This can be attributed in partto an apparent non-spherical and correlated posterior distribution, causing the employed proposalkernels to struggle to discover areas of high posterior probability.We have demonstrated that relatively simple inverse hydrogeological problems can be solved inreasonable time on a commonly available personal computer with no GPU-acceleration. This opensthe opportunity to apply robust uncertainty quantiﬁcation during ﬁeldwork and as a decision-supporttool for groundwater surveying campaigns. We have also demonstrated the applicability of ourapproach on a larger scale three-dimensional problem, utilising a GPU-accelerated high-performancecomputer (HPC). Aside from the beneﬁt of using a HPC computer for accelerating the ﬁne modelevaluations, utilising the GPU allowed for rapidly training and testing multiple diﬀerent DNNdesigns to eﬃciently establish a well performing model approximation. There are other obviousways to further encrease the eﬃciency of the proposed methodology. For example, constructionof the DNNs used as coarse models comes with the cost of evaluating multiple models from theprior distribution, and, unlike the MCMC sampler, the prior models are independent and these ﬁnemodel evaluations can thus be massively parallelised.Our methodology was demonstrated in the context of two relatively simple groundwater ﬂowproblems with log-Gaussian transmissivity ﬁelds parameterised by Karhunen-Loève decompositions.While this model provides a convenient computational structure for our purposes, it may not reﬂectthe full scale transmissivity of real-world aquifers, particularly in the presence of geological faultsand other heterogeneities, as discussed in [24]. Future research could address this problem throughgeological layer stratiﬁcation using the universal cokriging interpolation method suggested in [46],potentially utilising the open-source geological modelling tool GemPy [47], which allows for simpleparametric representation of geological strata. Spatially heterogeneous parameters within eachstrata could then be modelled hierarchically using a low order log-Gaussian random ﬁeld to accountfor within-stratum variation, as demonstrated in [12].20 Acknowledgements

This work was funded as part of the Water Informatics Science and Engineering Centre forDoctoral Training (WISE CDT) under a grant from the Engineering and Physical Sciences Re-search Council (EPSRC), grant number EP/L016214/1. TD was funded by a Turing AI Fellow-ship (2TAFFP\100007). DM acknowledges support from the EPSRC Platform Grant PRISM(EP/R029423/1). The authors have no competing interests. Data supporting the ﬁndings inthis study are under preparation and will be made available in the Open Research Exeter (ORE, https://ore.exeter.ac.uk/repository/ ) data repository.

References [1] M. P. Anderson, W. W. Woessner, R. J. Hunt, Applied groundwater modeling: simulation ofﬂow and advective transport, second edition Edition, Academic Press, London ; San Diego,CA, 2015, oCLC: ocn921253555.[2] A. D. Woodbury, T. J. Ulrych, A full-Bayesian approach to the groundwater inverse problemfor steady state ﬂow, Water Resources Research 36 (8) (2000) 2081–2093. doi:10.1029/2000WR900086 .URL http://doi.wiley.com/10.1029/2000WR900086 [3] G. Mariethoz, P. Renard, J. Caers, Bayesian inverse problem and optimization with iterativespatial resampling: ITERATIVE SPATIAL RESAMPLING, Water Resources Research 46 (11)(Nov. 2010). doi:10.1029/2010WR009274 .URL http://doi.wiley.com/10.1029/2010WR009274 [4] M. de la Varga, J. F. Wellmann, Structural geologic modeling as an inference problem: ABayesian perspective, Interpretation 4 (3) (2016) SM1–SM16. doi:10.1190/INT-2015-0188.1 .URL http://library.seg.org/doi/10.1190/INT-2015-0188.1 [5] C. P. Robert, G. Casella, Monte Carlo statistical methods, 2nd Edition, Springer texts instatistics, Springer, New York, NY, 2010, oCLC: 837651914.[6] D. Higdon, H. Lee, C. Holloman, Markov chain Monte Carlo-based approaches for inference incomputationally intensive inverse problems, in: Bayesian Statistics, Vol. 7, Oxford UniversityPress., 2003, pp. 181–197.[7] T. J. Dodwell, C. Ketelsen, R. Scheichl, A. L. Teckentrup, A Hierarchical Multilevel MarkovChain Monte Carlo Algorithm with Applications to Uncertainty Quantiﬁcation in SubsurfaceFlow, SIAM/ASA Journal on Uncertainty Quantiﬁcation 3 (1) (2015) 1075–1108. doi:10.1137/130915005 .URL http://epubs.siam.org/doi/10.1137/130915005 [8] G. Detommaso, T. Dodwell, R. Scheichl, Continuous Level Monte Carlo and Sample-AdaptiveModel Hierarchies, arXiv:1802.07539 [math]ArXiv: 1802.07539 (Feb. 2018).URL http://arxiv.org/abs/1802.07539 [9] J. Doherty, Calibration and uncertainty analysis for complex environmental models, 2015,oCLC: 991568728.[10] B. Peherstorfer, K. Wilcox, M. Gunzburger, Survey of multiﬁdelity methods in uncertaintypropagation, inference, and optimization, SIAM Review 60 (3) (2018) 550–591.[11] Y. Efendiev, A. Datta-Gupta, V. Ginting, X. Ma, B. Mallick, An eﬃcient two-stage Markovchain Monte Carlo method for dynamic data integration, Water Resources Research 41 (12)(2005). doi:10.1029/2004WR003764 .URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2004WR003764 doi:10.1016/j.advwatres.2009.10.010 .URL https://linkinghub.elsevier.com/retrieve/pii/S0309170809001729 [13] P. Dostert, Y. Efendiev, B. Mohanty, Eﬃcient uncertainty quantiﬁcation techniques in inverseproblems for Richards’ equation using coarse-scale simulation models, Advances in WaterResources 32 (3) (2009) 329–339. doi:10.1016/j.advwatres.2008.11.009 .URL https://linkinghub.elsevier.com/retrieve/pii/S0309170808002121 [14] E. Laloy, B. Rogiers, J. A. Vrugt, D. Mallants, D. Jacques, Eﬃcient posterior exploration of ahigh-dimensional groundwater model from two-stage Markov chain Monte Carlo simulationand polynomial chaos expansion: Speeding up MCMC Simulation of a Groundwater Model,Water Resources Research 49 (5) (2013) 2664–2682. doi:10.1002/wrcr.20226 .URL http://doi.wiley.com/10.1002/wrcr.20226 [15] J. A. Christen, C. Fox, Markov chain Monte Carlo Using an Approximation, Journal of Com-putational and Graphical Statistics 14 (4) (2005) 795–810. doi:10.1198/106186005X76983 .URL [16] S. L. Cotter, G. O. Roberts, A. M. Stuart, D. White, MCMC Methods for Functions: ModifyingOld Algorithms to Make Them Faster, Statistical Science 28 (3) (2013) 424–446, arXiv:1202.0709. doi:10.1214/13-STS421 .URL http://arxiv.org/abs/1202.0709 [17] H. Haario, E. Saksman, J. Tamminen, An Adaptive Metropolis Algorithm, Bernoulli 7 (2)(2001) 223. doi:10.2307/3318737 .URL [18] T. Cui, C. Fox, M. J. O’Sullivan, A posteriori stochastic correction of reduced models indelayed acceptance MCMC, with application to multiphase subsurface inverse problems,arXiv:1809.03176 [stat]ArXiv: 1809.03176 (Sep. 2018).URL http://arxiv.org/abs/1809.03176 [19] H.-J. G. Diersch, FEFLOW: Finite Element Modeling of Flow, Mass and Heat Transportin Porous and Fractured Media, Springer Berlin Heidelberg, Berlin, Heidelberg, 2014. doi:10.1007/978-3-642-38739-5 .URL http://link.springer.com/10.1007/978-3-642-38739-5 [20] H. P. Langtangen, A. Logg, Solving PDEs in Python – The FEniCS Tutorial Volume I (2017).[21] A. W. Harbaugh, MODFLOW-2005 : the U.S. Geological Survey modular ground-watermodel–the ground-water ﬂow process, Report (2005). doi:10.3133/tm6A16 .URL http://pubs.er.usgs.gov/publication/tm6A16 [22] R. A. Freeze, A stochastic-conceptual analysis of one-dimensional groundwater ﬂow innonuniform homogeneous media, Water Resources Research 11 (5) (1975) 725–741. doi:10.1029/WR011i005p00725 .URL http://doi.wiley.com/10.1029/WR011i005p00725 [23] N.-O. Kitterrød, L. Gottschalk, Simulation of normal distributed smooth ﬁelds by Karhunen-Loéve expansion in combination with kriging, Stochastic Hydrology and Hydraulics 11 (6)(1997) 459–482. doi:10.1007/BF02428429 .URL http://link.springer.com/10.1007/BF02428429 [24] J. Gómez-Hernández, X.-H. Wen, To be or not to be multi-Gaussian? A reﬂection onstochastic hydrogeology, Advances in Water Resources 21 (1) (1998) 47–61. doi:10.1016/S0309-1708(96)00031-0 .URL http://linkinghub.elsevier.com/retrieve/pii/S0309170896000310 doi:10.1016/j.advwatres.2007.07.002 .URL https://linkinghub.elsevier.com/retrieve/pii/S0309170807001236 [26] D. Russo, M. Bouton, Statistical analysis of spatial variability in unsaturated ﬂow parameters,Water Resources Research 28 (7) (1992) 1911–1925. doi:10.1029/92WR00669 .URL http://doi.wiley.com/10.1029/92WR00669 [27] R. J. Hoeksema, P. K. Kitanidis, Analysis of the Spatial Structure of Properties of SelectedAquifers, Water Resources Research 21 (4) (1985) 563–572. doi:10.1029/WR021i004p00563 .URL http://doi.wiley.com/10.1029/WR021i004p00563 [28] P. Dostert, Y. Efendiev, T. Hou, W. Luo, Coarse-gradient Langevin algorithms for dynamicdata integration and uncertainty quantiﬁcation, Journal of Computational Physics 217 (1)(2006) 123–142. doi:10.1016/j.jcp.2006.03.012 .URL https://linkinghub.elsevier.com/retrieve/pii/S0021999106001380 [29] Y. M. Marzouk, H. N. Najm, Dimensionality reduction and polynomial chaos accelerationof Bayesian inference in inverse problems, Journal of Computational Physics 228 (6) (2009)1862–1902. doi:10.1016/j.jcp.2008.11.024 .URL https://linkinghub.elsevier.com/retrieve/pii/S0021999108006062 [30] C. Scarth, S. Adhikari, P. H. Cabral, G. H. Silva, A. P. d. Prado, Random ﬁeld simulation overcurved surfaces: Applications to computational structural mechanics, Computer Methods inApplied Mechanics and Engineering 345 (2019) 283–301. doi:10.1016/j.cma.2018.10.026 .URL https://linkinghub.elsevier.com/retrieve/pii/S0045782518305309 [31] A. Gelman (Ed.), Bayesian data analysis, 2nd Edition, Texts in statistical science, Chapman& Hall/CRC, Boca Raton, Fla, 2004.[32] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, E. Teller, Equation of StateCalculations by Fast Computing Machines, The Journal of Chemical Physics 21 (6) (1953)1087–1092. doi:10.1063/1.1699114 .URL http://aip.scitation.org/doi/10.1063/1.1699114 [33] W. K. Hastings, Monte Carlo sampling methods using Markov chains and their applications,Biometrika (1970) 13.[34] L. Katafygiotis, K. Zuev, Geometric insight into the challenges of solving high-dimensionalreliability problems, Probabilistic Engineering Mechanics 23 (2-3) (2008) 208–218. doi:10.1016/j.probengmech.2007.12.026 .URL https://linkinghub.elsevier.com/retrieve/pii/S0266892007000707 [35] G. O. Roberts, J. S. Rosenthal, Examples of Adaptive MCMC, Journal of Computational andGraphical Statistics 18 (2) (2009) 349–367. doi:10.1198/jcgs.2009.06134 .URL [36] F. Chollet, et al., Keras, https://github.com/fchollet/keras (2015).[37] Theano Development Team, Theano: A Python framework for fast computation of mathemati-cal expressions, arXiv e-prints abs/1605.02688 (May 2016).URL http://arxiv.org/abs/1605.02688 [38] T. Hastie, R. Tibshirani, J. H. Friedman, The elements of statistical learning: data mining,inference, and prediction, 2nd Edition, Springer series in statistics, Springer, New York, NY,2009.[39] G. Hinton, N. Srivastava, K. Swersky, Neural networks for machine learning, lecture 6a:Overview of mini-batch gradient descent, Coursera, University of Toronto (2012).2340] U. Wolﬀ, Monte Carlo errors with less errors, Computer Physics Communications 176 (5)(2007) 383, arXiv: hep-lat/0306017. doi:10.1016/j.cpc.2006.12.001 .URL http://arxiv.org/abs/hep-lat/0306017 [41] J. Kaipio, E. Somersalo, Statistical inverse problems: Discretization, model reduction andinverse crimes, Journal of Computational and Applied Mathematics 198 (2) (2007) 493–504. doi:10.1016/j.cam.2005.09.027 .URL https://linkinghub.elsevier.com/retrieve/pii/S0377042705007296 [42] T. Cui, C. Fox, M. J. O’Sullivan, Bayesian calibration of a large-scale geothermal reservoirmodel by a new adaptive delayed acceptance Metropolis Hastings algorithm: ADAPTIVEDELAYED ACCEPTANCE METROPOLIS-HASTINGS ALGORITHM, Water ResourcesResearch 47 (10) (Oct. 2011). doi:10.1029/2010WR010352 .URL http://doi.wiley.com/10.1029/2010WR010352 [43] J. Brynjarsdóttir, A. O’Hagan, Learning about physical parameters: the im-portance of model discrepancy, Inverse Problems 30 (11) (2014) 114007. doi:10.1088/0266-5611/30/11/114007 .URL http://stacks.iop.org/0266-5611/30/i=11/a=114007?key=crossref.7b886360dda7b385609c577ad82450aa [44] M. D. McKay, R. J. Beckman, W. J. Conover, A comparison of three methods for selectingvalues of input variables in the analysis of output from a computer code, Technometrics 21 (2)(1979) 239–245.URL [45] M. D. Morris, T. J. Mitchell, Exploratory designs for computational experiments, Journalof Statistical Planning and Inference 43 (3) (1995) 381–402. doi:10.1016/0378-3758(94)00035-T .URL https://linkinghub.elsevier.com/retrieve/pii/037837589400035T [46] C. Lajaunie, G. Courrioux, L. Manuel, Foliation ﬁelds and 3d cartography in geology: Principlesof a method based on potential interpolation, Mathematical Geology 29 (4) (1997) 571–584. doi:10.1007/BF02775087 .URL http://link.springer.com/10.1007/BF02775087 [47] M. de la Varga, A. Schaaf, F. Wellmann, GemPy 1.0: open-source stochastic geologicalmodeling and inversion, Geoscientiﬁc Model Development 12 (1) (2019) 1–32. doi:10.5194/gmd-12-1-2019 .URL Preconditioned Crank-Nicolson

The preconditioned Crank-Nicolson (pCN) proposal was developed in [16] and is based on thefollowing Stochasic Partial Diﬀerential Equation (SPDE): duds = −KL u + √ K dbds where L = C − is the precision operator for the prior distribution µ , b is brownian motionwith covariance operator I , and K is a positive operator. This equation can be discretised usingthe Crank Nicolson approach to yield v = u − δ KL ( u + v ) + √ K δξ for white noise ξ and a weight δ ∈ [0 , K = I , we get the plain Crank Nicolson(CN) proposal: (2 C + δI ) v = (2 C − δI ) u + √ δ C ξ where ξ ∼ N (0 , C ). If we instead choose K = C , we get the pCN proposal: v = p − β u + βξ, β = √ δ δ , β ∈ [0 , θ = p − β θ i + β ξξ