Scaling Up Bayesian Uncertainty Quantification for Inverse Problems using Deep Neural Networks
SScaling Up Bayesian Uncertainty Quantification for InverseProblems using Deep Neural Networks
Shiwei Lan ∗ , Shuyi Li , and Babak Shahbaba School of Mathematical and Statistical Sciences, Arizona State University Department of Statistics, University of California, Irvine
Abstract
Due to the importance of uncertainty quantification (UQ), Bayesian approach toinverse problems has recently gained popularity in applied mathematics, physics, andengineering. However, traditional Bayesian inference methods based on Markov ChainMonte Carlo (MCMC) tend to be computationally intensive and inefficient for suchhigh dimensional problems. To address this issue, several methods based on surro-gate models have been proposed to speed up the inference process. More specifically,the calibration-emulation-sampling (CES) scheme has been proven to be successful inlarge dimensional UQ problems. In this work, we propose a novel CES approach forBayesian inference based on deep neural network (DNN) models for the emulationphase. The resulting algorithm is not only computationally more efficient, but alsoless sensitive to the training set. Further, by using an Autoencoder (AE) for dimen-sion reduction, we have been able to speed up our Bayesian inference method up tothree orders of magnitude. Overall, our method, henceforth called
Dimension-ReducedEmulative Autoencoder Monte Carlo (DREAM) algorithm, is able to scale BayesianUQ up to thousands of dimensions in physics-constrained inverse problems. Using twolow-dimensional (linear and nonlinear) inverse problems we illustrate the validity thisapproach. Next, we apply our method to two high-dimensional numerical examples(elliptic and advection-diffussion) to demonstrate its computational advantage overexisting algorithms.
Keywords—
Bayesian Inverse Problems, Ensemble Kalman Methods, Emulation,Convolutional Neural Network, Dimension Reduction, Autoencoder
In applied mathematics, physics and engineering, there is a growing interest in uncertaintyquantification (UQ) in order to calibrate model inadequacy, carry out sensitivity analysis, ∗ [email protected], http://math.asu.edu/ ∼ slan a r X i v : . [ s t a t . C O ] J a n r practice optimal control under uncertainty. As a result, Bayesian methods for inverseproblems, such as reservoir modeling and weather forecasting, have become increasinglypopular. Models in these application domains are usually constrained to physical laws andare typically represented as ordinary and partial differential equation (ODE/PDE) systems,which involve expensive numerical simulation. Therefore, Bayesian UQ for such physics-constrained inverse problems is quite difficult because they involve 1) computationallyintensive simulations for solving ODE/PDE systems, and 2) sampling from the resultinghigh dimensional posterior distributions. To address these issues, we follow the work ofCleary et al (2020) [1] and propose a scalable framework for Bayesian UQ that combinesensemble Kalman methods and Markov Chain Monte Carlo (MCMC) algorithms.Ensemble Kalman methods, originated from geophysics [2], have achieved significantsuccess in state estimation for complex dynamical systems with noisy observations [3–10].More recently, these methods have been used to solve inverse problems with the objectiveof estimating the model parameters instead of the states [11–17]. As a gradient-free opti-mization algorithm based on a small number of ensembles, these methods gained increasingpopularity for solving inverse problems since they can be implemented non-intrusively andin parallel. However, due to the collapse of ensembles [18–21], they tend to underestimatethe posterior variances and thus fail to provide a rigorous basis for systematic UQ. Com-bining Kalman methods with MCMC can alleviate this issue. This approach consists ofthree stages: (i) calibrating models with ensemble Kalman methods, (ii) emulating theparameter-to-data map using evaluations of forward models, and (iii) sampling posteri-ors with MCMC based on cheaper emulators. We refer to this approach as Calibration-Emulation-Sampling (CES) [1]. Two immediate benefits of such framework are 1) thereuse of expensive forward evaluations, and 2) the computationally cheap surrogates in theMCMC procedure.For the emulation component, one could rely on Gaussian Process (GP) models, whichhave been widely used for emulating computer models [22], uncertainty analysis [23], sen-sitivity analysis [24], and computer code calibration [25–27]. The derivative informationextracted from GP [28, 29] can also be used to improve the sampling component [25, 23].In general, however, GP emulation involves intensive computation ( O ( N ) with N inputs)and does not scale well to high dimensions. Additionally, the prediction accuracy of GPemulator highly depends on the training set which usually demands a substantial effort ofexperimental design [22, 30]. For these reasons, we propose to use deep neural network(DNN)[31] for the emulation.DNN is an artificial neural network with multiple layers between the input and outputlayers. Each layer contains neurons (units), synapses (connections), weights, biases, andactivation functions. Composition of multiple layers in different types makes the wholenetwork capable of learning complex non-linear relationships and hence a flexible functionalapproximation tool [32–34]. DNN has achieved enormous successes in image processing[35, 36], speech recognition [37], natural language processing [38, 39], bioinformatics [40, 41],and many more areas in artificial intelligence [42, 43]. DNN is usually trained using back-2ropagation [44, 45]. Its computational cost can be reduced by using stochastic batchoptimization algorithms such as stochastic gradient descent [46]. This way, DNN is trainedat a much lower computational cost, O ( N B ), with B being the batch size.In this paper, we particularly focus on convolutional neural networks (CNN), which arecommonly used in image recognition [35]. In many physics-constrained inverse problems,parameter functions are defined on a finite 2d or 3d domain that can be viewed as images.Therefore, we expect CNN to be more suitable than regular dense neural networks (denotedas ‘DNN’ in the following) for such problems. This novel observation could lead to a moreeffective and computationally efficient emulators. Although there have been few relatedworks where CNN is used on actual images in inverse problems [47–49], to the best ofour knowledge this is first application of CNN for training generic emulators in Bayesianinverse problems.Besides computational challenges associated with building emulators, sampling fromposterior distributions in physics-constrained inverse problems is also a challenging taskdue to the high dimensionality of the target distribution. Traditional Metropolis-Hastingsalgorithms defined on the finite-dimensional space suffer from deteriorating mixing timesupon mesh-refinement or increasing dimensions [50–52]. This has promoted the recentdevelopment of a class of ‘dimension-independent’ MCMC methods [53–58] defined oninfinite-dimensional Hilbert space to overcome this deficiency. Despite of the robustness toincreasing dimensions, these ∞ -dimensional MCMC algorithms are still computationallydemanding. To address this issue, several recent papers have investigated the use of di-mension reduction methods to find an intrinsic low-dimensional subspace that contains themost information on the posterior distribution [59–61]. In this paper, we adopt autoen-coder (AE) [62] to learn a low-dimensional latent representation of the parameter spacewith an encoder (original → latent) and a decoder (latent → reconstruction). We haveshown previously that MCMC defined in the latent space can run more efficiently, andsamples can move effectively between the original and latent spaces [63].Combining CNN and AE, we propose a class of hybrid MCMC algorithms named Dimension-Reduced Emulative Autoencoder Monte Carlo (DREAM) that can improve andscale up the application of the CES framework for Bayesian UQ from hundreds of dimen-sions (with GP emulation) [1] to thousands of dimensions. In summary, this paper makesmultiple contributions as follows:1. apply CNN to train emulators for Bayesian inverse problems,2. embed AE in CES to significantly improve its computational efficiency,3. scale Bayesian UQ for physics-constrained inverse problems up to thousands of di-mensions with DREAM.The rest of the paper is organized as follows. Section 2 includes the necessary back-ground on Bayesian inverse problems and the CES framework for UQ. Further, gradientbased MCMC algorithms ( ∞ -MALA and ∞ -HMC) are reviewed for sampling. In Section 3we apply various neural networks including DNN, CNN, and AE to scale up Bayesian UQfor physics-constrained inverse problems. Details of emulating functions, extracting gradi-3nts and reducing dimensions will be discussed. Then, we illustrate the validity of DREAMin Section 4 with simulated linear and nonlinear inverse problems. In Section 5, using twohigh-dimensional inverse problems involving an elliptic equation and an advection-diffusionequation, we demonstrate that our methods can speed up traditional MCMC up to threeorders of magnitude. Section 6 concludes with some discussions on future research direc-tions. In many physics-constrained inverse problems, we are interested in finding an unknownparameter function u based on finite-dimensional observations y . There is a forwardparameter-to-observation mapping G : X (cid:55)→ Y from a separable Hilbert space X to thedata space Y (e.g. Y = R m for m ≥
1) that connects u to y as follows: y = G ( u ) + η (1)where η ∈ Y is assumed to be Gaussian noise η ∼ N (0 , Γ). We can define the potentialfunction (negative log-likelihood), Φ : X × Y → R as:Φ( u ; y ) = 12 (cid:107) y − G ( u ) (cid:107) = 12 (cid:104) y − G ( u ) , Γ − ( y − G ( u )) (cid:105) (2)For given u , the forward output G ( u ) is obtained by solving large and complex ODE/PDEsystems and thus G could be highly non-linear. Moreover, repeated evaluations of G ( u )could be expensive for different u ’s, which could appear as coefficients or boundary condi-tions in these systems.In the Bayesian setting, a prior measure µ is imposed on u , independent of η . Forexample, we could assume a Gaussian prior µ = N (0 , C ) with the covariance C being apositive, self-adjoint and trace-class operator on X . Then we can get the posterior of u ,denoted as µ y , using Bayes’ theorem [64, 65]: dµ y dµ ( u ) = 1 Z exp( − Φ( u ; y )) , if 0 < Z := (cid:90) X exp( − Φ( u ; y )) µ ( du ) < + ∞ . Notice that the posterior µ y can exhibit strongly non-Gaussian behavior, posing enormouschallenges for efficiently sampling the distribution. For simplicity, we drop y from termsinvolved, so we denote the posterior as µ ( du ) and the potential function as Φ( u ).Among the above-mentioned challenges in Bayesian UQ for inverse problems, the highdimensionality of the discretized parameter of u makes the forward evaluation computa-tionally intensive and also challenges the robustness of sampling algorithms. To address allthese issues, the CES framework has been proposed for approximate Bayesian parameterlearning, which consists of the following three stages [17]:4. Calibration : use optimization (ensemble Kalman) based algorithms to obtain pa-rameter estimation and collect expensive forward evaluations for emulation;2.
Emulation : recycle forward evaluations in the calibration stage to build an emulatorfor sampling;3.
Sampling : sample the posterior approximately based on the emulator which is muchcheaper than the original forward mapping.The CES scheme is promising for high-dimensional Bayesian UQ in physics-constrainedinverse problems. The repeated forward solving ODE/PDE systems dominates the com-putation. Emulation bypasses the expensive evaluation of original forward models, makingthe sampling only a small computational overhead. The sampling also benefits from thecalibration which provides MCMC algorithms a good initial point in high density regionso that the burning time can be reduced.In this paper, we choose neural networks for emulation instead of GP used in [17]for the computational efficiency and flexibility. Moreover, we extract gradients from theneural networks and apply gradient-based ∞ -dimensional MCMC for sampling. AE is alsoadopted as a dimension reduction technique to further improve the efficiency of the originalCES method. In the following, we review ensemble Kalman methods for calibration and ∞ -dimensional MCMC algorithms for sampling. Initializing J ensemble particles { u ( j ) } Jj =1 with, for example, prior samples, the basic en-semble Kalman inversion (EKI) method uses the following iterative equation to estimatethe unknown function u : u ( j ) n +1 = u ( j ) n + C upn ( C ppn + h − Γ) − ( y ( j ) n +1 − G ( u ( j ) n )) , j = 1 , · · · , J, n = 1 , · · · , N − h = 1 /N , y ( j ) n +1 = y + ξ ( j ) n +1 with ξ ( j ) n +1 ∼ N (0 , h − Σ ), u n := J (cid:80) Jj =1 u ( j ) n , G n := J (cid:80) Jj =1 G ( u ( j ) n ), and C ppn = 1 J J (cid:88) j =1 ( G ( u ( j ) n ) − G n ) ⊗ ( G ( u ( j ) n ) − G n ) , C upn = 1 J J (cid:88) j =1 ( u ( j ) n − u n ) ⊗ ( G ( u ( j ) n ) − G n )It can be shown [18] that Equation (3) has the following time-continuous limit as h → du ( j ) dt = 1 J J (cid:88) k =1 (cid:42) G ( u ( k ) ) − G , y − G ( u ( j ) ) + √ Σ dW ( j ) dt (cid:43) Γ ( u ( k ) − u ) (4)where { W ( j ) } are independent cylindrical Brownian motions on Y . We can set Σ = 0 toremove noise for an optimization algorithm, or set Σ = Γ to add noise for a dynamics thattransforms the prior to the posterior in one time unit for linear forward maps [18, 17].5igure 1: Comparing the estimation of standard deviation by MCMC (left) and ensembleKalman methods (right two) in an elliptic inverse problem (Section 5.1).A variant of EKI to approximately sample from the posterior is ensemble Kalmansampler (EKS) [17]. This is obtained by adding a prior-related damping term as in [21],and modifying the position-dependent noise in Equation (4): du ( j ) dt = 1 J J (cid:88) k =1 (cid:68) G ( u ( k ) ) − G , y − G ( u ( j ) ) (cid:69) Γ ( u ( k ) − u ) − C ( u ) C − u ( j ) + (cid:112) C ( u ) dW ( j ) dt (5)where C ( u ) := J (cid:80) Jj =1 ( u ( j ) − u ) ⊗ ( u ( j ) − u ). The time discretization by means of a linearlyimplicit split-step scheme is given by [17] with an adaptive time scheme ∆ t n as in [66]. u ( ∗ ,j ) n +1 = u ( j ) n + ∆ t n J J (cid:88) k =1 (cid:68) G ( u ( k ) n ) − G , y − G ( u ( j ) n ) (cid:69) Γ u ( k ) n − ∆ t n C ( u n ) C − u ( ∗ ,j ) n +1 u ( j ) n +1 = u ( ∗ ,j ) n +1 + (cid:112) t n C ( u n ) ξ ( j ) n (6)However, due to the collapse of ensembles [18–21], sample variance estimated by ensem-bles tends to underestimate the true uncertainty. Therefore, these methods do not providea rigorous basis for systematic UQ. Figure 1 illustrates that both EKI and EKS severelyunderestimate the posterior standard deviation (plot in the 2d domain) of the parameterfunction in an elliptic inverse problem (see more details in Section 5.1). In what follows,we discuss scalable (dimension robust) inference methods and propose to use them in thesampling step of CES. ∞ -MCMC) Traditional Metropolis-Hastings algorithms are characterized by deteriorating mixing timesupon mesh-refinement or with increasing dimensions. In contrast, a new class of di-6ension robust algorithms, including preconditioned Crank-Nicolson (pCN) [55], infinite-dimensional MALA ( ∞ -MALA) [53], infinite-dimensional HMC ( ∞ -HMC) [54] and infinite-dimensional manifold MALA ( ∞ -mMALA) [57], are well-defined on the infinite-dimensionalHilbert space, and thus yield the important computational benefit of dimension-independentmixing times for finite, but high-dimensional problems in practice.Consider the following continuous-time Hamiltonian dynamics: d udt + K (cid:8) C − u + D Φ( u ) (cid:9) = 0 , (cid:18) v := dudt (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) t =0 ∼ N (0 , K ) . (7)If we let K ≡ C , Equation (7) preserves the total energy H ( u, v ) = Φ( u ) + (cid:107) v (cid:107) K . HMCalgorithm [67] solves the dynamics (7) using the following St¨ormer-Verlet symplectic inte-grator [68]: v − = v − αε C D Φ( u ) ; (cid:20) u ε v + (cid:21) = (cid:20) cos ε sin ε − sin ε cos ε (cid:21) (cid:20) u v − (cid:21) ; v ε = v + − αε C D Φ( u ε ) . (8)Equation (8) gives rise to the leapfrog map Ψ ε : ( u , v ) (cid:55)→ ( u ε , v ε ). Given a time horizon τ and current position u , the MCMC mechanism proceeds by concatenating I = (cid:98) τ /ε (cid:99) stepsof leapfrog map consecutively, u (cid:48) = P u (cid:8) Ψ Iε ( u, v ) (cid:9) , v ∼ N (0 , K ) . where P u denotes the projection onto the u -argument. Then, the proposal u (cid:48) is acceptedwith probability a ( u, u (cid:48) ) = 1 ∧ exp( − ∆ H ( u, v )), where∆ H ( u, v ) = H (Ψ Iε ( u, v )) − H ( u, v )=Φ( u I ) − Φ( u ) − α ε (cid:110) (cid:107)C D Φ( u I ) (cid:107) − (cid:107)C D Φ( u ) (cid:107) (cid:111) − αε I − (cid:88) i =0 ( (cid:104) v i , D Φ( u i ) (cid:105) + (cid:104) v i +1 , D Φ( u i +1 ) (cid:105) )This yields ∞ -HMC [54]. We can use different step-sizes in (8): ε for the first and thirdequations, and ε for the second equation and let I = 1, ε = h, cos ε = − h/ h/ , sin ε = √ h h/ . Then, ∞ -HMC reduces to ∞ -MALA, which can also be derived from Langevindynamics [53, 58]. When α = 0, ∞ -MALA further reduces to pCN [58], which does notuse gradient information and can be viewed as an infinite-dimensional analogy of randomwalk Metropolis. 7igure 2: A typical architecture of convolutional neural network (CNN). As we mentioned before, two majors challenges limit the scalability of Bayesian UQ forinverse problems: the intensive computation required for repeated evaluations of likelihood(potential), Φ( u ), and the large dimensionality of the discretized space. If we use ∞ -MALAor ∞ -HMC, we also need the gradient D Φ( u ), which is often not available. Here, we willuse neural networks to address these issues. More specifically, we train CNN to emulatethe forward evaluation and AE to reduce the parameter dimensionality. In the following,we discretize the parameter function u and denote its dimension as d . We still denote thediscretized parameter as u when there is no confusion. Usually we have d (cid:29) The ensemble-based algorithms in the calibration phase produce parameters and forwardsolutions { u ( j ) n , G ( u ( j ) n ) } Jj =1 for n = 0 , · · · , N . These input-output pairs can be used to traina DNN as an emulator G e of the forward mapping G [31]: G e ( u ; θ ) = F K − ◦ · · · ◦ F ( u ) , F k ( · ) = g k ( W k · + b k ) ∈ C ( R d k , R d k +1 ) (9)where d = d , d K = m ; W k ∈ R d k +1 × d k , b k ∈ R d k +1 , θ k = ( W k , b k ), θ = ( θ , · · · , θ K − );and g k ’s are activation functions. There are multiple choices of activation functions, e.g. g k ( x ) = ( σ ( x ) , · · · , σ ( x d k +1 )) with σ ∈ C ( R , R ) including rectified linear unit (ReLU, σ ( x i ) = 0 ∨ x i ), leaky ReLU ( σ ( x i ; α ) = x i I ( x i ≥
0) + αx i I ( x i < σ ( x i ) =( e x i − / ( e x i + 1)); or alternatively, g k ( x ) = ( σ ( x ) , · · · , σ d k +1 ( x )) ∈ C ( R d k +1 , R d k +1 )such as softmax ( σ i ( x ) = e x i / (cid:80) d k +1 i (cid:48) =1 e x i (cid:48) ).In many physics-constrained inverse problems, the parameter function u is defined overa 2-d or 3-d field, which possesses unique spatial features resembling an image. This has8otivated our choice of CNN for emulators. Inspired by biological processes, where theconnectivity pattern between neurons resembles the organization of visual cortex [69], CNNhas become a powerful tool in image recognition and classification [35]. As a regularizedneural network with varying depth and width, CNN has much fewer connections and thusfewer training parameters compared with standard fully connected DNN of similar size.Therefore, CNN is preferred to DNN in the CES framework due to its flexibility andcomputational efficiency.In general, CNN consists of a series of convolution layers with filters (kernels), poolinglayers to extract features, and fully connected layers to connect to outputs. The con-volutional layer is introduced for sparse interaction, parameter sharing, and equivariantrepresentations [31]. Therefore, instead of full matrix multiplication, we consider the fol-lowing discrete convolution [44] F k ( · ) = g k ( w k ∗ · + b k ) ∈ C ( R d k , R d k +1 ) (10)where w k is a kernel function with discrete format defined by (multiplying) a circulantmatrix W ∗ k . Convolution is the first layer to extract spatial features (corresponding tofilters) from an input image. Each image could be seen as a c × h × w array of pixels( c = 3 for RGB images and c = 1 for grayscale images; h, w are image size in height andwidth respectively). For convolution of the image with the kernel, CNN slides a window ofpre-specified size (kernel size) with certain stride (step size) over the image. The resultingoperation typically reduces dimension.After the convolutional layer, a pooling layer is added to reduce the number of param-eters by generating summary statistics of the nearby outputs. Such operation is a formof non-linear down-sampling that sparsifies the neural network but retains the most im-portant information of the input image. Spatial pooling can be of different types such asmax-pooling, average-pooling, and sum-pooling. Finally, we use a dense layer to generateforward outputs {G ( u ) } . Figure 2 illustrates the structure of a CNN used in the ellipticinverse problem (Section 5.1).CES [1] uses GP for the emulation step. However, CNN has several advantages overGP for building the emulator: 1) it is computationally more efficient for large training sets,2) it is less sensitive to the locations of training samples, and 3) we could take advantageof all the ensemble samples collected by EKI or EKS to train CNN without the need tocarefully “design” a training set of controlled size as it is common in GP. After the emulatoris trained, we could approximate the potential function using the prediction of CNN:Φ( u ∗ ) ≈ Φ e ( u ∗ ) = 12 (cid:107) y − G e ( u ∗ ) (cid:107) (11)In the sampling stage, significant computation will be saved if we use Φ e instead of Φ inthe accept/reject step of MCMC. If the emulator is a good representation of the forwardmapping, then the difference between Φ e and Φ is small. Thus the samples by such emula-tive MCMC have the stationary distribution with small discrepancy compared to the trueposterior µ ( du ). 9igure 3: Comparing the emulation G e : R → R in an elliptic inverse problem (Section5.1) by GP, DNN and CNN in terms of error (left) and time (right).In gradient-based MCMC algorithms, we need to calculate D Φ( u ) – the derivatives of(log) density function Φ( u ) with respective to parameter function u . This is not alwaysavailable because the forward mapping may not be differentiable or not provided in thesolutions of ODE/PDE. However, (almost everywhere) differentiability is required for eachlayer of neural network as it uses back-propagation in the training. Therefore, the gradientof the emulated potential function can be obtained by the chain rule D Φ e ( u ∗ ) = −(cid:104) y − G e ( u ∗ ) , D G e ( u ∗ ) (cid:105) Γ (12)where D G e ( u ∗ ) can be the output from CNN’s back-propagation, e.g. implemented in GradientTape of TensorFlow . We can see that D Φ( u ∗ ) ≈ D Φ e ( u ∗ ) if D Φ( u ∗ ) exists.The following theorem generalizes [70] and gives the error bound of CNN emulator inapproximating the true potential Φ and its gradient D Φ. Theorem 3.1.
Let ≤ s ≤ d and Ω ⊂ [ − , d . Assume G j ∈ H r ( R d ) for r > d/ , j = 1 , · · · , m . If K ≥ d/ ( s − , then there exist G e by CNN with ReLU activation functionsuch that (cid:107) Φ − Φ e (cid:107) H (Ω) ≤ c (cid:107)G(cid:107) (cid:112) log KK − − d (13) where we have (cid:107) Φ (cid:107) H (Ω) = (cid:16) (cid:107) Φ (cid:107) L (Ω) + (cid:107) D Φ (cid:107) L (Ω) (cid:17) , c is an absolute constant and (cid:107)G(cid:107) =max ≤ j ≤ m (cid:107)G j (cid:107) H r ( R d ) with (cid:107)G j (cid:107) H r ( R d ) := (cid:107) (1 + | ω | ) r/ (cid:98) G j ( ω ) (cid:107) L ( R d ) .Proof. See Appendix A.
Remark 1.
Based on Theorem 3 of [71], we have a weaker bound with sup-norm: (cid:107) Φ − Φ e (cid:107) W , ∞ (Ω) ≤ ˜ c (cid:107)G(cid:107) K − (14)10igure 4: A typical architecture of autoencoder (AE) neural network. where (cid:107) Φ (cid:107) W , ∞ (Ω) = max ≤ i ≤ d (cid:107) D i Φ (cid:107) L ∞ (Ω) ( D Φ = Φ ). Since H r (Ω) (cid:44) → W , ∞ (Ω) (cid:44) → C (Ω) , we can show that: • For any continuous forward mapping G , a CNN emulator, G e , with depth K can bebuilt such that (cid:107) Φ − Φ e (cid:107) L ∞ (Ω) → as K → ∞ . • If G is further continuously differentiable, a CNN, G e , with depth K exists such that (cid:107) Φ − Φ e (cid:107) W , ∞ (Ω) → as K → ∞ . Even if D Φ( u ∗ ) does not exist, such gradient information D Φ e ( u ∗ ) can still be extractedfrom the emulator G e to inform the landscape of Φ in the vicinity of u ∗ . Note that wetrain CNN only on { u ( j ) n , G ( u ( j ) n ) } as opposed to { u ( j ) n , D G ( u ( j ) n ) } . That is, no gradientinformation is used for training. This is similar to extracting geometric information fromGP emulator [28, 29]. Figure 3 compares GP, DNN, and CNN in emulating a forward mapthat takes a 1681(41 ×
41) dimensional discretized parameter function with 25 observationstaken from the solution of an elliptic PDE as the output (see more details in Section 5.1).Given limited training data, CNN outperforms both GP and DNN in rendering smallerapproximation errors with less time for consumption.
Although we can reduce computation by emulation, the MCMC algorithms we use forBayesian inference are still defined in high-dimensional spaces. In this section, we use AEto reduce the dimensions and further speed up the UQ process [63]. AE is a special typeof feed-forward neural network for latent representation learning. The input is encodedinto a low-dimensional latent representation (code). The code is then decoded into areconstruction of the original input (see Figure 4 for the structure of an AE). The modelis trained to minimize the difference between the input and the reconstruction. An AE11ould learn complicated nonlinear dimensionality reduction and thus is widely used in manychallenging tasks such as image recognition and artificial data generation [62].While AE is commonly used to reduce the dimensionality of the data, here we use itto reduce the dimensionality of the parameter space. Denote the latent space as X L withdimensionality d L (cid:28) d . Let u L ∈ X L be the latent representation of parameter u . Thenthe encoder φ and the decoder ψ are defined respectively as follows φ : X → X L , u (cid:55)→ u L ψ : X L → X , u L (cid:55)→ u R (15)where u R ∈ X is a reconstruction of u ; φ and ψ can be chosen as multilayer neural networkssimilarly as in Equation (9). Depending on the layers and structures, we could haveconvolutional AE (CAE) [72, 73], variational AE (VAE) [74, 75], etc.According to universal approximation theorem [76–78], a feed-forward artificial neuralnetwork can approximate any continuous function given some mild assumptions aboutthe activation functions. Theoretically, an AE with suitable activation functions couldrepresent an identity map, i.e. ψ ◦ φ = id . An accurate reconstruction of the inputimplies a good low-dimensional representation encoded in φ . In practice, the success of thealgorithm heavily relies on the quality of the trained AE. Note that we train the AE withensembles { u ( j ) n , j = 1 , · · · J, n = 0 , · · · , N } from the calibration stage. Even though thereis difference between ψ ◦ φ and the identity map id , AE could provide a reconstruction ψ ◦ φ ( u ) very close to the original parameter u . See Figure 10b (Section 5.1) and Figure15b (Section 5.2) for examples.The potential function Φ( u ) and its derivative D Φ( u ) can be projected to the latentspace X L – denoted as Φ r ( u L ) and D Φ r ( u L ) respectively – as follows:Φ r ( u L ) = Φ( u ) = Φ( ψ ( u L )) D Φ r ( u L ) = (cid:18) ∂u∂u L (cid:19) T ∂ Φ( u ) ∂u = ( dψ ( u L )) T D Φ( ψ ( u L )) (16)where dψ = ∂u∂u L is the Jacobian matrix of size d × d L for the decoder ψ . The derivativeinformation D Φ r ( u L ) needed in the gradient-based MCMC, ∞ -MALA and ∞ -HMC willbe discussed in Section 3.3. In practice, we avoid explicit computation of the Jacobianmatrix dψ by calculating the Jacobian-vector action altogether: D Φ r ( u L ) = ∂∂u L [ ψ ( u L ) T D Φ( ψ ( u L ))] (17)which is output by AE’s back-propagation.Note, we can combine the emulation to further reduce the computation. The result-ing approximate MCMC algorithms in the latent space involve potential function and itsderivative, denoted as Φ er ( u L ) and D Φ er ( u L ) respectively, which are defined by replacing Φwith Φ e in Equation (16). 12 L Φ er ( u L ) D Φ er ( u L ) DREAM u Φ e ( u ) D Φ e ( u ) e-MCMCAE φ ψ change ofvariable ( dψ ) T · Φ( u ) D Φ( u ) MCMC G e CNN D G e CNNpredic-tion CNNpredic-tion
Figure 5: Relationship among quantities in various MCMC algorithms. Node sizes indicaterelative dimensions of these quantities. Thick solid arrows mean training neural networks.Dashed arrows with colors represent mappings that are not directly calculated but actuallyhave equivalent compositions indicated by the same color, e.g. u (cid:55)→ Φ e ( u ) (dashed redarrow) obtained by training CNN (thick solid red arrow) followed by network prediction(solid red arrow); or by color mixing, e.g. u L (cid:55)→ Φ er ( u L ) (dashed orange arrow) as a resultof combing the decoder ψ (thick solid yellow arrow), u (cid:55)→ Φ e ( u ) (dashed red arrow), andthe change of variable (solid red arrow). Next, we combine all the techniques discussed above to speed up Bayesian UQ for inverseproblems. More specifically, our main method is composed of the following three stages:1.
Calibration : collect
J N samples { u ( j ) n , G ( u ( j ) n ) } j,n from EKI or EKS procedure;2. Emulation : build an emulator of the forward mapping G e based on { u ( j ) n , G ( u ( j ) n ) } j,n (and extract D G e ) using CNN; train an AE ( φ, ψ ) based on { u ( j ) n } j,n ;3. Sampling : run approximate MCMC based on emulation to propose u (cid:48) from u :i) obtain the projection of u by u L = φ ( u );ii) propose u (cid:48) L from u L by ∞ -MCMC (with Φ er and D Φ er ) in the latent space X L ;iii) obtain the sample u (cid:48) = ψ ( u (cid:48) L )In the class of ∞ -MCMC, we can use the emulated potential and its derivative instead ofthe faithful evalutions. We refer to the resulting algorithms as emulative ∞ -MCMC (e-MCMC) . Further, we can use AE to project these approximate MCMC into low-dimensional13atent space. We denote these algorithms as dimension-reduced emulative autoencoder ∞ -MCMC (DREAM) . Figure 5 illustrates the relationship among the quantities involved inthese MCMC algorithms. For example, the mapping u L (cid:55)→ D Φ er ( u L ) (dashed brown arrow)is not directly calculated, but we could combine the decoder ψ (thick solid yellow arrow),emulated gradient u (cid:55)→ D Φ e ( u ) (dashed violet arrow), and left multiplying Jacobian matrix( dψ ) T (solid violet arrow).We note that if we accept/reject proposals u (cid:48) L in the latent space with Φ er , then there isno need to traverse between the original space and the latent space constantly. The chaincan mainly stay in the latent space X L to collect samples { u L } , as shown in the bottomlevel of Figure 5, and only needs to go back to the original space X when relevant emulatedquantities are needed. In the following, we describe the details of DREAM algorithms.For the convenience of following disposition, we first whiten the coordinates by thetransformation u (cid:55)→ ˜ u := C − u . The whitened variable ˜ u has the prior ˜ µ = N (0 , I ),where the identity covariance operator is not a trace-class on X . However, random drawsfrom ˜ µ are square-integrable in the weighted space Im( C − ). We can still get a well-definedfunction space proposal for parameter u after inverting the transformation [59, 61]. In thewhitened coordinates ˜ u , the Langevin and Hamiltonian (7) dynamics (with algorithmicparameter α ≡
1) become the following respectively d ˜ udt = − (cid:8) I ˜ u + αD Φ(˜ u ) (cid:9) + dWdt (18) d ˜ udt + (cid:8) I ˜ u + αD Φ(˜ u ) (cid:9) = 0 , (cid:18) ˜ v := d ˜ udt (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) t =0 ∼ N (0 , I ) . (19)where D Φ(˜ u ) = C D Φ( u ). Then we can train CNN based on { ˜ u ( j ) n , G (˜ u ( j ) n ) } j,n and AEbased on { ˜ u ( j ) n } j,n .On the other hand, since the AE does not preserve the volume ( ψ ◦ φ ≈ id but ψ ◦ φ (cid:54) = id ), the acceptance of proposals in the latent space needs to be adjusted with a volumecorrection term V (cid:48) V in order to maintain the ergodicity [79, 63]. Note, the volume adjustmentterm V (cid:48) V breaks into the product of Jacobian determinants of the encoder φ and the decoder ψ that can be calculated with Gramian function as follows [63]: V (cid:48) V = det( dψ (˜ u (cid:48) L )) det( dφ (˜ u )) = (cid:118)(cid:117)(cid:117)(cid:116) det (cid:34)(cid:18) ∂ ˜ u (cid:48) ∂ ˜ u (cid:48) L (cid:19) T (cid:18) ∂ ˜ u (cid:48) ∂ ˜ u (cid:48) L (cid:19)(cid:35)(cid:118)(cid:117)(cid:117)(cid:116) det (cid:34)(cid:18) ∂ ˜ u L ∂ ˜ u (cid:19) (cid:18) ∂ ˜ u L ∂ ˜ u (cid:19) T (cid:35) (20)where terms under square root are determinants of matrices with small size d L × d L , whichcan be obtained by the singular value decomposition of the Jacobian matrices respectively.In practice, we can exclude V (cid:48) V from the acceptance probability and use it as a resamplingweight as in importance sampling [80]. Alternatively, we can ignore the accept/reject stepfor an approximate Bayesian UQ [81, 63]. 14 .3.1 DREAM-pCN In the whitened latent space, pCN proposal becomes˜ u (cid:48) L = ρ ˜ u L + (cid:112) − ρ ˜ ξ L , ˜ ξ L ∼ N (0 , I d L ) (21)If we use emulated potential energy, then the acceptance probability with volume adjust-ment (20) of the resulting DREAM-pCN algorithm is a (˜ u L , ˜ u (cid:48) L ) = 1 ∧ exp {− Φ er (˜ u (cid:48) L ) + Φ er (˜ u L ) + log det( dψ (˜ u (cid:48) L )) + log det( dφ (˜ u )) } ∞ -MALA Based on the Langevin dynamics in the whitened coordinates, ∞ -MALA proposal in thelatent space with emulated gradient becomes˜ u (cid:48) L = ρ ˜ u L + (cid:112) − ρ ˜ v L , ˜ v L = ˜ ξ L − α √ h D Φ er (˜ u L ) , ρ = (1 − h ) / (1 + h ) . (22)The resulting DREAM- ∞ -MALA has adjusted acceptance probability a (˜ u L , ˜ u (cid:48) L ) = 1 ∧ κ (˜ u (cid:48) L , ˜ u L ) κ (˜ u L , ˜ u (cid:48) L ) V (cid:48) V with V (cid:48) V as in (20) and κ (˜ u L , ˜ u (cid:48) L ) = exp( − Φ er (˜ u L )) · exp (cid:40) − α h (cid:107) D Φ er (˜ u L ) (cid:107) − α √ h (cid:104) D Φ er (˜ u L ) , ˜ v L (˜ u L , ˜ u (cid:48) L ) (cid:105) (cid:41) ∞ -HMC To derive ∞ -HMC in the latent space based on the Hamiltonian dynamics in the whitenedcoordinates (19), we also need to project ˜ v ∼ N (0 , I d ) into d L -dimensional latent space. Wecould have used the same encoder φ as in [63]; however, since ˜ v i iid ∼ N (0 ,
1) for i = 1 , · · · , d ,we just set ˜ v L ∼ N (0 , I d L ) for simplicity. Then, the ∞ -HMC proposal Ψ ε : (˜ u L, , ˜ v L, ) (cid:55)→ (˜ u L,ε , ˜ v L,ε ) in the whitened augmented latent space with emulated gradient becomes˜ v − L = ˜ v L, − αε D Φ er (˜ u L, ) ; (cid:20) ˜ u L,ε ˜ v + L (cid:21) = (cid:20) cos ε sin ε − sin ε cos ε (cid:21) (cid:20) ˜ u L, ˜ v − L (cid:21) ;˜ v L,ε = ˜ v + L − αε D Φ er (˜ u L,ε ) . (23)The acceptance probability for the resulting DREAM- ∞ -HMC algorithm involves H (˜ u L , ˜ v L ) =Φ er (˜ u L ) + (cid:107) ˜ v L (cid:107) and becomes a (˜ u L , ˜ u (cid:48) L ) = 1 ∧ exp( − ∆ H (˜ u L , ˜ v L )) V (cid:48) V with V (cid:48) V as in (20) and15 lgorithm 1 Dimension Reduced Emulative Autoencoder ∞ -dimensional HMC(DREAM- ∞ -HMC) Require:
Collect
N J samples { u ( j ) n , G ( u ( j ) n ) } j,n from EKI or EKS procedure; whiten co-ordinates { ˜ u ( j ) n = C − u ( j ) n } j,n . Require:
Build an emulator of the forward mapping G e based on { ˜ u ( j ) n , G ( j ) } j,n (and ex-tract D G e ) using CNN; train an AE ( φ, ψ ) based on { ˜ u ( j ) n } j,n ; Initialize current state ˜ u (0) and project it to the latent space by ˜ u (0) L = φ (˜ u (0) ) Sample velocity ˜ v (0) L ∼ N (0 , I d L ) Calculate current energy E = Φ er (˜ u (0) L ) − α ε (cid:107) D Φ er (˜ u (0) L ) (cid:107) + log det( dφ (˜ u (0) )) for i = 0 to I − do Run Ψ ε : (˜ u ( i ) L , ˜ v ( i ) L ) (cid:55)→ (˜ u ( i +1) L , ˜ v ( i +1) L ) according to (23). Update the energy E ← E + αε ( (cid:104) ˜ v L,i , D Φ er (˜ u L,i ) (cid:105) + (cid:104) ˜ v L,i +1 , D Φ er (˜ u L,i +1 ) (cid:105) ) end for Calculate new energy E = Φ er (˜ u ( I ) L ) − α ε (cid:107) D Φ er (˜ u ( I ) L ) (cid:107) − log det( dψ (˜ u ( I ) L )) Calculate acceptance probability a = exp( − E + E ) Accept ˜ u ( I ) L with probability a for the next state ˜ u (cid:48) L or set ˜ u (cid:48) L = ˜ u (0) L in the latent space. Record the next state u (cid:48) = C ψ (˜ u (cid:48) L ) in the original space.∆ H (˜ u L , ˜ v L ) = H (Ψ Iε (˜ u L , ˜ v L )) − H (˜ u L , ˜ v L )=Φ(˜ u L,I ) − Φ(˜ u L, ) − α ε (cid:8) (cid:107) D Φ er (˜ u L,I ) (cid:107) − (cid:107) D Φ er (˜ u L, ) (cid:107) (cid:9) − αε I − (cid:88) i =0 ( (cid:104) ˜ v L,i , D Φ er (˜ u L,i ) (cid:105) + (cid:104) ˜ v L,i +1 , D Φ er (˜ u L,i +1 ) (cid:105) ) (24)We summarize DREAM- ∞ -HMC in Algorithm 1, which includes DREAM- ∞ -MALA with I = 1 and DREAM-pCN with α = 0. In this section, we investigate two low-dimensional inverse problems: a three-dimensionallinear Gaussian inverse problem with analytic solution of posterior distribution and a four-dimensional nonlinear banana-biscuit-donoughnut (BBD) distribution [29] with complexgeometry. We illustrate the validity of our proposed emulative ∞ -MCMC and DREAMalgorithms using these two examples. 16igure 6: Linear Gaussian inverse problem: pairwise marginal posterior density estimation. We first consider a three-dimensional linear Gaussian inverse problem. y = G ( u ) + η, η ∼ N (0 , Γ) G ( u ) = Au,u ∼ N (0 , Σ )where A is a matrix of size m × d . In this example, we choose Γ = σ η I m with σ η = 0 . Σ = σ u I d with σ u = 1. We randomly generate A with a ij ∼ unif[0 , d = 3and set the true value u † = ( − , , m = 100 data points y = { y n } mn =1 with G ( u † ) = Au † . The inverse problem involves finding the posterior distribution of u | y whichhas the following analytic form u | y ∼ N ( µ, Σ ) µ = ΣA T Γ − y, Σ − = Σ − + A T Γ − A We follow the procedure of CES outlined in Section 3.3. First, we run EKS withensemble size J = 100 for N = 50 iterations and collect ensembles { u ( j ) n , G ( u ( j ) n ) } j,n . Thenwe train DNN with 75% of these 5000 ensembles and use the remaining 25% for testing.The DNN has 3 layers with ‘softplus’ activation function for the hidden layers and ‘linear’activation for the output layer with the units linearly interpolated between input dimension17igure 7: Comparing the emulated potential Φ e (colored contours) and its gradient D Φ e (arrows) with G e : R → R in a four-dimensional nonlinear BBD inverse problem by GP(right) and DNN (middle) with the truth (left).( d = 3) and output dimension ( m = 100). We also train AE with the same split oftraining/testing data. The AE has latent dimension d L = 2, 2 layers for encoder, 2 layersfor decoder, and activation function chosen as ‘LeakyReLU( α = 2)’. We run ∞ -HMC,emulative ∞ -HMC and DREAM ∞ -HMC (Algorithm 1) to collect 10000 posterior samplesof u after burning in the first 10000 respectively. Figure 6 shows that both emulative ∞ -HMC and DREAM ∞ -HMC generate samples very close to the original ∞ -HMC showingthat they all represent the true distribution whose 3-standard deviation contours are plottedin red ellipses. Next, we challenge our proposed methodology with a four-dimensional Banana-Biscuit-Doughnut (BBD) distribution [29]. BBD distribution was first proposed in [29] as a bench-mark for testing MCMC algorithms and has recently revisited by [63]. The name of thedistribution comes from 3 possible shapes of pairwise marginal distributions. It can be castinto a Bayesian inverse problem with parameters u = ( u , · · · , u d ): y = G ( u ) + η, η ∼ N (0 , σ η I m ) G ( u ) = A S u, S u = (cid:16) u , u , · · · , u p ( k ) k , · · · , u p ( d ) d (cid:17) , p ( k ) = (cid:40) , k odd2 , k even u ∼ N (0 , σ u I d )where A is a matrix of size m × d . Therefore G : R d → R m is a linear forward mapping of S u but a non-linear operator of u .We generate a matrix A similarly as above and true vector u † with elements beingrandom integers between 0 and d . Then, we obtain m = 100 data points { y n } mn =1 with18igure 8: BBD inverse problem: pairwise marginal posterior density estimation. G ( u † ) = A S u † , σ η = 1. In the prior, we set σ u = 1. The inverse problem involves finding u for given data { y n } mn =1 . Due to the non-identifiability of u in A S u = G ( u † ), the d = 4dimensional posterior distribution u | y resembles a banana in (1,2) dimension, a biscuit in(1,3) dimension, and a doughnut in (2,4) dimension (See Figure 19). The distribution has acomplex geometric structure, which makes it challenging for MCMC algorithms to explore.We collect training samples { u ( j ) n , G ( u ( j ) n ) } j,n from EKS ensembles with size J = 100for N = 50 iterations. Then we train DNN with a similar structure (but for 5 layers)and AE with the same configuration (but for latent dimension d L = 3) as the above linearGaussian inverse problem. As illustrated in Figure 7, emulators G e : R → R are builtfor BBD distribution and the log-likelihood is computed using (11). Compared with thetruth (left), emulation by DNN (middle) is much closer than that of GP (right).Next, we run ∞ -HMC, emulative ∞ -HMC and DREAM ∞ -HMC to collect 10000posterior samples of u after burning in the first 10000 respectively. We plot the marginaldistributions of model parameters estimated by these MCMC samples in Figure 8. Aswe can see, emulative ∞ -HMC reproduces the distributions as the original ∞ -HMC. Inthis example, the intrinsic dimension (the dimension of space containing essential datainformation) is d = 4 but the latent dimension is d L = 3. Therefore, some information islost in AE. This leads to the discrepancy between DREAM ∞ -HMC and the other twoalgorithms. Still, the DREAM algorithm recovers enough details of the posterior which ismuch better than the ensemble Kalman methods for UQ.19 a) Forcing field f ( s ) (left), and the solution p ( s ) with true transmissivity field k ( s ) (right).(b) True log-transmissivity field u † ( s ) (left), and 25 observations on selected locations indicated bycircles (right), with color indicating their values. Figure 9: Elliptic inverse problem.
In this section, we consider two high-dimensional inverse problems involving elliptic PDEand advection-diffusion equation. In both problems, the forward parameter-to-observationmappings are nonlinear and the posterior distributions are non-Gaussian. The high dimen-sionality of the discretized parameter imposes a big challenge on Bayesian UQ. The secondinverse problem involving advection-diffusion equation is even more difficult because it isbased on spatiotemporal observations. We demonstrate substaintial numerical advantagesof our proposed methods and show that they indeed can scale up the Bayesian UQ forPDE-constrained inverse problems to thousands of dimensions. Python codes are publiclyavailable at https://github.com/lanzithinking/DREAM-BUQ .20 .1 Elliptic Inverse Problem
The following elliptic PDE [59, 61] is defined on the unit square domain Ω = [0 , : −∇ · ( k ( s ) ∇ p ( s )) = f ( s ) , s ∈ Ω (cid:104) k ( s ) ∇ p ( s ) , (cid:126)n ( s ) (cid:105) = 0 , s ∈ ∂ Ω (cid:90) ∂ Ω p ( s ) dl ( s ) = 0 (25)where k ( s ) is the transmissivity field, p ( s ) is the potential function, f ( s ) is the forcing term,and (cid:126)n ( s ) is the outward normal to the boundary. The source/sink term f ( s ) is defined bythe superposition of four weighted Gaussian plumes with standard deviation 0 .
05, centeredat [0 . , . , [0 . , . , [0 . , . , [0 . , . { , − , , − } respectively, as shownin the left panel of Figure 9a.The transmissivity field is endowed with a log-Gaussian prior, i.e. k ( s ) = exp( u ( s )) , u ( s ) ∼ N (0 , C )where the covariance operator C is defined through an exponential kernel function C : X → X , u ( s ) (cid:55)→ (cid:90) c ( s, s (cid:48) ) u ( s (cid:48) ) ds (cid:48) , c ( s, s (cid:48) ) = σ u exp (cid:18) − (cid:107) s − s (cid:48) (cid:107) s (cid:19) , for s, s (cid:48) ∈ Ωwith the prior standard deviation σ u = 1 .
25 and the correlation length s = 0 . u † ( s ) that is not drawn from the prior, as shown on the left panel of Figure 9b. Theright panel of Figure 9a shows the potential function, p ( s ), solved with u † ( s ), which is alsoused for generating noisy observations. Partial observations are obtained by solving p ( s )on an 81 ×
81 mesh and then collecting at 25 measurement sensors as shown by the circleson the right panel of Figure 9b. The corresponding observation operator O yields the data y = O p ( s ) + η, η ∼ N (0 , σ η I )where we consider the signal-to-noise ratio SNR = max s { u ( s ) } /σ η = 50 in this example.The inverse problem involves sampling from the posterior of the log-transmissivity field u ( s ), which becomes a vector with dimension of 1681 after being discretized on 41 × { u ( j ) n , G ( u ( j ) n ) } N,Jn =1 ,j =1 from N = 10 iterations of EKS runswith ensemble size J = 500. For the emulation, we train DNN and CNN with 75% ofthese 5000 ensembles and test/validate them with the remaining 25%. The DNN has 3layers with ‘softplus’ activation function for the hidden layers and ‘linear’ activation for theoutput layer and 40% nodes dropped out. The structure of CNN is illustrated in Figure 2with ‘solfplus’ activation for the convolution layers, ‘softmax’ activation fo the latent layer21 a) CNN (middle) and DNN (right) emulation ( G e : R → R ) extracting gradients D Φ e ( u MAP )compared with the true gradient D Φ( u MAP ) (left).(b) AE compressing the original function u MAP (left) into latent space u MAP r (middle) and recon-structing it in the original space u MAP (cid:48) (right).
Figure 10: Elliptic inverse problem: outputs by neural networks viewed as 2d images.(dimension 256) and ‘linear’ activation for the output layer. The trained CNN has drop outrate of 50% on all its nodes. Figure 10a compares the true gradient function D Φ( u MAP )(left) and its emulations D Φ e ( u MAP ) (right two) as in Equation (12) extracted from twotypes of neural networks. These gradient functions are plotted on the 2d domain [0 , .We can see that even trained on forward outputs without any gradient information, theseextracted gradients from the neural network model provide decent approximations to thetrue gradient that capture its main graphical feature viewed as a 2d image. The resultby CNN is qualitatively better than DNN, which is supported by the numeric evidence oferror comparison illustrated in the left panel of Figure 3.In the sampling stage, we train AE with the structure illustrated in Figure 4. The latentdimension is d L = 121 (11 ×
11) and the node sizes of hidden layers between input and latent,between latent and output are linearly interpolated. All the activation functions are chosenas ‘LeakyReLU( α = 2)’. Figure 10b plots the original u MAP (left), the latent representation u MAP r = φ ( u MAP ) (middle) and the reconstruction u MAP (cid:48) = ψ ( u MAP r ) (right). Even thoughthe latent representation is not very intuitive, the output function (image) decoded fromthe latent space can be viewed as a ‘faithful’ reconstruction of the original function (image),indicating a sufficiently good AE that compresses and restores information. Therefore, ourproposed MCMC algorithms, defined on the latent space, generate samples that can be22igure 11: Elliptic inverse problem: Bayesian posterior mean estimates of the log-transmissivity field u ( s ) based on 5000 samples by various MCMC algorithms.projected back to the original space without losing too much accuracy in representing theposterior distribution.We compare the performance of algorithms including vanilla pCN, ∞ -MALA, ∞ -HMC,their emulative versions and corresponding DREAM algorithms. For each algorithm, werun 6000 iterations and burn in the first 1000. For HMC algorithms, we set I = 5. Wetune the step sizes for each algorithm so that they have similar acceptance rates around60 ∼ ∞ -HMC and DREAM ∞ -MALA achieve morethan 600 times speed-up in sampling efficiency and DREAM ∞ -HMC attains 3 orders ofmagnitude improvement compared to the benchmark pCN. Such comparison focuses onthe cost of obtaining uncertainty estimates and does not include the time for training CNNand AE, which is relatively much smaller compared with the overall sampling time.Figure 13a shows the traceplots of the potential function (data-misfit) on the left paneland autocorrelation functions on the right panel. HMC algorithms make distant proposals23igure 12: Elliptic inverse problem: Bayesian posterior standard deviation estimates of thelog-transmissivity field u ( s ) based on 5000 samples by various MCMC algorithms.with least autocorrelation, followed by MALA algorithms and then pCN algorithms withthe highest autocorrelation. This is also supported numerically by ESS of parameters (thelower autocorrelation, the higher ESS) in Table 1. Note DREAM ∞ -MALA has similarautocorrelation as HMC algorithms.Finally, we plot the Kullback–Leibler (KL) divergence between the posterior and theprior in terms of iteration (upper) and time (lower) respectively in Figure 13b. Amongall the MCMC algorithms, emulative MCMC algorithms stabilize such measurement thefastest and attain smaller values for given iterations and time. In the following example, we quantify the uncertainty in the solution of an inverse problemgoverned by a parabolic PDE via the Bayesian inference framework. The underlying PDEis a time-dependent advection-diffusion equation in which we seek to infer an unknowninitial condition from spatiotemporal point measurements.The parameter-to-observable forward mapping G : u → O u maps an initial condition u ∈ L (Ω) to pointwise spatiotemporal observations of the concentration field u ( x , t )24 ethod h a AP b s/iter c ESS(min,med,max) d minESS/s e spdup f PDEsolns g pCN 0.03 0.65 0.49 (7.8,28.93,73.19) 0.0032 1.00 6001 ∞ -MALA 0.15 0.61 0.56 (29.21,120.79,214.85) 0.0105 3.30 12002 ∞ -HMC 0.10 0.70 1.65 (547.62,950.63,1411.6) 0.0663 20.82 36210e-pCN 0.05 0.60 0.02 (10.07,43.9,93.62) 0.0879 27.60 0e- ∞ -MALA 0.15 0.67 0.03 (33.23,133.54,227.71) 0.2037 63.95 0e- ∞ -HMC 0.10 0.77 0.07 (652.54,1118.08,1455.56) 1.9283 ∞ -MALA 1.00 0.66 0.04 (391.53,782.06,927.08) 2.0988 ∞ -HMC 0.60 0.64 0.11 (2289.86,3167.03,3702.4) 4.1720 a step size b acceptance probability c seconds per iteration d (minimum, median, maximum) effective samplesize e minimal ESS per second f comparison of minESS/s with pCN as benchmark g number of PDE solutions Table 1: Elliptic inverse problem: sampling efficiency of various MCMC algorithms.through the solution of the following advection-diffusion equation [82, 83]: u t − κ ∆ u + v · ∇ u = 0 in Ω × (0 , T ) u ( · ,
0) = u in Ω κ ∇ u · (cid:126)n = 0 , on ∂ Ω × (0 , T ) (26)where Ω ⊂ [0 , is a bounded domain shown in Figure 14a, κ > − ), and T > v is computed by solving thefollowing steady-state Navier-Stokes equation with the side walls driving the flow [82]: − v + ∇ q + v · ∇ v = 0 in Ω ∇ · v = 0 in Ω v = g , on ∂ Ω (27)Here, q is the pressure, Re is the Reynolds number, which is set to 100 in this example.The Dirichlet boundary data g ∈ R d is given by g = e = (0 ,
1) on the left wall of thedomain, g = − e on the right wall, and g = everywhere else.We set the true initial condition u † = 0 . ∧ exp {− x − . + ( x − . ] } ,illustrated in the top left panel of Figure 14a, which also shows a few snapshots of solutions u at other time points on a regular grid mesh of size 61 ×
61. To obtain spatiotemporalobservations, we collect solutions u ( x , t ) solved on a refined mesh at 80 selected locationsacross 16 time points evenly distributed between 1 and 3 seconds (thus denoted as O u )and inject some Gaussian noise N (0 , σ η ) such that the relative noise standard deviation is0.5 ( σ η / max O u = 0 . y = O u ( x , t ) + η, η ∼ N (0 , σ η I )Figure 14b plots 4 snapshots of these observations at 80 locations along the inner boundary.In the Bayesian setting, we adopt the following Gaussian process prior with the covariance25 a) The trace plots of data-misfit function evaluated with each sample (left, values have been offsetto be better compared with) and the auto-correlation of data-misfits as a function of lag (right).(b) The KL divergence between the posterior and the prior as a function of iteration (upper) andtime (lower) respectively. Figure 13: Elliptic inverse problem: analysis of posterior samples.kernel C defined through the Laplace operator ∆: u ∼ µ = N (0 , C ) , C = ( δ I − γ ∆) − where δ governs the variance of the prior and γ/δ controls the correlation length. We set γ = 2 and δ = 10 in this example.The Bayesian inverse problem involves obtaining an estimate of the initial condition u and quantifying its uncertainty based on the 80 ×
16 spatiotemporal observations. Forthe notational convenience, we still denote u ( x ) as u ( x ) when it is not confused withthe general concentration field u ( x , t ). The Bayesian UQ in this example is especiallychallenging not only because of its large dimensionality (3413) of spatially discretized u (Lagrange degree 1) at each time t , but also due to the spatiotemporal interactions in26 a) True initial condition (top left), and the solutions u ( s ) at different time points.(b) Spatiotemporal observations at 80 selected locations indicated by circles across different timepoints, with color indicating their values. Figure 14: Advection-diffusion inverse problem.these observations. We follow the CES framework as in Section 3.3. In the calibrationstage, we collect { u ( j ) n , G ( u ( j ) n ) } N,Jn =1 ,j =1 by running EKS runs with ensemble size J = 500for N = 10 iterations. For the emulation, we train DNN and CNN with the same 3 : 1splitting of these 5000 ensembles for training/testing data. The DNN has 5 layers withactivation function ‘LeakyReLU( α = 0 . α = 0 . D Φ( u MAP ) (left) and its emulations D Φ e ( u MAP ) (right two) as in Equation (12) extracted from two types of neural networks.As before, we can see better extracted gradient output by CNN as an approximation to the27 a) CNN (middle) and DNN (right) emulation ( G e : R → R ) extracting gradients D Φ e ( u MAP ) compared with the true gradient D Φ( u MAP ) (left).(b) AE compressing the original function u MAP (left) into latent space u MAP r (middle) and recon-structing it in the original space u MAP (cid:48) (right).
Figure 15: Advection-diffusion inverse problem: outputs by neural networks viewed as 2dimages.true gradient compared with DNN. Due to the large dimensionality of inputs and outputs( G e : R → R ) and memory requirement, GP failed to fit and output gradientextraction.In the sampling stage, we adopt AE with the same structure as in Figure 4, the latentdimension d L = 417, and the activation functions chosen as ‘elu’. Figure 15b plots theoriginal u MAP (left), the latent representation u MAP r = φ ( u MAP ) (middle) and the recon-struction u MAP (cid:48) = ψ ( u MAP r ) (right). Again we see a ‘faithful’ reconstruction of the originalfunction (image) by AE even though the latent representation is not very intuitive.We compare the performance of ∞ -MCMC algorithms (pCN, ∞ -MALA, ∞ -HMC),their emulative versions and corresponding DREAM algorithms. For each algorithm, werun 6000 iterations and burn in the first 1000. For HMC algorithms, we set I = 5. Wetune the step sizes for each algorithm so that they have similar acceptance rates around60 ∼ u ( x ) based on 5000 samples by various MCMC algorithms.those by ensemble Kalman methods, which severely underestimate the posterior standarddeviations. Method h a AP b s/iter c ESS(min,med,max) d minESS/s e spdup f PDEsolns g pCN 0.00 0.69 0.03 (3.16,6.37,40.7) 0.0222 1.00 6001 ∞ -MALA 0.01 0.68 0.06 (3.78,11.6,51.5) 0.0122 0.55 12002 ∞ -HMC 0.01 0.78 0.12 (31.55,83.54,240.34) 0.0507 2.29 35872e-pCN 0.00 0.69 0.02 (3.33,7.19,58.2) 0.0324 1.46 0e- ∞ -MALA 0.01 0.72 0.05 (4.28,14.3,62) 0.0157 0.71 0e- ∞ -HMC 0.01 0.72 0.11 (25.41,113.11,270.79) 0.0475 2.14 0DREAM-pCN 0.02 0.68 0.02 (8.88,16.99,53.35) 0.0727 ∞ -MALA 0.10 0.83 0.06 (37.65,66.58,157.09) 0.1310 ∞ -HMC 0.10 0.72 0.17 (564.12,866.72,1292.11) 0.6791 a step size b acceptance probability c seconds per iteration d (minimum, median, maximum) effective samplesize e minimal ESS per second f comparison of minESS/s with pCN as benchmark g number of PDE solutions Table 2: Advection-diffusion inverse problem: sampling efficiency of MCMC algorithms.Table 2 compares the sampling efficiency of various MCMC algorithms measured byminESS/s. Three most efficient sampling algorithms are all DREAM algorithms. DREAM ∞ -HMC attains up to 30 times speed up compared to the benchmark pCN. Considering thecomplexity of this inverse problem with spatiotemporal observations, this is a significantachievement. Again, we exclude the training time of CNN and AE from the comparisonsince it is rather negligible compared with the overall sampling time.29igure 17: Advection-diffusion inverse problem: Bayesian posterior standard deviationestimates of the initial concentration field u ( x ) based on 5000 samples by various MCMCalgorithms.Figure 20a verifies DREAM ∞ -HMC is the most efficient MCMC algorithm that hasthe smallest autocorrelation shown on the right panel. It follows by other HMC algorithmsand DREAM ∞ -MAMA which is even better than ∞ -HMC. Figure 20b plots the KLdivergence between the posterior and the prior in terms of iteration (upper) and time(lower) respectively. As we can see, ∞ -HMC converges the fastest. In this paper, we have proposed a new framework to scale up Bayesian UQ for physics-constrained inverse problems. More specifically, we use CNN – a regularized neural net-work, which is a powerful tool for image recognition and amenable to inverse problems ifwe treat the input parameter function as an image. This way, CNN is capable of learningspatial features. In addition, the resulting algorithm has low computational complexityand is robust: as seen in Figure 3, the performance of CNN as an emulator is relativelystable across different training sizes. If larger training size is required for certain problems,we could train CNN adaptively as more samples are collected from the parameter space[63]. We have adopted AE to further reduce the dimension of the parameter space andspeed up the sampling process. Overall, by utilizing different techniques based on neural30igure 18: Elliptic inverse problem: CAE compressing the original function (left) intolatent space (middle) and reconstructing it in the original space (right).networks, we have been able to scale up Bayesian UQ up to thousands of dimensions.In the current framework, we rely on regular grid mesh to facilitate the CNN training – adiscretized function over a 2d mesh needs to be converted to a matrix of image pixels. Sucha limitation can be alleviated by using mesh CNN [84], which could train CNN directly onirregular mesh; for example, triangular mesh has been extensively used for solving PDE.This will extend our methodology and further enhances its utility.The standard AE used in our proposed framework and the corresponding latent pro-jection by dense layers might not be the optimal choices. Alternatively, we could useconvolutional AE (CAE) [72], which generates more recognizable latent representation asillustrated in Figure 18. In this case, the latent parameter can be interpreted as a repre-sentation of the original function on a coarser mesh. We can even modify the loss functionto develop ‘multilevel’ emulation based on CNN.Lastly, there are spatiotemporal data in some inverse problem (e.g., advection-diffusionequation). In such cases, we could model the temporal pattern of observations in theemulation using some recurrent neural networks (RNN) [85], e.g., long short-term memory(LSTM) [86]. We can then build a ‘CNN-RNN’ emulator with convolutional layer forfunction (image) inputs and RNN layer for multivariate time series outputs. While wehave obtained promising preliminary results, we intend to pursue this idea in our futurework. 31 ppendixA Proof of Theorem 3.1
Proof.
Note because we have | Φ( u ) − Φ e ( u ) | ≤ |(cid:104)G ( u ) − G e ( u ) , y − G ( u ) (cid:105) Γ | + |(cid:104) y − G e ( u ) , G ( u ) − G e ( u ) (cid:105) Γ || D Φ( u ) − D Φ e ( u ) | ≤ |(cid:104)G ( u ) − G e ( u ) , D G ( u ) (cid:105) Γ | + |(cid:104) y − G e ( u ) , D G ( u ) − D G e ( u ) (cid:105) Γ | it suffices to prove (cid:107)G j − G ej (cid:107) H (Ω) ≤ c j (cid:107)G j (cid:107) (cid:112) log KK − − d , j = 1 , · · · , m (28)Let K ∗ be the integer part of ( s − Kd −
1, i.e. K ∗ = (cid:104) ( s − Kd (cid:105) − ≥
1. For each j = 1 , · · · , m ,there exists a linear combination of ramp ridge functions of the following form by Theorem2 of [87]: G ej ( u ) = G j (0) + D G j (0) · u + vK ∗ K ∗ (cid:88) k =1 β k ( α k · u − t k ) + (29)with β k ∈ [ − , (cid:107) α k (cid:107) = 1, t k ∈ [0 ,
1] and | v | ≤ v G j , := (cid:82) R d (cid:107) ω (cid:107) | (cid:98) G j ( ω ) | dω ≤ c d,r (cid:107)G j (cid:107) such that (cid:107)G j − G ej (cid:107) L ∞ ([ − , d ) ≤ c v G j , ( (cid:112) log K ∗ ∨ √ d )( K ∗ ) − − d (30)[70, Theorem B] constructs weights W and biases b of a CNN that has output of the formin Equation (29). Therefore, (cid:107)G j − G ej (cid:107) L (Ω) ≤ C (cid:107)G j − G ej (cid:107) L ∞ (Ω) ≤ c j (cid:107)G j (cid:107) (cid:112) log KK − − d , j = 1 , · · · , m (31)Now we take derivative on both sides of (29) to get D G ej ( u ) = D G j (0) + vK K (cid:88) k =1 α k β k H ( α k · u − t k ) (32)where H ( x ) = I ( x ≥
0) is the Heaviside function. For any i = 1 , · · · , d , we have v D i G j , := (cid:82) R d (cid:107) ω (cid:107) | (cid:100) D i G j ( ω ) | dω ≤ C (cid:82) R d (cid:107) ω (cid:107) | ω i || (cid:98) G j ( ω ) | dω ≤ Cv G j , . Therefore, by Theorem 3 of [71]we have (cid:107) D i G j − D i G ej (cid:107) L ([ − , d ) ≤ c (cid:48) v G j , K − − d (33)Inequality (31) and inequality (33) yield error bound (28) thus complete the proof. B Extended Results (a) The trace plots of data-misfit function evaluated with each sample (left, values have been offsetto be better compared with) and the auto-correlation of data-misfits as a function of lag (right).(b) The KL divergence between the posterior and the prior as a function of iteration (upper) andtime (lower) respectively.
Figure 20: Advection-diffusion inverse problem: analysis of posterior samples.33 eferences [1] Emmet Cleary, Alfredo Garbuno-Inigo, Shiwei Lan, Tapio Schneider, and Andrew MStuart. Calibrate, emulate, sample, 2020.[2] Geir Evensen. Sequential data assimilation with a nonlinear quasi-geostrophic modelusing monte carlo methods to forecast error statistics.
Journal of Geophysical Research ,99(C5):10143, 1994.[3] Geir Evensen and Peter Jan van Leeuwen. Assimilation of geosat altimeter data forthe agulhas current using the ensemble kalman filter with a quasi-geostrophic model.1996.[4] P. L. Houtekamer and Herschel L. Mitchell. A sequential ensemble kalman filter foratmospheric data assimilation.
Monthly Weather Review , 129(1):123–137, Jan 2001.[5] Naevdal G. Oliver D. S. Reynolds A. C. & Valles B. Aanonsen, S. I. The ensem-ble kalman filter in reservoir engineering–a review.
Society of Petroleum Engineers ,14(September), 2009.[6] Geir Evensen.
Data Assimilation: The Ensemble Kalman Filter . Springer-VerlagBerlin Heidelberg, 2 edition, 2009.[7] G. Evensen. The ensemble Kalman filter: theoretical formulation and practical im-plementation.
Ocean Dyn. , 53:343–367, 2003.[8] Eugenia Kalnay. Atmospheric modeling, data assimilation and predictability. Nov2002.[9] Konstantinos Zygalakis Kody Law, Andrew Stuart.
Data Assimilation: A Mathemat-ical Introduction , volume 62 of
Texts in Applied Mathematics . Springer InternationalPublishing, 1 edition, 2015.[10] P. L. Houtekamer and Fuqing Zhang. Review of the ensemble kalman filter for atmo-spheric data assimilation.
Monthly Weather Review , 144(12):4489–4532, Dec 2016.[11] Dean S. Oliver, Albert C. Reynolds, and Ning Liu. Inverse theory for petroleumreservoir characterization and history matching. 2008.[12] Yan Chen and Dean S. Oliver. Ensemble randomized maximum likelihood method asan iterative ensemble smoother.
Mathematical Geosciences , 44(1):1–26, Dec 2011.[13] Alexandre A. Emerick and Albert C. Reynolds. Investigation of the sampling per-formance of ensemble-based methods with a simple reservoir model.
ComputationalGeosciences , 17(2):325–350, Jan 2013. 3414] Marco A Iglesias, Kody J H Law, and Andrew M Stuart. Ensemble kalman methodsfor inverse problems.
Inverse Problems , 29(4):045001, Mar 2013.[15] Marco A Iglesias. A regularizing iterative ensemble kalman method for pde-constrainedinverse problems.
Inverse Problems , 32(2):025002, Jan 2016.[16] Geir Evensen. Analysis of iterative ensemble smoothers for solving inverse problems.
Computational Geosciences , 22(3):885–908, Mar 2018.[17] Alfredo Garbuno-Inigo, Franca Hoffmann, Wuchen Li, and Andrew M. Stuart. Inter-acting langevin diffusions: Gradient structure and ensemble kalman sampler.
SIAMJournal on Applied Dynamical Systems , 19(1):412–441, 2020.[18] C. Schillings and A. Stuart. Analysis of the ensemble kalman filter for inverse problems.
SIAM Journal on Numerical Analysis , 55(3):1264–1290, 2017.[19] C. Schillings and A. M. Stuart. Convergence analysis of ensemble kalman inversion:the linear, noisy case.
Applicable Analysis , 97(1):107–123, Oct 2017.[20] J. de Wiljes, S. Reich, and W. Stannat. Long-time stability and accuracy of theensemble kalman–bucy filter for fully observed processes and small measurement noise.
SIAM Journal on Applied Dynamical Systems , 17(2):1152–1181, 2018.[21] Neil K. Chada, Andrew M. Stuart, and Xin T. Tong. Tikhonov regularization withinensemble kalman inversion, 2019.[22] Jerome Sacks, William J. Welch, Toby J. Mitchell, and Henry P. Wynn. Design andanalysis of computer experiments.
Statistical Science , 4(4):409–423, Nov 1989.[23] Jeremy Oakley and Anthony O’Hagan. Bayesian inference for the uncertainty distri-bution of computer model outputs.
Biometrika , 89(4):769–784, 12 2002.[24] Jeremy E. Oakley and Anthony O’Hagan. Probabilistic sensitivity analysis of com-plex models: A bayesian approach.
Journal of the Royal Statistical Society. Series B(Statistical Methodology) , 66(3):751–769, 2004.[25] Marc C. Kennedy and Anthony O’Hagan. Bayesian calibration of computer models.
Journal of the Royal Statistical Society, Series B, Methodological , 63:425–464, 2001.[26] Dave Higdon, Marc Kennedy, James C. Cavendish, John A. Cafeo, and Robert D.Ryne. Combining field data and computer simulations for calibration and prediction.
SIAM Journal on Scientific Computing , 26(2):448–466, 2004.[27] A. O’Hagan. Bayesian analysis of computer code outputs: A tutorial.
ReliabilityEngineering & System Safety , 91(10):1290 – 1300, 2006. The Fourth InternationalConference on Sensitivity Analysis of Model Output (SAMO 2004).3528] Gemma Stephenson.
Using derivative information in the statistical analysis of com-puter models . PhD thesis, University of Southampton, 2010.[29] Shiwei Lan, Tan Bui-Thanh, Mike Christie, and Mark Girolami. Emulation of higher-order tensors in manifold Monte Carlo methods for Bayesian inverse problems.
Journalof Computational Physics , 308:81–101, 2016.[30] Thomas J Santner, Brian J Williams, and William Notz.
The design and analysis ofcomputer experiments . Springer, 2003.[31] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep Learning . MIT Press,2016. .[32] Y. Bengio. Learning deep architectures for AI.
Foundations and Trends ® in MachineLearning , 2(1):1–127, 2009.[33] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature ,521(7553):436–444, may 2015.[34] J¨urgen Schmidhuber. Deep learning in neural networks: An overview.
Neural Net-works , 61:85 – 117, 2015.[35] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification withdeep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, andK. Q. Weinberger, editors,
Advances in Neural Information Processing Systems , vol-ume 25, pages 1097–1105. Curran Associates, Inc., 2012.[36] Dan Cire¸san, Ueli Meier, Jonathan Masci, and J¨urgen Schmidhuber. Multi-columndeep neural network for traffic sign classification.
Neural Networks , 32:333 – 338, 2012.Selected Papers from IJCNN 2011.[37] Alex Graves, Douglas Eck, Nicole Beringer, and Juergen Schmidhuber. Biologicallyplausible speech recognition with LSTM neural nets. In
Biologically Inspired Ap-proaches to Advanced Information Technology , pages 127–136. Springer Berlin Heidel-berg, 2004.[38] F.A. Gers and E. Schmidhuber. LSTM recurrent networks learn simple context-freeand context-sensitive languages.
IEEE Transactions on Neural Networks , 12(6):1333–1340, 2001.[39] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur, X. He, L. Heck,G. Tur, D. Yu, and G. Zweig. Using recurrent neural networks for slot filling in spokenlanguage understanding.
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , 23(3):530–539, 2015. 3640] Davide Chicco, Peter Sadowski, and Pierre Baldi. Deep autoencoder neural networksfor gene ontology annotation predictions. In
Proceedings of the 5th ACM Conferenceon Bioinformatics, Computational Biology, and Health Informatics . ACM, sep 2014.[41] Andrew W. Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre,Tim Green, Chongli Qin, Augustin ˇZ´ıdek, Alexander W. R. Nelson, Alex Bridgland,Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli,David T. Jones, David Silver, Koray Kavukcuoglu, and Demis Hassabis. Improved pro-tein structure prediction using potentials from deep learning.
Nature , 577(7792):706–710, jan 2020.[42] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, Georgevan den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, IlyaSutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel,and Demis Hassabis. Mastering the game of go with deep neural networks and treesearch.
Nature , 529(7587):484–489, jan 2016.[43] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, YutianChen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, ThoreGraepel, and Demis Hassabis. Mastering the game of go without human knowledge.
Nature , 550(7676):354–359, oct 2017.[44] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, andL. D. Jackel. Backpropagation applied to handwritten zip code recognition.
NeuralComputation , 1(4):541–551, 1989.[45] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithmfor deep belief nets.
Neural Computation , 18(7):1527–1554, 2006. PMID: 16764513.[46] L´eon Bottou. On-line learning and stochastic approximations. In
On-Line Learningin Neural Networks , pages 9–42. Cambridge University Press, jan 1999.[47] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser. Deep convolutional neuralnetwork for inverse problems in imaging.
IEEE Transactions on Image Processing ,26(9):4509–4522, 2017.[48] M. T. McCann, K. H. Jin, and M. Unser. Convolutional neural networks for inverseproblems in imaging: A review.
IEEE Signal Processing Magazine , 34(6):85–95, 2017.[49] Feng Wang, Alberto Eljarrat, Johannes M¨uller, Trond R. Henninen, Rolf Erni, andChristoph T. Koch. Multi-resolution convolutional neural networks for inverse prob-lems.
Scientific Reports , 10(1), mar 2020.3750] Gareth O Roberts, Andrew Gelman, and Walter R Gilks. Weak convergence and opti-mal scaling of random walk metropolis algorithms.
The Annals of Applied Probability ,7(1):110–120, 1997.[51] Gareth O. Roberts and Jeffrey S. Rosenthal. Optimal scaling of discrete approx-imations to langevin diffusions.
Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 60(1):255–268, 1998.[52] Alexandros Beskos, Gareth Roberts, and Andrew Stuart. Optimal scalings for localmetropolis–hastings chains on nonproduct targets in high dimensions.
The Annals ofApplied Probability , 19(3):863–898, jun 2009.[53] Alexandros Beskos, Gareth Roberts, Andrew Stuart, and Jochen Voss. MCMC meth-ods for diffusion bridges.
Stochastics and Dynamics , 8(03):319–350, 2008.[54] A. Beskos, F. J. Pinski, J. M. Sanz-Serna, and A. M. Stuart. Hybrid Monte-Carlo onHilbert spaces.
Stochastic Processes and their Applications , 121:2201–2230, 2011.[55] Simon L Cotter, Gareth O Roberts, AM Stuart, and David White. MCMC methodsfor functions: modifying old algorithms to make them faster.
Statistical Science ,28(3):424–446, 2013.[56] Kody Law. Proposals which speed up function-space MCMC.
Journal of Computa-tional and Applied Mathematics , 262:127–138, 2014.[57] Alexandros Beskos. A stable manifold MCMC method for high dimensions.
Statistics& Probability Letters , 90:46–52, 2014.[58] Alexandros Beskos, Mark Girolami, Shiwei Lan, Patrick E. Farrell, and Andrew M.Stuart. Geometric mcmc for infinite-dimensional inverse problems.
Journal of Com-putational Physics , 335:327 – 351, 2017.[59] Tiangang Cui, Kody J.H. Law, and Youssef M. Marzouk. Dimension-independentlikelihood-informed MCMC.
Journal of Computational Physics , 304:109 – 137, 2016.[60] P. Constantine, C. Kent, and T. Bui-Thanh. Accelerating markov chain monte carlowith active subspaces.
SIAM Journal on Scientific Computing , 38(5):A2779–A2805,2016.[61] Shiwei Lan. Adaptive dimension reduction to accelerate infinite-dimensional geometricmarkov chain monte carlo.
Journal of Computational Physics , 392:71 – 95, September2019.[62] G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neuralnetworks.
Science , 313(5786):504–507, 2006.3863] Babak Shahbaba, Luis Martinez Lomeli, Tian Chen, and Shiwei Lan. Deep markovchain monte carlo. arXiv:1910.05692, 2019.[64] Andrew M Stuart. Inverse problems: a bayesian perspective.
Acta Numerica , 19:451–559, 2010.[65] Masoumeh Dashti and Andrew M. Stuart.
The Bayesian Approach to Inverse Prob-lems , pages 311–428. Springer International Publishing, Cham, 2017.[66] Nikola B Kovachki and Andrew M Stuart. Ensemble kalman inversion: a derivative-free technique for machine learning tasks.
Inverse Problems , 35(9):095005, Aug 2019.[67] R. M. Neal. MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G. Jones,and X. L. Meng, editors,
Handbook of Markov Chain Monte Carlo . Chapman andHall/CRC, 2010.[68] L. Verlet. Computer “Experiments” on Classical Fluids. I. Thermodynamical Proper-ties of Lennard-Jones Molecules.
Phys. Rev. , 159(1):98–103, 1967.[69] Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for amechanism of pattern recognition unaffected by shift in position.
Biological Cybernet-ics , 36(4):193–202, 1980.[70] Ding-Xuan Zhou. Universality of deep convolutional neural networks.
Applied andComputational Harmonic Analysis , 48(2):787 – 794, 2020.[71] Y. Makovoz. Random approximants and neural networks.
Journal of ApproximationTheory , 85(1):98 – 109, 1996.[72] Xifeng Guo, Xinwang Liu, En Zhu, and Jianping Yin. Deep clustering with con-volutional autoencoders. In
Neural Information Processing , pages 373–382. SpringerInternational Publishing, 2017.[73] Manass´es Ribeiro, Andr´e Eugˆenio Lazzaretti, and Heitor Silv´erio Lopes. A study ofdeep convolutional auto-encoders for anomaly detection in videos.
Pattern RecognitionLetters , 105:13 – 22, 2018. Machine Learning and Applications in Artificial Intelligence.[74] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In , 2014.[75] Diederik P. Kingma and Max Welling. An introduction to variational autoencoders.
Foundations and Trends ˆA ® in Machine Learning , 12(4):307–392, 2019.[76] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematicsof Control, Signals, and Systems , 2(4):303–314, dec 1989.3977] Allan Pinkus. Approximation theory of the MLP model in neural networks.
ActaNumerica , 8:143–195, jan 1999.[78] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expres-sive power of neural networks: A view from the width. In I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,
Advancesin Neural Information Processing Systems , volume 30, pages 6231–6239. Curran As-sociates, Inc., 2017.[79] Shiwei Lan, Vasileios Stathopoulos, Babak Shahbaba, and Mark Girolami. MarkovChain Monte Carlo from Lagrangian Dynamics.
Journal of Computational and Graph-ical Statistics , 24(2):357–378, 2015.[80] Shiwei Lan, Bo Zhou, and Babak Shahbaba. Spherical hamiltonian monte carlo forconstrained target distributions. In
Proceedings of The 31st International Conferenceon Machine Learning , pages 629–637, Beijing, China, 2014.[81] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dy-namics. In
Proceedings of the International Conference on Machine Learning , 2011.[82] N. Petra and G. Stadler. Model variational inverse problems governed by partialdifferential equations. Technical report, The Institute for Computational Engineeringand Sciences, The University of Texas at Austin., 2011.[83] Umberto Villa, Noemi Petra, and Omar Ghattas. hippylib: An extensible softwareframework for large-scale inverse problems governed by pdes; part i: Deterministicinversion and linearized bayesian inference, 2020.[84] Rana Hanocka, Amir Hertz, Noa Fish, Raja Giryes, Shachar Fleishman, and DanielCohen-Or. Meshcnn: A network with an edge.
ACM Trans. Graph. , 38(4), July 2019.[85] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning represen-tations by back-propagating errors.
Nature , 323(6088):533–536, oct 1986.[86] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.