[PDF] Novel Deep neural networks for solving Bayesian statistical inverse

Abstract

We consider the simulation of Bayesian statistical inverse problems governed by large-scale linear and nonlinear partial differential equations (PDEs). Markov chain Monte Carlo (MCMC) algorithms are standard techniques to solve such problems. However, MCMC techniques are computationally challenging as they require several thousands of forward PDE solves. The goal of this paper is to introduce a fractional deep neural network based approach for the forward solves within an MCMC routine. Moreover, we discuss some approximation error estimates and illustrate the efficiency of our approach via several numerical examples.

Full PDF

NNOVEL DEEP NEURAL NETWORKS FOR SOLVINGBAYESIAN STATISTICAL INVERSE PROBLEMS ∗ HARBIR ANTIL † , HOWARD C. ELMAN ‡ , AKWUM ONWUNTA § , AND

DEEPANSHUVERMA ¶ Abstract.

We consider the simulation of Bayesian statistical inverse problems governed bylarge-scale linear and nonlinear partial diﬀerential equations (PDEs). Markov chain Monte Carlo(MCMC) algorithms are standard techniques to solve such problems. However, MCMC techniquesare computationally challenging as they require several thousands of forward PDE solves. The goalof this paper is to introduce a fractional deep neural network based approach for the forward solveswithin an MCMC routine. Moreover, we discuss some approximation error estimates and illustratethe eﬃciency of our approach via several numerical examples.

Key words.

Statistical inverse problem, Metropolis-Hastings, fractional time derivative, deepneural network, uncertainty quantiﬁcation.

AMS subject classiﬁcations.

1. Introduction.

Large-scale statistical inverse problems governed by partialdiﬀerential equations (PDEs) are increasingly found in diﬀerent areas of computa-tional science and engineering [5, 9, 10, 20, 36]. The basic use of this class of problemsis to recover certain physical quantities from limited and noisy observations. They aregenerally computationally demanding and can be solved using the Bayesian frame-work [9, 18, 20]. In the Bayesian approach to statistical inverse problems, one modelsthe solution as a posterior distribution of the unknown parameters conditioned on theobservations. Such mathematical formulations are often ill-posed and regularization isintroduced in the model via prior information. The posterior distribution completelycharacterizes the uncertainty in the model. Once it has been computed, statisticalquantities of interest can then be obtained from the posterior distribution. Unfor-tunately, many problems of practical relevance do not admit analytic representationof the posterior distribution. Thus, the posterior distributions are generally sampledusing Markov chain Monte Carlo (MCMC)-type schemes.We point out here that a large number of samples are generally needed by MCMCmethods to obtain convergence. This makes MCMC methods computationally inten-sive to simulate Bayesian inverse problems. Moreover, for statistical inverse problemsgoverned by PDEs, one needs to solve the forward problem corresponding to eachMCMC sample. This task further increases the computational complexity of theproblem, especially if the governing PDE is a large-scale nonlinear PDE [18]. Hence,it is reasonable to construct a computationally cheap surrogate to replace the forwardmodel solver [40]. Problems that involve large numbers of input parameters often ∗ HA, AO, and DV are partially supported by NSF grants DMS-1818772, DMS-1913004, the AirForce Oﬃce of Scientiﬁc Research (AFOSR) under Award NO: FA9550-19-1-0036, and Department ofNavy, Naval PostGraduate School under Award NO: N00244-20-1-0005. HCE is partially supportedby NSF grant DMS-1819115. † Department of Mathematical Sciences and The Center for Mathematics and Artiﬁcial Intelli-gence, George Mason University, Fairfax, VA 22030, USA. [email protected] . ‡ Department of Computer Science and Institute for Advanced Computer Studies, University ofMaryland, College Park, MD 20742, USA. [email protected] . § Department of Mathematical Sciences and The Center for Mathematics and Artiﬁcial Intelli-gence, George Mason University, Fairfax, VA 22030, USA. [email protected] . ¶ Department of Mathematical Sciences and The Center for Mathematics and Artiﬁcial Intelli-gence, George Mason University, Fairfax, VA 22030, USA. [email protected] .1 a r X i v : . [ m a t h . NA ] F e b H. Antil, H. C. Elman, A. Onwunta and D. Verma lead to the so-called curse of dimensionality . Projection-based reduced-order modelse.g., reduced basis methods and the discrete empirical interpolation method (DEIM)are typical examples of dimensionality reduction methods for tackling parametrizedPDEs [1, 6, 12, 17, 21, 33, 44]. However, they are intrusive by design in the sense thatthey do not allow reuse of existing codes in the forward solves, especially for nonlin-ear forward models. To mitigate this computational issue, the goal of this work is todemonstrate the use of deep neural networks (DNN) to construct surrogate modelsspeciﬁcally for nonlinear parametrized PDEs governing statistical Bayesian inverseproblems. In particular, we will focus on fractional Deep Neural Networks (fDNN)which have been recently introduced and applied to classiﬁcation problems [2].The use of DNNs for surrogate modeling in the framework of PDEs has receivedincreasing attention in recent years, see e.g., [13, 25, 29, 41, 43, 45, 49, 52, 54] and thereferences therein. Some of these references also cover the type of Bayesian inverseproblems considered in our paper. To accelerate or replace computationally-expensivePDE solves, a DNN is trained to approximate the mapping from parameters in thePDE to observations obtained from its solution. In a supervised learning setting,training data consisting of inputs and outputs are available and the learning problemaims at tuning the weights of the DNN. To obtain an eﬀective surrogate model, it isdesirable to train the DNN to a high accuracy.We consider a diﬀerent approach than the aforementioned works. We proposenovel dynamical systems based neural networks which allow connectivity among all thenetwork layers. Speciﬁcally, we consider a fractional DNN technique recently proposedin [2] in the framework of classiﬁcation problems. We note that [2] is motivated by [26].Both papers are in the spirit of the push to develop rigorous mathematical models forthe analysis and understanding of DNNs. The idea is to consider DNNs as dynamicalsystems. More precisely, in [7, 26, 28], DNN is thought of as an optimization problemconstrained by a discrete ordinary diﬀerential equation (ODE). As pointed out in[2], designing the DNN solution algorithms at the continuous level has the appealingadvantage of architecture independence; in other words, the number of optimizationiterations remains the same even if the number of layers is increased. Moreover, [2]speciﬁcally considers continuous fractional ODE constraint. Unlike standard DNNs,the resulting fractional DNN allows the network to access historic information of inputand gradients across all subsequent layers since all the layers are connected to oneanother.We demonstrate that our fractional DNN leads to a signiﬁcant reduction in theoverall computational time used by MCMC algorithms to solve inverse problems, whilemaintaining accurate evaluation of the relevant statistical quantities. We note herethat, although the papers [41, 49, 54] consider DNNs for inverse problems governed byPDEs, the DNNs used in these studies do not, in general, have the optimization-basedformulation considered here.The remainder of the paper is organized as follows: In section 2 we state thegeneric parametrized PDEs under consideration, as well as discuss the well-known sur-rogate modeling approaches. Our main DNN algorithm to approximate parametrizedPDEs is provided in section 3. We ﬁrst discuss a ResNet architecture to approximateparametrized PDEs, which is followed by our fractional ResNet (or fractional DNN)approach to carry out the same task. A brief discussion on error estimates has beenprovided. Next, we discuss the application of fractional DNN to statistical Bayesianinverse problems in section 4. We conclude the paper with several illustrative numer-ical examples in section 5. ractional DNN for nonlinear statistical inverse problems

2. Parametrized PDEs.

In this section, we present an abstract formulation ofthe forward problem as a discrete partial diﬀerential equation (PDE) depending onparameters. We are interested in the following model which represents (ﬁnite element,ﬁnite diﬀerence or ﬁnite volume) discretization of a (possibly nonlinear) parametrizedPDE F ( u ( ξ ); ξ ) = 0 , (2.1)where u ( ξ ) ∈ U ⊂ R N x and ξ ∈ P ⊂ R N ξ denote the solution of the PDE and theparameter in the model, respectively. Here, P denotes the parameter domain and U the solution manifold. For a ﬁxed parameter ξ ∈ P , we seek the solution u ( ξ ) ∈ U . Inother words, we have the functional relation given by the parameter-to-solution map ξ (cid:55)→ Φ( ξ ) ≡ u ( ξ ) . (2.2)Several processes in computational sciences and engineering are modelled viaparameter-dependent PDEs, which when discretized are of the form (2.1), for in-stance, in viscous ﬂows governed by Navier-Stokes equations, which is parametrizedby the Reynolds number [19]. Similarly, in unsteady natural convection problemsmodelled via Boussinesq equations, the Grashof or Prandtl numbers are importantparameters [44], etc. Approximating the high-ﬁdelity solution of (2.1) can be donewith nonlinear solvers, such as Newton or Picard methods combined with Krylovsolvers [19]. However, computing solutions to (2.1) can become prohibitive, espe-cially when they are required for many parameter values and N ξ is large. Besides,a relatively large N x (resulting from a really ﬁne mesh in the discretization of thePDE) yields large (nonlinear) algebraic systems which are computationally expensiveto solve and may also lead to huge storage requirements. This is, for instance, the casein Bayesian inference problems governed by PDEs where several forward solves arerequired to adequately sample posterior distributions through MCMC-type schemes[8, 42]. As a result of the above computational challenges, it is reasonable to replacethe high-ﬁdelity model by a surrogate model which is relatively easy to evaluate.In what follows, we discuss two classes of surrogate models, namely: reduced-ordermodels and deep learning models. Surrogate models are cheap-to-evaluate models de-signed to replace computationally costly models. The major advantage of surrogatemodels is that approximate solution of the models can be easily evaluated at anynew parameter instance with minimal loss of accuracy, at a cost independent of thedimension of the original high-ﬁdelity problem.A popular class of surrogate models are the reduced-order models (ROMs), seee.g., [1, 6, 11, 16, 17, 21, 33, 44]. Typical examples of ROMs include the reducedbasis method [44], proper orthogonal decomposition [11, 24, 33, 44] and the discreteempirical interpolation method (DEIM) and its variants [1, 6, 11, 17, 21]. A keyfeature of the ROMs is that they use the so-called oﬄine-online paradigm . Theoﬄine step essentially constructs the low-dimensional approximation to the solutionspace; this approximation is generally known as the reduced basis . The online stepuses the reduced basis to solve a smaller reduced problem. The resulting reducedsolution accurately approximates the solution of the original problem.Deep neural network (DNN) models constitute another class of surrogate modelswhich are well-known for their high approximation capabilities. The basic idea ofDNNs is to approximate multivariate functions through a set of layers of increasing

H. Antil, H. C. Elman, A. Onwunta and D. Verma complexity [23]. Examples of DNNs for surrogate modeling include Residual NeuralNetwork (ResNet) [28, 31], physics-informed neural network (PINNs) [45] and frac-tional DNN [2].Note that the ROMs require system solves and they are highly intrusive especiallyfor nonlinear problems [1, 11, 17, 18]. In contrast, the DNN approach is fully non-intrusive, which is essential for legacy codes. Although rigorous error estimates forROMs under various assumptions have been well studied [44], we like the advantageof DNN being nonintrusive, but recognize that error analysis is not yet as strong.In this work, we propose a surrogate model based on the combination of PODand fractional DNN. Before we discuss our proposed model, we ﬁrst review POD.

For suﬃciently large N s ∈ N , sup-pose that E := { ξ , ξ , · · · , ξ N s } is a set of parameter samples with ξ i ∈ R N ξ , and { u ( ξ ) , u ( ξ ) , · · · , u ( ξ N s ) } the corresponding snapshots (solutions of the model (2.1),with u i ∈ R N x ). Here, we assume that span { u ( ξ ) , u ( ξ ) , · · · , u ( ξ N s ) } suﬃcientlyapproximates the space of all possible solutions of (2.1). Next, we denote by S = [ u | u | · · · | u N s ] ∈ R N x × N s (2.3)the matrix whose columns are the solution snapshots. Then, the singular value de-composition (SVD) of S is given by S = (cid:101) V Σ W T , (2.4)where (cid:101) V ∈ R N x × r and W ∈ R N s × r are orthogonal matrices called the left and rightsingular vectors, and r ≤ min { N s , N x } is the rank of S . Thus, (cid:101) V T (cid:101) V = I r , W T W = I r , and Σ = diag( ρ , ρ , · · · , ρ r ) ∈ R r × r , where ρ ≥ ρ ≥ · · · ≥ ρ r ≥ S . Now, denote by V ⊂ (cid:101) V the ﬁrst k ≤ r left singular vectors of S . Then, the columnsof V ∈ R N x × k form a POD basis of dimension k. According to the Schmidt-Eckart-Young theorem [14, 22], the POD basis V minimizes, over all possible k -dimensionalorthonormal bases Z ∈ R N x × k , the sum of the squares of the errors between eachsnapshot vector u i and its projection onto the subspace spanned by Z . More precisely, N s (cid:88) i =1 || u i − VV T u i || = min Z ∈Z N s (cid:88) i =1 || u i − ZZ T u i || = r (cid:88) i = k +1 ρ i , (2.5)where Z := { Z ∈ R N x × k : Z T Z = I k } . Note from (2.5) that the POD basis V solves aleast squares minimization problem, which guarantees that the approximation erroris controlled by the singular values.For every ξ , we then approximate the continuous solution u ( ξ ) as u ( ξ ) ≈ V (cid:98) u ( ξ ),where (cid:98) u ( ξ ) solves the reduced problem V T F ( V (cid:98) u ( ξ ); ξ ) = 0 . (2.6)Notice that, in some reduced-modeling techniques such as DEIM, additional steps areneeded to fully reduce the dimensionality of the problem (2.6). Nevertheless, one stillneeds to solve a nonlinear (reduced) system like (2.6) to evaluate (cid:98) u .We conclude this section by emphasizing that the above approach is “linear”because u ( ξ ) ≈ (cid:98) Φ( ξ ), where (cid:98) Φ( ξ ) = V (cid:98) u ( ξ ) is an approximation of the map Φ( ξ )given in (2.2). ractional DNN for nonlinear statistical inverse problems

3. Deep neural network.

The DNN approach to modeling surrogates producesa nonlinear approximation (cid:98)

Φ of the input-output map Φ : R N ξ → R N x given in (2.2),where (cid:98) Φ depends implicitly on a set of parameters θ ∈ R N θ conﬁgured as a layeredset of latent variables that must be trained. We represent this dependence usingthe notation (cid:98) Φ( ξ ; θ ) . In the context of PDE surrogate modeling, training a DNNrequires a data set ( E , S ) , where the parameter samples ξ j ∈ E are the inputs and thecorresponding snapshots u j ∈ S are targets; training then consists of constructing θ sothat the DNN (cid:98) Φ( ξ j ; θ ) matches u j for each ξ j . This matching is determined by a lossfunctional . The learning problem therefore involves computing the optimal parameter θ that minimizes the loss functional and satisﬁes (cid:98) Φ( ξ j ; θ ) ≈ u j . The ultimate goal isthat this approximation also holds with the same optimal parameter θ for a diﬀerentdata set; in other words, for ξ / ∈ E , we take (cid:98) Φ( ξ ; θ ) to represent a good approximationto Φ( ξ ) . Deep learning can be either supervised or unsupervised depending on the data setused in training. In the supervised learning technique for DNN, all the input samples ξ j are available for all the corresponding samples of the targets u j . In contrast, theunsupervised learning framework does not require all the outputs to accomplish thetraining phase. In this paper, we adopt the supervised learning approach to model oursurrogate and apply it to Bayesian inverse problems. In particular, we shall discussResidual neural network (ResNet)[28, 31] and Fractional DNN [2] in the context ofPDE surrogate modeling.

The residual neural network (ResNet) modelwas originally proposed in [31]. For a given input datum ξ , ResNet approximates Φthrough the following recursive expression φ = σ ( W ξ + b ) ,φ j = φ j − + hσ ( W j − φ j − + b j − ) , ≤ j ≤ L − ,φ L = W L − φ L − , (3.1)where { W j , b j } are the weights and the biases, h > L is thenumber of layers and σ is an activation function which is applied element-wise onits arguments. Typical examples of σ include the hyperbolic tangent function, thelogistic function or the rectiﬁed linear unit (or ReLU) function. ReLU is a nonlinearfunction given by σ ( x ) = max { x, } . In this work, we use a smoothed ReLU functiondeﬁned, for ε > , as σ ( x ) :=  x, x > ε , x < − ε ε x + x + ε , − ε ≤ x ≤ ε. (3.2)Note that as ε →

0, smooth ReLU approaches ReLU, see Figure 3.1.It follows from (3.1) that (cid:98) Φ( ξ ; θ ) = φ L ( ξ ) = W L − (cid:16)(cid:0) I + h ( σ ◦ K L − ) (cid:1) ◦ · · · ◦ (cid:0) I + h ( σ ◦ K ) (cid:1) ◦ K (cid:17) ( ξ ) , (3.3)where K j ( y ) = W j y + b j , for all j = 0 , . . . , L − y .To this end, the following two critical questions naturally come to mind:(a) How well does (cid:98) Φ( ξ ; θ ) approximate Φ( ξ )?(b) How do we determine θ ? H. Antil, H. C. Elman, A. Onwunta and D. Verma -0.5 0 0.500.10.20.30.40.5

Smooth ReLU -0.5 0 0.500.20.40.60.81

Derivative of Smooth ReLU

Fig. 3.1: Smooth ReLU and derivative for diﬀerent values of ε .We will address (b) ﬁrst and leave (a) for a discussion provided in section 3.3.Now, setting θ = { W j , b j } , it follows that the problem of approximating Φ via ResNetis essentially the problem of learning the unknown parameter θ . More speciﬁcally, thelearning problem is the solution of the minimization problem [28]:min θ J ( θ ; ξ , u ) (3.4)subject to constraints (3.1), where J ( θ , ξ , u ) is suitable loss function.In this work, we consider the mean squared error, together with a regularizationterm, as our loss functional: J ( θ ; ξ , u ) = 12 N s N s (cid:88) j =1 (cid:107) ˆΦ( ξ j ; θ ) − u j (cid:107) + λ || θ || , (3.5)where λ is the regularization parameter. Due to its highly non-convex nature, thisis a very diﬃcult optimization problem. Indeed, a search over a high dimensionalparameter space for the global minimum of a non-convex function can be intractable.The current state-of-the-art approaches to solve these optimization problems are basedon the stochastic gradient descent method [41, 49, 54].As pointed out in, for instance, [28], the middle equation in expression (3.1)mimics the forward Euler discretization of a nonlinear diﬀerential equation: d t φ ( t ) = σ ( W ( t ) φ ( t ) + b ( t )) , t ∈ (0 , T ] ,φ (0) = φ . It is known that standard DNNs are prone to vanishing gradient problem [50],leading to loss of information during the training phase. ResNets do help, but morehelpful is the so-called DenseNet [35]. Notice that in a typical DenseNet, each layertakes into account all the previous layers. However, this is an ad hoc approach withno rigorous mathematical framework. Recently in [2], a mathematically rigorousapproach based on fractional derivatives has been introduced. This ResNet is called

Fractional DNN.

In [2], the authors numerically establish that fractional derivativebased ResNet outperforms ResNet in overcoming the vanishing gradient problem.This is not surprising because fractional derivatives allow connectivity among all thelayers. Building rigorously on the idea of [2], we proceed to present the fractionalDNN surrogate model for the discrete PDE in (2.1). ractional DNN for nonlinear statistical inverse problems As pointed out in the previous section,the fractional DNN approach is based on the replacement of the standard ODE con-straint in the learning problem by a fractional diﬀerential equation. To this end, weﬁrst introduce the deﬁnitions and concepts on which we shall rely to discuss fractionalDNN.Let u : [0 , T ] → R be an absolutely continuous function and assume γ ∈ (0 , d γt u ( t ) = f ( u ( t )) , u (0) = u , (3.6)and d γT − t u ( t ) = f ( u ( t )) , u ( T ) = u T . (3.7)Here, d γt and d γT − t denote left and right Caputo derivatives, respectively [2].Then, setting u ( t j ) = u j , and using the L -scheme (see e.g., [2]) for the discretiza-tion of (3.6) and (3.7) yields, respectively, u j +1 = u j − j − (cid:88) k =0 a j − k ( u k +1 − u k ) + h γ Γ(2 − γ ) f ( u j ) , j = 0 , ..., L − , (3.8)and u j − = u j + L − (cid:88) k = j a k − j ( u k +1 − u k ) − h γ Γ(2 − γ ) f ( u j ) , j = L, ..., , (3.9)where h > · ) is the Euler-Gamma function, and a j − k = ( j + 1 − k ) − γ − ( j − k ) − γ . (3.10)After this brief overview of the fractional derivatives, we are ready to introduceour fractional DNN (cf. (3.1)) φ = σ ( W φ + b ); φ = ξ ,φ j = φ j − − j − (cid:88) k =0 a j − − k ( φ k +1 − φ k ) + h γ Γ(2 − γ ) σ ( W j − φ j − + b j − ) , (3.11)2 ≤ j ≤ L − ,φ L = W L − φ L − . Our learning problem then amounts tomin θ J ( θ ; ξ , u ) (3.12)subject to constraints (3.11). Notice that the middle equation in (3.11) mimics the L -in time discretization of the following nonlinear fractional diﬀerential equation d γt φ ( t ) = σ ( W ( t ) φ ( t ) + b ( t )) , t ∈ (0 , T ] ,φ (0) = φ . There are two ways to approach the constrained optimization problem (3.12). Theﬁrst approach is the so-called reduced approach, where we eliminate the constraints

H. Antil, H. C. Elman, A. Onwunta and D. Verma (3.11) and consider the minimization problem (3.12) only in terms of θ . The resultingproblem can be solved by using a gradient based method such as BFGS, see e.g., [38,Chapter 4]. During every step of the gradient method, one needs to solve the stateequation (3.11) and an adjoint equation. These two solves enable us to derive anexpression of the gradient of the reduced loss function with respect to θ . Alternatively,one can derive the gradient and adjoint equations by using the Lagrangian approach.It is well-known that the gradient with respect to θ for both approaches coincides,see [3, pg. 14] for instance. We next illustrate how to evaluate this gradient using theLagrangian approach.The Lagrangian functional associated with the discrete constrained optimizationproblem (3.12) is given by L ( u , θ , ψ ) := J ( θ ; ξ , u ) + (cid:104) φ − σ ( W φ + b ) , ψ (cid:105) + L − (cid:88) j =2 (cid:42) φ j − φ j − + j − (cid:88) k =0 a j − − k ( φ k +1 − φ k ) + h γ Γ(2 − γ ) σ ( W j − φ j − + b j − ) , ψ j (cid:43) + (cid:104) φ L − W L − φ L − , ψ L (cid:105) , (3.13)where ψ j ’s are the Lagrange multi1pliers, also called adjoint variables, correspondingto (3.11) and (cid:104)· , ·(cid:105) is the inner product on R N x .Next we write the state and adjoint equations fulﬁlled at a stationary point ofthe Lagrangian L . In addition, we state the derivative of L with respect to the designvariable θ .(i) State Equation. φ = σ ( W ξ + b ) ,φ j = φ j − − j − (cid:88) k =0 a j − − k ( φ k +1 − φ k ) + h γ Γ(2 − γ ) σ ( W j − φ j − + b j − ) , ≤ j ≤ L − ,φ L = W L − φ L − . (3.14a)(ii) Adjoint Equation. ψ j = ψ j +1 + L − (cid:88) k = j +1 a k − j ( ψ k +1 − ψ k ) − j = L − , ..., h γ Γ(2 − γ ) (cid:2) − W Tj ( ψ j +1 (cid:12) σ (cid:48) ( W j φ j +1 + b j )) (cid:3) ,ψ L − = − W TL − ψ L ,ψ L = ∂ φ L J ( θ ; ξ , u ) . (3.14b)(iii) Derivative with respect to θ . ∂ W L − L = ψ L φ TL − = ∂ φ L J ( θ ; ξ , u ) φ TL − ,∂ W j L = − φ j ( ψ j +1 (cid:12) σ (cid:48) ( W j φ j + b j )) T + ∂ W j J ( θ ; ξ , u ) ,j = 0 , ..., L − ,∂ b j L = − ψ Tj +1 σ (cid:48) ( W j φ j + b j ) + ∂ b j J ( θ ; ξ , u ) .j = 0 , ..., L − . (3.14c) ractional DNN for nonlinear statistical inverse problems L with respect to θ . Wethen use a gradient-based method (BFGS in our case) to identify θ . In this section we brieﬂy address the question of how wellthe deep neural approximation approximates the PDE solution map Φ : R N ξ → R N x . The approximation capabilities of neural networks has received a lot of attentionrecently in the literature, see e.g., [15, 34, 46, 53] and the references therein. Thepapers [34, 46] obtain results based on general activation functions for which theapproximation rate is O ( n − / ), where n is the total number of hidden neurons.This implies that neural networks can overcome the curse of dimensionality, as theapproximation rate is independent of dimension N x .We begin by making the observation that the fractional DNN can be writtenas a linear combination of activation functions evaluated at diﬀerent layers. In-deed, observe from (3.11) that if Φ( ξ ) is approximated by a one-hidden layer networkˆΦ( ξ , θ ) := φ (that is, L = 2) then it can be expressed as: φ = W σ ( W ξ + b ) . (3.15)By a one-hidden layer network, we mean a network with the input layer, one hiddenlayer and the output layer [15, 34, 39, 46]. Next, for L = 3, we obtain φ = W φ = W [ α σ ( W φ + b ) + α σ ( W φ + b ) + α φ ] , (3.16)where α = 1 − a , α = τ := h γ Γ(2 − γ ) and α = a . Similarly, if we set L = 4,then we get φ = W φ = W [ α σ ( W φ + b ) + α σ ( W φ + b ) + α σ ( W φ + b ) + α φ ] , (3.17)where α = (1 − a + a − a ) , α = (1 − a ) τ, α = τ, and α = 2 a − a − a . Proceeding in a similar fashion yields the following result regarding multilayerfractional DNN.

Proposition 3.1 (

Representation of multilayer fractional DNN ). For L ≥ , thefractional DNN given by (3.11) fulﬁlls φ L = W L − (cid:34) α L − φ + L − (cid:88) i =0 α i σ ( W i φ i + b i ) (cid:35) , (3.18) where α i are constants depending on τ and a i , as given by (3.10) . Observe that aone-hidden layer fDNN (that is, L = 2 ) coincides with a one-hidden layer standardDNN. Error analysis for multilayer networks is generally challenging. Some papers thatstudy this include [30, 53] and the references therein. The results of these two pa-pers focus mainly on the ReLU activation function. In particular, [30] discusses theapproximation of linear ﬁnite elements by ReLU deep and shallow networks.In this paper, we restrict the analysis discussion to one-hidden layer fDNN i.e., L = 2. In particular, we consider a one-hidden layer network with ﬁnitely manyneurons. Notice that for L = 2, fDNN coincides with standard DNN according toProposition 3.1. Therefore, it is possible to extend the result from [46] to our case.To state the result from [46], we ﬁrst introduce some notation.0 H. Antil, H. C. Elman, A. Onwunta and D. Verma

To this end, we make the assumption that the function Φ( ξ ) is deﬁned on abounded domain P ⊂ R N ξ and has a bounded Barron norm || Φ || B m = (cid:90) P (1 + ω ) m | ˆΦ( ω ) | dω, m ≥ . (3.19)Now, let m ∈ N ∪ { } and p ∈ [1 , ∞ ] . Recall that the Sobolev space W m,p ( P ) isthe space of functions in L p ( P ) whose weak derivatives of order m are also in L p ( P ) : W m,p ( P ) := { f ∈ L p ( P ) : D m f ∈ L p ( P ) , ∀ m with | m | ≤ m } , (3.20)where m = ( m , . . . , m N ξ ) ∈ { , , . . . } N ξ , | m | = m + m . . . + m N ξ and D m f is theweak derivative. In particular, the space H m ( P ) := W m, ( P ) is a Hilbert space withinner product and norm given respectively by( f, g ) H m ( P ) = (cid:88) | m |≤ m (cid:90) P D m f ( x ) D m g ( x ) d x , and || f || H m ( P ) = ( f, f ) / H m ( P ) . For p = ∞ , we note that the Sobolev space W m, ∞ ( P )is equipped with the norm || f || W m, ∞ ( P ) = max m : | m |≤ m ess sup x ∈P | D m f ( x ) | . (3.21)Also, we say that f ∈ W m,ploc ( P ) if f ∈ W m,p ( P (cid:48) ) , ∀ compact P (cid:48) ⊂ P . Next, since the mapping Φ : R N ξ → R N x can be computed using N x mappingsΦ j : R N ξ → R , it suﬃces to focus on networks with one output unit. Observe from(3.15) that ˆΦ( ξ , θ ) := φ = n (cid:88) i =1 c i · σ ( ω i · ξ + b i ) , (3.22)where n is the total number of neurons in the hidden layer, b i an element of the biasvector, c i and ω i are rows of W and W respectively. Now, for a given activationfunction σ, deﬁne the set F nN ξ ( σ ) = (cid:40) n (cid:88) i =1 c i · σ ( ω i · ξ + b i ) , ω i ∈ R N ξ , b i ∈ R (cid:41) . (3.23)We can now state the following result for any non-zero activation functions satis-fying some regularity conditions [46, Corollary 1]. After this result, we will show thatour activation function in (3.2) fulﬁlls the required assumptions. Theorem 3.2.

Let P be a bounded domain and σ ∈ W m, ∞ loc ( R ) be a non-zeroactivation function. Suppose there exists a function ν ∈ F q ( σ ) satisfying | ν ( k ) ( t ) | ≤ C p (1 + | t | ) − p , ≤ k ≤ m, p > . (3.24) Then, for any Φ ∈ B m , we have inf ˆΦ n ∈F nNξ ( σ ) || Φ − ˆΦ n || H m ( P ) ≤ C ( σ, m, p, β ) |P| / q / n − / || Φ || B m +1 , (3.25) where β = diam ( P ) and C is a constant depending on σ, m, p and β. ractional DNN for nonlinear statistical inverse problems O ( n − / ) . Here, the function ν ( x ) is a linearcombination of the shifts and dilations of the activation σ ( x ) . Moreover, the activationfunction itself need not satisfy the decay (3.24). The theorem says that it suﬃces toﬁnd some function ν ∈ F q ( σ ) (cf. (3.23) with n = q, N ξ = 1) for which (3.24) holds.In this paper, we are working with smooth ReLU (see (3.2)) as the activationfunction σ ( x ) for which m = 2. One choice of ν for which (3.24) holds is ν ∈ F ( σ )given by ν ( t ) = σ ( t + 1) + σ ( t − − σ ( t ) . (3.26)It can be shown that, ν satisﬁes (3.24) with C p = 20 and p = 1 .

5, that is, | ν ( k ) ( t ) | ≤ | t | ) − . , ≤ k ≤ . (3.27)This can be veriﬁed from Figure 3.2. -2 0 202468101214161820 -2 0 202468101214161820 -2 0 202468101214161820 Fig. 3.2: Veriﬁcation of the decay condition (3.24) for ν in (3.26), which has beendeﬁned using the smooth ReLU in (3.2).In what follows, we discuss the application of our proposed fractional DNN to thesolution of two statistical Bayesian inverse problems.

4. Application to Bayesian inverse problems.

The inverse problem associ-ated with the forward problem (2.1) essentially requires estimating a parameter vector ξ ∈ R N ξ given some observed noisy limited data d ∈ R N x . In the Bayesian inferenceframework, the posterior probability density, π ( ξ | d ) : R N ξ → R , solves the statisticalinverse problem. In principle, π ( ξ | d ) encodes the uncertainty from the set of observeddata and the sought parameter vector. More formally, the posterior probability den-sity is given by Baye’s rule as π pos ( ξ ) := π ( ξ | d ) ∝ π ( ξ ) π ( d | ξ ) , (4.1)where π ( · ) : R N ξ → R is the prior and π ( d | ξ ) is the likelihood function. The standardapproach to Bayesian inference uses the assumption that the observed data are of theform (see e.g., [10, 20]) d = Φ( ξ ) + η, (4.2)2 H. Antil, H. C. Elman, A. Onwunta and D. Verma where Φ is the parameter-to-observable map and η ∼ N (0 , Σ η ). In our numerical ex-periments, we assume that Σ η = κ I, where I is the identity matrix with appropriatedimensions and κ denotes the variance. Now, let the log-likelihood be given by L ( ξ ) = 12 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Σ / η ( d − Φ( ξ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ; (4.3)then Baye’s rule in (4.1) becomes π pos ( ξ ) := π ( ξ | d ) ∝ π ( ξ ) π ( d | ξ ) = 1 Z exp ( − L ( ξ )) π ( ξ ) , (4.4)where Z = (cid:82) P exp( − L ( ξ )) π ( ξ ) d ξ is a normalizing constant.We note here that, in practice, the posterior density rarely admits analytic ex-pression. Indeed, it is generally evaluated using Markov chain Monte Carlo (MCMC)sampling techniques. In MCMC schemes one has to perform several thousands of for-ward model simulations. This is a signiﬁcant computational challenge, especially if theforward model represents large-scale high ﬁdelity discretization of a nonlinear PDE.It is therefore reasonable to construct a cheap-to-evaluate surrogate for the forwardmodel in the oﬄine stage for use in online computations in MCMC algorithms.In what follows, we discuss MCMC schemes for sampling posterior probabilitydistributions, as well as how our fractional DNN surrogate modeling strategy can beincorporated in them. Markov chain Monte Carlo (MCMC) meth-ods are very powerful and ﬂexible numerical sampling approaches for approximat-ing posterior distributions, see e.g., [8, 10, 47]. Prominent MCMC methods includeMetropolis-Hastings algorithm (MH) [47], Gibbs algorithm [5] and Hamiltonian MonteCarlo algorithm (HMC) [8, 9], Metropolis-adjusted Langevin algorithm (MALA) [8],and preconditioned Crank Nicolson algorithm (pCN) [5], and their variants. In ournumerical experiments, we will use MH, pCN, HMC, and MALA algorithms. Wewill describe only MH in this paper and refer the reader to [8] for the details of thederivation and properties of the many variants of the other three algorithms.To this end, consider a given parameter sample ξ = ξ i . The MH algorithmgenerates a new sample ξ i +1 as follows:(1) Generate a proposed sample ξ ∗ from a proposal density q ( ξ ∗ | ξ ), and compute q ( ξ | ξ ∗ ).(2) Compute the acceptance probability α ( ξ ∗ | ξ ) = min (cid:26) , π ( ξ ∗ | d ) q ( ξ ∗ | ξ ) π ( ξ | d ) q ( ξ | ξ ∗ ) (cid:27) . (4.5)(3) If Uniform(0; 1] < α ( ξ ∗ | ξ ) , then ξ i +1 = ξ ∗ . Else, set ξ i +1 = ξ .Observe from (4.3) and (4.4) that each evaluation of the likelihood function, andhence, the acceptance probability (4.5) requires the evaluation of the forward modelto compute π ( ξ ∗ | d ). In practice, one has to do tens of thousands of forward solvesfor the HM algorithm to converge. We propose to replace the forward solves by thefractional DNN surrogate model. In our numerical experiments, we follow [27] inwhich an adaptive Gaussian proposal is used: q ( ξ ∗ | ξ i ) = exp (cid:18) −

12 ( ξ ∗ − ξ i − ) T C i − ( ξ ∗ − ξ i − ) (cid:19) , (4.6) ractional DNN for nonlinear statistical inverse problems C = I, C i − = 1 i i − (cid:88) j =0 ( ξ j − ¯ ξ )( ξ j − ¯ ξ ) T + ϑI, ¯ ξ = i − (cid:80) i − j =0 ξ j , and ϑ ≈ − . When the proposal is chosen adaptively as speciﬁedabove, the HM method is referred to as an Adaptive Metropolis (AM) method [4, 27].

5. Numerical Experiments.

In this section, we consider two statistical inverseproblems. The ﬁrst one is a diﬀusion-reaction problem in which two parametersneed to be inferred [18]. The second one is a more challenging problem – a thermalﬁn problem from [8], which involves one hundred parameters to be identiﬁed. Allexperiments were performed using MATLAB R2020b on a Mac desktop with RAM32GB and CPU 3.6 GHz Quad-Core Intel Core i7.In both of these experiments we train using a 3-hidden layer network with 15neurons in each hidden layer (that is, L = 4 , n = 45) and k = 400 neurons in theoutput layer, where k matches the dimension of POD basis as described in section 5.1below. We chose ε = 0 . T = 1and step-size h = . We set the fractional exponent γ = 0 . λ = 10 − . To train the network, we usethe BFGS optimization method [38, Chapter 4].Tables 5.1 and 5.2 show the number of BFGS iterations and the CPU timesrequired to train the data from the diﬀusion-reaction problem and the thermal ﬁnproblem, respectively. Also shown in these tables are the relative errors obtained byevaluating the trained network at a parameter ξ e not in the training set E ; here, Error = || u ( ξ e ) − ˆΦ( ξ e ) || ∞ || u ( ξ e ) || ∞ , where u ( ξ e ) and ˆΦ( ξ e ) are the true and surrogate solutions at ξ e , respectively. Notefrom both tables that after 1600 BFGS iterations, the decrease in errors is not signif-icant. Hence, for all the experiments discussed below, we use 1600 BFGS iterations.BFGS iterations Error Time (in sec)400 2 . × − . . × − . . × − . . × − . . × − . We consider the following nonlinear diﬀusion-reaction problem posed in a two-dimensional spatial domain [11, 24] − ∆ u + g ( u ; ξ ) = f, in Ω = (0 , ,u = 0 , on ∂ Ω , (5.1)4 H. Antil, H. C. Elman, A. Onwunta and D. Verma

BFGS iterations Error Time (in sec)400 1 . × − . . × − . . × − . . × − . . × − . g ( u ; ξ ) = ξ ξ [exp( ξ u ) −

1] and f = 100 sin(2 πx ) sin(2 πx ). Moreover, theparameters are ξ = ( ξ , ξ ) ∈ [0 . , ⊂ R .Equation (5.1) is discretized on a uniform mesh in Ω with 64 grid points in eachdirection using centered diﬀerences resulting in 4096 degrees of freedom. We obtainedthe solution of the resulting system of nonlinear equations using an inexact Newton-GMRES method as described in [37]. The stopping tolerance was 10 − . To train the network, we ﬁrst computed N s = 900 solution snapshots S corre-sponding to a set E of 900 parameters ξ = ( ξ , ξ ) drawn from the parameter space[0 . , . These parameters were chosen using Latin hypercube sampling. Eachsolution snapshot is of dimension N x = 4096 . Next, we computed the SVD of thematrix of solution snapshots S = (cid:101) V Σ W T , and set our POD basis V = (cid:101) V (: , k ) , where k = 400 . Our training set then consisted of E as inputs and V T S ∈ R k × N s asour targets. As reported by Hesthaven and Ubbiali in [32] in the context of solvingparameterized PDEs, the POD-DNN approach accelerates the online computationsfor surrogate models. Thus, we follow this approach in our numerical experiments; inparticular, using k -dimensional data (where k (cid:28) N x ) conﬁrms an overall speed up inthe solution of the statistical inverse problems.Next, in both numerical examples considered in this paper, we solved the inverseproblems using M = 20 ,

000 MCMC samples. In each case, the ﬁrst 10 ,

000 sampleswere discarded for “burn-in” eﬀects, and the remaining 10 ,

000 samples were used toobtain the reported statistics. Here, we used an initial Gaussian proposal (4.6) withcovariance C = I and updated the covariance after every 100th sample. As in [18],we assume for this problem, that ξ is uniformly distributed over the parameter space[0 . , . Hence, (4.4) becomes π pos ( ξ ) ∝ (cid:40) exp (cid:0) − κ ( d − Φ( ξ )) T ( d − Φ( ξ )) (cid:1) if ξ ∈ [0 . , . d by using (4.2) the true parameter to be identiﬁed ξ e = (1 , .

1) and a Gaussian noise vector η with κ = 10 − . Figures 5.1, 5.2 and 5.3, represent, respectively, the histogram (which depicts theposterior distribution), the Markov chains and the autocorrelation functions corre-sponding to the parameters using the high ﬁdelity model (Full) and the deep neuralnetwork surrogate (DNN) models in the MCMC algorithm. Recall that, for a Markovchain { δ j } Jj =1 generated by the Metropolis-Hastings algorithm, with variance κ , theautocorrelation function (ACF) (cid:37) of the δ -chain is given by (cid:37) ( j ) = cov( δ , δ | j | ) /κ . ractional DNN for nonlinear statistical inverse problems τ int , (IACT) of the chain is deﬁned as τ int ( δ ) := J − (cid:88) j = − J +1 (cid:37) ( j ) ≈ J − (cid:88) j =1 (cid:18) − jJ (cid:19) cov( δ , δ j ) /κ . (5.3)ACF decays very fast to zero as J → ∞ . In parctice, J is often taken to be (cid:98)

10 log M (cid:99) , [5]. If IACT = K, this means that the roughly every K th sample of the δ chain isindependent. We have used the following estimators to approximate (cid:37) ( j ) and τ int inour computations [4, 48]: (cid:37) ( j ) := B ( j ) /B (0) , and τ int := (cid:98) J (cid:88) j = − (cid:98) J (cid:37) ( j ) , where B ( j ) = J − j (cid:80) J − jk =1 ( δ k − ¯ δ )( δ k + | j | − ¯ δ ) , ¯ δ is the mean of the δ -chain, and (cid:98) J ischosen be the smallest integer such that (cid:98) J ≥ τ int . Fig. 5.1: Histograms of the posterior distributions for the parameters ξ = ( ξ , ξ ) . They have been obtained from Full (left) and fDNN (right) models with M = 10 , , . . In fact, for the full model, the 95% conﬁdence intervals (CIs) for ξ and ξ are[0 . , . . , . . , . . , . , .

1) appropriately.We also examine the impact of training accuracy in Fig 5.3. The ﬁgure shows thatthe ACFs decay very fast to zero, as expected. The value of IACT is roughly 10 forall the MCMC chains generated by the full model (left). The right of the ﬁgure showsthe IACT for fDNN model, where the red-colored curves show the results obtained6

H. Antil, H. C. Elman, A. Onwunta and D. Verma with Full with fDNN with Full with fDNN Fig. 5.2: MCMC samples for the parameters ξ = ( ξ , ξ ) using Full (ﬁrst and third)and fDNN (second and fourth) models. For the full model, the 95% conﬁdence intervalfor ξ and ξ are [0 . , . . , . ξ and ξ are [0 . , . . , .

10 20 30 4000.20.40.60.81

ACFs with full model int ( )=10.3291 int ( )=10.4178

10 20 30 4000.20.40.60.81

ACFs with fDNN model int ( )=9.7595 int ( )=10.6826 int ( )=10.4947 int ( )=10.5695 Fig. 5.3: Autocorrelation functions (ACFs) for ξ and ξ chains computed with Full(left) and DNN (right) models. The red and blue lines are ACFs obtained with 1600and 6400 BFGS iterations, respectively. ractional DNN for nonlinear statistical inverse problems th sample of the MCMC chainsgenerated by both the full and the fDNN models are independent.These results indicate that the fDNN surrogate model produces results with ac-curacy comparable to those obtained using the full model. The advantage of thesurrogate lies in its reduced costs. In this example, using the HM algorithm, the fullmodel required 2347 . . θ that deﬁnes the fDNN. For this example,this entailed construction of the 900 snapshot solutions used to produce targets forthe training process, computation of the SVD of the matrix S of snapshots giving thetargets, together with the training of the network (using BFGS). The times requiredfor these computations were 127 . .

52 sec-ond to compute the SVD and 29 . . . This is,for instance, the case with the Diﬀerential Evolution Adaptive Metropolis (DREAM)method which runs multiple diﬀerent chains simultaneously when used to sample theposterior distributions [51].

Next, we consider the following thermal ﬁn prob-lem from [8]: − div ( e ξ ( x ) ∇ u ) = 0 , in Ω , ( e ξ ( x ) ∇ u ) · n + 0 . u = 0 , on ∂ Ω \ Γ , ( e ξ ( x ) ∇ u ( x )) · n = 1 , on Γ = ( − . , . × { } . (5.4)These equations (5.4) represent a forward model for heat conduction over the non-convex domain Ω as shown in Fig. 5.4. Given the heat conductivity function e ξ ( x ) ,the forward problem (5.4) is used to compute the temperature u . The goal of thisinverse problem is to infer 100 unknown parameters ξ from 262 noisy observations of u . Fig. 5.4 shows the location of the observations on the boundary ∂ Ω \ Γ, as well asthe forward PDE solution u at the true parameter ξ .To train the network, we ﬁrst computed, as in the diﬀusion-reaction case, N s =900 solution snapshots S corresponding to 900 parameters ξ ∈ R drawn using Latinhypercube sampling. This problem is a lot more diﬃcult than the previous problemin the sense that the dimension of the parameter space in this case is 100 rather than2 . Next, we compute the SVD of S and proceed as before.In what follows, the inﬁnite variants of Riemannian manifold Metropolis-adjustedLangevin algorithm (MALA) and Hamiltonian Monte-Carlo (HMC) algorithms as We did not try to optimize the number of samples used for training and in particular N s = 900was an essentially arbitrary choice. The results obtained here with N s = 900 were virtually thesame as those found using 10 ,

000 training samples, with the smaller number of samples incurringdramatically lower oﬄine costs. Further reduction in computational time could be achieved usinga smaller training data set. In [32], the authors used O (400) snapshots to train the feedforwardnetwork and still achieved good results. H. Antil, H. C. Elman, A. Onwunta and D. Verma

Fig. 5.4: The location of observations (circles) (left) and the forward PDE solution u under the true parameter ξ (right).Fig. 5.5: The true heat conductivity ﬁeld e ξ ( x ) (upper left) and the mean estimatesof the posterior obtained by the diﬀerent MCMC methods using fractional DNN as asurrogate model.presented in [8] are denoted, respectively, by ∞ -MALA and ∞ -HMC. In particular,Fig. 5.5 shows the true conductivity, as well as the posterior mean estimates obtainedby three MCMC methods – pCN, ∞ -MALA and ∞ -HMC – using the fractional DNNas a surrogate model.As in [8], for this problem we used a Gaussian prior deﬁned on the domain D :=[ − , × [0 ,

4] with covariance C of eigen-structure given by C = (cid:88) i ∈ I µ i { ϕ i ⊗ ϕ i } , ractional DNN for nonlinear statistical inverse problems µ i = { π (( i + 1 / + ( i + 1 / ) } − . ,ϕ i ( x ) = 2 |D| − / cos( π ( i + 1 / x ) cos( π ( i + 1 / x ) , and I = { i = ( i , i ) , i ≥ , i ≥ } . Moreover, the log-conductivity ξ was the trueparameter to be inferred with coordinates ξ i = µ i sin(( i − / + ( i − / ) , i ≤ , i ≤ . Table 5.3 shows the average acceptance rates for these models and the computa-tional times required to perform the MCMC simulation for diﬀerent variants of theMCMC algorithm, using both fDNN surrogate computations and full-order discretePDE solution. These results indicate that the acceptance rates are comparable forthe fDNN surrogate model and the full order forward discrete PDE solver. Moreover,there is about a 90% reduction in the computational times when fDNN surrogatesolution is used, a clear advantage of the fDNN approach. The costs of the oﬄinecomputations for fDNN were 63 . . . ∞ -MALA ∞ -HMCAcc. Rate (fDNN) 0 .

67 0 .

67 0 . .

66 0 .

70 0 . .

46 98 . . . . . ∞ -MALA and ∞ -HMC algorithms together with fDNN and fullforward models.

6. Conclusions.

This paper has introduced a novel deep neural network (DNN)based approach to approximate nonlinear parametrized partial diﬀerential equations(PDEs). The proposed DNN helps learn the parameter to PDE solution operator.We have used this learnt solution operator to solve two challenging Bayesian inverseproblems. The proposed approach is highly eﬃcient without compromising on accu-racy. We emphasize, that the proposed approach shows several advantages over thetraditional surrogate approaches for parametrized PDEs such as reduced basis meth-ods. For instance the proposed approach is fully non-intrusive and therefore it canbe directly used in legacy codes, unlike the reduced basis method for nonlinear PDEswhich can be highly intrusive.

REFERENCES[1]

H. Antil, M. Heinkenschloss, and D. C. Sorensen , Application of the discrete empiricalinterpolation method to reduced order modeling of nonlinear and parametric systems , inReduced order methods for modeling and computational reduction, vol. 9 of MS&A. Model.Simul. Appl., Springer, Cham, 2014, pp. 101–136.[2]

H. Antil, R. Khatri, R. Lohner, and D. Verma , Fractional deep neural network via con-strained optimization , Machine Learning: Science and Technology, (2020). H. Antil, H. C. Elman, A. Onwunta and D. Verma[3]

H. Antil and D. Leykekhman , A brief introduction to PDE-constrained optimization , inFrontiers in PDE-constrained optimization, vol. 163 of IMA Vol. Math. Appl., Springer,New York, 2018, pp. 3–40.[4]

J. M. Bardsley , Computational Uncertainty Quantiﬁcation for Inverse Problems , SIAM, 2018.[5]

J. M. Bardsley, A. Solonen, H. Haario, and M. Laine , Randomize-then-optimize: Amethod for sampling from posterior distributions in nonlinear inverse problems , SIAMJournal on Scientiﬁc Computing, 36(4) (2014), pp. A1895 – A1910.[6]

M. Barrault, Y. Maday, N. C. Nguyen, and A. T. Patera , An ‘empirical interpolation’method: Application to eﬃcient reduced-basis discretization of partial diﬀerential equa-tions , Comptes Rendus Mathematique, 339 (2004), pp. 667 – 672.[7]

M. Benning, E. Celledoni, M. Ehrhardt, B. Owren, and C.-B. Schnlieb , Deep learningas optimal control problems: Models and numerical methods , J. of Comp. Dynamics, 6(2019), pp. 171 – 198.[8]

A. Beskos, M. Girolami, S. Lan, P. E. Farrell, and A. M. Stuart , Geometric MCMC forinﬁnite-dimensional inverse problems , J. of Comp. Physics, 335 (2017), pp. 327 – 351.[9]

T. Bui-Thanh and M. Girolami , Solving large-scale PDE-constrained Bayesian inverse prob-lems with Riemann manifold Hamiltonian Monte Carlo , Inverse Problems, 2014 (30),p. 114014.[10]

D. Calvetti and E. Somersalo , Introduction to Bayesian Scientiﬁc Computing , Springer,2007.[11]

S. Chaturantabut , Nonlinear Model Reduction via Discrete Empirical Interpolation , PhDthesis, Rice University, Houston, 2011.[12]

S. Chaturantabut and D. C. Sorensen , Nonlinear model reduction via discrete empiricalinterpolation , SIAM Journal on Scientiﬁc Computing, 32 (2010), pp. 2737 – 2764.[13]

Y. Chen, L. Lu, G. E. Karniadakis, and L. D. Negro , Physics-informed neural networks forinverse problems in nano-optics and metamaterials , Optics Express, 28 (2020), pp. 11618–11633.[14]

C. Eckart and G. Young , The approximation of one matrix by another of lower rank , Psy-chometrika, 1 (1936), pp. 211 – 218.[15]

S. Ellacott , Aspects of the numerical analysis of neural networks , Acta Numerica, 3 (1994),pp. 145 – 202.[16]

H. C. Elman and V. Forstall , Preconditioning techniques for reduced basis methods forparameterized elliptic partial diﬀerential equations , SIAM Journal on Scientiﬁc Computing,37(5) (2015), pp. S177 – S194.[17] ,

Numerical solution of the steady-state Navier-Stokes equations using empirical inter-polation methods , CMAME, 317 (2017), pp. 380 – 399.[18]

H. C. Elman and A. Onwunta , Reduced-order modeling for nonlinear bayesian statisticalinverse problems , hhttps://arxiv.org/abs/1909.02539, (2019), pp. 1 – 20.[19]

H. C. Elman, D. J. Silvester, and A. J. Wathen , Finite Elements and Fast Iterative Solvers ,vol. Second Edition, Oxford University Press, 2014.[20]

H. P. Flath, L. C. Wilcox, V. Akcelik, J. Hill, B. van Bloemen Waanders, and O. Ghat-tas , Fast algorithms for Bayesian uncertainty quantiﬁcation in large-scale linear inverseproblems based on low-rank partial Hessian approximations , SIAM Journal on ScientiﬁcComputing, 33 (2011), pp. 407 – 432.[21]

V. Forstall , Iterative Solution Methods for Reduced-order Models of Parameterized PartialDiﬀerential Equations , PhD thesis, University of Maryland, College Park, 2015.[22]

G. H. Golub and C. H. Van Loan , Matrix Computations , JHU press, 1996.[23]

I. Goodfellow, Y. Bengio, and A. Courville , Deep Learning , MIT Press, 2016.[24]

M. A. Grepl, Y. Maday, N. C. Nguyen, and A. T. Patera , Eﬃcient reduced-basis treat-ment of nonaﬃne and nonlinear partial diﬀerential equations , Mathematical Modellingand Numerical Analysis, 41(3) (2007), pp. 575 – 605.[25]

M. Gulian, M. Raissi, P. Perdikaris, and G. E. Karniadakis , Machine learning ofspace-fractional diﬀerential equations , SIAM Journal on Scientiﬁc Computing, 41 (2019),pp. A2485 – A2509.[26]

S. G¨unther, L. Ruthotto, J. B. Schroder, E. C. Cyr, and N. R. Gauger , Layer-paralleltraining of deep residual neural networks , SIAM J. on Math. of Data Science, 2 (2020),pp. 1 – 23.[27]

H. Haario, E. Saksman, and J. Tamminen , An adaptive Metropolis algorithms , Bernoulli, 7(2001), pp. 223 – 242.[28]

E. Haber and L. Ruthotto , Stable architectures for deep neural networks , Inverse Problems,34 (2017), p. 014004.[29]

J. Han, A. Jentzen, and W. E. , Solving high-dimensional partial diﬀerential equations using ractional DNN for nonlinear statistical inverse problems deep learning , PNAS, 115 (2018), pp. 8505 – 8510.[30] J. He, L. Li, J. Xu, and C. Zheng , ReLU deep neural newtworks and linear ﬁnite elements ,Journal of Computational Mathematics, 38 (2020), p. 502.[31]

K. He, X. Zhang, S. Ren, and J. Sun , Deep residual learning for image recognition , inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016,pp. 770–778.[32]

J. Hesthaven and S. Ubbiali , Non-intrusive reduced order modeling of nonlinear problemsusing neural networks , Journal of Computational Physics, 363 (2018), pp. 55 –78.[33]

J. S. Hesthaven, G. Rozza, and B. Stamm , Certiﬁed reduced basis methods for parametrizedpartial diﬀerential equations , SpringerBriefs in Mathematics, Springer, Bilbao, 2016.[34]

K. Hornik, M. Stinchcombe, H. White, and P. Auer , Degree of approximation resultsfor feedforward networks approximating unknown mappings and their derivatives , NeuralComputation, 6 (1994), pp. 1262 – 1275.[35]

G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger , Densely connected convolu-tional networks , in Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2017, pp. 4700–4708.[36]

J. Kaipio and E. Somersalo , Statistical and Computional Inverse Problems , Springer, 2005.[37]

C. T. Kelley , Iterative Methods for Linear and Nonlinear Equations , SIAM, 1995.[38] ,

Iterative Methods for Optimization , SIAM, 1999.[39]

M. Leshno, V.Y. Lin, A. Pinkus, and S. Schocken , Multilayer feedforward networks witha nonpolynomial activation function can approximate any function , Neural networks, 6(1993), pp. 861–867.[40]

J. Li and Y. M Marzouk , Adaptive construction of surrogates for the bayesian solution ofinverse problems , SIAM Journal on Scientiﬁc Computing, 36 (2014), pp. A1163 – A1186.[41]

Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, andA. Anandkumar , Fourier neural operator for parametric partial diﬀerential equations ,arXiv preprint arXiv:2010.08895, (2020).[42]

J. Martin, L. C. Wilcox, C. Burstedde, and O. Ghattas , A stochastic Newton MCMCmethod for large-scale statistical inverse problems with application to seismic inversion ,SIAM Journal on Scientiﬁc Computing, 34 (2012), pp. A1460 – A1487.[43]

G. Pang, L. Lu, and G. E. Karniadakis , Fpinns: Fractional physics-informed neural net-works , SIAM Journal on Scientiﬁc Computing, 41 (2019), pp. A2603 – A2626.[44]

A. Quarteroni, A. Manzoni, and F. Negri , Reduced basis methods for partial diﬀerentialequations: an introduction , vol. 92, Springer, 2015.[45]

M. Raissi, P. Perdikaris, and G. E. Karniadakis , Physics-informed neural networks: a deeplearning framework for solving forward and inverse problems involving nonlinear partialdiﬀerential equations , Journal of Compuational Physics, 378 (2019), pp. 686 – 707.[46]

J. W. Siegel and Jinchao Xu , Approximation rates for neural networks with general activa-tion functions , Neural Networks, (2020), pp. 313 – 321.[47]

R. C. Smith , Uncertainty Quantiﬁcation: Theory, Implementation, and Applications , SIAM,2014.[48]

A. Sokal , Monte Carlo methods in statistical mechanics: Foundations and new algorithms ,in Functional Integration, C. DeWitt-Morette, P. Cartier, and A. Folacci, eds., Springer,1997, pp. 131 – 192.[49]

R. K. Tripathy and I. Bilionis , Deep UQ: Learning deep neural network surrogate modelsfor high dimensional uncertainty quantiﬁcation , JCP, 375 (2018), pp. 565 – 588.[50]

A. Veit, M. J. Wilber, and S. Belongie , Residual networks behave like ensembles of relativelyshallow networks , in Advances in neural information processing systems, 2016, pp. 550–558.[51]

J. A. Vrugt, C. J. F. Ter Braak, C. G. H. Diks, B. A. Robinson, J. M. Hyman, andD. Higdon , Accelerating markov chain monte carlo simulation by diﬀerential evolutionwith self-adaptive randomized subspace sampling , International Journal of Nonlinear Sci-ences and Numerical Simulation, 10 (2009), pp. 273 – 290.[52]

L. Yang, X. Meng, and G. E. Karniadakis , B-PINNs: Bayesian physics-informed neuralnetworks for forward and inverse PDE problems with noisy data , Journal of ComputationalPhysics, 425 (2021), p. 109913.[53]

D. Yarotsky , Error bounds for approximations with deep ReLU networks , Neural Networks,94 (2017), pp. 103 – 114.[54]