Approximate Bayesian Model Inversion for PDEs with Heterogeneous and State-Dependent Coefficients
AApproximate Bayesian Model Inversion for PDEs withHeterogeneous and State-Dependent Coefficients
D.A. Barajas-Solano a, ∗ , A.M. Tartakovsky a a Pacific Northwest National Laboratory, Richland, WA 99354
Abstract
We present two approximate Bayesian inference methods for parameter esti-mation in partial differential equation (PDE) models with space-dependentand state-dependent parameters. We demonstrate that these methods pro-vide accurate and cost-effective alternatives to Markov Chain Monte Carlosimulation. We assume a parameterized Gaussian prior on the unknownfunctions, and approximate the posterior density by a parameterized multi-variate Gaussian density. The parameters of the prior and posterior are esti-mated from sparse observations of the PDE model’s states and the unknownfunctions themselves by maximizing the evidence lower bound (ELBO), alower bound on the log marginal likelihood of the observations. The firstmethod, Laplace-EM, employs the expectation maximization algorithm tomaximize the ELBO, with a Laplace approximation of the posterior on the E-step, and minimization of a Kullback-Leibler divergence on the M-step. Thesecond method, DSVI-EB, employs the doubly stochastic variational infer-ence (DSVI) algorithm, in which the ELBO is maximized via gradient-basedstochastic optimization, with nosiy gradients computed via simple MonteCarlo sampling and Gaussian backpropagation. We apply these methodsto identifying diffusion coefficients in linear and nonlinear diffusion equa-tions, and we find that both methods provide accurate estimates of posteriordensities and the hyperparameters of Gaussian priors. While the Laplace-EM method is more accurate, it requires computing Hessians of the physicsmodel. The DSVI-EB method is found to be less accurate but only requiresgradients of the physics model. ∗ Corresponding author
Email addresses:
[email protected] (D.A. Barajas-Solano),
[email protected] (A.M. Tartakovsky)
Preprint submitted to Elsevier February 19, 2019 a r X i v : . [ s t a t . M E ] F e b eywords: approximate Bayesian inference, model inversion, variationalinference, empirical Bayes
1. Introduction
Partial differential equation (PDE) models of many physical systems in-volve space-dependent parameters and constitutive relationships that areusually only partially observed. Model inversion aims to estimate these un-known functions of space and the system’s state from sparse measurementsof the state, associated quantities of interest, and the unknown functionsthemselves. Bayesian inference provides a probabilistic framework for modelinversion [1], in which data is assimilated by computing the posterior den-sity of the parameters in terms of the likelihood of the observations giventhe PDE model and the prior density of the parameters codifying model-ing assumptions. Unlike deterministic parameter estimation methods [2, 3],the Bayesian framework provides a probabilistic characterization of the es-timated parameter that can be employed for quantifying uncertainty andevaluating modeling assumptions. For linear problems with Gaussian likeli-hoods and Gaussian priors, Bayesian inference can be done exactly (knownin the context of state estimation for dynamical systems as the Kalman fil-ter [4]). Unfortunately, physics models define nonlinear maps between thestate and the model parameters, preventing carrying out exact inference evenin the case of Gaussian likelihoods and Gaussian priors. The Markov ChainMonte Carlo (MCMC) method is robust for general nonlinear problems butis computationally expensive [5]. Despite recent advances in HamiltonianMonte Carlo and ensemble and parallel MCMC [6, 7, 8], the number of for-ward simulations and likelihood evaluations required by MCMC samplingposes a challenge for model inversion of PDE models with high-dimensionalparameters. Here, we propose two cost-effective alternatives to MCMC forestimating unknown parameters and constitutive relationships in PDE mod-els. Gaussian process (GP) regression, known as kriging in spatial geophysics,is commonly used to construct probabilistic models of heterogeneous parame-ters; therefore, GPs serve as a reasonable choice of prior for unknown param-eters. In the context of Bayesian inference with GP priors, GP regression isequivalent to exact Bayesian inference for assimilating direct measurements ofunknown parameters. Similarly, the marginal likelihood of parameter mea-surements can be computed in closed form, therefore allowing for model2election to be carried out by empirical Bayesian inference, also known astype-II maximum likelihood estimation [9].Assimilating measurements of the state of PDE models is on the otherhand less straightforward. Recently, a framework has been proposed to com-bine GP priors on the state and a discretization of the governing PDEs toassimilate state observations [10, 11]. In this framework, state estimationand type-II maximum likelihood can be carried out in closed form when thegoverning equations are linear on the state; for the nonlinear case, inferenceis carried out approximately via linearization of the governing equations.Parameter estimation for PDE models presents another layer of chal-lenge as governing equations commonly induce nonlinear relations betweenparameters and states. A common example is the Laplace equation withspace-dependent unknown diffusion coefficient, which is linear on the state,but induces a nonlinear relation between the state and the diffusion coeffi-cient. For the general case of parameter estimation with nonlinearity intro-duced by the physics model, approximate Bayesian inference methods arenecessary. The standard approximate inference tool is MCMC sampling ofthe Bayesian posterior. Given unbounded computational resources, MCMCwill provide arbitrarily accurate results, but in practice MCMC often re-quires an intractable amount of forward simulations of the physics model.Algorithms such as Hamiltonian Monte Carlo (HMC) and the Metropolis-adjusted Langevin algorithm (MALA) employ first-order information in theform sensitivities of the physics model to improve the mixing and conver-gence of the Markov chain random walk, but nevertheless the total numberof forward and sensitivity simulations remains a challenge. As an alternative,approaches such as the Laplace approximation and variational inference aimto approximate the exact posterior with an analytical, parameterized density.In this manuscript we propose employing approximate Bayesian inferencewith GP priors to approximate the posterior of PDE parameters and to esti-mate the hyperparameters of their GP prior. We propose two optimization-based methods: The first, Laplace-EM, is based on the Laplace approxima-tion [9, 12, 13] and the expectation maximization (EM) algorithm [14, 12].The second, doubly stochastic variational inference for empirical Bayes in-ference (DSVI-EB) is based on the DSVI algorithm [15, 16]. The proposedmethods employ first and second-order information, i.e., gradient and Hes-sian of physics models, evaluated via the discrete adjoint method. Both pre-sented methods enjoy advantageous computational properties over MCMCand other approximate Bayesian inference algorithms such as expectation3ropagation [17] and the Laplace approximation-based method of [13] thatrenders each of them attractive for model inversion depending on the natureof the inversion problem. In particular, the Laplace-EM method is accuratefor approximating the unimodal posteriors of the numerical examples of thismanuscript, but requires computing Hessians. On the other hand, DSVI-EBis less accurate but only requires computing gradients, and can be triviallyparallelized. We note that Gaussian mixtures can be employed in variationalinference to approximate multimodal posteriors [18], but in the present workwe limit our focus to unimodal posteriors. Furthermore, both methods areapplicable to non-factorizing likelihoods, do not require computing momentsof the likelihood, and do not require third- or higher order derivatives of thephysics model. Finally, variational inference and the Laplace approximationhave been employed for model inversion [19, 20, 21, 22, 13], but to the best ofour knowledge, have not been used in the context of the empirical Bayesianframework to estimate GP prior hyperparameters, with the exception of thework of [13]. That work, based on the Laplace approximation, requires com-puting third-order derivatives of the physics model, which may be costly tocompute, whereas the presented methods do not require third-order deriva-tives.The manuscript is structured as follows: In Section 2 we formulate theempirical Bayesian inference problem for physics models and GP priors. InSection 3 we propose the approximate Bayesian inference and summarize theexpectation maximization (EM) algorithm. The Laplace-EM algorithm isintroduced in Section 4, and the DSVI-EB algorithm is described in Section 5.The computational complexity of the algorithms is discussed in Section 6.The application of the proposed methods are presented in Section 7. Finally,conclusions are given in Section 8.
2. Problem formulation
We consider physical systems modeled by a stationary PDEs over thesimulation domain Ω ⊂ R d , d ∈ [1 , u : Ω → U ⊂ R thesystem’s state, and by y : Ω × U → R the system’s parameter, an unknownscalar function of space and the system’s state. Our goal is to estimatethe unknown function y ( x, u ) from sparse, noisy measurements of u ( x ) and y ( x , u ). The PDE and boundary conditions are discretized for numericalcomputations, resulting in the set of M algebraic equations L ( u , y ) = 0,where u ∈ R M denotes the vector of M state degrees of freedom, and y ∈ N denotes the discretized parameter vector, corresponding to the value of y ( x , u ) at the N discrete locations { ξ i ∈ Ω × U } Ni =1 . Furthermore, we denoteby Ξ the matrix of coordinates Ξ ≡ ( ξ , . . . , ξ N ).We assume that the sparse observations of the discrete state and parame-ters, u s and y s , respectively, are collected with iid normal observation errors,that is, u s = H u u + (cid:15) u , (cid:15) us ∼ N (0 , σ us I M s ) , (1) y s = H y y + (cid:15) y , (cid:15) ys ∼ N (0 , σ ys I N s ) , (2)where u s ∈ R M s , M s (cid:28) M are the state observations, y s ∈ R N s , N s (cid:28) N are the parameter observations, H u ∈ R M s × M is the state observationoperator, H y ∈ R N s × N is the parameter observation operator, and (cid:15) us and (cid:15) ys are observation errors satisfying E [ (cid:15) us (cid:15) (cid:62) ys ] = 0. Then, the likelihood ofthe observations D s ≡ { u s , y s } given y is defined aslog p ( D s | y ) ≡ − σ us (cid:107) u s − H u u (cid:107) − σ ys (cid:107) y s − H y y (cid:107) + const. , (3)where u satisfies the physics constraint L ( u , y ) = 0 given y , and the constantis independent of y .In probabilistic terms, our goal is to estimate the posterior density of y given the data D s . By Bayes’ theorem, this posterior is given by p ( y | D s , θ ) = p ( D s | y ) p ( y | θ ) p ( D s | θ ) , (4)where p ( y | θ ) is the parameterized prior density of y , with hyperparameters θ , and p ( D s | θ ) is the marginal likelihood or evidence of the data, given by p ( D s | θ ) = (cid:90) p ( D s | y ) p ( y | θ ) d y . (5)If one is not interested in the uncertainty in estimating y given the data,one can compute in lieu of the full posterior ((4)) the maximum a posteriori (MAP) point estimate of y , defined as the mode of the posterior, that is,ˆ y ≡ arg max y log p ( y | D s , θ ) = arg max y log p ( y , D s | θ ) , (6)where p ( y , D s | θ ) = p ( D s | y ) p ( y | θ ) is the joint density of the data andthe parameters given θ . Here we used the fact that the marginal likelihood p ( D s | θ ) is independent of y . 5e employ a zero-mean Gaussian process prior, that is, p ( y | θ ) = N ( y | , C p ( θ ) ≡ C ( Ξ , Ξ | θ )) , (7)where N ( · | µ , Σ ) denotes the multivariate normal density with mean µ andcovariance matrix Σ , and C ( · , · | θ ) is a parameterized covariance kernel.The posterior density depends on the prior hyperparameters, which canbe chosen based on prior expert knowledge, or estimated from data. Inthe empirical Bayes approach, also known as type-II maximum likelihood or marginal likelihood estimation, point estimates of the hyperparametersare obtained by maximizing the marginal likelihood with respect to θ , i.e.,ˆ θ ≡ arg max θ p ( D s | θ ). In the fully Bayes approach, we instead pose ahyperprior on the hyperparameters, which is updated with data by the Bayes’theorem. In this work we will pursue the empirical Bayes approach.Due to the nonlinear map from y to u defined by the physics constraint,the Bayesian inference problem of evaluating the posterior and marginal like-lihood cannot be done in closed form. Exact inference therefore requiressampling the posterior via MCMC, which is intractable for sufficiently large N and M . As a consequence, estimating hyperparameters via marginal like-lihood estimation is also intractable. As an alternative to exact inference, inthis work we propose various approximate inference algorithms.
3. Approximate inference and Expectation Maximization
The goal is to approximate the exact posterior p ( y | D s , θ ) by a den-sity q ( y ). The Kullback-Leibler (KL) divergence D KL ( q ( y ) (cid:107) p ( y | D s , θ ))provides a means to rewriting the marginal likelihood, (5), in terms of q ( y ).Namely, substituting (4) into the definition of the KL divergence gives D KL ( q ( y ) (cid:107) p ( y | D s , θ )) = − (cid:90) q ( y ) log p ( y | D s , θ ) q ( y ) d y = − (cid:90) q ( y ) log p ( D s | y ) p ( y | θ ) q ( y ) p ( D s | θ ) d y = −F [ q ( y ) , θ ] + log p ( D s | θ ) . (8)where F [ q ( y ) , θ ] is given by F [ q ( y ) , θ ] = E q ( y ) [ p ( D s | y )] − D KL ( q ( y ) (cid:107) p ( y | θ )) , (9)6nd E q ( y ) [ · ] ≡ (cid:82) ( · ) q ( y ) d y denotes expectation with respect to the density q ( y ). Reorganizing (8) we have the following alternative expression for (5):log p ( D s | θ ) = F [ q ( y ) , θ ] + D KL ( q ( y ) (cid:107) p ( y | D s , θ )) . Given that the KL divergence is always non-negative, we have the inequalitylog p ( D s | θ ) ≥ F [ q ( y ) , θ ]; therefore, the operator F is often called the evidence lower bound (ELBO). The inequality becomes an equality when q ( y ) = p ( y | D s , θ ), that is, when the variational density is equal to theexact posterior. In the empirical Bayes setting, this suggest the strategy ofselecting both q and θ by maximizing the ELBO [14], i.e.,(ˆ q ( y ) , ˆ θ ) ≡ arg max ( q ( y ) , θ ) F [ q ( y ) , θ ] . (10)Instead of maximizing over q ( y ) and θ simultaneously, we can do it iterativelyby alternating two maximization steps, resulting in the iterative schemeE-step: ˆ q ( j +1) ( y ) set as arg max q ( y ) F (cid:2) q ( y ) , θ ( j ) (cid:3) M-step: ˆ θ ( j +1) set as arg max θ F (cid:2) q ( j +1) ( y ) , θ (cid:3) . (11)thus recovering the expectation maximization (EM) algorithm [14].It remains to specify how the maximization problems (10) and (11) willbe solved, particularly with respect to how to optimize over the space ofpossible densities q ( y ) approximating the true posterior. The approximateinference algorithms presented in this manuscript are based on two familiesof approximations of the posterior. The Laplace-EM algorithm (Section 4)uses a local approximation around the MAP for a given θ , and optimizes theELBO using the EM algorithm, (11). The DSVI-EB algorithm (Section 5)uses a parameterized density q ( y | φ ) with variational parameters φ to beselected jointly with θ via stochastic optimization.
4. Laplace-EM algorithm
The Laplace approximation is an approach for approximating unimodalposteriors densities. It consists of fitting a multivariate Gaussian densityaround the MAP for a given choice of hyperparameters θ . The j th E-stepof the EM algorithm, (11), consists of finding the posterior for a given set of7yperparameters, θ ( j ) . This suggests we can replace the E-step by a Laplaceapproximation to the posterior, giving raise to the Laplace-EM algorithm.We proceed to briefly describe the Laplace approximation. Expanding upto second order the log posterior (see (4)) around the MAP ((6)) yieldslog p ( y | D s , θ ) = − log p ( D s | θ ) + log p (ˆ y , D s | θ )+ 12 ( y − ˆ y ) (cid:62) (cid:104) ∇∇ log p ( y , D s | θ ) | y =ˆ y (cid:105) ( y − ˆ y ) + . . . , where ∇∇ log p ( y , D s | θ ) | y =ˆ y denotes the Hessian of the log joint densitylog p ( y , D s | θ ) around the MAP. This quadratic expression suggests approx-imating the posterior by the multivariate Gaussian density q ( y ) ≡ N ( y | µ q , Σ q ) with mean µ q given by the MAP and covariance Σ q given by theHessian of the log joint density. In other words, we have µ q ≡ arg min y [ − log p ( y , D s | θ )]= arg min y (cid:26) − log p ( D s | y ) + 12 (cid:2) y (cid:62) C − p ( θ ) y + log det C p ( θ ) + N log 2 π (cid:3)(cid:27) , (12)and Σ q ≡ − ∇∇ log p ( y , D s | θ ) | y =ˆ y = H + C − p ( θ ) , (13)where H ≡ − ∇∇ log p ( D s | y ) | y = µ q denotes the Hessian of the likelihoodaround µ q . Note that Σ q and µ q depend on θ indirectly through the depen-dence of the MAP on θ .In this work we solve the minimization problem (12) via gradient-basedoptimization. The necessary gradient of log p ( D s | y ) with respect to y iscomputed via the discrete adjoint method described in Appendix C. TheHessian of log p ( D s | y ) with respect to y , necessary to evaluate (13), is alsocomputed via the discrete adjoint method.We propose the Laplace-EM algorithm, where the Laplace approxima-tion provides an approximation to the E-step. For the M-step, we keep theLaplace approximation fixed and maximize the ELBO with respect to thehyperparameters of the GP prior, θ . From (9) we see that for fixed q ( y ), θ appears only through the KL divergence D KL ( q ( y ) (cid:107) p ( y | θ )); therefore,it suffices to minimize this KL divergence at the M-step. The Laplace-EMM-step reads θ ( j +1) = arg min θ D KL ( q ( y ) (cid:107) p ( y | θ ( j ) )) . (14)8or the GP prior, this KL divergence is given in closed form by D KL ( q ( y ) (cid:107) p ( y | θ )) = 12 (cid:20) tr (cid:0) C − p Σ q (cid:1) + ˆ y (cid:62) C − p ˆ y − N + log det C p det Σ q (cid:21) . (15)If using a gradient method, the gradient of the KL divergence is given by ∂∂θ i D KL ( q ( y ) (cid:107) p ( y | θ )) = −
12 ˆ y (cid:62) C − p ∂ C p ∂θ i C − p ˆ y + 12 tr (cid:20) C − p ∂ C p ∂θ i (cid:0) I N − C − p Σ q (cid:1)(cid:21) . (16)The Laplace-EM algorithm is summarized in Algorithm 1. In practice,the EM iterations are halted once either a maximum number of iterations arecompleted, or once the relative change in hyperparameters is below a certainthreshold, that is, whenmax (cid:110)(cid:12)(cid:12)(cid:12) θ ( j +1) i − θ ( j ) i (cid:12)(cid:12)(cid:12) / | θ s i | (cid:111) N θ i =1 ≤ rtol , where N θ is the number of prior hyperparameters, the θ si , i ∈ [1 , N θ ] areprescribed hyperparameter scales (that provide a sense of the magnitude ofthe hyperparameters), and rtol is the prescribed tolerance. Algorithm 1
Laplace-EM
Require: θ (0) , C p ( θ ), log p ( D s | y ) j ← repeat Compute µ q using (12)Compute Σ q using (13) (cid:46) E-stepSolve (14) for θ ( j +1) (cid:46) M-step j ← j + 1 until ConvergenceWe note that the Laplace approximation is a commonly used tool forunimodal non-Gaussian inference [9, 12, 13]. Directly maximizing with re-spect to θ , the Laplace approximation to the marginal likelihood requiresevaluating third-order derivatives of the log-likelihood function (3) with re-spect to y . This is due to the implicit dependence of the MAP on θ , whichrequires evaluating third-order derivatives of the physics constraint. The useof the expectation maximization algorithm allows us to side-step the need of9hird-order derivatives. Other methods for non-Gaussian inference such asexpectation propagation [17] require multiple evaluations of the moments ofthe likelihood function, and are therefore not considered in this work.
5. Doubly stochastic variational inference
In variational inference (VI) [23], we restrict our choice of q ( y ) to a pa-rameterized family q ( y | φ ). In this context we refer to q as the variational density and φ as the variational parameters. Following (10), we will estimatethe variational parameters and the GP prior hyperparameters simultaneouslyby maximizing the corresponding ELBO, that is,( ˆ φ , ˆ θ ) = arg max ( φ , θ ) F [ q ( y | φ ) , θ )] . (17)In this section we present our proposed implementation of variational in-ference for empirical Bayes. The main challenges of VI are (i) approximatingthe expectations on the expression for the ELBO, (9), and (ii) optimizingsuch approximations. To address these challenges we employ the doublystochastic variational inference (DSVI) framework [15, 16], in which a noisysimple Monte Carlo estimate of the ELBO (1st source of stochasticity) is min-imized via a gradient-based stochastic optimization algorithm (2nd source ofstochasticity). In particular, we employ stochastic gradient ascent with theadaptive step-size sequence proposed by [24]. The gradients of the ELBOestimate with respect to variational parameters and prior hyperparametersare computed via Gaussian backpropagation [25, 16, 26, 24]. To maximize the ELBO via gradient-based stochastic optimization, weconstruct unbiased estimates of the ELBO and its gradients with respect to φ and θ . Computing the gradient ∇ φ F is not trivial as it involves expectationsover q , which depends on φ .We restrict ourselves to the multivariate Gaussian variational family q ( y | φ ) = N ( y | µ q , Σ q ), with variational mean µ q ∈ R N and covariance Σ q = R q R (cid:62) q ∈ R N × N , where R q is a lower triangular factor matrix. For this choicewe have φ = { µ q , R q } . Similar to the Laplace approximation, this choice isjustified for unimodal posteriors. We then introduce the change of variables y = µ q + R q z , with z ∼ N (0 , I N ). Substituting this change of variables10nto (9), we can rewrite the involved expectations in terms of expectationsover N ( z | , I N ), resulting in F [ φ ( y | φ ) , θ ] = E N ( z | , I N ) [log p ( D s | y )]+ E N ( z | , I N ) [log p ( y | θ )] + log det R q + H [ N ( z | , I N )] , (18)where y = µ q + R q z , and H [ N ( z | , I N )] = N (1+log 2 π ) / f ( z ; φ , θ ) = log p ( D s | y )+log det R q − (cid:2) y (cid:62) C − p y + log det C p − N (cid:3) , (19)with gradients ∇ µ q f ( z ; φ , θ ) = ∇ y log p ( D s | y ) − C − p y , (20) ∇ R q f ( z ; φ , θ ) = [ ∇ y log p ( D s | y )] z (cid:62) − C − p yz (cid:62) + (cid:0) R − q (cid:1) (cid:62) , (21) ∂∂θ i f ( z ; φ , θ ) = 12 y (cid:62) C − p ∂ C p ∂θ i C − p y −
12 tr (cid:18) C − p ∂ C p ∂θ i (cid:19) , (22)where again y = µ q + R q z . The details of the derivations of (20)–(22) arepresented in Appendix A. It can be verified that E N ( z | , I N ) [ f ( z ; φ , θ )] = F [ q ( y | φ ) , θ ], so that the estimates are unbiased.The variance of the estimate (19) and its gradients can be reduced byusing the simple Monte Carlo (MC) or batch estimate and the correspondinggradients f n ( φ , θ ) = 1 n n (cid:88) k =1 f (cid:0) z ( k ) ; φ , θ (cid:1) , z ( k ) ∼ N (0 , I N ) , (23) ∇ ( · ) f n ( φ , θ ) = 1 n n (cid:88) k =1 ∇ ( · ) f ( z ( k ) ; φ , θ ) , (24)where n is the size of the batch. The variance of the batch estimate (23)is lower by a factor of n , but requires computing the gradients of the log-likelihood n times.Our DSVI algorithm is summarized in Algorithm 2. The stochastic gra-dient ascent algorithm with adaptive step-size sequence, proposed by [24], isreproduced in Appendix B for completeness.11 lgorithm 2 Doubly stochastic variational inference
Require: φ (0) , θ (0) , C p ( θ ), log p ( D s | y ) j ← repeat Sample n realizations of z ∼ N (0 , I N )Compute ∇ φ f n ( φ ( j ) , θ ( j ) ) and ∇ θ f n ( φ ( j ) , θ ( j ) ) using (20)–(22) and (24)Calculate step-size vectors ρ ( j ) φ and ρ ( j ) θ using (B.3) and (B.4) φ ( j +1) ← φ ( j ) + ρ ( j ) φ ◦ ∇ φ f n ( φ ( j ) , θ ( j ) ) (B.1) θ ( j +1) ← θ ( j ) + ρ ( j ) θ ◦ ∇ θ f n ( φ ( j ) , θ ( j ) ) (B.2) until Convergence R q It remains to discuss the parameterization of the factor R q . In thismanuscript, we consider three alternatives: a full parameterization, the so-called mean field parameterization, and a constrained Chevron parameteriza-tion. The sparsity patterns of these parameterizations are shown in Figure 1.In the full rank parameterization [24], we take R q to be the non-uniqueCholesky factor, that is, a N × N lower triangular matrix with unconstrainedentries (Figure 1a). In this case, we have φ ∈ R N + N ( N +1) / . The numberof variational parameters is therefore O ( N ), which may render their op-timization difficult. In order to address this challenge, we can employ themean field and Chevron parameterizations, which result in a total numberof variational parameters that is linear on N .In the mean field parameterization, we take R q to be a strictly positivediagonal matrix, i.e., R q = diag[exp( ω q )], with ω q ∈ R N (Figure 1b) andexp( · ) understood as element-wise. This parameterization assumes that thevariational density covariance is diagonal, and the exponential ensures thatthe non-zero entries of R q are strictly positive. In this case, we have φ ≡{ µ q , ω q } ∈ R N . The gradient of f ( z ; φ , θ ) with respect to ω q is given by ∇ ω q f ( z ; φ , θ ) = (cid:2) −∇ y log p ( D s | y ) + C − p y (cid:3) ◦ z ◦ exp( ω q ) − I N , (25)where ◦ denotes element-wise product, and exponentiation is taken as element-wise. The details of the derivation are presented in Appendix A. This pa-rameterization assumes that the posterior components of y are essentiallyuncorrelated and therefore cannot resolve the correlations of the true poste-rior, which are expected to be non-trivial for highly correlated prior covari-12nce structures and for a small number of observations. As a consequence,the mean field parameterization tends to underestimate the posterior vari-ance [23].Finally, the constrained Chevron parameterization is similar to the fullparameterization, but we set entries below the diagonal and for column num-ber larger than the Chevron parameter k < N to zero (Figure 1c) [27]. Inthis case, we have φ ∈ R N +(2 N − k )( k +1) / . The number of variational param-eters for this parameterization is O ( N ( k + 1)), a reduction with respect tothe full parameterization, while maintaining some degree of expressivity forcapturing correlations of the true posterior. (a) Full rank (b) Mean field (c) Chevron with k = 3 Figure 1: Parameterization of the factor matrix R q The DSVI-EB method follows the automatic differentiation variationalinference (ADVI) algorithm [24], in which gradients of the joint probability p ( D s | y ) with respect to variational parameters are computed using Gaus-sian backpropagation and reverse-mode automatic differentiation. ADVI isformulated for the full Bayes case and implements the full and mean-fieldparameterizations of R q . In comparison, our work is formulated for theempirical Bayes case, employs the adjoint method to compute gradients ofphysics solvers, and implements the constrained Chevron parameterizationin addition to the full and mean-field parameterizations.Two schemes are common in the literature for computing the gradientsof the noisy ELBO estimate: the reinforce algorithm [28], also known asthe likelihood ratio method or the log-derivative trick, and Gaussian back-propagation [26], also known as the reparameterization trick [25, 16]. The reinforce algorithm employs gradients of the variational density with re-spect to its parameters, which is convenient as it only requires zero-orderinformation of the physics model. Unfortunately the reinforce estimatesof the ELBO gradients are well-known to be of high variance and must bepaired with a variance reduction technique [15]. Gaussian backpropagation,13n the other hand, results in lower-variance gradient estimates at the cost ofrequiring first-order information of the physics model.An alternative formulation of VI is presented in [18], where the authorsemploy mixtures of diagonal multivariate Gaussian densities as the varia-tional posterior, and approximate the ELBO using a second order Taylorexpansion around the mean of each mixture component. The mean, diago-nal covariance and mixture weights are estimated by maximizing the ELBOvia coordinate ascent. This entirely deterministic approach is formulated forthe inference problem and does not consider optimization over prior hyper-parameters. A similar approach is also presented in [29] in the context ofempirical Bayes.
6. Computational cost
In this section, we discuss the computational effort of the Laplace-EMand DSVI-EB algorithms. We compute the gradient and the Hessian of thelikelihood via the discrete adjoint method (see Appendix C for details).Note that the Laplace-EM method requires both gradients and Hessians,while the DSVI-EB method only requires gradients. For the physics con-straint L ( u , y ) = 0, the computation of the gradient requires the solutionof one (linear) backward sensitivity problem of size M × M . Similarly, thecomputation of the Hessian requires one backward sensitivity problem and N forward sensitivity problems, each of size M × M . For the following dis-cussion we assume that the cost of each forward and backward sensitivityproblem is of order O ( M γ ), γ > C p ( θ ( j ) ), of cost N , thesolution of (12) via gradient-based optimization, and the computation of theHessian. Therefore, the cost of each E-step is O (max( M γ , N )). Each itera-tion of the M-step requires one Cholesky factorization of C p ( θ ). Therefore,the total cost of each EM cycle is again O (max( M γ , N )).For the DSVI-EB algorithm, each iteration requires evaluating n gradientsand one Cholesky factorization of C p . If n is chosen independent of M , wehave that the total cost per iteration is also O (max( M γ , N )).Finally, in general we expect the number of iterations for each E - and M -step, and the number of EM cycles and DSVI iterations, to increase withincreasing N . The analysis of how said numbers scale with N is beyond thescope of this manuscript. 14 . Numerical experiments In this section, we present the application of the proposed approximateinference algorithms to the identification of the diffusion coefficient in diffu-sion equations. In particular, we are interested in identifying the diffusioncoefficient k ( x , u ) of the homogeneous diffusion equation ∇ · ( k ( x , u ) ∇ u ) = 0in Ω ⊂ R d , from both measurements of the diffusion coefficient and of thestate u . For the linear case ( k ≡ k ( x )), the diffusion equation models phe-nomena such as stationary heat transfer and Darcy flow. For the nonlinearcase ( k ≡ k ( u )), one recovers the so-called Richards equation for horizontalflows in unsaturated porous media. We consider the one-dimensional diffusion equation with Dirichlet bound-ary conditions ∂∂x (cid:20) k ( x ) ∂∂x u ( x ) (cid:21) = 0 , x ∈ [0 , , (26) u (0) = u L , u (1) = u R , (27)where u : [0 , → R is the state and k : [0 , → R + is the diffusion coefficient.The state is discretized into M degrees of freedom u i organized into thevector u ∈ R M . The diffusion coefficient is discretized into N degrees offreedom k i = exp y i corresponding to N spatial coordinates { x i } Ni =1 , with the y i organized into the vector y ∈ R N . The discretized problem (26) and (27)is of algebraic form L ( u , y ) ≡ S ( y ) u − b ( y ) = 0, where S : R N → R M × M and b : R N → R M . In (26), y can only be identified from measurements of u upto an additive constant [1], and measurements of y are required to estimateit uniquely.We apply the presented model inversion algorithms to estimating a syn-thetic diffusion coefficient from 10 measurements of the state u and onemeasurement of the log-diffusion coefficient y . The reference values of y and u and the corresponding observations are shown in Figure 2. The reference y is taken as a realization of the zero-mean GP with squared exponentialcovariance C ( x, x (cid:48) | θ ) = σ exp (cid:104) − ( x − x (cid:48) ) / λ (cid:105) + σ n x = x (cid:48) , (28)where θ ≡ ( σ, λ ), and σ n is set to 1 × − . We refer to σ and λ as the stan-dard deviation and correlation length, respectively, of the prior covariance.15he reference values of θ are presented in Table 1. The y and u observationsare taken at randomly selected degrees of freedom, with observation errorstandard deviations σ us = σ ys = 1 × − . Finally, the boundary conditionsare set to u L = 1 . u R = 0 .
0, and the numbers M and N of degrees offreedom is set to 50. (a) (b) Figure 2: Reference diffusion coefficient and state fields (continuous lines), and observa-tions (crosses), for the one-dimensional linear diffusion problem
Figure 3 shows the estimated diffusion coefficient, together with the 95%confidence intervals centered around the posterior mean, computed using theLaplace-EM algorithm and DSVI-EB with Chevron parameterization and k = 20. It can be seen that both methods provide accurate estimates of thereference field, with the reference field falling inside the confidence interval ofthe estimates (with localized exceptions for DSVI-EB with Chevron param-eterization in the vicinity of the x = 1 . F , computed using 1 × real-izations of the estimated posterior. It can be seen that hyperparameter esti-mates are very similar for the Laplace-EM method and the DSVI method. Inparticular, estimates of the correlation length are close to reference values,while the standard deviation is underestimated across all methods. Esti-mates are also similar for the different factor matrix parameterizations onthe DSVI method, with the exception of the full rank parameterization thatresulted in more pronounced underestimation of both the standard deviation16nd correlation length. In terms of the ELBO, the Laplace-EM method re-sults in the highest value, followed by the DSVI method with full rank factorparameterization. This indicates that full rank representations of the co-variance matrix of the estimated posterior density result in better estimatesof the true posterior, whereas reduced representations such as Chevron andmean field are less accurate. In practice, it can be seen in Figure 3b thatthe reduced representation is less capable of resolving the uncertainty of the y estimate in the vicinity of the x = 1 . (a) Laplace-EM (b) DSVI with Chevron factor k = 20 Figure 3: Reference and estimated diffusion coefficient for the one-dimensional lineardiffusion problem
We proceed to evaluate the accuracy of the proposed inference algorithmsat approximating the posterior density p ( y | D s , θ ). For this purpose, we em-ploy MCMC simulation as the benchmark as it is known to converge to theexact posterior density. In order to restrict the focus to the approximation ofthe posterior density, we set the prior hyperparameters to fixed values equalto the reference values, θ ref , and employ the Laplace-EM and DSVI algo-rithms to estimate the posterior p ( y | D s , θ ref ).We compare the estimated Note that in this context the Laplace-EM algorithm is reduced to the Laplace approx-imation for given θ (e.g. a single E-step in the EM algorithm), but we will refer to theassociated results as Laplace-EM results for the sake of convenience. F σ λ Reference 1 .
000 0 . − . .
608 0 . − . .
551 0 . k = 20 − . .
653 0 . k = 10 − . .
672 0 . k = 5 − . .
681 0 . − . .
687 0 . Table 1: Reference and estimated hyperparameters, and simple MC estimate of the ELBO,for the one-dimensional linear-diffusion problem posterior mean and standard deviation against the sample mean and stan-dard deviation computed from 1 × MCMC realizations of the posteriorgenerated using the No-U-Turn Sampler (NUTS) [8].Figure 4 presents a point-wise comparison against MCMC of the esti-mated mean and standard deviation computed using both Laplace-EM andDSVI with the full rank parameterization and the Chevron parameterizationwith k = 20 and 5. It can be seen that all estimates of the mean are very ac-curate, which indicates that the presented approximate inference algorithmsprovide accurate estimates of the mean of multimodal posterior densities.This result is expected for the Laplace-EM algorithm where the estimatedposterior mean is set to the MAP, but for the DSVI algorithm this is less ofa given.For the standard deviation, the Laplace-EM method provides the mostaccurate estimates, followed by DSVI with the full rank parameterization.This reinforces the conclusion drawn previously that full rank representationsof the covariance lead to better estimates of the true posterior. Furthermore,it can be seen that the Chevron representation accurately resolves the bulk ofpoint-wise standard deviation values (clustered at the bottom left of each plotin Figure 4b) but leads to noticeable underestimation of the larger point-wisevalues (i.e. the top half of Figure 4b). The underestimation of the standarddeviation is more pronounced for decreasing k , and is the most pronounced forthe mean field parameterization (not shown), which as remarked previouslytends to underestimate the variance of the posterior [23].Finally, in Figure 5 we present the posterior mean and variance for18 a) Mean(b) Standard deviation Figure 4: Posterior mean and standard deviation estimated via approximate inference,compared against sample mean and standard deviation computed from MCMC realizationsof the posterior (reference) k = 20, obtainedfor fixed θ = θ ref . Comparing Figure 3 against Figure 5 reveals that eventhough the empirical Bayes estimation procedure results in a standard de-viation estimate lower than the reference value (see Table 1), the posteriordensity with empirical Bayes hyperparameter estimates, p ( y | D s , ˆ θ ), is agood approximate to the posterior density with reference hyperparameters, p ( y | D s , θ ref ). (a) Laplace-EM (b) DSVI with Chevron factor k = 20 Figure 5: Estimated posterior mean and 95% confidence interval computed from theestimated posterior variance, for the one-dimensional linear diffusion problem for fixed θ = θ ref We consider the one-dimensional nonlinear diffusion equation with Dirich-let boundary conditions ∂∂x (cid:20) k ( u ( x )) ∂∂x u ( x ) (cid:21) = 0 , x ∈ [0 , , (29) u (0) = u L , u (1) = u R , u L < u R ≤ , (30)where u : [0 , → ( −∞ ,
0] is the state and k : ( ∞ , → R + is the diffusioncoefficient. Similarly to Section 7.1, the state is discretized into M degreesof freedom u i organized into the vector u ∈ R M . The diffusion coefficientfunction is discretized into N degrees of freedom k i = exp y i corresponding to N values of u over [ u min ,
0] (where u min < u L ), organized into the vector y ∈ R N . The discretized problem (29) and (30) is of the algebraic form L ( u , y ) ≡ S ( u , y ) u − b ( u , y ) = 0, where S : R M × R N → R M × M and b : R M × R N → R M .20nspection of (29) reveals that y ( u ) ≡ log k ( u ) can only be identified over therange [ u L , u R ] and up to an additive constant .To disambiguate the estimate,we provide measurements of y ( u ) at u = u min and u = 0.We apply the presented model inversion algorithms to estimating a func-tion k ( u ) from 5 measurements of the state u and 2 measurements of y ( u ) ≡ log k ( u ) at u = u min and u = 0. The reference diffusion coefficient is k ( u ) = exp u ( y ( u ) = u ). The u observations are taken at randomly se-lected degrees of freedom, and are shown in Figure 6. Observation errorstandard deviations σ us and σ ys are set to 1 × − . Boundary conditionsare set to u L = − . u R = − .
5, and u min is set to − .
5. Finally, thenumbers M and N are set to 50 and 21, respectively. Figure 6: Reference state field (continuous lines), and observations (crosses), for the one-dimensional nonlinear diffusion problem
Figure 7 shows the estimated diffusion coefficient using the proposedmodel inversion methods, together with the 95% confidence intervals cen-tered around the posterior mean. Presented are the results for the Laplace-EM method and the DSVI-EB method with Chevron parameterization and k = 5. It can be seen that the estimated posterior mean and confidence in-tervals for both methods are nearly identical. As prior covariance C ( u, u | θ ),we employ the squared exponential model (28) with σ n set to 1 × − . As inthe linear case, both methods accurately estimate the reference function y ( u ),and the reference function falls inside the 95% confidence intervals providedby the estimated posterior covariance. This can be verified by introducing the Kirchhoff transformation f ( u ) = (cid:82) uu min k ( u ) d u ,with which (29) can be written as a linear equation on f . × realizations of the corresponding es-timated posterior densities. It can be seen that estimated hyperparametersare different for the different methods (note that here we don’t have referencevalues for the hyperparameters of the prior, as the reference k ( u ) is not drawnfrom a GP model). Nevertheless, it can be seen that both methods resultin similar estimates of y . In agreement with the linear case, the Laplace-EM and the DSVI-EB method with full rank parameterization result in thelargest values of ELBO. Additionally, it can be seen that the ELBO decreaseswith increasing sparsity of the posterior covariance factor parameterization(i.e. with decreasing Chevron factor k ), being the lowest for the mean fieldparameterization. This illustrates the compromise between the sparsity ofthe covariance factor and its expressive capacity for approximating the trueposterior, that is, that less sparse covariance factors produce more accurateapproximate posteriors. (a) Laplace-EM (b) DSVI with Chevron factor, k = 5 Figure 7: Estimated diffusion coefficient for the one-dimensional nonlinear diffusion prob-lem
8. Conclusions and discussion
We have presented two approximate empirical Bayesian methods, Laplace-EM and DSVI-EB, for estimating unknown parameters and constitutive rela-tions in PDE models. Compared to other methods for approximate Bayesianinference, the proposed methods do not require third-order derivatives of the22yperparametersˆ F σ λ Laplace-EM − . .
650 6 . − . .
024 5 . k = 10 − . .
007 6 . k = 5 − . .
343 6 . k = 2 − . .
768 7 . − . .
353 9 . Table 2: Reference and estimated hyperparameters, and simple MC estimate of the ELBO,for the one-dimensional nonlinear-diffusion problem physics model, do not involve computing moments of non-Gaussian likeli-hoods, and are applicable to non-factorizing likelihoods. Furthermore, thecalculation of the batch estimate of the ELBO and its gradients employed inthe DSVI-EB method is trivially parallelizable, leading to savings in compu-tational time. The numerical experiments presented show that both methodsaccurately approximate the posterior density and the hyperparameters of theGP prior. In particular, we find that the Laplace-EM method is more accu-rately approximate the posterior density, at the cost of computing Hessiansof the physics model, which increase the computational cost of each EM cy-cle. The DSVI-EB method, on the other hand, is less accurate but doesnot require Hessians. Consistent with the literature, we find that the accu-racy of the DSVI-EB method at approximating the posterior decreases withincreasing sparsity of the covariance factor parameterization employed.For a very large number of degrees of freedom of the discretization ofthe unknown functions, the computational cost of the proposed methodsis dominated by the associated cubic complexity. Future work will aim toaddress the challenge of cubic complexity by employing sparse GP inference.
Appendix A. Gaussian backpropagation rules
In this section we present the derivation of the gradients (20)–(22).For the gradient with respect to the variational mean, (20), we have bythe chain rule, in index notation, ∂∂µ q,i log p ( D s | y ) = ∂y j ∂µ q,i ∂∂y j log p ( D s | y ) = δ ji ∂∂y j log p ( D s | y ) , ∂y j /∂µ q,i = δ ji . It follows that ∇ µ q log p ( D s | y ) = ∇ y log p ( D s | y ). Similarly, we have ∇ µ q y (cid:62) C − p y = ∇ y y (cid:62) C − p y = C − p y , and thus we recover (20).For the gradient with respect to the Cholesky factor R q , we note that ∂y k /∂R q,ij = δ ki z j . By the chain rule, we have ∂∂R q,ij log p ( D s | y ) = ∂y k ∂R q,ij ∂∂y k log p ( D s | y ) = ∂∂y i log p ( D s | y ) z j , so that ∇ R q log p ( D s | y ) = [ ∇ y log p ( D s | y )] z (cid:62) . Similarly, ∇ R q y (cid:62) C − p y = (cid:20) ∇ y y (cid:62) C − p y (cid:21) z (cid:62) = C − p yz (cid:62) . Finally, we have ∇ R q log det R q = ( R − q ) (cid:62) , from which we recover (21).The gradients with respect to the prior hyperparameters, (22) can bederived from [9], Eqs. (A.14) and (A.15).For the gradient with respect to ω q of the mean-field parameterization R q = diag[exp ω q ], we employ the relation ∂R q,ij ∂ω q,k = (cid:40) exp ω q,k for k = i = j, . By the chain rule, we have ∂∂ω q,k log p ( D s | y ) = ∂R q,ij ∂ω q,k ∂∂R q,ij log p ( D s | y ) = ∂∂y k log p ( D s | y ) z k exp ω q,k , summation over k not implied. It follows that ∇ ω q log p ( D s | y ) = [ ∇ y log p ( D s | y )] ◦ z ◦ exp ω q . Similarly, ∂∂ω q,k y (cid:62) C − p y = ∂R q,ij ∂ω q,k (cid:0) C − p (cid:1) im y m z j = (cid:0) C − p (cid:1) km y m z k exp ω q,k , summation over k not implied. Finally, ∇ ω q log det R q = ∇ ω q log (cid:89) k exp ω q,k = ∇ ω q (cid:88) k log exp ω q,k = I N , from which we recover (25). 24 ppendix B. Stochastic gradient ascent with adaptive step-sizesequence Here we reproduce for completeness the stochastic gradient ascent algo-rithm with adaptive step-size sequence proposed in [24]. The presentation isexpanded to the empirical Bayes context for the update of prior hyperparame-ters. At each iteration, the variational parameters and prior hyperparametersare updated using the rules φ ( j +1) = φ ( j ) + ρ ( j ) φ ◦ ∇ φ f ( j ) n , (B.1) θ ( j +1) = θ ( j ) + ρ ( j ) θ ◦ ∇ θ f ( j ) n , (B.2)where f ( j ) n ≡ f n ( φ ( j ) , θ ( j ) ), and the vectors of step-sizes ρ ( j ) φ and ρ ( j ) θ are givenby ρ ( j ) φ \ θ = η ( j + 1) − / (cid:15) (cid:18) τ + (cid:113) s ( j ) φ \ θ (cid:19) , (B.3)and the sequence s ( j ) φ \ θ = α (cid:0) ∇ φ \ θ f ( j ) n (cid:1) + (1 − α ) s ( j ) φ \ θ for j > , s (0) φ \ θ = (cid:0) ∇ φ \ θ f (0) n (cid:1) , (B.4)where √· and ( · ) are understood as element-wise. The parameters τ , α , and (cid:15) are set to 1 .
0, 0 .
1, and 1 × − , respectively, while the parameter η > Appendix C. Discrete adjoint method for Darcy flow
In this section we describe the computation of the gradient and Hessian ofthe log-likelihood, ∇ y log p ( D s | y ) via the discrete adjoint method [30, 31].For this purpose we introduce the function h ( u , y ) = − σ us (cid:107) u s − H u u (cid:107) − σ ys (cid:107) y s − H y y (cid:107) , (C.1)so that ∇ y log p ( D s | y ) = ∇ h ( u ( y ) , y ) by virtue of (3) (as the constantin (3) is independent of y ). In the following we will employ the followingnotation: Let a be a scalar function, b and c be vector functions, and γ bea scalar variable; then, ∂a/∂ b denotes the row vector with entries ∂a/∂b i , ∂ b /∂γ denotes the column vector with entries ∂b i /∂γ , and ∂ b /∂ c be thematrix with ij th entry ∂b i /∂c j . 25ifferentiation h ( u , y ) with respect to y i givesd h d y j = ∂h∂y j + ∂h∂ u ∂ u ∂y j . (C.2)Similarly, differentiating the physics constraint L ( u , y ) = 0 with respect to y gives ∂ L ∂y j + ∂ L ∂ u ∂ u ∂y j = 0 (C.3)which implies ∂ u /∂y j = − ( ∂ L /∂ u ) − ( ∂ L /∂y j ). Substituting this relationinto (C.2) gives the following expression for the j th component of the gradi-ent: d h d y j = ∂h∂y i + λ (cid:62) ∂ L ∂y j (C.4)where the adjoint variables λ satisfies the adjoint equation (cid:18) ∂ L ∂ u (cid:19) (cid:62) λ + (cid:18) ∂h∂ u (cid:19) (cid:62) = 0 . (C.5)It can be seen that computing the gradient ∇ h ( u ( y ) , y ) requires a singlelinear backward sensitivity problem, (C.5), of size M × M .For the Hessian, we differentiate (C.2) with respect to y i , obtainingd h d y i d y j = ∂h∂ u ∂ u ∂y i ∂y j + D i,j h, (C.6)where D i,j h is given by ∂ h∂y i ∂y j + ∂ h∂y i ∂ u ∂ u ∂y j + ∂ h∂y j ∂ u ∂ u ∂y i + ∂ h∂ u (cid:18) ∂ u ∂y i ⊗ ∂ u ∂y j (cid:19) , (C.7)and ∂ h/∂ u denotes the Hessian of h with respect to u , i.e. the matrixwith ij th entry ∂ h/∂u i ∂u j . Similarly, differentiating (C.3) with respect to y i gives ∂ L ∂ u ∂ u ∂y i ∂y j + D i,j L = 0 , (C.8)where D i,j L is given element-wise in a manner similar to (C.7). (C.8) implies ∂ u /∂y i ∂y j = − ( ∂ L /∂ u ) − D i,j L . Substituting into (C.6) gives the followingexpression for the ij th component of the Hessian:d h d y i d y j = λ (cid:62) D i,j L + D i,j h. (C.9)26omputing the Hessian therefore requires the solution of N linear forwardsensitivity problems, (C.3), for each ∂ u /∂y i , and one backward sensitivitysolution for the adjoint variables, each problem of size M × M . Acknowledgments
This work was supported by the Applied Mathematics Program withinthe U.S. Department of Energy Office of Advanced Scientific ComputingResearch. Pacific Northwest National Laboratory is operated by Battelle forthe DOE under Contract DE-AC05-76RL01830.
References [1] A. M. Stuart, Inverse problems: A bayesian perspective, Acta Numerica19 (2010) 451559. doi:10.1017/S0962492910000061 .[2] M. Hanke, A regularizing levenberg-marquardt scheme, with applica-tions to inverse groundwater filtration problems, Inverse Problems 13 (1)(1997) 79.URL http://stacks.iop.org/0266-5611/13/i=1/a=007 [3] D. A. Barajas-Solano, B. E. Wohlberg, V. V. Vesselinov, D. M. Tar-takovsky, Linear functional minimization for inverse modeling, WaterResour. Res. 51 (2014) 4516–4531. doi:10.1002/2014WR016179 .[4] G. Evensen, Data Assimilation: The Ensemble Kalman Filter, Springer-Verlag, Berlin, Heidelberg, 2006.[5] T. Salimans, D. Kingma, M. Welling, Markov chain monte carlo andvariational inference: Bridging the gap, in: F. Bach, D. Blei (Eds.),Proceedings of the 32nd International Conference on Machine Learning,Vol. 37 of Proceedings of Machine Learning Research, PMLR, Lille,France, 2015, pp. 1218–1226.URL http://proceedings.mlr.press/v37/salimans15.html [6] J. Goodman, J. Weare, Ensemble samplers with affine invariance, Com-mun. Appl. Math. Comput. Sci. 5 (1) (2010) 65–80. doi:10.2140/camcos.2010.5.65 .URL https://doi.org/10.2140/camcos.2010.5.65 arXiv:1311.4780 .[8] M. D. Hoffman, A. Gelman, The no-u-turn sampler: Adaptively settingpath lengths in hamiltonian monte carlo, Journal of Machine LearningResearch 15 (2014) 1593–1623.URL http://jmlr.org/papers/v15/hoffman14a.html [9] C. E. Rasmussen, C. K. I. Williams, Gaussian Processes for MachineLearning (Adaptive Computation and Machine Learning), The MITPress, 2005.[10] M. Raissi, P. Perdikaris, G. E. Karniadakis, Numerical gaussianprocesses for time-dependent and nonlinear partial differential equa-tions, SIAM Journal on Scientific Computing 40 (1) (2018) A172–A198. arXiv:https://doi.org/10.1137/17M1120762 , doi:10.1137/17M1120762 .URL https://doi.org/10.1137/17M1120762 [11] M. Raissi, P. Perdikaris, G. E. Karniadakis, Machine learn-ing of linear differential equations using gaussian processes,Journal of Computational Physics 348 (2017) 683 – 693. doi:https://doi.org/10.1016/j.jcp.2017.07.050 .URL [12] C. M. Bishop, Pattern Recognition and Machine Learning (InformationScience and Statistics), Springer-Verlag, Berlin, Heidelberg, 2006.[13] N. D. Lawrence, G. Sanguinetti, M. Rattray, Modelling transcriptionalregulation using gaussian processes, in: B. Sch¨olkopf, J. C. Platt,T. Hoffman (Eds.), Advances in Neural Information Processing Systems19, MIT Press, 2007, pp. 785–792.URL http://papers.nips.cc/paper/3119-modelling-transcriptional-regulation-using-gaussian-processes.pdf [14] R. M. Neal, G. E. Hinton, A View of the EM Algorithm that Justifies In-cremental, Sparse, and other Variants, Springer Netherlands, Dordrecht,1998, pp. 355–368. doi:10.1007/978-94-011-5014-9_12 .URL https://doi.org/10.1007/978-94-011-5014-9_12 http://dl.acm.org/citation.cfm?id=2074022.2074067 [18] P. Tsilifis, I. Bilionis, I. Katsounaros, N. Zabaras, Computationally effi-cient variational approximations for bayesian inverse problems, Journalof Verification, Validation and Uncertainty Quantification 1 (3) (2016)031004.[19] B. Jin, J. Zou, Hierarchical bayesian inference for ill-posed problems viavariational method, Journal of Computational Physics 229 (19) (2010)7317 – 7343. doi:https://doi.org/10.1016/j.jcp.2010.06.016 .URL [20] I. M. Franck, P. Koutsourelakis, Sparse variational bayesian ap-proximations for nonlinear inverse problems: Applications innonlinear elastography, Computer Methods in Applied Me-chanics and Engineering 299 (2016) 215 – 244. doi:https://doi.org/10.1016/j.cma.2015.10.015 .URL [21] N. Guha, X. Wu, Y. Efendiev, B. Jin, B. K. Mallick, A variationalbayesian approach for inverse problems with skew-t error distri-butions, Journal of Computational Physics 301 (2015) 377 – 393.29 oi:https://doi.org/10.1016/j.jcp.2015.07.062 .URL [22] K. Yang, N. Guha, Y. Efendiev, B. K. Mallick, Bayesian andvariational bayesian approaches for flows in heterogeneous randommedia, Journal of Computational Physics 345 (2017) 275 – 293. doi:https://doi.org/10.1016/j.jcp.2017.04.034 .URL [23] D. M. Blei, A. Kucukelbir, J. D. McAuliffe, Variational inference: Areview for statisticians, Journal of the American Statistical Associ-ation 112 (518) (2017) 859–877. arXiv:https://doi.org/10.1080/01621459.2017.1285773 , doi:10.1080/01621459.2017.1285773 .URL https://doi.org/10.1080/01621459.2017.1285773 [24] A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, D. M. Blei, Auto-matic differentiation variational inference, Journal of Machine LearningResearch 18 (14) (2017) 1–45.[25] D. P. Kingma, M. Welling, Auto-Encoding Variational Bayes, ArXive-prints arXiv:1312.6114 .[26] D. J. Rezende, S. Mohamed, D. Wierstra, Stochastic Backpropaga-tion and Approximate Inference in Deep Generative Models, ArXiv e-prints arXiv:1401.4082 .[27] E. Challis, D. Barber, Gaussian kullback-leibler approximate inference,Journal of Machine Learning Research 14 (2013) 2239–2286.URL http://jmlr.org/papers/v14/challis13a.html [28] R. J. Williams, Simple Statistical Gradient-Following Algorithms forConnectionist Reinforcement Learning, Springer US, Boston, MA, 1992,pp. 5–32.[29] K. Friston, J. Mattout, N. Trujillo-Barreto, J. Ashburner,W. Penny, Variational free energy and the laplace approxi-mation, NeuroImage 34 (1) (2007) 220 – 234. doi:https://doi.org/10.1016/j.neuroimage.2006.08.035 .30RL [30] M. B. Giles, M. C. Duta, J.-D. M¨uller, N. A. Pierce, Algorithm de-velopments for discrete adjoint methods, AIAA Journal 41 (2) (2003)198–205.[31] D. Ghate, M. Giles, Efficient hessian calculation using automatic differ-entiation, in: 25th AIAA Applied Aerodynamics Conference, 2007, p.4059. doi:10.2514/6.2007-4059doi:10.2514/6.2007-4059