[PDF] Probabilistic Reduced-Order Modeling for Stochastic Partial Differential Equations

Abstract

We discuss a Bayesian formulation to coarse-graining (CG) of PDEs where the coefficients (e.g. material parameters) exhibit random, fine scale variability. The direct solution to such problems requires grids that are small enough to resolve this fine scale variability which unavoidably requires the repeated solution of very large systems of algebraic equations. We establish a physically inspired, data-driven coarse-grained model which learns a low- dimensional set of microstructural features that are predictive of the fine-grained model (FG) response. Once learned, those features provide a sharp distribution over the coarse scale effec- tive coefficients of the PDE that are most suitable for prediction of the fine scale model output. This ultimately allows to replace the computationally expensive FG by a generative proba- bilistic model based on evaluating the much cheaper CG several times. Sparsity enforcing pri- ors further increase predictive efficiency and reveal microstructural features that are important in predicting the FG response. Moreover, the model yields probabilistic rather than single-point predictions, which enables the quantification of the unavoidable epistemic uncertainty that is present due to the information loss that occurs during the coarse-graining process.

Full PDF

UUNCECOMP 20172nd ECCOMAS Thematic Conference onUncertainty Quantiﬁcation in Computational Sciences and EngineeringM. Papadrakakis, V. Papadopoulos, G. Stefanou (eds.)Rhodes Island, Greece, 15–17 June 2017

PROBABILISTIC REDUCED-ORDER MODELING FOR STOCHASTICPARTIAL DIFFERENTIAL EQUATIONS

Constantin Grigo and Phaedon-Stelios Koutsourelakis Continuum Mechanics Group,Department of Mechanical Engineering,Technical University of Munich,D-85748 Garching b. M¨unchen

Keywords:

Reduced-order modeling, generative Bayesian model, SPDE, effective materialproperties

Abstract.

We discuss a Bayesian formulation to coarse-graining (CG) of PDEs where thecoefﬁcients (e.g. material parameters) exhibit random, ﬁne scale variability. The direct solutionto such problems requires grids that are small enough to resolve this ﬁne scale variability whichunavoidably requires the repeated solution of very large systems of algebraic equations.We establish a physically inspired, data-driven coarse-grained model which learns a low-dimensional set of microstructural features that are predictive of the ﬁne-grained model (FG)response. Once learned, those features provide a sharp distribution over the coarse scale effec-tive coefﬁcients of the PDE that are most suitable for prediction of the ﬁne scale model output.This ultimately allows to replace the computationally expensive FG by a generative proba-bilistic model based on evaluating the much cheaper CG several times. Sparsity enforcing pri-ors further increase predictive efﬁciency and reveal microstructural features that are importantin predicting the FG response. Moreover, the model yields probabilistic rather than single-pointpredictions, which enables the quantiﬁcation of the unavoidable epistemic uncertainty that ispresent due to the information loss that occurs during the coarse-graining process. a r X i v : . [ s t a t . M L ] M a r onstantin Grigo and Phaedon-Stelios Koutsourelakis Many engineering design problems such as ﬂow in porous media and mechanical proper-ties of composite materials require simulations that are capable of resolving the microstructureof the underlying medium. If the material components under consideration exhibit ﬁne-scaleheterogeneity, popular discretization schemes (e.g. ﬁnite elements) yield very large systems ofalgebraic equations. Pertinent solution strategies at best (e.g. multigrid methods) scale linearlywith the dimension of the unknown state vector. Despite the ongoing improvements in com-puter hardware, repeated solutions of such problems, as is required in the context of uncertaintyquantiﬁcation (UQ), poses insurmountable difﬁculties.It is obvious that viable strategies for such problems, as well as a host of other deterministicproblems where repeated evaluations are needed such as inverse, control/design problems etc,should focus on methods that exhibit sublinear complexity with respect to the dimension of theoriginal problem. In the context of UQ a popular and general such strategy involves the useof surrogate models or emulators which attempt to learn the input-output map implied by theﬁne-grained model. Such models, e.g. Gaussian Processes [26], polynomial chaos expansions[8], neural nets [2] and many more, are trained on a ﬁnite set of ﬁne-grained model runs. Never-theless, their performance is seriously impeded by the curse of dimensionality, i.e. they usuallybecome inaccurate for input dimensions larger than a few tens or hundreds, or equivalently, thenumber of FG runs required to achieve an acceptable level of accuracy grows exponentially fastwith the input dimension.Alternative strategies for high-dimensional problems make use of multi-ﬁdelity models [10,16] as inexpensive predictors of the FG output. As shown in [12], lower-ﬁdelity models whoseoutput deviates signiﬁcantly from that of the FG can still yield accurate estimates with signiﬁ-cant computational savings, as long as the outputs of the models exhibit statistical dependence.In the case of PDEs where ﬁnite elements are employed as the FG, multi-ﬁdelity solvers can besimply obtained by using coarser discretizations in space or time. While linear and nonlineardimensionality reduction techniques are suitable for dealing with high-dimensional inputs [14],it is known which of the microstructural features are actually predictive of FG outputs [22].The model proposed in the present paper attempts to address this question. By using atwo-component Bayesian network, we are able to predict ﬁne-grained model outputs based ononly a ﬁnite number of training data runs and a repeated solution of a much coarser model.Uncertainties can be easily quantiﬁed as our model leads to probabilistic rather than point-likepredictions.

Let (Ω , F , P ) be a probability space. Let H be the Hilbert space of functions deﬁned overthe domain D over which the physical problem is deﬁned. We consider problems in the contextof heterogeneous media which exhibit properties given by a random process λ ( x , ξ ) deﬁnedover the product space D × Ω . The corresponding stochastic PDE may be written as L ( x , λ ( x , ξ )) u ( x , λ ( x , ξ )) = f ( x ) , +B.C. (1)where L is a stochastic differential operator and x ∈ D , ξ ∈ Ω are elements of the physicaldomain and the sample space, respectively. Discretization of the random process λ ( x , ξ ) −→ (cid:124)(cid:123)(cid:122)(cid:125) discretize λ f ∈ R n λ f as well as the governing equation leads to a system of n f (potentially2 onstantin Grigo and Phaedon-Stelios Koutsourelakis nonlinear) algebraic equations, which can be written in residual form as r f ( U f ; λ f ) = , (2)where U f ( λ f ) ∈ R n f is the n f -dimensional discretized solution vector for a given λ f and r f : R n f × R n λ f → R n f the discretized residual vector. It is the model described by equation(2) which is denoted as the ﬁne-grained model (FG) henceforth. Let p ( λ f ) = (cid:90) δ ( λ f − λ f ( ξ )) p ( ξ ) dξ (3)be the density of λ f . The density of the ﬁne-scale response U f is then given by p ( U f ) = (cid:90) p ( U f | λ f ) p ( λ f ) d λ f , (4)where the conditional density p ( U f | λ f ) degenerates to a δ ( U f − U f ( λ f )) when the only un-certainties in the problem are due to λ f .The objective of this paper is to approximate this input-output map implied by U f ( λ f ) ,or equivalently in terms of probability densities, the conditional distribution p ( U f | λ f ) . Thelatter case can also account for problems where additional sources of uncertainty are presentand the input-output map is stochastic. To that end, we introduce a coarse-grained model (CG)leading to an approximate distribution ¯ p ( U f | λ f ) which will be trained on a limited number ofFG solutions D N = (cid:110) λ ( i ) f , U ( i ) f ( λ ( i ) f ) (cid:111) Ni =1 .Our approximate model ¯ p ( U f | λ f ) employs a set of latent [3], reduced, collective variableswhich we denote by λ c ∈ R n λ c for reasons that will be apparent in the sequel, such that ¯ p ( U f | λ f ) = (cid:90) ¯ p ( U f , λ c | λ f ) d λ c = (cid:90) ¯ p ( U f | λ c ) (cid:124) (cid:123)(cid:122) (cid:125) decoder ¯ p ( λ c | λ f ) (cid:124) (cid:123)(cid:122) (cid:125) encoder d λ c . (5)As it can be understood from the equation above, the latent variables λ c act as a probabilisticﬁlter (encoder) on the FG input λ f , retaining the features necessary for predicting the FG output U f . In order for ¯ p ( U f | λ f ) to approximate well p ( U f | λ f ) , the latent variables λ c should notsimply be the outcome of a dimensionality reduction on λ f . Even if λ f is amenable to such adimensionality reduction, it is not necessary that the λ c found would be predictive of U f . Poseddifferently, it is not important that λ c provides a high-ﬁdelity encoding of λ f but it sufﬁces thatit is capable of providing a good prediction of the corresponding U f ( λ f ) .The aforementioned desiderata do not unambiguously deﬁne the form of the encoding/ de-coding densities in (5) nor the type/dimension of the latent variables λ c . In order to retainsome of the physical and mathematical structure of the FG, we propose employing a coarsened,discretized version of the original continuous equation (1). In residual form this can again bewritten as r c ( U c ; λ c ) = , (6)3 onstantin Grigo and Phaedon-Stelios Koutsourelakis Figure 1: A two-step Bayesian network/generative model deﬁning ¯ p ( U f | λ f , θ c , θ cf ) .where U c ∈ R n c is the n c -dimensional ( n c (cid:28) n f ) discretized solution and r c : R n f × R n λ c → R n c is the discretized residual vector. Due to the signiﬁcant discrepancy in dimensions n c (cid:28) n f ,the cost of solving the CG in (6) is negligible compared to the FG in (2).It is clear that λ c plays the role of effective/equivalent properties but it is not obvious (exceptfor some limiting cases where homogenization results can be invoked [23]) how these shoulddepend on the ﬁne-scale input λ f nor how the solution U c ( λ c ) of the CG should relate to U f ( λ f ) in (2). Furthermore, it is important to recognize a priori that the use of the reducedvariables λ c in combination with the coarse model in (6) would in general imply some infor-mation loss. The latter should introduce an additional source of uncertainty in the predictionsof the ﬁne-scale output U f [25], even in the limit of inﬁnite training data. For that purpose andin agreement with (5), we propose a generative probabilistic model composed of the followingtwo densities: • A probabilistic mapping from λ f to λ c , which determines the effective properties λ c given λ f . We write this as p c ( λ c | λ f , θ c ) where θ c denotes a set of model parameters, • A coarse-to-ﬁne map p cf ( U f | U c , θ cf ) , which is the PDF of the FG output U f given theoutput U c of the CG. It is parametrized by θ cf .This model is illustrated graphically in Figure 1. The density p c encodes λ f into λ c and thecoarse-to-ﬁne map p cf plays the role of a decoder, i.e. given the CG output U c , it predicts U f .Using the abbreviated notation θ = [ θ c , θ cf ] , from (5) we obtain ¯ p ( U f | λ f , θ ) = (cid:90) p cf ( U f | U c ( λ c ) , θ cf ) p c ( λ c | λ f , θ c ) d λ c , (7)where U c ( λ c ) is the solution vector to equation (6). The previous discussion suggests thefollowing generative process for drawing samples from ¯ p ( U f | λ f , θ ) i.e. predicting the FGoutput U f given a FG input λ f , • draw a sample λ c ∼ p c ( λ c | λ f , θ c ) , • solve the CG to obtain U c ( λ c ) , • draw a sample U f ∼ p cf ( U f | U c ( λ c ) , θ cf ) . In order to train the model described above, it is a reasonable strategy to minimize theKullback-Leibler divergence [5] between the target density p ( U f , λ f ) = δ ( U f − U f ( λ f )) onstantin Grigo and Phaedon-Stelios Koutsourelakis and ¯ p ( U f | λ f , θ ) . As these are conditional distributions, the KL divergence would depend on λ f . In order to calibrate the model for the λ f values that are of signiﬁcance, we operate onthe augmented densities p ( U f , λ f ) = p ( U f | λ f ) p ( λ f ) and ¯ p ( U f , λ f | θ ) = ¯ p ( U f | λ f , θ ) p ( λ f ) ,where p ( λ f ) is deﬁned by (3). In particular, we propose minimizing with respect to θ KL ( p ( U f , λ f ) || ¯ p ( U f , λ f )) = KL ( p ( U f , λ f ) || ¯ p ( U f | λ f , θ ) p ( λ f ))= (cid:90) p ( U f , λ f ) log (cid:18) p ( U f , λ f )¯ p ( U f | λ f , θ ) p ( λ f ) (cid:19) d U f d λ f = − (cid:90) p ( U f , λ f ) log ¯ p ( U f | λ f , θ ) d U f d λ f + H ( p ( U f , λ f )) ≈ − N N (cid:88) i =1 log ¯ p ( U ( i ) f | λ ( i ) f , θ ) + H ( p ( U f , λ f )) , (8)where N is the number of training samples drawn from p ( U f , λ f ) , i.e. λ ( i ) f ∼ p ( λ f ) , U ( i ) f = U f ( λ ( i ) f ) , (9)and H ( p ( U f , λ f )) is the entropy of p ( U f , λ f )) that is nevertheless independent of the modelparameters θ . It is obvious from the ﬁnal expression in (8) that (cid:80) Ni =1 log ¯ p ( U ( i ) f | λ ( i ) f , θ c , θ cf ) isa log -likelihood function of the data D N which we denote by L ( D N | θ c , θ cf ) . In a fully Bayesiansetting, this can be complemented with a prior p ( θ c , θ cf ) leading to the posterior p ( θ c , θ cf |D N ) ∝ e L ( D N | θ c , θ cf ) p ( θ c , θ cf ) . (10)It is up to the analyst if predictions using equation (7) are carried out using point estimates of θ (e.g. maximum likelihood (MLE) or maximum a posteriori (MAP)) or if a fully Bayesianapproach is followed by averaging over the posterior p ( θ c , θ cf |D N ) . The latter has the addedadvantage of quantifying the uncertainty introduced due the ﬁnite training data N . We pursuethe former in the following as it is computationally more efﬁcient. Our objective is to ﬁnd θ ∗ = [ θ ∗ c , θ ∗ cf ] which maximizes the posterior given in equation (10),i.e. [ θ ∗ c , θ ∗ cf ] = arg max θ c , θ cf e L ( D N | θ c , θ cf ) p ( θ c , θ cf )= arg max θ c , θ cf ( L ( D N | θ c , θ cf ) + log p ( θ c , θ cf ))= arg max θ c , θ cf (cid:32) N (cid:88) i =1 log ¯ p ( U ( i ) f | λ ( i ) f , θ c , θ cf ) + log p ( θ c , θ cf ) (cid:33) . (11)The main difﬁculty in this optimization problem arises from the log-likelihood term whichinvolves the log of an analytically intractable integral with respect to λ c since L i ( θ c , θ cf ) = log ¯ p ( U ( i ) f | λ ( i ) f , θ c , θ cf )= log (cid:90) p cf ( U ( i ) f | U c ( λ ( i ) c ) , θ cf ) p c ( λ ( i ) c | λ ( i ) f , θ c ) d λ ( i ) c , (12)5 onstantin Grigo and Phaedon-Stelios Koutsourelakis where we note that an independent copy of λ ( i ) c pertains to each data point i . Due to thisintegration, typical deterministic optimization algorithms are not applicable.However, as λ c appears as a latent variable, we may resort to the well-known Expectation-Maximization algorithm [6]. Using Jensen’s inequality, we establish a lower bound on everysingle term L i of the sum in the log-likelihood L ( D N | θ c , θ cf ) by employing an arbitrary density q i ( λ ( i ) c ) (different for each sample i ) as L i ( θ c , θ cf ) = log ¯ p ( U ( i ) f | λ ( i ) f , θ c , θ cf )= log (cid:90) p cf ( U ( i ) f | U c ( λ ( i ) c ) , θ cf ) p c ( λ ( i ) c | λ ( i ) f , θ c ) d λ ( i ) c = log (cid:90) q i ( λ ( i ) c ) p cf ( U ( i ) f | U c ( λ ( i ) c ) , θ cf ) p c ( λ ( i ) c | λ ( i ) f , θ c ) q i ( λ ( i ) c ) d λ ( i ) c ≥ (cid:90) q i ( λ ( i ) c ) log (cid:32) p cf ( U ( i ) f | U c ( λ ( i ) c ) , θ cf ) p c ( λ ( i ) c | λ ( i ) f , θ c ) q i ( λ ( i ) c ) (cid:33) d λ ( i ) c (Jensen) = F i ( q i ; θ c , θ cf ) , (13)Hence, the log -posterior in (10) can be lower bounded by log p ( θ c , θ cf |D N ) = log L ( D N | θ c , θ cf ) + log p ( θ c , θ cf )= N (cid:88) i =1 log L i ( θ c , θ cf ) + log p ( θ c , θ cf ) ≥ N (cid:88) i =1 F i ( q i ; θ c , θ cf ) + log p ( θ c , θ cf )= F (cid:16) { q i } Ni =1 , θ c , θ cf (cid:17) + log p ( θ c , θ cf ) . (14)The introduction of the auxiliary densities q i suggests the following recursive procedure [4] formaximizing the log-posterior: E-step:

Given some θ ( t ) c , θ ( t ) cf in iteration t , ﬁnd the auxiliary densities q ( t +1) i that maximize F , M-step:

Given q ( t +1) i , ﬁnd the parameters θ ( t +1) c , θ ( t +1) cf that maximize F .It can be readily shown that the optimal q i is given by q i ( λ ( i ) c ) ∝ p cf ( U ( i ) f | U c ( λ ( i ) c ) , θ cf ) p c ( λ ( i ) c | λ ( i ) f , θ c ) (15)with which the inequality in (13) becomes an equality. In fact, both E- and M-steps can berelaxed to ﬁnd suboptimal q ( t ) i , θ ( t ) c , θ ( t ) cf , which enables the application of approximate schemessuch as e.g. Variational Inference (VI) [24, 9]. For the M-step, we may resort to any (stochastic)optimization algorithm [17] or, on occasion, closed-form updates might also be feasible.6 onstantin Grigo and Phaedon-Stelios Koutsourelakis Figure 2: Random microstructure samples and corresponding ﬁne-grained model outputs.

As a sample problem, we consider the 2D stationary heat equation on the unit square [0 , ∇ x ( − λ ( x , ξ ( x )) ∇ x U ( x , ξ ( x ))) = 0 (16)where U ( x , ξ ( x )) represents the temperature ﬁeld. For the boundary conditions, we ﬁx thetemperature U in the upper left corner (see Figure 2) to − and prescribe the heat ﬂux Q ( x ) = (cid:0) − y − x (cid:1) on the remaining domain boundary ∂ D .We consider a binary random medium whose conductivity λ ( x ) can take the values λ hi and λ lo . To deﬁne such a ﬁeld we consider transformations of a zero-mean Gaussian process ξ ( x ) of the form [18, 13] λ ( x , ξ ( x )) = (cid:40) λ hi , if ξ ( x ) > c,λ lo , otherwise (17)where the thresholding constant c is selected so as to achieve the target volume fraction φ hi (orequivalently φ lo = 1 − φ hi ) of the material with conductivity λ hi (or equivalently of λ lo ). In thefollowing we use an isotropic squared-exponential covariance function for ξ ( x ) of the formcov ( x i , x j ) = exp (cid:26) − | x i − x j | l (cid:27) . (18)In the following studies, we used values l ≈ . .Due to the small correlation length l , we discretize the SPDE in Equation (16) using × standard quadrilateral ﬁnite elements, leading to a linear system of N eq = dim( U f ) − onstantin Grigo and Phaedon-Stelios Koutsourelakis × − algebraic equations. We choose the discretization mesh of the randomprocess λ ( x , ξ ( x )) to coincide with the ﬁnite element discretization mesh of the SPDE, i.e. dim( λ f ) = 256 ×

256 = 65536 .Samples D N = { λ ( i ) f , U ( i ) f } Ni =1 are readily obtained by simulating realizations of the dis-cretized Gaussian ﬁeld, transforming them according to (17) and solving the discretized SPDE.Three such samples can be seen in Figure 2. We deﬁne a coarse model employing n λ c quadrilateral elements, the conductivities of whichare given by the vector λ c . Since these need to be strictly positive, we operate instead on z c deﬁned as z c = log λ c . (19)For each element k = 1 , . . . , n λ c of the coarse model/mesh, we assume that z c,k = N features (cid:88) j =1 θ c,j χ j ( λ f,k ) + σ k Z k , Z k ∼ N (0 , , (20)where λ f,k is the subset of the vector λ f that belongs to coarse element k and χ j some featurefunctions of the underlying λ f,k that we specify below. In a more compact form, we can write p c ( z c | λ f , θ c , σ ) = N ( z c | Φ ( λ f ) θ c , diag ( σ )) , (21)where Φ ( λ f ) is an n λ c × N features design matrix with Φ kj ( λ f ) = χ j ( λ f,k ) and diag ( σ ) is adiagonal covariance matrix with components σ k .For the coarse-to-ﬁne map p cf , we postulate the relation p cf ( U f | U c ( z c ) , θ cf ) = N ( U f | W U c ( z c ) , S ) , (22)where θ cf = ( W , S ) are parameters to be learned from the data. The matrix W is of dimension n f × n c and S is a positive deﬁnite covariance matrix of size n f × n f . To reduce the largeamount of free parameters in the model, we ﬁx W to express coarse model’s shape functions.Furthermore, we assume that S = diag ( s ) where s is the n f -dimensional vector of diagonalentries of the diagonal matrix S . The aforementioned expression implies, on average, a linearrelation between the ﬁne and coarse model outputs, which is supplemented by the residualGaussian noise implied by S . The latter expresses the uncertainty in predicting the FG outputwhen only the CG solution is available. We note that the relation in Equation (20) (i.e. θ c and σ k ) will be adjusted during training so that the model implied in Equation (22) represents thedata as good as possible.From equation (15), we have that log q ( t +1) i ( z ( i ) c ) ∝ N c el (cid:88) k =1 log (cid:16) ( σ ( t ) k ) − (cid:17) − N c el (cid:88) k =1 ( σ ( t ) k ) − (cid:16) z ( i ) c − Φ ( λ ( i ) f ) θ ( t ) c (cid:17) k − N f el (cid:88) j =1 log s ( t ) j − N f el (cid:88) j =1 s ( t ) j (cid:16) T ( i ) f − W ( t ) T c ( z ( i ) c ) (cid:17) j , (23)8 onstantin Grigo and Phaedon-Stelios Koutsourelakis where with ( . ) i , we mean the i -th component of the vector in brackets. For use in the M-step,we compute the gradients ∂ F ∂σ − k = N σ k − N (cid:88) i =1 (cid:28)(cid:16) z ( i ) c − Φ ( λ ( i ) f ) θ ( t ) c (cid:17) k (cid:29) q i , (24) ∇ θ c F = N (cid:88) i =1 (cid:16) Φ T ( z ( i ) f ) Σ − (cid:10) z ( ic (cid:11) q ( t ) i − Φ T ( λ ( i ) f ) Σ − Φ ( λ ( i ) f ) θ c (cid:17) (25)where Σ = diag ( σ , . . . , σ n λ c ) . We can readily compute the roots to ﬁnd the optimal Σ , θ c .Furthermore, from ∂ F ∂s j = 0 we get s ( t +1) j = 1 N N (cid:88) i =1 (cid:28)(cid:16) T ( i ) f − W T c ( z ( i ) c ) (cid:17) j (cid:29) q ( t ) i . (26) χ The framework advocated allows for any form and any number of feature functions χ j in(20). Naturally such a selection can be guided by physical insight [11]. In practice, the featurefunctions we are using can be roughly classiﬁed into two different groups: We consider existing effective-medium approximation formula that can be found in the liter-ature [23] as ingredients for construction of feature functions χ j in equation (20). The majorityof commonly used such features only retain low-order topological information. only The fol-lowing approximations to the effective property λ eff can be used as building blocks for featurefunctions χ in the sense that we can transform them nonlinearly such that χ ( λ f ) = f ( λ eff ( λ f )) .In particular, as we are modeling z c = log λ c , we include feature functions of type χ ( λ f ) =log( λ eff ( λ f )) . Maxwell-Garnett approximation (MGA)

The Maxwell-Garnett approximation is assumedto be valid if the microstructure consists of a matrix phase λ mat and spherical inclusions λ inc that are small and dilute enough such that interactions between them can be neglected. anotherMoreover, the heat ﬂux far away from any inclusion is assumed to be constant. Under suchconditions, the effective conductivity can be approximated in 2D as λ eff = λ mat λ mat + λ inc + φ inc ( λ inc − λ mat ) λ mat + λ inc − φ inc ( λ inc − λ mat ) , (27)where the volume fraction φ i ( λ f ) is given by the fraction of phase i elements in the binaryvector λ f . Self-consistent approximation (SCA)

The self-consistent approximation (or Bruggemanformula) was originally developed for effective electrical properties of random microstructures.It also considers non-interacting spherical inclusions and follows from the assumption that per-turbations of the electric ﬁeld due to the inclusions average to . In 2D, the formula reads λ eff = α + √ α + 4 λ mat λ inc , α = λ mat (2 φ mat −

1) + λ inc (2 φ inc − . (28)9 onstantin Grigo and Phaedon-Stelios Koutsourelakis Figure 3: Left: Sample microstructure for l = 0 . and φ hi = 0 . . The blue line encompassesthe convex area of the encircled low conducting phase blob, the two red lines are its maximumextent in x - and y -direction. The green lines are paths along which we count pixels and com-pute generalized means in the pixel-cross and straight path mean functions. Right: distancetransform (distance to nearest black pixel) of the microstructure.Note that the SCA exhibits phase inversion symmetry. Differential effective medium (DEM)

From a ﬁrst-order expansion in volume fraction of theeffective conductivity in the dilute limit of spherical inclusions, one can deduce the differentialequation [23] (1 − φ inc ) ddφ inc λ eff ( φ inc ) = 2 λ eff ( φ inc ) λ inc − λ eff ( φ inc ) λ inc + λ eff ( φ inc ) , (29)which can be integrated to (cid:18) λ inc − λ eff λ inc − λ mat (cid:19) (cid:115) λ mat λ eff = 1 − φ inc . (30)We solve for λ eff and use it as a feature function χ . Apart from the effective-medium approximations which only take into account the phaseconductivities and volume fractions, we wish to have feature functions χ j that more thoroughlydescribe the morphology of the underlying microstructure. Popular members of this class ofmicrostructural features are the two-point correlation, the lineal-path function, the pore-sizedensity or the speciﬁc surface to mention only a few of them [23].We are however free to use any function χ j : ( R + ) dim( λ f,k ) (cid:55)→ R as a feature, no matter fromwhich ﬁeld or consideration it may originate. We thus make use of existing image processingfeatures [15] as well as novel topology-describing functions. Some important examples are10 onstantin Grigo and Phaedon-Stelios Koutsourelakis Convex area of connected phase blobs

This feature identiﬁes distinct, connected “blobs” ofonly high or low conducting phase pixels and computes the area of the convex hull to each blob.We then can e.g. use the mean or maximum value thereof as a feature function.

Blob extent

From an identiﬁed phase blob, we can compute its maximum extension in x - and y -directions. One can for instance take the mean or maximum of maximum extension amongidentiﬁed blobs as a feature. Distance transformation functions

The distance transform of a binary image assigns a num-ber to each pixel i that is the distance from that pixel i to the nearest nonzero pixel in the binaryimage, see right part of Figure 3. One can use either phase to correspond to nonzero in thebinary image as well as different distance metrics. As a feature, one can e.g. take the mean ormaximum of the distance transformed image. Pixel-cross function

This feature counts the number of high or low conducting pixels one hasto cross going on a straight line from boundary to boundary in x - or y -direction. One can againtake e.g. mean, maximum or minimum values as the feature function outputs. Straight path mean function

A further reﬁnement of the latter function is to take generalizedmeans instead of numbers of crossed pixels along straight lines from boundary to boundary. Inparticluar, we use harmonic, geometric and arithmetic means as features.

It is clear from the previous discussion that the number of feature functions is practicallylimitless. The more such χ j one introduces, the more unknown parameters θ c,j must be learned.From the modeling point of view, while ML estimates can always be found, it is desirable tohave as clear of a distinction as possible between relevant and irrelevant features that couldprovide further insight as well being able to do so with the fewest possible training data avail-able. For that purpose we advocate the use of sparsity-enforcing priors in θ c [1, 7, 19]. Froma statistical perspective, this is also motivated by the bias-variance-tradeoff . Model predic-tion accuracy is adversely affected by two factors: One is noise in the training data ( variance ),the other is due to overly simple model assumptions ( bias ). Maximum-likelihood estimates ofmodel parameters tend to have low bias but high variance, i.e. they accurately predict the train-ing data but generalize poorly. To address this issue, a common Bayesian approach to controlmodel complexity is the use of priors, which is the equivalent to regularization in frequentistformulations. A particularly appealing family of prior distributions is the Laplacian (or LASSOregression, [21]), as it sets redundant or unimportant predictors to exactly 0, thereby simplifyinginterpretation.In particular, we use a prior on the coefﬁcients θ c of the form log p ( θ c ) = log √ γ − √ γ N θ c (cid:88) i =1 | θ c,i | , (31)with a hyper-parameter γ which can be set by either applying some cross-validation scheme ormore efﬁciently by minimization of Stein’s unbiased risk estimate [20, 27]. A straightforwardimplementation of this prior is described in [7].11 onstantin Grigo and Phaedon-Stelios Koutsourelakis Figure 4: Relative squared prediction error (cid:28)(cid:16) U f, true − (cid:104) U f (cid:105) ¯ p (cid:17) (cid:29) var ( U f ) versus number training datasamples N and different CG mesh sizes N el ,c = dim( λ c ) . We set φ hi = 0 . , l = 0 . and c = 10 . We use a total number of 306 feature functions χ j and a Laplacian prior (31) with a hyperpa-rameter γ we ﬁnd by cross-validation. We set the length scale parameter to l = 0 . and theexpected volume fraction of the high conducting phase to φ hi = 0 . . For the contrast, we take c = λ hi λ lo = 10 , where we set λ lo = 1 . The full- and reduced-order models are both computed onregular quadrilateral ﬁnite element meshes of size × for the FG and × , × and × for the CG.Our goal is to measure the predictive capabilities of the described model. To that end, wecompute the mean squared distance d = 1 N test n f n f (cid:88) j =1 N test (cid:88) i =1 (cid:18) U ( i ) f, true ,j − (cid:68) U ( i ) f,j (cid:69) ¯ p ( U ( i ) f ) | λ ( i ) f , θ ) (cid:19) (32)of the predictive mean (cid:68) U ( i ) f (cid:69) ¯ p ( U ( i ) f ) | λ ( i ) f , θ ) to the true FG output U ( i ) f on a test set of N test =256 samples. Predictions are carried out by drawing 10,000 samples of λ c from the learneddistribution p c (log λ c | λ f , θ ∗ c , σ ) = N (log( λ c ) | Φ ( λ f ) θ ∗ c , diag ( σ )) , solving the coarse model U ( i ) c = U c ( λ ( i ) c ) and drawing a sample from p cf ( U ( i ) f | U ( i ) c , θ ∗ cf ) = N ( U ( i ) f | W U ( i ) c , S ∗ ) forevery test data point. Monte Carlo noise is small enough to be neglected.As a reference value for the computed error d , we compute the mean variance of the FGoutput var ( U f ) = 1 n f n f (cid:88) i =1 (cid:0)(cid:10) U f,i (cid:11) − (cid:104) U f,i (cid:105) (cid:1) , (33)12 onstantin Grigo and Phaedon-Stelios Koutsourelakis Figure 5: Components of the optimal θ ∗ c for different values of contrast c = { , , } . Mostfeature functions are deactivated by the sparsity prior. We observe that with increasing contrast,the importance of the log SCA diminishes at the expense of more geometric features.where expectation values (cid:104) . (cid:105) are estimated on a set of 1024 FG samples such that errors due toMonte Carlo can be neglected. Figure 4 shows the relative squared prediction error d var ( U f ) independence of the number of training data samples for different coarse mesh sizes dim( λ c ) =2 × , × and × . We observe that the predictive error converges to a ﬁnite value already forrelatively few training data. This is due to the inevitable information loss during the coarseningprocess λ f → λ c as well as ﬁnite element discretization errors. In accordance with that, we seethat the error drops with the dimension of the coarse mesh, dim( λ c ) . For the same volume fraction and microstructural length scale parameter as above( φ hi = 0 . , l = 0 . ), a varying contrast of c = { , , } , a coarse model dimension of dim( λ c ) = 4 × and a training set of N = 128 samples, we ﬁnd the optimal θ ∗ c shown in Figure5. Due to the application of a sparsity enforcing prior as described in section 4.3, we observe thatmost components of θ ∗ c are exactly 0. Comparability between different feature functions canbe ensured by standardization or normalization of feature function outputs on the training data.For all three contrast values, we see that the three most important feature functions are givenby the maximum convex area of blobs of conductivity λ hi , the maximum log of the geometricmean of conductivities along a straight line from boundary to boundary in y -direction and the log of the self-consistent effective medium approximation as described above. The maximumconvex area feature returns the largest convex area of all blobs found within a coarse element.convex area. It is an interesting to note that although, in the set of 306 feature functions χ j , the (max / min / mean/variance of) the blob area (for both high and low conducting phases) are alsoincluded, it is only the convex area which is activated.In Figure 5, it is observed that with increasing contrast, the coefﬁcient θ ∗ c,j belonging to the log self-consistent approximation is decreasing in contrast to increasing values of θ ∗ c,j ’s cor-13 onstantin Grigo and Phaedon-Stelios Koutsourelakis Figure 6: Three test samples and the corresponding mode exp( Φ ( λ f ) θ ∗ c ) of p c .responding the max . high conducting convex area and the max . log geometric mean along y -direction. We believe that this is because, the higher the contrast c is, the more the exactgeometry and connectedness of the microstructure plays a role for predicting the effective prop-erties. As the SCA only considers the volume fraction and the conductivities of both phases, itdisregards such information. λ c Figure 6 shows three test samples of φ hi = 0 . , l = 0 . and c = 10 (top row) along withthe mode exp ( Φ ( λ f ) θ ∗ c ) of the learned distribution of the effective conductivity p c ( λ c | λ f , θ ∗ c ) ,which is connected to the mean by (cid:104) λ c (cid:105) p c = exp (cid:0) Φ ( λ f ) θ ∗ c + σ ∗ (cid:1) . We emphasize againthat the latent variable λ c is not a lower-dimensional compression of λ f with the objective ofmost accurately reconstructing λ f , but of providing good predictions of U f ( λ f ) . Even thoughFigure 6 gives the impression of a simple local averaging relation between λ f and λ c , this is notalways be the case. In particular, p c ( λ c | λ f , θ ∗ c ) was found to have non-vanishing probabilitymass for λ c,i < λ lo or λ c,i > λ hi , especially for more general models where the coarse-to-ﬁnemapping (22) is not ﬁxed to be the shape function interpolation W .14 onstantin Grigo and Phaedon-Stelios Koutsourelakis Figure 7: Variance parameters σ ∗ and s ∗ of p c and p cf , respectively.The predictive uncertainty is composed of the uncertainty in having an accurate encoding of λ f in λ c which is described by σ k in p c (21), as well as the uncertainty in the reconstructionprocess from U c to U f , which is given by the diagonal covariance S = diag ( s ) in p cf (22).Both are shown after training ( φ = 0 . , l = 0 . , c = 10 ) in Figure 7. For σ ∗ , we observethat values in the corner elements always converge to very tight values, whereas some non-corner elements can converge to comparably large values. The exact location of these elementsis data dependent. The coarse-to-ﬁne reconstruction variances s ∗ is depicted on the right columnof Figure 7. As expected, we see that the estimated coarse-to-ﬁne reconstruction error is largestin the center of coarse elements i.e. at large distances from the coarse model ﬁnite elementnodes. 15 onstantin Grigo and Phaedon-Stelios Koutsourelakis As in section 5.1, for a model with N train = 32 and dim( λ c ) = 8 × and random microstruc-tures with parameter values φ hi = 0 . , l = 0 . , c = 10 , we consider predictions by samplingfrom ¯ p ( U f | λ f , θ cf , θ c ) using 10,000 samples. The predictive histogram for the temperature U f,lr in the lower right corner of the domain can be seen in Figure 8. Figure 9 shows a surfaceplot of the true solution (colored), the predictive mean (blue) ± σ (gray). As can be seen, thetrue solution U f is nicely included in ¯ p everywhere.Figure 8: Predictive histogram (samples from ¯ p ( U f,lr | λ f , θ ∗ ) ) for the temperature U f,lr of thelower right corner of the domain. The true solution U f,lr, true is nicely captured by the distribution ¯ p .Figure 9: Predictions over the whole domain on four different test samples. The true solution(colored) lies in between ± σ (grey). The predictive mean (blue) is very close to the true solution.16 onstantin Grigo and Phaedon-Stelios Koutsourelakis In this paper, we described a generative Bayesian model which is capable of giving proba-bilistic predictions to an expensive ﬁne-grained model (FG) based only on a small ﬁnite numberof training data and multiple solutions of a fast, but less accurate reduced order model (CG).In particular, we consider the discretized solution of stochastic PDEs with random coefﬁ-cients where the FG corresponds to a ﬁne-scale discretization. Naturally, this comes along witha very high-dimensional vector of input uncertainties. The proposed model is capable of ex-tracting the most relevant features of those input uncertainties and gives a mapping to a muchlower dimensional space of effective properties (encoding). These lower dimensional effectiveproperties serve as the input to the CG, which solves the PDE on a much coarser scale. Thelast step consists of a probabilistic reconstruction mapping from the coarse- to the ﬁne-scalesolution (decoding).We demonstrated features and capabilities of the model proposed for a D steady-state heatproblem, where the ﬁne scale of the conductivity implies (upon discretization) a random inputvector of dimension × and the solution of a discretized system of equations of com-parable size. In combination with a sparsity-enforcing prior, the proposed model identiﬁed themost salient features of the ﬁne-scale conductivity ﬁeld and allowed accurate predictions ofthe FG response using a CGs of size only × , × and × . The predictive distributionalways included the true FG solution as well as provided uncertainty bounds arising from theinformation loss taking place during the coarse-graining process.17 onstantin Grigo and Phaedon-Stelios Koutsourelakis REFERENCES [1] J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, and M. West.Bayesian factor regression models in the large p, small n paradigm.

Bayesian statistics ,7:733–742, 2003.[2] C. M. Bishop.

Neural networks for pattern recognition . Oxford university press, 1995.[3] C. M. Bishop.

Latent Variable Models , pages 371–403. Springer Netherlands, Dordrecht,1998.[4] C. M. Bishop.

Pattern Recognition and Machine Learning , volume 4. 2006.[5] T. M. Cover and J. A. Thomas.

Elements of information theory . John Wiley & Sons, 2012.[6] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete datavia the em algorithm.

Journal of the royal statistical society. Series B (methodological) ,pages 1–38, 1977.[7] M. A. T. Figueiredo. Adaptive sparseness for supervised learning.

IEEE Transactions onPattern Analysis and Machine Intelligence , 25(9):1150–1159, 2003.[8] R. G. Gahem, P. D. Spanos, R. G. Ghanem, and P. D. Spanos.

Stochastic Finite Elements:A Spectral Approach . 2003.[9] M. D. Hoffman, D. M. Blei, C. Wang, and J. W. Paisley. Stochastic variational inference.

Journal of Machine Learning Research , 14(1):1303–1347, 2013.[10] M. C. Kennedy and A. O’Hagan. Predicting the output from a complex computer codewhen fast approximations are available, 2000.[11] P. Koutsourelakis. Probabilistic characterization and simulation of multi-phase randommedia.

Probabilistic Engineering Mechanics , 21(3):227–234, 2006.[12] P.-S. Koutsourelakis. Accurate Uncertainty Quantiﬁcation Using Inaccurate Computa-tional Models.

SIAM Journal on Scientiﬁc Computing , 31(5):3274–3300, 2009.[13] P.-S. Koutsourelakis and G. Deodatis. Simulation of binary random ﬁelds with applica-tions to two-phase random media.

Journal of Engineering Mechanics , 131(4):397–412,2005.[14] X. Ma and N. Zabaras. Kernel principal component analysis for stochastic input modelgeneration.

Journal of Computational Physics , 230(19):7311–7331, Aug. 2011.[15] MATLAB. version 9.1.0.441655 (R2016b) . The MathWorks Inc., Natick, Massachusetts,2016.[16] P. Perdikaris, D. Venturi, J. O. Royset, and G. E. Karniadakis. Multi-ﬁdelity modellingvia recursive co-kriging and Gaussian Markov random ﬁelds Subject Areas :. 2015.[17] H. Robbins and S. Monro. A stochastic approximation method.

The Annals of Mathemat-ical Statistics , 22(3):400–407, 1951. 18 onstantin Grigo and Phaedon-Stelios Koutsourelakis [18] A. Roberts and M. Teubner. Transport properties of heterogeneous materials derived fromgaussian random ﬁelds: bounds and simulation.

Physical Review E , 51(5):4141, 1995.[19] M. Sch¨oberl, N. Zabaras, and P.-S. Koutsourelakis. Predictive coarse-graining.

J. Comput.Physics , 333:49–77, 2017.[20] C. Stein. Stein-1981.pdf.

The Annals of statistics , 9(6):1135–1151, 1981.[21] R. Tibshirani. Royal Statistical Society.

Journal of the Royal Statistical Society B ,58(1):267–288, 1996.[22] N. Tishby, F. C. N. Pereira, and W. Bialek. The information bottleneck method.

CoRR ,physics/0004057, 2000.[23] S. Torquato.

Random Heterogeneous Materials - Microstructure and Macroscopic Prop-erties . 2001.[24] M. J. Wainwright, M. I. Jordan, et al. Graphical models, exponential families, and varia-tional inference.

Foundations and Trends R (cid:13) in Machine Learning , 1(1–2):1–305, 2008.[25] E. Weinan. Principles of multiscale modeling . Cambridge University Press, 2011.[26] C. E. R. Williams and C. K. I.

Gaussian Processes for Machine Learning . 2005.[27] H. Zou, T. Hastie, and R. Tibshirani. On the ”degrees of freedom” of the lasso.