[PDF] Bayesian Calibration of Computer Models with Informative Failures

Abstract

There are many practical difficulties in the calibration of computer models to experimental data. One such complication is the fact that certain combinations of the calibration inputs can cause the code to output data lacking fundamental properties, or even to produce no output at all. In many cases the researchers want or need to exclude the possibility of these "failures" within their analyses. We propose a Bayesian (meta-)model in which the posterior distribution for the calibration parameters naturally excludes regions of the input space corresponding to failed runs. That is, we define a statistical selection model to rigorously couple the disjoint problems of binary classification and computer model calibration. We demonstrate our methodology using data from a carbon capture experiment in which the numerics of the computational fluid dynamics are prone to instability.

Full PDF

BBayesian Calibration of Computer Models withInformative Failures

Peter W. Marcy ∗ Statistical Sciences Group (CCS-6), Los Alamos National LaboratoryandCurtis B. StorlieMayo Clinic

Abstract

There are many practical diﬃculties in the calibration of computer models to experimentaldata. One such complication is the fact that certain combinations of the calibration inputs cancause the code to output data lacking fundamental properties, or even to produce no outputat all. In many cases the researchers want or need to exclude the possibility of these “failures”within their analyses. We propose a Bayesian (meta-)model in which the posterior distributionfor the calibration parameters naturally excludes regions of the input space corresponding tofailed runs. That is, we deﬁne a statistical selection model to rigorously couple the disjointproblems of binary classiﬁcation and computer model calibration. We demonstrate our method-ology using data from a carbon capture experiment in which the numerics of the computationalﬂuid dynamics are prone to instability.

Keywords: computer experiment, Gaussian process, emulator, latent process, classiﬁcation, selec-tion model, weighted distribution, uncertainty quantiﬁcation, Bayesian analysis. ∗ This work was funded by the U.S. Department of Energy, Oﬃce of Fossil Energy’s Carbon Capture SimulationInitiative through the National Energy Technology Laboratory. The authors thank Canhai (Kevin) Lai and Zhijie(Jay) Xu for providing the MFIX simulator data as well as Derek Bingham for mentioning the COMPAS model. Theauthors are also grateful to three anonymous reviewers for their helpful comments. a r X i v : . [ s t a t . M E ] J un Introduction

Computer codes (also called simulators ) are often used in the physical sciences to model com-plex systems and phenomena. Beginning with Sacks et al. (1989), the statistical design/analysisof experiments involving simulators has become so ubiquitous that this subdiscipline needs littleintroduction; uninitiated readers may consult the plethora of standard references within Fang et al.(2006), Levy and Steinberg (2010), and Santner et al. (2018).One typical inferential goal involving computer models is calibration : when actual experimen-tal/ﬁeld data of a system are available these measurements can be used to infer the most probablevalues of unobservable input quantities necessary to run the computer model. Historically, cal-ibrating computer models to data took the form of nonlinear regression (Seber and Wild, 1989,e.g.) in which a least-squares criterion was minimized, an early example being Box and Coutie(1956). Modern calibration analyses often feature fast approximations of the computer model andcan go by the names “tuning” (Park, 1991; Cox et al., 2001) or “history-matching” (Craig et al.,1997). The latter takes a Bayes-linear approach and explicitly incorporates a discrepancy termto acknowledge the imperfection of the computer model. Bayesian calibration featuring Gaussianprocess (GP) surrogates ( emulators ) and model discrepancy can be found in Kennedy and O’Hagan(2001), Higdon et al. (2008), and a glut of subsequent modiﬁcations/applications by many authors.One impediment to calibration, or even emulation of the computer model, is the possibilitythat certain combinations of the calibration inputs will result in output absent some necessaryproperty P : in other words, a failure. In certain cases the scientists consider these failures data; assuch, the parameter combinations leading to failures are deemed implausible , to use the languageof history-matching. When this is true, the failed runs should be used to exclude or down-weightthose regions of the calibration input space. Without doing so it is possible that these regions willbe admitted to the calibration posterior via an emulator which is forced to extrapolate where thereis no data. A smaller parameter space could be desirable in its own right, but especially whenthe calibrated model is to be used for further purposes such as uncertainty analysis (Oakley andO’Hagan, 2002).Here is some concrete motivation by way of examples. First consider the Allometrically Con-strained Growth and Carbon Allocation model of Gemoets et al. (2013) and Fell et al. (2018).Calibrating this code to a particular tree using radius or height is bedeviled by the fact that themodel will sometimes report the tree as being dead when the real tree is very much alive (Property P ). As the code behaves this way for particular and a priori unknown calibration input conﬁgu-rations, it would be all too reasonable to estimate and utilize the P -constraint on the parameterspace. (If the constraint were known , this could be easily accommodated by the implausibility2etric for a history-match.) Another example is the COMPAS (Compact Object Mergers: Pop-ulation Astrophysics and Statistics) model of binary black hole formation (Barrett et al., 2016).Calibrating this code to existing binary black holes using chirp mass only makes sense when ( P :)a black hole is actually formed– this is not guaranteed for all inputs to the code. The motivationbehind our work is the calibration of data to a particular computational ﬂuid dynamics (CFD)model. The code is prone to numerical instability for certain parts of the input space, but its issuesare not easily addressed as it is specialized software. Property P in this case is the very presenceof output. Those regions of parameter space yielding no output pragmatically needed to be ruledout for further upscaling and uncertainty analyses because these were built around the same code.The outline of the paper is as follows. In Sections 2 and 3 we review the disparate methodologiesof calibration and classiﬁcation upon which our method will build. In Section 4 we combine them.The new methodology will be demonstrated using the CFD code mentioned above with data froma carbon capture experiment in Section 5. In this section we review the details of univariate calibration using the framework of Kennedy andO’Hagan (2001). The restriction to a single output is purely for expositional clarity– because ourmethod uses only the input space to couple the two sources of information, it is directly applicableto the case of multivariate output.The inputs to the computer model are partitioned into two groups. The calibration inputs t = ( t , . . . , t D t ) are quantities that can be changed within the code, but ideally would be ﬁxed atone setting θ = ( θ , . . . , θ D t ) that represents the true state of nature. These calibration parameters θ represent physical quantities of interest that cannot directly be measured in the real system.The second group contains the variable inputs x = ( x , . . . , x D x ) that correspond to observable orcontrollable experimental settings. The computer code η is executed M times to produce simulateddata η m = η ( x ∗ m , t ∗ m ) , ( m = 1 , . . . , M ). In addition there are N measurements of the true physicalprocess τ : y n = τ ( x n ) + (cid:15) n n = 1 , . . . , N , where { (cid:15) n } Nn =1 are iid N (0 , σ (cid:15) ) observational errors. Note that this experimental data is the resultof conditions x n which may diﬀer from those of the simulated data (hence the “ ∗ ” notation above).The end goal is to use the simulated data to ﬁnd the values of θ most consistent with the3xperimental data. To make this inference the following relationship is assumed: τ ( x ) = η ( x , θ ) + δ ( x ) . That is, the real physical system can be decomposed as the simulated system at a “true” setting θ ,plus a systematic discrepancy which depends only on the variable inputs. Kennedy and O’Hagan(2001) use independent GPs to model the unknown functions η ( x , t ) and δ ( x ) so that uncertaintyat unobserved input conﬁgurations can be accounted for (e.g., O’Hagan and Kingman, 1978; Sackset al., 1989; Santner et al., 2018). Speciﬁcally, η is assumed to have mean µ η and covariancekernel k η (( x , t ) , ( x , t )); the discrepancy δ is assumed to have mean µ δ and covariance kernel k δ ( x , x ). In this work, we take the kernels of the independent processes to have the form: k η (( x , t ) , ( x , t )) = σ η D x (cid:89) d =1 R ( | x ,d − x ,d | ; λ η,x,d ) · D t (cid:89) e =1 R ( | t ,e − t ,e | ; λ η,t,e ) k δ ( x , x ) = σ δ D x (cid:89) d =1 R ( | x ,d − x ,d | ; λ δ,d ) , where R ( · ; λ ) is a one-dimensional correlation function (such as a squared-exponential or Mat´ernwith speciﬁed smoothness) and ( λ η,x , λ η,t , λ δ ) def = λ are 2 D x + D t correlation length parameters.Under these assumptions, together with the relationships y n = η ( x n , θ ) + δ ( x n ) + (cid:15) n n = 1 , . . . , N (1) η m = η ( x ∗ m , t ∗ m ) m = 1 , . . . , M , the joint likelihood for all of the observations is obtained readily. Let y denote all the experimentaldata and η denote the collected simulation runs. The distribution for all the observed data d def =( y (cid:62) , η (cid:62) ) (cid:62) is then multivariate normal (MVN): N N + M (cid:32) mean = (cid:34) ( µ η + µ δ ) N µ η M (cid:35) , cov = (cid:34) k η ([ X , θ ] , [ X , θ ]) + k δ ( X , X ) + σ (cid:15) I N k η ([ X , θ ] , [ X ∗ , T ∗ ]) k η ([ X ∗ , T ∗ ] , [ X , θ ]) k η ([ X ∗ , T ∗ ] , [ X ∗ , T ∗ ]) (cid:35) (cid:33) . (2) Above, k η ([ X ∗ , T ∗ ] , [ X , θ ]) is the matrix whose ( i, j )-entry is k η (( x ∗ i , t ∗ i ) , ( x j , θ )); other blockswithin the covariance are deﬁned similarly. If the non-calibration parameters are collected in thevector α η,δ def = ( µ η , µ δ , σ η , σ δ , σ (cid:15) , λ ), the posterior distribution for all parameters is[ θ , α η,δ | d ] ∝ L ( θ , α η,δ ; d ) · [ θ | α η,δ ] · [ α η,δ ] , (3)and a priori independence is almost always assumed: [ θ | α η,δ ] ≡ [ θ ]. The notation [ X ] is shorthandfor f X ( x ), the density or mass function of the variable X , and L ( · ; d ) represents the likelihoodfunction which in this case is proportional to the MVN pdf deﬁned by (2). We have partitionedthe parameter space into emulator/discrepancy and calibration parameters to make later sectionsmore transparent. 4he posterior given in (3) can be explored using Markov Chain Monte Carlo (MCMC) withMetropolis-Hastings updates for some of the parameters and Gibbs steps for the remaining. Inupdating the emulator/discrepancy parameters, the use of Gibbs versus Metropolis will depend onthe prior speciﬁcation. For all analyses in this paper we use the following prior distributions (whichassume that the inputs within x and t have all been scaled to [0,1], and that the output d has beenrescaled by its sample standard deviation):[ θ | α η,δ ] · [ α η,δ ] def = [ θ ] · [ α η,δ ]= [ θ ] · [ µ η ] · [ µ δ ] · [ σ η ] · [ σ δ ] · [ σ (cid:15) ] · [ λ η,x , λ η,t , λ δ ][ µ η ] · [ µ δ ] ∝ σ η ∼ Unif(0 , σ δ ∼ Unif(0 , σ (cid:15) ∼ Unif(0 , λ iid ∼ Unif(0 . , . ∀ λ ∈ { λ η,x , λ η,t , λ δ } . The use of uniform priors on the square root of the variances was inspired by Gelman (2006),and we found that this produces better mixing than a typical inverse-gamma speciﬁcation. Allcorrelation functions in this work are parameterized such that a correlation length of λ > λ ’s works well tokeep the covariance of the likelihood well-conditioned without sacriﬁcing the emulator’s predictiveability. Finally, the priors for the calibration parameters will depend on the particular application.Therefore, under this scheme of prior distributions, µ η and µ δ can be updated by Gibbs, and therest of the parameters by Metropolis steps. We now introduce a simple “toy” example with which to illustrate our methodology in stages.The data are obtained using (1), where the “simulator” η ( · , · ) in this case is actually a realizationof a zero-mean GP on the domain ( x, t ) ∈ [0 , having covariance kernel k η (( x , t ) , ( x , t )) = σ η exp (cid:0) ( x − x ) /λ η,x (cid:1) · exp (cid:0) ( t − t ) /λ η,t (cid:1) with parameter values σ η = 10, λ η,x = 1 .

0, and λ η,t = 2 .

0. The discrepancy function is taken tobe δ ( x ) = 0 . x − . − . x − .

2) and the experimental error variance to be σ (cid:15) = 0 . × M = 114). The experimental data of the physicalsystem are obtained at the same x values as the simulated data ( X ∗ = X , N=18) using the truevalue of t set to θ = 0 . l l l l l l l l l l l l l l l l l − − − − − x y l l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l l l t = 0.00t = 0.43t = 1.00 l l experimentalsimulation successsimulation failure l l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l l l . . . . . . x t Figure 1: A simple calibration problem with failed “simulator” runs. The full data is given in theleft panel, and the input space showing the failed runs is presented in the right panel.A diﬀerent view of both datasets is presented in Figure 2; the response surface is presented withmissing regions corresponding to failed simulation runs. As a visual aid, the experimental data areconnected by line segments and plotted at t = θ (of course, the goal of calibration is to infer thislocation). The green lines indicate the true region of the calibration input space for which all runsare successful.We set a uniform prior on θ and use a fairly standard MCMC routine to explore the posteriordistribution of all parameters ( θ, α η,δ ). A total of 120,000 iterations (20,000 of those for burn-in)were completed, and every third draw was recorded. Diagnostics for all parameters within thethinned chain were deemed adequate. A plot of the estimated posterior density for θ is given bythe black curve in Figure 4. Here it can be seen that the tails of this distribution give positiveprobability to regions of θ that give rise to simulator failures for certain values within the x -space.If this calibrated model were to be used in a subsequent uncertainty analysis, the additional supportwould add superﬂuous variation to the quantity of interest.6 Figure 2: Response surface of completed “simulator” runs for the illustrative example. Theexperimental data (blue points) are plotted along the slice t = θ = 0 . t -space for which all runs are successful. In addition to the completed computer model runs η , we assume that there is also a set of M input conﬁgurations that result in model failures; let these be denoted ( x ∗ m , t ∗ m ), where m = ( M +1) , . . . , ( M + M ). Latent variable approaches can be used to model the binary outcome of successor failure at arbitrary inputs ( x , t ) given the M tot def = M + M observed outcomes.In this section we describe Bayesian estimation and prediction under one such model: the binaryor “clipped” random ﬁeld. Within this model, a response variable realization is 1 if and only ifa continuous latent random ﬁeld is positive. This diﬀers from a more standard logistic/probitapproach in which a transformed random ﬁeld (or regression function) determines the non-binaryprobability that a response value will be 1 (Chib and Greenberg, 1998; Neal, 1999; Rasmussen andWilliams, 2006) instead of the outcome itself. The binary random ﬁeld model will lead to sharper7nference and predictions when the successes/failures are strongly correlated in space. We willindicate when and how this model can be extended. We deﬁne an indicator random variable Z m to be 1 if and only if the m th run is successful, andthis is tied to a latent continuous random variable ζ m through the relationship Z m = I { ζ m > } .Thus for the given data, z def = ( z , . . . , z M tot ) (cid:62) is realized to be 1 in the ﬁrst M positions (and 0 forthe remaining M ) as a result of a latent realization ζ def = ( ζ , . . . , ζ M tot ) (cid:62) with positive values onlyin the ﬁrst M entries.In the Bayesian framework, the introduction of the latent ζ is also called data augmentation (Tanner and Wong, 1987) and can readily facilitate inference for parameters of certain models.This is indeed the case for binary data when the latent variables are normally distributed becausethe joint posterior of model parameters and latent variables suggests the use of Gibbs sampling(Albert and Chib, 1993; Chib and Greenberg, 1998; De Oliveira, 2000). This reasoning is detailedbelow.By construction, the model for the observed binary data given the latent values is a point-mass[ Z | ζ ] = M tot (cid:89) m =1 I { ζ m > } z m + (cid:0) − I { ζ m > } (cid:1) − z m = M (cid:89) m =1 I { ζ m > } z m · M (cid:89) m =1 I { ζ M + m ≤ } − z M + m . (4)If the latent data are modeled as depending on parameters α ζ , then the joint posterior for allunobservables is proportional to [ Z | ζ , α ζ ] · [ ζ , α ζ ] = [ Z | ζ ] · [ ζ | α ζ ] · [ α ζ ]. Furthermore, when thelatent data model [ ζ | α ζ ] is MVN, then (4) implies that the full conditional distribution [ ζ | α ζ , z ] isa truncated MVN and can be sampled with a Gibbs algorithm (Geweke, 1991; Robert, 1995). Weuse a model of this form because it readily admits a posterior predictive process which is also MVNand hence easy to sample from; a tractable predictive distribution is crucial to our methodology,as Section 4 will demonstrate.We suppose that the latent process deﬁning the space of failures is continuous and well describedby a GP– in other words, that there is spatial correlation to the successes and failures. Let ζ m def = ζ ( x ∗ m , t ∗ m ), and also suppose that ζ has mean µ ζ and covariance kernel k ζ with form k ζ (( x ∗ , t ∗ ) , ( x ∗ , t ∗ ); λ ζ ) = D x (cid:89) d =1 R (cid:0) | x ∗ ,d − x ∗ ,d | ; λ ζ,x,d (cid:1) · D t (cid:89) e =1 R (cid:0) | t ∗ ,e − t ∗ ,e | ; λ ζ,t,e (cid:1) , (5)8here ( λ ζ,x , λ ζ,t ) def = λ ζ are D x + D t correlation length parameters. Note here that for identiﬁability,the variance of the ζ process is set to unity (De Oliveira, 2000). Therefore, in this latent model theparameter vector is α ζ def = ( µ ζ , λ ζ ), and the distribution of ζ | α ζ is then MVN: N M tot (cid:16) mean = µ ζ M tot , cov = k ζ ([ X ∗ , T ∗ ] , [ X ∗ , T ∗ ]) (cid:17) (6)and [ X ∗ , T ∗ ] is the design matrix [ X ∗ , T ∗ ] row-augmented by the experimental and calibrationinput settings for the runs that fail.The posterior for all parameters in the classiﬁcation problem combines the likelihood in (4), thelatent prior of (6), and our choice of priors for hyperparameters:[ ζ , α ζ | z ] ∝ [ Z | ζ ] · [ ζ | α ζ ] · [ α ζ ] (7)[ α ζ ] = [ µ ζ ] · [ λ ζ,x , λ ζ,t ][ µ ζ ] ∝ λ iid ∼ Unif(0 . , . ∀ λ ∈ { λ ζ,x , λ ζ,t } . Within the MCMC, the mean parameter µ ζ can be updated with a Gibbs step and α ζ with aMetropolis step. The elements of the latent vector are updated using the full conditional (cid:2) ζ i | ζ − i , µ ζ , α ζ , z (cid:3) ∝ N (cid:16) µ ζ − q ii Q i, − i ( ζ − i − µ ζ ) , q ii (cid:17) · I { ζ i ∈ A ( z i ) } (8) Q = Σ ∗− q ii = Q i,i A ( z i ) def = (cid:40) ( −∞ ,

0] if z i = 0(0 , ∞ ) if z i = 1 . We have written the usual conditional distributions in an equivalent form featuring the precisionmatrix to show that Σ ∗ need be inverted only once (Rue and Held, 2005). All the necessary Gibbssteps can be accomplished through, e.g., the R package tmvtnorm (Wilhelm and Manjunath, 2015).All told, the posterior sampling just described is essentially the routine of De Oliveira (2000) Section3.1, which is contained in steps (cid:104) . (cid:105) − (cid:104) . (cid:105) of Algorithm 1 below.The motivation behind the use of the latent GP model was the assumption that the failuremechanism is (almost surely) continuous on the latent space. Evidence against this stipulationcomes in the form of at least one very small value amongst the estimated correlation lengths, andmore conspicuously, poor leave-one-out cross validation (LOOCV) classiﬁcation rates. This maypoint the modeler to a speciﬁcation that accounts for noise, such as probit GP model where theindicator functions of (4) are replaced with Φ( ζ ). Subsequent equations would have to be adjustedaccordingly. Another option is a probit regression model (Albert and Chib, 1993), where the latentmean is expanded from a constant µ ζ to a function µ ζ ( x , t ; β ζ ) which is linear in β ζ , and wherein thecovariance model is simpliﬁed to iid N (0 ,

1) errors. This would greatly simplify the computations9elating to (7), (8), and (9) below, but we avoided probit regression due to the undesirable modelselection surrounding the choice of mean function µ ζ ( x , t ; β ζ ). In moderate dimensional inputspaces, determining a functional basis in ( x , t ) to adequately describe potential curvature in thelatent process can be problematic, especially with relatively sparse data. This was the case even inthe low-dimensional illustrative example, whose pattern of failed runs was given in the right panelof Figure 1. The posterior predictive distribution for the latent process at ˜ M new locations given in the rowsof the matrix [ ˜ X , ˜ T ] is denoted ˜ ζ def = ζ ( ˜ X , ˜ T ) | ( ζ , α ζ ) and has distribution N ˜ M (cid:16) mean = µ ζ ˜ M + ˜ Σ (cid:62) Σ ∗− ( ζ − µ ζ ˜ M ) , cov = ≈ Σ − ˜ Σ (cid:62) Σ ∗− ˜ Σ (cid:17) (9) Σ ∗ = k ζ ([ X ∗ , T ∗ ] , [ X ∗ , T ∗ ])˜ Σ = k ζ ([ X ∗ , T ∗ ] , [ ˜ X , ˜ T ]) ≈ Σ = k ζ ([ ˜ X , ˜ T ] , [ ˜ X , ˜ T ]) . This immediately yields predictions on the binary scale: ˜ z (cid:62) = (cid:16) I { ˜ ζ > } , . . . , I { ˜ ζ ˜ M > } (cid:17) , and thistype of binary prediction can be used for predictive model assessments. For example, as the MCMCsampling progresses, leave-one-out cross validation conditional upon the hyperparameters can beused to assess goodness-of-ﬁt. The i th prediction is I { ˆ ζ i > } , where ˆ ζ i def = ζ ([ X ∗ , T ∗ ] i, · ) | ( ζ − i , α ζ )is obtained from the single inversion of the full Σ ∗ used in the updates of the ζ vector. After theMCMC is ﬁnished, posterior summaries of the LOOCV classiﬁcation rate can be calculated forthe assessment of latent model adequacy. Making posterior predictions for LOOCV can be seen asrepeating step (cid:104) . (cid:105) of Algorithm 1 a total of M tot times.In addition to the predictions for cross-validation, realizations along the “slice” t = θ will be ofinterest, and this motivates two cases. The posterior predictive latent process can be:(C1) insensitive to x : k ζ (( x , t ) , ( x , t ); λ ζ ) ≡ k ζ ( t , t ; λ ζ,t )(C2) sensitive to x : the covariance kernel is as given above in (5).Case (C1) will be highly desirable because prediction along the slice amounts to a prediction at asingle point. This fact then strongly encourages model selection on the ( x , t )-space: the latent GPmodel using the kernel of (C1) can be built ﬁrst and the posterior of LOOCV classiﬁcation ratecan be used to assess this assumption. Prediction at either θ for (C1) or [ ˜ X , θ ] for (C2) is also asubcase of (cid:104) . (cid:105) in Algorithm 1. 10 lgorithm 1 Metropolis-within-Gibbs algorithm for Bayesian classiﬁcation and prediction usinga latent Gaussian processGiven α ( t − ζ = (cid:16) µ ( t − ζ , λ ( t − ζ (cid:17) and ζ ( t − (cid:104) . (cid:105) Update hyperparameters of the latent process α ( t − ζ → α ( t ) ζ using a Gibbs step for µ ζ andMetropolis-Hastings update for λ ζ using (7) (cid:104) . (cid:105) Update latent values ζ ( t − → ζ ( t ) with a sequence of Gibbs steps according to (8): • sample ζ ( t )1 ∼ ζ | (cid:16) ζ ( t − , ζ ( t − , . . . , ζ ( t − M tot , µ ( t ) ζ , α ( t ) ζ , z (cid:17) • sample ζ ( t )2 ∼ ζ | (cid:16) ζ ( t )1 , ζ ( t − , . . . , ζ ( t − M tot , µ ( t ) ζ , α ( t ) ζ , z (cid:17) ... • sample ζ ( t ) M tot ∼ ζ M tot | (cid:16) ζ ( t )1 , ζ ( t )2 , . . . , ζ ( t ) M tot − , µ ( t ) ζ , α ( t ) ζ , z (cid:17) (cid:104) . (cid:105) Sample from the posterior predictive distribution (9) at ˜ M locations given in the rows ofsome matrix [ ˜ X , ˜ T ].Before returning to the illustrative example, we note that the modeler may be interested inbetter estimating the boundary of the failure-space from sequential batches of simulation runs.Adaptively reﬁning the estimate of the ζ ( x , t ) = 0 contour under the latent GP model can beaccomplished in a principled way using the Expected Improvement procedure of Ranjan et al.(2008) (pg. 530). In this example the failure mechanism is spatially correlated in x and t implying case (C2) and theneed for two correlation length parameters λ ζ,x and λ ζ,t within the covariance structure: k ζ (( x , t ) , ( x , t )) = R ( x − x ; λ ζ,x ) · R ( t − t ; λ ζ,t ) R ( l ; λ ) = (1 + √ | l | /λ ) exp( −√ | l | /λ ) . The use of the Mat´ern kernel with smoothness ν = 1 . Σ ∗ to become ill-conditioned.To ﬁt the latent GP model, steps (cid:104) . (cid:105) and (cid:104) . (cid:105) of Algorithm 1 were performed within anMCMC of 120,000 iterations. LOOCV predictions were performed every 200 iterations, and theseresulted in a median posterior classiﬁcation rate of 1.00 with a 95% credible set of (0.993, 1.00).Additional predictions were made on a dense grid [ ˜ X, ˜ T ] of 50 runs, and the ζ = 0 contours fortwo such realizations are shown in Figure 3. Each pair of green lines in the ﬁgure enclose the regionfor which the latent process is positive for all values of x .11 l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l l l . . . . . . x t Figure 3: Contours of ζ = 0 for two realizations (solid/dashed red curves) of the posterior predictiveprocess, and corresponding “admissible” regions whose bounds are deﬁned by the solid/dashedgreen lines. After investigating the pattern of simulator failures and the ﬁtting/checking the statistical modelfor the latent failure mechanism, the results must be discussed with the subject matter experts whoprovided the simulation data in order to make sense of the failed runs. It could be, for example,that it is actually an extreme combination of variable inputs x (outside the range of conditionswhere the experimental data was obtained) is leading to output absent Property P . Or, in thecase of altogether missing output, it may be that the code’s numerical solvers are not convergingwithin some conﬁgured time limit which is altogether too small. In these two instances it wouldnot be reasonable to exclude regions of the calibration input space. Along the lines of the secondinstance, Huang et al. (2020) reasoned that they should allow regions where their ISOTSEAL codefailed and that they could trust their emulator to extrapolate. Another sensible example of ignoringcomputer model failures comes from the spot welding example of Bayarri et al. (2007) (Table 3).Even though a high proportion of the runs failed (17/52), the absolute number is small and thefailures exhibit no visible pattern as a function of their two x ’s and one t ; hence these seem toprovide no real insight.It is important to note that whether or not the failures are meaningful (i.e. deemed “data”),if they depend on t , then two calibration analyses can be conducted regardless: one ignoring thefailed runs, and one featuring the methodology of this section to utilize them. The GP emulator12ithin the ﬁrst scenario will have to interpolate or extrapolate for regions of t -space containingfailures, potentially leading to the admission of these regions into the posterior distribution for θ .This might actually be desirable in situations where the experts believe the code failures themselvesare a form of discrepancy from reality. But by comparing the posteriors from each analysis, theresearchers will not only get some sense of how much the extrapolation changes the answers, butmight also gain additional insight into the code. The basic idea of our approach is to use a carefully chosen weighted distribution (Patil and Rao,1977; Bayarri and DeGroot, 1987; Patil, 2002) as a prior on the calibration parameters[ θ | ζ ] ∝ w ( θ , ζ ) · [ θ ] , (10)where w ( θ , ζ ) is a non-negative weight function depending on the binary data z and additionalparameters α ζ via the latent function ζ ( x , t ), and [ θ ] is the prior that would otherwise be used ina typical calibration analysis. This explicitly couples calibration to classiﬁcation. The approach weadvocate is to use a selection model (Bayarri and DeGroot, 1987, 1992) on an unknown set that isdetermined by the model failures: w ( θ , ζ ) def = I { θ ∈ S ζ } . (11)In the above construction, the admissible set S ζ does not (and cannot) depend on x , but thisstands to reason: if θ corresponds to a true state of nature, then it should not depend on thespeciﬁc conﬁguration of experimental settings that were used/observed. This reasoning furthersuggests sets of the form S ζ = (cid:8) θ : P (success across x , as determined by ζ ) ≥ − p tol (cid:9) (12)for a given tolerance probability p tol . A particular value of θ will be said to be admissible if itis an element of S ζ . When it is non-negative, the quantity p tol acts as a form of discrepancy. Itacknowledges a mismatch between observed and “true” failures in that while there may be somefailed runs expected along the slice ζ ( x , t = θ ), θ can still be considered admissible.As a concrete example, temporarily assume that the latent classiﬁer function ζ ( x , t ) is knownand deterministic. The set S ζ in (12) then becomes the collection of all θ such that P θ { ζ ( x , t = θ ) > } = (cid:90) X =[0 , Dx I { x : ζ ( x , θ )) > } dG ( x ) ≥ − p tol (13)for choice of p tol and measure G on the space of variable inputs. When dG ( x ) > X and p tol = 0 (14)13 particular θ is admissible if and only if the minimum of the classiﬁer function along the slice t = θ is positive. The problem of checking admissibility is as such reduced to the problem of functionminimization. The reasoning extends to the case when the classiﬁer function is unknown/randomgiven next.Let ˜ ζ ( x , t ; α ζ ) be a realization of a posterior predictive GP: ˜ ζ ∼ ζ | ( ζ , α ζ ), such as the one givenin (9). The conditional prior on the calibration parameters using (10) and (11), while assumingthe conditions in (14) is [ θ | ˜ ζ ] ∝ · [ θ ] · I (cid:110) min ˜ ζ ( x , t = θ ; α ζ ) > (cid:111) . (15)Case (C1) of Section 3.2 reduces the indicator to I (cid:8) ˜ ζ ( θ ; α ζ ) > (cid:9) – i.e., evaluating the criterionat a single point. Case (C2) will necessitate a practical compromise due to the fact that the GPcannot be realized at all points in X along t = θ . The simplest and most natural approximationis I { min ˜ ζ θ > } , where ˜ ζ θ def = ζ ( x = ˜ X , t = θ ) | ( ζ , α ζ ) for some large space-ﬁlling design ˜ X .In either case, uncertainty in the admissible set S ζ will be derived from the stochastic nature of˜ ζ ( x , t ; · ) as well as from the uncertainty in the hyperparameters α ζ themselves.On a ﬁnal note, we return to the discussion at the end of Section 3.1 and admit the possibilityof a probit regression model for the classiﬁer function. Given the regression coeﬃcients β ζ , arealization at the appropriate slice of the posterior predictive ˜ ζ ( ˜ X , θ ) | ( ζ , β ζ ) is MVN with iid noise and hence discontinuous everywhere. So even though the practical check of admissibility (12)on a large design ˜ X is straightforward: I (cid:110)(cid:80) ˜ Mm =1 w m · I { ˜ ζ θ ,m > } ≥ − p tol (cid:111) (for pre-speciﬁedweights w m summing to unity), care must be taken to formally deﬁne the integral in the analogueof (13). Of course, one ﬁx is to deﬁne the admissibility criterion as using only the continuous meanof the latent process (cid:90) X I (cid:8) x : µ ζ ( x , θ ; β ζ )) > (cid:9) dG ( x ) ≥ − p tol . (16)This can allow for closed-form evaluations of admissibility but does not use the full posteriorpredictive distribution; it has stochasticity only through the regression coeﬃcients. The full posterior of all the parameters is (cid:2) θ , α η,δ , ˜ ζ θ , ζ , α ζ | d , z (cid:3) = (cid:2) θ , α η,δ , ˜ ζ θ | ζ , α ζ , d , z (cid:3) · (cid:2) ζ , α ζ | d , z (cid:3) = (cid:2) θ , α η,δ | ˜ ζ θ , d (cid:3) · (cid:2) ˜ ζ θ | ζ , α ζ (cid:3) · (cid:2) ζ , α ζ | z (cid:3) (17)14 The ﬁrst term in (17) is proportional to L ( θ , α η,δ ; d ) · [ θ | ˜ ζ θ ] · [ α η,δ ] which is the posterior fora traditional calibration model, except that a marginal prior for θ is replaced by a conditionalprior [ θ | ˜ ζ θ ] such as that of (15). • The second term is the posterior predictive distribution of the latent process (9) along the t = θ slice. • The third term is the posterior for the latent process values and hyperparameters of theclassiﬁcation model (7). Here it is explicit that ζ is assumed independent of η and δ (further,the response values of the successful simulations and experimental data).This clean factorization allows for a straightforward implementation using the results of previoussections. The pseudocode of the whole procedure is given in Algorithm 2. Algorithm 2

Metropolis-within-Gibbs algorithm for Bayesian calibration with simulator failures (cid:104) . (cid:105) Update latent values and hyperparameters as in (cid:104) . (cid:105) and (cid:104) . (cid:105) resulting in ζ ( t ) and α ( t ) ζ =( µ ζ , λ ζ ) ( t ) (cid:104) . (cid:105) Update emulator and discrepancy parameters using (3) and the description within Section 2resulting in α ( t ) η,δ = ( µ η , µ δ , σ η , σ δ , σ (cid:15) , λ η,x , λ η,t , λ δ ) ( t ) (cid:104) . (cid:105) Update calibration parameters θ with a Metropolis step: • propose θ ∗ from distribution having density q ( θ new | θ old = θ ( t − ) • check admissibility using the indicator I { min ˜ ζ θ ∗ > } from predictions (cid:104) . (cid:105) according toone of the following cases(C1) ˜ ζ θ ∗ ∼ ζ ( t = θ ∗ ) | ( ζ ( t ) , α ( t ) ζ )(C2) ˜ ζ θ ∗ ∼ ζ ( x = ˜ X , t = θ ∗ ) | ( ζ ( t ) , α ( t ) ζ ) for some large space-ﬁlling design ˜ X • conditional upon admissibility, set θ ( t ) = θ ∗ with probability min(1 , R ) where R = L ( θ ∗ ; α ( t ) η,δ , d ) · [ θ ∗ ] · q ( θ ( t − | θ ∗ ) L ( θ ( t − ; α ( t ) η,δ , d ) · [ θ ( t − ] · q ( θ ∗ | θ ( t − )It is very important to observe that the classiﬁcation and calibration models can be investigatedindependently before the coupled analysis. The latent failure process and its hyperparameters areindependent of the output values of successful simulation runs, so the classiﬁcation model should beanalyzed ﬁrst to get some idea how useful the observed failures will be. Starting values and proposaldistributions for the classiﬁcation analysis can be used directly within the MCMC of the coupledmodel. The calibration model without the use of failed simulator runs can also be ﬁt independentlyin order to get a reasonable initial proposal distributions for the combined analysis. Fitting both15odels separately will also allow the researcher to determine whether hyperparameters of eithercan be ﬁxed, if desired.How useful are the failed runs? Insight is provided by ﬁrst considering the case of a ﬁxed,known constraint. For clarity, and without loss of generality, one may further suppose the emulatorand discrepancy hyperparameters within α η,δ are known. Property P is guaranteed when the fulldomain of θ is restricted to a subset S (temporarily suppressing the “ ζ ” subscript) implying theconstrained posterior is[ θ | d , P ] ∝ L ( θ ; d ) · [ θ ] · I { θ ∈ S} ∝ [ θ | d ] · I { θ ∈ S} . (18)Furthermore, the normalizing constant of the constrained posterior, π def = (cid:82) S [ θ | d ] d θ , is obviously P { θ ∈ S} under the full posterior. If this probability is close to 1 then the restriction to S isnot actually very restrictive. If given a sample θ (1) , . . . , θ ( N tot ) iid ∼ [ θ | d ], then (18) implies that { θ ( n ) : θ ( n ) ∈ S} is a sample from the constrained posterior. An estimate of its normalizingconstant is (cid:98) π = N tot (cid:80) N tot n =1 I { θ ( n ) ∈ S} .In the case of a random constraint, uncertainty in the hyperparameters α ζ and predictiveprocess ˜ ζ | ( ζ , α ζ ) for the latent classiﬁer can be factored into an assessment of how useful the failedruns are. Consider the binary matrix B whose columns correspond to draws of the calibrationparameters under the traditional posterior (disregarding the failures) and whose rows correspondto draws of the latent posterior (cid:2) ˜ ζ , ζ , α ζ | z (cid:3) ; the ( i, j ) entry of B is 1 if θ ( j ) is predicted as a successfor realization i of the latent parameters and 0 otherwise. The row-means of this matrix provide adistribution of π estimates as the latent quantities vary. If there is relatively little uncertainty inthis distribution and it is very close to 1, the researcher need not conduct the full coupled samplingdescribed in Algorithm 2– an approximate sample from the constrained posterior is obtained bysimply omitting values from the full sample that are not in the admissible set S (cid:98) ζ derived from E { ˜ ζ | ζ , (cid:98) α ζ } and (cid:98) α ζ = E { α ζ } .The column-means of the binary prediction matrix B are arguably more informative. Themean of B · ,j is the estimated posterior probability (cid:98) π j of θ ( j ) being admissible (marginalizing overthe latent quantities). This collection of values give a distribution on the pointwise probabilityof admissibility. The proportion of { (cid:98) π j } N tot j =1 which are approximately zero indicate the fractionof θ (1) , . . . θ ( N tot ) which are almost always classiﬁed as failures– values for which the failed runswould be considered informative. On the other hand the proportion which are approximately unityindicate the fraction of full posterior values which would be admissible with or without the use ofthe failed simulation runs. For the analysts not willing to incorporate the admissibility criterionwithin their calibration routine, the collection { (cid:98) π j } N tot j =1 can be used as resampling weights for thetraditional posterior θ draws. If estimation is not the sole purpose of calibration, then the fullcoupling of Algorithm 2 should be used, as the updates for the discrepancy parameters (vital for16rediction) will be aﬀected by the restricted θ updates. Here we demonstrate the methodology with the illustrative example as well as the problem thatoriginally motivated our calibration technique.All proposal distributions for Metropolis steps were taken to be MVN after parameter trans-formations. For example, the univariate marginal priors on the calibration parameters were ﬁrstrescaled to [0,1] and then to ( −∞ , ∞ ) via a probit transformation Φ − ( · ). All such MVN proposalshad their covariances tuned adaptively according to Haario et al. (1999, 2001). Failures were observed at t < .

286 and t > .

714 (that is, for at least one value of x ) suggesting thatthat the true set of admissible θ is contained in the interval between these two values. As we shallsee, the calibration analysis incorporating failed runs will indeed produce a posterior distributionfor θ having support ≈ (0 . , . θ ∗ was checkedusing a draw of the latent posterior predictive process on the slice with ˜ X containing 50 equallyspaced points. Had the entire latent process been realized, the ζ = 0 contours would appear akinto the red curves in Figure 3. Each set of red curves would deﬁne the boundaries (green lines) ofthe admissible set, and a proposed θ ∗ would be admissible if it fell within these boundaries. Thereis a small probability that a realization of the latent process is positive within the naive bounds(0 . , . x , as desired. However the distribution incorporating failed runs is not the full posteriorrestricted to the naive bounds because there is uncertainty associated with these bounds. In fact,the probability of admissibility (under the posterior for the latent parameters) ranges from about0.22 to 0.61; to contrast, the point estimate of this value using the naive bounds is nearly 0.7.17 .0 0.2 0.4 0.6 0.8 1.0 q [ q | r e s t, da t a ] posterior not using failuresposterior using failurespriortrue value Figure 4: Posterior distributions for calibration parameters with and without using the failedsimulations.Regarding pointwise admissibility, 37% of the samples from the full posterior were always classiﬁedas failures whereas 20% were always predicted to be successes.For this toy problem the true calibration parameter and traditional posterior mean/ mode/median fell squarely within the admissible region, but this need not be. In fact, were the inclusionof failures to rule out a region having high probability under the usual calibration case, this mightchallenge the researchers’ understanding of the numerical model or physical system. It would alsohighlight the fact the most probable conﬁgurations of parameters were inferred by extrapolation ofthe response surface to regions which had no successful runs. As such this would be an interestingand important scenario.

We conclude this paper by applying the methodology to the example that motivated it.The Carbon Capture Simulation Initiative (CCSI, a partnership between Department of Energylaboratories, academic institutions, and industry) has been developing state-of-the-art computa-tional tools to assist in the development of technologies that capture CO from coal-burning powerplants. One technology investigated was an adsorber system with a bubbling ﬂuidized bed of 32D1sorbent (Lane et al., 2014; Spenik et al., 2015). In this system, a mixture of gases that includes18able 1: Summary of inputs, outputs, and CFD calibration parameters. Here, trNorm( µ, σ ; [ a, b ])is a N( µ, σ ) distribution truncated to the interval [ a, b ], and ssBeta( α, β ; [ a, b ]) is a Beta( α, β )distribution shifted and scaled to the interval [ a, b ]. Output y : y : knot 0.975 of the functional breakthrough curveExperimental Conditions (Variable Inputs) x : Range/Prior x : gas inﬂow rate (std. L/min) [15.0, 30.0] x : partial pressure of CO (%) [10.0, 20.0] x : coil temperature ( ◦ C) [39.0, 81.5] x : gas inﬂow temperature ( ◦ C) [23.6, 38.4]CFD Model Parameters (Calibration Inputs) t :dry adsorption t : formation enthalpy (J/mol) ∆ H c trNorm (cid:0) − . e , (1 . e ; [ − . e , − . e (cid:1) t : formation entropy (J/mol · K) ∆ S c trNorm (cid:0) − , ; [ − , − (cid:1) t : activation energy (J/mol) ∆ H ‡ c Unif(3 . e , . e t : log pre-exponential factor (unitless) log C c Unif(0 , . t : formation enthalpy ∆ H a Unif( − . e , − . e t : formation entropy ∆ S a Unif( − , − t : activation energy ∆ H ‡ a Unif(2 . e , . e t : log pre-exponential factor log C a Unif(0 , t : formation enthalpy ∆ H b Unif( − . e , − . e t : formation entropy ∆ S b Unif( − , − t : activation energy ∆ H ‡ b Unif(2 . e , . e t : log pre-exponential factor log C b Unif(0 , t : particle size ( µ m) ssBeta(4.5, 3.3; 108, 125) t : eﬀective amine proportion when fresh (unitless) trNorm (cid:0) . , (0 . ; [0 . , . (cid:1) CO ﬂows up through the bed of solid sorbent particles and together they form a ﬂuid-like state.The increased mixing between gases and solids better facilitates the adsorption of CO from the gasstream, that is, it increases the adhesion of the CO atoms to the surface of the sorbent particles.CCSI’s investigation of this carbon capture apparatus was done sequentially using a valida-tion hierarchy (Lai et al., 2016). Experiments of increasing complexity were formulated to isolatesubsystems of the entire multiphase ﬂow process; computer models of each idealized system werealso formulated such that each could be calibrated to its corresponding physical experiment. Inparticular we use data from just one of these validation exercises: the case of hot-reacting ﬂow [Sec.6.3 of Lai et al. (2016)].The experimental data were obtained by varying four quantities: gas inﬂow rate, partial pressureof CO , coil temperature, and gas inﬂow temperature. A Latin Hypercube design of size 52, with19 points replicated (for a total of N = 71 points) was used to cover the variable input space; thebounds of 4-dimensional cube can be found in Table 1). Each experiment produced a breakthrough19urve describing the CO adsorption over time, and this monotonic functional output could veryaccurately be captured by ﬁve points: the y -intercept and the time until 2.5, 25, 50, and 75 percentreductions from the intercept (initial adsorption) via a monotone cubic log-spline. The second ofthese ﬁve outputs (knot 0.975) contained a large percent of the total variation, so we restrict ouranalysis to this single quantity.The numerical representation of the experimental system was a computational ﬂuid dynamics(CFD) model implemented within the MFIX (Multiphase Flow with Interphase eXchanges) software(Syamlal et al., 1993). MFIX solves the governing equations for conservations of mass, momentum,energy, and species subject to the boundary and initial conditions in multiphase ﬂow.The physics of 32D1 reactive ﬂow involve multiple mechanisms including the hydrodynamics ofmultiphase ﬂow, the transfer of heat, and the reactions between chemical species within the mixedsystem of the ﬂuidized bed. Three main chemical reactions occur when 32D1 adsorbs carbondioxide: the reaction of CO with the impregnated amine to form carbamate (dry adsorption, 19),the physical adsorption of H O to the sorbent (water “physisorption”, 20), and the reaction ofCO , physisorbed H O, and amine to form bicarbonate (wet reaction, 21). The site fractions ofcarbamate anion, adsorbed water, and bicarbonate are denoted c , a , and b , respectively. ∂c∂t = C c T exp (cid:32) − ∆ H ‡ c RT (cid:33) · f c ( c, b ; ∆ S c , ∆ H c ) (19) ∂a∂t = C a T exp (cid:32) − ∆ H ‡ a RT (cid:33) · f a ( a ; ∆ S a , ∆ H a ) (20) ∂b∂t = C b T exp (cid:32) − ∆ H ‡ b RT (cid:33) · f b ( c, a, b ; ∆ S b , ∆ H b ) (21)Above, T is the temperature of the reacting system and R is the gas constant. Each ordinarydiﬀerential equation within the coupled system depends on four parameters, ∆ H, ∆ S, ∆ H ‡ , C andthe subscripts “ c ”, “ a ”, and “ b ” indicate the appropriate reaction (19, 20, 21). The f c , f a , and f b functions within the rate equations have relatively simple forms but are excluded because ofthe additional deﬁnitions required. Two additional parameters were included for calibration, onerelated to eﬀective particle size of the sorbent and one to pertaining to amine degradation. Theexpanded formulation of the physics and a more detailed account of the parameters can be foundin Lai et al. (2016).A total of M tot = 471 simulations were run, but of these M = 136 failed to converge (implying M = 335). With this large percentage of failed runs the researchers wanted to utilize this informa-tion, if possible. This was largely due to necessity as subsequent uncertainty analyses would alsobe based around MFIX– a code whose numerical solvers are not easily modiﬁed. In general, justbecause a code fails to produce output it does not imply that there is not some numerical regime20 elta H _c^++ l l lll llll lll l l lll lll ll ll ll l ll ll lll l lll ll lll ll ll l ll ll ll ll ll ll lll ll ll ll ll ll l ll lll ll lllll ll l lll l ll l ll l lllll ll l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lll lll ll ll ll ll lll lll ll ll l ll l lllll l ll lll l lllll ll ll ll llll l l ll lll ll lll l ll lll ll l ll l ll ll lllll l l lll l ll ll ll llll lll llll l lll lll ll l ll ll ll lll l ll ll l ll lll l ll llll l ll lll l l l ll l l l ll ll ll ll l l lll l ll l lllll ll l llll l ll lll ll ll lll l ll l lll lll lll lll ll ll llll l lll lll ll lll ll ll llll lll lllll l ll ll l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l l lll ll llll ll ll l ll ll l lll l lll lll ll ll lll llll ll lll lllll ll ll l ll ll llll ll l lll ll ll llll lll l lll lll ll llll ll lll l ll lll l ll ll l lll lll ll ll ll ll ll ll ll l ll llll lll l l lll l l lllll lll ll l lll ll ll ll ll lll lllll lll ll llll lll l lll ll lllll ll lll ll ll ll ll ll ll l ll lll lll lll l l ll l ll lllll ll lllll ll l ll l lll ll ll l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lll l ll l lll lll l ll lll lll l ll ll llll lllll ll ll llll llllll lll ll l ll lll ll llll l ll lll l ll l lll llll llllll l ll ll llll ll lll lll lll ll lll ll ll lll ll ll ll l lllll llllllll lll ll lll ll llll l log_10 (C _ c) l ll l l l ll ll ll ll l l lll l ll l lllll ll l llll l ll lll ll ll lll l ll l lll lll lll lll ll ll llll l lll lll ll lll ll ll llll lll lllll l ll ll l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l l lll ll llll ll ll l ll ll l lll l lll lll ll ll lll llll ll lll lllll ll ll l ll ll llll ll l lll ll ll llll lll l lll lll ll llll ll lll l ll lll l ll ll l lll lll ll ll ll ll ll ll ll l ll llll lll l l lll l l lllll lll ll l lll ll ll ll ll lll lllll lll ll llll lll l lll ll lllll ll lll ll ll ll ll ll ll l ll lll lll lll l l ll l ll lllll ll lllll ll l ll l lll ll ll l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lll l ll l lll lll l ll lll lll l ll ll llll lllll ll ll llll llllll lll ll l ll lll ll llll l ll lll l ll l lll llll llllll l ll ll llll ll lll lll lll ll lll ll ll lll ll ll ll l lllll llllllll lll ll lll ll llll l l l lll llll lll l l lll lll ll ll ll l ll ll lll l lll ll lll ll ll l ll ll ll ll ll ll lll ll ll ll ll ll l ll lll ll lllll ll l lll l ll l ll l lllll ll l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lll lll ll ll ll ll lll lll ll ll l ll l lllll l ll lll l lllll ll ll ll llll l l ll lll ll lll l ll lll ll l ll l ll ll lllll l l lll l ll ll ll llll lll llll l lll lll ll l ll ll ll lll l ll ll l ll lll l ll llll l ll lll l l Delta

H_a ^++

Figure 5: The successes/failures (black dots / red × ’s) versus three calibration inputs of the MFIXCFD code.which could produce reasonable answers at the given input settings. This is why the simulationruns must be discussed with the subject-matter experts; further, it should be decided whether ornot calibration should be allowed to include regions where the emulator is forced to extrapolate.The chemists had indicated that extreme reaction rates would be numerically problematic (andlikely to produce unrealistic output even if convergent), but could not provide an explicit relation-ship between calibration parameter conﬁgurations and success/failure. Though the bounds withinthe prior distributions (Table 1) were a good faith eﬀort to encapsulate previous knowledge of thephysics and failure-space boundaries, they could not completely preclude troublesome combina-tions. Indeed, a number of projection-based exploratory plots of the failures were studied but clearpatterns did not emerge, some evidence that three or more of the parameters were jointly causingproblems. An exception is found in Figure 5 where it can be seen that model failures frequentlyresult from low ∆ H ‡ c and high log C c (i.e. when the dry adsorption occurs too quickly, in arelative sense). As for the low values of ∆ H ‡ a producing instabilities, this was not known a priori but it was concluded that the lower bound for water physisorption activation energy was likely toopessimistic. 21he latent Bayesian model under case (C1) (no dependence of the failures upon the experimentalconditions) and Squared-Exponential latent correlation was run for 200,000 iterations, and after ashort burn-in, the chains showed no evidence of non-convergence. Every 200 iterations LOOCV wasperformed resulting in a classiﬁcation rate between 80.0% and 98.3% with a median of 91.3%. Giventhe dimension of calibration input space, these rates were deemed adequate enough to proceed withthe coupled calibration under (C1).Calibrations were conducted without and with the failed simulations. (It should be notedthat the results of the analysis without using the failed runs diﬀer from those of Lai et al. (2016)because diﬀerent models and data were used– the current work used only one of the ﬁve outputsand a GP model instead of a Bayesian Smoothing Spline ANOVA model.) The priors for the14 calibration parameters are given in Table 1. They are simple univariate uniform, truncatednormal, or shifted/scaled beta distributions obtained from a previous analysis. For each case threechains of 200,000 iterations were used, recording every third sample after a burn-in of 50,000.Binary predictions of success/failure were obtained using a subset of draws from the traditionalcalibration posterior (ignoring the failed runs) and classiﬁers based upon the posterior for the latentquantities. The estimated distribution of pointwise admissibility probabilities is shown in Figure 7.In particular, while 29.1% of the posterior samples were almost always classiﬁed as successes, 28.3%of the calibration parameter samples were almost always ruled out as inadmissible. This suggestedthat the failures were indeed informative and that the coupled calibration could be performed inorder to properly weight θ -space according to uncertainty in the latent classiﬁer.The bivariate marginal posterior distributions of the calibration parameters were compared andit was seen that 11 of the 14 parameters had density estimates which appeared qualitatively similar.The three parameters that diﬀered most were ∆ H ‡ c , log C c , and ∆ H ‡ a ; the posteriors for each caseare given in Figure 6. As expected, the posterior utilizing the failed runs excluded regions with 1)low ∆ H ‡ c , high log C c , and 2) low ∆ H ‡ a . The fact that the latent LOOCV yielded reasonably highrates while many of the bivariate marginals of the two calibration analyses looked alike suggeststhat there are indeed many-way interactions in the latent failure surface. While we were not able toexplicitly give a closed-form description of what combinations would lead to failures, our methodwas able to implicitly learn such conditions and incorporate them into a restricted calibrationposterior. References

Albert, J. H. and Chib, S. (1993), “Bayesian analysis of binary and polychotomous response data,”

Journal of theAmerican Statistical Association , 88, 669–679.Barrett, J. W., Mandel, I., Neijssel, C. J., Stevenson, S., and Vigna-G´omez, A. (2016), “Exploring the parameterspace of compact binary population synthesis,”

Proceedings of the International Astronomical Union , 12, 46–50. elta H _c^++ M[, dd]40000 60000 80000 + − − log_10 (C _ c) M[, dd] D en s i t y . . . . . Delta

H_a ^++ D en s i t y + − − Delta H _c^++ M[, dd]40000 60000 80000 + − − log_10 ( C _ c) M[, dd] D en s i t y . . . . . Delta

H_a ^++ D en s i t y + − − Figure 6: Posterior distributions for three calibration parameters without and with using failedruns (left/right). The numbers in the panels above the diagonal are estimated pairwise correlations. hist. of Pointwise Pr(admiss.) prob f ( p r ob ) . . . . . . cdf of Pointwise Pr(admiss.) prob F ( p r ob ) Figure 7: Estimates of pdf and cdf for the pointwise probability of admissibility of the full posteriorcalibration parameter samples under the uncertain classiﬁer. The dashed line segments indicatethe lower and upper 10%– i.e., proportions of the full posterior samples which are almost alwaysclassiﬁed as failures and successes, respectively.

Bayarri, M. J., Berger, J. O., Paulo, R., Sacks, J., Cafeo, J. A., Cavendish, J., Lin, C.-H., and Tu, J. (2007), “Aframework for validation of computer models,”

Technometrics , 49, 138–154.Bayarri, M. J. and DeGroot, M. (1992), “A ‘BAD’ view of weighted distributions and selection models,” in

BayesianStatistics 4 , eds. Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A. F. M., Oxford University Press, pp.17–33.Bayarri, M. J. and DeGroot, M. H. (1987), “Bayesian analysis of selection models,”

The Statistician , 36, 137–146.Box, G. E. P. and Coutie, G. A. (1956), “Application of digital computers in the exploration of functional relation-ships,”

Proceedings of the IEE - Part B: Radio and Electronic Engineering , 103, 100–107. hib, S. and Greenberg, E. (1998), “Analysis of multivariate probit models,” Biometrika , 85, 347–361.Cox, D. D., Park, J.-S., and Singer, C. E. (2001), “A statistical method for tuning a computer code to a data base,”

Computational Statistics & Data Analysis , 37, 77–92.Craig, P. S., Goldstein, M., Seheult, A. H., and Smith, J. A. (1997), “Pressure matching for hydrocarbon reservoirs: Acase study in the use of Bayes linear strategies for large computer experiments (with discussion).” in

Case Studiesin Bayesian Statistics , eds. Gatsonis, C., Hodges, J. S., Kass, R. E., McCulloch, R., Rossi, P., and Singpurwalla,N. D., Springer-Verlag, vol. III, pp. 36–93.De Oliveira, V. (2000), “Bayesian prediction of clipped Gaussian random ﬁelds,”

Computational Statistics and DataAnalysis , 34, 299–314.Fang, K.-T., Li, R., and Sudjianto, A. (2006),

Design and Modeling For Computer Experiments , Boca Raton, FL:Chapman & Hall / CRC.Fell, M., Barber, J., Lichstein, J. W., and Ogle, K. (2018), “Multidimensional trait space informed by a mechanisticmodel of tree growth and carbon allocation,”

Ecosphere , 9, 1–25.Gelman, A. (2006), “Prior distributions for variance parameters in hierarchical models (Comment on article byBrowne and Draper),”

Bayesian Analysis , 1, 515–534.Gemoets, D., Barber, J., and Ogle, K. (2013), “Reversible jump MCMC for inference in a deterministic individual-based model of tree growth for studying forest dynamics,”

Environmetrics , 24, 433–448.Geweke, J. (1991), “Eﬃcient simulation from the multivariate normal and Student- t distributions subject to linearconstraints,” in Computer Science and Statistics: Proceedings of the 23rd Symposium on the Interface. Seattle,WA; Apr 21–24, 1991 , pp. 571–578.Haario, H., Heikki, E., and Tamminen, J. (1999), “Adaptive proposal distribution for random walk Metropolisalgorithm,”

Computational Statistics , 14, 375–395.— (2001), “An adaptive Metropolis algorithm,”

Bernoulli , 7, 223–242.Higdon, D., Gattiker, J., Williams, B., and Rightley, M. (2008), “Computer model calibration using high-dimensionaloutput,”

Journal of the American Statistical Association , 103, 570–583.Huang, J., Gramacy, R. B., Binois, M., and Libraschi, M. (2020), “On-site surrogates for large-scale calibration,”

Applied Stochastic Models in Business and Industry , 36, 283–304.Kennedy, M. C. and O’Hagan, A. (2001), “Bayesian calibration of computer models,”

Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 63, 425–464.Lai, C., Xu, Z., Pan, W., Sun, X., Storlie, C., Marcy, P. W., Dietiker, J.-F., Li, T., and Spenik, J. (2016), “Hierarchicalcalibration and validation of computational ﬂuid dynamics models for solid sorbent-based carbon capture,”

PowderTechnology , 288, 388–406.Lane, W., Storlie, C., Montgomery, C., and Ryan, E. (2014), “Numerical modeling and uncertainty quantiﬁcation ofa bubbling ﬂuidized bed with immersed horizontal tubes,”

Powder Technology , 253, 733–743.Levy, S. and Steinberg, D. M. (2010), “Computer experiments: A review,”

Advances in Statistical Analysis , 94,311–324.Neal, R. M. (1999), “Regression and classiﬁcation using Gaussian process priors (with discussion),” in

BayesianStatistics 6 , eds. Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A. F. M., Oxford University Press, pp.475–501.Oakley, J. E. and O’Hagan, A. (2002), “Bayesian inference for the uncertainty distribution of computer modeloutputs,”

Biometrika , 89, 769–784. ’Hagan, A. and Kingman, J. F. C. (1978), “Curve ﬁtting and optimal design for prediction,” Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , 40, 1–24.Park, J.-S. (1991), “Tuning Complex Computer Codes to Data and Optimal Designs,” Ph.D. thesis, University ofIllinois, Urbana-Champaign, (unpublished).Patil, G. P. (2002), “Weighted distributions,” in

Encyclopedia of Environmetrics, Vol. 4 , eds. El-Shaarawi, A. H. andPiegorsch, W. W., John Wiley & Sons, Chichester, pp. 2369–2377.Patil, G. P. and Rao, C. R. (1977), “The weighted distributions: A survey of their applications,” in

Applications ofStatistics , ed. Krishnaiah, P., North-Holland, Amsterdam, pp. 383–405.Ranjan, P., Bingham, D., and Michailidis, G. (2008), “Sequential experimental design for contour estimation,”

Technometrics , 50, 527–541.Rasmussen, C. E. and Williams, C. K. I. (2006),

Gaussian Processes for Machine Learning , Cambridge, MA: MITPress.Robert, C. P. (1995), “Simulation of truncated normal variables,”

Statistics and Computing , 5, 121–125.Rue, H. and Held, L. (2005),

Gaussian Markov Random Fields: Theory and Applications , Boca Raton, FL: Chapman& Hall / CRC.Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989), “Design and analysis of computer experiments,”

Statistical Science , 4, 409–435.Santner, T. J., Williams, B. J., and Notz, W. I. (2018),

The Design and Analysis of Computer Experiments , NewYork: Springer, 2nd ed.Seber, G. A. F. and Wild, C. J. (1989),

Nonlinear Regression , Hoboken, NJ: John Wiley & Sons, Inc.Spenik, J. L., Shadle, L. J., Breault, R. W., Hoﬀman, J. S., and Gray, M. L. (2015), “Cyclic tests in batch mode ofCO2 adsorption and regeneration with sorbent consisting of immobilized amine on a mesoporous silica,”

Industrial& Engineering Chemistry Research , 54, 5388–5397.Syamlal, M., Rogers, W., and O’Brien, T. J. (1993), “MFIX documentation: theory guide,” Tech. Rep. DOE / METC-94 / 1004 (DE94000087), U.S. Department of Energy, Oﬃce of Fossil Energy, Morgantown Energy TechnologyCenter, Morgantown, West Virginia.Tanner, M. A. and Wong, W. H. (1987), “The calculation of posterior distributions by data augmentation,”

Journalof the American Statistical Association , 82, 528–540.Wilhelm, S. and Manjunath, B. G. (2015), tmvtnorm: Truncated multivariate normal and Student- t distributions , R package version 1.4-10.package version 1.4-10.