A Fast, Scalable, and Calibrated Computer Model Emulator: An Empirical Bayes Approach
Vojtech Kejzlar, Mookyong Son, Shrijita Bhattacharya, Tapabrata Maiti
AA Fast, Scalable, and Calibrated Computer Model Emulator: AnEmpirical Bayes Approach
Vojtech Kejzlar , Mookyong Son , Shrijita Bhattacharya , and Tapabrata Maiti Department of Mathematics and Statistics, Skidmore College Department of Statistics and Probability, Michigan State University
Abstract
Mathematical models implemented on a computer have become the driving force behindthe acceleration of the cycle of scientific processes. This is because computer models aretypically much faster and economical to run than physical experiments. In this work, wedevelop an empirical Bayes approach to predictions of physical quantities using a computermodel, where we assume that the computer model under consideration needs to be calibratedand is computationally expensive. We propose a Gaussian process emulator and a Gaussianprocess model for the systematic discrepancy between the computer model and the underly-ing physical process. This allows for closed-form and easy-to-compute predictions given bya conditional distribution induced by the Gaussian processes. We provide a rigorous the-oretical justification of the proposed approach by establishing posterior consistency of theestimated physical process. The computational efficiency of the methods is demonstratedin an extensive simulation study and a real data example. The newly established approachmakes enhanced use of computer models both from practical and theoretical standpoints.
Keywords:
Gaussian process, posterior consistency, computer experiments, nonparametricregression, nuclear binding energies
1. Introduction
With the advancements of computer architectures in the 21 th century, mathematical mod-els implemented on a computer ( computer models ) heavily contributed to the rapid speed-upof the cycle of scientific processes. This is because computer models are generally much faster Preprint submitted to Elsevier August 13, 2020 a r X i v : . [ s t a t . C O ] A ug nd economical to run than physical experiments. For instance, experiments conducted inhigh-energy particle colliders require budgets in billions of dollars and multinational collab-orations. Additionally, many experiments related to natural events such as extreme weatherphenomena, including tropical cyclones or tornadoes, are practically impossible to conduct.Computer models, despite being an invaluable component of the process of scientific dis-covery, are imperfect representation of physical systems with each model evaluation oftentaking many hours. In this paper, we present an empirical Bayes approach for fast and sta-tistically principled predictions of physical quantities using imperfect computer models thatneed to be calibrated with experimental observations. We particularly aim at those scenarioswhere computer models under consideration are complex and computationally too expen-sive to be used directly for predictions with quantified uncertainties. Our approach buildson the framework for computer model aided inference developed by [19] that establishes theconnection between experimental observations, computer model, and the systematic discrep-ancy between the model and the physical process. The systematic discrepancy is modelednonparametrically using a Gaussian process (GP) and the computer model is replaced by anemulator based also on a GP. This framework has reached high popularity over the past twodecades with many applications in nuclear physics [16, 20], climatology [40, 35], and engi-neering [49, 34, 52]. There have been also various extensions of the original framework fromboth methodological and computational perspective. For example, [15] consider computermodels with high-dimensional output. [32] and [14] study specific GP modeling choices toimprove the predictive accuracy of the framework. [17] develop variational inference basedapproach for approximation of posterior densities. [44], [33], and lately [50] show theoreticalproperties of the framework under some modifications.Despite these efforts, some of the practical challenges for computer enabled predictionswith GPs remain. First, implementation of the framework [19] is never straightforward andtypically requires considerable effort and experience, especially under some of the extensionslisted in the previous paragraph. Second, a fully Bayesian approach becomes quickly com-2utationally demanding with the increasing sample size, model complexity, and number ofparameters. Third, in the absence of correct prior distributions, the full Bayesian modelscould be sensitive to the choice of hyperparameter values. To avoid these complications, weconsider an empirical Bayes approach, which can be viewed as an approximation to the fullyBayesian treatment. This approximation principle is well established for standard statisticalmodels. We validate this in the context of calibrated computer models. Following are thespecific contributions of this work:a) Our methodology utilizes the statistical properties of GPs to establish easy-to-implement,closed-form, and fast-to-compute predictions of physical quantities using computation-ally expensive computer models that are calibrated with experimental observations.This includes a proposal of two estimators for plug-in model parameters with negli-gible loss of uncertainty on predictions that can be readily obtained using standardnumerical solvers.b) We offer a fresh perspective on the framework of [19] and provide its equivalent repre-sentation as a hierarchical model. As a consequence, we derive new theoretical proper-ties of this framework and show that our proposed methodology estimates the valuesof underlying physical process consistently. Our theoretical analysis is based on anoriginal extension of Schwartz’s theorem for nonparametric regression problems withGP priors and an unknown but consistently estimated variance.c) We provide an extensive simulation study and demonstrate the computational efficiencyof the proposed methodology compared with the Metropolis-Hasting algorithm (fullyBayesian implementation). We also conduct a sensitivity study of the fully Bayesiansolution to prior selection and show that our methodology is preferred in the absence ofproper and meaningful prior distributions. Additionally, we illustrate the opportunitiesprovided by our method on an analysis of experimental nuclear binding energies. Afully documented Python code with our algorithm and examples is available at https: /github.com/kejzlarv/EB_Calibration . In Section 2, we review the general framework for Bayesian inference with computermodels. Section 3 defines two plug-in estimators for GP model parameters and a consistentestimator of a noise variance component. Then, in Section 4, we discuss the theoreticalproperties of our approach and establish its statistical consistency. Section 5 contains asimulation study that validates the methodology in this paper empirically. A real dataapplication is also included in Section 5.
2. Bayesian model for inference with computer models
Let us consider observations y = ( y , . . . , y n ) of a physical process ζ ( t ) depending on aknown set of inputs t i , i = 1 , · · · , n taking values in a compact and convex set Ω ⊂ R p , p ≥
1, following the relationship y i = ζ ( t i ) + σ(cid:15) i , i = 1 , . . . , n, (1)where σ represents the scale of observational error, typically (cid:15) i i.i.d. ∼ N (0 , y ∗ = ( y ∗ , . . . , y ∗ J ) of the physical process ζ atnew, yet to be observed, inputs ( t ∗ , . . . , t ∗ J ) using y and a computer model f m defined as amapping ( t , θ ) (cid:55)→ f m ( t , θ ). As we can see, the computer model depends on an additionalset of inputs θ ∈ Θ ⊂ R q that we call calibration parameters. These are considered fixedbut unknown quantities common to all the observations y i and all the instances of thephysical process that we intend to predict using calibrated computer model. The calibrationparameters represent inherent properties of the physical process that cannot be directlymeasured or controlled in an experiment. In the most rudimentary form, one can think ofthe calibration parameters as parameters in standard regression problems. To this extent, wesuppose the relationship between the observations y , physical process ζ , and the computer4odel f m as proposed by [19]: y i = f m ( t i , θ ) + δ ( t i ) + σ(cid:15) i , (2)where δ ( t i ) represents an unknown systematic error between the computer model and thephysical process. While δ ( t i ) is intrinsically deterministic, a nonparametric approach usinga GP prior model is typically imposed for Bayesian inference.GPs are a convenient way of placing a distribution over a space of functions. By definition,we say that δ ( t ) has a GP distribution, if for every i = 1 , , . . . the joint distributionof δ ( t ) , . . . δ ( t i ) is multivariate normal. It is fully described by its mean and covariancefunctions that characterizes the relationship of the process at different inputs.Typically, the mean function is chosen to be zero or some dense family of basis functions(wavelets, Fourier, polynomials) across the input domain: m ( · ) = h ( · ) T β , where h ( · ) = ( h ( · ) , . . . h r ( · )) are the basis functions and β is a hyperparameter . A typicalchoice for the covariance function is a stationary covariance function that depends on theinputs through t − t (cid:48) . For example, a Gaussian kernel covariance function (also called squaredexponential or radial basis function kernel) takes the form k ( t , t (cid:48) ) = η exp (cid:18) −
12 ( t − t (cid:48) ) T M ( t − t (cid:48) ) (cid:19) , (3)where M corresponds to a positive definite diagonal matrix of hyperparameters. We referto the case of M = (cid:96) I , for some (cid:96) >
0, as an isotropic version of the kernel, because it isinvariant to the rotation. The case of M with different diagonal terms is called an anisotropic version of the kernel. Other popular choices for stationary covariance functions are Maternkernels, polynomial kernels, or exponential kernels [36].It is important to note that one first needs to provide an estimate of the unknown pa-rameter θ according to the relationship (2), before making any predictions. The process of5stimation of such parameter is called model calibration. In Bayesian sense, it correspondsto obtaining a full posterior distribution of θ given data. Unfortunately, the calibration pa-rameter θ is non-identifiable in general. Several authors have pointed this out and proposedvarious methods to mitigate the problem including [3, 7, 32, 44, 45]. Our main goal here,nonetheless, is not the correct identification of θ , but a prediction. Thus the problem can bethought of as a “black-box” based prediction such as the prediction based on neural networksor deep networks where parameters are part of the nonparametric models.It is often the case that the evaluation of computer model f m is too expensive in termsof both time and space (memory). Common practice is to reduce the number of necessarycomputer model evaluations by considering a GP prior model. We use the following notation: f m ( t , θ ) ∼ GP ( m f ( t , θ ) , k f (( t , θ ) , ( t (cid:48) , θ (cid:48) ))) . In this setup, the data also include set of model evaluations z = ( z , . . . , z s ) over a grid { ( (cid:101) t , (cid:101) θ ) , . . . , ( (cid:101) t s , (cid:101) θ s ) } . These are usually selected sequentially using some space-filling designsuch us uniform or Latin hypercube design [28], which is a design that has a good coverage ofthe space with evenly distributed points in each one-dimensional projection. The completedataset d in the case of computationally expensive models consists of n observations y i from the physical process ζ and s evaluations z j of the computer model f m , i.e. d =( d , . . . , d n + s ) := ( y , z ). We shall denote the set of unknown parameters as ( θ , φ , σ ) with φ denoting the set of hyperparameters of GPs’ mean and covariance functions. Consequently,the distribution of the complete dataset d conditioned on ( θ , φ , σ ) is d | θ , φ , σ ∼ N ( M ( θ , φ ) , K ( θ , φ , σ )) , (4)where M ( θ , φ ) = M f ( T y ( θ )) + M δ ( T y ) M f ( T z ( (cid:101) θ )) , (5) M f ( T y ( θ )) is a column vector with j th element m f ( t j , θ ), M δ ( T y ) is a column vector with6 th element m δ ( t j ), and M f ( T z ( (cid:101) θ )) is a column vector with j th element m f ( (cid:101) t j , (cid:101) θ j ). Thecovariance matrix of the multivariate normal distribution (4) is K ( θ , φ , σ ) = K f ( T y ( θ ) , T y ( θ )) + K δ ( T y , T y ) + σ I n K f ( T y ( θ ) , T z ( (cid:101) θ )) K f ( T z ( (cid:101) θ ) , T y ( θ )) K f ( T z ( (cid:101) θ ) , T z ( (cid:101) θ )) . (6)Here K f ( T y ( θ ) , T y ( θ )) is the matrix with ( i, j ) element k f (( t i , θ ) , ( t j , θ )), K δ ( T y , T y ) is thematrix with ( i, j ) element k δ ( t i , t j ), and K f ( T z ( (cid:101) θ ) , T z ( (cid:101) θ )) is the matrix with ( i, j ) element k f (( (cid:101) t i , (cid:101) θ i ) , ( (cid:101) t j , (cid:101) θ j )). We can define the matrix K f ( T y ( θ ) , T z ( (cid:101) θ )) similarly with the kernel k f .Under a fully Bayesian treatment, the predictions of y ∗ are specified by the poste-rior predictive distribution p ( y ∗ | d ). It is obtained by integrating the conditional density p ( y ∗ | d , θ , φ , σ ), which is a multivariate normal density given by the statistical model (1)and the specification of GPs, against the posterior density p ( θ , φ , σ | d ). Analogical relation-ship holds for the predictions of new realizations of the physical process ζ ∗ . The posteriordensity p ( θ , φ , σ | d ), however, does not have a closed-form in general and one needs to resortto either Markov chain Monte Carlo (MCMC) methods for approximation or use variationaltechniques. This can be a non-trivial task to implement and requires some practical expe-rience. Additionally, the nature of the marginal likelihood p ( d | θ , φ , σ ) makes the problemharder to scale due to the complex structure of the covariance matrix K ( θ , φ , σ ), see [19]and [17] for further discussion.To avoid these difficulties, we propose an empirical Bayes approach which instead ofplacing a (prior) distribution on ( θ , φ , σ ) estimates these parameters directly form the data.One can therefore utilize the convenience of GPs to obtain closed-form, simple, and fastpredictions given by the conditional distribution p ( y ∗ | d , θ , φ , σ ) (or p ( ζ ∗ | d , θ , φ , σ )). Theproposed approach can be viewed as an approximation of the fully Bayesian treatment thatneglects some of the uncertainty associated with the unknown parameters.7 . Prediction and parameter estimation One of the main benefits of the empirical Bayes approach is that once we estimate theunknown parameters ( θ , φ , σ ), we can obtain a closed-form predictive distribution giventhese estimates. The framework additionally yields a principled approach for the inferenceof physical process ζ that is statistically consistent (shown below in Section 4).Here we formally derive the algorithm for prediction of physical quantities. Let us con-sider a set of new inputs ( t ∗ , . . . , t ∗ J ) at which we want to obtain prediction according to themodel (2). The joint normality between d and y ∗ implies that the conditional distribution p ( y ∗ | d , θ , φ , σ ) is a multivariate normal distribution with the mean vector M y ∗ ( θ , φ , σ ) = M f ( T ∗ y ( θ )) + M δ ( T ∗ y ) + C ∗ K ( θ , φ , σ ) − ( d − M ( θ , φ )) , (7)and the covariance matrix K y ∗ ( θ , φ , σ ) = K f ( T ∗ y ( θ ) , T y ( θ )) + K δ ( T ∗ y , T y ) + σ I J − C ∗ K ( θ , φ , σ ) − C T ∗ , (8)where C ∗ = (cid:18) K f ( T ∗ y ( θ ) , T y ( θ )) + K δ ( T ∗ y , T y ) K f ( T ∗ y ( θ ) , T z ( (cid:101) θ )) (cid:19) , (9) M ( θ , φ ) and K ( θ , φ , σ ) is the mean vector and the covariance matrix of the data likeli-hood p ( d | θ , φ , σ ), K f ( T ∗ y ( θ ) , T y ( θ )) is the matrix with ( i, j ) element k f (( t ∗ i , θ ) , ( t j , θ )) and K δ ( T ∗ y , T y ) is the matrix with ( i, j ) element k δ ( t ∗ i , t j ). We can similarly define the matrix K f ( T ∗ y ( θ ) , T z ( (cid:101) θ )) with the kernel k f and the mean vectors M f ( T ∗ y ( θ )) and M δ ( T ∗ y ) as inthe case of the likelihood (4). Analogical relationship holds for the conditional distributionof the new realizations from the physical process p ( ζ ∗ | d , θ , φ , σ ), where the mean vector M ζ ∗ ( θ , φ , σ ) is identical with (7), and the covariance matrix is K ζ ∗ ( θ , φ , σ ) = K f ( T ∗ y ( θ ) , T y ( θ )) + K δ ( T ∗ y , T y ) − C ∗ K ( θ , φ , σ ) − C T ∗ , (10)The Algorithm 1 summarizes the procedure for predictions of physical quantities usingimperfect and computationally expensive computer models.8 lgorithm 1: Empirical Bayes algorithm for predictions of physical quantities usingcomputer models
Input:
Data d = ( y , z ), mean and covariance functions for GPs, and new inputs( t ∗ , . . . , t ∗ J ). Use the experimental observations y to compute the estimate of noise scale ˆ σ n Use d to obtain the estimates of GPs’ hyperparameters ( ˆ θ n + s , ˆ φ n + s ) Compute M y ∗ ( ˆ θ n + s , ˆ φ n + s , ˆ σ n ) and K y ∗ ( ˆ θ n + s , ˆ φ n + s , ˆ σ n ) or M ζ ∗ ( ˆ θ n + s , ˆ φ n + s , ˆ σ n ) and K ζ ∗ ( ˆ θ n + s , ˆ φ n + s , ˆ σ n ) respectively to get the posterior predictive distribution As we have all closed-form expressions for the conditional distributions in Algorithm 1,the computation avoids Monte Carlo sampling, hence negligible time is required comparedto the sampling based approximations. This is assuming plugged-in parameter estimates.To this extent, we propose the following estimator of the noise scale:ˆ σ n = (cid:115) (cid:80) n − i =1 ( y i +1 − y i ) n − , (11)where y i are the observations from the physical process under the model (1). The advantageof considering ˆ σ n of this form is twofold. First, the estimator requires minimal computationaleffort. Second, ˆ σ n is in fact a strongly consistent estimator (see Corollary 1 in Section 4)which turns out to be a crucial assumption for the theoretical validation of the empiricalBayes framework conducted in the following section. We first consider estimates of ( θ , φ ) as minimizers of a loss functionthat is reminiscent of the standard maximum likelihood approach, namely L MLE ( θ , φ ) = − log p ( d | θ , φ , ˆ σ n ) , (12)with the negative log-likelihood being − log p ( d | θ , φ , ˆ σ n ) = 12 ( d − M ( θ , φ )) T K ( θ , φ , ˆ σ n )( d − M ( θ , φ ))+ 12 log | K ( θ , φ , ˆ σ n ) | + n + s π.
9e can readily interpret the minimizer of L MLE as a trade-off between the data-fit ( d − M ( θ , φ )) T K ( θ , φ , ˆ σ n )( d − M ( θ , φ )) and the model complexity penalty log | K ( θ , φ , ˆ σ n ) | that depends only on model parameters and the variable inputs. Predictive likelihood with K-fold cross-validation.
Another viable approach of estimating theparameters ( θ , φ ) is to base these on a model’s predictive performance on unseen data.Cross-validation is a popular and robust approach to estimate this predictive performancethat has been utilized across many statistical applications. See [42, 36, 26] for applicationswith Gaussian processes. Here, we consider a K-fold cross-validation where the basic ideais to randomly partition the training dataset into K subsets of roughly equal size. We thenselect K − K subsets forthe purpose of validation with typical choices for K being 3, 5, 10, or n (leave-one-outcross-validation).Formally, let y i represent the i th subset of the observations y and y − i = y (cid:114) y i . Thenegative predictive log-likelihood under the K-fold cross-validation is L CV ( K ) ( θ , φ ) = − K (cid:88) i log p ( y i | y − i , z , θ , φ , ˆ σ n ) , (13)The cross-validation should be more robust against the model miss-specification and over-fitting [47].
4. Theoretical analysis and posterior consistency
Below we represent the Bayesian model described in Section 2 hierarchically using a setof prior distributions for a systematic exploitation of conjugacy. This representation of themodel is crucial for the theoretical results obtained in Section 4.1. It reframes the Bayesianmodel as a version of a nonparametric regression problem with a GP prior for ζ ( t ) and an10dditive noise. Namely, we define the model for data d = ( d , . . . , d n + s ) = ( y , z ): y i = ζ ( t i ) + σ(cid:15) i i = 1 , . . . , n,z j = f m ( (cid:101) t j , (cid:101) θ j ) , j = 1 , . . . , s,(cid:15) i i.i.d. ∼ N (0 , σ ) , where z j ’s are the realizations of computer model f m ( t , θ ) at pre-selected design points( (cid:101) t j , (cid:101) θ j ), and y i ’s are the experimental observations from the underlying physical process.Additionally, we consider the following GP priors: ζ ( t ) | f m ( t , θ ) , δ ( t ) ∼ f m ( t , θ ) + δ ( t ) ,δ ( t ) ∼ GP δ ( m δ ( t ) , k δ ( t , t (cid:48) )) ,f m ( t , θ ) ∼ GP f ( m f ( t , θ ) , k f (( t , θ ) , ( t (cid:48) , θ (cid:48) ))) . Under this model, the conditional likelihoods for y i and z j are p ( y i | ζ ( t i ) , σ ) = 1 σ √ π exp (cid:18) − ( y i − ζ ( t i )) σ (cid:19) , (14) p ( z j | f m ( (cid:101) t j , (cid:101) θ j )) = 1 z j = f m ( (cid:101) t j , (cid:101) θ j ) ( z j ) , (15)where p ( z j | f m ( (cid:101) t j , (cid:101) θ j )) is a likelihood with the point mass at z j = f m ( (cid:101) t j , (cid:101) θ j ). Consequently,the equivalence of the hierarchical formulation here and the model described in Section 2is given through the equality between the likelihood (4) and the following integral, whichshows that both model representations yield the same (marginal) data likelihood. (cid:90) ζ (cid:90) ˜ f m p ( ζ , ˜ f m , d | θ , φ , σ ) d ˜ f m d ζ = (cid:90) ζ (cid:90) ˜ f m p ( d | ζ , ˜ f m , θ , φ , σ ) p ( ζ , ˜ f m | θ , φ ) d ˜ f m d ζ = (cid:90) ζ (cid:90) ˜ f m n (cid:89) i p ( y i | ζ i , σ ) s (cid:89) j p ( z j | ˜ f m,j ) p ( ζ , ˜ f m | θ , φ ) d ˜ f m d ζ = (cid:90) ζ n (cid:89) i p ( y i | ζ i , σ ) p ( ζ , z | θ , φ ) d ζ , ζ = ( ζ ( t ) , . . . , ζ ( t n )) = ( ζ , . . . , ζ n ) and ˜ f m = ( f m ( (cid:101) t , (cid:101) θ ) , . . . , f m ( (cid:101) t s , (cid:101) θ s )). The likeli-hood p ( ζ , z | θ , φ ) is the multivariate normal distribution with the mean M ( θ , φ ) (see (5))and the covariance K p ( θ , φ ) = K f ( T y ( θ ) , T y ( θ )) + K δ ( T y , T y ) K f ( T y ( θ ) , T z ( (cid:101) θ )) K f ( T z ( (cid:101) θ ) , T y ( θ )) K f ( T z ( (cid:101) θ ) , T z ( (cid:101) θ )) . We leave the details of the integral computation for Appendix A. Using this equivalentrepresentation, we can gain a further insight into the role of the set of model runs z . Let usconsider a function space F and a subset (cid:101) F ⊂ F , then p ( ζ ∈ (cid:101) F | d , θ , φ , σ ) ∝ (cid:90) (cid:101) F n (cid:89) i p ( y i | ζ i , σ ) p ( ζ | z , θ , φ ) d ζ . (16)One can therefore interpret the model runs z as an additional information provided by thecomputer model f m that enhances the GP prior distribution p ( ζ | z , θ , φ ) over the physicalprocess ζ , having the mean function m ζ ( t ) = m f ( t , θ ) + m δ ( t ) + s (cid:88) i,j =1 κ j,i (cid:104) k f (( t , θ ) , ( (cid:101) t j , (cid:101) θ j )) (cid:105)(cid:104) z i − m f ( (cid:101) t i , (cid:101) θ i ) (cid:105) , (17)and the covariance function k ζ ( t , t (cid:48) ) = k f (( t , θ ) , ( t (cid:48) , θ )) + k δ ( t , t (cid:48) ) − s (cid:88) i,j =1 κ j,i (cid:104) k f (( t , θ ) , ( (cid:101) t j , (cid:101) θ j )) (cid:105)(cid:104) k f (( (cid:101) t i , (cid:101) θ i ) , ( t (cid:48) , θ )) (cid:105) , (18)where κ j,i is the ( j, i ) element of the matrix K f ( T z ( (cid:101) θ ) , T z ( (cid:101) θ )) − . The revealing consequence of the previous discussion is that the [19] framework is equiv-alent to the nonparametric regression model of an unknown function ζ ( t ) with the priordistribution p ( ζ | z , θ , φ ). This is not only a new perspective on the popular framework, butalso happens to be the key step that allows us to validate our empirical Bayes approachtheoretically and establish the posterior consistency of the physical process when the prior p ( ζ | z , θ , φ ) satisfies certain properties. 12n the reminder of this section, we assume that the true underlying physical process ζ isa continuously differentiable function on the compact and convex set Ω ⊂ R p . Without lossof generality, we take Ω = [0 , p . Additionally, we shall assume the hyperparameters ( θ , φ )take values in a set Υ that we shall later specify. For any ν >
0, we aim to establish, undersuitable conditions, the following:sup ( θ , φ ) ∈ Υ p ( ζ ∈ W Cν,n | y , . . . , y n , z , θ , φ , ˆ σ n ) −−−→ n P , (19)where P denotes the joint conditional distribution of { y i } ∞ i =1 given the true ζ and the truenoise scale σ , ˆ σ n is a strongly consistent estimator of σ , and W ν,n = (cid:26) ζ : (cid:90) | ζ ( t ) − ζ ( t ) | d Q n ( t ) ≤ ν (cid:27) , (20)with Q n being the empirical measure on the design points given as Q n ( t ) = n − (cid:80) ni =1 t i ( t ).In Theorem 1, we first present a general result on the consistency of nonparametricregression problems and subsequently discuss the theorem’s conditions in the context of themodel described in Section 2. This is based on the extensions of Schwartz’s theorem forindependent but non-identically distributed random variables given by [8] and [9], where theauthors assume σ is included in W ν,n , and the posterior consistency is derived jointly for ζ and σ . On the other hand, the consistency of ζ conditioned on ˆ σ n , as stated in (19), requiresa non-trivial modification of their original results. The proof of Theorem 1 is provided inAppendix B. Theorem 1.
Let { y i } ∞ i =1 be independently and normally distributed with mean ζ ( t i ) andstandard deviation σ with respect to a common σ -finite measure, where ζ belongs to a spaceof continuously differentiable functions on [0 , p denoted as F , and σ > . Let ζ ∈ F andlet P denote the joint conditional distribution of { y i } ∞ i =1 given true ζ and σ . Let { U n } ∞ n =1 be a sequence of subsets of F . Let ζ have a prior Π( ·| θ , φ ) where ( θ , φ ) take values in a set Υ . Then, under assumptions (A1)–(A3) (provided in Section 4.1.1 below), sup ( θ , φ ) ∈ Υ p ( ζ ∈ U Cn | y , . . . , y n , θ , φ , ˆ σ n ) −−−→ n a.s. P . For the purpose of generality of Theorem 1, we do not explicitly condition on the setof model runs z . It is clear from our previous discussions (see (16) in particular) that the13odel runs play the role of fixed constants in the prior distribution over ζ . The dependenceon z in (19) arises by setting Π( ζ | θ , φ ) := p ( ζ | z , θ , φ ), which is the GP prior distributionwith the mean function (17) and the covariance function (18). As a matter of convenience, for any 0 < (cid:15) < ζ ( t i ) = ζ ,i defineΛ i ( ζ , ζ ) = log p ( y i | ζ ,i , σ ) p ( y i | ζ i , σ (1 − (cid:15) )) ,K i ( ζ , ζ ) = E ζ ,σ (Λ i ( ζ , ζ )) ,V i ( ζ , ζ ) = V ar ζ ,σ (Λ i ( ζ , ζ )) . The following paragraph lists all the necessary conditions of Theorem 1:(A1) Suppose there exists a set B with Π( B | θ , φ ) > > < ˜ (cid:15) <
1, so that for any (cid:15) < ˜ (cid:15) :(i) (cid:80) ∞ i =1 V i ( ζ ,ζ ) i < ∞ , ∀ ζ ∈ B ,(ii) Π( B ∩ { ζ : K i ( ζ , ζ ) < ∆ for all i }| θ , φ ) > { Φ n } ∞ n =1 , sets {F n } ∞ n =1 and constants C , C , c > < ˜ (cid:15) < (cid:80) ∞ n =1 E ζ ,σ Φ n < ∞ (ii) sup ( θ , φ ) ∈ Υ Π( F Cn | θ , φ ) < C e − c n (iii) There exists a constant c > < (cid:15) < ˜ (cid:15) the inequality c + log(1 − (cid:15) ) − log(1 + (cid:15) ) > ζ ∈ U Cn ∩F n E ζ ,σ (1+ (cid:15) ) (1 − Φ n ) ≤ C e − c n . (A3) ˆ σ n is strongly consistent, i.e ˆ σ n −−−→ n σ a.s. P .We now discuss (A1)–(A3) in the context of the model described in Section 2. These fallinto three general categories; the first one addresses prior positivity conditions ((A1) and14ii) of (A2)), second category is related to the existence of test functions Φ n ((i) and (ii) of(A2)), and the last condition (A3) requires strong consistency of the noise scale estimator.To verify conditions (A1) of Theorem 1 for prior distributions, it is sufficient to showthat the GP prior for ζ assigns positive probability to the following set for any ω > B ω = { ζ : (cid:107) ζ − ζ (cid:107) ∞ < ω } , (21)where (cid:107) · (cid:107) ∞ denotes the supremum norm. For any 0 < (cid:15) <
1, a short calculation leads to K i ( ζ , ζ ) = log(1 − (cid:15) ) − (cid:18) − − (cid:15) ) (cid:19) + [ ζ ( t i ) − ζ ( t )] σ (1 − (cid:15) ) ≤ log(1 − (cid:15) ) − (cid:18) − − (cid:15) ) (cid:19) + (cid:107) ζ ( t i ) − ζ ( t ) (cid:107) ∞ σ (1 − (cid:15) ) . Let a ( (cid:15) ) = log(1 − (cid:15) ) − / / [2(1 − (cid:15) ) ], it is easy to see that a ( (cid:15) ) is positive and continuousat (cid:15) = 0. Therefore, for every ∆ >
0, there exist ω > < ˜ (cid:15) < K i ( ζ , ζ ) < ∆for all i and any (cid:15) < ˜ (cid:15) .Additionally, for any (cid:15) < ˜ (cid:15) and any ω > V i ( ζ , ζ ) = 12 (cid:20) − (cid:15) ) − (cid:21) + (cid:20) [ ζ ( t i ) − ζ ( t )](1 − (cid:15) ) (cid:21) < ∞ uniformly in i, and as a result, for all ζ ∈ B ω , (cid:80) ∞ i =1 V i ( ζ ,ζ ) i < ∞ . The prior condition (ii) of (A2) for thesieve F n (22) is addressed in Lemma 1, see Appendix C for proof. Lemma 1.
Let the mean function m ζ ( · ) of the GP prior for ζ defined on [0 , p be contin-uously differentiable, and the covariance function k ζ ( · , · ) has mixed partial derivatives up toorder 4 that are continuous. Define, ρ ( θ , φ ) = sup t ∈ [0 , p V ar ( ζ ( t ) | z , θ , φ ) ,ρ i ( θ , φ ) = sup t ∈ [0 , p V ar (cid:18) ∂∂t i ζ ( t ) (cid:12)(cid:12)(cid:12)(cid:12) z , θ , φ (cid:19) , i = 1 , . . . , p. Suppose that Υ is a compact set, and ρ i are continuous functions of ( θ , φ ) for all ( θ , φ ) ∈ Υ , = 0 , . . . , p . Then there exist constants C, c > such that sup ( θ , φ ) ∈ Υ p ( F Cn | z , θ , φ ) < Ce − cn , where F n are the sieves defined in (22) . Our approach to establish the existence of test functions { Φ n } ∞ n =1 that satisfy the condi-tions (i) and (iii) in Theorem 1 is similar to that of Theorem 2 in [9]. We consider a sieve F n which grows to the space of continuously differentiable functions on [0 , p . Namely, let F n = (cid:26) ζ : (cid:107) ζ (cid:107) ∞ < M n , (cid:107) ∂∂t i ζ (cid:107) ∞ < M n , i = 1 , · · · , p (cid:27) , (22)where M n = O ( n α ) for some α ∈ ( , F n . The existence of tests in the case of W n,ν isgiven in Lemma 2 with proof in Appendix D. Lemma 2.
Let F n be the sieves defined in (22) . For any ν > there exist tests { Φ n } ∞ n =1 and constants C and < ˜ (cid:15) < so that:(i) (cid:80) ∞ n =1 E ζ ,σ Φ n < ∞ (ii) There exists a constant c > such that for any < (cid:15) < ˜ (cid:15) the inequality c + log(1 − (cid:15) ) − log(1 + (cid:15) ) > holds and sup ζ ∈ W Cn,ν ∩F n E ζ ,σ (1+ (cid:15) ) (1 − Φ n ) ≤ Ce − cn . As we have suggested in Section 3, the estimator ˆ σ n defined in (11) is in fact stronglyconsistent estimator of the true scale parameter σ . Theorem 2.
Suppose ζ ( t ) represents the true physical process and σ be the true value ofthe experimental error variance, where t ∈ Ω is a compact and convex subset of R p and ζ iscontinuously differentiable on Ω . Let P denote the joint conditional distribution of { y i } ∞ i =1 given true ζ and σ . Also assume that the following holds about the design points t i : sup i ∈{ ,...,n } ,j ∈{ ,...,p } | t i +1 ,j − t i,j | −−−→ n , (AD) then ˆ σ n −−−→ n σ a.s. P . (23)The proof of Theorem 2 is given in Appendix E. The continuous mapping theorem directlyimplies the following. 16 orollary 1. Under the assumptions of Theorem 2, ˆ σ n = (cid:112) ˆ σ n −−−→ n σ a.s. P . (24) Remark 1.
The assumption (AD) is satisfied by a design that contains at least one point ineach hypercube H in Ω with its Lebesgue measure λ ( H ) ≥ Kn , for some constant < K ≤ .This is, for example, the case of equally spaced design. Below we present the almost sure consistency result (19) as a direct consequence ofLemmas 1 and 2, and Theorem 1.
Theorem 3.
Let P denote the joint conditional distribution of { y i } ∞ i =1 given true ζ and σ .Let m ζ ( · ) and k ζ ( · , · ) be the mean and covariance functions of the GP prior for ζ satisfying theconditions of Lemma 1. Assume Υ is a compact set, and for any ω > , p ( B ω | z , θ , φ ) > .If ˆ σ n is a strongly consistent estimator of σ , then for any ν > ( θ , φ ) ∈ Υ p ( ζ ∈ W Cν,n | y , . . . , y n , z , θ , φ , ˆ σ n ) −−−→ n a.s. P . (25) Prior conditions.
The prior positivity condition requiring p ( B ω | z , θ , φ ) > ω wasextensively studied by [13] and [43]. Theorem 4 of [13] implies that this condition is satisfiedfor a GP with continuous sample paths and continuous mean and covariance functions,as long as ζ and the m ζ belong to reproducing kernel Hilbert space (RKHS) of k ζ . Thecontinuity of GP’s sample paths is given by the application of Theorem 5 in [13] whichrequires the same continuity conditions as Lemma 1 in this section (excluding those on ρ i ).It should be clear from (17) and (18) that m ζ is continuously differentiable on [0 , p , and k ζ has continuous mixed partial derivatives up to 4 th order on [0 , p , as long as the same holdsabout m f and m δ and respectively k f and k δ . [43] shows that the RKHS of k ζ spans thespace of continuously differentiable functions on [0 , p , if k ζ is a product of p isotropic andintegrable univariate covariance functions with continuous mixed partial derivatives up toorder 4. For example, the squared exponential covariance functions satisfy these requirementsincluding the continuity of ρ i for i = 0 , . . . , p .This, of course, does not directly imply that such choices for m f and m δ , and k f and k δ respectively, result in the conditional mean m ζ and covariance k ζ functions satisfying thesesufficient conditions. For larger applicability of our results, we note that further investigation17f specific choices for mean and covariance functions that satisfy the desired conditions isneeded. Nevertheless, the simulation study conducted in Section 5 strongly suggests thatchoosing the squared exponential kernel leads to consistent predictions.
5. Numerical analysis and applications
The main objective of this section is to establish the efficiency of the empirical Bayesmethod in Algorithm 1 and to support the consistency result presented in section 4. All thiswhile sacrificing minimally in terms of the fidelity of UQ as compared to the fully Bayesiantreatment. To this extent, we consider a simulation study where we compare our method(under both L MLE and L CV ( K ) ) to a fully Bayesian treatment with the posterior samplesobtained using the standard Metropolis-Hastings algorithm [12]. We also conduct a priorsensitivity analysis of the fully Bayesian treatment to further the practical advantages ofthe empirical Bayes. Finally, we demonstrate the opportunities provided by our method forscience practitioners through predictions of nuclear binding energies using the Liquid DropModel. Let us consider a simple computer model representing a periodic wave disturbance thatmoves through a medium and causes displacement of individual atoms or molecules in themedium. This is called a transverse harmonic wave, where the displacement f m (( t, x ) , θ ) ofa particle at location x over time t is given by f m (( t, x ) , θ ) = θ sin (cid:0) kx − θ t + ψ (cid:1) , (26)where θ represents the amplitude of the wave, and θ is the frequency of the wave. Themodel also depends on the wave number k , which is reciprocal to the wave length, and thephase constant ψ . For the purpose of this example, we shall consider these to be knownvalues with k = 5 and ψ = 1, and define the model inputs ( t, x ) over the space [0 , (weassume that the length and time units are all equal to one). The true physical process is18odeled according to ζ ( t, x ) = f m (( t, x ) , θ ) + δ ( t, x ) = θ sin (cid:0) x − θ t + 1 (cid:1) + β, (27)where β = 1 is a constant systematic error of the model and θ = ( θ , θ ) are arbitrarily setto be (1.2, 1.8). We generate the experimental observation according to the model (2) with the true valueof the observation error scale σ = 0 .
2, where the model inputs ( t, x ) are chosen using theLatin hypercube design over the full space [0 , . The space filling properties of the designguarantee decreasing bias of the estimator ˆ σ n with an increasing sample size. Additionally,we assume that the computer model for the periodic wave disturbance is computationallyexpensive and generate the set of model runs z using, again, the Latin hypercube design,now over [0 , × [0 , . In each of the subsequent scenarios, the amount of experimentalobservations is equal to the number of computer model runs, i.e. n = s . We define the GPpriors for f m and δ to have zero means and the covariance functions k f ( { t, x, θ } , { t (cid:48) , x (cid:48) , θ (cid:48) } ) = η f · exp ( − (cid:107) t − t (cid:48) (cid:107) (cid:96) t − (cid:107) x − x (cid:48) (cid:107) (cid:96) x − (cid:107) θ − θ (cid:48) (cid:107) (cid:96) θ − (cid:107) θ − θ (cid:48) (cid:107) (cid:96) θ ) ,k δ ( { t, x } , { t (cid:48) , x (cid:48) } ) = η δ · exp ( − (cid:107) t − t (cid:48) (cid:107) ν t − (cid:107) x − x (cid:48) (cid:107) ν x ) . The hyperparameters in this scenario are therefore φ = ( η f , (cid:96) t , (cid:96) x , (cid:96) θ , (cid:96) θ , η δ , ν t , ν x ). Forthe case of fully Bayesian treatment, we choose inverse gamma priors with shape andscale parametrization for ( σ, η f , η δ ), gamma priors with shape and rate parametrizationfor the length scales, and independent normal distributions for the calibration parameters( θ , θ ). As we demonstrate below, the performance of the MCMC-based fit can vary greatlywith different prior selections. To asses this effect, we consider the following prior varia-tions: inverse gamma distributions with the shape fixed at 3 and the scale taking values in { . , , , , } , gamma distribution with the rate equal to 3 and the shape taking values in { , } , and the normal distribution with the mean µ θ ∈ { , , . } and the standard deviation19 θ ∈ { . , . , , } . These choices reflect both fairly informative priors (e.g. µ θ = 1 . σ θ = 0 .
25) and non-informative priors, given the spans of both the input space [0 , andthe parameter space [0 , . Figure 1 shows the root mean squared errors (RMSEs) of predictions of new realizationsfrom the true physical process (27) evaluated on a testing datasets of 225 realizations overa uniform grid on [0 , . The predictions are taken to be the posterior predicative meansunder each method. Each box-plot in Figure 1 represents the distribution of RMSEs obtainedthrough the MCMC-based fits for given values of µ θ and σ θ . We consider the estimates ofhyperparameters using the L MLE loss and the predictive likelihood loss function with 10-fold cross-validation under the empirical Bayes approach. The noise scale parameter wasestimated using the consistent estimator ˆ σ n defined in Section 3. R M S E = 0.25 125.0 250.0 500.0 = 0.50 125.0 250.0 500.0 = 1.00 L MLE L CV (10) Figure 1: The RMSE of the empirical Bayes approach and the fully Bayesian treatment. The results aregrouped according to the values of prior means µ θ and standard deviations σ θ used in the Metropolis-Hastingsalgorithm. The box-plots represent the distribution of RMSE values obtained with the MCMC-based fitsacross the prior combinations described in Section 5.1.1. The GP hyperparameters for the empirical Bayesapproach were estimated using Algorithm 1. In general, the proposed empirical Bayes approach performs comparably with the fullyBayesian treatment and monotonously decreases with the increasing size of the dataset. In20articular, the RMSE under the L CV (10) loss is larger than the other methods for the smallestsize of training dataset considered, however, the RMSE under the L MLE loss is the smallestfor the larger training sets. The likely reason for the slightly better performance of theempirical Bayes is that the parameter estimates given by the minimization of L MLE and L CV (10) are purely data driven, whereas the fully Bayesian approach needs to account forprior uncertainties. This observation is consistent with the sensitivity of the predictions tothe prior selection clearly visible in Figure 1. A choice of strongly informative prior that isfar from the underlying truth, such as µ θ = 0 and σ θ = 0 .
25, can yield especially poor fit evenfor large training sets. Thus, in the absence of proper and meaningful prior distributions,an empirical Bayes approach may be preferable besides its other advantages as discussed inthis article. Overall, the empirical Bayes fit can be readily obtained in several minutes usingstandard numerical solvers while sampling from posterior distributions can take hours.It took approximately 2 hours to obtain 10 samples in the scenario with the largestsample size on a standard PC with 4 cores. For completeness, we also show the estimatesof calibration parameters and the noise scale under each method in Figure 2 and Table 1.Posterior means were taken as the estimates under the fully Bayesian solution. We can seea reasonable match between the approximate empirical Bayes method and the Metropolis-Hastings algorithm for many of the prior choices. The first notable difference is a seriesof outlying estimates of the calibration parameters under the MCMC-based fit. These arethe consequence of the aforementioned strongly informative priors. The second differenceis in terms of the noise scale estimate ˆ σ n . This is expected since the estimate is unbiasedasymptotically. 21 .01.11.21.3 = 1.2125.0 250.0 500.00.000.050.100.15 = 1.8125.0 250.0 500.0n0.050.000.050.100.15 = 0.2 Figure 2: The distribution of posterior means of the calibration parameters and the noise scale obtainedwith the Metropolis-Hastings algorithm. Unlike in Figure 1, the box-plots were aggregated over all the priorchoices. The values used to generate the simulation data were ( θ , θ ) = (1 . , .
8) and σ = 0 . Parameter n = 125, s = 125 n = 250, s = 250 n = 500, s = 500 L MLE L CV (10) L MLE L CV (10) L MLE L CV (10) θ θ σ Table 1: The estimates of calibration parameters and the noise scale under the empirical Bayes approach.The values used to generate the simulation data were ( θ , θ ) = (1 . , .
8) and σ = 0 . Figure 3 and Figure 4 show the loss in terms of UQ is negligible under the empiricalBayes approach as compared to the fully Bayesian treatment for all practical purposes. Forclarity, we display only the results of inverse gamma priors with shape 3 and scale 1, gammapriors with shape 1 and rate 3, and normal priors with mean 0 and standard deviation2. These are fairly non-informative priors. We can see that the empirical Bayes approachslightly overestimates the uncertainty for smaller sample size, but this quickly diminishes asthe sample size increases. This is likely the consequence of the inflation of the noise scalegiven by the bias of ˆ σ n which diminishes with the increasing sample size as expected. SeeAppendix F for additional figures of the empirical Bayes fit at the time locations t = 0, t = 0 . t = 0 .
71, and t = 1. 22 .0 0.2 0.4x0.751.001.251.501.752.002.252.50 n = 125 s = 125 0.0 0.2 0.4x n = 250 s = 250 0.0 0.2 0.4x n = 500 s = 500 L MLE L CV (10) M-H
Figure 3: Details of 95% credible bands of posterior predictive distributions under the empirical Bayesapproach and the fully Bayesian approach of Metropolis-Hastings algorithm. These were plotted at t = 0 . n = 125 s = 125 L MLE L CV (10) n = 250 s = 250 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00x012 n = 500 s = 500 0.00 0.25 0.50 0.75 1.00x 0.00 0.25 0.50 0.75 1.00x ( t , x ) ( t , x )95% CI Figure 4: Comparison of the convergence to the true physical process ζ ( t, x ) under the empirical Bayesapproach and the fully Bayesian implementation given by the Metropolis-Hastings algorithm. The dashed linerepresents the true process ζ , and the solid line corresponds to the mean of posterior predictive distributionsunder respective method. The curves with 95% credible intervals (shaded area) are plotted at t = 0 . Nuclear physics is one of many fields that has recently experienced a surge in the applica-tions of Bayesian statistics due to its intuitive way to describe uncertainties probabilistically.GP modeling and its variants have been prominently used in the context of computationally23xpensive theoretical mass models for either emulation or modeling of systematic discrepan-cies to produce precise and quantified predictions of nuclear observables [16, 31, 30, 39].To illustrate our framework for computer enabled predictions on a real data example,we shall consider the 4-parameter Liquid Drop Model (LDM) [29, 21, 4] of nuclear bindingenergy, which is the minimum energy needed to break the nucleus of an atom into freeprotons and neutrons. It is equivalent (energy-mass equivalence explained by E = mc ) tothe mass defect that corresponds to the difference between the mass number of a nucleusand its actual measured mass. This difference is caused by the energy released in the eventof atom’s creation. The LDM is a simple yet reasonably accurate description of the atomicnucleus given by the semi-empirical mass formula: E B ( N, Z ) = θ vol A − θ surf A / − θ sym ( N − Z ) A − θ C Z ( Z − A / . (28)The LDM is a function of the proton number Z and the neutron number N ( A = Z + N isthe mass number) that depends on a set of parameters θ = ( θ vol , θ surf , θ sym , θ C ). These havephysical meaning that represent the volume, surface, symmetry and Coulomb energy (see[25] for details). The semi-empirical mass formula is particularly suitable example, because itprovides a good fit for heavy nuclei and somewhat poor fit for light nuclei. This clearly pointsto the existence of a systematic model discrepancy that is also supported in the literature[37, 51, 18].We now present an analysis of 595 experimental binding energies of even-even nuclei fromthe AME2003 dataset [2] (publicly available at http://amdc.impcas.ac.cn/web/masstab.html ) randomly divided into a training set of 450 nuclei and a testing set of 145 nuclei.24
20 40 60 80 100 120 140 160N1101009080706050403020100 Z TrainingTesting 0 20 40 60 80 100 120 140 160N0102030405060708090100110 Z E B [ M e V ] Figure 5: Binding energies of even-even nuclei in AME2003 dataset divided into the testing and trainingdatasets.
We consider the statistical model (2) and model the systematic discrepancy δ with zeromean GP and the isotropic squared exponential covariance function. For the purpose of thisexample, we also assume that the LDM is computationally expensive (or not directly acces-sible) and regard it is an unknown function of ( Z, N ) and θ . Similarly to the discrepancy δ , we assign a GP prior to E B ( N, Z ) with zero mean and the isotropic squared exponentialcovariance function. To this extent, we additionally generated a set of 900 model evalua-tions using the Latin hypercube design over the space spanning all reasonable values of theparameters θ as given by the nuclear physics literature [48, 6, 29, 21, 4]. Correspondingnuclear configurations, the inputs ( Z, N ), were randomly assigned to the generated valuesof θ from a set of two times duplicated training nuclei. We also want to point out that thisis not the first application of GP modeling in the context of the LDM. See [5] for instance.We conducted a similar study previously using a fully Bayesian approach with posteriordistributions approximated through variational inference [17]. The predictions of nuclear binding energies were computed as the means of the posteriorpredictive distribution (7) conditioned on the estimates of the calibration parameters θ ,GP’s hyperparameters φ , and the noise scale ˆ σ n . The estimates for ( θ , φ ) were obtainednumerically as the minimizers of L MLE and L CV (10) . The priors for the GP hyperparameters25n the case of the fully Bayesian treatment are discussed in Appendix G.Parameter estimates Testing error θ vol θ surf θ sym θ C RMSE (MeV) L MLE L CV (10) Table 2: The RMSEs of the predictions evaluated on 145 even-even nuclei from the AME2003 dataset. Theparameter estimates are also listed. The posterior means are shown in the case of the Metropolis-Hastingsalgorithm.
Table 2 gives the RMSE values calculated on the testing set of 145 even-even nuclei forthe empirical Bayes approach and also the Metropolis-Hastings algorithm. The calibrationparameter estimates are also provided with values that do not significantly differ betweenthe methods considered. The resulting RMSEs are 1 . − . . × samples.
6. Conclusion
We presented and studied an empirical Bayes approach to prediction of physical quantitiesusing computer model, where we assumed that the computer model under considerationneeds to be calibrated and is computationally too expensive to be used directly for inference.To this extent, we proposed a GP emulator and utilized the structural convenience of GPs toformulate closed-form and easy-to-compute predictions of new observations from a physicalprocess. These predictions are obtained through conditional predictive distributions withplugged-in estimates of calibration parameters, GP hyperparameters, and experimental noise26cale. A strongly consistent estimator for the noise scale and two sensible estimators for theremaining parameters (defined as minimizers of two alternative loss functions) were provided.Theoretical study and justification of the proposed methodology were also given: we re-visited hierarchical models and established an equivalent representation of the framework of[19] as a nonparametric regression model with GP prior for an unknown function correspond-ing to the underlying physical process. Consequently, we derived a non-trivial extension ofSchwartz’s theorem for nonparametric regression problems. The application of this resultsshows that our method consistently estimates the underlying true physical process, assum-ing smoothness of the mean and covariance functions of GP priors and the existence of astrongly consistent estimator of the noise scale. To the best of our knowledge, this is thefirst such posterior consistency result under the original model of [19].A simulation study that empirically supports the consistency result was given in Section5. The speed and efficiency of the empirical Bayes approach was demonstrated in comparisonto the fully Bayesian approach of Metropolis-Hastings algorithm. Both methods yield compa-rable results in terms of UQ and quality of the predictions, however, the Metropolis-Hastingsalgorithm is significantly slower and its implementation requires considerable effort. Addi-tionally, our sensitivity study strongly suggests that the empirical Bayes approach may bepreferable in the absence of proper and meaningful prior distributions. Finally, to show theopportunities given by our method for practitioners, we analyzed a dataset of experimentalbinding energies using the Liquid Drop Model.The general framework presented in this paper can be wived as a fast and computationallyefficient approximation to the sampling based fully Bayesian approach for calibration ofcomputer models that neglects some uncertainty of unknown parameters. Our empiricalstudies show that this loss becomes quickly negligible with the increasing size of datasets.
References [1] Amewou-Atisso, M., Ghosal, S., Ghosh, J. K., and Ramamoorthi, R. (2003). Posteriorconsistency for semi-parametric regression problems.
Bernoulli , 9(2):291–312.272] Audi, G., Wapstra, A., and Thibault, C. (2003). The AME2003 atomic mass evaluation:(ii). tables, graphs and references.
Nuclear Physics A , 729:337–676.[3] Bayarri, M. J., Berger, J. O., Paulo, R., Sacks, J., Cafeo, J. A., Cavendish, J., Lin,C.-H., and Tu, J. (2007). A framework for validation of computer models.
Technometrics ,49:138–154.[4] Benzaid, D., Bentridi, S., Kerraci, A., and Amrani, N. (2020). Bethe–Weizs¨acker semiem-pirical mass formula coefficients 2019 update based on AME2016.
Nucl. Sci. Tech. , 31:9.[5] Bertsch, G. F. and Bingham, D. (2017). Estimating parameter uncertainty in binding-energy models by the frequency-domain bootstrap.
Physical Review Letters , 119:252501.[6] Bethe, H. A. and Bacher, R. F. (1936). Nuclear physics a. stationary states of nuclei.
Rev. Mod. Phys. , 8:82–229.[7] Brynjarsd´ottir, J. and O’Hagan, A. (2014). Learning about physical parameters: theimportance of model discrepancy.
Inverse Problems , 30:114007.[8] Choi, T. (2007). Alternative posterior consistency results in nonparametric binary re-gression using gaussian process priors.
Journal of Statistical Planning and Inference ,137(9):2975 – 2983.[9] Choi, T. and Schervish, M. J. (2007a). On posterior consistency in nonparametric re-gression problems.
Journal of Multivariate Analysis , 98(10):1969 – 1987.[10] Choi, T. and Schervish, M. J. (2007b). Posterior Consistency in Nonparametric Regres-sion Problems under Gaussian Process Priors.[11] Fayans, S. A. (1998). Towards a universal nuclear density functional.
Journal of Exper-imental and Theoretical Physics Letters , 68(3):169–174.[12] Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., and Rubin, D. (2013).
Bayesian Data Analysis . CRC Pres, third edition.2813] Ghosal, S. and Roy, A. (2006). Posterior consistency of gaussian process prior fornonparametric binary regression.
Ann. Statist. , 34(5):2413–2429.[14] Gu, M. and Wang, L. (2018). Scaled Gaussian stochastic process for computer modelcalibration and prediction.
SIAM/ASA Journal on Uncertainty Quantification , 6(4):1555–1583.[15] Higdon, D., Gattiker, J., Williams, B., and Rightley, M. (2008). Computer model cal-ibration using high-dimensional output.
Journal of the American Statistical Association ,103:570–583.[16] Higdon, D., McDonnell, J. D., Schunck, N., Sarich, J., and Wild, S. M. (2015). ABayesian approach for parameter estimation and prediction using a computationally in-tensive model.
Journal of Physics G: Nuclear and Particle Physics , 42(3):034009.[17] Kejzlar, V. and Maiti, T. (2020). Variational inference with vine copulas: An efficientapproach for bayesian computer model calibration.[18] Kejzlar, V., Neufcourt, L., Nazarewicz, W., and Reinhard, P.-G. (2020). Statisticalaspects of nuclear mass models.
Journal of Physics G: Nuclear and Particle Physics ,47(9):094001.[19] Kennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computer models.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 63:425–464.[20] King, G. B., Lovell, A. E., Neufcourt, L., and Nunes, F. M. (2019). Direct comparisonbetween Bayesian and frequentist uncertainty quantification for nuclear reactions.
PhysicalReview Letters , 122:232502.[21] Kirson, M. W. (2008). Mutual influence of terms in a semi-empirical mass formula.
Nucl. Phys. A , 798(1):29 – 60. 2922] Kortelainen, M., Lesinski, T., Mor´e, J. J., Nazarewicz, W., Sarich, J., Schunck, N.,Stoitsov, M. V., and Wild, S. M. (2010). Nuclear energy density optimization.
PhysicalReview C , 82(2):024313.[23] Kortelainen, M., McDonnell, J., Nazarewicz, W., Olsen, E., Reinhard, P.-G., Sarich, J.,Schunck, N., Wild, S. M., Davesne, D., Erler, J., and Pastore, A. (2014). Nuclear energydensity optimization: Shell structure.
Phys. Rev. C , 89:054314.[24] Kortelainen, M., McDonnell, J., Nazarewicz, W., Reinhard, P.-G., Sarich, J., Schunck,N., Stoitsov, M. V., and Wild, S. M. (2012). Nuclear energy density optimization: largedeformations.
Physical Review C , 85:024304.[25] Krane, K. (1987).
Introductory Nuclear Physics . Wiley.[26] Martino, L., Laparra, V., and Camps-Valls, G. (2017). Probabilistic cross-validationestimators for gaussian process regression. In , pages 823–827.[27] McDonnell, J. D., Schunck, N., Higdon, D., Sarich, J., Wild, S. M., and Nazarewicz, W.(2015). Uncertainty quantification for nuclear density functional theory and informationcontent of new measurements.
Physical Review Letters , 114(12):122501.[28] Morris, M. D. and Mitchell, T. J. (1995). Exploratory designs for computational exper-iments.
Journal of Statistical Planning and Inference , 43(3):381 – 402.[29] Myers, W. D. and Swiatecki, W. J. (1966). Nuclear masses and deformations.
Nucl.Phys. , 81(2):1 – 60.[30] Neufcourt, L., Cao, Y., Nazarewicz, W., Olsen, E., and Viens, F. (2019). Neutron dripline in the Ca region from Bayesian model averaging.
Physical Review Letters , 122:062502.[31] Neufcourt, L., Cao, Y., Nazarewicz, W., and Viens, F. (2018). Bayesian approach tomodel-based extrapolation of nuclear observables.
Physical Review C , 98:034318.3032] Plumlee, M. (2017). Bayesian calibration of inexact computer models.
Journal of theAmerican Statistical Association , 112:1274–1285.[33] Plumlee, M. (2019). Computer model calibration with confidence and consistency.
Jour-nal of the Royal Statistical Society: Series B (Statistical Methodology) , 81(3):519–545.[34] Plumlee, M., Joseph, V. R., and Yang, H. (2016). Calibrating functional parameters inthe ion channel models of cardiac cells.
Journal of the American Statistical Association ,111:500–509.[35] Pollard, D., Chang, W., Haran, M., Applegate, P., and DeConto, R. (2016). Largeensemble modeling of the last deglacial retreat of the West Antarctic Ice Sheet: com-parison of simple and advanced statistical techniques.
Geoscientific Model Development ,9(5):1697–1723.[36] Rasmussen, C. E. and Williams, C. K. I. (2006).
Gaussian Processes for MachineLearning . Cambridge, MA: MIT Press.[37] Reinhard, P.-G., Bender, M., Nazarewicz, W., and Vertse, T. (2006). From finite nucleito the nuclear liquid drop: Leptodermous expansion based on self-consistent mean-fieldtheory.
Phys. Rev. C , 73:014309.[38] Schaeffer, D. G. and Cain, J. W. (2016).
Nonlinear Systems: Local Theory , pages79–109. Springer New York, New York, NY.[39] Schunck, N., O’Neal, J., Grosskopf, M., Lawrence, E., and Wild, S. M. (2020). Calibra-tion of energy density functionals with deformed nuclei.
Journal of Physics G: Nuclearand Particle Physics .[40] Sexton, D. M. H., Murphy, J. M., Collins, M., and Webb, M. J. (2012). Multivariateprobabilistic projections using imperfect climate models Part i: outline of methodology.
Climate Dynamics , 38(11):2513–2542. 3141] Shiryaev, A. N. (1996).
Convergence of Probability Measures. Central Limit Theorem ,pages 308–378. Springer New York, New York, NY.[42] Sundararajan, S. and Keerthi, S. S. (2001). Predictive approaches for choosing hyper-parameters in gaussian processes.
Neural Computation , 13(5):1103–1118.[43] Tokdar, S. T. and Ghosh, J. K. (2007). Posterior consistency of logistic gaussian processpriors in density estimation.
Journal of Statistical Planning and Inference , 137(1):34 – 42.[44] Tuo, R. and Wu, C. F. J. (2015). Efficient calibration for imperfect computer models.
The Annals of Statistics , 43:2331–2352.[45] Tuo, R. and Wu, C. F. J. (2016). A theoretical framework for calibration in computermodels: parametrization, estimation and convergence properties.
SIAM/ASA Journal onUncertainty Quantification , 4:767–795.[46] van der Vaart, A. and Wellner, J. (1996).
Weak Convergence and Empirical Processes:With Applications to Statistics . Springer Series in Statistics. Springer.[47] Wahba, G. (1990).
Spline Models for Observational Data . Society for Industrial andApplied Mathematics.[48] Weizs¨acker, C. F. v. (1935). Zur theorie der kernmassen.
Z. Phys. , 96(7):431–458.[49] Williams, B., Higdon, D., Gattiker, J., Moore, L., McKay, M., and Keller-McNulty, S.(2006). Combining experimental data and computer simulations, with an application toflyer plate experiments.
Bayesian Analysis , 1(4):765–792.[50] Xie, F. and Xu, Y. (2020). Bayesian projected calibration of computer models.
Journalof the American Statistical Association , 0(0):1–18.[51] Yuan, C. (2016). Uncertainty decomposition method and its application to the liquiddrop model.
Phys. Rev. C , 93:034310. 3252] Zhang, L., Jiang, Z., Choi, J., Lim, C.-Y., Maiti, T., and Baek, S. (2019). Patient-specific prediction of abdominal aortic aneurysm expansion using Bayesian calibration.
IEE Journal of Biomedical and Health Informatics .33 ppendix A Equivalency of hierarchical model
To establish the equivalency between the Bayesian model given by the data likelihood p ( d | θ , φ , σ ) and the hierarchical model (see Section 4), we need to show that the followingequality holds p ( d | θ , φ , σ ) = (cid:90) ζ n (cid:89) i p ( y i | ζ i , σ ) p ( ζ , z | θ , φ ) d ζ , (29)where ζ = ( ζ ( t ) , . . . , ζ ( t n )) = ( ζ , . . . , ζ n ) and p ( ζ , z | θ , φ ) is the multivariate normal distri-bution with mean the mean M ( θ , φ ) (see (5)) and the covariance K p ( θ , φ ) = K f ( T y ( θ ) , T y ( θ )) + K δ ( T y , T y ) K f ( T y ( θ ) , T z ( (cid:101) θ )) K f ( T z ( (cid:101) θ ) , T y ( θ )) K f ( T z ( (cid:101) θ ) , T z ( (cid:101) θ )) = C C C C . For the ease of notation, let us now assume M ( θ , φ ) = ( M Ty , M Tz ) T . Then (cid:90) ζ n (cid:89) i p ( y i | ζ i , σ ) p ( ζ , z | θ , φ ) d ζ = (cid:90) ζ π ) n/ | σ I n | / exp (cid:18) −
12 ( y − ζ ) T ( σ I n ) − ( y − ζ ) (cid:19) × π ) ( n + s ) / | K p | / exp (cid:18) − (cid:18) ζ − M y z − M z (cid:19) T K − p (cid:18) ζ − M y z − M z (cid:19)(cid:19) d ζ = 1(2 π ) ( n + s ) / | K | / exp (cid:18) − (cid:18) y − M y z − M z (cid:19) T K − (cid:18) y − M y z − M z (cid:19)(cid:19) × (cid:90) ζ | K | / (2 π ) n/ | σ I n | / | K p | / exp (cid:18) −
12 ( y − ζ ) T ( σ I n ) − ( y − ζ ) (cid:19) × exp (cid:18) − (cid:18) ζ − M y z − M z (cid:19) T K − p (cid:18) ζ − M y z − M z (cid:19)(cid:19) + 12 (cid:18) y − M y z − M z (cid:19) T K − (cid:18) y − M y z − M z (cid:19)(cid:19) d ζ = 1(2 π ) ( n + s ) / | K | / exp (cid:18) − (cid:18) y − M y z − M z (cid:19) T K − (cid:18) y − M y z − M z (cid:19)(cid:19) × . The integral is equal to 1 since it is an integration of multivariate normal probability densityfunction over ζ with covariance function (( σ I n ) − + ( C − C C − C ) − ) − . Namely34 K | / | σ I n | / | K p | / = | C | / | C + σ I n − C C − C | / | σ I n | / | C | / | C − C C − C | / = | C + σ I n − C C − C | / | σ I n | / | C − C C − C | / = | A + B | / | A | / | B | / = 1 | A | / | B | / | A + B | − / = 1( | A − || B − || A + B | ) − / = 1 | A − B − A + A − B − B | − / = 1 | A − B − A + A − | − / = 1 | A − ( B − + A − ) A | − / = 1( | A − || ( B − + A − ) || A | ) − / = 1 | ( B − + A − ) − | / where we used the Schur complement identity for determinants in the first equality and A = C − C C − C ,B = σ I n . Lastly, considering the notation K − p = C − C − C − C − we haveexp (cid:18) −
12 ( y − ζ ) T ( σ I n ) − ( y − ζ ) − (cid:18) ζ − M y z − M z (cid:19) T K − p (cid:18) ζ − M y z − M z (cid:19)(cid:19) × exp (cid:18) (cid:18) y − M y z − M z (cid:19) T K − (cid:18) y − M y z − M z (cid:19)(cid:19) ∝ exp (cid:18) − ζ T ( σ I n ) − ζ + ζ T ( σ I n ) − y − y T ( σ I n ) − y (cid:19) × exp (cid:18) −
12 [( ζ − M y ) T C − + ( z − M z ) T C − , ( ζ − M y ) T C − + ( z − M z ) T C − ] (cid:18) ζ − M y z − M z (cid:19)(cid:19) ∝ exp (cid:18) − ζ T (( σ I n ) − + C − ) ζ + ζ T b (cid:19) , where C − = C − C C − C and b is a constant column vector. This shows that integralis indeed equal to 1 as stated, and the equality (29) holds.35 ppendix B Proof of Theorem 1 Note that for any (cid:15) >
0, the posterior probability of interest p ( ζ ∈ U Cn | y , . . . , y n , θ , φ , ˆ σ n )can be bound from the above as p ( ζ ∈ U Cn | y , . . . , y n , θ , φ , ˆ σ n ) ≤ p ( ζ ∈ U Cn | y , . . . , y n , θ , φ , ˆ σ n )1 { (cid:12)(cid:12)(cid:12) ˆ σnσ − (cid:12)(cid:12)(cid:12) ≤ (cid:15) } + 1 { (cid:12)(cid:12)(cid:12) ˆ σnσ − (cid:12)(cid:12)(cid:12) >(cid:15) } , where p ( ζ ∈ U Cn | y , . . . , y n , θ , φ , ˆ σ n )1 { (cid:12)(cid:12)(cid:12) ˆ σnσ − (cid:12)(cid:12)(cid:12) ≤ (cid:15) } ≤ Φ n + (1 − Φ n ) (cid:82) U cn ∩F n (cid:81) ni =1 p ( y i | ζ i , ˆ σ n ) p ( y i | ζ ,i ,σ ) { (cid:12)(cid:12)(cid:12) ˆ σnσ − (cid:12)(cid:12)(cid:12) ≤ (cid:15) } dΠ( ζ | θ , φ ) (cid:82) F (cid:81) ni =1 p ( y i | ζ i , ˆ σ n ) p ( y i | ζ ,i ,σ ) dΠ( ζ | θ , φ )+ (cid:82) U cn ∩F Cn (cid:81) ni =1 p ( y i | ζ i , ˆ σ n ) p ( y i | ζ ,i ,σ ) { (cid:12)(cid:12)(cid:12) ˆ σnσ − (cid:12)(cid:12)(cid:12) ≤ (cid:15) } dΠ( ζ | θ , φ ) (cid:82) F (cid:81) ni =1 p ( y i | ζ i , ˆ σ n ) p ( y i | ζ ,i ,σ ) dΠ( ζ | θ , φ )= Φ n + I n ( y , . . . , y n , θ , φ , ˆ σ n , (cid:15) ) I n ( y , . . . , y n , θ , φ , ˆ σ n ) + I n ( y , . . . , y n , θ , φ , ˆ σ n , (cid:15) ) I n ( y , . . . , y n , θ , φ , ˆ σ n ) . Since the assumption (A3) implies that 1 { (cid:12)(cid:12)(cid:12) ˆ σnσ − (cid:12)(cid:12)(cid:12) >(cid:15) } −→ n P , it is enough to show thatthere exists (cid:15) > ( θ , φ ) ∈ Υ Φ n −→ n P , (30)sup ( θ , φ ) ∈ Υ e β n I n ( y , . . . , y n , θ , φ , ˆ σ n , (cid:15) ) −→ n P for some β > , (31)sup ( θ , φ ) ∈ Υ e β n I n ( y , . . . , y n , θ , φ , ˆ σ n , (cid:15) ) −→ n P for some β > , (32)sup ( θ , φ ) ∈ Υ e β n I n ( y , . . . , y n , θ , φ , ˆ σ n ) −→ n ∞ a.s. P for some β > , (33)where β ≤ min { β , β } .The rest of the proof follows the general steps of the proof of Theorem 1 in [9] andTheorem 9 in [8] with some non-trivial treatment of the constant (cid:15) . We shall provide stepby step details below. 36 tep 1). By Markov inequality, for any ρ > ∞ (cid:88) n =1 P (Φ n > ρ ) ≤ ρ ∞ (cid:88) n =1 E ζ ,σ Φ n , which due to the condition (i) of (A2) and the first Borel-Cantelli Lemma yieldsΦ n −→ n P . Since this does not depend on ( θ , φ ), it implies (30). Step 2).
By Fubini’s theorem and for any 0 < (cid:15) < ˜ (cid:15) E ζ ,σ ( I n ( y , . . . , y n , θ , φ , ˆ σ n , (cid:15) ))= E ζ ,σ (cid:20) (1 − Φ n ) (cid:90) U cn ∩F n n (cid:89) i =1 p ( y i | ζ i , ˆ σ n ) p ( y i | ζ ,i , σ ) 1 { (cid:12)(cid:12)(cid:12) ˆ σnσ − (cid:12)(cid:12)(cid:12) ≤ (cid:15) } dΠ( ζ | θ , φ ) (cid:21) = (cid:90) U cn ∩F n (cid:90) (1 − Φ n ) n (cid:89) i =1 p ( y i | ζ i , ˆ σ n ) p ( y i | ζ ,i , σ ) 1 { (cid:12)(cid:12)(cid:12) ˆ σnσ − (cid:12)(cid:12)(cid:12) ≤ (cid:15) } d P dΠ( ζ | θ , φ ) ≤ (cid:18) σ (1 − (cid:15) ) σ (1 + (cid:15) ) (cid:19) − n (cid:90) U nC ∩F n E ζ ,σ (1+ (cid:15) ) [(1 − Φ n )] dΠ( ζ | θ , φ ) ≤ (cid:18) − (cid:15) (cid:15) (cid:19) − n sup ζ ∈ U Cn ∩F n E ζ,σ (1+ (cid:15) ) [(1 − Φ n )] ≤ (cid:18) − (cid:15) (cid:15) (cid:19) − n C e − c n = C e − ˜ c (cid:15) n , where ˜ c (cid:15) = c + log(1 − (cid:15) ) − log(1 + (cid:15) ) together with condition (iii) of (A2) implies ˜ c (cid:15) > P (cid:26) I n ( y , . . . , y n , θ , φ , ˆ σ n , (cid:15) ) ≥ e − ˜ c (cid:15) n (cid:27) ≤ C e ˜ c (cid:15) n e − ˜ c (cid:15) n = C e − ˜ c (cid:15) n . Therefore, for any (cid:15) > (cid:15) < ˜ (cid:15) there exists a constant ˜ c (cid:15) for which the first Borel-Cantelli Lemma implies e ˜ c (cid:15) n I n ( y , . . . , y n , θ , φ , ˆ σ n , (cid:15) ) −→ n P . Since this does not depend on ( θ , φ ), it implies (31).37 tep 3). If we proceed as in the step 2), the Fubini’s theorem implies E ζ ,σ ( I n ( y , . . . , y n , θ , φ , ˆ σ n , (cid:15) ))= E ζ ,σ (cid:20) (cid:90) U cn ∩F n n (cid:89) i =1 p ( y i | ζ i , ˆ σ n ) p ( y i | ζ ,i , σ ) 1 { (cid:12)(cid:12)(cid:12) ˆ σnσ − (cid:12)(cid:12)(cid:12) ≤ (cid:15) } dΠ( ζ | θ , φ ) (cid:21) ≤ (cid:18) σ (1 − (cid:15) ) σ (1 + (cid:15) ) (cid:19) − n (cid:90) U nC ∩F Cn E ζ,σ (1+ (cid:15) ) [1] dΠ( ζ | θ , φ ) ≤ (cid:18) − (cid:15) (cid:15) (cid:19) − n Π( F Cn | θ , φ ) . The condition (ii) of (A2) and the first Borel-Cantelli Lemma implies that for any (cid:15) < − e − c e − c :sup ( θ , φ ) ∈ Υ e ˜ k (cid:15) n I n ( y , . . . , y n , θ , φ , ˆ σ n , (cid:15) ) −→ n P , where ˜ k (cid:15) = c + log(1 − (cid:15) ) − log(1 + (cid:15) ). Step 4).
To prove (33), given any 0 < ρ <
1, we first observe the following: I n ( y , . . . , y n , θ , φ , ˆ σ n ) ≥ I n ( y , . . . , y n , θ , φ , ˆ σ n )1 { (cid:12)(cid:12)(cid:12) ˆ σnσ − (cid:12)(cid:12)(cid:12) ≤ ρ } ≥ (cid:18) − ρ ρ (cid:19) n (cid:90) F n (cid:89) i =1 p ( y i | ζ i , σ (1 − ρ )) p ( y i | ζ ,i , σ ) dΠ( ζ | θ , φ ) . Let us now define log + ( x ) = max { , log( x ) } and log − ( x ) = − min { , log( x ) } as well as W i = log + p ( y i | ζ ,i , σ ) p ( y i | ζ i , σ (1 − ρ )) ,K + i ( ζ , ζ ) = (cid:90) p ( y i | ζ ,i , σ ) log + p ( y i | ζ ,i , σ ) p ( y i | ζ i , σ (1 − ρ )) d y i ,K − i ( ζ , ζ ) = (cid:90) p ( y i | ζ ,i , σ ) log − p ( y i | ζ ,i , σ ) p ( y i | ζ i , σ (1 − ρ )) d y i . V ar ζ ,σ ( W i ) = E ζ ,σ ( W i ) − { K + i ( ζ , ζ ) } ≤ E ζ ,σ ( W i ) − { K i ( ζ , ζ ) } ≤ E ζ ,σ ( W i ) + (cid:90) p ( y i | ζ ,i , σ ) (cid:18) log − p ( y i | ζ ,i , σ ) p ( y i | ζ i , σ (1 − ρ )) (cid:19) d y i − { K i ( ζ , ζ ) } = (cid:90) p ( y i | ζ ,i , σ ) (cid:18) log p ( y i | ζ ,i , σ ) p ( y i | ζ i , σ (1 − ρ )) (cid:19) d y i − { K i ( ζ , ζ ) } = V i ( ζ , ζ ) . Hence, by condition (i) of (A1) for any ρ < ˜ (cid:15) and ζ ∈ B n = ∞ (cid:88) i =1 V ar ζ ,σ ( W i ) i ≤ n = ∞ (cid:88) i =1 V i ( ζ , ζ ) i < ∞ , and by the Kolmogorov’s strong law of large numbers for independent non-identically dis-tributed random variables (e.g. [41], Chapter 3),1 n n (cid:88) i =1 ( W i − K + i ( ζ , ζ )) −→ n P . As a result, for every ζ ∈ B , with P probability 1lim inf n →∞ (cid:18) n n (cid:88) i =1 log p ( y i | ζ i , σ (1 − ρ )) p ( y i | ζ ,i , σ ) (cid:19) = − lim inf n →∞ (cid:32) n n (cid:88) i =1 − log p ( y i | ζ i , σ (1 − ρ )) p ( y i | ζ ,i , σ ) (cid:33) = − lim inf n →∞ (cid:32) n n (cid:88) i =1 log p ( y i | ζ ,i , σ ) p ( y i | ζ i , σ (1 − ρ )) (cid:33) ≥ − lim sup n →∞ (cid:32) n n (cid:88) i =1 log + p ( y i | ζ ,i , σ ) p ( y i | ζ i , σ (1 − ρ )) (cid:33) = − lim sup n →∞ (cid:32) n n (cid:88) i =1 K + i ( ζ , ζ ) (cid:33) ≥ − lim sup n →∞ (cid:32) n n (cid:88) i =1 K i ( S , S ) + 1 n n (cid:88) i =1 (cid:114) K i ( ζ , ζ )2 (cid:33) ≥ − lim sup n →∞ n n (cid:88) i =1 K i ( ζ , ζ ) + (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 K i ( ζ , ζ )2 . β > (cid:113) ∆2 ≤ β and also C = B ∩ { ζ : K i ( ζ , ζ ) < ∆ for all i } . By (A1) there exists ˜ (cid:15) so that for all 0 < ρ < ˜ (cid:15) implies Π( C | θ , φ ) > ζ ∈ C lim inf n →∞ (cid:32) n n (cid:88) i =1 log p ( y i | ζ i , σ (1 − ρ )) p ( y i | ζ ,i , σ ) (cid:33) ≥ − lim sup n →∞ n n (cid:88) i =1 K i ( ζ , ζ ) + (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 K i ( ζ , ζ )2 ≥ − (∆ + (cid:114) ∆2 ) , since n (cid:80) ni =1 K i ( ζ , ζ ) < ∆ for all ζ ∈ C . Finally, for any ρ < min { ˜ (cid:15) , − e − β e − β } lim inf n →∞ e nβ I n ( y , . . . , y n , θ , φ , ˆ σ n ) ≥ lim inf n →∞ e nβ (cid:18) − ρ ρ (cid:19) n (cid:90) F n (cid:89) i =1 p ( y i | ζ i , σ (1 − ρ )) p ( y i | ζ ,i , σ ) dΠ( ζ | θ , φ ) ≥ lim inf n →∞ e nβ (cid:18) − ρ ρ (cid:19) n (cid:90) C n (cid:89) i =1 p ( y i | ζ i , σ (1 − ρ )) p ( y i | ζ ,i , σ ) dΠ( ζ | θ , φ ) ≥ (cid:90) C lim inf n →∞ e nβ (cid:18) − ρ ρ (cid:19) n n (cid:89) i =1 p ( y i | ζ i , σ (1 − ρ )) p ( y i | ζ ,i , σ ) dΠ( ζ | θ , φ )= ∞ . Note that the actual bound on I n does not depend on ( θ , φ ). Taking (cid:15) < min { ˜ (cid:15) , − e − c e − c } concludes the proof. Appendix C Proof of Lemma 1
Theorem 5 of [13] implies that there exist positive constants
C, d , . . . , d p so that for i = 1 , . . . , p P (cid:32) sup t ∈ [0 , p | ζ ( t ) | > M n (cid:12)(cid:12)(cid:12)(cid:12) z , θ , φ , (cid:33) ≤ Ce − d M nρ θ , φ ) ,P (cid:32) sup t ∈ [0 , p (cid:12)(cid:12)(cid:12) ∂∂t i ζ ( t ) (cid:12)(cid:12)(cid:12) > M n | z , θ , φ , (cid:33) ≤ Ce − d i M nρ i ( θ , φ ) . ρ i ( θ , φ ), for i = 0 , · · · , p , on a compact set Υ implies that they areuniformly bounded. Therefore, there exist universal constants (Ξ , , Ξ , ) , · · · , (Ξ p, , Ξ p, )such that for i = 0 , · · · , p , 0 < Ξ i, ≤ sup ( θ , φ ) ∈ Υ | ρ i ( θ , φ ) | ≤ Ξ i, . Hence, for i = 0 , · · · , p ,sup ( θ , φ ) ∈ Υ P (cid:32) sup t ∈ [0 , p | ζ ( t ) | > M n (cid:12)(cid:12)(cid:12)(cid:12) z , θ , φ , (cid:33) ≤ Ce − d M n Ξ0 , , sup ( θ , φ ) ∈ Υ P (cid:32) sup t ∈ [0 , p (cid:12)(cid:12)(cid:12) ∂∂t i ζ ( t ) (cid:12)(cid:12)(cid:12) > M n | z , θ , φ , (cid:33) ≤ Ce − d i M n Ξ i, . Appendix D Proof of Lemma 2
We shall first define some notation. Let 0 < r < ν and t = r . Let N t = N ( t, F n , (cid:107) · (cid:107) ∞ )be the covering number of F n . In Theorem 2.7.1, [46] show that there exist a constant K so that log N t ≤ KM n t p and therefore N t = O ( M n ), where M n = O ( n α ) for α ∈ ( , τ ∈ ( α , ) and define c n = n τ sothat log( N t ) = o ( c n ). Moreover, let ζ , . . . , ζ N t ∈ F n be finitely many elements of the sieveso that for every ζ ∈ F n there is i ∈ { , . . . , N t } satisfying (cid:107) ζ − ζ i (cid:107) ∞ < t . This implies thatif ζ ∈ F n such that (cid:82) | ζ ( t ) − ζ ( t ) | d Q n ( t ) > ν , then (cid:82) | ζ i ( t ) − ζ ( t ) | d Q n ( t ) > ν .The next step in the proof is to construct a test for each ζ i with the resulting functionsΦ n defined as a combination of the individual tests and showing that the probabilities oftype I and type II errors satisfies the properties of the lemma. Let us recall that ζ j = ζ ( t j )and ζ ,j = ζ ( t j ). For an arbitrary ζ ∈ F n such that (cid:107) ζ − ζ i (cid:107) ∞ < t , let us define ζ ,j = ζ i ( t j )and b j = 1 if ζ ,j > ζ ,j and − ν >
0, let Ψ n [ ζ, ν ] be the indicator of set A defined as follows A = (cid:40) n (cid:88) j =1 b j (cid:18) y j − ζ ,j σ (cid:19) > c n √ n (cid:41) . n are then Φ n = max ≤ j ≤ N t Ψ n [ ζ j , ν . Type I error.
The Mill’s ratio implies E ζ ,σ (Ψ n ) = P (cid:34) n (cid:88) j =1 b j (cid:18) y j − ζ ,j σ (cid:19) > c n √ n (cid:35) = 1 − Φ(2 c n ) ≤ c n √ π e − c n ≤ e − c n . The function Φ( · ) is the CDF of the standard normal distribution. Consequently, we have E ζ ,σ (Φ n ) ≤ N t (cid:88) j =1 E ζ ,σ (Ψ n [ ζ j , ν ≤ N t e − c n = e log( N t ) − c n ≤ e − c n , and ∞ (cid:88) n =1 E ζ ,σ Φ n < ∞ . Type II error.
It is sufficient to find i for which the probability of type II error of Ψ n [ ζ i , ν ],given an arbitrary ζ in W Cν,n ∩ F n , is sufficiently small. This is because the probability oftype II error for the composite test Φ n is no larger than the smallest of Ψ n [ ζ i , ν ]. Note thathere we assume (cid:82) | ζ ( t ) − ζ ( t ) | d Q n ( t ) > ν , and then (cid:82) | ζ i ( t ) − ζ ( t ) | d Q n ( t ) > ν . For every r < ν , [10] show that n (cid:88) j =1 | ζ ,j − ζ ,j | > rn. n be large enough so that 4 σ c n < r √ n , then for any 0 < (cid:15) < E ζ,σ (1+ (cid:15) ) (1 − Ψ n [ ζ i , ν P ζ,σ (1+ (cid:15) ) (cid:34) n (cid:88) j =1 b j (cid:18) y j − ζ ,j σ (cid:19) ≤ c n √ n (cid:35) = P ζ,σ (1+ (cid:15) ) (cid:34) n (cid:88) j =1 b j (cid:18) y j − ζ j + ζ j − ζ ,j + ζ ,j + ζ ,j σ (cid:19) ≤ c n √ n (cid:35) = P ζ,σ (1+ (cid:15) ) (cid:34) √ n n (cid:88) j =1 b j (cid:18) y j − ζ j σ (cid:19) + 1 √ n n (cid:88) j =1 b j (cid:18) ζ j − ζ ,j σ (cid:19) + 1 √ n n (cid:88) j =1 (cid:12)(cid:12)(cid:12)(cid:12) ζ ,j − ζ ,j σ (cid:12)(cid:12)(cid:12)(cid:12) ≤ c n (cid:35) ≤ P ζ,σ (1+ (cid:15) ) (cid:34) √ n n (cid:88) j =1 b j (cid:18) y j − ζ j σ (cid:19) ≤ r √ n σ − r √ nσ + 2 c n (cid:35) ≤ P ζ,σ (1+ (cid:15) ) (cid:34) √ n n (cid:88) j =1 b j (cid:18) y j − ζ j σ (1 + (cid:15) ) (cid:19) ≤ − r √ n σ (1 + (cid:15) ) (cid:35) = Φ (cid:18) − r √ n σ (1 + (cid:15) ) (cid:19) ≤ σ (1 + (cid:15) ) r √ πn e − nr σ (cid:15) )2 . To establish the part (ii) of the lemma, we need to show that there exists 0 < ˜ (cid:15) < (cid:15) < ˜ (cid:15) r σ (1 + (cid:15) ) + log (cid:18) − (cid:15) (cid:15) (cid:19) > . (34)Take κ = r σ and define b ( (cid:15) ) to be the left hand side of (34), b ( (cid:15) ) = κ (cid:18) (cid:15) ) + 1 κ log (cid:18) − (cid:15) (cid:15) (cid:19)(cid:19) . The function b ( (cid:15) ) is clearly continuous at (cid:15) = 0. Hence, for each κ >
0, there exists ˜ (cid:15) suchthat for all 0 < (cid:15) < ˜ (cid:15) , b ( (cid:15) ) >
0. 43 ppendix E Proof of Theorem 2
First, we show that ˆ σ n is asymptotically unbiased. Note that E [( y i +1 − y i ) ] = [ ζ ( t i +1 ) − ζ ( t i )] + σ E [( (cid:15) i +1 − (cid:15) i ) ]= [ ζ ( t i +1 ) − ζ ( t i )] + 2 σ , because (cid:15) i i.i.d. ∼ N (0 , E (ˆ σ n ) = (cid:80) n − i =1 [ ζ ( t i +1 ) − ζ ( t i )] n −
1) + σ . (35)Since ζ is continuously differentiable on the compact and convex set Ω , it is also (globally)Lipschitz on Ω (e.g. [38], Corollary 3.2.4), and there exist a real constant K so that | ζ ( t i +1 ) − ζ ( t i ) | ≤ K p (cid:88) j =1 | t i +1 ,j − t i,j | . Therefore, due to the design assumption (AD)0 ≤ (cid:80) n − i =1 [ ζ ( t i +1 ) − ζ ( t i )] n − ≤ K p (cid:20) sup i ∈{ ,...,n } ,j ∈{ ,...,p } | t i +1 ,j − t i,j | (cid:21) −−−→ n , (36)and the combination of (35) with (36) implies E (ˆ σ n ) −−−→ n σ . (37)To show the almost sure convergence of ˆ σ n , let us now denote x i = ( y i +1 − y i ) andrewrite the estimator ˆ σ n as a sum of two estimators, each consisting of a sum of independentvariables: ˆ σ n = (cid:80) n − i =1 x i (cid:0) n − (cid:1) + (cid:80) n − j =1 x j − (cid:0) n − (cid:1) = ˆ σ n,e + ˆ σ n,o . Without loss of generality, we assumed that n is an odd integer. Lastly note that V ar ( x i ) ≤ C < ∞ uniformly in i . This is because the differences ζ ( t i +1 ) − ζ ( t i ) are uniformly boundedon the compact set Ω due to the continuity of ζ . Additionally, y i +1 − y i are normal andhave bounded moments. We can now apply the Kolmogorov’s strong law of large numbers44or independent non-identically distributed random variables (e.g. [41], Chapter 3),ˆ σ n,e −−−→ n σ a.s. P ˆ σ n, −−−→ n σ a.s. P and as a result ˆ σ n = ˆ σ n,e + ˆ σ n,o −−−→ n σ a.s. P . Appendix F Additional results for simulation study 1 n = 125 s = 125 L MLE L CV (10) n = 250 s = 250 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00x012 n = 500 s = 500 0.00 0.25 0.50 0.75 1.00x 0.00 0.25 0.50 0.75 1.00x ( t , x ) ( t , x )95% CI Figure 6: Comparison of the convergence to the true physical process ζ ( t, x ) under the empirical Bayesapproach and the fully Bayesian implementation given by the Metropolis-Hastings algorithm. The dashed linerepresents the true process ζ , and the solid line corresponds to the mean of posterior predictive distributionsunder respective method. The curves with 95% credible intervals (shaded area) are plotted at t = 0 . .00 0.25 0.50 0.75 1.00012 n = 125 s = 125 L MLE L CV (10) n = 250 s = 250 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00x012 n = 500 s = 500 0.00 0.25 0.50 0.75 1.00x 0.00 0.25 0.50 0.75 1.00x ( t , x ) ( t , x )95% CI Figure 7: Comparison of the convergence to the true physical process ζ ( t, x ) under the empirical Bayesapproach and the fully Bayesian implementation given by the Metropolis-Hastings algorithm. The dashed linerepresents the true process ζ , and the solid line corresponds to the mean of posterior predictive distributionsunder respective method. The curves with 95% credible intervals (shaded area) are plotted at t = 0 . n = 125 s = 125 L MLE L CV (10) n = 250 s = 250 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00x012 n = 500 s = 500 0.00 0.25 0.50 0.75 1.00x 0.00 0.25 0.50 0.75 1.00x( t , x ) ( t , x )95% CI Figure 8: Comparison of the convergence to the true physical process ζ ( t, x ) under the empirical Bayesapproach and the fully Bayesian implementation given by the Metropolis-Hastings algorithm. The dashed linerepresents the true process ζ , and the solid line corresponds to the mean of posterior predictive distributionsunder respective method. The curves with 95% credible intervals (shaded area) are plotted at t = 0 . .00 0.25 0.50 0.75 1.00012 n = 125 s = 125 L MLE L CV (10) n = 250 s = 250 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00x012 n = 500 s = 500 0.00 0.25 0.50 0.75 1.00x 0.00 0.25 0.50 0.75 1.00x( t , x ) ( t , x )95% CI Figure 9: Comparison of the convergence to the true physical process ζ ( t, x ) under the empirical Bayesapproach and the fully Bayesian implementation given by the Metropolis-Hastings algorithm. The dashed linerepresents the true process ζ , and the solid line corresponds to the mean of posterior predictive distributionsunder respective method. The curves with 95% credible intervals (shaded area) are plotted at t = 1 . n = 125 s = 125 0.0 0.2 0.4x n = 250 s = 250 0.0 0.2 0.4x n = 500 s = 500 L MLE L CV (10) M-H
Figure 10: Details of 95% credible bands of posterior predictive distributions under the empirical Bayesapproach and the fully Bayesian approach of Metropolis-Hastings algorithm. These were plotted at t = 0 . .0 0.2 0.4x0.751.001.251.501.752.002.252.50 n = 125 s = 125 0.0 0.2 0.4x n = 250 s = 250 0.0 0.2 0.4x n = 500 s = 500 L MLE L CV (10) M-H
Figure 11: Details of 95% credible bands of posterior predictive distributions under the empirical Bayesapproach and the fully Bayesian approach of Metropolis-Hastings algorithm. These were plotted at t = 0 . n = 125 s = 125 0.0 0.2 0.4x n = 250 s = 250 0.0 0.2 0.4x n = 500 s = 500 L MLE L CV (10) M-H
Figure 12: Details of 95% credible bands of posterior predictive distributions under the empirical Bayesapproach and the fully Bayesian approach of Metropolis-Hastings algorithm. These were plotted at t = 0 . n = 125 s = 125 0.0 0.2 0.4x n = 250 s = 250 0.0 0.2 0.4x n = 500 s = 500 L MLE L CV (10) M-H
Figure 13: Details of 95% credible bands of posterior predictive distributions under the empirical Bayesapproach and the fully Bayesian approach of Metropolis-Hastings algorithm. These were plotted at t = 1 . ppendix G The LDM calibration The analysis of the LDM follows our previous study in [17]. Here we provide a concisediscussion regarding the choices of prior distributions for and the GP’s specification.
GP specifications..
For the computer model E B ( Z, N ), we consider the GP prior with themean zero and the covariance function η E · exp( − ( Z − Z (cid:48) ) ν Z − ( N − N (cid:48) ) ν N − ( θ vol − θ (cid:48) vol ) ν − ( θ surf − θ (cid:48) surf ) ν − ( θ sym − θ (cid:48) sym ) ν − ( θ C − θ (cid:48) C ) ν ) . We also assume the GP prior for the systematic discrepancy δ ( Z, N ) with mean zero andcovariance function η δ · exp( − ( Z − Z (cid:48) ) l Z − ( N − N (cid:48) ) l N ) . Prior distributions.
The prior distributions for the calibration parameters θ are chose to bewide enough to cover the space of all their reasonable values: θ vol ∼ N (15 . , . ,θ surf ∼ N (16 . , . ,θ sym ∼ N (22 . , . ,θ C ∼ N (0 . , . . The prior distributions for the hyperparameters φ were selected as Gamma ( α, β ) with theshape parameter α and scale parameter β . They are chosen to be weakly informative sothat they correspond to the scale of these parameters given by the literature on nuclear mass49odels [48, 6, 29, 11, 21, 27, 22, 24, 23, 4, 18]. In particular, σ ∼ Gamma (2 , ,η δ ∼ Gamma (10 , ,l Z ∼ Gamma (10 , ,l N ∼ Gamma (10 , ,ν Z ∼ Gamma (10 , ,ν N ∼ Gamma (10 , ,ν i ∼ Gamma (10 , , i = 1 , , , . Since the majority of the masses in the training dataset are larger than 1000 MeV. Weconsider the following prior for η f to reflect this notion η f ∼ Gamma (110 , ..