[PDF] Comparison of Gaussian process modeling software

Abstract

Gaussian process fitting, or kriging, is often used to create a model from a set of data. Many available software packages do this, but we show that very different results can be obtained from different packages even when using the same data and model. We describe the parameterization, features, and optimization used by eight different fitting packages that run on four different platforms. We then compare these eight packages using various data functions and data sets, revealing that there are stark differences between the packages. In addition to comparing the prediction accuracy, the predictive variance--which is important for evaluating precision of predictions and is often used in stopping criteria--is also evaluated.

Full PDF

CComparison of Gaussian process modeling software

Collin B. Erickson a, ∗ , Bruce E. Ankenman a , Susan M. Sanchez b a Department of Industrial Engineering and Management Sciences, Northwestern University,2145 Sheridan Rd., Evanston, IL, 60208 USA b Department of Operations Research, Naval Postgraduate School, 1411 Cunningham Rd.,Monterey, CA, 93943 USA

Abstract

Gaussian process ﬁtting, or kriging, is often used to create a model from a set ofdata. Many available software packages do this, but we show that very diﬀerent resultscan be obtained from diﬀerent packages even when using the same data and model. Wedescribe the parameterization, features, and optimization used by eight diﬀerent ﬁttingpackages that run on four diﬀerent platforms. We then compare these eight packagesusing various data functions and data sets, revealing that there are stark diﬀerencesbetween the packages. In addition to comparing the prediction accuracy, the predictivevariance—which is important for evaluating precision of predictions and is often used instopping criteria—is also evaluated.

Keywords:

Simulation, Gaussian processes, stochastic kriging, metamodels, computerexperiments

1. Introduction

When computer simulation models are used to study complex systems, it is often usefulto ﬁt an empirical mathematical model to quickly approximate the time-consumingcomputer simulation at input values that have not yet been evaluated. If ﬁt to suﬃcientaccuracy, these metamodels can replace the original computer models in optimizationor “what if” analyses. Gaussian process (GP) modeling is commonly used for ﬁttingmetamodels in simulation experiments since it provides a ﬂexible model and model-based estimate of prediction error even if the simulation itself is deterministic. Gaussianprocess models can also be used when the simulation is stochastic, although this requiresan extension of the model. Gaussian process models have become a large part in theexpanding machine learning toolbox (Rasmussen and Williams, 2006).In this paper we are concerned with how GP models are used by practitioners, so wecompare the performance of some commonly used software packages. Many practitioners ∗ Corresponding author

Email addresses: [email protected] (Collin B. Erickson), [email protected] (Bruce E. Ankenman), [email protected] (Susan M. Sanchez)

Preprint submitted to Elsevier October 10, 2017 a r X i v : . [ s t a t . C O ] O c t re not familiar with the particulars of GP ﬁtting, so we investigate packages that arerelatively easy to use and do not require extensive knowledge of all the options andparameters that can be speciﬁed. GP ﬁtting is unlike linear regression where, for agiven data set, all software packages will produce exactly the same parameter estimatesand ﬁtted surface (up to round-oﬀ error). Most GP ﬁtting packages use essentiallythe same equations, but there is variability in how parameters are deﬁned and estimatedthrough numerical optimization. So, in practice, diﬀerent packages can give substantiallydiﬀerent results. Since GP ﬁtting is often used over other ﬁtting techniques because ofits model-based estimate of prediction error, the quality of a software implementationdepends not only on the accuracy of the ﬁtted surface, but also the accuracy of the errorpredictions. In many applications, random noise or computational limitations do notallow for extrinsic measures of prediction accuracy, and thus an easily obtained estimateof the uncertainty of prediction is valuable if it is at least reasonably accurate. Our major interest in GP ﬁtting is its use to sequentially build an accurate metamodelof a computer simulation over a broad range of input values. This global metamodelcould be built automatically using excess computing capacity and then when the needfor real-time decision making arises, the quickly-evaluated metamodel can be used inplace of the time-consuming computer simulation model.Sequential algorithms require a stopping criterion. For building an accurate globalmetamodel, a stopping criterion that is a function of the estimated prediction errormakes sense. Many ﬁtting methods, such as splines and neural networks, provide noestimate of prediction error apart from extrinsic methods like cross validation which arenot related to the ﬁtted model itself. This is where the model-based estimate of GPﬁtting is very attractive. In addition to its use as a stopping criterion, prediction errorestimates can be used by a sequential algorithm to determine the location of the nextset of design points to run with the actual computer simulation model. We think of thesequential selection of these design points as a sequential experiment design. A sequentialexperiment design is called non-adaptive if the location of each additional design point isrelated only to the location of the previous design points in the input space. An adaptivesequential experiment design uses not only the location of the previous design points,but also the observed value of the response at those design points. If the objective is toﬁnd the location of an optimum, adaptive designs are vastly more eﬃcient because theycan select locations where the response is likely to be desirable. However, for buildingan accurate global metamodel, the response information can be used to select designpoints where the prediction error is estimated to be large or locations that would helpestimate the metamodel parameters more accurately. Our goal is to determine whichsoftware packages have the capability of eﬀectively and reliably estimating locations fornew design points by estimating the metamodel and its prediction error.GP metamodels are well suited for ﬁtting an accurate global metamodel since theycan ﬁt complicated surfaces, however they are computationally slow for large data sets.On the other hand, if the goal is to ﬁnd a optimum, then a GP can also be used as a2urrogate model to search for extrema (e.g., Jones et al., 1998).

Many practitioners in the simulation community use Gaussian processes as a simpleﬁtting model approach and often are not familiar with the intricacies of the model.These scientists often use the basic kriging model and do not delve into the advancedparameter settings, optimization routines, and alternative correlation models that areavailable.For example, in aerospace design, Christen et al. (2014) use GPs to model the acous-tic transmission on launchers in an eﬀort to reduce damage to the payload. The GPmodel allows them to perform global sensitivity analysis to see which parameters in theiracoustic model aﬀect the transmission. Yin et al. (2014) use these models in materialsscience for modeling functionally graded foam-ﬁlled tapered tubes to see which designshave the best energy absorption characteristics. Du et al. (2014) model the currentdensity of lithium-ion batteries as a function of eight input parameters. GPs are usedfor metamodeling in simulations for corn crops by Villa-Vialaneix et al. (2012); theirmetamodels predict the “nitrogen dioxide ( N O ) ﬂuxes and nitrogen leaching from Eu-ropean farmlands.” Gidaris and Taﬂanidis (2015) use kriging for earthquake engineeringto see how the conﬁguration of ﬂuid viscous dampers aﬀects costs. GPs are used asmetamodels in sensitivity analysis for traﬃc simulation models by Ciuﬀo et al. (2013).These examples highlight the need for Gaussian process software to be stable andreliable, in the same way that regression modeling is trusted for ﬁtting linear models. Ofparticular importance for these applications are reliable predictions and accurate errorestimates. The error predictions are especially useful in determining whether more datais needed. This study is inspired by our previous research where we found discrepancies betweentwo software packages that were using the same GP model but were giving diﬀerentresults, particularly in the estimation of the prediction error. We believe that othersmay also encounter similar issues with GP ﬁtting and would beneﬁt from an in-depthstudy of the various software options. Knowledgeable users may know how to improvethe results by setting advanced options or tuning parameters. However, we are tryingto ﬁnd what works best for practitioners who may not have this speciﬁc knowledge.Thus, we select packages that are easy to use, and we have left as many options tothe default setting as possible. For our comparisons we use packages from a mixture ofplatforms: the R (R Core Team, 2014) packages DiceKriging, GPﬁt, laGP, and mlegp;JMP, produced by SAS; the MATLAB toolbox DACE; and the Python modules GPyand sklearn (scikit-learn); These are described in more detail in Section 3. JMP is acommercial package, the rest are free and publicly available.Figure 1 is a simple example that demonstrates problems that can arise. It displaysa sample of size six (black points) in one dimension ﬁt using common options for three3iﬀerent software packages, which then give predictions of the output for x between 0and 1. The details of the packages will be explained later in this paper. The predictionsgiven by laGPE smoothly interpolate between the observations. We can see that mlegpEexhibits mean reversion, where it predicts the observed points correctly, but then quicklyreverts to the mean away from those points. At the other extreme, Dice2 oversmooths,causing its predictions to be far from the observed points. Furthermore, the predictionsof the standard error across the range of x values will also be signiﬁcantly diﬀerent.Thus, even for a simple data set, we can obtain very diﬀerent results. Figure 1: Comparison of Gaussian process ﬁts from three software packages, laGPE, mlegpE, and Dice2,on one-dimensional data. The black points are the input/output data given to each package to ﬁt a GPmodel. The lines are the predicted mean over the interval [0,1] for each package, showing signiﬁcantdiﬀerences.

2. GP ﬁtting

A Gaussian process is characterized such that the output from any set of input pointshas a multivariate normal distribution. If we have n inputs in d dimensions, then the i th input is x i = ( x i , ..., x id ) T . These are stored in the rows of the n by d input matrix X . The output is one dimensional y = ( y , ..., y n ) T .4ollowing Sacks et al. (1989) the surface is modeled as a mean, µ , plus a Gaussianprocess, z , which is a function of x as follows: y = f ( x ) = µ + z ( x ) . In general, a linear combination of functions can be used in place of µ , which is calleduniversal kriging (e.g., Bastos and O’Hagan (2009)). Many authors use only a constantterm for µ , since the Gaussian process is ﬂexible enough to model any linear behaviorin addition to many other more interesting features of the surface. In this paper wealso use a mean-only model, following the suggestion of Chen et al. (2016), who claimthat “there is little to be gained (and maybe even something to lose) by using otherthan a constant term for µ .” One explanation is that replacing this constant by a morecomplicated function f ( x ) (e.g., a polynomial of an order higher than zero) requires theestimation of additional (often extraneous) parameters (e.g., polynomial coeﬃcients).Under this model, the distribution of the outputs, y , is multivariate normal withmean µ n , where n is the n -length vector of ones. The covariance matrix of thismultivariate normal distribution is proportional to a correlation matrix, which has aspecial structure such that the points will create a smooth surface. The constant ofproportionality between the covariance matrix and the correlation matrix is a variancethat is denoted σ . The correlation matrix is constructed such that the correlation ofthe outputs from any two distinct points in the input domain is inversely related tothe distance between those two points. In particular, the correlation goes to one as thedistance between the two input points goes to zero, and the correlation goes to zero asthe distance goes to inﬁnity. This allows the output points to form a surface, but alsoplaces few constraints on the shapes and features of that surface. Thus the model is y ∼ M V N (cid:0) µ n , σ R (cid:1) , (1)where R is the correlation matrix of y . R ij denotes the element in the i th row and the j th column of R , and is the correlation between y i and y j .The covariance matrix is determined by the correlation function, for which thereare many options. The most common covariance function, which we use here, is theGaussian correlation that deﬁnes the correlation between the outputs at x i and x j as R ij = d (cid:89) k =1 exp (cid:16) − θ k ( x ik − x jk ) (cid:17) . (2)The estimator for the mean isˆ µ = (cid:0) n R − n (cid:1) − (cid:0) n R − y (cid:1) . Then the best linear unbiased predictor (BLUP) of y at x is given by Sacks et al. (1989).Using our notation and following the derivations in MacDonald et al. (2015), we ﬁndˆ y = ˆ f ( x ) = ˆ µ + r T R − ( y − ˆ µ n ) = (cid:20) (1 − r T R − n ) Tn R − n Tn + r T (cid:21) R − y = C T y , (3)5here r T = ( r ( x ) , . . . , r n ( x )), r i ( x ) is the covariance between x i and x , and C is avector of length n deﬁned as shown. The associated mean squared error of ˆ y at x is ϕ ( x ) = σ (cid:2) − C T r + C T RC (cid:3) = σ (cid:20) − r T R − r + (1 − Tn R − r ) n R − n (cid:21) . (4)These parameters can be estimated, and then used in equations 3 and 4 to get thepredictions. The error added from parameter estimation is generally not included in thepredictive equations, but can be estimated through bootstrap techniques, see Kleijnen(2015). The model of equation 1 does not account for random noise. This can be done by addinga nugget parameter. To account for random homoscedastic (constant variance) noise,the model above must be augmented as follows: y ∼ M V N (cid:0) µ n , σ ( R + δ I ) (cid:1) . If a nugget, δ , is included in the model, then the correlation matrix is increased alongthe diagonal by δ : R δ = R + δ I . Then the predictive equations 3 and 4 can be used with R δ in place of R . Note that R δ is no longer a correlation matrix since the diagonal values are greater than one.The nugget has the eﬀect of smoothing the function and allowing for noise. Anotherreason for using a nugget is to provide computational stability. The calculations aboveall require inverting R , which can be near singular. Adding a nugget will improve thisstability. When the noise is heteroscedastic (i.e. the noise variance varies across theinput domain) then stochastic kriging methods must be used as explained in Section6.1. When ﬁtting a model to data, at least d + 2 parameters must be estimated: the mean µ , the d correlation parameters θ , and the σ parameter. One additional parameter, δ , must also be estimated if the nugget is used. Given the same parameters and thesame data, diﬀerent software should give the same predictions since they are usingthe same equations. Thus, the diﬀerences that we have observed in predicted valuesbetween software packages are likely caused by diﬀerent parameter estimates. Eachsoftware package uses some type of numerical optimization method to seek estimatorsfor these parameters that maximize the likelihood function, see Section 3.3. Practitionerstypically trust these optimization methods to work without intervention. Later in this6aper, empirical studies essentially demonstrate the performance of the optimizationmethods for various software packages.The mean for the Gaussian process is µ . Predictions far away from design pointswill revert to the mean since they will have low correlation with the observed data, but µ has a much smaller eﬀect on predictions near design points. As previously discussed,the mean term can be replaced with a linear model-type function β T f , but we do notconsider such an expanded model in this paper. Some analysts prefer to not use a meanterm at all.The vector of hyperparameters, θ = ( θ , ..., θ d ) T , contains the correlation parametersof the covariance function. There is one parameter for each dimension that determineshow strong the correlation is between points in corresponding dimension. Sometimes adiﬀerent parameterization is used (see Section 3.1), which can lead to changes in theease and stability of numerical optimization.The parameters in θ help determine the correlation, but the parameter σ also aﬀectsthe ﬁtting since the covariance between any two points is cov ( y i , y j ) = σ R ij . Note thatthis variance parameter is not the variance of a sample from the output surface. By itselfit can be interpreted as the variance of a point “inﬁnitely far” from all other points.The nugget allows for measurement error or stochasticity of the response. If thenugget is not used (i.e., set to zero), then the model will interpolate exactly, so theprediction error at a design point will be zero. This is often useful for deterministiccomputer experiments, but if the data is stochastic then a nugget should be estimatedand used. Using a nugget improves the numerical stability by making the correlationmatrices easier to invert; this inversion can be a problem when there is a large numberof sample points. Ranjan et al. (2011) provide a method to use the smallest nuggetvalue that makes the computations stable, which can help balance the beneﬁts of havingstability and a small nugget. Gramacy and Lee (2012) argue that the nugget providesprotection when the assumption of stationarity is violated or the data is sparse, andthey claim this protection is more important than stability.

3. Gaussian process ﬁtting in various software packages

Table 1 lists the primary sources for our comparison of packages. Most of the packagesprovide users with options. We try to leave most options at their default settings,since that is what most practitioners are likely to use. However, in order to have faircomparisons, we make some simple selections so all packages use comparable models.This section provides overviews of the packages and describes the selections we use inour study.

DiceKriging is an R package for kriging created by the DICE Consortium, whichhas also released the R packages DiceOptim, DiceDesign, and DiceEval (Roustant et al.,2012). DiceOptim does the optimization performed when ﬁtting DiceKriging models.DiceKriging provides many options for ﬁtting and is very thorough, so it may be a goodchoice for many R users.

GPﬁt , another R package, was created by MacDonald, Ranjan, and Chipman. GP-7 able 1: Packages we are using

Package Version Platform Primary source

DiceKriging 1.5.5 R Roustant et al. (2012)GPﬁt 1.0-0 R MacDonald et al. (2015)laGP 1.3-2 R Gramacy (2015)mlegp 3.1.4 R Dancik and Dorman (2008)JMP Pro 13.0.0 JMP JMP: Gaussian Process (2016)DACE 2.5 Matlab Lophaven et al. (2002a)GPy 1.5.6 Python The GPy authors (2015a)sklearn 0.18.1 Python Pedregosa et al. (2011)ﬁt does the most extensive search in optimizing the maximum likelihood parameters,as detailed in MacDonald et al. (2015). Even when the control parameters for the op-timization were set to reduced values, GPﬁt was still slower than the other packagesby orders of magnitude. With more than a hundred design points GPﬁt becomes pro-hibitively slow, while the other packages still run quickly. One advantage of GPﬁt isthat it focuses on computational stability, using the ideas put forth in Ranjan et al.(2011). It sets the nugget to be the smallest value that will avoid singularity, meaningthat the nugget is never estimated. Thus GPﬁt is best suited for noiseless data. Sincethe default exponential power is 1.95 (instead of 2; see Equation 2), we include it as aseparate model in some of the testing below. However, we change the power to 2 formost of the testing to be comparable with the other packages. laGP is an R package created by Robert Gramacy that provides an entirely newmethod for ﬁtting using GPs (Gramacy, 2015). The “la” stands for “local approximate”since the model is designed for large data sets. In the laGP model, sparsity is exploitedso the kriging is done on a small number of design points that are most important forprediction at a given point (Gramacy and Apley, 2015). Thus, it can be run whenthere are a million design points, since it uses only a small number of points to makepredictions and leverages parallelism. For the purpose of this research, we only use thebasic GP ﬁtting functions provided by the package. Although it is an R package, theheart of it is a C implementation wrapped in R, meaning it should be faster than abasic R implementation. In its current state it is a rather minimal package with a focuson speed, so it has fewer additional options and may require more ﬁne-tuning—but isextremely fast.The R package mlegp (Maximum Likelihood Estimates of Gaussian Processes) wascreated by Garrett Dancik (see Dancik, 2013). It provides full GP modeling capabilities.A distinctive feature of mlegp is that the user can specify the nugget matrix up to amultiplicative constant, which can be useful when the response is heteroscedastic asin Dancik and Dorman (2008). Another feature is the ability to perform sensitivityanalysis, letting the user quantify how the response is aﬀected by parameters and howmuch variability in the output can be attributed to changes in the design matrix.We also use the Gaussian Process capability of

JMP , a data analysis software tool8rovided by SAS (SAS Products: JMP, 2016). This program is commonly used bypractitioners since it provides a clean interface, makes data analysis simple, and providesuseful output displays.

DACE is a Matlab toolbox for ﬁtting data from deterministic computer experiments,so it does not allow for noisy data (see Lophaven et al., 2002a). Thus it is only suitablewhen there is no random error at any given design point. DACE was created and lastupdated in 2002, so while it is commonly used, it lacks many of the additional featuresthat other packages include.

GPy is a Python Gaussian process implementation created by the Sheﬃeld machinelearning group (The GPy authors, 2015a). GPy has a tremendous amount of function-ality available for many diﬀerent cases. During our preliminary tests, GPy gave poorresults due to computation problems when it was version 0.6. These problems were ﬁxedafter GPy version 1.0 was released in April 2016, and we report results for version 1.5.6in this paper.Another open source library for Python is scikit-learn, which we call sklearn sincethat is the name of the Python module (Pedregosa et al., 2011). It is targeted formachine learning, not just kriging, so there are many other modeling options availablein the module. Up through version 0.17, the kriging implementation was based onDACE. However, the Gaussian process functionality was vastly upgraded with version0.18, released in September 2016 (scikit-learn developers, 2016a). The update addedoptions for the correlation function, called the kernel, including the Gaussian, Mat´ern,rational quadratic kernel, and others, as well as sum or product combinations of kernels.Since they all use nearly the same equations, the real challenge in model ﬁttingis estimating the parameters. Whereas the predictions are calculated using formulas,parameters must be estimated by solving an optimization problem. Generally, the pa-rameters are chosen to be those that maximize the likelihood. However, the solutionfound to this depends on starting values, bounds, the algorithm used, and the parame-terization. There is additional ambiguity since real data is typically not truly samplesfrom an actual Gaussian process. The Gaussian process model is just a useful approxi-mation technique, and thus there are no true parameter values. The rest of this sectionexamines the diﬀerences between the package parameterizations, the options available,and the estimation methods.

The commonly used Gaussian correlation function is shown in Equation 2 above, butthere are many variations on it and many other correlation functions that can be used.

DACE and

JMP use this standard formulation. The mlegp package uses a diﬀerentnotation for the correlation parameters, β = θ , but this does not aﬀect the calculationsat all. GPﬁt uses β = log θ in order to focus the optimization search near the centerof the search space.The Gaussian correlation can be generalized by allowing the exponent to be changedto any value in the range [1 , GPﬁt uses 1.95 in the exponent. Ranjan et al. (2011)9ustiﬁes this change by explaining that it helps to reduce the computational problemscaused by a near-singular correlation matrix when a space-ﬁlling design is used. In ourtests, we evaluate two versions of GPﬁt: one using the Gaussian correlation function,which we call GPﬁt2, and one using 1.95 as the exponent, which we call GPﬁt1.95.The package laGP moves θ to the denominator, and calls it the length-scale param-eter, as shown in Equation 5. This change is simply a reparameterization and does notaﬀect the model at all. However, it will aﬀect the optimization routine used to estimate d , the p -length vector where the k th element is d k . This formulation is used by otherauthors such as Rasmussen and Williams (2006). The correlation function is R ij = p (cid:89) k =1 exp (cid:16) − ( x ik − x jk ) /d k (cid:17) , (5)where p is used to denote the number of dimensions. The notation used for the param-eters is also slightly diﬀerent in the laGP code and vignette (a guide for the R package)than the others. The nugget is referred to as g in the code and η in the vignette, whilethe lengthscale parameters are denoted by d in the code and θ in the vignette (Gramacy,2014).Another parameterization adjusts the correlation function so that the lengthscaleparameters, denoted as (cid:96) , appear squared in the denominator. This puts (cid:96) on the samescale as x . In addition a factor of two is added in the denominator so that the correlationfunction closely resembles the Gaussian probability distribution function, as shown inEquation 6. This is used by DiceKriging , sklearn and GPy (The GPy authors,2015b). R ij = p (cid:89) k =1 exp (cid:18) −

12 ( x ik − x jk ) /(cid:96) k (cid:19) . (6)Another popular correlation function is the Mat´ern function. It takes a parameter ν that determines the smoothness. Commonly used values for ν are 3/2 and 5/2. TheMat´ern correlation function can be seen as a generalization of the Gaussian correlationfunction since they are equivalent for ν = ∞ . According to Roustant et al. (2012), thedefault correlation function for DiceKriging is the Mat´ern with ν = 5 /

2, which is g ( h ) = (cid:18) √ | h | + 53 h (cid:19) exp (cid:16) −√ | h | (cid:17) , where h = (cid:118)(cid:117)(cid:117)(cid:116) p (cid:88) k =1 ( x ik − x jk ) /(cid:96) k . (7)GPﬁt, GPy, and sklearn have the Mat´ern correlation as an option. We include twoversions of DiceKriging in our comparisons, one using the Gaussian correlation, whichwe label Dice2, and another using the Mat´ern ν = 5 / .2. Nugget options There are also options available for the nugget parameter.

DiceKriging defaults to having no nugget. There is also the option of setting thenugget to a constant or estimating it. For both DiceKriging with the Gaussian (Dice2)and the Mat´ern (DiceM52), we let it estimate the nugget. Preliminary tests involvingnoiseless data did not reveal noticeable diﬀerences between ﬁts based on no nuggest andan estimated nugget. We chose to include the version that estimates the nugget sinceDiceKriging is also useful for stochastic kriging. See more details in Section 6.1.

GPﬁt uses the smallest nugget value that keeps the computation stable, as explainedin Ranjan et al. (2011), in order to prevent over-smoothing. The nugget value they useis δ lb = max (cid:26) λ n ( κ ( R ) − e a ) κ ( R ) ( e a − , (cid:27) , where λ n is the largest eigenvalue of R , a is a parameter set to 25 for space-ﬁllingdesigns, and κ ( R ) is the condition number of R . MacDonald et al. (2015) compareGPﬁt to mlegp and state that “mlegp occasionally crashes due to near-singularity of thespatial correlation matrix,” which agrees with what we have seen, so there is a beneﬁtto setting the nugget in this way.In laGP the user must either set the nugget to a ﬁxed value or tell laGP to estimateit, since there is no default option. We tried several values for the nugget in preliminarytests, and found that 1 × − worked best. Thus, in our tests we run laGP both with thenugget set to 1 × − , called laGP6, and with the nugget being estimated, called laGPE.laGP is also diﬀerent from the other packages we use since it performs the calculationsfrom a Bayesian perspective. In practice this makes little diﬀerence for users, since thedefault priors are very general. In addition, laGP is the only package we use that doesnot estimate a mean term.By default, mlegp will not use a nugget unless there are repeated design points, butit can be estimated or set to a constant or a vector (Dancik, 2011). For our tests werun both with nugget ﬁxed to 0, and with a nugget estimated using a starting value of1 × − . We call these mlegp0 and mlegpE, respectively. As shown below, we ﬁnd thereis little diﬀerence. JMP provides an option to ﬁt the model with no nugget or with an estimated nugget.We run JMP using the Gaussian correlation both with estimating a nugget and withouta nugget, and we refer to these two as JMPE and JMP0, respectively.Since it is designed for noiseless computer experiments,

DACE does not let the userset or estimate the nugget. Instead it uses a small value equal to 2 . n ) × − for computationally stability.The nugget can be set or estimated in GPy by setting the noise variance parameterwhen using the GPRegression function. We set this parameter to a small value, 1 × − ,which forces the model to estimate a nugget parameter.The nugget is called alpha by sklearn , and defaults to 1 × − . Alternatively,the nugget can be speciﬁed by using a WhiteKernel, and this method allows it to beestimated. We use the default value in our investigation.11able 2 shows the packages we use in our study, along with the diﬀerences in theparameterization of θ and the options and defaults for the nugget. The last columnshows how we set the nugget for our study. Table 2: Settings for each software version used in the study. DiceM52 uses the Mat´ern correlationfunction with ν = 5 /

2, GPﬁt1.95 uses the power exponential correlation with power 1.95, and all otheruse the Gaussian correlation function.

Package θ NuggetCan Can Settingset? estimate? Default used

DiceM52Dice2 θ = 1 / (2 θ ) (cid:88) (cid:88) β = log θ - - δ lb δ lb laGP6laGPE d = 1 / θ (cid:88) (cid:88) NA 1e-6Estmlegp0mlegpE β = θ (cid:88) (cid:88) θ = θ (cid:88) (cid:88) NA 0EstDACE θ = θ - - (10+n)2.2e-16 (10+n)2.2e-16GPy (cid:96) = 1 / (2 θ ) (cid:88) (cid:88) Est Estsklearn (cid:96) = 1 / (2 θ ) (cid:88) (cid:88) There also many options in most packages for setting up the optimization routine thatestimates the parameters. There are diﬀerent algorithms, choices of starting points andnumber of restarts, and stopping criteria. We use the default settings for all packagesunless speciﬁed otherwise.By default,

DiceKriging uses the L-BFGS-B algorithm for parameter optimization.The other option available, but not used in our study, is the genoud algorithm from thergenoud R package (Mebane Jr et al., 2011), that combines genetic (evolutionary search)algorithms with derivative-based algorithms.

GPﬁt uses the most in-depth optimization algorithm. As detailed in MacDonaldet al. (2015), GPﬁt uses L-BFGS-B (Byrd et al., 1995) with multiple starts to esti-mate the correlation parameters which they have transformed to be β = log θ . Thistransformation focuses the optimization search more to the middle of the search spacethan to the edges. In Section 2.3 of MacDonald et al. (2015), they describe how GPﬁtadds bounds for each β k to create a domain where the optimum is likely to be found.The function that is minimized is the negative proﬁle log-likelihood (which they call the12eviance), − L θ ∝ log | R | + n log (cid:104) ( y − n ˆ µ ( θ )) T R − (( y − n ˆ µ ( θ )) (cid:105) . GPﬁt begins its search with a space-ﬁlling LHD in the space of all the β i ’s, then selectsa number of parameter sets that have low deviance. These points are clustered using the k -means algorithm. Then the L-BFGS-B algorithm is run using these cluster centers asthe starting point in each restart. laGP requires an initial value for the correlation parameters and nugget (if esti-mated) with no default provided. However laGP provides the functions darg and garg which provide good initial starting values for θ and δ using Empirical Bayes (Gramacy,2015). We use these two functions in our tests to ﬁnd starting values.The package mlegp estimates the parameters using L-BFGS (Liu and Nocedal, 1989)in a gradient method (Dancik, 2013). The starting points are found using multipleNelder-Mead simplexes. JMP is proprietary software and provides no details on the optimization or otherdetails beneath the surface. There are no options that can be set for the optimization.

DACE uses a pattern search that iterates through the steps of exploring, moving,and rotating, after ﬁnding a suitable starting point as in Lophaven et al. (2002b). Bydefault, DACE uses a single correlation parameter for all dimensions and initializes itto 0.1. In order to be comparable to the other methods, we retain the 0.1 value, butmake it into a d -length vector so that these packages ﬁt a correlation parameter for eachdimension. In DACE, upper and lower bounds for each θ i must be provided. We setthese to be generous, with the lower bounds to 1 × − and the upper bounds to 1 × . GPy begins with initial correlation parameters set to 1. However, by default GPyuses the same correlation parameter in every direction, so to get a separate parameterfor each dimension, we had to set

ARD=True . We also had to set the likelihood varianceto a small value of 1 × − , instead of its default of 1, to get good results. GPy allowsa choice of optimization routines: TNS, L-BFGS-B, and BFGS from Scipy (Jones et al.,2001); Adadelta and RProp from the Python module climin; as well as Nelder-Meadsimplex routine and Scaled Conjugate Gradients. The optimization is run through thePython module “ paramz .” We use the default of L-BFGS-B. We also use ﬁve optimizationrestarts to ensure the optimization results are favorable, although this increases the runtime.By default, sklearn uses the “ fmin l bfgs b ” optimization algorithm fromscipy.optimize (scikit-learn developers, 2016b; Jones et al., 2001). This algorithm isan implementation of the L-BFGS-B algorithm (Byrd et al., 1995). There is an optionto use multiple restarts to help the optimization avoid getting stuck in a local minima.By default the number of restarts is zero, which is what we use in our tests. However,trying more restarts may improve performance. In our initial testing with sklearn, weobserved poor results when the data was not scaled. In our comparison tests in thispaper, we scaled all our data to have mean 0 and range 1, as discussed in Section 5.13 . Empirical study methodology In this section, we discuss the criterion on which we will compare diﬀerent GP ﬁttingsoftware packages. When constructing a global metamodel, two properties of GP mod-eling are of primary importance: (1) the accuracy of prediction, and (2) the accuracy ofthe estimate of prediction error. The ﬁrst is important for obvious reasons. The secondis important to allow the practitioner to assess whether the metamodel is ﬁt for use orwhether additional data is needed to improve its ﬁt The diﬀerences between parame-terizations mentioned in Section 3.1 do not matter here because our interest is on theaccuracy of the predictions. We focus on global ﬁtting, not on optimization where thecomparison criterion would be the accuracy of the estimation of the optimal input vectorand the estimated output scalar.

When we evaluate model accuracy, we use a known surface and compare the actualsurface and the model’s predicted values at a large number of points, called predictionpoints. The prediction points are distributed throughout the area of interest for the inputvalues. We use the square root of the mean of the squared errors at the prediction pointsas an estimate of the model’s RMSE; for ease of discussion we will call this estimate the“empirical model RMSE” or “EMRMSE.” Thus, using m prediction points x ∗ , · · · , x ∗ m ,EMRMSE = (cid:118)(cid:117)(cid:117)(cid:116) m m (cid:88) i =1 (ˆ y ( x ∗ i ) − y ( x ∗ i )) . Santner et al. (2003, p. 108) call this the empirical root mean squared prediction error.Although this metric does not account for the distribution of the prediction errors, itis a commonly-used single number that summarizes the quality of the ﬁt of the modelacross the entire surface.We use EMRMSE to assess the quality of the ﬁt for a model. This measure isprimarily useful when comparing two models ﬁtting the same surface since the modelwith the lower EMRMSE ﬁts the surface better. Also, in our empirical studies in Section5, the value of EMRMSE can be very roughly thought of as an average relative error forthe model. This is because EMRMSE estimates the standard deviation of the predictionerrors, and, as discussed in Section 5, we scale the responses (at prediction and designpoints) to have a range of 1.

To evaluate the accuracy of the model’s estimated prediction error, we estimate themodel’s mean squared error ϕ ( x ), deﬁned in Equation 4, by ˆ ϕ ( x ), obtained by substi-tuting the ﬁtted model’s parameter estimates for the unknown parameters in equation(4). The square root of the average of the estimated mean squared errors over all predic-14ion points is used as the summary measure for the predicted model RMSE and calledthe “PMRMSE.” PMRMSE = (cid:118)(cid:117)(cid:117)(cid:116) m m (cid:88) i =1 ˆ ϕ ( x ∗ i )Since EMRMSE and PMRMSE both measure the model’s RMSE, we expect themto be approximately equal. If we observe EMRMSE ≈ PMRMSE, that conﬁrms theaccuracy of the model’s prediction error. If EMRMSE is much larger than PMRMSE,then the model is overconﬁdent in its ﬁt, since its estimated prediction errors will besmaller than the empirical errors. Conversely, if EMRMSE is much less than PMRMSE,then the model’s estimated prediction errors are conservative.Note that Bastos and O’Hagan (2009) suggest a diﬀerent way to compare predictiveerrors by using the predictive covariance matrix. For a single prediction point, we couldcalculate the standardized prediction error, which should follow a t -distribution. Forthe entire set of prediction points, these will be correlated. Intuitively this makes sensebecause the prediction functions are continuous and the true surface is usually alsocontinuous, so points near each other will necessarily have related errors. This erroranalysis requires the predictive covariance matrix so that the standardized errors can bedecorrelated. This method places equal importance on all parts of the surface. However,in practice, one usually focuses on areas where the predicted error is large. Moreover,most software packages do not provide the predictive covariance matrix, and the errorscan diﬀer by orders of magnitude depending on how close they are to sample points. Forthese reasons, we use EMRMSE and PMRMSE as the basis for assessing the accuracyof the estimated prediction error for a given model. Fitting GPs can be computationally intensive when the number of points is large. Thuswe only want to make the computational investment when we will see a signiﬁcantimprovement over simpler models. In preliminary investigations, we found that whenthe surface is too trivial or the sample size is too small that ﬁtting a linear model—oreven just the mean—can give predictions similar to the ﬁtted GP model. Thus, in ourempirical study, whenever we ﬁt a GP model in d dimensions, we also ﬁt a d -dimensionalhyperplane, which we call the linear model (LM). Using the same prediction points, wethen calculate the EMRMSE for the LM. Use of the complicated GP model is onlybeneﬁcial if the EMRMSE of the GP model is substantially less than that of the LM.When the EMRMSE of the GP model is greater than or equal to that of the LM, itindicates that the GP model is unsuitable for the situation. In either case, comparisonsof GP model ﬁtting are not of interest. Thus throughout the empirical study we willdeﬁne ξ (M) as the ratio of the EMRMSE of the ﬁtted GP model M, and the EMRMSEof the LM, as shown in Equation 8: 15 (M) = EMRMSE(M)EMRMSE(LM) . (8)This is similar to the normalized RMSE measure, e rmse,ho , used by Chen et al. (2016),which is the ratio of the RMSE of the GP model to the RMSE of the model that only ﬁtsthe mean. We believe ξ is more useful because practitioners are more likely to considera linear model as an alternative to the Gaussian process.To keep PMRMSE on the same scale, we also deﬁne π (M) as the ratio of the PM-RMSE of M to the EMRMSE of the LM, as shown in equation 9: π (M) = PMRMSE(M)EMRMSE(LM) . (9)

5. Empirical study results

In this section we compare the aforementioned software packages on four test functions:the borehole function, the output transformer-less (OTL) circuit function, the Detteand Pepelyshev 8-dimensional model, and the Morris function. For all the functions,we created independent maximin Latin hypercube samples (LHSs) using the R package

MaxPro (Ba and Joseph, 2015) for the design matrices, and a much larger (2,000 point)maximin LHS for the prediction points. Using a maximin LHS helps ensure that thedata represents the input space well. We use x i ∈ [0 , d , which is commonly done toavoid numerical issues and make sure the data scale is reasonable for the correlationfunction. When calculating function values, the input values, x i , are scaled to be in theappropriate domain of each function. The output, y , can also be standardized beforeﬁtting the model to it since the range can aﬀect how much of the variation in the datais seen as noise. Many software packages do this standardization automatically or havethe option to do so. For all of our comparisons shown in Section 5, the output data isscaled to have mean 0 and range 1, as recommended by Gramacy (2007).A common recommendation in computer experiments is to use a sample size of tentimes the number of input dimensions, i.e., choose n = 10 d (Loeppky et al., 2009). Weﬁnd that this sample size is often too small, giving predictions only slightly better thana linear model. Thus we use input samples of size n = 10 d and n = 20 d taken fromspace-ﬁlling designs in our comparisons that follow. The amount of data needed to get agood ﬁt depends on the curvature of the data, the quality of the design, and the desiredaccuracy of the model.On each sub-plot of the following ﬁgures, we generate ﬁve surfaces, called macrorepli-cates, and ﬁt them using thirteen software package versions. Diﬀerent shaped icons rep-resent the results for diﬀerent macroreplicates. The ﬁve macroreplicates are generatedby ﬁve diﬀerent sets of design and prediction points. Thus, we have 65 ﬁtted metamodelson each plot—thirteen packages ﬁtting ﬁve macroreplicates each. The x-axis plots ξ (M i )and π (M i ) for each macroreplicate, as deﬁned in equations 8 and 9, where M i representsone of the 65 metamodels. ξ (M i ) is plotted on the solid line for each package with solid16cons, while the gray icons slightly above each solid line are π (M i ). Lines connecting the π value to the ξ value for each macroreplicate make it easy to see whether the packagesare underestimating or overestimating the actual error. A positive slope for the lineindicates overestimation of the error, while a negative slope indicates underestimationof the error. Thus, a good metamodel M i will have a ξ (M i ) near zero and will alsohave π (M i ) roughly equal to ξ (M i ). Each macroreplicate uses the same icon shape for ξ (M i ) and π (M i ) so they can be compared to each other and across the diﬀerent softwarepackages.The range of the plots are selected to allow the reader to see the relationships betweenthe ξ and π values across all packages. Several of the problematic values appear to theleft of the plots, indicating that their values are too small to ﬁt on the plot. Values thatare too large are shown to the right of the plots. All of the data values in the examplesin this paper, including those that appear outside of the plot ranges, are available in thedata in brief article associated with this paper (Erickson et al., 2017). The borehole function described by Worley (1987) is commonly used for testing emulators(Morris et al., 1993). The input is 8-dimensional and each variable is conﬁned to speciﬁedranges. The borehole function, f ( x ), is f ( x ) = 2 πT u ( H u − H l )log( r/r w )[1 + LT u log( r/r w ) r w K w + T u /T l ] . We used an R implementation based on the one provided by Surjanovic and Bingham(2016), where they recommend selecting sample points following a normal distribution for r w , a lognormal distribution for r , and uniform distributions for the other six variables intheir respective ranges. We followed these recommendations for choosing sample pointsin each dimension, transforming them to be uniform in [0 , r w , T u , T l , and L , with the other values set to the middle valueof their range. Figure 2 shows the results in plots of our comparisons. The top row hasthe 4-dimensional function, the bottom row has the full 8-dimensional function, and weuse two diﬀerent sample sizes for each dimension. The left column of plots in Figure2 has results for the smaller sample sizes ( n = 40 for 4 dimensions and n = 80 for 8dimensions). The right column has results for the larger sample sizes ( n = 80 for 4dimensions and n = 160 for 8 dimensions).When the input sample size is increased by a factor of 2, the ξ and π values aretypically reduced by about 30%. All four plots in Figure 2 have been put on the samescale for easier comparison of this eﬀect. The GP metamodels clearly ﬁt the boreholesurface better than the linear model since almost all of the ξ values are less than one.The exceptions are one macroreplicate of sklearn in Figure 2b and some of the JMP0macroreplicates in Figures 2c and 2d; these have been cut from the plot and placed tothe right to indicate that they could not ﬁt on the plot without skewing the axes.17e can see that there is a problem in the error estimates for all of the packages. Foralmost every macroreplicate, π is less than ξ , often by a factor of two or more. Usersshould be aware of the possibility of systematic underestimation of model error as seenin the borehole example. These discrepancies between predicted errors and actual errorsare likely due to the data not actually coming from a Gaussian process, so the surfacedoes not match the model assumptions. Methods such as cross-validation can be usedto check for this problem.Overall, GPﬁt, mlegp, JMPE, and GPy have the best performances on all fourexamples shown. sklearn has trouble on some of the macroreplicates in four dimensions,but does better in eight dimensions. DiceKriging, laGP, and DACE generally perform alittle worse than the others, while JMP0 has some serious problems on the 8-dimensionalsurfaces.Figure 3 shows how long (in seconds on a log scale) it took to ﬁt each macroreplicateand make m = 2000 predictions for the 8-dimensional borehole surface with n = 80 and n = 160 design points. All macroreplicates for all packages were run on the same nodeof a Linux cluster, except for JMP which was run on a personal Dell laptop runningWindows. The relative run times for each package are the same for both sample sizes,and the same pattern is found on other test functions as well. GPﬁt is by far the slowest,taking over 15 minutes per macroreplicate for n = 160. JMP is the next slowest, takingtwo minutes per macroreplicate, but this data is unreliable since it was run on a diﬀerentcomputer. The next slowest is mlegp, taking about eight minutes, with GPy only slightlyfaster. The fastest packages were DiceKriging, laGP, sklearn, and DACE, which onlytook a handful of seconds. Thus we see that there is a massive diﬀerence in the run times,with a factor of over 1000 between the fastest and the slowest packages performing thesame task. The times shown in this plot are for the borehole function, but the relativetimes are similar for the other functions. In particular, GPﬁt and JMP are extremelyslow, mlegp is also very slow, and the rest are much faster. Therefore when one ischoosing a package, it may be necessary to consider not only the model options andcapability, but also how quickly it runs. Run times must also be considered in thecontext of the data being used. If the data comes from a simulation model that takeshours per observation, then the diﬀerence of a minute may be negligible. Ben-Ari and Steinberg (2007) use a test function that describes an output transformer-less (OTL) push-pull circuit. There are six input parameters, ﬁve for resistors ( R b , R b , R f , R c , R c ) and one for circuit gain ( β ). The equation is given by V m = ( R b + 0 . β ( R c + 9) β ( R c + 9) + R f + 11 . R f β ( R c + 9) + R f + 0 . R f β ( R c + 9)( β ( R c + 9) + R f ) R c with V b = 12 R b R b + R b . a) d = 4, n = 40 (b) d = 4, n = 80(c) d = 8, n = 80 (d) d = 8, n = 160Figure 2: Borehole 4-D and 8-D comparison. All four plots are on the same scale. a) n = 80 (b) n = 160Figure 3: Run times (seconds) for Borehole 8-D with n = 80 and n = 160, both with m = 2000. Thereare enormous diﬀerences among the packages, but the relative speeds of the packages are similar. We used an R implementation provided by Surjanovic and Bingham (2016). Figure 4shows our results, based on ﬁve macroreplicates of n = 60 and n = 120 observations.On this function, most of the ﬁts are much better than the linear model since mostof the ξ values are below 0.1. Some of the laGPE and JMP0 points are placed to theright of the plot to indicate that their values are oﬀ the scale. The problems exhibitedby some of these packages, such as DiceKriging and laGPE, may be reduced by tuningthe optimization parameters, but we did not attempt this as not all practitioners mayhave this insight. For both sample sizes, the best ξ values come from GPﬁt2, mlegp,JMPE, GPy, and sklearn, with GPﬁt1.95, laGP6, and DACE performing only slightlyworse. DiceKriging, laGPE, and JMP0 perform poorly compared to the best packages.The prediction errors, π values, are fairly accurate on this data. Most of the π valuesare less than the corresponding ξ values by a small margin, but not by as much as inthe borehole results of Figure 2. Doubling the sample size reduced the ξ and π by abouta factor of two, showing that increasing the sample size beyond 10 d can have a largeimpact. Dette and Pepelyshev (2010) present an 8-dimensional model “which is highly curved insome variables and has less curvature in another variables.” The input is in [0 , , andthe output is given by the equation below:20 a) n = 60 (b) n = 120Figure 4: OTL Circuit comparison η ( x ) = 4( x − x − x ) + (3 − x ) + 16 (cid:112) ( x + 1)(2 x − + (cid:88) k =4 k log (1 + k (cid:88) i =3 x i ) . This function and its R implementation were also taken from Surjanovic and Bingham(2016).Figure 5 shows the results when we test this function with n = 80 and n = 160 designpoints in 8 dimensions. There are clear diﬀerences between the packages in these plots.GPy has the smallest ξ values for both plots, with GPﬁt2 not far behind. There is alarge diﬀerence in the ξ values between GPy and the worst performers, so again we seethat the software used makes a diﬀerence. JMP0 is consistently bad on these examples,while JMPE is very inconsistent, with a mixture of good and bad ﬁts. laGP6 is generallyvery good, while laGPE, despite being consistent, is one of the worst performers. GPﬁt,mlegp, GPy, and sklearn perform the best on this function. The error predictions for allpackages except JMP are generally good, being within 25% of the actual error. We canalso see that the performance ordering of the packages on this problem is similar to thosefor the OTL circuit example in Figure 4. Again increasing the number of observationsbeyond 10 d had a decidedly beneﬁcial eﬀect, roughly halving the ξ and π values. The Morris function is a 20-dimensional function created by Morris (1991), and we usethe version presented by Le Gratiet et al. (2016):21 a) n = 80 (b) n = 160Figure 5: Dette-Pepelyshev comparison f ( x ) = (cid:88) i =1 β i w i + (cid:88) i

2) for i = 3 , ,

7, and w i = 2( x i − / i . The coeﬃcients are β i = 20 for i = 1 , ..., β i,j = −

15 for i, j = 1 , ..., β i,j,l = 10 for i, j, l = 1 , ...,

5. All other coeﬃcients are set to β i = ( − i , β i,j = ( − i + j and β i,j,l = 0. The results from using the Morris function are shown inFigure 6 using input samples of size n = 200 and n = 400.DACE has ﬁtting problems, especially for n = 200, but the rest are fairly consistent.The ξ values are all fairly large, many around 0.4. This demonstrates that in higherdimensions it is more diﬃcult to get a ﬁt that is substantially better than the linearmodel, especially when the function is relatively linear. GPﬁt, laGP6, JMPE, GPy, andsklearn perform the best, but they all underestimate the error by a signiﬁcant amount.

6. Recent literature / Advanced models

In this paper, we have focused on an ordinary GP model. However, there are manyvariations of this model that can be used in situations where there is domain knowledgeabout the data or where the basic model is inadequate. If the data is noisy then anugget should always be used and estimated. There are also many diﬀerent correlationfunctions beyond the Gaussian and Mat´ern that may perform better with certain typesof data. If the data set is large then there are approximation models that should be22 a) n = 200 (b) n = 400Figure 6: Morris comparison used instead, since they will run much faster with a small loss of accuracy. If there are n design points, then the computation complexity for kriging is O ( n ), which is far tooslow for modern problems with millions of data points. While computer experiments often assume that there is no variability in the data, thisis not the case in stochastic simulations. When the noise is similar across the entireresponse surface, then the basic model should suﬃce by using the nugget term. However,when the noise level varies across the surface, called heteroscedasticity, stochastic krigingshould be used. Stochastic kriging, by Ankenman et al. (2010), accommodates for noisein data collected by assuming that the variance in the data is diﬀerent at each designpoint. In order to estimate the noise at each point, replicates must be collected at everypoint in the design. This is equivalent to having a diﬀerent nugget at each design point.Thus instead of adding δI to the diagonal of the correlation matrix, diag( δ ) is added,where δ i ∝ Var( x i ). This requires Var( x i ) to be estimated by replicates at each uniquedesign point.Stochastic kriging has been used for modeling simulation data from many ﬁelds, suchas in game theory simulations (Pousi et al., 2010) and ﬁnance for measuring portfoliorisk (Liu and Staum, 2010). Stochastic kriging models are often run in two stages. Inthe ﬁrst stage, a small number of samples are taken for all design points. For the secondstage, the number of samples for each point is allocated according to Equation (29) inAnkenman et al. (2010), which puts more replicates at points that have large sample23ariances and are centrally located.Of the software packages discussed above, only mlegp and DiceKriging are able toperform stochastic kriging. For each package a variance estimate at each point must beprovided. In mlegp, this vector is passed as the “nugget” parameter, and the diagonalof the nugget matrix is set to be proportional to these values (Dancik, 2011). Thenugget scaling parameter is estimated along with the other parameters. In DiceKriging,this same vector as passed as the “noise.var” parameter (Roustant et al., 2012). Theprototype software developed for the paper Ankenman et al. (2010) used oﬀ-the-shelfoptimization algorithms that do not scale to larger problems and often have convergenceissues. It will not be considered in this comparison.To demonstrate the use of mlegp and DiceKriging for stochastic kriging, we use datataken from the standard M/M/1 queue model. The M/M/1 queue is a service model that represents a system with one server andinterarrival and service times that are independently exponentially distributed. We setthe service rate λ = 1 and the arrival rate 0 ≤ x <

1. We model the number ofcustomers waiting in the queue as a function of the arrival rate, which is known tohave mean y ( x ) = x/ (1 − x ) and variance x/ (1 − x ) . For design points we use sevenequally-spaced points at (0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9). In the ﬁrst stage, we take n = 5 samples at each point. Then the second stage is run with a total of n = 100 and n = 200 points allotted to the individual design points according to the square root ofthe variance of the output for the ﬁrst stage samples. The results are shown in Figure7, where there is no distinguishable diﬀerence between the two packages. Even with n = 100 observations the ξ values are relatively large, showing that stochastic krigingtypically needs more observations to ﬁt a surface. Both packages tend to underestimatethe error when the sample size is small. Many advances in Gaussian process ﬁtting have been adapting the method to be suit-able for large amounts of data, such as by exploiting sparsity. Snelson and Ghahramani(2005) present a sparse GP method that reduces the size of the covariance matrix byusing pseudo-input points. Hensman et al. (2013) produce a method that works fordata sets with millions of data points, which would be prohibitively slow for the stan-dard model. Sparsity can be induced into the correlation matrix by having a correlationfunction with compact support, meaning that it is zero for points that are suﬃcientlyfar away (Rasmussen and Williams, 2006, p. 87-8). These functions usually are piece-wise polynomials that resemble the Gaussian correlation function. Gramacy and Apley(2015) provides a way to ﬁt GPs to large experiments quickly. Their model inducessparsity by only selecting the points that provide the greatest reduction in predictivevariance when calculating the metamodel function at a given point. They also allow forquick sequential updating and trivial parallelization, making their method very practi-24 a) n = 100 (b) n = 200Figure 7: M/M/1 stochastic kriging comparison using mlegp and DiceKriging. Both use 5 samples ateach point in the ﬁrst stage, and then the total number of points allocated in the second stage is n = 100for Figure 7a and n = 200 for Figure 7b. cal. Gramacy (2015) has provided the R package laGP that implements most of thesemethods, in addition to the basic model that we explore in this paper. Further work inSung et al. (2016) makes the search for the best sub-design much faster.The recent paper by Binois et al. (2016) provides a signiﬁcant improvement tostochastic kriging. They provide a more favorable framework by putting the problemin an inferential scheme with a single objective and explicit derivatives. They introducesome smoothing techniques that allow design points without replicates to be used in themodel, which is a shortcoming of previous versions of stochastic kriging. Also, they usethe Woodbury identity (Woodbury, 1950) to ensure that the computational complexityis similar to other stochastic kriging methods.Advances have also come by combining the Gaussian process with other models.Neural networks inspired the work of Damianou and Lawrence (2013), who present a deepGaussian process model that uses hierarchical Gaussian process mappings. Gramacy andLee (2008) add GPs to the Bayesian partition tree model of Chipman et al. (1998) sothat a GP is ﬁt to each partition. Williams and Barber (1998) create a model that usesGPs for Bayesian classiﬁcation. There have also been advances by allowing the modelto take categorical input. Platt et al. (2001) use GPs with categorical input to generatemusic playlists. Chen et al. (2013) address the use of stochastic kriging with categoricalinput.Some of the packages we investigate have advanced models available for users. GPyhas many models, including classiﬁcation, sparse regression, latent variable models, andmore. In addition to providing GP classiﬁcation models, scikit-learn also has other ma-chine learning models such as clustering, neural networks, and support vector machines.laGP provides the approximate GP model, as explained above, that is useful for massivedata sets. When choosing a software package, users should consider the depth of optionsavailable on the platform, and what types of models they would potentially use beyond25he basic GP model to get better results.

7. Discussion

In this paper, we study various Gaussian process ﬁtting software packages, see Table 1,and compare their performance using the GP model with Gaussian correlation for globalﬁtting. We do not compare their performance for optimization, i.e., accurately locatingan optimal point. We assess them based on the quality of their response predictions andtheir error estimates, each of which are averaged across the region of interest. Otherpossible criteria that we do not consider include maximum error and relative error. Inmany cases the diﬀerent packages give similar, or even indistinguishable, results—whichis expected since they are using the same models on the same surfaces. However, due todiﬀerences in the parameter estimation routines, the packages often give diﬀerent resultson complicated surfaces.

DiceKriging , an R package, performed somewhat worse than many of the otherpackages in our examples. However, DiceKriging runs faster than the R packages mlegpand GPﬁt, and provides more customization options than laGP. It has multiple correla-tion functions available and can estimate a nugget. By default the correlation functionis the Mat´ern with ν = 5 /

2, but we did not see a large diﬀerence between that andthe Gaussian correlation function. DiceKriging also provides functionality for stochastickriging, as demonstrated in Section 6.1.Another R package,

GPﬁt , uses the most extensive parameter optimization, whichmakes it very dependable. In our tests we found GPﬁt to be reliable and give goodresults. The cost of this is that it takes signiﬁcantly longer to use on larger data sets,taking noticeably longer on samples larger than even 50. On the borehole test, eachmacroreplicate for GPﬁt for a sample size of 500 took over two hours, while all the otherpackages ﬁnished in minutes or even seconds. For this reason we do not recommend usingGPﬁt when the data set is large and time is valuable. We ran GPﬁt with its defaultexponent of 1.95 in the correlation function and also used the Gaussian process modelwhere that exponent is 2.00. In general we did not see a large diﬀerence between thetwo in performance or run time. GPﬁt uses 1.95 as the default because it is supposedto provide computational stability.The R package laGP is the fastest package we tested. We used it with estimatinga nugget and with setting a small nugget, the latter of which gave better results. Themain beneﬁt of laGP is that it runs very quickly, especially when repeatedly addingdata in a sequential manner. In addition, it provides some additional complex modelsthat are useful for large data sets. Thus we do not recommend laGP for kriging withsmall sample sizes, but we do suggest looking into its advanced functionality if there arethousands of sample points.The ﬁnal R package we evaluated, mlegp , performed well in our testing. We used itwith both setting the nugget to zero and estimating the nugget, and did not see a large26iﬀerence. In addition, mlegp was a little slower than the other packages, though not asslow as GPﬁt. One beneﬁt of mlegp is that it also can do stochastic kriging, as shownin Section 6.1.

JMP is a commercial software platform that makes data analysis easy for practi-tioners. When the nugget was set to zero, JMP performed poorly on most of our testfunctions. When the nugget was estimated, however, JMP performed substantially bet-ter, often on par with the best packages. JMP seems to run relatively slow. For ourtesting, JMP was run on a laptop and the other packages were run on a cluster. How-ever in our experience, the other packages (except for GPﬁt) ran faster when also run onthe same laptop. Thus users should be careful when using JMP, particularly when thenugget is not estimated, since the results may be spurious, and users might get betterand faster results using a diﬀerent software option.The Matlab toolbox

DACE was fast but generally provided a slightly worse ﬁtthan the best models. DACE is very basic and has not been updated in years, so werecommend using other packages if there is a desire to progress to more advanced models.

GPy , a Python module, gave the best ﬁtting results in most of our tests. It wasan order of magnitude slower than the fastest package, and was generally in the middlein terms of speed. GPy also provides many options and advanced models that thepractitioner can explore once they have mastered the basic model.Finally, the Python module scikit-learn contains GP ﬁtting capability in additionto many other machine learning algorithms. It was near the best on most examples, butoccasionally exhibited inconsistency. In preliminary tests, we found that it performedbetter when the data is scaled and more optimization restarts are used. It was also oneof the fastest packages. Although we only included the results using scikit-learn withthe Gaussian correlation function in this paper, we have found in some of our tests thatusing the Mat´ern correlation function gives better results. Although scikit-learn doesnot provide advanced GP models, it does provide other machine learning models suchas support vector machines and random forests. Thus it would be a useful tool for thosewho want to use a single platform for multiple machine learning methods.

This paper focuses on the traditional Gaussian process model with Gaussian correlation.Despite specifying the same type of GP metamodel, we found that there are often signif-icant diﬀerences between the metamodel predictions made by various software packageson the same data. Practitioners should be aware of the quality of predictions, typicalrun time, and model options when choosing a modeling software to use.There are many modiﬁcations of the model and other correlation functions that willoften give better and faster results if there is prior knowledge about the structure ofthe data or if there is a large number of observations. We focus on the simple modelbecause it is used by many practitioners who do not want to spend the time to learn theintricacies of the model, but wish to use the power of GP ﬁtting. However, if unstableor nonsensical results are observed when ﬁtting the simple GP model, we encourage27odelers to consider using the packages with more advanced features that we allude toin Section 7.1.

8. Acknowledgments

U.S. Department of Defense Distribution Statement: Approved for public release; dis-tribution is unlimited. The views expressed in this document are those of the authorsand do not reﬂect the oﬃcial policy or position of the Department of Defense or theU.S. Government. This work was supported in part by the Oﬃce of Naval Research viaNPS’s CRUSER initiative, the NPS Naval Research Program NPS-17-N191-B, and theNaval Supply Systems Command Fleet Logistics grant number N00244-15-2-0004.

Bibliography

B. E. Ankenman, B. L. Nelson, J. Staum, Stochastic kriging for simulation metamodel-ing, Operations Research 58 (2) (2010) 371–382.S. Ba, V. R. Joseph, MaxPro: Maximum Projection Designs, URL https://CRAN.R-project.org/package=MaxPro , r package version 3.1-2, 2015.L. S. Bastos, A. O’Hagan, Diagnostics for Gaussian process emulators, Technometrics51 (4) (2009) 425–438.E. N. Ben-Ari, D. M. Steinberg, Modeling data from computer experiments: an em-pirical comparison of kriging with MARS and projection pursuit regression, QualityEngineering 19 (4) (2007) 327–338.M. Binois, R. B. Gramacy, M. Ludkovski, Practical heteroskedastic Gaussian processmodeling for large simulation experiments, arXiv preprint arXiv:1611.05902 .R. H. Byrd, P. Lu, J. Nocedal, C. Zhu, A limited memory algorithm for bound con-strained optimization, SIAM Journal on Scientiﬁc Computing 16 (5) (1995) 1190–1208.H. Chen, J. L. Loeppky, J. Sacks, W. J. Welch, et al., Analysis Methods for ComputerExperiments: How to Assess and What Counts?, Statistical Science 31 (1) (2016)40–60.X. Chen, K. Wang, F. Yang, Stochastic kriging with qualitative factors, in: Proceedingsof the 2013 Winter Simulation Conference, IEEE Press, 790–801, 2013.H. A. Chipman, E. I. George, R. E. McCulloch, Bayesian CART model search, Journalof the American Statistical Association 93 (443) (1998) 935–948.J. Christen, M. Ichchou, B. Troclet, M. Ouisse, Global sensitivity analysis of acoustictransmission models through inﬁnite plates, in: Proceedings of ISMA, 4177–4188,2014. 28. Ciuﬀo, J. Casas, M. Montanino, J. Perarnau, V. Punzo, Gaussian process meta-models for sensitivity analysis of traﬃc simulation models: Case study of AIMSUNmesoscopic model, Transportation Research Record: Journal of the TransportationResearch Board 2390 (2013) 87–98.A. Damianou, N. Lawrence, Deep Gaussian processes, Proceedings of the SixteenthInternational Workshop on Artiﬁcial Intelligence and Statistics (AISTATS) (2013)207–215.G. Dancik, mlegp: An R package for Gaussian process modeling and sensitivity analysis,retrieved April 14, 2016, from http://download.nextag.com/cran/web/packages/mlegp/vignettes/mlegp.pdf , 2011.G. M. Dancik, mlegp: Maximum Likelihood Estimates of Gaussian Processes, URL http://CRAN.R-project.org/package=mlegp , r package version 3.1.4, 2013.G. M. Dancik, K. S. Dorman, mlegp: Statistical analysis for computer models of biolog-ical systems using R, Bioinformatics 24 (17) (2008) 1966–1967.H. Dette, A. Pepelyshev, Generalized Latin hypercube design for computer experiments,Technometrics 52 (4) (2010) 421–429.W. Du, N. Xue, W. Shyy, J. R. Martins, A surrogate-based multi-scale model for masstransport and electrochemical kinetics in lithium-ion battery electrodes, Journal ofthe Electrochemical Society 161 (8) (2014) E3086–E3096.C. B. Erickson, B. E. Ankenman, S. M. Sanchez, Data from ﬁtting Gaussian processmodels to various data sets using eight Gaussian process software packages, Data inBrief, submitted .I. Gidaris, A. A. Taﬂanidis, Performance assessment and optimization of ﬂuid vis-cous dampers through life-cycle cost criteria and comparison to alternative designapproaches, Bulletin of Earthquake Engineering 13 (4) (2015) 1003–1028.R. Gramacy, tgp: An R Package for Bayesian Nonstationary, Semiparametric Nonlin-ear Regression and Design by Treed Gaussian Process Models, Journal of StatisticalSoftware, Articles 19 (9) (2007) 1–46, ISSN 1548-7660, doi: \ bibinfo { doi }{ } , URL .R. B. Gramacy, laGP: Large-scale spatial modeling via local approximate Gaussianprocesses in R, Tech. Rep., The University of Chicago. Available as a vignette in thelaGP package, 2014.R. B. Gramacy, laGP: Local Approximate Gaussian Process Regression, URL http://CRAN.R-project.org/package=laGP , r package version 1.2-1, 2015.29. B. Gramacy, D. W. Apley, Local Gaussian process approximation for large computerexperiments, Journal of Computational and Graphical Statistics 24 (2) (2015) 561–578, doi: \ bibinfo { doi }{ } , URL http://dx.doi.org/10.1080/10618600.2014.914442 .R. B. Gramacy, H. K. Lee, Cases for the nugget in modeling computer experiments,Statistics and Computing 22 (3) (2012) 713–722.R. B. Gramacy, H. K. H. Lee, Bayesian Treed Gaussian Process Models With an Ap-plication to Computer Modeling, Journal of the American Statistical Association103 (483) (2008) 1119–1130, doi: \ bibinfo { doi }{ } , URL http://dx.doi.org/10.1198/016214508000000689 .J. Hensman, N. Fusi, N. D. Lawrence, Gaussian Processes for Big Data, CoRRabs/1309.6835, URL http://arxiv.org/abs/1309.6835 .JMP: Gaussian Process, JMP: Gaussian Process, retrieved August 01, 2016, from , 2016.D. R. Jones, M. Schonlau, W. J. Welch, Eﬃcient global optimization of expensive black-box functions, Journal of Global optimization 13 (4) (1998) 455–492.E. Jones, T. Oliphant, P. Peterson, et al., SciPy: Open source scientiﬁc tools for Python,URL , [Online; accessed 2016-04-19], 2001.J. P. C. Kleijnen, Design and Analysis of Simulation Experiments, vol. 230, Springer,2015.L. Le Gratiet, S. Marelli, B. Sudret, Metamodel-based sensitivity analysis: Polynomialchaos expansions and Gaussian processes, arXiv preprint arXiv:1606.04273 .D. C. Liu, J. Nocedal, On the limited memory BFGS method for large scale optimization,Mathematical programming 45 (1-3) (1989) 503–528.M. Liu, J. Staum, Stochastic kriging for eﬃcient nested simulation of expected shortfall,The Journal of Risk 12 (3) (2010) 3.J. L. Loeppky, J. Sacks, W. J. Welch, Choosing the Sample Size of a Computer Experi-ment: A Practical Guide, Technometrics 51 (4) (2009) 366–376, doi: \ bibinfo { doi }{ } , URL http://dx.doi.org/10.1198/TECH.2009.08040 .S. N. Lophaven, H. B. Nielsen, J. Søndergaard, Aspects of the MATLAB toolbox DACE,Tech. Rep., Informatics and Mathematical Modelling, Technical University of Den-mark, DTU, 2002b.S. N. Lophaven, H. B. Nielsen, J. Søndergaard, DACE–A MATLAB Kriging toolbox,version 2.0, Tech. Rep., 2002a. 30. MacDonald, P. Ranjan, H. Chipman, GPﬁt: An R package for ﬁtting a Gaussian pro-cess model to deterministic simulator outputs, Journal of Statistical Software 64 (12)(2015) 1–23, URL .W. R. Mebane Jr, J. S. Sekhon, et al., Genetic optimization using derivatives: thergenoud package for R, Journal of Statistical Software 42 (11) (2011) 1–26.M. D. Morris, Factorial sampling plans for preliminary computational experiments, Tech-nometrics 33 (2) (1991) 161–174.M. D. Morris, T. J. Mitchell, D. Ylvisaker, Bayesian design and analysis of computerexperiments: Use of derivatives in surface prediction, Technometrics 35 (3) (1993)243–255.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning inpython, The Journal of Machine Learning Research 12 (2011) 2825–2830.J. C. Platt, C. J. Burges, S. Swenson, C. Weare, A. Zheng, Learning a Gaussian processprior for automatically generating music playlists., in: NIPS, 1425–1432, 2001.J. Pousi, J. Poropudas, K. Virtanen, Game theoretic simulation metamodeling usingstochastic kriging, in: Proceedings of the 2010 Winter Simulation Conference, ISSN0891-7736, 1456–1467, doi: \ bibinfo { doi }{ } , 2010.R Core Team, R: A Language and Environment for Statistical Computing, R Foundationfor Statistical Computing, Vienna, Austria, URL , 2014.P. Ranjan, R. Haynes, R. Karsten, A computationally stable approach to Gaussianprocess interpolation of deterministic computer simulation data, Technometrics 53 (4)(2011) 366–378.C. E. Rasmussen, C. K. I. Williams, Gaussian Processes for Machine Learning, The MITPress abs/1309.6835, URL .O. Roustant, D. Ginsbourger, Y. Deville, DiceKriging, DiceOptim: Two R Packagesfor the Analysis of Computer Experiments by Kriging-Based Metamodeling and Opti-mization, Journal of Statistical Software, Articles 51 (1) (2012) 1–55, ISSN 1548-7660,doi: \ bibinfo { doi }{ } , URL .J. Sacks, W. J. Welch, T. J. Mitchell, H. P. Wynn, Design and analysis of computerexperiments, Statistical Science (1989) 409–423.T. J. Santner, B. J. Williams, W. Notz, The Design and Analysis of Computer Experi-ments, Springer Science & Business Media, 2003.31AS Products: JMP, SAS Products: JMP, retrieved August 01, 2016, from http://support.sas.com/software/products/jmp/ , 2016.scikit-learn developers, scikit-learn Release history, Retrieved January 16, 2017, from http://scikit-learn.org/stable/whats_new.html , 2016a.scikit-learn developers, sklearn.gaussian process.GaussianProcessRegressor, retrievedJanuary 14, 2017, from http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html , 2016b.E. Snelson, Z. Ghahramani, Sparse Gaussian processes using pseudo-inputs, in: Ad-vances in Neural Information Processing Systems, 1257–1264, 2005.C.-L. Sung, R. B. Gramacy, B. Haaland, Potentially predictive variance reducing sub-sample locations in local Gaussian process regression, ArXiv e-prints .S. Surjanovic, D. Bingham, Virtual Library of Simulation Experiments: Test Functionsand Datasets, retrieved March 29, 2016, from , 2016.The GPy authors, GPy: A Gaussian process framework in python, URL http://github.com/SheffieldML/GPy , 2012–2015a.The GPy authors, Welcome to GPy’s documentation!, retrieved Oct 03, 2016, from https://pythonhosted.org/GPy/index.html , 2015b.N. Villa-Vialaneix, M. Follador, M. Ratto, A. Leip, A comparison of eight metamod-eling techniques for the simulation of N2