Comparison of Gaussian process modeling software
CComparison of Gaussian process modeling software
Collin B. Erickson a, ∗ , Bruce E. Ankenman a , Susan M. Sanchez b a Department of Industrial Engineering and Management Sciences, Northwestern University,2145 Sheridan Rd., Evanston, IL, 60208 USA b Department of Operations Research, Naval Postgraduate School, 1411 Cunningham Rd.,Monterey, CA, 93943 USA
Abstract
Gaussian process fitting, or kriging, is often used to create a model from a set ofdata. Many available software packages do this, but we show that very different resultscan be obtained from different packages even when using the same data and model. Wedescribe the parameterization, features, and optimization used by eight different fittingpackages that run on four different platforms. We then compare these eight packagesusing various data functions and data sets, revealing that there are stark differencesbetween the packages. In addition to comparing the prediction accuracy, the predictivevariance—which is important for evaluating precision of predictions and is often used instopping criteria—is also evaluated.
Keywords:
Simulation, Gaussian processes, stochastic kriging, metamodels, computerexperiments
1. Introduction
When computer simulation models are used to study complex systems, it is often usefulto fit an empirical mathematical model to quickly approximate the time-consumingcomputer simulation at input values that have not yet been evaluated. If fit to sufficientaccuracy, these metamodels can replace the original computer models in optimizationor “what if” analyses. Gaussian process (GP) modeling is commonly used for fittingmetamodels in simulation experiments since it provides a flexible model and model-based estimate of prediction error even if the simulation itself is deterministic. Gaussianprocess models can also be used when the simulation is stochastic, although this requiresan extension of the model. Gaussian process models have become a large part in theexpanding machine learning toolbox (Rasmussen and Williams, 2006).In this paper we are concerned with how GP models are used by practitioners, so wecompare the performance of some commonly used software packages. Many practitioners ∗ Corresponding author
Email addresses: [email protected] (Collin B. Erickson), [email protected] (Bruce E. Ankenman), [email protected] (Susan M. Sanchez)
Preprint submitted to Elsevier October 10, 2017 a r X i v : . [ s t a t . C O ] O c t re not familiar with the particulars of GP fitting, so we investigate packages that arerelatively easy to use and do not require extensive knowledge of all the options andparameters that can be specified. GP fitting is unlike linear regression where, for agiven data set, all software packages will produce exactly the same parameter estimatesand fitted surface (up to round-off error). Most GP fitting packages use essentiallythe same equations, but there is variability in how parameters are defined and estimatedthrough numerical optimization. So, in practice, different packages can give substantiallydifferent results. Since GP fitting is often used over other fitting techniques because ofits model-based estimate of prediction error, the quality of a software implementationdepends not only on the accuracy of the fitted surface, but also the accuracy of the errorpredictions. In many applications, random noise or computational limitations do notallow for extrinsic measures of prediction accuracy, and thus an easily obtained estimateof the uncertainty of prediction is valuable if it is at least reasonably accurate. Our major interest in GP fitting is its use to sequentially build an accurate metamodelof a computer simulation over a broad range of input values. This global metamodelcould be built automatically using excess computing capacity and then when the needfor real-time decision making arises, the quickly-evaluated metamodel can be used inplace of the time-consuming computer simulation model.Sequential algorithms require a stopping criterion. For building an accurate globalmetamodel, a stopping criterion that is a function of the estimated prediction errormakes sense. Many fitting methods, such as splines and neural networks, provide noestimate of prediction error apart from extrinsic methods like cross validation which arenot related to the fitted model itself. This is where the model-based estimate of GPfitting is very attractive. In addition to its use as a stopping criterion, prediction errorestimates can be used by a sequential algorithm to determine the location of the nextset of design points to run with the actual computer simulation model. We think of thesequential selection of these design points as a sequential experiment design. A sequentialexperiment design is called non-adaptive if the location of each additional design point isrelated only to the location of the previous design points in the input space. An adaptivesequential experiment design uses not only the location of the previous design points,but also the observed value of the response at those design points. If the objective is tofind the location of an optimum, adaptive designs are vastly more efficient because theycan select locations where the response is likely to be desirable. However, for buildingan accurate global metamodel, the response information can be used to select designpoints where the prediction error is estimated to be large or locations that would helpestimate the metamodel parameters more accurately. Our goal is to determine whichsoftware packages have the capability of effectively and reliably estimating locations fornew design points by estimating the metamodel and its prediction error.GP metamodels are well suited for fitting an accurate global metamodel since theycan fit complicated surfaces, however they are computationally slow for large data sets.On the other hand, if the goal is to find a optimum, then a GP can also be used as a2urrogate model to search for extrema (e.g., Jones et al., 1998).
Many practitioners in the simulation community use Gaussian processes as a simplefitting model approach and often are not familiar with the intricacies of the model.These scientists often use the basic kriging model and do not delve into the advancedparameter settings, optimization routines, and alternative correlation models that areavailable.For example, in aerospace design, Christen et al. (2014) use GPs to model the acous-tic transmission on launchers in an effort to reduce damage to the payload. The GPmodel allows them to perform global sensitivity analysis to see which parameters in theiracoustic model affect the transmission. Yin et al. (2014) use these models in materialsscience for modeling functionally graded foam-filled tapered tubes to see which designshave the best energy absorption characteristics. Du et al. (2014) model the currentdensity of lithium-ion batteries as a function of eight input parameters. GPs are usedfor metamodeling in simulations for corn crops by Villa-Vialaneix et al. (2012); theirmetamodels predict the “nitrogen dioxide ( N O ) fluxes and nitrogen leaching from Eu-ropean farmlands.” Gidaris and Taflanidis (2015) use kriging for earthquake engineeringto see how the configuration of fluid viscous dampers affects costs. GPs are used asmetamodels in sensitivity analysis for traffic simulation models by Ciuffo et al. (2013).These examples highlight the need for Gaussian process software to be stable andreliable, in the same way that regression modeling is trusted for fitting linear models. Ofparticular importance for these applications are reliable predictions and accurate errorestimates. The error predictions are especially useful in determining whether more datais needed. This study is inspired by our previous research where we found discrepancies betweentwo software packages that were using the same GP model but were giving differentresults, particularly in the estimation of the prediction error. We believe that othersmay also encounter similar issues with GP fitting and would benefit from an in-depthstudy of the various software options. Knowledgeable users may know how to improvethe results by setting advanced options or tuning parameters. However, we are tryingto find what works best for practitioners who may not have this specific knowledge.Thus, we select packages that are easy to use, and we have left as many options tothe default setting as possible. For our comparisons we use packages from a mixture ofplatforms: the R (R Core Team, 2014) packages DiceKriging, GPfit, laGP, and mlegp;JMP, produced by SAS; the MATLAB toolbox DACE; and the Python modules GPyand sklearn (scikit-learn); These are described in more detail in Section 3. JMP is acommercial package, the rest are free and publicly available.Figure 1 is a simple example that demonstrates problems that can arise. It displaysa sample of size six (black points) in one dimension fit using common options for three3ifferent software packages, which then give predictions of the output for x between 0and 1. The details of the packages will be explained later in this paper. The predictionsgiven by laGPE smoothly interpolate between the observations. We can see that mlegpEexhibits mean reversion, where it predicts the observed points correctly, but then quicklyreverts to the mean away from those points. At the other extreme, Dice2 oversmooths,causing its predictions to be far from the observed points. Furthermore, the predictionsof the standard error across the range of x values will also be significantly different.Thus, even for a simple data set, we can obtain very different results. Figure 1: Comparison of Gaussian process fits from three software packages, laGPE, mlegpE, and Dice2,on one-dimensional data. The black points are the input/output data given to each package to fit a GPmodel. The lines are the predicted mean over the interval [0,1] for each package, showing significantdifferences.
2. GP fitting
A Gaussian process is characterized such that the output from any set of input pointshas a multivariate normal distribution. If we have n inputs in d dimensions, then the i th input is x i = ( x i , ..., x id ) T . These are stored in the rows of the n by d input matrix X . The output is one dimensional y = ( y , ..., y n ) T .4ollowing Sacks et al. (1989) the surface is modeled as a mean, µ , plus a Gaussianprocess, z , which is a function of x as follows: y = f ( x ) = µ + z ( x ) . In general, a linear combination of functions can be used in place of µ , which is calleduniversal kriging (e.g., Bastos and O’Hagan (2009)). Many authors use only a constantterm for µ , since the Gaussian process is flexible enough to model any linear behaviorin addition to many other more interesting features of the surface. In this paper wealso use a mean-only model, following the suggestion of Chen et al. (2016), who claimthat “there is little to be gained (and maybe even something to lose) by using otherthan a constant term for µ .” One explanation is that replacing this constant by a morecomplicated function f ( x ) (e.g., a polynomial of an order higher than zero) requires theestimation of additional (often extraneous) parameters (e.g., polynomial coefficients).Under this model, the distribution of the outputs, y , is multivariate normal withmean µ n , where n is the n -length vector of ones. The covariance matrix of thismultivariate normal distribution is proportional to a correlation matrix, which has aspecial structure such that the points will create a smooth surface. The constant ofproportionality between the covariance matrix and the correlation matrix is a variancethat is denoted σ . The correlation matrix is constructed such that the correlation ofthe outputs from any two distinct points in the input domain is inversely related tothe distance between those two points. In particular, the correlation goes to one as thedistance between the two input points goes to zero, and the correlation goes to zero asthe distance goes to infinity. This allows the output points to form a surface, but alsoplaces few constraints on the shapes and features of that surface. Thus the model is y ∼ M V N (cid:0) µ n , σ R (cid:1) , (1)where R is the correlation matrix of y . R ij denotes the element in the i th row and the j th column of R , and is the correlation between y i and y j .The covariance matrix is determined by the correlation function, for which thereare many options. The most common covariance function, which we use here, is theGaussian correlation that defines the correlation between the outputs at x i and x j as R ij = d (cid:89) k =1 exp (cid:16) − θ k ( x ik − x jk ) (cid:17) . (2)The estimator for the mean isˆ µ = (cid:0) n R − n (cid:1) − (cid:0) n R − y (cid:1) . Then the best linear unbiased predictor (BLUP) of y at x is given by Sacks et al. (1989).Using our notation and following the derivations in MacDonald et al. (2015), we findˆ y = ˆ f ( x ) = ˆ µ + r T R − ( y − ˆ µ n ) = (cid:20) (1 − r T R − n ) Tn R − n Tn + r T (cid:21) R − y = C T y , (3)5here r T = ( r ( x ) , . . . , r n ( x )), r i ( x ) is the covariance between x i and x , and C is avector of length n defined as shown. The associated mean squared error of ˆ y at x is ϕ ( x ) = σ (cid:2) − C T r + C T RC (cid:3) = σ (cid:20) − r T R − r + (1 − Tn R − r ) n R − n (cid:21) . (4)These parameters can be estimated, and then used in equations 3 and 4 to get thepredictions. The error added from parameter estimation is generally not included in thepredictive equations, but can be estimated through bootstrap techniques, see Kleijnen(2015). The model of equation 1 does not account for random noise. This can be done by addinga nugget parameter. To account for random homoscedastic (constant variance) noise,the model above must be augmented as follows: y ∼ M V N (cid:0) µ n , σ ( R + δ I ) (cid:1) . If a nugget, δ , is included in the model, then the correlation matrix is increased alongthe diagonal by δ : R δ = R + δ I . Then the predictive equations 3 and 4 can be used with R δ in place of R . Note that R δ is no longer a correlation matrix since the diagonal values are greater than one.The nugget has the effect of smoothing the function and allowing for noise. Anotherreason for using a nugget is to provide computational stability. The calculations aboveall require inverting R , which can be near singular. Adding a nugget will improve thisstability. When the noise is heteroscedastic (i.e. the noise variance varies across theinput domain) then stochastic kriging methods must be used as explained in Section6.1. When fitting a model to data, at least d + 2 parameters must be estimated: the mean µ , the d correlation parameters θ , and the σ parameter. One additional parameter, δ , must also be estimated if the nugget is used. Given the same parameters and thesame data, different software should give the same predictions since they are usingthe same equations. Thus, the differences that we have observed in predicted valuesbetween software packages are likely caused by different parameter estimates. Eachsoftware package uses some type of numerical optimization method to seek estimatorsfor these parameters that maximize the likelihood function, see Section 3.3. Practitionerstypically trust these optimization methods to work without intervention. Later in this6aper, empirical studies essentially demonstrate the performance of the optimizationmethods for various software packages.The mean for the Gaussian process is µ . Predictions far away from design pointswill revert to the mean since they will have low correlation with the observed data, but µ has a much smaller effect on predictions near design points. As previously discussed,the mean term can be replaced with a linear model-type function β T f , but we do notconsider such an expanded model in this paper. Some analysts prefer to not use a meanterm at all.The vector of hyperparameters, θ = ( θ , ..., θ d ) T , contains the correlation parametersof the covariance function. There is one parameter for each dimension that determineshow strong the correlation is between points in corresponding dimension. Sometimes adifferent parameterization is used (see Section 3.1), which can lead to changes in theease and stability of numerical optimization.The parameters in θ help determine the correlation, but the parameter σ also affectsthe fitting since the covariance between any two points is cov ( y i , y j ) = σ R ij . Note thatthis variance parameter is not the variance of a sample from the output surface. By itselfit can be interpreted as the variance of a point “infinitely far” from all other points.The nugget allows for measurement error or stochasticity of the response. If thenugget is not used (i.e., set to zero), then the model will interpolate exactly, so theprediction error at a design point will be zero. This is often useful for deterministiccomputer experiments, but if the data is stochastic then a nugget should be estimatedand used. Using a nugget improves the numerical stability by making the correlationmatrices easier to invert; this inversion can be a problem when there is a large numberof sample points. Ranjan et al. (2011) provide a method to use the smallest nuggetvalue that makes the computations stable, which can help balance the benefits of havingstability and a small nugget. Gramacy and Lee (2012) argue that the nugget providesprotection when the assumption of stationarity is violated or the data is sparse, andthey claim this protection is more important than stability.
3. Gaussian process fitting in various software packages
Table 1 lists the primary sources for our comparison of packages. Most of the packagesprovide users with options. We try to leave most options at their default settings,since that is what most practitioners are likely to use. However, in order to have faircomparisons, we make some simple selections so all packages use comparable models.This section provides overviews of the packages and describes the selections we use inour study.
DiceKriging is an R package for kriging created by the DICE Consortium, whichhas also released the R packages DiceOptim, DiceDesign, and DiceEval (Roustant et al.,2012). DiceOptim does the optimization performed when fitting DiceKriging models.DiceKriging provides many options for fitting and is very thorough, so it may be a goodchoice for many R users.
GPfit , another R package, was created by MacDonald, Ranjan, and Chipman. GP-7 able 1: Packages we are using
Package Version Platform Primary source
DiceKriging 1.5.5 R Roustant et al. (2012)GPfit 1.0-0 R MacDonald et al. (2015)laGP 1.3-2 R Gramacy (2015)mlegp 3.1.4 R Dancik and Dorman (2008)JMP Pro 13.0.0 JMP JMP: Gaussian Process (2016)DACE 2.5 Matlab Lophaven et al. (2002a)GPy 1.5.6 Python The GPy authors (2015a)sklearn 0.18.1 Python Pedregosa et al. (2011)fit does the most extensive search in optimizing the maximum likelihood parameters,as detailed in MacDonald et al. (2015). Even when the control parameters for the op-timization were set to reduced values, GPfit was still slower than the other packagesby orders of magnitude. With more than a hundred design points GPfit becomes pro-hibitively slow, while the other packages still run quickly. One advantage of GPfit isthat it focuses on computational stability, using the ideas put forth in Ranjan et al.(2011). It sets the nugget to be the smallest value that will avoid singularity, meaningthat the nugget is never estimated. Thus GPfit is best suited for noiseless data. Sincethe default exponential power is 1.95 (instead of 2; see Equation 2), we include it as aseparate model in some of the testing below. However, we change the power to 2 formost of the testing to be comparable with the other packages. laGP is an R package created by Robert Gramacy that provides an entirely newmethod for fitting using GPs (Gramacy, 2015). The “la” stands for “local approximate”since the model is designed for large data sets. In the laGP model, sparsity is exploitedso the kriging is done on a small number of design points that are most important forprediction at a given point (Gramacy and Apley, 2015). Thus, it can be run whenthere are a million design points, since it uses only a small number of points to makepredictions and leverages parallelism. For the purpose of this research, we only use thebasic GP fitting functions provided by the package. Although it is an R package, theheart of it is a C implementation wrapped in R, meaning it should be faster than abasic R implementation. In its current state it is a rather minimal package with a focuson speed, so it has fewer additional options and may require more fine-tuning—but isextremely fast.The R package mlegp (Maximum Likelihood Estimates of Gaussian Processes) wascreated by Garrett Dancik (see Dancik, 2013). It provides full GP modeling capabilities.A distinctive feature of mlegp is that the user can specify the nugget matrix up to amultiplicative constant, which can be useful when the response is heteroscedastic asin Dancik and Dorman (2008). Another feature is the ability to perform sensitivityanalysis, letting the user quantify how the response is affected by parameters and howmuch variability in the output can be attributed to changes in the design matrix.We also use the Gaussian Process capability of
JMP , a data analysis software tool8rovided by SAS (SAS Products: JMP, 2016). This program is commonly used bypractitioners since it provides a clean interface, makes data analysis simple, and providesuseful output displays.
DACE is a Matlab toolbox for fitting data from deterministic computer experiments,so it does not allow for noisy data (see Lophaven et al., 2002a). Thus it is only suitablewhen there is no random error at any given design point. DACE was created and lastupdated in 2002, so while it is commonly used, it lacks many of the additional featuresthat other packages include.
GPy is a Python Gaussian process implementation created by the Sheffield machinelearning group (The GPy authors, 2015a). GPy has a tremendous amount of function-ality available for many different cases. During our preliminary tests, GPy gave poorresults due to computation problems when it was version 0.6. These problems were fixedafter GPy version 1.0 was released in April 2016, and we report results for version 1.5.6in this paper.Another open source library for Python is scikit-learn, which we call sklearn sincethat is the name of the Python module (Pedregosa et al., 2011). It is targeted formachine learning, not just kriging, so there are many other modeling options availablein the module. Up through version 0.17, the kriging implementation was based onDACE. However, the Gaussian process functionality was vastly upgraded with version0.18, released in September 2016 (scikit-learn developers, 2016a). The update addedoptions for the correlation function, called the kernel, including the Gaussian, Mat´ern,rational quadratic kernel, and others, as well as sum or product combinations of kernels.Since they all use nearly the same equations, the real challenge in model fittingis estimating the parameters. Whereas the predictions are calculated using formulas,parameters must be estimated by solving an optimization problem. Generally, the pa-rameters are chosen to be those that maximize the likelihood. However, the solutionfound to this depends on starting values, bounds, the algorithm used, and the parame-terization. There is additional ambiguity since real data is typically not truly samplesfrom an actual Gaussian process. The Gaussian process model is just a useful approxi-mation technique, and thus there are no true parameter values. The rest of this sectionexamines the differences between the package parameterizations, the options available,and the estimation methods.
The commonly used Gaussian correlation function is shown in Equation 2 above, butthere are many variations on it and many other correlation functions that can be used.
DACE and
JMP use this standard formulation. The mlegp package uses a differentnotation for the correlation parameters, β = θ , but this does not affect the calculationsat all. GPfit uses β = log θ in order to focus the optimization search near the centerof the search space.The Gaussian correlation can be generalized by allowing the exponent to be changedto any value in the range [1 , GPfit uses 1.95 in the exponent. Ranjan et al. (2011)9ustifies this change by explaining that it helps to reduce the computational problemscaused by a near-singular correlation matrix when a space-filling design is used. In ourtests, we evaluate two versions of GPfit: one using the Gaussian correlation function,which we call GPfit2, and one using 1.95 as the exponent, which we call GPfit1.95.The package laGP moves θ to the denominator, and calls it the length-scale param-eter, as shown in Equation 5. This change is simply a reparameterization and does notaffect the model at all. However, it will affect the optimization routine used to estimate d , the p -length vector where the k th element is d k . This formulation is used by otherauthors such as Rasmussen and Williams (2006). The correlation function is R ij = p (cid:89) k =1 exp (cid:16) − ( x ik − x jk ) /d k (cid:17) , (5)where p is used to denote the number of dimensions. The notation used for the param-eters is also slightly different in the laGP code and vignette (a guide for the R package)than the others. The nugget is referred to as g in the code and η in the vignette, whilethe lengthscale parameters are denoted by d in the code and θ in the vignette (Gramacy,2014).Another parameterization adjusts the correlation function so that the lengthscaleparameters, denoted as (cid:96) , appear squared in the denominator. This puts (cid:96) on the samescale as x . In addition a factor of two is added in the denominator so that the correlationfunction closely resembles the Gaussian probability distribution function, as shown inEquation 6. This is used by DiceKriging , sklearn and GPy (The GPy authors,2015b). R ij = p (cid:89) k =1 exp (cid:18) −
12 ( x ik − x jk ) /(cid:96) k (cid:19) . (6)Another popular correlation function is the Mat´ern function. It takes a parameter ν that determines the smoothness. Commonly used values for ν are 3/2 and 5/2. TheMat´ern correlation function can be seen as a generalization of the Gaussian correlationfunction since they are equivalent for ν = ∞ . According to Roustant et al. (2012), thedefault correlation function for DiceKriging is the Mat´ern with ν = 5 /
2, which is g ( h ) = (cid:18) √ | h | + 53 h (cid:19) exp (cid:16) −√ | h | (cid:17) , where h = (cid:118)(cid:117)(cid:117)(cid:116) p (cid:88) k =1 ( x ik − x jk ) /(cid:96) k . (7)GPfit, GPy, and sklearn have the Mat´ern correlation as an option. We include twoversions of DiceKriging in our comparisons, one using the Gaussian correlation, whichwe label Dice2, and another using the Mat´ern ν = 5 / .2. Nugget options There are also options available for the nugget parameter.
DiceKriging defaults to having no nugget. There is also the option of setting thenugget to a constant or estimating it. For both DiceKriging with the Gaussian (Dice2)and the Mat´ern (DiceM52), we let it estimate the nugget. Preliminary tests involvingnoiseless data did not reveal noticeable differences between fits based on no nuggest andan estimated nugget. We chose to include the version that estimates the nugget sinceDiceKriging is also useful for stochastic kriging. See more details in Section 6.1.
GPfit uses the smallest nugget value that keeps the computation stable, as explainedin Ranjan et al. (2011), in order to prevent over-smoothing. The nugget value they useis δ lb = max (cid:26) λ n ( κ ( R ) − e a ) κ ( R ) ( e a − , (cid:27) , where λ n is the largest eigenvalue of R , a is a parameter set to 25 for space-fillingdesigns, and κ ( R ) is the condition number of R . MacDonald et al. (2015) compareGPfit to mlegp and state that “mlegp occasionally crashes due to near-singularity of thespatial correlation matrix,” which agrees with what we have seen, so there is a benefitto setting the nugget in this way.In laGP the user must either set the nugget to a fixed value or tell laGP to estimateit, since there is no default option. We tried several values for the nugget in preliminarytests, and found that 1 × − worked best. Thus, in our tests we run laGP both with thenugget set to 1 × − , called laGP6, and with the nugget being estimated, called laGPE.laGP is also different from the other packages we use since it performs the calculationsfrom a Bayesian perspective. In practice this makes little difference for users, since thedefault priors are very general. In addition, laGP is the only package we use that doesnot estimate a mean term.By default, mlegp will not use a nugget unless there are repeated design points, butit can be estimated or set to a constant or a vector (Dancik, 2011). For our tests werun both with nugget fixed to 0, and with a nugget estimated using a starting value of1 × − . We call these mlegp0 and mlegpE, respectively. As shown below, we find thereis little difference. JMP provides an option to fit the model with no nugget or with an estimated nugget.We run JMP using the Gaussian correlation both with estimating a nugget and withouta nugget, and we refer to these two as JMPE and JMP0, respectively.Since it is designed for noiseless computer experiments,
DACE does not let the userset or estimate the nugget. Instead it uses a small value equal to 2 . n ) × − for computationally stability.The nugget can be set or estimated in GPy by setting the noise variance parameterwhen using the GPRegression function. We set this parameter to a small value, 1 × − ,which forces the model to estimate a nugget parameter.The nugget is called alpha by sklearn , and defaults to 1 × − . Alternatively,the nugget can be specified by using a WhiteKernel, and this method allows it to beestimated. We use the default value in our investigation.11able 2 shows the packages we use in our study, along with the differences in theparameterization of θ and the options and defaults for the nugget. The last columnshows how we set the nugget for our study. Table 2: Settings for each software version used in the study. DiceM52 uses the Mat´ern correlationfunction with ν = 5 /
2, GPfit1.95 uses the power exponential correlation with power 1.95, and all otheruse the Gaussian correlation function.
Package θ NuggetCan Can Settingset? estimate? Default used
DiceM52Dice2 θ = 1 / (2 θ ) (cid:88) (cid:88) β = log θ - - δ lb δ lb laGP6laGPE d = 1 / θ (cid:88) (cid:88) NA 1e-6Estmlegp0mlegpE β = θ (cid:88) (cid:88) θ = θ (cid:88) (cid:88) NA 0EstDACE θ = θ - - (10+n)2.2e-16 (10+n)2.2e-16GPy (cid:96) = 1 / (2 θ ) (cid:88) (cid:88) Est Estsklearn (cid:96) = 1 / (2 θ ) (cid:88) (cid:88) There also many options in most packages for setting up the optimization routine thatestimates the parameters. There are different algorithms, choices of starting points andnumber of restarts, and stopping criteria. We use the default settings for all packagesunless specified otherwise.By default,
DiceKriging uses the L-BFGS-B algorithm for parameter optimization.The other option available, but not used in our study, is the genoud algorithm from thergenoud R package (Mebane Jr et al., 2011), that combines genetic (evolutionary search)algorithms with derivative-based algorithms.
GPfit uses the most in-depth optimization algorithm. As detailed in MacDonaldet al. (2015), GPfit uses L-BFGS-B (Byrd et al., 1995) with multiple starts to esti-mate the correlation parameters which they have transformed to be β = log θ . Thistransformation focuses the optimization search more to the middle of the search spacethan to the edges. In Section 2.3 of MacDonald et al. (2015), they describe how GPfitadds bounds for each β k to create a domain where the optimum is likely to be found.The function that is minimized is the negative profile log-likelihood (which they call the12eviance), − L θ ∝ log | R | + n log (cid:104) ( y − n ˆ µ ( θ )) T R − (( y − n ˆ µ ( θ )) (cid:105) . GPfit begins its search with a space-filling LHD in the space of all the β i ’s, then selectsa number of parameter sets that have low deviance. These points are clustered using the k -means algorithm. Then the L-BFGS-B algorithm is run using these cluster centers asthe starting point in each restart. laGP requires an initial value for the correlation parameters and nugget (if esti-mated) with no default provided. However laGP provides the functions darg and garg which provide good initial starting values for θ and δ using Empirical Bayes (Gramacy,2015). We use these two functions in our tests to find starting values.The package mlegp estimates the parameters using L-BFGS (Liu and Nocedal, 1989)in a gradient method (Dancik, 2013). The starting points are found using multipleNelder-Mead simplexes. JMP is proprietary software and provides no details on the optimization or otherdetails beneath the surface. There are no options that can be set for the optimization.
DACE uses a pattern search that iterates through the steps of exploring, moving,and rotating, after finding a suitable starting point as in Lophaven et al. (2002b). Bydefault, DACE uses a single correlation parameter for all dimensions and initializes itto 0.1. In order to be comparable to the other methods, we retain the 0.1 value, butmake it into a d -length vector so that these packages fit a correlation parameter for eachdimension. In DACE, upper and lower bounds for each θ i must be provided. We setthese to be generous, with the lower bounds to 1 × − and the upper bounds to 1 × . GPy begins with initial correlation parameters set to 1. However, by default GPyuses the same correlation parameter in every direction, so to get a separate parameterfor each dimension, we had to set
ARD=True . We also had to set the likelihood varianceto a small value of 1 × − , instead of its default of 1, to get good results. GPy allowsa choice of optimization routines: TNS, L-BFGS-B, and BFGS from Scipy (Jones et al.,2001); Adadelta and RProp from the Python module climin; as well as Nelder-Meadsimplex routine and Scaled Conjugate Gradients. The optimization is run through thePython module “ paramz .” We use the default of L-BFGS-B. We also use five optimizationrestarts to ensure the optimization results are favorable, although this increases the runtime.By default, sklearn uses the “ fmin l bfgs b ” optimization algorithm fromscipy.optimize (scikit-learn developers, 2016b; Jones et al., 2001). This algorithm isan implementation of the L-BFGS-B algorithm (Byrd et al., 1995). There is an optionto use multiple restarts to help the optimization avoid getting stuck in a local minima.By default the number of restarts is zero, which is what we use in our tests. However,trying more restarts may improve performance. In our initial testing with sklearn, weobserved poor results when the data was not scaled. In our comparison tests in thispaper, we scaled all our data to have mean 0 and range 1, as discussed in Section 5.13 . Empirical study methodology In this section, we discuss the criterion on which we will compare different GP fittingsoftware packages. When constructing a global metamodel, two properties of GP mod-eling are of primary importance: (1) the accuracy of prediction, and (2) the accuracy ofthe estimate of prediction error. The first is important for obvious reasons. The secondis important to allow the practitioner to assess whether the metamodel is fit for use orwhether additional data is needed to improve its fit The differences between parame-terizations mentioned in Section 3.1 do not matter here because our interest is on theaccuracy of the predictions. We focus on global fitting, not on optimization where thecomparison criterion would be the accuracy of the estimation of the optimal input vectorand the estimated output scalar.
When we evaluate model accuracy, we use a known surface and compare the actualsurface and the model’s predicted values at a large number of points, called predictionpoints. The prediction points are distributed throughout the area of interest for the inputvalues. We use the square root of the mean of the squared errors at the prediction pointsas an estimate of the model’s RMSE; for ease of discussion we will call this estimate the“empirical model RMSE” or “EMRMSE.” Thus, using m prediction points x ∗ , · · · , x ∗ m ,EMRMSE = (cid:118)(cid:117)(cid:117)(cid:116) m m (cid:88) i =1 (ˆ y ( x ∗ i ) − y ( x ∗ i )) . Santner et al. (2003, p. 108) call this the empirical root mean squared prediction error.Although this metric does not account for the distribution of the prediction errors, itis a commonly-used single number that summarizes the quality of the fit of the modelacross the entire surface.We use EMRMSE to assess the quality of the fit for a model. This measure isprimarily useful when comparing two models fitting the same surface since the modelwith the lower EMRMSE fits the surface better. Also, in our empirical studies in Section5, the value of EMRMSE can be very roughly thought of as an average relative error forthe model. This is because EMRMSE estimates the standard deviation of the predictionerrors, and, as discussed in Section 5, we scale the responses (at prediction and designpoints) to have a range of 1.
To evaluate the accuracy of the model’s estimated prediction error, we estimate themodel’s mean squared error ϕ ( x ), defined in Equation 4, by ˆ ϕ ( x ), obtained by substi-tuting the fitted model’s parameter estimates for the unknown parameters in equation(4). The square root of the average of the estimated mean squared errors over all predic-14ion points is used as the summary measure for the predicted model RMSE and calledthe “PMRMSE.” PMRMSE = (cid:118)(cid:117)(cid:117)(cid:116) m m (cid:88) i =1 ˆ ϕ ( x ∗ i )Since EMRMSE and PMRMSE both measure the model’s RMSE, we expect themto be approximately equal. If we observe EMRMSE ≈ PMRMSE, that confirms theaccuracy of the model’s prediction error. If EMRMSE is much larger than PMRMSE,then the model is overconfident in its fit, since its estimated prediction errors will besmaller than the empirical errors. Conversely, if EMRMSE is much less than PMRMSE,then the model’s estimated prediction errors are conservative.Note that Bastos and O’Hagan (2009) suggest a different way to compare predictiveerrors by using the predictive covariance matrix. For a single prediction point, we couldcalculate the standardized prediction error, which should follow a t -distribution. Forthe entire set of prediction points, these will be correlated. Intuitively this makes sensebecause the prediction functions are continuous and the true surface is usually alsocontinuous, so points near each other will necessarily have related errors. This erroranalysis requires the predictive covariance matrix so that the standardized errors can bedecorrelated. This method places equal importance on all parts of the surface. However,in practice, one usually focuses on areas where the predicted error is large. Moreover,most software packages do not provide the predictive covariance matrix, and the errorscan differ by orders of magnitude depending on how close they are to sample points. Forthese reasons, we use EMRMSE and PMRMSE as the basis for assessing the accuracyof the estimated prediction error for a given model. Fitting GPs can be computationally intensive when the number of points is large. Thuswe only want to make the computational investment when we will see a significantimprovement over simpler models. In preliminary investigations, we found that whenthe surface is too trivial or the sample size is too small that fitting a linear model—oreven just the mean—can give predictions similar to the fitted GP model. Thus, in ourempirical study, whenever we fit a GP model in d dimensions, we also fit a d -dimensionalhyperplane, which we call the linear model (LM). Using the same prediction points, wethen calculate the EMRMSE for the LM. Use of the complicated GP model is onlybeneficial if the EMRMSE of the GP model is substantially less than that of the LM.When the EMRMSE of the GP model is greater than or equal to that of the LM, itindicates that the GP model is unsuitable for the situation. In either case, comparisonsof GP model fitting are not of interest. Thus throughout the empirical study we willdefine ξ (M) as the ratio of the EMRMSE of the fitted GP model M, and the EMRMSEof the LM, as shown in Equation 8: 15 (M) = EMRMSE(M)EMRMSE(LM) . (8)This is similar to the normalized RMSE measure, e rmse,ho , used by Chen et al. (2016),which is the ratio of the RMSE of the GP model to the RMSE of the model that only fitsthe mean. We believe ξ is more useful because practitioners are more likely to considera linear model as an alternative to the Gaussian process.To keep PMRMSE on the same scale, we also define π (M) as the ratio of the PM-RMSE of M to the EMRMSE of the LM, as shown in equation 9: π (M) = PMRMSE(M)EMRMSE(LM) . (9)
5. Empirical study results
In this section we compare the aforementioned software packages on four test functions:the borehole function, the output transformer-less (OTL) circuit function, the Detteand Pepelyshev 8-dimensional model, and the Morris function. For all the functions,we created independent maximin Latin hypercube samples (LHSs) using the R package
MaxPro (Ba and Joseph, 2015) for the design matrices, and a much larger (2,000 point)maximin LHS for the prediction points. Using a maximin LHS helps ensure that thedata represents the input space well. We use x i ∈ [0 , d , which is commonly done toavoid numerical issues and make sure the data scale is reasonable for the correlationfunction. When calculating function values, the input values, x i , are scaled to be in theappropriate domain of each function. The output, y , can also be standardized beforefitting the model to it since the range can affect how much of the variation in the datais seen as noise. Many software packages do this standardization automatically or havethe option to do so. For all of our comparisons shown in Section 5, the output data isscaled to have mean 0 and range 1, as recommended by Gramacy (2007).A common recommendation in computer experiments is to use a sample size of tentimes the number of input dimensions, i.e., choose n = 10 d (Loeppky et al., 2009). Wefind that this sample size is often too small, giving predictions only slightly better thana linear model. Thus we use input samples of size n = 10 d and n = 20 d taken fromspace-filling designs in our comparisons that follow. The amount of data needed to get agood fit depends on the curvature of the data, the quality of the design, and the desiredaccuracy of the model.On each sub-plot of the following figures, we generate five surfaces, called macrorepli-cates, and fit them using thirteen software package versions. Different shaped icons rep-resent the results for different macroreplicates. The five macroreplicates are generatedby five different sets of design and prediction points. Thus, we have 65 fitted metamodelson each plot—thirteen packages fitting five macroreplicates each. The x-axis plots ξ (M i )and π (M i ) for each macroreplicate, as defined in equations 8 and 9, where M i representsone of the 65 metamodels. ξ (M i ) is plotted on the solid line for each package with solid16cons, while the gray icons slightly above each solid line are π (M i ). Lines connecting the π value to the ξ value for each macroreplicate make it easy to see whether the packagesare underestimating or overestimating the actual error. A positive slope for the lineindicates overestimation of the error, while a negative slope indicates underestimationof the error. Thus, a good metamodel M i will have a ξ (M i ) near zero and will alsohave π (M i ) roughly equal to ξ (M i ). Each macroreplicate uses the same icon shape for ξ (M i ) and π (M i ) so they can be compared to each other and across the different softwarepackages.The range of the plots are selected to allow the reader to see the relationships betweenthe ξ and π values across all packages. Several of the problematic values appear to theleft of the plots, indicating that their values are too small to fit on the plot. Values thatare too large are shown to the right of the plots. All of the data values in the examplesin this paper, including those that appear outside of the plot ranges, are available in thedata in brief article associated with this paper (Erickson et al., 2017). The borehole function described by Worley (1987) is commonly used for testing emulators(Morris et al., 1993). The input is 8-dimensional and each variable is confined to specifiedranges. The borehole function, f ( x ), is f ( x ) = 2 πT u ( H u − H l )log( r/r w )[1 + LT u log( r/r w ) r w K w + T u /T l ] . We used an R implementation based on the one provided by Surjanovic and Bingham(2016), where they recommend selecting sample points following a normal distribution for r w , a lognormal distribution for r , and uniform distributions for the other six variables intheir respective ranges. We followed these recommendations for choosing sample pointsin each dimension, transforming them to be uniform in [0 , r w , T u , T l , and L , with the other values set to the middle valueof their range. Figure 2 shows the results in plots of our comparisons. The top row hasthe 4-dimensional function, the bottom row has the full 8-dimensional function, and weuse two different sample sizes for each dimension. The left column of plots in Figure2 has results for the smaller sample sizes ( n = 40 for 4 dimensions and n = 80 for 8dimensions). The right column has results for the larger sample sizes ( n = 80 for 4dimensions and n = 160 for 8 dimensions).When the input sample size is increased by a factor of 2, the ξ and π values aretypically reduced by about 30%. All four plots in Figure 2 have been put on the samescale for easier comparison of this effect. The GP metamodels clearly fit the boreholesurface better than the linear model since almost all of the ξ values are less than one.The exceptions are one macroreplicate of sklearn in Figure 2b and some of the JMP0macroreplicates in Figures 2c and 2d; these have been cut from the plot and placed tothe right to indicate that they could not fit on the plot without skewing the axes.17e can see that there is a problem in the error estimates for all of the packages. Foralmost every macroreplicate, π is less than ξ , often by a factor of two or more. Usersshould be aware of the possibility of systematic underestimation of model error as seenin the borehole example. These discrepancies between predicted errors and actual errorsare likely due to the data not actually coming from a Gaussian process, so the surfacedoes not match the model assumptions. Methods such as cross-validation can be usedto check for this problem.Overall, GPfit, mlegp, JMPE, and GPy have the best performances on all fourexamples shown. sklearn has trouble on some of the macroreplicates in four dimensions,but does better in eight dimensions. DiceKriging, laGP, and DACE generally perform alittle worse than the others, while JMP0 has some serious problems on the 8-dimensionalsurfaces.Figure 3 shows how long (in seconds on a log scale) it took to fit each macroreplicateand make m = 2000 predictions for the 8-dimensional borehole surface with n = 80 and n = 160 design points. All macroreplicates for all packages were run on the same nodeof a Linux cluster, except for JMP which was run on a personal Dell laptop runningWindows. The relative run times for each package are the same for both sample sizes,and the same pattern is found on other test functions as well. GPfit is by far the slowest,taking over 15 minutes per macroreplicate for n = 160. JMP is the next slowest, takingtwo minutes per macroreplicate, but this data is unreliable since it was run on a differentcomputer. The next slowest is mlegp, taking about eight minutes, with GPy only slightlyfaster. The fastest packages were DiceKriging, laGP, sklearn, and DACE, which onlytook a handful of seconds. Thus we see that there is a massive difference in the run times,with a factor of over 1000 between the fastest and the slowest packages performing thesame task. The times shown in this plot are for the borehole function, but the relativetimes are similar for the other functions. In particular, GPfit and JMP are extremelyslow, mlegp is also very slow, and the rest are much faster. Therefore when one ischoosing a package, it may be necessary to consider not only the model options andcapability, but also how quickly it runs. Run times must also be considered in thecontext of the data being used. If the data comes from a simulation model that takeshours per observation, then the difference of a minute may be negligible. Ben-Ari and Steinberg (2007) use a test function that describes an output transformer-less (OTL) push-pull circuit. There are six input parameters, five for resistors ( R b , R b , R f , R c , R c ) and one for circuit gain ( β ). The equation is given by V m = ( R b + 0 . β ( R c + 9) β ( R c + 9) + R f + 11 . R f β ( R c + 9) + R f + 0 . R f β ( R c + 9)( β ( R c + 9) + R f ) R c with V b = 12 R b R b + R b . a) d = 4, n = 40 (b) d = 4, n = 80(c) d = 8, n = 80 (d) d = 8, n = 160Figure 2: Borehole 4-D and 8-D comparison. All four plots are on the same scale. a) n = 80 (b) n = 160Figure 3: Run times (seconds) for Borehole 8-D with n = 80 and n = 160, both with m = 2000. Thereare enormous differences among the packages, but the relative speeds of the packages are similar. We used an R implementation provided by Surjanovic and Bingham (2016). Figure 4shows our results, based on five macroreplicates of n = 60 and n = 120 observations.On this function, most of the fits are much better than the linear model since mostof the ξ values are below 0.1. Some of the laGPE and JMP0 points are placed to theright of the plot to indicate that their values are off the scale. The problems exhibitedby some of these packages, such as DiceKriging and laGPE, may be reduced by tuningthe optimization parameters, but we did not attempt this as not all practitioners mayhave this insight. For both sample sizes, the best ξ values come from GPfit2, mlegp,JMPE, GPy, and sklearn, with GPfit1.95, laGP6, and DACE performing only slightlyworse. DiceKriging, laGPE, and JMP0 perform poorly compared to the best packages.The prediction errors, π values, are fairly accurate on this data. Most of the π valuesare less than the corresponding ξ values by a small margin, but not by as much as inthe borehole results of Figure 2. Doubling the sample size reduced the ξ and π by abouta factor of two, showing that increasing the sample size beyond 10 d can have a largeimpact. Dette and Pepelyshev (2010) present an 8-dimensional model “which is highly curved insome variables and has less curvature in another variables.” The input is in [0 , , andthe output is given by the equation below:20 a) n = 60 (b) n = 120Figure 4: OTL Circuit comparison η ( x ) = 4( x − x − x ) + (3 − x ) + 16 (cid:112) ( x + 1)(2 x − + (cid:88) k =4 k log (1 + k (cid:88) i =3 x i ) . This function and its R implementation were also taken from Surjanovic and Bingham(2016).Figure 5 shows the results when we test this function with n = 80 and n = 160 designpoints in 8 dimensions. There are clear differences between the packages in these plots.GPy has the smallest ξ values for both plots, with GPfit2 not far behind. There is alarge difference in the ξ values between GPy and the worst performers, so again we seethat the software used makes a difference. JMP0 is consistently bad on these examples,while JMPE is very inconsistent, with a mixture of good and bad fits. laGP6 is generallyvery good, while laGPE, despite being consistent, is one of the worst performers. GPfit,mlegp, GPy, and sklearn perform the best on this function. The error predictions for allpackages except JMP are generally good, being within 25% of the actual error. We canalso see that the performance ordering of the packages on this problem is similar to thosefor the OTL circuit example in Figure 4. Again increasing the number of observationsbeyond 10 d had a decidedly beneficial effect, roughly halving the ξ and π values. The Morris function is a 20-dimensional function created by Morris (1991), and we usethe version presented by Le Gratiet et al. (2016):21 a) n = 80 (b) n = 160Figure 5: Dette-Pepelyshev comparison f ( x ) = (cid:88) i =1 β i w i + (cid:88) i 2) for i = 3 , , 7, and w i = 2( x i − / i . The coefficients are β i = 20 for i = 1 , ..., β i,j = − 15 for i, j = 1 , ..., β i,j,l = 10 for i, j, l = 1 , ..., 5. All other coefficients are set to β i = ( − i , β i,j = ( − i + j and β i,j,l = 0. The results from using the Morris function are shown inFigure 6 using input samples of size n = 200 and n = 400.DACE has fitting problems, especially for n = 200, but the rest are fairly consistent.The ξ values are all fairly large, many around 0.4. This demonstrates that in higherdimensions it is more difficult to get a fit that is substantially better than the linearmodel, especially when the function is relatively linear. GPfit, laGP6, JMPE, GPy, andsklearn perform the best, but they all underestimate the error by a significant amount. 6. Recent literature / Advanced models In this paper, we have focused on an ordinary GP model. However, there are manyvariations of this model that can be used in situations where there is domain knowledgeabout the data or where the basic model is inadequate. If the data is noisy then anugget should always be used and estimated. There are also many different correlationfunctions beyond the Gaussian and Mat´ern that may perform better with certain typesof data. If the data set is large then there are approximation models that should be22 a) n = 200 (b) n = 400Figure 6: Morris comparison used instead, since they will run much faster with a small loss of accuracy. If there are n design points, then the computation complexity for kriging is O ( n ), which is far tooslow for modern problems with millions of data points. While computer experiments often assume that there is no variability in the data, thisis not the case in stochastic simulations. When the noise is similar across the entireresponse surface, then the basic model should suffice by using the nugget term. However,when the noise level varies across the surface, called heteroscedasticity, stochastic krigingshould be used. Stochastic kriging, by Ankenman et al. (2010), accommodates for noisein data collected by assuming that the variance in the data is different at each designpoint. In order to estimate the noise at each point, replicates must be collected at everypoint in the design. This is equivalent to having a different nugget at each design point.Thus instead of adding δI to the diagonal of the correlation matrix, diag( δ ) is added,where δ i ∝ Var( x i ). This requires Var( x i ) to be estimated by replicates at each uniquedesign point.Stochastic kriging has been used for modeling simulation data from many fields, suchas in game theory simulations (Pousi et al., 2010) and finance for measuring portfoliorisk (Liu and Staum, 2010). Stochastic kriging models are often run in two stages. Inthe first stage, a small number of samples are taken for all design points. For the secondstage, the number of samples for each point is allocated according to Equation (29) inAnkenman et al. (2010), which puts more replicates at points that have large sample23ariances and are centrally located.Of the software packages discussed above, only mlegp and DiceKriging are able toperform stochastic kriging. For each package a variance estimate at each point must beprovided. In mlegp, this vector is passed as the “nugget” parameter, and the diagonalof the nugget matrix is set to be proportional to these values (Dancik, 2011). Thenugget scaling parameter is estimated along with the other parameters. In DiceKriging,this same vector as passed as the “noise.var” parameter (Roustant et al., 2012). Theprototype software developed for the paper Ankenman et al. (2010) used off-the-shelfoptimization algorithms that do not scale to larger problems and often have convergenceissues. It will not be considered in this comparison.To demonstrate the use of mlegp and DiceKriging for stochastic kriging, we use datataken from the standard M/M/1 queue model. The M/M/1 queue is a service model that represents a system with one server andinterarrival and service times that are independently exponentially distributed. We setthe service rate λ = 1 and the arrival rate 0 ≤ x < 1. We model the number ofcustomers waiting in the queue as a function of the arrival rate, which is known tohave mean y ( x ) = x/ (1 − x ) and variance x/ (1 − x ) . For design points we use sevenequally-spaced points at (0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9). In the first stage, we take n = 5 samples at each point. Then the second stage is run with a total of n = 100 and n = 200 points allotted to the individual design points according to the square root ofthe variance of the output for the first stage samples. The results are shown in Figure7, where there is no distinguishable difference between the two packages. Even with n = 100 observations the ξ values are relatively large, showing that stochastic krigingtypically needs more observations to fit a surface. Both packages tend to underestimatethe error when the sample size is small. Many advances in Gaussian process fitting have been adapting the method to be suit-able for large amounts of data, such as by exploiting sparsity. Snelson and Ghahramani(2005) present a sparse GP method that reduces the size of the covariance matrix byusing pseudo-input points. Hensman et al. (2013) produce a method that works fordata sets with millions of data points, which would be prohibitively slow for the stan-dard model. Sparsity can be induced into the correlation matrix by having a correlationfunction with compact support, meaning that it is zero for points that are sufficientlyfar away (Rasmussen and Williams, 2006, p. 87-8). These functions usually are piece-wise polynomials that resemble the Gaussian correlation function. Gramacy and Apley(2015) provides a way to fit GPs to large experiments quickly. Their model inducessparsity by only selecting the points that provide the greatest reduction in predictivevariance when calculating the metamodel function at a given point. They also allow forquick sequential updating and trivial parallelization, making their method very practi-24 a) n = 100 (b) n = 200Figure 7: M/M/1 stochastic kriging comparison using mlegp and DiceKriging. Both use 5 samples ateach point in the first stage, and then the total number of points allocated in the second stage is n = 100for Figure 7a and n = 200 for Figure 7b. cal. Gramacy (2015) has provided the R package laGP that implements most of thesemethods, in addition to the basic model that we explore in this paper. Further work inSung et al. (2016) makes the search for the best sub-design much faster.The recent paper by Binois et al. (2016) provides a significant improvement tostochastic kriging. They provide a more favorable framework by putting the problemin an inferential scheme with a single objective and explicit derivatives. They introducesome smoothing techniques that allow design points without replicates to be used in themodel, which is a shortcoming of previous versions of stochastic kriging. Also, they usethe Woodbury identity (Woodbury, 1950) to ensure that the computational complexityis similar to other stochastic kriging methods.Advances have also come by combining the Gaussian process with other models.Neural networks inspired the work of Damianou and Lawrence (2013), who present a deepGaussian process model that uses hierarchical Gaussian process mappings. Gramacy andLee (2008) add GPs to the Bayesian partition tree model of Chipman et al. (1998) sothat a GP is fit to each partition. Williams and Barber (1998) create a model that usesGPs for Bayesian classification. There have also been advances by allowing the modelto take categorical input. Platt et al. (2001) use GPs with categorical input to generatemusic playlists. Chen et al. (2013) address the use of stochastic kriging with categoricalinput.Some of the packages we investigate have advanced models available for users. GPyhas many models, including classification, sparse regression, latent variable models, andmore. In addition to providing GP classification models, scikit-learn also has other ma-chine learning models such as clustering, neural networks, and support vector machines.laGP provides the approximate GP model, as explained above, that is useful for massivedata sets. When choosing a software package, users should consider the depth of optionsavailable on the platform, and what types of models they would potentially use beyond25he basic GP model to get better results. 7. Discussion In this paper, we study various Gaussian process fitting software packages, see Table 1,and compare their performance using the GP model with Gaussian correlation for globalfitting. We do not compare their performance for optimization, i.e., accurately locatingan optimal point. We assess them based on the quality of their response predictions andtheir error estimates, each of which are averaged across the region of interest. Otherpossible criteria that we do not consider include maximum error and relative error. Inmany cases the different packages give similar, or even indistinguishable, results—whichis expected since they are using the same models on the same surfaces. However, due todifferences in the parameter estimation routines, the packages often give different resultson complicated surfaces. DiceKriging , an R package, performed somewhat worse than many of the otherpackages in our examples. However, DiceKriging runs faster than the R packages mlegpand GPfit, and provides more customization options than laGP. It has multiple correla-tion functions available and can estimate a nugget. By default the correlation functionis the Mat´ern with ν = 5 / 2, but we did not see a large difference between that andthe Gaussian correlation function. DiceKriging also provides functionality for stochastickriging, as demonstrated in Section 6.1.Another R package, GPfit , uses the most extensive parameter optimization, whichmakes it very dependable. In our tests we found GPfit to be reliable and give goodresults. The cost of this is that it takes significantly longer to use on larger data sets,taking noticeably longer on samples larger than even 50. On the borehole test, eachmacroreplicate for GPfit for a sample size of 500 took over two hours, while all the otherpackages finished in minutes or even seconds. For this reason we do not recommend usingGPfit when the data set is large and time is valuable. We ran GPfit with its defaultexponent of 1.95 in the correlation function and also used the Gaussian process modelwhere that exponent is 2.00. In general we did not see a large difference between thetwo in performance or run time. GPfit uses 1.95 as the default because it is supposedto provide computational stability.The R package laGP is the fastest package we tested. We used it with estimatinga nugget and with setting a small nugget, the latter of which gave better results. Themain benefit of laGP is that it runs very quickly, especially when repeatedly addingdata in a sequential manner. In addition, it provides some additional complex modelsthat are useful for large data sets. Thus we do not recommend laGP for kriging withsmall sample sizes, but we do suggest looking into its advanced functionality if there arethousands of sample points.The final R package we evaluated, mlegp , performed well in our testing. We used itwith both setting the nugget to zero and estimating the nugget, and did not see a large26ifference. In addition, mlegp was a little slower than the other packages, though not asslow as GPfit. One benefit of mlegp is that it also can do stochastic kriging, as shownin Section 6.1. JMP is a commercial software platform that makes data analysis easy for practi-tioners. When the nugget was set to zero, JMP performed poorly on most of our testfunctions. When the nugget was estimated, however, JMP performed substantially bet-ter, often on par with the best packages. JMP seems to run relatively slow. For ourtesting, JMP was run on a laptop and the other packages were run on a cluster. How-ever in our experience, the other packages (except for GPfit) ran faster when also run onthe same laptop. Thus users should be careful when using JMP, particularly when thenugget is not estimated, since the results may be spurious, and users might get betterand faster results using a different software option.The Matlab toolbox DACE was fast but generally provided a slightly worse fitthan the best models. DACE is very basic and has not been updated in years, so werecommend using other packages if there is a desire to progress to more advanced models. GPy , a Python module, gave the best fitting results in most of our tests. It wasan order of magnitude slower than the fastest package, and was generally in the middlein terms of speed. GPy also provides many options and advanced models that thepractitioner can explore once they have mastered the basic model.Finally, the Python module scikit-learn contains GP fitting capability in additionto many other machine learning algorithms. It was near the best on most examples, butoccasionally exhibited inconsistency. In preliminary tests, we found that it performedbetter when the data is scaled and more optimization restarts are used. It was also oneof the fastest packages. Although we only included the results using scikit-learn withthe Gaussian correlation function in this paper, we have found in some of our tests thatusing the Mat´ern correlation function gives better results. Although scikit-learn doesnot provide advanced GP models, it does provide other machine learning models suchas support vector machines and random forests. Thus it would be a useful tool for thosewho want to use a single platform for multiple machine learning methods. This paper focuses on the traditional Gaussian process model with Gaussian correlation.Despite specifying the same type of GP metamodel, we found that there are often signif-icant differences between the metamodel predictions made by various software packageson the same data. Practitioners should be aware of the quality of predictions, typicalrun time, and model options when choosing a modeling software to use.There are many modifications of the model and other correlation functions that willoften give better and faster results if there is prior knowledge about the structure ofthe data or if there is a large number of observations. We focus on the simple modelbecause it is used by many practitioners who do not want to spend the time to learn theintricacies of the model, but wish to use the power of GP fitting. However, if unstableor nonsensical results are observed when fitting the simple GP model, we encourage27odelers to consider using the packages with more advanced features that we allude toin Section 7.1. 8. Acknowledgments U.S. Department of Defense Distribution Statement: Approved for public release; dis-tribution is unlimited. The views expressed in this document are those of the authorsand do not reflect the official policy or position of the Department of Defense or theU.S. Government. This work was supported in part by the Office of Naval Research viaNPS’s CRUSER initiative, the NPS Naval Research Program NPS-17-N191-B, and theNaval Supply Systems Command Fleet Logistics grant number N00244-15-2-0004. Bibliography B. E. Ankenman, B. L. Nelson, J. Staum, Stochastic kriging for simulation metamodel-ing, Operations Research 58 (2) (2010) 371–382.S. Ba, V. R. Joseph, MaxPro: Maximum Projection Designs, URL https://CRAN.R-project.org/package=MaxPro , r package version 3.1-2, 2015.L. S. Bastos, A. O’Hagan, Diagnostics for Gaussian process emulators, Technometrics51 (4) (2009) 425–438.E. N. Ben-Ari, D. M. Steinberg, Modeling data from computer experiments: an em-pirical comparison of kriging with MARS and projection pursuit regression, QualityEngineering 19 (4) (2007) 327–338.M. Binois, R. B. Gramacy, M. Ludkovski, Practical heteroskedastic Gaussian processmodeling for large simulation experiments, arXiv preprint arXiv:1611.05902 .R. H. Byrd, P. Lu, J. Nocedal, C. Zhu, A limited memory algorithm for bound con-strained optimization, SIAM Journal on Scientific Computing 16 (5) (1995) 1190–1208.H. Chen, J. L. Loeppky, J. Sacks, W. J. Welch, et al., Analysis Methods for ComputerExperiments: How to Assess and What Counts?, Statistical Science 31 (1) (2016)40–60.X. Chen, K. Wang, F. Yang, Stochastic kriging with qualitative factors, in: Proceedingsof the 2013 Winter Simulation Conference, IEEE Press, 790–801, 2013.H. A. Chipman, E. I. George, R. E. McCulloch, Bayesian CART model search, Journalof the American Statistical Association 93 (443) (1998) 935–948.J. Christen, M. Ichchou, B. Troclet, M. Ouisse, Global sensitivity analysis of acoustictransmission models through infinite plates, in: Proceedings of ISMA, 4177–4188,2014. 28. Ciuffo, J. Casas, M. Montanino, J. Perarnau, V. Punzo, Gaussian process meta-models for sensitivity analysis of traffic simulation models: Case study of AIMSUNmesoscopic model, Transportation Research Record: Journal of the TransportationResearch Board 2390 (2013) 87–98.A. Damianou, N. Lawrence, Deep Gaussian processes, Proceedings of the SixteenthInternational Workshop on Artificial Intelligence and Statistics (AISTATS) (2013)207–215.G. Dancik, mlegp: An R package for Gaussian process modeling and sensitivity analysis,retrieved April 14, 2016, from http://download.nextag.com/cran/web/packages/mlegp/vignettes/mlegp.pdf , 2011.G. M. Dancik, mlegp: Maximum Likelihood Estimates of Gaussian Processes, URL http://CRAN.R-project.org/package=mlegp , r package version 3.1.4, 2013.G. M. Dancik, K. S. Dorman, mlegp: Statistical analysis for computer models of biolog-ical systems using R, Bioinformatics 24 (17) (2008) 1966–1967.H. Dette, A. Pepelyshev, Generalized Latin hypercube design for computer experiments,Technometrics 52 (4) (2010) 421–429.W. Du, N. Xue, W. Shyy, J. R. Martins, A surrogate-based multi-scale model for masstransport and electrochemical kinetics in lithium-ion battery electrodes, Journal ofthe Electrochemical Society 161 (8) (2014) E3086–E3096.C. B. Erickson, B. E. Ankenman, S. M. Sanchez, Data from fitting Gaussian processmodels to various data sets using eight Gaussian process software packages, Data inBrief, submitted .I. Gidaris, A. A. Taflanidis, Performance assessment and optimization of fluid vis-cous dampers through life-cycle cost criteria and comparison to alternative designapproaches, Bulletin of Earthquake Engineering 13 (4) (2015) 1003–1028.R. Gramacy, tgp: An R Package for Bayesian Nonstationary, Semiparametric Nonlin-ear Regression and Design by Treed Gaussian Process Models, Journal of StatisticalSoftware, Articles 19 (9) (2007) 1–46, ISSN 1548-7660, doi: \ bibinfo { doi }{ } , URL .R. B. Gramacy, laGP: Large-scale spatial modeling via local approximate Gaussianprocesses in R, Tech. Rep., The University of Chicago. Available as a vignette in thelaGP package, 2014.R. B. Gramacy, laGP: Local Approximate Gaussian Process Regression, URL http://CRAN.R-project.org/package=laGP , r package version 1.2-1, 2015.29. B. Gramacy, D. W. Apley, Local Gaussian process approximation for large computerexperiments, Journal of Computational and Graphical Statistics 24 (2) (2015) 561–578, doi: \ bibinfo { doi }{ } , URL http://dx.doi.org/10.1080/10618600.2014.914442 .R. B. Gramacy, H. K. Lee, Cases for the nugget in modeling computer experiments,Statistics and Computing 22 (3) (2012) 713–722.R. B. Gramacy, H. K. H. Lee, Bayesian Treed Gaussian Process Models With an Ap-plication to Computer Modeling, Journal of the American Statistical Association103 (483) (2008) 1119–1130, doi: \ bibinfo { doi }{ } , URL http://dx.doi.org/10.1198/016214508000000689 .J. Hensman, N. Fusi, N. D. Lawrence, Gaussian Processes for Big Data, CoRRabs/1309.6835, URL http://arxiv.org/abs/1309.6835 .JMP: Gaussian Process, JMP: Gaussian Process, retrieved August 01, 2016, from , 2016.D. R. Jones, M. Schonlau, W. J. Welch, Efficient global optimization of expensive black-box functions, Journal of Global optimization 13 (4) (1998) 455–492.E. Jones, T. Oliphant, P. Peterson, et al., SciPy: Open source scientific tools for Python,URL , [Online; accessed 2016-04-19], 2001.J. P. C. Kleijnen, Design and Analysis of Simulation Experiments, vol. 230, Springer,2015.L. Le Gratiet, S. Marelli, B. Sudret, Metamodel-based sensitivity analysis: Polynomialchaos expansions and Gaussian processes, arXiv preprint arXiv:1606.04273 .D. C. Liu, J. Nocedal, On the limited memory BFGS method for large scale optimization,Mathematical programming 45 (1-3) (1989) 503–528.M. Liu, J. Staum, Stochastic kriging for efficient nested simulation of expected shortfall,The Journal of Risk 12 (3) (2010) 3.J. L. Loeppky, J. Sacks, W. J. Welch, Choosing the Sample Size of a Computer Experi-ment: A Practical Guide, Technometrics 51 (4) (2009) 366–376, doi: \ bibinfo { doi }{ } , URL http://dx.doi.org/10.1198/TECH.2009.08040 .S. N. Lophaven, H. B. Nielsen, J. Søndergaard, Aspects of the MATLAB toolbox DACE,Tech. Rep., Informatics and Mathematical Modelling, Technical University of Den-mark, DTU, 2002b.S. N. Lophaven, H. B. Nielsen, J. Søndergaard, DACE–A MATLAB Kriging toolbox,version 2.0, Tech. Rep., 2002a. 30. MacDonald, P. Ranjan, H. Chipman, GPfit: An R package for fitting a Gaussian pro-cess model to deterministic simulator outputs, Journal of Statistical Software 64 (12)(2015) 1–23, URL .W. R. Mebane Jr, J. S. Sekhon, et al., Genetic optimization using derivatives: thergenoud package for R, Journal of Statistical Software 42 (11) (2011) 1–26.M. D. Morris, Factorial sampling plans for preliminary computational experiments, Tech-nometrics 33 (2) (1991) 161–174.M. D. Morris, T. J. Mitchell, D. Ylvisaker, Bayesian design and analysis of computerexperiments: Use of derivatives in surface prediction, Technometrics 35 (3) (1993)243–255.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning inpython, The Journal of Machine Learning Research 12 (2011) 2825–2830.J. C. Platt, C. J. Burges, S. Swenson, C. Weare, A. Zheng, Learning a Gaussian processprior for automatically generating music playlists., in: NIPS, 1425–1432, 2001.J. Pousi, J. Poropudas, K. Virtanen, Game theoretic simulation metamodeling usingstochastic kriging, in: Proceedings of the 2010 Winter Simulation Conference, ISSN0891-7736, 1456–1467, doi: \ bibinfo { doi }{ } , 2010.R Core Team, R: A Language and Environment for Statistical Computing, R Foundationfor Statistical Computing, Vienna, Austria, URL , 2014.P. Ranjan, R. Haynes, R. Karsten, A computationally stable approach to Gaussianprocess interpolation of deterministic computer simulation data, Technometrics 53 (4)(2011) 366–378.C. E. Rasmussen, C. K. I. Williams, Gaussian Processes for Machine Learning, The MITPress abs/1309.6835, URL .O. Roustant, D. Ginsbourger, Y. Deville, DiceKriging, DiceOptim: Two R Packagesfor the Analysis of Computer Experiments by Kriging-Based Metamodeling and Opti-mization, Journal of Statistical Software, Articles 51 (1) (2012) 1–55, ISSN 1548-7660,doi: \ bibinfo { doi }{ } , URL .J. Sacks, W. J. Welch, T. J. Mitchell, H. P. Wynn, Design and analysis of computerexperiments, Statistical Science (1989) 409–423.T. J. Santner, B. J. Williams, W. Notz, The Design and Analysis of Computer Experi-ments, Springer Science & Business Media, 2003.31AS Products: JMP, SAS Products: JMP, retrieved August 01, 2016, from http://support.sas.com/software/products/jmp/ , 2016.scikit-learn developers, scikit-learn Release history, Retrieved January 16, 2017, from http://scikit-learn.org/stable/whats_new.html , 2016a.scikit-learn developers, sklearn.gaussian process.GaussianProcessRegressor, retrievedJanuary 14, 2017, from http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html , 2016b.E. Snelson, Z. Ghahramani, Sparse Gaussian processes using pseudo-inputs, in: Ad-vances in Neural Information Processing Systems, 1257–1264, 2005.C.-L. Sung, R. B. Gramacy, B. Haaland, Potentially predictive variance reducing sub-sample locations in local Gaussian process regression, ArXiv e-prints .S. Surjanovic, D. Bingham, Virtual Library of Simulation Experiments: Test Functionsand Datasets, retrieved March 29, 2016, from , 2016.The GPy authors, GPy: A Gaussian process framework in python, URL http://github.com/SheffieldML/GPy , 2012–2015a.The GPy authors, Welcome to GPy’s documentation!, retrieved Oct 03, 2016, from https://pythonhosted.org/GPy/index.html , 2015b.N. Villa-Vialaneix, M. Follador, M. Ratto, A. Leip, A comparison of eight metamod-eling techniques for the simulation of N2