Replication or exploration? Sequential design for stochastic simulation experiments
Mickael Binois, Jiangeng Huang, Robert B Gramacy, Mike Ludkovski
RReplication or exploration? Sequential design forstochastic simulation experiments
Micka¨el Binois ∗ Jiangeng Huang † Robert B. Gramacy † Mike Ludkovski ‡ January 28, 2019
Abstract
We investigate the merits of replication, and provide methods for optimal design(including replicates), with the goal of obtaining globally accurate emulation of noisy computer simulation experiments. We first show that replication can be beneficialfrom both design and computational perspectives, in the context of Gaussian processsurrogate modeling. We then develop a lookahead based sequential design scheme thatcan determine if a new run should be at an existing input location (i.e., replicate) orat a new one (explore). When paired with a newly developed heteroskedastic Gaus-sian process model, our dynamic design scheme facilitates learning of signal and noiserelationships which can vary throughout the input space. We show that it does soefficiently, on both computational and statistical grounds. In addition to illustrativesynthetic examples, we demonstrate performance on two challenging real-data simula-tion experiments, from inventory management and epidemiology.
Keywords: computer experiment, Gaussian process, surrogate model, input-dependentnoise, replicated observations, lookahead
Historically, design and analysis of computer experiments focused on deterministic solversfrom the physical sciences via Gaussian process (GP) interpolation (Sacks et al., 1989). Butnowadays computer modeling is common in the social (Cioffi-Revilla, 2014, Chapter 8), man-agement (Law, 2015) and biological (Johnson, 2008) sciences, where stochastic simulationsabound. Noisier simulations demand bigger experiments to isolate signal from noise, andmore sophisticated GP models—not just adding nuggets to smooth the noise, but variance ∗ Corresponding author: The University of Chicago Booth School of Business, Chicago IL, 60637; [email protected] † Department of Statistics, Virginia Tech ‡ Department of Statistics and Applied Probability, University of California Santa Barbara a r X i v : . [ s t a t . M E ] J a n rocesses to track changes in noise throughout the input space in the face of heteroskedas-ticity (Binois et al., 2016). In this context there are not many simple tools: most add to,rather than reduce, modeling and computational complexity.Replication in the experiment design is an important exception, offering a pure look atnoise, not obfuscated by signal. Since usually the signal is of primary interest, a naturalquestion is: How much replication should be performed in a simulation experiment? Theanswer to that question depends on a number of factors. In this paper the focus is onglobal surrogate model prediction accuracy and computational efficiency, and we show thatreplication can be a great benefit to both, especially for heteroskedastic systems.There is evidence to support this in the literature. Ankenman et al. (2010) demonstratedhow replicates could facilitate signal isolation, via stochastic kriging (SK), and that accu-racy could be improved without much extra computation by augmenting existing degrees ofreplication in stages (also see Liu and Staum, 2010; Quan et al., 2013; Mehdad and Kleijnen,2018). Wang and Haaland (2017) showed that replicates have an important role in charac-terizing sources of inaccuracy in SK. Boukouvalas et al. (2014) demonstrated the value ofreplication in (Fisher) information metrics, and Plumlee and Tuo (2014) provided asymptoticresults favoring replication in quantile regression. Finally, replication has proved helpful inthe surrogate-assisted (i.e., Bayesian) optimization of noisy blackbox functions (Horn et al.,2017; Jalali et al., 2017).However, none of these studies address what we see as the main decision problem fordesign of GP surrogates in the face of noisy simulations. That is: how to allocate a setof unique locations, and the degree of replication thereon, to obtain the best overall fitto the data. That sentiment has been echoed independently in several recent publications(Kleijnen, 2015; Weaver et al., 2016; Jalali et al., 2017; Horn et al., 2017). The standardapproach of allocating a uniform number of replicates leaves plenty of room for improvement.One exception is Chen and Zhou (2014, 2017) who proposed several criteria to explore thereplication/exploration trade-off, but only for a finite set of candidate designs.Here we tackle the issue sequentially, one new design element at a time. We study theconditions under which the new element should be a replicate, or rather explore a newlocation, under an integrated mean-square prediction error (IMSPE) criterion. We alsohighlight how replicates offer computational savings in surrogate model fitting and predictionwith GPs, augmenting results of Binois et al. (2016) with fast updates as new data arrives.Inspired by those findings, we develop a new IMSPE-based criterion that offers lookaheadover future replicates. This criterion is the first to acknowledge that exploring now offersa new site for replication later, and conversely that replicating first offers the potential tolearn a little more (cheaply, in terms of surrogate modeling computation) before committingto a new design location. A key component in solving this sequential decision problem in anefficient manner is a closed form expression for IMSPE, and its derivatives, allowing for fastnumerical optimization.While our IMSPE criterion corrects for myopia in replication, it is important to note thatit is not a full lookahead scheme. Rather, we illustrate that it is biased toward replication:longer lookahead horizons tend to tilt toward more replication in the design. In our experi-2nce, full lookahead, even when approximated, is impractical for all but the most expensivesimulations. Even the cleverest dynamic programming-like schemes (e.g., Ginsbourger andLe Riche, 2010; Gonzalez et al., 2016; Lam et al., 2016; Huan and Marzouk, 2016) requireapproximation to remain tractable or otherwise only manage to glimpse a few steps into thefuture despite enormous computational cost. Our more thrifty scheme can search dozens ofiterations ahead. That flexibility allows us to treat the horizon as a tuning parameter thatcan be adjusted, online, to meet design and/or surrogate modeling goals. When simulationsare cheap and noisy, we provide an adaptive horizon scheme that favors replicates to keepsurrogate modeling costs down; when surrogate modeling costs are less of a concern, we pro-vide a scheme that optimizes out-of-sample RMSE, which might or might not favor longerhorizons (i.e., higher replication).The structure of the remainder of the paper is as follows. First we summarize relevant el-ements of GPs, sequential design and the computational savings enjoyed through replicationin Section 2. Then in Section 3 we detail IMSPE, with emphasis on sequential applicationsand computational enhancements (e.g., fast GP updating) essential for the tractability ofour framework. Section 3.3 discusses our lookahead scheme, while Section 4 provides practi-cal elements for the implementation, including tuning the horizon of the lookahead scheme.Finally, in Section 5 results are presented from several simulation experiments, including il-lustrative test problems, and real simulations from epidemiology and inventory management,which benefit from disparate design strategies. Here we introduce relevant surrogate modeling and design elements while at the same timeillustrating proof-of-concept for our main methodological contributions. Namely that repli-cation can be valuable computationally, as well as for accuracy in surrogate modeling.
We consider Gaussian process (GP) surrogate models for an unknown function over a fixeddomain f : D ⊂ R d → R based on noisy observations Y = ( y , . . . , y N ) (cid:62) at design locations X = ( x (cid:62) , . . . , x (cid:62) N ). For simplicity, we assume a zero-mean GP prior, completely specifiedby covariance kernel k ( · , · ), a positive definite function. Many different choices of kernel arepossible, while in the computer experiments literature the power exponential and Mat´ernfamilies are the most common. Often the families are parametererized by unknown quanti-ties such as lengthscales, scales, etc., which are inferred from data (see, e.g., Rasmussen andWilliams, 2006; Santner et al., 2013). The noise is presumed to be zero-mean i.i.d. Gaus-sian, with variance r ( x ) = V ar[ Y ( x ) | f ( x )]. While we discuss our preferred modeling andinference apparatus in Section 4.1, for now we make the (unrealistic) assumption that ker-nel hyperparameters, along with the potentially non-constant r ( x ), are known. Altogether,the data-generating mechanism follows a multivariate normal distribution, Y ∼ N N (0 , K N ),where K N is an N × N matrix comprised of k ( x i , x j ) + δ ij r ( x i ), for 1 ≤ i, j ≤ N and with3 ij being the Kronecker delta function.Conditional properties of multivariate normal (MVN) distributions yield that the predic-tive distribution Y ( x ) | Y is Gaussian with µ N ( x ) = E ( Y ( x ) | Y ) = k N ( x ) (cid:62) K − N Y , with k N ( x ) = ( k ( x , x ) , . . . , k ( x , x N )) (cid:62) ; σ N ( x ) = V ar( Y ( x ) | Y ) = k ( x , x ) + r ( x ) − k (cid:62) N ( x ) K − N k N ( x ) . (1)It can be shown that µ ( x ) is a best linear unbiased predictor (BLUP) for Y ( x ) (and f ( x )).Although testaments to the high accuracy and attractive uncertainty quantification featuresabound in the literature, one notable drawback is that when N is large the computationalexpense of O ( N ) due to decomposing K N (e.g., to solve for K − N ) can be prohibitive.When the observations y ( x ) are deterministic (i.e., r ( x ) = 0), often N can be kept toa manageable size. When data are noisy, with potentially varying noise level, many sam-ples may be needed to separate signal from noise. Indeed in our motivating applications, thesignal-to-noise ratios can be very low, so even for a relatively small input space, thousands oftraining observations are necessary. In that context replication can offer significant computa-tional gains. To illustrate, let ¯ x i , i = 1 , . . . , n denote the n ≤ N unique input locations, and y ( j ) i be the j th out of a i ≥ x i , i.e., j = 1 , . . . , a i , where (cid:80) ni =1 a i = N .Also, let ¯ Y ( N,n ) = (¯ y , . . . , ¯ y n ) (cid:62) store averages over replicates, ¯ y i = a i (cid:80) a i j =1 y ( j ) i . Then Bi-nois et al. (2016) show that predictive equations based on this “unique- n ” formulation, i.e.,following Eq. (1) except with ¯ Y ( N,n ) and K ( N,n ) = (cid:16) k (¯ x i , ¯ x j ) + δ ij r (¯ x i ) a i (cid:17) ≤ i,j ≤ n , are identical.Compared to the “full- N ” formulation, the respective costs are reduced from O ( N ) to just O ( n ), without any approximations. Although there are many criteria dedicated to design for GP regression (see, e.g., Pronzatoand M¨uller, 2012), our focus here is on global predictive accuracy defined via integratedmean-squared prediction error (IMSPE). Fixing X , the IMSPE integrates the “de-noised”posterior variance ˇ σ N ( x ) = σ N ( x ) − r ( x ) over D ,IMSPE( x , . . . , x N ) = (cid:90) x ∈ D ˇ σ N ( x ) d x =: I N . (2)Note that although this definition removes r ( x ), it is still present in K N and therefore affectsˇ σ N ( x ). Removing r ( x ) is not required, but since (cid:82) r ( x ) d x is constant over x , . . . , x N , itsimplifies future expressions.Even in the highly idealized case were all covariance k ( · , · ) and noise r ( · ) relationshipsare presumed known, one-shot design—i.e., choosing all N locations X at once to minimize(2)—is an extraordinarily difficult task owing to the ( N × d )-dimensional search space. Onlyin very specific cases, such as d = 1 and a exponential kernel (Antognini and Zagoraiou,2010), or with the simpler task of allocating N replicates to a fixed set of n unique sites¯ x , . . . , ¯ x n (Ankenman et al., 2010), is a computationally tractable solution known.4herefore, we consider here the simpler case of a purely sequential design, building upa big design greedily, one simulation at a time. Note that this means that N grows by 1after each iteration. While n is also evolving, the precise change is dependent on whethera replicate or a new location is selected. In the generic step, we condition on existing x , . . . , x N locations and optimize IMSPE( x , . . . , x N , x N +1 ) over x N +1 . Recall that the pos-terior variance ˇ σ N only depends on the geometry of X , i.e., it is independent of the outputs Y and hence we can view the above as minimizing I N +1 ( x N +1 ) := IMSPE( x N +1 | x , . . . , x N ).Later we establish specific closed-form expressions both for I N +1 and its gradient whichenable fast optimization via library-based numerical schemes. Foreshadowing these develop-ments, and utilizing the calculations detailed therein, we illustrate here the possibility that x N +1 = argmin x I N +1 ( x ) is a replicate. The conditions under which replication is advanta-geous, which we describe shortly in Section 3.1, have to our knowledge only been illustratedempirically (Boukouvalas, 2010), or conceptually (e.g., Wang and Haaland (2017) highlightthat replication is more beneficial as the signal-to-noise ratio decreases, via upper bounds onthe MSPE), or to bolster technical results (e.g., Plumlee and Tuo (2014) demand a sufficientdegree of replication to ensure asymptotic efficiency). x r( x ) l l l l l . . . . . x I M SPE
Figure 1: Illustration of the effect of noise variance on IMSPE optimization. Left: two ex-amples of noise variance functions r ( · ) (blue solid and green dashed lines), with observationsat X (five red points). The grey dotted line represents the minimum r ( x ) that guaranteesthat replicating is optimal. Right: I N +1 ( x ) for the two respective r ( · ). Diamonds highlightminimum values, and red dotted lines the existing designs x , . . . , x .The left panel of Figure 1 shows two different noise levels, r ( x ), for a stylized heteroske-dastic GP predictor trained at N = 5 locations whose x , . . . , x values are shown as red dots.The fact that the two r ( x ) curves coincide at these locations is not material to this discussion.Later in Section 3.1 this feature and a description of the gray-dotted curve will be provided.The right panel in the figure shows the predicted IMSPE, I N +1 ( x ) = IMSPE( x , . . . , x , x )derived from ˇ σ N calculations using those x , . . . , x values combined with the two r ( x ) set-tings. With smaller IMSPE being better, we see that the solid blue regime calls for x beinga replicate (argmin I N +1 ( x ) = x ), whereas the dashed green regime wants to explore at anew unique location (argmin I N +1 ( x ) (cid:39) . x ) with lower IMSPE than some local minima,meaning that augmenting with a cheap discrete search over replicates may be more effectivethan deploying a multi-start optimization scheme.Ultimately, we entertain the far more realistic setup of unknown kernel hyperparametersand noise processes. In this context, sequential design to “learn-as-you-go” is essential.We take this approach not simply to avoid pathologies in hyperparameter mis-specification,as discussed in homoskedastic setups (e.g., Seo et al., 2000; Krause and Guestrin, 2007),but explicitly to gain the flexibility to sample non-uniformly in a manner that can onlybe adapted after a degree of initial sampling allows a fit of the noise process ˆ r ( x ) to beobtained, and further refined. Our empirical results illustrate that reasonable, yet inaccurate, a priori simplifications such as constant r ( x ) may—even if just for the purposes of design,not subsequent fitting—lead to inferior prediction. Previously such adaptive behavior andnon-uniform sampling was only available via more cumbersome fully nonstationary methods,say involving treed partitioning (Gramacy and Lee, 2009). Over the years several authors (e.g., Ankenman et al., 2010; Anagnostopoulos and Gramacy,2013; Burnaev and Panov, 2015; Leatherman et al., 2017) have provided closed form ex-pressions for IMSPE (i.e., for the integral in (2)) via variations on the criterion’s definition(i.e., versions somewhat different than our preferred version in (2)), or via simplificationsto the GP specification or to the argument x , . . . , x N , obtained by constraining the searchset. Others have argued in more general contexts that d -dimensional numerical integration,usually via sums over a (potentially random) reference set, is the only viable option in theirsetting (Seo et al., 2000; Gramacy and Lee, 2009; Gauthier and Pronzato, 2014; Gorodetskyand Marzouk, 2016; Pratola et al., 2017).Here we provide a new closed-form expression for the IMSPE which, despite being inti-mately connected to earlier versions, is quite general and, we think, could replace many ofthe prevailing numerical schemes. This development uses the “unique- n ” representation forefficient calculation under replication, however the analogue “full- N ” version is immediate.We then consider an “add one” variation, I N +1 (˜ x ) = IMSPE(˜ x | x , . . . , x N ), for efficient cal-culation in the sequential design setting and derive a condition under which replication ispreferred for the next sample. Here we use ˜ x for a potential new location, while x N +1 willultimately be chosen as the best candidate (i.e., minimizing IMSPE over ˜ x ). Note that if x N +1 turns out to be a replicate, n would not increase.One important reason to have a closed-form IMSPE is the calculation of gradients, alsoin closed form, to aid in optimization. We provide the first such derivative expressions ofwhich we are aware. Finally, acknowledging the dual role of replication (to speed calculationsand separate signal from noise) we describe two new lookahead IMSPE heuristics for tuningthe lookahead horizon in an online fashion, depending on whether speed or accuracy is moreimportant. 6 .1 IMSPE closed-formed expressions We start by writing the IMSPE, shorthanded as I N in Eq. (2), as an expectation: I N = (cid:90) x ∈ D ˇ σ n ( x ) d x = E [ˇ σ n ( X )] = E [ k ( X, X )] − E [ k n ( X ) (cid:62) K − n k n ( X )]with X uniformly sampled in D , and using the linearity of the expectation. Notice that K n depends on the number of replicates per unique design, so this representation includes atacit dependence on the noise and replication counts a , . . . , a n . Then, as shown in Lemma3.1, the integration of ˇ σ n over D may be reduced to integrations of the covariance function. Lemma 3.1.
Let W n be an n × n matrix with entries comprising integrals of kernel products w ( x i , x j ) = (cid:82) x ∈ D k ( x i , x ) k ( x j , x ) d x for ≤ i, j ≤ n , and let E = (cid:82) x ∈ D k ( x , x ) d x . Then I N = E − tr( K − n W n ) . (3) Proof.
The first part, involving E , follows simply by definition. For the second part, let z bea random vector of size n with mean m and covariance M . Petersen et al. (2008) providesthat E (cid:2) z (cid:62) K − n z (cid:3) = tr ( K − n M ) + m (cid:62) K − n m . Therefore using m (cid:62) K − n m = tr( K − n mm (cid:62) ),we have E [ k n ( X ) (cid:62) K − n k n ( X )] = tr( K − n ( M + mm (cid:62) )) where m = E [ k n ( X )] and M = C ov ( k n ( X ) (cid:62) , k n ( X )). Observing that W n = M + mm (cid:62) gives the desired result.Our interest in the re-characterization in (3) is three-fold. First and foremost, some ofthe most commonly used kernels enjoy closed form expressions of E and w ( · , · ). In AppendixB we provide w ( · , · ) for (i) Gaussian, (ii) Mat´ern-5/2, (iii) Mat´ern-3/2, and (iv) Mat´ern-1/2families. For those families, E further reduces to their scale hyperparameter. Section 4.1offers specific forms for the generic expression (3) under our hetGP model. Second, notethat even when closed forms are not available, as may arise when the kernel k ( X, X ) cannotbe analytically integrated over D , this formulation may still be advantageous. Numericallyintegrating k ( x , · ) inside W n will likely be far easier than the alternative of integrating ˇ σ n ,which can be highly multi-modal. Third, we remark that tr( K − n W n ) = (cid:62) ( K − n ◦ W n ) where ◦ stands for the Hadamard (i.e., element-wise) product. Once K − n and W n arecomputed, the cost is in O ( n ), whereas the na¨ıve alternative is O ( n ).Now, in sequential application the goal is to choose a new x N +1 by optimizing I N +1 (˜ x )over candidates ˜ x . Fixing the first n unique design elements simplifies calculations substan-tially if we assume that K − n and W n are previously available. In that case, write K n +1 = (cid:20) K n k n (˜ x ) k n (˜ x ) (cid:62) k (˜ x , ˜ x ) + r (˜ x ) (cid:21) , W n +1 = (cid:20) W n w (˜ x ) w (˜ x ) (cid:62) w (˜ x , ˜ x ) (cid:21) with w (˜ x ) = ( w (˜ x , ¯ x i )) ≤ i ≤ n . The partition inverse equations (Barnett, 1979) give K − n +1 = (cid:20) K − n + g (˜ x ) g (˜ x ) (cid:62) σ n (˜ x ) g (˜ x ) g (˜ x ) (cid:62) σ n (˜ x ) − (cid:21) , (4)7here g (˜ x ) = − σ n (˜ x ) − K − n k n (˜ x ) and σ n (˜ x ) = ˇ σ n (˜ x ) + r (˜ x ) as in (1). Combining those tworesults together leads to I N +1 (˜ x ) = E − (cid:62) [ K − n +1 ◦ W n +1 ] = E − (cid:0) (cid:62) [ K − n ◦ W n ] + σ n (˜ x ) g (˜ x ) (cid:62) W n g (˜ x ) + 2 w (˜ x ) (cid:62) g (˜ x ) + σ n (˜ x ) − w (˜ x , ˜ x ) (cid:1) = I N − (cid:0) σ n (˜ x ) g (˜ x ) (cid:62) W n g (˜ x ) + 2 w (˜ x ) (cid:62) g (˜ x ) + σ n (˜ x ) − w (˜ x , ˜ x ) (cid:1) . (5)Both (4–5) only require O ( n ) computation.After optimizing the latter part of (5) over ˜ x and choosing the best new design x N +1 , onemay utilize those inverse equations again to update the GP fit. Although similar identitieshave been provided in the literature (e.g., Gramacy and Polson, 2011; Chevalier et al., 2014),the ones we provide here are the first to exploit the thrifty “unique- n ” representation, andto tailor to the setting where x N +1 is a replicate, i.e., an ¯ x k , for k ∈ { , . . . , n } , versus a newdistinct ¯ x n +1 location. Lemma 3.2.
Suppose x N +1 = ¯ x k . Then the updated predictive mean and variance (increas-ing N but not n ) are given by µ ( N +1 ,n ) ( x ) := µ ( N,n ) ( x ) + k n ( x ) (cid:62) ( K − N,n ) ( ¯ Y ( N +1 ,n ) − ¯ Y ( N,n ) ) − B k ¯ Y ( N +1 ,n ) ) ,σ N +1 ,n ) ( x ) = σ N,n ) ( x ) − k n ( x ) (cid:62) B k k n ( x ) , with B k = (cid:16) K − N,n ) (cid:17) .,k (cid:16) K − N,n ) (cid:17) k,. a k ( a k +1) /r (¯ x k ) − ( K ( N,n ) ) − k,k , a rank-one matrix.Proof. By adding a replicate at ¯ x k , the only change is to augment a k by one in K ( N +1 ,n ) ,namely K ( N +1 ,n ) − K ( N,n ) = − Diag (cid:16) , . . . , , r (¯ x k ) a k ( a k +1) , , . . . , (cid:17) =: − r (¯ x k ) uu (cid:62) = r (¯ x k ) u (cid:48) u (cid:62) with u (cid:48) = − u . Similarly, ¯ Y ( N +1 ,n ) − ¯ Y ( N,n ) = (cid:16) , . . . , , a k +1 ( y ( a k +1 ) k − ¯ y ( N ) k ) , . . . , (cid:17) has onlyone non-zero element, residing in position k .The Sherman-Morrison (i.e., rank-one Woodbury) formula gives K − N +1 ,n ) = ( K ( N,n ) + r (¯ x k ) u (cid:48) u (cid:62) ) − = K − N,n ) + (cid:16) K − N,n ) (cid:17) .,k (cid:16) K − N,n ) (cid:17) k,. ( r (¯ x k ) u k ) − − (cid:16) K − N,n ) (cid:17) k,k = K − N,n ) + B k . (6)This enables us to write µ ( N +1 ,n ) ( x ) − µ ( N,n ) ( x ) = k n ( x ) (cid:62) (cid:16) K − N +1 ,n ) ¯ Y n +1 − K − N,n ) ¯ Y ( N,n ) (cid:17) and σ N +1 ,n ) ( x ) − σ N,n ) ( x ) = k n ( x ) (cid:62) (cid:16) K − N +1 ,n ) − K − N,n ) (cid:17) k n ( x ) and substitute K − N +1 ,n ) − K − N,n ) = B k from (6). From the proof we also see that adding a replicate x N +1 incurs O ( n )rather than the usual O ( n ) cost.As a corollary we obtain the following formula for one-step-ahead IMSPE at existingdesigns I N +1 (¯ x k ) (relying on the fact that W n is unchanged when replicating): I N +1 (¯ x k ) = E − tr( K − N +1 ,n ) W n ) = E − tr(( K − N,n ) + B k ) W n ) = I N − tr( B k W n ) . (7)8esides enabling a “quick check” (with cost O ( n )) for finding the best replicate, perhaps amore important application of this result is that (7) yields an explicit condition under whichreplication is optimal. Proposition 3.1.
Given n unique design locations ¯ x , . . . , ¯ x n , replicating is optimal (withrespect to I N +1 ) if r (˜ x ) ≥ k (˜ x ) (cid:62) K − n W n K − n k (˜ x ) − w (˜ x ) (cid:62) K − n k (˜ x ) + w (˜ x , ˜ x )tr( B k ∗ W n ) − ˇ σ n (˜ x ) , ∀ ˜ x ∈ D, (8) where k ∗ ∈ argmin ≤ k ≤ n I N +1 (¯ x k ) .Proof. We proceed by comparing I N +1 (˜ x ) values when ˜ x is a replicate vis-`a-vis a new design.Summarizing our results from above, we have I N +1 (¯ x ∗ k ) = I N − tr( B k ∗ W n ) for the (best)replicate and I N +1 (˜ x ) = I N − (cid:0) σ n (˜ x ) g (˜ x ) (cid:62) W n g (˜ x ) + 2 w (˜ x ) (cid:62) g (˜ x ) + σ n (˜ x ) − w (˜ x , ˜ x ) (cid:1) for anew design. Replicating is better if I N +1 (¯ x ∗ k ) ≤ I N +1 (˜ x ) for all ˜ x , or whentr( B k ∗ W n ) ≥ σ n (˜ x ) − (cid:0) k (˜ x ) (cid:62) K − n W n K − n k (˜ x ) − w (˜ x ) (cid:62) K − n k (˜ x ) + w (˜ x , ˜ x ) (cid:1) . Using the fact that σ n (˜ x ) = ˇ σ n (˜ x ) + r (˜ x ) establishes the desired result.Referring back to Figure 1, the gray-dotted line in the left panel represents the righthand side of Eq. (8). Thus, any noise surfaces with r ( x ) above this line will lead to the I N +1 minimizer being a replicate, cf. the solid blue r ( x ) case in the figure. Although thisillustration involves a heteroskedastic example, the inequality in (8) can also hold in thehomoskedastic case. In practice, replication in homoskedastic processes is most often at theedges of the input space, however particular behavior is highly sensitive to the settings ofthe n design locations, and their degrees of replication, a i . To facilitate the optimization of I N +1 (˜ x ) with respect to ˜ x , we provide closed-form ex-pressions for its gradient, via partial derivatives. Below the subscript ( p ) denotes the p -thcoordinate of the d -dimensional design ˜ x ∈ D . As a starting point, the chain rule gives ∂I N +1 (˜ x ) ∂ ˜ x ( p ) = − ∂ tr( K − n +1 W n +1 ) ∂ ˜ x ( p ) = − tr (cid:18) K − n +1 ∂ W n +1 ∂ ˜ x ( p ) (cid:19) − tr (cid:18) ∂ K − n +1 ∂ ˜ x ( p ) W n +1 (cid:19) . (9)To manage the computational costs, we notate below how the partial derivatives are dis-tributed in another application of the partition inverse equations: ∂ K − n +1 ∂ ˜ x = (cid:20) H (˜ x ) h (˜ x ) h (˜ x ) (cid:62) v (˜ x ) (cid:21) (10) ∂ W n +1 ∂ ˜ x ( p ) = (cid:20) n × n c (˜ x ) c (˜ x ) (cid:62) c (˜ x ) (cid:21) (11)where the detailed expressions and derivations are given in Appendix A.The expressions above are collected into the following lemma.9 emma 3.3. The p th component of the gradient for sequential ISMPE is − ∂I N +1 ∂ ˜ x ( p ) = 2 c (˜ x ) (cid:62) g (˜ x ) + c σ n (˜ x ) − + (cid:62) n [ H (˜ x ) σ n (˜ x ) ◦ W n ] n (12)+ 2 w (˜ x ) (cid:62) h (˜ x ) + v (˜ x ) w (˜ x , ˜ x ) . Proof.
Beginning with Eq. (9), substitute (10) for the partial derivative of K − n +1 , and (11) forthat of W n +1 . Then, note that (cid:62) n [ H (˜ x ) σ n (˜ x ) ◦ W n ] n can be rewritten as v (˜ x ) g (˜ x ) (cid:62) W n g (˜ x )+2 σ n (˜ x ) g (˜ x ) (cid:62) W n h (˜ x ).Since no further matrix decompositions are required, note that calculating the gradient of I N +1 (˜ x ) in this way incurs computational costs in O ( n ). Under certain conditions, sequential design via IMSPE, i.e., greedily minimizing I N +1 tochoose x N +1 , can well-approximate a one-shot batch design of size N max because the cri-terion is monotone supermodular (Das and Kempe, 2008; Krause et al., 2008). However,these results assume a known kernel hyperparameterization k ( · , · ) and constant noise level r ( · ). In the more realistic case where those quantities must be estimated from data, andpotentially with non-constant variance, there is ample evidence in the literature suggestingthat sequential design can be much better than a batch design, e.g., based on a poorly-chosen parameterization, and no worse than an idealistic one (Seo et al., 2000; Gramacyand Lee, 2009). However, that does not mean that greedy, myopic, selection is optimal.By accounting for potential future selections in choosing the very next one, it is possible toobtain substantially improved final designs. However, the calculations involved, especially to“look ahead” from early sequential decisions to a far-away horizon N max , require expensivedynamic programming techniques to search an enormous decision space.Approximating that full search, by limiting the lookahead horizon or otherwise reducingthe scope of the decision space, has become an active area in Bayesian optimization viaexpected improvement (Ginsbourger and Le Riche, 2010; Gonzalez et al., 2016; Lam et al.,2016). Targeting overall accuracy has seen rather less development, the work by Huan andMarzouk (2016) being an important exception. Here we aim to port many of these ideas toour setting of IMSPE optimization, where the nature of our approximation involves a weakbias towards replication which we have shown can be doubly beneficial in design.The essential decision boils down to either choosing an x N +1 to explore, i.e., a new designelement ¯ x n +1 , or choosing to replicate with x N +1 taken to be some ¯ x k , for k ∈ { , . . . , n } .However, rather than directly minimizing (5) or (7), respectively, we perform a “rollout”lookahead procedure similar to Lam et al. (2016) in order to explore the impact of thosechoices on a limited space of future design decisions. The updating equations in the previoussubsections make this tractable.In particular we consider a horizon h ∈ { , , , . . . } determining the number of designiterations to look ahead, with h = 0 representing ordinary (myopic) IMSPE search. Al-though larger values of h entertain future sequential design decisions, the goal (for any h ) is10o determine what to do now . Toward that end, we evaluate h + 1 “decision paths” span-ning alternatives between exploring sooner and replicating later, or vice versa. During eachiteration along a given path, either (5) or (7) (but not simultaneously) is taken up as thehypothetical action. On the first iteration, if a new ¯ x n +1 is chosen by optimizing Eq. (5),that location (along with the existing ¯ x , . . . , ¯ x n ) are considered as candidates for futurereplication over the remaining h lookahead iterations (when h ≥ h iterations will pick a new ¯ x n +1 , with the others optimizing over replicates.This recursion is resolved by moving to the second iteration and again splitting the decisionpath into the choice between replicate-explore-replicate-... and replicate-replicate-..., etc.After recursively optimizing up to horizon h along the h + 1 paths, the ultimate IMSPEfor the respective hypothetical design with size N + 1 + h is computed, and the decisionpath yielding the smallest IMSPE is noted. Finally, the next ¯ x N +1 is a new location if theexplore-first path was optimal, and is a replicate otherwise. x n +1 = 0 . x n +2 = 0 . x n +3 = 0 . x n +4 = 0 . ) k = 6(95724) k = 6(94014) k = 7(92365) k = 7(92370) k = 4(92434) k = 4(90873) k = 4(90867) k = 7(90877) k = 5(93992) (97668) Figure 2: Lookahead strategy for h = 3, starting with n unique designs. Each ellipse is astate, with a specific training set. Black dashed arrows represent the action of adding thebest replicate, red solid arrows represent adding the best new design. This example considersaugmenting the design for the example shown in Figure 3. Numbers in parenthesis indicatesthe IMSPE at each stage (values have been multiplied by 10 ).A diagram depicting the decision space is shown in Figure 2. In this example we attempt11o augment the design from Figure 3 using h = 3. The path yielding the lowest IMSPE atthe horizon involves here replicating three times (adding copies of design elements 6, 5, and 7respectively), with exploration at x n +4 = 0 .
175 in the final stage. Consequently, replicationis preferred over exploration and the next design element will be a replicate, duplicating ¯ x .The figure also illustrates that the cost of searching for the best replicate over adding a newdesign element involves at most h + 1 global optimizations of Eq. (5), using the gradient.Although ( h + 1)( h + 2) / − h + 1) searches of mixed continuous and discrete type may be performed in parallel.In practice, global optimization with a budget of at least the same order as n is an order ofmagnitude more expensive than looking for the best replicate.In this scheme the horizon, h , determines the extent to which replicates are entertained inthe lookahead, and therefore larger h somewhat inflates the likelihood of replication. Indeed,as h grows, there are more and more decision paths that delay exploration to a later iteration;if any of them yield a smaller IMSPE than the explore-first path, the immediate action is toreplicate. However, note that although larger h allows more replication before committingto a new, unique ¯ x n +1 , it also magnifies the value of an ¯ x n +1 chosen in the first iteration,as it could potentially accrue its own replicates in subsequent rollout iterations. Therefore,although we do find in practice that larger h leads to more replication in the final design,this association is weak. Indeed, we frequently encounter situations where exploration is(temporarily) preferred for arbitrarily large horizons. Here we consider inference and implementation details, in particular for learning the noiseprocess r ( · ). Our presumption is that little is known about the noise, however it is worthnoting that this assumption may not be well aligned to some data-generating mechanisms,e.g., as arising from Monte Carlo simulations with known convergence rates (Picheny andGinsbourger, 2013). After reviewing a promising new framework called hetGP , for hetero-skedastic GP surrogate modeling (Binois et al., 2016), we provide extensions facilitating fastsequential updating of that predictor along with its (hyper-) parameterization. We concludewith schemes for adjusting the lookahead horizon introduced in Section 3.3. One way of dealing with heteroskedasticity in GP regression is to use empirical estimatesof the variance as in SK (Ankenman et al., 2010), described briefly in Section 2. Althoughthis has the downside of requiring a minimum amount of replication, the calculations arestraightforward and computations are thrifty. However, sequential design requires predict-ing the variance at new locations, and to accommodate that within SK Ankenman et al.,recommend fitting a second, independent, GP for ˆ r ( x ) to smooth the empirical variances.An alternative is to model the (log) variance as a latent process, jointly with the original“mean process” (Goldberg et al., 1998; Kersting et al., 2007). However these methods can12e computationally cumbersome, and are not tailored to leverage the computational savingsthat come with replication. Here we rely on the hybrid approach detailed by Binois et al.(2016), leveraging replication and learning the latent log-variance GP based on a joint log-likelihood with the mean GP. We offer the following by way of a brief review.For common choices of stationary kernel k ( x , x (cid:48) ) = νc ( x − x (cid:48) ), the covariance matrix forthe “mean GP” may be characterized as K n = ν ( C n + Λ n ) with C n = ( c (¯ x i − ¯ x j )) ≤ i,j ≤ n ;and for the “noise GP” we take the analog log Λ n = C ( g ) ( C ( g ) + g A − n ) − ∆ n where C ( g ) isthe equivalent of C n for the second GP with kernel k ( g ) . That is, log Λ n is the predictiongiven by a GP based on latent variables ∆ n = ( δ , . . . , δ n ) that can be learned as additionalparameters, alongside hyperparameters of k ( g ) and nugget g .Based on this representation, the MLE of ν isˆ ν N := N − (cid:32) N − n (cid:88) i =1 a i λ i s i + ¯ Y (cid:62) ( C n + A − n Λ n ) − ¯ Y (cid:33) with s i = a i (cid:80) a i j =1 ( y ( j ) i − ¯ y i ) whereas the rest of the parameters and hyperparameters canbe optimized based on the concentrated joint log-likelihood:log ˜ L = − N ν N − n (cid:88) i =1 [( a i −
1) log λ i + log a i ] −
12 log | C n + A − n Λ n |− n ν ( g ) −
12 log | C ( g ) + g A − n | + Const , with ˆ ν ( g ) = n − ∆ (cid:62) n ( C n + g A − n ) − ∆ n . Closed form derivatives are given in Binois et al.(2016), while an R (R Core Team, 2017) package with embedded C++ subroutines is availableas hetGP on CRAN.Notice that for stationary kernels, the Eq. (3) reduces to IMSPE( x , . . . , x N ) = ν (1 − tr( C − n W n )). The look-ahead IMSPE over replicates (7) becomes I N +1 (¯ x k ) = ν (1 − tr( B (cid:48) k W n ))with B (cid:48) k = (cid:16) ( C n + A − n Λ n ) − (cid:17) .,k (cid:16) ( C n + A − n Λ n ) − (cid:17) k,. a k ( a k +1) /λ k − ( C n + A − n Λ n ) − k,k . Also, the gradient of I N +1 (˜ x ) from (5) in-volves ∂r (˜ x ) /∂ ˜ x ( p ) , which for hetGP reduces to ∂ k ( g ) (˜ x )( C ( g ) + g A − n ) − ∆ n ∂ ˜ x ( p ) = ∂ k ( g ) (˜ x ) ∂ ˜ x ( p ) ( C ( g ) + g A − n ) − ∆ n . Optimizing IMSPE with lookahead over replication [Section 3.3] is only practical if the hetGP model can be updated efficiently when new simulations are performed. Two differentupdate schemes are necessary: one for potential new designs, considered during the processof evaluating alternatives under the criteria [Eqs. (5–7)]; and another for the actual updatewith new simulation y ( x N +1 ). 13hen looking-ahead, no new y -value is entertained, so hyperparameters of both GPsstay fixed and only the latents may need to be augmented. Updating K n follows (4) or (6),depending on whether the candidate ˜ x is new or a replicate. In the latter case, only A n isupdated for the “noise GP”. Conversely, if a new location is added, an estimate of r (¯ x n +1 )is required, which can come from the noise GP via exponentiating the usual GP predictiveequations. That is, the new latent δ n +1 is taken as the predicted value by the noise GP.The second update scheme—using the y ( x N +1 ) observation—will require updating all theGPs’ hyperparameters (including latents). Optimizing all hyperparameters of our heteroske-dastic GP model is a potentially costly O ( n ) procedure. Instead of starting from scratch,a warm start of the MLE optimization is performed. Where they exist, previous values canbe re-used as starting values, leaving only the latent ˜ δ at the newest design point, that is˜ δ = δ n +1 for a new location or ˜ δ = δ k for a replicate, requiring special attention.As in the first case, ˜ δ may be initialized at its predicted value. But taking into accountthe new y ( x N +1 ) makes it possible to combine information from the latent noise GP withresults from empirical estimation of the log-variances. Kami´nski (2015) explores this forupdating SK models when new observations are added—a special case of the typical GPupdate formulas. The resulting combination of two predictions is via the geometric meanand can be summarized by the Gaussian N (˜ δ, V ˜ δ ) with˜ δ = (cid:32) µ ( g ) ( x N +1 )ˇ σ g ) ( x N +1 ) + ˆ δV ˆ δ (cid:33) (cid:32) σ g ) ( x N +1 ) + 1 V ˆ δ (cid:33) − , V ˜ δ = (cid:32) σ g ) ( x N +1 ) + 1 V ˆ δ (cid:33) − , where µ ( g ) ( x N +1 ) and ˇ σ g ) ( x N +1 ) are the prediction from the noise GP while ˆ δ is theempirical estimate of the log variance at x N +1 , itself with variance V ˆ δ . We take ˆ σ =ˆ ν − N a (cid:80) ˜ aj =1 (cid:0) y ( j ) ( x N +1 ) − µ n ( x N +1 ) (cid:1) , i.e., the uncorrected sample variance estimator that ex-ists even for ˜ a = 1, i.e., the number of observations at x N +1 . Supposing that the y ( j ) ( x N +1 )’sare i.i.d. Gaussian, we have ˜ a ˆ σ /σ ∼ χ a . Accounting for the log-transformation, as inBoukouvalas (2010), we get ˆ δ = log(ˆ σ ) − Ψ((˜ a ) / − log(2) + log(˜ a ) and V ˆ δ = Ψ (˜ a/
2) withΨ and Ψ the digamma and trigamma functions.Finally, the quick updates described above are predicated on improving local searches,and are thus not guaranteed to globally optimize the likelihood, which is always a challengein MLE settings. The risk of becoming trapped in an inferior local mode is greater at earlierstages in the sequential design, i.e., when n is small. In practice, we find it beneficial toperiodically restart the optimization with conservative (potentially random) initializations,which is cheap in that (small n ) setting. As n increases, and the likelihood becomes morepeaked, we find that costly restarts are of limited practical value. Local refinements, asdescribed above, are fast and reliable. To avoid predictive variances close to zero for replicates, i.e., ˜ δ = δ k , such that σ g ) ( x N +1 ) ≈ g shouldbe small), the variance is given by the “downdated” GP instead (i.e., the predicted variance if removing thereplicated design), that are usually used for Leave-One-Out estimations and can be found, e.g., in Bachoc(2013), giving σ g ) ( x N +1 ) = (cid:16)(cid:0) C ( g ) + g A − n (cid:1) − k,k (cid:17) − . .3 Defining the horizon Although the horizon h in the lookahead criteria in Section 3.3 could be fixed arbitrarily, orchosen based on computational considerations (smaller being faster), here we propose twoheuristics to set it adaptively based on design goals. The adaptiveness means that h ≡ h N is now indexed by the current design size.The first heuristic involves managing surrogate modeling costs, targeting a fixed ratio ρ = n/N of unique to full design size. The goal is to ensure that each new unique locationis, “worth its weight” in replicates from a computational perspective. The choice of n/N isarbitrary —other targets will do—but we focus on this particular one because its magnitudeis easy to intuit. The Target heuristic we use to “maintain ρ ” as sequential design stepsprogress is as simple as it is effective: h N +1 ← h N + 1 if n/N > ρ and a new point ¯ x n +1 is chosen;max { h N − , − } if n/N < ρ and a replicate is chosen; h N otherwise . (13)If the current ratio is too high and a new point ¯ x n +1 was recently added, making the ratioeven higher, the horizon is increased to encourage future replication. If rather a replicate hasbeen added while the current ratio was too low, then the horizon is decreased, encouragingexploration. Otherwise the evolution is on the right trajectory and the horizon is unchanged.Observe that (13) allows a horizon of −
1, which is described shortly.To implement the continuous search (5), we deploy a limited multistart scheme over optim searches in R with method="lbfgsb" and closed form gradients (12). In parallel, a discretesearch over ¯ x , . . . , ¯ x n is carried out via (7). The two solutions thus obtained are comparedagainst thresholds, and if the ˜ x found via continuous search (or its relative objective value)is within (say) ε = 10 − of that of the best ¯ x k ∗ , the replicate is preferred on computationalgrounds. The horizon h N = − x no matter how close it isto the replicate candidate. Thus, h ≡ − n by 1along with N at each iteration; in practice it still occasionally generates replicates, primarilyat the corners of the input space, if the corresponding multistart scheme determines thatthe I N +1 -minimizer lies at the boundary of D . On the other hand, h ≡ x , . . . , ¯ x n ).Indeed, every iteration where we have a situation resembling Figure 1, the h = 0 rule willselect a replicate and not increment n . In contrast, h = − h N can reach quite high values (upwards of h N = 20), the computational cost of search is negligible compared to updating the GP fits.Meanwhile high horizons represent a “light touch” preference for replication: they do notpreclude exploration, rather they somewhat discourage it. Thus, while the ultimate numberof unique locations n is dependent on the entire history of the simulations, and hence comeswith a sampling distribution, the corresponding search heuristic is much simpler than one15hat would impose a hard constraint on the final n .When accuracy is the ultimate goal we prefer a different adaptation of h , making a moreexplicit link between ρ and the signal-to-noise ratio in the data. In linear regression contexts,one way to deal with heterogeneity is to allocate replications on unique designs such thatthe ratio of the empirical variance over number of replicates are close to each other, i.e.,to enforce homogeneity of ˆ σ i /a i (Kleijnen, 2015). This approach captures the basic ideathat more replicates are needed where r ( x ) is high, but applicability to our setup is notdirect because such a scheme does not factor in correlations estimated by GPs. Ankenmanet al. (2010) address this within SK by considering the allocation of the remaining budgetof evaluations over existing designs, i.e., to determine where to augment with additionalreplicates. In particular, they show that the optimal allocation of the N simulations across n unique designs is summarized by A ∗ n , a diagonal matrix with components a ∗ i ≈ N (cid:112) r (¯ x i ) K in (cid:80) j =1 (cid:112) r (¯ x j ) K j , where K i = ( K − n W n K − n ) i,i . (14)We emphasize that (14) only addresses the replication aspect—the designs ¯ x , . . . , ¯ x n mustbe entered a priori by the user. Thus, this recipe is not directly implementable in a sequentialdesign setting. One solution could be to generate (e.g., by space-filling) a candidate designof pre-determined size n and r replicates per design and then, after learning K n and r (¯ x i )’s,apply (14). However, in that case one may end up with a ∗ i < r, as is the case in Figure3. This illustrative 1d example highlights that in areas with low noise, a lower number ofreplicates would have been better, while in more noisy areas, more points are necessary.The right panel shows the a ∗ i at this stage (referred to as batch ) compared to the greedysequential allocation of 105 replicates. The latter is more realistic because it acknowledgesthat design decisions cannot be undone .Instead of such two-stage design, we utilize (14) in a sequential fashion, by making acomparison between the allocation a ∗ i via (14) (employing the current estimates of the noise r (¯ x ) at that particular stage) and the actual a i ’s collected so far from the sequential design.The existing number of replicates a i is then either too high, in which case no more replicatesshould be added, or too low, and could benefit from more replication. We use this informationin the Adapt scheme to adjust the horizon by sampling h N +1 ∼ Unif { a (cid:48) , . . . , a (cid:48) n } with a (cid:48) i := max(0 , a ∗ i − a i ) . (15)Hence, if there are locations that require many more replicates according to (14), h N +1 couldbe large to encourage replication. Here we illustrate our methods and simpler variants on a suite of examples spanning syntheticand real data from computer simulation experiments. Our main metric is out-of-sample The batch scheme recommends fewer than five replicates after five replicates where already used. .0 0.2 0.4 0.6 0.8 1.0−5051015 x y lllll lllll lllll lllll lllll lllll lllll lllll lllll lllll lllll lllll lllll lllll lllll lllll lllll lllll lllll lllll lllll Predictive meanConfidence intervalsPredictive intervalsTrue function 0.0 0.2 0.4 0.6 0.8 1.0 x a i l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lll BatchSequentialEstimated varianceTrue variance
Figure 3: Left: toy example with 5 replicates at each of the 21 uniformly spaced uniquedesign points. Right: proposed allocation of new 105 replicates (total 210 observations)based on (14) and a greedy sequential approach.root mean-square (prediction) error (RMSE) over the sequential design iterations, and inparticular after the final iteration. Since accurate estimation of variances over the inputspace is also an important consideration (especially in the heteroskedastic context)—eventhough our IMSPE criteria does not explicitly target learning variances—we consider RMSEto the true log variance, when it is known, and when it is not we use a proper scoringrule (Gneiting and Raftery, 2007, Eq. (27)) combining mean and variance forecasts out-of-sample. Our main comparators are non-sequential (space-filling) designs, homoskedastic GPpredictors, and combinations thereof.
We start by reusing the 1d toy example from above [surrounding Figure 2 and 3] to showqualitatively the effect of the horizon choice on the resulting designs. The underlying functionis f ( x ) = (6 x − sin(12 x − r ( x ) = (1 . πx )) . The experiment starts with an initial maximin LHS with 10points, no replicates, and the GPs use a Gaussian kernel.Results are presented in Figure 4 for a total budget of N max = 500. Each panel in thefigure corresponds to a different look-ahead horizon h , with the final two involving Adaptiveand Target schemes. There are several noteworthy observations. Notice that as the horizonis increased, more replicates are added. See ρ = n/N reported in the main title of each panel.The design density is greatest in the high variance parts of the space, and that density isincreasingly replaced by replication when the horizon is increased. The effect is most drasticfrom h = − h = 0, with the ratio of unique designs over total designs dropping bymore than half without impact on performance. Notice that replicates are added even with h = −
1, at the extremities of the space. Results with high horizons and Adapt and Targetschemes end up having both fewer unique designs and a higher accuracy. The very bestRMSE results are provided by h = 4 and the Adapt (15) scheme. In the latter case just 60unique locations are used ( ρ = 0 . .0 0.2 0.4 0.6 0.8 1.0 − h = −1, r = 0.982, RMSE = 0.081 Predictive meanConfidence intervalsPredictive intervalsTrue function lll ll ll l lll l ll lll l ll l l lll lll ll l ll l ll ll ll ll llll l lll l ll ll lll l lll l ll ll ll l lll llll lll llllll l ll lll ll ll l ll l lll l l lll ll ll ll lll ll l l ll l lllll l lll l llll lll llll ll l lll lll l lll lll ll ll l lll ll l lll ll ll l lll ll ll ll l l lll l ll ll lll ll l l l ll l l ll ll l llll ll lll lll lll ll ll ll ll lll ll lll lllllll lllll l llll ll l ll ll ll ll l l lll ll ll ll l lll ll ll llll ll l ll lllll ll ll lllll lll lll ll l lll llll ll ll ll l lll l lll ll l ll ll ll ll lll lll l l ll l ll l ll l lll lll lll ll ll ll ll l ll l ll l ll ll llll lllll ll llll ll ll l ll l l ll l l lll ll ll l ll ll lll ll ll llll l llll ll llll l ll lll l l ll ll ll l llllll l ll ll ll llll l ll a i f ( x ) x − h = 0, r = 0.406, RMSE = 0.082 lll ll ll l lll l ll lll l ll l l lll lll ll l ll l ll ll ll ll llll l lll l ll ll lll l lll l ll lll l llll l lll l ll l ll ll ll ll l llll ll ll ll l ll l lll ll ll ll ll ll l ll l ll l ll l ll l lll ll ll lll lll l ll l lll ll ll lllll l l ll lll ll ll lll l ll ll l l lll lll ll lll lll ll ll l ll l lll ll ll l lll l lll ll l lll lll ll ll ll lll lll ll ll ll ll l l ll l lll ll ll lll l ll ll lllll ll lll lll l ll lll l ll ll lll lll ll ll ll ll l l l ll lll l ll ll ll l ll l lll l lll ll l lll lll lll l ll l lll ll lll ll l lll l l llll ll l lllll ll ll l lllll ll llll ll ll l llll l lll l lll l ll llll l l ll l l l ll lll lll l llll l llll ll l l ll l ll lll lll ll llll l llll l lllll lll l ll llll l lll ll lll l lll lll ll l a i f ( x ) x − h = 1, r = 0.36, RMSE = 0.016 lll ll ll l lllll ll ll ll ll ll llll l lll lll ll ll l ll l ll ll lll l ll lll l ll ll ll l ll lll l ll ll ll l llll lll ll l lll ll ll l lllll l ll ll ll ll ll ll l ll l l lll l ll ll ll l l ll ll l l llll llll lll ll l ll l ll l ll l ll l lll lll l l ll ll l l ll l ll l ll l lll l llll l ll lll l lll l ll ll ll ll lll l lll lll lll ll lll ll l ll l lll ll ll ll ll l llll ll lll ll l ll ll l lll ll ll l lll ll l l lll llll lll l ll lll ll ll ll llll ll l ll ll ll llll lll ll l ll ll l l ll ll lll ll lll ll l llll llll ll l ll lll l l ll ll l ll ll l ll l ll ll ll ll llll l ll l ll lll ll ll lll lll l llll ll l ll ll l ll ll ll l lll ll l ll l ll ll l lll l l ll lll l ll l ll ll l lll l ll ll ll lll l ll lll lll l ll ll lll ll lll ll lll a i f ( x ) x − h = 2, r = 0.222, RMSE = 0.017 lll ll ll l llll l l lll l ll l ll l ll l ll ll l ll ll l ll l lll l lll l lll lllll ll llll l lll l l ll ll ll l ll l llll ll ll ll ll l lll ll ll lll ll l lll ll l ll ll l ll llll l ll ll llll llll lll ll ll l ll l ll ll l llll lll lll ll lll l ll lll l lll ll l ll llll ll ll ll l lll ll lll llll ll lll l l l ll lll ll ll ll ll l ll ll lll ll llllll l ll l l lllll l l lll ll l l lll lll ll lll l ll lll l ll lll ll lllll lll ll lll l l ll lll ll l l llll ll lll lll ll l ll ll ll l l ll llll l ll ll ll ll ll ll ll l ll lll ll l lll l ll l ll ll l ll l lll l llll l l ll l lll lll ll ll ll ll l ll llll ll ll lll ll l lll l l l lll ll l ll ll l llll lll ll ll ll lll ll lll l lllll l l ll ll l ll ll l lllll l lllll l ll l ll ll ll ll l a i f ( x ) x − h = 3, r = 0.2, RMSE = 0.018 lll ll ll l llll l l ll ll lll l llll lll ll l ll lll ll ll ll l ll l ll ll lll lll lll ll l lll l lll ll ll l lll lllll ll lll l lll ll ll lll ll l ll l lll llll lll ll ll ll l ll lll l l ll lll ll l ll l ll lll ll ll l l ll lll l lllll l ll ll l lll ll l ll l ll ll lll l l lll ll ll ll lll ll ll lll ll lll l ll lll ll ll l ll l ll l ll ll lll l lll ll ll l lllll lll ll lll ll l l lll l l ll l ll ll l ll lll l llll ll l ll l lll l ll l ll lll ll l lll lll ll ll ll ll l lll lll l ll l ll lll ll l ll ll ll lll l ll ll llll l lll l ll ll ll ll l lll lll l lll l ll l llll ll l llll ll ll l lll ll l ll l l llll ll l l l lll l l l l ll l l ll ll ll l l ll lllll lll l ll l ll ll ll ll ll ll lllll l ll ll ll lll l ll llll ll ll ll l ll l lll l l a i f ( x ) x − h = 4, r = 0.12, RMSE = 0.015 lll ll ll l llll l l ll ll l l ll l ll l l ll l lll l llll ll ll lll ll ll ll ll ll lll l ll lll ll l ll ll lll ll lll llll ll l ll l llll ll ll ll ll ll lll lll ll lll lll lll l llll ll ll ll ll l ll ll l ll ll l ll lll l ll lll l ll lll lll l lll ll lll l l ll l lll ll l l lll ll ll ll ll ll l ll l ll ll l lllll ll ll l llll lll lll l ll lll ll lll l lllll l ll lll ll ll l ll lll l lll ll ll ll ll ll ll l l l lll ll ll l llll lll ll ll ll ll l lll l l lll lll ll ll l ll l l ll ll ll lll ll l lll ll ll l llll ll ll l ll lll lll ll ll ll lll llll l ll lll lll ll l ll ll ll ll ll ll l lll lllll l ll llll l ll ll ll lll lll l lll l ll l lll ll l ll lll ll ll llll ll ll ll lll l lll ll ll l ll lll ll ll l ll l ll ll lll ll l l lll ll a i f ( x ) x − Adaptive, r = 0.354, RMSE = 0.012 lll ll ll l lll l ll llll l ll l ll ll l lll ll l ll llll ll ll ll ll ll l lllllll lll l ll lll l l llll ll ll l ll l llll llll ll ll lll ll ll l lll l ll ll ll ll ll lll ll l ll lll lll ll ll lll l ll lll l l ll l l ll llll lll ll ll ll llll l ll ll ll l lll lll lll l l lll ll ll ll ll l ll lll l ll ll l ll l l ll ll l ll ll llll l ll l ll ll l lll l ll lll lll l l lll ll ll llllll lllllllll llll llll ll ll ll l lll l llll l ll llll ll ll ll l ll l ll ll ll ll llll lll llll l ll l l ll llll l ll ll lll ll ll l lll l l ll ll l ll ll l l lll ll ll lll l l ll l ll ll ll l ll ll l lll lllll l l ll lll ll ll l ll l lll l ll l ll lll ll l l ll ll ll llll ll ll ll llll ll lll ll ll ll ll ll ll ll l ll ll l l ll lll ll lll ll l l llll l a i f ( x ) x − Target, r = 0.202, RMSE = 0.025 lll ll ll l lllll ll l ll llll lll ll l l ll ll llll ll lll l ll lll l ll l ll lll ll ll l lll l lll ll ll ll ll lll l ll l l llll ll llll ll ll lll l l ll l ll ll l lll lll ll ll ll ll l ll lll ll ll llll l ll ll l l ll l ll l l lll l lll l lll l ll l l llll lll ll lll l l l lll l lll l l l lll lll ll lll lll l llll ll lll ll lll ll l lll ll ll l l ll lll lll llll l lll l lll l ll l ll l ll l ll lll lll l ll l lll lll l ll lll l ll ll lll ll lll l ll l lll l ll lll ll lll ll l ll ll llll ll l lll ll l ll lll ll lll l ll llll l ll ll l lll ll lll ll ll l l llll ll ll l lll l ll lll ll l lll ll ll ll l lll lll l llll l l lll ll lll l lll l l ll ll lll lll ll l ll l llll ll llll l ll lll l ll lll l ll lll lll ll llll l ll l ll l ll ll a i f ( x ) x Figure 4: Results on the 1d example by varying the horizon. Grey vertical segments indicatethe number of replicates at a given location.times.
Here we expand previous 1-d illustrative example by exploring variation over data-generatingmechanisms via Monte Carlo (MC) with input space x ∈ [0 , Λ n sampled aslog Λ n ∼ GP(0 , ν g C ( g ) ), where C ( g ) is stationary with Mat´ern 5 / k ( g ) . Then obser-vations are drawn via Y | Λ n ∼ GP(0 , K n ), where K n = ν ( C n + Λ n ) and C n is again Mat´ern5 /
2. We set θ = 0 .
1, and ν = 1 for the mean GP, and θ ( g ) = 0 . ν ( g ) = 7 for the noiseGP. To manage the MC variance between runs we normalized the Λ n -values thus obtainedso that the average signal-to-noise ratio was one.We considered a budget of N = 200 and studied various strategies for design—comparingone-shot space-filling designs without or with replication to sequential designs with a looka-head horizon of h = 0—and for modeling, testing both homoskedastic and heteroskedasticGPs. These are enumerated as follows: (i) homoskedastic GP without replication using an n = 200 grid design; (ii) hetGP without replication, again with an n = 200 grid; (iii) hetGP with one-shot space-filling design with random replication on an n = 40 grid with random a i ∈ { , . . . , } ; (iv) sequential learning and design using a homoskedastic GP initializedwith a single-replicate n = 40 grid, iterating until N = 200; and (v) sequential learning and18esign using hetGP initialized with a single-replicate n = 40 grid, iterating until N = 200. l ll lll ll lll ll ll lll ll lllllll ll lll lllllll lll ll ll llllll llllllll lllll llll l ll lll llll lllllllllllll llll lll l lll lllll lllllll llllllll lll ll l HomGP w/o repHetGP w/o repHetGP w. rand. repSequential HomGPSequential HetGP 0.0 0.2 0.4 0.6 0.8 1.0 1.2RMSE for signal l lllllllllll llll lllllllllllllllllllllllllllllllllllllllllllllllll lllll llllllllllllllllllll llllll lll llllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllll l lllllll lllllllllllllll llll ll
HomGP w/o repHetGP w/o repHetGP w. rand. repSequential HomGPSequential HetGP −1000 −500 0 500Score lll lll llllllll l llll ll lllll l ll lllll lllll ll lll ll lllll llll ll ll lllll llll llll lllll ll llll l lllll l ll l lll lll ll lllllllll llll lll lll lllllll l llllll llll l lllllllllllll lllll llll lllllll lllll lllllllllllll lllllll lllll lll ll l llllllll lllll ll lllllll l lllll l ll l llll lllll ll ll ll ll l ll lllllllll llll l
HomGP w/o repHetGP w/o repHetGP w. rand. repSequential HomGPSequential HetGP 0 5 10 15RMSE for log Noise
Figure 5: Result from on-dimensional synthetic Monte Carlo experiment in terms of RMSEto the true mean (left), proper scores (middle) and RMSE to true log noise (right).Figure 5 summarizes the results from 1000 MC replicates, illustrating the subtle balancebetween replication and exploration. As can be seen in the left panel, our proposed sequential hetGP performs the best in terms of out-of-sample RMSE. To investigate the statisticalsignificance of RMSE differences we conducted one-sided matched Wilcoxon signed-rank testsof adjacent performers (better v. next-best) with the order based on median RMSE. Thecorresponding p -values are 4 . × − , 9 . × − , 0 . . × − . For example,the test involving our best method, hetGP with sequential design, versus the second best, hetGP with random replication, suggests that the former significantly out-performs the latter.Since our IMSPE design criteria emphasized mean-squared prediction error, it is refreshingthat our proposed method wins (significantly) on that metric. The only such comparisonwhich did not “reject the null at the 5% level” involved pitting sequential versus uniformdesign with a homoskedastic GP. The value of proceeding sequentially is much diminishedwithout the capability to learn a differential noise level.The center and right panels of the figure show that other design variations may be pre-ferred for other performance metrics. Observe that one-shot space-filling design with randomreplication using hetGP wins when using proper scores. Apparently, random replication yieldsbetter estimates of predictive variance when comparing to the truth. See the right panel.Space-filling and uniform replication are easily achieved in this one-dimensional case, butmay not port well to higher dimension as our later, more realistic, examples show. Our se-quential hetGP , coming in second here on score and log noise RMSE, offers more robustnessas the input dimension increases. Our first real example deals with estimating the future number of infecteds in a stochasticSusceptible-Infected-Recovered (SIR) epidemic model. This is a standard model for cost-benefit analysis of public health interventions related to communicable diseases, such as19nfluenza or dengue. For our purposes we treat it as a 2d input space indexed by the count I ∈ N of initial infecteds and S ∈ N of initial susceptibles (the total population size M ≥ I + S is pre-fixed; the rest of the population is viewed as immune to the disease). Thepair ( I t , S t ) ∈ { S + I ≤ M } evolves as a continuous-time Markov chain (easily simulated)following certain non-linear (hence analytically intractable) transition rates, until eventually I t = 0 and the epidemic dies out. The response f ( S, I ) is the expected aggregate number ofinfected-days, (cid:82) ∞ I t dt averaged across the Markov chain trajectories; determining f ( S, I )is a first step towards constructing adaptive epidemic response policies. It is important tonote that the signal-to-noise ratio is varying drastically over D , with a zero variance at I = 0(where Y ≡
0) and up to r ( x ) ≈ on the left part of the domain, in the critical regionwhere the stochasticity in infections leads to either a quick burn-out in infecteds or a rapidinfection of a significant population fraction.Whereas Binois et al. (2016) considered static space-filling designs with random numbersof replicates, with a favorable comparison to SK, here we focus on aspects of sequentialdesign, in particular the effect of horizon h in the IMSPE with lookahead over replication. Weperform a Monte Carlo experiment wherein designs are initialized with n = N = 10 uniquedesign locations (just one observation each), and grown to size N = 500 over sequentialdesign iterations, disregarding how many unique locations n are chosen along the way. AMat´ern kernel with ν = 5 / R M SE horizon−101234AdaptTarget l ll l llllll ll lll h = −1h = 0h = 1h = 2h = 3h = 4AdaptTarget0.0620 0.0635Final RMSE 0 100 200 300 400 5002.53.03.54.04.5 N S c o r e l lll llllllll h = −1h = 0h = 1h = 2h = 3h = 4AdaptTarget 4.2 4.6Final score Figure 6: RMSE and score results on the SIR test case over sequential design iterations, viasummaries 30 MC repetitions. The Target scheme aims for ρ = n/N = 0 . h = 4 and Target scheme). Sincethe signal-to-noise ratio is low in some parts of the input space, replication is beneficial interms of RMSE and score . One reason for the RMSEs not to be very different between thealternatives is that the underlying function is very smooth. However, the variance surface ismore challenging, such that having more replicates is helpful in this case, as highlighted by20he differences in score, shown in the final panel of Figure 6. r a t i o n / N ll h = −1h = 0h = 1h = 2h = 3h = 4AdaptTarget 0.2 0.6 1.0Final ratio 0 100 200 300 400 50005101520 N H o r i z on Figure 7: Ratio n/N and horizon evolution on the SIR test case over sequential designiterations, via summaries over 30 MC repetitions. The thin dotted line indicates the Targetratio of 0.2. Right: thin dotted (resp. thick) lines represent one iteration (resp. average) ofthe horizons for the Adapt and Target schemes. Colors are the same as in Figure 6.Table 1: Average percentage of designs points with no replicates, with more than five, andrunning time on the SIR test problem.Horizon -1 0 1 2 3 4 Adapt TargetPercentage of 1s 99.2 49.6 13.6 6.9 4.8 3.5 8.8 4.1Percentage of 5s and more 0.04 1.6 5.7 6.9 7.7 7.9 6.9 7.6Time (s) 812 473 278 257 259 271 306 288Moving on to Figure 7, the left and center panels show the ratio of unique locations overthe total design size: n/N . As expected, as the horizon h increases, more replicates areselected. In turn, this lowers the computation time, as reported in Table 1. In particular,observe that the computational cost of looking ahead is negligible next to the cost saved byhaving smaller n relative to N . The final panel in Figure 7 shows how the horizon h evolveswhen fixing a Target ratio of ρ = 0 . ρ = 0 . h ≥
15, yet the computational cost isnever higher than the high-fixed-horizon results, which offer the best performance for thisproblem. Due to its random nature, the Adapt scheme changes abruptly between algorithmruns, but its horizon h N is increasing on average in N .Figure 8 provides a visual indication of the density of design throughout the input spacefor fixed and tuned horizons. As expected, in all panels the density of inputs in the de-sign is higher in high variance parts of the input space. The numbers in the plot indicatethe numbers of replicates a i . Observe that low-horizon heuristics result in mostly a i = 1,whereas for the longer horizons clusters of tightly grouped unique locations are replaced withreplicates. Table 1 demonstrates that this feature is consistent over MC repetitions. Thus,21 (a) h = −
24 13 3 739 42 71 111 2154 6122 3 132 421 1 112 132 633 112 25 141 1522 156 312 61 22 1 73 31 112 14 11 12122 301 23 5 101 11 12 2 44 12 1 12 3 42 1 81 2 12 34 34 111 1032 52 1 41 22 1212 221 511 422 1 232 33 4 11 3 52613 12 1 311 1 12 4 11 12 212 2 11 1 (b) h = 1
111 11 1 622 91 11 11 11 111 2 9641 42 2 1 30111 1 64 55311 436 11133 6 24313 22 365 3 822 62 246 132 13 3556 2112 21 6112 2 411 732 21 115 2 1 751 6 553 6 51 32 31 134 411 1 412 133 4345 72 21 1 12 511 7 22 1 2 11 1 4 211 (c) Adapt (d) Target
Figure 8: Designs obtained with different strategies for the horizon, where numbers indicatehow many replicates a i are performed at a given location ¯ x i . Darker colors indicate highervariance. The x-axis is the number of susceptibles, from 1200 to 2000 while the y-axis is theinitial number of infecteds, from, 0 to 200.our heuristic is adept at capturing the basic logic of amalgamating singleton design locationsinto replicates, which apparently maintains essentially the same statistical efficiency whilereducing computational overhead by a factor of more than 3. The assemble to order (ATO) simulation, first introduced by Hong and Nelson (2006) withimplementation in
MATLAB later provided by Xie et al. (2012), comes from inventory man-agement. The inputs determine stocks and replenishment schedules for key items in assem-bled products, and the simulator estimates revenue by combining inventory costs with profitsobtained from orders which come in following a compound Poisson random process. Binoiset al. (2016) showed the benefit of heteroskedastic modeling, versus several homoskedasticalternatives, on random space-filling designs with n = 1000 unique locations with a randomnumber of replications (uniform in 1 , . . . ,
10) so that the average full data size was N = 5000.Here, one of our aims is to illustrate that by building a better design (sequentially), a muchlower N is possible without sacrificing accuracy. Binois et al. used a proper scoring rule(Gneiting and Raftery, 2007, Eq. (27)) as their main metric. Since our IMSPE criteriontargets accuracy via squared-error loss we report RMSEs, but include scores to facilitatecomparison to those space-filling designs. The best average score reported in Figure 2 ofthat paper was 3.3, with a min and max of 2.8 and 3.6 respectively.Similar to the SIR experiment, we perform the following variations on sequential IMSPEdesign, varying the horizon, h , of lookahead and offering the two adaptive horizon schemesoutlined in Section 3.3. We initialize with n = 100 unique space-filling locations and arandom number of replicates, uniform in { , . . . , } so that the starting size is N = 500on average. Subsequently, sequential design iterations are performed until N = 2000 totalsamples are gathered, irregardless of how many unique locations, n , result. The experimentis repeated in a Monte Carlo fashion, with thirty repeats.Figure 9 summarizes the results of the experiment in a format similar to Figure 6. Thetake-home message is fairly evident: in contrast to the SIR example, shorter lookahead22
00 1000 1500 20000.150.200.250.300.350.400.45 N R M SE horizon−101234AdaptTarget ll lllll l h = −1h = 0h = 1h = 2h = 3h = 4AdaptTarget 0.10 0.20 0.30Final RMSE 500 1000 1500 20000123 N S c o r e l lllll l ll l h = −1h = 0h = 1h = 2h = 3h = 4AdaptTarget 0 1 2 3Final score Figure 9: RMSE and score results on the ATO problem, via thirty Monte Carlo iterations,in a format similar to Figure 6.horizon is better, owing to the relatively higher signal-to-noise ratio. Observe that ouraverage score is 3.5, and the min and max are 2.4 and 3.7 respectively. So scores based onjust N = 2000 samples are higher than in the space-filling N = 5000 experiment, howeverthe spread is a little wider. Finally, note that the Adapt heuristic (15) eventually performsas well as the best horizon ( h = − ρ = 0 . N = 2000 with an average of n = 1086 unique sites (min and max of 465 and1211 respectively), whereas Target took only 183 minutes thanks to using n = 400 locationson average (399 to 405). This paper addresses a design question which has been around the surrogate modeling lit-erature, and informally in the community, for many years. There is general agreement thatthe “fully batched” version of the problem, of finding n locations and numbers of replicates a , . . . , a n on each, is not computationally tractable, although there are some attempts inthe recent literature. We therefore consider the simpler task of deciding whether the next sample should explore or replicate in a sequential design context. The condition we deriveis simple to express, and leads to an intuitive suite of visualizations. Proceeding sequen-tially has merits, not only computationally but also facilitating “as you go” adjustments tohelp avoid pathologies arising from feedbacks between design and inference. However theprocedure is still myopic. To help correct this we introduced a computationally tractablelookahead scheme that emphasizes the role of replication in design. Tuning the horizon ofthat scheme allows the user to trade off the dual roles of replication in surrogate modelingdesign: computational thriftiness of inference against out-of-sample accuracy, although aswe show these are not always at odds.Our presentation focused on the integrated mean-squared prediction error (IMSPE) cri-teria. We chose IMSPE because it is popular, but also because it leads to closed form23erivatives for optimization, updating equations for search over look-ahead horizons, andsimplifications for entertaining replicates. There is, of course, a vast literature on model-based design criteria (see, e.g., Chen and Zhou, 2017; Kleijnen, 2015) targeting alternativequantities of interest, such as entropy or information for unknown model parameters. Al-though designs for prediction and estimation sometimes coincide, like for linear regression,the correlation structure for GPs can be a game-changer (M¨uller et al., 2012). It may wellbe that other criteria lead to strategies similar to ours, which may be an interesting avenuefor future research.Our implementation and empirical work leveraged a new heteroskedastic Gaussian pro-cess modeling library called hetGP , available for R on CRAN (Binois and Gramacy, 2017).Our IMSPE, updates, lookahead procedures, and more are provided in a recently updatedversion of the package. To aid in reproducibility, our supplementary material contains codesusing that library to reproduce the smaller examples from the paper [Figures 2–4]. Theother examples require rather more computing, and/or linking between R and MATLAB forsimulation [ATO], which somewhat challenges ease of replication. However, we are happy toprovide those codes upon request.Processes (i.e., data generating mechanisms) benefiting from a heteroskedastic featurebring out the best in our sequential design schemes, demanding a greater degree of replicationin high-noise regions relative to low-noise ones, confirming the intuition that replicationbecomes more valuable for separating signal from noise as the data get noisier (e.g., Wangand Haaland, 2017). However, the results we provide are just as valid in the homoskedasticsetting, albeit with somewhat less flair. In that context, inferring the right level of replicationis a global affair, except perhaps at the edges of the input space which tend to prefer a slightlyhigher degree.Our three sets of examples illustrated that the method both does what it is designed todo, and that designs with the right trade-off between exploration and replication performbetter than ones which are designed more na¨ıvely. These examples span a range of features,from low to high noise (and slow to rapid change in noise), low to moderate input dimension,and synthetic to real simulation experiments. The behavior is diverse but the results areconsistent: sequential design with lookahead-based IMSPE leads to accurate prediction, andthe slight bias toward replication yields computationally more thrifty predictors withouta compromise on accuracy. However, sequential design might not always be appropriate.Sometimes batching, at least to a small degree, cannot be avoided. Addressing this situationrepresents an exciting avenue for further research.
Acknowledgments
We thank the anonymous reviewers for helpful comments on the earlier version of the paper.All four authors are grateful for support from National Science Foundation grant DMS-1521702 and DMS-1521743. 24 eferences
Anagnostopoulos, C. and Gramacy, R. (2013). “Information-Theoretic Data Discarding forDynamic Trees on Data Streams.”
Entropy , 15, 12, 5510–5535. ArXiv:1201.5568.Ankenman, B., Nelson, B. L., and Staum, J. (2010). “Stochastic kriging for simulationmetamodeling.”
Operations research , 58, 2, 371–382.Antognini, A. B. and Zagoraiou, M. (2010). “Exact optimal designs for computer experimentsvia Kriging metamodelling.”
Journal of Statistical Planning and Inference , 140, 9, 2607–2617.Bachoc, F. (2013). “Cross Validation and Maximum Likelihood estimations of hyper-parameters of Gaussian processes with model misspecification.”
Computational Statistics& Data Analysis , 66, 55–69.Barnett, S. (1979).
Matrix Methods for Engineers and Scientists . McGraw-Hill.Binois, M. and Gramacy, R. B. (2017). hetGP : Heteroskedastic Gaussian Process Modelingand Design under Replication . R package version 1.0.0.Binois, M., Gramacy, R. B., and Ludkovski, M. (2016). “Practical heteroskedastic Gaussianprocess modeling for large simulation experiments.” arXiv preprint arXiv:1611.05902 .Boukouvalas, A. (2010). “Emulation of random output simulators.” Ph.D. thesis, AstonUniversity.Boukouvalas, A., Cornford, D., and Stehl´ık, M. (2014). “Optimal design for correlatedprocesses with input-dependent noise.”
Computational Statistics & Data Analysis , 71,1088–1102.Burnaev, E. and Panov, M. (2015). “Adaptive design of experiments based on Gaussianprocesses.” In
Statistical Learning and Data Sciences , 116–125. Springer.Chen, X. and Zhou, Q. (2014). “Sequential experimental designs for stochastic kriging.” In
Proceedings of the 2014 Winter Simulation Conference , 3821–3832. IEEE Press.— (2017). “Sequential design strategies for mean response surface metamodeling via stochas-tic kriging with adaptive exploration and exploitation.”
European Journal of OperationalResearch , 262, 2, 575–585.Chevalier, C., Ginsbourger, D., and Emery, X. (2014). “Corrected kriging update formu-lae for batch-sequential data assimilation.” In
Mathematics of Planet Earth , 119–122.Springer.Cioffi-Revilla, C. (2014).
Introduction to computational social science . Berlin/New York:Springer. 25as, A. and Kempe, D. (2008). “Algorithms for subset selection in linear regression.” In
Proceedings of the fortieth annual ACM symposium on Theory of computing , 45–54. ACM.Forrester, A., Sobester, A., and Keane, A. (2008).
Engineering design via surrogate mod-elling: a practical guide . John Wiley & Sons.Gauthier, B. and Pronzato, L. (2014). “Spectral approximation of the IMSE criterion foroptimal designs in kernel-based interpolation models.”
SIAM/ASA Journal on UncertaintyQuantification , 2, 1, 805–825.Ginsbourger, D. and Le Riche, R. (2010). “Towards Gaussian process-based optimizationwith finite time horizon.” In mODa 9–Advances in Model-Oriented Design and Analysis ,89–96. Springer.Gneiting, T. and Raftery, A. E. (2007). “Strictly proper scoring rules, prediction, andestimation.”
Journal of the American Statistical Association , 102, 477, 359–378.Goldberg, P. W., Williams, C. K., and Bishop, C. M. (1998). “Regression with input-dependent noise: A Gaussian process treatment.” In
Advances in Neural InformationProcessing Systems , vol. 10, 493–499. Cambridge, MA: MIT press.Gonzalez, J., Osborne, M., and Lawrence, N. (2016). “GLASSES: Relieving The Myopia OfBayesian Optimisation.” In
Proceedings of the 19th International Conference on ArtificialIntelligence and Statistics , 790–799.Gorodetsky, A. and Marzouk, Y. (2016). “Mercer kernels and integrated variance experi-mental design: connections between Gaussian process regression and polynomial approxi-mation.”
SIAM/ASA Journal on Uncertainty Quantification , 4, 1, 796–828.Gramacy, R. and Polson, N. (2011). “Particle Learning of Gaussian Process Models forSequential Design and Optimization.”
Journal of Computational and Graphical Statistics ,20, 1, 102–118.Gramacy, R. B. and Lee, H. K. H. (2009). “Adaptive Design and Analysis of SupercomputerExperiment.”
Technometrics , 51, 2, 130–145.Hong, L. and Nelson, B. (2006). “Discrete optimization via simulation using COMPASS.”
Operations Research , 54, 1, 115–129.Horn, D., Dagge, M., Sun, X., and Bischl, B. (2017). “First Investigations on Noisy Model-Based Multi-objective Optimization.” In
International Conference on Evolutionary Multi-Criterion Optimization , 298–313. Springer.Huan, X. and Marzouk, Y. M. (2016). “Sequential Bayesian optimal experimental designvia approximate dynamic programming.” arXiv preprint arXiv:1604.08320 .26alali, H., Nieuwenhuyse, I. V., and Picheny, V. (2017). “Comparison of Kriging-basedalgorithms for simulation optimization with heterogeneous noise.”
European Journal ofOperational Research , 261, 1, 279 – 301.Johnson, L. (2008). “Microcolony and Biofilm Formation as a Survival Strategy for Bacteria.”
Journal of Theoretical Biology , 251, 24–34.Kami´nski, B. (2015). “A method for the updating of stochastic Kriging metamodels.”
Eu-ropean Journal of Operational Research , 247, 3, 859–866.Kersting, K., Plagemann, C., Pfaff, P., and Burgard, W. (2007). “Most likely heteroscedasticGaussian process regression.” In
Proceedings of the International Conference on MachineLearning , 393–400. New York, NY: ACM.Kleijnen, J. P. (2015).
Design and Analysis of Simulation Experiments , vol. 230. Springer.Krause, A. and Guestrin, C. (2007). “Nonmyopic active learning of gaussian processes: anexploration-exploitation approach.” In
Proceedings of the 24th international conference onMachine learning , 449–456. ACM.Krause, A., Singh, A., and Guestrin, C. (2008). “Near-optimal sensor placements in Gaus-sian processes: Theory, efficient algorithms and empirical studies.”
Journal of MachineLearning Research , 9, Feb, 235–284.Lam, R., Willcox, K., and Wolpert, D. H. (2016). “Bayesian Optimization with a FiniteBudget: An Approximate Dynamic Programming Approach.” In
Advances In NeuralInformation Processing Systems , 883–891.Law, A. M. (2015).
Simulation Modeling and Analysis . 5th ed. McGraw-Hill.Leatherman, E. R., Santner, T. J., and Dean, A. M. (2017). “Computer experiment designsfor accurate prediction.”
Statistics and Computing , 1–13.Liu, M. and Staum, J. (2010). “Stochastic kriging for efficient nested simulation of expectedshortfall.”
The Journal of Risk , 12, 3, 3.Mehdad, E. and Kleijnen, J. P. (2018). “Stochastic intrinsic Kriging for simulation meta-modeling.”
Applied Stochastic Models in Business and Industry , in press.M¨uller, W. G., Pronzato, L., and Waldl, H. (2012). “Relations between designs for predictionand estimation in random fields: an illustrative case.” In
Advances and Challenges inSpace-time Modelling of Natural Events , 125–139. Springer.Petersen, K. B., Pedersen, M. S., et al. (2008). “The matrix cookbook.”
Technical Universityof Denmark , 7, 15. 27icheny, V. and Ginsbourger, D. (2013). “A nonstationary space-time Gaussian processmodel for partially converged simulations.”
SIAM/ASA Journal on Uncertainty Quantifi-cation , 1, 57–78.Plumlee, M. and Tuo, R. (2014). “Building accurate emulators for stochastic simulations viaquantile Kriging.”
Technometrics , 56, 4, 466–473.Pratola, M. T., Harari, O., Bingham, D., and Flowers, G. E. (2017). “Design and Analysisof Experiments on Nonconvex Regions.”
Technometrics , 1–12.Pronzato, L. and M¨uller, W. G. (2012). “Design of computer experiments: space filling andbeyond.”
Statistics and Computing , 22, 3, 681–701.Quan, N., Yin, J., Ng, S. H., and Lee, L. H. (2013). “Simulation optimization via kriging: asequential search using expected improvement with computing budget constraints.”
IIETransactions , 45, 7, 763–780.R Core Team (2017).
R: A Language and Environment for Statistical Computing . R Foun-dation for Statistical Computing, Vienna, Austria.Rasmussen, C. E. and Williams, C. (2006).
Gaussian Processes for Machine Learning . MITPress.Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989). “Design and analysis ofcomputer experiments.”
Statistical science , 4, 4, 409–423.Santner, T. J., Williams, B. J., and Notz, W. I. (2013).
The design and analysis of computerexperiments . Springer Science & Business Media.Seo, S., Wallat, M., Graepel, T., and Obermayer, K. (2000). “Gaussian Process Regression:Active Data Selection and Test Point Rejection.” In
Proceedings of the International JointConference on Neural Networks , vol. III, 241–246. IEEE.Wang, W. and Haaland, B. (2017). “Controlling Sources of Inaccuracy in Stochastic Kriging.” arXiv preprint arXiv:1706.00886 .Weaver, B. P., Williams, B. J., Anderson-Cook, C. M., Higdon, D. M., et al. (2016). “Com-putational enhancements to Bayesian design of experiments using Gaussian processes.”
Bayesian Analysis , 11, 1, 191–213.Xie, J., Frazier, P., and Chick, S. (2012). “Assemble to Order Simulator.”28
Detailed gradient expressions
Here we provide expressions for the Section 3.2 discussion on the gradient of the IMSPE. ∂ K − n +1 ∂ ˜ x = ∂∂ ˜ x (cid:20) K − n + g (˜ x ) g (˜ x ) (cid:62) σ n (˜ x ) g (˜ x ) g (˜ x ) (cid:62) σ n (˜ x ) − (cid:21) = (cid:20) H (˜ x ) h (˜ x ) h (˜ x ) (cid:62) v (˜ x ) (cid:21) as in Eq. (10) , where v (˜ x ) := ∂σ n (˜ x ) − ∂ ˜ x ( p ) = 2 d (˜ x ) (cid:62) K − n k (˜ x ) + ∂r (˜ x ) ∂ ˜ x ( p ) ( k (˜ x , ˜ x ) − k (˜ x ) (cid:62) K − n k (˜ x ) + r (˜ x )) , d (˜ x ) := ∂ k (˜ x ) ∂ ˜ x ( p ) h (˜ x ) := ∂ g (˜ x ) ∂ ˜ x ( p ) = − K − n (cid:0) v (˜ x ) k (˜ x ) + σ n (˜ x ) − d (˜ x ) (cid:1) H (˜ x ) := ∂ g (˜ x ) g (˜ x ) (cid:62) σ n (˜ x ) ∂ ˜ x ( p ) = ∂σ n (˜ x ) ∂ ˜ x ( p ) g (˜ x ) g (˜ x ) (cid:62) + σ n (˜ x ) h (˜ x ) g (˜ x ) (cid:62) + σ n (˜ x ) g (˜ x ) h (˜ x ) (cid:62) = v (˜ x ) g (˜ x ) g (˜ x ) (cid:62) + σ n (˜ x ) (cid:0) h (˜ x ) g (˜ x ) (cid:62) + ( h (˜ x ) g (˜ x ) (cid:62) ) (cid:62) (cid:1) , and v (˜ x ) = − d (˜ x ) (cid:62) K − n k (˜ x ). Similarly, since W n does not depend on ˜ x : ∂ W n +1 ∂ ˜ x ( p ) = (cid:20) n × n c (˜ x ) c (˜ x ) (cid:62) c (˜ x ) (cid:21) , as presented in Eq. (11) . Expressions for c ( · ) and c ( · ) for particular kernels may be found in Appendix B. B Expressions for common kernels
We consider here four kernels common in practice: Gaussian (or squared exponential) andMat´ern with parameter α = 5 / , / , / E , w , d , c and c as introduced in Section 3. Noticethat all these kernels are stationary, i.e., k ( x , x (cid:48) ) = νc ( x − x (cid:48) ) with ν the process varianceand c the correlation function. As a consequence, E = (cid:82) x ∈ D k ( x , x ) d x = (cid:82) x ∈ D νc ( ) d x = ν .In their separable form, over D = [0 , d , these kernel write k ( x , x (cid:48) ) = ν (cid:81) dp =1 k i ( x p , x (cid:48) p )with k i one of the aforementioned kernel. By using the separability we get: νw ( x i , x j ) = (cid:90) x ∈ D k ( x i , x ) k ( x j , x ) d x = ν d (cid:89) p =1 (cid:90) x ∈ [0 , k i ( x i,p , x ) k i ( x j,p , x ) dx = ν d (cid:89) p =1 w p ( x i,p , x j,p ) . Below we provide our parameterization of these kernels in univariate form along with thecorresponding expressions for w , d , c and c . B.1 Gaussian kernel
The univariate Gaussian kernel is k G ( x, x (cid:48) ) = exp (cid:16) − ( x − x (cid:48) ) θ (cid:17) . Therefore: d i = ∂k G ( x i , x ) ∂x = 2( x i − x ) θ exp (cid:18) − ( x i − x ) θ (cid:19) ( x i , x j ) = √ πθ (cid:18) − ( x i − x j ) θ (cid:19) (cid:18) erf (cid:18) − ( x i + x j ) √ θ (cid:19) + erf (cid:18) x i + x j √ θ (cid:19)(cid:19) , ≤ i, j ≤ n with erf the error function. In addition: c = ∂w ( x i , x i ) ∂x i = exp (cid:18) − x i θ (cid:19) − exp (cid:18) − (1 − x i ) θ (cid:19) and, for the vector c , 1 ≤ i ≤ n : ∂w ( x, x i ) ∂x = (cid:114) π θ exp (cid:18) − ( x − x i ) θ (cid:19) (cid:20) ( x − x i ) (cid:18) erf (cid:18) x + x i − √ θ (cid:19) − erf (cid:18) x + x i √ θ (cid:19)(cid:19) + (cid:114) θπ (cid:18) exp (cid:18) − ( x + x i ) θ (cid:19) − exp (cid:18) − ( x + x i − θ (cid:19)(cid:19)(cid:35) . Remark: this is the kernel used in Figure 1, with hyperparameters ν = 1, θ = 0 . B.2 Mat´ern kernels with α = { , , } / We use the following parameterization of the Mat´ern kernel for specific values of α : k M, / ( x, x (cid:48) ) = exp (cid:18) − | x − x (cid:48) | θ (cid:19) k M, / ( x, x (cid:48) ) = (cid:32) √ | x − x (cid:48) | θ (cid:33) exp (cid:32) − √ | x − x (cid:48) | θ (cid:33) k M, / ( x, x (cid:48) ) = (cid:32) √ | x − x (cid:48) | θ + 5( x − x (cid:48) ) θ (cid:33) exp (cid:32) − √ | x − x (cid:48) | θ (cid:33) The derivatives with respect to x , i.e., in d are: ∂k M, / ( x, x (cid:48) ) ∂x = ( − δ x 1) ( x j − 1) ( x j + x i − (cid:17) + 50 ( x i − ( x j − x j +50 ( x i − (cid:3) − θ (cid:16) θ + 9 · x j θ − · x i θ + 50 x j − x i x j + 50 x i (cid:17) γ with t = 4 θ √ (cid:16) √ x j − x i +2) θ (cid:17) , t = − √ θ exp (cid:16) √ x j − x i +2) θ (cid:17) .The case when x i > x j is obtained by swapping x i and x j above. Derivatives with respectto x i and x j , to account for both of these cases, are provided as follows: ∂w / ( x i , x j ) ∂x i = − (cid:16) x i + θ − x j ) exp (cid:0) x i θ (cid:1) + θ exp (cid:16) x j θ (cid:17) − θ (cid:17) exp (cid:0) − x i + x j θ (cid:1) θ∂w / ( x i , x j ) ∂x j = (cid:16) θ exp (cid:16) x j θ (cid:17) − (cid:0) x i θ (cid:1) x j + 2 ( θ + x i ) exp (cid:0) x i θ (cid:1) + θ (cid:17) exp (cid:0) − x j + x i θ (cid:1) θ ∂w / ( x i , x j ) ∂x i t = exp (cid:32) √ x i θ (cid:33) (cid:104) √ βx i + (cid:16) − θ − · x j (cid:17) βx i ++ (cid:32)(cid:16) (6 x j − θ − θ (cid:17) exp (cid:32) √ x j θ (cid:33) + (cid:16) √ θ + 12 x j θ + 2 · x j (cid:17) β (cid:33) x i + (cid:16) θ + (cid:16) √ − √ x j (cid:17) θ + (6 − x j ) θ (cid:17) exp (cid:32) √ x j θ (cid:33) + (cid:16) − √ x j θ − x j θ − √ x j (cid:17) β (cid:35) + (cid:16) − θ − x j x i (cid:17) βx i + (cid:16) − s − √ x j θ (cid:17) β∂w / ( x i , x j ) ∂x j t = θ (cid:104)(cid:16) θ − x i + 6 (cid:17) x j − θ (cid:16) θ − √ x i − (cid:17) + 6 x i − (cid:105) exp (cid:32) √ x j + x i ) θ (cid:33) − √ (cid:32) √ x i + 1) θ (cid:33) x j − (cid:16) θ − x i (cid:17) exp (cid:32) √ x i + 1) θ (cid:33) x j − β (cid:32) √ θ exp (cid:32) √ x i θ (cid:33) − x i θ exp (cid:32) √ x i θ (cid:33) + 2 · x i exp (cid:32) √ x i θ (cid:33) − θ − x i θ (cid:33) x j + 2 x i (cid:16) √ θ − x i θ + √ x i (cid:17) exp (cid:32) √ x i + 1) θ (cid:33) + θ (cid:16) θ + √ x i (cid:17) β t = − θ exp (cid:16) √ x i + x j +2) θ (cid:17) , t = − t . ∂w / ( x i , x j ) ∂x i t = (cid:104) · γx i + (cid:16) − θ − · x j (cid:17) γx i + (cid:16) · θ + 400 x j θ + 4 · x j (cid:17) γx i + (cid:32)(cid:16) θ + (cid:16) · − · x j (cid:17) θ + (cid:0) x j − x j + 150 (cid:1) θ (cid:17) exp (cid:32) √ x j θ (cid:33) + (cid:16) − θ − · x j θ − x j θ − · x j (cid:17) γ (cid:17) x i + (cid:16)(cid:16) − · θ + (270 x j − θ + (cid:16) − · x j + 72 · x j − · (cid:17) θ + (cid:0) − x j + 600 x j − (cid:1) θ (cid:17) exp (cid:32) √ x j θ (cid:33) + (cid:16) √ θ + 420 x j θ + 54 · x j θ + 400 x j θ + 2 · x j (cid:17) γ (cid:17) x i + (cid:16) θ + (cid:16) √ − √ x j (cid:17) θ + (cid:0) x j − x j + 450 (cid:1) θ + (cid:16) · x j − · x j + 36 · (cid:17) θ + (cid:0) x j − x j + 150 (cid:1) θ (cid:1) exp (cid:32) √ x j θ (cid:33) + (cid:16) − √ x j θ − x j θ − · x j θ − x j θ − · x j (cid:17) γ (cid:105) exp (cid:32) √ x i θ (cid:33) + (cid:16) − θ − · x j θ − x j θ (cid:17) γx i + (cid:16) − · θ − x j θ − · x j θ (cid:17) γx i + (cid:16) − θ − √ x j θ − x j θ (cid:17) γ with t = − θ exp (cid:16) √ x j + x i +2) θ (cid:17) . 33 w / ( x i , x j ) ∂x j t = (cid:32)(cid:16) θ + 24 · (1 − x i ) θ + (cid:0) x i − x i + 150 (cid:1) θ (cid:17) exp (cid:32) √ x i θ (cid:33) x j + (cid:16) − · θ + (270 x i − θ − · (cid:0) x i − x i + 1 (cid:1) θ − (cid:0) x i − x i + 1 (cid:1) θ (cid:17) exp (cid:32) √ x i θ (cid:33) x j + (cid:16) θ + (cid:16) √ − √ x i (cid:17) θ + (cid:0) x i − x i + 450 (cid:1) θ + (cid:16) · x i − · x i + 36 · (cid:17) θ + (cid:0) x i − x i + 150 (cid:1) θ (cid:1) exp (cid:32) √ x i θ (cid:33)(cid:33) exp (cid:32) √ x j θ (cid:33) + 2 · exp (cid:16) √ x i /θ + 2 √ /θ (cid:17) x j + (cid:16) θ − · x i (cid:17) exp (cid:32) √ x i + 1) θ (cid:33) x j + (cid:16) · θ − x i θ + 4 · x i (cid:17) exp (cid:32) √ x i + 1) θ (cid:33) x j + (cid:32)(cid:16) θ − · x i θ + 600 x i θ − · x i (cid:17) exp (cid:32) √ x i + 1) θ (cid:33) + (cid:16) − θ − · x i θ − x i θ (cid:17) γ (cid:17) x j + (cid:16)(cid:16) √ θ − x i θ + 54 · x i θ − x i θ +2 · x i (cid:17) exp (cid:32) √ x i + 1) θ (cid:33) + (cid:16) − · θ − x i θ − · x i θ (cid:17) γ (cid:33) x j + (cid:16) − √ x i θ + 210 x i θ − · x i θ + 100 x i θ − · x i (cid:17) exp (cid:32) √ x i + 1) θ (cid:33) + (cid:16) − θ − √ x i θ − x i θ (cid:17) γ Finally, we provide expressions for c from (11): c , / = exp (cid:18) − x i θ (cid:19) ; c , / · t = (cid:16) x i − (cid:16) √ θ + 3 (cid:17) x i + θ + 2 √ θ + 3 (cid:17) exp (cid:32) √ x i θ (cid:33) − βx i − √ θβx i − θ β ; c , / · t = exp (cid:32) √ x i θ (cid:33) · (cid:104) θ − (cid:16) · θ + 50 (cid:17) x i + 3 (cid:16) θ (cid:16) θ + 6 · (cid:17) + 50 (cid:17) x i − (cid:16) θ (cid:16) θ (cid:16) √ θ + 25 (cid:17) + 3 · (cid:17) + 50 (cid:17) x i + 9 θ + 18 √ θ + 75 θ + 6 · θ + 25 (cid:105) − γx i − · θγx i − θ γx i − √ θ γx i − θ γ with t = − θ exp (cid:16) √ x i +1) θ (cid:17) , t = − θ exp (cid:16) √ x i +1) θ (cid:17)(cid:17)