Statistical Analysis of Complex Computer Models in Astronomy
SStatistical Analysis of Complex Computer Models inAstronomy
Joshua Lukemire , Qian Xiao , Abhyuday Mandal ∗ , and Weng Kee Wong Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, USA Department of Statistics, University of Georgia, Athens, GA, USA Department of Biostatistics, University of California, Los Angeles, CA, USA
Abstract
We introduce statistical techniques required to handle complex computer models with po-tential applications to astronomy. Computer experiments play a critical role in almost all fieldsof scientific research and engineering. These computer experiments, or simulators, are oftencomputationally expensive, leading to the use of emulators for rapidly approximating the out-come of the experiment. Gaussian process models, also known as Kriging, are the most commonchoice of emulator. While emulators offer significant improvements in computation over com-puter simulators, they require a selection of inputs along with the corresponding outputs of thecomputer experiment to function well. Thus, it is important to select inputs judiciously for thefull computer simulation to construct an accurate emulator. Space-filling designs are efficientwhen the general response surface of the outcome is unknown, and thus they are a popularchoice when selecting simulator inputs for building an emulator. In this tutorial we discuss howto construct these space filling designs, perform the subsequent fitting of the Gaussian processsurrogates, and briefly indicate their potential applications to astronomy research.
Computer experiments, or simulators, are an increasingly important tool in many scientificfields. In these experiments, a computer model is defined relating a set of inputs to an output.Instead of conducting a traditional experiment, a researcher will provide a set of inputs tothe computer model and obtain the model output. This approach is very appealing in fieldssuch as physics, where the computer experiment model can be setup using a series of knownrelationships/equations and different inputs may consist of unknown constants in those equationsor other properties such as mass or chemical compositions. These experiments can be effectivealternatives to experiments which may be too expensive or otherwise impossible to performin a traditional setting. They differ from standard experiments in several key ways. Mostimportantly, computer experiments are generally deterministic; for a set of input settings theexperiment will return the same result every time it is conducted. Second, the experiments willgenerally not have an easily described response surface; for example a standard linear regressionmodel will not generally describe the outcome accurately.Many research areas in astronomy do not easily permit conducting traditional experiments.For example, researchers may be interested in the formation of binary black holes. Clearlythe researchers will not be able to create multiple black holes and observe their dynamics overtime. Computer experiments make it possible to study such phenomena by creating computermodels based on the theorized properties of these binary systems and then comparing theoutput to what is observed in Nature. For example, Compact Object Mergers: Population ∗ [email protected] a r X i v : . [ a s t r o - ph . I M ] F e b igure 1: An example from a simulation examining whether two black holes merge. Source: . Astrophysics and Statistics (COMPAS) is used to investigate binary population synthesis. Thecomputer experiment takes input as initial conditions and simulates the lifespan of stars [50][56]. Similarly, binary population synthesis code ComBinE has been used to perform binarypopulation syntheses [32], and the tool UniverseMachine [4] allows researchers to study galaxyformation.Computer experiments for many complex systems can be very expensive to perform (see,for example, [62]). This computational expense can be a significant problem, especially if aresearcher hopes to conduct the experiment for many sets of inputs. An alternative to directlyperforming these computer experiments is to instead create a surrogate or emulator [18]. Surro-gates are popular for computer experiments when it is not realistic to evaluate a fine grid overthe entire input space. Instead, a (relatively) small number of points are chosen to evaluateunder the original computer simulation. Then, a model is fit to the output from these limitedruns. Predictions under this model for new inputs, as well as uncertainty quantification, can beobtained from the surrogate without the need to re-run the expensive computer simulation atthe new points. If the model fits well, then the predicted value will be close to the true valuethat would have been obtained if the full computer experiment was used.The most common tool used to fit the data points and create the surrogate model is theGaussian process (GP) [48] [18]. The GP is appealing for creating surrogates because it interpo-lates known data to evaluate new data points. This is especially important when the outcomefor a fixed set of inputs is deterministic, which is frequently the case in computer experiments.This approach is becoming more popular in the astronomy literature. Some recent work in-cludes [22], who proposed using Gaussian process emulation to obtain confidence intervals forthe parameter vector of a phase-space distribution function for dwarf spheroidal galaxies.Section 2 of this tutorial paper introduces Gaussian process models and discusses theirapplications to computer experiments. We provide codes and examples throughout in the Rprogramming language [45]. Section 3 of this paper focuses on determining what inputs touse to generate the responses used to fit the Gaussian process model to obtain an accuratesurrogate. We draw upon the design of experiments statistical literature to discuss design ofcomputer experiments. In particular we focus on Latin hypercube designs and discuss severaltechniques for finding them.
Simpler surrogates or emulators are often preferred for complex deterministic computer models.Gaussian Process (GP) models are popular choice for this purpose [48]. Consider an n -run computer experiment with d -dimensional input vectors x i = ( x i , . . . , x id ) T and a deterministicoutput y ( x i ), for i = 1 , , . . . , n . To fix ideas, assume that we are interested in a 2-dimensionalinput for a computer experiment with output given by the Branin function as defined by [6],see also [14]. y ( x , x ) = (cid:18) x − . π x + 5 π x − (cid:19) + 10 (cid:18) − π (cid:19) cos ( x ) + 10 , (1)where the design space is given by values of x ∈ [ − ,
10] and x ∈ [0 , branin < − f u n c t i o n ( x ) { a < − b < − ∗ pi ˆ2) c < − r < − s < − t < − ∗ pi ) return ( a ∗ ( x [ 2 ] − b ∗ x [ 1 ] ˆ 2 + c ∗ x [ 1 ] − r ) ˆ2 + s ∗ (1 − t ) ∗ cos ( x [ 1 ] ) + s ) } The left panel in Figure 2 displays the output for this function over the entire design space.
The simplest possible GP model, known as ordinary GP or krigging , is given by y ( x i ) = µ + Z ( x i ) , (2)where µ is the mean and Z ( x ) is a GP, denoted by Z ( x ) ∼ GP (0 , σ R ). This notation impliesthat the GP has zero-mean, and the covariance function Cov ( Z ( x i ) , Z ( x j )) = σ R ( ·| θ ), where θ = ( θ , . . . , θ d ) T is the vector of unknown correlation parameters with all θ s > s = 1 , . . . , d ).The correlation between outputs is determined by a stationary correlation function R withparameter θ . Two of the more commonly-used correlation functions are the power-exponential nd the Gaussian functions. Under a power-exponential correlation structure the ( i, j )th termis defined as: R ( x i , x j | θ ) = d (cid:89) s =1 exp (cid:40) − θ s | x is − x js | p s (cid:41) for all i, j, (3)where smoothness parameters p , . . . , p s are all between 0 and 2. Of special importance is p s = 2, for all s = 1 , . . . , d , which corresponds to the popular Gaussian correlation function: R ( x i , x j | θ ) = exp (cid:40) − d (cid:88) s =1 θ s ( x is − x js ) (cid:41) for all i, j. (4)The flexibility of the correlation structure is what makes the GP model a popular surrogatefor complex computer models. For any given input x ∗ in the design space, the fitted GPsurrogate gives the predicted computer model response as,ˆ y ( x ∗ ) = µ + r T ( x ∗ ) R − ( y − µ n ) , (5)where r ( x ∗ ) = (cid:34) corr (cid:16) Z ( x ∗ ) , Z ( x ) (cid:17) , corr (cid:16) Z ( x ∗ ) , Z ( x ) (cid:17) , . . . , corr (cid:16) Z ( x ∗ ) , Z ( x n ) (cid:17)(cid:35) T , (6) n is a vector of ones of length n , R is the n × n correlation matrix for ( Z ( x ) , ..., Z ( x n )) T , y is the response vector ( y ( x ) , . . . , y ( x n )) T , and the associated uncertainty estimate is s ( x ∗ ) = σ (cid:16) − r ( x ∗ ) T R − r ( x ∗ ) (cid:17) . (7)In practice, the parameters µ, σ and θ in Equations (5) and (7) are unknown and need tobe estimated from the data. The parameters can be estimated using the mlegp function in R.Assume that we already have a design with 10 points (details for obtaining this design will bepresented in Section 3). Then we can fit a GP with a Gaussian correlation function as, l i b r a r y ( mlegp ) design < − matrix ( c ( − − − Yx < − apply ( design , 1 , branin ) branin s u r r o g a t e 1 < − mlegp ( design , Yx) Similarly, estimates across the entire design space can be obtained by using the surrogatemodel by specifying the inputs on a grid: x1 < − seq ( from = −
5, to =10, length . out = 25) x2 < − seq ( from = 0 , to =15, length . out = 25) t e s t p o i n t s < − expand . g r i d ( x1 , x2 ) yhat < − p r e d i c t ( branin s u r r o g a t e 1 , t e s t p o i n t s ) p r e d i c t i o n s < − matrix ( yhat , nrow = length ( x1 ) ) persp ( x1 , x2 , p r e d i c t i o n s , theta = −
45, phi =45)
The right panel in Figure 2 displays a plot of the surrogate output. Comparing this output tothe true values in the left panel, it is clear that the surrogate model is able to obtain a veryclose approximation to the true process. he formulation in equation (2) can be extended to incorporate a global trend function for themean µ [57]. This is known as Universal Kriging : y ( x ) = µ ( x ) + Z ( x ) , (8)with µ ( x ) = g ( x ) T β = (cid:80) mi =1 β i g i ( x ), where g is a m -dimensional known function and β =( β , . . . , β m ) T is a vector of unknown parameters. If we assume g ( x ) = 1 and let G =( g ( x ) , . . . , g m ( x )) T , then the optimal predictor under model (8) is given byˆ y ( x ∗ ) = g T ( x ∗ ) ˆ β + r T ( x ∗ ) R − ( y − G ˆ β ) , (9)where ˆ β = ( G T R − G ) − ( G T R − y ). If the assumed µ ( x ) is close to the truth, this formulationwill lead to a better prediction than ordinary krigging. Note that this universal kriging formu-lation uses µ ( x ) to capture the known trends, but in most real applications, these trends arenot known, and hence ordinary kriging is commonly used [61]. Note that Equation (3) refers to a stationary GP, that is
Cov (cid:16) Z ( x + h ) , Z ( x ) (cid:17) = σ R ( h ) , (10)where the correlation function R ( h ) is a positive semidefinite function with R ( ) = 1 and R ( − h ) = R ( h ). These stationary Gaussian processes are popular surrogates for complex com-puter models, since it can be shown that the corresponding predictor of µ in equation (2)ˆ µ = ( T n R − n ) − T n R − y (11)is the best linear unbiased predictor (BLUP) in the sense that it minimizes the mean squaredprediction error. In reality this assumption of stationarity may not hold. Under these circum-stances, the above predictor is no longer optimal. Some literature is available to deal withnon-stationary Gaussian processes for emulating computationally expensive functions. For ex-ample, [67] introduced the idea of nonlinear mapping based on a parameterized density function,and [20] proposed a Bayesian tree structure by dividing the design space into subregions.[2] used composite Gaussian process (CGP) models to address the nonstationarity problem.In their formulation, the model takes the following form: y ( x ) = Z global ( x ) + Z local ( x ) ,Z global ( x ) ∼ GP (cid:0) µ, τ R ( · ) (cid:1) , (12) Z local ( x ) ∼ GP (cid:0) , σ R ( · ) (cid:1) . Here Z global ( x ) and Z local ( x ) are two stationary GPs that are independent of each other. Justas the universal kriging generalizes the ordinary kriging by adding a trend function µ ( x ), thecomposite Gaussian process model given in equation (12) is a further extension which adds amore flexible global trend component. The model was extended to incorporate the non-constantvariance assumption as follows: y ( x ) = Z global ( x ) + σ ( x ) Z local ( x ) ,Z global ( x ) ∼ GP (cid:0) µ, τ R ( · ) (cid:1) , (13) Z local ( x ) ∼ GP (0 , R ( · )) . The model can be further extended for noisy data by adding a third GP (with zero correlation)to the model (13). .3 Numeric Considerations - Local GP Note that the prediction involves the inversion of the n × n correlation matrix R , where n is the number of data points (see equation (5) or (11), for example). This is a big hurdle inimplementing GPs. To overcome this problem, [19] introduced the idea of local Gaussian Processapproximation for large computer models. They provided a family of local sequential designschemes that dynamically define the support points of a GP predictor based on a local subsetof the data. Their approach is different from that of k -nearest neighbours. The basic idea issimple, under the standard choices of the covariance structures the correlation between pointsis dependent on the distance between those points, with data points far from x ∗ having verylittle effect on its prediction. Hence it is not a good use of computational resources to invertthe full covariance matrix, as the elements corresponding to “far away” points will contributelittle to the prediction of y ( x ∗ ). An interested reader should refer to [19] for the formulas of theGP predictor based on a local subset of data. The end result is a global predictor that takesadvantage of modern multicore parallel computing tools. The conventional GP models consider quantitative predictor variables only, but many computerexperiments may have both quantitative and qualitative inputs. In order to construct an emu-lator with qualitative factors, a naive approach would be to create distinct GP models for datacollected at the different level combinations of the qualitative factors. Clearly this approachhas many limitations, particularly when there are several qualitative factors. There are somemore advanced techniques to deal with such cases. To fix ideas, for an n -run computer model,denote the k th ( k = 1 , . . . , n ) data input as w k = ( x T k , z T k ) T where x k = ( x k , . . . , x kp ) T ∈ R p is the quantitative part and z k = ( z k , . . . , z kq ) T ∈ N q is the qualitative part (coded in levels)of the input. Note here that previously x denoted the input, which was entirely continuous.However, now w denotes the entire input, with x referring to the continuous part. For thesekind of problems, a popular GP based model was introduced by [44], among many others [23],[72], [53], [70] and [71]. Specifically, an ordinary GP model with a multiplicative covariancefunction is considered (for any two inputs w and w ):Cov( Z ( w ) , Z ( w )) = σ q (cid:89) j =1 τ ( j ) z j z j R ( x , x | θ ) , (14)where the parameter τ ( j ) z j z j represents the correlation between two levels ( z j and z j ) in the j th qualitative factor z ( j ) , and R ( x , x | θ ) is given before in equation (4). Different choices of τ ( j ) z j z j lead to different types of correlation functions. For example, an exchangable correlationfunction is obtained when τ ( j ) z j z j is some constant between 0 and 1. Alternatively, an additiveGP model was proposed in [11], which adopts the following covariance function:Cov( Z ( w ) , Z ( w )) = q (cid:88) j =1 σ j τ ( j ) z j z j R ( x , x | θ ( j ) ) , (15)where σ j and θ ( j ) ( j = 1 , . . . , q ) are the process variance and correlation parameters, respec-tively, corresponding to z ( j ) .The methods above do not have good physical interpretation of the correlation structures.Motivated by this, [64] proposed an EzGP method based on ANOVA (Analysis of Variance)ideas to jointly model the quantitative and qualitative inputs: Y ( w ) = µ + G z ( x ) , (16)which implies that for any level combination of z , Y ( w ) is a Gaussian process. In particular,they considered G z ( x ) = G ( x ) + G z (1) ( x ) + · · · + G z ( q ) ( x ) , (17) here G and G z ( h ) ( h = 1 , . . . q ) are independent Gaussian processes with mean zero and somecovariance functions. Here, G is a standard GP taking only quantitative inputs x , which canbe viewed as the base GP reflecting the intrinsic relation between y and x . On the other hand, G z ( h ) ’s can be viewed as an adjustment to the base GP by the impact of the qualitative factor z ( h ) ( h = 1 , . . . q ). This EzGP technique enjoys some nice theoretical properties and is able toflexibly address heterogeneity in computer models involving multiple qualitative factors. [64]also developed two variants of the EzGP model to achieve computational efficiency for datawith high dimensionality and large sizes. The notion of calibration and sensitivity analysis is important in the context of physical andcomputer experiments. Instead of observing the real physical process, y Real , we are only ableto observe a process y Field as: y Field ( x ) = y Real ( x ) + (cid:15), (18)where (cid:15) is the usual normal error. This y Real is approximated by a computer model y Model .Note that the computer model y Model not only has the input variables x , but also some unknownparameters θ , called calibration parameters which are used to fine tune the model. Note thatthese calibration parameters can be, for example, the correlation parameters discussed above.The field data y Field is used mainly to learn more about the real phenomenon y Real . [30]proposed a Bayesian framework to address this as follows: y Real ( x ) = y Model ( x , θ ) + b ( x ) (19) y Field ( x ) = y Model ( x , θ ) + b ( x ) + (cid:15), where b ( x ) is a functional discrepancy, called bias. [30] used Bayesian methods to estimate thebias correction function and unknown calibration parameter θ under a GP prior. An alternativeto this Bayesian approach is an iterative history matching algorithm such as the one proposedby [55] for calibrating a galaxy formation model called GALFORM. This is actually a hands-on process, which intelligently eliminates the implausible points from the input (or parameter)space and returns a set of plausible candidates for the parameters θ . Recently, [5] used thisalgorithm for calibrating hydrological time-series models and [46] further extended this methodwith a more systematic approach, in which they discretize the target response series on a fewtime points, and then iteratively apply the history matching algorithm with respect to thediscretized targets. The computer experiments under consideration have deterministic outputs, and thus replicatesat a given set of input settings should be avoided, as they do not provide any further informationabout the response. Good designs for computer experiments are then designs that are “space-filling” in some sense, which make it easier to fit accurate surrogate models. We will next discussa few types of space-filling designs and examine techniques which can be used to construct them.
Latin hypercube designs (LHDs) are n × d matrices whose columns are permutations of numbers1 to n (or 0 to n −
1) [39]. They have unique point projections on every dimension and avoidreplications, making them ideal for determining which inputs to use for computer experiments[13]. For a given number of runs and input size, an LHD can easily be generated in R: l i b r a r y (LHD) lhd1 < − rLHD(10 , 2) hile it is intuitive to favor a design that is space-filling, in practice it is difficult to identifysuch designs for experiments with different number of runs and inputs. One of the more commonapproaches is to use orthogonal or nearly-orthogonal LHDs (OLHD). OLHDs minimize thecorrelations among the input settings in the design [15] [52]. They can be obtained by minimizinga correlation-based design criteria. For example, two of the most commonly used criteria forOLHDs are the average absolute correlation (ave( | r | )) and the maximum absolute correlation(max | r | ): ave( | r | ) = 2 (cid:80) d − s =1 (cid:80) ds (cid:48) = s +1 | r ss (cid:48) | d ( d − , (20)max | r | = max s,s (cid:48) | r ss (cid:48) | , where r ss (cid:48) is the correlation between the s th and s (cid:48) th columns of the design. If the design is atrue orthogonal LHD, then ave( | r | ) = 0 and max | r | = 0. For example, to generate an OLHD inR with 8 factors and 32 runs we can write: ∗ OLHD < − OLHD. S2010 (C = 3 , r = 2 , type = ” even ” )
The design can easily be verified to be orthogonal by examining: t (OLHD) % ∗ % OLHD However, for many combinations of run size and number of inputs an orthogonal LHD doesnot exist, and thus a good design will be one with small ave( | r | ) and max | r | values. Manyalgebraic construction methods have been proposed for finding OLHDs, and they can also befound via searching algorithms using ave( | r | ) or max | r | as objective functions. Some specificresults include [69], who proposed techniques for constructing orthogonal LHDs with run-size n = 2 m or n = 2 m + 1 where m is an integer. [3] proposed to rotate the 2 d factorial designsfor constructing d -factor orthogonal LHDs where d must be some power of 2 and the run-sizeis n = 2 d . For further examples, please refer to [7], [49], [9], [35], [51] and [68]; see [59] for asurvey. (a) Orthogonal LHD l l l l l l l l l (b) OALHD l l l l l l l l l (c) Orthogonal OALHD l l l l l l l l l (d) Maximin OALHD l l l l l l l l l Figure 3: Some examples of 9-run 2-factor LHDs
While OLHDs are very commonly used, they are not guaranteed to be space-filling; seedesign (a) in Figure 3 for an example [63]. In light of this, various design optimality criteriahave been developed related to measures of space-filling. .1.1 Centered L -Discrepancy Criteria [24] defined several discrepancy based criteria among which the centered L -discrepancy (CD)is the most popular. The intuition behind the CD criteria is that a space-filling design shouldhave points spread out uniformly in the whole design space or any sub-space of the design space.If this is the case, for any rectangular region of the design space we examine, the number ofdesign points in that space should be proportional to the volume of that space. The CD criteriais defined as, CD ( D n ) = (cid:88) v (cid:54) = ∅ (cid:90) C v (cid:12)(cid:12)(cid:12)(cid:12) D n v , J x v ) n − Volume( J x v ) (cid:12)(cid:12)(cid:12)(cid:12) d x, (21)where D n is the n -run, d -factor, q -level design, v is some non empty subset of 1 , , . . . , q , C v isthe subspace defined by the coordinate indexes selected by v , D n v is the projection of D n ontothe subspace C u , x v is the projection of vector x = ( x , x , . . . , x q ) on to the subspace C v , J x isthe chosen rectangle space defined by x , J x u is the projection of J x onto the subspace definedby C u , D n u , J x u ) is the total number of designs points in D n u within the chosen area definedby J x u , and V olume ( J x u ) is the volume of J x u . For more details on the rationale of the CDcriteria, see the Chapter 3 in [13] for a survey. Another commonly-used metric for evaluating designs’ space-filling properties is the maximindistance criterion [27]. This criteria favors designs with maximum pairwise distances betweeninputs. Maximin designs are popular due to their robustness, since the design criteria focuses onoptimizing the worst case scenario − the closest pairwise distance between any two points. [43]defined a computationally efficient scalar value for evaluating the maximin distance criterion: φ p = (cid:32) n (cid:88) i =2 i − (cid:88) j =1 u pi,j (cid:33) p , (22)where u i,j is the distance between the i th and j th design points. Designs with smaller φ p valuesare more space-filling. For sufficiently large p (e.g. p > φ p criterion is asymptoticallyidentical to the true maximin distance criterion.Due to the desirability of both the orthogonality and maximin properties, [28] proposeda multi-objective criterion (denoted OMmcri) to generate orthogonal-maximin LHDs (OMmLHDs), which act as a compromise between orthogonal and maximin designs. The OMmcricriteria is given by,OMmcri( x, ω ) = ωρ + (1 − ω ) ( φ p − φ p,lowerbound )( φ p,upperbound − φ p,lowerbound ) . (23)Here, φ p is the maximin criteria value from Equation (22), ρ is the ave ( | r | ) criteria value asdefined in Equation (20), ω is a weight value reflecting the tradeoff between the orthogonalityand maximin criteria, and φ p,lowerbound and φ p,upperbound are given by, φ p,lowerbound = (cid:40)(cid:32) n (cid:33) (cid:18) (cid:100) u (cid:101) − u (cid:98) u (cid:99) p − u − (cid:98) u (cid:99)(cid:100) u (cid:101) p (cid:19)(cid:41) p , and φ p,upperbound = (cid:32) n − (cid:88) i =1 n − i ( id ) p (cid:33) p , respectively. Here u is the average distance between the design points and (cid:98) u (cid:99) and (cid:100) u (cid:101) are thelargest integer smaller than u and the smallest integer larger than u .Another popular class of efficient LHDs is the orthogonal array based LHDs (OALHDs)by [54], where the levels in randomized orthogonal arrays (OAs) are expanded to form LHDs.The OALHDs have desirable sampling and projection properties, but they are not necessarilyspace-filling [63]; see designs (b) and (c) in Figure 3 for some examples. [33] proposed to use a imulated annealing algorithm to search for space-filling OALHDs, and [66] further proposed toconsider both level permutation and level expansion for generating OALHDs. Some algebraicconstruction methods are also available for constructing maximin LHDs for certain design sizes[65] [60]. Space-filling LHDs, including CD and and maximin distance LHDs, focus on the design’s prop-erties in the full dimensional spaces. Yet, their space-filling properties in some sub-spaces (pro-jections) may not be adequate. [29] proposed the maximum projection LHDs (Maxpro LHDs)that guarantee designs have space-filling properties in all projections. The maximum projectioncriterion is defined as min X ψ ( X ) = (cid:40) (cid:0) n (cid:1) n − (cid:88) i =1 n (cid:88) j = i +1 (cid:81) ds =1 ( x is − x js ) (cid:41) /d . (24)Here, X is a n × d matrix where each row is an input to the computer experiment, and theminimization is over all pairs of rows in X . Clearly a design minimizing ψ will have every pairof design points apart from each other in all projections, justifying the name “Maxpro.” The metrics discussed above for evaluating designs such as the minimax criteria provide a wayof quantifying how “good” a design is in some sense. It remains to determine how to actuallyconstruct designs that have a good value of the criterion, which is a challenging problem inmany situations. For many such design problems, it is popular to use metaheuristic optimiza-tion algorithms to find designs. Metaheuristic algorithms are often used to solve problems inastronomy, see, for example, [8], [16], [41], [40], and [42]. They can be applied to solve difficultproblems such as clustering in complex data [12], [26].These algorithms are preferred due to their flexibility - in general they will work with anyobjective function. For a more detailed review of metaheuristic algorithms for finding designs,see [37]. Here we will focus on two of the more commonly used approaches: Simulated Annealingand the Genetic Algorithm.
Simulated Annealing (SA) is one of the most widely used general probabilistic optimizationtechniques [31]. The algorithm follows the annealing process in metallurgy, in which materialsare heated to a high temperature where their properties change, and then are allowed to slowlycool. [43] adapted the classic SA algorithm for finding maximin distance LHDs, and the approachcan easily be modified to search for other types of designs by using the other optimality criteriadefined in Section 3.1.SA starts with a random LHD and then improves it via an element exchange method, wheretwo random elements from a random column in the design are exchanged. If this exchangeresults in a more efficient design, then the change is kept. If the exchange does not result in anyimprovement, the change is kept with probability controlled by the current temperature (tuningparameter). Allowing changes that do not improve the design helps the search algorithm toescape local optima. The SA algorithm will iteratively repeat this exchange procedure. Aftera certain number of rounds, the temperature would be annealed to decrease (cool down) theprobability of updating the current design following the annealing schedule. We summarize ageneral SA framework in Algorithm 1, where the target function Φ to be minimized can be theoptimality criterion defined in (20), (21), (22), (23) and (24) for the orthogonal, CD, maximin,OMm and Maxpro LHDs, respectively.In the Algorithm 1, the maximum number of iterations N is recommended to be around 500according to the convergence analysis in [59]. The decreasing rate for the current temperature is another important tuning parameter. A larger rate will make T decline faster, and thuslead to a faster stop of the algorithm. Yet, it may also result in larger probability of missing thetrue global optimum. Considering this trade-off, it is recommended to set T between 0 .
05 to0 .
15. The tuning parameter S indicates the maximum consecutive attempts the algorithm willtry without improvements before temperature reduces, and [43] recommends it to be around 5,depending on how expensive the objective function is to evaluate.It is straightforward to use Simulated Annealing to find designs in R. For example: LHD SA < − SA( n = 10 , k = 2 , N = 25)
Similarly, designs satisfying the multi-objective approach can be found by: − o b j e c t i v e multi obj design < − SA2008 ( n = 10 , k = 2 , N = 25)
The genetic algorithm (GA) is a metaheuristic algorithm inspired by the process of naturalselection [25] [17]. The GA starts from a population of randomly generated candidate solutions(designs), called chromosomes. The population of chromosomes in each iteration is called ageneration. For each generation the objective function will be evaluated for each chromosome,with the corresponding value being known as the fitness. The more fit chromosomes will beallowed to survive to the next generation, while the less fit chromosomes will be replaced bynew offspring. These offspring are obtained by selecting several chromosomes (called parents)and recombining their settings using crossover and mutation techniques to produce offspringwith potentially better fitness.[34] adapted the general GA framework for searching for maximin LHDs. Their approachbegins with random LHDs as the initial population. They then perform a selection step in whichthe best half of the LHDs are allowed to survive to the next generation. Then, a crossover stepis performed in which random columns in these survivors are exchanged with other survivors.Additionally, to encourage diversity in the solutions and prevent the algorithm becoming stuckin a local optima, a mutation step is performed in which two random elements in a column areexchanged. Note that the current best chromosome is excluded from this mutation in orderto preserve the best current solution. Finally, the fitness of the new population of LHDs iscalculated, and the process is repeated until the stopping criteria is satisfied. We include adetailed description of the GA, along with the tuning parameters, in Algorithm 2.It is also straightforward to use the GA to find space-filling designs in R. For example: − phi p i s the maximin d i s t a n c e LHD GA < − GA( n = 10 , k = 2 , N = 25 , OC = ” phi p” )
Sophisticated computer simulators allow scientists to test complex systems which would be tooexpensive or completely impossible to assess otherwise. These simulations are usually verytime-consuming, and computationally cheap surrogates are called for to facilitate the analysisand optimization of the underlying system. Gaussian processes are popular choices for suchsurrogates (or emulators). In order to effectively reap the benefits of utilizing the surrogate,the simulator should be evaluated on a set of points chosen efficiently. Latin hypercube designshave proven efficient for that purpose. lgorithm 1 Simulated Annealing for LHD Choose values for the tuning parameters: the starting temperature T , the number of attemptsbefore lowering the temperature S , and the maximum number of iterations N . Set the counter index C = 1. Construct a random starting LHD X . Select a column from X at random. Exchange two randomly selected elements within this chosen column. Denote the new design by X new . If Φ( X new ) < Φ( X ), then X = X new (accept the new design). Otherwise, let X = X new withprobability exp (cid:110) − Φ( X new ) − Φ( X ) T (cid:111) . If S attempts have passed since the last improvement, decrease the temperature T and repeatSteps 4 − If C < N , increment C and repeat Steps 4 −
7; Otherwise, terminate and return X . Algorithm 2
Genetic Algorithm for LHD Set the probability of mutation, p mut . Suggested setting is 1 / ( d −
1) [34]. Set the maximumnumber of iterations N and the counter index C = 1. Generate m random n × d LHDs, denoted by X , . . . , X m , where m is the population size (numberof chromosomes). Here, m must be an even number. Evaluate the objective function, Φ( X i ), for i = 1 , . . . , m . Select survivors : order the X i by their objective function values and select the best m X i (withthe smallest m Φ values), denoted by X si for i = 1 , . . . , m , WLOG. Let X sb = argmin i Φ( X si ) (i.e. X sb is the best survivor ) for each X si , excluding X sb , do Randomly choose a column j from X sb , and replace it with the j th column from X si . end for for each X si , excluding X sb , do Randomly choose a column j from X si , and replace it with the j th column from X sb . end for Update X i : let X = X sb and the X , . . . , X m/ be the design matrices obtained by steps 6 − X m/ = X sb and X m/ , . . . , X m be the design matrices generated by Steps 9 − for each X i (except X ) do for each column j of X i do if z < p mut where z ∼ Uniform(0 , then Exchange two randomly selected elements in j . end if end for end for Calculate Φ( X i ) for all i . if C ≤ N , set C = C + 1 and repeat Steps 4-21; otherwise, stop the algorithm.12 n this tutorial paper we discussed design criteria and subsequent metaheuristic optimizationstrategies for finding designs that allow astronomy researchers to extract the maximum benefitoffered by Gaussian process surrogate modeling. We provided an overview of model fittingusing Gaussian processes and identification of optimal Latin hypercube designs. Relevant Rcodes have been used for illustration. Apart from the libraries discussed in the paper, thereare many other packages in R that can be used. Interested readers may want to consider thelaGP (Local Approximate Gaussian Process Regression [21]), DiceKriging (Kriging Methods forComputer Experiments [47]), GPfit (Gaussian Processes Modeling [38]) and SLHD (Maximin-Distance (Sliced) Latin Hypercube Designs [1]) packages.One consideration not covered in this tutorial paper is how to best utilize Gaussian processmodels when the data sets are astronomically large. Such “big data” may cause the estimationtechniques to become quite slow, requiring advanced techniques to speed up the estimation.This is a topic of active research. For further details, see [36]. References [1] Ba, S. (2015), SLHD: Maximin-Distance (Sliced) Latin Hypercube Designs, URL: https://cran.r-project.org/web/packages/SLHD/index.html , R package version 2.1-1.[2] Ba, S. and Joseph, V. R. (2012), Composite Gaussian process models for emulating expensivefunctions,
Ann. Appl. Stat. , , 4, 1838–1860.[3] Beattie, S. D. and Lin, D. K. J., (2005), A new class of Latin hypercube for computerexperiments, Contemporary Multivariate Analysis and Designs of Experiments, in Celebrationof Prof. Kai-Tai Fang’s 65th Birthday. Singapore: World Scientific , 205–226.[4] Behroozi, P., Wechsler, R., Hearin, A., and Conroy, C., (2019), UniverseMachine: Thecorrelation between galaxy growth and dark matter halo assembly from z= 0 − MonthlyNotices of the Royal Astronomical Society , , 3, 3143–3194, Oxford University Press.[5] Bhattacharjeea, N., Ranjan, P., Mandal, A. and Tollner, E. W. (2019), A history matchingapproach for calibrating hydrological models, Environmental and Ecological Statistics , , 1,87–105.[6] Bingham, D., Branin Function. Virtual Library of Simulation Experiments , .[7] Bursztyn, D. and Steinberg, D. M., (2002), Rotation designs: orthogonal first-order designswith higher order projectivity, Applied Stochastic Models in Business and Industry , , 3,197–206, Wiley Online Library.[8] Charbonneau, P., (1995), Genetic algorithms in astronomy and astrophysics, The Astro-physical Journal Supplement Series , , 309–334.[9] Cioppa, T. M. and Lucas, T. W., (2007), Efficient nearly orthogonal and space-filling Latinhypercubes, Technometrics , , 1, 45–55, Taylor & Francis.[10] Dancik, G. M. (2020), mlegp: Maximum Likelihood Estimates of Gaussian Processes,URL: https://cran.r-project.org/web/packages/mlegp/index.html , R package version3.1.8.[11] Deng, X., Lin, C. D., Liu, K.-W. and Rowe, R. K., (2017), Additive Gaussian process forcomputer models with qualitative and quantitative factors, Technometrics , , 3, 283–292,Taylor & Francis.[12] Djorgovski, SG., Brunner, R., Mahabal, A., Williams, R., Granat, R., and Stolorz, P.,(2003), Challenges for cluster analysis in a virtual observatory, Statistical Challenges in As-tronomy, 127–141, Springer.[13] Fang, K. T., Li, R. and Sudjianto, A., (2006), Design and modeling for computer experi-ments, CRC Press.[14] Forrester, A., Sobester, A., and Keane, A., (2008) Engineering design via surrogate mod-elling: a practical guide, John Wiley & Sons.
15] Georgiou, S. D., (2009), Orthogonal Latin hypercube designs from generalized orthogonaldesigns,
Journal of Statistical Planning and Inference , , 4, 1530–1540, Elsevier.[16] Giuliano, M., and Johnston, M., (2008), Multi-Objective Evolutionary Algorithms forScheduling the James Webb Space Telescope. , ICAPS, 107–115.[17] Goldberg, D. E., (1989), Genetic algorithms in search,
Optimization and MachineLearning ,Addison Wesley Publishing Co. Inc.[18] Gramacy, R. B. (2020). Surrogates: Gaussian Process Modeling, Design, and Optimizationfor the Applied Sciences. CRC Press.[19] Gramacy, R. B. and Apley, D. W. (2015), Local Gaussian Process Approximation for LargeComputer Experiments,
Journal of Computational and Graphical Statistics , , 2.[20] Gramacy, R. B. and Lee, H. K. H. (2008), Bayesian treed Gaussian process models withan application to computer modeling. J. Amer. Statist. Assoc. , , 1119—1130.[21] Gramacy, R. B., Sun, F. (2019), laGP: Local Approximate Gaussian Process Regres-sion, URL: https://cran.r-project.org/web/packages/laGP/index.html , R package ver-sion 1.5-5.[22] Gration, A. and Wilkinson, M., (2019), Dynamical modelling of dwarf spheroidal galaxiesusing Gaussian-process emulation, Monthly Notices of the Royal Astronomical Society , ,4, 4878–4892, Oxford University Press.[23] Han, G., Santner, T. J., Notz, W. I., Bartel, D. L., (2009), Prediction for computer exper-iments having quantitative and qualitative input variables, Technometrics , , 3, 278–288,Taylor & Francis.[24] Hickernell, F., (1998) A generalized discrepancy and quadrature error bound, Mathematicsof Computation of the American Mathematical Society , , 221, 299–322.[25] Holland, J. H. and others, (1992), Adaptation in natural and artificial systems: an intro-ductory analysis with applications to biology, control, and artificial intelligence, MIT press.[26] Hruschka, E., Campello, R., Freitas, A., and others, (2009) A survey of evolutionary algo-rithms for clustering, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Appli-cations and Reviews) , , 2, 133–155, IEEE[27] Johnson, M. E., Moore, L. M. and Ylvisaker, D., (1990), Minimax and maximin distancedesigns, Journal of Statistical Planning and Inference , , 2, 131–148, Elsevier.[28] Joseph, V. R. and Hung, Y., (2008), Orthogonal-maximin Latin hypercube designs, Statis-tica Sinica , 171–186, JSTOR.[29] Joseph, V. R., Gul, E. and Ba, S., (2015), Maximum projection designs for computerexperiments,
Biometrika , , 2, 371–380, Oxford University Press.[30] Kennedy, M. and O’Hagan, A. (2002), Bayesian calibration of computer models, Journal ofthe Royal Statistical Society Series B (Statistical Methodology) . , 3, 425–464, Wiley OnlineLibrary.[31] Kirkpatrick, S., Gelatt, C. D. and Vecchi, M. P., (1983), Optimization by simulated an-nealing, Science , , 4598, 671–680, American Association for the Advancement of Science.[32] Kruckow, M., Tauris, T., Langer, N., Kramer, M., and Izzard, Robert G, (2018), Pro-genitors of gravitational wave mergers: binary evolution with the stellar grid-based codeCOMBINE, Monthly Notices of the Royal Astronomical Society , ,2, 1908–1949, OxfordUniversity Press.[33] Leary, S., Bhaskar, A., and Keane, A., (2003), Optimal orthogonal-array-based latin hy-percubes, Journal of Applied Statistics , , 5, 585–598, Taylor & Francis.[34] Liefvendahl, M. and Stocki, R., (2006), A study on algorithms for optimization of Latinhypercubes, Journal of Statistical Planning and Inference , , 9, 3231–3247, Elsevier.[35] Lin, C. D., Mukerjee, R. and Tang, B., (2009), Construction of orthogonal and nearlyorthogonal Latin hypercubes, Biometrika , 243–247, JSTOR.
36] Liu, H., Ong, Y-S., Shen, X. and Cai, J., (2020), When Gaussian process meets big data:A review of scalable GPs,
IEEE transactions on neural networks and learning systems , ,11, 4405–4423, IEEE.[37] Mandal, A., Wong, W. K., and Yu, Y. (2015), Algorithmic searches for optimal designs, Handbook of design and analysis of experiments ,755–783, CRC Press Boca Raton, FL.[38] MacDoanld, B., Chipman, H., Campbell, C. and Ranjan, P. (2019), GPfit: Gaussian Pro-cesses Modeling, URL: https://cran.r-project.org/web/packages/GPfit/index.html , Rpackage version 1.0-8.[39] McKay, M. D. and Beckman, R. J. and Conover, W. J., (1979), Comparison of threemethods for selecting values of input variables in the analysis of output from a computercode,
Technometrics , , 2, 239–245, Taylor & Francis.[40] Misiak, M., and others, (2016) Evolutionary Algorithms in Astrodynamics, InternationalJournal of Astronomy and Astrophysics , , 04, 435–439, Scientific Research Publishing.[41] Mohanty, S., (2012) Particle Swarm Optimization and regression analysis–I, AstronomicalReview , , 2, 29–35, Taylor & Francis[42] Mohanty, S. and Fahnestock, E., (2020), Adaptive spline fitting with particle swarm opti-mization, Computational Statistics , , pages=155–191, Springer.[43] Morris, M. D. and Mitchell, T. J., (1995), Exploratory designs for computational experi-ments, Journal of statistical planning and inference , , 3, 381–402, Elsevier.[44] Qian, P. Z. G. and Wu, H. and Wu, C. F. J., (2008), Gaussian process models for computerexperiments with qualitative and quantitative factors, Technometrics , https://arxiv.org/abs/2010.08941 .[47] Roustant, O., Ginsbourger, D., Deville, Y., Clement, C. and Richet, Y. (2020), DiceKrig-ing: Kriging Methods for Computer Experiments, URL: https://cran.r-project.org/web/packages/DiceKriging/index.html , R package version 1.5.8.[48] Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989), Design and analysis ofcomputer experiments. Statistical science , 409–423.[49] Steinberg, D. M. and Lin, D. K. J., (2006), A construction method for orthogonal Latinhypercube designs,
Biometrika , , 2, 279–288, JSTOR.[50] Stevenson, S., Vigna-G´omez, A., Mandel, I., Barrett, J.W., Neijssel, C.J., Perkins, D. andDe Mink, S.E., (2017), Formation of the first three gravitational-wave observations throughisolated binary evolution. Nature Communications, , 1, 1–7. Vancouver.[51] Sun, F., Liu, M.-Q. and Lin, D. K. J., (2010), Construction of orthogonal Latin hypercubedesigns with flexible run sizes, Journal of Statistical Planning and Inference , , 11, 3236–3242, Elsevier.[52] Sun, F. and Tang, B., (2017), A general rotation method for orthogonal Latin hypercubes, Biometrika , , 2, 465–472, Oxford University Press.[53] Swiler, L. P. and Hough, P. D. and Qian, P. Z. G., Xu, X., Storlie, C. and Lee, H.,(2014), Surrogate models for mixed discrete-continuous variables, Constraint Programmingand Decision Making, 181–202, Springer.[54] Tang, B., (1993), Orthogonal array-based Latin hypercubes, J. Amer. Statist. Assoc. , ,424, 1392–1397, Taylor & Francis.[55] Vernon, I., Goldstein, M. and Bower, R. G., (2010), Galaxy formation: a Bayesian uncer-tainty analysis, Bayesian Analysis , , 4, 619–669, International Society for Bayesian Analysis.
56] Vigna-G´omez, A., Neijssel, C.J., Stevenson, S., Barrett, J.W., Belczynski, K., Justham, S.,de Mink, S.E., M¨uller, B., Podsiadlowski, P., Renzo, M. and Sz´ecsi, D., 2018. On the formationhistory of Galactic double neutron stars.
Monthly Notices of the Royal Astronomical Society , , 3, 4009–4029. Vancouver.[57] Wackernagel H. (2002), Multivariate Geostatistics, Springer.[58] Wang, H., Xiao, Q., and Mandal, A., (2020), LHD: Latin Hypercube Designs (LHDs),URL: https://CRAN.R-project.org/package=LHD , R package version 1.3.1.[59] Wang, H., Xiao, Q. and Mandal, A., (2020) Musings about Constructions of Efficient LatinHypercube Designs with Flexible Run-sizes, arXiv preprint arXiv:2010.09154v2 .[60] Wang, L., Xiao, Q. and Xu, H., (2018), Optimal maximin L -distance Latin hypercubedesigns based on good lattice point designs, Annals of Statistics , , 6B, 3741–3766, Instituteof Mathematical Statistics.[61] Welch, W. J., Buck, R. J., Sacks, J., Wynn, H. P., Mitchell, T. J. and Morris, M. D. (1992),Screening, predicting, and computer experiments, Technometrics , , 15—25.[62] Williams, D., Heng, I.S., Gair, J., Clark, J.A. and Khamesra, B., (2019), A PrecessingNumerical Relativity Waveform Surrogate Model for Binary Black Holes: A Gaussian ProcessRegression Approach. arXiv preprint arXiv:1903.09204.[63] Xiao, Q., (2017), Constructions and Applications of Space-Filling Designs, Ph.D. Disserta-tion, University of California Los Angeles.[64] Xiao, Q., Mandal, A., Lin, C. D., and Deng, X. (2021), EzGP: Easy-to-interpret GaussianProcess models for computer experiments with both quantitative and qualitative factors.under revision for SIAM/ASA Journal on Uncertainty Quantification .[65] Xiao, Q. and Xu, H., (2017), Construction of maximin distance Latin squares and relatedLatin hypercube designs,
Biometrika , , 2, 455–464, Oxford University Press.[66] Xiao, Q. and Xu, H., (2018), Construction of maximin distance designs via level permuta-tion and expansion, Statistica Sinica , , 3, 1395–1414, JSTOR.[67] Xiong, Y., Chen, W., Apley, D. W. and Ding, X. (2007), A non-stationary covariance-basedkriging method for metamodelling in engineering design. Internat. J. Numer. Methods Engrg. , , 733-–756.[68] Yang, J. and Liu, M.-Q., (2012), Construction of orthogonal and nearly orthogonal Latinhypercube designs from orthogonal designs, Statistica Sinica , 433–442, JSTOR.[69] Ye, K. Q., (1998), Orthogonal column Latin hypercubes and their application in computerexperiments,
Journal of the American Statistical Association , , 444, 1430–1439, Taylor &Francis.[70] Zhang, Y. and Notz, W. I., (2015), Computer experiments with qualitative and quantitativevariables: a review and reexamination, Quality Engineering , , 1, 2–13, Taylor & Francis.[71] Zhang, Y., Tao, S., Chen, W. and Apley, D. W., (2019), A latent variable approach toGaussian process modeling with qualitative and quantitative factors, Technometrics , 1–12,Taylor & Francis.[72] Zhou, Q. and Qian, P. Z. G. and Zhou, S., (2011), A simple approach to emulation forcomputer models with qualitative and quantitative factors,
Technometrics , , 3, 266–273,Taylor & Francis., 3, 266–273,Taylor & Francis.