Optimal Design Emulators: A Point Process Approach
OOptimal Design Emulators: A Point Process Approach
Matthew T. Pratola , , C. Devon Lin , , Peter F. Craigmile , Department of Statistics, The Ohio State University, Columbus, OH 43210, USA Queen’s University, Department of Mathematics and Statistics, Jeffery Hall, UniversityAvenue, Kingston, ON K7L 3N6, Canada [email protected] , [email protected] , [email protected] Last updated May 17, 2019
Abstract
Design of experiments is a fundamental topic in applied statistics with a long history.Yet its application is often limited by the complexity and costliness of constructingexperimental designs in the first place, which involve searching a high-dimensionalinput space and evaluating computationally expensive criterion functions. In this work,we introduce a novel approach to the challenging design problem. We will take aprobabilistic view of the problem by representing the optimal design as being oneelement (or a subset of elements) of a probability space. Given a suitable distributionon this space, a generative point process can be specified from which stochastic designrealizations can be drawn. The appropriate class of point processes is motivated byexploring a connection to Latin Hypercube designs. We then describe a scenario wherethe classical (point estimate) entropy-optimal design for Gaussian Process regressioncoincides with the mode of a particular point process. We conclude with outlining analgorithm for drawing such design realizations, its extension to sequential design, and a r X i v : . [ s t a t . M E ] M a y pplying the techniques developed to constructing space-filling designs for StochasticGradient Descent and entropy designs for Gaussian process regression. Keywords : Big data, Design of experiments, Determinantal point process, Exchange algo-rithm, Gaussian process, Stochastic gradient descent, Sequential design
Optimal design of experiments (DOE) is a fundamental topic in statistics with a long history[e.g. Fedorov, 1972, Silvey, 1980, Kiefer et al., 1985, Atkinson et al., 2007, Dean et al.,2017]. Yet more recently, DOE has seemed to attract less attention from theoreticians,methodologists, and practitioners. One reason for this decreased interest may be due to anincreased reliance on observational data, but this ignores current challenges in the statisticalanalysis of data. Datasets are becoming ever larger and more complex. Statistical models aremore complicated than they once were. Uncertainty quantification is more challenging. Asthe challenges increase, DOE should play an important role in modern statistical analyses.Typically DOE targets a particular aspect of a model which is deemed “important”, suchas the prediction error in a spatial model, or the variance of some set of parameters of aregression model. Given n input settings ξ , . . . , ξ n drawn from some set χ ⊂ R d , a designcriterion L ( ξ , . . . , ξ n ) is specified, and the n -run optimal design involves finding the input2ettings Ξ n that minimize this criterion: Ξ n = arg min ξ ,..., ξ n L ( ξ , . . . , ξ n ) . (1)In many settings, solving this problem is challenging. Assuming χ is a discretized candidateset of cardinality N , the number of possible designs to explore is (cid:0) Nn (cid:1) . The (Federov)
ExchangeAlgorithm [Fedorov, 1972] is the most popular approach to solving this problem, whichperforms one-at-a-time updates to the design. Unfortunately, the optimization problem isnotoriously difficult due to the large number of possible designs and the multi-modality ofthe optimization problem. More recently, modern optimization alogrithms such as particle-swarm methods and simulated annealing [Chen et al., 2013] have been applied, but thesecan be difficult to implement reliably in modern settings [e.g. Nguyen et al., 2019].When the dimensionality of the input space is high or the number of design points desired islarge, performing optimal design remains practically infeasible. This is because constructinga designed experiment involves searching the d -dimensional input space χ ⊂ R d , and, for eachplausible solution, calculating an optimality criterion which can itself be computationallyexpensive. This challenging problem has typically only been made tractable by changingthe optimality criterion to one based on a simplified model that is more computationallyamenable. This approach is sometimes justifiable but rarely broadly desirable.The goal of this work is to introduce a novel approach to the challenging problem of con-structing optimal designs. Our focus in particular are designs for Gaussian process (GP)3egression models, which have broad applications in spatial statistics, computer experimentsand statistical/machine learning [Sacks et al., 1989, Furrer et al., 2006, Cressie and Johan-nesson, 2008, Banerjee et al., 2008, Higdon et al., 2008, Guhaniyogi et al., 2011, Sang et al.,2011, Katzfuss, 2013, Pratola et al., 2013, 2014, Gramacy and Apley, 2015, Katzfuss andHammerling, 2017]. We take a probabilistic view of the problem that is more general thanthe traditional probabilistic view previously discussed in the DOE literature [Kiefer et al.,1985, M¨uller, 2007]. The idea is to represent the collection of points that form an optimaldesign as being one element (or a subset of isometrically equivalent elements) of a stochasticprocess defined on the space of point patterns. By specifying an appropriate generativepoint process (PP) for this distribution, we introduce the idea of an optimal design emu-lator , where the classical optimal design solution typically coincides with the mode of thisgenerative stochastic process. Since the generative process can be specified in terms of alow-dimensional parameter space, constructing an optimal design reduces to drawing a re-alization of this process given appropriately tuned parameter settings of the process ratherthan performing a difficult optimization problem. Among other things, this approach lendsitself to a wide selection of Markov Chain Monte Carlo (MCMC) tools for efficient compu-tation, and also gives a measure of how optimal the design drawn actually is. As such, ourwork draws a connection between PP models and optimal designs – particularly for Gaussianprocesses – while taking a distinctively Bayesian perspective on the design problem.4 .1 GP Regression and Design The Gaussian process (GP) regression model is used extensively in modern applications asa model of an unknown, potentially smoothly varying, process f ( x ), observed at continuousinputs x ∈ R d . The continuous inputs x represents the input settings where our process maybe observed and/or predicted. In contrast, the discrete, countable, set of design candidates is represented as χ ⊂ R d , from which n -run experimental designs Ξ n = { ξ , . . . , ξ n } may beconstructed, where ξ i ∈ χ, i = 1 , . . . , n. The process f ( x ) may or may not be observed withnoise (cid:15) ( x ), leading to a model for the observations y ( x ) , given by y ( x ) = f ( x ) + (cid:15) ( x ) . The error term (cid:15) ( x ) is often assumed to be independent and identically distributed (i.i.d.)normal with mean zero and error variance σ (cid:15) . In the simplest case, the unknown process f ( x )is modeled as a stationary GP with mean E [ f ( x )] = µ ( x ) and covariance Cov( f ( x ) , f ( x (cid:48) )) = σ c ( x , x (cid:48) ) at input settings x , x (cid:48) ∈ R d with mean model µ ( x ), process scale σ , and positivedefinite correlation function c ( x , x (cid:48) ). The choice of mean function can be as simple as aconstant or can include covariates that are related to f . The popular choice of an isotropicGaussian correlation function [Stein, 2012], c ( x , x (cid:48) ) = ρ || x − x (cid:48) || , ρ is the corre-lation parameter of the GP. Given n observations Y = ( y ( x ) , . . . , y ( x n )), and assuming µ ( x ) = 0 for all x , the GP model is Y | σ , ρ ∼ N n ( , σ R + σ (cid:15) I ) , (2)where R ij = c ( x i , x j ) is the ( i, j ) entry of the correlation matrix R .In the setting of spatial statistics or computer experiments, the two most popular model-based design criteria are the integrated mean squared prediction error (IMSPE) optimaldesigns , L = (cid:82) x ( Y ( x ) − E [ Y ( x ) | ξ , . . . , ξ n ]) d x and the entropy optimal designs , L = E [log( f Y )] where f Y is the usual multivariate Gaussian density corresponding to (2). IMSPEoptimal designs are useful as they minimize the error in out-of-sample predictions. Entropyoptimal designs provide improved estimates of the GP correlation parameter, ρ, which can beimportant for accurately quantifying prediction uncertainties, interpreting which variablesare active in a variable selection problem [Morris et al., 1993, Linkletter et al., 2006], orimproved estimation of the variogram [Cressie, 1993].Both the IMSPE and entropy-based criteria involve O ( n ) operations on the potentially large n × n correlation matrix R , which makes an already challenging optimization problem evenmore difficult. An alternative approach is to use model-robust designs that are geometricallymotivated, such as the Latin Hypercube Sampling (LHS) designs [McKay et al., 1979], orother space-filling designs such as maximin distance designs [Nychka et al., 2015, Carnell,6016]. LHS designs are popular model-free designs that are colloquially described as “space-filling” designs, and are theoretically justified as variance-reduction designs [McKay et al.,1979] while also having theoretical connections to Gaussian Process (GP) regression models[Johnson et al., 1990]. Maximin distance designs were found to be the limiting form ofentropy optimal designs for GP regression as the correlation ρ decays to 0 [Johnson et al.,1990]. Similar to LHS and maximin designs, both IMSPE and entropy designs empiricallylead to designs exhibiting “space-fillingness”, that is the chosen design points tend to spreadout over the design region thereby filling-in any empty space, such as the LHS sample shownin Figure 1(c). However, LHS and maximin designs are usually more amenable in terms ofcomputational cost. In practice, a combined criterion is often used, such as the space-fillingLHS implemented in the popular R package fields [Nychka et al., 2015]. The statistical study of point patterns attempts to classify what type of pattern a given,observed, set of points exhibits. Such applications are numerous in the area of ecology whereoften the spatial location and spatial density of an object being studied is to be inferredand possibly also used for prediction. For example, the location of trees in a forest form ageographically indexed set of points and the pattern exhibited by these points relate to thecompetition for resources between trees in a given forest, while such patterns of competitionmay also differ between species of trees.A common summary measure of such point patterns are the nearest-neighbour distribution7unctions F ( h ) and G ( h ) [Diggle, 2006] defined as functions of distance, h , where F ( h )denotes the probability a point has nearest grid neighbour less than h units away and G ( h )denotes the probability a point has a nearest neighbour point less than h units away. Thesemeasures can be used to categorize point patterns as follows: F ( h ) > G ( h ) ⇒ regular point pattern; F ( h ) = G ( h ) ⇒ complete spatial randomness; F ( h ) < G ( h ) ⇒ clustered point pattern.These classification rules make intuitive sense. For instance, imagine a 2-dimensional spaceover which we overlay a regular grid, and for simplicity let us assume the grid spacing is1 unit in size. For fixed h , when F ( h ) is large it means there is a high probability thatone would observe a point ≤ h units from the nearest grid intersection. Similarly, for fixed h , when G ( h ) is large it means there is a high probability that one would observe a point ≤ h units from another point in the point pattern. So, when F ( h ) dominates G ( h ) for all h , it means points have a higher probability of being near a grid intersection than anotherpoint at all distances h , hence the notion of a ‘regular’ point pattern. Meanwhile, if G ( h )dominates F ( h ) at all distances h , it means points have a higher probability of being near oneanother rather than being near a grid intersection, hence the notion of a ‘clustered’ pointpattern. When both of these functions are approximately equal, then we have ‘completespatial randomness’ – the pattern of points realized have no discernable pattern at all.To visualize this classification of point patterns, we consider 3 scenarios: point patterns8enerated uniformly at random, point patterns generated to exhibit clustering, and pointpatterns generated as LHS designs using the R package lhs [Carnell, 2016]. All three exam-ples were investigated in p = 2 dimensions with each pattern consisting of n = 20 points, andfor each of these point patterns, we calculated empirical estimates of the functions F and G along with the empirical 95% pointwise confidence intervals over 1,000 replicates. Figure1(a-c) displays sample draws of the random, clustered, and LHS point patterns. The corre-sponding mean and associated pointwise confidence intervals for the F and G functions areshown in Figure 1(d-f). From this figure, one would reasonably conclude that F ( h ) > G ( h )for the point patterns generated as LHS designs. This suggests that space-filling designsbelong to the class of regular point patterns rather than clustering, or random, point pat-terns. This recognition – interpreting the patterns of design points from the perspective ofstatistical models for point patterns – will lead to our proposal for a probabilistic emulatorof optimal designs for GP regression. Let Ξ = { ξ , . . . , ξ n } represent a point pattern of cardinality n and let Z ( x ) representa stochastic point process defined on all x ∈ R d . Such a process assigns a probabilitymeasure F Z : Ξ ⊆ D → [0 , , where D ⊂ R d could be a continuous subset of R d , or somediscrete, countable subset of cardinality |D| = N. In the former interpretation, the pointprocess (PP) can be viewed as assigning the probability that a point will be realized in someinfinitesimal region d x about x . In the latter interpretation, one can view the PP realization9 .0 0.2 0.4 0.6 0.8 1.0 . . . . . . (a) X1 X l lll l ll ll ll ll l lll l l l . . . . . . (b) X1 X ll ll l l ll ll l ll ll ll lll . . . . . . (c) X1 X l lll lll ll ll ll ll ll l l l . . . . . . (d) h E m p r i c a l M ea s u r e G(h)F(h) . . . . . . (e) h E m p r i c a l M ea s u r e G(h)F(h) . . . . . . (f) h E m p r i c a l M ea s u r e G(h)F(h)
Figure 1: (a) Random point pattern generated by drawing from the uniform distribution.(b) Clustered point pattern generated by drawing from the Normal distribution with mean0 . . F ( h ) and G ( h ) , with90% pointwise confidence intervals from 1,000 replicates of random designs. (e) Estimates ofnearest-neighbor distribution functions, F ( h ) and G ( h ) , with 90% pointwise confidence inter-vals from 1,000 replicates of clustered designs. (f) Estimates of nearest-neighbor distributionfunctions, F ( h ) and G ( h ) , with 90% pointwise confidence intervals from 1,000 replicates ofLHS space-filling designs.as an N -vector such that n entries take the value 1, indicating presence of some element ξ ∈ D in the point pattern, and N − n entries taking the value 0 indicating absence. Ineither setting, the point pattern is described by the number, n, of points making up therealization, and the location of these points, here denoted by the Ξ = { ξ , . . . ξ n } [Geyerand Møller, 1994, Diggle, 2006, Lavancier and Møller, 2015].Typical applications of PP modeling are in spatial statistics [e.g. Diggle, 2006] where the10ocations are coordinates in a subregion of R , but we consider more generally the locationof points in a subregion of R d . We assume simple PP models, that is at any given location x we will only ever realize at most one point at x . We also assume stationary PP modelsthroughout, although this is not necessary in general. The simplest PP model is the Poisson[e.g. Daley and Vere-Jones, 2003], where the probability of a point in the domain D belongingto the point pattern is given by the Poisson distribution with rate parameter given by the first order intensity function , λ Z , defined as λ Z ( x ) = lim | d x |→ E ( Z ( d x )) | d x | , which implies the average number of points one expects to be generated within an arbitrarysubregion of D under the Poisson assumption. Furthermore, due to the memoryless propertyof the Poisson model, the existence of a point in some arbitrary Borel-measurable subregion B of D is independent of the existence of a point in some arbitrary Borel-measureablesubregion B (cid:48) of D . For this reason, the Poisson model is referred to as a point process modelof complete spatial randomness.Besides the complete random point patterns of the Poisson model, we have already seenthat there are two additional possibilities: clustered point patterns and regular point pat-terns . These cases will necessarily require additional parameterization to capture the formof non-independence of points that give rise to clustered or regular point patterns. Thestraightforward extension of the intensity function formulation of the Poisson model is tointroduce the so-called second order intensity function , λ ,Z , defined as11 ,Z ( x , x (cid:48) ) = lim | d x |→ , | d x (cid:48) |→ E ( Z ( d x ) Z ( d x (cid:48) )) | d x || d x (cid:48) | . The second-order intensity function simply captures the expected number of points that willjointly co-occur in infinitesimal regions about x and x (cid:48) . One can also classify the type of PPusing these intensity functions and the first order intensities by essentially considering thecovariance properties of the process [Diggle, 2006]: λ ,Z ( x , x (cid:48) ) < λ Z ( x ) λ Z ( x (cid:48) ) ⇒ regular point pattern ,λ ,Z ( x , x (cid:48) ) = λ Z ( x ) λ Z ( x (cid:48) ) ⇒ complete spatial randomness ,λ ,Z ( x , x (cid:48) ) > λ Z ( x ) λ Z ( x (cid:48) ) ⇒ clustered point pattern.That is, a positive covariance C Z ( x , x (cid:48) ) = λ ,Z ( x , x (cid:48) ) − λ Z ( x ) λ Z ( x (cid:48) ) indicates that it isrelatively more likely for points to co-occur than under the Poisson model, which generatesclustered point patterns. Similarly, a negative covariance indicates that it is relatively lesslikely for points to co-occur than under the Poisson model, which generates regular pointpatterns.In Section 2 we will explore the properties of space-filling (LHS) designs from the perspectiveof point processes (PPs). As alluded to above, the PP models that interest us are those thatgenerate point patterns that fall in the regular class. In fact, we will establish theoreticalconnections between PPs and LHS designs via the intensity functions defined above, andthen motivate the appropriate class of PP model for GP entropy designs. In Section 3 weintroduce a particular regular point process model, the determinantal point processes (DPPs)12Lavancier et al., 2015], and show how these replusive point process models are connected tothe stationary GP model outlined in Section 1.1. We then introduce the concept of a designemulator – a PP model that assigns a probability measure to the space of possible designpoint patterns such that the classical optimal design coincides with the mode – and devisean efficient algorithm that can be used to sample entropy optimal designs for GPs fromthe emulator. Section 4 explores examples of our design emulator applied to the popularstochastic gradient descent algorithm and for sequential GP regression. We conclude inSection 5. All proofs of results are presented in the Appendix. We start by exploring the point process connection to LHS designs more deeply before movingon to focus specifically on the GP regression scenario.
LHS designs [McKay et al., 1979] are geometrically motivated to select design points thatspread out over the design region of interest. They are popular due to their simple con-struction: for an n -run design in p dimensions, split each dimension into n bins and assigna random permutation of the bin indices 1 , . . . , n to each column of the design matrix suchthat no row contains the same bin index more than once. A design constructed accordingto this simple algorithm is an LHS. An example LHS is shown in Figure 1(c) where the bin13 .00 0.05 0.10 0.15 0.20 0.25 . . . . . (b) r E s t i m a t ed K f un c t i on K pois (r)K^ iso (r; r =0.04) Figure 2: Estimated Ripley’s K function, ˆ K iso , taken as the pointwise median with corre-sponding 90% uncertainty interval (shown in grey) from 1,000 replicates of estimated entropy-optimal designs (dotted line represents Ripley’s function under random sampling, K pois , forcomparison).labels have been noted on the top and right axes, and it is evident that LHS designs tend tospread-out over the design region of interest rather than exhibiting a clustering pattern. Inpractice, the actual design point might be taken as the centroid of each bin or a uniformlygenerated location within each bin, but the bin arrangement is the key part of the proce-dure. Given this construction of LHS designs, a direct connection between PPs and LHSdesigns can be established using the intensity function definitions described in Section 1.3,as summarized in Theorem 1. Theorem 1.
LHS designs belong to the class of simple, regular PPs satisfying λ ,Z ( x , x (cid:48) ) <λ Z ( x ) λ Z ( x (cid:48) ) for any x , x (cid:48) ∈ R d . Lemma 1. [Lehmann] Two random variables, X and X (cid:48) are said to be negative quadrantdependent (NQD) if P ( X ≤ x, X (cid:48) ≤ x (cid:48) ) ≤ P ( X ≤ x ) P ( X (cid:48) ≤ x (cid:48) ) . Furthermore, let U = r ( X , . . . , X d ) and V = s ( X (cid:48) , . . . , X (cid:48) d ) for independent pairs of random variables ( X i , X (cid:48) i ) , i =1 , . . . , d. Then
U, V are NQD if, for each i , X i , X (cid:48) i are NQD and r, s are concordant functionsin the i th coordinate.Here, concordant functions are functions which are either both monotone increasing or mono-tone decreasing in a single argument when holding the remaining arguments fixed. Now, letus extend the definition of NQD to a design as follows. Definition.
Let Ξ n = { X , . . . , X n } be a design with X i = ( X i , . . . , X id ) . If X ij , X ik areNQD for all j (cid:54) = k and all i ∈ , . . . , n, then we say Ξ n is an NQD design.15ased on this definition and Lemma 1, we have the following theorem that shows that designsfalling in the class of regular PPs are necessarily NQD designs. Theorem 2.
Let Ξ n be a design with design points X i = ( X i , . . . , X id ) generated from aregular PP and suppose ( X ik , X jk ) are pairwise independent for all i (cid:54) = j . Then Ξ n is anNQD design.The proof of Theorem 2 relies on the definition of the intensity function as the expectation ofthe Bernoulli random variable Z ( d x ) along with Lemma 1. A lemma of Hoeffding [Lehmann,1966] extends the necessary condition to negative covariance. Lemma 2. [Hoeffding] Let F denote the joint distribution of U, V and F U , F V denotethe marginal distributions of U, V respectively. Then
Cov ( U, V ) = (cid:82) ∞−∞ (cid:82) ∞−∞ [ F ( u, v ) − F U ( u ) F V ( v )] dudv. From Lemma 2 and Lemma 1, we immediately see that NQD implies negative covariancefor pairs of input settings. Combined with the assumption of concordant functions, theseresults imply that designs falling in the class of regular PPs exhibit negative covarianceamong the selected design points. This interpretation in fact coincides with McKay et al.[1979] original motivation, and subsequent proof, for LHS designs being superior from theperspective of reducing the variance of estimators. In particular, for estimators of the form T = (cid:80) ni =1 g ( Y i ) where Y i = h ( X i , . . . , X id ) is assumed to be monotonic in each of itsarguments, the designed inputs, McKay et al. [1979] shows LHS designs have less variabilitythan uniform random designs for such estimators; i.e., Var( T L ) ≤ Var( T R ) , where T L is theestimator T calculated using an LHS design while T R refers to T calculated using random16ampling (a uniform random sample from the underlying distribution of input variables).In their proof, the variance reduction is due to a negative covariance term that arises froma monotone function of the design points. Theorem 2 more broadly captures this idea andconnects it to the regular PP class, albeit showing that in general we only have a necessarycondition for being in the regular class.Besides the general motivation of LHS designs as a variance-reduction technique for func-tionals of designs points, LHS designs are also popular in GP regression [Johnson et al.,1990], which suggests that optimal designs for GP regression also lie in the class of regularPPs. By considering entropy optimal designs for GPs, we obtain a more direct connectionto PPs, as we now demonstrate. Let us consider the simplest case of our GP regression model defined in Section 1.1, with amean trend of µ ( x ) = 0 and noise-free observations, i.e. σ (cid:15) = 0. The GP model in this caseis commonly used in computer experiments and in spatial statistics where a “nugget” termis not required. In this setting, the entropy optimal criterion for an n -run design, ξ , . . . , ξ n , drawn from the discrete candidate set χ ⊂ R d can be shown to reduce to [Shewry and Wynn,1987] L ( ξ , . . . , ξ n ) = − det( R ) , (3)17here the n × n correlation matrix R has entries R ij = c ( ξ i , ξ j ) . Johnson et al. [1990],Morris et al. [1993], Mitchell et al. [1994] established that entropy optimal designs for GPsare asymptotically equivalent to maximin designs as the correlation becomes weaker, i.e.as ρ → c ( ξ , ξ (cid:48) ) . This is a convenient result as typically thecorrelation parameter ρ is not known a-priori. This asymptotic result provides a justifiableapproach for designing experiments, particularly before data has been observed: one canuse an initial space-filling design to approximate, or emulate, entropy optimal designs forGPs. Subsequently, the initial space-filling design can be sequentially updated to add addi-tional design points using the entropy criterion with an updated estimate of the correlationparameter ρ given the data collected so far.To illustrate the connection between PPs, space-filling designs, and entropy optimal designsfor GPs, the following motivating simulation experiment is considered. For reasons which willshortly become clear, entropy optimal designs for our simplified (stationary, isotropic) GPmodel will generate point patterns that are stationary and isotropic. When a PP is stationaryand isotropic with constant first-order intensity λ Z , one can use Ripley’s K function [Ripley,1976, Diggle et al., 2010] to measure spatial dependence in the point patterns, defined to be K Z ( r ) = E Z ( r ) /λ Z , where E Z ( r ) is the expected number of points within a distance of radius r of an arbitrarypoint. The simulation experiment proceeds by generating 1,000 replicates of entropy optimaldesigns of size n = 30 in p = 2 dimensions. Each of these entropy optimal designs is18onstructed by starting with a random initial design which is then optimized using theFedorov Exchange Algorithm for 20,000 iterations. The candidate set of input settings, χ, is taken to be the grid of 50 ×
50 equally spaced points in [0 , . The assumed exponentialcorrelation function is c ( ξ , ξ (cid:48) ) = 0 . || ξ − ξ (cid:48) || where || · || is the L distance metric. Figure 2shows the estimated K Z function [Diggle et al., 2010] averaged over the replicates and 95%pointwise uncertainty intervals for the generated optimal designs as a function of radius r .As a reference, the K R function for a completely random (Poisson process) point patternis also shown as the dashed line in the figure. As the estimated K Z function is less thanthe K R function, this indicates evidence of a regular structure to the point pattern [Ripley,1977, Dixon, 2014].The evident connection between entropy optimal designs and a regular point pattern isrelevant to our goal of emulating designs because of a connection between entropy optimaldesigns and a particular stochastic process model for a regular PP, which we explore in thenext section. While the criterion-based view of design introduced in Section 1 is the popular interpretationin the literature, a more formal probabilistic exposition [Kiefer et al., 1985, M¨uller, 2007] waspreviously explored. Given such a probabilistic design model, one can imagine our designproblem as simply being equivalent to sampling, that is, drawing a realization of a point19attern distributed according to our model. In the earlier probabilistic exposition from thedesign literature, this probability model places non-zero weight only on the optimal designpoints. Formally, we can think of this as a conditional distribution, where the conditionalityarises from the criterion function of the design, L ( ξ , . . . , ξ n ) achieving a particular value, L ∗ , where L ∗ = min ξ ,.... ξ n L ( ξ , . . . , ξ n ) . More broadly, we can cast the design problem as one of needing to define a probability modelon any finite collection of points which could make up our design. Unconditionally, we canimagine a corresponding probability model for a fixed cardinality, n , of the resulting pointpattern. That is, unconditionally our model will define the probability of all n -run pointpatterns over the (discrete) sample space χ. The optimal design is one (or a small subset, say,due to isometries) of the point patterns which collectively form the sample space. Denote thestochastic process generating these point patterns by Z , let f Z represent the probability mass(or density) function of this process, and let J ( ξ , . . . , ξ n ) = M ( L ( ξ , . . . , ξ n )) for somemonotone transformation M such that J is a non-negative, Lebesgue-integrable function ofthe inputs. Definition:
We call the probability model represented by the mass (or density) function f Z a design emulator if f Z ∝ J , where f Z and J are defined on the same support χ. Note that this notion of a design emulator is in terms of the probability representation of adesign. In other words, we aim to introduce a statistical model to emulate the probability ofthe proposed designs. Leveraging this alternative representation, our task will be to arrive20t an appropriate emulator for the probability of design points and to use this model as ameans of sampling the optimal design from the stochastic PP Z without resorting to thebrute-force optimization techniques that are prevalent in optimal design.In order to construct a PP-based design emulator for our GP regression model of interest,we are motivated by the so-called determinantal PP model from the PP literature, which weintroduce in the next section. This model generates point patterns that fall in the regularpattern class of point processes, as introduced in Section 2. Later, we will explore a variant ofthis model which will motivate a computationally cheap algorithm for emulating the optimaldesign from our stochastic PP model.
An increasingly popular PP model that generates regular point patterns is the determinantalpoint process (DPP) . The DPP was introduced to the statistics literature only recently[Hough et al., 2006, Kulesza and Taskar, 2011, Lavancier et al., 2014]. It has been applied tosparse variable selection problems [Rockov´a et al., 2015, Mutsuki and Fumiyasu, 2016] andstatistical and machine learning [e.g., Kulesza and Taskar, 2013, Kang, 2013, Affandi et al.,2014, Dupuy and Bach, 2016, Xu et al., 2016].For entropy optimal designs of GP models, the following result on discrete, finite DPPs dueto Kulesza and Taskar [2013] motivates the use of DPP models.
Lemma 1. [Kulesza and Taskar, 2013] For a determinantal point process defined over a21iscrete candidate set χ ⊂ R d with the positive semi-definite kernel function K ( ξ , ξ (cid:48) ; θ ) with ξ , ξ (cid:48) ∈ χ and known kernel function parameter θ, the probability mass function for a pointpattern realization of cardinality n is given by f Z ∝ det( K Z ), where the n × n positivesemi-definite matrix K Z has entries [ K Z ] ij = K ( ξ i , ξ j ; θ ) . Based on this result, we immediately have the following.
Corollary.
The entropy-optimal design for GP regression model (2) with σ = 1 , σ (cid:15) = 0and correlation function c ( · , · ; ρ ) corresponds to the mode of a determinantal PP with kernelfunction K ( ξ , ξ (cid:48) ; θ ) ≡ c ( ξ , ξ (cid:48) ; ρ ) . These results provide an elegant connection between entropy optimal designs for GP regres-sion and using DPP models to essentially emulate the point pattern associated with theoptimal design by placing a DPP prior on the space of point patterns to which the optimaldesign belongs. However, on the surface, finding the mode of the DPP is no easier thanthe usual optimization problem associated with finding the entropy optimal design. A keyresult, due to Hough et al. [2006], leads to the following approximation to the DPP, knownas the
Determinantal Projection Point Process (DPPP).
Lemma 2. [Hough et al., 2006] Suppose Z is a DPP with kernel K defined over χ andwrite K ( ξ , ξ (cid:48) ) = (cid:80) Nk =1 λ k φ k ( ξ ) φ Tk ( ξ (cid:48) ) where φ k ’s are orthonormal eigenvectors of K witheigenvalues λ k ( k = 1 , . . . , N ) . Define Z to be a DPPP with kernel given by K ( ξ , ξ (cid:48) ) = (cid:80) Nk =1 B k φ k ( ξ ) φ Tk ( ξ (cid:48) ) where the B k ’s follow independent Bernoulli( λ k / ( λ k + 1)) distributionsfor k = 1 , . . . , N. Then Z d = Z, where d = denotes equality in distribution.22his result shows that any DPP can be represented as the weighted combination of so-calledDPPP’s. Hough et al. [2006] show that this result implies a sampling algorithm where onefirst generates the Bernoulli random variables B , . . . , B N where the number of points in therealization is n = (cid:80) Ni =1 B i , and then the locations of the points are generated by (suitablyorthonormalized) vectors, whose L norm is interpreted as a discrete probability measure.In other words, sampling from the DPP is simplified by the separation of how many points make up a realization and the location of points for a realization of a given size; that is P ( Ξ n = { ξ , . . . , ξ n } , B = b , . . . , B N = b N )= P ( Ξ n = { ξ , . . . , ξ n }| B = b , . . . , B N = b N ) P ( B = b , . . . , B N = b N ) , where n = (cid:80) Ni =1 B i is the total number of points appearing in a particular realization.For our purposes, the generation of point patterns of a random cardinality is not relevant,however the approximation introduced by the DPPP gives us the tools to specify a con-ditional framework that eventually can be used to give an (approximate) emulator of theentropy optimal design. Kulesza and Taskar [2011] outline the notion of a fixed-rank DPP, a DPP with a fixed samplesize n . That is, the sampling of the B i ’s is conditional on (cid:80) Ni =1 B i = n . While technicallyelegant, their approach is less interpretable from a statistical modeling perspective. In our23pproach, we recognize the distribution of B , . . . , B N given (cid:80) B i = n as a conditionalBernoulli distribution [Chen and Liu, 1997]. The advantage of this approach is two-fold.First, we can define a clear hierarchical statistical model for sampling from a fixed-rankDPP. Second, Chen and Liu [1997] provide no less than four algorithms for sampling fromthis conditional Bernoulli distribution, with differing computational and memory complexitytradeoffs. This allows one to provide a more efficient algorithm for constructing entropyoptimal designs.Conditioning on (cid:80) B i = n , we have the following hierarchical model for the fixed-rank DPP, P (cid:16) Ξ n = { ξ , . . . , ξ n } , B = b , . . . , B N = b N (cid:12)(cid:12)(cid:12)(cid:88) B i = n (cid:17) = P (cid:0) Ξ n = { ξ , . . . , ξ n } (cid:12)(cid:12) { B j = 1 } j ∈S , { B j = 0 } j ∈{ ,...,N }\S (cid:1) × P (cid:32) { B j = 1 } j ∈S , { B j = 0 } j ∈{ ,...,N }\S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) i =1 B i = n (cid:33) , where S is the subset of indices of { , . . . , N } of cardinality |S| = n . Calculation of the firstterm comes from L -norms of appropriate orthonormalizations of the vectors as shown inAlgorithm 1. The conditional Bernoulli probability can be calculated sequentially as P (cid:32) { B j = 1 } j ∈S , { B j = 0 } j ∈{ ,...,N }\S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) i =1 B i = n (cid:33) = P (cid:32) B = b (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) i =1 B i = n (cid:33) × P (cid:32) B = b (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) B = b , N (cid:88) i =1 B i = n (cid:33) × · · · × P (cid:32) B N = b N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) B = b , . . . , B N − = b N − , N (cid:88) i =1 B i = n (cid:33) , K χ denotes thekernel matrix constructed for the candidate set χ. Algorithm 1: Generating a fixed-rank DPP realization.
Input : φ , . . . , φ N and λ , . . . , λ N from eigendecomposition of K χ = (cid:80) λ i φ i φ Ti // Draw from the conditional Bernoulli distributionSet S = {} and j = 0 Repeat Set j = j + 1 Let r = |S j − | . With probability P ( j, r ) ( see Appendix )Set S j = S j − ∪ j Until | S j | = n Set S = S n // Initialize required quantities given S Let v ( ξ ) = ( φ S [1] ( ξ ) , . . . , φ S [ n ] ( ξ )) T Let e j be the vector of 0 ’ s except 1 in the j th position , j = 1 , . . . , n // Draw the point patternfor j in n, . . . , Sample ξ j from P j ( ξ ) = j (cid:16) || v ( ξ ) || − (cid:80) n − jk =1 | e Tk v ( ξ ) | (cid:17) Orthonormalize φ , . . . , φ j − with respect to e j // Return the drawn fixed rank point pattern realizationreturn Ξ n = ( ξ , . . . , ξ n ) T .3 Emulating the Optimal Design Algorithm 1 will generate fixed-rank DPP realizations with a fixed number of points, andwhile these points should generally exhibit a regular pattern, there is no guarantee that anyparticular realization would be of especially high quality in terms of the regularity of thepoint pattern. That is, much like any stochastic process, it is always possible to draw a“bad” realization that has low probability.Since the entropy optimal design for GP regression corresponds to the mode of a DPP bythe corollary to Lemma 1, this suggests taking Ξ n = arg max ξ ,..., ξ n P (cid:0) Ξ n = { ξ , . . . , ξ n } (cid:12)(cid:12) { B j = 1 } j ∈S ∗ , { B j = 0 } j ∈{ ,...,N }\S ∗ (cid:1) , where S ∗ = arg max S P (cid:32) { B j = 1 } j ∈S , { B j = 0 } j ∈{ ,...,N }\S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) i =1 B i = n (cid:33) , (4)as our (approximate) emulator of the optimal design. Theorem 3.
The index set given by (4) is S ∗ = { , . . . , n } . Input : φ , . . . , φ n from the eigendecomposition of K χ = (cid:80) λ i φ i φ Ti // Initialize required quantitiesLet v ( ξ ) = ( φ ( ξ ) , . . . , φ n ( ξ )) T Let e j be the vector of 0 ’ s except 1 in the j th position// Draw the point patternfor j in n, . . . , Set ξ j = arg max ξ P j ( ξ ) = j (cid:16) || v ( ξ ) || − (cid:80) n − jk =1 | e Tk v ( ξ ) | (cid:17) Orthonormalize φ , . . . , φ j − with respect to e j // Return design drawn from the entropy optimal design emulatorreturn Ξ n = ( ξ , . . . , ξ n ) T Theorem 3 shows that selecting the set S ∗ amounts to calculating the first n (eigenvector,eigenvalue) pairs (i.e. such that λ > λ > . . . > λ n ) of the N × N kernel matrix K χ .This eliminates the computational burden of conditional Bernoulli sampling, leading to afast algorithm for emulating the optimal design. We refer to this fast sampling algorithmfor the mode of the fixed-rank DPP as our optimal design emulator of entropy designs forGP regression.The proposed pseudo-code for drawing from the design emulator is shown in Algorithm 2.Note that the inputs to Algorithm 2 depend on the matrix K χ having been formed with a“suitable” value of the correlation parameter ρ. As noted earlier, space-fillingness occurs as ρ → ρ to construct27pace-filling designs using the design emulator.An example of the design obtained from the design emulator in 2 dimensions is shownin Figure 3 along with 2 random realizations of the fixed-rank DPP drawn according toAlgorithm 1 and a design constructed using the space-filling design function cover.design of Nychka et al. [2015]. Comparisons in terms of the determinants of the correlation matricesfor the resulting designs as well as their runtimes are summarized in Table 1. Since it istypically recommended to perform random restarts of cover.design to obtain a solutioncloser to the global optima, we report the runtime for 0 and 100 restarts. Note that forany random realization of the fixed-rank DPP there is no guarantee that the sampled pointswill be especially good in terms of space-fillingness; here the two samples drawn seem poor.However, the design emulator results in a design that empirically fills the space well and alsoproduces the best criterion value. At the same time, the design emulator was the fastest of allmethods (even using unoptimized R code). While running random restarts of cover.design improved the criterion, it also significantly increases the computational cost.Table 1: Values of the determinant of the correlation matrix for designs resulting from 2random samples of the fixed-rank DPP, 2 designs generated by cover.design using 0 and100 random restarts, and the emulated optimal design taken as the (approximate) mode ofthe fixed-rank DPP. The runtimes are indicated in seconds.DesignAlgorithm randomfixed-rankDPP randomfixed-rankDPP cover.design reps=0 cover.design reps=100 designemulatordet( R ( Ξ n )) 2.15e-20 2.59e-19 1.71e-20 2.79e-20 Runtime (s) 312.0 441.5 0.521 37.76 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . . . random fixed−rank DPP X1 X l lll l l ll l lll llll ll l ll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . . . random fixed−rank DPP X1 X llll ll lll l ll lll l lll ll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . . . cover.design, reps=100 X1 X ll l ll llll lllll lll ll l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . . . design emulator X1 X ll lll lllll ll lll l lll ll Figure 3: Random realizations drawn from the fixed-rank DPP (left two panels), a space-filling design constructed by 100 random restarts of the cover.design function from R package fields , and the optimal design emulator constructed as the (approximate) mode ofthe fixed-rank DPP with isotropic Gaussian correlation function in p = 2 dimensions withcorrelation parameter ρ = 0 .
01. All are n = 21 point designs. The design drawn from the design emulator requires specification of the correlation functionparameter, ρ, of the GP. Due to the emulator’s speed, it becomes feasible to sample designsfor different values of this parameter, or to sequentially update designs with refined estimatesof ρ as data are collected. Here we will demonstrate a batch-sequential approach in the caseof an isotropic GP.From Lemma 1, we have that f Z ∝ det( K Z ). Suppose an initial design of size n has alreadybeen selected consisting of locations Ξ = (cid:0) ξ , . . . , ξ n (cid:1) with the corresponding kernel sub-matrix K Ξ from the overall kernel matrix K χ defined on the candidate set χ. Then, f Z = f Z \ Ξ × f Ξ ∝ det( (cid:101) K Z \ Ξ ) × det( K Ξ )where det( (cid:101) K Z \ Ξ ) = det( K Z \ Ξ − k T Ξ ,Z \ Ξ K − Ξ k Ξ ,Z \ Ξ ) denotes the Schur complement of29 Z in K Ξ , where K Z \ Ξ is the submatrix of K Z excluding rows, columns associated withthe design points Ξ and k Ξ ,Z \ Ξ is the (rectangular) cross-kernel matrix between designpoints Ξ and the set of points Z \ Ξ . The above decomposition allows for a simple sequential updating scheme. Suppose n a designpoints Ξ a have been selected so far (perhaps in a one-shot arrangement, or perhaps as a resultof some previous sequential design point selection iterations). To select n b additional designpoints, say Ξ b , one performs the following steps.1. Construct (cid:101) K χ \ Ξ a . Note that this matrix assigns probability 0 to (re)selecting any ofthe first n a points.2. Apply Algorithm 2 using the constructed kernel matrix (cid:101) K χ \ Ξ a . This will sample thenext n b sequential points, Ξ b , conditional on already having selected the first n a points.3. The updated design of n = n a + n b points can then be returned as Ξ n = Ξ a ∪ Ξ b .The sequential selection steps could then be iterated again to perform further updates.This conditional approach to sequential design construction is very elegant. We note twosituations where application of the approach is useful, and which will be demonstrated inSection 4.2. First, consider sequential designs for GP regression. An initial design wouldbe constructed using a small setting of ρ to emulate a space-filling design. However, insubsequent sequential design updates, data will have been collected giving information ona data-supported value of ρ. At each iteration, the kernel matrix can be updated usingthe most recent point estimate of ρ , thereby resulting in sequential designs that start off30s space-filling but which evolve to extract more meaningful information from the processbeing observed: K χ ( · , · ; ρ ) → (cid:101) K χ \ Ξ ( · , · ; ρ ) → (cid:101) K χ \ Ξ ∪ Ξ ( · , · ; ρ ) → · · · where the initial “space-filling” value ρ is subsequently updated based on the data collected.Another relevant scenario is the desire to enforce certain projection properties of designs.Typically, the desired projection property is to enforce space-fillingness or non-collapsingnessin all marginal dimensions as well as in the full d -dimensional design space. For instance, itis well known that Latin hypercube designs preserve space-fillingness of the 1-dimensionalmarginals but do not enforce this constraint on the higher-order marginals. Generally, addingthis constraint to design construction has been a challenge, both in formulating an appro-priate mathematical criterion and in optimizing the resulting criterion. Recently, Josephet al. [2015] proposed a criterion that aims to preserve the space-fillingness constraint in allmarginal sub-spaces when constructing d -dimensional designs, but in general there has beenlittle work in this area due to the computational difficulty of finding such designs.Our sequential formulation allows one to easily enforce the non-collapsing projection propertyconstraint – that is, to remove the possibility of design points overlapping in their marginalprojections. Let S = { ξ (cid:48) ∈ χ \ Ξ | ξ (cid:48) j = ξ j where ξ ∈ Ξ for at least one j ∈ , . . . , d } , = { ξ (cid:48) ∈ χ \ Ξ | ξ (cid:48) j = ξ j ∩ ξ (cid:48) k = ξ k where ξ ∈ Ξ for at least one pair ( j, k ) ∈ , . . . , d } ,S = { ξ (cid:48) ∈ χ \ Ξ | ξ (cid:48) j = ξ j ∩ ξ (cid:48) k = ξ k ∩ ξ (cid:48) l = ξ l where ξ ∈ Ξ for at least one tuple ( j, k, l ) ∈ , . . . , d } , and so on, where the higher-order sets S , . . . , S d − are similarly defined. Then, it is clearlythe case that S ⊇ S ⊇ · · · ⊇ S d − . Therefore, to find the subset of candidate points that would violate the desired projectionconstraint in all marginal dimensions, it is sufficient to find the set S alone. Notably,this is an operation that is O ( N nd ) in the worst case (where N is the cardinality of χ , n is the number of design points and d is the dimension of the input space), exhibiting nocombinatorial explosion with dimension. Our batch-sequential design algorithm satisfyingthe non-collapsing constraint is outlined in Algorithm 3.32lgorithm 3: Batch-Sequential Design Emulator with Non-Overlapping Projections. Let Ξ a be the n a existing d - dimensional design points ξ , . . . , ξ n a Let χ be the original ( full ) set of candidate pointsInitialize S = Ξ a // Construct the set of points violating the non - collapsing// projection constraint of the existing designfor i in , . . . , n a for j in , . . . , d for k in , . . . , N if ξ ij == χ kj and χ k / ∈ SS = S ∪ χ k // Calculate the conditional kernel matrix (cid:101) K χ \ S = K χ \ S − k TS,χ \ S K − S k S,χ \ S // Draw n b batch - sequential points using Algorithm 2 with the// first n b eigenfunctions φ , . . . , φ n b from the eigendecomposition// of kernel matrix (cid:101) K χ \ S = (cid:80) λ i φ i φ Ti . To motivate the interesting possibilities of using a fast design emulator, we consider a de-signed variant of the popular Stochastic Gradient Descent (SGD) algorithm as well as se-quential designs for GP regression. 33 .1 Designed Stochastic Gradient Descent
As mentioned earlier, one motivation for space-filling designs is as a variance-reduction tech-nique when calculating statistical estimators. To demonstrate the potential for a compu-tationally cheap design emulator for constructing space-filling designs in a modern context,we motivate possible modern applications in the big data setting involving the stochasticgradient descent (SGD) algorithm [Bottou, 2010]. SGD is used extensively in statisticalmachine learning to scale model training to big data for a variety of applications includinglinear models, clustering, GP regression and deep neural networks [e.g. Zhang, 2004, Scul-ley, 2010, Dean et al., 2012, Hensman et al., 2013, Wan et al., 2013, Sutskever et al., 2014,Badrinarayanan et al., 2015]. SGD works by using small random subsets of the data, called batches , to estimate the gradient. The essential idea of SGD is to sacrifice an increase inestimator variance for computational gain so the parameter space of the model can be moreefficiently explored when fitting models to big data. Due to the speed of the proposed designemulator, we can replace SGD’s random susbset selection with a space-filling subset selectionat each iteration of the algorithm, thereby recovering some of this variance tradeoff.A small simulation was carried out by generating observations from a 5-dimensional linearregression model (similar to the popular Friedman function [Friedman, 1991]), Y ( x ) = β + β sin(2 πx x ) + β ( x − . + β ( x − . + β x + β x + (cid:15), (5)where the regression coefficients β , . . . , β were generated as Unif( − ,
10) and the obser-34ubset Size
MSE( ˆ β | Ξ R )MSE( ˆ β | Ξ ) MSE( ˆ β | Ξ R )MSE( ˆ β | Ξ ) MSE( ˆ β | Ξ R )MSE( ˆ β | Ξ ) MSE( ˆ β | Ξ R )MSE( ˆ β | Ξ ) MSE( ˆ β | Ξ R )MSE( ˆ β | Ξ )
23 1.97 1.47 1.27 2.92 1.9743 1.62 1.04 1.45 2.72 1.6763 1.56 0.94 2.00 2.59 1.5683 1.43 1.20 1.32 2.38 1.44Table 2: Ratio of MSE of parameter estimates for random subsets ( Ξ R ) versus designedsubsets ( Ξ ) when using SGD to fit the model (5). Values greater than 1 . (cid:15) ∼ N (0 , batchsize = { , , , } . SGD iterates over all the batches of data in random orderand repeats this entire process a number of times, called epochs. We used 200 epochs inthis example. The total dataset size was 50 × batchsize and each study was replicated 100times, using randomly drawn coefficients for each replicate. To evaluate the quality of theSGD solution, we compared the ratio of the average squared error of the regression coeffi-cients for random subsets versus the designed subsets. For example, a ratio of 2 indicatesthe estimation error was twice as large as that of using SGD with the design emulator.The results summarized in Table 2 show that even in such a simple example, the resultingerror is usually 1.5-3 times worse based solely on how the subsets are selected from thedataset. This demonstrates a novel application of experimental design in the modern bigdata modeling setting that is enabled by the computationally cheap design emulator.35 .2 Batch-Sequential Designs in GP Regression Our second demonstration of the proposed technique considers constructing sequential de-signs for GP regression using Algorithm 3. As outlined in Section 1.1, we assume a stationaryGP regression model with mean zero and Gaussian correlation function with no measure-ment error. First, consider this model in 2 dimensions and look at batch-sequential designsconstructed 3-at-a-time and 4-at-a-time, shown in Figure 4. In this figure, the design pointsare denoted by solid dots and the light gray circles denote available candidate points which,if selected, would negatively impact the space-fillingness of marginal projections. The blackcircles denote available candidate points that would not negatively impact the marginalprojections.Initial designs were constructed with n = 3 (respectively n = 4) points and sequentiallyupdated until n = 9 + 3 (respectively n = 12 + 4). For comparison, the single-shot designsconstructed using Algorithm 2 are labelled as n = 12 (respectively n = 16). The examplesshown in Figure 4 show that the sequential approach appears to construct space-filling designsthat end up with a similar spread of points as the single-shot design. However, as theprojection property is not enforced in this sequence, the marginal projections of the sequentialdesigns are as poor as the single-shot designs. This is easily seen by comparing the number ofblack circles falling on each marginal dimension for n = 9 + 3 versus n = 12 and n = 12 + 4versus n = 16. In particular, for the 3-at-a-time case, the marginal projections of thesequential and single-shot designs are equally bad.Next, we consider batch-sequential designs that incorporate the constraint that the marginal36 l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l l . . . . . . n=3 X1 X l ll l lllllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l l l l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l l . . . . . . n=3+3 X1 X l lll l l l lll l llllllllllll l l l l l l l l llllllllll l l l l l l l llllllllll l l l l l l lllllllll l l l l l llllllll l l l l l l l l l l ll l l l l ll l l l l ll l l l l l . . . . . . n=6+3 X1 X l lll l lll l l lll l lll llllllllllll l l l l l l l l llllllllll l l l l l l l llllllllll l l l l l l lllllll l l l l l lllllll l l l l ll l l l l ll l l l l l l l l l ll l l l ll l l l l . . . . . . n=9+3 X1 X l lll l lll l ll l l lll l lll l ll llllllllllll l l l l l l l l llllllllll l l l l l l llllllllll l l l l l l lllllll l l l l l llllll l l l ll l l l l ll l l l l llllll l l l l l l ll l ll l ll l ll l l . . . . . . n=12 X1 X l ll lll ll ll ll l ll lll ll ll lllllllllll l l l l l l llllllllllllllllllll l l l l l ll l l l l l l llllllllll l l l lllllllll l l l l lllllll llllll l lllllll l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l l . . . . . . n=4 X1 X l lll l llllllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l lllllllll l l l l l l l l l ll l l ll l l ll l l ll l l l . . . . . . n=4+4 X1 X l lll ll ll l lll ll lllllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l lllllllll l l l l l lllllll l l lllllll lllllll l l l l l l ll l l ll l l l . . . . . . n=8+4 X1 X l lll ll lll ll l l lll ll lll ll llllllllllll l l l l l l l l lllllllllll l l l l l l l lllllllll l l l l l l lllllllll l l l l l llllll l l llllll llllll l l ll l l ll l l l l l . . . . . . n=12+4 X1 X l lll ll lll ll l ll l l l lll ll lll ll l ll l lllllllllll l l l l l l l l lllllllllll l l l l l l l lllllllll l l l l l lllllllll l l l l llllll l l llllll lllll l l ll l l ll l l ll l l ll l l ll l l ll ll ll ll ll l . . . . . . n=16 X1 X l ll ll llll lll ll l l l ll ll llll lll ll l llllllllll l l l l l llllllllll l l l l l lllllllll l l l l l l lllllllllllllllllllllll llllllll l llllllllllllllll l Figure 4: Batch sequential designs. Row 1: batch size of 3. Row 2: batch size of 4. Thesingle-shot designs ( n = 12 and n = 16) are shown in the rightmost column. Solid dotsdenote designs while empty circles are candidate points. The gray circles denote candidateswhich would negatively impact the marginal projections. These sequential designs wereconstructed without regard to the marginal projections.projections should not overlap as described in Section 3.4. The 3-at-a-time and 4-at-a-time designs are shown in Figure 5. Enforcing this constraint did not have a noticeablecomputational effect, but the quality of designs is noticeably improved. For instance, the n = 9 + 3 design now has only 2 settings in the marginal projections which are unoccupiedby design points as compared to the n = 12 single-shot design which has 3 and 5 settings ofthe input dimensions unoccupied by design points. In the 4-at-a-time designs, the n = 12 + 4design has no unoccupied marginal projections while the single-shot n = 16 design has 2 and6 unoccupied marginal projections. These examples demonstrate the efficacy of applyingAlgorithm 3 to enfore the desired projection properties.The designs constructed in Figures 4 and 5 use a correlation parameter of ρ = 1 × − toessentially be constructed as space-filling designs. However, in practice a sequential design37 l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l l . . . . . . n=3 X1 X l ll l lllllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l l l l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l l . . . . . . n=3+3 X1 X l llll l l llll llllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l lllllll l l l l l ll l l l l l lllllll l l l ll l l ll l l ll l l l . . . . . . n=6+3 X1 X l llll llll l llll llll lllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l lllllll l l l l l ll l l l l l llllllllllll l l ll l l l lllll l ll l . . . . . . n=9+3 X1 X l llll llll lll l llll llll llllllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l lllllll l l l l l ll l l l l l llllllllllll l l ll l l l lllll lll ll l lll l l ll l ll l ll l ll l l . . . . . . n=12 X1 X l ll lll ll ll ll l ll lll ll ll lllllllllll l l l l l l llllllllllllllllllll l l l l l ll l l l l l l llllllllll l l l lllllllll l l l l lllllll llllll l lllllll l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l l . . . . . . n=4 X1 X l lll l llllllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l lllllllll l l l l l l l l l l ll l l l ll l l l ll l l l ll l l l l . . . . . . n=4+4 X1 X l lll ll ll l lll ll ll lllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l lllllllll l l l l l lllllll l l l llllll l l l l l l ll ll l . . . . . . n=8+4 X1 X l lll ll ll l l ll l lll ll ll l l lllllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l lllllllll l l l l l lllllll l l l llllll l l l l lllll l llll lllll l . . . . . . n=12+4 X1 X l lll ll ll l l lll lll l lll ll ll l l lll llllllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l lllllllll l l l l l lllllll l l l llllll l l l l lllll l llll lllll lll l ll ll ll ll ll l . . . . . . n=16 X1 X l ll ll llll lll ll l l l ll ll llll lll ll l llllllllll l l l l l llllllllll l l l l l lllllllll l l l l l l lllllllllllllllllllllll llllllll l llllllllllllllll l Figure 5: Batch sequential designs with marginal projection constraint. Row 1: batch sizeof 3. Row 2: batch size of 4. The single-shot designs ( n = 12 and n = 16) are shown inthe rightmost column. Solid dots denote designs while empty circles are candidate points.The gray circles denote candidates which would negatively impact the marginal projections.Note how the final sequential designs ( n = 9 + 3 and n = 8 + 4) have few overlapping 1dprojections while the single-shot designs ( n = 12 and n = 16) have many.could benefit from improved estimates of the GP correlation parameter from data collectedat each step in the sequence. Using Algorithm 3, we can construct designs which sequentiallytake this into account, becoming less space-filling while still preserving the desired projectionproperties. We consider an evolution of ρ over the 4 updates as ρ = 1 × − → ρ =1 × − → ρ = 0 . → ρ = 0 . . The resulting sequential designs are shown in Figure6. The evolution of both design sequences demonstrate a more centralized pattern to thepoints as the correlation is updated to represent an observed process that is smooth andslowly varying. Yet, the marginal projection property is still satisfied, with no unoccupiedmarginal projections for the n = 9 + 3 and n = 12 + 4 designs as compared to the single-shotdesigns with n = 12 and n = 16 runs. 38 l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l l . . . . . . n=3, r =1e−10 X1 X l ll l lllllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l l l l l l l ll l l l l ll l l l l ll l l l l ll l l l l ll l l l l l . . . . . . n=3+3, r =1e−5 X1 X l lll ll l lll ll lllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l lllllll l l l l l l llllllll l l l l l l l l ll l l ll l l ll l l l . . . . . . n=6+3, r =0.001 X1 X l lll ll l ll l lll ll l lllllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l lllllll l l l l l l llllllll l l l l llllll l l l llllll l l l l l . . . . . . n=9+3, r =0.001 X1 X l lll ll l ll ll l l lll ll l ll ll llllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l lllllll l l l l l l llllllll l l l l llllll l l l llllll l l lllll l lll l ll l l l ll l ll l ll l ll l l . . . . . . n=12, r =1e−10 X1 X l ll lll ll ll ll l ll lll ll ll lllllllllll l l l l l l llllllllllllllllllll l l l l l ll l l l l l l llllllllll l l l lllllllll l l l l lllllll llllll l lllllll l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l l . . . . . . n=4, r =1e−10 X1 X l lll l llllllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l lllllllll l l l l l l l l l l ll l l l ll l l l ll l l l ll l l l l . . . . . . n=4+4, r =1e−5 X1 X l llll ll l l llll ll llllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l lllllllll l l l l l llllll l l l l l llllll l l l l l ll l . . . . . . n=8+4, r =0.001 X1 X l llll ll ll ll l l llll ll ll ll llllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l lllllllll l l l l l llllll l l l l l llllll l l l lllll l l llll l ll l lll . . . . . . n=12+4, r =0.001 X1 X l llll ll ll ll llll l l llll ll ll ll llll llllllllllll l l l l l l l l lllllllllll l l l l l l l llllllllll l l l l l l lllllllll l l l l l llllll l l l l l llllll l l l lllll l l llll l ll l lll l ll ll ll ll ll l . . . . . . n=16, r =1e−10 X1 X l ll ll llll lll ll l l l ll ll llll lll ll l llllllllll l l l l l llllllllll l l l l l lllllllll l l l l l l lllllllllllllllllllllll llllllll l llllllllllllllll l Figure 6: Batch sequential designs with evolving ρ . Row 1: batch size of 3. Row 2: batchsize of 4. The single-shot designs ( n = 12 and n = 16) are shown in the rightmost column.Solid dots denote designs while empty circles are candidate points. The gray circles denotecandidates which would negatively impact the marginal projections. These sequential designswere constructed to preserve the marginal projection constraint, and assume the correlationparameter is sequentially updated as ρ = 1 × − → ρ = 1 × − → ρ = 0 . → ρ = 0 . n = 12 and n = 16) were constructed using ρ = 1 × − . In this article we have introduced a novel probabilistic approach to constructing optimaldesigns by taking a PP approach. In particular, we considered entropy-optimal designs,which have a clear connection to the popular space-filling designs used in computer experi-ments and spatial statistics, and we establish their connection to DPPs. By using a discreteversion of this representation, we arrive at a computationally efficient algorithm to samplethe mode of this PP, which corresponds to the optimal design. Note that our methodologyapproximately samples the mode of this PP, yet in practice the quality of designs sampledresulted in criterion values many orders of magnitude larger than random DPP samples or39pace-filling LHS designs, which implies the approxmation is of high quality. Subsequently,we extend the method to allow for sequential design construction and allow one to incor-porate a popular marginal projection property constraint without losing the computationalbenefits of our basic algorithm. Since our approach to enforcing such constraints amounts tospecifying a conditional kernel matrix, other constraints not considered in this paper couldbe easily implemented using the same basic approach described.The design algorithm introduced in this paper was demonstrated on the popular StochasticGradient Descent (SGD) algorithm applied to fitting a model to the popular Friedman testfunction. SGD works by approximating the model gradient with a small batch samplefrom the entire dataset and is hugely popular in fitting complex statistical and machinelearning models. The algorithm trades off reduced computation time for increased variancein the estimation of the gradients. We have demonstrated that using PPs to select non-uniform, space-filling batches from the design space noticeably improved the performance ofthe algorithm.We also demonstrated the sequential variant of our algorithm for GP regression designsthat incorporate marginal space-filling projections and/or incorporate updated parameterestimates in a sequential model-fitting excercise. The designs constructed clearly show theeffect of incorporating the marginal projection property constraint and the effect of updatingthe parameter. This allows one to sequentially update the design, starting from a purelyspace-filling construction when we have not yet observed any information, to one whichplaces greater focus on estimating the model as more data is collected.40aking inspiration from earlier, and simpler, probabilistic designs in the literature [e.g. Kieferet al., 1985, M¨uller, 2007], this article provides a general approach to constructing designsfrom a probabilistic PP perspective. While we have focused on designs for stationary GPmodels, our approach would also apply to non-stationary GP models specified by a closed-form correlation function. The methods outlined in this paper are available on
CRAN in the R package demu , and we aim to further develop this approach to handle more complex scenar-ios such as high-dimensional design construction and high-dimensional statistical learningalgorithms. Acknowledgments
Peter F. Craigmile is supported in part by the US National ScienceFoundation (NSF) under grants NSF-DMS-1407604 and NSF-SES-1424481, and the NationalCancer Institute of the National Institutes of Health under Award Number R21CA212308.The content is solely the responsibility of the authors and does not necessarily represent theofficial views of the National Institutes of Health. C. Devon Lin’s research was supportedby the Discovery grant from the National Sciences and Engineering Research Council ofCanada.
References
Affandi, R. H., Fox, E. B., Adams, R. P. and Taskar, B. (2014) Learning the parameters of deter-minantal point process kernels. In
ICML , 1224–1232.Atkinson, A., Donev, A. and Tobias, R. (2007)
Optimum experimental designs, with SAS . Oxford,UK: Oxford University Press.Badrinarayanan, V., Kendall, A. and Cipolla, R. (2015) Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 . anerjee, S., Gelfand, A. E., Finley, A. O. and Sang, H. (2008) Gaussian predictive process modelsfor large spatial datasets. Journal of the Royal Statistical Society Series B , , 825–848.Bottou, L. (2010) On-line learning for very large datasets. Proceedings of COMPSTAT’2010 , 177–186.Carnell, R. (2016) lhs: Latin Hypercube Samples . URL: https://CRAN.R-project.org/package=lhs . R package version 0.14.Chen, R. B., Hsieh, D. N., Hung, Y. and Wang, W. (2013) Optimizing latin hypercube designs byparticle swarm.
Statistics and Computing , , 663–676.Chen, S. X. and Liu, J. S. (1997) Statistical applications of the Poisson-Binomial and conditionalBernoulli distributions. Statistica Sinica , , 875–892.Cressie, N. and Johannesson, G. (2008) Fixed rank kriging for very large spatial data sets. Journalof the Royal Statistical Society, Series B , , 209–226.Cressie, N. A. C. (1993) Statistics for Spatial Data, rev. edn.
New York, New York: John Wiley &Sons Inc.Daley, D. J. and Vere-Jones, D. (2003) An introduction to the theory of point processes: volume I:Elementary theory and methods.
Springer-Verlag, New York, New York .Dean, A. M., D. Voss, D. and Draguljic, D. (2017)
Design and Analysis of Experiments, SecondEdition . New York, New York: Springer-Verlag.Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker,P., Yang, K., Le, Q. V. and Ng, A. Y. (2012) Large scale distributed deep networks. In
Advancesin Neural Information Processing Systems 25 , 1232–1240.Diggle, P. J. (2006) Spatio-temporal point processes: Methods and Applications.
Monographs onStatistics and Applied Probability , , 1.Diggle, P. J., Menezes, R. and Su, T. (2010) Geostatistical inference under preferential sampling. Journal of the Royal Statistical Society: Series C (Applied Statistics) , , 191–232.Dixon, P. M. (2014) Ripley’s k function. Wiley StatsRef: Statistics Reference Online .Dupuy, C. and Bach, F. (2016) Learning determinantal point processes in sublinear time. arXivpreprint arXiv:1610.05925 .Fedorov, V. V. (1972)
Theory of Optimal Experiments . New York, New York: Elsevier.Friedman, J. H. (1991) Multivariate adaptive regression splines.
The Annals of Statistics , , 1–67.Furrer, R., Genton, M. G. and Nychka, D. (2006) Covariance tapering for interpolation of largespatial datasets. Journal of Computational and Graphical Statistics , , 502–523.Geyer, C. J. and Møller, J. (1994) Simulation procedures and likelihood inference for spatial pointprocesses. Scandinavian journal of statistics , 359–373.Gramacy, R. B. and Apley, D. W. (2015) Local gaussian process approximation for large computerexperiments.
Journal of Computational and Graphical Statistics , , 561–578.Guhaniyogi, R., Finley, A. O., Banerjee, S. and Gelfand, A. E. (2011) Adaptive Gaussian predictiveprocess models for large spatial datasets. Environmetrics , , 997–1007. ensman, J., Fusi, N. and Lawrence, N. D. (2013) Gaussian processes for big data. In Conferenceon Uncertainty in Artificial Intellegence , 282–290. URL: auai.org .Higdon, D., Gattiker, J., Williams, B. and Rightley, M. (2008) Computer model calibration usinghigh-dimensional output.
Journal of the American Statistical Association , , 570–583.Hough, J. B., Krishnapur, M., Peres, Y. and B.Virag (2006) Determinantal processes and indepen-dence. Probability Surveys , , 206–229.Johnson, M. E., Moore, L. M. and Ylvisaker, D. (1990) Minimax and maximin distance designs. Journal of Statistical Planning and Inference , , 131–148.Joseph, V. R., Gul, E. and Ba, S. (2015) Maximum projection designs for computer experiments. Biometrika , , 371–380.Kang, B. (2013) Fast determinantal point process sampling with application to clustering. In Advances in Neural Information Processing Systems 26 , vol. 2, 2319–2327.Katzfuss, M. (2013) Bayesian nonstationary spatial modeling for very large datasets.
Environ-metrics , , 189–200.Katzfuss, M. and Hammerling, D. (2017) Parallel inference for massive distributed spatial datausing low-rank models. Statistics and Computing , , 363–375.Kiefer, J. C., Brown, L. D., Olkin, I. and Sacks, J. (1985) Jack Carl Kiefer Collected Papers 3:Design of Experiments . New York, New York: Springer.Kulesza, A. and Taskar, B. (2011) k-DPPs: Fixed-size determinantal point processes.
Proceedingsof the 28th International Conference on Machine Learning , 1193–1200.— (2013) Determinantal point processes for machine learning. arXiv preprint arXiv:1207.6083 .Lavancier, F. and Møller, J. (2015) Modelling aggregation on the large scale and regularity on thesmall scale in spatial point pattern datasets.
Scandinavian Journal of Statistics , , 587–609.Lavancier, F., Møller, J. and Rubak, E. (2014) Determinantal point process models and statisticalinference. Journal of the Royal Statistical Society: Series B , , 853–877.— (2015) Determinantal point process models and statistical inference. Journal of the Royal Sta-tistical Society: Series B (Statistical Methodology) , , 853–877.Lehmann, E. L. (1966) Some concepts of dependence. The Annals of Mathematical Statistics , ,1137–1153.Linkletter, C., Bingham, D., Hengartner, N., Higdon, D. and Kenny, Q. Y. (2006) Variable selectionfor gaussian process models in computer experiments. Technometrics , , 478–490.McKay, M. D., Beckman, R. J. and Conover, W. J. (1979) A comparison of three methods forselecting values of input variables in the analysis of output from a computer code. Technometrics , , 55–61.Mitchell, T., Morris, M. and Ylvisaker, D. (1994) Asymptotically optimum experimental designs forprediction of deterministic functions given derivative information. Journal of Statistical Planningand Inference , , 377–389.Morris, M., Mitchell, T. and Ylvisaker, D. (1993) Bayesian design and analysis of computer exper-iments: use of derivatives in surface prediction. Technometrics , , 243–255. ¨uller, W. G. (2007) Collecting Spatial Data: Optimum Design Of Experiments For Random Fields,3rd Edition . New York, New York: Springer.Mutsuki, K. and Fumiyasu, K. (2016) Determinantal point process priors for Bayesian variableselection in linear regression.
Statistica Sinica , , 97–117.Nguyen, H., Craigmile, P. F. and Pratola, M. T. (2019) Near-optimal designs for gaussian processregression models. submitted .Nychka, D., Furrer, R., Paige, J. and Sain, S. R. (2015) fields: Tools for spatial data. URL: . R package version 9.0.Pratola, M. T., Chipman, H. A., Gattiker, J. R., Higdon, D. M., McCulloch, R. and Rust, W. N.(2014) Parallel bayesian additive regression trees. Journal of Computational and Graphical Statis-tics , , 830–852.Pratola, M. T., Sain, S. R., Bingham, D., Wiltberger, M. and Rigler, J. (2013) Fast sequentialcomputer model calibration of complex spatial-temporal processes. Technometrics , , 232–242.Ripley, B. D. (1976) The second-order analysis of stationary point processes. Journal of appliedprobability , , 255–266.— (1977) Modelling spatial patterns. Journal of the Royal Statistical Society: Series B , , 172–192.Rockov´a, V., Moran, G. and George, E. I. (2015) Determinantal regularization for ensemble variableselection. Tech. rep. , University of Pennsylvania, Philadelphia and 19th International Conferenceon Artificial Intelligence & Statistics.Sacks, J., Welch, W., Mitchell, T. and Wynn, H. (1989) Design and analysis of computer experi-ments (with discussion).
Statistical Science , , 409–423.Sang, H., Jun, M. and Huang, J. Z. (2011) Covariance approximation for large multivariate spatialdata sets with an application to multiple climate model errors. The Annals of Applied Statistics , , 2519–2548.Sculley, D. (2010) Web-scale k-means clustering. In Proceedings of the 19th International Conferenceon World Wide Web , 1177–1178. ACM.Shewry, M. C. and Wynn, H. P. (1987) Maximum entropy sampling.
Journal of Applied Statistics , , 165–170.Silvey, S. D. (1980) Optimal Design: An Introduction to the Theory for Parameter Estimation .New York, New York: Chapman & Hall.Stein, M. L. (2012)
Interpolation of Spatial Data: Some Theory for Kriging . New York, New York:Springer.Sutskever, I., Vinyals, O. and Le, Q. V. (2014) Sequence to sequence learning with neural networks.In
Advances in Neural Information Processing Systems 27 , 3104–3112.Wan, L., Zeiler, M., Zhang, S., Cun, Y. L. and Fergus, R. (2013) Regularization of neural networksusing dropconnect. In
Proceedings of the 30th International Conference on Machine Learning(ICML-13) , 1058–1066.Xu, Y., M¨uller, P. and Telesca, D. (2016) Bayesian inference for latent biologic structure with eterminantal point processes (DPP). Biometrics , , 955–964.Zhang, T. (2004) Solving large scale linear prediction problems using stochastic gradient descentalgorithms. In Proceedings of the Twenty-first International Conference on Machine Learning ,919–926. ACM.,919–926. ACM.