Fast calculation of Gaussian Process multiple-fold cross-validation residuals and their covariances
FFast calculation of Gaussian Process multiple-foldcross-validation residuals and their covariances
David Ginsbourger ∗ and C´edric Sch¨arerDepartment of Mathematics and Statistics,University of Berne, SwitzerlandJanuary 11, 2021 Abstract
We generalize fast Gaussian process leave-one-out formulae to multiple-fold cross-validation, highlighting in turn in broad settings the covariancestructure of cross-validation residuals. The employed approach, that re-lies on block matrix inversion via Schur complements, is applied to bothSimple and Universal Kriging frameworks. We illustrate how resulting co-variances affect model diagnostics and how to properly transform residualsin the first place. Beyond that, we examine how accounting for depen-dency between such residuals affect cross-validation-based estimation ofthe scale parameter. It is found in two distinct cases, namely in scaleestimation and in broader covariance parameter estimation via pseudo-likelihood, that correcting for covariances between cross-validation resid-uals leads back to maximum likelihood estimation or to an original varia-tion thereof. The proposed fast calculation of Gaussian Process multiple-fold cross-validation residuals is implemented and benchmarked againsta naive implementation, all in R language. Numerical experiments high-light the accuracy of our approach as well as the substantial speed-upsthat it enables. It is noticeable however, as supported by a discussionon the main drivers of computational costs and by a dedicated numericalbenchmark, that speed-ups steeply decline as the number of folds (say, allsharing the same size) decreases. Overall, our results enable fast multiple-fold cross-validation, have direct consequences in GP model diagnostics,and pave the way to future work on hyperparameter fitting as well as onthe promising field of goal-oriented fold design.
We focus on cross-validation in the context of Gaussian process (GP) modelling,and more specifically on adapting fast formulae for leave-one-out to multiple-foldcross-validation and exploring how these can be efficiently exploited in topics ∗ Institute of Mathematical Statistics and Actuarial Science, Department of Mathematicsand Statistics, University of Berne, Alpeneggstrasse 22, CH-3012 Bern, Switzerland.E-mail: [email protected] a r X i v : . [ s t a t . M E ] J a n uch as model diagnostics and covariance hyperparameter fitting. One of theessential bricks of the proposed approaches turns out to be closed-form formulaefor the covariance structure of cross-validation residuals. Our main results arepresented both in the Simple Kriging and Universal Kriging frameworks. Theycontribute to a lively stream of research on cross-validation that pertains to GPmodelling, and also to further related classes of statistical models.Leave-one-out and multiple-fold cross-validation in the GP framework were tack-led as early as in [6], and some further references as well as a more recent accountof their use in machine learning is given in Chapter 5 of [18]. A survey of cross-validation procedures for model selection is presented in [1]. Cross-validation forselecting a model selection procedure is tackled in [23]. [10] proposed cokriging-based sequential design strategies using fast cross-validation for multi-fidelitycomputer codes. [7] studied model selection for GP Regression by Approxima-tion Set Coding. [13] suggested new probabilistic cross-validation Estimatorsfor Gaussian Process Regression. Cross-validation strategies for data with tem-poral, spatial, hierarchical, or phylogenetic structure were investigated in [19].New prediction error estimators were defined in [16], where new model selectioncriteria in the same spirit as AIC and Mallow’s C p were suggested. Leave-One-Out cross-validation for Bayesian model comparison in large data was recentlytackled in [11]. [9] introduced an efficient batch multiple-fold cross-ValidationVoronoi adaptive sampling technique for global surrogate modeling. [17] intro-duced a bias-corrected cross-validation estimator for correlated data. Throughout the paper we focus on a random vector Z = ( Z x , . . . , Z x n ) (cid:62) ofvalues taken by a random field ( Z x ) x ∈ D at given points x , . . . , x n ∈ D ( n > D is typically thought of as a compact set of the d -dimensional Euclideanspace but can actually be any set. Furthermore, we denote random vectorrealizations using lower case letters, so that a typical realization of Z is thusnoted z . z is silently assumed of finite norm (as holds almosty surely for Z ).We assume here that Z x = µ ( x ) + η x where ( η x ) x ∈ D is a centred Gaussian ran-dom field with kernel k : D × D → R , that may depend on so-called covariancehyperparameters ψ ∈ Ψ (the set Ψ being case-dependent and precised upon us-age) and that the trend function µ ( x ) may be assumed known (case of SimpleKriging , see Section 2.1) or depending linearly on unknown trend parameters β (See Universal Kriging in Section 2.3, including its Bayesian interpretationwhere β is assumed a priori independent of the random field η ).Most importantly, we denote by K = ( k ( x i , x j )) i,j ∈{ ,...,n } the covariance ma-trix (assumed invertible) of the Gaussian random vector η = ( η x , . . . , η x n ) (cid:62) ,by k ( x ) = ( k ( x , x i )) i ∈{ ,...,n } the vector of covariances between η x and the η x i ’s, where x ∈ D is an arbitrary point of the index set. Denoting by µ =( µ ( x ) , . . . , µ ( x n )) (cid:48) the vector of trend values at the observation points, we be-low recall the Simple Kriging equations: (cid:98) z ( x ) = µ ( x ) + k ( x ) (cid:62) K − ( z − µ ) s ( x ) = k ( x , x ) − k ( x ) (cid:62) K − k ( x ) , (cid:98) z ( x ) and s ( x ) are called Simple Kriging mean and
Simple Kriging vari-ance , respectively, and are in turn the posterior mean and variance of Z knowingthe observations in the presently assumed Gaussian framework. Let us recallthat the equations of the Simple Kriging mean and variance are also used innon-Gaussian frameworks of second-order random fields, where they still standfor best linear unbiased predictior with its associated mean squared error ofresiduals, respectively. Let us also remark that one can always boil down inthat case to Simple Kriging with a null trend by working on the centred field( η x ) x ∈ D = ( Z x − µ ( x ))) x ∈ D , and so we assume without loss of generality that µ = 0 in the following quick review of fast leave-one-out formulae.Leave-One-Out (LOO) cross-validation consists in predicting at each of the ob-servation locations x i when (virtually) removing the corresponding observationfrom the data set, and then comparing for each i ∈ { , . . . , n } the predictedvalue to the left-out observation, also in light of the associated prediction vari-ance. Coming back to the formalism, we denote here by e i = z i − (cid:98) z i the i thleave-one-out prediction error, where (cid:98) z i stands for the Simple Kriging mean atpoint x i when excluding the i th observation from the “learning set”, and by s i the corresponding leave-one-out prediction standard deviation. It has beenfound as early as in 1983 (See Dubrule’s seminal paper [6]) that the LOO pre-diction errors and variance could be calculated quite elegantly and efficiently byrelying on matrix inverses and quantities derived thereof. Working here as in [2]in terms of covariances and denoting by K ij the coefficient on the i th row and j th column of the inverse covariance matrix K − , one obtains indeed by usingfor instance the Schur complement formula of Theorem 2 with properly chosenblocks, that s i = ( K ii ) − ( i ∈ { , . . . , n } ) , (1)and subsequently that e i = ( K ii ) − ( K − z ) i ( i ∈ { , . . . , n } ) . (2)In compact form, the vector of LOO errors writes e = diag((K ii ) − )K − z . (3)This elegant formula has been leveraged in [2] for hyperparameter estimationin the stationary case, namely by minimizing the squared norm of leave-one-out errors as a function of covariance parameters (excluding the “variance”parameter σ := Var[ η x ]), using the compact expression || e || = z (cid:48) K − diag((K ii ) − )K − z . (4)In particular, numerical experiments conducted in [2] suggested that leave-one-out error minimization is more robust than maximum likelihood in the case ofmodel misspecification, a common situation in practice.Yet, depending on the experimental design (meaning the choice of x i ’s), theerrors obtained by leaving one point at a time may be far from representa-tive of generalization errors. While remaining in a framework at given design,multiple-fold cross-validation allows to mitigate shortcomings of leave-one-out3y considering situations where larger parts of the design (i.e., several points)are removed at a time.Our primary aim in the present paper is to generalize fast-leave-one-out formu-lae to multiple-fold cross-validation, hence facilitating their use in both contextsof model diagnostics and hyperparameter optimization. We derive such formu-lae and highlight in turn the covariance structure of cross-validation residuals,a crucial ingredient in the proposed diagnostics and findings pertaining to pa-rameter fitting. Furthermore, following ideas already sketched in the seminalpaper of Dubrule [6], we present extensions of our results to the case of Uni-versal Kriging. The obtained results clarify some links between cross-validationand maximum likelihood, and also open new perspectives regarding the designof folds and ultimately also of the underlying experimental design points. Section 2 is dedicated to fast multiple-fold cross-validation. We first present inSection 2.1 our main results on the fast calculation and the joint probability dis-tribution of cross-validation residuals in the Simple Kriging case. The latter areillustrated based on analytical test functions in Section 2.2 and then extendedto Universal Kriging settings in Section 2.3. In Section 3 we present conse-quences of the main results in the context of model fitting. Section 3.1 focuseson graphical diagnostics via pivotal statistics for cross-validation residuals, whilethe two subsequent sections are devoted to cross-validation-based covariance pa-rameter estimation. More specifically, Section 3.2 is devoted to the estimationof the scale parameter; Section 3.3 then pertains to the estimation of furtherhyperparameters building upon either the norm or the (pseudo-)likelihood ofmultiple-fold cross-validation residuals. Section 4 mostly consists in numericalexperiments illustrating the accuracy and speed-ups offered by the closed-formformulae (Section 4.1), and some considerations regarding associated computa-tional complexities (Section 4.2). The ultimate Section 5 is a discussion openingperspectives on how the established results could be further exploited to improvediagnostics and parameter estimation procedures in GP modelling.
Throughout the following we consider vectors of strictly ordered indices from { , . . . , n } , and denote by S the set of all such index vectors. Let us remarkthat S is equipotent to P ( { , . . . , n } ) \{∅} and hence consists of 2 n − i as (non-void) subsets of { , . . . , n } . As we now tackle distributional properties of cross-validation residuals, wemostly use random vector notations. We denote by Z i the subvector of Z asso-ciated with a vector of strictly ordered indices i ∈ S , by (cid:99) Z i the vector of SimpleKriging mean values at points x i ( i ∈ i ) based on the observation of responsesat the remaining points, denoted in short Z − i , and by E i = Z i − (cid:99) Z i the Simple4riging residual obtained when predicting at locations indexed by i based onobservations at the remaining locations. Note that this residual is well-definedat least in an almost sure sense, and the following equalities between randomvectors will be in an almost sure sense unless precised otherwise.Finally, for any n × n matrix M and i , j ∈ S , we denote by M [ i , j ] = M i , j theblock extracted from M with row and column indices from the index vectors i and j , respectively, and use the shortcut notation M [ i ] for M [ i , i ] = M i , i . Notealso that while the indexing by i in Z i corresponds to taking the subvector Z [ i ],things are different for (cid:99) Z i as it does not follow from a simple extraction.Our first result concerns the distribution of the random field of cross-validationresiduals indexed by the set of possible i ’s, i.e. ( E i ) i ∈S . While its elements havea dimensionality that depends on i , the random field ( E i ) i ∈S has a Gaussiandistribution in the sense that any concatenation of E i ’s forms a Gaussian vectoras further detailed in the theorem below. Theorem 1.
For any i ∈ S , the Simple Kriging residual E i = Z i − (cid:99) Z i obtainedwhen predicting at locations indexed by i based on observations at the remainingones writes E i = ( K − [ i ]) − ( K − Z ) i . (5) Consequently, for any q > and i , . . . , i q ∈ S , the E i j (1 ≤ j ≤ q ) are jointlyGaussian, centred, and with covariance structure given by Cov( E i , E j ) = ( K − [ i ]) − K − [ i , j ]( K − [ j ]) − ( i , j ∈ S ) . (6) In particular, for the case of folds i , . . . , i q forming a partition of { , . . . , n } and such that the concatenation of i , . . . , i q gives (1 , . . . , n ) , then the concate-nated vector of cross-validation residuals E i ,..., i q := ( E (cid:62) i , . . . , E (cid:62) i q ) (cid:62) has a n -dimensional centred Gaussian distribution with covariance matrix Cov( E i ,..., i q ) = DK − D, (7) where D = blockdiag (cid:0) ( K − [ i ]) − , . . . , ( K − [ i q ]) − (cid:1) .Proof. Equation 17 follows from suitably expressing blocks of K ’s inverse usingthe Schur complement approach of Equations 42 and 43, delivering respectively K − [ i ] = ( K [ i ] − K [ i , − i ] K [ − i ] − K [ − i , i ]) − , and K − [ i , − i ] = − ( K [ i ] − K [ i , − i ] K [ − i ] − K [ − i , i ]) − K [ i , − i ] K [ − i ] − . Considering the rows indexed by i of the product K − Z , we hence get that( K − Z ) i = K − [ i ] Z i − K − [ i ] K [ i , − i ] K [ − i ] − Z − i = K − [ i ]( Z i − (cid:99) Z i ) , whereof E i = ( K − [ i ]) − ( K − Z ) i . In order to highlight the joint Gaussianityand the covariance structure at once, let us further define ∆ i = I n [ i , (1 , . . . , n )]to be the i × n “subsetting” matrix. We then have that for any i ∈ S , E i = ( K − [ i ]) − ∆ i K − Z , so that the concatenating any finite number q ≥ E i , . . . , E i q leads to a Gaussian vector by left multiplication of5 by a deterministic matrix. As for the covariance matrix of E i ,..., i q , it followsfrom( E (cid:62) i , . . . , E (cid:62) i q ) (cid:62) = ( K − [ i ]) − ∆ i K − Z . . . ( K − [ i q ]) − ∆ i q K − Z = blockdiag (cid:0) ( K − [ i ]) − , . . . , ( K − [ i q ]) − (cid:1) ∆ i . . . ∆ i q K − Z = DI n K − Z = DK − Z , hence Cov( E i ,..., i q ) = DK − KK − D (cid:62) = DK − D . Remark 1.
Note that for arbitary i , . . . , i q ∈ S i.e. without imposing orderingbetween them or that they form a partition, we would have a similar resultyet without the above simplification, i.e. Cov( E i ,..., i q ) = D ∆ K − ∆ T D with ∆ = (∆ (cid:62) i , . . . , ∆ (cid:62) i q ) (cid:62) . An extreme case would be to consider all possible non-empty subsets of { , . . . , n } , leading to q = 2 n − and n n − lines for ∆ . Remark 2.
In the extreme case where q = n and the i j ’s are set to ( j ) (1 ≤ j ≤ n ) , one recovers on the other hand fast leave-one-out cross-validation formulae,and we obtain as a by-product the covariance matrix of leave-one-out errors diag((K ii ) − )K − diag((K ii ) − ) , (8) so that, in particular, cov( E i , E j ) = ( K ii ) − K ij ( K jj ) − whereof corr( E i , E j ) =( K ii ) − K ij ( K jj ) − , leading to a correlation matrix of LOO residuals Corr( E (1) ,..., ( n ) ) = diag((K ii ) − )K − diag((K ii ) − ) = D K − D . (9) -d function As a first example, we consider a case of GP prediction for a one-dimensionaltest function from [22], f : ( x ) ∈ [0 , (cid:55)→ f ( x ) = sin(30( x − . )cos(2( x − . x − . / . (10)For simplicity we consider a regular design, here a 10-point subdivision of [0 , .0 0.2 0.4 0.6 0.8 1.0 − . − . . Basic versus LOO GP predictions x f and G P p r ed i c t i on s Objective functionEvaluationsGP predictionLOO predictions . . . . . Absolute prediction errors vs GP standard deviation x | P r ed . e rr o r s | vs G P s . d . Absolute GP prediction errorsGP standard deviationAbsolute LOO prediction errors
Figure 1: On the upper panel, GP prediction (blue line) of the test function(black line) defined by Equation 10 based on 10 evaluations at a regular grid,LOO cross-validation predictions (red points). Lower panel: absolute predictionerrors associated with GP (black line) and LOO (red point) predictions, and GPprediction standard deviation (in blue).By design, the considered example function presents moderate fluctuations inthe first third of the domain, then a big up-and-down in the second third, andfinally is relatively flat in the last third. Of course, appealing to a non-stationaryGP that could capture this is legitimate in such a case (and has been done, forinstance in [22, 12]), yet we rather stick here to a baseline stationary model andstart by illustrating how cross-validation highlights this heterogeneity. Lookingfirst at the upper panel of Figure 2.2.1, we see that LOO residuals tend to taketheir larger magnitudes on the boundaries and around the center of the domain.A closer look reveals that the residual at the left boundary point has a highermagnitude than the one of the right boundary, and that the location with thehighest LOO residual is at the top of the central bump. The graph on the lowerpanel of Figure 2.2.1 represents the absolute prediction errors associated withthe baseline (“basic”) GP model (black line), with the LOO cross-validation(red points), as well as the prediction standard deviation coming with the basicGP model (blue line). While the GP prediction standard deviation appears tobehave in an homogeneous way across the domain (it is known to depend solelyon the evaluation locations and not on the responses, a phenomenon referredto as homoscedasticity in the observations ), the true absolute prediction errors7re concentrated on left half of the domain, with a largest peak around the leftboundary and several bumps spreading out towards the center of the domain.Contrarily to what the LOO magnitude at the end right point suggests, there isno substantial prediction error in this region of the domain. In all, while LOOversus actual absolute prediction errors are not in complete agreement, LOOsignals more difficulty to predict in the left domain half, something that theprediction standard deviation attached to the considered stationary GP modelis by design unable to detect.Besides this, as a by-product of Theorem 1 (in the limiting case of LOO, seeRemark 2), we represent on Figure 2.2.1 the correlations between the LOOprediction residual at the leftmost location versus the LOO residuals at theother locations. A first interesting thing that we would like to stress here is l l l l l l l l l l − . . . . Correlation between first and other LOO residuals
Index C o rr e l a t i on Figure 2: Correlations between the LOO residual at the leftmost point and theLOO residuals at all design points in the example displayed on Figure 2.2.1.that correlation with the LOO prediction residual at the second location isnegative, with a value below − .
5. While such a negative correlation betweensuccessive LOO residuals does not materialize with the actual prediction errorsat these two locations, an illustration of this effect can be seen on the upperpanel of Figure 2.2.1 when looking at the fourth and fifth locations. Comingback to Figure 2.2.1, after the second location the correlation goes back topositive but with a smaller magnitude and subsequently appear to continue witha damped oscillation until stationing around zero. As we will see in Section 3,8ccounting for these correlations is instrumental in producing suitable QQ-plotsfrom cross-validation residuals. Let us now turn to a further example focusingon multiple-fold cross-validation.
We now mimick the previous example but consider instead of our 10 previouspoints a design of 20 points formed by 10 pairs of neighbouring points. First, aregular design of 10 points between (cid:15) and 1 − (cid:15) is formed, where (cid:15) is a constantwith value prescribed to 0 .
001 in this example. Then, for every point of this basedesign, two design points are created by adding realizations of independentlygenerated uniform random variables on [ − (cid:15), (cid:15) ]. . . . . Absolute prediction errors vs GP standard deviation x | P r ed . e rr o r s | vs G P s . d . Absolute GP prediction errorsGP standard deviationAbsolute LOO prediction errors . . . . Absolute prediction errors vs GP standard deviation x | P r ed . e rr o r s | vs G P s . d . Absolute GP prediction errorsGP standard deviationAbsolute MFCV prediction errors
Figure 3: Absolute prediction errors associated with GP (black line) and cross-validation (red point) predictions, and GP prediction standard deviation (inblue). Upper panel: LOO. Lower panel: 10-fold CV where each fold contains twosuccessive points from the example, at distances upper bounded by 2 (cid:15) = 2 . − .As can be seen on the upper panel of Figure 2.2.2, LOO is here completely my-opic to prediction errors, precisely because of the paired points. Indeed, whenleaving one point out at a time, there is always a very close neighbour that en-ables making a an accurate Kriging prediction. This somehow extreme situationhighlights how LOO (and more generally cross-validation with poorly designedpartition of input points) may lead to a bad picture of the model’s generaliza-tion ability. When perfoming multiple-fold cross-validation with grouping of9he pairs, however, the residuals are much more insightful, as illustrated on thelower panel of Figure 2.2.2. Let us further remark that Theorem 1 delivers inturn the respective covariance matrices between residuals within groups. Let usnow generalize our previous results to the settings of Universal Kriging. In the presence of a random field possessing a linear trend with known basisfunctions but unkown trend coefficients to be estimated within Kriging, Uni-versal Kriging (UK) appears as a standard option. In particular, UK enablesan integrated treatement of prediction uncertainty, with an inflated predictionvariance compared to Simple Kriging that reflects the additional estimation oftrend coefficients. Details of Universal Kriging in the Gaussian case are re-called in Appendix B. Beyond its inception is spatial statistics [14], UK has alsobeen popularized in the design and analysis of computer experiments [21] andis implemented in several dedicated softwares (including [20, 5]). While it hasbeen known since [6] that fast cross-validation formulae were achievable in theUK framework we provide here a systematic approach to it that relies on blockmatrix calculations and generalizes the results from the previous section.Sticking to the notation adopted in Appendix B, recall that an essential matrixarising in UK calculations is the augmented covariance matrix (cid:101) K = (cid:18) K FF (cid:48) (cid:19) where F is the n × p experimental/design matrix (assumed to have full column-rank), with general entry F i,j = f j ( x i ) (1 ≤ i ≤ n, ≤ j ≤ p , where p is thenumber of basis functions f j ). A key to our approach to fast cross-validationin the UK framework will be to apply block inversion matrices to (cid:101) K . To thatend, let us first consider without loss of generality the case where observation 1is left out, and denote for brevity K , − = K [(1) , − (1)], K − = K [ − (1) , − (1)], K − , − = K [ − (1) , − (1)] (similarly for (cid:101) K ), and F [ − , ] = F [ − (1) , (1 , . . . , n )] .We can then write (cid:101) K in block form as follows: (cid:101) K = (cid:32) (cid:101) K , (cid:101) K , − (cid:101) K − , (cid:101) K − (cid:33) = K , K , − f ( x ) (cid:62) K − , K − F [ − , ] f ( x ) F [ − , ] (cid:62) p × p , (11)with (cid:101) K , = K , , (cid:101) K , − = ( K , − , f ( x ) (cid:62) ), and (cid:101) K − , − = (cid:18) K − , − F [ − , ] F [ − , ] (cid:62) p × p (cid:19) .From there, assuming that F [ − , ] is full column-ranked, the usual Schur com-plement block inversion formula delivers: (cid:101) K − = (cid:32) V − − V − (cid:101) K , − (cid:101) K − − − (cid:101) K − − , (cid:101) K − , V − non represented block (cid:33) , (12)where the short notation V = (cid:101) K , − (cid:101) K , − (cid:101) K − − (cid:101) K − , is used. A nice thingwith the latter, as we show below relying on Equation 11 and further use ofblock inversion, is that V turns out to coincide with the Universal Krigingvariance (when predicting at x based on observations at the other points) while10 (cid:101) K − [ { , . . . , n } ] Z delivers the corresponding UK prediction residual. To seethe former, let us first remark that, by using the Schur complement formula withrespect of the bottom right block of (cid:101) K − , one obtains with the the shorthandnotation I β , − = F [ − , ] K − − F [ − , ] (cid:62) (cid:101) K − − = (cid:18) K − F [ − , ] F [ − , ] (cid:62) p × p (cid:19) − = (cid:18) K − − − K − − F [ − , ] I − β F [ − , ] (cid:62) K − − − K − − F [ − , ] I − β , − − I − β , − F [ − , ] (cid:62) K − − − I − β , − (cid:19) . (13)Plugging in Equation 13 within the definition of V above and remembering that (cid:101) K , − = ( K , − , f ( x ) (cid:62) ), we obtain that (cid:101) K , − (cid:101) K − − (cid:101) K − , = (cid:101) K , − K − − K − , − ( f ( x ) − F [ − , ] (cid:62) K − − K − , ) (cid:62) I − β , − ( f ( x ) − F [ − , ] (cid:62) K − − K − , ) , (14)and one recognizes hence that V = K , − (cid:101) K , − (cid:101) K − − (cid:101) K − , coincides with theUK prediction variance at x based on evaluations at x i (2 ≤ i ≤ n ). As forthe UK (mean) predictor, introducing Z − such that Z (cid:62) = ( Z , Z (cid:62)− ), we get( (cid:101) K − [ { , . . . , n } ] Z ) = V − Z − V − (cid:101) K , − (cid:101) K − − [ , (1 , . . . n − Z − (15)From the first n − (cid:101) K − − [ , (1 , . . . , n − Z − = (cid:18) K − − Z − − K − − F [ − , ] I − β F [ − , ] (cid:62) K − − Z − − i − β , − F [ − , ] (cid:62) K − − Z − (cid:19) , which, after left multiplication by (cid:101) K , − and substitution in Equation 15, de-livers indeed V ( (cid:101) K − [(1 , . . . , n )]) Z = Z − (cid:98) Z − , (16)where (cid:98) Z − stands here for the UK mean predictor at x based on evaluationsat x i (2 ≤ i ≤ n ). Coming back to multiple-fold cross-validation, let us nowformulate a corollary to Theorem 1 for the case of Universal Kriging, the proofof which is not included as it can be obtained by generalizing to arbitrary folds,following Theorem 1’s proof scheme, the results presented above for case wheresolely the observation at the first point is left out. Corollary 1.
For any i ∈ S such that F [ − i , ] is full column-ranked and revis-iting the notation (cid:99) Z i to denote the Universal Kriging Predictor (with arbitrarybasis function as discussed here and in Appendix), the Universal Kriging resid-ual E i = Z i − (cid:99) Z i obtained when predicting at locations indexed by i based onobservations at the remaining ones writes E i = ( (cid:101) K − [ i ]) − ( (cid:101) K − [(1 , . . . , n )] Z ) i , (17) where (cid:101) K is defined in Equation 11. Consequently, for any q > and i , . . . , i q ∈S such that the F [ − i j , ] ’s are full column-ranked, the E i j (1 ≤ j ≤ q ) are jointlyGaussian, centred, and with covariance structure given by Cov( E i , E j ) = ( (cid:101) K − [ i ]) − (cid:101) K − [ i , j ]( (cid:101) K − [ j ]) − ( i , j ∈ S ) . (18)11 n particular, for the case of index vectors i , . . . , i q forming a partition of { , . . . , n } and such that the concatenation of i , . . . , i q gives (1 , . . . , n ) , thenthe concatenated vector of cross-validation residuals E i ,..., i q := ( E (cid:62) i , . . . , E (cid:62) i q ) (cid:62) has a n -dimensional centred Gaussian distribution with covariance matrix Cov( E i ,..., i q ) = (cid:101) D (cid:101) K − [(1 , . . . , n )] (cid:101) D, (19) where (cid:101) D = blockdiag (cid:16) ( (cid:101) K − [ i ]) − , . . . , ( (cid:101) K − [ i q ]) − (cid:17) . We now wish to further explore the potential of the established formulae fordiagnosing the quality of GP models. A first immediate consequence of The-orem 1 and Corollary 1 for model fitting is that, under the hypothesis thatobservations are generated from the assumed model, cross-validation residualscan be transformed via appropriate left matrix multiplications into a randomvector with standard multivariate normal distribution. Assuming for simplicitythat Cov( E i ,..., i q ) is full-ranked and denoting t = (cid:80) qj =1 i j it follows thatCov( E i ,..., i q ) − / E i ,..., i q ∼ N ( , I t ) . If fact, any matrix A ∈ R n × n such that ADK − DA (cid:62) = I n (remaining forsimplicity under the final assumptions of Theorem 1) does the job. More specif-ically, sticking to the final part of Theorem 1 where ∆ = I n , one gets indeedwith A = K / D − A E i ,..., i q = K − / Z ∼ N ( , I n ) , and hence the hypothesis that the model is correct can be questioned using anystandard means relying on such a pivotal quantity with multivariate Gaussiandistribution (e.g., by means of a χ test statistic or graphical diagnostics suchas Q-Q plots, see also [3] for more on the topic). Let us remark that, while K − / Z or a quantity sharing its distribution can be directly computed (seealso remark below regarding K ’s Cholesky factor), A = K / D − too can becomputed without needing to perform much additional calculations since D − = blockdiag (cid:0) K − [ i ] , . . . , K − [ i q ] (cid:1) is pre-calculated, and K / can also be replaced by a pre-calculated matrix suchas the lower triangular Cholesky factor of K . Similar considerations apply tothe UK settings of Corollary 1. Remark 3. If ∆ is not I n but still an invertible matrix, then the above canbe adapted with A = K / ∆ − D − . In case ∆ is not invertible, analoguousapproaches can be employed relying on the Moore-Penrose pseudo-inverse of D ∆ K − ∆ D (cid:62) where the n -dimensional standard normal should be replaced by astandard Gaussian on the range of ∆ K − ∆ . − Normal Q−Q Plot of Transformed LOO Residuals
Theoretical Quantiles S a m p l e Q uan t il e s −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 − Normal Q−Q Plot of Standardized LOO Residuals
Theoretical Quantiles S a m p l e Q uan t il e s Figure 4: On the effect of accounting and correcting for correlation in Q-Qplots based on LOO residuals. Right panel: Q-Q plot against N (0 ,
1) of LOOresiduals merely divided by corresponding LOO standard deivations. Left panel:Q-Q plot against N (0 ,
1) of duly transformed LOO residuals.Coming back to our first one-dimensional example, Figure 3.1 represents Q-Q plots of the distribution of standardized LOO residuals (right panel) versus“transformed” LOO residuals (left panel).While these two plots possess quite similar appearances, a difference that standsout occurs at the second point from the left. Even if it does not really make senseto think of points in terms of unique locations for the left panel of of Figure 3.1,we hypothesize that most entries do not differ much from the standardizedversion except for those cases where two successive locations come with LOOresiduals of relatively high magnitude, which is precisely the case for the fourthand fifth locations from Figure 3.1, respectively with large negative and positiveLOO residuals. As it turns out, the second largest negative LOO residuals fromthe right panel of Figure 3.1 seems absent from the left panel, and it seemsreasonable to attribute this to its decorrelation from the largest positive residualcorresponding to the fifth location (appearing rightmost of the right panel).As an additional cross-validation example with a focus on the effect of correla-tion on Q-Q plots, we added to the data used in the first example (Evaluationsat 10-point regular subdivision of [0,1] of the one-dimensional test function re-called in Eq. 10) a cluster of 10 additional function evaluations for equi-spaced x values between 0 . .
3, and we now present the effect of this clusteron similar Q-Q plots. A brief comparison between Q-Q plots obtained on thesecond example (the one with 10 paired points, not further developed in thissection by prioritization) is appended in Section D.1 for completeness.We can see on Figure 3.1 that the function is now very well captured in theregion of the added cluster. Yet what is no appearent here but is reflectedby Figure 3.1 is that leave-one-out predictions are coming with over-evaluatedprediction uncertainty for a number of locations, that turn out to originate13rom the region of the added cluster and also from the flatter region on theright handside. While looking at the Q-Q plot of standardized residuals onthe right panel of Figure 3.1 suggests some model inadequacy, accounting forand removing covariance between leave-one-out prediction errors as illustratedin the left panel of Figure 3.1 suggests on the contrary that the assumptionsunderlying the considered GP model do not stand out as unreasonable. − . − . . Basic versus LOO GP predictions x f and G P p r ed i c t i on s Objective functionEvaluationsGP predictionLOO predictions . . . . Absolute prediction errors vs GP standard deviation x | P r ed . e rr o r s | vs G P s . d . Absolute GP prediction errorsGP standard deviationAbsolute LOO prediction errors
Figure 5: Upper panel: GP prediction (blue line) of the test function (black line)defined by Equation 10 based on evaluations at a regular grid complementedby clustered points between 0 . .
3, LOO predictions (red points). Lowerpanel: absolute prediction errors associated with GP (black line) and LOO (redpoint) predictions, and GP prediction standard deviation (in blue).The results obtained on this modified test case further stress the deleteriouseffect of negelcting the correlation between cross-validation residuals when di-agnosing GP models relying notably on Q-Q plots. For completeness, we alsoadded in Section D.2 a brief analysis of what cross-validation would result in onthis third example when grouping the second to eleventh points within an indexvector (and keeping the other points as individual entities like in LOO). We nowturn to the topic of covariance parameter estimation relying on cross-validationresiduals. As we will see below, incorporating covariances between CV cross-validation residuals within such approaches leads to interesting developments.14 − . − . − . − . . . . Normal Q−Q Plot of Transformed LOO Residuals
Theoretical Quantiles S a m p l e Q uan t il e s −2 −1 0 1 2 − . − . . . . Normal Q−Q Plot of Standardized LOO Residuals
Theoretical Quantiles S a m p l e Q uan t il e s Figure 6: On the effect of accounting and correcting for correlation in QQ-plotsbased on LOO residuals (second case: regular grid complemented by clusteredpoints between 0 . . N (0 ,
1) of LOOresiduals merely divided by corresponding LOO standard deivations. Left panel:QQ-plot against N (0 ,
1) of duly transformed LOO residuals.
Assuming a kernel of the form k ( x , x (cid:48) ) = σ r ( x , x (cid:48) ) with unknown σ , the prob-lem of estimating σ by maximum likelihood versus cross-validation has beenconsidered in earlier works, leading notably to results concerning the superiorityof MLE in the well-specified case but situations in the misspecified case whenCV turned out to perfom better than MLE [2]. Denoting by R the correlationmatrix associated with the observation points x , . . . , x n , assumed full-rank, itis well-known (Cf. for instance [21]) that the MLE of σ can be written in closedform as (cid:98) σ = 1 n Z R − Z , (20)assuming again without loss of generality a centred random field, for simplicity.This estimator is unbiased and reaches the Cram´er-Rao bound, with a variancevar (cid:0)(cid:98) σ (cid:1) = 2 σ n . (21)In contrast, the leave-one-out-based estimator of σ investigated in [2] reads (cid:98) σ = 1 n Z R − (diag( R − )) − R − Z , (22)and originates from the idea, traced back by [2] to [4], that the criterion (withnotation C LOO inherited from [2]) defined by C (1)LOO ( σ ) = 1 n n (cid:88) i =1 ( Z i − (cid:99) Z i ) σ c − i , (23)should take a value close to one, where c − i = ( s − i ) /σ , leading to (cid:98) σ = n (cid:80) ni =1 ( Z i − (cid:99) Z i ) c − i and ultimately to Eq. 22. This estimator turns out to be15nbiased as well, but with variancevar (cid:0)(cid:98) σ (cid:1) = 2 σ tr(( R − (diag( R − )) − ) ) n . (24)Following [2], tr(( R − (diag( R − )) − ) ) ≥ n so thatvar (cid:0)(cid:98) σ (cid:1) ≥ σ n = var (cid:0)(cid:98) σ (cid:1) . (25)Yet we claim that in order to remove undesirable effects due to the covariancebetween LOO residuals, it is natural to revise the criterion C (1)LOO ( σ ) into C (1) (cid:93) LOO ( σ ) = 1 n n (cid:88) i =1 n (cid:88) j =1 ( Z i − (cid:99) Z i )( DK − D ) ij ( Z j − (cid:99) Z j ) (26)= 1 nσ E (cid:48) diag( R − ) R diag( R − ) E (27)= 1 nσ Z (cid:48) R − Z , (28)so that setting this modified criterion to 1 would plainly result in (cid:98) σ (cid:93) LOO = 1 n Z R − Z = (cid:98) σ . (29)Hence the advantages found for (cid:98) σ in [2] in the case of model misspecificationcome at the price of neglecting covariances and hence possible redundancies be-tween leave-one-out residuals, yet attempting to address this issue by naturallyaccounting for those covariances within the criterion leads to the maximum like-lihood estimator, known to enjoy optimality properties in the well-specified caseof the considered settings but to be potentially suboptimal otherwise. We willsee next that this is not the only instance where modifying LOO in order toaccount for residual covariances leads back to MLE. Cross-validation has also been used to estimate covariance parameters beyondthe scale parameter σ tackled in the last section. We will now denote by θ thoseadditional parameters (and by ψ = ( σ , θ ) the concatenation of all covarianceparameters) and review some approaches that have been used to estimate themvia cross-validation. An important point compared to the last section on σ ’sestimation is that Kriging predictors generally depend on θ , so that we willnow stress it by noting (cid:98) Z ( θ ) (recall that in the presented results one assumesnoiseless observations). The approach followed by [21, 2] to estimate θ basedon leave-one-out cross-validation residuals is to minimize C (2)LOO ( θ ) = n (cid:88) i =1 ( Z i − (cid:99) Z i ( θ )) = n (cid:88) i =1 E i ( θ ) , (30)where we stress the dependence of E i ’s on θ . Following [2], the latter criterioncan be in turn expressed in closed form as C (2)LOO ( θ ) = Z (cid:62) R − θ diag( R − θ ) − R − θ Z , (31)16here R θ stands for Z ’s correlation matrix under correlation parameter θ . Basedon our results, a natural extension of this criterion to multiple fold cross-validation can be similarly obtained. Considering q -fold settings such as inthe final part of Theorem 1, C (2)LOO becomes indeed C (2)CV ( θ ) = q (cid:88) j =1 || Z i j − (cid:99) Z i j ( θ ) || = q (cid:88) j =1 || E i j ( θ ) || = || E i ,..., i q ( θ ) || . (32)Then, building up upon our main theorem, we obtain that C (2)CV ( θ ) = Z (cid:62) R − θ ¯ D ( θ ) R − θ Z , (33)with ¯ D ( θ ) = blockdiag (cid:0) ( R − θ [ i ]) − , . . . , ( R − θ [ i q ]) − (cid:1) , generalizing indeed Eq.31.Note that applying a sphering transformation to the CV residuals previousto taking norms in Eqs. 30,32 would lead to a criterion that boils down to θ (cid:55)→ Z T R − θ Z and hence appears as one of the two building blocks of the log-likelihood criterion.On a different note, as exposed in [18], Eq. 30 is only one among several possibleways to construct a criterion based on applying loss functions to leave-one-outresiduals. While Eq. 30 corresponds to the squared error loss, one arrives forinstance by using instead the “negative log validation density loss” (after theterminology of [18], Section 5.4.2) at the “LOO log predictive probability”, alsosometimes called pseudo-likelihood : C (3)LOO ( ψ ) = n (cid:88) j =1 log (cid:0) p Z j | Z − j ( z j | z − j ; ψ ) (cid:1) , (34)where p Z j | Z − j ( z j | z − j ; ψ ) denotes the conditional density of Z j at the value z j knowing that Z − j = z − j . Further scoring rules could be used as a base to definefurther criteria relying on cross-validation residuals, as illustrated for instancewith the CRPS score in [15].Let us now focus on an extension of C (3)LOO ( ψ ) to multiple fold cross-validation,and on the exploration of some links between the resulting class of criteria andthe log-likelihood. First, C (3)LOO ( ψ ) is straightforwardly adapted into C (3)CV ( ψ ) = q (cid:88) j =1 log (cid:16) p Z i j | Z − i j ( z i j | z − i j ; ψ ) (cid:17) . (35)Reformulating the sum of log terms as the log of a product, we see that C (3)CV ( ψ ) = log q (cid:89) j =1 p Z i j | Z − i j ( z i j | z − i j ; ψ ) = log q (cid:89) j =1 p E i j ( e i j ; ψ ) (36)and so it appears that this approach amounts indeed in some sense to abusinglyassuming independence betwen the cross-validation residuals E i j ( j = 1 . . . q ).We now examine in more detail (still in the settings of the final part of Theo-rem 1) when such cross-validation residuals are independent and show that insuch case C (3)CV ( ψ ) coincides with the log-likelihood criterion based on Z .17 orollary 2. The following propositions are equivalent: a) The cross-validation residuals E i j ( j = 1 . . . q ) are mutually independent, b) D = K , c) the subvectors Z i j ( j = 1 . . . q ) are mutually independent.If they hold, then C (3) CV ( ψ ) = log( p Z ( z ; ψ )) ( ψ ∈ Ψ) .Proof. As we know from Theorem 1 that the cross-validation residuals E i j ( j =1 . . . q ) form a Gaussian Vector, they are mutually independent if and only iftheir cross-covariance matrices are null. Yet we also know by Eq. 18 from thesame theorem that Cov( E i , E j ) = ( K − [ i ]) − K − [ i , j ]( K − [ j ]) − ( i , j ∈ S ). Itfollows that a) is equivalent to K − [ i j , i j (cid:48) ] = for j (cid:54) = j (cid:48) ( j, j (cid:48) ∈ { , . . . , q } ),which coincides in turn with b) . Looking finally at b) from the perspective of K being block-diagonal, we obtain similarly that b) is equivalent to c) . In thesepropositions hold, then E i j = Z i j ( j = 1 . . . q ) from Eq. 17, and thus for ψ ∈ Ψ: C (3)CV ( ψ ) = log q (cid:89) j =1 p E i j ( e i j ; ψ ) = log q (cid:89) j =1 p Z i j ( z i j ; ψ ) = log( p Z ( z ; ψ )) . Also, one could have derived it from p Z i j | Z − i j ( z i j | z − i j ; ψ ) = p Z i j ( z i j ; ψ ). We have seen in previous sections that block matrix inversion enabled us toreformulate the equations of (multiple-fold) cross-validation residuals and asso-ciated covariances in compact form, from Simple Kriging to Universal Krigingsettings. More about our R implementation can be found in Appendix (Sec-tion C). Here we briefly illustrate that our Schur-complement-based formualedo indeed reproduce results obtained by a more straighforward approach, alsokeeping and eye on computational speed-ups and by-products. The next sectionfocuses in more detail on the respective computational complexities of the twoapproaches, and also how efficiency gains evolve depending on the number offolds. Let us start by a basic LOO example in the context of Simple Kriging.We first consider Simple Kriging prediction based on 100 points of our recurrentone-dimensional example test function, to which we apply LOO cross-validationwith fast versus naive implementation. We find relative differences of the orderof 10 − and 10 − when comparing vectors of leave-one-out predictions andvariances obtained from the fast versus naive approaches, respectively (thoserelative differences consist of Eulidean norms of the differences divided by thenorm of the relevant vector using the naive approach). In the current set-up, thefast implementation typically comes with a speed-up factor of 2, a performancethat is much improved when increasing the number of observations. Alreadywith 1000 observations, we found a speed-up of around 120 (More experimentsto follow). Moving to a Universal Kriging model with quadratic trend (See18ection D.4 in Appendix for figures), we found with 100 observations a speed-up nearing 5 while relative errors were still in the same tiny orders of magnitude.With 1000 observations, we observed the same speed-up as in the case of SimpleKriging (with relative accuracies slightly increasing near 10 and 10 ).In Figure 4.1 we present the speed-up measured for LOO on 10 regular de-signs, with 100 to 1000 points equidistributed on [0 , − and 10 − , respectively)illustrating the accuracy of the fast method. design size (in hundreds) Speed−up when using fast vs naive LOO
Figure 7: Speed-up (ratio between times required to run the naive and fastmethods) measured for LOO on 10 regular designs, with 100 to 1000 pointsequidistributed on [0 , . − . − . − . − design size (in hundreds) Relative error on mean (fast vs naive LOO) . − . − . − . − design size (in hundreds) Relative error on covariance (fast vs naive LOO)
Figure 8: Relative errors on LOO means and covariances (between naive andfast methods, with the naive method as reference) measured as on Figure 4.1.where the covariance matrix was first rebuilt from the Cholesky factor andthen factorized again, so as to be in an unfavourable situation for the fastmethod (as the rebuilt part comes as penalty resulting from the fact that theemployed GP modelling code, the km function of the R package DiceKriging).The corresponding results are plotted on Figure D.3 in Appendix, where it canbe seen that in this unfavourable situation speed-ups are merely divided by afactor of 2 but the immense benefit of using the fast LOO formula over the naiveapproach is not affected, all the more so that the design size increases.Turning now to multiple-fold cross-validation beyond leave-one-out, we compareon Figures 4.1 and 4.1 results obtained via fast versus naive implementation inthe case of 50-fold cross-validation on the same regular designs as previously. Ascan be seen on Figure 4.1, speed-ups start getting mitigated with designs nearingthe highest considered design sizes, a phenomenon related to a computationaltrade-off that is analyzed in more detail in the next section.
We now turn to a more specific examination on how our closed form formulaeaffect computational speed compared to a naive implementation of multiple foldcross-validation. To this aim, we focus on the particular case of homegeneousfold sizes and denote by r ≥ n = r × q . As we will now develop, a first order reasoning on computationalcosts associated with naive versus closed multiple-fold cross-validation highlightsthe existence of two regimes depending on q : for large q it is faster to use closeform formula yet for small q one should better employ the naive approach. Therationale is as follows; in a naive implementation, the main computational cost ison inverting q covariance sub-matrices, all belonging to R ( n − q ) × ( n − q ) . Denotingthis (approximate) cost by κ naive and assuming a cubic cost for matrix inversion(with a multiplicative constant γ > κ naive = q × γ ( n − r ) . (37)20 design size (in hundreds) Speed−up when using fast vs naive CV
Figure 9: Speed-up (ratio between times required to run the naive and fastmethods) measured for 50-fold CV on 10 regular designs, with 100 to 1000points equidistributed on [0 , . − . − . − . − design size (in hundreds) Relative error on mean (fast vs naive CV) . − . − . − . − . − . − design size (in hundreds) Relative error on covariance (fast vs naive CV)
Figure 10: Relative errors on LOO means and covariances (between naive andfast methods, with the naive method as reference) measured as on Figure 4.1.One the other hand, the close form approach requires one inversion of a n × n matrix and q inversions of r × r matrices, leading to an approximate cost κ close = γ ( n + qr ) . (38)21e thus obtain (at first order) that a speed up takes place whenever0 ≤ q × γ ( n − r ) − γ ( n + qr ) ⇔ q × ( n − r ) ≥ n ( q + 1) ⇔ ( q − ≥ ( q + 1) , (39)which, as q is a positive integer, can be proved to be equivalent to q ≥ q → ( q − − ( q + 1) possesses a unique real root between 3 and 4, takesnegative values for q ∈ { , , } , and tends to + ∞ when q → + ∞ . As notedin the previous section, however, in our implementation the cost of the fastapproach turns out to be smaller than γ ( n + qr ) since the Cholesky factoris already pre-calculated. Modelling this with a damping factor of α ∈ (0 , γn term, we end up with an approximate cost of κ close = γ ( αn + qr ), leading via analogue calculations to a speed-up whenever ( q − ≥ ( αq + 1). Again, the polynomial involved possesses a single real-valued root,with a value shrinking towards 2 as α tends to 0.In practice, speed-ups are already observed here from q = 2, which may beattributed to various reasons pertaining notably to implementation specifics andarbitrary settings used in the numerical experiments. Also, let us stress againthat the reasoning done above is to be taken as an approximation at “firstorder” in the sense that it does not account for matrix-vector multiplicationcosts, storage and communication overheads, and further auxiliary operationsthat could influence the actual running times. We now present indeed somenumerical experiments at fixed design size but varying number of folds andmonitor the variations of the measured speed-ups and accuracy in calculatingcross-validation residual means and covariances.For convenience, we consider here a design of size 2 = 1024, and let the numberof folds decrease from q = 1024 to q = 2 by successive halvings. This amountsto folds of sizes r ranging from 1 to 512, correspondingly. For each value of q ,50 random replications of the folds are considered; this is done by permutingthe indices at random prior to arranging them in regularly constructed folds ofsuccessive r elements. The resulting speed-up distributions are represented inFigure 4.2, in function of q . With speed-up medians ranging from 1027 .
82 to1 .
50 (means from 1429 .
63 to 1 .
28, all truncated after the second digit), we henceobserved a speed-up whatever the value of q , yet with a substantial decreasewhen q decreases compared to the LOO situation, as could be expected.Additionally, relative errors on cross-validation residuals and associated vari-ances calculated using the fast versus naive approach are represented in Fig-ure 4.2, also in function of q . It is remarkable that, while the relative erroron the calculation of residuals increases moderately (with r ) from a median ofaround 3 . e −
14 to values nearing 4 e −
14, in the case of the covariances thevariation is more pronounced, with an increase from approx. 2 e −
11 to 1 . e − − , the observed differencesremain of a neglectable extent. 22 q1024 512 256 128 64 32 16 8 4 2 Relative error on mean (fast vs naive q−fold CV)
Figure 11: Speed-up (ratio between times required to run the naive and fastmethods) measured for q -fold CV, where q decreases from 1024 to 2 and 50seeds are used that affect here both model fitting and the folds. . − . − . − . − q1024 512 256 128 64 32 16 8 4 2 Relative error on mean (fast vs naive q−fold CV) . − . − . − . − . − . − q1024 512 256 128 64 32 16 8 4 2 Relative error on mean (fast vs naive q−fold CV)
Figure 12: Relative errors on CV means and covariances (between naive andfast methods, with the naive method as reference) measured in similar settingsas on Figure 4.2. 23
Discussion
Fast leave-one-out formulae for Gaussian Process prediction that have beenused for model diagnostics and hyperparameter fitting do carry over well tomultiple-fold cross validation, as our theoretical and numerical results in bothSimple and Universal Kriging settings suggest. The resulting formulae werefound to allow substantial speed-ups, yet with a most favourable scenario in theleave-one-out case and decreasing advantages in cases with lesser folds of largerobservation subsets. A first order analysis in terms of involved matrix inversioncomplexities confirmed the observed trend, yet in most considered situationsfast formulae appeared computationally worthwhile. In addition, establishedformulae enabled a closed-form calculation of the covariance structure of cross-validation residuals, and eventually correcting for an abusive assumption ofindependence silently underlying QQ-plots with standardized LOO residuals.As we established as well, the established formulae lend themselve well to gener-alizing existing covariance hyperparameter estimation approaches that are basedon cross-validation residuals. Looking first at scale parameter estimation in theLOO case, we found that changing a criterion underlying the scale estimationapproach used in [21, 2] by correcting for covariance between LOO residuals ledback to the maximum likelihood estimator of scale. On a different note andwith more general covariance hyperparameters in view, as established in Corol-lary 2 maximizing the pseudo-likelihood criterion mentioned in [18] was foundto coincide with MLE in the case of independent cross-validation residuals.It is interesting to note as a potential starting point to future work on cross-validation-based parameter estimation that going beyond the independence as-sumption of Corollary 2 and accounting for the calculated residual covariancesresults in fact in a departure from basic Maximum Likelihood Estimation. Stillassuming indeed that ∆ = I n and replacing the product in C (3)CV ( ψ ) by the jointdensity of cross-validation residuals, we end up indeed with C (3) (cid:103) CV ( ψ ) = log (cid:16) p ( E i ,..., E i q ) (( e i , . . . , e i q ); ψ ) (cid:17) = log ( p DK − Z ( e ; ψ ))= log (cid:0) p DK − Z ( DK − z ; ψ ) (cid:1) = log (cid:0) p Z ( z ; ψ ) | det( DK − ) | (cid:1) = log ( p Z ( z ; ψ )) + log (det( D )) − log (det( K )) , (41)and we see that maximizing C (3) (cid:103) CV ( ψ ) generally departs indeed from MLE, as K and D are functions of ψ , yet coincides with it when det( D ) = det( K ).While studying further this new criterion is out of scope of the present paper,we believe that using the full distribution of cross-validation residuals couldbe helpful in designing novel criteria and procedures for covariance parameterestimation and more. For instance, beyond our generalization to multiple-foldsettings of the (cid:96) norm of cross-validation residuals as an objective function forcorrelation parameter estimation, we present in Section E merely decorrelatedversions of the resulting criteria and that could be of interest in future work.Beyond this, efficient adjoint computation of gradients could be useful to speed-up covariance hyperparameter estimation procedures such as studied here, but24lso variations thereof relying on further scoring rules [15], not only in LOOsettings but also potentially in the case of multiple-fold cross validation.It is important to stress that while the presented results do not come withprocedures for the design of folds, they might be of interest to create objectivefunctions for their choice. Looking at the covariance matrix between cross-validation residuals might be a relevant ingredient to such procedures. On theother hand future research might also be concerned with choosing folds so asto deliver cross-validation residuals reflecting in some sense the generalizationerrors of the considered models. Of course, designing the folds is not independentof choosing the x i ’s and we expect interesting challenges to arise at this interface.Also, stochasic optimization under random folds is yet another approach ofinterest which could be eased thanks to fast computation of cross-validationresiduals. Last but not least, as fast multiple-fold cross-validation carries overwell to linear regression and Gaussian Process modelling shares foundations withseveral other paradigms, it would be desirable to generalize presented results tobroader model classes (see for instance [17], where mixed models are tackled). Acknowledgements
Part of DG’s contributions have taken place within the Swiss National ScienceFoundation project number 178858, and he would like to thank Ath´ena¨ıs Gau-tier and C´edric Travelletti for stimulating discussions. Yves Deville should bewarmly thanked too for insightful comments on an earlier version of this paper.Calculations were performed on UBELIX, the High Performance Computing(HPC) cluster of the University of Bern. DG would also like to aknowledgesupport of Idiap Research Institute, his primary affiliation during earlier stagesof this work.
References [1] S. Arlot and A. C´elisse. A survey of cross-validation proceduresfor modelselection.
Statistical Surveys , 2010.[2] F. Bachoc. Cross validation and maximum likelihood estimation of hyper-parameters of gaussian processes with model misspecification.
Computa-tional Statistics and Data Analysis , 66:55–69, 2013.[3] L.S. Bastos, and A. O’Hagan. Diagnostics for Gaussian Process Emulators.
Technometrics , 51:4, 425-438.[4] N.A.C. Cressie.
Statistics for spatial data . Wiley, 1993.[5] Y. Deville, D. Ginsbourger, and O. Roustant. kergp: Gaussian ProcessLaboratory , 2020. R package version 0.5.1.[6] O. Dubrule. Cross validation of kriging in a unique neighborhood.
Journalof the International Association for Mathematical Geology , 15 (6):687–699,1983. 257] Benjamin Fischer. Model selection for gaussian process regression by ap-proximation set coding. Master’s thesis, ETH Z¨urich, 2016.[8] J. Gallier. The schur complement and symmetric positive semidefinite(and definite) matrices. Retrived at .[9] A.L. Kaminsky, Y. Wang, K. Pant, W.N. Hashii, and A. Atachbarian. Anefficient batch k-fold cross-validation voronoi adaptive sampling techniquefor global surrogate modeling.
J. Mech. Des. , 143(1):011706, 2021.[10] L. Le Gratiet, C. Cannamela, and B. Iooss. Cokriging-based sequential de-sign strategies using fast cross-validation for multi-fidelity computer codes.
Technometrics , 57:418–427, 2015.[11] M. Magnusson, M. R. Andersen, J. Jonasson, and A. Vehtari. Leave-one-outcross-validation for bayesian model comparison in large data. In
Proceedingsof the 23rd International Conference on Artificial Intelligence and Statistics(AISTATS) , 2020.[12] S. Marmin, D. Ginsbourger, J. Baccou, and J. Liandrat. Warped gaussianprocesses and derivative-based sequential design for functions with hetero-geneous variations.
SIAM/ASA Journal on Uncertainty Quantification ,6(3):991–1018, 2018.[13] L. Martino, V. Laparra, and G. Camps-Valls. Probabilistic cross-validationestimators for gaussian process regression. In , pages 823–827, 2017.[14] G. Matheron. Le krigeage universel.
Les Cahiers du Centre de MorphologieMath´ematique de Fontainebleau , 1, 1969.[15] S.J. Petit, J. Bect, S. Da Veiga, P. Feliot, and E. Vazquez. Towards newcross-validation-based estimators for gaussian process regression: efficientadjoint computation of gradients. https://arxiv.org/pdf/2002.11543.pdf,2020.[16] A. Rabinowicz and S. Rosset. Assessing prediction error atinterpolation andextrapolation points.
Electronic Journal of Statistics , 14(272-301), 2020.[17] A. Rabinowicz and S. Rosset. Cross-validation for correlated data.
Journalof the American Statistical Association , 2020.[18] C.E. Rasmussen and C.K.I. Williams.
Gaussian Processes for MachineLearning . the MIT Press, 2006.[19] D. R. Roberts, V. Bahn, S. Ciuti, M. S. Boyce, J. Elith, G. Guillera-Arroita,S. Hauenstein, J. J. Lahoz-Monfort, B. Schr¨oder, W. Thuiller, D. I. Warton,B. A. Wintle, F. Hartig, and C. F. Dormann. Cross-validation strategiesfor data with temporal, spatial, hierarchical, or phylogenetic structure.
Ecography , 40(8):913–929, 2017. 2620] O. Roustant, D. Ginsbourger, and Y. Deville. DiceKriging, DiceOptim:Two R packages for the analysis of computer experiments by kriging-basedmetamodelling and optimization.
Journal of Statistical Software , 51(1):1–55, 2012.[21] T.J. Santner, B.J. Williams, and W. Notz.
The Design and Analysis ofComputer Experiments . Springer, New York, 2003.[22] Y. Xiong, W. Chen, D. Apley, and X. Ding. A non-stationary covariance-based kriging method for metamodelling in engineering design.
Interna-tional Journal of Numerical Methods in Engineering , 71:733–756, 2007.[23] Y. Zhang and Y. Yang. Cross-validation for selecting a model selectionprocedure.
Journal of Econometrics , 187(1):95–112, 2015.
A About block inversion via Schur complements
The following, partly based on [8], is a summary of standard results revolvingaround the celebrated notion of Schur complement.
Theorem 2.
Let M = (cid:18) A BC D (cid:19) be a real n × n matrix with blocks A, B, C, D of respective sizes p × p , p × q , q × p and q × q , where p, q ≥ such that n = p + q .Assuming that D and A − BD − C are invertible, then so is M with M − = (cid:18) ( A − BD − C ) − − ( A − BD − C ) − BD − − D − C ( A − BD − C ) − D − + D − C ( A − BD − C ) − BD − (cid:19) . Proof.
We follow the classical path consisting of solving for z = (cid:0) x y (cid:1) T in theequation M z = w , with w = (cid:0) u v (cid:1) T ∈ R n where x, u ∈ R p and y, v ∈ R q .Using the assumed invertibility of D , we have indeed equivalence between (cid:26) Ax + By = uCx + Dy = v and (cid:26) y = D − ( v − Cx )( A − BD − C ) x = u − BD − v , whereof, using this time the assumed invertibility of ( A − BD − C ), (cid:26) x = ( A − BD − C ) − u − ( A − BD − C ) − BD − vy = − ( A − BD − C ) − D − Cu + ( D − + D − C ( A − BD − C ) − BD − ) v , resulting in the claimed result for M − . Remark 4.
The block inversion formula of Theorem 2 above can be reformu-lated as the following product of three matrices M − = (cid:18) I − D − C I (cid:19) (cid:18) ( A − BD − C ) − D − (cid:19) (cid:18) I − BD − I (cid:19) , implying in turn the following, that actually only requires the invertibility of D , M = (cid:18) I BD − I (cid:19) (cid:18) ( A − BD − C ) 00 D (cid:19) (cid:18) I D − C I (cid:19) , and illustrates that the simultaneous invertibility of D and A − BD − C is ac-tually equivalent to the invertibility of M itself. emark 5. As classically, in Theorem 2 the inverted block is the bottom-right D and correspondingly the considered Schur complement is the one of the upperleft block A . The Schur inverted Schur complement ( A − BD − C ) − henceappears as the upper left block of M − associated with i = (1 , . . . , p ) . Yet,analogue derivations are possible for any vector of indices i from { , . . . , n } .Reformulating a result presented in Horn and Johnson in terms of our notation,we have indeed M − [ i ] = ( M [ i ] − M [ i , − i ] M [ − i ] − M [ − i , i ]) − (42) and, more generally for two vectors of indices i and j , M − [ i , j ] = − ( M [ i ] − M [ i , j ] M [ − i ] − M [ j , i ]) − M [ i , j ] M [ j ] − = − M [ j ] − M [ j , i ]( M [ i ] − M [ i , j ] M [ j ] − M [ j , i ]) − . (43) Remark 6.
Following up on Remark 5 and using the notation of Theorem 2,we obtain in particular when assuming that A and ( D − CA − D ) are invertibleand considering indeed the Schur complement of A instead of D ’s one that M − = (cid:18) A − + A − B ( D − CA − B ) − CA − − A − B ( D − CA − B ) − − ( D − CA − B ) − CA − ( D − CA − CB ) − (cid:19) . Assuming that both blocks
A, D and their respective Schur complements are in-vertible, one then retrieves the so-called binomial inverse theorem, namely ( D − CA − D ) − = D − + D − C ( A − BD − C ) − BD − . As a by-product, equating non-diagonal blocks further delivers that ( A − BD − C ) − BD − = A − B ( D − CA − B ) − and similarly, D − C ( A − BD − C ) − = ( D − CA − B ) − CA − . B Basics of Universal Kriging (Gaussian case)
In Universal Kriging, the the trend function µ ( x ) of the considered randomfield ( Z x ) x ∈ D is assumed to be a linear combination of given basis functions f j (1 ≤ j ≤ p , p >
0) but with unknown trend parameters β ∈ R p . While in theseminal work of Matheron [14] both square-integrable and intrinsic stationaryfor η were considered, we solely need the first case here as we focus on theGaussian case, i.e. on settings where Z x = p (cid:88) j =1 β j f j ( x ) + η x ( x ∈ D ) , where ( η x ) x ∈ D is assumed centred Gaussian with known (not necessarily sta-tionary) covariance kernel k and unknown coefficients β j (1 ≤ j ≤ p ). UniversalKriging then consists in jointly estimating β while accounting for the spatialdepency in observations, and predicting Z at locations where it was not (or nois-ily) observed. Like in Simple Kriging, the Universal Kriging predictor (at any28iven x ∈ D ) is set to be a linear combination (cid:98) Z x = (cid:80) ni =1 λ i Z x i of observations Z x i (where the dependency of the “Kriging weight” λ i ’s on x is, as often in theliterature, not stressed here for the sake of readability) with Kriging weightssuch that the prediction be unbiased and with minimal variance.With a deterministic treatment of the unknown β j ’s, the residual varianceamounts to the same quantity as in Simple Kriging, that is Var[ Z x − (cid:98) Z x ] =Var[ η x − (cid:80) ni =1 λ i η x i ] = λ (cid:48) K λ − λ (cid:48) k ( x ) + k ( x , x ). Yet the unbiasedness condi-tion adds p linear constraints, namely f j ( x ) = (cid:80) ni =1 λ i f j ( x j ) for j ∈ { , . . . , p } .Minimizing the above quadratic risk under these unbiasedness constraints canbe addressed by Lagrange duality, leading to a linear problem featuring a p -dimensional Lagrange multiplier (cid:96) : (cid:18) K FF (cid:48) (cid:19) (cid:18) λ(cid:96) (cid:19) = (cid:18) k ( x ) f ( x ) (cid:19) , (44)where F = ( f j ( x i )) ≤ i ≤ n, ≤ j ≤ p ∈ R n × p and f ( x ) = ( f ( x ) , . . . , f p ( x )) (cid:48) ∈ R p .From Equation 47 it can be seen that the resulting λ writes as a sum of twocontributions proportional (up to matrix multiplications) to k ( x ) and f ( x ) re-spectively. More precisely, assuming that K and F (cid:48) K − F are invertible andapplying to (cid:101) K = (cid:18) K FF (cid:48) (cid:19) the variant of the Schur complement inversion for-mula discussed in Remark 6, we get that (cid:101) K is invertible and has inverse (cid:101) K − = (cid:18) K − − K − F ( F (cid:48) K − F ) − F (cid:48) K − K − F ( F (cid:48) K − F ) − ( F (cid:48) K − F ) − F (cid:48) K − − ( F (cid:48) K − F ) − . (cid:19) Recalling that the Universal Kriging predictor at the considered x is given by (cid:98) Z x = λ (cid:48) Z = Z (cid:48) λ , we hence find that (cid:98) Z x = α (cid:48) k ( x ) + (cid:98) β (cid:48) f ( x ) where α = ( K − − K − F ( F (cid:48) K − F ) − F (cid:48) K − ) Z , (45) (cid:98) β = ( F (cid:48) K − F ) − F (cid:48) K − Z . (46)As it turns out, (cid:98) β coincides with the Generalized Least Squares estimator of β under assumption of a (centred) noise term with covariance matrix K . Besidesthis, these values for α and (cid:98) β are characterized by (cid:18) K FF (cid:48) (cid:19) (cid:18) α (cid:98) β (cid:19) = (cid:18) Z (cid:19) , (47)as is well-known in the geostatistics literature (See [6] and references therein).Note that [6] is a seminal reference that tackled fast cross-validation formulaealready in the Universal Kriging case and also tackled the case of two pointsleft out. While our formulas coincide with those presented in [6], the blocformalism sheds light on the algebraic mechanisms at work and to a certainextent help achieving a broader level of generality, as illustrated for instance inSection 2.3 with the relative simplicity of obtaining multiple-fold cross-validationconditional covariance matrices in the Universal Kriging framework.29 R implementation (with DiceKriging) cv <- function (model, folds, type="UK", trend.reestim =TRUE,fast=TRUE, light=FALSE){if (!is.list(folds)){stop("The input ‘folds’ must be a list of index subsets(without index redundancy within each fold)")}if (length([email protected]) > 0) {stop("At this stage, cross-validation is not implemented fornoisy observations")}q <- length(folds)case1 <- ((type == "SK") && (!trend.reestim))case2 <- ((type == "UK") && (trend.reestim))analytic <- (case1 | case2) && fastX <- model@Xy <- model@yT <- model@Tn <- model@nyhat <- list()cvcov.list <- list()beta <- [email protected] <- model@Fif (!analytic) {C <- crossprod(T)for (i in 1:q) {I <- folds[[i]]F.I <-F[I,,drop=FALSE]y.but.I <- y[-I, ]F.but.I <- F[-I, ]C.but.I <- C[-I, -I]c.but.I <- C[-I, I]T.but.I <- chol(C.but.I)x <- backsolve(t(T.but.I), y.but.I, upper.tri = FALSE)M <- backsolve(t(T.but.I), F.but.I, upper.tri = FALSE)M <- as.matrix(M)if (trend.reestim) {l <- lm(x ~ M - 1)beta <- as.matrix(l$coef, ncol = 1)}z <- x - M %*% betaTinv.c <- backsolve(t(T.but.I), c.but.I, upper.tri = FALSE)y.predict.complement <- t(Tinv.c) %*% zy.predict.trend <- F.I %*% betay.predict <- y.predict.trend + y.predict.complementyhat[[i]] <- y.predictcvcov.1 <- crossprod(Tinv.c)totalcov <- C[I, I]cvcov.list[[i]] <- totalcov - cvcov.1 f (type == "UK") {T.M <- chol(crossprod(M)) eturn(res)} Additional experiments
D.1 QQ-plots on on the second example −2 −1 0 1 2 − . − . − . − . . . . . Normal Q−Q Plot of Transformed CV Residuals
Theoretical Quantiles S a m p l e Q uan t il e s −2 −1 0 1 2 − . − . − . . . . . . Normal Q−Q Plot of Standardized CV Residuals
Theoretical Quantiles S a m p l e Q uan t il e s Figure 13: On the effect of accounting for covariance in QQ-plots based on CVresiduals for the second example (paired points). Right panel: QQ-plot against N (0 ,
1) of CV residuals merely divided by corresponding CV standard deiva-tions. Left panel: QQ-plot against N (0 ,
1) of duly transformed CV residuals.33 .2 Multiple-fold cross-validation on the third example − . − . . Basic versus CV GP predictions x f and G P p r ed i c t i on s Objective functionEvaluationsGP predictionLOO predictions . . . . Absolute prediction errors vs GP standard deviation t | P r ed . e rr o r s | vs G P s . d . Absolute GP prediction errorsGP standard deviationAbsolute LOO prediction errors
Figure 14: Upper panel: GP prediction (blue line) of the test function (blackline) of Equation 10 based on evaluations at a regular grid complemented byclustered points between 0 . .
3, CV predictions (red points). Lower panel:absolute prediction errors associated with GP (black line) and CV (red point)predictions, and GP prediction standard deviation (in blue). Here the second toeleventh points are grouped while the others are left as singletons (so q = 10). −2 −1 0 1 2 − . − . − . . . . . Normal Q−Q Plot of Standardized CV Residuals
Theoretical Quantiles S a m p l e Q uan t il e s −2 −1 0 1 2 − . − . . . . . Normal Q−Q Plot of Standardized CV Residuals
Theoretical Quantiles S a m p l e Q uan t il e s Figure 15: On the effect of accounting for covariance in QQ-plots based on CVresiduals for the third example (one cluster). Right panel: QQ-plot against N (0 ,
1) of CV residuals merely divided by corresponding CV standard deiva-tions. Left panel: QQ-plot against N (0 ,
1) of duly transformed CV residuals.34 .3 LOO speed-ups when unfavouring the fast method design size (in hundreds)
Speed−up when using fast vs naive LOO
Figure 16: Speed-up (ratio between times required to run the naive and fastmethods) measured for LOO in an unfavourable situation where K is re-builtand its Cholesky factor is recalculated. Apart from that, the settings are thesame as in Figure 4.1. 35 .4 Fast versus naive LOO for Universal Kriging design size (in hundreds) Speed−up when using fast vs naive LOO
Figure 17: Speed-up (ratio between times required to run the naive and fastmethods) measured for LOO under Universal Kriging (with second order trend)on 10 regular designs, with 100 to 1000 points equidistributed on [0 , . − . − . − . − . − . − . − design size (in hundreds) Relative error on mean (fast vs naive LOO) . − . − . − . − design size (in hundreds) Relative error on covariance (fast vs naive LOO)
Figure 18: Relative errors on LOO means and covariances (between naive andfast methods, with the naive method as reference) measured as on Figure D.4.36
About merely decorrelating residuals in C (2) CV ( θ ) A natural way to remedy the lack of account of correlation between CV residualsin Eq. 30 is to modify the criterion into C (2) (cid:103) CV ( θ ) = E i ,..., i q ( θ ) T (Corr( E i ,..., i q ( θ ))) − E i ,..., i q ( θ ) (48)Recalling that Cov( E i ,..., i q ) = DK − D and removing the indexing on i , . . . , i q for readability, we plainly get by noting G = diag( DK − D ) (implicitly depend-ing on both θ and σ , as does D ) thatCorr( E i ,..., i q ( θ )) = G − / DK − DG − / , (49)resulting in C (2) (cid:103) CV ( θ ) = Z (cid:62) K − DG / ( DK − D ) − G / DK − Z . (50)Stressing the dependencies on θ overall and noting in particular ¯ G θ = σ G , thelatter can be reformulated as C (2) (cid:103) CV ( θ ) = Z (cid:62) R − θ ¯ D ( θ ) ¯ G / θ ( ¯ D ( θ ) R − θ ¯ D ( θ )) − ¯ G / θ ¯ D ( θ ) R − θ Z . (51)In the particular case of leave-one-out, we obtain D = σ diag( R − θ ) − whereofCorr( E i ,..., i q ( θ )) = diag( R − θ ) − / R − θ diag( R − θ ) − / and C (2) (cid:93) LOO ( θ ) = Z (cid:62) R − θ diag( R − θ ) − / R − θ diag( R − θ ) − / R − θ Z ..