[PDF] Fast calculation of Gaussian Process multiple-fold cross-validation residuals and their covariances

Abstract

We generalize fast Gaussian process leave-one-out formulae to multiple-fold cross-validation, highlighting in turn in broad settings the covariance structure of cross-validation residuals. The employed approach, that relies on block matrix inversion via Schur complements, is applied to both Simple and Universal Kriging frameworks. We illustrate how resulting covariances affect model diagnostics and how to properly transform residuals in the first place. Beyond that, we examine how accounting for dependency between such residuals affect cross-validation-based estimation of the scale parameter. It is found in two distinct cases, namely in scale estimation and in broader covariance parameter estimation via pseudo-likelihood, that correcting for covariances between cross-validation residuals leads back to maximum likelihood estimation or to an original variation thereof. The proposed fast calculation of Gaussian Process multiple-fold cross-validation residuals is implemented and benchmarked against a naive implementation, all in R language. Numerical experiments highlight the accuracy of our approach as well as the substantial speed-ups that it enables. It is noticeable however, as supported by a discussion on the main drivers of computational costs and by a dedicated numerical benchmark, that speed-ups steeply decline as the number of folds (say, all sharing the same size) decreases. Overall, our results enable fast multiple-fold cross-validation, have direct consequences in GP model diagnostics, and pave the way to future work on hyperparameter fitting as well as on the promising field of goal-oriented fold design.

Full PDF

FFast calculation of Gaussian Process multiple-foldcross-validation residuals and their covariances

David Ginsbourger ∗ and C´edric Sch¨arerDepartment of Mathematics and Statistics,University of Berne, SwitzerlandJanuary 11, 2021 Abstract

We generalize fast Gaussian process leave-one-out formulae to multiple-fold cross-validation, highlighting in turn in broad settings the covariancestructure of cross-validation residuals. The employed approach, that re-lies on block matrix inversion via Schur complements, is applied to bothSimple and Universal Kriging frameworks. We illustrate how resulting co-variances aﬀect model diagnostics and how to properly transform residualsin the ﬁrst place. Beyond that, we examine how accounting for depen-dency between such residuals aﬀect cross-validation-based estimation ofthe scale parameter. It is found in two distinct cases, namely in scaleestimation and in broader covariance parameter estimation via pseudo-likelihood, that correcting for covariances between cross-validation resid-uals leads back to maximum likelihood estimation or to an original varia-tion thereof. The proposed fast calculation of Gaussian Process multiple-fold cross-validation residuals is implemented and benchmarked againsta naive implementation, all in R language. Numerical experiments high-light the accuracy of our approach as well as the substantial speed-upsthat it enables. It is noticeable however, as supported by a discussionon the main drivers of computational costs and by a dedicated numericalbenchmark, that speed-ups steeply decline as the number of folds (say, allsharing the same size) decreases. Overall, our results enable fast multiple-fold cross-validation, have direct consequences in GP model diagnostics,and pave the way to future work on hyperparameter ﬁtting as well as onthe promising ﬁeld of goal-oriented fold design.

We focus on cross-validation in the context of Gaussian process (GP) modelling,and more speciﬁcally on adapting fast formulae for leave-one-out to multiple-foldcross-validation and exploring how these can be eﬃciently exploited in topics ∗ Institute of Mathematical Statistics and Actuarial Science, Department of Mathematicsand Statistics, University of Berne, Alpeneggstrasse 22, CH-3012 Bern, Switzerland.E-mail: [email protected] a r X i v : . [ s t a t . M E ] J a n uch as model diagnostics and covariance hyperparameter ﬁtting. One of theessential bricks of the proposed approaches turns out to be closed-form formulaefor the covariance structure of cross-validation residuals. Our main results arepresented both in the Simple Kriging and Universal Kriging frameworks. Theycontribute to a lively stream of research on cross-validation that pertains to GPmodelling, and also to further related classes of statistical models.Leave-one-out and multiple-fold cross-validation in the GP framework were tack-led as early as in [6], and some further references as well as a more recent accountof their use in machine learning is given in Chapter 5 of [18]. A survey of cross-validation procedures for model selection is presented in [1]. Cross-validation forselecting a model selection procedure is tackled in [23]. [10] proposed cokriging-based sequential design strategies using fast cross-validation for multi-ﬁdelitycomputer codes. [7] studied model selection for GP Regression by Approxima-tion Set Coding. [13] suggested new probabilistic cross-validation Estimatorsfor Gaussian Process Regression. Cross-validation strategies for data with tem-poral, spatial, hierarchical, or phylogenetic structure were investigated in [19].New prediction error estimators were deﬁned in [16], where new model selectioncriteria in the same spirit as AIC and Mallow’s C p were suggested. Leave-One-Out cross-validation for Bayesian model comparison in large data was recentlytackled in [11]. [9] introduced an eﬃcient batch multiple-fold cross-ValidationVoronoi adaptive sampling technique for global surrogate modeling. [17] intro-duced a bias-corrected cross-validation estimator for correlated data. Throughout the paper we focus on a random vector Z = ( Z x , . . . , Z x n ) (cid:62) ofvalues taken by a random ﬁeld ( Z x ) x ∈ D at given points x , . . . , x n ∈ D ( n > D is typically thought of as a compact set of the d -dimensional Euclideanspace but can actually be any set. Furthermore, we denote random vectorrealizations using lower case letters, so that a typical realization of Z is thusnoted z . z is silently assumed of ﬁnite norm (as holds almosty surely for Z ).We assume here that Z x = µ ( x ) + η x where ( η x ) x ∈ D is a centred Gaussian ran-dom ﬁeld with kernel k : D × D → R , that may depend on so-called covariancehyperparameters ψ ∈ Ψ (the set Ψ being case-dependent and precised upon us-age) and that the trend function µ ( x ) may be assumed known (case of SimpleKriging , see Section 2.1) or depending linearly on unknown trend parameters β (See Universal Kriging in Section 2.3, including its Bayesian interpretationwhere β is assumed a priori independent of the random ﬁeld η ).Most importantly, we denote by K = ( k ( x i , x j )) i,j ∈{ ,...,n } the covariance ma-trix (assumed invertible) of the Gaussian random vector η = ( η x , . . . , η x n ) (cid:62) ,by k ( x ) = ( k ( x , x i )) i ∈{ ,...,n } the vector of covariances between η x and the η x i ’s, where x ∈ D is an arbitrary point of the index set. Denoting by µ =( µ ( x ) , . . . , µ ( x n )) (cid:48) the vector of trend values at the observation points, we be-low recall the Simple Kriging equations: (cid:98) z ( x ) = µ ( x ) + k ( x ) (cid:62) K − ( z − µ ) s ( x ) = k ( x , x ) − k ( x ) (cid:62) K − k ( x ) , (cid:98) z ( x ) and s ( x ) are called Simple Kriging mean and

Simple Kriging vari-ance , respectively, and are in turn the posterior mean and variance of Z knowingthe observations in the presently assumed Gaussian framework. Let us recallthat the equations of the Simple Kriging mean and variance are also used innon-Gaussian frameworks of second-order random ﬁelds, where they still standfor best linear unbiased predictior with its associated mean squared error ofresiduals, respectively. Let us also remark that one can always boil down inthat case to Simple Kriging with a null trend by working on the centred ﬁeld( η x ) x ∈ D = ( Z x − µ ( x ))) x ∈ D , and so we assume without loss of generality that µ = 0 in the following quick review of fast leave-one-out formulae.Leave-One-Out (LOO) cross-validation consists in predicting at each of the ob-servation locations x i when (virtually) removing the corresponding observationfrom the data set, and then comparing for each i ∈ { , . . . , n } the predictedvalue to the left-out observation, also in light of the associated prediction vari-ance. Coming back to the formalism, we denote here by e i = z i − (cid:98) z i the i thleave-one-out prediction error, where (cid:98) z i stands for the Simple Kriging mean atpoint x i when excluding the i th observation from the “learning set”, and by s i the corresponding leave-one-out prediction standard deviation. It has beenfound as early as in 1983 (See Dubrule’s seminal paper [6]) that the LOO pre-diction errors and variance could be calculated quite elegantly and eﬃciently byrelying on matrix inverses and quantities derived thereof. Working here as in [2]in terms of covariances and denoting by K ij the coeﬃcient on the i th row and j th column of the inverse covariance matrix K − , one obtains indeed by usingfor instance the Schur complement formula of Theorem 2 with properly chosenblocks, that s i = ( K ii ) − ( i ∈ { , . . . , n } ) , (1)and subsequently that e i = ( K ii ) − ( K − z ) i ( i ∈ { , . . . , n } ) . (2)In compact form, the vector of LOO errors writes e = diag((K ii ) − )K − z . (3)This elegant formula has been leveraged in [2] for hyperparameter estimationin the stationary case, namely by minimizing the squared norm of leave-one-out errors as a function of covariance parameters (excluding the “variance”parameter σ := Var[ η x ]), using the compact expression || e || = z (cid:48) K − diag((K ii ) − )K − z . (4)In particular, numerical experiments conducted in [2] suggested that leave-one-out error minimization is more robust than maximum likelihood in the case ofmodel misspeciﬁcation, a common situation in practice.Yet, depending on the experimental design (meaning the choice of x i ’s), theerrors obtained by leaving one point at a time may be far from representa-tive of generalization errors. While remaining in a framework at given design,multiple-fold cross-validation allows to mitigate shortcomings of leave-one-out3y considering situations where larger parts of the design (i.e., several points)are removed at a time.Our primary aim in the present paper is to generalize fast-leave-one-out formu-lae to multiple-fold cross-validation, hence facilitating their use in both contextsof model diagnostics and hyperparameter optimization. We derive such formu-lae and highlight in turn the covariance structure of cross-validation residuals,a crucial ingredient in the proposed diagnostics and ﬁndings pertaining to pa-rameter ﬁtting. Furthermore, following ideas already sketched in the seminalpaper of Dubrule [6], we present extensions of our results to the case of Uni-versal Kriging. The obtained results clarify some links between cross-validationand maximum likelihood, and also open new perspectives regarding the designof folds and ultimately also of the underlying experimental design points. Section 2 is dedicated to fast multiple-fold cross-validation. We ﬁrst present inSection 2.1 our main results on the fast calculation and the joint probability dis-tribution of cross-validation residuals in the Simple Kriging case. The latter areillustrated based on analytical test functions in Section 2.2 and then extendedto Universal Kriging settings in Section 2.3. In Section 3 we present conse-quences of the main results in the context of model ﬁtting. Section 3.1 focuseson graphical diagnostics via pivotal statistics for cross-validation residuals, whilethe two subsequent sections are devoted to cross-validation-based covariance pa-rameter estimation. More speciﬁcally, Section 3.2 is devoted to the estimationof the scale parameter; Section 3.3 then pertains to the estimation of furtherhyperparameters building upon either the norm or the (pseudo-)likelihood ofmultiple-fold cross-validation residuals. Section 4 mostly consists in numericalexperiments illustrating the accuracy and speed-ups oﬀered by the closed-formformulae (Section 4.1), and some considerations regarding associated computa-tional complexities (Section 4.2). The ultimate Section 5 is a discussion openingperspectives on how the established results could be further exploited to improvediagnostics and parameter estimation procedures in GP modelling.

Throughout the following we consider vectors of strictly ordered indices from { , . . . , n } , and denote by S the set of all such index vectors. Let us remarkthat S is equipotent to P ( { , . . . , n } ) \{∅} and hence consists of 2 n − i as (non-void) subsets of { , . . . , n } . As we now tackle distributional properties of cross-validation residuals, wemostly use random vector notations. We denote by Z i the subvector of Z asso-ciated with a vector of strictly ordered indices i ∈ S , by (cid:99) Z i the vector of SimpleKriging mean values at points x i ( i ∈ i ) based on the observation of responsesat the remaining points, denoted in short Z − i , and by E i = Z i − (cid:99) Z i the Simple4riging residual obtained when predicting at locations indexed by i based onobservations at the remaining locations. Note that this residual is well-deﬁnedat least in an almost sure sense, and the following equalities between randomvectors will be in an almost sure sense unless precised otherwise.Finally, for any n × n matrix M and i , j ∈ S , we denote by M [ i , j ] = M i , j theblock extracted from M with row and column indices from the index vectors i and j , respectively, and use the shortcut notation M [ i ] for M [ i , i ] = M i , i . Notealso that while the indexing by i in Z i corresponds to taking the subvector Z [ i ],things are diﬀerent for (cid:99) Z i as it does not follow from a simple extraction.Our ﬁrst result concerns the distribution of the random ﬁeld of cross-validationresiduals indexed by the set of possible i ’s, i.e. ( E i ) i ∈S . While its elements havea dimensionality that depends on i , the random ﬁeld ( E i ) i ∈S has a Gaussiandistribution in the sense that any concatenation of E i ’s forms a Gaussian vectoras further detailed in the theorem below. Theorem 1.

For any i ∈ S , the Simple Kriging residual E i = Z i − (cid:99) Z i obtainedwhen predicting at locations indexed by i based on observations at the remainingones writes E i = ( K − [ i ]) − ( K − Z ) i . (5) Consequently, for any q > and i , . . . , i q ∈ S , the E i j (1 ≤ j ≤ q ) are jointlyGaussian, centred, and with covariance structure given by Cov( E i , E j ) = ( K − [ i ]) − K − [ i , j ]( K − [ j ]) − ( i , j ∈ S ) . (6) In particular, for the case of folds i , . . . , i q forming a partition of { , . . . , n } and such that the concatenation of i , . . . , i q gives (1 , . . . , n ) , then the concate-nated vector of cross-validation residuals E i ,..., i q := ( E (cid:62) i , . . . , E (cid:62) i q ) (cid:62) has a n -dimensional centred Gaussian distribution with covariance matrix Cov( E i ,..., i q ) = DK − D, (7) where D = blockdiag (cid:0) ( K − [ i ]) − , . . . , ( K − [ i q ]) − (cid:1) .Proof. Equation 17 follows from suitably expressing blocks of K ’s inverse usingthe Schur complement approach of Equations 42 and 43, delivering respectively K − [ i ] = ( K [ i ] − K [ i , − i ] K [ − i ] − K [ − i , i ]) − , and K − [ i , − i ] = − ( K [ i ] − K [ i , − i ] K [ − i ] − K [ − i , i ]) − K [ i , − i ] K [ − i ] − . Considering the rows indexed by i of the product K − Z , we hence get that( K − Z ) i = K − [ i ] Z i − K − [ i ] K [ i , − i ] K [ − i ] − Z − i = K − [ i ]( Z i − (cid:99) Z i ) , whereof E i = ( K − [ i ]) − ( K − Z ) i . In order to highlight the joint Gaussianityand the covariance structure at once, let us further deﬁne ∆ i = I n [ i , (1 , . . . , n )]to be the i × n “subsetting” matrix. We then have that for any i ∈ S , E i = ( K − [ i ]) − ∆ i K − Z , so that the concatenating any ﬁnite number q ≥ E i , . . . , E i q leads to a Gaussian vector by left multiplication of5 by a deterministic matrix. As for the covariance matrix of E i ,..., i q , it followsfrom( E (cid:62) i , . . . , E (cid:62) i q ) (cid:62) =  ( K − [ i ]) − ∆ i K − Z . . . ( K − [ i q ]) − ∆ i q K − Z  = blockdiag (cid:0) ( K − [ i ]) − , . . . , ( K − [ i q ]) − (cid:1)  ∆ i . . . ∆ i q  K − Z = DI n K − Z = DK − Z , hence Cov( E i ,..., i q ) = DK − KK − D (cid:62) = DK − D . Remark 1.

Note that for arbitary i , . . . , i q ∈ S i.e. without imposing orderingbetween them or that they form a partition, we would have a similar resultyet without the above simpliﬁcation, i.e. Cov( E i ,..., i q ) = D ∆ K − ∆ T D with ∆ = (∆ (cid:62) i , . . . , ∆ (cid:62) i q ) (cid:62) . An extreme case would be to consider all possible non-empty subsets of { , . . . , n } , leading to q = 2 n − and n n − lines for ∆ . Remark 2.

In the extreme case where q = n and the i j ’s are set to ( j ) (1 ≤ j ≤ n ) , one recovers on the other hand fast leave-one-out cross-validation formulae,and we obtain as a by-product the covariance matrix of leave-one-out errors diag((K ii ) − )K − diag((K ii ) − ) , (8) so that, in particular, cov( E i , E j ) = ( K ii ) − K ij ( K jj ) − whereof corr( E i , E j ) =( K ii ) − K ij ( K jj ) − , leading to a correlation matrix of LOO residuals Corr( E (1) ,..., ( n ) ) = diag((K ii ) − )K − diag((K ii ) − ) = D K − D . (9) -d function As a ﬁrst example, we consider a case of GP prediction for a one-dimensionaltest function from [22], f : ( x ) ∈ [0 , (cid:55)→ f ( x ) = sin(30( x − . )cos(2( x − . x − . / . (10)For simplicity we consider a regular design, here a 10-point subdivision of [0 , .0 0.2 0.4 0.6 0.8 1.0 − . − . . Basic versus LOO GP predictions x f and G P p r ed i c t i on s Objective functionEvaluationsGP predictionLOO predictions . . . . . Absolute prediction errors vs GP standard deviation x | P r ed . e rr o r s | vs G P s . d . Absolute GP prediction errorsGP standard deviationAbsolute LOO prediction errors

Figure 1: On the upper panel, GP prediction (blue line) of the test function(black line) deﬁned by Equation 10 based on 10 evaluations at a regular grid,LOO cross-validation predictions (red points). Lower panel: absolute predictionerrors associated with GP (black line) and LOO (red point) predictions, and GPprediction standard deviation (in blue).By design, the considered example function presents moderate ﬂuctuations inthe ﬁrst third of the domain, then a big up-and-down in the second third, andﬁnally is relatively ﬂat in the last third. Of course, appealing to a non-stationaryGP that could capture this is legitimate in such a case (and has been done, forinstance in [22, 12]), yet we rather stick here to a baseline stationary model andstart by illustrating how cross-validation highlights this heterogeneity. Lookingﬁrst at the upper panel of Figure 2.2.1, we see that LOO residuals tend to taketheir larger magnitudes on the boundaries and around the center of the domain.A closer look reveals that the residual at the left boundary point has a highermagnitude than the one of the right boundary, and that the location with thehighest LOO residual is at the top of the central bump. The graph on the lowerpanel of Figure 2.2.1 represents the absolute prediction errors associated withthe baseline (“basic”) GP model (black line), with the LOO cross-validation(red points), as well as the prediction standard deviation coming with the basicGP model (blue line). While the GP prediction standard deviation appears tobehave in an homogeneous way across the domain (it is known to depend solelyon the evaluation locations and not on the responses, a phenomenon referredto as homoscedasticity in the observations ), the true absolute prediction errors7re concentrated on left half of the domain, with a largest peak around the leftboundary and several bumps spreading out towards the center of the domain.Contrarily to what the LOO magnitude at the end right point suggests, there isno substantial prediction error in this region of the domain. In all, while LOOversus actual absolute prediction errors are not in complete agreement, LOOsignals more diﬃculty to predict in the left domain half, something that theprediction standard deviation attached to the considered stationary GP modelis by design unable to detect.Besides this, as a by-product of Theorem 1 (in the limiting case of LOO, seeRemark 2), we represent on Figure 2.2.1 the correlations between the LOOprediction residual at the leftmost location versus the LOO residuals at theother locations. A ﬁrst interesting thing that we would like to stress here is l l l l l l l l l l − . . . . Correlation between first and other LOO residuals

Index C o rr e l a t i on Figure 2: Correlations between the LOO residual at the leftmost point and theLOO residuals at all design points in the example displayed on Figure 2.2.1.that correlation with the LOO prediction residual at the second location isnegative, with a value below − .

5. While such a negative correlation betweensuccessive LOO residuals does not materialize with the actual prediction errorsat these two locations, an illustration of this eﬀect can be seen on the upperpanel of Figure 2.2.1 when looking at the fourth and ﬁfth locations. Comingback to Figure 2.2.1, after the second location the correlation goes back topositive but with a smaller magnitude and subsequently appear to continue witha damped oscillation until stationing around zero. As we will see in Section 3,8ccounting for these correlations is instrumental in producing suitable QQ-plotsfrom cross-validation residuals. Let us now turn to a further example focusingon multiple-fold cross-validation.

We now mimick the previous example but consider instead of our 10 previouspoints a design of 20 points formed by 10 pairs of neighbouring points. First, aregular design of 10 points between (cid:15) and 1 − (cid:15) is formed, where (cid:15) is a constantwith value prescribed to 0 .

001 in this example. Then, for every point of this basedesign, two design points are created by adding realizations of independentlygenerated uniform random variables on [ − (cid:15), (cid:15) ]. . . . . Absolute prediction errors vs GP standard deviation x | P r ed . e rr o r s | vs G P s . d . Absolute GP prediction errorsGP standard deviationAbsolute LOO prediction errors . . . . Absolute prediction errors vs GP standard deviation x | P r ed . e rr o r s | vs G P s . d . Absolute GP prediction errorsGP standard deviationAbsolute MFCV prediction errors

Figure 3: Absolute prediction errors associated with GP (black line) and cross-validation (red point) predictions, and GP prediction standard deviation (inblue). Upper panel: LOO. Lower panel: 10-fold CV where each fold contains twosuccessive points from the example, at distances upper bounded by 2 (cid:15) = 2 . − .As can be seen on the upper panel of Figure 2.2.2, LOO is here completely my-opic to prediction errors, precisely because of the paired points. Indeed, whenleaving one point out at a time, there is always a very close neighbour that en-ables making a an accurate Kriging prediction. This somehow extreme situationhighlights how LOO (and more generally cross-validation with poorly designedpartition of input points) may lead to a bad picture of the model’s generaliza-tion ability. When perfoming multiple-fold cross-validation with grouping of9he pairs, however, the residuals are much more insightful, as illustrated on thelower panel of Figure 2.2.2. Let us further remark that Theorem 1 delivers inturn the respective covariance matrices between residuals within groups. Let usnow generalize our previous results to the settings of Universal Kriging. In the presence of a random ﬁeld possessing a linear trend with known basisfunctions but unkown trend coeﬃcients to be estimated within Kriging, Uni-versal Kriging (UK) appears as a standard option. In particular, UK enablesan integrated treatement of prediction uncertainty, with an inﬂated predictionvariance compared to Simple Kriging that reﬂects the additional estimation oftrend coeﬃcients. Details of Universal Kriging in the Gaussian case are re-called in Appendix B. Beyond its inception is spatial statistics [14], UK has alsobeen popularized in the design and analysis of computer experiments [21] andis implemented in several dedicated softwares (including [20, 5]). While it hasbeen known since [6] that fast cross-validation formulae were achievable in theUK framework we provide here a systematic approach to it that relies on blockmatrix calculations and generalizes the results from the previous section.Sticking to the notation adopted in Appendix B, recall that an essential matrixarising in UK calculations is the augmented covariance matrix (cid:101) K = (cid:18) K FF (cid:48) (cid:19) where F is the n × p experimental/design matrix (assumed to have full column-rank), with general entry F i,j = f j ( x i ) (1 ≤ i ≤ n, ≤ j ≤ p , where p is thenumber of basis functions f j ). A key to our approach to fast cross-validationin the UK framework will be to apply block inversion matrices to (cid:101) K . To thatend, let us ﬁrst consider without loss of generality the case where observation 1is left out, and denote for brevity K , − = K [(1) , − (1)], K − = K [ − (1) , − (1)], K − , − = K [ − (1) , − (1)] (similarly for (cid:101) K ), and F [ − , ] = F [ − (1) , (1 , . . . , n )] .We can then write (cid:101) K in block form as follows: (cid:101) K = (cid:32) (cid:101) K , (cid:101) K , − (cid:101) K − , (cid:101) K − (cid:33) =  K , K , − f ( x ) (cid:62) K − , K − F [ − , ] f ( x ) F [ − , ] (cid:62) p × p  , (11)with (cid:101) K , = K , , (cid:101) K , − = ( K , − , f ( x ) (cid:62) ), and (cid:101) K − , − = (cid:18) K − , − F [ − , ] F [ − , ] (cid:62) p × p (cid:19) .From there, assuming that F [ − , ] is full column-ranked, the usual Schur com-plement block inversion formula delivers: (cid:101) K − = (cid:32) V − − V − (cid:101) K , − (cid:101) K − − − (cid:101) K − − , (cid:101) K − , V − non represented block (cid:33) , (12)where the short notation V = (cid:101) K , − (cid:101) K , − (cid:101) K − − (cid:101) K − , is used. A nice thingwith the latter, as we show below relying on Equation 11 and further use ofblock inversion, is that V turns out to coincide with the Universal Krigingvariance (when predicting at x based on observations at the other points) while10 (cid:101) K − [ { , . . . , n } ] Z delivers the corresponding UK prediction residual. To seethe former, let us ﬁrst remark that, by using the Schur complement formula withrespect of the bottom right block of (cid:101) K − , one obtains with the the shorthandnotation I β , − = F [ − , ] K − − F [ − , ] (cid:62) (cid:101) K − − = (cid:18) K − F [ − , ] F [ − , ] (cid:62) p × p (cid:19) − = (cid:18) K − − − K − − F [ − , ] I − β F [ − , ] (cid:62) K − − − K − − F [ − , ] I − β , − − I − β , − F [ − , ] (cid:62) K − − − I − β , − (cid:19) . (13)Plugging in Equation 13 within the deﬁnition of V above and remembering that (cid:101) K , − = ( K , − , f ( x ) (cid:62) ), we obtain that (cid:101) K , − (cid:101) K − − (cid:101) K − , = (cid:101) K , − K − − K − , − ( f ( x ) − F [ − , ] (cid:62) K − − K − , ) (cid:62) I − β , − ( f ( x ) − F [ − , ] (cid:62) K − − K − , ) , (14)and one recognizes hence that V = K , − (cid:101) K , − (cid:101) K − − (cid:101) K − , coincides with theUK prediction variance at x based on evaluations at x i (2 ≤ i ≤ n ). As forthe UK (mean) predictor, introducing Z − such that Z (cid:62) = ( Z , Z (cid:62)− ), we get( (cid:101) K − [ { , . . . , n } ] Z ) = V − Z − V − (cid:101) K , − (cid:101) K − − [ , (1 , . . . n − Z − (15)From the ﬁrst n − (cid:101) K − − [ , (1 , . . . , n − Z − = (cid:18) K − − Z − − K − − F [ − , ] I − β F [ − , ] (cid:62) K − − Z − − i − β , − F [ − , ] (cid:62) K − − Z − (cid:19) , which, after left multiplication by (cid:101) K , − and substitution in Equation 15, de-livers indeed V ( (cid:101) K − [(1 , . . . , n )]) Z = Z − (cid:98) Z − , (16)where (cid:98) Z − stands here for the UK mean predictor at x based on evaluationsat x i (2 ≤ i ≤ n ). Coming back to multiple-fold cross-validation, let us nowformulate a corollary to Theorem 1 for the case of Universal Kriging, the proofof which is not included as it can be obtained by generalizing to arbitrary folds,following Theorem 1’s proof scheme, the results presented above for case wheresolely the observation at the ﬁrst point is left out. Corollary 1.

For any i ∈ S such that F [ − i , ] is full column-ranked and revis-iting the notation (cid:99) Z i to denote the Universal Kriging Predictor (with arbitrarybasis function as discussed here and in Appendix), the Universal Kriging resid-ual E i = Z i − (cid:99) Z i obtained when predicting at locations indexed by i based onobservations at the remaining ones writes E i = ( (cid:101) K − [ i ]) − ( (cid:101) K − [(1 , . . . , n )] Z ) i , (17) where (cid:101) K is deﬁned in Equation 11. Consequently, for any q > and i , . . . , i q ∈S such that the F [ − i j , ] ’s are full column-ranked, the E i j (1 ≤ j ≤ q ) are jointlyGaussian, centred, and with covariance structure given by Cov( E i , E j ) = ( (cid:101) K − [ i ]) − (cid:101) K − [ i , j ]( (cid:101) K − [ j ]) − ( i , j ∈ S ) . (18)11 n particular, for the case of index vectors i , . . . , i q forming a partition of { , . . . , n } and such that the concatenation of i , . . . , i q gives (1 , . . . , n ) , thenthe concatenated vector of cross-validation residuals E i ,..., i q := ( E (cid:62) i , . . . , E (cid:62) i q ) (cid:62) has a n -dimensional centred Gaussian distribution with covariance matrix Cov( E i ,..., i q ) = (cid:101) D (cid:101) K − [(1 , . . . , n )] (cid:101) D, (19) where (cid:101) D = blockdiag (cid:16) ( (cid:101) K − [ i ]) − , . . . , ( (cid:101) K − [ i q ]) − (cid:17) . We now wish to further explore the potential of the established formulae fordiagnosing the quality of GP models. A ﬁrst immediate consequence of The-orem 1 and Corollary 1 for model ﬁtting is that, under the hypothesis thatobservations are generated from the assumed model, cross-validation residualscan be transformed via appropriate left matrix multiplications into a randomvector with standard multivariate normal distribution. Assuming for simplicitythat Cov( E i ,..., i q ) is full-ranked and denoting t = (cid:80) qj =1 i j it follows thatCov( E i ,..., i q ) − / E i ,..., i q ∼ N ( , I t ) . If fact, any matrix A ∈ R n × n such that ADK − DA (cid:62) = I n (remaining forsimplicity under the ﬁnal assumptions of Theorem 1) does the job. More specif-ically, sticking to the ﬁnal part of Theorem 1 where ∆ = I n , one gets indeedwith A = K / D − A E i ,..., i q = K − / Z ∼ N ( , I n ) , and hence the hypothesis that the model is correct can be questioned using anystandard means relying on such a pivotal quantity with multivariate Gaussiandistribution (e.g., by means of a χ test statistic or graphical diagnostics suchas Q-Q plots, see also [3] for more on the topic). Let us remark that, while K − / Z or a quantity sharing its distribution can be directly computed (seealso remark below regarding K ’s Cholesky factor), A = K / D − too can becomputed without needing to perform much additional calculations since D − = blockdiag (cid:0) K − [ i ] , . . . , K − [ i q ] (cid:1) is pre-calculated, and K / can also be replaced by a pre-calculated matrix suchas the lower triangular Cholesky factor of K . Similar considerations apply tothe UK settings of Corollary 1. Remark 3. If ∆ is not I n but still an invertible matrix, then the above canbe adapted with A = K / ∆ − D − . In case ∆ is not invertible, analoguousapproaches can be employed relying on the Moore-Penrose pseudo-inverse of D ∆ K − ∆ D (cid:62) where the n -dimensional standard normal should be replaced by astandard Gaussian on the range of ∆ K − ∆ . − Normal Q−Q Plot of Transformed LOO Residuals

Theoretical Quantiles S a m p l e Q uan t il e s −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 − Normal Q−Q Plot of Standardized LOO Residuals

Theoretical Quantiles S a m p l e Q uan t il e s Figure 4: On the eﬀect of accounting and correcting for correlation in Q-Qplots based on LOO residuals. Right panel: Q-Q plot against N (0 ,

1) of LOOresiduals merely divided by corresponding LOO standard deivations. Left panel:Q-Q plot against N (0 ,

1) of duly transformed LOO residuals.Coming back to our ﬁrst one-dimensional example, Figure 3.1 represents Q-Q plots of the distribution of standardized LOO residuals (right panel) versus“transformed” LOO residuals (left panel).While these two plots possess quite similar appearances, a diﬀerence that standsout occurs at the second point from the left. Even if it does not really make senseto think of points in terms of unique locations for the left panel of of Figure 3.1,we hypothesize that most entries do not diﬀer much from the standardizedversion except for those cases where two successive locations come with LOOresiduals of relatively high magnitude, which is precisely the case for the fourthand ﬁfth locations from Figure 3.1, respectively with large negative and positiveLOO residuals. As it turns out, the second largest negative LOO residuals fromthe right panel of Figure 3.1 seems absent from the left panel, and it seemsreasonable to attribute this to its decorrelation from the largest positive residualcorresponding to the ﬁfth location (appearing rightmost of the right panel).As an additional cross-validation example with a focus on the eﬀect of correla-tion on Q-Q plots, we added to the data used in the ﬁrst example (Evaluationsat 10-point regular subdivision of [0,1] of the one-dimensional test function re-called in Eq. 10) a cluster of 10 additional function evaluations for equi-spaced x values between 0 . .

3, and we now present the eﬀect of this clusteron similar Q-Q plots. A brief comparison between Q-Q plots obtained on thesecond example (the one with 10 paired points, not further developed in thissection by prioritization) is appended in Section D.1 for completeness.We can see on Figure 3.1 that the function is now very well captured in theregion of the added cluster. Yet what is no appearent here but is reﬂectedby Figure 3.1 is that leave-one-out predictions are coming with over-evaluatedprediction uncertainty for a number of locations, that turn out to originate13rom the region of the added cluster and also from the ﬂatter region on theright handside. While looking at the Q-Q plot of standardized residuals onthe right panel of Figure 3.1 suggests some model inadequacy, accounting forand removing covariance between leave-one-out prediction errors as illustratedin the left panel of Figure 3.1 suggests on the contrary that the assumptionsunderlying the considered GP model do not stand out as unreasonable. − . − . . Basic versus LOO GP predictions x f and G P p r ed i c t i on s Objective functionEvaluationsGP predictionLOO predictions . . . . Absolute prediction errors vs GP standard deviation x | P r ed . e rr o r s | vs G P s . d . Absolute GP prediction errorsGP standard deviationAbsolute LOO prediction errors

Figure 5: Upper panel: GP prediction (blue line) of the test function (black line)deﬁned by Equation 10 based on evaluations at a regular grid complementedby clustered points between 0 . .

3, LOO predictions (red points). Lowerpanel: absolute prediction errors associated with GP (black line) and LOO (redpoint) predictions, and GP prediction standard deviation (in blue).The results obtained on this modiﬁed test case further stress the deleteriouseﬀect of negelcting the correlation between cross-validation residuals when di-agnosing GP models relying notably on Q-Q plots. For completeness, we alsoadded in Section D.2 a brief analysis of what cross-validation would result in onthis third example when grouping the second to eleventh points within an indexvector (and keeping the other points as individual entities like in LOO). We nowturn to the topic of covariance parameter estimation relying on cross-validationresiduals. As we will see below, incorporating covariances between CV cross-validation residuals within such approaches leads to interesting developments.14 − . − . − . − . . . . Normal Q−Q Plot of Transformed LOO Residuals

Theoretical Quantiles S a m p l e Q uan t il e s −2 −1 0 1 2 − . − . . . . Normal Q−Q Plot of Standardized LOO Residuals

Theoretical Quantiles S a m p l e Q uan t il e s Figure 6: On the eﬀect of accounting and correcting for correlation in QQ-plotsbased on LOO residuals (second case: regular grid complemented by clusteredpoints between 0 . . N (0 ,

1) of LOOresiduals merely divided by corresponding LOO standard deivations. Left panel:QQ-plot against N (0 ,

1) of duly transformed LOO residuals.

Assuming a kernel of the form k ( x , x (cid:48) ) = σ r ( x , x (cid:48) ) with unknown σ , the prob-lem of estimating σ by maximum likelihood versus cross-validation has beenconsidered in earlier works, leading notably to results concerning the superiorityof MLE in the well-speciﬁed case but situations in the misspeciﬁed case whenCV turned out to perfom better than MLE [2]. Denoting by R the correlationmatrix associated with the observation points x , . . . , x n , assumed full-rank, itis well-known (Cf. for instance [21]) that the MLE of σ can be written in closedform as (cid:98) σ = 1 n Z R − Z , (20)assuming again without loss of generality a centred random ﬁeld, for simplicity.This estimator is unbiased and reaches the Cram´er-Rao bound, with a variancevar (cid:0)(cid:98) σ (cid:1) = 2 σ n . (21)In contrast, the leave-one-out-based estimator of σ investigated in [2] reads (cid:98) σ = 1 n Z R − (diag( R − )) − R − Z , (22)and originates from the idea, traced back by [2] to [4], that the criterion (withnotation C LOO inherited from [2]) deﬁned by C (1)LOO ( σ ) = 1 n n (cid:88) i =1 ( Z i − (cid:99) Z i ) σ c − i , (23)should take a value close to one, where c − i = ( s − i ) /σ , leading to (cid:98) σ = n (cid:80) ni =1 ( Z i − (cid:99) Z i ) c − i and ultimately to Eq. 22. This estimator turns out to be15nbiased as well, but with variancevar (cid:0)(cid:98) σ (cid:1) = 2 σ tr(( R − (diag( R − )) − ) ) n . (24)Following [2], tr(( R − (diag( R − )) − ) ) ≥ n so thatvar (cid:0)(cid:98) σ (cid:1) ≥ σ n = var (cid:0)(cid:98) σ (cid:1) . (25)Yet we claim that in order to remove undesirable eﬀects due to the covariancebetween LOO residuals, it is natural to revise the criterion C (1)LOO ( σ ) into C (1) (cid:93) LOO ( σ ) = 1 n n (cid:88) i =1 n (cid:88) j =1 ( Z i − (cid:99) Z i )( DK − D ) ij ( Z j − (cid:99) Z j ) (26)= 1 nσ E (cid:48) diag( R − ) R diag( R − ) E (27)= 1 nσ Z (cid:48) R − Z , (28)so that setting this modiﬁed criterion to 1 would plainly result in (cid:98) σ (cid:93) LOO = 1 n Z R − Z = (cid:98) σ . (29)Hence the advantages found for (cid:98) σ in [2] in the case of model misspeciﬁcationcome at the price of neglecting covariances and hence possible redundancies be-tween leave-one-out residuals, yet attempting to address this issue by naturallyaccounting for those covariances within the criterion leads to the maximum like-lihood estimator, known to enjoy optimality properties in the well-speciﬁed caseof the considered settings but to be potentially suboptimal otherwise. We willsee next that this is not the only instance where modifying LOO in order toaccount for residual covariances leads back to MLE. Cross-validation has also been used to estimate covariance parameters beyondthe scale parameter σ tackled in the last section. We will now denote by θ thoseadditional parameters (and by ψ = ( σ , θ ) the concatenation of all covarianceparameters) and review some approaches that have been used to estimate themvia cross-validation. An important point compared to the last section on σ ’sestimation is that Kriging predictors generally depend on θ , so that we willnow stress it by noting (cid:98) Z ( θ ) (recall that in the presented results one assumesnoiseless observations). The approach followed by [21, 2] to estimate θ basedon leave-one-out cross-validation residuals is to minimize C (2)LOO ( θ ) = n (cid:88) i =1 ( Z i − (cid:99) Z i ( θ )) = n (cid:88) i =1 E i ( θ ) , (30)where we stress the dependence of E i ’s on θ . Following [2], the latter criterioncan be in turn expressed in closed form as C (2)LOO ( θ ) = Z (cid:62) R − θ diag( R − θ ) − R − θ Z , (31)16here R θ stands for Z ’s correlation matrix under correlation parameter θ . Basedon our results, a natural extension of this criterion to multiple fold cross-validation can be similarly obtained. Considering q -fold settings such as inthe ﬁnal part of Theorem 1, C (2)LOO becomes indeed C (2)CV ( θ ) = q (cid:88) j =1 || Z i j − (cid:99) Z i j ( θ ) || = q (cid:88) j =1 || E i j ( θ ) || = || E i ,..., i q ( θ ) || . (32)Then, building up upon our main theorem, we obtain that C (2)CV ( θ ) = Z (cid:62) R − θ ¯ D ( θ ) R − θ Z , (33)with ¯ D ( θ ) = blockdiag (cid:0) ( R − θ [ i ]) − , . . . , ( R − θ [ i q ]) − (cid:1) , generalizing indeed Eq.31.Note that applying a sphering transformation to the CV residuals previousto taking norms in Eqs. 30,32 would lead to a criterion that boils down to θ (cid:55)→ Z T R − θ Z and hence appears as one of the two building blocks of the log-likelihood criterion.On a diﬀerent note, as exposed in [18], Eq. 30 is only one among several possibleways to construct a criterion based on applying loss functions to leave-one-outresiduals. While Eq. 30 corresponds to the squared error loss, one arrives forinstance by using instead the “negative log validation density loss” (after theterminology of [18], Section 5.4.2) at the “LOO log predictive probability”, alsosometimes called pseudo-likelihood : C (3)LOO ( ψ ) = n (cid:88) j =1 log (cid:0) p Z j | Z − j ( z j | z − j ; ψ ) (cid:1) , (34)where p Z j | Z − j ( z j | z − j ; ψ ) denotes the conditional density of Z j at the value z j knowing that Z − j = z − j . Further scoring rules could be used as a base to deﬁnefurther criteria relying on cross-validation residuals, as illustrated for instancewith the CRPS score in [15].Let us now focus on an extension of C (3)LOO ( ψ ) to multiple fold cross-validation,and on the exploration of some links between the resulting class of criteria andthe log-likelihood. First, C (3)LOO ( ψ ) is straightforwardly adapted into C (3)CV ( ψ ) = q (cid:88) j =1 log (cid:16) p Z i j | Z − i j ( z i j | z − i j ; ψ ) (cid:17) . (35)Reformulating the sum of log terms as the log of a product, we see that C (3)CV ( ψ ) = log  q (cid:89) j =1 p Z i j | Z − i j ( z i j | z − i j ; ψ )  = log  q (cid:89) j =1 p E i j ( e i j ; ψ )  (36)and so it appears that this approach amounts indeed in some sense to abusinglyassuming independence betwen the cross-validation residuals E i j ( j = 1 . . . q ).We now examine in more detail (still in the settings of the ﬁnal part of Theo-rem 1) when such cross-validation residuals are independent and show that insuch case C (3)CV ( ψ ) coincides with the log-likelihood criterion based on Z .17 orollary 2. The following propositions are equivalent: a) The cross-validation residuals E i j ( j = 1 . . . q ) are mutually independent, b) D = K , c) the subvectors Z i j ( j = 1 . . . q ) are mutually independent.If they hold, then C (3) CV ( ψ ) = log( p Z ( z ; ψ )) ( ψ ∈ Ψ) .Proof. As we know from Theorem 1 that the cross-validation residuals E i j ( j =1 . . . q ) form a Gaussian Vector, they are mutually independent if and only iftheir cross-covariance matrices are null. Yet we also know by Eq. 18 from thesame theorem that Cov( E i , E j ) = ( K − [ i ]) − K − [ i , j ]( K − [ j ]) − ( i , j ∈ S ). Itfollows that a) is equivalent to K − [ i j , i j (cid:48) ] = for j (cid:54) = j (cid:48) ( j, j (cid:48) ∈ { , . . . , q } ),which coincides in turn with b) . Looking ﬁnally at b) from the perspective of K being block-diagonal, we obtain similarly that b) is equivalent to c) . In thesepropositions hold, then E i j = Z i j ( j = 1 . . . q ) from Eq. 17, and thus for ψ ∈ Ψ: C (3)CV ( ψ ) = log  q (cid:89) j =1 p E i j ( e i j ; ψ )  = log  q (cid:89) j =1 p Z i j ( z i j ; ψ )  = log( p Z ( z ; ψ )) . Also, one could have derived it from p Z i j | Z − i j ( z i j | z − i j ; ψ ) = p Z i j ( z i j ; ψ ). We have seen in previous sections that block matrix inversion enabled us toreformulate the equations of (multiple-fold) cross-validation residuals and asso-ciated covariances in compact form, from Simple Kriging to Universal Krigingsettings. More about our R implementation can be found in Appendix (Sec-tion C). Here we brieﬂy illustrate that our Schur-complement-based formualedo indeed reproduce results obtained by a more straighforward approach, alsokeeping and eye on computational speed-ups and by-products. The next sectionfocuses in more detail on the respective computational complexities of the twoapproaches, and also how eﬃciency gains evolve depending on the number offolds. Let us start by a basic LOO example in the context of Simple Kriging.We ﬁrst consider Simple Kriging prediction based on 100 points of our recurrentone-dimensional example test function, to which we apply LOO cross-validationwith fast versus naive implementation. We ﬁnd relative diﬀerences of the orderof 10 − and 10 − when comparing vectors of leave-one-out predictions andvariances obtained from the fast versus naive approaches, respectively (thoserelative diﬀerences consist of Eulidean norms of the diﬀerences divided by thenorm of the relevant vector using the naive approach). In the current set-up, thefast implementation typically comes with a speed-up factor of 2, a performancethat is much improved when increasing the number of observations. Alreadywith 1000 observations, we found a speed-up of around 120 (More experimentsto follow). Moving to a Universal Kriging model with quadratic trend (See18ection D.4 in Appendix for ﬁgures), we found with 100 observations a speed-up nearing 5 while relative errors were still in the same tiny orders of magnitude.With 1000 observations, we observed the same speed-up as in the case of SimpleKriging (with relative accuracies slightly increasing near 10 and 10 ).In Figure 4.1 we present the speed-up measured for LOO on 10 regular de-signs, with 100 to 1000 points equidistributed on [0 , − and 10 − , respectively)illustrating the accuracy of the fast method. design size (in hundreds) Speed−up when using fast vs naive LOO

Figure 7: Speed-up (ratio between times required to run the naive and fastmethods) measured for LOO on 10 regular designs, with 100 to 1000 pointsequidistributed on [0 , . − . − . − . − design size (in hundreds) Relative error on mean (fast vs naive LOO) . − . − . − . − design size (in hundreds) Relative error on covariance (fast vs naive LOO)

Figure 8: Relative errors on LOO means and covariances (between naive andfast methods, with the naive method as reference) measured as on Figure 4.1.where the covariance matrix was ﬁrst rebuilt from the Cholesky factor andthen factorized again, so as to be in an unfavourable situation for the fastmethod (as the rebuilt part comes as penalty resulting from the fact that theemployed GP modelling code, the km function of the R package DiceKriging).The corresponding results are plotted on Figure D.3 in Appendix, where it canbe seen that in this unfavourable situation speed-ups are merely divided by afactor of 2 but the immense beneﬁt of using the fast LOO formula over the naiveapproach is not aﬀected, all the more so that the design size increases.Turning now to multiple-fold cross-validation beyond leave-one-out, we compareon Figures 4.1 and 4.1 results obtained via fast versus naive implementation inthe case of 50-fold cross-validation on the same regular designs as previously. Ascan be seen on Figure 4.1, speed-ups start getting mitigated with designs nearingthe highest considered design sizes, a phenomenon related to a computationaltrade-oﬀ that is analyzed in more detail in the next section.

We now turn to a more speciﬁc examination on how our closed form formulaeaﬀect computational speed compared to a naive implementation of multiple foldcross-validation. To this aim, we focus on the particular case of homegeneousfold sizes and denote by r ≥ n = r × q . As we will now develop, a ﬁrst order reasoning on computationalcosts associated with naive versus closed multiple-fold cross-validation highlightsthe existence of two regimes depending on q : for large q it is faster to use closeform formula yet for small q one should better employ the naive approach. Therationale is as follows; in a naive implementation, the main computational cost ison inverting q covariance sub-matrices, all belonging to R ( n − q ) × ( n − q ) . Denotingthis (approximate) cost by κ naive and assuming a cubic cost for matrix inversion(with a multiplicative constant γ > κ naive = q × γ ( n − r ) . (37)20 design size (in hundreds) Speed−up when using fast vs naive CV

Figure 9: Speed-up (ratio between times required to run the naive and fastmethods) measured for 50-fold CV on 10 regular designs, with 100 to 1000points equidistributed on [0 , . − . − . − . − design size (in hundreds) Relative error on mean (fast vs naive CV) . − . − . − . − . − . − design size (in hundreds) Relative error on covariance (fast vs naive CV)

Figure 10: Relative errors on LOO means and covariances (between naive andfast methods, with the naive method as reference) measured as on Figure 4.1.One the other hand, the close form approach requires one inversion of a n × n matrix and q inversions of r × r matrices, leading to an approximate cost κ close = γ ( n + qr ) . (38)21e thus obtain (at ﬁrst order) that a speed up takes place whenever0 ≤ q × γ ( n − r ) − γ ( n + qr ) ⇔ q × ( n − r ) ≥ n ( q + 1) ⇔ ( q − ≥ ( q + 1) , (39)which, as q is a positive integer, can be proved to be equivalent to q ≥ q → ( q − − ( q + 1) possesses a unique real root between 3 and 4, takesnegative values for q ∈ { , , } , and tends to + ∞ when q → + ∞ . As notedin the previous section, however, in our implementation the cost of the fastapproach turns out to be smaller than γ ( n + qr ) since the Cholesky factoris already pre-calculated. Modelling this with a damping factor of α ∈ (0 , γn term, we end up with an approximate cost of κ close = γ ( αn + qr ), leading via analogue calculations to a speed-up whenever ( q − ≥ ( αq + 1). Again, the polynomial involved possesses a single real-valued root,with a value shrinking towards 2 as α tends to 0.In practice, speed-ups are already observed here from q = 2, which may beattributed to various reasons pertaining notably to implementation speciﬁcs andarbitrary settings used in the numerical experiments. Also, let us stress againthat the reasoning done above is to be taken as an approximation at “ﬁrstorder” in the sense that it does not account for matrix-vector multiplicationcosts, storage and communication overheads, and further auxiliary operationsthat could inﬂuence the actual running times. We now present indeed somenumerical experiments at ﬁxed design size but varying number of folds andmonitor the variations of the measured speed-ups and accuracy in calculatingcross-validation residual means and covariances.For convenience, we consider here a design of size 2 = 1024, and let the numberof folds decrease from q = 1024 to q = 2 by successive halvings. This amountsto folds of sizes r ranging from 1 to 512, correspondingly. For each value of q ,50 random replications of the folds are considered; this is done by permutingthe indices at random prior to arranging them in regularly constructed folds ofsuccessive r elements. The resulting speed-up distributions are represented inFigure 4.2, in function of q . With speed-up medians ranging from 1027 .

82 to1 .

50 (means from 1429 .

63 to 1 .

28, all truncated after the second digit), we henceobserved a speed-up whatever the value of q , yet with a substantial decreasewhen q decreases compared to the LOO situation, as could be expected.Additionally, relative errors on cross-validation residuals and associated vari-ances calculated using the fast versus naive approach are represented in Fig-ure 4.2, also in function of q . It is remarkable that, while the relative erroron the calculation of residuals increases moderately (with r ) from a median ofaround 3 . e −

14 to values nearing 4 e −

14, in the case of the covariances thevariation is more pronounced, with an increase from approx. 2 e −

11 to 1 . e − − , the observed diﬀerencesremain of a neglectable extent. 22 q1024 512 256 128 64 32 16 8 4 2 Relative error on mean (fast vs naive q−fold CV)

Figure 11: Speed-up (ratio between times required to run the naive and fastmethods) measured for q -fold CV, where q decreases from 1024 to 2 and 50seeds are used that aﬀect here both model ﬁtting and the folds. . − . − . − . − q1024 512 256 128 64 32 16 8 4 2 Relative error on mean (fast vs naive q−fold CV) . − . − . − . − . − . − q1024 512 256 128 64 32 16 8 4 2 Relative error on mean (fast vs naive q−fold CV)

Figure 12: Relative errors on CV means and covariances (between naive andfast methods, with the naive method as reference) measured in similar settingsas on Figure 4.2. 23

Discussion

Fast leave-one-out formulae for Gaussian Process prediction that have beenused for model diagnostics and hyperparameter ﬁtting do carry over well tomultiple-fold cross validation, as our theoretical and numerical results in bothSimple and Universal Kriging settings suggest. The resulting formulae werefound to allow substantial speed-ups, yet with a most favourable scenario in theleave-one-out case and decreasing advantages in cases with lesser folds of largerobservation subsets. A ﬁrst order analysis in terms of involved matrix inversioncomplexities conﬁrmed the observed trend, yet in most considered situationsfast formulae appeared computationally worthwhile. In addition, establishedformulae enabled a closed-form calculation of the covariance structure of cross-validation residuals, and eventually correcting for an abusive assumption ofindependence silently underlying QQ-plots with standardized LOO residuals.As we established as well, the established formulae lend themselve well to gener-alizing existing covariance hyperparameter estimation approaches that are basedon cross-validation residuals. Looking ﬁrst at scale parameter estimation in theLOO case, we found that changing a criterion underlying the scale estimationapproach used in [21, 2] by correcting for covariance between LOO residuals ledback to the maximum likelihood estimator of scale. On a diﬀerent note andwith more general covariance hyperparameters in view, as established in Corol-lary 2 maximizing the pseudo-likelihood criterion mentioned in [18] was foundto coincide with MLE in the case of independent cross-validation residuals.It is interesting to note as a potential starting point to future work on cross-validation-based parameter estimation that going beyond the independence as-sumption of Corollary 2 and accounting for the calculated residual covariancesresults in fact in a departure from basic Maximum Likelihood Estimation. Stillassuming indeed that ∆ = I n and replacing the product in C (3)CV ( ψ ) by the jointdensity of cross-validation residuals, we end up indeed with C (3) (cid:103) CV ( ψ ) = log (cid:16) p ( E i ,..., E i q ) (( e i , . . . , e i q ); ψ ) (cid:17) = log ( p DK − Z ( e ; ψ ))= log (cid:0) p DK − Z ( DK − z ; ψ ) (cid:1) = log (cid:0) p Z ( z ; ψ ) | det( DK − ) | (cid:1) = log ( p Z ( z ; ψ )) + log (det( D )) − log (det( K )) , (41)and we see that maximizing C (3) (cid:103) CV ( ψ ) generally departs indeed from MLE, as K and D are functions of ψ , yet coincides with it when det( D ) = det( K ).While studying further this new criterion is out of scope of the present paper,we believe that using the full distribution of cross-validation residuals couldbe helpful in designing novel criteria and procedures for covariance parameterestimation and more. For instance, beyond our generalization to multiple-foldsettings of the (cid:96) norm of cross-validation residuals as an objective function forcorrelation parameter estimation, we present in Section E merely decorrelatedversions of the resulting criteria and that could be of interest in future work.Beyond this, eﬃcient adjoint computation of gradients could be useful to speed-up covariance hyperparameter estimation procedures such as studied here, but24lso variations thereof relying on further scoring rules [15], not only in LOOsettings but also potentially in the case of multiple-fold cross validation.It is important to stress that while the presented results do not come withprocedures for the design of folds, they might be of interest to create objectivefunctions for their choice. Looking at the covariance matrix between cross-validation residuals might be a relevant ingredient to such procedures. On theother hand future research might also be concerned with choosing folds so asto deliver cross-validation residuals reﬂecting in some sense the generalizationerrors of the considered models. Of course, designing the folds is not independentof choosing the x i ’s and we expect interesting challenges to arise at this interface.Also, stochasic optimization under random folds is yet another approach ofinterest which could be eased thanks to fast computation of cross-validationresiduals. Last but not least, as fast multiple-fold cross-validation carries overwell to linear regression and Gaussian Process modelling shares foundations withseveral other paradigms, it would be desirable to generalize presented results tobroader model classes (see for instance [17], where mixed models are tackled). Acknowledgements

Part of DG’s contributions have taken place within the Swiss National ScienceFoundation project number 178858, and he would like to thank Ath´ena¨ıs Gau-tier and C´edric Travelletti for stimulating discussions. Yves Deville should bewarmly thanked too for insightful comments on an earlier version of this paper.Calculations were performed on UBELIX, the High Performance Computing(HPC) cluster of the University of Bern. DG would also like to aknowledgesupport of Idiap Research Institute, his primary aﬃliation during earlier stagesof this work.

References [1] S. Arlot and A. C´elisse. A survey of cross-validation proceduresfor modelselection.

Statistical Surveys , 2010.[2] F. Bachoc. Cross validation and maximum likelihood estimation of hyper-parameters of gaussian processes with model misspeciﬁcation.

Computa-tional Statistics and Data Analysis , 66:55–69, 2013.[3] L.S. Bastos, and A. O’Hagan. Diagnostics for Gaussian Process Emulators.

Technometrics , 51:4, 425-438.[4] N.A.C. Cressie.

Statistics for spatial data . Wiley, 1993.[5] Y. Deville, D. Ginsbourger, and O. Roustant. kergp: Gaussian ProcessLaboratory , 2020. R package version 0.5.1.[6] O. Dubrule. Cross validation of kriging in a unique neighborhood.

Journalof the International Association for Mathematical Geology , 15 (6):687–699,1983. 257] Benjamin Fischer. Model selection for gaussian process regression by ap-proximation set coding. Master’s thesis, ETH Z¨urich, 2016.[8] J. Gallier. The schur complement and symmetric positive semideﬁnite(and deﬁnite) matrices. Retrived at .[9] A.L. Kaminsky, Y. Wang, K. Pant, W.N. Hashii, and A. Atachbarian. Aneﬃcient batch k-fold cross-validation voronoi adaptive sampling techniquefor global surrogate modeling.

J. Mech. Des. , 143(1):011706, 2021.[10] L. Le Gratiet, C. Cannamela, and B. Iooss. Cokriging-based sequential de-sign strategies using fast cross-validation for multi-ﬁdelity computer codes.

Technometrics , 57:418–427, 2015.[11] M. Magnusson, M. R. Andersen, J. Jonasson, and A. Vehtari. Leave-one-outcross-validation for bayesian model comparison in large data. In

Proceedingsof the 23rd International Conference on Artiﬁcial Intelligence and Statistics(AISTATS) , 2020.[12] S. Marmin, D. Ginsbourger, J. Baccou, and J. Liandrat. Warped gaussianprocesses and derivative-based sequential design for functions with hetero-geneous variations.

SIAM/ASA Journal on Uncertainty Quantiﬁcation ,6(3):991–1018, 2018.[13] L. Martino, V. Laparra, and G. Camps-Valls. Probabilistic cross-validationestimators for gaussian process regression. In , pages 823–827, 2017.[14] G. Matheron. Le krigeage universel.

Les Cahiers du Centre de MorphologieMath´ematique de Fontainebleau , 1, 1969.[15] S.J. Petit, J. Bect, S. Da Veiga, P. Feliot, and E. Vazquez. Towards newcross-validation-based estimators for gaussian process regression: eﬃcientadjoint computation of gradients. https://arxiv.org/pdf/2002.11543.pdf,2020.[16] A. Rabinowicz and S. Rosset. Assessing prediction error atinterpolation andextrapolation points.

Electronic Journal of Statistics , 14(272-301), 2020.[17] A. Rabinowicz and S. Rosset. Cross-validation for correlated data.

Journalof the American Statistical Association , 2020.[18] C.E. Rasmussen and C.K.I. Williams.

Gaussian Processes for MachineLearning . the MIT Press, 2006.[19] D. R. Roberts, V. Bahn, S. Ciuti, M. S. Boyce, J. Elith, G. Guillera-Arroita,S. Hauenstein, J. J. Lahoz-Monfort, B. Schr¨oder, W. Thuiller, D. I. Warton,B. A. Wintle, F. Hartig, and C. F. Dormann. Cross-validation strategiesfor data with temporal, spatial, hierarchical, or phylogenetic structure.

Ecography , 40(8):913–929, 2017. 2620] O. Roustant, D. Ginsbourger, and Y. Deville. DiceKriging, DiceOptim:Two R packages for the analysis of computer experiments by kriging-basedmetamodelling and optimization.

Journal of Statistical Software , 51(1):1–55, 2012.[21] T.J. Santner, B.J. Williams, and W. Notz.

The Design and Analysis ofComputer Experiments . Springer, New York, 2003.[22] Y. Xiong, W. Chen, D. Apley, and X. Ding. A non-stationary covariance-based kriging method for metamodelling in engineering design.

Interna-tional Journal of Numerical Methods in Engineering , 71:733–756, 2007.[23] Y. Zhang and Y. Yang. Cross-validation for selecting a model selectionprocedure.

Journal of Econometrics , 187(1):95–112, 2015.

A About block inversion via Schur complements

The following, partly based on [8], is a summary of standard results revolvingaround the celebrated notion of Schur complement.

Theorem 2.

Let M = (cid:18) A BC D (cid:19) be a real n × n matrix with blocks A, B, C, D of respective sizes p × p , p × q , q × p and q × q , where p, q ≥ such that n = p + q .Assuming that D and A − BD − C are invertible, then so is M with M − = (cid:18) ( A − BD − C ) − − ( A − BD − C ) − BD − − D − C ( A − BD − C ) − D − + D − C ( A − BD − C ) − BD − (cid:19) . Proof.

We follow the classical path consisting of solving for z = (cid:0) x y (cid:1) T in theequation M z = w , with w = (cid:0) u v (cid:1) T ∈ R n where x, u ∈ R p and y, v ∈ R q .Using the assumed invertibility of D , we have indeed equivalence between (cid:26) Ax + By = uCx + Dy = v and (cid:26) y = D − ( v − Cx )( A − BD − C ) x = u − BD − v , whereof, using this time the assumed invertibility of ( A − BD − C ), (cid:26) x = ( A − BD − C ) − u − ( A − BD − C ) − BD − vy = − ( A − BD − C ) − D − Cu + ( D − + D − C ( A − BD − C ) − BD − ) v , resulting in the claimed result for M − . Remark 4.

The block inversion formula of Theorem 2 above can be reformu-lated as the following product of three matrices M − = (cid:18) I − D − C I (cid:19) (cid:18) ( A − BD − C ) − D − (cid:19) (cid:18) I − BD − I (cid:19) , implying in turn the following, that actually only requires the invertibility of D , M = (cid:18) I BD − I (cid:19) (cid:18) ( A − BD − C ) 00 D (cid:19) (cid:18) I D − C I (cid:19) , and illustrates that the simultaneous invertibility of D and A − BD − C is ac-tually equivalent to the invertibility of M itself. emark 5. As classically, in Theorem 2 the inverted block is the bottom-right D and correspondingly the considered Schur complement is the one of the upperleft block A . The Schur inverted Schur complement ( A − BD − C ) − henceappears as the upper left block of M − associated with i = (1 , . . . , p ) . Yet,analogue derivations are possible for any vector of indices i from { , . . . , n } .Reformulating a result presented in Horn and Johnson in terms of our notation,we have indeed M − [ i ] = ( M [ i ] − M [ i , − i ] M [ − i ] − M [ − i , i ]) − (42) and, more generally for two vectors of indices i and j , M − [ i , j ] = − ( M [ i ] − M [ i , j ] M [ − i ] − M [ j , i ]) − M [ i , j ] M [ j ] − = − M [ j ] − M [ j , i ]( M [ i ] − M [ i , j ] M [ j ] − M [ j , i ]) − . (43) Remark 6.

Following up on Remark 5 and using the notation of Theorem 2,we obtain in particular when assuming that A and ( D − CA − D ) are invertibleand considering indeed the Schur complement of A instead of D ’s one that M − = (cid:18) A − + A − B ( D − CA − B ) − CA − − A − B ( D − CA − B ) − − ( D − CA − B ) − CA − ( D − CA − CB ) − (cid:19) . Assuming that both blocks

A, D and their respective Schur complements are in-vertible, one then retrieves the so-called binomial inverse theorem, namely ( D − CA − D ) − = D − + D − C ( A − BD − C ) − BD − . As a by-product, equating non-diagonal blocks further delivers that ( A − BD − C ) − BD − = A − B ( D − CA − B ) − and similarly, D − C ( A − BD − C ) − = ( D − CA − B ) − CA − . B Basics of Universal Kriging (Gaussian case)

In Universal Kriging, the the trend function µ ( x ) of the considered randomﬁeld ( Z x ) x ∈ D is assumed to be a linear combination of given basis functions f j (1 ≤ j ≤ p , p >

0) but with unknown trend parameters β ∈ R p . While in theseminal work of Matheron [14] both square-integrable and intrinsic stationaryfor η were considered, we solely need the ﬁrst case here as we focus on theGaussian case, i.e. on settings where Z x = p (cid:88) j =1 β j f j ( x ) + η x ( x ∈ D ) , where ( η x ) x ∈ D is assumed centred Gaussian with known (not necessarily sta-tionary) covariance kernel k and unknown coeﬃcients β j (1 ≤ j ≤ p ). UniversalKriging then consists in jointly estimating β while accounting for the spatialdepency in observations, and predicting Z at locations where it was not (or nois-ily) observed. Like in Simple Kriging, the Universal Kriging predictor (at any28iven x ∈ D ) is set to be a linear combination (cid:98) Z x = (cid:80) ni =1 λ i Z x i of observations Z x i (where the dependency of the “Kriging weight” λ i ’s on x is, as often in theliterature, not stressed here for the sake of readability) with Kriging weightssuch that the prediction be unbiased and with minimal variance.With a deterministic treatment of the unknown β j ’s, the residual varianceamounts to the same quantity as in Simple Kriging, that is Var[ Z x − (cid:98) Z x ] =Var[ η x − (cid:80) ni =1 λ i η x i ] = λ (cid:48) K λ − λ (cid:48) k ( x ) + k ( x , x ). Yet the unbiasedness condi-tion adds p linear constraints, namely f j ( x ) = (cid:80) ni =1 λ i f j ( x j ) for j ∈ { , . . . , p } .Minimizing the above quadratic risk under these unbiasedness constraints canbe addressed by Lagrange duality, leading to a linear problem featuring a p -dimensional Lagrange multiplier (cid:96) : (cid:18) K FF (cid:48) (cid:19) (cid:18) λ(cid:96) (cid:19) = (cid:18) k ( x ) f ( x ) (cid:19) , (44)where F = ( f j ( x i )) ≤ i ≤ n, ≤ j ≤ p ∈ R n × p and f ( x ) = ( f ( x ) , . . . , f p ( x )) (cid:48) ∈ R p .From Equation 47 it can be seen that the resulting λ writes as a sum of twocontributions proportional (up to matrix multiplications) to k ( x ) and f ( x ) re-spectively. More precisely, assuming that K and F (cid:48) K − F are invertible andapplying to (cid:101) K = (cid:18) K FF (cid:48) (cid:19) the variant of the Schur complement inversion for-mula discussed in Remark 6, we get that (cid:101) K is invertible and has inverse (cid:101) K − = (cid:18) K − − K − F ( F (cid:48) K − F ) − F (cid:48) K − K − F ( F (cid:48) K − F ) − ( F (cid:48) K − F ) − F (cid:48) K − − ( F (cid:48) K − F ) − . (cid:19) Recalling that the Universal Kriging predictor at the considered x is given by (cid:98) Z x = λ (cid:48) Z = Z (cid:48) λ , we hence ﬁnd that (cid:98) Z x = α (cid:48) k ( x ) + (cid:98) β (cid:48) f ( x ) where α = ( K − − K − F ( F (cid:48) K − F ) − F (cid:48) K − ) Z , (45) (cid:98) β = ( F (cid:48) K − F ) − F (cid:48) K − Z . (46)As it turns out, (cid:98) β coincides with the Generalized Least Squares estimator of β under assumption of a (centred) noise term with covariance matrix K . Besidesthis, these values for α and (cid:98) β are characterized by (cid:18) K FF (cid:48) (cid:19) (cid:18) α (cid:98) β (cid:19) = (cid:18) Z (cid:19) , (47)as is well-known in the geostatistics literature (See [6] and references therein).Note that [6] is a seminal reference that tackled fast cross-validation formulaealready in the Universal Kriging case and also tackled the case of two pointsleft out. While our formulas coincide with those presented in [6], the blocformalism sheds light on the algebraic mechanisms at work and to a certainextent help achieving a broader level of generality, as illustrated for instance inSection 2.3 with the relative simplicity of obtaining multiple-fold cross-validationconditional covariance matrices in the Universal Kriging framework.29 R implementation (with DiceKriging) cv <- function (model, folds, type="UK", trend.reestim =TRUE,fast=TRUE, light=FALSE){if (!is.list(folds)){stop("The input ‘folds’ must be a list of index subsets(without index redundancy within each fold)")}if (length([email protected]) > 0) {stop("At this stage, cross-validation is not implemented fornoisy observations")}q <- length(folds)case1 <- ((type == "SK") && (!trend.reestim))case2 <- ((type == "UK") && (trend.reestim))analytic <- (case1 | case2) && fastX <- model@Xy <- model@yT <- model@Tn <- model@nyhat <- list()cvcov.list <- list()beta <- [email protected] <- model@Fif (!analytic) {C <- crossprod(T)for (i in 1:q) {I <- folds[[i]]F.I <-F[I,,drop=FALSE]y.but.I <- y[-I, ]F.but.I <- F[-I, ]C.but.I <- C[-I, -I]c.but.I <- C[-I, I]T.but.I <- chol(C.but.I)x <- backsolve(t(T.but.I), y.but.I, upper.tri = FALSE)M <- backsolve(t(T.but.I), F.but.I, upper.tri = FALSE)M <- as.matrix(M)if (trend.reestim) {l <- lm(x ~ M - 1)beta <- as.matrix(l$coef, ncol = 1)}z <- x - M %*% betaTinv.c <- backsolve(t(T.but.I), c.but.I, upper.tri = FALSE)y.predict.complement <- t(Tinv.c) %*% zy.predict.trend <- F.I %*% betay.predict <- y.predict.trend + y.predict.complementyhat[[i]] <- y.predictcvcov.1 <- crossprod(Tinv.c)totalcov <- C[I, I]cvcov.list[[i]] <- totalcov - cvcov.1 f (type == "UK") {T.M <- chol(crossprod(M)) eturn(res)} Additional experiments

D.1 QQ-plots on on the second example −2 −1 0 1 2 − . − . − . − . . . . . Normal Q−Q Plot of Transformed CV Residuals

Theoretical Quantiles S a m p l e Q uan t il e s −2 −1 0 1 2 − . − . − . . . . . . Normal Q−Q Plot of Standardized CV Residuals

Theoretical Quantiles S a m p l e Q uan t il e s Figure 13: On the eﬀect of accounting for covariance in QQ-plots based on CVresiduals for the second example (paired points). Right panel: QQ-plot against N (0 ,

1) of CV residuals merely divided by corresponding CV standard deiva-tions. Left panel: QQ-plot against N (0 ,

1) of duly transformed CV residuals.33 .2 Multiple-fold cross-validation on the third example − . − . . Basic versus CV GP predictions x f and G P p r ed i c t i on s Objective functionEvaluationsGP predictionLOO predictions . . . . Absolute prediction errors vs GP standard deviation t | P r ed . e rr o r s | vs G P s . d . Absolute GP prediction errorsGP standard deviationAbsolute LOO prediction errors

Figure 14: Upper panel: GP prediction (blue line) of the test function (blackline) of Equation 10 based on evaluations at a regular grid complemented byclustered points between 0 . .

3, CV predictions (red points). Lower panel:absolute prediction errors associated with GP (black line) and CV (red point)predictions, and GP prediction standard deviation (in blue). Here the second toeleventh points are grouped while the others are left as singletons (so q = 10). −2 −1 0 1 2 − . − . − . . . . . Normal Q−Q Plot of Standardized CV Residuals

Theoretical Quantiles S a m p l e Q uan t il e s −2 −1 0 1 2 − . − . . . . . Normal Q−Q Plot of Standardized CV Residuals

Theoretical Quantiles S a m p l e Q uan t il e s Figure 15: On the eﬀect of accounting for covariance in QQ-plots based on CVresiduals for the third example (one cluster). Right panel: QQ-plot against N (0 ,

1) of CV residuals merely divided by corresponding CV standard deiva-tions. Left panel: QQ-plot against N (0 ,

1) of duly transformed CV residuals.34 .3 LOO speed-ups when unfavouring the fast method design size (in hundreds)

Speed−up when using fast vs naive LOO

Figure 16: Speed-up (ratio between times required to run the naive and fastmethods) measured for LOO in an unfavourable situation where K is re-builtand its Cholesky factor is recalculated. Apart from that, the settings are thesame as in Figure 4.1. 35 .4 Fast versus naive LOO for Universal Kriging design size (in hundreds) Speed−up when using fast vs naive LOO

Figure 17: Speed-up (ratio between times required to run the naive and fastmethods) measured for LOO under Universal Kriging (with second order trend)on 10 regular designs, with 100 to 1000 points equidistributed on [0 , . − . − . − . − . − . − . − design size (in hundreds) Relative error on mean (fast vs naive LOO) . − . − . − . − design size (in hundreds) Relative error on covariance (fast vs naive LOO)

Figure 18: Relative errors on LOO means and covariances (between naive andfast methods, with the naive method as reference) measured as on Figure D.4.36

About merely decorrelating residuals in C (2) CV ( θ ) A natural way to remedy the lack of account of correlation between CV residualsin Eq. 30 is to modify the criterion into C (2) (cid:103) CV ( θ ) = E i ,..., i q ( θ ) T (Corr( E i ,..., i q ( θ ))) − E i ,..., i q ( θ ) (48)Recalling that Cov( E i ,..., i q ) = DK − D and removing the indexing on i , . . . , i q for readability, we plainly get by noting G = diag( DK − D ) (implicitly depend-ing on both θ and σ , as does D ) thatCorr( E i ,..., i q ( θ )) = G − / DK − DG − / , (49)resulting in C (2) (cid:103) CV ( θ ) = Z (cid:62) K − DG / ( DK − D ) − G / DK − Z . (50)Stressing the dependencies on θ overall and noting in particular ¯ G θ = σ G , thelatter can be reformulated as C (2) (cid:103) CV ( θ ) = Z (cid:62) R − θ ¯ D ( θ ) ¯ G / θ ( ¯ D ( θ ) R − θ ¯ D ( θ )) − ¯ G / θ ¯ D ( θ ) R − θ Z . (51)In the particular case of leave-one-out, we obtain D = σ diag( R − θ ) − whereofCorr( E i ,..., i q ( θ )) = diag( R − θ ) − / R − θ diag( R − θ ) − / and C (2) (cid:93) LOO ( θ ) = Z (cid:62) R − θ diag( R − θ ) − / R − θ diag( R − θ ) − / R − θ Z ..