Bootstrapping the Out-of-sample Predictions for Efficient and Accurate Cross-Validation
Ioannis Tsamardinos, Elissavet Greasidou, Michalis Tsagris, Giorgos Borboudakis
aa r X i v : . [ c s . L G ] A ug Noname manuscript No. (will be inserted by the editor)
Bootstrapping the Out-of-sample Predictions forEfficient and Accurate Cross-Validation
Ioannis Tsamardinos ⋆ · ElissavetGreasidou ⋆ · Michalis Tsagris · GiorgosBorboudakis
Received: date / Accepted: date
Abstract
Cross-Validation (CV), and out-of-sample performance-estimation pro-tocols in general, are often employed both for (a) selecting the optimal combinationof algorithms and values of hyper-parameters (called a configuration) for produc-ing the final predictive model, and (b) estimating the predictive performance of thefinal model. However, the cross-validated performance of the best configuration isoptimistically biased. We present an efficient bootstrap method that corrects forthe bias, called Bootstrap Bias Corrected CV (BBC-CV). BBC-CV’s main idea isto bootstrap the whole process of selecting the best-performing configuration onthe out-of-sample predictions of each configuration, without additional training ofmodels. In comparison to the alternatives, namely the nested cross-validation [31]and a method by Tibshirani and Tibshirani [29], BBC-CV is computationally moreefficient, has smaller variance and bias, and is applicable to any metric of perfor-mance (accuracy, AUC, concordance index, mean squared error). Subsequently,we employ again the idea of bootstrapping the out-of-sample predictions to speedup the CV process. Specifically, using a bootstrap-based hypothesis test we stoptraining of models on new folds of statistically-significantly inferior configurations.We name the method Bootstrap Corrected with Early Dropping CV (BCED-CV)that is both efficient and provides accurate performance estimates.
Keywords performance estimation · bias correction · cross-validation · hyper-parameter optimization ⋆ Equal contribution.Ioannis TsamardinosComputer Science Department, University of Crete and Gnosis Data Analysis PCE-mail: [email protected] GreasidouComputer Science Department, University of Crete and Gnosis Data Analysis PCE-mail: [email protected] TsagrisComputer Science Department, University of CreteGiorgos BorboudakisComputer Science Department, University of Crete and Gnosis Data Analysis PC Tsamardinos, I., Greasidou, E. et al.
Typically, the goals of a machine learning predictive modelling task are two: toreturn a high-performing predictive model for operational use and an estimate ofits performance. The process often involves the following steps: (a)
Tuning , wheredifferent combinations of algorithms and their hyper-parameter values (called con-figurations ) are tried producing several models, their performance is estimated, andthe best configuration is determined, (b)
Production of the final model trained onall available data using the best configuration, and (c)
Performance Estimation ofthe final model.Focusing first on tuning, we note that a configuration may involve combiningseveral algorithms for every step of the learning process such as: pre-processing,transformation of variables, imputation of missing values, feature selection, andmodeling. Except for rare cases, each of these algorithms accepts a number ofhyper-parameters that tune its behavior. Usually, these hyper-parameters affectthe sensitivity of the algorithms to detecting patterns, the bias-variance trade-off,the trade-off between model complexity and fitting of the data, or may trade-offcomputational complexity for optimality of fitting. Examples include the maximumnumber of features to select in a feature selection algorithm, or the type of kernelto use in Support Vector Machine and Gaussian Process learning.There exist several strategies guiding the order in which the different configura-tions are tried, from sophisticated ones such as Sequential Bayesian Optimization[26, 12] to simple grid search in the space of hyper-parameter values. However,independently of the order of production of configurations, the analyst needs toestimate the performance of the average model produced by each configuration onthe given task and select the best.The estimation methods of choice for most analysts are the out-of-sample es-timation protocols, where a portion of the data training instances is hidden fromthe training algorithm to serve as an independent test set. The performance ofseveral models stemming from different configurations is tried on the test set, alsocalled the hold-out set, in order to select the best performing one. This procedureis known as the
Hold-out protocol. We will refer to such a test set as a tuning set to emphasize the fact that it is employed repeatedly by all configurations for thepurposes of tuning the algorithms and the hyper-parameter values of the learningpipeline. We note that while there exist approaches that do not employ out-of-sample estimation, such as using the Akaike Information Criterion (AIC) [1] ofthe models, the Bayesian Information Criterion (BIC) [25], and others, in thispaper we focus only on out-of-sample estimation protocols.The process of withholding a tuning set can be repeated multiple times leadingto several analysis protocols variations. The simplest one is to repeatedly withholddifferent, randomly-chosen tuning sets and select the one with the best averageperformance over all tuning sets. This protocol is called the
Repeated Hold-out .Arguably however, the most common protocol for performance estimationfor relatively low sample sizes is the
K-fold Cross-Validation or simply Cross-Validation (CV). In CV the data training instances are partitioned to K approx-imately equal-sized subsets, each one serving as a tuning set and the remainingones as training sets. The performance of each configuration is averaged over alltuning folds. The difference with the Repeated Hold-Out is that the process isrepeated exactly K times and the tuning sets are enforced to be non-overlapping fficient and Accurate Cross-Validation 3 in samples (also referred to as instances or examples). The process can be repeatedwith different partitions of the data to folds leading to the Repeated CV .A final note on tuning regards its name. In statistics, the term model selection ispreferred for similar purposes. The reason is that a common practice in statisticalanalysis is to produce several models using different configurations on all of theavailable data, manually examine their fitting, degrees of freedom, residuals, AICand other metrics and then make an informed (but manual) choice of a model. Incontrast, in our experience machine learning analysts estimate the performance ofeach configuration and select the best configuration to employ on all data, ratherthan selecting the best model. Thus, in our opinion, the term tuning is moreappropriate than model selection for the latter approach.Considering now the production of the final model to deploy operationally, themost reasonable choice is arguably to train a single model using the best configu-ration found on all of the available data . Note that each configuration may haveproduced several models during CV or Repeated Hold-Out for tuning purposes,each time using a subset of the data for training. However, assuming that — onaverage — a configuration produces better predictive models when trained withlarger sample sizes (i.e., its learning curve is monotonically increasing) it is rea-sonable to employ all data to train the final model and not waste any trainingexamples (samples) for tuning or performance estimation. There may be excep-tions of learning algorithms that do not abide to this assumption (K-NN algorithmfor example, see [20] for a discussion) but it is largely accepted and true for mostpredictive modeling algorithms.The third step is to compute and return a performance estimate of the finalmodel.
The cross-validated performance of the best configuration (estimated on thetuning sets) is an optimistically biased estimate of the performance of the finalmodel. Thus, it should not be reported as the performance estimate.
Particularlyfor small sample sizes (less than a few hundred) like the ones that are typical inmolecular biology and other life sciences, and when numerous configurations aretried, the optimism could be significant. This is in our opinion a common sourceof methodological errors in data analysis .The main problem of using the estimation provided on the tuning sets is thatthese sets have been employed repeatedly by all configurations, out of which theanalysts selected the best. Thus, equivalent statistical phenomena occur as inmultiple hypothesis testing. The problem was named the multiple comparisons ininduction problems and was first reported in the machine learning literature byJensen in [17]. A simple mathematical proof of the bias is as follows. Let µ i be theaverage true performance (loss) of the models produced by configuration i whentrained on data of size | D train | from the given data distribution, where | D train | isthe size of the training sets. The sample estimate of µ i on the tuning sets is m i ,and so we expect that µ i = E ( m i ) for estimations that are unbiased. Returningthe estimate of the configuration with the smallest loss returns min { m , . . . , m n } ,where n is the number of configurations tried. On average, the estimate on thebest configuration on the tuning sets is E (min { m , . . . , m n } ) while the estimateof the true best is min { µ , . . . , µ n } = min { E ( m ) , . . . , E ( m n ) } . The optimism(bias) is Bias = min { E ( m ) , . . . , E ( m n ) } − E (min { m , . . . , m n } ) ≥ Tsamardinos, I., Greasidou, E. et al.
The bias of Cross-Validation when multiple configurations are tried has alsobeen explored empirically in [30] on real datasets. For small samples ( < the simplest procedure is to hold-out a second, untainted set,exclusively used for estimation purposes on a single model . This single model is ofcourse the one produced with the best configuration as found on the tuning sets.This approach has been particularly advocated in the Artificial Neural Networksliterature were the performance of the current network is estimated on a valida-tion set (equivalent to a tuning set) as a stopping criterion of training (weightupdates). Thus, the validation set is employed repeatedly on different networks(models), albeit only slightly different by the weight updates of one epoch. Forthe final performance estimation a separate, independent test set is used. Thus,in essence, the data are partitioned to train-tuning-estimation subsets: the tuningset is employed multiple times for tuning of the configuration; then, a single modelproduced on the union of the train and tuning data with the best configuration istested on the estimation subset. Generalizing this protocol so that all folds serveas tuning and estimation subsets and performance is averaged on all subsets leadsto the Nested Cross Validation (NCV) protocol [31]. The problem with NCV isthat it requires O ( K · C ) models to be trained, where K the number of folds and C the number of configurations tried, resulting in large computational overheads.The main contribution of this paper is the idea that one can bootstrap the pooledpredictions of all configurations over all tuning sets (out-of-sample predictions) toachieve several goals. The first goal is to estimate the loss of the best configuration(i.e., remove the bias of cross validating multiple configurations) without trainingadditional models. Specifically, the (out-of-sample) predictions of all configurationsare bootstrapped, i.e., selected with replacement, leading to a matrix of predic-tions. The configuration with the minimum loss is selected and its loss is computedon the out-samples (not selected by the bootstrap). The procedure is repeated fora few hundred bootstraps and the average loss of the selected best configurationon the out-samples is returned. Essentially, the above procedure bootstraps thestrategy for selecting the best configuration and computes its average loss on thesamples not selected by the bootstrap.Bootstrapping has a relatively low computational overhead and is triviallyparallelized. The computational overhead for each bootstrap iteration amounts tore-sampling the sample indexes of predictions, computing the loss on the boot-strapped predictions for each configuration, and selecting the minimum. We callthe method Bootstrap Bias Corrected CV (BBC-CV). BBC-CV is empirically com-pared against NCV, the standard for avoiding bias, and a method by Tibshiraniand Tibshirani [29] (TT from hereon) which addresses the large computationalcost of NCV. BBC-CV is shown to exhibit a more accurate estimation of the Bias than TT and similar to that of NCV, while it requires no training of new models,and thus being as computationally efficient as TT and much faster than NCV.Bootstrapping the out-of-sample predictions can also trivially be used to computeconfidence intervals for the performance estimate in addition to point estimates. fficient and Accurate Cross-Validation 5
In experiments on real data, we show that the confidence intervals are accurate orsomewhat conservative (i.e. have higher coverage than expected).The second main use of bootstrapping the out-of-sample predictions is to cre-ate a hypothesis test for the hypothesis that a configuration exhibits equal per-formance as the currently best configuration. The test is employed in every newfold serving for tuning during CV. When the hypothesis can be rejected based onthe predictions on a limited number of folds, the configuration is eliminated ordropped from further consideration and no additional models are trained on re-maining folds. We combine the idea of dropping configurations with the BBC-CVmethod for bias correction, and get the Bootstrap Corrected with Early DroppingCV (BCED-CV). BCED-CV results in significant computational gains, typicallyachieving a speed-up of 2-5 (in some cases up to the theoretical maximum equalto the number of folds, in this case 10) over BBC-CV, while providing accurateestimates of performance and confidence intervals. Finally, we examine the role ofrepeating the procedure with different partitions to folds (
Repeated BBC-CV ) andshow that multiple repeats improve the selection of the best configuration (tuning)and lead to better performing models. In addition, for the same number of trainedmodels, Repeated BBC-CV leads to better performing models than NCV whilehaving similar bias in their performance estimates.The rest of this paper is structured as follows. In Section 2 we present anddiscuss widely established protocols for tuning and out-of-sample performanceestimation. In Section 3 we discuss additional related work. We introduce ourmethods BBC-CV and BCED-CV in Sections 4 and 5, respectively, and empiricallyevaluate them on synthetic and real settings in Section 6. We conclude the paperin Section 7.
In this section, we present the basics of out-of-sample estimation of the perfor-mance of a learning method f and introduce the notation employed in the restof the paper. We assume the learning method is a function that accepts as inputa dataset D = {h x j , y j i} Nj =1 of pairs of training vectors x and their correspond-ing labels y and returns another function M ( x ) (a predictive model), so that f ( D ) = M . We can also think of D as a 2D matrix with the rows containing theexamples, and the columns corresponding to features (a.k.a. variables, attributes,measured/observed quantities). It is convenient to employ the Matlab index nota-tion on matrices to denote with D (: , i ) the i -th column of D and D ( i, :) the i -throw of D ; similarly D ( I, i ) is the vector of values in the i -th column from rowswith indexes in vector I .We also overload the notation and use f ( x, D ) to denote the output (predic-tions) of the model M trained by f on dataset D when given as input one ormultiple samples x . We also denote the loss (metric of error) between the value y of a label and a corresponding prediction ˆ y as l ( y, ˆ y ). For convenience, we canalso define the loss between a vector of labels y and a vector of predictions ˆ y asthe vector of losses between the corresponding labels and predictions:[ l ( y, ˆ y )] j = l ( y j , ˆ y j ) Tsamardinos, I., Greasidou, E. et al.
Algorithm 1 CV ( f, D = { F , . . . , F K } ): Basic K-Fold Cross Validation Input : Learning method f , Data matrix D = {h x j , y j i} Nj =1 partitioned into aboutequally-sized folds F i Output : Model M , Performance estimation L CV , out-of-sample predictions Π onall folds Define D \ i ← D \ F i // Obtain the indexes of each fold I i ← indexes ( F i ) // Final Model trained by f on all available data M ← f ( D ) // Performance estimation: learn from D \ i , estimate on F i L CV ← K P Ki =1 l ( y ( I i ) , f ( F i , D \ i )) // Out-of-sample predictions are used by bias-correction methods Collect out-of-sample predictions Π = [ f ( F , D \ ); · · · ; f ( F K , D \ K )] Return h M, L CV , Π i The loss function can be either the 0-1 loss for classification (i.e, one when thelabel and prediction are equal and zero otherwise), the squared error ( y − ˆ y ) forregression or any other metric. Some metrics of performance such as the AUC orthe Concordance Index for survival analysis problems [16] cannot be expressedusing a loss function defined on single pairs h y, ˆ y i . These metrics can only becomputed on a test set containing at least 2 predictions and thus, l ( y, ˆ y ) is definedonly when y and ˆ y are vectors for such metrics.The K -fold Cross-Validation (CV) protocol is arguably the most common out-of-sample performance estimation protocol for relatively small sample sizes. It isshown in Algorithm 1. The protocol accepts a learning method f , a dataset D al-ready partitioned into K folds F . The model to return is computed by applying thelearning method f on all available data . To estimate the performance of this modelCV employs each fold F i in turn as an estimation set and trains a model M i onthe remaining data (in the algorithm denoted as D \ i ) using f , i.e., M i = f ( D \ i ).It then computes the loss of M i on the hold-out fold F i . The final performanceestimate is the average loss over all folds. The pseudo-code in Algorithm 1 as pre-sented, also collects and returns all out-of-sample predictions in a vector Π . Thisfacilitates the presentation of some bias-correction methods below, who depend onthem. In case no bias-correction is applied afterwards, Π can be omitted from theoutput arguments.As simple and common as CV is, there are still several misconceptions aboutits use. First, the protocol returns f ( D ) learned from the full dataset D , but thelosses computed are on different models, namely models trained with subsets of thedata. So, CV does not estimate the loss of the specific returned model. We arguethat Cross Validation estimates the average loss of the models produced by f whentrained on datasets of size | D \ i | on the distribution of D . The key conceptual pointis that it is not the returned model who is being cross-validated, but the learningmethod f . Non-expert analysts (and students in particular) often wonder whichout of the K models produced by cross-validation excluding each fold in turnshould be returned. The answer is none; the model to use operationally is the onelearned by f on all of D . fficient and Accurate Cross-Validation 7 Notice that the size of the training sets in CV is | D \ i | = ( K − /K · | D | < | D | (e.g., for 5-fold, we are using only 80% of the total sample size for training ineach fold). It follows that CV estimates are conservatively biased : the final model f is trained on | D | samples while the estimates are produced by models trained on | D \ i | samples. A typical major assumption is that f ’s models improve on any giventask with increased sample size . This is a reasonable assumption to make, althoughnot always necessarily true. If it does hold, then we expect that the returned lossestimate L of CV is conservative , i.e,. on average higher than the average loss ofthe returned model. Exactly how conservative it will be depends on where theclassifier is operating on its learning curve for this specific task, which is unknowna priori. It also depends on the number of folds K : the larger the K , the more( K − /K approaches 100% and the bias disappears. When sample sizes are smallor distributions are imbalanced (i.e., some classes are quite rare in the data), weexpect most classifiers to quickly benefit from increased sample size, and thus, forCV to be more conservative.Based on the above, one expects that leave-one-out CV (where each fold’s size is1 sample) should be the least biased. However, leave-one-out CV can collapse in thesense that it can provide extremely misleading estimates in degenerate situations(see [33], p. 151, and [19] for an extreme failure of leave-one-out CV and of the0 .
632 bootstrap rule). We believe that the problem of leave-one-out CV stemsfrom the fact that the folds may follow a totally different distribution than thedistribution of the class in the original dataset: when only one example is left out,the distribution of one class in the fold is 100% and 0% for all the others. We insteadadvice to use K only as large as possible to still allow the distribution of classesin each fold to be approximately similar as in the original dataset, and imposethis restriction when partitioning to folds. The latter restriction leads to what iscalled stratified CV and there is evidence that it leads to improved performanceestimations [30].2.1 Cross-Validation with Tuning (CVT)A typical data analysis involves several algorithms to be combined, e.g., for trans-forming the data, imputing the missing values, variable selection or dimensionalityreduction, and modeling. There are hundreds of choices of algorithms in the lit-erature for each type of algorithms. In addition, each algorithm typically takesseveral hyper-parameter values that should be tuned by the user. We assume thatthe learning method f ( D ) is augmented to f ( D, θ ) to take as input a vector θ that determines which combination of algorithms to run and with what values ofhyper-parameters. We call θ a configuration and refer to the process of selectingthe best θ as tuning of the learning pipeline.The simplest tuning procedure is to cross-validate f with a different config-uration θ each time within a predetermined set of configurations Θ , choose thebest performing configuration θ ⋆ and then train a final model on all data with θ ⋆ .The procedure is shown in Algorithm 2. In the pseudo-code, we compute f i as theclosure of f when the configuration input parameter is grounded to the specific The term closure is used in the programmatic sense to denote a function pro-duced by another function by binding some free parameters to specific values; see alsohttp://gafter.blogspot.gr/2007/01/definition-of-closures.html. Tsamardinos, I., Greasidou, E. et al.
Algorithm 2 CVT ( f, D = { F , . . . , F K } , Θ ): Cross Validation With Tuning Input : Learning method f , Data matrix D = {h x j , y j i} Nj =1 partitioned into aboutequally-sized folds F i , set of configurations Θ Output : Model M , Performance estimation L CV T , out-of-sample predictions Π on all folds for all configurations for i = 1 to C = | Θ | do // Create a closure of f (a new function) by grounding the configuration θ i f i ← Closure( f ( · , θ i )) h M i , L i , Π i i ← CV ( f i , D ) end for i ⋆ ← arg min i L i // Final Model trained by f on all available data using the best configuration M ← f ( D, θ i ⋆ ) // Performance estimation; may be optimistic and should not be reported ingeneral L CV T ← L i ⋆ // Out-of-sample predictions are used by bias-correction methods Collect all out-of-sample predictions of all configurations in one matrix Π ← [ Π · · · Π C ] Return h M, L
CV T , Π i values in θ i . For example, if configuration θ i is a combination of a feature selec-tion algorithm g and modeling algorithm h with their respective hyper-parametervalues a and b , taking the closure and grounding the hyper-parameters produces afunction f i = h ( g ( · , a ) , b ), i.e., a function f i that first applies the specified featureselection g using hyper-parameters a and uses the result to train a model using h with hyper-parameters b . The use of the closures leads to a compact pseudo-codeimplementation of the method.We now put together two observations already noted above: the performanceestimate L CV T of the winning configuration tends to be conservative because itis computed by models trained on only a subset of the data; at the same time, ittends to be optimistic because it is selected as the best among many tries. Whichof the two trends will dominate depends on the situation and is a priori unknown.For large K and a large number of configurations tried, the training sets are almostas large as the whole dataset and the optimistic trend dominates. In general, forsmall sample sizes and a large number of configurations tried L CV T is optimisticand should not be reported as the performance estimate of the final model .2.2 The Nested Cross-Validation (NCV) ProtocolGiven the potential optimistic bias of CV when tuning takes place, other protocolshave been developed, such as the Nested Cross Validation (NCV). We could nottrace who introduced or coined up first the name Nested Cross-Validation butthe authors and colleagues have independently discovered it and using it since2005 [27]; around the same time Varma and Simon in [31], report a bias in errorestimation when using K-Fold Cross-Validation, and suggest the use of the Nested fficient and Accurate Cross-Validation 9
Algorithm 3 NCV ( f, D = { F , . . . , F K } , Θ ): Nested Cross Validation Input : Learning method f , Data matrix D = {h x j , y j i} Nj =1 partitioned into aboutequally-sized folds F i , set of configurations Θ Output : Model M , Performance estimation L NCV , out-of-sample predictions Π on all folds for all configurations // Create closure by grounding the f and the Θ input parameters of CVT f ′ ← CVT ( f, · , Θ ) // Notice: final Model is trained by f ′ on all available data; final estimate isprovided by basic CV (no tuning) since f ′ returns a single model each time h M, L
NCV , Π i ← CV ( f ′ , D ) Return h M, L
NCV i K-Fold Cross-Validation (NCV) protocol as an almost unbiased estimate of thetrue performance. A similar method in a bioinformatics analysis was used in 2005[27]. One early comment hinting of the method is in [24], while Witten and Frank(see [32], page 286) briefly discuss the need of treating any parameter tuning stepas part of the training process when assessing performance. It is interesting to notethat the earlier works on NCV appeared first in bioinformatics where the samplesize of datasets is often quite low and the effects of the bias more dramatic.The idea of the NCV is evolved as follows. Since the tuning sets have beenused repeatedly for selecting the best configuration, one needs a second hold-out set exclusively for estimation of one, single, final model. However, one couldrepeat the process with several held-out folds and average the estimates. In otherwords, each fold is held-out for estimation purposes each time and a CVT takesplace for the remaining folds in selecting the best configuration and training on allremaining data with this best configuration to return a single model. Thus, in NCVeach fold serves once for estimation and multiple times as a tuning set. Under thisperspective, NCV is a generalization of a double-hold-out protocol partitioningthe data to train-tuning-estimation.Another way to view NCV is to consider tuning as part of the learning process .The result is a new learning function f ′ that returns a single model, even thoughinternally it is using CV to select the best configuration to apply to all inputdata. NCV simply cross-validates f ′ . What is this new function f ′ that uses CVfor tuning and returns a single model? It is actually CVT for the given learningmethod f and configuration set Θ . Naturally, any method that performs hyper-parameter optimization and returns a single model can be used instead of CVT as f ′ . The pseudo-code in Algorithm 3 clearly depicts this fact and implements NCVin essentially two lines of code using the mechanism of closures.Counting the number of models created by NCV, let us denote with C = | Θ | the number of configurations to try. To produce the final model, NCV will runCVT on all data. This will create K × C models for tuning and once the bestconfiguration is picked, one more model will be produced leading to K × C + 1models for final model production. To produce the estimate, the whole process iscross-validated each time excluding one fold, thus leaving K − f ′ ). Overall, this leads to K × (( K − × C +1)models trained for estimation. The total count is exactly K × C + K + 1 models, Algorithm 4 TT ( f, D = { F , . . . , F K } , Θ ): Cross Validation with Tuning, Biasremoval using the TT method Input : Learning method f , Data matrix D = {h x j , y j i} Nj =1 partitioned into aboutequally-sized folds F i , set of configurations Θ Output : Model M , Performance estimation L T T // Notice: the final Model is the same as in CVT h M, L
CV T , Π i ←
CVT ( f, D, Θ ) for k = 1 to K do // Compute bias estimate for fold k TTBias k ← l ( y ( I k ) , Π ( I k , j )) − min i l ( y ( I k ) , Π ( I k , i )) end for TTBias ← K P Kk =1 TTBias k L T T ← L CV T + TTBias Return h M, L
T T i which is of course computationally expensive as it depends quadratically on thenumber of folds K .2.3 The Tibshirani and Tibshirani ProtocolTo reduce the computational overhead of NCV, Tibshirani and Tibshirani [29]introduced a new method for estimating and correcting for the bias of CVT withouttraining additional models. We refer to this method as the TT and it is the firstwork of its kind, inspiring this work.The main idea of the TT method is to consider, in a sense, each fold a differentdataset and serving as an independent example to estimate how much the processof selecting the best configuration out of many incurs optimism. It compares theloss of the final, selected configuration with the one selected in a given fold asan estimate of the bias of the selection process. Let I k denote the indexes of thesamples (rows) of the k -th fold F k . Furthermore, let j denote the index of the bestperforming configuration (column of Π ), as computed by CVT. The bias TTBias estimated by the TT method is computed as:
TTBias = 1 K K X k =1 ( l ( y ( I k ) , Π ( I k , j )) − min i l ( y ( I k ) , Π ( I k , i )))Note that, the average of the first terms l ( y ( I k ) , Π ( I k , j )) in the sum is the averageloss of the best configuration computed by CVT, L CV T . Thus,
TTBias can berewritten as:
TTBias = L CV T − K K X k =1 min i l ( y ( I k ) , Π ( I k , i ))The final performance estimate is: L T T = L CV T + TTBias fficient and Accurate Cross-Validation 11
The pseudo-code is presented in Algorithm 4 where it is clear that the TT doesnot train new models, employs the out-of-sample predictions of all models andcorresponding configurations, and returns the same final model as both the CVTand the NCV. It is also clear that when the same configuration is selected on eachfold as the final configuration, the bias estimate is zero.A major disadvantage of the TT is also apparent. Observe that the bias esti-mate of TT obeys 0 ≤ TTBias ≤ L CV T . Thus, the final estimate L T T is alwaysbetween L CV T and 2 L CV T . This can trivially lead to cases where TT over-correctsthe loss or does not perform any correction at all. As an example of the former,consider the extreme case of classification, 0-1 loss and Leave-One-Out CV whereeach fold contains a single instance. Then it is likely, especially if many configura-tions have been tried, that there always is a configuration that correctly predictsthe held-out sample in each fold. Thus, in this scenario the bias estimate will beexactly equal to the loss of the selected configuration and so L T T = 2 L CV T . Iffor example in a multi-class classification problem, the selected configuration hasan estimated 0-1 loss of 70%, the TT method will adjust it to return 140% lossestimate! If on the other hand the 0-1 loss is close to zero, almost no correctionwill be performed by TT. Such problems are very likely to be observed with fewsamples and if many configurations are tried. For reliable estimation of the bias,the TT requires relatively large folds, but it is exactly the analyses with overallsmall sample size that need the bias estimation the most. For the same reason,it is less reliable for performance metrics such as the AUC or the concordanceindex (in survival analysis) that require several predictions to be computed; thus,estimating these metrics in small folds is totally unreliable.
There are two additional major works that deal with performance estimation whentuning (model selection) is included. Bernau et al. [2] introduced two variants ofa bias correction method as a smooth analytical alternative to NCV, the WMCand WMCS. The method is based on repeated subsampling of the original datasetand training of multiple models. It then computes the error estimate as a weightedmean of the error rates of every configuration over all the subsamples. The twovariants differ in the way that the weights are calculated. Compared to NCV, theauthors claim that WMC/WMCS is competitive and more stable, for the samenumber of trained models as the CVT. However, subsequent independent work by[7] report problems with the method, and specifically that it provides fluctuatingestimates and it may over-correct the bias in some cases. It is also complicated tounderstand and implement.Ding et al. in [7] proposed a resampling-based inverse power law (IPL) methodfor bias correction and compared its performance to those of TT, NCV, andWMC/WMCS on both simulated and real datasets. The error rate of each classifieris estimated by fitting a learning curve which is constructed from repeatedly re-sampling the original dataset for different sample sizes and fitting an inverse powerlaw function. The IPL method outperforms the other methods in terms of perfor-mance estimation but, as the authors point out, it exhibits significant limitations.Firstly, it is based on the assumption that the learning curve for each classifier canbe fitted well by inverse power law. Additionally, if the sample size of the origi- nal dataset is small, the method will provide unstable estimates. Lastly, the IPLmethod has higher computational cost compared to TT and the WMC/WMCSmethods.
The bootstrap [8] has been developed and applied extensively to estimate in anon-parametric way the (unknown) distribution of a statistic b o computed for apopulation (dataset). The main idea of the bootstrap is to sample with replacementfrom the given dataset multiple times (e.g., 500), each time computing the statistic b i , i = 1 , . . . , B on the resampled dataset. The empirical distribution of b i , undercertain broad conditions approaches the unknown distribution of b o . Numerousvariants have appeared for different statistical tasks and problems (see [6]).In machine learning, for estimation purposes the idea of bootstrapping datasetshas been proposed as an alternative to the CV. Specifically, to produce a perfor-mance estimate for a method f multiple training sets are produced by bootstrap(uniform re-sampling with replacement of rows of the dataset), a model is trainedand its performance is estimated on the out-of-sample examples. On average, ran-dom re-sampling with replacement results in 63.2% of the original samples includedin each bootstrap dataset and the rest serving as out-of-sample test sets. The pro-tocol has been compared to the CV in [19] concluding that the CV is preferable.The setting we explore in this paper is different than what described above sincewe examine the case where one is also tuning. A direct application of the bootstrapidea in such settings would be to substitute CVT (instead of CV) with a bootstrapversion where not one but all configurations are tried on numerous bootstrapdatasets, the best is selected, and its performance is estimated as the averageloss on the out-of-sample predictions. Obviously, this protocol would require thetraining of B × C models, where B is the number of bootstraps, an unacceptablyhigh computational overhead for B in the typical range of a few hundreds tothousands.Before proceeding with the proposed method, let us define a new importantfunction css ( Π, y ) standing for configuration selection strategy , where Π is a ma-trix of out-of-sample predictions and y is a vector of the corresponding true labels.Recall that Π contains N rows and C columns, where N is the sample size and C is the number of configurations so that [ Π ] ij denotes the out-of-sample pre-diction of on the i -th sample of the j -th configuration. The function css returnsthe index of the best-performing configuration according to some criterion. Thesimplest criterion, also employed in this paper, is to select the configuration withthe minimum average loss: css ( Π, y ) = arg min i l ( y, Π (: , i ))where we again employ the Matlab index notation Π (: , i ) to denote the vector incolumn i of matrix Π , i.e,. all pooled out-of-sample predictions of configuration i . However, by explicitly writing the selection as a new function, one can easilyimplement other selection criteria that consider, not only the out-of-sample loss,but also the complexity of the models produced by each configuration.We propose the Bootstrap Bias Corrected CV method (BBC-CV), for efficientand accurate performance estimation. The pseudo-code is shown in Algorithm 5. fficient and Accurate Cross-Validation 13 Algorithm 5 BBC-CV ( f, D = { F , . . . , F K } , Θ ): Cross Validation with Tuning,Bias removal using the BBC method Input : Learning method f , Data matrix D = {h x j , y j i} Nj =1 partitioned into ap-proximately equally-sized folds F i , set of configurations Θ Output : Model M , Performance estimation L BBC , 95% confidence interval [ lb, ub ] // Notice: the final Model is the same as in CVT h M, L
CV T , Π i ←
CVT ( f, D, Θ ) for b = 1 to B do Π b ← sample with replacement N rows of Π Π \ b ← Π \ Π b // get samples in Π and not in Π b // Apply the configuration selection method on the bootstrapped out-of-sample predictions i ← ccs ( Π b , y b ) // Estimate the error of the selected configuration on predictions not se-lected by this bootstrap L b ← l ( y \ b , Π \ b ) end for L BBC = B P Bb =1 L b // Compute 95% confidence interval; b ( k ) denotes the k -th value of b ’s inascending order [ lb, ub ] = [ b (0 . · B ) , b (0 . · B ) ] Return h M, L
BBC , [ lb, ub ] i BBC-CV uses the out-of-sample predictions Π returned by CVT. It creates B bootstrapped matrices Π b , b = 1 , . . . , B and the corresponding vectors of truelabels y b by sampling N rows of Π with replacement. Let Π \ b , b = 1 , . . . , B denotethe matrices containing the samples in Π and not in Π b (denoted as Π \ Π b ),and y \ b their corresponding vectors of true labels. For each bootstrap iteration b ,BBC-CV: (a) applies the configuration selection strategy css( Π b , y b ) to select thebest-performing configuration i , and (b) computes the loss L b of configuration i as L b = l ( y \ b , Π \ b ). Finally, the estimated loss L BBC is computed as the averageof L b over all bootstrap iterations.BBC-CV differs from the existing methods in two key points. (a) the datathat are being bootstrapped are in the matrix Π of the pooled out-of-samplepredictions computed by CVT (instead of the actual data in D ), and (b) themethod applied on each bootstrap sample is the configuration selection strategy css (not the learning method f ). Thus, performance estimation can be appliedwith minimal computational overhead, as no new models need to be trained .A few comments on the BBC-CV method now follow. First, notice that if asingle configuration is always selected as best, the method will return the boot-strapped mean loss of this configuration instead of the mean loss on the originalpredictions. The first asymptotically approaches the second as the number of boot-strap iterations increase and they will coincide. A single configuration may alwaysselected for two reasons: either only one configuration was cross-validated or oneconfiguration dominates all others with respect to the selection criterion. In boththese cases the BBC-CV estimate will approach the CVT estimate. Second, BBC-CV simultaneously considers a bootstrap sample from all pre-dictions of all configurations, not only the ones pertaining to a single fold eachtime. Thus, unlike TT, it is robust even when folds contain only one or just a fewsamples. For the same reason, it is also robust when the performance metric is theAUC (or a similar metric) and requires multiple predictions to be computed reli-ably. There is one caveat however, with the use of BCC-CV and the AUC metric: because BBC-CV pools together predictions from different folds, and thus differentmodels (although produced with the same configuration), the predictions in termsof scores have to be comparable (in the same scale) for use with the AUC . Finally,we note that we presented BBC in the context of K -fold CV, but the main ideaof bootstrapping the pooled out-of-sample predictions of each configuration canbe applied to other protocols. One such protocol is the hold-out where essentiallythere is only one fold. Similarly, it may happen that an implementation of K -fold CV, to save computational time decides to terminate only after a few foldshave been employed, e.g., because the confidence intervals of performance are tightenough and there is no need to continue. We call the latter the incomplete CV pro-tocol . Again, even though predictions are not available for all samples, BBC-CVcan be applied to the predictions of any folds that have been employed for tuning.4.1 Computing Confidence Intervals with the BootstrapThe idea of bootstrapping the out-of-sample predictions can not only correct forthe bias, but also trivially be applied to provide confidence intervals of the loss.1 − α (commonly 95%) confidence intervals for a statistic b are provided by thebootstrap procedure by computing the population of bootstrap estimates of thestatistics b , . . . , b B and considering an interval [ lb, ub ] that contains p percentageof the population [8]. The parameter 1 − α is called the confidence level of theinterval. The simplest approach to compute such intervals is to consider the orderedstatistics b (1) , . . . , b ( B ) , where b i denotes the i -th value of b ’s in ascending order,and take the interval [ b ( α/ · B ) , b ((1 − α/ · B ) ], excluding a probability mass of α/ α = 0 .
05 and B = 1000 we obtain[ lb, ub ] = [ b (25) , b (975) ]. Other variants are possible and could be applied, althoughoutside the scope of this paper. For more theoretical details on the bootstrapconfidence intervals and different methods for constructing them, as well as acomparison of them, see [8].4.2 BCC-CV with RepeatsWhen sample size is small, the variance of the estimation of the performance islarge, even if there is no bias. This is confirmed in [30] empirically on severalreal datasets. A component of the variance of estimation stems from the specificrandom partitioning to folds. To reduce this component it is advisable to repeatthe estimation protocol multiple times with several fold partitions, leading to theRepeated Cross Validation protocol and variants.Applying the BBC-CV method with multiple repeats is possible with the fol-lowing minimal changes in the implementation: We now consider the matrix Π ofthe out-of-sample predictions of the models to be three dimensional with [ Π ] ijk fficient and Accurate Cross-Validation 15 to denote the out-of-sample prediction (i.e, when the example was held-out duringtraining) on the i -th example, of the j -th configuration, in the k -th repeat. Notethat predictions for the same instance x i in different repeats are correlated : they alltend to be precise for easy-to-predict instances and tend to be wrong for outliersthat do not fit the assumptions of the configuration correctly. Thus, predictionson the same instance for different repeats have to all be included in a bootstrapsample or none at all. In other words, as in Algorithm 5, what is resampled withreplacement to create the bootstrap data are the indexes of the instances. Otherthan that, the key idea remains the same as in Algorithm 5. In this section, we present a second use of the idea to bootstrap the pooled out-of-sample predictions of each configuration. Specifically, they are employed as partof a statistical hypothesis test that determines whether a configuration’s perfor-mance is statistically significantly inferior than the performance of the currentbest configuration. If this is indeed the case, the dominated configuration can beearly dropped from further consideration, in the sense that no additional modelson subsequent folds will be trained under this configuration. If a strict significancethreshold is employed for the test then the dropped configurations have a lowprobability of actually ending up as the optimal configuration at the end of theCVT and thus, the prediction performance of the final model will not be affected.The Early Dropping scheme can lead to substantial computational savings as nu-merous configurations can be dropped after just a few folds before completing thefull K -fold CV on them.Specifically, the null hypothesis of the test is that a given configuration’s θ performance (loss) is equal to the current best configuration θ o , i.e., H θ : l N ( θ ) = l N ( θ o ), where l is the average loss of the models produced by the given configu-ration when trained on datasets from the distribution of the problem at hand ofsize N . Since all models are produced by the same dataset size stemming fromexcluding a single fold, we can actually drop the subscript N . These hypothesesare tested for every θ that is still under consideration at the end of each fold i.e.,as soon as new out-of-sample predictions are accrued for each configuration.To perform the test, the current, pooled, out-of-sample predictions of all con-figurations still under consideration Π are employed to identify the best currentconfiguration θ o = css ( Π, y ). Subsequently, Π ’s rows are bootstrapped to creatematrices Π , . . . , Π B and corresponding label matrices y , . . . , y B . From the pop-ulation of these bootstrapped matrices the probability p θ of a given configuration θ to exhibit a worse performance than θ o is estimated as the percentage of times itsloss is higher than that of θ ’s, i.e., ˆ p θ = B { l ( y b , Π b (: , θ )) > l ( y b , Π b (: , θ o )) , b =1 , . . . , B } . If ˆ p θ > α for some significance threshold (e.g., α = 0 . θ is dropped.A few comments on the procedure above. It is a heuristic procedure mainly withfocus on computational efficiency, not statistical theoretical properties. Ideally, thenull hypothesis to test for each configuration θ would be the hypothesis that θ willbe selected as the best configuration at the end of the CVT procedure, given a finitenumber of folds remain to be considered. If this null hypothesis is rejected for a given θ , θ should be dropped. Each of these hypotheses for a given θ has to be testedin the context of all other configurations that participate in the CVT procedure. Incontrast, the heuristic procedure we provide essentially tests each hypothesis H θ in isolation. For example, it could be the case during bootstrapping, configuration θ exhibits a significant probability of a better loss than θ o (not dropped by ourprocedure), but it could be that in all of these cases, it is always dominated bysome other configuration θ ′ . Thus, the actual probability of being selected as bestin the end maybe smaller than the percentage of times it appears better than θ o .In addition, our procedure does not consider the uncertainty (variance) of theselection of the current best method θ o . Perhaps, a double bootstrap procedurewould be more appropriate in this case [23] but any such improvements wouldhave to also minimize the computational overhead to be worthwhile in practice.5.1 Related workThe idea of accelerating the learning process by specifically eliminating under-performing configurations from a finite set, early within the cross-validation pro-cedure, was introduced as early as 1994 by Maron and Moore with HoeffdingRaces [22]. At each iteration of leave-one-out CV (i.e. after the evaluation of a newtest point) the algorithm employs the Hoeffding inequality for the construction ofconfidence intervals around the current error rate estimate of each configuration.Configurations whose intervals do not overlap with those of the best-performingone, are eliminated (dropped) from further consideration. The procedure is re-peated until the confidence intervals have shrunk enough so that a definite overallbest configuration can be identified. However, many test point evaluations may berequired before a configuration can clearly be declared the winner.Following a similar approach, Zheng and Bilenko in 2013 [34] applied the con-cept of early elimination of suboptimal configurations to K-fold CV. They improveon the method by Maron and Moore by incorporating paired hypothesis tests forthe comparison of configurations for both discrete and continuous hyper-parameterspaces. At each iteration of CV, all current configurations are tested pairwise andthose which are inferior are dropped. Then, power analysis is used to determinethe number of new fold evaluations for each remaining configuration given anacceptable false negative rate.Krueger et al. [20] in 2015 introduced the so-called
Fast Cross-Validation viaSequential Testing (CVST) which uses nonparametric testing together with se-quential analysis in order to choose the best performing configuration on thebasis of linearly increasing subsets of data. At each step, the Friedman [11] orthe Cochran’s Q test [4] (for regression and classification tasks respectively) areemployed in order to detect statistically significant differences between configu-rations’ performances. Then, the seemingly under-performing configurations arefurther tested through sequential analysis to determine which of them will be dis-charged. Finally, an early stopping criterion is employed to further speed up theCV process. The winner configuration is the one that has the best average ranking,based on performance, in the last few iterations specified in advance. The disad-vantage of CVST is that it initially operates on smaller subsets, thus risking theearly elimination of good models when the original dataset is already small. fficient and Accurate Cross-Validation 17
In none of the methods above the bias of the performance estimate of a par-tially completed CV is examined. Our approach, BCED-CV, utilizes the bootstrapcorrection protocol (BBC-CV) and provides an almost unbiased estimate of per-formance of the returned model. In comparison to the statistical tests used in [34]and [20], the bootstrap is a general test, applicable to any type of learning taskand measure of performance, and is suitable even for relatively small sample sizes.Finally, BCED-CV requires that only the value of the significance threshold α ispre-specified while the methods in [34] and [20] have a number of hyper-parametersto be specified in advance. We empirically evaluate the efficiency and investigate the properties of BBC-CVand BCED-CV, on both controlled settings and real problems. In particular, wefocus on the bias of the performance estimates of the protocols, and on computa-tional time. We compare the results to those of three standard approaches: CVT,TT and NCV. We also examine the tuning (configuration selection) properties ofBBC-CV, BCED-CV and BBC-CV with repeats, as well as the confidence intervalsthat these methods construct.6.1 Simulation studiesExtensive simulation studies were conducted in order to validate BBC-CV andBCED-CV, and assess their performance. We focus on the binary classificationtask and use classification accuracy as the measure of performance. We examinemultiple settings for varying sample size N ∈ { , , , , , , } , num-ber of candidate configurations C ∈ { , , , , , , } , and trueperformances P of the candidate configurations drawn from different Beta dis-tributions Be ( a, b ) with ( a, b ) ∈ { (9 , , (14 , , (24 , , (54 , } , corresponding toa mean value of µ ∈ { . , . , . , . } and variance of 0 . , . , . , . Π . First,a true performance value P j , j = 1 , . . . , C , sampled from the same beta distribu-tion, is assigned to each configuration c j . Then, the sample predictions for each c j are produced as Π ij = ( r i < P j ) , i = 1 , . . . , N , where r i are random num-bers sampled uniformly from (0 , ( condition ) denotes the unit (indicator)function.Then, the BBC-CV, BCED-CV, CVT, TT, and NCV protocols for tuning andperformance assessment of the returned model are applied. We set the numberof bootstraps B = 1000 for the BBC-CV method, and for the BCED-CV we set B = 1000 and the dropping threshold to a = 0 .
99. We applied the same split of thedata into K = 10 folds for all the protocols. Consequently, all of them, with thepossible exception of the BCED-CV, select and return the same predictive modelwith different estimations of its performance. The internal cross-validation loop ofthe NCV uses K = 9 folds. The whole procedure was repeated 500 times for each setting, leading to a totalof 98,000 generated matrices of predictions, on which the protocols were applied.The results presented are the averages over the 500 repetitions.
The bias of the estimation is computed as [ Bias = ˆ P − P , where ˆ P and P denotethe estimated and the true performance of the selected configuration, respectively.A positive bias indicates a lower true performance than the one estimated by thecorresponding performance estimation protocol and implies that the protocol isoptimistic (i.e. overestimates the performance), whereas a negative bias indicatesthat the estimated performance is conservative. Ideally, the estimated bias shouldbe 0, although a conservative estimate is also acceptable in practice.Figure 1 shows the average estimated bias for models with average true classi-fication accuracy µ = 0 .
6, over 500 repetitions, of the protocols under comparison.Each panel corresponds to a different protocol (specified in the title) and showsthe bias of its performance estimate relatively to the sample size (horizontal axis)and the number of configurations tested (different plotted lines). We omit resultsfor the rest of the tested values of µ as they are similar.The CVT estimate of performance is optimistically biased in all settings withthe bias being as high as 0 .
17 points of classification accuracy. We notice that thesmaller the sample size, the more CVT overestimates the performance of the finalmodel. However, as sample size increases, the bias of CVT tends to 0. Finally, wenote that the bias of the estimate also grows as the number of models under com-parison becomes greater, although the effect is relatively small in this experiment.The behaviour of TT greatly varies for small sample sizes ( ≤ N ∈ { , } ,and over-corrects, for N ∈ { , , } . For larger sample size ( ≥ K .BBC-CV provides conservative estimates, having low bias which quickly tendsto zero as sample size increases. Compared to TT, it is better fitting for smallsample sizes and produces more accurate estimates overall. In comparison to NCV,BBC-CV is somewhat more conservative with a difference in the bias of 0 . .
034 in the worst case (for N = 20). However,we believe that the much lower computational cost (one order of magnitude) ofBBC-CV compensates for its conservatism. BCED-CV displays similar behaviourto BBC-CV, having lower bias which approaches zero faster. It is on par withNCV, having 0 .
005 points of accuracy more bias on average, and 0 .
018 in theworst case. As we show later on, BCED-CV is up to one order of magnitude fasterthan CVT, and consequently two orders of magnitude faster than NCV.In summary, the proposed BBC-CV and BCED-CV methods produce almostunbiased performance estimates, and perform only slightly worse in small samplesettings than the computationally expensive NCV. As expected, CVT is overlyoptimistic, and thus should not be used for performance estimation purposes.Finally, the use of TT is discouraged, as (a) its performance estimate varies a lot fficient and Accurate Cross-Validation 19 sample size
20 40 60 80 100 500 1000 a v e r age b i a s ( a cc ) -0.1-0.0500.050.10.150.2 CVT sample size
20 40 60 80 100 500 1000 a v e r age b i a s ( a cc ) -0.1-0.0500.050.10.150.2 TT sample size
20 40 60 80 100 500 1000 a v e r age b i a s ( a cc ) -0.1-0.05 NCV sample size
20 40 60 80 100 500 1000 a v e r age b i a s ( a cc ) -0.1-0.05 BBC-CV sample size
20 40 60 80 100 500 1000 a v e r age b i a s ( a cc ) -0.1-0.0500.050.10.150.2 BCED-CV
50 configs100 configs200 configs300 configs500 configs1000 configs2000 configs
Fig. 1: Average estimated bias (over 500 repeats) of the CVT, TT, NCV, BBC-CV and BCED-CV estimates of performance for an average true classificationaccuracy of 60%. CVT over-estimates performance in all settings. TT’s behaviourvaries for sample size
N <
500 and is conservative for N ≥ .
013 points of accuracy on average. BCED-CV ison par with NCV.for different sample sizes and numbers of configurations, and (b) it overestimates
Table 1: Datasets’ characteristics. pr/nr denotes the ratio of positive to negativeexamples in a dataset. | D pool | refers to the portion of the datasets (30%) fromwhich the sub-datasets were sampled and | D holdout | to the portion (70%) fromwhich the true performance of a model is estimated.Name pr/nr | D pool | | D holdout | Sourcechristine 5418 1636 1 1625 3793 [14]jasmine 2984 144 1 895 2089 [14]philippine 5832 308 1 1749 4082 [14]madeline 3140 259 1.01 942 2198 [14]sylvine 5124 20 1 1537 3587 [14]gisette 7000 5000 1 2100 4900 [15]madelon 2600 500 1 781 1819 [15]dexter 600 20000 1 180 420 [15]gina 3468 970 1.03 1041 2427 [13]performance for small sample sizes, which are the cases where bias correction isneeded the most.6.2 Real DatasetsAfter examining the behaviour of BBC-CV and BCED-CV on controlled settings,we investigated their performance on real datasets. Again we focus on the binaryclassification task but now we use the AUC as the metric of performance. All of thedatasets utilized for the experiments come from popular data science challenges(NIPS 2003 [15], WCCI 2006 [13], ChaLearn AutoML [14]). Table 1 summarizestheir characteristics. The domains of application of the ChaLearn AutoML chal-lenge’s datasets are not known, however the organizers claim that they are diverseand were chosen to span different scientific and industrial fields. gisette [15] and gina [13] are handwritten digit recognition problems, dexter [15] is a text classifica-tion problem, and madelon [15] is an artificially constructed dataset characterizedby having no single feature that is informative by itself.The experimental set-up is similar to the one used by Tsamardinos et al. [30].Each original dataset D was split into two stratified subsets; D pool which consistedof 30% of the total samples in D , and D holdout which consisted of the remaining70% of the samples. For each original dataset with the exception of dexter , D pool was used to sample (without replacement) 20 sub-datasets for each sample size N ∈ { , , , , , } . For the dexter dataset we sampled 20 sub-datasetsfor each N ∈ { , , , , } . We created a total of 8 × × × D holdout was used to estimate the true performance of the final,selected model of each of the protocols tested.The set Θ (i.e. the search grid) explored consists of 610 configurations. Theseresulted from various combinations of preprocessing, feature selection, and learn-ing methods and different values for their hyper-parameters. The preprocessingmethods included imputation, binarization (of categorical variables) and stan-dardization (of continuous variables) and were used when they could be applied. fficient and Accurate Cross-Validation 21 sample size
20 40 60 80 100 500 a v e r age b i a s ( A UC ) -0.15-0.1-0.0500.050.10.150.20.250.3 CVT (pooling) sample size
20 40 60 80 100 500 a v e r age b i a s ( A UC ) -0.15-0.1-0.0500.050.10.150.20.250.3 TT sample size
20 40 60 80 100 500 a v e r age b i a s ( A UC ) -0.15-0.1-0.0500.050.10.150.20.250.3 NCV sample size
20 40 60 80 100 500 a v e r age b i a s ( A UC ) -0.15-0.1-0.0500.050.10.150.20.250.3 BBC-CV sample size
20 40 60 80 100 500 a v e r age b i a s ( A UC ) -0.15-0.1-0.0500.050.10.150.20.250.3 BCED-CV sylvinephilippinedextermadelinegisettechristinemadelonginajasmine
Fig. 2: Average estimated bias (over 20 sub-datasets for each original dataset) ofthe CVT, TT, NCV, BBC-CV and BCED-CV estimates of performance. CVT isoptimistically biased for sample size N ≤ N ≥
80. NCV and BBC-CV,both have low bias though results vary with dataset. BCED-CV has, on average,greater bias than BBC-CV for N ≤
100 and identical for N = 500.For feature selection we used the SES algorithm [21] with alpha ∈ { . , . } ,and k ∈ { , } and we also examined the case of no feature selection (i.e., a to- tal of 5 cases/choices). The learning algorithms utilized were Random Forests [3],SVMs [5], and LASSO [28]. For Random Forests the hyper-parameters and valuestried are numT rees = 1000, minLeaf Size ∈ { , , } and numV arT oSample ∈{ (0 . , , . , ∗ √ numV ar } , where numV ar is the number of variables of thedataset. We tested SVMs with linear, polynomial and radial basis function (RBF)kernels. For their hyper-parameters we examined, wherever applicable, all thecombinations of degree ∈ { , } , gamma ∈ { . , . , , , } and cost ∈{ . , . , , , } . Finally, LASSO was tested with alpha ∈ { . , . , . } and10 different values for lambda which were created independently for each datasetusing the glmnet library [10].We performed tuning and performance estimation of the final model usingCVT, TT, NCV, BBC-CV, BCED-CV, and BBC-CV with 10 repeats (BBC-CV )for each of the 1060 created sub-datasets, leading to more than 135 million trainedmodels. We set B = 1000 for the BBC-CV method, and B = 1000, a = 0 . K = 10stratified folds for all the protocols. The inner cross-validation loop of NCV uses K = 9 folds. It is important to remind at this point that BBC-CV selects thebest configuration by estimating the performance of the models on the pooledout-of-sample predictions from all folds. For metrics such as the AUC it is possiblethat this approach selects a different configuration from the one that the conven-tional CVT procedure selects (i.e. the configuration with the maximum/minimumaverage performance/loss over all folds). In anecdotal experiments, we comparedthe two approaches in terms of the true performance of the models that they re-turn, and found that they perform similarly. In the sections that follow we presentthe results of CVT with pooling of the out-of-sample predictions. For each proto-col, original dataset D , and sample size N , the results are averaged over the 20randomly sampled sub-datasets. The bias of estimation is computed as in the simulation studies, i.e., [ Bias = ˆ P − P ,where ˆ P and P denote the estimated and the true performance of the selectedconfiguration, respectively.In Figure 2 we examine the average bias of the CVT, TT, NCV, BBC-CV,and BCED-CV estimates of performance, on all datasets, relative to sample size.We notice that the results are in agreement with those of the simulation studies.In particular, CVT is optimistically biased for sample size N ≤
100 and its biastends to zero as N increases. TT over-estimates performance for N = 20, its biasvaries with dataset for N = 40, and it over-corrects the bias of CVT for N ≥ madeline dataset for N = 40 and the madelon dataset for N ∈{ , , } . NCV is slightly optimistic for the dexter and madeline datasets for N = 40 with a bias of 0.033 and 0.031 points of AUC respectively. BCED-CV has,on average, greater bias than BBC-CV for N ≤ N = 500, its bias shrinksand becomes identical to that of BBC-CV and NCV. fficient and Accurate Cross-Validation 23 sample size
20 40 60 80 100 500 r e l a t i v e a v e r age t r ue pe r f o r m an c e ( A UC ) BCED-CV/CVT sylvine philippine dexter madeline gisette christine madelon gina jasmine
Fig. 3: Relative average true performance of the models returned by the BCED-CV and CVT. For N ≤
100 the loss in performance varies greatly with dataset,however, for N = 500 there is negligible to no loss in performance. If N if fairlylarge, BCED-CV will accelerate the CVT procedure without sacrificing the qualityof the resulting model or the accuracy of its performance estimate. We have shown that for large sample sizes ( N = 500) BCED-CV provides accurateestimates of performance of the model it returns, comparable to those of BBC-CVand NCV. How well does this model perform though? In this section, we evalu-ate the effectiveness of BCED-CV in terms of its tuning (configuration selection)properties, and its efficiency in reducing the computational cost of CVT.Figure 3 shows the relative average true performance of the models returned bythe BCED-CV and CVT protocols, plotted against sample size. We remind herethat for each of the 20 sub-datasets of sample size N ∈ { , , , , , } sampled from D pool , the true performance of the returned model is estimated onthe D holdout set. We notice that, for N ≤
100 the loss in performance varies greatlywith dataset and is quite significant; up to 9 .
05% in the worst case ( dexter dataset, N = 40). For N = 500, however, there is negligible to no loss in performance.Specifically, for the sylvine, philippine, madeline, christine and gina datasets thereis no loss in performance when applying BCED-CV, while there is 0 .
44% and 0 . gisette and jasmine datasets, respectively. madelon exhibits the higheraverage loss of 1 . N ≤ datasets sy l v i ne ph ili pp i ne m ade li ne g i s e tt e c h r i s t i ne m ade l on g i na j a s m i ne s peed up BCED-CV/CVT, sample size = 500
Fig. 4: The speed-up of BCED-CV over CVT is shown for sample size N = 500.It is computed as the ratio of models trained by CVT over BCED-CV. Typically,BCED-CV achieves a speed-up of 2-5, up to 10 for the gisette dataset. Overall,using BCED-CV results in a significant speed boost, without sacrificing modelquality or performance estimation.( > N ≤ N = 500 BCED-CV incurs almost no loss in performance, werecommend a minimum of 50 out-of-sample predictions to start dropping configu-rations, although a smaller number may suffice. For example, with N = 100, thiswould mean that dropping starts after the fifth iteration. Finally, we note thatdropping is mostly useful with larger sample sizes (i.e. for computationally costlyscenarios), which are also the cases where BCED-CV is on par with BBC-CV andNCV, in terms of tuning and performance estimation.Next, we compare the computational cost of BCED-CV to CVT, in terms oftotal number of models trained. The results for N = 500 are shown in Figure 4.We only focused on the N = 500 case, as it is the only case where both protocolsproduce models of comparable performance. We observe that a speed-up of 2 to 5is typically achieved by BCED-CV. For the gisette dataset, the speed-up is veryclose the theoretical maximum of this experimental setup. Overall, if sample sizeis sufficiently large, using dropping is recommended to speed-up CVT without aloss of performance. fficient and Accurate Cross-Validation 25 sample size
20 40 60 80 100 500 r e l a t i v e a v e r age t r ue pe r f o r m an c e ( A UC ) BBC-CV /BBC-CV sample size
20 40 60 80 100 500 r e l a t i v e a v e r age t r ue pe r f o r m an c e ( A UC ) BBC-CV /NCV sylvine philippine dexter madeline gisette christine madelon gina jasmine Fig. 5: Relative average true performance of BBC-CV to BBC-CV (left), andof BBC-CV to NCV (right). Multiple repeats increase the performance of thereturned models, maintaining the accuracy of the performance estimation. If com-putational time is not a limitation, it is preferable to use BBC-CV over NCV. We repeated the previous experiments, running BBC-CV with 10 repeats (calledBBC-CV hereafter). First, we compare the true performance of the models re-turned by BBC-CV and BBC-CV , as well as the bias of the estimation. Ideally,using multiple repeats should result in a better performing model, as the varianceof the performance estimation (used by CVT for tuning) due to a specific choiceof split for the data is reduced when multiple splits are considered. This comes ata cost of increased computational overhead, which in case of 10 repeats is similarto that of the NCV protocol. To determine which of the approaches is preferable,we also compare the performance of the final models produced by BBC-CV andNCV.Figure 5 (left) shows the relative average true performance of BBC-CV toBBC-CV with increasing sample size N . We notice that, for N = 20 the resultsvary with dataset, however, for N ≥
40, BBC-CV systematically returns anequally well or (in most cases) better performing model than the one that BBC-CV returns. In terms of the bias of the performance estimates of the two methods,we have found them to be similar.Similarly, Figure 5 (right) shows the comparison between BBC-CV and NCV.We see again that for sample size N = 20 the relative average true performance ofthe returned models vary with dataset. BBC-CV outperforms NCV for N ≥ philippine and jasmine datasets for which results vary with samplesize. Thus, if computational time is not a limiting factor, it is still beneficial to useBBC-CV with multiple repeats instead of NCV.To summarize, we have shown that using multiple repeats increases the qualityof the resulting models as well as maintaining the accuracy of the performance expected coverage (CIs) e s t i m a t ed c o v e r age sample size = 20 expected coverage (CIs) e s t i m a t ed c o v e r age sample size = 100 expected coverage (CIs) e s t i m a t ed c o v e r age sample size = 500 BBC-CV BCED-CV BBC-CV Fig. 6: Coverage of the { , , . . . , , } CIs returned by BBC-CV,BCED-CV, and BBC-CV , defined as the ratio of the estimated CIs that containthe corresponding true performances of the produced models. The CIs are mainlyconservative and become more accurate with increasing sample size and multiplerepeats.estimation. We note that the number 10 was chosen mainly to compare BBC-CVto NCV on equal grounds (same number of trained models). If time permits, werecommend using as many repeats as possible, especially for low sample sizes. Forlarger sample sizes, usually one or a few repeats suffice. The bootstrap-based estimation of performance, allows for easy computation ofconfidence intervals (CIs) as described in Section 4.1. We investigated the accuracyof the CIs produced by the proposed BBC-CV, BCED-CV and BBC-CV pro-tocols. For this, we computed the coverage of the { , , . . . , , } CIsestimated by the protocols, defined as the ratio of the computed CIs that containthe corresponding true performances of the produced models. For a given samplesize, the coverage of a CI was computed over all 20 sub-datasets and 9 datasets. Tofurther examine the effect of multiple repeats on CIs, we computed their averagewidth (over all 20 sub-datasets) for each dataset and different number of repeats(1 to 10).Figure 6 shows the estimated coverage of the CIs constructed with the use ofthe percentile method relative to the expected coverage for the BBC-CV, BCED-CV, and BBC-CV protocols. We present results for sample sizes N = 20 (left), N = 100 (middle), and N = 500 (right). Figure 7 shows, for the same values for N and for each dataset, the average width of the CIs with increasing number ofrepeats.We notice that for N = 20 the CIs produced by BBC-CV are conservative, thatis, they are wider than ought to be. As sample size increases ( N ≥ ) greatly shrinks the width of the CIs and improves their calibration(i.e., their true coverage is closer to the expected one). The same holds whenusing dropping of under-performing configurations (BCED-CV). For N = 500 theintervals appear to not be conservative. After closer inspection, we saw that thisis caused by two datasets ( madeline and jasmine ) for which the majority of the fficient and Accurate Cross-Validation 27
1 2 3 4 5 6 7 8 9 10 a v e r age w i d t h o f C I s ( A UC )
0 0.10.20.30.40.50.60.70.80.9
BBC-CV X , N = 20, 95% CIs
1 2 3 4 5 6 7 8 9 10 a v e r age w i d t h o f C I s ( A UC )
0 0.10.20.30.40.50.60.70.80.9
BBC-CV X , N = 100, 95% CIs
1 2 3 4 5 6 7 8 9 10 a v e r age w i d t h o f C I s ( A UC )
0 0.10.20.30.40.50.60.70.80.9
BBC-CV X , N = 500, 95% CIs sylvine philippine dexter madeline gisette christine madelon gina jasmine Fig. 7: Average width (over all 20 sub-datasets) of CIs returned by BBC-CV,BCED-CV, and BBC-CV , for each dataset, with increasing number of repeats.CIs shrink with increasing sample size and number of repeats.true performances are higher than the upper bound of the CI. We note that thosedatasets are the ones with the highest negative bias (see Figure 2 for N = 500),which implicitly causes the CIs to also be biased downwards, thus failing to captureperformance estimates above the CI limits.In conclusion, the proposed BBC-CV method provides mainly conservative CIsof the true performance of the returned models which become more accurate withincreasing sample size. The use of multiple repeats improves the calibration ofCIs and shrinks their width, for small sample sizes (less than 100). The use of3-4 repeats seems to suffice and further repeats provide small added value in CIestimation. Pooling together the out-of-sample predictions during cross-validation of multipleconfigurations (i.e., combinations of algorithms and their hyper-parameter valuesthat leads to a model) and employing bootstrapping techniques on them solvesin a simple and general way three long-standing, important data analysis tasks:(a) removing the optimism of the performance estimation of the selected configu-ration, (b) estimating confidence intervals of performance, and (c) dropping fromfurther consideration inferior configurations. While other methods have also beenproposed, they lack the simplicity and the generality in applicability in all typesof performance metrics. The ideas above are implemented in method BBC-CVtackling points (a) and (b) and BCED-CV that includes (c).Simulation studies and experiments on real datasets show empirically thatBBC-CV and BCED-CV outperform the alternatives (nested Cross-Validationand the TT method) by either providing more accurate, almost unbiased, con-servative estimates of performance even for smaller sample sizes and/or by havingmuch lower computational cost (speed-up of up to 10). We examined the effectof repeatedly applying our methods on multiple fold partitions of the data, andfound that we acquire better results in terms of tuning (i.e., better-performingconfigurations are selected) compared to BBC-CV and NCV. Finally, in our ex- periments, the confidence intervals produced by bootstrapping are shown to bemainly conservative, improving with increasing sample size and multiple repeats.Future work includes a thorough evaluation of the methods on different typesof learning tasks such as regression, and survival analysis (however, preliminaryresults have shown that they are equivalently efficient and effective).For a practitioner, based on the results on our methods we offer the followingsuggestions: first, to forgo the use of the computationally expensive nested cross-validation. Instead, we suggest the use of BBC-CV for small sample sizes (e.g.,less than 100 samples). BCED-CV could also be used in these cases to reducethe number of trained models (which may be negligible for such small samplesizes) but it may select a slightly sub-optimal configuration. For larger samplesizes, we advocate the use BCED-CV that is computationally more efficient andmaintains all benefits of BBC-CV. We also suggest using as many repeats withdifferent partitions to folds as computational time allows, particularly for smallsample sizes, as they reduce the widths of the confidence intervals and lead to abetter selection of the optimal configuration.
Acknowledgements
IT, MT and GB received funding from the European Research Councilunder the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC GrantAgreement n. 617393. EG received funding from the Toshiba project: “Feasibility study towardsthe Next Generation of statistical Text to Speech Synthesis System”. We’d like to thank PavlosCharoniktakis for constructive feedback.fficient and Accurate Cross-Validation 29
References
1. Akaike, H.: A new look at the statistical model identification. IEEE transactions onautomatic control (6), 716–723 (1974)2. Bernau, C., Augustin, T., Boulesteix, A.L.: Correcting the optimal resampling-based errorrate by estimating the error rate of wrapper algorithms. Biometrics (3), 693–702 (2013)3. Breiman, L.: Random forests. Machine learning (1), 5–32 (2001)4. Cochran, W.G.: The comparison of percentages in matched samples. Biometrika (3/4),256–266 (1950)5. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning (3), 273–297 (1995)6. Davison, A.C., Hinkley, D.V.: Bootstrap methods and their application. Cambridge uni-versity press (1997)7. Ding, Y., Tang, S., Liao, S.G., Jia, J., Oesterreich, S., Lin, Y., Tseng, G.C.: Bias correctionfor selecting the minimal-error classifier from many machine learning models. Bioinfor-matics (22), 3152–3158 (2014)8. Efron, B., Tibshirani, R.J.: An introduction to the bootstrap. CRC press (1993)9. Fawcett, T.: An introduction to roc analysis. Pattern recognition letters (8), 861–874(2006)10. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear modelsvia coordinate descent. Journal of Statistical Software (1), 1–22 (2010)11. Friedman, M.: The use of ranks to avoid the assumption of normality implicit in theanalysis of variance. Journal of the american statistical association (200), 675–701(1937)12. Garnett, R., Osborne, M.A., Roberts, S.J.: Bayesian optimization for sensor set selection.In: Proceedings of the 9th ACM/IEEE International Conference on Information Processingin Sensor Networks, pp. 209–219 (2010)13. Guyon, I., Alamdari, A.R.S.A., Dror, G., Buhmann, J.M.: Performance prediction chal-lenge. In: The 2006 IEEE International Joint Conference on Neural Network Proceedings,pp. 1649–1656. IEEE (2006)14. Guyon, I., Bennett, K., Cawley, G., Escalante, H.J., Escalera, S., Ho, T.K., Maci`a, N., Ray,B., Saeed, M., Statnikov, A., Viegas, E.: Design of the 2015 chalearn automl challenge. In:Proc. of IJCNN (2015)15. Guyon, I., Gunn, S., Ben-Hur, A., Dror, G.: Result analysis of the nips 2003 featureselection challenge. In: Advances in neural information processing systems, pp. 545–552(2004)16. Harrell, F.E., Califf, R.M., Pryor, D.B., Lee, K.L., Rosati, R.A.: Evaluating the yield ofmedical tests. Journal of the American medical association (18), 2543–2546 (1982)17. Jensen, D.D., Cohen, P.R.: Multiple comparisons in induction algorithms. Machine Learn-ing (3), 309–338 (2000)18. Jensen, J.L.W.V.: Sur les fonctions convexes et les in´egalit´es entre les valeurs moyennes.Acta mathematica (1), 175–193 (1906)19. Kohavi, R., et al.: A study of cross-validation and bootstrap for accuracy estimation andmodel selection. In: Ijcai, vol. 14, pp. 1137–1145 (1995)20. Krueger, T., Panknin, D., Braun, M.: Fast cross-validation via sequential testing. Journalof Machine Learning Research , 1103–1155 (2015)21. Lagani, V., Athineou, G., Farcomeni, A., Tsagris, M., Tsamardinos, I.: Feature selectionwith the R package MXM: Discovering statistically-equivalent feature subsets. Journal ofStatistical Software To appear (2017)22. Maron, O., Moore, A.W.: Hoeffding races: Accelerating model selection search for classi-fication and function approximation. Advances in neural information processing systemspp. 59–59 (1994)23. Nankervis, J.C.: Computational algorithms for double bootstrap confidence intervals.Computational statistics & data analysis (2), 461–475 (2005)24. Salzberg, S.L.: On comparing classifiers: Pitfalls to avoid and a recommended approach.Data mining and knowledge discovery (3), 317–328 (1997)25. Schwarz, G., et al.: Estimating the dimension of a model. The annals of statistics (2),461–464 (1978)26. Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learningalgorithms. In: Advances in neural information processing systems, pp. 2951–2959 (2012)27. Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., Levy, S.: A comprehensive eval-uation of multicategory classification methods for microarray gene expression cancer di-agnosis. Bioinformatics (5), 631–643 (2005)0 Tsamardinos, I., Greasidou, E. et al.28. Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society. Series B (Methodological) pp. 267–288 (1996)29. Tibshirani, R.J., Tibshirani, R.: A bias correction for the minimum error rate in cross-validation. The Annals of Applied Statistics pp. 822–829 (2009)30. Tsamardinos, I., Rakhshani, A., Lagani, V.: Performance-estimation properties of cross-validation-based protocols with simultaneous hyper-parameter optimization. In: ArtificialIntelligence: Methods and Applications, pp. 1–14. Springer (2014)31. Varma, S., Simon, R.: Bias in error estimation when using cross-validation for modelselection. BMC bioinformatics7