Estimating the Maximum Expected Value: An Analysis of (Nested) Cross Validation and the Maximum Sample Average
aa r X i v : . [ s t a t . M L ] M a r Estimating the Maximum Expected Value: AnAnalysis of (Nested) Cross Validation and theMaximum Sample Average
Hado van [email protected] 4, 2013
Abstract
We investigate the accuracy of the two most common estimators forthe maximum expected value of a general set of random variables: a gen-eralization of the maximum sample average, and cross validation. Nounbiased estimator exists and we show that it is non-trivial to select agood estimator without knowledge about the distributions of the ran-dom variables. We investigate and bound the bias and variance of theaforementioned estimators and prove consistency. The variance of crossvalidation can be significantly reduced, but not without risking a largebias. The bias and variance of different variants of cross validation areshown to be very problem-dependent, and a wrong choice can lead to veryinaccurate estimates.
We often need to estimate the maximum expected value of a set of randomvariables (RVs), when only noisy estimates for each of the variables are given. For instance, this problem arises in optimization in stochastic decision processesand in algorithmic evaluation.Formally, we consider a finite set V = { V , . . . , V M } of M ≥ µ , . . . , µ M and variances σ , . . . , σ M . We want to findthe value of µ ∗ ( V ), defined by µ ∗ ( V ) ≡ max i µ i ≡ max i E { V i } . (1)We assume the distribution of each V i is unknown, but a set of noisy samples X is given. The question is how best to use the samples to construct an estimateˆ µ ∗ ( X ) ≈ µ ∗ ( V ). We write µ ∗ and ˆ µ ∗ when V and X are clear from the context. Without loss of generality, we assume that we want to maximize rather than minimize .
1t is easy to construct consistent estimators, but we are also interested in thequality for small sample sizes. The mean squared error (MSE) is the mostcommon metric for the quality of an estimator, but sometimes (the sign of) thebias is more important. Unfortunately, as we discuss in Section 2, no unbiasedestimators for µ ∗ exist.A common estimator is the maximum estimator (ME), which constructsestimates ˆ µ i ≈ µ i and then uses ˆ µ ∗ ≡ max i ˆ µ i . When X i ⊂ X contains directsamples for V i , and ˆ µ i is the average of X i , the ME is simply the maximumsample average. The ME on average overestimates µ ∗ . This bias has beenrediscovered several times, for instance in economics Van den Steen [2004] anddecision making Smith and Winkler [2006]. It can lead to overestimation of theperformance of algorithms Varma and Simon [2006], Cawley and Talbot [2010],and poor policies in reinforcement learning algorithms van Hasselt [2011a,b]. Itis related to over-fitting in model selection, selection bias in sample selectionHeckman [1979] and feature selection Ambroise and McLachlan [2002], and the winner’s curse in auctions Capen et al. [1971].The most common alternative to avoid this bias is cross validation (CV)Larson [1931], Mosteller and Tukey [1968]. If CV is used to construct each ˆ µ i ,and thereafter to estimate µ ∗ (as described in Section 3.2), this is called nestedCV or “double cross” Stone [1974]. Unfortunately, (nested) CV can lead to alarge variance. Perhaps surprisingly, we show the absolute bias of CV can belarger than the bias of the ME that we are trying to prevent. However, the biasof CV is provably negative, which can be an advantage.In this paper, we give general distribution-independent bounds for the biasand variance of the ME and CV. We present a new variant of CV and showthat it is very dependent on V which CV estimator is most accurate in terms ofMSE. Therefore, it is non-trivial to construct accurate CV estimators withoutsome knowledge about the distributions of V . We discuss why standard 10-foldsCV is often not a bad choice for model selection, but show that in other settingsother estimators may be much more accurate.We now discuss two motivating examples to highlight the practical impor-tance of this general topic. Learning Algorithms
Many learning algorithms explicitly maximize noisyvalues to update their internal parameters. For instance, in reinforcement learn-ing Sutton and Barto [1998] the goal is to find strategies that maximize a rewardsignal in a (sequential) decision task. Inaccurate biased estimators for µ ∗ canhave adverse effects on the speed of learning and the strategies that are learnedvan Hasselt [2011a]. Evaluation of Algorithms
Most machine-learning algorithms have tunableparameters.
Internal parameters , such as the Lagrangian multipliers of a sup-port vector machine (SVM) Vapnik [1995], are optimized by the algorithm.
Hyper-parameters , such as the choice of kernel function in a SVM, are oftentuned manually or chosen with domain knowledge. Other relevant choices by2he experimenter—such as which algorithms to consider and the representationof the problem—can be summarized as meta-parameters .Typically, we evaluate a set C of configurations, where each c i ∈ C denotesa specific algorithm with specific hyper- and meta-parameters. Often, eachevaluation is noisy, due to (pseudo-)randomness in the algorithms or inherentrandomness in the problem. The performance of c i is then a random variable V i , and we want to construct an estimate ˆ µ ∗ for the best performance µ ∗ . For instance, Varma and Simon [2006] note that the ME results in overly-optimistic prediction errors and propose to use nested CV. They evaluate anSVM for various hyper-parameters on an artificial problem, with an actual errorof 50%. The estimate by the ME is 41.7% and nested CV results in 54 . .
2% is a demonstration of a completely different general bias that wediscuss in Section 3.2. This bias has received very little attention, although—aswe will show—it is not in general smaller than the bias caused by using the ME.
Overview
In the rest of this section, we discuss related work and (notational)preliminaries. In Section 2, we discuss the bias of estimators in general. InSection 3, we discuss the properties of the ME and of CV, including bounds ontheir bias and variance. In Section 4 we discuss concrete settings with empiricalillustrations. Section 5 contains a discussion and Section 6 concludes the paper.
The bootstrap
Efron and Tibshirani [1993] is a resampling method that can beused to estimate the bias of an estimator, in order to reduce this bias. Basedon this, Tibshirani and Tibshirani Tibshirani and Tibshirani [2009] propose anestimator for µ ∗ for model selection in classification. Inevitably—see Section2—the resulting estimate is still biased, and it is typically more variable thatthe original estimate. Also specifically considering model selection for classifiers,Bernau et al. Bernau et al. [2011] propose a smoothed version of nested CV.The resulting estimator performs similar to normal (nested) CV, which in turnis shown to typically be more accurate than the approach by Tibshirani andTibshirani. In this paper we focus on CV and the ME, which are by far themost widely used.The problem of estimating µ ∗ is related to the multi-armed bandit frameworkRobbins [1952], Berry and Fristedt [1985], where the goal is to find the identity of the action with the maximum expected value rather than its value . Thefocus in the literature on multi-armed bandits is often on how best to collectsamples. In contrast, in this paper we assume that a set of samples is given. Adiscussion on how best to collect samples to minimize the bias or MSE is outside Sometimes we are more interested in the configuration that optimizes the performancethan in the resulting performance, but often the performance itself is at least as important.In part, this depends on whether the focus of the research is on the algorithms or on theproblem.
The measurable domain of V i is X i , and f i : X i → R denotes its probabilitydensity function (PDF), such that µ i = R X i x f i ( x ) dx . For conciseness, weassume X i = R . We assume these PDFs f i are unknown and therefore µ ∗ =max i µ i can not be found analytically.We write ˆ µ i ( X ) for an estimator for µ i based on a sample set X . Similarly,ˆ µ ∗ ( X ) is an estimator for µ ∗ . We write ˆ µ i and ˆ µ ∗ when X is clear from thecontext. If X i ⊂ X is a set of unbiased samples for V i , ˆ µ i might be the sampleaverage. In that case, ˆ µ i is unbiased for µ i . In general, ˆ µ i can be biased for µ i .As discussed in the next section, no general unbiased estimator for µ ∗ exists,even if all ˆ µ i are unbiased.The following definitions will be useful below, when stating necessary andsufficient conditions for a strictly positive or a strictly negative bias. The set of optimal indices for RVs V is defined as O ( V ) ≡ { i | µ i = µ ∗ } . (2)The set of maximal indices for samples X is defined as M ( X ) ≡ (cid:26) i (cid:12)(cid:12)(cid:12)(cid:12) ˆ µ i ( X ) = max j ˆ µ j ( X ) (cid:27) . (3)An estimator is called optimal or maximal whenever its index is optimal ormaximal, respectively. Note that optimal estimators are not necessarily maximaland maximal estimators are not necessarily optimal. Let V be a function space containing all admissible sets of M RVs. We mightknow V , but not the precise identity of V ∈ V . For instance, V may be the setof all sets of M normal RVs with finite moments. Let p : V → R be a PDF over V . The expected MSE of an estimator ˆ µ ∗ is equal to Z V p ( V ) Z X P ( X | V ) (ˆ µ ∗ ( X ) − µ ∗ ) dX dV , (4)In any given concrete setting, there is a single unknown set V . Therefore, p doesnot exist ‘in the world’. Rather, p might model our prior belief about which sets V are likely in a given setting, or it might specify the V for which we would likean estimator to perform well. The MSE consists of variance and bias. To reasonin some generality about which estimators are good in practice, we discuss the non-existence of unbiased estimators and the direction of the bias .4 on-Existence of Unbiased Estimators By definition, ˆ µ ∗ is a generalunbiased estimator (GUE) for V if and only if ∀ V ∈ V : E { ˆ µ ∗ | V } = µ ∗ . (5)Unfortunately, for most V of interest no such estimator exists. For instance,Blumenthal and Cohen [1968] show no GUE exists for two normal distributionsand Dhariyal et al. [1985] proved this for arbitrary M ≥ µ ∗ depends smoothly on the values of thesamples, whereas the real value µ ∗ is a piece-wise linear function with a dis-continuous derivative. We can not know the location of these discontinuitieswithout knowing the actual maximum.Note that (5) is already false if V contains only a single set of variables forwhich ˆ µ ∗ is biased. However, bias alone does not tell us everything, and a lowbias does not necessarily imply a small expected MSE. The Direction of the Bias
In some cases, the direction of the bias is veryimportant. Suppose we test an algorithm for various hyper-parameters andobserve that the best performance is better than some baseline. If we simplyuse the highest test result, it can not be concluded that the algorithm can reallystructurally outperform the baseline for any of the specific hyper-parameters.Although this may sound trivial, it is common in practice: when we manuallytune hyper- or meta-parameters on a problem and use the best result, we areusing max i ˆ µ i , which has non-negative bias. It is hard to avoid optimizing onmeta-parameters: these include the very (properties of the) problem we test thealgorithm on.The practical implication of this positive bias is that the algorithm willdisappoint in future evaluations on similar (real-world) problems. In contrast,if we use an estimator with non-positive bias and our estimate is higher thanthe baseline, we can have much more confidence that the algorithm can reachthat performance consistently with a properly tuned hyper-parameter. This issimilar to the considerations about overfitting in model selection, where CV ismost often used. We prove below that CV indeed has non-positive bias, andcan therefore avoid overestimations of µ ∗ .As another example, the performance of most machine-learning algorithmsimproves when more data is available. When the data collection is expensive it isuseful to predict how an algorithm performs when more data is available, beforeactually collecting this data. An overestimation of the future performance canlead to a misallocation of resources, since the collected data may be less usefulthan predicted. An underestimation means we may be too pessimistic, and toooften decide not to collect more data. Whether or not the false positives aremore important than the false negatives depends crucially on specifics of thesetting. 5 Estimators for the Maximum Expected Value
In this section, we discuss the ME and CV estimators for µ ∗ in detail. We boundthe biases and variances of all estimators, discuss similarities and contrasts, andprove consistency. We introduce a low-variance variant of CV. We give necessaryand sufficient conditions for non-zero biases to occur for all estimators, andperhaps surprisingly we show that there are settings in which the negative biasof all variants of CV is larger in size than the positive bias of ME. All proofsare given in an appendix and the end of this paper. The maximum-estimator (ME) estimator for µ ∗ isˆ µ ME ∗ ( X ) ≡ max i ˆ µ i ( X ) , (6)where ˆ µ i is a (possibly biased) estimator for µ i . Because it is conceptuallysimple and easy to implement, the ME estimator is often used in practice. Thetheorem below proves its bias is non-negative and gives necessary and sufficientconditions for a strictly positive bias. The theorem is stronger and more generalthan some similar earlier theorems. For instance, Smith and Winkler [2006] donot consider the possibility of multiple optimal variables, and do not discussnecessity of the conditions for a strictly positive bias. Theorem 1.
For any given set V , M ≥ and unbiased estimators ˆ µ i , E { ˆ µ i | V } = µ i , E { ˆ µ ME ∗ | V } ≥ µ ∗ , with equality if and only if all optimal indices are maximal with probability one. Theorem 1 implies a lower bound of zero for the bias of the ME. An upperbound for arbitrary means and variances is given by Aven [1985]:Bias(ˆ µ ME ∗ ) ≤ vuut M − M M X i Var (ˆ µ i ) , (7)which is tight when the estimators are iid [Arnold and Groeneveld, 1979], indi-cating that iid variables are a worst-case setting.We do not know of previous work that bounds the variance, which we discussnext. Theorem 2.
The variance of the ME estimator is bounded by Var (ˆ µ ME ∗ ) ≤ P Mi =1 Var (ˆ µ i ) . Theorem 2 and bound (7) imply that ˆ µ ME ∗ is consistent for µ ∗ whenever eachˆ µ i is consistent for µ i and that MSE(ˆ µ ME ∗ ) < P Mi =1 Var (ˆ µ i ).6 .2 The Cross-Validation Estimator In general, µ ∗ can be considered to be a weighted average of the means of alloptimal variables: µ ∗ = |O ( V ) | P Mi =1 I ( i ∈ O ( V )) µ i , where I is the indicatorfunction. We do not know O ( V ) and µ i , but with sample sets A, B we canapproximate these with M ( A ) and ˆ µ i ( B ) to obtain µ ∗ ≈ ˆ µ ∗ ≡ |M ( A ) | M X i =1 I ( i ∈ M ( A ))ˆ µ i ( B ) . (8)If A = B = X , this reduces to ˆ µ ME ∗ . However, suppose that A and B areindependent. This idea leads to cross-validation (CV) estimators. Of course,CV itself is not new. However, it seems to be less well-known how properties ofthe problem affect the accuracy and that CV can be quite biased.We split each X into K disjoint sets X k and define ˆ µ ki ≡ ˆ µ i ( X k ). Forinstance, ˆ µ ki might be the sample average of X ki . We consider two different CVestimators. In both methods, for each k ∈ { , . . . , K } we construct an argumentset ˆ a k and a value set ˆ v k . Low-bias cross validation (LBCV) is the ‘standard’ CV estimator, where K − a k (the model), and the remainingset is used to determine its value:ˆ a ki ≡ ˆ µ i ( X \ X k ) and ˆ v ki ≡ ˆ µ i ( X k ) ≡ ˆ µ ki . Low-variance cross validation (LVCV) reverses the definitions for ˆ a k and ˆ v k :ˆ a ki ≡ ˆ µ ( X ki ) ≡ ˆ µ ki and ˆ v ki ≡ ˆ µ ( X i \ X ki ) . We do not know of any previous work that discusses this variant. However,its lower variance can sometimes result in much lower MSEs than obtained byLBCV. For both LBCV and LVCV, if ˆ µ i ( X ) is the sample average of X i andall samples are unbiased, then E { ˆ a ki } = E { ˆ v ki } = µ i .For either approach, M k is the set of indices that maximize the argumentvector. For LBCV this implies M k = M ( X \ X k ) and for LVCV this im-plies M k = M ( X k ). We find the value of these indices with the value vector,resulting in ˆ µ k ∗ ≡ |M k | X i ∈M k ˆ v ki . (9)We then average over all K sets:ˆ µ CV ∗ ≡ K K X k =1 ˆ µ k ∗ = 1 K K X k =1 |M k | M X i ∈M k ˆ v ki , (10)where either ˆ µ CV ∗ = ˆ µ LBCV ∗ or ˆ µ CV ∗ = ˆ µ LVCV ∗ , depending on the definitions of ˆ a ki and ˆ v ki . 7he construction of ˆ µ k ∗ performs the approximation: ˆ v ki ≈ ˆ v ki ∗ ≈ µ i ∗ ≡ µ ∗ ,where i ∈ M k and i ∗ ∈ O . The first approximation results from using M k to approximate O and is the main source of bias. The second approximationresults from the variance of ˆ v k .For large enough X , K can be treated as a parameter that trades off biasand variance. For LBCV larger K implies less bias and more variance, whilefor LVCV it implies more bias and less variance. For K = 2, LBCV and LVCVare equivalent. If K >
2, LVCV is more biased but less variable than LBCV,since M k is then based on fewer samples while ˆ v ki is based on more samples.When ∀ i : | X i | = K for LBCV, ˆ v ki is based on a single sample, resulting in alarge variance. This variant is commonly known as leave-one-out CV. When ∀ i : | X i | = K for LVCV, ˆ a ki is based on a single sample, potentially resulting inlarge bias due to large probabilities of selecting sub-optimal indices.The bias of CV for µ ∗ has received comparatively little attention. Sometimesthe bias is mentioned without explanation [Kohavi, 1995], and sometimes it iseven claimed that CV is unbiased [Mannor et al., 2007]. Often, any observedbias is attributed to the fact that ˆ a k can be biased when it based on K − K | X | rather than | X | samples [Varma and Simon, 2006]. This can be a factor, butthe bias induced by using M k for O is often at least as important, as will bedemonstrated below. Some confusion seems to arise from the fact that ˆ v ki isoften unbiased for µ i . Unfortunately, this does not imply that ˆ µ CV ∗ is unbiasedfor µ ∗ .Next, we prove that CV estimators can have a negative bias even if ˆ a andˆ v are unbiased, and we give necessary and sufficient conditions for a strictlynegative bias. Theorem 3. If E (cid:8) ˆ µ ki (cid:12)(cid:12) V (cid:9) = µ i is unbiased then E { ˆ µ CV ∗ | V } ≤ µ ∗ is negativelybiased, with a strict inequality if and only if there is a non-zero probability thatany non-optimal index is maximal. The theorem shows that ˆ µ LVCV ∗ and ˆ µ LBCV ∗ on average underestimate µ ∗ ifand only if there is a non-zero probability that i ∈ M k ( X ) for some i / ∈ O ( V ).A prominent case in which this does not hold is when all variables have thesame mean, since then i ∈ O ( V ) for all i . Interestingly, this implies that CV isunbiased when the V i ∈ V are iid, which is a worst case for the ME. Theorem3 implies that the bias of CV is bound from above by zero. We conjecture thebias is bound from below as follows. Conjecture 1.
Let E (cid:8) ˆ µ ki (cid:12)(cid:12) V (cid:9) = µ i . ThenBias (ˆ µ CV ∗ ) > − K K X k =1 vuut M X i =1 Var (cid:0) ˆ a ki (cid:1) . We do not prove this conjecture here in full generality, but there is a prooffor M = 2 in the appendix. It makes intuitive sense that the bias of ˆ µ CV ∗ depends only on the variances Var (cid:0) ˆ a ki (cid:1) if each ˆ µ ki is unbiased. The bias of the8V estimators is unaffected by the fact that ˆ µ CV ∗ averages over K estimators ˆ µ k ∗ ,but K does affect the bias by regulating how many samples are used for eachˆ a ki . As mentioned earlier, for LVCV larger a K implies a higher bias since thenˆ a ki is more variable, while for LBCV a larger K implies a lower bias since thenˆ a ki is less variable.Although CV is known for low bias and high variance, the next theoremshows its absolute bias is not necessarily smaller than the absolute bias of theME. Theorem 4.
There exist V and N = | X | such that | Bias (ˆ µ CV ∗ | N ) | > | Bias (ˆ µ ME ∗ | N ) | for any K and for any variant of CV. Two different experiments in Section 4 prove this theorem, since there eventhe negative bias of leave-one-out LBCV is larger in size than the positive biasof ME.
Theorem 5.
The variance of ˆ µ LBCV ∗ is bounded byVar (ˆ µ LBCV ∗ ) ≤ K K X k =1 M X i =1 Var (cid:0) ˆ µ ki (cid:1) . If each ˆ µ ki is unbiased, the variance of LVCV is necessarily smaller than thatof LBCV and the same bound applies trivially to ˆ µ LVCV ∗ . Corollary 1. If ˆ µ i is the sample average of X i and | X ki | = | X i | /K for all k ,then Var (ˆ µ CV ∗ ) ≤ P Mi =1 Var (ˆ µ i ) for both LBCV and LVCV. Conjecture 1 and Theorem 5 imply that CV is consistent if each ˆ µ i is con-sistent and K is fixed (or slowly increasing, see also Shao [1993]), and thatMSE(ˆ µ CV ∗ ) ≤ P Mi =1 Var (ˆ µ i ). To illustrate that it is non-trivial to select an accurate estimator, we discusssome concrete examples.
The framework of multi-armed bandits can be used to optimize which ad isshown on a website [Langford et al., 2008, Strehl et al., 2010]. Consider M adswith unknown fixed expected returns per visitor µ i . Bandit algorithm can beused to balance exploration and exploitation to optimize the online return pervisitor, which converges to µ ∗ . However, quick accurate estimates of µ ∗ can beimportant, for instance to base future investments on. Additionally, placing anyad may induce some cost c , so we may want to know quickly whether µ ∗ > c .For simplicity, assume each ad has the same return per click, such that onlythe click rate matters and each V i can be modeled with a Bernoulli variable with9igure 1: The MSE for ˆ µ ME ∗ , ˆ µ LBCV ∗ and ˆ µ LVCV ∗ for different settings, averagedover 2,000 experiments. The left-most bar is always ˆ µ ME ∗ . The other bars are,from left to right, leave-one-out LVCV, 10-folds LVCV, 5-folds LVCV, 2-foldsCV, 5-folds LBCV, 10-folds LBCV and leave-one-out LBCV. Note that 2-foldsLVCV is equivalent to 2-folds LBCV, which are therefore not shown separately.mean µ i and variance (1 − µ i ) µ i . In our first experiment, there are N = 100 , M = 10, M = 100 or M = 1000 ads, and ∀ i : µ i = 0 .
5. All ads areshown equally often, such that ∀ i : N i = N/M . Because all means are equal,Theorem 3 implies that CV estimators are unbiased; their MSE depends solelyon the variance. In the second—more realistic—setting, the M mean click ratesare distributed evenly between 0 .
02 and 0 .
05, there are N = 300 ,
000 visitors,and M = 30, M = 300, or M = 3000 ads.The results are shown in the first four plots in Figure 1. We show the rootMSE (RMSE), such that the units are percentage points. Within the RMSEs,the contributions of the bias and the variance are shown. Note that MSE =bias + variance, and therefore RMSE = p bias + variance = bias + std dev.This implies that the depicted contributions of bias and variance to the RMSEare not in general exactly equal to the bias and standard deviation, but thisdepiction does allow us to see directly how many percentage points of error arecaused by bias and by variance.In the first setting (left plot) CV is indeed unbiased. Leave-one-out LVCVhas the lowest variance of all CV methods—it is barely visible—which impliesit has the smallest MSE. For M = 1000 ads, the huge bias of the ME causes itto overestimate the actual maximal click rate by more than 15%.In the second setting (middle three plots), there is a clear trade-off in CV:LVCV with large K has large bias and small variance, whereas LBCV with large K has small bias and large variance. The bias of the CV estimators is clearlyimportant, even though each ˆ µ ki is unbiased. Even for leave-one-out LBCV thebias is non-negligible: for M = 30 its bias is larger than the bias of the ME.Interestingly, when M increases (and the number of samples per ad decreases Sometimes the bias of LBCV seems to increase slightly for higher K . These are noise-related artifacts. M , this implies that leave-one-out LVCV goes from being by far theleast accurate for M = 30 to almost the most accurate for M = 3000. Incontrast, the ME goes from being the most accurate for M = 30 to the leastaccurate for M = 3000. The reason is that for increasing M , the variables arerelatively more similar to iid variables, which is a best case for LVCV and aworst case for the ME. In all three cases, 10- and 5-folds LVCV are a goodchoice. We now consider a regression problem. The goal is to fit polynomials on noisysamples from a function r ( y ) = 4(sin( y ) + sin(2 y )). Let X = { ( y, r ( y ) + ω ) | y ∈ Y } denote a noisy data set for inputs Y , where ω is zero-mean Gaussian noisewith variance σ ω = 4. Let p i denote a polynomial of degree i , of which thecoefficients are fitted with least-squares on X .Let Y = { , . , . . . , . , } be 81 equidistant inputs. We want to maximizethe negative MSE. The lowest expected MSE of fitting each p i on 81 samples andtesting on an independent test set of 81 samples is obtained at 4 .
34 for i = 5,which implies µ ∗ = µ = − .
34. We construct 1,000 independent noisy sets X = { ( y, r ( y ) + ω ) | y ∈ Y } . For each X , we conduct the following experiment.For any given Z ⊆ X , ˆ µ i is defined by an inner CV loop as follows. Foreach z ∈ Z , we fit p i on Z \ { z } and test the error on z to obtain an error e i ( z ). We average these errors to obtain: ˆ µ i ( Z ) = | Z | P z ∈ Z e i ( z ). This impliesˆ µ i is biased, since p i is fitted on | Z | − <
81 samples. For the ME, ˆ µ ME ∗ =max i ˆ µ i ( X ) which means | Z | = 80 samples are used to fit each p i . For LBCV,ˆ a ki = ˆ µ i ( X \ X k ) which means | Z | = K − K
81. For LVCV, ˆ a ki = ˆ µ i ( X k ), whichmeans | Z | = K
81. Since | Z | can then be much smaller than 81, LVCV can besignificantly biased. We consider K ∈ { , , , } . When K = 81, LBCV isalso known as nested leave-one-out CV . Figure 1 (right plot) shows the results.LVCV is not shown for K = 81 and K = 9: LVCV with K = 81 is meaning-less, since one cannot fit a polynomial on a single point. The MSE for K = 9 ishuge. In sharp contrast with the previous settings, LVCV fares poorly—even interms of variance—and leave-one-out LBCV is the best CV estimator. However,interestingly the ME is more accurate than all CV estimators, and even the sizeof its bias (0 . n -fold LBCV ( − . Our results show that it is hard to choose an estimator that is good in general.Unfortunately, the best choice in one setting can be the worst choice in another.A poorly chosen CV estimator can be far less accurate than the ME. This doesnot imply that we suggest using the ME; it is often very biased.11 potential advantage of CV estimators in some settings is a guaranteednon-positive bias. This can be desirable even if the estimator is less accu-rate. However, in our results the recommendation to always use 10-folds LBCV[Kohavi, 1995] seems unfounded. When each ˆ µ i is unbiased and especially when M is large, LVCV often performs much better. On the other hand, when eachestimator ˆ µ i has a bias that decreases with the number of samples, the bias ofLVCV can become prohibitively large, as illustrated in the regression setting.This explains why 10-folds LBCV is often not a bad choice for model selection,as long as M is fairly small and ˆ µ i is fairly biased. However, note that 5- and10-folds LBCV were the most accurate estimator in none of our experiments.As a general recommendation, it may be good to try both the ME and oneor more CV estimators. If the estimates are close together, this indicates theyare more likely to be accurate. Although the true maximum expected valuewill often lie between the estimate by the ME and those by CV, one should notsimply average these estimates: as we have shown that for instance the ME canbe very biased in some settings, and hardly biased in others. Furthermore, thepotentially excessive variance of some variants of LBCV implies that in somecases its estimate may itself be an overestimation, which is why we recommendto include LVCV in the analysis. Alternative estimators
Of course, there are possible alternatives to the es-timators we discussed. First, one can consider using the maximum of somelower confidence bounds on the individual value estimates. Although this doescounter the overestimation of ME, it can not be guaranteed that this does notlead to an underestimation in its place. Furthermore, it is non-trivial to selecta good confidence interval, and the resulting estimate will typically be muchmore variable than the ME.Second, for model-selection there exist criteria such as AIC [Akaike, 1974]and BIC [Schwarz, 1978] that use a penalty term based on the number of param-eters in the model. Obviously, such penalties are only useful when comparinghomomorphic models with different numbers of parameters, and therefore donot apply to the more general setting we consider in this paper. Furthermore,the main purpose of these criteria is not to give an accurate estimate of theexpected value of the best model, but to increase the probability of selecting it.These goals are related, but unfortunately not equivalent.Finally, one can estimate belief distributions ˆ F i for the location of each µ i ,for instance with Bayesian inference. With these distributions, we can estimate µ ∗ . This approach is less general, since it requires prior knowledge about V ,but then it does seem reasonable. The probability that the maximum mean issmaller than some x is equal to the probability that all means are smaller than x . Therefore, its CDF is ˆ F max ( x ) = Q Mi =1 ˆ F i ( x ), which we can use to estimate µ ∗ . The resulting Bayesian estimator (BE) isˆ µ BE ∗ = Z ∞−∞ x M X i =1 ˆ f i ( x ) Y j = i ˆ F j ( x ) dx , f i ( x ) = ddx ˆ F i ( x ). To show a perhaps counter-intuitive result from thisapproach, we discuss a small example. Consider two Bernoulli variables. Weconsider all means equally likely and use a uniform prior Beta distribution, withparameters α = β = 1. Suppose µ = µ = 0 .
5. We draw two samples fromeach variable. The expected estimate for the ME is , for a bias of ≈ . F i ( x ) is 1 − (1 − x ) ,3 x − x or x , depending on how many samples for V i are equal to one. Itsexpected value is then E (cid:8) ˆ µ BE ∗ (cid:9) = ≈ . F max ( x ) = Q Mi =1 ˆ F i ( x ). We analyzed the bias and variance of the two most common estimators for themaximum expected value of a set of random variables. The maximum estimate results in non-negative bias. The common alternative of cross validation (CV)has non-positive bias, which can be preferable. Unfortunately, the accuracies ofdifferent variants of CV are very dependent on the setting; an uninformed choicecan result in extremely inaccurate estimates. No general rule—e.g., always use10-fold CV—is always optimal.
Appendix
Proof of Theorem 1.
For conciseness, we leave V and X implicit. Let j ∈ O bean arbitrary optimal index, and define event A j ≡ ( j ∈ M ) to be true if andonly if j is maximal. We can write µ ∗ = P ( A j ) E { ˆ µ j | A j } + P ( ¬ A j ) E { ˆ µ j | ¬ A j } . Note: E { ˆ µ j | A j } = E { ˆ µ ME ∗ | A j } and E { ˆ µ j | ¬ A j } < E { ˆ µ ME ∗ | ¬ A j } . Therefore, µ ∗ ≤ E { ˆ µ ME ∗ } , with equality if and only if P ( ¬ A j ) = 0 for all j ∈ O . Proof of Theorem 2.
Let A and B be independent sets of RVs with E { A i } = E { B i } and E (cid:8) A i (cid:9) = E (cid:8) B i (cid:9) . Define C ( i ) ≡ ( A \ A i ) ∪ { B i } = { A , . . . , A i − , B i , A i +1 , . . . , A M } . The Efron-Stein inequality [Efron and Stein, 1981] states that for any g : R M → R : Var ( g ( A )) ≤ M X i =1 E (cid:26)(cid:16) g ( A ) − g ( C ( i ) ) (cid:17) (cid:27) . A and B be independent instantiations of ˆ µ and let g ( A ) = max i A i for any A . We derive Var (ˆ µ ME ∗ ) ≤ M X i =1 E ((cid:18) max j A j − max j C ( i ) j (cid:19) ) ≤ M X i =1 E n ( A i − B i ) o = M X i =1 Var (ˆ µ i ) . Proof of Theorem 3.
Let w ki ≡ E (cid:8) I ( i ∈ M k ) / |M k | (cid:9) . Then E (cid:8) ˆ µ k ∗ (cid:9) = P Mi w ki µ i ≤ µ ∗ , because M k and ˆ v ki are independent, P Mi w ki = 1 and E (cid:8) ˆ v ki (cid:9) = µ i . Notethat w ki > P (cid:0) i ∈ M k (cid:1) >
0. Therefore, E (cid:8) ˆ µ k ∗ (cid:9) < µ ∗ if and onlyif there exists a i / ∈ O such that P (cid:0) i ∈ M k (cid:1) > Proof of Conjecture 1 for M = 2 . Assume without loss of generality that µ = µ ∗ . The assumption E (cid:8) ˆ µ ki (cid:9) = µ i implies that E (cid:8) ˆ v ki (cid:9) = µ i . Then,Bias (cid:0) ˆ µ k ∗ (cid:1) = E (cid:26) I (2 ∈ M k ) |M k | (cid:27) ( µ − µ ) ≥ P (cid:0) ∈ M k (cid:1) ( µ − µ )= P (cid:0) ˆ a k ≥ ˆ a k (cid:1) ( µ − µ ) ≥ (cid:0) Var (cid:0) ˆ a k (cid:1) + Var (cid:0) ˆ a k (cid:1)(cid:1) ( µ − µ )Var (cid:0) ˆ a k (cid:1) + Var (cid:0) ˆ a k (cid:1) + ( µ − µ ) ≥ − q Var (cid:0) ˆ a k (cid:1) + Var (cid:0) ˆ a k (cid:1) , where the second inequality follows from Cantelli’s inequality, and the thirdinequality is the result of minimizing for µ − µ . From this, it follows that for M = 2 Bias(ˆ µ CV ∗ ) ≤ − K K X k =1 vuut X i =1 Var (cid:0) ˆ a ki (cid:1) , which is a factor tighter than the general bound in the conjecture. Proof of Theorem 5.
We apply definition (10) and use P Kk =1 ˆ v ki = P Kk =1 ˆ µ ki toderive Var K K X k =1 |M k | M X i ∈M k ˆ v ki ≤ Var K K X k =1 M X i =1 ˆ v ki ! ≤ K K X k =1 M X i =1 Var (cid:0) ˆ µ ki (cid:1) . Proof of Corollary 1.
Apply Theorem 5 with Var (cid:0) ˆ µ ki (cid:1) = σ i / | X ki | = Kσ i / | X i | = K Var (ˆ µ i ) 14 eferences H. Akaike. A new look at the statistical model identification.
IEEE Transactionson Automatic Control , 19(6):716–723, 1974.C. Ambroise and G. McLachlan. Selection bias in gene extraction on the basisof microarray gene-expression data.
Proceedings of the National Academy ofSciences , 99(10):6562, 2002.B. C. Arnold and R. A. Groeneveld. Bounds on expectations of linear systematicstatistics based on dependent samples.
The Annals of Statistics , 7(1):220–223,1979.P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmedbandit problem.
Machine learning , 47(2):235–256, 2002.T. Aven. Upper (lower) bounds on the mean of the maximum (minimum) ofa number of random variables.
Journal of applied probability , 22(3):723–728,1985.C. Bernau, T. Augustin, and A.-L. Boulesteix. Correcting the optimally selectedresampling-based error rate: A smooth analytical alternative to nested cross-validation, 2011.D. A. Berry and B. Fristedt.
Bandit Problems: sequential allocation of experi-ments . Chapman and Hall, London/New York, 1985.S. Blumenthal and A. Cohen. Estimation of the larger of two normal means.
Journal of the American Statistical Association , pages 861–876, 1968.E. Capen, R. Clapp, and T. Campbell. Bidding in high risk situations.
Journalof Petroleum Technology , 23:641–653, 1971.G. Cawley and N. Talbot. On over-fitting in model selection and subsequentselection bias in performance evaluation.
The Journal of Machine LearningResearch , 11:2079–2107, 2010.I. Dhariyal, D. Sharma, and K. Krishnamoorthy. Nonexistence of unbiasedestimators of ordered parameters.
Statistics , 16(1):89–95, 1985.B. Efron and C. Stein. The jackknife estimate of variance.
The Annals ofStatistics , 9(3):586–596, 1981.B. Efron and R. Tibshirani.
An introduction to the bootstrap , volume 57. Chap-man & Hall/CRC, 1993.J. Heckman. Sample selection bias as a specification error.
Econometrica:Journal of the econometric society , pages 153–161, 1979.15. Kohavi. A study of cross-validation and bootstrap for accuracy estimationand model selection. In C. S. Mellish, editor,
Proceedings of the FourteenthInternational Joint Conference on Artificial Intelligence (IJCAI-95) , pages1137–1145, San Mateo, 1995. Morgan Kaufmann.T. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules.
Ad-vances in Applied Mathematics , 6(1):4 – 22, 1985.J. Langford, A. Strehl, and J. Wortman. Exploration scavenging. In
Proceedingsof the 25th international conference on Machine learning (ICML-08) , pages528–535. ACM, 2008.S. Larson. The shrinkage of the coefficient of multiple correlation.
Journal ofEducational Psychology , 22(1):45, 1931.S. Mannor, D. Simester, P. Sun, and J. N. Tsitsiklis. Bias and variance approx-imation in value function estimates.
Management Science , 53(2):308–322,2007.F. Mosteller and J. W. Tukey. Data analysis, including statistics. In G. Lindzeyand E. Aronson, editors,
Handbook of Social Psychology, Vol. 2 . Addison-Wesley, 1968.H. Robbins. Some aspects of the sequential design of experiments.
Bull. Amer.Math. Soc , 58(5):527–535, 1952.G. Schwarz. Estimating the dimension of a model.
The annals of statistics , 6(2):461–464, 1978.J. Shao. Linear model selection by cross-validation.
Journal of the AmericanStatistical Association , 88(422):486–494, 1993.J. E. Smith and R. L. Winkler. The optimizer’s curse: Skepticism and post-decision surprise in decision analysis.
Management Science , 52(3):311–322,2006.M. Stone. Cross-validatory choice and assessment of statistical predictions.
Roy.Stat. Soc. , 36:111–147, 1974.A. Strehl, J. Langford, L. Li, and S. Kakade. Learning from logged implicitexploration data. In
Advances in Neural Information Processing Systems ,volume 23, pages 2217–2225. The MIT Press, 2010.R. S. Sutton and A. G. Barto.
Reinforcement Learning: An Introduction . TheMIT press, Cambridge MA, 1998.R. Tibshirani and R. Tibshirani. A bias correction for the minimum error ratein cross-validation.
The Annals of Applied Statistics , 3(2):822–829, 2009.E. Van den Steen. Rational overoptimism (and other biases).
American Eco-nomic Review , 94(4):1141–1151, September 2004.16. P. van Hasselt. Double Q-Learning. In
Advances in Neural InformationProcessing Systems , volume 23. The MIT Press, 2011a.H. P. van Hasselt.
Insights in Reinforcement Learning . PhD thesis, UtrechtUniversity, 2011b.V. N. Vapnik.
The nature of statistical learning theory . Springer Verlag, 1995.S. Varma and R. Simon. Bias in error estimation when using cross-validationfor model selection.