From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation
FFrom Fixed-X to Random-X Regression: Bias-VarianceDecompositions, Covariance Penalties, and Prediction ErrorEstimation
Saharon Rosset Ryan J. Tibshirani ∗ Abstract
In statistical prediction, classical approaches for model selection and model evaluation basedon covariance penalties are still widely used. Most of the literature on this topic is based onwhat we call the “Fixed-X” assumption, where covariate values are assumed to be nonrandom.By contrast, it is often more reasonable to take a “Random-X” view, where the covariatevalues are independently drawn for both training and prediction. To study the applicability ofcovariance penalties in this setting, we propose a decomposition of Random-X prediction error inwhich the randomness in the covariates contributes to both the bias and variance components.This decomposition is general, but we concentrate on the fundamental case of least squaresregression. We prove that in this setting the move from Fixed-X to Random-X predictionresults in an increase in both bias and variance. When the covariates are normally distributedand the linear model is unbiased, all terms in this decomposition are explicitly computable,which yields an extension of Mallows’ Cp that we call RCp. RCp also holds asymptoticallyfor certain classes of nonnormal covariates. When the noise variance is unknown, plugging inthe usual unbiased estimate leads to an approach that we call (cid:100)
RCp, which is closely relatedto Sp (Tukey 1967), and GCV (Craven and Wahba 1978). For excess bias, we propose anestimate based on the “shortcut-formula” for ordinary cross-validation (OCV), resulting in anapproach we call RCp + . Theoretical arguments and numerical simulations suggest that RCp + is typically superior to OCV, though the difference is small. We further examine the Random-Xerror of other popular estimators. The surprising result we get for ridge regression is that, inthe heavily-regularized regime, Random-X variance is smaller than Fixed-X variance, which canlead to smaller overall Random-X error. A statistical regression model seeks to describe the relationship between a response y ∈ R and a co-variate vector x ∈ R p , based on training data comprised of paired observations ( x , y ) , . . . , ( x n , y n ).Many modern regression models are ultimately aimed at prediction: given a new covariate value x ,we apply the model to predict the corresponding response value y . Inference on the prediction errorof regression models is a central part of model evaluation and model selection in statistical learning(e.g., Hastie et al. 2009). A common assumption that is used in the estimation of prediction error iswhat we call a “Fixed-X” assumption, where the training covariate values x , . . . , x n are treated asfixed, i.e., nonrandom, as are the covariate values at which predictions are to be made, x , . . . , x n ,which are also assumed to equal the training values. In the Fixed-X setting, the celebrated notionsof optimism and degrees of freedom lead to covariance penalty approaches to estimate the predictionperformance of a model (Efron, 1986, 2004; Hastie et al., 2009), extending and generalizing classicalapproaches like Mallows’ Cp (Mallows, 1973) and AIC (Akaike, 1973). ∗ The authors thank Edgar Dobriban for help in formulating and proving Theorem 3, and Felix Abramovich, TrevorHastie, Amit Moscovich, Moni Shahar, Rob Tibshirani and Stefan Wager for useful comments. a r X i v : . [ s t a t . M E ] J un he Fixed-X setting is one of the most common views on regression (arguably the predominantview), and it can be found at all points on the spectrum from cutting-edge research to introductoryteaching in statistics. This setting combines the following two assumptions about the problem.(i) The covariate values x , . . . , x n used in training are not random (e.g., designed), and the onlyrandomness in training is due to the responses y , . . . , y n .(ii) The covariates x , . . . , x n used for prediction exactly match x , . . . , x n , respectively, and thecorresponding responses y , . . . , y n are independent copies of y , . . . , y n , respectively.Relaxing assumption (i), i.e., acknowledging randomness in the training covariates x , . . . , x n , andtaking this randomness into account when performing inference on estimated parameters and fittedmodels, has received a good deal of attention in the literature. But, as we see it, assumption (ii) isthe critical one that needs to be relaxed in most realistic prediction setups. To emphasize this, wedefine two settings beyond the Fixed-X one, that we call the “Same-X” and “Random-X” settings.The Same-X setting drops assumption (i), but does not account for new covariate values at predictiontime. The Random-X setting drops both assumptions, and deals with predictions at new covariatesvalues. These will be defined more precisely in the next subsection. We assume that the training data ( x , y ) , . . . , ( x n , y n ) are i.i.d. according to some joint distribution P . This is an innocuous assumption, and it means that we can posit a relationship for the trainingdata, y i = f ( x i ) + (cid:15) i , i = 1 , . . . , n (1)where f ( x ) = E ( y | x ), and the expectation here is taken with respect to a draw ( x, y ) ∼ P . We alsoassume that for ( x, y ) ∼ P , (cid:15) = y − f ( x ) is independent of x, (2)which is less innocuous, and precludes, e.g., heteroskedasticity in the data. We let σ = Var( y | x )denote the constant conditional variance. It is worth pointing out that some results in this papercan be adjusted or modified to hold when (2) is not assumed; but since other results hinge criticallyon (2), we find it is more convenient to assume (2) up front.For brevity, we write Y = ( y , . . . , y n ) ∈ R n for the vector of training responses, and X ∈ R n × p for the matrix of training covariates with i th row x i , i = 1 , . . . , n . We also write Q for the marginaldistribution of x when ( x, y ) ∼ P , and Q n = Q × · · · × Q ( n times) for the distribution of X whenits n rows are drawn i.i.d. from Q . We denote by ˜ y i an independent copy of y i , i.e., an independentdraw from the conditional law of y i | x i , for i = 1 , . . . , n , and we abbreviate ˜ Y = (˜ y , . . . , ˜ y n ) ∈ R n .These are the responses considered in the Same-X setting, defined below. We denote by ( x , y ) anindependent draw from P . This the covariate-response pair evaluated in the Random-X setting, alsodefined below.Now consider a model building procedure that uses the training data ( X, Y ) to build a predictionfunction ˆ f n : R p → R . We can associate to this procedure two notions of prediction error:ErrS = E X,Y, ˜ Y (cid:20) n n (cid:88) i =1 (cid:0) ˜ y i − ˆ f n ( x i ) (cid:1) (cid:21) and ErrR = E X,Y,x ,y (cid:0) y − ˆ f n ( x ) (cid:1) , where the subscripts on the expectations highlight the random variables over which expectations aretaken. (We omit subscripts when the scope of the expectation is clearly understood by the context.)The Same-X and
Random-X settings differ only in the quantity we use to measure prediction error:in Same-X, we use ErrS, and in Random-X, we use ErrR. We call ErrS the Same-X prediction error2nd ErrR the Random-X prediction error, though we note these are also commonly called in-sampleand out-of-sample prediction error, respectively. We also note that by exchangeability,ErrS = E X,Y, ˜ y (cid:0) ˜ y − ˆ f n ( x ) (cid:1) . Lastly, the
Fixed-X setting is defined by the same model assumptions as above, but with x , . . . , x n viewed as nonrandom, i.e., we assume the responses are drawn from (1), with the errors being i.i.d.We can equivalently view this as the Same-X setting, but where we condition on x , . . . , x n . In theFixed-X setting, prediction error is defined byErrF = E Y, ˜ Y (cid:20) n n (cid:88) i =1 (cid:0) ˜ y i − ˆ f n ( x i ) (cid:1) (cid:21) . (Without x , . . . , x n being random, the terms in the sum above are no longer exchangeable, and soErrF does not simplify as ErrS did.) From our perpsective, much of the work encountered in statistical modeling takes a Fixed-X view,or when treating the covariates as random, a Same-X view. Indeed, when concerned with parameterestimates and parameter inferences in regression models, the randomness of new prediction pointsplays no role, and so the Same-X view seems entirely appropriate. But, when focused on prediction,the Random-X view seems more realistic as a study ground for what happens in most applications.On the other hand, while the Fixed-X view is common, the Same-X and Random-X views havenot exactly been ignored, either, and several groups of researchers in statistics, but also in machinelearning and econometrics, fully adopt and argue for such random covariate views. A scholarly andhighly informative treatment of how randomness in the covariates affects parameter estimates andinferences in regression models is given in Buja et al. (2014, 2016). We also refer the reader to thesepapers for a nice review of the history of work in statistics and econometrics on random covariatemodels. It is also worth mentioning that in nonparametric regression theory, it is common to treatthe covariates as random, e.g., the book by Gyorfi et al. (2002), and the random covariate view isthe standard in what machine learning researchers call statistical learning theory, e.g., the book byVapnik (1998). Further, a stream of recent papers in high-dimensional regression adopt a randomcovariate perspective, to give just a few examples: Greenshtein and Ritov (2004); Chatterjee (2013);Dicker (2013); Hsu et al. (2014); Dobriban and Wager (2015).In discussing statistical models with random covariates, one should differentiate between whatmay be called the “i.i.d. pairs” model and “signal-plus-noise” model. The former assumes i.i.d. draws( x i , y i ), i = 1 , . . . , n from a common distribution P , or equivalently i.i.d. draws from the model (1);the latter assumes i.i.d. draws from (1), and additionally assumes (2). The additional assumption(2) is not a light one, and it does not allow for, e.g., heteroskedasticity. The books by Vapnik (1998);Gyorfi et al. (2002) assume the i.i.d. pairs model, and do not require (2) (though their results oftenrequire a bound on the maximum of Var( y | x ) over all x .)More specifically related to the focus of our paper is the seminal work of Breiman and Spector(1992), who considered Random-X prediction error mostly from an intuitive and empirical point ofview. A major line of work on practical covariance penalties for Random-X prediction error in leastsquares regression begins with Stein (1960) and Tukey (1967), and continues onwards throughoutthe late 1970s and early 1980s with Hocking (1976); Thompson (1978a,b); Breiman and Freedman(1983). Some more recent contributions are found in Leeb (2008); Dicker (2013). A common themeto these works is the assumption that ( x, y ) is jointly normal. This is a strong assumption, and isone that we avoid in our paper (though for some results we assume x is marginally normal); we willdiscuss comparisons to these works later. Through personal communication, we are aware of workin progress by Larry Brown, Andreas Buja, and coauthors on a variant of Mallows’ Cp for a setting3n which covariates are random. It is out understanding that they take somewhat of a broader viewthan we do in our proposals RCp , (cid:100) RCp , RCp + , each designed for a more specific scenario, but resortto asymptotics in order to do so.Finally, we must mention that an important alternative to covariance penalties for Random-Xmodel evaluation and selection are resampling-based techniques, like cross-validation and bootstrapmethods (e.g., Efron 2004; Hastie et al. 2009). In particular, ordinary leave-one-out cross-validationor OCV evaluates a model by actually building n separate prediction models, each one using n − n − n ), albeit, at a somewhat high price in terms of variance andinaccuracy (e.g., see Burman 1989; Hastie et al. 2009). Altogether, OCV is an important benchmarkfor comparing the results of any proposed Random-X model evaluation approach. Consider first the Fixed-X setting, where x , . . . , x n are nonrandom. Recall the well-known decom-position of Fixed-X prediction error (e.g., Hastie et al. 2009):ErrF = σ + 1 n n (cid:88) i =1 (cid:0) E ˆ f n ( x i ) − f ( x i ) (cid:1) + 1 n n (cid:88) i =1 Var (cid:0) ˆ f n ( x i ) (cid:1) where the latter two terms on the right-hand side above are called the (squared) bias and variance of the estimator ˆ f n , respectively. In the Same-X setting, the same decomposition holds conditionalon x , . . . , x n . Integrating out over x , . . . , x n , and using exchangeability, we concludeErrS = σ + E X (cid:16) E (cid:0) ˆ f n ( x ) | X (cid:1) − f ( x ) (cid:17) (cid:124) (cid:123)(cid:122) (cid:125) B + E X Var (cid:0) ˆ f n ( x ) | X (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) V . The last two terms on the right-hand side above are integrated bias and variance terms associatedwith ˆ f n , which we denote by B and V , respectively. Importantly, whenever the Fixed-X variance ofthe estimator ˆ f n in question is unaffected by the form of f ( x ) = E ( y | x ) (e.g., as is the case in leastsquares regression), then so is the integrated variance V .For Random-X, we can condition on x , . . . , x n and x , and then use similar arguments to yieldthe decompositionErrR = σ + E X,x (cid:16) E (cid:0) ˆ f n ( x ) | X, x (cid:1) − f ( x ) (cid:17) + E X,x Var (cid:0) ˆ f n ( x ) | X, x (cid:1) . For reasons that will become clear in what follows, it suits our purpose to rearrange this asErrR = σ + B + V (3)+ E X,x (cid:16) E (cid:0) ˆ f n ( x ) | X, x (cid:1) − f ( x ) (cid:17) − E X (cid:16) E (cid:0) ˆ f n ( x ) | X (cid:1) − f ( x ) (cid:17) (cid:124) (cid:123)(cid:122) (cid:125) B + (4)+ E X,x Var (cid:0) ˆ f n ( x ) | X, x (cid:1) − E X Var (cid:0) ˆ f n ( x ) | X (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) V + . (5)We call the quantities in (4), (5) the excess bias and excess variance of ˆ f n (“excess” here referringto the extra amount of bias and variance that can be attributed to the randomness of x ), denoted4y B + and V + , respectively. We note that, by construction,ErrR − ErrS = B + + V + , thus, e.g., B + + V + ≥ f n is no smallerthan its Same-X (in-sample) prediction error. Moreover, as ErrS is easily estimated following stan-dard practice for estimating ErrF, discussed next, we see that estimates or bounds B + , V + lead toestimates or bounds on ErrR. Starting with the Fixed-X setting again, we recall the definition of optimism, e.g., as in Efron (1986,2004); Hastie et al. (2009),OptF = E Y, ˜ Y (cid:20) n n (cid:88) i =1 (cid:0) ˜ y i − ˆ f n ( x i ) (cid:1) − n n (cid:88) i =1 (cid:0) y i − ˆ f n ( x i ) (cid:1) (cid:21) , which is the difference in prediction error and training error. Optimism can also be expressed as thefollowing elegant sum of self-influence terms,OptF = 2 n n (cid:88) i =1 Cov (cid:0) y i , ˆ f n ( x i ) (cid:1) , and furthermore, under a normal regression model (i.e., the data model (1) with (cid:15) ∼ N (0 , σ )) andsome regularity conditions on ˆ f n (i.e., continuity and almost differentiability as a function of y ),OptF = 2 σ n n (cid:88) i =1 E (cid:20) ∂ ˆ f n ( x i ) ∂y i (cid:21) , which is often called Stein’s formula (Stein, 1981).Optimism is an interesting and important concept because an unbiased estimate (cid:92) OptF of OptF(say, from Stein’s formula or direct calculation) leads to an unbiased estimate of prediction error:1 n n (cid:88) i =1 (cid:0) y i − ˆ f n ( x i ) (cid:1) + (cid:92) OptF . When ˆ f n is given by the least squares regression of Y on X (and X has full column rank), so thatˆ f n ( x i ) = x Ti ( X T X ) − X T Y , i = 1 , . . . , n , it is not hard to check that OptF = 2 σ p/n . This is exactand hence “even better” than an unbiased estimate; plugging in this result above for (cid:92) OptF gives usMallows’ Cp (Mallows, 1973).In the Same-X setting, optimism can be defined similarly, except additionally integrated over thedistribution of x , . . . , x n ,OptS = E X,Y, ˜ Y (cid:20) n n (cid:88) i =1 (cid:0) ˜ y i − ˆ f n ( x i ) (cid:1) − n n (cid:88) i =1 (cid:0) y i − ˆ f n ( x i ) (cid:1) (cid:21) = 1 n n (cid:88) i =1 E X Cov (cid:0) y i , ˆ f n ( x i ) | X (cid:1) . Some simple results immediately follow.
Proposition 1. (i) If T ( X, Y ) is an unbiased estimator of OptF in the Fixed-X setting, for any X in the supportof Q n , then it is also unbiased for OptS in the Same-X setting. ii) If OptF in the Fixed-X setting does not depend on X (e.g., as is true in least squares regres-sion), then it is equal to OptS in the Same-X setting.
Some consequences of this proposition are as follows. • For the least squares regression estimator of Y on X (and X having full column rank almostsurely under Q n ), we have OptF = OptS = 2 σ p/n . • For a linear smoother, where ˆ f n ( x i ) = s ( x i ) T Y , i = 1 , . . . , n and we denote by S ( X ) ∈ R n × n the matrix with rows s ( x ) , . . . , s ( x n ), we have (by direct calculation) OptF = 2 σ tr( S ( X )) /n and OptS = 2 σ E X [tr( S ( X ))] /n . • For the lasso regression estimator of Y on X (and X being in general position almost surelyunder Q n ), and a normal data model (i.e., the model in (1), (2) with (cid:15) ∼ N (0 , σ )), Zou et al.(2007); Tibshirani and Taylor (2012); Tibshirani (2013) prove that for any value of the lassotuning parameter λ > X , the Fixed-X optimism is just OptF = 2 σ E Y | A λ ( X, Y ) | /n ,where A λ ( X, Y ) is the active set at the lasso solution at λ and | A λ ( X, Y ) | is its size; thereforewe also have OptS = 2 σ E X,Y | A λ ( X, Y ) | /n .Overall, we conclude that for the estimation of prediction error, the Same-X setting is basicallyidentical to Fixed-X. We will see next that the situation is different for Random-X. For the definition of Random-X optimism, we have to now integrate over all sources of uncertainty,OptR = E X,Y,x ,y (cid:104)(cid:0) y − ˆ f n ( x ) (cid:1) − (cid:0) y − ˆ f n ( x ) (cid:1) (cid:105) . The definitions of OptS , OptR are both given by a type of prediction error (Same-X or Random-X)minus training error, and there is just one common way to define training error. Hence, by subtract-ing training error from both sides in the decomposition (3), (4), (5), we obtain the relationship:OptR = OptS + B + + V + , (6)where B + , V + are the excess bias and variance as defined in (4), (5), respectively.As a consequence of our definitions, Random-X optimism is tied to Same-X optimism by excessbias and variance terms, as in (6). The practical utility of this relationship: an unbiased estimate ofSame-X optimism (which, as pointed out in the last subsection, follows straightforwardly from anunbiased estimate of Fixed-X optimism), combined with estimates of excess bias and variance, leadsto an estimate for Random-X prediction error. In this section, we examine the case when ˆ f n is defined by least squares regression of Y on X , wherewe assume X has full column rank (or, when viewed as random, has full column rank almost surelyunder its marginal distribution Q n ). B + , V + Our first result concerns the signs of B + and V + . Theorem 1.
For ˆ f n the least squares regression estimator, we have both B + ≥ and V + ≥ . roof. We prove the result separately for V + and B + . Nonnegativity of V + . For a function g : R p → R , we will write g ( X ) = ( g ( x ) , . . . , g ( x n )) ∈ R n , thevector whose components are given by applying g to the rows of X . Letting X ∈ R n × p be a matrixof test covariate values, whose rows are i.i.d. draws from Q , we note that excess variance in (5) canbe equivalently expressed as V + = E X,X n tr (cid:2) Cov (cid:0) ˆ f n ( X ) | X, X (cid:1)(cid:3) − E X n tr (cid:2) Cov (cid:0) ˆ f n ( X ) | X (cid:1)(cid:3) . Note that the second term here is just E X [( σ /n )tr( X ( X T X ) − X T )] = σ p/n . The first term is σ n tr (cid:0) E X,X (cid:2) ( X T X ) − X T X (cid:3)(cid:1) = σ n tr (cid:0) E (cid:2) ( X T X ) − (cid:3) E [ X T X ] (cid:1) = σ n tr (cid:0) E (cid:2) ( X T X ) − (cid:3) E [ X T X ] (cid:1) , (7)where in the first equality we used the independence of X and X , and in the second equality weused the identical distribution of X and X . Now, by a result of Groves and Rothenberg (1969), weknow that E [( X t X ) − ] − [ E ( X t X )] − is positive semidefinite. Thus we have σ n tr (cid:0) E (cid:2) ( X T X ) − (cid:3) E [ X T X ] (cid:1) ≥ σ n tr (cid:0)(cid:2) E ( X T X ) (cid:3) − E [ X T X ] (cid:1) = σ pn . This proves V + ≥ Nonnegativity of B + . This result is actually a special case of Theorem 4, and its proof follows fromthe proof of the latter.An immediate consequence of this, from the relationship between Random-X and Same-X pre-diction error in (3), (4), (5), is the following.
Corollary 1.
For ˆ f n the least squares regression estimator, we have ErrR ≥ ErrS . This simple result, that the Random-X (out-of-sample) prediction error is always larger than theSame-X (in-sample) prediction error for least squares regression, is perhaps not suprising; however,we have not been able to find it proven elsewhere in the literature at the same level of generality. Weemphasize that our result only assumes (1), (2) and places no other assumptions on the distributionof errors, distribution of covariates, or the form of f ( x ) = E ( y | x ).We also note that, while this relationship may seem obvious, it is in fact not universal. Later inSection 6.2, we show that the excess variance V + in heavily-regularized ridge regression is guaranteedto be negative, and this can even lead to ErrR < ErrS. V + for normal covariates Beyond the nonnegativity of B + , V + , it is actually easy to quantify V + exactly in the case that thecovariates follow a normal distribution. Theorem 2.
Assume that Q = N (0 , Σ) , where Σ ∈ R p × p is invertible, and p < n − . Then for theleast squares regression estimator, V + = σ pn p + 1 n − p − . roof. As the rows of X are i.i.d. from N (0 , Σ), we have X T X ∼ W (Σ , n ), which denotes a Wishartdistribution with n degrees of freedom, and so E ( X T X ) = n Σ. Similarly, ( X T X ) − ∼ W − (Σ − , n ),denoting an inverse Wishart with n degrees of freedom, and hence E [( X T X ) − ] = Σ − / ( n − p − V + = σ n tr (cid:0) E (cid:2) ( X T X ) − (cid:3) E [ X T X ] (cid:1) − σ pn = σ n tr (cid:18) I p × p nn − p − (cid:19) − σ pn = σ pn p + 1 n − p − , completing the proof.Interestingly, as we see, the excess variance V + does not depend on the covariance matrix Σ inthe case of normal covariates. Moreover, we stress that (as a consequence of our decomposition anddefinition of B + , V + ), the above calculation does not rely on linearity of f ( x ) = E ( y | x ).When f ( x ) is linear, i.e., the linear model is unbiased, it is not hard to see that B + = 0, and thenext result follows from (6). Corollary 2.
Assume the conditions of Theorem 2, and further, assume that f ( x ) = x T β , a linearfunction of x . Then for the least squares regression estimator, OptR = OptS + σ pn p + 1 n − p − σ pn (cid:18) p + 1 n − p − (cid:19) For the unbiased case considered in Corollary 2, the same result can be found in previous works,in particular in Dicker (2013), where it is proven in the appendix. It is also similar to older resultsfrom Stein (1960); Tukey (1967); Hocking (1976); Thompson (1978a,b), which assume the pair ( x, y )is jointly normal (and thus also assume the linear model to be unbiased). We return to these olderclassical results in the next section. When bias is present, our decomposition is required, so that theappropriate result would still apply to V + . V + for nonnormal covariates Using standard results from random matrix theory, the result of Theorem 2 can be generalized toan asymptotic result over a wide class of distributions. Theorem 3.
Assume that x ∼ Q is generated as follows: we draw z ∈ R p , having i.i.d. components z i ∼ F , i = 1 , . . . , p , where F is any distribution with zero mean and unit variance, and then set x = Σ / z , where Σ ∈ R p × p is positive definite and Σ / is its symmetric square root. Consider anasymptotic setup where p/n → γ ∈ (0 , as n → ∞ . Then V + → σ γ − γ as n → ∞ . Proof.
Denote by X n = Z n Σ / the training covariate matrix, where Z n has rows z , . . . , z n , and weuse subscripts of X n , Z n to denote the dependence on n in our asymptotic calculations below. Thenas in the proof of Theorem 1, V + = σ n tr (cid:0) E (cid:2) ( X Tn X n ) − (cid:3) E [ X Tn X n ] (cid:1) = σ n tr (cid:0) E (cid:2) ( Z Tn Z n ) − (cid:3) E [ Z Tn Z n ] (cid:1) = σ n tr (cid:0) n E (cid:2) ( Z Tn Z n ) − (cid:3)(cid:1) . The second equality used the relationship X n = Z n Σ / , and the third equality used the fact thatthe entries of Z n are i.i.d. with mean 0 and variance 1. This confirms that V + does not depend onthe covariance matrix Σ. We thank Edgar Dobriban for help in formulating and proving this result. λ , . . . , λ p of Z Tn Z n /n converges to a fixed law, independent of F ; more precisely, the random measure µ n , defined by µ n ( A ) = 1 p p (cid:88) i =1 { λ i ∈ A } , converges weakly to the Marchenko-Pastor law µ . We note that µ has density bounded away fromzero when γ <
1. As the eigenvalues of n ( Z Tn Z n ) − are simply 1 /λ , . . . , /λ p , we also have that therandom measure ˜ µ n , defined by ˜ µ n ( A ) = 1 p p (cid:88) i =1 { /λ i ∈ A } , converges to a fixed law, call it ˜ µ . Denoting the mean of ˜ µ by m , we now have V + = σ n tr (cid:0) n E (cid:2) ( Z Tn Z n ) − (cid:3)(cid:1) = σ pn E (cid:20) p p (cid:88) i =1 λ i (cid:21) → σ γm as n → ∞ . As this same asymptotic limit, independent of F , must agree with specific the case in which F = N (0 , m = γ/ (1 − γ ), which proves the result.The next result is stated for completeness. Corollary 3.
Assume the conditions of Theorem 3, and moreover, assume that the linear model isunbiased for n large enough. Then OptR → σ γ − γ − γ as n → ∞ . It should be noted that the requirement of Theorem 3 that the covariate vector x be expressibleas Σ / z with the entries of z i.i.d. is not a minor one, and limits the set of covariate distributionsfor which this result applies, as has been discussed in the literature on random matrix theory (e.g.,El Karoui 2009). In particular, left multiplication by the square root matrix Σ / performs a kindof averaging operation. Consequently, the covariates x can either have long-tailed distributions, orhave complex dependence structures, but not both, since then the averaging will mitigate any longtail of the distribution F . In our simulations in Section 5, we examine some settings that combineboth elements, and indeed the value of V + in such settings can deviate substantially from what thistheory suggests. We maintain the setting of the last section, taking ˆ f n to be the least squares regression estimator of Y on X , where X has full column rank (almost surely under its marginal distribution Q ). Let us denote RSS = (cid:107) Y − ˆ f n ( X ) (cid:107) , and recall Mallows’ Cp (Mallows, 1973), which is defined asCp = RSS /n + 2 σ p/n . The results in Theorems 2 and 3 lead us to define the following generalizedcovariance penalty criterion we term RCp:RCp = Cp + V + = RSS n + σ pn (cid:18) p + 1 n − p − (cid:19) .
9n asymptotic approximation is given by RCp ≈ RSS /n + σ γ (2 + γ/ (1 − γ )), in a problem scalingwhere p/n → γ ∈ (0 , The assumption that σ is known. This obviously affects the use of Cp in Fixed-X situationsas well, as has been noted in the literature.(ii)
The assumption of no bias.
It is critical to note here the difference from Fixed-X or Same-Xsituations, where OptS (i.e., Cp) is independent of the bias in the model and must only correctfor the “overfitting” incurred by model fitting. In contrast, in Random-X, the existence of B + , which is a component of OptR not captured by the training error, requires taking it intoaccount in the penalty, if we hope to obtain low-bias estimates of prediction error. Moreover,it is often desirable to assume nothing about the form of the true model f ( x ) = E ( y | x ), henceit seems unlikely that theoretical considerations like those presented in Theorems 2 and 3 canlead to estimates of B + .We now propose enhancements that deal with each of these problems separately. σ in unbiased least squares Here, we assume that the linear model is unbiased, f ( x ) = x T β , but the variance σ of the noise in(1) is unknown. In the Fixed-X setting, it is customary to replace σ in covariance penalty approachlike Cp with the unbiased estimate ˆ σ = RSS / ( n − p ). An obvious choice is to also use ˆ σ in placeof σ in RCp, leading to a generalized covariance penalty criterion we call (cid:100) RCp: (cid:100)
RCp = RSS n + ˆ σ pn (cid:18) p + 1 n − p − (cid:19) = RSS( n − n − p )( n − p − . An asymptotic approximation, under the scaling p/n → γ ∈ (0 , (cid:100) RCp ≈ RSS / ( n (1 − γ ) ).This penalty, as it turns out, is exactly equivalent to the Sp criterion of Tukey (1967); Sclove(1969); see also Stein (1960); Hocking (1976); Thompson (1978a,b). These authors all studied thecase in which ( x, y ) is jointly normal, and therefore the linear model is assumed correct for the fullmodel and any submodel. The asymptotic approximation, on other hand, is equivalent to the GCV(generalized cross-validation) criterion of Craven and Wahba (1978); Golub et al. (1979), though themotivation behind the derivation of GCV is somewhat different.Comparing (cid:100) RCp to RCp as a model evaluation criterion, we can see the price of estimating σ asopposed to knowing it, in their asymptotic approximations. Their expectations are similar when thelinear model is true, but the variance of (the asymptotic form) of (cid:100) RCp is roughly 1 / (1 − γ ) timeslarger than that of (the asymptotic form) of RCp. So when, e.g., γ = 0 .
5, the price of not knowing σ translates roughly into a 16-fold increase in the variance of the model evaluation metric. This isclearly demonstrated in our simulation results in the next section. B + Next, we move to assuming nothing about the underlying regression function f ( x ) = E ( y | x ), and weexamine methods that account for the resulting bias B + . First we consider the behavior of (cid:100) RCp (or10quivalently Sp) in the case that bias is present. Though this criterion was not designed to accountfor bias at all, we will see it still performs an inherent bias correction. A straightforward calculationshows that in this case E X,Y
RSS = ( n − p ) σ + nB, where recall B = E X (cid:107) E ( ˆ f n ( X ) | X ) − f ( X ) (cid:107) /n , generally nonzero in the current setting, and thus E X,Y (cid:100)
RCp = σ n − n − p − B n ( n − n − p )( n − p − ≈ σ − γ + B (1 − γ ) , the last step using an asymptotic approximation, under the scaling p/n → γ ∈ (0 , (cid:100) RCp, which is larger than the integrated Same-X bias B by a factor of 1 / (1 − γ ) . Putdifferently, (cid:100) RCp implicitly assumes that B + is (roughly) 1 / (1 − γ ) − (cid:100) RCp still providesreasonably good estimates of Random-X prediction error in biased situations in Section 5. A partialexplanation is available through a connection to OCV, as discussed, e.g., in the derivation of GCVin Craven and Wahba (1978). We return to this issue in Section 8.We describe a more principled approach to estimating the integrated Random-X bias, B + B + ,assuming knowledge of σ , and leveraging a bias estimate implicit to OCV. Recall that OCV builds n models, each time leaving one observation out, applying the fitted model to that observation, andusing these n holdout predictions to estimate prediction error. Thus it gives us an almost-unbiasedestimate of Random-X prediction error ErrR (“almost”, because its training sets are all of size n − n ). For least squares regression (and other estimators), the well-known “shortcut-trick”for OCV (e.g., Wahba 1990; Hastie et al. 2009) allows us to represent the OCV residuals in termsof weighted training residuals. Write ˆ f ( − i ) n for the least squares estimator trained on all but ( x i , y i ),and h ii the i th diagonal element of X ( X T X ) − X T , for i = 1 , . . . , n . Then this trick tells us that y i − ˆ f ( − i ) n ( x i ) = y i − ˆ f n ( x i )1 − h ii , which can be checked by applying the Sherman-Morrison update formula for relating the inverse ofa matrix to the inverse of its rank-one pertubation. Hence the OCV error can be expressed asOCV = 1 n n (cid:88) i =1 (cid:16) y i − ˆ f ( − i ) n ( x i ) (cid:17) = 1 n n (cid:88) i =1 (cid:18) y i − ˆ f n ( x i )1 − h ii (cid:19) . Taking an expectation conditional on X , we find that E (OCV | X ) = 1 n n (cid:88) i =1 Var( y i − ˆ f n ( x i ) | X )(1 − h ii ) + 1 n n (cid:88) i =1 [ f ( x i ) − E ( ˆ f n ( x i ) | X )] (1 − h ii ) = σ n n (cid:88) i =1 − h ii + 1 n n (cid:88) i =1 [ f ( x i ) − E ( ˆ f n ( x i ) | X )] (1 − h ii ) , (8)where the second line uses Var( y i − ˆ f n ( x i ) | X ) = (1 − h ii ) σ , i = 1 , . . . , n . The above display showsOCV − σ n n (cid:88) i =1 − h ii = 1 n n (cid:88) i =1 (cid:16)(cid:0) y i − ˆ f n ( x i ) (cid:1) − (1 − h ii ) σ (cid:17) − h ii ) is an almost-unbiased estimate of the integrated Random-X prediction bias, B + B + (it is almost-unbiased, due to the almost-unbiased status of OCV as an estimate of Random-X prediction error).11eanwhile, an unbiased estimate of the integrated Same-X prediction bias B isRSS n − σ ( n − p ) n = 1 n n (cid:88) i =1 (cid:16)(cid:0) y i − ˆ f n ( x i ) (cid:1) − (1 − h ii ) σ (cid:17) . Subtracting the last display from the second to last delivers (cid:100) B + = 1 n n (cid:88) i =1 (cid:16)(cid:0) y i − ˆ f n ( x i ) (cid:1) − (1 − h ii ) σ (cid:17)(cid:18) − h ii ) − (cid:19) , an almost-unbiased estimate of the excess bias B + . We now define a generalized covariance penaltycriterion that we call RCp + by adding this to RCp:RCp + = RCp + (cid:100) B + = OCV − σ n n (cid:88) i =1 h ii − h ii + σ pn (cid:18) p + 1 n − p − (cid:19) . It is worth pointing out that, like RCp and (cid:100)
RCp, RCp + assumes that we are in a setting covered byTheorem 2 or asymptotically by Theorem 3, as it takes advantage of the value of V + prescribed bythese theorems.A key question, of course, is: what have we achieved by moving from OCV to RCp + , i.e., canwe explicitly show that RCp + is preferable to OCV for estimating Random-X prediction error whenits assumptions hold? We give a partial positive answer next. RCp + and OCV As already discussed, OCV is by an almost-unbiased estimate of Random-X prediction error (or anunbiased estimate of Random-X prediction error for the procedure in question, here least squares,applied to a training set of size n − X . It should be emphasized that OCV has the significantadvantage over RCp + of not requiring knowledge of σ or assumptions on Q . Assuming that σ isknown and Q is well-behaved, we can compare the two criteria for estimating Random-X predictionerror in least squares.OCV is generally slightly conservative as an estimate of Random-X prediction error, as modelstrained on more observations are generally expected to be better. RCp + does not suffer from suchslight conservativeness in the variance component, relying on the integrated variance from theory,and in that regard it may already be seen as an improvement. However we will choose to ignore thisissue of conservativeness, as the difference in training on n − n observations is clearly smallwhen n is large. Thus, we can approximate the mean squared error or MSE of each method, as anestimate of Random-X prediction error, as E (OCV − ErrR) ≈ Var X (cid:0) E (OCV | X ) (cid:1) + E X (cid:0) Var(OCV | X ) (cid:1) , E (RCp + − ErrR) ≈ Var X (cid:0) E (RCp + | X ) (cid:1) + E X (cid:0) Var(RCp + | X ) (cid:1) , where these two approximations would be equalities if OCV and RCp + were exactly unbiased esti-mates of ErrR. Note that conditioned on X , the difference between OCV and RCp + , is nonrandom(conditioned on X , all diagonal entries h ii , i = 1 , . . . , p are nonrandom). Hence E X Var(OCV | X ) = E X Var(RCp + | X ), and we are left to compare Var X E (OCV | X ) and Var X E (RCp + | X ), according tothe (approximate) expansions above, to compare the MSEs of OCV and RCp + .Denote the two terms in (8) by v ( X ) and b ( X ), respectively, so that E (OCV | X ) = v ( X ) + b ( X )can be viewed as a decomposition into variance and bias components, and note that by constructionVar X (cid:0) E (OCV | X ) (cid:1) = Var X (cid:0) v ( X ) + b ( X ) (cid:1) and Var X (cid:0) E (RCp + | X ) (cid:1) = Var X (cid:0) b ( X ) (cid:1) .
12t seems reasonable to believe that Var X E (OCV | X ) ≥ Var X E (RCp + | X ) would hold in most cases,thus RCp + would be no worse than OCV. One situation in which this occurs is the case when thelinear model is unbiased, hence b ( X ) = 0 and consequently Var X (cid:0) E (RCp + | X ) (cid:1) = Var X (cid:0) b ( x ) (cid:1) = 0.In general, Var X E (OCV | X ) ≥ Var X E (RCp + | X ) is guaranteed when Cov X ( v ( X ) , b ( X )) ≥
0. Thismeans that choices of X that give large variance tend to also give large bias, which seems reasonableto assume and indeed appears to be true in our experiments. But, this covariance depends on theunderlying mean function f ( x ) = E ( y | x ) in complicated ways, and at the moment it eludes rigorousanalysis. We empirically study the decomposition of Random-X prediction error into its various componentsfor least squares regression in different problem settings, and examine the performance of the variousmodel evaluation criteria in these settings. The only criterion which is assumption-free and shouldinvariably give unbiased estimates of Random-X prediction error is OCV (modulo the slight bias inusing n − n training observations). Thus we may consider OCV as the “gold standard”approach, and we will hold the other methods up to its standard under different conditions, eitherwhen the assumptions they use hold or are violated.Before diving into the details, here is a high-level summary of the results: RCp performs verywell in unbiased settings (when the mean is linear), but very poorly in biased ones (when the meanis nonlinear); RCp + and (cid:100) RCp peform well overall, with (cid:100)
RCp having an advantage and even holdinga small advantage over OCV, in essentially all settings, unbiased and biased. This is perhaps a bitsurprising since (cid:100)
RCp is not designed to account for bias, but then again, not as surprising once werecall that (cid:100)
RCp is closely related to GCV.We perform experiments in a total of six data generating mechanisms, based on three differentdistributions Q for the covariate vector x , and two models for f ( x ) = E ( y | x ), one unbiased (linear)and the other biased (nonlinear). The three generating models for x are as follows. • Normal.
We choose Q = N (0 , Σ), where Σ is block-diagonal, containing five blocks such thatall variables in a block have pairwise correlation 0.9. • Uniform.
We define Q by taking N (0 , Σ) as above, then apply the inverse normal distributionfunction componentwise. In other words, this can be seen as a Gaussian copula with uniformmarginals. • t (4) . We define Q by taking N (0 , Σ) as above, then adjust the marginal distributions appro-priately, again a Gaussian copula with t (4) marginals.Note that Theorem 2 covers the normal setting (and in fact, the covariance matrix Σ plays no rolein the RCp estimate), while the uniform and t (4) settings do not comply with either Theorems 2 or3. Also, the latter two settings differ considerably in the nature of the distribution Q : finite supportversus long tails, respectively. The two generating models for y | x both use (cid:15) ∼ N (0 , ), but differin the specification for the mean function f ( x ) = E ( y | x ), as follows. • Unbiased.
We set f ( x ) = (cid:80) pj =1 x j . • Biased.
We set f ( x ) = C (cid:80) pj =1 | x j | .The simulations discussed in the coming subsections all use n = 100 training observations. Inthe “high-dimensional” case, we use p = 50 variables and C = 0 .
75, while in the “low-dimensional,extreme bias” case, we use p = 10 and C = 100. In both cases, we use a test set of 10 observationsto evaluate Random-X quantities like ErrR , B + , V + . Lastly, all figures show results averaged over5000 repetitions. 13 .1 The components of Random-X prediction error We empirically evaluate
B, V, B + , V + for least squares regression fitted in the six settings (three forthe distribution of x times two for E ( y | x )) in the high-dimensional case, with n = 100 and p = 50.The results are shown in Figure 1. We can see the value of V + implied by Theorem 2 is extremelyaccurate for the normal setting, and also very accurate for the short-tailed uniform setting. Howeverfor the t (4) setting, the value of V + is quite a bit higher than what the theory implies. In terms ofbias, we observe that for the biased settings the value of B + is bigger than the Same-X bias B , andso it must be taken into account if we hope to obtain reasonable estimates of Random-X predictionerror ErrR. N o r m U nb U n i f U nb t ( ) U nb N o r m B i a s U n i f B i a s t ( ) B i a s llll Same−X VarExcess VarSame−X BiasExcess BiasSame−X Var (theory)Random−X Var (Thm 2)
Figure 1:
Decomposition of Random-X prediction error into its reducible components: Same-X bias B , Same-X variance V , excess bias B + , and excess variance V + , in the “high-dimensional” casewith n = 100 and p = 50 . Next we compare the performance of the proposed criteria for estimating the Random-X predictionerror of least squares over the six simulation settings. The results in Figures 2 and 3 correspondto the “high-dimensional” case with n = 100 and p = 50 and the “low-dimensional, extreme bias”case with n = 100 and p = 10, respectively. Displayed are the MSEs in estimating the Random-Xprediction error, relative to OCV; also, the MSE for each method are broken down into squared biasand variance components.In the high-dimensional case in Figure 2, we see that for the true linear models (three leftmostscenarios), RCp has by far the lowest MSE in estimating Random-X prediction error, much better14 e l a t i v e M SE ( C o m pa r ed t o O C V ) . . . . . . . Norm Unb Unif Unb t(4) Unb Norm Bias Unif Bias t(4) Bias . . . . . . . . . . . . . . RCp VarRCp Bias RCpHat/GCV VarRCpHat/GCV Bias RCp + VarRCp + Bias OCV MSE
Figure 2:
MSEs of the different methods in estimating Random-X prediction error relative to OCV,in the “high-dimensional” case with n = 100 and p = 50 . The plot is truncated at a relative error of3 for clarity, but the RCp relative errors continue as high as 10 in the biased settings. than OCV. For the normal and uniform covariate distributions, it also has no bias in estimating thiserror, as warranted by Theorem 2 for the normal setting. For the t (4) distribution, there is alreadysignificant bias in the prediction error estimates generated by RCp, as is expected from the resultsin Figure 1; however, if the linear model is correct then we see RCp still has three- to five-fold lowerMSE compared to all other methods. The situation changes dramatically when bias is added (threerightmost scenarios). Now, RCp is by far the worse method, failing completely to account for large B + , and its relative MSE compared to OCV reaches as high as 10.As for RCp + and (cid:100) RCp in the high-dimensional case, we see that RCp + indeed has lower errorthan OCV under the normal models as argued in Section 4.4, and also in the uniform models. Thisis true regardless of the presence of bias. The difference, however is small: between 0.1% and 0.7%.In these settings, we can see (cid:100) RCp has even lower MSE than RCp + , with no evident bias in dealingwith the biased models. For the long-tailed t (4) distribution, both (cid:100) RCp and RCp + suffer some biasin estimating prediction error, as expected. Interestingly, in the nonlinear model with t (4) covariates(rightmost scenario), (cid:100) RCp does suffer significant bias in estimating prediction error, as opposed toRCp + . However, this bias does not offset the increased variance due to RCp + /OCV.In the low-dimensional case in Figure 3, many of the same conclusions apply: RCp does well ifthe linear model is correct, even with the long-tailed covariate distribution, but fails completely inthe presence of nonlinearity. Also, RCp + performs almost identically to OCV throughout. The most15 e l a t i v e M SE ( C o m pa r ed t o O C V ) . . . . . . . Norm Unb Unif Unb t(4) Unb Norm Bias Unif Bias t(4) Bias . . . . . . . . . . . . . . RCp VarRCp Bias RCpHat/GCV VarRCpHat/GCV Bias RCp + VarRCp + Bias OCV MSE
Figure 3:
MSEs of the different methods in estimating Random-X prediction error relative to OCV,in the “low-dimensional, extreme bias” case with n = 100 and p = 10 . important distinction is the failure of (cid:100) RCp in the normal covariate, biased setting, where it sufferssignificant bias in estimating the prediction error (see circled region in the plot). This demonstratesthat the heuristic correction for B + employed by (cid:100) RCp can fail when the linear model does not hold,as opposed to RCp + and OCV. We discuss this further in Section 8. In this section, we examine ridge regression, which behaves similarly in some ways to least squaresregression, and differently in others. In particular, like least squares, it has nonnegative excess bias,but unlike least squares, it can have negative excess variance, increasingly so for larger amounts ofregularization.These results are established in the subsections below, where we study excess bias and varianceseparately. Throughout, we will write ˆ f n for the estimator from the ridge regression of Y on X , i.e.,ˆ f n ( x ) = x T ( X T X + λI ) − X T Y , where the tuning parameter λ > f n on λ implicit). When λ = 0, we must assume that X hasfull column rank (almost surely under its marginal distribution Q n ), but when λ >
0, no assumptionis needed on X . 16 .1 Nonnegativity of B + We prove an extension to the excess bias result in Theorem 1 for least squares regression that theexcess bias in ridge regression is nonnegative.
Theorem 4.
For ˆ f n the ridge regression estimator, we have B + ≥ .Proof. This result is actually itself a special case of Theorem 6; the latter is phrased in somewhatof a different (functional) notation, so for concreteness, we give a direct proof of the result for ridgeregression here. Let X ∈ R n × p be a matrix of test covariate values, with rows i.i.d. from Q , and let Y ∈ R n be a vector of associated test response value. Then excess bias in (4) can be written as B + = E X,X n (cid:13)(cid:13) E (cid:0) ˆ f n ( X ) | X, X (cid:1) − f ( X ) (cid:13)(cid:13) − E X n (cid:13)(cid:13) E (cid:0) ˆ f n ( X ) | X (cid:1) − f ( X ) (cid:13)(cid:13) . Note ˆ f n ( X ) = X ( X T X + λI ) − X T Y , and by linearity, E ( ˆ f n ( X ) | X ) = X ( X T X + λI ) − X T f ( X ).Recalling the optimization problem underlying ridge regression, we thus have E (cid:0) ˆ f n ( X ) | X (cid:1) = argmin Xβ ∈ R n (cid:107) f ( X ) − Xβ (cid:107) + λ (cid:107) β (cid:107) . An analogous statement holds for ˆ f n , which we write to denote the result from the ridge regression Y on X ; we have E (cid:0) ˆ f n ( X ) | X (cid:1) = argmin X β ∈ R n (cid:107) f ( X ) − X β (cid:107) + λ (cid:107) β (cid:107) . Now write β n = ( X T X + λI ) − X T f ( X ) and β n = ( X T X + λI ) − X T f ( X ) for convenience. Byoptimality of X β n for the minimization problem in the last display, (cid:107) X β n − f ( X ) (cid:107) + λ (cid:107) β n (cid:107) ≥ (cid:107) X β n − f ( X ) (cid:107) + λ (cid:107) β n (cid:107) , and taking an expectation over X, X gives E X,X (cid:104) (cid:107) X β n − f ( X ) (cid:107) + λ (cid:107) β n (cid:107) (cid:105) ≥ E X (cid:104) (cid:107) X β n − f ( X ) (cid:107) + λ (cid:107) β n (cid:107) (cid:105) = E X (cid:104) (cid:107) Xβ n − f ( X ) (cid:107) + λ (cid:107) β n (cid:107) (cid:105) , where in the last line we used the fact that ( X, Y ) and ( X , Y ) are identical in distribution. Can-celling out the common term of λ E X (cid:107) β n (cid:107) in the first and third lines above establishes the result,since E ( ˆ f n ( X ) | X, X ) = X β n and E ( ˆ f n ( X ) | X ) = Xβ n . V + for large λ Here we present two complementary results on the variance side.
Proposition 2.
For ˆ f n the ridge regression estimator, the integrated Random-X prediction variance, V + V + = E X,x Var (cid:0) ˆ f n ( x ) | X, x (cid:1) , is a nonincreasing function of λ .Proof. As in the proofs of Theorems 1 and 4, let X ∈ R n × p be a test covariate matrix, and noticethat we can write the integrated Random-X variance as V + V + = E X,X n tr (cid:2) Cov (cid:0) ˆ f n ( X ) | X, X (cid:1)(cid:3) . X, X , we have1 n tr (cid:2) Cov (cid:0) ˆ f n ( X ) | X, X (cid:1)(cid:3) = σ n tr (cid:16) X ( X T X + λI ) − X T X ( X T X + λI ) − X T (cid:17) = σ n tr (cid:18) X T X p (cid:88) i =1 u i u Ti d i ( d i + λ ) (cid:19) , where the second line uses an eigendecomposition X T X = U DU T , with U ∈ R p × p having orthonor-mal columns u , . . . , u p and D = diag( d , . . . , d p ). Taking a derivative with respect to λ , we see ddλ (cid:32) n tr (cid:2) Cov (cid:0) ˆ f n ( X ) | X, X (cid:1)(cid:3)(cid:33) = − σ n tr (cid:18) X T X p (cid:88) i =1 u i u Ti λd i ( d i + λ ) (cid:19) ≤ , the inequality due to the fact that tr( AB ) ≥ A, B are positive semidefinite matrices. Taking anexpectation and switching the order of integration and differentiation (which is possible because theintegrand is a continuously differentiable function of λ >
0) gives ddλ (cid:32) E X,X n tr (cid:2) Cov (cid:0) ˆ f n ( X ) | X, X (cid:1)(cid:3)(cid:33) = E X,X ddλ (cid:32) n tr (cid:2) Cov (cid:0) ˆ f n ( X ) | X, X (cid:1)(cid:3)(cid:33) ≤ , the desired result.The proposition shows that adding regularization guarantees a decrease in variance for Random-X prediction. The same is true of the variance in Same-X prediction. However, as we show next, asthe amount of regularization increases these two variances decrease at different rates, a phenomenonthat manifests itself in the fact that and the Random-X prediction variance is guaranteed to besmaller than the Same-X prediction variance for large enough λ . Theorem 5.
For ˆ f n the ridge regression estimator, the integrated Same-X prediction variance andintegrated Random-X prediction variance both approach zero as λ → ∞ . Moreover, the limit of theirratio satisfies lim λ →∞ E X,x Var( ˆ f n ( x ) | X, x ) E X Var( ˆ f n ( x ) | X ) = tr[ E ( X T X ) E ( X T X )]tr[ E ( X T XX T X )] ≤ , the last inequality reducing to an equality if and only if x ∼ Q is deterministic and has no variance.Proof. Again, as in the proof of the last proposition as well as Theorems 1 and 4, let X ∈ R n × p bea test covariate matrix, and write the integrated Same-X and Random-X prediction variances as E X n tr (cid:2) Cov (cid:0) ˆ f n ( X ) | X (cid:1)(cid:3) and E X,X n tr (cid:2) Cov (cid:0) ˆ f n ( X ) | X, X (cid:1)(cid:3) , respectively. From the arguments in the proof of Proposition 2, letting X T X = U DU T be an eigen-decomposition with U ∈ R p × p having orthonormal columns u , . . . , u p and D = diag( d , . . . , d p ), wehave lim λ →∞ E X,X n tr (cid:2) Cov (cid:0) ˆ f n ( X ) | X, X (cid:1)(cid:3) = lim λ →∞ E X,X σ n tr (cid:18) X T X p (cid:88) i =1 u i u Ti d i ( d i + λ ) (cid:19) = E X,X lim λ →∞ σ n tr (cid:18) X T X p (cid:88) i =1 u i u Ti d i ( d i + λ ) (cid:19) = 0 , where in the second line we used the dominated convergence theorem to exchange the limit and theexpectation (since E X,X tr[ X T X (cid:80) pi =1 u i u Ti ( d i / ( d i + λ ) )] ≤ E X,X tr( X T X X T X ) < ∞ ). Similararguments show that the integrated Same-X prediction variance also tends to zero.18ow we consider the limiting ratio of the integrated variances,lim λ →∞ E X,X tr[Cov( ˆ f n ( X ) | X, X )] E X tr[Cov( ˆ f n ( X ) | X )] = lim λ →∞ E X,X tr[ X ( X T X + λI ) − X T X ( X T X + λI ) − X T ] E X tr[ X ( X T X + λI ) − X T X ( X T X + λI ) − X T ]= lim λ →∞ E X,X tr[ λ X ( X T X + λI ) − X T X ( X T X + λI ) − X T ] E X tr[ λ X ( X T X + λI ) − X T X ( X T X + λI ) − X T ]= lim λ →∞ E X,X tr[ λ X ( X T X + λI ) − X T X ( X T X + λI ) − X T ]lim λ →∞ E X tr[ λ X ( X T X + λI ) − X T X ( X T X + λI ) − X T ] , where the last line holds provided that the numerator and denominator both converge to finitenonzero limits, as will be confirmed by our arguments below. We study the numerator first. Notingthat λ ( X T X + λI ) − X T X ( X T X + λI ) − − X T X has eigenvalues d i (cid:18) λ ( d i + λ ) − (cid:19) , i = 1 , . . . , p, we have that λ ( X T X + λI ) − X T X ( X T X + λI ) − → X T X as λ → ∞ , in (say) the operator norm,implying tr[ λ X ( X T X + λI ) − X T X ( X T X + λI ) − X T ] → tr( X X T XX T ) as λ → ∞ . Hencelim λ →∞ E X,X tr (cid:2) λ X ( X T X + λI ) − X T X ( X T X + λI ) − X T (cid:3) = E X,X lim λ →∞ tr (cid:2) λ X ( X T X + λI ) − X T X ( X T X + λI ) − X T (cid:3) = E X,X tr( X X T XX T )= tr (cid:2) E X ( X T X ) E X ( X T X ) (cid:3) = tr (cid:2) E X ( X T X ) E X ( X T X ) (cid:3) . Here, in the first line, we applied the dominated convergence theorem as previously, in the third weused the independence of
X, X , and in the last we used the identical distribution of X, X . Similararguments lead to the conclusion for the denominatorlim λ →∞ E X tr (cid:2) λ X ( X T X + λI ) − X T X ( X T X + λI ) − X T (cid:3) = tr (cid:2) E X ( X T XX T X ) (cid:3) , and thus we have shown thatlim λ →∞ E X,X tr[Cov( ˆ f n ( X ) | X, X )] E X tr[Cov( ˆ f n ( X ) | X )] = tr[ E X ( X T X ) E X ( X T X )]tr[ E X ( X T XX T X )] , as desired. To see that the ratio on the right-hand side is at most 1, consider A = E ( X T XX T X ) − E ( X T X ) E ( X T X ) , which is a symmetric matrix whose trace istr( A ) = p (cid:88) i,j =1 Var (cid:0) ( X T X ) i,j (cid:1) ≥ . Furthermore, the trace is zero if and only if all summands are zero, which occurs if and only if allcomponents of x ∼ Q have no variance.In words, the theorem shows that the excess variance V + of ridge regression approaches zero as λ → ∞ , but it does so from the left (negative side) of zero. As we can have cases in which the excessbias is very small or even zero (for example, a null model like in our simulations below), we see that19rrR − ErrS = V + can be negative for ridge regression with a large level of regularization; this is astriking contrast to the behavior of this gap for least squares, where it is always nonnegative.We finish by demonstrating this result empirically, using a simple simulation setup with p = 100covariates drawn from Q = N (0 , I ), and training and test sets each of size n = 300. The underlyingregression function was f ( x ) = E ( y | x ) = 0, i.e., there was no signal, and the errors were also standardnormal. We drew training and test data from this simulation setup, fit ridge regression estimators tothe training at various levels of λ , and calculated the ratio of the sample versions of the Random-Xand Same-X integrated variances. We repeated this 100 times, and averaged the results. As shownin Figure 4, for values of λ larger than about 250, the Random-X integrated variance is smaller thanthe Same-X integrated variance, and consequently the same is true of the prediction errors (as thereis no signal, the Same-X and Random-X integrated biases are both zero). Also shown in the figure isthe theoretical limiting ratio of the integrated variances according to Theorem 5, which in this casecan be calculated from the properties of Wishart distributions to be n p/ ( n p + np + np ) ≈ . . . . . Lambda R ando m − X V a r / S a m e − X V a r Figure 4:
Ratio of Random-X integrated variance to Same-X integrated variance for ridge regressionas we vary the tuning parameter λ , in a simple problem setting with p = 100 covariates and trainingand test sets each of size n = 300 . For values of λ larger than about 250, the ratio is smaller than 1,meaning the Random-X prediction variance is smaller than the Same-X prediction variance; as theintegrated bias is zero in both settings, the same ordering also applies to the Random-X and Same-Xprediction errors. The plotted curve is an average of the computed ratios over 100 repetitions, andthe dashed lines (hard to see, because they are very close to the aforementioned curve) denote 95%confidence intervals over these repetitions. The red line is the theoretical limiting ratio of integratedvariances due to Theorem 5, in good agreement with the simulation results. Nonparametric regression estimators
We present a brief study of the excess bias and variance of some common nonparametric regressionestimators. In Section 8, we give a high-level discussion of the view on the gap between Random-Xand Same-X prediction errors from the perspective of empirical process theory, which is a topic thatis well-studied by researchers in nonparametric regression.
Consider an estimator ˆ f n defined by the general-form functional optimization problemˆ f n = argmin g ∈G n (cid:88) i =1 (cid:0) y i − g ( x i ) (cid:1) + J ( g ) , (9)where G is a function class and J is a roughness penalty on functions. Examples estimators of thisform include the (cubic) smoothing spline estimator in p = 1 dimensions, in which G is the space ofall functions that are twice differentiable and whose second derivative is in square integrable, and J ( g ) = (cid:82) g (cid:48)(cid:48) ( t ) dt ; and more broadly, reproducing kernel Hilbert space or RKHS estimators (in anarbitrary dimension p ), in which G is an RKHS and J ( g ) = (cid:107) f (cid:107) G is the corresponding RKHS norm.Provided that ˆ f n defined by (9) is a linear smoother, which means ˆ f n ( x ) = s ( x ) T Y for a weightfunction s : R p → R (that can and will generally also depend on X ), we now show that the excessbias of ˆ f n is always nonnegative. We note that this result applies to smoothing splines and RKHSestimators, since these are linear smoothers; it also covers ridge regression, and thus generalizes theresult in Theorem 4. Theorem 6.
For ˆ f n a linear smoother defined by a problem of the form (9) , we have B + ≥ .Proof. Let us introduce a test covariate matrix X ∈ R n × p and associated response vector Y ∈ R n ,and write the excess bias in (4) as B + = E X,X n (cid:13)(cid:13) E (cid:0) ˆ f n ( X ) | X, X (cid:1) − f ( X ) (cid:13)(cid:13) − E X n (cid:13)(cid:13) E (cid:0) ˆ f n ( X ) | X (cid:1) − f ( X ) (cid:13)(cid:13) . Writing ˆ f n ( x ) = s ( x ) T Y for a weight function s : R p → R , let S ( X ) ∈ R n × n be a smoother matrixthat has rows s ( x ) , . . . , s ( x n ). Thus ˆ f n ( X ) = S ( X ) Y , and by linearity, E ( ˆ f n ( X ) | X ) = S ( X ) f ( X ).This in fact means that we can express g n = E ( ˆ f n | X ), a function defined by g ( x ) = s ( x ) T f ( X ), asthe solution of an optimization problem of the form (9), g n = argmin g ∈G (cid:107) f ( X ) − g ( X ) (cid:107) + J ( g ) , where we have rewritten the loss term in a more convenient notation. Analogously, if we denote byˆ f n the estimator of the form (9), but fit to the test data X , Y instead of the training data X, Y ,and g n = E ( ˆ f n | X ), then g n = argmin g ∈G (cid:107) f ( X ) − g ( X ) (cid:107) + J ( g ) . By virtue of optimality of g n for the problem in the last display, we have (cid:107) g n ( X ) − f ( X ) (cid:107) + J ( g n ) ≥ (cid:107) g n ( X ) − f ( X ) (cid:107) + J ( g n )and taking an expectation over X, X gives E X,X (cid:104) (cid:107) g n ( X ) − f ( X ) (cid:107) + J ( g n ) (cid:105) ≥ E X (cid:104) (cid:107) g n ( X ) − f ( X ) (cid:107) + J ( g n ) (cid:105) = E X (cid:104) (cid:107) g n ( X ) − f ( X ) (cid:107) + J ( g n ) (cid:105) , X, X are identical in distribution. Cancelling outthe common term of E X J ( g n ) from the the first and third expressions proves the result, because E ( ˆ f n ( X ) | X, X ) = g n ( X ) and E ( ˆ f n ( X ) | X ) = g n ( X ). k -nearest-neighbors regression Consider ˆ f n the k -nearest-neighbors or kNN regression estimator, defined byˆ f n ( x ) = 1 k (cid:88) i ∈ N k ( x ) y i , where N k ( x ) returns the indices of the k nearest points among x , . . . , x n to x . It is immediate thatthe excess variance of kNN regression is zero. Proposition 3.
For ˆ f n the kNN regression estimator, we have V + = 0 .Proof. Simply compute Var (cid:0) ˆ f n ( x ) | X (cid:1) = σ k , by independence of y , . . . , y n , and hence the points in the nearest neighbor set N k ( x ), conditionalon X . Similarly, Var (cid:0) ˆ f n ( x ) | X, x (cid:1) = σ k . On the other hand, the excess bias is not easily computable, and is not covered by Theorem 6,since kNN cannot be written as an estimator of the form (9) (though it is a linear smoother). Thenext result sheds some light on the nature of the excess bias.
Proposition 4.
For ˆ f n the kNN regression estimator, we have B + = B n,k − (cid:18) − k (cid:19) B n − ,k − , where B n,k denotes the integrated Random-X prediction bias of the kNN estimator fit to a trainingset of size n , and with tuning parameter (number of neighbors) k .Proof. Observe (cid:16) E (cid:0) ˆ f n ( x ) | X, x (cid:1) − f ( x ) (cid:17) = (cid:32) k (cid:88) i ∈ N k ( x ) (cid:0) f ( x i ) − f ( x ) (cid:1)(cid:33) , and by definition, E X,x [ E ( ˆ f n ( x ) | X, x ) − f ( x )] = B n,k . Meanwhile (cid:16) E (cid:0) ˆ f n ( x ) | X (cid:1) − f ( x ) (cid:17) = (cid:32) k (cid:88) i ∈ N k ( x ) (cid:0) f ( x i ) − f ( x ) (cid:1)(cid:33) = (cid:32) k (cid:88) i ∈ N − k ( x ) (cid:0) f ( x i ) − f ( x ) (cid:1)(cid:33) , where N − k − ( x ) gives the indices of the k − x , . . . , x n to x (which equals N k ( x ) as x is trivially one of its own k nearest neighbors). Now notice that x plays the role of thetest point x in the last display, and therefore, E X [ E ( ˆ f n ( x ) | X ) − f ( x )] = (( k − /k ) B n − ,k − .This proves the result. 22he above proposition suggests that, for moderate values of k , the excess bias in kNN regressionis likely positive. We are comparing the integrated Random-X bias of a kNN model with n trainingpoints and k neighbors to that of a model n − k − n and moderate k , it seems that the former should be larger than the latter, and in addition, the factor of (1 − /k )multiplying the latter term makes it even more likely that the difference B n,k − (1 − /k ) B n − ,k − is positive. Rephrased, using the zero excess variance result of Proposition 3: the gap in Random-Xand Same-X prediction errors, ErrR − ErrS = B n,k − (1 − /k ) B n − ,k − , is likely positive for large n and moderate k . Of course, this is not a formal proof; aside from the choice of k , the shape of theunderlying mean function f ( x ) = E ( y | x ) obviously plays an important role here too. As a concreteproblem setting, we might try analyzing the Random-X bias B n,k for f Lipschitz and a scaling for k such that k → ∞ but k/n → n → ∞ , e.g., k (cid:16) √ n , which ensures consistency of kNN. Typicalanalyses provide upper bounds on the kNN bias in this problem setting (e.g., see Gyorfi et al. 2002),but a more refined analysis would be needed to compare B n,k to B n − ,k − . We have proposed and studied a division of Random-X prediction error into components: the irre-ducible error σ , the traditional (Fixed-X or Same-X) integrated bias B and integrated variance V components, and our newly defined excess bias B + and excess variance V + components, such that B + B + gives the Random-X integrated bias and V + V + the Random-X integrated variance. Forleast squares regression, we were able to quantify V + exactly when the covariates are normal andasymptotically when they are drawn from a linear transformation of a product distribution, leadingto our definition of RCp. To account for unknown error variance σ , we defined (cid:100) RCp based on theusual plug-in estimate, which turns out to be asymptotically identical to GCV, giving this classicmethod a novel interpretation. To account for B + (when σ is known and the distribution Q of thecovariates is well-behaved), we defined RCp + , by leveraging a Random-X bias estimate implicit toOCV. We also briefly considered methods beyond least squares, proving that B + is nonnegative inall settings considered, while V + can become negative in the presence of heavy regularization.We reflect on some issues surrounding our findings and possible directions for future work. Ability of (cid:100)
RCp to account for bias.
An intriguing phenomenon that we observe is the abilityof (cid:100)
RCp/Sp and its close (asymptotic) relative GCV to deal to some extent with B + in estimatingRandom-X prediction error, through the inflation it performs on the squared training residuals. ForGCV in particular, where recall GCV = RSS / ( n (1 − γ ) ), we see that this inflation a simple form: ifthe linear model is biased, then the squared bias component in each residual is inflated by 1 / (1 − γ ) .Comparing this to the inflation that OCV performs, which is 1 / (1 − h ii ) , on the i th residual, for i = 1 , . . . , n , we can interpret GCV as inflating the bias for each residual by some “averaged” versionof the elementwise factors used by OCV. As OCV provides an almost-unbiased estimate of B + forRandom-X prediction, GCV can get close when the diagonal elements h ii , i = 1 , . . . , n do not varytoo wildly. When they do vary greatly, GCV can fail to account for B + , as in the circled region inFigure 3. Alternative bias-variance decompositions.
The integrated terms we defined are expectationsof conditional bias and variance terms, where we conditioned on both training and testing covariates
X, x . One could also consider other conditioning schemes, leading to different decompositions. Aninteresting option would be to condition on the prediction point x only and calculate the bias andvariance unconditional on the training points X before integrating, as in E x ( E ( ˆ f n ( x ) | x ) − f ( x )) and E x (Var( ˆ f n ( x ) | x )) for these alternative notions of Random-X bias and variance, respectively.It is easy to see that this would cause the bias (and thus excess bias) to decrease and variance (andthus excess variance) to increase. However, it is not clear to us that computing or bounding such new23efinitions of (excess) bias and (excess) variance would be possible even for least squares regression.Investigating the tractability of this approach and any insights it might offer is an interesting topicfor future study. Alternative definitions of prediction error.
The overall goal in our work was to estimate theprediction error, defined as ErrR = E X,Y,x ,y ( y − ˆ f n ( x )) , the squared error integrated over all ofthe random variables available in training and testing. Alternative definitions have been suggestedby some authors. Breiman and Spector (1992) generalized the Fixed-X setting in a manner that ledthem to define E Y,x ,y [( y − ˆ f n ( x )) | X ] as the prediction error quantity of interest, which can beinterpreted as the Random-X prediction error of a Fixed-X model. Hastie et al. (2009) emphasizedthe importance of the quantity E x ,y [( y − ˆ f n ( x )) | X, Y ], which is the out-of-sample error of thespecific model we have trained on the given training data
X, Y . Of these two alternate definitions,the second one is more interesting in our opinion, but investigating it rigorously requires a differentapproach than what we have developed here.
Alternative types of cross-validation.
Our exposition has concentrated on comparing OCV togeneralized covariance penalty methods. We have not discussed other cross-validation approaches,in particular, K-fold cross-validation (KCV) method with K (cid:28) n (e.g., K = 5 or 10). A supposedlywell-known problem with OCV is that its estimates of prediction error have very high variance; weindeed observe this phenomenon in our simulations (and for least squares estimation, the analyticalform of OCV clarifies the source of this high variance). There are some claims in the literature thatKCV can have lower variance than OCV (Hastie et al. 2009, and others), and should be consideredas the preferred CV variant for estimation of Random-X prediction error. Systematic investigationsof this issue for least squares regression such as Burman (1989); Arlot and Celisse (2010) actuallyreach the opposite conclusion—that high variance is further compounded by reducing K . Our ownsimulations also support this view (results not shown), therefore we do not consider KCV to be animportant benchmark to consider beyond OCV. Model selection for prediction.
Our analysis and simulations have focused on the accuracy ofprediction error estimates provided by various approaches. We have not considered their utility formodel selection, i.e., for identifying the best predictive model, which differs from model evaluationin an important way. A method can do well in the model selection task even when it is inaccurateor biased for model evaluation, as long as such inaccuracies are consistent across different modelsand do not affect its ability to select the better predictive model. Hence the correlation of modelevaluations using the same training data across different models plays a central role in model selec-tion performance. An investigation of the correlation between model evaluations that each of theapproaches we considered here creates is of major interest, and is left to future work.
Semi-supervised settings.
Given the important role that the marginal distribution Q of x playsin evaluating Random-X prediction error (as expressed, e.g., in Theorems 2 and 3), it is of interestto consider situations where, in addition to the training data, we have large quantities of additionalobservations with x only and no response y . In the machine learning literature this situation is oftenconsidered under then names semi-supervised learning or transductive learning. Such data could beused, e.g., to directly estimate the excess variance from expressions like (7). General view from empirical process theory.
This paper was focused in large part on esti-mating or bounding the excess bias and variance in specific problem settings, which led to estimatesor bounds on the gap in Random-X and Same-X prediction error, as ErrR − Err = B + + V + . Thisgap is indeed a familiar concept to those well-versed in the theory of nonparametric regression, androughly speaking, standard results from empirical process theory suggest that we should in general24xpect ErrR − ErrS to be small, i.e., much smaller than either of ErrR or ErrS to begin with. Theconnection is as follows. Note thatErrR − ErrS = E X,Y,x (cid:0) f ( x ) − ˆ f n ( x ) (cid:1) − E X,Y (cid:20) n n (cid:88) i =1 (cid:0) f ( x i ) − ˆ f n ( x i ) (cid:1) (cid:21) = E X,Y (cid:104) (cid:107) f − ˆ f n (cid:107) L ( Q ) − (cid:107) f − ˆ f n (cid:107) L ( Q n ) (cid:105) , where we are using standard notation from nonparametric regression for “population” and “empir-ical” norms, (cid:107) · (cid:107) L ( Q ) and (cid:107) · (cid:107) L ( Q n ) , respectively. For an appropriate function class G , empiricalprocess theory can be used to control the deviations between (cid:107) g (cid:107) L ( Q ) and (cid:107) g (cid:107) L ( Q n ) , uniformlyover all functions g ∈ G . Such uniformity is important, because it gives us control on the differencein population and empirical norms for the (random) function g = f − ˆ f n (provided of course thisfunction lies in G ). This theory applies to finite-dimensional G (e.g., linear functions, which wouldbe relevant to the case when f is assumed to be linear and ˆ f n is chosen to be linear), and even toinfinite-dimensional classes G , provided we have some understanding of the entropy or Rademachercomplexity of G (e.g., this is true of Lipschitz functions, which would be relevant to the analysis of k -nearest-neighbors regression or kernel estimators).Under appropriate conditions, we typically find E X,Y |(cid:107) f − ˆ f n (cid:107) L ( Q ) − (cid:107) f − ˆ f n (cid:107) L ( Q n ) | = O ( C n ),where C n is the L ( Q ) convergence rate of ˆ f n to f . This is even true in an asymptotic setting inwhich p grows with n (so C n here gets replaced by C n,p ), but such high-dimensional results usuallyrequire more restrictions on the distribution Q of covariates. The takeaway message: in most caseswhere ˆ f n is consistent with rate C n , we should expect to see the gap being ErrR − ErrS = O ( C n ),whereas ErrR , ErrS ≥ σ , so the difference in Same-X and Random-X prediction error is quite small(as small as the Same-X and Random-X risk) compared to these prediction errors themselves; saiddifferently, we should expect to see B + , V + being of the same order (or smaller than) B, V .It is worth pointing out that several interesting aspects of our study really lie outside what canbe inferred from empirical process theory. One aspect to mention is the precision of the results: insome settings we can characterize B + , V + individually, but (as described above), empirical processtheory would only provide a handle on their sum. Moreover, for least squares regression estimators,with p/n converging to a nonzero constant, we are able to characterize the exact asymptotic excessvariance under some conditions on Q (essentially, requiring Q to be something like a rotation of aproduct distribution), in Theorem 3; note that this is a problem setting in which least squares isnot consistent, and could not be treated by standard results empirical process theory.Lastly, empirical process theory tells us nothing about the sign of (cid:107) f − ˆ f n (cid:107) L ( Q ) − (cid:107) f − ˆ f n (cid:107) L ( Q n ) , or its expectation under P (which equals ErrR − ErrS = B + + V + , as described above). This 1-bitquantity is of interest to us, since it tells us if the Same-X (in-sample) prediction error is optimisticcompared to the Random-X (out-of-sample) prediction error. Theorems 1, 4, 5, 6 and Propositions3, 4 all pertain to this quantity. References
Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle.
SecondInternational Symposium on Information Theory , pages 267–281, 1973.Sylvain Arlot and Alain Celisse. A survey of cross-validation procedures for model selection.
StatisticsSurveys , 4:40–79, 2010.Leo Breiman and David Freedman. How many variables should be entered in a regression equation?
Journal of the American Statistical Association , 78(381):131–136, 1983.25eo Breiman and Philip Spector. Submodel selection and evaluation in regression. The x-randomcase.
International Statistical Review , 60(3):291–319, 1992.Andreas Buja, Richard Berk, Lawrence Brown, Edward George, Emil Pitkin, Mikhail Traskin, LindaZhao, and Kai Zhang. Models as approximations, Part I: A conspiracy of nonlinearity and randomregressors in linear regression. arXiv: 1404.1578, 2014.Andreas Buja, Richard Berk, Lawrence Brown, Ed George, Arun Kumar Kuchibhotla, and LindaZhao. Models as approximations, Part II: A general theory of model-robust regression. arXiv:1612.03257, 2016.Prabir Burman. A comparative study of ordinary cross-validation, v-fold cross-validation and therepeated learning-testing methods.
Biometrika , 76(3):503–514, 1989.Sourav Chatterjee. Assumptionless consistency of the lasso. arXiv: 1303.5817, 2013.Peter Craven and Grace Wahba. Smoothing noisy data with spline functions.
Numerische Mathe-matik , 31(4):377–403, 1978.Lee H. Dicker. Optimal equivariant prediction for high-dimensional linear models with arbitrarypredictor covariance.
Electronic Journal of Statistics , 7:1806–1834, 2013.Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridge regressionand classification. arXiv: 1507.03003, 2015.Bradley Efron. How biased is the apparent error rate of a prediction rule?
Journal of the AmericanStatistical Association , 81(394):461–470, 1986.Bradley Efron. The estimation of prediction error: Covariance penalties and cross-validation.
Journalof the American Statistical Association , 99(467):619–632, 2004.Noureddine El Karoui. Concentration of measure and spectra of random matrices: Applications tocorrelation matrices, elliptical distributions and beyond.
The Annals of Applied Probability , 19(6):2362–2405, 12 2009.Gene H Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method forchoosing a good ridge parameter.
Technometrics , 21(2):215–223, 1979.Eitan Greenshtein and Ya’acov Ritov. Persistence in high-dimensional linear predictor selection andthe virtue of overparametrization.
Bernoulli , 10(6):971–988, 2004.Theodore Groves and Thomas Rothenberg. A note on the expected value of an inverse matrix.
Biometrika , 56(3):690–691, 1969.Laszlo Gyorfi, Michael Kohler, Adam Krzyzak, and Harro Walk.
A Distribution-Free Theory ofNonparametric Regression . Springer, 2002.Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
The Elements of Statistical Learning; DataMining, Inference and Prediction . Springer, 2009. Second edition.Ronald R. Hocking. The analysis and selection of variables in linear regression.
Biometrics , 32(1):1–49, 1976.Daniel Hsu, Sham Kakade, and Tong Zhang. Random design analysis of ridge regression.
Foundationsof Computational Mathematics , 14(3):569–600, 2014.Hannes Leeb. Evaluation and selection of models for out-of-sample prediction when the sample sizeis small relative to the complexity of the data-generating process.
Bernoulli , 14(3):661–690, 2008.26olin Mallows. Some comments on C p . Technometrics , 15(4):661–675, 1973.Stanley L. Sclove. On criteria for choosing a regression equation for prediction. Technical Report,Carnegie Mellon University, 1969.Charles Stein. Multiple regression. In Ingram Olkin, editor,
Contributions to Probability and Statis-tics: Essays in Honor of Harold Hotelling , pages 424–443. Stanford University Press, 1960.Charles Stein. Estimation of the mean of a multivariate normal distribution.
Annals of Statistics , 9(6):1135–1151, 1981.Mary L. Thompson. Selection of variables in multiple regression: Part I. A review and evaluation.
International Statistical Review , 46(1):1–19, 1978a.Mary L. Thompson. Selection of variables in multiple regression: Part II. Chosen procedures,computations and examples.
International Statistical Review , 46(2):129–146, 1978b.Ryan J. Tibshirani. The lasso problem and uniqueness.
Electronic Journal of Statistics , 7:1456–1490,2013.Ryan J. Tibshirani and Jonathan Taylor. Degrees of freedom in lasso problems.
Annals of Statistics ,40(2):1198–1232, 2012.John W. Tukey. Discussion of “Topics in the investigation of linear relations fitted by the methodof least squares”.
Journal of the Royal Statistical Society, Series B, , 29(1):2–52, 1967.Sara van de Geer.
Empirical Processes in M-Estimation . Cambdrige University Press, 2000.Vladimir N. Vapnik.
Statistical Learning Theory . Wiley, 1998.Grace Wahba.
Spline Models for Observational Data . Society for Industrial and Applied Mathemat-ics, 1990.Martin J. Wainwright.
High-Dimensional Statistics: A Non-Asymptotic View . Cambridge UniversityPress, 2017. To appear.Hui Zou, Trevor Hastie, and Robert Tibshirani. On the “degrees of freedom” of the lasso.