aa r X i v : . [ m a t h . S T ] J un Optimal model selection in heteroscedastic regression using piecewisepolynomial functions
A. Saumard ∗ Department of Statistics, University of Washington, Seattle, WA 98195, USAINRIA Saclay ˆIle-de-France, FranceApril 4, 2013
Abstract
We consider the estimation of a regression function with random design and heteroscedastic noise ina nonparametric setting. More precisely, we address the problem of characterizing the optimal penaltywhen the regression function is estimated by using a penalized least-squares model selection method. Inthis context, we show the existence of a minimal penalty, defined to be the maximum level of penalizationunder which the model selection procedure totally misbehaves. The optimal penalty is shown to be twicethe minimal one and to satisfy a non-asymptotic pathwise oracle inequality with leading constant almostone. Finally, the ideal penalty being unknown in general, we propose a hold-out penalization procedureand show that the latter is asymptotically optimal.
Keywords: nonparametric regression, heteroscedastic noise, random design, optimal model selection,slope heuristics, hold-out penalty.
Given a collection of models and associated estimators, two different model selection tasks can be tackled:find out the smallest true model (consistency problem), or select an estimator achieving the best performanceaccording to some criterion, called a risk or a loss (efficiency problem). We focus on the efficiency problem,where the leading idea of penalization, that goes back to early works of Akaike [2, 3] and Mallows [33], is toperform an unbiased - or uniformly biased - estimation of the risk of the estimators. FPE and AIC proceduresproposed by Akaike respectively in [2] and [3], as well as Mallows’ C p or C L [33], aim to do so by adding tothe empirical risk a penalty which depends on the dimension of the models.The first analysis of such procedures had the drawback of being fundamentally asymptotic, consideringin particular that the number of models as well as their dimensions are fixed while the sample size tendsto infinity. As explained for instance in Massart [34], in various statistical settings it is natural to let thesequantities depend on the amount of data. Thus, pointing out the importance of Talagrand’s type concentrationinequalities in the nonasymptotic approach, Birg´e and Massart [16, 18] and Barron, Birg´e and Massart [11]have been able to build nonasymptotic oracle inequalities for penalization procedures. Their framework takesinto account the complexity of the collection of models as a parameter depending on the sample size.In an abstract risk minimization framework, which includes statistical learning problems such as classifica-tion or regression, many distribution-dependent and data-dependent penalties have been proposed, from themore general and less accurate global penalties, see Koltchinskii [27], Bartlett et al. [12], to the refined localRademacher complexities in the case where some favorable noise conditions hold (see for instance Bartlett,Bousquet and Mendelson [13], Koltchinskii [28]). But as a price to pay for generality, the above penalties suf-fer from their dependence on unknown constants. These penalized procedures are very difficult to implementand calibrate in practice. Moreover, the existing risk bounds for these procedures contain very large leading ∗ Research partly supported by the french Agence Nationale de la Recherche (ANR 2011 BS01 010 01 projet Calibration),NI-AID grant 2R01 AI29168-04 and a PIMS postdoctoral fellowship. V -fold penalties of Arlot [5, 6]. These penalties are essentially resampling estimates ofthe difference between the empirical risk and the risk. Arlot [5, 6] proved sharp pathwise oracle inequalities forthe resampling and V -fold penalties in the case of regression with random design and heteroscedastic noise onhistograms models, and conjectured that the restriction to histograms is mainly technical and that his resultscan be extended to more general situations.Model selection via penalization is not the only method which provides sharp oracle inequalities for theestimation of a nonparametric regression function. Indeed, aggregation techniques and PAC-Bayesian boundsalso allow to obtain nearly optimal constants in the oracle inequalities. Bunea et al. [21] derived some sharporacle inequalities for different aggregation tasks by means of a single unifying procedure. However, the authorsasked for a fixed design and homoscedastic Gaussian noise. By using aggregation with exponential weights,Dalalyan and Tsybakov obtained in [25] oracle inequalities of a PAC-Bayesian flavor with leading constantone and optimal rate of the remainder term for the estimation of a regression function with deterministicdesign and homoscedastic errors. Furthermore, these authors allowed error distributions which are symmetricor n -divisible. PAC-Bayesian methods are systematically investigated in Catoni, [23]. The work of Lecu´eand Mendelson [29] concerning the aggregation by empirical risk minimization of a finite family of functionsseems to handle the case of a random design and heteroscedastic noise, even if this example is not explicitlydeveloped. The oracle inequalities obtained by Lecu´e and Mendelson are sharp and valid with probabilityclose to one. In particular, they are related to oracle inequalities obtained, in expectation, by Catoni in [23].A difference between aggregation and model selection studies, is that in most aggregation results, theestimators at hand are considered as deterministic functions. However, notable exceptions are the following.Leung and Barron [32] proved sharp oracle inequalities for the aggregation of projection estimators in theGaussian sequence model. Rigollet and Tsybakov [35] recently showed sharp bounds for the aggregationof some linear estimators, including projection estimators, in a regression setting, with fixed design andhomoscedastic Gaussian noise. More general PAC-Bayesian type inequalities were also recently obtained byDalalyan and Salmon [24], considering the aggregation of affine estimators in heteroscedastic regression, withGaussian noise and fixed design.Birg´e and Massart [19] discovered, in a generalized linear Gaussian model setting, that the optimal penaltyis closely related to the minimal one. An optimal penalty is a penalty which gives an oracle inequality withleading constant converging to one when the sample size tends to infinity. The minimal penalty is definedto be the maximal penalty under which the procedure totally misbehaves (in a sense to be specified below).Birg´e and Massart [19] proved sharp upper and lower bounds for the minimal penalty. These authors alsoshowed that the optimal penalty is twice the minimal one, both for small and large collections of models.These facts are called the slope heuristics . The authors also exhibited a jump in the dimension of the selectedmodel occurring around the value of the minimal penalty, and used it to estimate the minimal penalty fromthe data. Taking a penalty equal to twice the previous estimate then gives a nonasymptotic quasi-optimaldata-driven model selection procedure. The algorithm proposed by Birg´e and Massart [19] to estimate theminimal penalty relies on the previous knowledge of the shape of the latter, which is a known function of thedimension of the models in their setting. Thus, their procedure gives a data-driven calibration of the minimalpenalty.Considering the case of Gaussian least-squares regression with unknown variance, Baraud et al. [10] havealso derived lower bounds on the penalty terms for small and large collections of models. In the setting ofmaximum likelihood estimation of density on histograms, Castellan [22] obtained a lower bound on the penaltyterm, in the case of small collections of models.The slope heuristics has been then extended by Arlot and Massart [9] in a bounded regression framework,with heteroscedastic noise and random design. The authors considered least-squares estimators on a “small”collection of histograms models. Their analysis differs from the one of Birg´e and Massart [19] in an importantway. Indeed, Arlot and Massart [9] did not assume a particular shape of the penalty term. As a matter offact, the penalties considered by Birg´e and Massart [19] were known functions of the dimension of the models,whereas heteroscedasticity of the noise allowed Arlot and Massart to consider situations where the shape ofthe penalty is not even a function of the dimension of the models. In such general cases, the authors proposedto estimate the shape of the penalty by using Arlot’s resampling or V -fold penalties, proved to be efficient intheir regression framework by Arlot [5, 6].The approach developed in [9] is more general than the histogram case, except for some identified technical2arts of the proofs, thus providing a general framework that can be applied to other problems. The authorshave also identified, in the case of histograms, the minimal penalty as the mean of the empirical excess loss oneach model, and the ideal penalty to be estimated as the sum of the empirical excess loss and true excess losson each model. The slope heuristics then heavily relies on the fact that the empirical excess loss is equivalentto the true excess loss for models of reasonable dimensions.Arlot and Massart [9] conjectured that this equivalence between the empirical and true excess loss is aquite general fact in M-estimation. A general result supporting this conjecture is the high dimensional Wilks’phenomenon investigated by Boucheron and Massart [20] in the setting of bounded contrast minimization. Theauthors derive in [20] concentration inequalities for the empirical excess loss, under some margin conditions(called “noise conditions” by the authors) and when the considered model satisfies some general “complexitycondition” on the first moment of the supremum of the empirical process on localized slices of variance in theloss class. The latter assumption can be explicated under suitable covering entropy conditions on the model.Lerasle [31] proved the validity of the slope heuristics in a least-squares density estimation setting, underrather mild conditions on the considered linear models. The approach developed by the author in this frame-work allows sharp computations and the empirical excess loss is shown to be exactly equal to the true excessloss. Lerasle [31] also proved in the least-squares density estimation setting the efficiency of Arlot’s resamplingpenalties. Moreover, Lerasle [30] generalized the previous results to weakly dependent data. Arlot and Bach[8] recently considered the problem of selecting among linear estimators in nonparametric regression. Theirframework includes model selection for linear regression, the choice of a regularization parameter in kernelridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. In such cases, theminimal penalty is not necessarily half the optimal one, but the authors propose to estimate the unknownvariance by the minimal penalty and to use it in a plug-in version of Mallows’ C L . The latter penalty is provedto be optimal by establishing a nonasymptotic oracle inequality with constant close to one, converging to onewhen the sample size tends to infinity.In this paper, we prove the validity of the slope heuristics in the framework of bounded regression withrandom design and heteroscedastic noise. This is done by considering a “small” collection of finite-dimensionallinear models of piecewise polynomial functions. This setting extends the case of histograms already treatedby Arlot and Massart [9]. An interesting consequence is that piecewise polynomial functions are known tohave good approximation properties in Besov spaces and can lead to minimax rates of convergence, see forinstance [11, 37]. As a matter of fact, histograms allow minimax procedures only on H¨older spaces.Our validation of the slope heuristics is of asymptotic nature. However, the complexity of the collection ofmodels as well as their dimensions are not constant terms in our analysis. These quantities are indeed allowedto depend on the sample size n .If the noise is homoscedastic, then the shape of the ideal penalty is known, and is linear in the dimensionof the models as in the case of Mallows’ C p . However, if the noise is heteroscedastic, then Arlot [7] showedthat the ideal penalty is not even a function of the linear dimensions of the models. So, it is necessary to givea suitable estimator of this shape. As emphasized by Arlot [5, 6], V -fold and resampling penalties are good,natural candidates for this task. In this paper, we show that a hold-out penalty - which is closely related to aspecial case of resampling penalty - is indeed asymptotically optimal under very mild conditions on the datasplit. As a matter of fact, a half-and-half split leads to an optimal penalization. It is worth noticing thathold-out type procedures have also been exploited in Chapter 8 of Massart [34] as simple tools to overcomethe margin adaptivity issue in classification.The paper is organized as follows. In Section 2, we describe the statistical framework. The slope heuristicsis presented in Section 3, and the hold-out penalization is considered in Section 4. The proofs are collected inSection 5. Let us take n independent observations ξ i = ( X i , Y i ) ∈ X × R with common distribution P . In Sections 2.2,3.2-4 the feature space X = [0 , X i is denoted by P X . We assume that the3ata satisfy the following relation Y i = s ∗ ( X i ) + σ ( X i ) ε i , (1)where s ∗ ∈ L (cid:0) P X (cid:1) . Conditionally to X i , the residual ε i is assumed to have zero mean and variance equalto one. The function σ : X → R + is the unknown heteroscedastic noise level. A generic random variable withdistribution P , independent of the sample ( ξ , ..., ξ n ), is denoted by ξ = ( X, Y ).It follows from (1) that s ∗ is the unknown regression function of Y with respect to X . Our aim is to estimate s ∗ from the sample. To do so, we are given a finite collection of models M n , with cardinality depending onthe sample size n . Each model M ∈ M n is assumed to be a finite-dimensional vector space. We denote by D M the linear dimension of M . In the main part of this paper, we focus on models of piecewise polynomialfunctions, that are introduced in Section 2.2 below.We denote by k s k = (cid:0)R X s dP X (cid:1) / the usual norm in L (cid:0) P X (cid:1) and by s M the linear projection of s ∗ onto M in the Hilbert space (cid:0) L (cid:0) P X (cid:1) , k·k (cid:1) . For a function f ∈ L ( P ), we write P ( f ) = P f = E [ f ( ξ )]. Bysetting K : L (cid:0) P X (cid:1) → L ( P ) the least-squares contrast, defined by K ( s ) : ( x, y ) ( y − s ( x )) , s ∈ L (cid:0) P X (cid:1) , (2)the regression function s ∗ satisfies s ∗ = arg min s ∈ L ( P X ) P ( K ( s )) . (3)For the linear projections s M we get s M = arg min s ∈ M P ( K ( s )) . (4)For each model M ∈ M n , we consider a least-squares estimator s n ( M ) (possibly non unique), satisfying s n ( M ) ∈ arg min s ∈ M { P n ( K ( s )) } = arg min s ∈ M ( n n X i =1 ( Y i − s ( X i )) ) ,where P n = n − P ni =1 δ ξ i is the empirical measure built from the data.In order to avoid cumbersome notations, we will often write Ks in place of K ( s ) for the image of a suitablefunction s by the contrast K . We measure the performance of the least-squares estimators by their excess loss, ℓ ( s ∗ , s n ( M )) := P ( Ks n ( M ) − Ks ∗ ) = k s n ( M ) − s ∗ k .We have the following decomposition, ℓ ( s ∗ , s n ( M )) = ℓ ( s ∗ , s M ) + ℓ ( s M , s n ( M )) ,where ℓ ( s ∗ , s M ) := P ( Ks M − Ks ∗ ) = k s M − s ∗ k and ℓ ( s M , s n ( M )) := P ( Ks n ( M ) − Ks M ) ≥ ℓ ( s ∗ , s M ) is called the bias of the model M and ℓ ( s M , s n ( M )) is the excess loss of the least-squares estimator s n ( M ) on the model M . By the Pythagorean identity, we have ℓ ( s M , s n ( M )) = k s n ( M ) − s M k .Given the collection of models M n , an oracle model M ∗ is defined as a minimizer of the losses - orequivalently excess losses - of the estimators at hand, M ∗ ∈ arg min M ∈M n { ℓ ( s ∗ , s n ( M )) } . (5)The associated oracle estimator s n ( M ∗ ) thus achieves the best performance in terms of excess loss among thecollection { s n ( M ) ; M ∈ M n } . The oracle model is a random quantity because it depends on the data and it4s also unknown as it depends on the distribution P of the data. We propose to estimate the oracle model bya penalization procedure.Given some known penalty pen, that is a function from M n to R , we consider the following data-dependentmodel, also called selected model, c M ∈ arg min M ∈M n { P n ( Ks n ( M )) + pen ( M ) } . (6)Our aim is then to find a good penalty, such that the selected model c M satisfies an oracle inequality of theform ℓ (cid:16) s ∗ , s n (cid:16) c M (cid:17)(cid:17) ≤ C × ℓ ( s ∗ , s n ( M ∗ )) ,with some positive constant C as close to one as possible and with probability close to one, typically morethan 1 − Ln − for some positive constant L . Let us take X = [0 ,
1] the unit interval and P a finite partition of X . For a positive integer r and any( I, j ) ∈ P× { , ..., r } , we set p I,j : x ∈ X 7→ x j I ( x ) . Definition 1
A finite dimensional vector space M is said to be a model of piecewise polynomial functions,with respect to the finite partition P of X = [0 , and of degrees not larger than r ∈ N , if M = Span { p I,j ; (
I, j ) ∈ P× { , ..., r }} .The linear dimension of M is then equal to ( r + 1) |P| . Notice that models of histograms on the unit interval are exactly models of piecewise polynomial functionswith degrees not larger than 0. In [36], it is shown that models of piecewise polynomial functions have niceanalytical and statistical properties. Let us recall two of them.In Lemma 8 of [36], it is proved that if the distribution P X has a density with respect to the Lebesguemeasure Leb on X = [0 ,
1] which is uniformly bounded away from zero and if the considered partition P islower regular with respect to Leb - that is there exists a positive constant c such that |P| inf I ∈P Leb ( I ) ≥ c >
0- then the associated model of piecewise polynomial functions is equipped with a localized orthonormal basisin L (cid:0) P X (cid:1) . For a formal definition of a localized basis, see Section 5 below. Since the pioneering work ofBirg´e and Massart [15, 17, 34], the property of localized basis is known to play a key role in M-estimation andmodel selection using vector spaces or more general sieves.Considering models of piecewise polynomial functions on the unit interval, where the density of P X withrespect to Leb is both uniformly bounded and bounded away from 0 and where the underlying partition islower regular with respect to Leb, it is shown in Lemma 9 of [36] that the least-squares estimator s n ( M )converges in sup-norm to the linear projection s M of the regression function s ∗ .Assumptions of lower regularity of the considered partitions as well as the existence of a uniformly boundeddensity of P X with respect to the Lebesgue measure on X , will thus naturally arise when dealing withleast-squares model selection using piecewise polynomial functions - see Section 3.2 below. Furthermore, theinterested reader will find in Section 5 a more general version of our results, available for linear models equippedwith a localized basis and where least-squares estimators converge in sup-norm to the linear projections of theregression function onto the models. In order to clarify our approach and to highlight the connection of the present paper with the results previouslyestablished in [36], we first give a brief heuristic explanation of the major mathematical facts underlying theslope phenomenon. 5e rewrite the definition of the oracle model M ∗ given in (5). For any M ∈ M n , the excess loss ℓ ( s ∗ , s n ( M )) = P ( Ks n ( M )) − P ( Ks ∗ ) is the difference between the loss of the estimator s n ( M ) and the lossof the target s ∗ . As P ( Ks ∗ ) is independent of M varying in M n , it holds M ∗ ∈ arg min M ∈M n { P ( Ks n ( M )) } = arg min M ∈M n { P n ( Ks n ( M )) + pen id ( M ) } ,where for all M ∈ M n , pen id ( M ) := P ( Ks n ( M )) − P n ( Ks n ( M )) .The penalty function pen id is called the ideal penalty - as it allows to select the oracle - and is unknownbecause it depends on the distribution of the data. As pointed out by Arlot and Massart [9], the main idea ofpenalization in the efficiency problem is to give some sharp estimate, up to a constant, of the ideal penalty. Thiswould yield an (asymptotically) unbiased - or uniformly biased over the collection of models M n - estimationof the loss. Such a penalization would lead to a sharp oracle inequality for the selected model.A penalty term pen opt is said to be optimal if it achieves an oracle inequality with leading constantconverging to one when the sample size n tends to infinity.Concerning the estimation of the optimal penalty, Arlot and Massart [9] conjectured that the mean ofthe empirical excess loss E [ P n ( Ks M − Ks n ( M ))] satisfies the following slope heuristics in a quite generalM-estimation framework: (i) If a penalty pen : M n −→ R + is such that, for all models M ∈ M n ,pen ( M ) ≤ (1 − δ ) E [ P n ( Ks M − Ks n ( M ))]with δ >
0, then the dimension of the selected model c M is “very large” and the excess loss of the selectedestimator s n (cid:16) c M (cid:17) is “much larger” than the excess loss of the oracle. (ii) If pen ≈ (1 + δ ) E [ P n ( Ks M − Ks n ( M ))] with δ >
0, then the corresponding model selection proceduresatisfies an oracle inequality with a leading constant C ( δ ) < + ∞ and the dimension of the selectedmodel is “not too large”. Moreover,pen opt ( M ) ≈ E [ P n ( Ks M − Ks n ( M ))]is an optimal penalty.The mean of the empirical excess loss on M , when M varies in M n , is thus conjectured to be the maximalvalue of penalty under which the model selection procedure totally misbehaves or, equivalently, the minimumvalue of penalty above which the procedure achieves an oracle inequality. It is called the minimal penalty, denoted by pen min : for all M ∈ M n , pen min ( M ) = E [ P n ( Ks M − Ks n ( M ))] .The optimal penalty is then close to twice the minimal one,pen opt ≈ min . (7)Let us now briefly explain the points (i) and (ii) above. We give in Section 3.3 precise results which validatethe slope heuristics for models of piecewise polynomial functions.If the chosen penalty is less than the minimal one, pen = (1 − δ ) pen min with δ ∈ [0 , M n , P n ( Ks n ( M )) + pen ( M ) − P n ( Ks ∗ )= P ( Ks M − Ks ∗ ) + ( P n − P ) ( Ks M − Ks ∗ ) − P n ( Ks M − Ks n ( M )) + pen ( M )= P ( Ks M − Ks ∗ ) + ( P n − P ) ( Ks M − Ks ∗ ) − δP n ( Ks M − Ks n ( M ))+ (1 − δ ) ( E [ P n ( Ks M − Ks n ( M ))] − P n ( Ks M − Ks n ( M ))) ≈ ℓ ( s ∗ , s M ) − δP n ( Ks M − Ks n ( M )) .6n the latter identity, we neglect the difference between the empirical and true loss of the projections s M and the deviations of the empirical excess loss P n ( Ks M − Ks n ( M )). Indeed, as shown by Boucheron andMassart [20], the empirical excess loss satisfies a concentration inequality in a general framework, which allowsto neglect the difference with its mean, at least for models that are not too small.As the empirical excess loss is increasing and the excess loss of the projection s M is decreasing with respectto the complexity of the models, the penalized criterion is (almost) decreasing with respect to the complexityof the models, and the selected model is among the largest of the collection.On the contrary, if the chosen penalty is greater than the minimal one, pen = (1 + δ ) pen min with δ > M ∈ M n , P n ( Ks n ( M )) + pen ( M ) − P n ( Ks ∗ ) ≈ ℓ ( s ∗ , s M ) + δP n ( Ks M − Ks n ( M )) . (8)The selected model thus achieves a trade-off between the bias of the models which decreases with the complexityand the empirical excess loss which increases with the complexity of the models. The selected dimension wouldthen be reasonable, and the trade-off between the bias and the complexity of the models is likely to give someoracle inequality.Finally, if we take δ = 1 in the latter case, pen = 2 × pen min , and if we assume that the empirical excessloss is equivalent to the excess loss, P n ( Ks M − Ks n ( M )) ∼ P ( Ks n ( M ) − Ks M ) , (9)then according to (8) the selected model almost minimizes P ( Ks M − Ks ∗ ) + P n ( Ks M − Ks n ( M )) ≈ ℓ ( s ∗ , s M ) + P ( Ks n ( M ) − Ks M ) ≈ ℓ ( s ∗ , s n ( M )) .Hence, ℓ (cid:16) s ∗ , s n (cid:16) c M (cid:17)(cid:17) ≈ ℓ ( s ∗ , s n ( M ∗ ))and the procedure is nearly optimal.One can find in [36] some results showing that (9) is a quite general fact in least-squares regression and is inparticular satisfied when considering models of piecewise polynomial functions. Thus, these results representa preliminary material for the present study, and we shall base our arguments on the results exposed in [36]. We take X = [0 , X , and linear models M ∈ M n are models of piecewisepolynomial functions. We denote by P M the partition of X underlying the model M . Set of assumptions for piecewise polynomial functions : (
SAPP ) (P1) there exist two positive constants c M , α M such that Card ( M n ) ≤ c M n α M . (P2) there exists a positive constant A M , + such that for every M ∈ M n , ≤ D M ≤ A M , + n (ln n ) − ≤ n . (P3) there exist c rich > A rich > M , M ∈ M n such that D M ∈ h n / ( β + ) , c rich n / ( β + ) i and D M ≥ A rich n (ln n ) − , where β + is defined in ( Ap u ). (Ap u ) there exist β + > C + > ℓ ( s ∗ , s M ) ≤ C + D − β + M . (An) There exists a constant σ min such that σ ( X i ) ≥ σ min > a.s. (Ab) There exists a positive constant A , that bounds the data: | Y i | ≤ A < ∞ . Ad Leb ) P X has a density f with respect to Leb satisfying for some constants c min and c max , that0 < c min ≤ f ( x ) ≤ c max < ∞ , ∀ x ∈ [0 , . (Aud) there exists r ∈ N ∗ such that, for all M ∈ M n , all I ∈ P M and all p ∈ M ,deg (cid:0) p | I (cid:1) ≤ r . (Alr) a positive constant c M , Leb exists such that, for all M ∈ M n ,0 < c M , Leb ≤ |P M | inf I ∈P M Leb ( I ) < + ∞ . The set of assumptions (
SAPP ) can be divided into three groups. Firstly, assumptions ( P1 ), ( P2 ), ( P3 )and ( Ap u ) are linked to properties of the collection of models M n . Secondly, assumptions ( An ), ( Ab ) and( Ad Leb ) give some constraints on the general regression relation stated in (1). Thirdly, assumptions (
Aud )and (
Alr ) specify some quantities related to the choice of the models of piecewise polynomial functions.Assumption ( P1 ) states that the collection of models has a “small” complexity, more precisely a polyno-mially increasing one with respect to the amount of data. For this kind of complexities, if one wants to designa good model selection procedure for prediction, the chosen penalty should estimate the mean of the idealone on each model, up to a constant. Indeed, as Talagrand’s type concentration inequalities for the empiricalprocess are exponential, they allow to neglect the deviations of the quantities of interest from their mean,uniformly over the collection of models. This is not the case for large collections of models, where one hasto put an extra-log factor depending on the complexity of the collection of models inside the penalty, see forinstance [16, 11].We assume in ( P3 ) that the collection of models contains a model M of reasonably large dimensionand a model M of high dimension, which is necessary since we prove the existence of a jump between highand reasonably large dimensions. One can notice that in practice, the parameter β + , which depends on thebias of the model is not known and so the existence of M is not straightforward. However, it suffices forthe statistician to take at least one model per dimension lower than the chosen upper bound to ensure theexistence of M and M .We require in ( Ap u ) for the quality of approximation of the collection of models to be good enough interms of the quadratic loss. More precisely, we ask for a polynomial decrease of excess loss of linear projectionsof the regression function onto the models. It is well-known that piecewise polynomial functions uniformlybounded in their degrees have good approximation properties in Besov spaces. More precisely, as stated inLemma 12 of Barron, Birg´e and Massart [11], if X = [0 ,
1] and the regression function s ∗ belongs to the Besovspace B α,p, ∞ ( X ) (see the definition in [11]), then taking models of piecewise polynomial functions of degreebounded by r > α − X , and assumingthat P X has a density with respect to Leb which is bounded in sup-norm, assumption ( Ap u ) is satisfied.Assumption ( Ab ) is rather restrictive, since it excludes Gaussian noise. However, the assumption ofbounded noise is somehow classical when dealing with M-estimation and related procedures. Indeed, a centraltool in this field is empirical process theory and more especially, concentration inequalities for the supremum ofthe empirical process. We used the classical inequalities of Bousquet, and Klein and Rio in [36]. As a matterof fact, we do not know yet if an adaptation of our proofs (including results established in [36]) by usingextensions of the latter inequalities to some unbounded cases - as for instance in Adamczak’s concentrationinequalities [1] - would be possible.The noise restriction stated in ( An ) is needed to derive our results which are optimal to the first order.More precisely, it allows in [36] to obtain sharp lower bounds for the true and empirical excess losses on a fixedmodel. This assumption is also needed in the work of Arlot and Massart [9] concerning the case of histogrammodels. As it is noticed in Section 5.3 of [36], assumption ( An ) could be replaced by the following assumption,which states that the partitions underlying the models of piecewise polynomial functions are regular from abovewith respect to the Lebesgue measure on [0 , Aur) a positive constant c + M , Leb exists such that, for all M ∈ M n , |P M | sup I ∈P M Leb ( I ) ≤ c + M , Leb .Assumptions ( Ad Leb ), (
Aud ) and (
Alr ) imply several important properties for the models of piecewisepolynomial functions, such as the existence of an orthonormal localized basis in each model or the consistencyin sup-norm of least-squares estimators toward the projections of the target onto the models. See also Sections2.2 and 5.1 for further comments about these properties.
We are now able to state our main results leading to the slope heuristics. They describe the behavior of thepenalization procedure defined in (6).
Theorem 2
Take a positive penalty: for all M ∈ M n , pen ( M ) ≥ . Suppose that the assumptions ( SAPP )of Section 3.2 hold, and furthermore suppose that for A pen ∈ [0 , and A p > the model M of assumption( P3 ) satisfies ≤ pen ( M ) ≤ A pen E [ P n ( Ks M − Ks n ( M ))] , (10) with probability at least − A p n − . Then there exist a constant A > only depending on constants in( SAPP ), as well as an integer n and a positive constant A only depending on A pen and on constants in( SAPP ) such that, for all n ≥ n , it holds with probability at least − A n − , D c M ≥ A n ln ( n ) − and ℓ (cid:16) s ∗ , s n (cid:16) c M (cid:17)(cid:17) ≥ n β + / ( β + )(ln n ) inf M ∈M n { ℓ ( s ∗ , s n ( M )) } , (11) where β + > is defined in assumption ( Ap u ) of ( SAPP ). Theorem 2 justifies the first part ( i ) of the slope heuristics exposed in Section 3. As a matter of fact, it showsthat there exists a level such that, if the penalty is smaller than this level for one of the largest models, thenthe dimension of the output is among the largest dimensions of the collection and the excess loss of the selectedestimator is much larger than the excess loss of the oracle. Moreover, this level is given by the mean of theempirical excess loss of the least-squares estimator on each model. Let us also notice that the lower boundgiven in (11) gets worse as β + increases. This is due to the fact that when β + increases, the approximationproperties of the models improve and the performances in terms of excess loss for the oracle estimator alsoimprove.The following theorem validates the second part of the slope heuristics. Theorem 3
Suppose that the assumptions (
SAPP ) of Section 3.2 hold, and furthermore suppose that forsome δ ∈ [0 , and A p , A r > , there exists an event of probability at least − A p n − on which, for everymodel M ∈ M n such that D M ≥ A M , + (ln n ) , it holds | pen ( M ) − E [ P n ( Ks M − Ks n ( M ))] | ≤ δ ( ℓ ( s ∗ , s M ) + E [ P n ( Ks M − Ks n ( M ))]) (12) together with | pen ( M ) | ≤ A r ℓ ( s ∗ , s M )(ln n ) + (ln n ) n ! . (13) Then, for any η ∈ (cid:0) , β + / (cid:0) β + (cid:1)(cid:1) , there exist an integer n only depending on η, δ and β + and on constantsin ( SAPP ), a positive constant A only depending on c M given in ( SAPP ) and on A p , two positive constants A and A only depending on constants in ( SAPP ) and on A r and a sequence θ n ≤ A (ln n ) / (14)9 uch that it holds for all n ≥ n , with probability at least − A n − , D c M ≤ n η +1 / ( β + ) and ℓ (cid:16) s ∗ , s n (cid:16) c M (cid:17)(cid:17) ≤ δ − δ + 5 θ n (1 − δ ) ! ℓ ( s ∗ , s n ( M ∗ )) + A (ln n ) n . (15) Assume that in addition, the following assumption holds, (Ap)
The bias decreases like a power of D M : there exist β − ≥ β + > and C + , C − > such that C − D − β − M ≤ ℓ ( s ∗ , s M ) ≤ C + D − β + M . Then it holds for all n ≥ n (cid:0) ( SAPP ) , C − , β − , β + , η, δ (cid:1) , with probability at least − A n − , A M , + (ln n ) ≤ D c M ≤ n η +1 / ( β + ) (16) and ℓ (cid:16) s ∗ , s n (cid:16) c M (cid:17)(cid:17) ≤ δ − δ + 5 θ n (1 − δ ) ! ℓ ( s ∗ , s n ( M ∗ )) . (17)Theorem 3 states that if the penalty is close to twice the minimal one, then the selected estimator satisfiesa pathwise oracle inequality with constant almost one, and so the model selection procedure is approximatelyoptimal. Moreover, the dimension of the selected model is of reasonable dimension, bounded by a power lessthan one of the sample size.Condition ( Ap ) allows to remove the remainder terms from the oracle inequality (15) by ensuring thatthe selected model is of dimension not too small, as stated in (16). Assumption ( Ap ) is the conjunction ofassumption ( Ap u ) with a polynomial lower bound of the bias of the models. On histogram models, Arlotshowed in Section 8.10 of [4] that this lower bound is satisfied for non constant α -H¨older, α ∈ (0 , min ( M ) = E [ P n ( Ks M − Ks n ( M ))] ,thus generalizing the results of Arlot and Massart in [9] to the case of piecewise polynomial functions. The conditions on the penalty given in Theorems 2 and 3 can not be directly checked in practice. Indeed, theyare expressed in terms of the mean of the empirical excess loss on each model, which is an unknown quantityin general. Nevertheless, in the homoscedastic case, it is easy to see that Mallows’ penalty is a nonasymptoticquasi-optimal penalty. According to Theorem 3, such a penalty is given by twice the mean of the empiricalexcess loss. Now, using Theorem 10 of [36], we get (with an explicit control of the second order terms in thefollowing equivalence), 2 E [ P n ( Ks M − Ks n ( M ))] ∼ K ,M D M n ,where K ,M = 1 /D M P D M k =1 E (cid:16)(cid:0) ψ ,M ( X, Y ) · ϕ k ( X ) (cid:1) (cid:17) , ψ ,M ( X, Y ) = − Y − s M ( X )) and ( ϕ k ) D M k =1 is anorthonormal basis in ( M, k·k ). By easy computations, we deduce that if the noise is homoscedastic, that is σ ( X ) ≡ σ >
0, it holds 12 K ,M D M n = 2 σ D M n + E " ( s ∗ − s M ) P D M i =1 ϕ k n . (18)10he second term at the right of identity (18) being negligible for models of interest in the conditions of Theorem3 (thanks to Lemma 7 in [36], which implies that P D M i =1 ϕ k ≤ LD M for some constant L > σ D M /n , which is Mallows’ classical penalty.In the case where the noise level is homoscedastic but unknown, Mallows’ penalty is only known througha constant, the noise level, which can be estimated via the slope heuristics (for practical issues about theslope heuristics, see Baudry et al. [14]). But in the common situation where the noise level is sufficientlyheteroscedastic, the shape of the ideal penalty is not linear in the dimension of the models and not even a function of the linear dimensions. In such a case, Arlot [7] proved that any calibration of a linear penalty leadsto a suboptimal procedure, but yet can achieve an oracle inequality with a leading constant more than one.In order to achieve a nearly optimal selection procedure in the general situation, it remains to estimate theideal penalty or, thanks to the slope heuristics, the shape of the ideal penalty. This section is devoted to thistask. We propose a hold-out type penalty that automatically adapts to heteroscedasticity. Let us now detailour hold-out penalization procedure.The ideal penalty is defined bypen id ( M ) := P ( Ks n ( M )) − P n ( Ks n ( M )) ,for all M ∈ M n . A natural idea is to divide the data into two groups, indexed by I and I , satisfying I ∩ I = ∅ and I ∪ I = { , ..., n } and to propose the following hold-out type penalty,pen ho,C ( M ) := C ( P n ( Ks n ( M )) − P n ( Ks n ( M ))) ,where P n i = 1 /n i P j ∈ I i δ ξ j , n i =Card( I i ), for i = 1 , s n ( M ) ∈ arg min s ∈ M P n ( Ks ) and C > n is not too small, P n ( Ks n ( M )) is likely to vary like P n ( Ks n ( M ))and P n ( Ks n ( M )) is, conditionally to (cid:0) ξ j (cid:1) j ∈ I , an unbiased estimate of P ( Ks n ( M )), which again is likelyto vary like P ( Ks n ( M )). Moreover, we see from Theorem 10 in [36] that when the model M is fixed, thequantities P n ( Ks n ( M )) and P ( Ks n ( M )) are almost inversely proportional to n , so a good constant in frontof the hold-out penalty should be C opt = n /n .The previous observation is justified by the following theorem, where for the sake of clarity we fixed n = n = n/
2. For a more general version of Theorem 4, see Section 5.3. We setpen ho ( M ) = 12 ( P n ( Ks n ( M )) − P n ( Ks n ( M ))) and c M / ∈ arg min M ∈M n { P n ( Ks n ( M )) + pen ho ( M ) } .(19) Theorem 4
Consider the procedure defined in (19), with n = n = n/ . Suppose that the assumptions( SAPP ) of Section 3.2 hold. Then, for any η ∈ (cid:0) , β + / (cid:0) β + (cid:1)(cid:1) , there exist an integer n only dependingon η and on constants in ( SAPP ), a positive constant A only depending on c M given in ( SAPP ), twopositive constants A and A only depending on constants in ( SAPP ) and a sequence θ n ≤ A (ln n ) − / suchthat it holds for all n ≥ n , with probability at least − A n − , D c M / ≤ n η +1 / ( β + ) and ℓ (cid:16) s ∗ , s n (cid:16) c M / (cid:17)(cid:17) ≤ (1 + θ n ) ℓ ( s ∗ , s n ( M ∗ )) + A (ln n ) n . (20) Assume that in addition ( Ap ) holds (see Theorem 3). Then it holds for all n ≥ n (cid:0) ( SAPP ) , C − , β − , η (cid:1) , withprobability at least − A n − , A M , + (ln n ) ≤ D c M / ≤ n η +1 / ( β + ) and ℓ (cid:16) s ∗ , s n (cid:16) c M / (cid:17)(cid:17) ≤ (1 + θ n ) inf M ∈M n { ℓ ( s ∗ , s n ( M )) } . (21)11heorem 4 shows the asymptotic optimality of the hold-out penalization procedure, for a half-and-half splitof the data. This is a remarkable fact compared to the classical hold-out, defined by c M ho ∈ arg min M ∈M n { P n ( Ks n ( M )) } . (22)Indeed, the choice n = n/ P (cid:0) Ks n/ ( M ) (cid:1) , and so is close to the oracle, but for n/ V -fold penalties.Notice also that the random hold-out penalty proposed by Arlot [6] is proportional to the mean alongthe splits of our hold-out penalty, providing thus a “stabilization effect” in practice. This should bringsome improvement compared to our unique split, at the price of increased computational cost. However, thestabilization effect seems more difficult to study mathematically, and our results provide a first step towardthe study of the more complicated resampling penalties. We first present in Section 5.1 some “structural” properties of models, denoted (GSA) , that are sufficient forour needs and that are satisfied for models of piecewise polynomial functions considered in (
SAPP ). Thenin Sections 5.2 and 5.3 respectively, we prove the results stated in Sections 3.3 and 4, for (GSA) instead of(
SAPP ). General set of assumptions: (GSA)
Assume ( P1 ), ( P2 ), ( P3 ), ( An ) and ( Ap u ) of ( SAPP ). Furthermore suppose that, (Ab’)
A positive constant A exists, such that for all M ∈ M n , | Y i | ≤ A < ∞ , k s M k ∞ ≤ A < ∞ . (Alb) there exists a constant r M such that for each M ∈ M n one can find an orthonormal basis ( ϕ k ) D M k =1 satisfying, for all ( β k ) D M k =1 ∈ R D M , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) D M X k =1 β k ϕ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ r M p D M | β | ∞ , where | β | ∞ = max {| β k | ; k ∈ { , ..., D M }} . (Ac ∞ ) a positive integer n exists such that, for all n ≥ n , there exist a positive constant A cons and an eventΩ ∞ of probability at least 1 − n − − α M , on which for all M ∈ M n , k s n ( M ) − s M k ∞ ≤ A cons r D M ln nn . (23)Notice that the covariate space X is general in ( GSA ). Let us explain how assumptions (
Ab’ ), ( Ad Leb ),(
Aud ) and (
Alr ) of (
SAPP ) allow to recover ( Ab ), ( Alb ) and ( Ac ∞ ) of ( GSA ) in the special case of modelsof piecewise polynomial functions.Assumption (
Ab’ ) only differs from ( Ab ) by the fact that the projections of the target onto the models areuniformly bounded in sup-norm. In the general case, this is indeed not guaranteed, but considering piecewisepolynomial functions uniformly bounded in their degrees, this follows from simple computations (see Section5.3 in [36]). Then, assumption ( Alb ) requires the existence of a localized orthonormal basis for each model. Inthe case of piecewise polynomial functions, this is ensured by ( Ad Leb ), (
Aud )and (
Alr ), see Lemma 8 of [36].Finally, assumption ( Ac ∞ ) states the consistency of each estimator for the sup-norm. Again, this is satisfied12or models of piecewise polynomial functions under assumptions ( Ad Leb ), (
Aud ) and (
Alr ). This result isestablished in Lemma 9 of [36].Let us now describe a set of assumptions, less restrictive than (
SAPP) , that allows to recover (
GSA )when considering histogram models. Lemma 5 and 6 of [36] allow to recover (
GSA ) from (
SAH ) for modelsof histograms.
Set of assumptions for histogram models: ( SAH )Given some linear histogram model M ∈ M n , we denote by P M the associated partition of X .Take assumptions ( P1 ), ( P2 ), ( P3 ), ( An ), ( Ab ) and ( Ap u ) from ( SAPP ). Assume moreover, (Alrh) there exists a positive constant c h M ,P such that,for all M ∈ M n , < c h M ,P ≤ |P M | inf I ∈P M P X ( I ) .Theorems 2 and 3 would also be valid when replacing the set of assumptions ( SAPP ) by (
SAH ). This wouldlead to the (almost exact) recovering of the assumptions and results described in Theorems 2 and 3 of [9],concerning the selection of least-squares estimators among histogram models.
The following remark will be useful.
Remark 5
Since constants in (
GSA ) are uniform over the collection M n , we deduce from Theorem 2 of [36]applied with α = 2 + α M and A − = A + = A M , + that if assumptions ( P2 ), ( Ab’ ), ( An ), ( Alb ) and ( Ac ∞ )hold, then a positive constant A exists, depending on α M , A M , + and on the constants A, σ min and r M definedin ( GSA ), such that for all M ∈ M n satisfying < A M , + (ln n ) ≤ D M , by setting ε n ( M ) = A max ((cid:18) ln nD M (cid:19) / ; (cid:18) D M ln nn (cid:19) / ) (24) we have, for all n ≥ n ( A M , + , A, A cons , n , r M , σ min , α M ) , P (cid:20) (1 − ε n ( M )) 14 D M n K ,M ≤ P ( Ks n ( M ) − Ks M ) ≤ (1 + ε n ( M )) 14 D M n K ,M (cid:21) ≥ − n − − α M (25) and P (cid:20)(cid:0) − ε n ( M ) (cid:1) D M n K ,M ≤ P n ( Ks M − Ks n ( M )) ≤ (cid:0) ε n ( M ) (cid:1) D M n K ,M (cid:21) ≥ − n − − α M (26) where K ,M = 1 /D M P D M k =1 E (cid:16)(cid:0) ψ ,M ( X, Y ) · ϕ k ( X ) (cid:1) (cid:17) , ψ ,M ( X, Y ) = − Y − s M ( X )) and ( ϕ k ) D M k =1 is anorthonormal basis in ( M, k·k ) . Moreover, for all M ∈ M n , we have by Theorem 3 of [36], for a positiveconstant A u depending on A, A cons , r M and α M and for all n ≥ n ( A cons , n ) , P (cid:20) P ( Ks n ( M ) − Ks M ) ≥ A u D M ∨ ln nn (cid:21) ≤ n − − α M (27) and P (cid:20) P n ( Ks M − Ks n ( M )) ≥ A u D M ∨ ln nn (cid:21) ≤ n − − α M . (28)13wo technical lemmas are needed. In the first lemma, we intend to evaluate the minimal penalty E [ P n ( Ks M − Ks n ( M ))] for models of dimension not too small. Lemma 6
Assume ( P2 ), ( Ab’ ), ( An ), ( Alb ) and ( Ac ∞ ) of ( GSA ). Then, for every model M ∈ M n ofdimension D M such that < A M , + (ln n ) ≤ D M ,we have for all n ≥ n ( A M , + , A, A cons , n , r M , σ min , α M ) , (cid:0) − L A M , + ,A,σ min ,r M ,α M ε n ( M ) (cid:1) D M n K ,M ≤ E [ P n ( Ks M − Ks n ( M ))] (29) ≤ (cid:0) L A M , + ,A,σ min ,r M ,α M ε n ( M ) (cid:1) D M n K ,M , (30) where ε n ( M ) = A max (cid:26)(cid:16) ln nD M (cid:17) / ; (cid:0) D M ln nn (cid:1) / (cid:27) is defined in Remark 5. Proof.
As explained in Remark 5, for all n ≥ n ( A M , + , A, A cons , n , r M , σ min , α M ), we thus have on anevent Ω ( M ) of probability at least 1 − n − − α M ,(1 − ε n ( M )) 14 D M n K ,M ≤ P n ( Ks M − Ks n ( M )) ≤ (1 + ε n ( M )) 14 D M n K ,M , (31)where ε n ( M ) = A max (cid:26)(cid:16) ln nD M (cid:17) / ; (cid:0) D M ln nn (cid:1) / (cid:27) . Moreover, as | Y i | ≤ A a.s. and k s M k ∞ ≤ A by ( Ab’ ), itholds 0 ≤ P n ( Ks M − Ks n ( M )) ≤ P n Ks M = 1 n n X i =1 ( Y i − s M ( X I )) ≤ A (32)and as D M ≥
1, we have ε n ( M ) = A max ((cid:18) ln nD M (cid:19) / ; (cid:18) D M ln nn (cid:19) / ) ≥ A n − / . (33)We also have E [ P n ( Ks M − Ks n ( M ))] = E (cid:2) P n ( Ks M − Ks n ( M )) Ω ( M ) (cid:3) + E (cid:2) P n ( Ks M − Ks n ( M )) (Ω ( M )) c (cid:3) . (34)Now notice that by ( An ) we have K ,M ≥ σ min >
0. Hence, as D M ≥
1, it comes from (32) and (33) that0 ≤ E (cid:2) P n ( Ks M − Ks n ( M )) (Ω ( M )) c (cid:3) ≤ A n − − α M ≤ A A σ ε n ( M ) D M n K ,M . (35)Moreover, we have ε n ( M ) < n ≥ n ( A , A M , + , A cons ), so by (31),0 < (cid:0) − n − − α M (cid:1) (cid:0) − ε n ( M ) (cid:1) D M n K ,M ≤ E (cid:2) P n ( Ks M − Ks n ( M )) Ω ( M ) (cid:3) (36) ≤ (cid:0) ε n ( M ) (cid:1) D M n K ,M . (37)Finally, noticing that n − − α M ≤ A − ε n ( M ) by (33), we use (35), (36) and (37) in (34) to conclude bystraightforward computations that L A M , + ,A,σ min ,r M ,α M = 80 A A σ + 5 A − + 1is convenient in (29) and (30), as A only depends on α M , A M , + , A, σ min and r M . (cid:4) emma 7 Let α > . Assume that ( Ab’ ) of (
GSA ) is satisfied. Then there exists a positive constant A d ,depending only in A, A M , + , σ min and α such that, by setting ¯ δ ( M ) = ( P n − P ) ( Ks M − Ks ∗ ) , we have forall M ∈ M n , P (cid:12)(cid:12) ¯ δ ( M ) (cid:12)(cid:12) ≥ A d r ℓ ( s ∗ , s M ) ln nn + ln nn !! ≤ n − α . (38) If moreover, assumptions ( P2 ), ( An ), ( Alb ) and ( Ac ∞ ) of ( GSA ) hold, then for all M ∈ M n such that A M , + (ln n ) ≤ D M and for all n ≥ n ( A M , + , A, A cons , n , r M , σ min , α ) , we have P (cid:18)(cid:12)(cid:12) ¯ δ ( M ) (cid:12)(cid:12) ≥ ℓ ( s ∗ , s M ) √ D M + A d ln n √ D M E [p ( M )] (cid:19) ≤ n − α , (39) where p ( M ) := P n ( Ks M − Ks n ( M )) ≥ . Proof.
We set A d = max ( A √ α ; 8 A α ; 8 A α p A M , + σ + 16 A α A M , + σ min ) . (40)Since by ( Ab’ ) we have | Y | ≤ A a.s. and k s M k ∞ ≤ A , it holds k s ∗ k ∞ = k E [ Y | X ] k ∞ ≤ A , and so k s M − s ∗ k ∞ ≤ A. Next, we apply Bernstein’s inequality (see Proposition 2.9 of [34]) to ¯ δ ( M ) = ( P n − P ) ( Ks M − Ks ∗ ) . Notice that K ( s M ) ( x, y ) − K ( s ∗ ) ( x, y ) = ( s M ( x ) − s ∗ ( x )) ( s M ( x ) + s ∗ ( x ) − y ) , hence k Ks M − Ks ∗ k ∞ ≤ A . Moreover, as E [ Y − s ∗ ( X ) | X ] = 0 and E h ( Y − s ∗ ( X )) | X i ≤ (2 A ) = A we have E h ( Ks M ( X, Y ) − Ks ∗ ( X, Y )) i = E h(cid:16) Y − s ∗ ( X )) + ( s M ( X ) − s ∗ ( X )) (cid:17) ( s M ( X ) − s ∗ ( X )) i ≤ A E h ( s M ( X ) − s ∗ ( X )) i = 8 A ℓ ( s ∗ , s M ) , and therefore, by Bernstein’s inequality we have for all x > , P (cid:12)(cid:12) ¯ δ ( M ) (cid:12)(cid:12) ≥ r A ℓ ( s ∗ , s M ) xn + 8 A x n ! ≤ − x ) . By taking x = α ln n , we then have P (cid:12)(cid:12) ¯ δ ( M ) (cid:12)(cid:12) ≥ r A αℓ ( s ∗ , s M ) ln nn + 8 A α ln n n ! ≤ n − α , (41)which gives the first part of Lemma 7 for A d given in (40). Now, by noticing the fact that 2 √ ab ≤ aη + bη − for all η >
0, and using it in (41) with a = ℓ ( s ∗ , s M ), b = A α ln nn and η = D − / M , we obtain P (cid:18)(cid:12)(cid:12) ¯ δ ( M ) (cid:12)(cid:12) ≥ ℓ ( s ∗ , s M ) √ D M + (cid:18) p D M + 83 (cid:19) A α ln nn (cid:19) ≤ n − α . (42)Then, for a model M ∈ M n such that A M , + (ln n ) ≤ D M , we apply Lemma 6 and by (29), it holds for all n ≥ n ( A M , + , A, A cons , n , r M , σ min , α M ), (cid:0) − L A M , − ,A,σ min ,r M ,α M ε n ( M ) (cid:1) D M n K ,M ≤ E [p ( M )] (43)15here ε n ( M ) = A max (cid:26)(cid:16) ln nD M (cid:17) / ; (cid:0) D M ln nn (cid:1) / (cid:27) . Moreover, as D M ≤ A M , + n (ln n ) − by ( P2 ) and A M , + (ln n ) ≤ D M , we deduce that for all n ≥ n ( A M , + , A, A cons , r M , σ min , α M ), L A M , − ,A,σ min ,r M ,α M ε n ( M ) ≤ / K ,M ≥ σ min > An ), we have by (43), E [p ( M )] ≥ σ D M n for all n ≥ n ( A M , + , A, A cons , n , r M , σ min , α M ). This allows, using (42), to conclude the proof for the value of A d given in (40) by simple computations. (cid:4) In order to avoid cumbersome notations in the proofs of Theorems 3 and 2, when generic constants L and n depend on constants defined in the general set of assumptions stated in Section 5.1, we will note L ( GSA ) and n ( GSA ). The values of these constants may change from line to line.
Proof of Theorem 3.
From the definition of the selected model c M given in (6), c M minimizescrit ( M ) := P n ( Ks n ( M )) + pen ( M ) , (44)over the models M ∈ M n . Hence, c M also minimizescrit ′ ( M ) := crit ( M ) − P n ( Ks ∗ ) , (45)over the collection M n . Let us write ℓ ( s ∗ , s n ( M )) = P ( Ks n ( M ) − Ks ∗ )= P n ( Ks n ( M )) + P n ( Ks M − Ks n ( M )) + ( P n − P ) ( Ks ∗ − Ks M )+ P ( Ks n ( M ) − Ks M ) − P n ( Ks ∗ ) . By setting p ( M ) = P ( Ks n ( M ) − Ks M ) ,p ( M ) = P n ( Ks M − Ks n ( M )) ,¯ δ ( M ) = ( P n − P ) ( Ks M − Ks ∗ )and pen ′ id ( M ) = p ( M ) + p ( M ) − ¯ δ ( M ) ,we have ℓ ( s ∗ , s n ( M )) = P n ( Ks n ( M )) + p ( M ) + p ( M ) − ¯ δ ( M ) − P n ( Ks ∗ ) (46)and by (45), crit ′ ( M ) = ℓ ( s ∗ , s n ( M )) + (pen ( M ) − pen ′ id ( M )) . (47)As c M minimizes crit ′ over M n , it is therefore sufficient by (47), to control pen ( M ) − pen ′ id ( M ) - or equivalentlycrit ′ ( M ) - in terms of the excess loss ℓ ( s ∗ , s n ( M )), for every M ∈ M n , in order to derive oracle inequalities.Let Ω n be the event on which: • For all models M ∈ M n of dimension D M such that A M , + (ln n ) ≤ D M , (12) holds and | p ( M ) − E [p ( M )] | ≤ L ( GSA ) ε n ( M ) E [p ( M )] (48) | p ( M ) − E [p ( M )] | ≤ L ( GSA ) ε n ( M ) E [p ( M )] (49) (cid:12)(cid:12) ¯ δ ( M ) (cid:12)(cid:12) ≤ ℓ ( s ∗ , s M ) √ D M + L ( GSA ) ln n √ D M E [p ( M )] (50) (cid:12)(cid:12) ¯ δ ( M ) (cid:12)(cid:12) ≤ L ( GSA ) r ℓ ( s ∗ , s M ) ln nn + ln nn ! (51)16 For all models M ∈ M n of dimension D M such that D M ≤ A M , + (ln n ) , (13) holds together with (cid:12)(cid:12) ¯ δ ( M ) (cid:12)(cid:12) ≤ L ( GSA ) r ℓ ( s ∗ , s M ) ln nn + ln nn ! (52)p ( M ) ≤ L ( GSA ) D M ∨ ln nn ≤ L ( GSA ) (ln n ) n (53)p ( M ) ≤ L ( GSA ) D M ∨ ln nn ≤ L ( GSA ) (ln n ) n (54)By (25), (26), (27) and (28) in Remark 5, Lemma 6, Lemma 7 applied with α = 2 + α M , and since (12) holdswith probability at least 1 − A p n − , we get for all n ≥ n ( GSA ), P (Ω n ) ≥ − A p n − − X M ∈M n n − − α M ≥ − L A p ,c M n − . Control on the criterion crit ′ for models of dimension not too small: We consider models M ∈ M n such that A M , + (ln n ) ≤ D M . Notice that (50) implies by (24) that, for all M ∈ M n such that A M , + (ln n ) ≤ D M , for all n ≥ n ( GSA ), (cid:12)(cid:12) ¯ δ ( M ) (cid:12)(cid:12) ≤ L ( GSA ) (ln n ) D M · ln nD M ! / × E [ ℓ ( s ∗ , s M ) + p ( M )] ≤ L ( GSA ) ε n ( M ) E [ ℓ ( s ∗ , s M ) + p ( M )] ,so that on Ω n we have, for all models M ∈ M n such that A M , + (ln n ) ≤ D M , | pen ′ id ( M ) − pen ( M ) |≤ | p ( M ) + p ( M ) − pen ( M ) | + (cid:12)(cid:12) ¯ δ ( M ) (cid:12)(cid:12) ≤ | p ( M ) + p ( M ) − E [p ( M )] | + (cid:0) L ( GSA ) ε n ( M ) + δ (cid:1) E [ ℓ ( s ∗ , s M ) + p ( M )] ≤ (cid:0) δ + L ( GSA ) ε n ( M ) (cid:1) E [ ℓ ( s ∗ , s M ) + p ( M )] . (55)Now notice that using ( P2 ) in (24) gives that for all models M ∈ M n such that A M , + (ln n ) ≤ D M and forall n ≥ n ( GSA ), 0 < L ( GSA ) ε n ( M ) ≤ . As ℓ ( s ∗ , s n ( M )) = ℓ ( s ∗ , s M ) + p ( M ), we thus have on Ω n , forall n ≥ n ( GSA ), 0 ≤ E [ ℓ ( s ∗ , s M ) + p ( M )] ≤ ℓ ( s ∗ , s n ( M )) + | p ( M ) − E [p ( M )] |≤ ℓ ( s ∗ , s n ( M )) + L ( GSA ) ε n ( M )1 − L ( GSA ) ε n ( M ) p ( M ) by (48) ≤ L ( GSA ) ε n ( M )1 − L ( GSA ) ε n ( M ) ℓ ( s ∗ , s n ( M )) ≤ (cid:0) L ( GSA ) ε n ( M ) (cid:1) ℓ ( s ∗ , s n ( M )) . (56)Hence, using (56) in (55), we have on Ω n for all models M ∈ M n such that A M , + (ln n ) ≤ D M and for all n ≥ n ( GSA ), | pen ′ id ( M ) − pen ( M ) | ≤ (cid:0) δ + L ( GSA ) ε n ( M ) (cid:1) ℓ ( s ∗ , s n ( M )) . (57)Consequently, for all models M ∈ M n such that A M , + (ln n ) ≤ D M and for all n ≥ n ( GSA ), it holds onΩ n , using (47) and (57), (cid:0) − δ − L ( GSA ) ε n ( M ) (cid:1) ℓ ( s ∗ , s n ( M )) ≤ crit ′ ( M ) ≤ (cid:0) δ + L ( GSA ) ε n ( M ) (cid:1) ℓ ( s ∗ , s n ( M )) . (58)17 ontrol on the criterion crit ′ for models of small dimension: We consider models M ∈ M n such that D M ≤ A M , + (ln n ) . By (13), (52) and (53), it holds on Ω n , for any τ > M ∈ M n such that D M ≤ A M , + (ln n ) , | pen ′ id ( M ) − pen ( M ) |≤ p ( M ) + p ( M ) + | pen ( M ) | + (cid:12)(cid:12) ¯ δ ( M ) (cid:12)(cid:12) ≤ L ( GSA ) (ln n ) n + A r ℓ ( s ∗ , s M )(ln n ) + A r (ln n ) n + L ( GSA ) r ℓ ( s ∗ , s M ) ln nn + ln nn ! ≤ L ( GSA ), A r (ln n ) n + ℓ ( s ∗ , s M )(ln n ) ! + τ ℓ ( s ∗ , s M ) + (cid:0) τ − + 1 (cid:1) L ( GSA ) ln nn ≤ L ( GSA ), A r (ln n ) n + ℓ ( s ∗ , s M )(ln n ) ! + τ ℓ ( s ∗ , s n ( M )) + (cid:0) τ − + 1 (cid:1) L ( GSA ) ln nn . (59)Hence, by taking τ = (ln n ) − in (59) we get that for all M ∈ M n such that D M ≤ A M , + (ln n ) , it holds onΩ n , | pen ′ id ( M ) − pen ( M ) | ≤ L ( GSA ), A r ℓ ( s ∗ , s n ( M ))(ln n ) + (ln n ) n ! . (60)Moreover, by (47) and (60), we have on the event Ω n , for all M ∈ M n such that D M ≤ A M , + (ln n ) , (cid:16) − L ( GSA ), A r (ln n ) − (cid:17) ℓ ( s ∗ , s n ( M )) − L ( GSA ), A r (ln n ) n ≤ crit ′ ( M ) (61) ≤ (cid:16) L ( GSA ), A r (ln n ) − (cid:17) ℓ ( s ∗ , s n ( M )) + L ( GSA ), A r (ln n ) n . (62) Oracle inequalities:
Recall that by the definition given in (5), an oracle model satisfies M ∗ ∈ arg min M ∈M n { ℓ ( s ∗ , s n ( M )) } . (63)By Lemmas 8 and 9 below, we control on Ω n the dimensions of the selected model c M and the oracle model M ∗ .More precisely, by (75) and (77), we have on Ω n , for any η ∈ (cid:0) , β + / (cid:0) β + (cid:1)(cid:1) and for all n ≥ n (( GSA ) , η, δ ), D c M ≤ n / ( β + ) + η , (64) D M ∗ ≤ n / ( β + ) + η . (65)Now, from (64) we distinguish two cases in order to control crit ′ (cid:16) c M (cid:17) . If A M , + (ln n ) ≤ D c M ≤ n / ( β + ) + η ,we get by (58), for all n ≥ n ( GSA ),crit ′ (cid:16) c M (cid:17) ≥ (cid:16) − δ − L ( GSA ) ε n (cid:16) c M (cid:17)(cid:17) ℓ (cid:16) s ∗ , s n (cid:16) c M (cid:17)(cid:17) . (66)Otherwise, if D c M ≤ A M , + (ln n ) , we get by (61), (cid:16) − L ( GSA ), A r (ln n ) − (cid:17) ℓ (cid:16) s ∗ , s n (cid:16) c M (cid:17)(cid:17) − L ( GSA ), A r (ln n ) n ≤ crit ′ (cid:16) c M (cid:17) . (67)18et us denote S n = n M ∈ M n ; A M , + (ln n ) ≤ D M ≤ n / ( β + ) + η o . In all cases, we have by (66) and (67),for all n ≥ n ( GSA ),crit ′ (cid:16) c M (cid:17) ≥ (cid:18) − δ − L ( GSA ), A r (cid:18) (ln n ) − + sup M ∈S n ε n ( M ) (cid:19)(cid:19) ℓ (cid:16) s ∗ , s n (cid:16) c M (cid:17)(cid:17) − L ( GSA ), A r (ln n ) n . (68)Similarly, from (65) we distinguish two cases in order to control crit ′ ( M ∗ ). If A M , + (ln n ) ≤ D M ∗ ≤ n / ( β + ) + η , we get by (58), for all n ≥ n ( GSA ),crit ′ ( M ∗ ) ≤ (cid:0) δ + L ( GSA ) ε n ( M ∗ ) (cid:1) ℓ ( s ∗ , s n ( M ∗ )) . (69)Otherwise, if D M ∗ ≤ A M , + (ln n ) , we get by (62),crit ′ ( M ∗ ) ≤ (cid:16) L ( GSA ), A r (ln n ) − (cid:17) ℓ ( s ∗ , s n ( M ∗ )) + L ( GSA ), A r (ln n ) n . (70)In all cases, we deduce from (69) and (70) that we have for all n ≥ n (( GSA ), δ ),crit ′ ( M ∗ ) ≤ (cid:18) δ + L ( GSA ), A r (cid:18) (ln n ) − + sup M ∈S n ε n ( M ) (cid:19)(cid:19) ℓ ( s ∗ , s n ( M ∗ ))+ L ( GSA ), A r (ln n ) n . (71)Hence, by setting θ n = L ( GSA ), A r (cid:18) (ln n ) − + sup M ∈S n ε n ( M ) (cid:19) ,we have by (24), for all n ≥ n (( GSA ) , η, δ ), θ n ≤ L ( GSA ), A r (ln n ) / , θ n < − δ − x ≤ x for all x ∈ (cid:2) , (cid:1) , that for all n ≥ n (( GSA ) , η, δ ), itholds on Ω n , ℓ (cid:16) s ∗ , s n (cid:16) c M (cid:17)(cid:17) ≤ (cid:18) δ + θ n − δ − θ n (cid:19) ℓ ( s ∗ , s n ( M ∗ )) + L ( GSA ), A r − δ − θ n (ln n ) n ≤ δ − δ + 5 θ n (1 − δ ) ! ℓ ( s ∗ , s n ( M ∗ )) + L ( GSA ), A r (ln n ) n . (72)Inequality (15) is now proved.It remains to prove the second part of Theorem 3. We assume that assumption ( Ap ) holds. From Lemmas 8and 9, we have that for any > η > (cid:0) − β + (cid:1) + / n ≥ n (cid:0) ( GSA ) , C − , β − , η, δ (cid:1) , it holds on Ω n , A M , + (ln n ) ≤ D c M ≤ n / η , (73) A M , + (ln n ) ≤ D M ∗ ≤ n / η . (74)Now, using (66) and (69), by the same kind of computations leading to (72), we deduce that it holds on Ω n ,for all n ≥ n (cid:0) ( GSA ) , C − , β − , η, δ (cid:1) , ℓ (cid:16) s ∗ , s n (cid:16) c M (cid:17)(cid:17) ≤ (cid:18) δ + θ n − δ − θ n (cid:19) ℓ ( s ∗ , s n ( M ∗ )) ≤ δ − δ + 5 θ n (1 − δ ) ! ℓ ( s ∗ , s n ( M ∗ )) .Thus inequality (17) is proved and Theorem 3 follows. (cid:4) emma 8 (Control on the dimension of the selected model) Assume that (
GSA ) holds. Let η ∈ (cid:0) , β + / (cid:0) β + (cid:1)(cid:1) .If n ≥ n (( GSA ) , η, δ ) then, on the event Ω n defined in the proof of Theorem 3, we have D c M ≤ n / ( β + ) + η . (75) If moreover ( Ap ) holds, then for all n ≥ n (cid:0) ( GSA ) , C − , β − , η, δ (cid:1) , we get on the event Ω n , A M , + (ln n ) ≤ D c M ≤ n / ( β + ) + η . (76) Lemma 9 (Control on the dimension of oracle models)
Assume that (
GSA ) holds. Let η ∈ (cid:0) , β + / (cid:0) β + (cid:1)(cid:1) .If n ≥ n (( GSA ) , η ) then, on the event Ω n defined in the proof of Theorem 3, we have D M ∗ ≤ n / ( β + ) + η . (77) If moreover ( Ap ) holds, then for all n ≥ n (cid:0) ( GSA ) , C − , β − , η (cid:1) , we get on the event Ω n , A M , + (ln n ) ≤ D M ∗ ≤ n / ( β + ) + η . (78) Proof of Lemma 8 . Recall that c M minimizescrit ′ ( M ) = crit ( M ) − P n Ks ∗ = ℓ ( s ∗ , s M ) − p ( M ) + ¯ δ ( M ) + pen ( M ) (79)over the models M ∈ M n .
1. Lower bound on crit ′ ( M ) for small models in the case where ( Ap ) holds: let M ∈ M n be such that D M < A M , + (ln n ) . By (13) and (79), it holdscrit ′ ( M ) ≥ − A r (ln n ) ! ℓ ( s ∗ , s M ) − p ( M ) + ¯ δ ( M ) − A r (ln n ) n .We then have on Ω n , ℓ ( s ∗ , s M ) ≥ C − A − β − M , + (ln n ) − β − by ( Ap )p ( M ) ≤ L ( GSA ) (ln n ) n from (53)¯ δ ( M ) ≥ − L ( GSA ) (cid:18)q ℓ ( s ∗ ,s M ) ln nn + ln nn (cid:19) from (52).Since by ( Ab’ ), we have 0 ≤ ℓ ( s ∗ , s M ) ≤ A , we deduce that for all n ≥ n (cid:0) ( GSA ) , C − , β − , A r (cid:1) ,crit ′ ( M ) ≥ C − A − β − M , + n ) − β − . (80)2. Lower bound for large models: let M ∈ M n be such that D M ≥ n / ( β + ) + η . From (12) and (49) wehave on Ω n , for all n ≥ n ( A M , + ),pen ( M ) − p ( M ) ≥ E [p ( M )] − (cid:0) δ + L ( GSA ) ε n ( M ) (cid:1) ( ℓ ( s ∗ , s M ) + E [p ( M )]) .Using ( P2 ) and the fact that D M ≥ n / ( β + ) + η in (24), we deduce that for all n ≥ n (cid:0) ( GSA ) , η, δ, β + (cid:1) , L ( GSA ) ε n ( M ) ≤ (1 − δ ) and as by ( An ), K ,M ≥ σ min , we also deduce from Lemma 6 that for all n ≥ n (( GSA ) , η ), E [p ( M )] ≥ σ D M n . Consequently, it holds for all n ≥ n (cid:0) ( GSA ) , η, δ, β + (cid:1) ,pen ( M ) − p ( M ) ≥ σ − δ ) D M n − C + D − β + M ≥ (1 − δ ) L ( GSA ) n − β +1+ β + + η (81)20rom (51) it holds on Ω n ,¯ δ ( M ) ≥ − L ( GSA ) r ℓ ( s ∗ , s M ) ln nn + ln nn ! ≥ − L ( GSA ) (cid:18) n − β +2 ( β + ) √ ln n + ln nn (cid:19) . (82)Hence, we deduce from (79), (81) and (82) that we have on Ω n , for all n ≥ n (cid:0) ( GSA ) , η, δ, β + (cid:1) ,crit ′ ( M ) ≥ (1 − δ ) L ( GSA ) n − β +1+ β + + η . (83)3. A better model exists for crit ′ ( M ): from ( P3 ), there exists M ∈ M n such that n / ( β + ) ≤ D M ≤ c rich n / ( β + ) . Then, for all n ≥ n (( GSA ) , η ), A M , + (ln n ) ≤ n / ( β + ) ≤ D M ≤ c rich n / ( β + ) ≤ n / ( β + ) + η . Using ( Ap u ), ℓ ( s ∗ , s M ) ≤ C + n − β + / ( β + ) . (84)By (50), we have on Ω n , for all n ≥ n (( GSA ) , η ), (cid:12)(cid:12) ¯ δ ( M ) (cid:12)(cid:12) ≤ ℓ ( s ∗ , s M ) p D M + L ( GSA ) ln n p D M E [p ( M )] ≤ L ( GSA ) n − β +2 ( β + ) ln ( n ) (85)and by (12), pen ( M ) ≤ ℓ ( s ∗ , s M ) + E [p ( M )]) ≤ L ( GSA ) n − β + / ( β + ) .Consequently, we have on Ω n , for all n ≥ n (( GSA ) , η ),crit ′ ( M ) ≤ ℓ ( s ∗ , s M ) + (cid:12)(cid:12) ¯ δ ( M ) (cid:12)(cid:12) + pen ( M ) ≤ L ( GSA ) n − β + / ( β + ) . (86)To conclude, notice that the upper bound (86) is smaller than the lower bound given in (83) for all n ≥ n (( GSA ) , η, δ ). Hence, points 2 and 3 above yield inequality (75). Moreover, the upper bound (86) is smallerthan lower bounds given in (80), derived by using ( Ap ), and (83), for all n ≥ n (cid:0) ( GSA ) , C − , β − , η, δ (cid:1) . Thisthus gives (76) and Lemma 8 is proved. (cid:4) Proof of Lemma 9 . By definition, M ∗ minimizes ℓ ( s ∗ , s n ( M )) = ℓ ( s ∗ , s M ) + p ( M )over the models M ∈ M n .
1. Lower bound on ℓ ( s ∗ , s n ( M )) for small models: let M ∈ M n be such that D M < A M , + (ln n ) . In thiscase we have ℓ ( s ∗ , s n ( M )) ≥ ℓ ( s ∗ , s M ) ≥ C − A − β − M , + (ln n ) − β − by ( Ap ). (87)2. Lower bound of ℓ ( s ∗ , s n ( M )) for large models: let M ∈ M n be such that D M ≥ n / ( β + ) + η . From(48) we get on Ω n , p ( M ) ≥ (cid:0) − L ( GSA ) ε n ( M ) (cid:1) E [p ( M )] .Using ( P2 ) and the fact that D M ≥ n / ( β + ) + η in (24), we deduce that for all n ≥ n (( GSA ) , η ), L ( GSA ) ε n ( M ) ≤ and as by ( An ), K ,M ≥ σ min we also deduce from Lemma 6 that for all n ≥ n (( GSA ) , η ), E [p ( M )] ≥ σ D M n . Consequently, it holds for all n ≥ n (( GSA ) , η ), on the event Ω n , ℓ ( s ∗ , s n ( M )) ≥ p ( M ) ≥ σ D M n ≥ σ n − β + / ( β + ) + η . (88)21. A better model exists for ℓ ( s ∗ , s n ( M )): from ( P3 ), there exists M ∈ M n such that n / ( β + ) ≤ D M ≤ c rich n / ( β + ) . Moreover, for all n ≥ n (( GSA ) , η ), A M , + (ln n ) ≤ n / ( β + ) ≤ D M ≤ c rich n / ( β + ) ≤ n / ( β + ) + η . Using ( Ap u ), ℓ ( s ∗ , s M ) ≤ C + n − β + / ( β + )and by (48) p ( M ) ≤ (cid:0) L ( GSA ) ε n ( M ) (cid:1) E [p ( M )] .Hence, as K ,M ≤ A by ( Ab’ ) and as, by (24), for all n ≥ n ( GSA ) it holds ε n ( M ) ≤
1, we deducefrom Lemma 6 that for all n ≥ n ( GSA ), on the event Ω n ,p ( M ) ≤ L ( GSA ) D M n ≤ L ( GSA ) n − β + / ( β + ) . Consequently, on Ω n , for all n ≥ n (( GSA ) , η ), ℓ ( s ∗ , s n ( M )) = ℓ ( s ∗ , s M ) + p ( M ) ≤ L ( GSA ) n − β + / ( β + ) . (89)The upper bound (89) is smaller than the lower bound (88) for all n ≥ n (( GSA ) , η ), and this gives(77). If ( Ap ) holds, then the upper bound (89) is smaller than the lower bounds (87) and (88) for all n ≥ n (cid:0) ( GSA ) , C − , β − , η (cid:1) , which proves (78) and allows to conclude the proof of Lemma 9. (cid:4) Proof of Theorem 2.
As in the proof of Theorem 3, we consider the event Ω ′ n of probability at least1 − L c M ,A p n − for all n ≥ n ( GSA ), on which: (10) holds and • For all models M ∈ M n of dimension D M such that A M , + (ln n ) ≤ D M , | p ( M ) − E [p ( M )] | ≤ L ( GSA ) ε n ( M ) E [p ( M )] , (90) | p ( M ) − E [p ( M )] | ≤ L ( GSA ) ε n ( M ) E [p ( M )] . (91) • For all models M ∈ M n with D M ≤ A M , + (ln n ) ,p ( M ) ≤ L ( GSA ) (ln n ) n . (92) • For every M ∈ M n , (cid:12)(cid:12) ¯ δ ( M ) (cid:12)(cid:12) ≤ L ( GSA ) r ℓ ( s ∗ , s M ) ln nn + ln nn ! . (93)Let d ∈ (0 ,
1) to be chosen later.
Lower bound on D c M . Let us recall that c M minimizescrit ′ ( M ) = crit ( M ) − P n Ks ∗ = ℓ ( s ∗ , s M ) − p ( M ) + ¯ δ ( M ) + pen ( M ) . (94)1. Lower bound on crit ′ ( M ) for “small” models: assume that M ∈ M n and D M ≤ dA rich n (ln n ) − . We have ℓ ( s ∗ , s M ) + pen ( M ) ≥ ℓ ( s ∗ , s M ) ≤ A by ( Ab’ ), we get on Ω ′ n , for all n ≥ n (( GSA ), d ),¯ δ ( M ) ≥ − L ( GSA ) r ℓ ( s ∗ , s M ) ln nn + ln nn ! ≥ − L ( GSA ) r ln nn ≥ − d × A A rich (ln n ) − . (96)Then, if D M ≥ A M , + (ln n ) , as K ,M ≤ A by ( Ab’ ) and as, by (24), for all n ≥ n ( GSA ) it holds L ( GSA ) ε n ( M ) ≤
1, we deduce from (91) and Lemma 6 that for all n ≥ n (( GSA ), d ),p ( M ) ≤ E [p ( M )] ≤ A D M n ≤ d × A A rich (ln n ) − . Whenever D M ≤ A M , + (ln n ) , (92) gives that, for all n ≥ n (( GSA ), d ), on the event Ω ′ n ,p ( M ) ≤ L ( GSA ) (ln n ) n ≤ d × A A rich (ln n ) − . Hence, we have checked that for all n ≥ n (( GSA ) , d ), on the event Ω ′ n , − p ( M ) ≥ − d × A A rich (ln n ) − , (97)and finally, by using (95), (96) and (97) in (94), we deduce that on Ω ′ n , for all n ≥ n (( GSA ) , d ),crit ′ ( M ) ≥ − d × A A rich (ln n ) − . (98)2. There exists a better model for crit ′ ( M ). By ( P3 ), for all n ≥ n ( A M , + , A rich ) a model M ∈ M n exists such that A M , + (ln n ) ≤ A rich n (ln n ) ≤ D M . We then have on Ω ′ n , ℓ ( s ∗ , s M ) ≤ A − β + rich (ln n ) β + n − β + by ( Ap u )p ( M ) ≥ (cid:0) − L ( GSA ) ε n ( M ) (cid:1) E [p ( M )] by (91)pen ( M ) ≤ A pen E [p ( M )] by (10) (cid:12)(cid:12) ¯ δ ( M ) (cid:12)(cid:12) ≤ L ( GSA ) p ln ( n ) /n by (93) and ( Ab’ )and therefore,crit ′ ( M ) ≤ (cid:0) − A pen + L ( GSA ) ε n ( M ) (cid:1) E [p ( M )] + L ( GSA ) r ln nn + A − β + rich (ln n ) β + n β + . (99)Hence, as − A pen <
0, and as by (24), ( An ) and Lemma 6 it holds for all n ≥ n (( GSA ) , A pen ) L ( GSA ) ε n ( M ) ≤ − A pen E [p ( M )] ≥ σ D M n ≥ σ A rich n ) − ,we deduce from (99) that on Ω ′ n , for all n ≥ n (( GSA ) , A pen ),crit ′ ( M ) ≤ −
14 (1 − A pen ) σ A rich (ln n ) − . (100)23ow, by taking 0 < d = (cid:18) − A pen ) (cid:16) σ min A (cid:17) (cid:19) ∧ < ′ n , for all n ≥ n (( GSA ) , A pen ), for all M ∈ M n suchthat D M ≤ dA rich n (ln n ) − , crit ′ ( M ) < crit ′ ( M )and so D c M > dA rich n (ln n ) − . (102) Excess Loss of s n (cid:16) c M (cid:17) . We take d with the value given in (101). First notice that for all n ≥ n ( A M , + , A rich , d ) , we have dA rich n (ln n ) − ≥ A M , + (ln n ) . Hence, for all M ∈ M n such that D M ≥ dA rich n (ln n ) − , by (24),( P2 ), ( An ) and Lemma 6, it holds on Ω ′ n for all n ≥ n (( GSA ) , A pen ), using (90), ℓ ( s ∗ , s n ( M )) ≥ p ( M ) ≥ σ D M n ≥ dσ A rich n ) − . By (102), we thus get that on Ω ′ n , for all n ≥ n (( GSA ) , A pen ), ℓ (cid:16) s ∗ , s n (cid:16) c M (cid:17)(cid:17) ≥ dσ A rich n ) − . (103)Moreover, the model M defined in ( P3 ) satisfies, for all n ≥ n ( GSA ), A M , + (ln n ) ≤ n / ( β + ) ≤ D M ≤ c rich n / ( β + )and so using ( Ap u ), ℓ ( s ∗ , s M ) ≤ C + n − β + / ( β + ) .In addition, by (48), p ( M ) ≤ (cid:0) L ( GSA ) ε n ( M ) (cid:1) E [p ( M )] .Hence, as K ,M ≤ A by ( Ab’ ) and as, by (24), for all n ≥ n ( GSA ) it holds ε n ( M ) ≤
1, we deduce fromLemma 6 that for all n ≥ n ( GSA ),p ( M ) ≤ L ( GSA ) D M n ≤ L ( GSA ) n − β + / ( β + ) . Consequently, for all n ≥ n ( GSA ), ℓ ( s ∗ , s n ( M )) ≤ L ( GSA ) n − β + / ( β + ) (104)and the ratio between the two bounds (103) and (104) is larger than n β + / ( β + ) (ln n ) − for all n ≥ n (cid:0) L ( GSA ) , A pen (cid:1) ,which yields (11). (cid:4) Theorem 4 is a straightforward consequence of the following result, that will be proved below.
Theorem 10
Assume that (
GSA ) holds. With the notations of Section 4, assume moreover that there exist c ∈ (0 , such that nc ≤ n < n and τ ∈ (1 , satisfying n (ln n ) τ /D M ≤ n ≤ n (1 − c ) for all M ∈ M n such that A M , + (ln n ) ≤ D M ≤ A M , + n/ (ln n ) . Take n = n (1 − c ) if D M ≤ A M , + (ln n ) . Define for all M ∈ M n , pen ho ( M ) = n n ( P n ( Ks n ( M )) − P n ( Ks n ( M ))) . hen, for any η ∈ (cid:0) , β + / (cid:0) β + (cid:1)(cid:1) , there exist an integer n depending on c, η and on constants in ( GSA) , apositive constant A only depending on c M given in ( GSA ), two positive constants A and A only dependingon constants in ( GSA ) and a sequence θ n ≤ A (ln n ) / ∧ (ln n ) ( τ − / such that it holds for all n ≥ n (( GSA ) , c, η ) , with probability at least − A n − , D c M n ≤ n η +1 / ( β + ) and ℓ (cid:16) s ∗ , s n (cid:16) c M n (cid:17)(cid:17) ≤ (1 + θ n ) ℓ ( s ∗ , s n ( M ∗ )) + A (ln n ) n . (105) Assume that in addition ( Ap ) holds (see Theorem 3). Then it holds for all n ≥ n (cid:0) ( GSA ) , C − , β − , η, c (cid:1) ,with probability at least − A n − , A M , + (ln n ) ≤ D c M n ≤ n η +1 / ( β + ) and ℓ (cid:16) s ∗ , s n (cid:16) c M n (cid:17)(cid:17) ≤ (1 + θ n ) inf M ∈M n { ℓ ( s ∗ , s n ( M )) } . (106) Lemma 11
Assume that (
GSA ) holds. Let c ∈ (0 , , τ ∈ (1 , and ( n , n ) ∈ N ∗ . We assume that nc ≤ n < n and set n = n − n . Then there exists L = L ( GSA ) , c > such that for all M ∈ M n satisfying D M ≥ A M , + (ln n ) , for all n ≥ n (( GSA ) , c ) , it holds P | P n ( Ks n ( M ) − Ks M ) − P ( Ks n ( M ) − Ks M ) | ≥ L p ( D M ∨ ln n ) (ln n ) ((ln n ) (ln n ) + n ) n √ n ! ≤ n − − α M . (107) Now, let us assume that n (ln n ) τ /D M ≤ n ≤ n (1 − c ) if A M , + (ln n ) ≤ D M ≤ A M , + n/ (ln n ) and n = n (1 − c ) if D M ≤ A M , + (ln n ) . If A M , + (ln n ) ≤ D M , then by setting ε , n ( M ) = L n p ln n ((ln n ) (ln n ) + n ) n √ n D M ≤ L (ln n ) ( τ − / , (108) we have for all n ≥ n (( GSA ) , c ) , P (cid:0) | P n ( Ks n ( M ) − Ks M ) − P ( Ks n ( M ) − Ks M ) | ≥ ε , n ( M ) E [p ( M )] (cid:1) ≤ n − − α M . (109) If D M ≤ A M , + (ln n ) , we obtain P | P n ( Ks n ( M ) − Ks M ) − P ( Ks n ( M ) − Ks M ) | ≥ L (ln n ) n ! ≤ n − − α M . (110) Proof.
By Bernstein’s inequality (see Corollary 2.10 in [34]) applied to the sum of ( s n ( M )) ( ξ i ) conditionallyto (cid:0) ξ j (cid:1) j ∈ I , we get that for all x >
0, it holds P (cid:0) | P n ( Ks n ( M ) − Ks M ) − P ( Ks n ( M ) − Ks M ) | ≥ x (cid:12)(cid:12)(cid:0) ξ j (cid:1) , j ∈ I (cid:1) ≤ (cid:18) − nx v + b x/ (cid:19) , (111)where v = E ξ h ( Ks n ( M ) ( ξ ) − Ks M ( ξ )) i b = k Ks n ( M ) − Ks M k ∞ . We have v = E ( X,Y ) h (2 ( Y − s M ( X )) − s n ( M ) ( X ) + s M ( X )) ( s n ( M ) ( X ) − s M ( X )) i ≤ (4 A + k s n ( M ) − s M k ∞ ) E X h ( s n ( M ) ( X ) − s M ( X )) i = (4 A + k s n ( M ) − s M k ∞ ) P ( Ks n ( M ) − Ks M ) (112)and b = k (2 ( Y − s M ( X )) − s n ( M ) ( X ) + s M ( X )) ( s n ( M ) ( X ) − s M ( X )) k ∞ ≤ A k s n ( M ) − s M k ∞ + k s n ( M ) − s M k ∞ . (113)Now, we set Ω v = { v ≤ L v ( D M ∨ ln n ) /n } and Ω b = n b ≤ L b p D M ln n /n o . By integrating (111), itcomes for all x > P ( | P n ( Ks n ( M ) − Ks M ) − P ( Ks n ( M ) − Ks M ) | ≥ x ) ≤ E (cid:20) exp (cid:18) − n x v + b x/ (cid:19) Ω v ∩ Ω b (cid:21) + 2 P (Ω cv ) + 2 P (Ω cb ) ≤ − n x (cid:16) L v ( D M ∨ ln n ) /n + L b x p D M ln n /n (cid:17) + 2 P (Ω cv ) + 2 P (Ω cb )From assumption ( Ac ∞ ) and inequality (27), it is possible to choose L v and L b , depending among otherconstants on c , such that for all n ≥ n (( GSA ) , c ), 2 P (Ω cv ) + 2 P (Ω cb ) ≤ n − − α M . Thus, we get for L > x > P ( | P n ( Ks n ( M ) − Ks M ) − P ( Ks n ( M ) − Ks M ) | ≥ x ) ≤ − n x L (cid:16) ( D M ∨ ln n ) /n + x p D M ln n /n (cid:17) + 10 n − − α M . (114)By taking x = p Lα ln n ( D M ∨ ln n ) ( Lα (ln n ) (ln n ) + 4 n ) / (cid:0) n √ n (cid:1) > P | P n ( Ks n ( M ) − Ks M ) − P ( Ks n ( M ) − Ks M ) | ≥ L p ( D M ∨ ln n ) (ln n ) ((ln n ) (ln n ) + n ) n √ n ! ≤ n − − α M ,where L >
GSA ) and on c . Inequalities (109) and (110) then follow fromsimple calculations. Remark 12
It is easy to see that by using the assumption of consistency in sup-norm for a fixed model,stated as ( H5 ) in [36], instead of ( Ac ∞ ) and by using Theorem 4 of [36] instead of inequality (27), the resultsestablished in Lemma 11 are valid with probability bounds proportional to n − α , for any α > (in Lemma 11,we only derive the case α = 2 + α M for convenience). Proof of Theorem 10.
We set pen ( M ) = pen ho ( M ) − ( n /n ) · ( P n ( Ks ∗ ) − P n ( Ks ∗ )). It is worth notingthat P n ( Ks ∗ ) − P n ( Ks ∗ ) is a quantity independent of M , when M varies in M n . Hence, the proceduredefined by pen gives the same result as the hold-out procedure defined by pen ho . It will be convenient forour analysis to consider pen instead of pen ho . As a matter of fact, we derive Theorem 10 as a corollary ofTheorem 3 applied with pen ≡ pen , through the use of Lemma 11.26e get for all M ∈ M n ,pen ( M ) = n n ( P n ( Ks n ( M ) − Ks ∗ ) − P n ( Ks n ( M ) − Ks ∗ ))= n n ( P n ( Ks n ( M ) − Ks M ) − P n ( Ks n ( M ) − Ks M ))+ n n (( P n − P ) ( Ks M − Ks ∗ ) − ( P n − P ) ( Ks M − Ks ∗ ))= n n (cid:0) p n ( M ) + p n ( M ) + ¯ δ n ( M ) − ¯ δ n ( M ) (cid:1) wherep n ( M ) = P n ( Ks n ( M ) − Ks M ) , p n ( M ) = P n ( Ks M − Ks n ( M )) , ¯ δ n i ( M ) = ( P n i − P ) ( Ks M − Ks ∗ ) .Let Ω n be the event on which: • For all models M ∈ M n of dimension D M such that A M , + (ln n ) ≤ D M , it holds | p ( M ) − E [p ( M )] | ≤ L ( GSA ) ε n ( M ) E [p ( M )] (115) | p ( M ) − E [p ( M )] | ≤ L ( GSA ) ε n ( M ) E [p ( M )] (116)together with (cid:12)(cid:12)(cid:12)(cid:12) p n ( M ) − nn E [p ( M )] (cid:12)(cid:12)(cid:12)(cid:12) ≤ L ( GSA ) ,c (cid:2) ε , n ( M ) + ε n ( M ) (cid:3) E [p ( M )] (117) (cid:12)(cid:12)(cid:12)(cid:12) p n ( M ) − nn E [p ( M )] (cid:12)(cid:12)(cid:12)(cid:12) ≤ L ( GSA ) ,c ε n ( M ) E [p ( M )] (118) (cid:12)(cid:12) ¯ δ n ( M ) (cid:12)(cid:12) ≤ ℓ ( s ∗ , s M ) √ D M + L ( GSA ) ,c ln n √ D M E [p ( M )] (119) (cid:12)(cid:12) ¯ δ n ( M ) (cid:12)(cid:12) ≤ L ( GSA ) s ℓ ( s ∗ , s M ) ln n n + ln n n (120) • For all models M ∈ M n of dimension D M such that D M ≤ A M , + (ln n ) , it holds (cid:12)(cid:12) ¯ δ n ( M ) (cid:12)(cid:12) ≤ L ( GSA ) ,c r ℓ ( s ∗ , s M ) ln nn + ln nn ! (121) (cid:12)(cid:12) ¯ δ n ( M ) (cid:12)(cid:12) ≤ L ( GSA ) ,c r ℓ ( s ∗ , s M ) ln nn + ln nn ! (122)p n ( M ) ≤ L ( GSA ) ,c D M ∨ ln nn ≤ L ( GSA ) ,c (ln n ) n (123)p n ( M ) ≤ L ( GSA ) ,c (ln n ) n + D M ∨ ln nn ! ≤ L ( GSA ) ,c (ln n ) n (124)By (25), (26), (27) and (28) in Remark 5, Lemma 6 and Lemma 11, we get for all n ≥ n (( GSA ) , c ), P (Ω n ) ≥ − A p n − − L X M ∈M n n − − α M ≥ − L A p ,c M n − . We consider models M ∈ M n such that A M , + (ln n ) ≤ D M . Notice that (119) implies by (24) that, for all M ∈ M n such that A M , + (ln n ) ≤ D M , (cid:12)(cid:12) ¯ δ n ( M ) (cid:12)(cid:12) ≤ L ( GSA ) ,c (ln n ) D M · ln nD M ! / × ( ℓ ( s ∗ , s M ) + E [ p ( M )]) ≤ L ( GSA ) ,c ε n ( M ) ( ℓ ( s ∗ , s M ) + E [ p ( M )]) .27n addition, from (120), Lemma 6 and the fact that n (ln n ) τ /D M ≤ n , we get that for all n ≥ n ( GSA ), (cid:12)(cid:12) ¯ δ n ( M ) (cid:12)(cid:12) ≤ L ( GSA ) s ℓ ( s ∗ , s M ) ln n n + ln n n ≤ L ( GSA ) ℓ ( s ∗ , s M )(ln n ) ( τ − / + ln n n (ln n ) ( τ − / ! ≤ L ( GSA ) (ln n ) (1 − τ ) / ( ℓ ( s ∗ , s M ) + E [ p ( M )]) .We deduce that on Ω n we have, for all models M ∈ M n such that A M , + (ln n ) ≤ D M and for all n ≥ n ( GSA ), | pen ( M ) − E [p ( M )] |≤ n n (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) p n ( M ) − nn E [p ( M )] (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) p n ( M ) − nn E [p ( M )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) + (cid:12)(cid:12) ¯ δ n ( M ) (cid:12)(cid:12) + (cid:12)(cid:12) ¯ δ n ( M ) (cid:12)(cid:12) ≤ (cid:16) L ( GSA ) ,c (cid:16) ε , n ( M ) + ε n ( M ) + (ln n ) (1 − τ ) / (cid:17)(cid:17) ( ℓ ( s ∗ , s M ) + E [ p ( M )]) (125)Hence, inequality (12) of Theorem 3 is satisfied on Ω n by taking δ = L ( GSA ) ,c (cid:16) ε , n ( M ) + ε n ( M ) + (ln n ) (1 − τ ) / (cid:17) .Moreover, we have δ ∈ [0 ,
1) for all n ≥ n (( GSA ), c, τ ).Let us now consider models M ∈ M n such that D M ≤ A M , + (ln n ) . By (121), (122), (124) and (123), wehave on Ω n , | pen ( M ) | = n n (cid:12)(cid:12) p n ( M ) + p n ( M ) + ¯ δ n ( M ) − ¯ δ n ( M ) (cid:12)(cid:12) ≤ L ( GSA ) ,c r ℓ ( s ∗ , s M ) ln nn + (ln n ) n ! ≤ L ( GSA ) ,c ℓ ( s ∗ , s M )(ln n ) + (ln n ) n ! (126)Inequality (126) implies that inequality (13) of Theorem 3 is satisfied with A r = L ( GSA ) ,c . From (125) and(126), we thus apply Theorem 3 with A p = L A p ,c M , and this gives Theorem 10 with θ n = L ( GSA ) ,c (cid:18) (ln n ) − + (ln n ) (1 − τ ) / + sup M ∈M n n ε n ( M ) + ε , n ( M ) , A M , + (ln n ) ≤ D M ≤ n η +1 / ( β + ) o(cid:19) . Acknowledgements
I am deeply grateful to Pr. Jon A. Wellner and Pr. Pascal Massart for their valuable support. I alsowarmly thank Pr. Wellner for having helped me to improve my English along the text. Finally, I gratefullythank the associate editors and anonymous referees for their comments and suggestions, that greatly improvedthe quality of the paper.
References [1] R. Adamczak. A tail inequality for suprema of unbounded empirical processes with applications to Markovchains.
Electron. J. Probab. , 13:1000–1034, 2008.282] H. Akaike. Statistical predictor identification.
Ann. Inst. Statist. Math. , 22:203–217, 1970.[3] H. Akaike. Information theory and an extension of the maximum likelihood principle. In
Second In-ternational Symposium on Information Theory (Tsahkadsor, 1971) , pages 267–281. Akad´emiai Kiad´o,Budapest, 1973.[4] S. Arlot.
Resampling and Model Selection . PhD thesis, University Paris-Sud 11, December 2007.oai:tel.archives-ouvertes.fr:tel-00198803 v1.[5] S. Arlot. V -fold cross-validation improved: V -fold penalization, February 2008. arXiv:0802.0566v2.[6] S. Arlot. Model selection by resampling penalization. Electron. J. Stat. , 3:557–624, 2009.[7] S. Arlot. Choosing a penalty for model selection in heteroscedastic regression, June 2010. arXiv:0812.3141.[8] S. Arlot and F. Bach. Data-driven calibration of linear estimators with minimal penalties. In Y. Bengio,D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors,
Advances in Neural InformationProcessing Systems 22 , pages 46–54, 2009.[9] S. Arlot and P. Massart. Data-driven calibration of penalties for least-squares regression.
J. Mach. Learn.Res. , 10:245–279 (electronic), 2009.[10] Y. Baraud, C. Giraud, and S. Huet. Gaussian model selection with an unknown variance.
Ann. Statist. ,37(2):630–672, 2009.[11] A. Barron, L. Birg´e, and P. Massart. Risk bounds for model selection via penalization.
Probab. TheoryRelated Fields , 113(3):301–413, 1999.[12] P.L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation.
Machine Learning ,48:85–113, 2002.[13] P.L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities.
Ann. Statist. , 33(4):1497–1537, 2005.[14] J.-P. Baudry, C. Maugis, and B. Michel. Slope heuristics: overview and implementation.
Stat. Comput. ,22(2):455–470, 2012.[15] L. Birg´e and P. Massart. Rates of convergence for minimum contrast estimators.
Probab. Theory RelatedFields , 97:113–150, 1993.[16] L. Birg´e and P. Massart. From model selection to adaptive estimation. In
Festschrift for Lucien Le Cam ,pages 55–87. Springer, New York, 1997.[17] L. Birg´e and P. Massart. Minimum contrast estimators on sieves: exponential bounds and rates ofconvergence.
Bernoulli , 4(3):329–375, 1998.[18] L. Birg´e and P. Massart. Gaussian model selection.
J.Eur.Math.Soc. , 3(3):203–268, 2001.[19] L. Birg´e and P. Massart. Minimal penalties for Gaussian model selection.
Probab. Theory Related Fields ,138(1-2):33–73, 2007.[20] S. Boucheron and P. Massart. A high-dimensional Wilks phenomenon.
Probab. Theory Related Fields ,150(3-4):405–433, 2011.[21] F. Bunea, A. B. Tsybakov, and M. H. Wegkamp. Aggregation for Gaussian regression.
Ann. Statist. ,35(4):1674–1697, 2007.[22] G. Castellan. Modified Akaike’s criterion for histogram density estimation.
Technical report ♯ , 1999. 2923] O. Catoni. Statistical learning theory and stochastic optimization , volume 1851 of
Lecture Notes in Mathe-matics . Springer-Verlag, Berlin, 2004. Lecture notes from the 31st Summer School on Probability Theoryheld in Saint-Flour, July 8–25, 2001.[24] A. S. Dalalyan and J. Salmon. Sharp oracle inequalities for aggregation of affine estimators.
Ann. Statist. ,40(4):2327–2355, 2012.[25] A. S. Dalalyan and A. B. Tsybakov. Aggregation by exponential weighting and sharp oracle inequalities.In
Learning theory , volume 4539 of
Lecture Notes in Comput. Sci. , pages 97–111. Springer, Berlin, 2007.[26] B. Efron. Estimating the error rate of a prediction rule: improvement on cross-validation.
J. Amer.Statist. Assoc. , 78(382):316–331, 1983.[27] V. Koltchinskii. Rademacher penalties and structural risk minimization.
IEEE Trans. Inform. Theory ,47(5):1902–1914, 2001.[28] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimisation.
Ann. Statist. ,34(6):2593–2656, 2006.[29] G. Lecu´e and S. Mendelson. Aggregation via empirical risk minimization.
Probab. Theory Related Fields ,145(3-4):591–613, 2009.[30] M. Lerasle. Optimal model selection for density estimation of stationary data under various mixingconditions.
Ann. Statist. , 39(4):1852–1877, 2011.[31] M. Lerasle. Optimal model selection in density estimation.
Ann. Inst. Henri Poincar´e Probab. Stat. ,48(3):884–908, 2012.[32] G. Leung and A. R. Barron. Information theory and mixing least-squares regressions.
IEEE Trans.Inform. Theory , 52(8):3396–3410, 2006.[33] Colin L. Mallows. Some comments on C p . Technometrics , 15:661–675, 1973.[34] P. Massart.
Concentration inequalities and model selection , volume 1896 of
Lecture Notes in Mathematics .Springer, Berlin, 2007. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour,July 6–23, 2003, With a foreword by Jean Picard.[35] P. Rigollet and A.B. Tsybakov. Sparse estimation by exponential weighting.
Statistical Science , 27(4):558–575, 2012.[36] A. Saumard. Optimal upper and lower bounds for the true and empirical excess risks in heteroscedasticleast-squares regression.
Electron. J. Statist. , 6(1-2):579–655, 2012.[37] A. B. Tsybakov.