High-dimensional Gaussian model selection on a Gaussian design
aa r X i v : . [ m a t h . S T ] A p r a p p o r t (cid:13)(cid:13) d e r e c h e r c h e (cid:13) I SS N - I S RN I NR I A / RR -- -- F R + E N G Thème COG
INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE
High-dimensional Gaussian model selection on aGaussian design
Nicolas Verzelen
N° 6616 — version 2 initial version Aout 2008 — revised version Avril 2009 entre de recherche INRIA Saclay – Île-de-FranceParc Orsay Université4, rue Jacques Monod, 91893 ORSAY Cedex
Téléphone : +33 1 72 92 59 00
High-dimensional Gaussian model sele tion on a GaussiandesignNi olas Verzelen ∗ †
Thème COG (cid:22) Systèmes ognitifsÉquipes-Projets Sele tRapport de re her he n° 6616 (cid:22) version 2 (cid:22) initial version Aout 2008 (cid:22) revised version Avril2009 (cid:22) 57 pagesAbstra t: We onsider the problem of estimating the onditional mean of a real Gaussianvariable Y = P pi =1 θ i X i + ǫ where the ve tor of the ovariates ( X i ) ≤ i ≤ p follows a joint Gaussiandistribution. This issue often o urs when one aims at estimating the graph or the distribution of aGaussian graphi al model. We introdu e a general model sele tion pro edure whi h is based on theminimization of a penalized least-squares type riterion. It handles a variety of problems su h asordered and omplete variable sele tion, allows to in orporate some prior knowledge on the modeland applies when the number of ovariates p is larger than the number of observations n . Moreover,it is shown to a hieve a non-asymptoti ora le inequality independently of the orrelation stru tureof the ovariates. We also exhibit various minimax rates of estimation in the onsidered frameworkand hen e derive adaptiveness properties of our pro edure.Key-words: Model sele tion, Linear regression, ora le inequalities, Gaussian graphi al models,minimax rate of estimation ∗ Laboratoire de Mathématiques UMR 8628, Université Paris-Sud, 91405 Osay † INRIA Sa lay, Projet SELECT, Université Paris-Sud, 91405 Osayéle tion de modèles en grande dimension pour des designgaussiensRésumé : We onsider the problem of estimating the onditional mean of a real Gaussian variable Y = P pi =1 θ i X i + ǫ where the ve tor of the ovariates ( X i ) ≤ i ≤ p follows a joint Gaussian distri-bution. This issue often o urs when one aims at estimating the graph or the distribution of aGaussian graphi al model. We introdu e a general model sele tion pro edure whi h is based on theminimization of a penalized least squares type riterion. It handles a variety of problems su h asordered and omplete variable sele tion, allows to in orporate some prior knowledge on the modeland applies when the number of ovariates p is larger than the number of observations n . Moreover,it is shown to a hieve a non-asymptoti ora le inequality independently of the orrelation stru tureof the ovariates. We also exhibit various minimax rates of estimation in the onsidered frameworkand hen e derive adaptivity properties of our pro edure.Mots- lés : Séle tion de modèles, régression linéaire, inégalités ora les, modèles graphiquesgaussiens, vitesse minimax d'estimationodel sele tion on a Gaussian design 31 Introdu tion1.1 Regression modelWe onsider the following regression model Y = Xθ + ǫ , (1)where θ is an unknown ve tor of R p . The row ve tor X := ( X i ) ≤ i ≤ p follows a real zero mean Gaus-sian distribution with non singular ovarian e matrix Σ and ǫ is a real zero mean Gaussian randomvariable independent of X with varian e σ . The varian e of ǫ orresponds to the onditional vari-an e of Y given X , Var ( Y | X ) . In the sequel, the parameters θ , Σ , and σ are onsidered as unknown.Suppose we are given n i.i.d. repli ations of the ve tor ( Y, X ) . We respe tively write Y and X for the ve tor of n observations of Y and the n × p matrix of observations of X . In the present work,we propose a new pro edure to estimate the ve tor θ , when the matrix Σ and the varian e σ areboth unknown. This orresponds to estimating the onditional expe tation of the variable Y giventhe random ve tor X . Besides, we want to handle the di(cid:30) ult ase of high-dimensional data, i.e.the number of ovariates p is possibly mu h larger than n . This estimation problem is equivalentto building a suitable predi tor of Y given the ovariates ( X i ) ≤ i ≤ p . Classi ally, we shall use themean-squared predi tion error to assess the quality of our estimation. For any ( θ , θ ) ∈ R p , it isde(cid:28)ned by l ( θ , θ ) := E h ( Xθ − Xθ ) i . (2)1.2 Appli ations to Gaussian graphi al models (GGM)Estimation in the regression model (1) is mainly motivated by the study of Gaussian graphi almodels (GGM). Let Z be a Gaussian random ve tor indexed by the elements of a (cid:28)nite set Γ .The ve tor Z is a GGM with respe t to an undire ted graph G = (Γ , E ) if for any ouple ( i, j ) whi h is not ontained in the edge set E , Z i and Z j are independent, given the remaining variables.See Lauritzen [23℄ for de(cid:28)nitions and main properties of GGM. Estimating the neighborhood of agiven point i ∈ Γ is equivalent to estimating the support of the regression of Z i with respe t tothe ovariates ( Z j ) j ∈ Γ \{ i } . Meinshausen and Bühlmann [26℄ have taken this point of view in orderto estimate the graph of a GGM. Similarly, we an apply the model sele tion pro edure we shallintrodu e in this paper to estimate the support of the regression and therefore the graph G of aGGM.Interest in these models has grown sin e they allow the des ription of dependen e stru ture ofhigh-dimensional data. As su h, they are widely used in spatial statisti s [16, 27℄ or probabilisti expert systems [15℄. More re ently, they have been applied to the analysis of mi roarray data. The hallenge is to infer the network regulating the expression of the genes using only a small sampleof data, see for instan e S häfer and Strimmer [29℄, or Wille et al. [39℄.This has motivated the sear h for new estimation pro edures to handle the linear regressionmodel (1) with Gaussian random design. Finally, let us mention that the model (1) is also ofinterest when estimating the distribution of dire ted graphi al models or more generally the jointdistribution of a large Gaussian random ve tor. Estimating the joint distribution of a Gaussianve tor ( Z i ) ≤ i ≤ p indeed amounts to estimating the onditional expe tations and varian e of Z i given ( Z j ) ≤ j ≤ i − for any ≤ i ≤ p .RR n° 6616 Verzelen1.3 General ora le inequalitiesEstimation of high-dimensional Gaussian linear models has now attra ted a lot of attention. Var-ious pro edures have been proposed to perform the estimation of θ when p > n . The hallenge athand it to design estimators that are both omputationally feasible and are proved to be e(cid:30) ient.The Lasso estimator has been introdu ed by Tibshirani [33℄. Meinshausen and Bühlmann [26℄ haveshown that this estimator is onsistent under a neighborhood stability ondition. These onver-gen e results were re(cid:28)ned in the works of Zhao and Yu [40℄, Bunea et al. [11℄, Bi kel et al. [5℄, orCandès and Plan [12℄ in a slightly di(cid:27)erent framework. Candès and Tao [13℄ have also introdu edthe Dantzig-sele tor pro edure whi h performs similarly as l penalization methods. In the morespe i(cid:28) ontext of GGM, Bühlmann and Kalis h [21℄ have analyzed the PC algorithm and haveproven its onsisten y when the GGM follows a faithfulness assumption. All these methods sharean attra tive omputational e(cid:30) ien y and most of them are proven to onverge at the optimal ratewhen the ovariates are nearly independent. However, they also share two main drawba ks. First,the l estimators are known to behave poorly when the ovariates are highly orrelated and evenfor some ovarian e stru tures with small orrelation (see e.g. [12℄). Similarly, the PC algorithm isnot onsistent if the faithfulness assumption is not ful(cid:28)lled. Se ond, these pro edures do not allowto integrate some biologi al or physi al prior knowledge. Let us provide two examples. Biologistssometimes have a strong pre on eption of the underlying biologi al network thanks to previousexperimentations. For instan e, Sa hs et al. [28℄) have produ ed multivariate (cid:29)ow ytometry datain order to study a human T ell signaling pathway. Sin e this pathway has important medi alimpli ations, it was already extensively studied and a network is onventionally a epted (see [28℄).For this parti ular example, it ould be more interesting to he k whether some intera tions wereforgotten or some unne essary intera tions were added in the model than performing a ompletegraph estimation. Moreover, the ovariates have in some situations a temporal or spatial inter-pretation. In su h a ase, it is natural to introdu e an order between the ovariates, by assumingthat a ovariate whi h is lose (in spa e or time) to the response Y is more likely to be signi(cid:28) ant.Hen e, an ordered variable sele tion method is here possibly more relevant than the omplete vari-able sele tion methods previously mentioned.Let us emphasize the main di(cid:27)eren es of our estimation setting with related studies in theliterature. Birgé and Massart [8℄ onsider model sele tion in a (cid:28)xed design setting with knownvarian e. Bunea et al. [10℄ also suppose that the varian e is known. Yet, they onsider a randomdesign setting, but they assume that the regression fun tions are bounded (Assumption A.2 in theirpaper) whi h is not the ase here. Moreover, they obtain risk bounds with respe t to the empiri alnorm k X ( b θ − θ ) k n and not the integrated loss l ( ., . ) . Here, k . k n refers to the anoni al norm in R n reweighted by √ n . As mentioned earlier, our obje tive is to infer the onditional expe tation of Y given X . Hen e, it is more signi(cid:28) ant to assess the risk with respe t to the loss l ( ., . ) . Baraud etal. [4℄ onsider (cid:28)xed design regression but do not assume that the varian e is known.Our obje tive is twofold. First, we introdu e a general model sele tion pro edure that is very(cid:29)exible and allows to integrate any prior knowledge on the regression. We prove non-asymptoti ora le inequalities that hold without any assumption on the orrelation stru ture between the o-variates. Se ond, we obtain non-asymptoti rates of estimation for our model ( ) that help us toderive adaptive properties for our riterion. INRIAodel sele tion on a Gaussian design 5In the sequel, a model m stands for a subset of { , . . . , p } . We note d m the size of m whereas thelinear spa e S m refers to the set of ve tors θ ∈ R p whose omponents outside m equal zero. If d m is smaller than n , then we de(cid:28)ne b θ m as the least-square estimator of θ over S m . In the sequel, Π m stands for the proje tion of R n into the spa e generated by ( X i ) i ∈ m . Hen e, we have the relation X b θ m = Π m Y . Sin e the ovarian e matrix Σ is non singular, observe that almost surely the rank of Π m is d m . Given a olle tion M of models, our purpose is to sele t a model b m ∈ M that exhibits arisk as small as possible with respe t to the predi tion loss fun tion l ( ., . ) de(cid:28)ned in (2). The model m ∗ that minimizes the risks E [ l ( b θ m , θ )] over the whole olle tion M is alled an ora le. Hen e, wewant to perform as well as the ora le b θ m ∗ . However, we do not have a ess to m ∗ as it requires theknowledge of the true ve tor θ . A lassi al method to estimate a good model b m is a hieved throughpenalization with respe t to the omplexity of models. In the sequel, we shall sele t the model b m as b m := arg min m ∈M Crit ( m ) := arg min m ∈M k Y − Π m Y k n [1 + pen ( m )] , (3)where pen ( . ) is a positive fun tion de(cid:28)ned on M . Besides, we re all that k . k n refers to the anon-i al norm in R n reweighted by √ n . Observe that Crit ( m ) is the sum of the least-square error k Y − Π m Y k n and a penalty term pen ( m ) res aled by the least-square error in order to ome upwith the fa t that the onditional varian e σ is unknown. We pre ise in Se tion 2 the heuristi sunderlying this model sele tion riterion. Baraud et al. [4℄ have extensively studied this penaliza-tion method in the (cid:28)xed design Gaussian regression framework with unknown varian e. In theirintrodu tion, they explain how one may retrieve lassi al riteria like AIC [2℄, BIC [30℄, and FPE[1℄ by hoosing a suitable penalty fun tion pen ( . ) .This model sele tion pro edure is really (cid:29)exible through the hoi es of the olle tion M and ofthe penalty fun tion pen ( . ) . Indeed, we may perform omplete variable sele tion by taking the ol-le tion of subsets of { , . . . , p } whose is smaller than some integer d . Otherwise, by taking a nested olle tion of models, one performs ordered variable sele tion. We give more details in Se tions 2and 3. If one has some prior idea on the true model m , then one ould only onsider the olle tionof models that are lose in some sense to m . Moreover, one may also give a Bayesian (cid:29)avor to thepenalty fun tion pen ( . ) and hen e spe ify some prior knowledge on the model.First, we state a non-asymptoti ora le inequality when the omplexity of the olle tion M issmall and for penalty fun tions pen ( m ) that are larger than Kd m / ( n − d m ) with K > . Then,we prove that the FPE riterion of Akaike [1℄ whi h orresponds to the hoi e K = 2 a hieves anasymptoti exa t ora le inequality for the spe ial ase of ordered variable sele tion. For the sake of ompleteness, we prove that hoosing K smaller than one yields to terrible performan es.In Se tion 3.2, we onsider general olle tion of models M . By introdu ing new penalties thattake into a ount the omplexity of M as in [9℄, we are able to state a non-asymptoti ora le in-equality. In parti ular, we onsider the problem of omplete variable sele tion. In Se tion 3.4, wede(cid:28)ne penalties based on a prior distribution on M . We then derive the orresponding risk bounds.Interestingly, these rates of onvergen e do not depend on the ovarian e matrix Σ of the o-variates, whereas known results on the Lasso or the Dantzig sele tor rely on some assumptions on Σ , as dis ussed in Se tion 3.2. We illustrate in Se tion 5 on simulated examples that for someRR n° 6616 Verzelen ovarian e matri es Σ the Lasso performs poorly whereas our methods still behaves well. Besides,our penalization method does not require the knowledge of the onditional varian e σ . In ontrast,the Lasso and the Dantzig sele tor are onstru ted for known varian e. Sin e σ is unknown, oneeither has to estimate it or has to use a ross-validation method in order to alibrate the penalty.In both ases, there is some room for improvements for the pra ti al alibration of these estimators.However, our model sele tion pro edure su(cid:27)ers from a omputational ost that depends linearlyon the size of the olle tion M . For instan e, the omplete variable sele tion problem is NP-hard.This makes it intra table when p be omes too large (i.e. more than 50). In ontrast, our riterionapplies for arbitrary p when onsidering ordered variable sele tion sin e the size of M is linear with n . We shall mention in the dis ussion some possible extensions that we hope an ope with the omputational issues.In a simultaneous and independent work to ours, Giraud [19℄ applies an analogous pro edure toestimate the graph of a GGM. Using slightly di(cid:27)erent te hniques, he obtains non-asymptoti resultsthat are omplementary to ours. However, he performs an unne essary thresholding to derive anupper bound of the risk. Moreover, he does not onsider the ase of nested olle tions of models aswe do in Se tion 3.1. Finally, he does not derive minimax rates of estimation.1.4 Minimax rates of estimationIn order to assess the optimality of our pro edure, we investigate in Se tion 4 the minimax rates ofestimation for ordered and omplete variable sele tion. For ordered variable sele tion, we omputethe minimax rate of estimation over ellipsoids whi h is analogous to the rate obtained in the (cid:28)xeddesign framework. We derive that our penalized estimator is adaptive to the olle tion of ellipsoidsindependently of the ovarian e matrix Σ . For omplete variable sele tion, we prove that theminimax rates of estimator of ve tors θ with at most k non-zero omponents is of order k log pn whenthe ovariates are independent. This is again oherent with the situation observed in the (cid:28)xeddesign setting. Then, the estimator e θ de(cid:28)ned for omplete variable sele tion problem is shown tobe adaptive to any sparse ve tor θ . Moreover, it seems that the minimax rates may be ome fasterwhen the matrix Σ is far from identity. We investigate this phenomenon in Se tion 4.2. All theseminimax rates of estimation are, to our knowledge, new in the Gaussian random design regression.Tsybakov [35℄ has derived minimax rates of estimation in a general random design regression setup,but his results do not apply in our setting as explained in Se tion 4.2.1.5 Organization of the paper and some notationsIn Se tion 2, we pre ise our estimation pro edure and explain the heuristi s underlying the penal-ization method. The main results are stated in Se tion 3. In Se tion 4, we derive the di(cid:27)erentminimax rates of estimation and assess the adaptivity of the penalized estimator b θ b m . We performa simulation study and ompare the behaviour of our estimator with Lasso and adaptive Lassoin Se tion 5. Se tion 6 ontains a (cid:28)nal dis ussion and some extensions, whereas the proofs arepostponed to Se tion 7.Throughout the paper, k . k n stands for the square of the anoni al norm in R n reweighted by n .For any ve tor Z of size n , we re all that Π m Z denotes the orthogonal proje tion of Z onto the spa egenerated by ( X i ) i ∈ m . The notation X m stands for ( X i ) i ∈ m and X m represents the n × d m matrixINRIAodel sele tion on a Gaussian design 7of the n observations of X m . For the sake of simpli ity, we write e θ for the penalized estimator b θ b m .For any x > , ⌊ x ⌋ is the largest integer smaller than x and ⌈ x ⌉ is the smallest integer larger than x . Finally, L , L , L , . . . denote universal onstants that may vary from line to line. The notation L ( . ) spe i(cid:28)es the dependen y on some quantities.2 Estimation pro edureGiven a olle tion of models M and a penalty pen : M → R + , the estimator e θ is omputed asfollows:Model sele tion pro edure1. Compute b θ m = arg min θ ′ ∈ S m k Y − Xθ ′ k n for all models m ∈ M .2. Compute b m := arg min m ∈M k Y − X b θ m k n [1 + pen ( m )] .3. e θ := b θ b m .The hoi e of the olle tion M and the penalty fun tion pen ( . ) depends on the problem understudy. In what follows, we provide some preliminary results for the parametri estimators b θ m andwe give an heuristi explanation for our penalization method.For any ve tor θ ′ in R p , we de(cid:28)ne the mean-squared error γ ( . ) and its empiri al ounterpart γ n ( . ) as γ ( θ ′ ) := E θ h ( Y − Xθ ′ ) i and γ n ( θ ′ ) := k Y − X θ ′ k n . (4)The fun tion γ ( . ) is losely onne ted to the loss fun tion l ( ., . ) through the relation l ( β, θ ) = γ ( β ) − γ ( θ ) .Given a model m of size stri tly smaller than n , we refer to θ m as the unique minimizer of γ ( . ) over the subset S m . It then follows that E ( Y | X m ) = P i ∈ m θ i X i and γ ( θ m ) is the onditionalvarian e of Y given X m . As for it, the least squares estimator b θ m is the minimizer of γ n ( . ) over thespa e S m . b θ m := arg min θ ′ ∈ S m γ n ( θ ′ ) a.s. . It is almost surely uniquely de(cid:28)ned sin e Σ is assumed to be non-singular and sin e d m < n . Besides γ n ( b θ m ) equals k Y − Π m Y k n . Let us derive two simple properties of b θ m that will give us some hintsto perform model sele tion.Lemma 2.1. For any model m whose dimension is smaller than n − , the expe ted mean-squarederror of b θ m and the expe ted least squares of b θ m respe tively equal E h γ ( b θ m ) i = (cid:2) l ( θ m , θ ) + σ (cid:3) (cid:18) d m n − d m − (cid:19) , (5) E h γ n ( b θ m ) i = (cid:2) l ( θ m , θ ) + σ (cid:3) (cid:18) − d m n (cid:19) . (6)RR n° 6616 VerzelenThe proof is postponed to the Appendix. From Equation (5), we derive a bias varian e de om-position of the risk of the estimator b θ m : E h l ( b θ m , θ ) i = l ( θ m , θ ) + (cid:2) σ + l ( θ m , θ ) (cid:3) d m n − d m − . Hen e, b θ m onverges to θ m in probability when n onverges to in(cid:28)nity. Contrary to the (cid:28)xed designregression framework, the varian e term (cid:2) σ + l ( θ m , θ ) (cid:3) d m n − d m − depends on the bias term l ( θ m , θ ) .Besides, this varian e term does not ne essarily in rease when the dimension of the model in reases.Let us now explain the idea underlying our model sele tion pro edure. We aim at hoosinga model b m that nearly minimizes the mean-squared error γ ( b θ m ) . Sin e we do not have a ess to γ ( b θ m ) nor to the bias l ( θ m , θ ) , we perform an unbiased estimation of the risk as done by Mallows[24℄ in the (cid:28)xed design framework. γ (cid:16)b θ m (cid:17) ≈ γ n (cid:16)b θ m (cid:17) + E h γ (cid:16)b θ m (cid:17) − γ n (cid:16)b θ m (cid:17)i ≈ γ n (cid:16)b θ m (cid:17) + E h γ n (cid:16)b θ m (cid:17)i d m n − d m (cid:20) d m + 1 n − d m − (cid:21) ≈ γ n (cid:16)b θ m (cid:17) (cid:20) d m n − d m (cid:18) d m + 1 n − d m − (cid:19)(cid:21) . (7)By Lemma 2.1, these approximations are in fa t equalities in expe tation. Sin e the last expressiononly depends on the data, we may ompute its minimizer over the olle tion M . This approximationis e(cid:27)e tive and minimizing ( ) provides a good estimator e θ when the size of the olle tion M ismoderate as stated in Theorem 3.1. We re all that k Y − Π m Y k n equals γ n ( b θ m ) . Hen e, our previousheuristi s would lead to a hoi e of penalty pen ( m ) = d m n − d m (cid:16) d m +1 n − d m − (cid:17) in our riterion (3),whereas FPE riterion orresponds to pen ( m ) = d m n − d m . These two penalties are equivalent whenthe dimension d m is small in front of n . In Theorem 3.1, we explain why these riteria allow toderive approximate ora le inequalities when there is a small number of models. However, when thesize of the olle tions M in reases, we need to design other penalties that take into a ount the omplexity of the olle tion M (see Se tion 3.2).3 Ora le inequalities3.1 A small number of modelsIn this se tion, we restri t ourselves to the situation where the olle tion of models M only ontainsa small number of models as de(cid:28)ned in [9℄ Se t 3.1.2.( H P ol ): for ea h d ≥ the number of models m ∈ M su h that d m = d grows at mostpolynomially with respe t to d . In other words, there exists α and β su h that for any d ≥ ,Card ( { m ∈ M , d m = d } ) ≤ αd β .( H η ): The dimension d m of every model m in M is smaller than ηn . Moreover, the number ofobservations n is larger than / (1 − η ) . INRIAodel sele tion on a Gaussian design 9Assumption ( H P ol ) states that there is at most a polynomial number of models with a givendimension. It in ludes in parti ular the problem of ordered variable sele tion, on whi h we will fo usin this se tion. Let us introdu e the olle tion of models relevant for this issue. For any positivenumber i smaller or equal to p , we de(cid:28)ne the model m i := { , . . . , i } and the nested olle tion M i := { m , m , . . . m i } . Here, m refers to the empty model. Any olle tion M i satis(cid:28)es ( H P ol )with β = 0 and α = 1 .Theorem 3.1. Let η be any positive number smaller than one. Assume that the olle tion M satis(cid:28)es ( H P ol ) and ( H η ). If the penalty pen ( . ) is lower bounded as follows pen ( m ) ≥ K d m n − d m for all m ∈ M and some K > , (8)then E h l ( e θ, θ ) i ≤ L ( K, η ) inf m ∈M (cid:20) l ( θ m , θ ) + n − d m n pen ( m ) (cid:2) σ + l ( θ m , θ ) (cid:3)(cid:21) + τ n , (9)where the error term τ n is de(cid:28)ned as τ n = τ n [ Var ( Y ) , K, η, α, β ] := L ( K, η, α, β ) (cid:20) σ n + n β Var ( Y ) exp [ − nL ( K, η )] (cid:21) , and L ( K, η ) is positive.The theorem applies for any n , any p and there is no hidden dependen y on n or p in the onstants. Besides, observe that the theorem does not depend at all on the ovarian e matrix Σ between the ovariates. If we hoose the penalty pen ( m ) = K d m n − d m , we obtain an approximateora le inequality. E h l ( e θ, θ ) i ≤ L ( K, η ) inf m ∈M E h l ( b θ m , θ ) i + τ n [ Var ( Y ) , K, η, α, β ] , thanks to Lemma 2.1. The term in n β Var ( Y ) exp[ − nL ( K, η )] onverges exponentially fast to0 when n goes to in(cid:28)nity and is therefore onsidered as negligible. One interesting feature of thisora le inequality is that it allows to onsider models of dimensions as lose to n as we want providingthat n is large enough. This will not be possible in the next se tion when handling more omplex olle tions of models.If we have stated that e θ performs almost as well as the ora le model, one may wonder whetherit is possible to perform exa tly as well as the ora le. In the next proposition, we shall provethat under additional assumption the estimator e θ with K = 2 follows an asymptoti exa t ora leinequality. We state the result for the problem of ordered variable sele tion. Let us assume for amoment that the set of ovariates is in(cid:28)nite, i.e. p = + ∞ . In this setting, we de(cid:28)ne the subset Θ ofsequen es θ = ( θ i ) i ≥ su h that < X, θ > onverges in L . In the following proposition, we assumethat θ ∈ Θ .De(cid:28)nition 3.1. Let s and R be two positive numbers. We de(cid:28)ne the so- alled ellipsoid E ′ s ( R ) as E ′ s ( R ) := ( ( θ i ) i ≥ , + ∞ X i =1 l (cid:0) θ m i − , θ m i (cid:1) i − s ≤ R σ ) . RR n° 66160 VerzelenIn Se tion 4.1, we explain why we all this set E ′ s ( R ) an ellipsoid.Proposition 3.2. Assume there exists s , s ′ , and R su h that θ ∈ E ′ s ( R ) and su h that for anypositive numbers R ′ , θ / ∈ E ′ s ′ ( R ′ ) . We onsider the olle tion M ⌊ n/ ⌋ and the penalty pen ( m ) =2 d m n − d m . Then, there exists a onstant L ( s, R ) and a sequen e τ n onverging to zero at in(cid:28)nity su hthat, with probability, at least − L ( s, R ) log nn , l ( e θ, θ ) ≤ [1 + τ ( n )] inf m ∈M ⌊ n/ ⌋ l ( b θ m , θ ) . (10)Admittedly, we make n go to the in(cid:28)nity in this proposition but we are still in a high dimensionalsetting sin e p = + ∞ and sin e the size of the olle tion M ⌊ n/ ⌋ goes to in(cid:28)nity with n . Let usbrie(cid:29)y dis uss the assumption on θ . Roughly speaking, it ensures that the ora le model has adimension not too lose to zero (larger than log ( n ) ) and small before n (smaller than n/ log n ).Noti e that it is lassi al to assume that the bias is non-zero for every model m for proving theasymptoti optimality of Mallows' C p ( f. Shibata [31℄ and Birgé and Massart [9℄). Here, wemake a stronger assumption be ause the bound (10) holds in probability and be ause the design isGaussian. Moreover, our stronger assumption has already been made by Stone [32℄ and Arlot [3℄.We refer to Arlot [3℄ Se t.4.1 for a more omplete dis ussion of this assumption.The hoi e of the olle tion M ⌊ n/ ⌋ is arbitrary and one an extend it to many olle tions thatsatisfy ( H P ol ) and ( H η ) . As mentioned in Se tion 2, the penalty pen ( m ) = 2 d m n − d m orresponds tothe FPE model sele tion pro edure. In on lusion, the hoi e of the FPE riterion turns out to beasymptoti ally optimal when the omplexity of M is small.We now underline that the ondition K > in Theorem 3.1 is almost ne essary. Indeed, hoosing K smaller than one yields terrible statisti al performan es.Proposition 3.3. Suppose that p is larger than n/ . Let us onsider the olle tion M ⌊ n/ ⌋ andassume that for some ν > , pen ( m ) = (1 − ν ) d m n − d m , (11)for any model m ∈ M ⌊ n/ ⌋ . Then given δ ∈ (0 , , there exists some n ( ν, δ ) only depending on ν and δ su h that for n ≥ n ( ν, δ ) , P θ h d b m ≥ n i ≥ − δ and E h l ( e θ, θ ) i ≥ l ( θ m ⌊ n/ ⌋ , θ ) + L ( δ, ν ) σ . If one hooses a too small penalty, then the dimension d b m of the sele ted model is huge and thepenalized estimator e θ performs poorly. The hypothesis p ≥ n/ is needed for de(cid:28)ning the olle -tion M ⌊ n/ ⌋ . On e again, the hoi e of the olle tion M ⌊ n/ ⌋ is rather arbitrary and the result ofProposition 3.3 still holds for olle tions M whi h satisfy ( H P ol ) and ( H η ) and ontain at least onemodel of large dimension. Theorem 3.1 and Proposition 3.3 tell us that d m n − d m is the minimal penalty.In pra ti e, we advise to hoose K between and . Admittedly, K = 2 is asymptoti ally optimalby Proposition 3.2. Nevertheless, we have observed on simulations that K = 3 gives slightly betterresults when n is small. For ordered variable sele tion, we suggest to take the olle tion M ⌊ n/ ⌋ .INRIAodel sele tion on a Gaussian design 113.2 A general model sele tion theoremIn this se tion, we study the performan e of the penalized estimator e θ for general olle tions M .Classi ally, we need to penalize stronger the models m , in orporating the omplexity of the olle -tion. As a spe ial ase, we shall onsider the problem of omplete variable sele tion. This is whywe de(cid:28)ne the olle tions M dp that onsist of all subsets of { , . . . , p } of size less or equal to d .De(cid:28)nition 3.2. Given a olle tion M , we de(cid:28)ne the fun tion H ( . ) by H ( d ) := 1 d log [ Card ( { m ∈ M , d m = d } )] , for any integer d ≥ .This fun tion measures the omplexity of the olle tion M . For the olle tion M dp , H ( k ) isupper bounded by log( ep/k ) for any k ≤ d (see Eq.(4.10) in [25℄). Contrary to the situation en- ountered in ordered variable sele tion, we are not able to onsider models of arbitrary dimensionsand we shall do the following assumption. ( H K,η ) : Given K > and η > , the olle tion M and the number η satisfy ∀ m ∈ M , h p H ( d m ) i d m n − d m ≤ η < η ( K ) , (12)where η ( K ) is de(cid:28)ned as η ( K ) := [1 − / ( K + 2)) / ] W [1 − (3 /K + 2) / ] / .The fun tion η ( K ) is positive and in reases when K is larger than one. Besides, η ( K ) onvergesto one when K onverges to in(cid:28)nity. We do not laim that the expression of η ( K ) is optimal. Weare more interested in its behavior when K is large.Theorem 3.4. Let K > and let η < η ( K ) . Assume that n is larger than some quantity n ( K ) only depending on K and the olle tion M satis(cid:28)es ( H K,η ) . If the penalty pen ( . ) is lower boundedas follows pen ( m ) ≥ K d m n − d m (cid:16) p H ( d m ) (cid:17) for any m ∈ M , (13)then E h l ( e θ, θ ) i ≤ L ( K, η ) inf m ∈M (cid:26) l ( θ m , θ ) + n − d m n pen ( m ) (cid:2) σ + l ( θ m , θ ) (cid:3)(cid:27) + τ n , (14)where τ n is de(cid:28)ned as τ n = τ n [ Var ( Y ) , K, η ] := σ L ( K, η ) n + L ( K, η ) n / Var ( Y ) exp [ − nL ( K, η )] , and L ( K, η ) is positive.RR n° 66162 VerzelenThis theorem provides an ora le type inequality of the same type as the one obtained in theGaussian sequential framework by Birgé and Massart [8℄. The risk of the penalized estimator e θ almost a hieves the in(cid:28)mum of the risks plus a penalty term depending on the fun tion H ( . ) . Asin Theorem 3.1, the error term τ n [ Var ( Y ) , K, η ] depends on θ but this part goes exponentially fastto 0 with n .Comments:ˆ As for Theorem 3.1, the result holds for arbitrary large p as long as n is larger than the quantity n ( K ) (independent of p ). There is no hidden dependen y on p ex ept in the omplexityfun tion H ( . ) and Assumption H K,η that we shall dis uss for the parti ular ase of ompletevariable sele tion. Moreover, one may easily he k Assumption H K,η sin e it only depends onthe olle tion M and not on some unknown quantity.ˆ This result (as well as of Theorem 3.1) does not depend at all on the ovarian e matrix Σ between the ovariates.ˆ The penalty introdu ed in this theorem only depends on the olle tion M and a number K > . Hen e, performing the pro edure does not require any knowledge on σ , Σ , or θ . Wegive hints at the end of the se tion for hoosing the onstant K .ˆ Observe that Theorem 3.1 is not just orollary of Theorem 3.4. If we apply Theorem 3.4 tothe problem of ordered sele tion, then the maximal size of the model has to be smaller than n η ( K )1+ η ( K ) , whi h depends on K and is always smaller than n/ . In ontrast, Theorem 3.1handles models of size up to n − .3.3 Appli ation to omplete variable sele tionLet us now restate Theorem 3.4 for the parti ular issue of omplete variable sele tion. Consider K > , η < η ( K ) and d > su h that M dp satis(cid:28)es Assumption ( H K,η ) . If we take for any model m ∈ M dp the penalty term pen ( m ) = K d m n − d m " s (cid:18) epd m (cid:19) , (15)then we get E h l ( e θ, θ ) i ≤ L ( K, η ) inf m ∈M dp (cid:26) l ( θ m , θ ) + d m n log (cid:18) epd m (cid:19) σ (cid:27) + τ n [ Var ( Y ) , K, η ] . We shall prove in Se tion 4.2, that the term log( p/d m ) is unavoidable and that the obtainedestimator is optimal from a minimax point of view. If the true parameter θ belongs to some unknownmodel m , then the rates of estimation of ˜ θ is of the order d m n log( p/d m ) σ . Let us ompare ourresult with other pro edures.ˆ The ora le type inequalities look similar to the ones obtained by Birgé and Massart [8℄, Buneaet al. [10℄ and Baraud et al. [4℄. However, Birgé and Massart and Bunea et al. assume thatthe varian e σ is known. Moreover, Birgé and Massart and Baraud et al. only onsider aINRIAodel sele tion on a Gaussian design 13(cid:28)xed design setting. Yet, Bunea et al. allow the design to be random, but they assume thatthe regression fun tions are bounded (Assumption A.2 in their paper) whi h is not the asehere. Moreover, they only get risk bounds with respe t to the empiri al norm k . k n and notthe integrated loss l ( ., . ) .ˆ As mentioned previously, our ora le inequality holds for any ovarian e matrix Σ . In ontrast,Lasso and Dantzig sele tor estimators have been shown to satisfy ora le inequalities underassumptions on the empiri al design X . In [13℄, Candès and Tao indeed assume that thesingular values of X restri ted to any subset of size proportional to the sparsity of θ arebounded away from zero. Bi kel et al. [5℄ introdu e an extension of this ondition proveboth for the Lasso and the Dantzig sele tor. In a re ent work [12℄, Candès and Plan statethat if the empiri al orrelation between the ovariates is smaller than L (log p ) − ,then theLasso follows an ora le inequality in a majority of ases. Their ondition is in fa t almostne essary. On the one hand, they give examples of some low orrelated situations, where theLasso performs poorly. On the other hand, they prove that the Lasso fails to work well if the orrelation between the ovariates if larger than L (log p ) − . Yet, Candès and Plan onsiderthe loss fun tion k X b θ − X θ k n , whereas we use the integrated loss l ( b θ, θ ) , but this does not really hange the impa t of their result. We refer to their paper for further details. The main point isthat for some orrelation stru tures, our pro edure still works well, whereas the Lasso and theDantzig sele tor pro edures perform poorly. In many problems su h as GGM estimation, the orrelation between the ovariates may be high and even the relaxed assumptions of Candèsand Plan may not be ful(cid:28)lled. In Se tion 5, we illustrate this phenomenon by omparingour pro edure with the Lasso on numeri al examples for independent and highly orrelated ovariates.ˆ Suppose that the ovariates are independent and that θ belongs to some model m , the ratesof onvergen e of the Lasso is then of the order d m n log( p ) σ , whereas ours is d m n log( p/d m ) σ .Consider the ase where p , and d m are of the same order whereas n is large. Our modelsele tion pro edure therefore outperforms the Lasso by a log( p ) fa tor even if the ovariatesare independent.ˆ Let us restate Assumption ( H K,η ) for the parti ular olle tion M dp . Given some K > andsome η < η ( K ) , the olle tion M dp satis(cid:28)es ( H K,η ) if d ≤ η n h p p/d )) i . (16)If p is mu h larger than n , the dimension d of the largest model has to be be smaller than theorder η n p ) . Candès and Plan state a similar ondition for the lasso. We believe that this ondition is unimprovable. Indeed, Wainwright states in Th.2 of [38℄ a result going in thissense: it is impossible to estimate reliably the support of a k -sparse ve tor θ if n is smallerthan the order k log( p/k ) . If log( p ) is larger than n , then we annot apply Theorem 3.4. Thisultra-high dimensional setting is also not handled by the theory for the Lasso and the Dantzigsele tor. Finally, if p is of the same order as n , then Condition (16) is satis(cid:28)ed for dimensions d of the same order as n . Hen e, our method works well even when the sparsity is of the sameorder as n , whi h is not the ase for the Lasso or the Dantzig sele tor.RR n° 66164 VerzelenLet us dis uss the pra ti al hoi e of d and K for omplete variable sele tion. From numeri alstudies, we advise to take d ≤ n . [ ( pn ∨ )] ∧ p even if this quantity is slightly larger than whatis ensured by the theory. The pra ti al hoi e of K depends on the aim of the study. If one aimsat minimizing the risk, K = 1 . gives rather good result. A larger K like . or allows to obtaina more onservative pro edure and onsequently a lower FDR. We ompare these values of K onsimulated examples in Se tion 5.3.4 Penalties based on a prior distributionThe penalty de(cid:28)ned in Theorem 3.4 only depends on the models through their ardinality. However,the methodology developed in the proof may easily extend to the ase where the user has someprior knowledge of the relevant models. Let π M be a prior probability measure on the olle tion M . For any non-empty model m ∈ M , we de(cid:28)ne l m by l m := − log ( π M ( m )) d m . By onvention, we set l ∅ to . We de(cid:28)ne in the next proposition penalty fun tions based on thequantity l m that allow to get non-asymptoti ora le inequalities.Assumption ( H lK,η ) : Given K > and η > , the olle tion M , the numbers l m and the number η satisfy ∀ m ∈ M , (cid:2) √ l m (cid:3) d m n − d m ≤ η < η ( K ) , (17)where η ( K ) is de(cid:28)ned as in ( H K,η ) .Proposition 3.5. Let K > and let η < η ( K ) . Assume that n ≥ n O ( K ) and that Assumption ( H lK,η ) is ful(cid:28)lled. If the penalty pen ( . ) is lower bounded as follows pen ( m ) ≥ K d m n − d m (cid:16) p l m (cid:17) for any m ∈ M \ {∅} , (18)then E h l ( e θ, θ ) i ≤ L ( K, η ) inf m ∈M (cid:26) l ( θ m , θ ) + n − d m n pen ( m ) (cid:2) σ + l ( θ m , θ ) (cid:3)(cid:27) + τ n , (19)where L ( K, η ) and τ n are the same as in Theorem 3.4. INRIAodel sele tion on a Gaussian design 15Comments:ˆ In this proposition, the penalty (18) as well as the risk bound (19) depend on the priordistribution π M . In fa t, the bound (19) means that e θ a hieves the trade-o(cid:27) between the biasand some prior weight, whi h is of the order − log[ π M ( m )][ σ + l ( θ m , θ )]) /n . This emphasizes that e θ favours models with a high prior probability. Similar risk bounds areobtained in the (cid:28)xed design regression framework in Birgé and Massart [7℄.ˆ If the proofs of Proposition 3.5 and Theorem 3.4 are very similar, Proposition 3.5 does notimply the theorem.ˆ Roughly speaking, Assumption ( H lK,η ) requires that the prior probability π M ( m ) is not ex-ponentially small with respe t to n .4 Minimax lower bounds and AdaptivityThroughout this se tion, we emphasize the dependen y of the expe tations E ( . ) and the probabilities P ( . ) on θ by writing E θ and P θ . We have stated in Se tion 3 that the penalized estimator e θ performsalmost as well as the best of the estimators b θ m . We now want to ompare the risk of e θ with therisk of any other possible estimator estimator b θ . There is no hope to make a pointwise omparisonwith an arbitrary estimator. Therefore, we lassi ally onsider the maximal risk over some suitablesubsets Θ of R p . The minimax risk over the set Θ is given by inf b θ sup θ ∈ Θ E θ [ l ( b θ, θ )] , where thein(cid:28)mum is taken over all possible estimators b θ of θ . Then, the estimator ˜ θ is said to be approximatelyminimax with respe t to the set Θ if the ratio sup θ ∈ Θ E θ h l (cid:16) ˜ θ, θ (cid:17)i inf b θ sup θ ∈ Θ E θ h l (cid:16)b θ, θ (cid:17)i is smaller than a onstant that does not depend on σ , n , or p . The minimax rates of estimationwere extensively studied in the (cid:28)xed design Gaussian regression framework and we refer for instan eto [8℄ for a detailed dis ussion. In this se tion, we apply a lassi al methodology known as Fano'sLemma in order to derive minimax rates of estimation for ordered and omplete variable sele tion.Then, we dedu e adaptive properties of the penalized estimator e θ .4.1 Adaptivity with respe t to ellipsoidsIn this se tion, we prove that the estimator e θ introdu ed in Se tion 3.1 to perform ordered variablesele tion is adaptive to a large lass of ellipsoids.De(cid:28)nition 4.1. For any non in reasing sequen e ( a i ) ≤ i ≤ p +1 su h that a = 1 and a p +1 = 0 andany R > , we de(cid:28)ne the ellipsoid E a ( R ) by E a ( R ) := ( θ ∈ R p , p X i =1 l (cid:0) θ m i − , θ m i (cid:1) a i ≤ R ) . RR n° 66166 VerzelenThis de(cid:28)nition is very similar to the notion of ellipsoids introdu ed in [36℄. Let us explain whywe all this set an ellipsoid. Assume for one moment that the ( X i ) ≤ i ≤ p are independent identi allydistributed with varian e one. In this ase, the term l (cid:0) θ m i − , θ m i (cid:1) equals θ i and the de(cid:28)nition of E a ( R ) translates in E a ( R ) = ( θ ∈ R p , p X i =1 θ i a i ≤ R ) , whi h pre isely orresponds to a lassi al de(cid:28)nition of an ellipsoid. If the ( X i ) ≤ i ≤ p are not i.i.d.with unit varian e, it is always possible to reate a sequen e X ′ i of i.i.d. standard Gaussian variablesby orthonormalizing the X i using Gram-S hmidt pro ess. If we all θ ′ the ve tor in R p su h that Xθ = X ′ θ ′ , then it holds that l (cid:0) θ m i − , θ m i (cid:1) = θ ′ i . Then, we an express E a ( R ) using the oordinatesof θ ′ as previously: E a ( R ) = ( θ ∈ R p , p X i =1 θ ′ i a i ≤ R ) . The main advantage of this de(cid:28)nition is that it does not dire tly depend on the ovarian e of ( X i ) ≤ i ≤ p .Proposition 4.1. For any sequen e ( a i ) ≤ i ≤ p and any positive number R , the minimax rate ofestimation over the ellipsoid E a ( R ) is lower bounded by inf b θ sup θ ∈E a ( R ) E θ h l ( b θ, θ ) i ≥ L sup ≤ i ≤ p (cid:20) a i R ∧ σ in (cid:21) . (20)This result is analogous to the lower bounds obtained in the (cid:28)xed design regression framework(see e.g. [25℄ Th. 4.9). Hen e, the estimator e θ built in Se tion 3.1 is adaptive to a large lass ofellipsoids.Corollary 4.2. Assume that n is larger than . We onsider the penalized estimator e θ with the olle tion M ⌊ n/ ⌋ and the penalty pen ( m ) = K d m n − d m . Let E a ( R ) be an ellipsoid whose radius R satis(cid:28)es σ n ≤ R ≤ σ n β for some β > . Then, e θ is approximately minimax on E a ( R )sup θ ∈E a ( R ) l ( e θ, θ ) ≤ L ( K, β ) inf b θ sup θ ∈E a ( R ) E θ h l ( b θ, θ ) i , if either n ≥ p or a ⌊ n/ ⌋ +1 R ≤ σ / .In the (cid:28)xed design framework, one may build adaptive estimators to any ellipsoid satisfying R ≥ σ /n so that the ellipsoid is not degenerate (see e.g. [25℄ Se t. 4.3.3). In our setting,when p is small the estimator e θ is adaptive to all the ellipsoids that have a moderate radius σ /n ≤ R ≤ n β . The te hni al ondition R ≤ n β is not really restri tive. It omes fromthe term n l (0 p , θ ) exp( − nL ( K )) in Theorem 3.1 whi h goes exponentially fast to with n . When p is larger, e θ is adaptive to the ellipsoids that also satis(cid:28)es a ⌊ n/ ⌋ +1 R ≤ σ / . In other words,we require that the ellipsoid is well approximated by the spa e S m ⌊ n/ ⌋ of ve tors θ whose supportis in luded in { , . . . , ⌊ n/ ⌋} . If this ondition is not ful(cid:28)lled, the estimator e θ is not proved to beminimax on E a ( R ) . For su h situations, we believe on the one hand that the estimator e θ should bere(cid:28)ned and on the other hand that our lower bounds are not sharp. Finally, the olle tion M ⌊ n/ ⌋ INRIAodel sele tion on a Gaussian design 17may be repla ed by any M ⌊ nη ⌋ in Corollary 4.2.Sin e the methods used for minimax lower bounds and the ora le inequalities are analogous tothe ones in the Gaussian sequen e framework, one may also adapt in our setting the argumentsdeveloped in [25℄ Se t. 4.3.5 to derive minimax rates of estimation over other sets su h Besovbodies. However, this is not really relevant for the regression model (1).4.2 Adaptivity with respe t to sparsityOur aim is now to analyze the minimax risk for the omplete variable sele tion problem. Let us(cid:28)x an integer k between and p . We are interested in estimating the ve tor θ within the lass ofve tors with a most k non-zero omponents. This typi ally orresponds to the situation en ounteredin graphi al modeling when estimating the neighborhoods of large sparse graphs. As the graph isassumed to be sparse, only a small number of omponents of θ are non-zero.In the sequel, the set Θ[ k, p ] stands for the subset of ve tors θ ∈ R p , su h that at most k oordinates of θ are non-zero. For any r > , we denote Θ[ k, p ]( r ) the subset of Θ[ k, p ] su h thatany omponent of θ is smaller than r in absolute value.First, we derive a lower bound for the minimax rates of estimation when the ovariates areindependent. Then, we prove the estimator e θ de(cid:28)ned with some olle tion M dp and the penalty(15) is adaptive to any sparse ve tor θ . Finally, we investigate the minimax rates of estimation for orrelated ovariates.Proposition 4.3. Assume that the ovariates X i are independent and have a unit varian e. Forany k ≤ p and any radius r > , inf b θ sup θ ∈ Θ[ k,p ]( r ) E θ h l ( b θ, θ ) i ≥ Lk " r ∧ σ (cid:0) pk (cid:1) n . (21)Thanks to Theorem 3.4, we derive the minimax rate of estimation over Θ[ k, p ] .Corollary 4.4. Consider K > , β > , and η < η ( K ) . Assume that n ≥ n ( K ) and that the ovariates X i are independent and have a unit varian e. Let d be a positive integer su h that M dp satis(cid:28)es ( H K,η ) . The penalized estimator e θ de(cid:28)ned with the olle tion M dp and the penalty (15) isadaptive minimax over the sets Θ[ k, p ]( n β )sup θ ∈ Θ[ k,p ] E θ h l ( e θ, θ ) i ≤ L ( K, β, η ) inf b θ sup θ ∈ Θ[ k,p ]( n β ) E θ h l ( b θ, θ ) i , for any k smaller than d .Hen e, the minimax rates of estimation over Θ[ k, p ]( n β ) is of order k log ( epk ) n , whi h is similarto the rates obtained in the (cid:28)xed design regression framework. As in previous Se tion, we restri tourselves to a radius r in Θ[ k, p ]( r ) smaller than n β be ause of the term τ n ( Var ( Y ) , K, η ) whi h de-pends on l (0 p , θ ) but goes exponentially fast to 0 when n goes to in(cid:28)nity. Let us interpret Corollary4.4 with regard to Condition (16). If p is of the same order as n , the estimator e θ is simultaneouslyminimax over all sets Θ[ k, p ]( n β ) when k is smaller than a onstant times n . If p is mu h largerRR n° 66168 Verzelenthan n , the estimator e θ is simultaneously minimax over all sets Θ[ k, p ]( n β ) with k smaller than Ln/ log( p ) . We onje ture that the minimax rate of estimation is larger than k log( p/k ) /n when k be omes larger than n/ log p . Let us mention that Tsybakov [35℄ has proved general minimaxlower bounds for aggregation in Gaussian random design regression. However, his result does notapply in our Gaussian design setting setting sin e he assumes that the density of the ovariates X i is lower bounded by a onstant µ .We have proved that the estimator e θ is adaptive to an unknown sparsity when the ovariatesare independent. The performan e of e θ exhibited in Theorem 3.4 do not depend on the ovari-an e matrix Σ . Hen e, the minimax rates of estimation on Θ[ k, p ] is smaller or equal to the order k log( p/k ) /n for any dependen e between the ovarian e. One may then wonder whether the mini-max rate of estimation over Θ[ k, p ] is not faster when the ovariates are orrelated. We are unableto derive the minimax rates for a general ovarian e matrix Σ . This is why we restri t ourselvesto parti ular examples of orrelation stru tures. Let us (cid:28)rst onsider a pathologi al situation: As-sume that X , . . . , X k are independent and that X k +1 , . . . , X p are all equal to X . Admittedly, the ovarian e matrix Σ is hen eforth non invertible. In the dis ussion, we mention that Theorems 3.1and 3.4 easily extend when Σ is non-invertible if we take into a ount that the estimators b θ m and b m are non-ne essarily uniquely de(cid:28)ned. We may derive from Lemma 2.1 that the estimator b θ { ,...,k } a hieves the rate k/n over θ [ k, p ]( n β ) . Conversely, the parametri rate k/n is optimal. However, theestimator e θ de(cid:28)ned with the olle tion M kp and penalty (15) only a hieves the rate k log( p/k ) /n .Hen e, e θ is not minimax over Θ[ k, p ] for this parti ular ovarian e matrix and the minimax rate isdegenerate. This emergen e of faster rates for orrelation ovariates also o urs for testing problemsin the model (1) as stated in [36℄ Se t. 4.3. This is why we provide su(cid:30) ient onditions on Σ sothat the minimax rate of estimation is still of the same order as in the independent ase. In thefollowing proposition, k . k refers to the anoni al norm in R p .Proposition 4.5. Let Ψ denote the orrelation matrix of the ovariates ( X i ) ≤ i ≤ p . Let k be apositive number smaller p/ and let δ > . Assume that (1 − δ ) k θ k ≤ θ ∗ Ψ θ ≤ (1 + δ ) k θ k , (22)for all θ ∈ R p with at most k non-zero omponents. Then, the minimax rate of estimation over Θ[ k, p ]( r ) is lower bounded as follows inf b θ sup θ ∈ Θ[ k,p ]( r ) E θ h l ( b θ, θ ) i ≥ L (1 − δ ) k " r ∧ σ (cid:0) pk (cid:1) (1 + δ ) n . Assumption (22) orresponds to the δ -Restri ted Isometry Property of order k introdu ed byCandès and Tao [14℄. Under su h a ondition, the minimax rates of estimation is the same asthe one in the independent ase up to a onstant depending on δ and the estimator e θ de(cid:28)ned inCorollary 4.4 is still approximately minimax over su h sets Θ[ k, p ] .However, the δ -Restri ted Isometry Property is quite restri tive and seems not to be ne essaryso that the minimax rate of estimation stays of the order k log( p/k ) /n . Besides, in many situationsthis ondition is not ful(cid:28)lled. Assume for instan e that the random ve tor X is a Gaussian Graphi almodel with respe t to a given sparse graph. We expe t that the orrelation between two ovariatesis large if they are neighbors in the graph and small if they are far-o(cid:27) (w.r.t. the graph distan e).This is why we derive lower bounds on the rate of estimation for orrelation matri es often used tomodel stationary pro esses. INRIAodel sele tion on a Gaussian design 19Proposition 4.6. Let X , . . . , X p form a stationary pro ess on the one dimensional torus. Morepre isely, the orrelation between X i and X j is a fun tion of | i − j | p where | . | p refers to the toroidaldistan e de(cid:28)ned by: | i − j | p := ( | i − j | ) ∧ ( p − | i − j | ) . Ψ ( ω ) and Ψ ( t ) respe tively refer to the orrelation matrix of X su h that orr ( X i , X j ) := exp( − ω | i − j | p ) where ω > , orr ( X i , X j ) := (1 + | i − j | p ) − t where t > . Then, the minimax rates of estimation are lower bounded as follows inf b θ sup θ ∈ Θ[ k,p ] E θ, Ψ ( ω ) h l ( b θ, θ ) i ≥ L kσ n " (cid:4) p ⌈ log(4 k ) /ω ⌉ − (cid:5) k ! , if k is smaller than p/ ⌈ log(4 k ) /ω ⌉ and inf b θ sup θ ∈ Θ[ k,p ] E θ, Ψ ( t ) h l ( b θ, θ ) i ≥ L kσ n " ⌊ p ⌈ (4 k ) t − ⌉ − ⌋ k ! ; if k is smaller than p/ ⌈ (4 k ) t − ⌉ .In the proof of the proposition, we justify that the orrelations onsidered are well-de(cid:28)ned atleast when p is odd. Let us mention that these orrelation models are quite lassi al when modellingthe orrelation of time series (see e.g. [20℄)If the range ω is larger than /p γ or if the range t is larger than γ for some γ < , the lowerbounds are of order σ kn (1 + log p/k ) . As a onsequen e, for any of these orrelation models theminimax rate of estimation is of the same order as the minimax rate of estimation for independent ovariates. This means that the estimator e θ de(cid:28)ned in Proposition 4.4 is rate-optimal for these orrelations matri es.In on lusion, the estimator e θ de(cid:28)ned in Corollary 4.4 may not be adaptive to the ovarian ematrix Σ but rather a hieves the minimax rate over all ovarian e matri es Σ : sup Σ ≥ sup θ ∈ Θ[ k,p ]( n β ) E θ h l ( e θ, θ ) i ≤ L ( K, β, η ) inf b θ sup Σ ≥ sup θ ∈ Θ[ k,p ]( n β ) E θ h l ( b θ, θ ) i . Nevertheless, the result makes sense if one onsiders GGMs sin e the resulting ovarian e matri esare typi ally far from being independent.5 Numeri al studyIn this se tion, we arry out a small simulation study to evaluate the performan e of our estimator e θ . As pointed out earlier, an interesting feature of our riterion lies in its (cid:29)exibility. However, werestri t ourselves here to the variable sele tion problem. Indeed, it allows to assess the e(cid:30) ien y ofour pro edure with having regard to the Lasso [34℄ and adaptive Lasso proposed by Zou [41℄. Evenif these two pro edures assume that the onditional varian e σ is known, they give good resultsin pra ti e and the omparison with our method is of interest. The al ulations are made with R p = 20 , and σ = 1 . The number of observations n equal , , and . We perform two simulation experiments.1. First simulation experiment: The ovarian e matrix Σ is the identity matrix. This orre-sponds to the situation where the ovariates are all independent. The ve tor θ has all its omponents to zero ex ept the three (cid:28)rst ones, whi h respe tively equal , , and . .2. Se ond simulation experiment: Let A be the p × p matrix whose lines ( a , . . . , a p ) are respe -tively de(cid:28)ned by a := (1 , − , , . . . , / √ a := ( − , . , , . . . , / p . a := (1 / √ , / √ , /p, . . . , /p ) / p / p − /p , and for ≤ j ≤ p , a j orresponds to the j th anoni al ve tor of R p . Then, we take the ovarian e matrix Σ = A ∗ A and the ve tor θ ∗ = (40 , , , . . . , . This hoi e of parametersderives from the simulation experiments of [4℄. Observe that the two (cid:28)rst ovariates are highly orrelated.For ea h sample we estimate θ with our pro edure, the Lasso and the adaptive Lasso. Forour pro edure we use the olle tion M p for n = 15 , M p for n = 20 and, M p for n = 30 . The hoi e of smaller olle tions for n = 15 and is due to Condition (16). We take the penalty(15) with K = 1 . . , and . For the Lasso and adaptive Lasso pro edures, we (cid:28)rst normalizethe ovariates ( X i ) . Here, √ log pσ would be a good hoi e for the parameter λ of the Lasso.However, we do not have a ess to σ . Hen e, we use an estimation of the varian e d Var ( Y ) whi his a (possibly ina urate) upper bound of σ . This is why we hoose the parameter λ of the Lassobetween . × q log p d Var ( Y ) and q log p d Var ( Y ) by leave-one-out ross-validation. The number . is rather arbitrary. In pra ti e, the performan es of the Lasso do not really depend on thisnumber as soon it is neither too small nor lose to one. For the adaptive Lasso pro edure, theparameters γ and λ are also estimated thanks to leave-one-out ross-validation: γ an take threevalues (0 . , , and the values of λ vary between . × q log p d Var ( Y ) and q log( p ) d Var ( Y ) .We evaluate the risk ratioratio.Risk := E h l ( b θ, θ ) i inf m ∈M p E h l ( b θ m , θ ) i as well as the power and the FDR on the basis of simulations. Here, the power orrespondsto the fra tion of non-zero omponents θ estimated as non-zero by the estimator b θ , while the FDRis the ratio of the false dis overies over the true dis overies.Power := E " Card ( { i, θ i = 0 and b θ i = 0 } ) Card ( { i, θ i = 0 } ) and FDR := E " Card ( { i, θ i = 0 and b θ i = 0 } ) Card ( { i, b θ i = 0 } ) . INRIAodel sele tion on a Gaussian design 21 n = 15 n = 20 Estimator ratio.Risk Power FDR ratio.Risk Power FDR K = 1 . . ± . . ± .
02 0 . ± .
02 4 . ± . . ± .
01 0 . ± . K = 1 . . ± . . ± .
02 0 . ± .
01 5 . ± . . ± .
02 0 . ± . K = 2 7 . ± . . ± .
02 0 . ± .
01 6 . ± . . ± .
02 0 . ± . Lasso . ± . . ± .
01 0 . ± .
02 6 . ± . . ± .
01 0 . ± . A. Lasso . ± . . ± .
02 0 . ± .
02 4 . ± . . ± .
02 0 . ± . n = 30 Estimator ratio.Risk Power FDR K = 1 . . ± . . ± .
01 0 . ± . K = 1 . . ± . . ± .
01 0 . ± . K = 2 4 . ± . . ± .
01 0 . ± . Lasso . ± . . ± .
01 0 . ± . A. Lasso . ± . . ± .
02 0 . ± . Table 1: Our pro edure with K = 1 . , . , and and Lasso and adaptive Lasso pro edures:Estimation and on(cid:28)den e interval of Risk ratio (ratio.Risk), Power and FDR when p = 20 , Σ = Σ , θ = θ , and n = 15 , , and . n = 15 n = 20 Estimator ratio.Risk Power FDR ratio.Risk Power FDR K = 1 . . ± . . ± .
03 0 . ± .
02 6 . ± . . ± .
02 0 . ± . K = 1 . . ± . . ± .
03 0 . ± .
02 5 . ± . . ± .
02 0 . ± . K = 2 5 . ± . . ± .
03 0 . ± .
02 5 . ± . . ± .
02 0 . ± . Lasso . ± . . ± .
01 0 . ± .
01 16 . ± . . ± .
01 0 . ± . A. Lasso . ± . . ± .
01 0 . ± .
02 20 . ± . . ± .
01 0 . ± . n = 30 Estimator ratio.Risk Power FDR K = 1 . . ± . . ± .
02 0 . ± . K = 1 . . ± . . ± .
01 0 . ± . K = 2 3 . ± . . ± .
01 0 . ± . Lasso . ± . . ± .
01 0 . ± . A. Lasso . ± . . ± .
01 0 . ± . Table 2: Our pro edure with K = 1 . , . , and and Lasso and adaptive Lasso pro edures:Estimation and on(cid:28)den e interval of Risk ratio (ratio.Risk), Power and FDR when p = 20 , Σ = Σ , θ = θ , and n = 15 , , and .5.2 ResultsThe results of the (cid:28)rst simulation experiment are given in Table 1. We observe that the (cid:28)ve estima-tors perform more or less similarly as expe ted by the theory. The results of the se ond simulationRR n° 66162 Verzelenstudy are reported in Table 2. Clearly, the Lasso and adaptive Lasso pro edures are not onsistentin this situation sin e the power is lose to and the FDR is lose to one. Consequently, the riskratio is quite large and the adaptive Lasso even seems unstable. In ontrast, our method exhibitsa large power and a reasonable FDR.In the two studies, hoosing a larger K redu es the power of the estimator but also de reasesthe FDR. It seems that the hoi e K = 1 . yields a good risk ratio, whereas K = 2 gives a better ontrol of the FDR. Contrary to the parameter λ for the lasso, we do not need an ad-ho methodsu h as ross-validation to alibrate K . The se ond example is ertainly quite pathologi al but itillustrates that our estimator e θ performs well even when the Lasso does not provide an a urateestimation. The good behavior of our method illustrates the strength of Theorem 3.4 that does notdepend on the orrelation of the explanatory variables.6 Dis ussion and on luding remarksUntil now, we have assumed that the ovarian e matrix Σ of the ovariates is non-singular. If Σ is singular, the estimators b θ m and the model b m are not ne essarily uniquely de(cid:28)ned. However,upon de(cid:28)ning b θ m as one of the minimizers of γ n ( θ ′ ) over S m , one may readily extend the ora leinequalities stated in Theorem 3.1 and 3.4.Let us re all the main features of our method. We have de(cid:28)ned a model sele tion riterion thatsatis(cid:28)es ora le inequalities regardless of the orrelation between the ovariates and regardless of the olle tion of models. Hen e, the estimator e θ a hieves ni e adaptive properties for ordered variablesele tion or for omplete variable sele tion. Besides, one an easily ombine this method with priorknowledge on the model by hoosing a proper olle tion M or by modulating the penalty pen ( . ) .Moreover, we may easily alibrate the penalty even when σ is unknown, whereas the Lasso-typepro edures require a ross-validation strategy to hoose the parameter λ . The ompensation forthese ni e properties is a omputational ost that depends linearly on the size of M . Hen e, the omplete variable sele tion problem is NP-hard. This makes it intra table when p be omes toolarge (i.e. more than 50). In ontrast, our riterion applies for arbitrary p when onsidering or-dered variable sele tion sin e the size of M is linear with n . In situations where one has a goodprior knowledge on the true model, the olle tion M is then not too large and our riterion is alsofastly al ulable even for large p .For omplete variable sele tion, Lasso-type pro edures are omputationally feasible even when p is large and a hieve ora le inequalities under assumptions on the ovarian e stru ture. However,there are both theoreti al and pra ti al problems with these estimators. On the one hand, theyare known to perform poorly for some ovarian e stru tures. On the other hand, there is someroom for improvement in the pra ti al alibration of the lasso, espe ially when σ is unknown. In afuture work, we would like to ombine the strength of our method with these omputationally fastalgorithms. The problem at hand is to design a fast data-driven method that pi ks a sub olle tion c M of reasonable size. Afterwards, one applies our pro edure to c M instead of M . A dire tion thatneeds further investigation is taking for c M all the subsets of the regularization path given by thelasso. INRIAodel sele tion on a Gaussian design 237 Proofs7.1 Some notations and probabilisti toolsFirst, let us de(cid:28)ne the random variable ǫ m by Y = Xθ m + ǫ m + ǫ a.s. . (23)By de(cid:28)nition of θ m , ǫ m follows a normal distribution and is independent of ǫ and of X m . Hen e,the varian e of ǫ m equals l ( θ m , θ ) . The ve tors ǫ and ǫ m refer to the n samples of ǫ and ǫ m .For any model m and any ve tor Z of size n , Π ⊥ m Z stands for Z − Π m Z . For any subset m of { , . . . , p } , Σ m denotes the ovarian e matrix of the ve tor X ∗ m . Moreover, we de(cid:28)ne the row ve tor Z m := X m p Σ − m in order to deal with standard Gaussian ve tors. Similarly to the matrix X m , the n × d m matrix Z m stands for the n observations of Z m . The notation h ., . i n refers to the empiri alinner produ t asso iated with the norm k . k n . Lastly, ϕ max ( A ) denotes the largest eigenvalue (inabsolute value) of a symmetri square matrix A .We shall extensively use the expli it expression of b θ m : X b θ m = X m ( X ∗ m X m ) − X ∗ m Y . (24)Let us state a (cid:28)rst lemma that gives the expressions of γ n ( b θ m ) , γ ( b θ m ) , and the loss l ( b θ m , θ m ) .Lemma 7.1. For any model m of size smaller than n , γ n (cid:16)b θ m (cid:17) = k Π ⊥ m ( ǫ + ǫ m ) k n , (25) γ (cid:16)b θ m (cid:17) = σ + l ( θ m , θ ) + l ( b θ m , θ m ) , (26) l ( b θ m , θ m ) = ( ǫ + ǫ m ) ∗ Z m ( Z ∗ m Z m ) − Z ∗ m ( ǫ + ǫ m ) . (27)The proof is postponed to the Appendix.We now introdu e the main probabilisti tools used throughout the proofs. First, we need tobound the deviations of χ random variables.Lemma 7.2. For any integer d > and any positive number x , P (cid:16) χ ( d ) ≤ d − √ dx (cid:17) ≤ exp( − x ) , P (cid:16) χ ( d ) ≥ d + 2 √ dx + 2 x (cid:17) ≤ exp( − x ) . These bounds are lassi al and are shown by applying Lapla e method. We refer to Lemma1 in [22℄ for more details. Moreover, we state a re(cid:28)ned bound for the lower deviations of a χ distribution.Lemma 7.3. For any integer d > and any positive number x , P χ ( d ) ≤ d " − δ d − r xd ! ∨ ≤ exp( − x ) , RR n° 66164 Verzelenwhere δ d := r π d + exp( − d/ . (28)The proof is postponed the Appendix. Finally, we shall bound the largest eigenvalue of standardWishart matri es and standard inverse Wishart matri es. The following deviation inequality is takenfrom Theorem 2.13 in [17℄.Lemma 7.4. Let Z ∗ Z be a standard Wishart matrix of parameters ( n, d ) with n > d . For anypositive number x , P ϕ max (cid:2) ( Z ∗ Z ) − (cid:3) ≥ n − r dn − x ! − ≤ exp( − nx / , and P ϕ max ( Z ∗ Z ) ≤ n r dn + x ! ≤ exp( − nx / . m in the olle tion M . By de(cid:28)nition of b m , we know that γ n ( e θ ) [1 + pen ( b m )] ≤ γ n ( θ m ) [1 + pen ( m )] . Subtra ting γ ( θ ) to both sides of this inequality yields l ( e θ, θ ) ≤ l ( θ m , θ ) + γ n ( θ m ) pen ( m ) + γ n ( θ m ) − γ n ( e θ ) pen ( b m ) − γ n ( e θ ) , (29)where γ n ( . ) := γ n ( . ) − γ ( . ) . The proof is based on the on entration of the term − γ n ( e θ ) . Morepre isely, we shall prove that with overwhelming probability this quantity is of the same order asthe penalty term γ n ( e θ ) pen ( b m ) .Let κ and κ be two positive numbers smaller than one that we shall (cid:28)x later. For any model m ′ ∈ M , we introdu e the random variables A m ′ and B m ′ as A m ′ := κ + 1 − k Π ⊥ m ′ ǫ m ′ k n l ( θ m ′ , θ ) + κ nϕ max (cid:2) ( Z ∗ m ′ Z m ′ ) − (cid:3) k Π m ( ǫ + ǫ m ′ ) k n l ( θ m ′ , θ ) + σ − K d m ′ n − d m ′ k Π m ′⊥ ( ǫ + ǫ m ′ ) k n l ( θ m ′ , θ ) + σ , (30) B m ′ := κ − h Π ⊥ m ′ ǫ , Π ⊥ m ′ ǫ m ′ i n σ l ( θ m ′ , θ ) + k Π m ′ ǫ k n σ + κ nϕ max h ( Z ∗ m ′ Z m ′ ) − i k Π m ′ ( ǫ + ǫ m ′ ) k n l ( θ m ′ , θ ) + σ − K d m ′ n − d m ′ k Π ⊥ m ′ ( ǫ + ǫ m ′ ) k n l ( θ m ′ , θ ) + σ . (31)We re all that the notations ǫ m , Z m , h ., . i n , and ϕ max ( . ) are de(cid:28)ned in Se tion 7.1. We may upperbound the expression − γ n ( e θ ) − γ n ( e θ ) pen ( b m ) with respe t to A b m and B b m as follows. INRIAodel sele tion on a Gaussian design 25Lemma 7.5. Almost surely, it holds that − γ n ( e θ ) − γ n ( e θ ) pen ( b m ) − σ + k ǫ k n ≤ l ( e θ, θ ) [ A b m ∨ (1 − κ )] + σ B b m . (32)Let us set the onstants κ := 14 and κ := ( K − − √ η ) ∧ . (33)We do not laim that this hoi e is optimal, but we are not really on erned about the onstantsfor this result. The ore of this proof onsists in showing that with overwhelming probability thevariable A b m is smaller than and B b m is smaller than a onstant over n .Lemma 7.6. The event Ω de(cid:28)ned as Ω := (cid:26) A b m ≤ (cid:27) \ (cid:26) κ nϕ max (cid:2) ( Z ∗ b m Z b m ) − (cid:3) ≤ K − (cid:27) satis(cid:28)es P (Ω c ) ≤ L Card ( M ) exp [ − nL ′ ( K, η )] , where L ′ ( K, η ) is positive.Lemma 7.7. There exists an event Ω of probability larger than − exp ( − nL ) with L > su hthat E [ B b m Ω ∩ Ω ] ≤ L ( K, η, α, β ) n . Gathering the upper bound (29) and Lemma 7.5, 7.6, and 7.7, we on lude that E (cid:20) l ( e θ, θ ) Ω ∩ Ω (cid:18) κ ∧ (cid:19)(cid:21) ≤ l ( θ m , θ ) + E [ γ n ( θ m ) pen ( m )]+ σ L ( K, η, α, β ) n + E (cid:2) Ω ∩ Ω (cid:0) γ n ( θ m ) + σ − k ǫ k n (cid:1)(cid:3) . As the expe tation of the random variable γ n ( θ m ) + σ − k ǫ k n is zero, it holds that E (cid:2) Ω ∩ Ω (cid:0) γ n ( θ m ) + σ − k ǫ k n (cid:1)(cid:3) = E (cid:2) Ω c ∪ Ω c (cid:0) γ n ( θ m ) + σ − k ǫ k n (cid:1)(cid:3) ≤ q P (Ω c ) + P (Ω c ) (cid:20)q E [ k ǫ m k n − l ( θ m , θ )] + 2 p E [ h ǫ , ǫ m i n ] (cid:21) ≤ q P (Ω c ) + P (Ω c ) r n h l ( θ m , θ ) + σ p l ( θ m , θ ) i . The probabilities P (Ω c ) and P (Ω c ) onverge to at an exponential rate with respe t to n . Hen e,by taking the in(cid:28)mum over all the models m ∈ M , we obtain E h l ( e θ, θ ) Ω ∩ Ω i ≤ L ( K, η ) inf m ∈M (cid:2) l ( θ m , θ ) + (cid:0) σ + l ( θ m , θ ) (cid:1) pen ( m ) (cid:3) + L ( K, η, α, β ) σ n ++ L ( K, η ) r Card ( M ) n (cid:2) σ + l (0 p , θ ) (cid:3) exp [ − nL ( K, η )] , (34)with L ( K, η ) > . In order to on lude, we need to ontrol the loss of the estimator e θ on the eventof small probability Ω c ∪ Ω c . Thanks to the following lemma, we may upper bound the r -th risk ofthe estimators b θ m .RR n° 66166 VerzelenProposition 7.8. For any model m and any integer r ≥ su h that n − d m − r + 1 > , E h l ( b θ m , θ m ) r i r ≤ Lrd m n (cid:2) σ + l ( θ m , θ ) (cid:3) . The proof is postponed to Se tion 7.4. We derive from this bound a strong ontrol on E h l ( e θ, θ ) Ω c ∪ Ω c i .Lemma 7.9. E h l ( e θ, θ ) Ω c ∪ Ω c i ≤ L ( K, η ) n Card ( M ) Var ( Y ) exp [ − nL ′ ( K, η )] , (35)where L ′ ( K, η ) is positive.By Assumptions ( H P ol ) and ( H η ) , the ardinality of the olle tion of M is smaller than αn β .We gather the upper bounds (34) and (35) and so we on lude.Proof of Lemma 7.5. Thanks to Lemma 7.1, we de ompose γ n ( e θ ) as γ n ( e θ ) = k Π ⊥ b m ( ǫ + ǫ b m ) k n − σ − l ( θ b m , θ ) − (1 − κ ) l ( e θ, θ b m ) − κ ( ǫ + ǫ b m ) ∗ Z b m ( Z ∗ b m Z b m ) − Z ∗ b m ( ǫ + ǫ b m ) . Sin e ab ≤ κ a + κ − b for any κ > , it holds that −k Π ⊥ b m ( ǫ + ǫ b m ) k n + k ǫ k n = k Π b m ǫ k n − k Π ⊥ b m ǫ b m k n − h Π ⊥ b m ǫ , Π ⊥ b m ǫ b m i n ≤ σ (cid:20) κ − h Π ⊥ b m ǫ , Π ⊥ b m ǫ b m i n σ l ( θ b m , θ ) + k Π b m ǫ k n σ (cid:21) + l ( θ b m , θ ) (cid:20) − k Π ⊥ b m ǫ b m k n l ( θ b m , θ ) + κ (cid:21) . Besides, we upper bound Expression (27) of l ( e θ, θ b m ) using the largest eigenvalue of ( Z ∗ b m Z b m ) − . ( ǫ + ǫ b m ) ∗ Z b m ( Z ∗ b m Z b m ) − Z ∗ b m ( ǫ + ǫ b m ) ≤ ϕ max (cid:2) ( Z ∗ b m Z b m ) − (cid:3) ( ǫ + ǫ b m ) ∗ Z b m ( Z ∗ b m Z b m ) − Z ∗ b m ( ǫ + ǫ b m ) ≤ (cid:2) σ + l ( θ b m , θ ) (cid:3) nϕ max h ( Z ∗ b m Z b m ) − i k Π b m ( ǫ + ǫ b m ) k n σ + l ( θ b m , θ ) . (36)Thanks to Assumption (8), we upper bound the penalty terms as follows: − γ n ( e θ ) pen ( b m ) ≤ − (cid:2) σ + l ( θ b m , θ ) (cid:3) k Π ⊥ b m ( ǫ + ǫ b m ) k n σ + l ( θ b m , θ ) K d b m n − d b m . By gathering the four last identities, we get − γ n ( e θ ) − γ n ( e θ ) pen ( b m ) − σ + k ǫ k n ≤ l ( e θ, θ ) [ A b m ∨ (1 − κ )] + σ B b m , sin e l ( e θ, θ ) de omposes into the sum l ( e θ, θ b m ) + l ( θ b m , θ ) .Proof of Lemma 7.6. We re all that for any model m ∈ M , A m := 54 − k Π ⊥ m ǫ m k n l ( θ m , θ ) + κ nϕ max (cid:2) ( Z ∗ m Z m ) − (cid:3) k Π m ( ǫ + ǫ m ) k n l ( θ m , θ ) + σ − K d m n − d m k Π ⊥ m ( ǫ + ǫ m ) k n l ( θ m , θ ) + σ . INRIAodel sele tion on a Gaussian design 27In order to ontrol the variable A b m , we shall simultaneously bound the deviations of the four randomvariables involved in any variable A m .Sin e X m is independent of ǫ m / p l ( θ m , θ ) and sin e ǫ m / p l ( θ m , θ ) is a standard Gaussian ve torof size n , the random variable n k Π ⊥ m ǫ m k n /l ( θ m , θ ) follows a χ distribution with n − d m degreesof freedom onditionally on X m . As this distribution does not depend on X m , n k Π ⊥ m ǫ m k n /l ( θ m , θ ) follows a χ distribution with n − d m degrees of freedom. Similarly, the random variables n k Π m ( ǫ + ǫ m ) k n / [ l ( θ m , θ ) + σ ] and n k Π ⊥ m ( ǫ + ǫ m ) k n / [ l ( θ m , θ ) + σ ] follow χ distributions with respe tively d m and n − d m degrees of freedom. Besides, the matrix ( Z ∗ m Z m ) follows a standard Wishartdistribution with parameters ( n, d m ) .Let x be a positive number we shall (cid:28)x later. By Lemma 7.2 and 7.4, there exists an event Ω ′ of large probability P (Ω ′ c ) ≤ − nx ) Card ( M ) , su h that for onditionally on Ω ′ , k Π ⊥ m ǫ m k n l ( θ m , θ ) ≥ n − d m n − r ( n − d m ) xn , (37) k Π m ( ǫ + ǫ m ) k n σ + l ( θ m , θ ) ≤ d m n + 2 r d m xn + 2 x , (38) k Π ⊥ m ( ǫ + ǫ m ) k n σ + l ( θ m , θ ) ≥ n − d m n − r ( n − d m ) xn , (39) ϕ max h ( Z ∗ m Z m ) − i ≤ n " − r d m n − √ x ! ∨ − , (40)for every model m ∈ M . Let us prove that for a suitable hoi e of the number x , A b m Ω ′ is smallerthan / . First, we onstrain nκ ϕ max h ( Z ∗ b m Z b m ) − i to be smaller than K − on the event Ω ′ . By(40), it holds that nϕ max h ( Z ∗ b m Z b m ) − i ≤ h(cid:16) − √ η − √ x (cid:17) ∨ i − . Constraining x to be smaller than ( −√ η ) ensures that the largest eigenvalue of ( Z ∗ b m Z b m ) − satis(cid:28)es nϕ max h ( Z ∗ b m Z b m ) − i ≤ (cid:0) − √ η (cid:1) . By de(cid:28)nition (33) of κ , it follows that nκ ϕ max h ( Z ∗ b m Z b m ) − i ≤ ( K − / . Applying inequality ab ≤ δa + δ − b to the bounds (37), (38), and (39) yields − k Π ⊥ b m ǫ b m k n l ( θ b m , θ ) ≤ −
12 + d b m n + 2 xκ nϕ max h ( Z ∗ b m Z b m ) − i k Π b m ( ǫ + ǫ b m ) k n σ + l ( θ b m , θ ) ≤ K − (cid:20) d b m n + 3 x (cid:21) − K d b m n − d b m k Π ⊥ b m ( ǫ + ǫ b m ) k n σ + l ( θ b m , θ ) ≤ − K d b m n + x Kη − η . RR n° 66168 VerzelenGathering these three inequalities, we get A b m Ω ′ ≤
34 + x (cid:20) K − K η − η (cid:21) . If we set x to x := (cid:20) (cid:18) K − K η − η (cid:19)(cid:21) − ∧ (cid:0) − √ η (cid:1) , then A b m Ω ′ is smaller than and the result follows.Proof of Lemma 7.7. We shall simultaneously bound the deviations of the random variables in-volved in the de(cid:28)nition of B m for all models m ∈ M . Let us (cid:28)rst de(cid:28)ne the random variable E m as E m := κ − h Π ⊥ m ǫ , Π ⊥ m ǫ m i n σ l ( θ m , θ ) + k Π m ǫ k n σ . Fa torizing by the norm of ǫ , we get E m ≤ κ − k ǫ k n σ h Π ⊥ m ǫ k Π ⊥ m ǫ k n , Π ⊥ m ǫ m i n l ( θ m , θ ) + k Π m ǫ k n σ . (41)The variable n k ǫ k n σ follows a χ distribution with n degrees of freedom. By Lemma 7.2 there existsan event Ω of probability larger than − exp ( n/ su h that k ǫ k n σ is smaller than . As κ − = 4 ,we obtain E m Ω ≤ h Π ⊥ m ǫ k Π ⊥ m ǫ k n , Π ⊥ m ǫ m i n l ( θ m , θ ) + k Π m ǫ k n σ . Sin e ǫ , ǫ m , and X m are independent, it holds that onditionally on X m and ǫ , n h Π ⊥ m ǫ k Π ⊥ m ǫ k n , Π ⊥ m ǫ m i n l ( θ m , θ ) ∼ χ (1) . Sin e the distribution depends neither on X m nor on ǫ , this random variable follows a χ distributionwith degree of freedom. Besides, it is independent of the variable k Π m . ǫ k n σ . Arguing as previously,we work out the distribution n k Π m ǫ k n σ ∼ χ ( d m ) . Consequently, the variable E m Ω is upper bounded by a random variable that follows the distri-bution of n T + 1 n T , where T and T are two independent χ distribution with respe tively and d m degrees of free-dom. Moreover, the random variables n k Π m ( ǫ + ǫ m ) k n l ( θ m ,θ )+ σ and n k Π ⊥ m ( ǫ + ǫ m ) k n l ( θ m ,θ )+ σ respe tively follow a χ INRIAodel sele tion on a Gaussian design 29distribution with d m and n − d m degrees of freedom.Let us bound the deviations of the random variables E m Ω , k Π m ( ǫ + ǫ m ) k n l ( θ m ,θ )+ σ , and k Π ⊥ m ( ǫ + ǫ m ) k n l ( θ m ,θ )+ σ forany model m ∈ M . We apply Lemma 1 in [22℄ for E m Ω and Lemma 7.2 for the two remainingrandom variables. Hen e, for any x > , there exists an event F ( x ) of large probability P [ F ( x ) c ] ≤ e − x X m ∈M e − ξ d m + e − ξ d m + e − ξ d m ! ≤ e − x " α + ∞ X d =1 d β ( e − ξ d + e − ξ d + e − ξ d ) , su h that onditionally on F ( x ) , E m Ω ≤ d m +8 n + n p [ d m + 8 ] ( ξ d m + x ) + 16 ξ d m + xn k Π m ( ǫ + ǫ m ) k n l ( θ m ,θ )+ σ ≤ n (cid:16) d m + 2 p d m [ d m ξ + x ] + 2 ( d m ξ + x ) (cid:17) − Kd m n − d m k (Π ⊥ m ǫ + ǫ m ) k n σ + l ( θ m ,θ ) ≤ − Kd m n ( n − d m ) (cid:16) n − d m − p ( n − d m )( ξ d m + x ) (cid:17) , for all models m ∈ M . We shall (cid:28)x later the positive onstants ξ , ξ , and ξ . Let us applyextensively the inequality ab ≤ τ a + τ − b . Hen e, onditionally on F ( x ) , the model b m satis(cid:28)es E b m Ω ≤ d c m n (cid:2) √ ξ + 17 ξ + τ (cid:3) + xn (cid:2)
17 + τ − (cid:3) + n k Π c m ( ǫ + ǫ c m ) k n l ( θ c m ,θ )+ σ ≤ d c m n (cid:2) √ ξ + 2 ξ + τ (cid:3) + xn (cid:2) τ − (cid:3) − Kd c m n − d c m k Π ⊥ c m ( ǫ + ǫ c m ) k n σ + l ( θ c m ,θ ) ≤ − K d c m n h − q ξ d c m n − d c m − τ i + K xn τ − d c m n − d c m . By Lemma 7.6, we know that onditionally on Ω , κ nϕ max h ( Z ∗ b m Z b m ) − i is smaller than K − .By assumption ( H η ), the ratio d c m n − d c m is smaller than η − η . Gathering these inequalities we upperbound B b m on the event Ω ∩ Ω ∩ F ( x ) , B b m ≤ d b m n U + xn V + 72 n , where U and V are de(cid:28)ned as U := 1 + 2 p ξ + 17 ξ + τ + K − h p ξ + 2 ξ + τ i − K (cid:20) − p ξ r η − η − τ (cid:21) V := 17 + τ − + K − (cid:2) τ − (cid:3) + Kτ − η − η . Looking losely at U , one observes that it is the sum of the quantity − K − and an expressionthat we an make arbitrary small by hoosing the positive onstants ξ , ξ , ξ , τ , τ , and τ smallenough. Consequently, there exists a suitable hoi e of these onstants only depending on K and η that onstrains the quantity U to be non positive. It follows that for any x > , with probabilitylarger than − e − x L ( K, η, α, β ) , B b m Ω ∩ Ω ≤ xn L ( K, η ) + L ′ ( K, η ) n . RR n° 66160 VerzelenIntegrating this upper bound for any x > , we on lude E [ B b m Ω ∩ Ω ] ≤ L ( K, η, α, β ) n . Proof of Lemma 7.9. We perform a very rude upper bound by ontrolling the sum of the risk ofevery estimator b θ m . E h l ( e θ, θ )1 Ω c ∪ Ω c i ≤ q P (Ω c ) + P (Ω c ) s X m ∈M E h l ( b θ m , θ ) i . As for any model m ∈ M , l ( b θ m , θ ) = l ( θ m , θ ) + l ( b θ m , θ m ) , it follows that E h l ( b θ m , θ ) i ≤ n l ( θ m , θ ) + E h l ( b θ m , θ m ) io . For any model m ∈ M , it holds that n − d m − ≥ (1 − η ) n − , whi h is positive by assumption( H η ). Hen e, we may apply Proposition 7.8 with r = 2 to all models m ∈ M : E h l ( b θ m , θ m ) i ≤ L (cid:2) d m n ( σ + l ( θ m , θ )) (cid:3) ≤ Ln Var ( Y ) , sin e for any model m , σ + l ( θ m , θ ) ≤ Var ( Y ) . By summing this bound for all models m ∈ M andapplying Lemma 7.6 and 7.7, we get E h l ( e θ, θ ) Ω c ∪ Ω c i ≤ n Card ( M ) L ( K, η ) Var ( Y ) exp [ − nL ′ ( K, η )] , where L ′ ( K, η ) is positive.7.3 Proof of Theorem 3.4 and Proposition 3.5Proof of Theorem 3.4. This proof follows the same approa h as the one of Theorem 3.1. We shallonly emphasize the di(cid:27)eren es with this previous proof. The bound (29) still holds. Let us respe -tively de(cid:28)ne the three onstants κ , κ and ν ( K ) as κ := q K +2 − √ η − ν ( K ) , κ := ( K − (cid:2) − √ η (cid:3) (cid:2) − √ η − ν ( K ) (cid:3) ∧ ,ν ( K ) := (cid:18) K + 2 (cid:19) / ∧ − (cid:16) K +2 (cid:17) / . INRIAodel sele tion on a Gaussian design 31We also introdu e the random variables A m ′ and B m ′ for any model m ′ ∈ M . A m ′ := κ + 1 − k Π ⊥ m ′ ǫ m ′ k n l ( θ m ′ , θ ) + κ nϕ max (cid:2) ( Z ∗ m ′ Z m ′ ) − (cid:3) k Π m ′ ( ǫ + ǫ m ′ ) k n l ( θ m ′ , θ ) + σ − K h p H ( d ′ m ) i d m ′ n − d m ′ k Π m ′⊥ ( ǫ + ǫ m ′ ) k n l ( θ m ′ , θ ) + σ ,B m ′ := κ − h Π ⊥ m ′ ǫ , Π ⊥ m ′ ǫ m ′ i n σ l ( θ m ′ , θ ) + k Π m ′ ǫ k n σ + κ nϕ max h ( Z ∗ m ′ Z m ′ ) − i k Π m ′ ( ǫ + ǫ m ′ ) k n l ( θ m ′ , θ ) + σ − K d m ′ n − d m ′ h p H ( d ′ m ) i k Π ⊥ m ′ ( ǫ + ǫ m ′ ) k n l ( θ m ′ , θ ) + σ . The bound given in Lemma 7.5 learly extends to − γ n ( e θ ) − γ n ( e θ ) pen ( b m ) − σ + k ǫ k n ≤ l ( e θ, θ ) [ A b m ∨ (1 − κ )] + σ B b m . As previously, we ontrol the variable A b m on an event of large probability Ω and take the expe -tation of B b m on an event of large probability Ω ∩ Ω .Lemma 7.10. Let Ω be the event Ω := { A b m ≤ s ( K, η ) } \ ( κ nϕ max (cid:2) ( Z ∗ b m Z b m ) − (cid:3) ≤ ( K − (cid:0) − √ η − ν ( K ) (cid:1) ) , where s ( K, η ) is a fun tion smaller than one. Then, P (Ω c ) ≤ L ( K ) n exp [ − nL ′ ( K, η )] with L ′ ( K, η ) > . The fun tion s ( K, η ) is given expli itly in the proof of Lemma 7.10Lemma 7.11. Let us assume that n is larger than some quantities n ( K ) . Then, there exists anevent Ω of probability larger than − exp [ − nL ( K, η )] where L ( K, η ) > su h that E [ B b m Ω ∩ Ω ] ≤ L ( K, η ) n . Gathering inequalities (29), (32), Lemma 7.10 and 7.11, we obtain as on the previous proof that E h l ( e θ, θ ) Ω ∩ Ω i ≤ L ( K, η ) inf m ∈M (cid:2) l ( θ m , θ ) + (cid:0) σ + l ( θ m , θ ) (cid:1) pen ( m ) (cid:3) ++ L ′ ( K, η ) (cid:20) σ n + (cid:0) σ + l (0 p , θ ) (cid:1) n exp [ − nL ′′ ( K, η )] (cid:21) . (42)Afterwards, we ontrol the loss of the estimator e θ on the event of small probability Ω c ∪ Ω c .Lemma 7.12. If n is larger than some quantity n ( K ) , E h l ( e θ, θ )1 Ω c ∪ Ω c i ≤ n / (cid:0) σ + l (0 p , θ ) (cid:1) L ( K, η ) exp [ − nL ′ ( K, η )] , where L ( K, η ) is positive.RR n° 66162 VerzelenGathering this last bound with (42) enables to on lude.Proof of Lemma 7.10. This proof is analogous to the proof of Lemma 7.6, ex ept that we shall hange the weights in the on entration inequalities in order to take into a ount the omplexityof the olle tion of models. Let x be a positive number we shall (cid:28)x later. Applying Lemma 7.2,Lemma 7.3, and Lemma 7.4 ensures that there exists an event Ω ′ su h that P (Ω ′ c ) ≤ − nx ) X m ∈M exp [ − d m H ( d m )] , and for all models m ∈ M , k Π ⊥ m ǫ m k n l ( θ m , θ ) ≥ n − d m n − δ n − d m − s d m H ( d m ) n − d m − r xnn − d m ∨ , (43) k Π m ( ǫ + ǫ m ) k n σ + l ( θ m , θ ) ≤ d m n h p H ( d m ) + H ( d m ) i + 3 x , (44) k Π ⊥ m ( ǫ + ǫ m ) k n σ + l ( θ m , θ ) ≥ n − d m n − δ n − d m − s d m H ( d m ) n − d m − r xnn − d m ∨ , (45) nϕ max h ( Z ∗ m Z m ) − i ≤ " − (cid:16) p H ( d m ) (cid:17) r d m n − √ x ! ∨ − . We re all that δ d is de(cid:28)ned in (28). Besides, it holds that P (Ω ′ c ) ≤ − nx ] n X d =0 Card [ { m ∈ M , d m = d } ] exp[ − dH ( d )] ≤ n exp[ − nx ] . By Assumption ( H K,η ) , the expression (cid:16) p H ( d m ) (cid:17) q d m n is bounded by √ η . Hen e, ondi-tionally on Ω ′ , nϕ max h ( Z ∗ b m Z b m ) − i ≤ h(cid:16) − √ η − √ x (cid:17) ∨ i − , Constraining x to be smaller than ( −√ η ) ensures that nκ ϕ max h ( Z ∗ b m Z b m ) − i Ω ′ ≤ ( K − − √ η − ν ( K )) . By assumption ( H K,η ) , the dimension of any model m ∈ M is smaller than n/ . If n is larger thansome quantities only depending on K , then δ n/ is smaller than ν ( K ) . Let us assume (cid:28)rst that thisis the ase. We re all that ν ( K ) is de(cid:28)ned at the beginning of the proof of Theorem 3.4. Sin e ν ( K ) ≤ − √ η , inequality (43) be omes k Π ⊥ m ǫ b m k n l ( θ b m , θ ) ≥ (cid:18) − d b m n (cid:19) [1 − ν ( K ) − √ η ] − √ x . INRIAodel sele tion on a Gaussian design 33Bounding analogously the remaining terms of A b m , we get A b m ≤ κ + 1 − (cid:2) − √ η − δ n/ (cid:3) + d b m n (1 − √ η − δ n/ ) U + √ xU + xU , where U , U , and U are respe tively de(cid:28)ned as U := − K h p H ( d b m ) i + 1 + ( K − / h p H ( d b m ) i ≤ U := 2 √ Kη ] U := ( K − (cid:2) − √ η − ν ( K ) (cid:3) . Sin e U is non-positive, we obtain an upper bound of A b m that does not depend anymore on b m .By assumption ( H K,η ) , we know that η < (1 − ν ( K ) − ( K +2 ) / ) . Hen e, oming ba k to thede(cid:28)nition of κ allows to prove that κ is stri tly smaller than [1 − √ η − ν ( K )] . Setting x := " (cid:2) − √ η − ν ( K ) (cid:3) − κ U ∧ (cid:2) − √ η − ν ( K ) (cid:3) − κ U ∧ (cid:0) − √ η (cid:1) , we get A b m ≤ − h (1 − √ η − ν ( K )) − κ i < , on the event Ω ′ .In order to take into a ount the ase δ n/ ≥ ν ( K ) , we only have to hoose a large onstant L ( K ) in the upper bound of P (Ω c ) .Proof of Lemma 7.11. On e again, the sket h of the proof losely follows the proof of Lemma 7.11.Let us onsider the random variables E m de(cid:28)ned as E m := κ − h Π ⊥ m ′ ǫ , Π ⊥ m ′ ǫ m ′ i n σ l ( θ m ′ , θ ) + k Π m ′ ǫ k n σ . Sin e n k ǫ k n /σ follows a χ distribution with n degrees of freedom, there exists an event Ω ofprobability larger than − exp [ − nL ( K )] su h that k ǫ k n /σ is smaller than κ − = p ( K + 2) / −√ η − ν ( K )] on Ω . The onstant L ( K ) in the exponential is positive. We shall simultaneouslyupper bound the deviations of the random variables E m , k Π m ( ǫ + ǫ m ) k n l ( θ m ,θ )+ σ , and k Π ⊥ m ( ǫ + ǫ m ) k n σ + l ( θ m ,θ ) . Let ξ besome positive onstant that we shall (cid:28)x later. For any x > , we de(cid:28)ne an event F ( x ) su h that onditionally on F ( x ) ∩ Ω , E m ≤ d m + κ − n + n q(cid:2) d m + κ − (cid:3) [ d m ( ξ + H ( d m )) + x ]+ 2 κ − ξ ( d m + H ( d m ))+ xn k Π m ( ǫ + ǫ m ) k n l ( θ m ,θ )+ σ ≤ n h d m + 2 q d m (cid:2) d m ( + H ( d m )) + x (cid:3) + 2 (cid:2) d m ( + H ( d m )) + x (cid:3)i k Π ⊥ m ǫ m + ǫ k n σ + l ( θ m ,θ ) ≥ n − d m n (cid:20)(cid:18) − δ n − d m − q d m (1+2 H ( d m )) n − d m − q xn − d m (cid:19) ∨ (cid:21) , RR n° 66164 Verzelenfor any model m ∈ M . Then, the probability of F ( x ) satis(cid:28)es P [ F ( x ) c ] ≤ e − x " X m ∈M exp [ − d m H ( d m )] (cid:16) e − ξd m + e − dm + e − dm (cid:17) ≤ e − x (cid:18) − e − ξ + 11 − e − / + 11 − e − / (cid:19) . Let us expand the three deviation bounds thanks to the inequality ab ≤ τ a + τ − b : E m ≤ d m n h p ξ + 2 κ − ξ + τ ξ + τ i + xn (cid:2) κ − + τ − + τ (cid:3) + κ − n (cid:2) τ − κ − (cid:3) + d m H ( d m ) n (cid:2) κ − + τ (cid:3) + 2 d m p H ( d m ) n ≤ d m n (cid:16) p H ( d m ) (cid:17) h κ − + 2 p ξ + 2 κ − ξ + τ ξ + τ i + xn (cid:2) κ − + τ − + τ (cid:3) + κ − n (cid:2) τ − κ − (cid:3) . Similarly, we get k Π m ( ǫ + ǫ m ) k n l ( θ m , θ ) + σ ≤ d m n h p H ( d m ) i + 5 xn . If n is larger than some quantity n ( K ) , then δ n/ is smaller than ν ( K ) . Applying Assumption ( H K,η ) , we get − K d m n − d m (cid:16) p H ( d m ) (cid:17) k Π ⊥ m ( ǫ + ǫ m ) k n l ( θ m , θ ) + σ ≤ − K d m n (cid:16) p H ( d m ) (cid:17) (cid:20)(cid:18) − √ η − ν ( K ) − r xn − d m (cid:19) ∨ (cid:21) ≤ − K d m n (cid:16) p H ( d m ) (cid:17) h (1 − √ η − ν ( K )) − τ i + 2 Kητ − xn . Let us ombine these three bounds with the de(cid:28)nitions of B m , κ , and κ . Hen e, Conditionally tothe event Ω ∩ Ω ∩ F ( x ) , B b m ≤ d b m n h p H ( b m ) i U + xn U + L ( K, η ) n U , (46)where U := − K − (cid:0) − √ η − ν ( K ) (cid:1) + Kτ + 2 √ ξ + 2 κ − ξ + τ ξ + τ ,U := τ − + τ + L ( K, η )(1 + τ − ) ,U := 1 + τ − . Sin e
K > , there exists a suitable hoi e of the onstants ξ , τ , and τ , only depending on K and η that onstrains U to be non positive. Hen e, onditionally on the event Ω ∩ Ω ∩ F ( x ) , B b m ≤ L ( K, η ) n + L ′ ( K, η ) xn . INRIAodel sele tion on a Gaussian design 35Sin e P [ F ( x ) c ] ≤ e − x L ( K, η ) , we on lude by integrating the last expression with respe t to x .Proof of Lemma 7.12. As in the ordered sele tion ase, we apply Cau hy-S hwarz inequality E h l ( e θ, θ )1 Ω c ∪ Ω c i ≤ q P (Ω c ) + P (Ω c ) r E h l ( e θ, θ ) i . However, there are too many models to bound e(cid:30) iently the risk of e θ by the sum of the risks of theestimators b θ m . This is why we use here Hölder's inequality E h l ( e θ, θ )1 Ω c ∪ Ω c i ≤ L ( K ) √ n exp [ − nL ( K, η )] vuut E " X m ∈M m = b m l ( b θ m , θ ) ≤ L ( K ) √ n exp [ − nL ( K, η )] s X m ∈M P ( m = b m ) /u E h l ( b θ m , θ ) v i /v , (47)where v := (cid:4) n (cid:5) , and u =: vv − . We assume here that n is larger than . For any model m ∈ M ,the loss l ( b θ m , θ ) de omposes into the sum l ( θ m , θ ) + l ( b θ m , θ m ) . Hen e,we obtain the following upperbound by applying Minkowski's inequality E h l ( b θ m , θ ) v i / v ≤ l ( θ m , θ ) + E h l ( b θ m , θ m ) v i / v ≤ Var ( Y ) + E h l ( b θ m , θ m ) v i / v . (48)We shall upper bound this last term thanks to Proposition 7.8. Sin e v is smaller than n/ andsin e d m is smaller than n/ , it follows that for any model m ∈ M , n − d m − v + 1 is positive and E h l ( b θ m , θ m ) v i / v ≤ vLnd m (cid:0) σ + l ( θ m , θ ) (cid:1) , for any model m ∈ M . Sin e d m ≤ n and sin e σ + l ( θ m , θ ) ≤ Var ( Y ) , we obtain E h l ( b θ m , θ m ) v i / v ≤ vLn Var ( Y ) . (49)Gathering upper bounds (47), (48), and (49) we get E h l ( e θ, θ )1 Ω c ∪ Ω c i ≤ L ( K ) √ n exp [ − nL ′ ( K, η )] × (cid:2) Var ( Y ) + 2 vLn Var ( Y ) (cid:3) s X m ∈M P ( m = b m ) /u . Sin e the sum over m ∈ M of P ( m = b m ) is one, the last term of the previous expression is maximizedwhen every P ( m = b m ) equals Card ( M ) . Hen e, E h l ( e θ, θ )1 Ω c ∪ Ω c i ≤ n / Var ( Y ) L ( K, η ) Card ( M ) / (2 v ) exp [ − nL ′ ( K, η )] , RR n° 66166 Verzelenwhere L ′ ( K, η ) is positive. Let us (cid:28)rst bound the ardinality of the olle tion M . We re all thatthe dimension of any model m ∈ M is assumed to be smaller than n/ by ( H K,η ) . Besides, for any d ∈ { , . . . , n/ } , there are less than exp( dH ( d )) models of dimension d . Hen e, log ( Card ( M )) ≤ log( n ) + sup d =1 ,...,n/ dH ( d ) . By assumption ( H K,η ) , dH ( d ) is smaller than n/ . Thus, log( Card ( M )) ≤ log( n ) + n/ and itfollows that Card ( M ) / (2 v ) is smaller than an universal onstant providing that n is larger than 8.All in all, we get E h l ( e θ, θ )1 Ω c ∪ Ω c i ≤ n / Var ( Y ) L ( K, η ) exp [ − nL ′ ( K, η )] , where L ′ ( K, η ) is positive.Proof of Proposition 3.5. We apply the same arguments as in the proof of Theorem 3.4, ex ept thatwe repla e H ( d m ) by l m . A m ′ := κ + 1 − k Π ⊥ m ′ ǫ m ′ k n l ( θ m ′ , θ ) + κ nϕ max (cid:2) ( Z ∗ m ′ Z m ′ ) − (cid:3) k Π m ′ ( ǫ + ǫ m ′ ) k n l ( θ m ′ , θ ) + σ − K h p l m ′ i d m ′ n − d m ′ k Π m ′⊥ ( ǫ + ǫ m ′ ) k n l ( θ m ′ , θ ) + σ ,B m ′ := κ − h Π ⊥ m ′ ǫ , Π ⊥ m ′ ǫ m ′ i n σ l ( θ m ′ , θ ) + k Π m ′ ǫ k n σ + κ nϕ max h ( Z ∗ m ′ Z m ′ ) − i k Π m ′ ( ǫ + ǫ m ′ ) k n l ( θ m ′ , θ ) + σ − K d m ′ n − d m ′ h p l ′ m i k Π ⊥ m ′ ( ǫ + ǫ m ′ ) k n l ( θ m ′ , θ ) + σ . In fa t, Lemma 7.10, 7.11, and 7.12 are still valid for this penalty. The previous proofs of thesethree lemma depend on the quantity H ( d m ) through the properties: H ( d m ) satis(cid:28)es assumption ( H K,η ) and P m ∈M , d m = d exp( − dH ( d m )) ≤ .Under the assumptions of Proposition 3.5, l m satis(cid:28)es the orresponding Assumption ( H lK,η ) and is su h that P m ∈M , d m = d exp( − dl m )) ≤ . Hen e, the proofs of these lemma remain valid inthis setting if we repla e H ( d m ) by l m .There is only one small di(cid:27)eren e at the end of the proof of Lemma 7.12 when bounding log ( Card ( M )) . By de(cid:28)nition of l m ,Card ( M ) − ≤ sup m ∈M\{∅} exp( d m l m ) . Hen e, log(
Card ( M ) ≤ m ∈M\{∅} d m l m , whi h is smaller than n/ by Assumption ( H lK,η ) .Hen e, the upper bound shown in the proof of Lemma 7.12 is still valid. INRIAodel sele tion on a Gaussian design 377.4 Proof of Proposition 7.8Proof of Proposition 7.8. Let m be a subset of { , . . . , p } . Thanks to (27), we know that l ( b θ m , θ m ) = ( ǫ + ǫ m ) ∗ Z m ( Z ∗ m Z m ) − Z ∗ m ( ǫ + ǫ m ) . Applying Cau hy-S hwarz inequality, we de ompose the r -th loss of b θ m in two terms E h l ( b θ m , θ m ) r i r ≤ E h(cid:13)(cid:13) ( ǫ + ǫ m ) ( ǫ + ǫ m ) ∗ (cid:13)(cid:13) rF k Z m ( Z ∗ m Z m ) − Z ∗ m k rF i r ≤ E (cid:2)(cid:13)(cid:13) ( ǫ + ǫ m ) ( ǫ + ǫ m ) ∗ (cid:13)(cid:13) rF (cid:3) r E (cid:26) tr h ( Z ∗ m Z m ) − i r (cid:27) r , (50)by independen e of ǫ , ǫ m , and Z m . Here, k . k F stands for the Frobenius norm in the spa e of squarematri es. We shall su essively upper bound the two terms involved in (50). (cid:13)(cid:13) ( ǫ + ǫ m ) ( ǫ + ǫ m ) ∗ (cid:13)(cid:13) rF = X ≤ i,j ≤ n ( ǫ + ǫ m ) [ i ] ( ǫ + ǫ m ) [ j ] r/ . This last expression orresponds to the L r/ norm of a Gaussian haos of order 4. By Theorem3.2.10 in [18℄, su h haos satisfy a Khint hine-Kahane type inequality:Lemma 7.13. For all d ∈ N there exists a onstant L d ∈ (0 , ∞ ) su h that, if X is a Gaussian haos of order d with values in any normed spa e F with norm k . k and if < s < q < ∞ , then ( E k X k q ) q ≤ L d (cid:18) q − s − (cid:19) d/ E [ k X k s ] s . Let us assume that r is larger than four. Applying the last lemma with d = 4 , q = r/ , and s = 2 yields E (cid:2)(cid:13)(cid:13) ( ǫ + ǫ m ) ( ǫ + ǫ m ) ∗ (cid:13)(cid:13) rF (cid:3) r ≤ L ( r/ − E h(cid:13)(cid:13) ( ǫ + ǫ m ) ( ǫ + ǫ m ) ∗ (cid:13)(cid:13) F i . By standard Gaussian properties, we ompute the fourth moment of this haos and obtain E h(cid:13)(cid:13) ( ǫ + ǫ m ) ( ǫ + ǫ m ) ∗ (cid:13)(cid:13) F i ≤ Ln (cid:2) σ + l ( θ m , θ ) (cid:3) . Hen e, we get the upper bound E (cid:2)(cid:13)(cid:13) ( ǫ + ǫ m ) ( ǫ + ǫ m ) ∗ (cid:13)(cid:13) rF (cid:3) r ≤ L ( r − n (cid:2) σ + l ( θ m , θ ) (cid:3) . (51)Straightforward omputations allow to extend this bound to r = 2 and r = 3 .Let us turn to bounding the se ond term of (50). Sin e the eigenvalues of the matrix ( Z ∗ m Z m ) − are almost surely non-negative, it follows that tr h ( Z ∗ m Z m ) − i ≤ tr h ( Z ∗ m Z m ) − i . RR n° 66168 VerzelenConsequently, we shall upper bound the r -th moment of the tra e of an inverse standard Wishartmatrix. For any ouple of matri es A and B respe tively of size p × q and p × q , we de(cid:28)ne theKrone ker produ t matrix A ⊗ B as the matrix of size p p × q q that satis(cid:28)es: A ⊗ B [ i + p ( i − j + q ( j − A [ i ; j ] B [ i ; j ] , for any ≤ i ≤ p ≤ i ≤ p ≤ j ≤ q ≤ j ≤ q . For any matrix A , ⊗ k A refers to the k -th power of A with respe t to the Krone ker produ t. Sin e tr ( A ) k = tr (cid:0) ⊗ k A (cid:1) for any square matrix A , we obtain E (cid:2) tr ( Z ∗ m Z m ) − (cid:3) k = E (cid:2) tr (cid:0) ⊗ k ( Z ∗ m Z m ) − (cid:1)(cid:3) = tr (cid:2) E (cid:0) ⊗ k ( Z ∗ m Z m ) − (cid:1)(cid:3) ≤ q d km (cid:13)(cid:13) E (cid:2) ⊗ k ( Z ∗ m Z m ) − (cid:3)(cid:13)(cid:13) F , thanks to Cau hy-S hwarz inequality. In Equation (4.2) of [37℄, Von Rosen has hara terizedre ursively the expe tation of ⊗ k ( Z ∗ m Z m ) − as long as n − d m − k − is positive:ve (cid:0) E (cid:2) ⊗ k +1 ( Z ∗ m Z m ) − (cid:3)(cid:1) = A ( n, d m , k ) − ve (cid:0) E (cid:2) ⊗ k ( Z ′ m Z m ) − (cid:3) ⊗ I (cid:1) , (52)where 've ' refers to the ve torized version of the matrix. See Se tion 2 of [37℄ for more detailsabout this de(cid:28)nition. A ( n, d m , k ) is a symmetri matrix of size d k +1 m × d k +1 m whi h only dependson n , d m , and k and is known to be diagonally dominant. More pre isely, any diagonal elementof A ( n, d m , k ) is greater or equal to one plus the orresponding row sums of the absolute values ofthe o(cid:27)-diagonal elements. Hen e, the matrix A is invertible and its smallest eigenvalue is larger orequal to one. Consequently, ϕ max (cid:0) A − (cid:1) is smaller or equal to one. It then follows from (52) that (cid:13)(cid:13) E (cid:2) ⊗ k +1 ( Z ∗ m Z m ) − (cid:3)(cid:13)(cid:13) F = (cid:13)(cid:13) ve (cid:0) E (cid:2) ⊗ k +1 ( Z ∗ m Z m ) − (cid:3)(cid:1)(cid:13)(cid:13) F ≤ ϕ max ( A − ) (cid:13)(cid:13) ve (cid:0) E (cid:2) ⊗ k ( Z ∗ m Z m ) − (cid:3) ⊗ I (cid:1)(cid:13)(cid:13) F ≤ p d m (cid:13)(cid:13) E (cid:2) ⊗ k ( Z ∗ m Z m ) − (cid:3)(cid:13)(cid:13) F . By indu tion, we obtain E (cid:2) tr ( Z ∗ m Z m ) − (cid:3) r ≤ d rm , (53)if n − d m − r + 1 > . Combining upper bounds (51) and (53) enables to on lude E h l ( b θ m , θ m ) r i r ≤ Lrd m n ( σ + l ( θ m , θ )) . m ∗ be the model that minimizes the loss fun tion l ( b θ m , θ ) : m ∗ = arg inf m ∈M ⌊ n/ ⌋ l ( b θ m , θ ) . INRIAodel sele tion on a Gaussian design 39It is almost surely uniquely de(cid:28)ned. Contrary to the ora le m ∗ , the model m ∗ is random. Byde(cid:28)nition of b m , we derive that l ( e θ, θ ) ≤ l ( b θ m ∗ , θ ) + γ n ( b θ m ∗ ) pen ( m ∗ ) + γ n ( b θ m ∗ ) − γ n ( e θ ) pen ( b m ) − γ n ( e θ ) , (54)where γ n is de(cid:28)ned in the proof of Theorem 3.1. The proof divides in two parts. First, we statethat on an event Ω of large probability, the dimensions of b m and of m ∗ are moderate. Afterwards,we prove that on another event of large probability Ω ∩ Ω ∩ Ω , the ratio l ( e θ, θ ) /l ( b θ m ∗ , θ ) is loseto one.Lemma 7.14. Let us de(cid:28)ne the event Ω as: Ω := (cid:26) log ( n ) < d m ∗ < n log n and log ( n ) < d b m < n log n (cid:27) . The event Ω is a hieved with large probability: P (Ω ) ≥ − L ( R,s ) n .Lemma 7.15. There exists an event Ω of probability larger than − L log nn su h that h − γ n ( e θ ) − γ n ( e θ ) pen ( b m ) − σ + k ǫ k n i Ω ∩ Ω ≤ l ( e θ, θ ) τ ( n ) , where τ ( n ) is a positive sequen e onverging to zero when n goes to in(cid:28)nity.Lemma 7.16. There exists an event Ω of probability larger than − L log nn su h that h γ n ( b θ m ∗ ) + γ n ( b θ m ∗ ) pen ( m ∗ ) + σ − k ǫ k n i Ω ∩ Ω ≤ l (cid:16)b θ m ∗ , θ (cid:17) τ ( n ) , where τ ( n ) is a positive sequen e onverging to zero when n goes to in(cid:28)nity.Gathering these three lemma, we derive from the upper bound (54) the inequality l ( e θ, θ ) l ( b θ m ∗ , θ ) Ω ∩ Ω ∩ Ω ≤ τ ( n )1 − τ ( n ) , whi h allows to on lude.Proof of Lemma 7.14. Let us onsider the model m R,s de(cid:28)ned by d m R,s := ⌊ ( nR ) s ⌋ . If n islarger than some quantity L ( R, s ) , then d m R,s is smaller than n/ and m R,s therefore belongs tothe olle tion M ⌊ n/ ⌋ . We shall prove that outside an event of small probability, the loss l ( b θ m R,s , θ ) is smaller than the loss l ( b θ m , θ ) of all models m ∈ M ⌊ n/ ⌋ whose dimension is smaller than log ( n ) or larger than n log n . Hen e, the model m ∗ satis(cid:28)es log ( n ) < d m ∗ < n log n with large probability.First, we need to upper bound the loss l ( b θ m R,s , θ ) . Sin e l ( b θ m R,s , θ ) = l ( θ m R,s , θ )+ l ( b θ m R,s , θ m R,s ) ,it omes to upper bounding both the bias term and the varian e term. Sin e θ belongs to E ′ s ( R ) , l (cid:0) θ m R,s , θ (cid:1) = + ∞ X i>d mR,s l ( θ m i − , θ m i ) ≤ ( d m i + 1) − s + ∞ X i>d mR,s l ( θ m i − , θ m i ) i − s ≤ σ (cid:18) R n s (cid:19) s . (55)RR n° 66160 VerzelenThen, we bound the varian e term l ( b θ m R,s , θ m R,s ) thanks to (36) as in the proof of Lemma 7.5. l (cid:16)b θ m R,s , θ m R,s (cid:17) ≤ (cid:2) σ + l (cid:0) θ m R,s , θ (cid:1)(cid:3) ϕ max h n ( Z ∗ m R,s Z m R,s ) − i (cid:13)(cid:13) Π m R,s ( ǫ + ǫ m R,s ) (cid:13)(cid:13) n σ + l ( θ m R,s , θ ) . The two random variables involved in this last expression respe tively follow (up to a fa tor n )the distribution of an inverse Wishart matrix with parameters ( n, d m R,s ) and a χ distributionwith d m R,s degrees of freedom. Thanks to Lemma 7.2 and 7.4, we prove that outside an event ofprobability smaller than L ( R, s ) exp[ − L ′ ( R, s ) n s ] with L ′ ( R, s ) > , l (cid:16)b θ m R,s , θ m R,s (cid:17) ≤ (cid:2) σ + l (cid:0) θ m R,s , θ (cid:1)(cid:3) d m R,s n , if n is large enough. Gathering this last upper bound with (55) yields l (cid:16)b θ m R,s , θ (cid:17) ≤ σ R s n s s + 4 R s n s s ! ≤ σ C ( R, s ) n s s (56)where C ( R, s ) is a onstant that only depends on R and s .Let us prove that the bias term of any model of dimension smaller than log ( n ) is larger than(56) if n is large enough. Obviously, we only have to onsider the model of dimension ⌊ log ( n ) ⌋ .Assume that there exists an in(cid:28)nite in reasing sequen e of integers u n satisfying: X i> log ( u n ) l (cid:0) θ m i − , θ m i (cid:1) ≤ C ( R, s )( u n +1 ) s s . (57)Then, the sequen e ( v n ) de(cid:28)ned by v n := log ( u n ) satis(cid:28)es X i>v n l (cid:0) θ m i − , θ m i (cid:1) ≤ C ( R, s ) exp (cid:20) −√ v n +1 s s (cid:21) . Let us onsider a subsequen e of ( v n ) su h that ⌊ v n ⌋ is stri tly in reasing. For the sake of simpli itywe still all it v n . It follows that + ∞ X i = ⌊ v ⌋ +1 l (cid:0) θ m i − , θ m i (cid:1) i − s ′ = + ∞ X n =0 ⌊ v n +1 ⌋ X i = ⌊ v n ⌋ +1 l (cid:0) θ m i − , θ m i (cid:1) i − s ′ ≤ C ( R, s ) + ∞ X n =0 ⌊ v n +1 ⌋ s ′ exp (cid:20) − p ⌊ v n +1 ⌋ s s (cid:21) < ∞ , and θ therefore belongs to some ellipsoid E s ′ ( R ′ ) . This ontradi ts the assumption θ does not belongto any ellipsoid E s ′ ( R ′ ) . As a onsequen e, there only exists a (cid:28)nite sequen e of integers u n thatsatisfy Condition (57). For n large enough, the bias term of any model of dimension less than log ( n ) is therefore larger than the loss l ( b θ m R,s , θ ) with overwhelming probability. INRIAodel sele tion on a Gaussian design 41Let us turn to the models of dimension larger than n/ log n . We shall prove that with largeprobability, for any model m of dimension larger than n/ log n , the varian e term l ( b θ m , θ m ) is largerthan the order σ / log n . For any model m ∈ M ⌊ n/ ⌋ , l (cid:16)b θ m , θ m (cid:17) ≥ nσ ϕ max ( Z ∗ m Z m ) k Π m ( ǫ + ǫ m ) k n σ + l ( θ m , θ ) . The two random variables involved in this expression respe tively follow (up to a fa tor n ) a Wishartdistribution with parameters ( n, d m ) and a χ distribution with d m . Again, we apply Lemma 7.2and 7.4 to ontrol the deviations of these random variables. Hen e, outside an event of probabilitysmaller than L ( ξ ) exp[ − nξ/ log n ] , l (cid:16)b θ m , θ m (cid:17) ≥ σ r d m n + r ξ d m n ! − d m n (cid:16) − p ξ (cid:17) , for any model m of dimension larger than n/ log n . For any model m ∈ M ⌊ n/ ⌋ , the ratio d m /n issmaller than / . As a onsequen e, we get l (cid:16)b θ m , θ m (cid:17) ≥ σ log n (cid:16) − p ξ (cid:17) (cid:16) p / p ξ (cid:17) − . Choosing for instan e ξ = 1 / ensures that for n large enough the loss l ( b θ m , θ m ) is larger than l ( b θ m R,s , θ ) for every model m of dimension larger than n/ log n outside an event of probabilitysmaller than L exp[ − L n/ log n ] + L ( R, s ) exp[ − L ( R, s ) n / (1+ s ) ] with L ( R, s ) > .Let us now turn to the sele ted model b m . We shall prove that outside an event of smallprobability, γ n (cid:16)b θ m R,s (cid:17) [1 + pen ( m R,s )] ≤ γ n (cid:16)b θ m (cid:17) [1 + pen ( m )] , (58)for all models m of dimension smaller than log n or larger than n/ log n . We (cid:28)rst onsider themodels of dimension smaller than log ( n ) . For any model m ∈ M ⌊ n/ ⌋ , γ n ( b θ m ) ∗ n/ [ σ + l ( θ m , θ )] follows a χ distribution with n − d m degrees of freedom. Again, we apply Lemma 7.2. Hen e,with probability larger than − e/ [ n ( e − , the following upper bound holds for any model m ofdimension smaller than log ( n ) . γ n (cid:16)b θ m (cid:17) [1 + pen ( m )] ≥ σ (cid:20) l ( θ m , θ ) σ (cid:21) (cid:18) d m n − d m (cid:19) " n − d m n − p ( n − d m )( d m + 2 log( n )) n ≥ σ (cid:20) l ( θ m , θ ) σ (cid:21) (cid:18) d m n (cid:19) − s d m + 2 log( n ) n − d m ≥ σ (cid:20) l ( θ m , θ ) σ (cid:21) (cid:20) − n √ n (cid:21) , RR n° 66162 Verzelenfor n large enough. Besides, outside an event of probability smaller than n , γ n (cid:16)b θ m R,s (cid:17) [1 + pen ( m R,s )] ≤ σ (cid:20) l ( θ m R,s , θ ) σ (cid:21) (cid:18) d m R,s n − d m R,s (cid:19) × " n − d m R,s n + 2 p ( n − d m R,s )2 log nn + 4 log nn ≤ σ (cid:20) l ( θ m R,s , θ ) σ (cid:21) (cid:18) d m R,s n (cid:19) " √ n p n − d m R,s + 4 log nn − d m R,s . For n large enough, d m R,s is smaller than n , and the last upper bound be omes: γ n (cid:16)b θ m R,s (cid:17) [1 + pen ( m R,s )] ≤ σ (cid:20) C ( R, s ) n s s (cid:21) (cid:18) n ) √ n (cid:19) . Hen e, γ n (cid:16)b θ m R,s (cid:17) [1 + pen ( m R,s )] ≤ γ n (cid:16)b θ m (cid:17) [1 + pen ( m )] if l ( θ m ⌊ log2 n ⌋ , θ ) σ ≥ C ( R, s ) n s s × n ) / √ n − n ) / √ n + 14 log( n ) √ n . As previously, this inequality always holds ex ept for a (cid:28)nite number of n , sin e θ does not belong toany ellipsoid E s ′ ( R ′ ) . Thus, outside an event of probability smaller than Ln , d b m is larger than log n .Let us now turn to the models of large dimension. Inequality (58) holds if the quantity k ǫ k n „ d m R,s n − d m R,s − d m n − d m « + k Π m ǫ k n „ d m n − d m « + h Π ⊥ m R,s ǫ m R,s , Π ⊥ m R,s ǫ + 2 ǫ m R,s i n „ d m R,s n − d m R,s « (59)is non-positive. The three following bounds hold outside an event of probability smaller than L ( ξ ) n : k ǫ k n ≥ − √ log n √ n , k Π m ǫ k n ≤ (1 + ξ ) d m n , for all models m of dimension d m > n log n , h Π ⊥ m R,s ǫ m R,s , Π ⊥ m R,s ǫ + 2 ǫ m R,s i n ≤ l ( θ m R,s , θ ) " n − d m R,s n + 4 p ( n − d m R,s ) log nn + 4 log nn + 4 q l ( θ m R,s , θ ) σ p ( n − d m R,s ) log nn .
Gathering these three inequalities we upper bound (59) by σ d m n − d m " − r log nn + (1 + ξ ) (cid:18) n + d m n (cid:19) + 2 σ d m R,s n − d m R,s ++ σ L (cid:18) d m R,s n (cid:19) l ( θ m R,s , θ ) σ + p l ( θ m R,s , θ ) σ ! s log nn − d m R,s ! . INRIAodel sele tion on a Gaussian design 43The dimension of any model m ∈ M ⌊ n/ ⌋ is assumed to be smaller than n/ and the dimensions ofthe models m onsidered are larger than n log n . For ξ small enough and n large enough, the previousexpression is therefore upper bounded by σ n "
32 (1 + ξ ) − r log nn + Lσ " R s n s s + R s n a a ) . (60)For n large enough, this last quantity is learly non-positive.All in all, we have proved that for n large enough outside an event of probability smaller than L ( R,s ) n , it holds that log ( n ) < d m ∗ < n log n and log ( n ) < d b m < n log n . Proof of Lemma 7.15. Arguing as in the proof of Theorem 3.1, we upper bound − γ n ( e θ ) − γ n ( e θ ) pen ( b m ) + σ + k ǫ k n ≤ l ( θ b m , θ ) A b m + σ B b m + (1 − κ ( n )) l ( e θ, θ b m ) , (61)where A b m and B b m are respe tively de(cid:28)ned in (30) and in (31). We will (cid:28)x the quantities κ ( n ) and κ ( n ) later. Besides, we de(cid:28)ne and bound the quantity E b m as in (41).Applying Lemma 7.2 and Lemma 7.4 and arguing as in the proofs of Lemma 7.6 and Lemma7.7, there exists an event Ω of large probability P (Ω c ) ≤ exp [ − n/
8] + 5 n log n X d =log ( n ) exp (cid:20) − d log n (cid:21) ≤ exp [ − n/
8] + 5 log n n (1 − / log n ) , and su h that onditionally on Ω ∩ Ω , k Π ⊥ b m ǫ b m k n l ( θ b m , θ ) ≥ n − d b m n − p n − d b m ) d b m / log nn , k Π b m ( ǫ + ǫ b m ) k n σ + l ( θ b m , θ ) ≤ d b m n + 2 √ d b m n √ log n + 4 d b m n log n , k Π ⊥ b m ( ǫ + ǫ b m ) k n σ + l ( θ b m , θ ) ≥ n − d b m n − p n − d b m ) d b m / log nnϕ max h ( Z ∗ b m Z b m ) − i ≤ n − − (cid:18) r n (cid:19) r d b m n ! − k ǫ k n ≤ E b m ≤ d b m + 2 κ − ( n ) n + 2 n sh d b m + (cid:0) κ − ( n ) (cid:1) i d b m log n + 8 κ − ( n ) d b m n log n . RR n° 66164 VerzelenGathering these six upper bounds, we are able to upper bound A b m and B b m , A b m ≤ κ ( n ) + L s d b m n log n + d b m n − L s d b m ( n − d b m ) log n + κ ( n ) 1 + L / p log( n ) (cid:20) − (1 + q n ) q d c m n ) (cid:21) ,B b m ≤ d b m n − L s d b m ( n − d b m ) log n + κ ( n ) 1 + L / p log( n ) (cid:20) − (1 + q n ) q d c m n ) (cid:21) + L d b m n " κ − ( n ) d b m + κ − ( n )log n + 1 p log( n ) + κ − ( n ) p log( n ) d b m . Conditionally to the event Ω , the dimension of b m is moderate. Setting κ to n , we get A b m ≤ L log n + d b m n − L log n + κ ( n ) 1 + L √ log n (cid:20) − L √ log( n ) (cid:21) ,B b m ≤ d b m n − L log n + κ ( n ) 1 + L √ log n (cid:20) − L √ log( n ) (cid:21) + L √ log n . Hen e, there exists a sequen e κ ( n ) onverging to one su h that onditionally on Ω ∩ Ω , B b m isnon-positive and A b m is bounded by L log n when n is large enough. Coming ba k to the inequality(61) yields h − γ n ( e θ ) − γ n ( e θ ) pen ( b m ) − σ + k ǫ k n i Ω ∩ Ω ≤ l ( e θ, θ ) (cid:20) L log n ∨ (1 − κ ( n )) (cid:21) , whi h on ludes the proof.Proof of Lemma 7.16. We follow a similar approa h to the previous proof. γ n ( b θ m ∗ ) + γ n ( b θ m ∗ ) pen ( m ∗ ) + σ − k ǫ k n ≤ C m ∗ l ( θ m ∗ , θ ) + D m ∗ σ + κ ( n ) l ( b θ m ∗ , θ m ∗ ) , (62)where for any model m ′ ∈ M ⌊ n/ ⌋ , C m ′ and D m ′ are respe tively de(cid:28)ned as C m ′ = κ ( n ) + k Π ⊥ m ′ ǫ m ′ k n l ( θ m ′ , θ ) − d m ′ n − d m ′ k Π ⊥ m ′ ( ǫ + ǫ m ′ ) k n l ( θ m ′ , θ ) + σ − (1 + κ ( n )) nϕ max ( Z ∗ m ′ Z m ′ ) k Π m ( ǫ + ǫ m ′ ) k n l ( θ m ′ , θ ) + σ D m ′ = κ − ( n ) h Π ⊥ m ′ ǫ , Π ⊥ m ′ ǫ m ′ i n σ l ( θ m ′ , θ ) − k Π m ′ ǫ k n σ − (1 + κ ( n )) nϕ max ( Z ∗ m ′ Z m ′ ) k Π m ′ ( ǫ + ǫ m ′ ) k n l ( θ m ′ , θ ) + σ + 2 d m ′ n − d m ′ k Π ⊥ m ′ ( ǫ + ǫ m ′ ) k n l ( θ m ′ , θ ) + σ . INRIAodel sele tion on a Gaussian design 45We (cid:28)x κ ( n ) = 1 / log n whereas κ ( n ) will be (cid:28)xed later. Arguing as in the proof of Lemma 7.15,there exists an event Ω of large probability P (Ω c ) ≤ exp [ − n/
8] + 5 n log n X d =log ( n ) exp (cid:20) − d log n (cid:21) ≤ exp [ − n/
8] + 5 log n n (1 − / log( n )) , su h that onditionally on Ω ∩ Ω , the two following bounds hold: C m ∗ ≤ L log n + d m ∗ n L log n − (1 + κ ( n )) 1 + L q n h L √ log n i ,D m ∗ ≤ d m ∗ n L log n + L √ log n − (1 + κ ( n )) 1 + L q n h L √ log n i , if n is large. The main di(cid:27)eren e with the proof of Lemma 7.15 lies in the fa t that we now ontrolthe largest eigenvalue of Z ∗ m Z m thanks to the se ond result of Lemma 7.4. There exists a sequen e κ ( n ) onverging to su h that onditionally on Ω ∩ Ω , D m ∗ is non-positive and C m ∗ is boundedby L log n when n is large. Coming ba k to (61) yields h γ n ( b θ m ∗ ) + pen ( m ∗ ) + σ − k ǫ k n i Ω ∪ Ω ≤ l ( b θ m ∗ , θ ) (cid:20) L log n ∨ κ ( n ) (cid:21) , whi h on ludes the proof.7.6 Proof of Proposition 3.3Proof of Proposition 3.3. The approa h is similar to the proof of Proposition 1 in [9℄. For anymodel m ∈ M ⌊ n/ ⌋ , let us de(cid:28)ne ∆( m, m ⌊ n/ ⌋ ) := γ n (cid:16)b θ m ⌊ n/ ⌋ (cid:17) [1 + pen ( m ⌊ n/ ⌋ )] − γ n (cid:16)b θ m (cid:17) [1 + pen ( m )] . We shall prove that with large probability the quantity ∆( m, m ⌊ n/ ⌋ ) is negative for any model m of dimension smaller than n/ . Hen e, with large probability d b m will be larger than n/ . Let us(cid:28)x a model m of dimension smaller than n/ .First, we use Expression (25) to lower bound γ n ( b θ m ) . γ n (cid:16)b θ m (cid:17) = k Π ⊥ m (cid:0) ǫ + ǫ m ⌊ n/ ⌋ (cid:1) k n + k Π ⊥ m (cid:0) ǫ m − ǫ m ⌊ n/ ⌋ (cid:1) k n + 2 h Π ⊥ m (cid:0) ǫ + ǫ m ⌊ n/ ⌋ (cid:1) , Π ⊥ m (cid:0) ǫ m − ǫ m ⌊ n/ ⌋ (cid:1) i n ≥ k Π ⊥ m (cid:0) ǫ + ǫ m ⌊ n/ ⌋ (cid:1) k n − * Π ⊥ m (cid:0) ǫ + ǫ m ⌊ n/ ⌋ (cid:1) , Π ⊥ m (cid:0) ǫ m − ǫ m ⌊ n/ ⌋ (cid:1) k Π ⊥ m (cid:0) ǫ m − ǫ m ⌊ n/ ⌋ (cid:1) k n + n , RR n° 66166 Verzelensin e ab ≥ − a − b for any number a and b . Hen e, we may upper bound ∆( m, m ⌊ n/ ⌋ ) by ∆( m, m ⌊ n/ ⌋ ) ≤ k Π ⊥ m ⌊ n/ ⌋ (cid:0) ǫ + ǫ m ⌊ n/ ⌋ (cid:1) k n (cid:2) pen ( m ⌊ n/ ⌋ ) − pen ( m ) (cid:3) − (cid:13)(cid:13)(cid:13) [Π ⊥ m − Π ⊥ m ⌊ n/ ⌋ ] (cid:0) ǫ + ǫ m ⌊ n/ ⌋ (cid:1)(cid:13)(cid:13)(cid:13) n [1 + pen ( m )]+ * Π ⊥ m (cid:0) ǫ + ǫ m ⌊ n/ ⌋ (cid:1) , Π ⊥ m (cid:0) ǫ m − ǫ m ⌊ n/ ⌋ (cid:1) k Π ⊥ m (cid:0) ǫ m − ǫ m ⌊ n/ ⌋ (cid:1) k n + n [1 + pen ( m )] . (63)Arguing as the proof of Lemma 2.1, we observe that k Π ⊥ m ⌊ n/ ⌋ (cid:0) ǫ + ǫ m ⌊ n/ ⌋ (cid:1) k n ∗ n/ [ σ + l ( θ m ⌊ n/ ⌋ )] follows a χ distribution with n − ⌊ n/ ⌋ degrees of freedom. Analogously, the random variable k [Π ⊥ m − Π ⊥ m ⌊ n/ ⌋ ] (cid:0) ǫ + ǫ m ⌊ n/ ⌋ (cid:1) k n ∗ n/ [ σ + l ( θ m ⌊ n/ ⌋ ) ] follows a χ distribution with ( d m ⌊ n/ ⌋ − d m ) degrees of freedom. Let us turn to the distribution of the third term. Coming ba k to the de(cid:28)nitionof ǫ m , we observe that ǫ m − ǫ m ⌊ n/ ⌋ = Y − Xθ m − ( Y − Xθ m ⌊ n/ ⌋ ) = X ( θ m − θ m ⌊ n/ ⌋ ) . Hen e, ǫ m − ǫ m ⌊ n/ ⌋ is both independent of X m and of ǫ + ǫ m ⌊ n/ ⌋ . Consequently, by onditioningand un onditioning, we on lude that the random variable de(cid:28)ned in (63) follows (up to a [ σ + l ( θ m ⌊ n/ ⌋ )] /n fa tor) a χ distribution with degree of freedom.On e again, we apply Lemma 7.2 and the lassi al deviation bound P (cid:0) |N (0 , | ≥ √ x (cid:1) ≤ e − x .Let x be some positive number smaller than one that we shall (cid:28)x later. There exists an event Ω x ofprobability larger than − exp( − nx/ − − ( n/ − x ) − e − x su h for any model of dimensionsmaller than n/ , ∆( m, m ⌊ n/ ⌋ ) σ + l ( θ m ⌊ n/ ⌋ ) ≤ (cid:18) n − ⌊ n/ ⌋ n (cid:19) (cid:0) √ x + 2 x (cid:1) (cid:0) pen ( m ⌊ n/ ⌋ ) − pen ( m ) (cid:1) − ⌊ n/ ⌋ − d m n (1 − √ x − x )(1 + pen ( m )) . We now repla e the penalty terms by their values thanks to Assumption (11). Conditionally to Ω x ,we obtain that ∆( m, m ⌊ n/ ⌋ ) σ + l ( θ m ⌊ n/ ⌋ ) ≤ ⌊ n/ ⌋ − d m n (cid:26) − ν )( √ x + x ) (cid:20) d m n − d m (cid:21) − ν (1 − √ x − x ) (cid:27) . Sin e the dimension of the model m is smaller than n/ , d m n − d m is smaller than / . Hen e, the lastupper bound be omes ∆( m, m ⌊ n/ ⌋ ) σ + l ( θ m ⌊ n/ ⌋ ) ≤ ⌊ n/ ⌋ − d m n (cid:26)
163 (1 − ν )( √ x + x ) − ν (1 − √ x − x ) (cid:27) . There exists some x ( ν ) su h that onditionally on Ω x ( ν ) , ∆( m, m ⌊ n/ ⌋ ) is negative for any model m of dimension smaller than n/ . Sin e P (Ω cx ( ν ) ) goes exponentially fast with ν to 0, there existssome n ( ν, δ ) su h that for any n larger than n ( ν, δ ) , P (Ω cx ( ν ) ) is smaller than δ . We have provedthat with probability larger than − δ , the dimension of b m is larger than n/ . INRIAodel sele tion on a Gaussian design 47Let us simultaneously lower bound the loss l ( b θ m , θ m ) for every model m ∈ M of dimension largerthan n/ . In the sequel, (cid:23) means "sto hasti ally larger than". Thanks to (27), we sto hasti allylower bound l ( b θ m , θ m ) l ( b θ m , θ m ) ≥ nϕ max ( Z ∗ m Z m ) − k Π m ( ǫ + ǫ m ) k n (cid:23) ϕ max ( n Z ∗ m Z m ) − k Π m ǫ k n , where Z ∗ m Z m follows a standard Wishart distribution with parameters ( n, d m ) . Applying Lemma7.2 and Lemma 7.4 in order to simultaneously lower bound the loss l ( b θ m , θ m ) , we (cid:28)nd an event Ω ′ of probability larger than − − n/ − e − / , su h that l ( b θ m , θ m ) Ω ′ ≥ r d m n + r d m n ! − d m n σ ≥ d m n σ , for any model m ∈ M of dimension larger than n/ . On the event Ω x ( ν ) , the dimension d b m is largerthan n/ . As a onsequen e, l ( e θ, θ b m ) Ω ′ ∩ Ω x ( ν ) ≥ σ . All in all, we obtain E h l ( e θ, θ ) i ≥ l ( θ m ⌊ n/ ⌋ , θ ) + E h Ω ′ ∩ Ω x ( ν ) l ( e θ, θ b m ) i ≥ l ( θ m ⌊ n/ ⌋ , θ ) + h − P (Ω cx ( ν ) ) − P (Ω ′ c ) i σ ≥ l ( θ m ⌊ n/ ⌋ , θ ) + L ( δ, ν ) σ , if n is larger than some n ( ν, δ ) .7.7 Proofs of the minimax lower boundsAll these minimax lower bounds are based on Birgé's version of Fano's Lemma [6℄.Lemma 7.17. (Birgé's Lemma) Let (Θ , d ) be some pseudo-metri spa e and { P θ , θ ∈ Θ } besome statisti al model. Let κ denote some absolute onstant smaller than one. Then for anyestimator b θ and any (cid:28)nite subset Θ of Θ , setting δ = min θ,θ ′ ∈ Θ ,θ = θ ′ d ( θ, θ ′ ) , provided that max θ,θ ′ ∈ Θ K ( P θ , P θ ′ ) ≤ κ log | Θ | , the following lower bound holds for every p ≥ , sup θ ∈ Θ E θ [ d p ( b θ, θ )] ≥ − p δ p (1 − κ ) . First, we ompute the Kullba k-Leibler divergen e between the distribution P θ and P θ ′ . K ( P θ ; P θ ′ ) = K ( P θ ( X ); P θ ′ ( X )) + E θ [ K ( P θ ( Y | X ); P θ ′ ( Y | X )) | X ] The two marginal distributions P θ ( X ) and P θ ′ ( X ) are equal. The onditional distributions P θ ( Y | X ) and P θ ′ ( Y | X ) are Gaussian with varian e σ and with mean respe tively equal to Xθ and Xθ ′ .Hen e, the onditional Kullba k-Leibler divergen e equals K ( P θ ( Y | X ); P θ ′ ( Y | X )) = [ X ( θ − θ ′ )] σ . RR n° 66168 VerzelenReintegrating with respe t to X yields K ( P θ ; P θ ′ ) = l ( θ ′ , θ )2 σ and K (cid:0) P ⊗ nθ ; P ⊗ nθ ′ (cid:1) = n l ( θ ′ , θ )2 σ . (64)Proof of Proposition 4.1. First, we need a lower bound of the minimax rate of estimation on asubspa e of dimension D .Lemma 7.18. Let D be some positive number smaller than p and r be some arbitrary positivenumber. Let S D be the set of ve tors in R p whose support in in luded in { , . . . , D } . Then, for anyestimator b θ of θ , sup θ ∈ S D , l (0 p ,θ ) ≤ Dr E θ h l ( b θ, θ ) i ≥ LD (cid:20) r ∧ σ n (cid:21) . (65)Let us (cid:28)x some D ∈ { , . . . , p } . Consider the set Θ D := (cid:8) θ ∈ S D , l (0 p , θ ) ≤ a D R (cid:9) . Sin e the a j 's are non in reasing, it holds that p X i =1 l ( θ m i − , θ m i ) a i ≤ D X i =1 l ( θ m i − , θ m i ) a D ≤ l (0 p , θ ) a D ≤ R , for any θ ∈ Θ D . Hen e Θ D is in luded in E a ( R ) . Applying Lemma 7.18, we get inf b θ sup θ ∈E a ( R ) ≥ LD (cid:20) a D R D ∧ σ n (cid:21) ≥ L (cid:20) a D R ∧ Dσ n (cid:21) . Taking the supremum over D in { , . . . , p } enables to on lude.Proof of Lemma 7.18. Let us assume (cid:28)rst that Σ = I p . Consider the hyper ube C D ( r ) := { , r } D ×{ } p − D . Thanks to (64), we upper bound the Kullba k-Leibler divergen e between the distributions P θ and P θ ′ K (cid:0) P ⊗ nθ ; P ⊗ nθ ′ (cid:1) ≤ nDr σ , where θ and θ ′ belong to C D ( r ) . Then, we apply Varshamov-Gilbert's lemma (e.g. Lemma 4.7 in[25℄) to the set C D ( r ) .Lemma 7.19 (Varshamov-Gilbert's lemma). Let { , } D be equipped with Hamming distan e d H .There exists some subset Θ of { , } D with the following properties d H ( θ, θ ′ ) > D/ for every ( θ, θ ′ ) ∈ Θ with θ = θ ′ and log | Θ | ≥ D/ . Combining Lemma 7.17 with the set Θ de(cid:28)ned in the last lemma yields inf b θ sup θ ∈C D ( r ) E θ h d H ( b θ, θ ) i ≥ D , INRIAodel sele tion on a Gaussian design 49provided that nDr σ ≤ D/ . Coming ba k to the loss fun tion l ( ., . ) yields inf b θ sup θ ∈C D ( r ) E θ h l ( b θ, θ ) i ≥ LDr , if r ≤ L σ n . Finally, we get inf b θ sup θ ∈ S D , l (0 p ,θ ) ≤ Dr E θ h l ( b θ, θ ) i ≥ LD (cid:20) r ∧ σ n (cid:21) . If we no longer assume that the ovarian e matrix Σ is the identity, we orthogonalize the sequen e X i thanks to Gram-S hmidt pro ess. Applying the previous argument to this new sequen e of ovariates allows to on lude.Proof of Corollary 4.2. This result follows from the upper bound on the risk of e θ in Theorem 3.1 andthe minimax lower bound of Proposition 4.1. Let E a ( R ) an ellipsoid satisfying σ n ≤ R ≤ σ n β ,then l (0 p , θ ) is smaller than σ n β . By Theorem 3.1, the estimator e θ de(cid:28)ned with the olle tion M ⌊ n/ ⌋∧ p and pen ( m ) = K d m n − d m satis(cid:28)es E θ h l ( e θ, θ ) i ≤ L ( K ) inf ≤ i ≤⌊ n/ ⌋∧ p (cid:26) l ( θ m i , θ ) + K in − i [ σ + l ( θ m i , θ )] (cid:27) + L ( K, β ) σ n ≤ L ( K, β ) inf ≤ i ≤⌊ n/ ⌋∧ p (cid:20) l ( θ m i , θ ) + in σ (cid:21) . If θ belongs to E a ( R ) , then l ( θ m i , θ ) ≤ a i +1 p X j = i +1 l ( θ m j , θ m j − ) a j ≤ R a i +1 , sin e the ( a i ) 's are in reasing. It follows that E θ h l ( e θ, θ ) i ≤ L ( K, β ) inf ≤ i ≤⌊ n/ ⌋∧ p (cid:20) R a i +1 + in σ (cid:21) . (66)Let us de(cid:28)ne i ∗ := sup n ≤ i ≤ p , R a i ≥ σ in o , with the onvention sup ∅ = 0 . Sin e R ≥ σ /n , i ∗ is larger or equal to one. By Proposition 4.1, the minimax rates of estimation is lower boundedas follows inf b θ sup θ ∈E a ( R ) E θ h l ( b θ, θ ) i ≥ L (cid:20) a i ∗ +1 R ∨ σ i ∗ n (cid:21) ≥ L (cid:20) a i ∗ +1 R + σ i ∗ n (cid:21) . If either p ≤ n or a ⌊ n/ ⌋ +1 R ≤ σ / , then i ∗ is smaller or equal to ⌊ n/ ⌋ ∧ p and we obtain thanksto (66) that E θ h l ( e θ, θ ) i ≤ L ( K, β ) (cid:20) a i ∗ +1 R + σ i ∗ n (cid:21) ≤ L ( K, β ) inf b θ sup θ ∈E a ( R ) E h l ( b θ, θ ) i . RR n° 66160 VerzelenProof of Proposition 4.3. First, we use (64) to upper bound the Kullba k-Leibler divergen e be-tween the distributions orresponding to parameters θ and θ ′ in the set Θ[ k, p ]( r ) K (cid:0) P ⊗ nθ ; P ⊗ nθ ′ (cid:1) ≤ nkr σ , sin e the ovariates are i.i.d standard Gaussian variables. Let us state a ombinatorial argumentdue to Birgé and Massart [7℄.Lemma 7.20. Let { , } p be equipped with Hamming distan e d H and given ≤ k ≤ p/ , de(cid:28)ne { , } pk := { x ∈ { , } p : d H (0 , x ) = k } . There exists some subset Θ of { , } pk with the followingproperties d H ( θ, θ ′ ) > k/ for every ( θ, θ ′ ) ∈ Θ with θ = θ ′ and log | Θ | ≥ k/ (cid:16) pk (cid:17) . Suppose that k is smaller than p/ . Applying Lemma 7.17 with Hamming distan e d H and theset r Θ introdu ed in Lemma 7.20 yields inf b θ sup θ ∈ Θ[ k,p ]( r ) E θ h d H (cid:16)b θ, θ (cid:17)i ≥ k , provided that nkr σ ≤ k
10 log (cid:16) pk (cid:17) . (67)Sin e the ovariates X i are independent and of varian e 1, the lower bound (67) is equivalent to inf b θ sup θ ∈ Θ[ k,p ]( r ) E θ h l (cid:16)b θ, θ (cid:17)i ≥ kr . All in all, we obtain inf b θ sup θ ∈ Θ[ k,p ]( r ) E θ h l (cid:16)b θ, θ (cid:17)i ≥ Lk r ∧ log (cid:0) pk (cid:1) n σ ! . Sin e p/k is larger than , we obtain the desired lower bound by hanging the onstant L : inf b θ sup θ ∈ Θ[ k,p ]( r ) E θ h l (cid:16)b θ, θ (cid:17)i ≥ Lk r ∧ (cid:0) pk (cid:1) n σ ! . If p/k is smaller than , we know from the proof of Lemma 7.18, that inf b θ sup θ ∈C k ( r ) E θ h l (cid:16)b θ, θ (cid:17)i ≥ Lk (cid:18) r ∧ σ n (cid:19) . We on lude by observing that log( p/k ) is smaller than log(4) and that C k ( r ) is in luded in Θ[ k, p ]( r ) .Proof of Proposition 4.5. Assume (cid:28)rst the ovariates ( X i ) have a unit varian e. If this is not the ase, then one only has to res ale them. By Condition (22), the Kullba k-Leibler divergen e betweenthe distributions orresponding to parameters θ and θ ′ in the set Θ[ k, p ]( r ) satis(cid:28)es K (cid:0) P ⊗ nθ ; P ⊗ nθ ′ (cid:1) ≤ (1 + δ ) nkr σ , INRIAodel sele tion on a Gaussian design 51We re all that k . k refers to the anoni al norm in R p . Arguing as in the proof of Proposition 4.3,we lower bound the risk of any estimator b θ with the loss fun tion k . k , inf b θ sup θ ∈ Θ[ k,p ]( r ) E θ h k b θ − θ k i ≥ Lk r ∧ (cid:0) pk (cid:1) (1 + δ ) n σ ! , Applying again Assumption (22) allows to obtain the desired lower bound on the risk inf b θ sup θ ∈ Θ[ k,p ]( r ) E θ h l ( b θ, θ ) i ≥ Lk (1 − δ ) r ∧ (cid:0) pk (cid:1) (1 + δ ) n σ ! . Proof of Proposition 4.6. In short, we (cid:28)nd a subset Φ ⊂ { , . . . , p } whose orrelation matrix followsa / -Restri ted Isometry Property of size k . We then apply Proposition 4.5 with the subset Φ of ovariates.We (cid:28)rst onsider the orrelation matrix Ψ ( ω ) . Let us pi k a maximal subset Φ ⊂ { , . . . p } ofpoints that are ⌈ log(4 k ) /ω ⌉ spa ed with respe t to the toroidal distan e. Hen e, the ardinalityof Φ is ⌊ p ⌈ log(4 k ) /ω ⌉ − ⌋ . Assume that k is smaller than this quantity. We all C the orrelationmatrix of the points that belong to Φ . Obviously, for any ( i, j ) ∈ Φ , it holds that | C ( i, j ) | ≤ / (4 k ) if i = j . Hen e, any submatrix of C with size k is diagonally dominant and the sum of the absolutevalue of its non-diagonal elements is smaller than / . Hen e, the eigenvalues of any submatrix of C with size k lies between / and / . The matrix C therefore follows a / -Restri ted IsometryProperty of size k . Consequently, we may apply Proposition 4.5 with the subset of ovariates Φ and the result follows. The se ond ase is handled similarly.De(cid:28)nition of the orrelationsLet us now justify why these orrelations are well-de(cid:28)ned when p is an odd integer. We shallprove that the matri es Ψ ( ω ) and Ψ ( t ) are non-negative. Observe that these two matri es aresymmetri and ir ulant. This means that there exists a family of numbers ( a k ) ≤ k ≤ p su h that Ψ ( ω )[ i, j ] = a i − j mod p for any ≤ i, j ≤ p . Su h matri es are known to be jointly diagonalizable in the same basis and their eigenvalues or-respond to the dis rete Fourier transform of ( a k ) . More pre isely, their eigenvalues ( λ l ) ≤ l ≤ p areexpressed as λ l := p − X k =0 exp (cid:18) iπklp (cid:19) a k . (68)We refer to [27℄ Se t. 2.6.2 for more details. In the (cid:28)rst example, a k equals exp( − ω ( k ∧ ( p − k )) ,whereas it equals [1 + ( k ∧ ( p − k ))] − t in the se ond example.RR n° 66162 VerzelenCASE 1: Using the expression (68), one an ompute λ l . λ l = − ( p − / X k =0 cos (cid:18) πklp (cid:19) exp( − kω )= − ( p − / X k =0 exp (cid:20) k ( i π lp − ω (cid:21) = − ( − e − ω p +12 ( − l e i π lp − e − ω + i π lp ) = − − e − ω cos (cid:16) πlp (cid:17) + e − ω ( p +1) / ( − l cos (cid:16) πlp (cid:17) ( e − ω − e − ω − e − ω cos (cid:16) πlp (cid:17) Hen e, we obtain that λ l ≥ ⇔ e − ω ( p +1) / ( − l cos (cid:18) πlp (cid:19) (cid:0) e − ω − (cid:1) − e − ω ≥ . It is su(cid:30) ient to prove that − e − ω + 2 e − ω ( p +3) / − e − ω ( p +1) / ≥ . This last expression is non-negative if ω equals zero and is in reasing with respe t to ω . We on- lude that λ l is non-negative for any ≤ l ≤ p . The matrix Ψ ( ω ) is therefore non-negative andde(cid:28)nes a orrelation.CASE 2: Let us prove that the orresponding eigenvalues λ l are non-negative. λ l = − ( p − / X k =0 cos (cid:18) πklp (cid:19) ( k + 1) − t Using the following identity ( k + 1) − t = 1Γ( t ) Z ∞ e − r ( k +1) r t − dr , we de ompose λ l into a sum of integrals. λ l = 1Γ( t ) Z ∞ r t − e − r − ( p − / X k =0 cos (cid:18) πklp (cid:19) e − rk dr . The term inside the bra kets orresponds to the eigenvalue for an exponential orrelation withparameter r (CASE 1). This expression is therefore non-negative for any r ≥ . In on lusion, thematrix Ψ ( t ) is non-negative and the orrelation is de(cid:28)ned. INRIAodel sele tion on a Gaussian design 53AppendixProof of Lemma 7.1. We re all that γ n ( b θ m ) = k Y − Π m Y k n . Thanks to the de(cid:28)nition (23) of ǫ and ǫ m , we obtain the (cid:28)rst result. Let us turn to the mean squared error γ ( b θ m ) . In the following omputation b θ m is onsidered as (cid:28)xed and we only use that b θ m belongs to S m . By de(cid:28)nition, γ ( b θ m ) = E Y,X h Y − X b θ m i = σ + E X h X ( θ − b θ m ) i = σ + l ( θ m , θ ) + l ( b θ m , θ m ) , sin e θ m is the orthogonal proje tion of θ with respe t to the inner produ t asso iated to the loss l ( ., . ) . We then derive that l ( b θ m , θ m ) = E X m h X (cid:16) θ m − b θ m (cid:17)i = (cid:16) θ m − b θ m (cid:17) ∗ Σ (cid:16) θ m − b θ m (cid:17) . Sin e b θ m is the least-squares estimator of θ m , it follows from (23) that l ( b θ m , θ m ) = ( ǫ + ǫ m ) ∗ X m ( X ∗ m X m ) − Σ m ( X ∗ m X m ) − X ∗ m ( ǫ + ǫ m ) . We repla e X m by Z m √ Σ m and therefore obtain l ( b θ m , θ m ) = ( ǫ + ǫ m ) ∗ Z m ( Z ∗ m Z m ) − Z ∗ m ( ǫ + ǫ m ) . Proof of Lemma 2.1. Thanks to Equation (25), we know that γ n ( b θ m ) = k Π ⊥ m ( ǫ + ǫ m ) k n . Thevarian e of ǫ + ǫ m is σ + l ( θ m , θ ) . Sin e ǫ + ǫ m is independent of X m , γ n ( b θ m ) ∗ n/ [ σ + l ( θ m , θ )] follows a χ distribution with n − d m degrees of freedom and the result follows.Let us turn to the expe tation of γ ( b θ m ) . By (26), γ ( b θ m ) equals γ (cid:16)b θ m (cid:17) = σ + l ( θ m , θ ) + ( ǫ + ǫ b m ) ∗ Z b m ( Z ∗ b m Z b m ) − Z ∗ b m ( ǫ + ǫ b m ) , following the arguments of the proof of Lemma 7.1. Sin e ǫ + ǫ m and X m are independent, one mayintegrate with respe t to ǫ + ǫ m E h γ ( b θ m ) i = (cid:2) σ + l ( θ m , θ ) (cid:3) (cid:8) E (cid:2) tr (cid:0) Z ∗ m Z m ) − (cid:1)(cid:3)(cid:9) , where the last term it the expe tation of the tra e of an inverse standard Wishart matrix of pa-rameters ( n, d m ) . Thanks to [37℄, we know that it equals d m n − d m − .Proof of Lemma 7.3. The random variable p χ ( d ) may be interpreted as a Lips hitz fun tion with onstant 1 on R d equipped with the standard Gaussian measure. Hen e, we may apply the Gaussian on entration theorem (see e.g. [25℄ Th. 3.4). For any x > , P (cid:16)p χ ( d ) ≤ E hp χ ( d ) i − √ x (cid:17) ≤ exp( − x ) . (69)RR n° 66164 VerzelenIn order to on lude, we need to lower bound E hp χ ( d ) i . Let us introdu e the variable Z :=1 − q χ ( d ) d . By de(cid:28)nition, Z is smaller or equal to one. Hen e, we upper bound E ( Z ) as E ( Z ) ≤ Z P ( Z ≥ t ) dt ≤ Z √ P ( Z ≥ t ) dt + P ( Z ≥ r
18 ) . Let us upper bound P ( Z ≥ t ) for any ≤ t ≤ q by applying Lemma 7.2 P ( Z ≥ t ) ≤ P (cid:16) χ ( d ) ≤ d [1 − t ] (cid:17) ≤ P (cid:16) χ ( d ) ≤ d − √ d p dt / (cid:17) ≤ exp (cid:18) − dt (cid:19) , sin e t ≤ − √ . Gathering this upper bound with the previous inequality yields E ( Z ) ≤ exp (cid:18) − d (cid:19) + Z + ∞ exp (cid:18) − dt (cid:19) dt ≤ exp (cid:18) − d (cid:19) + r π d . Thus, we obtain E (cid:16)p χ ( d ) (cid:17) ≥ √ d − √ d exp( − d/ − p π/ . Combining this lower bound with(69) allows to on lude.A knowledgementsI gratefully thank Pas al Massart for many fruitful dis ussions. I also would like to thank the refereefor his suggestions that led to an improvement of the paper.Referen es[1℄ H. Akaike. Statisti al predi tor identi(cid:28) ation. Ann. Inst. Statist. Math., 22:203(cid:21)217, 1970.[2℄ H. Akaike. A new look at the statisti al model identi(cid:28) ation. IEEE Trans. Automati Control,AC-19:716(cid:21)723, 1974. System identi(cid:28) ation and time-series analysis.[3℄ S. Arlot. Model sele tion by resampling penalization, 2008. oai:hal.ar hives-ouvertes.fr:hal-00262478_v1.[4℄ Y. Baraud, C. Giraud, and S. Huet. Gaussian model sele tion with an unknown varian e. Ann.Statist., 37(2):630(cid:21)672, 2009.[5℄ P. Bi kel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig sele tor.Annals of Statisti s (to appear), 2009.[6℄ L. Birgé. A new lower bound for multiple hypothesis testing. IEEE Trans. Inform. Theory,51(4):1611(cid:21)1615, 2005. INRIAodel sele tion on a Gaussian design 55[7℄ L. Birgé and P. Massart. Minimum ontrast estimators on sieves: exponential bounds andrates of onvergen e. Bernoulli, 4(3):329(cid:21)375, 1998.[8℄ L. Birgé and P. Massart. Gaussian model sele tion. J. Eur. Math. So . (JEMS), 3(3):203(cid:21)268,2001.[9℄ L. Birgé and P. Massart. Minimal penalties for Gaussian model sele tion. Probab. TheoryRelated Fields, 138(1-2):33(cid:21)73, 2007.[10℄ F. Bunea, A. Tsybakov, and M. Wegkamp. Aggregation for Gaussian regression. Ann. Statist.,35(4):1674(cid:21)1697, 2007.[11℄ F. Bunea, A. Tsybakov, and M. Wegkamp. Sparsity ora le inequalities for the Lasso. Ele tron.J. Stat., 1:169(cid:21)194 (ele troni ), 2007.[12℄ E. Candès and Y. Plan. Near-ideal model sele tion by l minimization. Ann. Statist. (toappear), 2009.[13℄ E. Candes and T. Tao. The Dantzig sele tor: statisti al estimation when p is mu h larger than n . Ann. Statist., 35(6):2313(cid:21)2351, 2007.[14℄ E. J. Candes and T. Tao. De oding by linear programming. IEEE Trans. Inform. Theory,51(12):4203(cid:21)4215, 2005.[15℄ R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Probabilisti networksand expert systems. Statisti s for Engineering and Information S ien e. Springer-Verlag, NewYork, 1999.[16℄ N. A. C. Cressie. Statisti s for spatial data. Wiley Series in Probability and Mathemati- al Statisti s: Applied Probability and Statisti s. John Wiley & Sons In ., New York, 1993.Revised reprint of the 1991 edition, A Wiley-Inters ien e Publi ation.[17℄ K. R. Davidson and S. J. Szarek. Lo al operator theory, random matri es and Bana h spa es. InHandbook of the geometry of Bana h spa es, Vol. I, pages 317(cid:21)366. North-Holland, Amsterdam,2001.[18℄ V. H. de la Peña and E. Giné. De oupling. Probability and its Appli ations (New York).Springer-Verlag, New York, 1999. From dependen e to independen e, Randomly stopped pro- esses. U -statisti s and pro esses. Martingales and beyond.[19℄ C. Giraud. Estimation of Gaussian graphs by model sele tion. Ele tron. J. Stat., 2:542(cid:21)563,2008.[20℄ T. Gneiting. Power-law orrelations, related models for long-range dependen e and their sim-ulation. J. Appl. Probab., 37(4):1104(cid:21)1109, 2000.[21℄ M. Kalis h and P. Bühlmann. Estimating high-dimensional dire ted a y li graphs with thePC-algorithm. J. Ma h. Learn. Res., 8:613(cid:21)636, 2007.[22℄ B. Laurent and P. Massart. Adaptive estimation of a quadrati fun tional by model sele tion.Ann. Statist., 28(5):1302(cid:21)1338, 2000.RR n° 66166 Verzelen[23℄ S. L. Lauritzen. Graphi al models, volume 17 of Oxford Statisti al S ien e Series. The Claren-don Press Oxford University Press, New York, 1996. Oxford S ien e Publi ations.[24℄ C.L. Mallows. Some omments on C p . Te hnometri s, 15:661(cid:21)675, 1973.[25℄ P. Massart. Con entration inequalities and model sele tion, volume 1896 of Le ture Notes inMathemati s. Springer, Berlin, 2007. Le tures from the 33rd Summer S hool on ProbabilityTheory held in Saint-Flour, July 6(cid:21)23, 2003, With a foreword by Jean Pi ard.[26℄ N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable sele tion with thelasso. Ann. Statist., 34(3):1436(cid:21)1462, 2006.[27℄ H. Rue and L. Held. Gaussian Markov Random Fields: Theory and Appli ations, volume 104of Monographs on Statisti s and Applied Probability. Chapman & Hall/CRC, London, 2005.[28℄ K. Sa hs, O. Perez, D.Pe'er, D. A. Lau(cid:27)enburger, and G. P. Nolan. Causal protein-signalingnetworks derived from multiparameter single- ell data. S ien e, 308:523(cid:21)529, 2005.[29℄ J. S häfer and K. Strimmer. An empiri al bayes approa h to inferring large-s ale gene asso i-ation network. Bioinformati s, 21:754(cid:21)764, 2005.[30℄ G. S hwarz. Estimating the dimension of a model. Ann. Statist., 6(2):461(cid:21)464, 1978.[31℄ R. Shibata. An optimal sele tion of regression variables. Biometrika, 68(1):45(cid:21)54, 1981.[32℄ C. Stone. An asymptoti ally optimal histogram sele tion rule. In Pro eedings of the Berke-ley onferen e in honor of Jerzy Neyman and Ja k Kiefer, Vol. II (Berkeley, Calif., 1983),Wadsworth Statist./Probab. Ser., pages 513(cid:21)520, Belmont, CA, 1985. Wadsworth.[33℄ R. Tibshirani. Regression shrinkage and sele tion via the lasso. J. Roy. Statist. So . Ser. B,58(1):267(cid:21)288, 1996.[34℄ R. Tibshirani. Regression shrinkage and sele tion via the lasso. J. Roy. Statist. So . Ser. B,58(1):267(cid:21)288, 1996.[35℄ A. Tsybakov. Optimal rates of aggregation. In 16th Annual Conferen e on Learning Theory,volume 2777, pages 303(cid:21)313. Springer-Verlag, 2003.[36℄ N. Verzelen and F. Villers. Goodness-of-(cid:28)t tests for high-dimensional gaussian linear models.Ann. Statist. (to appear), 2009.[37℄ D. von Rosen. Moments for the inverted Wishart distribution. S and. J. Statist., 15(2):97(cid:21)109,1988.[38℄ M. J. Wainwright. Information-theoreti limits on sparsity re overy in the high-dimensionaland noisy setting. Te hni al Report 725, Department of Statisti s, UC Berkeley, 2007.[39℄ A. Wille, P. Zimmermann, E. Vranova, A. Fürholz, O. Laule, S. Bleuler, L. Hennig, A. Preli ,P. von Rohr, L. Thiele, E. Zitzler, W. Gruissem, and P. Bühlmann. Sparse graphi al Gaussianmodelling of the isoprenoid gene network in arabidopsis thaliana. Genome Biology, 5(11), 2004.[40℄ P. Zhao and B. Yu. On model sele tion onsisten y of Lasso. J. Ma h. Learn. Res., 7:2541(cid:21)2563,2006. INRIAodel sele tion on a Gaussian design 57[41℄ H. Zou. The adaptive lasso and its ora le properties. J. Amer. Statist. Asso ., 101(476):1418(cid:21)1429, 2006.RR n° 6616 entre de recherche INRIA Saclay – Île-de-FranceParc Orsay Université - ZAC des Vignes4, rue Jacques Monod - 91893 Orsay Cedex (France) Centre de recherche INRIA Bordeaux – Sud Ouest : Domaine Universitaire - 351, cours de la Libération - 33405 Talence CedexCentre de recherche INRIA Grenoble – Rhône-Alpes : 655, avenue de l’Europe - 38334 Montbonnot Saint-IsmierCentre de recherche INRIA Lille – Nord Europe : Parc Scientifique de la Haute Borne - 40, avenue Halley - 59650 Villeneuve d’AscqCentre de recherche INRIA Nancy – Grand Est : LORIA, Technopôle de Nancy-Brabois - Campus scientifique615, rue du Jardin Botanique - BP 101 - 54602 Villers-lès-Nancy CedexCentre de recherche INRIA Paris – Rocquencourt : Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay CedexCentre de recherche INRIA Rennes – Bretagne Atlantique : IRISA, Campus universitaire de Beaulieu - 35042 Rennes CedexCentre de recherche INRIA Sophia Antipolis – Méditerranée : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis CedexÉditeurINRIA - Domaine de Voluceau - Rocquencourt, BP 105 - 78153 Le Chesnay Cedex (France)