aa r X i v : . [ m a t h . S T ] J u l Optimal model selection in density estimation
Matthieu Lerasle ∗ Abstract
We build penalized least-squares estimators using the slope heuristic and re-sampling penalties. We prove oracle inequalities for the selected estimator withleading constant asymptotically equal to 1. We compare the practical perfor-mances of these methods in a short simulation study.
Key words:
Density estimation, optimal model selection, resampling methods, slopeheuristic.
The aim of model selection is to construct data-driven criteria to select a model among agiven list. The history of statistical model selection goes back at least to Akaike [1], [2] andMallows [18]. They proposed to select among a collection of parametric models the onewhich minimizes an empirical loss plus some penalty term proportional to the dimensionof the model. Birg´e & Massart [8] and Barron, Birg´e & Massart [6] generalized thisapproach, making in particular the link between model selection and adaptive estimation.They proved that previous methods, in particular cross-validation (see Rudemo [20]) andhard thresholding (see Donoho et.al. [12]) can be viewed as penalization methods. Morerecently, Birg´e & Massart [9], Arlot & Massart [5] and Arlot [4], (see also [3]) arised theproblem of optimal efficient model selection. Basically, the aim is to select an estimatorsatisfying an oracle inequality with leading constant asymptotically equal to 1. Theyobtained such procedures thanks to a sharp estimator of the ideal penalty pen id . We willbe interested in two natural ideas, that are used in practice to evaluate pen id and provedto be efficient in other frameworks. The first one is the slope heuristic. It was introducedin Birg´e & Massart [9] in Gaussian regression and developed in Arlot & Massart [5] ina M -estimation framework. It allows to optimize the choice of a leading constant inthe penalty term, provided that we know the shape of pen id . The other one is Efron’sresampling heuristic. The basic idea comes from Efron [14] and was used by Fromont [15]in the classification framework. Then, Arlot [4] made the link with ideal penalties anddeveloped the general procedure. Up to our knowledge, these methods have only beentheoretically validated in regression frameworks. We propose here to prove their efficiencyin density estimation. Let us now explain more precisely our context. In this paper, we define and study efficient penalized least-squares estimators in the den-sity estimation framework when the error is measured with the L -loss. We observe n ∗ IME-USP, granted by FAPESP Processo 2009/09494-0 X , ..., X n , defined on a probability space (Ω , A , P ), valued in ameasurable space ( X , X ), with common law P . We assume that a measure µ on ( X , X ) isgiven and we denote by L ( µ ) the Hilbert space of square integrable real valued functionsdefined on X . L ( µ ) is endowed with its classical scalar product, defined for all t, t ′ in L ( µ ) by < t, t ′ > = Z X t ( x ) t ′ ( x ) dµ ( x )and the associated L -norm k . k , defined for all t in L ( µ ) by k t k = √ < t, t > . Theparameter of interest is the density s of P with respect to µ , we assume that it belongsto L ( µ ). The risk of an estimator ˆ s of s is measured with the L -loss, that is k s − ˆ s k ,which is random when ˆ s is. s minimizes the integrated quadratic contrast P Q ( t ), where Q : L ( µ ) → L ( P ) is definedfor all t in L ( µ ) by Q ( t ) = k t k − t . Hence, density estimation is a problem of M -estimation. These problems are classically solved in two steps. First, we choose a ”model” S m that should be close to the parameter s , which means that inf t ∈ S m k s − t k is ”small”.Then, we minimize over S m the empirical version of the integrated contrast, that is, wechoose ˆ s m ∈ arg min t ∈ S m P n Q ( t ) . (1)This last minimization can be computationaly untractable for general sets S m , leading tountractable procedures in practice. However, it can be easily solved when S m is a linearsubspace of L ( µ ) since, for all orthonormal basis ( ψ λ ) λ ∈ m ,ˆ s m = X λ ∈ m ( P n ψ λ ) ψ λ . (2)Thus, we will always assume that a model is a linear subspace in L ( µ ). The risk of theleast-squares estimator ˆ s m defined in (1) is then decomposed in two terms, called bias andvariance, thanks to Pythagoras relation. Let s m be the orthogonal projection of s onto S m , k s − ˆ s m k = k s − s m k + k s m − ˆ s m k . The statistician should choose a space S m realizing a trade-off between those terms. S m must be sufficiently “large” to ensure a small bias k s − s m k , but not too much, for thevariance k s m − ˆ s m k not to explose. The best trade-off depends on unknown propertiesof s , since the bias is unknown, and on the behavior of the empirical minimizer ˆ s m inthe space S m . Classically, S m is a parametric space and the dimension d m of S m as alinear space is used to give upper bounds on D m = n E (cid:0) k s m − ˆ s m k (cid:1) . This approachis validated in regular models under the assumption that the support of s is a knowncompact, as mentioned in section 3. However, this definition can fail dramatically becausethere exist simple models (histograms with a small dimension d m ) where D m is verylarge, and infinite dimensional models where D m is easily upper bounded. This issue isextensively discussed in Birg´e [7]. Birg´e chooses to keep the dimension d m of S m as acomplexity measure and build new estimators that achieve better risk bounds than theempirical minimizer. His procedures are unfortunatly untractable for the practical userbecause he can only prove the existence of his estimators. Even his bounds on the riskare only interesting theoretically because they involve constants which are not optimal.We will not take this point of view here and our estimator will always be the empiricalminimizer, mainly because it can easily be computed, see (2). We will focus on the quantity D m /n and introduce a general Assumption (namely Assumption [V] ) that allows to work2ndifferently with D m /n or with the actual risk k s m − ˆ s m k . We will also provide andstudy an estimator of D m /n based on the resampling heuristic.We insist here on the fact that, unlike classical methods, we will not use in this paperstrong extra assumptions on s , like k s k ∞ < ∞ or assume that s is compactly supported. Recall that the choice of an optimal model S m is impossible without strong assumptionson s , for example a precise information on its regularity. However, under less restrictivehypotheses, we can build a countable collection of models ( S m ) m ∈M n , growing with thenumber of observations, such that the best estimator in the associated collection (ˆ s m ) m ∈M n realizes an optimal trade-off, see for example Birg´e & Massart [8] and Barron, Birg´e &Massart [6]. The aim is then to build an estimator ˆ m such that our final estimator, ˜ s = ˆ s ˆ m behaves almost as well as any model m o in the set of oracles M ∗ n = { m o ∈ M n , k ˆ s m o − s k = inf m ∈M n k ˆ s m − s k } . This is the problem of model selection. More precisely, we want that ˜ s satisfies an oracleinequality defined in general as follows. Definition: (Trajectorial oracle inequality) Let ( p n ) n ∈ N be a summable sequence and let ( C n ) n ∈ N and ( R m,n ) n ∈ N be sequences of positive real numbers. The estimator ˜ s = ˆ s ˆ m satisfies a trajectorial oracle inequality T O ( C n , ( R m,n ) m ∈M n , p n ) if ∀ n ∈ N ∗ , P (cid:18) k ˜ s − s k > C n inf m ∈M n (cid:8) k s − ˆ s m k + R m,n (cid:9)(cid:19) ≤ p n . (3) When ˜ s satisfies an oracle inequality, C n is called the leading constant. In this paper, we are interested in the problem of optimal model selection defined asfollows.
Definition: (Optimal model selection) We say that ˜ s is optimal or that the procedureof selection ( X , ..., X n ) ˆ m is optimal when ˜ s satisfies a trajectorial oracle inequality T O (1 + r n , ( R m,n ) m ∈M n , p n ) with r n → and for all n in N ∗ and m in M n R m,n = 0 . Inorder to simplify the notations, when ˜ s is optimal we will say that ˜ s satisfies an optimaloracle inequality OT O ( r n , p n ) . In order to build ˆ m , we remark that, for all m in M n , k s − ˆ s m k = k ˆ s m k − P ˆ s m + k s k = P n Q (ˆ s m ) + 2 ν n (ˆ s m ) + k s k , (4)where ν n = P n − P is the centered empirical process. An oracle minimizes k s − ˆ s m k and thus P n Q (ˆ s m ) + 2 ν n (ˆ s m ). As we want to imitate the oracle, we will design a mappen : M n → R + and chooseˆ m ∈ arg min m ∈M n P n Q (ˆ s m ) + pen( m ) , ˜ s = ˆ s ˆ m . (5)It is clear that the ideal penalty is pen id ( m ) = 2 ν n (ˆ s m ). For all m in M n , for all orthonor-mal basis ( ψ λ ) λ ∈ m , ˆ s m = P λ ∈ m ( P n ψ λ ) ψ λ and s m = P λ ∈ m ( P ψ λ ) ψ λ , thus ν n (ˆ s m − s m ) = ν n X λ ∈ m ( ν n ψ λ ) ψ λ ! = X λ ∈ m ( ν n ψ λ ) = k ˆ s m − s m k . m in M n p ( m ) = ν n (ˆ s m − s m ) = k ˆ s m − s m k . From (4), for all m in M n , k s − ˜ s k = k ˜ s k − P ˜ s + k s k = k ˜ s k − P n ˜ s + 2 ν n ˜ s + k s k ≤ P n Q (ˆ s m ) + pen( m ) + (2 ν n (˜ s ) − pen( ˆ m )) + k s k = k s − ˆ s m k + (pen( m ) − ν n (ˆ s m )) + (2 ν n (˜ s ) − pen( ˆ m ))Hence, for all m in M n , k s − ˜ s k ≤ k s − ˆ s m k + (pen( m ) − p ( m )) + (2 p ( ˆ m ) − pen( ˆ m )) + 2 ν n ( s ˆ m − s m ) . (6)Let us define, for all c , c >
0, the function f c ,c : R + → R + , x (cid:26) c x − c x − x < /c + ∞ if x ≥ /c . (7)It comes from inequality (6) that ˜ s satisfies an oracle inequality OT O ( f , ( ǫ n ) , p n ) as soonas, with probability larger than 1 − p n ∀ m ∈ M n | p ( m ) − pen( m ) |k s − ˆ s m k ≤ ǫ n and (8) ∀ ( m, m ′ ) ∈ M n , ν n ( s m ′ − s m ) k s − ˆ s m ′ k + k s − ˆ s m k ≤ ǫ n . (9)Inequality (9) does not depend on our choice of penalty, we will check that it can easilybe satisfied in classical collections of models. In order to obtain inequality (8), we use twomethods, defined in M -estimation, but studied only on some regression frameworks. The first one is refered as the ”slope heuristic”. The idea has been introduced by Birg´e& Massart [9] in the Gaussian regression framework and developed in a general algorithmby Arlot & Massart [5]. This heuristic states that there exist a sequence (∆ m ) m ∈M n anda constant K min satisfying the following properties,1. when pen( m ) < K min ∆ m , then ∆ ˆ m is too large, typically ∆ ˆ m ≥ C max m ∈M n ∆ m ,2. when pen( m ) ≃ ( K min + δ )∆ m for some δ >
0, then ∆ ˆ m is much smaller,3. when pen( m ) ≃ K min ∆ m , the selected estimator is optimal.Thanks to the third point, when ∆ m and K min are known, this heuristic says that thepenalty pen( m ) = 2 K min ∆ m selects an optimal estimator. When ∆ m only is known, thefirst and the second point can be used to calibrate K min in practice, as shown by thefollowing algorithm (see Arlot & Massart [5]): Slope algorithm
For all
K >
0, compute the selected model ˆ m ( K ) given by (5) with the penalty pen( m ) = K ∆ m and the associated complexity ∆ ˆ m ( K ) .Find the constant K min such that ∆ ˆ m ( K ) is large when K < K min , and ”much smaller”4hen
K > K min .Take the final ˆ m = ˆ m (2 K min ).We will justify the slope heuristic in the density estimation framework for ∆ m = E ( k s m − ˆ s m k ) = D m /n and K min = 1. In general, D m is unknown and has to be estimated, wepropose a resampling estimator and prove that it can be used without extra assumptionsto obtain optimal results. Data-driven penalties have already been used in density estimation in particular cross-validation methods as in Stone [21], Rudemo [20] or Celisse [11]. We are interested herein the resampling penalties introduced by Arlot [4]. Let ( W , ..., W n ) be a resamplingscheme, i.e. a vector of random variables independent of X, X , ..., X n and exchangeable,that is, for all permutations τ of (1 , ..., n ),( W , ..., W n ) has the same law as ( W τ (1) , ..., W τ ( n ) ) . Hereafter, we denote by ¯ W n = P ni =1 W i /n and by E W and L W respectively the expectationand the law conditionally to the data X, X , ..., X n . Let P Wn = P ni =1 W i δ X i /n , ν Wn = P Wn − ¯ W n P n be the resampled empirical processes. Arlot’s procedure is based on the resamplingheurististic of Efron (see Efron [13]), which states that the law of a functional F ( P, P n )is close to its resampled counterpart, that is the conditional law L W ( C W F ( ¯ W n P n , P Wn )). C W is a renormalizing constant that depends only on the resampling scheme and on F .Following this heuristic, Arlot defines as a penalty the resampling estimate of the idealpenalty 2 D m /n , that is pen( m ) = 2 C W E W ( ν Wn (ˆ s Wm )) , (10)where ˆ s Wm minimizes P Wn Q ( t ) over S m . We prove concentration inequalities for pen( m )and deduce that pen( m ) provides an optimal procedure.The paper is organized as follows. In Section 2, we state our main results, we prove theefficiency of the slope algorithm and the resampling penalties.In Section 3, we compute the rates of convergence in the oracle inequalities using classicalcollections of models. Section 4 is devoted to a short simulation study where we comparedifferent methods in practice. The proofs are postponed to Section 5. Section 6 is anAppendix where we add some probabilistic material, we prove a concentration inequalityfor Z , where Z = sup t ∈ B ν n ( t ) and B is symmetric. We deduce a simple concentrationinequality for U -statistics of order 2 that extends a previous result by Houdr´e & Reynaud-Bouret [16]. Hereafter, we will denote by c , C , K , κ , L, α , with various subscripts some constants thatmay vary from line to line.
Take an orthonormal basis ( ψ λ ) λ ∈ m of S m . Easy algebra leads to s m = X λ ∈ m ( P ψ λ ) ψ λ , ˆ s m = X λ ∈ m ( P n ψ λ ) ψ λ , thus k s m − ˆ s m k = X λ ∈ m ( ν n ( ψ λ )) . s m is an unbiased estimator of s m andpen id ( m ) = 2 ν n (ˆ s m ) = 2 ν n (ˆ s m − s m ) + 2 ν n ( s m ) = 2 k s m − ˆ s m k + 2 ν n ( s m ) . For all m, m ′ in M n , let p ( m ) = k s m − ˆ s m k = X λ ∈ m ( ν n ( ψ λ )) , δ ( m, m ′ ) = 2 ν n ( s m − s m ′ ) . (11)From (6), for all m in M n , k s − ˜ s k ≤ k s − ˆ s m k + (pen( m ) − p ( m )) + (2 p ( ˆ m ) − pen( ˆ m )) + δ ( ˆ m, m ) . (12)In this section, we are interested in the concentration of p ( m ) around E ( p ( m )) = D m /n .Let us first remark that, for all m in M n , p ( m ) is the supremum of the centered empiricalprocess over the ellipsoid B m = { t ∈ S m , k t k ≤ } . From Cauchy-Schwarz inequality, forall real numbers ( b λ ) λ ∈ m , X λ ∈ m b λ = sup P a λ ≤ X λ ∈ m a λ b λ ! . (13)We apply this inequality with b λ = ν n ( ψ λ ). We obtain, since the system ( ψ λ ) λ ∈ m isorthonormal, X λ ∈ m ( ν n ( ψ λ )) = sup P a λ ≤ X λ ∈ m a λ ν n ( ψ λ ) ! = sup P a λ ≤ ν n X λ ∈ m a λ ψ λ !! = sup t ∈ B m ( ν n ( t )) . Hence, p ( m ) is bounded by a Talagrand’s concentration inequality (see Talagrand [22]).This inequality involves D m = n E (cid:0) k ˆ s m − s m k (cid:1) and the constants e m = 1 n sup t ∈ B m k t k ∞ and v m = sup t ∈ B m Var( t ( X )) . (14)More precisely, the following proposition holds: Proposition 2.1
Let
X, X , ..., X n be iid random variables with common density s withrespect to a probability measure µ . Assume that s belongs to L ( µ ) and let S m be alinear subspace in L ( µ ) . Let s m and ˆ s m be respectively the orthogonal projection and theprojection estimator of s onto S m . Let p ( m ) = k s m − ˆ s m k , D m = n E ( p ( m )) and let v m , e m be the constants defined in (14). Then, for all x > , P p ( m ) − D m n > D / m ( e m x ) / + 0 . p D m v m x + 0 . v m x + e m x n ! ≤ e − x/ (15) P D m n − p ( m ) > . D / m ( e m x ) / + 1 . p D m v m x + 4 . e m x n ! ≤ . e − x/ (16) Comments :
From (12), for all m in M n , k s − ˜ s k ≤ k s − ˆ s m k + (cid:18) pen( m ) − D m n (cid:19) + 2 (cid:18) D m n − p ( m ) (cid:19) +2 (cid:18) p ( ˆ m ) − D ˆ m n (cid:19) + (cid:18) D ˆ m n − pen( ˆ m ) (cid:19) + δ ( ˆ m, m ) . (17)6t appears from (17) that we can obtain oracle inequalities with a penalty of order 2 D m /n if, uniformly over m, m ′ in M n , p ( m ) − D m n << k s − ˆ s m k and δ ( m ′ , m ) << k s − ˆ s m k + k s − ˆ s m ′ k . Proposition 2.1 proves that the first part holds with large probability for all m in M n such that e m ∨ v m << n E ( k s − ˆ s m k ). Actually, the other part also holds under the samekind of assumption. For all m , m ′ in M n , let D m = n E (cid:0) k s m − ˆ s m k (cid:1) , R m n = E (cid:0) k s − ˆ s m k (cid:1) = k s − s m k + D m n ,v m,m ′ = sup t ∈ S m + S m ′ , k t k≤ Var( t ( X )) , e m,m ′ = 1 n sup t ∈ S m + S m ′ , k t k≤ k t k ∞ . For all k ∈ N , let M kn = { m ∈ M n , R m ∈ [ k, k + 1) } . For all n in N , for all k > k ′ > γ ≥
0, let [ k ] be the integer part of k and let l n,γ ( k, k ′ ) = ln(1 + Card( M [ k ] n )) + ln(1 + Card( M [ k ′ ] n )) + ln(( k + 1)( k ′ + 1)) + (ln n ) γ (18) Assumption [V] : There exist γ > and a sequence ( ǫ n ) n ∈ N , with ǫ n → such that, forall n in N , sup ( k,k ′ ) ∈ ( N ∗ ) sup ( m,m ′ ) ∈M kn ×M k ′ n v m,m ′ R m ∨ R m ′ ! ∨ e m,m ′ R m ∨ R m ′ l n,γ ( k, k ′ ) ≤ ǫ n . [BR] There exist two sequences ( h ∗ n ) n ∈ N ∗ and ( h on ) n ∈ N ∗ with ( h on ∨ h ∗ n ) → as n → ∞ such that, for all n in N ∗ , for all m o ∈ arg min m ∈M n R m and all m ∗ ∈ arg max m ∈M n D m , R m o D m ∗ ≤ h on , n k s − s m ∗ k D m ∗ ≤ h ∗ n . Comments: • Assumption [V] ensures that the fluctuations of the ideal penalty are uniformlysmall compared to the risk of the estimator ˆ s m . Note that for all k, k ′ , l n,γ ( k, k ′ ) ≥ (ln n ) γ , thus, Assumption [V] holds only in typical non parametric situations where R n = inf m ∈M n R m → ∞ as n → ∞ . • The slope heuristic states that the complexity ∆ ˆ m of the selected estimator is toolarge when the penalty term is too small. A minimal assumption for this heuristicto hold with ∆ m = D m would be that there exists a sequence ( θ n ) n ∈ N ∗ with θ n → n → ∞ such that, for all n in N ∗ , for all m o ∈ arg min m ∈M n E (cid:0) k s − ˆ s m k (cid:1) andall m ∗ ∈ arg max m ∈M n E (cid:0) k s m − ˆ s m k (cid:1) , D m o ≤ θ n D m ∗ . Assumption [BR] is slightly stronger but will always hold in the examples (seeSection 3). 7n order to have an idea of the rates R n , ǫ n , h ∗ n , h on and θ n , let us briefly consider the verysimple following example: Example HR:
We assume that s is supported in [0 ,
1] and that ( S m ) m ∈M n is the collectionof the regular histograms on [0 , d m = 1 , ..., n pieces. We will see in Section 3.2that D m ∼ d m asymptotically, hence D m ∗ ≃ n . Moreover, we assume that s is H¨olderianand not constant so that there exist positive constants c l , c u , α l , α u such that, for all m in M n , see for example Arlot [4], c l d − α l m ≤ k s − s m k ≤ c u d − α u m . In Section 3.2, we prove that this assumption implies [V] with ǫ n ≤ C ln( n ) n − / (8 α l +4) .Moreover, there exists a constant C > R m o ≤ inf m ∈M n ( c u nd − α u m + d m ) ≤ Cn − / (2 α u +1) , thus R m o /D ∗ m ≤ Cn / (2 α u +1) − = Cn − α u / (2 α u +1) . Since there exists C > n k s − s m ∗ k /D m ∗ ≤ Cd − α u m ∗ = Cn − α u , [BR] holds with h on = Cn − α u / (2 α u +1) and h ∗ n = Cn − α u .Other examples can be found in Birg´e & Massart [8], see also Section 3. Let us now turn to the slope heuristic presented in Section 1.2.1.
Theorem 2.2 (Minimal penalty) Let M n be a collection of models satisfying [V] and [BR] and let ǫ ∗ n = ǫ n ∨ h ∗ n .Assume that there exists < δ n < such that ≤ pen ( m ) ≤ (1 − δ n ) D m /n . Let ˆ m, ˜ s bethe random variables defined in (5) and let c n = δ n − ǫ ∗ n ǫ n . There exists a constant
C > such that, P (cid:18) D ˆ m ≥ c n D m ∗ , k s − ˜ s k ≥ c n h on inf m ∈M n k s − ˆ s m k (cid:19) ≥ − Ce − (ln n ) γ . (19) Comments:
Assume that pen( m ) ≤ (1 − δ ) D m /n , then, inequality (19) proves that anoracle inequality can not be obtained since c n /h on → ∞ . Moreover, D ˆ m ≥ cD m ∗ is as largeas possible. This proves point 1 of the slope heuristic. Theorem 2.3
Let M n be a collection of models satisfying Assumption [V] . Assume thatthere exist δ + ≥ δ − > − and ≤ p ′ < such that, with probability at least − p ′ , D m n + δ − R m n ≤ pen ( m ) ≤ D m n + δ + R m n . Let ˆ m, ˜ s be the random variables defined in (5) and let C n ( δ − , δ + ) = (cid:18) δ − − ǫ n δ + + 26 ǫ n ∨ (cid:19) − . There exists a constant
C > such that, with probability larger than − p ′ − Ce − (ln n ) γ , D ˆ m ≤ C n ( δ − , δ + ) R m o , k s − ˜ s k ≤ C n ( δ − , δ + ) inf m ∈M n k s − ˆ s m k . (20)8 omments : • Assume that pen( m ) = KD m /n with K >
1, then inequality (20) ensures that D ˆ m ≤ C n ( K, K ) R m o . Hence, D ˆ m jumps from D m ∗ (Theorem 2.2) to R m o (20) whenpen( m ) is around D m /n , which is much smaller thanks to Assumption [BR] . Thisproves point 2 of the slope heuristic. • Point 3 of this heuristic comes from inequality (20) applied with small δ − and δ + .The rate of convergence of the leading constant to 1 is then given by the supremumbetween δ − , δ + and ǫ n . • The condition on the penalty has the same form as the one given in Arlot & Massart[5]. It comes from the fact that we do not know D m /n in many cases, therefore, ithas to be estimated. We propose two alternatives to solve this issue. In Section 2.4,we give a resampling estimator of D m . It can be used for all collection of modelssatisfying [V] and its error of approximation is upper bounded by ǫ n R m /n . ThusTheorem 2.3 holds with ( δ − ∨ δ + ) ≤ Cǫ n . In Section 3.2, we will also see that, inregular models, we can use d m instead of D m and the error is upper bounded by CR m /R m o , thus Theorem 2.3 holds with ( δ − ∨ δ + ) ≤ C/R m o << ǫ n , p ′ = 0. Inboth cases, we deduce from Theorem 2.3 that the estimator ˜ s given by the slopealgorithm achieves an optimal oracle inequality OT O ( κǫ n , Ce − (ln n ) γ ). In Example HR , for example, we obtain ǫ n = Cn − / (8 α l +4) ln n . Optimal model selection is possible in density estimation provided that we have a sharpestimation of D m = n E (cid:0) sup t ∈ B m ( ν n ( t )) (cid:1) . We propose an estimator of this quantitybased on the resampling heuristic. The model selection algorithm that we deduce is thesame as the resampling penalization procedure introduced by Arlot [4]. Let F be a fixedfunctional. Efron’s heuristic states that the law L ( F ( ν n )) is close to the conditional law L W ( C W F ( ν Wn )), where C W is a normalizing constant depending only on the resamplingscheme and the functional F . Let P Wn = P ni =1 W i δ X i /n and ν Wn = P Wn − ¯ W n P n . Theresampling estimator of D m is D Wm = nC W E W (cid:0) sup t ∈ B m ( ν Wn ( t )) (cid:1) and the resamplingpenalty associated is pen( m ) = 2 D Wm /n . Actually, the following result describes theconcentration of D Wm around its mean D m and around np ( m ). Proposition 2.4
Let ( W , ..., W n ) be a resampling scheme, let S m be a linear space, B m = { t ∈ S m , k t k ≤ } , p ( m ) = sup t ∈ B m ( ν n ( t )) , D m = n E ( p ( m )) and let D Wm be the resam-pling estimator of D m based on ( W , ..., W n ) , that is D Wm = nC W E W (cid:0) sup t ∈ B m ( ν Wn ( t )) (cid:1) ,where v W = Var ( W − ¯ W n ) and C W = ( v W ) − .Then, for all m in M n , E ( D Wm ) = D m . Moreover, let e m , v m be the quantities defined in(14). For all x > , on an event of probability larger than − . e − x , D Wm − D m ≤ p e m D m x + e m (cid:18) x . x ) n − (cid:19) + 9 D / m ( e m x ) / + 7 . p v m D m xn − . (21) D Wm − D m ≥ − p e m D m x − e m (cid:18) x . x ) n − (cid:19) − . D / m ( e m x ) / + 3 p v m D m x + 3 v m xn − . (22)9 or all x > , P p ( m ) − D Wm n > . D / m ( e m x ) / + 3 p v m D m x + 3 v m x + e m (19 . x ) n − ! ≤ e − x (23) P D Wm n − p ( m ) ≤ D / m ( e m x ) / + 7 . p v m D m x + e m (40 . x ) n − ! ≤ . e − x . (24) Remark
The concentration of the resampling estimator involves the same quantities as the concen-tration of p ( m ), thus, it can be used to estimate the ideal penalty in the slope heuristic’salgorithm presented in the previous section without extra assumptions on the collection M n . Proposition 2.4 and Theorem 2.3 prove that this resampling penalty leads to anefficient model selection procedure. However, we do not need to use the slope heuristic inour framework to obtain an optimal model selection procedure as shown by the followingtheorem. Theorem 2.5
Let X , ..., X n be i.i.d random variables with common density s . Let M n bea collection of models satisfying Assumption [V] . Let W , ..., W n be a resampling scheme,let ¯ W n = P ni =1 W i /n , v W = Var ( W − ¯ W n ) and C W = 2( v W ) − . Let ˜ s be the penalizedleast-squares estimator defined in (5) withpen ( m ) = C W E W (cid:18) sup t ∈ B m ( ν Wn ( t )) (cid:19) . Then, there exists a constant
C > such that P (cid:18) k s − ˜ s k ≤ (1 + 100 ǫ n ) inf m ∈M n k s − ˆ s m k (cid:19) ≥ − Ce − (ln n ) γ . (25) Comments :
The main advantage of this results is that the penalty term is alwaystotally computable. Unlike the penalties derived from the slope heuristic, it does notdepend on an arbitrary choice of a constant K min made by the observer, that may behard to detect in practice (see the paper of Alot & Massart [5] for an extensive discussionon this important issue). However, C W is only optimal asymptotically. It is sometimesuseful to overpenalize a little in order to improve the non-asymptotic performances of ourprocedures (see Massart [19]) and the slope heuristic can be used to do it in an optimalway (see our short simulation study in Section 4). The regularization of the bootstrap phenomenon (see Arlot [3, 4] and the referencestherein) states that the resampling estimator C W E W ( F ( ν Wn )) of a functional F ( ν n ) con-centrates around its mean better than F ( ν n ). This phenomenon can be justified with ourprevious results for our functional F . Recall that we have proven in Proposition 2.1 that,for all x >
0, with probability larger than 1 − . e − x/ , (cid:12)(cid:12)(cid:12)(cid:12) p ( m ) − D m n (cid:12)(cid:12)(cid:12)(cid:12) ≤ . D / m ( e m x ) / + 1 . p D m v m x + 0 . v m x + 4 . e m x n . In Example HR , we have the following upper bounds D m ≤ d m , e m ≤ d m n , v m ≤ c k s k p d m . C such that, for all x > P | np ( m ) − D m | > Cd m r x √ n + (cid:18) x √ n (cid:19) !! ≤ . e − x/ . (26)On the other hand, it comes from Inequalities (21) and (22), that, for all x >
0, on anevent of probability larger than 1 − . e − x/ , (cid:12)(cid:12) D Wm − D m (cid:12)(cid:12) ≤ p . e m D m x + e m (cid:18) x
15 + 4 . x n − (cid:19) + 1 . D / m ( e m x ) / + 1 . p v m D m x + 0 . v m xn − . Thus, there exists a constant C such that, for all x > P (cid:18)(cid:12)(cid:12) D Wm − D m (cid:12)(cid:12) > Cd m (cid:18)r xn + (cid:16) xn (cid:17) (cid:19)(cid:19) ≤ . e − x/ . The concentration of D Wm is then much better than the one of np ( m ). This implies that D Wm is an estimator of D m rather than an estimator of np ( m ). Thus, the resamplingpenalty can be used when D m /n is a good penalty for example, under [V] . When D m /n is known to underpenalize (see the examples in Barron, Birg´e & Massart [6]), there is nochance that D Wm /n can work. The aim of this section is to show that [V] can be derived from a more classical hypothesisin two classical collections of models: the histograms and Fourier spaces. We derive therates ǫ n under this new hypothesis. As mentioned in Section 2.2, Assumption [V] can only hold if there exists γ > R n (ln n ) − γ → ∞ as n → ∞ , where R n = inf m ∈M n R m . In our example, we will make thefollowing Assumption that ensures that this condition is always satisfied. [BR] (Bounds on the Risk) There exist constants C u > , α u > , γ > , and a sequence ( θ n ) n ∈ N with θ n → ∞ as n → ∞ such that, for all n in N ∗ , for all m in M n θ n (ln n ) γ ≤ R n ≤ R m ≤ C u n α u . Comments:
Assumption [BR] holds with θ n = Cn α for the collection of regular his-tograms of example HR , provided that s is an H¨olderian, non constant and compactlysupported function (see for example Arlot [3]). It is also a classical result of minimax the-ory that there exist functions in Sobolev spaces satisfying this kind of Assumption when M n is the collection of Fourier spaces that we will introduce below.We want to check that these collections satisfy Assumption [V] , i.e. that there exists γ > ( k,k ′ ) ∈ ( N ∗ ) sup ( m,m ′ ) ∈M kn ×M k ′ n v m,m ′ R m ∨ R m ′ ! ∨ e m,m ′ R m ∨ R m ′ l n,γ ( k, k ′ ) ≤ ǫ n . m ∈ M n , R m ≤ C u n α u , thus for all k > C u n α u , Card( M kn ) = 0. In particular,we can assume in the previous supremum that k ≤ C u n α u and k ′ ≤ C u n α u . Hence, thereexists a constant κ > k )(1 + k ′ )] ≤ κ ln n . We also add the followingassumption that ensures that there exists a constant κ > k ∈ N ,ln(1 + Card ( M kn )) ≤ κ ln n . [PC] (Polynomial collection) There exist constants c M ≥ , α M ≥ , such that, for all n in N , Card( M n ) ≤ c M n α M . Under Assumptions [BR] and [PC] , there exists a constant κ > γ > n ≥
3, sup ( k,k ′ ) ∈ ( N ∗ ) sup ( m,m ′ ) ∈M kn ×M k ′ n v m,m ′ R m ∨ R m ′ ! ∨ e m,m ′ R m ∨ R m ′ l n,γ ( k, k ′ ) ≤ sup ( m,m ′ ) ∈ ( M n ) v m,m ′ R m ∨ R m ′ ! ∨ e m,m ′ R m ∨ R m ′ κ (ln n ) γ . Let ( X , X ) be a measurable space. Let ( P m ) m ∈M n be a collection of measurable partitions P m = ( I λ ) λ ∈ m of subsets of X such that, for all m ∈ M n , for all λ ∈ m , 0 < µ ( I λ ) < ∞ .Let m in M n , the set S m of histograms associated to P m is the set of functions whichare constant on each I λ , λ ∈ m . S m is a linear space. Setting, for all λ ∈ m , ψ λ =( p µ ( I λ )) − I λ , the functions ( ψ λ ) λ ∈ m form an orthonormal basis of S m .Let us recall that, for all m in M n , D m = X λ ∈ m Var( ψ λ ( X )) = X λ ∈ m P ( ψ λ ) − ( P ψ λ ) = X λ ∈ m P ( X ∈ I λ ) µ ( I λ ) − k s m k . (27)Moreover, from Cauchy-Schwarz inequality, for all x in X , for all m , m ′ in M n sup t ∈ B m,m ′ t ( x ) ≤ X λ ∈ m ∪ m ′ ψ λ ( x ) , thus e m,m ′ = 1 n sup λ ∈ m ∪ m ′ µ ( I λ ) . (28)Finally, it is easy to check that, for all m , m ′ in M n v m,m ′ = sup λ ∈ m ∪ m ′ Var( ψ λ ( X )) = sup λ ∈ m ∪ m ′ P ( X ∈ I λ )(1 − P ( X ∈ I λ )) µ ( I λ ) . (29)We will consider two particular types of histograms. Example 1 [Reg] : µ -regular histograms. For all m in M n , P m is a partition of X and there exist a family ( d m ) m ∈M n bounded by n and two constants c rh , C rh such that, for all m in M n , for all λ ∈ M n , c rh d m ≤ µ ( I λ ) ≤ C rh d m . The typical example here is the collection described in Example HR . Example 2 [Ada]: Adapted histograms. here exist positive constants c r , C ah such that, for all m in M n , for all λ ∈ M n , µ ( I λ ) ≥ c r n − and P ( X ∈ I λ ) µ ( I λ ) ≤ C ah . [Ada] is typically satisfied when s is bounded on X . Remark that the models satisfying [Ada] have finite dimension d m ≤ Cn since1 ≥ X λ ∈ m P ( X ∈ I λ ) ≥ C ah X λ ∈ m µ ( I λ ) ≥ C ah c r d m n − . The example [Reg] .It comes from equations (27, 28, 29) and Assumption [Reg] that C − rh d m − k s m k ≤ D m ≤ c − rh d m − k s m k .e m,m ′ ≤ c − rh d m ∨ d m ′ n , v m,m ′ ≤ sup t ∈ B m,m ′ k t k ∞ k t kk s k ≤ c − / rh k s k p d m ∨ d m ′ . Thus e m,m ′ R m ∨ R m ′ ≤ C rh c − rh ( R m ∨ R m ′ ) + k s k n ( R m ∨ R m ′ ) ≤ Cn − . If D m ∨ D m ′ ≤ θ n (ln n ) γ , v m,m ′ R m ∨ R m ′ ≤ q C rh c − rh p ( D m ∨ D m ′ ) + k s k R m o ≤ Cθ n (ln n ) γ . If D m ∨ D m ′ ≥ θ n (ln n ) γ , v m,m ′ R m ∨ R m ′ ≤ q C rh c − rh p ( D m ∨ D m ′ ) + k s k D m ∨ D m ′ ≤ Cθ n (ln n ) γ . There exists κ > θ n (ln n ) γ ≤ κn since for all m in M n , R m ≤ n k s − s m k + c − rh d m ≤ ( k s k + c − rh ) n . Hence Assumption [V] holds with γ given in Assumption [BR] and ǫ n = Cθ − / n . The example [Ada] .It comes from inequalities (28), (29) and Assumption [Ada] that, for all m and m ′ in M n e m,m ′ ≤ c − r and v m,m ′ ≤ C ah . Thus, there exists a constant κ > m an m ′ in M n ,sup ( m,m ′ ) ∈ ( M n ) v m,m ′ R m ∨ R m ′ ! ∨ e m,m ′ R m ∨ R m ′ ≤ κθ n (ln n ) γ . Therefore Assumption [V] holds also with γ given in Assumption [BR] and ǫ n = κθ − / n .13 .3 Fourier spaces In this section, we assume that s is supported in [0 , ψ : [0 , → R , x k ∈ N ∗ , we define the functions ψ ,k : [0 , → R , x
7→ √ πkx ) , ψ ,k : [0 , → R , x
7→ √ πkx ) . For all j in N ∗ , let m j = { } ∪ { ( i, k ) , i = 1 , , k = 1 , ..., j } and M n = { m j , j = 1 , ..., n } . For all m in M n , let S m be the space spanned by the family ( ψ λ ) λ ∈ m . ( ψ λ ) λ ∈ m is anorthonormal basis of S m and for all j in 1 , ..., n , d m j = 2 j + 1.Let j in 1 , ...n , for all x in [0 , X λ ∈ m j ψ λ ( x ) = 1 + 2 j X k =1 cos (2 πkx ) + sin (2 πkx ) = 1 + 2 j = d m j . Hence, for all m in M n , D m = P X λ ∈ m j ψ λ − k s m k = d m − k s m k . (30)It is also clear that, for all m , m ′ in M n , e m,m ′ = d m ∨ d m ′ n , v m,m ′ ≤ k s k p d m ∨ d m ′ . (31)The collection of Fourier spaces of dimension d m ≤ n satisfies Assumption [PC] , and thequantities D m e m,m ′ and v m,m ′ satisfy the same inequalities as in the collection [Reg] ,therefore, [V] comes also in this collection from [BR] . We have obtained the followingcorollary of Theorem 2.5. Corollary 3.1
Let M n be either a collection of histograms satisfying Assumptions [PC]-[Reg] or [PC]-[Ada] or the collection of Fourier spaces of dimension d m ≤ n . Assumethat s satisfies Assumption [BR] for some γ > and θ n → ∞ . Then, there exist constants κ > and C > such that the estimator ˜ s selected by a resampling penalty satisfies P (cid:18) k s − ˜ s k ≤ (1 + κθ − / n ) inf m ∈M n k s − ˆ s m k (cid:19) ≥ − Ce − (ln n ) γ . Comment:
Assumption [BR] is hard to check in practice. We mentioned that it holdsin Example HR provided that s is H¨olderian, non constant and compactly supported (seeArlot [4]). It is also classical to build functions satisfying [BR] with the Fourier spaces inorder to prove that the oracle reaches the minimax rate of convergence over some Sobolevballs, see for example Birg´e & Massart [8], Barron, Birg´e & Massart [6] or Massart [19].In these cases, there exist c > α > θ n ≥ cn α . In more general situations,we can use the same trick as Arlot [4] and use our main theorem only for the modelswith dimension d m ≥ (ln n ) γ , they satisfy [BR] with θ n = (ln n ) , at least when n issufficiently large, because k s k + R m ≥ k s k + D m ≥ cd m ≥ c (ln n ) (ln n ) γ . With our concentration inequalities, we can control easily the risk of the models withdimension d m ≤ (ln n ) γ by κ (ln n ) γ/ with probability larger than 1 − Ce − (ln n ) γ and we can then deduce the following corollary.14 orollary 3.2 Let M n be either a collection of histograms satisfying Assumptions [PC]-[Reg] or [PC]-[Ada] or the collection of Fourier spaces of dimension d m ≤ n . Thereexist constants κ > , η > γ/ and C > such that the estimator ˜ s selected by aresampling penalty satisfies P (cid:18) k s − ˜ s k ≤ (1 + κ (ln n ) − ) (cid:18) inf m ∈M n k s − ˆ s m k + (ln n ) η n (cid:19)(cid:19) ≥ − Ce − (ln n ) γ . We propose in this section to show the practical performances of the slope algorithm andthe resampling penalties on two examples. We estimate the density s ( x ) = 34 x − / [0 , ( x )and we compare the three following methods.1. The first one is the slope heuristic applied with the linear dimension d m of themodels. We observe two main behaviors of d ˆ m ( K ) with respect to K . Most of thetimes, we only observe one jump, as in Figure 1, and we find K min easily.Figure 1: Classical behavior of K d ˆ m ( K ) We also observe more difficult situations as the one of Figure 2 below, where wecan see several jumps. In these cases, as prescribed in the regression framework byArlot & Massart [5], we choose the constant K min realizing the maximal jump of d ˆ m ( K ) . Arlot & Massart [5] also proposed to select K min as the minimal K suchthat d ˆ m ( K ) ≤ d m ∗ (ln n ) − , but they obtained worse performances of the selectedestimator in their simulations.We justify this method only for collection of models where d m ≃ KD m for someconstant K . We will see that it gives really good performances when this conditionis satisfied.2. The second method is the resampling based penalization algorithm of Theorem 2.5.Note here that all the resampling penalties D Wm /n can be easily computed, withoutany Monte Carlo approximations. Actually, for all resampling scheme, D Wm n = 1 n X λ ∈ m P n ψ λ − n ( n − n X i = j =1 ψ λ ( X i ) ψ λ ( X j ) . D m . However, in nonasymptotic situations, it may be usefull to overpenalize a little bit in order to improvethe leading constants in the oracle inequality (in Theorem 2.3, imagine that 46 ǫ n isvery close to 1).3. In a third method, we propose therefore to use the slope algorithm applied with acomplexity D Wm . By this way, we hope to overpenalize a little bit the resamplingpenalty when it is necessary. In the first example, we consider the collection of regular histograms described in example HR and we observe n = 100 data. In this example, we saw that D Wm ≃ D m ≃ d m . We canactually verify in Figure 2 that these quantities almost coincide for the selected model. R e s a m p li ng e s t i m a t i on o f D m D i m en s i on o f t he s e l e c t ed m ode l Constant K Constant K
Figure 2: Comparison of d m and D Wm on the selected modelWe compute N = 1000 times the oracle constant c = k s − ˜ s k / (inf m ∈M n k s − ˆ s m k ) forthe 3 methods. We put in the following array the mean, the median and the 0 . q . of these quantities.method mean of the N constants c median q . slope + d m .
56 2 .
30 10 . .
43 2 .
52 15 . .
57 2 .
21 10 . d m ≃ D Wm , the slope algorithm leads to the sameresults when applied with d m or with D Wm . Although we have an explicite formula tocompute the resampling penalties, the computation time is much longer if we use D Wm .Therefore, we clearly recommand to use the slope algorithm with d m for regular collectionsof model, as regular histograms or Fourier spaces described in Section 3.3. In the next example, we want to show that the linear dimension shall not be used ingeneral. Let us consider a slightly more complicated collection. Let k, J , J , n be four16on null integers satisfying k ≤ n , J ≤ k , J ≤ n − k . We denote by S k,J ,J ,n the linearspace of histograms on the following partition. (cid:26)(cid:20) l kJ n , ( l + 1) kJ n (cid:20) , l = 0 , ..., J − (cid:27) ∪ (cid:26)(cid:20) kn + l − k/nJ , kn + ( l + 1) 1 − k/nJ (cid:20) , l = 0 , ...J − (cid:27) . Let n ∈ N ∗ and let M n = { ( k, J , J ) ∈ ( N ∗ ) ; k ≤ n, J ≤ k, J ≤ n − k } . It is clearthat Card( M n ) ≤ n . The oracle of this collection is better than the previous one sincethe regular histograms belongs to ( S m,n ) m ∈M n . It is easy to check that the dimension of S k,J ,J ,n is equal to J + J and that D k,J ,J ,n is equal to ( nJ /k ) F ( k/n ) + ( nJ / ( n − k ))(1 − F ( k/n )) − k s k,J ,J ,n k /n , where F is the distribution function of the observations.Hence, there is no constant K o such that K o d k,J ,J ,n ≃ D k,J ,J ,n as in the previousexample. Figure 3 let us see this fact on the selected model. constant K d i m e n s i on o f t h e s e l ec t e d m od e l constant K r e s a m p li ng e s ti m a ti on o f D m Figure 3: Comparison of d m and D Wm on the selected modelWe also compute N = 1000 times the oracle constant c = k s − ˜ s k / (inf m ∈M n k s − ˆ s m k )for the 3 methods, taking n = 100 observations each time. The results are summarized infollowing array. method mean of the N constants c median q . slope + d m .
30 7 .
01 19 . .
11 5 .
08 13 . .
33 4 .
04 12 . d m . This is due to the fact that d m is not proportional to D m here. The resampling based penalty 2 D Wm /n is much betterand, as in the regular case, it is well improved by the slope algorithm. Therefore, forgeneral collections of models where we do not know an optimal shape of the ideal penalty,we recommand to apply the slope algorithm with a complexity equal to D Wm .17 Proofs
It is a straightforward application of Corollary 6.6 in the appendix.
Before giving the proofs of the main theorems, we state and prove some important technicallemmas that we will use repeatedly all along the proofs. Let us recall here the mainnotations. For all m , m ′ in M n , p ( m ) = k s m − ˆ s m k , D m = n E ( p ( m )) = n E (cid:0) k ˆ s m − s m k (cid:1) R m = n E (cid:0) k s − ˆ s m k (cid:1) = n k s − s m k + D m , δ ( m, m ′ ) = ν n ( s m − s m ′ ) . For all n ∈ N ∗ , k > , k ′ > , γ > , , let [ k ] be the integer part of k and let l n,γ ( k, k ′ ) = ln((1 + Card( M [k]n ))(1 + Card( M [k ′ ]n ))) + ln((1 + k )(1 + k ′ )) + (ln n ) γ . Recall that Assumption [V] implies that, for all m, m ′ in M n , v m,m ′ l n,γ ( R m , R m ′ ) ≤ ǫ n ( R m ∨ R m ′ ) ,e m,m ′ ( l n,γ ( R m , R m ′ )) ≤ ǫ n ( R m ∨ R m ′ ) . (32)Let us prove a simple result Lemma 5.1
For all
K > , Σ( K ) = X k ∈ N X m ∈M kn e − K [ln(1+Card( M kn ))+ln(1+ k )] < ∞ . (33) For all m in M n , let l m = l n,γ ( R m , R m ) , then, for all K > / √ , X m ∈M n e − K l m = Σ(2 K ) e − K (ln n ) γ . (34) For all m , m ′ in M n , let l m,m ′ = l n,γ ( R m , R m ′ ) , then, for all K > , X ( m,m ′ ) ∈ ( M n ) e − K l m,m ′ = (Σ( K )) e − K (ln n ) γ . (35) Proof :
Inequality (33) comes from the fact that, when
K > ∀ k ∈ N , X m ∈M kn e − K [ln(1+Card( M kn ))] ≤ , and X k ∈ N ∗ e − K ln k < ∞ . For all integer k such that M kn = ∅ , for all m in M kn , l m ≥ M kn )) + ln(1 + k )] + (ln n ) γ , thus, for all K > / √
2, it comes from (33) that X m ∈M n e − K l m ≤ e − K (ln n ) γ X k ∈ N X m ∈M kn e − K [ln(1+Card( M kn ))+ln(1+ k )] ≤ Σ(2 K ) e − K (ln n ) γ . k, k ′ ) such that M kn × M k ′ n = ∅ , l m,m ′ ≥ ln(1 + Card( M kn )) + ln(1 + k ) + ln(1 + Card( M k ′ n )) + ln(1 + k ′ ) + (ln n ) γ . Thus, from (33), X ( m,m ′ ) ∈ ( M n ) e − K l m,m ′ = X k ∈ N X m ∈M kn e − K [ln(1+Card( M kn ))+ln(1+ k )] e − K (ln n ) γ . Lemma 5.2
Let M n be a collection of models satisfying Assumption [V] . We considerthe following events. Ω δ = (cid:26) ∀ ( m, m ′ ) ∈ M n , δ ( m, m ′ ) ≤ ǫ n R m ∨ R m ′ n (cid:27) Ω p = \ m ∈M n (cid:26)(cid:26) p ( m ) − D m n ≤ ǫ n R m n (cid:27) ∩ (cid:26) p ( m ) − D m n ≥ − ǫ n R m n (cid:27)(cid:27) and Ω T = Ω δ ∩ Ω p . Then there exists a constant C > such that P (Ω cδ ) ≤ Ce − (ln n ) γ , P (Ω cp ) ≤ Ce − (ln n ) γ , P (Ω cT ) ≤ Ce − (ln n ) γ . Proof :
Let
K > u = s m − s m ′ , S = S m + S m ′ , L = id , x = K l n,γ ( R m , R m ′ ). For all η >
0, for all m, m ′ in M n , on an event of probability larger than 1 − e − K l n,γ ( R m ,R m ′ ) , δ ( m, m ′ ) ≤ η k s m − s m ′ k + 2 v m,m ′ K l n,γ ( R m , R m ′ ) + e m,m ′ ( K l n,γ ( R m , R m ′ )) / ηn . (36)From [V] , for all m , m ′ in M n ,2 v m,m ′ K l n,γ ( R m , R m ′ )) + e m,m ′ ( K l n,γ ( R m , R m ′ )) ≤ (cid:18) Kǫ n ) + ( Kǫ n ) (cid:19) R m ∨ R m ′ n . Moreover, for all m, m ′ in M n , k s m − s m ′ k ≤ k s − s m k + k s − s m ′ k ) ≤ R m + R m ′ ) ≤ R m ∨ R m ′ ) . Let e n ( K ) = p ( Kǫ n ) + ( Kǫ n ) /
18. In (36) we take η = e n ( K ) and we obtain P (cid:18) δ ( m, m ′ ) > e n ( K ) R m ∨ R m ′ n (cid:19) ≤ e − Kl n,γ ( R m ,R m ′ ) . (37)From (35), for all K > P (cid:18) ∀ ( m, m ′ ) ∈ M n , δ ( m, m ′ ) > e n ( K ) R m ∨ R m ′ n (cid:19) ≤ (Σ( K )) e − K (ln n ) . Let K = 1 . n sufficiently large so that K ǫ n / ≤
1, then 4 e n ( K ) ≤ ǫ n . Hence,the first conclusion of Lemma 5.2 holds for sufficiently large n , it holds in general, providedthat we increase the constant C if necessary.19e apply Assumption [V] (see (32)) with m = m ′ , let l m = l n,γ ( R m , R m ), for all K > n such that 4 . Kǫ n ) ≤ D / m ( e m ( K l m ) ) / + 0 . p D m v m K l m + 0 . v m K l m + e m ( K l m ) n ≤ (1 . Kǫ n + 0 . Kǫ n ) + ( Kǫ n ) ) R m n ≤ Kǫ n R m n . . D / m ( e m ( K l m ) ) / + 1 . p D m v m ( K l m ) + 4 . e m ( K l m ) n ≤ (3 . Kǫ n + 4 . Kǫ n ) ) R m n ≤ Kǫ n R m n . It comes then from Proposition 2.1 applied with x = K l m that, for all m in M n P (cid:18) p ( m ) − D m n > Kǫ n R m n (cid:19) ≤ e − K l m . Thus, from (34), for all
K > √
10, and for all n sufficiently large, P (cid:18) ∀ m ∈ M n , p ( m ) − D m n > Kǫ n R m n (cid:19) ≤ Σ( K / e − K (ln n ) γ . We use the same arguments to prove that P (cid:18) ∀ m ∈ M n , p ( m ) − D m n < Kǫ n R m n (cid:19) ≤ Σ( K / e − K (ln n ) γ . Fixe K = √ .
5, then for all n sufficiently large , the conclusion of Lemma 5.2 holds. Itholds in general provided that we increase the constant C if necessary. Lemma 5.3
Let ( ψ λ ) λ ∈ Λ be an orthonormal system in L ( µ ) and let L be a linear func-tional defined on L ( µ ) . Let p (Λ) = P λ ∈ Λ ( ν n ( L ( ψ λ ))) . Let ( W , ..., W n ) be a resamplingscheme, let ¯ W n = P ni =1 W i /n and let v W = Var ( W − ¯ W n ) . Let D W Λ = n ( v W ) − X λ ∈ Λ E W (cid:0) ( ν Wn ( L ( ψ λ ))) (cid:1) ,T = P λ ∈ Λ ( L ( ψ λ ) − P L ( ψ λ )) , D = P T and U = 1 n ( n − n X i = j =1 X λ ∈ Λ ( L ( ψ λ )( X i ) − P L ( ψ λ ))( L ( ψ λ )( X j ) − P L ( ψ λ )) . then p (Λ) = 1 n P n T + n − n U, D W Λ = P n T − U, p (Λ) − D W Λ n = U, E ( D W Λ ) = D, D W Λ − D = ν n T − U. roof : It is easy to check that p (Λ) = X λ ∈ Λ ( 1 n n X i =1 L ( ψ λ )( X i ) − P L ( ψ λ )) = 1 n n X i =1 ( L ( ψ λ )( X i ) − P L ( ψ λ )) + 1 n n X i = j =1 X λ ∈ Λ ( L ( ψ λ )( X i ) − P L ( ψ λ ))( L ( ψ λ )( X j ) − P L ( ψ λ ))= 1 n P n T + n − n U. Recall that ν Wn = P Wn − ¯ W n P n . For all λ in Λ, since P ni =1 ( W i − ¯ W n ) = 0, ν Wn ( L ( ψ λ )) = 1 n n X i =1 ( W i − ¯ W n ) L ( ψ λ )( X i )= 1 n n X i =1 ( W i − ¯ W n )( L ( ψ λ )( X i ) − P L ( ψ λ )) . Thus, if E i,j = E (cid:0) ( W i − ¯ W n )( W j − ¯ W n ) (cid:1) /v W , D W Λ = n ( v W ) − X λ ∈ Λ E W n n X i =1 ( W i − ¯ W n )( L ( ψ λ )( X i ) − P L ( ψ λ )) ! = 1 n n X i =1 E (cid:0) ( W i − ¯ W n ) (cid:1) v W ( L ( ψ λ )( X i ) − P L ( ψ λ )) +1 n n X i = j =1 X λ ∈ Λ E i,j ( L ( ψ λ )( X i ) − P L ( ψ λ ))( L ( ψ λ )( X j ) − P L ( ψ λ )) . Since the weights are exchangeable, for all i = 1 , .., n , E (( W i − ¯ W n ) ) = Var( W − ¯ W n ) = v W and for all i = j = 1 , ..., n , v W E i,j = E (cid:0) ( W i − ¯ W n )( W j − ¯ W n ) (cid:1) = E (cid:0) ( W − ¯ W n )( W − ¯ W n ) (cid:1) . Moreover, since P ni =1 ( W i − ¯ W n ) = 0,0 = E n X i =1 ( W i − ¯ W n ) ! = n X i =1 E (cid:0) ( W i − ¯ W n ) (cid:1) + n X i = j =1 v W E i,j = n E (( W − ¯ W n ) ) + n ( n − E (cid:0) ( W − ¯ W n )( W − ¯ W n ) (cid:1) . Hence, for all i = j = 1 , ..., n , E i,j = − / ( n − D W Λ = P n T − U. The last inequalities of Lemma 5.3 follow from the fact that E ( U ) = 0. Finally, p (Λ) − D W Λ n = 1 n P n T + n − n U − (cid:18) n P n T − n U (cid:19) = U. emma 5.4 Let Ω u = \ m ∈M n (cid:26) D Wm n − p ( m ) ≤ ǫ n R m n (cid:27) Ω l = \ m ∈M n (cid:26) D Wm n − p ( m ) ≥ − ǫ n R m n (cid:27) and ˜Ω p = Ω u ∩ Ω l . There exists a constant C > such that P ( ˜Ω cp ) ≤ Ce − (ln n ) γ . Proof :
From Assumption [V] applied with m = m ′ , (see (32)), if l m = l n,γ ( R m , R m ), for all K > D / m ( e m ( K l m ) ) / ≤ Kǫ n R m , p v m D m ( K l m ) ≤ Kǫ n R m ,v m ( K l m ) ≤ ( Kǫ n ) R m , e m ( Kl m ) ≤ ( Kǫ n ) R m . We apply Proposition 2.4 with x = K l m and we obtain P (cid:18) D Wm n − p ( m ) > (cid:0) . Kǫ n + 3( Kǫ n ) + (19 . ( Kǫ n ) (cid:1) R m n − (cid:19) ≤ e − K l m . Thus, for all
K > / ( √ e n ( K ) = n (cid:0) . Kǫ n + 3( Kǫ n ) + (19 . ( Kǫ n ) (cid:1) / ( n − P (cid:18) ∀ m ∈ M n , D Wm n − p ( m ) > e n ( K ) R m n (cid:19) ≤ K ) e − K (ln n ) γ . Take K = 8 / .
31 and n ≥
10 sufficiently large to ensure that 3 K ǫ n + (19 . K ǫ n ≤ e n ( K ) ≤
109 (8 ǫ n + ǫ n ) ≤ ǫ n . We deduce that, for sufficiently large n , P (Ω cu ) ≤ K ) e − K (ln n ) γ . We also apply Proposition 2.4 with x = K l m , and we use the same arguments to provethat, for K = 16 / .
61, for all n ≥
10 sufficiently large to ensure that (40 . K ǫ n ≤ P (cid:18) ∀ m ∈ M n , D Wm n − p ( m ) < − ǫ n R m n (cid:19) ≤ . K ) e − K (ln n ) γ . Hence, the conclusion of Lemma 5.4 holds for sufficiently large n . It holds in general,provided that we increase the constant C if necessary. If c n <
0, there is nothing to prove. We can then assume that c n ≥
0, this implies inparticular that 28 ǫ n ≤ δ n < . We use the notations of Lemma 5.2. From Lemma 5.2, the inequalities (19) will be provedif, on Ω T , D ˆ m ≥ c n D m ∗ and k s − ˜ s k ≥ c n h on inf m ∈M n k s − ˆ s m k . m o ∈ arg min m ∈M n R m , ˆ m minimizes over M n the following criterion.Crit( m ) = P n Q (ˆ s m ) + pen( m ) + k s k + 2 ν n ( s m o )= k s − s m k − p ( m ) + δ ( m o , m ) + pen( m ) . Recall that 0 ≤ pen( m ) ≤ (1 − δ n ) D m /n . On Ω T , for all m in M n , since R m ≥ R m o ,Crit( m ) ≥ k s − s m k − D m n − ǫ n R m n ≥ − (1 + 16 ǫ n ) D m n . Crit( m ) ≤ k s − s m k + 26 ǫ n R m n − δ n D m n = (1 + 26 ǫ n ) k s − s m k − ( δ n − ǫ n ) D m n . When D m ≤ c n D m ∗ ,(1 + 16 ǫ n ) D m ≤ D m ∗ (cid:18) ( δ n − ǫ n ) − (1 + 26 ǫ n ) n k s − s m ∗ k D m ∗ (cid:19) . Thus Crit( m ) ≥ Crit( m ∗ ). This implies that D ˆ m ≥ c n D m ∗ .Moreover, on Ω T , we also have, for all m in M n k s − ˜ s k = R ˆ m n + (cid:18) p ( ˆ m ) − D ˆ m n (cid:19) ≥ (1 − ǫ n ) R ˆ m n , and inf m ∈M n k s − ˆ s m k ≤ inf m ∈M n R m n (1 + 10 ǫ n ) ≤ R m o n (1 + 10 ǫ n ) . Thus k s − ˜ s k ≥ (1 − ǫ n ) R ˆ m n ≥ (1 − ǫ n ) D ˆ m n ≥ (1 − ǫ n ) c n D m ∗ n ≥ c n − ǫ n h on R m o n ≥ c n h on − ǫ n ǫ n inf m ∈M n k s − ˆ s m k . We conclude the proof, saying that ǫ n ≤ /
28 implies that (1 − ǫ n )(1+ 10 ǫ n ) − ≥ / ≥ / . If δ − − ǫ n < −
1, there is nothing to prove, hence, we can assume in the following that δ − − ǫ n > − T introduced in Lemma 5.2. LetΩ pen = \ m ∈M n (cid:26) D m n + δ − R m n ≤ pen( m ) ≤ D m n + δ + R m n (cid:27) , Ω = Ω T ∩ Ω pen and m o ∈ arg min m ∈M n R m . Recall that P (Ω pen ) ≥ − p ′ and that, ˆ m minimizes over m the following criterion.Crit( m ) = P n Q (ˆ s m ) + pen( m ) + k s k + 2 ν n ( s m o )= k s − s m k − p ( m ) + δ ( m o , m ) + pen( m ) . m in M n , since R m ≥ R m o ,Crit( m ) ≥ (1 + δ − ) R m n + (cid:18) D m n − p ( m ) (cid:19) − ǫ n R m n ≥ (1 + δ − − ǫ n ) k s − s m k + (1 + δ − − ǫ n ) D m n ≥ (1 + δ − − ǫ n ) D m n Crit( m ) ≤ (1 + δ + + 26 ǫ n ) R m n . If D m > C n ( δ − , δ + ) R m o ,(1 + δ − − ǫ n ) D m > (1 + δ + + 26 ǫ n ) R m o , Thus Crit( m ) > Crit( m o ), hence D ˆ m ≤ C n ( δ − , δ + ) R m o .Moreover, from (6), for all m in M n k s − ˜ s k ≤ k s − ˆ s m k + (pen( m ) − p ( m )) + (2 p ( ˆ m ) − pen( ˆ m )) + δ ( ˆ m, m ) ≤ k s − ˆ s m k + 2 (cid:18) D m n − p ( m ) (cid:19) + ( δ + + 6 ǫ n ) R m n +2 (cid:18) p ( ˆ m ) − D ˆ m n (cid:19) + ( − δ − + 6 ǫ n ) R ˆ m n ≤ k s − ˆ s m k + (46 ǫ n + δ + ) R m n + (26 ǫ n − δ − ) R ˆ m n . For all m in M n , on Ω T , k s − ˆ s m k = R m n + (cid:18) p ( m ) − D m n (cid:19) ≥ (1 − ǫ n ) R m n . Hence, for all m ∈ M n , k s − ˜ s k ≤ k s − ˆ s m k (cid:18) ǫ n + δ + − ǫ n (cid:19) + 26 ǫ n − δ − − ǫ n k s − ˜ s k . This concludes the proof of Proposition 2.3.
We apply Lemma 5.3 with L = id and Λ = m . By definition of p ( m ) and D Wm , p ( m ) − D Wm n = 1 n ( n − n X i = j =1 X λ ∈ m ( ψ λ ( X i ) − P ψ λ )( ψ λ ( X j ) − P ψ λ ) . Thus, from Lemma 6.7 in the appendix, for all x > P p ( m ) − D Wm n > . D / m ( e m x ) / + 3 p v m D m x + 3 v m x + e m (19 . x ) n − ! ≤ e − x . P D Wm n − p ( m ) > D / m ( e m x ) / + 7 . p v m D m x + e m (40 . x ) n − ! ≤ . e − x . m in M n , the function T m = P λ ∈ m ( ψ λ − P ψ λ ) and the random variable U m = 1 n ( n − n X i = j =1 X λ ∈ m ( ψ λ ( X i ) − P ψ λ )( ψ λ ( X j ) − P ψ λ ) . We apply Lemma 5.3 with L = id , we obtain D Wm − D m = ν n ( T m ) − U m . From Bernstein’s inequality (see Proposition 6.3), for all x > ξ in {− , } , P ξν n ( T m ) > r T m ( X )) xn + k T m k ∞ x n ! ≤ e − x . From Cauchy-Schwarz inequality, T m = sup t ∈ B m ( t − P t ) , thus k T m k ∞ /n = 4 e m andVar( T m ( X )) /n ≤ k T m k ∞ P T m /n = 4 e m D m , therefore, for all x > ξ in {− , } , P (cid:18) ξν n ( T m ) > p e m D m x + 4 e m x (cid:19) ≤ e − x . Moreover, from Lemma 6.7 in the appendix, for all x > P U m > . D / m ( e m x ) / + 3 p v m D m x + 3 v m x + e m (19 . x ) n − ! ≤ e − x . P U m < − D / m ( e m x ) / + 7 . p v m D m x + e m (40 . x ) n − ! ≤ . e − x . We deduce that, for all x >
0, with probability larger than 1 − . e − x , D Wm − D m ≤ p e m D m x + e m (cid:18) x . x ) n − (cid:19) + 9 D / m ( e m x ) / + 7 . p v m D m xn − . Moreover, for all x >
0, on an event of probability larger than 1 − e − x , D Wm − D m ≥ − p e m D m x − e m (cid:18) x . x ) n − (cid:19) − . D / m ( e m x ) / + 3 p v m D m x + 3 v m xn − . Recall that P (Ω cT ) ≤ Ce − (ln n ) γ , and that, on Ω T , ∀ m ∈ M n , (1 − ǫ n ) R m n ≤ k s − ˆ s m k , ∀ m, m ′ ∈ M n , δ ( m, m ′ ) ≤ ǫ n R m ∨ R m ′ n . p be the event defined in Lemma 5.4 and let Ω = ˜Ω p ∩ Ω T , from Lemma 5.2, P (Ω c ) ≤ Ce − (ln n ) γ . Recall that pen( m ) = 2 D Wm /n . On Ω, from (6), for all n such that20 ǫ n <
1, for all m in M n , k s − ˜ s k ≤ k s − ˆ s m k + 26 ǫ n R m n + 16 ǫ n R ˆ m n ≤ k s − ˆ s m k + 26 ǫ n − ǫ n k s − ˆ s m k + 16 ǫ n − ǫ n k s − ˜ s k . Hence, for all n such that 20 ǫ n <
1, on Ω,(1 − ǫ n ) k s − ˜ s k ≤ (1 + 6 ǫ n ) inf m ∈M n k s − ˆ s m k . For all n such that 42 / (1 − ǫ n ) < k s − ˜ s k ≤ (cid:18) ǫ n − ǫ n (cid:19) inf m ∈M n k s − ˆ s m k ≤ (1 + 100 ǫ n ) inf m ∈M n k s − ˆ s m k . Hence (25) holds for sufficiently large n , it holds in general provided that we enlarge theconstant C if necessary.. In this Section, we state and prove some technical lemmas that are useful in the proofs.The main tool is the first Lemma based on Bousquet’s version of Talagrand’s inequality.It is a concentration inequality for the square of the supremum of the empirical processover a uniformly bounded class of functions. Recall first Bousquet’s [10] and Klein & Rio[17] versions of Talagrand’s inequality.
Theorem 6.1 (Bousquet’s bound) Let X , ..., X n be i.i.d. random variables valued in ameasurable space ( X , X ) and let S be a class of real valued functions bounded by b . Let v = sup t ∈ S Var ( t ( X )) and let Z = sup t ∈ S ν n t . Then ∀ x > , P Z > E ( Z ) + r n ( v + 2 b E ( Z )) x + bx n ! ≤ e − x . Theorem 6.2 (Klein & Rio’s bound) Let X , ..., X n be i.i.d. random variables valued ina measurable space ( X , X ) and let S be a class of real valued functions bounded by b . Let v = sup t ∈ S Var ( t ( X )) and let Z = sup t ∈ S ν n t . Then ∀ x > , P Z < E ( Z ) − r n ( v + 2 b E ( Z )) x − bx n ! ≤ e − x . Let us now also recall Bernstein’s inequality.
Proposition 6.3
Bernstein’s inequalityLet X , ..., X n be iid random variables valued in a measurable space ( X, X ) and let t be ameasurable real valued function. Then, for all x > , P ν n ( t ) > r Var ( t ( X )) xn + k t k ∞ x n ! ≤ e − x .
26e derive from these bounds the following useful corollary. Hereafter, S denotes a symetricclass of real valued functions upper bounded by b , v = sup t ∈ S Var( t ( X )), Z = sup t ∈ S ν n t , n E ( Z ) = D . Since S is symetric, we always have Z ≥ Corollary 6.4
Let S be a symetric class of real valued functions upper bounded by b , v = sup t ∈ S Var ( t ( X )) , Z = sup t ∈ S ν n t , n E ( Z ) = D , e b = b /n and nE m = 225 e b + (cid:16) . √ π (cid:17) √ v D + √ D / e / b , then E ( Z Z ≥ E ( Z ) ) ≤ ( E ( Z )) P ( Z ≥ E ( Z )) + E m . (38) In particular, ( E ( Z )) ≤ E ( Z ) ≤ ( E ( Z )) + E m . (39) Proof :
We have E ( Z Z ≥ E ( Z ) ) = Z ∞ P ( Z Z ≥ E ( Z ) > x ) dx = Z ∞ P ( Z Z ≥ E ( Z ) > √ x ) dx = ( E ( Z )) P ( Z ≥ E ( Z )) + Z ∞ ( E ( Z )) P ( Z > √ x ) dx Take x = ( E ( Z ) + p v + 2 b E ( Z )) y/n + by/ (3 n )) in the previous integral, from Bous-quet’s version of Talagrand’s inequality, E ( Z Z ≥ E ( Z ) ) ≤ E ( Z ) r n ( v + 2 b E ( Z )) Z ∞ e − y √ y dy + 2 v + 14 b E ( Z ) / n Z ∞ e − y dy + bn r n ( v + 2 b E ( Z )) Z ∞ e − y √ ydy + 2 b n Z ∞ ye − y dy. Classical computations lead to Z ∞ e − y √ y dy = 2 Z ∞ e − y √ ydy = √ π, Z ∞ e − y dy = Z ∞ ye − y dy = 1 . Therefore, if e b = b /n , using repeatedly the inequalities a α b − α ≤ αa + (1 − α ) b (40)and √ a + b ≤ √ a + √ b , we obtain, for all η > √ ne b E ( Z ) ≤ e b η + 2 η e / b ( √ n E ( Z )) / , ( √ n E ( Z )) / e / b ≤ η e / b ( √ n E ( Z )) / + 2 e b √ η . Thus E ( Z Z ≥ E ( Z ) ) ≤ (cid:18) v + 29 e b + v √ πe b (cid:19) n + √ π p √ n E ( Z ) ( e b ) / n + (cid:18) √ e b + v √ π (cid:19) √ n E ( Z ) n + 2 √ π ( √ n E ( Z )) / ( e b ) / n ≤ η √ π ! v n + r πn v E ( Z ) +
29 + √ π η + 2 √ π √ η + 149 η ! e b n + (cid:18) η (cid:18) √ π (cid:19) + 2 √ π (cid:19) ( √ n E ( Z )) / ( e b ) / n . η = 0 . E ( Z Z ≥ E ( Z ) ) ≤ . v n + 15 e b n + √ πv √ n E ( Z ) n + √
15 ( √ n E ( Z )) / ( e b ) / n . Finally, we use Cauchy-Schwarz inequality to obtain that √ n E ( Z ) ≤ ( n E ( Z )) / =( D ) / . Since v ≤ D , we get (38).We deduce from this result the following concentration inequalities for Z Corollary 6.5
Let e b = b /n . We have, for all x > , P Z − Dn > D / ( e b (19 x ) ) / + 3 √ Dv x ) + 3 v x + e b (19 x ) n ! ≤ e − x . Moreover, for all x > , with probability larger than − e − x , Dn − Z ≤ D / e / b ( √
15 + 4 . √ x ) + √ v D (4 .
61 + 3 √ x ) + 225 e b (6 . x + 1) n . (41) Proof :
From Bousquet’s version of Talagrand’s inequality and from ( E ( Z )) ≤ E ( Z ), weobtain that, for all x >
0, with probability larger than 1 − e − x , Z − D/n is not largerthan4 D / ( e b x ) / + √ D (14 p e b x / √ v x ) + 4 D / ( e b x ) / / v x + e b x / n . We use repeatedly the inequality a α b − α ≤ αa + (1 − α ) b to obtain that, with probabilityat least 1 − e − x , Z − D/n is not larger than(4 + 32 η/ D / ( e b x ) / + 2 √ √ Dv x + 3 v x + (3 + 14 /η + 8 / √ η ) e b x / n . For η = 0 .
07, this gives Z − Dn > D / ( e b (19 x ) ) / + 2 √ √ Dv x + 3 v x + e b (19 x ) n . For the second one we use Klein’s version of Talagrand’s inequality to obtain, for all x > r ( x ) = p v + 2 b E ( Z )) x/n + 8 bx/ n < E ( Z ), P (cid:16) Z < ( E ( Z ) − r ( x )) (cid:17) ≤ e − x . We have ( E ( Z ) − r ( x )) = ( E ( Z )) − E ( Z ) r ( x ) + r ( x ) ≥ ( E ( Z )) − E ( Z ) r ( x ), thus P (cid:0) Z < ( E ( Z )) − E ( Z ) r ( x ) (cid:1) ≤ e − x . From the previous corollary, ( E ( Z )) ≥ E ( Z ) − E m , thus P (cid:0) Z < E ( Z ) − E m − E ( Z ) r ( x ) (cid:1) ≤ e − x . In order to conclude the proof of 6.5, just remark that2 E ( Z ) r ( x ) ≤ D / ( e b x ) / + 3 √ Dv x + 16 p De b x / n ≤ (4 + 32 η/ D / ( e b x ) / + 3 √ Dv x + 16 / (9 η ) e b x n . For η = 0 , Z around its mean Corollary 6.6
For all x > , P Z − Dn > D / ( e b (19 x ) ) / + 3 √ Dv x + 3 v x + e b (19 x ) n ! ≤ e − x . P Z − Dn < − D / ( e b x ) / + 7 . √ v Dx + e b (40 . x ) n ! ≤ ee − x . Proof :
In order to obtain the second inequality, we remark that the inequality is trivial when x ≤
1, thus we only have to use (41) for x > √ x > x > U -statistics of order 2. The following result generalizes a previous inequality due to Houdr´e& Reynaud-Bouret [16] to random variables taking values in a measurable space. Lemma 6.7
Let
X, X , ..., X n be i.i.d random variables taking value in a measurable space ( X , X ) with common law P . Let µ be a measure on ( X , X ) and let ( t λ ) λ ∈ Λ be a set offunctions in L ( µ ) . Let B = { t = X λ ∈ Λ a λ t λ , X λ ∈ Λ a λ ≤ } , D = E (cid:18) sup t ∈ B ( t ( X ) − P t ) (cid:19) ,v = sup t ∈ B Var ( t ( X )) , b = sup t ∈ B k t k ∞ and e b = b n . Let U = 1 n ( n − n X i = j =1 X λ ∈ Λ ( t λ ( X i ) − P t λ )( t λ ( X j ) − P t λ ) . Then the following inequality holds ∀ x > , P U > . D / ( e b x ) / + 3 √ v Dx + 3 v x + e b (19 . x ) n − ! ≤ e − x . (42) ∀ x > , P U < − D / ( e b x ) / + 7 . √ v Dx + e b (40 . x ) n − ! ≤ . e − x . (43) Proof :
Remark that, from Cauchy-Schwarz inequality,sup t ∈ B ( ν n ( t )) = sup P a λ ≤ X λ ∈ Λ a λ ν n ( t λ ) ! = X λ ∈ Λ ( ν n ( t λ )) . For all x in X , from Cauchy-Schwarz inequality,sup t ∈ B ( t ( x ) − P t ) = X λ ( t λ ( x ) − P t λ ) ,
29n particular, D = P λ ∈ Λ Var( ψ λ ( X )) . Moreover, easy algebra leads to X λ ∈ Λ ( ν n ( t λ )) = 1 n n X i =1 X λ ∈ Λ ( t λ ( X i ) − P t λ ) + 1 n n X i = j =1 X λ ∈ Λ ( t λ ( X i ) − P t λ )( t λ ( X j ) − P t λ )= 1 n P n X λ ∈ Λ ( t λ − P t λ ) ! + n − n U. Let Z = sup t ∈ B ( ν n ( t )) , T Λ = P λ ∈ Λ ( t λ − P t λ ) , E ( Z ) = E (cid:18) n P n T Λ (cid:19) = Dn .
Hence U = nn − (cid:18) Z − E ( Z ) − n ν n ( T Λ ) (cid:19) . From Corollary 6.6, for all x > P Z − Dn > D / ( e b (19 x ) ) / + 3 √ v Dx + 3 v x + e b (19 x ) n ! ≤ e − x . P Z − Dn < − D / ( e b ( x ) ) / + 7 . √ v Dx + e b (40 . x ) n ! ≤ . e − x . Moreover, from Bernstein inequality, for all x > P (cid:16) − ν n T Λ > p De b x + e b x (cid:17) ≤ e − x . P (cid:16) ν n T Λ > p De b x + e b x (cid:17) ≤ e − x . We apply inequality (40) with a = D / ( e b x ) / , b = e b √ x , α = 2 / P − ν n T Λ > √ D / ( e b x ) / + e b x + √ x !! ≤ e − x . P ν n T Λ > √ D / ( e b x ) / + e b x + √ x !! ≤ e − x . Therefore, for all x > P U > . D / ( e b x ) / + 3 √ v Dx + 3 v x + e b (cid:0) (19 x ) + ( x + √ x ) / (cid:1) n − ! ≤ e − x . P U < − D / ( e b x ) / + 7 . √ v Dx + e b (cid:0) (40 . x ) + ( x + √ x ) / (cid:1) n − ! ≤ . e − x . These inequalities are trivial when x <
1. We only use them when x > x < x and √ x < x when x > Lemma 6.8
Let
X, X , ..., X n be i.i.d random variables taking value in a measurable space ( X , X ) with common law P . Let µ be a measure on ( X , X ) and let ( ψ λ ) λ ∈ Λ be an or-thonormal system in L ( µ ) . Let L be a linear functional in L ( µ ) and let B = { t = P λ ∈ Λ a λ L ( ψ λ ) , P λ ∈ Λ a λ ≤ } , v = sup t ∈ B Var ( t ( X )) , b = sup t ∈ B k t k ∞ and e b = b /n .Let u be a function in S , the linear space spanned by the functions ( ψ λ ) λ ∈ Λ and let η > .Then the following inequality holds ∀ x > , P (cid:18) ν n ( L ( u )) > η k u k + 2 v x + e b x / ηn (cid:19) ≤ e − x . (44) Proof :
From Bernstein’s inequality, ∀ x > , P ν n ( L ( u )) > r L ( u )( X )) xn + k L ( u ) k ∞ x n ! ≤ e − x . Since t = L ( u/ k u k ) belongs to B , r L ( u )( X )) xn + k L ( u ) k ∞ x n = k u k r t ( X )) xn + k t k ∞ x n ! ≤ η k u k + 12 η r v xn + bx n ! . We conclude the proof using the inequality ( a + b ) ≤ a + 2 b . References [1] H. Akaike. Statistical predictor identification.
Ann. Inst. Statist. Math. , 22:203–217,1970.[2] H. Akaike. Information theory and an extension of the maximum likelihood princi-ple. In
Second International Symposium on Information Theory (Tsahkadsor, 1971) ,pages 267–281. Akad´emiai Kiad´o, Budapest, 1973.[3] S. Arlot.
Resampling and model selection . PhD thesis, Universit´e Paris-Sud 11, 2007.[4] S. Arlot. Model selection by resampling penalization.
Electron. J. Statist. , 3:557–624,2009.[5] S. Arlot and P. Massart. Data-driven calibration of penalties for least-squares regres-sion.
Journal of Machine learning research , 10:245–279, 2009.[6] A. Barron, L. Birg´e, and P. Massart. Risk bounds for model selection via penalization.
Probab. Theory Related Fields , 113(3):301–413, 1999.[7] L. Birg´e. Model selection for density estimation with l -loss. Preprint , 2008.[8] L. Birg´e and P. Massart. From model selection to adaptive estimation. In
Festschriftfor Lucien Le Cam , pages 55–87. Springer, New York, 1997.319] L. Birg´e and P. Massart. Minimal penalties for Gaussian model selection.
Probab.Theory Related Fields , 138(1-2):33–73, 2007.[10] O. Bousquet. A Bennett concentration inequality and its application to suprema ofempirical processes.
C. R. Math. Acad. Sci. Paris , 334(6):495–500, 2002.[11] A. C´elisse. Density estimation via cross validation: Model selection point of view.
Preprint, downloadable on arXiv.org : 08110802 , 2008.[12] D. L. Donoho, I. M. Johnstone, G. Kerkyacharian, and D. Picard. Density estimationby wavelet thresholding.
Ann. Statist. , 24(2):508–539, 1996.[13] B. Efron. Bootstrap methods: another look at the jackknife.
Ann. Statist. , 7(1):1–26,1979.[14] B. Efron. Estimating the error rate of a prediction rule: improvement on cross-validation.
J. Amer. Statist. Assoc. , 78(382):316–331, 1983.[15] M. Fromont. Model selection by bootstrap penalization for classification.
MachineLearning , 66(2, 3):165–207, 2007.[16] C. Houdr´e and P. Reynaud-Bouret. Exponential inequalities, with constants, for U-statistics of order two. In
Stochastic inequalities and applications , volume 56 of
Progr.Probab. , pages 55–69. Birkh¨auser, Basel, 2003.[17] T. Klein and E. Rio. Concentration around the mean for maxima of empirical pro-cesses.
Ann. Probab. , 33(3):1060–1077, 2005.[18] C.L. Mallows. Some comments on c p . Technometrics , 15:661–675, 1973.[19] P. Massart.
Concentration inequalities and model selection , volume 1896 of
LectureNotes in Mathematics . Springer, Berlin, 2007. Lectures from the 33rd Summer Schoolon Probability Theory held in Saint-Flour, July 6–23, 2003, With a foreword by JeanPicard.[20] M. Rudemo. Empirical choice of histograms and kernel density estimators.
Scand. J.Statist. , 9(2):65–78, 1982.[21] M. Stone. Cross-validatory choice and assessment of statistical predictions.
J. Roy.Statist. Soc. Ser. B , 36:111–147, 1974. With discussion by G. A. Barnard, A. C.Atkinson, L. K. Chan, A. P. Dawid, F. Downton, J. Dickey, A. G. Baker, O. Barndorff-Nielsen, D. R. Cox, S. Giesser, D. Hinkley, R. R. Hocking, and A. S. Young, and witha reply by the authors.[22] M. Talagrand. New concentration inequalities in product spaces.