`local' vs. `global' parameters -- breaking the gaussian complexity barrier
aa r X i v : . [ s t a t . M L ] A p r ‘local’ vs. ‘global’ parameters – breaking thegaussian complexity barrier Shahar Mendelson ∗ September 4, 2018
Abstract
We show that if F is a convex class of functions that is L -subgaussian,the error rate of learning problems generated by independent noise isequivalent to a fixed point determined by ‘local’ covering estimates ofthe class, rather than by the gaussian averages. To that end, we es-tablish new sharp upper and lower estimates on the error rate for suchproblems. The focus of this article is on the question of prediction . Given a class offunctions F defined on a probability space (Ω , µ ) and an unknown targetrandom variable Y , one would like to identify an element of F whose ‘pre-dictive capabilities’ are (almost) the best possible in the class. The notionof ‘best’ is measured via the point-wise cost of predicting f ( x ) instead of y ,and the best function in the class is the one that minimizes the average cost.Here, we will consider the squared loss: the cost of predicting f ( x ) ratherthan y is ( f ( x ) − y ) , and if X is distributed according to µ , the goal is toidentify f ∗ = argmin f ∈ F E ( f ( X ) − Y ) = argmin f ∈ F k f − Y k L , where the expectation is taken with respect to the joint distribution of X and Y on the product space Ω × R . ∗ Department of Mathematics, Technion, I.I.T, Haifa 32000, Israel email: [email protected]
Supported in part by the Mathematical Sciences Institute, The Australian National Uni-versity, Canberra, ACT 2601, Australia. Additional support was given by the IsraelScience Foundation grant 900/10. X i , Y i ) Ni =1 , selected according to the N -product of the joint distribution of X and Y . And, using this data, one must select some (random) f ∈ F . Definition 1.1
Given a sample size N and a class F defined on (Ω , µ ) , alearning procedure is a map Ψ : (Ω × R ) N → F . For a set Y of admissibletargets, Ψ performs with confidence − δ and accuracy E p if for every Y ∈ Y ,and setting ˜ f = Ψ(( X i , Y i ) Ni =1 ) , E (cid:16) ( ˜ f − Y ) | ( X i , Y i ) Ni =1 (cid:17) ≤ E ( f ∗ ( X ) − Y ) + E p with probability at least − δ relative to the N -product of the joint distributionof X and Y . The accuracy (or error) E p is a function of F , N and δ , and may dependon some features of the target Y as well, for example, its norm in some L q space.A fundamental problem in Learning Theory is to identify the features ofthe underlying class F and of the set of admissible targets Y that govern E p ;in particular, the way E p scales with the sample size N (the so-called errorrate ). This question has been studied extensively, and we refer the reader tothe manuscripts [2, 9, 6, 17, 3, 3, 10, 11] for more information on its historyand on some more recent progress.Here, the aim is to obtain matching upper and lower bounds on E p thathold for any reasonable class F , at least under some assumptions which wewill now outline.It is well understood that the ability to predict is quantified by variouscomplexity parameters of the underlying class. Frequently, one encoun-ters parameters that are based on various gaussian and empirical/multiplierprocesses indexed by ‘localizations’ of F (see, e.g., [7]), and any hope of ob-taining matching bounds on E p must be based on sharp estimates on theseprocesses. Unfortunately, the analysis of empirical/multiplier processes is,in general, highly nontrivial. Moreover, and unlike gaussian processes, thereis no clear path that leads to sharp bounds on empirical processes, andeven when upper estimates are available, they are often loose and lead tosuboptimal bounds on E p .The one generic example in which a more satisfactory theory of empiri-cal/multiplier processes is known, is when the indexing class is L -subgaussian.2 efinition 1.2 A class F ⊂ L ( µ ) is L -subgaussian with respect to themeasure µ if for every p ≥ and every f, h ∈ F ∪ { } , k f − h k L p ( µ ) ≤ L √ p k f − h k L ( µ ) , and if the canonical gaussian process { G f : f ∈ F } is bounded (see the book[4] for a detailed survey on gaussian processes). More facts on subgaussian classes may be found in [8, 18, 4, 14, 7]. Forour purposes, the main feature of subgaussian classes is that the empiricaland multiplier processes that govern E p may be bounded from above usingproperties of the canonical gaussian process indexed by the class, giving onesome hope of obtaining sharp estimates. Because of that feature, we willfocus in what follows on subgaussian classes.Despite their importance, complexity parameters are not the entire storywhen it comes to E p . For example, it is possible to construct a class consistingof just two functions, { f , f } , but if the target Y is a 1 / √ N -perturbation ofthe midpoint ( f + f ) /
2, no learning procedure can perform with an errorthat is better than c/ √ N having been given a sample of cardinality N (see,e.g., [1]). Thus, rather than being solely determined by the complexity ofthe underlying class, there is an additional geometric requirement on F and Y which is there to ensure that all the admissible targets in Y are locatedin a favourable position relative of F (see [13] for more details). One mayshow that if F ⊂ L ( µ ) is compact and convex, any target Y ∈ L is in afavourable position relative to F . Therefore, to remove possible geometricobstructions, we will assume that F ⊂ L ( µ ) is compact and convex.Finally, for a reason that will become clear later, we will not study ageneral class of admissible targets Y , but rather consider targets of the form Y = f ( X ) + W for some f ∈ F and W that is orthogonal to span( F ) (e.g., W ∈ L that is a mean-zero random variable and is independent of X is a‘legal’ choice).With all these assumptions in place, let us formulate the question wewould like to study: Question 1.3
Let F ⊂ L ( µ ) be a compact, convex class that is L -subgaussianwith respect to µ . Given targets of the form Y = f ( X ) + W as above, findmatching upper and lower bounds (up to constants) on E p . Let us recall the following standard definitions.3 efinition 1.4
Let F ⊂ L ( µ ) . Set F − h = { f − h : f ∈ F } and F − F = { f − h : f, h ∈ F } . Denote by star( F ) = { λf : f ∈ F ≤ λ ≤ } the star-shaped hull of F with ; F is star-shaped around if star( F ) = F .Let { G f : f ∈ F } be the canonical gaussian process indexed by F and set E k G k F = sup ( E sup f ∈ F ′ G f : F ′ ⊂ F, F ′ is finite ) . Finally, let D be the unit ball in L ( µ ) . The best known bounds on E p in the subgaussian context have beenestablished in [7] and are based on two fixed points: Definition 1.5
For κ , κ > , set r M ( κ , f ) = inf n s > E k G k ( F − f ) ∩ sD ≤ κ s √ N o (1.1) and r Q ( κ , f ) = inf n r > E k G k ( F − f ) ∩ sD ≤ κ s √ N o . (1.2) Put r M ( κ ) = sup f ∈ F r M ( κ , f ) and r Q ( κ ) = sup f ∈ F r Q ( κ , f ) . In the context of the problem we are interested in, one has the following:
Theorem 1.6 [7] For every L ≥ there exist constants c , c , c and c that depend only on L for which the following holds. Let F ⊂ L ( µ ) be acompact, convex, L -subgaussian class of functions, set Y = f ( X ) + W andassume that for every p ≥ , k W k L p ≤ L √ p k W k L . There is a learningprocedure (empirical risk minimization performed in F ) for which, if r ≥ { r M ( c / k W k L ) , r Q ( c ) } . then with probability at least − (cid:0) − c N min (cid:8) , r / k W k L (cid:9)(cid:1) , the error of the procedure is at most E p ≤ r . r M and r Q that are based on the notion of packing numbers. Definition 1.7
Let E be a normed space and set B to be its unit ball. Let M ( A, rB ) be the cardinality of a maximal r -separated subset of A with re-spect to the given norm, that is, the cardinality of the largest subset ( a i ) mi =1 ⊂ A for which k a i − a j k ≥ r for every i = j . Definition 1.8
For η , η > set γ M ( η , f ) = inf (cid:8) s > M (( F − f ) ∩ sD, ( s/ D ) ≤ η s N (cid:9) . and γ Q ( η , f ) = inf (cid:8) s > M (( F − f ) ∩ sD, ( s/ D ) ≤ η N (cid:9) . Put γ M ( η ) = sup f ∈ F γ M ( η , f ) , and γ Q ( η ) = sup f ∈ F γ Q ( η , f ) . Theorem 1.9 [7] There exist absolute constants c and c for which thefollowing holds. Let F be a class of functions, set W be a centred normalrandom variable and for every f ∈ F put Y f = f ( X )+ W . If Ψ is a learningprocedure that performs for every target Y f with confidence at least / , thenthere is some Y f for which E p ≥ c γ M ( c / k W k L ) . Remark 1.10
One should note that a lower bound that is based on γ Q wasnot known. The connection between the two types of parameters is Sudakov’s in-equality (see, e.g. [8]): there is an absolute constant c for which, for every H ⊂ L ( µ ), c sup ε> ε log / M ( H, εD ) ≤ E k G k H . To see the connection, assume that for every f ∈ F , E k G k ( F − f ) ∩ rD ≤ κ (4 r ) √ N , which means that r M ( κ ) ≤ r . Applying Sudakov’s inequalityto H = ( F − f ) ∩ rD and for the choice of ε = r/ c ( r/
2) log / M (( F − f ) ∩ rD, rD ) ≤ E k G k ( F − f ) ∩ rD ≤ κ r √ N ;hence, γ M ( c κ ) ≤ r . A similar observation is true for r Q and γ Q , whichshows that γ M and γ Q are intrinsically smaller than r Q and r M respectively,for the right choice of constants. 5he starting point of this article is fact that the gap between these upperand lower estimates on E p is more than a mere technicality.The core issue is that the parameters r M and r Q are ‘global’ in nature,whereas γ M and γ Q are ‘local’. Indeed, although ( F − f ) ∩ rD is a localizedset, E k G k ( F − f ) ∩ rD is not determined solely by the effects of a ‘level’ that isproportional r . For example, it is straightforward to construct examples inwhich E k G k ( F − f ) ∩ rD ≥ cr √ N because of a very large, ρ -separated subsetof ( F − f ) ∩ rD , for ρ that is much smaller than r . Thus, even if r M or r Q are of order r , this need not be ‘exhibited’ by ( F − f ) ∩ rD at a scalethat is proportional to r . In contrast, γ M and γ Q are ‘local’: the degree ofseparation is proportional to the diameter of the separated set, and the fixedpoint indicates that ( F − f ) ∩ rD is truly ‘rich’ at a scale that is proportionalto r .As noted in [7], the upper and lower estimates coincide when the ‘local’and ‘global’ parameters are equivalent, but that is not a typical situation –in the generic case, there is a gap between the two. An example of that factwill be presented in Section 5.Given that there is a gap between the two sets of parameters, one mustface the obvious question: which of the two captures E p ? Is it the ‘global’pair, r Q and r M , or the ‘local’ one of γ Q and γ M ?Our main result is that the ‘local’ parameters are the right answer – atleast in the setup outlined above. To that end, we shall improve the upperbound in Theorem 1.6 and add the missing component in Theorem 1.9. Theorem 1.11
For every
L > and q > there are constants c , ..., c thatdepend only of q and L for which the following holds. Let F ⊂ L ( µ ) be acompact, convex, L -subgaussian class of functions with respect to µ . Thereis a learning procedure Ψ : (Ω × R ) N → F , for which, if Y = f ( X ) + W for f ∈ F and W ∈ L q that is orthogonal to span( F ) , then with probability atleast − (cid:0) − c N min { , γ M ( c / k W k L q ) } (cid:1) − c log q NN ( q/ − , E p ≤ c max (cid:26) γ M (cid:18) c k W k L q (cid:19) , γ Q ( c ) (cid:27) + r Q ( c ) exp ( − c exp( N ))The term r Q ( c ) exp( − c exp( N )) is almost certainly an artifact of theproof, but in any case, it is significantly smaller than the dominating termin any reasonable example.To complement Theorem 1.11 we obtain the following lower bound.6 heorem 1.12 There exist absolute constants c and c for which the fol-lowing holds. Let F ⊂ L ( µ ) be a convex, centrally-symmetric class of func-tions and let Ψ be any learning procedure that performs with confidence / for any target of the form Y = f ( X ) + W for some f ∈ F and W ∈ L thatis orthogonal to span( F ) . • For any W ∈ L that is orthogonal to span( F ) , there is some f ∈ F , forwhich, for Y = f ( X ) + W , E p ≥ c γ Q ( c ) . • If W is a centred, normal random variable that is independent of X , thereis some f ∈ F for which, for Y = f ( X ) + W , E p ≥ c γ M (cid:18) c k W k L (cid:19) . An outcome of Theorem 1.11 and Theorem 1.12 is that if W is a centredgaussian random variable that is independent of X , then for any convex,centrally-symmetric, L -subgaussian class F , the upper and lower estimatesmatch (up to the parasitic and negligible term r Q ( c ) exp( − c exp( N )) inthe upper bound): when considering targets of the form Y = f ( X ) + W for f ∈ F , E p ∼ max (cid:8) γ Q ( c ) , γ M ( c / k W k L ) (cid:9) . The second part of Theorem 1.12 follows from Theorem 1.9. We havechosen to present a new proof of that fact – a proof we believe is both in-structive and less restrictive than existing proofs. The first part of Theorem1.12 is, to the best of our knowledge, new.Let us mention that if F happens to be convex and centrally symmetric(i.e. if f ∈ F then − f ∈ F ), what is essentially the ‘richest’ shift of F is the0-shift. Indeed, since F − F = 2 F , it is evident that for every f ∈ F ( F − f ) ∩ rD ⊂ ( F − F ) ∩ rD = 2( F ∩ rD ) . This makes one’s life much simpler when studying lower bounds, as it givesan obvious choice of where to look. Indeed, the ‘richest’ part of F is thehardest part for a learning procedure to deal with – and that part is aneighbourhood of 0. 7 .1 The idea of the proof of the upper bound The proof of the upper bound is based on the following decompositionof the squared excess loss: let Y be the unknown target and set f ∗ =argmin f ∈ F k f − Y k L . For every f ∈ F , let ℓ f ( X, Y ) = ( f ( X ) − Y ) and set L Ff ( X, Y ) =( ℓ f − ℓ f ∗ )( X, Y ) = ( f ( X ) − Y ) − ( f ∗ ( X ) − Y ) =2( f ∗ ( X ) − Y )( f − f ∗ )( X ) + ( f − f ∗ ) ( X ) . (1.3)Let P N h = N P Ni =1 h ( X i , Y i ) and setˆ f = argmin f ∈ F P N ℓ f = argmin f ∈ F P N L Ff to be the empirical minimizer in F . The learning procedure that assignsto every sample ( X i , Y i ) Ni =1 the empirical minimizer in F is called EmpiricalRisk Minimization (ERM).Clearly, L Ff ∗ = 0, and thus, for every sample ( X i , Y i ) Ni =1 , P N L F ˆ f ≤ , implying that members of the random set { f ∈ F : P N L Ff > } cannot beempirical minimizers. One way of identifying that set is via the decompo-sition (1.3): assume that ( X i , Y i ) Ni =1 is a sample for which, if k f − f ∗ k ≥ r ,one has 1 N N X i =1 ( f − f ∗ ) ≥ κ k f − f ∗ k L , (1.4)and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ( f ∗ ( X i ) − Y i )( f − f ∗ )( X i ) − E ( f ∗ ( X ) − Y )( f − f ∗ )( X ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ κ k f − f ∗ k L . (1.5)Since F is compact and convex, by properties of the metric projection ontoa closed convex set in an inner product space, E ( f ∗ ( X ) − Y )( f − f ∗ )( X ) ≥ f ∈ F . Therefore, setting ξ = f ∗ ( X ) − Y and ξ i = f ∗ ( X i ) − Y i , P N L Ff ≥ N N X i =1 ( f − f ∗ ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ξ i ( f − f ∗ )( X i ) − E ξ ( f − f ∗ )( X ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + E ξ ( f − f ∗ )( X ) ≥ κ − κ/ > f ∈ F that satisfies k f − f ∗ k L ≥ r . Thus, if (1.4) and (1.5) holdfor the sample ( X i , Y i ) Ni =1 , then { f ∈ F : k f − f ∗ k L ≥ r } ⊂ (cid:8) f ∈ F : P N L Ff > (cid:9) implying that k ˆ f − f ∗ k L < r .This argument has been used in [10] and was then extended in [11],showing that E ( L F ˆ f | ( X i , Y i ) Ni =1 ) ≤ r – which is the type of result one is looking for.This method of proof leads to the complexity parameters r Q and r M : theformer controls the quadratic component (1.4) and the latter the multipliercomponent (1.5). The ‘global’ nature of r Q and r M , i.e., the fact that thetwo depend on the gaussian oscillation E k G k ( F − f ) ∩ rD cannot be helped: theoscillations of the quadratic and multiplier processes are highly affected bythe ‘richness’ of F around f ∗ at every ‘level’.A rather obvious idea for improving the upper estimate is ‘erasing’ allthe fine structure of F , for example, by replacing F with an appropriateseparated subset. The difficultly in such an approach is that the geometryof a separated set is problematic, and (1.6) will no longer be true for anarbitrary target Y . This is why we only consider targets of the form f ( X ) + W for f ∈ F and W that is orthogonal to span( F ). For such targets, aversion of (1.6) happens to be true even if F is replaced by a separated set.The path we will take in proving the upper bound is as follows: • Choose a ‘correct’ level r using the parameters γ M and γ Q for well-chosenconstants η and η that depend only on q and L . • Replace F by V , a maximal r -separated subset of F with respect to the L ( µ ) norm, and study ERM in V . To that end, set v = argmin v ∈ V k v − Y k L and observe that by the orthogonality of W to span( F ), for every v ∈ V , | E ( v ( X ) − Y )( v − v )( X ) | = | E ( v − f ∗ )( v − v )( X ) | ≤ r k v − v k L . V satisfies P N L Vv ≥ N N X i =1 ( v − v ) ( X i ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ( v ( X i ) − Y i )( v − v )( X i ) − E ( v ( X ) − Y )( v − v )( X ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − r k v − v k L . • Next, one may study the corresponding quadratic and multiplier processesindexed by localizations of V and show that with high probability,if k v − v k L ≥ c r then P N L Vv >
0. Thus, ERM performed in V produces ˆ v for which k ˆ v − v k L ≤ c r . • It is possible to show that on the same event, k ˆ v − f ∗ k L ≤ c r . And, usingthe orthogonality of W to span( F ) once again, E ( L F ˆ v | ( X i , Y i ) Ni =1 ) ≤ c r , as required. Let us begin with some natation. Throughout, absolute constants are de-noted by c, c , ... etc. Their value may change from line to line. c ( α ) is aconstant that depends only on the parameter α . We use κ , κ , η , η etc.to denote fixed constants whose value remains unchanged throughout thearticle.In what follows, we will, at times, abuse notation and not specify theprobability space on which each random variable is defined. For example, k f − Y k L = E ( f ( X ) − Y ) and integration is with respect to the jointdistribution of X and Y , while k f − f k L = E ( f − f ) ( X ), in which caseintegration is with respect to µ .Next, let us turn to the notions of cover and covering numbers . Definition 2.1
Let B be a unit ball of a norm. Set N ( A, B ) to be theminimal number of centres a , ..., a n ∈ A for which A ⊂ S ni =1 ( a i + B ) . ( a i ) ni =1 is called a cover of A with respect to B . An r -cover is a cover withrespect to the set rB . It is standard to verify that if a , ..., a m is a maximal separated subset withrespect to B then it is also a cover with respect to B . Indeed, the maximalityof the separated set implies that every point a ∈ A has some a i for which10 a i − a k ≤
1, i.e, a ∈ a i + B . Therefore, N ( A, B ) ≤ M ( A, B ). In thereverse direction, if a , ..., a n is a cover with respect to B , then each one ofthe balls a i + B contains at most one point in any 2-separated set. Thus, M ( A, B ) ≤ N ( A, B ).The following lemma is straightforward but it plays a crucial part inwhat follows.
Lemma 2.2
Let T ⊂ W ⊂ L ( µ ) . For s > r > , set φ ( s, r ) = sup w ∈ W N ( T ∩ ( w + sD ) , rD ) . Then1. φ ( s, r ) ≤ φ ( s, s/ · φ ( s/ , r ) .2. If T and W are star-shaped around 0 then log φ ( s, r ) ≤ c log(2 s/r ) · log φ (4 r, r ) for a suitable absolute constant c . Proof.
Fix w ∈ W and let t , ..., t N ∈ T ∩ ( w + sD ) be centres of a minimal s/ ≤ i ≤ N , T ∩ ( w + sD ) ∩ ( t i + ( s/ D ) ⊂ T ∩ ( t i + ( s/ D ) , and N ( T ∩ ( t i + ( s/ D ) , rD ) ≤ φ ( s/ , r ), because t i ∈ T ⊂ W . Therefore,sup w ∈ W N ( T ∩ ( w + sD ) , rD ) ≤ sup w ∈ W N ( T ∩ ( w + sD ) , ( s/ D ) · φ ( s/ , r ) . Turning to the second part of the claim, assume that T and W are star-shaped around 0. Let w ∈ W , set t , ..., t m to be a maximal s/ T ∩ ( w + sD ) with respect to the L ( µ ) norm and put y i = ( r/s ) t i .Since T is star-shaped around 0, y i ∈ T and ( y i ) mi =1 is an r/ r/s ) w + rD . For the same reason, ( r/s ) w ∈ W , and M ( T ∩ ( w + sD ) , rD ) ≤ sup v ∈ W M ( T ∩ ( v + rD ) , ( r/ D ) . Using the standard connection between packing numbers and covering num-bers and taking the supremum over w , φ ( s, s/
2) = sup w ∈ W N ( T ∩ ( w + sD ) , ( s/ D ) ≤ sup w ∈ W M ( T ∩ ( w + sD ) , ( s/ D ) ≤ sup w ∈ W M ( T ∩ ( w + 2 rD ) , rD ) . φ ( s, r ) ≤ log (2 s/r ) · sup w ∈ W log M ( T ∩ ( w + 4 rD ) , rD ) ≤ log (2 s/r ) · sup w ∈ W log N ( T ∩ ( w + 4 rD ) , rD ) ≤ log (2 s/r ) · log φ (4 r, r ) , as claimed.Before we turn to the proof of the upper bound, let us revisit the com-plexity parameters in question. Since F is a convex class, F − f is star-shapedaround 0; hence, if s > r M (( F − f ) ∩ sD, ( s/ D ) ≤ M (( F − f ) ∩ r/ D, rD ) . In particular, if γ M ( η , f ) < r thenlog M (( F − f ) ∩ sD, ( s/ D ) ≤ η N r ≤ η N s , implying that γ M ( η , f ) < s as well.This simple argument shows that if r < γ M ( η , f ) thenlog M (( F − f ) ∩ rD, ( r/ D ) ≥ η N r , while if r > γ M ( η , f ), the reverse inequality holds.A similar assertion holds for γ Q , r M and r Q ; the rather standard proof ofthese facts, which is almost identical to the argument used above, is omitted. Let F ⊂ L ( µ ) be a compact, convex class of functions. Fix r > V to be a maximal r -separated subset of F .Note that for every v ∈ V , F v = F − v is star-shaped around 0, andstar( V − v ) ⊂ F − v . Using the notation of Lemma 2.2, let T = W = F v ,and for s > r > N ((star(V − v )) ∩ sD, rD ) ≤ log N ( F v ∩ sD, rD ) ≤ sup x ∈ F log N ( F v ∩ ( x − v + sD ) , rD ) ≤ c log( s/r ) sup x ∈ F log N ( F v ∩ ( x − v + 4 rD ) , rD )= c log( s/r ) sup x ∈ F log N ( F ∩ ( x + 4 rD ) , rD ) . F ∩ ( x + 4 rD ) ⊂ (( F − x ) ∩ rD ) + x , implying thatlog N ((star(V − v )) ∩ sD, rD ) ≤ c log( s/r ) · sup x ∈ F log N (( F − x ) ∩ rD, rD ) . (3.1)Moreover, the same estimate holds for ( V − v ) ∩ sD , and since V − v is r -separated,log | ( V − v ) ∩ sD | = log M (( V − v ) ∩ sD, rD ) ≤ log N (( V − v ) ∩ sD, ( r/ D ) ≤ log N ( F v ∩ sD, ( r/ D ) ≤ c log( s/r ) · sup x ∈ F log N (( F − x ) ∩ rD, ( r/ D ) ≤ c log( s/r ) · sup x ∈ F log M (( F − x ) ∩ rD, ( r/ D ) (3.2)With that in mind, fix constants η , η , κ and κ that will be specifiedlater, and for that choice of constants, let r > x ∈ F log M (( F − x ) ∩ rD, ( r/ D ) ≤ max (cid:8) η N r , η N (cid:9) , (3.3)and r ≥ r Q ( κ ) exp( − κ exp( N ));that is, r ≥ max { γ M ( η ) , γ Q ( η ) , r Q ( κ ) exp( − κ exp( N )) } . Let V be a maximal r -separated subset of F with respect to the L ( µ )norm. Following the path outlined earlier, the idea is to study ERM in V , given the data ( X i , Y i ) Ni =1 for Y = f ( X ) + W . To that end, one mustcontrol the multiplier and quadratic components in the decomposition of thesquared loss relative to V : if v = argmin v ∈ V k v ( X ) − Y k L , L Vv ( X, Y ) =( v ( X ) − Y ) − ( v ( X ) − Y ) =2( v ( X ) − Y )( v − v )( X ) + ( v − v ) ( X ) . Let us begin with the multiplier component:
Lemma 3.1
Fix < θ < , L > and q > . There exist constants c , c and c that depend only on L and q and for which the following holds.Let F be a convex, L -subgaussian class, set ξ ∈ L q for some q > and put η = c θ/ k ξ k L q . Then, for every v ∈ V , with probability at least − c log q NN (( q/ − − − c η r N ) , { v ∈ V : k v − v k L ≥ r } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ξ i v − v k v − v k L ( X i ) − E ξ v − v k v − v k L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ θ. The proof of Lemma 3.1 is based on the following fact from [12].
Theorem 3.2
For
L > and q > there exist constants c , c and c thatdepend only on L and q for which the following holds. Let ξ ∈ L q , set H tobe an L -subgaussian class and denote by d H = sup h ∈ H k h k L . For w, u ≥ ,with probability at least − c w − q N − (( q/ − log q N − − c u (cid:18) E k G k H Ld H (cid:19) ! , sup h ∈ H (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ξ i h ( X i ) − E ξh (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c Lwu k ξ k L q E k G k H √ N .
Proof of Lemma 3.1.
The proof consists of two parts: first, controllingthe process indexed by { f ∈ F : k f − v k L ≥ s } where s = (3 / r M ( η , v ),and then treating the process indexed by { v ∈ V : r ≤ k v − v k L ≤ s } .Clearly, without loss of generality one may assume that r ≤ r M ( η , v ).By the regularity of r M and since s > r M ( η , v ), E k G k ( F − v ) ∩ sD ≤ η √ N s . Moreover, ( F − v ) ∩ ( s/ D ⊂ ( F − v ) ∩ sD , and since s/ ≤ r M ( η , v ),the regularity of r M implies that E k G k ( F − v ) ∩ sD ≥ η √ N s / . Therefore, applying Theorem 3.2 to the set H = ( F − v ) ∩ sD , thereare constants c , c and c that depend only on q and L for which, withprobability at least1 − c N − (( q/ − log q N − (cid:0) − c η s N (cid:1) , if f ∈ F and k f − v k L ≤ s , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ξ i ( f − v )( X i ) − E ξ ( f − v ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c L k ξ k L q η s = ( ∗ ) . ∗ ) ≤ θs if η ≤ θ/c L k ξ k L q , and for such a choice, if k f − v k L = s then (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ξ i ( f − v )( X i ) − E ξ ( f − v ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ θ k f − v k L ; (3.4)since F − v is star-shaped around 0, (3.4) holds on the same event for every f ∈ F for which k f − v k ≥ s .Next, one has to control the process indexed by { v ∈ V : r ≤ k v − v k L < s } . Set j = ⌈ s/r ⌉ , fix s j = 2 j r for 0 ≤ j ≤ j and let V j =star(( V − v ) ∩ s j D ). By Theorem 3.2, on an event A j , for every h ∈ V j , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ξ i h ( X i ) − E ξh (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c ( L, q ) w j u j k ξ k L q E k G k V j √ N = ( ∗∗ ) j . The aim it to ensure that ( ∗∗ ) j ≤ θs j / A j is of high enoughprobability. Indeed, on A j , if v ∈ V and s j / ≤ k v − v k L ≤ s j , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ξ i h ( X i ) − E ξh (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ θ k v − v k L . To that end, let w j = √ j , recall that d V = sup v ∈ V k v k L and thus d V j = s j = r j . Put u j = max ( , √ N θ c k ξ k L q · j r √ j · d V j E k G k V j ) and consider two cases: first, if u j > ∗ ) ≤ θs j / P r ( A j ) ≥ − c log q Nj q/ N ( q/ − − − c ( q, L ) N j θ j k ξ k L q · r ! . Alternatively, if u j = 8, then u j (cid:18) E k G k V j d V j (cid:19) ≥ c ( q, L ) r N j θ j k ξ k L . Also, by (3.2), V j has at most | ( V − v ) ∩ s j D | extreme points. Sincelog | ( V − v ) ∩ s j D | ≤ c log( s j /r ) log M ( F v ∩ rD, ( r/ D ) ≤ c log( s j /r ) η √ N r,
15y standard properties of gaussian processes E k G k V j ≤ c d V j · log / | ( V − v ) ∩ s j D | ≤ c s j log / (cid:18) s j r (cid:19) η √ N r = c η √ N r j j s j . Hence, there are constants c and c that depend only on q and L forwhich sup h ∈ V j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ξ i h ( X i ) − E ξh (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c k ξ k L q √ j · η r j j s j ≤ θs j / η ≤ c θ/ k ξ k L q .Therefore, in both cases, there are constants c and c that dependonly on q and L , and with probability at least1 − c log q Nj q/ N ( q/ − − (cid:0) − c N r η j (cid:1) , sup h ∈ V j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ξ i h ( X i ) − E ξh (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ θs j / . The claim follows by applying the union bound to this estimate for 0 ≤ j ≤ j . Next, let us turn to the infimum of the quadratic processinf { v ∈ V : k v − v k L ≥ cr } N N X i =1 (cid:18) ( v − v ) k v − v k L (cid:19) ( X i ) (3.5)where r was selected in (3.3) for a well-chosen η and where c is a suitableconstant. Lemma 3.3
For every
L > there exist constants c , c and c that dependonly on L for which the following holds. For every v ∈ V , with probabilityat least − − c N ) , if v ∈ V and k v − v k L ≥ c r then N N X i =1 ( v − v ) ( X i ) ≥ c k v − v k L . F , i.e., when f ∈ F for which k f − v k L ≥ (3 / r Q ( η ) ≡ s ; and then ‘small distances’in V , that is, v ∈ V for which r ≤ k v − v k L ≤ s (again, one may assumethat r < r Q ( η )).For the constant η (yet to be specified), one has • E k G k ( F − v ) ∩ sD ≤ η N s , • for every 2 r < t < s ,log N ((star( V − v )) ∩ sD, tD ) ≤ c log(2 s/t ) · η N, and log | (star( V − v )) ∩ sD | ≤ c log(2 s/r ) · η N. The required lower bound on the infimum of the quadratic process (3.5)is based on estimates from [10] and [11], which will be formulated under thesubgaussian assumption, rather than using the original (and much weaker)small-ball condition.
Theorem 3.4
For every
L > there are constants κ , κ and κ thatdepend only on L for which the following holds. Let H be an L -subgaussianclass that is star-shaped around zero. Set H ρ = H ∩ ρD and fix ρ for which E k G k H ρ ≤ κ √ N ρ.
Then, with probability at least − − κ N ) , inf { h ∈ H : k h k L ≥ ρ } N N X i =1 (cid:18) h ( X i ) k h k L (cid:19) ≥ κ . We will apply Theorem 3.4 to the class H = ( F − v ) ∩ sD (large dis-tances) and then to V j = star (( V − v ) ∩ s j D ) for s j = 2 j r (small dis-tances). Lemma 3.5
There exist absolute constants c and c for which the followingholds. For every s > ρ ≥ c r , E k G k V j ∩ ρD ≤ c η √ N (cid:16) ρ log / (2 s j /ρ ) + r log / (2 s/r ) (cid:17) . In particular, setting ρ = s j / for η = c κ , one has E k G k V j ∩ ( s j / D ≤ κ √ N ( s j / . roof. Fix ρ < s j and note that by Dudley’s entropy integral bound (see,e.g., [8, 18]), E k G k V j ∩ ρD ≤ c Z ρ log / N ( V j ∩ ρD, tD ) dt ≤ c Z r log / N ( V j ∩ ρD, tD ) dt + c Z ρr log / N ( V j ∩ ρD, tD ) dt. Applying (3.1) and since V j = star (( V − v ) ∩ s j D ) ⊂ (star( V − v )) ∩ s j D, it follows that for r < t < ρ ,log N ( V j ∩ ρD, rD ) ≤ log N ((star( V − v )) ∩ ρD, rD ) ≤ c log(2 ρ/r ) · sup x ∈ F log N (( F − x ) ∩ rD, rD ) ≤ c log(2 ρ/r ) · η N. Moreover, by (3.2),log | ( V − v ) ∩ s j D | ≤ c log(2 s j /r ) · η N = ( ∗ ) . Hence, V j is the union of at most exp( ∗ ) ‘intervals’ of the from [0 , v − v ],and for t ≤ r ,log N ( V j ∩ ρD, tD ) ≤ c (cid:0) η N log(2 s j /r ) + log(2 ρ/t ) (cid:1) . Now the first part of the claim follows from integration, and the second partis an immediate outcome of the first.
Proof of Lemma 3.3.
Combining Theorem 3.4 and Lemma 3.5 for η = c κ , it follows that with probability at least 1 − − κ N ), if v ∈ V and s j / ≤ k v − v k L ≤ s j ,1 N N X i =1 ( v − v ) ( X i ) ≥ κ k v − v k L . (3.6)Repeating this argument for s j = 2 j r and then applying it to the set F v ∩ sD for s = (3 / r Q ( η ), it follows that if log ( s/r ) ≤ exp( κ N/
2) then withprobability at least 1 − − κ N/ v ∈ V thatsatisfies k v − v k L ≥ c r . 18ith all the ingredients in place, we may now conclude the proof of theupper estimate.Fix f ∈ F and set Y = f ( X ) + W for W ∈ L q that is orthogonal tospan( F ). Let r , V and v as above. Clearly, for every v ∈ V , k v − Y k L = k W k L + k v − f k L , (3.7)and thus k v − f k L ≤ r . Moreover, for every v ∈ V , E W · ( v − v )( X ) = 0and | E ( v ( X ) − Y )( v − v )( X ) | = | E ( v − f )( X ) · ( v − v )( X ) |≤k v − f k L · k v − v k L ≤ r k v − v k L . By Lemma 3.3, with probability at least 1 − − κ N/ v ∈ V and k v − v k L ≥ c ( L ) r , then1 N N X i =1 ( v − v ) ( X i ) ≥ κ k v − v k L . Using the notation of Lemma 3.1, set θ = κ / η = c ( q, L ) θ/ k W k L q .Hence, there are constants c and c that depend only on q and L , for which,with probability at least1 − c log q NN (( q/ − − − c η r N ) , for every v ∈ V , k v − v k L ≥ r , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ( v ( X i ) − Y i )( v − v )( X i ) − E ( v ( X ) − Y )( v − v )( X ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ κ k v − v k L . On the intersection of the two events and for a constant c = c ( q, L ), if k v − v k L ≥ c r then P N L Vv = 1 N N X i =1 ( v − v ) ( X i ) + 2 N N X i =1 ( v ( X i ) − Y i )( v − v )( X i ) ≥ N N X i =1 ( v − v ) ( X i ) − | E ( v ( X ) − Y )( v − v )( X ) |− (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ( v ( X i ) − Y i )( v − v )( X i ) − E ( v ( X ) − Y )( v − v )( X ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ κ k v − v k L − r k v − v k L − ( κ / k v − v k L ≥ ( κ / k v − v k L . v ∈ V satisfies that k ˆ v − v k L ≤ c r. And, since W is orthogonal to span( F ), E (cid:0) L F ˆ v | ( X i , Y i ) Ni =1 (cid:1) = k ˆ v − Y k L − k f − Y k L = k ˆ v − f − W k L − k W k L = k ˆ v − f k L − E W · (ˆ v − f )( X ) ≤ ( k ˆ v − v k L + k v − f k L ) ≤ (1 + c ) r . The lower estimates presented below are based on a volumetric argument.The idea is that if a learning procedure is ‘too successful’, a well-separatedsubset of F endows a well-separated subset in R N (a set that depends on X , ..., X N ). However, because of some volumetric constraint, there is not‘enough room’ for such a separated set to exist, leading to a contradiction.The notions of volume are different in the two estimates: one is basedon the Lebesgue measure while the other is determined by the choice of the‘noise’ W , which is, in our case, gaussian. Definition 4.1
Let F be a class of functions and assume that X = ( x , ..., x N ) ∈ Ω N . For every f ∈ F , set K ( f, X ) = { h ∈ F : h ( x i ) = f ( x i ) for every 1 ≤ i ≤ N } . The set K ( f, X ) is called the version space of F associated with f and X . In other words, K ( f, X ) consists of all the functions in F that agree with f on X . Naturally, in the context of learning, X is a random sample ( X i ) Ni =1 ,selected according to the underlying measure µ .The diameter of the version space is a reasonable choice for a lowerbound on the performance of any learning procedure: if Y i = f ( X i ) + W i , alearning procedure cannot distinguish between f and any other function inthe version space associated with f and ( X i ) Ni =1 . Hence, the largest typicaldiameter of a version space should be a lower estimate on the performanceof any learning procedure, as the following well-known fact shows (see, e.g.,[7]). 20 heorem 4.2 Given a random variable W , for every f ∈ F set Y f = f ( X ) + W . If Ψ is a learning procedure, then sup f ∈ F P r (cid:18) k Ψ(( Y fi , X i ) Ni =1 ) − f k L ( µ ) ≥ K ( f, X ) (cid:19) ≥ / , where the probability is relative to the product measure endowed on (Ω × R ) N by the N -product of the joint distribution of X and W . Clearly, if W is orthogonal to span( F ), then for every h ∈ F and everytarget Y f , E L fh = k h − f k L . Thus, the largest typical diameter of a versionspace K ( f, X ) is a lower bound on E p for the set of admissible targets Y = { f ( X ) + W : f ∈ F } .This leads to the following question: Question 4.3
Given a class F defined on a probability space (Ω , µ ) , f ∈ F and X = ( x , ..., x N ) ⊂ Ω N , find a lower estimate on diam ( K ( f, X ) , L ( µ )) . One situation in which Question 4.3 is of independent interest is when T ⊂ R n is a convex body (i.e., a convex, centrally-symmetric set with anonempty interior) and F = (cid:8)(cid:10) t, · (cid:11) : t ∈ T (cid:9) is the class of linear functionalsassociated with T . For every x , ..., x N ∈ R n set X = ( x , ...., x N ), and letΓ X = P Ni =1 (cid:10) x i , · (cid:11) e i be the matrix whose rows are x , ..., x N . Thus, K (0 , X ) = ker(Γ X ) ∩ T. If µ is an isotropic, L -subgaussian measure on R n , one may show that withprobability at least 1 − − c N ),diam( K (0 , X ) , L ( µ )) ≤ r Q ( c ( L )) (4.1)(see [14]). This extends the celebrated result of Pajor and Tomczak-Jaegermann[15, 16], that (4.1) holds for the Haar measure on S n − (and thus, also forthe gaussian measure on R n ).It turns out that (4.1) is not far from optimal: Theorem 4.4
There exists an absolute constant c for which the followingholds. Let F ⊂ L ( µ ) be a convex and centrally-symmetric set. If log M ( F ∩ rD, ( r/ D ) ≥ cN, then for every X = ( x , ..., x N ) , diam ( K (0 , X ) , L ( µ )) ≥ r/ . F is convex and centrally-symmetric, F − F = 2 F and 0 ∈ F . There-fore, M ( F ∩ rD, ( r/ D ) ≤ sup x ∈ F M (( F − x ) ∩ rD, ( r/ D ) ≤M (( F − F ) ∩ rD, ( r/ D ) = M ( F ∩ rD, ( r/ D ) . Hence, Theorem 4.4 shows that if γ Q ( c, > r then for every X = ( x , ..., x N ),diam( K (0 , X ) , L ( µ )) ≥ r/
8. In particular, for every W ∈ L that is orthog-onal to span( F ), the best possible error rate in F that holds for every target Y f = f ( X ) + W , is at least γ Q ( c, ≥ c γ Q ( c ). Proof.
Let f , ..., f m be r/ F ∩ rD . Set A i = f i F ∩ rD ) , and observe that A i ⊂ F ∩ rD . Also, for every h ∈ A i , k ( f i / − h k L ≤ r/
16; therefore, if h i ∈ A i and h ℓ ∈ A ℓ , then k h i − h ℓ k L ≥ r/ X = ( x , ..., x N ) and for A ⊂ F set P X ( A ) = (cid:8) ( h ( X i )) Ni =1 : h ∈ A (cid:9) ⊂ R N , the coordinate projection of A associated with X . Clearly, for every 1 ≤ i ≤ m , P X ( A i ) = 12 ( f i ( x j )) Nj =1 + 132 P X ( F ∩ rD ) . (4.2)Consider two possibilities. First, if there are i = ℓ for which P X ( A i ) ∩ P X ( A ℓ ) = ∅ , there are h i ∈ A i and h ℓ ∈ A ℓ that satisfy h i − h ℓ ∈ K (0 , X ),thus showing that diam( K (0 , X ) , L ( µ )) ≥ r/ P X ( A i ) are disjoint subsets of P X ( F ∩ rD ). And,setting T = P X ( F ∩ rD ), (4.2) implies that M ( T, T / ≥ m . Since T is aconvex, centrally symmetric subset of R N , a standard volumetric argumentshows that M ( T, T / ≤ exp( cN ) for a suitable absolute constant c . Thus,if m > exp( cN ), diam ( K (0 , X ) , L ) ≥ r/
8, as claimed.The final result of this section is the ‘noise-dependent’ lower bound.
Theorem 4.5
There exist absolute constants c and c for which the follow-ing holds. Let F ⊂ L ( µ ) be a convex, centrally-symmetric class of functions,set W to be a centred normal random variable that is independent of X , andfor every f ∈ F , put Y f = f ( X ) + W . If Ψ is a learning procedure that erforms with confidence of at least / for every Y f , there is some Y f forwhich E p ≥ c γ M (cid:18) c k W k L (cid:19) . Stronger versions of Theorem 4.5 (without the assumption that F is con-vex and centrally-symmetric) may be proved in several different ways: usinginformation theoretic tools (see, Theorem 2.5 in [17]), or, alternatively, byapplying the gaussian isoperimetric inequality as in [7]. Both these argu-ments are rather restrictive, because they relay on rather special propertiesof the noise.Although the proof we present below is also for a gaussian noise, theargument is less restrictive and may be extended to other choices of noise(e.g. when W is log-concave rather than gaussian). The argument is essen-tially the same as Talagrand’s proof of the dual-Sudakov inequality [8], andas such is volumetric in nature: obtaining a lower bound on the measure ofa shift of a centrally-symmetric set in terms of the Euclidean norm of theshift. Lemma 4.6
Let A ⊂ R N be centrally symmetric and set z ∈ R N . If ν isthe centred gaussian measure on R N with covariance σ I N and | | denotesthe Euclidean norm on R n , then ν ( z + A ) ≥ exp (cid:18) − | z | σ (cid:19) ν ( A ) . Proof.
A change of variables shows that ν ( z + A ) = 1(2 πσ ) N/ Z z + A exp (cid:18) − | x | σ (cid:19) dx = 1(2 πσ ) N/ Z A exp (cid:18) − | t + z | σ (cid:19) dt = exp (cid:18) − | z | σ (cid:19) · πσ ) N/ Z A exp (cid:10) z, t (cid:11) σ ! · exp (cid:18) − | t | σ (cid:19) dt = ( ∗ ) . Let E ν | A be the expectation with respect to the gaussian measure ν , condi-tioned on A . Thus,( ∗ ) = exp (cid:18) − | z | σ (cid:19) ν ( A ) · E ν | A exp − (cid:10) z, t (cid:11) σ ! . Since A is symmetric, E ν | A (cid:10) z, t (cid:11) = 0, and by Jensen’s inequality( ∗ ) ≥ exp (cid:18) − | z | σ (cid:19) ν ( A ) . roof of Theorem 4.5. Let Ψ be a learning procedure that performs withaccuracy E p for every target Y f = f ( X ) + W for f ∈ F and W ∼ N (0 , σ )that is independent of X . Note that for the target Y f , the true minimizerin F is f ∗ = f and for every h ∈ F , E L h = E ( h ( X ) − Y f ) − E ( f ∗ ( X ) − Y f ) = k h − f k L . Thus, if τ = ( x i , y i ) Ni =1 ∈ (Ω × R ) N is a sample on which Ψ performs withaccuracy E p relative to the target Y f , then k Ψ( τ ) − f ∗ k L ≤ E p .Let ( f j ) mj =1 be a subset of F ∩ rD that is r/ L ( µ ) for( r/ = 9 E p and fix X = ( x , ..., x N ) ∈ Ω N .For every 1 ≤ j ≤ m , put A j ( X ) = n ( w i ) Ni =1 : Ψ (cid:0) ( x i , f j ( x i ) + w i ) Ni =1 (cid:1) ∈ f j + p E p D o ⊂ R N , i.e., A j ( X ) consists of all the vectors ( w i ) Ni =1 ∈ R N , for which, upon receivingthe data ( x i , f j ( x i ) + w i ) Ni =1 , Ψ selects a point whose L distance to f j is atmost r/ p E p .Let ν be the centred gaussian measure on R N with covariance σ I N .Since W is a centred gaussian random variable with variance σ , ( w i ) Ni =1 isdistributed according to ν , and since it is independent of X , if Ψ performswith accuracy E p and with probability at least 7 /
8, it is evident that µ N ⊗ ν (cid:16)n ( x i , w i ) Ni =1 : Ψ(( x i , f j ( x i ) + w i ) Ni =1 ) ∈ f j + p E p D o(cid:17) = µ N ⊗ ν (cid:0)(cid:8) ( x i , w i ) Ni =1 : ( w i ) Ni =1 ∈ A j ( X ) (cid:9)(cid:1) ≥ / . A standard Fubini argument shows that there is an event C j ⊂ Ω N of µ N probability at least 1 /
2, and for every X = ( x i ) Ni =1 ∈ C j , ν ( A j ( X )) ≥ / X ∈ C j then by the symmetry of ν , ν ( − A j ( X )) ≥ /
4, andthe centrally-symmetric set A j ( X ) ∩ − A j ( X ) ⊂ A j ( X ) satisfies that ν ( A j ( X ) ∩ − A j ( X )) ≥ / . Let z j = ( f j ( x i )) Ni =1 . If X ∈ C j ∩ C ℓ , the sets z j + A j ( X ) and z ℓ + A ℓ ( X )are disjoint, because Ψ maps z j + A j ( X ) to an r/ f j and z ℓ + A ℓ ( X ) to an r/ f ℓ – but k f j − f ℓ k L ≥ r/
2. Therefore m X j =1 C j ( X ) ν ( z j + ( A j ( X ) ∩ − A j ( X ))) ≤ µ N , m X i =1 E X C j ( X ) ν ( z j + ( A j ( X ) ∩ − A j ( X ))) ≤ , and all that remains is to control E X C j ( X ) ν ( z j + ( A j ( X ) ∩ − A j ( X ))) frombelow.Applying Lemma 4.6, ν ( z j + ( A j ( X ) ∩ − A j ( X ))) ≥ exp (cid:18) − | z j | σ (cid:19) · ν ( A j ( X ) ∩ − A j ( X ))= exp − σ N X i =1 f j ( x i ) ! · ν ( A j ( X ) ∩ − A j ( X )) . By Chebychev’s inequality and recalling that k f j k L ≤ r , µ N ( N X i =1 f j ( X i ) ≤ c N r ) ≥ / c and for every 1 ≤ j ≤ m .Thus, on an event of µ N measure at least 1 / X ∈ C j , ν ( A j ( X ) ∩ − A j ( X )) ≥ / P Ni =1 f j ( X i ) ≤ c N r ; therefore, E X C j ( X ) ν ( z j + ( A j ( X ) ∩ − A j ( X ))) ≥ c exp (cid:18) − c N r σ (cid:19) . Hence, log m ≤ c N r /σ , i.e., log M ( F ∩ rD, ( r/ D ) ≤ ( c /σ ) N r ,implying that E p ≥ c γ M ( c /σ ). We begin this section with an example of ‘natural’ sets, for which there is atrue gap between the two sets of parameters: r Q /r M and γ Q /γ M .Let T ⊂ R n be a convex body in R n (i.e., a convex, centrally-symmetricset with a nonempty interior), put F = { (cid:10) t, · (cid:11) : t ∈ T } , the class of linearfunctionals associated with T and set µ to be the gaussian measure on R n .It is straightforward to verify that for every r >
0, ( F ∩ rD, L ( µ )) isisometric to ( T ∩ rB n , ℓ n ), where B n is the Euclidean unit ball in R n . Let1 ≤ p <
2, and set T = B np , the unit ball in ℓ np = ( R n , k k ℓ p ). One may25how (see [7]) that when p = 1, r M and γ M are equivalent, as are r Q and γ Q . However, such an equivalence is no longer true for 1 < p < p > / log n – otherwise, ℓ np is equivalent to ℓ n ).To see how that gap between the ‘global’ and ‘local’ parameters is ex-hibited in B np for 1 < p <
2, let x = ( x i ) ni =1 ∈ B np and set ( x ∗ i ) ni =1 to be thenon-increasing rearrangement of ( | x i | ) ni =1 ; thus, x ∗ i ≤ i − /p . Recall the wellknown fact (see, e.g., [5]), that E k G k B np ∩ rB n is equivalent to c ( p ) ( n − /p if r ≥ c ( p ) n − (1 /p − / ,rn / if r ≤ c ( p ) n − (1 /p − / . Thus, if N ≤ n /p , r M ∼ n / − / p /N / ≥ c ( p ) n − (1 /p − / . Let us consider the case in which 1 > r ≫ c ( p ) n − (1 /p − / . Set ℓ =(1 /r ) p/ (2 − p ) and observe that B np ∩ rB n ⊂ n x ∈ R n : x ∗ i ≤ r/i / if i ≤ ℓ, and x ∗ i ≤ /i /p if i > ℓ o . Clearly, for a well-chosen constant c one has P i ≥ c ℓ i − /p ≤ r / B np ∩ rB n ⊂ [ | I | = c ℓ (cid:0) c rB I , ∞ + B I c p, ∞ (cid:1) , where B Iq, ∞ is the unit ball in R I endowed with the weak ℓ q, ∞ norm . Inparticular, if | I | = c ℓ , B I c p, ∞ ⊂ ( r/ B I c and the impact of those ‘small’coordinates on Euclidean distances is negligible: M (cid:0) B np ∩ rB n , rB n (cid:1) ≤ M [ | I | = c ℓ c rB I , ( r/ B n ≤ (cid:18) nc ℓ (cid:19) c c ℓ . Hence, separation at scale r occurs only because of the largest ∼ ℓ coor-dinates of the vectors involved.On the other hand, the contribution of those ‘large’ coordinates to E k G k B np ∩ rB n is equally negligible. Indeed, if T = S | I | = m αrB I for some m ≤ n/ α ≥
1, it is standard to verify that E sup t ∈ T n X i =1 g i t i ≤ c αrm / · log / (cid:16) enm (cid:17) = ( ∗ ) . Recall that for x ∈ R n , k x k q, ∞ ≤ A if and only if, sup i ≥ i /q x ∗ i ≤ A . m = c ℓ = c (1 /r ) p/ (2 − p ) and since r ≫ c ( p ) n − (1 /p − / , it follows that( ∗ ) ≪ n − /p ∼ E k G k B np ∩ rB n . Thus, as long as r is significantly larger than c ( p ) n − (1 /p − / , the gaus-sian average of the intersection body B np ∩ rB n originates from the ‘smallcoordinates’ in the monotone rearrangement, and in particular, from vec-tors whose Euclidean norm is significantly smaller than r . Such vectors are‘invisible’ to γ M , which is why γ M is much smaller than r M . Fixed points are encountered frequently in Empirical Processes and Statis-tics literature, and almost always with the same goal: obtaining ‘relative’upper bounds on various empirical processes. To obtain such bounds, onehas to compare the oscillation (i.e., the behaviour of the process indexed by( F − F ) ∩ rD ) with some function of r .One usually obtains upper bounds on the oscillation via a symmetriza-tion argument, leading to a sample-dependent Bernoulli process. Thus, thestandard outcome is a fixed point equation, linking an entropy integral rela-tive to the random L metric and generated by the sample X , ..., X N , withthe desired function of r (see [18] for numerous examples).Still within the realm of entropy integrals, it is possible to impose addi-tional structure on the problem, which allows one to replace the empirical L (random) metrics with the global L ( µ ) metric. For example, a fixed pointequation with the same normalization as r M may be found in [2], where thesetup allows the transition between the random metric and the deterministicone – but the ‘philosophy’ of the proof is the same: it is based on an entropyintegral.Since the entropy integral is only upper estimate on the supremum of theempirical process in question – regardless of the underlying assumptions, it isoften loose. Therefore, one would like to find a general argument bypassingthe whole mechanism of entropy integrals.As a first step, and because it is natural to expect that the empiricalprocesses in question converges to a gaussian limit, one may try a ‘gaussian’-based fixed point, which relies on E k G k ( F − F ) ∩ rD , rather than on an entropyintegral bound. And, indeed, under a subgaussian assumption, the resultsof [7] lead to the gaussian-based r M and r Q .Our results show that r M and r Q are not the end of the story and canbe improved – at least for the special learning problems we consider. The27right’ fixed points should involve the smaller local entropy estimates ratherthan the oscillation of the gaussian process.One fixed point that seems closer in nature to γ M than to r M maybe found in the celebrated work of Yang and Barron [19], though a closerinspection shows that this impression is inaccurate.Comparing [19] to our results is somewhat unnatural because the setup in[19] is completely different: a function class consisting of uniformly boundedfunctions and an independent gaussian noise, both of which are crucial tothe proof (see Section 3.2 in [19]). Also, the upper estimate is an existenceresult of a ‘good’ procedure – rather than a specific choice of a procedure;the estimate holds in expectation and not with high probability; and it doesnot tend to zero with the ‘noise level’ of the problem.All these differences are significant, but are still not a conclusive indi-cation that the nature of the complexity parameter in [19] is different fromours. That indication is the key to the results in [19]: the assumption thatthe underlying class ‘large’ – in the sense thatlim inf ε → log M ( F, ( ε/ D )log M ( F, εD ) > . (5.1)One should note that this assumption immediately excludes all the modernhigh-dimensional problems, involving classes indexed by subsets of R n . In-deed, for any convex subset of R n , the liminf above is 1 rather than strictlygreater than 1.Equation (5.1) has two significant implications: • The r/ F and of F ∩ rD are equivalent, whichmeans that one may replace the local sets F ∩ rD with F in thedefinition of the fixed points. This makes the proof of the upper boundsimpler. • It essentially restricts the setup to classes that have polynomial entropy,which is a considerably narrower scenario. Indeed, for the sake ofbrevity let us ignore cases in whichlim sup ε → log M ( F, ( ε/ D )log M ( F, εD ) = L ≥ L > { G f : f ∈ F } is not bounded andthe class F is not subgaussian, while if L = 4 an entropy estimate isnot enough to determine whether the gaussian process is bounded and28hus requires a more subtle analysis). When L < F ∩ rD is equivalent to the entropy of F , itfollows that there are 0 < q ≤ q < ε ≤ R smallenough, (cid:18) Rε (cid:19) q . log M ( F ∩ RD, ( ε/ D )log M ( F ∩ RD, εD ) . (cid:18) Rε (cid:19) q . Using Dudley’s entropy integral for the upper bound and Sudakov’sminoration for the lower one, it is straightforward to verify that E k G k F ∩ RD ∼ q ,q R log / M ( F ∩ RD, ( R/ D ) . Therefore, the ‘global’ parameters r M and r Q are equivalent to thelocal ones γ M and γ Q ; in fact, the ‘local’ and ‘global’ parameters areeven equivalent to the ones defined via the entropy integral. Thus,the typical situation in [19] is very different from the problems studiedhere – mainly because of (5.1). References [1] Martin Anthony and Peter L. Bartlett.
Neural network learning: theoretical founda-tions . Cambridge University Press, Cambridge, 1999.[2] Lucien Birg´e and Pascal Massart. Rates of convergence for minimum contrast esti-mators.
Probab. Theory Related Fields , 97(1-2):113–150, 1993.[3] Peter B¨uhlmann and Sara A. van de Geer.
Statistics for high-dimensional data .Springer Series in Statistics. Springer, Heidelberg, 2011. Methods, theory and appli-cations.[4] R. M. Dudley.
Uniform central limit theorems , volume 63 of
Cambridge Studies inAdvanced Mathematics . Cambridge University Press, Cambridge, 1999.[5] Yehoram Gordon, Alexandre E. Litvak, Shahar Mendelson, and Alain Pajor. Gaus-sian averages of interpolated bodies and applications to approximate reconstruction.
J. Approx. Theory , 149(1):59–73, 2007.[6] Vladimir Koltchinskii.
Oracle inequalities in empirical risk minimization and sparserecovery problems , volume 2033 of
Lecture Notes in Mathematics . Springer, Heidel-berg, 2011. Lectures from the 38th Probability Summer School held in Saint-Flour,2008, ´Ecole d’´Et´e de Probabilit´es de Saint-Flour. [Saint-Flour Probability SummerSchool].[7] Guillaume Lecu´e and Shahar Mendelson. Learning subgaussian classes: Upper andminimax bounds. Technical report, CNRS, Ecole polytechnique and Technion, 2013.[8] Michel Ledoux and Michel Talagrand.
Probability in Banach spaces , volume 23 of
Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics andRelated Areas (3)] . Springer-Verlag, Berlin, 1991. Isoperimetry and processes.
9] Pascal Massart.
Concentration inequalities and model selection , volume 1896 of
Lec-ture Notes in Mathematics . Springer, Berlin, 2007. Lectures from the 33rd SummerSchool on Probability Theory held in Saint-Flour, July 6–23, 2003, With a forewordby Jean Picard.[10] Shahar Mendelson. Learning without concentration. Journal of the ACM. To appear.[11] Shahar Mendelson. Learning without concentration for general loss functions. Arxiv:http://arxiv.org/abs/1410.3192.[12] Shahar Mendelson. Upper bounds on product and multiplier empirical processes.Arxiv: http://arxiv.org/abs/1410.8003.[13] Shahar Mendelson. Obtaining fast error rates in nonconvex situations.
J. Complexity ,24(3):380–397, 2008.[14] Shahar Mendelson, Alain Pajor, and Nicole Tomczak-Jaegermann. Reconstructionand subgaussian operators in asymptotic geometric analysis.
Geom. Funct. Anal. ,17(4):1248–1282, 2007.[15] A. Pajor and N. Tomczak-Jaegermann. Nombres de Gel ′ fand et sections euclidiennesde grande dimension. In S´eminaire d’Analyse Fonctionelle 1984/1985 , volume 26 of
Publ. Math. Univ. Paris VII , pages 37–47. Univ. Paris VII, Paris, 1986.[16] Alain Pajor and Nicole Tomczak-Jaegermann. Subspaces of small codimension offinite-dimensional Banach spaces.
Proc. Amer. Math. Soc. , 97(4):637–642, 1986.[17] Alexandre B. Tsybakov.
Introduction to nonparametric estimation . Springer Seriesin Statistics. Springer, New York, 2009. Revised and extended from the 2004 Frenchoriginal, Translated by Vladimir Zaiats.[18] Aad W. van der Vaart and Jon A. Wellner.
Weak convergence and empirical processes .Springer Series in Statistics. Springer-Verlag, New York, 1996. With applications tostatistics.[19] Yuhong Yang and Andrew Barron. Information-theoretic determination of minimaxrates of convergence.
Ann. Statist. , 27(5):1564–1599, 1999., 27(5):1564–1599, 1999.