[PDF] On the Minimal Error of Empirical Risk Minimization

Abstract

We study the minimal error of the Empirical Risk Minimization (ERM) procedure in the task of regression, both in the random and the fixed design settings. Our sharp lower bounds shed light on the possibility (or impossibility) of adapting to simplicity of the model generating the data. In the fixed design setting, we show that the error is governed by the global complexity of the entire class. In contrast, in random design, ERM may only adapt to simpler models if the local neighborhoods around the regression function are nearly as complex as the class itself, a somewhat counter-intuitive conclusion. We provide sharp lower bounds for performance of ERM for both Donsker and non-Donsker classes. We also discuss our results through the lens of recent studies on interpolation in overparameterized models.

Full PDF

aa r X i v : . [ m a t h . S T ] F e b On the Minimal Error of Empirical Risk Minimization

Gil KurMIT Alexander RakhlinMIT

Abstract

We study the minimal error of the Empirical Risk Minimization (ERM) procedure in the task ofregression, both in the random and the ﬁxed design settings. Our sharp lower bounds shed light onthe possibility (or impossibility) of adapting to simplicity of the model generating the data. In theﬁxed design setting, we show that the error is governed by the global complexity of the entire class. Incontrast, in random design, ERM may only adapt to simpler models if the local neighborhoods aroundthe regression function are nearly as complex as the class itself, a somewhat counter-intuitive conclusion.We provide sharp lower bounds for performance of ERM for both Donsker and non-Donsker classes. Wealso discuss our results through the lens of recent studies on interpolation in overparameterized models.

An increasing number of machine learning applications employ ﬂexible overparameterized models to ﬁt thetraining data. Theoretical analysis of such ‘overﬁtted’ solutions has been a recent focus of the learning com-munity. It is conjectured that the use of large overparameterized neural networks makes the loss landscapeamenable to optimization through local search methods, such as stochastic gradient descent. It is also hy-pothesized that implicit regularization, arising from the choice of the optimization algorithm and the neuralnetwork architecture, mitigates the large complexity and ensures that the ‘overﬁtted’ solutions generalize.Suppose a ‘simple’ class H of models captures the relationship between the covariates X and the responsevariable Y . Inspired by the use of overparameterized models, we may take a much larger class F ⊃ H forcomputational or other purposes (such as lack of explicit description of H ) and minimize training loss overthis larger class. It is natural to ask whether the learning procedure can adapt to the fact that data comesfrom a simple model f ∈ H , in the sense that the prediction error depends on the statistical complexity of H rather than F . We do have positive examples of this type: the least squares solution (that is, empiricalrisk minimization with respect to square loss) over the class of all convex functions F on a convex compactsubset of R d (with d ≤ ) automatically enjoys the faster “parametric” rate ˜ O ( k/n ) of convergence to thetrue regression function f ∈ H if f is a piece-wise linear convex function with k pieces. This rate shouldbe contrasted with the slow non-parametric rate Θ( n − / ( d +4) ) when the true regression function is ‘complex’and cannot be approximated well by a piece-wise linear convex function.How generic is this phenomenon of automatic adaptivity of empirical minimizers to simplicity of the truemodel? An aﬃrmative answer would lend credibility to the practice of taking large models, whereas anegative answer would necessitate the study of conditions that can make such adaptivity possible.This paper studies the fundamental limits of adaptivitiy of empirical risk minimization (ERM) in the settingof nonparametric regression (or, prediction with square loss and a well-speciﬁed model), in both randomand ﬁxed design. In contrast with the standard minimax approach to lower bounds, which may hide thetrue performance of ERM on simple models, we focus on lower bounds that hold for any (rather thanthe worst-case) regression function in a given class. In the ﬁxed design setting, we show that—informallyspeaking—for rich classes F , dependence on the global statistical complexity of the class is unavoidable, asit controls the error of ERM for any true regression function f , no matter how ‘simple’ it is. In contrast, inthe random design case, the situation is more subtle. Somewhat counter-intuitively, we show that for richclasses F , adaptation to the simplicity of f may only be possible if the local neighborhood of f in F is1early as rich as the class F . This ﬁnding can be viewed through the lens of recent results on interpolation(Belkin et al., 2019, 2018; Bartlett et al., 2020; Liang et al., 2020b). In these papers, the solutions can beseen as ‘simple-plus-spiky’ (Wyner et al., 2017) with spikes responsible for ﬁtting the training data withoutaﬀecting the error with respect to the population. Since in these models there are enough degrees of freedomto ﬁt any noisy data, the eﬀective function classes have rich local neighborhoods. In such cases, it is stillpossible that ‘overﬁtting’ to the training data does not result it large out-of-sample error. Conversely, weshow that—again, informally speaking—if f is embedded in a local neighborhood in F with low complexity,the empirical minimizer will necessarily be attracted to a solution far away from f with respect to theout-of-sample loss. This ﬁnding initially appeared counter-intuitive to the authors. We now present the formal model. Let F be a convex class of real-valued functions on some domain X . Weaim to recover f ∈ F based on n samples Y i = f ( X i ) + ξ i , i = 1 , . . . , n , under the assumption f ∈ F and ξ , . . . , ξ n i.i.d. ∼ N (0 , . In the random design setting, X , . . . , X n i.i.d. ∼ P , where P is some unknowndistribution on X , while in the ﬁxed design setting X , . . . , X n are some ﬁxed points in X .The Least Squares Estimator, or ERM with respect to square loss, is deﬁned as ˆ f n = Ψ argmin f ∈F n X i =1 ( Y i − f ( X i )) ! , (1)where Ψ is a function that selects a particular solution in the set of possible minimizers (for example, aminimal norm solution).One of the most important questions regarding ERM is its statistical performance as compared to otherestimators, deﬁned as maps from { ( X i , Y i ) } ni =1 to F (or to R X for improper methods). While there aremultiple ways of measuring the statistical performance, perhaps the most popular is the minimax risk (Tsybakov, 2003), deﬁned in the random design case for any estimator ¯ f n as R ( ¯ f n , F , P ) := sup f ∈F E x,ξ Z ( ¯ f n (( X , Y ) , . . . , ( X n , Y n )) − f ) d P , where E x,ξ denotes expectation over the training data and the integral represents the expected out-of-sampleperformance with respect to P . One can also write this measure of performance as excess square loss sup f ∈F E x,ξ E ( X,Y ) ( ¯ f n ( X ) − Y ) − E ( X,Y ) ( f ( X ) − Y ) . We say that the ERM ˆ f n is minimax optimal , if for for all n ≥ , R ( ˆ f n , F , P ) . inf ¯ f n R ( ¯ f n , F , P ) , where . denotes less or equal up to a constant that only depends on P , F . The quantity inf ¯ f n R ( ¯ f n , F , P ) is known as the minimax rate for ( F , P ) . In the ﬁxed design setting, the risk measure is deﬁned in ananalogous way, except that instead of drawing n i.i.d. points from P , we consider a sequence of measuresthat are supported uniformly on n points.Clearly, the deﬁnitions of the risk and the minimax optimality measure “the worst case scenario" of a givenestimator, and may hide the true statistical performance of the ERM in real-life applications (cf. (Bellec,2017)). For example, as mentioned in the introduction, if f is known to belong to a smaller class H , therelevant quantity is R H ( ˆ f n , F , P ) := sup f ∈H E Z ( ˆ f n − f ) d P , ˆ f n is still deﬁned over F , due to computational or other considerations. As an example, consider linear regression in R d whenthe true coeﬃcient vector is sparse, i.e. supported on k ≪ d coordinates. Then, due to computationalconsiderations, it is standard to replace the original problem of minimizing square loss over sparse vectorsin R d by minimization over a larger ℓ ball in R d (the Lasso procedure).The second example was already brieﬂy mentioned in the introduction, and we expand on it here. Let F d bethe family of convex -Lipschitz functions on X = [0 , d , and let P = Unif( X ) . The subset H d (of ‘simplefunctions’) is the set of -Lipschitz k -aﬃne piece-wise linear functions with k = Θ(1) . It is well known thatERM over H d is NP-hard since the problem is highly non-convex; moreover, even estimating the numberof pieces is computationally hard (cf. the recent paper Ghosh et al. (2019) for more details). In contrast,ERM over F d can be eﬃciently computed (Ghosh et al., 2019). While the minimax rate for ( F d , P d ) is Θ( n − / ( d +4) ) (Dudley, 1999; Bronshtein, 1976), it was proved recently in (Kur et al., 2020b) that the risk ofERM is ˜Θ d (max { n − /d , n − / ( d +4) } ) , which is minimax-suboptimal when d ≥ . Furthermore, it was shownin (Han and Wellner, 2016; Feng et al., 2018) that R H d ( ˆ f n , F d , P d ) . ˜ O (max { n − /d , n − } ) , (2)which is signiﬁcantly smaller than both the risk of ERM and the minimax rate. When the ERM (or MLE)satisﬁes such improved bounds, we say that it exhibits adaptation (cf. (Feng et al., 2018; Kim et al., 2018;Samworth, 2018; Han et al., 2019; Kur et al., 2020b)).In this paper we answer the two following questions: Does there exist a uniform lower bound on the minimalerror inf f ∈F E x,ξ Z ( ˆ f n − f ) d Q of ERM ˆ f n , where Q is either ﬁxed or random design measure? Does the richness of the entire class F aﬀectthe minimal error, or is there a more reﬁned notion of complexity that governs its behavior? We start with deﬁnitions. For n points x n := { x , . . . , x n } in X and G ⊆ F , we deﬁne the Gaussian averagesof G as c W ( G ) := E ξ sup f ∈G n n X i =1 ξ i f ( x i ) , W ( G ) := E c W ( G ) . For a measure Q on X and f : X → R we denote by k f k Q the L ( Q ) norm of f . Finally, for any Q , f ∈ F ,and r ≥ we denote by B Q ( f, r ) := { g ∈ F : k g − f k Q ≤ r } , the intersection of the L ( Q ) ball around f andthe class F . We now state our sharp lower bound for the ﬁxed design error, for simplicity of exposition under theassumption of uniform boundedness of F (the general statement is given below in Lemma 3.1). Corollary 3.1.

Let P n be the empirical measure on some n points in X , and assume F ⊆ [ − , X is convex.Then the minimal error of ERM over F satisﬁes inf f ∈F E ξ Z ( ˆ f n − f ) d P n ≥ − ( c W ( F ) − Cn − ) , where C is some positive absolute constant. 3hen F is uniformly bounded (say, by ), a classical result in non-parametric statistics (van de Geer, 2000)and our theorem imply that − ( c W ( F ) − Cn − ) ≤ inf f ∈F E ξ Z ( ˆ f n − f ) d P n ≤ sup f ∈F E ξ Z ( ˆ f n − f ) d P n | {z } = R ( ˆ f n , F , P n ) ≤ c W ( F ) . Moreover, both of these bounds are tight, in the sense that they can be attained on certain families offunctions, up to constants (cf. Birgé et al. (1998); Han et al. (2019)). Therefore, we conclude that in the ﬁxed design case, both the minimax risk and the minimal error of the ERM depend on the entire Gaussiancomplexity of F (when it is convex and uniformly bounded). In particular, for the case of convex regression,Corollary 3.1 recovers the rate in (2) (up to logarithmic factors) for the ﬁxed design case, since with highprobability the global complexity c W ( F ) is of the order max { n − /d , n − / } . We now turn to the random design setting, which is signiﬁcantly more subtle. Before stating the result, wedescribe a direct proof strategy that fails. This approach would attempt to pass from the ﬁxed design lowerbound to the random design lower bound by relating the population and empirical norms k · k P and k · k P n ,uniformly over the class. A statement of this type (which may be called “upper isometry,” in contrast with“lower isometry” studied, for instance, in Mendelson (2014)) could be derived under additional assumptionson the geometry of ( F , P ), such as a small-ball condition (Mendelson, 2014), Kolchinskii-Pollard entropy(Rakhlin et al., 2017), or an ǫ -covering with respect to the sup -norm van de Geer (2000). To the best of ourknowledge, such upper-isometry statements can at best read k f − g k P ≥ k f − g k P n − C · W ( F ) ∀ f, g ∈ F , where C ≥ . Since W ( F ) is larger than the lower bound on the ﬁxed design error, this technique does notappear to work.Moreover, a uniform lower bound of order W ( F ) in random design cannot be true in general. For in-stance, it was shown in a string of recent works (Liang et al., 2020a; Belkin et al., 2019; Bartlett et al., 2020;Tsigler and Bartlett, 2020) that it is possible to completely interpolate Y , . . . , Y n (i.e. achieve zero empiricalerror) and still have a small generalization error (of order n − c , for some c ∈ (0 , ), and even be minimaxoptimal (with an appropriate function Ψ in Eq. (1)) . In these examples, because of the ability to interpolateany data, we know that W ( B n ( f , ; therefore, the lower bound in the ﬁxed design case cannot bealways true in random design.The last paragraph motivates the need to consider additional properties of the model F and the underlyingdistribution P . With the interpolation examples in mind, we might hope that the relation between the globalcomplexity of the class and complexity of local neighborhoods around the regression function f may play arole in determining rates of convergence of ERM. To this end, for every n and f ∈ F , we deﬁne the followingnotion of complexity: t n, P ( f , F ) := max { t ∈ R + : W ( B P ( f , t )) ≤ l ξ W ( B n ( f , } , (3)where l ξ ∈ (0 , is a small absolute constant that will be chosen in the proofs. We remark that under theadditional assumption of F being uniformly bounded by , we have that W ( B n ( f , ≤ W ( B n ( f , W ( F ) , and thus we can replace the term on the right-hand side of (3) with global Gaussian averages W ( F ) .The quantity t n, P ( f , F ) is the maximal radius of the population ball around f that has Gaussian complexitycomparable to that of the entire class (in the uniformly bounded case), up to some absolute constant, orto a ball of constant radius within the class. As we show next, this local richness is necessary in order toavoid the rate being dominated by the global complexity of F . In the aforementioned interpolation exampleswe have both t n, P ( f , F ) = O ( n − c ) and W ( B n ( f , . The last two relations must be true for any f ∈ F for which ERM attains perfect ﬁt to data, and yet a small generalization error of order n − c .4e now state the main result of this paper for the random design setting, under the additional assumptionof F being uniformly bounded. Remarkably, t n, P ( f , F ) is the only additional quantity that we need toconsider for a general uniform lower bound on a general family F . Speciﬁcally, we prove the following: Theorem 3.1.

Let F be a convex class of functions uniformly bounded by one. Then for large enough n ,the minimal error of ERM over F is lower bounded as inf f ∈F E x,ξ R ( ˆ f n − f ) d P min {W ( F ) , t n, P ( f , F ) } ≥ c, where c ∈ (0 , . Remark 1.

Notably, Theorem 3.1 holds under only convexity and uniform boundedness assumptions on theclass F . Furthermore, one can easily design a convex uniformly bounded family and an f ∈ F such thatthe ERM attains an error of order t n, P ( f , F ) ≪ W ( F ) for all n that are large enough (for completenesssee Section B.1). Therefore, under no additional assumption on F , the above lower bound is sharp up toabsolute constants. An almost immediate corollary of this theorem is the following key insight on the behavior of the ERMprocedure in the random design setting:

Corollary 3.2.

Let F be convex and uniformly bounded by . For any f ∈ F such that E x,ξ Z ( ˆ f n − f ) d P | {z } := E ( f ) ≪ W ( F ) , there must exists some constant t ( f ) ≤ c · E ( f ) such that W ( B P ( f , t ( f ))) = Θ( W ( F )) , where c ∈ (0 , is some absolute constant.Informally speaking, if ERM learns some f ∈ F at a rate faster than W ( F ) , then the local complexity ofa population ball centered at f with a very small radius must be as rich as the entire complexity of F . Amore prescriptive recipe for guaranteeing such fast rates is an interesting direction of further work. The lower bounds stated thus far assumed little about the geometry of the class F beyond convexity andglobal and local Gaussian averages. Under additional assumptions on the behavior of entropy log N ( ǫ, F , P ) (deﬁned as the logarithm of the smallest number of balls with respect to L ( P ) of radius ǫ suﬃcient to cover F ) or entropy with bracketing log N [] ( ǫ, F , P ) (deﬁned as the logarithm of the smallest number of brackets l i , u i ∈ F such that l i ≤ u i , k l i − u i k P ≤ ǫ and F is contained in the union of the brackets), we can providespeciﬁc upper bounds on the Gaussian averages via chaining and other techniques. In particular, we say thata convex uniformly bounded F is P -Donsker if log N [] ( ǫ, F , P ) ∼ ǫ − α with α ∈ (0 , or if F is parametricwith log N [] ( ǫ, F , P ) ∼ v log(1 /ǫ ) for some ‘dimension’ v . In seminal works of (Birgé and Massart, 1993) itwas shown that for any P -Donsker class, the ERM is minimax optimal, i.e. R ( ˆ f n , F , P ) ∼ inf ¯ f n R ( ¯ f n , F , P ) ∼ n − α . Note that for α ∈ (0 , we have that W ( F ) ∼ n − / .The next result shows that without further assumptions we cannot learn any function in a convex uniformlybounded P -Donsker class faster than a parametric rate. We assume that F is non-degenerate and contains at least two functions such that k f − f k P ≥ / . orollary 3.3. Let F be a convex uniformly bounded P -Donsker class, and let X , . . . , X n ∼ P . Then n − . inf f ∈F E Z ( ˆ f n − f ) d P n ∼ inf f ∈F E Z ( ˆ f n − f ) d P This lower bound is sharp, namely there are classical P -Donsker classes, such as the convex regression examplementioned in the introduction and Section 2, where ERM can attain a parametric rate (up to logarithmicfactors) when optimizing over all convex Lipschitz functions, but only for d ≤ which puts us in the Donskerregime.For non-Donsker classes, i.e when α > , the ERM procedure may not be optimal. One can show that n − α . R ( ˆ f n , F , P ) . n − α and both of these bounds can be tight, up to logarithmic factors. Furthermore, one can show that n − α . W ( F ) . n − α and, again, both of these can be tight. Our next corollary shows that in this regime, the ﬁxed-design erroris at least of the order W ( F ) , i.e. it is impossible to learn at a parametric rate in the non-Donsker regime. Corollary 3.4.

Let F be a convex uniformly bounded non- P -Donsker class, and let X , . . . , X n ∼ P . Thenthe following holds: n − α . W ( F ) . inf f ∈F E Z ( ˆ f n − f ) d P n The proof of these two corollaries appears in the appendix.

Remark 2.

Due to the geometry of general non-Donsker classes, in random design case the same lowerbound may not hold. However, in all the examples in the literature (Han and Wellner, 2016; Feng et al.,2018; Kim et al., 2018; Han et al., 2019; Kur et al., 2020b) that study the adaptivity of ERM in non-Donskerfamilies (such as convex functions when d ≥ , isotonic functions when d ≥ ), the term of t n, P ( f , F ) ofTheorem 3.1 is signiﬁcantly larger than W ( F ) . As a consequence, one may use Theorem 3.1 to show thatthe bound in Eq. (2) is tight up to logarithmic factors. In this section, we state the general lower bound for ﬁxed design. In comparison to its consequence, Corol-lary 3.1, the version below captures complexity of local neighborhoods around regression functions that areclose to f . Note that this lemma holds for any convex family (and not necessarily bounded). Lemma 3.1.

Let F be a convex family of functions and and let x , . . . , x n ∈ X be some n points, and let P n := n − P ni =1 δ x i . For all f ∈ F deﬁne r ( f ) := argmax r ≥ c W ( B n ( f , r )) − r (4)and L x ( f ) := max g ∈ B n ( f , ,t ≥ c W ( B n ( g, t )) − c W ( B n ( f , r ( f )) − Cn − k g − f k P n + t where C ∈ (1 , ∞ ) is some absolute constant. Then the following lower bound holds: E ξ Z ( ˆ f n − f ) d P n ≥ max { ( c W ( B n ( f , − Cn − ) , L x ( f ) } . F ,and the uniform bounded by assumption imply that W ( F ) = c W ( B n ( f , ≤ c W ( B n ( f , . Remark 3.

The second term in our lower bound may be signiﬁcantly larger than c W ( F ) . For example, thesecond term may be equal to c W ( F ) in several non-Donsker families that appear in (Birgé and Massart, 1993;Kur et al., 2020a; Birgé, 2006). We also remark that constant is tight (up to o n (1) ). The rest of this paper is devoted to proofs. While the ﬁxed design lower bound follows a rather simpleargument, the corresponding lower bound in the random design case is more subtle. In particular, weemploy a particular version of Talagrand’s inequality that, in our particular regime, provides control oncertain empirical processes, while the more commonly used versions (including Bousquet’s inequality) resultin vacuous estimates.

Notation

Throughout this section, c, c , c ∈ (0 , and C, C , C ∈ (1 , ∞ ) are some absolute constantsthat may change from to line to line. Also S , s , S , s are absolute constants, but we use this notation toemphasize that we have some freedom to control their size. We also use the notation C ( c , C ) to mean thatthe constant depends on c , C .To recap, we assume that F is a convex family of functions, Y i = f ( x i ) + ξ i , where ξ i ∼ N (0 , i.i.d., x , . . . , x n ∈ X , and f ∈ F . We write k f k n = k f k P n and h f, g i n = R f gd P n . With slight abuse of notation,we write h ξ , f i n = n P ni =1 ξ i f ( x i ) for ξ := ( ξ , . . . , ξ n ) . We also abbreviate B n ( f , t ) := B P n ( f , t ) to be the L ( P n ) ball with respect to empirical measure P n .Recall the deﬁnition of r ( f ) in (4). The following lemma that was proven in (Chatterjee, 2014): Lemma 4.1. [(Chatterjee, 2014, Thm 1.1)] The following holds under the above assumptions: Pr (cid:16) |k ˆ f n − f k n − r ( f ) | ≥ t (cid:17) ≤ ( − nt ) t ≥ r ( f )3 exp( − nt r ( f ) ) 0 ≤ t ≤ r ( f ) (5)Moreover, for each t ≥ the following holds Pr (cid:16) |h ˆ f n − f , ξ i n − c W ( B n ( f , r ( f ))) | ≥ t · r ( f ) (cid:17) ≤ ( − nt ) t ≥ r ( f )3 exp( − nt r ( f ) ) 0 ≤ t ≤ r ( f ) (6)Also, we state a simple corollary that follows from this lemma (cf. (Boucheron et al., 2013),(Chatterjee, 2014,Thm 1.2)) Corollary 4.1.

The following two bounds hold E (cid:12)(cid:12)(cid:12) h ˆ f n − f , ξ i n − c W ( B n ( f , r ( f ))) (cid:12)(cid:12)(cid:12) ≤ C max { r ( f ) / n − / , n − } . and E (cid:12)(cid:12)(cid:12) k ˆ f n − f k n − r ( f ) (cid:12)(cid:12)(cid:12) ≤ C max { r ( f ) / n − / , n − } . Proof of Lemma 3.1.

For brevity, denote b r := r ( f ) , where r ( f ) is deﬁned in Lemma 4.1. Deﬁne g ξ := argmax h ∈ B n ( g,t ) h h − g, ξ i n . ˆ f n and convexity of F imply that h∇ f k f − y k n | f = ˆ f n , g − ˆ f n i n ≥ for any g ∈ F . In particular,for g = g ξ this implies ≥ E h ξ + f − ˆ f n , g ξ − ˆ f n i n , where the expectation is over ξ , conditionally on x , . . . , x n . For any g ∈ F , we may write the right-handside as E h ξ + f − ˆ f n , g ξ − g + g − f + f − ˆ f n i n = c W ( B n ( g, t )) − E h h ξ , ˆ f n − f i n + h ˆ f n − f , g ξ − g i n + h ˆ f n − f , g − f i n i + E k ˆ f n − f k n where we used the deﬁnition of g ξ and the fact that E h ξ , g − f i n = 0 . Using Corollary 4.1, we obtain afurther lower bound of c W ( B n ( g, t )) − c W ( B n ( f , b r )) − E h h ˆ f n − f , g ξ − g i n + h ˆ f n − f , g − f i n i + b r − C b r / n − / − Cn − ≥ c W ( B n ( g, t )) − c W ( B n ( f , b r )) − E h h ˆ f n − f , g ξ − g i n + h ˆ f n − f , g − f i n i (7) + b r / − C n − . To verify the last inequality, observe that b r / ≥ C b r / n − / when b r ≥ C n − / for C that is large enough;on the other hand, if b r ≤ C n − / , the Cn − term is dominant for C large enough. Since h ˆ f n − f , g − f i n ≤k ˆ f n − f k n k g − f k n and h ˆ f n − f , g ξ − g i n ≤ t · k ˆ f n − f k n , we conclude that ≥ c W ( B n ( g, t )) − c W ( B n ( f , b r )) − E [ k ˆ f n − f k P n ]( k g − f k P n + t ) − Cn − . (8)By re-arranging the terms and using Jensen’s inequality, we have E k ˆ f n − f k n ≥ (cid:16) E k ˆ f n − f k n (cid:17) ≥ c W ( B n ( g, t )) − c W ( B n ( f , b r )) − Cn − t + k f − g k n ! where ( a ) = max { a, } . Since L x ( f ) in the statement of the Lemma is non-negative, the lower bound of L x ( f ) follows.Now, for the ﬁrst part of the lower bound, we have to consider two cases. The ﬁrst one is when c W ( B n ( f , b r )) ≥ − c W ( B n ( f , b r / , and we have E k ˆ f n − f k n ≥ E k ξ k n · E k ˆ f n − f k n ≥ E h ξ , ˆ f n − f i n ≥ − c W ( B n ( f , b r / − C b r / n − / ≥ − c W ( B n ( f , − C n − , where we used Cauchy-Schwartz inequality and Corollary 4.1. In the other case, we use Eq. (8) with g = f and t = 1 : E k ˆ f n − f k n ≥ c W ( B n ( f , − c W ( B n ( f , b r )) − Cn − ≥ − c W ( B n ( f , − Cn − , concluding the proof. Throughout the proof of Theorem 3.1, P n denotes the random empirical measure of X , . . . , X n . Denote by b r := argmax c W ( B n ( f , r )) − r , with the hat emphasizing the dependence on x n = ( X , . . . , X n ) . We adoptthe notation k · k n , h· , ·i n , B n in the previous section for the norm and the inner product with respect to P n ,and the L ( P n ) ball. Recall that we assumed that F is not degenerate: W ( F ) ≥ c/ √ n , for some c ∈ (0 , . See the proof of Lemma A.3 for further details roof of Theorem 3.1. Denote t ∗ := min { t n, P ( f , F ) , s p W ( F ) , s W ( F ) } (9)where s , s ∈ (0 , are small enough absolute constants that will be deﬁned in the proof, and t n, P ( f , F ) isdeﬁned in Eq. (3).Denote by M the maximal separated set with respect to L ( P ) at scale p W ( F ) , and let M = M (6 p W ( F ) , F , P ) (10)denote its size.For a constant K ∈ (1 , ∞ ) , let E denote the high-probability event that is deﬁned by the intersection ofthe events of Lemma A.4 and Lemma A.2: E := (cid:26) x n : sup f,g ∈F (cid:12)(cid:12) k f − g k n − k f − g k P (cid:12)(cid:12) ≤ W ( F ) , sup h ∈ B P ( f ,t ∗ ) ,g ∈M |h ( g − f ) , ( h − f ) i n − E [( g − f )( h − f )] | ≤ (8 K ) − W ( F ) (cid:27) . (11)Further, deﬁne the events E = n x n : K − W ( F ) ≤ c W ( F ) ≤ K W ( F ) + Cn − / o , E = n x n : c W ( B P ( f , t ∗ )) ≤ K W ( B P ( f , t ∗ )) + C W ( F ) / n − / o , E = E ∩ E ∩ E . (12)Lemma A.3, proved in the appendix, shows that the event E holds with probability of at least . . Notethat under the event E , M is also a p W ( F ) separated set with respect to the random empirical measure P n . Hence, we may apply Sudakov’s minoration (Lemma A.5) with ǫ = 2 p W ( F ) and empirical measure P n deﬁned on any x n ∈ E : c r W ( F ) · log Mn ≤ c W ( F ) ≤ K W ( F ) + Cn − / ≤ C K W ( F ) , (13)where in the last inequality we used the assumption that W ( F ) ≥ c · n − / , and C ≥ is deﬁned to be largeenough to satisfy the last inequality. Hence, the last equation implies that M ≤ exp( C K n W ( F )) . (14)First, recall the deﬁnition of t n, P ( f , F ) where in Lemma A.2 (that appears in the supplementary) we set l ξ = (256 K ) − . Recall Eq. (9), where in Lemma A.2 we set s = c ( K, K , C ) , and the three constants K, K , C follow from Sudakov’s minoration lemma, Talagrand’s inequality, and Adamzcak’s bound. Wedeﬁne s := 16 − ( K ) − . Deﬁne the event A = n ( ξ , x n ) : ˆ f n ∈ B P ( f , t ∗ ) o (15)and, for any x n , deﬁne the conditional event A ( x n ) = n ξ : ˆ f n ∈ B P ( f , t ∗ ) o . (16)Assume by the way of contradiction that Pr x,ξ ( A ) > . . Then, using the average principle (Fubini) and thefact Pr( E ) ≥ . , we can ﬁnd an event E ⊆ E that has a probability of at least . (when n is large enough)such that ∀ x n ∈ E Pr ξ ( A ( x n )) ≥ . . x n ∈ E , K W ( B P ( f , t ∗ )) + l ξ W ( F ) ≥ c W ( B n ( f , b r )) . (17)First, recall that t ∗ ≤ s p W ( F ) and therefore under the event E , we have sup h ∈ B P ( f ,t ∗ ) k h − f k n ≤ · W ( F ) . (18)Now, for each x n ∈ E ⊆ E , the map ξ sup h ∈ B P ( f ,t ∗ ) h ξ , h − f i n is Lipschitz with constant at most sup h ∈ B P ( f ,t ∗ ) n − / k h − f k n ≤ C p W ( F ) n − by (18), and thus by Lipschitz concentration (Lemma A.8), conditionally on x n , Pr ξ | sup h ∈ B P ( f ,t ∗ ) h ξ , h − f i n − c W ( B P ( f , t ∗ )) | ≥ ǫ ! ≤ − Cn W ( F ) − ǫ ) for some absolute constant C . By setting ǫ = C ( n − W ( F )) / in the last equation, we may deﬁne the event A ( x n ) = ( ξ : | sup h ∈ B P ( f ,t ∗ ) h ξ , h − f i n − c W ( B P ( f , t ∗ )) | ≤ C p W ( F ) n − ) ∩ A ( x n ) . that holds with probability of at least . (over ξ ) for any x n ∈ E .Before deﬁning the next event, observe that b r = argmax r ≥ c W ( B n ( f , r )) − r / ≤ q c W ( F ) , according to Lemma 4.1 and the fact that for r n := 2 q c W ( F ) we have c W ( B n ( f , r n )) − r n / ≤ . As wealready argued in (13), for any x n ∈ E we have that c W ( F ) ≤ C K W ( F ) for some absolute constant C ,and thus ∀ x n ∈ E , b r ≤ C p K W ( F ) . (19)Now, from Eq. (6) in Lemma 4.1, for C large enough, the event n ξ : |h ξ , ˆ f n − f i n − c W ( B n ( f , b r )) | ≤ C ( n − / b r / + n − ) o holds with probability of at least . , and thus, in view of (19), for all x n ∈ E , the event A ( x n ) = n ξ : |h ξ , ˆ f n − f i n − c W ( B n ( f , b r )) | ≤ C K W ( F ) / n − / o ∩ A ( x n ) that holds with probability of at least . over ξ .We are now ready to prove Eq. (17), using the fact that A ( x n ) is not empty for each x n ∈ E . To this end,ﬁx x n ∈ E and ξ ∈ A ( x n ) . First, by deﬁnition of E , we have K W ( B P ( f , t ∗ )) ≥ c W ( B P ( f , t ∗ )) − C p W ( F ) n − which can be further lower bounded, by deﬁnition of A ( x n ) , by sup h ∈ B P ( f ,t ∗ ) h ξ , h − f i n − C p W ( F ) n − . ξ ∈ A ( x n ) ⊆ A ( x n ) , the above expression is further lower bounded by h ξ , ˆ f n − f i n − C p W ( F ) n − which, under the assumption of ξ ∈ A ( x n ) , is lower bounded by c W ( B n ( f , b r )) − C p W ( F ) n − − C K W ( F ) / n − / . When n is large enough, the above estimate is lower bounded by c W ( B n ( f , b r )) − l ξ W ( F ) . To see this, observe that under the assumption of W ( F ) ≥ c/ √ n , both p W ( F ) n − = o n ( W ( F )) and W ( F ) / n − / = o n ( W ( F )) . Therefore, we proved Eq. (17) holds, namely that K W ( B P ( f , t ∗ )) + l ξ W ( F ) ≥ c W ( B n ( f , b r )) for all x n ∈ E . Using the deﬁnition of l ξ = (256 K ) − , we have W ( B P ( f , t ∗ )) ≤ l ξ W ( F ) ≤ − K − W ( F ) , and thus for any x n ∈ E , − K − W ( F ) ≥ c W ( B n ( f , b r )) . (20)By Lemma A.1 and (20), for any x n ∈ E , ≥ (2 − K − − − K − ) W ( F ) − E ξ max g ∈M h g − f , ˆ f n − f i n − p W ( F ) b r, and since x n ∈ E ⊆ E , we also have ≥ (2 − K − − − K − − − K − ) W ( F ) − sup h ∈ B P ( f ,t ∗ ) ,g ∈M Z ( g − f )( h − f ) d P − p W ( F ) b r ≥ (4 K ) − W ( F ) − max g ∈M k g − f k t ∗ − p W ( F ) b r ≥ (4 K ) − W ( F ) − t ∗ − p W ( F ) b r (21)where we used the Cauchy-Schwartz inequality, the fact that F ⊂ [ − , X , and the deﬁnition of l ξ =(256 K ) − .If p W ( F ) b r < (8 K ) − W ( F ) , then the last equation implies that s W ( F ) = (16 K ) − W ( F ) < t ∗ . However, this inequality contradicts the deﬁnition of t ∗ , and thus cannot hold for any x n ∈ E . In the othercase, we assume that p W ( F ) b r ≥ (8 K ) − W ( F ) , or equivalently, b r ≥ (128 K ) − p W ( F ) . Now, fromLemma 4.1 one can see that the maximizing value b r ensures c W ( B n ( f , b r )) − − b r > and hence c W ( B n ( f , b r )) > − (128 K ) − W ( F ) . Therefore, under the event E and by Eq. (17) K l ξ W ( F ) ≥ K W ( B P ( f , t ∗ )) + l ξ W ( F ) ≥ c W ( B n ( f , b r )) > − (128 K ) − W ( F ) . Once again, we have a contradiction for any x n ∈ E , since we assumed that l ξ = (256 K ) − .11herefore, we showed that Eq. (21), cannot hold under the event E , i.e. the set E is empty. Thiscontradicts our earlier conclusion that Pr( E ) ≥ . , which was made under the assumption that event A has probability at least . . Hence, we conclude that Pr( A ) ≤ . , or, equivalently, with probability at least . , ˆ f n / ∈ B P ( f , t ∗ ) . Therefore, we must have that E Z ( ˆ f n − f ) d P ≥ t ∗ { t n, P ( f , F ) , s W ( F ) , s W ( F ) } ≥ c · min { t n, P ( f , F ) , W ( F ) } . where in the last inequality, we used the fact that W ( F ) ≤ W ([ − , X ) ≤ E | ξ | ≤ p E ξ = 1 . The theorem follows.

Acknowledgements

We acknowledge support from the NSF through award DMS-2031883 and from the Simons Foundationthrough Award 814639 for the Collaboration on the Theoretical Foundations of Deep Learning. We furtheracknowledge support from NSF through grant DMS-1953181 and ONR through grants N00014-20-1-2336and N00014-20-1-2394.

References

Radoslaw Adamczak. A tail inequality for suprema of unbounded empirical processes with applications tomarkov chains.

Electronic Journal of Probability , 13:1000–1034, 2008.Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overﬁtting in linear regression.

Proceedings of the National Academy of Sciences , 117(48):30063–30070, 2020.Mikhail Belkin, Daniel Hsu, and Partha Mitra. Overﬁtting or perfect ﬁtting? risk bounds for classiﬁcationand regression rules that interpolate. arXiv preprint arXiv:1806.05161 , 2018.Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov. Does data interpolation contradict statisticaloptimality? In

The 22nd International Conference on Artiﬁcial Intelligence and Statistics , pages 1611–1619. PMLR, 2019.Pierre C Bellec. Optimistic lower bounds for convex regularized least-squares. arXiv preprintarXiv:1703.01332 , 2017.Lucien Birgé. Model selection via testing: an alternative to (penalized) maximum likelihood estimators. In

Annales de l’IHP Probabilités et statistiques , volume 42, pages 273–325, 2006.Lucien Birgé and Pascal Massart. Rates of convergence for minimum contrast estimators.

Probability Theoryand Related Fields , 97(1-2):113–150, 1993.Lucien Birgé, Pascal Massart, et al. Minimum contrast estimators on sieves: exponential bounds and ratesof convergence.

Bernoulli , 4(3):329–375, 1998.Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.

Concentration inequalities: A nonasymptotic theoryof independence . Oxford university press, 2013.EM Bronshtein. ε -entropy of convex sets and functions. Siberian Mathematical Journal , 17(3):393–398, 1976.Sourav Chatterjee. A new perspective on least squares under convex constraint.

The Annals of Statistics ,42(6):2340–2381, 2014.Richard M Dudley.

Uniform central limit theorems . Number 63. Cambridge university press, 1999.Oliver Y Feng, Adityanand Guntuboyina, Arlene KH Kim, and Richard J Samworth. Adaptation in multi-variate log-concave density estimation. arXiv preprint arXiv:1812.11634 , 2018.12vishek Ghosh, Ashwin Pananjady, Adityanand Guntuboyina, and Kannan Ramchandran. Max-aﬃne re-gression: Provable, tractable, and near-optimal statistical estimation. arXiv preprint arXiv:1906.09255 ,2019.Evarist Giné and Richard Nickl.

Mathematical foundations of inﬁnite-dimensional statistical models . Num-ber 40. Cambridge University Press, 2016.Qiyang Han and Jon A Wellner. Multivariate convex regression: global risk bounds and adaptation. arXivpreprint arXiv:1601.06844 , 2016.Qiyang Han, Tengyao Wang, Sabyasachi Chatterjee, Richard J Samworth, et al. Isotonic regression in generaldimensions.

The Annals of Statistics , 47(5):2440–2471, 2019.Arlene KH Kim, Adityanand Guntuboyina, Richard J Samworth, et al. Adaptation in log-concave densityestimation.

The Annals of Statistics , 46(5):2279–2306, 2018.Vladimir Koltchinskii.

Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems:Ecole d’Eté de Probabilités de Saint-Flour XXXVIII-2008 , volume 2033. Springer Science & BusinessMedia, 2011.Gil Kur, Fuchang Gao, Adityanand Guntuboyina, and Bodhisattva Sen. Convex regression in multidimen-sions: Suboptimality of least squares estimators. arXiv preprint arXiv:2006.02044 , 2020a.Gil Kur, Alexander Rakhlin, and Adityanand Guntuboyina. On suboptimality of least squares with applica-tion to estimation of convex bodies. arXiv preprint arXiv:2006.04046 , 2020b.Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. On the multiple descent of minimum-norm interpolantsand restricted lower isometry of kernels. In

Conference on Learning Theory , pages 2683–2711. PMLR,2020a.Tengyuan Liang, Alexander Rakhlin, et al. Just interpolate: Kernel “ridgeless” regression can generalize.

Annals of Statistics , 48(3):1329–1347, 2020b.Shahar Mendelson. Learning without concentration. In

Conference on Learning Theory , pages 25–39, 2014.Gilles Pisier. Some applications of the metric entropy condition to harmonic analysis. In

Banach Spaces,Harmonic Analysis, and Probability Theory , pages 123–154. Springer, 1983.Alexander Rakhlin, Karthik Sridharan, Alexandre B Tsybakov, et al. Empirical entropy, minimax regretand minimax risk.

Bernoulli , 23(2):789–824, 2017.Richard J Samworth. Recent progress in log-concave density estimation.

Statistical Science , 33(4):493–509,2018.Alexander Tsigler and Peter L Bartlett. Benign overﬁtting in ridge regression. arXiv preprintarXiv:2009.14286 , 2020.Alexandre B. Tsybakov.

Introduction to Nonparametric Estimation . Springer, 2003.Sara A van de Geer.

Empirical Processes in M-estimation , volume 6. Cambridge university press, 2000.Abraham J Wyner, Matthew Olson, Justin Bleich, and David Mease. Explaining the success of adaboost andrandom forests as interpolating classiﬁers.

The Journal of Machine Learning Research , 18(1):1558–1590,2017. 13

Lemmas

Lemma A.1.

Under the event E in (12), and for n that is large enough, the following holds: ≥ − K − W ( F ) − c W ( B n ( f , b r )) − E ξ max g ∈M h g − f , ˆ f n − f i n − p W ( F ) b r, (22)where K is deﬁned in Eq. (12), and set M is deﬁned in Eq. (10). Lemma A.2.

Let X , . . . , X n ∼ i.i.d P , then the following holds with probability of at least − K exp( − n W ( F ))sup g ∈M ,h ∈ B P ( f ,t ∗ ) |h h − f , g − f i n − Z X ( h − f )( g − f ) d P | ≤ (8 K ) − W ( F ) . where M is deﬁned in Eq. (10), t ∗ is deﬁned in Eq. (9), and K , K are deﬁned in Eq. (12), Lemma A.6. Lemma A.3.

The event E deﬁned in Eq. (12) holds with probability of at least . . A.1 Auxiliary Lemmas

Lemma A.4. [(Koltchinskii, 2011, pgs. 25-26)] Let

F ⊆ [ − , X be family of functions. Then withprobability of at least − − c n W ( F )) , ∀ f, g ∈ F (cid:12)(cid:12) k f − g k n − k f − g k P (cid:12)(cid:12) ≤ W ( F ) , and k ˆ f n − f k n ≤ W ( F ) . Lemma A.5 (Sudakov’s minoration lemma) . Let

H ⊂ [ − , X . There exists a constant c such that forany P n , c sup ǫ ≥ ǫ r log M ( ǫ, H , P n ) n ≤ c W ( H ) . where M ( ǫ, H , P n ) denotes the size of the largest ǫ -separated set in H with respect to L ( P n ) .The next two lemmas appear in (Koltchinskii, 2011, pgs. 24-25), (Adamczak, 2008). Lemma A.6 (Talagrand’s inequality) . Let X , . . . , X n ∼ i.i.d. P , and H ⊆ [ − U, U ] X be a family of functions.Let Z = sup f ∈H | n − P ni =1 f ( X i ) − E [ f ] | . Then there exists an absolute constant K ≥ such that for any s ≥ | Z − E Z | ≥ s ) ≤ K exp (cid:18) − K − U − log(1 + sUV ) ns (cid:19) , where V = sup f ∈H R f d P . Lemma A.7 (Adamczak’s inequality) . Let G be a centred family of functions supported on D , and Q besome distribution on D . Let Z = sup g ∈G | n − P ni =1 g ( X i ) | . Assume that there exists an envelope function G such that | g ( x ) | ≤ G ( x ) for all g ∈ G , x ∈ D . Then, the following holds for all t ≥ K − E Z − V r tn − k max ≤ i ≤ n f ′ ( X i ) k ψ tn ≤ Z ≤ K E Z + V r tn + k max ≤ i ≤ n f ( X i ) k ψ tn , where V := sup g ∈G R g d Q , and K ∈ (1 , ∞ ) is some universal constant, and ψ is the Orlicz norm. Lemma A.8 (Lipschitz Concentration) . Let ξ , . . . , ξ n ∼ i.i.d N (0 , , and f : R n → R be a L -Lipschitzfunction with respect to k · k . Then, for all ǫ > , Pr( | f − E [ f ] | ≥ ǫ ) ≤ exp( − cǫ L − ) . Proofs

Proof of Lemma A.1.

We invoke the lower bound of Eq. (7) with g = f and t = 2 , implying ≥ c W ( B n ( f , − c W ( B n ( f , b r )) − E h ˆ f n − f , g ξ − f i n − Cn − = c W ( F ) − c W ( B n ( f , b r )) − E h ˆ f n − f , g ξ − Π( g ξ ) + Π( g ξ ) − f i n − Cn − , (23)where Π( g ξ ) := argmin g ∈M k g ξ − g k P n , and the equality follows for the fact that for F ⊆ [ − , X we have B n ( f ,

2) = F .Now, recall that M is a maximal p W ( F ) -separated set with respect to L ( P ) , and therefore also a p W ( F ) -net with respect to L ( P ) . Therefore, under the event E it is also a p W ( F ) -net with respectto L ( P n ) , and, in particular, k Π( g ξ ) − g ξ k P n ≤ p W ( F ) . Hence, we can rewrite (23) as E ξ max g ∈M h ˆ f n − f , g − f i n ≥ c W ( F ) − c W ( B n ( f , b r )) − E ξ h ˆ f n − f , g ξ − Π( g ξ ) i n − Cn − ≥ K − W ( F ) − c W ( B n ( f , b r )) − p W ( F ) E ξ k f − ˆ f n k P n − Cn − Now, we proceed by using the ﬁrst part of Corollary 4.1 and the assumption of lying in E . The last expressionis lower-bounded by K − W ( F ) − c W ( B n ( f , b r )) − p W ( F )( b r + C b r / n − / ) . (24)According to (19), under the event E , we have b r ≤ C p K W ( F ) for some constant C . Thus the expression in Eq. (24) is further lower-bounded by K − W ( F ) − c W ( B n ( f , b r )) − p W ( F ) b r − C p K W ( F ) / n − / − Cn − ≥ (2 K ) − W ( F ) − c W ( B n ( f , b r )) − p W ( F ) b r where the last inequality holds when n is large enough. To see this, recall that W ( F ) ≥ c/ √ n and under thisassumption both n − = o n ( W ( F )) and W ( F ) / n − / = o n ( W ( F )) hold. Therefore, the lemma follows. Proof of Lemma A.2.

First, denote by k P n − P k H := sup h ∈H | n − P ni =1 h ( X i ) − E [ h ] | , and for each g i ∈ M ,deﬁne G i = { ( h − f )( g i − f ) : h ∈ B P ( f , t ∗ ) } . By Talagrand’s inequality (Lemma A.6), the following holdsfor and u ≥ |k P n − P k G i − E k P n − P k G i | ≥ u ) ≤ K exp (cid:18) − nK − log(1 + 4 − us − W ( F ) ) u (cid:19) where we used the fact that V ≤ sup h ∈F k g − f k ∞ t ∗ ≤ s W ( F ) . Now, we set u = (16 K ) − W ( F ) in thelast equation Pr (cid:0) |k P n − P k G i − E k P n − P k G i | ≥ (16 K ) − W ( F ) (cid:1) ≤ K exp (cid:0) − n (16 K · K ) − W ( F ) log(1 + 4 − (16 K ) − s − ) (cid:1) . Next, we aim to take a union bound over M , and recall that log M ≤ C K n W ( F ) ≤ C ( K ) n W ( F ) , forsome absolute constant that does not depend on s . Therefore, we may choose s := c ( K, K , C ) (25)where c ( K, K , C ) is a constant that satisﬁes the following: Pr (cid:0) |k P n − P k G i − E k P n − P k G i | ≥ (16 K ) − W ( F ) (cid:1) ≤ K exp( − C K n W ( F )) . Pr (cid:0) ∃ ≤ i ≤ M : |k P n − P k G i − E k P n − P k G i | ≥ (16 K ) − W ( F ) (cid:1) ≤ M K exp( − C K n W ( F ))) ≤ K exp( − C K n W ( F )) ≤ K exp( − n W ( F )) . We conclude that with probability of at least − K exp( − n W ( F )) the following holds for G := { ( h − f )( g − f ) : g ∈ M , h ∈ B P ( f , t ∗ ) } : k P n − P k G ≤ max ≤ i ≤ M E k P n − P k G i + (16 K ) − W ( F ) . (26)The lemma will follow as soon as we show that max ≤ i ≤ M E k P n − P k G i ≤ (16 K ) − W ( F ) . In order to prove the last inequality, we ﬁrst apply the symmetrization lemma (cf. (Koltchinskii, 2011, p.20)) and majorize the resulting Rademacher averages by a constant multiple of the Gaussian averages E k P n − P k G i ≤ W ( G i ) (27)where we used the fact that ∈ G i .Next, since k g i − f k ∞ ≤ , a standard argument (e.g. (Giné and Nickl, 2016, Theorem 3.1.17)) gives E ξ sup h ∈ B P ( f ,t ∗ ) n − n X k =1 ( h − f )( g i − f )( X k ) ξ k ≤ E ξ sup h ∈ B P ( f ,t ∗ ) n − n X k =1 ( h − f )( X i ) ξ k . Then, by taking expectation over X , . . . , X n over the last equation and by Eq. (27), we conclude E k P n − P k G i ≤ W ( B P ( f , t ∗ )) ≤ l ξ W ( F ) ≤ (16 K ) − W ( F ) , where we set l ξ = (256 K ) − . Then, by Eq. (26) and the last equation, the claim follows. Proof of Lemma A.3.

It is enough to show that E , E hold with probability of at least . for n largeenough. First, we prove this claim for E .We aim to apply Adamczak bound for concentration of the suprema of unbounded empirical processes(Lemma A.7). For this purpose, deﬁne the family of functions G := { yf ( x ) , y ∈ R , f ∈ F − f } , and thedistribution Q = P ⊗ N (0 , . Note that F ⊆ [ − , X and, ξ is Gaussian. Therefore, by Pisier’s inequality(cf. Pisier (1983),(Adamczak, 2008, Eq. 13)), we have k max ≤ i ≤ n | ξ i f ( X i ) |k ψ ≤ C log( n ) max ≤ i ≤ n k| ξ i f ( X i ) |k ψ ≤ C log( n ) . By Adamczak’s bound (Lemma A.7), K − E x,ξ sup f ∈F− f | n n X i =1 f ( X i ) ξ i | − √ n − C log( n ) n ≤ sup f ∈F− f | n n X i =1 f ( X i ) ξ i | ≤ K E x,ξ sup f ∈F− f | n n X i =1 f ( X i ) ξ i | + 10 √ n + C log( n ) n , (28)with probability of at least . both X , . . . , X n and ξ .Now, using the average principle, for n large enough, we can ﬁnd an event E (that depends only on X , . . . , X n ) that holds with probability . , such that for any ﬁxed x n ∈ E , there exists an event A ( x n ) of probability at least . (over ξ ) such that Eq. (28) holds. For each x n ∈ E , Lemma A.8 (with Lipschitz16onstant sup f ∈F k f − f k n ≤ ) implies that the middle term in (28) is, with high probability, within Cn − / from its expectation (with respect to ξ ). Therefore, we have for all x n ∈ E : K − E x,ξ sup f ∈F− f | n − n X i =1 f ( X i ) ξ i | − C √ n ≤ E ξ sup f ∈F− f | n − n X i =1 f ( X i ) ξ i | ≤ K E x,ξ sup f ∈F− f | n − n X i =1 f ( X i ) ξ i | + C √ n . Finally, since ∈ F − f , we have E ξ sup f ∈F− f n − n X i =1 f ( X i ) ξ i ≤ E ξ sup f ∈F− f | n − n X i =1 f ( X i ) ξ i | ≤ E ξ sup f ∈F− f n − n X i =1 f ( X i ) ξ i . Hence, the last two equations imply that when W ( F ) ≥ C n − / , for C that is large enough, the claimfollows for E . To handle the remaining case of W ( F ) ≤ C n − / , recall that we assumed that our class isnot degenerate (i.e it has two functions that are k f − f k P ≥ . . Then, it is easy to see that with probabilityof . it holds that c W ( F − f ) ≥ W ( { , f − f , f − f } ) ≥ E max { n − . g, } ≥ c · n − / ≥ c · C − W ( F ) , where g ∼ N (0 , / . Therefore, for some K − = c ( K , c ) , the claim follows for E .Next, we handle E . By using the deﬁnition of B P ( f , t ∗ ) , and similar considerations that led to Eq. (28),we have sup f ∈ B P ( f ,t ∗ ) − f | n − n X i =1 f ( X i ) ξ i | ≤ K E x,ξ sup B P ( f ,t ∗ ) − f | n − n X i =1 f ( X i ) ξ i | + 10 t ∗ √ n + C log( n ) n , (29)with probability of at least . over both X , . . . , X n and ξ .As above, for n large enough, we can ﬁnd an event E ⊆ E (where E is deﬁned in Eq. (11)) of probability atleast . (over X , . . . , X n ), such that for any x n ∈ E , there exists an event A ( x n ) of probability at least . (over ξ ) such that (29) holds. Then, similarly to the case of E , we will employ Lipschitz concentrationfor the middle term in (29), for each x n ∈ E . To estimate the Lipschitz constant, recall that under E (moreprecisely, under the event of Lemma A.4), we also have that k f − f k n ≤ s W ( F ) + 10 W ( F ) ≤ W ( F ) for all f ∈ B P ( f , t ∗ ) , under the choice t ∗ in (9). Then, using the fact that A ( x n ) holds with probability ofat least . , and Lemma A.8 with Lipschitz constant sup f ∈ B P ( f ,t ∗ ) k f − f k P n ≤ p W ( F ) , imply that foreach x n ∈ E , the middle term in (29) is within an additive factor of C p W ( F ) n − / from its expectationover ξ . Namely, we have for all x n ∈ E : E ξ sup f ∈F− f | n − n X i =1 f ( X i ) ξ i | ≤ K E x,ξ sup f ∈F− f | n − n X i =1 f ( X i ) ξ i | + C p W ( F ) √ n . where we used the fact that t ∗ ≤ s p W ( F ) . The claim for E follows by similar considerations that we usedearlier. Proof of Corollary 3.3.

For any P -Donsker class we have with probability at least . (van de Geer, 2000,Chap. 5) c W ( F ) ∼ W ( F ) ∼ n − / . Then, by Corollary 3.1, we have that E Z ( ˆ f n − f ) d P n & n − .

17n order to prove the second part of the bound, we apply Theorem 3.1, E Z ( ˆ f n − f ) d P & max { n − , t n, P ( f , F ) } . The corollary will follow if we show that for any f ∈ F , we have that t n, P & . To see this, we use(van de Geer, 2000, Thm 5.11) that shows that for all t ≥ , we have W ( B P ( f , t )) . n − / Z t u − α/ du . t − α +22 n − / . Since α ∈ (0 , , the right hand side is decreasing in t , therefore we know that if W ( B P ( f , t ∗ )) & W ( F ) & n − / then we have t ∗ & . Hence, t n, P ( f , F ) & , and the claim follows. Proof of Corollary 3.4.

For any non P -Donsker class we have with probability of at least . (van de Geer,2000, Chap. 5) n − α . c W ( F ) ∼ W ( F ) . n − α . Then, by Corollary 3.1, we have that E Z ( ˆ f n − f ) d P n & n − α , and the claim follows. B.1 An example to the tightness of Theorem 3.1 (a sketch)

Let P be the uniform density of [0 , , and denote by I ( x i , l i ) to be an interval with center x i and length l i .For each m ≥ we deﬁne F m := (cid:8) m − / m X i =1 ǫ i I ( x i ,m − / ) : ∀ x , . . . , x m s.t. ≤ j = k ≤ m I ( x k , m − / ) ∩ I ( x j , m − / ) = ∅ , ∀ ( ǫ , . . . , ǫ m ) ∈ {− , } m (cid:9) . Now, we deﬁne F := conv { , {F m } ∞ m =1 } . Clearly, this family is uniformly bounded by one. Also, we assumethat f = 0 .Using a classical fact, we have that with probability of at least − n , max ≤ i = j ≤ n | X j − X i | ≥ c · ( n log n ) − , and denote this event by A . Clearly, for each x n ∈ A , we can ﬁnd a function f ξ ∈ F n (that depends on x n as well) such that h f ξ , ξ i n = n − / · n − n X i =1 | ξ i | . (30)Also, note that under the event A , ˆ f n / ∈ {F m } ∞ m = n +1 . Therefore, one can easily show that c W ( B n ( f , n − / )) ∼ n − / . Now, denote by C ( n ) := C ( n log( n )) / for C that is large enough. Note that any F m such that C ( n ) ≤ m ≤ n − , we can only place m intervals with length of at most ( c/ · ( n log( n )) − . Therefore, under theevent A , each of these intervals has at most one point. Hence, we have that max f m ∈F m h f m , ξ i n = m − / n − max S ∈ ( nm ) , | S | = m X i ∈ S | ξ i | . C ( n ) ≤ m ≤ n − , one can easily show by standard concentration inequalities that E max f m ∈F m h f m , ξ i n . m − / ( m/n ) + m − / ( m/n ) p log( n/m ) . m − / ( m/n ) p log( n/m ) . (31)In the remaining case of m ≤ C ( n ) , using some standard arguments, it can be shown that with probabilityof at least − n (over X , . . . , X n ) the following holds: E ξ sup f m ∈F m h f m , ξ i n ∼ E x,ξ sup f m ∈F m h f m , ξ i n ≪ n − / . (32)By using Eqs. (30),(31),(32), one can show that with high probability ˆ f n ∈ B n ( f , Cn − / ) , for some C ≥ ,and also W ( F ) ∼ n − / . Therefore, one can conclude that E Z ( ˆ f n − f ) d P n ∼ n − ∼ W ( F ) , and E Z ( ˆ f n − f ) d P ∼ n − ( + ) ≪ W ( F ) ∼ n − . Finally, it is easy to see that t n, P ( f , F ) & n − ( + ) , and therefore, by using the last equation E Z ( ˆ f n − f ) d P ∼ t n, P ( f ,,