aa r X i v : . [ m a t h . S T ] J a n Benign overfitting without concentration
January 5, 2021
Zong Shang Abstract
We obtain a sufficient condition for benign overfitting of linear regres-sion problem. Our result does not rely on concentration argument but onsmall-ball assumption and thus can holds in heavy-tailed case. The basicidea is to establish a coordinate small-ball estimate in terms of effectiverank so that we can calibrate the balance of epsilon-Net and exponentialprobability. Our result indicates that benign overfitting is not dependingon concentration property of the input vector. Finally, we discuss poten-tial difficulties for benign overfitting beyond linear model and a benignoverfitting result without truncated effective rank.
In recent years, there are tremendous interest in studying generalization prop-erty of statistical model when it interpolates the input data. The classicallearning theory suggests that when the predictor fits input data perfectly, itwill suffer from noise so that it will not generalize well. To overcome this prob-lem, regularization and penalized learning procedures like LASSO are studied toweaken the effect of noise to avoid overfitting. However, some empirical experi-ments indicate that overfitting may perform well. Why can overfitting performwell? In what cases can overfitting perform well? These questions motivated aseries work on this field.The original motivation is from the Deep Learning community, who empir-ically revealed that overfitting Deep Neural Network can still generalize well,see [31]. This counter-intuitive phenomenon still appear for linear regressionand kernel ridge regression, see [3] and [28]. They believe that investigatingbenign overfitting phenomenon in linear regression case will benefit to the morecomplex Deep Neural Networks case.The cornerstone work [2] presented a minimax bound of generalization errorof overfitting linear regression. Their result is in terms of effective ranks, whichmeasures the tail behavior of eigenvalues of covariance matrix and will be defined [email protected], College of Computer Science and Technology, Jilin Uni-versity, China. In this paper, we consider linear regression problems in R p . Given a dataset D N = { ( X i , Y i ) } Ni =1 and Y i = h X i , α ∗ i + ξ i , where α ∗ ∈ R p is an unknownvector and ( X i ) Ni =1 are i.i.d. copies of X , ξ i are unpredictable i.i.d. centeredsub-gaussian noise, which is independent with X . Because we are going tocompare linear regression with more general functions class later, we also oftenuse f α ( · ) to denote h· , α i in the following, and F A as a set of f α such that α ∈ A . Assume the random vector X ∈ R p satisfies weak small ball assumptionwith parameter ( L , θ ), which will be defined in Definition 2.1 ,and denote itscovariance matrix as Σ. Define the design matrix X with N lines X Ti . Denoteresponse vector Y = ( Y , · · · , Y N ).When p > n , the least-square estimator can interpolate D N . Denote the one2hat has the smallest ℓ norm as ˆ α . That is to say,ˆ α = X † Y , where X † is the Moore-Penrose pseudo inverse of X . Denote H N ⊂ R p as H N = { α ∈ R p : X α = Y } , we call H N as interpolation space. We haveˆ α = argmin α ∈ H N k α k ℓ . We assume that rank(Σ) > N , then ˆ α exists almost surely.Our loss function is squared loss, that is ℓ ( t ) = t , and the loss of α isdenoted by ℓ α = ℓ ( h α − α ∗ , X i ). So the empirical excess risk is defined asP N L α = P N ( ℓ α − ℓ α ∗ ) . Benign overfitting depends on effective ranks of Σ. If A ∈ R p × p is a sym-metric matrix, denote λ ( A ) > · · · > λ p ( A ) as eigenvalues of A and s ( A ) > · · · > s p ( A ) be its singular values. If there is no ambiguity, we will write s i instead of s i ( A ).[2] defined two effective ranks: r k (Σ) = P i>k λ i (Σ) λ k +1 , R k (Σ) = (cid:0)P i>k λ i (Σ) (cid:1) P i>k λ i (Σ) . (1.1) R k (Σ) is a truncated version of stable rank that occurred in Asymptotic Geo-metric Analysis, see [30], [25] and [22] for a comprehensive review. Stable rank,denoted as srank q ( A ), defined bysrank q ( A ) = k A k S k A k S q ! qq − , where k·k S q is the q -Schatten norm of A , that is to say, k A k S q = ( P pi =1 s qi ( A )) /q .When q = 4, and A = Σ / , thensrank (Σ / ) = (cid:13)(cid:13) Σ / (cid:13)(cid:13) S (cid:13)(cid:13) Σ / (cid:13)(cid:13) S ! = (cid:0)P pi =1 s i (Σ / ) (cid:1) P pi =1 s i (Σ / ) = ( P pi =1 λ i (Σ)) P pi =1 λ i (Σ) . It can be seen that R k (Σ) is the truncated version of srank (Σ / ).Apart from this, r k (Σ) is also a truncated version of the usual ”effectiverank” which is actually tr(Σ) /λ (Σ) in statistical literature, see [12],[26].In fact, our result will be in terms of R k (Σ) instead of r k (Σ), which is theusual choice in most past work. However, this does not matter, because the twoeffective ranks are closely related, we refer the reader to Appendix A.6 in [2] fora comprehensive review. 3or sake of simplicity, we define some extra notations. They have no specialmeanings, but will make our formula more clear. Denote R k, (Σ) := − s k srank (Σ) ! R k (Σ) . (1.2)When k = 0, we have R k, (Σ) = srank (Σ). Denote R k (Σ) := (4 p − k ) p c p p − k )2 R k, (Σ) R k, (Σ) , where c < A as k A k . Denote k α k A as √ α T Aα .We use S ( r ) to denote the sphere in R p with radius r with respect to k·k ℓ , B ( r ) as ball analogously. Denote S A ( r ) and B A ( r ) as the corresponding sphereand ball with respect to k·k A . Denote ( ε i ) Ni =1 are i.i.d. Bernoulli random vari-ables. Denote D as unit ball with respect to L distance. If F ⊂ L ( µ ), let { G f : f ∈ F} be the canonical gaussian process indexed by F , denote E k G k F as E k G k F := sup ( E sup f ∈ F ′ G f : F ′ ⊂ F, F ′ is finite ) . Denote Λ s ,u ( F ) = inf sup f ∈F X s > s s/ k f − π s f k ( u s ) , u > , s > . where the infimum is taken with respect to all admissible sequences ( F s ) s > and π s f is the nearest point in F s to f with respect to k·k ( u s ) . Here k·k ( p ) =sup q p k·k L q / √ q . An admissible sequence is a sequence of partitions on F such that | F s | s , and | F | = 1, cf. [21]. Denote d q ( F ) as diameter of F with respect to k·k L q . Denote k·k ψ as sub-gaussian norm. Denote [ N ] as set { , , · · · , N } . Denote C, c, c , c , c , · · · as absolute constants. Section 2 contains some preliminaries knowledge. Section 3 contains our mainresult, Theorem 3.1 and its proof are decomposed into two parts, which will bepost-posed to section 4 and section 5. These two sections contains estimationerror and prediction error of interpolation procedure in linear regression case.In section 6, we will discuss why it is difficult to obtain oracle inequality beyondlinear regression, and give a benign overfitting result without truncated effectiverank by using Dvoretzky-Milman Theorem in Asymptotic Geometric Analysis.4
Preliminaries
In this section, we introduce some preliminary techniques which will be usedto formulate and prove our main results. More precisely, we will introducelocalization method to yield oracle inequality and small ball method to providea lower bound of smallest singular value of design matrix.
To get an oracle inequality, there are two approaches in general. The firstone is called Isomorphism Method, which uses the isomorphy between betweenempirical and actual structures to derive an oracle inequality. The other is calledlocalization method. In this work, we will use localization method to derive anoracle inequality.For a statistical model F and f ∗ ∈ F is an oracle. Localization Methoduses a L -ball centered at f ∗ with radius r to localize model F . This allowsus to study statistical properties of a learning procedure ˆ f on this small ball .More precisely, the radius r captures upper bound estimation error k f − f ∗ k L for all f ∈ F ∩ rD . Therefore, if we can find an upper bound of r , we find theestimation error of learning procedure ˆ f . Analog to that in [8], our localized setin this paper is F H r,ρ = {h· , α i : α ∈ H r,ρ } , where H r,ρ := B ( ρ ) ∩ B Σ ( r ) , where ρ is upper bound of estimation error, which will be studied in section 4,and r is upper bound of prediction error, which will be studied in section 5.Obtaining prediction risk is based on estimation risk. In this paper, we obtainestimation risk by studying minimum ℓ interpolation procedure and obtainprediction error by localization method based on it.Optimal level of r , denoted as r ∗ is carefully chosen by fixed points calledcomplexity parameters. In classical statistical learning theory, there are two common-used complexityparameters called multiplier complexity r M and quadratic complexity r Q , werefer the reader to [20] for a comprehensive view. Quadratic complexity r Q isdefined as follows: r Q, ( F , ζ ) = arginf r> E k G k ( F−F ) ∩ rD ζr √ N , and r Q, ( F , ζ ) = arginf r> E sup w ∈ ( F−F ) ∩ rD (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ N N X i =1 ε i w ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ζr √ N, This diameter is not necessarily to be of L distance in localization method, though ourchoice is L distance. We refer the interested reader to [7] ζ is an absolute constant. r Q ( ζ , ζ ) = max { r Q, ( F , ζ ) , r Q, ( F , ζ ) } is an intrinsic parameter. Thatis to say, r Q does not rely on noise ξ , but only on F . This parameter measuresthe ability of F to estimate target function f ∗ .While multiplier complexity r M is defined as follows: φ N ( r ) := sup w ∈ ( F−F ) ∩ rD √ N N X i =1 ε i ξ i w ( X i ) ,r M, ( κ, δ ) = inf r> n P n φ N ( r ) r κ √ N o > − δ o , and r ( κ ) = inf r> sup w ∈ ( F−F ) ∩ rD (cid:26) k ξw ( X ) k L √ N κr (cid:27) , where κ is an absolute constant.Then r M ( κ, δ ) = r M, + r is called multiplier complexity, which measuresthe interplay between noise ξ and function class F .Classical learning theory employs r M to measure the ability of F to absorbnoise ξ . However, this parameter does not make sense in interpolation case. Thisis because interpolant ˆ f causes no loss on r M by interpolating ( X i , Y i ) Ni =1 per-fectly. Therefore, r M = 0 in this case. However, since ˆ f interpolates ( X i , Y i ) Ni =1 ,it bears influence from noise ξ so that r Q is no longer an intrinsic parameter.That is to say, r Q relies on ξ implicitly because ˆ f has to estimate both signaland noise. It is this that causes the biggest difference from interpolation caseand classical learning theory. Therefore, our complexity parameter is a variantof quadratic complexity, which will be defined in Equation 3.5.Localization method employed complexity parameters to provide radius oflocalized set. However, we need to illustrate that interpolant ˆ f lies in it withhigh probability. This step is guaranteed by an exclusion argument. For all f ∈ F , if f wants to be an interpolant, its empirical excess risk must belower than a fixed level with high probability. To see this, we first decomposeinf f ∈F P N L f to its lower bound.There are two decompositions of empirical excess risk into quadratic andmultiplier components. The first one is as follows:inf f ∈F P N L f = inf f ∈F N N X i =1 ( f ( X i ) − Y i ) − ( f ∗ ( X i ) − Y i ) > inf f ∈F ( N N X i =1 ( f − f ∗ ) ( X i ) ) − f ∈F ( N N X i =1 ξ i ( f − f ∗ )( X i ) ) := inf f ∈F P N Q f − f ∗ − f ∈F P N M f − f ∗ . (2.1)6his kind of decomposition needs lower bound of quadratic component P N Q f − f ∗ .This lower bound is provided by small ball method. We will use this approachin subsection 6.2 to acquire a sufficient condition for benign overfitting withouttruncated effective rank. We turn to the second kind of decomposition. Our lo-calized statistical model F = F H r,ρ is a class of linear functionals on R p . Recallthat the optimal choice of r is denoted as r ∗ , so denote θ = r ∗ /r ∈ (0 , α = α ∗ + θ ( α − α ∗ ), so (cid:13)(cid:13) Σ / ( α − α ∗ ) (cid:13)(cid:13) ℓ = r ∗ and k α − α ∗ k ℓ θρ . Denote Q r,ρ = sup α − α ∗ ∈ H r,ρ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 h X i , α − α ∗ i − E h X i , α − α ∗ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , and M r,ρ = sup α − α ∗ ∈ H r,ρ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ξ i h X i , α − α ∗ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Then inf f ∈F P N L f = inf f ∈F N N X i =1 ( f ( X i ) − Y i ) − ( f ∗ ( X i ) − Y i ) > θ − (cid:0) ( r ∗ ) − Q r ∗ ,ρ (cid:1) − M r ∗ ,ρ . (2.2)Suppose ( ξ i ) Ni =1 are i.i.d. sub-gaussian random variables, then by Bernstein’sinequality, with probability at least 1 − exp( − N/
2) we have1 N N X i =1 ξ i k ξ k ψ . Because interpolation procedure ˆ f interpolates all these inputs ( X i , Y i ) Ni =1 , theexcess risk of ˆ f can be obtained by the noises,P N L ˆ f = P N (cid:16) ℓ ˆ f − ℓ f ∗ (cid:17) = − P N ℓ f ∗ = − P N ξ , then with probability at least 1 − exp( − N/ N L ˆ f − k ξ k ψ . That is to say, if f wants to be an interpolant, it must satisfy this upper bound.Otherwise, f will be excluded because it has little probability to be an inter-polant.This upper bound is different from the case of non-interpolation setting,where it is 0. It is smaller than 0 because the interpolation procedure is morerestrict(in the sense of the interpolation space is smaller than version space) thannon-interpolation procedure like ERM. The smaller upper bound can excludemore functions than non-interpolation procedure.7herefore, we just need to upper bound multiplier and quadratic processesin Equation 2.2, such that the lower bound of empirical excess risk over all f ∈ F H r,ρ is greater than − k ξ k ψ / r is greater than a fixed level r ∗ . Sothat functions in F H r,ρ will be excluded from being an interpolant with highprobability. Therefore, with high probability, interpolant will lie in H r ∗ ,ρ andwe can upper bound its prediction risk by r ∗ . We employ upper bounds of multiplier process and quadratic process in [21]:
Lemma . There existabsolute constants c and c for which the following holds. If ξ ∈ L ψ then forevery u, w >
8, with probability at least 1 − − c u s ) − − c N w ),sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ N N X i =1 ( ξ i f ( X i ) − E ξf ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) cuw k ξ k ψ ˜Λ s ,u ( F ) , where c is an absolute constant. Lemma . Thereexists a constant c ( q ) that depends only on q for which the following holds.Then with probability at least 1 − (cid:0) − c u s (cid:1) ,sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 (cid:0) f ( X i ) − E f (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c ( q ) N (cid:16) u ˜Λ s ,u ( F ) + u √ N d q ( F )˜Λ s ,u ( F ) (cid:17) , where c ( q ) is a constant depending on q . Particularly, if F is a sub-gaussianclass, then with probability at least 1 − − c N ),sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 (cid:0) f ( X i ) − E f (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) cL E k G k F . Set 2 s = k F , where k F = ( E k G k F /d ( F )) is the Dvoretzky-Milman Di-mension of F , we refer the reader to [1] for a comprehensive view. For ℓ p , theDvoretzky-Milman dimension k ∼ p , see e.g. Theorem 5.4.1 in [1].And ˜Λ( F ) iscalled Λ-complexity of F , which is a generalization of Gaussian complexity sothat Λ-complexity just needs F has finite order of moments, instead of infiniteorder of moments. Particularly, when F happens to be a sub-gaussian class,˜Λ( F ) is equivalent to E k G k F . We refer the reader to [21] for a comprehensivereview.In this paper, our function class is of linear functionals on R p , especiallyellipses since we assume the random vector X is not isotropic. Given the co-variance matrix Σ of random vector X . By Lemma 5 in [8], we obtain E k G k F Hr,ρ √ vuut p X i =1 min { λ i (Σ) ρ , r } . (2.3)8owever, estimating ˜Λ( H r,ρ ) is non-trivial unless H r,ρ is a sub-gaussian class.Note that the deviation in Lemma 2.2 is neither optimal nor user-friendly(inthe sense of the deviation parameter u is coupled with complexity parameter E k G k F ). In fact, upper bound of quadratic process given by [9] is in optimaldeviation when F is a sub-gaussian class. This is not fit to our heavy-tailedsetup. It is non-trivial to obtain upper bound of quadratic process in heavy-tailed case. However, when F is sub-gaussian, we can omit parameter r ∗ whichwill be defined in section 3 and get a better bound of k Γ k in subsection 4.2,which will generate a preciser result, whereas the proof is omitted. To make ourresult uniform to both heavy-tailed and sub-gaussian case, we employ the onefrom [21] though it will not generate an optimal bound. To deal with heavy-tailed case, we employ small ball method, which is a crucialargument in Asymptotic Geometric Analysis, see [1]. Small Ball Method instatistical learning theory is first developed in [13]. It can be viewed as a kindof Paley-Zygmund method, which assumes the random vector is sufficientlyspread, so that it will have many large coordinates. We refer the reader to [20]and [22] for a comprehensive view.Classical small ball assumption is a lower bound on tail of random function,that is P (cid:8) | X i | > κ k X k L (cid:9) > θ, which can be verified by Paley-Zygmund inequality, see Lemma 3.1 in [11],under L − L norm equivalence condition. In this paper, small ball methodis used to obtain lower bound of smallest singular value of design matrix. Instatistical learning theory, small ball assumption is used to lead to lower boundof quadratic component in Equation 2.1, so that it can provide a lower boundof smallest singular value. As we do not need coordinates of input vector X areindependent, we need a small ball method without independent. Fortunately,independence assumption is relaxed in [22], and the corresponding definition ofsmall-ball assumption is as follows: Definition . The random vector X ∈ R p satis-fies a weak small ball assumption(denoted as wSBA) with constants L , κ if forevery 1 k p −
1, every k dimensional subspace F , every z ∈ R p , P n k P F X − z k ℓ κ √ k o ( L κ ) k , where P F is the orthogonal projection onto the subspace F .There are many cases when random vector X satisfying wSBA, we refer thereader to Appendix A in [22] for a comprehensive view.9 Main Result
In this section, we will formulate our main result, Theorem 3.1. Before this, wehave to assume our final assumption and define some parameters.Firstly, We need to following assumption: There are constants δ > δ > p p X i =1 (cid:13)(cid:13)(cid:13) Σ / e i (cid:13)(cid:13)(cid:13) δ ℓ ! δ δ s tr(Σ) p , (3.1)where ( e i ) pi =1 are ONB of R p .This assumption is used to select a proper(in sense of a uniform lower boundof inner product) subset σ ⊂ [ p ] such that | σ | > c p , where c depends on δ and δ . This assumption is not restrictive. See subsection 3.1 for an example.Secondly, we define three parameters. Define k ∗ as the smallest integer suchthat p log q d q ( D ) ˜Λ( D ) √ p + ˜Λ ( D ) p + λ (Σ) q p tr(Σ) p c (1 − R k (Σ)) − N c R k (Σ) + 1 − c p (3.2)where the minimum of empty set is defined as ∞ . Denote ν as follows: ν := N c R k ∗ (Σ) + 1 − c p − p log q d q ( D ) ˜Λ( D ) √ p + ˜Λ ( D ) p + λ (Σ) q p tr(Σ) p c (1 − R k ∗ (Σ)) − . (3.3)Parameter k ∗ is a level which can balance the two sides in Equation 3.2.Denote ρ = k α ∗ k ℓ + s − c ) c − r p tr(Σ) k ξ k ψ ε , (3.4)where ε is a constant. ρ will be upper bound of estimation error.Denote r ∗ := arginf r> n ˜Λ( H r,ρ ) p ζ pr o , r ∗ := arginf r> { d q ( F r,ρ ) ζ r } .r ∗ := r ∗ + r ∗ , (3.5)where ζ , ζ are absolute constants. Particularly, when H r,ρ is a sub-gaussianclass, this definition reduces to that of [8]. r ∗ will be upper bound of predictionrisk.Now we can formulate our main result as follows.10 heorem 3.1. Suppose X = Σ / Z ∈ R p is a random vector, where Z is anisotropic random vector that satisfies wSBA with constants L , κ , and Σ satisfiesEquation 3.1. If ( X i ) Ni =1 are i.i.d. copies of X , forming rows of a random matrix X . Let ˆ α be an interpolation solution on ( X i , Y i ) Ni =1 , where Y i = h X i , α ∗ i + ξ i ,and ( ξ i ) Ni =1 are i.i.d. sub-gaussian random variables. Then there exists absoluteconstant c such that: with probability at least − exp( − ν ) − exp( − cN ) , k ˆ α − α ∗ k ℓ ρ, (cid:13)(cid:13)(cid:13) Σ / ( ˆ α − α ∗ ) (cid:13)(cid:13)(cid:13) ℓ r ∗ , where ρ , ν and r ∗ are defined in Equation 3.4, 3.3 and 3.5 Consider a simple example considered in [2], [8] and [28]. When X is a sub-gaussian random vector, we have ˜Λ( F ) ∼ E k G k F at once. Therefore, withprobability at least 1 − exp( − cN ), we have k Γ k . √ N λ (Σ) . Consider a concrete case that ε = o (1) such that for any k , λ k (Σ) = e − k + ε, with log 1 ε < N, p = cN log 1 ε . If pε = ω (1), then tr(Σ) = O (1), and d q ( D ) ˜Λ( F ) √ p + ˜Λ ( F ) p + λ (Σ) = O (1). So r p tr(Σ) s d q ( D ) ˜Λ( D ) √ p + ˜Λ ( D ) p + λ (Σ) = O (1) . To choose k such that Equation 3.2 holds, we have to bound R k (Σ) from below.Firstly, we need to lower bound R k (Σ). R k (Σ) = Θ (cid:0) e − k + pε (cid:1) e − k + pε ! by setting k = log (1 /ε ) < p , then R k (Σ) = Θ( p ).Then we estimate srank (Σ). We have srank (Σ) = Θ( p ). So R k, (Σ) =Θ( p ). Further, we have R k (Σ) = Θ(2 − p ). Further, 2 − p = Θ( ε cN ).Therefore, LHS /p = Θ log p c (1 − − p ) − !! = Θ (cid:0) − log (cid:0) c (1 − ε cN ) − (cid:1)(cid:1) = Θ (cid:0) − log (cid:0) − ε cN (cid:1)(cid:1) . Np c R k (Σ) + 1 − c (cid:18) − c (1 − − p )2 c log (1 /ε ) (cid:19) = Θ (cid:0) ε cN (cid:1) , and log pp = Θ (cid:18) log c + log N + log log (1 /ε ) cN log (1 /ε ) (cid:19) = Θ (cid:18) log NN (cid:19) . Recall that ε = o (1), so Equation 3.2 holds for N large enough.Consider δ such that δ > /ε , then Equation 3.1 holds. Therefore, wecan set ν = log NN − ε cN + log (cid:0) − ε cN (cid:1) , Next, we estimate r ∗ . Since F H r,ρ is a sub-gaussian class, r ∗ = 0, and r ∗ = arginf r> ( p X i =1 min (cid:8) r , λ i (Σ) ρ (cid:9) ζ pr ) . So r ∗ = r ∗ √ ζ k α ∗ k ℓ q tr(Σ) p . Therefore, by Theorem 3.1, with probabilityat least 1 − exp( − ν ) − exp( − cN ), we have k ˆ α − α ∗ k ℓ k α ∗ k ℓ + C k ξ k ψ r ppε + 1 c k α ∗ k ℓ , (cid:13)(cid:13)(cid:13) Σ / (ˆ α − α ∗ ) (cid:13)(cid:13)(cid:13) ℓ ζ k α ∗ k ℓ (cid:18) pε + 1 p (cid:19) = 4 ζ k α ∗ k ℓ (cid:18) ε + 1 c log (1 /ε ) N (cid:19) , if signal-to-noise ratio k α ∗ k ℓ / k ξ k ψ is greater than q ppε +1 . In this section, we are going to obtain a upper bound of k ˆ α − α ∗ k ℓ in highprobability. We have ˆ α = X † Y = X † X α ∗ + X † ξ. Therefore, k ˆ α − α ∗ k ℓ = (cid:13)(cid:13)(cid:0) X † X − I (cid:1) α ∗ (cid:13)(cid:13) ℓ + (cid:13)(cid:13) X † ξ (cid:13)(cid:13) ℓ k α ∗ k ℓ + (cid:13)(cid:13) X † (cid:13)(cid:13) k ξ k ℓ . (4.1)For k ξ k ℓ , we can obtain k ξ k ℓ √ N k ξ k ψ , with probability at least 1 − exp( − N ) by Bernstein’s inequality.To upper bound (cid:13)(cid:13) X † (cid:13)(cid:13) , we need a lower bound of the smallest singular valueof X in high probability. 12 emma . Suppose X = Σ / Z ∈ R p is a random vector, where Z is an isotropic random vector that satisfies wSBAwith constants L , κ , and Σ satisfies Equation 3.1. If ( X i ) Ni =1 are i.i.d. copiesof X , forming rows of a random matrix X . Then there exists constant c suchthat the smallest singular value of X has lower bound s min ( X ) > ε r c (1 − c ) − · √ N s tr(Σ) p & s N tr(Σ) p , ∀ ε ∈ (0 , − exp ( − ν ) − exp( − cN ), where c is an absolute con-stant.With the help of Lemma 4.1, we can arrive at the estimation error: Theorem 4.1 (Estimation Error) . Suppose X = Σ / Z , where Z is a randomvector satisfying wSBA with parameters L , κ and Σ satisfies Equation 3.1. Let ( X i ) Ni =1 are i.i.d. copies of X . Let Y i = h X i , α ∗ i + ξ i , where ( ξ i ) Ni =1 are i.i.d.sub-gaussian random variables, and let Y = ( Y i ) Ni =1 . Let X as random matrixwith lines X Ti , and ˆ α = X † Y . For ν defined in Equation 3.3, there existsconstant c such that: with probability at least − exp( − cN ) − exp( − ν ) , k ˆ α − α ∗ k ℓ k α ∗ k ℓ + s − c ) c − r p tr(Σ) k ξ k ψ ε , ∀ ε ∈ (0 , k α ∗ k ℓ + c k ξ k ψ r p tr(Σ) . The proof is trivial by using Equation 4.1 and Lemma 4.1.Theorem 4.1 can be compared with Theorem 3 in [8]. Their estimation erroris related to effective rank r k (Σ), while our bound depends only on tr(Σ) /p .This is because our lower bound on the smallest singular value is given byaverage eigenvalue, instead of effective rank. We believe that by choosing c ,the smallest singular value can be controlled in terms of effective rank, thoughwe think deriving such a bound in our work is not necessary.In the following subsection, we are going to prove Lemma 4.1. An outline ofthe proof of Lemma 4.1 is as follows. Firstly, we establish a coordinate small-ball estimation in terms of effective rank in Theorem 4.2. Secondly, we provea uniform lower bound of k X t k ℓ on an epsilon-Net of S p − . Finally, we canlower bound the smallest singular value by combining its minimal ℓ norm andits maximal operator norm. In this subsection, we prove the following Theorem.13 heorem 4.2 (Coordinate Small Ball Estimate in terms of Effective ranks) . If random vector X = Σ / Z ∈ R p satisfies wSBA with constants ( L , κ ) , and ( e i ) pi =1 are ONB of R p that satisfy Equation 3.1, then for ε ∈ (0 , , P ((cid:12)(cid:12)(cid:12)(cid:12)(cid:12)( i p : (cid:12)(cid:12)(cid:12)D Σ / Z, e i E(cid:12)(cid:12)(cid:12) > ε s tr(Σ) p )(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c p ) . (4 p − k ) p c p p − k )2 R k, (Σ) R k, (Σ) , (4.2) where c depends on L , κ . This is just a simple modification of that in [22]. We divide the proof ofTheorem 4.2 into three steps.Firstly, we select a proper subset σ ⊂ [ p ]. This step can be done by usinga probabilistic combinatorics technique. Let u i be a random vector uniformlydistributed on the given ONB { e i } pi =1 . Set indicators (1 i ) pi =1 . If (cid:13)(cid:13) Σ / u i (cid:13)(cid:13) ℓ > p tr(Σ) / (2 p ), then 1 i = 1, otherwise 1 i = 0. Then E " p X i =1 i = p X i =1 P ((cid:13)(cid:13)(cid:13) Σ / u i (cid:13)(cid:13)(cid:13) ℓ > s tr(Σ)2 p ) . Then by Equation 3.1 and Paley-Zygmund inequality, see e.g. Lemma 3.1 in[11], we can get its lower bound: RHS > c ( δ , δ ) p . Therefore, there exists asubset σ ⊂ [ p ], whose cardinality is at least c p , such that for all i ∈ σ , thereexists (cid:13)(cid:13)(cid:13) Σ / e i (cid:13)(cid:13)(cid:13) ℓ > s tr(Σ)2 p . (4.3)Secondly, [22] decompose [ c p ] into ℓ coordinate blocks by using restricted in-vertibility Theorem. That is to say, Lemma . Assume that for every 1 i p , Equation3.1 holds. Set k = srank (Σ / ). Then for any λ ∈ (0 , σ i ) ℓi =1 ⊂ [ c p ] such that • For 1 j ℓ , there is | σ j | > c k q / P ℓj =1 | σ j | > c p/ • (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) (Σ / ) ∗ P ∗ σ j (cid:17) − (cid:13)(cid:13)(cid:13)(cid:13) S ∞ | σ j | by lower bounding srank (Σ / ). Lemma . For 0 k p − ∈ R p × p , we havesrank (Σ / ) > p (4 p − k ) − s k srank (Σ) ! R k (Σ) , roof. The proof is separated into two parts. Firstly, we lower bound (cid:13)(cid:13) Σ / (cid:13)(cid:13) S .By Ky Fan’s maximal principle, see e.g. Lemma 8.1.8 in [18] or Chapter 3in [5], we have p − r X i =1 s i (Σ / ) > tr (Σ ( I p − P )) , where I p − P is an orthogonal projection of rank ( p − r ), which provides alower bound of sum of largest ( p − r ) eigenvalues of Σ. Set p − r = k , thenrank( P ) = p − k . We have P i>k s i (Σ / ) = (cid:13)(cid:13) Σ / (cid:13)(cid:13) S − P ki =1 s i (Σ / ). Itfollows that X i>k s i (Σ / ) (cid:13)(cid:13)(cid:13) Σ / (cid:13)(cid:13)(cid:13) S − tr (Σ ( I p − P )) . (4.4)We just need to lower bound tr (Σ ( I p − P )) in terms of (cid:13)(cid:13) Σ / (cid:13)(cid:13) S . Consider Σand Σ P separately, we have the following identity: (cid:13)(cid:13)(cid:13) P Σ / (cid:13)(cid:13)(cid:13) S = tr(Σ) − tr (Σ ( I p − P )) , so tr (Σ ( I p − P )) = (cid:13)(cid:13)(cid:13) Σ / (cid:13)(cid:13)(cid:13) S − (cid:13)(cid:13)(cid:13) P Σ / (cid:13)(cid:13)(cid:13) S . (4.5)Substitute Equation 4.5 into Equation 4.4, then we just need to upper bound (cid:13)(cid:13) P Σ / (cid:13)(cid:13) S . However, by definition of k·k S and property of Frobenius norm,we have (cid:13)(cid:13)(cid:13) P Σ / (cid:13)(cid:13)(cid:13) S = (cid:13)(cid:13)(cid:13) Σ / (cid:13)(cid:13)(cid:13) S − (cid:13)(cid:13)(cid:13) P C Σ / (cid:13)(cid:13)(cid:13) S , where P C is complement of projector P . Recall that rank( P ) = p − k , so wecan set P picking p − k rows of Σ / , so P C picks k rows of Σ / and it haslower bound (cid:13)(cid:13) Σ / (cid:13)(cid:13) S / (2 √ p ) by Equation 4.3. Therefore, we have X i>k s i (Σ / ) (cid:18) − k p (cid:19) (cid:13)(cid:13)(cid:13) Σ / (cid:13)(cid:13)(cid:13) S . and immediately, p X i =1 λ i (Σ) > (cid:18) − k p (cid:19) − X i>k s i (Σ / ) = (cid:18) − k p (cid:19) − X i>k λ i (Σ) . (4.6)Secondly, we upper bound (cid:13)(cid:13) Σ / (cid:13)(cid:13) S .By Holder’s inequality, k Σ k S = k X i =1 s i (Σ) + X i>k s i (Σ) √ k k Σ k S + X i>k s i (Σ) .
15o we have p X i =1 λ i (Σ) = k Σ k S − s k srank (Σ) ! − X i>k λ i (Σ) , (4.7)by definition of stable rank.Combining Equation 4.7 and Equation 4.6, Lemma 4.3 is proved.The rest of the proof is based on the following Lemma: Lemma . If random vector X satisfies the wSBA with constant ( L , κ ),and ( e i ) pi =1 are ONB of R p , and Σ / : R p → R p satisfies Equation 3.1, then for ε ∈ (0 , P ((cid:12)(cid:12)(cid:12)(cid:12)(cid:12)( i p : (cid:12)(cid:12)(cid:12)D Σ / Z, e i E(cid:12)(cid:12)(cid:12) > ε s tr(Σ) p )(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c p ) X j ℓ (cid:18) e L εc / (cid:19) | σ j | / ( c / , where σ j , c , ℓ are the same as Lemma 4.2.This Lemma is not explicitly given in [22].Using this, we can prove Theorem 4.2: Proof of Theorem 4.2.
By Lemma 4.4, Lemma 4.3 and Lemma 4.2, Theorem4.2 is proved easily.
In this subsection, we proceed step 2 and 3. Build an epsilon-Net on S p − ,obtain uniform lower bound of smallest singular value on it and extend it onthe whole S p − . Proof of Lemma 4.1.
Fix a random vector X j ∈ R p . Consider p unit vectors( e i ) pi =1 which forming an ONB of R p , then by Theorem 4.2, with probability atleast 1 − R k (Σ), there exists a subset σ j with cardinality at least c p , and thevectors in it satisfy that for all e i ∈ σ j , (cid:12)(cid:12)(cid:12)D Σ / Z j , e i E(cid:12)(cid:12)(cid:12) > ε s tr(Σ) p . (4.8)Each X j has such a subset σ j ⊂ ( e i ) pi =1 of cardinality at least c p with proba-bility at least 1 − R k (Σ). Pick a t ∈ ( e i ) pi =1 randomly. If t ∈ σ j , then Equation4.8 holds, that is to say, we can have a lower bound on the inner product at thistime.Denote 1 j as 1 { t/ ∈ σ j } , then E j − c (1 − R k (Σ)). By Bernstein’s inequality,with probability at least 1 − exp( − min (cid:8) t , t (cid:9) N ), N X j =1 j N E j + t N ( c R k (Σ) + 1 − c ) .
16y setting t = ( c R k (Σ) + 1 − c ) /
2, then with probability at least1 − exp ( − N ( c R k (Σ) + 1 − c ) / , we have P Nj =1 j N ( c R k (Σ) + 1 − c ) /
2, that is to say, vuut N X j =1 (cid:10) Σ / Z j , u (cid:11) > ε r c (1 − R k (Σ)) − N s tr(Σ) p , ∀ u ∈ σ j . Build an η -Net Γ ε on B p .Set η = p tr(Σ)2 √ k Γ k ε p c (1 − R k (Σ)) − s Np By log | Γ η | p log (1 + 2 /η ), we havelog | Γ η | p log √ ε k Γ k p tr(Σ) r pN · p c (1 − R k (Σ)) − ! We just need to ensure p log k Γ k p tr(Σ) r pN · c p c (1 − R k (Σ)) − ! N c R k (Σ) + 1 − c p (4.9)by choosing k wisely. Here we just need to repeat the selecting process nomore than ⌈| Γ η | / ( c p ) ⌉ times, so to make Equation 4.8 holds uniformly forall elements in Γ η , we have to pay a log ( ⌈| Γ η | / ( c p ) ⌉ ) in exponential termexp ( − N ( c R k (Σ) + 1 − c ) /
2) and make sure this probability is no greaterthan 1.For upper bound of k Γ k , we use Lemma 2.2. With probability at least1 − exp( − c N ), k Γ k = vuut max t ∈ S p − N X i =1 h Γ · ,i , t i ℓ √ N vuut C d ˜Λ( D ) √ k D + ˜Λ ( D ) k D ! + λ (Σ) . By choosing k = k ∗ defined in Equation 3.2, the probability can be lowerbounded by 1 − exp( − ν ), where ν is defined in Equation 3.3. In summary,with probability at least 1 − exp( − c N ) − exp( − ν ), we have s min ( X ) > ε r c (1 − c ) − N s tr(Σ) p Prediction error
In this section, we obtain an upper bound of prediction error based on upperbound of estimation risk by using localization method introduced in subsetcion2.1.
Theorem 5.1 (Prediction error) . If random vector X = Σ / Z ∈ R p satis-fies wSBA with constants ( L , κ ) , and Σ that satisfy Equation 3.1. Then withprobability at least − exp ( − ν ) − − cN ) , prediction error satisfies (cid:13)(cid:13)(cid:13) Σ / (ˆ α − α ∗ ) (cid:13)(cid:13)(cid:13) ℓ r ∗ , where c is an absolute constant. The proof is a kind of localization argument. That is to say, we are goingto prove ˆ α − α ∗ lies in a localized area with respect to k·k Σ . Firstly, we need alocalization Lemma from [8]: Lemma . With probability at least 1 − exp( − N/ N L ˆ α − k ξ k ψ /
2. Moreover, for any r , let Ω r,ρ denotethe following eventΩ r,ρ = (cid:26) α ∈ R p : α − α ∗ ∈ B ( ρ ) \ B Σ ( r ) , and P N L α > − k ξ k ψ (cid:27) . On the eventΩ r,δ ∩ { ˆ α − α ∗ ∈ B ( ρ ) } ∩ (cid:26) P N L ˆ α − k ξ k ψ (cid:27) , (5.1)prediction risk has upper bound r , that is to say, (cid:13)(cid:13)(cid:13) Σ / (ˆ α − α ∗ ) (cid:13)(cid:13)(cid:13) ℓ r. Lemma 5.1 reduces upper bound of prediction risk to Equation 5.1. Proba-bility of event { ˆ α − α ∗ ∈ B ( ρ ) } can be lower bounded by estimation error, seeLemma 4.1. Recall that H r,ρ = B ( ρ ) ∩ B Σ ( r ), we just need to prove that eventinf α : α − α ∗ ∈ H r,ρ P N L α > − k ξ k ψ holds with high probability for r > r ∗ .In this section, we are going to find lower bound of P N L α in terms of upperbounds of quadratic and multiplier processes according to Equation 2.2.Firstly, we find upper bound of quadratic process. By Lemma 2.2, withprobability at least 1 − exp( − cN ), we have Q r,ρ . q d q ( H r,ρ ) ˜Λ( H r,ρ ) √ p + ˜Λ ( H r,ρ ) p ! . (5.2)18econdly, we need upper bound of multiplier component. This can be doneby using Lemma 2.1. That is to say, with probability at least 1 − − cN ), M r,ρ . q k ξ k ψ ˜Λ( H r,ρ ) √ p . (5.3)since ξ is centered and independent with X . Therefore, when r > r ∗ , we have d q ( H r,ρ ) ζ r ∗ , ˜Λ( H r,ρ ) √ p ζ r ∗ . Proof of Theorem 5.1.
Let α ∈ α ∗ + H r,ρ . Recall that r = (cid:13)(cid:13) Σ / ( α − α ∗ ) (cid:13)(cid:13) ℓ .If r > r ∗ , by substituting Equation 5.3 and Equation 5.2 into Equation 2.2, itfollows thatinf α ∈ α ∗ + H r,ρ P N L α > ( r ∗ ) (cid:0) θ − − ζ ζ θ − − ζ θ − − ζ (cid:1) − k ξ k ψ . Set ζ , ζ small enough, RHS > − k ξ k ψ . By Lemma 5.1, Theorem 5.1 isproved. In this section, we discuss two aspects. Firstly, we discuss why it is so difficultto investigate benign overfitting beyond linear model. Secondly, we discuss abenign overfitting case without truncated effective rank.
In this subsection, we imagine statistical model F is the affine hull of sub-classes( F j ) dj =1 , that is to say, for all f ∈ F , there exists f j in each F j and α j ∈ R such that f = P dj =1 α j f j . Denote α = ( α , · · · , α d ) ∈ R d . Of course this isnot the problem that we deal with in this paper, but considering such a generalcase like this would be benefit to understand the role of α and the difficulty togeneralize benign overfitting beyond linear model.Even in this kind of simple ”additive model” case, benign overfitting is muchmore difficult. Firstly, ˆ f interpolates ( X i , Y i ) Ni =1 , but ˆ f j need not interpolatesthem. In fact, they may differs a lot, see ˆ f ( x ) = − x + x = ˆ f + ˆ f can interpolate(10 ,
0) but ˆ f (10) = − ˆ f (10). It is a difficult task to derive oracle inequality bystudying F j .If we minimizing k α k ℓ analogs to linear case, the minimization of k α k ℓ given interpolation condition P pj =1 α j f ( X i ) = Y i for all i = 1 , , · · · , N can besolved by Moore-Penrose inverse. 19ondition on ( f j ) pj =1 . Denote matrix Γ asΓ = f ( X ) f ( X ) · · · f p ( X ) f ( X ) f ( X ) · · · f p ( X )... ... . . . ... f ( X N ) · · · · · · f p ( X N ) . N × p Denote f ∗ as ( f ∗ ( X ) , f ∗ ( X ) , · · · , f ∗ ( X N )) and ξ as ( ξ , ξ , · · · , ξ N ). Then theinterpolation condition is equivalent tominimizing k α k ℓ , s . t . Γ α = Y. We assume α satisfying interpolation condition always exists. Using Moore-Penrose inverse, we have ˆ α = Γ † Y = Γ † f ∗ + Γ † ξ. Therefore, to establish upper bound of k ˆ α k ℓ , we need a lower bound of thesmallest singular value of Γ.However, as we see in Lemma 4.1, the smallest singular value increases when N increases, causing (cid:13)(cid:13) Γ † f ∗ (cid:13)(cid:13) ℓ decreasing. This phenomenon is called ”signalblood” in [24], which means that the influence caused by signal f ∗ will declineso that minimizing k ˆ α k ℓ cannot reflect properties true signal unless there aresome unrealistic restrictions.Therefore, f ∗ should balance Γ † when N increase to avoid signal blood. Thiscan be done by linear regression, where Γ = X . This illustrates that why wechoose linear model. Now, we try to establish benign overfitting without truncated effective rank,but on stable rank r (Σ), see Equation 1.1. Recall that a linear model on T ⊂ R p is F T = {h· , t i : t ∈ T } . Let σ = ( X , · · · , X N ), then the projection of F T by using σ is indeed a random linear transformation of T . That is to say, P σ F T = X T , where P σ ( f t ) = ( h X i , t i ) Ni =1 . We need lower bound of smallestsingular value of X to derive an upper bound of estimation error, and a lowerbound of quadratic component in Equation 2.1. Fortunately, this can be doneby Dvoretzky-Milman Theorem, see [1] or [19]. Dvoretzky-Milman Theoremcan hold with rather heavy-tailed random vectors, but for the sake of simplicity,we assume ( g i ) Ni =1 are i.i.d. gaussian random vectors in R p . Lemma . There exists absolute constants c , c such that: If 0 < δ < , and N c δ log (1 /δ ) r (Σ) , (6.1)20nd Γ = P Ni =1 h g i , ·i e i , where ( e i ) Ni =1 are ONB of R N . Then with probabilityat least 1 − − c r (Σ) δ / log (1 /δ )),(1 − δ ) p tr(Σ) B N ⊂ Γ (cid:16) Σ / B p (cid:17) . Take δ = 1 / B N ⊂ Γ(Σ / B p ), so s min (Γ) = r min t ∈ S p − h g i , t i > p tr(Σ)holds with probability at least 1 − − cr (Σ)). Therefore, with probabilityat least 1 − − c r (Σ)) − − c N ), k ˆ α − α ∗ k ℓ k α ∗ k ℓ + k ξ k ψ s N tr(Σ) . As for the prediction risk, we have: when r > r ∗ ,inf α ∈ α ∗ + H r,ρ P N L α > r (cid:18) N − ζ (cid:19) − k ξ k ψ > − k ξ k ψ . From here on, the proof is as the same as that of Theorem 5.1, the details areomitted.Note that r (Σ) p , and p = cN log (1 /ε ), so we can set c , δ wisely toadapt to the example discussed in subsection 3.1.In summary, although interpolation learning suffers from estimating bothnoise ξ and sign α ∗ , it still generalize well if the smallest singular value of X islarge enough such that it can absorb the level of noise, √ N k ξ k ψ , see Equation4.1. The smallest singular value is used to weaken influence of noise. To makethe smallest singular value large enough, the number of samples should satisfyan upper bound that depends on covariance of the input vector. This thresholdis used to balance the rate of exponential decay(acquired by concentration orsmall-ball argument) and metric entropy(given by net argument). Therefore,this threshold depends on dimension p , sample size N and covariance Σ. If wefix relationship between p and N (like the example in subsection 3.1), we needΣ has a large trace(or at least heavy tail of eigenvalues), which is the key tobenign overfitting. Note that in this interpretation, there is no restrictions onconcentration properties of input vector X , but its small-ball property, that isto say, X should fully spread on its margin. It is its spreading that can absorbnoise ξ . It is this that make minimum ℓ linear interpolant fit into heavy-tailedcase. Finally, we believe that our result could be easily modified to ”Informative-Outlier” framework, cf. [7], to obtain a result in a ”robust flavor” both for CScommunity and statistics community. References [1] Artstein-Avidan, S.,Giannopoulos, A. and Milman, V.D. (2015) Asymp-totic geometric analysis. Part I.
American Mathematical Society, Provi-dence .MR3331351. 212] Bartlett, P.L., Long, P. M., Lugosi, G. and Tsigler, A. (2019)
Benign Over-fitting in Linear Regression . Proceedings of the National Academy of SciencesApr 2020, 201907378 .[3] Belkin, M., Ma, S. and Mandal, S. (2018)
To understand deep learning weneed to understand kernel learning . Proceedings of the the 35th InternationalConference on Machine Learning (ICML 2018) .[4] Belkin, M., Rakhlin, A. and Tsybakov, A.B. (2019)
Does data interpolationcontradict statistical optimality? AISTAT 2019 .[5] Bhatia, R. (1997)
Matrix Analysis . Springer-Verlag, New York .MR1477662[6] Boucheron, S.,Lugosi, G. and Massart, P. (2013)
Concentration inequal-ities: A nonasymptotic theory of independence . 1nd ed.Oxford universitypress.MR3185193[7] G. Chinot, G. Lecu´e and M. Lerasle (2020)
Statistical Learning with Lipschitzand convex loss functions , Probability Theory and Related Fields , , 897–940.MR4087486[8] Chinot, G.,Lerasle, M. (2020) Benign overfitting in the large deviationregime . arXiv preprint arXiv:2003.05838 .[9] Dirksen, S. (2015) Tail bounds via generic chaining. Electronic Journal ofProbability .MR3354613[10] Hastie, T. ,et al.(2019) Surprises in high-dimensional ridgeless least squaresinterpolation . arXiv preprint arXiv:1903.08560 .[11] Kallenberg, O. (2002) Foundations of modern probability . 2nd ed. Springer-Verlag, New York.MR1876169[12] Koltchinskii, V. and Lounici, K. (2017) Concentration inequalities andmoment bounds for sample covariance operators.
Bernoulli , 110-133.MR3556768[13] Koltchinskii, V. and Mendelson, S. (2015) Bounding the smallest singularvalue of a random matrix without concentration. Int. Math. Res. Not. IMRN , 12991–13008.MR3431642[14] Liang, T., Rakhlin, A. (2020) Just Interpolate: Kernel ”Ridgeless” Regres-sion Can Generalize. Annals of Statistics .MR4124325[15] Liang, T., Rakhlin, A. and Zhai, X. (2020) On the Multiple Descent ofMinimum-Norm Interpolants and Restricted Lower Isometry of Kernels.
Con-ference on Learning Theory (COLT), 2020 .[16] Rakhlin, A. and Zhai, X. (2019) Consistency of Interpolation with LaplaceKernels is a High-Dimensional Phenomenon.
Conference on Learning Theory(COLT), 2019 . 2217] Song, M. and Montanari, A. (2019) The generalization error of randomfeatures regression: Precise asymptotics and double descent curve.
Submittedto Communications on Pure and Applied Mathematics. [18] Størmer, E. (2013) Positive Linear Maps of Operator Algebras.
SpringerMonographs in Mathematics .MR3012443[19] Mendelson, S. (2016a) Dvoretzky type Theorems for subgaussian coordi-nate projections.
J. Theoret. Probab. , , 1644–1660.MR3571258[20] Mendelson, S. (2016b) Learning without concentration for general loss func-tions. Probability Theory and Related Fields , , 459–502.MR3800838[21] Mendelson, S. (2016c) Upper bounds on product and multiplier em-pirical processes. Stochastic Processes and their Applications , , 3652–3680.MR3565471[22] Mendelson, S. and Paouris, G. (2019) Stable recovery and the coordinatesmall-ball behaviour of random vectors . arXiv preprint arXiv:1904.08532 .[23] Meyer Carl D. (2000) Matrix analysis and applied linear algebra. Societyfor Industrial and Applied Mathematics (SIAM) , .MR1777382[24] Muthukumar, V., Vodrahalli, K., Subramanian, V. and Sahai, A. (2020)Harmless interpolation of noisy data in regression. IEEE Journal on SelectedAreas in Information Theory .[25] Naor, A. and Youssef, P. (2017) Restricted invertibility revisited.
SpringerCham. A journey through discrete mathematics , 657–691.MR3726618[26] Rudelson, M. and Vershynin, R. (2007) Sampling from large matrices: anapproach through geometric functional analysis.
Journal of the ACM (2007) ,Art. 21, 19 pp.MR2351844[27] Talagrand, M. (2014) Upper and lower bounds for stochastic processes:modern methods and classical problems.
Springer Science & Business Me-dia .MR3184689[28] Tsigler, A. and Bartlett, P. (2020) Benign overfitting in ridge regres-sion. arXiv preprint arXiv:2009.14286 .[29] Vaart, Aad W and Wellner, Jon A (1996) Weak convergence and em-pirical processes: with applications to statistics.
Springer Series in Statis-tics .MR1385671[30] Vershynin, R. (2018) High-Dimensional Probability: An Introduc-tion with Applications in Data Science.
Cambridge University Press:NewYork .MR3837109[31] Zhang, C. et al. (2016) Understanding deep learning requires rethinkinggeneralization. arXiv preprint arXiv:1611.03530arXiv preprint arXiv:1611.03530