Failures of model-dependent generalization bounds for least-norm interpolation
aa r X i v : . [ s t a t . M L ] O c t Failures of model-dependent generalization boundsfor least-norm interpolation
Peter L. BartlettUC [email protected] Philip M. [email protected]
Abstract
We consider bounds on the generalization performance of the least-norm linear regressor, in theover-parameterized regime where it can interpolate the data. We describe a sense in which any gen-eralization bound of a type that is commonly proved in statistical learning theory must sometimes bevery loose when applied to analyze the least-norm interpolant. In particular, for a variety of natural jointdistributions on training examples, any valid generalization bound that depends only on the output ofthe learning algorithm, the number of training examples, and the confidence parameter, and that satis-fies a mild condition (substantially weaker than monotonicity in sample size), must sometimes be veryloose—it can be bounded below by a constant when the true excess risk goes to zero.
Deep learning methodology has revealed some striking deficiencies of classical statistical learning theory:large neural networks, trained to zero empirical risk on noisy training data, have good predictive accuracyon independent test data. These methods are overfitting (that is, fitting to the training data better than thenoise should allow), but the overfitting is benign (that is, prediction performance is good). It is an importantopen problem to understand why this is possible.The presence of noise is key to why the success of interpolating algorithms is mysterious. Generalizationof algorithms that produce a perfect fit in the absence of noise has been studied for decades (see [Haussler,1992] and its references). A number of recent papers have provided generalization bounds for interpolatingalgorithms in the absence of noise, either for deep networks or in abstract frameworks motivated by deepnetworks [Li and Liang, 2018, Arora et al., 2019, Cao and Gu, 2019, Feldman, 2020]. The generalizationbounds in these papers either do not hold or become vacuous in the presence of noise: Assumption A1in [Li and Liang, 2018] rules out noisy data; the data-dependent bound in Arora et al. [2019, Theorem 5.1]becomes vacuous when independent noise is added to the y i ; adding a constant level of independent noiseto the y i in [Cao and Gu, 2019, Theorem 3.3] gives an upper bound on excess risk that is at least a constant;and the analysis in [Feldman, 2020] concerns the noise-free case.There has also been progress on bounding the gap between average loss on the training set and expectedloss on independent test data, based on uniform convergence arguments that bound the complexity of classesof real-valued functions computed by deep networks. For instance, the results in [Bartlett, 1998] for sigmoidnonlinearities rely on ℓ -norm bounds on parameters throughout the network, and those in [Bartlett et al.,2017] for ReLUs rely on spectral norm bounds of the weight matrices throughout the network (see also therefinements in [Bartlett and Mendelson, 2002, Neyshabur et al., 2015, Bartlett et al., 2017, Golowich et al.,1018, Long and Sedghi, 2019]). These bounds involve distribution-dependent function classes, since theydepend on some measure of the complexity of the output model that may be expected to be small for naturaltraining data. For instance, if some training method gives weight matrices that all have small spectral norm,the bound in [Bartlett et al., 2017] will imply that the gap between empirical risk and predictive accuracywill be small. But while it is possible for these bounds to be small for networks that trade off fit to the datawith complexity in some way, it is not clear that a network that interpolates noisy data could ever have smallvalues of these complexity measures. This raises the question: are there any good data-dependent boundsfor interpolating networks?Zhang et al. [2017] claimed, based on empirical evidence, that conventional learning theoretic tools areuseless for deep networks, but they considered the case of a fixed class of functions defined, for example, asthe set of functions that can be computed by a neural network, or that can be reached by stochastic gradientdescent with some training data, no matter how unlikely. These observations illustrate the need to considerdistribution-dependent notions of complexity in understanding the generalization performance of deep net-works. The study of such distribution-dependent complexity notions has a long history in nonparametricstatistics, where it is central to the problem of model selection [see, for example, Bartlett et al., 2002, and itsreferences]; uniform convergence analysis over a level in a complexity hierarchy is part of a standard outlinefor analyzing model selection methods.Nagarajan and Kolter [2019] provided an example of a scenario where, with high probability, an algo-rithm generalizes well, but two-sided uniform convergence fails for any hypothesis space that is likely tocontain the algorithm’s output. Their analysis takes an important step in allowing distribution-dependentnotions of complexity, but only rules out the application of a specific set of tools: uniform convergence overa model class of the absolute differences between expectations and sample averages. Indeed, in their proof,the failure is an under-estimation of the accuracy of a model—a model has good predictive accuracy, butperforms poorly on a sample (one obtained as a transformed but equally likely version of the sample thatwas used to train the model). However, in applying uniform convergence tools to show good performanceof an algorithm, uniform bounds are only needed to show that bad models are unlikely to perform well onthe training data. So if one wishes to provide bounds that guarantee that an algorithm has poor predictiveaccuracy, Nagarajan and Kolter [2019] provided an example where uniform convergence tools will not suf-fice. In contrast, we are concerned with understanding what tools can provide guarantees of good predictiveaccuracy of interpolating algorithms.In this paper, motivated by the phenomenon of benign overfitting in deep networks, we consider a sim-pler setting where the phenomenon occurs, that of linear regression. (Lemma 5.2 of [Negrea et al., 2020]adapts the construction of [Nagarajan and Kolter, 2019] to show a similar failure of uniform convergence inthis context, and similarly cannot shed light on tools that can or cannot guarantee good predictive accuracy.)We study the minimum norm linear interpolant. Earlier work [Bartlett et al., 2020] provides tight upper andlower bounds on the excess risk of this interpolating prediction rule under suitable conditions on the proba-bility distribution generating the data, showing that benign overfitting depends on the pattern of eigenvaluesof the population covariance (and there is already a rich literature on related questions [Liang and Rakhlin,2020, Belkin et al., 2019b,a, Hastie et al., 2019a,b, Negrea et al., 2020, Derezi´nski et al., 2019, Li et al.,2020, Tsigler and Bartlett, 2020]). These risk bounds involve fine-grained properties of the distribution.Is this knowledge necessary? Is it instead possible to obtain data-dependent bounds for interpolating pre-diction rules? Already the proof in [Bartlett et al., 2020] provides some clues that this might be difficult:when benign overfitting occurs, the eigenvalues of the empirical covariance are a very poor estimate of thetrue covariance eigenvalues—all but a small fraction (the largest eigenvalues) are within a constant factor ofeach other. 2n this paper, we show that in these settings there cannot be good risk bounds based on data-dependentfunction classes in a strong sense: For linear regression with the minimum norm prediction rule, any“bounded-antimonotonic” model-dependent error bound that is valid for a sufficiently broad set of probabil-ity distributions must be loose—too large by an additive constant—for some (rather innocuous) probabilitydistribution. The bounded-antimonotonic condition formalizes the mild requirement that the bound does notdegrade very rapidly with additional data. Aside from this constraint, our result applies for any bound thatis determined as a function of the output of the learning algorithm, the number of training examples, and theconfidence parameter. This function could depend on the level in a hierarchy of models where the output ofthe algorithm lies. Our result applies whether the bound is obtained by uniform convergence over a level inthe hierarchy, or in some other way.The intuition behind our result is that benign overfitting can only occur when the test distribution hasa vanishing overlap with the training data. Indeed, interpolating the data in the training sample guaranteesthat the conditional expectation of the prediction rule’s loss on the training points that occur once must beat least the noise level. Using a Poissonization method, we show that a situation where the training sampleforms a significant fraction of the support of the distribution is essentially indistinguishable from a benignoverfitting situation where the training sample has measure zero. Since we want a data-dependent bound tobe valid in both cases, it must be loose in the second case. We consider prediction problems with patterns x ∈ ℓ and labels y ∈ R , where ℓ is the space of squaresummable sequences of real numbers. In fact, all probability distributions that we consider have supportrestricted to a finite dimensional subspace of ℓ , which we identify with R d for an appropriate d . For a jointdistribution P over R d × R and a hypothesis h : R d → R , define the risk of h to be R P ( h ) = E ( x,y ) ∼ P [( y − h ( x )) ] . Let R ∗ P be the minimum of R P over measurable functions.For any positive integer k , a distribution D over R k is sub-Gaussian with parameter σ if, for any u ∈ R k , E x ∼ D [exp( u · ( x − E x ))] ≤ exp (cid:16) k u k σ (cid:17) . A joint distribution P over R d × R has unit scale if ( X , ..., X d , Y ) ∼ P is sub-Gaussian with parameter . It is innocuous if• it is unit scale,• the marginal on ( X , ..., X d ) is Gaussian, and• the conditional of Y given ( X , ..., X d ) is continuous.A sample is a finite multiset of elements of R d × R . A least-norm interpolation algorithm takes as inputa sample, and outputs θ ∈ R d that minimizes k θ k subject to X i ( θ · x i − y i ) = min ˆ y ,..., ˆ y n X i (ˆ y i − y i ) . We will refer both to the parameter vector θ output by the least-norm interpolation algorithm and the function x → θ · x parameterized by θ as the least-norm interpolant .A function ǫ ( h, n, δ ) mapping a hypothesis h , a sample size n and a confidence δ to a positive real num-ber is a uniform model-dependent bound for unit-scale distributions if, for all unit-scale joint distributions P n , with probability at least − δ over the random choice of S ∼ P n , for the least-norminterpolant h of S , we have R P ( h ) − R ∗ P ≤ ǫ ( h, n, δ ) . The bound ǫ is c -bounded antimonotonic for c ≥ if for all h , δ , n and n , if n / ≤ n ≤ n then ǫ ( h, n , δ ) ≤ cǫ ( h, n , δ ) . This enforces that the bound cannot get too much worse too quickly with moredata. If ǫ ( h, · , δ ) is monotone-decreasing for all h and δ , then it is -bounded antimonotonic.A set B ß N is β -dense if liminf N →∞ |B∩{ ,...,N }| N ≥ β. Say that B ß N is strongly β -dense beyond n if,for all s ∈ N such that s ≥ n , |B ∩ { s , ..., ( s + 1) − }| s + 1 ≥ β. (Notice that if a set is strongly β -dense beyond n , then it is β -dense.)The following is our main result. Theorem 1
There are innocuous distributions P , P , . . . , such that, if ǫ is a bounded-antimonotonic, uni-form model-dependent bound for mean-zero unit-scale distributions, then there are constants c , c , c , c , c > , such that for all < δ < c , the least-norm interpolant h satisfies, for all large enough n , Pr S ∼ P nn " R P n ( h ) − R ∗ P n ≤ c r log 1 /δn ≥ − δ but nonetheless, the set of n such that Pr S ∼ P nn [ ǫ ( h, n, δ ) > c ] ≥ is strongly (1 − c δ ) -dense beyond c log(1 /δ ) . This is a consequence of a more general theorem, which describes a variety of cases in which the successof the least-norm interpolant is not reflected in the bound ǫ . Theorem 2
For any covariance matrices Σ , Σ , ... with k Σ s k ≤ / for all s and any unit-length θ ∗ , θ ∗ , ... ,there are innocuous P , P , ... such that the following holds. If ǫ is a bounded-antimonotonic, uniformmodel-dependent bound for unit-scale distributions, then there are positive constants c , c , c , c such thatfor each n , defining s = ⌊√ n ⌋ , the marginal of P n is N (0 , Σ s ) , the conditional mean of Y given X is E [ Y | X = x ] = θ ∗ s · x, and for all < δ < c , the set of n such that Pr S ∼ P nn [ ǫ ( h, n, δ ) > c ] ≥ is strongly (1 − c δ ) -dense beyond c log(1 /δ ) . Theorem 1 is proved by applying Theorem 2 with covariance matrices for which the least-norm inter-polant converges especially rapidly. Theorem 2 could similarly be applied when the convergence is slower.The least-norm interpolant converges at a variety of rates, depending on the eigenvalue decays of the spectraof Σ , Σ , ... [Bartlett et al., 2020]. 4 Proofs
First, we will prove Theorem 2, and then use it to prove Theorem 1.
Our proof uses the following lemma [Birch, 1963] (see also [Feller, 1968, p. 216] and [Batu et al., 2000,2013]), which has become known as the “Poissonization lemma”. We use
Poi( λ ) to denote the Poissondistribution with mean λ : For t ∼ Poi( λ ) and k ≥ , Pr [ t = k ] = λ k e − λ k ! . Lemma 3
If, for t ∼ Poi( n ) , you throw t balls independently uniformly at random into m bins, • the numbers of balls falling into the bins are mutually independent, and • the number of balls falling in each bin is distributed as Poi( n/m ) . For each n , our proof uses three distributions: D n , Q n and P n . The first, D n , is used to define Q n and P n . The distribution Q n is defined so that, for any unit-scale D n , the least-norm interpolant performs poorlyon Q n . The distribution P n is defined so that, when the least-norm interpolant performs well on D n , it alsoperforms well on P n . Crucially, the least norm interpolants that arise from Q n and P n are closely related.For each n , for arbitrary d , the marginal of D n on x ∈ R d is N (0 , Σ s ) , where s = ⌊√ n ⌋ . For ( X, Y ) ∼ D n , for each x ∈ R d , the distribution of Y given X = x is N ( θ ∗ · x, / . Note that ( x , ..., x d , y ) isGaussian and any 1-dimensional projection has variance at most , so that D n is a unit-scale distribution.For an absolute constant positive integer b , we get Q n from D n through the following steps.1. Sample ( x , y ) , . . . , ( x bn , y bn ) ∼ D bnn .2. Define Q n on R d × R so that its marginal on R d is uniform on U = { x , . . . , x bn } and its conditionaldistribution of Y | X is the same as D n . Definition 4
For a sample S , the compression of S , denoted by C ( S ) , is defined to be C ( S ) = (( u , v ) , . . . , ( u k , v k )) , where u , . . . , u k are the unique elements of { x , . . . , x n } , and, for each i , v i is the average of { y j : 1 ≤ j ≤ n, x j = u i } . For the least-norm interpolation algorithm A , for any pair S and S ′ of samples such that C ( S ) = C ( S ′ ) ,we have A ( S ) = A ( S ′ ) . (This is true because the least-norm interpolant A ( S ) is uniquely defined by theequality constraints specified by the compression C ( S ) .)We can show that the least-norm interpolant is bad for Q n by only considering the points in the supportof Q n that the algorithm sees exactly once. 5 emma 5 For any constant c > , there are constants c , c > such that, for all sufficiently large n ,almost surely for Q n chosen randomly as described above, if t is chosen randomly according to Poi( cn ) and S consists of t random draws from Q n , then with probability at least − e − c n over t and S , E ( x,y ) ∼ Q n (cid:2) ( A ( C ( S ))( x ) − y ) (cid:3) − E ( x,y ) ∼ Q n (cid:2) ( f ∗ ( X ) − Y ) (cid:3) ≥ c where f ∗ is the regression function for D n (and hence also for Q n ). Proof:
Recall that U = { x , . . . , x bn } is the support of the marginal of Q n on the independent variables.With probability , U has cardinality bn . Define h = A ( C ( S )) . If some x ∈ U appears exactly once in S , then h ( x ) is a sample from the distribution of y given x under D n . Thus, for such an x , the expectedquadratic loss of h ( x ) on a test point is the squared difference between two independent samples from thisdistribution, which is twice its variance, i.e. twice the expected loss of f ∗ , which is × / . On any x ∈ U , whether or not it was seen exactly once in S , by definition, f ∗ ( x ) minimizes the expected loss given x . Lemma 3 shows that, conditioned on the random choice of Q n , the numbers of times the various x in U in S are mutually independent and, the probability that x ∈ U is seen exactly once in S is cb exp (cid:0) − cb (cid:1) ≥ ce − c b . Applying a Chernoff bound (see, for example, Theorem 4.5 in Mitzenmacher and Upfal [2005]), theprobability that fewer than ce − c n/ members of U are seen exactly once in S is at most e − c n for an absoluteconstant c . Thus if U is the (random) subset of points in U that were seen exactly once, we have Q n (cid:2) ( h ( X ) − Y ) (cid:3) − Q n (cid:2) ( f ∗ ( X ) − Y ) (cid:3) = X x ∈ U Q n (cid:2) (( h ( X ) − Y ) − ( f ∗ ( X ) − Y ) )1 X = x (cid:3) ≥ X x ∈ U E [( f ∗ ( X ) − Y ) X = x ]= | U | bn . Since, with probability − e − c n , | U | ≥ ce − c n/ , this completes the proof. Definition 6
Define P n as follows.1. Set the marginal distribution of P n on R d the same as that of D n .2. To generate Y given X = x for ( X, Y ) ∼ P n , first sample a random variable Z whose distributionis obtained by conditioning a draw from a Poisson with mean cb on the event that it is at least ,then sample Z values V , . . . , V Z from the conditional distribution D n ( Y | X = x ) , and set Y = Z P Zi =1 V j . Note that, since D n has a density, x , ..., x r are almost surely distinct and hence S drawn from P rn has C ( S ) = S a.s.The following lemma implies that the bounds for P n tend to be as big as those for Q n . Lemma 7
Define Q n as above let Q n be the resulting distribution over the random choice of Q n . Suppose P n is defined as in Definition 6. Let c > be an arbitrary constant. Choose S randomly by choosing t ∼ Poi( cn ) , Q n ∼ Q n and S ∼ Q tn . Choose T by choosing r ∼ B (cid:0) bn, − exp (cid:0) − cb (cid:1)(cid:1) and T ∼ P rn .Then C ( S ) and T have the same distribution. In particular, for all δ > , for any function ψ of the leastnorm interpolant h , a sample size r , and a confidence parameter δ , we have E t ∼ Poi( cn ) ,Q n ∼Q n [ E S ∼ Q tn [ ψ ( h ( S ) , | C ( S ) | , δ )]] = E r ∼ B ( bn, − exp ( − cb ))[ E T ∼ P rn [ ψ ( h ( T ) , r, δ )] . roof: Let C be the probability distribution over training sets obtained by picking Q n from Q n , picking t from Poi( cn ) , picking S from Q tn and compressing it. Let C = C ( S ) be a random draw from C . Let n C bethe number of examples in C .We claim that n C is distributed as B (cid:0) bn, − exp (cid:0) − cb (cid:1)(cid:1) . Conditioned on Q n , and recalling that U isthe support of Q n , for any x ∈ U , Lemma 3 implies that for each x ∈ U , the probability that x is notseen is the probability, under a Poisson with mean cb , of drawing a . Thus, the probability that x is seenis − exp (cid:0) − cb (cid:1) . Since the numbers of times different x are seen in S are independent, the number seen isdistributed as B (cid:0) bn, − exp (cid:0) − cb (cid:1)(cid:1) .Now, for each x ∈ U , the event that it is in C ( S ) is the same as the event that at appears at least once in S . Thus, conditioned on the event that x appears in S , the number of y values that are used to compute the y value in C ( S ) is distributed as a Poisson with mean cb , conditioned on having a value at least .Let D n,X be the marginal distribution of D n on the x ’s. If we make n independent draws from D n,X ,and then independently reject some of these examples, to get n C draws, the resulting n C examples areindependent. (We could first randomly decide the number n C of examples to keep, and then draw thoseindependently from D n , and we would have the same distribution.)The last two paragraphs together, along with the definition of Q n , imply that the distribution over T obtained by sampling r from B ( bn, − e − c/b ) and T from P rn is the same as the distribution over C obtained by sampling Q n from Q n , t from Poi( cn ) , then sampling S from Q tn and compressing it. Thus,the distributions of T and C ( S ) are the same, and hence the distributions of ( h ( T ) , | T | ) and ( h ( S ) , | C ( S ) | ) are the same, because h ( S ) = h ( C ( S )) .We will use the following bound on a tail of the Poisson distribution. Lemma 8 ([Canonne, 2017])
For any λ, α > , Pr r ∼ Poi( λ ) ( r ≥ (1 + α ) λ ) ≤ exp (cid:16) − α α ) λ (cid:17) . Armed with these tools, we are ready to prove Theorem 2.We will think of the natural numbers as being divided into bins [1 , , [2 , , [4 , , ... , Let us focus ourattention on one bin { s , ..., ( s + 1) − } . Let n denote the center of the bin, n = s + s , so that s ∼ √ n .For a constant c > and any δ > , Lemma 7 implies E r ∼ B ( bn, − e − c/b ) [ Pr T ∼ P rn [ ǫ ( h ( T ) , r, δ ) ≤ c ]] = E t ∼ Poi( cn ) ,Q n ∼Q n [ Pr S ∼ Q tn [ ǫ ( h ( S ) , | C ( S ) | , δ ) ≤ c ]] . (1)Suppose that ǫ is B ′ -bounded-antimonotonic. Fix B > such that B > B ′ . Then E t ∼ Poi( cn ) ,Q n ∼Q n [ Pr S ∼ Q tn [ ǫ ( h, | C ( S ) | , δ ) ≤ c ]] ≤ E t ∼ Poi( cn ) ,Q n ∼Q n [ Pr S ∼ Q tn [ R Q n ( h ) − R ∗ Q n > Bǫ ( h, | C ( S ) | , δ )]]+ E t ∼ Poi( cn ) ,Q n ∼Q n [ Pr S ∼ Q tn [ R Q n ( h ) − R ∗ Q n ≤ c B ]] . (2)For each sample size t and Q n , Pr S ∼ Q tn [ R Q n ( h ) − R ∗ Q n > Bǫ ( h, | C ( S ) | , δ )] ≤ Pr S ∼ Q tn (cid:2) R Q n ( h ) − R ∗ Q n > ǫ ( h, t, δ ) (cid:3) + Pr S ∼ Q tn [ Bǫ ( h, | C ( S ) | , δ ) ≤ ǫ ( h, t, δ )] ≤ δ + Pr S ∼ Q tn [ | C ( S ) | < t/ ǫ is a valid B ′ -bounded-antimonotonic uniformmodel-dependent bound and B > B ′ . Now by a union bound, for some constant c > , E t ∼ Poi( cn ) ,Q n ∼Q n [ Pr S ∼ Q tn [ | C ( S ) | < t/ ≤ E t ∼ Poi( cn ) ,Q n ∼Q n [ Pr S ∼ Q tn [ | C ( S ) | < c n ]] + Pr t ∼ Poi( cn ) [ t/ ≥ c n ]= Pr Z ∼ B ( bn, − e − c/b ) [ Z < c n ] + Pr t ∼ Poi( cn ) [ t/ ≥ c n ] ≤ δ, (3)where the last inequality follows from a Chernoff bound and from Lemma 8 with α = 1 − c /c , provided n = Ω(log(1 /δ )) and provided we can choose c to satisfy c/ < c < b (1 − e − c/b ) . Our choice of b and c , specified below, will ensure this. In that case, we have that E t ∼ Poi( cn ) ,Q n ∼Q n [ Pr S ∼ Q tn [ ǫ ( h, | C ( S ) | , δ ) ≤ c ]] ≤ δ + E t ∼ Poi( cn ) ,Q n ∼Q n [ Pr S ∼ Q tn [ R Q n ( h ) − R ∗ Q n ≤ c B ]] . Applying Lemma 5 to bound the RHS, if n is large enough and c B < e , then E t ∼ Poi( cn ) ,Q n ∼Q n [ Pr S ∼ Q tn [ ǫ ( h, | C ( S ) | , δ ) ≤ c ]] ≤ δ. Returning to (1), we get E r ∼ B ( bn, − e − c/b ) [ Pr T ∼ P rn [ ǫ ( h, r, δ )) ≤ c ]] ≤ δ. (4)Let us now focus on the case that b = 2 and c = 2 ln 2 , so that E r ∼ B ( bn, − e − c/b ) [ r ] = (1 − e − c/b ) bn = n. (And note that c/ < b (1 − e − c/b ) , as required for (3).) Chebyshev’s inequality implies Pr r ∼ B ( bn, − e − c/b ) [ r [ n − s, n + s ]] ≤ c for an absolute positive constant c . Returning now to (4), Markov’s inequality implies Pr r ∼ B ( bn, − e − c/b ) [ Pr T ∼ P rn [ ǫ ( h, r, δ )) ≤ c ] > / ≤ c δ. (5)Further, it is known [Slud, 1977, Box et al., 1978] that there is an absolute constant c such that, for alllarge enough n and all r ∈ [ n − s, n + s ] , Pr r ∼ B ( bn, − e − c/b ) [ r = r ] ≥ c √ n . Combining this with (5) and recalling that s and s ′ are Θ( √ n ) , we get |{ r ∈ [ n − s, n + s ] : Pr T ∼ P rn [ ǫ ( h, r, δ )) ≤ c ] > / }| s + 1 ≤ c δ, for an absolute constant c > . Since, for all r ∈ [ n − s, n + s ] , we have P r = P n , this completes theproof. 8 .2 Proof of Theorem 1 The proof of Theorem 1 uses D , D , ... that are defined as in Section 3.1, with additional constraints.Fix an arbitrary constant K ≥ . For each n , the joint distribution D n on ( x, y ) -pairs is defined asfollows. Let s = ⌊√ n ⌋ , N = s , d = N . Let θ ∗ be an arbitrary unit-length vector. Let Σ s be an arbitrarycovariance matrix with two distinct eigenvalues: / , with multiplicity K , and /d , with multiplicity d − K .The marginal of D n on x is then N (0 , Σ s ) . As before, for each x ∈ R d , the distribution of y given x is N ( θ ∗ · x, / . Once again, since ( x, y ) is Gaussian, k Σ s k ≤ / , and the variance of y is / , each D n isinnocuous.We then derive P , P , ... from D , D , ... exactly as in the proof of Theorem 2.The following bound can be obtained through direction application of the results in [Bartlett et al., 2020].The details are given in Appendix A. Lemma 9
There is a constant c such that, for all < δ < / and all large enough n , with probability atleast − δ , for S ∼ D nn , the least-norm interpolant h satisfies R P n ( h ) − R ∗ P n ≤ c q log(1 /δ ) n . Combining this with Theorem 2 completes the proof.
Acknowledgements
We thank Vaishnavh Nagarajan and Zico Kolter for helpful comments on an earlier draft of this paper, andDan Roy for calling our attention to Lemma 5.2 of [Negrea et al., 2020].
References
Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimizationand generalization for overparameterized two-layer neural networks. In
ICML , volume 97, pages 322–332, 2019.P. L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights ismore important than the size of the network.
IEEE Transactions on Information Theory , 44(2):525–536,1998.P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results.
Journal of Machine Learning Research , 3:463–482, 2002.Peter Bartlett, Dylan Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural net-works. In
NIPS , pages 6240–6249, 2017.Peter L Bartlett, St´ephane Boucheron, and G´abor Lugosi. Model selection and error estimation.
MachineLearning , 48(1-3):85–113, 2002.Peter L Bartlett, Philip M Long, G´abor Lugosi, and Alexander Tsigler. Benign overfitting in linear regres-sion.
PNAS , 2020.Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing that distri-butions are close. In
Proceedings 41st Annual Symposium on Foundations of Computer Science , pages259–269. IEEE, 2000. 9ugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick White. Testing close-ness of discrete distributions.
J. ACM , 60(1):4:1–4:25, 2013. doi: 10.1145/2432622.2432626. URL https://doi.org/10.1145/2432622.2432626 .Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning prac-tice and the classical bias–variance trade-off.
PNAS , 116(32):15849–15854, 2019a.Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov. Does data interpolation contradict statisti-cal optimality?
AISTATS , pages 1611—-1619, 2019b.MW Birch. Maximum likelihood in three-way contingency tables.
Journal of the Royal Statistical Society:Series B (Methodological) , 25(1):220–233, 1963.George EP Box, William H Hunter, Stuart Hunter, et al.
Statistics for experimenters , volume 664. JohnWiley and sons New York, 1978.Cl´ement L Canonne. A short note on Poisson tail bounds, 2017.Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neuralnetworks. In
Advances in Neural Information Processing Systems , pages 10835–10845, 2019.Michał Derezi´nski, Feynman Liang, and Michael W Mahoney. Exact expressions for double descent andimplicit regularization via surrogate random design. arXiv preprint arXiv:1912.04533 , 2019.Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In
Proceedings of the52nd Annual ACM SIGACT Symposium on Theory of Computing , pages 954–959, 2020.Willliam Feller.
An introduction to probability theory and its applications , volume 1. John Wiley & Sons,1968.Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neuralnetworks. In
COLT , volume 75, pages 297–299, 2018.Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensionalridgeless least squares interpolation. arXiv preprint arXiv:1903.08560 , 2019a.Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in high-dimensionalridgeless least squares interpolation. Technical Report 1903.08560 [math.ST], arXiv, 2019b. URL https://arxiv.org/abs/1903.08560 .David Haussler. Decision theoretic generalizations of the pac model for neural net and other learning appli-cations.
Information and Computation , 100(1):78–150, 1992.Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descenton structured data. In
NeurIPS , pages 8157–8166. 2018.Zhu Li, Weijie Su, and Dino Sejdinovic. Benign overfitting and noisy features. arXiv preprintarXiv:2008.02901 , 2020.Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel “Ridgeless” regression can generalize.
The Annals of Statistics , 48(3):1329–1347, 2020. 10hilip M Long and Hanie Sedghi. Generalization bounds for deep convolutional neural networks.
ICLR ,2019.Michael Mitzenmacher and Eli Upfal.
Probability and Computing: Randomized Algorithms and Probabilis-tic Analysis . Cambridge University Press, 2005.Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain generalization indeep learning. In
NeurIPS , pages 11611–11622, 2019.Jeffrey Negrea, Gintare Karolina Dziugaite, and Daniel M. Roy. In defense of uniform convergence: Gen-eralization via derandomization with an application to interpolating predictors.
ICML , 2020.Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks.In
COLT , pages 1376–1401, 2015.Eric V Slud. Distribution inequalities for the binomial law.
The Annals of Probability , pages 404–412, 1977.A. Tsigler and P. L. Bartlett. Benign overfitting in ridge regression, 2020.Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Un-derstanding deep learning requires rethinking generalization. In
ICLR , 2017. URL https://arxiv.org/abs/1611.03530 . A Proof of Lemma 9
The lemma follows from Theorem 1 of [Bartlett et al., 2020]; before showing how to apply it, let us firstrestate a special case of the theorem for easy reference.
A.1 A useful upper bound
The special case concerns the least-norm interpolant applied to training data ( x , y ) , ..., ( x n , y n ) drawnfrom a joint distribution P over ( x, y ) pairs. The marginal distribution of x is N (0 , Σ) . There is a unitlength θ ∗ such that, for all x , the conditional distribution of y given x has mean θ ∗ · x is sub-gaussian withparameter / and variance at most / .We will apply an upper bound in terms of the eigenvalues λ ≥ λ ≥ ... of Σ . The bound is in terms oftwo notions of the effective rank of the tail of this spectrum: r k (Σ) = P i>k λ i λ k +1 , R k (Σ) = (cid:0)P i>k λ i (cid:1) P i>k λ i . The rank of Σ is assumed to be greater than n . Lemma 10
There are b, c, c > for which the following holds. For all n , P and Σ defined as above, write k ∗ = min { k ≥ r k (Σ) ≥ bn } . Suppose that δ < with log(1 /δ ) < n/c . If k ∗ < n/c , then, withprobability at least − δ , the least-norm interpolant h satisfies R P ( h ) − R ∗ P ≤ c max (r r (Σ) n . r (Σ) n . r log(1 /δ ) n ) + log(1 /δ ) (cid:18) k ∗ n + nR k ∗ (Σ) (cid:19)! . .2 The proof To prove Lemma 9, we need to show that P n satisfies the requirements on P in Lemma 10, and evaluate theeffective ranks r k and R k of P n ’s covariance Σ s . Define α = 1 /d . For any k ≤ K , we have r k = K − k + ( d − K ) α / K − k ) + ( d − K ) α/ and R k = ( K − k + ( d − K ) α ) K − k + ( d − K ) α . For k > K , r k = R k = d − k. In particular, since K is a constant, r is bounded by constant.Since d grows faster than n , for large enough n , k ∗ := min { k : r k ≥ bn } = K + 1 . So R k ∗ = d − K − n ) . Each sample from the distribution of Y given X = x has a mean of θ ∗ · x , and is sub-Gaussian withparameter at most / , and with variance at most / (because increasing Z only decreases the variance of Y ). Evaluating Lemma 10 on P nn