aa r X i v : . [ m a t h . S T ] S e p B ENIGN OVER FITTING IN R IDGE R EGR ESSION
A P
REPRINT
Alexander Tsigler
Department of StatisticsUniversity of California, Berkeley [email protected]
Peter L. Bartlett
Departments of Statistics and Computer ScienceUniversity of California, Berkeley [email protected]
October 1, 2020 A BSTRACT
Classical learning theory suggests that strong regularization is needed to learn a class with largecomplexity. This intuition is in contrast with the modern practice of machine learning, in particularlearning neural networks, where the number of parameters often exceeds the number of data points.It has been observed empirically that such overparametrized models can show good generalizationperformance even if trained with vanishing or negative regularization. The aim of this work is tounderstand theoretically how this effect can occur, by studying the setting of ridge regression. Weprovide non-asymptotic generalization bounds for overparametrized ridge regression that depend onthe arbitrary covariance structure of the data, and show that those bounds are tight for a range of reg-ularization parameter values. To our knowledge this is the first work that studies overparametrizedridge regression in such a general setting. We identify when small or negative regularization issufficient for obtaining small generalization error. On the technical side, our bounds only requirethe data vectors to be i.i.d. sub-gaussian, while most previous work assumes independence of thecomponents of those vectors.
The bias-variance tradeoff is well known in statistics and machine learning. The classical theory suggests that largemodels overfit the data, and one needs to impose significant regularization to make them generalize. This intuitionis, however, in contrast with the empirical study of modern machine learning techniques. It was repeatedly observedthat even models with enough capacity to exactly interpolate the data can generalize with only a very small amount ofregularization, or with no regularization at all [2, 19]. In some cases, the best value of the regularizer can be zero [9]or even negative [8] for such models.The aim of this paper is to provide a theoretical understanding of the above mentioned phenomenon. We suggest toconsider perhaps the simplest statistical model — linear regression with ridge regularization and i.i.d. noisy observa-tions. The setting and assumptions are fully described in Section 2. We derive non-asymptotic upper bounds for theprediction mean squared error (MSE), which depend on the spectrum of the covariance operator Σ as well as the valueof optimal parameters θ ∗ . Under some additional assumptions we provide matching lower bounds. Our theoreticalresults are in accordance with the empirical observations mentioned above: we show that small regularization canindeed provide good generalization performance, if, for example, the eigenvalues of the covariance operator have fastdecay. Moreover, if there is a sharp drop in the sequence of those eigenvalues, followed by a plateau, then negativevalues of regularization may be beneficial if the signal-to-noise ratio (SNR) is high enough. Motivated by the empirical success of overparametrized models, there has recently been a flurry of work aimed atunderstanding theoretically whether the corresponding effects can be seen in overparametrized linear regression. A
PREPRINT - O
CTOBER
1, 2020large body of that work deals with the ridgeless regime [1, 3, 4, 5, 7, 10, 13, 14, 18, 19], also known as the “interpolatingsolution,” with zero in-sample risk.Theoretical study of ridge regularization is, however, a much broader area. As we mentioned before, classical intuitionsuggests that one should regularize if the model is large, so in general ridge regularization is a classical topic, and it’simpossible to mention all the work that was done in this direction. The purpose of our work, however, is to study caseswhere small or negative regularization is sufficient, that is, outside the classical regime, but beyond the specific caseof interpolation.In the following we mention some recent work on ridge regularization and point out the differences with our approach.Dobriban and Wager [6] analyze the predictive risk of ridge regression in the asymptotic setting, Thrampoulidis, Oy-mak and Hassibi [15] suggest a methodology for obtaining asymptotic performance guaranties for ridge regressionwith gaussian design, Mitra [12] provides asymptotic expressions for ridge regression for a special data generatingmodel that they call Misparametrized Sparse Regression, and Mei and Montanari [11] derive asymptotics of the gen-eralization error for ridge regression on random features. All those works are in the asymptotic setting with ambientdimension and the number of data points going to infinity, while their ratio goes to a constant. Moreover, each of thoseworks imposes an assumption of independence of coordinates of the data vectors, or even requires a specific distribu-tion of those vectors (e.g., gaussian). Our results are non-asymptotic in nature, and cover cases that don’t satisfy theabove mentioned assumptions. For example, our results work if the dimension is infinite and the eigenvalues of thecovariance operator have a fast decay, and we show in Section 4 that small regularization can be sufficient to obtainsmall generalization error in that case. Moreover, we do not require any assumptions apart from sub-gaussianity tostate our main result—the first part of Theorem 1. To our knowledge this is the first result of this kind that applieswith such general assumptions on the structure of the data.A result in a similar framework was obtained in Kobak, Lomond and Sanchez, where it was proved that negative ridgeregularization can be beneficial in a spiked covariance model with one spike. Our results are in accordance with thatobservation. In Corollary 8, we show that negative regularization can also be beneficial for spiked covariance modelswhen the number of spikes is small compared to the sample size.The closest to our work is Bartlett et al. [1], where it was shown that the variance term for ridgeless overparametrizedregression can be small if and only if the tail of the sequence of eigenvalues of the covariance operator has largeeffective rank. We strictly generalize those results. We obtain similar bounds for the variance term for the case ofridge regression with zero regularization, extend these to the full range of levels of regularization, and provide novelbounds for the bias term. The bias term is of independent interest, as it’s closely related to high-dimensional PCA fora spiked covariance model. Our assumptions on the covariance structure are more general than the usual assumptionsof the spiked covariance model. Our argument is also novel, and doesn’t require the assumption of independence of thecoordinates of data vectors, which was needed in [1]. We only need sub-gaussianity and a small-ball condition. Forexample, our results apply to many kernel regression settings, such as polynomial kernels with compactly supporteddata.
We study ridge regression with independent sub-gaussian observations. Throughout the whole paper we make thefollowing assumptions:• X ∈ R n × p — a random design matrix with i.i.d. centered rows,• the covariance matrix of a row of X is Σ = diag( λ , . . . , λ p ) with λ ≥ · · · ≥ λ p (note that we can assumethat the covariance is diagonal without loss of generality, as we can always achieve this by a change of basis),• the rows of X Σ − / are sub-gaussian vectors with sub-gaussian norm at most σ x ,• y = Xθ ∗ + ε is the response vector, where θ ∗ ∈ R p is some unknown vector, and ε is noise that is independentof X ,• components of ε are independent and have sub-gaussian norms bounded by σ ε .Recall that a random variable Z is sub-gaussian if it has a finite sub-gaussian norm k Z k ψ :=inf (cid:8) t > E exp( Z /t ) ≤ (cid:9) and that the sub-gaussian norm of a random vector Z is k Z k ψ :=sup s =0 kh s, Z i / k s kik ψ .In the following, we write a . b if there exists an absolute constant c s.t. a ≤ cb . We write a . σ x b if there exists aconstant c x that only depends on σ x s.t. a ≤ c x b . For any positive integer m , we denote the m × m identity matrix2 PREPRINT - O
CTOBER
1, 2020as I m , and for any positive semidefinite (PSD) matrix M ∈ R m × m and any x ∈ R m denote k x k M := √ x ⊤ M x . Wedenote the eigenvalues of M in decreasing order by µ ( M ) ≥ · · · ≥ µ m ( M ) .For any non-negative integer k we introduce the following notation:• For any matrix M ∈ R n × p denote M k to be the matrix which is comprised of the first k columns of M , and M k : ∞ to be the matrix comprised of the rest of the columns of M .• For any vector η ∈ R p denote η k to be the vector comprised of the first k components of η , and η k : ∞ to bethe vector comprised of the rest of the coordinates of η .• Denote Σ k = diag( λ , . . . , λ k ) , and Σ k : ∞ = diag( λ k +1 , λ k +2 , . . . ) . • Denote A k := X k : ∞ X ⊤ k : ∞ + λI n .• Denote A − k := X k − X ⊤ k − + X k : ∞ X ⊤ k : ∞ + λI n .Denote the ridge estimator as ˆ θ = ˆ θ ( λ, y ) = argmin θ (cid:8) k Xθ − y k + λ k θ k (cid:9) = X ⊤ ( λI n + XX ⊤ ) − y, where we assume that the matrix λI n + XX ⊤ is non-degenerate. The representation in the last line is derived inAppendix A.We are interested in evaluating the MSE of the ridge estimator. For a new independent observation x (row x with thesame distribution as a row of matrix X , and is independent of X and ε ), we write for the prediction MSE E (cid:20)(cid:16) x (ˆ θ − θ ∗ ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X, ε (cid:21) = k ˆ θ − θ ∗ k = k θ ∗ − X ⊤ ( λI n + XX ⊤ ) − ( Xθ ∗ + ε ) k . k ( I p − X ⊤ ( λI n + XX ⊤ ) − X ) θ ∗ k + k X ⊤ ( λI n + XX ⊤ ) − ε k , so we denote B := k ( I p − X ⊤ ( λI n + XX ⊤ ) − X ) θ ∗ k ,V := k X ⊤ ( λI n + XX ⊤ ) − ε k as the bias and variance terms. The following theorem is the main result of this paper. It consists of two parts: the first part gives explicit bounds interms of eigenvalues of the covariance matrix. The second part is more general, but the bound is given in terms of thesmallest and largest eigenvalues of random matrix A k . We will use that second part in section 4.2 to obtain bounds forthe case of negative ridge regularization. The proof of Theorem 1 can be found in Section C.1 of the appendix. Theorem 1.
Suppose λ ≥ and it is also known that for some δ < − e − n/c x with probability at least − δ thecondition number of A k is at most L , then with probability at least − δ − e − t/c x BL . σ x k θ ∗ k : ∞ k k : ∞ + k θ ∗ k k − k (cid:18) λ + P i>k λ i n (cid:19) ,Vσ ε tL . σ x kn + n P i>k λ i (cid:0) λ + P i>k λ i (cid:1) , and λ + P i>k λ i nλ k +1 ≥ c x L . PREPRINT - O
CTOBER
1, 2020
More generally, there exists a constant c x , which only depends on σ x , s.t. for any λ ∈ R , k ≤ n/c x and t ∈ (1 , n/c x ) with probability at least − e − t/c x if the matrix A k is PD B . σ x k θ ∗ k : ∞ k k : ∞ µ ( A − k ) µ n ( A − k ) + µ ( A − k ) n λ k +1 + n X i>k λ i !! + k θ ∗ k k − k n µ n ( A − k ) + µ ( A − k ) µ n ( A − k ) λ k +1 + 1 n X i>k λ i !! ,Vσ ε t . σ x µ ( A − k ) µ n ( A − k ) kn + nµ ( A − k ) X i>k λ i . The result of Theorem 1 depends on the arbitrary k . One way to interpret the results is through a generalization ofthe effective ranks that appear for the interpolation case in [1]. Let’s call the highest variance k components of thecovariance the “spiked part,” and the rest of the components the “tail.” Denote for any kρ k = 1 nλ k +1 λ + X i>k λ i ! . (1)Note that for λ = 0 this is the usual effective rank (as it’s defined in [1]) divided by n . Denote another effective rankof the tail: R k = (cid:0) λ + P i>k λ i (cid:1) / P i>k λ i . Then the bounds for bias and variance become BL . σ x k θ ∗ k : ∞ k k : ∞ + k θ ∗ k k − k λ k +1 ρ k , Vσ ε tL . σ x kn + nR k . The following section suggests that to choose k —the size of the spiked part—optimally one needs to find such k forwhich ρ k is of order of a constant, which only depends on σ x . If such k doesn’t exist, one should take the smallest k for which ρ k is larger than a constant.So the variance term becomes smaller as the rank of the tail increases and the dimension of the spiked part decreases.For the bias term, on the contrary, components of the tail don’t get estimated at all (all energy k θ ∗ k : ∞ k k : ∞ from thatpart of the signal goes into error). When it comes to the spiked part, its estimation error is lower the higher its varianceis and the lower the energy of the tail is.There is a second interpretation, emphasizing the role of λ . Section C.3 of the Appendix shows that if k is chosen assuggested above , then one can write (up to constant multipliers that depend only on σ x ): BL . σ x X i λ i | θ ∗ i | ρ k λ k +1 ρ k λ k +1 + λ i , Vσ ε tL . σ x n X i λ i ρ k λ k +1 + λ i So we see that there is a mixture coefficient affecting each term: ρ k λ k +1 gives the weight of bias in in the i -th com-ponent, while λ i gives the weight of variance. It can be shown that if ρ k is larger than a constant, then k is effectivelyconstant, so ρ k λ k +1 = λ + P i>k λ i has affine dependence on λ . This shows how increasing the regularizationparameter λ changes the weights of the bias and variance terms in each component.We further discuss assumptions and applications of Theorem 1 in Section 4. The following lower bounds for the bias and variance term can be expressed succinctly using our modified effectiverank ρ k : Lemma 2.
Suppose that λ ≥ , components of the rows of X are independent, and the components of the noise vector ε have unit variance. Then for some absolute constant c for any t, k s.t. t > c and k + 2 σ x t + √ ktσ x < n/ w.p. atleast − e − t/c , V ≥ cn X i =1 min (cid:26) , λ i σ x λ k +1 ( ρ k + 2) (cid:27) . PREPRINT - O
CTOBER
1, 2020
Lemma 3.
For arbitrary ¯ θ ∈ R p consider the following prior distribution on θ ∗ : θ ∗ is obtained from ¯ θ randomlyflipping signs of all its coordinates.Suppose also that λ ≥ and it is known for some k, δ, L that for any j > k w.p. at least − δ µ n ( A − j ) ≥ L (cid:0) λ + P i>k λ i (cid:1) . Then for some absolute constant c for any non-negative t < n σ x w.p. at least − δ − e − t/c E θ ∗ B ≥ X i λ i ¯ θ i (cid:16) λ i Lλ k +1 ρ k (cid:17) . It may be unclear from the first glance whether these lower bounds can match the upper bounds from Theorem 1. Weshall see that it actually is the case under assumptions that are not much more restrictive than those of the first part ofTheorem 1.Since both upper and lower bounds are stated for arbitrary k , we should first decide which k we want to choose. Itturns out that to understand which k to pick one can look at the values of ρ k . We already know from the last statementof Theorem 1, that the ability to upper bound the condition number of A k by a constant L implies a lower bound on ρ k that is also a constant (depends only on σ x and L ). At the same time the following lemma shows that if ρ k is notupper bounded by some large constant, then we can shift k to be the first number for which ρ k is larger than that largeconstant, and still be able to control the condition number of A k by a (different) constant with high probability. Lemma 4 (Controlling the condition number for some k ∈ ( k ∗ , n ) means controlling it for k ∗ ) . Suppose λ ≥ andthat it is known that for some δ, L > and some k < n with probability at least − δ the condition number of thematrix A k is at most L . If λ + P i λ i ≥ Knλ for some K > , then for some absolute constant c for any t ∈ (0 , n ) with probability at least − δ − e − t/c (1 − tσ x /n )( K − LK λ + X i λ i ! ≤ µ n ( A ) ≤ µ ( A ) ≤ cσ x K + 2 K λ + X i λ i ! . The ability to choose such a k (and the corresponding ρ k ) justifies the conditions of the next theorem. Theorem 5 (The lower bound is the same as the upper bound) . Denote B := X i λ i | θ ∗ i | (cid:16) λ i λ k +1 ρ k (cid:17) , B := k θ ∗ k : ∞ k k : ∞ + k θ ∗ k k − k (cid:18) λ + P i>k λ i n (cid:19) ,V := 1 n X i min (cid:26) , λ i λ k +1 ( ρ k + 2) (cid:27) , V := kn + n P i>k λ i (cid:0) λ + P i>k λ i (cid:1) . Suppose ρ k ∈ ( a, b ) for some b > a > . Then ≤ BB ≤ max (cid:8) (1 + b ) , (1 + a − ) (cid:9) , ≤ VV ≤ max (cid:8) (2 + b ) , (1 + 2 a − ) (cid:9) . Alternatively, if k = min { l : ρ l > b } and b > /n then ≤ BB ≤ max (cid:8) (1 + b ) , (1 + b − ) (cid:9) , ≤ VV ≤ max (cid:8) (2 + b ) , (1 + 2 b − ) (cid:9) . A k The first part of Theorem 1 may seem peculiar: it is not immediately clear whether one would be able to control thecondition number of A k , and what one would require to do that. One possible answer is rather straightforward: thecondition number of A k is small if and only if λ is large enough. For example, one could just choose λ to be such that λ & σ x k X k : ∞ X ⊤ k : ∞ k with high probability. On the other hand, the matrix X k : ∞ X ⊤ k : ∞ may be well conditioned withhigh probability on its own, in which case one could choose small or even negative values of λ . The following lemma,whose proof is given in Appendix I, provides the high probability bound on k X k : ∞ X ⊤ k : ∞ k . Moreover, it shows thatone can also control the lowest eigenvalue under a small-ball assumption.5 PREPRINT - O
CTOBER
1, 2020
Lemma 6.
For some absolute constant c for any t > with probability at least − e − t/c , µ max ( A k ) ≤ λ + cσ x λ k +1 ( t + n ) + X i λ i ! . If it’s additionally known that for some δ, L > w.p. at least − δ k X ,k : ∞ k ≥ q E k X ,k : ∞ k / L then w.p. at least − nδ − e − t/c , µ min ( A k ) ≥ λ + 1 L X i>k λ i − cσ vuut ( t + n ) λ k +1 ( t + n ) + X i λ i ! . For example, if P i>k λ i > c x,L λ k +1 for a large enough constant c x,L , which depends only on σ x and L , then for t . σ x n both upper and lower bounds from Lemma 6 would match up to a constant factor even for zero λ . This can beobserved by consecutively plugging in P i λ i ≤ λ k +1 P i>k λ i and λ k +1 n ≤ c x,L P i>k λ i , and then choosing c x,L large enough. To demonstrate the applications of Theorem 1 we consider three different regimes, which are the focus of the followingtheorem. In the first regime, P i>k λ i ≪ nλ k +1 for all k , therefore one can only control the condition number of A k by choosing large λ . In the second case, P i>k λ i ≥ c x nλ k +1 for some large constant c x , and therefore one cancontrol all the eigenvalues of A k up to a constant factor even for vanishing λ . In this case, adding small positiveregularization has no effect. Moreover, if c x from the second case is extremely large, then the matrix X k : ∞ X ⊤ k : ∞ concentrates around properly scaled identity. In this (third) case one can qualitatively change the bound by choosingnegative λ , by decreasing the bias without significantly increasing the variance. Theorem 7.
There exists a large constant c x that only depends on σ x s.t.1. If nλ k +1 & σ x P i>k λ i for some k < n/c x , then for λ = nλ k +1 and for any t ∈ ( c x , n/c x ) , with probabilityat least − e − t/c x , B . σ x k θ ∗ k : ∞ k k : ∞ + λ k k θ ∗ k k − k , Vσ ε t . σ x kn + P i>k λ i nλ k .
2. Suppose the components of the data vectors are independent and P i>k λ i ≥ c x nλ k +1 for some k < n/c x .(a) For any non-negative λ < P i>k λ i , for any t ∈ ( c x , n/c x ) , with probability at least − e − t/c x , B . σ x k θ ∗ k : ∞ k k : ∞ + k θ ∗ k k − k (cid:18) P i>k λ i n (cid:19) , Vσ ε t . σ x kn + n P i>k λ i (cid:0)P i>k λ i (cid:1) . (b) For ξ > c x and λ = − P i>k λ i + ξ (cid:16) nλ + q n P i>k λ i (cid:17) for any t ∈ ( c x , n/c x ) with probability atleast − e − t/c x B . σ x k θ ∗ k : ∞ k k : ∞ + k θ ∗ k k − k ξ n nλ k +1 + X i>k λ i ! ,Vσ ε t . σ x kn + P i>k λ i ξ (cid:0) nλ k +1 + P i>k λ i (cid:1) . Proof.
1. By Lemma 6 with probability at least − e − t/c k A k k ≤ λ + cσ x λ k +1 ( t + n ) + X i>k λ i ! . PREPRINT - O
CTOBER
1, 2020On the other hand, µ min ( A k ) ≥ λ. Plugging in λ = nλ k +1 & σ x P i>k λ i , we obtain that with probability atleast − e − t/c the condition number of A k is at most a constant L that only depends on σ x , and ρ k ≈ σ x .The first part of Theorem 1 gives the desired result.2. Applying Lemma from [1] we obtain that, with probability at least − e − n/c x the condition number of A k is bounded by a constant that only depends on σ x . The first part of Theorem 1 gives the first part of theclaim.To obtain the second claim, we use the following result which was shown in the proof of Lemma from [1]:with probability at least − e − n/c x (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X k : ∞ X ⊤ k : ∞ − I n X i>k λ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . σ x λ k +1 n + s n X i>k λ i , which yields that for ξ > c x and λ = − λ k +1 p + ξλ k +1 √ np , all the eigenvalues of A k are equal to ξλ k +1 √ np up to a multiplicative constant that only depends on σ x . The second part of Theorem 1 gives the desired result.The following corollary presents the consequences for three specific examples of covariance operators and regulariza-tion parameters that illustrate the three regimes of Theorem 7. Corollary 8.
1. Suppose λ i = e − γi for some γ & σ x n and ≤ i < ∞ . Then for any k for λ = ne − γ ( k +1) B . σ x X i>k | θ ∗ i | e − γi + k X i =1 | θ ∗ i | e − γ (2( k +1) − i ) , Vσ ε t . σ x kn + 1 n (1 − e − γ ) .
2. Suppose the components of the data vectors are independent, and there exist such k < n/c x and p > nc x s.t. λ i = λ for i ≤ k, λ i = λ k +1 for k < i ≤ p. Then(a) If λ = 0 , then for any t ∈ ( c x , n/c x ) , with probability at least − e − t/c x , B . σ x k θ ∗ k : ∞ k λ k +1 + k θ ∗ k k λ k +1 p λ n , Vσ ε t . σ x kn + np . (b) For ξ > c x and λ = − λ k +1 p + ξλ k +1 √ np , for any t ∈ ( c x , n/c x ) , with probability at least − e − t/c x , B . σ x k θ ∗ k : ∞ k λ k +1 + k θ ∗ k k ξ λ k +1 pλ n , Vσ ε t . σ x kn + 1 ξ . Proof.
The statements follow directly from the corresponding parts of Theorem 7.As we see from Corollary 8, the regularization can be extremely small compared to the energy ( ℓ -norm of the signal),and still provide small generalization. For example, if we take γ ≪ in part 1 of Corollary 8, for λ ≈ ne − γk we get B . σ x k θ ∗ k ∞ e − γk γ and Vσ ε t . σ x kn + nγ . If we further substitute k = n / and γ = n − / , we obtain vanishing (withlarge n ) generalization error with regularization that decays as ne − n / , which is exponentially fast. Moreover, in thesecond part of Corollary 8, we see that for a particular spiked covariance, taking negative regularization may performbetter than zero regularization if the dimension is high enough. Indeed, the bound in Corollary 8, Part 2(b) becomesthe same as in Part 2(a) if we plug in ξ = pn . However, if pn is large (e.g., np ≪ kn ), then taking ξ = c x nk would notchange the variance term (up to a constant multiplier, that only depends on σ x ), but would significantly decrease thesecond part of the bias term, which dominates when the noise is sufficiently small. Note that Theorem 5 implies thatthe bound in Part 2(b) is tight for λ large enough (so that ρ k − < b < ρ k ), which means that the upper bound for thecase of negative regularization may be smaller than the lower bound for the case of zero regularization.7 PREPRINT - O
CTOBER
1, 2020
We obtained non-asymptotic generalization bounds for ridge regression, depending on the covariance structure of thedata, and relying on much weaker assumptions then the previous work. Our bound for the bias term is novel even inthe ridgeless case, and it shows that the energy of true signal that corresponds to the tail goes directly into error, whilethe energy that corresponds to the spiked part is reciprocal to the corresponding term of the error. Our results imply thatfor the case of a rapidly decaying sequence of the eigenvalues of the covariance operator, very small regularizationmay be sufficient to achieve vanishing generalization error. Moreover, we showed that negative regularization canachieve a similar effect, and can even be beneficial compared to zero regularization for a slowly decaying sequenceof eigenvalues, e.g. in the standard spiked covariance setting.Despite the fact that the second part of Theorem 1 is applicable in a quite general situation, we don’t expect that boundto be sharp if there is no possibility to control the condition number of A k . For example, that is the case if eigenvaluesof the covariance operator decay quickly and regularization is zero. Another possible direction of further work isremoving the assumption of sub-gaussianity, which would, for example, allow the results to be applied to a larger classof kernel regression problems. References [1] Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.
Proceedings of the National Academy of Sciences , 2020.[2] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practiceand the classical bias–variance trade-off.
Proceedings of the National Academy of Sciences , 116(32):15849–15854, 2019.[3] Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features.
ArXiv , abs/1903.07571,2019.[4] Koby Bibas, Yaniv Fogel, and Meir Feder. A new look at an old problem: A universal learning approach to linearregression. pages 2304–2308, 07 2019.[5] Michał Derezi´nski, Feynman Liang, and Michael Mahoney. Exact expressions for double descent and implicitregularization via surrogate random design. 12 2019.[6] Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridge regression and classifica-tion. arXiv: Statistics Theory , pages 247–279, 2015.[7] Trevor J. Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in high-dimensionalridgeless least squares interpolation.
ArXiv , abs/1903.08560, 2019.[8] Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez. Optimal ridge penalty for real-world high-dimensionaldata can be zero or negative due to the implicit ridge regularization. arXiv: Statistics Theory , 2020.[9] Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel "ridgeless" regression can generalize.
ArXiv ,abs/1808.00387, 2018.[10] Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. On the risk of minimum-norm interpolants and restrictedlower isometry of kernels.
ArXiv , abs/1908.10292, 2019.[11] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptoticsand double descent curve, 2019.[12] Partha P. Mitra. Understanding overfitting peaks in generalization error: Analytical risk curves for l2 and l1penalized interpolation.
ArXiv , abs/1906.03667, 2019.[13] Vidya Muthukumar, Kailas Vodrahalli, and Anant Sahai. Harmless interpolation of noisy data in regression. , pages 2299–2303, 2019.[14] Preetum Nakkiran. More data can hurt for linear regression: Sample-wise double descent.
ArXiv , abs/1912.07242,2019.[15] Christos Thrampoulidis, Samet Oymak, and Babak Hassibi. Regularized linear regression: A precise analysisof the estimation error. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors,
Proceedings of The 28thConference on Learning Theory , volume 40 of
Proceedings of Machine Learning Research , pages 1683–1709,Paris, France, 03–06 Jul 2015. PMLR.[16] Roman Vershynin.
Introduction to the non-asymptotic analysis of random matrices , page 210–268. CambridgeUniversity Press, 2012. 8
PREPRINT - O
CTOBER
1, 2020[17] Roman Vershynin.
High-Dimensional Probability: An Introduction with Applications in Data Science . Cam-bridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.[18] Ji Xu and Daniel J Hsu. On the number of variables to use in principal component regression. In H. Wallach,H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,
Advances in Neural InformationProcessing Systems 32 , pages 5094–5103. Curran Associates, Inc., 2019.[19] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learningrequires rethinking generalization.
CoRR , abs/1611.03530, 2016.9
PREPRINT - O
CTOBER
1, 2020
A Ridge regression
Recall that• X ∈ R n × p — a random matrix with i.i.d. centered rows,• the covariance matrix of a row of X is Σ = diag( λ , . . . , λ p ) , • the rows of X Σ − / are sub-gaussian with sub-gaussian norm at most σ x ,• y = Xθ ∗ + ε is the response vector, where θ ∗ ∈ R p is some unknown vector, and ε is noise,• components of ε are independent and have sub-gaussian norms bounded by σ ε .We are interested in evaluating the MSE of the ridge estimator ˆ θ = ˆ θ ( λ, y ) = argmin θ (cid:8) k Xθ − y k + λ k θ k (cid:9) = (cid:0) λI p + X ⊤ X (cid:1) − X ⊤ y Applying Sherman-Morrison-Woodbury yields (cid:0) λI p + X ⊤ X (cid:1) − = λ − I p − λ − X ⊤ ( I n + λ − XX ⊤ ) − X So, (cid:0) λI p + X ⊤ X (cid:1) − X ⊤ = λ − X ⊤ − λ − X ⊤ ( I n + λ − XX ⊤ ) − XX ⊤ = λ − X ⊤ − λ − X ⊤ ( I n + λ − XX ⊤ ) − ( λ − XX ⊤ + I n − I n )= λ − X ⊤ ( I n + λ − XX ⊤ ) − Finally, ˆ θ = X ⊤ ( λI n + XX ⊤ ) − y. Since we have y = Xθ ∗ + ε (best misspecified prediction plus noise), the excess MSE can be bounded as k ˆ θ − θ ∗ k . k ( I p − X ⊤ ( λI n + XX ⊤ ) − X ) θ ∗ k + k X ⊤ ( λI n + XX ⊤ ) − ε k . B Splitting the coordinates and the corresponding notation
Let’s also fix some k < n . We will spit the coordinates into two groups: the first k components and the rest of thecomponents.Consider integers a, b from to ∞ .• For any matrix M ∈ R n × p denote M a : b to be the matrix which is comprised of the columns of M from a + 1 -st to b -th.• For any vector η ∈ R p denote η a : b to be the vector comprised of components of η from a + 1 -st to b -th.• Denote Σ k = diag( λ , . . . , λ k ) , and Σ k : ∞ = diag( λ k +1 , λ k +2 , . . . ) . C Main results
C.1 Upper bound on the prediction MSETheorem 9.
There exists (large) constants c x , C x , ˜ C x , which only depend on σ x s.t. if1. t ∈ (1 , n/c x ) , PREPRINT - O
CTOBER
1, 2020 √ k + √ t ≤ √ n/c x then k ˆ θ ( λ, y ) − θ ∗ k ≤ B + V , and with probability at least − e − t/c x BC x ≤k θ ∗ k : ∞ k k : ∞ µ ( A − k ) µ n ( A − k ) + µ ( A − k ) n λ k +1 + n X i>k λ i !! + k θ ∗ k k − k n µ n ( A − k ) + µ ( A − k ) µ n ( A − k ) λ k +1 + 1 n X i>k λ i !! VC x σ ε t ≤ µ ( A − k ) µ n ( A − k ) kn + nµ ( A − k ) X i>k λ i . ( B stands for the contribution of the bias term, and V — for the variance).Moreover, if it is also known that for some δ < − e − n/c x with probability at least − δ the condition number of A k is at most L , then with probability at least − δ − e − t/c x B ˜ C x L ≤k θ ∗ k : ∞ k k : ∞ + k θ ∗ k k − k (cid:18) λ + P i>k λ i n (cid:19) V ˜ C x σ ε tL ≤ kn + n P i>k λ i (cid:0) λ + P i>k λ i (cid:1) , and λ + P i>k λ i nλ k +1 ≥ L ˜ C x Proof.
Combining Lemmas 12 and 13 we obtain the following bound k ˆ θ ( λ, y ) − θ ∗ k . k θ ∗ k : ∞ k k : ∞ + µ ( A − k ) µ n ( A − k ) µ (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) µ k (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) k X k : ∞ θ ∗ k : ∞ k + k θ ∗ k k − k µ n ( A − k ) µ k (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) + k X k : ∞ Σ k : ∞ X ⊤ k : ∞ k µ ( A − ) k X k : ∞ θ ∗ k : ∞ k + k X k : ∞ Σ k : ∞ X ⊤ k : ∞ k µ ( A − k ) µ n ( A − k ) µ (Σ − / k X ⊤ k X k Σ − / k ) µ k (Σ − / k X ⊤ k X k Σ − / k ) k Σ − / k θ ∗ k k + ε ⊤ A − k X k Σ − k X ⊤ k A − k εµ n ( A − k ) µ k (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) + ε ⊤ A − X k : ∞ Σ k : ∞ X ⊤ k : ∞ A − ε, where the first 4 terms correspond to the bias, and the last two — to the variance.We proceed with bounding all the random quantities besides µ ( A − k ) and µ n ( A − k ) . • The matrix X k Σ − / k ∈ R k × n has n i.i.d. columns with isotropic sub-gaussian distribution in R k . Bytheorem 5.39 in [16] for some constants c ′ x , C ′ x (which only depend on σ x ) for every t > s.t. √ n − PREPRINT - O
CTOBER
1, 2020 C ′ x √ k − √ t > with probability − − c ′ x t ) µ k (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) ≥ (cid:16) √ n − C ′ x √ k − √ t (cid:17) (2) µ (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) ≤ (cid:16) √ n + C ′ x √ k + √ t (cid:17) . (3)• Since ε is centered sub-gaussian vector with i.i.d. components, which is independent of A − k X k Σ k X ⊤ k A − k and A − X k : ∞ Σ k : ∞ X ⊤ k : ∞ A − , we have by Lemma 18 that for any t > with proba-bility at least − e − c t , ε ⊤ A − k X k Σ − k X ⊤ k A − k ε ≤ C σ ε t tr (cid:0) A − k X k Σ − k X ⊤ k A − k (cid:1) ≤ C σ ε tµ ( A − k ) tr (cid:0) X k Σ − k X ⊤ k (cid:1) ,ε ⊤ A − X k : ∞ Σ k : ∞ X ⊤ k : ∞ A − ε ≤ C σ ε t tr (cid:0) A − X k : ∞ Σ k : ∞ X ⊤ k : ∞ A − (cid:1) ≤ C σ ε tµ ( A − k ) tr (cid:0) X k : ∞ Σ k : ∞ X ⊤ k : ∞ (cid:1) , where we used that µ ( A − k ) ≥ µ ( A − ) Next, tr (cid:0) X k Σ − k X ⊤ k (cid:1) is the sum of squared norms of n i.i.d. isotropic vectors in R k , and tr (cid:0) X k : ∞ Σ k : ∞ X ⊤ k : ∞ (cid:1) is the sum of squared norms of n i.i.d. sub-gaussian vectors with covariance Σ k : ∞ .By Lemma 17 with probability at least − e − c t tr (cid:0) X k Σ − k X ⊤ k (cid:1) ≤ ( n + tσ x ) k tr (cid:0) X k : ∞ Σ k : ∞ X ⊤ k : ∞ (cid:1) ≤ ( n + tσ x ) X i>k λ i . • By Lemma 20, with probability at least − e − t/c k X k : ∞ Σ k : ∞ X ⊤ k : ∞ k ≤ c σ x λ k +1 ( t + n ) + X i>k λ i ! . • The vector X k : ∞ θ ∗ k : ∞ / k θ ∗ k : ∞ k Σ k : ∞ has n i.i.d. centered components with unit variances and sub-gaussiannorms at most σ x . Treating those components as sub-gaussian vectors in R , we can apply Lemma 17 to getthat for any t ∈ (0 , n ) with probability at least − e − c t k X k : ∞ θ ∗ k : ∞ k ≤ ( n + tσ x ) k θ ∗ k : ∞ k k : ∞ . Combining the above bounds gives the first assertion of the theorem.For the second assertion, note that by Lemma 21 for some absolute constant c with probability at least − δ − − c t ) n − tσ x nL λ + X i>k λ i ! ≤ µ n ( A k ) ≤ µ ( A k ) ≤ ( n + tσ x ) Ln λ + X i>k λ i ! , Plugging that into the first assertion gives BC ′′ x L ≤k θ ∗ k : ∞ k k : ∞ n λ k +1 + n P i>k λ i (cid:0) λ + P i>k λ i (cid:1) ! (4) + k θ ∗ k k − k (cid:18) λ + P i>k λ i n (cid:19) + λ k +1 + 1 n X i>k λ i !! (5) VC ′′ x σ ε tL ≤ kn + n P i>k λ i (cid:0) λ + P i>k λ i (cid:1) , (6)(7)12 PREPRINT - O
CTOBER
1, 2020for some C ′′ x , which only depends on σ x .However, recall that Lemma 21 also says that δ < − e − c s implies λ + P i>k λ i nλ k +1 ≥ L · − sσ x /n sσ x /n . Since s < n/c x by assumption, taking c x large enough allows to absorb the quantity − sσ x /n sσ x /n into ˜ C x .Moreover, X i>k λ i ≤ λ k X i>k λ i ≤ Ln λ + X i>k λ i ! sσ x /n − sσ x /n . Plugging it all into 4 and adjusting ˜ C x gives the result. C.2 Upper bound matches the lower bound
In the next theorem we show that the upper bound given in Theorem 1 matches the lower bounds from Lemmas 14and 15 if we choose suitable k . Note that by Lemmas 21 and 4, being able to control the condition number of A k ′ forsome k ′ < n implies that we can choose suitable k . Theorem 10 (The lower bound is the same as the upper bound) . Denote B := X i λ i | θ ∗ i | (cid:16) λ i λ k +1 ρ k (cid:17) B := k θ ∗ k : ∞ k k : ∞ + k θ ∗ k k − k (cid:18) λ + P i>k λ i n (cid:19) ,V := 1 n X i min (cid:26) , λ i λ k +1 ( ρ k + 2) (cid:27) ,V := kn + n P i>k λ i (cid:0) λ + P i>k λ i (cid:1) . Suppose ρ k ∈ ( a, b ) for some b > a > . Then min (cid:8) (1 + b ) − , (1 + a − ) − (cid:9) ≤ B / B ≤ (cid:8) (2 + b ) − , (1 + 2 a − ) − (cid:9) ≤ V / V ≤ Alternatively, if k = min { l : ρ l > b } and b > /n then min (cid:8) (1 + b ) − , (1 + b − ) − (cid:9) ≤ B / B ≤ (cid:8) (2 + b ) − , (1 + 2 b − ) − (cid:9) ≤ V / V ≤ Proof.
First of all, we represent k θ ∗ k : ∞ k k : ∞ + k θ ∗ k k − k (cid:18) λ + P i>k λ i n (cid:19) = X i (cid:18) { i ≤ k } | θ ∗ i | ρ k λ k +1 λ i + { i > k } λ i | θ ∗ i | (cid:19) kn + n P i>k λ i (cid:0) λ + P i>k λ i (cid:1) = X i (cid:18) { i ≤ k } n + { i > k } λ i nλ k +1 ρ k (cid:19) In the following we will bound the ratio of the sums from the statement of the theorem by bounding the ratios of thecorresponding terms.• First case: ρ k ∈ ( a, b ) . 13 PREPRINT - O
CTOBER
1, 2020 – Bias term:* i ≤ k : λ i | θ ∗ i | (cid:16) λ i λ k +1 ρ k (cid:17) : | θ ∗ i | ρ k λ k +1 λ i = λ i ρ k λ k +1 (cid:16) λ i λ k +1 ρ k (cid:17) = (cid:18) λ k +1 ρ k λ i (cid:19) − ∈ (cid:0) (1 + b ) − , (cid:1) * i > k : λ i | θ ∗ i | (cid:16) λ i λ k +1 ρ k (cid:17) : λ i | θ ∗ i | = (cid:18) λ i λ k +1 ρ k (cid:19) − ∈ (cid:0) (1 + a − ) − , (cid:1) – Variance term:* i ≤ k : n min (cid:26) , λ i λ k +1 ( ρ k + 2) (cid:27) : 1 n ∈ (cid:0) (2 + b ) − , (cid:3) * i > k : n min (cid:26) , λ i λ k +1 ( ρ k + 2) (cid:27) : λ i nλ k +1 ρ k = λ i λ k +1 ( ρ k + 2) : λ i λ k +1 ρ k = ρ k ( ρ k + 2) ∈ (cid:0) (1 + 2 a − ) − , (cid:1) • Second case: k = min { l : ρ l > b } . In this case we have ρ k ≥ b,λ k + nλ k +1 ρ k nλ k = λ + λ k + P i>k λ i nλ k = ρ k − < b, ∀ i ≤ k : λ i ≥ λ k ≥ nλ k +1 ρ k nb − λ k +1 ρ k b ≥ λ k +1 ρ k b . The rest of the computation is analogous to the previous case: – Bias term: 14
PREPRINT - O
CTOBER
1, 2020* i ≤ k : λ i | θ ∗ i | (cid:16) λ i λ k +1 ρ k (cid:17) : | θ ∗ i | ρ k λ k +1 λ i = λ i ρ k λ k +1 (cid:16) λ i λ k +1 ρ k (cid:17) = (cid:18) λ k +1 ρ k λ i (cid:19) − ∈ (cid:2) (1 + b ) − , (cid:1) * i > k : λ i | θ ∗ i | (cid:16) λ i λ k +1 ρ k (cid:17) : λ i | θ ∗ i | = (cid:18) λ i λ k +1 ρ k (cid:19) − ∈ (cid:2) (1 + b − ) − , (cid:1) – Variance term:* i ≤ k : n min (cid:26) , λ i λ k +1 ( ρ k + 2) (cid:27) : 1 n ∈ (cid:20) λ k +1 ρ k /b λ k +1 ( ρ k + 2) , (cid:21) ⊆ (cid:20) b ( b + 2) b , (cid:21) = (cid:2) ( b + 2) − , (cid:3) * i > k : n min (cid:26) , λ i λ k +1 ( ρ k + 2) (cid:27) : λ i nλ k +1 ρ k = λ i λ k +1 ( ρ k + 2) : λ i λ k +1 ρ k = ρ k ( ρ k + 2) ∈ (cid:2) (1 + 2 b − ) − , (cid:3) C.3 Discussion of the main results
Our bounds for bias and variance are correspondingly k θ ∗ k : ∞ k k : ∞ + k θ ∗ k k − k (cid:18) λ + P i>k λ i n (cid:19) ,kn + n P i>k λ i (cid:0) λ + P i>k λ i (cid:1) . PREPRINT - O
CTOBER
1, 2020First of all, we represent (recall that ρ k = nλ k +1 (cid:0) λ + P i>k λ i (cid:1) ) k θ ∗ k : ∞ k k : ∞ + k θ ∗ k k − k (cid:18) λ + P i>k λ i n (cid:19) = X i (cid:18) { i ≤ k } | θ ∗ i | ρ k λ k +1 λ i + { i > k } λ i | θ ∗ i | (cid:19) = X i λ i | θ ∗ i | (cid:18) { i ≤ k } ρ k λ k +1 λ i + { i > k } (cid:19) kn + n P i>k λ i (cid:0) λ + P i>k λ i (cid:1) = X i (cid:18) { i ≤ k } n + { i > k } λ i nλ k +1 ρ k (cid:19) = X i n (cid:18) { i ≤ k } + { i > k } λ i λ k +1 ρ k (cid:19) Suppose that either ρ k ∈ ( a, b ) or k = min { κ : ρ κ > b } . Therefore, the expressions above can be rewritten (up to aconstant multiplier) as B ≈ X i λ i | θ ∗ i | min (cid:26) ρ k λ k +1 λ i , (cid:27) V ≈ n X i min (cid:26) , λ i λ k +1 ρ k (cid:27) or B ≈ X i λ i | θ ∗ i | ρ k λ k +1 ρ k λ k +1 + λ i V ≈ n X i λ i λ k +1 ρ k + λ i So we see that ρ k λ k is the weight of bias in each component, while the weight of variance in the i -th component is λ i .Increasing λ increases ρ k . Moreover, decreasing k as λ increases is neglectable (up to a constant multiplier): indeed ρ k λ k = n λ + n P i>k λ i . Suppose ρ k ≥ a , then increasing k by ∆ k means multiplying that quantity by at least − a ∆ k/n , which is lower bounded by a constant if ∆ k ≪ n .Increasing sample size decreases ρ k λ k : indeed, the larger n the larger k is. Therefore, increasing the sample sizeincreases the weight of variance, but decreases the variance itself. Theorem 11.
There exists a large constant c x that only depends on σ x s.t.1. If nλ k +1 & σ x P i>k λ i for some k < n/c x , then for λ = nλ k +1 for any t ∈ ( c x , n/c x ) with probability atleast − e − t/c x B . σ x k θ ∗ k : ∞ k k : ∞ + λ k k θ ∗ k k − k Vσ ε t . σ x kn + P i>k λ i nλ k ,
2. Suppose the components of the data vectors are independent. If k = min { κ : ρ κ > c x } and k < n/c x thenfor any non-negative λ < P i>k λ i for any t ∈ ( c x , n/c x ) with probability at least − e − t/c x B . σ x k θ ∗ k : ∞ k k : ∞ + k θ ∗ k k − k (cid:18) P i>k λ i n (cid:19) ,Vσ ε t . σ x kn + n P i>k λ i (cid:0)P i>k λ i (cid:1) . PREPRINT - O
CTOBER
1, 2020
3. Suppose the components of the data vectors are independent, and there exist such k < n/c x and p > nc x s.t. λ i = λ k +1 for k < i ≤ pλ i = 0 for i > p Then for ξ > c x and λ = − λ k +1 p + ξλ k +1 √ np for any t ∈ ( c x , n/c x ) with probability at least − e − t/c x B . σ x k θ ∗ k : ∞ k k : ∞ + k θ ∗ k k − k ξ λ k +1 pnVσ ε t . σ x kn + 1 ξ . Proof.
1. By Lemma 20 with probability at least − e − t/c k A k k ≤ λ + cσ x λ k +1 ( t + n ) + X i>k λ i ! . On the other hand, µ min ( A k ) ≥ λ. Plugging in λ = nλ k +1 & σ x P i>k λ i , we obtain that with probability atleast − e − t/c the condition number of A k is at most a constant L that only depends on σ x , and ρ k ≈ σ x .The first part of Theorem 1 gives the desired result.2. Applying Lemma from [1] we obtain that, with probability at least − e − n/c x the condition number of A k is bounded by a constant that only depends on σ x . The first part of Theorem 1 gives the desired result.3. Applying theorem 5.39 in [16] we see that for some constants c ′ x , C ′ x (which only depend on σ x ) for every t > s.t. √ p − k − C ′ x √ n − √ t > with probability − − c ′ x t ) µ n ( A k − λI n ) ≥ λ k +1 (cid:16)p p − k − C ′ x √ n − √ t (cid:17) µ ( A k − λI n ) ≤ λ k +1 (cid:16)p p − k + C ′ x √ k + √ t (cid:17) . which yields that for ξ > c x and λ = − λ k +1 p + ξλ k +1 √ np all the eigenvalues of A k are equal to ξλ k +1 √ np up to a multiplicative constant, that only depends on σ x . The second part of Theorem 1 gives the desiredresult. D Deriving a useful identity
We have ˆ θ ( λ, y ) ⊤ = (cid:2) ˆ θ ( λ, y ) ⊤ k , ˆ θ ( λ, y ) ⊤ k : ∞ (cid:3) . The goal of this section is to show ˆ θ ( λ, y ) k + X ⊤ k A − k X k ˆ θ ( λ, y ) k = X ⊤ k A − k y, (8)where A k = A k ( λ ) = λI n + X k : ∞ X ⊤ k : ∞ . D.1 Derivation in ridgeless case
In the ridgeless case we are simply dealing with projections, and ˆ θ is the minimum norm interpolating solution. Notethat ˆ θ k : ∞ is also the minimum norm solution to the equation X k : ∞ θ k : ∞ = y − X k ˆ θ k , where θ k : ∞ is the variable.Thus, we can write ˆ θ k : ∞ = X ⊤ k : ∞ (cid:0) X k : ∞ X ⊤ k : ∞ (cid:1) − (cid:16) y − X k ˆ θ k (cid:17) . Now we need to minimize the norm in ˆ θ k (our choice of ˆ θ k : ∞ already makes solution interpolating): we need tominimize the norm of the following vector v ( θ k ) = h θ ⊤ k , ( y − X k θ k ) ⊤ (cid:0) X k : ∞ X ⊤ k : ∞ (cid:1) − X k : ∞ i PREPRINT - O
CTOBER
1, 2020As θ k varies, this vector swipes the affine subspace of our Hilbert space. ˆ θ k gives the minimal norm vector if andonly if for any additional vector η k we have v (ˆ θ k ) ⊥ v (ˆ θ k + η k ) − v (ˆ θ k ) . This is equivalent to writing ∀ η k (cid:28)h ˆ θ ⊤ k , (cid:16) y − X k ˆ θ k (cid:17) ⊤ (cid:0) X k : ∞ X ⊤ k : ∞ (cid:1) − X k : ∞ i , h η ⊤ k , − η ⊤ k X ⊤ k (cid:0) X k : ∞ X ⊤ k : ∞ (cid:1) − X k : ∞ i(cid:29) = 0 , or ˆ θ ⊤ k − (cid:16) y − X k ˆ θ k (cid:17) ⊤ (cid:0) X k : ∞ X ⊤ k : ∞ (cid:1) − X k = 0 . ˆ θ k + X ⊤ k A − k X k ˆ θ k = X ⊤ k A − k y, where we replaced X k : ∞ X ⊤ k : ∞ =: A k D.2 Checking for the case of non-vanishing regularization
So, now we define A k = A k ( λ ) = λI n + X k : ∞ X ⊤ k : ∞ , and we want to prove that ˆ θ k + X ⊤ k A − k X k ˆ θ k = X ⊤ k A − k y , where ˆ θ k corresponds to the solution of the ridge regularized problem from the section A. ˆ θ = X ⊤ ( λI n + XX ⊤ ) − y, ˆ θ k = X ⊤ k ( A k + X k X ⊤ k ) − y. Plugging into identity yields ˆ θ k + X ⊤ k A − k X k ˆ θ k = X ⊤ k ( A k + X k X ⊤ k ) − y + X ⊤ k A − k X k X ⊤ k ( A k + X k X ⊤ k ) − y = X ⊤ k A − k ( A k + X k X ⊤ k )( A k + X k X ⊤ k ) − y = X ⊤ k A − k y. E Variance
The variance term is k ˆ θ ( λ, ε ) k = k X ⊤ ( λI n + XX ⊤ ) − ε k . In this section we prove the following
Lemma 12. k X ⊤ ( λI n + XX ⊤ ) − ε k ≤ ε ⊤ A − k X k Σ − k X ⊤ k A − k εµ n ( A − k ) µ k (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) + ε ⊤ A − X k : ∞ Σ k : ∞ X ⊤ k : ∞ A − ε. E.1 First k components It was shown in section D, that the following identity holds (c.f. 8): ˆ θ ( λ, ε ) k + X ⊤ k A − k X k ˆ θ ( λ, ε ) k = X ⊤ k A − k ε Thus, multiplying the identity by ˆ θ ( λ, ε ) k from the left, we have ˆ θ ( λ, ε ) ⊤ k X ⊤ k A − k X k ˆ θ ( λ, ε ) k ≤ k ˆ θ ( λ, ε ) k k + ˆ θ ( λ, ε ) ⊤ k X ⊤ k A − k X k ˆ θ ( λ, ε ) k = ˆ θ ( λ, ε ) k X ⊤ k A − k ε. The leftmost expression is a quadratic form in ˆ θ ( λ, ε ) k , and the rightmost is linear. We use these expressions tobound the norm of ˆ θ ( λ, ε ) k in the norm, that corresponds to the covariance matrix of the data.First, we extract that norm from the quadratic part ˆ θ ( λ, ε ) ⊤ k X ⊤ k A − k X k ˆ θ ( λ, ε ) k ≥ µ n ( A − k )ˆ θ ( λ, ε ) ⊤ k X ⊤ k X k ˆ θ ( λ, ε ) k ≥k ˆ θ ( λ, ε ) k k k µ n ( A − k ) µ k (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) , where Σ k = diag( λ , . . . , λ k ) , and µ l denotes the l -th largest eigenvalue of a matrix.18 PREPRINT - O
CTOBER
1, 2020Now we can write k ˆ θ ( λ, ε ) k k k µ n ( A − k ) µ k (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) ≤ k ˆ θ ( λ, ε ) k k Σ k (cid:13)(cid:13)(cid:13) Σ − / k X ⊤ k A − k ε (cid:13)(cid:13)(cid:13) , k ˆ θ ( λ, ε ) k k k ≤ ε ⊤ A − k X k Σ − k X ⊤ k A − k εµ n ( A − k ) µ k (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) . E.2 Components starting from k + 1 -st The rest of the variance term is (cid:13)(cid:13)(cid:13) Σ / k : ∞ X ⊤ k : ∞ A − ε (cid:13)(cid:13)(cid:13) = ε ⊤ A − X k : ∞ Σ k : ∞ X ⊤ k : ∞ A − ε. F Bias
The Bias term is given by k θ ∗ − ˆ θ ( λ, Xθ ∗ ) k . In this section we prove the following Lemma 13 (Bias term) . k θ ∗ − ˆ θ ( λ, Xθ ∗ ) k . k θ ∗ k : ∞ k k : ∞ + µ ( A − k ) µ n ( A − k ) µ (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) µ k (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) k X k : ∞ θ ∗ k : ∞ k + k θ ∗ k k − k µ n ( A − k ) µ k (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) + k X k : ∞ Σ k : ∞ X ⊤ k : ∞ k µ ( A − ) k X k : ∞ θ ∗ k : ∞ k + k X k : ∞ Σ k : ∞ X ⊤ k : ∞ k µ ( A − k ) µ n ( A − k ) µ (Σ − / k X ⊤ k X k Σ − / k ) µ k (Σ − / k X ⊤ k X k Σ − / k ) k Σ − / k θ ∗ k k . F.1 First k components We need to bound k θ ∗ k − ˆ θ k ( λ, Xθ ∗ ) k k . By section D, in particular identity 8 we have ˆ θ ( λ, Xθ ∗ ) k + X ⊤ k A − k X k ˆ θ ( λ, Xθ ∗ ) k = X ⊤ k A − k Xθ ∗ . Denote ζ := ˆ θ ( λ, Xθ ∗ ) − θ ∗ . We can rewrite the equation above as ζ k + X ⊤ k A − k X k ζ k = X ⊤ k A − k X k : ∞ θ ∗ k : ∞ − θ ∗ k . Multiplying both sides by ζ ⊤ k from the left and using that ζ ⊤ k ζ k = k ζ k k ≥ we obtain ζ ⊤ k X ⊤ k A − k X k ζ k ≤ ζ ⊤ k X ⊤ k A − k X k : ∞ θ ∗ k : ∞ − ζ ⊤ k θ ∗ k . This equation can be rewritten as ζ ⊤ k Σ / k Σ − / k X ⊤ k A − k X k Σ − / k Σ / k ζ k ≤ ζ ⊤ k Σ / k Σ − / k X ⊤ k A − k X k : ∞ θ ∗ k : ∞ − ζ ⊤ k Σ / k Σ − / k θ ∗ k . Thus k ζ k k k µ n ( A − k ) µ k (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) ≤ k ζ k k Σ k µ ( A − k ) r µ (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) k X k : ∞ θ ∗ k : ∞ k + k ζ k k Σ k k θ ∗ k k Σ − k . PREPRINT - O
CTOBER
1, 2020 k ζ k k Σ k ≤ µ ( A − k ) µ n ( A − k ) µ (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) / µ k (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) k X k : ∞ θ ∗ k : ∞ k + k θ ∗ k k Σ − k µ n ( A − k ) µ k (cid:16) Σ − / k X ⊤ k X k Σ − / k (cid:17) F.2 The rest of the components
Recall that the full bias term is k ( I p − X ⊤ ( λI n + XX ⊤ ) − X ) θ ∗ k .Recall that A = λI n + XX ⊤ . The contribution of the components of ζ , starting from the k + 1 st can be bounded asfollows: k θ ∗ k : ∞ − X ⊤ k : ∞ A − Xθ ∗ k k : ∞ . k θ ∗ k : ∞ k Σ k : ∞ + k X ⊤ k : ∞ A − X k : ∞ θ ∗ k : ∞ k k : ∞ + k X ⊤ k : ∞ A − X k θ ∗ k k k : ∞ First of all, let’s deal with the second term: k X ⊤ k : ∞ A − X k : ∞ θ ∗ k : ∞ k k : ∞ = k Σ / k : ∞ X ⊤ k : ∞ A − X k : ∞ θ ∗ k : ∞ k ≤k X k : ∞ Σ k : ∞ X ⊤ k : ∞ k µ ( A − ) k X k : ∞ θ ∗ k : ∞ k . Now, let’s deal with the last term. Note that A = A k + X k X ⊤ k . By Sherman–Morrison–Woodbury formula A − X k =( A − k + X k X ⊤ k ) − X k = (cid:16) A − k − A − k X k (cid:0) I k + X ⊤ k A − k X k (cid:1) − X T k A − k (cid:17) X k = A − k X k (cid:16) I n − (cid:0) I k + X ⊤ k A − k X k (cid:1) − X T k A − k X k (cid:17) = A − k X k (cid:16) I n − (cid:0) I k + X ⊤ k A − k X k (cid:1) − (cid:0) I k + X T k A − k X k − I k (cid:1)(cid:17) = A − k X k (cid:0) I k + X ⊤ k A − k X k (cid:1) − Thus, k X ⊤ k : ∞ A − X k θ ∗ k k k : ∞ = k X ⊤ k : ∞ A − k X k (cid:0) I k + X ⊤ k A − k X k (cid:1) − θ ∗ k k k : ∞ = k Σ / k : ∞ X ⊤ k : ∞ A − k X k Σ − / k (cid:16) Σ − k + Σ − / k X ⊤ k A − k X k Σ − / k (cid:17) − Σ − / k θ ∗ k k ≤k X k : ∞ Σ k : ∞ X ⊤ k : ∞ k µ ( A − k ) µ (Σ − / k X ⊤ k X k Σ − / k ) µ k (Σ − / k X ⊤ k A − k X k Σ − / k ) k Σ − / k θ ∗ k k ≤k X k : ∞ Σ k : ∞ X ⊤ k : ∞ k µ ( A − k ) µ n ( A − k ) µ (Σ − / k X ⊤ k X k Σ − / k ) µ k (Σ − / k X ⊤ k X k Σ − / k ) k Σ − / k θ ∗ k k G Lower bounds
G.1 Variance term
The argument for the variance term is verbatim the same as the argument from "Benign overfitting".
Lemma 14.
Suppose that components of vectors X i, ∗ are independent, and the components of the noise vector ε haveunit variance.. Then for some absolute constant c for any t, k s.t. t > c and k + 2 σ x t + √ ktσ x < n/ w.p. at least − e − t/c V := E ε k X ⊤ ( λI n + XX ⊤ ) − ε k ≥ cn X i =1 min (cid:26) , λ i σ x λ k +1 ( ρ k + 2) (cid:27) . PREPRINT - O
CTOBER
1, 2020
Proof.
The variance term can be written as V = E ε k X ⊤ ( λI n + XX ⊤ ) − ε k = tr (cid:0) Σ X ⊤ A − X (cid:1) = ∞ X i =1 λ i z ⊤ i A − − i z i (1 + λ i z ⊤ i A − − i z i ) . First of all, by Cauchy-Schwartz we have k z i k · z ⊤ i A − − i z i ≥ ( z ⊤ i A − − i z i ) . Thus,
V ≥ ∞ X i =1 k z i k (cid:0) λ i z ⊤ i A − − i z i ) − (cid:1) Take some k < n . By Lemma 20, for some absolute constant c with probability at least − e − t/c k A k k ≤ c σ x λ k +1 ( t + n ) + λ + X i>k λ i ! . Since A − i − A k has rank at most k , the lowest n − k eigenvalues of A − i don’t exceed k A k k . Denote P i,k to be aprojector on the linear space spanned by the first k eigenvectors of A − i . Then z ⊤ i A − − i z i ≥ k ( I − P i,k ) z i k k A k k − = (cid:0) k z i k − k P i,k z i k (cid:1) k A k k − . Since z i is independent of P i,k , by theorem Theorem 6.2.1 (Hanson-Wright inequality) in [17], for some absoluteconstant c for any t > P (cid:8)(cid:12)(cid:12) k P i,k z i k − E z i k P i,k z i k (cid:12)(cid:12) ≥ t (cid:9) ≤ − c − min ( t σ x k P i,k k F , tσ x k P i,k k )! . Since P i,k is an orthogonal projector of rank k , k P i,k k F = k, k P i,k k = 1 , E z i k P i,k z i k = tr( P i,k ) = k. Thus, w.p.at least − e − t/c (cid:12)(cid:12) k P i,k z i k − k (cid:12)(cid:12) ≤ σ x max( √ kt, t ) ≤ ( t + √ kt ) σ x Next, by Lemma 17 for some constant c and any t ∈ (0 , n ) w.p. at least − e − t/c n − tσ x ≤ k z i k ≤ n + tσ x . We see that for some constants c , c for any t ∈ ( c , n ) w.p. at least − e − t/c z ⊤ i A − − i z i ≥ n − k − tσ x − √ ktσ x c σ x (cid:0) λ k +1 ( t + n ) + λ + P i>k λ i (cid:1) . We can simplify the expression by moving to the case when k, t ≪ n . In particular, for some constant c for any k, t s.t. t > c and k + 2 σ x t + √ ktσ x < n/ w.p. at least − e − t/c k z i k ∈ [ n/ , n ] , ( z ⊤ i A − − i z i ) − ≤ c σ x λ k +1 + 1 n λ + X i>k λ i !! = c σ x λ k +1 ( ρ k + 2) , where ρ k := nλ k +1 (cid:0) λ + P i>k λ i (cid:1) . Thus, on the same event k z i k (cid:0) λ i z ⊤ i A − − i z i ) − (cid:1) ≥ n (cid:16) c σ x λ k +1 λ i ( ρ k + 2) (cid:17) Finally, by Lemma 9 from "Benign overfitting...", for some constant c for any k, t s.t. t > c and k +2 σ x t + √ ktσ x CTOBER 1, 2020 G.2 Bias termLemma 15. For arbitrary ¯ θ ∈ R p consider the following prior distribution on θ ∗ : θ ∗ is obtained from ¯ θ randomlyflipping signs of all it’s coordinates. Recall that the bias term is B := θ ∗ (cid:16)(cid:0) λI p + X ⊤ X (cid:1) − X ⊤ X − I p (cid:17) Σ (cid:16)(cid:0) λI p + X ⊤ X (cid:1) − X ⊤ X − I p (cid:17) θ ∗ Suppose also that it is known for some k, δ, L that for any j > k w.p. at least − δ µ n ( A − j ) ≥ L (cid:0) λ + P i>k λ i (cid:1) . Then for some absolute constant c for any non-negative t < n σ x w.p. at least − δ − e − t/c E θ ∗ B ≥ X i λ i ¯ θ i (cid:16) λ i Lλ k +1 ρ k (cid:17) . Proof. Applying Sherman-Morrison-Woodbury yields (cid:0) λI p + X ⊤ X (cid:1) − = λ − I p − λ − X ⊤ ( I n + λ − XX ⊤ ) − X So, (cid:0) λI p + X ⊤ X (cid:1) − X ⊤ X − I p = (cid:0) λI p + X ⊤ X (cid:1) − ( λI p + X ⊤ X − λI p ) − I p = − λ (cid:0) λI p + X ⊤ X (cid:1) − = I p − λ − X ⊤ ( I n + λ − XX ⊤ ) − X = I p − X ⊤ ( λI n + XX ⊤ ) − X. Thus, the bias term becomes θ ∗ (cid:0) I p − X ⊤ ( λI n + XX ⊤ ) − X (cid:1) Σ (cid:0) I p − X ⊤ ( λI n + XX ⊤ ) − X (cid:1) θ ∗ and taking expectation over the prior kills all the off-diagonal elements, so E θ ∗ B = X i (cid:0)(cid:0) I p − X ⊤ ( λI n + XX ⊤ ) − X (cid:1) Σ (cid:0) I p − X ⊤ ( λI n + XX ⊤ ) − X (cid:1)(cid:1) i,i ¯ θ i . Let’s compute the diagonal elements of the matrix (cid:0) I p − X ⊤ ( λI n + XX ⊤ ) − X (cid:1) Σ (cid:0) I p − X ⊤ ( λI n + XX ⊤ ) − X (cid:1) (e.g. the i -th diagonal element is equal to the bias term for the case when θ ∗ = e i — the i -th vector of standardorthonormal basis). Note that the i -th row of I p − X ⊤ ( λI n + XX ⊤ ) − X is equal to e i − √ λ i z ⊤ i ( λI n + XX ⊤ ) − X, so the i -th diagonal element of the initial matrix is given by λ i (cid:0) − λ i z ⊤ i A − z i (cid:1) + X j = i λ i λ j ( z ⊤ i A − z j ) , where A = λI n + P pi =0 λ i z i z ⊤ i . Denote also A − i := A − λ i z i z ⊤ i , A − i, − j := A − λ i z i z ⊤ i − λ j z j z ⊤ j . First, − λ i z ⊤ i (cid:0) A − i + λ i z i z ⊤ i (cid:1) − z i =1 − λ i z ⊤ i (cid:0) A − − i − λ i A − − i z i (1 + z ⊤ i A − − i z i ) − z ⊤ i A − − i (cid:1) z i =1 − λ i z ⊤ i A − − i z i + (cid:0) λ i z ⊤ i A − − i z i (cid:1) λ i z ⊤ i A − − i z i = 11 + λ i z ⊤ i A − − i z i . So the diagonal element becomes λ i (1 + λ i z ⊤ i A − − i z i ) + X j = i λ i λ j ( z ⊤ i A − z j ) ≥ λ i (1 + λ i z ⊤ i A − − i z i ) , PREPRINT - O CTOBER 1, 2020and thus E θ ∗ B ≥ X i λ i ¯ θ i (1 + λ i z ⊤ i A − − i z i ) . Let’s bound λ i ¯ θ i (1+ λ i z ⊤ i A − − i z i ) from below with high probability. By our assumptions, for any i with probability at least − δ µ n ( A − i ) ≥ L λ + X j>k λ j Next, λ i (1 + λ i z ⊤ i A − − i z i ) ≥ λ i (1 + λ i µ n ( A − i ) − k z i k ) and by Lemma 17 for some absolute constant c for any t ∈ (0 , n ) w.p. at least − e − t/c we have k z i k ≤ n − tσ x ≤ n/ , where the last transition is true if additionally t ≤ n/ (2 σ x ) . Recall that ρ k := λ + P j>k λ j nλ k +1 . We obtain for any t ∈ (cid:0) , n/ (2 σ x ) (cid:1) w.p. at least − δ − e − t/c λ i ¯ θ i (1 + λ i z ⊤ i A − − i z i ) ≥ λ i ¯ θ i (cid:16) λ i Lλ k +1 ρ k (cid:17) Finally, since all the terms are non-negative, Lemma 9 from "Benign overfitting . . . " gives the result. H Concentration inequalities Lemma 16 (Non-standard norms of sub-gaussian vectors ) . Suppose V is a sub-gaussian vector in R p with k V k ψ ≤ σ . Consider Σ = diag( λ , . . . , λ p ) for some positive non-decreasing sequence { λ i } pi =1 . Then for some absoluteconstant c for any t > P ( k Σ / V k > cσ tλ + X i λ i !) ≤ e − t/c . Proof. The argument consists of two parts: first, we obtain the bound, which only works well in the case when all λ i are approximately the same. Next, we split the sequence { λ i } into pieces with approximately equal values within eachpiece, and obtain the final result by applying the first part of the argument to each piece. First part: Consider a / -net { u j } mj =1 on S p − , such that m ≤ p . We have k Σ / V k ≤ p λ max j h V, u j i ≤ p λ max j h V, u j i . Since the random variable h V, u j i is σ -sub-gaussian, it also holds for any t > and some absolute constant c that P ( |h V, u j i| > t ) ≤ e − ct /σ . P (4 λ h V, u j i > λ tσ ) ≤ e − ct . By multiplicity correction, we obtain P (cid:18) k Σ / V k > λ σ t + 4 σ λ log 9 c p (cid:19) ≤ e − ct . We see that the random variable (cid:16) k Σ / V k − σ λ log 9 c p (cid:17) + has sub-exponential norm bounded by Cσ λ . Second part: Now, instead of applying the result that we’ve just obtained to the whole vector V , let’s split it in thefollowing way: define the sub-sequence { i j } in such that i = 1 , and for any l ≥ i l +1 = min { i : λ i < λ i l / } .Denote V l to be a sub-vector of V , comprised of components from i l -th to i l +1 − -th. Let Σ l = diag( λ i l , . . . , λ i l +1 − ) . PREPRINT - O CTOBER 1, 2020Then by the initial argument, the random variable (cid:16) k Σ / l V l k − σ λ il log 9 c ( i l +1 − i l ) (cid:17) + has sub-exponential normbounded by Cσ λ i l . Since each next λ i l is at most a half of the previous, we obtain that the sum (over l ) of thoserandom variables has sub-exponential norm at most Cσ λ . Combining it with the fact that i l +1 − X i = i l λ i ≥ ( i l +1 − i l ) λ i l +1 − ≥ ( i l +1 − i l ) λ i l +1 / , we obtain that for some absolute constants c , c , . . . for any t > e − c t ≥ P (X l (cid:16) k Σ / l V l k − c σ λ i l ( i l +1 − i l ) (cid:17) > c σ λ t ) ≥ P ( k Σ / V k ≥ c σ X i λ i + c σ λ t ) Lemma 17 (Concentration of the sum of squared norms) . Suppose Z ∈ R n × p is a matrix with independent isotropicsub-gaussian rows with k Z i, ∗ k ψ ≤ σ . Consider Σ = diag( λ , . . . , λ p ) for some positive non-decreasing sequence { λ i } pi =1 . Then for some absolute constant c and any t ∈ (0 , n ) with probability at least − − ct )( n − tσ ) X i>k λ i ≤ n X i =1 k Σ / k : ∞ Z i,k : ∞ k ≤ ( n + tσ ) X i>k λ i . Proof. Since { Z i,k : ∞ } ni =1 are independent, isotropic and sub-gaussian, k Σ / k : ∞ Z i,k : ∞ k are independent sub-exponential r.v’s with expectation P i>k λ i and sub-exponential norms bounded by c σ P i>k λ i . Applying Bern-stein’s inequality gives P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 k Σ / k : ∞ Z i,k : ∞ k − X i>k λ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ tσ X i>k λ i ! ≤ (cid:0) − c min( t, t ) n (cid:1) Changing t → t/n gives the result. Lemma 18 (Weakened Hanson-Wright inequality) . Suppose A ∈ R n × n is a (random) psd matrix and ε ∈ R n is acentered vector whose components { ε i } ni =1 are independent and have sub-gaussian norm at most σ . Then for someabsolute constants c, C and any t > with probability at least − e − t/c ε ⊤ Aε ≤ Cσ t tr( A ) Proof. By theorem Theorem 6.2.1 (Hanson-Wright inequality) in [17], for some absolute constant c for any t > P A (cid:8) | ε ⊤ Aε − E ε ⊤ Aε | ≥ t (cid:9) ≤ (cid:18) − c min (cid:26) t k A k F σ , tσ k A k (cid:27)(cid:19) , where P A denotes conditional probability given A .Since for any i E ε i = 0 , and Var( ε i ) . σ , and A is psd we have E ε ⊤ Aε ≤ c σ tr( A ) . Moreover, since k A k F ≤ tr( A ) and k A k ≤ tr( A ) , we obtain P A (cid:8) ε ⊤ Aε > σ ( c + t ) tr( A ) (cid:9) ≤ {− c min( t, t } ) . Restricting to t > and adjusting the constants gives the result (note that since the RHS doesn’t depend on A , we cansubstitute P A with P ). 24 PREPRINT - O CTOBER 1, 2020 I Controlling the singular values Lemma 19 (Bound on the norm of non-diagonal part of a Gram matrix) . Suppose { Z i } ni =1 is a sequence of indepen-dent sub-gaussian vectors in R p with k Z i k ψ ≤ σ . Consider Σ = diag( λ , . . . , λ p ) for some positive non-decreasingsequence { λ i } pi =1 . Denote X to be the matrix with rows (cid:8) Z ⊤ i Σ / (cid:9) ni =1 and A = XX ⊤ . Denote also ˚ A to be thematrix A with zeroed out diagonal elements: ˚ A i,j = (1 − δ i,j ) A i,j .Then for some absolute constant c for any t > with probability at least − e − t/c k ˚ A k ≤ cσ vuut ( t + n ) λ ( t + n ) + X i λ i ! . Proof. We follow the lines of decoupling argument from [16]. Consider a / -net { u j } mj =1 on S n − s.t. m ≤ n .Then k ˚ A k ≤ j | u ⊤ j ˚ Au j | (take v to be eigenvector of ˚ A with largest (by absolute value) eigenvalue µ , for some j |h u j , v i| > / , so | u ⊤ j ˚ Au j | ≥ | µ | − µ ).Denote k -th coordinate of u j as u j [ k ] . Note that u ⊤ j ˚ Au j = 4 E T X k ∈ T l u j [ k ] u j [ l ] ˚ A [ k, l ] , where expectation is taken over a uniformly chosen random subset T of { , . . . , n } .Thus, | u ⊤ j ˚ Au j | ≤ T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X k ∈ T l u j [ k ] u j [ l ] ˚ A [ k, l ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 4 max T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)*X k ∈ T u j [ k ] X k, ∗ , X l T u j [ l ] X l. ∗ +(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Fix j . Note that since u j is from the sphere, { X i, ∗ } ni =1 are independent, and k, l live in disjoint subsets, the vectors U ⊤ := P k ∈ T u j [ k ] X k, ∗ Σ − / and V ⊤ := P l T u j [ l ] X l. ∗ Σ − / are independent sub-gaussian with sub-gaussiannorms bounded by Cσ for some constant C .First, that means that for some constant c we have P n(cid:12)(cid:12)(cid:12)D Σ / U, Σ / V E(cid:12)(cid:12)(cid:12) ≥ tσ k Σ V k o ≤ e − c t Second, by Lemma 16, for some constant c for any t > P ( k Σ V k ≥ c σ λ t + X i λ i !) ≤ e − t/c We obtain that for some constant c for any t > with probability at least − e − t/c (cid:12)(cid:12)(cid:12)D Σ / U, Σ / V E(cid:12)(cid:12)(cid:12) < cσ vuut t λ t + X i λ i ! Finally, making multiplicity correction for all j (there are at most n of them), and all subsets T (at most n ), weobtain that for some constant c with probability at least − e − t/c k ˚ A k ≤ cσ vuut ( t + n ) λ ( t + n ) + X i λ i ! . PREPRINT - O CTOBER 1, 2020 Lemma 20. In the setting of Lemma 19 for some absolute constant c for any t > with probability at least − e − t/c k A k ≤ cσ λ ( t + n ) + X i λ i ! . If it’s additionally known that for some δ, L > w.p. at least − δ k X ,k : ∞ k ≥ q E k X ,k : ∞ k / L then w.p. at least − nδ − e − t/c , µ min ( A k ) ≥ λ + 1 L X i>k λ i − cσ vuut ( t + n ) λ k +1 ( t + n ) + X i λ i ! . (9) Proof. Note that k A k ≤ max i k X i, ∗ k + k ˚ A k . Combining Lemma 16 (with multiplicity correction) and Lemma 19gives with probability − e − t/c k A k ≤ c σ ( t + c log n ) λ + X i λ i + vuut ( t + n ) λ ( t + n ) + X i λ i ! . Now note that ( t + c log n ) λ . vuut ( t + n ) λ ( t + n ) + X i λ i ! ≤ s λ ( t + n ) + λ ( t + n ) X i λ i . λ ( t + n ) + X i λ i ! , where we used √ a + ab . a + b in the last transition. Removing the dominated (up to a constant multiplier) termsgives the result.The second statement of the Lemma follows from combining Lemma 19 and Equation 9 with multiplicity correction. Lemma 21. Suppose Z ∈ R n × p is a matrix with independent isotropic sub-gaussian rows with k Z i, ∗ k ψ ≤ σ .Consider Σ = diag( λ , . . . , λ p ) for some positive non-decreasing sequence { λ i } pi =1 .Denote A k = λI n + P i>k λ i Z ∗ ,i Z ⊤∗ ,i for some λ ≥ . Suppose that it is known that for some δ, L > and some k < n with probability at least − δ the condition number of the matrix A k is at most L . Then for some absoluteconstant c with probability at least − δ − − ct ) n − tσ nL λ + X i>k λ i ! ≤ µ n ( A k ) ≤ µ ( A k ) ≤ ( n + tσ ) Ln λ + X i>k λ i ! . Moreover, if δ < − e − cs for some s ∈ (0 , n ) , then λ + P i>k λ i nλ k +1 ≥ L · − sσ /n sσ /n . Proof. First of all, note that the sum of eigenvalues of A k is equal to tr( A k ) = µn + n X i =1 k Σ / k : ∞ Z i,k : ∞ k PREPRINT - O CTOBER 1, 2020By Lemma 17 ( n − tσ ) X i>k λ i ≤ n X i =1 k Σ / k : ∞ Z i,k : ∞ k ≤ ( n + tσ ) X i>k λ i . Now we know that with probability at least − δ − − c t ) the following two conditions hold: µ ( A k ) ≤ Lµ n ( A k ) ,nλ + ( n − tσ ) X i>k λ i ≤ n X i =1 µ i ( A k ) ≤ nλ + ( n + tσ ) X i>k λ i . Thus, with probability at least − δ − − c t ) λL + n − tσ nL X i>k λ i ≤ µ n ( A k ) ≤ µ ( A k ) ≤ λL + ( n + tσ ) Ln X i>k λ i , which gives the first assertion of the Lemma.Moreover, note that µ ( A k ) ≥ λ k +1 k Z ∗ , k . By Lemma 17 for some c for any t ∈ (0 , n ) w.p. at least − e − c t k Z ∗ , k ≥ n − tσ , which means that if − δ − e − c t − e − c t > then with positive probability ( n + tσ ) Ln λ + X i>k λ i ! ≥ λ k +1 ( n − tσ ) Taking c = min( c , c ) we see that if δ < − e − c t , then λ + P i>k λ i nλ k +1 ≥ L · − tσ /n tσ /n . Lemma 22 (Controlling the condition number for some k ∈ ( k ∗ , n ) means controlling it for k ∗ ) . In the setting ofLemma 21, if λ + P i λ i ≥ Knλ for some K > , then for some absolute constant c for any t ∈ (0 , n ) withprobability at least − δ − e − t/c (1 − tσ /n )( K − LK λ + X i λ i ! ≤ µ n ( A ) ≤ µ ( A ) ≤ cσ K + 2 K λ + X i λ i ! . Proof. First, by Lemma 21 with probability at least − δ − − c t ) n − tσ nL λ + X i>k λ i ! ≤ µ n ( A k ) ≤ µ ( A k ) ≤ ( n + tσ ) Ln λ + X i>k λ i ! . Next, by Lemma 20 we know that with probability at least − e − t/c µ ( A ) ≤ c σ λ ( t + n ) + X i λ i ! + λ. Recall the condition λ + P i λ i ≥ Knλ . Using this, we obtain λ + X i>k λ i ≥ λ + X i λ i − nλ ≥ K − K λ + X i λ i ! . Thus, with probability at least − δ − e − t/c µ n ( A ) ≥ µ n ( A k ) ≥ − tσ /nL λ + X i>k λ i ! ≥ (1 − tσ /n )( K − LK λ + X i λ i ! . PREPRINT - O CTOBER 1, 2020Since rows of Z are isotropic, σ & , which allows us to write σ λ ( t + n ) + X i λ i ! + λ . σ λ ( t + n ) + X i λ i + λ ! . Finally, λ ( t + n ) + X i λ i + λ ≤ λ n + X i λ i + λ ≤ K + 2 K X i λ i + λ ! ,,