Asymptotics of Ridge (less) Regression under General Source Condition
AAsymptotics of Ridge(less) Regression under General SourceCondition
Dominic Richards Jaouad Mourtada Lorenzo Rosasco [email protected] {jaouad.mourtada,lorenzo.rosasco}@unige.it
June 12, 2020
Abstract
We analyze the prediction performance of ridge and ridgeless regression when both thenumber and the dimension of the data go to infinity. In particular, we consider a general settingintroducing prior assumptions characterizing “easy” and “hard” learning problems. In thissetting, we show that ridgeless (zero regularisation) regression is optimal for easy problems witha high signal to noise. Furthermore, we show that additional descents in the ridgeless bias andvariance learning curve can occur beyond the interpolating threshold, verifying recent empiricalobservations. More generally, we show how a variety of learning curves are possible dependingon the problem at hand. From a technical point of view, characterising the influence of priorassumptions requires extending previous applications of random matrix theory to study ridgeregression.
Understanding the generalisation properties of Artificial Deep Neural Networks (ANN) has recentlymotivated a number of statistical questions. These models perform well in practice despite perfectlyfitting (interpolating) the data, a property that seems at odds with classical statistical theory [49].This has motivated the investigation of the generalisation performance of methods that achievezero training error (interpolators) [32, 9, 11, 10, 8] and, in the context of linear least squares,the unique least norm solution to which gradient descent converges [22, 5, 37, 8, 21, 38, 20, 39].Overparameterized linear models, where the number of variables exceed the number of points, arearguably the simplest and most natural setting where interpolation can be studied. Moreover, incertain regimes ANN can be approximated by suitable linear models [24, 17, 18, 2, 13].The learning curve (test error versus model capacity) for interpolators has been shown to exhibita characteristic “Double Descent” [1, 7] shape, where the test error decreases after peaking at the“interpolating” threshold, that is, the model capacity required to interpolate the data. The regimebeyond this threshold naturally captures the settings of ANN [49], and thus, has motivated itsinvestigation [36, 44, 39]. Indeed, for least squares regression, sharp characterisations of a double Department of Statistics, University of Oxford, 24-29 St Giles’, Oxford, OX1 3LB MaLGa Center, Universitá degli Studi di Genova, Genova, Italy Istituto Italiano di Tecnologia, Via Morego, 30, Genoa 16163, Italy Massachusetts Institute of Technology, Cambridge, MA 02139, USA a r X i v : . [ m a t h . S T ] J un escent curve have been obtained for the least norm interpolating solution in the case of isotropicor auto-regressive covariates [22, 8] and random features [36].For least squares regression the structure of the features and data can naturally influencegeneralisation performance. This can be argued to arise also in the case of ANN where, for instance,inductive biases can be encoded in the network architecture e.g. convolution layers for imageclassification [29, 30]. In contrast, least squares models investigated beyond the interpolationthreshold have focused on cases where the ground truth parameter is symmetric in nature [16, 22, 5],without a natural notion of the estimation problem’s difficulty. This has left open the naturalquestions of what characteristics the learning curve exhibits beyond the interpolating thresholdwhen the features and data are drawn from more structured distributions, such as, lower-dimensionalspaces.In this work we investigate the performance of ridge regression, and its ridgeless limit, assumingthe data is generated from a noisy linear model with a structured regression parameter. Thisstructure is encoded through a general function analogous to the source condition used in kernelregression and inverse problems, see e.g. [35, 6]. The function is applied to the spectrum of thepopulation covariance of covariates and represents how well the true regression parameter is alignedto the variation in the covariates. We then study the test error of the ridge regression estimatorin a high-dimensional asymptotic regime when the number of samples and ambient dimension goto infinity in proportion to one another. The limits of resulting quantities are then characterisedby utilising tools from asymptotic Random Matrix Theory [3, 31, 16, 22], with results specificallydeveloped to characterise the influence of the source condition. This provides a more generalframework for studying the limiting test error of ridge regression, characterised by the signal tonoise, regularisation, overparmeterisation, and now, the structure of the parameter through thesource condition.We then instantiate our general framework and results to a stylized structure, allowing to studymodel misspecification and its effect on prediction error. Specifically, we consider a populationcovariance with two types of Eigenvectors: strong features , associated with a common large Eigenvalue(hence favored by the ridge estimator), as well as weak features , with a common smaller Eigenvalue.This model is an idealization of a realistic structure for distributions, with some parts of the signal(associated for instance to high smoothness, or low-frequency components) easier to estimate thanother, higher-frequency components. The use of source conditions allows to study situations wherethe true coefficients exhibit either faster or slower decay than implicitly postulated by the ridgeestimator, a form of model misspecification which affects predictive performance. This encodes thedifficulty of the problem, and allows to distinguish between “easy” and “hard” learning problems.We now summarise the primary contributions of this work. • Asymptotic Prediction Error under General Source Condition.
An asymptotic charac-terisation of the test error under a general source condition on the ground truth is provided. Thisrequired characterizing the limit of certain trace quantities, and provides a richer framework forinvestigating the performance of ridge regression. (Theorem 1) • Zero Ridge Regularisation Optimal for Easy Problems with High SNR.
In the “easy”,overparameterised and high signal-to-noise ratio (SNR) case, we show that the optimal regular-isation choice is zero. Previously, for least squares regression with isotropic prior, the optimalregularisation choice was zero only in the limit of infinity signal to noise [14, 16]. (Section 3.1)Our analysis of the strong and weak features model also provides asymptotic characterisations of a2umber of phenomena recently observed within the literature. That is, adding noisy weak featuresperforms implicit regularisation and can recover the performance of optimally tuned regressionrestricted to the strong features [28]. Also, we show an additional peak occurring in the learningcurve beyond the interpolation threshold for the ridgeless bias and variance [39]. These particularinsights are presented in Sections 3.2 and 3.3, respectively.Let us now describe the remainder of this work. Section 1.1 covers the related literature. Section2 describes the setting, and provides the general theorem. Section 3 formally introduces the strongand weak features model, and presents the aforementioned insights. Section 4 gives the conclusion.
Due to the large number of works investigating interpolating methods as well as double descent, wenext focus on works that consider the asymptotic regime.
High-Dimensional Statistics.
Random matrix theory has found numerous applications in high-dimensional statistics [48, 19]. In particular, asymptotic random matrix theory has been leveragedto study the predictive performance of ridge regression under a well-specified linear model withan isotropic prior on the parameter, for identity population covariance [27, 26, 14, 47] and thengeneral population covariance [16]. More recently, [33] considered the limiting test error of the leastnorm predictor under the spiked covariance model [25] where, both, a subset of Eigenvalues and theratio of dimension to samples diverge to infinity. They show the bias is bounded by the norm ofthe ground truth projected on the Eigenvectors associated to the subset of large Eigenvalues. Incontrast to these works, our work follows the kernel regression or inverse problems literature [6], byadding structural assumptions on the parameter through the variation of its coefficients along thecovariance basis.
Double Descent for Least Squares.
While interpolating predictors (which perfectly fit trainingdata), are classically expected to be sensitive to noise and exhibit poor out-of-sample performance,empirical observations about the behaviour of artificial neural networks [49] challenged this receivedwisdom. This surprising phenomenon, where interpolators can generalize, has first been shown forsome local averaging estimators [11, 9], kernel “ridgeless” regression [32], and linear regression, where[5] characterised conditions on the covariance structure under which ridgeless estimation has smallvariance. A “double descent” phenomenon for interpolating predictors, where test error can decreasepast the interpolation threshold, has been suggested by [7]. This double descent curve has beenestablished in the context of asymptotic least squares [22, 36, 8, 20, 38, 39]. The work [22] considerseither isotropic or auto-regressive features, while [36] consider Random Features constructed from anon-linear functional applied to the product of isotropic covariates and a random matrix. Meanwhile,the works [37, 20, 38] considers recovery guarantees under sparsity assumptions on the parameter,with [20] showing a peak in the test error when the number of samples equals the sparsity of thetrue predictor. The work [38] considers recovery properties of interpolators in the non-asymptoticregime. In contrast to these works, we make a direct connection between the population covarianceand the ground truth parameter. Finally, [39] recently gave empirical evidence showing additionalpeaks in the test error occur beyond the interpolation threshold when the covariance is misalignedwith the ground truth predictor. These empirical observations are verified by the theory we developin this paper. 3
Dense Regression with General Source Condition
In this section we formally introduce the setting as well as the main theorem. Section 2.1 introducesthe linear regression setting. Section 2.2 introduces the functionals that arise from asymptoticrandom matrix theory. Section 2.3 then presents the main theorem.
We start by introducing the linear regression setting and the general source condition.
Linear Regression.
We consider prediction in a random-design linear regression setting withGaussian covariates. Let β ? ∈ R d denote the true regression parameter, Σ ∈ R d × d the populationcovariance, and σ > σ = 1 We consider ani.i.d. dataset { ( x i , y i ) } ≤ i ≤ n such that for i = 1 , . . . , n , y i = h β ? , x i i + σ(cid:15) i , x i ∼ N (0 , Σ) , E [ (cid:15) i | x i ] = 0 , E [ (cid:15) i | x i ] = 1 . (1)In what follows, we let Y = ( y , . . . , y n ) , (cid:15) = ( (cid:15) , . . . , (cid:15) n ) ∈ R n , as well as the design matrix X ∈ R n × d .Given the n samples the objective is to derive an estimator β ∈ R d that minimises the error ofpredicting a new response. For a fixed parameter β ? , the test risk is then R ( β ) = E [( h x, β i − y ) ] = k Σ / ( β − β ? ) k + σ , where the expectation is with respect to a new response sampled accordingto (1). We consider ridge regression [23, 46], defined for λ > β λ := (cid:16) X > Xn + λI (cid:17) − X > Yn = arg min β ∈ R d n n n X i =1 ( y i − h β, x i i ) + λ k β k o . (2) Source Condition.
We consider an average-case analysis where the parameter β ? is random,sampled with covariance encoded by a source function Φ : R + → R + , which describes how coefficientsof β ∗ vary along Eigenvectors of Σ. Specifically, denote by { ( τ j , v j ) } ≤ j ≤ d the Eigenvalue-Eigenvectorpairs of Σ, ordered so that τ ≥ τ ≥ · · · ≥ τ d ≥
0, and let Φ(Σ) = P di =1 Φ( τ i ) v i v > i . For r > r = 1 up to change of Φ), the parameter β ? is such that E [ β ? ] = 0 , E [ β ? ( β ? ) > ] = r d Φ(Σ) . (3)For estimators linear in Y (such as ridge regression), the expected risk only depends on thefirst two moments of the prior on β ? , hence one can assume a Gaussian prior β ? ∼ N (0 , r Φ(Σ) /d ).Under prior (3), Φ(Σ) − / β ? has isotropic covariance I/d , so that E k Φ(Σ) − / β ? k = 1. This meansthat the coordinate β j := h β ? , v j i of β ? in the j -th direction has standard deviation q Φ( τ j ) /d .We note that, as d → ∞ , β ? has a “dense” high-dimensional structure, where the number ofits components grows with d , while their magnitude decreases proportionally. This prior is anaverage-case, high-dimensional analogue of the standard source condition considered in inverseproblems and nonparametric regression [35, 6], which describes the behaviour of coefficients of β ? along the Eigenvector basis of Σ. In the special case Φ( x ) = x α , one has E k Σ − α/ β ? k = r . For aGaussian prior, Σ − α/ β ? ∼ N (0 , r I/d ), which is rotation invariant with squared norm distributedas r χ d /d (converging to r as d → ∞ ), hence “close” to the uniform distribution on the sphere ofradius r . 4 asy and Hard Problems. The case of a constant function Φ( x ) ≡ r ) is optimal in termsof average risk. The influence of Φ can be understood in terms of the average signal strength ineigen-directions of Σ. Specifically, let v j be an eigenvector of Σ, with associated Eigenvalue τ j . Then,given β ? , the signal strength in direction v j (namely, the contribution of this direction to the signal)is E x hh β ? , v j i v j , x i = τ j h β ? , v j i , its expectation over β ? is τ j Φ( τ j ). When Φ is increasing, strengthalong direction v j decays faster as τ j decreases than postulated by the ridge estimator. In this sense,the problem is lower-dimensional, and hence “easier” than for constant Φ; likewise, a decreasing Φis associated to a slower decay of coefficients, and therefore a “harder”, higher-dimensional problem.While our results do not require this restriction, it is natural to consider functions Φ such that τ τ Φ( τ ) is non-decreasing, so that principal components (with larger Eigenvalue) carry moresignal on average; otherwise, the norm used by the ridge estimator favours the wrong directions. Inthis respect, the hardest prior is obtained for Φ( τ ) = τ − , corresponding to the isotropic prior inthe prediction norm induced by Σ: for this un-informative prior, all directions have same signalstrength. Finally, note that in the standard nonparametric setting of reproducing kernel Hilbertspaces, source conditions are related to smoothness of the regression function [45].As β ? is random, we study the expected performance of the ridge estimator against the groundtruth i.e. the expected test error E (cid:15),β ? [ R ( β λ ) − R ( β ? )] = E (cid:15),β ? [ k Σ / ( β − β ? ) k ], where theexpectation is with respect to the parameter β ? and the noise (cid:15) within the n samples. Remark 1 (Oracle Estimator)
The best linear (in Y ) estimator in terms of average risk can bedescribed explicitly. It corresponds to the Bayes-optimal estimator under prior N (0 , r Φ(Σ) /d ) on β ? , which writes: e β = (cid:16) X > Xn + σ r dn Φ(Σ) − (cid:17) − X > Yn . (4)
This estimator requires knowledge of Σ and r Φ . In the special case of an isotropic prior with Φ ≡ ,the oracle estimator is the ridge estimator (2) with λ = ( σ d ) / ( r n ) . Let us now describe the considered asymptotic regime, as well as quantities and notions from randommatrix theory that appear in the analysis.
High-Dimensional Asymptotics.
We study the performance of the ridge estimator β λ underhigh-dimensional asymptotics [27, 26, 14, 16, 47, 3], where the number of samples and dimensiongo to infinity n, d → ∞ proportionally with d/n → γ ∈ (0 , ∞ ). This setting enables precisecharacterisation of the risk, beyond the classical regime where n → ∞ with fixed true distribution.The ratio γ = d/n plays a key role. A value of γ > n , varying γ changes d and hence the underlying distribution. Hence, γ should not be interpreted as a degree of overparmeterisation. Rather, it quantifies the sample sizerelatively to the dimension of the problem. 5 andom Matrix Theory. Following standard assumptions [31, 16], assume the spectral dis-tribution of the covariance Σ converges almost surely to a probability distribution H supportedon [ h , h ] for 0 < h ≤ h < ∞ . Specifically, denoting the cumulative distribution function of thepopulation covariance Eigenvalues as H d ( τ ) = d P di =1 ( τ ) [ τ i , ∞ ) , we have H d → H ( τ ) almost surelyas d → ∞ .A key quantity utilised within the analysis is the Stieltjes Transform of the empirical spectraldistribution, defined for z ∈ C \ R + as e m ( z ) := d − Tr (cid:0)(cid:0) X > Xn − zI (cid:1) − (cid:1) . Under appropriate assump-tions of the covariates x (see for instance [16]) it is known as n, d → ∞ the Stieltjes Transform ofthe empirical covariance e m ( z ) converges almost surely to a Stieltjes transform m ( z ) that satisfiesthe following stationary point equation m ( z ) = Z ∞ τ (1 − γ (1 + zm ( z ))) − z d H ( τ ) . (5)In the case of an isotropic covariance Σ = I , where the limiting spectral distribution is a pointmass at one, the above equation can be solved for m ( z ) where it is the Stieltjes Transform ofthe Marchenko-Pastur distribution [34]. For more general spectral densities, the stationary pointequation (5) may not be as easily solved algebraically, but can still yield insights into the limitingproperties of quantities that arise. One tool that we will use extensively to gain insights intoquantities that depend on m ( z ) will be its companion transform v ( z ) which is the Stieltjes transformof the limiting spectral distribution of n − XX > . It is related to m ( z ) through the following equality γ ( m ( z ) + 1 /z ) = v ( z ) + 1 /z for all z ∈ C \ R + . Finally, introduce the Φ-weighted Stieltjes TransformΘ Φ ( z ) = Z Φ( τ ) 1 τ (1 − γ (1 + zm ( z ))) − z d H ( τ ) for all z ∈ C \ R + . which is the limit of the trace quantity d − Tr (cid:0) Φ(Σ)( X > Xn − zI ) − (cid:1) [31]. Let us now state the main theorem of this work, which provides the limit of the ridge regression risk.
Theorem 1
Consider the setting described in Section 2.1 and 2.2. Suppose Φ is a real-valuedbounded function defined on [ h , h ] with finitely many points of discontinuity and let v ( z ) = ∂v ( z ) /∂z . If n, d → ∞ with γ = d/n ∈ (0 , ∞ ) then almost surely E (cid:15),β ? [ R ( β λ ) − R ( β ? )] → σ (cid:16) v ( − λ )( v ( − λ )) − (cid:17)| {z } Variance + r Θ Φ ( − λ ) + λ ∂ Θ Φ ( − λ ) ∂λ v ( − λ ) | {z } Bias . The above theorem characterises the expected test error of the ridge estimator when the sample sizeand dimension go to infinity n, d → ∞ with d/n = γ ∈ (0 , ∞ ), and β ? is distributed as (3). Theasymptotic risk in Theorem 1 is characterised by the relative sample size γ , the limiting spectraldistribution H , and the source function Φ (normalising σ = r = 1). This provides a general formfor studying the asymptotic test error for ridge regression in a dense high-dimensional setting. Thesource condition affects the limiting bias; to evaluate it we are required to study the limit of thetrace quantity d − Tr (cid:0) Σ( X > Xn − zI ) − Φ(Σ)( X > Xn − zI ) − (cid:1) , which is achieved utilising techniques6rom both [12] and [31] (key steps in proof of Lemma 1 Appendix C). The variance term in Theorem1 aligns with that seen previously in [16], as the structure of β ? only influences the bias.We now give some examples of asymptotic expected risk r in Theorem 1 for 3 different structuresof β ? , namely Φ( x ) = 1 (isotropic), Φ( x ) = x (easier case) and Φ( x ) = x − (harder case). Corollary 1
Consider the setting of Theorem 1. If n, d → ∞ with γ = d/n , then almost surely E (cid:15),β ? [ R ( β λ ) − R ( β ? )] → σ (cid:16) v ( − λ )( v ( − λ )) − (cid:17) + r v ( − λ ) γ ( v ( − λ )) − γv ( − λ ) if Φ( x ) = x γv ( − λ ) − λγ v ( − λ )( v ( − λ )) if Φ( x ) = 1 λγ v ( − λ ) v ( − λ ) +(1 − γ (cid:1) v ( − λ ) v ( − λ ) − γ if Φ( x ) = 1 /x The three choices of source function Φ in Corollary 1 are cases where the functional for the asymptoticbias in Theorem 1 can be expressed in terms of the companion transform and its first derivative.The expression in the case Φ( x ) = 1 was previously investigated in [16], while for Φ( x ) = x thebias aligns with quantities previously studied in [12], and thus, can be simply plugged in. ForΦ( x ) = 1 /x , we show how algebraic manipulations similar to the Φ( x ) = x case allow Θ Φ ( z ) tobe simplified. Finally, while for Φ( x ) = 1 it is clear how the bias and variance can be broughttogether and simplified, yielding optimal regularisation choice λ = σ γ/r [16], see also Remark 1.As noted in Section 2.1, Φ( x ) = x − corresponds to a hardest case, with no favoured direction.Finally, Φ( x ) = x corresponds to an “easier” case with faster coefficient decay. In this section we consider a simple covariance structure, the strong and weak features model . Let U ∈ R d × d and U ∈ R d × d be two orthonormal matrices such that d + d = d and their collectionof rows forms an orthonormal basis of R d . The covariance considered is then for ρ , ρ >
0Σ = ρ U > U + ρ U > U . (6)Unless stated otherwise, we adopt to the convention that the Eigenvalues are ordered ρ > ρ .Naturally, we call elements of the span of rows of U strong features , since they are associated to thedominant Eigenvalue ρ . Similarly, U is associated to the weak features . The size of U , U then goto infinity d , d → ∞ with the sample size n → ∞ , with d i /d → ψ i ∈ (0 ,
1) and thus ψ + ψ = 1.The limiting spectral measure of Σ in this case is then atomic dH ( τ ) = ψ δ ρ + ψ δ ρ .The parameter β ? then has covariance E [ β ? ( β ? ) > ] = r d (cid:0) φ U > U + φ U > U (cid:1) , where φ , φ arethe coefficients for each type of feature and the source condition is Φ( x ) = φ x = ρ + φ x = ρ .The coefficients φ , φ encode the composition of the ground truth in terms of strong and weakfeatures, and thus, the difficulty of the estimation problem. The case φ = φ corresponds to theisotropic prior, while the case φ > φ corresponds to faster decay and hence an “easier” problem.In particular, if φ > φ increases, β ? has faster decay, the problem becomes “easier” since theground truth is increasingly made of strong features. Then, we say that if φ /φ ≥ easy , meanwhile when φ /φ < hard .Under the model just introduced, Theorem 1 gives us the following asymptotic characterization7or the expected test risk in terms of the companion transform as n, d → ∞ E (cid:15),β ? [ R ( β λ ) − R ( β ? )] → σ (cid:16) v ( − λ )( v ( − λ )) − (cid:17)| {z } Variance + r X i =1 φ i ψ i ρ i v ( − λ ) v ( − λ ) ( ρ i v ( − λ ) + 1) | {z } Bias . (7)We now investigate the above limit in the regime γ > The insights are then summarised in the following sections.Section 3.1 shows that zero regularisation can be optimal in some situations. Section 3.2 shows hownoisy weak features can be added and used as a form of regularisation similar to ridge regression.Section 3.3 present findings related to the ridgeless bias and variance.
In this section, we investigate how the true regression function, namely the parameter β ? (throughthe source condition) affects optimal ridge regularisation. Here we consider the easy case, the hardcase is then investigated in Appendix A.1. Figure 1 plots the performance of optimally tuned ridgeregression ( Left ) and the optimal choice of regularisation parameter (
Right ) against (a monotonictransform) of the Eigenvalue ratio ρ /ρ , for a coefficient ratios φ ≥ φ .As shown in the right plot of Figure 1, for a fixed distribution of X (characterised by ψ , ρ , ρ )and sample size (characterised by γ ) as the ratio φ /φ increases (that is, signal concentratesmore on strong features), the optimal regularisation decreases. Remarkably, if the ratio φ /φ islarge enough, the optimal ridge regularisation parameter λ can be 0, corresponding to ridgelessinterpolation. /( + )0.450.500.550.600.650.700.75 T e s t E rr o r Opt Ridge. Test Error vs Eig. Ratio, = 3.5, = 0.5 / = 1.0 / = 1.67 / = 3.0 /( + )0.0000.0250.0500.0750.1000.1250.1500.175 O p t i m a l R i dg e R e g u l a r i s a t i o n Opt. Regularisation vs Eig. Ratio, = 3.5, = 0.5 / = 1.0 / = 1.67 / = 3.0 Figure 1:
Left : Limiting test error for optimally tuned ridge regression as described by (7),
Right : optimal regularisation computed numerically using theory.
Both : Quantities plotted againstEigenvalue ratio ρ / ( ρ + ρ ). Problem parameters were E [ h x, β ? i ] = ρ φ ψ + ρ φ ψ = 1, E [ k β ? k ] = r ( φ ψ + φ ψ ) = r = 1, σ = 0 . γ = 3 . ψ = 0 . Left : Dashed lines indicatesimulations with d = 2 , 40 replications, noise (cid:15) from standard Gaussian and covariance Σ diagonalwith ρ on first d co-ordinates and ρ on remaining d . Evaluating the companion transform requires solving a polynomial since the limiting measure is atomic, see forinstance [15]. The polynomial in our case can be solved efficiently as it is at most of order 3. omparison with the Isotropic Model. In the case of a parameter β ? drawn from an isotropicprior Φ ≡ λ = ( σ d ) / ( r n ) (seeRemark 1, as well as [16, 22]). This parameter is always positive, and is inversely proportional tothe signal-to-noise ratio r /σ . Studying the influence of β ? through a general φ , φ shows that(1) optimal regularisation also depends on the coefficient decay of β ? ; (2) optimal regularisationcan be equal to λ = 0, which interpolates training data. Finally, let us note that the optimalestimator of Remark 1 (with oracle knowledge of Σ , Φ) does not interpolate; hence, the optimalityof interpolation among the family of ridge estimators arises from a form of “prior misspecification”.We believe this phenomenon to extend beyond the specific case of ridge estimators.
In this section we consider the special case where weak features are pure noise variables, namely φ = 0, while their dimension is large. Such noisy weak features can be artificially introduced tothe dataset, to induce an overparameterised problem. We then refer to this technique as NoisyFeature Regularisation , and note it corresponds to the design matrix augmentation in [28]. Lookingto Figure 2, the ridgeless test error is then plotted against the Eigenvalue ratio ρ /ρ ( Left ) and thenumber of weak features with the tuned Eigenvalue ratio (
Right ).Observe (right plot) as we increase the number of weak features (as encoded by 1 /ψ ), andtune the Eigenvalue ρ , the performance converges to optimally tuned ridge regression with thestrong features only. The left plot then shows the “regularisation path” as a function of the thatthe Eigenvalue ratio ρ /ρ for some numbers of weak features 1 /ψ . We repeated this experimenton the real dataset SUSY [4] with Random Fourier Features [41]. The test error is plotted in Figure5 in Appendix A.2. Weak Features Can Implicitly Regularise.
The results in Sections 3.1 and 3.2 suggest thatweak features can implicitly regularise when the ground truth is associated to a subset of strongerfeatures. Specifically, Section 3.2 demonstrated how this can occur passively in an easy learningproblem, with the weak features providing sufficient stability that zero ridge regularisation can bethe optimal choice . Meanwhile, in this section we demonstrated an active approach where weakfeatures can purposely be added to a model and tuned similar to ridge. In this section we investigate how the ridgeless bias and variance depend on the ratio of dimensionto sample size γ . Conveniently the companion transform v (0) takes a closed form in this case, seeequation (15) in Appendix (B.4.1). Looking to Figure 3 the ridgeless bias and variance is plottedagainst the ratio of dimension to sample size γ .Note that an additional peak in the ridgeless bias and variance is observed beyond the interpo-lation threshold. This has only recently been empirically observed for the test error [39], as such,these plots now theoretically verify this phenomenon. The location of the peaks naturally dependson the number of strong and weak features as well as the ambient dimension, as denoted by thevertical lines. Specifically, the peak occurs in the ridgeless bias for the “hard” setting when the Zero regularisation has been shown to be optimal for Random Feature regression with a high signal to noise [36].For ridge regression, the work [28] numerically estimated the derivative of the test risk with respect to λ with a spikedcovariance model and found that the derivative could be positive, suggesting zero regularization. .0 0.1 0.2 0.3 0.4 0.5Eigenvalue Ratio /( + )0.00.51.01.52.02.53.0 T e s t E rr o r Test Error vs Eig. Ratio, d / n = 1.5, = 0.5 = 1.51/ = 2.01/ = 30.0 /( d / n
1) + r (1 n / d )Opt. Ridge Regression T e s t E rr o r Test Error vs No. Weak Features, d / n = 1.5, = 0.5 Tuned / /( d / n
1) + r (1 n / d )Opt. Ridge Regression Figure 2: Ridgeless test error for strong and weak features model ( r = σ = 1) against Eigenvalueratio ρ /ρ ( Left ) and size of noisy bulk 1 /ψ = d /d ( Right ). Solid lines show theory computedusing v (0) with γψ = d /n = 1 . ρ = 0 .
5. Dashed lines are simulations with d = 2 ( Left )and 2 ( Right ) and 20 replications.
Solid Grey Horizontal Line : Performance of optimally tunedridge regression with strong features only.number of samples and number of strong features are equal n = d . Meanwhile, a peak occurs in theridgeless variance when the number of samples and number of strong features equal n = d and theEigenvalue ratio is large ρ > ρ . This demonstrates that learning curves beyond the interpolationthreshold can have different characteristics due to the interplay between the covariate structure andunderlying data. We conjecture this arises due to instabilities of the design matrix Moore-PenrosePseudo-inverse, similar to the isotropic setting [8]. d / n T e s t E rr o r ( B i a s ) Test Error (Bias) vs Relative Dim., = 0.35 / = / = 1.0 / = 100, / = 0.001 / = / = 100= 1/ d / n T e s t E rr o r ( V a r i a n c e ) Test Error (Var.) vs Relative Dim., + = 1 / = 1.0, = 0.35 / = 300, = 0.35 / = 300, = 0.65= 1/ = 1/(1 ) Figure 3: : Ridgeless bias and variance and for strong and weak feature model plotted againstrelative dimension γ = d/n , for different Eigenvalue ratios ρ /ρ and coefficients φ /φ . Solidlines give theory computed using v (0) with ψ = 0 . E [ h x, β ? i ] = ρ φ ψ + ρ φ ψ = 1 and E [ k β ? k ] = r ( φ ψ + φ ψ ) = 1. Dashed lines are simulations with d = 2 and 20 replications. In this work, we introduced a general framework for studying ridge regression in a high-dimensionalregime. We characterised the limiting risk of ridge regression in terms of the dimension to samplesize ratio, the spectrum of the population covariance and the coefficients of the true regression10arameter along the covariance basis. This extends prior work [14, 16], that considered an isotropicground truth parameter. Our extension enables the study of “prior misspecification”, where signalstrength may decrease faster or slower than postulated by the ridge estimator, and its effect on idealregularisation.We instantiated this general framework to a simple structure, with strong and weak features.In this case, we deduced that in some situations, “ridgeless” regression with zero regularisationcan be optimal among all ridge regression estimators. This occurs when the signal-to-noise ratio islarge and when strong features (with large Eigenvalue of the covariance matrix) have sufficientlymore signal than weak ones. The latter condition corresponds to an “easy” or “lower-dimensional”problem, where ridge tends to over-penalise along strong features. This phenomenon does not occurfor isotropic priors, where optimal regularisation is always strictly positive. Finally, we discussednoisy weak features, which act as a form of regularisation, and concluded by showing additionalpeaks in ridgeless bias and variance can occur for our model.Moving forward, it would be natural to consider non-Gaussian covariates. Other structures forthe ground truth and data generating process can be investigated through Theorem 1 by considerdifferent functions Φ and the population Eigenvalue distributions. The tradeoff between predictionand estimation error exhibited by [16] in the isotropic case can be explored with a general source Φ.
D.R. is supported by the EPSRC and MRC through the OxWaSP CDT programme (EP/L016710/1).Part of this work has been carried out at the Machine Learning Genoa (MaLGa) center, Universitàdi Genova (IT). L.R. acknowledges the financial support of the European Research Council (grantSLING 819789), the AFOSR projects FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (Eu-ropean Office of Aerospace Research and Development), and the EU H2020-MSCA-RISE projectNoMADS - DLV-777826.
References [1] Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error inneural networks. arXiv preprint arXiv:1710.03667 , 2017.[2] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In
Proceedings of the 36th International Conference on Machine Learning ,volume 97, pages 242–252, 2019.[3] Zhidong Bai and Jack W Silverstein.
Spectral analysis of large dimensional random matrices ,volume 20. Springer, 2010.[4] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energyphysics with deep learning.
Nature communications , 5:4308, 2014.[5] Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting inlinear regression.
Proceedings of the National Academy of Sciences , 2020.[6] Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learningtheory.
Journal of complexity , 23(1):52–72, 2007.117] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.
Proceedings of the National Academyof Sciences , 116(32):15849–15854, 2019.[8] Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. arXivpreprint arXiv:1903.07571 , 2019.[9] Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overfitting or perfect fitting? risk bounds forclassification and regression rules that interpolate. In
Advances in neural information processingsystems , pages 2300–2311, 2018.[10] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need tounderstand kernel learning. arXiv preprint arXiv:1802.01396 , 2018.[11] Mikhail Belkin, Alexander Rakhlin, and Alexandre B. Tsybakov. Does data interpolationcontradict statistical optimality? In
Proceedings of the International Conference on ArtificialIntelligence and Statistics (AISTATS) , pages 1611–1619, 2019.[12] Lin S Chen, Debashis Paul, Ross L Prentice, and Pei Wang. A regularized hotelling’s t 2 testfor pathway analysis in proteomic studies.
Journal of the American Statistical Association ,106(496):1345–1360, 2011.[13] Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable program-ming. In
Advances in Neural Information Processing Systems , pages 2933–2943, 2019.[14] Lee H. Dicker. Ridge regression and asymptotic minimax estimation over spheres of growingdimension.
Bernoulli , 22(1):1–37, 2016.[15] Edgar Dobriban. Efficient computation of limit spectra of sample covariance matrices.
RandomMatrices: Theory and Applications , 4(04):1550019, 2015.[16] Edgar Dobriban, Stefan Wager, et al. High-dimensional asymptotics of prediction: Ridgeregression and classification.
The Annals of Statistics , 46(1):247–279, 2018.[17] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent findsglobal minima of deep neural networks. In
Proceedings of the 36th International Conference onMachine Learning , volume 97, pages 1675–1685, 2019.[18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizesover-parameterized neural networks.
International Conference on Learning Representations(ICLR) , 2019.[19] Noureddine El Karoui. Random matrices and high-dimensional statistics: beyond covariancematrices. In
Proceedings of the International Congress of Mathematicians , volume 4, pages2875–2894, Rio de Janeiro, 2018.[20] Cédric Gerbelot, Alia Abbara, and Florent Krzakala. Asymptotic errors for convex penalizedlinear regression beyond gaussian matrices. arXiv preprint arXiv:2002.04372 , 2020.[21] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearizedtwo-layers neural networks in high dimension. arXiv preprint arXiv:1904.12191 , 2019.1222] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises inhigh-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560 , 2019.[23] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonalproblems.
Technometrics , 12(1):55–67, 1970.[24] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence andgeneralization in neural networks. In
Advances in neural information processing systems , pages8571–8580, 2018.[25] Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis.
Annals of statistics , pages 295–327, 2001.[26] Noureddine El Karoui. Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results. arXiv preprint arXiv:1311.2445 ,2013.[27] Noureddine El Karoui and Holger Kösters. Geometric sensitivity of random matrix results:consequences for shrinkage estimators of covariance and related statistical methods. arXivpreprint arXiv:1105.1404 , 2011.[28] Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez. Implicit ridge regularization providedby the minimum-norm least squares estimator when n <= p. arXiv preprint arXiv:1805.10939 ,2018.[29] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, WayneHubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition.
Neural computation , 1(4):541–551, 1989.[30] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learningapplied to document recognition.
Proceedings of the IEEE , 86(11):2278–2324, 1998.[31] Olivier Ledoit and Sandrine Péché. Eigenvectors of some large sample covariance matrixensembles.
Probability Theory and Related Fields , 151(1-2):233–264, 2011.[32] Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel" ridgeless" regression cangeneralize. arXiv preprint arXiv:1808.00387 , 2018.[33] Yasaman Mahdaviyeh and Zacharie Naulet. Asymptotic risk of least squares minimum normestimator under the spike covariance model. arXiv preprint arXiv:1912.13421 , 2019.[34] Vladimir A Marčenko and Leonid Andreevich Pastur. Distribution of eigenvalues for some setsof random matrices.
Mathematics of the USSR-Sbornik , 1(4):457, 1967.[35] Peter Mathé and Sergei V Pereverzev. Geometry of linear ill-posed problems in variable hilbertscales.
Inverse problems , 19(3):789, 2003.[36] Song Mei and Andrea Montanari. The generalization error of random features regression:Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355 , 2019.1337] Partha P Mitra. Understanding overfitting peaks in generalization error: Analytical risk curvesfor l _2 and l _1 penalized interpolation. arXiv preprint arXiv:1906.03667 , 2019.[38] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai. Harmlessinterpolation of noisy data in regression. IEEE Journal on Selected Areas in InformationTheory , 2020.[39] Preetum Nakkiran, Prayaag Venkat, Sham Kakade, and Tengyu Ma. Optimal regularizationcan mitigate double descent. arXiv preprint arXiv:2003.01897 , 2020.[40] Debashis Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariancemodel.
Statistica Sinica , pages 1617–1642, 2007.[41] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In
Advancesin neural information processing systems , pages 1177–1184, 2008.[42] Jack W Silverstein and Sang-Il Choi. Analysis of the limiting spectral distribution of largedimensional random matrices.
Journal of Multivariate Analysis , 54(2):295–309, 1995.[43] Jack W Silverstein and Patrick L Combettes. Signal detection via spectral theory of largedimensional random matrices.
IEEE Transactions on Signal Processing , 40(8):2100–2105, 1992.[44] S Spigler, M Geiger, S d’Ascoli, L Sagun, G Biroli, and M Wyart. A jamming transition fromunder-to over-parametrization affects generalization in deep learning.
Journal of Physics A:Mathematical and Theoretical , 52(47):474001, 2019.[45] Ingo Steinwart, Don Hush, and Clint Scovel. Optimal rates for regularized least squaresregression. In
Proceedings of the 22nd Annual Conference on Learning Theory (COLT) , pages79–93, 2009.[46] Andrey N. Tikhonov. Solution of incorrectly formulated problems and the regularizationmethod.
Soviet Mathematics Doklady , 4:1035–1038, 1963.[47] Antonia M Tulino, Sergio Verdú, et al. Random matrix theory and wireless communications.
Foundations and Trends R (cid:13) in Communications and Information Theory , 1(1):1–182, 2004.[48] Jianfeng Yao, Shurong Zheng, and ZD Bai. Sample covariance matrices and high-dimensionaldata analysis . Cambridge University Press Cambridge, 2015.[49] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understandingdeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 , 2016.14 ontents
A.1 Insights for Hard Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16A.2 Additional Plots for Noisy Weak Feature Regularisation . . . . . . . . . . . . . . . . 16
B Proofs for Ridge Regression 16
B.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17B.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18B.3 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19B.3.1 Case: Φ( x ) = x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19B.3.2 Case: Φ( x ) = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20B.3.3 Case: Φ( x ) = 1 /x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20B.4 Strong and Weak Features Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21B.4.1 Ridgeless Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 C Proof of Lemma 1 23
C.1 Showing δ → A Additional Material - Strong and Weak Features Model
In this section we provide additional material related to the strong and weak features modelintroduced within the main body of the manuscript. Section A.1 presents insights for the hardlearning setting, covering a case not considered within the main body of the manuscript. SectionA.2 provides plots related to applying noisy weak feature regularisation to real data.15 .1 Insights for Hard Problems
In this section we discuss insights related to the setting of Section 3.1 but for case of hard problems.That is the case when φ /φ <
1. Looking to Figure 4 we see plots similar to those in Section 3.1but for choices of weights φ /φ < ρ /ρ . We believe this is due to characteristic of ridge regression “suppressing”smaller Eigenvalues, in this case ρ , improving performance for sufficiently large ρ , even though φ < φ . Intuitively, this is due to the contribution to the signal taking the form E [ h x, β ? i ] = ρ ψ φ + ρ ψ φ , and thus, when when ψ = ψ = 0 . ρ φ > ρ φ ridge regression can stillperform well since it suppresses the small contribution to the signal ρ φ . Looking to the right plotof Figure 4, we observe that the optimal choice of regularisation initially increases as the Eigenvalueratio ρ /ρ . One explanation, is that the estimated coefficients associated to the strong features areinflated in order to explain the signal coming from the weak features, and thus, for prediction oughtto be corrected through regularisation. /( + )0.50.60.70.8 T e s t E rr o r Opt Ridge. Test Error vs Eig. Ratio, = 3.5, = 0.5 / = 0.25 / = 0.54 / = 1.0 /( + )0.20.40.60.81.01.21.41.6 O p t i m a l R i dg e R e g u l a r i s a t i o n Opt. Regularisation vs Eig. Ratio, = 3.5, = 0.5 / = 0.25 / = 0.54 / = 1.0 Figure 4: Matching plot to Figure (1) but for the hard regime where φ /φ < A.2 Additional Plots for Noisy Weak Feature Regularisation
In this section we present additional plots associated to Section 3.2. In particular Figure 5 presentsnoisy weak feature regularisation applied to a real world example.
B Proofs for Ridge Regression
In this section we provide the calculations associated to ridge regression. Section B.1 provides somepreliminary calculations. Section B.2 gives the proof of Theorem 1. Section B.3 provides the proofof Corollary 1. Section B.4 provides the calculations associated to the strong and weak featuresmodel. 16 .0 0.2 0.4 0.6 0.8 1.0Regularisation: (Ridge Regression), (Noisy Bulk Eig. Value)0.2500.2750.3000.3250.3500.3750.4000.4250.450 C l a ss i f i c a t i o n E rr o r Classification Error vs Regularisation, RF = 300
Ridge Regression1/ = 1.51/ = 2.01/ = 4.0 (Noisy Bulk Eig. Value)0.2500.2750.3000.3250.3500.3750.4000.425 C l a ss i f i c a t i o n E rr o r Classification Error vs Regularisation, RF = 300
Ridge Regression1/ = 2.01/ = 3.01/ = 7.0 Figure 5: Ridgeless test error performance on SUSY dataset with d = 300 Random Fourier FeaturesFeatures (RFF) and n = 200 samples. Weak features constructed from standard Gaussian randomvariables scaled by ρ . Predictions made from: Left extracting known signal co-ordinates fromestimated coefficients (throw away weak features);
Right : sampling weak features for test data.Error bars from 100 (
Left ) and 10 (
Right ) replications over RFF. Responses are 0 or 1, thereforepredictions were an indicator function on whether the predicted response was greater than 1 / RedLine : Performance of ridge regression on with strong features only. Plotted against regularisation λ ,error bars from 100 replications. B.1 Preliminaries
We begin introducing some useful properties of the Stieltjes transform as well as its companiontransform. Firstly, we know the companion transform satisfies the Silverstein equation [43, 42] − v ( z ) = z − γ Z τ τ v ( z ) dH ( τ ) . (8)We then have for z ∈ S := { u + iv : v = 0 , or v = 0 , u > } , the companion transform v ( z ) is theunique solution to the Silverstein equation with v ( z ) ∈ S such that the sign of the imaginary part ispreserved sign(Im( v ( z ))) − sign(Im( z )). The above can then be differentiated with respect to z toobtain a formula for v ( z ) in terms of v ( z ): ∂v ( z ) ∂z = (cid:16) v ( z ) − γ Z τ (1 + τ v ( z )) dH ( τ ) (cid:17) − Meanwhile from from the equality γ ( m ( z ) + 1 /z ) = v ( z ) + 1 /z we note that we have the followingequalities 1 − γ (1 − λm ( − λ )) = λv ( − λ ) (9)1 − λm ( − λ ) = γ − (1 − λv ( − λ )) m ( − λ ) − λm ( − λ ) = γ − ( v ( − λ ) − λv ( − λ ))which we will readily use to simplify/rewrite a number of the limiting functions.17 .2 Proof of Theorem 1 We begin with the decomposition into bias and variance terms following [16]. The difference for theridge parameter can be denoted β λ − β ? = − λ (cid:0) X > Xn + λI (cid:1) − β ? + σ (cid:0) X > Xn + λI (cid:1) − X > (cid:15)n And thus taking expectation with respect to the noise in the observations (cid:15) E (cid:15) [ R ( β λ )] − R ( β ? ) = E (cid:15) [ k Σ / ( β λ − β ? ) k ]= E (cid:15) [ k Σ / ( β λ − E (cid:15) [ β λ ]) k ] + k Σ / ( E (cid:15) [ β λ ] − β ? ) k = σ E (cid:15) (cid:2)(cid:13)(cid:13) Σ / (cid:0) X > Xn + λI (cid:1) − X > (cid:15)n (cid:13)(cid:13) (cid:3) + λ k Σ / (cid:0) X > Xn − λI (cid:1) − β ? k = σ n Tr (cid:16)(cid:16) X > Xn + λI (cid:17) − Σ (cid:16) X > Xn + λI (cid:17) − X > Xn (cid:17) λ Tr (cid:16) ( β ? ) > (cid:16) X > Xn + λI (cid:17) − Σ (cid:16) X > Xn + λI (cid:17) − β ? (cid:17) Taking expectation with respect to E β ? we arrive at E β ? [ E (cid:15) [ R ( β λ )] − R ( β ? )] = σ n Tr (cid:16)(cid:16) X > Xn + λI (cid:17) − Σ (cid:16) X > Xn + λI (cid:17) − X > Xn (cid:17) λ r d Tr (cid:16)(cid:16) X > Xn + λI (cid:17) − Σ (cid:16) X > Xn + λI (cid:17) − Φ(Σ) (cid:17) = σ γ d Tr (cid:16)(cid:16) X > Xn + λI (cid:17) − Σ (cid:17) − λσ γ d Tr (cid:16)(cid:16) X > Xn + λI (cid:17) − Σ (cid:17) + λ r d Tr (cid:16)(cid:16) X > Xn + λI (cid:17) − Σ (cid:16) X > Xn + λI (cid:17) − Φ(Σ) (cid:17)
It is now a matter of showing the asymptotic almost sure convergence of the following threefunctionals 1 d Tr (cid:16)(cid:16) X > Xn + λI (cid:17) − Σ (cid:17) , d Tr (cid:16)(cid:16) X > Xn + λI (cid:17) − Σ (cid:17) and 1 d Tr (cid:16)(cid:16) X > Xn + λI (cid:17) − Σ (cid:16) X > Xn + λI (cid:17) − Φ(Σ) (cid:17)
The limit of the first trace quantity comes directly from [31] meanwhile the limit of the secondtrace quantity is proven in [16]. The third trace quantity depends upon the source condition Φ andcomputing its limit is one of the main technical contributions of this work. The limits for theseobjects is summarised within the following Lemma, the proof of which provides the key steps forcomputing the limit involving the source function.18 emma 1
Under the assumptions of Theorem 1 for any λ > we have almost surely as n, d → ∞ with d/n = γ d Tr (cid:16)(cid:16) X > Xn + λI (cid:17) − Σ (cid:17) → − λm ( − λ )1 − γ (1 − λm ( − λ )) (10)1 d Tr (cid:16)(cid:16) X > Xn + λI (cid:17) − Σ (cid:17) → m ( − λ ) − λm ( − λ ) (cid:0) − γ (1 − λm ( − λ )) (cid:1) (11)1 d Tr (cid:16)(cid:16) X > Xn + λI (cid:17) − Σ (cid:16) X > Xn + λI (cid:17) − Φ(Σ) (cid:17) → Θ Φ ( − λ ) + λ ∂ Θ Φ ( − λ ) ∂λ (cid:0) − γ (1 − λm ( − λ )) (cid:1) (12)The result is arrived at by plugging in the above limits and noting from the definition of theCompanion Transform v that 1 − γ (1 − λm ( − λ )) = λv ( − λ ), 1 − λm ( − λ ) = γ − (1 − λv ( − λ )) and,taking derivatives, m ( − λ ) − λm ( − λ ) = γ − ( v ( − λ ) − λv ( − λ )). The proof of Lemma 1, which isthe key technical step in the proof of Theorem 1, is provided in Appendix C. B.3 Proof of Corollary 1
In this section we provide the proof of Corollary 1. It will be broken into three parts associated tothe three cases Φ( x ) = x , Φ( x ) = 1 and Φ( x ) = 1 /x . B.3.1 Case: Φ( x ) = x The purpose of this section is to demonstrate, in the case Φ( x ) = x , how the functional Θ Φ ( − λ ) + λ ∂ Θ Φ ( − λ ) ∂λ can be written in terms of the Stieltjes Transform m ( z ). For this particular choice ofΦ the asymptotics were calculated in [12], see also Lemma 7.9 in [16]. We therefore repeat thiscalculation for completeness. Now, in this case we haveΘ Φ ( z ) = Z ττ (1 − γ (1 + zm ( − λ ))) − z dH ( τ )Following the steps are the start of the proof for Lemma 2.2 in [31], consider 1 + zm ( z )1 + zm ( z ) = Z zτ (1 − γ (1 + zm ( z ))) − z dH ( τ )= Z τ (1 − γ (1 + zm ( z ))) τ (1 − γ (1 + zm ( z ))) − z dH ( τ )= (1 − γ (1 + zm ( z )))Θ Φ ( z )Solving for Θ Φ ( z ) we haveΘ Φ ( z ) = 1 + zm ( z )1 − γ (1 + zm ( z )) = 1 γ (cid:0) − γ (1 + zm ( z )) − (cid:1) Picking z = − λ and differentiating with respect to λ we get ∂ Θ Φ ( − λ ) ∂λ = − m ( − λ ) − λm ( − λ )(1 − γ (1 − λm ( − λ ))) Φ ( − λ ) + λ ∂ Θ Φ ( − λ ) ∂λ (cid:0) − γ (1 − λm ( − λ )) (cid:1) = 1 − λm ( − λ )(1 − γ (1 − λm ( − λ ))) − λ m ( − λ ) − λm ( − λ )(1 − γ (1 − λm ( − λ ))) = γ − (1 − λv ( − λ ))( λv ( − λ )) − λ γ − ( v ( − λ ) − λv ( − λ ))( λv ( − λ )) = v ( − λ ) γλ v ( − λ ) − γ ( λv ( − λ )) where on the second equality we used (9). Multiplying through by λ then yields the quantitypresented. B.3.2 Case: Φ( x ) = 1The functional of interest in this case aligns with that calculated within [16], which we include belowfor completeness. In particular we have Θ Φ ( − λ ) = m ( − λ ) and as such we getΘ Φ ( − λ ) + λ ∂ Θ Φ ( − λ ) ∂λ = m ( − λ ) − λm ( − λ ) = γ − ( v ( − λ ) − λv ( − λ ))where on the second equality we used (9). Dividing by v ( − λ ) as well as adding the asymptoticvariance we get, from Theorem 1, the limit as n, d → ∞ E β ? [ E (cid:15) [ R ( β λ )] − R ( β ? )] → σ − λv ( − λ ) λv ( − λ ) − λσ v ( − λ ) − λv ( − λ )( λv ( − λ )) + r γ v ( − λ ) − λv ( − λ ) v ( − λ ) = σ (cid:16) v ( − λ )( v ( − λ )) − (cid:17) + r γv ( − λ ) − r λγ v ( − λ ) v ( − λ ) B.3.3 Case: Φ( x ) = 1 /x The functional in the case Φ( x ) = 1 /x takes the formΘ Φ ( z ) = Z τ τ (1 − γ (1 + zm ( z ))) − z dH ( τ ) . Observe that we have Z τ dH ( τ ) + z Θ Φ ( z ) = Z τ (cid:16) zτ (1 − γ (1 + zm ( z ))) − z (cid:17) dH ( τ )= Z τ τ (1 − γ (1 + zm ( z ))) τ (1 − γ (1 + zm ( z ))) − z dH ( τ )= (1 − γ (1 + zm ( z ))) Z τ (1 − γ (1 + zm ( z ))) − z dH ( τ )= (1 − γ (1 + zm ( z ))) m ( z ) . Φ ( z ) and plugging in the definition of the companion transform v ( z ) we arrive atΘ Φ ( z ) = 1 z (cid:0) (1 − γ (1 + zm ( z ))) m ( z ) − z Z τ dH ( τ ) (cid:1) = − v ( z ) (cid:0) v ( z ) γ + 1 z ( 1 γ − (cid:1)(cid:1) − z Z τ dH ( τ )= − v ( z ) γ − v ( z ) z ( 1 γ − (cid:1) − z Z τ dH ( τ ) . Fixing z = − λ the quantity of interest then has the formΘ Φ ( − λ ) = − v ( − λ ) γ + v ( − λ ) λ ( 1 γ − (cid:1) + 1 λ Z τ dH ( τ ) , which when differentiated with respect to λ yields ∂ Θ Φ ( − λ ) ∂λ = 2 v ( − λ ) v ( − λ ) γ − λ ( 1 γ − (cid:1)(cid:0) v ( − λ ) λ + v ( − λ ) (cid:1) − λ Z τ dH ( τ ) . Multiplying the above by λ and adding Θ Φ ( − λ ) brings us toΘ Φ ( − λ ) + λ ∂ Θ Φ ( − λ ) ∂λ = 2 λ v ( − λ ) v ( − λ ) γ − v ( − λ ) γ − ( 1 γ − (cid:1) v ( − λ ) . Dividing the above by v ( − λ ) and adding the limiting variance yields, from Theorem 1, the limit as n, d → ∞ E β ? [ E (cid:15) [ R ( β λ )] − R ( β ? )] → σ − λv ( − λ ) λv ( − λ ) − λσ v ( − λ ) − λv ( − λ )( λv ( − λ )) + 2 r λ v ( − λ ) γv ( − λ ) − r γ − r ( 1 γ − (cid:1) v ( − λ ) v ( − λ ) = σ (cid:0) v ( − λ )( v ( − λ )) − (cid:1) + 2 r λ v ( − λ ) γv ( − λ ) − r γ + r λ γ − γ v ( − λ ) v ( − λ ) B.4 Strong and Weak Features Model
This section presents the calculations associated to the strong and weak features model. We begingiving the stationary point equation of the companion transform v ( t ), after which we explicitlycompute the limiting risk with the particular choice of Φ( x ) in this case. Section B.4.1 there aftergives explicit form for the companion transform in the ridgeless limit.We begin by recalling the limiting spectrum of the covariance Σ for the two Bulks Model is dH ( τ ) = ψ δ ρ + ψ δ ρ . Recall we have ψ + ψ = 1 therefore we simply write ψ = 1 − ψ . Usingthe Silverstein equations (8) the companion transform must satisfy − v ( t ) = t − γ (cid:16) ψ ρ ρ v ( t ) + (1 − ψ ) ρ ρ v ( t ) (cid:17) , (13)meanwhile the derivative must satisfy1( v ( t )) = 1 v ( t ) + γ (cid:16) ψ ρ (1 + ρ v ( t )) + (1 − ψ ) ρ (1 + ρ v ( t )) (cid:17) (14)= ⇒ v ( t ) = (cid:16) v ( t )) − γ (cid:16) ψ ρ (1 + ρ v ( t )) + (1 − ψ ) ρ (1 + ρ v ( t )) (cid:17)(cid:17) −
21s such given v ( t ) we can compute the derivative. Rearranging (13) and denoting v ( t ) = v thecompanion transform evaluated at t satisfies0 = (1 + ρ v )(1 + ρ v ) + tv (1 + ρ v )(1 + ρ v ) − γψ ρ v (1 + ρ v ) − γ (1 − ψ ) ρ v (1 + ρ v )= tρ ρ v + ( t ( ρ + ρ ) + (1 − γ ) ρ ρ ) v + ( t + ρ + ρ − γψ ρ − γ (1 − ψ ) ρ ) v + 1This cubic can then be solved computationally for different choices of t . In the case of the ridgelesslimit t → γ >
1, the above simplifies to a quadratic which can besolved, as shown in Section B.4.1.Now, recall in the strong and weak features model the structure of the ground truth β ? is suchthat Φ( x ) = φ x = ρ + φ x = ρ . To compute the limiting risk, specifically the bias, we must thenevaluate Θ Φ ( − λ ) + λ ∂ Θ Φ ( − λ ) ∂λ . To this end, we have plugging Φ( x ) into the definition of Θ Φ ( z )Θ Φ ( z ) = Z Φ( τ ) 1 τ (1 − γ (1 + zm ( z ))) − z dH ( τ )= φ ψ ρ (1 − γ (1 + zm ( z ))) − z + φ (1 − ψ ) ρ (1 − γ (1 + zm ( z ))) − z = φ ψ − z (1 + ρ v ( z )) + φ (1 − ψ ) − z (1 + ρ v ( z ))where on the last equality we used (9) to rewrite the above in terms of the companion transform.Plugging in the regularisation parameter z = − λ we then getΘ Φ ( − λ ) = φ ψ λ (1 + ρ v ( − λ )) + φ (1 − ψ ) λ (1 + ρ v ( − λ )) . To the end of computing ∂ Θ Φ ( − λ ) ∂λ , we can differentiate the above to get ∂ Θ Φ ( − λ ) ∂λ = − φ ψ ρ v ( − λ ) − λρ v ( − λ ) (cid:0) λρ v ( − λ ) + λ (cid:1) − φ (1 − ψ ) 1 + ρ v ( − λ ) − λρ v ( − λ ) (cid:0) λρ v ( − λ ) + λ (cid:1) , which yieldsΘ Φ ( − λ ) + λ ∂ Θ Φ ( − λ ) ∂λ = φ ψ ρ v ( − λ )( ρ v ( − λ ) + 1) + φ (1 − ψ ) ρ v ( − λ )( ρ v ( − λ ) + 1) as required. The final form for the limiting risk is thenlim n,d →∞ E β ? [ E (cid:15) [ R ( β λ )] − R ( β ? )]= σ − λv ( − λ ) λv ( − λ ) − λσ v ( − λ ) − λv ( − λ )( λv ( − λ )) + r X i =1 φ i ψ i ρ i v ( − λ )( ρ i v ( − λ ) + 1) v ( − λ ) = − σ + σ v ( − λ )( v ( − λ )) + r X i =1 φ i ψ i ρ i v ( − λ )( v ( − λ ) ( ρ i v ( − λ ) + 1) . B.4.1 Ridgeless Limit
To consider the Ridgeless limit t → v ( t ), some care must be takenabout which regime γ < γ > nderparameterised γ < γ < t → − tv ( t ) = 1 − γ . Overparameterised γ > γ > t → − v ( t ) = v (0). From dominated convergence theorem we can take the limit in theSilverstein equation (13) to arrive at the quadratic0 = (1 + ρ v )(1 + ρ v ) − γψ ρ v (1 + ρ v ) − γ (1 − ψ ) ρ v (1 + ρ v )= (1 − γ ) ρ ρ v + ( ρ + ρ − γψ ρ − γ (1 − ψ ) ρ ) v + 1Solving for v with the quadratic formula immediately gives v (0) = (15) − ( ρ + ρ − γψ ρ − γ (1 − ψ ) ρ ) − p ( ρ + ρ − γψ ρ − γ (1 − ψ ) ρ ) − − γ ) ρ ρ − γ ) ρ ρ . Recall from [42] we have that v ( z ) ∈ S , as such we take the sign above which yields a non-negativequantity. Noting we we focus on the regime where γ >
1, we see for the above to be non-negativewe require the numerator to be negative, and thus, we take the negative sign.
C Proof of Lemma 1
In this section we provide the proof for Lemma 1. We recall that the limits (10) and (11) have beencomputed previously. In particular, Lemma 2.2 of [31] (the roles of d, n are swapped in their work,and thus, one must swap γ with 1 /γ ) shows1 d Tr (cid:16)(cid:16) X > Xn + λI (cid:17) − Σ (cid:17) → γ − (cid:16) − γ (1 − λm ( − λ )) − (cid:17) Meanwhile Lemma 7.4 of [16] shows1 d Tr (cid:16)(cid:16) X > Xn + λI (cid:17) − Σ (cid:17) → m ( − λ ) − λm ( − λ )(1 − γ (1 − λm ( − λ ))) This leaves us to show the limit (12), for which we build upon the techniques [31] as well as [12].We begin with the decomposition. Recall since the covariates are multivariate Gaussians, they canbe rewritten as X = Z Σ / where Z ∈ R n × d is a matrix of independent standard normal Gaussianrandom variables. For i = 1 , . . . , n the associated row in X is then denoted X i = Z i Σ / . As such X > X = P ni =1 X > i X i = P ni =1 Σ / Z > i Z i Σ / . Let us then define R i ( z ) = (cid:0) X > Xn − X > i X i n − zI (cid:1) − .Using the Sherman-Morrison formula we then get R ( z ) = ( X > Xn − zI ) − = R i ( z ) − n R i ( z )Σ / Z > i Z i Σ / R i ( z )1 + n Z i Σ / R i ( z )Σ / Z > i (16)Moreover we have1 n n X i =1 Σ / Z > i Z i Σ / R ( z ) = X > Xn R ( z ) = ( X > Xn − zI ) R ( z ) + zR ( z ) = I + zR ( z )23ultiplying the above on the left by Φ(Σ) R ( z ), taking the trace and dividing by d yields1 d Tr (cid:0) Φ(Σ) R ( z ) (cid:1) + z d Tr (cid:0) Φ(Σ) R ( z ) (cid:1) = 1 d n X i =1 n Z i Σ / R ( z )Φ(Σ) R ( z )Σ / Z > i = 1 d n X i =1 n Z i Σ / R i ( z )Φ(Σ) R i ( z )Σ / Z > i (cid:0) n Z i Σ / R i ( z )Σ / Z > i (cid:1) where for i = 1 , . . . , n we have plugged in (16) twice into for R ( z ) to get Z i Σ / R ( z )Φ(Σ) R ( z )Σ / Z > i = Z i Σ / R i ( z )Φ(Σ) R ( z )Σ / Z > i − n Z i Σ / R i ( z )Σ / Z > i Z i Σ / R i ( z )Φ(Σ) R ( z )Σ / Z > i n Z i Σ / R i ( z )Σ / Z > i = Z i Σ / R i ( z )Φ(Σ) R ( z )Σ / Z > i n Z i Σ / R i ( z )Σ / Z > i = 11 + n Z i Σ / R i ( z )Σ / Z > i × h Z i Σ / R i ( z )Φ(Σ) R i ( z )Σ / Z > i − n Z i Σ / R i ( z )Φ(Σ) R i ( z )Σ / Z > i Z i Σ / R i ( z )Σ / Z > i n Z i Σ / R i ( z )Σ / Z > i i = Z i Σ / R i ( z )Φ(Σ) R i ( z )Σ / Z > i (cid:0) n Z i Σ / R i ( z )Σ / Z > i (cid:1) . Choosing z = − λ we then have that1 d Tr (cid:0) Φ(Σ) R ( − λ ) (cid:1) − λ d Tr (cid:0) Φ(Σ) R ( − λ ) (cid:1) = 1 d n X i =1 1 n Tr (cid:0) Σ / R ( − λ )Φ(Σ) R ( − λ )Σ / (cid:1)(cid:0) n Tr(Σ R ( − λ )) (cid:1) + δ (17)where the error term δ = δ + δ + δ + δ such that δ = 1 d n X i =1 1 n Tr (cid:0) Σ / R i ( − λ )Φ(Σ) R i ( − λ )Σ / (cid:1) − n Tr (cid:0) Σ / R ( − λ )Φ(Σ) R ( − λ )Σ / (cid:1)(cid:0) n Tr(Σ R ( − λ )) (cid:1) δ = 1 d n X i =1 n Tr (cid:0) Σ / R i ( − λ )Φ(Σ) R i ( − λ )Σ / (cid:1)(cid:16) (cid:0) n Tr(Σ R i ( − λ )) (cid:1) − (cid:0) n Tr(Σ R ( − λ )) (cid:1) (cid:17) δ = 1 d n X i =1 n Tr (cid:0) Σ / R i ( − λ )Φ(Σ) R i ( − λ )Σ / (cid:1) × (cid:16) (cid:0) n Z i Σ / R i ( − λ )Σ / Z > i (cid:1) − (cid:0) n Tr(Σ R i ( − λ )) (cid:1) (cid:17) δ = 1 d n X i =1 1 n Z i Σ / R i ( − λ )Φ(Σ) R i ( − λ )Σ / Z > i − n Tr (cid:0) Σ / R i ( − λ )Φ(Σ) R i ( − λ )Σ / (cid:1)(cid:0) n Z i Σ / R i ( z )Σ / Z > i (cid:1) As shown in section C.1 the error terms | δ | , | δ | , | δ | , | δ | → n, d → ∞ . It isnow a matter of computing the limits of the remaining terms. As discussed previously the limit of d Tr(Σ R ( − λ )) is known from [31]. From the same work it is also known that1 d Tr (cid:0) Φ(Σ) R ( − λ ) (cid:1) → Θ Φ ( − λ ) . (18)24hat leaves us to compute the limit of d Tr (cid:0) Φ(Σ) R ( − λ ) (cid:1) . If we are to write f d ( λ ) = d Tr (cid:0) Φ(Σ) R ( − λ ) (cid:1) then note the derivative with respect to λ is f d ( λ ) = − d Tr (cid:0) Φ(Σ) R ( − λ ) (cid:1) . We wish to now studythe limit of the f d ( λ ) through the limit of f d ( λ ). To do so we will follow the steps in [16], whichwill require some definitions and the following theorem.Let D be a domain, i.e. a connected open set of C . A function f : D → C is called analytic on D if it is differentiable as a function of the complex variable z on D . The following key theorem,sometimes known as Vitali’s Theorem, ensures that the derivatives of converging analytic functionsalso converge. Theorem 2 (Lemma 2.14 in [3])
Let f , f , . . . be analytic on the domain D , satisfying | f n ( z ) | ≤ M for every n and z in D . Suppose that there is an analytic function f on D such that f n ( z ) → f ( z ) for all z ∈ D . Then it also holds that f n ( z ) → f ( z ) for all z ∈ D Now we have from [31] f d ( λ ) → Z Φ( τ ) 1 τ (1 − γ (1 − λm ( − λ ))) + λ dH ( τ )for all λ ∈ S := { u + iv : v = 0 , or v = 0 , u > } . Checking the conditions of Theorem 2 we havethat f d ( λ ) is an analytic function of λ on S and is bounded | f d ( λ ) | ≤ k Φ(Σ) k λ . To apply Theorem 2it suffices to show that the limit Θ Φ ( − λ ) is analytical. To this end we invoke Morera’s theoremwhich states if I γ Θ Φ ( − λ ) dλ = 0for any closed curve γ in the region S then Θ Φ ( − λ ) is analytic. We see this is the case by applyingFubini’s Theorem as follows I γ Θ Φ ( − λ ) dλ = I γ Z Φ( τ ) 1 τ (1 − γ (1 − λm ( − λ ))) + λ dH ( τ ) dλ = Z Φ( τ ) I γ τ (1 − γ (1 − λm ( − λ ))) + λ dλ | {z } =0 dH ( τ ) = 0and noting that the inner integral is zero from Cauchy Theorem as τ (1 − γ (1 − λm ( − λ )))+ λ is an analyticalfunction of λ in S for any τ ∈ [ h , h ]. By Theorem 2 we have that − d Tr (cid:0) Φ(Σ) R ( − λ ) (cid:1) = f d ( − λ ) → ∂ Θ Φ ( − λ ) ∂λ . (19)The final limit (12) is arrived at by considering the limit as d, n → ∞ of (17). Specifically, with thefact that δ →
0, bringing together (18), (19) and (10). Noting that (10) is applied to the square of1 + n Tr (cid:0) Σ R ( − λ ) (cid:1) = 1 + γ d Tr (cid:0) Σ R ( − λ ) (cid:1) → − γ (1 − λm ( − λ )) . C.1 Showing δ → To analyse these quantities we introduce the following concentration inequality from Lemma A.2 of[40] with δ = 1 /
3. 25 emma 2
Suppose y is d − dimensional Gaussian random vector y ∼ N (0 , I ) and C ∈ R d × d is asymmetric matrix such that k C k ≤ L . Then for all < t < L , P (cid:0) d | yCy > − Tr( C ) | > t (cid:1) ≤ n − pt L o . Furthermore, we will use the fact that the maximal Eigenvalues are upper bounded k R ( − λ ) k ≤ λ and max ≤ i ≤ n k R i ( − λ ) k ≤ λ We proceed to show that each of the error δ , δ , δ , δ converge to zero almost surely.Begin with δ . For i = 1 , . . . , n by adding and subtracting Tr(Σ / R ( z )Φ(Σ) R i ( z )Σ / ) we candecomposeTr (cid:0) Σ / R i ( − λ )Φ(Σ) R i ( − λ )Σ / (cid:1) − Tr (cid:0) Σ / R ( z )Φ(Σ) R ( z )Σ / (cid:1) = Tr (cid:0) ( R i ( − λ ) − R ( − λ ))Φ(Σ) R i ( − λ )Σ (cid:1) + Tr (cid:0) Σ R ( − λ )Φ(Σ)( R i ( − λ ) − R ( − λ )) (cid:1) Using (16) and letting A = Φ(Σ) R i ( − λ )Σ we then get1 n | Tr (cid:0) ( R i ( − λ ) − R ( − λ )) A (cid:1) | = (cid:12)(cid:12)(cid:12) n Z i Σ / R i ( − λ ) AR i ( − λ )Σ / Z > i n Z i Σ / R i ( − λ )Σ / Z > i (cid:12)(cid:12)(cid:12) (20) ≤ k A k n (cid:12)(cid:12)(cid:12) n Z i Σ / R i ( − λ ) R i ( − λ )Σ / Z > i n Z i Σ / R i ( − λ )Σ / Z > i (cid:12)(cid:12)(cid:12) ≤ k A k n sup x (cid:12)(cid:12)(cid:12) xR i ( − λ ) x > xR i ( − λ ) x > (cid:12)(cid:12)(cid:12) ≤ k A k n sup x (cid:12)(cid:12)(cid:12) xR i ( − λ ) x > xR i ( − λ ) x > (cid:12)(cid:12)(cid:12) ≤ k A k n k R i ( − λ ) k ≤ k A k λn ≤ k Φ(Σ) k k Σ k λ n An identical calculation with A = Φ(Σ) R ( − λ )Σ yields the same bound. This then yields withthe lower bound (1 + Tr(Σ / R ( − λ )Σ / )) ≥ | δ | ≤ nd k Φ(Σ) k k Σ k λ n and as such δ goes to zero as n, d → ∞ so that d/n → γ .26ow consider the term δ . Note that for two positive numbers a, b ≥ a ) − b ) = (1 + b ) − (1 + a ) (1 + a ) (1 + b ) = b + 2 b − a − a (1 + a ) (1 + b ) = b ( b − a ) + a ( b − a ) + 2( b − a )(1 + a ) (1 + b ) = ( b − a ) ( b + 1) + ( a + 1)(1 + a ) (1 + b ) = ( b − a ) (cid:0) a ) (1 + b ) + 1(1 + a )(1 + b ) (cid:1) and as such | (1 + a ) − − (1 + b ) − | ≤ | b − a | . Using this with a = n Tr(Σ / R i ( − λ )Σ / ) and b = n Tr(Σ / R ( − λ )Σ / ) whom are both non-negative, allows us to upper bound (cid:12)(cid:12)(cid:12) (cid:0) n Tr(Σ R i ( − λ )) (cid:1) − (cid:0) n Tr(Σ R ( − λ )) (cid:1) (cid:12)(cid:12)(cid:12) ≤ n (cid:12)(cid:12) Tr(Σ R i ( − λ )) − Tr(Σ R ( − λ )) (cid:12)(cid:12) ≤ k Σ k λn where for the final inequality we used the argument (20) with A = Σ. Now, since the Eigenvalues inthe following trace are non-negative we can upper bound1 d (cid:12)(cid:12) Tr (cid:0) Σ / R i ( − λ )Φ(Σ) R i ( − λ )Σ / (cid:1)(cid:12)(cid:12) ≤ k Σ / R i ( − λ )Φ(Σ) R i ( − λ )Σ / k ≤ k Σ / k k Φ(Σ) k k R i ( − λ ) k ≤ k Σ / k k Φ(Σ) k λ = k Σ k k Φ(Σ) k λ (21)Combining these two facts yields the upper bound | δ | ≤ k Σ k k Φ(Σ) k λ n which goes to zero as n → ∞ .We now proceed to bound δ and δ . With the bound on the trace (21) as well as using thebound | (1 + a ) − − (1 + b ) − | ≤ | b − a | we arrive at the bound for δ | δ | ≤ k Σ k k Φ(Σ) k λ × max ≤ i ≤ n (cid:12)(cid:12)(cid:12) Z i Σ / R i ( z )Φ(Σ) R i ( z )Σ / Z > i − Tr (cid:0) Σ / R i ( z )Φ(Σ) R i ( z )Σ / (cid:1)(cid:12)(cid:12)(cid:12) . Meanwhile using that 1 + n Z i Σ / R i ( − λ )Σ / Z > i ≥ δ | δ | ≤ max ≤ i ≤ n (cid:12)(cid:12)(cid:12) Z i Σ / R i ( z )Φ(Σ) R i ( z )Σ / Z > i − Tr (cid:0) Σ / R i ( z )Φ(Σ) R i ( z )Σ / (cid:1)(cid:12)(cid:12)(cid:12)
27e now show that max ≤ i ≤ n (cid:12)(cid:12)(cid:12) Z i Σ / R i ( z )Φ(Σ) R i ( z )Σ / Z > i − Tr (cid:0) Σ / R i ( z )Φ(Σ) R i ( z )Σ / (cid:1)(cid:12)(cid:12)(cid:12) con-verges to zero almost surely. Observe since we have the upper bound on the largest Eigenvalue wehave using Lemma 2 as well as union bound for 1 ≤ i ≤ n we have for 0 < t < k Σ k k Φ(Σ) k λ P (cid:16) max ≤ i ≤ n d (cid:12)(cid:12)(cid:12) Z i Σ / R i ( z )Φ(Σ) R i ( z )Σ / Z > i − Tr (cid:0) Σ / R i ( z )Φ(Σ) R i ( z )Σ / (cid:1)(cid:12)(cid:12)(cid:12) ≥ t (cid:17) ≤ n − dt λ k Σ k k Φ(Σ) k + log( n ) o (22)Let V n,d := max ≤ i ≤ n d (cid:12)(cid:12)(cid:12) Z i Σ / R i ( z )Φ(Σ) R i ( z )Σ / Z > i − Tr (cid:0) Σ / R i ( z )Φ(Σ) R i ( z )Σ / (cid:1)(cid:12)(cid:12)(cid:12) and, for any t >
0, let E n,d ( t ) denote the event { V n,d ≥ t } where d = d n . Then, if d = d n satisfies d n /n → ∞ , P ( E n,d ) ≤ n exp n − dt λ k Σ k k Φ(Σ) k o ≤ n exp n − γnt λ k Σ k k Φ(Σ) k o where the last inequality for n largeenough that d/n ≥ γ/
2. Hence, ∞ X n =1 P ( E n,d n ( t )) < + ∞ so that, by the Borel-Cantelli lemma, almost surely, V n,d ≥ t only holds for a finite number of valuesof n . This implies that, almost surely, lim sup n →∞ V n,d ≤ t . Note that this is true for every t > t = 1 /k and taking a union bound over k ≥ n →∞ V n,d = 0 almost surely, i.e. V n,d →→