[PDF] Canonical thresholding for non-sparse high-dimensional linear regression

Abstract

We consider a high-dimensional linear regression problem. Unlike many papers on the topic, we do not require sparsity of the regression coefficients; instead, our main structural assumption is a decay of eigenvalues of the covariance matrix of the data. We propose a new family of estimators, called the canonical thresholding estimators, which pick largest regression coefficients in the canonical form. The estimators admit an explicit form and can be linked to LASSO and Principal Component Regression (PCR). A theoretical analysis for both fixed design and random design settings is provided. Obtained bounds on the mean squared error and the prediction error of a specific estimator from the family allow to clearly state sufficient conditions on the decay of eigenvalues to ensure convergence. In addition, we promote the use of the relative errors, strongly linked with the out-of-sample R 2 . The study of these relative errors leads to a new concept of joint effective dimension, which incorporates the covariance of the data and the regression coefficients simultaneously, and describes the complexity of a linear regression problem. Numerical simulations confirm good performance of the proposed estimators compared to the previously developed methods.

Full PDF

CCanonical thresholding for non-sparsehigh-dimensional linear regression ∗ Igor Silin and Jianqing FanDepartment of Operations Research and Financial EngineeringPrinceton University

Abstract

We consider a high-dimensional linear regression problem. Unlike many papers onthe topic, we do not require sparsity of the regression coeﬃcients; instead, our mainstructural assumption is a decay of eigenvalues of the covariance matrix of the data.We propose a new family of estimators, called the canonical thresholding estimators,which pick largest regression coeﬃcients in the canonical form. The estimators admit anexplicit form and can be linked to LASSO and Principal Component Regression (PCR). Atheoretical analysis for both ﬁxed design and random design settings is provided. Obtainedbounds on the mean squared error and the prediction error of a speciﬁc estimator fromthe family allow to clearly state suﬃcient conditions on the decay of eigenvalues to ensureconvergence. In addition, we promote the use of the relative errors, strongly linked withthe out-of-sample R . The study of these relative errors leads to a new concept of jointeﬀective dimension, which incorporates the covariance of the data and the regressioncoeﬃcients simultaneously, and describes the complexity of a linear regression problem.Numerical simulations conﬁrm good performance of the proposed estimators compared tothe previously developed methods. Keywords:

High-dimensional linear regression; covariance eigenvalues decay; thresholding;relative errors; principal component regression. ∗ Research supported by ONR grant N00014-19-1-2120, NSF grant DMS-1662139, and NIH grant 2R01-GM072611-14. E-mail: [email protected], [email protected] a r X i v : . [ m a t h . S T ] J u l Introduction and Setup

Consider the standard linear regression model y = x (cid:62) β + ε, where x ∈ R d is a vector of covariates, β ∈ R d is a vector of coeﬃcients, ε ∈ R is a noiseterm, and y ∈ R is a response. Suppose we observe n pairs { ( x i , y i ) } ni =1 from this model withthe assumption that the underlying noise terms { ε i } ni =1 are i.i.d. random variables with mean0. In matrix notations, introducing YYY =  y ... y n  ∈ R n , XXX =  x (cid:62) ... x (cid:62) n  ∈ R n × d , ε =  ε ... ε n  ∈ R n , we rewrite our model as YYY = XXX β + ε . (1.1)Deﬁne the covariance matrix of the data (cid:98) Σ def = n − (cid:80) ni =1 x i x (cid:62) i = n − XXX (cid:62)

XXX ∈ R d × d . Our goalis to estimate the unknown β and analyze the quality of estimation in two diﬀerent settings: • Fixed design . That means, the vectors of covariates { x i } ni =1 are deterministic (withoutloss of generality we assume (cid:80) ni =1 x i = 0). A standard way to measure the error of anestimator (cid:101) β in this case is the mean squared error (MSE): MSE ( (cid:101) β ) def = 1 n n (cid:88) i =1 ( x (cid:62) i (cid:101) β − x (cid:62) i β ) = 1 n (cid:107) XXX (cid:101) β − XXX β (cid:107) = ( (cid:101) β − β ) (cid:62) (cid:98) Σ ( (cid:101) β − β ) . This diﬀers from the prediction error for the ﬁxed design by an amount of E [ ε ] (inde-pendent of the model) and reﬂects the model error in the prediction for this case. • Random design . In this scenario the vectors of covariates { x i } ni =1 come independentlyfrom some unknown distribution with mean zero (for simplicity) and the covariance matrix Σ def = E [ xx (cid:62) ] ∈ R d × d . We are interested in the performance of an estimator (cid:101) β measuredby the expected prediction error (PE): PE ( (cid:101) β ) def = E (cid:104) ( x (cid:62) (cid:101) β − x (cid:62) β ) (cid:105) = ( (cid:101) β − β ) (cid:62) Σ ( (cid:101) β − β ) . This quantity diﬀers also from the prediction error for random design by E [ ε ] and equalsthe excess risk PE ( (cid:101) β ) = E (cid:104) ( y − x (cid:62) (cid:101) β ) (cid:105) − E (cid:2) ( y − x (cid:62) β ) (cid:3) . In the sequel we refer to these quantities simply as the ( absolute ) MSE and PE. The rea-son we give two names is to diﬀerentiate their statistical behaviors in high dimension and toavoid confusions at various discussions. We will also motivate and analyze the relative errors

MSE ( (cid:101) β ) / MSE (0) and PE ( (cid:101) β ) / PE (0). Surprisingly, the relative errors, appearing naturally and2eing well-motivated, have not gained much attention in the literature. As we will see, theimportance of the relative errors arises as a high-dimensional eﬀect.Being a fundamental statistical problem, the high-dimensional linear regression has beenapproached in various ways. Probably the simplest method is Principal Component Regression(PCR). The idea is to reduce the dimension ﬁrst via Principal Component Analysis (PCA)(Pearson, 1901), and then use several leading principal components as covariates to constructthe least squares estimator. This approach heavily relies on a very strong assumption that theresponse depends on just a few leading principal components of the data. Various exampleswere provided where PCR performs poorly, see Jolliﬀe (1982). Another related idea to usesupervised principal components was proposed by Bair et al. (2006).Over the past two decades, the main approach to tackle high-dimensionality of the problemhas been the sparsity assumption on β , which is reasonable for many real-world applications.This has given rise to such model selection procedures as LASSO (Tibshirani, 1996), SCAD (Fanand Li, 2001), Least Angle Regression (Efron et al., 2004), Dantzig selector (Candes and Tao,2007), SLOPE (Bogdan et al., 2015). The list of papers devoted to these methods is too long tobe presented here, so we just mention some of them: Greenshtein and Ritov (2004); Paul et al.(2008); Bickel, Ritov and Tsybakov (2009); Dalalyan, Hebiri and Lederer (2017); Bellec, Lecu´eand Tsybakov (2018). Typically, a theoretical analysis of such procedures requires assumptionson the design like restricted isometry property (RIP), restricted eigenvalue (RE) condition,incoherence. These assumptions are needed to make sure that the correlations among subsetsof features are small. See van de Geer and B¨uhlmann (2009) for an overview of conditions usedin the theoretical analysis of sparse linear regression. We also refer to Chapters 3–5 of Fanet al. (2020) for an overview of existing methods and theoretical results for high-dimensionallinear regression under sparsity.The methods from the last two paragraphs were developed (partially) due to a belief that theunconstrained least squares estimator is hopeless in high dimensions. Recent papers by Bartlettet al. (2020) and later Chinot and Lerasle (2020) have shown that the minimum (cid:96) -norm leastsquares estimator (cid:101) β LS def = ( XXX (cid:62)

XXX ) + XXX (cid:62)

YYY (where (

XXX (cid:62)

XXX ) + is the generalized inverse of the matrix XXX (cid:62)

XXX ) can generalize well (i.e. have small absolute PE) even interpolating the training data –they call this phenomenon “benign overﬁtting”. To deliver convergence of PE ( (cid:101) β LS ) to zero theyrequire quite speciﬁc conditions on Σ : the decay of its eigenvalues should be fast, but not toofast. These requirements are quantiﬁed by two notions of eﬀective rank of Σ . A closely relatedpaper by Hastie et al. (2019) also studies (cid:101) β LS , but in the regime p/n → γ . They are interestedin the dependence of PE ( (cid:101) β LS ) on γ , and focus on the case λ min ( Σ ) ≥ c >

0, e.g. consideringisotropic and equicorrelation covariances. One more work on the topic is Belkin, Hsu and Xu(2019), where the authors try to mathematically explain double descent phenomenon in severaldiﬀerent models.Going beyond the linear regression, one basic idea to approach general (nonlinear) regressionproblem y = f ( x )+ ε is to decompose the regression function f ( x ) ≈ (cid:80) Dj =1 β j ψ j ( x ) = ψ ( x ) (cid:62) β over a Fourier basis, wavelet basis, or basis of eigenfunctions in Reproducing Kernel Hilbert3pace (RKHS), denoted here by ψ ( · ) , . . . , ψ D ( · ). This reduces the nonlinear regression problemto a linear one (potentially very high-dimensional), and allows to apply the existing methods.Though we do not pursue the analysis of nonlinear regression in our work, this setting providesan excellent motivation for the main structural assumptions we make in our results. One ofthem is fast decay of the eigenvalues of Σ (or (cid:98) Σ ). For instance, we require that the eﬀectiverank r eﬀ [ Σ ] def = Tr [ Σ ] (cid:107) Σ (cid:107) (cid:32) or r eﬀ [ (cid:98) Σ ] def = Tr [ (cid:98) Σ ] (cid:107) (cid:98) Σ (cid:107) (cid:33) can be well-controlled. The spectral decay has been observed in real-world datasets (e.g.MNIST, see Figure 5 in Liang and Rakhlin (2020); ﬁnancial data in Zumbach (2009), Fig-ure 5; economics data in Fan, Ke and Wang (2020), Figure 5), which makes our assumptionreasonable. Importance of the eigenvalue decay (not only of the covariance, but of general ker-nel matrices) is highlighted in Liang and Rakhlin (2020), where such kind of conditions on thespectral decay is called “favorable data geometry”. Moreover, Ma and Belkin (2017); Belkin(2018) analyze the super-polynomial decay of eigenvalues of smooth kernel matrices. Goingeven further in deep learning literature, neural tangent kernels also exhibit the spectral decay,as shown by Bietti and Mairal (2019), among others. However, the fast eigenvalue decay is notthe only motivation behind our work; another structural assumption that can make our resultsmeaningful is a fast decay of regression coeﬃcients in eigenbasis. This is a very well-understoodcondition as well: it is well-known that Fourier coeﬃcients decay at a polynomial rate, wherethe degree depends on the smoothness of the underlying regression function. In addition, thedecay of coeﬃcients in RKHS was studied by Belkin (2018).With these structural assumptions, the idea behind our family of estimators (cid:98) β is quitenatural: in some eigendirections (e.g. the ones that correspond to small eigenvalues of Σ ) wedo not gain much by estimating the associated coeﬃcient, so it makes sense to estimate onlythose components that allow to signiﬁcantly reduce the error; speciﬁcally, we use thresholdingto cut the components associated with the insigniﬁcant directions oﬀ. When applied to thenonlinear regression with wavelet basis, one estimator from the proposed family coincides withthe soft thresholding approach studied in the series of papers by Donoho and Johnstone (1994);Donoho (1995); Donoho and Johnstone (1995); Donoho et al. (1995); Donoho and Johnstone(1998), among others. We highlight that we will not require sparsity of β or any restrictiveconditions on the design.Let us summarize some motivations behind our work: • Our methods can be viewed as an attempt to ﬁx PCR by relaxing its restrictive assump-tions. Instead of working with the ﬁrst several principal components, our estimatorsautomatically screen for the most important principal components, not necessarily theleading ones. • Remarkably, the procedures that we propose are a modiﬁcation of LASSO, so one canview this work as an attempt to extend LASSO to non-sparse high-dimensional linear4egression. • Though the papers by Bartlett et al. (2020) and Chinot and Lerasle (2020) do not advocatethe use of interpolating estimators rather justify why the overﬁtting may not be harmful(very relevant question in modern deep learning research), we aim to show that there is nonecessity to give up the in-sample denoising quality to get good bounds on the predictionerror. In fact, our numerical results show that our method is better the least squaresestimator in various situations.Main contributions of this paper are: • We propose a new method for high-dimensional linear regression, called

Natural Canon-ical Thresholding (NCT), in Section 2. The connection of this approach to LASSO andPCR is discussed in Section 2.1 and Section 2.2. In Section 2.3 we extend the suggestedprocedure and present a richer family of estimators, called

Generalized Canonical Thresh-olding (GCT). Our estimators (cid:98) β are given via an explicit expression and do not requireany optimization. Though each estimator from the family has one hyperparameter, it canbe tuned in an eﬃcient way as shown in Section 4. • We provide theoretical guarantees for the NCT estimator in the ﬁxed design and randomdesign settings in Section 3. The presented bounds have two-fold meaning: – For the absolute errors

MSE ( (cid:98) β ) and PE ( (cid:98) β ), studied in Section 3.1 and Section 3.2,we state explicit suﬃcient conditions of the form “the eigenvalues of (cid:98) Σ or Σ decayfast enough” to ensure convergence in high dimensions. No conditions on β areimposed in this case. – For the relative errors

MSE ( (cid:98) β ) / MSE (0) and PE ( (cid:98) β ) / PE (0), motivated in Section 3.3.1,our bounds factorize into the newly deﬁned notion of the joint eﬀective dimension ,the signal-to-noise ratio, and a vanishing factor. To get good rates for the relativeerrors in high dimensions it is not enough to assume fast decay of eigenvalues of (cid:98) Σ or Σ alone, and we need to impose conditions of Σ and β together (Section 3.3.2),which is reﬂected by the joint eﬀective dimension that we analyze (Section 3.3.3).Our analysis also applies to PCR, which can also be regarded as a regularization on thecanonical regression coeﬃcients by truncating high-index components to zero.Theoretical analysis of the GCT estimator is not that insightful, however we still presentand discuss a bound on the absolute error MSE ( (cid:98) β ) (Section 3.4). • Numerical experiments conducted in Section 5 conﬁrm good performance of NCT andespecially GCT, in comparison with other existing methods.The main proofs and the additional proofs are postponed to Section 6 and Section 7, respec-tively. We conclude this section with deﬁning some notations used throughout the work.5or a positive integer k , we write [ k ] as shorthand for the set { , , . . . , k } . We use O k × l for k × l matrix of zeros and I k for the identity matrix of size k × k . For a vector a = [ a , . . . , a k ] (cid:62) ∈ R k and q >

0, the standard (cid:96) q -(pseudo)norm in R k is (cid:107) a (cid:107) q def = (cid:16)(cid:80) kj =1 | a j | q (cid:17) /q . We use thefollowing convention for the (cid:96) -pseudonorm: (cid:107) a (cid:107) = (cid:107) a (cid:107) = (cid:80) kj =1 { a j (cid:54) = 0 } . Also, the (cid:96) ∞ -norm is (cid:107) a (cid:107) ∞ = max j ∈ [ k ] | a j | . For a matrix A , let (cid:107) A (cid:107) be the spectral norm (the largestsingular value), rank [ A ] be the rank, and (if A is square) Tr [ A ] be the trace.For sequences a n and b n the relation a n (cid:46) b n means that there exists an absolute constant C such that a n ≤ Cb n for all n , while a n (cid:16) b n means that a n (cid:46) b n and b n (cid:46) a n . By c, C, C (cid:48) we denote absolute constants which may diﬀer from place to place.Throughout the work, β stands for the true vector of regression coeﬃcients in the model(1.1), (cid:98) β stands for our NCT or GCT estimators proposed in the next section, and a genericestimator is denoted as (cid:101) β . Let r def = rank [ (cid:98) Σ ]. Typically, r = min( n, d ). Consider the SVD of the data matrix XXX (scaledby n − / ): 1 √ n XXX = (cid:98) V (cid:98) Λ (cid:98) U (cid:62) , where (cid:98) Λ = diag (cid:16)(cid:98) λ / , . . . , (cid:98) λ / r (cid:17) ∈ R r × r is a diagonal matrix consisting of the non-zero singularvalues of n − / XXX in non-increasing order, the columns of (cid:98) V ∈ R n × r are the left singularvectors of XXX , and the columns of (cid:98) U = [ (cid:98) u , . . . , (cid:98) u r ] ∈ R d × r are the right singular vectors of XXX .Alternatively, it is also convenient to think of the eigendecomposition of (cid:98) Σ : (cid:98) Σ = (cid:98) U (cid:98) Λ (cid:98) U (cid:62) , where now the diagonal entries (cid:98) λ , . . . , (cid:98) λ r of (cid:98) Λ are interpreted as the non-zero eigenvalues of (cid:98) Σ in non-increasing order and the columns (cid:98) u , . . . , (cid:98) u r of (cid:98) U are the corresponding eigenvectorsof (cid:98) Σ . Similarly, in what follows we will actively use the eigendecomposition of Σ : Σ = UΛ U (cid:62) , where Λ = diag ( λ , . . . , λ d ) ∈ R d × d is a diagonal matrix consisting of the eigenvalues of Σ innon-increasing order, and U = [ u , . . . , u d ] ∈ R d × d consists of the corresponding eigenvectors.We introduce the following deﬁnition, which will be extensively used throughout the work. Deﬁnition 2.1.

Rewrite the linear regression model

YYY = XXX β + ε as YYY = (

XXX (cid:98) U (cid:98) Λ − )( (cid:98) Λ (cid:98) U (cid:62) β ) + ε = ZZZ θ + ε . We call this the canonical form of the linear regression model. Here

ZZZ def = XXX (cid:98) U (cid:98) Λ − ∈ R n × r isthe standardized design matrix and θ def = (cid:98) Λ (cid:98) U (cid:62) β ∈ R r s the new vector of coeﬃcients, called the canonical regression coeﬃcient vector, or simplycanonical coeﬃcients. Note that the standardized design coincides with the left singular vectors

ZZZ = n / (cid:98) V andsatisﬁes the orthonormality constraints: n − ZZZ (cid:62)

ZZZ = I r . Hence, the least-squares estimator forthe canonical parameter is (cid:101) θ LS = n − ZZZ (cid:62)

YYY = n − (cid:98) Λ − (cid:98) U (cid:62) XXX (cid:62)

YYY . As in Fan (1996), we furtherregularize the estimated canonical coeﬃcient vector by either thresholding or truncation (set-ting higher indices to zero), depending whether θ is approximately sparse or concentrate onthe leading principal components. Transforming the canonical parameter back to the originaldomain leads to the canonical thresholding estimator or principal component regression esti-mator, as to be further elaborated below. Our work pushes forward the interactions betweenthe canonical parameters and design matrix.More speciﬁcally, in the canonical domain our estimator looks like (cid:98) θ def = SOFT τ (cid:104)(cid:101) θ LS (cid:105) , which in the original domain brings us to the Natural Canonical Thresholding (NCT) estimatorof β , deﬁned as (cid:98) β def = (cid:98) U (cid:98) Λ − SOFT τ (cid:20) (cid:98) Λ − (cid:98) U (cid:62) XXX (cid:62)

YYY n (cid:21) , (2.1)where SOFT τ [ z ] def = z · max (1 − τ / | z | ,

0) is the soft thresholding function applied component-wise and τ ≥ YYY ≈ ZZZ θ in, use the eigendecomposition of (cid:98) Σ and get (cid:98) θ ≈ SOFT τ [ θ ] . Due to the structure of our error (e.g. in the ﬁxed design case)

MSE ( (cid:98) β ) = (cid:107) (cid:98) Λ (cid:98) U (cid:62) (cid:98) β − (cid:98) Λ (cid:98) U (cid:62) β (cid:107) = (cid:107) (cid:98) θ − θ (cid:107) and since we assume the eigenvalue decay, it is likely that some components θ j do not playrole, and the estimation of them by (cid:98) θ j is not that important. Hence, it is reasonable to cutsuch insigniﬁcant components oﬀ, and this is exactly what the thresholding does. This reducesthe variance of the estimator, while not increasing the bias by too much.We also mention that when τ = 0, our estimator reduce to the minimum (cid:96) -norm leastsquares solution (cid:98) θ = (cid:101) θ LS and (cid:98) β = (cid:101) β LS = (cid:98) Σ + XXX (cid:62)

YYY n = ( XXX (cid:62)

XXX ) + XXX (cid:62)

YYY (unbiased or slightly biased, but with large variance), while τ = + ∞ corresponds to the trivialsolution (cid:98) θ = 0 and (cid:98) β = 0 (very biased, but with zero variance).7 .1 Relation to LASSO Recall that the standard LASSO estimator is a solution of the following optimization problem: (cid:101) β LASSO ∈ arg min β (cid:48) ∈ R d (cid:26) n (cid:107) YYY − XXX β (cid:48) (cid:107) + τ (cid:107) β (cid:48) (cid:107) (cid:27) . In practice one usually standardizes the columns of

XXX so that they are on the same scale andthe coeﬃcients corresponding to diﬀerent covariates are penalized equally. Now imagine thatwe standardize our

XXX in the canonical manner as in Deﬁnition 2.1. If we run LASSO for thevector of coeﬃcients θ , then the solution is expressed via the soft thresholding: (cid:101) θ LASSO = arg min θ (cid:48) ∈ R r (cid:26) n (cid:107) YYY − ZZZ θ (cid:48) (cid:107) + τ (cid:107) θ (cid:48) (cid:107) (cid:27) = SOFT τ (cid:20) ZZZ (cid:62)

YYY n (cid:21) , which is exactly our estimator (cid:98) θ in the canonical domain. Going back to (cid:98) β = (cid:98) U (cid:98) Λ − (cid:101) θ LASSO we recover the NCT estimator (2.1). The solution is the soft thresholding on the canonicalregression coeﬃcients. This is why we call the method canonical thresholding.We also note that the NCT estimator can be represented as the solution of the optimizationproblem (cid:98) β = arg min β (cid:48) ∈ R d (cid:26) n (cid:107) YYY − XXX β (cid:48) (cid:107) + τ (cid:107) (cid:98) Λ (cid:98) U (cid:62) β (cid:48) (cid:107) (cid:27) . Our estimator is nothing more than LASSO penalized on the canonical regression coeﬃcients.

Principal Component Regression (PCR) approaches the high dimensionality of the problem bytaking only m ( m < r = min( d, n )) leading principal components of the original data. Thenew design matrix becomes ZZZ m = XXX (cid:98) U ≤ m (cid:98) Λ − ≤ m ∈ R n × m , where (cid:98) U ≤ m ∈ R d × m consists of the ﬁrst m columns of (cid:98) U and (cid:98) Λ ≤ m ∈ R m × m is m × m leadingprincipal submatrix of (cid:98) Λ . The new regression problem YYY = ZZZ m θ m + ε is solved via the least squares, yielding the solution (cid:101) θ LSm = ZZZ (cid:62) m YYY ∈ R m , and thus (cid:101) β P CR def = (cid:98) U ≤ m (cid:98) Λ − ≤ m (cid:101) θ LSm ∈ R d . Note that (cid:101) θ LSm is essentially the ﬁrst m components of (cid:101) θ LS , and we can express (cid:101) β P CR = (cid:98) U (cid:98) Λ − ZERO m (cid:20) (cid:98) Λ − (cid:98) U (cid:62) XXX (cid:62)

YYY n (cid:21) , ZERO m [ z ] = [ z , . . . , z m , , . . . , (cid:62) is the operator zeroing out all the components of avector z ∈ R r except the ﬁrst m . The similarity of the PCR estimator to the NCT estima-tor (2.1) is now clear. While PCR blindly selects the coeﬃcients corresponding to the ﬁrst m principal components, our procedure screens for the most important principal directions, whichmay be diﬀerent from the leading ones, and leaves only those with signiﬁcant contributionexceeding τ . However, if there is a strong prior indicating the canonical coeﬃcients spike atthe principal directions, then of course PCR should also be a suitable procedure, and NCT willsimply mimic its behavior, with some small costs. PCR focuses only on the estimators on the principal component directions and NCT doesnot have any preferences. We now propose a family of canonical thresholding estimators tobridge these two extremes, progressively putting more preferences on the principal directions.In addition, we also generalize the thresholding function.First, the soft thresholding can be replaced by generalized thresholding rules (see e.g. Deﬁ-nition 9.3 in Fan et al. (2020)), introduced for completeness in the following deﬁnition.

Deﬁnition 2.2.

The function T τ : R → R is called a generalized thresholding function, if(i) | T τ [ z ] | ≤ c | z (cid:48) | for all z, z (cid:48) satisfying | z − z (cid:48) | ≤ τ / and some constant c ;(ii) | T τ [ z ] − z | ≤ τ for all z ∈ R .The parameter τ is called the thresholding level. Second, if there is some prior that the spike canonical coeﬃcients are more likely to be inthe lower principal components rather than in the higher ones, we can introduce additionalmultiplicative weights, equal to the eigenvalues raised to a non-negative power ϕ/

2, underthresholding to reﬂect this preference. This is equivalent to applying a larger thresholding onhigher principal components, with ϕ controlling the degree. Implementing this strategy, wepropose the following more general family of estimators, parameterized by ϕ ≥ (cid:98) θ def = (cid:98) Λ − ϕ T τ (cid:104) (cid:98) Λ ϕ (cid:101) θ LS (cid:105) in the canonical domain, or (cid:98) β def = (cid:98) U (cid:98) Λ − − ϕ T τ (cid:20) (cid:98) Λ − ϕ (cid:98) U (cid:62) XXX (cid:62)

YYY n (cid:21) (2.2)in the original domain. Here T τ [ z ] is a generalized thresholding function from Deﬁnition 2.2applied component-wise and τ ≥ Generalized Canonical Thresholding (GCT) estimators.When ϕ = 0 and the soft thresholding function is used, GCT corresponds to the NCTestimator (2.1). The intuition behind GCT is somewhat similar to NCT: the estimators auto-matically screen the most signiﬁcant principal components. However, the choice of ϕ allows9o put diﬀerent importance to eigenvalue (cid:98) λ j and (cid:98) u (cid:62) j β when deciding whether to threshold aprincipal component j or not. While in NCT this importance is calibrated in accordance to thescaling appearing in the decomposition of the absolute MSE (this justiﬁes the word “Natural”in the name), GCT with φ > y . This leads to PCR. We introduce GCT to better accommodatethis situation.It turns out that the theoretical results for GCT are not that nice and insightful as forNCT. However, in practice we observed that GCT may behave much better in some scenarios,as to be shown later in the corresponding section. In principle, one can even tune ϕ inaddition to tuning τ via cross-validation, which might further enhance the practical utility ofthe procedure. The ﬁrst condition needed for our theoretical analysis is the following assumption on the noise,which will be used in both ﬁxed design and random design settings.

Assumption 3.1 (Sub-Weibull noise) . The noise vector ε is independent of XXX and is jointlysub-Weibull random vector with parameter < α ≤ (see Kuchibhotla and Chakrabortty(2018)). That is, there exists σ < ∞ such that sup (cid:107) w (cid:107) =1 (cid:107) w (cid:62) ε (cid:107) ψ α ≤ σ, where (cid:107) · (cid:107) ψ α is the Orlicz norm for ψ α = e x α − . The following tail bound takes place: P (cid:2) | w (cid:62) ε | ≥ t (cid:3) ≤ − ( t/σ ) α ) for all w , (cid:107) w (cid:107) = 1 and t > . This allows to go slightly beyond sub-Gaussian and sub-Exponential tails. For i.i.d. sub-Gaussian noise, σ coincides (up to a multiplicative constant) with the variance of a single ε i .So, σ can be interpreted as the magnitude of the noise.Deﬁne the signal-to-noise ratio of the linear regression problem in the ﬁxed design settingas SNR def = (cid:18) n − (cid:80) ni =1 ( x (cid:62) i β ) σ (cid:19) / = (cid:32) β (cid:62) (cid:98) Σ β σ (cid:33) / = (cid:107) θ (cid:107) σ , while for the random design we use SNR def = (cid:18) E [( x (cid:62) β ) ] σ (cid:19) / = (cid:18) β (cid:62) Σ β σ (cid:19) / = (cid:107) ΛU (cid:62) β (cid:107) σ . SNR > θ = (cid:98) Λ (cid:98) U (cid:62) β for the random design. Its normal-ized version θ / (cid:107) θ (cid:107) has the ﬁrst k components θ ≤ k / (cid:107) θ (cid:107) where θ ≤ k = (cid:98) Λ ≤ k (cid:98) U (cid:62)≤ k β . Here asusual (cid:98) Λ ≤ k ∈ R k × k is the leading principal submatrix of (cid:98) Λ (containing the square roots of theﬁrst k eigenvalues of (cid:98) Σ on the diagonal) and (cid:98) U ≤ k ∈ R r × k is the matrix consisting of the ﬁrst k columns of (cid:98) U (which are the k leading eigenvectors of (cid:98) Σ ). Measuring the normalized ﬁrst k component θ ≤ k / (cid:107) θ (cid:107) in (cid:96) q -(pseudo)norm gives D eﬀ q,k ( (cid:98) Σ , β ) def = (cid:107) θ ≤ k (cid:107) qq (cid:107) θ (cid:107) q . We call this quantity the joint eﬀective dimension of order q up to index k of (cid:98) Σ and β . Notethat when q = 2, it measures the proportion of θ explained by θ ≤ k ; when q = 0, it countsthe sparsity among θ ≤ k . Similar quantity can be deﬁned for the random design setting: D eﬀ q,k ( Σ , β ) def = (cid:107) Λ ≤ k U (cid:62)≤ k β (cid:107) qq (cid:107) ΛU (cid:62) β (cid:107) q . It turns out that this joint eﬀective dimension will play crucial role in our bounds for the NCTestimator (2.1).For shortness, we introduce the following quantity that will be appearing regularly through-out the section: ρ def = 2 √ n (log(2 d/δ )) /α , (3.1)where δ is from the statements “with probability 1 − δ ...”. The thresholding level τ for bothNCT and GCT will be expressed in terms of ρ . We ﬁrst provide a simple guarantee on the mean squared error

MSE ( (cid:98) β ) of the NCT estima-tor (2.1). Theorem 3.1.

Suppose Assumption 3.1 is satisﬁed. Take τ = σρ with ρ = √ n (log(2 d/δ )) /α .Then, with probability − δ , the NCT estimator (cid:98) β from (2.1) with thresholding at level τ satisﬁes MSE ( (cid:98) β ) (cid:46) inf q ∈ [0 , (cid:8) (cid:107) θ (cid:107) qq ( σρ ) − q (cid:9) (3.2)= MSE (0) inf q ∈ [0 ,  D eﬀ q,d ( (cid:98) Σ , β ) (cid:34) SNR − (log(2 d/δ )) /α n (cid:35) − q/  . (3.3)The proof of this result almost repeats the classical proof for hard and soft thresholding fororthonormal design. 11 emark 3.1 (Choice of τ ) . The choice of τ in the above theorem depends on the noise mag-nitude σ , the probability δ and the quantity α , but this is not a signiﬁcant problem. Later inSection 4 we will show how to tune τ using an eﬀective cross-validation procedure. Though the bounds (3.2) and (3.3) coincide, the way we state them reﬂects two diﬀerentmessages. The ﬁrst one, if we take q = 1 and apply inequality (cid:107) θ (cid:107) ≤ (cid:107) (cid:98) Σ (cid:107) / (cid:107) β (cid:107) r eﬀ [ (cid:98) Σ ] / (which follows from the Cauchy-Schwarz inequality), then from the bound (3.2) we get MSE ( (cid:98) β ) (cid:46) σ (cid:107) (cid:98) Σ (cid:107) / (cid:107) β (cid:107) (cid:115) r eﬀ [ (cid:98) Σ ] (log(2 d/δ )) /α n with high probability. This means that if one is interested in the absolute error MSE ( (cid:98) β ), thenin moderate noise case σ ≤ C (cid:107) (cid:98) Σ (cid:107)(cid:107) β (cid:107) , essentially r eﬀ [ (cid:98) Σ ] /n = o (1) is enough to guarantee MSE ( (cid:98) β ) = (cid:107) (cid:98) Σ (cid:107)(cid:107) β (cid:107) · o (1), omitting logarithmic terms. No additional assumptions on β arerequired, and there is no necessity to worry about the joint eﬀective dimension and the signal-to-noise ratio from the bound (3.3) in this situation.However, the bound (3.3) is useful to better understand the structure of the error. Takinginto account that the main motivation in our work is the decay of eigenvalues of (cid:98) Σ or Σ , itmay easily be the case that even the trivial estimator (cid:101) β = 0 has small error MSE (0) = β (cid:62) (cid:98) Σ β .Hence, it makes sense to care more about the relative error MSE ( (cid:101) β ) / MSE (0). In this case, thejoint eﬀective dimension and the signal-to-noise ratio control the upper bound on the relativeerror. We will get back to the analysis of the relative error and the joint eﬀective dimensionafter we state results for the random design case.

In addition to the noise assumption, to study the performance of the NCT estimator in therandom design setting we need to impose a couple more conditions on the distribution of thecovariates.

Assumption 3.2 (Sub-Gaussian covariates) . The scaled generic random vector of covariates Σ − / x is sub-Gaussian. Assumption 3.3 (Convex decay of eigenvalues) . There exists a convex decreasing function λ ( · ) such that the eigenvalues of Σ satisfy λ j = λ ( j ) for j ∈ [ d ] . The previous assumption is technical and we impose it in our main result just for concreteness.Later in Remark 3.8 we mention how our result can be modiﬁed if this assumption does nothold.One more assumption is needed just to make the rates more friendly-looking. If it is notsatisﬁed, our result below will be meaningless, so there is no loss of generality in this condition.12 ssumption 3.4 (Technical conditions) . The eﬀective rank satisﬁes r eﬀ [ Σ ] ≤ n . Also, when-ever we say “with probability − δ ...”, we suppose that the quantity (cid:15) = (cid:15) n,d,δ def = (cid:114) log( d/δ ) n satisﬁes (cid:15) ≤ c for properly chosen implicit absolute constant c > (this constant comes fromthe proof ). In addition to the assumptions above, in the sequel we take the convention λ k = 0 for all k > d . Now we are ready to present the following result. Theorem 3.2.

Suppose Assumptions (3.1) – (3.4) are fulﬁlled. Recall (cid:15) = (cid:15) n,d,δ def = (cid:112) log( d/δ ) /n and deﬁne k ∗ def = ( (cid:15) log(1 /(cid:15) )) − / (essentially, k ∗ (cid:16) n / up to a logarithmic term). Then:(i) With probability − δ , the NCT estimator (cid:98) β from (2.1) with thresholding at level τ = σρ = 2 σ √ n (log(2 d/δ )) /α satisﬁes PE ( (cid:98) β ) (cid:46) inf q ∈ [0 , (cid:8) (cid:107) Λ ≤ k U (cid:62)≤ k β (cid:107) qq ( σρ ) − q (cid:9) ++ (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:32) λ k λ + (cid:114) r eﬀ [ Σ ] + log(1 /δ ) n + (cid:15) k (cid:88) j =1 λ j (1 + (cid:15) j ) λ (cid:33) for all ≤ k ≤ k ∗ .(ii) With probability − δ , the NCT estimator (cid:98) β from (2.1) with thresholding at level τ = σρ + C (cid:107) Σ (cid:107) / (cid:107) β (cid:107) (cid:15) / max j ∈ [ k ] (cid:18) λ j (1 + (cid:15)j ) λ (cid:19) / for some C satisﬁes PE ( (cid:98) β ) (cid:46) inf q ∈ [0 , (cid:8) (cid:107) Λ ≤ k U (cid:62)≤ k β (cid:107) qq τ − q (cid:9) + (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:32) λ k λ + (cid:114) r eﬀ [ Σ ] + log(1 /δ ) n (cid:33) for all ≤ k ≤ k ∗ . Several comments are in order.

Remark 3.2 (Why two bounds?) . We present two diﬀerent bounds in (i) and (ii) for diﬀerentthresholds τ and τ , because they behave diﬀerently for diﬀerent eigenvalue regimes. The boundfrom (i) outperforms the bound from (ii) in a wide variety of regimes (e.g. in polynomial andsuperpolynomial decay scenario), however there are cases when the bound (ii) can be better (e.g.speciﬁc cases in Factor Model regime). emark 3.3 (Meaning of terms) . We call the term inf q ∈ [0 , (cid:8) (cid:107) Λ ≤ k U (cid:62)≤ k β (cid:107) qq τ − q (cid:9) in the boundsof Theorem 3.2 (i) and (ii) the “main term”, since it is almost the same as what we had inTheorem 3.1 for the ﬁxed design. The other terms in these bounds are referred to as “additionalterms” as they appear only in the random design case. Allowing ≤ k ≤ k ∗ provides atradeoﬀ, as some of the terms increase with growing k , while others decrease. In what followswe are typically interested in k = k ∗ just for concreteness. The meaning of diﬀerent parts ofthe “additional terms” is the following. The parts including r eﬀ [ Σ ] are the payment for thecovariance matrix estimation. The part λ k ∗ /λ appears due to the diﬃculty of control of theempirical eigenvectors beyond k ∗ -th. The parts with (cid:80) k ∗ j =1 λ j (1 + (cid:15)j ) /λ and max j ∈ [ k ∗ ] λ j (1 + (cid:15)j ) /λ are the payment for the control of the sample eigenvalues and eigenvectors up to k ∗ . Remark 3.4 (Moderate noise: simpliﬁcations and suﬃcient conditions for convergence) . Con-sider the moderate noise situation σ ≤ C (cid:107) Σ (cid:107)(cid:107) β (cid:107) . In this case the “additional terms” becomedominating: simply taking k = k ∗ and q = 1 , applying (cid:107) Λ ≤ k U (cid:62)≤ k β (cid:107) ≤ (cid:107) Σ (cid:107) / (cid:107) β (cid:107) r eﬀ [ Σ ] / and plugging in τ and τ makes the “main term” negligible. Omitting logarithmic terms, thebound (i) reduces to PE ( (cid:98) β ) (cid:46) (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:32) λ k ∗ λ + (cid:114) r eﬀ [ Σ ] n + (cid:15) k ∗ (cid:88) j =1 λ j (1 + (cid:15) j ) λ (cid:33) , (3.4) while the bound (ii) reduces to PE ( (cid:98) β ) (cid:46) (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:32) λ k ∗ λ + (cid:18) r eﬀ [ Σ ] n (cid:19) / max j ∈ [ k ∗ ] (cid:18) λ j (1 + j / √ n ) λ (cid:19) / (cid:33) with high probability.From here we can easily deduce suﬃcient conditions to ensure the convergence of the absoluteerror PE ( (cid:98) β ) = (cid:107) Σ (cid:107)(cid:107) β (cid:107) · o (1) as n → ∞ without any conditions on β . In particular, λ n = o (1) , r eﬀ [ Σ ] = o ( n ) , and (cid:80) k ∗ j =1 λ j (1+ j / √ n ) λ = o ( n / ) is enough (again, up to logarithmicfactors). Essentially, these suﬃcient conditions require the decay of eigenvalues to be fastenough (but in a more sophisticated fashion than for the MSE). Remark 3.5 (Moderate noise: further simpliﬁcations in speciﬁc examples) . Continuing themoderate noise situation, for the sake of exposition, let us consider several speciﬁc examples ofeigenvalue regimes and illustrate how the bound from Theorem 3.2 (i) simpliﬁes. As above, weomit logarithmic terms. • (Polynomial decay): If λ j (cid:46) j − a with a ≥ or d (cid:46) n (3 − a ) / (3 − a ) , it is easy to verify thatthe bracket factor of (3.4) is dominated by λ k ∗ /λ + n − / (again ignoring logarithmicterms) and with high probability PE ( (cid:98) β ) (cid:46) (cid:107) Σ (cid:107)(cid:107) β (cid:107) n min( a/ , / . n particular, when a = 1 (the boundary that has a good control of r eﬀ [ Σ ] in high dimen-sions), we have with high probability PE ( (cid:98) β ) (cid:46) (cid:107) Σ (cid:107)(cid:107) β (cid:107) n / . When a ≥ / , we have with high probability PE ( (cid:98) β ) (cid:46) (cid:107) Σ (cid:107)(cid:107) β (cid:107) n / . • (Factor Model regime) If λ (cid:16) . . . (cid:16) λ m (cid:16) d , λ m +1 (cid:16) . . . (cid:16) λ d (cid:16) for some m (cid:46) k ∗ ,then taking k = m + 1 yields with high probability PE ( (cid:98) β ) (cid:46) (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:18) d + m √ n + m n (cid:19) . Remark 3.6 (Large noise) . In the large noise case, when σ (cid:29) (cid:107) Σ (cid:107)(cid:107) β (cid:107) , the “main term”dominates. Similarly to the ﬁxed design case, we can factorize the “main term” into the errorof the trivial estimator PE (0) = β (cid:62) Σ β , the joint eﬀective dimension and the signal-to-noiseratio: the “main term” from (i) becomes PE (0) inf q ∈ [0 , (cid:40) D eﬀ q,k ∗ ( Σ , β ) (cid:34) SNR − (log(2 d/δ )) /α n (cid:35) − q/ (cid:41) , and the “main term” from (ii) can be rewritten as PE (0) inf q ∈ [0 , (cid:40) D eﬀ q,k ∗ ( Σ , β ) (cid:34) SNR − (log(2 d/δ )) /α n ++ (cid:107) Σ (cid:107)(cid:107) β (cid:107) β (cid:62) Σ β (cid:15) max j ∈ [ k ∗ ] (cid:18) λ j (1 + (cid:15)j ) λ (cid:19) (cid:35) − q/ (cid:41) . In this regime, the relative error PE ( (cid:98) β ) / PE (0) is essentially controlled by the joint eﬀectivedimension and the signal-to-noise ratio. More detailed analysis of the joint eﬀective dimension D eﬀ q,k ( Σ , β ) is conducted in the next section. Remark 3.7 (Comparison with the least squares) . It is straightforward to notice that the fasterdecay of eigenvalues, the better bound we obtain. This contrasts the min norm least squaresestimator considered in Bartlett et al. (2020); Chinot and Lerasle (2020), where the decay isrequired to be not too fast. It also reveals the beneﬁts of the thresholding even in such a situation.

Remark 3.8 (Relaxing Assumption 3.3) . Assumption 3.3 can be avoided. If one deﬁnes k ∗ as k ∗ = min  ( (cid:15) log(1 /(cid:15) )) − / , max  j ∈ [ d ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d (cid:88) l =1 l (cid:54) = j λ l | λ j − λ l | + λ j min( λ j − − λ j , λ j − λ j +1 ) ≤ (cid:15)  , then the same conclusion as in Theorem 3.2 is true with a slightly diﬀerent rate.

15o compare the natural canonical thresholding with the canonical truncation of higher indexcomponents, i.e. PCR, we state the next proposition.

Proposition 3.3.

Assume the conditions of Theorem 3.2 hold and let k ∗ be deﬁned in thesame way. Then, with probability − δ , the PCR estimator (cid:101) β P CR with the number of leadingprincipal components set to m (cid:46) k ∗ satisﬁes PE ( (cid:101) β P CR ) (cid:46) (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:32) λ m λ + (cid:114) r eﬀ [ Σ ] + log(1 /δ ) n (cid:33) + σ mn (log(2 m/δ )) /α . We omit the proof of this result, since it essentially uses the same techniques and follows thesame strategy as the proof of Theorem 3.2. Note that in the moderate noise scenario the rateessentially coincides with what we obtained for the NCT estimator estimator in Remark 3.4.The adaptivity of our estimator (cid:98) β comes into play in large noise case: the “main term” in thebounds on PE ( (cid:98) β ) is better than σ m/n in situations when U (cid:62)≤ k β is approximately sparse. So far we were able to establish some suﬃcient conditions for convergence of absolute errors

MSE ( (cid:98) β ) and PE ( (cid:98) β ) of the NCT estimator without any assumptions on β by simply taking q = 1 (Remark 3.4). The analysis of the relative errors for ﬁxed design and (in large noisecase) random design requires more careful study of D eﬀ q,k ( Σ , β ). Let us motivate why relativeerrors MSE ( (cid:98) β ) / MSE (0) and PE ( (cid:98) β ) / PE (0) might be of interest in the ﬁrst place. One reason behind studying the relative errors was already mentioned previously. Note that ifthere is no relation between (cid:98) Σ and β , meaning that (cid:98) U (cid:62) β is a “random” vector, then we canexpect (cid:107) (cid:98) U (cid:62) β (cid:107) ∞ (cid:16) (cid:107) β (cid:107) (cid:112) log( d ) /d . In this case, the trivial estimator (cid:101) β = 0 achieves error MSE (0) = β (cid:62) (cid:98) Σ β ≤ Tr [ (cid:98) Σ ] (cid:107) (cid:98) U (cid:62) β (cid:107) ∞ (cid:16) (cid:107) (cid:98) Σ (cid:107)(cid:107) β (cid:107) r eﬀ [ (cid:98) Σ ] log( d ) d . As long as the eigenvalues of (cid:98) Σ decay fast, even the trivial estimator gives error close to zeroin high dimensions. Here we should highlight that this eﬀect does not appear in low dimensions(and even in high-dimensional but isotropic situations), where the absolute and relative errorsare just a multiplicative constant apart. (Same reasoning works for the PE.) Hence, it is notsatisfactory for us to show that the absolute error of our estimator goes to zero with growingsample size and dimension. We would like to get more meaningful conclusions from our results,which would conﬁrm that the proposed estimator does better than the trivial estimator. Thisnaturally leads to the relative errors.Another motivation comes from the way statisticians evaluate and compare estimators inpractical applications. In particular, a widely used performance measure is the coeﬃcient of etermination , or simply R . For instance, the in-sample version for an estimator (cid:101) β is deﬁnedas R in ( (cid:101) β ) def = 1 − (cid:80) ni =1 ( y i − x (cid:62) i (cid:101) β ) (cid:80) ni =1 y i . The larger this quantity is, the better method we have; its largest possible value is 1, and thevalue of 0 indicates that the estimator does no better than the trivial estimator. Via heuristicarguments we can see R in ( (cid:101) β ) ≈ − MSE ( (cid:101) β ) MSE (0) . Thus, maximizing the coeﬃcient of determination is approximately equivalent to minimizingthe relative error. Similar intuition applies to the out-of-sample R out and the relative predictionerror PE ( (cid:101) β ) / PE (0). Note that in applications it is often the case that even small but positive R (e.g. 0 .

05) can be considered a success. Therefore, the hope to have

MSE ( (cid:101) β ) / MSE (0) or PE ( (cid:101) β ) / PE (0) converging to 0 might be too optimistic in some situations. Having these relativeerrors smaller than 1 already means that the procedure is able to extract some useful signalfrom the data. Prior to describing properties of D eﬀ q,k ( Σ , β ), let us show why imposing conditions on the designalone, or imposing conditions on β alone can be not enough to establish convergence of therelative errors. It is easier to do for the ﬁxed design case, so let us focus on this setting for now.For a given design matrix XXX and an estimator (cid:101) β we can construct another estimator (cid:101) θ = (cid:98) Λ (cid:98) U (cid:62) (cid:101) β . Therefore, ( (cid:101) β − β ) (cid:62) (cid:98) Σ ( (cid:101) β − β ) β (cid:62) (cid:98) Σ β = (cid:107) (cid:101) θ − θ (cid:107) (cid:107) θ (cid:107) , and hence, inf (cid:101) β sup β ∈ R d MSE ( (cid:101) β ) MSE (0) = inf (cid:101) θ sup θ ∈ R r (cid:107) (cid:101) θ − θ (cid:107) (cid:107) θ (cid:107) . The minimax rate for the relative MSE coincides with the minimax rate for the relative esti-mation error in the right-hand side, which has nothing to do with (cid:98) Σ . This demonstrates thatgetting a good rate is hopeless in high dimension assuming only fast decay of eigenvalues of (cid:98) Σ .On the other hand, we might impose strong conditions on β , such as sparsity, in whichcase one could expect even MSE ( (cid:101) β ) (cid:16) σ s/n (up to a logarithmic factor) for some appropriateestimator (cid:101) β , where s measures the degree of sparsity. However, as we mentioned previously,if (cid:98) Σ and β are not related, and the eigenvalues of (cid:98) Σ decay fast, we might have MSE (0) (cid:16)(cid:107) (cid:98) Σ (cid:107)(cid:107) β (cid:107) /d (again up to logarithmic factors) for the trivial estimator. This implies that thereis no much hope in getting vanishing relative error MSE ( (cid:101) β ) / MSE (0) in high dimensions. Thatis why it seems natural that our bound on the relative error depends on the joint eﬀectivedimension, that takes into account not only decay of eigenvalues or only assumptions on β ,but the joint structure of (cid:98) Σ and β . 17 .3.3 Joint eﬀective dimension Now, once we supported the appearance of the joint eﬀective dimension, let us mention its basicproperties. (For concreteness we choose the random design setting and consider D eﬀ q,k ( Σ , β ),though the following ideas apply to D eﬀ q,d ( (cid:98) Σ , β ) appearing in the ﬁxed design case.) • D eﬀ q,k ( Σ , β ) ≤ k . • D eﬀ q,k ( Σ , β ) is decreasing in q and increasing in k . • D eﬀ ,k ( Σ , β ) ≤ D eﬀ ,d ( Σ , β ) = 1. • D eﬀ ,k ( Σ , β ) = (cid:107) Λ ≤ k U (cid:62)≤ k β (cid:107) is essentially the sparsity of U (cid:62)≤ k β .For now let us assume SNR ≥ c > D eﬀ q ( Σ , β ) only. Recall that the “mainterm” in the relative bounds looks likeinf q ∈ [0 , (cid:40) D eﬀ q,k ( Σ , β ) ζ − q/ n (cid:41) where ζ n is some vanishing rate. Hence, the properties above reveal a tradeoﬀ in the main term: • When q is large, i.e. closer to 2, it is easier to control D eﬀ q,k ( Σ , β ); however, the vanishingrate ζ n is raised to a small power, making the convergence slow. • When q is small, i.e. closer to 0, it is more diﬃcult to control D eﬀ q,k ( Σ , β ); in contrast, ζ n is raised to a large power potentially enabling fast convergence rate.So, the bound allows to ﬁnd largest q for which D eﬀ q,k ( Σ , β ) can be bounded in dimension-free(or the dependence on d is not that severe) and sample size-free manner to facilitate fasterconvergence rate.Some scenarios where D eﬀ q,k ( Σ , β ) can be bounded more explicitly (for some q <

2) arediscussed below: • Sparsity of U T ≤ k β . Denote s def = (cid:107) U (cid:62)≤ k β (cid:107) to be the sparsity level. Then, as alreadymentioned previously, D eﬀ ,k ( Σ , β ) = s . • Approximate sparsity of Λ ≤ k U T ≤ k β . Suppose there exists a small set J ⊆ [ k ] (of size |J | = s ) of signiﬁcant components, so that the rest of the components satisfy λ / j | u (cid:62) j β | ≤ cd (cid:107) ΛU (cid:62) β (cid:107) for all j / ∈ J . Then D eﬀ ,k ( Σ , β ) ≤ s + c . • Polynomial decay: λ j (cid:16) j − a , | u (cid:62) j β | (cid:16) j − b , where a ≥

0. We have several cases: – If a + 2 b ≥ q ≥ / ( a + 2 b ), then D eﬀ q,k ( Σ , β ) (cid:16) – If a + 2 b ≥ q < / ( a + 2 b ), then D eﬀ q,k ( Σ , β ) (cid:16) k − ( a +2 b ) q/ .18 a+2b . . . . . q d . d . d . d . d . constgrows with d Figure 1: Dependence of D eﬀ q,d ( Σ , β ) on d . – If a + 2 b <

1, then D eﬀ q,k ( Σ , β ) (cid:16) k − ( a +2 b ) q/ /d (1 − a − b ) q/ .Here we omitted logarithmic factors. To better understand the dependence of D eﬀ q,d ( Σ , β )on d in speciﬁc case k = d , in Figure 1 we depict this dependence in ( a + 2 b ) – q axes. In the green region D eﬀ q,d ( Σ , β ) does not grow with d , while in the yellow region D eﬀ q,d ( Σ , β ) grows with d polynomially, and the contours of constant power are illustratedwith diﬀerent colors. Since the “main term” conveniently decomposes into several factors, among which the mostmysterious one – the joint eﬀective dimension – was discussed above, the analysis of the relativeerrors of the NCT estimator in the ﬁxed design and (for large noise case) random design ispretty much complete. It is intriguing though, what happens to the prediction error bounds inmoderate noise case: recall that in this scenario the “main term” is absorbed by the “additionalterm”, which doesn’t have a structure allowing a direct analysis of the bounds on the relativeerror PE ( (cid:98) β ) / PE (0). It is not clear whether they can be stated in a way that will providebetter understanding of the relative error. Instead, we can take a look at the particular caseof polynomial decay of eigenvalues and regression coeﬃcients: λ j (cid:16) j − a , | u (cid:62) j β | (cid:16) j − b . Aftertedious calculations, one may express the bounds from Theorem 3.1 and Theorem 3.2 (i), (ii)in terms of d, n, a, b only. It turns out, that in this scenario the bound from Theorem 3.2 (ii)is always worse than the bound from Theorem 3.2 (i), so we exclude it from consideration.In Figure 2 we plot the contours of constant convergence rate on a – b plane for the boundsfrom Theorem 3.1 and Theorem 3.2 (i). Diﬀerent colors of the contours correspond to diﬀerentrates. The background color describes the assumptions on d and n that we make in diﬀerentregions: in light green zones d can be much larger than n (though this is not necessary), inlight yellow zones d and n are allowed to be of the same order, i.e. d < Cn for some C (that19an be larger that 1), and in grey zone the rates do not go to zero unless d is signiﬁcantlysmaller than n . We again disregard the logarithmic terms. Now we move to a brief study of the theoretical guarantees for the GCT estimator (2.2). Theresults are not that insightful as the ones for the NCT estimator, and we focus on the ﬁxeddesign setting only. However, once we state the MSE bound, the theory for the random designcan be developed in the same way as for the NCT estimator.

Theorem 3.4.

Suppose Assumption 3.1 is satisﬁed. Let ϕ ≥ . Take τ = σρ with ρ = √ n (log(2 d/δ )) /α . Then, with probability − δ , the GCT estimator (cid:98) β from (2.2) with parameter ϕ and thresholding at level τ satisﬁes MSE ( (cid:98) β ) (cid:46) r (cid:88) j =1 min (cid:32) (cid:98) λ ϕ/ (cid:98) λ ϕ/ j σρ, | θ j | (cid:33) . The obtained bound may be diﬃcult to comprehend, but we state it in the most general form tomake sure it is tight and applicable in wide range of scenarios. When θ j = 0 for j ≥ m , it hasno estimation errors beyond the ﬁrst m principal components, adapting very well to focusingonly on the low dimensions estimation like PCR. If, in addition, (cid:98) λ / (cid:98) λ m is bounded, we have MSE ( (cid:98) β ) (cid:46) m (cid:88) j =1 min ( σρ, | θ j | ) , which is not much larger than the MSE of PCR. It can even be much smaller than PCR when | θ j | are small for many j . In general, GCT outperforms NCT when | θ j | decays fast enough.The next corollary allows to make sure that the bound is essentially dimension-free – asfor Theorem 3.1, the rate can be expressed in terms of the eﬀective rank r eﬀ [ (cid:98) Σ ], while thedependence on the dimension appears only under logarithm. Corollary 3.5.

Under assumptions of Theorem 3.4, with probability − δ , the GCT estimator (cid:98) β from (2.2) with parameter ϕ ≥ satisﬁes MSE ( (cid:98) β ) (cid:46) (cid:16) (cid:107) (cid:98) Σ (cid:107)(cid:107) β (cid:107) + σ (cid:107) (cid:98) Σ (cid:107) / (cid:107) β (cid:107) r eﬀ [ (cid:98) Σ ] / (cid:17) (log(2 d/δ )) / (2+ ϕ ) α n / (2+ ϕ ) . Note that when ϕ = 0 we essentially recover the rate obtained after Theorem 3.1. The ratedeteriorates when ϕ is far from 0, and this is explainable: the GCT procedure signiﬁcantlydeviates from the natural one, leading to a worse bound in the worst case, i.e. when the onlyassumption is the control of eﬀective rank, and spikiness of canonical coeﬃcients is not justiﬁed. Remark 3.9 (Simpliﬁcations in speciﬁc cases) . In several speciﬁc cases the rate from Theo-rem 3.4 can be made much more transparent. We omit logarithmic terms. . . . a . . . . b n − . n − . n − . n − . n − . n − . n − . n − . n − . n − . d (cid:29) nd < Cn no convergence (a) Bound on the relative mean squared error MSE ( (cid:98) β ) / MSE (0) from Theorem 3.1; . . . a . . . . b n − . n − . n − . n − . n − . n − . n − . n − . n − . d (cid:29) nd < Cn no convergence (b) Bound on the relative prediction error PE ( (cid:98) β ) / PE (0) from Theorem 3.2 (i); Figure 2: Rates for the relative errors of the NCT estimator in polynomial decay scenario.21 (Polynomial decay) If (cid:98) λ j (cid:16) j − a and | u (cid:62) j β | (cid:16) j − b for a ≥ , a + 2 b ≥ , then with highprobability MSE ( (cid:98) β ) (cid:46) (cid:18) σ n (cid:19) a +2 b − a ( ϕ +1)+2 b . • (Sparsity) If there exists a set J of size |J | = s such that u (cid:62) j β = 0 for j / ∈ J and (cid:98) λ max J ≥ c (cid:98) λ for c > , then with high probability MSE ( (cid:98) β ) (cid:46) sσ n . • (Approximate sparsity) If there exists a set J of size |J | = s such that (cid:98) λ / j | u (cid:62) j β | ≤ c (cid:107) θ (cid:107) /d for j / ∈ J and (cid:98) λ max J ≥ c (cid:98) λ for c > , then with high probability MSE ( (cid:98) β ) (cid:46) sσ n + c (cid:107) θ (cid:107) d . • (Factor Model regime) If λ (cid:16) . . . (cid:16) λ m (cid:16) d , λ m +1 (cid:16) . . . (cid:16) λ d (cid:16) for some m , thenwith high probability MSE ( (cid:98) β ) (cid:46) mσ n + (cid:107) β (cid:107) d . Therefore, the rate from Theorem 3.4 can adapt well to these speciﬁc structures despite deteri-orating rate of Corollary 3.5, which is only an upper bound. τ To start with, we focus on the case when a good value of τ is somehow known, and analyzethe computational complexity of the GCT estimators. In particular, this includes the NCT es-timator. The computation of SVD of XXX (speciﬁcally, (cid:98) Λ and (cid:98) U ) takes O ( dn min( d, n )) time.Once we have SVD, computing the matrix A def = (cid:98) U (cid:98) Λ − − ϕ and the vector b def = (cid:98) Λ − ϕ (cid:98) U (cid:62) XXX (cid:62)

YYY n needed before the generalized thresholding takes O ( dn ) time. Obtaining (cid:98) β for already com-puted A and b takes O ( d min( d, n )) time. Therefore, the total computational time of ourprocedure is O ( dn min( d, n )). The computation is as fast as the SVD of the design matrix.Note that computational complexity of the LASSO is O ( n min( d, n ) ), when we computeits solution path via a modiﬁcation of Least Angle Regression, see Efron et al. (2004). τ Our approach requires to tune the hyperparameter τ . Whatever τ is, we anyway have to com-pute A and b . This already takes O ( dn min( d, n )) time. Applying the generalized threshold-ing and combining the result into the vector (cid:98) β takes O ( d min( d, n )) time. This means thatwe can try n diﬀerent values of τ “for free” — the computational complexity will be still of22he same order as computing (cid:98) β for a single value of τ . But we can go even further, if we focuson the GCT estimators with the soft or hard thresholding.Notice that varying τ continuously from 0 to + ∞ , we still can get only (min( d, n ) + 1)diﬀerent solutions (cid:98) β for given XXX , YYY because we threshold a vector of size min( d, n ). Therefore,we can compute the whole solution path for τ from 0 to + ∞ . However, we are not thatinterested in the solution path, since we do not expect to get coeﬃcients entering the picture oneby one as in LASSO for sparse regression. Instead, this can be useful for k -fold cross validation,where we will have at most k (min( d, n ) + 1) “interesting” values of τ giving diﬀerent solutions (cid:98) β . In total, this implies that we need O ( dn min( d, n )) + k (min( d, n ) + 1) · O ( d min( d, n )) = O ( dn min( d, n ) + kd min( d, n ) )operations to ﬁnd the best τ (providing smallest cross-validation error). In practice, we typ-ically use k = 5 or k = 10, i.e. of constant order, which leads to the total computationalcomplexity of O ( dn min( d, n )) for our optimally tuned estimator – same as for a single value of τ . Leave-one-out cross validation takes slightly more computations, namely, O ( dn min( d, n ) ).Similar “free tuning” property holds for LASSO (we again refer to Efron et al. (2004)).However, for example the ridge regression does not posses this nice properties: diﬀerent valuesof regularization parameter will lead to diﬀerent estimators, and one has to “guess” a discreteset of values to be tried. We compare the following methods: • “ NCT ”: Natural Canonical Thresholding estimator (2.1) with eﬃcient hyperparametertuning by 10-fold CV. • “ GCT ”: Generalized Canonical Thresholding estimator (2.2) with ϕ = 1, the soft thresh-olding, and with eﬃcient hyperparameter tuning by 10-fold CV. • “ OLS ”: Ordinary Least Squares. When d > n , the min norm solution is considered. • “ PCR ”: Principal Component Regression. The number of PCs is chosen by 10-fold CV. • “ Ridge ”: Ridge regression with 10-fold CV (default implementation from R-package glm-net ). • “ LASSO ”: LASSO with 10-fold CV (default implementation from R-package glmnet ).We ﬁx n = 200, SNR = 10 or

SNR = 1, and focus on how the relative errors

MSE ( (cid:98) β ) / MSE (0)and PE ( (cid:98) β ) / PE (0) of these methods behave when the dimension d grows. The covariates x , . . . , x n ∼ N (0 , Σ ), where Σ depends on the eigenvalue scenario, and the noise vector ε ∼ N (0 , σ I n ) (where σ is chosen to ensure SNR = 10 or

SNR = 1 for given Σ and β ).23ithout loss of generality we take Σ a diagonal matrix, or equivalently U = I d . The resultsare presented in Figure 3–6. Figure 3 and Figure 4 correspond to SNR = 10, Figure 5 andFigure 6 correspond to

SNR = 1. Figure 3 and Figure 5 cover the scenarios of polynomialdecay of the eigenvalues λ j = j − a and the coeﬃcients u (cid:62) j β = j − b with(a) a = 2, b = 2;(b) a = 1, b = 0 . a = 0 . b = 1;while Figure 4 and Figure 6 also consider polynomial decay of the eigenvalues λ j = j − a but U (cid:62) β is diﬀerent:(a) a = 1; u (cid:62) j β = 1 for j = 1 , . . . ,

10 and 0 otherwise;(b) a = 0 . u (cid:62) j β = 1 for 10 randomly chosen j ∈ { d − , . . . , d } , and the rest componentsare i.i.d. N (0 , d − );(c) a = 0 . U (cid:62) β ∼ N (0 , I d ).In each scenario, for each method and dimension we run the corresponding experiment 100times and plot the median errors.We notice that the NCT estimator (among some others) in some settings suﬀer around d = n . This is so called “interpolation threshold” – when the dimension exceeds the number ofdata points, a model has enough features to interpolate training points. The behavior aroundthis point and the associated “double descent” phenomenon has been an active area of researchfor the last couple of years. We do not focus on this in our work.Otherwise, from the plots it is clear that in the presented settings the proposed procedureperforms quite good compared to the other methods. In particular, the persistent performanceof GCT suggests the beneﬁt of varying thresholding to better adapt to various scenarios withdiﬀerent priors. However, it is worth mentioning that other methods also perform quite un-expectedly well in a variety of settings, though previous theoretical results for them do notpredict such performance. This may engender an interest in more thorough study of classicallinear regression methods in high-dimensional setting under diﬀerent structural assumptions.24 d . . . . MSE ( b β ) MSE (0)

NCTGCTOLSPCRRidgeLASSO

50 200 300 400 500 600 700 d . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (a) a = 2, b = 2;

50 200 300 400 500 600 700 d . . . . . . MSE ( b β ) MSE (0)

NCTGCTOLSPCRRidgeLASSO

50 200 300 400 500 600 700 d . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (b) a = 1, b = 0 .

50 200 300 400 500 600 700 d . . . . MSE ( b β ) MSE (0)

NCTGCTOLSPCRRidgeLASSO

50 200 300 400 500 600 700 d . . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (c) a = 0 . b = 1; Figure 3: The relative errors

MSE ( (cid:98) β ) / MSE (0) (left) and PE ( (cid:98) β ) / PE (0) (right) for diﬀerentestimators with n = 200, SNR = 10. Polynomial decay of eigenvalues and coeﬃcients ineigenbasis: λ j = j − a , u (cid:62) j β = j − b . 25 d . . . . . . . MSE ( b β ) MSE (0)

NCTGCTOLSPCRRidgeLASSO

50 200 300 400 500 600 700 d . . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (a) a = 1, u (cid:62) j β = 1 for j = 1 , . . . ,

10 and 0 otherwise;

50 200 300 400 500 600 700 d . . . . . . . MSE ( b β ) MSE (0)

NCTGCTOLSPCRRidgeLASSO

50 200 300 400 500 600 700 d . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (b) a = 0 . u (cid:62) j β = 1 for 10 randomly chosen j ∈ { d − , . . . , d } , and the rest components are i.i.d. N (0 , d − );

50 200 300 400 500 600 700 d . . . . . . . MSE ( b β ) MSE (0)

NCTGCTOLSPCRRidgeLASSO

50 200 300 400 500 600 700 d . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (c) a = 0 . U (cid:62) β ∼ N (0 , I d ); Figure 4: The relative errors

MSE ( (cid:98) β ) / MSE (0) (left) and PE ( (cid:98) β ) / PE (0) (right) for diﬀerentestimators with n = 200, SNR = 10. Polynomial decay of eigenvalues λ j = j − a and diﬀerentregimes of coeﬃcients in eigenbasis u (cid:62) j β . 26 d . . . . MSE ( b β ) MSE (0)

NCTGCTOLSPCRRidgeLASSO

50 200 300 400 500 600 700 d . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (a) a = 2, b = 2;

50 200 300 400 500 600 700 d . . . . . . MSE ( b β ) MSE (0)

NCTGCTOLSPCRRidgeLASSO

50 200 300 400 500 600 700 d . . . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (b) a = 1, b = 0 .

50 200 300 400 500 600 700 d . . . . . MSE ( b β ) MSE (0)

NCTGCTOLSPCRRidgeLASSO

50 200 300 400 500 600 700 d . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (c) a = 0 . b = 1; Figure 5: The relative errors

MSE ( (cid:98) β ) / MSE (0) (left) and PE ( (cid:98) β ) / PE (0) (right) for diﬀerentestimators with n = 200, SNR = 1. Polynomial decay of eigenvalues and coeﬃcients ineigenbasis: λ j = j − a , u (cid:62) j β = j − b . 27 d . . . . . . . MSE ( b β ) MSE (0)

NCTGCTOLSPCRRidgeLASSO

50 200 300 400 500 600 700 d . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (a) a = 1, u (cid:62) j β = 1 for j = 1 , . . . ,

10 and 0 otherwise;

50 200 300 400 500 600 700 d . . . . . . MSE ( b β ) MSE (0)

NCTGCTOLSPCRRidgeLASSO

50 200 300 400 500 600 700 d . . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (b) a = 0 . u (cid:62) j β = 1 for 10 randomly chosen j ∈ { d − , . . . , d } , and the rest components are i.i.d. N (0 , d − );

50 200 300 400 500 600 700 d . . . . . . MSE ( b β ) MSE (0)

NCTGCTOLSPCRRidgeLASSO

50 200 300 400 500 600 700 d . . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (c) a = 0 . U (cid:62) β ∼ N (0 , I d ); Figure 6: The relative errors

MSE ( (cid:98) β ) / MSE (0) (left) and PE ( (cid:98) β ) / PE (0) (right) for diﬀerentestimators with n = 200, SNR = 1. Polynomial decay of eigenvalues λ j = j − a and diﬀerentregimes of coeﬃcients in eigenbasis u (cid:62) j β . 28 Main proofs

We start with the following lemma that allows to bound the properly scaled noise vector in (cid:96) ∞ -norm. The lemma simultaneously deals with both ﬁxed and random design settings. Lemma 6.1.

Suppose Assumption 3.1 is fulﬁlled. Let ρ be as in (3.1) . Deﬁne the event Ω = (cid:110) (cid:107) ξ (cid:107) ∞ ≤ σρ (cid:111) with ξ def = ZZZ (cid:62) ε n = (cid:98) Λ − (cid:98) U (cid:62) XXX (cid:62) ε n . Then P [Ω ] ≥ − δ. Proof of Theorem 3.1

Using (cid:98) Σ = (cid:98) U (cid:98) Λ (cid:98) U (cid:62) we write for (cid:98) β from (2.1) MSE ( (cid:98) β ) = ( (cid:98) β − β ) (cid:62) (cid:98) Σ ( (cid:98) β − β ) = (cid:107) (cid:98) Λ (cid:98) U (cid:62) (cid:98) β − (cid:98) Λ (cid:98) U (cid:62) β (cid:107) . Now we plug our estimator (cid:98) β and YYY = XXX β + ε in to get MSE ( (cid:98) β ) = (cid:13)(cid:13)(cid:13)(cid:13) SOFT τ (cid:20) (cid:98) Λ − (cid:98) U (cid:62) XXX (cid:62)

YYY n (cid:21) − (cid:98) Λ (cid:98) U (cid:62) β (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) SOFT τ (cid:34) (cid:98) Λ (cid:98) U (cid:62) β + (cid:98) Λ − (cid:98) U (cid:62) XXX (cid:62) ε n (cid:35) − (cid:98) Λ (cid:98) U (cid:62) β (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:107) SOFT τ [ θ + ξ ] − θ (cid:107) , where we recall the canonical coeﬃcients θ = (cid:98) Λ (cid:98) U (cid:62) β from Deﬁnition 2.1 and ξ from Lemma 6.1.From now on, the proof basically repeats the classical derivation for the soft and hard thresh-olding. Let us analyze its j -th component on Ω from Lemma 6.1 of probability at least1 − δ . • If | θ j + ξ j | > τ , then | θ j | ≥ τ − | ξ j | ≥ τ / | SOFT τ [ θ j + ξ j ] − θ j | = | θ j + ξ j ± τ − θ j | = | ξ j ± τ | ≤ | ξ j | + τ ≤ τ ≤ τ, | θ j | ) , where ± means that we take either + or − depending on the sign of ( θ j + ξ j ), but thisdoesn’t play any role. • If | θ j + ξ j | ≤ τ , then | θ j | ≤ τ + | ξ j | ≤ τ / | SOFT τ [ θ j + ξ j ] − θ j | = | − θ j | = | θ j | ≤ τ, | θ j | ) . Note that for any 0 ≤ q ≤ τ, | θ j | ) ≤ τ − q/ | θ j | q/ (we use convention 0 = 0).Thus, on Ω MSE ( (cid:98) β ) = (cid:107) SOFT τ [ θ + ξ ] − θ (cid:107) ≤ r (cid:88) j =1 min ( τ, | θ j | ) ≤ τ − q r (cid:88) j =1 | θ j | q = 9 τ − q (cid:107) θ (cid:107) qq , using convention (cid:107) · (cid:107) = (cid:107) · (cid:107) . Taking inﬁmum over q ∈ [0 , β (cid:62) (cid:98) Σ β = MSE (0)and recalling the deﬁnitions of

SNR and D eﬀ q,d ( (cid:98) Σ , β ), we conclude the proof.29 roof of Theorem 3.2 To begin with, we state the following well-known result on the concentration of the samplecovariance around the true covariance in terms of the eﬀective rank. See Koltchinskii andLounici (2017), Theorem 9; also, Vershynin (2018), Theorem 9.2.4 and Exercise 9.2.5.

Lemma 6.2.

Suppose Assumption 3.2 is fulﬁlled. Then, with probability − δ (cid:107) (cid:98) Σ − Σ (cid:107) ≤ C (cid:107) Σ (cid:107) (cid:32)(cid:114) r eﬀ [ Σ ] + log(1 /δ ) n + r eﬀ [ Σ ] + log(1 /δ ) n (cid:33) . Using Assumption 3.4 we can leave only the ﬁrst term in the bound above. Let Ω be the eventon which this bound holds.Our main tools to prove the main result is the beautiful work by Jirak and Wahl (2018)that develops tight relative perturbation bounds for eigenvalues and eigenvectors of covariancematrix. Let us describe the framework of that paper. By Assumption 3.3 we consider the caseof simple eigenvalues of Σ . The following quantities play important role: the relative rank r j ( Σ ) def = d (cid:88) l =1 l (cid:54) = j λ l | λ j − λ l | + λ j min( λ j − − λ j , λ j − λ j +1 ) for j ∈ [ d ](here λ = + ∞ and λ d +1 = 0 for convenience) and the entries of Σ − / ( (cid:98) Σ − Σ ) Σ − / η ll (cid:48) def = u (cid:62) l ( (cid:98) Σ − Σ ) u l (cid:48) √ λ l λ l (cid:48) for l, l (cid:48) ∈ [ d ] . Relative perturbation bounds for j -th eigenvalue and eigenvector hold under the condition thatthere exist x such that | η ll (cid:48) | ≤ x for all l, l (cid:48) ∈ [ d ] , and r j ( Σ ) ≤ x . The following lemma helps to control the ﬁrst condition.

Lemma 6.3.

Suppose Assumption 3.2 and Assumption 3.4 hold. Then, with probability − δ max l,l (cid:48) ∈ [ d ] | η ll (cid:48) | ≤ (cid:15), where (cid:15) = (cid:15) n,d,δ def = C (cid:114) log( d/δ ) n . for some C . Deﬁne Ω to be the event where the inequality from the previous lemma holds.So, the relative perturbation bounds hold true on Ω for indices j for which r j ( Σ ) ≤ / (3 (cid:15) ).We would like to have this property for as many indices as possible. Under Assumption 3.3 we30ave (see Jirak and Wahl (2018), inequalities (3.30); Jirak (2016), Lemma 7.13; Cardot, Masand Sarda (2007), Lemma 6.1) r j ( Σ ) ≤ d (cid:88) l =1 l (cid:54) = j λ l | λ j − λ l | ≤ Cj log( j ) . Note that with k ∗ def = ( (cid:15) log(1 /(cid:15) )) − / we indeed have r j ( Σ ) ≤ / (3 (cid:15) ) for all j ∈ [ k ∗ ] due to Assumption 3.4. Hence, the followingrelative perturbation bounds from Jirak and Wahl (2018) hold true. Lemma 6.4.

For all j ∈ [ k ∗ ] on Ω holds | (cid:98) λ j − λ j | ≤ C(cid:15)λ j and (cid:107) (cid:98) u j − u j (cid:107) ≤ C(cid:15) (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) d (cid:88) l =1 l (cid:54) = j λ j λ l ( λ j − λ l ) . Furthermore, for all j ∈ [ k ∗ ] and l ∈ [ d ] , l (cid:54) = j on Ω holds | (cid:98) u (cid:62) j u l | ≤ C(cid:15) (cid:112) λ j λ l | λ j − λ l | . Note that the bounds from the previous lemma apply even for larger indices j , which can beup ( (cid:15) log(1 /(cid:15) )) − (of order n / ), while we restrict k ∗ to be of order n / . Later in the proof itwill be clear how this speciﬁc k ∗ arises.Now we are ready to proceed to the main part of the proof. Proof of Theorem 3.2.

We prove the theorem for k = k ∗ , and it will be clear that the sameproof works with any k < k ∗ . The proof for part (i) and part (ii) coincides up to the last step.Denote τ (cid:48) to be the general thresholding level, which is τ for part (i) and τ for part(ii). Usingthe deﬁnition of (cid:98) β given in (2.1), the eigendecompositions Σ = UΛ U (cid:62) , (cid:98) Σ = (cid:98) U (cid:98) Λ (cid:98) U (cid:62) and themodel YYY = XXX β + ε , write the prediction error as PE ( (cid:98) β ) = ( (cid:98) β − β ) (cid:62) Σ ( (cid:98) β − β )= (cid:107) ΛU (cid:62) (cid:98) U (cid:98) Λ − SOFT τ (cid:48) (cid:20) (cid:98) Λ − (cid:98) U (cid:62) XXX (cid:62)

YYY n (cid:21) − ΛU (cid:62) β (cid:107) = (cid:107) ΛU (cid:62) (cid:98) U (cid:98) Λ − SOFT τ (cid:48) [ (cid:98) Λ (cid:98) U (cid:62) β + ξ ] − ΛU (cid:62) β (cid:107) with ξ from Lemma 6.1. Let us add and subtract ΛU (cid:62) (cid:98) U (cid:98) U (cid:62) β inside the norm and apply (cid:107) a + b (cid:107) ≤ (cid:107) a (cid:107) + 2 (cid:107) b (cid:107) : PE ( (cid:98) β ) ≤ (cid:107) ΛU (cid:62) β − ΛU (cid:62) (cid:98) U (cid:98) U (cid:62) β (cid:107) ++ 2 (cid:107) ΛU (cid:62) (cid:98) U (cid:98) Λ − SOFT τ (cid:48) (cid:104) (cid:98) Λ (cid:98) U (cid:62) β + ξ (cid:105) − ΛU (cid:62) (cid:98) U (cid:98) U (cid:62) β (cid:107) =: I + I .

31e ﬁrst deal with I : I (cid:107) ΛU (cid:62) β − ΛU (cid:62) (cid:98) U (cid:98) U (cid:62) β (cid:107) = β (cid:62) ( I d − (cid:98) U (cid:98) U (cid:62) ) Σ ( I d − (cid:98) U (cid:98) U (cid:62) ) β = β (cid:62) ( I d − (cid:98) U (cid:98) U (cid:62) ) ( Σ − (cid:98) Σ ) ( I d − (cid:98) U (cid:98) U (cid:62) ) β ≤ (cid:107) (cid:98) Σ − Σ (cid:107) (cid:107) ( I d − (cid:98) U (cid:98) U (cid:62) ) β (cid:107) ≤ (cid:107) (cid:98) Σ − Σ (cid:107) (cid:107) β (cid:107) ≤ C (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:114) r eﬀ [ Σ ] + log(1 /δ ) n , where the last inequality holds on Ω due to Lemma 6.2 and Assumption 3.4.Next, we focus on I . We will decompose it into two parts: one will correspond to the ﬁrst k ∗ eigenvectors and eigenvalues, while the other will correspond to the rest ( r − k ∗ ). Let ussplit (cid:98) U = [ (cid:98) U ≤ k ∗ (cid:98) U >k ∗ ] and (cid:98) Λ = (cid:34) (cid:98) Λ ≤ k ∗ O k ∗ × ( r − k ∗ ) O ( r − k ∗ ) × k ∗ (cid:98) Λ >k ∗ (cid:35) , where (cid:98) Λ ≤ k ∗ ∈ R k ∗ × k ∗ , (cid:98) U ≤ k ∗ ∈ R d × k ∗ correspond to the ﬁrst k ∗ eigenvalues and eigenvectors,while (cid:98) Λ >k ∗ ∈ R ( r − k ∗ ) × ( r − k ∗ ) , (cid:98) U >k ∗ ∈ R d × ( r − k ∗ ) correspond to the rest. Also let ξ = (cid:2) ξ (cid:62)≤ k ∗ ξ (cid:62) >k ∗ (cid:3) (cid:62) with ξ ≤ k ∗ ∈ R k ∗ and ξ >k ∗ ∈ R r − k ∗ . Then (cid:98) U (cid:98) Λ − SOFT τ (cid:48) (cid:104) (cid:98) Λ (cid:98) U (cid:62) β + ξ (cid:105) − (cid:98) U (cid:98) U (cid:62) β == (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ SOFT τ (cid:48) (cid:104) (cid:98) Λ ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ β + ξ ≤ k ∗ (cid:105) − (cid:98) U ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ β + (cid:98) U >k ∗ (cid:98) Λ − >k ∗ SOFT τ (cid:48) (cid:104) (cid:98) Λ >k ∗ (cid:98) U (cid:62) >k ∗ β + ξ >k ∗ (cid:105) − (cid:98) U >k ∗ (cid:98) U (cid:62) >k ∗ β . Again applying (cid:107) a + b (cid:107) ≤ (cid:107) a (cid:107) + 2 (cid:107) b (cid:107) we obtain I (cid:107) ΛU (cid:62) (cid:98) U (cid:98) Λ − SOFT τ (cid:48) (cid:104) (cid:98) Λ (cid:98) U (cid:62) β + ξ (cid:105) − ΛU (cid:62) (cid:98) U (cid:98) U (cid:62) β (cid:107) ≤≤ (cid:107) ΛU (cid:62) (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ SOFT τ (cid:48) (cid:104) (cid:98) Λ ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ β + ξ ≤ k ∗ (cid:105) − ΛU (cid:62) (cid:98) U ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ β (cid:107) + 2 (cid:107) ΛU (cid:62) (cid:98) U >k ∗ (cid:98) Λ − >k ∗ SOFT τ (cid:48) (cid:104) (cid:98) Λ >k ∗ (cid:98) U (cid:62) >k ∗ β + ξ >k ∗ (cid:105) − ΛU (cid:62) (cid:98) U >k ∗ (cid:98) U (cid:62) >k ∗ β (cid:107) =: I + I . So, to upper bound I we will upper bound I and I separately.Consider I . Denote γ def = (cid:98) Λ − >k ∗ SOFT τ (cid:48) (cid:104) (cid:98) Λ >k ∗ (cid:98) U (cid:62) >k ∗ β + ξ >k ∗ (cid:105) − (cid:98) U (cid:62) >k ∗ β ∈ R r − k ∗ . Let us analyze j -th component γ j , for j ∈ [ r − k ∗ ], using the deﬁnition of SOFT τ (cid:48) [ · ]. We havetwo cases: • If | (cid:98) λ / j + k ∗ (cid:98) u (cid:62) j + k ∗ β + ξ j + k ∗ | > τ (cid:48) , then γ j = (cid:98) λ − / j + k ∗ ( ξ j + k ∗ ± τ (cid:48) ) (the actual sign will play norole). Since by Lemma 6.1 on Ω (cid:98) λ / j + k ∗ | (cid:98) u (cid:62) j + k ∗ β | ≥ τ (cid:48) − | ξ j + k ∗ | ≥ τ (cid:48) / , we have | γ j | ≤ (cid:98) λ − / j + k ∗ ( | ξ j + k ∗ | + τ (cid:48) ) ≤ (cid:98) λ − / j + k ∗ · τ (cid:48) ≤ | (cid:98) u (cid:62) j + k ∗ β | . If | (cid:98) λ / j + k ∗ (cid:98) u (cid:62) j + k ∗ β + ξ j + k ∗ | ≤ τ (cid:48) , then γ j = − (cid:98) u (cid:62) j + k ∗ β , and we directly get | γ j | = | (cid:98) u (cid:62) j + k ∗ β | . In any case, | γ j | ≤ | (cid:98) u (cid:62) j + k ∗ β | for all j ∈ [ r − k ∗ ], and therefore (cid:107) γ (cid:107) ≤ (cid:107) β (cid:107) on Ω . Hence, I (cid:107) ΛU (cid:62) (cid:98) U >k ∗ γ (cid:107) = γ (cid:62) (cid:98) U (cid:62) >k ∗ Σ (cid:98) U >k ∗ γ = γ (cid:62) (cid:98) U (cid:62) >k ∗ ( Σ − (cid:98) Σ ) (cid:98) U >k ∗ γ + γ (cid:62) (cid:98) U (cid:62) >k ∗ (cid:98) Σ (cid:98) U >k ∗ γ ≤ (cid:107) (cid:98) Σ − Σ (cid:107)(cid:107) (cid:98) U >k ∗ γ (cid:107) + γ (cid:62) (cid:98) Λ >k ∗ γ ≤ (cid:107) (cid:98) Σ − Σ (cid:107)(cid:107) γ (cid:107) + (cid:98) λ k ∗ +1 (cid:107) γ (cid:107) . We bound the ﬁrst term on the right-hand side on Ω by Lemma 6.2, and for the second termon Ω holds (cid:98) λ k ∗ +1 ≤ (cid:98) λ k ∗ ≤ (1 + C(cid:15) ) λ k ∗ ≤ C (cid:48) λ k ∗ due to Lemma 6.4 and Assumption 3.4. Takinginto account (cid:107) γ (cid:107) ≤ C (cid:107) β (cid:107) on Ω , we get on Ω ∩ Ω ∩ Ω I ≤ C (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:32)(cid:114) r eﬀ [ Σ ] + log(1 /δ ) n + λ k ∗ λ (cid:33) . Finally, it is left to bound I . Denote (cid:98) ω def = SOFT τ (cid:48) (cid:104) (cid:98) Λ ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ β + ξ ≤ k ∗ (cid:105) − (cid:98) Λ ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ β ∈ R k ∗ . Then, I (cid:107) ΛU (cid:62) (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:98) ω (cid:107) ≤ (cid:107) ΛU (cid:62) (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:107) (cid:107) (cid:98) ω (cid:107) . An upper bound on (cid:107) ΛU (cid:62) (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:107) is provided in the next lemma. Lemma 6.5.

Suppose Assumption 3.2 holds. Then on Ω holds (cid:107) ΛU (cid:62) (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:107) ≤ C. Remark 6.1.

The previous lemma is the only place where we use k ∗ = ( (cid:15) log(1 /(cid:15) )) − / . Therest of the proof would go through if k ∗ was deﬁned as ( (cid:15) log(1 /(cid:15) )) − . Remark 6.2.

Interestingly, a closely related to ΛU (cid:62) (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ matrix appears also in Bartlettet al. (2020). The main diﬃculty of their proof is to ﬁnd regimes of eigenvalues such that for C deﬁned as C def = ( XXXXXX (cid:62) ) − XXX Σ XXX (cid:62) ( XXXXXX (cid:62) ) − holds Tr [ C ] = o (1) as n → ∞ . Using SVD n − / XXX = (cid:98) V (cid:98) Λ (cid:98) U (cid:62) one can show Tr [ C ] = 1 n Tr [ (cid:98) Λ − (cid:98) U (cid:62) UΛ U (cid:62) (cid:98) U (cid:98) Λ − ] , while in Lemma 6.5 we essentially upper bound the operator norm of somewhat simpler (in asense that we truncate the sample eigenvalues and eigenvector beyond k ∗ -th) matrix (cid:98) Λ − ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ UΛ U (cid:62) (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ . The latter task turns out to be much easier and does not requirespeciﬁc regimes of eigenvalues, unlike the former one.

33o deal with (cid:107) (cid:98) ω (cid:107) , we ﬁrst state the following lemma which has two parts, one of whichwill help to conclude the proof of claim (i), and the other one will be useful for claim (ii). Lemma 6.6. On Ω holds(i) (cid:107) (cid:98) Λ ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ β − Λ ≤ k ∗ U (cid:62)≤ k ∗ β (cid:107) ≤ C (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:15) (cid:32) r eﬀ [ Σ ] + (cid:15) k ∗ (cid:88) j =1 λ j j λ (cid:33) . (ii) (cid:107) (cid:98) Λ ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ β − Λ ≤ k ∗ U (cid:62)≤ k ∗ β (cid:107) ∞ ≤ τ ∞ def = C (cid:107) Σ (cid:107) / (cid:107) β (cid:107) (cid:15) / max j ∈ [ k ∗ ] (cid:18) λ j (1 + (cid:15)j ) λ (cid:19) / . So, to deal with part (ii) of Theorem 3.2, we notice that our thresholding level τ (cid:48) = τ = τ + τ ∞ ,where τ from Lemma 6.1 is responsible for the noise and τ ∞ from Lemma 6.6 is responsible forthe estimation of the eigenvalues and eigenvectors. Now we analyze the components (cid:98) ω j of (cid:98) ω for j ∈ [ k ∗ ]. • If | (cid:98) λ / j (cid:98) u (cid:62) j β + ξ j | > τ , then (cid:98) ω j = ξ j ± τ (the sign again doesn’t matter). Moreover, due toLemma 6.1 and Lemma 6.6 (ii) on Ω ∩ Ω | λ / j u (cid:62) j β | ≥ τ − | ξ j | − | (cid:98) λ / j (cid:98) u (cid:62) j β − λ / j u (cid:62) j β | ≥ τ − τ − τ ∞ τ . Hence, | (cid:98) ω j | ≤ τ , | λ / j u (cid:62) j β | ) . • If | (cid:98) λ / j (cid:98) u (cid:62) j β + ξ j | ≤ τ , then (cid:98) ω j = − (cid:98) λ / j (cid:98) u (cid:62) j β . Furthermore, again by Lemma 6.1 andLemma 6.6 (ii) on Ω ∩ Ω | λ / j u (cid:62) j β | ≤ τ + | ξ j | + | (cid:98) λ / j (cid:98) u (cid:62) j β − λ / j u (cid:62) j β | ≤ τ + τ τ ∞ τ . Thus, | (cid:98) ω j | ≤ τ , | λ / j u (cid:62) j β | ) . In both cases, | (cid:98) ω j | ≤ τ , | λ / j u (cid:62) j β | ), and based on the same derivation as in the proof ofTheorem 3.1, we obtain on Ω ∩ Ω (cid:107) (cid:98) ω (cid:107) ≤ q ∈ [0 , (cid:8) τ − q (cid:107) Λ ≤ k ∗ U (cid:62)≤ k ∗ β (cid:107) qq (cid:9) . For part (i) we act slightly diﬀerently. Now τ (cid:48) = τ . We decompose (cid:98) ω = ω + ∆ ω , where ω def = SOFT τ (cid:2) Λ ≤ k ∗ U (cid:62)≤ k ∗ β + ξ ≤ k ∗ (cid:3) − Λ ≤ k ∗ U (cid:62)≤ k ∗ β ∈ R k ∗ .

34y Lemma 6.6 (i) it is easy to bound on Ω (cid:107) ∆ ω (cid:107) = (cid:107) (cid:98) ω − ω (cid:107) ≤ C (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:15) (cid:32) r eﬀ [ Σ ] + (cid:15) k ∗ (cid:88) j =1 λ j j λ (cid:33) . The norm (cid:107) ω (cid:107) with thresholding at level τ can be bounded as in the proof of Theorem 3.1:on Ω (cid:107) ω (cid:107) ≤ q ∈ [0 , (cid:8) τ − q (cid:107) Λ ≤ k ∗ U (cid:62)≤ k ∗ β (cid:107) qq (cid:9) , which, together with the bound on (cid:107) ∆ ω (cid:107) , gives bound on (cid:107) (cid:98) ω (cid:107) on Ω ∩ Ω .Putting all the bounds for I and I (in particular, for I and I ) together on the intersectionof high probability events Ω ∩ Ω ∩ Ω , adjusting δ → δ/ − δ , we conclude the proof. Proof of Theorem 3.4

Similarly to the proof of Theorem 3.1, we write for (cid:98) β from (2.2) MSE ( (cid:98) β ) = (cid:13)(cid:13)(cid:13) (cid:98) Λ − ϕ T τ (cid:104) (cid:98) Λ ϕ ( θ + ξ ) (cid:105) − θ (cid:13)(cid:13)(cid:13) = r (cid:88) j =1 (cid:12)(cid:12)(cid:12)(cid:98) λ − ϕ/ j T τ [ (cid:98) λ ϕ/ j ( θ j + ξ j )] − θ j (cid:12)(cid:12)(cid:12) . For each individual term we apply the following two bounds. On one hand, (cid:12)(cid:12)(cid:12)(cid:98) λ − ϕ/ j T τ [ (cid:98) λ ϕ/ j ( θ j + ξ j )] − θ j (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:98) λ − ϕ/ j (cid:16) T τ [ (cid:98) λ ϕ/ j ( θ j + ξ j )] − (cid:98) λ ϕ/ j ( θ j + ξ j ) (cid:17) + ξ j (cid:12)(cid:12)(cid:12) ≤ (cid:98) λ − ϕ/ j τ + | ξ j | ≤ (cid:32) (cid:98) λ ϕ/ (cid:98) λ ϕ/ j + 12 (cid:33) σρ ≤ (cid:98) λ ϕ/ (cid:98) λ ϕ/ j σρ, where the ﬁrst inequality uses property (ii) from Deﬁnition 2.2 and the triangle inequality, andthe second inequality holds on Ω by Lemma 6.1. On the other hand, (cid:12)(cid:12)(cid:12)(cid:98) λ − ϕ/ j T τ [ (cid:98) λ ϕ/ j ( θ j + ξ j )] − θ j (cid:12)(cid:12)(cid:12) ≤ (cid:98) λ − ϕ/ j (cid:12)(cid:12)(cid:12) T τ [ (cid:98) λ ϕ/ j ( θ j + ξ j )] (cid:12)(cid:12)(cid:12) + | θ j |≤ (cid:98) λ − ϕ/ j · c | (cid:98) λ ϕ/ j θ j | + | θ j | = ( c + 1) | θ j | , where the second inequality is due to property (i) from Deﬁnition 2.2 applied to z = (cid:98) λ ϕ/ j ( θ j + ξ j )and z (cid:48) = (cid:98) λ ϕ/ j θ j satisfying | z − z (cid:48) | = (cid:98) λ ϕ/ j | ξ j | ≤ τ / by Lemma 6.1. Thus, on Ω MSE ( (cid:98) β ) = (cid:13)(cid:13)(cid:13) (cid:98) Λ − ϕ T τ (cid:104) (cid:98) Λ ϕ ( θ + ξ ) (cid:105) − θ (cid:13)(cid:13)(cid:13) (cid:46) r (cid:88) j =1 min (cid:32) (cid:98) λ ϕ/ (cid:98) λ ϕ/ j σρ, | θ j | (cid:33) . Proof of Corollary 3.5

We need to upper bound the right-hand side of the inequality obtained in Theorem 3.4. Let usdeﬁne the following auxiliary set: P def = (cid:110) j ∈ [ r ] (cid:12)(cid:12)(cid:12) (cid:98) λ j ≥ (cid:98) λ ρ / (2+ ϕ ) (cid:111) .

35e ﬁrst bound (cid:88) j / ∈P min (cid:32) (cid:98) λ ϕ/ (cid:98) λ ϕ/ j σρ, | θ j | (cid:33) ≤ (cid:88) j / ∈P | θ j | = (cid:88) j / ∈P (cid:98) λ j ( (cid:98) u (cid:62) j β ) ≤ (cid:98) λ (cid:107) β (cid:107) ρ / (2+ ϕ ) , where we used only the deﬁnition of P (more speciﬁcally its complement). Then, we bound (cid:88) j ∈P min (cid:32) (cid:98) λ ϕ/ (cid:98) λ ϕ/ j σρ, | θ j | (cid:33) ≤ (cid:88) j ∈P (cid:98) λ ϕ/ (cid:98) λ ϕ/ j σρ | θ j | ≤ σρ − ϕ · ϕ +1 (cid:88) j ∈P | θ j | ≤ σρ / (2+ ϕ ) (cid:107) θ (cid:107) , where we again used the deﬁnition of P . Now we apply (cid:107) θ (cid:107) ≤ (cid:107) (cid:98) Σ (cid:107) / (cid:107) β (cid:107) r eﬀ [ (cid:98) Σ ] / , andadding the above two inequalities yields the desired statement. Proof of Lemma 6.1.

It is straightforward to verify that ξ is also a sub-Weibull random vectorconditionally on XXX : sup (cid:107) w (cid:107) =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w (cid:62) (cid:98) Λ − (cid:98) U (cid:62) XXX (cid:62) ε n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ψ α ≤ √ n sup (cid:107) w (cid:107) =1 (cid:107) w (cid:62) ε (cid:107) ψ α ≤ σ √ n , where the ﬁrst inequality holds given XXX since (cid:107)

XXX (cid:98) U (cid:98) Λ − w / √ n (cid:107) = 1, and the last inequality isdue to Assumption 3.1. Then, taking e , . . . , e r (the standard basis in R r ) we have P (cid:2) | e (cid:62) j ξ | ≥ t (cid:12)(cid:12) XXX (cid:3) ≤ (cid:0) − (cid:0) t √ n/σ (cid:1) α (cid:1) for j ∈ [ d ] . Applying the union bound and plugging in t = σρ/

2, we get the desired.

Proof of Lemma 6.3.

The proof is pretty standard and can be found in numerous papers. Fix l, l (cid:48) ∈ [ d ]. We have η ll (cid:48) = u (cid:62) l ( (cid:98) Σ − Σ ) u l (cid:48) √ λ l λ l (cid:48) = u (cid:62) l ( Σ − / (cid:98) ΣΣ − / − I d ) u l (cid:48) = 1 n n (cid:88) i =1 (cid:8) ( u (cid:62) l Σ − / x i ) · ( u (cid:62) l (cid:48) Σ − / x i ) − E (cid:2) ( u (cid:62) l Σ − / x i ) · ( u (cid:62) l (cid:48) Σ − / x i ) (cid:3)(cid:9) . Since by Assumption 3.2 Σ − / x i is sub-Gaussian for i ∈ [ n ], then by Lemma 2.7.7 of Vershynin(2018) ( u (cid:62) l Σ − / x i ) · ( u (cid:62) l (cid:48) Σ − / x i ) is sub-Exponential and by Exercise 2.7.10 of Vershynin (2018)its centered version is also sub-Exponential for i ∈ [ n ]. Bernstein’s inequality (e.g. Corollary2.8.3 of Vershynin (2018)) applied to this centered random variables implies P [ | η ll (cid:48) | ≥ t ] ≤ (cid:18) − cn min (cid:18) t C , tC (cid:19)(cid:19) . By union bound, P (cid:20) max l,l (cid:48) ∈ [ d ] | η ll (cid:48) | ≥ t (cid:21) ≤ d exp (cid:18) − cn min (cid:18) t C , tC (cid:19)(cid:19) . t = C (cid:112) log(2 d /δ ) /n for some other properly chosen C and using Assumption 3.4 tomake sure t /c ≤ t/c , we conclude the proof. Proof of Lemma 6.4.

The ﬁrst part follows from Corollary 2 of Jirak and Wahl (2018), and thesecond part follows from Lemma 4 of Jirak and Wahl (2018). Conditions (2.1) and | η ll (cid:48) | ≤ x for all l, l (cid:48) ∈ [ d ] are satisﬁed since we work on Ω from Lemma 6.3 and consider j ∈ [ k ∗ ] withproperly deﬁned k ∗ . Proof of Lemma 6.5.

We ﬁrst apply inequalities (cid:107) ΛU (cid:62) (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:107) = (cid:107) ΛU (cid:62) (cid:98) U ≤ k ∗ Λ − ≤ k ∗ Λ ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:107) ≤ (cid:107) ΛU (cid:62) (cid:98) U ≤ k ∗ Λ − ≤ k ∗ (cid:107)(cid:107) Λ ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:107)≤ (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ΛU (cid:62) (cid:98) U ≤ k ∗ Λ − ≤ k ∗ − (cid:34) I k ∗ O ( d − k ∗ ) × k ∗ (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + 1 (cid:33) (cid:107) Λ ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:107) . The spectral norm of Λ ≤ k ∗ (cid:98) Λ − ≤ k ∗ is easy to control: (cid:107) Λ ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:107) = (cid:32) max j ∈ [ k ∗ ] λ j (cid:98) λ j (cid:33) / ≤ − C(cid:15) ) / ≤ , where the ﬁrst inequality is due to Lemma 6.4 on Ω and in the last inequality we assumedthat (cid:15) is small enough by Assumption 3.4.Next, let us focus on ΛU (cid:62) (cid:98) U ≤ k ∗ Λ − ≤ k ∗ − [ I k ∗ O k ∗ × ( d − k ∗ ) ] (cid:62) , which we denote by H for shortness.Denote its columns as h , . . . , h k ∗ . We can bound (cid:96) -norm of each column, using Lemma 6.4,as (cid:107) h j (cid:107) = d (cid:88) l =1 l (cid:54) = j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) λ / l u (cid:62) l (cid:98) u j λ / j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + | u (cid:62) j (cid:98) u j − | = d (cid:88) l =1 l (cid:54) = j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) λ / l u (cid:62) l (cid:98) u j λ / j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 12 (cid:107) (cid:98) u j − u (cid:62) j (cid:107) ≤ (cid:15) d (cid:88) l =1 l (cid:54) = j λ / l λ / j λ / j λ / l | λ j − λ l | + C(cid:15) d (cid:88) l =1 l (cid:54) = j λ j λ l ( λ j − λ l ) on Ω . Applying Jirak and Wahl (2018), inequalities (3.30), or Jirak (2016), Lemma 7.13,together with Assumption 3.3, we get (cid:107) h j (cid:107) ≤ (cid:15) j log( j ) + C(cid:15) j . Since j ≤ k ∗ and (cid:15)k ∗ ≤ C , we have (cid:107) h j (cid:107) ≤ C(cid:15) j log( k ∗ ) for all j ∈ [ k ∗ ] . Finally, we have for the Frobenius norm (cid:107) H (cid:107) F on Ω (cid:107) H (cid:107) F = k ∗ (cid:88) j =1 (cid:107) h j (cid:107) ≤ C k ∗ (cid:88) j =1 (cid:15) j log( k ∗ ) = C(cid:15) k ∗ log( k ∗ ) ≤ C (cid:48) , where we used the deﬁnition of k ∗ . The inequality between the spectral and the Frobeniusnorms completes the proof. 37 roof of Lemma 6.6. Fix arbitrary j ∈ [ k ∗ ]. We have the following chain of inequalities: | (cid:98) λ / j (cid:98) u (cid:62) j β − λ / j u (cid:62) j β | ≤ (cid:107) (cid:98) λ / j (cid:98) u j − λ / j u j (cid:107) (cid:107) β (cid:107) ≤ (cid:107) (cid:98) λ / j (cid:98) u j − λ / j (cid:98) u j + λ / j (cid:98) u j − λ / j u j (cid:107) (cid:107) β (cid:107) ≤ (cid:16)(cid:12)(cid:12)(cid:12)(cid:98) λ / j − λ / j (cid:12)(cid:12)(cid:12) (cid:107) (cid:98) u j (cid:107) + λ / j (cid:107) (cid:98) u j − u j (cid:107) (cid:17) (cid:107) β (cid:107) ≤ (cid:16) | (cid:98) λ j − λ j | / + λ / j (cid:107) (cid:98) u j − u j (cid:107) (cid:17) (cid:107) β (cid:107) . Applying Lemma 6.4 we have on Ω | (cid:98) λ / j (cid:98) u (cid:62) j β − λ / j u (cid:62) j β | ≤ C (cid:107) β (cid:107)  λ / j (cid:15) / + λ / j (cid:15) (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) d (cid:88) l =1 l (cid:54) = j λ j λ l ( λ j − λ l )  ≤ C (cid:107) β (cid:107) (cid:16) λ / j (cid:15) / + λ / j (cid:15) j (cid:17) , where in the second inequality we used Jirak and Wahl (2018), inequalities (3.30), or Jirak(2016), Lemma 7.13, together with Assumption 3.3. Taking maximum over j ∈ [ k ∗ ] we obtainthe claim (i), and raising to the square and summing over j ∈ [ k ∗ ] we get the claim (ii). References

Bair, E. , Hastie, T. , Paul, D. and

Tibshirani, R. (2006). Prediction by supervisedprincipal components.

J. Amer. Statist. Assoc. , , 473, 119–137. Bartlett, P. , Long, P. , Lugosi, G. and

Tsigler, A. (2020). Benign overﬁtting in linearregression.

Proc. Natl. Acad. Sci. USA . Belkin, M. (2018). Approximation beats concentration? An approximation view on inferencewith smooth radial kernels.

Proc. Mach. Learn. Res. , , 1–18. Belkin, M. , Hsu, D. and

Xu, J. (2019). Two models of double descent for weak features.

ArXiv:1903.07571 . Bellec, P. , Lecu´e, G. and

Tsybakov, A. (2018). SLOPE meets Lasso: improved oraclebounds and optimality.

Ann. Statist. , , 6B, 3603–3642. Bickel, P. , Ritov, Y. and

Tsybakov, A. (2009). Simultaneous analysis of Lasso andDantzig selector.

Ann. Statist. , , 4, 1705–1732. Bietti, A. and

Mairal, J. (2019). On the inductive bias of Neural Tangent Kernels.

Advancesin Neural Information Processing Systems , 12893–12904.

Bogdan, M. , van den Berg, E. , Sabatti, C. , Su, W. and

Candes, E. (2015). SLOPE –adaptive variable selection via convex optimization.

Ann. Appl. Stat. , , 3, 1103–1140. Cardot, H. , Mas, A. and

Sarda, P. (2007). CLT in functional linear regression models.

Probab. Theory Related Fields , , 325–361.38 andes, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when p is muchlarger than n . Ann. Statist. , , 6, 2313–2351. Chinot, G. and

Lerasle, M. (2020). Benign overﬁtting in the large deviation regime.

ArXiv:2003.05838 . Dalalyan, A. , Hebiri, M. and

Lederer, J. (2017). On the prediction performance of theLasso.

Bernoulli , , 1, 552–581. Donoho, D. L. and

Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage.

Biometrika , , 425–455. Donoho, D. L. (1995). De-noising by soft-thresholding.

IEEE Trans. Inform. Theory , , 3,613–627. Donoho, D. L. and

Johnstone, I. M. (1995). Adapting to unknown smoothness via waveletshrinkage.

J. Amer. Statist. Assoc. , , 1200–1224. Donoho, D. L. , Johnstone, I. M. , Kerkyacharian, G. and

Picard, D. (1995). Waveletshrinkage: asymptopia?

J. R. Stat. Soc. Ser. B. Stat. Methodol. , , 301–369. Donoho, D. L. and

Johnstone, I. M. (1998). Minimax estimation via wavelet shrinkage.

Ann. Statist. , ,879–921. Efron, B. , Hastie, T. , Johnstone, I. M. and

Tibshirani, R. (2004). Least angle regres-sion.

Ann. Statist. , , 2, 407–499. Fan, J. (1996). Test of signiﬁcance based on wavelet thresholding and Neyman’s truncation

J.Amer. Statist. Assoc. , , 434, 674–688. Fan, J. and

Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracleproperties.

J. Amer. Statist. Assoc. , , 456, 1348–1360. Fan, J. , Li, R. , Zhang, C.-H. , Zou, H. (2020).

Statistical foundations of data science . CRCPress.

Fan, J. , Ke, Y. and

Wang, K. (2020). Factor-adjusted regularized model selection.

J. Econo-metrics , , 71–85. Fan, J. , Wang, W. and

Yao, J. (2017). Suﬃcient forecasting using factor models.

J. Econo-metrics , , 292-306. Greenshtein, E. and

Ritov, Y. (2004). Persistence in high-dimensional linear predictorselection and the virtue of overparametrization.

Bernoulli , , 6, 971–988. Hastie, T. , Montanari, A. , Rosset, S. and

Tibshirani, R. (2019). Surprises in high-dimensional ridgeless least squares interpolation.

ArXiv:1903.08560 .39 irak, M. (2016). Optimal eigen expansions and uniform bounds.

Probab. Theory RelatedFields , , 753–799. Jirak, M. and

Wahl, M. (2018). Relative perturbation bounds with applications to empiricalcovariance operators.

ArXiv:1802.02869 . Jolliffe, I. (1982). A note on the use of principal components in regression.

J. R. Stat. Soc.Ser. C. Appl. Stat. , , 3, 300–303. Koltchinskii, V. and

Lounici, K. (2017). Concentration inequalities and moment boundsfor sample covariance operators.

Bernoulli , , 1, 110–133. Kuchibhotla, A. K. and

Chakrabortty, A. (2018). Moving beyond sub-Gaussianityin high-dimensional statistics: applications in covariance estimation and linear regression.

ArXiv:1804.02605 . Liang, T. and

Rakhlin, A. (2020). Just interpolate: kernel ”ridgeless” regression can gen-eralize.

Ann. Statist. , to appear.

Ma, S. and

Belkin, M. (2017). Diving into the shallows: a computational perspective onlarge-scale shallow learning.

Advances in Neural Information Processing Systems , 3781–3790.

Paul, D. , Bair, E. , Hastie, T. and

Tibshirani, R. (2008). “Preconditioning” for featureselection and regression in high-dimensional problems.

Ann. Statist. , , 4, 1595–1618. Pearson, K. (1901). On lines and planes of closest ﬁt to systems of points in space.

TheLondon, Edinburgh and Dublin Philosophical Magazine and Journal of Science. , , 559–572. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso.

J. R. Stat. Soc. Ser.B. Stat. Methodol. , , 1, 267–288. van de Geer, S. and B¨uhlmann, P. (2009). On the conditions used to prove oracle resultsfor the Lasso.

Electron. J. Stat. , , 1360–1392. Vershynin, R. (2018). High-dimensional probability. An introduction with applications indata science.

Cambridge Series in Statistical and Probabilistic Mathematics . Zumbach, G. (2009). The empirical properties of large covariance matrices.

RiskMetrics Jour-nal ,9