Canonical thresholding for non-sparse high-dimensional linear regression
CCanonical thresholding for non-sparsehigh-dimensional linear regression ∗ Igor Silin and Jianqing FanDepartment of Operations Research and Financial EngineeringPrinceton University
Abstract
We consider a high-dimensional linear regression problem. Unlike many papers onthe topic, we do not require sparsity of the regression coefficients; instead, our mainstructural assumption is a decay of eigenvalues of the covariance matrix of the data.We propose a new family of estimators, called the canonical thresholding estimators,which pick largest regression coefficients in the canonical form. The estimators admit anexplicit form and can be linked to LASSO and Principal Component Regression (PCR). Atheoretical analysis for both fixed design and random design settings is provided. Obtainedbounds on the mean squared error and the prediction error of a specific estimator fromthe family allow to clearly state sufficient conditions on the decay of eigenvalues to ensureconvergence. In addition, we promote the use of the relative errors, strongly linked withthe out-of-sample R . The study of these relative errors leads to a new concept of jointeffective dimension, which incorporates the covariance of the data and the regressioncoefficients simultaneously, and describes the complexity of a linear regression problem.Numerical simulations confirm good performance of the proposed estimators compared tothe previously developed methods. Keywords:
High-dimensional linear regression; covariance eigenvalues decay; thresholding;relative errors; principal component regression. ∗ Research supported by ONR grant N00014-19-1-2120, NSF grant DMS-1662139, and NIH grant 2R01-GM072611-14. E-mail: [email protected], [email protected] a r X i v : . [ m a t h . S T ] J u l Introduction and Setup
Consider the standard linear regression model y = x (cid:62) β + ε, where x ∈ R d is a vector of covariates, β ∈ R d is a vector of coefficients, ε ∈ R is a noiseterm, and y ∈ R is a response. Suppose we observe n pairs { ( x i , y i ) } ni =1 from this model withthe assumption that the underlying noise terms { ε i } ni =1 are i.i.d. random variables with mean0. In matrix notations, introducing YYY = y ... y n ∈ R n , XXX = x (cid:62) ... x (cid:62) n ∈ R n × d , ε = ε ... ε n ∈ R n , we rewrite our model as YYY = XXX β + ε . (1.1)Define the covariance matrix of the data (cid:98) Σ def = n − (cid:80) ni =1 x i x (cid:62) i = n − XXX (cid:62)
XXX ∈ R d × d . Our goalis to estimate the unknown β and analyze the quality of estimation in two different settings: • Fixed design . That means, the vectors of covariates { x i } ni =1 are deterministic (withoutloss of generality we assume (cid:80) ni =1 x i = 0). A standard way to measure the error of anestimator (cid:101) β in this case is the mean squared error (MSE): MSE ( (cid:101) β ) def = 1 n n (cid:88) i =1 ( x (cid:62) i (cid:101) β − x (cid:62) i β ) = 1 n (cid:107) XXX (cid:101) β − XXX β (cid:107) = ( (cid:101) β − β ) (cid:62) (cid:98) Σ ( (cid:101) β − β ) . This differs from the prediction error for the fixed design by an amount of E [ ε ] (inde-pendent of the model) and reflects the model error in the prediction for this case. • Random design . In this scenario the vectors of covariates { x i } ni =1 come independentlyfrom some unknown distribution with mean zero (for simplicity) and the covariance matrix Σ def = E [ xx (cid:62) ] ∈ R d × d . We are interested in the performance of an estimator (cid:101) β measuredby the expected prediction error (PE): PE ( (cid:101) β ) def = E (cid:104) ( x (cid:62) (cid:101) β − x (cid:62) β ) (cid:105) = ( (cid:101) β − β ) (cid:62) Σ ( (cid:101) β − β ) . This quantity differs also from the prediction error for random design by E [ ε ] and equalsthe excess risk PE ( (cid:101) β ) = E (cid:104) ( y − x (cid:62) (cid:101) β ) (cid:105) − E (cid:2) ( y − x (cid:62) β ) (cid:3) . In the sequel we refer to these quantities simply as the ( absolute ) MSE and PE. The rea-son we give two names is to differentiate their statistical behaviors in high dimension and toavoid confusions at various discussions. We will also motivate and analyze the relative errors
MSE ( (cid:101) β ) / MSE (0) and PE ( (cid:101) β ) / PE (0). Surprisingly, the relative errors, appearing naturally and2eing well-motivated, have not gained much attention in the literature. As we will see, theimportance of the relative errors arises as a high-dimensional effect.Being a fundamental statistical problem, the high-dimensional linear regression has beenapproached in various ways. Probably the simplest method is Principal Component Regression(PCR). The idea is to reduce the dimension first via Principal Component Analysis (PCA)(Pearson, 1901), and then use several leading principal components as covariates to constructthe least squares estimator. This approach heavily relies on a very strong assumption that theresponse depends on just a few leading principal components of the data. Various exampleswere provided where PCR performs poorly, see Jolliffe (1982). Another related idea to usesupervised principal components was proposed by Bair et al. (2006).Over the past two decades, the main approach to tackle high-dimensionality of the problemhas been the sparsity assumption on β , which is reasonable for many real-world applications.This has given rise to such model selection procedures as LASSO (Tibshirani, 1996), SCAD (Fanand Li, 2001), Least Angle Regression (Efron et al., 2004), Dantzig selector (Candes and Tao,2007), SLOPE (Bogdan et al., 2015). The list of papers devoted to these methods is too long tobe presented here, so we just mention some of them: Greenshtein and Ritov (2004); Paul et al.(2008); Bickel, Ritov and Tsybakov (2009); Dalalyan, Hebiri and Lederer (2017); Bellec, Lecu´eand Tsybakov (2018). Typically, a theoretical analysis of such procedures requires assumptionson the design like restricted isometry property (RIP), restricted eigenvalue (RE) condition,incoherence. These assumptions are needed to make sure that the correlations among subsetsof features are small. See van de Geer and B¨uhlmann (2009) for an overview of conditions usedin the theoretical analysis of sparse linear regression. We also refer to Chapters 3–5 of Fanet al. (2020) for an overview of existing methods and theoretical results for high-dimensionallinear regression under sparsity.The methods from the last two paragraphs were developed (partially) due to a belief that theunconstrained least squares estimator is hopeless in high dimensions. Recent papers by Bartlettet al. (2020) and later Chinot and Lerasle (2020) have shown that the minimum (cid:96) -norm leastsquares estimator (cid:101) β LS def = ( XXX (cid:62)
XXX ) + XXX (cid:62)
YYY (where (
XXX (cid:62)
XXX ) + is the generalized inverse of the matrix XXX (cid:62)
XXX ) can generalize well (i.e. have small absolute PE) even interpolating the training data –they call this phenomenon “benign overfitting”. To deliver convergence of PE ( (cid:101) β LS ) to zero theyrequire quite specific conditions on Σ : the decay of its eigenvalues should be fast, but not toofast. These requirements are quantified by two notions of effective rank of Σ . A closely relatedpaper by Hastie et al. (2019) also studies (cid:101) β LS , but in the regime p/n → γ . They are interestedin the dependence of PE ( (cid:101) β LS ) on γ , and focus on the case λ min ( Σ ) ≥ c >
0, e.g. consideringisotropic and equicorrelation covariances. One more work on the topic is Belkin, Hsu and Xu(2019), where the authors try to mathematically explain double descent phenomenon in severaldifferent models.Going beyond the linear regression, one basic idea to approach general (nonlinear) regressionproblem y = f ( x )+ ε is to decompose the regression function f ( x ) ≈ (cid:80) Dj =1 β j ψ j ( x ) = ψ ( x ) (cid:62) β over a Fourier basis, wavelet basis, or basis of eigenfunctions in Reproducing Kernel Hilbert3pace (RKHS), denoted here by ψ ( · ) , . . . , ψ D ( · ). This reduces the nonlinear regression problemto a linear one (potentially very high-dimensional), and allows to apply the existing methods.Though we do not pursue the analysis of nonlinear regression in our work, this setting providesan excellent motivation for the main structural assumptions we make in our results. One ofthem is fast decay of the eigenvalues of Σ (or (cid:98) Σ ). For instance, we require that the effectiverank r eff [ Σ ] def = Tr [ Σ ] (cid:107) Σ (cid:107) (cid:32) or r eff [ (cid:98) Σ ] def = Tr [ (cid:98) Σ ] (cid:107) (cid:98) Σ (cid:107) (cid:33) can be well-controlled. The spectral decay has been observed in real-world datasets (e.g.MNIST, see Figure 5 in Liang and Rakhlin (2020); financial data in Zumbach (2009), Fig-ure 5; economics data in Fan, Ke and Wang (2020), Figure 5), which makes our assumptionreasonable. Importance of the eigenvalue decay (not only of the covariance, but of general ker-nel matrices) is highlighted in Liang and Rakhlin (2020), where such kind of conditions on thespectral decay is called “favorable data geometry”. Moreover, Ma and Belkin (2017); Belkin(2018) analyze the super-polynomial decay of eigenvalues of smooth kernel matrices. Goingeven further in deep learning literature, neural tangent kernels also exhibit the spectral decay,as shown by Bietti and Mairal (2019), among others. However, the fast eigenvalue decay is notthe only motivation behind our work; another structural assumption that can make our resultsmeaningful is a fast decay of regression coefficients in eigenbasis. This is a very well-understoodcondition as well: it is well-known that Fourier coefficients decay at a polynomial rate, wherethe degree depends on the smoothness of the underlying regression function. In addition, thedecay of coefficients in RKHS was studied by Belkin (2018).With these structural assumptions, the idea behind our family of estimators (cid:98) β is quitenatural: in some eigendirections (e.g. the ones that correspond to small eigenvalues of Σ ) wedo not gain much by estimating the associated coefficient, so it makes sense to estimate onlythose components that allow to significantly reduce the error; specifically, we use thresholdingto cut the components associated with the insignificant directions off. When applied to thenonlinear regression with wavelet basis, one estimator from the proposed family coincides withthe soft thresholding approach studied in the series of papers by Donoho and Johnstone (1994);Donoho (1995); Donoho and Johnstone (1995); Donoho et al. (1995); Donoho and Johnstone(1998), among others. We highlight that we will not require sparsity of β or any restrictiveconditions on the design.Let us summarize some motivations behind our work: • Our methods can be viewed as an attempt to fix PCR by relaxing its restrictive assump-tions. Instead of working with the first several principal components, our estimatorsautomatically screen for the most important principal components, not necessarily theleading ones. • Remarkably, the procedures that we propose are a modification of LASSO, so one canview this work as an attempt to extend LASSO to non-sparse high-dimensional linear4egression. • Though the papers by Bartlett et al. (2020) and Chinot and Lerasle (2020) do not advocatethe use of interpolating estimators rather justify why the overfitting may not be harmful(very relevant question in modern deep learning research), we aim to show that there is nonecessity to give up the in-sample denoising quality to get good bounds on the predictionerror. In fact, our numerical results show that our method is better the least squaresestimator in various situations.Main contributions of this paper are: • We propose a new method for high-dimensional linear regression, called
Natural Canon-ical Thresholding (NCT), in Section 2. The connection of this approach to LASSO andPCR is discussed in Section 2.1 and Section 2.2. In Section 2.3 we extend the suggestedprocedure and present a richer family of estimators, called
Generalized Canonical Thresh-olding (GCT). Our estimators (cid:98) β are given via an explicit expression and do not requireany optimization. Though each estimator from the family has one hyperparameter, it canbe tuned in an efficient way as shown in Section 4. • We provide theoretical guarantees for the NCT estimator in the fixed design and randomdesign settings in Section 3. The presented bounds have two-fold meaning: – For the absolute errors
MSE ( (cid:98) β ) and PE ( (cid:98) β ), studied in Section 3.1 and Section 3.2,we state explicit sufficient conditions of the form “the eigenvalues of (cid:98) Σ or Σ decayfast enough” to ensure convergence in high dimensions. No conditions on β areimposed in this case. – For the relative errors
MSE ( (cid:98) β ) / MSE (0) and PE ( (cid:98) β ) / PE (0), motivated in Section 3.3.1,our bounds factorize into the newly defined notion of the joint effective dimension ,the signal-to-noise ratio, and a vanishing factor. To get good rates for the relativeerrors in high dimensions it is not enough to assume fast decay of eigenvalues of (cid:98) Σ or Σ alone, and we need to impose conditions of Σ and β together (Section 3.3.2),which is reflected by the joint effective dimension that we analyze (Section 3.3.3).Our analysis also applies to PCR, which can also be regarded as a regularization on thecanonical regression coefficients by truncating high-index components to zero.Theoretical analysis of the GCT estimator is not that insightful, however we still presentand discuss a bound on the absolute error MSE ( (cid:98) β ) (Section 3.4). • Numerical experiments conducted in Section 5 confirm good performance of NCT andespecially GCT, in comparison with other existing methods.The main proofs and the additional proofs are postponed to Section 6 and Section 7, respec-tively. We conclude this section with defining some notations used throughout the work.5or a positive integer k , we write [ k ] as shorthand for the set { , , . . . , k } . We use O k × l for k × l matrix of zeros and I k for the identity matrix of size k × k . For a vector a = [ a , . . . , a k ] (cid:62) ∈ R k and q >
0, the standard (cid:96) q -(pseudo)norm in R k is (cid:107) a (cid:107) q def = (cid:16)(cid:80) kj =1 | a j | q (cid:17) /q . We use thefollowing convention for the (cid:96) -pseudonorm: (cid:107) a (cid:107) = (cid:107) a (cid:107) = (cid:80) kj =1 { a j (cid:54) = 0 } . Also, the (cid:96) ∞ -norm is (cid:107) a (cid:107) ∞ = max j ∈ [ k ] | a j | . For a matrix A , let (cid:107) A (cid:107) be the spectral norm (the largestsingular value), rank [ A ] be the rank, and (if A is square) Tr [ A ] be the trace.For sequences a n and b n the relation a n (cid:46) b n means that there exists an absolute constant C such that a n ≤ Cb n for all n , while a n (cid:16) b n means that a n (cid:46) b n and b n (cid:46) a n . By c, C, C (cid:48) we denote absolute constants which may differ from place to place.Throughout the work, β stands for the true vector of regression coefficients in the model(1.1), (cid:98) β stands for our NCT or GCT estimators proposed in the next section, and a genericestimator is denoted as (cid:101) β . Let r def = rank [ (cid:98) Σ ]. Typically, r = min( n, d ). Consider the SVD of the data matrix XXX (scaledby n − / ): 1 √ n XXX = (cid:98) V (cid:98) Λ (cid:98) U (cid:62) , where (cid:98) Λ = diag (cid:16)(cid:98) λ / , . . . , (cid:98) λ / r (cid:17) ∈ R r × r is a diagonal matrix consisting of the non-zero singularvalues of n − / XXX in non-increasing order, the columns of (cid:98) V ∈ R n × r are the left singularvectors of XXX , and the columns of (cid:98) U = [ (cid:98) u , . . . , (cid:98) u r ] ∈ R d × r are the right singular vectors of XXX .Alternatively, it is also convenient to think of the eigendecomposition of (cid:98) Σ : (cid:98) Σ = (cid:98) U (cid:98) Λ (cid:98) U (cid:62) , where now the diagonal entries (cid:98) λ , . . . , (cid:98) λ r of (cid:98) Λ are interpreted as the non-zero eigenvalues of (cid:98) Σ in non-increasing order and the columns (cid:98) u , . . . , (cid:98) u r of (cid:98) U are the corresponding eigenvectorsof (cid:98) Σ . Similarly, in what follows we will actively use the eigendecomposition of Σ : Σ = UΛ U (cid:62) , where Λ = diag ( λ , . . . , λ d ) ∈ R d × d is a diagonal matrix consisting of the eigenvalues of Σ innon-increasing order, and U = [ u , . . . , u d ] ∈ R d × d consists of the corresponding eigenvectors.We introduce the following definition, which will be extensively used throughout the work. Definition 2.1.
Rewrite the linear regression model
YYY = XXX β + ε as YYY = (
XXX (cid:98) U (cid:98) Λ − )( (cid:98) Λ (cid:98) U (cid:62) β ) + ε = ZZZ θ + ε . We call this the canonical form of the linear regression model. Here
ZZZ def = XXX (cid:98) U (cid:98) Λ − ∈ R n × r isthe standardized design matrix and θ def = (cid:98) Λ (cid:98) U (cid:62) β ∈ R r s the new vector of coefficients, called the canonical regression coefficient vector, or simplycanonical coefficients. Note that the standardized design coincides with the left singular vectors
ZZZ = n / (cid:98) V andsatisfies the orthonormality constraints: n − ZZZ (cid:62)
ZZZ = I r . Hence, the least-squares estimator forthe canonical parameter is (cid:101) θ LS = n − ZZZ (cid:62)
YYY = n − (cid:98) Λ − (cid:98) U (cid:62) XXX (cid:62)
YYY . As in Fan (1996), we furtherregularize the estimated canonical coefficient vector by either thresholding or truncation (set-ting higher indices to zero), depending whether θ is approximately sparse or concentrate onthe leading principal components. Transforming the canonical parameter back to the originaldomain leads to the canonical thresholding estimator or principal component regression esti-mator, as to be further elaborated below. Our work pushes forward the interactions betweenthe canonical parameters and design matrix.More specifically, in the canonical domain our estimator looks like (cid:98) θ def = SOFT τ (cid:104)(cid:101) θ LS (cid:105) , which in the original domain brings us to the Natural Canonical Thresholding (NCT) estimatorof β , defined as (cid:98) β def = (cid:98) U (cid:98) Λ − SOFT τ (cid:20) (cid:98) Λ − (cid:98) U (cid:62) XXX (cid:62)
YYY n (cid:21) , (2.1)where SOFT τ [ z ] def = z · max (1 − τ / | z | ,
0) is the soft thresholding function applied component-wise and τ ≥ YYY ≈ ZZZ θ in, use the eigendecomposition of (cid:98) Σ and get (cid:98) θ ≈ SOFT τ [ θ ] . Due to the structure of our error (e.g. in the fixed design case)
MSE ( (cid:98) β ) = (cid:107) (cid:98) Λ (cid:98) U (cid:62) (cid:98) β − (cid:98) Λ (cid:98) U (cid:62) β (cid:107) = (cid:107) (cid:98) θ − θ (cid:107) and since we assume the eigenvalue decay, it is likely that some components θ j do not playrole, and the estimation of them by (cid:98) θ j is not that important. Hence, it is reasonable to cutsuch insignificant components off, and this is exactly what the thresholding does. This reducesthe variance of the estimator, while not increasing the bias by too much.We also mention that when τ = 0, our estimator reduce to the minimum (cid:96) -norm leastsquares solution (cid:98) θ = (cid:101) θ LS and (cid:98) β = (cid:101) β LS = (cid:98) Σ + XXX (cid:62)
YYY n = ( XXX (cid:62)
XXX ) + XXX (cid:62)
YYY (unbiased or slightly biased, but with large variance), while τ = + ∞ corresponds to the trivialsolution (cid:98) θ = 0 and (cid:98) β = 0 (very biased, but with zero variance).7 .1 Relation to LASSO Recall that the standard LASSO estimator is a solution of the following optimization problem: (cid:101) β LASSO ∈ arg min β (cid:48) ∈ R d (cid:26) n (cid:107) YYY − XXX β (cid:48) (cid:107) + τ (cid:107) β (cid:48) (cid:107) (cid:27) . In practice one usually standardizes the columns of
XXX so that they are on the same scale andthe coefficients corresponding to different covariates are penalized equally. Now imagine thatwe standardize our
XXX in the canonical manner as in Definition 2.1. If we run LASSO for thevector of coefficients θ , then the solution is expressed via the soft thresholding: (cid:101) θ LASSO = arg min θ (cid:48) ∈ R r (cid:26) n (cid:107) YYY − ZZZ θ (cid:48) (cid:107) + τ (cid:107) θ (cid:48) (cid:107) (cid:27) = SOFT τ (cid:20) ZZZ (cid:62)
YYY n (cid:21) , which is exactly our estimator (cid:98) θ in the canonical domain. Going back to (cid:98) β = (cid:98) U (cid:98) Λ − (cid:101) θ LASSO we recover the NCT estimator (2.1). The solution is the soft thresholding on the canonicalregression coefficients. This is why we call the method canonical thresholding.We also note that the NCT estimator can be represented as the solution of the optimizationproblem (cid:98) β = arg min β (cid:48) ∈ R d (cid:26) n (cid:107) YYY − XXX β (cid:48) (cid:107) + τ (cid:107) (cid:98) Λ (cid:98) U (cid:62) β (cid:48) (cid:107) (cid:27) . Our estimator is nothing more than LASSO penalized on the canonical regression coefficients.
Principal Component Regression (PCR) approaches the high dimensionality of the problem bytaking only m ( m < r = min( d, n )) leading principal components of the original data. Thenew design matrix becomes ZZZ m = XXX (cid:98) U ≤ m (cid:98) Λ − ≤ m ∈ R n × m , where (cid:98) U ≤ m ∈ R d × m consists of the first m columns of (cid:98) U and (cid:98) Λ ≤ m ∈ R m × m is m × m leadingprincipal submatrix of (cid:98) Λ . The new regression problem YYY = ZZZ m θ m + ε is solved via the least squares, yielding the solution (cid:101) θ LSm = ZZZ (cid:62) m YYY ∈ R m , and thus (cid:101) β P CR def = (cid:98) U ≤ m (cid:98) Λ − ≤ m (cid:101) θ LSm ∈ R d . Note that (cid:101) θ LSm is essentially the first m components of (cid:101) θ LS , and we can express (cid:101) β P CR = (cid:98) U (cid:98) Λ − ZERO m (cid:20) (cid:98) Λ − (cid:98) U (cid:62) XXX (cid:62)
YYY n (cid:21) , ZERO m [ z ] = [ z , . . . , z m , , . . . , (cid:62) is the operator zeroing out all the components of avector z ∈ R r except the first m . The similarity of the PCR estimator to the NCT estima-tor (2.1) is now clear. While PCR blindly selects the coefficients corresponding to the first m principal components, our procedure screens for the most important principal directions, whichmay be different from the leading ones, and leaves only those with significant contributionexceeding τ . However, if there is a strong prior indicating the canonical coefficients spike atthe principal directions, then of course PCR should also be a suitable procedure, and NCT willsimply mimic its behavior, with some small costs. PCR focuses only on the estimators on the principal component directions and NCT doesnot have any preferences. We now propose a family of canonical thresholding estimators tobridge these two extremes, progressively putting more preferences on the principal directions.In addition, we also generalize the thresholding function.First, the soft thresholding can be replaced by generalized thresholding rules (see e.g. Defi-nition 9.3 in Fan et al. (2020)), introduced for completeness in the following definition.
Definition 2.2.
The function T τ : R → R is called a generalized thresholding function, if(i) | T τ [ z ] | ≤ c | z (cid:48) | for all z, z (cid:48) satisfying | z − z (cid:48) | ≤ τ / and some constant c ;(ii) | T τ [ z ] − z | ≤ τ for all z ∈ R .The parameter τ is called the thresholding level. Second, if there is some prior that the spike canonical coefficients are more likely to be inthe lower principal components rather than in the higher ones, we can introduce additionalmultiplicative weights, equal to the eigenvalues raised to a non-negative power ϕ/
2, underthresholding to reflect this preference. This is equivalent to applying a larger thresholding onhigher principal components, with ϕ controlling the degree. Implementing this strategy, wepropose the following more general family of estimators, parameterized by ϕ ≥ (cid:98) θ def = (cid:98) Λ − ϕ T τ (cid:104) (cid:98) Λ ϕ (cid:101) θ LS (cid:105) in the canonical domain, or (cid:98) β def = (cid:98) U (cid:98) Λ − − ϕ T τ (cid:20) (cid:98) Λ − ϕ (cid:98) U (cid:62) XXX (cid:62)
YYY n (cid:21) (2.2)in the original domain. Here T τ [ z ] is a generalized thresholding function from Definition 2.2applied component-wise and τ ≥ Generalized Canonical Thresholding (GCT) estimators.When ϕ = 0 and the soft thresholding function is used, GCT corresponds to the NCTestimator (2.1). The intuition behind GCT is somewhat similar to NCT: the estimators auto-matically screen the most significant principal components. However, the choice of ϕ allows9o put different importance to eigenvalue (cid:98) λ j and (cid:98) u (cid:62) j β when deciding whether to threshold aprincipal component j or not. While in NCT this importance is calibrated in accordance to thescaling appearing in the decomposition of the absolute MSE (this justifies the word “Natural”in the name), GCT with φ > y . This leads to PCR. We introduce GCT to better accommodatethis situation.It turns out that the theoretical results for GCT are not that nice and insightful as forNCT. However, in practice we observed that GCT may behave much better in some scenarios,as to be shown later in the corresponding section. In principle, one can even tune ϕ inaddition to tuning τ via cross-validation, which might further enhance the practical utility ofthe procedure. The first condition needed for our theoretical analysis is the following assumption on the noise,which will be used in both fixed design and random design settings.
Assumption 3.1 (Sub-Weibull noise) . The noise vector ε is independent of XXX and is jointlysub-Weibull random vector with parameter < α ≤ (see Kuchibhotla and Chakrabortty(2018)). That is, there exists σ < ∞ such that sup (cid:107) w (cid:107) =1 (cid:107) w (cid:62) ε (cid:107) ψ α ≤ σ, where (cid:107) · (cid:107) ψ α is the Orlicz norm for ψ α = e x α − . The following tail bound takes place: P (cid:2) | w (cid:62) ε | ≥ t (cid:3) ≤ − ( t/σ ) α ) for all w , (cid:107) w (cid:107) = 1 and t > . This allows to go slightly beyond sub-Gaussian and sub-Exponential tails. For i.i.d. sub-Gaussian noise, σ coincides (up to a multiplicative constant) with the variance of a single ε i .So, σ can be interpreted as the magnitude of the noise.Define the signal-to-noise ratio of the linear regression problem in the fixed design settingas SNR def = (cid:18) n − (cid:80) ni =1 ( x (cid:62) i β ) σ (cid:19) / = (cid:32) β (cid:62) (cid:98) Σ β σ (cid:33) / = (cid:107) θ (cid:107) σ , while for the random design we use SNR def = (cid:18) E [( x (cid:62) β ) ] σ (cid:19) / = (cid:18) β (cid:62) Σ β σ (cid:19) / = (cid:107) ΛU (cid:62) β (cid:107) σ . SNR > θ = (cid:98) Λ (cid:98) U (cid:62) β for the random design. Its normal-ized version θ / (cid:107) θ (cid:107) has the first k components θ ≤ k / (cid:107) θ (cid:107) where θ ≤ k = (cid:98) Λ ≤ k (cid:98) U (cid:62)≤ k β . Here asusual (cid:98) Λ ≤ k ∈ R k × k is the leading principal submatrix of (cid:98) Λ (containing the square roots of thefirst k eigenvalues of (cid:98) Σ on the diagonal) and (cid:98) U ≤ k ∈ R r × k is the matrix consisting of the first k columns of (cid:98) U (which are the k leading eigenvectors of (cid:98) Σ ). Measuring the normalized first k component θ ≤ k / (cid:107) θ (cid:107) in (cid:96) q -(pseudo)norm gives D eff q,k ( (cid:98) Σ , β ) def = (cid:107) θ ≤ k (cid:107) qq (cid:107) θ (cid:107) q . We call this quantity the joint effective dimension of order q up to index k of (cid:98) Σ and β . Notethat when q = 2, it measures the proportion of θ explained by θ ≤ k ; when q = 0, it countsthe sparsity among θ ≤ k . Similar quantity can be defined for the random design setting: D eff q,k ( Σ , β ) def = (cid:107) Λ ≤ k U (cid:62)≤ k β (cid:107) qq (cid:107) ΛU (cid:62) β (cid:107) q . It turns out that this joint effective dimension will play crucial role in our bounds for the NCTestimator (2.1).For shortness, we introduce the following quantity that will be appearing regularly through-out the section: ρ def = 2 √ n (log(2 d/δ )) /α , (3.1)where δ is from the statements “with probability 1 − δ ...”. The thresholding level τ for bothNCT and GCT will be expressed in terms of ρ . We first provide a simple guarantee on the mean squared error
MSE ( (cid:98) β ) of the NCT estima-tor (2.1). Theorem 3.1.
Suppose Assumption 3.1 is satisfied. Take τ = σρ with ρ = √ n (log(2 d/δ )) /α .Then, with probability − δ , the NCT estimator (cid:98) β from (2.1) with thresholding at level τ satisfies MSE ( (cid:98) β ) (cid:46) inf q ∈ [0 , (cid:8) (cid:107) θ (cid:107) qq ( σρ ) − q (cid:9) (3.2)= MSE (0) inf q ∈ [0 , D eff q,d ( (cid:98) Σ , β ) (cid:34) SNR − (log(2 d/δ )) /α n (cid:35) − q/ . (3.3)The proof of this result almost repeats the classical proof for hard and soft thresholding fororthonormal design. 11 emark 3.1 (Choice of τ ) . The choice of τ in the above theorem depends on the noise mag-nitude σ , the probability δ and the quantity α , but this is not a significant problem. Later inSection 4 we will show how to tune τ using an effective cross-validation procedure. Though the bounds (3.2) and (3.3) coincide, the way we state them reflects two differentmessages. The first one, if we take q = 1 and apply inequality (cid:107) θ (cid:107) ≤ (cid:107) (cid:98) Σ (cid:107) / (cid:107) β (cid:107) r eff [ (cid:98) Σ ] / (which follows from the Cauchy-Schwarz inequality), then from the bound (3.2) we get MSE ( (cid:98) β ) (cid:46) σ (cid:107) (cid:98) Σ (cid:107) / (cid:107) β (cid:107) (cid:115) r eff [ (cid:98) Σ ] (log(2 d/δ )) /α n with high probability. This means that if one is interested in the absolute error MSE ( (cid:98) β ), thenin moderate noise case σ ≤ C (cid:107) (cid:98) Σ (cid:107)(cid:107) β (cid:107) , essentially r eff [ (cid:98) Σ ] /n = o (1) is enough to guarantee MSE ( (cid:98) β ) = (cid:107) (cid:98) Σ (cid:107)(cid:107) β (cid:107) · o (1), omitting logarithmic terms. No additional assumptions on β arerequired, and there is no necessity to worry about the joint effective dimension and the signal-to-noise ratio from the bound (3.3) in this situation.However, the bound (3.3) is useful to better understand the structure of the error. Takinginto account that the main motivation in our work is the decay of eigenvalues of (cid:98) Σ or Σ , itmay easily be the case that even the trivial estimator (cid:101) β = 0 has small error MSE (0) = β (cid:62) (cid:98) Σ β .Hence, it makes sense to care more about the relative error MSE ( (cid:101) β ) / MSE (0). In this case, thejoint effective dimension and the signal-to-noise ratio control the upper bound on the relativeerror. We will get back to the analysis of the relative error and the joint effective dimensionafter we state results for the random design case.
In addition to the noise assumption, to study the performance of the NCT estimator in therandom design setting we need to impose a couple more conditions on the distribution of thecovariates.
Assumption 3.2 (Sub-Gaussian covariates) . The scaled generic random vector of covariates Σ − / x is sub-Gaussian. Assumption 3.3 (Convex decay of eigenvalues) . There exists a convex decreasing function λ ( · ) such that the eigenvalues of Σ satisfy λ j = λ ( j ) for j ∈ [ d ] . The previous assumption is technical and we impose it in our main result just for concreteness.Later in Remark 3.8 we mention how our result can be modified if this assumption does nothold.One more assumption is needed just to make the rates more friendly-looking. If it is notsatisfied, our result below will be meaningless, so there is no loss of generality in this condition.12 ssumption 3.4 (Technical conditions) . The effective rank satisfies r eff [ Σ ] ≤ n . Also, when-ever we say “with probability − δ ...”, we suppose that the quantity (cid:15) = (cid:15) n,d,δ def = (cid:114) log( d/δ ) n satisfies (cid:15) ≤ c for properly chosen implicit absolute constant c > (this constant comes fromthe proof ). In addition to the assumptions above, in the sequel we take the convention λ k = 0 for all k > d . Now we are ready to present the following result. Theorem 3.2.
Suppose Assumptions (3.1) – (3.4) are fulfilled. Recall (cid:15) = (cid:15) n,d,δ def = (cid:112) log( d/δ ) /n and define k ∗ def = ( (cid:15) log(1 /(cid:15) )) − / (essentially, k ∗ (cid:16) n / up to a logarithmic term). Then:(i) With probability − δ , the NCT estimator (cid:98) β from (2.1) with thresholding at level τ = σρ = 2 σ √ n (log(2 d/δ )) /α satisfies PE ( (cid:98) β ) (cid:46) inf q ∈ [0 , (cid:8) (cid:107) Λ ≤ k U (cid:62)≤ k β (cid:107) qq ( σρ ) − q (cid:9) ++ (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:32) λ k λ + (cid:114) r eff [ Σ ] + log(1 /δ ) n + (cid:15) k (cid:88) j =1 λ j (1 + (cid:15) j ) λ (cid:33) for all ≤ k ≤ k ∗ .(ii) With probability − δ , the NCT estimator (cid:98) β from (2.1) with thresholding at level τ = σρ + C (cid:107) Σ (cid:107) / (cid:107) β (cid:107) (cid:15) / max j ∈ [ k ] (cid:18) λ j (1 + (cid:15)j ) λ (cid:19) / for some C satisfies PE ( (cid:98) β ) (cid:46) inf q ∈ [0 , (cid:8) (cid:107) Λ ≤ k U (cid:62)≤ k β (cid:107) qq τ − q (cid:9) + (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:32) λ k λ + (cid:114) r eff [ Σ ] + log(1 /δ ) n (cid:33) for all ≤ k ≤ k ∗ . Several comments are in order.
Remark 3.2 (Why two bounds?) . We present two different bounds in (i) and (ii) for differentthresholds τ and τ , because they behave differently for different eigenvalue regimes. The boundfrom (i) outperforms the bound from (ii) in a wide variety of regimes (e.g. in polynomial andsuperpolynomial decay scenario), however there are cases when the bound (ii) can be better (e.g.specific cases in Factor Model regime). emark 3.3 (Meaning of terms) . We call the term inf q ∈ [0 , (cid:8) (cid:107) Λ ≤ k U (cid:62)≤ k β (cid:107) qq τ − q (cid:9) in the boundsof Theorem 3.2 (i) and (ii) the “main term”, since it is almost the same as what we had inTheorem 3.1 for the fixed design. The other terms in these bounds are referred to as “additionalterms” as they appear only in the random design case. Allowing ≤ k ≤ k ∗ provides atradeoff, as some of the terms increase with growing k , while others decrease. In what followswe are typically interested in k = k ∗ just for concreteness. The meaning of different parts ofthe “additional terms” is the following. The parts including r eff [ Σ ] are the payment for thecovariance matrix estimation. The part λ k ∗ /λ appears due to the difficulty of control of theempirical eigenvectors beyond k ∗ -th. The parts with (cid:80) k ∗ j =1 λ j (1 + (cid:15)j ) /λ and max j ∈ [ k ∗ ] λ j (1 + (cid:15)j ) /λ are the payment for the control of the sample eigenvalues and eigenvectors up to k ∗ . Remark 3.4 (Moderate noise: simplifications and sufficient conditions for convergence) . Con-sider the moderate noise situation σ ≤ C (cid:107) Σ (cid:107)(cid:107) β (cid:107) . In this case the “additional terms” becomedominating: simply taking k = k ∗ and q = 1 , applying (cid:107) Λ ≤ k U (cid:62)≤ k β (cid:107) ≤ (cid:107) Σ (cid:107) / (cid:107) β (cid:107) r eff [ Σ ] / and plugging in τ and τ makes the “main term” negligible. Omitting logarithmic terms, thebound (i) reduces to PE ( (cid:98) β ) (cid:46) (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:32) λ k ∗ λ + (cid:114) r eff [ Σ ] n + (cid:15) k ∗ (cid:88) j =1 λ j (1 + (cid:15) j ) λ (cid:33) , (3.4) while the bound (ii) reduces to PE ( (cid:98) β ) (cid:46) (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:32) λ k ∗ λ + (cid:18) r eff [ Σ ] n (cid:19) / max j ∈ [ k ∗ ] (cid:18) λ j (1 + j / √ n ) λ (cid:19) / (cid:33) with high probability.From here we can easily deduce sufficient conditions to ensure the convergence of the absoluteerror PE ( (cid:98) β ) = (cid:107) Σ (cid:107)(cid:107) β (cid:107) · o (1) as n → ∞ without any conditions on β . In particular, λ n = o (1) , r eff [ Σ ] = o ( n ) , and (cid:80) k ∗ j =1 λ j (1+ j / √ n ) λ = o ( n / ) is enough (again, up to logarithmicfactors). Essentially, these sufficient conditions require the decay of eigenvalues to be fastenough (but in a more sophisticated fashion than for the MSE). Remark 3.5 (Moderate noise: further simplifications in specific examples) . Continuing themoderate noise situation, for the sake of exposition, let us consider several specific examples ofeigenvalue regimes and illustrate how the bound from Theorem 3.2 (i) simplifies. As above, weomit logarithmic terms. • (Polynomial decay): If λ j (cid:46) j − a with a ≥ or d (cid:46) n (3 − a ) / (3 − a ) , it is easy to verify thatthe bracket factor of (3.4) is dominated by λ k ∗ /λ + n − / (again ignoring logarithmicterms) and with high probability PE ( (cid:98) β ) (cid:46) (cid:107) Σ (cid:107)(cid:107) β (cid:107) n min( a/ , / . n particular, when a = 1 (the boundary that has a good control of r eff [ Σ ] in high dimen-sions), we have with high probability PE ( (cid:98) β ) (cid:46) (cid:107) Σ (cid:107)(cid:107) β (cid:107) n / . When a ≥ / , we have with high probability PE ( (cid:98) β ) (cid:46) (cid:107) Σ (cid:107)(cid:107) β (cid:107) n / . • (Factor Model regime) If λ (cid:16) . . . (cid:16) λ m (cid:16) d , λ m +1 (cid:16) . . . (cid:16) λ d (cid:16) for some m (cid:46) k ∗ ,then taking k = m + 1 yields with high probability PE ( (cid:98) β ) (cid:46) (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:18) d + m √ n + m n (cid:19) . Remark 3.6 (Large noise) . In the large noise case, when σ (cid:29) (cid:107) Σ (cid:107)(cid:107) β (cid:107) , the “main term”dominates. Similarly to the fixed design case, we can factorize the “main term” into the errorof the trivial estimator PE (0) = β (cid:62) Σ β , the joint effective dimension and the signal-to-noiseratio: the “main term” from (i) becomes PE (0) inf q ∈ [0 , (cid:40) D eff q,k ∗ ( Σ , β ) (cid:34) SNR − (log(2 d/δ )) /α n (cid:35) − q/ (cid:41) , and the “main term” from (ii) can be rewritten as PE (0) inf q ∈ [0 , (cid:40) D eff q,k ∗ ( Σ , β ) (cid:34) SNR − (log(2 d/δ )) /α n ++ (cid:107) Σ (cid:107)(cid:107) β (cid:107) β (cid:62) Σ β (cid:15) max j ∈ [ k ∗ ] (cid:18) λ j (1 + (cid:15)j ) λ (cid:19) (cid:35) − q/ (cid:41) . In this regime, the relative error PE ( (cid:98) β ) / PE (0) is essentially controlled by the joint effectivedimension and the signal-to-noise ratio. More detailed analysis of the joint effective dimension D eff q,k ( Σ , β ) is conducted in the next section. Remark 3.7 (Comparison with the least squares) . It is straightforward to notice that the fasterdecay of eigenvalues, the better bound we obtain. This contrasts the min norm least squaresestimator considered in Bartlett et al. (2020); Chinot and Lerasle (2020), where the decay isrequired to be not too fast. It also reveals the benefits of the thresholding even in such a situation.
Remark 3.8 (Relaxing Assumption 3.3) . Assumption 3.3 can be avoided. If one defines k ∗ as k ∗ = min ( (cid:15) log(1 /(cid:15) )) − / , max j ∈ [ d ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d (cid:88) l =1 l (cid:54) = j λ l | λ j − λ l | + λ j min( λ j − − λ j , λ j − λ j +1 ) ≤ (cid:15) , then the same conclusion as in Theorem 3.2 is true with a slightly different rate.
15o compare the natural canonical thresholding with the canonical truncation of higher indexcomponents, i.e. PCR, we state the next proposition.
Proposition 3.3.
Assume the conditions of Theorem 3.2 hold and let k ∗ be defined in thesame way. Then, with probability − δ , the PCR estimator (cid:101) β P CR with the number of leadingprincipal components set to m (cid:46) k ∗ satisfies PE ( (cid:101) β P CR ) (cid:46) (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:32) λ m λ + (cid:114) r eff [ Σ ] + log(1 /δ ) n (cid:33) + σ mn (log(2 m/δ )) /α . We omit the proof of this result, since it essentially uses the same techniques and follows thesame strategy as the proof of Theorem 3.2. Note that in the moderate noise scenario the rateessentially coincides with what we obtained for the NCT estimator estimator in Remark 3.4.The adaptivity of our estimator (cid:98) β comes into play in large noise case: the “main term” in thebounds on PE ( (cid:98) β ) is better than σ m/n in situations when U (cid:62)≤ k β is approximately sparse. So far we were able to establish some sufficient conditions for convergence of absolute errors
MSE ( (cid:98) β ) and PE ( (cid:98) β ) of the NCT estimator without any assumptions on β by simply taking q = 1 (Remark 3.4). The analysis of the relative errors for fixed design and (in large noisecase) random design requires more careful study of D eff q,k ( Σ , β ). Let us motivate why relativeerrors MSE ( (cid:98) β ) / MSE (0) and PE ( (cid:98) β ) / PE (0) might be of interest in the first place. One reason behind studying the relative errors was already mentioned previously. Note that ifthere is no relation between (cid:98) Σ and β , meaning that (cid:98) U (cid:62) β is a “random” vector, then we canexpect (cid:107) (cid:98) U (cid:62) β (cid:107) ∞ (cid:16) (cid:107) β (cid:107) (cid:112) log( d ) /d . In this case, the trivial estimator (cid:101) β = 0 achieves error MSE (0) = β (cid:62) (cid:98) Σ β ≤ Tr [ (cid:98) Σ ] (cid:107) (cid:98) U (cid:62) β (cid:107) ∞ (cid:16) (cid:107) (cid:98) Σ (cid:107)(cid:107) β (cid:107) r eff [ (cid:98) Σ ] log( d ) d . As long as the eigenvalues of (cid:98) Σ decay fast, even the trivial estimator gives error close to zeroin high dimensions. Here we should highlight that this effect does not appear in low dimensions(and even in high-dimensional but isotropic situations), where the absolute and relative errorsare just a multiplicative constant apart. (Same reasoning works for the PE.) Hence, it is notsatisfactory for us to show that the absolute error of our estimator goes to zero with growingsample size and dimension. We would like to get more meaningful conclusions from our results,which would confirm that the proposed estimator does better than the trivial estimator. Thisnaturally leads to the relative errors.Another motivation comes from the way statisticians evaluate and compare estimators inpractical applications. In particular, a widely used performance measure is the coefficient of etermination , or simply R . For instance, the in-sample version for an estimator (cid:101) β is definedas R in ( (cid:101) β ) def = 1 − (cid:80) ni =1 ( y i − x (cid:62) i (cid:101) β ) (cid:80) ni =1 y i . The larger this quantity is, the better method we have; its largest possible value is 1, and thevalue of 0 indicates that the estimator does no better than the trivial estimator. Via heuristicarguments we can see R in ( (cid:101) β ) ≈ − MSE ( (cid:101) β ) MSE (0) . Thus, maximizing the coefficient of determination is approximately equivalent to minimizingthe relative error. Similar intuition applies to the out-of-sample R out and the relative predictionerror PE ( (cid:101) β ) / PE (0). Note that in applications it is often the case that even small but positive R (e.g. 0 .
05) can be considered a success. Therefore, the hope to have
MSE ( (cid:101) β ) / MSE (0) or PE ( (cid:101) β ) / PE (0) converging to 0 might be too optimistic in some situations. Having these relativeerrors smaller than 1 already means that the procedure is able to extract some useful signalfrom the data. Prior to describing properties of D eff q,k ( Σ , β ), let us show why imposing conditions on the designalone, or imposing conditions on β alone can be not enough to establish convergence of therelative errors. It is easier to do for the fixed design case, so let us focus on this setting for now.For a given design matrix XXX and an estimator (cid:101) β we can construct another estimator (cid:101) θ = (cid:98) Λ (cid:98) U (cid:62) (cid:101) β . Therefore, ( (cid:101) β − β ) (cid:62) (cid:98) Σ ( (cid:101) β − β ) β (cid:62) (cid:98) Σ β = (cid:107) (cid:101) θ − θ (cid:107) (cid:107) θ (cid:107) , and hence, inf (cid:101) β sup β ∈ R d MSE ( (cid:101) β ) MSE (0) = inf (cid:101) θ sup θ ∈ R r (cid:107) (cid:101) θ − θ (cid:107) (cid:107) θ (cid:107) . The minimax rate for the relative MSE coincides with the minimax rate for the relative esti-mation error in the right-hand side, which has nothing to do with (cid:98) Σ . This demonstrates thatgetting a good rate is hopeless in high dimension assuming only fast decay of eigenvalues of (cid:98) Σ .On the other hand, we might impose strong conditions on β , such as sparsity, in whichcase one could expect even MSE ( (cid:101) β ) (cid:16) σ s/n (up to a logarithmic factor) for some appropriateestimator (cid:101) β , where s measures the degree of sparsity. However, as we mentioned previously,if (cid:98) Σ and β are not related, and the eigenvalues of (cid:98) Σ decay fast, we might have MSE (0) (cid:16)(cid:107) (cid:98) Σ (cid:107)(cid:107) β (cid:107) /d (again up to logarithmic factors) for the trivial estimator. This implies that thereis no much hope in getting vanishing relative error MSE ( (cid:101) β ) / MSE (0) in high dimensions. Thatis why it seems natural that our bound on the relative error depends on the joint effectivedimension, that takes into account not only decay of eigenvalues or only assumptions on β ,but the joint structure of (cid:98) Σ and β . 17 .3.3 Joint effective dimension Now, once we supported the appearance of the joint effective dimension, let us mention its basicproperties. (For concreteness we choose the random design setting and consider D eff q,k ( Σ , β ),though the following ideas apply to D eff q,d ( (cid:98) Σ , β ) appearing in the fixed design case.) • D eff q,k ( Σ , β ) ≤ k . • D eff q,k ( Σ , β ) is decreasing in q and increasing in k . • D eff ,k ( Σ , β ) ≤ D eff ,d ( Σ , β ) = 1. • D eff ,k ( Σ , β ) = (cid:107) Λ ≤ k U (cid:62)≤ k β (cid:107) is essentially the sparsity of U (cid:62)≤ k β .For now let us assume SNR ≥ c > D eff q ( Σ , β ) only. Recall that the “mainterm” in the relative bounds looks likeinf q ∈ [0 , (cid:40) D eff q,k ( Σ , β ) ζ − q/ n (cid:41) where ζ n is some vanishing rate. Hence, the properties above reveal a tradeoff in the main term: • When q is large, i.e. closer to 2, it is easier to control D eff q,k ( Σ , β ); however, the vanishingrate ζ n is raised to a small power, making the convergence slow. • When q is small, i.e. closer to 0, it is more difficult to control D eff q,k ( Σ , β ); in contrast, ζ n is raised to a large power potentially enabling fast convergence rate.So, the bound allows to find largest q for which D eff q,k ( Σ , β ) can be bounded in dimension-free(or the dependence on d is not that severe) and sample size-free manner to facilitate fasterconvergence rate.Some scenarios where D eff q,k ( Σ , β ) can be bounded more explicitly (for some q <
2) arediscussed below: • Sparsity of U T ≤ k β . Denote s def = (cid:107) U (cid:62)≤ k β (cid:107) to be the sparsity level. Then, as alreadymentioned previously, D eff ,k ( Σ , β ) = s . • Approximate sparsity of Λ ≤ k U T ≤ k β . Suppose there exists a small set J ⊆ [ k ] (of size |J | = s ) of significant components, so that the rest of the components satisfy λ / j | u (cid:62) j β | ≤ cd (cid:107) ΛU (cid:62) β (cid:107) for all j / ∈ J . Then D eff ,k ( Σ , β ) ≤ s + c . • Polynomial decay: λ j (cid:16) j − a , | u (cid:62) j β | (cid:16) j − b , where a ≥
0. We have several cases: – If a + 2 b ≥ q ≥ / ( a + 2 b ), then D eff q,k ( Σ , β ) (cid:16) – If a + 2 b ≥ q < / ( a + 2 b ), then D eff q,k ( Σ , β ) (cid:16) k − ( a +2 b ) q/ .18 a+2b . . . . . q d . d . d . d . d . constgrows with d Figure 1: Dependence of D eff q,d ( Σ , β ) on d . – If a + 2 b <
1, then D eff q,k ( Σ , β ) (cid:16) k − ( a +2 b ) q/ /d (1 − a − b ) q/ .Here we omitted logarithmic factors. To better understand the dependence of D eff q,d ( Σ , β )on d in specific case k = d , in Figure 1 we depict this dependence in ( a + 2 b ) – q axes. In the green region D eff q,d ( Σ , β ) does not grow with d , while in the yellow region D eff q,d ( Σ , β ) grows with d polynomially, and the contours of constant power are illustratedwith different colors. Since the “main term” conveniently decomposes into several factors, among which the mostmysterious one – the joint effective dimension – was discussed above, the analysis of the relativeerrors of the NCT estimator in the fixed design and (for large noise case) random design ispretty much complete. It is intriguing though, what happens to the prediction error bounds inmoderate noise case: recall that in this scenario the “main term” is absorbed by the “additionalterm”, which doesn’t have a structure allowing a direct analysis of the bounds on the relativeerror PE ( (cid:98) β ) / PE (0). It is not clear whether they can be stated in a way that will providebetter understanding of the relative error. Instead, we can take a look at the particular caseof polynomial decay of eigenvalues and regression coefficients: λ j (cid:16) j − a , | u (cid:62) j β | (cid:16) j − b . Aftertedious calculations, one may express the bounds from Theorem 3.1 and Theorem 3.2 (i), (ii)in terms of d, n, a, b only. It turns out, that in this scenario the bound from Theorem 3.2 (ii)is always worse than the bound from Theorem 3.2 (i), so we exclude it from consideration.In Figure 2 we plot the contours of constant convergence rate on a – b plane for the boundsfrom Theorem 3.1 and Theorem 3.2 (i). Different colors of the contours correspond to differentrates. The background color describes the assumptions on d and n that we make in differentregions: in light green zones d can be much larger than n (though this is not necessary), inlight yellow zones d and n are allowed to be of the same order, i.e. d < Cn for some C (that19an be larger that 1), and in grey zone the rates do not go to zero unless d is significantlysmaller than n . We again disregard the logarithmic terms. Now we move to a brief study of the theoretical guarantees for the GCT estimator (2.2). Theresults are not that insightful as the ones for the NCT estimator, and we focus on the fixeddesign setting only. However, once we state the MSE bound, the theory for the random designcan be developed in the same way as for the NCT estimator.
Theorem 3.4.
Suppose Assumption 3.1 is satisfied. Let ϕ ≥ . Take τ = σρ with ρ = √ n (log(2 d/δ )) /α . Then, with probability − δ , the GCT estimator (cid:98) β from (2.2) with parameter ϕ and thresholding at level τ satisfies MSE ( (cid:98) β ) (cid:46) r (cid:88) j =1 min (cid:32) (cid:98) λ ϕ/ (cid:98) λ ϕ/ j σρ, | θ j | (cid:33) . The obtained bound may be difficult to comprehend, but we state it in the most general form tomake sure it is tight and applicable in wide range of scenarios. When θ j = 0 for j ≥ m , it hasno estimation errors beyond the first m principal components, adapting very well to focusingonly on the low dimensions estimation like PCR. If, in addition, (cid:98) λ / (cid:98) λ m is bounded, we have MSE ( (cid:98) β ) (cid:46) m (cid:88) j =1 min ( σρ, | θ j | ) , which is not much larger than the MSE of PCR. It can even be much smaller than PCR when | θ j | are small for many j . In general, GCT outperforms NCT when | θ j | decays fast enough.The next corollary allows to make sure that the bound is essentially dimension-free – asfor Theorem 3.1, the rate can be expressed in terms of the effective rank r eff [ (cid:98) Σ ], while thedependence on the dimension appears only under logarithm. Corollary 3.5.
Under assumptions of Theorem 3.4, with probability − δ , the GCT estimator (cid:98) β from (2.2) with parameter ϕ ≥ satisfies MSE ( (cid:98) β ) (cid:46) (cid:16) (cid:107) (cid:98) Σ (cid:107)(cid:107) β (cid:107) + σ (cid:107) (cid:98) Σ (cid:107) / (cid:107) β (cid:107) r eff [ (cid:98) Σ ] / (cid:17) (log(2 d/δ )) / (2+ ϕ ) α n / (2+ ϕ ) . Note that when ϕ = 0 we essentially recover the rate obtained after Theorem 3.1. The ratedeteriorates when ϕ is far from 0, and this is explainable: the GCT procedure significantlydeviates from the natural one, leading to a worse bound in the worst case, i.e. when the onlyassumption is the control of effective rank, and spikiness of canonical coefficients is not justified. Remark 3.9 (Simplifications in specific cases) . In several specific cases the rate from Theo-rem 3.4 can be made much more transparent. We omit logarithmic terms. . . . a . . . . b n − . n − . n − . n − . n − . n − . n − . n − . n − . n − . d (cid:29) nd < Cn no convergence (a) Bound on the relative mean squared error MSE ( (cid:98) β ) / MSE (0) from Theorem 3.1; . . . a . . . . b n − . n − . n − . n − . n − . n − . n − . n − . n − . d (cid:29) nd < Cn no convergence (b) Bound on the relative prediction error PE ( (cid:98) β ) / PE (0) from Theorem 3.2 (i); Figure 2: Rates for the relative errors of the NCT estimator in polynomial decay scenario.21 (Polynomial decay) If (cid:98) λ j (cid:16) j − a and | u (cid:62) j β | (cid:16) j − b for a ≥ , a + 2 b ≥ , then with highprobability MSE ( (cid:98) β ) (cid:46) (cid:18) σ n (cid:19) a +2 b − a ( ϕ +1)+2 b . • (Sparsity) If there exists a set J of size |J | = s such that u (cid:62) j β = 0 for j / ∈ J and (cid:98) λ max J ≥ c (cid:98) λ for c > , then with high probability MSE ( (cid:98) β ) (cid:46) sσ n . • (Approximate sparsity) If there exists a set J of size |J | = s such that (cid:98) λ / j | u (cid:62) j β | ≤ c (cid:107) θ (cid:107) /d for j / ∈ J and (cid:98) λ max J ≥ c (cid:98) λ for c > , then with high probability MSE ( (cid:98) β ) (cid:46) sσ n + c (cid:107) θ (cid:107) d . • (Factor Model regime) If λ (cid:16) . . . (cid:16) λ m (cid:16) d , λ m +1 (cid:16) . . . (cid:16) λ d (cid:16) for some m , thenwith high probability MSE ( (cid:98) β ) (cid:46) mσ n + (cid:107) β (cid:107) d . Therefore, the rate from Theorem 3.4 can adapt well to these specific structures despite deteri-orating rate of Corollary 3.5, which is only an upper bound. τ To start with, we focus on the case when a good value of τ is somehow known, and analyzethe computational complexity of the GCT estimators. In particular, this includes the NCT es-timator. The computation of SVD of XXX (specifically, (cid:98) Λ and (cid:98) U ) takes O ( dn min( d, n )) time.Once we have SVD, computing the matrix A def = (cid:98) U (cid:98) Λ − − ϕ and the vector b def = (cid:98) Λ − ϕ (cid:98) U (cid:62) XXX (cid:62)
YYY n needed before the generalized thresholding takes O ( dn ) time. Obtaining (cid:98) β for already com-puted A and b takes O ( d min( d, n )) time. Therefore, the total computational time of ourprocedure is O ( dn min( d, n )). The computation is as fast as the SVD of the design matrix.Note that computational complexity of the LASSO is O ( n min( d, n ) ), when we computeits solution path via a modification of Least Angle Regression, see Efron et al. (2004). τ Our approach requires to tune the hyperparameter τ . Whatever τ is, we anyway have to com-pute A and b . This already takes O ( dn min( d, n )) time. Applying the generalized threshold-ing and combining the result into the vector (cid:98) β takes O ( d min( d, n )) time. This means thatwe can try n different values of τ “for free” — the computational complexity will be still of22he same order as computing (cid:98) β for a single value of τ . But we can go even further, if we focuson the GCT estimators with the soft or hard thresholding.Notice that varying τ continuously from 0 to + ∞ , we still can get only (min( d, n ) + 1)different solutions (cid:98) β for given XXX , YYY because we threshold a vector of size min( d, n ). Therefore,we can compute the whole solution path for τ from 0 to + ∞ . However, we are not thatinterested in the solution path, since we do not expect to get coefficients entering the picture oneby one as in LASSO for sparse regression. Instead, this can be useful for k -fold cross validation,where we will have at most k (min( d, n ) + 1) “interesting” values of τ giving different solutions (cid:98) β . In total, this implies that we need O ( dn min( d, n )) + k (min( d, n ) + 1) · O ( d min( d, n )) = O ( dn min( d, n ) + kd min( d, n ) )operations to find the best τ (providing smallest cross-validation error). In practice, we typ-ically use k = 5 or k = 10, i.e. of constant order, which leads to the total computationalcomplexity of O ( dn min( d, n )) for our optimally tuned estimator – same as for a single value of τ . Leave-one-out cross validation takes slightly more computations, namely, O ( dn min( d, n ) ).Similar “free tuning” property holds for LASSO (we again refer to Efron et al. (2004)).However, for example the ridge regression does not posses this nice properties: different valuesof regularization parameter will lead to different estimators, and one has to “guess” a discreteset of values to be tried. We compare the following methods: • “ NCT ”: Natural Canonical Thresholding estimator (2.1) with efficient hyperparametertuning by 10-fold CV. • “ GCT ”: Generalized Canonical Thresholding estimator (2.2) with ϕ = 1, the soft thresh-olding, and with efficient hyperparameter tuning by 10-fold CV. • “ OLS ”: Ordinary Least Squares. When d > n , the min norm solution is considered. • “ PCR ”: Principal Component Regression. The number of PCs is chosen by 10-fold CV. • “ Ridge ”: Ridge regression with 10-fold CV (default implementation from R-package glm-net ). • “ LASSO ”: LASSO with 10-fold CV (default implementation from R-package glmnet ).We fix n = 200, SNR = 10 or
SNR = 1, and focus on how the relative errors
MSE ( (cid:98) β ) / MSE (0)and PE ( (cid:98) β ) / PE (0) of these methods behave when the dimension d grows. The covariates x , . . . , x n ∼ N (0 , Σ ), where Σ depends on the eigenvalue scenario, and the noise vector ε ∼ N (0 , σ I n ) (where σ is chosen to ensure SNR = 10 or
SNR = 1 for given Σ and β ).23ithout loss of generality we take Σ a diagonal matrix, or equivalently U = I d . The resultsare presented in Figure 3–6. Figure 3 and Figure 4 correspond to SNR = 10, Figure 5 andFigure 6 correspond to
SNR = 1. Figure 3 and Figure 5 cover the scenarios of polynomialdecay of the eigenvalues λ j = j − a and the coefficients u (cid:62) j β = j − b with(a) a = 2, b = 2;(b) a = 1, b = 0 . a = 0 . b = 1;while Figure 4 and Figure 6 also consider polynomial decay of the eigenvalues λ j = j − a but U (cid:62) β is different:(a) a = 1; u (cid:62) j β = 1 for j = 1 , . . . ,
10 and 0 otherwise;(b) a = 0 . u (cid:62) j β = 1 for 10 randomly chosen j ∈ { d − , . . . , d } , and the rest componentsare i.i.d. N (0 , d − );(c) a = 0 . U (cid:62) β ∼ N (0 , I d ).In each scenario, for each method and dimension we run the corresponding experiment 100times and plot the median errors.We notice that the NCT estimator (among some others) in some settings suffer around d = n . This is so called “interpolation threshold” – when the dimension exceeds the number ofdata points, a model has enough features to interpolate training points. The behavior aroundthis point and the associated “double descent” phenomenon has been an active area of researchfor the last couple of years. We do not focus on this in our work.Otherwise, from the plots it is clear that in the presented settings the proposed procedureperforms quite good compared to the other methods. In particular, the persistent performanceof GCT suggests the benefit of varying thresholding to better adapt to various scenarios withdifferent priors. However, it is worth mentioning that other methods also perform quite un-expectedly well in a variety of settings, though previous theoretical results for them do notpredict such performance. This may engender an interest in more thorough study of classicallinear regression methods in high-dimensional setting under different structural assumptions.24 d . . . . MSE ( b β ) MSE (0)
NCTGCTOLSPCRRidgeLASSO
50 200 300 400 500 600 700 d . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (a) a = 2, b = 2;
50 200 300 400 500 600 700 d . . . . . . MSE ( b β ) MSE (0)
NCTGCTOLSPCRRidgeLASSO
50 200 300 400 500 600 700 d . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (b) a = 1, b = 0 .
50 200 300 400 500 600 700 d . . . . MSE ( b β ) MSE (0)
NCTGCTOLSPCRRidgeLASSO
50 200 300 400 500 600 700 d . . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (c) a = 0 . b = 1; Figure 3: The relative errors
MSE ( (cid:98) β ) / MSE (0) (left) and PE ( (cid:98) β ) / PE (0) (right) for differentestimators with n = 200, SNR = 10. Polynomial decay of eigenvalues and coefficients ineigenbasis: λ j = j − a , u (cid:62) j β = j − b . 25 d . . . . . . . MSE ( b β ) MSE (0)
NCTGCTOLSPCRRidgeLASSO
50 200 300 400 500 600 700 d . . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (a) a = 1, u (cid:62) j β = 1 for j = 1 , . . . ,
10 and 0 otherwise;
50 200 300 400 500 600 700 d . . . . . . . MSE ( b β ) MSE (0)
NCTGCTOLSPCRRidgeLASSO
50 200 300 400 500 600 700 d . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (b) a = 0 . u (cid:62) j β = 1 for 10 randomly chosen j ∈ { d − , . . . , d } , and the rest components are i.i.d. N (0 , d − );
50 200 300 400 500 600 700 d . . . . . . . MSE ( b β ) MSE (0)
NCTGCTOLSPCRRidgeLASSO
50 200 300 400 500 600 700 d . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (c) a = 0 . U (cid:62) β ∼ N (0 , I d ); Figure 4: The relative errors
MSE ( (cid:98) β ) / MSE (0) (left) and PE ( (cid:98) β ) / PE (0) (right) for differentestimators with n = 200, SNR = 10. Polynomial decay of eigenvalues λ j = j − a and differentregimes of coefficients in eigenbasis u (cid:62) j β . 26 d . . . . MSE ( b β ) MSE (0)
NCTGCTOLSPCRRidgeLASSO
50 200 300 400 500 600 700 d . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (a) a = 2, b = 2;
50 200 300 400 500 600 700 d . . . . . . MSE ( b β ) MSE (0)
NCTGCTOLSPCRRidgeLASSO
50 200 300 400 500 600 700 d . . . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (b) a = 1, b = 0 .
50 200 300 400 500 600 700 d . . . . . MSE ( b β ) MSE (0)
NCTGCTOLSPCRRidgeLASSO
50 200 300 400 500 600 700 d . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (c) a = 0 . b = 1; Figure 5: The relative errors
MSE ( (cid:98) β ) / MSE (0) (left) and PE ( (cid:98) β ) / PE (0) (right) for differentestimators with n = 200, SNR = 1. Polynomial decay of eigenvalues and coefficients ineigenbasis: λ j = j − a , u (cid:62) j β = j − b . 27 d . . . . . . . MSE ( b β ) MSE (0)
NCTGCTOLSPCRRidgeLASSO
50 200 300 400 500 600 700 d . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (a) a = 1, u (cid:62) j β = 1 for j = 1 , . . . ,
10 and 0 otherwise;
50 200 300 400 500 600 700 d . . . . . . MSE ( b β ) MSE (0)
NCTGCTOLSPCRRidgeLASSO
50 200 300 400 500 600 700 d . . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (b) a = 0 . u (cid:62) j β = 1 for 10 randomly chosen j ∈ { d − , . . . , d } , and the rest components are i.i.d. N (0 , d − );
50 200 300 400 500 600 700 d . . . . . . MSE ( b β ) MSE (0)
NCTGCTOLSPCRRidgeLASSO
50 200 300 400 500 600 700 d . . . . . . PE ( b β ) PE (0) NCTGCTOLSPCRRidgeLASSO (c) a = 0 . U (cid:62) β ∼ N (0 , I d ); Figure 6: The relative errors
MSE ( (cid:98) β ) / MSE (0) (left) and PE ( (cid:98) β ) / PE (0) (right) for differentestimators with n = 200, SNR = 1. Polynomial decay of eigenvalues λ j = j − a and differentregimes of coefficients in eigenbasis u (cid:62) j β . 28 Main proofs
We start with the following lemma that allows to bound the properly scaled noise vector in (cid:96) ∞ -norm. The lemma simultaneously deals with both fixed and random design settings. Lemma 6.1.
Suppose Assumption 3.1 is fulfilled. Let ρ be as in (3.1) . Define the event Ω = (cid:110) (cid:107) ξ (cid:107) ∞ ≤ σρ (cid:111) with ξ def = ZZZ (cid:62) ε n = (cid:98) Λ − (cid:98) U (cid:62) XXX (cid:62) ε n . Then P [Ω ] ≥ − δ. Proof of Theorem 3.1
Using (cid:98) Σ = (cid:98) U (cid:98) Λ (cid:98) U (cid:62) we write for (cid:98) β from (2.1) MSE ( (cid:98) β ) = ( (cid:98) β − β ) (cid:62) (cid:98) Σ ( (cid:98) β − β ) = (cid:107) (cid:98) Λ (cid:98) U (cid:62) (cid:98) β − (cid:98) Λ (cid:98) U (cid:62) β (cid:107) . Now we plug our estimator (cid:98) β and YYY = XXX β + ε in to get MSE ( (cid:98) β ) = (cid:13)(cid:13)(cid:13)(cid:13) SOFT τ (cid:20) (cid:98) Λ − (cid:98) U (cid:62) XXX (cid:62)
YYY n (cid:21) − (cid:98) Λ (cid:98) U (cid:62) β (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) SOFT τ (cid:34) (cid:98) Λ (cid:98) U (cid:62) β + (cid:98) Λ − (cid:98) U (cid:62) XXX (cid:62) ε n (cid:35) − (cid:98) Λ (cid:98) U (cid:62) β (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:107) SOFT τ [ θ + ξ ] − θ (cid:107) , where we recall the canonical coefficients θ = (cid:98) Λ (cid:98) U (cid:62) β from Definition 2.1 and ξ from Lemma 6.1.From now on, the proof basically repeats the classical derivation for the soft and hard thresh-olding. Let us analyze its j -th component on Ω from Lemma 6.1 of probability at least1 − δ . • If | θ j + ξ j | > τ , then | θ j | ≥ τ − | ξ j | ≥ τ / | SOFT τ [ θ j + ξ j ] − θ j | = | θ j + ξ j ± τ − θ j | = | ξ j ± τ | ≤ | ξ j | + τ ≤ τ ≤ τ, | θ j | ) , where ± means that we take either + or − depending on the sign of ( θ j + ξ j ), but thisdoesn’t play any role. • If | θ j + ξ j | ≤ τ , then | θ j | ≤ τ + | ξ j | ≤ τ / | SOFT τ [ θ j + ξ j ] − θ j | = | − θ j | = | θ j | ≤ τ, | θ j | ) . Note that for any 0 ≤ q ≤ τ, | θ j | ) ≤ τ − q/ | θ j | q/ (we use convention 0 = 0).Thus, on Ω MSE ( (cid:98) β ) = (cid:107) SOFT τ [ θ + ξ ] − θ (cid:107) ≤ r (cid:88) j =1 min ( τ, | θ j | ) ≤ τ − q r (cid:88) j =1 | θ j | q = 9 τ − q (cid:107) θ (cid:107) qq , using convention (cid:107) · (cid:107) = (cid:107) · (cid:107) . Taking infimum over q ∈ [0 , β (cid:62) (cid:98) Σ β = MSE (0)and recalling the definitions of
SNR and D eff q,d ( (cid:98) Σ , β ), we conclude the proof.29 roof of Theorem 3.2 To begin with, we state the following well-known result on the concentration of the samplecovariance around the true covariance in terms of the effective rank. See Koltchinskii andLounici (2017), Theorem 9; also, Vershynin (2018), Theorem 9.2.4 and Exercise 9.2.5.
Lemma 6.2.
Suppose Assumption 3.2 is fulfilled. Then, with probability − δ (cid:107) (cid:98) Σ − Σ (cid:107) ≤ C (cid:107) Σ (cid:107) (cid:32)(cid:114) r eff [ Σ ] + log(1 /δ ) n + r eff [ Σ ] + log(1 /δ ) n (cid:33) . Using Assumption 3.4 we can leave only the first term in the bound above. Let Ω be the eventon which this bound holds.Our main tools to prove the main result is the beautiful work by Jirak and Wahl (2018)that develops tight relative perturbation bounds for eigenvalues and eigenvectors of covariancematrix. Let us describe the framework of that paper. By Assumption 3.3 we consider the caseof simple eigenvalues of Σ . The following quantities play important role: the relative rank r j ( Σ ) def = d (cid:88) l =1 l (cid:54) = j λ l | λ j − λ l | + λ j min( λ j − − λ j , λ j − λ j +1 ) for j ∈ [ d ](here λ = + ∞ and λ d +1 = 0 for convenience) and the entries of Σ − / ( (cid:98) Σ − Σ ) Σ − / η ll (cid:48) def = u (cid:62) l ( (cid:98) Σ − Σ ) u l (cid:48) √ λ l λ l (cid:48) for l, l (cid:48) ∈ [ d ] . Relative perturbation bounds for j -th eigenvalue and eigenvector hold under the condition thatthere exist x such that | η ll (cid:48) | ≤ x for all l, l (cid:48) ∈ [ d ] , and r j ( Σ ) ≤ x . The following lemma helps to control the first condition.
Lemma 6.3.
Suppose Assumption 3.2 and Assumption 3.4 hold. Then, with probability − δ max l,l (cid:48) ∈ [ d ] | η ll (cid:48) | ≤ (cid:15), where (cid:15) = (cid:15) n,d,δ def = C (cid:114) log( d/δ ) n . for some C . Define Ω to be the event where the inequality from the previous lemma holds.So, the relative perturbation bounds hold true on Ω for indices j for which r j ( Σ ) ≤ / (3 (cid:15) ).We would like to have this property for as many indices as possible. Under Assumption 3.3 we30ave (see Jirak and Wahl (2018), inequalities (3.30); Jirak (2016), Lemma 7.13; Cardot, Masand Sarda (2007), Lemma 6.1) r j ( Σ ) ≤ d (cid:88) l =1 l (cid:54) = j λ l | λ j − λ l | ≤ Cj log( j ) . Note that with k ∗ def = ( (cid:15) log(1 /(cid:15) )) − / we indeed have r j ( Σ ) ≤ / (3 (cid:15) ) for all j ∈ [ k ∗ ] due to Assumption 3.4. Hence, the followingrelative perturbation bounds from Jirak and Wahl (2018) hold true. Lemma 6.4.
For all j ∈ [ k ∗ ] on Ω holds | (cid:98) λ j − λ j | ≤ C(cid:15)λ j and (cid:107) (cid:98) u j − u j (cid:107) ≤ C(cid:15) (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) d (cid:88) l =1 l (cid:54) = j λ j λ l ( λ j − λ l ) . Furthermore, for all j ∈ [ k ∗ ] and l ∈ [ d ] , l (cid:54) = j on Ω holds | (cid:98) u (cid:62) j u l | ≤ C(cid:15) (cid:112) λ j λ l | λ j − λ l | . Note that the bounds from the previous lemma apply even for larger indices j , which can beup ( (cid:15) log(1 /(cid:15) )) − (of order n / ), while we restrict k ∗ to be of order n / . Later in the proof itwill be clear how this specific k ∗ arises.Now we are ready to proceed to the main part of the proof. Proof of Theorem 3.2.
We prove the theorem for k = k ∗ , and it will be clear that the sameproof works with any k < k ∗ . The proof for part (i) and part (ii) coincides up to the last step.Denote τ (cid:48) to be the general thresholding level, which is τ for part (i) and τ for part(ii). Usingthe definition of (cid:98) β given in (2.1), the eigendecompositions Σ = UΛ U (cid:62) , (cid:98) Σ = (cid:98) U (cid:98) Λ (cid:98) U (cid:62) and themodel YYY = XXX β + ε , write the prediction error as PE ( (cid:98) β ) = ( (cid:98) β − β ) (cid:62) Σ ( (cid:98) β − β )= (cid:107) ΛU (cid:62) (cid:98) U (cid:98) Λ − SOFT τ (cid:48) (cid:20) (cid:98) Λ − (cid:98) U (cid:62) XXX (cid:62)
YYY n (cid:21) − ΛU (cid:62) β (cid:107) = (cid:107) ΛU (cid:62) (cid:98) U (cid:98) Λ − SOFT τ (cid:48) [ (cid:98) Λ (cid:98) U (cid:62) β + ξ ] − ΛU (cid:62) β (cid:107) with ξ from Lemma 6.1. Let us add and subtract ΛU (cid:62) (cid:98) U (cid:98) U (cid:62) β inside the norm and apply (cid:107) a + b (cid:107) ≤ (cid:107) a (cid:107) + 2 (cid:107) b (cid:107) : PE ( (cid:98) β ) ≤ (cid:107) ΛU (cid:62) β − ΛU (cid:62) (cid:98) U (cid:98) U (cid:62) β (cid:107) ++ 2 (cid:107) ΛU (cid:62) (cid:98) U (cid:98) Λ − SOFT τ (cid:48) (cid:104) (cid:98) Λ (cid:98) U (cid:62) β + ξ (cid:105) − ΛU (cid:62) (cid:98) U (cid:98) U (cid:62) β (cid:107) =: I + I .
31e first deal with I : I (cid:107) ΛU (cid:62) β − ΛU (cid:62) (cid:98) U (cid:98) U (cid:62) β (cid:107) = β (cid:62) ( I d − (cid:98) U (cid:98) U (cid:62) ) Σ ( I d − (cid:98) U (cid:98) U (cid:62) ) β = β (cid:62) ( I d − (cid:98) U (cid:98) U (cid:62) ) ( Σ − (cid:98) Σ ) ( I d − (cid:98) U (cid:98) U (cid:62) ) β ≤ (cid:107) (cid:98) Σ − Σ (cid:107) (cid:107) ( I d − (cid:98) U (cid:98) U (cid:62) ) β (cid:107) ≤ (cid:107) (cid:98) Σ − Σ (cid:107) (cid:107) β (cid:107) ≤ C (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:114) r eff [ Σ ] + log(1 /δ ) n , where the last inequality holds on Ω due to Lemma 6.2 and Assumption 3.4.Next, we focus on I . We will decompose it into two parts: one will correspond to the first k ∗ eigenvectors and eigenvalues, while the other will correspond to the rest ( r − k ∗ ). Let ussplit (cid:98) U = [ (cid:98) U ≤ k ∗ (cid:98) U >k ∗ ] and (cid:98) Λ = (cid:34) (cid:98) Λ ≤ k ∗ O k ∗ × ( r − k ∗ ) O ( r − k ∗ ) × k ∗ (cid:98) Λ >k ∗ (cid:35) , where (cid:98) Λ ≤ k ∗ ∈ R k ∗ × k ∗ , (cid:98) U ≤ k ∗ ∈ R d × k ∗ correspond to the first k ∗ eigenvalues and eigenvectors,while (cid:98) Λ >k ∗ ∈ R ( r − k ∗ ) × ( r − k ∗ ) , (cid:98) U >k ∗ ∈ R d × ( r − k ∗ ) correspond to the rest. Also let ξ = (cid:2) ξ (cid:62)≤ k ∗ ξ (cid:62) >k ∗ (cid:3) (cid:62) with ξ ≤ k ∗ ∈ R k ∗ and ξ >k ∗ ∈ R r − k ∗ . Then (cid:98) U (cid:98) Λ − SOFT τ (cid:48) (cid:104) (cid:98) Λ (cid:98) U (cid:62) β + ξ (cid:105) − (cid:98) U (cid:98) U (cid:62) β == (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ SOFT τ (cid:48) (cid:104) (cid:98) Λ ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ β + ξ ≤ k ∗ (cid:105) − (cid:98) U ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ β + (cid:98) U >k ∗ (cid:98) Λ − >k ∗ SOFT τ (cid:48) (cid:104) (cid:98) Λ >k ∗ (cid:98) U (cid:62) >k ∗ β + ξ >k ∗ (cid:105) − (cid:98) U >k ∗ (cid:98) U (cid:62) >k ∗ β . Again applying (cid:107) a + b (cid:107) ≤ (cid:107) a (cid:107) + 2 (cid:107) b (cid:107) we obtain I (cid:107) ΛU (cid:62) (cid:98) U (cid:98) Λ − SOFT τ (cid:48) (cid:104) (cid:98) Λ (cid:98) U (cid:62) β + ξ (cid:105) − ΛU (cid:62) (cid:98) U (cid:98) U (cid:62) β (cid:107) ≤≤ (cid:107) ΛU (cid:62) (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ SOFT τ (cid:48) (cid:104) (cid:98) Λ ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ β + ξ ≤ k ∗ (cid:105) − ΛU (cid:62) (cid:98) U ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ β (cid:107) + 2 (cid:107) ΛU (cid:62) (cid:98) U >k ∗ (cid:98) Λ − >k ∗ SOFT τ (cid:48) (cid:104) (cid:98) Λ >k ∗ (cid:98) U (cid:62) >k ∗ β + ξ >k ∗ (cid:105) − ΛU (cid:62) (cid:98) U >k ∗ (cid:98) U (cid:62) >k ∗ β (cid:107) =: I + I . So, to upper bound I we will upper bound I and I separately.Consider I . Denote γ def = (cid:98) Λ − >k ∗ SOFT τ (cid:48) (cid:104) (cid:98) Λ >k ∗ (cid:98) U (cid:62) >k ∗ β + ξ >k ∗ (cid:105) − (cid:98) U (cid:62) >k ∗ β ∈ R r − k ∗ . Let us analyze j -th component γ j , for j ∈ [ r − k ∗ ], using the definition of SOFT τ (cid:48) [ · ]. We havetwo cases: • If | (cid:98) λ / j + k ∗ (cid:98) u (cid:62) j + k ∗ β + ξ j + k ∗ | > τ (cid:48) , then γ j = (cid:98) λ − / j + k ∗ ( ξ j + k ∗ ± τ (cid:48) ) (the actual sign will play norole). Since by Lemma 6.1 on Ω (cid:98) λ / j + k ∗ | (cid:98) u (cid:62) j + k ∗ β | ≥ τ (cid:48) − | ξ j + k ∗ | ≥ τ (cid:48) / , we have | γ j | ≤ (cid:98) λ − / j + k ∗ ( | ξ j + k ∗ | + τ (cid:48) ) ≤ (cid:98) λ − / j + k ∗ · τ (cid:48) ≤ | (cid:98) u (cid:62) j + k ∗ β | . If | (cid:98) λ / j + k ∗ (cid:98) u (cid:62) j + k ∗ β + ξ j + k ∗ | ≤ τ (cid:48) , then γ j = − (cid:98) u (cid:62) j + k ∗ β , and we directly get | γ j | = | (cid:98) u (cid:62) j + k ∗ β | . In any case, | γ j | ≤ | (cid:98) u (cid:62) j + k ∗ β | for all j ∈ [ r − k ∗ ], and therefore (cid:107) γ (cid:107) ≤ (cid:107) β (cid:107) on Ω . Hence, I (cid:107) ΛU (cid:62) (cid:98) U >k ∗ γ (cid:107) = γ (cid:62) (cid:98) U (cid:62) >k ∗ Σ (cid:98) U >k ∗ γ = γ (cid:62) (cid:98) U (cid:62) >k ∗ ( Σ − (cid:98) Σ ) (cid:98) U >k ∗ γ + γ (cid:62) (cid:98) U (cid:62) >k ∗ (cid:98) Σ (cid:98) U >k ∗ γ ≤ (cid:107) (cid:98) Σ − Σ (cid:107)(cid:107) (cid:98) U >k ∗ γ (cid:107) + γ (cid:62) (cid:98) Λ >k ∗ γ ≤ (cid:107) (cid:98) Σ − Σ (cid:107)(cid:107) γ (cid:107) + (cid:98) λ k ∗ +1 (cid:107) γ (cid:107) . We bound the first term on the right-hand side on Ω by Lemma 6.2, and for the second termon Ω holds (cid:98) λ k ∗ +1 ≤ (cid:98) λ k ∗ ≤ (1 + C(cid:15) ) λ k ∗ ≤ C (cid:48) λ k ∗ due to Lemma 6.4 and Assumption 3.4. Takinginto account (cid:107) γ (cid:107) ≤ C (cid:107) β (cid:107) on Ω , we get on Ω ∩ Ω ∩ Ω I ≤ C (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:32)(cid:114) r eff [ Σ ] + log(1 /δ ) n + λ k ∗ λ (cid:33) . Finally, it is left to bound I . Denote (cid:98) ω def = SOFT τ (cid:48) (cid:104) (cid:98) Λ ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ β + ξ ≤ k ∗ (cid:105) − (cid:98) Λ ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ β ∈ R k ∗ . Then, I (cid:107) ΛU (cid:62) (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:98) ω (cid:107) ≤ (cid:107) ΛU (cid:62) (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:107) (cid:107) (cid:98) ω (cid:107) . An upper bound on (cid:107) ΛU (cid:62) (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:107) is provided in the next lemma. Lemma 6.5.
Suppose Assumption 3.2 holds. Then on Ω holds (cid:107) ΛU (cid:62) (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:107) ≤ C. Remark 6.1.
The previous lemma is the only place where we use k ∗ = ( (cid:15) log(1 /(cid:15) )) − / . Therest of the proof would go through if k ∗ was defined as ( (cid:15) log(1 /(cid:15) )) − . Remark 6.2.
Interestingly, a closely related to ΛU (cid:62) (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ matrix appears also in Bartlettet al. (2020). The main difficulty of their proof is to find regimes of eigenvalues such that for C defined as C def = ( XXXXXX (cid:62) ) − XXX Σ XXX (cid:62) ( XXXXXX (cid:62) ) − holds Tr [ C ] = o (1) as n → ∞ . Using SVD n − / XXX = (cid:98) V (cid:98) Λ (cid:98) U (cid:62) one can show Tr [ C ] = 1 n Tr [ (cid:98) Λ − (cid:98) U (cid:62) UΛ U (cid:62) (cid:98) U (cid:98) Λ − ] , while in Lemma 6.5 we essentially upper bound the operator norm of somewhat simpler (in asense that we truncate the sample eigenvalues and eigenvector beyond k ∗ -th) matrix (cid:98) Λ − ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ UΛ U (cid:62) (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ . The latter task turns out to be much easier and does not requirespecific regimes of eigenvalues, unlike the former one.
33o deal with (cid:107) (cid:98) ω (cid:107) , we first state the following lemma which has two parts, one of whichwill help to conclude the proof of claim (i), and the other one will be useful for claim (ii). Lemma 6.6. On Ω holds(i) (cid:107) (cid:98) Λ ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ β − Λ ≤ k ∗ U (cid:62)≤ k ∗ β (cid:107) ≤ C (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:15) (cid:32) r eff [ Σ ] + (cid:15) k ∗ (cid:88) j =1 λ j j λ (cid:33) . (ii) (cid:107) (cid:98) Λ ≤ k ∗ (cid:98) U (cid:62)≤ k ∗ β − Λ ≤ k ∗ U (cid:62)≤ k ∗ β (cid:107) ∞ ≤ τ ∞ def = C (cid:107) Σ (cid:107) / (cid:107) β (cid:107) (cid:15) / max j ∈ [ k ∗ ] (cid:18) λ j (1 + (cid:15)j ) λ (cid:19) / . So, to deal with part (ii) of Theorem 3.2, we notice that our thresholding level τ (cid:48) = τ = τ + τ ∞ ,where τ from Lemma 6.1 is responsible for the noise and τ ∞ from Lemma 6.6 is responsible forthe estimation of the eigenvalues and eigenvectors. Now we analyze the components (cid:98) ω j of (cid:98) ω for j ∈ [ k ∗ ]. • If | (cid:98) λ / j (cid:98) u (cid:62) j β + ξ j | > τ , then (cid:98) ω j = ξ j ± τ (the sign again doesn’t matter). Moreover, due toLemma 6.1 and Lemma 6.6 (ii) on Ω ∩ Ω | λ / j u (cid:62) j β | ≥ τ − | ξ j | − | (cid:98) λ / j (cid:98) u (cid:62) j β − λ / j u (cid:62) j β | ≥ τ − τ − τ ∞ τ . Hence, | (cid:98) ω j | ≤ τ , | λ / j u (cid:62) j β | ) . • If | (cid:98) λ / j (cid:98) u (cid:62) j β + ξ j | ≤ τ , then (cid:98) ω j = − (cid:98) λ / j (cid:98) u (cid:62) j β . Furthermore, again by Lemma 6.1 andLemma 6.6 (ii) on Ω ∩ Ω | λ / j u (cid:62) j β | ≤ τ + | ξ j | + | (cid:98) λ / j (cid:98) u (cid:62) j β − λ / j u (cid:62) j β | ≤ τ + τ τ ∞ τ . Thus, | (cid:98) ω j | ≤ τ , | λ / j u (cid:62) j β | ) . In both cases, | (cid:98) ω j | ≤ τ , | λ / j u (cid:62) j β | ), and based on the same derivation as in the proof ofTheorem 3.1, we obtain on Ω ∩ Ω (cid:107) (cid:98) ω (cid:107) ≤ q ∈ [0 , (cid:8) τ − q (cid:107) Λ ≤ k ∗ U (cid:62)≤ k ∗ β (cid:107) qq (cid:9) . For part (i) we act slightly differently. Now τ (cid:48) = τ . We decompose (cid:98) ω = ω + ∆ ω , where ω def = SOFT τ (cid:2) Λ ≤ k ∗ U (cid:62)≤ k ∗ β + ξ ≤ k ∗ (cid:3) − Λ ≤ k ∗ U (cid:62)≤ k ∗ β ∈ R k ∗ .
34y Lemma 6.6 (i) it is easy to bound on Ω (cid:107) ∆ ω (cid:107) = (cid:107) (cid:98) ω − ω (cid:107) ≤ C (cid:107) Σ (cid:107)(cid:107) β (cid:107) (cid:15) (cid:32) r eff [ Σ ] + (cid:15) k ∗ (cid:88) j =1 λ j j λ (cid:33) . The norm (cid:107) ω (cid:107) with thresholding at level τ can be bounded as in the proof of Theorem 3.1:on Ω (cid:107) ω (cid:107) ≤ q ∈ [0 , (cid:8) τ − q (cid:107) Λ ≤ k ∗ U (cid:62)≤ k ∗ β (cid:107) qq (cid:9) , which, together with the bound on (cid:107) ∆ ω (cid:107) , gives bound on (cid:107) (cid:98) ω (cid:107) on Ω ∩ Ω .Putting all the bounds for I and I (in particular, for I and I ) together on the intersectionof high probability events Ω ∩ Ω ∩ Ω , adjusting δ → δ/ − δ , we conclude the proof. Proof of Theorem 3.4
Similarly to the proof of Theorem 3.1, we write for (cid:98) β from (2.2) MSE ( (cid:98) β ) = (cid:13)(cid:13)(cid:13) (cid:98) Λ − ϕ T τ (cid:104) (cid:98) Λ ϕ ( θ + ξ ) (cid:105) − θ (cid:13)(cid:13)(cid:13) = r (cid:88) j =1 (cid:12)(cid:12)(cid:12)(cid:98) λ − ϕ/ j T τ [ (cid:98) λ ϕ/ j ( θ j + ξ j )] − θ j (cid:12)(cid:12)(cid:12) . For each individual term we apply the following two bounds. On one hand, (cid:12)(cid:12)(cid:12)(cid:98) λ − ϕ/ j T τ [ (cid:98) λ ϕ/ j ( θ j + ξ j )] − θ j (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:98) λ − ϕ/ j (cid:16) T τ [ (cid:98) λ ϕ/ j ( θ j + ξ j )] − (cid:98) λ ϕ/ j ( θ j + ξ j ) (cid:17) + ξ j (cid:12)(cid:12)(cid:12) ≤ (cid:98) λ − ϕ/ j τ + | ξ j | ≤ (cid:32) (cid:98) λ ϕ/ (cid:98) λ ϕ/ j + 12 (cid:33) σρ ≤ (cid:98) λ ϕ/ (cid:98) λ ϕ/ j σρ, where the first inequality uses property (ii) from Definition 2.2 and the triangle inequality, andthe second inequality holds on Ω by Lemma 6.1. On the other hand, (cid:12)(cid:12)(cid:12)(cid:98) λ − ϕ/ j T τ [ (cid:98) λ ϕ/ j ( θ j + ξ j )] − θ j (cid:12)(cid:12)(cid:12) ≤ (cid:98) λ − ϕ/ j (cid:12)(cid:12)(cid:12) T τ [ (cid:98) λ ϕ/ j ( θ j + ξ j )] (cid:12)(cid:12)(cid:12) + | θ j |≤ (cid:98) λ − ϕ/ j · c | (cid:98) λ ϕ/ j θ j | + | θ j | = ( c + 1) | θ j | , where the second inequality is due to property (i) from Definition 2.2 applied to z = (cid:98) λ ϕ/ j ( θ j + ξ j )and z (cid:48) = (cid:98) λ ϕ/ j θ j satisfying | z − z (cid:48) | = (cid:98) λ ϕ/ j | ξ j | ≤ τ / by Lemma 6.1. Thus, on Ω MSE ( (cid:98) β ) = (cid:13)(cid:13)(cid:13) (cid:98) Λ − ϕ T τ (cid:104) (cid:98) Λ ϕ ( θ + ξ ) (cid:105) − θ (cid:13)(cid:13)(cid:13) (cid:46) r (cid:88) j =1 min (cid:32) (cid:98) λ ϕ/ (cid:98) λ ϕ/ j σρ, | θ j | (cid:33) . Proof of Corollary 3.5
We need to upper bound the right-hand side of the inequality obtained in Theorem 3.4. Let usdefine the following auxiliary set: P def = (cid:110) j ∈ [ r ] (cid:12)(cid:12)(cid:12) (cid:98) λ j ≥ (cid:98) λ ρ / (2+ ϕ ) (cid:111) .
35e first bound (cid:88) j / ∈P min (cid:32) (cid:98) λ ϕ/ (cid:98) λ ϕ/ j σρ, | θ j | (cid:33) ≤ (cid:88) j / ∈P | θ j | = (cid:88) j / ∈P (cid:98) λ j ( (cid:98) u (cid:62) j β ) ≤ (cid:98) λ (cid:107) β (cid:107) ρ / (2+ ϕ ) , where we used only the definition of P (more specifically its complement). Then, we bound (cid:88) j ∈P min (cid:32) (cid:98) λ ϕ/ (cid:98) λ ϕ/ j σρ, | θ j | (cid:33) ≤ (cid:88) j ∈P (cid:98) λ ϕ/ (cid:98) λ ϕ/ j σρ | θ j | ≤ σρ − ϕ · ϕ +1 (cid:88) j ∈P | θ j | ≤ σρ / (2+ ϕ ) (cid:107) θ (cid:107) , where we again used the definition of P . Now we apply (cid:107) θ (cid:107) ≤ (cid:107) (cid:98) Σ (cid:107) / (cid:107) β (cid:107) r eff [ (cid:98) Σ ] / , andadding the above two inequalities yields the desired statement. Proof of Lemma 6.1.
It is straightforward to verify that ξ is also a sub-Weibull random vectorconditionally on XXX : sup (cid:107) w (cid:107) =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w (cid:62) (cid:98) Λ − (cid:98) U (cid:62) XXX (cid:62) ε n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ψ α ≤ √ n sup (cid:107) w (cid:107) =1 (cid:107) w (cid:62) ε (cid:107) ψ α ≤ σ √ n , where the first inequality holds given XXX since (cid:107)
XXX (cid:98) U (cid:98) Λ − w / √ n (cid:107) = 1, and the last inequality isdue to Assumption 3.1. Then, taking e , . . . , e r (the standard basis in R r ) we have P (cid:2) | e (cid:62) j ξ | ≥ t (cid:12)(cid:12) XXX (cid:3) ≤ (cid:0) − (cid:0) t √ n/σ (cid:1) α (cid:1) for j ∈ [ d ] . Applying the union bound and plugging in t = σρ/
2, we get the desired.
Proof of Lemma 6.3.
The proof is pretty standard and can be found in numerous papers. Fix l, l (cid:48) ∈ [ d ]. We have η ll (cid:48) = u (cid:62) l ( (cid:98) Σ − Σ ) u l (cid:48) √ λ l λ l (cid:48) = u (cid:62) l ( Σ − / (cid:98) ΣΣ − / − I d ) u l (cid:48) = 1 n n (cid:88) i =1 (cid:8) ( u (cid:62) l Σ − / x i ) · ( u (cid:62) l (cid:48) Σ − / x i ) − E (cid:2) ( u (cid:62) l Σ − / x i ) · ( u (cid:62) l (cid:48) Σ − / x i ) (cid:3)(cid:9) . Since by Assumption 3.2 Σ − / x i is sub-Gaussian for i ∈ [ n ], then by Lemma 2.7.7 of Vershynin(2018) ( u (cid:62) l Σ − / x i ) · ( u (cid:62) l (cid:48) Σ − / x i ) is sub-Exponential and by Exercise 2.7.10 of Vershynin (2018)its centered version is also sub-Exponential for i ∈ [ n ]. Bernstein’s inequality (e.g. Corollary2.8.3 of Vershynin (2018)) applied to this centered random variables implies P [ | η ll (cid:48) | ≥ t ] ≤ (cid:18) − cn min (cid:18) t C , tC (cid:19)(cid:19) . By union bound, P (cid:20) max l,l (cid:48) ∈ [ d ] | η ll (cid:48) | ≥ t (cid:21) ≤ d exp (cid:18) − cn min (cid:18) t C , tC (cid:19)(cid:19) . t = C (cid:112) log(2 d /δ ) /n for some other properly chosen C and using Assumption 3.4 tomake sure t /c ≤ t/c , we conclude the proof. Proof of Lemma 6.4.
The first part follows from Corollary 2 of Jirak and Wahl (2018), and thesecond part follows from Lemma 4 of Jirak and Wahl (2018). Conditions (2.1) and | η ll (cid:48) | ≤ x for all l, l (cid:48) ∈ [ d ] are satisfied since we work on Ω from Lemma 6.3 and consider j ∈ [ k ∗ ] withproperly defined k ∗ . Proof of Lemma 6.5.
We first apply inequalities (cid:107) ΛU (cid:62) (cid:98) U ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:107) = (cid:107) ΛU (cid:62) (cid:98) U ≤ k ∗ Λ − ≤ k ∗ Λ ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:107) ≤ (cid:107) ΛU (cid:62) (cid:98) U ≤ k ∗ Λ − ≤ k ∗ (cid:107)(cid:107) Λ ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:107)≤ (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ΛU (cid:62) (cid:98) U ≤ k ∗ Λ − ≤ k ∗ − (cid:34) I k ∗ O ( d − k ∗ ) × k ∗ (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + 1 (cid:33) (cid:107) Λ ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:107) . The spectral norm of Λ ≤ k ∗ (cid:98) Λ − ≤ k ∗ is easy to control: (cid:107) Λ ≤ k ∗ (cid:98) Λ − ≤ k ∗ (cid:107) = (cid:32) max j ∈ [ k ∗ ] λ j (cid:98) λ j (cid:33) / ≤ − C(cid:15) ) / ≤ , where the first inequality is due to Lemma 6.4 on Ω and in the last inequality we assumedthat (cid:15) is small enough by Assumption 3.4.Next, let us focus on ΛU (cid:62) (cid:98) U ≤ k ∗ Λ − ≤ k ∗ − [ I k ∗ O k ∗ × ( d − k ∗ ) ] (cid:62) , which we denote by H for shortness.Denote its columns as h , . . . , h k ∗ . We can bound (cid:96) -norm of each column, using Lemma 6.4,as (cid:107) h j (cid:107) = d (cid:88) l =1 l (cid:54) = j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) λ / l u (cid:62) l (cid:98) u j λ / j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + | u (cid:62) j (cid:98) u j − | = d (cid:88) l =1 l (cid:54) = j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) λ / l u (cid:62) l (cid:98) u j λ / j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 12 (cid:107) (cid:98) u j − u (cid:62) j (cid:107) ≤ (cid:15) d (cid:88) l =1 l (cid:54) = j λ / l λ / j λ / j λ / l | λ j − λ l | + C(cid:15) d (cid:88) l =1 l (cid:54) = j λ j λ l ( λ j − λ l ) on Ω . Applying Jirak and Wahl (2018), inequalities (3.30), or Jirak (2016), Lemma 7.13,together with Assumption 3.3, we get (cid:107) h j (cid:107) ≤ (cid:15) j log( j ) + C(cid:15) j . Since j ≤ k ∗ and (cid:15)k ∗ ≤ C , we have (cid:107) h j (cid:107) ≤ C(cid:15) j log( k ∗ ) for all j ∈ [ k ∗ ] . Finally, we have for the Frobenius norm (cid:107) H (cid:107) F on Ω (cid:107) H (cid:107) F = k ∗ (cid:88) j =1 (cid:107) h j (cid:107) ≤ C k ∗ (cid:88) j =1 (cid:15) j log( k ∗ ) = C(cid:15) k ∗ log( k ∗ ) ≤ C (cid:48) , where we used the definition of k ∗ . The inequality between the spectral and the Frobeniusnorms completes the proof. 37 roof of Lemma 6.6. Fix arbitrary j ∈ [ k ∗ ]. We have the following chain of inequalities: | (cid:98) λ / j (cid:98) u (cid:62) j β − λ / j u (cid:62) j β | ≤ (cid:107) (cid:98) λ / j (cid:98) u j − λ / j u j (cid:107) (cid:107) β (cid:107) ≤ (cid:107) (cid:98) λ / j (cid:98) u j − λ / j (cid:98) u j + λ / j (cid:98) u j − λ / j u j (cid:107) (cid:107) β (cid:107) ≤ (cid:16)(cid:12)(cid:12)(cid:12)(cid:98) λ / j − λ / j (cid:12)(cid:12)(cid:12) (cid:107) (cid:98) u j (cid:107) + λ / j (cid:107) (cid:98) u j − u j (cid:107) (cid:17) (cid:107) β (cid:107) ≤ (cid:16) | (cid:98) λ j − λ j | / + λ / j (cid:107) (cid:98) u j − u j (cid:107) (cid:17) (cid:107) β (cid:107) . Applying Lemma 6.4 we have on Ω | (cid:98) λ / j (cid:98) u (cid:62) j β − λ / j u (cid:62) j β | ≤ C (cid:107) β (cid:107) λ / j (cid:15) / + λ / j (cid:15) (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) d (cid:88) l =1 l (cid:54) = j λ j λ l ( λ j − λ l ) ≤ C (cid:107) β (cid:107) (cid:16) λ / j (cid:15) / + λ / j (cid:15) j (cid:17) , where in the second inequality we used Jirak and Wahl (2018), inequalities (3.30), or Jirak(2016), Lemma 7.13, together with Assumption 3.3. Taking maximum over j ∈ [ k ∗ ] we obtainthe claim (i), and raising to the square and summing over j ∈ [ k ∗ ] we get the claim (ii). References
Bair, E. , Hastie, T. , Paul, D. and
Tibshirani, R. (2006). Prediction by supervisedprincipal components.
J. Amer. Statist. Assoc. , , 473, 119–137. Bartlett, P. , Long, P. , Lugosi, G. and
Tsigler, A. (2020). Benign overfitting in linearregression.
Proc. Natl. Acad. Sci. USA . Belkin, M. (2018). Approximation beats concentration? An approximation view on inferencewith smooth radial kernels.
Proc. Mach. Learn. Res. , , 1–18. Belkin, M. , Hsu, D. and
Xu, J. (2019). Two models of double descent for weak features.
ArXiv:1903.07571 . Bellec, P. , Lecu´e, G. and
Tsybakov, A. (2018). SLOPE meets Lasso: improved oraclebounds and optimality.
Ann. Statist. , , 6B, 3603–3642. Bickel, P. , Ritov, Y. and
Tsybakov, A. (2009). Simultaneous analysis of Lasso andDantzig selector.
Ann. Statist. , , 4, 1705–1732. Bietti, A. and
Mairal, J. (2019). On the inductive bias of Neural Tangent Kernels.
Advancesin Neural Information Processing Systems , 12893–12904.
Bogdan, M. , van den Berg, E. , Sabatti, C. , Su, W. and
Candes, E. (2015). SLOPE –adaptive variable selection via convex optimization.
Ann. Appl. Stat. , , 3, 1103–1140. Cardot, H. , Mas, A. and
Sarda, P. (2007). CLT in functional linear regression models.
Probab. Theory Related Fields , , 325–361.38 andes, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when p is muchlarger than n . Ann. Statist. , , 6, 2313–2351. Chinot, G. and
Lerasle, M. (2020). Benign overfitting in the large deviation regime.
ArXiv:2003.05838 . Dalalyan, A. , Hebiri, M. and
Lederer, J. (2017). On the prediction performance of theLasso.
Bernoulli , , 1, 552–581. Donoho, D. L. and
Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage.
Biometrika , , 425–455. Donoho, D. L. (1995). De-noising by soft-thresholding.
IEEE Trans. Inform. Theory , , 3,613–627. Donoho, D. L. and
Johnstone, I. M. (1995). Adapting to unknown smoothness via waveletshrinkage.
J. Amer. Statist. Assoc. , , 1200–1224. Donoho, D. L. , Johnstone, I. M. , Kerkyacharian, G. and
Picard, D. (1995). Waveletshrinkage: asymptopia?
J. R. Stat. Soc. Ser. B. Stat. Methodol. , , 301–369. Donoho, D. L. and
Johnstone, I. M. (1998). Minimax estimation via wavelet shrinkage.
Ann. Statist. , ,879–921. Efron, B. , Hastie, T. , Johnstone, I. M. and
Tibshirani, R. (2004). Least angle regres-sion.
Ann. Statist. , , 2, 407–499. Fan, J. (1996). Test of significance based on wavelet thresholding and Neyman’s truncation
J.Amer. Statist. Assoc. , , 434, 674–688. Fan, J. and
Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracleproperties.
J. Amer. Statist. Assoc. , , 456, 1348–1360. Fan, J. , Li, R. , Zhang, C.-H. , Zou, H. (2020).
Statistical foundations of data science . CRCPress.
Fan, J. , Ke, Y. and
Wang, K. (2020). Factor-adjusted regularized model selection.
J. Econo-metrics , , 71–85. Fan, J. , Wang, W. and
Yao, J. (2017). Sufficient forecasting using factor models.
J. Econo-metrics , , 292-306. Greenshtein, E. and
Ritov, Y. (2004). Persistence in high-dimensional linear predictorselection and the virtue of overparametrization.
Bernoulli , , 6, 971–988. Hastie, T. , Montanari, A. , Rosset, S. and
Tibshirani, R. (2019). Surprises in high-dimensional ridgeless least squares interpolation.
ArXiv:1903.08560 .39 irak, M. (2016). Optimal eigen expansions and uniform bounds.
Probab. Theory RelatedFields , , 753–799. Jirak, M. and
Wahl, M. (2018). Relative perturbation bounds with applications to empiricalcovariance operators.
ArXiv:1802.02869 . Jolliffe, I. (1982). A note on the use of principal components in regression.
J. R. Stat. Soc.Ser. C. Appl. Stat. , , 3, 300–303. Koltchinskii, V. and
Lounici, K. (2017). Concentration inequalities and moment boundsfor sample covariance operators.
Bernoulli , , 1, 110–133. Kuchibhotla, A. K. and
Chakrabortty, A. (2018). Moving beyond sub-Gaussianityin high-dimensional statistics: applications in covariance estimation and linear regression.
ArXiv:1804.02605 . Liang, T. and
Rakhlin, A. (2020). Just interpolate: kernel ”ridgeless” regression can gen-eralize.
Ann. Statist. , to appear.
Ma, S. and
Belkin, M. (2017). Diving into the shallows: a computational perspective onlarge-scale shallow learning.
Advances in Neural Information Processing Systems , 3781–3790.
Paul, D. , Bair, E. , Hastie, T. and
Tibshirani, R. (2008). “Preconditioning” for featureselection and regression in high-dimensional problems.
Ann. Statist. , , 4, 1595–1618. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.
TheLondon, Edinburgh and Dublin Philosophical Magazine and Journal of Science. , , 559–572. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso.
J. R. Stat. Soc. Ser.B. Stat. Methodol. , , 1, 267–288. van de Geer, S. and B¨uhlmann, P. (2009). On the conditions used to prove oracle resultsfor the Lasso.
Electron. J. Stat. , , 1360–1392. Vershynin, R. (2018). High-dimensional probability. An introduction with applications indata science.
Cambridge Series in Statistical and Probabilistic Mathematics . Zumbach, G. (2009). The empirical properties of large covariance matrices.
RiskMetrics Jour-nal ,9