TTaming the heavy-tailed features by shrinkage and clipping
Ziwei Zhu
Department of Operations Research and Financial EngineeringPrinceton University
Abstract
In this paper, we consider the generalized linear models (GLM) with heavy-tailedfeatures and corruptions. Besides clipping the response, we propose to shrink thefeature vector by its (cid:96) -norm under the low dimensional regime and clip each entryof the feature vector in the high-dimensional regime. Under bounded fourth momentassumptions, we show that the MLE based on shrunk or clipped data enjoys nearlythe minimax optimal rate with exponential deviation bound. Simulations demonstratesignificant improvement in statistical performance by feature shrinkage and clipping inlinear regression with heavy-tailed noise and logistic regression with noisy labels. Wealso apply shrinkage to deep features of MNIST images and find that classifiers trainedby shrunk deep features are fairly robust to noisy labels: it achieves . testing errorin the presence of mislabeled data. Heavy-tailed data abound in modern data analytics. For instance, financial log-returnsand macroeconomic variables usually exhibit heavy tails (Cont (2001)). In genomic study,microarray data are always wildly fluctuated (Liu et al. (2003), Purdom et al. (2005)). In deeplearning, features learned by deep neural nets are generated via highly nonlinear transformationof the original data and thus have no guarantee of exponential-tailed distribution. These real-world cases contradict the common sub-Gaussian or sub-exponential conditions in statisticsliterature. A series of questions thus arise: with heavy-tailed data, can we still achieve goodstatistical properties of the previous standard estimators or testing statistics? If not, isthere a solution to overcome heavy-tailed corruptions and achieve equally well statisticalperformance as with exponential-tailed data?To answer these questions, perhaps the easiest statistical problem to start with is themean estimation problem. It turns out surprisingly, as first pointed out by Catoni (2012),that the empirical mean is far from optimal when data has only a few finite moments. InCatoni (2012), the author instead proposed a novel M-estimator for the population mean1 a r X i v : . [ s t a t . M E ] O c t -2 - x y MLE Boundary
True Boundary -5 - - x y MLE Boundary
True Boundary -5 - - x y Old MLE Boundary
True Boundary
New MLE Boundary (a) Gaussian Features (b) Student’s t Features(c) Shrunk Student’s t FeaturesFigure 1: Logistic Regression with mislabeled data based on different featuresand revealed its sub-Gaussian concentration behavior around the true mean under merelybounded second moment assumptions. The correspondent score function is constructed togrow in a logarithmic fashion when deviation is large, thereby being insensitive to outliersand delivering a robust M-estimator. Later on Devroye et al. (2016), Hsu and Sabato (2016)and Minsker (2017+) established a similar sub-Gaussian concentration property for themedian-of-means estimator (Nemirovsky et al. (1982)). Particularly, Hsu and Sabato (2016)and Minsker (2017+) consider the median-of-means approach under general metric spaces.Beyond the mean estimation problem, robust loss minimization and the median-of-meansapproach are proved to be successful under a great variety of problem setups with heavy-tailed data , e.g., covariance matrix or general matrix estimation (Minsker (2017+); Fanet al. (2017)), empirical loss minimization (Brownlees et al. (2015); Hsu and Sabato (2016)),low-dimensional regression and high-dimensional sparse linear regression (Fan et al. (2017);Sun et al. (2017)), low-rank matrix recovery (Fan et al. (2017)) and so forth.Despite heated research on statistics with heavy-tailed data, very few has studied the effectof heavy-tailed features or designs in general regression problems. It still remains unclearwhether widely spread features or designs are blessings or curses to statistical efficiency. Thismotivates us to consider both heavy-tailed designs and responses under a new variant ofthe generalized linear models (GLM) called corrupted GLM (CGLM), which imposes extrarandom corruptions on the response from the traditional GLM. The introduction of thisadditional corruption is to address the limited expression capacity of the traditional GLM,which can only generate exponential-tailed responses. The CGLM enjoys much broader modelcapacity and embraces a myriad of important real-world problems, e.g., linear regression withheavy-tailed noise, classification with mislabeled data, etc.One main message of our paper is that heavy-tailed features will drastically aggravate theinduced corruptions in the CGLM and downgrade the statistical efficiency of the standardapproach. To shed light on this point, in Figure 1 we contrast the performance of thestandard MLE under the logistic regression model when the features are light-tailed and2eavy-tailed. As we can observe, when data points are wildly spread as illustrated in Panel(b), the boundary derived from MLE deviates far from the true boundary. When we haveGaussian features, however, Panel (a) shows nearly perfect alignment between the MLEboundary and the true boundary. The reason is that outliers, especially those mislabeled,have huge influence on the log-likelihood and can thus easily drive the MLE boundary awayfrom the truth.To tame the wild behavior of heavy-tailed features, we propose to appropriately shrink orclip the features before calculating the M-estimator. Given feature vectors { x i ∈ R d } ni =1 , athreshold value τ and certain norm (cid:107) · (cid:107) , the shrunk features { (cid:101) x si } ni =1 are defined as: (cid:101) x si = min( (cid:107) x i (cid:107) , τ ) · x i (cid:107) x i (cid:107) . In other words, we restrict (cid:107) (cid:101) x si (cid:107) below the level τ . The clipped features { (cid:101) x ci } ni =1 are definedas (cid:101) x cij = min( | x ij | , τ ) · x ij / | x ij | , for any ≤ j ≤ d. For ease of notations, we will drop the superscript "s" or "c" unless it is not clear whethershrinkage and clipping is applied. We will illustrate both theoretically and numerically thatthis data shrinkage and clipping trade little bias for dramatic variance reduction such that thecorrespondent MLE achieves (nearly) the minimax optimal statistical rate. Panels (b) and(c) of Figure 1 compare the performance of MLE under Logistic regression based on originalheavy-tailed features and shrunk features. As we can clearly see, after feature shrinkage, thenew MLE boundary becomes much more aligned with the true boundary than the original one.The advantage of feature shrinkage is tremendous here, since it drags back the mislabeledoutliers and significantly mitigates their perturbation on the log-likelihood. Note that similarideas have been explored to overcome adversarial corruptions on features. For example,Chen et al. (2013) uses the trimmed inner product to robustify standard high-dimensionalregression toolkits and establishes strong statistical guarantees while allowing certain fractionof samples to be arbitrarily corrupted. Feng et al. (2014) proposes to throw out samples withlarge magnitude to avoid adversarial feature corruptions in logistic regression and binaryclassification problems. The major difference between our work and theirs is that (1) we donot assume any corruption on the features; all the corruptions here are imposed on responses;(2) our features are heavy-tailed per se, while in Chen et al. (2013) and Feng et al. (2014) thegenuine designs are sub-Gaussian. As clearly seen, our focus is heavy-tailedness, rather thancorruptions, of features in regression problems.We organize our paper as follows. In Section 2, we introduce in detail the model we workwith, i.e., the corrupted GLM. In Section 3, we elucidate the specific feature shrinkage andclipping methods and present our main theoretical results. Under the low-dimensional regime,we prove that the MLE based on (cid:96) -norm shrunk features and clipped responses enjoys thesame optimal statistical error rate as the standard MLE with sub-Gaussian features andresponse up to a √ log n factor. For high-dimensional models, we show that the (cid:96) -regularizedMLE based on appropriately clipped feature and responses achieve exactly the minimaxoptimal rate. One technical contribution that is worth emphasis is that we provide rigorous3ustification of the (restricted) strong convexity of the negative likelihood based on shrunkor clipped features. In Section 4, we illustrate the numerical superiority of our proposedestimators over the standard MLE under both low-dimensional and high-dimensional regimes.We investigate two important problem setups: the linear regression with heavy-tailed noiseand binary logistic regression with mislabeled data. Besides, we implement our methods onthe MNIST dataset and discover that shrinkage on deep features improves the predictionpower of logistic classifiers when noisy labels occur. In this paper we consider the corrupted generalized linear model (CGLM). Recall the definitionof the standard GLM with the canonical link. Suppose we have n samples { ( y i , x i ) } ni =1 , where y i is the response and x i is the feature vector. Under the GLM with the canonical link, theresponse follows the distribution f n ( y ; X , β ∗ ) = n (cid:89) i =1 f ( y i ; η ∗ i ) = n (cid:89) i =1 (cid:26) c ( y i ) exp (cid:20) y i η ∗ i − b ( η ∗ i ) φ (cid:21)(cid:27) , (2.1)where y = ( y , · · · , y n ) T , X = ( x , · · · , x n ) T , β ∗ ∈ R d is a regression vector, η ∗ i = x Ti β ∗ and φ > is the dispersion parameter. The negative log-likelihood corresponding to (2.1) is given,up to an affine transformation, by (cid:96) n ( β ) = 1 n n (cid:88) i =1 − y i x Ti β + b ( x Ti β ) = 1 n n (cid:88) i =1 − y i η i + b ( η i ) = 1 n n (cid:88) i =1 (cid:96) i ( β ) , (2.2)and the gradient and Hessian of (cid:96) n ( β ) are respectively ∇ (cid:96) n ( β ) = − n n (cid:88) i =1 ( y i − b (cid:48) ( x Ti β ∗ )) x i (2.3)and ∇ (cid:96) n ( β ) = 1 n n (cid:88) i =1 b (cid:48)(cid:48) ( x Ti β ∗ ) x i x Ti , (2.4)For ease of notations, we write the empirical hessian ∇ (cid:96) n ( β ) as H n ( β ) and E ( b (cid:48)(cid:48) ( x Ti β ) x i x Ti ) as H ( β ) .Now we consider further an extra random corruption on the response y i . Suppose we canonly observe corrupted responses z i = y i + (cid:15) i , where (cid:15) i is a random noise. We emphasize herethat the introduction of (cid:15) i significantly improves the flexibility of the original GLM. The newmodel now embraces many more real-world problems with complex structures, e.g., the linearregression model with heavy-tailed noise, the logistic regression with mislabeled samples andso forth. 4o handle the heavy-tailed features and corruptions, we propose to shrink or clip the data { ( z i , x i ) } ni =1 first and feed them to the log-likelihood (2.2) to calculate MLE. More rigorously,our negative log-likelihood is evaluated on the shrunk data { (cid:101) z i , (cid:101) x i } ni =1 as follows. (cid:101) (cid:96) n ( β ) := 1 n n (cid:88) i =1 − (cid:101) z i (cid:101) x Ti β + b ( (cid:101) x Ti β ) . (2.5)We denote the hessian matrix of (cid:101) (cid:96) n ( β ) by (cid:101) H n ( β ) and its population version E (cid:101) H n ( β ) by (cid:101) H ( β ) .In the next section, we will elucidate the specific shrinkage and clipping methods in both thelow-dimensional and high-dimensional regimes and explicitly derive the statistical error ratesof the MLE based on (cid:101) (cid:96) n ( β ) . We collect all the notations here for convenience of presentation. We use regular lettersfor scalars, bold regular letters for vectors and bold capital letters for matrices. Denotethe d -dimensional Euclidean unit sphere by S d − . Denote the Euclidean and (cid:96) -norm ballwith the center β ∗ and radius r by B ( β ∗ , r ) and B ( β ∗ , r ) respectively. We write the set { , · · · , d } as [ d ] . For two scalar sequences { a n } n ≥ and { b n } n ≥ , we say a n (cid:16) b n if thereexist two universal constants C and C such that C b n ≤ a n ≤ C b n for all n ≥ . We use (cid:107) v (cid:107) , (cid:107) v (cid:107) and (cid:107) v (cid:107) to denote the Euclidean norm, (cid:96) norm and (cid:96) -norm of v respectively.Particularly, recall that (cid:107) x i (cid:107) := ( d (cid:80) j =1 x ij ) . We use (cid:107) A (cid:107) op and (cid:107) A (cid:107) max to denote the operatornorm and elementwise max-norm of A respectively and use λ min ( A ) to denote the minimumeigenvalue of A . For any β ∗ ∈ R d and any differential map f : R d → R , define the first-orderTaylor remainder of f ( β ) at β = β ∗ to be δf ( β ; β ∗ ) := f ( β ) − f ( β ∗ ) − ∇ f ( β ∗ ) T ( β − β ∗ ) . From now on, we refer to some quantities as constants if they are independent of thesample size n , the dimension d , and the sparsity s of β ∗ in the high-dimensional regime. The standard MLE estimator is defined as (cid:98) β := argmin β ∈ R d (cid:96) n ( β ) , where (cid:96) n ( · ) is characterizedas in (2.2). It is widely known that under genuine GLM with bounded features, (cid:98) β enjoys (cid:112) d/n -consistency to the true parameter β ∗ in terms of the Euclidean norm. However, whenthe feature vectors have only bounded moments, there is no guarantee of (cid:112) d/n -consistencyany more, let alone further perturbation (cid:15) i on the response. To reduce the disruption due to5eavy-tailed data, we apply the following (cid:96) − norm shrinkage to feature vectors: (cid:101) x i := min( (cid:107) x i (cid:107) , τ ) (cid:107) x i (cid:107) · x i and also clipping to the response: (cid:101) z i := min( | z i | , τ ) before MLE calculation, where τ and τ are predetermined thresholds. Clipping on theresponse is natural; when | z i | is abnormally large, clipping reduces its magnitude to preventpotential wild corruptions by (cid:15) i . Here we explain more on why we shrink features in terms ofthe (cid:96) -norm rather than other norms. The (cid:96) -norm shrinkage has been proven to be successfulin low-dimensional covariance estimation in Fan et al. (2017). Theorem 6 therein shows thatwhen data have only bounded fourth moment, the (cid:96) -norm shrinkage sample covariance enjoysoperator-norm convergence rate of order O P ( (cid:112) d log d/n ) to the population covariance matrix.This inspires us to apply similar (cid:96) -norm shrinkage to heavy-tailed features to ensure thatthe empirical hessian (cid:101) H n ( β ) is close to its population version H ( β ) and thus well-behaved.We shall rigorize this argument in Lemma 1 later.After data shrinkage and clipping, we minimize the negative log-likelihood based on thenew data { (cid:101) z i , (cid:101) x i } ni =1 with respect to β to derive the M-estimator. Specifically, define (cid:101) (cid:96) n ( β ) := 1 n n (cid:88) i =1 − (cid:101) z i (cid:101) x Ti β + b ( (cid:101) x Ti β ) and we choose (cid:101) β := argmin β ∈ R d (cid:101) (cid:96) n ( β ) to estimate β ∗ . In the sequel, we will show that (cid:107) (cid:101) β − β ∗ (cid:107) = O P ( (cid:112) d log n/n ) with exponential exception probability. One crucial step in thisstatistical error analysis is to establish a uniform strong convexity of (cid:101) (cid:96) ( β ) over β ∈ B ( β ∗ , r ) up to some small tolerance term, where r > is small. The following lemma rigorouslyjustifies this point. Lemma 1.
Suppose the following conditions hold: (1) b (cid:48)(cid:48) ( x Ti β ∗ ) ≤ M < ∞ always holdsand ∀ ω > , ∃ m ( ω ) > such that b (cid:48)(cid:48) ( η ) ≥ m ( ω ) > for | η | ≤ ω ; (2) E x i = , λ min ( E x i x Ti ) ≥ κ > and E ( v T x i ) ≤ R < ∞ for all v ∈ S d − ; (3) (cid:107) β ∗ (cid:107) ≤ L < ∞ .Choose the shrinkage threshold τ (cid:16) ( n/ log n ) . For any < r < and t > , as long as (cid:112) d log n/n is sufficiently small, it holds with probability at least − − t ) that for all ∆ ∈ R d such that (cid:107) ∆ (cid:107) ≤ r , δ (cid:101) (cid:96) n ( β ∗ + ∆ ; β ∗ ) ≥ κ (cid:107) ∆ (cid:107) − Cr (cid:0)(cid:114) tn + (cid:114) dn (cid:1) , where κ and C are some constants. emark 1. Here we explain the conditions for Lemma 1. Condition (1) assumes that theresponse from the GLM has bounded variance and is non-degenerate when η is bounded. Notehere that we do not assume a uniform lower bound of b (cid:48)(cid:48) ( η ) . m ( ω ) is allowed to decayto zero as ω → ∞ . Condition (2) says that the population covariance matrix of the designvector x i is positive definite and x i has bounded fourth moment. Condition (3) is natural;it holds if we have var ( x Ti β ∗ ) < ∞ and λ min (E x i x Ti ) ≥ κ > . Note that the ordinary leastsquare (OLS) estimator has been shown to enjoy consistency under similar bounded fourthmoment conditions (Hsu et al. (2012), Audibert et al. (2011), Oliveira (2016)). Theorem1 later establishes similar results for CGLM that embraces a much broader family of lossfunctions. Remark 2.
When deriving the statistical rate of (cid:101) β in Theorem 1 later, we will let the radiusof the local neighborhood r decay to zero so that the tolerance term r ( (cid:112) t/n + (cid:112) d/n ) isnegligible. Given Lemma 1, we are now ready to derive the statistical error rate (cid:107) (cid:101) β − β ∗ (cid:107) . Theorem 1.
Suppose the conditions of Lemma 1 hold. We further assume that (1) E z i ≤ M < ∞ ; (2) (cid:107) E [ (cid:15) i x i ] (cid:107) ≤ M (cid:112) d/n for some constant M . Choose τ , τ (cid:16) ( n/ log n ) .There exists a constant C > such that for any ξ > , P (cid:16) (cid:107) (cid:101) β − β ∗ (cid:107) ≥ Cξ (cid:114) d log nn (cid:17) ≤ n − ξ . Remark 3.
Condition 1 requires merely bounded fourth moments of the response from CGLM.Condition 2 requires the additional corruption to be nearly uncorrelated with the design, whichis trivially satisfied if E( (cid:15) i | x i ) = 0 . Sometimes the covariance between (cid:15) i and x i does not vanish as n and d grow. For example,in binary logistic regression with a random corruption (cid:15) i , suppose P ( (cid:15) i = − | y i = 1) = p, P ( (cid:15) i = 0 | y i = 1) = 1 − p, P ( (cid:15) i = 1 | y i = 0) = p, P ( (cid:15) i = 0 | y i = 0) = 1 − p. (3.1)where p < . . In other words, we flip the genuine label y i with probability p . Then we have E (cid:15) i x i = E ( (cid:15) i x i · { y i =0 } ) + E ( (cid:15) i x i · { y i =1 } ) = p E ( x i (1 { y i =0 } − { y i =1 } )) = 2 p E ( x i · { y i =0 } ) . The last equality holds because we assume E x i = . Therefore, E (cid:15) i x i ∝ p and if p does notdecay, neither will E (cid:15) i x i . Natarajan et al. (2013) solves this noisy label problem throughminimizing weighted negative log-likelihood (cid:98) β w := argmin β ∈ R d n n (cid:88) i =1 (cid:96) w ( x i , z i ; β ) = argmin β ∈ R d n n (cid:88) i =1 (1 − p ) (cid:96) ( x i , z i ; β ) − p · (cid:96) ( x i , − z i ; β )1 − p . (3.2)7emma 1 therein shows that E (cid:15) i (cid:96) w ( x i , z i ) = (cid:96) ( x i , y i ) . This implies that when the samplesize is sufficiently large, minimizing the weighted negative log-likelihood above is similarto minimizing the negative log-likelihood with true labels. In the presence of heavy-tailedfeatures, however, we propose to replace x i with the (cid:96) -norm shrunk feature (cid:101) x i , i.e., we recruit (cid:101) β w := argmin β ∈ R d n n (cid:88) i =1 (cid:96) w ( (cid:101) x i , z i ; β ) = 1 n n (cid:88) i =1 (1 − p ) (cid:96) ( (cid:101) x i , z i ; β ) − p · (cid:96) ( (cid:101) x i , − z i ; β )1 − p . (3.3)to estimate the regression vector β ∗ . The following corollary establishes the statistical errorrate of (cid:101) β w with exponential deviation bound. Corollary 1.
Under the logistic regression with a random corruption (cid:15) i satisfying (3.1) ,choose τ (cid:16) ( n/ log n ) . Under the conditions of Lemma 1, it holds for some constant C andany ξ > such that as long as (cid:112) d log d/n is sufficiently small, P ( (cid:107) (cid:101) β w − β ∗ (cid:107) ≥ Cξ (cid:114) d log nn ) ≤ n − ξ . Remark 4.
Here we do not need to truncate the response by τ , because in logistic regressionthe response is always bounded. In this section, we consider the regime where the dimension d grows much faster than thesample size n . Recall that the standard (cid:96) -regularized MLE of the regression vector β ∗ underthe GLM is (cid:98) β := argmin β ∈ R d n n (cid:88) i =1 (cid:0) − y i x Ti β + b ( x Ti β ) (cid:1) + λ (cid:107) β (cid:107) , (3.4)where ( y i , x i ) comes from the genuine GLM and λ > is the tuning parameter. Negahban et al.(2012) shows that (cid:107) (cid:98) β − β ∗ (cid:107) = O P ( (cid:112) s log d/n ) under GLM when { x i } ni =1 are sub-Gaussian.However, in the presence of heavy-tailed features x i and corruptions (cid:15) i , the statistical accuracyof (cid:98) β might deteriorate if we directly evaluate the log-likelihood (3.4) on { ( z i , x i ) } ni =1 . In thissection, we aim to develop a robust (cid:96) -regularized MLE for the regression vector β ∗ . Definethe clipped feature vector (cid:101) x i such that for any j ∈ [ d ] , (cid:101) x ij := min( | x ij | , τ ) x ij / | x ij | and clipped response (cid:101) z i = min( | z i | , τ ) z i / | z i | . We propose to evaluate the negative log-likelihood on the truncated data. (cid:101) (cid:96) n ( β ) = 1 n n (cid:88) i =1 − (cid:101) z i (cid:101) x Ti β + b ( (cid:101) x Ti β ) . (cid:101) (cid:96) n ( β ) by (cid:101) H n ( β ) and E ( (cid:101) H n ( β )) by (cid:101) H ( β ) . We studythe following M -estimator as the robust estimator of β ∗ . (cid:101) β = argmin β ∈ R d (cid:101) (cid:96) n ( β ) + λ (cid:107) β (cid:107) . For
S ⊂ [ d ] and |S| = s , define the restricted cone C ( S ) := { v ∈ R d : (cid:107) v S c (cid:107) ≤ (cid:107) v S (cid:107) } .By Lemma 1 in Negahban et al. (2012), when λ > (cid:107)∇ (cid:101) (cid:96) n ( β ) (cid:107) max , (cid:101) β − β ∗ ∈ C ( S ) , whichis a crucial property that gives rise to statistical consistency of (cid:101) β under high-dimensionalregimes. Therefore, in the following we first present a lemma that characterizes the order of (cid:107)∇ β (cid:101) (cid:96) n ( β ∗ ) (cid:107) max . Lemma 2.
Under the following conditions: (1) b (cid:48)(cid:48) ( x Ti β ∗ ) ≤ M < ∞ always holds and ∀ ω > , ∃ m ( ω ) > such that b (cid:48)(cid:48) ( η ) ≥ m ( ω ) > for | η | ≤ ω ; (2) E x ij = 0 , E x ij x ik ≤ R < ∞ for all ≤ j, k ≤ d ; (3) E z i ≤ M and E (cid:15) i ≤ M ; (4) (cid:107) β ∗ (cid:107) ≤ L < ∞ ; (5) | E (cid:15) i x ij | ≤ M / √ n for some universal constant M < ∞ and all ≤ j ≤ d . With τ , τ (cid:16) ( n/ log d ) , it holdsfor some constant C and all ξ > that P (cid:0) (cid:107)∇ (cid:101) (cid:96) ( β ∗ ) (cid:107) max ≥ Cξ (cid:114) log dn (cid:1) ≤ d − ξ . Remark 5.
In this lemma we show that (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) max = O P ( (cid:112) log d/n ) . In the sequel wewill choose λ (cid:16) (cid:112) log d/n to achieve the minimax optimal rate for (cid:101) β . Another requirement for statistical guarantee of (cid:101) β is the restricted strong convexity (RSC)of (cid:101) (cid:96) n , which is first formulated in Negahban et al. (2012). RSC ensures that (cid:101) (cid:96) n ( β ) is “not tooflat”, so that if | (cid:101) (cid:96) n ( (cid:101) β ) − (cid:101) (cid:96) n ( β ∗ ) | is small, then (cid:101) β and β ∗ are close. In high-dimensional sparselinear regression, RSC is implied by the restricted eigenvalue (RE) condition (Bickel et al.(2009), van de Geer (2007), etc.), a widely studied and acknowledged condition for statisticalerror analysis of the Lasso estimator.Unlike the quadratic loss in linear regression, the negative log-likelihood (cid:101) (cid:96) n ( β ) has itshessian matrix (cid:101) H n ( β ) depend on β , which creates technical difficulty of verifying its RSC.Generally speaking, RSC condition does not hold uniformly over all β ∈ R d . This motivatesus to study localized RSC (LRSC), i.e., RSC with β constrained within a small neighborhoodof β ∗ . This idea was first explored by Fan et al. (2017+) and Sun et al. (2017). Specifically,we say a loss function L ( β ) satisfies LRSC( β ∗ , r, S , κ, τ L ) if for all ∆ ∈ C ( S ) ∩ B ( , r ) , δ L ( β ∗ + ∆ ; β ∗ ) ≥ κ (cid:107) ∆ (cid:107) − τ L , where τ L is a small tolerance term. Later we will see that this localized version of RSC sufficesfor establishing the statistical rate of (cid:101) β . The following lemma verifies the LRSC of (cid:101) (cid:96) n ( β ) . Lemma 3.
Suppose the conditions of Lemma 2 hold. Let S be the true support of β ∗ with |S| = s . Assume that for any v ∈ R d such that v ∈ C ( S ) and (cid:107) v (cid:107) = 1 , < κ ≤ l l l l l l log n E u c li dean S t a t i s t i c a l E rr o r l l l l l l lll t2.1 Noise, Clippedt2.1 Noise, Standardt4.1 Noise, Clippedt4.1 Noise, StandardGaussian Noise, ClippedGaussian Noise, Standard l l l l l l l . . . . . . log n E u c li dean S t a t i s t i c a l E rr o r l l l l l l lll t2.1 Noise, Clippedt2.1 Noise, Standardt4.1 Noise, Clippedt4.1 Noise, StandardGaussian Noise, ClippedGaussian Noise, Standard Standard Gaussian Features t . FeaturesFigure 2: High dimensional sparse linear regression with light-tailed features (left) andheavy-tailed features (right) v T E ( x i x Ti ) v ≤ κ < ∞ . Set τ (cid:16) ( n/ log d ) . For any < r < and t > , as long as s log d/n is sufficiently small, it holds with probability at least − − t ) that for all ∆ ∈ R d such that (cid:107) ∆ (cid:107) ≤ r and ∆ ∈ C ( S ) , δ (cid:101) (cid:96) n ( β ∗ + ∆ ; β ∗ ) ≥ κ (cid:107) ∆ (cid:107) − C r (cid:0)(cid:114) tn + (cid:114) s log dn (cid:1) , where κ and C are some constants. Remark 6.
This lemma is the high-dimensional analogue of Lemma 1. Similarly, in thesequel we will let r converge to zero when analyzing the statistical rate of (cid:101) β so that thetolerance term r ( (cid:112) t/n + (cid:112) s log d/n ) is negligible. The lemma above together with Lemma 2 underpins the statistical guarantee of (cid:101) β asfollows. Theorem 2.
Under the assumptions of Lemma 2 and 3, choose λ = 2 Cξ (cid:112) log d/n and τ , τ (cid:16) ( n/ log d ) , where ξ and C are the same constants as in Lemma 2. It holds for someconstant C that P (cid:16) (cid:107) (cid:101) β − β ∗ (cid:107) ≥ C ξ (cid:114) s log dn (cid:17) ≤ d − ξ . We first consider the high-dimensional sparse linear model y i = x Ti β ∗ + (cid:15) i . We set d = 1000 , n = 100 , , , , , and β ∗ = (1 , , , , , , · · · , T . Recall that in the10 . . . . . . . log n E u c li dean S t a t i s t i c a l E rr o r t2.1 Features, Shrinkage t2.1 Features, Standard t4.1 Features, Shrinkage t4.1 Features, Standard Gaussian Features, Shrinkage
Gaussian Features, Standard . . . . log n E u c li dean S t a t i s t i c a l E rr o r t2.1 Features, Clip t2.1 Features, Standard t4.1 Features, Clip t4.1 Features, Standard Gaussian Features, Clip
Gaussian Features, Standard
Low Dimensions High DimensionsFigure 3: Statistical Error of MLE based on minimizing (cid:101) (cid:96) wn ( β ) with mislabeled datahigh-dimensional regime, we propose elementwise clipping on both the heavy-tailed featuresand responses. In Figure 2, we compare estimation error of the (cid:96) -regularized least squaresestimators based on clipped data and original data under standard Gaussian features and t . features respectively. All feature vectors { x i } ni =1 are i.i.d. generated and within each x i , alldimensions { x ij } dj =1 are i.i.d. { (cid:15) i } ni =1 are i.i.d. noises that are independent to the featuresand we adjust the magnitude of the noise such that SD ( (cid:15) i ) = 5 regardless of the distributionit conforms to. The clipping threshold levels τ , τ and the regularization parameter λ areselected by cross-validation. The plot is based on , independent Monte Carlo simulations.From Figure 2, we first observe that under both light-tailed and heavy-tailed features, theheavier tail (cid:15) i has, the more data clipping improves the statistical accuracy. More importantly,the benefit from data clipping is much more significant in the presence of heavy-tailed features,which justifies our conjecture and theories. In this subsection we consider the logistic regression with mislabeled data as characterized by(3.1). We minimize the weighted negative log-likelihood to derive (cid:98) β w and (cid:101) β w as describedin (3.2) and (3.3) to estimate the regression vector β ∗ and compare their performances. Allthe samples are independently generated and all dimensions of features are independent toeach other as well. The tuning parameters λ and τ are chosen based on cross validation. Weinvestigate both the low-dimensional and high-dimensional regimes. • In the low-dimensional regime, we set d = 10 , n = 100 , , , , , , , β ∗ = (0 . , · · · , . (cid:124) (cid:123)(cid:122) (cid:125) , − . , · · · , − . (cid:124) (cid:123)(cid:122) (cid:125) ) , p = 0 . . The left panel of Figure 3 compares (cid:107) (cid:98) β w − β ∗ (cid:107) and (cid:107) (cid:101) β w − β ∗ (cid:107) under t . , t . and Gaussian features. We can observe that (cid:101) β w significantly outperforms (cid:98) β w under t . and t . features, and they perform equally11ell when features are Gaussian. This perfectly matches our intuition and supports ourtheories. • In the high-dimensional regime, we apply elementwise clipping to x i to derive (cid:101) β w . Weset d = 100 , n = 50 , , , , , , , β ∗ = (1 , , − , , · · · , , p = 0 . .As shown in the right panel of Figure 3, (cid:101) β w enjoys sharper statistical accuracy than (cid:98) β w under all the three feature distributions. The outstanding performance of (cid:101) β w underthe Gaussian feature scenario is particularly surprising. We conjecture that featureclipping here downsizes (cid:107)∇ (cid:101) (cid:96) wn ( β ∗ ) (cid:107) max and thus leads to more effective regularization. We extract deep features of all images of the digits and in the MNIST dataset through apre-trained convolutional neural network, which has . testing error in recognizing to .Readers can refer to Deep MNIST Tutorial ∗ by Google for the details of the architecture ofthe neural nets. We aim to use the extracted deep features of images to classify ’s and ’swith artificial mislabelling. We randomly flip the true labels with probability p and minimizethe weighted negative log-likelihood (cid:101) (cid:96) wn ( β ) = (1 /n ) n (cid:80) i =1 (cid:96) w ( (cid:101) x i , z i ; β ) as characterized in (3.3)to estimate the regression vector. We repeat the procedure for times and compare theaverage performance of the resulting MLE based on original deep features { x i } ni =1 and shrunkfeatures { x i / (cid:107) x i (cid:107) · min( (cid:107) x i (cid:107) , . } . The result is presented in Table 1. We discover thatfeature shrinkage robustifies the MLE so well such that the performance is insensitive to themislabelling probability.Mislabel Prob Original Features Shrunk Features
0% 0 .
5% 0 . .
0% 0 . .
4% 0 . .
1% 0 . Table 1: Testing classification error using original deep features and shrunk deep features
References
Audibert, J.-Y. , Catoni, O. et al. (2011). Robust linear least squares regression.
TheAnnals of Statistics Bickel, P. J. , Ritov, Y. and
Tsybakov, A. B. (2009). Simultaneous analysis of lassoand dantzig selector.
The Annals of Statistics ∗ oucheron, S. , Lugosi, G. and
Massart, P. (2013).
Concentration inequalities: Anonasymptotic theory of independence . OUP Oxford.
Brownlees, C. , Joly, E. and
Lugosi, G. (2015). Empirical risk minimization for heavy-tailed losses.
The Annals of Statistics Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviationstudy.
Annales de l’Institut Henri Poincaré Chen, Y. , Caramanis, C. and
Mannor, S. (2013). Robust sparse regression underadversarial corruption. In
International Conference on Machine Learning . Cont, R. (2001). Empirical properties of asset returns: stylized facts and statistical issues.
Quantitative Finance . Devroye, L. , Lerasle, M. , Lugosi, G. , Oliveira, R. I. et al. (2016). Sub-gaussianmean estimators.
The Annals of Statistics Fan, J. , Li, Q. and
Wang, Y. (2017). Robust estimation of high-dimensional mean regression.
Journal of Royal Statistical Society, Series B Fan, J. , Liu, H. , Sun, Q. and
Zhang, T. (2017+). I-lamm for sparse learning: Simultaneouscontrol of algorithmic complexity and statistical error.
The Annals of Statistics, to appear.
Fan, J. , Wang, W. and
Zhu, Z. (2017). A shrinkage principle for heavy-tailed data:High-dimensional robust low-rank matrix recovery. arXiv preprint arXiv:1603.08315 . Feng, J. , Xu, H. , Mannor, S. and
Yan, S. (2014). Robust logistic regression andclassification. In
Advances in Neural Information Processing Systems . Hsu, D. , Kakade, S. M. and
Zhang, T. (2012). Random design analysis of ridge regression.In
Conference on Learning Theory . Hsu, D. and
Sabato, S. (2016). Loss minimization and parameter estimation with heavytails.
Journal of Machine Learning Research Ledoux, M. and
Talagrand, M. (2013).
Probability in Banach Spaces: isoperimetry andprocesses . Springer Science & Business Media.
Liu, L. , Hawkins, D. M. , Ghosh, S. and
Young, S. S. (2003). Robust singular valuedecomposition analysis of microarray data.
Proceedings of the National Academy of Sciences
Massart, P. (2000). About the constants in talagrand’s concentration inequalities forempirical processes.
The Annals of Probability Minsker, S. (2017+). Sub-gaussian estimators of the mean of a random matrix withheavy-tailed entries.
The Annals of Statistics, to appear.13 atarajan, N. , Dhillon, I. S. , Ravikumar, P. K. and
Tewari, A. (2013). Learningwith noisy labels. In
Advances in Neural Information Processing Systems . Negahban, S. , Yu, B. , Wainwright, M. J. and
Ravikumar, P. K. (2012). A unifiedframework for high-dimensional analysis of m -estimators with decomposable regularizers. Statistical Science Nemirovsky, A.-S. , Yudin, D.-B. and
Dawson, E.-R. (1982). Problem complexity andmethod efficiency in optimization.
SIAM Review Oliveira, R. I. (2016). The lower tail of random quadratic forms with applications toordinary least squares.
Probability Theory and Related Fields
Purdom, E. , Holmes, S. P. et al. (2005). Error distribution for gene expression data.
Statistical Applications in Genetics and Molecular Biology Sun, Q. , Zhou, W. and
Fan, J. (2017). Adaptive huber regression: Optimality and phasetransition.
Preprint . van de Geer, S. (2007). The deterministic lasso. Seminar für Statistik, EidgenössischeTechnische Hochschule (ETH) Zürich. Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 . Lemma 4.
Suppose E ( v T x i ) ≤ R for any v ∈ S d − . Define the (cid:96) -norm shrunk samples (cid:101) x i := min( (cid:107) x i (cid:107) , τ ) (cid:107) x i (cid:107) · x i , where τ is a threshold value. Then we have the following:1. (cid:107) (cid:101) x i (cid:101) x Ti − E (cid:101) x i (cid:101) x Ti (cid:107) op ≤ (cid:107) (cid:101) x i (cid:107) + √ R ≤ √ dτ + √ R ; (cid:107) E (( (cid:101) x i (cid:101) x Ti − E (cid:101) x i (cid:101) x Ti ) T ( (cid:101) x i (cid:101) x Ti − E (cid:101) x i (cid:101) x Ti )) (cid:107) op ≤ R ( d + 1);
3. For all ξ > , P (cid:16) (cid:107) (cid:101) Σ n ( τ ) − Σ (cid:107) op ≥ ξ (cid:113) Rd log nn (cid:17) ≤ n − Cξ , where τ (cid:16) (cid:0) nR/ (log n ) (cid:1) / and C is a universal constant.Proof. This result is from Fan et al. (2017). For convenience of adapting the lemma to othersettings, we present its proof here. Notice that (cid:107) (cid:101) x i (cid:101) x Ti − E (cid:101) x i (cid:101) x Ti (cid:107) op ≤ (cid:107) (cid:101) x i (cid:101) x Ti (cid:107) op + (cid:107) E (cid:101) x i (cid:101) x Ti (cid:107) op = (cid:107) (cid:101) x i (cid:107) + √ R ≤ √ dτ + √ R. (5.1)14lso for any v ∈ S d − , we have E ( v T (cid:101) x i (cid:101) x Ti (cid:101) x i (cid:101) x Ti v ) = E ( (cid:107) (cid:101) x i (cid:107) ( v T (cid:101) x i ) ) ≤ E ( (cid:107) x i (cid:107) ( v T x i ) )= d (cid:88) j =1 E ( x ij ( v T x i ) ) ≤ d (cid:88) j =1 (cid:113) E ( x ij ) E ( v T x i ) ≤ Rd Then it follows that (cid:107) E (cid:101) x i (cid:101) x Ti (cid:101) x i (cid:101) x Ti (cid:107) op ≤ Rd . Since (cid:107) ( E (cid:101) x i (cid:101) x Ti ) T E ( (cid:101) x i (cid:101) x Ti ) (cid:107) op ≤ R , (cid:107) E (( (cid:101) x i (cid:101) x Ti − E (cid:101) x i (cid:101) x Ti ) T ( (cid:101) x i (cid:101) x Ti − E (cid:101) x i (cid:101) x Ti )) (cid:107) op ≤ R ( d + 1) . (5.2)By the matrix Bernstein’s inequality (Theorem 5.29 in Vershynin (2010)), we have for someconstant c , P (cid:16) (cid:107) n n (cid:88) i =1 (cid:101) x i (cid:101) x Ti − E (cid:101) x i (cid:101) x Ti (cid:107) op > t (cid:17) ≤ d exp (cid:16) − c (cid:0) nt R ( d + 1) ∧ nt √ dτ + √ R (cid:1)(cid:17) . (5.3)For any v ∈ S d − , it holds that E ( v T ( x i x Ti ) v · {(cid:107) x i (cid:107) ≥ τ } ) ≤ (cid:112) E( v T x i ) P ( (cid:107) x i (cid:107) > τ ) ≤ (cid:114) R dτ = R √ dτ . (5.4)Therefore we have (cid:107) E ( x i x Ti − (cid:101) x i (cid:101) x Ti ) (cid:107) op ≤ R √ d/τ . (5.5)Choose τ (cid:16) ( nR/ log d ) and substitute t with ξ (cid:112) Rd log n/n . Then we reach the finalconclusion by combining the concentration bound and bias bound. Proof of Lemma 1.
Define a contraction function φ ( x ; θ ) = x · {| x |≤ θ } + ( x − θ ) · { θ
Construct an intermediate estimator (cid:101) β η between (cid:101) β and β ∗ : (cid:101) β η = β ∗ + η ( (cid:101) β − β ∗ ) , where η = 1 if (cid:107) (cid:101) β − β ∗ (cid:107) ≤ r and η = r/ (cid:107) (cid:101) β − β ∗ (cid:107) if (cid:107) (cid:101) β − β ∗ (cid:107) > r . Write (cid:101) β η − β ∗ as (cid:101) ∆ η .By Lemma 1, it holds with probability at least − − t ) that κ (cid:107) (cid:101) ∆ η (cid:107) − Cr (cid:0)(cid:114) tn + (cid:114) dn (cid:1) ≤ δ (cid:101) (cid:96) n ( (cid:101) β η ; β ∗ ) ≤ −∇ (cid:101) (cid:96) n ( β ∗ ) T (cid:101) ∆ η ≤ (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) · (cid:107) (cid:101) ∆ η (cid:107) , which further implies that (cid:107) (cid:101) ∆ η (cid:107) ≤ (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) κ + (cid:114) c r κ · (cid:16) tn (cid:17) + (cid:114) c r κ · (cid:16) dn (cid:17) . (5.9)Now we derive the rate of (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) . ∇ (cid:101) (cid:96) n ( β ∗ ) = 1 n n (cid:88) i =1 ( (cid:101) z i − b (cid:48) ( (cid:101) x Ti β ∗ )) (cid:101) x i = 1 n n (cid:88) i =1 (cid:101) z i (cid:101) x i − E (cid:101) z i (cid:101) x i (cid:124) (cid:123)(cid:122) (cid:125) T + E ( (cid:101) z i − b (cid:48) ( (cid:101) x Ti β ∗ )) (cid:101) x i (cid:124) (cid:123)(cid:122) (cid:125) T + 1 n n (cid:88) i =1 b (cid:48) ( (cid:101) x Ti β ∗ ) (cid:101) x i − E ( b (cid:48) ( (cid:101) x Ti β ∗ ) (cid:101) x i ) (cid:124) (cid:123)(cid:122) (cid:125) T . (5.10)where x i is between x i and (cid:101) x i by the mean value theorem. In the following we will bound T , T and T respectively. Bound for T : Define the Hermitian dilation matrix (cid:101) Z i := (cid:101) z i · (cid:18) (cid:101) x Ti (cid:101) x i (cid:19) Note that (cid:107) E (cid:101) Z i (cid:107) op = (cid:107) E (cid:104)(cid:101) z i · (cid:18) (cid:101) x Ti (cid:101) x i T (cid:101) x i (cid:101) x Ti (cid:19)(cid:105) (cid:107) op = max( E ( (cid:101) z i (cid:101) x Ti (cid:101) x i ) , (cid:107) E ( (cid:101) z i (cid:101) x i (cid:101) x Ti ) (cid:107) op ) For any j ∈ [ d ] , E (cid:0)(cid:101) z i · (cid:101) x ij (cid:1) ≤ (cid:113) E z i · E x ij ≤ (cid:112) M R, E [ (cid:101) z i · (cid:101) x Ti (cid:101) x i ] ≤ d √ M R . In addition, for any v ∈ R d such that (cid:107) v (cid:107) = 1 , E (cid:0)(cid:101) z i ( v T (cid:101) x i ) (cid:1) ≤ (cid:112) M R. We thus have (cid:107) E (cid:101) Z i (cid:107) op ≤ d √ M R . In addition, (cid:107) E (cid:101) Z i (cid:107) op = E ( (cid:101) z i · (cid:107) (cid:101) x i (cid:107) ) ≤ (cid:112) E z i E (cid:107) x i (cid:107) ≤√ d ( M R ) , which further implies that (cid:107) E ( (cid:101) Z i − E (cid:101) Z i ) (cid:107) op ≤ ( d + 1) √ M R . Also notice thatsince (cid:107) (cid:101) x i (cid:107) ≤ τ and (cid:101) z i ≤ τ , (cid:107) (cid:101) Z i (cid:107) op ≤ d · τ τ . By the matrix Bernstein’s inequality, P (cid:0) (cid:107) n n (cid:88) i =1 (cid:101) Z i − E (cid:101) Z i (cid:107) op ≥ t (cid:1) ≤ d · exp (cid:16) − c min (cid:0) nt ( d + 1) √ M R , ntd τ τ (cid:1)(cid:17) . Given that (cid:107) T (cid:107) = 2 (cid:107) n − n (cid:80) i =1 (cid:101) Z i − E (cid:101) Z i (cid:107) op , it thus holds that P (cid:16) (cid:107) T (cid:107) ≥ t (cid:17) ≤ d · exp (cid:16) − c min (cid:0) nt ( d + 1) √ M R , ntd τ τ (cid:1)(cid:17) . (5.11) Bound for T : We decompose T as follows: (cid:107) T (cid:107) ≤ (cid:107) E ( (cid:101) z i − z i ) (cid:101) x i (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) T + (cid:107) E ( z i − y i ) (cid:101) x i (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) T + (cid:107) E ( y i − b (cid:48) ( x Ti β ∗ )) (cid:101) x i (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) T + (cid:107) E ( b (cid:48) ( x Ti β ∗ ) − b (cid:48) ( (cid:101) x Ti β ∗ )) (cid:101) x i (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) T . Now we work on { T i } i =1 one by one. For any v ∈ R d such that (cid:107) v (cid:107) = 1 , | E ( (cid:101) z i − z i )( v T (cid:101) x i ) | ≤ E ( | z i | ( v T x i ) · {| z i | >τ } ) ≤ (cid:113) E ( z i ( v T x i ) ) · P ( | z i | > τ ) . ≤ ( M R ) · √ M τ , thus we have (cid:107) T (cid:107) ≤ M R /τ . Again, for any v ∈ R d such that (cid:107) v (cid:107) = 1 , since (cid:107) E (cid:15) i x i (cid:107) ≤ M (cid:112) d/n , E [ (cid:15) i ( (cid:101) x Ti v )] = E [ (cid:15) i (( (cid:101) x i − x i ) T v )] + E [ (cid:15) i ( x Ti v )] ≤ E [ (cid:15) i | x Ti v | · {(cid:107) x i (cid:107) ≥ τ } ] + M (cid:114) dn ≤ (cid:113) E ( (cid:15) i ( x Ti v )) · P (cid:0) (cid:107) x i (cid:107) ≥ τ (cid:1) + M (cid:114) dn ≤ ( M R ) · √ dRτ + M (cid:114) dn . Therefore (cid:107) T (cid:107) ≤ ( M R ) √ dR/τ + M (cid:112) d/n . For T , since E [ y i − b (cid:48) ( x Ti β ∗ ) | x i ] = 0 , T = 0 . Finally we bound T . For any v ∈ R d such that (cid:107) v (cid:107) = 1 , (cid:107) T (cid:107) ≤ M E( β ∗ T ( x i − (cid:101) x i ))( v T (cid:101) x i ) ≤ M E[( β ∗ T x i )( v T x i ) · {(cid:107) x i (cid:107) ≥ τ } ] ≤ M (cid:113) E( β ∗ T x i ) ( v T x i ) · P (cid:0) (cid:107) x i (cid:107) ≥ τ (cid:1) ≤ M L √ dR/τ .
18o summarize here, we have (cid:107) T (cid:107) ≤ ( M R ) (cid:0) √ M τ + √ dRτ (cid:1) + M L √ dRτ + M (cid:114) dn . (5.12) Bound for T : We apply a similar proof strategy as in the bound for T . Define thefollowing Hermitian dilation matrix: (cid:101) X i := b (cid:48) ( (cid:101) x Ti β ∗ ) · (cid:18) (cid:101) x Ti (cid:101) x i (cid:19) . First, (cid:107) E (cid:101) X i (cid:107) op = max( E ( b (cid:48) ( (cid:101) x Ti β ∗ ) (cid:101) x Ti (cid:101) x i ) , (cid:107) E b (cid:48) ( (cid:101) x Ti β ∗ ) (cid:101) x i (cid:101) x Ti (cid:107) op ) . Write | b (cid:48) (1) | as b . For any j ∈ [ d ] , E (cid:0) b (cid:48) ( (cid:101) x Ti β ∗ ) · (cid:101) x ij (cid:1) ≤ E [( b + M | (cid:101) x Ti β ∗ − | ) (cid:101) x ij ] ≤ E [(( b + M ) + M ( (cid:101) x Ti β ∗ ) ) (cid:101) x ij ] ≤ M R (cid:107) β ∗ (cid:107) + 2( b + M ) √ R =: V, so E [ b (cid:48) ( (cid:101) x Ti β ∗ ) · (cid:101) x Ti (cid:101) x i ] ≤ dV . In addition, for any v ∈ R d such that (cid:107) v (cid:107) = 1 , E (cid:0) b (cid:48) ( (cid:101) x Ti β ∗ ) ( v T (cid:101) x i ) (cid:1) ≤ E (( b + M | (cid:101) x Ti β ∗ − | ) ( v T (cid:101) x i ) ) ≤ V. We thus have (cid:107) E (cid:101) X i (cid:107) op ≤ dV . In addition, (cid:107) E (cid:101) X i (cid:107) op = E ( b (cid:48) ( (cid:101) x Ti β ∗ ) ·(cid:107) (cid:101) x i (cid:107) ) ≤ (cid:112) E b (cid:48) ( (cid:101) x Ti β ∗ ) E (cid:107) (cid:101) x i (cid:107) ≤√ dV , which further implies that (cid:107) E ( (cid:101) X i − E (cid:101) X i ) (cid:107) op ≤ ( d + √ d ) V . Also notice that (cid:107) (cid:101) X i (cid:107) op ≤ (( b + M ) + M (cid:107) β ∗ (cid:107) · d τ ) d τ . By the matrix Bernstein’s inequality, P (cid:0) (cid:107) n n (cid:88) i =1 (cid:101) X i − E (cid:101) X i (cid:107) op ≥ t (cid:1) ≤ d · exp (cid:16) − c min (cid:0) nt ( d + √ d ) V , nt ( b + M + M (cid:107) β ∗ (cid:107) · d τ ) d τ (cid:1)(cid:17) . Given that (cid:107) T (cid:107) = 2 (cid:107) n − n (cid:80) i =1 (cid:101) X i − E (cid:101) X i (cid:107) op , it thus holds that P (cid:16) (cid:107) T (cid:107) ≥ t (cid:17) ≤ d · exp (cid:16) − c min (cid:0) nt ( d + √ d ) V , nt ( b + M + M (cid:107) β ∗ (cid:107) · d τ ) d τ (cid:1)(cid:17) . (5.13)Finally, choose τ , τ (cid:16) ( n/ log n ) . Combining (5.11), (5.12) and (5.13) delivers that forsome constant C any ξ > , P (cid:0) (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) ≥ C ξ (cid:114) d log nn (cid:1) ≤ n − ξ . (5.14)Choose t = ξ log n and let r be larger than the RHS of (5.9). When d/n is sufficiently smalland n is sufficiently large, we can obtain that r ≥ C ξ (cid:114) d log nn =: r , C is a constant. Choose r = r . Then by (5.9), (cid:107) ∆ η (cid:107) ≤ r and thus (cid:101) ∆ = (cid:101) ∆ η .Finally, we reach the conclusion that P (cid:16) (cid:107) (cid:101) ∆ (cid:107) ≥ C ξ (cid:114) d log nn (cid:17) ≤ n − ξ + 2 n − ξ ≤ n − ξ . Proof of Corollary 1.
The proof strategy is nearly the same as that for deriving Theorem1, so we provide a roadmap here and do not dive into great details. For ease of notation,write n − n (cid:80) i =1 (cid:96) w ( (cid:101) x i , z i ; β ) as (cid:101) (cid:96) w ( β ) and denote the hessian matrix of (cid:101) (cid:96) wn ( β ) by (cid:101) H wn ( β ) . Since (cid:101) H wn ( β ) = ∇ (cid:101) (cid:96) n ( β ) = (cid:101) H n ( β ) , we can directly obtain the uniform strong convexity of (cid:101) H wn ( β ) from Lemma 1. In addition, ∇ β (cid:101) (cid:96) wn ( β ∗ ) = 1 − p − p · n n (cid:88) i =1 ( b (cid:48) ( (cid:101) x Ti β ∗ ) − z i ) (cid:101) x i (cid:124) (cid:123)(cid:122) (cid:125) T − p − p · n n (cid:88) i =1 ( b (cid:48) ( (cid:101) x Ti β ∗ ) − (1 − z i )) (cid:101) x i (cid:124) (cid:123)(cid:122) (cid:125) T = 1 − p − p ( T − E T ) − p − p ( T − E T ) + 1 − p − p E T − p − p E T = 1 − p − p ( T − E T ) − p − p ( T − E T ) + E ( b (cid:48) ( (cid:101) x Ti β ∗ ) − y i ) (cid:101) x i . Since | b (cid:48) ( (cid:101) x Ti β ∗ ) − z i | ≤ and | b (cid:48) ( (cid:101) x Ti β ∗ ) − (1 − z i ) | ≤ , following the bound for T in Theorem1, we will obtain P ( (cid:107) − p − p ( T − E T ) − p − p ( T − E T ) (cid:107) ≥ c ξ (cid:114) d log nn ) ≤ n − ξ , where c > depends on R and p and ξ > . In addition, following the bound for T and T in Theorem 1, we shall obtain (cid:107) E ( b (cid:48) ( (cid:101) x Ti β ∗ ) − y i ) (cid:101) x i (cid:107) ≤ M L √ dRτ ≤ c M (cid:114) dR log nn . where c > is a constant. Therefore, for some constant c depending on R, p, M , R , wehave P ( (cid:107)∇ β (cid:101) (cid:96) wn ( β ∗ ) (cid:107) ≥ c ξ (cid:114) d log nn ) ≤ n − ξ . Combining this with the uniform strong convexity of (cid:101) H wn ( β ) delivers the final conclusion.20 roof of Lemma 2. According to (2.3), [ ∇ β (cid:101) (cid:96) ( β ∗ )] j = ( b (cid:48) ( (cid:101) x Ti β ∗ ) − (cid:101) z i ) (cid:101) x ij . Then we have (cid:12)(cid:12) n n (cid:88) i =1 ( b (cid:48) ( (cid:101) x Ti β ∗ ) − (cid:101) z i ) (cid:101) x ij (cid:12)(cid:12) ≤ (cid:12)(cid:12) n n (cid:88) i =1 b (cid:48) ( (cid:101) x Ti β ∗ ) (cid:101) x ij − E b (cid:48) ( (cid:101) x Ti β ∗ ) (cid:101) x ij (cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) T + | E ( b (cid:48) ( (cid:101) x Ti β ∗ ) − (cid:101) z i ) (cid:101) x ij | (cid:124) (cid:123)(cid:122) (cid:125) T + (cid:12)(cid:12) n n (cid:88) i =1 (cid:101) z i (cid:101) x ij − E (cid:101) z i (cid:101) x ij (cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) T . We start with the upper bound of T . By the Mean Value Theorem, for any i ∈ [ n ] , thereexists ξ i between and (cid:101) x Ti β ∗ such that b (cid:48) ( (cid:101) x Ti β ∗ ) = b (cid:48) (1) + b (cid:48)(cid:48) ( ξ i ) · ( (cid:101) x Ti β ∗ − . Therefore wehave T ≤ (cid:12)(cid:12) n n (cid:88) i =1 b (cid:48) (1) (cid:101) x ij − E ( b (cid:48) (1) (cid:101) x ij ) (cid:12)(cid:12) + (cid:12)(cid:12) n n (cid:88) i =1 b (cid:48)(cid:48) ( ξ i ) (cid:101) x ij ( (cid:101) x Ti β ∗ − − E ( b (cid:48)(cid:48) ( ξ i ) (cid:101) x ij ( (cid:101) x Ti β ∗ − (cid:12)(cid:12) ≤ (cid:12)(cid:12) n n (cid:88) i =1 b (cid:48) (1) (cid:101) x ij − E ( b (cid:48) (1) (cid:101) x ij ) (cid:12)(cid:12) + d (cid:88) k =1 | β ∗ k | · (cid:12)(cid:12) n n (cid:88) i =1 b (cid:48)(cid:48) ( ξ i ) (cid:101) x ij (cid:101) x ik − E b (cid:48)(cid:48) ( ξ i ) (cid:101) x ij (cid:101) x ik (cid:12)(cid:12) + (cid:12)(cid:12) n n (cid:88) i =1 b (cid:48)(cid:48) ( ξ i ) (cid:101) x ij − E ( b (cid:48)(cid:48) ( ξ i ) (cid:101) x ij ) (cid:12)(cid:12) . Since var ( (cid:101) x ij ) ≤ √ R and | (cid:101) x ij | ≤ τ , an application of Bernstein’s inequality (Theorem 2.10in Boucheron et al. (2013)) yields that P (cid:16) | n n (cid:88) i =1 b (cid:48) (1) (cid:101) x ij − E ( b (cid:48) (1) (cid:101) x ij ) | ≥ | b (cid:48) (1) | (cid:0)(cid:115) √ R · tn + c τ tn (cid:1)(cid:17) ≤ − t ) , where c > is some universal constant. In addition, b (cid:48)(cid:48) ( ξ i ) (cid:101) x ij (cid:101) x ik ≤ M τ and var ( b (cid:48)(cid:48) ( ξ i ) (cid:101) x ij (cid:101) x ik ) ≤ E( b (cid:48)(cid:48) ( ξ i ) (cid:101) x ij (cid:101) x ik ) ≤ M R . Again by Bernstein’s inequality, P (cid:16)(cid:12)(cid:12) n n (cid:88) i =1 b (cid:48)(cid:48) ( ξ i ) (cid:101) x ij (cid:101) x ik − E ( b (cid:48)(cid:48) ( ξ i ) (cid:101) x ij (cid:101) x ik ) (cid:12)(cid:12) ≥ (cid:114) M Rtn + c M τ tn (cid:17) ≤ − t ) . Similarly, P (cid:16)(cid:12)(cid:12) n n (cid:88) i =1 b (cid:48)(cid:48) ( ξ i ) (cid:101) x ij − E ( b (cid:48)(cid:48) ( ξ i ) (cid:101) x ij ) (cid:12)(cid:12) ≥ (cid:115) M √ Rtn + M τ tn (cid:17) ≤ − t ) . Combining the above three inequalities delivers that P (cid:16) T ≥ | b (cid:48) (1) | (cid:0)(cid:115) √ R · tn + c τ tn + (cid:114) M Rtn + c M τ tn + (cid:115) M √ Rtn + M τ tn (cid:17) ≤ − t ) . (5.15)21ow we bound T . T = E [( z i − (cid:101) z i ) (cid:101) x ij ] + E (cid:15) i (cid:101) x ij + E [( b (cid:48) ( x Ti β ∗ ) − b (cid:48) ( (cid:101) x Ti β ∗ )) (cid:101) x ij ] ≤ E [ | z i (cid:101) x ij | · {| z i |≥ τ } ] + E ( (cid:15) i x ij ) + E (cid:15) i ( x ij − (cid:101) x ij ) + M · d (cid:88) k =1 | β ∗ k | · E | (cid:101) x ik ( (cid:101) x ij − x ij ) |≤ ( M R ) · √ M τ + M √ n + ( M R ) τ + M M · √ Rτ . (5.16)Finally we bound T . Note that | (cid:101) z i (cid:101) x ij | ≤ τ τ , var ( (cid:101) x ij (cid:101) z i ) ≤ E | (cid:101) z i (cid:101) x ij | ≤ √ M R . According tothe Bernstein’s inequality, P (cid:16) | T | ≥ (cid:115) t √ M Rn + c τ τ tn (cid:17) ≤ − t ) . (5.17)Choose τ , τ (cid:16) ( n/ log d ) . Combining (5.15), (5.16) and (5.17) delivers that for someconstant C > that depends on M, R, { M i } i =1 , b (cid:48) (1) and any ξ > , P (cid:0) | [ ∇ β (cid:101) (cid:96) ( β ∗ )] j | ≥ C ξ (cid:114) log dn (cid:1) ≤ d − ξ . Then by the union bound for all j ∈ [ d ] , it holds that P (cid:0) max j ∈ [ d ] [ |∇ β (cid:101) (cid:96) ( β ∗ )] j | ≥ C ξ (cid:114) log dn (cid:1) ≤ d − ξ . Proof of Lemma 3.
The proof strategy is quite similar to that for Lemma 1, except thatwe need to take advantage of the restricted cone C ( S ) that ∆ lies in. First of all, for any ≤ j, k ≤ d , | E ( (cid:101) x ij (cid:101) x ik − x ij x ik ) | ≤ (cid:113) E ( x ij x ik ) · ( P ( | x ij | ≥ τ ) + P ( | x ik | ≥ τ )) ≤ √ Rτ . We thus have (cid:107) E [ x i x Ti − (cid:101) x i (cid:101) x Ti ] (cid:107) max ≤ √ Rτ ≤ CR (cid:114) dn , (5.18)where C > is some constant. Again, define a contraction function φ ( x ; θ ) = x · {| x |≤ θ } + ( x − θ ) · { θ
According to Lemma 1 in Negahban et al. (2012), as long as λ ≥ (cid:107)∇ (cid:101) (cid:96) n ( β ) (cid:107) max , (cid:101) ∆ ∈ C ( S ) . We construct an intermediate estimator (cid:101) β η between (cid:101) β and β ∗ : (cid:101) β η = β ∗ + η ( (cid:101) β − β ∗ ) , where η = 1 if (cid:107) (cid:101) β − β ∗ (cid:107) ≤ r and η = r/ (cid:107) (cid:101) β − β ∗ (cid:107) if (cid:107) (cid:101) β − β ∗ (cid:107) > r . Choose λ = 2 Cξ (cid:112) log d/n ,where C and ξ are the same as in Lemma 2. By Lemmas 2 and 3, it holds with probabilityat least − − t ) , κ (cid:107) (cid:101) ∆ η (cid:107) − C r (cid:0)(cid:114) tn + (cid:114) s log dn (cid:1) ≤ δ (cid:101) (cid:96) n ( (cid:101) β η ; β ∗ ) ≤ −∇ (cid:101) (cid:96) n ( β ∗ ) T (cid:101) ∆ η ≤ (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) max · (cid:107) (cid:101) ∆ η (cid:107) ≤ (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) max · (cid:107) [ (cid:101) ∆ η ] S (cid:107) ≤ √ s (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) max · (cid:107) (cid:101) ∆ η (cid:107) . (5.22)24ome algebra delivers that (cid:107) (cid:101) ∆ η (cid:107) ≤ √ s (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) max κ + r (cid:115) C κ (cid:0)(cid:114) tn + (cid:114) s log dn (cid:1) . (5.23)Choose t = ξ log d above. Let r be greater than the RHS of the inequality above. Forsufficiently sufficiently small s log d/n , we have r ≥ √ s (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) max /κ . Define r :=5 √ s (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) max /κ and choose r = r . Therefore, (cid:107) (cid:101) ∆ η (cid:107) ≤ r and (cid:101) ∆∆
According to Lemma 1 in Negahban et al. (2012), as long as λ ≥ (cid:107)∇ (cid:101) (cid:96) n ( β ) (cid:107) max , (cid:101) ∆ ∈ C ( S ) . We construct an intermediate estimator (cid:101) β η between (cid:101) β and β ∗ : (cid:101) β η = β ∗ + η ( (cid:101) β − β ∗ ) , where η = 1 if (cid:107) (cid:101) β − β ∗ (cid:107) ≤ r and η = r/ (cid:107) (cid:101) β − β ∗ (cid:107) if (cid:107) (cid:101) β − β ∗ (cid:107) > r . Choose λ = 2 Cξ (cid:112) log d/n ,where C and ξ are the same as in Lemma 2. By Lemmas 2 and 3, it holds with probabilityat least − − t ) , κ (cid:107) (cid:101) ∆ η (cid:107) − C r (cid:0)(cid:114) tn + (cid:114) s log dn (cid:1) ≤ δ (cid:101) (cid:96) n ( (cid:101) β η ; β ∗ ) ≤ −∇ (cid:101) (cid:96) n ( β ∗ ) T (cid:101) ∆ η ≤ (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) max · (cid:107) (cid:101) ∆ η (cid:107) ≤ (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) max · (cid:107) [ (cid:101) ∆ η ] S (cid:107) ≤ √ s (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) max · (cid:107) (cid:101) ∆ η (cid:107) . (5.22)24ome algebra delivers that (cid:107) (cid:101) ∆ η (cid:107) ≤ √ s (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) max κ + r (cid:115) C κ (cid:0)(cid:114) tn + (cid:114) s log dn (cid:1) . (5.23)Choose t = ξ log d above. Let r be greater than the RHS of the inequality above. Forsufficiently sufficiently small s log d/n , we have r ≥ √ s (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) max /κ . Define r :=5 √ s (cid:107)∇ (cid:101) (cid:96) n ( β ∗ ) (cid:107) max /κ and choose r = r . Therefore, (cid:107) (cid:101) ∆ η (cid:107) ≤ r and (cid:101) ∆∆ η = (cid:101) ∆∆