Out-of-sample error estimate for robust M-estimators with convex penalty
aarXiv:
Out-of-sample error estimate for robustM-estimators with convex penalty
Pierre C Bellec
Rutgers University. Abstract:
A generic out-of-sample error estimate is proposed for robust M -estimatorsregularized with a convex penalty in high-dimensional linear regression where ( X , y ) isobserved and p, n are of the same order. If ψ is the derivative of the robust data-fitting loss ρ , the estimate depends on the observed data only through the quantities (cid:98) ψ = ψ ( y − X (cid:98) β ) , X (cid:62) (cid:98) ψ and the derivatives ( ∂/∂ y ) (cid:98) ψ and ( ∂/∂ y ) X (cid:98) β for fixed X .The out-of-sample error estimate enjoys a relative error of order n − / in a linear modelwith Gaussian covariates and independent noise, either non-asymptotically when p/n ≤ γ or asymptotically in the high-dimensional asymptotic regime p/n → γ (cid:48) ∈ (0 , ∞ ) . Generaldifferentiable loss functions ρ are allowed provided that ψ = ρ (cid:48) is 1-Lipschitz. The validityof the out-of-sample error estimate holds either under a strong convexity assumption, or forthe (cid:96) -penalized Huber M-estimator if the number of corrupted observations and sparsityof the true β are bounded from above by s ∗ n for some small enough constant s ∗ ∈ (0 , independent of n, p .For the square loss and in the absence of corruption in the response, the resultsadditionally yield n − / -consistent estimates of the noise variance and of the generalizationerror. This generalizes, to arbitrary convex penalty, estimates that were previously knownfor the Lasso.
1. Introduction
Consider a linear model(1.1) y = Xβ + ε where X ∈ R n × p has iid N ( , Σ ) rows and ε ∈ R n is a noise vector independent of X . Theentries of ε may be heavy-tailed, for instance with infinite second moment, or follow Huber’sgross-errors contamination model with (cid:15) i iid with cumulative distribution function (cdf) F ( u ) =(1 − q ) P ( N (0 , σ ) ≤ u ) + qG ( u ) where q ∈ [0 , is the proportion of corrupted entries and G is an arbitrary cdf chosen by an adversary. Since the seminal work of Huber in [H + ε is based on robustloss functions ρ : R → [0 , + ∞ ) to construct M -estimators (cid:98) β by minimization of optimizationproblems of the form(1.2) (cid:98) β ∈ arg min b ∈ R p n n (cid:88) i =1 ρ ( y i − x (cid:62) i b ) where ( x i ) i =1 ,...,n are the rows of X . Robustness against corruption of the above estimatortypically requires the convex loss ρ to grow linearly at ±∞ and a well-studied example is theHuber loss ρ H ( u ) = min( u / , | u | − / .As we are interested in the high-dimensional regime where p is potentially larger than n , wealso allow for convex penalty functions to leverage structure in the signal β and fight the curse of Research partially supported by the NSF Grants DMS-1811976 and DMS-1945428.1 a r X i v : . [ m a t h . S T ] O c t ellec/Out-of-sample error estimate for robust M-estimators with convex penalty dimensionality. The central object of the present paper is thus a penalized robust M -estimatorof the form(1.3) (cid:98) β ( y , X ) ∈ arg min b ∈ R p (cid:16) n n (cid:88) i =1 ρ ( y i − x (cid:62) i b ) + g ( b ) (cid:17) where ρ : R → R is a convex differentiable loss function, and g : R p → R is a convex penalty.We may write simply (cid:98) β for (cid:98) β ( y , X ) if the context is clear.The main contribution of the present paper is the introduction of a generic out-of-sampleerror estimate for penalized M -estimators of the form (1.3). Here, the out-of-sample error refersto the random quantity(1.4) (cid:107) Σ ( (cid:98) β − β ) (cid:107) = E [(( (cid:98) β − β ) (cid:62) x new ) | ( X , y )] where x new is independent of the data ( X , y ) with the same distribution as any row of X . Ourgoal is to develop such out-of-sample error estimate for (cid:98) β in (1.3) with little or no assumptionon the robust loss ρ and the convex penalty g , in order to allow broad choices by practitionersfor ( ρ, g ) . Our goal is also to allow for non-isotropic design with Σ (cid:54) = I p .We consider the high-dimensional regime where p and n are of the same order. The results ofthe present paper are non-asymptotic and assume that p/n ≤ γ ∈ (0 , ∞ ) for some fixed constant γ independent of n, p . Although non-asymptotic, these results are applicable in the regime where n and p diverge such that(1.5) p/n → γ (cid:48) , simply by considering a constant γ > γ (cid:48) . The analysis of the performance of convex estimatorsin the asymptotic regime (1.5) has received considerable attention in the last few years inthe statistics, machine learning, electrical engineering and statistical physics communities.Most results available in the p/n → γ (cid:48) literature regarding M -estimators are either basedon Approximate Message Passing (AMP) [BM12, DM16, Bra15, WWM17, CM19, GAK20]following the pioneering work [DMM09] in compressed sensing problems, on leave-one-outmethods [EKBB +
13, BBEKY13, Kar13, EK18], or on the Gordon’s Gaussian min-max theorem(GMT) [Sto13, TAH15, TAH18, MM18]. The goal of these techniques is to summarize theperformance and behavior of the M -estimator (cid:98) β by a system of nonlinear equations with up to sixunknown scalars (e.g., the system of [EKBB +
13] with unknowns ( r, c ) for unregularized robust M -estimators, the system with unknowns ( τ, β ) of [MM18, Proposition 3.1] for the Lasso whichdates back to [BM12], the system with unknowns ( τ, λ ) of [CM19, Section 4] for permutation-invariant penalties, or recently the system with six unknowns ( α, σ, γ, θ, τ, r ) of [SAH19] inregularized logistic regression). Solving these nonlinear equations provide information about therisk (cid:107) (cid:98) β − β (cid:107) , and in certain cases asymptotic normality results for the coefficients of (cid:98) β after a biascorrection, see [CM19, e.g., Proposition 4.3(iii)]. These systems of nonlinear equations dependon a prior on the true coefficient vector β and the knowledge of the prior is required to computethe solutions. For Ridge regression, results can be obtained using random matrix theory toolssuch as the Stieljes transform and limiting spectral distributions of certain random matrices[D +
16, DW + β .Additionally, most of the aforementioned works require isotropic design ( Σ = I p ),although there are notable exceptions for specific examples: isotropy can be relaxed for Ridge ellec/Out-of-sample error estimate for robust M-estimators with convex penalty regularization [DW + Σ (cid:54) = I p is allowed without additional complexity. Contributions
We assume throughout the paper that ρ is differentiable and denote by ψ : R → R the derivativeof ρ . We also assume that ψ is absolutely continuous and denote by ψ (cid:48) its derivative. Thefunctions ψ, ψ (cid:48) act componentwise when applied to vectors, for instance ψ ( y − X (cid:98) β ) = ( ψ ( y i − x (cid:62) i (cid:98) β )) i =1 ,...,n . Our contributions are summarized below.• A novel data-driven estimate of the out-of-sample error (1.4) is introduced. The estimatedepends on the data only through (cid:98) ψ def = ψ ( y − X (cid:98) β ) , the vector Σ − X (cid:62) (cid:98) ψ and thederivatives of y (cid:55)→ (cid:98) β and y (cid:55)→ ψ ( y − X (cid:98) β ) for fixed X . For certain choices of ( ρ, g ) these derivatives have closed forms, for instance in the case of the (cid:96) -penalized Huber M -estimator when ρ is the Huber loss, the estimator ˆ R of the out-of-sample error (1.4) is ˆ R = (cid:0) | ˆ I | − | ˆ S | ) − (cid:8) (cid:107) ψ ( y − X (cid:98) β ) (cid:107) (cid:0) | ˆ S | − p (cid:1) + (cid:107) Σ − X (cid:62) ψ ( y − X (cid:98) β ) (cid:107) (cid:9) where ˆ S = { j ∈ [ p ] : (cid:98) β j (cid:54) = 0 } and ˆ I = { i ∈ [ n ] : ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) > } is the set of inliers. Forgeneral choices of ( ρ, g ) , the derivatives can be approximated by a Monte Carlo scheme.• The estimate is valid under mild assumptions, namely: ψ is -Lipschitz, p/n ≤ γ for someconstant γ independent of n, p and that either (i) the penalty function g is µ -stronglyconvex, (ii) the loss ρ is strongly convex and γ < , or (iii) (cid:98) β is the (cid:96) penalized Huber M -estimator together with an additional assumption on the fraction of corrupted observationsand sparsity of β . No isotropy assumption is required on Σ , and no prior distribution isrequired on β .• The proof arguments provide new avenues to study M -estimators in the regime p/n → γ (cid:48) .The results rely on novel moment inequalities that let us directly bound the quantities ofinterest. These new techniques do not overlap with arguments typically used to analyse M -estimators when p/n → γ (cid:48) such as or leave-one-out approaches [EKBB +
13, Kar13, EK18],Approximate Message Passing (AMP) [BM12, DM16, Bra15, WWM17, CM19, GAK20],or the Gordon’s Gaussian Min-Max Theorem (GMT) [Sto13, TAH15, TAH18, MM18].• In the special case of the square loss, our estimate of the out-of-sample error coincides withprevious estimates for the Ordinary Least-Squares [L + (cid:98) β = [Dic14]. Our results can be seen as a broad generalization of theseestimates to (a) arbitrary covariance, (b) general loss function, including robust losses,and (c) general convex penalty. For the square loss, our results also yield generic estimatesfor the noise level and the generalization error E [( x (cid:62) new (cid:98) β − Y new ) | ( X , y )] .
2. Main result
Throughout, (cid:98) β is the estimator (1.3) with loss ρ : R → R and penalty g : R p → R . The goal ofthis section is to develop a generic estimator ˆ R for the Out-of-sample error (cid:107) Σ / ( (cid:98) β − β ) (cid:107) . Our main result holds under the following assumptions. ellec/Out-of-sample error estimate for robust M-estimators with convex penalty Assumption 2.1 (Loss function) . The loss ρ is convex and differentiable, and ψ = ρ (cid:48) is -Lipschitz with derivative ψ (cid:48) where the derivative exists. This allows for a large class of robust loss functions, including the Huber loss ρ H ( u ) =min( u / , | u | − / and smoothed versions ρ H , cf. Tables 1 and 2 below for some concreteexamples. Since ψ is -Lipschitz, ψ (cid:48) exist almost everywhere thanks to Rademacher’s theorem.Loss functions typically require a scaling parameter that depends on the noise level to obtainsatisfactory risk bounds, see [DT19] and the references therein. For instance we consider in themain result below the loss(2.1) ρ ( u ) = Λ ∗ ρ H (cid:0) Λ − ∗ u (cid:1) where ρ H is the Huber loss and Λ ∗ > is a scaling parameter. Since for the Huber loss ψ H = ρ (cid:48) H is 1-Lipschitz, ψ ( u ) = ρ (cid:48) ( u ) = Λ ∗ ψ H (Λ − ∗ u ) is also 1-Lipschitz. In short, scaling a given loss witha tuning parameter Λ ∗ as in (2.1) does not change the Lipschitz constant of the first derivativeof ρ , and the above assumption does not prevent using a scaling parameter Λ ∗ . Additionally, ifthe desired loss is such that ψ is L -Lipschitz for some constant L (cid:54) = 1 , one may replace ( ρ, g ) by ( L − ρ, L − g ) to obtain a 1-Lipschitz loss without changing the value of (cid:98) β in (1.3). Assumption 2.2 (Probability distribution) . The rows of X are iid N ( , Σ ) with Σ invertible, ε is independent of X , and ( X , y ) has continuous distribution. The Gaussian assumption is admittedly the strongest assumption required in this work.However arbitrary covariance Σ is allowed, while a large body of related literature requires Σ proportional to identity, see for instance [BEM13, Bra15, CM19, SAH19]. Allowing arbitrary Σ together with general penalty functions is made possible by developing new techniques that are ofa different nature than this previous literature; see the proof sketch in Section 2.9 for an overview.We require that ( X , y ) has continuous distribution in order to ensure that derivatives of certainLipschitz functions of ( y , X ) exist with probability one, again by Rademacher’s theorem. If ( X , y ) does not have continuous distribution, one can always replace y with (cid:101) y = y + a (cid:101) z where a is very small and (cid:101) z ∼ N ( , I n ) is sampled independently of ( ε , X ) . Hence the continuousdistribution assumption is a mild technicality. Assumption 2.3 (Penalty) . Assume either one of the following:(i) p/n ≤ γ ∈ (0 , + ∞ ) and the penalty g is µ > strongly convex with respect to Σ , in thesense that for any b , b (cid:48) ∈ R p , d ∈ ∂g ( b ) and d (cid:48) ∈ ∂g ( b (cid:48) ) , inequality ( d − d (cid:48) ) (cid:62) ( b − b (cid:48) ) ≥ µ (cid:107) Σ ( b − b (cid:48) ) (cid:107) holds.(ii) The penalty g is only assumed convex, p/n ≤ γ < and ρ is µ ρ > strongly convex in thesense that ( u − s )( ψ ( u ) − ψ ( s )) ≥ µ ρ ( u − s ) for all u, s ∈ R .(iii) For any constants ϕ ≥ , γ > , η ∈ (0 , independent of n, p , assume diag ( Σ ) = I p , p/n ≤ γ ∈ (0 , ∞ ) and φ max ( Σ ) /φ min ( Σ ) ≤ ϕ . The loss is ρ ( u ) = nλ ∗ ρ H (( √ nλ ∗ ) − u ) for ρ H the Huber loss ρ H ( u ) = min( u / , | u | − / and the penalty is g ( b ) = λ (cid:107) b (cid:107) where λ ∗ , λ > are tuning parameters. Furthermore s ∗ > is a small enough constant dependingon { ϕ, γ, η } only such that at least (cid:100) n (1 − s ∗ ) + (cid:107) β (cid:107) (cid:101) coordinates of ε are iid N (0 , σ ) .The tuning parameters are set as λ ∗ = λ = ση − (1 + (2 log γs ∗ ) ) n − .Here and throughout the paper γ, µ, µ g , ϕ, η ≥ are constants independent of n, p . Strong convexity on the penalty (i.e., (i) above) or strong convexity of the loss (i.e., (ii) above)can be found in numerous other works on regularized M -estimators [DM16, CM19, XMRH19,among others]. In our setting, strong convexity simplifies the analysis as it grants existence ofthe derivatives of (cid:98) β with respect to ( y , X ) “for free” as we will see in Section 5.1. Assumption ellec/Out-of-sample error estimate for robust M-estimators with convex penalty (iii) above relaxes strong convexity entirely, by instead assuming a specific choice for ( ρ, g ) , theHuber loss and (cid:96) penalty together with an upper bound on the sparsity of β and the number ofcorrupted components of ε . Indeed, at least (cid:100) n (1 − s ∗ )+ (cid:107) β (cid:107) (cid:101) components of ε being iid N (0 , is equivalent to the existence of a set O ∗ ⊂ [ n ] with | O ∗ | + (cid:107) β (cid:107) ≤ s ∗ n and ( (cid:15) i ) i ∈ [ n ] \ O ∗ beingiid N (0 , . Here, the uncorrupted observations are indexed in [ n ] \ O ∗ and the corrupted onesare those indexed in O ∗ . Assumption 2.3(iii) provides a non-trivial example for which our resultholds without strong convexity on either the loss or the penalty, provided that the corruptionis not too strong and the penalty g (here the (cid:96) norm) is well suited to the structure of β (here,the sparsity). (cid:98) ψ, (cid:98) β at the observed data Throughout the paper, we view the functions(2.2) (cid:98) β : R n × R n × p → R p , ( y , X ) (cid:55)→ (cid:98) β ( y , X ) in (1.3) , (cid:98) ψ : R n × R n × p → R n , ( y , X ) (cid:55)→ (cid:98) ψ ( y , X ) = ψ ( y − X (cid:98) β ( y , X )) as functions of ( y , X ) , though we may drop the dependence in ( y , X ) and write simply (cid:98) β or (cid:98) ψ if the context is clear. Here, recall that ψ acts componentwise on the residuals y − X (cid:98) β , sothat ψ ( y − X (cid:98) β ) ∈ R n has components ψ ( y i − x (cid:62) i (cid:98) β ) i =1 ,...,n . The hat in the functions (cid:98) β and (cid:98) ψ above emphasize that they are data-driven quantities, and since they are functions of ( y , X ) , thedirectional derivatives of (cid:98) β and (cid:98) ψ at the observed data ( y , X ) are also data-driven quantities,for instance ∂∂y i (cid:98) β ( y , X ) = ddt (cid:98) β ( y + t e i , X ) (cid:12)(cid:12)(cid:12) t =0 . Provided that they exist, the derivatives can be computed approximately by finite-difference orother numerical methods; a Monte Carlo scheme to compute the required derivatives is given inSection 2.7. We thus assume that the Jacobians(2.3) ∂ (cid:98) ψ ∂ y ( y , X ) = (cid:16) ∂ (cid:98) ψ i ∂y l ( y , X ) (cid:17) ( i,l ) ∈ [ n ] × [ n ] , ∂ (cid:98) β ∂ y ( y , X ) = (cid:16) ∂ (cid:98) β j ∂y l ( y , X ) (cid:17) ( j,l ) ∈ [ p ] × [ n ] are available. Section 5.1 will make clear that the existence of such partial derivatives is granted,under our assumptions, for almost every ( y , X ) ∈ R n × R n × p by Rademacher’s theorem (cf.Proposition 5.3 below). For brevity and if it is clear from context, we will drop the dependencein ( y , X ) from the notation, so that the above Jacobians ( ∂/∂ y ) (cid:98) ψ ∈ R n × n , ( ∂/∂ y ) (cid:98) β ∈ R p × n as well as their entries ( ∂/∂y l ) (cid:98) ψ i and ( ∂/∂y l ) (cid:98) β j are implicitly taken at the currently observeddata ( y , X ) . Throughout the paper we denote by(2.4) h = (cid:98) β − β ∈ R p the error vector, so that the out-of-sample error that we wish to estimate is (cid:107) Σ h (cid:107) . Finally,define the n × n square matrices diag ( ψ (cid:48) ) and P bydiag ( ψ (cid:48) ) = diag ( { ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) } i ∈ [ n ] ) , (2.5) P = diag ( { I { ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) > }} i ∈ [ n ] ) , (2.6)where I { u > } = 1 if u > and 0 otherwise. For robust losses such as the Huber loss that aresuch that ψ (cid:48) ( u ) = 0 for u / ∈ ( − a, a ) , multiplication by the diagonal matrix P selects the inliers ˆ I = { i ∈ [ n ] : ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) > } and removes the outliers ˆ O = { i ∈ [ n ] : ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) = 0 } . ellec/Out-of-sample error estimate for robust M-estimators with convex penalty Equipped with the above notation for the n × n matrices diag ( ψ (cid:48) ) , P and the Jacobians ( ∂/∂ y ) (cid:98) ψ and ( ∂/∂ y ) (cid:98) β at the observed data ( y , X ) , we are ready to state the main result of the paper. Theorem 2.1.
Let (cid:98) β be the M -estimator (1.3) and assume that Assumptions 2.1 to 2.3 holdfor all n, p as n, p → + ∞ . Then almost surely (cid:12)(cid:12)(cid:12)(cid:16) n Tr ∂ (cid:98) ψ ∂ y (cid:17) (cid:107) Σ h (cid:107) − n (cid:110) (cid:107) (cid:98) ψ (cid:107) (cid:16) (cid:2) P X ∂ (cid:98) β ∂ y (cid:3) − p (cid:17) + (cid:107) Σ − X (cid:62) (cid:98) ψ (cid:107) (cid:111)(cid:12)(cid:12)(cid:12) = | V ∗ | (cid:0) (cid:107) (cid:98) ψ (cid:107) /n + (cid:107) Σ h (cid:107) (cid:1) n − , for some random variable V ∗ with E [ I { Ω }| V ∗ | ] ≤ C ( γ, µ, µ g , η, ϕ ) for a constant depending on { γ, µ, µ g , ϕ, η } only and an event Ω with P (Ω) → .Additionally, under Assumption 2.3(ii) we have µ ρ (1 − γ ) ≤ n Tr[( ∂/∂ y ) (cid:98) ψ ] ≤ almost surely,and under Assumption 2.3(iii) we have − d ∗ ≤ n Tr[( ∂/∂ y ) (cid:98) ψ ] ≤ in Ω for some constant d ∗ ∈ (0 , independent of n, p .Finally, almost surely in Ω , (cid:98) ψ (cid:54) = implies that ( ∂/∂ y ) (cid:98) β ( y , X ) = [( ∂/∂ y ) (cid:98) β ( y , X )] P sothat (cid:107) (cid:98) ψ (cid:107) Tr (cid:2) P X ( ∂/∂ y ) (cid:98) β (cid:3) is equal to (cid:107) (cid:98) ψ (cid:107) Tr (cid:2) X ( ∂/∂ y ) (cid:98) β (cid:3) , i.e., multiplication by P can beomitted. The proof is given in Section 8. Recall that the target of estimation is the out-of-sample error (cid:107) Σ h (cid:107) which appears on the left of the first line, multiplied by the observable multiplicativefactor ( n Tr[( ∂/∂ y ) (cid:98) ψ ]) . Thus, when the right hand side is negligible and ( n Tr[( ∂/∂ y ) (cid:98) ψ ]) isbounded away from 0, this suggests to use the quantity(2.7) ˆ R = (cid:16) Tr ∂ (cid:98) ψ ∂ y (cid:17) − (cid:110) (cid:107) (cid:98) ψ (cid:107) (cid:16) (cid:2) P X ∂ (cid:98) β ∂ y (cid:3) − p (cid:17) + (cid:107) Σ − X (cid:62) (cid:98) ψ (cid:107) (cid:111) to estimate (cid:107) Σ h (cid:107) . In the regime of interest here with p/n → γ (cid:48) , the risk (cid:107) Σ h (cid:107) is typicallyof the order of a constant, see [EKBB +
13, EK18, DM16, BM12, TAH18, CM19] among others.When (cid:107) (cid:98) ψ (cid:107) /n is also of order of a constant, the right hand side in Theorem 2.1 is of order O ( n − ) which is negligible compared to ( n Tr[( ∂/∂ y ) (cid:98) ψ ]) (cid:107) Σ h (cid:107) when the multiplicative factor ( n Tr[( ∂/∂ y ) (cid:98) ψ ]) is bounded away from 0. ˆ R Our results involve the multiplicative factors
Tr[( ∂/∂ y ) (cid:98) ψ ] and Tr[ P ( ∂/∂ y ) X (cid:98) β ] . The followingresult provides the possible range for these quantities. Proposition 2.2.
Assume that ρ is convex differentiable and that ψ = ρ (cid:48) is 1-Lipschitz. Forevery fixed X ∈ R n × p the following holds. • For almost every y , the map y (cid:55)→ (cid:98) ψ = ψ ( y − X (cid:98) β ) is Frechet differentiable at y , and theJacobian ( ∂/∂ y ) (cid:98) ψ ∈ R n × n is symmetric positive semi-definite with operator norm at mostone so that Tr[( ∂/∂ y ) (cid:98) ψ ] ∈ [0 , n ] . • If additionally y (cid:55)→ X (cid:98) β ( y , X ) is Lipschitz in an open set U ⊂ R n then Tr[ P ( ∂/∂ y ) X (cid:98) β ] ≤ | ˆ I | almost everywhere in U where ˆ I = { i ∈ [ n ] : ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) > } is the set of inliers. The proof of Proposition 2.2 is given in Appendix C.1. Proposition 2.2 provides informationabout the nature of the matrices ( ∂/∂ y ) (cid:98) ψ and P ( ∂/∂ y ) X (cid:98) β and their traces. ellec/Out-of-sample error estimate for robust M-estimators with convex penalty ˆ R for certain examples of loss functions As a first illustration of the above result, consider the square loss ρ ( u ) = u / . As we will detailin Section 3 devoted to the square loss, we have P = I n (so that all observations are inliers) aswell as ( ∂/∂ y ) (cid:98) ψ = I n − X ( ∂/∂ y ) (cid:98) β and the inequality of Theorem 2.1 becomes (cid:0) − ˆ df /n (cid:1) (cid:12)(cid:12) (cid:107) Σ h (cid:107) − ˆ R (cid:12)(cid:12) ≤ | V ∗ | n − ( (cid:107) (cid:98) ψ (cid:107) /n + (cid:107) Σ h (cid:107) ) , where (cid:98) ψ = y − X (cid:98) β denotes the residuals, ˆ df = Tr[( ∂/∂ y ) X (cid:98) β ] is the effective number ofparameters or effective degrees-of-freedom of (cid:98) β that dates back to [Ste81], and ˆ R becomes ˆ R = ( n − ˆ df ) − (cid:110) (cid:107) (cid:98) ψ (cid:107) (cid:16) df − p (cid:17) + (cid:107) Σ − X (cid:62) (cid:98) ψ (cid:107) (cid:111) . This estimator of the out-of-sample error for the square loss was known only for two specificpenalty functions g . The first is g = 0 [L +
08] in which case (cid:98) β is the Ordinary Least-Squaresand ˆ df = p . The second is g ( b ) = λ (cid:107) b (cid:107) [BEM13, MM18], in which case (cid:98) β is the Lasso and ˆ df = |{ j ∈ [ p ] : (cid:98) β j (cid:54) = 0 } . For g not proportional to the (cid:96) -norm, the above result is to ourknowledge novel, even restricted to the square loss. As we detail in Section 3, the algebraicnature of the square loss leads to additional results for noise level estimation and adaptiveestimation of the generalization error (here, adaptive means without knowledge of Σ ).To our knowledge, the estimate ˆ R for general loss functions ( ρ different than the squareloss) is new. The above result is also of a different nature than most results of the literatureconcerning the performance of M -estimators in the regime p/n → γ (cid:48) . These works characterizethe asymptotic limit in probability of the out-of-sample error (cid:107) Σ h (cid:107) by solving systems ofnonlinear equations of several scalar unknowns and these equations depend on a prior on thedistribution of β (cf. the discussion after (1.5) and the references therein). Here, the out-of-sample error is estimated with data-driven quantities satisfying a non-asymptotic error bound,and no prior distribution is assumed on β .As a second illustration of the above result, consider the Huber loss(2.8) ρ H ( u ) = u / for | u | ≤ and ρ H ( u ) = ( | u | − / for | u | > . Then the non-zero diagonal elements of P are exactly the observations i ∈ [ n ] such that y i − x (cid:62) i (cid:98) β falls in the range where ρ is quadratic. If ˆ I = { i ∈ [ n ] : ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) > } denotes the set of inliersand ˆ I is constant in a neighborhood of ( y , X ) , then ˜ df = Tr[ P ( ∂/∂ y ) X (cid:98) β ] is the divergence ofthe vector field R ˆ I → R ˆ I given by ( y i ) i ∈ ˆ I (cid:55)→ ( x (cid:62) i (cid:98) β ) i ∈ ˆ I , i.e., ˜ df = (cid:80) i ∈ ˆ I x (cid:62) i ( ∂/∂y i ) (cid:98) β . This can be interpreted as the effective degrees-of-freedom of (cid:98) β restricted to the inliers. Similarly, Tr[( ∂/∂ y ) (cid:98) ψ ] = Tr[ diag ( ψ (cid:48) )( I n − ( ∂/∂ y ) X (cid:98) β )] = | ˆ I | − ˜ df by the chain rule (cf. (5.4) below) andthe out-of-sample estimate ˆ R becomes ˆ R = (cid:0) | ˆ I | − ˜ df (cid:1) − (cid:110) (cid:107) (cid:98) ψ (cid:107) (cid:16) df − p (cid:17) + (cid:107) Σ − X (cid:62) (cid:98) ψ (cid:107) (cid:111) . This mimics the estimate available for the square loss, with ˆ df replaced by ˜ df and the samplesize n replaced by the number of inliers | ˆ I | . If a scaled version of the Huber loss is used, i.e.,with loss ρ ( y ) = Λ ∗ ρ H (cid:0) Λ − ∗ u (cid:1) , then the previous display still holds. ellec/Out-of-sample error estimate for robust M-estimators with convex penalty For loss functions such that ψ (cid:48) is not valued in { , } , the estimate ˆ R departs significantly fromthe above simpler estimates available for the square and Huber losses. For instance, consider thesymmetric loss ρ defined in Table 1 or the symmetric loss ρ in Table 2, both of which can beseen as smooth versions of the Huber loss also displayed in Table 1. For the Huber loss, ψ (cid:48) H is thediscontinuous step function and ψ (cid:48) , ψ (cid:48) are smooth approximations, sometimes referred to andimplemented as the smoothstep functions [Wik20]. The labels and correspond to the degreeof smoothness of the smoothstep function, and smoother piecewise polynomial approximationscan be obtained [Wik20]. For ρ = ρ or ρ = ρ , u ∈ [0 ,
1] [1 , ∞ ) ψ (cid:48) H ( u ) 1 0 ψ H ( u ) u ρ H ( u ) u u − u ∈ [0 ,
1] [1 ,
2] [2 , + ∞ ) ψ (cid:48) ( u ) 1 2 − u ψ ( u ) u − + 2 u − u ρ ( u ) u − u + u − u − + u Table 1
Huber loss ρ H ( u ) and its derivatives, as well as its smoothed version ρ ( u ) and its derivatives. In the plots, theloss ρ is shown in brown, ψ = ρ (cid:48) in red and ψ (cid:48) in blue. u ∈ [0 ,
1] [1 ,
2] [2 , + ∞ ) ψ (cid:48) ( u ) 1 2 x − x + 12 x − ψ ( u ) u + ( x − x ρ ( u ) u x − x + 2 x − x + x −
720 3720 + u − Table 2
Smooth robust loss ρ ( u ) and its derivatives for u ≥ . ˜ df = Tr[ P ( ∂/∂ y ) X (cid:98) β ] = (cid:80) i ∈ ˆ I x (cid:62) i ( ∂/∂y i ) (cid:98) β ellec/Out-of-sample error estimate for robust M-estimators with convex penalty still holds, where ˆ I = { i ∈ [ n ] : ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) > } is again the set of inliers. However, themultiplicative factor ( n Tr[( ∂/∂ y ) (cid:98) ψ ]) satisfies by the chain rule (5.4) below (cid:16) n Tr[ ∂ (cid:98) ψ ∂ y ] (cid:17) = Tr (cid:104) diag ( ψ (cid:48) ) (cid:0) I n − X ∂ (cid:98) β ∂ y (cid:1)(cid:105) = (cid:16) n (cid:88) i ∈ ˆ I ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) (cid:110) − x (cid:62) i ∂ (cid:98) β ∂y i (cid:111)(cid:17) . Unlike the case of the Huber loss, here
Tr[ ∂ (cid:98) ψ ∂ y ] does not depend only on | ˆ I | and ˜ df , due to theweights ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) that continuously vary in [0 , among inliers for ψ (cid:48) = ψ (cid:48) or ψ (cid:48) = ψ (cid:48) inTables 1 and 2.Other smooth robust loss functions include ρ √· ( u ) = √ u with derivative ψ √· ( u ) = u/ √ u , although in this case P = I n as all observations satisfy ψ (cid:48)√· ( y i − x (cid:62) i (cid:98) β ) > .We emphasize that the above result does not provide guarantees against all forms ofcorruption in the data, and ˆ R may produce incorrect inferences (or be undefined) in certaincases where the multiplicative factor ( n Tr[( ∂/∂ y ) (cid:98) ψ ]) is too small or equal to 0. First, recallthat n Tr[( ∂/∂ y ) (cid:98) ψ ] ∈ [0 , by Proposition 2.2. To exhibit situations for which n Tr[( ∂/∂ y ) (cid:98) ψ ] is too close to 0, by the chain rule (5.4) below we have n Tr[( ∂/∂ y ) (cid:98) ψ ] = n Tr[ diag ( ψ (cid:48) )( I n − X ( ∂/∂ y ) (cid:98) β )] . Hence the above multiplicative factor is equal to 0 when diag ( ψ (cid:48) ) = , i.e., ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) = 0 for all observations i = 1 , ..., n : If all observations are classified as outliers by the minimizationproblem (1.3) then ˆ R is undefined and cannot be used. On the other hand, the relationship(2.9) (cid:16) n Tr[( ∂/∂ y ) (cid:98) ψ ] (cid:17) (cid:12)(cid:12)(cid:12) (cid:107) Σ h (cid:107) − ˆ R (cid:12)(cid:12)(cid:12) ≤ | V ∗ | n − ( (cid:107) Σ h (cid:107) + (cid:107) (cid:98) ψ (cid:107) /n ) always holds, which suggests that (cid:0) n Tr[( ∂/∂ y ) (cid:98) ψ ] (cid:1) must be bounded away from 0 in order toobtain reasonable upper bounds on (cid:12)(cid:12) (cid:107) Σ h (cid:107) − ˆ R (cid:12)(cid:12) . If the loss is strongly convex and γ < as inAssumption 2.3(ii), or under Assumption 2.3(iii) for the Huber M-estimator with (cid:96) penalty, thefactor (cid:0) n Tr[( ∂/∂ y ) (cid:98) ψ ] (cid:1) is bounded away from 0 as noted in the second claim of Theorem 2.1.However, (cid:0) n Tr[( ∂/∂ y ) (cid:98) ψ ] (cid:1) is not necessarily bounded away from 0 under Assumption 2.3(i):Indeed it is easy to construct an example where ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) = 0 for all i ∈ [ n ] with highprobability, for instance for the Huber loss ρ = ρ H defined in (2.8) with penalty g ( b ) = K (cid:107) b − a (cid:107) for some large K and some vector a ∈ R p with large distance (cid:107) a − β (cid:107) . This highlights the factthat the above result does not provide estimation guarantees against all forms of corruptionunder Assumption 2.3(i) without further assumption: If the corruption is so strong that allobservations are outliers and Tr[( ∂/∂ y ) (cid:98) ψ ] = 0 then the inequality of Theorem 2.1 is unusable toestimate or bound from above the out-of-sample error. Finally, to obtain a confidence intervalfor (cid:107) Σ h (cid:107) , the previous display (2.9) implies(2.10) (cid:12)(cid:12)(cid:12) (cid:107) Σ h (cid:107) − ˆ R (cid:12)(cid:12)(cid:12) ≤ | V ∗ | n − ( ˆ R + (cid:107) (cid:98) ψ (cid:107) /n ) (cid:2) ( n Tr[( ∂/∂ y ) (cid:98) ψ ]) − | V ∗ | n − (cid:3) + almost surely, where u + = max(0 , u ) for any u is used in the denominator. Since the expectationof I { Ω }| V ∗ | is bounded from above by a constant depending on γ, µ, µ g , ϕ, η only, Markov’sinequality applied to I { Ω }| V ∗ | combined with the previous display and P (Ω) → yields aconfidence interval for the out-of-sample error (cid:107) Σ h (cid:107) . ellec/Out-of-sample error estimate for robust M-estimators with convex penalty ( ρ, g ) The multiplicative factors
Tr[( ∂/∂ y ) (cid:98) ψ ] and Tr[ P ( ∂/∂ y ) X (cid:98) β ] have explicit closed formexpressions for particular choices of ( ρ, g ) . We now provide such examples, while the next sectionprovides a general method to approximate these quantities for arbitrary ( ρ, g ) . Proposition 2.3.
Assume that ψ is 1-Lipschitz and consider an Elastic-Net penalty of theform g ( b ) = µ (cid:107) b (cid:107) / λ (cid:107) u (cid:107) for µ > , λ ≥ . For almost every ( y , X ) , the map y (cid:55)→ X (cid:98) β isdifferentiable at y and ( ∂/∂ y ) X (cid:98) β = X ˆ S ( X (cid:62) ˆ S diag ( ψ (cid:48) ) X ˆ S + µ I | ˆ S | ) − X (cid:62) ˆ S diag ( ψ (cid:48) ) where ˆ S = { j ∈ [ p ] : (cid:98) β j (cid:54) = 0 } and X ˆ S is the submatrix of X obtained by keeping only thecolumns indexed in ˆ S , as well as ( ∂/∂ y ) (cid:98) ψ = diag ( ψ (cid:48) ) (cid:104) I n − diag ( ψ (cid:48) ) X ˆ S ( X (cid:62) ˆ S diag ( ψ (cid:48) ) X ˆ S + µ I | ˆ S | ) − X (cid:62) ˆ S diag ( ψ (cid:48) ) (cid:105) diag ( ψ (cid:48) ) . The proof is given in Appendix C.2 and our main result Theorem 2.1 is not dependent on theabove proposition. For the Elastic-Net penalty, the factors
Tr[ P ( ∂/∂ y ) X (cid:98) β ] and Tr[( ∂/∂ y ) (cid:98) ψ ] appearing in the out-of-sample estimate ˆ R have thus reasonably tractable forms and can becomputed efficiently by inverting a matrix of size ˆ S once the robust elastic-net estimate (cid:98) β hasbeen computed. The above estimates for general robust loss functions are closely related tothe formula for ( ∂/∂ y ) X (cid:98) β known for the Elastic-Net with square loss [TT12, Equation (28)][BZ18b, Section 3.5.3], the only difference being several multiplications by the diagonal matrixdiag ( ψ (cid:48) ) defined in (2.5).For the Huber loss with (cid:96) -penalty, these multiplicative factors are even simpler, as shown inthe following proposition. We keep using the notation ˆ I = { i ∈ [ n ] : ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) > } for theset of inliers (the set of outliers being [ n ] \ ˆ I ), and ˆ S = { j ∈ [ p ] : (cid:98) β j (cid:54) = 0 } for the set of activecovariates. Proposition 2.4.
Let ρ ( u ) = nλ ∗ ρ H (( √ nλ ∗ ) − u ) where ρ H is the Huber loss and let g ( b ) = λ (cid:107) b (cid:107) be the penalty for λ ∗ , λ > . For almost every ( y , X ) , the functions y (cid:55)→ ˆ I, y (cid:55)→ ˆ S and y (cid:55)→ P are constant in a neighborhood of y and (cid:98) Q def = P ( ∂/∂ y ) X (cid:98) β is the orthogonalprojection onto the column span of diag ( ψ (cid:48) ) X ˆ S . Furthermore ( ∂/∂ y ) (cid:98) ψ = diag ( ψ (cid:48) ) − (cid:98) Q and themultiplicative factors appearing in ˆ R are given by Tr[ P ( ∂/∂ y ) X (cid:98) β ] = | ˆ S | , Tr[( ∂/∂ y ) (cid:98) ψ ] = | ˆ I | − | ˆ S | ≥ . The proof is given in Appendix C.2. Proposition 2.4 implies that for the Huber loss with (cid:96) -penalty, the out-of-sample error estimate ˆ R becomes simply(2.11) ˆ R = ( | ˆ I | − | ˆ S | ) − (cid:8) (cid:107) (cid:98) ψ (cid:107) (2 | ˆ S | − p ) + (cid:107) Σ − X (cid:62) (cid:98) ψ (cid:107) (cid:9) . For the square-loss and identity covariance, the above estimate was known [BEM13, MM18]with | ˆ I | replaced by n . The extension of this estimate to general loss functions is rather natural:the sample size should be replaced by the number of observed inliers | ˆ I | .Propositions 2.3 and 2.4 provide, for specific examples, closed form expressions for themultiplicative factors Tr[( ∂/∂ y ) (cid:98) ψ ] and Tr[ P ( ∂/∂ y ) X (cid:98) β ] appearing in ˆ R . Closed form expressionscan also be obtained for different penalty functions, such as the Group-Lasso penalty, again bydifferentiating the KKT conditions as explained in [BZ19b] for the square loss. ellec/Out-of-sample error estimate for robust M-estimators with convex penalty ˆ R for arbitrary loss andpenalty ( ρ, g ) For general penalty function, however, no closed form solution is available. Still, it is possible toapproximate the multiplicative factors appearing in ˆ R using the following Monte Carlo scheme.Since Tr[( ∂/∂ y ) (cid:98) ψ ] and Tr[ P ( ∂/∂ y ) X (cid:98) β ] are the divergence of the vector fields y (cid:55)→ (cid:98) ψ and ( y i ) i ∈ ˆ I (cid:55)→ ( x (cid:62) i (cid:98) β ) i ∈ ˆ I respectively, we can use the following Monte Carlo approximation of thedivergence of a vector field, which was suggested at least as early as [MMB16], and for whichaccuracy guarantees are proved in [BZ18b].Let F : R n → R n be a vector field, and let z k , k = 1 , ..., m be iid standard normal randomvector in R n . Then for some small scale parameter a > , we approximate the divergence of F at a point y ∈ R n by ˆdiv F ( y ) = 1 m m (cid:88) k =1 a − z (cid:62) k (cid:2) F ( y + a z k ) − F ( y ) (cid:3) . Computing the quantities F ( y + a z k ) at the perturbed response vector y + a z k for F ( y ) = X (cid:98) β or F ( y ) = ψ ( y − X (cid:98) β ) require the computation of the M -estimator (cid:98) β ( y + a z k , X ) at the perturbedresponse. If (cid:98) β ( y , X ) has already been computed as a solution to (1.3) by an iterative algorithm,one can use (cid:98) β ( y , X ) as a starting point of the iterative algorithm to compute (cid:98) β ( y + a z k , X ) efficiently, since for small a > and by continuity, (cid:98) β ( y , X ) should provide a good initialization.We refer to [BZ18b] for an analysis of the accuracy of this approximation.Hence, even in situations where no closed form expressions for the Jacobians ( ∂/∂ y ) (cid:98) ψ and ( ∂/∂ y ) X (cid:98) β are available, the estimate ˆ R of the out-of-sample error of the M -estimator (cid:98) β canbe used by replacing the divergences Tr[( ∂/∂ y ) (cid:98) ψ ] and Tr[ P ( ∂/∂ y ) X (cid:98) β ] by their Monte Carloapproximations. We illustrate the above result with a short simulation study. For given tuning parameters λ, λ ∗ > , the M -estimator (cid:98) β is the Huber Lasso estimator (1.3) with loss ρ and penalty g given inProposition 2.4, and the estimate ˆ R is given by (2.11). We set n = 1001 , p = 1000 , Σ = I p and (cid:98) β has 100 nonzero coefficients all equal to p − / , and the components of ε are iid with t -distribution with degrees-of-freedom (so that the variance of each component does not exist).Define the sets Λ = { . n − / (1 . k , k = 0 , ..., } and Λ ∗ = { . n − / (1 . k , k = 0 , ..., } . Foreach ( λ, λ ∗ ) in the discrete grid Λ × Λ ∗ , the estimator ˆ R , its target (cid:107) Σ / ( (cid:98) β − β ) (cid:107) and therelative error | − ˆ R/ (cid:107) Σ / ( (cid:98) β − β ) (cid:107) | are reported, over 100 repetitions, in the boxplots inFigure 1. Figure 1 provides also a heatmap of the average of ˆ R and (cid:107) Σ / ( (cid:98) β − β ) (cid:107) over thesame 100 repetitions.The plots show that the estimate ˆ R accurately estimates (cid:107) Σ / ( (cid:98) β − β ) (cid:107) across the grid Λ × Λ ∗ , at the exception of the lowest value of the Huber loss parameter λ ∗ coupled with thetwo lowest values for the penalty parameter λ as seen on the left of the top left boxplot inFigure 1. These low values for ( λ, λ ∗ ) lead to small values for ( | ˆ I | − | ˆ S | ) in the denominatorof (2.11). This provides additional evidence that ˆ R should not be trusted for low values of Tr[( ∂/∂ y ) (cid:98) ψ ] /n (cf. the discussion around (2.10)). These inaccurate estimations for low valuesfor ( λ, λ ∗ ) do not contradict the theoretical results, as the inequality of Theorem 2.1 bounds fromabove (Tr[( ∂/∂ y ) (cid:98) ψ ] /n ) | ˆ R − (cid:107) Σ / h (cid:107) | = ( | ˆ I | /n − | ˆ S | /n ) | ˆ R − (cid:107) Σ / h (cid:107) | and upper bounds ellec/Out-of-sample error estimate for robust M-estimators with convex penalty on | ˆ R − (cid:107) Σ / h (cid:107) | are not guaranteed by Theorem 2.1 when ( | ˆ I | /n − | ˆ S | /n ) is close to 0.Furthermore, Figure 1 shows that the estimate ˆ R is accurate for ( λ, λ ∗ ) smaller than the values ( λ, λ ∗ ) required in Assumption 2.3(iii). This suggests that the validity of ˆ R may hold for broaderchoices of loss and penalty than those allowed by Assumption 2.3. Preliminaries for the proofs are twofold. First several Lipschitz properties are derived, to makesure that the derivatives used in the proofs exist almost surely. This is done in Section 5.1.Second, without loss of generality we may assume that Σ = I p , replacing if necessary ( X , β , (cid:98) β , g ) by ( X ∗ , β ∗ , (cid:98) β ∗ , g ∗ ) as follows,(2.12) X (cid:32) X ∗ = X Σ − , g ( · ) (cid:32) g ∗ ( · ) = g ( Σ − ( · )) , (cid:98) β (cid:32) (cid:98) β ∗ = Σ (cid:98) β , β (cid:32) β ∗ = Σ β . This change of variable leaves the quantities { y , P , (cid:98) ψ , X (cid:98) β , (cid:107) Σ − X (cid:62) (cid:98) ψ (cid:107) , (cid:107) Σ h (cid:107) } unchanged,so that Theorem 2.1 holds for general Σ if it holds for Σ = I p after the change of variable in(2.12).Next, throughout the proof, as in the rest of the paper we consider the scaled version of (cid:98) ψ and the error vector h given by(2.13) r = n − (cid:98) ψ = n − ψ ( y − X (cid:98) β ) , h = (cid:98) β − β so that (cid:107) r (cid:107) and (cid:107) h (cid:107) are of the same order. Define also the quantities V def = n (cid:107) X (cid:62) r (cid:107) − (cid:107) r (cid:107) n ( p − Tr[ P ∂ X (cid:98) β ∂ y ]) + Tr[ ∂ (cid:98) ψ ∂ y ] r (cid:62) Xh n / ( (cid:107) h (cid:107) + (cid:107) r (cid:107) ) n − , (2.14) V = pn (cid:107) r (cid:107) − n (cid:107) X (cid:62) r (cid:107) − ( n Tr[ ∂ (cid:98) ψ ∂ y ]) (cid:107) h (cid:107) − ∂ (cid:98) ψ ∂ y ] r (cid:62) Xh n / ( (cid:107) r (cid:107) + (cid:107) h (cid:107) ) n − . (2.15)By simple algebra, cancelling the rightmost terms, we find V ∗ def = 2 V + V = n (cid:107) X (cid:62) r (cid:107) + (cid:107) r (cid:107) n − (cid:0) P ∂ X (cid:98) β ∂ y ] − p (cid:1) − ( n Tr[ ∂ (cid:98) ψ ∂ y ]) (cid:107) h (cid:107) ( (cid:107) r (cid:107) + (cid:107) h (cid:107) ) n − , so that the first claim of Theorem 2.1 follows if E [ I { Ω }| V | ] + E [ I { Ω }| V | ] ≤ C ( γ, µ, µ g , ϕ, η ) for some constant independent of n, p and some event Ω with P (Ω) → .At this point the main ingredients of the proof are threefold. First, the bound E [ I { Ω }| V | ] ≤ C ( γ, µ, µ g , ϕ, η ) is obtained in Proposition 6.4 by applying the second moment identity of[BZ18b] in the form E (cid:104)(cid:16) ρ (cid:62) Xη − n (cid:88) i =1 p (cid:88) j =1 ∂ ( ρ i η j ) ∂x ij (cid:17) (cid:105) = E (cid:104) (cid:107) ρ (cid:107) (cid:107) η (cid:107) + p (cid:88) j =1 p (cid:88) k =1 n (cid:88) i =1 n (cid:88) l =1 ∂ ( ρ i η j ) ∂x lk ∂ ( ρ l η k ) ∂x ij (cid:105) ≤ E (cid:104) (cid:107) ρ (cid:107) (cid:107) η (cid:107) + p (cid:88) j =1 p (cid:88) k =1 n (cid:88) i =1 n (cid:88) l =1 (cid:16) ∂ ( ρ l η k ) ∂x ij (cid:17) (cid:105) (2.16)where X has iid N (0 , entries and η : R n × p → R p , ρ : R n × p → R n are well chosen functionsof X , in our setting(2.17) η ( X ) = ( (cid:107) r (cid:107) + (cid:107) h (cid:107) ) − / n − / X (cid:62) r and ρ ( X ) = ( (cid:107) r (cid:107) + (cid:107) h (cid:107) ) − / r ellec/Out-of-sample error estimate for robust M-estimators with convex penalty O u t - o f - s a m p l e e rr o r Huber loss parameter * = 0.0032 Huber loss parameter * = 0.0047 Huber loss parameter * = 0.0071 O u t - o f - s a m p l e e rr o r Huber loss parameter * = 0.0107 Huber loss parameter * = 0.016 Huber loss parameter * = 0.024 -penalty parameter O u t - o f - s a m p l e e rr o r Huber loss parameter * = 0.036 -penalty parameter Huber loss parameter * = 0.054 -penalty parameter Huber loss parameter * = 0.081 Out-of-sample error estimatetruerelative error
Fig 1: Boxplots over 100 repetitions of the out-of-sample error (cid:107) Σ / ( (cid:98) β − β ) (cid:107) for the HuberLasso with parameters ( λ, λ ∗ ) , the estimate ˆ R in (2.11) and the relative error | − ˆ R (cid:14) (cid:107) Σ / ( (cid:98) β − β ) (cid:107) | . The heatmap below displays the average over the same 100 repetitions of ˆ R (Left) and (cid:107) Σ / ( (cid:98) β − β ) (cid:107) (Right). The experiment is described in Section 2.8. Huber loss parameter * - pena l t y pa r a m e t e r ellec/Out-of-sample error estimate for robust M-estimators with convex penalty with r , h defined in (2.13). Proposition 6.3 explains how to evaluate and bound from abovequantities of the form (2.16). The result of [BZ18b], that covers the case p = 1 , (cid:101) h = 1 , is recalledin Proposition 6.1.The second ingredient is the bound E [ I { Ω }| V | ] ≤ C ( γ, µ, µ g , ϕ, η ) which is obtained bydeveloping a novel probabilistic inequality, of the form E (cid:12)(cid:12)(cid:12) p (cid:107) ρ (cid:107) − p (cid:88) j =1 (cid:16) ρ (cid:62) Xe j − n (cid:88) i =1 ∂ρ i ∂x ij (cid:17) (cid:12)(cid:12)(cid:12) ≤ C (cid:110) (cid:16) E n (cid:88) i =1 p (cid:88) j =1 (cid:107) ∂ ρ ∂x ij (cid:107) (cid:17) (cid:111) √ p + C E n (cid:88) i =1 p (cid:88) j =1 (cid:107) ∂ ρ ∂x ij (cid:107) . (2.18)for functions ρ : R n × p → R n with (cid:107) ρ (cid:107) ≤ almost surely, where C > is an absolute constant.Inequality (2.18) is derived in Theorem 7.1. The upper bound on E [ I { Ω }| V | ] is formally provedas a corollary in Proposition 7.3 by taking ρ ( X ) = r ( (cid:107) r (cid:107) + (cid:107) h (cid:107) ) − / with r , h in (2.13). Toour knowledge, inequality (2.18) is novel. In the simplest case, if ρ is constant with (cid:107) ρ (cid:107) = 1 then (2.18) reduces to E | χ p − p | ≤ C √ p and the dependence in √ p is optimal. The flexibilityof inequality (2.18) is that the left hand side of (2.18) is provably of order √ p , as in thecase of E | χ p − p | , as long as the derivatives of ρ do not vary too much in the sense that E (cid:80) ni =1 (cid:80) pj =1 (cid:107) ( ∂/∂x ij ) ρ (cid:107) ≤ C for some constant independent of n, p . This inequality holdsfor instance for all ( C /n ) / -Lipschitz functions ρ : R n × p → R n . In this case, a right-hand sideof order √ p in (2.18) would be expected if the p terms A j = (cid:107) ρ (cid:107) − (cid:16) ρ (cid:62) Xe j − n (cid:88) i =1 ∂ρ i ∂x ij (cid:17) were mean-zero and independent, thanks to E [( (cid:80) pj =1 A j ) ] = (cid:80) pj =1 E [ A j ] by independence. Thesurprising feature of (2.18) is that such bound of order √ p holds despite the intricate, nonlineardependence between A j and the p − other terms ( A k ) k (cid:54) = j through the ( C /n ) / -Lipschitzfunction ρ and its partial derivatives.Upper bounds on (2.16) and inequalities of the form (2.18) involve derivatives with respect tothe entries of X . It might thus be surprising at this point that Theorem 2.1 and the estimate ˆ R in (2.7) involve the derivatives of ( (cid:98) ψ , (cid:98) β ) with respect to y only, and no derivatives with respect tothe entries of X . The third major ingredient of the proof is to provide gradient identities betweenthe derivatives of (cid:98) β , (cid:98) ψ with respect to y and those with respect to X , by identifying certainperturbations of the data ( y , X ) that leave (cid:98) β or (cid:98) ψ unchanged. For instance, Corollary 5.7 showsthat (cid:98) ψ stays the same and (cid:98) β is still solution of the optimization problem (1.3) if the observeddata ( y , X ) is replaced by ( y + (cid:98) β j v , X + ve (cid:62) j ) for any canonical basis vector e j ∈ R p andany direction v ∈ R n with v (cid:62) (cid:98) ψ = 0 . If (cid:98) ψ ( y , X ) and (cid:98) β ( y , X ) are Frechet differentiable withrespect to ( y , X ) , then this provides identities between the partial derivatives with respect to y and the partial derivatives with respect to to entries of X . Further relationships between thepartial derivatives with respect to y and the partial derivatives with respect to X are derivedin Section 5.3, including equality(2.19) ψ ( y i − x (cid:62) i (cid:98) β ) P X ∂ (cid:98) β ∂y l ( y , X ) = P X (cid:104) ( x (cid:62) l (cid:98) β ) ∂ (cid:98) β ∂y i ( y , X )+ p (cid:88) k =1 x lk ∂ (cid:98) β ∂x ik ( y , X ) (cid:105) ψ (cid:48) ( y l − x (cid:62) l (cid:98) β ) for all l ∈ [ n ] , i ∈ [ n ] , which is particularity useful for the purpose of Theorem 2.1. Theserelationships on the derivatives are non-random in nature: They hold as long as (cid:98) β , (cid:98) ψ are Frechetdifferentiable at ( y , X ) and for instance do not require that X is normally distributed. ellec/Out-of-sample error estimate for robust M-estimators with convex penalty Proof sketch in a simple example with explicit derivatives
We illustrative how to combine (2.16)-(2.18) in a simple setting, where explicit formula forthe derivatives are available. In this section only and for simplicity, consider the square loss ρ ( u ) = u / , the (cid:96) penalty g ( b ) = λ (cid:107) b (cid:107) , identity covariance Σ = I p and p/n ≤ γ < which ensure that φ min ( X (cid:62) X ) > almost surely. Also, assume throughout that ε is fixed anddeterministic. In such setting, the derivatives with respect to X of the map X (cid:55)→ ψ ( X ) where ψ ( X ) = ε − X ( (cid:98) β ( ε + Xβ , X ) − β ) are given almost surely by(2.20) ∂ ψ ∂x ij ( X ) = − (cid:2) X ˆ S ( X (cid:62) ˆ S X ˆ S ) † e j ψ (cid:62) + h j ( I n − H ) (cid:3) e i with H def = X ˆ S ( X (cid:62) ˆ S X ˆ S ) † X ˆ S where e i ∈ R n , e j ∈ R p are canonical basis vectors and X ˆ S is the matrix X with columnsindexed in the complement of ˆ S = { j ∈ [ p ] : ˆ β j (cid:54) = 0 } replaced by zeros, see [BZ19a] or[BZ19b, Lemma 3.5] where (2.20) is derived. In this setting, the factors Tr[( ∂/∂ y ) (cid:98) ψ ( y , X )] and Tr[( ∂/∂ y ) X (cid:98) β ( y , X )] that appear in Theorem 2.1 are given almost surely by Tr[ I n − H ] = n −| ˆ S | and Tr[ H ] = | ˆ S | respectively (see e.g., [Tib13] or [BZ18b, Section 3.4] among others). Hence theapproximation obtained in Theorem 2.1 reads(2.21) (cid:107) h (cid:107) ( n − | ˆ S | ) ≈ (cid:107) ψ (cid:107) (2 | ˆ S | − p ) + (cid:107) X (cid:62) ψ (cid:107) . Applying (2.16) to ρ = ψ and η = X (cid:62) ψ , the quantity on the left hand side of (2.16) isRem (2.16) def = ρ (cid:62) Xη − n (cid:88) i =1 p (cid:88) j − ∂ ( ρ i η j ) ∂x ij = (cid:107) X (cid:62) ψ (cid:107) − p (cid:107) ψ (cid:107) − n (cid:88) i =1 p (cid:88) j − e (cid:62) j X (cid:62) ∂ ( ψ ψ i ) ∂x ij by using the product rule. Let Rem (2.18) = p (cid:107) ρ (cid:107) − (cid:80) pj =1 (cid:0) e (cid:62) j X (cid:62) ρ − (cid:80) ni =1 ∂ρ i ∂x ij (cid:1) be the quantityinside the absolute values in the left hand side of (2.18). By expanding the square in Rem (2.18) and using the product rule of differentiation, we findRem (2.18) + 2 Rem (2.16) = − p (cid:88) j =1 (cid:16) n (cid:88) i =1 ∂ρ i ∂x ij (cid:17) + 2 p (cid:88) j =1 e (cid:62) j X (cid:62) ψ n (cid:88) i =1 ∂ψ i ∂x ij − p (cid:107) ψ (cid:107) + (cid:107) X (cid:62) ψ (cid:107) − n (cid:88) i =1 p (cid:88) j − e (cid:62) j X (cid:62) ∂ ( ψ ψ i ) ∂x ij = − p (cid:88) j =1 (cid:16) n (cid:88) i =1 ∂ρ i ∂x ij (cid:17) − p (cid:107) ψ (cid:107) + (cid:107) X (cid:62) ψ (cid:107) − n (cid:88) i =1 ψ i p (cid:88) j =1 e (cid:62) j X (cid:62) ∂ ψ ∂x ij thanks to cancellations for p (cid:107) ψ (cid:107) , (cid:107) X (cid:62) ψ (cid:107) and the cross-terms when expanding the square. Wenow use the explicit formulae (2.20) for ( ∂/∂x ij ) ψ for the rightmost term, which equals +2 (cid:107) ψ (cid:107) Tr[ X (cid:62) ˆ S X ˆ S ( X (cid:62) ˆ S X ˆ S ) † ] + 2 h (cid:62) X (cid:62) ( I n − H ) ψ = 2 (cid:107) ψ (cid:107) | ˆ S | + Rem where Rem def = 2 h (cid:62) X (cid:62) ( I n − H ) ψ is negligible compared to p (cid:107) ψ (cid:107) and (cid:107) X (cid:62) ψ (cid:107) thanks to (cid:107) I n − H (cid:107) op ≤ . This shows thatRem (2.18) + 2 Rem (2.16) + p (cid:88) j =1 (cid:16) n (cid:88) i =1 ∂ρ i ∂x ij (cid:17) = (cid:107) ψ (cid:107) ( | ˆ S | − p ) + (cid:107) X (cid:62) ψ (cid:107) + Rem . ellec/Out-of-sample error estimate for robust M-estimators with convex penalty Finally, in the left-hand side (cid:80) ni =1 ( ∂/∂x ij ) ρ i = − h j ( n − | ˆ S | ) − ψ (cid:62) ( X ˆ S ( X (cid:62) ˆ S X ˆ S ) † ) e j . WithRem def = 2( n − | ˆ S | ) ψ (cid:62) X ˆ S ( X (cid:62) ˆ S X ˆ S ) † h + (cid:107) ( X (cid:62) ˆ S X ˆ S ) † X (cid:62) ˆ S ψ (cid:107) , this proves thatRem (2.18) + 2 Rem (2.16) + Rem + (cid:107) h (cid:107) ( n − | ˆ S | ) = (cid:107) ψ (cid:107) ( | ˆ S | − p ) + (cid:107) X (cid:62) ψ (cid:107) + Rem so that the desired approximation (2.21) holds when all remainder terms are negligible. Boundingfrom above Rem , Rem can be readily achieved using that (cid:107) I n − H (cid:107) op ∨(cid:107) H (cid:107) op ≤ and that theeigenvalues of X (cid:62) ˆ S X ˆ S are bounded away from 0. For Rem (2.18) and Rem (2.16) , bounding fromabove the right-hand side of (2.18) and (2.16) can be achieved using the explicit formula (2.20)and the integrability of the smallest eigenvalue of X (cid:62) X /n , i.e., E [ φ min ( X (cid:62) X /n ) − k ] ≤ C ( k, γ ) for any integer k > when p/n ≤ γ < . A complication that was overlooked in the abovederivation is that (2.18) formally requires (cid:107) ρ (cid:107) ≤ ; this is the reason why in the formal proof ofTheorem 2.1, inequality (2.18) is applied to ρ given in (2.17).In the general case, for the formal proof of Theorem 2.1 given in Section 8, explicit formulaesuch as (2.20) are not available. We instead resort to relationships such as (2.19) to linkderivatives with respect to X to the derivatives of (cid:98) ψ and (cid:98) β with respect to y . Some othertechnicalities are necessary to generalize the above argument to general loss ρ and generalconvex penalty g : For instance some derivatives do not exist or cannot be controlled on someevents of negligible probability, and we resort to the formulation in Corollary 7.2 to overcomethese issues. These technicalities aside, the argument of the previous paragraph outlines the keyalgebra behind the proof of Theorem 2.1.
3. Square loss
Throughout this section ρ ( u ) = u / in (1.3) so that (cid:98) β is the solution(3.1) (cid:98) β ( y , X ) = arg min b ∈ R p (cid:16) (cid:107) Xb − y (cid:107) / (2 n ) + g ( b ) (cid:17) . for some convex penalty g : R p → R . Here ψ ( u ) = ρ (cid:48) ( u ) = u , and (cid:98) ψ = ψ ( y − X (cid:98) β ) is simply (cid:98) ψ = y − X (cid:98) β , the residuals of the estimator (cid:98) β .For the square loss, the multiplicative factors Tr[( ∂/∂ y ) (cid:98) ψ ] and Tr[ P ( ∂/∂ y ) X (cid:98) β ] of theprevious section have simple expressions that only depend on the effective degrees-of-freedomof (cid:98) β defined by(3.2) ˆ df = Tr (cid:2) ( ∂/∂ y ) X (cid:98) β ( y , X ) (cid:3) . For any design matrix X , the Jacobian ( ∂/∂ y ) X (cid:98) β ( y , X ) is symmetric with eigenvalues in [0 , for almost every y (cf. [BZ19b, Proposition J.1]) so that ≤ ˆ df ≤ n holds. For the squareloss, the matrices in (2.5) are simply diag ( ψ (cid:48) ) = P = I n and the multiplicative factors of theprevious section are given by Tr[( ∂/∂ y ) (cid:98) ψ ] = n − ˆ df , Tr[ P ( ∂/∂ y ) X (cid:98) β ] = ˆ df . The quantity ˆ df in (3.2) was introduced in [Ste81] where Stein’s Unbiased Estimate (SURE) wasdeveloped, showing that(3.3) E [ (cid:107) X ( (cid:98) β − β ) (cid:107) ] = E [ (cid:91) SURE ] , where (cid:91) SURE = (cid:107) y − X (cid:98) β (cid:107) + 2 σ ˆ df − σ n ellec/Out-of-sample error estimate for robust M-estimators with convex penalty when ε ∼ N ( , σ I n ) under mild differentiability and integrability assumptions. Numerousworks followed with the goal to characterize the quantity ˆ df for estimators of interest, seefor instance [ZHT07, TT12, Kat09, DKF +
13] for the Lasso and the Elastic-Net, [VDP + |(cid:107) X ( (cid:98) β − β ) (cid:107) − (cid:91) SURE | with high probability and an unbiased estimate of ( (cid:107) X ( (cid:98) β − β ) (cid:107) − (cid:91) SURE ) is also available [BZ18b]. A surprise of the present paper is that for general penalty functions, ˆ df is not only useful to estimate the in-sample error in (3.3), but also the out-of-sample error (cid:107) Σ ( (cid:98) β − β ) (cid:107) . Estimation targets: Noise level and generalization error
The simple algebraic structure of the square loss allows us to provide generic estimators of thenoise level σ and the generalization error σ + (cid:107) Σ h (cid:107) , assuming that the components of ε areiid with mean zero and variance σ . The quantity σ + (cid:107) Σ h (cid:107) can be seen as the generalizationerror, since σ + (cid:107) Σ h (cid:107) = E [( x (cid:62) new (cid:98) β − Y new ) | ( y , X )] where ( x (cid:62) new , Y new ) is independent of ( X , y ) with the same distribution as any row of ( X , y ) ∈ R n × ( p +1) .When the components of ε are assumed iid, mean-zero with variance σ , the convergence (cid:107) ε (cid:107) /n → σ holds almost surely by the law of large numbers, and |(cid:107) ε (cid:107) /n − σ | = O P ( n − ) by the central limit theorem if the fourth moment of the entries of ε is uniformly bounded as n, p → + ∞ . We may thus consider the estimation targets σ ∗ def = (cid:107) ε (cid:107) /n and σ ∗ + (cid:107) Σ h (cid:107) for the noise level and generalization error, respectively. Results for σ and σ + (cid:107) Σ h (cid:107) can bededuced up to an extra additive error term of order |(cid:107) ε (cid:107) /n − σ | which converges to 0 almostsurely and that satisfies |(cid:107) ε (cid:107) /n − σ | = O P ( n − ) under the uniformly bounded fourth momentassumption on the components of ε . Assumption 3.1 (Penalty) . The loss is ρ ( u ) = u and either one of the following holds:(i) p/n ≤ γ ∈ (0 , + ∞ ) , µ > and the penalty g is µ -strongly convex with respect to Σ , in thesense that for any b , b (cid:48) ∈ R p , d ∈ ∂g ( b ) and d (cid:48) ∈ ∂g ( b (cid:48) ) , inequality ( d − d (cid:48) ) (cid:62) ( b − b (cid:48) ) ≥ µ (cid:107) Σ ( b − b (cid:48) ) (cid:107) holds.(ii) p/n ≤ γ < and the penalty g is only assumed convex.(iii) For constants γ, ϕ, η independent of n, p , assume that diag ( Σ ) = I p , p/n ≤ γ ∈ (0 , ∞ ) , φ max ( Σ ) /φ min ( Σ ) ≤ ϕ . The penalty is g ( b ) = λ (cid:107) b (cid:107) where λ > is a tuning parameter.Furthermore, ε ∼ N ( , σ I n ) and s ∗ > is a small enough constant depending on { ϕ, γ, η } only such that (cid:107) β (cid:107) ≤ s ∗ n . The tuning parameter is set as λ = ση − (1+(2 log γs ∗ ) ) n − .Here γ, µ, ϕ, η ≥ are constants independent of n, p . Theorem 3.1.
Let Assumptions 2.2 and 3.1 be fulfilled. Then almost surely (cid:12)(cid:12)(cid:0) − ˆ df /n (cid:1) (cid:8) (cid:107) Σ h (cid:107) + σ ∗ (cid:9) − (cid:107) (cid:98) ψ (cid:107) /n (cid:12)(cid:12) ≤ RHS , (3.4) (cid:12)(cid:12)(cid:0) − ˆ df /n (cid:1) (cid:107) Σ h (cid:107) − n − (cid:8) (cid:107) (cid:98) ψ (cid:107) (2 ˆ df − p ) + (cid:107) Σ − X (cid:62) (cid:98) ψ (cid:107) (cid:9)(cid:12)(cid:12) ≤ RHS , (cid:12)(cid:12)(cid:0) − ˆ df /n (cid:1) σ ∗ − n − (cid:8) (cid:107) (cid:98) ψ (cid:107) ( n − (2 ˆ df − p )) − (cid:107) Σ − X (cid:62) (cid:98) ψ (cid:107) (cid:9)(cid:12)(cid:12) ≤ RHS where
RHS = V ∗ n − (cid:0) (cid:107) (cid:98) ψ (cid:107) /n + (cid:107) Σ h (cid:107) + σ ∗ (cid:1) and V ∗ is a non-negative random variable with E [ I { Ω } V ∗ ] ≤ C ( γ, µ, ϕ, η ) for some event Ω with P (Ω) → . Furthermore ≤ (1 − ˆ df /n ) − ≤ C ( γ, µ, ϕ, η ) holds in Ω . ellec/Out-of-sample error estimate for robust M-estimators with convex penalty The proof is given in Appendix B. Theorem 3.1 provides consistent estimates for the out-of-sample error (cid:107) Σ h (cid:107) , the generalization error (cid:107) Σ h (cid:107) + σ ∗ , and the noise level σ ∗ = (cid:107) ε (cid:107) /n .More precisely, define the estimators of the generalization error, of the out-of-sample error andof the noise variance respectively by ˆ R Gen = (cid:0) n − ˆ df (cid:1) − (cid:107) (cid:98) ψ (cid:107) n, ˆ R = (cid:0) n − ˆ df (cid:1) − (cid:8) (cid:107) (cid:98) ψ (cid:107) (2 ˆ df − p ) + (cid:107) Σ − X (cid:62) (cid:98) ψ (cid:107) (cid:9) , ˆ σ = (cid:0) n − ˆ df (cid:1) − (cid:8) (cid:107) (cid:98) ψ (cid:107) ( n − (2 ˆ df − p )) − (cid:107) Σ − X (cid:62) (cid:98) ψ (cid:107) (cid:9) . The above estimates have been derived for the unregularized Ordinary Least-Squares in [L + (cid:98) β = in [Dic14]. Apart fromthese works and the specific Lasso penalty, to our knowledge the above estimates for generalconvex penalty g are new, so that Theorem 3.1 considerably extends the scope of applicationsof the estimates ˆ R Gen , ˆ R and ˆ σ . The estimate ˆ R Gen of the generalization error is of particularinterest as it does not require the knowledge of Σ , and can be used to choose the estimatorwith the smallest estimated generalization error among a collection of convex regularized least-squares of the form (3.1). Since ˆ R Gen estimates the risk for the actual sample size n , thisprovides a favorable alternative to K -fold cross-validation which provides estimates of the riskcorresponding to the biased sample size (1 − /K ) n . In defense of cross-validation, which isknown to successfully tune parameters in practice for arbitrary data distribution, the aboveestimates are valid when the rows of X are iid N ( , Σ ) and it is unclear if the validity wouldextend to non-Gaussian correlated designs.With the above notation, the following asymptotic corollary of Theorem 3.1 holds. The proofis given in Appendix B. Corollary 3.2.
For some fixed value of γ > , µ ≥ , consider a sequence of regression problemsand penalties with n, p → + ∞ such that for each n, p , the setting and assumptions of Theorem 3.1are fulfilled. Then (cid:12)(cid:12)(cid:8) (cid:107) Σ h (cid:107) + σ ∗ (cid:9) − ˆ R Gen (cid:12)(cid:12) ≤ O P ( n − ) ˆ R Gen , (cid:12)(cid:12) (cid:107) Σ h (cid:107) − ˆ R (cid:12)(cid:12) ≤ O P ( n − ) ˆ R Gen , (cid:12)(cid:12) σ ∗ − ˆ σ (cid:12)(cid:12) ≤ O P ( n − ) ˆ R Gen . Consequently, for the generalization error, ˆ R Gen / {(cid:107) Σ h (cid:107) + σ ∗ } → P in probability.
4. Notation and conventions for the proofs
As defined in (2.2), we consider the functions (cid:98) ψ = (cid:98) ψ ( y , X ) = ψ ( y − X (cid:98) β ) and (cid:98) β = (cid:98) β ( y , X ) as functions of ( y , X ) ∈ R n × R n × p . At a point ( y , X ) where these functions are Frechetdifferentiable, the statistician has access to the Jacobians and partial derivatives in (2.3). Thesetwo functions, (cid:98) ψ and (cid:98) β are the only functions of ( y , X ) that we will consider; the hat in (cid:98) ψ , (cid:98) β emphasize that these are functions of ( y , X ) .In the proof, we will consider functions of X only, such as(4.1) ψ = ψ ( y − X (cid:98) β ) , r = n − ψ = n − ψ ( y − X (cid:98) β ) , h = (cid:98) β − β ellec/Out-of-sample error estimate for robust M-estimators with convex penalty valued in R n , R n and R p respectively. Formally, ψ : R n × p → R n , r : R n × p → R n , h : R n × p → R p and we view the functions in (4.1) as functions of X only while the noise ε is fixed. We maywrite for example r = r ( X ) to recall that convention. We will denote their partial derivativesby ( ∂/∂x ij ) .To illustrate the above definitions, the function ψ = ψ ( X ) is related to (cid:98) ψ = (cid:98) ψ ( y , X ) by ψ = (cid:98) ψ ( Xβ + ε , X ) so that if (cid:98) ψ is Frechet differentiable at ( y , X ) we have √ n ( ∂/∂x ij ) r ( X ) = ( ∂/∂x ij ) ψ ( X )= [( ∂/∂x ij ) (cid:98) ψ ( y , X ) + β j ( ∂/∂y i ) (cid:98) ψ ( y , X )]= [( ∂/∂x ij ) (cid:98) ψ ( y , X ) + (cid:98) β j ( ∂/∂y i ) (cid:98) ψ ( y , X ) − h j ( ∂/∂y i ) (cid:98) ψ ( y , X )] . (4.2)We will apply some versions of Stein’s formula to the functions of X in (4.1) conditionallyon ε . The derivatives with respect to X of the functions in (4.1) that appear in these Stein’sformulae bring the derivatives of (cid:98) ψ , (cid:98) β by (4.2). Existence of the Frechet derivatives is proved inProposition 5.3 below.For the result on the square loss, we will also consider functions R n → R n . If g : R n → R n is a function of z , we denote by ∇ g the gradient of g where g is Frechet differentiable, so that ( ∇ g ) ik = ∂g k ∂z i and ( ∇ g ) (cid:62) is the Jacobian of g . The notation ∇ will be used only for vectorfields, i.e., functions R n → R n (or Ω → R n for some Ω ⊂ R n ) for some n . The notation ∇ willbe avoided for functions that take as input matrices in R n × p such as those in the two previousparagraphs. Let I d be the identity matrix of size d × d . For any p ≥ , let [ p ] be the set { , ..., p } . Let (cid:107) · (cid:107) bethe Euclidean norm and (cid:107) · (cid:107) q the (cid:96) q norm of vectors. Let (cid:107) · (cid:107) op be the operator norm (largestsingular value), (cid:107)·(cid:107) F the Frobenius norm. If M is positive semi-definite we also use the notation φ max ( M ) and φ min ( M ) for the largest and smallest eigenvalue. For any event Ω , denote by I { Ω } its indicator function. For a ∈ R , a + = max(0 , a ) . Throughout the paper, we use C , C , ... todenote positive absolute constants, C ( γ ) , C ( γ ) , ... to denote constants that depend on γ onlyand for instance C ( γ, µ, µ g , ϕ, η ) to denote a constant that depend on { γ, µ, µ g , ϕ, η } only. Canonical basis vectors are denoted by ( e i ) i =1 ,...,n or ( e l ) l =1 ,...,n for the canonical basis in R n ,and by ( e j ) j =1 ,...,p or ( e k ) k =1 ,...,p for the canonical basis vectors in R p . Indices i and l are usedto loop or sum over [ n ] = { , ..., n } only, while indices j and k are used to loop or sum over [ p ] = { , ..., p } only. This lets us use the notation e i , e l , e j , e k for canonical basis vectors in R n or R p as the index reveals without ambiguity whether the canonical basis vector lies in R n or R p .
5. Derivatives of M -estimators Proposition 5.1.
Let ρ be a loss function such that ψ is L -Lipschitz, where ψ = ρ (cid:48) . Then forany fixed design matrix X ∈ R n × p , the mapping y (cid:55)→ ψ ( y − X (cid:98) β ) is L -Lipschitz. ellec/Out-of-sample error estimate for robust M-estimators with convex penalty Proof. let y , (cid:101) y ∈ R n be two response vectors, (cid:98) β = (cid:98) β ( y , X ) , (cid:101) β = ( (cid:101) y , X ) and ψ = ψ ( y − X (cid:98) β ) , (cid:101) ψ = ψ ( (cid:101) y − X (cid:101) β ) . The KKT conditions read X (cid:62) ψ ∈ n∂g ( (cid:98) β ) , X (cid:62) (cid:101) ψ ∈ n∂g ( (cid:101) β ) where ∂g ( (cid:98) β ) denotes the subdifferential of g at (cid:98) β . Multiplying by (cid:98) β − (cid:101) β and taking the differenceof the two KKT conditions above, we find n ( ∂g ( (cid:98) β ) − ∂g ( (cid:101) β )) (cid:62) ( (cid:98) β − (cid:101) β ) + [ { y − X (cid:98) β } − { (cid:101) y − X (cid:101) β } ] (cid:62) ( ψ − (cid:101) ψ ) (cid:51) [ X ( (cid:98) β − (cid:101) β )] (cid:62) ( ψ − (cid:101) ψ ) + [ { y − X (cid:98) β } − { (cid:101) y − X (cid:101) β } ] (cid:62) ( ψ − (cid:101) ψ ) (5.1) = [ y − (cid:101) y ] (cid:62) ( ψ − (cid:101) ψ ) . By the monotonicity of the subdifferential, ( ∂g ( (cid:98) β ) − ∂g ( (cid:101) β )) (cid:62) ( (cid:98) β − (cid:101) β ) ⊂ [0 , ∞ ) . We now lowerbound the second term in the first line for each term indexed by i = 1 , ..., n . Since ψ : R → R is nondecreasing and L -Lipschitz, ψ ( u ) − ψ ( v ) ≤ L ( u − v ) holds for any u > v , as well as ( ψ ( u ) − ψ ( v )) ≤ L ( u − v )( ψ ( u ) − ψ ( v )) since ψ ( u ) − ψ ( v ) ≥ by monotonicity. Applying thisinequality for each i to u = y i − x (cid:62) i (cid:98) β and v = ˜ y i − x (cid:62) i (cid:101) β , we obtain L − (cid:107) ψ − (cid:101) ψ (cid:107) ≤ [ { y − X (cid:98) β } − { (cid:101) y − X (cid:101) β } ] (cid:62) ( ψ − (cid:101) ψ ) ≤ [ y − (cid:101) y ] (cid:62) ( ψ − (cid:101) ψ ) . The Cauchy-Schwarz inequality completes the proof.Proposition 5.1 generalizes the result of [BT17] to general loss functions. The followingproposition uses a variant of (5.1) to derive Lipschitz properties with respect to ( y , X ) . Proposition 5.2.
Assume that ψ is L -Lipschitz. Let ( y , X ) ∈ R n × R n × p and ( (cid:101) y , (cid:102) X ) ∈ R n × R n × p be fixed. Let (cid:98) β = (cid:98) β ( y , X ) be the estimator in (1.3) with observed data ( y , X ) and let (cid:101) β = arg min b ∈ R p n (cid:80) ni =1 ρ (˜ y i − e (cid:62) i (cid:102) Xb ) + g ( b ) , i.e., the same M -estimator as (1.3) with the data ( y , X ) replaced by ( (cid:101) y , (cid:102) X ) . Set ψ = ψ ( y − X (cid:98) β ) as well as (cid:101) ψ = ψ ( (cid:101) y − (cid:102) X (cid:101) β ) . Then nµ (cid:107) Σ ( (cid:98) β − (cid:101) β ) (cid:107) + max (cid:0) L − (cid:107) (cid:101) ψ − ψ (cid:107) , µ ρ (cid:107) y − X (cid:98) β − { (cid:101) y − (cid:102) X (cid:101) β }(cid:107) (cid:1) ≤ ( (cid:101) β − (cid:98) β ) (cid:62) ( (cid:102) X − X ) (cid:62) ψ + ( (cid:101) y + ( X − (cid:102) X ) (cid:98) β − y ) (cid:62) ( (cid:101) ψ − ψ ) . (5.2) Consequently(i) The map ( y , X ) (cid:55)→ ( (cid:98) β ( y , X ) , (cid:98) ψ ( y , X )) is Lipschitz on every compact subset of R n × R n × p if µ > as in Assumption 2.3(i).(ii) The map ( y , X ) (cid:55)→ ( (cid:98) β ( y , X ) , (cid:98) ψ ( y , X )) is Lipschitz on every compact subset of { ( y , X ) ∈ R n × R n × p : φ min ( Σ − X (cid:62) X Σ − ) > } if µ ρ > as in Assumption 2.3(ii).(iii) If (cid:102) X = X and (cid:101) y = y , we must have ψ = (cid:101) ψ . This means that if (cid:98) β and (cid:101) β are two distinctsolutions of the optimization problem (1.3) , then ψ ( y − X (cid:98) β ) = ψ ( y − X (cid:101) β ) must hold.Proof. The KKT conditions for (cid:98) β and (cid:101) β read (cid:102) X (cid:62) (cid:101) ψ ∈ n∂g ( (cid:101) β ) and X (cid:62) ψ ∈ n∂g ( (cid:98) β ) . If D g =( (cid:98) β − (cid:101) β ) (cid:62) ( ∂g ( (cid:98) β ) − ∂g ( (cid:101) β )) then nD g + ( (cid:101) y − (cid:102) X (cid:101) β − y + X (cid:98) β ) (cid:62) ( (cid:101) ψ − ψ ) (cid:51) ( (cid:101) β − (cid:98) β ) (cid:62) ( (cid:102) X (cid:62) (cid:101) ψ − X (cid:62) ψ ) + ( (cid:101) y − (cid:102) X (cid:101) β − y + X (cid:98) β ) (cid:62) ( (cid:101) ψ − ψ ) (5.3) = ( (cid:101) β − (cid:98) β ) (cid:62) ( (cid:102) X − X ) (cid:62) ψ + ( (cid:101) y + ( X − (cid:102) X ) (cid:98) β − y ) (cid:62) ( (cid:101) ψ − ψ ) . ellec/Out-of-sample error estimate for robust M-estimators with convex penalty Since for any reals u > s inequality ( u − s )( ψ ( u ) − ψ ( s )) ≥ L − ( ψ ( u ) − ψ ( s )) holds when ψ is L -Lipschitz and non-decreasing, the first line is bounded from below by ( n inf D g + L − (cid:107) ψ − (cid:101) ψ (cid:107) ) .If Assumption 2.3(ii) is satisfied for some µ ρ > , we also have ( u − s )( ψ ( u ) − ψ ( s )) ≥ µ ρ ( u − s ) so that the first line is bounded from below by µ ρ (cid:107) y − X (cid:98) β − { (cid:101) y − (cid:102) X (cid:101) β }(cid:107) . We also have inf D g ≥ µ (cid:107) Σ ( (cid:101) β − (cid:98) β ) (cid:107) by monotonicity of the subdifferential and strong convexity of g withrespect to Σ . This proves (5.2).For (i), by bounding from above the right hand side of (5.2) we find min( µ, L − )( (cid:107) Σ ( (cid:98) β − (cid:101) β ) (cid:107) + (cid:107) ψ − (cid:101) ψ (cid:107) /n ) ≤ n − (cid:107) y − (cid:101) y (cid:107) + n − (cid:107) ( X − (cid:102) X ) Σ − (cid:107) op ( n − (cid:107) ψ (cid:107) + (cid:107) Σ / (cid:98) β (cid:107) ) . By taking a fixed X , e.g. X = n × p , this implies that the supremum S ( K ) =sup ( (cid:101) y , (cid:101) X ) ∈ K ( (cid:107) Σ / (cid:101) β (cid:107) + n − (cid:107) (cid:101) ψ (cid:107) ) is finite for every compact K . If ( y , X ) , ( (cid:101) y , (cid:102) X ) ∈ K the righthand side is bounded from above by n − (cid:107) y − (cid:101) y (cid:107) + n − (cid:107) ( X − (cid:102) X ) Σ − (cid:107) op S ( K ) which provesthat the map is Lipschitz on K .For (ii), we use that on the left hand side of (5.2), (cid:107) y − X (cid:98) β − { (cid:101) y − (cid:102) X (cid:101) β }(cid:107) = (cid:107) (cid:102) X ( (cid:98) β − (cid:101) β ) (cid:107) + 2[ (cid:102) X ( − (cid:98) β + (cid:101) β )] (cid:62) [ y − (cid:101) y + ( (cid:102) X − X ) (cid:98) β ]+ (cid:107) y − (cid:101) y + ( (cid:102) X − X ) (cid:98) β (cid:107) . Combined with (5.2) this implies that min (cid:8) µ ρ φ min ( n Σ − / (cid:102) X (cid:62) (cid:102) X Σ − / ) , L − (cid:9) ( n (cid:107) Σ / ( (cid:98) β − (cid:101) β ) (cid:107) + (cid:107) ψ − (cid:101) ψ (cid:107) ) ≤ ( (cid:101) β − (cid:98) β ) (cid:62) ( (cid:102) X − X ) (cid:62) ψ + ( (cid:101) y + ( X − (cid:102) X ) (cid:62) (cid:98) β − y ) (cid:62) ( (cid:101) ψ − ψ ) . − (cid:102) X ( − (cid:98) β + (cid:101) β )] (cid:62) [ y − (cid:101) y + ( (cid:102) X − X ) (cid:98) β ] . The same argument as in (i) applies on every compact where of n − / (cid:102) X Σ − / are boundedaway from and + ∞ . Finally, (iii) directly follows from (5.2). Proposition 5.3.
Assume that ψ = ρ (cid:48) is M -Lipschitz. Let Assumption 2.3(i) orAssumption 2.3(ii) be fulfilled. Then the maps (cid:98) ψ , (cid:98) β in (2.2) are Frechet differentiable at ( y , X ) for almost every ( y , X ) . Furthermore the chain rule (5.4) ( ∂/∂ y ) (cid:98) ψ ( y , X ) = diag ( ψ (cid:48) )[ I n − ( ∂/∂ y ) X (cid:98) β ( y , X )] with diag ( ψ (cid:48) ) defined in (2.5) holds for almost every ( y , X ) .If instead Assumption 2.3(iii) holds, then there exists an open set U ⊂ R n × R n × p with P (( y , X ) ∈ U ) → such that the maps (cid:98) ψ , (cid:98) β in (2.2) are Frechet differentiable at ( y , X ) foralmost every ( y , X ) ∈ U and (5.4) holds almost everywhere in U .Proof. For Lebesgue almost every ( y , X ) , provided that either Assumption 2.3(i) or (ii) holds,there exists an open set U (cid:51) ( y , X ) and a compact K with U ⊂ K ⊂ U where U = { ( y , X ) : φ min ( Σ − / X (cid:62) X Σ − / ) + µ > } . By Proposition 5.2 the maps (cid:98) ψ and (cid:98) β in (2.2) are Lipschitzon K so that the Frechet derivatives of (cid:98) β and (cid:98) ψ exist almost everywhere by Rademacher’stheorem.The validity of the chain rule (5.4) boils down to the chain rule for ψ ◦ u i where u i : U → R , ( y , X ) (cid:55)→ u i ( y , X ) = y i − x (cid:62) i (cid:98) β ( y , X ) ellec/Out-of-sample error estimate for robust M-estimators with convex penalty for all i ∈ [ n ] . Since ψ : R → R is Lipschitz and u i is Lipschitz in U , [Zie89, Theorem2.1.11] implies that ( ∂/∂y l ) (cid:98) ψ i ( y , X ) = ψ (cid:48) ( u i ( y , X ))( ∂/∂y l ) u i ( y , X ) almost everywhere in U . (This version of the chain rule is straightforward at points ( y , X ) where ψ (cid:48) ( u i ( y , X )) and ( ∂/∂y l ) u i ( y , X ) both exist, as well as at points ( y , X ) where ( ∂/∂y l ) u i ( y , X ) = 0 thanks to t − | (cid:98) ψ ( y + t e l , X ) − (cid:98) ψ ( y , X ) | ≤ M t − | u i ( y + t e l , X ) − u i ( y , X ) | in which case ( ∂/∂y l ) (cid:98) ψ i ( y , X ) = 0 and ψ (cid:48) ( u i ( y , X )) needs not exist. The non-trivial part is to prove that { ( y , X ) ∈ U : ψ (cid:48) ( u i ( y , X )) fails to exist and ( ∂/∂y l ) u i ( y , X ) (cid:54) = 0 } has Lebesgue measure 0.)Under Assumption 2.3(iii), we set U = { ( y , X ) : ( y − Xβ , X ) ∈ Ω L } where Ω L is the openset defined in Proposition D.1. By Proposition D.1(i), the maps (2.2) are Lipschitz in everycompact subset of U and the conclusion follows by the same argument. ε In this subsection, ε is fixed and we consider functions of X ∈ R n × p as defined in the followingLemma. Lemma 5.4.
Let ε ∈ R n be fixed and X , (cid:102) X be two design matrices. Define (cid:98) β = (cid:98) β ( Xβ + ε , X ) and (cid:101) β = (cid:98) β ( (cid:102) Xβ + ε , (cid:102) X ) , ψ = ψ ( ε + Xβ − X (cid:98) β ) and (cid:101) ψ = ψ ( ε + (cid:102) Xβ − (cid:102) X (cid:101) β ) as well as r = n − ψ and (cid:101) r = n − (cid:101) ψ , h = (cid:98) β − β and (cid:101) h = (cid:101) β − β . Let also D = ( (cid:107) r (cid:107) + (cid:107) Σ h (cid:107) ) and ˜ D = ( (cid:107) (cid:101) r (cid:107) + (cid:107) Σ (cid:101) h (cid:107) ) . If for some constant L ∗ and some open set U ⊂ R n × p , for every { X , (cid:102) X } ⊂ U (5.5) ( (cid:107) Σ ( h − (cid:101) h ) (cid:107) + (cid:107) r − (cid:101) r (cid:107) ) ≤ n − (cid:107) ( X − (cid:102) X ) Σ − (cid:107) op L ∗ ( (cid:107) r (cid:107) + (cid:107) Σ h (cid:107) ) holds, then we also have the Lipschitz properties (cid:107) Σ ( h D − − (cid:101) h ˜ D − ) (cid:107) ≤ n − (cid:107) ( X − (cid:102) X ) Σ − (cid:107) op L ∗ , (5.6) (cid:107) r D − − (cid:101) r ˜ D − (cid:107) ≤ n − (cid:107) ( X − (cid:102) X ) Σ − (cid:107) op L ∗ , (5.7) n − (cid:107) Σ − ( X (cid:62) r D − (cid:101) X (cid:62) (cid:101) r ˜ D ) (cid:107) ≤ n − (cid:107) ( X − (cid:102) X ) Σ − (cid:107) op (1 + 2 L ∗ (cid:107) n − (cid:102) X Σ − (cid:107) op ) , (5.8) | D − − ˜ D − | ≤ n − (cid:107) ( X − (cid:102) X ) Σ − (cid:107) op L ∗ ˜ D − . (5.9)Our proof relies in major ways on the Lipschitz property (5.5). The constant L ∗ and theset U are given for Assumption 2.3(i) in Proposition 5.5 below, and for Assumption 2.3(ii) inProposition 5.6. Proof of Lemma 5.4.
Assume that Σ = I p without loss of generality, by performing the variablechange (2.12) if necessary. By the triangle inequality, (cid:107) h D − − (cid:101) h ˜ D − (cid:107) ≤ (cid:107) (cid:101) h (cid:107)| D − − ˜ D − | + D − (cid:107) h − (cid:101) h (cid:107) . Then D − (cid:107) h − (cid:101) h (cid:107) ≤ n − (cid:107) X − (cid:102) X (cid:107) op L ∗ for the second term by (5.5). For the firstterm, (cid:107) (cid:101) h (cid:107)| D − − ˜ D − | ≤ D − | D − ˜ D | ≤ D − ( (cid:107) r − (cid:101) r (cid:107) + (cid:107) h − (cid:101) h (cid:107) ) by the triangle inequality,and another application of (5.5) provides (5.6). The exact same argument provides (5.7) sincethe roles of h and r are symmetric in (5.5). For (5.8), we use (cid:107) X (cid:62) r D − − (cid:102) X (cid:62) (cid:101) r ˜ D − (cid:107) ≤ (cid:107) X − (cid:102) X (cid:107) op ( (cid:107) r (cid:107) D − ) + (cid:107) (cid:102) X (cid:107) op (cid:107) r D − − (cid:101) r ˜ D − (cid:107) . Combined with (5.7) and (cid:107) r (cid:107) D − ≤ , this provides (5.8). For the fourth inequality, by thetriangle inequality | D − − ˜ D − | ≤ ( D ˜ D ) − | D − ˜ D | ≤ n − (cid:107) X − (cid:102) X (cid:107) L ∗ ˜ D thanks to (5.5). ellec/Out-of-sample error estimate for robust M-estimators with convex penalty Proposition 5.5.
Let Assumption 2.1 and Assumption 2.3(i) be fulfilled. Consider the notationof Lemma 5.4 for X , ψ , r , h and (cid:102) X , (cid:101) ψ , (cid:101) r , (cid:101) h . Then by (5.2) we have µ (cid:107) Σ ( h − (cid:101) h ) (cid:107) + (cid:107) r − (cid:101) r (cid:107) = µ (cid:107) Σ ( (cid:98) β − (cid:101) β ) (cid:107) + (cid:107) ψ − (cid:101) ψ (cid:107) /n ≤ (cid:2) ( (cid:101) h − h ) (cid:62) ( (cid:102) X − X ) (cid:62) ψ + h (cid:62) ( X − (cid:102) X ) (cid:62) ( (cid:101) ψ − ψ ) (cid:3) /n (5.10) ≤ n − (cid:107) ( X − (cid:102) X ) Σ − (cid:107) op ( (cid:107) Σ ( (cid:101) h − h ) (cid:107) + (cid:107) (cid:101) r − r (cid:107) ) ( (cid:107) r (cid:107) + (cid:107) Σ h (cid:107) ) . Hence (5.5) holds for L ∗ = max( µ − , and all X , (cid:102) X ∈ R n × p .Proof. This follows by (5.2) with (cid:101) y = ε + (cid:102) Xβ and y = ε + Xβ . The last inequality in (5.10) isdue to the Cauchy-Schwarz inequality. Inequality (5.5) with the given L ∗ is obtained by dividingby ( (cid:107) r − (cid:101) r (cid:107) + (cid:107) Σ ( h − (cid:101) h ) (cid:107) ) . Proposition 5.6.
Let Assumption 2.1 and Assumption 2.3(ii) be fulfilled. Consider the notationof Lemma 5.4 for X , ψ , r , h and (cid:102) X , (cid:101) ψ , (cid:101) r , (cid:101) h . Then min (cid:0) , µ ρ φ min ( Σ − (cid:102) X (cid:62) (cid:102) X Σ − /n ) (cid:1) max (cid:0) (cid:107) r − (cid:101) r (cid:107) , (cid:107) Σ ( h − (cid:101) h ) (cid:107) (cid:1) ≤ n − (cid:107) ( X − (cid:102) X ) Σ − (cid:107) op (cid:2) (cid:107) Σ ( (cid:101) h − h ) (cid:107) ∨ (cid:107) (cid:101) r − r (cid:107) (cid:3)(cid:2) (cid:107) r (cid:107) + (cid:107) Σ h (cid:107) (cid:0) µ ρ (cid:107) n − (cid:102) X Σ − (cid:107) op (cid:1)(cid:3) . Hence (5.5) holds with L ∗ = max(1 , µ − ρ ((1 − γ ) / − , µ ρ (2 + √ γ )) for all X , (cid:102) X ∈ U where U = { X ∈ R n × p : (1 − √ γ ) / ≤ φ min ( X Σ − n − ) ≤ φ max ( X Σ − n − ) ≤ √ γ } . Proof.
By (5.2) we have max( (cid:107) r − (cid:101) r (cid:107) , µ ρ (cid:107) Xh − (cid:102) X (cid:101) h (cid:107) /n ) ≤ (cid:2) ( (cid:101) h − h ) (cid:62) ( (cid:102) X − X ) (cid:62) ψ − h (cid:62) ( X − (cid:102) X ) (cid:62) ( (cid:101) ψ − ψ ) (cid:3) /n. We also have (cid:107) Xh − (cid:102) X (cid:101) h (cid:107) = (cid:107) (cid:102) X ( h − (cid:101) h ) (cid:107) + 2[ (cid:102) X ( h − (cid:101) h )] (cid:62) ( X − (cid:102) X ) h + (cid:107) ( X − (cid:102) X ) h (cid:107) . so thatthe previous display implies max( (cid:107) r − (cid:101) r (cid:107) , µ ρ (cid:107) (cid:102) X ( h − (cid:101) h ) (cid:107) /n ) ≤ (cid:2) ( (cid:101) h − h ) (cid:62) ( (cid:102) X − X ) (cid:62) ψ − h (cid:62) ( X − (cid:102) X ) (cid:62) ( (cid:101) ψ − ψ ) − µ ρ [ (cid:102) X ( h − (cid:101) h )] (cid:62) ( X − (cid:102) X ) h (cid:3) /n and the desired conclusion holds by the Cauchy-Schwarz inequality and properties of the operatornorm. Corollary 5.7.
Let the setting and assumptions of Proposition 5.2 be fulfilled. If (cid:101) y = y + (cid:98) β j v and (cid:102) X = X + ve (cid:62) j for any direction v ∈ R n with v (cid:62) ψ = 0 and index j ∈ [ p ] , then (cid:101) ψ = ψ and the solution (cid:98) β of the optimization problem (1.3) is also solution of the same optimizationproblem with ( y , X ) replaced by ( (cid:101) y , (cid:102) X ) . If additionally µ > , then (cid:101) β = (cid:98) β must hold.Proof. The right hand side of (5.2) is 0 for the given (cid:101) y − y and (cid:102) X − X . This proves that (cid:101) ψ = ψ .Furthermore, the KKT conditions for (cid:98) β read X (cid:62) ψ ∈ n∂g ( (cid:98) β ) , and we have X (cid:62) ψ = (cid:102) X (cid:62) ψ , and ψ = ψ ( y − X (cid:98) β ) = ψ ( (cid:101) y − (cid:102) X (cid:98) β ) . This implies that (cid:102) X (cid:62) ψ ( (cid:101) y − (cid:102) X (cid:98) β ) ∈ n∂g ( (cid:98) β ) so that (cid:98) β is solutionto the optimization problem with data ( (cid:101) y , (cid:102) X ) , even if µ = 0 . The claim (cid:98) β = (cid:101) β for µ > followsby unicity of the minimizer of strongly convex functions. ellec/Out-of-sample error estimate for robust M-estimators with convex penalty Proposition 5.8.
Let (cid:98) β be the M -estimator (1.3) with convex loss ρ and µ -strongly convexpenalty g . For a given value of ( y , X ) , assume that (cid:98) β is Frechet differentiable at ( y , X ) , let (cid:98) β = (cid:98) β ( y , X ) and diag ( ψ (cid:48) ) = diag ( ψ (cid:48) , ..., ψ (cid:48) n ) where ψ (cid:48) i = ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) for each i = 1 , ..., n .Let ψ = ψ ( y − X (cid:98) β ) . Then for every U ∈ R n × n , b ( t ) = (cid:98) β (cid:16) y + t U diag ( ψ (cid:48) ) X (cid:98) β − t U (cid:62) ψ , X + t U diag ( ψ (cid:48) ) X (cid:17) satisfies diag ( ψ (cid:48) ) X ddt b ( t ) (cid:12)(cid:12) t =0 = and µ ddt b ( t ) (cid:12)(cid:12) t =0 = when ψ is continuously differentiable.If ψ is only 1-Lipschitz (but not necessarily everywhere differentiable) and additionallythe function ( y , X ) (cid:55)→ (cid:98) β ( y , X ) is Lipschitz in some bounded open set U , then diag ( ψ (cid:48) ) X ddt b ( t ) (cid:12)(cid:12) t =0 = still holds for almost every ( y , X ) ∈ U .Proof. For t ∈ R , let y ( t ) = y + t U diag ( ψ (cid:48) ) X (cid:98) β − t U (cid:62) ψ and X ( t ) = X + t U diag ( ψ (cid:48) ) X so that b ( t ) = (cid:98) β ( y ( t ) , X ( t )) . Set ψ ( t ) = ψ ( y ( t ) − X ( t ) b ( t )) where as usual ψ = ρ (cid:48) acts componentwise.The KKT conditions at t and read X ( t ) (cid:62) ψ ( t ) ∈ n∂g ( b ( t )) , X (cid:62) ψ (0) ∈ n∂g ( (cid:98) β ) . By multiplying the difference of these KKT conditions by ( b ( t ) − (cid:98) β ) and using the strongconvexity of g with respect to Σ , we find nµ (cid:107) Σ ( b ( t ) − (cid:98) β ) (cid:107) ≤ ( b ( t ) − (cid:98) β ) (cid:62) (cid:2) X ( t ) (cid:62) ψ ( t ) − X (cid:62) ψ (cid:3) or equivalently nµ (cid:107) Σ ( b ( t ) − (cid:98) β ) (cid:107) + ( b ( t ) − (cid:98) β ) (cid:62) X (cid:62) diag ( ψ (cid:48) ) X ( b ( t ) − (cid:98) β ) ≤ ( b ( t ) − (cid:98) β ) (cid:62) (cid:2) X ( t ) (cid:62) ψ ( t ) − X (cid:62) ψ + X diag ( ψ (cid:48) ) X ( b ( t ) − (cid:98) β ) (cid:3) . Let ˙ ψ ∈ R n , ˙ b ∈ R p and ˙ X = U diag ( ψ (cid:48) ) X be the derivatives at 0 uniquely defined by ψ ( t ) − ψ = t ˙ ψ + o ( t ) , b ( t ) − (cid:98) β = t ˙ b + o ( t ) , and X ( t ) − X = t ˙ X + o ( t ) . Then ˙ ψ = diag ( ψ (cid:48) )[ − U (cid:62) ψ − X ˙ b ] by the chain rule if ψ is continuously differentiable, and ddt X ( t ) (cid:62) ψ ( t ) (cid:12)(cid:12) t =0 = X (cid:62) ˙ ψ + ˙ X (cid:62) ψ . This implies that the right hand side of the previous displayis o ( t ) , as all terms proportional to t and t cancel out. Dividing by t and taking the limit as t → proves the claim.The second claim, where ψ is only assumed to be 1-Lipschitz, follows from [Zie89, Theorem2.1.11] which proves the validity of the chain rule almost everywhere in U for compositions ofthe form ψ ◦ u where ψ : R → R and u : U → R are Lipschitz. See also the discussion in theproof of (5.4).We now collect a useful identity as a consequence of Proposition 5.8. First, note thatdiag ( ψ (cid:48) ) X ddt b ( t ) (cid:12)(cid:12) t =0 = if and only if P X ddt b ( t ) (cid:12)(cid:12) t =0 = because P and diag ( ψ (cid:48) ) are diagonalwith the same nullspace. For every i, l ∈ [ n ] , with U = e i e (cid:62) l for two canonical vectors in R n ,and using the notation ψ i = ψ ( y i − x (cid:62) i (cid:98) β ) and ψ (cid:48) l = ψ (cid:48) ( y l − x (cid:62) l (cid:98) β ) for brevity, we find ψ i P X ∂ (cid:98) β ∂y l ( y , X ) = P X (cid:104) ( x (cid:62) l (cid:98) β ) ∂ (cid:98) β ∂y i ( y , X ) + p (cid:88) k =1 x lk ∂ (cid:98) β ∂x ik ( y , X ) (cid:105) ψ (cid:48) l = p (cid:88) k =1 P X (cid:104) ∂ (cid:98) β ∂y i ( y , X ) (cid:98) β k + ∂ (cid:98) β ∂x ik ( y , X ) (cid:105) e (cid:62) k X (cid:62) diag ( ψ (cid:48) ) e l . ellec/Out-of-sample error estimate for robust M-estimators with convex penalty Multiplying to the left by e (cid:62) l and summing over l ∈ { , ..., n } we get ψ i Tr[
P X ∂ (cid:98) β ∂ y ( y , X )] = p (cid:88) k =1 Tr (cid:104) e (cid:62) k X (cid:62) diag ( ψ (cid:48) ) X (cid:104) ∂ (cid:98) β ∂y i ( y , X ) (cid:98) β k + ∂ (cid:98) β ∂x ik ( y , X ) (cid:105)(cid:105) where we used Tr[
U V ] = Tr[
V U ] and diag ( ψ (cid:48) ) P = diag ( ψ (cid:48) ) . In the right hand side, since thequantity inside the trace is a scalar, the trace sign can be omitted. Finally, we relate the righthand side to the derivatives of (cid:98) ψ ( y , X ) . By the chain rule (5.4) we have ( ∂/∂y i ) (cid:98) ψ = diag ( ψ (cid:48) )[ e i − X ( ∂/∂y i ) (cid:98) β ] , ( ∂/∂x ik ) (cid:98) ψ = diag ( ψ (cid:48) )[ − e i (cid:98) β k − X ( ∂/∂x ik ) (cid:98) β ] so that (cid:98) β k ( ∂/∂y i ) (cid:98) ψ + ( ∂/∂x ik ) (cid:98) ψ = − diag ( ψ (cid:48) ) X [ (cid:98) β k ( ∂/∂y i ) (cid:98) β + ( ∂/∂x ik ) (cid:98) β ] and the previousdisplay can be rewritten as(5.11) ψ i Tr (cid:104) P X ∂ (cid:98) β ∂ y ( y , X ) (cid:105) = − p (cid:88) k =1 e (cid:62) k X (cid:62) (cid:104) (cid:98) β k ∂ (cid:98) ψ ∂y i ( y , X ) + ∂ (cid:98) ψ ∂x ik ( y , X ) (cid:105) . Proposition 5.9.
Let ε ∈ R n be a constant vector. Consider a minimizer (cid:98) β of (3.4) , assumethat the function (cid:98) ψ ( y , X ) (cid:55)→ ψ ( y − X (cid:98) β ( y , X )) is Frechet differentiable at ( y , X ) and denote itspartial derivatives by ( ∂/∂x ij ) and ( ∂/∂y i ) . We also consider the function r : X (cid:55)→ n − (cid:98) ψ ( ε + Xβ , X ) .(ii) ∀ j ∈ [ p ] , the quantity δ j def = n − Tr[ ∂ (cid:98) ψ ∂ y ] h j + (cid:80) ni =1 ∂r i ∂x ij ( X ) satisfies (5.12) | δ j | ≤ | h j |√ n + max u ∈ R n , (cid:107) u (cid:107) =1 (cid:13)(cid:13)(cid:13) n (cid:88) i =1 u i ∂ r ∂x ij ( X ) (cid:13)(cid:13)(cid:13) . (iii) Consequently by the triangle inequality (5.13) (cid:16) p (cid:88) j =1 δ j (cid:17) ≤ (cid:107) h (cid:107)√ n + (cid:16) p (cid:88) j =1 n (cid:88) i =1 (cid:13)(cid:13)(cid:13) ∂ r ∂x ij ( X ) (cid:13)(cid:13)(cid:13) (cid:17) . Proof. (ii) Since r is a function of X given by r ( X ) = n − (cid:98) ψ ( ε + Xβ , X ) , taking the directionalderivative of r in a direction ve (cid:62) j with v (cid:62) ( y − X (cid:98) β ) = 0 (i.e., by considering X + t ve j (cid:62) andletting t → ) gives h j (cid:104) ∂ (cid:98) ψ ∂ y (cid:105) v + √ n n (cid:88) i =1 v i ∂ r ∂x ij ( X ) = (i) h j (cid:104) ∂ (cid:98) ψ ∂ y (cid:105) v + n (cid:88) i =1 v i (cid:104) β j ∂ (cid:98) ψ ∂y i ( y , X ) + ∂ (cid:98) ψ ∂x ij ( y , X ) (cid:105) = (ii) n (cid:88) i =1 v i (cid:104) (cid:98) β j ∂ (cid:98) ψ ∂y i ( y , X ) + ∂ (cid:98) ψ ∂x ij ( y , X ) (cid:105) = (iii) ddt (cid:98) ψ ( y + t v (cid:98) β j , X + t ve (cid:62) j ) (cid:12)(cid:12) t =0 = (iv) where (i) follows by the chain rule (4.2), (ii) using [ ∂ (cid:98) ψ ∂ y ] v = (cid:80) ni =1 v i ∂ (cid:98) ψ ∂y i and h j + β j = (cid:98) β j ,(iii) by the relationship between directional and Frechet derivatives, and (iv) by Corollary 5.7. ellec/Out-of-sample error estimate for robust M-estimators with convex penalty This implies that the matrix M ( j ) ∈ R n × n with entries M ( j ) ki = √ n ( ∂/∂x ij ) r k ( X ) satisfies M ( j ) v + h j [ ∂ (cid:98) ψ ∂ y ] v for all v with v (cid:62) r = 0 . Hence the nullspace of the matrix M ( j ) + h j ∂ (cid:98) ψ ∂ y contains the hyperplane { v ∈ R n : v (cid:62) r = 0 } and M ( j ) + h j ∂ (cid:98) ψ ∂ y has rank is at most one, so that | Tr[ M ( j ) + h j ∂ (cid:98) ψ ∂ y ] | ≤ (cid:107) M ( j ) + h j ∂ (cid:98) ψ ∂ y (cid:107) op ≤ (cid:107) M ( j ) (cid:107) op + (cid:107) ∂ (cid:98) ψ ∂ y (cid:107) op | h j | . Since ∂ (cid:98) ψ ∂ y has operator norm at most one by Proposition 5.1 when ψ is 1-Lipschitz, we obtain(5.12).(iii) For (5.13) we sum the squares of (5.12) and used ( a + b ) ≤ a + 2 b . For the secondterm, for any u (1) , ..., u ( p ) ∈ R n each with norm one, p (cid:88) j =1 (cid:13)(cid:13)(cid:13) n (cid:88) i =1 u ( j ) i ∂ r ∂x ij (cid:13)(cid:13)(cid:13) = p (cid:88) j =1 n (cid:88) l =1 (cid:104) n (cid:88) i =1 u ( j ) i ∂r l ∂x ij (cid:105) ≤ p (cid:88) j =1 l (cid:88) i =1 n (cid:88) i =1 (cid:16) ∂r l ∂x ij (cid:17) by the Cauchy-Schwarz inequality, where we omitted the dependence in X for the partialderivatives of r for brevity. This proves (5.13).
6. Inequalities for functions of standard multivariate normalsProposition 6.1 ([BZ18b]) . If y = µ + ε with ε ∼ N ( , σ I n ) and f : R n → R n has weaklydifferentiable components then E [( ε (cid:62) f ( y ) − σ div f ( y )) ] = σ E [ (cid:107) f ( y ) (cid:107) ] + σ E [Tr (cid:0) ( ∇ f ( y )) (cid:1) ] ≤ σ E [ (cid:107) f ( y ) (cid:107) ] + σ E [ (cid:107)∇ f ( y ) (cid:107) F ] (6.1) ≤ σ (cid:107) E [ f ( y )] (cid:107) + 2 σ E [ (cid:107)∇ f ( y ) (cid:107) F ] , (6.2) provided that the last line is finite. If E [ f ( y )] = then (6.3) E [( ε (cid:62) f ( y ) − σ div f ( y )) ] ≤ σ E [ (cid:107)∇ f ( y ) (cid:107) F ] . The first equality in (6.1) is the identity studied in [BZ18b] the first inequality follows by theCauchy-Schwarz inequality. The second inequality is a consequence of the Gaussian Poincaréinequality [BLM13, Theorem 3.20] applied to each component of f . The variant (6.5) below willbe useful. Proposition 6.2.
Let f , ε , y be as in Proposition 6.1. Then there exist random variables Z, T, ˜ T with Z ∼ N (0 , and E [ ˜ T ] ∨ E [ T ] ≤ such that | ε (cid:62) f ( y ) − σ div f ( y ) | ≤ σ | Z | (cid:107) E [ f ( y )] (cid:107) + σ | T | E [ ∇ f ( y ) (cid:107) F ] (6.4) ≤ σ | Z | (cid:107) f ( y ) (cid:107) + σ (2 | T | + | Z ˜ T | ) E [ ∇ f ( y ) (cid:107) F ] . (6.5) Proof.
Define F ( y ) = f ( y ) − E [ f ( y )] , Z = σ − ε (cid:62) E [ f ( y )] / (cid:107) E [ f ( y )] (cid:107) ∼ N (0 , and T = (cid:0) ε (cid:62) f ( y ) − div f ( y ) − Zσ (cid:107) E f ( y ) (cid:107) (cid:1) (cid:14) (2 σ E [ (cid:107)∇ f ( y ) (cid:107) F ]) . Since E [ F ( y )] = 0 , by (6.2) applied to F we have σ E [ (cid:107)∇ f ( y ) (cid:107) F ] E [ T ] = E [( ε (cid:62) F ( y ) − div F ( y )) ] ≤ σ E [ (cid:107)∇ F ( y ) (cid:107) F ] = 2 σ E [ (cid:107)∇ f ( y ) (cid:107) F ] so that E [ T ] ≤ . Next, let ˜ T = [ σ E [ (cid:107)∇ f ( y ) (cid:107) F ] − (cid:12)(cid:12) (cid:107) E [ f ( y )] (cid:107) − (cid:107) f ( y ) (cid:107) (cid:12)(cid:12) , which satisfies E [ ˜ T ] ≤ by the Gaussian Poincaré inequality. By construction of T and ˜ T , we obtain (6.5). ellec/Out-of-sample error estimate for robust M-estimators with convex penalty n × p independent standard normals Proposition 6.3. If X = ( x ij ) ∈ R n × p is a matrix with iid N (0 , entries and η : R n × p → R p , ρ : R n × p → R n are two vector valued functions, with weakly differentiable components η , ..., η p and ρ , ..., ρ n , then E (cid:104)(cid:16) ρ (cid:62) Xη − n (cid:88) i =1 p (cid:88) j =1 ∂ ( ρ i η j ) ∂x ij (cid:17) (cid:105) = E (cid:104) (cid:107) ρ (cid:107) (cid:107) η (cid:107) (cid:105) + E p (cid:88) j =1 p (cid:88) k =1 n (cid:88) i =1 n (cid:88) l =1 ∂ ( ρ i η j ) ∂x lk ∂ ( ρ l η k ) ∂x ij ≤ E (cid:104) (cid:107) ρ (cid:107) (cid:107) η (cid:107) (cid:105) + E p (cid:88) j =1 p (cid:88) k =1 n (cid:88) i =1 n (cid:88) l =1 (cid:16) ∂ ( ρ i η j ) ∂x lk (cid:17) (6.6) ≤ E (cid:104) (cid:107) ρ (cid:107) (cid:107) η (cid:107) (cid:105) + E p (cid:88) k =1 n (cid:88) l =1 (cid:104) (cid:107) η (cid:107) (cid:13)(cid:13)(cid:13) ∂ ρ ∂x lk (cid:13)(cid:13)(cid:13) + 2 (cid:107) ρ (cid:107) (cid:13)(cid:13)(cid:13) ∂ η ∂x lk (cid:13)(cid:13)(cid:13) (cid:105) provided that (6.6) is finite, where for brevity we write ρ = ρ ( X ) , η = η ( X ) , and similarly forthe partial derivatives (i.e., omitting the dependence in X ).Furthermore, if for some open set U ⊂ R n × p we have (i) max( (cid:107) ρ (cid:107) , (cid:107) η (cid:107) ) ≤ K in U , (ii) ρ is L -Lipschitz on U and (iii) η is M -Lipschitz on U , then (6.7) E (cid:104) I { X ∈ U } (cid:16) ρ (cid:62) Xη − n (cid:88) i =1 p (cid:88) j =1 ∂ ( ρ i η j ) ∂x ij (cid:17) (cid:105) ≤ K + 2 K ( nL + pM ) . Proof.
The first inequality is inequality (6.1) applied by vectorizing X into a standard normalvector of size np , applied to the function R np → R np whose ( i, j ) -th component is ρ i η j .For the second claim, by Kirszbraun’s theorem, there exists an L -Lipschitz function ρ : R n × p → R n such that ρ = ρ on U , and an M -Lipschitz function η : R n × p → R p such that η = η on U . By projecting ρ , η onto the Euclidean ball of radius K if necessary, we may assumewithout loss of generality that max {(cid:107) ρ (cid:107) , (cid:107) η (cid:107) ) ≤ K on R n × p . Since U is open, the derivatives of ρ and ρ (resp. η and η ) coincide on U , so that the left hand side of the second claim is equalto the same expression with ρ replaced by ρ and η replaced by η . Since the Hilbert-Schmidtnorm of the Jacobian of an L -Lipschitz function valued in R n is at most L √ n (resp. the Hilbert-Schmidt norm of the Jacobian of an M -Lipschitz function valued in R p is at most M √ p ), theclaim is proved. Proposition 6.4.
Let Assumptions 2.1 and 2.2 be fulfilled with Σ = I p and p/n ≤ γ .Additionally, assume that (5.5) holds for all X , (cid:102) X ∈ U ⊂ R n × p where U is an open set possiblydepending on ε and L ∗ > a constant, and assume that sup X ∈ U (cid:107) n − / X Σ − / (cid:107) op ≤ √ γ .Then V def = n (cid:107) X (cid:62) r (cid:107) − (cid:107) r (cid:107) n ( p − Tr[ P ∂ X (cid:98) β ∂ y ]) + Tr[ ∂ (cid:98) ψ ∂ y ] r (cid:62) Xh n / ( (cid:107) h (cid:107) + (cid:107) r (cid:107) ) n − (2.14) satisfies E [ I { X ∈ U } V ] ≤ C ( γ, L ∗ ) . ellec/Out-of-sample error estimate for robust M-estimators with convex penalty The set U and constant L ∗ are given in Proposition 5.5 under Assumption 2.3(i) andProposition 5.6 under Assumption 2.3(ii).Note that in the case of the square loss, diag ( ψ (cid:48) ) = I n and ∂ (cid:98) ψ ∂ y = I n − ∂ X (cid:98) β ∂ y so that V isequal to V def = n (cid:107) X (cid:62) r (cid:107) − (cid:107) r (cid:107) p − ˆ df n + (1 − ˆ df n ) n − r (cid:62) Xh ( (cid:107) h (cid:107) + (cid:107) r (cid:107) ) n − . (6.8) Proof.
We apply Proposition 6.3 to ρ = r D − , η = X (cid:62) r n − D − where D = ( (cid:107) r (cid:107) + (cid:107) h (cid:107) ) .Thanks to Lemma 5.4, the functions ρ : R n × p → R n and η : R n × p → R p are Lipschitz on theopen set U with Lipschitz constants L = n − L ∗ and M = n − L ∗ (2 + √ γ ) . Furthermore,on U inequality max {(cid:107) ρ (cid:107) , (cid:107) η (cid:107)} ≤ K for K = 2 + √ γ , so that the right hand side of (6.7) isbounded from above by C ( γ, L ∗ ) .We now compute the divergence term that appears in the left hand side of (6.7), i.e., √ n n (cid:88) i =1 p (cid:88) j =1 ∂ ( D − r i e (cid:62) j X (cid:62) r ) ∂x ij (6.9) = 1 √ n n (cid:88) i =1 p (cid:88) j =1 r i D ∂ ( e (cid:62) j X (cid:62) r ) ∂x ij (cid:124) (cid:123)(cid:122) (cid:125) (i) + e (cid:62) j X (cid:62) r D ∂r i ∂x ij (cid:124) (cid:123)(cid:122) (cid:125) (ii) + r i e (cid:62) j X (cid:62) r ∂D − ∂x ij (cid:124) (cid:123)(cid:122) (cid:125) (iii) . For (ii), with δ ∈ R p with components δ j = Tr[ ∂ (cid:98) ψ ∂ y ] h j / √ n + (cid:80) ni =1 ∂r i ∂x ij the random vector studiedin (5.12) we find p (cid:88) j =1 n (cid:88) i =1 e (cid:62) j X (cid:62) r n D ∂r i x ij = − Tr (cid:104) ∂ (cid:98) ψ ∂ y (cid:105) h (cid:62) X (cid:62) r nD + δ (cid:62) X (cid:62) r n D . Above in the right hand side, the first term appears in V as desired, while we will show belowthat the second term is negligible. We now focus on term (i). By the product rule ∂∂x ij ( e (cid:62) j X (cid:62) r ) = r i + e (cid:62) j X (cid:62) ∂ r ∂x ij so that(6.10) n (cid:88) i =1 r i √ nD p (cid:88) j =1 ∂ ( e (cid:62) j X (cid:62) r ) ∂x ij = p (cid:107) r (cid:107) √ nD + n (cid:88) i =1 r i √ nD p (cid:88) j =1 e (cid:62) j X (cid:62) ∂ r ∂x ij . By the chain rule (4.2) we obtain almost surely(6.10) = p (cid:107) r (cid:107) √ nD − r (cid:62) [ ∂ (cid:98) ψ ∂ y ] Xh nD + 1 nD n (cid:88) i =1 r i p (cid:88) j =1 e (cid:62) j X (cid:62) (cid:104) ∂ (cid:98) ψ ∂x ij + (cid:98) β j ∂ (cid:98) ψ ∂y i (cid:105) = p (cid:107) r (cid:107) √ nD − r (cid:62) [ ∂ (cid:98) ψ ∂ y ] Xh nD − nD n (cid:88) i =1 r i ψ i Tr (cid:104) P X ∂ (cid:98) β ∂ y (cid:105) thanks to (5.11) for the last equality. For term (iii), we haveRem (iii) def = n (cid:88) i =1 p (cid:88) j =1 r i e (cid:62) j X (cid:62) r √ n ∂ ( D − ) ∂x ij = 2 D − n (cid:88) i =1 p (cid:88) j =1 r i e (cid:62) j X (cid:62) r √ n ∂ ( D − ) ∂x ij . ellec/Out-of-sample error estimate for robust M-estimators with convex penalty By (5.9), the Jacobian of the map X (cid:55)→ D − has operator norm at most n − L ∗ D − . Hence theabsolute value of the previous display is bounded from above as | Rem (iii) | ≤ L ∗ n − D − ( (cid:80) ni =1 (cid:80) pj =1 r i ( e (cid:62) j X (cid:62) r ) ) / = 2 L ∗ n − D − (cid:107) r (cid:107)(cid:107) X (cid:62) r (cid:107)≤ L ∗ n − (cid:107) X (cid:107) op . (6.11)In summary, we have shown that(6.9) = p − Tr[
P X ( ∂/∂ y ) (cid:98) β ] √ nD (cid:107) r (cid:107) − Tr[ ∂ (cid:98) ψ ∂ y ] nD r (cid:62) Xh + Rem + Rem (iii) where Rem = D − ( − n − r (cid:62) [ ∂ (cid:98) ψ ∂ y ] Xh + n − δ (cid:62) X (cid:62) r ) . To conclude it remains to show that Remand Rem (iii) are negligible. Using that (cid:107) ∂ (cid:98) ψ ∂ y (cid:107) op ≤ by Proposition 5.1 as well as the bound(5.13) on (cid:107) δ (cid:107) , almost surely we have | Rem | ≤ n − (cid:107) X (cid:107) op + (cid:107) δ (cid:107)(cid:107) X n − (cid:107) op (cid:14) D ≤ n − (cid:107) X (cid:107) op + (cid:107) X n − (cid:107) op (cid:0) p (cid:88) j =1 n (cid:88) i =1 (cid:13)(cid:13) ∂ r ∂x ij ( X ) (cid:13)(cid:13) (cid:1) /D. By (5.5), the Jacobian of the map X (cid:55)→ r ( X ) has operator norm at most n − L ∗ D , so thatits Hilbert-Schmidt norm that appears in the previous display is at most L ∗ D . Hence | Rem | ≤ n − (cid:107) X (cid:107) op + (cid:107) X n − (cid:107) op L ∗ which is uniformly bounded from above by a constant C ( γ, L ∗ ) in U . The same conclusion holds for Rem (iii) thanks to (6.11). χ type bounds under dependence The main result of this section is the following.
Theorem 7.1.
Assume that X has iid N (0 , entries, that r : R n × p → R n is locally Lipschitzand that (cid:107) r (cid:107) ≤ almost surely. Then E (cid:12)(cid:12)(cid:12) p (cid:107) r (cid:107) − p (cid:88) j =1 (cid:16) r (cid:62) Xe j − n (cid:88) i =1 ∂r i ∂x ij (cid:17) (cid:12)(cid:12)(cid:12) ≤ C (cid:110) (cid:16) E n (cid:88) i =1 p (cid:88) j =1 (cid:107) ∂ r ∂x ij (cid:107) (cid:17) (cid:111) √ p + C E n (cid:88) i =1 p (cid:88) j =1 (cid:107) ∂ r ∂x ij (cid:107) . The proof of Theorem 7.1 is given in Section 7.3. We first provide consequences of this resultfor regularized M -estimators. Corollary 7.2. (i) If ρ : U → R n is L -Lipschitz on an open set U ⊂ R n × p and (cid:107) ρ (cid:107) ≤ on U ,then E (cid:104) I { X ∈ U } (cid:12)(cid:12)(cid:12) p (cid:107) ρ (cid:107) − p (cid:88) j =1 (cid:16) ρ (cid:62) Xe j − n (cid:88) i =1 ∂ρ i ∂x ij (cid:17) (cid:12)(cid:12)(cid:12)(cid:105) ≤ C (cid:110) √ nL (cid:111) √ p + C nL . ellec/Out-of-sample error estimate for robust M-estimators with convex penalty (ii) If additionally a = ( a j ) j ∈ [ p ] is random, then E (cid:104) I { X ∈ U } (cid:12)(cid:12)(cid:12) p (cid:107) ρ (cid:107) − p (cid:88) j =1 (cid:16) ρ (cid:62) Xe j − a j (cid:17) (cid:12)(cid:12)(cid:12)(cid:105) ≤ C ( { √ Ξ }√ p + Ξ) where Ξ = nL + E [ I { X ∈ U } (cid:80) pj =1 ( a j − (cid:80) ni =1 ∂ρ i ∂x ij ) ] .Proof. By Kirszbraun’s theorem, there exists an L -Lipschitz function r : R n × p → R n such that ρ = r on U . By projecting r onto the unit ball if necessary (i.e., replacing r by π ◦ r where π : R n → R n is the projection onto the unit ball), we may assume without loss of generality that (cid:107) r (cid:107) ≤ on R n × p . Since U is open, the derivatives of ρ and r also coincide on U , so that the lefthand side of the first claim is equal to the same expression with ρ replaced by r . By applyingTheorem 7.1 to r , it is enough to bound from above E [ (cid:80) ni =1 (cid:80) pj =1 (cid:107) ( ∂/∂x ij ) r (cid:107) ] , which is theHilbert-Schmidt norm (or Frobenius norm) of the Jacobian of r (by vectorizing the input spaceof r , this Jacobian is a matrix in R n × ( np ) ). Since the Hilbert-Schmidt norm of the Jacobian ofan L -Lipschitz function valued in R n is at most L √ n , this proves the first claim.For the second claim, if c , b ∈ R p , we have on U (cid:12)(cid:12) | p (cid:107) ρ (cid:107) − (cid:107) b (cid:107) | − | p (cid:107) ρ (cid:107) − (cid:107) c (cid:107) | (cid:12)(cid:12) ≤ | ( b − c ) (cid:62) ( b + c ) |≤ (cid:107) b − c (cid:107)(cid:107) b + c (cid:107)≤ (cid:107) b − c (cid:107){ (cid:112) |(cid:107) b (cid:107) − p (cid:107) ρ (cid:107) | + (cid:112) |(cid:107) c (cid:107) − p (cid:107) ρ (cid:107) | + 2 √ p } thanks to (cid:107) ρ (cid:107) ≤ on U . Using xy ≤ x / y / , we eventually obtain (1 / (cid:12)(cid:12) | p (cid:107) ρ (cid:107) − (cid:107) b (cid:107) | ≤ (3 / | p (cid:107) ρ (cid:107) − (cid:107) c (cid:107) | + (cid:107) b − c (cid:107) + (cid:107) b − c (cid:107) √ p. We obtain the desired result by setting b j = e (cid:62) j X (cid:62) ρ − a j and c j = e (cid:62) j X (cid:62) ρ − (cid:80) ni =1 ( ∂/∂x ij ) ρ i ,multiplying by the indicator I { X ∈ U } and taking expectations. Proposition 7.3.
Let Assumptions 2.1 and 2.2 be fulfilled with Σ = I p and p/n ≤ γ .Additionally, assume that (5.5) holds for all X , (cid:102) X ∈ U ⊂ R n × p where U is an open set possiblydepending on ε and L ∗ > a constant. Then V = 1( (cid:107) r (cid:107) + (cid:107) h (cid:107) ) √ n (cid:104) p (cid:107) r (cid:107) − p (cid:88) j =1 (cid:16) r (cid:62) Xe j + n − Tr (cid:104) ∂ (cid:98) ψ ∂ y (cid:105) h j (cid:17) (cid:105) satisfies E [ I { X ∈ U }| V | ] ≤ C ( γ, L ∗ ) for some constant depending only on { γ, L ∗ } .Furthermore, by expanding the square, V = pn (cid:107) r (cid:107) − n (cid:107) X (cid:62) r (cid:107) − ( n Tr[ ∂ (cid:98) ψ ∂ y ]) (cid:107) h (cid:107) − ∂ (cid:98) ψ ∂ y ] r (cid:62) Xh n / ( (cid:107) r (cid:107) + (cid:107) h (cid:107) ) n − . (2.15)Note that for the square loss, V is equal to V = pn (cid:107) r (cid:107) − n (cid:107) X (cid:62) r (cid:107) − (1 − ˆ df n ) (cid:107) h (cid:107) − − ˆ df n ) n − r (cid:62) Xh ( (cid:107) r (cid:107) + (cid:107) h (cid:107) ) n − . (7.1) ellec/Out-of-sample error estimate for robust M-estimators with convex penalty Proof.
We apply Corollary 7.2 to ρ = D − r where D − = ( (cid:107) h (cid:107) + (cid:107) r (cid:107) ) and a j = − D − n − Tr[ ∂ (cid:98) ψ ∂ y ] h j so that the left hand side of Corollary 7.2(ii) is E [ I { X ∈ u } n | V | ] . Thefunction ρ : X (cid:55)→ D − r is L -Lipschitz for L = 2 n − L ∗ by (5.7), so that the quantity Ξ inCorollary 7.2 is bounded from above as follows Ξ ≤ L ∗ + E (cid:104) I { X ∈ U } p (cid:88) j =1 (cid:16) D − n − Tr[ ∂ (cid:98) ψ ∂ y ] h j + n (cid:88) i =1 ∂ρ i ∂x ij (cid:17) (cid:105) = 4 L ∗ + E (cid:104) I { X ∈ U } p (cid:88) j =1 (cid:16) D − δ j + n (cid:88) i =1 r i ∂ ( D − ) ∂x ij (cid:17) (cid:105) ≤ L ∗ + E (cid:104) I { X ∈ U } p (cid:88) j =1 (cid:16) D − δ j (cid:17) + 2 (cid:16) n (cid:88) i =1 r i ∂ ( D − ) ∂x ij (cid:17) (cid:105) where δ j = n − Tr[ ∂ (cid:98) ψ ∂ y ] h j + (cid:80) ni =1 ( ∂/∂x ij ) r i is the quantity defined in (5.12). By (5.13) and (cid:107) h (cid:107) ≤ D we have (cid:80) pj =1 D − δ j ≤ /n + 2 D − (cid:80) ni =1 (cid:80) pj =1 (cid:107) ( ∂/∂x ij ) r (cid:107) . Thanks to (5.5),the operator norm of the Jacobian of X (cid:55)→ r is bounded from above by L ∗ n − D so that D − (cid:80) ni =1 (cid:80) pj =1 (cid:107) ( ∂/∂x ij ) r (cid:107) ≤ L ∗ by viewing the Jacobian of r as a matrix of size n × ( np ) ofrank at most n . This shows that (cid:80) pj =1 D − δ j ≤ /n + 2 L ∗ . Furthermore, by (5.9), the operatornorm of the Jacobian of D − is bounded from above by L ∗ n − D − , and the Hilbert-Schmidtnorm of this Jacobian by L ∗ n − D − as well since D − is scalar valued. Hence by the Cauchy-Schwarz inequality, (cid:80) pj =1 ( (cid:80) ni =1 r i ∂ ( D − ) ∂x ij ) ≤ (cid:107) r (cid:107) (cid:80) pj =1 (cid:80) ni =1 ( ∂ ( D − ) ∂x ij ) ≤ L ∗ (cid:107) r (cid:107) n − D − ≤ L ∗ n − . In summary, the previous display is bounded from above by L ∗ + 4 /n + L ∗ + L ∗ /n and theproof is complete.The remaining subsections give a proof of Theorem 7.1. In the proof of Theorem 7.1, we need to control the correlation between the two mean-zerorandom variables(7.2) ( z j f ( z k )) − (cid:107) f ( z k ) (cid:107) and ( z k h ( z j )) − (cid:107) h ( z j ) (cid:107) where z j , z k are independent standard normal random vectors and f, h are functions R n → R n . If f, h are constant, then the correlation between these two random variables is 0 by independence.If f, h are non-constant, the following gives an exact formula and an upper bound for thecorrelation of the two random variables in (7.2). Lemma 7.4.
Let z j , z k be independent N ( , I n ) random vectors. Let f, h : R n → R n deterministic with weakly differentiable components and define the random matrices A , B ∈ R n × n by A = (cid:0) z (cid:62) j f ( z k ) + f ( z k ) z (cid:62) j (cid:1) ∇ f ( z k ) (cid:62) , B = (cid:0) z (cid:62) k h ( z j ) + h ( z j ) z (cid:62) k (cid:1) ∇ h ( z j ) (cid:62) Assume that (7.3) E [ (cid:107) f ( z k ) (cid:107) ] + E [ (cid:107) h ( z j ) (cid:107) ] + E [ (cid:107) A (cid:107) F ] + E [ (cid:107) B (cid:107) F ] < + ∞ . ellec/Out-of-sample error estimate for robust M-estimators with convex penalty Then equality (7.4) E (cid:2)(cid:8) ( z j f ( z k )) − (cid:107) f ( z k ) (cid:107) (cid:9)(cid:8) ( z k h ( z j )) − (cid:107) h ( z j ) (cid:107) (cid:9)(cid:3) = E [Tr { AB } ] , holds. Furthermore, E | Tr[ AB ] | ≤ E (cid:2) (cid:107)∇ f ( z k ) (cid:107) F (cid:107) f ( z k ) (cid:107) (cid:3) E (cid:2) (cid:107)∇ h ( z j ) (cid:107) F (cid:107) h ( z j ) (cid:107) (cid:3) Proof.
Define z ∈ R n with z = ( z (cid:62) j , z (cid:62) k ) (cid:62) as well as F, H : R n → R n by F ( z j , z k ) =( f ( z k ) f ( z k ) (cid:62) z j , R n ) and H ( z j , z k ) = ( R n , h ( z j ) h ( z j ) (cid:62) z k ) , where with some abuse ofnotation we denoted by ( u , v ) ∈ R n the vertical concatenation of two columns vectors u ∈ R n and v ∈ R n . The Jacobians of F, H are the n × n matrices ∇ F ( z ) (cid:62) = (cid:20) f ( z k ) f ( z k ) (cid:62) n × n A n × n (cid:21) , ∇ H ( z ) (cid:62) = (cid:20) n × n B n × n h ( z j ) h ( z j ) (cid:62) (cid:21) . Thus the left hand side in (7.4) can be rewritten as E [( z (cid:62) F ( z ) − div F ( z ))( z (cid:62) H ( z ) − div H ( z ))] with F, H being weakly differentiable with E [ (cid:107) F ( z ) (cid:107) + (cid:107) H ( z ) (cid:107) + (cid:107)∇ H ( z ) (cid:107) F + (cid:107)∇ F ( z ) (cid:107) F ] < + ∞ thanks to (7.3). The last display is equal to E [ F ( z ) (cid:62) H ( z )+Tr {∇ F ( z ) H ( z ) } ] by Section 2.2in [BZ18b]. Here F ( z ) (cid:62) H ( z ) = 0 always holds by construction of F, H and the matrix productby block gives Tr {∇ H ( z ) ∇ F ( z ) } = Tr { AB } .Next, by the Cauchy-Schwarz inequality we have E | Tr[ AB ] | ≤ E [ (cid:107) A (cid:107) F (cid:107) B (cid:107) F ] ≤ E [ (cid:107) A (cid:107) F ] / E [ (cid:107) B (cid:107) F ] / . By definition of A and properties of the operator norm, (cid:107) A (cid:107) F ≤ (cid:107)∇ f ( z k ) (cid:107) F | z (cid:62) j f ( z k ) | + (cid:107) f ( z k ) (cid:107)(cid:107)∇ f ( z k ) z j (cid:107) . By the triangle inequality and independence we find E [ (cid:107) A (cid:107) F ] / ≤ E [ (cid:107)∇ f ( z k ) (cid:107) F ( z (cid:62) j f ( z k )) ] / + E [ (cid:107)∇ f ( z k ) z j (cid:107) (cid:107) f ( z k ) (cid:107) ] / = E [ (cid:107)∇ f ( z k ) (cid:107) F (cid:107) f ( z k ) (cid:107) ] / + E [ (cid:107)∇ f ( z k ) (cid:107) F (cid:107) f ( z k ) (cid:107) ] / thanks to E [ z j z (cid:62) j | z k ] = I n and E [ (cid:107)∇ f ( z k ) z j (cid:107) | z k ] = Tr( ∇ f ( z k ) (cid:62) ∇ f ( z k ) E [ z j z (cid:62) j | z k ]) .Similarly, E [ (cid:107) B (cid:107) F ] / ≤ E [ (cid:107)∇ h ( z j ) (cid:107) F (cid:107) h ( z j ) (cid:107) ] which completes the proof. Proof of Theorem 7.1.
Let z j = Xe j . Let also ξ j = z (cid:62) j ( r − E j [ r ]) − d j where E j [ · ] is theconditional expectation E j [ · ] = E [ ·| X − j ] and d j = (cid:80) pj =1 ∂r i ∂x ij so that E j [ ξ j ] = 0 by Stein’sformula. Writing z (cid:62) j r − d j = ξ j + z (cid:62) j E j [ r ] and expanding the square, we find W = p (cid:88) j =1 (cid:16) z (cid:62) j r − d j (cid:17) − p (cid:107) r (cid:107) = p (cid:88) j =1 (cid:16) ξ j + z (cid:62) j E j [ r ] (cid:17) − p (cid:107) r (cid:107) = p (cid:88) j =1 (cid:104) ξ j (cid:124)(cid:123)(cid:122)(cid:125) ( i ) + 2 ξ j z (cid:62) j E j [ r ] (cid:124) (cid:123)(cid:122) (cid:125) ( ii ) + ( z (cid:62) j E j [ r ]) − (cid:107) E j [ r ] (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) ( iii ) + (cid:107) E j [ r ] (cid:107) − (cid:107) r (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) ( iv ) (cid:105) . This decomposition gives rise to 4 terms that we will bound separately. ellec/Out-of-sample error estimate for robust M-estimators with convex penalty (i) E (cid:80) pj =1 ξ j ≤ E (cid:80) pj =1 (cid:80) ni =1 (cid:107) ∂ r ∂x ij (cid:107) by (6.3) applied with respect to z j conditionally on X − j for each j ∈ [ p ] .(ii) Since E [( z (cid:62) j E j [ r ]) ] = E [ (cid:107) E j r (cid:107) ] ≤ by Jensen’s inequality, the Cauchy-Schwarzinequality E [ (cid:80) pj =1 a j b j ] ≤ E [ (cid:80) pj =1 a j ] / E [ (cid:80) pj =1 b j ] / yields E (cid:104) p (cid:88) j =1 | ξ j z (cid:62) j E j [ r ] | (cid:105) ≤ √ p E (cid:104) p (cid:88) j =1 ξ j (cid:105) ≤ √ p (cid:16) E p (cid:88) j =1 n (cid:88) i =1 (cid:107) ∂ r ∂x ij (cid:107) (cid:17) where the second inequality follows from the inequality derived for term (i).(iii) For the third term, set χ j = ( z (cid:62) j E j [ r ]) − (cid:107) E j [ r ] (cid:107) and note that E [( (cid:80) pj =1 χ j ) ] = E (cid:80) pj =1 (cid:80) pk =1 χ j χ k . For the diagonal terms, E (cid:80) pj =1 χ j = 2 E (cid:80) pj =1 (cid:107) E j [ r ] (cid:107) ≤ p because z j ∼ N (0 , I n ) is independent of E j [ r ] and E [( Z − s ) ] = 2 s if Z ∼ N (0 , s ) .For the non-diagonal terms we compute E [ χ j χ k ] using Lemma 7.4 with f ( j,k ) ( z k ) = E j [ r ] and h ( j,k ) ( z j ) = E k [ r ] conditionally on ( z l ) l/ ∈{ j,k } . Thanks to (cid:107) r (cid:107) ≤ this gives E [ χ j χ k ] ≤ C E n (cid:88) i =1 (cid:107) ∂ E k [ r ] ∂x ij (cid:107) + (cid:107) ∂ E j [ r ] ∂x ik (cid:107) ≤ C E n (cid:88) i =1 E k [ (cid:107) ∂ r ∂x ij (cid:107) ] + E j [ (cid:107) ∂ r ∂x ik (cid:107) ] , where the second inequality follows by dominated convergence for the conditional expectation(i.e., ( ∂/∂x ij ) E k [ r ] = E k [( ∂/∂x ij ) r ] almost surely) and Jensen’s inequality. Finally, summingover all pairs j (cid:54) = k we find p (cid:88) j =1 p (cid:88) k =1 ,k (cid:54) = j E [ χ j χ k ] ≤ C p p (cid:88) j =1 n (cid:88) i =1 E j [ (cid:107) ∂ r ∂x ik (cid:107) ] . (iv) For the last term, using |(cid:107) E j [ r ] (cid:107) − (cid:107) r (cid:107) | ≤ (cid:107) E j [ r ] − r (cid:107) (cid:107) E j [ r ] + r (cid:107) and the Cauchy-Schwarz inequality we find E (cid:12)(cid:12)(cid:12) p (cid:88) j =1 (cid:107) E j [ r ] (cid:107) − (cid:107) r (cid:107) (cid:12)(cid:12)(cid:12) ≤ E (cid:104) p (cid:88) j =1 (cid:107) E j [ r ] − r (cid:107) (cid:105) E (cid:104) p (cid:88) j =1 (cid:107) E j [ r ] + r (cid:107) (cid:105) ≤ E (cid:104) p (cid:88) j =1 n (cid:88) i =1 (cid:107) ∂ r ∂x ij (cid:107) (cid:105) (cid:112) p where the second inequality follows from (cid:107) r (cid:107) ≤ and the Gaussian Poincaré inequality [BLM13,Theorem 3.20] with respect to z j = Xe j conditionally on ( z k ) k ∈ [ p ] \{ j } , which gives E j [( r l − E j [ r l ]) ] ≤ E j (cid:80) ni =1 ( ∂r l ∂x ij ) for every l = 1 , ..., n .
8. Proofs of the main result: General loss function
Proof of Theorem 2.1.
The proof follows the steps described in Section 2.9, with the change ofvariable (2.12) to reduce the problem to X with iid N (0 , entries, and with the definitionof V , V in (2.14)-(2.15) so that V ∗ def = 2 V + V . Proposition 6.4 yields the desired bound on E [ I { X ∈ U }| V | ] and Proposition 7.3 the desired bound on E [ I { X ∈ U }| V | ] , provided thatthat, under each assumption, we can prove that(5.5) ( (cid:107) Σ ( h − (cid:101) h ) (cid:107) + (cid:107) r − (cid:101) r (cid:107) ) ≤ n − (cid:107) ( X − (cid:102) X ) Σ − (cid:107) op L ∗ ( (cid:107) r (cid:107) + (cid:107) Σ h (cid:107) ) ellec/Out-of-sample error estimate for robust M-estimators with convex penalty holds for some constant L ∗ = C ( γ, µ, µ g , ϕ, η ) and all X , (cid:102) X ∈ U for some open set U possiblydepending on ε . All notation in (5.5) is defined in Lemma 5.4.We set Ω = { X ∈ U } and the existence of the open set U and constant L ∗ are establishedunder each assumption as follows.(i) Under Assumption 2.3(i), the constant L ∗ is given in Proposition 5.5 and we define theopen set U as U = { X ∈ R n × p : (cid:107) n − X Σ − (cid:107) op < √ γ } . Then P ( X / ∈ U ) ≤ e − n/ and P ( X ∈ U ) → by [DS01, Theorem II.13].(ii) Under Assumption 2.3(ii), U and L ∗ are given in Proposition 5.6. Here, γ < and P ( X / ∈ U ) ≤ e − n/ + e − (1 −√ γ ) n/ again by [DS01, Theorem II.13] so that P (Ω) → holds.(iii) Under Assumption 2.3(iii) for the Huber loss with (cid:96) penalty, choose any d ∗ ∈ (0 , smallenough such that both (D.17) and (D.4) hold, and define s ∗ as a function of d ∗ , γ, ϕ, η byLemma D.2 (more specifically (D.8)). The set U and constant L ∗ are given by combiningProposition D.1 and Lemma D.2. Specifically we set U = { X : ( ε , X ) ∈ Ω L } for the openset Ω L in Proposition D.1. If ρ is the square loss, we use Lemma D.3 instead of Lemma D.2.The constant L ∗ thus only depends on { γ, µ, µ g , ϕ, η } , so that E [ I { X ∈ U } ( | V | + | V | ] ≤ C ( γ, µ, µ g , ϕ, η ) by Propositions 7.3 and 6.4.For the second claim, inequality n Tr[( ∂/∂ y ) (cid:98) ψ ] ≤ holds by Proposition 5.1 with L = 1 andit remains to prove the lower bounds on n Tr[( ∂/∂ y ) (cid:98) ψ ] . Under Assumption 2.3(ii), for a fixed X and any y , (cid:101) y with ( (cid:101) y − y ) (cid:62) X = × p we have by (5.1) and by strong convexity of ρ , withthe notation of Proposition 5.1, µ ρ (cid:107) y − X (cid:98) β − (cid:101) y + X (cid:101) β (cid:107) ≤ [ y − (cid:101) y ] (cid:62) [ ψ − (cid:101) ψ ] provided that ( (cid:101) y − y ) (cid:62) X = × p . Since ( (cid:101) y − y ) (cid:62) X is zero, this can be equivalently rewritten as µ ρ (cid:0) (cid:107) y − (cid:101) y (cid:107) + (cid:107) X ( (cid:101) β − (cid:98) β ) (cid:107) (cid:1) ≤ [ y − (cid:101) y ] (cid:62) [ ψ − (cid:101) ψ ] provided that ( (cid:101) y − y ) (cid:62) X = × p . The map y (cid:55)→ (cid:98) ψ ( y , X ) is Frechet differentiable at y for almost every y and ( ∂/∂ y ) (cid:98) ψ is positivesemi-definite by Proposition 2.2. Let ( u , ..., u n − p ) be an orthonormal family of vectors of size n in the complement of the column space of X . For almost every y , since the divergence of aFrechet differentiable vector field can be computed in any orthonormal basis, Tr (cid:104) ( ∂/∂ y ) (cid:98) ψ (cid:105) ≥ n − p (cid:88) k =1 u (cid:62) k ddt (cid:98) ψ ( y + t u k , X ) (cid:12)(cid:12)(cid:12) t =0 ≥ ( n − p ) µ ρ ≥ n (1 − γ ) µ ρ where the second inequality follows by setting (cid:101) y = y + t u k in the previous display.It remains to prove n Tr[( ∂/∂ y ) (cid:98) ψ ] ≥ − d ∗ in Ω in the case where Assumption 2.3(iii) holds.For the d ∗ chosen in the list item (iii) two paragraphs above, we have (cid:107) (cid:98) β (cid:107) + (cid:107) (cid:98) θ (cid:107) ≤ d ∗ n in Ω by Lemma D.2 with the notation (cid:98) θ defined in (C.1). By Proposition 2.4 we have Tr[( ∂/∂ y ) (cid:98) ψ ] = | ˆ I |−| ˆ S | where | ˆ I | = n −(cid:107) (cid:98) θ (cid:107) so that (cid:107) (cid:98) β (cid:107) + (cid:107) (cid:98) θ (cid:107) ≤ d ∗ n can be rewritten Tr[( ∂/∂ y ) (cid:98) ψ ] ≥ n − d ∗ n as desired.Finally, we show that ( ∂/∂ y ) (cid:98) β ( y , X ) = [( ∂/∂ y ) (cid:98) β ( y , X )] P almost surely in Ω for the matrix P in (2.5). This equality is equivalent to the following: for all i ∈ [ n ] such that the i -th diagonalelement ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) of diag ( ψ (cid:48) ) is equal to 0, equality ( ∂/∂y i ) (cid:98) β ( y , X ) = 0 holds. If µ > as in Assumption 2.3(i), Proposition 5.8 with U (cid:62) = e i (cid:107) (cid:98) ψ ( y , X ) (cid:107) − (cid:98) ψ ( y , X ) (cid:62) yields ddt (cid:98) β ( y − t e i , X ) (cid:12)(cid:12) t =0 = when ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) = 0 thanks to U diag ( ψ (cid:48) ) = . Under Assumption 2.3(ii),there is nothing to prove as diag ( ψ (cid:48) ) is always invertible and ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) > for all i ∈ [ n ] . ellec/Out-of-sample error estimate for robust M-estimators with convex penalty Under Assumption 2.3)(iii), the discrete sets ˆ S = supp( (cid:98) β ) and ˆ I = { i ∈ [ n ] : ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) > } are locally constant by the reasoning in Proposition 2.4, and the matrix diag ( ψ (cid:48) ) / X ˆ S is of rank | ˆ S | by the argument in the proof of Proposition D.1. Proposition 5.8 with U (cid:62) = e i (cid:107) (cid:98) ψ ( y , X ) (cid:107) − (cid:98) ψ ( y , X ) (cid:62) again yields ddt (cid:98) β ( y − t e i , X ) ˆ S (cid:12)(cid:12) t =0 = when ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) = 0 , while ddt (cid:98) β ( y − t e i , X ) ˆ S c (cid:12)(cid:12) t =0 = holds as well because ˆ S is locally constant. Appendix A: Rotational invariance of regularized least-squares
The crux of the following proposition is that if X has iid N ( , Σ ) rows then X has the samedistribution as RX where R ∈ O ( n ) is a random rotation such that Ru is uniformly distributedon the sphere for any deterministic u with (cid:107) u (cid:107) = 1 . Consequently if (cid:98) β is a penalized M -estimator with square loss as in (3.1) then (cid:98) β ( ε + Xβ , X ) = (cid:98) β ( Rε + RXβ , RX ) because (cid:107) R ( y − Xb ) (cid:107) = (cid:107) y − Xb (cid:107) for all b ∈ R p . Thus the distribution of (cid:98) β is unchanged if the noise ε is replaced by (cid:101) ε = (cid:107) ε (cid:107) v where v is uniformly distributed on the sphere of radius , as longas ( v , ε , X ) are mutually independent. Proposition A.1.
Let (cid:98) β in (3.1) , ˆ df in (3.2) and let r = n − / ( y − X (cid:98) β ) . Assume that y hascontinuous distribution with respect to the Lebesgue measure. Assume that X has iid N ( , Σ ) rows and that ε is independent of X . Then V def = (cid:107) r (cid:107) + n − r (cid:62) Xh − n (cid:107) ε (cid:107) (1 − ˆ df n ) (cid:0) (cid:107) ε (cid:107) /n + (cid:107) ε (cid:107) n − (cid:107) r (cid:107) (cid:1) n − (A.1) satisfies E [ | V | ] ≤ C . Since y = ε + Xβ , the assumption that y has continuous distribution is satisfied when β (cid:54) = or ε has continuous distribution. Proof.
We consider ε deterministic without loss of generality (if necessary, by replacing allprobability and expectation signs in the following argument by the conditional probability andconditional expectation given ε ). Note that V can be rewritten as V = n ε (cid:62) ( y − X (cid:98) β ) − (cid:107) ε (cid:107) (1 − ˆ df /n ) (cid:107) ε (cid:107) + (cid:107) ε (cid:107)(cid:107) y − X (cid:98) β (cid:107) . Throughout, let R ∈ O ( n ) ⊂ R n × n be a random rotation (i.e., R satisfies R (cid:62) R = I n ) suchthat Rv is uniformly distributed on the sphere S n − for every v ∈ S n − , and such that R isindependent of X . Then, write ε (cid:62) ( y − X (cid:98) β ) = ( Rε ) (cid:62) ( Rε − RX ( (cid:98) β ( y , X ) − β )) . Let g = Rε (cid:107) ε (cid:107) − ρ where ρ ∼ χ n is independent of ( ε , X , R ) . Then g ∼ N ( , I n ) and ρ = (cid:107) g (cid:107) . Since X has iid N ( , Σ ) rows, by rotational invariance of the normal distribution, (cid:102) X = RX has also iid N ( , Σ ) rows and (cid:102) X is independent of R , and ( ε , (cid:102) X ) is independent of g (thisindependence is the key). By definition of (cid:98) β , since R preserves the Euclidean norm we have (cid:98) β ( y , X ) = (cid:98) β ( ε + Xβ , X ) = (cid:98) β ( Rε + RXβ , RX ) = (cid:98) β ( (cid:107) ε (cid:107)(cid:107) g (cid:107) − g + (cid:102) Xβ , (cid:102) X ) . It follows that (cid:107) g (cid:107)(cid:107) ε (cid:107) − ε (cid:62) ( ε − X ( (cid:98) β ( y , X ) − β )) ellec/Out-of-sample error estimate for robust M-estimators with convex penalty = g (cid:62) ( Rε − (cid:102) X ( (cid:98) β ( (cid:107) ε (cid:107)(cid:107) g (cid:107) − g + (cid:102) Xβ , (cid:102) X ) − β )= g (cid:62) ( f ◦ h ( g )) (A.2)where f ( z ) = z − (cid:102) X ( (cid:98) β ( z + (cid:102) Xβ , (cid:102) X ) − β ) and h ( g ) = (cid:107) ε (cid:107)(cid:107) g (cid:107) − g + (cid:102) Xβ are two vectors fields R n → R n . With ( (cid:102) X , ε ) being fixed, the function f is 1-Lipschitz [BT17] and Frechet differentiablealmost everywhere by Rademacher’s theorem. At a point z where f is Frechet differentiable, thepartial derivatives of f with respect to z = ( z , ..., z n ) (cid:62) are given by ∂f∂ z ( z ) = I n − (cid:102) X ∂ (cid:98) β ∂ y ( z , (cid:102) X ) and the partial derivatives of h are given by by ( ∂/∂ g ) h ( g ) = (cid:107) ε (cid:107)(cid:107) g (cid:107) − ( I n − gg (cid:62) (cid:107) g (cid:107) − ) for g (cid:54) = . Since h ( g ) = Ry , the random vector h ( g ) has continuous distribution with respect tothe Lebesgue measure and f is Frechet differentiable at h ( g ) with probability one. Hence withprobability one, if J = ∂f∂ z ( h ( g )) , the Jacobian of f ◦ g is obtained by the chain rule(A.3) ∂ ( f ◦ h ) ∂ g ( g ) = J ∂h∂ g ( g ) = J (cid:16) I n − gg (cid:62) (cid:107) g (cid:107) (cid:17) (cid:107) ε (cid:107)(cid:107) g (cid:107) − . Since f is 1-Lipschitz, it follows that the operator norm of the Jacobian of f ◦ h is boundedby (cid:107) ε (cid:107)(cid:107) g (cid:107) − , and the Frobenius norm of the Jacobian of f ◦ h is bounded by √ n (cid:107) ε (cid:107)(cid:107) g (cid:107) − .Inequality (6.5) from Proposition 6.2 applied to the vector field f ◦ h yields that for somerandom variables Z, T as in Proposition A.1, we have | g (cid:62) { f ◦ h ( g ) } − Tr[ ∇ ( f ◦ h )( g )] | ≤ | Z | (cid:107) f ◦ h ( g ) (cid:107) + ( | Z ˜ T | + 2 | T | ) (cid:112) (cid:107) ε (cid:107) n E [ (cid:107) g (cid:107) − ] . Using (A.2), the identity E [ (cid:107) g (cid:107) − ] = 1 / ( n − and (A.3) and (cid:107) f ( (cid:107) g (cid:107) − (cid:107) ε (cid:107) g ) (cid:107) = (cid:107) y − X (cid:98) β ( y , X ) (cid:107) , the previous display can be rewritten | ε (cid:62) ( y − X (cid:98) β ) − (cid:107) ε (cid:107) (cid:107) g (cid:107) − Tr[ J ( I n − gg (cid:62) (cid:107) g (cid:107) − )] |≤ (cid:107) g (cid:107) − (cid:107) ε (cid:107) (cid:16) | Z | (cid:107) y − X (cid:98) β ( y , X ) (cid:107) + ( | Z ˜ T | + 2 | T | ) (cid:112) (cid:107) ε (cid:107) n/ ( n − (cid:17) . Since J has operator norm at most one, we have Tr[ J ] ≤ n as well as | Tr[ J ] /n − (cid:107) g (cid:107) − Tr[ J ( I n − gg (cid:62) (cid:107) g (cid:107) − )] | ≤ (cid:107) g (cid:107) − + |(cid:107) g (cid:107) − − /n | Tr J ≤ (cid:107) g (cid:107) − + | n (cid:107) g (cid:107) − − | = (cid:107) g (cid:107) − (1 + | n − (cid:107) g (cid:107) | ) . Assume that
Tr[ J ] = ( n − ˆ df ) always holds—this will be proved in the next paragraph. Combiningthe two previous displays yields that almost surely | V | ≤ n (cid:107) g (cid:107) (cid:16) | Z | + ( | Z ˜ T | + 2 | T | ) (cid:114) nn − (cid:17) + n / | n − (cid:107) g (cid:107) |(cid:107) g (cid:107) and the right hand side has expectation at most C by Hölder’s inequality, properties of the χ n distribution, E [ T ] ∨ E [ ˜ T ] ≤ and Z ∼ N (0 , .It remains to show that Tr[ J ] = ( n − ˆ df ) , i.e., to explain the relationship between the Jacobian J of f at h ( g ) and the Jacobian of y (cid:55)→ X (cid:98) β ( y , X ) . Since equality (cid:98) β ( y , X ) = (cid:98) β ( Ry , RX ) alwaysholds, by differentiation with respect to y we obtain, where the directional derivatives exist, that ∂ (cid:98) β j ∂y i ( y , X ) = n (cid:88) l =1 R li ∂ (cid:98) β j ∂y l ( Ry , RX ) i.e., ∂ (cid:98) β j ∂ y ( y , X ) = ∂ (cid:98) β j ∂ y ( Ry , RX ) R , ellec/Out-of-sample error estimate for robust M-estimators with convex penalty where ∂ β j ∂ y ( y , X ) and ∂ β j ∂ y ( Ry , RX ) are row vectors. Since (cid:102) X = RX and f ( z ) = (cid:102) X (cid:98) β ( z , (cid:102) X ) ,this implies that X ∂ (cid:98) β ∂ y ( y , X ) = R − (cid:102) X ∂ (cid:98) β ∂ y ( Ry , RX ) R and n − ˆ df = Tr[ I n − X ( ∂/∂ y ) (cid:98) β ( y , X )] = Tr[( ∂/∂ z ) f ( h ( g ))] using that for any matrix M , Tr[ R − M R ] = Tr M . Appendix B: Proofs of the main result: Square loss
Proof of Theorem 3.1.
Without loss of generality we may assume that Σ = I p as explained inSection 2.9. We set Ω = { X ∈ U } ∩ {(cid:107) X Σ − n − (cid:107) op ≤ √ γ + 2 } for the same open set U as forgeneral loss in the proof of Theorem 2.1 in Section 8. Since P ( (cid:107) X Σ − n − (cid:107) op ≤ √ γ + 2) → by [DS01, Theorem II.13], this proves P (Ω) → .We first prove the first inequality in (3.4). Define V , V , V by V def = n (cid:107) X (cid:62) r (cid:107) − (cid:107) r (cid:107) p − ˆ df n + (1 − ˆ df n ) n − r (cid:62) Xh ( (cid:107) h (cid:107) + (cid:107) r (cid:107) ) n − , (6.8) V = pn (cid:107) r (cid:107) − n (cid:107) X (cid:62) r (cid:107) − (1 − ˆ df n ) (cid:107) h (cid:107) − − ˆ df n ) n − r (cid:62) Xh ( (cid:107) r (cid:107) + (cid:107) h (cid:107) ) n − , (7.1) V def = (cid:107) r (cid:107) + n − r (cid:62) Xh − n (cid:107) ε (cid:107) (1 − ˆ df n ) (cid:0) (cid:107) ε (cid:107) /n + (cid:107) ε (cid:107) n − (cid:107) r (cid:107) (cid:1) n − . (A.1)Now let LHS A = (cid:0) − ˆ df /n (cid:1) (cid:8) (cid:107) Σ h (cid:107) + σ ∗ (cid:9) − (cid:107) (cid:98) ψ (cid:107) /n be the quantity appearing in the lefthand side of (3.4). Then we have − LHS A = ( V + V ) n − ( (cid:107) h (cid:107) + (cid:107) r (cid:107) ) + V n − (1 − ˆ df /n )( (cid:107) ε (cid:107) /n + (cid:107) ε (cid:107) n − (cid:107) r (cid:107) ) . It follows by the triangle inequality and | − ˆ df /n | ≤ that | LHS A | ≤ n − (cid:0) | V | + | V | + 2 | V | (cid:1) ( (cid:107) h (cid:107) + (cid:107) r (cid:107) + (cid:107) ε (cid:107) /n ) . Now let
LHS B = (cid:0) − ˆ df /n (cid:1) (cid:107) Σ h (cid:107) − n − (cid:8) (cid:107) (cid:98) ψ (cid:107) (2 ˆ df − p ) + (cid:107) Σ − X (cid:62) (cid:98) ψ (cid:107) (cid:9) be the quantityfrom the left hand side of the second line of (3.4). Then similarly to the proof of Theorem 2.1we have − LHS B = (2 V + V )( (cid:107) h (cid:107) + (cid:107) r (cid:107) ) n − , | LHS B | ≤ (2 | V | + | V | )( (cid:107) h (cid:107) + (cid:107) r (cid:107) ) n − . Finally, for
LHS C = (cid:0) − ˆ df /n (cid:1) σ ∗ − n − (cid:8) (cid:107) (cid:98) ψ (cid:107) ( n − (2 ˆ df − p )) − (cid:107) Σ − X (cid:62) (cid:98) ψ (cid:107) (cid:9) we have thedecomposition LHS C = LHS A − LHS B so that | LHS C | ≤ (3 | V | + 2 | V | + 2 | V | )( (cid:107) h (cid:107) + (cid:107) r (cid:107) + (cid:107) ε (cid:107) /n ) n − holds. To conclude we set V ∗ = 3 | V | + 2 | V | + 2 | V | so that each inequality in (3.4)holds. Propositions 7.3, 6.4 and A.1 show that E [ I { Ω }| V ∗ | ] ≤ C ( µ, γ, ϕ, η ) .It remains to prove that (1 − ˆ df /n ) − is bounded from above uniformly in Ω .(i) If µ > (i.e., the penalty is strongly convex with respect to Σ ), in Ω we have (cid:107) X Σ − (cid:107) op ≤√ n (2 + √ γ ) . In this event, the map y (cid:55)→ X (cid:98) β ( y , X ) is L -Lipschitz with L = (1 + µ (2 + ellec/Out-of-sample error estimate for robust M-estimators with convex penalty √ γ ) − ) − < . Indeed if X is fixed and y , (cid:101) y are two response vectors, multiplying by ( (cid:98) β − (cid:101) β ) the KKT conditions X (cid:62) ( y − X (cid:98) β ) ∈ ∂g ( (cid:98) β ) and X (cid:62) ( (cid:101) y − X (cid:101) β ) ∈ ∂g ( (cid:101) β ) and takingthe difference, we find n ( ∂g ( (cid:98) β ) − ∂g ( (cid:101) β )) (cid:62) ( (cid:98) β − (cid:101) β ) + (cid:107) X ( (cid:98) β − (cid:101) β ) (cid:107) (cid:51) ( y − (cid:101) y ) (cid:62) X ( (cid:98) β − (cid:101) β ) . (B.1)Since the infimum of ( ∂g ( (cid:98) β ) − ∂g ( (cid:101) β )) (cid:62) ( (cid:98) β − (cid:101) β ) is at least µ (cid:107) Σ ( (cid:98) β − (cid:101) β ) (cid:107) by strong convexityof g , this proves that ( µ (2 + √ γ ) − + 1) (cid:107) X ( (cid:98) β − (cid:101) β ) (cid:107) ≤ ( y − (cid:101) y ) (cid:62) X ( (cid:98) β − (cid:101) β ) in Ω . Thus theoperator norm of ( ∂/∂ h ) X (cid:98) β is bounded by L = ( µ (2 + √ γ ) − + 1) − < . It follows that ˆ df = Tr[( ∂/∂ h ) X (cid:98) β ] ≤ nL and (1 − ˆ df /n ) − ≤ (1 − L ) − .(ii) If γ < then ˆ df /n ≤ γ and (1 − ˆ df /n ) − ≤ (1 − γ ) − because ˆ df is the trace of a matrixwith operator norm at most one [BT17] and rank at most p .(iii) Under Assumption 3.1(iii) for the Lasso, recall that s ∗ is defined as follows: First picksome d ∗ ∈ (0 , small enough to satisfy (D.17) and then define s ∗ as a function of d ∗ , η, ϕ .Then (cid:107) (cid:98) β (cid:107) ≤ d ∗ n holds in Ω by Lemma D.3 so that (1 − ˆ df /n ) − ≤ (1 − d ∗ ) − . Proof of Corollary 3.2.
Recall that (1 − ˆ df /n ) − ≤ C ∗ for some constant C ∗ depending only on { γ, µ, ϕ, η } . The first inequality in (3.4) thus yields on Ω C − ∗ | ˆ R Gen − ( (cid:107) Σ h (cid:107) + σ ∗ ) | ≤ V ∗ n − ( (cid:107) (cid:98) ψ (cid:107) /n + (cid:107) Σ h (cid:107) + σ ∗ ) , ( C − ∗ − V ∗ n − ) | ˆ R Gen − ( (cid:107) Σ h (cid:107) + σ ∗ ) | ≤ V ∗ n − ( (cid:107) (cid:98) ψ (cid:107) /n + ˆ R Gen ) . Since ( (cid:107) (cid:98) ψ (cid:107) /n ) = (1 − ˆ df /n ) ˆ R Gen ≤ ˆ R Gen this implies | ˆ R Gen − ( (cid:107) Σ h (cid:107) + σ ∗ ) | / ˆ R Gen ≤ | V ∗ | n − C − ∗ − | V ∗ | n − = O P ( n − ) thanks to E [ I { Ω }| V ∗ | ] ≤ C ( γ, µ, ϕ, η ) . The two other inequalities are obtained similarly. Appendix C: Proofs of auxiliary results
C.1. Proof of some properties of the Jacobian ( ∂/∂y ) (cid:98) ψ We will prove the following proposition that encompasses the claims of Proposition 2.2.
Proposition C.1.
Assume that ρ is convex differentiable and that ψ = ρ (cid:48) is 1-Lipschitz. Then ρ ( u ) = min v ∈ R { ( u − v ) / h ( v ) } for some convex function h . Consider (C.1) ( (cid:98) b , (cid:98) θ ) ∈ arg min b ∈ R p , θ ∈ R n (cid:107) y − Xb − θ (cid:107) / (2 n ) + g ( b ) + n (cid:88) i =1 h ( θ i ) /n. Then for every solution (cid:98) β to the optimization problem (1.3) , there exists a solution ( (cid:98) b , (cid:98) θ ) to theoptimization problem (C.1) such that (cid:98) β = (cid:98) b and ψ ( y − X (cid:98) β ) = y − X (cid:98) b − (cid:98) θ . Furthermore forevery fixed X ,(i) For almost every y , the map y (cid:55)→ (cid:98) ψ = ψ ( y − X (cid:98) β ) is Frechet differentiable at y ,(ii) For almost every y , the Jacobian ( ∂/∂ y ) (cid:98) ψ ∈ R n × n is symmetric positive semi-definite withoperator norm at most one and consequently Tr[( ∂/∂ y ) (cid:98) ψ ] ∈ [0 , n ] , ellec/Out-of-sample error estimate for robust M-estimators with convex penalty (iii) If additionally y (cid:55)→ X (cid:98) β ( y , X ) is Lipschitz in a neighborhood U ⊂ R n then Tr[ P ( ∂/∂ y ) X (cid:98) β ] ≤ | ˆ I | almost everywhere in U where ˆ I = { i ∈ [ n ] : ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) > } is the set of inliers.Proof of Proposition 2.2. If ψ = ρ (cid:48) is 1-Lipschitz then f ( u ) = u / − ρ is convex and 1-smooth (inthe sense that f (cid:48) is again 1-Lipschitz), so that its Fenchel conjugate f ∗ ( v ) = max u ∈ R { uv − f ( u ) } is 1-strongly convex (in the sense that v (cid:55)→ f ∗ ( v ) − v / is convex). Let h ( v ) = f ∗ ( v ) − v / .For this choice of h , we have thanks to f ∗∗ = f min v ∈ R (cid:8) ( u − v ) h ( v ) (cid:9) = u − max v ∈ R { uv − f ∗ ( v ) } = u − (cid:0) u − ρ ( u ) (cid:1) = ρ ( u ) . If ρ is the Huber loss (2.8) this construction was already well studied and in this case h ( v ) = | v | ,see for instance [DM16, Section 6] or [DT19] and the references therein.Next consider the M -estimator with square loss and design matrix [ X | I n ] ∈ R n × ( p + n ) definedby (C.1). The KKT conditions are given by(C.2) X (cid:62) ( y − X (cid:98) b − (cid:98) θ ) ∈ n∂g ( (cid:98) b ) , y i − x (cid:62) i (cid:98) b − (cid:98) θ i ∈ ∂h ( (cid:98) θ i ) , i ∈ [ n ] where ∂g and ∂h denote the subdifferentials of g and h . That is, ( (cid:98) b , (cid:98) θ ) is solution to (C.1) ifand only if (C.2) holds. We claim that one solution of the optimization problem (C.1) is givenby ( (cid:98) b , (cid:98) θ ) = ( (cid:98) β , y − X (cid:98) β − ψ ( y − X (cid:98) β )) where (cid:98) β is any solution in (1.3). Indeed, the first partin (C.2) holds by the optimality conditions X (cid:62) ψ ( y − X (cid:98) β ) ∈ n∂g ( (cid:98) β ) of (cid:98) β as a solution to theoptimization problem (1.3); it remains to check that y i − x (cid:62) i (cid:98) b − (cid:98) θ i ∈ ∂h ( (cid:98) θ i ) holds for all i ∈ [ n ] ,or equivalently that(C.3) ψ ( y i − x (cid:62) i (cid:98) β ) ∈ ∂h ( y i − x (cid:62) i (cid:98) β − ψ ( y i − x (cid:62) i (cid:98) β )) by definition of (cid:98) θ i . By additivity of the subdifferential, v + ∂h ( v ) = ∂f ∗ ( v ) . Furthermore u ∈ ∂f ∗ ( v ) if and only if f ∗∗ ( u ) + f ∗ ( v ) = uv by property of the Fenchel conjugate, where herewe have f ∗∗ ( u ) = f ( u ) = u / − ρ ( u ) since here f is convex and finite valued. We also have v ∈ ∂f ( u ) iff f ( u ) + f ∗ ( v ) = uv , and here ∂f ( u ) = { u − ψ ( u ) } is a singleton. Combining thesepieces together, for any u, v ∈ R we find v = u − ψ ( u ) iff v ∈ ∂f ( u ) iff f ( u ) + f ∗ ( v ) = uv iff f ∗∗ ( u ) + f ∗ ( v ) = uv iff u ∈ ∂f ∗ ( v ) iff u − v ∈ ∂h ( v ) . Hence taking u = y i − x (cid:62) i (cid:98) β and v = u − ψ ( u ) , the previous sentence implies that ψ ( u ) ∈ ∂h ( u − ψ ( u )) and the previous display (C.3) must hold for all i ∈ [ n ] . This proves that the given ( (cid:98) b , (cid:98) θ ) is solution to (C.1).By [BZ19b, Proposition J.1] applied to ( (cid:98) b , (cid:98) θ ) with design matrix [ X | I n ] , the map y (cid:55)→ y − X (cid:98) b − (cid:98) θ is 1-Lipschitz on R n , and for almost every y ∈ R n this map has symmetric positivesemi-definite Jacobian. Since y − X (cid:98) b − (cid:98) θ = ψ ( y − X (cid:98) β ) , this proves (i) and (ii).For (iii), let S = ( ∂/∂ y ) (cid:98) ψ . By the chain rule [Zie89, Theorem 2.1.11] we have S = diag ( ψ (cid:48) )( I n − ( ∂/∂ y ) X (cid:98) β ) where S is symmetric with eigenvalues in [0 , by (ii). It followsthat P ( ∂/∂ y ) X (cid:98) β = P − diag ( ψ (cid:48) ) † S where diag ( ψ (cid:48) ) † is the Moore-Penrose pseudo inverse ofdiag ( ψ (cid:48) ) . Since Tr[ diag ( ψ (cid:48) ) † S ] = Tr[( diag ( ψ (cid:48) ) † ) S ( diag ( ψ (cid:48) ) † ) ] ≥ because the matrix insidethe latter trace is positive semi-definite, the claim is proved. ellec/Out-of-sample error estimate for robust M-estimators with convex penalty C.2. Elastic-Net penalty and Huber Lasso
Proof of Proposition 2.3.
The KKT conditions read X (cid:62) (cid:98) ψ − nµ (cid:98) β ∈ nλ∂ (cid:107) (cid:98) β (cid:107) where ∂ (cid:107) b (cid:107) denotes the sub-differential of the (cid:96) norm at b ∈ R p . We first prove that the KKT conditionshold strictly with probability one, in the sense that P ( ∀ j ∈ [ p ] , j / ∈ ˆ S implies e (cid:62) j X (cid:62) (cid:98) ψ ∈ ( − nλ, nλ )) = 1 . Let j be fixed and let (cid:98) α be the solution to the same optimization problem as (cid:98) β , with theadditional constraint that the j -th coordinate is always set to 0. Then { j (cid:54)∈ ˆ S } = { (cid:98) α = (cid:98) β } asthe solution of each optimization problem is unique thanks to µ > . Let X − j be X with j -th column removed. The conditional distribution of Xe j given ( X − j , y ) is continuousbecause ( X , y ) has continuous distribution. Hence e (cid:62) j X (cid:62) ψ ( y − X (cid:98) α ) also has continuousdistribution conditionally on ( X − j , y ) when ψ ( y − X (cid:98) α ) (cid:54) = , so that P ( e (cid:62) j X (cid:62) ψ ( y − X (cid:98) α ) ∈{− λn, λn }| X − j , y ) = 0 because a continuous distribution has no atom. The unconditionalprobability is also 0 by the tower property. This shows that P ( j (cid:54)∈ ˆ S and e (cid:62) j X (cid:62) (cid:98) ψ ∈{− nλ, nλ } ) = 0 for all j . The union bound over all j ∈ [ p ] proves that the KKT conditionshold strictly with probability one, as desired. The maps ( y , X ) (cid:55)→ (cid:98) β and ( y , X ) (cid:55)→ (cid:98) ψ are Lipschitz continuous on every compact byProposition 5.2(i) as Σ is invertible. At a point ( y , X ) where the KKT conditions hold strictly,the KKT conditions stay strict and ˆ S stay the same in a neighborhood of ( y , X ) because thecontinuity of ( y , X ) (cid:55)→ e (cid:62) j X (cid:62) (cid:98) ψ − nµ (cid:98) β j ensure that e (cid:62) j X (cid:62) (cid:98) ψ − nµ (cid:98) β j stay bounded away from {− nλ, nλ } for every j ∈ [ p ] not in the active set at ( y , X ) . Furthermore, by (5.2) there existsan open set U ⊂ R n × R n × p with U (cid:51) ( y , X ) such that the maps ( y , X ) (cid:55)→ (cid:98) β and ( y , X ) (cid:55)→ (cid:98) ψ are Lipschitz in U , and the chain rule (5.4) yields ( ∂/∂ y ) (cid:98) ψ = diag ( ψ (cid:48) )( I n − X ( ∂/∂ y ) (cid:98) β ) foralmost every ( y , X ) ∈ U . In a neighborhood of a point ( y , X ) where the KKT conditionshold strictly and where the aforementioned chain rule holds, since ˆ S is locally constant we have ( ∂/∂ y ) (cid:98) β ˆ S c = ˆ S c × [ n ] as well as X (cid:62) ˆ S diag ( ψ (cid:48) )[ I n − X ( ∂/∂ y ) (cid:98) β ] − nµ ( ∂/∂ y ) (cid:98) β = ˆ S × [ n ] . By simple algebra, this implies ( ∂/∂ y ) (cid:98) β ˆ S = ( X (cid:62) ˆ S diag ( ψ (cid:48) ) X ˆ S + µn I | ˆ S | ) − X (cid:62) ˆ S diag ( ψ (cid:48) ) and thedesired expressions for ( ∂/∂ y ) X (cid:98) β and ( ∂/∂ y ) (cid:98) ψ . Proof of Proposition 2.4 .
For the Huber loss with (cid:96) -penalty, the M-estimator (cid:98) β satisfies ( (cid:98) β , (cid:98) θ ) = arg min ( b , θ ) ∈ R p × R n (cid:107) Xb + κ θ − y (cid:107) / (2 n ) + λ ( (cid:107) b (cid:107) + (cid:107) θ (cid:107) ) where κ > is some constant, see e.g. [DT19] and the references therein or Proposition C.1 with h ( · ) proportional to | · | for the Huber loss. Let β = ( (cid:98) β , (cid:98) θ ) . Then β is a Lasso solution with data ( y , X ) where the design matrix is X = [ X | κ I n ] ∈ R n × ( n + p ) .In this paragraph, we show that if X has continuous distribution then X satisfiesAssumption 3.1 of [BZ18b] with probability one. That assumption requires that for any ( δ j ) j ∈ [ p + n ] {− , +1 } p + n and any columns c j , ..., c j n +1 of X with j < ... < j n +1 , the matrix(C.4) (cid:18) c j . . . c j n +1 δ j . . . δ j n +1 (cid:19) ∈ R ( n +1) × ( n +1) Similar arguments to prove that the KKT conditions hold strictly are used in [BZ18b, Proposition 3.9] forthe Lasso or [BZ19b, Lemma L.1] for the Group-Lasso. The above argument is provided for completeness. ellec/Out-of-sample error estimate for robust M-estimators with convex penalty has rank n + 1 . We reorder the columns so that any column of the form c p + i , i ∈ [ n ] is the i -thcolumn after reordering, and note that c p + i = κ e i . Then there exists a value of X ∈ R n × p suchthat the above matrix, after reordering the columns, is equal to (cid:18) κ I n | n × δ k . . . δ k n | δ k n +1 (cid:19) for some permutation ( k , ..., k n +1 ) of ( j , ..., j n +1 ) . Since the previous display has nonzerodeterminant κ n δ k n +1 , the determinant of matrix (C.4), viewed as a polynomial of the coefficientsof X , is a non-zero polynomial. Since non-zero polynomials have a zero-set of Lebesgue measure0 [hl], this proves that (C.4) is rank n + 1 with probability one.Hence with probability one, by Proposition 3.9 in [BZ18b], the solution β ∈ R n + p is unique, (cid:107) β (cid:107) ≤ n and the KKT conditions of the optimization problem of β hold strictly almosteverywhere in ( y , X ) (see [Tib13] for related results). This shows that the sets ˆ S and ˆ I , viewed asa function of y while X is fixed, are constant in a neighborhood of y for almost every ( y , X ) . Nowthe set of { i ∈ [ n ] : (cid:98) θ i (cid:54) = 0 } exactly correspond to the outliers { i ∈ [ n ] : ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) = 0 } = [ n ] \ ˆ I and (cid:107) β (cid:107) ≤ n holds if and only if | ˆ S | + ( n − | ˆ I | ) ≤ n . This proves that | ˆ S | ≤ | ˆ I | almost surely.Furthermore, almost surely in ( y , X ) , the derivative of y (cid:55)→ Xβ = X (cid:98) β + κ (cid:98) θ exists and isequal to the orthogonal projection onto the linear span of { e i , i ∈ [ n ] \ ˆ I } ∪ { Xe j , j ∈ ˆ S } . Weconstruct an orthonormal basis of this linear span as follows: First by considering the vectors { e i , i ∈ [ n ] \ ˆ I } and then completing by a basis ( u k ) k ∈ ˆ S of the orthogonal complement of { e i , i ∈ [ n ] \ ˆ I } . Note that this orthogonal complement is exactly the column span of P X ˆ S .The orthogonal projection onto the linear span of { e i , i ∈ [ n ] \ ˆ I } ∪ { Xe j , j ∈ ˆ S } is thus ( ∂/∂ y ) Xβ = (cid:80) i ∈ [ n ] \ ˆ I e i e (cid:62) i + (cid:80) k ∈ ˆ S u k u (cid:62) k . Since P = diag ( ψ (cid:48) ) is constant in a neighborhoodof y and P zeros out all rows corresponding to outliers,diag ( ψ (cid:48) )( ∂/∂ y ) X (cid:98) β = ( ∂/∂ y ) diag ( ψ (cid:48) ) X (cid:98) β = ( ∂/∂ y ) diag ( ψ (cid:48) ) Xβ = diag ( ψ (cid:48) )( ∂/∂ y ) Xβ = (cid:80) k ∈ ˆ S u k u (cid:62) k which is exactly the orthogonal projection (cid:98) Q defined in the proposition, as desired. The almostsure identity ( ∂/∂ y ) (cid:98) ψ = diag ( ψ (cid:48) ) − (cid:98) Q is obtained by the chain rule: Here ψ is differentiable at y i − x (cid:62) i (cid:98) β for all i ∈ [ n ] with probability one since the fact that the KKT conditions of β holdstrictly imply that no y i − x (cid:62) i (cid:98) β is a kink of ψ . Appendix D: Huber Lasso: Proofs
This section provides the necessary lemmas to prove the main result under Assumption 2.3(iii),i.e., when the penalty is the (cid:96) norm and a scaled Huber loss is used as the loss function: fortuning parameters λ ∗ , λ ,(D.1) (cid:98) β = arg min b ∈ R p (cid:16) n (cid:88) i =1 λ ∗ ρ H (cid:16) ( √ nλ ∗ ) − ( y i − x (cid:62) i b ) (cid:17) + λ (cid:107) b (cid:107) (cid:17) where ρ H is the Huber loss (2.8). We let ˆ O = { i ∈ [ n ] : ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) = 0 } be the set of outliers.To control the sparsity and number of outliers of the M -estimator with Huber loss and (cid:96) penalty (D.1), the following equivalent definition of the estimator will be useful. The M -estimator (cid:98) β ∈ R p is equal to the first p components of the solution ( (cid:98) β , (cid:98) θ ) ∈ R p + n of the ellec/Out-of-sample error estimate for robust M-estimators with convex penalty optimization problem(D.2) ( (cid:98) β , (cid:98) θ ) = arg min ( b , θ ) ∈ R p + n (cid:107) Xb + √ n θ − y (cid:107) / (2 n ) + λ ∗ (cid:107) θ (cid:107) + λ (cid:107) β (cid:107) . This representation of the Huber Lasso (cid:98) β is well known in the study of M -estimators based onthe Huber loss, cf. [DM16, Section 6] or [DT19] and the references therein. When λ = λ ∗ , (D.2)reduces to a Lasso optimization problem in R p + n with design matrix [ X |√ n I n ] ∈ R n × ( p + n ) andresponse y .Assumption 2.3(iii) requires that (cid:98) n (1 − s ∗ ) + (cid:107) β (cid:107) (cid:99) components of ε are iid N (0 , σ ) . In thiscase, as explained in [DT19] we rewrite y as(D.3) y = Xβ + √ n θ ∗ + σ z where (cid:107) θ ∗ (cid:107) ≤ (cid:98) s ∗ n (cid:99) − (cid:107) β (cid:107) and z ∼ N ( , I n ) . The non-zero components of the unknownvector θ ∗ represent the contaminated responses and θ ∗ is not independent of z . The sparsityof the unknown regression vector in the above linear model with design matrix X = [ X |√ n I n ] is (cid:107) β (cid:107) + (cid:107) θ ∗ (cid:107) ≤ s ∗ n , and the support of (cid:98) θ is exactly the set of outliers ˆ O = { i ∈ [ n ] : ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) = 0 } . Lemma D.2 below shows that (cid:107) (cid:98) β (cid:107) + (cid:107) (cid:98) θ (cid:107) can be controlled with highprobability when s ∗ ∈ (0 , is a small enough constant. In order to achieve this, given that(D.2) is a Lasso problem, we may leverage existing results (such as Lemma D.6 below) tocontrol the sparsity of the Lasso. This requires a control on the sparse condition number of X = [ X , √ n I n ] (cf. Lemma D.5 below) and a control of the noise in the linear model (D.3) (cf.Lemma D.4 below). Proposition D.1 (Lipschitz properties, Huber Lasso) . Consider R n × R n × p as the underlyingprobability space for ( ε , X ) equipped with the Lebesgue sigma algebra. Let (cid:98) β the Huber Lassoestimate in (D.1) . We allow λ ∗ = + ∞ in which case (cid:98) β reduces to the Lasso with square lossand ˆ O = ∅ . Assume that for some event Ω ∗ we have Ω ∗ ⊂ {| ˆ O | + (cid:107) (cid:98) β (cid:107) ≤ d ∗ n } with P (Ω ∗ ) → and Ω ∗ ⊂ R n × R n × p is open, where (D.4) (1 − (cid:112) d ∗ ) − (cid:112) d ∗ log( e ( γ + 1) / (2 d ∗ )) > for some d ∗ ∈ (0 , independent of n, p . Then there exists an open subset Ω L ⊂ R n × R n × p with P (( ε , X ) ∈ Ω L ) → such that the following holds.(i) Let ( ε , X ) and ( (cid:101) ε , (cid:102) X ) ∈ R n × R n × p . Set y = Xβ + ε , (cid:101) y = (cid:102) Xβ + (cid:101) ε and define (cid:98) β , (cid:101) β , ψ , (cid:101) ψ as in Proposition 5.2. If { ( ε , X ) , ( (cid:101) ε , (cid:102) X ) } ⊂ Ω L then (cid:107) ψ − (cid:101) ψ (cid:107) ∨ ( √ n (cid:107) Σ ( (cid:98) β − (cid:101) β ) (cid:107) ) ≤ (cid:2) (cid:107) y − (cid:101) y (cid:107) + (cid:107) ( X − (cid:102) X ) Σ − (cid:107) op ( n − (cid:107) ψ (cid:107) + (cid:107) Σ (cid:98) β (cid:107) ) (cid:3) M ( d ∗ , γ ) for some constant M ( d ∗ , γ ) depending on { d ∗ , γ } only.(ii) Let ε , (cid:101) ε ∈ R n with ε = (cid:101) ε , (cid:102) X , X ∈ R n × p , and let (cid:98) β , (cid:101) β , ψ , (cid:101) ψ , r , (cid:101) r be the correspondingquantities as in Lemma 5.4. Let also y = Xβ + ε , (cid:101) y = (cid:102) Xβ + ε . If { ( ε , X ) , ( ε , (cid:102) X ) } ⊂ Ω L then (D.5) (cid:107) Σ ( (cid:98) β − (cid:101) β ) (cid:107) ∨ (cid:107) r − (cid:101) r (cid:107) ≤ n − (cid:107) ( X − (cid:102) X ) Σ − (cid:107) op ( (cid:107) Σ h (cid:107) + (cid:107) r (cid:107) ) L ( d ∗ , γ ) for some constant L ( d ∗ , γ ) depending on { d ∗ , γ } only.Proof. Since D g ⊂ [0 , + ∞ ) in (5.3) and using the argument leading to (5.2) with L = 1 , wehave (cid:107) ψ − (cid:101) ψ (cid:107) ≤ Ξ where Ξ def = ( (cid:101) β − (cid:98) β ) (cid:62) ( (cid:102) X − X ) (cid:62) ψ + ( (cid:101) y + ( X − (cid:102) X ) (cid:98) β − y ) (cid:62) ( (cid:101) ψ − ψ ) ellec/Out-of-sample error estimate for robust M-estimators with convex penalty as well as ( (cid:102) X (cid:101) β − (cid:101) y + y − X (cid:98) β ) (cid:62) ( ψ − (cid:101) ψ ) ≤ Ξ . The left-hand side can be rewritten (cid:80) ni =1 [ ψ ( u i ) − ψ (˜ u i )]( u i − ˜ u i ) with u i = y i − x (cid:62) i (cid:98) β and ˜ u i = ˜ y i − (cid:101) x (cid:62) i (cid:101) β where (cid:101) x i = (cid:102) X (cid:62) e i . Each term in the sum is non-negative by convexity of ρ andmonotonicity of the subdifferential. For all terms in the sum such that both ψ (cid:48) ( u i ) = 1 and ψ (cid:48) (˜ u i ) = 1 hold, we have ψ (cid:48) ( v ) = 1 for all v ∈ [ u i , ˜ u i ] so that [ ψ ( u i ) − ψ (˜ u i )]( u i − ˜ u i ) = ( u i − ˜ u i ) by the fundamental theorem of calculus. If Q ∈ R n × n is the diagonal projector matrix Q = diag ( ψ (cid:48) ( u i ) ψ (cid:48) (˜ u i )) i =1 ,...,n , this implies (cid:107) Q [ (cid:102) X (cid:101) β − (cid:101) y − X (cid:98) β + y ] (cid:107) ≤ Ξ which can be rewritten, byexpanding the square in the left-hand side, as (cid:107) Q (cid:102) X ( (cid:101) β − (cid:98) β ) (cid:107) + (cid:107) Q [( (cid:102) X − X ) (cid:98) β − (cid:101) y + y ] (cid:107) ≤ Ξ − (cid:101) β − (cid:98) β ) (cid:62) (cid:102) X (cid:62) Q [( (cid:102) X − X ) (cid:98) β − (cid:101) y + y ] The next step is to bound (cid:107) Q (cid:102) X ( (cid:101) β − (cid:98) β ) (cid:107) from below by nτ (cid:107) Σ ( (cid:101) β − (cid:98) β ) (cid:107) for a constant τ to bespecified. It will be useful to consider the two sets ˆ A, ˆ B defined by ˆ B = supp( (cid:98) β ) ∪ supp( (cid:101) β ) ⊂ [ p ] and ˆ A = { i ∈ [ n ] : ψ (cid:48) ( y i − x (cid:62) i (cid:98) β ) = ψ (cid:48) (˜ y i − (cid:101) x (cid:62) i (cid:101) β ) = 1 } , which satisfies | ˆ B | + | [ n ] \ ˆ A | ≤ d ∗ n when { ( ε , X ) , ( (cid:101) ε , (cid:102) X ) } ⊂ Ω ∗ . Let ˇ φ def = min A ⊂ [ n ] ,B ⊂ [ p ]: | B | +( n −| A | ) ≤ d ∗ n (cid:104) min u ∈ R p :supp( u ) ⊂ B, u (cid:54) = (cid:16) n (cid:88) i ∈ A ( x (cid:62) i u ) (cid:107) Σ / u (cid:107) (cid:17)(cid:105) . If B ⊂ [ p ] , A ⊂ [ n ] are fixed, by [DS01, Theorem II.13] applied to submatrix of the Gaussianmatrix X Σ − multiplied on the left by an orthogonal projection of rank | A | and multiplied tothe right by an orthogonal projection or rank | B | , we find P (cid:104) min u ∈ R p :supp( u ) ⊂ B, u (cid:54) = (cid:16) (cid:88) i ∈ A ( x (cid:62) i u ) (cid:107) Σ u (cid:107) (cid:17) > (cid:112) | A | − (cid:112) | B | − √ x (cid:105) ≥ − e − x for all x > . In the right hand side inside the probability sign, using | B | + ( n − | A | ) ≤ d ∗ n wefind (cid:112) | A | − (cid:112) | B | = ( | A | − | B | ) / ( (cid:112) | A | + (cid:112) | B | ) ≥ √ n (1 − d ∗ ) / (1 + (cid:112) d ∗ ) = √ n (1 − (cid:112) d ∗ ) using the loose bounds | A | ≤ n and | B | ≤ d ∗ n to bound the denominator. There are (cid:0) n + p (cid:98) d ∗ n (cid:99) (cid:1) pairs of sets ( A, B ) with A ⊂ [ n ] , B ⊂ [ p ] and | A | + | B | ≤ d ∗ n . Hence by the union bound, P ( ˇ φ > (1 − (cid:112) d ∗ ) − (cid:112) d ∗ log( e ( γ + 1) / (2 d ∗ )) + 2 x/n ) ≥ − e − x where we used the classical bound log (cid:0) qd (cid:1) ≤ d log eqd . Finally, with τ defined by τ = (1 −√ d ∗ ) − (cid:112) d ∗ log( e ( γ + 1) / (2 d ∗ )) , using (cid:112) a + 2 x/n ≤ √ a + (cid:112) x/n and setting x = n τ wefind P ( ˇ φ > τ ) ≥ − e − nτ / . Since τ > by assumption (D.4), this implies that ˇ φ is boundedfrom 0 with probability approaching one. Finally, P ( (cid:107) X Σ − (cid:107) op < √ n (2 + √ γ )) ≥ e − n/ byanother application of [DS01, Theorem II.13] so that the event Ω L = Ω ∗ ∩ { ˇ φ > τ } ∩ {(cid:107) X Σ − (cid:107) op < √ n (2 + √ γ ) } satisfies P (Ω L ) → . Furthermore, Ω L viewed as a subset of R n × R n × p is open as a finiteintersection of open sets. In summary, we have proved that if { ( ε , X ) , ( (cid:101) ε , (cid:102) X ) } ⊂ Ω L then(D.6) (cid:107) ψ − (cid:101) ψ (cid:107) ∨ ( nτ (cid:107) Σ ( (cid:98) β − (cid:101) β ) (cid:107) ) ≤ (cid:12)(cid:12) Ξ (cid:12)(cid:12) + (cid:12)(cid:12) (cid:101) β − (cid:98) β ) (cid:62) (cid:102) X (cid:62) Q [( (cid:102) X − X ) (cid:98) β − (cid:101) y + y ] (cid:12)(cid:12) . ellec/Out-of-sample error estimate for robust M-estimators with convex penalty To prove (i), by simple algebra to bound the right hand side, we find (cid:107) ψ − (cid:101) ψ (cid:107) ∨ ( nτ (cid:107) Σ ( (cid:98) β − (cid:101) β ) (cid:107) ) ≤(cid:107) y − (cid:101) y (cid:107) (cid:0) (cid:107) ψ − (cid:101) ψ (cid:107) + 2 √ n (2 + √ γ ) (cid:107) Σ ( (cid:98) β − (cid:101) β ) (cid:107) (cid:1) + (cid:107) ( X − (cid:102) X ) Σ − (cid:107) op (cid:8) (cid:107) Σ ( (cid:98) β − (cid:101) β ) (cid:107) ( (cid:107) ψ (cid:107) + 2 √ n (2 + √ γ ) (cid:107) Σ (cid:98) β (cid:107) ) + (cid:107) (cid:101) ψ − ψ (cid:107)(cid:107) Σ (cid:98) β (cid:107) (cid:9) . Dividing by the square root of the left-hand side provides the desired inequality.To prove (ii), if ( ε , X ) , ( ε , (cid:102) X ) ∈ Ω L , using ε = (cid:101) ε to bound the right hand side in (D.6) yields (cid:107) ψ − (cid:101) ψ (cid:107) ∨ ( nτ (cid:107) Σ ( (cid:98) β − (cid:101) β ) (cid:107) ) ≤ (cid:107) ( X − (cid:102) X ) Σ − (cid:107) op (cid:8) (cid:107) Σ h (cid:107)(cid:107) ψ − (cid:101) ψ (cid:107) + (cid:107) Σ ( (cid:98) β − (cid:101) β ) (cid:107) ( (cid:107) ψ (cid:107) + 2 √ n (2 + √ γ ) (cid:107) Σ h (cid:107) ) (cid:9) . Dividing both sides by max {√ nτ (cid:107) Σ ( (cid:98) β − (cid:101) β ) (cid:107) , (cid:107) ψ − (cid:101) ψ (cid:107)} we obtain (D.5). Lemma D.2 (Sparsity of the Huber Lasso) . For any ϕ ≥ , constants η ∈ (0 , and d ∗ ∈ (0 , satisfying (D.17) below, there exists a sparsity proportion s ∗ = s ∗ ( d ∗ , γ, ϕ, η ) ∈ (0 , d ∗ ) such thatif (D.7) φ max ( Σ ) φ min ( Σ ) ≤ ϕ, (cid:107) β (cid:107) + (cid:107) θ ∗ (cid:107) ≤ s ∗ n, λ ∗ = λ = ση √ n (cid:16) (cid:114) γ + 1 s ∗ (cid:17) then there exists and event Ω ∗ of probability approaching one such that Ω ∗ ⊂ {(cid:107) (cid:98) β (cid:107) + (cid:107) (cid:98) θ (cid:107) ≤ d ∗ n } .If the probability space for ( ε , X ) is chosen as R n × R n × p equipped with the Lebesgue sigmaalgebra, then Ω ∗ ⊂ R n × R n × p can be chosen as an open set.Proof. Since we can view ( (cid:98) β , (cid:98) θ ) as the solution of a Lasso problem, we can use existing resultson the Lasso, such as Lemma D.6, to control the sparsity (cid:107) (cid:98) β (cid:107) + (cid:107) (cid:98) θ (cid:107) . We apply Lemma D.6with ¯ p = n + p , X = [ X , √ n I n ] ∈ R n × ( n + p ) , b = ( β , θ ∗ ) ∈ R n + p and y = y as explained in thediscussion following Lemma D.6. We have to verify that (D.25), (D.26), (D.27) and (D.28) allhold for certain m , η , λ .Let the constants { c ∗ , d ∗ } be as in Lemma D.5 and let Ω ∗ be defined in Lemma D.5. We nowdefine s ∗ ∈ (0 , d ∗ ) by(D.8) s ∗ = d ∗ (2(1 − η ) ) − (1 + η ) { ϕc ∗ − } + 1 . Let ¯ A ⊂ [ n ] , ¯ B ⊂ [ p ] be two sets that will be specified later and define ¯ S as the disjoint union ¯ S = ¯ B ∪ { i + n, i ∈ ¯ A } . With m def = d ∗ n − | ¯ S | , the condition ¯ S ≤ s ∗ n implies(D.9) | ¯ S | ≤ d ∗ n (2(1 − η ) ) − (1 + η ) { ϕc ∗ − } + 1 or equivalently(D.10) | ¯ S | ≤ − η ) ( d ∗ n − | ¯ S | )(1 + η ) { ϕc ∗ − } . By Lemma D.5, in Ω ∗ inequality | ¯ S | ≤ s ∗ n with m = d ∗ n − | ¯ S | thus implies (D.27). It remainsto define the sets ¯ A, ¯ B , prove that | ¯ S | = | ¯ A | + | ¯ B | indeed satisfies | ¯ S | ≤ s ∗ n , and prove that(D.25), (D.26), (D.28) all hold. Let B = supp( β ) ∪ B where B = { j ∈ [ p ] : | e (cid:62) j X (cid:62) σ z | /n ≥ ηλ } ,A = supp( θ ∗ ) ∪ A where A = { i ∈ [ n ] : | σz i | / √ n ≥ ηλ } ellec/Out-of-sample error estimate for robust M-estimators with convex penalty so that (D.26) and (D.25) can be rewritten ¯ B ⊃ B and ¯ A ⊃ A . In the event (D.13), ( | A | + | B | )( t + 1) = ( | A | + | B | ) σ − η λ n < (cid:0) √ s ∗ n + t (cid:112) | A | + | B | (cid:1) . where t + 1 = σ − ηλ √ n by definition of λ in (D.7) and the definition of t in Lemma D.4.Inequality | A | + | B | < s ∗ n must hold, otherwise the previous display leads to ( | A | + | B | )( t +1) being strictly smaller than itself. Since (cid:107) β (cid:107) + (cid:107) θ ∗ (cid:107) ≤ s ∗ n , inequality | A | + | B | < s ∗ n thusholds in the event (D.13). Next we define the sets ¯ A ⊃ A and ¯ B ⊃ B by adding elementsarbitrarily until | ¯ A | + | ¯ B | ∈ [ s ∗ n, s ∗ n ) . For (D.28), again in event (D.13) we have thanks to | ¯ A | + | ¯ B | ≥ s ∗ n that (cid:8) (cid:107) σ z ¯ A (cid:107) /n + (cid:107) X (cid:62) ¯ B σ z (cid:107) /n (cid:9) ≤ σn − (cid:2) √ s ∗ n + t (cid:0) | ¯ A | + | ¯ B | (cid:1) (cid:3) ≤ ηλ (cid:0) | ¯ A | + | ¯ B | (cid:1) . Hence conditions (D.25), (D.26), (D.27) and (D.28) all hold, and (cid:107) (cid:98) β (cid:107) + (cid:107) (cid:98) θ (cid:107) ≤ | ¯ A | + | ¯ B | + | supp( (cid:98) θ ) \ ¯ A | + | supp( (cid:98) β ) \ ¯ B | ≤ | ¯ A | + | ¯ B | + m ≤ d ∗ n. The previous display holds in the intersection of Ω ∗ from Lemma D.5 and the event (D.13).Both these events are open when viewed as subsets of R n × R n × p so that Ω ∗ def = Ω ∗ ∩ { (D.13) } is open as well (with some abuse of notation, { (D.13) } here denotes the high-probability eventinside the probability sign in (D.13)).A similar result can be proved for the Lasso with square loss using Lemma D.6 directly. Theproof is simpler than the previous proof as there is no need to control the number of outliers. Lemma D.3 (Control of the sparsity of the Lasso) . Assume that ε ∼ N ( , σ I n ) in the linearmodel (1.1) . For any ϕ ≥ , constants η ∈ (0 , and d ∗ ∈ (0 , satisfying (D.17) below, thereexists a sparsity proportion s ∗ = s ∗ ( d ∗ , γ, ϕ, η ) ∈ (0 , d ∗ ) such that if (D.11) φ max ( Σ ) φ min ( Σ ) ≤ ϕ, (cid:107) β (cid:107) ≤ s ∗ n, λ = ση √ n (cid:16) (cid:114) γs ∗ (cid:17) then there exists and event Ω ∗ of probability approaching one such that Ω ∗ ⊂ {(cid:107) (cid:98) β (cid:107) ≤ d ∗ n } for (cid:98) β the Lasso, i.e., defined as (3.1) with g ( b ) = λ (cid:107) b (cid:107) . If the probability space for ( ε , X ) is chosenas R n × R n × p equipped with the Lebesgue sigma algebra, then Ω ∗ ⊂ R n × R n × p can be chosen asan open set.Proof. We use the same argument as the proof Lemma D.2 in the simpler setting with nooutliers. Here, set ε = σ z so that z ∼ N ( , I n ) by assumption. We apply Lemma D.6 with ¯ p = p , X = X b = β and y = y . We have to verify that (D.22), (D.23), (D.24) all hold forcertain m , η , λ and set ¯ S ⊂ [ p ] .Let the constants { c ∗ , d ∗ } be as in Lemma D.5 and let Ω ∗ be defined in Lemma D.5. Wenow define s ∗ ∈ (0 , d ∗ ) by (D.8). With m def = d ∗ n − | ¯ S | , the condition ¯ S ≤ s ∗ n implies (D.9)or equivalently (D.10) by simple algebra. By Lemma D.5, in Ω ∗ inequality | ¯ S | ≤ s ∗ n with m = d ∗ n − | ¯ S | thus implies (D.23). It remains to prove that ¯ S indeed satisfies | ¯ S | ≤ s ∗ n , andprove that (D.22), (D.23) and (D.24) all hold. Let S = supp( β ) ∪ S where S = { j ∈ [ p ] : | e (cid:62) j X (cid:62) σ z | /n ≥ ηλ } , so that (D.22) can be rewritten ¯ S ⊃ S . In the event (D.13) with A = ∅ ,since (cid:107) β (cid:107) ≤ s ∗ n we have | S | ( t + 1) = | S | σ − η λ n < ( √ s ∗ n + t (cid:112) | S | ) . where t + 1 = σ − ηλ √ n by definition of λ in (D.11) and the definition of t in Lemma D.4.Inequality | S | < s ∗ n must hold, otherwise the previous display leads to | S | ( t +1) being strictly ellec/Out-of-sample error estimate for robust M-estimators with convex penalty smaller than itself. Hence | S | ≤ | S | + (cid:107) β (cid:107) < s ∗ n must hold in the event (D.13). Next we definethe set ¯ S ⊃ S by adding elements arbitrarily until | ¯ S | ∈ [ s ∗ n, s ∗ n ) . For (D.24), again in event(D.13) we have thanks to | ¯ S | ≥ s ∗ n that {(cid:107) X (cid:62) ¯ S σ z (cid:107) /n } ≤ σn − (cid:2) √ s ∗ n + t | ¯ S | (cid:3) ≤ ηλ | ¯ S | . Hence conditions (D.22), (D.23), (D.24) all hold, and (cid:107) (cid:98) β (cid:107) ≤ | ¯ S | + | supp( (cid:98) θ ) \ ¯ A | ≤ | ¯ S | + m ≤ d ∗ n. The previous display holds in the intersection of Ω ∗ from Lemma D.5 and the event (D.13). Boththese events are open when viewed as subsets of R n × R n × p so that Ω ∗ def = Ω ∗ ∩ { (D.13) } is openas well. Lemma D.4 (Control of the noise in the linear model (D.3)) . Let p/n ≤ γ , let Σ ∈ R p × p withdiagonal entries equal to 1 and let ϕ ≥ φ max ( Σ ) /φ min ( Σ ) be an upper bound on the conditionnumber. Let z ∼ N ( , I n ) , G ∈ R n × p with iid N (0 , entries and assume that z and G areindependent. For some fixed t define W = (cid:16) (cid:88) i ∈ [ n ] ( | z i | − t ) + (cid:88) j ∈ [ p ] ( | e (cid:62) j Σ G (cid:62) z (cid:107) z (cid:107) − | − t ) (cid:17) . If t = (cid:112) γ ) /s ∗ ) for some s ∗ ∈ (0 , then E [ W ] ≤ s ∗ n and for all u > , (D.12) P ( W ≥ √ s ∗∗ n ) ≤ e − nC ( γ,s ∗ ) / (2 ϕ ) where s ∗∗ depends on { γ, s ∗ } only and is such that s ∗∗ < s ∗ , and where C ( γ, s ∗ ) > dependson { γ, s ∗ } only. For the same value of t , we have as n, p → + ∞ (D.13) P (cid:104) ∩ A ⊂ [ n ] ,B ⊂ [ p ] (cid:110) {(cid:107) z A (cid:107) + n (cid:107) ( Σ G (cid:62) z ) B (cid:107) } < √ s ∗ n + t (cid:112) | A | + | B | (cid:111)(cid:105) → . Proof.
Let g = G (cid:62) z (cid:107) z (cid:107) − ∈ R p . Then g is independent of z , for instance because theconditional distribution of g given z is always N ( , I p ) . Define f ( z , g ) = (cid:16) (cid:88) i ∈ [ n ] ( | z i | − t ) + (cid:88) j ∈ [ p ] ( | e (cid:62) j Σ g | − t ) (cid:17) . If (cid:101) z ∈ R n , (cid:101) g ∈ R p then by the triangle inequality for the Euclidean norm and using | ( | a | − t ) + − ( | b | − t ) + | ≤ | a − b | for all a, b ∈ R , we find | f ( z , g ) − f ( (cid:101) z , (cid:101) g ) | ≤ (cid:8) (cid:107) z − (cid:101) z (cid:107) + (cid:107) Σ ( g − (cid:101) g ) (cid:107) (cid:9) ≤ (1 ∨ (cid:107) Σ (cid:107) op ) (cid:8) (cid:107) z − (cid:101) z (cid:107) + (cid:107) g − (cid:101) g (cid:107) (cid:9) . Hence f is a (1 ∨ (cid:107) Σ (cid:107) op ) -Lipschitz function of iid standard normal random variables. Sincediag ( Σ ) = I p we have ∨ (cid:107) Σ (cid:107) op = (cid:107) Σ (cid:107) op ≤ ϕ . By the concentration of Lipschitz functions ofstandard normal random variables [BLM13, Theorem 5.6], this implies that(D.14) P ( f ( z , g ) ≥ E [ f ( z , g )] + u √ n ) ≤ exp( − nu ϕ ) . It remains to bound the expectation of f ( z , g ) . Note that Z j = e (cid:62) j Σ G (cid:62) z / (cid:107) z (cid:107) has N (0 , distribution thanks to Σ jj = 1 for all diagonal entries. By [BZ18b, Lemma G.1], the bound E [( | Z | − t ) ] ≤ e − t / / (( t + 2) √ πt ) holds for Z ∼ N (0 , . For t = (cid:112) γ ) /s ∗ ) this implies E [( | Z | − t ) ] ≤ (1 + γ ) − s ∗ / (( t + 2) √ πt ) . Since E [ f ( z , g ) ] is the sum of n + p terms of the form E [( | Z | − t ) ] and n + p ≤ (1 + γ ) n , inequality E [ f ( z , g ) ] ≤ ns ∗ / (( t + ellec/Out-of-sample error estimate for robust M-estimators with convex penalty √ πt ) holds. Now choose u > in (D.14) as a constant depending on { γ, s ∗ } only definedby u = √ s ∗ − { s ∗ / (cid:0) ( t + 2) (cid:112) πt (cid:1) } and define s ∗∗ by √ s ∗∗ = √ s ∗ − u so that E [ f ( z , g ) ] + u ≤ √ s ∗∗ n . This proves (D.12).For (D.13), standard concentration bounds on the χ n distribution yields P ( (cid:107) z (cid:107) < n + √ n ) ≥ − /n , and(D.15) n − ≤ (cid:107) z (cid:107) − (1 + (cid:112) n ) /n ) holds in this event. On the intersection of this event with the complement of (D.12), if ( | v | − t ) + denotes the vector with components max( | v i | − t, for any vector v , we find {(cid:107) z A (cid:107) + n (cid:107) ( Σ G (cid:62) z ) B (cid:107) } < t (cid:112) | A | + | B | + {(cid:107) (cid:0) ( | z | − t ) + (cid:1) A (cid:107) + (cid:107) (cid:0) ( | Σ G (cid:62) z n − | − t ) + (cid:1) B (cid:107) } ≤ t (cid:112) | A | + | B | + (1 + (cid:112) n ) /n ) f ( z , g ) ≤ t (cid:112) | A | + | B | + (1 + (cid:112) n ) /n ) √ s ∗∗ n. where the first inequality is due to | x | ≤ ( | x | − t ) + + t and the triangle inequality, the secondfollows from (D.15) and the third thanks to the complement of (D.12). Since s ∗∗ , s ∗ are bothindependent of n, p and s ∗∗ < s ∗ , inequality (1 + (cid:112) n ) /n ) √ s ∗∗ n ≤ √ s ∗ n holds for n largeenough. Lemma D.5 (Sparse condition number for X ∈ R n × ( p + n ) ) . Let ¯ X = [ X |√ n I n ] ∈ R n × ( p + n ) and d ∈ [ n + p ] . Then for S ( d ) = { ( b , θ ) : b ∈ R p , θ ∈ R n , (cid:107) b (cid:107) + (cid:107) θ (cid:107) ≤ d, (cid:107) Σ b (cid:107) + (cid:107) θ (cid:107) = 1 } we have (D.16) max ( b , θ ) ∈ S ( d ) (cid:107) Xb + √ n θ (cid:107) min ( b , θ ) ∈ S ( d ) (cid:107) Xb + √ n θ (cid:107) ≤ ˆΦ + 1 + (( ˆΦ − + 4 ˆ φ ) ˆΦ − + 1 − (( ˆΦ − − + 4 ˆ φ ) where ˆ φ = max b ∈ R p , θ ∈ R n : (cid:107) b (cid:107) + (cid:107) θ (cid:107) ≤ d | θ (cid:62) Xb | n − (cid:107) θ (cid:107)(cid:107) Σ b (cid:107) , ˆΦ + = max b ∈ R p : (cid:107) b (cid:107) ≤ d (cid:107) Xb (cid:107) n − (cid:107) Σ b (cid:107) , ˆΦ − = min b ∈ R p : (cid:107) b (cid:107) ≤ d (cid:107) Xb (cid:107) n − (cid:107) Σ b (cid:107) . Furthermore, for any d ∗ > satisfying (D.17) (2 d ∗ log( e (1 + γ ) /d ∗ )) + 2 (cid:112) d ∗ < (cid:112) / − , the constants c, c ∗ defined by c = (2 d ∗ log( e (1 + γ ) /d ∗ )) + 2 (cid:112) d ∗ , c < (cid:112) / − ,c ∗ = (1 + c ) + (cid:112) ((1 + c ) − + 4 c (1 − c ) + (cid:112) ((1 − c ) − + 4 c (D.18) are such that for any d ≤ d ∗ n , event Ω ∗ = { ˆ φ < c, Φ + < (1 + c ) , Φ − < (1 − c ) } hasprobability approaching one, and in Ω ∗ the right hand side of (D.16) is bounded from aboveby c ∗ . Consequently, in Ω ∗ , (D.19) max v ∈ R n + p : (cid:107) v (cid:107) =1 , (cid:107) v (cid:107) ≤ d ∗ n (cid:107) ¯ Xv (cid:107) min v ∈ R n + p : (cid:107) v (cid:107) =1 , (cid:107) v (cid:107) ≤ d ∗ n (cid:107) ¯ Xv (cid:107) ≤ c ∗ max(1 , φ max ( Σ ))min(1 , φ min ( Σ )) . ellec/Out-of-sample error estimate for robust M-estimators with convex penalty Proof.
Let ( b , θ ) ∈ S ( d ) . By expanding the square (cid:107) Xb + √ n θ (cid:107) /n = (cid:107) Xb (cid:107) /n + (cid:107) θ (cid:107) − θ (cid:62) Xb n − ≥ ˆΦ − (cid:107) Σ b (cid:107) + (cid:107) θ (cid:107) − (cid:107) θ (cid:107)(cid:107) Σ b (cid:107) ˆ φ = (cid:16) (cid:107) Σ b (cid:107) (cid:107) θ (cid:107) (cid:17) (cid:18) ˆΦ − − ˆ φ − ˆ φ (cid:19) (cid:18) (cid:107) Σ b (cid:107)(cid:107) θ (cid:107) (cid:19) (D.20)For the upper sparse eigenvalue, we have similarly (cid:107) Xb + √ n θ (cid:107) n ≤ (cid:16) (cid:107) Σ b (cid:107) (cid:107) θ (cid:107) (cid:17) (cid:18) ˆΦ ˆ φ ˆ φ (cid:19) (cid:18) (cid:107) Σ b (cid:107)(cid:107) θ (cid:107) (cid:19) . (D.21)Hence the left hand side of (D.16) is bounded from above by the ratio of the maximal singularvalue of the 2 by 2 matrix in (D.21) and the minimal singular value of the 2 by 2 matrix in(D.20). Since the minimal eigenvalue of ( a bb ) is ( a + 1 − (cid:112) ( a − + 4 b ) and the maximaleigenvalue is ( a + 1 + (cid:112) ( a − + 4 b ) , which can be checked for instance with the pythoncode import sympya, b = sympy.symbols(’a, b’, real=True)sympy.Matrix([ [a, -b], [-b, 1] ]).eigenvals() This proves (D.16).Let t = (cid:112) d ∗ log( e (1 + γ ) /d ∗ ) . By Theorem II.13 in [DS01], for any ( A, B ) with B ⊂ [ p ] , A ⊂ [ n ] with | A | + | B | ≤ d , the event (cid:110) max b ∈ R p : supp( b )= B n − (cid:107) Xb (cid:107)(cid:107) Σ b (cid:107) ≥ (cid:112) | B | /n + t (cid:111) has probability at most P ( N (0 , > t √ n ) , and the same holds for the event (cid:110) min b ∈ R p : supp( b )= B n − (cid:107) Xb (cid:107)(cid:107) Σ b (cid:107) ≤ − (cid:112) | B | /n − t (cid:111) . Again by Theorem II.13 in [DS01], the event (cid:110) max θ ∈ R n , b ∈ R p : supp( b )= B, supp( θ )= A n − | θ (cid:62) Xb |(cid:107) θ (cid:107)(cid:107) Σ b (cid:107) > (cid:112) | A | /n + (cid:112) | B | /n + t (cid:111) has probability at most P ( N (0 , > t √ n ) . Hence by the union bound and a standard upperbound on the Gaussian tail, the union of the above three events has probability at most e − t n/ / ( t √ πn ) . There are (cid:0) n + pd (cid:1) pairs ( A, B ) of subsets A ⊂ [ n ] , B ⊂ [ p ] with | A | + | B | = d ,so by the union bound, the union of all events as above over all pairs ( A, B ) with | A | + | B | = d has probability at most (cid:18) n + pd (cid:19) exp( − t n/ t √ πn ≤ e n (cid:0) dn log e ( n + p ) d − t (cid:1) t √ πn ≤ t √ πn → using the classical bound log (cid:0) qd (cid:1) ≤ d log eqd on binomial coefficients, the definition of t as well as d ≤ d ∗ n . By definition of c in (D.18), this proves that Ω ∗ has probability approaching one. The ellec/Out-of-sample error estimate for robust M-estimators with convex penalty fact that c ∗ bounds from above the right hand side of (D.16) follows follows by applying thebounds in Ω ∗ on ˆΦ + , ˆΦ i , ˆ φ to bound the numerator and denominator. Finally, the bound (D.19)on the d -sparse condition number of ¯ X = [ X |√ n I n ] is obtained using ( (cid:107) θ (cid:107) + (cid:107) b (cid:107) ) min(1 , φ min ( Σ )) ≤ (cid:107) θ (cid:107) + (cid:107) Σ b (cid:107) ≤ ( (cid:107) θ (cid:107) + (cid:107) b (cid:107) ) max(1 , φ max ( Σ )) . Bound on the false positives of the Lasso
Lemma D.6 (Proposition 7.4 in [BZ18a]–deterministic result) . Let n, ¯ p ≥ be integers and m ∈ [¯ p ] . Let X ∈ R n × ¯ p , y ∈ R n , b ∈ R p and define the Lasso estimate (cid:98) b = arg min b ∈ R ¯ p (cid:107) Xb − y (cid:107) / (2 n ) + λ (cid:107) b (cid:107) . Let η ∈ (0 , , λ > . Let ¯ S ⊂ [¯ p ] be a set with (D.22) ¯ S ⊃ supp( b ) ∪ { j ∈ [¯ p ] : | e (cid:62) j X (cid:62) ( Xb − y ) | /n ≥ ηλ } . Define the Sparse Riecz Condition (SRC) by (D.23) | ¯ S | < − η ) m (1 + η ) max B ⊂ [¯ p ]: | B \ ¯ S |≤ m { φ cond ( X (cid:62) B X B ) − } holds where φ cond ( S ) = φ max ( S ) /φ min ( S ) is the condition number of any positive semi-definitematrix S . Consider also the condition (cid:107) ¯ X (cid:62) ¯ S ( y − Xb ) (cid:107) /n ≤ η λ | ¯ S | . (D.24) If (D.22) , (D.23) and (D.24) hold simultaneously then | supp( (cid:98) b ) \ ¯ S | ≤ m . In our problem with X = [ X |√ n I n ] ∈ R n × ( n + p ) , b = ( β , θ ∗ ) ∈ R n + p , y = y = Xβ + √ n θ ∗ + σ z and ¯ p = n + p , the above deterministic Lemma can be rewritten as follows. Consider, somesets ¯ B ⊂ [ p ] , ¯ A ⊂ [ n ] , the conditions ¯ B ⊃ supp( β ) ∪ { j ∈ [ p ] : | e (cid:62) j X (cid:62) σ z | /n ≥ ηλ } , (D.25) ¯ A ⊃ supp( θ ∗ ) ∪ { i ∈ [ n ] : | σz i | / √ n ≥ ηλ } , (D.26) | ¯ S | = | ¯ A | + | ¯ B | < − η ) m (1 + η ) max B ⊂ [¯ p ]: | B \ ¯ S |≤ m { φ cond ( X (cid:62) B X B ) − } , (D.27) (cid:107) σ z ¯ A (cid:107) /n + (cid:107) X (cid:62) ¯ B σ z (cid:107) /n ≤ η λ ( | ¯ A | + | ¯ B | ) (D.28)where ¯ S = ¯ B ∪ { i + n, i ∈ ¯ A } . If (D.25), (D.26), (D.27) and (D.28) all hold and ( (cid:98) β , (cid:98) θ ) is thesolution of (D.2) with λ = λ ∗ then | supp( (cid:98) β ) \ ¯ B | + | supp( (cid:98) θ ) \ ¯ A | ≤ m . References [BBEKY13] Derek Bean, Peter J Bickel, Noureddine El Karoui, and Bin Yu,
Optimal m-estimation in high-dimensional regression , Proceedings of the National Academyof Sciences (2013), no. 36, 14563–14568. ellec/Out-of-sample error estimate for robust M-estimators with convex penalty [BEM13] Mohsen Bayati, Murat A Erdogdu, and Andrea Montanari, Estimating lasso riskand noise level , Advances in Neural Information Processing Systems, 2013, pp. 944–952.[BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart,
Concentrationinequalities: A nonasymptotic theory of independence , Oxford University Press,2013.[BM12] Mohsen Bayati and Andrea Montanari,
The lasso risk for gaussian matrices , IEEETransactions on Information Theory (2012), no. 4, 1997–2017.[Bra15] Jelena Bradic, Robustness in sparse linear models: relative efficiency based on robustapproximate message passing , Electronic Journal of Statistics (2015), 2.[BT17] Pierre C Bellec and Alexandre B Tsybakov, Bounds on the prediction errorof penalized least squares estimators with convex penalty , Modern Problems ofStochastic Analysis and Statistics, Selected Contributions In Honor of ValentinKonakov, Springer, 2017.[BZ18a] Pierre C Bellec and Cun-Hui Zhang,
De-biasing the lasso with degrees-of-freedomadjustment , preprint (2018).[BZ18b] ,
Second order stein: Sure for sure and other applications in high-dimensional inference , Annals of Statistics, accepted, to appear (2018).[BZ19a] ,
De-biasing the lasso with degrees-of-freedom adjustment , arXiv:1902.08885(2019).[BZ19b] ,
Second order poincare inequalities and de-biasing arbitrary convexregularizers when p/n → γ , arXiv:1912.11943 (2019).[CLS19] Xi Chen, Qihang Lin, and Bodhisattva Sen, On degrees of freedom of projectionestimators with applications to multivariate nonparametric regression , Journal ofthe American Statistical Association (2019), 1–30.[CM19] Michael Celentano and Andrea Montanari,
Fundamental barriers to high-dimensional regression with convex penalties , arXiv preprint arXiv:1903.10603(2019).[CMW20] Michael Celentano, Andrea Montanari, and Yuting Wei,
The lasso withgeneral gaussian designs with applications to hypothesis testing , arXiv preprintarXiv:2007.13716 (2020).[D +
16] Lee H Dicker et al.,
Ridge regression and asymptotic minimax estimation overspheres of growing dimension , Bernoulli (2016), no. 1, 1–37.[Dic14] Lee H Dicker, Variance estimation in high-dimensional linear models , Biometrika (2014), no. 2, 269–284.[DKF +
13] Charles Dossal, Maher Kachour, MJ Fadili, Gabriel Peyré, and ChristopheChesneau,
The degrees of freedom of the lasso for general design matrix , StatisticaSinica (2013), 809–828.[DM16] David Donoho and Andrea Montanari,
High dimensional robust m-estimation:Asymptotic variance via approximate message passing , Probability Theory andRelated Fields (2016), no. 3-4, 935–969.[DMM09] David L Donoho, Arian Maleki, and Andrea Montanari,
Message-passing algorithmsfor compressed sensing , Proceedings of the National Academy of Sciences (2009), no. 45, 18914–18919.[DS01] Kenneth R Davidson and Stanislaw J Szarek,
Local operator theory, randommatrices and banach spaces , Handbook of the geometry of Banach spaces (2001),no. 317-366, 131.[DT19] Arnak Dalalyan and Philip Thompson, Outlier-robust estimation of a sparse linearmodel using (cid:96) -penalized huber’s m-estimator , Advances in Neural Information ellec/Out-of-sample error estimate for robust M-estimators with convex penalty Processing Systems, 2019, pp. 13188–13198.[DW +
18] Edgar Dobriban, Stefan Wager, et al.,
High-dimensional asymptotics of prediction:Ridge regression and classification , The Annals of Statistics (2018), no. 1, 247–279.[EK18] Noureddine El Karoui, On the impact of predictor geometry on the performanceon high-dimensional ridge-regularized generalized robust regression estimators ,Probability Theory and Related Fields (2018), no. 1-2, 95–175.[EKBB +
13] Noureddine El Karoui, Derek Bean, Peter J Bickel, Chinghway Lim, and Bin Yu,
On robust regression with high-dimensional predictors , Proceedings of the NationalAcademy of Sciences (2013), no. 36, 14557–14562.[GAK20] Cédric Gerbelot, Alia Abbara, and Florent Krzakala,
Asymptotic errors forconvex penalized linear regression beyond gaussian matrices , arXiv preprintarXiv:2002.04372 (2020).[H +
64] Peter J Huber et al.,
Robust estimation of a location parameter , The Annals ofMathematical Statistics (1964), no. 1, 73–101.[hl] Math Lover (https://math.stackexchange.com/users/366404/math lover), Thelebesgue measure of zero set of a polynomial function is zero , Mathematics StackExchange, URL:https://math.stackexchange.com/q/1920302 (version: 2016-09-09).[Kar13] Noureddine El Karoui,
Asymptotic behavior of unregularized and ridge-regularizedhigh-dimensional robust regression estimators: rigorous results , arXiv preprintarXiv:1311.2445 (2013).[Kat09] Kengo Kato,
On the degrees of freedom in shrinkage estimation , Journal ofMultivariate Analysis (2009), no. 7, 1338–1352.[L +
08] Hannes Leeb et al.,
Evaluation and selection of models for out-of-sample predictionwhen the sample size is small relative to the complexity of the data-generatingprocess , Bernoulli (2008), no. 3, 661–690.[Min20] Kentaro Minami, Degrees of freedom in submodular regularization: A computationalperspective of stein’s unbiased risk estimate , Journal of Multivariate Analysis (2020), 104546.[MM18] Léo Miolane and Andrea Montanari,
The distribution of the lasso: Uniform controlover sparse balls and adaptive parameter tuning , arXiv preprint arXiv:1811.01212(2018).[MMB16] Christopher A Metzler, Arian Maleki, and Richard G Baraniuk,
From denoising tocompressed sensing , IEEE Transactions on Information Theory (2016), no. 9,5117–5144.[SAH19] Fariborz Salehi, Ehsan Abbasi, and Babak Hassibi, The impact of regularizationon high-dimensional logistic regression , Advances in Neural Information ProcessingSystems, 2019, pp. 12005–12015.[Ste81] Charles M Stein,
Estimation of the mean of a multivariate normal distribution , Theannals of Statistics (1981), 1135–1151.[Sto13] Mihailo Stojnic,
A framework to characterize performance of lasso algorithms ,arXiv preprint arXiv:1303.7291 (2013).[TAH15] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi,
Lasso with non-linearmeasurements is equivalent to one with linear measurements , Advances in NeuralInformation Processing Systems, 2015, pp. 3420–3428.[TAH18] ,
Precise error analysis of regularized m -estimators in high dimensions ,IEEE Transactions on Information Theory (2018), no. 8, 5592–5628.[Tib13] Ryan J Tibshirani, The lasso problem and uniqueness , Electronic Journal ofStatistics (2013), 1456–1490. ellec/Out-of-sample error estimate for robust M-estimators with convex penalty [TT12] Ryan J. Tibshirani and Jonathan Taylor, Degrees of freedom in lasso problems ,Ann. Statist. (2012), no. 2, 1198–1232.[VDP +
12] Samuel Vaiter, Charles Deledalle, Gabriel Peyré, Jalal Fadili, and Charles Dossal,
The degrees of freedom of the group lasso , arXiv preprint arXiv:1205.1481 (2012).[Wik20] Wikipedia,
Smoothstep — Wikipedia, the free encyclopedia , http://en.wikipedia.org/w/index.php?title=Smoothstep&oldid=958070398 , 2020,[Online; accessed 26-July-2020].[WWM17] Shuaiwen Wang, Haolei Weng, and Arian Maleki, Which bridge estimator is optimalfor variable selection? , arXiv preprint arXiv:1705.08617 (2017).[XMRH19] Ji Xu, Arian Maleki, Kamiar Rahnama Rad, and Daniel Hsu,
Consistent riskestimation in high-dimensional linear regression , arXiv preprint arXiv:1902.01753(2019).[ZHT07] Hui Zou, Trevor Hastie, and Robert Tibshirani,
On the “degrees of freedom” of thelasso , Ann. Statist. (2007), no. 5, 2173–2192.[Zie89] William P Ziemer, Weakly differentiable functions: Sobolev spaces and functions ofbounded variation , vol. 120, Springer-Verlag New York, 1989.[ZSC20] Qian Zhao, Pragya Sur, and Emmanuel J Candes,