A Statistical Learning Assessment of Huber Regression
aa r X i v : . [ m a t h . S T ] S e p A Statistical Learning Assessment of Huber Regression
Yunlong Feng and Qiang Wu Department of Mathematics and Statistics, University at Albany Department of Mathematical Sciences, Middle Tennessee State University
Abstract
As one of the triumphs and milestones of robust statistics, Huber regression plays an importantrole in robust inference and estimation. It has also been finding a great variety of applicationsin machine learning. In a parametric setup, it has been extensively studied. However, in thestatistical learning context where a function is typically learned in a nonparametric way, there isstill a lack of theoretical understanding of how Huber regression estimators learn the conditionalmean function and why it works in the absence of light-tailed noise assumptions. To addressthese fundamental questions, this paper conducts an assessment of Huber regression from astatistical learning viewpoint. First, we show that the usual risk consistency property of Huberregression estimators, which is usually pursued in machine learning, cannot guarantee theirlearnability in mean regression. Second, we argue that Huber regression should be implementedin an adaptive way to perform mean regression, implying that one needs to tune the scaleparameter in accordance with the sample size and the moment condition of the noise. Third,with an adaptive choice of the scale parameter, we demonstrate that Huber regression estimatorscan be asymptotic mean regression calibrated under (1 + ǫ ) -moment conditions ( ǫ > ) on theconditional distribution. Last but not least, under the same moment conditions, we establishalmost sure convergence rates for Huber regression estimators. Note that the (1 + ǫ ) -momentconditions accommodate the special case where the response variable possesses infinite varianceand so the established convergence rates justify the robustness feature of Huber regressionestimators. In the above senses, the present study provides a systematic statistical learningassessment of Huber regression estimators and justifies their merits in terms of robustness froma theoretical viewpoint. In this paper, we are concerned with the robust regression problem where one aims at seeking afunctional relation between input and output when the response variable may be heavy-tailed [14, 19,16, 10]. In such scenarios, the traditionally frequently used least squares regression paradigms maynot work well due to the amplification of the least squares loss to large residuals. As an alternative,the Huber loss was proposed in the seminal work [12] in the context of robust estimation of locationparameters. The Huber loss and the theoretical findings in location parameter estimation thenapplied and carried over to robust regression problems. The regression paradigm that is associatedwith the Huber loss is termed as
Huber regression and the resulting estimator is termed as the
Huber regression estimator . The introduction of Huber regression led to the development of varioussubsequent M-estimators and fostered the development of robust statistics into a discipline.Denoting X as the input variable that takes values in a compact metric space X ⊂ R d and Y theresponse variable taking values in Y ⊂ R , given i.i.d observations z = { ( x i , y i ) } ni =1 , in the context of1arametric regression, the Huber regression estimator f z ,σ ( X ) = X ⊤ ˆ β is learned from the followingempirical risk minimization (ERM) scheme f z ,σ := arg min f ∈H n n X i =1 ℓ σ ( y i − f ( x i )) , (1)where H is the function space from X to R consisting of linear functions of the form f ( x ) = x ⊤ β and ℓ σ is the well-known Huber loss defined by ℓ σ ( t ) = ( t , if | t | ≤ σ, σ | t | − σ , otherwise . (2)Assuming that the conditional mean function f ⋆ ( X ) = E ( Y | X ) can be parametrically represented as f ⋆ ( X ) = X ⊤ β ⋆ and the noise Y − f ⋆ ( X ) is zero mean when conditioned on X , asymptotic propertiesof ˆ β and its convergence to β ⋆ have been extensively studied in the literature of parametric statistics.An incomplete list of related literature includes [13, 25, 17, 11, 16, 14, 19, 10, 15] and many referencestherein. Note that in the aforementioned studies, the scale parameter σ in the Huber loss is set tobe fixed and chosen according to the asymptotic efficiency rule. In a high-dimensional setting,Huber regression with a fixed scale parameter, however, may not be able to learn β ⋆ when thenoise is asymmetric, as argued recently in [22, 8]. There the authors proposed to choose the scaleparameter by relating it to the dimension of the input space, the moment condition of the noisedistribution, and the sample size so that one may debias the resulting regression estimator and thescale parameter can play a trade-off role between bias and robustness.In a nonparametric statistical learning context where functions in H , in general, do not admitparametric representations, theoretical investigations of Huber regression estimators are still sparsethough they have been applied extensively into various applications where robustness is a concern.To proceed with our discussion, denote H as a compact subset of the space C ( X ) of continuousfunctions on X , ρ the underlying unknown distribution over X × Y , and R σ ( f ) the generalizationerror of f : X → R defined by R σ ( f ) = E ℓ σ ( Y − f ( X )) , where the expectation is taken jointly with respect to X and Y . Recall that the objective is tolearn the conditional mean function f ⋆ ( X ) = E ( Y | X ) robustly. Existing studies in the literatureof statistical learning theory remind that Huber regression estimators are R σ -risk consistent, i.e., R σ ( f z ,σ ) → min f ∈H R σ ( f ) as n → ∞ . To see this, one simply notes that the Huber loss is Lipschitzcontinuous on R and so the usual learning theory arguments may apply [2, 9, 18, 5, 6]. However,such a risk consistency property neither says anything about the role that the scale parameter σ plays nor implies the convergence of the regression estimator f z ,σ to the mean regression function f ⋆ . In the present study, we aim to conduct a statistical learning assessment of the Huber regressionestimator f z ,σ . More specifically, we pursue answers to the following fundamental questions: • Question 1 : Whether R σ -risk consistency implies the convergence of f z ,σ to f ⋆ ? • Question 2 : What is the role that σ plays when learning f ⋆ through ERM (1)? • Question 3 : How to develop exponential-type fast convergence rates of f z ,σ to f ⋆ ? • Question 4 : How to justify the learnability of f z ,σ in the absence of light-tail noise?2nswers to these questions represent our main contributions. In particular, if R σ -risk consistencyimplies the convergence of f z ,σ to f ⋆ , we say that Huber regression (1) is mean regression calibrated.We show that Huber regression is generally not mean regression calibrated for any fixed scaleparameter σ. Instead, it should be implemented in an adaptive way in order to perform meanregression, where the adaptiveness refers to the dependence of the scale parameter on the samplesize and the moment condition. We also show that the scale parameter needs to diverge in accordancewith the sample size to ensure that the Huber regression estimator f z ,σ learns the mean regressionfunction f ⋆ , which we term as the asymptotic mean regression calibration property. Furthermore,such an asymptotic mean regression calibration property can be established under (1 + ǫ ) -momentconditions ( ǫ > ) on the conditional distribution. This is a rather weak condition as it admits thecase where the conditional distribution possesses infinite variance. To develop fast exponential-typeconvergence rates, we establish a relaxed Bernstein condition. The idea is to bound the secondmoment of associated random variables by using their first moment and an additional bias termthat diminishes towards when the sample size tends to infinity. These preparations allow us toestablish fast exponential-type convergence rates for f z ,σ . Interestingly, but not surprisingly, it isshown that σ plays a trade-off role between bias and learnability, and the convergence rates of f z ,σ depend on the order of the imposed moment conditions.The rest of this paper is organized as follows. In Section 2, we argue that risk consistency isinsufficient in guaranteeing learnability and so does not necessarily imply the convergence of Huberregression estimators to the mean regression function. In Section 3, we demonstrate that Huberregression is asymptotically mean regression calibrated if the scale parameter is chosen in a divergingmanner in accordance to the sample size and the moment condition. Some efforts are then madein Section 4 to develop fast exponential-type convergence rates by relaxing the standard Bernsteincondition in learning theory. In Section 5, we establish fast convergence rates for Huber regressionestimators under weak moment conditions. Proofs of Theorems are collected in Section 6. Thepaper is concluded in Section 7. Notation and Convention . Throughout this paper, we assume that f ⋆ is bounded and H ⊂ C ( X ) is uniformly bounded and denote M = max {k f ⋆ k ∞ , sup f ∈H k f k ∞ } . Denoting ρ X as the marginaldistribution of ρ on X , then k · k ,ρ defines the L -norm induced by ρ X . The notation a . b denotesthe fact that there exists an absolute positive constant c such that a ≤ cb . For any t ∈ R , let t + = max(0 , t ) . All proofs are deferred to the appendix.
In this section, we shall make efforts to answer Question 1 listed in the introduction by arguing thatrisk consistency is insufficient in guaranteeing learnability of the Huber regression estimator, whereby risk consistency we refer to the convergence of R σ ( f z ,σ ) to min f ∈H R σ ( f ) while learnability refersto the convergence of f z ,σ to f ⋆ .Following existing studies on empirical risk minimization schemes induced by convex loss func-tions, it is easy to deduce that f z ,σ is R σ -risk consistent for any fixed σ value. Moreover, undercertain mild assumptions, probabilistic convergence rates may also be established. To see this, notethat the deduction of the risk consistency property of Huber regression estimators as well as theirconvergence rates involves the following set of random variables G H = n ξ f | ξ f := ℓ σ ( y − f ( x )) − ℓ σ ( y − f H ,σ ( x )) , f ∈ H , ( x, y ) ∈ X × Y o , f H ,σ = arg min f ∈H R σ ( f ) is the population version of f z ,σ . Notice that the Huber loss ℓ σ in(2) is Lipschitz continuous on R with Lipschitz constant σ . Therefore, the random variables in G H and their variances can be uniformly upper bounded by constants involving σ . Applying learningtheory arguments and concentration inequalities to G H , under mild assumptions, convergence ratescan be derived. However, due to the dependence of f z ,σ on the scale parameter σ , it may possessmuch flexibility and can be quite different with different choices of the σ values. Consequently, the R σ -risk consistency property as well as the convergence rates of R σ ( f z ,σ ) − min f ∈H R σ ( f ) may notbe informative and may not indicate the learnability of f z ,σ in learning f ⋆ even if H is perfectlychosen such that f ⋆ ∈ H . . . . . . . . . . Figure 1: The bottom black curve with square marks gives the conditional mean function. The topblue curve represents the learned Huber regression estimator with σ = 0 . .To illustrate this phenomenon numerically, consider a toy example with the model Y = 2 sin( πX ) + (1 + 2 X ) ε, where X follows a uniform distribution on [0 , and ε ∼ . N (0 , . ) + 0 . N (0 , . ) . It is apparentthat the noise distribution admits zero mean and is skewed. Simple calculation shows that forthis regression model, the conditional mean function f ⋆ ( X ) = 2 sin( πX ) . In this experiment, wevisualize the function f z ,σ and compare it with f ⋆ . We choose the hypothesis space H as a ball ofthe reproducing kernel Hilbert space associated with the Gaussian kernel K ( x i , x j ) = exp {−k x i − x j k /h } . Both the kernel bandwidth h and the radius of the ball are tuned via cross-validationunder the least absolute deviation error criterion while the scale parameter σ in the Huber loss is setto be fixed with σ = 0 . . A set of independent observations are sampled from the above regressionmodel and are used as the training data. Then f z ,σ is plotted in Figure 1. The conditional meanfunction is also plotted for comparison. As discussed earlier, due to the Lipschitz continuity of theHuber loss, with the choice of σ = 0 . , the risk consistency can be guaranteed. However, from theplots in Figure 1, clearly, f z ,σ does not approach the conditional mean function.The fact that R σ -risk consistency of Huber regression estimators cannot guarantee their ability4o learn the conditional mean function can be further justified through the following example. Let M be the space of measurable functions from X to R and define f σ := arg min f ∈M R σ ( f ) . (3)Intuitively, f σ can be regarded as the best Huber regression estimator learned in an ideal case whereinfinite observations are available and the hypothesis space is perfectly selected. Example 1.
Consider the Huber regression problem where one aims to learn the conditional meanfunction f ⋆ from the following homoscedastic regression model Y = f ⋆ ( X ) + ε, where ε is the zero-mean noise variable with density p ε ( t ) = e − ( t + ) , if t ≥ − ,e t + ) , if t < − . Then there exists a constant c with c = 0 such that f σ ( x ) = f ⋆ ( x ) + c for all x ∈ X . As a result, if f σ ∈ H and R σ ( f z ,σ ) converges to min f ∈H R σ ( f ) as n → ∞ with large probability, then f z ,σ doesnot converge to f ⋆ with large probability.Proof. Recalling the definition of f σ in (3), for any x ∈ X , we can re-express it as follows f σ ( x ) = arg min ν ∈ R ˆ R ℓ σ ( t − ν ) p Y | X = x ( t )d t = arg min ν ∈ R ˆ R ℓ σ ( t − ν ) p ε ( t − f ⋆ ( x ))d t = arg min ν ∈ R ˆ R ℓ σ ( t − ν ) p ε ( t − f ⋆ ( x ))d t = arg min ν ∈ R ˆ R ℓ σ ( u − ( ν − f ⋆ ( x ))) p ε ( u )d u. Therefore, for any x ∈ X , we have f σ ( x ) − f ⋆ ( x ) = arg min ν ∈ R ˆ R ℓ σ ( u − ν ) p ε ( u )d u. The assumption that the noise ε is independent of x tells us that arg min ν ∈ R ˆ R ℓ σ ( u − ν ) p ε ( u )d u (4)is a unique constant for all x ∈ X . To prove the first part of the assertion, we only need to showthat is not a solution to the above minimization problem.5rom the definition of the Huber loss (2), we know that ˆ R ℓ σ ( u − ν ) p ε ( u )d u = ˆ R ℓ σ ( u − ν ) p ε ( u )d u + ˆ R ( u − ν ) p ε ( u )d u − ˆ R ( u − ν ) p ε ( u )d u = ˆ R ( u − ν ) p ε ( u )d u + ˆ | u − ν |≥ σ (2 σ | u − v | − σ ) p ε ( u )d u − ˆ | u − ν |≥ σ ( u − v ) p ε ( u )d u = ˆ R ( u − ν ) p ε ( u )d u − ˆ | u − ν |≥ σ ( | u − ν | − σ ) p ε ( u )d u = ˆ R ( u − ν ) p ε ( u )d u − ˆ u − ν ≥ σ ( u − ν − σ ) p ε ( u )d u − ˆ u − ν ≤− σ ( u − ν + σ ) p ε ( u )d u. Therefore, we have d( ´ R ℓ σ ( u − ν ) p ε ( u )d u )d ν = − ˆ R ( u − ν ) p ε ( u )d u + 2 ˆ + ∞ ν + σ ( u − ν − σ ) p ε ( u )d u + 2 ˆ ν − σ −∞ ( u − ν + σ ) p ε ( u )d u. The zero-mean noise assumption tells us that d( ´ R ℓ σ ( u − ν ) p ε ( u )d u )d ν (cid:12)(cid:12)(cid:12) ν =0 = 2 ˆ + ∞ σ ( u − σ ) p ε ( u )d u + 2 ˆ − σ −∞ ( u + σ ) p ε ( u )d u. If σ ≥ , then we have ˆ + ∞ σ ( u − σ ) p ε ( u )d u + 2 ˆ − σ −∞ ( u + σ ) p ε ( u )d u = ˆ + ∞ σ ( u − σ ) e − ( u + ) d u + 2 ˆ − σ −∞ ( u + σ ) e u + ) d u = e − σ − − e − σ > , where we used the fact that g ( a ) = a − e a is positive for a ∈ (0 , e ) and < e − − σ ≤ e − < e . If < σ < , then we have ˆ + ∞ σ ( u − σ ) p ε ( u )d u + 2 ˆ − σ −∞ ( u + σ ) p ε ( u )d u = ˆ + ∞ σ ( u − σ ) e − ( u + ) d u + ˆ − σ − ( u + σ ) e − ( u + ) d u + 2 ˆ − −∞ ( u + σ ) e u + ) d u = 2 σ + e − σ − − e σ − > , g ( σ ) = 2 σ + e − σ − − e σ − has g (0) = 0 and is strictlyincreasing.Therefore, d( ´ R ℓ σ ( u − ν ) p ε ( u )d u )d ν (cid:12)(cid:12)(cid:12) ν =0 is non-zero for fixed σ > , which implies that is not asolution to the minimization problem (4). This proves the first claim in Example 1.To prove the divergence of f z ,σ to f ⋆ for any fixed σ , first note that, by the convexity of theHuber loss, f σ is the unique minimizer of R σ ( f ) . Since f σ = f ⋆ + c = f ⋆ , we have R σ ( f σ ) = R σ ( f ⋆ ) . If f σ ∈ H and R σ ( f z ,σ ) converges to min f ∈H R σ ( f ) = R σ ( f σ ) with large probability, then thereexists some large enough N such that for any n > N , |R σ ( f z ,σ ) − R σ ( f σ ) | ≤ |R σ ( f σ ) − R σ ( f ⋆ ) | holds with large probability. By the Lipschitz property of the Huber loss, we have k f z ,σ − f ⋆ k ,ρ ≥ σ |R σ ( f z ,σ ) − R σ ( f ⋆ ) |≥ σ (cid:16) |R σ ( f σ ) − R σ ( f ⋆ ) | − |R σ ( f z ,σ ) − R σ ( f σ ) | (cid:17) ≥ σ |R σ ( f σ ) − R σ ( f ⋆ ) | . This proves that f z ,σ does not converge to f ⋆ for any fixed σ. Example 1 tells us that in some scenarios, f z ,σ may not converge to the conditional mean functionwhich one aims to learn even if infinite samples are given and the hypothesis space is perfectly chosen.This is due to the inherent bias brought by the integrated scale parameter in the Huber loss whenpursuing robustness. Continuing our discussion at the beginning of this section, the general answerto Question 1 is that, R σ -risk consistency cannot guarantee its learnability as the gain in robustnessmay entail a bias. That is, in general, the Huber regression scheme (1) may not be mean regressioncalibrated. To address this problem and to learn the conditional mean function f ⋆ through Huberregression, one needs to tune the scale parameter σ to reduce the bias and learn in an adaptive way,as argued in the next section. In this section, we shall show that, in a distribution-free setup, with properly selected scale parameter σ , Huber regression can be asymptotically mean regression calibrated, meaning that risk consistencyimplies the convergence of f z ,σ to the conditional mean function f ⋆ when σ → ∞ . Recall that in the context of regression learning, one of the central concerns is the convergence of thelearned empirical target function to the unknown truth function of interest, that is, the conditionalmean function f ⋆ in this study. While the distance between the empirical target function and f ⋆ is not directly accessible, one settles for bounding the excess generalization error. The underlyingphilosophy is that the generalization error of a learning machine can be approximated by usingits empirical counterpart and the excess generalization error can be bounded via learning theory7rguments. As mentioned in the introduction, the regression estimator is called mean regressioncalibrated if the convergence of the excess generalization error towards implies the convergence ofthe empirical target function to the conditional mean function [20].Translating into the context of Huber regression, one is concerned with whether the convergenceof the excess generalization error R σ ( f z ,σ ) − R σ ( f σ ) towards implies the convergence of f z ,σ to f ⋆ . However, the numerical experiment and the counterexample in the preceding section suggesta negative answer and demonstrate that the desired mean regression calibration property may,in general, not be true. This conflicts with our intuition and common understanding that theHuber loss can serve as a robust alternate of the least squares loss when the scale parameter ischosen sufficiently large. To bypass this problem, in what follows, noticing our interest in learningthe conditional mean function f ⋆ , we turn to investigate the relation between the convergence of R σ ( f z ,σ ) to R σ ( f ⋆ ) and the convergence of f z ,σ to f ⋆ . More specifically, we shall show that undermild conditions, the convergence of R σ ( f z ,σ ) to R σ ( f ⋆ ) does imply the convergence of f z ,σ to f ⋆ when σ → ∞ . This justifies the mean regression calibration property in an asymptotic sense. We now look into the asymptotic mean calibration property by establishing a comparison theorem.For regression estimators that are produced by empirical risk minimization schemes with convexloss functions, some efforts on investigating their mean regression calibration properties have beenmade in the literature; see e.g., [20]. For Huber regression estimators, it was concluded that theyare mean regression calibrated if the response variable is upper bounded or the conditional noisevariables ε | X admit symmetric probability density functions. Recall that one of the most prominentmerits of Huber regression estimators lies in that they can perform mean regression in the absenceof light-noise assumptions. In this sense, boundedness or symmetry constraints on the noise variableshould be considered as stringent ones. In this study, we are seeking to assess the Huber regressionestimator f z ,σ and investigate its mean regression calibration properties without resorting to light-tail distributional assumptions on the conditional distribution or on the noise. To this end, weintroduce the following weak moment condition. Assumption 1.
There exists a constant ǫ > such that E | Y | ǫ < + ∞ . The moment condition in Assumption 1 is rather weak in the sense that it admits the case wherethe response variable Y possesses infinite variance. The same comment condition also applies to thedistributions of the conditional random variable Y | X and the conditional noise variable ε | X underthe additive data generating model, implying that heavy-tailed noise is allowed.As discussed earlier, without further distributional assumptions on the noise variable, f z ,σ is,in general, biased and its population version f σ may be different from f ⋆ almost everywhere on X .However, such a bias can be upper bounded and may decrease with the increase of the σ values.Results in this regard are stated in the following theorem under the above (1 + ǫ ) -moment condition. Theorem 1.
Let σ > max { M, } . Under Assumption 1, there exists an absolute constant c ǫ > independent of σ such that for any measurable function f : X → R with k f k ∞ ≤ M , we have (cid:12)(cid:12)(cid:12) [ R σ ( f ) − R σ ( f ⋆ )] − k f − f ⋆ k ,ρ (cid:12)(cid:12)(cid:12) ≤ c ǫ σ ǫ . (5)Theorem 1 states that for any bounded measurable function f , under the (1 + ǫ ) -moment con-ditions, the gap between R σ ( f ) − R σ ( f ⋆ ) and k f − f ⋆ k ,ρ is up to O ( σ − ǫ ) .8onsequently, with a sufficiently large σ value or sufficiently light-tailed noise, this gap couldbe sufficiently small. As a special case, let us consider the presence of Gaussian or sub-Gaussiannoise where the moment condition holds for arbitrarily large ǫ values. In this scenario, the gapbetween the above two quantities can be arbitrarily small. These findings remind us that in orderto debias the Huber regression estimator, one may relate the σ value to the sample size n . In otherwords, from an asymptotic viewpoint, with diverging σ values, according to Theorem 1, Huberregression is asymptotically mean regression calibrated. Following this spirit, we shall proceed withthe assessment based on diverging σ values by deriving their convergence rates to the conditionalmean function f ⋆ . From our previous discussions, in order to derive convergence rates for f z ,σ , one needs to bound theexcess generalization error R σ ( f z ,σ ) −R σ ( f ⋆ ) which essentially requires us to deal with the followingset of random variables F H := n ξ | ξ ( x, y ) = ℓ σ ( y − f ( x )) − ℓ σ ( y − f ⋆ ( x )) , f ∈ H , ( x, y ) ∈ X × Y o . The existing studies in learning theory remind us that it is crucial to establish the so called
Bernsteincondition , i.e., bounding the second moment of ξ ∈ F H by using its first moment. We will show thatwhile the standard Bernstein condition does not hold, one can relax it to develop fast convergencerates for R σ ( f z ,σ ) − R σ ( f ⋆ ) . Let us start with recapping the Bernstein condition in learning theory. Originally introduced in [2] in the context of empirical risk minimization, the standard Bernsteincondition can be restated as follows: a set F of random variables is said to satisfy the ( β, B ) -Bernstein condition with < β ≤ and B > if for any f ∈ F , it holds that E f ≤ B ( E ( f )) β . Inother words, the second moment of the random variable (and so the variance) can be upper boundedby its first moment. Later, the standard Bernstein condition was generalized and extended intovarious other Bernstein-like conditions for analyzing learning algorithms in different contexts; seee.g., [21, 23, 4]. It turns out that the Bernstein condition and its variants play an important rolein establishing fast convergence rates for learning algorithms of interest because they provide tightupper bounds for the variance of the random variables induced by the resulting estimators.In the context of Huber regression, a Bernstein-like condition is also desired in order to establishfast convergence rates for the excess generalization error R σ ( f z ,σ ) − R σ ( f ⋆ ) . However, as shown inthe preceding section, without further distributional restrictions to the noise variable, f ⋆ may not bethe optimal hypothesis that minimizes R σ ( f ) over the measurable function space M . Consequently, R σ ( f ) − R σ ( f ⋆ ) is not necessarily positive. As a result, the usual Bernstein condition, namely, E ξ ≤ B ( E ξ ) β for ξ ∈ F H with constants B > and < β ≤ , may not hold. This brings barriersto the development of fast convergence rates for R σ ( f ) − R σ ( f ⋆ ) . To circumvent this problem, inour study, we shall establish a relaxed Bernstein condition, which takes the form E ξ ≤ B ( E ξ ) β + g ( σ ) , with B > , < β ≤ , and g a nonnegative function of σ . A motivating observation for establishingsuch a relaxed Bernstein condition is that the gap between E ( ξ ) and k f − f ⋆ k ,ρ can be upperbounded by O ( σ − ǫ ) for any f ∈ H , as stated in Theorem 1.9 .2 A Relaxed Bernstein Condition To establish our relaxed Bernstein condition, we prove the following variance bound for ξ ∈ F H . Theorem 2.
Let Assumption 1 hold and σ > max { M, } . For any measurable function f : X → R with k f k ∞ ≤ M , the random variable ξ ( x, y ) = ℓ σ ( y − f ( x )) − ℓ σ ( y − f ⋆ ( x )) satisfies E ξ ≤ c k f − f ⋆ k ǫ − ǫ +1 ,ρ + c σ − ǫ , where c and c are absolute positive constants independent of σ or f . Recall the results in Theorem 1 that states k f − f ⋆ k ,ρ ≤ E ξ + c ǫ σ − ǫ . This in connection withTheorem 2 immediately yields the following relaxed Bernstein condition: E ξ ≤ c ( E ξ ) ( ǫ − ǫ +1 + c ( c ǫ σ − ǫ ) ( ǫ − ǫ +1 + c σ − ǫ . This relaxed Bernstein condition will be crucial in establishing error bounds and fast exponential-type convergence rates for the Huber regression estimator f z ,σ . In this section, we present fast exponential-type convergence rates for the Huber regression estimatorunder the (1 + ǫ ) -moment condition in Assumption 1. Specifically, we are interested in boundingthe L ρ X -distance between f z ,σ and f ⋆ .To state the result, we introduce the following capacity assumption. For any η > , let N ( H , η ) denote the covering number of H by the balls of radius η in C ( X ) , that is, N ( H , η ) = min k ∈ N : there exist f j ∈ H , j = 1 , . . . , k such that H ⊂ k [ j =1 B ( f j , η ) , where B ( f j , η ) = { f ∈ C ( X ) : k f − f j k ∞ < η } . Our capacity condition is stated as follows.
Assumption 2.
There exist positive constants q and c such that log N ( H , η ) ≤ cη − q , ∀ η > . Generalization error bounds in terms of the covering number argument under Assumption 2 istypical in statistical learning theory; see e.g., [1, 6, 20] and references therein.To state our results on the convergence rates, we introduce a new function f H , which is definedas f H = arg min f ∈H k f − f ⋆ k ,ρ . The function f H is the optimal function in H that one may expect in approximating the truthfunction f ⋆ . The distance k f H − f ⋆ k ,ρ can be regarded as the approximation error when workingwith the hypothesis space H and so corresponds to the bias caused by the choice of the hypothesisspace H . 10 heorem 3. Suppose that Assumptions 1 and 2 hold and let σ > max { M, } . Let f z ,σ be producedby (1) . For any < δ < , with probability at least − δ , it holds that k f z ,σ − f ⋆ k ,ρ . k f H − f ⋆ k ,ρ + log(2 /δ )Ψ( n, ǫ, σ ) , where Ψ( n, ǫ, σ ) := σ ǫ + σn / ( q +1) , if < ǫ ≤ , σ ǫ + (cid:18) σ q + 2 ǫ ǫ n (cid:19) / ( q +1) , if ǫ > . The proof of Theorem 3 is based on a ratio probability inequality and standard learning theoryargument [7, 24, 1, 6, 20], where results established in Theorems 1 and 2 play a crucial role.The error bound in Theorem 3 involves three components: the approximation error due to theimperfect choice of the hypothesis space H , the inherent bias caused by the integrated parameter σ, and the sample error. In practice, the hypothesis space could be chosen by structural riskminimization so that the approximation error decreases to a tolerably small level [9, 24]. The valueof σ affects both the inherent bias and the sample error. The best choice depends on the samplesize, the moment condition, and the capacity of the hypothesis space. To see this, consider a specialcase when f ⋆ ∈ H so that the approximation error k f H − f ⋆ k ,ρ disappears. With properly chosen σ values, we immediately obtain the following convergence rates. Corollary 4.
Under the assumptions of Theorem 3, let f ⋆ ∈ H and σ be chosen as σ = n Φ( ǫ,q ) with Φ( ǫ, q ) = ǫ )(1 + q ) , if < ǫ ≤ , ǫq (1 + ǫ ) + ǫ ( ǫ + 3) , if ǫ > . For any < δ < , with probability at least − δ , we have k f z ,σ − f ⋆ k ,ρ . log(2 /δ ) n − ǫ Φ( ǫ,q ) . According to Corollary 4, with properly chosen diverging σ values, we obtain exponential-typeconvergence rates for f z ,σ . In particular, if the noise variable is bounded or sub-Gaussian and hencethe moment condition in Assumption 1 holds for any ǫ > , one can select an arbitrarily large ǫ to obtain convergence rates of order arbitrarily close to O ( n − q ) . As a comparison, recall that forleast square estimators convergence rates of order O ( n − q/ ) can usually be established; see e.g.,[3, 20, 1] and references therein. Moreover, note that with weaker moment conditions, i.e., smaller ǫ values, one gets slower convergence rates for f z ,σ , indicating increased sacrifice for robustness. Thiscoincides with our intuitive understanding of robust regression estimators. On the other hand, if f ⋆ is smooth enough and one selects a smooth hypothesis space (such as the reproducing kernelHilbert spaces induced by radial basis kernels or neural networks with smooth activate functions),then q → and the difference between Huber regression and least square method could be minimal,indicating less sacrifice necessary for learning smooth functions. Finally, we stress that we obtainexponential convergence rates even for the case when < ǫ < where the distribution of theconditional variable Y | X does not possess finite variance and a least square based estimator cannoteven be defined. This further explains the robustness of Huber regression estimators.11 Proofs of Theorems
Let f : X → R with k f k ∞ ≤ M be a measurable function. For any σ > max { M, } , we denotethe two events I Y and II Y as follows I Y := n y : | y | ≥ σ o , and II Y := n y : | y | < σ o . Noticing that ˆ X ˆ Y [( y − f ( x )) − ( y − f ⋆ ( x )) ]d ρ ( y | x )d ρ X ( x ) = k f − f ⋆ k ,ρ , we have (cid:12)(cid:12)(cid:12) [ R σ ( f ) − R σ ( f ⋆ )] − k f − f ⋆ k ,ρ (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ˆ X ˆ Y [ ℓ σ ( y − f ( x )) − ℓ σ ( y − f ⋆ ( x ))] − [( y − f ( x )) − ( y − f ⋆ ( x )) ]d ρ ( y | x )d ρ X ( x ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) ˆ X ˆ I Y S II Y [ ℓ σ ( y − f ( x )) − ℓ σ ( y − f ⋆ ( x ))] d ρ ( y | x )d ρ X ( x ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) ˆ X ˆ I Y S II Y [( y − f ( x )) − ( y − f ⋆ ( x )) ]d ρ ( y | x )d ρ X ( x ) (cid:12)(cid:12)(cid:12) . For any ( x, y ) ∈ X × II Y , since σ > max { M, } , we see that | y − f ( x ) | ≤ | y | + k f k ∞ < σ, and | y − f ⋆ ( x ) | ≤ | y | + k f ⋆ k ∞ < σ. Consequently, for any ( x, y ) ∈ X × II Y , we have [ ℓ σ ( y − f ( x )) − ℓ σ ( y − f ⋆ ( x ))] − [( y − f ( x )) − ( y − f ⋆ ( x )) ] = 0 , and hence (cid:12)(cid:12)(cid:12) [ R σ ( f ) − R σ ( f ⋆ )] − k f − f ⋆ k ,ρ (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) ˆ X ˆ I Y [ ℓ σ ( y − f ( x )) − ℓ σ ( y − f ⋆ ( x ))] d ρ ( y | x )d ρ X ( x ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) ˆ X ˆ I Y [( y − f ( x )) − ( y − f ⋆ ( x )) ]d ρ ( y | x )d ρ X ( x ) (cid:12)(cid:12)(cid:12) . (6)12ecall that the Huber loss (2) is Lipschitz continuous with Lipschitz constant σ . The first term ofthe right-hand side of equation (6) can be upper bounded as follows (cid:12)(cid:12)(cid:12) ˆ X ˆ I Y [ ℓ σ ( y − f ( x )) − ℓ σ ( y − f ⋆ ( x ))] d ρ ( y | x )d ρ X ( x ) (cid:12)(cid:12)(cid:12) ≤ σ (cid:12)(cid:12)(cid:12) ˆ X ˆ I Y | f ( x ) − f ⋆ ( x ) | d ρ ( y | x )d ρ X ( x ) (cid:12)(cid:12)(cid:12) ≤ σ k f − f ⋆ k ∞ Pr(I Y ) . The quantity
Pr(I Y ) can be bounded by applying Markov’s inequality which yields Pr(I Y ) ≤ ǫ E (cid:0) | Y | ǫ (cid:1) σ ǫ . (7)Therefore, we have ˆ X ˆ I Y [ ℓ σ ( y − f ( x )) − ℓ σ ( y − f ⋆ ( x ))] d ρ ( y | x )d ρ X ( x ) ≤ ǫ k f − f ⋆ k ∞ E (cid:0) | Y | ǫ (cid:1) σ ǫ . (8)The second term in the right-hand side of equation (6) can be upper bounded as follows (cid:12)(cid:12)(cid:12) ˆ X ˆ I Y [( y − f ( x )) − ( y − f ⋆ ( x )) ]d ρ ( y | x )d ρ X ( x ) (cid:12)(cid:12)(cid:12) ≤ k f − f ⋆ k ∞ ˆ X ˆ I Y | y − f ( x ) − f ⋆ ( x ) | d ρ ( y | x )d ρ X ( x ) ≤ k f − f ⋆ k ∞ ˆ X ˆ I Y (2 | y | + k f ⋆ k ∞ + k f k ∞ )d ρ ( y | x )d ρ X ( x ) ≤ k f − f ⋆ k ∞ (cid:18) ˆ I Y | y | d ρ ( y ) + ( k f ⋆ k ∞ + k f k ∞ ) Pr(I Y ) (cid:19) . By applying Hölder inequality and recalling the estimate in (7), we have ˆ I Y | y | d ρ ( y ) ≤ (cid:0) Pr(I Y ) (cid:1) ǫ ǫ (cid:0) E ( | Y | ǫ ) (cid:1) ǫ ≤ ǫ E (cid:0) | Y | ǫ (cid:1) σ ǫ . As a result, we conclude that (cid:12)(cid:12)(cid:12) ˆ X ˆ I Y [( y − f ( x )) − ( y − f ⋆ ( x )) ]d ρ ( y | x )d ρ X ( x ) (cid:12)(cid:12)(cid:12) ≤ ǫ k f − f ⋆ k ∞ E (cid:0) | Y | ǫ (cid:1) σ ǫ + 2 ǫ ( k f k ∞ + k f ⋆ k ∞ ) E (cid:0) | Y | ǫ (cid:1) σ ǫ . (9)From (8) and (9), (cid:12)(cid:12) [ R σ ( f ) − R σ ( f ⋆ )] − k f − f ⋆ k ,ρ (cid:12)(cid:12) can be upper bounded by (2 ǫ + 2 ǫ ) k f − f ⋆ k ∞ E (cid:0) | Y | ǫ (cid:1) σ ǫ + 2 ǫ ( k f k ∞ + k f ⋆ k ∞ ) E (cid:0) | Y | ǫ (cid:1) σ ǫ . Therefore, the desired estimate (5) holds with c ǫ = 2 ǫ ( M + 1) E (cid:0) | Y | ǫ (cid:1) . This completes theproof of Theorem 1. 13 .2 Proof of Theorem 2 Let f : X → R with k f k ∞ ≤ M be a measurable function. For any σ > max { M, } , we againconsider the following two events I Y := n y : | y | ≥ σ o , and II Y := n y : | y | < σ o . Based on the above notation, it is obvious to see the following decomposition E ξ = ˆ X ×Y ( ℓ σ ( y − f ( x )) − ℓ σ ( y − f ⋆ ( x ))) d ρ ( x, y )= ˆ X × I Y ( ℓ σ ( y − f ( x )) − ℓ σ ( y − f ⋆ ( x ))) d ρ ( x, y )+ ˆ X × II Y ( ℓ σ ( y − f ( x )) − ℓ σ ( y − f ⋆ ( x ))) d ρ ( x, y ):= Q + Q . The first term Q can be easily bounded by applying the Lipschitz continuity property of theHuber loss (2) and Markov’s inequality: Q ≤ σ ˆ X × I Y (cid:0) f ( x ) − f ⋆ ( x )) d ρ ( x, y ) ≤ M σ Pr(I Y ) ≤ M E | Y | ǫ σ − ǫ . To bound the second term Q , noticing that for any ( x, y ) ∈ X × II Y , since σ > max { M, } , wehave | y − f ( x ) | ≤ | y | + k f k ∞ < σ, and | y − f ⋆ ( x ) | ≤ | y | + k f ⋆ k ∞ < σ. By the definition of the Huber loss ℓ σ , for any ( x, y ) ∈ X × II Y , we have ℓ σ ( y − f ( x )) − ℓ σ ( y − f ⋆ ( x )) = ( y − f ( x )) − ( y − f ⋆ ( x )) . Therefore, Q = ˆ X × II Y (cid:0) ( y − f ( x )) − ( y − f ⋆ ( x )) (cid:1) d ρ ( x, y )= ˆ X ˆ II Y ( f ( x ) − f ⋆ ( x )) (2 y − f ( x ) − f ⋆ ( x )) d ρ ( y | x )d ρ X ( x ) . If ǫ > , applying Hölder’s inequality, we obtain Q ≤ ˆ X ˆ Y ( f ( x ) − f ⋆ ( x )) (2 y − f ( x ) − f ⋆ ( x )) d ρ ( y | x )d ρ X ( x ) ≤ k f − f ⋆ k ǫ ∞ E (cid:16) | f ( x ) − f ⋆ ( x ) | ǫ − ǫ (2 | y | + 2 M ) (cid:17) ≤ M + 1) ( E | Y | ǫ + M + 1) k f − f ⋆ k ǫ − ǫ ,ρ . < ǫ ≤ , we have the following estimate Q ≤ M ˆ X ˆ II Y ( | y | ǫ | y | − ǫ + M )d ρ ( y | x )d ρ X ( x ) ≤ M ( E | Y | ǫ + M ǫ ) σ − ǫ . Combining the above estimates for Q and Q , we come to the conclusion that E ξ ≤ c k f − f ⋆ k ǫ − ǫ +1 ,ρ + c σ − ǫ , with c = 64( M + 1) ( E | Y | ǫ + M + 1) and c = 48 M ( E | Y | ǫ + M ǫ ) + 16 M E | Y | ǫ . Thiscompletes the proof of Theorem 2. We first prove a ratio inequality in Subsection 6.3.1 which plays an important role in the proofof Theorem 3. The detailed proof will then be given in Subsection 6.3.2. To proceed, for anymeasurable function f : X → R , we denote R σ z ( f ) = 1 n n X i =1 ℓ σ ( y i − f ( x i )) , and recall the notation f H ,σ = arg min f ∈H R σ ( f ) . Let σ > max { M, } . Under Assumptions 1 and 2, for any γ > c ǫ σ ǫ , we have Pr ( sup f ∈H (cid:12)(cid:12) [ R σ ( f ) − R σ ( f ⋆ )] − [ R σ z ( f ) − R σ z ( f ⋆ )] (cid:12)(cid:12)p R σ ( f ) − R σ ( f ⋆ ) + 2 γ > √ γ ) ≤ N (cid:16) H , γ σ (cid:17) e − Θ( n,γ,σ ) , where Θ( n, γ, σ ) = − nγc ′ σ , if < ǫ ≤ , − nγc ′ σ ǫǫ +1 , if ǫ > , with c ′ and c ′ being two positive constants independent of n , γ , or σ that will be explicitly specifiedin the proof. To prove Proposition 5, we need the following Bernstein concentration inequality that is fre-quently employed in the literature of learning theory.
Lemma 6.
Let ξ be a random variable on a probability space Z with variance σ ⋆ satisfying | ξ − E ξ | ≤ M ξ almost surely for some constant M ξ and for all z ∈ Z . Then for all λ > , Pr ( n n X i =1 ξ ( z i ) − E ξ ≥ λ ) ≤ exp ( − nλ σ ⋆ + M ξ λ ) ) . roof of Proposition 5. Recall that F H denotes the following set of random variables F H = n ξ | ξ ( x, y ) = ℓ σ ( y − f ( x )) − ℓ σ ( y − f ⋆ ( x )) , f ∈ H , ( x, y ) ∈ X × Y o . For each ξ ∈ F H , by the fact that the Huber loss (2) is Lipschitz continuous with Lipschitz constant σ , we have k ξ k ∞ ≤ σ k f − f ⋆ k ∞ ≤ M σ and k ξ − E ξ k ∞ ≤ σ k f − f ⋆ k ∞ ≤ M σ.
According to Theorem 2, we know that E ξ ≤ c k f − f ⋆ k ǫ − ǫ +1 ,ρ + c σ − ǫ . By Assumption 2, we know that there exist a finite positive integer J = N ( H , γ σ ) and { f j } Jj =1 ⊂H such that B ( f j , η ) , j = 1 , . . . , J form a γ σ -cover of H . We next show that for each j = 1 , · · · , J ,it holds that Pr ( (cid:12)(cid:12) [ R σ ( f j ) − R σ ( f ⋆ )] − [ R σ z ( f j ) − R σ z ( f ⋆ )] (cid:12)(cid:12)p R σ ( f j ) − R σ ( f ⋆ ) + 2 γ > √ γ ) ≤ e − Θ( n,γ,σ ) (10)for γ > c ǫ σ ǫ . To see this, we apply the Bernstein inequality in Lemma 6 to the following randomvariables ξ j ( x, y ) = ℓ σ ( y − f j ( x )) − ℓ σ ( y − f ⋆ ( x )) , ( x, y ) ∈ X × Y , j = 1 , . . . , J, and obtain Pr (cid:26)(cid:12)(cid:12) [ R σ ( f j ) − R σ ( f ⋆ )] − [ R σ z ( f j ) − R σ z ( f ⋆ )] (cid:12)(cid:12) > √ γ q R σ ( f j ) − R σ ( f ⋆ ) + 2 γ (cid:27) ≤ Pr (cid:8)(cid:12)(cid:12) [ R σ ( f j ) − R σ ( f ⋆ )] − [ R σ z ( f j ) − R σ z ( f ⋆ )] (cid:12)(cid:12) > µ j √ γ (cid:9) ≤ exp − nγµ j (8 M/ c + c ) √ γµ j σ + σ − ǫ + k f j − f ⋆ k ǫ − ǫ +1 ,ρ ! , (11)where µ j := R σ ( f j ) − R σ ( f ⋆ ) + 2 γ . Since γ > c ǫ σ ǫ , by Theorem 1, we have for j = 1 , . . . , J , µ j = R σ ( f j ) − R σ ( f ⋆ ) + 2 γ> R σ ( f j ) − R σ ( f ⋆ ) + c ǫ σ − ǫ + γ ≥ k f j − f ⋆ k ,ρ + γ ≥ γ. (12)We proceed with the proof by considering the two cases when < ǫ ≤ and when ǫ > . If < ǫ ≤ , by the assumption σ > and (12), we have nγµ j (8 M/ c + c ) √ γµ j σ + σ − ǫ + k f j − f ⋆ k ǫ − ǫ +1 ,ρ ! > nγµ j M/ c + c ) (cid:0) √ γµ j σ + σ − ǫ (cid:1) > nγc ′ σ , c ′ = 2(1 + c − ǫ )(8 M/ c + c ) . If ǫ > , note that (12) implies µ j > k f j − f ⋆ k ,ρ and µ j > γ > c ǫ σ − ǫ . We have nγµ j (8 M/ c + c ) √ γµ j σ + σ − ǫ + k f j − f ⋆ k ǫ − ǫ +1 ,ρ ! = nγµ j (8 M/ c + c ) (cid:18) √ γµ j σ + σ − ǫ + k f j − f ⋆ k ǫ − ǫ +1 ,ρ (cid:19) > nγµ j (8 M/ c + c ) (cid:18) √ γµ j σ + σ − ǫ + µ ǫ − ǫ +1 j (cid:19) > nγc ′ σ ǫǫ +1 , where c ′ = (8 M/ c + c ) (cid:18) c − ǫ + c − ǫ +1 ǫ (cid:19) . Combining the above estimates for two cases andrecalling (11), we thus have proved the result in (10).Denote the two events A and, respectively, as A = ( sup f ∈H (cid:12)(cid:12) [ R σ ( f ) − R σ ( f ⋆ )] − [ R σ z ( f ) − R σ z ( f ⋆ )] (cid:12)(cid:12)p R σ ( f ) − R σ ( f ⋆ ) + 2 γ ≤ √ γ ) and B = J \ j =1 ( (cid:12)(cid:12) [ R σ ( f j ) − R σ ( f ⋆ )] − [ R σ z ( f j ) − R σ z ( f ⋆ )] (cid:12)(cid:12)p R σ ( f j ) − R σ ( f ⋆ ) + 2 γ ≤ √ γ ) . We next prove B ⊂ A. To this end, we assume that the event B occurs, that is, for all j = 1 , . . . , J , (cid:12)(cid:12) [ R σ ( f j ) − R σ ( f ⋆ )] − [ R σ z ( f j ) − R σ z ( f ⋆ )] (cid:12)(cid:12) ≤ q γ ( R σ ( f j ) − R σ ( f ⋆ ) + 2 γ ) . (13)Recall B ( f j , η ) , j = 1 , . . . , J, is a γ σ -cover of H . For every f ∈ H , there exists f j ∈ H such that k f − f j k ∞ ≤ γ σ . Since γ > c ǫ σ − ǫ , we have R σ ( f ) − R σ ( f ⋆ ) + 2 γ ≥ k f − f ⋆ k ,ρ − c ǫ σ − ǫ + 2 γ ≥ γ. (14)Therefore, by the Lipschitz continuity of the Huber loss, for each j = 1 , . . . , J , we have |R σ ( f ) − R σ ( f j ) | ≤ σ k f − f j k ∞ ≤ γ ≤ p γ ( R σ ( f ) − R σ ( f ⋆ ) + 2 γ ) , (15)and |R σ z ( f ) − R σ z ( f j ) | ≤ σ k f − f j k ∞ ≤ γ p γ ( R σ ( f ) − R σ ( f ⋆ ) + 2 γ ) . (16)By (15) and (14), we also have R σ ( f j ) − R σ ( f ⋆ ) + 2 γ = (cid:16) R σ ( f j ) − R σ ( f ) (cid:17) + R σ ( f ) − R σ ( f ⋆ ) + 2 γ ≤ p γ ( R σ ( f ) − R σ ( f ⋆ ) + 2 γ ) + R σ ( f ) − R σ ( f ⋆ ) + 2 γ ≤ R σ ( f ) − R σ ( f ⋆ ) + 2 γ ) . (17)17ombining the estimates (15), (16), (17) with the assumption (13), we obtain (cid:12)(cid:12) [ R σ ( f ) − R σ ( f ⋆ )] − [ R σ z ( f ) − R σ z ( f ⋆ )] (cid:12)(cid:12) ≤ |R σ ( f ) − R σ ( f j ) | + (cid:12)(cid:12) R σ ( f j ) − R σ ( f ⋆ )] − [ R σ z ( f j ) − R σ z ( f ⋆ )] (cid:12)(cid:12) + |R σ z ( f j ) − R σ z ( f ) |≤ p γ ( R σ ( f ) − R σ ( f ⋆ ) + 2 γ ) + q γ ( R σ ( f j ) − R σ ( f ⋆ ) + 2 γ ) ≤ p γ ( R σ ( f ) − R σ ( f ⋆ ) + 2 γ ) . (18)Since (18) holds for every f ∈ H , we have proved B ⊂ A or equivalently A c ⊂ B c . This togetherwith (10) leads to Pr ( sup f ∈H (cid:12)(cid:12) [ R σ ( f ) − R σ ( f ⋆ )] − [ R σ z ( f ) − R σ z ( f ⋆ )] (cid:12)(cid:12)p R σ ( f ) − R σ ( f ⋆ ) + 2 γ > √ γ ) = Pr( A c ) ≤ Pr( B c ) ≤ J X j =1 Pr ( (cid:12)(cid:12) [ R σ ( f j ) − R σ ( f ⋆ )] − [ R σ z ( f j ) − R σ z ( f ⋆ )] (cid:12)(cid:12)p R σ ( f j ) − R σ ( f ⋆ ) + 2 γ > √ γ ) ≤ N (cid:0) H , γσ − / (cid:1) e − Θ( n,γ,σ ) . This completes the proof of Proposition 5.
We first prove that for any < δ < , with probability at least − δ/ , we have [ R σ ( f z ,σ ) − R σ ( f ⋆ )] − [ R σ z ( f z ,σ ) − R σ z ( f ⋆ )] −
12 [ R σ ( f z ,σ ) − R σ ( f ⋆ )] ≤ γ , where γ is given by γ := O (cid:16) σ ǫ + log (cid:0) δ (cid:1) σn / ( q +1) (cid:17) , if < ǫ ≤ , O σ ǫ + log (cid:0) δ (cid:1) (cid:18) σ q + 2 ǫ ǫ n (cid:19) / ( q +1) ! , if ǫ > . (19)Note that Proposition 5 implies that, for any γ > c ǫ σ − ǫ , sup f ∈H (cid:12)(cid:12) [ R σ ( f ) − R σ ( f ⋆ )] − [ R σ z ( f ) − R σ z ( f ⋆ )] (cid:12)(cid:12)p R σ ( f ) − R σ ( f ⋆ ) + 2 γ < √ γ (20)holds with probability at least − N (cid:16) H , γ σ (cid:17) e − Θ( n,γ,σ ) . We know from Assumption 2 that N (cid:16) H , γ σ (cid:17) . exp { q cσ q /γ q } . − exp { q cσ q γ − q − Θ( n, γ, σ ) } . Forany < δ < , let exp { q cσ q γ − q − Θ( n, γ, σ ) } = δ/ , or equivalently q cσ q γ − q − Θ( n, γ, σ ) = log( δ/ . The equation has a unique positive solution γ ∗ satisfying γ ⋆ . log (cid:0) δ (cid:1) (cid:16) σn / ( q +1) (cid:17) , if ǫ ≤ , log (cid:0) δ (cid:1) (cid:18) σ q + 2 ǫ ǫ n (cid:19) / ( q +1) , if ǫ > . Choose γ = c ǫ σ − ǫ + γ ⋆ . Then γ satisfies the condition (19) and for any < δ < , with probabilityat least − δ/ , it holds that sup f ∈H (cid:12)(cid:12) [ R σ ( f ) − R σ ( f ⋆ )] − [ R σ z ( f ) − R σ z ( f ⋆ )] (cid:12)(cid:12)p R σ ( f ) − R σ ( f ⋆ ) + 2 γ ≤ √ γ , which immediately yields [ R σ ( f z ,σ ) − R σ ( f ⋆ )] − [ R σ z ( f z ,σ ) − R σ z ( f ⋆ )] ≤ √ γ q R σ ( f z ,σ ) − R σ ( f ⋆ ) + 2 γ ≤ (cid:16) R σ ( f z ,σ ) − R σ ( f ⋆ ) (cid:17) + 9 γ . (21)By a similar procedure we can prove that for any < δ < , with probability at least − δ/ , itholds that [ R σ z ( f H ,σ ) − R σ z ( f ⋆ )] − [ R σ ( f H ,σ ) − R σ ( f ⋆ )] ≤ (cid:16) R σ ( f H ,σ ) − R σ ( f ⋆ ) (cid:17) + 9 γ . This in connection with the fact that R σ ( f H ,σ ) ≤ R σ ( f H ) and Theorem 1 implies that for any < δ < , with probability at least − δ/ , we have [ R σ z ( f H ,σ ) − R σ z ( f ⋆ )] − [ R σ ( f H ,σ ) − R σ ( f ⋆ )] ≤ k f H − f ⋆ k ,ρ + 10 γ . (22)Combining the two estimates in (21) and (22), we come to the conclusion that for any < δ < ,with probability at least − δ , it holds that [ R σ ( f z ,σ ) − R σ ( f H ,σ )] − [ R σ z ( f z ,σ ) − R σ z ( f H ,σ )] ≤ (cid:2) R σ ( f z ,σ ) − R σ ( f ⋆ ) (cid:3) + 12 k f H − f ⋆ k ,ρ + 19 γ . (23)On the other hand, from the definitions of f H ,σ , f H , and f z ,σ , we have R σ ( f z ,σ ) − R σ ( f ⋆ )=[ R σ ( f z ,σ ) − R σ ( f H ,σ )] + [ R σ ( f H ,σ ) − R σ ( f ⋆ )] ≤ [ R σ ( f z ,σ ) − R σ ( f H ,σ )] − [ R σ z ( f z ,σ ) − R σ z ( f H ,σ )] + R σ ( f H ) − R σ ( f ⋆ ) ≤ [ R σ ( f z ,σ ) − R σ ( f H ,σ )] − [ R σ z ( f z ,σ ) − R σ z ( f H ,σ )] + k f H − f ⋆ k ,ρ + c ǫ σ − ǫ , R σ z ( f z ,σ ) ≤ R σ z ( f H ,σ ) , and R σ ( f H ,σ ) ≤ R σ ( f H ) . By (23), we know that for any < δ < , with probability at least − δ , it holds that R σ ( f z ,σ ) − R σ ( f ⋆ ) ≤ (cid:16) R σ ( f z ,σ ) − R σ ( f ⋆ ) (cid:17) + 32 k f H − f ∗ k ,ρ + 20 γ , which implies that R σ ( f z ,σ ) − R σ ( f ⋆ ) ≤ k f H − f ∗ k ,ρ + 40 γ holds with probability at least − δ . By Theorem 1 again, we conclude that for any < δ < ,with probability at least − δ , it holds that k f z ,σ − f ⋆ k ,ρ . k f H − f ⋆ k ,ρ + γ . Recalling the definition of Ψ and noticing that γ . log( δ/ , we complete the proof of Theorem3. In this paper, we studied the Huber regression problem by investigating the empirical risk mini-mization scheme induced by the Huber loss. In a statistical learning setup, our study answeredthe four fundamental questions raised in the introduction: the R σ -risk consistency is insufficient inensuring their convergence to the mean regression function; the scale parameter σ plays a trade-off role in bias and learnability; fast exponential-type convergence rates can be established under (1 + ǫ ) -moment conditions ( ǫ > ) by relaxing the standard Bernstein condition and allowing someadditional small bias term; the merit of Huber regression in terms of the robustness can be reflectedby its learnability under the (1 + ǫ ) -moment conditions which are considered to be weak conditionsin that heavy-tailed noise can be accommodated in regression problems. Moreover, it was shownthat with higher moment conditions being imposed, one can obtain faster convergence rates. Inthe above senses, we conducted a complete and systematic statistical learning assessment of Huberregression estimators.We remark that in the present study a general hypothesis space H is considered. In practice,the implementation of learning with Huber regression requires to specify a particular hypothesisspace. It can be a reproducing kernel Hilbert space, a neural network, or other families of functions.Functions in such a hypothesis space are generally not uniformly bounded. Regularization couldbe used to restrict the searching region of the Huber regression scheme and consequently controlsthe capacity of the working hypothesis space. The techniques developed in this study may still beapplicable to assessing the regularized Huber regression schemes. Additionally, the development ofthese techniques for assessing Huber regression estimators may also shed light on the analysis ofother robust regression schemes. Acknowledgement
This work was partially supported by the Simons Foundation Collaboration Grant eferences [1] M. Anthony and P. L. Bartlett.
Neural Network Learning: Theoretical Foundations . CambridgeUniversity Press, 2009.[2] P. L. Bartlett and S. Mendelson. Empirical minimization.
Probability Theory and RelatedFields , 135(3):311–334, 2006.[3] F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory.
Journal of Complexity , 23(1):52–72, 2007.[4] G. Chinot, G. Lecué, and M. Lerasle. Robust statistical learning with Lipschitz and convexloss functions.
Probability Theory and Related Fields , pages 1–44, 2019.[5] A. Christmann and I. Steinwart. Consistency and robustness of kernel-based regression inconvex risk minimization.
Bernoulli , 13(3):799–819, 2007.[6] F. Cucker and D. X. Zhou.
Learning Theory: an Approximation Theory Viewpoint , volume 24.Cambridge University Press, 2007.[7] L. Devroye, L. Györfi, and G. Lugosi.
A Probabilistic Theory of Pattern Recognition , volume 31.Springer Science & Business Media, 2013.[8] J. Fan, Y. Guo, and B. Jiang. Adaptive huber regression on markov-dependent data.
StochasticProcesses and their Applications , 2019.[9] J. Friedman, T. Hastie, and R. Tibshirani.
The Elements of Statistical Learning . SpringerSeries in Statistics New York, 2001.[10] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel.
Robust Statistics: theApproach Based on Influence Functions , volume 196. John Wiley & Sons, 2011.[11] X. He and Q.-M. Shao. A general Bahadur representation of M-estimators and its application tolinear regression with nonstochastic designs.
The Annals of Statistics , 24(6):2608–2630, 1996.[12] P. J. Huber. Robust estimation of a location parameter.
The Annals of Mathematical Statistics ,35(1):73–101, 1964.[13] P. J. Huber. Robust regression: asymptotics, conjectures and Monte Carlo.
The Annals ofStatistics , 1(5):799–821, 1973.[14] P. J. Huber and E. Ronchetti.
Robust Statistics . Wiley, 2009.[15] P.-L. Loh. Statistical consistency and asymptotic normality for high-dimensional robust M -estimators. The Annals of Statistics , 45(2):866–896, 2017.[16] R. Maronna, D. Martin, and V. Yohai.
Robust Statistics: Theory and Methods . John Wiley &Sons, Chichester, 2006.[17] S. Portnoy. Asymptotic behavior of M -estimators of p regression parameters when p /n islarge. I. Consistency. The Annals of Statistics , 12(4):1298–1309, 1984.[18] L. Rosasco, E. D. Vito, A. Caponnetto, M. Piana, and A. Verri. Are loss functions all thesame?
Neural Computation , 16(5):1063–1076, 2004.2119] P. J. Rousseeuw and A. M. Leroy.
Robust Regression and Outlier Detection , volume 589. JohnWiley & Sons, 2005.[20] I. Steinwart and A. Christmann.
Support Vector Machines . Springer, New York, 2008.[21] I. Steinwart and C. Scovel. Fast rates for support vector machines using gaussian kernels.
TheAnnals of Statistics , 35(2):575–607, 2007.[22] Q. Sun, W.-X. Zhou, and J. Fan. Adaptive Huber regression.
Journal of the American StatisticalAssociation , 115(529):254–265, 2019.[23] T. Van Erven, P. D. Grünwald, N. A. Mehta, M. D. Reid, and R. C. Williamson. Fast rates instatistical and online learning.
The Journal of Machine Learning Research , 16(1):1793–1861,2015.[24] V. Vapnik.