[PDF] A Framework of Learning Through Empirical Gain Maximization

Abstract

We develop in this paper a framework of empirical gain maximization (EGM) to address the robust regression problem where heavy-tailed noise or outliers may present in the response variable. The idea of EGM is to approximate the density function of the noise distribution instead of approximating the truth function directly as usual. Unlike the classical maximum likelihood estimation that encourages equal importance of all observations and could be problematic in the presence of abnormal observations, EGM schemes can be interpreted from a minimum distance estimation viewpoint and allow the ignorance of those observations. Furthermore, it is shown that several well-known robust nonconvex regression paradigms, such as Tukey regression and truncated least square regression, can be reformulated into this new framework. We then develop a learning theory for EGM, by means of which a unified analysis can be conducted for these well-established but not fully-understood regression approaches. Resulting from the new framework, a novel interpretation of existing bounded nonconvex loss functions can be concluded. Within this new framework, the two seemingly irrelevant terminologies, the well-known Tukey's biweight loss for robust regression and the triweight kernel for nonparametric smoothing, are closely related. More precisely, it is shown that the Tukey's biweight loss can be derived from the triweight kernel. Similarly, other frequently employed bounded nonconvex loss functions in machine learning such as the truncated square loss, the Geman-McClure loss, and the exponential squared loss can also be reformulated from certain smoothing kernels in statistics. In addition, the new framework enables us to devise new bounded nonconvex loss functions for robust learning.

Full PDF

aa r X i v : . [ c s . L G ] S e p A Framework of Learning through Empirical Gain Maximization

Yunlong Feng and Qiang Wu Department of Mathematics and Statistics, University at Albany Department of Mathematical Sciences, Middle Tennessee State University

Abstract

We develop in this paper a framework of empirical gain maximization (EGM) to address therobust regression problem where heavy-tailed noise or outliers may present in the response vari-able. The idea of EGM is to approximate the density function of the noise distribution insteadof approximating the truth function directly as usual. Unlike the classical maximum likelihoodestimation that encourages equal importance of all observations and could be problematic inthe presence of abnormal observations, EGM schemes can be interpreted from a minimum dis-tance estimation viewpoint and allow the ignorance of those observations. Furthermore, it isshown that several well-known robust nonconvex regression paradigms, such as Tukey regressionand truncated least square regression, can be reformulated into this new framework. We thendevelop a learning theory for EGM, by means of which a uniﬁed analysis can be conductedfor these well-established but not fully-understood regression approaches. Resulting from thenew framework, a novel interpretation of existing bounded nonconvex loss functions can beconcluded. Within this new framework, the two seemingly irrelevant terminologies, the well-known Tukey’s biweight loss for robust regression and the triweight kernel for nonparametricsmoothing, are closely related. More precisely, it is shown that the Tukey’s biweight loss canbe derived from the triweight kernel. Similarly, other frequently employed bounded nonconvexloss functions in machine learning such as the truncated square loss, the Geman-McClure loss,and the exponential squared loss can also be reformulated from certain smoothing kernels instatistics. In addition, the new framework enables us to devise new bounded nonconvex lossfunctions for robust learning.

In this paper, we are concerned with robust regression problems where conditional distributions mayhave heavier-than-Gaussian tails or be contaminated by outliers. In machine learning and statistics,regression procedures are typically carried out through Empirical Risk Minimization (ERM) or itsvariants. Denote X as the input variable taking values in X ⊂ R d and Y the output variable takingvalues in R . Given n i.i.d observations z = { ( x i , y i ) } ni =1 and choosing a hypothesis space H , ERMcan be formulated as f z = arg min f ∈H n n X i =1 ℓ ( y i − f ( x i )) , (1)1here ℓ : R → R + is a loss function that measures the point-wise goodness-of-ﬁt when using f ( x ) topredict y . Several frequently employed loss functions for regression with continuous output includethe square loss, the absolute deviation loss, as well as the check loss. The resulting ERM schemeslearn, respectively, the conditional mean, the conditional median, and the conditional quantilefunction. Under the additive noise model Y = f ⋆ ( X ) + ε, (2)where ε denotes the noise variable, these ERM schemes can also be deduced from maximum likeli-hood estimation (MLE) by assuming a Gaussian, Laplace, or asymmetry Laplace prior distributionof the noise variable, respectively. In this sense, these ERM schemes are essentially MLE-based,though in practice the prior distributional assumptions may not be imposed explicitly or may evenbe abandoned. To deal with robust regression problems, various M-estimators such as the Tukeyregression estimator and the truncated least square estimator, which can be viewed as generalizedmaximum likelihood estimators, are proposed in robust statistics. Carrying over from robust esti-mation of location parameters in parametric statistics, the idea of M-estimation is to implementMLE based on heavy-tailed noise distributions. In the machine learning context, the nonparametriccounterparts of these M-estimators are also frequently used for robust prediction, especially in thearea of computer vision.While maximum likelihood estimates are eﬃcient and often lead to eﬀective regression estimators,their disadvantages are also obvious. To explain, let us consider again the n -size sample z generatedby the regression model (2). Denoting p ε | X as the density of the noise variable ε conditioned on X , the likelihood of drawing the sample z given the parameter “ f ” from the “parameter space” H is L ( z ; f ) = Q ni =1 p ε | xi ( y i − f ( x i )) . In MLE, one aims at seeking the optimal “parameter” f z in H such that this likelihood is maximized, i.e., f z := arg max f ∈H n Y i =1 p ε | xi ( y i − f ( x i )) . (3)In other words, the purpose of searching for the optimal “parameter” f z is achieved by maximizingthe product of the likelihood functions of all the observations. Taking a logarithm operation tothe product leads to the log-likelihood log L ( z ; f ) and further brings us ERM schemes induced bythe square loss and the least absolute deviation loss when the noise is assumed to be Gaussian andLaplace, respectively. MLE is known to be (asymptotically) more eﬃcient than other estimatorswhen the noise distribution is correctly speciﬁed. However, in the machine learning context, thepremise that the noise distribution is known a priori is seldom the case and more often than not,is not realistic. On the other hand, one of the shortcomings of MLE lies in that it is sensitive toabnormal observations. This could be understood intuitively from the formula (3), where the maxi-mization of the product obviously encourages equal importance of all the observations. Speciﬁcally,the maximization does not allow the zeroness of the likelihood of any observation no matter whetherthe observation is abnormal or not. This could be problematic in practice where the acquired dataare frequently contaminated. To address the non-robustness problem of MLE based methodologies,tremendous eﬀorts have been made in the literature of statistics, machine learning, as well as otherdata science areas. 2 .1 Problem Formulation Our study in this paper starts from a family of robust learning approaches of the form f z = arg max f ∈H n n X i =1 p ε | xi ( y i − f ( x i )) , (4)where H is the hypothesis space and p ε | X the density of the noise ε | X . Since p ε | X is unknownin practice, we consider its surrogate p σ by ignoring the dependence of ε on X and the normingconstant. Then, the scheme can be generalized to the form f z ,σ = arg max f ∈H n n X i =1 p σ ( y i − f ( x i )) , (5)where we call p σ : R → R + a gain function if it is unimodal, attains its peak value at , and isintegrable. Correspondingly, the scheme (5) is termed as Empirical Gain Maximization (EGM) , asystematic investigation of which will be the main focus of this paper.

Our development of the EGM framework and the introduction of gain function are inspired by thefollowing motivating scenarios that ﬁnd tremendous real-world applications in robust estimationacross numerous data science ﬁelds.

Motivating Scenario I: Tukey Regression.

The well-known Tukey regression based on Tukey’sbiweight loss, proposed in [3], can be equivalently formulated as an EGM scheme (5) associatedwith the gain function p σ ( t ) = (cid:18) − t σ (cid:19) I {| t |≤ σ } , which we call triweight gain function. Tukey regression has found numerous applications in agreat variety of data science areas where robustness is a concern; see e.g., [26, 42, 7, 52, 6, 4,8, 14]. Motivating Scenario II: Truncated Least Square.

The truncated least square regression canbe equivalently formulated as an EGM scheme associated with the

Epanechnikov gain function p σ ( t ) = (cid:18) − t σ (cid:19) I {| t |≤ σ } . The related studies and applications of truncated least squares regression, to name a few, canbe found in [29, 59, 33, 36, 39]. 3 otivating Scenario III: Geman-McClure Regression.

Introduced in [25] for image analysis,the Geman-McClure regression can be formulated as an EGM scheme with the

Cauchy gain function p σ ( t ) = σ σ + t . It has been applied extensively especially to the area of robust computer vision, see e.g.,[5, 16, 43, 57, 58, 2, 47, 9, 34].

Motivating Scenario IV: Maximum Correntropy.

The maximum correntropy criterion [40,44], motivated by maximizing the information gain measured by correntropy between theinput variable X and output variable Y , is equivalent to an EGM scheme with the Gaussiangain function p σ ( t ) = exp (cid:18) − t σ (cid:19) . This gain function was also proposed as a goodness-of-ﬁt measurement in various contextsof robust estimation under diﬀerent terminologies such as the Welsch’s loss [17], the invertedGaussian loss [35], the exponential squared loss [54], and the reﬂected normal loss [49]. Itstheoretical properties were recently investigated in [24, 23, 21, 20] from a statistical learningviewpoint.Further details of the above gain functions and their correspondences to bounded nonconvexloss functions will be discussed in Section 3 below. What are common behind the above four robustlearning schemes are that (1) all of them can be naturally formulated into the EGM framework (5)with the associated gain function either the kernel of a common probability distribution or a commonsmoothing kernel; (2) they are all introduced in order to pursue robustness in estimation procedureswith a scale parameter σ controlling the robustness; (3) they all have extremely wide applicationsin robust estimation problems from science to technology. However, in a distribution-free setup,their learning performance, especially the relationship between their learnability, robustness, andthe scale parameter, has not been well assessed. In this sense, they are well-established learningschemes but not fully-understood ones; and (4) they were usually treated as M-estimators andinterpreted from an ERM viewpoint or further traced to MLE. However, such interpretations, onone hand, cannot reveal the working mechanism of the resulting estimators and so cannot fullyexplain their robustness, on the other hand, cannot help discover new learning schemes of the samekind. By introducing gain functions and developing an EGM framework, the objectives of this study arenot solely to develop new robust learning schemes but rather to (1) present a uniﬁed analysis ofthe several well-known robust regression estimators and study their performance from a learningtheory viewpoint; (2) pursue novel insights and interpretations of these estimators, which could helpexplain their robustness merits; (3) develop a learning theory framework as well as novel machineries4or analyzing robust regression schemes that fall into the same vein; and (4) devise and explore morenew robust regression estimators that are of the same kind. It turns out that our newly developedframework can fulﬁll these objectives eﬀectively. Our main contributions are summarized as follows: • Motivated by several widely used robust regression schemes, we introduce gain function as analternate measurement of goodness-of-ﬁt in regression and develop an EGM framework. It isshown that maximizing empirical gain in regression can be viewed as maximizing the summa-tion of the likelihood functions. Unlike the product of likelihood functions in MLE, summationdoes not encourage the equal importance of all observations but allows even the zeroness ofthe likelihoods of abnormal ones. This observation may better explain the robustness of theseregression schemes. • Interestingly, the newly developed framework subsumes a bunch of well-known robust regres-sion schemes, especially those listed in the motivating scenarios, such as the celebrated Tukeyregression, the classic truncated least square regression, and the widely employed Geman-McClure regression in computer vision. We stress here that under the new framework, Tukeyregression estimators can be obtained via EGM induced by triweight gain function, whichcomes from the triweight kernel; truncated least square estimators can be deduced from EGMassociated with Epanechnikov gain function, which comes from the Epanechnikov kernel; andGeman-McClure regression estimators can be derived from EGM with Cauchy gain function,which results from the Cauchy kernel. Such ﬁndings provide us an alternative understandingof these robust regression schemes. That is, these M-estimators may be interpreted morenaturally as minimum distance estimators. • We conduct a uniﬁed learning theory analysis of EGM schemes. In the learning theory lit-erature, convex regression schemes have been extensively studied. However, studies and as-sessments of nonconvex ones, such as Tukey regression, truncated least square regression, andGeman-McClure regression, are still sparse though they have been extensively applied. Ourstudy provides a uniﬁed learning theory assessment of these robust regression schemes. Morespeciﬁcally, we consider two diﬀerent setups, namely, learning without noise misspeciﬁcationand distribution-free learning. We show that when learning without noise misspeciﬁcation,EGM estimators are regression calibrated while in a distribution-free setup, under certainconditions, EGM estimators can learn the underlying truth function under weak momentconditions. • The present study brings us novel insights into existing bounded nonconvex loss functions.Furthermore, the correspondence between bounded nonconvex losses and gain functions allowus to introduce more new bounded nonconvex losses that enjoy similar properties. For instance,a generalized Tukey’s loss may be formulated to cater to further needs in robust regression aswell as classiﬁcation problems.

The roadmap of this paper is as follows. In Section 2, we provide some ﬁrst look at EGM by com-paring it with ERM and interpreting it from a minimum distance estimation viewpoint. Section 35xposes the way of deﬁning gain functions and formulating EGM. Gain functions are exampledand categorized in this section. In particular, the correspondence between some representativebounded nonconvex losses and gain functions is also illustrated here. In Section 4, we look intoEGM schemes by investigating fundamental questions in EGM based learning and conducting auniﬁed learning theory analysis. We assess the learning performance of EGM estimators in diﬀerentscenarios. Speciﬁcally, as an instantiation, we show that one can directly apply the developed the-ory to those well-established robust learning schemes mentioned in the motivating scenarios. Somefurther insights and perspectives are provided in Section 5. The paper is concluded in Section 6.Intermediate lemmas and proofs of the theorems are provided in the appendix.Throughout this paper, k · k ,ρ denotes the L -norm induced by the marginal distribution ρ X . C ( X ) denote the space of bounded continuous functions on X . p Y | X stands for the conditionaldensity of Y conditioned on X and p Y | X = x , or p Y | x for brevity, represents the conditional density of Y conditioned on X = x . For a set S (or an event), I S represents an indicator function that takesthe value on S (or when S occurs) and otherwise. We also denote a . b if a ≤ cb for someabsolute constant c > . In this section, at an intuitive level, we present a ﬁrst look at EGM by comparing it with theclassical ERM. To this end, we ﬁrst investigate situations where ERM based regression schemes failand then look into EGM by interpreting it from a minimum distance estimation (MDE) as well asa maximum likelihood estimation (MLE) viewpoint.

As stated earlier, ERM based regression schemes can be traced to the MLE framework where onemaximizes the product of the likelihoods of all the observations. That is, f z = arg max f ∈H n n Y i =1 p ε | xi ( y i − f ( x i )) . (6)In (6), for any f ∈ H , maximizing the product of the likelihood functions encourages the equalimportance of all the residuals y i − f ( x i ) , i = 1 , · · · , n , and does not tolerate the zeroness ofany likelihood no matter whether the residual is caused by an abnormal observation or not. Incontrast, in EGM regression schemes, one maximizes the summation of the likelihood functions ofthe residuals. That is, f z = arg max f ∈H n n X i =1 p ε | xi ( y i − f ( x i )) . (7)Intuitively, maximizing the summation instead of the product allows small values or even zeronessof likelihood of certain residuals and so may signiﬁcantly reduce the impact of those abnormal ones.Therefore, EGM may outperform ERM in terms of robustness to abnormal observations.6 .2 EGM: When MLE Meets MDE To proceed with our comparison, for any measurable function f : X → R , we denote E f as therandom variable deﬁned by the residual between Y and f ( X ) , i.e., E f = Y − f ( X ) . Then, for anyﬁxed f and any realization of X , say x , the density function of E f | x can be obtained by translatingthat of ε | x horizontally f ⋆ ( x ) − f ( x ) units. Consequently, the density of E f | x can be expressed as p Ef | x ( t ) = p ε | x ( t + f ( x ) − f ⋆ ( x )) . Similarly, we also have p ε | x ( t ) = p Ef | x ( t + f ⋆ ( x ) − f ( x )) . Moreover, reminded by [19], we know that p Ef ( t ) = ˆ X p ε | x ( t + f ( x ) − f ⋆ ( x ))d ρ X ( x ) deﬁnes the density of the random variable E f , and p ε ( t ) = ˆ X p Ef | x ( t + f ⋆ ( x ) − f ( x ))d ρ X ( x ) deﬁnes a density of the random variable ε .Continuing our discussion in the introduction, the population version of the M-estimators result-ing from the ERM scheme (1) and so the MLE (3) can be equivalently cast as c f H = arg max f ∈H E log p ε | X ( Y − f ( X )) . Simple computations show that one also has c f H = arg min f ∈H KL (cid:16) p ε | X ( Y − f ( X )) , p Ef | X ( Y − f ( X )) (cid:17) , where KL ( p , p ) denotes the KL-divergence between the two distributions p and p and measurestheir discrepancy.On the other hand, recall that the EGM estimator (5) originates from (4), which can be treatedas an M-estimator. Therefore, the EGM scheme can still be interpreted from an MLE viewpoint.Meanwhile, note that the population counterpart of (4) is f H = arg max f ∈H E p ε | X ( Y − f ( X )) . For any measurable functions f , we deﬁne the integrated squared density-based distance, dist ( p Ef , p ε ) ,between p Ef and p ε as dist ( p Ef , p ε ) := s ˆ X ˆ + ∞−∞ ( p Ef | x ( t ) − p ε | x ( t )) d t d ρ X ( x ) .

7t is easy to see that the above distance between p Ef and p ε deﬁnes a metric between p Ef and p ε .In particular, if f equals f ⋆ on X , then we have dist ( p Ef , p ε ) = 0 . Notice that the data-generatingmodel Y = f ⋆ ( X ) + ε deﬁnes a location family, which implies that f H = arg min f ∈H dist ( p Ef , p ε ) . Therefore, EGM can be interpreted both from an MLE viewpoint and from a minimum distanceestimation viewpoint. To be brief, when MLE meets MDE, EGM estimators come into sight.Because of such dual interpretations, they may possess built-in merits such as robustness deliveredby MDE and fast convergence rates inherited from MLE, as we shall explore later.

In this section, we ﬁrst detail the reasoning process that leads to EGM (5) and also the deﬁnitionof gain functions. More gain functions will be exampled and categorized and their correspondenceswith bounded nonconvex losses will also be discussed.

While the initial scheme (4) seems to be promising due to its merits in terms of robustness. However,one may encounter barriers in its implementation, the primary one of which is tractability. This isbecause in practice p ε is unknown in advance. Therefore, further eﬀorts are needed to address thisproblem and make it practically applicable. One way of tackling this problem, inspired by existingstudies in learning theory, is to ﬁnd a tractable relaxation, which is the gain function p σ deﬁned inSection 1.1. Listed below are several possible strategies that may be adopted for this purpose. Assuming the Density.

Recall that when MLE is used to estimate the conditional mean, theGaussianity of the noise is usually assumed which leads to the least squares regression. IfLaplace noise is assumed, then one comes to the least absolute deviation estimation whichapproaches the conditional median in regression. When moving attention to EGM, followingthe same way, one may also assume that the noise variable ε obeys a certain law such asthe Gaussian or Laplace. In particular, with Laplace noise, we then have p σ ( y i − f ( x i )) =exp ( −| y i − f ( x i ) | /σ ) . Under mild conditions, it can be shown that the resulting EGM estima-tor approaches the conditional median or the conditional mode function with properly chosen σ values. Similarly, with Gaussian noise, then one gets p σ ( y i − f ( x i )) = exp (cid:0) − ( y i − f ( x i )) / σ (cid:1) and the resulting EGM scheme can be equivalently formulated as the maximum correntropycriterion stated in Motivating Scenario IV.We arrive at the EGM scheme (5) from the scheme (4) by choosing p σ which is an assumeddensity and serves as a surrogate of the true noise density. However, we remark that, similarto MLE, EGM may still be practically eﬀective even if the assumed noise density deviatesfrom the ground truth, as shown in Section 4 below.8 earning the Density. Noticing that in (4), for any ﬁxed f in H , we maximize the summationof the values of the unknown density p ε at n points y i − f ( x i ) , i = 1 , · · · , n . When a learningmachine f is utilized, the observations y i − f ( x i ) , i = 1 , · · · , n , can be treated as realizationsof the unknown noise variable deﬁned by the residual Y − f ⋆ ( X ) . Therefore, for each i ,one could estimate the point-wise density p ε ( y i − f ( x i )) by using observations y j − f ( x j ) , j = 1 , · · · , i − , i + 1 , · · · , n , by means of the Parzen window density estimator. Explicitly,let K σ be a smoothing kernel with the bandwidth parameter σ , under mild conditions, onehas the following estimate of p ε that serves as its relaxation p σ ( y i − f ( x i )) = 1 n − n X j =1 j = i K σ ( y i − f ( x i ) − y j + f ( x j )) . The resulting EGM scheme can be formulated as f z ,σ = arg max f ∈H n ( n − n X i =1 n X j =1 j = i K σ ( y i − f ( x i ) − y j + f ( x j )) . Canonical smoothing kernels include Gaussian kernel, Laplace kernel, Epanechnikov kernel,uniform kernel, triangle kernel, etc. Interestingly, the above learning scheme is essentiallyequivalent to the one induced by the minimum error entropy algorithm [18, 44] that wasrecently investigated from a learning theory viewpoint in [30, 19, 31, 32, 27].

Approximating the Density.

Another way of dealing with the unknown density p ε is that onemay directly approximate this density function. To this end, we recall that ε := Y − f ⋆ ( X ) is aone-dimensional random variable. As a mild assumption, one may assume that p ε is continuouson R . From the knowledge of approximation theory, we know that one can approximate thisone-dimensional continuous function p ε arbitrarily well by using certain basis functions on R .As an example, one may use the convex combination of the one-dimensional Gaussian kernel,which leads to p σ ( y i − f ( x i )) = K X j =1 w j exp − ( y i − f ( x i )) σ j ! , where the coeﬃcients w j are positive constants such that P Kj =1 w j = 1 , K ≥ a positiveinteger, and σ j > for j = 1 , · · · , K . Then, the resulting EGM is f z ,σ = arg max f ∈H n n X i =1 K X j =1 w j exp − ( y i − f ( x i )) σ j ! . In particular, if K = 1 , it reduces Gaussian EGM. We note that similar ideas have beeninvestigated for robust learning, see e.g., [10].In addition to the above-mentioned approaches to ﬁnding tractable relaxations of p ε , one mayalso use smoothing kernels from statistics since each smoothing kernel deﬁnes a density. By stretch-ing or compressing a smoothing kernel vertically or horizontally, one may approximate the unknowndensity p ε . For illustration, we will provide more examples in the next subsection. It would be in-teresting to explore further techniques for ﬁnding such a relaxation.9 .2 Gain Function: Formal Deﬁnition and More Examples With the preparations above, we are now ready to introduce a formal deﬁnition of gain functionswhich leads to the general EGM formulation (5).

Deﬁnition 1 (Gain function) . A function p σ : R → R + with a parameter σ > is said to be a gainfunction if there exists a generating function φ : R → R + such that p σ ( t ) = φ ( t/σ ) for any t ∈ R and the following conditions are satisﬁed: (1) < ´ + ∞−∞ φ ( t )d t < + ∞ ; and (2) φ is non-decreasing on ( −∞ , and non-increasing on [0 , + ∞ ) . The gain functions p σ is introduced as a surrogate of p ε . The scale parameter σ is used tostretch or compress the function φ so as to approximate the density. According to the deﬁnition,gain functions attain their peak value at the point . An intuitive explanation of this restriction isthat one gains the most if a learning machine f ﬁts y exactly at the point x . Remark 2.

The terminology “gain function" in Deﬁnition 1 has been used in various disciplines.For instance, in game theory, gain function is better known as “pay-oﬀ function". It is a functiondeﬁned on the set of situations in a game, the values of which are a numerical description of theutility of a player or of a team of players in a given situation. In economics, gain function is betterknown as “utility function". It is a function that measures preferences over a set of goods andservices. Its value represents the satisfaction that consumers receive for choosing and consuming aproduct or service. In the present study, gain function is not referred to as the ones in game theoryor economics, though it may be related to those terminologies. The introduction of gain function inthe present study is directly inspired by the studies in [55, 56] for robust statistical estimation.

Following Deﬁnition 1 and the discussions in Section 3.1, one can immediately write out a varietyof gain functions.

Example 1 (Triweight gain function) . When taking the triweight kernel as the gain function, wehave the triweight gain function p σ ( t ) = (cid:18) − t σ (cid:19) I {| t |≤ σ } . Example 2 (Epanechnikov gain function) . When taking the Epanechnikov kernel as a gain function,we come to Epanechnikov gain function p σ ( t ) = (cid:18) − t σ (cid:19) I {| t |≤ σ } . Example 3 (Cauchy gain function) . When considering the kernel of a Cauchy distribution with thelocation parameter , we obtain the Cauchy gain function p σ ( t ) = σ σ + t . xample 4 (Gaussian gain function) . When considering the kernel of a standard Gaussian distri-bution, we have the Gaussian gain function p σ ( t ) = exp (cid:18) − t σ (cid:19) . Example 5 (Laplace gain function) . Considering the kernel of a Laplace distribution with thelocation parameter , we have the Laplace gain function p σ ( t ) = exp (cid:18) − | t | σ (cid:19) . Example 6 (Cosine gain function) . Using the cosine kernel as a gain function leads to the Cosinegain function p σ ( t ) = cos (cid:18) πt σ (cid:19) I {| t |≤ σ } . Example 7 (Uniform gain function) . Using the uniform kernel as a gain function gives the uniformgain function p σ ( t ) = 12 σ I {| t |≤ σ } . As shown above, a variety of gain functions can be introduced in various ways for diﬀerent purposes.For instance, Gaussian and Cauchy gain functions may be employed in EGM to learn the condi-tional mean function in regression, while by means of the Laplace gain function one may learn theconditional median function. In this part, we make eﬀorts to categorize gain functions by deﬁningtype α gain functions and strongly mean-calibrated gain functions. Deﬁnition 3.

A gain function p σ is said to be of type α if there exist two nonnegative constants α and c such that p σ ( t ) = p σ (0) − c (cid:18) | t | σ (cid:19) α + R α (cid:18) | t | σ (cid:19) , where the remainder term R α (cid:16) | t | σ (cid:17) = o (cid:16) | t | α σ α (cid:17) as | t | σ → . In particular, p σ is said to be of exacttype α if R α (cid:16) | t | σ (cid:17) = 0 for any | t | σ ≤ . It is easy to verify that • the triweight, Cauchy, Gaussian, and Cosine gain functions in Examples 1, 3, 4, and 6 are oftype ; • the Epanechnikov gain function in Example 2 is of exact type ; • the Laplace gain function in Example 5 is of type ; and11 ain Function Mean-Calibration ψ ( t ) L L Triweight Strong (1 − t ) I {| t |≤ } √ Epanechnikov Exact (1 − t ) I {| t |≤ } Cauchy Strong t √ Gaussian Strong e − t/ e − / Cosine Strong cos (cid:16) π √ t (cid:17) I {| t |≤ } π π Table 1: Further Categorizations of Type Gain Functions in Examples 1-7 • the uniform gain function in Example 7 is of exact type .Intuitively, a type gain function may be used for mean regression while a type gain functionmay be employed for median regression. As mean regression will be the main focus in what follows,among type gain functions, we are particularly interested in strongly mean-calibrated ones as wellas exactly mean-calibrated ones deﬁned below. Deﬁnition 4.

A gain function p σ is said to be strongly mean-calibrated if there exist a representingfunction ψ : R + → R + and absolute positive constants L and L such that p σ ( t ) := ψ ( t /σ ) andthe following two conditions hold (1) ψ ( t ) is L -Lipschitz w.r.t. t on R ; and (2) ψ ′ (0) < and ψ ′ ( t ) exists and is L -Lipschitz on [0 , .In particular, a strongly mean-calibrated gain function p σ is said to be exactly mean-calibrated if ψ ′ ( t ) is a constant function on (0 , . As per the above deﬁnition, the type gain functions listed in Examples 1-4 and 6 can be furthercategorized as in Table 1. Clearly, not all type gain functions are strongly mean-calibrated. Thefollowing proposition further reveals the relations between strongly mean-calibrated and (exact)type gain functions. Proposition 5.

A strongly mean-calibrated gain function must be of type , and an exactly mean-calibrated gain function must be of exact type .Proof. Let p σ be a strongly mean-calibrated gain function. According to Deﬁnition 4, there existsa representing function ψ such that p σ ( t ) = ψ ( t /σ ) . Applying the mean value theorem, we knowthat for any < u < , it holds that ψ ( u ) − ψ (0) = ψ ′ ( ξ ) u, where < ξ < u . To show that p σ is of type , it suﬃces to prove that there exists a positiveconstant c such that ψ ′ ( ξ ) u + cu = o ( u ) , as u → . c = − ψ ′ (0) > and recall that ψ ′ ( u ) is L -Lipschitz on [0 , . Replacing u with t /σ , we have proved that p σ is of type . The conclusion that an exactly mean-calibratedgain function must be of exact type is obvious.As we shall see later, EGM schemes induced by strongly mean-calibrated gain functions areasymptotically mean calibrated in regression and their sharp error bounds can be established, whichexplain the name of such gain functions. It should be remarked that the conditions for strongly mean-calibrated gain functions are suﬃcient ones to ensure the (asymptotic) mean calibration propertiesof the resulting EGM estimators and to establish fast convergence rates. In fact, such conditions canbe relaxed to much weaker ones if one is only interested in regression consistency. Likewise, one canalso further categorize type gain functions and investigate their behaviors in median regression,which is beyond the scope of the present study. In recent years, bounded nonconvex loss functions are playing more and more important rolesin machine learning applications especially in computer vision as it is commonly accepted thatthe boundedness of a loss function is essential for outlier resistance, see e.g., [41, 48]. Severalcanonical bounded nonconvex losses that are frequently employed in robust estimation problemsinclude truncated square loss, Tukey’s biweight loss, Geman-McClure loss, exponential squared loss,and Andrews loss. Interestingly, within the EGM framework, these bounded nonconvex losses canbe naturally interpreted as gain functions. Such correspondences are detailed as follows and arealso summarized in Table 2.

Example 1 ′ (Tukey’s biweight loss) . The well-known Tukey’s biweight loss for robust estimationwas introduced in [53] and is deﬁned as ℓ σ ( t ) =  σ (cid:20) − (cid:16) − t σ (cid:17) (cid:21) , if | t | ≤ σ, σ , otherwise . It can be deduced from the triweight gain function in Example 1 and leads to the triweight EGM inMotivating Scenario I.

Example 2 ′ (Truncated square loss) . The truncated square loss proposed in [29] is given as ℓ σ ( t ) = min { t , σ } . It is also known as the skipped mean loss or Talwar loss. The truncated square loss can be translatedfrom the Epanechnikov gain function in Example 2 and leads to the Epanechnikov EGM in MotivatingScenario II.

Example 3 ′ (Geman-McClure loss) . The Geman-McClure loss proposed in [25] is deﬁned as follows ℓ σ ( t ) = t σ + t . Clearly, it can be derived from the Cauchy gain function in Example 3 and leads to the Geman-McClure regression in Motivating Scenario III. ounded Nonconvex Loss Gain Function Related Examples Tukey’s biweight loss Triweight 1 and 1 ′ Truncated square loss Epanechnikov 2 and 2 ′ Geman-McClure loss Cauchy 3 and 3 ′ Exponential squared loss Gaussian 4 and 4 ′ Exponential absolute loss Laplace 5 and 5 ′ Andrews loss Cosine 6 and 6 ′ Box loss Uniform 7 and 7 ′ Table 2: Correspondence between Gain Functions and Existing Bounded Nonconvex Losses

Example 4 ′ (Exponential squared loss) . The exponential squared loss is deﬁned as ℓ σ ( t ) = σ (cid:18) − exp (cid:18) − t σ (cid:19)(cid:19) . It can be derived from the Gaussian gain function in Example 4 and leads to the Gaussian EGM inMotivating Scenario IV.

Example 5 ′ (Exponential absolute loss) . The exponential absolute loss studied in [38] and [11] isdeﬁned as ℓ σ ( t ) = 1 − exp (cid:18) − | t | σ (cid:19) . It can be derived from the Laplace gain function in Example 5.

Example 6 ′ (Andrews loss) . Andrews loss was introduced in [1] and has been applied in robustestimation in statistics and machine learning. It takes the form ℓ σ ( t ) = ( σ (1 − cos( πt σ )) , if | t | ≤ σ,σ , otherwise , and can be derived from the Cosine gain function in Example 6. Example 7 ′ (Box loss) . Box loss takes the following form ℓ σ ( t ) = ( , if | t | ≤ σ, , otherwise . Corresponding to the uniform gain function in Example 7, this loss function was employed to performmodal regression in [37] and the resulting EGM scheme also gives the maximum consensus problemin computer vision, see e.g., [12, 13].

In this section, we take a sober look at EGM. To this end, we ﬁrst propose several fundamentalquestions that are raised when assessing EGM from a learning theory viewpoint. We then assess theperformance of EGM estimators in two diﬀerent setups, namely, the distribution-free setup and thesetup where the noise distribution is correctly speciﬁed. We also conduct case studies by applyingour theoretical results to the motivating scenarios.

Recalling that in the ERM scheme (1), in order to assess the out-of-sample prediction ability of f z ,one evaluates the excess generalization error E ℓ ( Y − f z ( X )) − E ℓ ( Y − f M , ℓ ( X )) , where the expectation is taken jointly w.r.t. X and Y and f M , ℓ = arg min f ∈M E ℓ ( Y − f ( X )) serves as the oracle of the ERM scheme induced by the loss function ℓ . When the loss function ℓ is chosen as the square loss, one has f M , ℓ = f ⋆ with f ⋆ being the conditional mean function. Inparticular, in this case, one also has the following relation k f z − f ⋆ k ,ρ = E ℓ ( Y − f z ( X )) − E ℓ ( Y − f ⋆ ( X )) . When ℓ is a general convex loss, under certain noise assumptions, one may still be able to charac-terize the oracle f M , ℓ and further show that the oracle is the underlying truth f ⋆ . Moreover, theconvergence E ℓ ( Y − f z ( X )) → E ℓ ( Y − f ⋆ ( X )) may also imply the convergence of k f z − f ⋆ k ,ρ . In thestatistical learning literature, the related studies that address the above concerns for ERM schemesinduced by convex losses have been conducted extensively and theoretical frameworks have beenwell developed, see e.g., [15, 51].However, for EGM schemes, the story becomes more complicated due to the nonconcavity ofgain functions and the involvement of the parameter σ . For any measurable function f : X → R ,we denote its generalization gain associated with the gain function p σ as G σ ( f ) = E p σ ( Y − f ( X )) G σ, z ( f ) = 1 n n X i =1 p σ ( y i − f ( x i )) as its empirical gain . Moreover, we denote the quantity G σ ( f M ,σ ) −G σ ( f ) as the excess generalizationgain of f , where f M ,σ := arg max f ∈M G σ ( f ) serves as the oracle of EGM induced by the gain function p σ . With the above notations, the following fundamental questions regarding theoretical assessmentsof EGM arise naturally: Question 1.

Which gain function p σ should one choose ?This question is somewhat similar to the one that was proposed in the context of ERM,namely, which loss function one should choose for ERM, and was investigated in [45, 50] insome scenarios. It is generally accepted that each loss function has its own merits in learningand the choice of the loss function for ERM may rest upon the learning task confronted.For instance, the square loss may be chosen if one is interested in mean regression; the leastabsolute deviation loss may be preferred in performing median regression; while the check lossmay be a good option for quantile regression. Likewise, in the context of EGM, while a generalanswer to this question is not obtainable, the choice of the gain function may also need to bediscussed case-by-case. For instance, the Gaussian gain function may be adopted for robustmean regression; the Laplace gain function may be used to perform median regression robustly;while the asymmetry Laplace gain function may be utilized for robust quantile regression. Question 2.

What is the oracle f M ,σ ?Clearly, the oracle f M ,σ is deﬁned in association with the parameter σ . Diﬀerent σ values maylead to diﬀerent oracles, which together with the non-concavity of EGM, promotes barriersto the characterization of f M ,σ . It would be interesting to give a full characterization of f M ,σ under various circumstances. For instance, regarding Gaussian EGM, some eﬀorts were madetowards this direction in [22]. However, due to the dependence on the parameter σ , the oracle f M ,σ may be far from the underlying truth function f ⋆ that one intends to approach and somay not be much informative. In particular, characterizing the oracle f M ,σ and its relationwith f ⋆ may be much involved. This same situation also arises when seeking an answer to thefollowing fundamental question. Question 3.

How to bound the excess generalization gain G σ ( f M ,σ ) −G σ ( f z ,σ ) ? Whether the conver-gence of the excess generalization gain G σ ( f M ,σ ) − G σ ( f z ,σ ) towards implies the convergenceof k f z ,σ − f ⋆ k ,ρ ?Following the clue of learning theory studies on ERM, the above questions also arise naturally.However, as mentioned above, due to the introduction of the parameter σ , the oracle f M ,σ maydrift away from f ⋆ . In this case, bounding the excess generalization gain G σ ( f M ,σ ) − G σ ( f z ,σ ) may again not be much informative and its convergence may not imply the closeness between f z ,σ and f ⋆ . In fact, following the study in [50], under some stringent assumptions on thenoise variable ε , one may conclude that f M ,σ is the same as f ⋆ . However, given that in the16achine learning context distribution-free learning is preferred, we prefer not to impose suchassumptions on the noise.In what follows, we shall make eﬀorts to address Questions 2 and 3 above. To this end, recallthat the purpose of EGM is to learn the truth function f ⋆ . Though, the target hypothesis f M ,σ mayvary due to diﬀerent choices of the σ values. Therefore, what really matters here is the locationfunction f ⋆ rather than the target hypothesis f M ,σ . This inspires us to directly take f ⋆ as thetarget hypothesis and redeﬁne the excess generalization gain of f z ,σ as G σ ( f ⋆ ) − G σ ( f z ,σ ) . Withthis redeﬁnition, our main concerns in EGM based learning are then switched to the followingones: (1) Whether the excess generalization gain G σ ( f ⋆ ) − G σ ( f z ,σ ) decays to zero? (2) Whether G σ ( f ⋆ ) → G σ ( f z ,σ ) implies f z ,σ → f ⋆ ? We ﬁrst investigate learning performance of EGM estimators in a distribution-free setup, wheredistributional assumptions on the noise are absent while certain moment conditions may be imposed.To this end, we ﬁrst introduce two assumptions, one on the capacity of H and the other on the tailbehavior of the distribution of Y . Assumption 1.

H ⊂ C ( X ) and there exist positive constants q and c such that log N ( H , η ) ≤ cη − q , ∀ η > , where the covering number N ( H , η ) is deﬁned as the minimal k ∈ N such that there exist k balls in C ( X ) with centers in H and radius η covering H . Assumption 2.

There exists some ǫ > such that E | Y | ǫ < + ∞ . Assumption 1 is typical in learning theory and is introduced here to control the complexity ofthe hypothesis space H . Assumption 2 is a weak assumption on the distribution of the responsevariable. Note that under the boundedness assumption of f ⋆ , the ﬁniteness of the (1 + ǫ ) momentcondition on Y is equivalent to the ﬁniteness of that of the noise Y − f ⋆ ( X ) . It is rather weak asit admits the case where the noise has inﬁnite variance.Our ﬁrst result for EGM is concerned with its mean regression calibration property, namely,whether G σ ( f z ,σ ) → G σ ( f ⋆ ) implies f z ,σ → f ⋆ . While a general answer to this question maybe negative, the following theorem tells us that some weak form of regression calibration can beobtained with an adaptive selection of σ values. Theorem 6.

Let f ⋆ = E ( Y | X ) be bounded by M . Let Assumption 2 hold, σ ≥ max { M, } , and p σ be a strongly mean-calibrated gain function. For any bounded measurable function f : X → R with k f k ∞ ≤ M , it holds that (cid:12)(cid:12) σ [ G σ ( f ⋆ ) − G σ ( f )] − c k f − f ⋆ k ,ρ (cid:12)(cid:12) ≤ c ǫ σ − θ ǫ , where c = − ψ ′ (0) > , θ ǫ = min { ǫ, } , and c ǫ is a positive constant that is independent of f and will be given explicitly in the proof. Moreover, if p σ is exactly mean-calibrated, then the aboveinequality holds with θ ǫ = ǫ .

17s Theorem 6 applies to f z ,σ , we say that f z ,σ is asymptotically mean calibrated . That is,when σ is adjusted according to the sample size n and its value diverges, the bias term between G σ ( f ⋆ ) − G σ ( f z ,σ ) and k f z ,σ − f ⋆ k ,ρ shrinks to , yielding the calibration property.We next establish error bounds and convergence rates of f z ,σ . In particular, we consider twocases, namely, when the gain function p σ is strongly mean-calibrated and when p σ is exactly mean-calibrated, respectively. To this end, we introduce f H = arg min f ∈H k f − f ⋆ k ,ρ to characterize the approximation ability of the tuple ( H , ρ, L ρ X ) to learn f ⋆ . When p σ is stronglymean-calibrated, the established error bounds and convergence rates are as follows. Theorem 7.

Let Assumptions 1 and 2 hold and σ > max { M, } . Let f z ,σ be produced by EGM (5) associated with a strong mean-calibrated gain function p σ . For any < δ < , with probabilityat least − δ , it holds that k f z ,σ − f ⋆ k ,ρ . k f H − f ⋆ k ,ρ + log(2 /δ )Ψ ( n, ǫ, σ ) , (8) where Ψ ( n, ǫ, σ ) :=  σ ǫ + σn / ( q +1) , if < ǫ ≤ , σ ǫ + (cid:18) σ q + 2 ǫ ǫ n (cid:19) / ( q +1) , if < ǫ < , σ + (cid:18) σ q + 41+ ǫ n (cid:19) / ( q +1) , if ≤ ǫ < , σ + σn / ( q +1) , if ǫ ≥ . With properly chosen σ values, an immediate corollary is as follows. Corollary 8.

Under the assumptions of Theorem 7, let f ⋆ ∈ H and σ be chosen as σ := n ϑ ( ǫ,q ) ,where ϑ ( ǫ, q ) =  q +1)( ǫ +1) , if < ǫ ≤ , ǫ (1+ ǫ )( ǫ + q + qǫ )+2 ǫ , if < ǫ < , ǫ (2+3 q )(1+ ǫ )+ ǫ , if ≤ ǫ < , q +1) , if ǫ ≥ . Then for any < δ < , with probability at least − δ , it holds that k f z ,σ − f ⋆ k ,ρ . log(2 /δ ) n − θ ǫ ϑ ( ǫ,q ) , (9) where θ ǫ = min { ǫ, } . When p σ is exactly mean-calibrated, improved error bounds and rates can be established.18 heorem 9. Under the assumptions of Theorem 7, we further assume that p σ is exactly mean-calibrated. Then for any < δ < , with probability at least − δ , it holds that k f z ,σ − f ⋆ k ,ρ . k f H − f ⋆ k ,ρ + log(2 /δ )Ψ ( n, ǫ, σ ) , (10) where Ψ ( n, ǫ, σ ) :=  σ ǫ + σn / ( q +1) , if < ǫ ≤ , σ ǫ + (cid:18) σ q + 2 ǫ ǫ n (cid:19) / ( q +1) , if ǫ > . Corollary 10.

Under the assumptions of Theorem 7, let f ⋆ ∈ H and σ be chosen as σ := n ϑ ( ǫ,q ) ,where ϑ ( ǫ, q ) =  q +1)( ǫ +1) , if < ǫ ≤ , ǫ ( q + ǫ + qǫ )(1+ ǫ )+2 ǫ , if ǫ > . Then for any < δ < , with probability at least − δ , it holds that k f z ,σ − f ⋆ k ,ρ . log(2 /δ ) n − ǫϑ ( ǫ,q ) . (11)Proofs of Theorems 7 and 9 are deferred to the appendix. Results in Corollaries 8 and 10 areimmediate from the two theorems and so their proofs are omitted. Several remarks on the theoreticalresults are in order here. • First, under the (1 + ǫ ) moment condition, exponential type convergence rates for f z ,σ areestablished by diverging σ values. These results demonstrate that EGM estimators can dealwith regression problems in the presence of heavy-tailed noise since when < ǫ < , the noise Y − f ⋆ ( X ) does not even admit ﬁnite variance. • Second, these error bounds and convergence rates explicitly tell how the scale parameter σ in EGM inﬂuences the learnability of f z ,σ . Such an inﬂuence is weakened when ǫ goes larger,which coincides with our intuitive understanding. • Third, when p σ is strongly mean-calibrated, the tail of the noise distribution is suﬃciently light,and when functions in H is smooth enough, asymptotic convergence rates of type O ( n − / ) canbe obtained, suggesting the existence of a bottleneck phenomenon in learning f ⋆ . While when p σ is exactly mean-calibrated, such asymptotic convergence rates can be up to O ( n − ) . Theseﬁndings and comparisons indicate the advantages of exactly mean-calibrated gain functionsover strongly mean-calibrated ones. • Fourth, these theoretical results can be immediately applied to the regression schemes in themotivating scenarios in Section 1.2. Such applications bring us novel results that deepen ourunderstanding of these well-established but not fully-understood robust regression approaches.The instantiations and applications of the above theorems and corollaries will be detailed inSection 4.4. 19 .3 Learning through EGM without Misspeciﬁcation

It has been well understood that MLEs are asymptotically optimal when the likelihood functionis correctly speciﬁed. Likewise, in the context of EGM, we also have a look at the case when thenoise distribution is correctly speciﬁed and the gain function p σ results from the kernel of such adistribution. Theorem 11.

Assume that the distribution of the noise Y − f ⋆ ( X ) is symmetric and is independentof X . Let p σ be the kernel of such a distribution and be symmetric and square integrable. Then f ⋆ is a global maximizer of the gain functional G σ ( f ) and there exists an absolute constant C σ > suchfor any bounded measurable function f : X → R , we have C σ k f − f ⋆ k ,ρ ≤ G σ ( f ⋆ ) − G σ ( f ) . (12) If, in addition, p σ is strongly mean-calibrated, then there exists an absolute constant C ′ σ > suchthat G σ ( f ⋆ ) − G σ ( f ) ≤ C ′ σ k f − f ⋆ k ,ρ . (13)We remark that correctly specifying the noise distribution could be a stringent and impracticalrequirement in real-world problems. However, the performance of EGM in this ideal situationhelps better understand EGM schemes from a theoretical perspective. Theorem 11 indicates thatwhen σ is speciﬁed correctly so that the gain function is the kernel of the noise distribution, theresulting EGM scheme is f ⋆ -regression calibrated, that is, G σ ( f z ,σ ) → G σ ( f ⋆ ) implies f z ,σ → f ⋆ when n → ∞ . Note that in (12), f ⋆ denotes the underlying truth function and is not necessarilythe conditional mean function, but could be broadly any location function such as the conditionalmedian function or the conditional mode function. Moreover, the theorem also indicates thatwhen the noise distribution is correctly speciﬁed, EGMs induced by strongly mean-calibrated gainfunctions are essentially equivalent to the ERM induced by the square loss while at the same time,the former ones are capable of robust regression in the absence of light-tailed noise as have beenillustrated in Theorems 7 and 9.One may proceed with the establishment of error bounds and convergence rates of EGM bymeans of similar learning theory arguments and by recalling the regression calibration propertydeveloped in Theorem 11. In particular, it could be also shown that faster convergence rates areobtainable due to the equivalence of strongly mean-calibrated gain functions and the square loss inthis case. Details are omitted due to their great similarity to the proofs of Theorems 7 and 9. The generality of the EGM framework allows us to consider speciﬁc cases by choosing speciﬁc gainfunctions. When the conditional mean function is of interest, we investigate above the performanceof EGM estimators associated with the gain functions that are strongly mean-calibrated or exactlymean-calibrated. The usefulness of the above-established theoretical results lies in that they can bedirectly applied to existing well-established but yet not fully-understood robust regression schemes20 egression Method Gain Function Error Bounds and Rates

Tukey Regression Triweight (8) and (9)Truncated Least Square Epanechnikov (10) and (11)Geman-McClure Regression Cauchy (8) and (9)Maximum Correntropy Gaussian (8) and (9)

Table 3: Applications of EGM Framework and Theory to Motivating Scenarios I–IV and provide a statistical learning assessment on them. For instance, applying these results to thefour regression schemes mentioned in the motivating scenarios in Section 1.2, we immediately obtaintheir error bounds and convergences rates, which are listed in Table 3. While one may also applythe theoretical results to other robust regression schemes, further exploration of the applications ofthe new framework and the theoretical results will be left for future research.

In this section, we provide further insights and perspectives by showing that the newly developedEGM framework enables us to devise more new bounded nonconvex loss functions. As furthercomparisons with ERM, we also stress that, in addition to the minimum distance estimation inter-pretation, the adaptiveness and the boundedness of gain functions diﬀerentiate EGM from ERM.

As exampliﬁed in Section 3.4, the EGM framework allows us to translate bounded nonconvex lossesinto gain functions. Such correspondences also allow us to reformulate gain functions into boundednonconvex losses. Noticing the richness and versatility of gain functions, various new boundednonconvex losses can be obtained. Here we example an interesting instantiation of the idea byintroducing generalized Tukey’s loss ℓ σ ( t ) = ( − (cid:16) − | t | m σ m (cid:17) n , if | t | ≤ σ, , if | t | > σ. The two power indices m , n control the smoothness of the loss function and with larger m and n values, the loss function becomes more smooth. In particular, the loss function turns to be moreand more insensitive at the vicinity of t = 0 when the m and n values become larger and larger.The generalized Tukey’s loss can be derived from the gain function p σ ( t ) = (1 − | t | m /σ m ) n and itsintroduction is inspired by the fact that when m = 2 , n = 3 , it reduces to the Tukey’s biweight lossand when m = 2 , n = 1 , it gives the truncated square loss. One may explore other choices of m and n values, which leads to the following new bounded nonconvex losses:21 ricube loss from the tricube gain function. The tricube loss is deﬁned as ℓ σ ( t ) =  − (cid:16) − | t | σ (cid:17) , if | t | ≤ σ, , if | t | > σ. It can be reformulated from the following Tricube gain function p σ ( t ) = (cid:18) − | t | σ (cid:19) I {| t |≤ σ } , which results from the tricube smoothing kernel [46]. Quartic loss from the quartic gain function.

The quartic loss is deﬁned as ℓ σ ( t ) =  − (cid:16) − t σ (cid:17) , if | t | ≤ σ, , if | t | > σ. It can be derived from the quartic gain function p σ ( t ) = (cid:18) − t σ (cid:19) I {| t |≤ σ } , which comes from the quartic smoothing kernel. Truncated absolute deviation loss from the triangle gain function.

The truncated absolutedeviation loss is deﬁned as ℓ σ ( t ) = ( | t | , if | t | ≤ σ,σ, if | t | > σ. It can be derived from the following triangular gain function p σ ( t ) = (cid:18) − | t | σ (cid:19) I {| t |≤ σ } , which results from the triangular smoothing kernel.In addition, by hinging and translating the generalized Tukey’s loss, one can also obtain asmoothened approximate of the − loss for binary-valued regression, which may be of independentinterest for robust classiﬁcation. Further to our comparisons of EGM with ERM, we now rethink, in addition to the minimumdistance estimation interpretation of EGM, what else makes the diﬀerences between the two typesof learning schemes. 22 . . . . . . . . Figure 1: In the above two panels, the dotted red curve with square marks denotes the conditional mode function f MO for the regression model (14). The dotted black curve with plus marks gives the conditional mean function f ME . The dotted blue curve with ⊗ marks represents the learned Gaussian EGM estimator f z ,σ . Recall that a gain function can be translated into a bounded nonconvex loss and vice versa. Asis commonly accepted, the boundedness of a loss function is essential in dealing with outliers in theresponse variable. On the other hand, EGM is association with a gain function p σ which serves asa surrogate of p ε and contains an integrated scale parameter σ . The introduction of this parameterprovides ﬂexibility and adaptiveness in learning.In fact, apart from the boundedness of a gain function, it is its adaptiveness brought by theparameter σ that distinguishes EGM from ERM. This could be further illustrated by using thefollowing toy example on Gaussian EGM, where we consider the regression model y = f ⋆ ( x ) + κ ( x ) ε, (14)where x ∼ U (0 , , f ⋆ ( x ) = 2 sin( πx ) , and κ ( x ) = 1 + 2 x . The noise variable is distributed as ε ∼ . N ( − , . ) + 0 . N (1 , . ) . With simple computations, we know that the conditionalmean function is f ME ( x ) = 2 sin( πx ) , and the conditional mode function is approximately f MO ( x ) =2 sin( πx ) + 1 + 2 x .In our experiment, observations are drawn from the above data-generating model for trainingand the size of the test set is also set to . The reconstructed curve is plotted in Fig. 1, in whichthe conditional mean function f ME and the conditional mode function f MO are also plotted. In ourexperiment, the hypothesis space H is chosen as a subset of a reproducing kernel Hilbert spaceinduced by a Gaussian kernel, the bandwidth of which is selected through cross-validation. For thescale parameter σ in the gain function, we set σ = 0 . in the left panel of Fig. 1 and σ = 10 inthe right panel. In the two panels, the dotted blue curves with ⊗ marks are the learned GaussianEGM estimators. Clearly, from the experiments, we see that with diﬀerent choices, the GaussianEGM estimators can approach diﬀerent location functions. In fact, through numerical experiments,one can further show that in the presence of Cauchy noise, Gaussian EGM can still learn thetruth function as demonstrated theoretically earlier. These empirical ﬁndings together with ourtheoretical results suggest that Gaussian EGM may possess more adaptiveness than ERM.23 Conclusion

In this paper, a framework of learning through empirical gain maximization was developed to dealwith robust regression problems. The development of such a framework was inspired by several well-established but yet not fully-understood regression schemes such as Tukey regression and Geman-McClure regression. Unlike ERM that can be traced to the framework of maximum likelihoodestimation, empirical gain maximization can be interpreted from a minimum distance estimationviewpoint and thus may possess built-in robustness. To measure point-wise goodness-of-ﬁt in re-gression problems, gain function was introduced. A list of gain functions was exampled and alsocarefully categorized. Interestingly, we showed that a variety of existing representative robust lossfunctions such as Tukey’s biweight loss, the truncated squared loss, and the Geman-McClure losscan be reformulated as special cases of gain functions. A uniﬁed learning theory analysis was con-ducted to assess the performance of empirical gain maximization schemes in regression problems.The developed new framework and the conducted uniﬁed analysis not only help us better under-stand the existing non-convex robust regression schemes but also bring us new bounded nonconvexloss functions of the same kind.

Appendix: Lemmas and Collected Proofs

In this appendix section, we provide intermediate lemmas and detailed proofs of Theorems 6, 7, 9,and 11. Recall that for any bounded measurable function f : X → R , G σ ( f ) and G σ, z ( f ) denote thegeneralization gain and the empirical gain of f , respectively: G σ ( f ) = E p σ ( Y − f ( X )) and G σ, z ( f ) = 1 n n X i =1 p σ ( y i − f ( x i )) . For simpliﬁcation of the analysis, we introduce the scaled generalization gain e G σ ( f ) = σ G σ ( f ) = E (cid:2) σ p σ ( Y − f ( X )) (cid:3) and correspondingly the scaled empirical gain e G σ, z ( f ) = σ G σ, z ( f ) = 1 n n X i =1 σ p σ ( y i − f ( x i )) . We further denote f H ,σ as the population version of f z ,σ in H , namely, f H ,σ = arg max f ∈H G σ ( f ) . We ﬁrst provide two lemmas that will be used in our proofs.24 .1 Lemmas

Lemma 12.

Let ξ be a random variable on a probability space Z having variance v and satisfying | ξ − E ξ | ≤ B almost surely. Then for all ε > , Pr ((cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ξ ( z i ) − E ξ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ε ) ≤ ( − nε v + Bε ) ) . Lemma 13.

If a gain function p σ ( t ) is strongly mean-calibrated, then its representing function ψ ( t ) is L -Lipschitz with L = max( L + c , L ) . Proof.

For any t , t ∈ R , if both t ≥ and t ≥ , then by the fact that ψ ( t ) is L -Lipschitzw.r.t. t , we have | ψ ( t ) − ψ ( t ) | ≤ L |√ t − √ t | = L | t − t |√ t + √ t ≤ L | t − t | . If both t ≤ and t ≤ , since ψ ′ exists and is L -Lipschitz continuous on [0, 1), we have for all t ≤ , | ψ ′ ( t ) | ≤ | ψ ( t ) − ψ ′ (0) | + | ψ ′ (0) | ≤ L | t | + c ≤ L + c . Therefore, | ψ ( t ) − ψ ( t ) | = (cid:12)(cid:12)(cid:12)(cid:12) ˆ t t ψ ′ ( t )d t (cid:12)(cid:12)(cid:12)(cid:12) ≤ ( L + c ) | t − t | . If t < and t ≥ , then | ψ ( t ) − ψ ( t ) | ≤ | ψ ( t ) − ψ (1) | + | ψ (1) − ψ ( t ) | ≤ ( L + c )(1 − t ) + L t − ≤ max (cid:18) L + c , L (cid:19) ( t − t ) . Combining all the three cases, we obtain the desired conclusion.

Lemma 14.

Let Assumption 2 hold with some ǫ > . Let σ > max { M, } and p σ be a stronglymean-calibrated gain function. For any measurable function f : X → R with k f k ∞ ≤ M , we denote ξ ( X, Y ) := σ p σ ( Y − f ⋆ ( X )) − σ p σ ( Y − f ( X ))) . Then E ξ ≤  c σ − ǫ , if < ǫ ≤ ,c k f − f ⋆ k ǫ − ǫ ,ρ , if ǫ > , where c and c are absolute positive constants independent of σ or f and will be given explicitly inthe proof.Proof. Since ψ ( t ) is L -Lipschtiz w.r.t. t, we have | σ p σ ( Y − f ⋆ ( X )) − σ p σ ( Y − f ( X )) | ≤ σ L (cid:12)(cid:12)(cid:12)(cid:12) Y − f ⋆ ( X ) σ − ( Y − f ( X ) σ (cid:12)(cid:12)(cid:12)(cid:12) L σ | f ( X ) − f ⋆ ( X ) | ≤ L M σ. (15)Lemma 13 tells that ψ ( t ) is L -Lipschitz w.r.t. t and further implies | σ p σ ( Y − f ⋆ ( X )) − σ p σ ( Y − f ( X )) | ≤ σ L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) Y − f ⋆ ( X )) σ (cid:19) − (cid:18) Y − f ( X ) σ (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = L (cid:12)(cid:12) ( Y − f ( X ) − ( Y − f ⋆ ( X )) (cid:12)(cid:12) = L | f ( X ) − f ⋆ ( X ) | | Y − f ( X ) − f ⋆ ( X ) |≤ L | f ( X ) − f ⋆ ( X ) | ( | Y | + M ) ≤ L M ( | Y | + M ) . (16)Therefore, when < ǫ ≤ , by (15) and (16), we have E ξ ≤ (2 M L σ ) − ǫ (4 L M ) ǫ E ( | Y | + M ) ǫ ≤ c σ − ǫ , where c = 2 ǫ L − ǫ L ǫ M ( E | Y | ǫ + M ǫ ) . When ǫ > , by (16) and Hölder inequality, weobtain E ξ ≤ (2 L ) E h | f ( x ) − f ⋆ ( x ) | ( | Y | + M ) i ≤ L k f − f ⋆ k ǫ ∞ E (cid:16) | f ( X ) − f ⋆ ( X ) | ǫ − ǫ ( | Y | + M ) (cid:17) ≤ c k f − f ⋆ k ǫ − ǫ ,ρ , where c = 8 L (2 M ) ǫ (cid:16) ( E | Y | ǫ ) ǫ + M (cid:17) . This completes the proof of Lemma 14.

A.2 Proof of Theorem 6

Proof.

For any σ > max { M, } , let Ω = {| Y | > σ } and Ω c be its complement. By Markovinequality, we have Pr(Ω) ≤ E | Y | ǫ σ ǫ . (17)Recalling the identity k f − f ⋆ k ,ρ = E (cid:2) ( Y − f ( X )) − ( Y − f ⋆ ( X )) (cid:3) , we can write (cid:12)(cid:12)(cid:12) h e G σ ( f ⋆ ) − e G σ ( f ) i − c k f − f ⋆ k ,ρ (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) E (cid:16) (cid:2) σ p σ ( Y − f ⋆ ( X )) − σ p σ ( Y − f ( X )) (cid:3) − c (cid:2) ( Y − f ( X )) − ( Y − f ⋆ ( X )) (cid:3) (cid:17)(cid:12)(cid:12)(cid:12) ≤ E (cid:16) (cid:12)(cid:12)(cid:2) σ p σ ( Y − f ⋆ ( X )) − σ p σ ( Y − f ( X )) (cid:3) − c (cid:2) ( Y − f ( X )) − ( Y − f ⋆ ( X )) (cid:3)(cid:12)(cid:12) I Ω (cid:17) E (cid:16) (cid:12)(cid:12)(cid:2) σ p σ ( Y − f ⋆ ( X )) − σ p σ ( Y − f ( X )) (cid:3) − c (cid:2) ( Y − f ( X )) − ( Y − f ⋆ ( X )) (cid:3)(cid:12)(cid:12) I Ω c (cid:17) := Q + Q . By (16) and (17) we have Q ≤ ( L + c ) E h (cid:12)(cid:12) ( Y − f ( X )) − ( Y − f ⋆ ( X )) (cid:12)(cid:12) I Ω i ≤ M ( L + c ) E h ( | Y | + M ) I Ω i ≤ M ( L + c ) (cid:16) (cid:0) E | Y | ǫ (cid:1) ǫ (Pr(Ω)) ǫ ǫ + M Pr(Ω) (cid:17) ≤ M ( L + c )( E | Y | ǫ ) (cid:16) σ − ǫ + M σ − (1+ ǫ ) (cid:17) ≤ M ( L + c )( E | Y | ǫ ) σ − ǫ . In order to bound Q , we denote F σ ( t ) := − σ p σ ( t ) − c t . Then Q = E h | F σ ( Y − f ( X )) − F σ ( Y − f ⋆ ( X )) | I Ω c i . By the mean value theorem, we have | F σ ( Y − f ( X )) − F σ ( Y − f ⋆ ( X )) | = | F ′ σ ( a )( f ( X ) − f ⋆ ( X )) | ≤ M | F ′ σ ( a ) | , where a lies between Y − f ( X ) and Y − f ⋆ ( X ) and hence | a | ≤ max( | Y − f ( X ) | , | Y − f ⋆ ( X ) | ) ≤ | Y | + M. By the facts p σ ( t ) = ψ ( t /σ ) and c = − ψ ′ (0) , we have F ′ σ ( a ) = − σ p ′ σ ( a ) − ac = 2 a (cid:16) ψ ′ (0) − ψ ′ (cid:0) a /σ (cid:1) (cid:17) . Recalling that ψ ′ ( t ) is L -Lipschitz continuous on [0 , , we have | F ′ σ ( a ) | ≤ L | a | σ ≤ L ( | Y | + M ) σ ≤ L ( | Y | + M ) σ . Therefore, if ǫ ≥ , we have Q ≤ M L (cid:0) E | Y | + M (cid:1) σ − . If < ǫ < , by | Y | ≤ σ on Ω c , we obtain Q ≤ M L (cid:18)(cid:16) σ (cid:17) − ǫ E | Y | ǫ + M (cid:19) σ − ≤ M L (cid:0) E | Y | ǫ + M (cid:1) σ − ǫ . Combining the estimates for Q and Q we have (cid:12)(cid:12)(cid:12) h e G σ ( f ⋆ ) − e G σ ( f ) i − c k f − f ⋆ k ,ρ (cid:12)(cid:12)(cid:12) ≤ c ǫ σ ǫ , θ ǫ = min( ǫ, and c ǫ = 6 M ( L + c ) E | Y | ǫ + 16 M L ( E | Y | min(1+ ǫ, + M ) . This proves the desired conclusion when p σ is a strongly mean-calibrated gain function.If p σ = ψ ( t /σ ) is exactly mean-calibrated, i.e., ψ ′ ( t ) is constant on (0 , , then ψ ′ ( t ) ≡ ψ ′ (0) = − c on [0 , and hence ψ ( t ) = − c t , which implies F σ ( t ) = 0 . Therefore, Q = 0 and we have (cid:12)(cid:12)(cid:12) h e G σ ( f ⋆ ) − e G σ ( f ) i − c k f − f ⋆ k ,ρ (cid:12)(cid:12)(cid:12) ≤ Q ≤ c ′ ǫ σ ǫ , where c ′ ǫ = 6 M ( L + c ) E | Y | ǫ < c ǫ . This completes the proof of Theorem 6. A.3 Proof of Theorem 7

Step 1 : We ﬁrst prove that, under Assumption 2, there are two absolute constants c ′ and c ′ (tobe deﬁned explicitly later) such that, for any γ ≥ c ǫ σ − θ ǫ and f ∈ H , there holds Pr  (cid:12)(cid:12) [ e G σ ( f ⋆ ) − e G σ ( f )] − [ e G σ, z ( f ⋆ ) − e G σ, z ( f )] (cid:12)(cid:12)(cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17) − ζ ǫ > γ ζ ǫ  ≤ − Θ( n,γ,σ ) , (18)where ζ ǫ =  , if < ǫ ≤ , ǫ , if ǫ > , and Θ( n, γ, σ ) =  nγc ′ σ , if < ǫ ≤ , nγc ′ (cid:18) σ + σ θǫ ǫ (cid:19) , if ǫ > . To this end, for any f ∈ H , consider ξ = σ p σ ( Y − f ⋆ ( X )) − σ p σ ( Y − f ( X )) . By (15), we know | ξ | ≤ M L σ. Hence | ξ − E ξ | ≤ M L σ. By Lemma 14,var ( ξ ) ≤ E ξ ≤  c σ − ǫ , if < ǫ ≤ ,c k f − f ⋆ k ǫ − ǫ ,ρ , if ǫ > . By Theorem 6, when γ ≥ c ǫ σ − θ ǫ , we have E ξ + 2 γ = e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ ≥ e G σ ( f ⋆ ) − e G σ ( f ) + c ǫ σ − θ ǫ + γ ≥ c k f j − f ⋆ k ,ρ + γ ≥ γ. (19)Therefore, if < ǫ ≤ , by Lemma 12, we have Pr  (cid:12)(cid:12) [ e G σ ( f ⋆ ) − e G σ ( f )] − [ e G σ, z ( f ⋆ ) − e G σ, z ( f )] (cid:12)(cid:12)q e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ > √ γ   − nγ (cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17) c σ − ǫ + M L σ r γ (cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17)  ≤  − nγ (cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17) c σγ/c ǫ + M L σ (cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17)  ≤ (cid:26) − nγc σ (cid:27) , where c = 2 c /c ǫ + M L . If ǫ > , we have Pr  (cid:12)(cid:12) [ e G σ ( f ⋆ ) − e G σ ( f )] − [ e G σ, z ( f ⋆ ) − e G σ, z ( f )] (cid:12)(cid:12)(cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17) − ζ ǫ > γ ζ ǫ  ≤  − nγ ǫ (cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17) ǫ − ǫ c k f j − f ⋆ k ǫ − ǫ ,ρ + M L σγ ǫ (cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17) ǫ − ǫ  ≤  − nγ ǫ (cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17) ǫ − ǫ (cid:18) c c − ǫ − ǫ + M L σγ ǫ (cid:19) (cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17) ǫ − ǫ  ≤  − nγ ǫ (cid:16) e G σ ( f ⋆ ) − e G σ ( f j ) + 2 γ (cid:17) ǫ − ǫ c c − ǫ − ǫ γ − ǫ + M L σ  ≤  − nγc (cid:16) σ + σ θǫ ǫ (cid:17)  , where c := max (cid:18) c c − ǫ − ǫ c − ǫ ǫ , M L (cid:19) and the last inequality is due to γ − ǫ ≤ c − ǫ ǫ σ θǫ ǫ andthe fact e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ ≥ γ. This proves (18).

Step 2 : We next show that, under Assumption 2, the uniform concentration inequality Pr  sup f ∈H (cid:12)(cid:12) [ e G σ ( f ⋆ ) − e G σ ( f )] − [ e G σ, z ( f ⋆ ) − e G σ, z ( f )] (cid:12)(cid:12)(cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17) − ζ ǫ > γ ζ ǫ  ≤ N (cid:18) H , γL σ (cid:19) e − Θ( n,γ,σ ) , (20)holds for γ ≥ c ǫ σ − θ ǫ .

29o see this, denote J = N ( H , γL σ ) and let { f j } Jj =1 ⊂ H be a γL σ -cover of H . For each ≤ j ≤ J, there exists some f ∈ H such that k f − f j k ∞ ≤ γL σ . Notice that the L -Lipschitz property of ψ ( t ) w.r.t. t implies that σ p σ ( t ) is L σ -Lipschitz. This in combination with (19) implies | e G σ ( f ) − e G σ ( f j ) | ≤ σL k f − f j k ∞ ≤ γ ≤ γ ζ ǫ (cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17) − ζ ǫ and | e G σ, z ( f ) − e G σ, z ( f j ) | ≤ σL k f − f j k ∞ ≤ γ ≤ γ ζ ǫ (cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17) − ζ ǫ . If (cid:12)(cid:12) [ e G σ ( f ⋆ ) − e G σ ( f )] − [ e G σ, z ( f ⋆ ) − e G σ, z ( f )] (cid:12)(cid:12) > γ ζ ǫ (cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17) − ζ ǫ holds for all f ∈ H , then for every ≤ j ≤ J, there exists f ∈ H such that (cid:12)(cid:12) [ e G σ ( f ⋆ ) − e G σ ( f j )] − [ e G σ, z ( f ⋆ ) − e G σ, z ( f j )] (cid:12)(cid:12) ≥ (cid:12)(cid:12) [ e G σ ( f ⋆ ) − e G σ ( f )] − [ e G σ, z ( f ⋆ ) − e G σ, z ( f )] (cid:12)(cid:12) − |R σ ( f ) − R σ ( f j ) | − |R σ z ( f ) − R σ z ( f j ) | > γ ζ ǫ (cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17) − ζ ǫ ≥ γ ζ ǫ (cid:16) e G σ ( f ⋆ ) − e G σ ( f j ) + 2 γ (cid:17) − ζ ǫ , where the last inequality used the estimation e G σ ( f ⋆ ) − e G σ ( f j ) + 2 γ = (cid:16) e G σ ( f ⋆ ) − e G σ ( f )) + 2 γ (cid:17) + (cid:16) e G σ ( f ) − e G σ ( f j ) (cid:17) ≤ (cid:16) e G σ ( f ⋆ ) − e G σ ( f )) + 2 γ (cid:17) + γ ≤ e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ ) . This proves  sup f ∈H (cid:12)(cid:12) [ e G σ ( f ⋆ ) − e G σ ( f )] − [ e G σ, z ( f ⋆ ) − e G σ, z ( f )] (cid:12)(cid:12)(cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17) − ζ ǫ > γ ζ ǫ  ⊂ J [ j =1  (cid:12)(cid:12) [ e G σ ( f ⋆ ) − e G σ ( f j )] − [ e G σ, z ( f ⋆ ) − e G σ, z ( f j )] (cid:12)(cid:12)(cid:16) e G σ ( f ⋆ ) − e G σ ( f j ) + 2 γ (cid:17) − ζ ǫ > γ ζ ǫ  and the desired uniform concentration inequality (20) follows immediately from (18). Step 3 : When Assumption 1 holds, the uniform concentration inequality (20) becomes Pr  sup f ∈H (cid:12)(cid:12) [ e G σ ( f ⋆ ) − e G σ ( f )] − [ e G σ, z ( f ⋆ ) − e G σ, z ( f )] (cid:12)(cid:12)(cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17) − ζ ǫ > γ ζ ǫ  ≤ (cid:26) cL q σ q γ q − Θ( n, γ, σ ) (cid:27) . < δ < , let (cid:26) cL q σ γ q − Θ( n, γ, σ ) (cid:27) = δ, or equivalently Θ( n, γ, σ ) − cL q σ γ q − log(2 /δ ) = 0 . By Lemma 7.2 in [15], the equation has a unique positive solution γ ⋆ that satisﬁes γ ⋆ .  log (cid:0) δ (cid:1) σn / ( q +1) , if < ǫ ≤ (cid:0) δ (cid:1) (cid:18) σ q +1 + σ q + 2 θǫ ǫ n (cid:19) / ( q +1) , if ǫ > . Let γ = max( c ǫ σ − θ ǫ , γ ⋆ ) . The uniform concentration inequality tells that (cid:12)(cid:12) [ e G σ ( f ⋆ ) − e G σ ( f )] − [ e G σ, z ( f ⋆ ) − e G σ, z ( f )] ≤ γ ζ ǫ (cid:16) e G σ ( f ⋆ ) − e G σ ( f ) + 2 γ (cid:17) − ζ ǫ holds for all f ∈ H with probability at least − δ . Applying Young’s inequality, (cid:12)(cid:12)(cid:12) [ e G σ ( f ⋆ ) − e G σ ( f )] − [ e G σ, z ( f ⋆ ) − e G σ, z ( f )] (cid:12)(cid:12)(cid:12) −

12 [ e G σ ( f ⋆ ) − e G σ ( f )] . γ (21)holds for all f ∈ H with probability at least − δ . Applying (21) particularly to f z ,σ and f H ,σ , wehave [ e G σ ( f ⋆ ) − e G σ ( f z ,σ )] − [ e G σ, z ( f ⋆ ) − e G σ, z ( f z ,σ )] −

12 [ e G σ ( f ⋆ ) − e G σ ( f z ,σ )] . γ (22)and [ e G σ, z ( f ⋆ ) − e G σ, z ( f H ,σ )] − [ e G σ ( f ⋆ ) − e G σ ( f H ,σ )] −

12 [ e G σ ( f ⋆ ) − e G σ ( f H ,σ )] . γ (23)with probability at least − δ . Step 4 : By the deﬁnition of f z ,σ we know e G σ, z ( f z ,σ ) ≥ e G σ, z ( f H ,σ ) . Therefore, e G σ ( f ⋆ ) − e G σ ( f z ,σ ) = [ e G σ ( f ⋆ ) − e G σ ( f H ,σ )] + [ e G σ ( f H ,σ ) − e G σ ( f z ,σ )] ≤ [ e G σ ( f ⋆ ) − e G σ ( f H ,σ )] + [ e G σ ( f H ,σ ) − e G σ ( f z ,σ )] − [ e G σ, z ( f H ,σ ) − e G σ, z ( f z ,σ )] . (24)Combining (24), (22), and (23), we obtain that, for any < δ < , with probability at least − δ,

12 [ e G σ ( f ⋆ ) − e G σ ( f z ,σ )] −

32 [ e G σ ( f ⋆ ) − e G σ ( f H ,σ )] . γ , which implies e G σ ( f ⋆ ) − e G σ ( f z ,σ ) . [ e G σ ( f ⋆ ) − e G σ ( f H ,σ )] + γ .

31y the deﬁnition of f H ,σ , we have e G σ ( f H ,σ ) ≥ e G σ ( f H ) . Therefore, e G σ ( f ⋆ ) − e G σ ( f z ,σ ) . [ e G σ ( f ⋆ ) − e G σ ( f H )] + γ . By Theorem 6, we obtain for any < δ < , with probability at least − δ, k f z ,σ − f ⋆ k ,ρ . e G σ ( f ⋆ ) − e G σ ( f z ,σ ) + c ǫ σ − θ ǫ . e G σ ( f ⋆ ) − e G σ ( f H ) + γ . k f H − f ⋆ k ,ρ + γ . The proof is completed by noting that γ . log(2 /δ )Ψ ( n, ǫ, σ ) . A.4 Proof of Theorem 9

Theorem 9 can be proved analogously to Theorem 7. We omit the details.

A.5 Proof of Theorem 11

Proof.

Since p σ ( t ) is the kernel of the distribution of Y − f ⋆ ( X ) , there exists a constant c σ > suchthat ˆ + ∞−∞ c σ p σ ( t )d t = 1 . (25)Further by the assumption that Y − f ⋆ ( X ) is symmetric and independent of X , we know p σ ( t ) isan even function and the density of Y | x is p Y | x ( y ) = c σ p σ ( y − f ⋆ ( x )) for all x ∈ X . Consider thefunction V ( s ) = ˆ + ∞−∞ p σ ( t − s ) p σ ( t )d t. Then we see that G σ ( f ) = ˆ X ˆ + ∞−∞ p σ ( y − f ( x )) c σ p σ ( y − f ⋆ ( x ))d y d ρ X ( x )= c σ ˆ X ˆ + ∞−∞ p σ ( t − [ f ( x ) − f ⋆ ( x )]) p σ ( t )d t d ρ X ( x )= c σ ˆ X V ( f ( x ) − f ⋆ ( x ))d ρ X ( x ) . Let c p σ denote the Fourier transform of p σ , deﬁned by c p σ ( ξ ) = ˆ + ∞−∞ p σ ( t )e i ξt d t. Since p σ is even, c p σ must be real. The Plancherel theorem tells that V ( s ) = 12 π ˆ + ∞−∞ ( c p σ ( ξ )) e i ξs d ξ = 12 π ˆ + ∞−∞ ( c p σ ( ξ )) cos( ξs )d ξ.

32t is obvious that V achieves its maximum when s = 0 , which implies f ⋆ is a global maximizer of G σ ( f ) . Next let us write G σ ( f ⋆ ) − G σ ( f ) = c σ ˆ X V (0) − V ( f ( x ) − f ⋆ ( x ))d ρ X ( x )= c σ π ˆ X ˆ + ∞−∞ ( b p σ ( ξ )) (cid:16) − cos( ξ ( f ( x ) − f ⋆ ( x ))) (cid:17) d ξ d ρ X ( x )= c σ π ˆ X ˆ + ∞−∞ ( b p σ ( ξ )) sin ( ξ ( f ( x ) − f ⋆ ( x ))) d ξ d ρ X ( x ) . For any x ∈ X , | f ( x ) − f ⋆ ( x ) | ≤ M . When | ξ | ≤ π M , from Jordan’s inequality, sin (cid:18) ξ ( f ( x ) − f ⋆ ( x ))2 (cid:19) ≥ ξ ( f ( x ) − f ⋆ ( x )) π . As a result, G σ ( f ⋆ ) − G σ ( f ) ≥ σπ ˆ X ˆ π M − π M ξ ( b p σ ( ξ )) ( f ( x ) − f ⋆ ( x )) d ξ d ρ X ( x )= C σ ˆ X ( f ( x ) − f ⋆ ( x )) d ρ X ( x ) , where C σ = c σ π ˆ π M − π M ξ ( b p σ ( ξ )) d ξ. Note that (25) implies b p σ (0) = c σ > . This in combination with the continuity of b p σ tells C σ > .This proves (12).When p σ is strongly mean-calibrated, the monotonicity of ψ and the L -Lischitz property of ψ ( t ) w.r.t. t implies that p ′ σ ( t ) exists almost everywhere and is odd, non-positive, and bounded.Note further ˆ + ∞−∞ | p ′ σ ( t ) | d t = − ˆ + ∞ p ′ σ ( t )d t = 2 p σ (0) . We obtain that V ′ ( s ) = − ˆ + ∞−∞ p ′ σ ( t − s ) p σ ( t )d t = − ˆ + ∞−∞ p ′ σ ( t ) p σ ( t + s )d t is p σ (0) L σ -Lipschitz and V ′ (0) = 0 . Therefore, for each x ∈ X , there exists a number a x lyingbetween and f ( x ) − f ⋆ ( x ) such that V (0) − V (( f ( x ) − f ⋆ ( x )) = V ′ ( a x )( f ( x ) − f ⋆ ( x )) = ( V ′ ( a x ) − V ′ (0))( f ( x ) − f ⋆ ( x )) ≤ p σ (0) L σ | a x || f ( x ) − f ⋆ ( x ) | ≤ p σ (0) L σ | f ( x ) − f ⋆ ( x ) | . This implies the assertion in (13) with C ′ σ = p σ (0) L σ .33 cknowledgement This work was partially supported by the Simons Foundation Collaboration Grant [email protected] and [email protected] ,respectively. The two authors made equal contributions to this paper and are listed alphabetically.

References [1] David F. Andrews. A robust method for multiple linear regression.

Technometrics , 16(4):523–531, 1974.[2] Leah Bar, Nahum Kiryati, and Nir Sochen. Image deblurring in the presence of impulsive noise.

International Journal of Computer Vision , 70(3):279–298, 2006.[3] Albert E. Beaton and John W. Tukey. The ﬁtting of power series, meaning polynomials,illustrated on band-spectroscopic data.

Technometrics , 16(2):147–185, 1974.[4] Vasileios Belagiannis, Christian Rupprecht, Gustavo Carneiro, and Nassir Navab. Robustoptimization for deep regression. In

ICCV , 2015.[5] Michael J. Black and Anand Rangarajan. On the uniﬁcation of line processes, outlier rejection,and robust statistics with applications in early vision.

International Journal of ComputerVision , 19(1):57–91, 1996.[6] Maria Caterina Bramati and Christophe Croux. Robust estimators for the ﬁxed eﬀects paneldata model.

The Econometrics Journal , 10(3):521–540, 2007.[7] Ali Can, Charles V. Stewart, and Badrinath Roysam. Robust hierarchical algorithm for con-structing a mosaic from images of the curved human retina. In

CVPR , 1999.[8] Le Chang, Steven Roberts, and Alan Welsh. Robust lasso regression using Tukey’s biweightcriterion.

Technometrics , 60(1):36–47, 2018.[9] Avishek Chatterjee and Venu Madhav Govindu. Robust relative rotation averaging.

IEEETransactions on Pattern Analysis and Machine Intelligence , 40(4):958–972, 2017.[10] Badong Chen, Xin Wang, Na Lu, Shiyuan Wang, Jiuwen Cao, and Jing Qin. Mixture corren-tropy for robust learning.

Pattern Recognition , 79:318–327, 2018.[11] Yanbo Chen, Jin Ma, Pu Zhang, Feng Liu, and Shengwei Mei. Robust state estimator basedon maximum exponential absolute value.

IEEE Transactions on Smart Grid , 8(4):1537–1544,2015.[12] Tat-Jun Chin, Zhipeng Cai, and Frank Neumann. Robust ﬁtting in computer vision: Easy orhard?

International Journal of Computer Vision , pages 1–13, 2019.[13] Tat-Jun Chin and David Suter. The maximum consensus problem: recent algorithmic advances.

Synthesis Lectures on Computer Vision , 7(2):1–194, 2017.3414] Kenneth L. Clarkson, Ruosong Wang, and David P. Woodruﬀ. Dimensionality reduction forTukey regression. arXiv preprint arXiv:1905.05376 , 2019.[15] Felipe Cucker and Ding-Xuan Zhou.

Learning Theory: An Approximation Theory Viewpoint .Cambridge University Press, 2007.[16] Fernando De la Torre, Shaogang Gong, and Stephen McKenna. View-based adaptive aﬃnetracking. In

European Conference on Computer Vision , pages 828–842. Springer, 1998.[17] John E. Dennis Jr and Roy E. Welsch. Techniques for nonlinear least squares and robustregression.

Communications in Statistics-Simulation and Computation , 7(4):345–359, 1978.[18] Deniz Erdogmus and Jose C. Principe. Comparison of entropy and mean square error criteria inadaptive system training using higher order statistics. In

Proceedings of The 7th InternationalConference on Independent Component Analysis and Signal Separation , pages 75–90. Berlin:Springer-Verlag, 2000.[19] Jun Fan, Ting Hu, Qiang Wu, and Ding-Xuan Zhou. Consistency analysis of an empiricalminimum error entropy algorithm.

Applied and Computational Harmonic Analysis , 41(1):164–189, 2016.[20] Yunlong Feng. New insights into learning with correntropy based regression.

Neural Computa-tion, in press , 2020.[21] Yunlong Feng, Jun Fan, and Johan A.K. Suykens. A statistical learning approach to modalregression.

Journal of Machine Learning Research , 21(2):1–35, 2020.[22] Yunlong Feng, Xiaolin Huang, Lei Shi, Yuning Yang, and Johan A.K. Suykens. Learning withthe maximum correntropy criterion induced losses for regression.

Journal of Machine LearningResearch , 16:993–1034, 2015.[23] Yunlong Feng and Qiang Wu. Learning under (1 + ǫ ) -moment conditions. Applied and Com-putational Harmonic Analysis , 49(2):495–520, 2020.[24] Yunlong Feng and Yiming Ying. Learning with correntropy-induced losses for regression withmixture of symmetric stable noise.

Applied and Computational Harmonic Analysis , 48(2):795–810, 2020.[25] Stuart Geman and Donald E. McClure. Bayesian image analysis: An application to singlephoton emmission tomography. In

Proceedings of the American Statistical Association , pages12–18, 1985.[26] Peter J. Green. Iteratively reweighted least squares for maximum likelihood estimation, andsome robust and resistant alternatives.

Journal of the Royal Statistical Society: Series B ,46(2):149–170, 1984.[27] Xin Guo, Ting Hu, and Qiang Wu. Distributed minimum error entropy algorithms.

Journalof Machine Learning Research , 21(126):1–31, 2020.[28] Frank R. Hampel. The inﬂuence curve and its role in robust estimation.

Journal of theAmerican Statistical Association , 69(346):383–393, 1974.3529] Melvin J. Hinich and Prem P. Talwar. A simple method for robust regression.

Journal of theAmerican Statistical Association , 70(349):113–119, 1975.[30] Ting Hu, Jun Fan, Qiang Wu, and Ding-Xuan Zhou. Learning theory approach to minimumerror entropy criterion.

Journal of Machine Learning Research , 14:377–397, 2013.[31] Ting Hu, Qiang Wu, and Ding-Xuan Zhou. Convergence of gradient descent for minimum errorentropy principle in linear regression.

IEEE Transactions on Signal Processing , 64(24):6571–6579, 2016.[32] Ting Hu, Qiang Wu, and Ding-Xuan Zhou. Distributed kernel gradient descent algorithm forminimum error entropy principle.

Applied and Computational Harmonic Analysis, to appear ,2019.[33] Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Fast and robust estimation for unit-norm constrained linear ﬁtting problems. In

CVPR , 2018.[34] Wen Jiang, Nikos Kolotouros, Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis. Coher-ent reconstruction of multiple humans from a single image. In

CVPR , 2020.[35] Konrad Paul Körding and Daniel M. Wolpert. The loss function of sensorimotor learning.

Proceedings of the National Academy of Sciences , 101(26):9839–9842, 2004.[36] Fabien Lauer. On the exact minimization of saturated loss functions for robust regression andsubspace estimation.

Pattern Recognition Letters , 112:317–323, 2018.[37] Myoung-Jae Lee. Mode regression.

Journal of Econometrics , 42(3):337–349, 1989.[38] Thomas Leonard and John S. J. Hsu.

Bayesian Methods: An Analysis for Statisticians andInterdisciplinary Researchers . Cambridge University Press, 2001.[39] Tzu-Ying Liu and Hui Jiang. Minimizing sum of truncated convex functions and its applications.

Journal of Computational and Graphical Statistics , 28(1):1–10, 2019.[40] Weifeng Liu, Puskal P. Pokharel, and José C. Príncipe. Correntropy: properties and applica-tions in non-Gaussian signal processing.

IEEE Transactions on Signal Processing , 55(11):5286–5298, 2007.[41] Philip M. Long and Rocco A. Servedio. Random classiﬁcation noise defeats all convex potentialboosters.

Machine Learning , 78(3):287–304, 2010.[42] Peter Meer, Doron Mintz, Azriel Rosenfeld, and Dong Yoon Kim. Robust regression methodsfor computer vision: A review.

International Journal of Computer Vision , 6(1):59–70, 1991.[43] Christophoros Nikou, Fabrice Heitz, and Jean-Paul Armspach. Robust registration of dissimilarsingle and multimodal images. In

ECCV , 1998.[44] José C. Príncipe.

Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives .Springer Science & Business Media, 2010.[45] Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana, and Alessandro Verri.Are loss functions all the same?

Neural Computation , 16(5):1063–1076, 2004.3646] David W. Scott.

Multivariate Density Estimation: Theory, Practice, and Visualization . JohnWiley & Sons, 2015.[47] Sohil Atul Shah and Vladlen Koltun. Robust continuous clustering.

Proceedings of the NationalAcademy of Sciences , 114(37):9814–9819, 2017.[48] Yiyuan She and Art B. Owen. Outlier detection using nonconvex penalized regression.

Journalof the American Statistical Association , 106(494):626–639, 2011.[49] Fred A. Spiring. The reﬂected normal loss function.

Canadian Journal of Statistics , 21(3):321–330, 1993.[50] Ingo Steinwart. How to compare diﬀerent loss functions and their risks.

Constructive Approx-imation , 26(2):225–287, 2007.[51] Ingo Steinwart and Andreas Christmann.

Support Vector Machines . Springer, New York, 2008.[52] Charles V. Stewart, Kishore Bubna, and Amitha Perera. Estimating model parameters andboundaries by minimizing a joint, robust objective function. In

CVPR , 1999.[53] John W. Tukey. A survey of sampling from contaminated distributions.

Contributions toProbability and Statistics , 2:448–485, 1960.[54] Xueqin Wang, Yunlu Jiang, Mian Huang, and Heping Zhang. Robust variable selection withexponential squared loss.

Journal of the American Statistical Association , 108(502):632–643,2013.[55] Lionel Weiss. Estimation with a Gaussian gain function.

Statistics & Decisions , supp. issue:47–59, 1984.[56] Lionel Weiss. Estimating normal means with symmetric gain functions.

Statistics & ProbabilityLetters , 6(1):7–9, 1987.[57] Yaser Yacoob and Larry Davis. Tracking rigid motion using a compact-structure constraint.In

ICCV , 1999.[58] Yaser Yacoob and Larry S. Davis. Learned models for estimation of rigid and articulatedhuman motion from stationary or moving camera.

International Journal of Computer Vision ,36(1):5–30, 2000.[59] Jim J. Yang and John W. Van Ness. Breakdown points for redescending m-estimates of location.