[PDF] Distribution-Free Robust Linear Regression

Abstract

We study random design linear regression with no assumptions on the distribution of the covariates and with a heavy-tailed response variable. When learning without assumptions on the covariates, we establish boundedness of the conditional second moment of the response variable as a necessary and sufficient condition for achieving deviation-optimal excess risk rate of convergence. In particular, combining the ideas of truncated least squares, median-of-means procedures and aggregation theory, we construct a non-linear estimator achieving excess risk of order d/n with the optimal sub-exponential tail. While the existing approaches to learning linear classes under heavy-tailed distributions focus on proper estimators, we highlight that the improperness of our estimator is necessary for attaining non-trivial guarantees in the distribution-free setting considered in this work. Finally, as a byproduct of our analysis, we prove an optimal version of the classical bound for the truncated least squares estimator due to Gy\"{o}rfi, Kohler, Krzyzak, and Walk.

Full PDF

aa r X i v : . [ m a t h . S T ] F e b Distribution-Free Robust Linear Regression

Jaouad Mourtada ∗ Tomas Vaškevičius † Nikita Zhivotovskiy ‡ February 26, 2021

Abstract

We study random design linear regression with no assumptions on the distribution ofthe covariates and with a heavy-tailed response variable. When learning without assump-tions on the covariates, we establish boundedness of the conditional second moment of theresponse variable as a necessary and suﬃcient condition for achieving deviation-optimal ex-cess risk rate of convergence. In particular, combining the ideas of truncated least squares,median-of-means procedures and aggregation theory, we construct a non-linear estimatorachieving excess risk of order d/n with the optimal sub-exponential tail. While the exist-ing approaches to learning linear classes under heavy-tailed distributions focus on properestimators, we highlight that the improperness of our estimator is necessary for attainingnon-trivial guarantees in the distribution-free setting considered in this work. Finally, asa byproduct of our analysis, we prove an optimal version of the classical bound for thetruncated least squares estimator due to Györﬁ, Kohler, Krzyzak, and Walk.

Keywords.

Least squares, random design linear regression, robust estimation, improperlearning, median-of-means tournaments

In a typical random design regression setup, one is given access to n input-output pairs ( X i , Y i ) ∈ R d × R sampled i.i.d. from some unknown distribution P . We call any function g : R d → R a predictor and measure its quality via the expected mean squared error R ( g ) = E ( g ( X ) − Y ) , alsocalled risk . Based on the observed sample S n = ( X i , Y i ) ni =1 , we aim to construct a predictor b g such that its risk R ( b g ) is small. Since the risk is relative to the problem diﬃculty, it is customaryto compare it with the best possible risk achievable via some reference class of functions ofinterest, in this work taken to be the set of all linear functions F lin = {h w, ·i : w ∈ R d } .Henceforth, we focus on the random variable E ( b g ) , called the excess risk , deﬁned as E ( b g ) = R ( b g ) − inf g ∈F lin R ( g ) . (1)We assume without loss of generality that the inﬁmum in the above equation is attained by somelinear function h w ∗ , ·i , where w ∗ ∈ R d . Also, we remark that the randomness of the excess risk(1) comes from the fact that b g depends on the random sample S n . In this paper, we study non-asymptotic bounds on E ( b g ) (both in-expectation and high-probability) for appropriate choicesof b g to be speciﬁed later.Among the most widely studied statistical estimators b g are variants of linear least squaresalgorithm that return a linear predictor b g ∈ F lin minimizing a suitably regularized empirical ∗ CREST, ENSAE, Institut Polytechnique de Paris, France, [email protected] † Department of Statistics, University of Oxford, United Kingdom, [email protected] ‡ Department of Mathematics, ETH Zürich, Switzerland, [email protected] isk functional R n ( g ) = n P ni =1 ( g ( X i ) − Y i ) . Estimators based on empirical risk minimization(ERM) are known to achieve the optimal d/n excess risk rate in expectation under well-behavedcovariates X and assuming that the noise random variable ξ = Y − h w ∗ , X i is independentof X (see, for example, [9, 11, 58]). The work of Oliveira [61] highlights that the usual sub-Gaussian assumption on the distribution of X can be signiﬁcantly relaxed in the context oflinear regression. For example, an L – L norm equivalence of the form (cid:0) E h X, w i (cid:1) / κ (cid:0) E h X, w i (cid:1) / , for all w ∈ R d , (2)for some constant κ > is suﬃcient to achieve the d/n excess risk rate under additional assump-tions on the noise. Indeed, [61] shows that under (2), we have a high-probability control overthe lower tail of the sample covariance matrix, used in the analysis of linear least squares. Thismoment equivalence assumption and its variations have become standard tools in the recentliterature on robust linear regression (see, for example, [41, 32, 14, 36, 45, 17, 62]). However,as several authors have recently pointed out, the kurtosis constant κ satisfying the inequality(2) may introduce implicit dependence on the dimension d (for instance, for linear aggregationover histogram bases), leading to suboptimal bounds [14, 61, 41, 65]. In fact, in some casesthis behavior is inherent to empirical risk minimization, which has recently been shown by thesecond and the third authors of this paper [73] to incur suboptimal excess risk rates even ina favorable setup where both X and Y are almost surely bounded. Thus, a natural questionemerges, of how much further can the assumption (2) be relaxed and, consequently, what wouldbe a corresponding minimal assumption on the distribution of response variable Y ? It is notclear a priori whether non-trivial excess risk guarantees are possible at all without imposing anyassumptions on X .To better understand our goals, we now turn to a recent body of work initiated by Catoni [13]concerning the design and analysis of statistical estimators robust to heavy-tailed distributions.The ERM strategy is known to fail in this setting due to its sensitivity to the large ﬂuctuationsand the presence of atypical samples arising from heavy-tailed distributions. Thus, diﬀerenttechniques and procedures are required to handle such distributions. Following [45] we call theexcess risk E ( b g ) the accuracy of an estimator b g ; the conﬁdence of b g for an error rate of ε is equalto P ( E ( b g ) ε ) . The line of work concerning heavy-tailed distributions aims to design proceduresexhibiting optimal accuracy/conﬁdence trade-oﬀ under minimal distributional assumptions. Inthe context of linear regression, the optimal trade-oﬀ is usually achieved via the bounds on E ( b g ) of order ( d + log(1 /δ )) /n that hold with probability at least − δ ; in particular, such boundsmatch the performance of ERM for sub-Gaussian distributions. Using either the PAC-Bayesiantruncations [14] or the median-of-means tournaments [45], it has been shown that the optimalaccuracy/conﬁdence trade-oﬀ can be achieved under the L – L moment equivalence assumption(2) together with some additional assumptions on the noise variable ξ = Y −h w ∗ , X i . We remarkthat the existing procedures for heavy-tailed linear regression focus on outputting a predictorthat belongs to the class F lin . However, as we shall shortly explain, any such proper procedurefails in our distribution-free setting. We can now formulate the question studied in this paper.Is it possible to predict as well as the best linear predictor in F lin without any assumptionson the distribution of the covariates X , while maintaining the optimal accuracy/conﬁdencetrade-oﬀ? If so, what is the minimal assumption on the response variable Y allowing this?Somewhat independently of the above literature that focuses on robustness to heavy-tails,there exist two results in the literature that provide non-asymptotic excess risk guaranteeswithout any assumptions on X , albeit only in expectation. We remark that both of these results2ely on improper statistical estimators, meaning that they both output predictors outside of theclass F lin . Of course, once all the assumptions on X are dropped, the conditional distribution of Y given X consisting of a probability kernel ( P Y | X = x ) x ∈ R d needs to be restricted. We now statethe only assumption considered in our work; note that it is satisﬁed when Y is bounded, butalso allows this variable to possess heavy tails (including inﬁnite (2 + ε ) -th moment for ε > ). Assumption 1.

The conditional distribution of Y given X satisﬁes, for some m > , sup x ∈ R d E [ Y | X = x ] m . The ﬁrst result not involving any explicit restrictions on the distribution of X is a classicalbound for the truncated linear least squares estimator b g m due to Györﬁ, Kohler, Krzyzak, andWalk [28, Theorem 11.3] (we defer the exact deﬁnition to Section 3). Under Assumption 1, theirresult states that E R ( b g m ) − inf g ∈F lin R ( g ) c m d (log n + 1) n + 7 (cid:16) inf f ∈F lin R ( f ) − R ( f reg ) (cid:17) . (3)Here the expectation is taken with respect to the random sample S n , c > is an absoluteconstant and f reg is the regression function given by f reg ( x ) = E [ Y | X = x ] . The bound (3)is a standard benchmark for several communities. Applications of this result are known inmathematical ﬁnance [79], optimal control [8] and variance reduction [26, 27]; there are knownimprovements of this result under diﬀerent assumptions [19, 20].The second bound does not depend on the distribution of X and is due to Forster andWarmuth [24]; this estimator originates in the online learning literature and is obtained via amodiﬁcation of the renowned non-linear Vovk-Azoury-Warmuth forecaster [75, 6]. The Forster-Warmuth estimator, denoted by b g FW , satisﬁes the following expected excess risk bound E R ( b g FW ) − inf g ∈F lin R ( g ) k Y k L ∞ dn . (4)Of course, the assumption k Y k L ∞ m is stronger than Assumption 1. However, an inspection ofthe proof in [24] shows that Assumption 1 suﬃces to obtain the above in-expectation performanceof this algorithm with k Y k L ∞ replaced by m .We are now ready to present informal statements of our main ﬁndings. In our ﬁrst result,we prove that the term f ∈F lin R ( f ) − R ( f reg )) as well as the excess log n factor appearingin the bound (3) for the truncated linear least squares estimator can be removed. Theorem A (Informal) . Suppose that Assumption 1 holds and let b g m denote the truncated leastsquares estimator of Györﬁ, Kohler, Krzyzak, and Walk. Then, we have E R ( b g m ) − inf g ∈F lin R ( g ) m dn . Moreover, Assumption 1 ensures the same guarantee for the Forster-Warmuth estimator b g FW . One may notice that even though the bound of Theorem A scales as d/n , the usual depen-dence on the variance of the noise variable as is in, for example, [9] is replaced by the dependenceon m . It can be shown (see Proposition 2) that if only Assumption 1 holds, then the dependenceon m is unavoidable in general even if the problem is noise-free so that the variance of the noiseis equal to zero. Moreover, if we only impose Assumption 1, then any statistical estimator thatselects predictors from F lin (such an estimator is called proper ) is bound to fail. This fact can3e established using the recent result of Shamir [66, Theorem 3], and it remains true even when d = 1 and the response variable Y is bounded almost surely. This observation separates oursetup from the existing literature where only proper estimators are studied for convex classessuch as F lin even in the heavy-tailed scenarios (see, for example, [14, 45, 51, 52]).The bounds of Theorem A certify that the expected excess risk goes down to zero underAssumption 1. However, as discussed above, we aim to provide high-probability upper boundson the excess risk with logarithmic dependence on the conﬁdence level δ . It is not unreasonableto expect that the in-expectation guarantees of either b g m or b g FW transfer to analogous high-probability bounds, at least whenever Y is bounded almost surely. Our second result shows thatthis is, unfortunately, not the case. Indeed, both algorithms fail to achieve high probability upperbounds in a strong sense: they both fail with constant probability. It is possible, since neither b g m nor b g FW belong to the linear class F lin , and the random variable R ( b g ) − inf g ∈F lin R ( g ) cantake negative values. Consequently, Markov’s inequality cannot be applied to obtain deviationinequalities as m d/n · /δ for a given conﬁdence level δ . Theorem B (Informal) . Let b g denote either b g m or b g FW . There exist universal constants p ∈ (0 , , c > such that the following holds. For any d > and m > , there exists a distribution P satisfying k Y k L ∞ ( P ) m such that, with probability at least p , R ( b g ) − inf g ∈F lin R ( g ) > c m . The above theorem brings forth the question of whether achieving high-probability guaran-tees in our distribution-free setting is possible. Indeed, all known high-probability guaranteeson linear aggregation problems impose some restrictions on X . We show that there is, in fact,a procedure that achieves an optimal excess risk bound (up to a logarithmic factor) with asub-exponential tail. The following theorem is the main positive result of our paper. Theorem C (Informal) . Suppose that Assumption 1 holds. There exists an absolute constant c > such that the following holds. For any conﬁdence level δ ∈ (0 , , there exists an improperestimator b g (depending on δ and m ) such that P (cid:18) R ( b g ) − inf g ∈F lin R ( g ) c m ( d log( n/d ) + log(1 /δ )) n (cid:19) > − δ. The above theorem demonstrates that robust learning of linear classes is possible with norestrictions on the distribution of X and under weak assumptions on the conditional distributionsof the response variable Y that allow for modelling heavy tails settings. Moreover, we show inSection 6 that Assumption 1 is necessary to obtain any non-trivial guarantee in our setup.The estimator of Theorem C naturally leverages the ideas of the analysis of truncated linearfunctions [28, Chapter 11], skeleton estimators [21, Section 28.3], the deviation optimal modelselection aggregation procedures [2, 40, 51], min-max estimators [5, 39], and the median-of-meanstournaments [45]. An extended discussion is deferred to Section 5. • In Section 2, we discuss known results on distribution-free learning of linear classes.• In Section 3, we show that the classical bound of Györﬁ, Kohler, Krzyzak, and Walk [28,Theorem 11.3] for the truncated linear least squares estimator can be improved to achievethe optimal m d/n bound in expectation.4 In Section 4, we establish that the truncated least squares and Forster-Warmuth estimatorsare both deviation-suboptimal. In particular, we construct a distribution with almostsurely bounded response variable Y , under which both estimators incur an excess risk oforder m with constant probability.• Section 5 is split into three parts. In Section 5.1, we consider a simpliﬁed setting witha known covariance structure. Combining Tsybakov’s projection estimator [67] with therobust mean estimator of Lugosi and Mendelson [43], we construct an estimator attainingthe optimal rate d/n with the optimal dependence on the conﬁdence parameter. In Sec-tion 5.2, we drop the simplifying assumption of known covariance structure and presentour main positive result – a distribution-free deviation-optimal estimator robust to heavy-tailed responses. In Section 5.3, we discuss possible extensions of this result. In particular,we show that a simpliﬁcation to our linear regression procedure yields an estimator withdeviation-optimal rates for heavy-tailed model selection aggregation under Assumption 1.• Section 6 is devoted to establishing the necessity of Assumption 1. We show, in particular,that if E [ Y | X ] is unbounded, no estimator can achieve non-trivial excess risk guarantees.In addition, we establish that the dependence on m in our upper bounds is unavoidable.• Section 7 contains deferred proofs of lemmas appearing in the previous sections. Analysis of least squares estimators.

The most standard approach to regression problemsis the least squares principle, where one selects the predictor achieving the best ﬁt to datawithin some predeﬁned class of functions. A large body of work is devoted to analyzing andobtaining guarantees on its performance, in its most classical form, relying on the fact that theempirical risk provides a good approximation of its population counterpart. This is typicallyestablished when the underlying distribution is suﬃciently well-behaved (for instance, boundedor light-tailed), using tools from empirical process theory. For this point of view to statisticallearning, we refer to the standard textbooks [70, 48, 38, 76].A recent line of research has established that empirical minimization can perform well undersigniﬁcantly weaker assumptions. Our starting point is the work of Oliveira [61], where in thecontext of linear regression the usual sub-Gaussian assumption on X is replaced by a signiﬁ-cantly weaker L – L moment equivalence assumption (2). In particular, such an assumptiondoes not even require the existence of any moments of X higher than the fourth. Variationsof this assumption have became the standard tool in the recent literature on linear regression[41, 32, 14, 45, 36, 58, 17, 62]. The seminal work of Mendelson [50] introduced a more gen-eral condition, called the small-ball assumption. In most of the above-mentioned papers, theanalysis is performed for empirical risk minimization, which usually does not lead to the opti-mal accuracy/conﬁdence trade-oﬀ. The papers [4, 5] provide the optimal conﬁdence for ERM,albeit under stronger moment equivalence assumptions than that of (2). The L – L momentequivalence is also important in the robust covariance estimation problem [14, 55].It has been recently observed that the absolute constants involved in the moment equivalenceand the small-ball assumptions can behave badly in some cases. First, Saumard [65] shows thatthe small-ball condition is unsuitable for some important classes leading to suboptimal perfor-mance of ERM. Further, the work [14] (see also the discussion in [61] and [41]) discusses thatthe kurtosis constant κ in the moment assumptions similar to (2) can depend on the dimensionand aﬀect the bounds negatively. The recent paper [73] shows this suboptimal behavior in thecontext of linear regression, even in a favorable setup where both X and Y are bounded. Thereis a growing interest in further relaxing these assumptions and reﬁning the underlying methods565, 15, 53, 52, 18, 58, 54]. In particular, the works [15, 54] replace moment equivalence assump-tions by the bounds on the L p moments for p > . This is closer to the setting we are aimingfor in this paper. Robustness to heavy-tailed distributions.

In a broad sense, robustness encompasses thestudy and design of statistical estimation procedures exhibiting certain stability properties underthe existence of “outlier” points in the observed sample. For a classical perspective on robustness,originating from the work of Tukey [68] and building on the ideas of contaminated models,inﬂuence functions and breakdown points, we refer to the standard books [29, 33, 64].In contrast to the classical perspective, our work falls within the recent body of work initi-ated by Catoni [13], where the term robustness is to be understood speciﬁcally as robustness toheavy-tailed distributions (rather than, for example, adversarial contamination of the sample).The starting point of this direction is the question of mean estimation, where informally, oneaims to construct statistical estimators performing as well as the sample mean does for Gaussiansamples, all while making as weak distributional assumptions as possible. Several ways of con-structing such estimators (called sub-Gaussian estimators) have been proposed in the literature.The most widespread approach is based on the median-of-means estimators, which appear ﬁrstindependently in [60, 34, 1] and were further developed in the works of [56, 22, 44, 46]. Othertechniques include the Catoni’s estimator and its extensions [13, 15] or the trimmed means[47]. We refer to the survey [43] for further details and references. For a complementary surveyfocusing on the computational aspects see [23].The central ideas behind the robust mean estimation found their applications in many relatedproblems such as regression [32, 10, 18, 45, 39, 18, 57, 52], covariance estimation [14, 55] andclustering [10, 37]. In the context of linear regression, the ﬁrst works showing the optimalaccuracy/conﬁdence trade-oﬀ under weak assumptions are attributed to Audibert and Catoni[4, 5] and were further extended in [14, 15]; these papers are based on PAC-Bayesian truncations.

Distribution-free linear regression.

Distribution-free non-asymptotic excess risk boundstake their roots in the PAC-learning framework [72, 69], where historically the binary loss isstudied the most. Because of its boundedness, excess risk bounds in such setups can be ob-tained without any assumptions on the distribution of ( X, Y ) . In the context of non-parametricregression with the squared loss, only asymptotic consistency results are possible under trulyminimal assumptions on the underlying distribution (see the book [28]). In fact, the standardnotions of universal consistency [28, Section 1.6 and Chapter 10] involve only the assumption E Y < ∞ and no assumptions on the distribution of X . The distribution-free nature of thisassumption is one of our motivations. A notable non-asymptotic result in this direction is [28,Theorem 11.3], where an inexact oracle inequality (3) is proved without any explicit assumptionson the distribution of X .Another direction originates from the online learning literature (see [16] for background onthis topic). For instance, when both X and Y are bounded, the renowned Vovk-Azoury-Warmuthforecaster [75, 6] can be used to provide excess risk bounds of order d/n in our setup even whenthe aforementioned moment equivalence constants behave badly with respect to the dimension.This observation has been recently explored in [73]. For linear regression, the Forster-Warmuthalgorithm [24], which is in turn a modiﬁcation of the Vovk-Azoury-Warmuth forecaster, leadsto the only known exact oracle inequality without imposing any assumptions on X .6 .3 Notation We now set the notation. We let P = P ( X,Y ) be the joint distribution on R d × R (with d > ) ofa random pair ( X, Y ) . The joint distribution P itself can be decomposed into two components,namely the marginal distribution P X of X (a distribution on R d ), as well as the conditionaldistribution of Y given X , consisting of a (measurable) probability kernel ( P Y | X = x ) x ∈ R d , wherefor x ∈ R d , P Y | X = x is a distribution on R .For a real random variable Z and p > , we denote k Z k L p = E [ | Z | p ] /p , while for a measurablefunction f : R d → R , we set k f k L p = k f k L p ( P X ) = k f ( X ) k L p .The risk of a measurable function f : R d → R is by deﬁnition R ( f ) = E ( f ( X ) − Y ) = k f ( X ) − Y k L . It is known that the risk is minimized by the regression function f reg given by f reg ( x ) = E [ Y | X = x ] = R R yP Y | X = x (d y ) .Absolute constants are denoted by c, c , . . . and may change from line to line. For a realsquare matrix A let Tr( A ) denote its trace, k A k op its operator norm, A T its transpose and A † its Moore–Penrose inverse. In what follows, h· , ·i denotes the canonical inner product in R d and k·k stands for the Euclidean norm. For any two functions (or random variables) f, g the symbol f . g (or g & f ) means that there is an absolute constant c such that f cg on the entiredomain. For a pair of symmetric matrices A, B , the symbol A B means that B − A is positivesemi-deﬁnite.We consider the class F lin = {h w, ·i : w ∈ R d } of linear functions. Throughout, our assump-tions will imply that R (0) = E Y is ﬁnite (regardless of P X ); hence, so is the minimal risk in F lin , namely inf f ∈F lin R ( f ) is ﬁnite. In this case, for f ∈ F lin given by f ( x ) = h w, x i its risk R ( f ) is ﬁnite if and only if kh w, X ik L < + ∞ , and the set of such w ∈ R d is a subspace of R d , which coincides with R d itself if and only if E k X k < + ∞ . When the latter conditionholds, one can deﬁne the covariance of X as Cov( X ) = E ( X − E X )( X − E X ) T and the Grammatrix of X as Σ = E XX T ; the minimizers f of the risk in F lin are then the functions h w, ·i ,where w are solutions of the equation Σ w = E [ Y X ] . The last quantity is well-deﬁned since E | Y |k X k k Y k L E [ k X k ] / .Given the observed sample S n = ( X i , Y i ) ni =1 , the aim is to construct a predictor (usuallycalled an estimator) b g such that its risk R ( b g ) is small. A learning procedure is a measurablefunction mapping a sample in ( R d × R ) n to a measurable function R d → R . In what follows,we avoid measurability issues and use a standard convention that all events appearing in theprobabilistic statements are measurable. Given a sample S n = ( X i , Y i ) ni =1 , we usually write b g for the function b g ( S n ) . Finally, we remark that since the sample S n is random, the function b g = b g ( S n ) is also random and so is R ( b g ) . In this section, we set the context for the rest of this work, by reviewing relevant existingresults on distribution-free linear prediction, and framing them in our setting (through minormodiﬁcations). We remark that the bounds we are about to discuss hold in expectation, whereaswe will also be concerned with high-probability guarantees. As will be seen in Section 4, thedistinction between the two is not innocuous, as existing procedures achieving distribution-freeexpected excess risk bounds do not possess matching guarantees in deviation.

Limitations of proper estimators.

Recall that in the context of our work, a learning proce-dure is called proper if it always returns an element of the class F lin (that is, a linear function);otherwise, it is called improper or non-linear . The importance of considering improper estima-tors stems from a fundamental limitation of proper procedures in our distribution-free setting.7peciﬁcally, it follows from the work of Shamir [66, Theorem 3] that for any proper estimator b g proper , there exists a distribution of ( X, Y ) with the response variable Y almost surely boundedby m , for which E R ( b g proper ) − inf g ∈F lin R ( g ) & m . (5)Thus, even when the response is bounded, no proper learning procedure can improve (up to uni-versal constants) over the risk trivially achieved by the zero function, without some restrictionson the distribution of covariates. As discussed in the introduction, this negative result alreadyrules out many procedures introduced and analyzed in the statistical learning and robust esti-mation literature, including empirical risk minimizations and reﬁnements thereof. Learning with known covariance structure.

We now discuss a simpliﬁed setting, in whichguarantees can be obtained quite directly. Speciﬁcally, assume that the covariance structure of the distribution P X , namely, the map w E h w, X i (which can take inﬁnite values), isknown. As noted in Section 1.3, we can restrict our attention to the linear subspace wherethe above map takes ﬁnite values. Thus, we may assume without loss of generality that thecovariance matrix Σ = E XX T exists. In addition, up to restricting to the orthogonal of thenullspace { w ∈ R d : Σ w = 0 } , we may assume in what follows that the covariance matrix Σ isinvertible. Hence, the unique minimizer of the risk R ( f ) in F lin is given by f ∗ = h w ∗ , ·i , where w ∗ = Σ − E [ Y X ] . In addition, the excess risk of any linear function f = h w, ·i is given by thefollowing identity: R ( f ) − inf g ∈F lin R ( g ) = k Σ / ( w − w ∗ ) k . (6)The key simpliﬁcation provided by the knowledge of Σ is that random-design linear regressionreduces to multivariate mean estimation. To see this, consider the change of variables θ = Σ / w and notice that the excess risk (6) is then equal to k θ − θ ∗ k , where θ ∗ = E U for U = Y Σ − / X .Using Σ , an i.i.d. sample ( X i , Y i ) ni =1 can be turned into an i.i.d. sample ( U i ) ni =1 , with U i = Y i Σ − / X i distributed as U . One can thus estimate E U by the sample mean n P ni =1 U i . Thisleads to the projection estimator for our original problem, deﬁned as b g proj ( x ) = h b w, x i where b w = Σ − · n n X i =1 Y i X i . (7)Under Assumption 1, we have E U U T = E (cid:2) E [ Y | X ]Σ − / XX T Σ − / (cid:3) m I d , and in particular Tr(Cov( U )) m d and k Cov( U ) k op m . Applying the ﬁrst inequality tothe empirical mean estimator of θ ∗ = E U leads to the following guarantee for the projectionestimator, which corresponds up to minor changes in assumptions to the result of Tsybakov [67,Theorem 4]: E R ( b g proj ) − inf g ∈F lin R ( g ) m dn . It is worth noting that there is no contradiction between the lower bound (5) and the aboveupper bound. Indeed, the projection estimator, while proper, relies on the a priori knowledgeof Σ , which is unavailable in the typical statistical learning setting. This implies in particularthat the knowledge of Σ is suﬃcient to avoid the previous failure of proper procedures. In thiswork, the simpliﬁed setting with known covariance serves as a benchmark that we aim to matchin the general case, where nothing is known a priori about the distribution of X . Speciﬁcally, [67] assumes that the noise Y − f reg ( X ) is independent of X , but the same proof applies whenreplacing this assumption by the conditional moment bound of Assumption 1. pper bounds in expectation via non-linear predictors. As mentioned in the intro-duction, and with the exception of the aforementioned known covariance setting, there are twoknown results stating non-trivial in-expectation guarantees without restrictions on the distribu-tion of X . These guarantees are achieved, respectively, by the truncated linear least squaresestimator b g m and the Forster-Warmuth estimator b g FW , which we now deﬁne formally.First, consider the linear least squares estimator b g erm = arg min g ∈F lin b R ( g ) = h b w erm , ·i , where b w erm = (cid:16) n X i =1 X i X T i (cid:17) † (cid:16) n X i =1 Y i X i (cid:17) = b Σ † n · n n X i =1 Y i X i , with b Σ n = n P ni =1 X i X T i . Given a threshold m > , the truncated least squares estimator b g m returns the prediction of the linear function h b w erm , ·i , truncated to [ − m, m ] . That is, b g m ( x ) = max( − m, min( m, h b w erm , x i )) . (8)We now turn to the Forster-Warmuth estimator. Given the sample S n = ( X i , Y i ) ni =1 , deﬁnethe leverage score of a point x ∈ R d by h n ( x ) = h ( n b Σ n + xx T ) † x, x i . The Forster-Warmuthestimator is then deﬁned by reweighing predictions of h b w erm , ·i by a function of the statisticalleverage of the input point x : b g FW ( x ) = (cid:0) − h n ( x ) (cid:1) · h b w erm , x i . (9)Recall the guarantees on the risk of these procedures, stated in the introduction. Speciﬁcally,under Assumption 1, the truncated least squares estimator satisﬁes the oracle inequality (3),while the Forster-Warmuth estimator achieves the excess risk bound (4). Note that both proce-dures are improper, as they introduce non-linearities in the prediction function, either throughtruncation or through the leverage correction.As discussed in the recent works [58, 73], the risk of the least squares procedure is largewhen leverage scores are uneven and correlate with the noise. While this conﬁguration is ruledout under distributional assumptions such as moment equivalences, it can actually happen evenunder boundedness constraints, leading to poor performance [73]. Both non-linearities partiallymitigate the shortcomings of the least squares estimator, by adjusting its predictions at high-leverage points, which are the most unstable and lead to large errors. These corrections allowthese procedures to achieve in-expectation bounds, even for unfavorable distributions on whichordinary least squares fail. As discussed at the end of Section 2, the non-linearities introduced by the truncated least squaresand Forster-Warmuth estimators aim to mitigate the instability of ERM predictions at high-leverage points. The more sophisticated Forster-Warmuth procedure (which relies on an explicitleverage correction), however, leads to a better excess risk guarantee. Indeed, the risk guaranteeof b g m takes the form of an inexact oracle inequality, suﬀering from the approximation error term inf f ∈F lin R ( f ) − R ( f reg ) . This type of guarantee only ensures that the procedure approaches theperformance of the best linear function in the nearly well-speciﬁed case, where the true regressionfunction is almost linear. While reasonable in low-dimensional nonparametric estimation [28](with appropriate linear spaces), such an assumption is generally restrictive in high-dimensionalproblems and it is not satisﬁed in our setting. Unfortunately, the proof technique employedin [28] can only yield inexact oracle inequalities, and hence, no straightforward modiﬁcation totheir argument can match guarantees of b g FW given by (4).9 natural question remains of whether the gap between the existing in-expectation perfor-mance guarantees given by (3) and (4) is intrinsic to the estimators b g m and b g FW , or whether itis a byproduct of suboptimal analysis of the performance of the simpler procedure b g m . In thetheorem below, we show that truncated least squares estimator indeed matches the statisticalperformance of the Forster-Warmuth algorithm. Our proof is based on a leave-one-out argu-ment akin to the one used to prove the upper bound (4) in [24, Section 3]. We remark thatleave-one-out arguments have a long history; see the references [72, Chapter 6] and [31]. Theorem 1.

Suppose that Assumption 1 holds and let b g m denote the truncated least squaresestimator (8) . Then, we have E R ( b g m ) − inf f ∈F lin R ( f ) m dn + 1 . Proof.

To simplify the presentation, we introduce additional notation. Let S n +1 = ( X i , Y i ) n +1 i =1 denote an i.i.d. sample of size n + 1 . For any j ∈ { , . . . , n + 1 } , let S ( j ) n +1 = ( X i , Y i ) n +1 i =1 ,i = j bethe dataset obtained by removing the j -th sample. On the sample S n +1 (respectively S ( j ) n +1 ), wedeﬁne the minimal norm empirical risk minimizer e g (respectively e g ( j ) ) and its truncated variant e g m (respectively e g ( j ) m ).Since S n +1 is an i.i.d. sample, for every j ∈ { , . . . , n + 1 } , S ( j ) n +1 has the same distributionas S n = S ( n +1) n +1 (so that e g ( j ) m has the same distribution as b g m = e g ( n +1) m ), and is independent of Z j = ( X j , Y j ) . This implies that the expected excess risk of b g m can be bounded as follows: E E ( b g m ) = E S n +1 (cid:0)e g ( n +1) m ( X n +1 ) − Y n +1 (cid:1) − inf g ∈F lin E Z n +1 ( g ( X n +1 ) − Y n +1 ) = E S n +1 (cid:20) n + 1 n +1 X j =1 (cid:16)e g ( j ) m ( X j ) − Y j (cid:17) (cid:21) − inf g ∈F lin E S n +1 (cid:20) n + 1 n +1 X j =1 ( g ( X j ) − Y j ) (cid:21) E S n +1 (cid:20) n + 1 n +1 X j =1 (cid:16)e g ( j ) m ( X j ) − Y j (cid:17) − ( e g ( X j ) − Y j ) (cid:21) , (10)where the last line follows from the deﬁnition of e g . Now, deﬁne the leverage h j of the point X j among X , . . . , X n +1 by h j = (cid:28)(cid:18) n +1 X i =1 X i X T i (cid:19) † X j , X j (cid:29) ∈ [0 , . An explicit computation—postponed to the end of the proof—shows that for every j , e g ( X j ) = (1 − h j ) e g ( j ) ( X j ) + h j Y j . (11)Plugging (11) into the bound (10), we obtain E E ( b g m ) E (cid:20) n + 1 n +1 X j =1 (cid:16)e g ( j ) m ( X j ) − Y j (cid:17) − (1 − h j ) (cid:16)e g ( j ) ( X j ) − Y j (cid:17) (cid:21) . (12)By Assumption 1 and Jensen’s inequality we have sup x ∈ R d | f reg ( x ) | m . It follows that ( e g ( j ) m ( X j ) − f reg ( X j )) ( e g ( j ) ( X j ) − f reg ( X j )) , so that E h (1 − h j ) (cid:0)e g ( j ) ( X j ) − Y j (cid:1) (cid:12)(cid:12) S ( j ) n +1 , X j i = (1 − h j ) (cid:16) ( e g ( j ) ( X j ) − f reg ( X j )) + E (cid:2) ( f reg ( X j ) − Y j ) | S ( j ) n +1 , X j (cid:3)(cid:17) > E h (1 − h j ) (cid:0)e g ( j ) m ( X j ) − Y j (cid:1) (cid:12)(cid:12) S ( j ) n +1 , X j i . E E ( b g m ) E (cid:20) n + 1 n +1 X j =1 (cid:0)e g ( j ) m ( X j ) − Y j (cid:1) − (1 − h j ) (cid:0)e g ( j ) m ( X j ) − Y j (cid:1) (cid:21) E (cid:20) n + 1 n +1 X j =1 h j (cid:0)e g ( j ) m ( X j ) − Y j (cid:1) (cid:21) m E (cid:20) n + 1 n +1 X j =1 h j (cid:21) m dn + 1 , where the penultimate step follows via Jensen’s inequality combined with Assumption 1 and thelast step follows from the bound P n +1 j =1 h j = Tr (cid:2)(cid:0) P n +1 i =1 X i X T i (cid:1) † (cid:0) P n +1 i =1 X i X T i (cid:1)(cid:3) d .We now conclude by showing the identity (11). First, deﬁne e Σ = n +1 X i =1 X i X T i , e Σ ( j ) = e Σ − X j X T j , b = n +1 X i =1 Y i X i , and b ( j ) = b − Y j X j , so that e g ( X j ) = h e Σ † b, X j i , e g ( j ) ( X j ) = h (cid:0)e Σ ( j ) (cid:1) † b ( j ) , X j i , and h j = h e Σ † X j , X j i . Note that (11) is an identity, and up to restricting to the linear span of ( X , . . . , X n +1 ) we mayassume that e Σ is invertible. In addition, if X j does not belong to the linear span of ( X i ) n +1 i =1 ,i = j ,namely, if e Σ ( j ) is singular, then it can be shown that h j = 1 and e g ( X j ) = Y j (since e g minimizesthe empirical risk on S n +1 , and g ( X j ) can be set freely without aﬀecting the other predictions),so that (11) holds. Therefore, we may assume that e Σ ( j ) is invertible. Using the deﬁnition andthe Sherman-Morrison formula, as h j ∈ [0 , , we obtain e g ( j ) ( X j ) = (cid:28)(cid:18)e Σ − + e Σ − X j X T j e Σ − − h j (cid:19) ( b − Y j X j ) , X j (cid:29) = e g ( X j ) + h j − h j e g ( X j ) − h j Y j − h j − h j Y j = 11 − h j e g ( X j ) − h j − h j Y j ; rearranging the last equality yields (11), concluding the proof. As discussed in Section 2, Assumption 1 suﬃces to ensure that the Forster-Warmuth estimator[24] achieves an expected excess risk bound of order m d/n irrespective of the distribution of X . Our results established in Section 3 demonstrate the same conclusion for the truncated leastsquares estimator of [28, Theorem 11.3]. In addition to the guarantees in expectation, high-probability or tail bounds are desirable, as they provide a control on the probability of failureof the estimator. The following theorem shows that in fact, none of the two procedures satisfymeaningful high-probability guarantees, in a rather strong sense.11 heorem 2. Fix the dimension d = 1 . There exist absolute constants c > and n > suchthat the following holds. For any n > n , there is a distribution P = P ( n ) of ( X, Y ) such thatif b g is either the truncated least squares estimator (8) or the Forster-Warmuth estimator (9) ,computed on an i.i.d. sample S n , then P (cid:16) R ( b g ) − inf g ∈F lin R ( g ) > c m (cid:17) > c . Note that under Assumption 1, the trivial, identically function has risk at most E Y m .Theorem 2 states that, with constant probability, the truncated least squares and the Forster-Warmuth estimators incur a constant excess risk of the same order. At the ﬁrst sight, thisproperty may seem incompatible with expected excess risk bounds of order d/n . However, oneshould keep in mind that the estimators in question are improper (returning predictors outsideof the class F lin ), so that the excess risk may well take negative values; the expected excessrisk remains small due to the fact that positive and negative values essentially compensate inexpectation, regardless of the distribution.A related phenomenon was observed in the context of model selection-type aggregation byAudibert [2], who showed that the (improper) progressive mixture rule [78, 12], known to achievefast rates in expectation, exhibits slow rates in deviation. In our context the failure in deviationis even stronger, as the excess risk is of constant order, rather than exhibiting slow rates. Proof.

For any n > n , let P = P ( n ) be the distribution of ( X, Y ) satisfying ( X, Y ) = (cid:26) (1 , m ) with probability − n ;( √ n, with probability n . By homogeneity, we may assume that m = 1 . For any w ∈ R , set g w ( x ) = w · x . We have R ( g w ) = (cid:16) − n (cid:17) ( w − + 1 n ( w √ n ) = (cid:16) − n (cid:17) ( w − + w . It follows that the risk of the best linear predictor is equal to inf w ∈ R R ( g w ) = 1 − /n − /n . (13)In addition, let K = K n denote the number of indices i = 1 , . . . , n such that X i = √ n . Theempirical risk writes b R n ( g w ) = (cid:16) − Kn (cid:17) ( w − + Kw , and so b w erm = arg min w ∈ R b R n ( g w ) = 1 − K/nK + 1 − K/n .

In particular, b w erm / ( K + 1) . Now, note that if b g denotes either the truncated leastsquares (8) or the Forster-Warmuth estimator (9), then b g (1) b w erm · / ( K + 1) , andthus, denoting the sample ( X i , Y i ) ni =1 by S n , we have R ( b g ) > E (cid:2) ( b g ( X ) − Y ) ( X = 1) | S n (cid:3) > (cid:16) − n (cid:17) · (cid:16) KK + 1 (cid:17) . (14)Thus, under the event E n = { K n > } , it follows from (13) and (14) that for n > , R ( b g ) − inf g ∈F lin R ( g ) > (cid:16) − n (cid:17) · (cid:16) KK + 1 (cid:17) −

12 = (cid:16) − (cid:17) · −

12 = 110 . Finally, since K n follows the binomial distribution Bin ( n, /n ) , the probability P ( E n ) is positivefor n > > . Further, since K n converges in distribution to the Poisson distribution Poi (1) as n → ∞ , P ( E n ) → P ( e K > > with e K ∼ Poi (1) , so that setting p = inf n > P ( E n ) , wehave p > . This concludes the proof with c = min( p , / and n = 16 .12 An optimal robust estimator in the high-probability regime

In this section we present our main positive result. We show that there is an estimator achievingan optimal accuracy and sub-exponential tails for the linear class F lin under Assumption 1. Weﬁrst consider a simpliﬁed setup where the covariance structure of X is known. Following the discussion on the learning model with known covariance structure in Section 2,we assume in this section that

Σ = E XX T exists, is invertible and also known. Recall thedeﬁnition of Tsybakov’s projection estimator b g proj (7). Since this estimator always returns alinear predictor, its excess risk is non-negative and we may apply Markov’s inequality to showthat for any δ ∈ (0 , , it holds that P (cid:18) R ( b g proj ) − inf g ∈F lin R ( g ) m dn · δ (cid:19) > − δ. An argument similar to the one used in [13, Proposition 6.2] can be used to show that thisbound is essentially the best we can hope for the projection estimator, even when | Y | m almost surely.Fortunately, there is a way to modify this estimator and obtain a guarantee with sub-exponential tails. The result of Lugosi and Mendelson [46, Theorem 1] shows for any δ ∈ (0 , ,there exists an estimator b µ δ : ( R d ) n → R such that, for any sequence U , . . . , U n of i.i.d. randomvectors in R d with mean µ and covariance matrix e Σ , b µ δ = b µ δ ( U , . . . , U n ) satisﬁes P (cid:18) k b µ δ − µ k c Tr( e Σ) + k e Σ k op log(1 /δ ) n (cid:19) > − δ, (15)where c > is an absolute constant. Now, introduce the robust projection estimator e w = Σ − / · b µ δ (cid:16) Y Σ − / X , . . . , Y n Σ − / X n (cid:17) , (16)and consider the following result. Proposition 1.

There is an absolute constant c > such that the following is true. Supposethat Assumption 1 holds. Then, the robust projection estimator b g = h e w, ·i (which is a properestimator) deﬁned in (16) satisﬁes P (cid:18) R ( b g ) − inf g ∈F lin R ( g ) c m (cid:0) d + log(1 /δ ) (cid:1) n (cid:19) > − δ. Proof.

We have shown in Section 2, that under Assumption 1,

Cov( Y Σ − / X ) m I d . Com-bining the deviation bound (15) and the deﬁnition (16) with the identity (6), we ﬁnish theproof.The above result serves as a benchmark result for the performance that we aim to establishin the more realistic setting where the covariance matrix Σ is unknown. This is achieved in thenext section. 13 .2 Deviation-optimal estimator robust to heavy tails The theorem below is the main positive result of our paper. It demonstrates that Assumption 1is a suﬃcient condition for the existence of linear regression estimators satisfying an excess riskdeviation inequality with logarithmic dependence on the conﬁdence parameter. In Section 6, weshow that Assumption 1 is also necessary.

Theorem 3.

There is an absolute constant c > such that the following holds. Assume that n > d . Suppose that Assumption 1 holds and ﬁx any δ ∈ (0 , . Then, there exists an estimator b g depending on δ and m such that the following holds: P (cid:18) R ( b g ) − inf g ∈F lin R ( g ) c m ( d log( n/d ) + log(1 /δ )) n (cid:19) > − δ. Before presenting our estimator, we brieﬂy comment on the above theorem. First, in contrastto existing work on robust linear regression, our estimator b g is improper, even though theunderlying linear class is convex. Second, unlike our previous results presented in this paper,the bound of Theorem 3 is not speciﬁc to the linear class. In particular, our proof extendswithout changes to the family of VC-subgraph classes (see [25, Deﬁnition 3.6.8]). Some recentresults in the robust statistics literature apply to more general classes of functions, includingnon-parametric classes (see, for example, [45, 52, 18]). However, as discussed in Section 1.2,such results are only known to be valid under the existence of assumptions on P X . Extendingour results for more general classes presents some challenges; we discuss them in more detail inSection 5.3. Finally, we note that our estimator depends on the value of m . This assumptionsimpliﬁes the analysis and is standard in similar contexts (see, for example, [28, Theorem 11.3]and [54]).We now introduce some additional notation needed to deﬁne our estimator. For any ε > and any class of real-valued functions G let G ε denote the smallest ε -net of G with respect tothe empirical L distance n P ni =1 | f ( X i ) − g ( X i ) | . We only consider ε -nets that are subsetsof G . For the standard deﬁnition of an ε -net we refer to [74, Section 4.2]. Assume that wehave a sample S = ( X i , Y i ) ni =1 of size n and denote S = ( X i , Y i ) ni =1 , S = ( X i , Y i ) ni = n +1 and S = ( X i , Y i ) ni =2 n +1 . Fix any k n , and assume without loss of generality that n/k is integer. Split the set { , . . . , n } into k blocks I , . . . , I k of equal size such that I j = { j − n/k ) , . . . , j ( n/k ) } . Fix any function ℓ : R d × R → R , any sample S ′ of size n , anddenote the i -th element of S ′ by Z i = ( X i , Y i ) . The median-of-means estimator (see also [43,Section 2.1], [60]) is deﬁned as follows: MOM kS ′ ( ℓ ) = Median (cid:18) kn X i ∈ I ℓ ( Z i ) , . . . , kn X i ∈ I k ℓ ( Z i ) (cid:19) . Finally, for any predictor f : R d → R , denote the associated loss function by ℓ f ( Z i ) = ( f ( X i ) − Y i ) . We are now ready to present our estimator. The estimator of Theorem 3

1. Split the sample S of size n into three equal parts S , S and S as deﬁned above.Use the value m to construct the truncated class F = n f m : f ∈ F lin o , where recall that f m denotes the truncation of a function f (see (8)).14. Fix ε = mdn . Using the ﬁrst sample S , construct an ε -net of F with respect to theempirical L distance and denote it by F ε .3. Let c , c > be some speciﬁcally chosen absolute constants. Fix the number of blocks k = ⌈ c d (log( n/d )+log(1 /δ )) ⌉ and set α = c q m ( d log( n/d )+log(1 /δ )) n . Using the secondsample S deﬁne a random subset of F ε as follows: b F = n f ∈ F ε : ∀ g ∈ F ε , MOM kS ( ℓ f − ℓ g ) α s n X X i ∈ S ( f ( X i ) − g ( X i )) + α o .

4. Deﬁne the set b F + consisting of all the mid-points of b F , that is, b F + = ( b F + b F ) / .Using the third sample S , deﬁne our estimator b g as b g = arg min g ∈ b F + max f ∈ b F + MOM kS ( ℓ g − ℓ f ) .

5. Return b g .Our estimator is motivated by a combination of several seemingly disconnected ideas in theliterature. The truncation step is inspired by the analysis in [28, Chapter 11], with the diﬀerencethat we use the truncation as a preliminary step, rather than as a post-processing of the ERMprediction (see Theorem 1). The second step replaces the original class by an empirical L ε -netof the truncated class. In many situations, such a construction leads to suboptimal results.However, since we work with a particular parametric class, this step does not aﬀect the resultingperformance. The use of the ε -net F ε is needed for technical reasons; we explain the technicalaspects in detail in Section 5.3. Our third step is inspired by the median-of-means tournamentsintroduced in [45]. The main diﬀerence with the latter work is that our truncated class is nownon-convex, and to obtain the correct rates of convergence, we need to adapt the arguments usedin the model selection aggregation literature. This motivates our fourth step that can be seenas an adaptation of the star algorithm [2] and the two-step aggregation procedure developed in[40, 51] to our speciﬁc heavy-tailed setting combined with the idea of min-max formulation ofrobust estimators [5, 39]. We remark that the idea of combining model selection aggregationtechniques with the median-of-means tournaments has also recently appeared in [52], but underdiﬀerent assumptions. As we mentioned, the key distinction therein is that the suggested learningprocedure collapses to a proper estimator for convex classes of functions, such as F lin consideredin our work; as discussed in Section 2, for such procedures some restrictions on the distributionof covariates are required to obtain performance bounds.The rest of this section is devoted to proving Theorem 3. First, the truncation at the level m can only make the risk smaller whenever Assumption 1 is satisﬁed. Indeed, this follows fromthe identity R ( g ) = E ( g ( X ) − f reg ( X )) + E ( f reg ( X ) − Y ) , and the fact that f reg is absolutely bounded by m . Therefore, we may focus on bounding R ( b g ) − inf g ∈F R ( g ) . We will now state and comment on some technical lemmas that will be used in our proof. Theproofs of the below lemmas are deferred to Section 7.15ext, we provide a uniform deviation bound on the L distances between the elements of F . Lemma 1.

Assume that n > d . There is a constant c > such that simultaneously for all f, g ∈ F , with probability at least − δ , it holds that E | f ( X ) − g ( X ) | n n X i =1 | f ( X i ) − g ( X i ) | + c (cid:18) md log( n/d ) + m log(3 /δ ) n (cid:19) . To simplify the statements of the lemmas to follow, for any ﬁnite class G and for any conﬁ-dence parameter δ ∈ (0 , deﬁne: α ( G , δ ) = 32 r m (log(2 |G| ) + log(4 /δ )) n , (17)where the sample size n and the value m (of Assumption 1) will always be clear from the con-text. The next technical lemma provides basic concentration properties of the median-of-meansestimators, the proof of which follows from a combination of uniform Bernstein’s inequality anda median-of-means deviation inequality for mean estimation [43, Theorem 2]. Lemma 2.

Suppose that Assumption 1 holds and let S n = ( X i , Y i ) ni =1 denote an i.i.d. sample.Let G be any ﬁnite class of functions whose absolute value is bounded by m . Fix any δ ∈ (0 , ,let k = ⌈ |G| δ ⌉ and let α denote any upper bound on α ( G , δ ) deﬁned in (17) . Then, withprobability at least − δ , the following inequalities hold simultaneously for any f, g ∈ G : (cid:12)(cid:12)(cid:12) R ( f ) − R ( g ) − MOM kS n ( ℓ f − ℓ g ) (cid:12)(cid:12)(cid:12) α p E ( f ( X ) − g ( X )) , (cid:12)(cid:12)(cid:12) R ( f ) − R ( g ) − MOM kS n ( ℓ f − ℓ g ) (cid:12)(cid:12)(cid:12) √ α vuut n n X i =1 ( f ( X i ) − g ( X i )) + α , n n X i =1 ( f ( X i ) − g ( X i )) E ( f ( X ) − g ( X )) + α . For any class G , deﬁne its L diameter by: D ( G ) = sup f,g ∈G p E ( f ( X ) − g ( X )) . As a corollary of the above lemma, we are able to derive some basic properties of the randomset b F . In particular, we show that with high probability the set b F contains the population riskminimizer over the ε -net F ε . At the same time, we establish a uniform Bernstein-type boundon the excess risk of the elements of b F , with the role of the variance term played by D ( b F ) . Lemma 3.

Suppose that Assumption 1 holds and let S n = ( X i , Y i ) ni =1 denote an i.i.d. sample.Let G be any ﬁnite class of functions whose absolute value is bounded by m . Fix any δ ∈ (0 , , k = ⌈ |G| δ ⌉ and let α denote any upper bound on α ( G , δ ) deﬁned in (17) . Deﬁne the randomsubset of G : b G = ( f ∈ G : for every g ∈ G , MOM kS n ( ℓ f − ℓ g ) √ α vuut n n X i =1 ( f ( X i ) − g ( X i )) + α ) , Then, the following two conditions hold simultaneously, with probability at least − δ : . The function g ∗ = arg min g ∈G R ( g ) belongs to the class b G .2. For any f, g ∈ b G , we have R ( f ) − R ( g ∗ ) α D ( b G ) + 5 α . Finally, we prove an excess risk bound for the min-max estimator in terms of the L diameterof the set over which the estimator is computed. The intuitive implications of the below lemmaare the following. First, if D ( b F ) is of order / √ n , the bellow lemma immediately yields thefast rate of convergence for our estimator b g . If, on the other hand, the diameter of D is muchlarger than / √ n , then we can exploit the curvature of the quadratic loss and the gain in theapproximation error (due to considering the larger class b F + instead of b F ) to prove the desiredrate of convergence. Lemma 4.

Suppose that Assumption 1 holds and let S n = ( X i , Y i ) ni =1 denote an i.i.d. sample.Let G be any ﬁnite class of functions whose absolute value is bounded by m . Fix any δ ∈ (0 , ,let k = ⌈ |G| δ ⌉ and let α denote any upper bound on α ( G , δ ) deﬁned in (17) . Let b g be anyestimator satisfying b g ∈ arg min g ∈G max f ∈G MOM kS n ( ℓ g − ℓ f ) . Let g ∗ ∈ arg min g ∈G R ( g ) . Then, with probability at least − δ it holds that R ( b g ) R ( g ∗ ) + 2 α D ( G ) . We are now ready to prove Theorem 3.

Proof of Theorem 3.

Our proof is split into two parts. First, we approximate the truncatedlinear class F with a ﬁnite class, namely, an empirical L ε -net constructed using the ﬁrst thirdof the dataset denoted by S . Then, conditionally on S , we show that our estimator b g achievesthe optimal rate of model selection aggregation over the ﬁnite class F ε , in spite of the lack ofassumptions on the covariates and the presence of heavy-tailed labels. Finally, we note that ifthe number of median-of-means blocks k is equal to ( i.e. , n . d (log( n/d ) + log(1 /δ )) ), thenwe may simply output the function which satisﬁes the desired bound for such sample sizes.Thus, in what follows we assume that n & d (log( n/d ) + log(1 /δ )) . The approximation step.

Recall that F ε is an empirical L ε -net of the truncated linearclass F constructed using the sample S . Let f ∗ = arg min f ∈F R ( f ) and let f ∗ ε be any elementof F ε minimizing the empirical L distance to f ∗ , that is, we have n X X i ∈ S | f ∗ ( X i ) − f ∗ ε ( X i ) | ε. (18)Let E denote the event of Lemma 1 applied with respect to the sample S (that contains n points) with the choice of the conﬁdence parameter set to δ/ (thus, P ( E ) > − δ/ ). It follows17hat on the event E we have R ( f ∗ ε ) − R ( f ∗ )= 2 E Y ( f ∗ ( X ) − f ∗ ε ( X )) + E ( f ∗ ε ( X ) − f ∗ ( X ) ) E ( E [ Y | X ]( f ∗ ( X ) − f ∗ ε ( X ))) + 2 m E | f ∗ ε ( X ) − f ∗ ( X ) | ( since | f ∗ ε ( X ) + f ∗ ( X ) | m ) E ( p E [ Y | X ] | f ∗ ( X ) − f ∗ ε ( X ) | ) + 2 m E | f ∗ ε ( X ) − f ∗ ( X ) | ( by Jensen’s inequality ) m E | f ∗ ε ( X ) − f ∗ ( X ) | ( by Assumption 1 ) mε + 4 mc (cid:18) md log( n/d ) + m log(9 /δ ) n (cid:19) ( by (18) and Lemma 1 ) c (cid:18) m d log( n/d ) + m log(9 /δ ) n (cid:19) , ( by the deﬁnition of ε ) where c is an absolute constant. Observe that on the event E , any estimator b g satisﬁes R ( b g ) − R ( f ∗ ) R ( b g ) − min f ∈F ε R ( f ) + R ( f ∗ ε ) − R ( f ∗ ) R ( b g ) − min f ∈F ε R ( f ) + 12 c (cid:18) m d log( n/d ) + m log(9 /δ ) n (cid:19) . From this point onward, we work on the event E . It thus remains to prove that with probability − δ/ , the estimator b g computed using the remaining n points split into samples S and S satisﬁes R ( b g ) − min f ∈F ε R ( f ) . m d log( n/d ) + m log(1 /δ ) n . (19)Since F ε is a ﬁnite class of functions, we now turn to the aggregation part of this proof. The aggregation step.

By the L covering number bound stated in [28, Theorem 9.4, The-orem 9.5], which also holds for the empirical L distances, we have (see the proof of Lemma 1) log |F ε | . d log meε . d log( n/d ) . Note that | b F + | and | b F | are simultaneously upper bounded by |F ε | . For an arbitrary ﬁniteclass G , recall the deﬁnition of α ( G , δ ) stated in (17). It follows that there exists some absoluteconstant c > such that α deﬁned below satisﬁes max (cid:16) α ( b F , δ/ , α ( b F + , δ/ (cid:17) α = c r m d log( n/d ) + m log(1 /δ ) n . (20)Thus, α deﬁned above will be used in the applications of Lemmas 2, 3 and 4 to follow.Let E be the event of Lemma 3 applied for the set b F with conﬁdence parameter δ/ . Inparticular, on the event E we have arg min f ∈F ε R ( f ) ∈ b F , and for any f ∈ b F it holds that R ( f ) min f ∈F ε R ( f ) + 4 α D ( b F ) + 5 α . (21)Conditionally on the sample S , let the set b F deﬁned in the third step of our algorithm beﬁxed. Denote g ∗ = arg min g ∈ b F + R ( f ) , where recall that b F + = ( b F + b F ) / . Observe that the L diameters of b F and b F + are equal, that is D ( b F + ) = D ( b F ) . Let E be the event of Lemma 418pplied to the third part of our sample S and the ﬁnite class b F + with the conﬁdence parameterset to δ/ . Thus, on E our estimator b g satisﬁes: R ( b g ) R ( g ∗ ) + 2 α D ( b F ) . (22)Now choose any g, h ∈ b F such that p E ( g ( X ) − h ( X )) > D ( b F ) / (such a choice always existsby deﬁnition of the diameter). Since ( g + h ) / ∈ b F + , the parallelogram identity yields R ( g ∗ ) R (( g + h ) / R ( g ) + 12 R ( h ) − E ( g ( X ) − h ( X )) R ( g ) + 12 R ( h ) − D ( b F ) . (23)On the event E , applying (21) for the functions g and h we obtain R ( g ) + 12 R ( h ) min f ∈F ε R ( f ) + 4 α D ( b F ) + 5 α . Combining the above with equations (22) and (23) we have R ( b g ) − min f ∈F ε R ( f ) α D ( b F ) + 5 α − D ( b F ) α , where the last step follows by maximizing the quadratic equation with respect to D ( b F ) . Pluggingin the deﬁnition of α (see (20)) we obtain the desired inequality (19). The proof is complete bytaking the union bound over the events E , E and E deﬁned above. We begin by noting that Theorem 3 holds not only for linear classes, but more generally, forVC-subgraph classes (without any changes to our argument presented in the previous section).Indeed, the structure of the underlying function class only enters our proof though the controlon the empirical covering numbers of its truncated elements; sharp bounds for such coveringnumbers are available in [28, Theorem 9.4]. As a special case, our analysis covers ﬁnite classesand hence, provides new results for the problem of model selection aggregation, where a learneris tasked with constructing a predictor as good as the best one in a given ﬁnite class (also calleddictionary) of functions [59, 67]. It is arguably the most straightforward problem manifesting sta-tistical separation between proper and improper learning algorithms (see, for instance, [12, 35]).Procedures based on exponential weighting were shown to attain optimal rates in expectation[77, 78, 12, 3], yet they later appeared to be deviation suboptimal [2], in close similarity to ourresults presented in Section 4.We can now formulate the following result, which from the statistical point of view, gener-alizes the best known results for the problem of model selection aggregation [2, 40].

Theorem 4.

There is an absolute constant c > such that the following holds. Grant Assump-tion 1, ﬁx any δ ∈ (0 , and let F be a ﬁnite class of possibly unbounded functions. Then, thereexists an estimator b g depending on δ and m such that the following holds: P (cid:18) R ( b g ) − min g ∈F R ( g ) c m (log |F | + log(1 /δ )) n (cid:19) > − δ. roof. The aggregation algorithm is the same as the estimator of Theorem 3 with only twodiﬀerences. First, we skip the step with ε -net discretization of the truncated class F . The seconddiﬀerence is that the number of blocks in median-of-means estimators is of order log( |F | /δ ) andsimilarly, the parameter α is redeﬁned to be of order q m (log |F| +log(1 /δ )) n . The proof follows viathe “aggregation step” part of the proof of Theorem 3.Concerning aggregation with a heavy-tailed response variable, the above result can be com-pared with the bounds of Audibert [3] and Juditsky, Rigollet and Tsybakov [35]. Assuming thatthe functions in F are absolutely bounded by , and that E | Y | s m s for some s > , m s > ,they prove an in-expectation bound on E R ( e f ) − min f ∈F R ( f ) for some estimator e f with the rateof convergence slower than /n . In contrast, in Theorem 4 we do not assume the boundedness of F , but require that the conditional second moment of Y is bounded. As a result, we provide adeviation bound with the /n rate of convergence and logarithmic dependence on the conﬁdenceparameter δ . We emphasize again that due to the necessity of improperness for optimal modelselection aggregation, in-expectation results are not easily transferable to deviation bounds; thein-expectation guarantees of [35, 3] are in fact obtained for variants of the progressive mixtureor mirror averaging rule, which was shown by Audibert [2] to exhibit suboptimal deviations.Finally, an argument of Section 6 shows the necessity of Assumption 1 in our distribution-freesetting for model selection aggregation.Further extensions of Theorem 3, particularly, going beyond VC-subgraph classes presenttechnical challenges. First, obtaining distribution-free empirical covering number guarantees fortruncations of general classes (as done for F in our case) might be a non-trivial task. Second, itis well-known (see the discussion in [63]) that even when only bounded functions are considered,replacing the original function class by its empirical ε -net (as done via the function class F ε in our algorithm) usually renders the recovery of the correct excess risk rates impossible. Thisin turn leads to the ﬁnal and the most technical problem: if F is not replaced by F ε , thereare no known ways to obtain an analog of the concentration Lemma 2, while only imposingAssumption 1.To expand on the last point, an analog of Lemma 2 for general classes can be approachedvia the analysis of suprema of localized quadratic and multiplier processes (see [52] for relatedarguments); speciﬁcally, the supremum of the localized process E sup h ∈H r ( P ni =1 ε i Y i h ( X i )) isdiﬃcult to control for general classes under our assumptions (here H r denotes localized subsets ofthe class F −F , see the proof of Lemma 1 for more details). However, even if the response variable Y is independent of X , the standard in this context application of the multiplier inequality [71,Lemma 2.9.1] introduces the dependence on the moment k Y k , = R ∞ p P ( | Y | > t )d t in theresulting bounds, instead of the desired moment E Y , as we obtain in Lemma 2 for ﬁniteclasses. It is known that the dependence on the k · k , norm is unavoidable in some cases [42].More importantly, we refer to the recent work [30] discussing that the multiplier inequality canlead to suboptimal rates (see [30, Section 2.3.1] for more details). The statistical guarantees obtained in the previous sections hold under no assumptions on thedistribution of X and under Assumption 1 on the conditional distribution of Y given X . In thissection, we show that Assumption 1 is necessary to obtain non-trivial guarantees on the excessrisk without restrictions on P X and that our risk bounds are unimprovable, in a precise sense. Proposition 2.

Fix any n > , δ ∈ ( e − n , and any measurable function f : R → R satisfying f (0) = 0 and sup x ∈ R f ( x ) > . Then, there exists a distribution P X of X such that for any stimator b g (possibly improper and P X -dependent), setting Y = f reg ( X ) (where f reg ∈ { f, − f } ) the following three conditions hold: • there exists w ∗ ∈ R such that R ( g w ∗ ) = 0 ; • E [ Y ] ; • denoting k f reg k ∞ = sup x ∈ R | f reg ( x ) | = k f k ∞ ∈ [1 , + ∞ ] we have P (cid:18) R ( b g ) > min (cid:18) k f reg k ∞ · log(1 /δ )4 n , (cid:19)(cid:19) > δ. Before providing the proof, let us comment on the implications of the above result. First,notice that if the conditional second moment bound E [ Y | X ] of Assumption 1 is relaxedto the weaker unconditional bound E Y , then (taking δ = 0 . , and any f such that k f k ∞ > √ n ) the worst-case excess risk of any estimator b g is lower-bounded by an absoluteconstant c with probability . , matching up to constants the risk equaling that is triviallyachieved by the identically zero function. Second, without Assumption 1 our upper boundscannot be improved even in the “realizable” case where a perfect predictor belongs to the linearclass F lin (that is, when R ( g ) = 0 for some g ∈ F lin ). Finally, when Y = f reg ( X ) , then theworst-case dependence on f reg can be no better than k f reg k ∞ , as demonstrated in the last partof the above proposition. The dependence on m in our upper bounds is thus unavoidable,recalling that m = k f reg k ∞ whenever Y = f reg ( X ) .We remark that Proposition 2 is stated in dimension d = 1 for simplicity only. The samelower bound construction can be used for general dimension d (assuming, for example, that f iscontinuous, and imposing | f reg | | f | ), allowing us to replace the log(1 /δ ) term by d + log(1 /δ ) . Proof.

Let p ∈ (0 , be such that (1 − p ) n = δ ; using that − e − u > (1 − e − ) u > u/ for u = log(1 /δ ) /n ∈ [0 , , we have p = 1 − δ /n > log(1 /δ )2 n . (24)Let x ∈ R \ { } be such that | f ( x ) | is larger than min( k f k ∞ / √ , / √ p ) and let p =min( p, /f ( x ) ) . Fix the distribution of the covariates P X as follows: X = ( with probability − p ,x with probability p . Up to replacing f by − f , assume that f ( x ) > . For ε ∈ {− , } , let P ε denote the jointdistribution of the random pair ( X, εf ( X )) (where the marginal distribution of X is given by P X deﬁned above), and let R ε denote the risk functional associated to the distribution P ε . Notethat P ε satisﬁes the ﬁrst condition of the proposition with w ∗ = εf ( x ) /x . Also, the secondcondition holds since E Y = p f ( x ) .We now turn to proving the third condition of this proposition. Let b g be an arbitraryprocedure, possibly improper and depending on P X . Let S = (cid:0) (0 , , . . . , (0 , (cid:1) denote asample of n points equal to (0 , . Since the quadratic loss function is convex, we may assumewithout loss of generality that b g is a deterministic procedure and let g : R → R denote theoutput of b g on the sample S , that is, g = b g ( S ) . By symmetry of the problem, assume that g ( x ) and ﬁx the distribution P of ( X, Y ) to P (if g ( x ) > , we may ﬁx P = P − instead).21onsider the event E = { X = · · · = X n = 0 } and note that P ( E ) = (1 − p ) > (1 − p ) n = δ .Since f (0) = 0 , on the event E the observed sample is S , so that by (24) we have R ( b g ) > E [( g ( X ) − Y ) ( X = x )] = p · ( g ( x ) − f ( x )) > p f ( x ) = min (cid:0) p f ( x ) , (cid:1) > min (cid:16) p k f k ∞ , (cid:17) > min (cid:18) k f reg k ∞ · log(1 /δ )4 n , (cid:19) , which completes our proof. This section contains the proof of lemmas appearing in Section 5. Note that rescaling theresponse Y by /m aﬀects the excess risk by a multiplicative factor /m . Thus, without loss ofgenerality, in all the proofs of this section we may assume that Assumption 1 holds with m = 1 . The proof of this lemma is based on a combination of the classical localization via empiricalRademacher complexities argument of [7] and the covering number bounds for truncated VC-subgraph classes due to [28].First, deﬁne the star-hull of |F − F | = {| f − g | : f, g ∈ F } by H , and for r > , deﬁne itslocalized subsets by H r : H = n β | f − g | : β ∈ [0 , , f, g ∈ F o , H r = n h ∈ H : 1 n n X i =1 | h ( X i ) | r o . Let b ψ n ( r ) : [0 , ∞ ) → R denote any sub-root function with unique positive ﬁxed-point b r ∗ (thatis, a positive solution to the equation b ψ n ( b r ∗ ) = b r ∗ (see [7, Deﬁnition 3.1, Lemma 3.2]). Supposethat b ψ n satisﬁes the following inequality for any r > b r ∗ : n E ε ,...,ε n sup h ∈H r n X i =1 ε i h ( X i ) ! + log(3 /δ ) n . b ψ n ( r ) , (25)where ε , . . . , ε n is a sequence of i.i.d. Rademacher random variables. Notice that for any r > and any h ∈ H r we have sup x | h ( x ) | and E h ( X ) E h ( X ) . Hence, by the ﬁrst part of [7,Theorem 4.1], with probability at least − δ , the following holds simultaneously for all f, g ∈ F : E | f ( X ) − g ( X ) | n n X i =1 | f ( X i ) − g ( X i ) | + c (cid:18)b r ∗ + log(3 /δ ) n (cid:19) , (26)where c > is some universal constant.In the rest of the proof we show that a suitable value of b r ∗ can be obtained by upperbounding the empirical Rademacher complexity terms via Dudley’s entropy integral. To do so,we ﬁrst need to obtain an upper bound on the covering numbers of the class H with respectto the empirical L distance, deﬁned between any h, h ′ ∈ H by q n P ni =1 ( h ( X i ) − h ′ ( X i )) . Inwhat follows, for any class G and any γ > , an empirical L γ -net of G will be denoted by N ( G , γ ) ⊆ G . Thus, the covering number of G with respect to the empirical L distance at scale γ is at most | N ( G , γ ) | . 22ince H is a star-hull of the class |F − F | , it follows from [49, Lemma 4.5] that for any γ > we have | N ( H , γ ) | (cid:12)(cid:12) N ( F − F , γ/ (cid:12)(cid:12) · γ . (27)Further, noting that the Minkowski sum of γ/ covers of F forms a γ/ cover of F − F it followsthat | N ( F − F , γ/ | | N ( F , γ/ | . (28)Let F + = { x max(0 , f ( X )) : f ∈ F } and F − = { x min(0 , f ( X )) : f ∈ F } . By the sameargument, it holds that | N ( F , γ/ | | N ( F + , γ/ | · | N ( F − , γ/ | . (29)Finally, plugging in the upper bounds on the covering numbers of F + and F − due to [28,Theorem 9.4, Theorem 9.5] , the chain of inequalities (27), (28) and (29) yields log | N ( H , γ ) | . d log( e/γ ) . Plugging in the above inequality into Dudley’s entropy integral [25, Theorem 3.5.1] upperbound on Rademacher complexities, we obtain n E ε ,...,ε n sup h ∈H r n X i =1 ε i h ( X i ) ! . √ n Z √ r p d log( e/γ ) dγ . r dn p r log( e/r ) (cid:0) { r > d/n } + { r

12 log(1 /δ ′′ ) n α , the inequality (33) completes the proof ofthe third inequality of this lemma. Finally, the second inequality appearing in the statement ofthis lemma is implied (on the event of the ﬁrst and third inequalities) by plugging in (32) into(31) together with the inequality √ a + b √ a + √ b valid for any a, b > . The proof of thislemma is thus complete. Let E denote the event of Lemma 2 (thus, P ( E ) > − δ ). By the deﬁnition of g ∗ , for any g ∈ G we have R ( g ∗ ) − R ( g ) . Hence, on the event E it holds simultaneously for all g ∈ G that MOM kS n ( ℓ g ∗ − ℓ g ) R ( g ∗ ) − R ( g ) + | R ( g ∗ ) − R ( g ) − MOM kS n ( ℓ g ∗ − ℓ g ) | √ α vuut n n X i =1 ( g ∗ ( X i ) − g ( X i )) + α . In particular, on the event E the function g ∗ ∈ b G , which completes the ﬁrst part of the proof.We now turn to proving the second part of this lemma. Since g ∗ ∈ b G , by the deﬁnition of b G ,for any g ∈ b G we have MOM kS n ( ℓ g − ℓ g ∗ ) √ α vuut n n X i =1 ( g ( X i ) − g ∗ ( X i )) + α . E , by the third inequality of Lemma 2, for any g ∈ b G it holds that R ( g ) − R ( g ∗ ) (cid:12)(cid:12)(cid:12) R ( g ) − R ( g ∗ ) − MOM kS n ( ℓ g − ℓ g ∗ ) (cid:12)(cid:12)(cid:12) + MOM kS n ( ℓ g − ℓ g ∗ ) √ α vuut n n X i =1 ( g ( X i ) − g ∗ ( X i )) + 2 α α p E ( g ( X ) − g ∗ ( X )) + 5 α . By the deﬁnition of the L diameter of the class b G and by the fact that g ∗ , g ∈ b G , it follows that p E ( g ( X ) − g ∗ ( X )) D ( b G ) and hence our proof is complete. First observe that R ( b g ) = R ( g ∗ ) + (cid:16) R ( b g ) − R ( g ∗ ) − MOM kS n ( ℓ b g − ℓ g ∗ ) (cid:17) + MOM kS n ( ℓ b g − ℓ g ∗ ) R ( g ∗ ) + sup g ∈G (cid:12)(cid:12)(cid:12) R ( g ) − R ( g ∗ ) − MOM kS n ( ℓ g − ℓ g ∗ ) (cid:12)(cid:12)(cid:12) + MOM kS n ( ℓ b g − ℓ g ∗ ) R ( g ∗ ) + α D ( G ) + MOM kS n ( ℓ b g − ℓ g ∗ ) , (34)where the last line follows via an application of Lemma 2. Further, notice that by the deﬁnitionof b g we have MOM kS n ( ℓ b g − ℓ g ∗ ) max g ∈ b G MOM kS n ( ℓ b g − ℓ g ) max g ∈ b G MOM kS n ( ℓ g ∗ − ℓ g ) . At the same time, on the event of Lemma 2, for all g ∈ G we have MOM kS n ( ℓ g ∗ − ℓ g ) R ( g ∗ ) − R ( g ) + α D ( G ) α D ( G ) . Combining the above inequality with (34) concludes our proof.

Acknowledgements.

We would like to thank Manfred Warmuth for several useful discussions.

References [1] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments.

Journal of Computer and System Sciences , 58(1):137–147, 1999.[2] J.-Y. Audibert. Progressive mixture rules are deviation suboptimal. In

Advances in Neural Infor-mation Processing Systems 20 , pages 41–48, 2008.[3] J.-Y. Audibert. Fast learning rates in statistical inference through aggregation.

The Annals ofStatistics , 37(4):1591–1646, 2009.[4] J.-Y. Audibert and O. Catoni. Linear regression through PAC-Bayesian truncation. arXiv preprintarXiv:1010.0072 , 2010.[5] J.-Y. Audibert and O. Catoni. Robust linear least squares regression.

The Annals of Statistics ,39(5):2766–2794, 2011.[6] K. S. Azoury and M. K. Warmuth. Relative loss bounds for on-line density estimation with theexponential family of distributions.

Machine Learning , 43(3):211–246, 2001.

7] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities.

The Annals ofStatistics , 33(4):1497–1537, 2005.[8] D. Belomestny and J. Schoenmakers.

Advanced Simulation-Based Methods for Optimal Stoppingand Control: With Applications in Finance . Springer, 2018.[9] L. Breiman and D. Freedman. How many variables should be entered in a regression equation?

Journal of the American Statistical Association , 78(381):131–136, 1983.[10] C. Brownlees, E. Joly, and G. Lugosi. Empirical risk minimization for heavy-tailed losses.

TheAnnals of Statistics , 43(6):2507–2536, 2015.[11] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm.

Founda-tions of Computational Mathematics , 7(3):331–368, 2007.[12] O. Catoni.

Statistical Learning Theory and Stochastic Optimization: Ecole d’Eté de Probabilités deSaint-Flour XXXI - 2001 , volume 1851 of

Lecture Notes in Mathematics . Springer-Verlag BerlinHeidelberg, 2004.[13] O. Catoni. Challenging the empirical mean and empirical variance: a deviation study.

Annales del’Institut Henri Poincaré, Probabilités et Statistiques , 48(4):1148–1185, 2012.[14] O. Catoni. PAC-Bayesian bounds for the Gram matrix and least squares regression with a randomdesign. arXiv preprint arXiv:1603.05229 , 2016.[15] O. Catoni and I. Giulini. Dimension-free PAC-Bayesian bounds for matrices, vectors, and linearleast squares regression. arXiv preprint arXiv:1712.02747 , 2017.[16] N. Cesa-Bianchi and G. Lugosi.

Prediction, Learning, and Games . Cambridge University Press,2006.[17] Y. Cherapanamjeri, E. Aras, N. Tripuraneni, M. I. Jordan, N. Flammarion, and P. L. Bartlett.Optimal robust linear regression in nearly linear time. arXiv preprint arXiv:2007.08137 , 2020.[18] G. Chinot, G. Lecué, and M. Lerasle. Robust statistical learning with lipschitz and convex lossfunctions.

Probability Theory and Related Fields , pages 1–44, 2019.[19] A. Cohen, M. A. Davenport, and D. Leviatan. On the stability and accuracy of least squaresapproximations.

Foundations of Computational Mathematics , 13(5):819–834, 2013.[20] F. Comte and V. Genon-Catalot. Regression function estimation on non compact support in anheteroscesdastic model.

Metrika , 83(1):93–128, 2020.[21] L. Devroye, L. Györﬁ, and G. Lugosi.

A Probabilistic Theory of Pattern Recognition , volume 31.Springer Science & Business Media, 2013.[22] L. Devroye, M. Lerasle, G. Lugosi, and R. I. Oliveira. Sub-gaussian mean estimators.

The Annalsof Statistics , 44(6):2695–2725, 2016.[23] I. Diakonikolas and D. M. Kane. Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911 , 2019.[24] J. Forster and M. K. Warmuth. Relative expected instantaneous loss bounds.

Journal of Computerand System Sciences , 64(1):76–102, 2002.[25] E. Giné and R. Nickl.

Mathematical Foundations of Inﬁnite-Dimensional Statistical Models , vol-ume 40. Cambridge University Press, 2016.[26] E. Gobet.

Monte-Carlo Methods and Stochastic processes: from Linear to Non-linear . CRC Press,2016.[27] E. Gobet and P. Turkedjiev. Adaptive importance sampling in least-squares monte carlo algo-rithms for backward stochastic diﬀerential equations.

Stochastic Processes and their Applications ,127(4):1171–1203, 2017.[28] L. Györﬁ, M. Kohler, A. Krzyzak, and H. Walk.

A Distribution-free Theory of NonparametricRegression . Springer Science & Business Media, 2002.

29] F. R. Hampel, P. J. Rousseeuw, E. M. Ronchetti, and W. A. Stahel.

Robust statistics: the approachbased on inﬂuence functions . Wiley, 1980.[30] Q. Han and J. A. Wellner. Convergence rates of least squares regression estimators with heavy-tailederrors.

The Annals of Statistics , 47(4):2286–2319, 2019.[31] D. Haussler, N. Littlestone, and M. K. Warmuth. Predicting { , } -functions on randomly drawnpoints. Information and Computation , 115(2):248–292, 1994.[32] D. Hsu and S. Sabato. Loss minimization and parameter estimation with heavy tails.

Journal ofMachine Learning Research , 17(18):1–40, 2016.[33] P. J. Huber. Robust statistics.

Wiley Series in Probability and Mathematical Statistics , 1981.[34] M. R. Jerrum, L. G. Valiant, and V. V. Vazirani. Random generation of combinatorial structuresfrom a uniform distribution.

Theoretical Computer Science , 43:169–188, 1986.[35] A. Juditsky, P. Rigollet, and A. B. Tsybakov. Learning by mirror averaging.

The Annals of Statistics ,36(5):2183–2206, 2008.[36] A. Klivans, P. K. Kothari, and R. Meka. Eﬃcient algorithms for outlier-robust regression. In (Extended abstract) Proceedings of the 31st Conference On Learning Theory , volume 75, pages1420–1430, 2018.[37] Y. Klochkov, A. Kroshnin, and N. Zhivotovskiy. Robust k -means clustering for distributions withtwo moments. The Annals of Statistics (forthcomming) , 2020.[38] V. Koltchinskii.

Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems:Ecole d’Eté de Probabilités de Saint-Flour XXXVIII-2008 , volume 2033. Springer Science & BusinessMedia, 2011.[39] G. Lecué and M. Lerasle. Robust machine learning by median-of-means: theory and practice.

Annalsof Statistics , 48(2):906–931, 2020.[40] G. Lecué and S. Mendelson. Aggregation via empirical risk minimization.

Probability Theory andRelated Fields , 145(3-4):591–613, 2009.[41] G. Lecué and S. Mendelson. Performance of empirical risk minimization in linear aggregation.

Bernoulli , 22(3):1520–1534, 2016.[42] M. Ledoux and M. Talagrand. Conditions d’integrabilité pour les multiplicateurs dans le TLCBanachique.

The Annals of Probability , pages 916–921, 1986.[43] G. Lugosi and S. Mendelson. Mean estimation and regression under heavy-tailed distributions: Asurvey.

Foundations of Computational Mathematics , 19(5):1145–1190, 2019.[44] G. Lugosi and S. Mendelson. Near-optimal mean estimators with respect to general norms.

Proba-bility Theory and Related Fields , 175:957–973, 2019.[45] G. Lugosi and S. Mendelson. Risk minimization by median-of-means tournaments.

Journal of theEuropean Mathematical Society , 22(3):925–965, 2019.[46] G. Lugosi and S. Mendelson. Sub-gaussian estimators of the mean of a random vector.

The Annalsof Statistics , 47(2):783–794, 2019.[47] G. Lugosi and S. Mendelson. Robust multivariate mean estimation: The optimality of trimmedmean.

The Annals of Statistics , 49(1):393–410, 2021.[48] P. Massart.

Concentration Inequalities and Model Selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII - 2003 . Lecture Notes in Mathematics. Springer-Verlag Berlin Heidelberg, 2007.[49] S. Mendelson. Improving the sample complexity using global data.

IEEE Transactions on Infor-mation Theory , 48(7):1977–1991, 2002.[50] S. Mendelson. Learning without concentration.

Journal of the ACM , 62(3):21, 2015.

51] S. Mendelson. On aggregation for heavy-tailed classes.

Probability Theory and Related Fields ,168(3):641–674, 2017.[52] S. Mendelson. An unrestricted learning procedure.

Journal of the ACM , 66(6):1–42, 2019.[53] S. Mendelson. Extending the scope of the small-ball method.

Studia Mathematica , pages 1–21,2020.[54] S. Mendelson. Learning bounded subsets of L p . arXiv preprint arXiv:2002.01182 , 2020.[55] S. Mendelson and N. Zhivotovskiy. Robust covariance estimation under L − L norm equivalence. The Annals of Statistics , 48(3):1648–1664, 2020.[56] S. Minsker. Geometric median and robust estimation in Banach spaces.

Bernoulli , 21(4):2308–2335,2015.[57] S. Minsker and T. Mathieu. Excess risk bounds in robust empirical risk minimization. arXiv preprintarXiv:1910.07485 , 2019.[58] J. Mourtada. Exact minimax risk for linear least squares, and the lower tail of sample covariancematrices. arXiv preprint arXiv:1912.10754 , 2019.[59] A. Nemirovski. Topics in non-parametric statistics.

Ecole d’Eté de Probabilités de Saint-Flour ,28:85, 2000.[60] A. Nemirovsky and D. Yudin.

Problem Complexity and Method Eﬃciency in Optimization . Wiley,New York, 1983.[61] R. Oliveira. The lower tail of random quadratic forms with applications to ordinary least squares.

Probability Theory and Related Fields , 166(3-4):1175–1194, 2016.[62] A. Pensia, V. Jog, and P.-L. Loh. Robust regression with covariate ﬁltering: Heavy tails andadversarial contamination. arXiv preprint arXiv:2009.12976 , 2020.[63] A. Rakhlin, K. Sridharan, and A. B. Tsybakov. Empirical entropy, minimax regret and minimaxrisk.

Bernoulli , 23(2):789–824, 2017.[64] P. J. Rousseeuw and A. M. Leroy.

Robust Regression and Outlier Detection , volume 589. JohnWiley & Sons, 2005.[65] A. Saumard. On optimality of empirical risk minimization in linear aggregation.

Bernoulli ,24(3):2176–2203, 2018.[66] O. Shamir. The sample complexity of learning linear predictors with the squared loss.

Journal ofMachine Learning Research , 16(108):3475–3486, 2015.[67] A. B. Tsybakov. Optimal rates of aggregation. In

Learning Theory and Kernel Machines , volume2777 of

Lecture Notes in Artiﬁcial Intelligence , pages 303–313. Springer Berlin Heidelberg, 2003.[68] J. W. Tukey. A survey of sampling from contaminated distributions.

Contributions to probabilityand statistics , pages 448–485, 1960.[69] L. G. Valiant. A theory of the learnable.

Communications of the ACM , 27(11):1134–1142, 1984.[70] S. Van de Geer.

Empirical Processes in M-estimation , volume 6. Cambridge University Press, 2000.[71] A. W. Van Der Vaart and J. A. Wellner.

Weak Convergence and Empirical Processes . Springer-Verlag, New York, 1996.[72] V. Vapnik and A. Chervonenkis.

Theory of Pattern Recognition . Nauka. Moscow., 1974.[73] T. Vaškevičius and N. Zhivotovskiy. Suboptimality of constrained least squares and improvementsvia non-linear predictors. arXiv preprint arXiv:2009.09304 , 2020.[74] R. Vershynin.

High-Dimensional Probability: An Introduction with Applications , volume 47 of

Cam-bridge Series in Statistical and Probabilistic Mathematics . Cambridge University Press, 2016.[75] V. Vovk. Competitive on-line statistics.

International Statistical Review , 69(2):213–248, 2001.

76] M. J. Wainwright.

High-dimensional statistics: A non-asymptotic viewpoint . Cambridge UniversityPress, 2019.[77] Y. Yang. Combining diﬀerent procedures for adaptive regression.

Journal of Multivariate Analysis ,74(1):135–161, 2000.[78] Y. Yang. Mixing strategies for density estimation.

The Annals of Statistics , 28(1):75–87, 2000.[79] D. Z. Zanger. Quantitative error estimates for a least-squares Monte Carlo algorithm for Americanoption pricing.

Finance and Stochastics , 17(3):503–534, 2013., 17(3):503–534, 2013.