[PDF] Robust descent using smoothed multiplicative noise

Abstract

To improve the off-sample generalization of classical procedures minimizing the empirical risk under potentially heavy-tailed data, new robust learning algorithms have been proposed in recent years, with generalized median-of-means strategies being particularly salient. These procedures enjoy performance guarantees in the form of sharp risk bounds under weak moment assumptions on the underlying loss, but typically suffer from a large computational overhead and substantial bias when the data happens to be sub-Gaussian, limiting their utility. In this work, we propose a novel robust gradient descent procedure which makes use of a smoothed multiplicative noise applied directly to observations before constructing a sum of soft-truncated gradient coordinates. We show that the procedure has competitive theoretical guarantees, with the major advantage of a simple implementation that does not require an iterative sub-routine for robustification. Empirical tests reinforce the theory, showing more efficient generalization over a much wider class of data distributions.

Full PDF

RRobust descent using smoothed multiplicative noise

Matthew J. Holland ∗ Osaka UniversityYamada-oka 2-8, Suita, Osaka, Japan

Abstract

To improve the oﬀ-sample generalization of classical procedures minimizing the empir-ical risk under potentially heavy-tailed data, new robust learning algorithms have beenproposed in recent years, with generalized median-of-means strategies being particularlysalient. These procedures enjoy performance guarantees in the form of sharp risk boundsunder weak moment assumptions on the underlying loss, but typically suﬀer from a largecomputational overhead and substantial bias when the data happens to be sub-Gaussian,limiting their utility. In this work, we propose a novel robust gradient descent procedurewhich makes use of a smoothed multiplicative noise applied directly to observations beforeconstructing a sum of soft-truncated gradient coordinates. We show that the procedurehas competitive theoretical guarantees, with the major advantage of a simple implemen-tation that does not require an iterative sub-routine for robustiﬁcation. Empirical testsreinforce the theory, showing more eﬃcient generalization over a much wider class of datadistributions.

The risk minimization model of learning is ubiquitous in machine learning, and it eﬀectivelycaptures the key facets of any eﬀective learning algorithm: we must have reliable statisticalinference procedures, and practical implementations of these procedures. Formulated using theexpected loss, or risk R ( w ) .. = E l ( w ; z ), induced by a loss l , where w is the parameter (vector,function, set, etc.) to be learned, and expectation is taken with respect to z . In practice, allwe are given is data z , . . . , z n , and based on this the algorithm outputs some candidate b w .If R ( b w ) is small with high conﬁdence over the random sample, it provides some evidence forgood generalization, subject to the assumptions placed on the underlying distribution. Thestatistical side is important because the risk R is always unknown, and the implementationis important since the only b w we ever have in practice is one we can actually compute givenﬁnite data, time, and memory.The vast majority of popular algorithms used today can be viewed as diﬀerent implemen-tations of empirical risk minimization (ERM), which admits any minimizer of n − P ni =1 l ( · ; z i ).From an algorithmic perspective, ERM is ambiguous; there are countless ways to implementthe ERM procedure, and important work in recent years has highlighted the fact that a tremen-dous gap exists between the quality of good and bad ERM solutions [8], for tasks as simpleas multi-class pattern recognition [7], let alone tasks with unbounded losses. Furthermore,even tried-and-true implementations such as ERM by gradient descent (ERM-GD) only haveappealing guarantees when the data is distributed sharply around the mean in a sub-Gaussiansense, as demonstrated in important work by Lin and Rosasco [15]. These facts are important ∗ Please direct correspondence to [email protected] . a r X i v : . [ s t a t . M L ] O c t ecause ERM is ubiquitous in modern learning algorithms, and heavy-tailed data by no meansexceptional [9]. Furthermore, these works suggest that procedures which have been designedto deal with ﬁnite samples of heavy-tailed data may be much more eﬃcient than traditionalERM-based approaches, and indeed the theoretical promise of robust learning algorithms isbeing studied rigorously [13, 17, 18, 19]. Review of related work

Here we review the technical literature most closely related toour work. The canonical benchmark to be compared against is ERM-GD, for which Lin andRosasco [15] in pathbreaking work provide generalization guarantees under sub-Gaussian data.There are naturally two points of interest: (1) How do competing algorithms perform in settingswhen ERM is optimal? (2) What about robustness to settings in which ERM is sub-optimal?Many interesting robust learning algorithms have been studied in the past few years. Oneimportant procedure is from Brownlees et al. [1], based on fundamental results due to Catoni[3]. The basic idea is to minimize an M-estimator of the risk, namely b w = arg min w b l ( w ) b l ( w ) = arg min θ ∈ R n X i =1 ρ ( l ( w ; z i ) − θ ) . While the statistical guarantees are near-optimal under weak assumptions on the data, and theproxy loss b l can be computed accurately by an iterative procedure, its deﬁnition is implicit,and leads to rather signiﬁcant computational roadblocks. Even if l and R and convex, theproxy loss need not be, and the non-linear optimization required by this method can be bothunstable and costly in high dimensions.Another important body of work looks at generalization of the classical “median of means”technique to higher dimensions. From Minsker [20] and Hsu and Sabato [10], the core idea is topartition the data into k disjoint subsets D ∪ · · · ∪ D k = { , , . . . , n } , obtain ERM solutionson each subset, and then robustly aggregate these solutions such that poor candidates areeﬀectively ignored. For example, using the geometric median approach of aggregation, wehave b w = arg min w k X m =1 k w − e w m k e w m = arg min w X i ∈D m l ( w ; z i ) , m = 1 , . . . , k. These robust aggregation methods can be implemented [27], and have appealing formal prop-erties. An application of this technique to construct a robust loss was very recently proposedby Lecué et al. [14]. The main limitation of all these approaches is practical: when samplesize n is small relative to the number of parameters to be determined, very few subsets can becreated, and signiﬁcant error due to bias occurs; conversely, when n is large enough to makemany candidates, cheaper and less sophisticated methods often suﬃce. Furthermore, in thecase of Lecué et al. [14] where an expensive sub-routine must be run at every iteration, thecomputational overhead is substantial.Also in the recent literature, interesting work has begun to appear looking at “robustgradient descent” algorithms, which is to say steepest descent procedures which utilize a robustestimate of the gradient vector of the risk [5, 6, 24]. The basic idea is as follows. Assumingpartial derivatives exist, writing g ( w ) .. = ( ∂ R ( w ) , . . . , ∂ d R ( w )) for the risk gradient, we could2teratively solve this task by the following update: w ∗ ( t +1) = w ∗ ( t ) − α ( t ) g ( w ∗ ( t ) ) (1)Naturally, this procedure is ideal, since the underlying distribution is never known in practice,meaning R is always unknown. As such, we must approximate this objective function andoptimize it with incomplete information. In taking a steepest descent approach, all that is re-quired is an accurate approximation of g . Instead of ﬁrst approximating R and then using thatapproximation to infer g , computational resources are better spent approximating g directlywith some data-dependent b g constructed using the loss gradients l ( w ; z ) , . . . , l ( w ; z n ), andplugging this in to the iterative update, as b w ( t +1) = b w ( t ) − α ( t ) b g ( b w ( t ) ) . (2)Once again here, the median-of-means idea pops up in the literature, with Prasad et al. [24] us-ing a robust aggregation of empirical mean estimates of the gradient. That is, after partitioningthe data into k subsets as before, the estimate vector b g is constructed as b g ( w ) = arg min u k X m =1 k u − e g m ( w ) k e g m ( w ) = 1 |D m | X i ∈D m l ( w ; z i ) , m = 1 , . . . , k. and substituted within the gradient update (2). While conceptually a very appealing newproposition, there are a number of issues to be overcome. Their theoretical guarantees havestatistical error terms which depend only on d rather than d , where d is the dimension ofthe gradient, but the same error terms also grow with the number of iterations, leading tovery slow error rates when the number of iterations grows with n , as is required for ε -goodperformance (see Remark 11). Furthermore, computing b g via a geometric median sub-routineintroduces the exact same overhead and bias issues as the procedures of Hsu and Sabato [10]just discussed, only that this time these costs are incurred at each step of the gradient descentprocedure, and thus these costs and errors accumulate, and can propagate over time. Iterativeapproximations at each update take time and are typically distribution-dependent, while fastapproximations leave a major gap between the estimators studied in theory and those used inpractice. Our contributions

To address the limitations of both ERM-GD and the robust alterna-tives discussed above, we take an approach that allows us to obtain a robust gradient estimatedirectly, removing the need for iterative approximations, without losing the theoretical guaran-tees. In this paper, we provide both theoretical and empirical evidence that using the proposedprocedure, paying a small price in terms of bias and computational overhead is worth it whendone correctly, leading to a large payout in terms of distributional robustness. Key contribu-tions are as follows:• A practical learning algorithm which can closely mimic ERM-GD when ERM is optimal,but which performs far better under heavy-tailed data when ERM deteriorates.• Finite-sample risk bounds that hold with high probability for the proposed procedure,under weak moment assumptions on the distribution of the loss gradient.• We demonstrate the ease of use and ﬂexibility of our procedure in a series of experi-ments, testing performance using both controlled simulations and real-world datasets,and compared with numerous standard competitors.3 .. j Figure 1:

Illustration of the key elements of our proposed algorithm. From the far left, points represent lossgradient coordinates evaluated at diﬀerent observations in our sample. To each point, we consider multiplicationby Gaussian noise centered at 1. This noise is smoothed out by integration over the noise distribution, andapplied to each gradient coordinate to generate a robust update direction.

Content overview

Key concepts and basic technical ideas underlying the proposed algo-rithm are introduced in section 2. We ﬁll in the details and provide performance guarantees viatheoretical analysis in section 3, and reinforce our formal argument with a variety of empiricaltests in section 4. Finally, concluding remarks and a look ahead are given in section 5.

Our proposed procedure can be derived in a few simple steps. Let us begin with a one-dimensional example, in which for random variable x we try to estimate E x based on sample x , . . . , x n . Our problem of interest is the setting in which the underlying distribution maybe heavy-tailed, but it also may not be, and this information is not available to the learner apriori . Scaling and truncation

The ﬁrst step involves a very primitive technique for ensuring thebias is small under well-behaved data, all while constraining the impact of outlying points. Were-scale, apply a soft truncation ψ , and then put the truncated arithmetic mean back in theoriginal scale, namely sn n X i =1 ψ (cid:18) x i s (cid:19) ≈ E x. Here ψ should have the symmetry of an odd function ( ψ ( − u ) = − ψ ( u )), be non-decreasingon R , with a slope of ψ ( u ) → u →

0, and be bounded on R . A simple example is thehyperbolic tangent function, tanh( u ), but we shall consider other examples shortly. If thescale s > | x i | /s is near zero for all but errant observations, the impact ofthe non-deviant terms to the arithmetic mean will be approximately equal, while the deviantpoints will have a disproportionately small impact. Noise multiplication

The second step involves applying multiplicative noise, albeit thepurpose is rather unique. Let (cid:15) , . . . , (cid:15) n be our independent random noise, generated from acommon distribution (cid:15) ∼ ν with E (cid:15) = 0. We multiply each datum by 1 + (cid:15) , and then passeach modiﬁed datum x i (1 + (cid:15) i ) = x i + x i (cid:15) i through the truncation function as above, yielding e x ( (cid:15) ) = sn n X i =1 ψ (cid:18) x i + (cid:15) i x i s (cid:19) . ψ ,while a push in the right direction could earn an additional valid point for the estimator. Noise smoothing

In the third and ﬁnal step, we smooth out the multiplicative noise bytaking the expectation of this estimator with respect to the noise distribution. This smoothedversion of the estimator, still a random variable dependent on the original sample, is the ﬁnalestimator of interest, deﬁned b x .. = E e x ( (cid:15) ) = sn n X i =1 Z ψ (cid:18) x i + (cid:15) i x i s (cid:19) dν ( (cid:15) i ) . (3)Computationally, in order to obtain b x to approximate E x , we will not actually have to generatethe (cid:15) i and multiply the x i by (1 + (cid:15) i ), but instead will have to evaluate the integral. Computational matters

Before we move to the high-dimensional setting of interest, howcan we actually compute this b x ? Numerical integration is not appealing as the overhead will betoo much for a sub-routine to be repeated many times. Naturally, the computational approachwill depend on the noise distribution ν , and the truncation function ψ . Using recent results inthe statistical literature from Catoni and Giulini [4], if we set the truncation function to be ψ ( u ) .. =  u − u / , −√ ≤ u ≤ √ √ / , u > √ − √ / , u < −√ ν = N (0 , /β ), then the integral of interest can be given inan explicit form that is simple to compute, requiring no numerical integration or approximation.Written in a general form with shift parameter a ∈ R and scale parameter b >

0, we can expressthe integral as E ν ψ (cid:16) a + b p β(cid:15) (cid:17) = a a − b ! − a C ( a, b ) (5)where C ( a, b ) is a correction term that is complicated to write, but extremely simple to imple-ment (see Appendix A.3 for exact form). Proposed learning algorithm

Let us now return to the high-dimensional setting of inter-est. At any candidate w , we can evaluate the l ( w ; z i ) and l ( w ; z i ) for all points i = 1 , . . . , n .The heart of our proposal: apply the sub-routine speciﬁed in (3) to each coordinate of theloss gradients, which can be computed directly using (5), and plug the resulting “robust gra-dient estimate” into the usual ﬁrst-order update (2). Pseudocode for the proposed procedureis provided in Algorithm 1. All operations on vectors in the pseudo-code are element-wise,e.g., u = ( u , . . . , u d ), | u | = ( | u | , . . . , | u d | ), u / v = ( u /v , . . . , u d /v d ), and so forth. Forreadability, we abbreviate l i ( w ) .. = l ( w ; z i ). 5 lgorithm 1 Outline of robust gradient descent learning algorithm inputs: b w (0) , T > for t = 0 , , . . . , T − doScale set to minimize error bound, via Remark 6 and Lemma 8. s ( t ) = q n v ( t ) / δ − ) , where v ( t ) ≥ E µ l i ( b w ( t ) ) Corrected gradient estimate, via (3) and (5): b g ( t ) = 1 n n X i =1 l i ( b w ( t ) ) − l i ( b w ( t ) ) s t ) β ! − l i ( b w ( t ) ) s t ) ! + 1 n n X i =1 C l i ( b w ( t ) ) s ( t ) , | l i ( b w ( t ) ) | s ( t ) √ β ! Plug in to gradient-based update (2). b w ( t +1) = b w ( t ) − α ( t ) b g ( t ) ( b w ( t ) ) end forreturn: b w ( T ) As a simple example of the guarantees that are available for this procedure, assuming justﬁnite variance of the gradients, and setting α ( t ) = α for simplicity, we have that R ( b w ( T ) ) − R ∗ ≤ O d (log( dδ − ) + d log(∆ n )) n ! + O (cid:16) (1 − αγ ) T (cid:17) with probability at least 1 − δ over the random draw of the sample, where d is the dimensionof the space the gradient lives in, ∆ is the diameter of W , and the constant γ depends only on R ( · ). Theoretically, these results are competitive with existing state of the art methods citedin the previous section, but with the computational beneﬁts of zero computational error, directcomputability, and the fact that per-step computation time is independent of the underlyingdistribution. In this section, we carry out some formal analysis of the generalization performance of Algo-rithm 1. More concretely, we provide guarantees in the form of high-probability upper boundson the excess risk achieved by the proposed procedure, given a ﬁnite sample of n observations,and ﬁnite budget of T iterations. We start with a general sketch, followed by a precise descrip-tion of key conditions, representative results, and subsequent discussion. All detailed proofsare given in Appendix A.2. Sketch of the argument

Our approach can be broken down into three straightforwardsteps:1. Obtain pointwise error bounds for b g ( w ) ≈ g ( w ).2. Extend Step 1 to obtain error bounds uniform in w ∈ W .3. Control distance of b w ( t ) from minimizer at each step.For the ﬁrst step, we can leverage new technical results from Catoni and Giulini [4] toobtain strong guarantees for a novel estimator of the risk gradient, evaluated at an arbitrary,6lbeit pre-ﬁxed, parameter w ∈ W . With this result established, since Algorithm 1 updatesiteratively, and the sequence of parameters ( b w ( t ) ) is data dependent, bounds which hold for allpossible contingencies are required. Since W has an inﬁnite number of elements, naive unionbounds are useless. However, a rather typical covering number argument oﬀers an appealingsolution. As long as a ﬁnite number of balls of radius ε can cover W , then we can discretizethe space: every w is ε -close to at least one ball, and by the previous step we have strongpointwise guarantees that we can apply to each of the ball centers. Finally, while in practiceany learning meachine has no choice but to use the approximate update (2), when the riskfunction is convex, we can show that the deviation from the ideal procedure (1) can be tightlycontrolled. Indeed, under strong convexity, the distance between b w ( t ) and a minimizer of R ( · )can be controlled by a sharp statistical error term, and optimization error equivalent to runningthe ideal gradient descent procedure (1). Notation

Here we organize the key notation used in the remainder of our theoretical analysisand associated proofs (some are re-statements of deﬁnitions above). The observable loss is l : W × Z → R , where W is the model from which the learning machine can select parameters w ∈ W , and Z is the space housing the data sample, z , . . . , z n . The data distribution isdenoted z ∼ µ , and the noise distribution featured in our algorithm is (cid:15) ∼ ν . The risk tobe minimized is R ( w ) .. = E µ l ( w ; z ). The risk and loss gradients are respectively g and l .Estimates of g based on observations of l are denoted b g . We frequently use P to denotea generic probability measure, typically the product measure induced by the sample, whichshould be clear from the context. Unless speciﬁed otherwise, k · k shall denote the usual ‘ norm on Euclidean space. For integer k >

0, write [ k ] .. = { , . . . , k } . Assumptions with examples

No algorithm can achieve arbitrarily good performance acrossall possible distributions [25]. In order to obtain meaningful theoretical results, we must placeconditions on the underlying distribution, as well as the model and objective functions used.We give concrete examples to illustrate that these assumptions are reasonable, and that theyinclude scenarios that allow for both sub-Gaussian and heavy-tailed data.A0. W is a closed, convex subset of R d , with diameter ∆ .. = sup {k u − v k : u , v ∈ W} < ∞ .A1. Loss function l ( · ; z ) is λ -smooth on W .A2. R ( · ) is λ -smooth, and continuously diﬀerentiable on W .A3. There exists w ∗ ∈ W at which g ( w ∗ ) = 0.A4. R ( · ) is κ -strongly convex on W .A5. There exists v < ∞ such that E µ ( l j ( w ; z )) ≤ v , for all w ∈ W , j ∈ [ d ].Of these assumptions, assuredly A0 is simplest: any ball (here in the ‘ norm) with ﬁniteradius will suﬃce, though far more exotic examples are assuredly possible. The remainingassumptions require some checking, but hold under very weak assumptions on the underlyingdistribution, as the following examples show. Example . Consider the linear regression model y = h w ∗ , x i + η , where x is almost surely bounded (say P {k x k ≤ c } = 1), but the noise η can haveany distribution we desire. Consider the squared loss l ( w ; z ) = ( h w , x i − y ) , and observe thatfor any w , w ∈ W , we have l ( w ; z ) − l ( w ; z ) = 2( h w − w , x i ) x k l ( w ; z ) − l ( w ; z ) k ≤ k x k k w − w k ≤ c k w − w k . Thus we have smoothness with λ = 2 c , satisfying A1. Example . Consider a similar setup as inExample 1, but instead of requiring x to be bounded, weaken the assumption to E k x k < ∞ .Since taking the derivative under the integral we have g ( w ) = E l ( w ; z ) = 2( h w − w ∗ , x i− η ) x .Clearly, g ( w ∗ ) = 0, satisfying A3. Furthermore, it follows that g ( w ) − g ( w ) = E (cid:0) l ( w ; z ) − l ( w ; z ) (cid:1) = 2 E ( h w − w , x i ) x . We thus have k g ( w ) − g ( w ) k ≤ E k x k k w − w k ≤ c k w − w k , meaning smoothness of the risk holds with λ = 2 E k x k , satisfying A2. Example . Again consider a setting similar to Examples1–2, but with added assumptions that E x = 0, that the noise η and input x are independent,and that the components of x = ( x , . . . , x d ) are independent of each other. Some straightfor-ward algebra shows that E ( l j ( w ; z )) = 4 (cid:16) E x j h w − w ∗ , x i + E η E x j (cid:17) ≤ (cid:16) k w − w ∗ k E x j k x k + E η E x j (cid:17) . It follows that as long as the noise η has ﬁnite variance ( E η < ∞ ), and all inputs have ﬁnitefourth moments E x j < ∞ , then using assumption A0, we get E ( l j ( w ; z )) ≤ (cid:16) ∆ E x j k x k + E η E x j (cid:17) < ∞ . This holds for all w ∈ W , satisfying A5. Example . Consider the same setup as Example 3.Since E x j η = ( E x j )( E η ) = 0 for each j ∈ [ d ], it follows that the risk induced by the squaredloss under this model takes a convenient quadratic form, R ( w ) = E l ( w ; z ) = ( w − w ∗ ) T A ( w − w ∗ ) + b , with A = E xx T and b = E η . For concreteness, say all the components of x have variance E x j = σ , recalling that E x j = 0 by assumption. Then the Hessian matrix of R ( · ) is R ( w ) = E xx T = σ I d , for all w ∈ W . For any w , w ∈ W , taking an exact Taylor expansion, we havethat R ( w ) = R ( w ) + h g ( w ) , w − w i + 12 h w − w , R ( u )( w − w ) i for some appropriate u on the line segment between w and w . Since the Hessian is positivedeﬁnite with factor σ , the last term on the right-hand side can be no smaller than k w − w k σ /

2. This implies a lower bound, R ( w ) ≥ R ( w ) + h g ( w ) , w − w i + σ k w − w k . The exact same inequality holds for any choice of w and w . This is precisely the deﬁnition ofstrong convexity of R ( · ) given in (13), with convexity parameter κ = σ , satisfying A4.In the analysis that follows, A0–A5 are assumed to hold.8 nalysis of Algorithm 1 with discussion Here we consider the learning performance ofthe proposed procedure given by Algorithm 1. Almost every step of the procedure is givenexplicitly, save for the means of setting the moment bound v ( t ) (see section 4), and the stepsize setting of α ( t ) . In the subsequent analysis of section 3, we shall specify exact settings of α ( t ) , and show how these settings impact the ﬁnal guarantees that can be made.We begin with a general fact that shows the sub-routine used to estimate each element ofthe risk gradient has sharp guarantees under weak assumptions. Lemma 5 (Pointwise accuracy) . Consider data x , . . . , x n , with distribution x ∼ µ . Assumeﬁnite second moments, and a known upper bound E µ x ≤ v < ∞ . With probability no lessthan − δ , the estimator b x deﬁned in (3), scaled using s = p nv/ (2 log( δ − )) , satisﬁes | b x − E µ x | ≤ s v log( δ − ) n + r vn . Remark . Note that the scale setting s > n and the second moment of the underlying distribution. This can be derivedfrom an exponential tail bound on the deviations of this estimator, namely we have that forany choice of s > P ( | b x − E µ x | ≤ v s + s log( δ − ) n + r vn ) ≥ − δ. Choosing s to minimize this upper bound yields the ﬁnal results given in the lemma. Having ascale large relative to the second moments ensures the estimator behaves similarly regardlessof the underlying scale of the data. Note that it also increases with sample size n : this isintuitive since in most cases, given a larger sample, we can allow the estimator to be moresensitive to outliers, earning a reduction in bias.Of critical importance is that Lemma 5 only assumes ﬁnite variance, nothing more. Higher-order moments may be inﬁnite or undeﬁned, and the results still hold. This means the resultshold for both Gaussian-like well-behaved data, and heavy-tailed data which are prone to errantobservations. Next, we show that this estimator has a natural continuity property. Lemma 7 (Estimator is Lipschitz) . Considering the estimator b x deﬁned in (3) as a functionof the data x .. = ( x , . . . , x n ) ∈ R n , it satisﬁes the following Lipschitz property: | b x ( x ) − b x ( x ) | ≤ c ν n k x − x k , for all x , x ∈ R n where the factor c ν takes the form c ν = 1 − (cid:16) − p β (cid:17) + s βπ exp (cid:18) − β (cid:19) where Φ( u ) .. = P { N (0 , ≤ u } , the cumulative distribution function of the standard Normaldistribution. At each step in an iterative procedure, we have some candidate w , at which we can evaluatethe loss l ( w ; z i ) and/or the gradient l ( w ; z i ) over some or all data points i ∈ [ n ]. In traditionalERM-GD, one simply uses the empirical mean of the loss gradients to approximate g ( w ). Inour proposed robust gradient descent procedure, instead of just doing summation, we feed theloss gradients as data into the robust procedure (3), highlighted in Lemma 5. Running this9ub-routine for each dimension results in a novel estimator b g ( w ) of the risk gradient g ( w ),to be plugged into (2), constructing a novel steepest descent update. Since the candidate w at any step will depend on the random draw of the data set z , . . . , z n , upper bounds onthe estimation error must be uniform in w ∈ W in order to capture all contingencies. Moreexplicitly, we require for some bound 0 < ε < ∞ that P (cid:26) max t ≤ T k b g ( b w ( t ) ) − g ( b w ( t ) ) k ≤ ε (cid:27) ≥ − δ. (6)Using the following lemma, we can show that such a bound does exist, and its form can bereadily characterized. Lemma 8 (Uniform accuracy of gradient estimates) . Consider the risk gradient approximation b g = ( b g , . . . , b g d ) , deﬁned at w (for j ∈ [ d ] ) by b g j ( w ) .. = s j n n X i =1 Z ψ l j ( w ; z i )(1 + (cid:15) i ) s j ! dν ( (cid:15) i ) , scaled as s j = nv j δ − ) , (7) with v j any valid bound satisfying v j ≥ E µ | l j ( w ; z ) | , for all w ∈ W . Then, with probabilityno less than − δ , for any choice of w ∈ W , we have that k b g ( w ) − g ( w ) k ≤ e ε √ n , where writing V .. = max j ∈ [ d ] v j , the error e ε is e ε .. = λ (1 + c ν √ d ) + q dV (log( dδ − ) + d log(3∆ √ n/ √ V .

We now have that (6) is satisﬁed by our underlying routine, as just proved in Lemma 8.The last remaining task is to disentangle the underlying optimization problem (minimizationof unknown R ( · )) from the statistical estimation problem (approximating g with b g ), in orderto control the distance between the output of Algorithm 1 after T iterations, denoted b w ( T ) ,and the minimizer w ∗ of R ( · ). Lemma 9 (Distance control) . Consider the general approximate GD update (1), and assumethat (6) holds with bound < ε < ∞ . Then, with probability no less than − δ , the followingstatements hold.1. Setting α ( t ) = α/γ , with < α < , we have k b w ( T ) − w ∗ k ≤ (1 − α ) T/ k b w (0) − w ∗ k + 2 εγ .

2. Setting α ( t ) = 1 / ((2 + t ) γ ) , we have k b w ( T ) − w ∗ k ≤ √ t + 2 k b w (0) − w ∗ k + εγ . Our preparatory lemmas are now complete, and we can ﬁnally focus on the risk itself. Weare considering bounds on the excess risk, namely the diﬀerence between the risk achieved byour procedure R ( b w ( T ) ), and R ∗ .. = inf { R ( w ) : w ∈ W} = R ( w ∗ ), namely the best possibleperformance using W . 10 heorem 10 (Excess risk bounds, ﬁxed step size) . Write b w ( T ) for the output of Algorithm 1after T iterations, assuming step size α ( t ) = α/γ , and moment bounds v ( t ) ≤ V for all t . Itfollows that R ( b w ( T ) ) − R ∗ ≤ (1 − α ) T λ k b w (0) − w ∗ k + 4 λ e εκ n = O (cid:16) (1 − α ) T (cid:17) + O d (log( dδ − ) + d log(∆ n )) n ! with probability no less than − δ over the random draw of the sample z , . . . , z n , where e ε and V are as deﬁned in Lemma 8. An immediate observation is that if we let T scale with n such that T → ∞ as n → ∞ , wehave convergence in probability as for any ε >

0, it holds thatlim n →∞ P { R ( b w ( T ) ) − R ∗ > ε } = 0 , and that indeed to get arbitrarily good performance at conﬁdence 1 − δ , one requires onthe order of d log( δ − ) observations. In terms of learning eﬃciency, the rate of convergencebecomes more important than the simple fact of consistency. In the remark below, we compareour results with closely related works in the literature. Remark

11 (Comparison with other RGD) . Among recent work in the literature, conceptuallythe closest work to ours are those proposing and analyzing novel “robust gradient descent”algorithms. As with ours, these procedures look to replace the traditional sample mean-basedrisk gradient estimate with a more robust approximation, within the context of a ﬁrst-orderoptimization scheme. As we mentioned in section 1, two particularly closely related works areChen et al. [6] and Prasad et al. [24], both using a generalized median-of-means strategy. On thecomputational side, our Algorithm 1 requires only a ﬁxed number of basic operations, appliedto each coordinate and each data point; we have no iterative sub-routines here. This providesan advantage over median-of-means procedures which require iterative approximations of thegeometric median at each update in the main loop. Theoretically, while the setting of Chenet al. [6] is that of parallel computing with robustness to failures, their formal guarantees haveessentially the same dependence on d and n as our bounds in Theorem 10, under comparableassumptions. On the other hand, Prasad et al. [24], using the same tactic as Chen et al. [6],provide new formal guarantees with better dependence on the dimension, but at the cost of anew T factor in the statistical error term. More concretely, we consider their Theorem 8, inwhich they provide error bounds on a robust linear regression procedure. Consider assumptionsas in our Example 3, where z = ( x , y ) and y = h w ∗ , x i + η . Writing ( e w ( t ) ) for the sequentialoutput of their procedure. Under such assumptions, they assert (1 − δ )-probability bounds ofthe form k e w ( T ) − w ∗ k ≤ O (cid:16) a T (cid:17) + O σ − a s T d log( δ − )( n/k ) ! for a constant 0 < a <

1, where k is the number of partitions made, and E xx = σ I d . Thereason for this form is as follows. Using their Lemma 2, based on error bounds from Minsker[20], one gets a pointwise bound on the error gradient estimate. In their Theorem 1 proof, theytake a union over all algorithm iterations, and conclude with bounds on k e w ( T ) − w ∗ k taking theform just stated. A naive approach re-using the same data over each step t = 0 , , . . . , T of thealgorithm means the loss gradient observations are no longer independent, then making Lemma11 invalid. To get around this, the authors split the original n independent observations into T disjoint subsets to be used for their proposed sub-routine (involving a further k -partition).Union bounds over T steps are now perfectly valid, but the data size at each step becomes n/ ( T k ), and even T = Θ( √ n ) leads to a very slow rate of O ( n − / ). Our bounds have an extra d factor, but are free of T in the statistical error, while also maintaining O ( n − / ) rates. Remark

12 (Projected RGD) . One implicit assumption in our analysis above is that b w ( t ) ∈W for all steps of Algorithm 1. To enforce this, running a projection sub-routine after theparameter update step is suﬃcient, and all theoretical guarantees hold as-is. To see this, notethat the update becomes b w ( t +1) = π W (cid:16) b w ( t ) − α ( t ) b g ( t ) ( b w ( t ) ) (cid:17) (8)where π W ( v ) .. = arg min u ∈W k u − v k . Under A0, this projection is well-deﬁned [16, Sec. 3.12,Thm. 3.12]. With this fact in hand, it follows that k π W ( u ) − π W ( v ) k ≤ k u − v k for any choiceof u , v ∈ W . Again via A0, since w ∗ ∈ W and π W ( w ∗ ) = w ∗ , we have (cid:13)(cid:13)(cid:13) π W (cid:16) b w ( t ) − α ( t ) b g ( t ) ( b w ( t ) ) (cid:17) − w ∗ (cid:13)(cid:13)(cid:13) ≤ k b w ( t +1) − w ∗ k implying that Lemma 9 applies as-is to Algorithm 1 modiﬁed using projection to W as in (8),thereby extending all subsequent results based on it.One would expect that with robust estimates of the risk gradient that over a wide varietyof distributions, that the updates of Algorithm 1 should have small variance given enoughobservations. The following result shows that this is true, with the procedure stabilizing tothe best level available under the given sample as the procedure closes in on a valid solution. Theorem 13 (Control of update variance) . Run Algorithm 1 under the same assumptionsas Theorem 10, except with step-size α ( t ) left arbitrary. Then, for any step ≥ t , takingexpectation with respect to the sample { z i } ni =1 conditioned on b w ( t ) , we have E k b w ( t +1) − b w ( t ) k ≤ α t )  d √ πb s Vn (cid:18) − (cid:18) − √ d (cid:19)(cid:19) + s V dnπ e − d  + k g ( b w ( t ) ) k  . In the numerical experiments that follow, our primary goal is to elucidate the relationship thatexists between factors of the learning task (e.g., sample size, model dimension, initial value,underlying data distribution) and the performance of the robust gradient descent procedureproposed in Algorithm 1. We are interested in how these factors impact algorithm behavior inan absolute sense, as well as performance relative to well-known competitors.We are considering three basic types of experiments. First, we develop a risk minimizationtask based on noisy function (and thus noisy gradient) observations. These controlled simula-tions let us carefully examine how diﬀerent factors inﬂuence performance over time. Next, weconsider a regression task under a wide variety of noise distributions, which lets us examinethe true utility of competing algorithms under a scenario where the data may or may notbe heavy-tailed. Finally, we use real-world benchmark data sets to evaluate performance onclassiﬁcation tasks. 12 .1 Controlled tests

Experimental setup

We begin with a “noisy convex risk minimization” task, designed asfollows. The risk function itself takes a quadratic form, as R ( w ) = h Σ w , w i / h w , u i + c ,where Σ ∈ R d × d , u ∈ R d , and c ∈ R are constants set in advance. The learning task is to ﬁnda minimizer of R ( · ), without direct access to R , rather only access to n random function data r , . . . , r n , with r : R d → R mapping from parameter space to a numerical penalty. This data isgenerated independently from a common distribution, and are centered at the true risk, namely E r ( w ) = R ( w ) for all w ∈ R d . More concretely, we generate r i ( w ) = ( h w ∗ − w , x i i + (cid:15) i ) / i ∈ [ n ], with x and (cid:15) independent. The true minimum is denoted w ∗ , and Σ = E xx T . The inputs x are set to have a d -dimensional Gaussian distribution with all components uncorrelated.This means that Σ is positive deﬁnite, and R is strongly convex.We make use of three metrics for evaluating performance here: average excess empiricalrisk (averaging of r , . . . , r n ), average excess risk (computed using true R ), and variance of therisk. The latter two are computed by averaging over trials; each trial means a new independentrandom sample. In all tests, we conduct 250 trials.Regarding methods tested, we run three representative procedures. First is the idealizedgradient descent procedure (1), denoted oracle , which is possible here since R is designedby us. Second, as a de facto standard for most machine learning algorithms, we use ERM-GD, written erm . Here the update direction is simply the sample mean of the loss gradient.Finally, we compare our Algorithm 1, written rgdmult , against these two procedures. Variancebounds v ( t ) are computed using the simplest possible procedure, namely the empirical meanof the second moments of l ( b w ( t ) ; z ), divided by two. Impact of heavy-tailed noise

Our ﬁrst inquiry is a basic proof of concept: are there naturalproblem settings under which using rgdmult over ERM-GD is advantageous? How does thisprocedure perform when ERM-GD is known to be eﬀectively optimal? Under Gaussian noise,ERM-GD is eﬀectively optimal [15, Appendix C]. As a baseline, we start with Gaussian noise(mean 0, standard deviation 20), and then consider centered log-Normal noise (log-location0, log-scale 1 .

75) as a representative example of asymmetric, heavy-tailed data. Performanceresults are given in Figure 2.We see that when ERM-GD is known to be strong, both algorithms perform almost thesame. In contrast, when we have heavy-tailed data, we see that rgdmult is far superior interms of both the solution found, and the stability over the random draw of the sample. Whilecompared with the oracle procedure, there is clearly some overﬁtting, we see that rgdmult departs from the oracle procedure at a much slower rate than ERM-GD, a desirable property.Moving forward, we look more systematically at how diﬀerent experimental settings leadto diﬀerent performance, all else kept constant.

Impact of initialization

Having ﬁxed the underlying distribution and number of obser-vations n , here we look at the impact of the initial guess b w (0) . We look at three initializa-tions, taking the form w ∗ + Unif[ − ∆ , ∆ ], with ∆ = (∆ , . . . , ∆ d ), and values ranging over∆ j ∈ { . , . , . } , j ∈ [ d ]. Here larger values of ∆ j correspond to potentially worse initial-ization. Results are displayed in Figure 3.Several trends are clear. First, in the case where ERM-GD is essentially optimal, we seethat rgdmult matches it. Furthermore, when the data is heavy-tailed, the proposed procedureis seen to be much more robust to a sub-par initial guess. While a bad start can lead to seriousperformance issues long-run in ERM-GD, we see that rgdmult can eﬀectively recover.13

20 400246810 Excess empirical riskermoraclergdmult 0 20 4010 Excess risk 0 20 400.00.20.40.60.81.01.2Variation of risk over samples0 20 400246810 Excess empirical riskermoraclergdmult 0 20 4010 Excess risk 0 20 400.00.51.01.52.0Variation of risk over samples

Figure 2:

Performance metrics as a function of iterative updates. Top row: Normal noise. Bottom row:log-Normal noise. Settings: n = 500 , d = 2 , α ( t ) = 0 . t . Impact of distribution

In our risk minimization task construction, we take advantage ofthe fact that very distinct loss distributions can still lead to precisely the same risk function.Here we examine performance as the underlying distribution changes; since all changes are of apurely statistical nature the oracle procedure is not aﬀected, only ERM-GD and rgdmult . Weconsider six settings, three for Gaussian noise, and three for log-Normal noise. Location andscale parameters for the former are (0 , , , (1 , , , , , (1 . , . , . rgdmult is able to match ERM-GD in all settings. Under log-Normal noise, theperformance of ERM falls rather sharply, and we see a gap in performance that widens asthe variance grows. Guarantees of good performance over a wide class of distributions forAlgorithm 1, without any prior knowledge of the underlying distribution are the key take-aways of the results culminating in Theorem 10, and are reinforced clearly by these empiricaltest results, as well as those in subsequent sub-sections. Impact of sample size

As a direct measure of learning eﬃciency, we investigate how al-gorithm performance metrics change with the sample size n , with dimension ﬁxed. Figure 5shows the accuracy of erm and ERM-GD in tests just like those given above. Initial values arecommon for all methods, and n ranges over { , , , } .As is natural, both procedures see monotonic performance improvement over increasing14

20 4010 Excess risk (oracle)del=2.5del=5.0del=10.0 0 20 4010 Excess risk (erm) 0 20 4010 Excess risk (rgdmult)0 20 4010 Excess risk (oracle)del=2.5del=5.0del=10.0 0 20 4010 Excess risk (erm) 0 20 4010 Excess risk (rgdmult)

Figure 3:

Performance over iterations, under strong/poor initialization. Here del refers to ∆ j . Top row:Normal noise. Bottom row: log-Normal noise. Settings: n = 500 , d = 2 , α ( t ) = 0 . t . n . More salient is the strength of rgdmult under heavy-tailed observations, particularly whendata is limited, giving clear evidence of better learning eﬃciency, in the sense of realizing bettergeneralization in less time, with less data. Impact of dimension

The number of parameters the learning algorithm has to determine,here denoted as dimension d , makes a major impact on the overall diﬃculty of the learningtask, and what sample sizes should be considered “small.” Fixing n , we let d range over { , , , } , and investigate how each algorithm performs with access to progressively lesssuﬃcient information. For our Algorithm 1, we set the variance bound v ( t ) to the empiricalsecond moments of l ( b w ( t ) ; z ), multiplied by 1 / √ d . Figure 6 gives results for these tests.We see that with larger d , since n is ﬁxed, both non-oracle routines become less eﬃcient,and need more time to converge. As with previous tests, the key diﬀerence appears underheavy tails, where we see Algorithm 1 is superior ove ERM-GD for all d settings. While theERM procedure saturates rather quickly, our procedure keeps improving for more iterations. Comparison with robust loss minimizer

In section 1, we cited the important work ofBrownlees et al. [1], which chieﬂy considered theoretical analysis of a robust learning procedurethat minimizes a robust objective , in contrast to our use of a robust update direction. Ourproposed procedure enjoys essentially the same theoretical guarantees, and we have claimedthat it is more practical. Here we attempt to verify this claim empirically. Denote the methodof Brownlees et al. [1] by bjl . To implement their approach, which does not specify any15

20 4010 Excess risk (oracle)lowmedhigh 0 20 4010 Excess risk (erm) 0 20 4010 Excess risk (rgdmult)0 20 4010 Excess risk (oracle)lowmedhigh 0 20 4010 Excess risk (erm) 0 20 4010 Excess risk (rgdmult)

Figure 4:

Performance over iterations, under varying noise intensities. Here low , med , and high refer to thethree noise distribution settings described in the main text. Settings: n = 500 , d = 2 , α ( t ) = 0 . t . particular algorithmic technique, we implement bjl using the non-linear conjugate gradientmethod of Polak and Ribière [23]. This can be found as part of the the optimize moduleof the SciPy scientiﬁc computation library, called fmin_cg , with default parameter settings.We believe that using this standard ﬁrst-order solver makes for a fair comparison between bjl and our Algorithm 1, again denoted rgdmult , and again with variance bound v ( t ) set to theempirical second moments of l ( b w ( t ) ; z ), multiplied by 1 / √ d . For our routine, we have ﬁxedthe number of iterations to be T = 30 for all settings. We compute the time required forcomputation using the Python time module. Multiple independent trials of each learning task(analogous to those previous) are carried out, with the median time taken over trials (for each d setting) used as the ﬁnal time record. We consider settings of d = 2 , , , , ,

64. Thesetimes along with performance results are given in Figure 7.We can ﬁrst observe that in low dimensions, and with data subject to Gaussian noise, theperformance of both methods is simiular, a reassuring fact considering their conceptual close-ness. Moving to higher dimensions, however, and especially under heavy-tailed noise, we seethat our rgdmult achieves better performance in much less time. Note that bjl is optimizing acompletely diﬀerent objective function, explaining the deviation in excess empirical risk. Thereare certainly other ways of implementing bjl , but there is no way of circumventing the opti-mization of an explicitly-deﬁned objective, which may not be convex. Our proposed rgdmult looks to oﬀer a more practical alternative, which still enjoys the same statistical guarantees.16

20 4010 Excess risk (oracle)n=10n=40n=160n=640 0 20 4010 Excess risk (erm) 0 20 4010 Excess risk (rgdmult)0 20 4010 Excess risk (oracle)n=10n=40n=160n=640 0 20 4010 Excess risk (erm) 0 20 4010 Excess risk (rgdmult)

Figure 5:

Performance over iterations, under diﬀerent sample sizes. Settings: d = 2 , α ( t ) = 0 . t . Regression application

For our next class of experiments, we look at a more general regres-sion task, under a diverse collection of data distributions. We then compare Algorithm 1 withwell-known procedures specialized to regression, both classical and recent. In each experimen-tal condition, and for each trial, we generate n observations of the form y i = x Ti w ∗ + (cid:15) i , i ∈ [ n ]for training. Each condition is deﬁned by the setting of ( n, d ) and µ . Throughout, we haveinputs x which are generated from a d -dimensional Gaussian distribution, with each coordi-nate independent of the others. As such, to set µ requires setting the distribution of the noise, (cid:15) . We consider several families of distributions, each with 15 distinct parameter settings, or“noise levels.” These settings are carried out such that the standard deviation of (cid:15) increasesover the range 0 . − − .

0, in a roughly linear fashion as we increase from level 1 (lowest) to15 (highest).A range of signal/noise ratios can be captured by controlling the norm of the vector w ∗ ∈ R d determining the model. For each trial, we generate w ∗ randomly as follows. Considering thesequence w k .. = π/ − k − ( k − π/ , k = 1 , , . . . , sample i , . . . , i d ∈ [ d ] uniformly, with d = 500. The underlying vector is then set as w ∗ = ( w i , . . . , w i d ). The signal to noise ratioSN µ = k w ∗ k / var µ ( (cid:15) ) then varies over the range 0 . ≤ SN µ ≤ .

6. Here we considerfour noise families: log-logistic (denoted llog in ﬁgures), log-Normal ( lnorm ), Normal ( norm ),and symmetric triangular ( tri_s ). Many more are considered in Appendix B, and even withjust these four, we have representative distributions with both bounded and unbounded sub-Gaussian noise, and heavy-tailed data both with and without ﬁnite higher-order moments.Here we do not compute the risk R exactly, but rather use oﬀ-sample prediction error asthe key metric for evaluating performance. This is computed as excess root mean squared17

20 4010 Excess risk (oracle)d=2d=8d=32d=128 0 20 4010 Excess risk (erm) 0 20 4010 Excess risk (rgdmult)0 20 4010 Excess risk (oracle)d=2d=8d=32d=128 0 20 4010 Excess risk (erm) 0 20 4010 Excess risk (rgdmult)

Figure 6:

Performance over iterations, under increasing dimension. Settings: n = 500 , α ( t ) = 0 . t . error (RMSE) computed on an independent testing set. Performance is averaged over in-dependent trials. For each condition and trial, a test set of m independent observations isgenerated identically to the n -sized training set that precedes testing. All competing methodsuse common samples for training and testing, for each condition and trial. In the k th trial,each algorithm outputs an estimate b w ( h ). Using RMSE to approximate the ‘ -risk, com-pute e k ( b w ) .. = ( m − P mi =1 ( b w T x k,i − y k,i ) ) / , outputting prediction error as the excess error e k ( b w ( k )) − e k ( w ∗ ( k )), averaged over K trials. In all experiments, we have K = 250, m = 1000.We consider several methods against which we compare the proposed Algorithm 1. Asclassical choices, we have ordinary least squares (ERM under the squared error, ols ) and leastabsolute deviations (ERM under absolute error, lad ). For more recent methods, as describedin section 1, we consider robust regression routines as given by Minsker [20] ( geomed ) andHsu and Sabato [10] ( hs ). In the former, we partition the data, obtaining the ols solutionon each subset, and these candidates are aggregated using the geometric median in the ‘ norm [27]. The number of partitions is set to max { , b n/ (2 d ) c} . In the latter, we used sourcecode published online by the authors. To compare our Algorithm 1 with these routines, weinitialize rgdmult to the analytical ols solution, with step size α ( t ) = 0 .

01 for all iterations,and δ = 0 . v ( t ) are set to the empirical second moments of l ( b w ( t ) , z ),divided by 2. In total, the number of iterations is constrained by a ﬁxed budget: we allow for40 n gradient evaluations in total. Representative results are provided in Figure 8.To begin, we ﬁx d , and look at performance over n settings (ﬁrst row of Figure 8). Re-gardless of distribution, rgdmult is seen to provide highly competitive performance; whereasother methods perform well on some distributions and very poorly on others, a high level of18 Figure 7:

Comparison of our robust gradient-based approach with the robust objective-based approach. Top:Normal noise. Bottom: log-Normal noise. Performance is given as a function of the number of d , the numberof parameters to optimize, given in log scale. Settings: n = 500 , α ( t ) = 0 . t . generalization ability is uniformly maintained by the proposed procedure. We see that ols isstrong under sub-Gaussian data ( norm and tri_s ), while it deteriorates under the heavy-taileddata. The more robust methods tend to perform better than ols on heavy-tailed data, butclearly suﬀer from biased estimates under sub-Gaussian data. These results illustrate how rgdmult realizes the best of both worlds, paying a tolerable price in bias for large payouts interms of robustness to outliers.In the second row of Figure 8, we examine performance over noise levels. It is encouragingthat even with pre-ﬁxed step size and budgets (since n is ﬁxed over all noise levels), the strongperformance of rgdmult holds over very diverse settings.Finally, in the third row of Figure 8, the ratio of n to d is ﬁxed, and we see if and howperformance changes when d is increased. For all distributions, the performance of rgdmult isessentially constant over d when n scales with d , which is what we would hope considering therisk bounds of Theorem 10. Certain competitive methods show more sensitivity to the absolutenumber of free parameters, particularly in the case of heavy-tailed data with asymmetricdistributions. As our ﬁnal class of numerical experiments, we shift our focus to classiﬁcation tasks, and thistime make use of real-world data sets, to be described in detail below.All methods use a common model, here multi-class logistic regression. If the number ofclasses is C , and we have F input features, then the dimension of the model will be d = ( C − F .19 Excess RMSE (test), noise = normHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = tri_sHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = llogHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = lnormHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg5 10 15Noise level012345Excess RMSE (test), noise = normHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level012345 Excess RMSE (test), noise = tri_sHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level0.000.050.100.150.20 Excess RMSE (test), noise = llogHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level0.000.250.500.751.001.25Excess RMSE (test), noise = lnormHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg10 20 30 40Model dimension0.00.51.01.52.02.5Excess RMSE (test), noise = normHS-linregMinsker-linregOLS-linregRGD-linreg 10 20 30 40Model dimension0.00.51.01.52.02.5 Excess RMSE (test), noise = tri_sHS-linregMinsker-linregOLS-linregRGD-linreg 10 20 30 40Model dimension0.000.050.100.150.200.250.30 Excess RMSE (test), noise = llogHS-linregMinsker-linregOLS-linregRGD-linreg 10 20 30 40Model dimension0.00.20.40.60.8Excess RMSE (test), noise = lnormHS-linregMinsker-linregOLS-linregRGD-linreg

Figure 8:

Top: Prediction error over sample size 12 ≤ n ≤ d = 5, noise level = 8. Middle: Predictionerror over noise levels, for n = 30 , d = 5. Bottom: Prediction error over dimensions 5 ≤ d ≤

40, with ratio n/d = 6 ﬁxed, and noise level = 8. Each column corresponds to a distinct noise family.

A basic property of this model is that the loss function is convex in the parameters, withgradients that exist, thus placing the model ﬁrmly within our realm of interest. Furthermore,for all of these tests we shall add a squared ‘ -norm regularization term a k w k to the loss,where a varies depending on the dataset. Once again, each algorithm is given a ﬁxed budget,this time of 20 n , where n is the size of the training set available, which again depends on thedataset (details below).Here we give results for two well-known data sets used for benchmarking: the forest covertype dataset from the UCI repository, and the protein homology dataset used in a previousKDD Cup. For each dataset, we execute 10 independent trials, with training/testing subsetsrandomly sampled without replacement as is described shortly. For all datasets, we normalizeinput features to the unit interval [0 ,

1] in a per-feature fashion. For the cover type dataset,we consider binary classiﬁcation of the second type against all other types. With C = 2 and F = 54, we have d = 54 and a = 0 . n = 4 d . The proteinhomology dataset has highly unbalanced labels, with only 1296 positive labels our of over145,000 examples. We balance out training and testing data, randomly selecting 296 positiveexamples and the same number of negative examples, yielding a test set of 592 points. As forthe training set size, we use all positive examples not used for testing (1000 points each time), http://archive.ics.uci.edu/ml/datasets/Covertype Figure 9:

Test error (misclassiﬁcation rate) over budget spent, as measured by gradient computations, for thetop two performers within each method class. Each plot corresponds to a distinct dataset. plus a random selection of 1000 negatively labeled examples, so n = 2000. With C = 2 and F = 74, the dimension is d = 74, and a = 0 . − . , . v ( t ) are set to k times the empirical mean of thesecond moments of l ( b w ( t ) , z ), with k ranging over { / , / , / , , , , , } . Further-more, for the high-dimensional datasets, we consider a mini-batch in terms of random selectionof which parameters to robustly update. At each iteration, we randomly choose min { , d } in-dices, running Algorithm 1 for the resulting sub-vector, and the sample mean for the remainingcoordinates. We compare our proposed algorithm with stochastic gradient descent (SGD), andstochastic variance-reduced gradient descent (SVRG) proposed by Johnson and Zhang [11].For each method, pre-ﬁxed step sizes ranging over { . , . , . , . , . , . , . } are tested. SGD has mini-batches of size 1, just as the SVRG inner loop. The inner loop ofSVRG has n/ *_1 and *_2 here. Here the “top two” refers toperformance as measured by the median test error for the last ﬁve iterations. In general, theproposed procedure is clearly competitive with the best settings of these popular routines, andin the case of the smaller data set (protein homology, right-most plot), we see a signiﬁcantimprovement over competitors. Assuredly, our mini-batch implementation of Algorithm 1is merely a nascent application, but strong performance under even a very naive setup ispromising in terms of developing even stronger procedures for real-world data. We introduced and analyzed a novel machine learning algorithm, with a solid theoreticalgrounding based on ﬁrm statistical principles, and with the added beneﬁt of a simple im-21lementation, very few parameters to set, and a fast, non-iterative robustiﬁcation procedurethat does not throw away any data, but which is also not overly sensitive to errant observations.Based on the strong theoretical guarantees and appealing empirical performance, it appearsthat our approach of paying the price of a small bias for the reward of more distributionallyrobust gradient estimates is sound as a methodology, realizing better performance using lesscomputational resources (data, time).In looking ahead, we are particularly interested in moving beyond per-coordinate robus-tiﬁcation, and considering operations that operate on the loss gradient vectors themselves asatomic units. The per-coordinate technique is easy to implement and theoretical analysis isalso more straightforward, but the risk bounds have an extra d factor that should be removablegiven more sophisticated procedures. Indeed, the high-dimensional mean estimation discussedby Catoni and Giulini [4] has such a vector estimator, but unfortunately there is no way toactually compute the estimator they analyze. Bridging this gap is an important next step,from the perspective of both learning theory and machine learning practice. A Technical appendix

A.1 Preliminaries

Consider two probability measures P and Q on measurable space ( X , A ). We say that Q isabsolutely continuous with respect to P , written Q (cid:28) P , whenever P ( A ) = 0 implies Q ( A ) = 0for all A ∈ A . The Radon-Nikodym theorem guarantees that there exists a function g ≥ P -measurable, such that Q ( A ) = Z A g dP, for all A ∈ A . Furthermore, this g is unique in the sense that if another f exists satisfying the above equality,we have f = g almost everywhere [ P ]. It is common to call this function g the Radon-Nikodymderivative of Q with respect to P , written dQ/dP . The relative entropy, or Kullback-Leiblerdivergence, between two probability measures P and Q on measurable space ( X , A ) is deﬁned K ( P ; Q ) .. =  − Z log (cid:18) dQdP (cid:19) dP, if Q (cid:28) P + ∞ , else. (9)They key property of the ψ truncation function utilized by Catoni and Giulini [4], deﬁnedin (4), is that for all u ∈ R , we have − log − u + u ! ≤ ψ ( u ) ≤ log u + u ! . (10)Let f : R d → R be a continuously diﬀerentiable, convex, λ -smooth function. f ( u ) − f ( v ) ≤ λ k u − v k + h f ( v ) , u − v i (11)12 λ k f ( u ) − f ( v ) k ≤ f ( u ) − f ( v ) − h f ( v ) , u − v i (12)for all u , v ∈ R d . 22 erminology For a function F : W → R , we say that F is λ -Lipschitz if, for all w , w ∈ W we have | F ( w ) − F ( w ) | ≤ λ k w − w k . If F is diﬀerentiable, and the derivative w F ( w )is λ -Lipschitz, then we say that F is λ -smooth .If F is a convex function on convex set W , then we say F is κ - strongly convex if for all w , w ∈ W , F ( w ) − F ( w ) ≥ h F ( w ) , w − w i + κ k w − w k . (13)This deﬁnition can be made for any valid norm space, but we shall be assuming W ⊆ R d throughout, and use the Euclidean norm. If there exists w ∗ ∈ W such that F ( w ∗ ) = 0, thenit follows that w ∗ is the unique minimum of F on W . A.2 Proofs of results in the main text

Proof of Lemma 5.

Let P ( R ) denote all probability measures on R , with an appropriate σ -ﬁeldtacitly assumed. Consider any two measures ν, ν ∈ P ( R ), and h : R → R a ν -measurablefunction. By Catoni [2, p. 159–160], it is proved that a Legendre transform of the mapping ν K ( ν ; ν ) takes the form of a cumulant generating function, namelysup ν (cid:18)Z h ( u ) dν ( u ) − K ( ν ; ν ) (cid:19) = log Z exp( h ( u )) dν ( u ) , (14)where the supremum is taken over ν ∈ P ( R ). This identity is a technical tool, and the choiceof h and ν are parameters that can be adjusted to ﬁt the application.In actually setting these parameters, we adapt the general argument of Catoni and Giulini[4] to our setting. Recalling the estimator (3), we start with a quasi average of the points x , . . . , x n , modiﬁed by some data-sensitive additive noise, and passed through a truncationfunction. The expectation of this sum is then taken over the noise distribution. The ν in thedeﬁnition of (3) will correspond to ν here, and thus to reﬂect the whole estimator within (14),it makes sense to include the data-dependent sum in our choice of h . Note that the summandsin the estimator deﬁnition ψ (cid:18) x i + (cid:15) i x i s (cid:19) , i ∈ [ n ]depend on two random quantities, namely the data x i , and the artiﬁcial noise (cid:15) i (since s > f ( (cid:15), x ) .. = ψ (cid:18) x + (cid:15)xs (cid:19) , (cid:15), x ∈ R . Note that by deﬁnition of ψ in (4), the function f : R → R is measurable and bounded. Withthis cleaner notation, let us now set h ( (cid:15) ) = n X i =1 f ( (cid:15), x i ) − c ( (cid:15) )where c ( (cid:15) ) is a term to be determined shortly. Plugging this in to (14) yields the followingquantity: B .. = sup ν (cid:18)Z h ( (cid:15) ) dν ( (cid:15) ) − K ( ν ; ν ) (cid:19) = log Z exp n X i =1 f ( (cid:15), x i ) − c ( (cid:15) ) ! dν ( (cid:15) ) . B and then taking expectation with respect to the sample, wehave E µ exp( B ) = E µ Z (cid:18) exp ( P ni =1 f ( (cid:15), x i ))exp( c ( (cid:15) )) (cid:19) ν ( (cid:15) )= Z (cid:18) Q ni =1 E µ f ( (cid:15), x i )exp( c ( (cid:15) )) (cid:19) ν ( (cid:15) ) . The ﬁrst equality comes from simple log/exp manipulations, and the second equality fromtaking the integration over the sample inside the integration with respect to ν , valid viaFubini’s theorem. It will be useful to have E µ exp( B ) ≤

1. This can be achieved easily bysetting c ( (cid:15) ) = n log E µ exp( f ( (cid:15), x )) , which yields E µ exp( B ) = Z Q ni =1 E µ exp( f ( (cid:15), x i ))( E µ exp( f ( (cid:15), x i ))) n ! ν ( (cid:15) ) = 1 . (15)With this preparation done, we can start on the high-probability upper bound of interest: P { B ≥ log( δ − ) } = P { exp( B ) ≥ /δ } = E µ I { δ exp( B ) ≥ }≤ E µ δ exp( B )= δ. The inequality follows immediately since δ exp( B ) ≥

0, and the ﬁnal equality holds due to(15). Note that since our setting of c ( (cid:15) ) is such that c ( · ) is ν -measurable (via measurability of f ), the resulting h ( · ) is indeed ν -measurable, as required. Of importance here is the fact thatsup ν (cid:18)Z h ( (cid:15) ) dν ( (cid:15) ) − K ( ν ; ν ) (cid:19) ≤ log( δ − ) (16)with probability no less than 1 − δ , noting that the event is uniform in ν . Using (14) onceagain, and dividing both sides by n , we have that with high probability, for any choice of ν ,we can bound this generic empirical mean as follows:1 n n X i =1 Z f ( (cid:15), x i ) dν ( (cid:15) ) ≤ Z log E µ exp ( f ( (cid:15), x )) dν ( (cid:15) ) + K ( ν, ν ) + log( δ − ) n . (17)Bridging the gap between these preparatory facts and the estimator of interest is now easy;since the noise terms (cid:15) , . . . , (cid:15) n are assumed to be independent copies of (cid:15) ∼ ν , it followsimmediately that b x = sn n X i =1 Z (cid:18) ψ (cid:18) x i + ε i x i s (cid:19)(cid:19) dν ( (cid:15) i )= sn n X i =1 Z f ( (cid:15), x i ) dν ( (cid:15) ) . That is to say, we have b x ≤ s Z log E µ exp (cid:18) ψ (cid:18) x (1 + (cid:15) ) s (cid:19)(cid:19) dν ( (cid:15) ) + sn (cid:16) K ( ν ; ν ) + log( δ − ) (cid:17) (18)24n the high-probability event, uniformly in choice of ν . Let us work step by step through eachof the terms in the upper bound.Starting with the ﬁrst term, recall the deﬁnition of the truncation function ψ given in (4),and in particular the logarithmic upper/lower bounds given in (10). These bounds will beconvenient because it oﬀers us polynomial bounds when passing ψ through exp( · ), which isprecisely what occurs in (18) above. To get the ﬁrst term in (18) in a more useful form, wecan bound it as Z log E µ exp (cid:18) ψ (cid:18) x (1 + (cid:15) ) s (cid:19)(cid:19) dν ( (cid:15) ) ≤ Z log (cid:15) ) E µ xs + (1 + (cid:15) ) E µ x s ! dν ( (cid:15) ) ≤ Z (1 + (cid:15) ) E µ xs + (1 + (cid:15) ) E µ x s ! dν ( (cid:15) )= E ν (1 + (cid:15) ) E µ xs + E ν (1 + (cid:15) ) E µ x s = E µ xs + E µ x s (cid:18) β + 1 (cid:19) . The ﬁrst inequality follows from (10), and the second from the fact that log(1 + u ) ≤ u for all u > −

1. As for the ﬁnal equality, note that with (cid:15) ∼ ν = N (0 , β − ), it follows immediatelythat E ν (1 + (cid:15) ) = 1 β + ( E ν (1 + (cid:15) )) = 1 β + 1 . Moving on to the second term, evaluating K ( ν ; ν ) depends completely on how we deﬁnethe pre-ﬁxed ν . One approach is to set ν such that the KL divergence is easily computed;for example, ν = N (1 , β − ). In this case, simple computations show that K ( ν ; ν ) = Z ∞−∞ log exp β ( u − − βu !! s β π exp − βu ! du = Z ∞−∞ (1 − u ) β s β π exp − βu ! du = β . With this computation done, an upper bound is complete, taking the form b x ≤ E µ x + E µ x s (cid:18) β + 1 (cid:19) + sn (cid:18) β δ − ) (cid:19) . (19)Optimizing this upper bound with respect to s >

0, we have s = (cid:18) β (cid:19) n E µ x (cid:18) β δ − ) (cid:19) − and with respect to β >

0, we have β = n E µ x s . (20) Another approach of interest is as follows: consider setting ν = ν , causing the KL term to vanish, and β → ∞ to be the optimal strategy from the perspective of the upper bound. Then it remains to see howthe correction term C ( a, b ) is computed in the limit. Not sure if it’s worth the eﬀort, but conceptually it isinteresting. β in to the setting of s yields s = n E µ x δ − ) . (21)With this setting of s , the upper bound (19) can be cleaned up to the form b x ≤ E µ x + s E µ x log( δ − ) n + s E µ x n . To get lower bounds on b x − E µ x , we can equivalently seek out upper bounds on ( − b x + E µ x .This can be easily done via − b x ≤ s Z log E µ exp (cid:18) − ψ (cid:18) x (1 + (cid:15) ) s (cid:19)(cid:19) dν ( (cid:15) ) + sn (cid:16) K ( ν ; ν ) + log( δ − ) (cid:17) . (22)Only the ﬁrst term on the right-hand side is diﬀerent from before. Note that by the lowerbound of (10), we havelog E µ exp (cid:18) − ψ (cid:18) x (1 + (cid:15) ) s (cid:19)(cid:19) ≤ ( −

1) (1 + (cid:15) ) E µ xs + (1 + (cid:15) ) E µ x s . The rest plays out analogously to the upper bound, yielding( − b x ≤ ( − E µ x + s E µ x log( δ − ) n + s E µ x n which implies, as desired, b x − E µ x ≥ s E µ x log( δ − ) n + s E µ x n . (23)Since − ψ ( u ) = ψ ( − u ), both of these settings can be interpreted as diﬀerent settings of thedistribution of the noise factor: (1 + (cid:15) ) in the upper bound case, and − (1 + (cid:15) ) in the lowerbound case, both with (cid:15) ∼ ν . Since the inequality (16) is uniform in the distribution of thisnoise, both bounds hold on the same event, which has probability no less than 1 − δ . Wemay thus conclude that with probability at least 1 − δ over the random draw of the sample x , . . . , x n , the estimator b x satisﬁes | b x − E µ x | ≤ s E µ x log( δ − ) n + s E µ x n . In practice, since E µ x will typically be unknown, this factor can be replaced by any validupper bound v ≥ E µ x . The only impact to the ﬁnal upper bound is that the unknown E µ x factors are replaced by the known v , concluding the proof. Proof of Lemma 7.

Consider two data sets, the original x , . . . , x n and a perturbed version x , . . . , x n . For clean notation, organize these into vectors x = ( x , . . . , x n ) and x = ( x , . . . , x n ).26aking the diﬀerence between the estimator evaluated on these distinct data sets, we have b x ( x ) − b x ( x ) = Z sn n X i =1 (cid:18) ψ (cid:18) (1 + (cid:15) ) x i s (cid:19) − ψ (cid:18) (1 + (cid:15) ) x i s (cid:19)(cid:19) dν ( (cid:15) ) ≤ Z sn n X i =1 (cid:12)(cid:12)(cid:12)(cid:12) (cid:15)s (cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12) x i − x i (cid:12)(cid:12) dν ( (cid:15) )= E ν | (cid:15) | n n X i =1 (cid:12)(cid:12) x i − x i (cid:12)(cid:12) = E ν | (cid:15) | n k x − x k . The ﬁrst equality follows by linearity and the deﬁnition of the estimators. The subsequentinequality follows from the 1-Lipschitz property of ψ deﬁned in (4), which is that for all u, v ∈ R , we have that | ψ ( u ) − ψ ( v ) | ≤ | u − v | .Evaluating E ν | (cid:15) | is straightforward under the assumption that (cid:15) ∼ ν = N (0 , /β ),since the random variable | (cid:15) | follows a Folded Normal distribution. More generally, if X ∼ N ( a, b ), then Y = | X | follows a folded normal distribution, with expected value E Y = a (cid:18) − (cid:18) − ab (cid:19)(cid:19) + b r π exp − a b ! . Since in our case, we have a = 1 and b = 1 /β , it follows that E ν | (cid:15) | = 1 − (cid:16) − p β (cid:17) + s βπ exp (cid:18) − β (cid:19) . Reﬂecting this factor in the above inequalities concludes the proof.

Proof of Lemma 8.

In order to obtain bounds that hold uniformly over the choice of w , weadopt a rather standard strategy utilizing covering numbers of W . Using assumption A0, since W is closed and bounded, the Heine-Borel theorem implies that W is compact. This meansthe number of balls of radius ε required to cover W (denoted N ε ) is bounded above as N ε ≤ (3∆ / ε ) d . (24)Denote the centers of this ε -net by { e w , . . . , e w N ε } . Given an abitrary w ∈ W and center e w ∈ { e w , . . . , e w N ε } , we break the quantity to be controlled into three error terms, each to betackled separately, as k b g ( w ) − g ( w ) k ≤ k b g ( w ) − b g ( e w ) k + k g ( w ) − g ( e w ) k + k b g ( e w ) − g ( e w ) k . (25)Let us start with the ﬁrst term, k b g ( w ) − b g ( e w ) k . Using Lemma 7, we have that k b g ( w ) − b g ( e w ) k ≤ d X j =1 c ν n n X i =1 | l j ( w ; z i ) − l j ( e w ; z i ) | ! ≤ d X j =1 ( c ν λ k w − e w k ) = dc ν λ k w − e w k . This is a basic property of covering numbers for compact subsets of Euclidean space [12]. k b g ( w ) − b g ( e w ) k ≤ c ν λ √ d k w − e w k . (26)Moving on to the second error term in the upper bound, this follows easily by smoothnessof the risk (via A2), namely a Lipschitz property of the risk gradient. It immediately followsthat k g ( w ) − g ( e w ) k ≤ λ k w − e w k (27)for any choice of w ∈ W and ε -ball center e w .Finally for the third error term, given any center e w , as long as E µ l j ( e w ; z ) < ∞ , then wecan apply Lemma 5, implying | b g j ( w ) − g j ( w ) | ≤ ε j .. = s v j log( δ − ) n + r v j n where v j > E µ | l j ( w ; z ) | used in the setting of s j , in accordance withLemma 5. For any pre-ﬁxed w , then for any ε > P {k b g ( w ) − g ( w ) k > ε } = P n k b g ( w ) − g ( w ) k > ε o ≤ d X j =1 P (cid:26) | b g j ( w ) − g j ( w ) | > ε √ d (cid:27) . Using ε j just deﬁned, taking the maximum over j ∈ [ d ], it follows that P (cid:26) k b g ( w ) − g ( w ) k > (cid:18) max k ε k (cid:19) √ d (cid:27) ≤ d X j =1 P (cid:26) | b g j ( w ) − g j ( w ) | > max k ε k (cid:27) ≤ d X j =1 P {| b g j ( w ) − g j ( w ) | > ε j }≤ dδ. Note that the second inequality follows immediately from ε j ≤ max k ε k and monotonicity ofprobability measures. Writing V .. = max j v j , it follows immediately that ﬁxing any w ∈ W ,the nearest center e w .. = e w ( w ) can be determined, and the event E ( e w ) .. =  k b g ( e w ) − g ( e w ) k > s V d log( dδ − ) n + s Vn  has probability no greater than δ . The whole reason for utilizing a ε -cover of W in the ﬁrstplace is to avoid having to take a supremum over w ∈ W , which spoils union bounds, andinstead to simply take a maximum over a ﬁnite number of ε -covers. The critical fact for ourpurposes is that sup w ∈W k b g ( e w ( w )) − g ( e w ( w )) k = max k ∈ [ N ε ] k b g ( e w k ) − g ( e w k ) k E ( · ) holds for none ofthe centers on our ε -cover. In other words, the event E + =  \ k ∈ [ N ε ] E ( e w k )  c , which taking a union bound, occurs with probability no less than 1 − δN ε . To get a 1 − δ guarantee, simply pay the price of an extra logarithmic factor in the upper bound; that is tosay, we equivalently have k b g ( e w ( w )) − g ( e w ( w )) k ≤ s V d log( dN ε δ − ) n + s Vn (28)with probability no less than 1 − δ , uniformly in the choice of w ∈ W .Taking these intermediate results together, we can form a useful uniform upper bound on(25), taking the formsup w ∈W k b g ( w ) − g ( w ) k ≤ sup w ∈W (cid:16) c ν λ √ d k w − e w k + λ k w − e w k + k b g ( e w ) − g ( e w ) k (cid:17) ≤ c ν λ √ dε + λε + max k ∈ [ N ε ] k b g ( e w k ) − g ( e w k ) k≤ λε (1 + c ν √ d ) + s V d log( dN ε δ − ) n + s Vn with probability no less than 1 − δ over the random draw of the sample. The bounds onthe ﬁrst, second, and third terms in the original upper bound come from (26), (27), and (28)respectively, with the ε factors following immediately from the deﬁnition of an ε -cover. Toobtain the desired result, simply bound N ε as in (24), and set ε = 1 / √ n , yielding updates totwo of the terms, as λε (1 + c ν √ d ) = λ (1 + c ν √ d ) √ n s V d log( dN ε δ − ) n ≤ s V d (log( dδ − ) + d log(3∆ √ n/ n which, when plugged into the bound just obtained, concludes the proof. Proof of Lemma 9.

Given b w ( t ) , running the approximate update (2), we have k b w ( t +1) − w ∗ k = k b w ( t ) − α ( t ) b g ( b w ( t ) ) − w ∗ k≤ k b w ( t ) − α ( t ) g ( b w ( t ) ) − w ∗ k + α ( t ) k b g ( b w ( t ) ) − g ( b w ( t ) ) k . The ﬁrst term looks at the distance from the target given an optimal update, using g . Usingthe κ -strong convexity of R , via Nesterov [22, Thm. 2.1.15] it follows that k b w ( t ) − α ( t ) g ( b w ( t ) ) − w ∗ k ≤ (cid:18) − α ( t ) κλκ + λ (cid:19) k b w ( t ) − w ∗ k . Writing γ .. = 2 κλ/ ( κ + λ ), the coeﬃcient becomes (1 − α ( t ) γ ).To control the second term simply requires unfolding the recursion. By hypothesis, we canleverage (6) to bound the statistical estimation error by ε for every step, all on the same 1 − δ a ( t ) .. = q − α ( t ) γ . Unfolding the recursion, on thegood event, we have k b w ( t +1) − w ∗ k ≤ k b w (0) − w ∗ k t Y k =0 a ( k ) + ε  α ( t ) + t − X k =0 α ( k ) t Y l = k +1 a ( l )  . (29)In the case of α ( t ) = α/γ , things are very simple. We have a ( t ) = a .. = √ − α for all t . Theabove inequality simpliﬁes to k b w ( t +1) − w ∗ k ≤ k b w (0) − w ∗ k a t +1 + εαγ (cid:16) a + · · · + a t (cid:17) = k b w (0) − w ∗ k a t +1 + εαγ (1 − a t +1 )(1 − a ) . To clean up the second summand in (29), αεγ (1 − a t +1 )1 − a ≤ αεγ (1 + a )(1 − a )(1 + a )= αεγ (1 + √ − α ) α ≤ εγ . This gives us the ﬁrst statement as desired. For the case of α ( t ) = 1 / ((2 + t ) γ ), things are onlyslightly more complicated. First observe that M Y m =2 (cid:18) − m (cid:19) = M Y m =2 m − m = 1 M , (30)where the last equality follows by simply cancelling terms. We can now handle the ﬁrstsummand in (29) as t Y k =0 a ( k ) ! = t Y k =0 (cid:16) − α ( k ) γ (cid:17) = t Y k =0 (cid:18) −

12 + k (cid:19) = t +2 Y k =2 (cid:18) − k (cid:19) = 1 t + 2 , where the ﬁnal equality uses (30). As for the second summand in (29), ﬁrst note that for any k ≥

1, we have α ( k ) a ( k ) α ( k − = (2 + k − k ) (cid:16) − k ) (cid:17) = 1 . Then recalling the second term on the right-hand side of (29), consider any two consecutivesummands within the parentheses, say α ( k ) a ( k +1) · · · a ( t ) and α ( k − a ( k ) · · · a ( t ) (31)for any 1 ≤ k < t . Dividing the ﬁrst term by the second term, note that almost all the factorscancel, yielding α ( k ) a ( k +1) · · · a ( t ) α ( k − a ( k ) · · · a ( t ) = α ( k ) a ( k ) α ( k − = 1 ,

30y what we just proved in (31). It follows that all terms inside the parentheses next to ε areidentical, and indeed equal to α ( t ) , which is to say ε  α ( t ) + t − X k =0 α ( k ) t Y l = k +1 a ( l )  = ( t + 1) α ( t ) ε = ( t + 1) ε ( t + 2) γ ≤ εγ . Plugging these two new forms into the original inequality (29) yields our second desired result,and concludes the proof.

Proof of Theorem 10.

Using the strong convexity of R (via A4) and (11), it follows that R ( b w ( T ) ) − R ∗ ≤ λ k b w ( T ) − w ∗ k ≤ λ (1 − α ) T k b w (0) − w ∗ k + 4 λε γ . The latter inequality holds by direct application of Lemma 9 under ﬁxed step size, followed bythe elementary fact ( a + b ) ≤ a + b ). The particular value of ε under which Lemma 9 isvalid (i.e., under which (6) holds) is given by Lemma 8 as e ε . Setting ε = e ε yields the desiredresult. Proof of Theorem 13.

We construct an upper bound using E k b w ( t +1) − b w ( t ) k = α t ) E k b g ( b w ( t ) ) k ≤ α t ) E (cid:16) k b g ( b w ( t ) ) − g ( b w ( t ) ) k + k g ( b w ( t ) ) k (cid:17) ≤ α t ) (cid:16) E k b g ( b w ( t ) ) − g ( b w ( t ) ) k + k g ( b w ( t ) ) k (cid:17) . Now, if we condition on b w ( t ) , by assumption the loss gradients l ( b w ( t ) ; z ) , . . . , l ( b w ( t ) ; z n ) areiid. With independence, just as in the proof of Lemma 8, we have k b g ( b w ( t ) ) − g ( b w ( t ) ) k ≤ s V d log( dδ − ) n + s Vn , with probability no less than 1 − δ . Setting the right-hand side of this equation to ε and solvingfor δ , we have exponential tails of the form P n k b g ( b w ( t ) ) − g ( b w ( t ) ) k > ε o ≤ d exp − ( ε − a ) b ! with constants deﬁned a .. = s Vn , b .. = s V dn .

Controlling moments using exponential tails can be done as follows. For random variable X ∈ L p for p ≥

1, recall the classic inequality E | X | p = Z ∞ P {| X | p > t } dt. X = k b g ( b w ( t ) ) − g ( b w ( t ) ) k , with p = 2. It follows that E | X | = Z ∞ P {| X | > u } du = Z ∞ P { X > √ u } du = Z ∞ P { X > u } u du ≤ d Z ∞ exp − ( u − a ) b ! u du. The third equality uses substitution of variables, and the inequality at the end uses the expo-nential tail inequality given above. This integral is the expectation of the Normal distribution N ( a, b ) taken over just the positive half-line. A simple upper bound can be constructed by Z ∞ exp − ( u − a ) b ! u du = Z ∞−∞ I { u ≥ } exp − ( u − a ) b ! u du ≤ Z ∞−∞ exp − ( u − a ) b ! | u | du, easily recognized (after rescaling by 1 / √ πb ) as the expectation of a Folded Normal randomvariable, induced by N ( a, b ). Recalling the proof of Lemma 7, the expected value of thisFolded Normal random variable is Z ∞−∞ √ πb exp − ( u − a ) b ! | u | du = a (cid:18) − (cid:18) − ab (cid:19)(cid:19) + b r π exp − a b ! = s Vn (cid:18) − (cid:18) − √ d (cid:19)(cid:19) + s V dnπ e − d . Taking into account the normalization factor, our upper bound takes the form E | X | ≤ d √ πb s Vn (cid:18) − (cid:18) − √ d (cid:19)(cid:19) + s V dnπ e − d  . Plugging this in for X = k b g ( b w ( t ) ) − g ( b w ( t ) ) k in the upper bound constructed at the start ofthis proof yields the desired result. A.3 Computation

From Catoni and Giulini [4], Lemma 3.2, it follows that the correction term C ( a, b ) used in (5)can be computed as follows. First, some preparatory deﬁnitions to keep notation clean. V − .. = √ − ab , V + .. = √ abF − .. = Φ( − V − ) , F + .. = Φ( − V + ) E − .. = exp − V − ! , E + .. = exp − V ! .

32s seen in other parts of the text, Φ denotes the standard Normal CDF. With these atomicelements deﬁned to keep things a bit cleaner, we break the ﬁnal quantity into ﬁve terms to besummed: T = 2 √

23 ( F − − F + ) T = − a − a ! ( F − + F + ) T = b √ π − a ! ( E + − E − ) T = ab (cid:18) F + + F − + 1 √ π ( V + E + + V − E − ) (cid:19) T = b √ π (cid:16) (2 + V − ) E − − (2 + V ) E + (cid:17) . With these terms in hand, the ﬁnal computation is just summation, as C ( a, b ) = T + T + T + T + T . B Additional test results

Due to space restrictions and overall readability, we did not include all empirical test resultsin the tests of section 4. Here we provide results for all of the noise distribution familiesconsidered in the second class of numerical experiments, namely the regression applicationconsidered at the end of section 4.1. The following distribution families are considered: Arcsine( asin ), Beta Prime ( bpri ), Chi-squared ( chisq ), Exponential ( exp ), Exponential-Logarithmic( explog ), Fisher’s F ( f ), Fréchet ( frec ), Gamma ( gamma ), Gompertz ( gomp ), Gumbel ( gum ),Hyperbolic Secant ( hsec ), Laplace ( lap ), Log-Logistic ( llog ), Log-Normal ( lnorm ), Logistic( lgst ), Maxwell ( maxw ), Pareto ( pareto ), Rayleigh ( rayl ), Semi-circle ( scir ), Student’s t( t ), Triangle (asymmetric tri_a , symmetric tri_s ), U-Power ( upwr ), Wald ( wald ), Weibull( weibull ).The content of this section is as follows:• Figures 10–11: performance as a function of sample size n .• Figures 12–13: performance over noise levels, with ﬁxed n and d .• Figures 14–15: performance as a function of d , with ﬁxed n/d ratio and noise level.33 Excess RMSE (test), noise = asinHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = bpriHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = chisqHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = expHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg25 50 75 100 125Sample size10 Excess RMSE (test), noise = explogHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = fHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = frecHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = gammaHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg25 50 75 100 125Sample size10 Excess RMSE (test), noise = gompHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = gumHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = hsecHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = lapHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg25 50 75 100 125Sample size10 Excess RMSE (test), noise = lgstHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = llogHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = lnormHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = maxwHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg25 50 75 100 125Sample size10 Excess RMSE (test), noise = normHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = paretoHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = raylHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = scirHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg

Figure 10:

Prediction error over sample size 12 ≤ n ≤ d = 5, noise level = 8. Each plot correspondsto a distinct noise distribution. Excess RMSE (test), noise = tHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = tri_aHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = tri_sHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = upwrHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg25 50 75 100 125Sample size10 Excess RMSE (test), noise = waldHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 25 50 75 100 125Sample size10 Excess RMSE (test), noise = weibullHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg

Figure 11:

Prediction error over sample size 12 ≤ n ≤ d = 5, noise level = 8. Each plot correspondsto a distinct noise distribution.

10 15Noise level012345 Excess RMSE (test), noise = asinHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level0.00.10.20.3 Excess RMSE (test), noise = bpriHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level012345Excess RMSE (test), noise = chisqHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level01234 Excess RMSE (test), noise = expHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg5 10 15Noise level01234Excess RMSE (test), noise = explogHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level0.000.050.100.150.200.250.30 Excess RMSE (test), noise = fHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level0.000.050.100.150.200.25 Excess RMSE (test), noise = frecHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level012345Excess RMSE (test), noise = gammaHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg5 10 15Noise level01234Excess RMSE (test), noise = gompHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level02468Excess RMSE (test), noise = gumHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level01234Excess RMSE (test), noise = hsecHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level01234 Excess RMSE (test), noise = lapHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg5 10 15Noise level01234 Excess RMSE (test), noise = lgstHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level0.000.050.100.150.20 Excess RMSE (test), noise = llogHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level0.000.250.500.751.001.25Excess RMSE (test), noise = lnormHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level01234Excess RMSE (test), noise = maxwHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg5 10 15Noise level012345Excess RMSE (test), noise = normHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level0.000.050.100.150.20Excess RMSE (test), noise = paretoHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level01234 Excess RMSE (test), noise = raylHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level0123456 Excess RMSE (test), noise = scirHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg

Figure 12:

Prediction error over noise levels, for n = 30 , d = 5. Each plot corresponds to a distinct noisedistribution.

10 15Noise level0.00.10.20.30.4 Excess RMSE (test), noise = tHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level01234 Excess RMSE (test), noise = tri_aHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level012345 Excess RMSE (test), noise = tri_sHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level012345Excess RMSE (test), noise = upwrHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg5 10 15Noise level0.000.250.500.751.001.25Excess RMSE (test), noise = waldHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg 5 10 15Noise level0.00.20.40.60.81.01.2Excess RMSE (test), noise = weibullHS-linregLAD-linregMinsker-linregOLS-linregRGD-linreg

Figure 13:

Prediction error over noise levels, for n = 30 , d = 5. Each plot corresponds to a distinct noisedistribution. Figure 14:

Prediction error over dimensions 5 ≤ d ≤

40, with ratio n/d = 6 ﬁxed, and noise level = 8. Eachplot corresponds to a distinct noise distribution. Figure 15:

Prediction error over dimensions 5 ≤ d ≤

40, with ratio n/d = 6 ﬁxed, and noise level = 8. Eachplot corresponds to a distinct noise distribution. eferences [1] Brownlees, C., Joly, E., and Lugosi, G. (2015). Empirical risk minimization for heavy-tailed losses. Annals of Statistics , 43(6):2507–2536.[2] Catoni, O. (2004).

Statistical learning theory and stochastic optimization: Ecole d’Eté de Probabilitésde Saint-Flour XXXI-2001 , volume 1851 of

Lecture Notes in Mathematics . Springer.[3] Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study.

Annales de l’Institut Henri Poincaré, Probabilités et Statistiques , 48(4):1148–1185.[4] Catoni, O. and Giulini, I. (2017). Dimension-free PAC-Bayesian bounds for matrices, vectors, andlinear least squares regression. arXiv preprint arXiv:1712.02747 .[5] Chen, Y., Su, L., and Xu, J. (2017a). Distributed statistical machine learning in adversarial settings:Byzantine gradient descent. arXiv preprint arXiv:1705.05491 .[6] Chen, Y., Su, L., and Xu, J. (2017b). Distributed statistical machine learning in adversarial settings:Byzantine gradient descent.

Proceedings of the ACM on Measurement and Analysis of ComputingSystems , 1(2):44.[7] Daniely, A. and Shalev-Shwartz, S. (2014). Optimal learners for multiclass problems. In , volume 35 of

Proceedings of Machine Learning Research ,pages 287–316.[8] Feldman, V. (2016). Generalization of ERM in stochastic convex optimization: The dimensionstrikes back. In

Advances in Neural Information Processing Systems 29 , pages 3576–3584.[9] Finkenstädt, B. and Rootzén, H., editors (2003).

Extreme Values in Finance, Telecommunications,and the Environment . CRC Press.[10] Hsu, D. and Sabato, S. (2016). Loss minimization and parameter estimation with heavy tails.

Journal of Machine Learning Research , 17(18):1–40.[11] Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive vari-ance reduction. In

Advances in Neural Information Processing Systems 26 , pages 315–323.[12] Kolmogorov, A. N. (1993). ε -entropy and ε -capacity of sets in functional spaces. In Shiryayev,A. N., editor, Selected Works of A. N. Kolmogorov, Volume III: Information Theory and the Theoryof Algorithms , pages 86–170. Springer.[13] Lecué, G. and Lerasle, M. (2017). Learning from MOM’s principles. arXiv preprintarXiv:1701.01961 .[14] Lecué, G., Lerasle, M., and Mathieu, T. (2018). Robust classiﬁcation via mom minimization. arXivpreprint arXiv:1808.03106 .[15] Lin, J. and Rosasco, L. (2016). Optimal learning for multi-pass stochastic gradient methods. In

Advances in Neural Information Processing Systems 29 , pages 4556–4564.[16] Luenberger, D. G. (1969).

Optimization by Vector Space Methods . John Wiley & Sons.[17] Lugosi, G. and Mendelson, S. (2016). Risk minimization by median-of-means tournaments. arXivpreprint arXiv:1608.00757 .[18] Lugosi, G. and Mendelson, S. (2017a). Regularization, sparse recovery, and median-of-meanstournaments. arXiv preprint arXiv:1701.04112 .[19] Lugosi, G. and Mendelson, S. (2017b). Sub-gaussian estimators of the mean of a random vector. arXiv preprint arXiv:1702.00482 .

20] Minsker, S. (2015). Geometric median and robust estimation in Banach spaces.

Bernoulli ,21(4):2308–2335.[21] Nalisnick, E., Anandkumar, A., and Smyth, P. (2015). A scale mixture perspective of multiplicativenoise in neural networks. arXiv preprint arXiv:1506.03208 .[22] Nesterov, Y. (2004).

Introductory Lectures on Convex Optimization: A Basic Course . Springer.[23] Nocedal, J. and Wright, S. (1999).

Numerical Optimization . Springer Series in Operations Research.Springer.[24] Prasad, A., Suggala, A. S., Balakrishnan, S., and Ravikumar, P. (2018). Robust estimation viarobust gradient estimation. arXiv preprint arXiv:1802.06485 .[25] Shalev-Shwartz, S. and Ben-David, S. (2014).

Understanding Machine Learning: From Theory toAlgorithms . Cambridge University Press.[26] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout:a simple way to prevent neural networks from overﬁtting.

Journal of Machine Learning Research ,15(1):1929–1958.[27] Vardi, Y. and Zhang, C.-H. (2000). The multivariate L -median and associated data depth. Pro-ceedings of the National Academy of Sciences , 97(4):1423–1426., 97(4):1423–1426.