[PDF] Adversarial robust weighted Huber regression

Abstract

We propose a novel method to estimate the coefficients of linear regression when outputs and inputs are contaminated by malicious outliers. Our method consists of two-step: (i) Make appropriate weights \left\{\hat{w}_i\right\}_{i=1}^n such that the weighted sample mean of regression covariates robustly estimates the population mean of the regression covariate, (ii) Process Huber regression using \left\{\hat{w}_i\right\}_{i=1}^n. When (a-1) the regression covariate is a sequence with i.i.d. random vectors drawn from sub-Gaussian distribution satisfying L_4-L_2 norm equivalence with unknown mean and known identity covariance and (a-2) the absolute moment of the random noise is finite, our method attains a convergence rate, which is information theoretically optimal up to constant factor about noise term. When (b-1) the regression covariate is a sequence with i.i.d. random vectors drawn from heavy tailed distribution satisfying L_4-L_2 norm equivalence with unknown mean and (b-2) the absolute moment of the random noise is finite, our method attains a convergence rate, which is information theoretically optimal up to constant factor.

Full PDF

aa r X i v : . [ m a t h . S T ] F e b Adversarial robust weighted Huberregression

Takeyuki Sasai

Department of Statistical Science, The Graduate University for Advanced Studies,SOKENDAI, Tokyo, Japan. e-mail: [email protected]

Hironori Fujisawa

The Institute of Statistical Mathematics, Tokyo, Japan.Department of Statistical Science, The Graduate University for Advanced Studies,SOKENDAI, Tokyo, Japan.Center for Advanced Integrated Intelligence Research, RIKEN, Tokyo, Japan.e-mail: [email protected]

Abstract:

We propose a novel method to estimate the coeﬃcients of linearregression when outputs and inputs are contaminated by malicious outliers.Our method consists of two-step: (i) Make appropriate weights { ˆ w i } ni =1 such that the weighted sample mean of regression covariates robustly esti-mates the population mean of the regression covariate, (ii) Process Huberregression using { ˆ w i } ni =1 . When (a) the regression covariate is a sequencewith i.i.d. random vectors drawn from sub-Gaussian distribution with un-known mean and known identity covariance and (b) the absolute momentof the random noise is ﬁnite, our method attains a faster convergence ratethan Diakonikolas, Kong and Stewart (2019) and Cherapanamjeri et al.(2020). Furthermore, our result is minimax optimal up to constant factor.When (a) the regression covariate is a sequence with i.i.d. random vectorsdrawn from heavy tailed distribution with unknown mean and boundedkurtosis and (b) the absolute moment of the random noise is ﬁnite, ourmethod attains a convergence rate, which is minimax optimal up to con-stant factor. MSC 2010 subject classiﬁcations:

Keywords and phrases:

Linear regression, Robustness, Convergence rate,Huber loss.

Contents ∗ This work was supported in part by JSPS KAKENHI Grant Number 17K00065.1 asai and Fujisawa/Weighted Huber regression B Stochastic argument in case of heavy-tailed design . . . . . . . . . . . 45B.1 ROBUST-WEIGHT for bounded covariance distributions . . . . 46B.2 Conﬁrmation of the conditions . . . . . . . . . . . . . . . . . . . 49B.2.1 Proof of Theorem B.1 . . . . . . . . . . . . . . . . . . . . 58

1. Introduction

Learning from data contaminated by malicious outliers has been an impor-tant topic in robust statistics. Classically, Huber (1981) presented importantnotions such as Huber’s contamination model. Recently, Chen, Gao and Ren(2018), Gao (2020) derived minimax lower bounds of parameter estimations inHuber’s contamination model. They also proposed parameter estimation meth-ods using Tukey’s depth (Tukey 1975). The estimation errors match the lowerbounds up to constant factor using Tukey’s depth (Mizera et al. 2002). Elsenerand van de Geer (2018) and Loh (2017) studied non-convex M-estimators. Lu-gosi and Mendelson (2019) and Lugosi and Mendelson (2020) considered themedian-of-mean tournament (Jerrum, Valiant and Vazirani 1986, Nemirovskyand Yudin 1983). However, the methods proposed by Chen, Gao and Ren (2018),Gao (2020), Elsener and van de Geer (2018), Loh (2017), Lugosi and Mendel-son (2019, 2020) possibly requires exponential time computational complexityto archive fast convergence rates. On the other hand, Lai, Rao and Vempala(2016) and Diakonikolas et al. (2018) proposed methods which robustly estimateparameters with polynomial time complexity and since then, many computa-tionally eﬃcient robust estimators have been proposed by Diakonikolas et al.(2017), Karmalkar and Price (2018), Kothari, Steinhardt and Steurer (2018),Diakonikolas et al. (2019), Cheng, Diakonikolas and Ge (2019), Dong, Hop-kins and Li (2019), Cheng et al. (2019), Diakonikolas, Kane and Pensia (2020),Prasad et al. (2020), Hopkins, Li and Zhang (2020), Cheng et al. (2020).In the present paper, we study the linear regression: y i = x ⊤ i β ∗ + ξ i , i = 1 , · · · , n, (1.1)where x i ∈ R d and { ξ i } ni =1 is a sequence of random noise which is independentof { x i } ni =1 . We admit o adversary samples arbitrarily from ( y i , x i ) ni =1 , which cantake arbitrary values. Let I o be the set of the index of the replaced value and I G be { , . . . , n } \ I o . Then, the model contaminated by adversarial samples is y i = ( x i + ̺ i ) ⊤ β ∗ + ξ i + √ nθ i , i = 1 , · · · , n, (1.2)where ̺ i = (0 , · · · , ⊤ and θ i = 0 for i ∈ I G . We note that because we admit theadversary arbitrarily to take o samples from ( y i , x i ) ni =1 , ( y i + √ nθ i , x i + ̺ i ) ni =1 is a strong contamination model (Diakonikolas and Kane 2019), not Huber’scontamination. Furthermore, (1.2) can be expressed as y i = X ⊤ i β ∗ + ξ i + √ nθ i , i = 1 , · · · , n, (1.3) asai and Fujisawa/Weighted Huber regression where X i = x i + ̺ i . Some works (Diakonikolas, Kong and Stewart 2019, Cher-apanamjeri et al. 2020, Pensia, Jog and Loh 2020, Bakshi and Prasad 2020)consider the model (1.3) and proposed methods to estimate β ∗ with polyno-mial time computational complexity. Diakonikolas, Kong and Stewart (2019)and Cherapanamjeri et al. (2020) dealt with the case when { x i } ni =1 and { ξ i } ni =1 are Gaussian with possibly non-identity covariance and sub-Gaussian with iden-tity covariance, respectively. Cherapanamjeri et al. (2020), Pensia, Jog and Loh(2020), Bakshi and Prasad (2020) considered the case when { x i } ni =1 and { ξ i } ni =1 obeys heavy tailed distributions.Our ﬁrst result is given in the following theorem. For a precise statement, seeTheorem 4.1.Let ˆ β be the minimizer of the Huber loss function (2.2) with some weightsrelated to outliers. The estimator ˆ β in requires polynomial time computationalcomplexity because of the convexity. Theorem 1.1.

Suppose that { x i } ni =1 is a sequence with i.i.d. random vectorsdrawn from sub-Gaussian distribution with unknown mean µ and identity co-variance and { ξ i } ni =1 is a sequence with i.i.d. random variables drawn from adistribution whose absolute moment is bounded by σ . In addition, we assumesome conditions. Then, with high probability, we have k ˆ β − β ∗ k = O ε r log 1 ε + r d log dn ! . (1.4) When ε is a constant fraction of n and n is suﬃciently large, we have k ˆ β − β ∗ k = O ε r log 1 ε ! . (1.5)Our convergence rate is faster than the ones of Diakonikolas, Kong and Stew-art (2019) and Cherapanamjeri et al. (2020) because their convergence rates are O (cid:0) ε log ε (cid:1) . Furthermore, according to Theorem D.3 of Cherapanamjeri et al.(2020), our result is minimax optimal up to constant factor.Our second result is given in the following theorem. For a precise statement,see Theorem B.1. Theorem 1.2.

Suppose that { x i } ni =1 is a sequence with i.i.d. random vectorsdrawn from distribution with unknown mean µ , and bounded known covarianceand bounded kurtosis. Suppose { ξ i } ni =1 is a sequence with i.i.d. random variablesdrawn from a distribution whose absolute moment is bounded by σ . In addition,we assume some conditions. Then, with high probability, we have k ˆ β − β ∗ k = O √ ε + r d log dn ! . (1.6) When ε is a constant fraction of n and n is suﬃciently large, we have k ˆ β − β ∗ k = O (cid:0) √ ε (cid:1) . (1.7) asai and Fujisawa/Weighted Huber regression Our convergence rate is the same of Cherapanamjeri et al. (2020), Pensia, Jogand Loh (2020), Bakshi and Prasad (2020) up to constant factor. Furthermore,according to Theorem D.4 of Cherapanamjeri et al. (2020), our result is minimaxoptimal up to constant factor.In Section 2, we provide our estimation method. In Section 3, we state ourmain Theorem (Theorem 3.1). In Section 4, we conﬁrm the conditions in The-orem 3.1 are satisﬁed with high probability under the assumptions in Theorem4.1. In the Appendix, we prove the inequalities used in Section 4 and we stateTheorem 1.2 precisely with the proof of Theorem 1.2.

2. Our method

To estimate β ∗ in (1.3), we propose the following algorithm (Algorithm 1). Algorithm 1

TWO STEP WEIGHTED HUBER REGRESSION

Input: { y i , X i } ni =1 , ε and the tuning parameter λ o Output: ˆ β { ˆ w i } ni =1 ← ROBUST-WEIGHT( { y i , X i } ni =1 , ε )2: ˆ β ← WEIGHTED-HUBER-REGRESSION( { y i , X i , ˆ w i } ni =1 , ε, λ o ) We require ROBUST-WEIGHT to compute the weight vector ˆ w = ( ˆ w , · · · , ˆ w n )satisfying (3.1)-(3.5). For example, we can use Algorithm 1 of Cheng, Diakoniko-las and Ge (2019) as ROBUST-WEIGHT. The algorithm was proposed to ro-bustly estimate the mean µ = E [ x i ] from { X i } ni =1 .WEIGHTED-HUBER-REGRESSION is a parameter estimation algorithmusing the Huber loss with { n ˆ w i X i } ni =1 instead of { X i } ni =1 . In WEIGHTED-HUBER-REGRESSION, we consider the following optimization problem:(ˆ θ, ˆ β ) = argmin θ ∈ R n , β ∈ R d n X i =1 n (cid:13)(cid:13) y i − n ˆ w i ( X i − µ ˆ w ) ⊤ β − √ nθ (cid:13)(cid:13) + λ o k θ k , (2.1)where µ ˆ w = P ni =1 ˆ w i X i . From She and Owen (2011), after optimizing (2.1)about θ , we haveˆ β = argmin β ∈ R d n X i =1 λ o H (cid:18) y i − n ˆ w i ( X i − µ ˆ w ) ⊤ βλ o √ n (cid:19) , (2.2)where H ( t ) is the Huber loss function H ( t ) = ( | t | − / | t | > t / | t | ≤ . asai and Fujisawa/Weighted Huber regression

3. Deterministic argument

Let h ( t ) = ddt H ( t ) = ( t | t | ≤ t ) | t | > r i ( v ) = y i − n ˆ w i ( x i − µ ˆ w ) ⊤ vλ o √ n , R i ( v ) = y i − n ˆ w i ( X i − µ ˆ w ) ⊤ vλ o √ n for v ∈ R d . We state the main theorem in the deterministic form. Theorem 3.1.

Consider the optimization problem (2.2) . Let δ β η = β η − β ∗ = η ( ˆ β − β ∗ ) for some η ∈ [0 , . Suppose that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 λ o √ n ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c k δ β η k , (3.1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o λ o √ n ˆ w i h ( r i ( β η ))( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c k δ β η k , (3.2) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o λ o √ n ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c k δ β η k , (3.3) n X i =1 λ o √ n ˆ w i ( − h ( r i ( β η )) + h ( r i ( β ∗ ))) ( x i − µ ˆ w ) ⊤ δ β η ≥ c k δ β η k − c k δ β η k − c , (3.4) where c , c , c and c are some positive numbers and suppose c + c + √ c c c < r (3.5) for some positive number r . Then, we have k β ∗ − ˆ β k ≤ r . Proof of Theorem 3.1.

In Section 4, the conditions (3.1)-(3.5) are shown to besatisﬁed with high probability under the conditions in our main theorem (The-orem 4.1). Consequently, our main theorem is proved.For any ﬁxed r >

0, we deﬁne B := { β : k β ∗ − β k ≤ r } . We prove ˆ β ∈ B by assuming ˆ β / ∈ B and deriving a contradiction. For ˆ β / ∈ B , wecan ﬁnd some η ∈ [0 ,

1] such that k δ β η k = r . (3.6) asai and Fujisawa/Weighted Huber regression Let Q ′ ( η ) = λ o √ n ˆ w i n X i =1 ( − h ( r i ( β η )) + h ( r i ( β ∗ )))( X i − µ ˆ w ) ⊤ δ ˆ β , where δ ˆ β = ˆ β − β ∗ . From the proof of Lemma F.2. of Fan et al. (2018), we have ηQ ′ ( η ) ≤ ηQ ′ (1) and this means η n X i =1 λ o √ n ˆ w i ( − h ( R i ( β η )) + h ( R i ( β ∗ ))) ( X i − µ ˆ w ) ⊤ δ ˆ β ≤ η n X i =1 λ o √ n ˆ w i (cid:16) − h ( R i ( ˆ β )) + h ( R i ( β ∗ )) (cid:17) ( X i − µ ˆ w ) ⊤ δ ˆ β (3.7)and we have η n X i =1 λ o √ n ˆ w i ( − h ( R i ( β η )) + h ( R i ( β ∗ ))) ( X i − µ ˆ w ) ⊤ δ ˆ β ( a ) = η n X i =1 λ o √ n ˆ w i h ( R i ( β ∗ ))( X i − µ ˆ w ) ⊤ δ ˆ β ( b ) = n X i =1 λ o √ n ˆ w i h ( R i ( β ∗ ))( X i − µ ˆ w ) ⊤ δ β η , (3.8)where (a) follows from the fact that ˆ β is a optimal solution of (2.2) and (b)follows from the deﬁnition of δ β η .From (3.7) and (3.8), we have n X i =1 λ o √ n ˆ w i ( − h ( R i ( β η )) + h ( R i ( β ∗ ))) ( X i − µ ˆ w ) ⊤ δ β η ≤ n X i =1 λ o √ n ˆ w i h ( R i ( β ∗ ))( X i − µ ˆ w ) ⊤ δ β η . (3.9) asai and Fujisawa/Weighted Huber regression The left-hand side of (3.9) can be decomposed as n X i =1 λ o √ n ˆ w i ( − h ( R i ( β η )) + h ( R i ( β ∗ ))) ( X i − µ ˆ w ) ⊤ δ β η = X i ∈ I o λ o √ n ˆ w i ( − h ( R i ( β η )) + h ( r i ( β ∗ ))) ( X i − µ ˆ w ) ⊤ δ β η + X i ∈ I G λ o √ n ˆ w i ( − h ( R i ( β η )) + h ( R i ( β ∗ ))) ( X i − µ ˆ w ) ⊤ δ β η = X i ∈ I o λ o √ n ˆ w i ( − h ( R i ( β η )) + h ( r i ( β ∗ ))) ( X i − µ ˆ w ) ⊤ δ β η + X i ∈ I G λ o √ n ˆ w i ( − h ( r i ( β η )) + h ( r i ( β ∗ ))) ( x i − µ ˆ w ) ⊤ δ β η = n X i =1 λ o √ n ˆ w i ( − h ( r i ( β η )) + h ( r i ( β ∗ ))) ( x i − µ ˆ w ) ⊤ δ β η + X i ∈ I o λ o √ n ˆ w i ( − h ( R i ( β η )) + h ( r i ( β ∗ ))) ( X i − µ ˆ w ) ⊤ δ β η − X i ∈ I o λ o √ n ˆ w i ( − h ( r i ( β η )) + h ( r i ( β ∗ ))) ( x i − µ ˆ w ) ⊤ δ β η . (3.10)The right-hand side of (3.9) can be decomposed as n X i =1 λ o √ n ˆ w i h ( R i ( β ∗ ))( X i − µ ˆ w ) ⊤ δ β η = X i ∈ I o λ o √ n ˆ w i h ( R i ( β ∗ ))( X i − µ ˆ w ) ⊤ δ β η + X i ∈ I G λ o √ n ˆ w i h ( r i ( β ∗ ))( x i − µ ˆ w ) ⊤ δ β η = X i ∈ I o λ o √ n ˆ w i h ( R i ( β ∗ ))( X i − µ ˆ w ) ⊤ δ β η + X i ∈ I G λ o √ n ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η = X i ∈ I o λ o √ n ˆ w i h ( R i ( β ∗ ))( X i − µ ˆ w ) ⊤ δ β η − X i ∈ I o λ o √ n ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η + n X i =1 λ o √ n ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η . (3.11) asai and Fujisawa/Weighted Huber regression From (3.9) - (3.11), we have n X i =1 λ o √ n ˆ w i ( − h ( r i ( β η )) + h ( r i ( β ∗ ))) ( x i − µ ˆ w ) ⊤ δ β η ≤ n X i =1 λ o √ n ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η − X i ∈ I o λ o √ n ˆ w i h ( r i ( β η ))( x i − µ ˆ w ) ⊤ δ β η + X i ∈ I o λ o √ n ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 λ o √ n ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o λ o √ n ˆ w i h ( r i ( β η ))( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o λ o √ n ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (3.12)We evaluate each term of (3.12). From (3.4), the left hand side of (3.12) isevaluated as n X i =1 λ o √ n ˆ w i ( − h ( r i ( β η )) + h ( r i ( β ∗ ))) ( x i − µ ˆ w ) ⊤ δ β η ≥ c k δ β η k − c k δ β η k − c From (3.1) - (3.3), the right hand side of (3.12) is evaluated as (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 λ o √ n ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o λ o √ n ˆ w i h ( r i ( β η ))( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o λ o √ n ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c k δ β η k . Consequently, we have c k δ β η k − c k δ β η k − c ≤ c k δ β η k and k δ β k ≤ c + 3 c + p ( c + 3 c ) + 4 c c c ≤ c + 3 c + √ c c c ≤ r . This contradicts k δ β η k = r . Consequently, we have ˆ β ∈ B and k δ β k ≤k ˆ β − β ∗ k ≤ r .

4. Stochastic argument in case of Gaussian design

In this section, we state our main theorem in stochastic form assuming ran-domness in { x i } ni =1 and { ξ i } ni =1 . In the following subsection, we conﬁrm the asai and Fujisawa/Weighted Huber regression conditions in Theorem 3.1 are satisﬁed under the conditions in Theorem 4.1.Let r o = ε r log 1 ε , r d = r d log dn , r ′ d = r dn . Theorem 4.1.

Consider the optimization problem (2.2) . Suppose that { x i } ni =1 is a sequence with i.i.d. random vectors drawn from sub-Gaussian distributionwith unknown mean µ and identity covariance and { ξ i } ni =1 is a sequence withi.i.d. random variables drawn from a distribution whose absolute moment isbounded by σ . Suppose that n is suﬃciently large so that n = ( d log d ) holdand suppose that ε is suﬃciently small. Suppose δ is suﬃciently small so that δ ∈ (0 , / .Let λ o √ n = c max with c max = max (cid:16) m σ , p C cov m , C CDG (cid:17) where C CDG is deﬁned in Algorithm 2 and m is some positive constant such that E [( x i − µ ) ⊤ v ) ] ≤ m (cid:0) E [(( x i − µ ) ⊤ v ) ] (cid:1) for any v ∈ R d and C cov is somepositive constant such that E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  ≤ C cov . Let ˆ β is a optimal solution of (2.2) . Then, with probability at least (1 − δ ) ,we have k ˆ β − β ∗ k = O ( r o + r d ) . In the remaining part of this section, we assume that { x i } ni =1 is a sequencewith i.i.d. random vectors drawn from sub-Gaussian distribution with unknownmean µ and identity covariance and { ξ i } ni =1 is a sequence with i.i.d. randomvariables drawn from a distribution whose absolute moment is bounded by σ . For sub-Gaussian design, we use Algorithm 1 of Cheng, Diakonikolas and Ge(2019) as ROBUST-WEIGHT. The weights { ˆ w i } ni =1 can be computed by thealgorithm from { X i } ni =1 and ε with polynomial computational complexity. Thealgorithm showed that µ ˆ w − µ is close to 0 in the l norm with high probability.We brieﬂy introduce Algorithm 1 of Cheng, Diakonikolas and Ge (2019). Let∆ n,ε = ( w ∈ R n : X i w i = 1 and 0 ≤ w i ≤ − ε ) n for all i ) . (4.1)First, we state the primal-dual SDP used in Algorithm 1 of Cheng, Diakonikolasand Ge (2019). The primal SDP has the following form:minimize λ max n X i =1 w i ( X i − ν )( X i − ν ) ⊤ ! (4.2)subject to w ∈ ∆ N,ε , asai and Fujisawa/Weighted Huber regression where λ max ( M ) is the maximum eigenvalue of M and ν ∈ R d is a ﬁxed vector.The dual SDP is the following form:maximize the average of the smallest (1 − ε ) fraction of (cid:8) ( X i − ν ) ⊤ M ( X i − ν ) (cid:9) ni =1 (4.3)subject to M (cid:23) , tr( M ) ≤ . Using the primal-dual SDP above, Algorithm 1 of Cheng, Diakonikolas andGe (2019) estimates µ robustly by the following algorithm: Algorithm 2

Robust Mean Estimation for Known Covariance sub-Gaussian

Require: { X i } ni =1 ∈ R d with 0 < ε < / Ensure: ˆ w ∈ R n such that, with probability at least (1 − δ ) , k µ ˆ w − µ k ≤ C CDG ( r o + r ′ d ),where C CDG is some positive constant depending on δ when Lemma 4.1 holds.Let ν ∈ R d be the coordinate-wise median of { X i } ni =1 For i = 1 to O (log d )Use Proposition 4.1 of Cheng, Diakonikolas and Ge (2019) to compute either (i) A good solution w ∈ R n for the primal SDP (4.2) with parameters ν and 2 ε or (ii) A good solution M ∈ R d × d for the dual SDP (4.3) with parameters ν and ε if the objective value (4.2) is at most 1 + C CDG ( ε log(1 /ε ) + r ′ d ) (Lemma 4.1) return the weight vector ˆ w = ( ˆ w , · · · , ˆ w n ) ⊤ = w else Move ν closer to µ using the top eigenvector of M . First, we state some properties of sub-Gaussian random vector.

Lemma 4.1 (Adopted from Lemmas 4.3 and 4.4 of Diakonikolas et al. (2018)) . Suppose n is suﬃciently large so that n = O ( d ) and ε is suﬃciently small. Then,for any w ∈ ∆ N,ε , we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ I G w i ( x i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C g ( r o + r ′ d ) , (4.4) λ max X i ∈ I G w i ( x i − µ )( x i − µ ) ⊤ − I ! ≤ C g ( ε log(1 /ε ) + r ′ d ) , (4.5) where C g is some positive constant depending on δ , with probability at least − δ . Proposition 4.1 (Adopted from Theorem 4 of Koltchinskii and Lounici (2017)) . Assume that (cid:8) x i ∈ R d (cid:9) ni =1 is a sequence with i.i.d. random matrices drawn fromsub-Gaussian distribution with mean µ and assume d/n ≤ . Then, we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op  ≤ c gcov , (4.6) where c gcov is some positive constant. Let I | m | be a set of index such that the number of the elements of I | m | is m . asai and Fujisawa/Weighted Huber regression Corollary 4.1 (Corollary of Proposition A.3) . Suppose that < ε < / and δ ∈ (0 , / holds. For any vector u = ( u , · · · , u n ) ∈ R n , suppose that any set I | m | such that ≤ m ≤ o , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i ∈ I | m | u i ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C τ s X i ∈ I | m | u i k δ β η k (cid:18) (1 + p log(1 /δ )) + p d log d + r m log nm (cid:19) , (4.7) where C τ is some positive constant, with probability at least − δ . Proposition 4.2 (Bernstein inequality for Huber loss) . Suppose that n is suf-ﬁciently large so that n = O ( d log d ) . We have n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) h (cid:18) ξ i λ o √ n (cid:19)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ C Bδ r log dn , (4.8) where C Bδ is some positive constant depending on δ , with probability at least − δ . Let { α i } ni =1 be a series of Rademacher random variables. Proposition 4.3.

Suppose that n is suﬃciently large so that n = O ( d log d ) .We have n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) α i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C Rδ r ′ d (4.9) with probability at least − δ . Proposition 4.4 (Adopted from exercise 4.6.3 of Vershynin (2018)) . Supposethat n is suﬃciently large so that n = O ( d ) . For any vector v ∈ S d − , we have n E " n X i =1 (( x i − µ ) ⊤ v ) ≥ . (4.10) Proposition 4.5.

Let R ( r ) = (cid:8) β ∈ R d | k δ β k = k β − β ∗ k = r (cid:9) ( r ≤ and assume β η ∈ R ( r ) . Assume that (cid:8) x i ∈ R d (cid:9) ni =1 is a sequence with i.i.d.random matrices satisfying E [(( x i − µ ) ⊤ v ) ] ≤ m (cid:0) E [(( x i − µ ) ⊤ v ) ] (cid:1) for any v ∈ S d − , E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op  ≤ C cov , k µ − µ ˆ w k ≤ C µ , asai and Fujisawa/Weighted Huber regression where m , C cov and C µ are some positive constants. Assume that { ξ i } ni =1 is a se-quence with i.i.d. random variables drawn from a distribution whose absolute mo-ment is bounded by σ . Suppose λ o √ n ≥ c max = (cid:16) σ m , p C cov m , C µ (cid:17) .Then, with probability at least − δ , we have λ o n X i =1 − h ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β η λ o √ n ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ≥ a n n X i =1 E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) − a k δ β η k − a + B + C, where a = 18 ,a = λ o √ n E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 α i ( x i − µ ) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + λ o q E [ k ( x i − µ ) k ] p /δ ) + λ o √ n k µ − µ ˆ w k ,a = λ o /δ ) ,B =  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  (cid:18) − h (cid:18) ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β λ o (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ,C =  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  (cid:18) − h (cid:18) ξ i λ o √ n − ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ( x i − µ ˆ w ) ⊤ δ β λ o √ n . Next, we state some properties obtained by Algorithm 2. We note that theproof of Lemmas 3.1 - 3.3, Proposition 4.1 and Theorem 1.2 of Cheng, Di-akonikolas and Ge (2019) is valid even if we replace the order of δ , δ and β inSection 3 of Cheng, Diakonikolas and Ge (2019) as δ = ε p log(1 /ε ) + r ′ d , δ = ε log(1 /ε ) + r ′ d , β = p ε log(1 /ε ) + r ′ d and consequently, we have Propositions 4.6, 4.7 and 4.8. Proposition 4.6 (Lemma 3.2 of Cheng, Diakonikolas and Ge (2019)) . Supposethat < ε < / , (4.4) and (4.5) hold. We have k µ ˆ w − µ k ≤ C CDG ( r o + r ′ d ) . (4.11) when Algorithm 2 succeeds . Proposition 4.7 (From termination condition of Algorithm 2) . Suppose that < ε < / , (4.4) and (4.5) hold. We have λ max n X i =1 ˆ w i ( X i − ν )( X i − ν ) ⊤ ! ≤ C CDG ( ε log(1 /ε ) + r ′ d ) (4.12) for some vector ν ∈ R d in (4.3) when Algorithm 2 succeeds. asai and Fujisawa/Weighted Huber regression About ν in Proposition 4.7, we have the following Lemma Proposition 4.8 (From termination condition of Algorithm 2 and Lemma 3.1of Cheng, Diakonikolas and Ge (2019)) . Suppose that < ε < / , (4.4) and (4.5) hold. We have k ν − µ k ≤ C CDG ( p ε log(1 /ε ) + r ′ d ) (4.13) for ν in Proposition 4.7 when Algorithm 2 succeeds . Let I w < and I w ≥ be the sets of the index such that w i < n and w i ≥ n ,respectively. Lemma 4.2.

For w i ∈ ∆ n,ε , we have | I w < | ≤ o .Proof. We assume | I w < | > o , and then we derive a contradiction. From theconstraint about w i , we have 0 ≤ w i ≤ − ε ) n and n X i =1 w i = X i ∈ I w< w i + X i ∈ I w ≥ w i ≤ | I < | × n + ( n − | I < | ) × − ε ) n = 2 o × n + ( | I < | − o ) × n + ( n − o ) × − ε ) n + (2 o − | I < | ) × − ε ) n = 2 o × n + ( n − o ) × − ε ) n + ( | I < | − o ) × (cid:18) n − − ε ) n (cid:19) < o × n + ( n − o ) × − ε ) n = 1 − ε ≤ . This is a contradiction to the constraint about { w i } ni =1 , P ni =1 w i = 1. In this section, we conﬁrm (3.1) - (3.4) under the conditions in Theorem 4.1.We note that when (4.4) and (4.5) hold, Algorithm 2 succeeds (Algorithm 2computes { ˆ w i } ni =1 such that (4.11) is satisﬁed) with probability at least (1 − δ ) from the proof of Theorem 1.2 in Cheng, Diakonikolas and Ge (2019). In lightof this fact, Lemmas 4.1 and 4.2, Corollary 4.1, Propositions 4.1 - 4.8 hold withprobability at least (1 − δ ) under the conditions in Theorem 4.1. We also notethat from Proposition 4.6 and assumptions in Theorem 4.1, we have r o , r d ≤ C µ = 2 C CDG , where C µ is deﬁned in Proposition A.5. Fromthe deﬁnition of λ o , conditions assumed in Theorem 4.1 and Proposition 4.1, wesee that Proposition 4.5 holds.In Section 4.2, to omit to state “with probability at least”, we assume thatinequalities in Lemmas 4.1 and 4.2, Corollary 4.1, Propositions 4.1 - 4.8 holdand Algorithm 2 succeeds. asai and Fujisawa/Weighted Huber regression Proposition 4.9 (Conﬁrmation of (3.3)) . We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C cond ( r o + r d ) b k δ β η k , where C cond = (cid:16) p C g + 2 p C CDG + √ C CDG + 2 p C CDG C g + 2 √ C CDG (cid:17) . Proof.

From X i − µ ˆ w = ( X i − ν ) + ( ν − µ ) + ( µ − µ ˆ w ), we have X i ∈ I o ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η = X i ∈ I o ˆ w i h ( R i ( β η ))( X i − ν ) ⊤ δ β η + X i ∈ I o ˆ w i h ( R i ( β η ))( ν − µ ) ⊤ δ β η + X i ∈ I o ˆ w i h ( R i ( β η ))( µ − µ ˆ w ) ⊤ δ β η . (4.14)Let M i = ( X i − ν )( X i − ν ) ⊤ . For the ﬁrst term of the R.H.S of (4.14), wehave (X i ∈ I o ˆ w i h ( R i ( β η ))( X i − ν ) ⊤ δ β η ) a ) ≤ X i ∈ I o ˆ w i X i ∈ I o ˆ w i | ( X i − ν ) ⊤ δ β η | b ) ≤ ε X i ∈ I o ˆ w i | ( X i − ν ) ⊤ δ β η | = 2 εδ ⊤ β η X i ∈ I o ˆ w i M i δ β η = 2 ε δ ⊤ β η n X i =1 ˆ w i M i δ β η − δ ⊤ β η X i ∈ I G ˆ w i M i δ β η ! ≤ ελ max ( M ) k δ β η k − εδ ⊤ β η X i ∈ I G ˆ w i M i δ β η ! (4.15)where (a) follows from ˆ w i = √ ˆ w i √ ˆ w i and 0 ≤ h ( R i ( β η )) ≤

1, (b) follows fromthe constraint of ˆ w i and P i ∈ I o ˆ w i ≤ o (1 − ε ) n ≤ ε . For the last term of R.H.S of asai and Fujisawa/Weighted Huber regression (4.15), we have δ ⊤ β η X i ∈ I G ˆ w i M i δ β η = δ ⊤ β η X i ∈ I G ˆ w i ( x i − ν )( x i − ν ) ⊤ δ β η = δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ − ν + µ )( x i − µ − ν + µ ) ⊤ δ β η = δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( x i − µ ) ⊤ δ β η + δ ⊤ β η X i ∈ I G ˆ w i ( µ − ν )( µ − ν ) ⊤ δ β η + 2 δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η ( a ) ≥ (1 − C g ( ε log(1 /ε ) + r ′ d )) k δ β η k + δ ⊤ β η X i ∈ I G ˆ w i ( µ − ν )( µ − ν ) ⊤ δ β η + 2 δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η , (4.16)where (a) follows from (4.5). From (4.15) and (4.16), Proposition 4.7, Lemmas ?? and 4.3, we have12 ε (X i ∈ I o ˆ w i h ( R i ( δ β η ))( X i − ν ) ⊤ δ β η ) ≤ λ max ( M ) k δ β η k − k δ β η k + C g ( ε log(1 /ε ) + r ′ d ) k δ β η k − δ ⊤ β η X i ∈ I G ˆ w i ( µ − ν )( µ − ν ) ⊤ δ β η − δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η ≤ λ max ( M − I ) k δ β η k + C g ( ε log(1 /ε ) + r ′ d ) k δ β η k + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ ⊤ β η X i ∈ I G ˆ w i ( µ − ν )( µ − ν ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C CDG ( ε log(1 /ε ) + r ′ d ) k δ β η k + C g ( ε log(1 /ε ) + r ′ d ) k δ β η k + C CDG ( p ε log(1 /ε ) + r d ) k δ β η k + 2 C CDG C g ( p ε log(1 /ε ) + r d ) k δ β η k a ) ≤ C CDG ( ε log(1 /ε ) + r d ) k δ β η k + C g ( ε log(1 /ε ) + r d ) k δ β η k + C CDG ( p ε log(1 /ε ) + r d ) k δ β η k + 2 C CDG C g ( p ε log(1 /ε ) + r d ) k δ β η k , asai and Fujisawa/Weighted Huber regression where (a) follows from r ′ d ≤ r d . From triangular inequality, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( R i ( δ β η ))( X i − ν ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ p C CDG ε ( ε log(1 /ε ) + r d ) + q C g ε ( ε log(1 /ε ) + r d ) k δ β η k + q C CDG ε ( p ε log(1 /ε ) + r d ) k δ β η k + q C CDG C g ε ( p ε log(1 /ε ) + r d ) k δ β η k ≤ ( p C g + p C CDG )( r o + √ εr d ) k δ β η k + √ C CDG ( r o + √ εr d ) k δ β η k + 2 p C CDG C g ( r o + √ εr d ) k δ β η k a ) ≤ ( p C g + p C CDG (2 r o + r d ) k δ β η k + √ C CDG ( r o + r d ) k δ β η k + 2 p C CDG C g ( r o + r d ) k δ β η k ≤ (cid:16) p C g + 2 p C CDG + √ C CDG + 2 p C CDG C g (cid:17) ( r o + r d ) k δ β η k , (4.17)where (a) follows from ε < √ ab ≤ a + b for positive number a, b .For the second term of the R.H.S of (4.14), we have (X i ∈ I o ˆ w i h ( R i ( β η ))( ν − µ ) ⊤ δ β η ) ≤ X i ∈ I o ˆ w i X i ∈ I o ˆ w i | ( ν − µ ) ⊤ δ β η | a ) ≤ ε X i ∈ I o ˆ w i | ( ν − µ ) ⊤ δ β η | = 2 εδ ⊤ β η X i ∈ I o ˆ w i ( ν − µ )( ν − µ ) ⊤ δ β η ( b ) ≤ ε k ν − µ k k δ β η k c ) ≤ C CDG ε ( p ε log(1 /ε ) + r ′ d ) k δ β η k ≤ C CDG ε ( p ε log(1 /ε ) + r d ) k δ β η k , where (a) follows from 0 ≤ ˆ w i ≤ − ε ) n ≤ n and P i ∈ I o ˆ w i ≤ ε , (b) follows from P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1 and (c) follows from Proposition 4.8. Consequently,we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( R i ( β η ))( ν − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG √ ε ( p ε log(1 /ε ) + r d ) k δ β η k ≤ √ C CDG ( r o + √ εr d ) k δ β η k ≤ √ C CDG ( r o + r d ) k δ β η k . (4.18) asai and Fujisawa/Weighted Huber regression For the last term of the R.H.S of (4.14), we have (X i ∈ I o ˆ w i h ( R i ( β η ))( µ − µ ˆ w ) ⊤ δ β η ) ≤ X i ∈ I o ˆ w i X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | a ) ≤ ε X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | = 2 δ ⊤ β η ε X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ≤ δ ⊤ β η X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ( b ) ≤ δ ⊤ β η ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η = 2 k µ − µ ˆ w k k δ β η k c ) ≤ C CDG ( r o + r ′ d ) k δ β η k ≤ C CDG ( r o + r d ) k δ β η k , where (a) follows from 0 ≤ ˆ w i ≤ − ε ) n ≤ n , (b) follows from P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1, (c) follows from Proposition 4.6 and P i ∈ I o ˆ w i ≤ ε . Consequently,we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( R i ( β η ))( µ − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG ( r o + r d ) k δ β η k . (4.19)From (4.17), (4.18) and (4.19), we have X i ∈ I o ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η ≤ (cid:16) p C g + 2 p C CDG + √ C CDG + 2 p C CDG C g + 2 √ C CDG (cid:17) ( r o + r d ) k δ β η k . Lemma 4.3.

We have δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η ≤ C g C CDG ( p ε log(1 /ε ) + r d ) k δ β η k . asai and Fujisawa/Weighted Huber regression Proof. δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12) ( µ − ν ) ⊤ δ β η (cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ I G ˆ w i ( x i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k µ − ν k k δ β k a ) ≤ C g ( ε p log(1 /ε ) + r ′ d ) C CDG ( p ε log(1 /ε ) + r ′ d ) k δ β η k ≤ C g C CDG ( p ε log(1 /ε ) + r ′ d ) k δ β η k ≤ C g C CDG ( p ε log(1 /ε ) + r d ) k δ β η k , where (a) follows from (4.4) and Proposition 4.8. Proposition 4.10 (Conﬁrmation of (3.1)) . We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C δ ( r o + r d ) k δ β η k , where C δ is some positive constant depending on δ .Proof. From triangular inequality, we have λ o √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( b ) ≤ √ C CDG ( r o + r d ) k δ β η k + (cid:16) C τ p log(1 /δ ) + C Bδ + 2 C τ (cid:17) ( r o + r d ) k δ β η k = (cid:16) √ C CDG + C τ p log(1 /δ ) + C Bδ + 2 C τ (cid:17) ( r o + r d ) k δ β η k = C δ ( r o + r d ) k δ β η k , where (a) follows from the triangular inequality and (b) follows from Lemmas4.4 and 4.5. Lemma 4.4.

We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG ( r o + r d ) k δ β η k . asai and Fujisawa/Weighted Huber regression Proof. ( n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η ) a ) ≤ n X i =1 ˆ w i n X i =1 ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | b ) ≤ n X i =1 ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | = 2 δ ⊤ β η n X i =1 ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ( c ) ≤ k µ − µ ˆ w k k δ β η k d ) ≤ C CDG ( r o + r ′ d ) k δ β η k ≤ C CDG ( r o + r d ) k δ β η k , where (a) follows from ˆ w i = √ ˆ w i √ ˆ w i and 0 ≤ h ( R i ( β η )) ≤

1, (b) followsfrom 0 ≤ ˆ w i ≤ − ε ) n ≤ n and P i ∈ I o ˆ w i ≤ ε , (c) follows from P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1 and (d) follows from Proposition 4.6. Then, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG ( r o + r d ) k δ β η k . Lemma 4.5.

We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:16) C τ p log(1 /δ ) + C Bδ + 2 C τ (cid:17) ( r o + r d ) k δ β η k . Proof.

We have n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η ( a ) ≤ n max I | n − o | X i ∈ I | n − o | h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η = 1 n n X i =1 h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η − n min I | o | X i ∈ I | o | h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ βη (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i ∈ I o h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where (a) follows from the claim (iii) of Lemma 1 of Dalalyan and Minasyan asai and Fujisawa/Weighted Huber regression (2020). From Proposition 4.2, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 n h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 n h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ k δ β η k ≤ C Bδ r log dn k δ β η k ≤ C Bδ r d k δ β η k and, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ C τ vuutX i ∈ I o h (cid:18) ξ i λ o √ n (cid:19) k δ β η k (cid:16) (1 + p log(1 /δ )) + p d log d + √ o p log(1 /ε ) (cid:17) ( b ) ≤ C τ √ o k δ β η k (cid:16) (1 + p log(1 /δ )) + p d log d + √ o p log(1 /ε ) (cid:17) , where (a) follows from Corollary 4.1 and (b) follows from − ≤ h (cid:16) ξ i λ o √ n (cid:17) ≤ n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η ≤ C Bδ r d + C τ r on (1 + p log(1 /δ )) √ n + r d + √ ε p log(1 /ε ) !! k δ β η k a ) ≤ C τ (1 + p log(1 /δ )) √ n + ( C Bδ + C τ ) r d + C τ r o ! k δ β η k ≤ (cid:18) C τ (1 + p log(1 /δ )) d log dn + ( C Bδ + C τ ) r d + C τ r o (cid:19) k δ β η k b ) ≤ (cid:16) C τ p log(1 /δ ) + C Bδ + 2 C τ (cid:17) ( r o + r d ) k δ β η k , where (a) follows from o/n ≤ r d ≤ Proposition 4.11 (Conﬁrmation of (3.2)) . We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C δ ( r o + r d ) k δ β η k , where C δ is some positive constant depending on δ . asai and Fujisawa/Weighted Huber regression Proof.

We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ˆ w ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( µ − µ ˆ w ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( b ) ≤ √ C CDG ( r o + r d ) k δ β η k + 2 C τ (cid:16) √ p log(1 /δ ) (cid:17) ( r o + r d ) k δ β η k = (cid:16) C CDG + 2 C τ (cid:16) p log(1 /δ ) (cid:17)(cid:17) ( r o + r d ) k δ β η k = C δ ( r o + r d ) k δ β η k , where (a) follows from the triangular inequality and (b) follows from Lemmas4.6 and 4.7. Lemma 4.6.

We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG ( r o + r d ) k δ β η k . Proof. (X i ∈ I o ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η ) a ) ≤ X i ∈ I o ˆ w i X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | b ) ≤ ε X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | = 2 εδ ⊤ β η X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ≤ δ ⊤ β η X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ( c ) ≤ k µ − µ ˆ w k k δ β η k d ) ≤ C CDG ( r o + r ′ d ) k δ β η k d ) ≤ C CDG ( r o + r d ) k δ β η k , where (a) follows from ˆ w i = √ ˆ w i √ ˆ w i and 0 ≤ h ( R i ( β η )) ≤

1, (b) followsfrom 0 ≤ ˆ w i ≤ − ε ) n ≤ n and P i ∈ I o ˆ w i ≤ ε , (c) follows from P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1, (d) follows from Proposition 4.6. Lemma 4.7.

We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C τ (cid:16) p log(1 /δ ) (cid:17) ( r o + r d ) k δ β η k . asai and Fujisawa/Weighted Huber regression Proof.

We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ C τ sX i ∈ I o ˆ w i h ( r i ( β η )) (cid:18) (1 + p log(1 /δ )) + p d log d + √ o r log no (cid:19) k δ β η k b ) ≤ C τ √ on (cid:18) (1 + p log(1 /δ )) + p d log d + √ o r log no (cid:19) k δ β η k = 2 C τ (1 + p log(1 /δ )) √ n r on + r on r d + r o ! k δ β η k c ) ≤ C τ (1 + p log(1 /δ )) √ n + r d + r o ! k δ β η k ≤ C τ (1 + p log(1 /δ )) r d log dn + r d + r o ! k δ β η k ≤ C τ (cid:16) p log(1 /δ ) (cid:17) ( r o + r d ) k δ β η k , where (a) follows from Corollary 4.1, (b) follow from 0 ≤ ˆ w i ≤ − ε ) n ≤ n and P i ∈ I o ˆ w i ≤ ε and − ≤ h ( r i ( β η )) ≤ on ≤ Proposition 4.12 (Conﬁrmation of (3.4)) . We have λ o n X i =1 − h ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β η λ o √ n ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ≥ k δ β η k − λ o √ nC cond ( r o + r d ) k δ β η k − /δ ) λ o , where C cond = 3 √ C CDG + C δ + C δ + C δ and C δ is some positive constantdepending on δ and C δ and C δ are deﬁned in Lemmas 4.8 and 4.9, respectively.Proof. From Proposition A.5, λ o n X i =1 (cid:18) − h (cid:18) ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ≥ a n E " n X i =1 (( x i − µ ) ⊤ δ β η ) − a k δ β η k − a + B + C, (4.20) asai and Fujisawa/Weighted Huber regression where a = 18 ,a = λ o √ n E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 α i ( x i − µ ) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + λ o q E [ k ( x i − µ ) k ] r δ + λ o √ n k µ − µ ˆ w k ,a = λ o δB = λ o √ n  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  (cid:18) − h (cid:18) ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β λ o (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i ( x i − µ ˆ w ) ⊤ δ β η C = λ o √ n  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  (cid:18) − h (cid:18) ξ i λ o √ n − ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ( x i − µ ˆ w ) ⊤ δ β η . We evaluate each term of (4.20) and then we prove the proposition. FromProposition 4.4, we have a n E " n X i =1 (( x i − µ ) ⊤ δ β η ) ≥ k δ β η k . From Propositions 4.3 and 4.6, we have a ≤ λ o √ n C Rδ r dn + r δ r dn + C CDG

16 ( r o + r ′ d ) ! ≤ λ o √ n C Rδ r dn + r δ r dn + C CDG

16 ( r o + r d ) ! ≤ λ o √ n (cid:18) ( C Rδ + p /δ )) r d + C CDG

16 ( r o + r d ) (cid:19) ≤ λ o √ nC δ ( r o + r d ) . The terms B and C are bounded above from Lemmas 4.8 and 4.9. Combiningthe bounds, (4.20) can be bounded from below as follows: λ o √ n n X i =1 (cid:18) − h (cid:18) ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i ( x i − µ ˆ w ) ⊤ δ β η ≥ k δ β η k − λ o √ nC δ ( r o + r d ) − /δ ) λ o − λ o √ n (cid:16) √ C CDG + C δ + C δ (cid:17) ( r o + r d ) k δ β η k ≥ k δ β η k − λ o √ n (cid:16) √ C CDG + C δ + C δ + C δ (cid:17) ( r o + r d ) k δ β η k − /δ ) λ o . The following two lemmas are used in the proof of Proposition 4.12. asai and Fujisawa/Weighted Huber regression Lemma 4.8.

We have  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  − h ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i ( x i − µ ˆ w ) ⊤ δ β η ≤ ( √ C CDG + C δ )( r o + r d ) k δ β η k , where C δ is some positive constant depending on δ .Proof. Let h i = − h ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ! + h (cid:18) ξ i λ o √ n (cid:19) . We have two properties of { ˆ w i } ni =1 , as described on (4.21) and (4.22). From0 ≤ ˆ w i ≤ n for i ∈ I w < , − ≤ h i ≤ X i ∈ I G ∩ I ˆ w< ( h i ˆ w i ) ≤ X i ∈ I ˆ w< ( h i ˆ w i ) ≤ on , X i ∈ I o ( h i ˆ w i ) ≤ on and we have vuuut X i ∈ I G ∩ I ˆ w< + X i ∈ I o  ( h i ˆ w i ) ≤ √ √ on . (4.21)From 0 ≤ ˆ w i ≤ − ε ) n and Lemma 4.2, we have X i ∈ I G ∩ I ˆ w< ˆ w i ≤ X i ∈ I ˆ w< ˆ w i ≤ o (1 − ε ) n , X i ∈ I o ˆ w i ≤ o (1 − ε ) n and from that ε is small constant, we have − ε ) ≤  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  ˆ w i ≤ o (1 − ε ) n ≤ ε. (4.22)From x i − µ ˆ w = x i − µ + µ − µ ˆ w , we have  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  − h ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i ( x i − µ ˆ w ) ⊤ δ β η =  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  h i ˆ w i ( x i − µ ) ⊤ δ β η +  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  h i ˆ w i ( µ − µ ˆ w ) ⊤ δ β η (4.23) asai and Fujisawa/Weighted Huber regression For the ﬁrst term of the R.H.S of (4.23), we have  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  h i ˆ w i ( x i − µ ) ⊤ δ β η ( a ) ≤ C τ vuuut X i ∈ I G ∩ I ˆ w< + X i ∈ I o  h i ˆ w i (cid:18) (1 + p log(1 /δ )) + p d log d + √ o r log no (cid:19) k δ β η k b ) ≤ √ C τ √ on (cid:18) (1 + p log(1 /δ )) + p d log d + √ o r log no (cid:19) k δ β η k c ) ≤ √ C τ (1 + p log(1 /δ )) √ n + r d + r o ! k δ β η k ≤ √ C τ (1 + p log(1 /δ )) r d log dn + r d + r o ! k δ β η k = √ C τ (cid:16) (2 + p log(1 /δ )) r d + r o (cid:17) k δ β η k ≤ C δ ( r o + r d ) k δ β η k , where (a) follows from Corollary 4.1, (b) follows from (4.21) and (c) follows from o/n ≤  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  h i ˆ w i ( µ − µ ˆ w ) ⊤ δ β  ≤  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  ˆ w i  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | a ) ≤ ε  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | = 6 εδ ⊤ β η  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ≤ εδ ⊤ β η n X i =1 ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ( b ) ≤ C CDG ( r o + r d ) k δ β η k , where (a) follows from (4.22) and (b) follows from Proposition 4.6, and we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i ∈ I G ∩ I ˆ w< + X i ∈ I o  h i ˆ w i ( µ − µ ˆ w ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG ( r o + r d ) k δ β η k . Consequently, we have  X G ∈ I G ∩ I ˆ w< + X i ∈ I o  h i ˆ w i ( x i − µ ˆ w ) ⊤ δ β η ≤ C δ ( r o + r d ) k δ β η k + √ C CDG ( r o + r d ) k δ β η k . asai and Fujisawa/Weighted Huber regression Lemma 4.9.

We have λ o √ n  X G ∈ I G ∩ I ˆ w< + X i ∈ I o  − h ξ i λ o √ n − ( x i − µ ˆ w ) ⊤ δ β η λ o √ n ! + h (cid:18) ξ i λ o √ n (cid:19)! ( x i − µ ˆ w ) ⊤ δ β η ≤ λ o √ n (cid:16) √ C CDG + C δ (cid:17) ( r o + r d ) k δ β η k , where C δ is some positive constant depending on δ . Because the proof of Lemma 4.9 is almost the same of the proof of Lemma4.8, we omit the proof of Lemma 4.9.

From Proposition 4.9, 4.10, 4.11 and 4.12, conditions (3.1), (3.2), (3.3) and (3.4)are satisﬁed by c = λ o √ n (cid:16) C δ + C δ + 2 √ C CDG + C cond (cid:17) ( r o + r d ) ,c = 116 ,c = λ o √ nC cond ( r o + r d ) ,c = 5 λ o log(1 /δ ) = 5 c n with probability at least (1 − δ ) .We note that, from the deﬁnition of λ o and Proposition 4.1, λ o √ n = c max ,which is a constant with C cov = c gcov and C cond and C cond are also constant.Consequently, we have6 c + c + √ c c c ≤ λ o √ n (cid:18) C δ + C δ + 2 √ C CDG + C cond + C cond (cid:19) ( r o + r d ) + c max r

516 1 √ n ! ≤ λ o √ n (cid:18) C δ + C δ + 2 √ C CDG + C cond + C cond (cid:19) ( r o + r d ) + c max r r d ! ≤ λ o √ n (cid:18) C δ + C δ + 2 √ C CDG + C cond + C cond (cid:19) + c max r ! ( r o + r d ) . The proof of Theorem 4.1 is complete.

References

Bakshi, A. and

Prasad, A. (2020). Robust linear regression: Optimal ratesin polynomial time. arXiv preprint arXiv:2007.01394 . asai and Fujisawa/Weighted Huber regression Bellec, P. C. (2019). Localized Gaussian width of M -convex hulls with ap-plications to Lasso and convex aggregation. Bernoulli Boucheron, S. , Lugosi, G. and

Massart, P. (2013).

Concentration inequal-ities: A nonasymptotic theory of independence . Oxford university press.

Chen, M. , Gao, C. and

Ren, Z. (2018). Robust covariance and scatter matrixestimation under Huber’s contamination model.

The Annals of Statistics Cheng, Y. , Diakonikolas, I. and

Ge, R. (2019). High-dimensional robustmean estimation in nearly-linear time. In

Proceedings of the Thirtieth AnnualACM-SIAM Symposium on Discrete Algorithms

Cheng, Y. , Diakonikolas, I. , Ge, R. and

Woodruff, D. (2019). Fasteralgorithms for high-dimensional robust covariance estimation. arXiv preprintarXiv:1906.04661 . Cheng, Y. , Diakonikolas, I. , Ge, R. and

Soltanolkotabi, M. (2020).High-dimensional Robust Mean Estimation via Gradient Descent. In

Proceed-ings of the 37th International Conference on Machine Learning . Proceedingsof Machine Learning Research

Cherapanamjeri, Y. , Aras, E. , Tripuraneni, N. , Jordan, M. I. , Flam-marion, N. and

Bartlett, P. L. (2020). Optimal robust linear regressionin nearly linear time. arXiv preprint arXiv:2007.08137 . Dalalyan, A. S. and

Minasyan, A. (2020). All-In-One Robust Estimator ofthe Gaussian Mean. arXiv preprint arXiv:2002.01432 . Dalalyan, A. and

Thompson, P. (2019). Outlier-robust estimation of a sparselinear model using ℓ -penalized Huber’s M-estimator. In Advances in NeuralInformation Processing Systems 32 (H. Wallach, H. Larochelle, A. Beygelz-imer, F. d’Alch´e Buc, E. Fox and R. Garnett, eds.) 13188–13198. CurranAssociates, Inc.

Diakonikolas, I. and

Kane, D. M. (2019). Recent advances in algorithmichigh-dimensional robust statistics. arXiv preprint arXiv:1911.05911 . Diakonikolas, I. , Kane, D. M. and

Pensia, A. (2020). Outlier ro-bust mean estimation with subgaussian rates via stability. arXiv preprintarXiv:2007.15618 . Diakonikolas, I. , Kong, W. and

Stewart, A. (2019). Eﬃcient algorithmsand lower bounds for robust linear regression. In

Proceedings of the ThirtiethAnnual ACM-SIAM Symposium on Discrete Algorithms

Diakonikolas, I. , Kamath, G. , Kane, D. M. , Li, J. , Moitra, A. and

Stewart, A. (2017). Being robust (in high dimensions) can be practical.In

Proceedings of the 34th International Conference on Machine Learning-Volume 70

Diakonikolas, I. , Kamath, G. , Kane, D. M. , Li, J. , Moitra, A. and

Stewart, A. (2018). Robustly learning a gaussian: Getting optimal error,eﬃciently. In

Proceedings of the Twenty-Ninth Annual ACM-SIAM Sympo-sium on Discrete Algorithms

Diakonikolas, I. , Kamath, G. , Kane, D. , Li, J. , Moitra, A. and

Stew-art, A. (2019). Robust Estimators in High-Dimensions Without the Com- asai and Fujisawa/Weighted Huber regression putational Intractability. SIAM Journal on Computing Dong, Y. , Hopkins, S. and

Li, J. (2019). Quantum entropy scoring for fastrobust mean estimation and improved outlier detection. In

Advances in NeuralInformation Processing Systems

Elsener, A. and van de Geer, S. (2018). Sharp oracle inequalities for sta-tionary points of nonconvex penalized M-estimators.

IEEE Transactions onInformation Theory Fan, J. , Liu, H. , Sun, Q. and

Zhang, T. (2018). I-LAMM for sparse learning:Simultaneous control of algorithmic complexity and statistical error.

Annalsof statistics Gao, C. (2020). Robust regression via mutivariate regression depth.

Bernoulli Hopkins, S. , Li, J. and

Zhang, F. (2020). Robust and Heavy-Tailed MeanEstimation Made Simple, via Regret Minimization.

Advances in Neural In-formation Processing Systems . Huber, P. J. (1981).

Robust statistics . John Wiley & Sons.

Jerrum, M. R. , Valiant, L. G. and

Vazirani, V. V. (1986). Random gen-eration of combinatorial structures from a uniform distribution.

Theoreticalcomputer science Karmalkar, S. and

Price, E. (2018). Compressed sensing with adversarialsparse noise via l1 regression. arXiv preprint arXiv:1809.08055 . Koltchinskii, V. and

Lounici, K. (2017). Concentration inequalities andmoment bounds for sample covariance operators.

Bernoulli Kothari, P. K. , Steinhardt, J. and

Steurer, D. (2018). Robust momentestimation and improved clustering via sum of squares. In

Proceedings of the50th Annual ACM SIGACT Symposium on Theory of Computing

Lai, K. A. , Rao, A. B. and

Vempala, S. (2016). Agnostic estimation ofmean and covariance. In

Foundations of Computer Science (FOCS), 2016IEEE 57th Annual Symposium on

Laurent, B. and

Massart, P. (2000). Adaptive estimation of a quadraticfunctional by model selection.

The Annals of Statistics

Loh, P.-L. (2017). Statistical consistency and asymptotic normality for high-dimensional robust M -estimators. The Annals of Statistics Lugosi, G. and

Mendelson, S. (2019). Sub-Gaussian estimators of the meanof a random vector.

The annals of statistics Lugosi, G. and

Mendelson, S. (2020). Risk minimization by median-of-meanstournaments.

J. Eur. Math. Soc.

Massart, P. (2000). About the constants in Talagrand’s concentration inequal-ities for empirical processes.

The Annals of Probability Minsker, S. and

Wei, X. (2020). Robust modiﬁcations of U-statistics andapplications to covariance estimation problems.

Bernoulli Mizera, I. et al. (2002). On depth and deep points: a calculus.

The Annals ofStatistics Nemirovsky, A. S. and

Yudin, D. B. (1983). Problem complexity and methodeﬃciency in optimization. asai and Fujisawa/Weighted Huber regression Pensia, A. , Jog, V. and

Loh, P.-L. (2020). Robust regression with co-variate ﬁltering: Heavy tails and adversarial contamination. arXiv preprintarXiv:2009.12976 . Pisier, G. (2016). Subgaussian sequences in probability and Fourier analysis. arXiv preprint arXiv:1607.01053 . Prasad, A. , Suggala, A. S. , Balakrishnan, S. , Ravikumar, P. et al.(2020). Robust estimation via robust gradient estimation.

Journal of the RoyalStatistical Society Series B Rivasplata, O. (2012). Subgaussian random variables: An expository note.

She, Y. and

Owen, A. B. (2011). Outlier detection using nonconvex penalizedregression.

Journal of the American Statistical Association

Tukey, J. W. (1975). Mathematics and the picturing of data. In

Proceedings ofthe International Congress of Mathematicians, Vancouver, 1975 Vershynin, R. (2010). Introduction to the non-asymptotic analysis of randommatrices. arXiv preprint arXiv:1011.3027 . Vershynin, R. (2018).

High-dimensional probability: An introduction with ap-plications in data science . Cambridge university press. Appendix A: Proofs of Propositions and Lemmas in Section 4

In this section, we assume that { x i } ni =1 is a sequence with i.i.d. random vectorsdrawn from sub-Gaussian distribution with mean µ and identity covarianceexcept Proposition A.5. We also assume { ξ i } ni =1 is a sequence with i.i.d. randomvariables drawn from a distribution whose absolute moment is bounded by σ .The proof of Lemma 4.1 is almost the same of the proof of Lemma 4.3 ofDiakonikolas et al. (2018) and we omit the proof of (4.4) as Diakonikolas et al.(2018) because the proof of (4.4) is almost the same as (4.5). Before the proofof the Lemma 4.1, we introduce the following concentration inequality. Lemma A.1.

For some positive constants

A, B , we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 x i x ⊤ i − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ t for probability at least − (cid:0) Ad − Bn min( t, t ) (cid:1) . Lemma A.2 (Lemma 4.1 in Section 4) . Suppose n is suﬃciently large so that n = O ( d ) and ε is suﬃciently small. Then, for w ∈ ∆ N,ε , we have λ max X i ∈ I G w i ( x i − µ )( x i − µ ) ⊤ − I ! ≤ C g ( ε log(1 /ε ) + r ′ d ) , (A.1) where C g is some positive constant depending on δ , with probability at least − δ .Proof. For any J ⊂ { , · · · , n } , Let w J ∈ R n be the vector which is given by w Ji = 1 / | J | for i ∈ J and w Ji = 0 otherwise. By convexity, it is suﬃcient to show asai and Fujisawa/Weighted Huber regression that for any J such that | J | = (1 − ε ) n , P " n X i =1 w Ji x i x ⊤ i − (1 − ε ) I ≥ C g ( ε log(1 /ε ) + r ′ d ) ≤ δ. For any ﬁxed w J , we have n X i =1 w Ji x i x ⊤ i − I = 1(1 − ε ) n n X i =1 x i x ⊤ i − − ε I − − ε ) n X i/ ∈ J x i x ⊤ i − (cid:18) − ε − (cid:19) I ! . From triangular inequality, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w Ji x i x ⊤ i − (1 − ε ) I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ε ) n n X i =1 x i x ⊤ i − − ε I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ε ) n X i/ ∈ J x i x ⊤ i − (cid:18) − ε − (cid:19) I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . From Lemma A.1, we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ε ) n n X i =1 x i x ⊤ i − − ε I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C g r ′ d with probability at least 1 − δ for suﬃciently large n so that (log(2 /δ ) + log 4 + Ad ) /Bn < J ⊂ { , · · · , n } such that | J | = (1 − ε ) n and suﬃciently large positiveconstant C g , let E ( J ) be the event that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ε ) n X i/ ∈ J x i x ⊤ i − (cid:18) − ε − (cid:19) I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > C g ε log(1 /ε ) ⇒ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) εn X i/ ∈ J x i x ⊤ i − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > − εε C g ε log(1 /ε ) . For suﬃciently small ε , we have − εε ε log(1 /ε ) = Ω(log(1 /ε )) > (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) εn X i/ ∈ J x i x ⊤ i − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > − εε C g ε log(1 /ε ) . with probability at most 4 exp( Ad − Bεn − εε ε log(1 /ε )). Let B ( ε ) be the binaryentropy function. We have P (cid:2)(cid:0) ∪ J : | J | =(1 − ε ) n E ( J ) (cid:1) c (cid:3) ( a ) ≤ (cid:18) log (cid:18) nεn (cid:19) + Ad − Bεn − εε C g ε log(1 /ε ) (cid:19) ( b ) ≤ (cid:18) nB ( ε ) + Ad − Bεn − εε C g ε log(1 /ε ) (cid:19) ( c ) ≤ (cid:18) εn (cid:18) O (log(1 /ε ) − B − εε C g ε log(1 /ε ) (cid:19) + Ad (cid:19) ( d ) ≤ − εn/ Ad ) ( e ) ≤ δ/ , asai and Fujisawa/Weighted Huber regression where (a) follows by a union bound over all sets J of size (1 − ε ) n , (b) followsfrom log (cid:0) nεn (cid:1) ≤ εH ( ε ), (c) follows from ε is suﬃciently small because H ( ε ) = O ( ε log(1 /ε )) as ε →

0, (d) follows from C g is suﬃciently large constant, (e)follows from n is suﬃciently large.Here, we give a Bernstein concentration inequality. Theorem A.1 (Bernstein concentration inequality) . Let { W i } ni =1 be a sequencewith i.i.d random variables. We assume that n X i =1 E[ W i ] ≤ v, n X i =1 E[( W i ) k + ] ≤ k !2 vc k − for i = 1 , · · · n and for k ∈ N such that k ≥ . Then, we have P "(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( W i − E[ W i ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ vt + ct ≥ − e − t for any t > . Let C g = C g + C g and the proof is complete. Proposition A.1 (Proposition 4.2 in Section 4) . Suppose that n is suﬃcientlylarge so that n = ( d log d ) . We have n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) h (cid:18) ξ i λ o √ n (cid:19)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ C Bδ r log dn , where C Bδ is some positive constant depending on δ , with probability at least − δ .Proof. We have E h ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17)i = 0 and we see n X i =1 E  ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17) n   = 1 n n X i =1 E[( x i − µ ) j ]E " h (cid:18) ξ j λ o √ n (cid:19) ≤ n n X i =1 E[( x i − µ ) j ] ≤ n . From Proposition 3.2. of Rivasplata (2012), we can show that the absolute k ( ≥

3) th moment of ( x − µ ) j is bounded above, as follows: n X i =1 E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k  ≤ n k n X i =1 E[ | ( x − µ ) j | k ] ≤ k !2 (cid:18) n (cid:19) (cid:18) n (cid:19) k − . From Theorem A.1 with t = log( d/δ ), v = c = 1 /n , we haveP (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ r d/δ ) n + log( d/δ ) n  ≥ − δd asai and Fujisawa/Weighted Huber regression and for suﬃciently large constant C Bδ depending on δ , we haveP (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C Bδ r log dn  ≥ − δd . Hence, P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) h (cid:16) ξ i λ o √ n (cid:17) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ C Bδ r log dn  = P  sup j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C Bδ r log dn  = 1 − P  sup j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > C Bδ r log dn  = 1 − P [ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > C Bδ r log dn  ≥ − d X j =1 P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > C Bδ r log dn  ≥ − ( δ/d ) d = 1 − δ. Proposition A.2 (Proposition 4.3 in Section 4) . Suppose that n is suﬃcientlylarge so that n = ( d log d ) . We have n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) α i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C Rδ r dn , where C Rδ is some positive constant depending on δ , with probability at least − δ .Proof. From α i = 1, we note that k P ni =1 ( x i − µ ) k is χ -square distributionwith nd degree of freedom and from Lemma Laurent and Massart (2000), wehave (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) α i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ nd + 2 p nd log(1 /δ ) + 2 log(1 /δ )with probability at least (1 − δ ). For suﬃciently large constant C Rδ dependingon δ , we have 1 n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) α i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C Rδ r dn asai and Fujisawa/Weighted Huber regression with probability at least 1 − δ .We omit the proof of Proposition 4.3 because the proof is almost the sameas Proposition A.1.The following Proposition in used to prove Corollary 4.1. Proposition A.3 (Adopted from Proposition 4 of Dalalyan and Thompson(2019)) . Suppose that < ε < / holds. For m -sparse vector u such that ≤ m ≤ o and δ ∈ (0 , / , we have n X i =1 u i ( x i − µ ) ⊤ v ≤ C τ k u k k v k (cid:16) (1 + p log(1 /δ )) + p log d + p m log( n/m ) (cid:17) , where C τ is some positive constant, with probability at least (1 − δ ) .Proof. Let X =  ( x − µ ) ⊤ ...( x n − µ ) ⊤  Suppose that { z i } ni =1 is a sequence with i.i.d. random vectors drawn fromGaussian distribution with mean 0 and identity covariance and Z =  z ⊤ ... z ⊤ n  and To prove Proposition 4 of Dalalyan and Thompson (2019), Lemma 4 ofDalalyan and Thompson (2019) is used and the lemma is the following: whensup [ u,v ] ∈ V u ⊤ Zv ≤ G ( V ) + G ( V ) + p log(1 /δ ) , (A.2)with probability at least 1 − δ , where G ( · ) is the Gaussian width, V = S d − , V = S n − and V = V × V .To weaken the Gaussian assumption to sub-Gaussian, we conﬁrm that whetherthe following two properties of Gaussian use in the proof of Lemma 4 of Dalalyanand Thompson (2019) also hold for sub-Gaussian.(i) Gordon’s inequality,(ii) Gaussian concentration.For (i), we can use Theorem 4.2 of Pisier (2016) instead of Gordon’s inequality.For (ii), we can use Theorem 8.5.5 of Vershynin (2018) instead of Gaussianconcentration. To compute the Talagrand’s γ -Functional of V , which is deﬁnedby Deﬁnition 8.5.1 in Vershynin (2018), we can use Corollary 8.6.2. of Vershynin asai and Fujisawa/Weighted Huber regression (2018). From the discussion above, when { x i } ni =1 is a sequence with i.i.d. randomvectors drawn from sub-Gaussian distribution with identity covariance, we havesup [ u,v ] ∈ V n X i =1 u i ( x i − µ ) ⊤ v = sup [ u,v ] ∈ V u ⊤ Xv ≤ τ (cid:16) G ( V ) + G ( V ) + p log(1 /δ ) (cid:17) , (A.3)where τ is some positive constant, with probability at least 1 − δ .From Lemma 4 of Dalalyan and Thompson (2019) and (A.3), we have n X i =1 u i ( x i − µ ) ⊤ v ≤ τ k u k k v k (cid:16) √ . p log(81 /δ )) + 1 . p d (cid:17) + 1 . τ k v k G ( k u k B n ∩ | u k B n )(A.4)with probability at least (1 − δ ). Further more, we have G ( k u k B n ∩ | u k B n ) ≤ p n k u k a ) ≤ q max (1 , log(8 en k u k / k u k )) ( b ) ≤ √ e √ m k u k p n/m )where (a) follows from Proposition 1 of Bellec (2019) and (b) follows from Re-mark 4 of Dalalyan and Thompson (2019). We note that Proposition 1 of Bellec(2019) is proved for Gaussian, however, to replace Gaussian to sub-Gaussian inthe proof of (4) in Bellec (2019), we can conﬁrm that (a) holds.Let η m = q n/m )log( n/m ) and q n/m )log( n/m ) ≤ C η , where C η is a numericalconstant because 1 ≤ m ≤ o and 0 < ε < / n X i =1 u i ( x i − µ ) ⊤ v ≤ τ k u k k v k (cid:16) √ . p log(81 /δ )) + 1 . p d + 4 . √ e √ m p n/m ) (cid:17) = τ k u k k v k (cid:16) √ . p log(81 /δ )) + 1 . p d + 4 . η m √ e p m log( n/m ) (cid:17) ≤ τ k u k k v k (cid:16) √ . p log(81 /δ )) + 1 . p d + 4 . C η √ e p m log( n/m ) (cid:17) ≤ C τ k u k k v k (cid:16) (1 + p log(1 /δ )) + p log d + p m log( n/m ) (cid:17) (A.5)with probability at least (1 − δ ), where C τ is some positive constant. Corollary A.1 (Corollary 4.1 in Section 4) . Suppose that < ε < / holds.For any vector u = ( u , · · · , u n ) ∈ R n and any set I | m | such that ≤ m ≤ o and δ ∈ (0 , / , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i ∈ I | m | u i ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C τ s X i ∈ I | m | u i k δ β η k (cid:18) (1 + p log(1 /δ )) + p d log d + √ m r log nm (cid:19) , where C τ is some positive constant, with probability at least (1 − δ ) . asai and Fujisawa/Weighted Huber regression Proof.

Let u ′ = ( u ′ , · · · , u ′ n ) be a m -sparse vector such that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i ∈ I | m | u i ( x i − µ ) ⊤ β β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 u ′ i ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . From Proposition A.3, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 u ′ i ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C τ k u ′ k k δ β η k (cid:18) (1 + p log(1 /δ )) + p d log d + √ m r log nm (cid:19) = C τ s X i ∈ I | m | u i k δ β η k (cid:18) (1 + p log(1 /δ )) + p d log d + √ m r log nm (cid:19) . Proposition A.4 (Proposition 4.4 in Section 4) . Suppose that n is suﬃcientlylarge. For any vector v ∈ S d − , we have n E " n X i =1 (( x i − µ ) ⊤ v ) ≥ . Proof.

From Exercise 4.6.3 of Vershynin (2018), we have E vuut n X i =1 (( x i − µ ) ⊤ v )  ≥ √ n − C G √ d. where C G is some positive constant. From Jensen’s inequality, we have vuut E " n X i =1 (( x i − µ ) ⊤ v ) ≥ √ n − C G √ d. For suﬃciently large n , we have √ n − C G √ d > n E " n X i =1 (( x i − µ ) ⊤ v ) ≥ n ( √ n − C G √ d ) = 1 − C G r dn + C G dn , where C G is some positive constant. Consequently, for suﬃciently large n , wehave 1 n E " n X i =1 (( x i − µ ) ⊤ v ) ≥ − C G r dn + C G dn ≥ . The following Lemma is used to prove Proposition 4.5. asai and Fujisawa/Weighted Huber regression Lemma A.3.

For diﬀerentiable function f ( x ) , we denote its derivative f ′ ( x ) .For any diﬀerentiable and convex function f ( x ) , we have ( f ′ ( a ) − f ′ ( b ))( a − b ) ≥ . Proof.

From the deﬁnition of the convexity, we have f ( a ) − f ( b ) ≥ f ′ ( b )( a − b ) and f ( b ) − f ( a ) ≥ f ′ ( a )( b − a ) . From the above inequalities, we have0 ≥ f ′ ( b )( a − b ) + f ′ ( a )( b − a ) = ( f ′ ( b ) − f ′ ( a ))( a − b ) ⇒ ≤ ( f ′ ( a ) − f ′ ( b ))( a − b ) . Proposition A.5 (Proposition 4.5 in Section 4) . Let R ( r ) = (cid:8) β ∈ R d | k δ β k = k β − β ∗ k = r (cid:9) ( r ≤ and assume β η ∈ R ( r ) . Assume that (cid:8) x i ∈ R d (cid:9) ni =1 is a sequence with i.i.d.random matrices satisfying E [(( x i − µ ) ⊤ v ) ] ≤ m (cid:0) E [(( x i − µ ) ⊤ v ) ] (cid:1) for any v ∈ S d − , (A.6) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op  ≤ C cov , (A.7) k µ − µ ˆ w k ≤ C µ , (A.8) where m , C cov and C µ are some positive constants. Suppose λ o √ n ≥ c max = (cid:16) σ m , p C cov m , C µ (cid:17) .Then, with probability at least − δ , we have λ o n X i =1 − h ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β η λ o √ n ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ≥ a n n X i =1 E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) − a k δ β η k − a + B + C, (A.9) asai and Fujisawa/Weighted Huber regression where a = 18 ,a = λ o √ n E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 α i ( x i − µ ) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + λ o q E [ k ( x i − µ ) k ] p /δ ) + λ o √ n k µ − µ ˆ w k ,a = λ o /δ ) ,B =  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  (cid:18) − h (cid:18) ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β λ o (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ,C =  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  (cid:18) − h (cid:18) ξ i λ o √ n − ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ( x i − µ ˆ w ) ⊤ δ β λ o √ n . Proof.

Let u ′ i = ( x i − µ ) ⊤ δ β η λ o √ n , u ˆ w i = ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o , u ′ ˆ w i = ( x i − µ ˆ w ) ⊤ δ β η λ o √ n , v i = ξ i λ o √ n . The left-hand side of (A.9) divided by λ o can be expressed as n X i =1 − h ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β η λ o √ n ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o =  X i ∈ I G ∩ I ˆ w ≥ + X i ∈ I G ∩ I ˆ w< + X i ∈ I o  ( − h ( v i − u ˆ w i ) + h ( v i )) u ˆ w i = A + B − C, where A = X i ∈ I G ∩ I ˆ w ≥ ( − h ( v i − u ˆ w i ) + h ( v i )) u ˆ w i +  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  (cid:0) − h (cid:0) v i − u ′ ˆ w i (cid:1) + h ( v i ) (cid:1) u ′ ˆ w i ,B =  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  ( − h ( v i − u ˆ w i ) + h ( v i )) u ˆ w i ,C =  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  (cid:0) − h (cid:0) v i − u ′ ˆ w i (cid:1) + h ( v i ) (cid:1) u ′ ˆ w i . Let I E ˆ wi , I E ′ ˆ wi , I E ′≥ i , I E ′

132 + 12 λ o √ n k µ ˆ w − µ k k δ β η k c ) ≤

132 + C µ λ o √ n ( d ) ≤ ≤ , where (a) follows from triangular inequality, (b) follows from i ∈ E ′≥ i andH¨older’s inequality, (c) follows from (A.8) and k δ β η k ≤ λ o . From similar argument, we have I E ′ ˆ wi ⊃ I E ′
132 + 12 λ o √ n k µ ˆ w − µ k k δ β η k (cid:19) ( d ) ≤ (cid:18)

132 + C µ λ o √ n (cid:19) ( e ) ≤ , asai and Fujisawa/Weighted Huber regression where (a) follows from triangular inequality, (b) follows from ˆ w i ≤ − ε ) n ≤ i ∈ E ′≥ i and H¨older’s inequality, (d) follows from (A.8) and k δ β η k ≤ λ o .We have X i ∈ I G ∩ I w ≥ ( − h ( v i − u ˆ w i ) + h ( v i )) u ˆ w i ≥ X i ∈ I G ∩ I ˆ w ≥ ( − h ( v i − u ˆ w i ) + h ( v i )) u ˆ w i I E ˆ wi ( a ) = X i ∈ I G ∩ I ˆ w ≥ u w i I E ˆ wi ( b ) ≥ X i ∈ I G ∩ I ˆ w ≥ u ′ w i I E ˆ wi ( c ) ≥ X i ∈ I G ∩ I ˆ w ≥ u ′ w i I E ′ i ≥ , (A.12)where (a) follows from the deﬁnition of I E ˆ wi and (b) follows from i ∈ I G ∩ I ˆ w ≥ ,(c) follows from (A.10) and  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  (cid:0) − h (cid:0) v i − u ′ ˆ w i (cid:1) + h ( v i ) (cid:1) u ′ ˆ w i ≥  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  (cid:0) − h (cid:0) v i − u ′ ˆ w i (cid:1) + h ( v i ) (cid:1) u ′ ˆ w i I E ′ ˆ wi ( a ) =  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  u ′ w i I E ′ ˆ wi ( b ) ≥  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  u ′ w i I E ′ i < , (A.13)where (a) follows from the deﬁnition of I E ′ ˆ wi and (b) follows from (A.11). Con-sequently, from (A.12) and (A.13), we have A ≥ n X i =1 u ′ w i I E ′ i . From the convexity of the Huber loss and Lemma A.3, we have X i ∈ I G ∩ I w ≥ ( − h ( v i − u ˆ w i ) + h ( v i )) u ˆ w i ≥ X i ∈ I G ∩ I ˆ w ≥ ( − h ( v i − u ˆ w i ) + h ( v i )) u ˆ w i I E ˆ wi , (A.14)  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  (cid:0) − h (cid:0) v i − u ′ ˆ w i (cid:1) + h ( v i ) (cid:1) u ′ ˆ w i ≥  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  (cid:0) − h (cid:0) v i − u ′ ˆ w i (cid:1) + h ( v i ) (cid:1) u ′ ˆ w i I E ′ ˆ wi . (A.15) asai and Fujisawa/Weighted Huber regression and we have u ′ w i I E ′ i = ( x i − µ ˆ w ) ⊤ δ β η λ o √ n ! I E ′ i = 14 λ o n (cid:0) ( x i − µ ˆ w ) ⊤ δ β η (cid:1) I E ′ i = 14 λ o n (cid:0) ( x i − µ + µ − µ ˆ w ) ⊤ δ β η (cid:1) I E ′ i ≥ λ o n (cid:16)(cid:0) ( x i − µ ) ⊤ δ β η (cid:1) + (cid:0) ( µ − µ ˆ w ) ⊤ δ β η (cid:1) − | ( x i − µ ) ⊤ δ β η ( µ − µ ˆ w ) ⊤ δ β η | (cid:17) I E ′ i ≥ λ o n (cid:16)(cid:0) ( x i − µ ) ⊤ δ β η (cid:1) − | ( x i − µ ) ⊤ δ β η ( µ − µ ˆ w ) ⊤ δ β η | (cid:17) I E ′ i ( a ) ≥ λ o n (cid:16)(cid:0) ( x i − µ ) ⊤ δ β η (cid:1) − | ( x i − µ ) ⊤ δ β η |k µ − µ ˆ w k k δ β η k (cid:17) I E ′ i ( b ) ≥ λ o n (cid:18)(cid:0) ( x i − µ ) ⊤ δ β η (cid:1) − λ o √ n k µ − µ ˆ w k k δ β η k (cid:19) I E ′ i (A.16)where (a) follows from H¨older’s inequality and (b) follows from the deﬁnition ofI E ′ i . From (A.16), we have A ≥ n X i =1 u ′ w i I E ′ i ≥ n X i =1 λ o n (cid:18)(cid:0) ( x i − µ ) ⊤ δ β η (cid:1) − λ o √ n k µ − µ ˆ w k k δ β η k (cid:19) I E ′ i ≥ n X i =1 (cid:0) ( x i − µ ) ⊤ δ β η (cid:1) λ o n I E ′ i − λ o √ n k µ − µ ˆ w k λ o k δ β η k ≥ n X i =1 ϕ ( u ′ i ) ψ ( v ′ i ) − λ o √ n k µ − µ ˆ w k λ o k δ β η k . (A.17)Deﬁne the functions ϕ ( u ) =  u if | v | ≤ / u − / if 1 / ≤ u ≤ / u + 1 / if − / ≤ u ≤ − /

40 if | u | > / ψ ( v ) = I ( | v | ≤ / . (A.18)and we note that ϕ ( u ′ i ) ψ ( v ′ i ) ≤ ϕ ( u ′ i ) ≤ max ( x i − µ ) ⊤ δ β η λ o n , ! (A.19)and deﬁne f ( β ) = n X i =1 f i ( β ) = n X i =1 ϕ ( u ′ i ) ψ ( v ′ i ) . asai and Fujisawa/Weighted Huber regression To bound f ( β ) from bellow, consider the supremum of a random process indexedby R ( r ): ∆ := sup β ∈R ( r ) | f ( β ) − E f ( β ) | . (A.20)From (A.6), we have E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) ≤ m (cid:0) E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3)(cid:1) . (A.21)and we have E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) ≤ m (cid:0) E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3)(cid:1) ≤ m (cid:0) E (cid:2) k ( x i − µ ) k k δ β η k (cid:3)(cid:1) a ) ≤ m (cid:0) E (cid:2) k ( x i − µ ) k (cid:3)(cid:1) (A.22)where (a) follows from k δ β η k = r ≤ x i − µ ) ⊤ δ β η ) = h ( x i − µ )( x i − µ ) ⊤ , δ β η δ ⊤ β η i (A.23)and from (A.23), we have1 n n X i =1 E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) = 1 n n X i =1 E h h ( x i − µ )( x i − µ ) ⊤ , δ β η δ ⊤ β η i i ( a ) = E "* n n X i =1 ( x i − µ )( x i − µ ) ⊤ , δ β η δ ⊤ β η + ( b ) ≤ E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op k δ β η δ ⊤ β η k ∗  ( c ) = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op k δ β η k  ( d ) ≤ E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op  ( e ) ≤ C cov , (A.24)where (a) follows from the linearity of expectation, (b) follows from H¨older’sinequality, (c) follows from trace norm of δ β η δ ⊤ β η is the sum of the diagonalcomponent of δ β η δ ⊤ β η because δ β η δ ⊤ β η is positive semi-deﬁnite, (d) follows from k δ β η k = k δ β η k and r ≤ E ′ i , we have E " n X i =1 f i ( β ) = n X i =1 E [ f i ( β )] ≥ D − E − F, (A.25) asai and Fujisawa/Weighted Huber regression where D = n X i =1 E [ u ′ i ] = n X i =1 E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) λ o n , E = n X i =1 E (cid:20) u ′ i I (cid:18) | u ′ i | ≥ (cid:19)(cid:21) , F = n X i =1 E (cid:20) u ′ i I (cid:18) | v ′ i | ≥ (cid:19)(cid:21) . We evaluate the right-hand side of (A.25) at each term. First, we have E = n X i =1 E (cid:20) u ′ i I (cid:18) | u ′ i | ≥ (cid:19)(cid:21) ( a ) ≤ n X i =1 q E [ u ′ i ] s E (cid:20) I (cid:18) | u ′ i | ≥ (cid:19)(cid:21) ( b ) = n X i =1 q E [ u ′ i ] s P (cid:20)(cid:18) | u ′ i | ≥ (cid:19)(cid:21) ( c ) ≤ n X i =1 λ o n q E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3)q E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) ( c ) ≤ n X i =1 m λ o n E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) ( d ) ≤ C cov λ o n n X i =1 E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) ( e ) ≤ λ o E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) where (a) follows from H¨older’s inequality, (b) follows from the relation betweenindicator function and expectation, (c) follows from Markov’s inequality, (d)follows from the deﬁnition of λ o and (e) follows from (A.24). asai and Fujisawa/Weighted Huber regression Second, we have F = n X i =1 E (cid:20) u ′ i I (cid:18) | v ′ i | ≥ (cid:19)(cid:21) ( a ) ≤ n X i =1 q E [ u ′ i ] s E (cid:20) I (cid:18) | v ′ i | ≥ (cid:19)(cid:21) ( b ) ≤ n X i =1 q E [ u ′ i ] s P (cid:20) | v ′ i | ≥ (cid:21) ( c ) ≤ n X i =1 s λ o √ n λ o n q E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3)p E [ | ξ i | ] ( d ) ≤ n X i =1 s λ o √ n σ λ o n m E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) ( e ) ≤ n X i =1 λ o n E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) = 116 λ o E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) where (a) follows from H¨older’s inequality, (b) follows from relation betweenindicator function and expectation, (c) follows from Markov’s inequality, (d)follows from (A.21) and E [ | ξ i | ] ≤ σ and (e) follows from the deﬁnition of λ o .Consequently, we have E [ f ( β )] = D − E − F ≥ nλ o n X i =1 E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) (A.26)and from (A.17), we have A ≥ nλ o n X i =1 E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) − λ o √ n k µ − µ ˆ w k λ o k δ β η k − ∆ . (A.27)Next we evaluate the stochastic term ∆ deﬁned in (A.20). From (A.19) andTheorem 3 of Massart (2000), with probability at least 1 − e x , we have∆ ≤ E [∆] + σ f √ x + 18 . x ≤ E [∆] + σ f √ x + 5 x (A.28)where σ f = sup β ∈B P ni =1 E (cid:2) ( f i ( β ) − E [ f i ( β )]) (cid:3) . About σ f , we have E (cid:2) ( f i ( β ) − E [ f i ( β )]) (cid:3) ≤ E (cid:2) f i ( β ) (cid:3) . From (A.22), we have E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) ≤ E (cid:2) k x i − µ k k δ β η k (cid:3) ≤ E (cid:2) k x i − µ k (cid:3) r asai and Fujisawa/Weighted Huber regression and from (A.19) and E (cid:2) f i ( β ) (cid:3) ≤ E " (( x i − µ ) ⊤ δ β η ) λ o n ≤ E (cid:2) k x i − µ k (cid:3) λ o n r and then σ f ≤ p E [ k x i − µ k ]2 λ o r . Combining this and (A.28), we have∆ ≤ E [∆] + p E [ k x i − µ k ]2 λ o r √ x + 5 x. (A.29)From Symmetrization inequality (Lemma 11.4 of Boucheron, Lugosi and Mas-sart (2013)), we have E [∆] ≤ E (cid:2) sup β ∈B | G β | (cid:3) where G L := n X i =1 α i ϕ ( u ′ i ) ψ ( v ′ i ) , and { α i } is a sequence of Rademacher random variables (i.e., P ( α i = 1) = P ( α i = −

1) = 1 / { x i , ξ i } ni =1 . We denote E ∗ as aconditional expectation of { α i } ni =1 given { x i , ξ i } ni =1 . From contraction principal(Theorem 6.7.1 of Vershynin (2018)), we have E ∗ " sup β ∈B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 α i ϕ ( u ′ i ) ψ ( v ′ i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max ≤ i ≤ n { ψ ( v ′ i ) } E ∗ " sup β ∈B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 α i ϕ ( u ′ i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = E ∗ " sup β ∈B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 α i ϕ ( u ′ i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) and from the basic property of the expectation, we have E " sup β ∈B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 α i ϕ ( u ′ i ) ψ ( v ′ i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E " sup β ∈B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 α i ϕ ( u ′ i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Since ϕ is 1-Lipschitz and ϕ (0) = 0, from contraction principal (Theorem 11.6in Boucheron, Lugosi and Massart (2013)), we have E " sup β ∈B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 α i ϕ ( u ′ i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E " sup β ∈B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 α i u ′ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ n λ o E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 α i ( x i − µ ) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k δ β η k = √ n λ o E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 α i ( x i − µ ) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) r . Consequently, we have2 E [∆] ≤ √ nλ o E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 α i ( x i − µ ) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) r . (A.30)Combining (A.30) with (A.29) and (A.27), with probability at least 1 − e − x , wehave n X i =1 λ o − h ξ i + ˆ w i n ( x i − µ ) ⊤ δ β η λ o √ n ! − h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i √ n ( x i − µ ) ⊤ δ β η λ o ≥ a E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) − a k δ β η k − a + B + C asai and Fujisawa/Weighted Huber regression where a = 18 , a = λ o √ n E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 α i ( x i − µ ) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + λ o q E [ k ( x i − µ ) k ] √ x + λ o √ n k µ − µ ˆ w k ,a = λ o xB =  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  − h ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o C =  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  − h ξ i λ o √ n − ( x i − µ ˆ w ) ⊤ δ β η λ o √ n ! + h (cid:18) ξ i λ o √ n (cid:19)! ( x i − µ ˆ w ) ⊤ δ β η λ o √ n . Let δ = e − x and the proof is complete. Appendix B: Stochastic argument in case of heavy-tailed design

In this section, we state our main theorem in stochastic form assuming ran-domness in { x i } ni =1 and { ξ i } ni =1 . In the following subsection, we conﬁrm theconditions in Theorem 3.1 are satisﬁed under the conditions in Theorem B.1. Theorem B.1.

Consider the optimization problem (2.2) . Suppose that { x i } ni =1 is a sequence with i.i.d. random vectors drawn from a distribution whose un-known covariance is Σ (cid:22) σ c I with unknown mean µ and { ξ i } ni =1 is a sequencewith i.i.d. random variables drawn from a distribution whose absolute moment isbounded by σ . Suppose that n is suﬃciently large so n = O ( d log d ) and supposethat ε is suﬃciently small. Let λ o √ n = c max with c max = max (cid:16) m σ , p C cov m , C CDG (cid:17) and m is some positive constant such that E [( x i − µ ) ⊤ v ) ] ≤ m (cid:0) E [(( x i − v ) ⊤ v ) ] (cid:1) (B.1) and C cov is some positive constant such that E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  ≤ C cov . Let ˆ β is a optimal solution of (2.2) . Then, with probability at least (1 − δ ) ,we have k ˆ β − β ∗ k = O (cid:0) √ ε + r d (cid:1) . In the remaining part of Section B.1, we assume that { x i } ni =1 is a sequencewith i.i.d. random vectors drawn from a distribution whose unknown covarianceis Σ (cid:22) σ c I with unknown mean µ and { ξ i } ni =1 is a sequence with i.i.d. randomvariables drawn from a distribution whose absolute moment is bounded by σ . asai and Fujisawa/Weighted Huber regression B.1. ROBUST-WEIGHT for bounded covariance distributions

For heavy tailed design, we use Algorithm 2 of Cheng, Diakonikolas and Ge(2019) as ROBUST-WEIGHT. As in Algorithm 1 of Cheng, Diakonikolas andGe (2019) , The weights { ˆ w i } ni =1 can be computed by Algorithm 2 of Cheng,Diakonikolas and Ge (2019) from { X i } ni =1 and ε with polynomial time compu-tational complexity. Algorithm 2 of Cheng, Diakonikolas and Ge (2019) showedthat µ ˆ w − µ is close to 0 in the l norm with high probability.Algorithm 2 of Cheng, Diakonikolas and Ge (2019) use the same formulationof the primal-dual SDP as Algorithm 1 of Cheng, Diakonikolas and Ge (2019)and Algorithm 2 of Cheng, Diakonikolas and Ge (2019) estimates µ by thefollowing algorithm: Algorithm 3

Robust Mean Estimation for Bounded Covariance Distributions

Require: { X i } ni =1 ∈ R d with 0 < ε < / Ensure: ˆ w ∈ R n such that, with probability at least (1 − δ ) , k µ ˆ w − µ k ≤ C CDG ( √ ε + r d ),where C CDG is some positive constant depending on δ when Lemma B.1 holds.Let ν ∈ R d be the coordinate-wise median of { X i } ni =1 For i = 1 to O (log d )Use Proposition 5.5 of Cheng, Diakonikolas and Ge (2019) to compute either (i) A good solution w ∈ R n for the primal SDP (4.2) with parameters ν and 2 ε ; or (ii) A good solution M ∈ R d × d for the dual SDP (4.3) with parameters ν and ε if the objective value of w in SDP (4.2) is at most C CDG (Lemma B.1) return the weighted vector ˆ w (Lemma 5.3 of Cheng, Diakonikolas and Ge (2019)) else Move ν closer to µ using the top eigenvector of M (Lemma 5.4 of Cheng, Diakonikolasand Ge (2019)). First, we state some properties of the bounded covariance random vector.

Lemma B.1 (Adopted from the proof of Lemma A.18 of Diakonikolas et al.(2017) and the proof of Lemma 4.3 Diakonikolas et al. (2018)) . For any w ∈ δ n,ε ,we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ I G w i ( x i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C ht ( √ ε + r d ) , (B.2) λ max X i ∈ I G w i ( x i − µ )( x i − µ ) ⊤ ! ≤ C ht , (B.3) where C ht is some positive constant, with probability at least − δ . We note that Lemma A.18 in Diakonikolas et al. (2017) is valid even if ǫ ofDiakonikolas et al. (2017) is replaced by δ . Because combining Lemma A.18 inDiakonikolas et al. (2017) and the proof of Lemma A.2, we have the proof ofLemma B.1 and we omit the detailed proof of Lemma B.1. asai and Fujisawa/Weighted Huber regression Proposition B.1 (Adopted from Lemma 4.1 of Minsker and Wei (2020)) . As-sume that { x i } ni =1 is a sequence with i.i.d. random matrices drawn from a dis-tribution with mean µ and whose covariance Σ (cid:22) σ c I and { x i } ni =1 satisfy E [(( x i − µ ) ⊤ v ) ] ≤ m (cid:0) E [(( x i − µ ) ⊤ v ) ] (cid:1) , (B.4) for any v ∈ R d . Assume d/n ≤ . Then, we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  ≤ c htcov , where c htcov = m σ c .Proof. From Lemma 4.1 of Minsker and Wei (2020), we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  ≤ m tr(Σ) k Σ k op n ( a ) ≤ m σ c dn ( b ) ≤ c htcov . Proposition B.2.

We have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C htδ r dn , where C htδ = 2 /δ , with probability at least − δ .Proof. From ˆ w i h (cid:16) ξ i λ o √ n (cid:17) ≤ n × ≤ n , we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  ≤ d X j =1 n ≤ dn . From Markov’s inequality, we have P " n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ δ r dn ≥ − δ. Proposition B.3.

Let { α i } ni =1 be a series of Rademacher random veriables.Then, we have n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) α i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C htδ r dn . asai and Fujisawa/Weighted Huber regression Proof.

From α i = 1, we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) α i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  ≤ d X j =1 n ≤ nd. From Markov’s inequality, we have P " n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) α i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ δ r dn ≥ − δ. Proposition B.4 (Adopted from (5.31) of Vershynin (2010)) . Suppose that n is suﬃciently large so that n = O ( d log d ) . For any vector v ∈ S d − , we have n E " n X i =1 (( x i − µ ) ⊤ v ) ≥ . We omit the proof of Proposition B.4 because the proof is almost the sameof the proof of Proposition 4.4.

Proposition B.5 ((5.31) of Vershynin (2010)) . Suppose that n is suﬃcientlylarge so that n = O ( d log d ) . For any vector v ∈ S d − , we have vuut n X i =1 (( x i − µ ) ⊤ v ) ≤ √ n + C HT δ p d log d, where C HT δ is a constant depending on δ , with probability at least − δ . Next, we state some properties of Algorithm 3.We note that the proof of Lemmas 5.2, 5.3, 5.4, Proposition 5.5 and Theorem1.3 of Cheng, Diakonikolas and Ge (2019) is valid even if we replace the orderof δ in Section 5 of Cheng, Diakonikolas and Ge (2019) as δ = √ ε + r ′ d andconsequently, we have Propositions B.6, B.7 and B.8. Proposition B.6.

Suppose that < ε < / , (B.2) and (B.3) hold. We have k µ ˆ w − µ k ≤ C CDG √ ε. (B.5) when Algorithm 3 succeeds . Proposition B.7.

Suppose that < ε < / , (B.2) and (B.3) hold. We have λ max n X i =1 ˆ w i ( X i − ν )( X i − ν ) ⊤ ! ≤ C CDG for some vector in (4.2) ν when Algorithm 3 succeeds. asai and Fujisawa/Weighted Huber regression About ν in Proposition B.7, we have the following Lemma Proposition B.8.

Suppose that < ε < / , (B.2) and (B.3) hold. We have k ν − µ k ≤ C CDG for ν in Proposition B.7 when Algorithm 3 succeeds . B.2. Conﬁrmation of the conditions

In this section, we conﬁrm (3.1) - (3.4) under the condition used in Theorem B.1.We note that when (B.2) and (B.3) hold, Algorithm 3 succeeds (Algorithm 3computes { ˆ w i } ni =1 such that (B.5) is satisﬁed) with probability at least (1 − δ ) .In light of this fact, Lemmas B.1 and 4.2, Propositions B.1 - B.8 and A.5 holdwith probability at least (1 − δ ) under the conditions used in Theorem B.1.We also note that from Proposition B.6 and assumption assumed in TheoremB.1, we have r o , r d ≤

1, and we can set C µ = 2 C CDG , where C µ is deﬁned inProposition A.5. From the deﬁnition of λ o , conditions assumed in Theorem B.1and Proposition B.1, we see that Proposition A.5 holds.In Section B.2, to omit to state “with probability at least”, we assume thatinequalities in Lemmas B.1 and 4.2, Propositions B.1 - B.8 and A.5 hold andAlgorithm 3 succeeds. Proposition B.9 (Conﬁrmation of (3.3)) . We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C htcond ( √ ε + r d ) k δ β η k , where C htcond = (cid:18)q C CDG + C ht + C CDG + 2 C ht C CDG + 2 √ C CDG (cid:19) . Proof.

From X i − µ ˆ w = ( X i − ν ) + ( ν − µ ) + ( µ − µ ˆ w ), we have X i ∈ I o ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η = X i ∈ I o ˆ w i h ( R i ( β η ))( X i − ν ) ⊤ δ β η + X i ∈ I o ˆ w i h ( R i ( β η ))( ν − µ ) ⊤ δ β η + X i ∈ I o ˆ w i h ( R i ( β η ))( µ − µ ˆ w ) ⊤ δ β η . (B.6)Let M i = ( X i − ν )( X i − ν ) ⊤ . For the ﬁrst term of the R.H.S of (B.6), we asai and Fujisawa/Weighted Huber regression have (X i ∈ I o ˆ w i h ( R i ( β η ))( X i − ν ) ⊤ δ β η ) a ) ≤ X i ∈ I o ˆ w i X i ∈ I o ˆ w i | ( X i − ν ) ⊤ δ β η | b ) ≤ ε X i ∈ I o ˆ w i | ( X i − ν ) ⊤ δ β η | = 2 εδ ⊤ β η X i ∈ I o ˆ w i M i δ β η = 2 ε δ ⊤ β η n X i =1 ˆ w i M i δ β η − δ ⊤ β η X i ∈ I G ˆ w i M i δ β η ! = 2 ελ max ( M ) k δ β η k − εδ ⊤ β η X i ∈ I G ˆ w i M i δ β η ! , (B.7)where (a) follows from ˆ w i = √ ˆ w i √ ˆ w i and 0 ≤ h ( R i ( β η )) ≤

1, (b) follows from0 ≤ ˆ w i ≤ − ε ) n ≤ n and P i ∈ I o ˆ w i ≤ ε . For the last term of R.H.S of (B.7),we have δ ⊤ β η X i ∈ I G ˆ w i M i δ β η = δ ⊤ β η X i ∈ I G ˆ w i ( x i − ν )( x i − ν ) ⊤ δ β η = δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ − ν + µ )( x i − µ − ν + µ ) ⊤ δ β η = δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( x i − µ ) ⊤ δ β η + δ ⊤ β η X i ∈ I G ˆ w i ( µ − ν )( µ − ν ) ⊤ δ β η + 2 δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η ( a ) ≥ k δ β η k + δ ⊤ β η X i ∈ I G ˆ w i ( µ − ν )( µ − ν ) ⊤ δ β η + 2 δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η , (B.8)where (a) follows from (B.3). From (B.7) and (B.8), we have1 ε (X i ∈ I o ˆ w i h ( R i ( δ β η ))( X i − ν ) ⊤ δ β ) ≤ λ max ( M ) k δ β η k + C ht k δ β η k − δ ⊤ β η X i ∈ I G ˆ w i ( µ − ν )( µ − ν ) ⊤ δ β η − δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η ≤ λ max ( M ) k δ β η k + C ht k δ β η k + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ ⊤ β η X i ∈ I G ˆ w i ( µ − ν )( µ − ν ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ (cid:0) C CDG + C ht + C CDG + 2 C ht C CDG (cid:1) k δ β η k , asai and Fujisawa/Weighted Huber regression where (a) follows from Propositions 4.7 and B.8 and Lemma B.2.For the second term of the R.H.S of (B.6), we have (X i ∈ I o ˆ w i h ( R i ( β η ))( ν − µ ) ⊤ δ β η ) ≤ X i ∈ I o ˆ w i X i ∈ I o ˆ w i | ( ν − µ ) ⊤ δ β η | a ) ≤ ε X i ∈ I o ˆ w i | ( ν − µ ) ⊤ δ β η | = 2 δ ⊤ β η ε X i ∈ I o ˆ w i ( ν − µ )( ν − µ ) ⊤ δ β η ( b ) ≤ C CDG ε k δ β η k , where (a) follows from 0 ≤ ˆ w i ≤ − ε ) n ≤ n and P i ∈ I o ˆ w i ≤ ε and (b) followsfrom Proposition B.8 and P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1. For the last term of theR.H.S of (B.6), we have (X i ∈ I o ˆ w i h ( R i ( β η ))( µ − µ ˆ w ) ⊤ δ β η ) ≤ X i ∈ I o ˆ w i X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | a ) ≤ ε X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | = 2 δ ⊤ β η ε X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ( b ) ≤ C CDG ε √ ε k δ β η k ≤ C CDG ε k δ β η k , where (a) follows from 0 ≤ ˆ w i ≤ − ε ) n ≤ n and P i ∈ I o ˆ w i ≤ ε and (b) followsfrom Proposition B.6 and P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1.Consequently, we have X i ∈ I o ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η ≤ (cid:18)q C CDG + C ht + C CDG + 2 C ht C CDG + 2 √ C CDG (cid:19) √ ε k δ β η k ≤ (cid:18)q C CDG + C ht + C CDG + 2 C ht C CDG + 2 √ C CDG (cid:19) ( √ ε + r d ) k δ β η k . Lemma B.2 are used in the proof of Proposition B.9.

Lemma B.2.

We have δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η ≤ C ht C CDG k δ β η k . asai and Fujisawa/Weighted Huber regression Proof. δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12) ( µ − ν ) ⊤ δ β η (cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ I G ˆ w i ( x i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k µ − ν k k δ β k a ) ≤ C ht ( √ ε + r d ) C CDG k δ β η k b ) ≤ C ht C CDG k δ β η k , where (a) follows from (B.2) and (B.8) and (b) follows from √ ε ≤ r d ≤ Proposition B.10 (Conﬁrmation of (3.1)) . We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C ht δ √ ε k δ β η k , where C ht δ is some positive constant depending on δ .Proof. λ o √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ λ o √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + λ o √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( b ) ≤ λ o √ n (cid:0) C htδ r d k δ β η k + 2 C CDG ( √ ε + r d ) k δ β η k (cid:1) ≤ λ o √ n (cid:16) C htδ + √ C CDG (cid:17) ( √ ε + r d | δ β η k = λ o √ nC ht δ ( √ ε + r d ) k δ β η k , where (a) follows from triangular inequality and (b) follows from Lemmas B.3and B.4. Lemma B.3.

We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C htδ r d k δ β η k . Proof.

We have n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η ( a ) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k δ β η k b ) ≤ C htδ r dn k δ β η k ≤ C htδ r d k δ β η k asai and Fujisawa/Weighted Huber regression where (a) follows from H¨older’s inequality and (b) follows from PropositionB.2. Lemma B.4.

We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ ˆ w − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG ( √ ε + r d ) k δ β η k . Proof.

We have ( n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η ) ≤ n X i =1 ˆ w i n X i =1 ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | a ) ≤ n X i =1 ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | = 2 δ ⊤ β η n X i =1 ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ( b ) ≤ C CDG ( ε + r d ) k δ β η k , where (a) follows from 0 ≤ ˆ w i ≤ − ε ) n ≤ n and P ni =1 ˆ w i ≤ P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1. From triangular inequality,we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG ( √ ε + r d ) k δ β η k . Proposition B.11 (Conﬁrmation of (3.2)) . We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ˆ w ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C ht δ ( √ ε + r d ) k δ β η k , where C ht δ is some positive constant depending on δ .Proof. We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ˆ w ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( µ − µ ˆ w ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( b ) ≤ C HT δ )( √ ε + r d ) k δ β η k + √ C CDG ( √ ε + r d ) k δ β η k = (2(1 + C HT δ ) + √ C CDG )( √ ε + r d ) k δ β η k = C ht δ ( √ ε + r d ) k δ β η k , asai and Fujisawa/Weighted Huber regression where (a) follows from triangular inequality and (b) follows from Lemmas B.5and B.6. Lemma B.5.

We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C HT δ )( √ ε + r d ) k δ β η k . Proof.

We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ sX i ∈ I o ˆ w i h ( r i ( β η )) (cid:16) √ n + C HT δ p d log d (cid:17) k δ β η k b ) ≤ √ on (cid:16) √ n + C HT δ p d log d (cid:17) k δ β η k c ) ≤ (cid:0) √ ε + C HT δ r d (cid:1) k δ β η k ≤ C HT δ )( √ ε + r d ) k δ β η k , where (a) follows from H¨older’s inequality and Proposition B.5, (b) follow fromˆ w i ≤ n (1 − ε ) ≤ n and − ≤ h ( r i ( β η )) ≤ on ≤ Lemma B.6.

We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( µ ˆ w − µ ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG ( √ ε + r d ) k δ β η k . Proof.

We have (X i ∈ I o ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η ) ≤ X i ∈ I o ˆ w i X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | a ) ≤ ε X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | = 2 εδ ⊤ β η X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ≤ δ ⊤ β η X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ( b ) ≤ C CDG ( √ ε + r d ) k δ β η k , where (a) follows from 0 ≤ ˆ w i ≤ − ε ) n ≤ n and P i ∈ I o ˆ w i ≤ ε and (b) followsfrom Proposition B.6 and P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1. asai and Fujisawa/Weighted Huber regression Proposition B.12 (Conﬁrmation of (3.4)) . We have λ o n X i =1 (cid:18) − h (cid:18) ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ≥ k δ β η k − λ o √ nC htcond √ ε k δ β η k − /δ ) λ o , where C htcond = C htδ + p /δ ) + √ C CDG + 2 √ C HT δ ) .Proof. From Proposition A.5 λ o n X i =1 (cid:18) − h (cid:18) ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o (B.9) ≥ a n E n X i =1 (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) − a k δ β η k − a + B + C, (B.10)where a = 18 ,a = λ o √ n E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 α i ( x i − µ ) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + λ o q E [ k ( x i − µ ) k ] r δ + λ o √ n k µ − µ ˆ w k ,a = λ o δB = λ o √ n  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  (cid:18) − h (cid:18) ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β λ o (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i ( x i − µ ˆ w ) ⊤ δ β η C = λ o √ n  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  (cid:18) − h (cid:18) ξ i λ o √ n − ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ( x i − µ ˆ w ) ⊤ δ β η . From Proposition B.4, we have1 n E " n X i =1 (( x i − µ ) ⊤ δ β η ) ≥ k δ β η k . From Proposition B.3 and (B.6), we have a ≤ λ o √ n C htδ r dn + r δ r dn + C CDG √ ε ! ≤ λ o √ n ( C htδ + p /δ )) r dn + C CDG √ ε ! ≤ λ o √ n (cid:18) ( C htδ + p /δ )) r d + C CDG √ ε (cid:19) ≤ λ o √ n (cid:18) C htδ + p /δ ) + C CDG (cid:19) ( √ ε + r d ) asai and Fujisawa/Weighted Huber regression Combining the above inequalities and Lemmas B.7 and B.8, we have λ o √ n n X i =1 (cid:18) − h (cid:18) ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i ( x i − µ ˆ w ) ⊤ δ β η ≥ k δ β η k − λ o √ n (cid:18) C htδ + p /δ ) + C CDG (cid:19) ( √ ε + r d ) k δ β η k −

20 log(1 /δ ) λ o − λ o √ n (cid:16) √ C HT δ ) + √ C CDG (cid:17) ( √ ε + r d ) k δ β η k ≥ k δ β η k − λ o √ n C htδ + p /δ ) + 1 + 32 √ C CDG + 2 √ C HT δ ) ! ( √ ε + r d ) k δ β η k −

20 log(1 /δ ) λ o . The following two lemmas are used in the proof of Proposition B.12.

Lemma B.7.

We have  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  − h ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i ( x i − µ ˆ w ) ⊤ δ β η ≤ (cid:16) √ C HT δ ) + √ C CDG (cid:17) ( √ ε + r d ) k δ β η k . Proof.

Let h i = − h ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ! + h (cid:18) ξ i λ o √ n (cid:19) . From 0 ≤ ˆ w i ≤ n , − ≤ h i ≤ X i ∈ I G ∩ I ˆ w< ( h i ˆ w i ) ≤ on , X i ∈ I o ( h i ˆ w i ) ≤ on and  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  ( h i ˆ w i ) ≤ on . (B.11)In addition we note that X i ∈ I G ∩ I ˆ w< ˆ w i ≤ o (1 − ε ) n ≤ ε, X i ∈ I o ˆ w i ≤ o (1 − ε ) n ≤ ε and  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  ˆ w i ≤ ε. (B.12) asai and Fujisawa/Weighted Huber regression From x i − µ ˆ w = x i − µ + µ − µ ˆ w , we have  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  − h ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i ( x i − µ ˆ w ) ⊤ δ β η =  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  h i ˆ w i ( x i − µ ) ⊤ δ β η +  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  h i ˆ w i ( µ − µ ˆ w ) ⊤ δ β η (B.13)For the ﬁrst term of the R.H.S of (B.13), we have  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  h i ˆ w i ( x i − µ ) ⊤ δ β η ( a ) ≤ vuuut X i ∈ I G ∩ I ˆ w< + X i ∈ I o  h i ˆ w i (cid:16) √ n + C HT δ p d log d (cid:17) k δ β η k b ) ≤ √ √ on (cid:16) √ n + C HT δ p d log d (cid:17) k δ β η k c ) ≤ √ (cid:0) √ ε + C HT δ r d (cid:1) k δ β η k ≤ √ C HT δ ) ( √ ε + r d ) k δ β η k where (a) follows from H¨older’s inequality and Proposition B.5, (b) follows from(B.11) and (c) follows from on ≤  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  h i ˆ w i ( µ − µ ˆ w ) ⊤ δ β  ≤  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  ˆ w i  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | a ) ≤ ε  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | = 6 εδ ⊤ β η  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ≤ δ ⊤ β η  X i ∈ I G ∩ I ˆ w< + X i ∈ I o  ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ≤ δ ⊤ β η n X i =1 ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ( b ) ≤ C CDG ( √ ε + r d ) k δ β η k , where (a) follows from (B.12) and (b) follows from Proposition B.6 and P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1. asai and Fujisawa/Weighted Huber regression Consequently, we have  X G ∈ I G ∩ I ˆ w< + X i ∈ I o  h i ˆ w i ( x i − µ ˆ w ) ⊤ δ β η ≤ √ C HT δ ) ( √ ε + r d ) k δ β η k + √ C CDG ( √ ε + r d ) k δ β η k = (cid:16) √ C HT δ ) + √ C CDG (cid:17) ( √ ε + r d ) k δ β η k . Lemma B.8.

We have λ o √ n  X G ∈ I G ∩ I ˆ w< + X i ∈ I o  − h ξ i λ o √ n − ( x i − µ ˆ w ) ⊤ δ β η λ o √ n ! + h (cid:18) ξ i λ o √ n (cid:19)! ( x i − µ ˆ w ) ⊤ δ β η ≤ (cid:16) √ C HT δ ) + √ C CDG (cid:17) ( √ ε + r d ) k δ β η k . Because the proof of Lemma B.8 is almost the same of the proof of LemmaB.7, we omit the proof of Lemma B.8.

B.2.1. Proof of Theorem B.1

From Proposition B.9, B.10, B.11 and B.12, conditions (3.1), (3.2), (3.3) and(3.4) are satisﬁed by c = λ o √ n (cid:0) C ht δ + C ht δ + C htcond (cid:1) ( √ ε + r d ) ,c = 116 ,c = λ o √ nC htcond ( √ ε + r d ) ,c = 5 λ o log(1 /δ ) = 5 c n with probability at least (1 − δ ) .We note that, from Proposition B.1 and the deﬁnition of λ o , λ o √ n = c max ,which is a constant with C cov = c htcov and C cond and C cond are also constant.Consequently, we have3 c + c + √ c c c ≤ λ o √ n (cid:18) C ht δ + C ht δ + C htcond + C htcond (cid:19) ( √ ε + r d ) + c max r

516 1 √ n ! ≤ λ o √ n (cid:18) C ht δ + C ht δ + C htcond + C htcond (cid:19) ( √ ε + r d ) + c max r r d ! ≤ λ o √ n (cid:18) C ht δ + C ht δ + C htcond + C htcond (cid:19) + c max r ! ( √ ε + r d ) ..

Related Researches

Berry-Esseen bounds of second moment estimators for Gaussian processes observed at high frequency

by Soukaina Douissi

Prepivoted permutation tests

by Colin B. Fogarty

Sharp Sensitivity Analysis for Inverse Propensity Weighting via Quantile Balancing

by Jacob Dorn

Discrepancy Bounds for a Class of Negatively Dependent Random Points Including Latin Hypercube Samples

by Michael Gnewuch

Online nonparametric regression with Sobolev kernels

by Oleksandr Zadorozhnyi

Edgeworth approximations for distributions of symmetric statistics

by Friedrich Götze

Online Statistical Inference for Gradient-free Stochastic Optimization

by Xi Chen

Discrete Max-Linear Bayesian Networks

by Benjamin Hollering

A new robust approach for multinomial logistic regression with complex design model

by Elena Castilla

On the estimating equations and objective functions for parameters of exponential power distribution: Application for disorder

by Mehmet Niyazi ?ankaya

Inference and model selection in general causal time series with exogenous covariates

by Mamadou Lamine Diop

The complex behaviour of Galton rank order statistic

by E. del Barrio

Sharper Sub-Weibull Concentrations: Non-asymptotic Bai-Yin Theorem

by Huiming Zhang

Instance-Dependent Bounds for Zeroth-order Lipschitz Optimization with Error Certificates

by François Bachoc

Nonparametric calibration for stochastic reaction-diffusion equations based on discrete observations

by Florian Hildebrandt

On shrinkage estimation of a spherically symmetric distribution for balanced loss functions

by Lahoucine Hobbad

Graph Community Detection from Coarse Measurements: Recovery Conditions for the Coarsened Weighted Stochastic Block Model

by Nafiseh Ghoroghchian

Semiparametric empirical likelihood inference with estimating equations under density ratio models

by Meng Yuan

On the consistency of the Kozachenko-Leonenko entropy estimate

by Luc Devroye

Distribution-Free Robust Linear Regression

by Jaouad Mourtada

Efficient computational algorithms for approximate optimal designs

by Jiangtao Duan

Adaptive Robust Large Volatility Matrix Estimation Based on High-Frequency Financial Data

by Minseok Shin

Estimation and testing on independent not identically distributed observations based on Rényi's pseudodistances

by Elena Castilla

It was "all" for "nothing": sharp phase transitions for noiseless discrete channels

by Jonathan Niles-Weed

On the Minimal Error of Empirical Risk Minimization

by Gil Kur

«

1

2

3

4

»

Submitted on 22 Feb 2021 (v1), last revised 3 Jun 2021 (this version, v2) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar