Adversarial robust weighted Huber regression
aa r X i v : . [ m a t h . S T ] F e b Adversarial robust weighted Huberregression
Takeyuki Sasai
Department of Statistical Science, The Graduate University for Advanced Studies,SOKENDAI, Tokyo, Japan. e-mail: [email protected]
Hironori Fujisawa
The Institute of Statistical Mathematics, Tokyo, Japan.Department of Statistical Science, The Graduate University for Advanced Studies,SOKENDAI, Tokyo, Japan.Center for Advanced Integrated Intelligence Research, RIKEN, Tokyo, Japan.e-mail: [email protected]
Abstract:
We propose a novel method to estimate the coefficients of linearregression when outputs and inputs are contaminated by malicious outliers.Our method consists of two-step: (i) Make appropriate weights { ˆ w i } ni =1 such that the weighted sample mean of regression covariates robustly esti-mates the population mean of the regression covariate, (ii) Process Huberregression using { ˆ w i } ni =1 . When (a) the regression covariate is a sequencewith i.i.d. random vectors drawn from sub-Gaussian distribution with un-known mean and known identity covariance and (b) the absolute momentof the random noise is finite, our method attains a faster convergence ratethan Diakonikolas, Kong and Stewart (2019) and Cherapanamjeri et al.(2020). Furthermore, our result is minimax optimal up to constant factor.When (a) the regression covariate is a sequence with i.i.d. random vectorsdrawn from heavy tailed distribution with unknown mean and boundedkurtosis and (b) the absolute moment of the random noise is finite, ourmethod attains a convergence rate, which is minimax optimal up to con-stant factor. MSC 2010 subject classifications:
Keywords and phrases:
Linear regression, Robustness, Convergence rate,Huber loss.
Contents ∗ This work was supported in part by JSPS KAKENHI Grant Number 17K00065.1 asai and Fujisawa/Weighted Huber regression B Stochastic argument in case of heavy-tailed design . . . . . . . . . . . 45B.1 ROBUST-WEIGHT for bounded covariance distributions . . . . 46B.2 Confirmation of the conditions . . . . . . . . . . . . . . . . . . . 49B.2.1 Proof of Theorem B.1 . . . . . . . . . . . . . . . . . . . . 58
1. Introduction
Learning from data contaminated by malicious outliers has been an impor-tant topic in robust statistics. Classically, Huber (1981) presented importantnotions such as Huber’s contamination model. Recently, Chen, Gao and Ren(2018), Gao (2020) derived minimax lower bounds of parameter estimations inHuber’s contamination model. They also proposed parameter estimation meth-ods using Tukey’s depth (Tukey 1975). The estimation errors match the lowerbounds up to constant factor using Tukey’s depth (Mizera et al. 2002). Elsenerand van de Geer (2018) and Loh (2017) studied non-convex M-estimators. Lu-gosi and Mendelson (2019) and Lugosi and Mendelson (2020) considered themedian-of-mean tournament (Jerrum, Valiant and Vazirani 1986, Nemirovskyand Yudin 1983). However, the methods proposed by Chen, Gao and Ren (2018),Gao (2020), Elsener and van de Geer (2018), Loh (2017), Lugosi and Mendel-son (2019, 2020) possibly requires exponential time computational complexityto archive fast convergence rates. On the other hand, Lai, Rao and Vempala(2016) and Diakonikolas et al. (2018) proposed methods which robustly estimateparameters with polynomial time complexity and since then, many computa-tionally efficient robust estimators have been proposed by Diakonikolas et al.(2017), Karmalkar and Price (2018), Kothari, Steinhardt and Steurer (2018),Diakonikolas et al. (2019), Cheng, Diakonikolas and Ge (2019), Dong, Hop-kins and Li (2019), Cheng et al. (2019), Diakonikolas, Kane and Pensia (2020),Prasad et al. (2020), Hopkins, Li and Zhang (2020), Cheng et al. (2020).In the present paper, we study the linear regression: y i = x ⊤ i β ∗ + ξ i , i = 1 , · · · , n, (1.1)where x i ∈ R d and { ξ i } ni =1 is a sequence of random noise which is independentof { x i } ni =1 . We admit o adversary samples arbitrarily from ( y i , x i ) ni =1 , which cantake arbitrary values. Let I o be the set of the index of the replaced value and I G be { , . . . , n } \ I o . Then, the model contaminated by adversarial samples is y i = ( x i + ̺ i ) ⊤ β ∗ + ξ i + √ nθ i , i = 1 , · · · , n, (1.2)where ̺ i = (0 , · · · , ⊤ and θ i = 0 for i ∈ I G . We note that because we admit theadversary arbitrarily to take o samples from ( y i , x i ) ni =1 , ( y i + √ nθ i , x i + ̺ i ) ni =1 is a strong contamination model (Diakonikolas and Kane 2019), not Huber’scontamination. Furthermore, (1.2) can be expressed as y i = X ⊤ i β ∗ + ξ i + √ nθ i , i = 1 , · · · , n, (1.3) asai and Fujisawa/Weighted Huber regression where X i = x i + ̺ i . Some works (Diakonikolas, Kong and Stewart 2019, Cher-apanamjeri et al. 2020, Pensia, Jog and Loh 2020, Bakshi and Prasad 2020)consider the model (1.3) and proposed methods to estimate β ∗ with polyno-mial time computational complexity. Diakonikolas, Kong and Stewart (2019)and Cherapanamjeri et al. (2020) dealt with the case when { x i } ni =1 and { ξ i } ni =1 are Gaussian with possibly non-identity covariance and sub-Gaussian with iden-tity covariance, respectively. Cherapanamjeri et al. (2020), Pensia, Jog and Loh(2020), Bakshi and Prasad (2020) considered the case when { x i } ni =1 and { ξ i } ni =1 obeys heavy tailed distributions.Our first result is given in the following theorem. For a precise statement, seeTheorem 4.1.Let ˆ β be the minimizer of the Huber loss function (2.2) with some weightsrelated to outliers. The estimator ˆ β in requires polynomial time computationalcomplexity because of the convexity. Theorem 1.1.
Suppose that { x i } ni =1 is a sequence with i.i.d. random vectorsdrawn from sub-Gaussian distribution with unknown mean µ and identity co-variance and { ξ i } ni =1 is a sequence with i.i.d. random variables drawn from adistribution whose absolute moment is bounded by σ . In addition, we assumesome conditions. Then, with high probability, we have k ˆ β − β ∗ k = O ε r log 1 ε + r d log dn ! . (1.4) When ε is a constant fraction of n and n is sufficiently large, we have k ˆ β − β ∗ k = O ε r log 1 ε ! . (1.5)Our convergence rate is faster than the ones of Diakonikolas, Kong and Stew-art (2019) and Cherapanamjeri et al. (2020) because their convergence rates are O (cid:0) ε log ε (cid:1) . Furthermore, according to Theorem D.3 of Cherapanamjeri et al.(2020), our result is minimax optimal up to constant factor.Our second result is given in the following theorem. For a precise statement,see Theorem B.1. Theorem 1.2.
Suppose that { x i } ni =1 is a sequence with i.i.d. random vectorsdrawn from distribution with unknown mean µ , and bounded known covarianceand bounded kurtosis. Suppose { ξ i } ni =1 is a sequence with i.i.d. random variablesdrawn from a distribution whose absolute moment is bounded by σ . In addition,we assume some conditions. Then, with high probability, we have k ˆ β − β ∗ k = O √ ε + r d log dn ! . (1.6) When ε is a constant fraction of n and n is sufficiently large, we have k ˆ β − β ∗ k = O (cid:0) √ ε (cid:1) . (1.7) asai and Fujisawa/Weighted Huber regression Our convergence rate is the same of Cherapanamjeri et al. (2020), Pensia, Jogand Loh (2020), Bakshi and Prasad (2020) up to constant factor. Furthermore,according to Theorem D.4 of Cherapanamjeri et al. (2020), our result is minimaxoptimal up to constant factor.In Section 2, we provide our estimation method. In Section 3, we state ourmain Theorem (Theorem 3.1). In Section 4, we confirm the conditions in The-orem 3.1 are satisfied with high probability under the assumptions in Theorem4.1. In the Appendix, we prove the inequalities used in Section 4 and we stateTheorem 1.2 precisely with the proof of Theorem 1.2.
2. Our method
To estimate β ∗ in (1.3), we propose the following algorithm (Algorithm 1). Algorithm 1
TWO STEP WEIGHTED HUBER REGRESSION
Input: { y i , X i } ni =1 , ε and the tuning parameter λ o Output: ˆ β { ˆ w i } ni =1 ← ROBUST-WEIGHT( { y i , X i } ni =1 , ε )2: ˆ β ← WEIGHTED-HUBER-REGRESSION( { y i , X i , ˆ w i } ni =1 , ε, λ o ) We require ROBUST-WEIGHT to compute the weight vector ˆ w = ( ˆ w , · · · , ˆ w n )satisfying (3.1)-(3.5). For example, we can use Algorithm 1 of Cheng, Diakoniko-las and Ge (2019) as ROBUST-WEIGHT. The algorithm was proposed to ro-bustly estimate the mean µ = E [ x i ] from { X i } ni =1 .WEIGHTED-HUBER-REGRESSION is a parameter estimation algorithmusing the Huber loss with { n ˆ w i X i } ni =1 instead of { X i } ni =1 . In WEIGHTED-HUBER-REGRESSION, we consider the following optimization problem:(ˆ θ, ˆ β ) = argmin θ ∈ R n , β ∈ R d n X i =1 n (cid:13)(cid:13) y i − n ˆ w i ( X i − µ ˆ w ) ⊤ β − √ nθ (cid:13)(cid:13) + λ o k θ k , (2.1)where µ ˆ w = P ni =1 ˆ w i X i . From She and Owen (2011), after optimizing (2.1)about θ , we haveˆ β = argmin β ∈ R d n X i =1 λ o H (cid:18) y i − n ˆ w i ( X i − µ ˆ w ) ⊤ βλ o √ n (cid:19) , (2.2)where H ( t ) is the Huber loss function H ( t ) = ( | t | − / | t | > t / | t | ≤ . asai and Fujisawa/Weighted Huber regression
3. Deterministic argument
Let h ( t ) = ddt H ( t ) = ( t | t | ≤ t ) | t | > r i ( v ) = y i − n ˆ w i ( x i − µ ˆ w ) ⊤ vλ o √ n , R i ( v ) = y i − n ˆ w i ( X i − µ ˆ w ) ⊤ vλ o √ n for v ∈ R d . We state the main theorem in the deterministic form. Theorem 3.1.
Consider the optimization problem (2.2) . Let δ β η = β η − β ∗ = η ( ˆ β − β ∗ ) for some η ∈ [0 , . Suppose that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 λ o √ n ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c k δ β η k , (3.1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o λ o √ n ˆ w i h ( r i ( β η ))( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c k δ β η k , (3.2) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o λ o √ n ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c k δ β η k , (3.3) n X i =1 λ o √ n ˆ w i ( − h ( r i ( β η )) + h ( r i ( β ∗ ))) ( x i − µ ˆ w ) ⊤ δ β η ≥ c k δ β η k − c k δ β η k − c , (3.4) where c , c , c and c are some positive numbers and suppose c + c + √ c c c < r (3.5) for some positive number r . Then, we have k β ∗ − ˆ β k ≤ r . Proof of Theorem 3.1.
In Section 4, the conditions (3.1)-(3.5) are shown to besatisfied with high probability under the conditions in our main theorem (The-orem 4.1). Consequently, our main theorem is proved.For any fixed r >
0, we define B := { β : k β ∗ − β k ≤ r } . We prove ˆ β ∈ B by assuming ˆ β / ∈ B and deriving a contradiction. For ˆ β / ∈ B , wecan find some η ∈ [0 ,
1] such that k δ β η k = r . (3.6) asai and Fujisawa/Weighted Huber regression Let Q ′ ( η ) = λ o √ n ˆ w i n X i =1 ( − h ( r i ( β η )) + h ( r i ( β ∗ )))( X i − µ ˆ w ) ⊤ δ ˆ β , where δ ˆ β = ˆ β − β ∗ . From the proof of Lemma F.2. of Fan et al. (2018), we have ηQ ′ ( η ) ≤ ηQ ′ (1) and this means η n X i =1 λ o √ n ˆ w i ( − h ( R i ( β η )) + h ( R i ( β ∗ ))) ( X i − µ ˆ w ) ⊤ δ ˆ β ≤ η n X i =1 λ o √ n ˆ w i (cid:16) − h ( R i ( ˆ β )) + h ( R i ( β ∗ )) (cid:17) ( X i − µ ˆ w ) ⊤ δ ˆ β (3.7)and we have η n X i =1 λ o √ n ˆ w i ( − h ( R i ( β η )) + h ( R i ( β ∗ ))) ( X i − µ ˆ w ) ⊤ δ ˆ β ( a ) = η n X i =1 λ o √ n ˆ w i h ( R i ( β ∗ ))( X i − µ ˆ w ) ⊤ δ ˆ β ( b ) = n X i =1 λ o √ n ˆ w i h ( R i ( β ∗ ))( X i − µ ˆ w ) ⊤ δ β η , (3.8)where (a) follows from the fact that ˆ β is a optimal solution of (2.2) and (b)follows from the definition of δ β η .From (3.7) and (3.8), we have n X i =1 λ o √ n ˆ w i ( − h ( R i ( β η )) + h ( R i ( β ∗ ))) ( X i − µ ˆ w ) ⊤ δ β η ≤ n X i =1 λ o √ n ˆ w i h ( R i ( β ∗ ))( X i − µ ˆ w ) ⊤ δ β η . (3.9) asai and Fujisawa/Weighted Huber regression The left-hand side of (3.9) can be decomposed as n X i =1 λ o √ n ˆ w i ( − h ( R i ( β η )) + h ( R i ( β ∗ ))) ( X i − µ ˆ w ) ⊤ δ β η = X i ∈ I o λ o √ n ˆ w i ( − h ( R i ( β η )) + h ( r i ( β ∗ ))) ( X i − µ ˆ w ) ⊤ δ β η + X i ∈ I G λ o √ n ˆ w i ( − h ( R i ( β η )) + h ( R i ( β ∗ ))) ( X i − µ ˆ w ) ⊤ δ β η = X i ∈ I o λ o √ n ˆ w i ( − h ( R i ( β η )) + h ( r i ( β ∗ ))) ( X i − µ ˆ w ) ⊤ δ β η + X i ∈ I G λ o √ n ˆ w i ( − h ( r i ( β η )) + h ( r i ( β ∗ ))) ( x i − µ ˆ w ) ⊤ δ β η = n X i =1 λ o √ n ˆ w i ( − h ( r i ( β η )) + h ( r i ( β ∗ ))) ( x i − µ ˆ w ) ⊤ δ β η + X i ∈ I o λ o √ n ˆ w i ( − h ( R i ( β η )) + h ( r i ( β ∗ ))) ( X i − µ ˆ w ) ⊤ δ β η − X i ∈ I o λ o √ n ˆ w i ( − h ( r i ( β η )) + h ( r i ( β ∗ ))) ( x i − µ ˆ w ) ⊤ δ β η . (3.10)The right-hand side of (3.9) can be decomposed as n X i =1 λ o √ n ˆ w i h ( R i ( β ∗ ))( X i − µ ˆ w ) ⊤ δ β η = X i ∈ I o λ o √ n ˆ w i h ( R i ( β ∗ ))( X i − µ ˆ w ) ⊤ δ β η + X i ∈ I G λ o √ n ˆ w i h ( r i ( β ∗ ))( x i − µ ˆ w ) ⊤ δ β η = X i ∈ I o λ o √ n ˆ w i h ( R i ( β ∗ ))( X i − µ ˆ w ) ⊤ δ β η + X i ∈ I G λ o √ n ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η = X i ∈ I o λ o √ n ˆ w i h ( R i ( β ∗ ))( X i − µ ˆ w ) ⊤ δ β η − X i ∈ I o λ o √ n ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η + n X i =1 λ o √ n ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η . (3.11) asai and Fujisawa/Weighted Huber regression From (3.9) - (3.11), we have n X i =1 λ o √ n ˆ w i ( − h ( r i ( β η )) + h ( r i ( β ∗ ))) ( x i − µ ˆ w ) ⊤ δ β η ≤ n X i =1 λ o √ n ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η − X i ∈ I o λ o √ n ˆ w i h ( r i ( β η ))( x i − µ ˆ w ) ⊤ δ β η + X i ∈ I o λ o √ n ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 λ o √ n ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o λ o √ n ˆ w i h ( r i ( β η ))( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o λ o √ n ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (3.12)We evaluate each term of (3.12). From (3.4), the left hand side of (3.12) isevaluated as n X i =1 λ o √ n ˆ w i ( − h ( r i ( β η )) + h ( r i ( β ∗ ))) ( x i − µ ˆ w ) ⊤ δ β η ≥ c k δ β η k − c k δ β η k − c From (3.1) - (3.3), the right hand side of (3.12) is evaluated as (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 λ o √ n ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o λ o √ n ˆ w i h ( r i ( β η ))( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o λ o √ n ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c k δ β η k . Consequently, we have c k δ β η k − c k δ β η k − c ≤ c k δ β η k and k δ β k ≤ c + 3 c + p ( c + 3 c ) + 4 c c c ≤ c + 3 c + √ c c c ≤ r . This contradicts k δ β η k = r . Consequently, we have ˆ β ∈ B and k δ β k ≤k ˆ β − β ∗ k ≤ r .
4. Stochastic argument in case of Gaussian design
In this section, we state our main theorem in stochastic form assuming ran-domness in { x i } ni =1 and { ξ i } ni =1 . In the following subsection, we confirm the asai and Fujisawa/Weighted Huber regression conditions in Theorem 3.1 are satisfied under the conditions in Theorem 4.1.Let r o = ε r log 1 ε , r d = r d log dn , r ′ d = r dn . Theorem 4.1.
Consider the optimization problem (2.2) . Suppose that { x i } ni =1 is a sequence with i.i.d. random vectors drawn from sub-Gaussian distributionwith unknown mean µ and identity covariance and { ξ i } ni =1 is a sequence withi.i.d. random variables drawn from a distribution whose absolute moment isbounded by σ . Suppose that n is sufficiently large so that n = ( d log d ) holdand suppose that ε is sufficiently small. Suppose δ is sufficiently small so that δ ∈ (0 , / .Let λ o √ n = c max with c max = max (cid:16) m σ , p C cov m , C CDG (cid:17) where C CDG is defined in Algorithm 2 and m is some positive constant such that E [( x i − µ ) ⊤ v ) ] ≤ m (cid:0) E [(( x i − µ ) ⊤ v ) ] (cid:1) for any v ∈ R d and C cov is somepositive constant such that E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C cov . Let ˆ β is a optimal solution of (2.2) . Then, with probability at least (1 − δ ) ,we have k ˆ β − β ∗ k = O ( r o + r d ) . In the remaining part of this section, we assume that { x i } ni =1 is a sequencewith i.i.d. random vectors drawn from sub-Gaussian distribution with unknownmean µ and identity covariance and { ξ i } ni =1 is a sequence with i.i.d. randomvariables drawn from a distribution whose absolute moment is bounded by σ . For sub-Gaussian design, we use Algorithm 1 of Cheng, Diakonikolas and Ge(2019) as ROBUST-WEIGHT. The weights { ˆ w i } ni =1 can be computed by thealgorithm from { X i } ni =1 and ε with polynomial computational complexity. Thealgorithm showed that µ ˆ w − µ is close to 0 in the l norm with high probability.We briefly introduce Algorithm 1 of Cheng, Diakonikolas and Ge (2019). Let∆ n,ε = ( w ∈ R n : X i w i = 1 and 0 ≤ w i ≤ − ε ) n for all i ) . (4.1)First, we state the primal-dual SDP used in Algorithm 1 of Cheng, Diakonikolasand Ge (2019). The primal SDP has the following form:minimize λ max n X i =1 w i ( X i − ν )( X i − ν ) ⊤ ! (4.2)subject to w ∈ ∆ N,ε , asai and Fujisawa/Weighted Huber regression where λ max ( M ) is the maximum eigenvalue of M and ν ∈ R d is a fixed vector.The dual SDP is the following form:maximize the average of the smallest (1 − ε ) fraction of (cid:8) ( X i − ν ) ⊤ M ( X i − ν ) (cid:9) ni =1 (4.3)subject to M (cid:23) , tr( M ) ≤ . Using the primal-dual SDP above, Algorithm 1 of Cheng, Diakonikolas andGe (2019) estimates µ robustly by the following algorithm: Algorithm 2
Robust Mean Estimation for Known Covariance sub-Gaussian
Require: { X i } ni =1 ∈ R d with 0 < ε < / Ensure: ˆ w ∈ R n such that, with probability at least (1 − δ ) , k µ ˆ w − µ k ≤ C CDG ( r o + r ′ d ),where C CDG is some positive constant depending on δ when Lemma 4.1 holds.Let ν ∈ R d be the coordinate-wise median of { X i } ni =1 For i = 1 to O (log d )Use Proposition 4.1 of Cheng, Diakonikolas and Ge (2019) to compute either (i) A good solution w ∈ R n for the primal SDP (4.2) with parameters ν and 2 ε or (ii) A good solution M ∈ R d × d for the dual SDP (4.3) with parameters ν and ε if the objective value (4.2) is at most 1 + C CDG ( ε log(1 /ε ) + r ′ d ) (Lemma 4.1) return the weight vector ˆ w = ( ˆ w , · · · , ˆ w n ) ⊤ = w else Move ν closer to µ using the top eigenvector of M . First, we state some properties of sub-Gaussian random vector.
Lemma 4.1 (Adopted from Lemmas 4.3 and 4.4 of Diakonikolas et al. (2018)) . Suppose n is sufficiently large so that n = O ( d ) and ε is sufficiently small. Then,for any w ∈ ∆ N,ε , we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ I G w i ( x i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C g ( r o + r ′ d ) , (4.4) λ max X i ∈ I G w i ( x i − µ )( x i − µ ) ⊤ − I ! ≤ C g ( ε log(1 /ε ) + r ′ d ) , (4.5) where C g is some positive constant depending on δ , with probability at least − δ . Proposition 4.1 (Adopted from Theorem 4 of Koltchinskii and Lounici (2017)) . Assume that (cid:8) x i ∈ R d (cid:9) ni =1 is a sequence with i.i.d. random matrices drawn fromsub-Gaussian distribution with mean µ and assume d/n ≤ . Then, we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ c gcov , (4.6) where c gcov is some positive constant. Let I | m | be a set of index such that the number of the elements of I | m | is m . asai and Fujisawa/Weighted Huber regression Corollary 4.1 (Corollary of Proposition A.3) . Suppose that < ε < / and δ ∈ (0 , / holds. For any vector u = ( u , · · · , u n ) ∈ R n , suppose that any set I | m | such that ≤ m ≤ o , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i ∈ I | m | u i ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C τ s X i ∈ I | m | u i k δ β η k (cid:18) (1 + p log(1 /δ )) + p d log d + r m log nm (cid:19) , (4.7) where C τ is some positive constant, with probability at least − δ . Proposition 4.2 (Bernstein inequality for Huber loss) . Suppose that n is suf-ficiently large so that n = O ( d log d ) . We have n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) h (cid:18) ξ i λ o √ n (cid:19)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ C Bδ r log dn , (4.8) where C Bδ is some positive constant depending on δ , with probability at least − δ . Let { α i } ni =1 be a series of Rademacher random variables. Proposition 4.3.
Suppose that n is sufficiently large so that n = O ( d log d ) .We have n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) α i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C Rδ r ′ d (4.9) with probability at least − δ . Proposition 4.4 (Adopted from exercise 4.6.3 of Vershynin (2018)) . Supposethat n is sufficiently large so that n = O ( d ) . For any vector v ∈ S d − , we have n E " n X i =1 (( x i − µ ) ⊤ v ) ≥ . (4.10) Proposition 4.5.
Let R ( r ) = (cid:8) β ∈ R d | k δ β k = k β − β ∗ k = r (cid:9) ( r ≤ and assume β η ∈ R ( r ) . Assume that (cid:8) x i ∈ R d (cid:9) ni =1 is a sequence with i.i.d.random matrices satisfying E [(( x i − µ ) ⊤ v ) ] ≤ m (cid:0) E [(( x i − µ ) ⊤ v ) ] (cid:1) for any v ∈ S d − , E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ C cov , k µ − µ ˆ w k ≤ C µ , asai and Fujisawa/Weighted Huber regression where m , C cov and C µ are some positive constants. Assume that { ξ i } ni =1 is a se-quence with i.i.d. random variables drawn from a distribution whose absolute mo-ment is bounded by σ . Suppose λ o √ n ≥ c max = (cid:16) σ m , p C cov m , C µ (cid:17) .Then, with probability at least − δ , we have λ o n X i =1 − h ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β η λ o √ n ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ≥ a n n X i =1 E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) − a k δ β η k − a + B + C, where a = 18 ,a = λ o √ n E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 α i ( x i − µ ) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + λ o q E [ k ( x i − µ ) k ] p /δ ) + λ o √ n k µ − µ ˆ w k ,a = λ o /δ ) ,B = X i ∈ I G ∩ I ˆ w< + X i ∈ I o (cid:18) − h (cid:18) ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β λ o (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ,C = X i ∈ I G ∩ I ˆ w< + X i ∈ I o (cid:18) − h (cid:18) ξ i λ o √ n − ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ( x i − µ ˆ w ) ⊤ δ β λ o √ n . Next, we state some properties obtained by Algorithm 2. We note that theproof of Lemmas 3.1 - 3.3, Proposition 4.1 and Theorem 1.2 of Cheng, Di-akonikolas and Ge (2019) is valid even if we replace the order of δ , δ and β inSection 3 of Cheng, Diakonikolas and Ge (2019) as δ = ε p log(1 /ε ) + r ′ d , δ = ε log(1 /ε ) + r ′ d , β = p ε log(1 /ε ) + r ′ d and consequently, we have Propositions 4.6, 4.7 and 4.8. Proposition 4.6 (Lemma 3.2 of Cheng, Diakonikolas and Ge (2019)) . Supposethat < ε < / , (4.4) and (4.5) hold. We have k µ ˆ w − µ k ≤ C CDG ( r o + r ′ d ) . (4.11) when Algorithm 2 succeeds . Proposition 4.7 (From termination condition of Algorithm 2) . Suppose that < ε < / , (4.4) and (4.5) hold. We have λ max n X i =1 ˆ w i ( X i − ν )( X i − ν ) ⊤ ! ≤ C CDG ( ε log(1 /ε ) + r ′ d ) (4.12) for some vector ν ∈ R d in (4.3) when Algorithm 2 succeeds. asai and Fujisawa/Weighted Huber regression About ν in Proposition 4.7, we have the following Lemma Proposition 4.8 (From termination condition of Algorithm 2 and Lemma 3.1of Cheng, Diakonikolas and Ge (2019)) . Suppose that < ε < / , (4.4) and (4.5) hold. We have k ν − µ k ≤ C CDG ( p ε log(1 /ε ) + r ′ d ) (4.13) for ν in Proposition 4.7 when Algorithm 2 succeeds . Let I w < and I w ≥ be the sets of the index such that w i < n and w i ≥ n ,respectively. Lemma 4.2.
For w i ∈ ∆ n,ε , we have | I w < | ≤ o .Proof. We assume | I w < | > o , and then we derive a contradiction. From theconstraint about w i , we have 0 ≤ w i ≤ − ε ) n and n X i =1 w i = X i ∈ I w< w i + X i ∈ I w ≥ w i ≤ | I < | × n + ( n − | I < | ) × − ε ) n = 2 o × n + ( | I < | − o ) × n + ( n − o ) × − ε ) n + (2 o − | I < | ) × − ε ) n = 2 o × n + ( n − o ) × − ε ) n + ( | I < | − o ) × (cid:18) n − − ε ) n (cid:19) < o × n + ( n − o ) × − ε ) n = 1 − ε ≤ . This is a contradiction to the constraint about { w i } ni =1 , P ni =1 w i = 1. In this section, we confirm (3.1) - (3.4) under the conditions in Theorem 4.1.We note that when (4.4) and (4.5) hold, Algorithm 2 succeeds (Algorithm 2computes { ˆ w i } ni =1 such that (4.11) is satisfied) with probability at least (1 − δ ) from the proof of Theorem 1.2 in Cheng, Diakonikolas and Ge (2019). In lightof this fact, Lemmas 4.1 and 4.2, Corollary 4.1, Propositions 4.1 - 4.8 hold withprobability at least (1 − δ ) under the conditions in Theorem 4.1. We also notethat from Proposition 4.6 and assumptions in Theorem 4.1, we have r o , r d ≤ C µ = 2 C CDG , where C µ is defined in Proposition A.5. Fromthe definition of λ o , conditions assumed in Theorem 4.1 and Proposition 4.1, wesee that Proposition 4.5 holds.In Section 4.2, to omit to state “with probability at least”, we assume thatinequalities in Lemmas 4.1 and 4.2, Corollary 4.1, Propositions 4.1 - 4.8 holdand Algorithm 2 succeeds. asai and Fujisawa/Weighted Huber regression Proposition 4.9 (Confirmation of (3.3)) . We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C cond ( r o + r d ) b k δ β η k , where C cond = (cid:16) p C g + 2 p C CDG + √ C CDG + 2 p C CDG C g + 2 √ C CDG (cid:17) . Proof.
From X i − µ ˆ w = ( X i − ν ) + ( ν − µ ) + ( µ − µ ˆ w ), we have X i ∈ I o ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η = X i ∈ I o ˆ w i h ( R i ( β η ))( X i − ν ) ⊤ δ β η + X i ∈ I o ˆ w i h ( R i ( β η ))( ν − µ ) ⊤ δ β η + X i ∈ I o ˆ w i h ( R i ( β η ))( µ − µ ˆ w ) ⊤ δ β η . (4.14)Let M i = ( X i − ν )( X i − ν ) ⊤ . For the first term of the R.H.S of (4.14), wehave (X i ∈ I o ˆ w i h ( R i ( β η ))( X i − ν ) ⊤ δ β η ) a ) ≤ X i ∈ I o ˆ w i X i ∈ I o ˆ w i | ( X i − ν ) ⊤ δ β η | b ) ≤ ε X i ∈ I o ˆ w i | ( X i − ν ) ⊤ δ β η | = 2 εδ ⊤ β η X i ∈ I o ˆ w i M i δ β η = 2 ε δ ⊤ β η n X i =1 ˆ w i M i δ β η − δ ⊤ β η X i ∈ I G ˆ w i M i δ β η ! ≤ ελ max ( M ) k δ β η k − εδ ⊤ β η X i ∈ I G ˆ w i M i δ β η ! (4.15)where (a) follows from ˆ w i = √ ˆ w i √ ˆ w i and 0 ≤ h ( R i ( β η )) ≤
1, (b) follows fromthe constraint of ˆ w i and P i ∈ I o ˆ w i ≤ o (1 − ε ) n ≤ ε . For the last term of R.H.S of asai and Fujisawa/Weighted Huber regression (4.15), we have δ ⊤ β η X i ∈ I G ˆ w i M i δ β η = δ ⊤ β η X i ∈ I G ˆ w i ( x i − ν )( x i − ν ) ⊤ δ β η = δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ − ν + µ )( x i − µ − ν + µ ) ⊤ δ β η = δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( x i − µ ) ⊤ δ β η + δ ⊤ β η X i ∈ I G ˆ w i ( µ − ν )( µ − ν ) ⊤ δ β η + 2 δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η ( a ) ≥ (1 − C g ( ε log(1 /ε ) + r ′ d )) k δ β η k + δ ⊤ β η X i ∈ I G ˆ w i ( µ − ν )( µ − ν ) ⊤ δ β η + 2 δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η , (4.16)where (a) follows from (4.5). From (4.15) and (4.16), Proposition 4.7, Lemmas ?? and 4.3, we have12 ε (X i ∈ I o ˆ w i h ( R i ( δ β η ))( X i − ν ) ⊤ δ β η ) ≤ λ max ( M ) k δ β η k − k δ β η k + C g ( ε log(1 /ε ) + r ′ d ) k δ β η k − δ ⊤ β η X i ∈ I G ˆ w i ( µ − ν )( µ − ν ) ⊤ δ β η − δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η ≤ λ max ( M − I ) k δ β η k + C g ( ε log(1 /ε ) + r ′ d ) k δ β η k + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ ⊤ β η X i ∈ I G ˆ w i ( µ − ν )( µ − ν ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C CDG ( ε log(1 /ε ) + r ′ d ) k δ β η k + C g ( ε log(1 /ε ) + r ′ d ) k δ β η k + C CDG ( p ε log(1 /ε ) + r d ) k δ β η k + 2 C CDG C g ( p ε log(1 /ε ) + r d ) k δ β η k a ) ≤ C CDG ( ε log(1 /ε ) + r d ) k δ β η k + C g ( ε log(1 /ε ) + r d ) k δ β η k + C CDG ( p ε log(1 /ε ) + r d ) k δ β η k + 2 C CDG C g ( p ε log(1 /ε ) + r d ) k δ β η k , asai and Fujisawa/Weighted Huber regression where (a) follows from r ′ d ≤ r d . From triangular inequality, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( R i ( δ β η ))( X i − ν ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ p C CDG ε ( ε log(1 /ε ) + r d ) + q C g ε ( ε log(1 /ε ) + r d ) k δ β η k + q C CDG ε ( p ε log(1 /ε ) + r d ) k δ β η k + q C CDG C g ε ( p ε log(1 /ε ) + r d ) k δ β η k ≤ ( p C g + p C CDG )( r o + √ εr d ) k δ β η k + √ C CDG ( r o + √ εr d ) k δ β η k + 2 p C CDG C g ( r o + √ εr d ) k δ β η k a ) ≤ ( p C g + p C CDG (2 r o + r d ) k δ β η k + √ C CDG ( r o + r d ) k δ β η k + 2 p C CDG C g ( r o + r d ) k δ β η k ≤ (cid:16) p C g + 2 p C CDG + √ C CDG + 2 p C CDG C g (cid:17) ( r o + r d ) k δ β η k , (4.17)where (a) follows from ε < √ ab ≤ a + b for positive number a, b .For the second term of the R.H.S of (4.14), we have (X i ∈ I o ˆ w i h ( R i ( β η ))( ν − µ ) ⊤ δ β η ) ≤ X i ∈ I o ˆ w i X i ∈ I o ˆ w i | ( ν − µ ) ⊤ δ β η | a ) ≤ ε X i ∈ I o ˆ w i | ( ν − µ ) ⊤ δ β η | = 2 εδ ⊤ β η X i ∈ I o ˆ w i ( ν − µ )( ν − µ ) ⊤ δ β η ( b ) ≤ ε k ν − µ k k δ β η k c ) ≤ C CDG ε ( p ε log(1 /ε ) + r ′ d ) k δ β η k ≤ C CDG ε ( p ε log(1 /ε ) + r d ) k δ β η k , where (a) follows from 0 ≤ ˆ w i ≤ − ε ) n ≤ n and P i ∈ I o ˆ w i ≤ ε , (b) follows from P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1 and (c) follows from Proposition 4.8. Consequently,we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( R i ( β η ))( ν − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG √ ε ( p ε log(1 /ε ) + r d ) k δ β η k ≤ √ C CDG ( r o + √ εr d ) k δ β η k ≤ √ C CDG ( r o + r d ) k δ β η k . (4.18) asai and Fujisawa/Weighted Huber regression For the last term of the R.H.S of (4.14), we have (X i ∈ I o ˆ w i h ( R i ( β η ))( µ − µ ˆ w ) ⊤ δ β η ) ≤ X i ∈ I o ˆ w i X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | a ) ≤ ε X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | = 2 δ ⊤ β η ε X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ≤ δ ⊤ β η X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ( b ) ≤ δ ⊤ β η ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η = 2 k µ − µ ˆ w k k δ β η k c ) ≤ C CDG ( r o + r ′ d ) k δ β η k ≤ C CDG ( r o + r d ) k δ β η k , where (a) follows from 0 ≤ ˆ w i ≤ − ε ) n ≤ n , (b) follows from P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1, (c) follows from Proposition 4.6 and P i ∈ I o ˆ w i ≤ ε . Consequently,we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( R i ( β η ))( µ − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG ( r o + r d ) k δ β η k . (4.19)From (4.17), (4.18) and (4.19), we have X i ∈ I o ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η ≤ (cid:16) p C g + 2 p C CDG + √ C CDG + 2 p C CDG C g + 2 √ C CDG (cid:17) ( r o + r d ) k δ β η k . Lemma 4.3.
We have δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η ≤ C g C CDG ( p ε log(1 /ε ) + r d ) k δ β η k . asai and Fujisawa/Weighted Huber regression Proof. δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12) ( µ − ν ) ⊤ δ β η (cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ I G ˆ w i ( x i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k µ − ν k k δ β k a ) ≤ C g ( ε p log(1 /ε ) + r ′ d ) C CDG ( p ε log(1 /ε ) + r ′ d ) k δ β η k ≤ C g C CDG ( p ε log(1 /ε ) + r ′ d ) k δ β η k ≤ C g C CDG ( p ε log(1 /ε ) + r d ) k δ β η k , where (a) follows from (4.4) and Proposition 4.8. Proposition 4.10 (Confirmation of (3.1)) . We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C δ ( r o + r d ) k δ β η k , where C δ is some positive constant depending on δ .Proof. From triangular inequality, we have λ o √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( b ) ≤ √ C CDG ( r o + r d ) k δ β η k + (cid:16) C τ p log(1 /δ ) + C Bδ + 2 C τ (cid:17) ( r o + r d ) k δ β η k = (cid:16) √ C CDG + C τ p log(1 /δ ) + C Bδ + 2 C τ (cid:17) ( r o + r d ) k δ β η k = C δ ( r o + r d ) k δ β η k , where (a) follows from the triangular inequality and (b) follows from Lemmas4.4 and 4.5. Lemma 4.4.
We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG ( r o + r d ) k δ β η k . asai and Fujisawa/Weighted Huber regression Proof. ( n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η ) a ) ≤ n X i =1 ˆ w i n X i =1 ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | b ) ≤ n X i =1 ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | = 2 δ ⊤ β η n X i =1 ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ( c ) ≤ k µ − µ ˆ w k k δ β η k d ) ≤ C CDG ( r o + r ′ d ) k δ β η k ≤ C CDG ( r o + r d ) k δ β η k , where (a) follows from ˆ w i = √ ˆ w i √ ˆ w i and 0 ≤ h ( R i ( β η )) ≤
1, (b) followsfrom 0 ≤ ˆ w i ≤ − ε ) n ≤ n and P i ∈ I o ˆ w i ≤ ε , (c) follows from P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1 and (d) follows from Proposition 4.6. Then, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG ( r o + r d ) k δ β η k . Lemma 4.5.
We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:16) C τ p log(1 /δ ) + C Bδ + 2 C τ (cid:17) ( r o + r d ) k δ β η k . Proof.
We have n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η ( a ) ≤ n max I | n − o | X i ∈ I | n − o | h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η = 1 n n X i =1 h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η − n min I | o | X i ∈ I | o | h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ βη (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i ∈ I o h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where (a) follows from the claim (iii) of Lemma 1 of Dalalyan and Minasyan asai and Fujisawa/Weighted Huber regression (2020). From Proposition 4.2, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 n h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 n h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ k δ β η k ≤ C Bδ r log dn k δ β η k ≤ C Bδ r d k δ β η k and, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ C τ vuutX i ∈ I o h (cid:18) ξ i λ o √ n (cid:19) k δ β η k (cid:16) (1 + p log(1 /δ )) + p d log d + √ o p log(1 /ε ) (cid:17) ( b ) ≤ C τ √ o k δ β η k (cid:16) (1 + p log(1 /δ )) + p d log d + √ o p log(1 /ε ) (cid:17) , where (a) follows from Corollary 4.1 and (b) follows from − ≤ h (cid:16) ξ i λ o √ n (cid:17) ≤ n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η ≤ C Bδ r d + C τ r on (1 + p log(1 /δ )) √ n + r d + √ ε p log(1 /ε ) !! k δ β η k a ) ≤ C τ (1 + p log(1 /δ )) √ n + ( C Bδ + C τ ) r d + C τ r o ! k δ β η k ≤ (cid:18) C τ (1 + p log(1 /δ )) d log dn + ( C Bδ + C τ ) r d + C τ r o (cid:19) k δ β η k b ) ≤ (cid:16) C τ p log(1 /δ ) + C Bδ + 2 C τ (cid:17) ( r o + r d ) k δ β η k , where (a) follows from o/n ≤ r d ≤ Proposition 4.11 (Confirmation of (3.2)) . We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C δ ( r o + r d ) k δ β η k , where C δ is some positive constant depending on δ . asai and Fujisawa/Weighted Huber regression Proof.
We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ˆ w ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( µ − µ ˆ w ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( b ) ≤ √ C CDG ( r o + r d ) k δ β η k + 2 C τ (cid:16) √ p log(1 /δ ) (cid:17) ( r o + r d ) k δ β η k = (cid:16) C CDG + 2 C τ (cid:16) p log(1 /δ ) (cid:17)(cid:17) ( r o + r d ) k δ β η k = C δ ( r o + r d ) k δ β η k , where (a) follows from the triangular inequality and (b) follows from Lemmas4.6 and 4.7. Lemma 4.6.
We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG ( r o + r d ) k δ β η k . Proof. (X i ∈ I o ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η ) a ) ≤ X i ∈ I o ˆ w i X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | b ) ≤ ε X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | = 2 εδ ⊤ β η X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ≤ δ ⊤ β η X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ( c ) ≤ k µ − µ ˆ w k k δ β η k d ) ≤ C CDG ( r o + r ′ d ) k δ β η k d ) ≤ C CDG ( r o + r d ) k δ β η k , where (a) follows from ˆ w i = √ ˆ w i √ ˆ w i and 0 ≤ h ( R i ( β η )) ≤
1, (b) followsfrom 0 ≤ ˆ w i ≤ − ε ) n ≤ n and P i ∈ I o ˆ w i ≤ ε , (c) follows from P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1, (d) follows from Proposition 4.6. Lemma 4.7.
We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C τ (cid:16) p log(1 /δ ) (cid:17) ( r o + r d ) k δ β η k . asai and Fujisawa/Weighted Huber regression Proof.
We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ C τ sX i ∈ I o ˆ w i h ( r i ( β η )) (cid:18) (1 + p log(1 /δ )) + p d log d + √ o r log no (cid:19) k δ β η k b ) ≤ C τ √ on (cid:18) (1 + p log(1 /δ )) + p d log d + √ o r log no (cid:19) k δ β η k = 2 C τ (1 + p log(1 /δ )) √ n r on + r on r d + r o ! k δ β η k c ) ≤ C τ (1 + p log(1 /δ )) √ n + r d + r o ! k δ β η k ≤ C τ (1 + p log(1 /δ )) r d log dn + r d + r o ! k δ β η k ≤ C τ (cid:16) p log(1 /δ ) (cid:17) ( r o + r d ) k δ β η k , where (a) follows from Corollary 4.1, (b) follow from 0 ≤ ˆ w i ≤ − ε ) n ≤ n and P i ∈ I o ˆ w i ≤ ε and − ≤ h ( r i ( β η )) ≤ on ≤ Proposition 4.12 (Confirmation of (3.4)) . We have λ o n X i =1 − h ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β η λ o √ n ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ≥ k δ β η k − λ o √ nC cond ( r o + r d ) k δ β η k − /δ ) λ o , where C cond = 3 √ C CDG + C δ + C δ + C δ and C δ is some positive constantdepending on δ and C δ and C δ are defined in Lemmas 4.8 and 4.9, respectively.Proof. From Proposition A.5, λ o n X i =1 (cid:18) − h (cid:18) ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ≥ a n E " n X i =1 (( x i − µ ) ⊤ δ β η ) − a k δ β η k − a + B + C, (4.20) asai and Fujisawa/Weighted Huber regression where a = 18 ,a = λ o √ n E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 α i ( x i − µ ) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + λ o q E [ k ( x i − µ ) k ] r δ + λ o √ n k µ − µ ˆ w k ,a = λ o δB = λ o √ n X i ∈ I G ∩ I ˆ w< + X i ∈ I o (cid:18) − h (cid:18) ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β λ o (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i ( x i − µ ˆ w ) ⊤ δ β η C = λ o √ n X i ∈ I G ∩ I ˆ w< + X i ∈ I o (cid:18) − h (cid:18) ξ i λ o √ n − ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ( x i − µ ˆ w ) ⊤ δ β η . We evaluate each term of (4.20) and then we prove the proposition. FromProposition 4.4, we have a n E " n X i =1 (( x i − µ ) ⊤ δ β η ) ≥ k δ β η k . From Propositions 4.3 and 4.6, we have a ≤ λ o √ n C Rδ r dn + r δ r dn + C CDG
16 ( r o + r ′ d ) ! ≤ λ o √ n C Rδ r dn + r δ r dn + C CDG
16 ( r o + r d ) ! ≤ λ o √ n (cid:18) ( C Rδ + p /δ )) r d + C CDG
16 ( r o + r d ) (cid:19) ≤ λ o √ nC δ ( r o + r d ) . The terms B and C are bounded above from Lemmas 4.8 and 4.9. Combiningthe bounds, (4.20) can be bounded from below as follows: λ o √ n n X i =1 (cid:18) − h (cid:18) ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i ( x i − µ ˆ w ) ⊤ δ β η ≥ k δ β η k − λ o √ nC δ ( r o + r d ) − /δ ) λ o − λ o √ n (cid:16) √ C CDG + C δ + C δ (cid:17) ( r o + r d ) k δ β η k ≥ k δ β η k − λ o √ n (cid:16) √ C CDG + C δ + C δ + C δ (cid:17) ( r o + r d ) k δ β η k − /δ ) λ o . The following two lemmas are used in the proof of Proposition 4.12. asai and Fujisawa/Weighted Huber regression Lemma 4.8.
We have X i ∈ I G ∩ I ˆ w< + X i ∈ I o − h ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i ( x i − µ ˆ w ) ⊤ δ β η ≤ ( √ C CDG + C δ )( r o + r d ) k δ β η k , where C δ is some positive constant depending on δ .Proof. Let h i = − h ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ! + h (cid:18) ξ i λ o √ n (cid:19) . We have two properties of { ˆ w i } ni =1 , as described on (4.21) and (4.22). From0 ≤ ˆ w i ≤ n for i ∈ I w < , − ≤ h i ≤ X i ∈ I G ∩ I ˆ w< ( h i ˆ w i ) ≤ X i ∈ I ˆ w< ( h i ˆ w i ) ≤ on , X i ∈ I o ( h i ˆ w i ) ≤ on and we have vuuut X i ∈ I G ∩ I ˆ w< + X i ∈ I o ( h i ˆ w i ) ≤ √ √ on . (4.21)From 0 ≤ ˆ w i ≤ − ε ) n and Lemma 4.2, we have X i ∈ I G ∩ I ˆ w< ˆ w i ≤ X i ∈ I ˆ w< ˆ w i ≤ o (1 − ε ) n , X i ∈ I o ˆ w i ≤ o (1 − ε ) n and from that ε is small constant, we have − ε ) ≤ X i ∈ I G ∩ I ˆ w< + X i ∈ I o ˆ w i ≤ o (1 − ε ) n ≤ ε. (4.22)From x i − µ ˆ w = x i − µ + µ − µ ˆ w , we have X i ∈ I G ∩ I ˆ w< + X i ∈ I o − h ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i ( x i − µ ˆ w ) ⊤ δ β η = X i ∈ I G ∩ I ˆ w< + X i ∈ I o h i ˆ w i ( x i − µ ) ⊤ δ β η + X i ∈ I G ∩ I ˆ w< + X i ∈ I o h i ˆ w i ( µ − µ ˆ w ) ⊤ δ β η (4.23) asai and Fujisawa/Weighted Huber regression For the first term of the R.H.S of (4.23), we have X i ∈ I G ∩ I ˆ w< + X i ∈ I o h i ˆ w i ( x i − µ ) ⊤ δ β η ( a ) ≤ C τ vuuut X i ∈ I G ∩ I ˆ w< + X i ∈ I o h i ˆ w i (cid:18) (1 + p log(1 /δ )) + p d log d + √ o r log no (cid:19) k δ β η k b ) ≤ √ C τ √ on (cid:18) (1 + p log(1 /δ )) + p d log d + √ o r log no (cid:19) k δ β η k c ) ≤ √ C τ (1 + p log(1 /δ )) √ n + r d + r o ! k δ β η k ≤ √ C τ (1 + p log(1 /δ )) r d log dn + r d + r o ! k δ β η k = √ C τ (cid:16) (2 + p log(1 /δ )) r d + r o (cid:17) k δ β η k ≤ C δ ( r o + r d ) k δ β η k , where (a) follows from Corollary 4.1, (b) follows from (4.21) and (c) follows from o/n ≤ X i ∈ I G ∩ I ˆ w< + X i ∈ I o h i ˆ w i ( µ − µ ˆ w ) ⊤ δ β ≤ X i ∈ I G ∩ I ˆ w< + X i ∈ I o ˆ w i X i ∈ I G ∩ I ˆ w< + X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | a ) ≤ ε X i ∈ I G ∩ I ˆ w< + X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | = 6 εδ ⊤ β η X i ∈ I G ∩ I ˆ w< + X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ≤ εδ ⊤ β η n X i =1 ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ( b ) ≤ C CDG ( r o + r d ) k δ β η k , where (a) follows from (4.22) and (b) follows from Proposition 4.6, and we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i ∈ I G ∩ I ˆ w< + X i ∈ I o h i ˆ w i ( µ − µ ˆ w ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG ( r o + r d ) k δ β η k . Consequently, we have X G ∈ I G ∩ I ˆ w< + X i ∈ I o h i ˆ w i ( x i − µ ˆ w ) ⊤ δ β η ≤ C δ ( r o + r d ) k δ β η k + √ C CDG ( r o + r d ) k δ β η k . asai and Fujisawa/Weighted Huber regression Lemma 4.9.
We have λ o √ n X G ∈ I G ∩ I ˆ w< + X i ∈ I o − h ξ i λ o √ n − ( x i − µ ˆ w ) ⊤ δ β η λ o √ n ! + h (cid:18) ξ i λ o √ n (cid:19)! ( x i − µ ˆ w ) ⊤ δ β η ≤ λ o √ n (cid:16) √ C CDG + C δ (cid:17) ( r o + r d ) k δ β η k , where C δ is some positive constant depending on δ . Because the proof of Lemma 4.9 is almost the same of the proof of Lemma4.8, we omit the proof of Lemma 4.9.
From Proposition 4.9, 4.10, 4.11 and 4.12, conditions (3.1), (3.2), (3.3) and (3.4)are satisfied by c = λ o √ n (cid:16) C δ + C δ + 2 √ C CDG + C cond (cid:17) ( r o + r d ) ,c = 116 ,c = λ o √ nC cond ( r o + r d ) ,c = 5 λ o log(1 /δ ) = 5 c n with probability at least (1 − δ ) .We note that, from the definition of λ o and Proposition 4.1, λ o √ n = c max ,which is a constant with C cov = c gcov and C cond and C cond are also constant.Consequently, we have6 c + c + √ c c c ≤ λ o √ n (cid:18) C δ + C δ + 2 √ C CDG + C cond + C cond (cid:19) ( r o + r d ) + c max r
516 1 √ n ! ≤ λ o √ n (cid:18) C δ + C δ + 2 √ C CDG + C cond + C cond (cid:19) ( r o + r d ) + c max r r d ! ≤ λ o √ n (cid:18) C δ + C δ + 2 √ C CDG + C cond + C cond (cid:19) + c max r ! ( r o + r d ) . The proof of Theorem 4.1 is complete.
References
Bakshi, A. and
Prasad, A. (2020). Robust linear regression: Optimal ratesin polynomial time. arXiv preprint arXiv:2007.01394 . asai and Fujisawa/Weighted Huber regression Bellec, P. C. (2019). Localized Gaussian width of M -convex hulls with ap-plications to Lasso and convex aggregation. Bernoulli Boucheron, S. , Lugosi, G. and
Massart, P. (2013).
Concentration inequal-ities: A nonasymptotic theory of independence . Oxford university press.
Chen, M. , Gao, C. and
Ren, Z. (2018). Robust covariance and scatter matrixestimation under Huber’s contamination model.
The Annals of Statistics Cheng, Y. , Diakonikolas, I. and
Ge, R. (2019). High-dimensional robustmean estimation in nearly-linear time. In
Proceedings of the Thirtieth AnnualACM-SIAM Symposium on Discrete Algorithms
Cheng, Y. , Diakonikolas, I. , Ge, R. and
Woodruff, D. (2019). Fasteralgorithms for high-dimensional robust covariance estimation. arXiv preprintarXiv:1906.04661 . Cheng, Y. , Diakonikolas, I. , Ge, R. and
Soltanolkotabi, M. (2020).High-dimensional Robust Mean Estimation via Gradient Descent. In
Proceed-ings of the 37th International Conference on Machine Learning . Proceedingsof Machine Learning Research
Cherapanamjeri, Y. , Aras, E. , Tripuraneni, N. , Jordan, M. I. , Flam-marion, N. and
Bartlett, P. L. (2020). Optimal robust linear regressionin nearly linear time. arXiv preprint arXiv:2007.08137 . Dalalyan, A. S. and
Minasyan, A. (2020). All-In-One Robust Estimator ofthe Gaussian Mean. arXiv preprint arXiv:2002.01432 . Dalalyan, A. and
Thompson, P. (2019). Outlier-robust estimation of a sparselinear model using ℓ -penalized Huber’s M-estimator. In Advances in NeuralInformation Processing Systems 32 (H. Wallach, H. Larochelle, A. Beygelz-imer, F. d’Alch´e Buc, E. Fox and R. Garnett, eds.) 13188–13198. CurranAssociates, Inc.
Diakonikolas, I. and
Kane, D. M. (2019). Recent advances in algorithmichigh-dimensional robust statistics. arXiv preprint arXiv:1911.05911 . Diakonikolas, I. , Kane, D. M. and
Pensia, A. (2020). Outlier ro-bust mean estimation with subgaussian rates via stability. arXiv preprintarXiv:2007.15618 . Diakonikolas, I. , Kong, W. and
Stewart, A. (2019). Efficient algorithmsand lower bounds for robust linear regression. In
Proceedings of the ThirtiethAnnual ACM-SIAM Symposium on Discrete Algorithms
Diakonikolas, I. , Kamath, G. , Kane, D. M. , Li, J. , Moitra, A. and
Stewart, A. (2017). Being robust (in high dimensions) can be practical.In
Proceedings of the 34th International Conference on Machine Learning-Volume 70
Diakonikolas, I. , Kamath, G. , Kane, D. M. , Li, J. , Moitra, A. and
Stewart, A. (2018). Robustly learning a gaussian: Getting optimal error,efficiently. In
Proceedings of the Twenty-Ninth Annual ACM-SIAM Sympo-sium on Discrete Algorithms
Diakonikolas, I. , Kamath, G. , Kane, D. , Li, J. , Moitra, A. and
Stew-art, A. (2019). Robust Estimators in High-Dimensions Without the Com- asai and Fujisawa/Weighted Huber regression putational Intractability. SIAM Journal on Computing Dong, Y. , Hopkins, S. and
Li, J. (2019). Quantum entropy scoring for fastrobust mean estimation and improved outlier detection. In
Advances in NeuralInformation Processing Systems
Elsener, A. and van de Geer, S. (2018). Sharp oracle inequalities for sta-tionary points of nonconvex penalized M-estimators.
IEEE Transactions onInformation Theory Fan, J. , Liu, H. , Sun, Q. and
Zhang, T. (2018). I-LAMM for sparse learning:Simultaneous control of algorithmic complexity and statistical error.
Annalsof statistics Gao, C. (2020). Robust regression via mutivariate regression depth.
Bernoulli Hopkins, S. , Li, J. and
Zhang, F. (2020). Robust and Heavy-Tailed MeanEstimation Made Simple, via Regret Minimization.
Advances in Neural In-formation Processing Systems . Huber, P. J. (1981).
Robust statistics . John Wiley & Sons.
Jerrum, M. R. , Valiant, L. G. and
Vazirani, V. V. (1986). Random gen-eration of combinatorial structures from a uniform distribution.
Theoreticalcomputer science Karmalkar, S. and
Price, E. (2018). Compressed sensing with adversarialsparse noise via l1 regression. arXiv preprint arXiv:1809.08055 . Koltchinskii, V. and
Lounici, K. (2017). Concentration inequalities andmoment bounds for sample covariance operators.
Bernoulli Kothari, P. K. , Steinhardt, J. and
Steurer, D. (2018). Robust momentestimation and improved clustering via sum of squares. In
Proceedings of the50th Annual ACM SIGACT Symposium on Theory of Computing
Lai, K. A. , Rao, A. B. and
Vempala, S. (2016). Agnostic estimation ofmean and covariance. In
Foundations of Computer Science (FOCS), 2016IEEE 57th Annual Symposium on
Laurent, B. and
Massart, P. (2000). Adaptive estimation of a quadraticfunctional by model selection.
The Annals of Statistics
Loh, P.-L. (2017). Statistical consistency and asymptotic normality for high-dimensional robust M -estimators. The Annals of Statistics Lugosi, G. and
Mendelson, S. (2019). Sub-Gaussian estimators of the meanof a random vector.
The annals of statistics Lugosi, G. and
Mendelson, S. (2020). Risk minimization by median-of-meanstournaments.
J. Eur. Math. Soc.
Massart, P. (2000). About the constants in Talagrand’s concentration inequal-ities for empirical processes.
The Annals of Probability Minsker, S. and
Wei, X. (2020). Robust modifications of U-statistics andapplications to covariance estimation problems.
Bernoulli Mizera, I. et al. (2002). On depth and deep points: a calculus.
The Annals ofStatistics Nemirovsky, A. S. and
Yudin, D. B. (1983). Problem complexity and methodefficiency in optimization. asai and Fujisawa/Weighted Huber regression Pensia, A. , Jog, V. and
Loh, P.-L. (2020). Robust regression with co-variate filtering: Heavy tails and adversarial contamination. arXiv preprintarXiv:2009.12976 . Pisier, G. (2016). Subgaussian sequences in probability and Fourier analysis. arXiv preprint arXiv:1607.01053 . Prasad, A. , Suggala, A. S. , Balakrishnan, S. , Ravikumar, P. et al.(2020). Robust estimation via robust gradient estimation.
Journal of the RoyalStatistical Society Series B Rivasplata, O. (2012). Subgaussian random variables: An expository note.
She, Y. and
Owen, A. B. (2011). Outlier detection using nonconvex penalizedregression.
Journal of the American Statistical Association
Tukey, J. W. (1975). Mathematics and the picturing of data. In
Proceedings ofthe International Congress of Mathematicians, Vancouver, 1975 Vershynin, R. (2010). Introduction to the non-asymptotic analysis of randommatrices. arXiv preprint arXiv:1011.3027 . Vershynin, R. (2018).
High-dimensional probability: An introduction with ap-plications in data science . Cambridge university press. Appendix A: Proofs of Propositions and Lemmas in Section 4
In this section, we assume that { x i } ni =1 is a sequence with i.i.d. random vectorsdrawn from sub-Gaussian distribution with mean µ and identity covarianceexcept Proposition A.5. We also assume { ξ i } ni =1 is a sequence with i.i.d. randomvariables drawn from a distribution whose absolute moment is bounded by σ .The proof of Lemma 4.1 is almost the same of the proof of Lemma 4.3 ofDiakonikolas et al. (2018) and we omit the proof of (4.4) as Diakonikolas et al.(2018) because the proof of (4.4) is almost the same as (4.5). Before the proofof the Lemma 4.1, we introduce the following concentration inequality. Lemma A.1.
For some positive constants
A, B , we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 x i x ⊤ i − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ t for probability at least − (cid:0) Ad − Bn min( t, t ) (cid:1) . Lemma A.2 (Lemma 4.1 in Section 4) . Suppose n is sufficiently large so that n = O ( d ) and ε is sufficiently small. Then, for w ∈ ∆ N,ε , we have λ max X i ∈ I G w i ( x i − µ )( x i − µ ) ⊤ − I ! ≤ C g ( ε log(1 /ε ) + r ′ d ) , (A.1) where C g is some positive constant depending on δ , with probability at least − δ .Proof. For any J ⊂ { , · · · , n } , Let w J ∈ R n be the vector which is given by w Ji = 1 / | J | for i ∈ J and w Ji = 0 otherwise. By convexity, it is sufficient to show asai and Fujisawa/Weighted Huber regression that for any J such that | J | = (1 − ε ) n , P " n X i =1 w Ji x i x ⊤ i − (1 − ε ) I ≥ C g ( ε log(1 /ε ) + r ′ d ) ≤ δ. For any fixed w J , we have n X i =1 w Ji x i x ⊤ i − I = 1(1 − ε ) n n X i =1 x i x ⊤ i − − ε I − − ε ) n X i/ ∈ J x i x ⊤ i − (cid:18) − ε − (cid:19) I ! . From triangular inequality, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w Ji x i x ⊤ i − (1 − ε ) I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ε ) n n X i =1 x i x ⊤ i − − ε I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ε ) n X i/ ∈ J x i x ⊤ i − (cid:18) − ε − (cid:19) I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . From Lemma A.1, we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ε ) n n X i =1 x i x ⊤ i − − ε I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C g r ′ d with probability at least 1 − δ for sufficiently large n so that (log(2 /δ ) + log 4 + Ad ) /Bn < J ⊂ { , · · · , n } such that | J | = (1 − ε ) n and sufficiently large positiveconstant C g , let E ( J ) be the event that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ε ) n X i/ ∈ J x i x ⊤ i − (cid:18) − ε − (cid:19) I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > C g ε log(1 /ε ) ⇒ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) εn X i/ ∈ J x i x ⊤ i − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > − εε C g ε log(1 /ε ) . For sufficiently small ε , we have − εε ε log(1 /ε ) = Ω(log(1 /ε )) > (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) εn X i/ ∈ J x i x ⊤ i − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > − εε C g ε log(1 /ε ) . with probability at most 4 exp( Ad − Bεn − εε ε log(1 /ε )). Let B ( ε ) be the binaryentropy function. We have P (cid:2)(cid:0) ∪ J : | J | =(1 − ε ) n E ( J ) (cid:1) c (cid:3) ( a ) ≤ (cid:18) log (cid:18) nεn (cid:19) + Ad − Bεn − εε C g ε log(1 /ε ) (cid:19) ( b ) ≤ (cid:18) nB ( ε ) + Ad − Bεn − εε C g ε log(1 /ε ) (cid:19) ( c ) ≤ (cid:18) εn (cid:18) O (log(1 /ε ) − B − εε C g ε log(1 /ε ) (cid:19) + Ad (cid:19) ( d ) ≤ − εn/ Ad ) ( e ) ≤ δ/ , asai and Fujisawa/Weighted Huber regression where (a) follows by a union bound over all sets J of size (1 − ε ) n , (b) followsfrom log (cid:0) nεn (cid:1) ≤ εH ( ε ), (c) follows from ε is sufficiently small because H ( ε ) = O ( ε log(1 /ε )) as ε →
0, (d) follows from C g is sufficiently large constant, (e)follows from n is sufficiently large.Here, we give a Bernstein concentration inequality. Theorem A.1 (Bernstein concentration inequality) . Let { W i } ni =1 be a sequencewith i.i.d random variables. We assume that n X i =1 E[ W i ] ≤ v, n X i =1 E[( W i ) k + ] ≤ k !2 vc k − for i = 1 , · · · n and for k ∈ N such that k ≥ . Then, we have P "(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( W i − E[ W i ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ vt + ct ≥ − e − t for any t > . Let C g = C g + C g and the proof is complete. Proposition A.1 (Proposition 4.2 in Section 4) . Suppose that n is sufficientlylarge so that n = ( d log d ) . We have n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) h (cid:18) ξ i λ o √ n (cid:19)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ C Bδ r log dn , where C Bδ is some positive constant depending on δ , with probability at least − δ .Proof. We have E h ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17)i = 0 and we see n X i =1 E ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17) n = 1 n n X i =1 E[( x i − µ ) j ]E " h (cid:18) ξ j λ o √ n (cid:19) ≤ n n X i =1 E[( x i − µ ) j ] ≤ n . From Proposition 3.2. of Rivasplata (2012), we can show that the absolute k ( ≥
3) th moment of ( x − µ ) j is bounded above, as follows: n X i =1 E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k ≤ n k n X i =1 E[ | ( x − µ ) j | k ] ≤ k !2 (cid:18) n (cid:19) (cid:18) n (cid:19) k − . From Theorem A.1 with t = log( d/δ ), v = c = 1 /n , we haveP (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ r d/δ ) n + log( d/δ ) n ≥ − δd asai and Fujisawa/Weighted Huber regression and for sufficiently large constant C Bδ depending on δ , we haveP (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C Bδ r log dn ≥ − δd . Hence, P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) h (cid:16) ξ i λ o √ n (cid:17) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ C Bδ r log dn = P sup j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C Bδ r log dn = 1 − P sup j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > C Bδ r log dn = 1 − P [ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > C Bδ r log dn ≥ − d X j =1 P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( x i − µ ) j h (cid:16) ξ i λ o √ n (cid:17) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > C Bδ r log dn ≥ − ( δ/d ) d = 1 − δ. Proposition A.2 (Proposition 4.3 in Section 4) . Suppose that n is sufficientlylarge so that n = ( d log d ) . We have n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) α i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C Rδ r dn , where C Rδ is some positive constant depending on δ , with probability at least − δ .Proof. From α i = 1, we note that k P ni =1 ( x i − µ ) k is χ -square distributionwith nd degree of freedom and from Lemma Laurent and Massart (2000), wehave (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) α i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ nd + 2 p nd log(1 /δ ) + 2 log(1 /δ )with probability at least (1 − δ ). For sufficiently large constant C Rδ dependingon δ , we have 1 n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) α i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C Rδ r dn asai and Fujisawa/Weighted Huber regression with probability at least 1 − δ .We omit the proof of Proposition 4.3 because the proof is almost the sameas Proposition A.1.The following Proposition in used to prove Corollary 4.1. Proposition A.3 (Adopted from Proposition 4 of Dalalyan and Thompson(2019)) . Suppose that < ε < / holds. For m -sparse vector u such that ≤ m ≤ o and δ ∈ (0 , / , we have n X i =1 u i ( x i − µ ) ⊤ v ≤ C τ k u k k v k (cid:16) (1 + p log(1 /δ )) + p log d + p m log( n/m ) (cid:17) , where C τ is some positive constant, with probability at least (1 − δ ) .Proof. Let X = ( x − µ ) ⊤ ...( x n − µ ) ⊤ Suppose that { z i } ni =1 is a sequence with i.i.d. random vectors drawn fromGaussian distribution with mean 0 and identity covariance and Z = z ⊤ ... z ⊤ n and To prove Proposition 4 of Dalalyan and Thompson (2019), Lemma 4 ofDalalyan and Thompson (2019) is used and the lemma is the following: whensup [ u,v ] ∈ V u ⊤ Zv ≤ G ( V ) + G ( V ) + p log(1 /δ ) , (A.2)with probability at least 1 − δ , where G ( · ) is the Gaussian width, V = S d − , V = S n − and V = V × V .To weaken the Gaussian assumption to sub-Gaussian, we confirm that whetherthe following two properties of Gaussian use in the proof of Lemma 4 of Dalalyanand Thompson (2019) also hold for sub-Gaussian.(i) Gordon’s inequality,(ii) Gaussian concentration.For (i), we can use Theorem 4.2 of Pisier (2016) instead of Gordon’s inequality.For (ii), we can use Theorem 8.5.5 of Vershynin (2018) instead of Gaussianconcentration. To compute the Talagrand’s γ -Functional of V , which is definedby Definition 8.5.1 in Vershynin (2018), we can use Corollary 8.6.2. of Vershynin asai and Fujisawa/Weighted Huber regression (2018). From the discussion above, when { x i } ni =1 is a sequence with i.i.d. randomvectors drawn from sub-Gaussian distribution with identity covariance, we havesup [ u,v ] ∈ V n X i =1 u i ( x i − µ ) ⊤ v = sup [ u,v ] ∈ V u ⊤ Xv ≤ τ (cid:16) G ( V ) + G ( V ) + p log(1 /δ ) (cid:17) , (A.3)where τ is some positive constant, with probability at least 1 − δ .From Lemma 4 of Dalalyan and Thompson (2019) and (A.3), we have n X i =1 u i ( x i − µ ) ⊤ v ≤ τ k u k k v k (cid:16) √ . p log(81 /δ )) + 1 . p d (cid:17) + 1 . τ k v k G ( k u k B n ∩ | u k B n )(A.4)with probability at least (1 − δ ). Further more, we have G ( k u k B n ∩ | u k B n ) ≤ p n k u k a ) ≤ q max (1 , log(8 en k u k / k u k )) ( b ) ≤ √ e √ m k u k p n/m )where (a) follows from Proposition 1 of Bellec (2019) and (b) follows from Re-mark 4 of Dalalyan and Thompson (2019). We note that Proposition 1 of Bellec(2019) is proved for Gaussian, however, to replace Gaussian to sub-Gaussian inthe proof of (4) in Bellec (2019), we can confirm that (a) holds.Let η m = q n/m )log( n/m ) and q n/m )log( n/m ) ≤ C η , where C η is a numericalconstant because 1 ≤ m ≤ o and 0 < ε < / n X i =1 u i ( x i − µ ) ⊤ v ≤ τ k u k k v k (cid:16) √ . p log(81 /δ )) + 1 . p d + 4 . √ e √ m p n/m ) (cid:17) = τ k u k k v k (cid:16) √ . p log(81 /δ )) + 1 . p d + 4 . η m √ e p m log( n/m ) (cid:17) ≤ τ k u k k v k (cid:16) √ . p log(81 /δ )) + 1 . p d + 4 . C η √ e p m log( n/m ) (cid:17) ≤ C τ k u k k v k (cid:16) (1 + p log(1 /δ )) + p log d + p m log( n/m ) (cid:17) (A.5)with probability at least (1 − δ ), where C τ is some positive constant. Corollary A.1 (Corollary 4.1 in Section 4) . Suppose that < ε < / holds.For any vector u = ( u , · · · , u n ) ∈ R n and any set I | m | such that ≤ m ≤ o and δ ∈ (0 , / , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i ∈ I | m | u i ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C τ s X i ∈ I | m | u i k δ β η k (cid:18) (1 + p log(1 /δ )) + p d log d + √ m r log nm (cid:19) , where C τ is some positive constant, with probability at least (1 − δ ) . asai and Fujisawa/Weighted Huber regression Proof.
Let u ′ = ( u ′ , · · · , u ′ n ) be a m -sparse vector such that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i ∈ I | m | u i ( x i − µ ) ⊤ β β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 u ′ i ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . From Proposition A.3, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 u ′ i ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C τ k u ′ k k δ β η k (cid:18) (1 + p log(1 /δ )) + p d log d + √ m r log nm (cid:19) = C τ s X i ∈ I | m | u i k δ β η k (cid:18) (1 + p log(1 /δ )) + p d log d + √ m r log nm (cid:19) . Proposition A.4 (Proposition 4.4 in Section 4) . Suppose that n is sufficientlylarge. For any vector v ∈ S d − , we have n E " n X i =1 (( x i − µ ) ⊤ v ) ≥ . Proof.
From Exercise 4.6.3 of Vershynin (2018), we have E vuut n X i =1 (( x i − µ ) ⊤ v ) ≥ √ n − C G √ d. where C G is some positive constant. From Jensen’s inequality, we have vuut E " n X i =1 (( x i − µ ) ⊤ v ) ≥ √ n − C G √ d. For sufficiently large n , we have √ n − C G √ d > n E " n X i =1 (( x i − µ ) ⊤ v ) ≥ n ( √ n − C G √ d ) = 1 − C G r dn + C G dn , where C G is some positive constant. Consequently, for sufficiently large n , wehave 1 n E " n X i =1 (( x i − µ ) ⊤ v ) ≥ − C G r dn + C G dn ≥ . The following Lemma is used to prove Proposition 4.5. asai and Fujisawa/Weighted Huber regression Lemma A.3.
For differentiable function f ( x ) , we denote its derivative f ′ ( x ) .For any differentiable and convex function f ( x ) , we have ( f ′ ( a ) − f ′ ( b ))( a − b ) ≥ . Proof.
From the definition of the convexity, we have f ( a ) − f ( b ) ≥ f ′ ( b )( a − b ) and f ( b ) − f ( a ) ≥ f ′ ( a )( b − a ) . From the above inequalities, we have0 ≥ f ′ ( b )( a − b ) + f ′ ( a )( b − a ) = ( f ′ ( b ) − f ′ ( a ))( a − b ) ⇒ ≤ ( f ′ ( a ) − f ′ ( b ))( a − b ) . Proposition A.5 (Proposition 4.5 in Section 4) . Let R ( r ) = (cid:8) β ∈ R d | k δ β k = k β − β ∗ k = r (cid:9) ( r ≤ and assume β η ∈ R ( r ) . Assume that (cid:8) x i ∈ R d (cid:9) ni =1 is a sequence with i.i.d.random matrices satisfying E [(( x i − µ ) ⊤ v ) ] ≤ m (cid:0) E [(( x i − µ ) ⊤ v ) ] (cid:1) for any v ∈ S d − , (A.6) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ C cov , (A.7) k µ − µ ˆ w k ≤ C µ , (A.8) where m , C cov and C µ are some positive constants. Suppose λ o √ n ≥ c max = (cid:16) σ m , p C cov m , C µ (cid:17) .Then, with probability at least − δ , we have λ o n X i =1 − h ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β η λ o √ n ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ≥ a n n X i =1 E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) − a k δ β η k − a + B + C, (A.9) asai and Fujisawa/Weighted Huber regression where a = 18 ,a = λ o √ n E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 α i ( x i − µ ) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + λ o q E [ k ( x i − µ ) k ] p /δ ) + λ o √ n k µ − µ ˆ w k ,a = λ o /δ ) ,B = X i ∈ I G ∩ I ˆ w< + X i ∈ I o (cid:18) − h (cid:18) ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β λ o (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ,C = X i ∈ I G ∩ I ˆ w< + X i ∈ I o (cid:18) − h (cid:18) ξ i λ o √ n − ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ( x i − µ ˆ w ) ⊤ δ β λ o √ n . Proof.
Let u ′ i = ( x i − µ ) ⊤ δ β η λ o √ n , u ˆ w i = ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o , u ′ ˆ w i = ( x i − µ ˆ w ) ⊤ δ β η λ o √ n , v i = ξ i λ o √ n . The left-hand side of (A.9) divided by λ o can be expressed as n X i =1 − h ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β η λ o √ n ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o = X i ∈ I G ∩ I ˆ w ≥ + X i ∈ I G ∩ I ˆ w< + X i ∈ I o ( − h ( v i − u ˆ w i ) + h ( v i )) u ˆ w i = A + B − C, where A = X i ∈ I G ∩ I ˆ w ≥ ( − h ( v i − u ˆ w i ) + h ( v i )) u ˆ w i + X i ∈ I G ∩ I ˆ w< + X i ∈ I o (cid:0) − h (cid:0) v i − u ′ ˆ w i (cid:1) + h ( v i ) (cid:1) u ′ ˆ w i ,B = X i ∈ I G ∩ I ˆ w< + X i ∈ I o ( − h ( v i − u ˆ w i ) + h ( v i )) u ˆ w i ,C = X i ∈ I G ∩ I ˆ w< + X i ∈ I o (cid:0) − h (cid:0) v i − u ′ ˆ w i (cid:1) + h ( v i ) (cid:1) u ′ ˆ w i . Let I E ˆ wi , I E ′ ˆ wi , I E ′≥ i , I E ′
132 + 12 λ o √ n k µ ˆ w − µ k k δ β η k c ) ≤
132 + C µ λ o √ n ( d ) ≤ ≤ , where (a) follows from triangular inequality, (b) follows from i ∈ E ′≥ i andH¨older’s inequality, (c) follows from (A.8) and k δ β η k ≤ λ o . From similar argument, we have I E ′ ˆ wi ⊃ I E ′
132 + 12 λ o √ n k µ ˆ w − µ k k δ β η k (cid:19) ( d ) ≤ (cid:18)
132 + C µ λ o √ n (cid:19) ( e ) ≤ , asai and Fujisawa/Weighted Huber regression where (a) follows from triangular inequality, (b) follows from ˆ w i ≤ − ε ) n ≤ i ∈ E ′≥ i and H¨older’s inequality, (d) follows from (A.8) and k δ β η k ≤ λ o .We have X i ∈ I G ∩ I w ≥ ( − h ( v i − u ˆ w i ) + h ( v i )) u ˆ w i ≥ X i ∈ I G ∩ I ˆ w ≥ ( − h ( v i − u ˆ w i ) + h ( v i )) u ˆ w i I E ˆ wi ( a ) = X i ∈ I G ∩ I ˆ w ≥ u w i I E ˆ wi ( b ) ≥ X i ∈ I G ∩ I ˆ w ≥ u ′ w i I E ˆ wi ( c ) ≥ X i ∈ I G ∩ I ˆ w ≥ u ′ w i I E ′ i ≥ , (A.12)where (a) follows from the definition of I E ˆ wi and (b) follows from i ∈ I G ∩ I ˆ w ≥ ,(c) follows from (A.10) and X i ∈ I G ∩ I ˆ w< + X i ∈ I o (cid:0) − h (cid:0) v i − u ′ ˆ w i (cid:1) + h ( v i ) (cid:1) u ′ ˆ w i ≥ X i ∈ I G ∩ I ˆ w< + X i ∈ I o (cid:0) − h (cid:0) v i − u ′ ˆ w i (cid:1) + h ( v i ) (cid:1) u ′ ˆ w i I E ′ ˆ wi ( a ) = X i ∈ I G ∩ I ˆ w< + X i ∈ I o u ′ w i I E ′ ˆ wi ( b ) ≥ X i ∈ I G ∩ I ˆ w< + X i ∈ I o u ′ w i I E ′ i < , (A.13)where (a) follows from the definition of I E ′ ˆ wi and (b) follows from (A.11). Con-sequently, from (A.12) and (A.13), we have A ≥ n X i =1 u ′ w i I E ′ i . From the convexity of the Huber loss and Lemma A.3, we have X i ∈ I G ∩ I w ≥ ( − h ( v i − u ˆ w i ) + h ( v i )) u ˆ w i ≥ X i ∈ I G ∩ I ˆ w ≥ ( − h ( v i − u ˆ w i ) + h ( v i )) u ˆ w i I E ˆ wi , (A.14) X i ∈ I G ∩ I ˆ w< + X i ∈ I o (cid:0) − h (cid:0) v i − u ′ ˆ w i (cid:1) + h ( v i ) (cid:1) u ′ ˆ w i ≥ X i ∈ I G ∩ I ˆ w< + X i ∈ I o (cid:0) − h (cid:0) v i − u ′ ˆ w i (cid:1) + h ( v i ) (cid:1) u ′ ˆ w i I E ′ ˆ wi . (A.15) asai and Fujisawa/Weighted Huber regression and we have u ′ w i I E ′ i = ( x i − µ ˆ w ) ⊤ δ β η λ o √ n ! I E ′ i = 14 λ o n (cid:0) ( x i − µ ˆ w ) ⊤ δ β η (cid:1) I E ′ i = 14 λ o n (cid:0) ( x i − µ + µ − µ ˆ w ) ⊤ δ β η (cid:1) I E ′ i ≥ λ o n (cid:16)(cid:0) ( x i − µ ) ⊤ δ β η (cid:1) + (cid:0) ( µ − µ ˆ w ) ⊤ δ β η (cid:1) − | ( x i − µ ) ⊤ δ β η ( µ − µ ˆ w ) ⊤ δ β η | (cid:17) I E ′ i ≥ λ o n (cid:16)(cid:0) ( x i − µ ) ⊤ δ β η (cid:1) − | ( x i − µ ) ⊤ δ β η ( µ − µ ˆ w ) ⊤ δ β η | (cid:17) I E ′ i ( a ) ≥ λ o n (cid:16)(cid:0) ( x i − µ ) ⊤ δ β η (cid:1) − | ( x i − µ ) ⊤ δ β η |k µ − µ ˆ w k k δ β η k (cid:17) I E ′ i ( b ) ≥ λ o n (cid:18)(cid:0) ( x i − µ ) ⊤ δ β η (cid:1) − λ o √ n k µ − µ ˆ w k k δ β η k (cid:19) I E ′ i (A.16)where (a) follows from H¨older’s inequality and (b) follows from the definition ofI E ′ i . From (A.16), we have A ≥ n X i =1 u ′ w i I E ′ i ≥ n X i =1 λ o n (cid:18)(cid:0) ( x i − µ ) ⊤ δ β η (cid:1) − λ o √ n k µ − µ ˆ w k k δ β η k (cid:19) I E ′ i ≥ n X i =1 (cid:0) ( x i − µ ) ⊤ δ β η (cid:1) λ o n I E ′ i − λ o √ n k µ − µ ˆ w k λ o k δ β η k ≥ n X i =1 ϕ ( u ′ i ) ψ ( v ′ i ) − λ o √ n k µ − µ ˆ w k λ o k δ β η k . (A.17)Define the functions ϕ ( u ) = u if | v | ≤ / u − / if 1 / ≤ u ≤ / u + 1 / if − / ≤ u ≤ − /
40 if | u | > / ψ ( v ) = I ( | v | ≤ / . (A.18)and we note that ϕ ( u ′ i ) ψ ( v ′ i ) ≤ ϕ ( u ′ i ) ≤ max ( x i − µ ) ⊤ δ β η λ o n , ! (A.19)and define f ( β ) = n X i =1 f i ( β ) = n X i =1 ϕ ( u ′ i ) ψ ( v ′ i ) . asai and Fujisawa/Weighted Huber regression To bound f ( β ) from bellow, consider the supremum of a random process indexedby R ( r ): ∆ := sup β ∈R ( r ) | f ( β ) − E f ( β ) | . (A.20)From (A.6), we have E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) ≤ m (cid:0) E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3)(cid:1) . (A.21)and we have E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) ≤ m (cid:0) E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3)(cid:1) ≤ m (cid:0) E (cid:2) k ( x i − µ ) k k δ β η k (cid:3)(cid:1) a ) ≤ m (cid:0) E (cid:2) k ( x i − µ ) k (cid:3)(cid:1) (A.22)where (a) follows from k δ β η k = r ≤ x i − µ ) ⊤ δ β η ) = h ( x i − µ )( x i − µ ) ⊤ , δ β η δ ⊤ β η i (A.23)and from (A.23), we have1 n n X i =1 E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) = 1 n n X i =1 E h h ( x i − µ )( x i − µ ) ⊤ , δ β η δ ⊤ β η i i ( a ) = E "* n n X i =1 ( x i − µ )( x i − µ ) ⊤ , δ β η δ ⊤ β η + ( b ) ≤ E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op k δ β η δ ⊤ β η k ∗ ( c ) = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op k δ β η k ( d ) ≤ E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( e ) ≤ C cov , (A.24)where (a) follows from the linearity of expectation, (b) follows from H¨older’sinequality, (c) follows from trace norm of δ β η δ ⊤ β η is the sum of the diagonalcomponent of δ β η δ ⊤ β η because δ β η δ ⊤ β η is positive semi-definite, (d) follows from k δ β η k = k δ β η k and r ≤ E ′ i , we have E " n X i =1 f i ( β ) = n X i =1 E [ f i ( β )] ≥ D − E − F, (A.25) asai and Fujisawa/Weighted Huber regression where D = n X i =1 E [ u ′ i ] = n X i =1 E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) λ o n , E = n X i =1 E (cid:20) u ′ i I (cid:18) | u ′ i | ≥ (cid:19)(cid:21) , F = n X i =1 E (cid:20) u ′ i I (cid:18) | v ′ i | ≥ (cid:19)(cid:21) . We evaluate the right-hand side of (A.25) at each term. First, we have E = n X i =1 E (cid:20) u ′ i I (cid:18) | u ′ i | ≥ (cid:19)(cid:21) ( a ) ≤ n X i =1 q E [ u ′ i ] s E (cid:20) I (cid:18) | u ′ i | ≥ (cid:19)(cid:21) ( b ) = n X i =1 q E [ u ′ i ] s P (cid:20)(cid:18) | u ′ i | ≥ (cid:19)(cid:21) ( c ) ≤ n X i =1 λ o n q E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3)q E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) ( c ) ≤ n X i =1 m λ o n E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) ( d ) ≤ C cov λ o n n X i =1 E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) ( e ) ≤ λ o E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) where (a) follows from H¨older’s inequality, (b) follows from the relation betweenindicator function and expectation, (c) follows from Markov’s inequality, (d)follows from the definition of λ o and (e) follows from (A.24). asai and Fujisawa/Weighted Huber regression Second, we have F = n X i =1 E (cid:20) u ′ i I (cid:18) | v ′ i | ≥ (cid:19)(cid:21) ( a ) ≤ n X i =1 q E [ u ′ i ] s E (cid:20) I (cid:18) | v ′ i | ≥ (cid:19)(cid:21) ( b ) ≤ n X i =1 q E [ u ′ i ] s P (cid:20) | v ′ i | ≥ (cid:21) ( c ) ≤ n X i =1 s λ o √ n λ o n q E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3)p E [ | ξ i | ] ( d ) ≤ n X i =1 s λ o √ n σ λ o n m E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) ( e ) ≤ n X i =1 λ o n E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) = 116 λ o E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) where (a) follows from H¨older’s inequality, (b) follows from relation betweenindicator function and expectation, (c) follows from Markov’s inequality, (d)follows from (A.21) and E [ | ξ i | ] ≤ σ and (e) follows from the definition of λ o .Consequently, we have E [ f ( β )] = D − E − F ≥ nλ o n X i =1 E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) (A.26)and from (A.17), we have A ≥ nλ o n X i =1 E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) − λ o √ n k µ − µ ˆ w k λ o k δ β η k − ∆ . (A.27)Next we evaluate the stochastic term ∆ defined in (A.20). From (A.19) andTheorem 3 of Massart (2000), with probability at least 1 − e x , we have∆ ≤ E [∆] + σ f √ x + 18 . x ≤ E [∆] + σ f √ x + 5 x (A.28)where σ f = sup β ∈B P ni =1 E (cid:2) ( f i ( β ) − E [ f i ( β )]) (cid:3) . About σ f , we have E (cid:2) ( f i ( β ) − E [ f i ( β )]) (cid:3) ≤ E (cid:2) f i ( β ) (cid:3) . From (A.22), we have E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) ≤ E (cid:2) k x i − µ k k δ β η k (cid:3) ≤ E (cid:2) k x i − µ k (cid:3) r asai and Fujisawa/Weighted Huber regression and from (A.19) and E (cid:2) f i ( β ) (cid:3) ≤ E " (( x i − µ ) ⊤ δ β η ) λ o n ≤ E (cid:2) k x i − µ k (cid:3) λ o n r and then σ f ≤ p E [ k x i − µ k ]2 λ o r . Combining this and (A.28), we have∆ ≤ E [∆] + p E [ k x i − µ k ]2 λ o r √ x + 5 x. (A.29)From Symmetrization inequality (Lemma 11.4 of Boucheron, Lugosi and Mas-sart (2013)), we have E [∆] ≤ E (cid:2) sup β ∈B | G β | (cid:3) where G L := n X i =1 α i ϕ ( u ′ i ) ψ ( v ′ i ) , and { α i } is a sequence of Rademacher random variables (i.e., P ( α i = 1) = P ( α i = −
1) = 1 / { x i , ξ i } ni =1 . We denote E ∗ as aconditional expectation of { α i } ni =1 given { x i , ξ i } ni =1 . From contraction principal(Theorem 6.7.1 of Vershynin (2018)), we have E ∗ " sup β ∈B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 α i ϕ ( u ′ i ) ψ ( v ′ i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max ≤ i ≤ n { ψ ( v ′ i ) } E ∗ " sup β ∈B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 α i ϕ ( u ′ i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = E ∗ " sup β ∈B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 α i ϕ ( u ′ i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) and from the basic property of the expectation, we have E " sup β ∈B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 α i ϕ ( u ′ i ) ψ ( v ′ i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E " sup β ∈B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 α i ϕ ( u ′ i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Since ϕ is 1-Lipschitz and ϕ (0) = 0, from contraction principal (Theorem 11.6in Boucheron, Lugosi and Massart (2013)), we have E " sup β ∈B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 α i ϕ ( u ′ i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E " sup β ∈B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 α i u ′ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ n λ o E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 α i ( x i − µ ) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k δ β η k = √ n λ o E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 α i ( x i − µ ) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) r . Consequently, we have2 E [∆] ≤ √ nλ o E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 α i ( x i − µ ) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) r . (A.30)Combining (A.30) with (A.29) and (A.27), with probability at least 1 − e − x , wehave n X i =1 λ o − h ξ i + ˆ w i n ( x i − µ ) ⊤ δ β η λ o √ n ! − h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i √ n ( x i − µ ) ⊤ δ β η λ o ≥ a E (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) − a k δ β η k − a + B + C asai and Fujisawa/Weighted Huber regression where a = 18 , a = λ o √ n E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 α i ( x i − µ ) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + λ o q E [ k ( x i − µ ) k ] √ x + λ o √ n k µ − µ ˆ w k ,a = λ o xB = X i ∈ I G ∩ I ˆ w< + X i ∈ I o − h ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o C = X i ∈ I G ∩ I ˆ w< + X i ∈ I o − h ξ i λ o √ n − ( x i − µ ˆ w ) ⊤ δ β η λ o √ n ! + h (cid:18) ξ i λ o √ n (cid:19)! ( x i − µ ˆ w ) ⊤ δ β η λ o √ n . Let δ = e − x and the proof is complete. Appendix B: Stochastic argument in case of heavy-tailed design
In this section, we state our main theorem in stochastic form assuming ran-domness in { x i } ni =1 and { ξ i } ni =1 . In the following subsection, we confirm theconditions in Theorem 3.1 are satisfied under the conditions in Theorem B.1. Theorem B.1.
Consider the optimization problem (2.2) . Suppose that { x i } ni =1 is a sequence with i.i.d. random vectors drawn from a distribution whose un-known covariance is Σ (cid:22) σ c I with unknown mean µ and { ξ i } ni =1 is a sequencewith i.i.d. random variables drawn from a distribution whose absolute moment isbounded by σ . Suppose that n is sufficiently large so n = O ( d log d ) and supposethat ε is sufficiently small. Let λ o √ n = c max with c max = max (cid:16) m σ , p C cov m , C CDG (cid:17) and m is some positive constant such that E [( x i − µ ) ⊤ v ) ] ≤ m (cid:0) E [(( x i − v ) ⊤ v ) ] (cid:1) (B.1) and C cov is some positive constant such that E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C cov . Let ˆ β is a optimal solution of (2.2) . Then, with probability at least (1 − δ ) ,we have k ˆ β − β ∗ k = O (cid:0) √ ε + r d (cid:1) . In the remaining part of Section B.1, we assume that { x i } ni =1 is a sequencewith i.i.d. random vectors drawn from a distribution whose unknown covarianceis Σ (cid:22) σ c I with unknown mean µ and { ξ i } ni =1 is a sequence with i.i.d. randomvariables drawn from a distribution whose absolute moment is bounded by σ . asai and Fujisawa/Weighted Huber regression B.1. ROBUST-WEIGHT for bounded covariance distributions
For heavy tailed design, we use Algorithm 2 of Cheng, Diakonikolas and Ge(2019) as ROBUST-WEIGHT. As in Algorithm 1 of Cheng, Diakonikolas andGe (2019) , The weights { ˆ w i } ni =1 can be computed by Algorithm 2 of Cheng,Diakonikolas and Ge (2019) from { X i } ni =1 and ε with polynomial time compu-tational complexity. Algorithm 2 of Cheng, Diakonikolas and Ge (2019) showedthat µ ˆ w − µ is close to 0 in the l norm with high probability.Algorithm 2 of Cheng, Diakonikolas and Ge (2019) use the same formulationof the primal-dual SDP as Algorithm 1 of Cheng, Diakonikolas and Ge (2019)and Algorithm 2 of Cheng, Diakonikolas and Ge (2019) estimates µ by thefollowing algorithm: Algorithm 3
Robust Mean Estimation for Bounded Covariance Distributions
Require: { X i } ni =1 ∈ R d with 0 < ε < / Ensure: ˆ w ∈ R n such that, with probability at least (1 − δ ) , k µ ˆ w − µ k ≤ C CDG ( √ ε + r d ),where C CDG is some positive constant depending on δ when Lemma B.1 holds.Let ν ∈ R d be the coordinate-wise median of { X i } ni =1 For i = 1 to O (log d )Use Proposition 5.5 of Cheng, Diakonikolas and Ge (2019) to compute either (i) A good solution w ∈ R n for the primal SDP (4.2) with parameters ν and 2 ε ; or (ii) A good solution M ∈ R d × d for the dual SDP (4.3) with parameters ν and ε if the objective value of w in SDP (4.2) is at most C CDG (Lemma B.1) return the weighted vector ˆ w (Lemma 5.3 of Cheng, Diakonikolas and Ge (2019)) else Move ν closer to µ using the top eigenvector of M (Lemma 5.4 of Cheng, Diakonikolasand Ge (2019)). First, we state some properties of the bounded covariance random vector.
Lemma B.1 (Adopted from the proof of Lemma A.18 of Diakonikolas et al.(2017) and the proof of Lemma 4.3 Diakonikolas et al. (2018)) . For any w ∈ δ n,ε ,we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ I G w i ( x i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C ht ( √ ε + r d ) , (B.2) λ max X i ∈ I G w i ( x i − µ )( x i − µ ) ⊤ ! ≤ C ht , (B.3) where C ht is some positive constant, with probability at least − δ . We note that Lemma A.18 in Diakonikolas et al. (2017) is valid even if ǫ ofDiakonikolas et al. (2017) is replaced by δ . Because combining Lemma A.18 inDiakonikolas et al. (2017) and the proof of Lemma A.2, we have the proof ofLemma B.1 and we omit the detailed proof of Lemma B.1. asai and Fujisawa/Weighted Huber regression Proposition B.1 (Adopted from Lemma 4.1 of Minsker and Wei (2020)) . As-sume that { x i } ni =1 is a sequence with i.i.d. random matrices drawn from a dis-tribution with mean µ and whose covariance Σ (cid:22) σ c I and { x i } ni =1 satisfy E [(( x i − µ ) ⊤ v ) ] ≤ m (cid:0) E [(( x i − µ ) ⊤ v ) ] (cid:1) , (B.4) for any v ∈ R d . Assume d/n ≤ . Then, we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ c htcov , where c htcov = m σ c .Proof. From Lemma 4.1 of Minsker and Wei (2020), we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 ( x i − µ )( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m tr(Σ) k Σ k op n ( a ) ≤ m σ c dn ( b ) ≤ c htcov . Proposition B.2.
We have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C htδ r dn , where C htδ = 2 /δ , with probability at least − δ .Proof. From ˆ w i h (cid:16) ξ i λ o √ n (cid:17) ≤ n × ≤ n , we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ d X j =1 n ≤ dn . From Markov’s inequality, we have P " n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ δ r dn ≥ − δ. Proposition B.3.
Let { α i } ni =1 be a series of Rademacher random veriables.Then, we have n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) α i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C htδ r dn . asai and Fujisawa/Weighted Huber regression Proof.
From α i = 1, we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) α i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ d X j =1 n ≤ nd. From Markov’s inequality, we have P " n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ( x i − µ ) α i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ δ r dn ≥ − δ. Proposition B.4 (Adopted from (5.31) of Vershynin (2010)) . Suppose that n is sufficiently large so that n = O ( d log d ) . For any vector v ∈ S d − , we have n E " n X i =1 (( x i − µ ) ⊤ v ) ≥ . We omit the proof of Proposition B.4 because the proof is almost the sameof the proof of Proposition 4.4.
Proposition B.5 ((5.31) of Vershynin (2010)) . Suppose that n is sufficientlylarge so that n = O ( d log d ) . For any vector v ∈ S d − , we have vuut n X i =1 (( x i − µ ) ⊤ v ) ≤ √ n + C HT δ p d log d, where C HT δ is a constant depending on δ , with probability at least − δ . Next, we state some properties of Algorithm 3.We note that the proof of Lemmas 5.2, 5.3, 5.4, Proposition 5.5 and Theorem1.3 of Cheng, Diakonikolas and Ge (2019) is valid even if we replace the orderof δ in Section 5 of Cheng, Diakonikolas and Ge (2019) as δ = √ ε + r ′ d andconsequently, we have Propositions B.6, B.7 and B.8. Proposition B.6.
Suppose that < ε < / , (B.2) and (B.3) hold. We have k µ ˆ w − µ k ≤ C CDG √ ε. (B.5) when Algorithm 3 succeeds . Proposition B.7.
Suppose that < ε < / , (B.2) and (B.3) hold. We have λ max n X i =1 ˆ w i ( X i − ν )( X i − ν ) ⊤ ! ≤ C CDG for some vector in (4.2) ν when Algorithm 3 succeeds. asai and Fujisawa/Weighted Huber regression About ν in Proposition B.7, we have the following Lemma Proposition B.8.
Suppose that < ε < / , (B.2) and (B.3) hold. We have k ν − µ k ≤ C CDG for ν in Proposition B.7 when Algorithm 3 succeeds . B.2. Confirmation of the conditions
In this section, we confirm (3.1) - (3.4) under the condition used in Theorem B.1.We note that when (B.2) and (B.3) hold, Algorithm 3 succeeds (Algorithm 3computes { ˆ w i } ni =1 such that (B.5) is satisfied) with probability at least (1 − δ ) .In light of this fact, Lemmas B.1 and 4.2, Propositions B.1 - B.8 and A.5 holdwith probability at least (1 − δ ) under the conditions used in Theorem B.1.We also note that from Proposition B.6 and assumption assumed in TheoremB.1, we have r o , r d ≤
1, and we can set C µ = 2 C CDG , where C µ is defined inProposition A.5. From the definition of λ o , conditions assumed in Theorem B.1and Proposition B.1, we see that Proposition A.5 holds.In Section B.2, to omit to state “with probability at least”, we assume thatinequalities in Lemmas B.1 and 4.2, Propositions B.1 - B.8 and A.5 hold andAlgorithm 3 succeeds. Proposition B.9 (Confirmation of (3.3)) . We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C htcond ( √ ε + r d ) k δ β η k , where C htcond = (cid:18)q C CDG + C ht + C CDG + 2 C ht C CDG + 2 √ C CDG (cid:19) . Proof.
From X i − µ ˆ w = ( X i − ν ) + ( ν − µ ) + ( µ − µ ˆ w ), we have X i ∈ I o ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η = X i ∈ I o ˆ w i h ( R i ( β η ))( X i − ν ) ⊤ δ β η + X i ∈ I o ˆ w i h ( R i ( β η ))( ν − µ ) ⊤ δ β η + X i ∈ I o ˆ w i h ( R i ( β η ))( µ − µ ˆ w ) ⊤ δ β η . (B.6)Let M i = ( X i − ν )( X i − ν ) ⊤ . For the first term of the R.H.S of (B.6), we asai and Fujisawa/Weighted Huber regression have (X i ∈ I o ˆ w i h ( R i ( β η ))( X i − ν ) ⊤ δ β η ) a ) ≤ X i ∈ I o ˆ w i X i ∈ I o ˆ w i | ( X i − ν ) ⊤ δ β η | b ) ≤ ε X i ∈ I o ˆ w i | ( X i − ν ) ⊤ δ β η | = 2 εδ ⊤ β η X i ∈ I o ˆ w i M i δ β η = 2 ε δ ⊤ β η n X i =1 ˆ w i M i δ β η − δ ⊤ β η X i ∈ I G ˆ w i M i δ β η ! = 2 ελ max ( M ) k δ β η k − εδ ⊤ β η X i ∈ I G ˆ w i M i δ β η ! , (B.7)where (a) follows from ˆ w i = √ ˆ w i √ ˆ w i and 0 ≤ h ( R i ( β η )) ≤
1, (b) follows from0 ≤ ˆ w i ≤ − ε ) n ≤ n and P i ∈ I o ˆ w i ≤ ε . For the last term of R.H.S of (B.7),we have δ ⊤ β η X i ∈ I G ˆ w i M i δ β η = δ ⊤ β η X i ∈ I G ˆ w i ( x i − ν )( x i − ν ) ⊤ δ β η = δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ − ν + µ )( x i − µ − ν + µ ) ⊤ δ β η = δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( x i − µ ) ⊤ δ β η + δ ⊤ β η X i ∈ I G ˆ w i ( µ − ν )( µ − ν ) ⊤ δ β η + 2 δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η ( a ) ≥ k δ β η k + δ ⊤ β η X i ∈ I G ˆ w i ( µ − ν )( µ − ν ) ⊤ δ β η + 2 δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η , (B.8)where (a) follows from (B.3). From (B.7) and (B.8), we have1 ε (X i ∈ I o ˆ w i h ( R i ( δ β η ))( X i − ν ) ⊤ δ β ) ≤ λ max ( M ) k δ β η k + C ht k δ β η k − δ ⊤ β η X i ∈ I G ˆ w i ( µ − ν )( µ − ν ) ⊤ δ β η − δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η ≤ λ max ( M ) k δ β η k + C ht k δ β η k + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ ⊤ β η X i ∈ I G ˆ w i ( µ − ν )( µ − ν ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ (cid:0) C CDG + C ht + C CDG + 2 C ht C CDG (cid:1) k δ β η k , asai and Fujisawa/Weighted Huber regression where (a) follows from Propositions 4.7 and B.8 and Lemma B.2.For the second term of the R.H.S of (B.6), we have (X i ∈ I o ˆ w i h ( R i ( β η ))( ν − µ ) ⊤ δ β η ) ≤ X i ∈ I o ˆ w i X i ∈ I o ˆ w i | ( ν − µ ) ⊤ δ β η | a ) ≤ ε X i ∈ I o ˆ w i | ( ν − µ ) ⊤ δ β η | = 2 δ ⊤ β η ε X i ∈ I o ˆ w i ( ν − µ )( ν − µ ) ⊤ δ β η ( b ) ≤ C CDG ε k δ β η k , where (a) follows from 0 ≤ ˆ w i ≤ − ε ) n ≤ n and P i ∈ I o ˆ w i ≤ ε and (b) followsfrom Proposition B.8 and P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1. For the last term of theR.H.S of (B.6), we have (X i ∈ I o ˆ w i h ( R i ( β η ))( µ − µ ˆ w ) ⊤ δ β η ) ≤ X i ∈ I o ˆ w i X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | a ) ≤ ε X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | = 2 δ ⊤ β η ε X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ( b ) ≤ C CDG ε √ ε k δ β η k ≤ C CDG ε k δ β η k , where (a) follows from 0 ≤ ˆ w i ≤ − ε ) n ≤ n and P i ∈ I o ˆ w i ≤ ε and (b) followsfrom Proposition B.6 and P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1.Consequently, we have X i ∈ I o ˆ w i h ( R i ( β η ))( X i − µ ˆ w ) ⊤ δ β η ≤ (cid:18)q C CDG + C ht + C CDG + 2 C ht C CDG + 2 √ C CDG (cid:19) √ ε k δ β η k ≤ (cid:18)q C CDG + C ht + C CDG + 2 C ht C CDG + 2 √ C CDG (cid:19) ( √ ε + r d ) k δ β η k . Lemma B.2 are used in the proof of Proposition B.9.
Lemma B.2.
We have δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η ≤ C ht C CDG k δ β η k . asai and Fujisawa/Weighted Huber regression Proof. δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ )( µ − ν ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ ⊤ β η X i ∈ I G ˆ w i ( x i − µ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12) ( µ − ν ) ⊤ δ β η (cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ I G ˆ w i ( x i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k µ − ν k k δ β k a ) ≤ C ht ( √ ε + r d ) C CDG k δ β η k b ) ≤ C ht C CDG k δ β η k , where (a) follows from (B.2) and (B.8) and (b) follows from √ ε ≤ r d ≤ Proposition B.10 (Confirmation of (3.1)) . We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C ht δ √ ε k δ β η k , where C ht δ is some positive constant depending on δ .Proof. λ o √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ λ o √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + λ o √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( b ) ≤ λ o √ n (cid:0) C htδ r d k δ β η k + 2 C CDG ( √ ε + r d ) k δ β η k (cid:1) ≤ λ o √ n (cid:16) C htδ + √ C CDG (cid:17) ( √ ε + r d | δ β η k = λ o √ nC ht δ ( √ ε + r d ) k δ β η k , where (a) follows from triangular inequality and (b) follows from Lemmas B.3and B.4. Lemma B.3.
We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C htδ r d k δ β η k . Proof.
We have n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ δ β η ( a ) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( x i − µ ) ⊤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k δ β η k b ) ≤ C htδ r dn k δ β η k ≤ C htδ r d k δ β η k asai and Fujisawa/Weighted Huber regression where (a) follows from H¨older’s inequality and (b) follows from PropositionB.2. Lemma B.4.
We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ ˆ w − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG ( √ ε + r d ) k δ β η k . Proof.
We have ( n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η ) ≤ n X i =1 ˆ w i n X i =1 ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | a ) ≤ n X i =1 ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | = 2 δ ⊤ β η n X i =1 ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ( b ) ≤ C CDG ( ε + r d ) k δ β η k , where (a) follows from 0 ≤ ˆ w i ≤ − ε ) n ≤ n and P ni =1 ˆ w i ≤ P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1. From triangular inequality,we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG ( √ ε + r d ) k δ β η k . Proposition B.11 (Confirmation of (3.2)) . We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ˆ w ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C ht δ ( √ ε + r d ) k δ β η k , where C ht δ is some positive constant depending on δ .Proof. We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ˆ w ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( µ − µ ˆ w ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( b ) ≤ C HT δ )( √ ε + r d ) k δ β η k + √ C CDG ( √ ε + r d ) k δ β η k = (2(1 + C HT δ ) + √ C CDG )( √ ε + r d ) k δ β η k = C ht δ ( √ ε + r d ) k δ β η k , asai and Fujisawa/Weighted Huber regression where (a) follows from triangular inequality and (b) follows from Lemmas B.5and B.6. Lemma B.5.
We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C HT δ )( √ ε + r d ) k δ β η k . Proof.
We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( x i − µ ) ⊤ δ β η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ sX i ∈ I o ˆ w i h ( r i ( β η )) (cid:16) √ n + C HT δ p d log d (cid:17) k δ β η k b ) ≤ √ on (cid:16) √ n + C HT δ p d log d (cid:17) k δ β η k c ) ≤ (cid:0) √ ε + C HT δ r d (cid:1) k δ β η k ≤ C HT δ )( √ ε + r d ) k δ β η k , where (a) follows from H¨older’s inequality and Proposition B.5, (b) follow fromˆ w i ≤ n (1 − ε ) ≤ n and − ≤ h ( r i ( β η )) ≤ on ≤ Lemma B.6.
We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ I o ˆ w i h ( r i ( β η ))( µ ˆ w − µ ) ⊤ δ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C CDG ( √ ε + r d ) k δ β η k . Proof.
We have (X i ∈ I o ˆ w i h (cid:18) ξ i λ o √ n (cid:19) ( µ − µ ˆ w ) ⊤ δ β η ) ≤ X i ∈ I o ˆ w i X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | a ) ≤ ε X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | = 2 εδ ⊤ β η X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ≤ δ ⊤ β η X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ( b ) ≤ C CDG ( √ ε + r d ) k δ β η k , where (a) follows from 0 ≤ ˆ w i ≤ − ε ) n ≤ n and P i ∈ I o ˆ w i ≤ ε and (b) followsfrom Proposition B.6 and P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1. asai and Fujisawa/Weighted Huber regression Proposition B.12 (Confirmation of (3.4)) . We have λ o n X i =1 (cid:18) − h (cid:18) ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ≥ k δ β η k − λ o √ nC htcond √ ε k δ β η k − /δ ) λ o , where C htcond = C htδ + p /δ ) + √ C CDG + 2 √ C HT δ ) .Proof. From Proposition A.5 λ o n X i =1 (cid:18) − h (cid:18) ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o (B.9) ≥ a n E n X i =1 (cid:2) (( x i − µ ) ⊤ δ β η ) (cid:3) − a k δ β η k − a + B + C, (B.10)where a = 18 ,a = λ o √ n E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 α i ( x i − µ ) n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + λ o q E [ k ( x i − µ ) k ] r δ + λ o √ n k µ − µ ˆ w k ,a = λ o δB = λ o √ n X i ∈ I G ∩ I ˆ w< + X i ∈ I o (cid:18) − h (cid:18) ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β λ o (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i ( x i − µ ˆ w ) ⊤ δ β η C = λ o √ n X i ∈ I G ∩ I ˆ w< + X i ∈ I o (cid:18) − h (cid:18) ξ i λ o √ n − ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ( x i − µ ˆ w ) ⊤ δ β η . From Proposition B.4, we have1 n E " n X i =1 (( x i − µ ) ⊤ δ β η ) ≥ k δ β η k . From Proposition B.3 and (B.6), we have a ≤ λ o √ n C htδ r dn + r δ r dn + C CDG √ ε ! ≤ λ o √ n ( C htδ + p /δ )) r dn + C CDG √ ε ! ≤ λ o √ n (cid:18) ( C htδ + p /δ )) r d + C CDG √ ε (cid:19) ≤ λ o √ n (cid:18) C htδ + p /δ ) + C CDG (cid:19) ( √ ε + r d ) asai and Fujisawa/Weighted Huber regression Combining the above inequalities and Lemmas B.7 and B.8, we have λ o √ n n X i =1 (cid:18) − h (cid:18) ξ i − ˆ w i n ( x i − µ ˆ w ) ⊤ δ β λ o √ n (cid:19) + h (cid:18) ξ i λ o √ n (cid:19)(cid:19) ˆ w i ( x i − µ ˆ w ) ⊤ δ β η ≥ k δ β η k − λ o √ n (cid:18) C htδ + p /δ ) + C CDG (cid:19) ( √ ε + r d ) k δ β η k −
20 log(1 /δ ) λ o − λ o √ n (cid:16) √ C HT δ ) + √ C CDG (cid:17) ( √ ε + r d ) k δ β η k ≥ k δ β η k − λ o √ n C htδ + p /δ ) + 1 + 32 √ C CDG + 2 √ C HT δ ) ! ( √ ε + r d ) k δ β η k −
20 log(1 /δ ) λ o . The following two lemmas are used in the proof of Proposition B.12.
Lemma B.7.
We have X i ∈ I G ∩ I ˆ w< + X i ∈ I o − h ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i ( x i − µ ˆ w ) ⊤ δ β η ≤ (cid:16) √ C HT δ ) + √ C CDG (cid:17) ( √ ε + r d ) k δ β η k . Proof.
Let h i = − h ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ! + h (cid:18) ξ i λ o √ n (cid:19) . From 0 ≤ ˆ w i ≤ n , − ≤ h i ≤ X i ∈ I G ∩ I ˆ w< ( h i ˆ w i ) ≤ on , X i ∈ I o ( h i ˆ w i ) ≤ on and X i ∈ I G ∩ I ˆ w< + X i ∈ I o ( h i ˆ w i ) ≤ on . (B.11)In addition we note that X i ∈ I G ∩ I ˆ w< ˆ w i ≤ o (1 − ε ) n ≤ ε, X i ∈ I o ˆ w i ≤ o (1 − ε ) n ≤ ε and X i ∈ I G ∩ I ˆ w< + X i ∈ I o ˆ w i ≤ ε. (B.12) asai and Fujisawa/Weighted Huber regression From x i − µ ˆ w = x i − µ + µ − µ ˆ w , we have X i ∈ I G ∩ I ˆ w< + X i ∈ I o − h ξ i λ o √ n − ˆ w i √ n ( x i − µ ˆ w ) ⊤ δ β η λ o ! + h (cid:18) ξ i λ o √ n (cid:19)! ˆ w i ( x i − µ ˆ w ) ⊤ δ β η = X i ∈ I G ∩ I ˆ w< + X i ∈ I o h i ˆ w i ( x i − µ ) ⊤ δ β η + X i ∈ I G ∩ I ˆ w< + X i ∈ I o h i ˆ w i ( µ − µ ˆ w ) ⊤ δ β η (B.13)For the first term of the R.H.S of (B.13), we have X i ∈ I G ∩ I ˆ w< + X i ∈ I o h i ˆ w i ( x i − µ ) ⊤ δ β η ( a ) ≤ vuuut X i ∈ I G ∩ I ˆ w< + X i ∈ I o h i ˆ w i (cid:16) √ n + C HT δ p d log d (cid:17) k δ β η k b ) ≤ √ √ on (cid:16) √ n + C HT δ p d log d (cid:17) k δ β η k c ) ≤ √ (cid:0) √ ε + C HT δ r d (cid:1) k δ β η k ≤ √ C HT δ ) ( √ ε + r d ) k δ β η k where (a) follows from H¨older’s inequality and Proposition B.5, (b) follows from(B.11) and (c) follows from on ≤ X i ∈ I G ∩ I ˆ w< + X i ∈ I o h i ˆ w i ( µ − µ ˆ w ) ⊤ δ β ≤ X i ∈ I G ∩ I ˆ w< + X i ∈ I o ˆ w i X i ∈ I G ∩ I ˆ w< + X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | a ) ≤ ε X i ∈ I G ∩ I ˆ w< + X i ∈ I o ˆ w i | ( µ − µ ˆ w ) ⊤ δ β η | = 6 εδ ⊤ β η X i ∈ I G ∩ I ˆ w< + X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ≤ δ ⊤ β η X i ∈ I G ∩ I ˆ w< + X i ∈ I o ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ≤ δ ⊤ β η n X i =1 ˆ w i ( µ − µ ˆ w )( µ − µ ˆ w ) ⊤ δ β η ( b ) ≤ C CDG ( √ ε + r d ) k δ β η k , where (a) follows from (B.12) and (b) follows from Proposition B.6 and P i ∈ I o ˆ w i ≤ P ni =1 ˆ w i = 1. asai and Fujisawa/Weighted Huber regression Consequently, we have X G ∈ I G ∩ I ˆ w< + X i ∈ I o h i ˆ w i ( x i − µ ˆ w ) ⊤ δ β η ≤ √ C HT δ ) ( √ ε + r d ) k δ β η k + √ C CDG ( √ ε + r d ) k δ β η k = (cid:16) √ C HT δ ) + √ C CDG (cid:17) ( √ ε + r d ) k δ β η k . Lemma B.8.
We have λ o √ n X G ∈ I G ∩ I ˆ w< + X i ∈ I o − h ξ i λ o √ n − ( x i − µ ˆ w ) ⊤ δ β η λ o √ n ! + h (cid:18) ξ i λ o √ n (cid:19)! ( x i − µ ˆ w ) ⊤ δ β η ≤ (cid:16) √ C HT δ ) + √ C CDG (cid:17) ( √ ε + r d ) k δ β η k . Because the proof of Lemma B.8 is almost the same of the proof of LemmaB.7, we omit the proof of Lemma B.8.
B.2.1. Proof of Theorem B.1
From Proposition B.9, B.10, B.11 and B.12, conditions (3.1), (3.2), (3.3) and(3.4) are satisfied by c = λ o √ n (cid:0) C ht δ + C ht δ + C htcond (cid:1) ( √ ε + r d ) ,c = 116 ,c = λ o √ nC htcond ( √ ε + r d ) ,c = 5 λ o log(1 /δ ) = 5 c n with probability at least (1 − δ ) .We note that, from Proposition B.1 and the definition of λ o , λ o √ n = c max ,which is a constant with C cov = c htcov and C cond and C cond are also constant.Consequently, we have3 c + c + √ c c c ≤ λ o √ n (cid:18) C ht δ + C ht δ + C htcond + C htcond (cid:19) ( √ ε + r d ) + c max r
516 1 √ n ! ≤ λ o √ n (cid:18) C ht δ + C ht δ + C htcond + C htcond (cid:19) ( √ ε + r d ) + c max r r d ! ≤ λ o √ n (cid:18) C ht δ + C ht δ + C htcond + C htcond (cid:19) + c max r ! ( √ ε + r d ) ..