[PDF] Robust Mean Estimation in High Dimensions via Global Outlier Pursuit

Abstract

Full PDF

RRobust Mean Estimation in High Dimensions via Global OutlierPursuit

Aditya Deshmukh ∗ Coordinated Science LaboratoryUniversity of Illinois at Urbana-ChampaignUrbana, IL 61801 [email protected]

Jing Liu ∗ Coordinated Science LaboratoryUniversity of Illinois at Urbana-ChampaignUrbana, IL 61801 [email protected]

Venugopal V. VeeravalliCoordinated Science LaboratoryUniversity of Illinois at Urbana-ChampaignUrbana, IL 61801 [email protected]

February 18, 2021

Abstract

We study the robust mean estimation problem in high dimensions, where less than half of the datapointscan be arbitrarily corrupted. Motivated by compressive sensing, we formulate the robust mean estimationproblem as the minimization of the (cid:96) -‘norm’ of an outlier indicator vector , under a second moment constrainton the datapoints. We further relax the (cid:96) -‘norm’ to the (cid:96) p -norm ( < p ≤ ) in the objective and prove thatthe global minima for each of these objectives are order-optimal for the robust mean estimation problem.Then we propose a computationally tractable iterative (cid:96) p -minimization and hard thresholding algorithm thatoutputs an order-optimal robust estimate of the population mean. Both synthetic and real data experimentsdemonstrate that the proposed algorithm outperforms state-of-the-art robust mean estimation methods. Thesource code will be made available at GitHub. Robust mean estimation in high dimensions has received considerable interest recently, and has found applicationsin areas such as data analysis (e.g., spectral data in astronomy [1]), outlier detection [2, 3, 4] and distributedmachine learning [5, 6, 7]. Classical robust mean estimation methods such as coordinate-wise median andgeometric median have error bounds that scale with the dimension of the data [8], which results in poorperformance in the high dimensional regime. A notable exception is Tukey’s Median [9] that has an error boundthat is independent of the dimension, when the fraction of outliers is less than a threshold [10, 11]. However,the computational complexity of Tukey’s Median algorithm is exponential in the dimension.A number of recent papers have proposed polynomial-time algorithms that have dimension independenterror bounds under certain distributional assumptions (e.g., bounded covariance or concentration properties).For a recent comprehensive survey on robust mean estimation, we refer the interested readers to [12]. One ofthe ﬁrst such algorithms is Iterative Filtering [13, 14, 15], in which one ﬁnds the top eigenvector of the samplecovariance matrix and removes (or down-weights) the points with large projection scores on that eigenvector,and then repeat this procedure on the rest of points until the top eigenvalue is small. However, as discussedin [4], the drawback of this approach is that it only looks at one direction/eigenvector at a time, and the outliersmay not exhibit unusual bias in only one direction or lie in a single cluster. Figure 1 illustrates an example forwhich Iterative Filtering might have poor empirical performance. In this ﬁgure, the inlier datapoints in blueare randomly generated from the standard Gaussian distribution in (high) dimension d , and therefore their (cid:96) -distances to the origin are roughly √ d (see, e.g., Theorem 3.1 of [16]). There are two clusters of outliers in ∗ Equal contribution a r X i v : . [ s t a t . A P ] F e b ed, and their (cid:96) -distances to the origin are also roughly √ d . If there is only one cluster of outliers, IterativeFiltering can eﬀectively identify them; however, in this example, this method may remove many inlier points andperform suboptimally.Figure 1: Illustration of two clusters of outliers (red points). The inlier points (blue) are drawn from standardGaussian distribution in high dimension d . Both the outliers and inliers are at roughly √ d distance from theorigin.There are interesting connections between existing methods for robust mean estimation and those used incompressive sensing. The Iterative Filtering algorithm has similarities to the greedy Matching Pursuit typecompressive sensing algorithm [17]. In the latter algorithm, one ﬁnds a single column of sensing matrix A thathas largest correlation with the measurements b , removes that column and its contribution from b , and repeatsthis procedure on the remaining columns of A . Dong et al. [4] proposed a new scoring criteria for ﬁnding outliers,in which one looks at multiple directions associated with large eigenvalues of the sample covariance matrix inevery iteration of the algorithm. Interestingly, this approach is conceptually similar to Iterative Thresholdingtechniques in compressive sensing (e.g., Iterative Hard Thresholding [18] or Hard Thresholding Pursuit [19]), inwhich one simultaneously ﬁnds multiple columns of matrix A that are more likely contribute to b . Although thistype of approach is also greedy, it is more accurate than the Matching Pursuit technique in practice.A common assumption in robust mean estimation problem is that the fraction of the corrupted datapoints issmall. In this paper, we explicitly use this information through the introduction of an outlier indicator vector whose (cid:96) -‘norm’ we minimize under a second moment constraint on the datapoints. This new formulation enablesus to leverage well-studied compressive sensing techniques to solve the robust mean estimation problem.We consider the setting in which the distribution of the datapoints before corruption has bounded covariance,as is commonly assumed in many recent works (e.g., [14, 4, 20, 21]). In particular, in [20], the authors proposeto minimize the spectral norm of the weighted sample covariance matrix and use the knowledge of the outlierfraction α to constrain the weights. Along this line, two very recent works [22, 23] show that any approximatestationary point of the objective in [20] gives a near-optimal solution. In contrast, our objective is designedto minimize the sparsity of an outlier indicator vector , and we show that any sparse enough solution is nearlyoptimal.There is another line of related work on mean estimation of heavy tailed distributions. See, e.g. therecent survey article [24] and the references therein. Also, the connection between robust mean estimation andheavy-tailed mean estimation is discussed in [25]. Contributions • At a fundamental level, a contribution of this paper is the formulation of the robust mean estimationproblem as minimizing the (cid:96) -‘norm’ of the proposed outlier indicator vector , under a second momentconstraint on the datapoints. In addition, order-optimal estimation error guarantees and near-optimalbreakdown point are shown for this objective.• Another contribution is in relaxing the (cid:96) objective to (cid:96) p (0 < p ≤ as in compressive sensing, and moreimportantly, establishing corresponding order-optimal estimation error guarantees.• Motivated by the proposed (cid:96) and (cid:96) p objectives and their theoretical justiﬁcations, we propose a compu-tationally tractable iterative (cid:96) p (0 < p ≤ minimization and hard thresholding algorithm, and establishthe order optimality of the algorithm. Empirical studies show that the proposed algorithms signiﬁcantlyoutperform state-of-the-art methods in robust mean estimation.2 Proposed optimization problem

We begin by deﬁning what we mean by a corrupted sample of datapoints.

Deﬁnition 1. ( α -corrupted sample [4]) Let P be a distribution on R d with unknown mean µ , and let ˜ y , ..., ˜ y n be independent and identically distributed (i.i.d.) drawn from P . These datapoints are then modiﬁed by anadversary who can inspect all the datapoints, remove αn of them, and replace them with arbitrary vectors in R d .We then obtain an α -corrupted sample, denoted as y , ..., y n . Throughout the rest of the paper, we adhere to the notation given above: we represent a datapoint beforecorruption as ˜ y i , and after corruption as y i .There are other types of contamination one can consider, for e.g., Huber’s (cid:15) -contamination model [26]. Thecontamination model described in Deﬁnition 1 is the strongest in the sense that the adversary is not oblivious tothe original datapoints, and can replace any subset of αn datapoints with any vectors in R d . We refer the readerto [12] for a more detailed discussion on contamination models.Our primary goal is to robustly estimate the average of the datapoints before corruption, given an α -corruptedsample. A powerful and useful key insight that was exploited in previous work on the problem is that if theoutliers in an α -corrupted sample (of large size) shift the average of datapoints before corruption by Ω( ξ ) in adirection ν , then the variance of the projected sample along ν is shifted by Ω( ξ /α ) . Thus, intuitively, it suﬃcesto ﬁnd a large subset of the α -corrupted sample, whose sample covariance matrix has bounded spectral norm. Inorder for such a subset to exist and for the mean of this large subset to be close to the true mean, we need someform of concentration of the datapoints (before corruption) around the mean of their distribution. A constrainedsecond moment condition is suﬃcient to guarantee this, and this assumption is also used in previous works. Inparticular, assuming that the spectral norm of the covariance matrix Σ is bounded, we show in the Appendix(see (24)) that given an α -corrupted sample of size n ≥ ed(cid:15)δ c (cid:48) log (cid:0) dδ (cid:1) , there exists an index subset I such that theevents E = {| I | ≥ (1 − (cid:15) ) n } , E = (cid:40) λ max (cid:32)(cid:88) i ∈ I ( ˜ y i − µ )( ˜ y i − µ ) (cid:62) (cid:33) ≤ c σ n (cid:41) , hold with probability at least − δ , i.e., P ( E ∩ E ) ≥ − δ [see also, [27, Proposition B.1], [20]]. Note that (cid:15) ∈ (0 , dictates the trade-oﬀ between the size of the index set I and the minimum number of samples required.The constant c is a tuning parameter, σ is a known upper bound on the spectral norm of the covariance matrixof the distribution of the datapoints before corruption and c (cid:48) = [ c min { c log c + 1 − c , } ] .Based on this motivation, we propose an (cid:96) -minimization problem to ﬁnd the largest subset, whose samplecovariance matrix exhibits bounded spectral norm. We ﬁrst introduce an outlier indicator vector h : for the i -thdatapoint, h i indicates that whether it is an outlier ( h i = 1 ) or not ( h i = 0 ). Given an α -corrupted sample ofsize n , we propose the following optimization problem: min h , x (cid:107) h (cid:107) s.t. h i ∈ { , } , ∀ i, (1) λ max (cid:32) n (cid:88) i =1 (1 − h i )( y i − x )( y i − x ) (cid:62) (cid:33) ≤ c σ n. We further relax the problem to the following: min h , x (cid:107) h (cid:107) s.t. ≤ h i ≤ , ∀ i, (2) λ max (cid:32) n (cid:88) i =1 (1 − h i )( y i − x )( y i − x ) (cid:62) (cid:33) ≤ c σ n. Note that any globally optimal solution of (1) is also globally optimal solution of (2). We show in Theorem 1,that any sparse enough feasible solution including the global optimum of (2) achieves order-optimality.However, the above (cid:96) objective is not computationally tractable. Motivated by compressive sensing, wefurther propose to relax the (cid:96) -‘norm’ to the (cid:96) p -norm ( < p ≤ ), which leads to the following optimizationproblem. 3 in h , x (cid:107) h (cid:107) p s.t. ≤ h i ≤ , ∀ i, (3) λ max (cid:32) n (cid:88) i =1 (1 − h i )( y i − x )( y i − x ) (cid:62) (cid:33) ≤ c σ n. We show in Theorem 2, that even in this case any ‘good’ feasible solution including the global optimum isorder-optimal.We now provide theoretical guarantees for the estimator which is given by the solution of the optimizationproblem (2). Assume that (cid:15) is ﬁxed, which controls the trade-oﬀ between sample size n and the size of the set I ,as discussed previously. We show that given an α -corrupted sample of size Ω (cid:16) d log d(cid:15) (cid:17) , with high probability, the (cid:96) -norm of the estimator’s error is bounded by O ( σ √ α + (cid:15) ) . We formalize this in the following theorem. It is wellknown that an information-theoretic lower bound on the (cid:96) -norm of any estimator’s error (cid:107) ˆ x − µ (cid:107) is Ω( σ √ α ) (for completeness, we provide a proof in Lemma 2 in the Appendix). Thus, our estimator is order-optimal interms of the error in estimating µ .. Theorem 1.

Let P be a distribution on R d with unknown mean µ and unknown covariance matrix Σ (cid:22) σ I .Let < (cid:15) < / , < δ < / and c > be ﬁxed. Let < α < − (cid:15) . Given an α -fraction corrupted set of n ≥ ed(cid:15)δ c (cid:48) log (cid:0) dδ (cid:1) datapoints from P , let H = (cid:40) ( h , x ) : (cid:107) h (cid:107) ≤ ( α + (cid:15) ) n ; x = (cid:80) { i : h i =0 } y i |{ i : h i = 0 }| (cid:41) , (4) where c (cid:48) = [ c min { c log c + 1 − c , } ] , α (cid:48) = α + (cid:15) .Then the following holds with probability at least − δ :1. Any feasible solution (ˆ h , ˆ x ) of (2) such that (ˆ h , ˆ x ) ∈ H satisﬁes (cid:107) ˆ x − µ (cid:107) ≤ c σ √ α (cid:48) − (cid:15) + 2 c σ √ (cid:15) − α (cid:48) + σ √ (cid:15)δ (cid:32) (cid:115) c (cid:48) e log( d/δ ) (cid:33) (5)

2. A global optimum of (2) lies in H . The proof of is deferred to the Appendix. A high-level sketch is as follows. We use the property of resilience [21]. Informally, a set S is said to be resilient if the average of any large subset of S is close to the average of S .We show that the set { y i : ˆ h i = 0 } , and the set of datapoints before corruption, that were within a distance of σ (cid:113) d(cid:15)δ from µ , are both resilient and large enough (of sizes at least (1 − α (cid:48) ) n and (1 − (cid:15) ) n respectively). Thereforethey must have a large intersection. Utilizing the resilience property, one can show that the distance between thesample averages of these two sets is O ( σ √ α (cid:48) ) . Note that the average of the ﬁrst set { y i : ˆ h i = 0 } is ˆ x , and usingconcentration bounds we can show that the distance between the average of the second set and µ is O ( σ √ (cid:15) ) .Thus we establish that (cid:107) ˆ x − µ (cid:107) is upper bounded as given in (5). Remark 1.

The breakdown point of the estimator in Theorem 1 is nearly the maximal possible / (as (cid:15) → and n → ∞ ), that is the estimator can tolerate any corruption level α < / , assuming the number of samples n satisﬁes the lower bound. Remark 2.

The proof of Theorem 1 shows that, as long as we ﬁnd a feasible solution ˆ h that is sparse enough,i.e., (cid:107) ˆ h (cid:107) ≤ ( α + (cid:15) ) n , the average of the estimated inliers (cid:80) { i :ˆ hi =0 } y i |{ i :ˆ h i =0 }| is close to the true mean. It is not necessaryto reach the global optimum of the objective (2) . We now provide a similar order-optimal error guarantee for the solution of the optimization problem in (3).4 heorem 2.

Let P be a distribution on R d with unknown mean µ and unknown covariance matrix Σ (cid:22) σ I .Let < p ≤ , < (cid:15) < / , < δ < / and c > be ﬁxed. Let < α < / − (cid:15) . Given an α -fraction corruptedset of n ≥ ed(cid:15)δ c (cid:48) log (cid:0) dδ (cid:1) datapoints from P , let H (cid:48) = (cid:26) ( h , x ) : (cid:107) h (cid:107) p ≤ ( α (cid:48) n ) /p ; x = (cid:80) ni =1 (1 − h i ) y i (cid:80) ni =1 (1 − h i ) (cid:27) , (6) where c (cid:48) = [ c min { c log c + 1 − c , } ] , α (cid:48) = α + (cid:15) .Then the following holds with probability at least − δ :1. Any feasible solution (ˆ h , ˆ x ) of (3) such that (ˆ h , ˆ x ) ∈ H (cid:48) satisﬁes (cid:107) ˆ x − µ (cid:107) ≤ (cid:34) c √ α (cid:48) − α (cid:48) + 2 c (cid:114) α (cid:48) − (cid:15) + √ (cid:15)δ (cid:32) (cid:115) c (cid:48) e log( d/δ ) (cid:33)(cid:35) − α (cid:48) − α (cid:48) σ (7)

2. A global optimum of (3) lies in H (cid:48) . The proof is deferred to the Appendix. The high-level idea is as follows: (cid:107) ˆ h (cid:107) p ≤ ( α (cid:48) n ) /p ensures that the totalweight w i = 1 − ˆ h i on the corrupted datapoints as well as datapoints that are far away from µ (whose distance to µ is more than σ (cid:113) d(cid:15)δ ) is suﬃciently small. We follow the approach in Lemma 5.2 of [20]. We establish bounds onthe absolute value of the sum of weighted inner product: w i (cid:68) y i − ˆ x , µ − ˆ x (cid:107) µ − ˆ x (cid:107) (cid:69) over the aforementioned datapoints.Speciﬁcally, using concentration bounds, we lower bound it by an aﬃne function of (cid:107) ˆ x − µ (cid:107) and upper boundit by a linear function of (cid:112) λ max ( (cid:80) ni =1 w i ( y i − ˆ x )( y i − ˆ x ) (cid:62) ) . This helps us upper bound (cid:107) ˆ x − µ (cid:107) as given in (7). Remark 3.

From Lemma 6 in the Appendix, we know that given any feasible solution of (3) with (cid:107) ˆ h (cid:107) p ≤ ( α (cid:48) n ) /p ,we have that  ˆ h , n (cid:80) i =1 (1 − ˆ h i ) y in (cid:80) i =1 (1 − ˆ h i )  is also a feasible solution, and therefore it lies in the set H (cid:48) deﬁned in (6) .Theorem 2 further shows that, this weighted average of the datapoints n (cid:80) i =1 (1 − ˆ h i ) y in (cid:80) i =1 (1 − ˆ h i ) is close to the true mean. Again,we note that it is not necessary to reach the global optimum of the objective (3) ; we only need to ﬁnd a feasiblesolution of (3) whose (cid:96) p -norm is small enough. (cid:96) p minimization and thresholding Motivated by the (cid:96) p objective and its theoretical guarantee, we propose an iterative (cid:96) p minimization algorithm.The algorithm alternates between updating the outlier indicator vector h via minimizing its (cid:96) p -norm and updatingthe estimated mean x , which is detailed in Algorithm 1.When updating the estimated mean x in Step 2 of Algorithm 1, we add an option to threshold the h i by τ , so one can use the weighted average of the estimated ‘reliable’ datapoints (i.e., those for which h i ≈ ) toestimate x . This is motivated by the analysis of the original (cid:96) objective in Theorem 1, where the average of theestimated ‘reliable’ datapoints (cid:80) { i :ˆ hi =0 } y i |{ i :ˆ h i =0 }| is close to the true mean as long as the outlier indicator vector ˆ h issparse enough. With this intuitive updating rule in Step 2, Algorithm 1 has following order-optimal guarantee. Theorem 3.

Let P be a distribution on R d with unknown mean µ and unknown covariance matrix Σ (cid:22) σ I .Let < δ ≤ , < τ ≤ , < (cid:15) < f ( τ ) and c > be ﬁxed. Let ≤ α < f ( τ ) − (cid:15) . Given an α -fraction corruptedset of n ≥ max (cid:110) , ed(cid:15)δ c (cid:48) (cid:111) log (cid:0) dδ (cid:1) datapoints from P , with probability at least − δ , the iterates of Algorithm 1satisfy (cid:107) x ( t ) − µ (cid:107) ≤ σ (cid:18) c (0)2 γ t + β − γ (cid:19) (8)5 lgorithm 1 Robust Mean Estimation via (cid:96) p Minimization and Thresholding

Inputs:

1) An α -corrupted set of datapoints { y i } ni =1 ∈ R d generated by a distribution whose covariance matrix satisﬁes Σ (cid:22) σ I .2) Corruption level: α < / .3) Upper bound on spectral norm of Σ : σ

4) Threshold: < τ ≤ and (cid:15) > such that f ( τ ) > α (cid:48) = α + (cid:15) , where f ( τ ) is deﬁned in (10).5) Error probability tolerance level: < δ < .6) Set c > .7) Set < p ≤ in (cid:96) p . Initialize: x (0) as the coordinate-wise median of { y i } ni =1 .2) c (0)2 = 3 √ d .3) Iteration number t = 0 . while t < T = 1 + . d +log(3 σ )log( τ − α (cid:48) ) − log( τ ( α (cid:48) + √ α (cid:48) )) do Step 1: Given x ( t ) , update h : h ( t ) ∈ arg min h (cid:107) h (cid:107) p ,s.t. ≤ h i ≤ , ∀ i , λ max ( (cid:80) ni =1 (1 − h i )( y i − x ( t ) )( y i − x ( t ) ) (cid:62) ) ≤ ( c + ( c ( t )2 ) + 2 c ( t )2 c ) nσ ,where c is deﬁned in (14).Step 2: Given h ( t ) , update x : x ( t +1) = (cid:80) ni =1 (1 − h ( t ) i )1 { h ( t ) i ≤ τ } y i (cid:80) ni =1 (1 − h ( t ) i )1 { h ( t ) i ≤ τ } . c ( t +1)2 = γc ( t )2 + β , where γ and β are deﬁned in (12) and (13). t = t + 1 . end whileOutput: x ( T ) where c (0)2 is deﬁned in Algorithm 1 and α (cid:48) = α + (cid:15) (9) f ( τ ) = 2 g ( τ ) + 1 − (cid:112) g ( τ ) + 12 g ( τ ) (10) g ( τ ) = 1 + 1 τ (11) γ = τ ( α (cid:48) + √ α (cid:48) ) τ − α (cid:48) (12) β = τ c √ α (cid:48) τ − α (cid:48) + 2 c (cid:115) α (cid:48) τ (1 − (cid:15) ) + c (13) c = √ (cid:15)δ (cid:32) (cid:115) c (cid:48) e log( d/δ ) (cid:33) . (14) The output of Algorithm 1 at the end of T = 1 + . d +log(3 σ ) | log γ | = O (cid:16) log d | log α (cid:48) | (cid:17) iterations is order-optimal: (cid:107) x ( T ) − µ (cid:107) ≤ σ (cid:18) γ + β − γ (cid:19) = O ( σ √ α (cid:48) ) . (15)The proof is deferred to the Appendix, but we brieﬂy discuss the design of the algorithm and the high-levelapproach. We use induction to show that (cid:107) x ( t ) − µ (cid:107) ≤ c ( t )2 σ . We show in the Appendix that the coordinate-wisemedian satisﬁes (cid:107) x (0) − µ (cid:107) ≤ c (0)2 σ = 3 σ √ d with high probability. Firstly, observe that in Step 1 of Algorithm 1,the constraint on the spectral norm of the weighted ‘covariance’ matrix is ( c + c ( t )2 + 2 c ( t )2 c ) σ n instead of c σ n as in (3). This ensures that with high probability that the optimization problem in Step 1 has a feasible solution,and that the optimum solution satisﬁes (cid:107) h ( t ) (cid:107) p ≤ ( α (cid:48) n ) /p . Secondly, we exploit the boundedness of (cid:107) h ( t ) (cid:107) p x ( t +1) in Step 2moves closer to µ than x ( t ) . Speciﬁcally, we show that (cid:107) x ( t +1) − µ (cid:107) ≤ γ (cid:107) x ( t ) − µ (cid:107) + βσ ≤ ( γc ( t )2 + β ) σ = c ( t +1)2 σ .From the proof we can see that it is not necessary to reach the global optimum in Step 1, we only need to ﬁnd afeasible solution whose (cid:96) p -norm is small enough. Remark 4.

Observe that in Theorems 1, 2 and 3, (cid:15) controls the error tolerance level. Also, the lower bound onthe required number of datapoints is Ω (cid:16) d log d(cid:15) (cid:17) , which is independent of the corruption level α . Previous works(see, e.g., [13, 14, 20]) do not consider a tolerance level, and in these works the lower bound on the requirednumber of datapoints is inversely proportional to the corruption level α , which blows up as α → . Moreover, α is typically unknown in practice. Specifying (cid:15) to control the estimator’s error helps us remove the dependence ofthe number of datapoints required on the fraction of corruption α . Note that we can recover the order-optimalresults in the form as given in the previous works by setting (cid:15) = O ( α ) in Theorems 1, 2 and 3. Remark 5.

The results of Theorems 1, 2 and 3 can be easily extended to establish the estimators’ closeness tothe average of the datapoints before corruption, ˜ µ = n n (cid:80) i =1 ˜ y i , using the fact that ˜ µ is close to µ , which is shownin the Appendix (see (30) ). We obtain the following extension to the above theorems with the same probabilityguarantees: (cid:107) ˆ x − ˜ µ (cid:107) ≤ (cid:107) ˆ x − µ (cid:107) + σ (cid:115) c (cid:48) (cid:15)δe log( d/δ ) . (16) Moreover, it can be also shown that the estimators are close to the average of inliers, that are at most a distanceof σ (cid:113) d(cid:15)δ from µ . Remark 6.

The value of α can be replaced by an upper bound on α , if the true value of corruption level α isnot known. The initialization c (0)2 = 3 √ d can be replaced by a smaller value as long as it is possible to guarantee (cid:107) x (0) − µ (cid:107) ≤ c (0)2 σ with high probability. When we set p = 1 in the objective (cid:107) h (cid:107) p in Step 1 of Algorithm 1, the resulting problem is convex, and can bereformulated as the following packing SDP [28] with w i (cid:44) − h i , and e i being the i -th standard basis vector in R n . The details can be found in the Appendix. max w (cid:62) w s.t. w i ≥ , ∀ i (17) n (cid:88) i =1 w i (cid:20) e i e (cid:62) i ( y i − x )( y i − x ) (cid:62) (cid:21) (cid:22) (cid:20) I n × n cnσ I d × d (cid:21) When < p < , the equivalent objective function (cid:107) h (cid:107) pp = (cid:80) i h pi is concave, not convex. So it may bediﬃcult to ﬁnd its global minimum. Nevertheless, we can iteratively construct and minimize a tight upper boundon this objective function via iterative re-weighted (cid:96) [29, 30] or (cid:96) techniques [31] from compressive sensing. And it is well-known in compressive sensing that such iterative re-weighted approaches often performs betterthan (cid:96) [31, 29]. Theorem 3 guarantees that the total number of iterations of Algorithm 1 required to achieve optimality is upperbounded by O (log d ) . In each iteration, the computational complexity of Step 2 is O ( nd ) . In Step 1, if we use (cid:96) , we can solve the resulting Packing SDP (17) to precision − O ( (cid:15) ) in ˜ O ( nd/(cid:15) ) parallelizable work usingpositive SDP solvers [32] (the notation ˜ O ( m ) hides the poly-log factors: ˜ O ( m ) = O ( m polylog ( m )) ).If we use (cid:96) p with < p < in Step 1, we iteratively construct and minimize a tight upper bound on the (cid:96) p objective via iterative re-weighted (cid:96) [29, 30] or iterative re-weighted (cid:96) techniques [31] . Minimizing the We observe that iterative re-weighted (cid:96) achieves better empirical performance. We run fewer than 10 re-weighted iterations in our implementation. (cid:96) objective can be also solved very eﬃciently to precision − O ( (cid:15) ) by formulating it as aPacking SDP (see Appendix) with computational complexity of ˜ O ( nd/(cid:15) ) [32]. If we use iterative re-weighted (cid:96) ,minimizing the resulting weighted (cid:96) objective is a SDP constrained least squares problem, whose computationalcomplexity is in general polynomial in both d and n . We will explore more eﬃcient solutions for this objective infuture work. In this section, we present empirical results on the performance of the proposed methods and compare withthe following state-of-the-art high dimension robust mean estimation methods: Iterative Filtering [14], themethod proposed in [8] (denoted as LRV), the method in [20] (denoted as CDG), and Quantum Entropy Scoring(QUE) [4], which scores the outliers based on multiple directions. We ﬁx p = 0 . for the proposed (cid:96) p method. InAlgorithm 1, we set the threshold τ = 0 . , δ = 1 / , c = 1 . , (cid:15) = α/ , and we initialize c (0)2 as the (cid:96) errorof the Coordinate-wise Median relative to the true mean. We carefully tune the parameters in the comparedmethods. For evaluation, we deﬁne the recovery error as the (cid:96) distance of the estimated mean to the oraclesolution, i.e., the average of the datapoints before corruption. We use a similar experiment setting as in [4]. The dimension of the data is d , and the number of datapoints is n .There are two clusters of outliers, and their (cid:96) distances to the true mean x are similar to that of the inlier points.The inlier datapoints are randomly generated from the standard Gaussian distribution with zero mean. For theoutliers, half of them are set to be [ (cid:112) d/ , (cid:112) d/ , , ..., , and the other half are set as [ (cid:112) d/ , − (cid:112) d/ , , ..., ,so that their (cid:96) distances to the true mean [0 , ..., are all √ d , similar to that of the inlier points. We vary thetotal fraction α of the outliers and report the average recovery error of each method over 10 trials in Table 1with d = 100 , n = 1000 . The proposed (cid:96) and (cid:96) p methods show signiﬁcant improvements over the competingmethods, and the (cid:96) p method performs the best.Table 1: Recovery error of each method under diﬀerent fraction α of the outlier points ( d = 100 , n = 1000 ) α Iter Filter QUE LRV CDG (cid:96) (cid:96) p

10% 0.124 0.429 0.367 0.064

20% 0.131 0.492 0.659 0.084

We also tested the performance of each method for diﬀerent numbers of datapoints. The dimension of thedata is ﬁxed to be 100. The fraction of the corrupted points is ﬁxed to be 20%. We vary the number of datapointsfrom 100 to 1000, and report the average recovery error for each method over 100 trials in Table 2. We can seethat the performance of all methods get better when the number of datapoints is increased. Again, the proposedmethods consistently perform better than the other methods.Table 2: Recovery error of each method w.r.t. diﬀerent number of samples ( d = 100 , α = 0 . ) (cid:96) (cid:96) p

100 0.493 1.547 1.423 0.316

200 0.313 1.038 1.084 0.198

500 0.186 0.680 0.794 0.148

Here we use a dataset of real face images to test the eﬀectiveness of the robust mean estimation methods.The average face of particular regions or certain groups of people is useful for many social and psychologicalstudies [33]. Here we use 100 frontal human face images from the Brazilian face database as inliers. For theoutliers, we choose 15 face images of cats and dogs from the CIFAR10 [34] database. In order to be able to run https://fei.edu.br/ cet/facedatabase.html ×

15 pixels, so the dimension of each datapoint is 270.The oracle solution is the average of the 100 human faces. Table 3 reports the recovery error, which is the (cid:96) distance of the estimated mean to the oracle solution, for each method. The proposed methods achieve smallerrecovery error than the state-of-the-art methods. The sample inlier and outlier images as well as the estimatedmean for each method can be found in the Appendix.Table 3: Recovery error of the mean face by each methodSample average Iter Filter LRV CDG (cid:96) (cid:96) p

141 63 83 81

38 46

We formulated the robust mean estimation problems as the minimization of the (cid:96) -‘norm’ of the introduced outlierindicator vector, under a second moment constraint on the datapoints. We further relaxed the (cid:96) objective to (cid:96) p (0 < p ≤ and theoretically justiﬁed the new objective. Then we proposed a computationally tractable iterative (cid:96) p (0 < p ≤ minimization and hard thresholding algorithm, which signiﬁcantly outperforms state-of-the-artrobust mean estimation methods, and is order-optimal. In the empirical studies, we observed strong numericalevidence that (cid:96) p (0 < p ≤ leads to sparse solutions; theoretically justifying this phenomenon is also of interest. References [1] R. A. Maronna and R. H. Zamar, “Robust estimates of location and dispersion for high-dimensional datasets,”

Technometrics , vol. 44, no. 4, pp. 307–317, 2002.[2] P. J. Huber,

Robust statistics . Springer, 2011.[3] R. A. Maronna, R. D. Martin, V. J. Yohai, and M. Salibián-Barrera,

Robust statistics: theory and methods(with R) . Wiley, 2018.[4] Y. Dong, S. Hopkins, and J. Li, “Quantum entropy scoring for fast robust mean estimationand improved outlier detection,” in

Advances in Neural Information Processing Systems 32 .Curran Associates, Inc., 2019, pp. 6067–6077. [Online]. Available: http://papers.nips.cc/paper/8839-quantum-entropy-scoring-for-fast-robust-mean-estimation-and-improved-outlier-detection.pdf[5] Y. Chen, L. Su, and J. Xu, “Distributed statistical machine learning in adversarial settings: Byzantinegradient descent,” in

Proc. ACM Measurement and Analysis of Computing Systems , vol. 1, no. 2. ACMNew York, NY, USA, 2017, pp. 1–25.[6] D. Yin, Y. Chen, R. Kannan, and P. Bartlett, “Byzantine-robust distributed learning: Towards optimalstatistical rates,” in

International Conference on Machine Learning , 2018, pp. 5650–5659.[7] S. Bubeck, N. Cesa-Bianchi, and G. Lugosi, “Bandits with heavy tail,”

IEEE Transactions on InformationTheory , vol. 59, no. 11, pp. 7711–7717, 2013.[8] K. A. Lai, A. B. Rao, and S. Vempala, “Agnostic estimation of mean and covariance,” in , 2016, pp. 665–674.[9] J. W. Tukey, “Mathematics and the picturing of data,” in

Proceedings of the International Congress ofMathematicians, Vancouver, 1975 , vol. 2, 1975, pp. 523–531.[10] D. L. Donoho, M. Gasko et al. , “Breakdown properties of location estimates based on halfspace depth andprojected outlyingness,”

The Annals of Statistics , vol. 20, no. 4, pp. 1803–1827, 1992.[11] B. Zhu, J. Jiao, and J. Steinhardt, “When does the tukey median work?” arXiv preprint arXiv:2001.07805 ,2020.[12] I. Diakonikolas and D. M. Kane, “Recent advances in algorithmic high-dimensional robust statistics,” arXivpreprint arXiv:1911.05911 , 2019. 913] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart, “Robust estimators in highdimensions without the computational intractability,” in , 2016, pp. 655–664.[14] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart, “Being robust (in high dimensions)can be practical,” in

Proceedings of the 34th International Conference on Machine Learning-Volume 70 ,2017, pp. 999–1008.[15] J. Steinhardt, “Robust learning: Information theory and algorithms,” Ph.D. dissertation, Stanford University,2018.[16] S. Adams, “High-dimensional probability lecture notes,” 2020.[17] S. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,”

IEEE Trans. Signal Process. ,vol. 41, pp. 3397–3415, 1993.[18] T. Blumensath and M. E. Davies, “Iterative hard thresholding for compressed sensing,”

Applied andcomputational harmonic analysis , vol. 27, no. 3, pp. 265–274, 2009.[19] S. Foucart, “Hard thresholding pursuit: an algorithm for compressive sensing,”

SIAM Journal on NumericalAnalysis , vol. 49, no. 6, pp. 2543–2563, 2011.[20] Y. Cheng, I. Diakonikolas, and R. Ge, “High-dimensional robust mean estimation in nearly-linear time,” in

Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms , ser. SODA ’19. USA:Society for Industrial and Applied Mathematics, 2019, p. 2755–2771.[21] J. Steinhardt, M. Charikar, and G. Valiant, “Resilience: A criterion for learning in the presence of arbitraryoutliers,” arXiv preprint arXiv:1703.04940 , 2017.[22] Y. Cheng, I. Diakonikolas, R. Ge, and M. Soltanolkotabi, “High-dimensional robust mean estimation viagradient descent,” arXiv preprint arXiv:2005.01378 , 2020.[23] B. Zhu, J. Jiao, and J. Steinhardt, “Robust estimation via generalized quasi-gradients,” arXiv preprintarXiv:2005.14073 , 2020.[24] G. Lugosi and S. Mendelson, “Mean estimation and regression under heavy-tailed distributions–a survey,” arXiv preprint arXiv:1906.04280 , 2019.[25] S. B. Hopkins, J. Li, and F. Zhang, “Robust and heavy-tailed mean estimation made simple, via regretminimization,” in

Advances in Neural Information Processing Systems 33 , 2020.[26] P. J. Huber, “Robust estimation of a location parameter,”

Ann. Math. Statist. , vol. 35, no. 1, pp. 73–101, 031964. [Online]. Available: https://doi.org/10.1214/aoms/1177703732[27] M. Charikar, J. Steinhardt, and G. Valiant, “Learning from untrusted data,” in

Proceedings of the 49th AnnualACM SIGACT Symposium on Theory of Computing , ser. STOC 2017. New York, NY, USA: Associationfor Computing Machinery, 2017, p. 47–60. [Online]. Available: https://doi.org/10.1145/3055399.3055491[28] G. Iyengar, D. J. Phillips, and C. Stein, “Approximation algorithms for semideﬁnite packing problemswith applications to maxcut and graph coloring,” in

International Conference on Integer Programming andCombinatorial Optimization . Springer, 2005, pp. 152–166.[29] R. Chartrand and W. Yin, “Iteratively reweighted algorithms for compressive sensing,” in , 2008, pp. 3869–3872.[30] I. F. Gorodnitsky and B. D. Rao, “Sparse signal reconstruction from limited data using focuss: a re-weightedminimum norm algorithm,”

IEEE Trans. Signal Process. , vol. 45, no. 3, pp. 600–616, Mar. 1997.[31] E. J. Candes, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by reweighted l1 minimization,”

Journal ofFourier analysis and applications , vol. 14, no. 5-6, pp. 877–905, 2008.[32] Z. Allen-Zhu, Y. T. Lee, and L. Orecchia, “Using optimization to obtain a width-independent, parallel,simpler, and faster positive sdp solver,” in

Proceedings of the Annual ACM-SIAM Symposium on DiscreteAlgorithms , 2016. 1033] A. C. Little, B. C. Jones, and L. M. DeBruine, “Facial attractiveness: evolutionary based research,”

Philosophical Transactions of the Royal Society B: Biological Sciences , vol. 366, no. 1571, pp. 1638–1659,2011.[34] A. Krizhevsky et al. , “Learning multiple layers of features from tiny images,” 2009.11

Appendix

We introduce the following parameters that control the minimum number of datapoints needed, error andconﬁdence level. Let < (cid:15) < , δ > and c > be ﬁxed. Let c (cid:48) = [ c min { c log c + 1 − c , } ] . Let S = { ˜ y , . . . , ˜ y n } be a set of n ≥ ed(cid:15)δ c (cid:48) log (cid:0) dδ (cid:1) datapoints drawn from a distribution P with mean µ andcovariance matrix Σ (cid:22) σ I . We now deﬁne G as the set of datapoints which are less than σ (cid:113) d(cid:15)δ distance awayfrom µ : I = (cid:40) i : (cid:107) ˜ y i − µ (cid:107) ≤ σ (cid:114) d(cid:15)δ (cid:41) (18) G = { ˜ y i : i ∈ I } . (19)It follows from Lemma 4 that for the event E = {| I | ≥ n − (cid:15)n } , (20) Pr( E ) ≥ − δ. (21)Let E be the event: E = (cid:40) λ max (cid:32)(cid:88) i ∈ I ( ˜ y i − µ )( ˜ y i − µ ) (cid:62) (cid:33) ≤ c σ n (cid:41) . (22)It follows from Lemma 5 that Pr( E ) ≥ − δ. (23)Thus, we have that Pr( E ∩ E ) ≥ − δ. (24)For analysis purposes, we consider the far away uncorrupted datapoints S \ G as outliers also.Let { y , . . . , y n } be an α -corrupted set of S . Let h ∗ be such that h ∗ i = 1 for the outliers (both far awayuncorrupted datapoints and corrupted datapoints), and h ∗ i = 0 for the rest of uncorrupted datapoints, i.e., h ∗ i = (cid:40) , if y i (cid:54) = ˜ y i or ˜ y i ∈ S \ G , otherwise (25)Let the set of inliers be given by G ∗ : I ∗ = { i : h ∗ i = 0 } (26) G ∗ = { y i : i ∈ I ∗ } = { ˜ y i : i ∈ I ∗ } (27)Note that I ∗ ⊆ I and G ∗ ⊆ G . Since ( ˜ y i − µ )( ˜ y i − µ ) (cid:62) is PSD, we must have λ max (cid:32) n (cid:88) i =1 (1 − h ∗ i )( y i − µ )( y i − µ ) (cid:62) (cid:33) ≤ λ max (cid:32)(cid:88) i ∈ I ( ˜ y i − µ )( ˜ y i − µ ) (cid:62) (cid:33) . This implies that (cid:40) λ max (cid:32) n (cid:88) i =1 (1 − h ∗ i )( y i − µ )( y i − µ ) (cid:62) (cid:33) ≤ c σ n (cid:41) ⊇ E . (28)Then, we have: Pr (cid:40) λ max (cid:32) n (cid:88) i =1 (1 − h ∗ i )( y i − µ )( y i − µ ) (cid:62) (cid:33) ≤ c σ n (cid:41) ≥ Pr( E ) ≥ − δ. (29)Our intended solution is to have h i = 0 for the inlier points and h i = 1 for the outlier points.12able 4: Description of variablesVariable Description µ Mean (expected value) of population distribution ˜ µ Average of all datapoints before corruption ¯ x Average of vectors in G , the set of datapoints within σ (cid:113) d(cid:15)δ of µ ¯ x ∗ Average of vectors in G ∗ , the set of inliers within G We now introduce some more events (c.f. [14, Lemma A.18]): E = (cid:40)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ( ˜ y i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ σ (cid:115) c (cid:48) (cid:15)δe log( d/δ ) (cid:41) (30) E = (cid:40)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ( z i − E [ z ]) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ σ (cid:115) c (cid:48) (cid:15)δe log( d/δ ) (cid:41) , (31)where z i = ( ˜ y i − µ ) (cid:26) (cid:107) ˜ y i − µ (cid:107) > σ (cid:113) d(cid:15)δ (cid:27) . From Lemma 4, we get that if n ≥ ed(cid:15)δ c (cid:48) log (cid:0) dδ (cid:1) , Pr( E ) ≥ − δ, and Pr( E ) ≥ − δ. (32)Let E be the event given by E = E ∩ E ∩ E ∩ E . (33) Deﬁnition 2. (Resilience [21]) A set of points y , ..., y m lying in R d is ( σ , β )-resilient in (cid:96) -norm around apoint x if, for all its subsets T of size at least (1 − β ) m , we have (cid:13)(cid:13)(cid:13) | T | (cid:80) y i ∈ T y i − x (cid:13)(cid:13)(cid:13) ≤ σ . Lemma 1. ([21, Section 1.3]) For a set of datapoints S (cid:44) { y i } , let x = | S | (cid:80) y i ∈ S y i . If λ max (cid:16) | S | (cid:80) y i ∈ S ( y i − x )( y i − x ) (cid:62) (cid:17) ≤ σ , then the set S is (cid:0) σ √ β, β (cid:1) -resilient in (cid:96) -norm around x for all β < . . Lemma 2.

Let P d be the set of all probability distributions on R d such that the spectral norm of the covariancematrix is upper bounded by σ . Given an estimator ˆ x , there exists a distribution P ∈ P d with mean, say µ , anda corruption procedure such that given an α -corrupted sample with the datapoints (before corruption) generatedthrough P , (cid:107) ˆ x − µ (cid:107) = Ω( σ √ α ) (34) Proof.

Fix < α ≤ . Consider the following one-dimensional distributions. Let P be the distribution with itsentire mass on { } , and let P be the distribution with support { , σ/ √ α } and α probability mass on σ/ √ α .Note that the variances of both distributions are upper bounded by σ , and so P , P ∈ P . Let N be thedistribution with its entire mass on { σ/ √ α } . Clearly we have, (1 − α ) P + αN = P . (35)Suppose an adversary corrupts the data-points in the following way: if the data-points are generated through P ,then corrupt them using N , so that the resultant distribution is (1 − α ) P + αN , and if the data-points aregenerated through P , then leave them uncorrupted. Note that the distance between the means of P and P is σ √ α . It follows that the distance of any estimator from at least one of the means of P or P would be Ω( σ √ α ) .The case for d > easily follows. Lemma 3.

Let P be a distribution on R d with mean µ and covariance matrix Σ (cid:22) σ I . Let α ≤ / . Given an α -fraction corrupted set of n datapoints from P , the coordinate-wise median of the corrupted set, ˆ x , satisﬁeswith probability at least − d exp( − n/ that (cid:107) ˆ x − µ (cid:107) ≤ σ √ d. (36)13 roof. We ﬁrst show that with high probability the error in each dimension is bounded by σ . Fix a coordinate,and let ˜ y i , y i , µ and ˆ x be the component of ˜ y i , y i , µ and ˆ x respectively in that coordinate. By Markov’sinequality, we have P ( | ˜ y i − µ i | ≥ σ ) ≤ / . (37)Let b i = 1 {| ˜ y i − µ i | ≥ σ } . By Chernoﬀ’s inequality, we obtain P (cid:32) n (cid:88) i =1 b i ≥ n/ (cid:33) ≤ exp (cid:18) − (5 / n / (cid:19) ≤ exp( − n/ . (38)Thus with high probability more than three-fourth of the datapoints satisfy | y i − µ i | ≤ σ , which impliesthat even if α ≤ / fraction of datapoints are corrupted, we would have | ˆ x − µ | ≤ σ. (39)Applying union bound, we get that with probability at least − d exp( − n/ , the error in each dimension isbounded by σ and hence (cid:107) ˆ x − µ (cid:107) ≤ σ √ d holds. Lemma 4.

Let < (cid:15) ≤ , < δ ≤ , c (cid:48) > , and n ≥ ed(cid:15)δ c (cid:48) log (cid:0) dδ (cid:1) . Let E , E and E be the events as describedin (20) , (30) and (31) . Then, Pr( E ) ≥ − δ, Pr( E ) ≥ − δ, and Pr( E ) ≥ − δ, Proof.

By Markov’s inequality we have

Pr( | G c | > (cid:15)n ) ≤ E [ | G c | ] (cid:15)n (40) = E (cid:20) n (cid:80) i =1 (cid:26) (cid:107) ˜ y i − µ (cid:107) > σ (cid:113) d(cid:15)δ (cid:27)(cid:21) (cid:15)n (41) = Pr (cid:18) (cid:107) ˜ y − µ (cid:107) > σ (cid:113) d(cid:15)δ (cid:19) (cid:15) . (42)Applying Markov’s inequality again, we have Pr (cid:32) (cid:107) ˜ y − µ (cid:107) > σ (cid:114) d(cid:15)δ (cid:33) ≤ (cid:15)δE (cid:2) (cid:107) ˜ y − µ (cid:107) (cid:3) σ d (43) = (cid:15)δ Tr( E [( ˜ y − µ )( ˜ y − µ ) (cid:62) ]) σ d (44) ≤ (cid:15)δσ dσ d (45) = (cid:15)δ. (46)Thus, we get Pr( | G c | > (cid:15)n ) ≤ δ (47) Pr( | G | ≥ (1 − (cid:15) ) n ) ≥ − δ. (48)14his proves the result for E . Applying Markov’s inequality again, we obtain Pr (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ( ˜ y i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ σ (cid:115) c (cid:48) (cid:15)δe log( d/δ ) (cid:33) ≤ E (cid:34)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:80) i =1 ( ˜ y i − µ ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:35) c (cid:48) (cid:15)δσ e log( d/δ ) (49) = e log( d/δ ) c (cid:48) (cid:15)δσ . d (cid:88) k =1 E (cid:2) (˜ µ k − µ k ) (cid:3) (50) ≤ e log( d/δ ) c (cid:48) (cid:15)δσ . dσ n (51) = δ. (52)This proves the result for E . By similar reasoning, the result for E follows. Lemma 5.

Let < (cid:15) ≤ , < δ ≤ , c > , c (cid:48) = [ c min { c log c + 1 − c , } ] and n ≥ ed(cid:15)δ c (cid:48) log (cid:0) dδ (cid:1) . Let E be the event described in (22) . Then Pr( E ) ≥ − δ. Proof.

We adopt the approach in [14, Lemma A.18 (iv)]. Lemma A.19 from [14] states that the following: Let { X i } ni =1 be d x d positve semi-deﬁnite random matrices such that λ max ( X i ) ≤ L almost surely for all i . Let S = n (cid:80) i =1 X i and M = λ max ( E [ S ]) . Then, for any θ > , E [ λ max ( S )] ≤ ( e θ − M/θ + L log( d ) /θ, (53)and for any ζ > , Pr( λ max ( S ) ≥ (1 + ζ ) M ) ≤ d (cid:18) e ζ (1 + ζ ) ζ (cid:19) M/L . (54)We apply this result by assigning X i = ( ˜ y i − µ )( ˜ y i − µ ) (cid:62) (cid:26) (cid:107) ˜ y i − µ (cid:107) ≤ σ (cid:113) d(cid:15)δ (cid:27) . Note that λ max ( X i ) ≤ L = σ d(cid:15)δ for all i ∈ [ n ] , and M ≤ nλ max ( E [ X ]) ≤ nσ . We consider two mutually exclusive cases:1) Suppose that M < e − δc σ n . Applying (53) with θ = 1 , we obtain E [ λ max ( S )] ≤ ( e − M + L log d. (55)Applying Markov’s inequality, we obtain Pr( λ max ( S ) ≥ c σ n ) ≤ E [ λ max ( S )] c σ n (56) ≤ ( e − δc σ nec σ n + σ d log d(cid:15)δc σ n (57) ≤ ( e − δe + δe (58) = δ. (59)The inequality in (57) follows from the assumption that M < e − δc σ n and the inequality in (58) follows fromthe fact that n ≥ ed(cid:15)δ c (cid:48) log (cid:0) dδ (cid:1) ≥ ed log d(cid:15)δ c .2) Suppose that M ≥ e − δc σ n . Applying (54) with ζ = c − , we obtain Pr( λ max ( S ) ≥ c σ n ) ≤ Pr( λ max ( S ) ≥ c M ) (60) ≤ d (cid:32) e c − ( c ) c (cid:33) δc σ ne . (cid:15)δσ d (61) ≤ δ. (62)The inequality in (60) follows from the fact that M ≤ nσ , the inequality in (62) follows from the fact that e ζ < (1 + ζ ) ζ for any ζ > , and the fact that n ≥ ed(cid:15)δ c (cid:48) log (cid:0) dδ (cid:1) ≥ ed log ( dδ ) (cid:15)δ c ( c log( c )+1 − c ) .15 emma 6. Given a set of points y i ∈ R d , i = 1 , . . . , n , then for any w ∈ R n we have x w (cid:44) n (cid:80) i =1 w i y i (cid:107) w (cid:107) ∈ arg min x λ max (cid:32) n (cid:88) i =1 w i ( y i − x )( y i − x ) (cid:62) (cid:33) (63) Proof.

We have min x λ max (cid:32) n (cid:88) i =1 w i ( y i − x )( y i − x ) (cid:62) (cid:33) = min x max ν : (cid:107) ν (cid:107) =1 n (cid:88) i =1 w i (cid:104) y i − x , ν (cid:105) (64) ≥ max ν : (cid:107) ν (cid:107) =1 min x n (cid:88) i =1 w i (cid:104) y i − x , ν (cid:105) (65) = max ν : (cid:107) ν (cid:107) =1 n (cid:88) i =1 w i (cid:104) y i − x w , ν (cid:105) (66) = λ max (cid:32) n (cid:88) i =1 w i ( y i − x w )( y i − x w ) (cid:62) (cid:33) . (67)The equality (66) follows from the fact that the minimum in the RHS of (65) is attained at x w = n (cid:80) i =1 w i y i (cid:107) w (cid:107) .Consequently, (63) holds. Lemma 7.

Let < (cid:15) < and n ≥ ed(cid:15)δ c (cid:48) log (cid:0) dδ (cid:1) . Suppose (cid:107) x − µ (cid:107) ≤ c σ . Then on event E deﬁned in (33) , h ∗ satisﬁes λ max (cid:32) n (cid:88) i =1 (1 − h ∗ i )( y i − x )( y i − x ) (cid:62) (cid:33) ≤ ( c + c + 2 c c ) σ n, (68) where c = √ (cid:15)δ (cid:16) (cid:113) c (cid:48) e log( d/δ ) (cid:17) . roof. Let I and I ∗ be the sets deﬁned in (18) and (26). We have λ max (cid:32) n (cid:88) i =1 (1 − h ∗ i )( y i − x )( y i − x ) (cid:62) (cid:33) (69) = λ max (cid:32) (cid:88) i ∈ I ∗ ( ˜ y i − x )( ˜ y i − x ) (cid:62) (cid:33) (70) ≤ λ max (cid:32)(cid:88) i ∈ I ( ˜ y i − x )( ˜ y i − x ) (cid:62) (cid:33) (71) = λ max (cid:32)(cid:88) i ∈ I ( ˜ y i − µ + µ − x )( ˜ y i − µ + µ − x ) (cid:62) (cid:33) (72) ≤ λ max (cid:32)(cid:88) i ∈ I ( ˜ y i − µ )( ˜ y i − µ ) (cid:62) (cid:33) + λ max (cid:32)(cid:88) i ∈ I ( x − µ )( x − µ ) (cid:62) (cid:33) + λ max (cid:32)(cid:88) i ∈ I ( ˜ y i − µ )( µ − x ) (cid:62) (cid:33) + λ max (cid:32)(cid:88) i ∈ I ( µ − x )( ˜ y i − µ ) (cid:62) (cid:33) (73) = λ max (cid:32)(cid:88) i ∈ I ( ˜ y i − µ )( ˜ y i − µ ) (cid:62) (cid:33) + | I | λ max (cid:0) ( x − µ )( x − µ ) (cid:62) (cid:1) + | I | λ max (cid:32)(cid:34) | I | (cid:88) i ∈ I ( ˜ y i − µ ) (cid:35) ( µ − x ) (cid:62) (cid:33) + | I | λ max  ( µ − x ) (cid:34) | I | (cid:88) i ∈ I ( ˜ y i − µ ) (cid:35) (cid:62)  (74) ≤ λ max (cid:32)(cid:88) i ∈ I ( ˜ y i − µ )( ˜ y i − µ ) (cid:62) (cid:33) + | I |(cid:107) x − µ (cid:107) + 2 | I |(cid:107) x − µ (cid:107) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) | I | (cid:88) i ∈ I ( ˜ y i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (75) ≤ c σ n + c σ n + 2 c c σ n (76) =( c + c + 2 c c ) σ n. (77)The last inequality follows from the deﬁnition of E in (22) and Lemma 8. Lemma 8.

Let < (cid:15) < and n ≥ ed(cid:15)δ c (cid:48) log (cid:0) dδ (cid:1) . Let ˜ y , . . . , ˜ y n be i.i.d. datapoints drawn from a distributionwith mean µ and covariance matrix Σ (cid:52) σ I . Let G be the set deﬁned in (19) . Let ¯ x be the average of datapointsin G . Then the following holds on the event E ∩ E ∩ E , where the events are deﬁned in (20) , (30) and (31) : (cid:107) ¯ x − µ (cid:107) ≤ σ √ (cid:15)δ (cid:32) (cid:115) c (cid:48) e log( d/δ ) (cid:33) . (78) Proof.

Note that 17 (cid:13)(cid:13)(cid:13) | G | n ( ¯ x − µ ) (cid:13)(cid:13)(cid:13)(cid:13) (79) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ( ˜ y i − µ ) − n n (cid:88) i =1 ( ˜ y i − µ ) (cid:40) (cid:107) ˜ y i − µ (cid:107) > σ (cid:114) d(cid:15)δ (cid:41)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (80) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ( ˜ y i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 z i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (81) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ( ˜ y i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ( z i − E [ z ]) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:107) E [ z ] (cid:107) , (82)where z i = ( ˜ y i − µ ) (cid:26) (cid:107) ˜ y i − µ (cid:107) > σ (cid:113) d(cid:15)δ (cid:27) .The last term is upper bounded as follows, (cid:107) E [ z ] (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:34) ( ˜ y − µ ) (cid:40) (cid:107) ˜ y − µ (cid:107) > σ (cid:114) d(cid:15)δ (cid:41)(cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (83) = max (cid:107) v (cid:107) =1 v (cid:62) E (cid:34) ( ˜ y − µ ) (cid:40) (cid:107) ˜ y ! − µ (cid:107) > σ (cid:114) d(cid:15)δ (cid:41)(cid:35) (84) = max (cid:107) v (cid:107) =1 E (cid:34) v (cid:62) ( ˜ y − µ ) (cid:40) (cid:107) ˜ y − µ (cid:107) > σ (cid:114) d(cid:15)δ (cid:41)(cid:35) (85) (a) ≤ max (cid:107) v (cid:107) =1 (cid:118)(cid:117)(cid:117)(cid:116) E [ v (cid:62) ( ˜ y − µ )] P (cid:32) (cid:107) ˜ y − µ (cid:107) > σ (cid:114) d(cid:15)δ (cid:33) (86) = (cid:118)(cid:117)(cid:117)(cid:116) λ max (Σ) P (cid:32) (cid:107) ˜ y − µ (cid:107) > σ (cid:114) d(cid:15)δ (cid:33) (87) (b) ≤ √ σ (cid:15)δ (88) = σ √ (cid:15)δ. (89)The inequality (a) follows from Cauchy-Schwarz inequality, and (b) follows from Markov’s inequality.From (82), (21), (32), and (89), we get that on the event E ∩ E ∩ E , (cid:107) ¯ x − µ (cid:107) ≤ σ √ (cid:15)δ (cid:32) (cid:115) c (cid:48) e log( d/δ ) (cid:33) . (90) Lemma 9.

Let P be a distribution on R d with mean µ and covariance matrix (cid:22) σ I . Let (cid:15) > , α > and η > be such that α + (cid:15) + η < / . Given an α -fraction corrupted set of n ≥ ed(cid:15)δ c (cid:48) log (cid:0) dδ (cid:1) datapoints from P .Let I , G , I ∗ and G ∗ be the sets deﬁned in (18) , (19) , (26) and (27) . Let ¯ x be the average of elements of G .Then it holds on the event E ∩ E that sup w ∈ ∆ [ n ] ,η (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) i ∈ I ∗ w i ( y i − ¯ x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ c σ (cid:114) α + (cid:15) + η − (cid:15) , (91) where, ∆ S,η = { w ∈ R n : (cid:107) w (cid:107) = 1 , ≤ w i ≤ − η ) | S | , ∀ i ∈ S, and w i = 0 , ∀ i / ∈ S } (92) = conv (cid:26) w ∈ R n : w i = 1 { i ∈ T }| T | , where T ⊆ S, with | T | = (1 − η ) | S | (cid:27) . (93)18 roof. On the event E ∩ E , we have that G is (cid:16) c σ (cid:113) β − (cid:15) , β (cid:17) -resilient for any β < (this follows easily fromLemma 1 and the deﬁnitions of E , E ∈ and G ). We show that this implies that on event E ∩ E , ∀ β < , sup w ∈ ∆ I ,β (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i ∈ I w i ˜ y i − ¯ x (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ c σ (cid:114) β − (cid:15) . (94)We prove (94) as follows. Let β < / be ﬁxed. We deﬁne w S for S ⊆ [ n ] as: w S,i = (cid:40) − β ) | I | , i ∈ S , i / ∈ S. (95)Since G is (cid:16) c σ (cid:113) β − (cid:15) , β (cid:17) -resilient on event E ∩E , we have from Deﬁnition 2, that ∀ S ⊆ I with | S | = (1 − β ) | I | , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i ∈ I w S,i ˜ y i − ¯ x (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ c σ (cid:114) β − (cid:15) . (96)Observe that ∆ I ,β is a convex hull: ∆ I,β = conv { w S : S ⊆ I , | S | = (1 − β ) | I |} . (97)Consider some w ∈ ∆ I ,β . From (97), we get that ∃ λ S ≥ such that w = (cid:88) S ⊆ I , | S | =(1 − β ) | I | λ S w S , and (cid:88) S ⊆ I , | S | =(1 − β ) | I | λ S = 1 . (98)Hence, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i ∈ I w i ˜ y i − ¯ x (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i ∈ I (cid:88) S ⊆ I , | S | =(1 − β ) | I | λ S w S,i ˜ y i − ¯ x (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (99) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) S ⊆ I , | S | =(1 − β ) | I | λ S (cid:34)(cid:88) i ∈ I w S,i ˜ y i − ¯ x (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (100) ≤ (cid:88) S ⊆ I , | S | =(1 − β ) | I | λ S (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i ∈ I w S,i ˜ y i − ¯ x (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (101) ≤ (cid:88) S ⊆ I , | S | =(1 − β ) | I | λ S c σ (cid:114) β − (cid:15) (102) = 2 c σ (cid:114) β − (cid:15) . (103)Since this holds for any w ∈ ∆ I,β , we conclude that (94) holds. On event E ∩ E , we also have that | I ∗ | ≥ (1 − α − (cid:15) ) n . Now consider some w ∈ ∆ [ n ] ,η . We deﬁne ˜ w ∈ R n as follows ˜ w i = (cid:40) w i (cid:80) j ∈ I ∗ w j , i ∈ I ∗ , i / ∈ I ∗ . (104)19e have (cid:107) ˜ w (cid:107) = (cid:80) i ∈ I ∗ ˜ w i = (cid:80) i ∈ I ˜ w i = 1 . Observe that for any i ∈ I ∗ , ˜ w i = w i (cid:80) j ∈ I ∗ w j (105) = w i − (cid:80) j / ∈ I ∗ w j (106) ≤ − η ) n − ( α + (cid:15) ) n (1 − η ) n (107) = 1(1 − α − (cid:15) − η ) n (108) ≤ − α − (cid:15) − η ) | I | . (109)Hence, we get that ˜ w ∈ ∆ I ,α + (cid:15) + η . Since ∀ i ∈ I ∗ , y i = ˜ y i , we obtain (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) i ∈ I ∗ w i ( y i − ¯ x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) =  (cid:88) j ∈ I ∗ w j  (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) i ∈ I ∗ ˜ w i ( ˜ y i − ¯ x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (110) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i ∈ I ˜ w i ( ˜ y i − ¯ x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (111) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i ∈ I ˜ w i ˜ y i − ¯ x (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (112) ≤ c σ (cid:114) α + (cid:15) + η − (cid:15) . (113)Since this holds for any w ∈ ∆ [ n ] ,η , we get that on event E ∩ E , sup w ∈ ∆ [ n ] ,η (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) i ∈ I ∗ w i ( y i − ¯ x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ c σ (cid:114) α + (cid:15) + η − (cid:15) . (114) Lemma 10.

Let < τ ≤ . Suppose h ∈ R n such that ∀ i, ≤ h i ≤ , and (cid:107) h (cid:107) ≤ αn for some α ∈ [0 , .Then n (cid:88) i =1 (1 − h i )1 { h i ≤ τ } ≥ (cid:16) − ατ (cid:17) n. (115) Proof.

We ﬁrst show that n (cid:80) i =1 { h i > τ } ≤ αnτ . Observe that αn ≥ n (cid:88) i =1 h i = n (cid:88) i =1 h i { h i ≤ τ } + n (cid:88) i =1 h i { h i > τ } (116) ≥ τ n (cid:88) i =1 { h i > τ } . (117)Hence, we have n (cid:88) i =1 { h i > τ } ≤ αnτ . (118)20onsequently, we obtain n (cid:88) i =1 (1 − h i )1 { h i ≤ τ } = n (cid:88) i =1 (1 − h i ) − n (cid:88) i =1 (1 − h i )1 { h i > τ } (119) ≥ n (cid:88) i =1 (1 − h i ) − (1 − τ ) n (cid:88) i =1 { h i > τ } (120) ≥ (1 − α ) n − (1 − τ ) αnτ (121) = (cid:16) − ατ (cid:17) n. (122) Proof.

First, note that for any globally optimal solution of (2), by setting all its non-zero h i to be 1, we canalways get corresponding feasible and globally optimal ( h opt , x opt ) with h opt i ∈ { , } and x opt = (cid:80) { i : h opt i =0 } y i |{ i : h opt i =0 }| (i.e., x opt is the average of the y i ’s corresponding to h opt i = 0 ), and the objective value remains unchanged.Consider h ∗ as deﬁned in (25). Let α (cid:48) (cid:44) (cid:15) + α . Note that E = {| I | ≥ (1 − (cid:15) ) n } ⊆ { n − (cid:107) h ∗ (cid:107) ≥ (1 − α (cid:48) ) n } = {(cid:107) h ∗ (cid:107) ≤ α (cid:48) n } . It follows from (21) and (29) that Pr( E ∩ E ) ≥ − δ .Now, on the event E ∩ E , we have that G is (cid:16) c σ (cid:113) β − (cid:15) , β (cid:17) -resilient around ¯ x for any β < (this followseasily from Lemma 1 and the deﬁnitions of E , E and G ) and that | G | ≥ (1 − (cid:15) ) n . Also, on the event E itfollows from (28) that λ max (cid:32) n (cid:88) i =1 (1 − h ∗ i )( y i − ¯ x ∗ )( y i − ¯ x ∗ ) (cid:62) (cid:33) ≤ λ max (cid:32) n (cid:88) i =1 (1 − h ∗ i )( y i − µ )( y i − µ ) (cid:62) (cid:33) ≤ c σ n. This implies that on event E , ( h ∗ , ¯ x ∗ ) is feasible. Since ( h opt , x opt ) is globally optimal, and ( h ∗ , ¯ x ∗ ) is feasible,we have (cid:107) h opt (cid:107) ≤ (cid:107) h ∗ (cid:107) ≤ α (cid:48) n . Thus n − (cid:107) h opt (cid:107) ≥ n − α (cid:48) n . Note that λ max (cid:32) n (cid:88) i =1 (1 − h opt i )( y i − x opt )( y i − x opt ) (cid:62) (cid:33) ≤ c σ n. (123)Normalizing (123) leads to λ max (cid:32) n − (cid:107) h opt (cid:107) n (cid:88) i =1 (1 − h opt i )( y i − x opt )( y i − x opt ) (cid:62) (cid:33) ≤ nc σ n − α (cid:48) n = c σ − α (cid:48) . (124)Since h opt i ∈ { , } , (124) implies that the set S opt (cid:44) { y i | h opt i = 0 } is also (cid:16) c σ (cid:113) β − α (cid:48) , β (cid:17) -resilient in (cid:96) -norm around its sample average x opt for all β < . by Lemma 1. We also have | S opt | ≥ (1 − α (cid:48) ) n .Let T (cid:44) G ∩ S opt , and set β = α (cid:48) − (cid:15) and β = (cid:15) − α (cid:48) . Since (cid:15) < / and α < − (cid:15) , we have β < . and β < . . One can verify that | T | ≥ (1 − β ) | G | and | T | ≥ (1 − β ) | S opt | . Then, from the property of resiliencein Deﬁnition 2, we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) | T | (cid:88) y i ∈ T y i − ¯ x (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ c σ (cid:114) β − (cid:15) and (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) | T | (cid:88) y i ∈ T y i − x opt (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ c σ (cid:114) β − α (cid:48) . By the triangle inequality, we obtain (cid:107) ¯ x − x opt (cid:107) ≤ c σ √ α (cid:48) − (cid:15) + 2 c σ √ (cid:15) − α (cid:48) . E deﬁned in (33), using Lemma 8 and applying triangle inequality, we obtain thatwith probability at least − δ (cid:107) µ − x opt (cid:107) ≤ c σ √ α (cid:48) − (cid:15) + 2 c σ √ (cid:15) − α (cid:48) + σ √ (cid:15)δ (cid:32) (cid:115) c (cid:48) e log( d/δ ) (cid:33) . For any feasible solution of (2) with (cid:107) ˆ h (cid:107) ≤ α (cid:48) n , following similar proof as above, we have the same errorbounds for (cid:13)(cid:13)(cid:13) ¯ x − (cid:80) { i :ˆ hi =0 } y i |{ i :ˆ h i =0 }| (cid:13)(cid:13)(cid:13) and (cid:13)(cid:13)(cid:13) µ − (cid:80) { i :ˆ hi =0 } y i |{ i :ˆ h i =0 }| (cid:13)(cid:13)(cid:13) as above. Proof.

Let (ˆ h , ˆ x ) ∈ H (cid:48) be a feasible solution to (3) with some < p ≤ . We have (cid:107) ˆ h (cid:107) p ≤ ( α (cid:48) n ) /p . (125)Since ≤ ˆ h i ≤ for all i , we have (cid:34) n (cid:88) i =1 ˆ h i (cid:35) /p ≤ (cid:34) n (cid:88) i =1 ˆ h pi (cid:35) /p ≤ ( α (cid:48) n ) /p . (126)This implies the following (cid:107) ˆ h (cid:107) ≤ α (cid:48) n (127) (cid:107) − ˆ h (cid:107) ≥ (1 − α (cid:48) ) n. (128)Let w = − ˆ h (cid:107) − ˆ h (cid:107) . Observe that w ∈ ∆ [ n ] ,α (cid:48) , where ∆ S,η is deﬁned in (92). Now we follow the proof strategy inLemma 5.2 of [20]. Let x = n (cid:80) i =1 w i y i . Let ν = x − µ (cid:107) x − µ (cid:107) . Let I ∗ be the set deﬁned in (26). Note that on event E , we have | I ∗ | ≥ (1 − α (cid:48) ) n . Now, on the event E , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i/ ∈ I ∗ w i (cid:104) y i − x , ν (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (129) ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i/ ∈ I ∗ w i (cid:104) y i − µ , ν (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:32) (cid:88) i/ ∈ I ∗ w i (cid:33) |(cid:104) µ − x , ν (cid:105)| (130) ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 w i (cid:104) y i − µ , ν (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ I ∗ w i (cid:104) y i − µ , ν (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − α (cid:48) − α (cid:48) (cid:107) x − µ (cid:107) (131) ≥(cid:107) x − µ (cid:107) − (cid:107) (cid:88) i ∈ I ∗ w i ( y i − µ ) (cid:107) − α (cid:48) − α (cid:48) (cid:107) x − µ (cid:107) (132) ≥ − α (cid:48) − α (cid:48) (cid:107) x − µ (cid:107) − (cid:107) (cid:88) i ∈ I ∗ w i ( y i − ¯ x ) (cid:107) − (cid:107) (cid:88) i ∈ I ∗ w i ( ¯ x − µ ) (cid:107) (133) ≥ − α (cid:48) − α (cid:48) (cid:107) x − µ (cid:107) − (cid:107) (cid:88) i ∈ I ∗ w i ( y i − ¯ x ) (cid:107) − (cid:107) ¯ x − µ (cid:107) (134) ≥ − α (cid:48) − α (cid:48) (cid:107) x − µ (cid:107) − c σ (cid:114) α (cid:48) − (cid:15) − σ √ (cid:15)δ (cid:32) (cid:115) c (cid:48) e log( d/δ ) (cid:33) . (135)22ote that the last inequality follows from Lemmas 8 and 9. Using the Cauchy–Schwarz inequality, we get (cid:32) (cid:88) i/ ∈ I ∗ w i (cid:33)(cid:32) (cid:88) i/ ∈ I ∗ w i (cid:104) y i − x , ν (cid:105) (cid:33) ≥ (cid:32) (cid:88) i/ ∈ I ∗ w i (cid:104) y i − x , ν (cid:105) (cid:33) (136) α (cid:48) − α (cid:48) λ max (cid:32) n (cid:88) i =1 w i ( y i − x )( y i − x ) (cid:62) (cid:33) ≥ (cid:32) (cid:88) i/ ∈ I ∗ w i (cid:104) y i − x , ν (cid:105) (cid:33) (137) σ (cid:115) α (cid:48) c (1 − α (cid:48) ) ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i/ ∈ I ∗ w i (cid:104) y i − x , ν (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (138)Thus, we obtain (cid:107) x − µ (cid:107) ≤ σ − α (cid:48) − α (cid:48) (cid:34) √ α (cid:48) c − α (cid:48) + 2 c (cid:114) α (cid:48) − (cid:15) + √ (cid:15)δ (cid:32) (cid:115) c (cid:48) e log( d/δ ) (cid:33)(cid:35) . (139)Let ( h opt , x opt ) be an optimal solution to (3). From Lemma 6 we have that  h opt , n (cid:80) i =1 (1 − h opt i ) y in (cid:80) i =1 (1 − h opt i )  is also anoptimal solution. Note that on the event E , we have that ( h ∗ , µ ) is a feasible solution of (3). Hence, (cid:107) h opt (cid:107) p ≤ (cid:107) h ∗ (cid:107) p ≤ ( α (cid:48) n ) /p . (140)This implies  h opt , n (cid:80) i =1 (1 − h opt i ) y in (cid:80) i =1 (1 − h opt i )  ∈ H (cid:48) . (141)The result follows from the fact that Pr( E ) ≥ − δ . Proof.

Suppose (cid:107) x (0) − µ (cid:107) ≤ c σ . Let ˜ h be the optimal solution to min h (cid:48) (cid:107) h (cid:48) (cid:107) p (142)s.t. λ max (cid:32) n (cid:88) i =1 (1 − h (cid:48) i )( y i − x (0) )( y i − x (0) ) (cid:62) (cid:33) ≤ ( c + c + 2 c c ) nσ (143) ≤ h (cid:48) i ≤ , ∀ i ∈ [ n ] . (144)From Lemma 7, we have that on event E , h ∗ is a feasible solution to the above optimization problem. Hence, (cid:107) ˜ h (cid:107) p ≤ (cid:107) h ∗ (cid:107) p ≤ ( α (cid:48) n ) /p . (145)Since ≤ ˜ h i ≤ for all i , we have (cid:34) n (cid:88) i =1 ˜ h i (cid:35) /p ≤ (cid:34) n (cid:88) i =1 (cid:16) ˜ h i (cid:17) p (cid:35) /p ≤ ( α (cid:48) n ) /p (146)This implies (cid:107) ˜ h (cid:107) ≤ α (cid:48) n. (147)Let w be such that w i = (1 − ˜ h i )1 { ˜ h i ≤ τ } n (cid:80) i =1 (1 − ˜ h i )1 { ˜ h i ≤ τ } . By Lemma 10, we have that w ∈ ∆ [ n ] , α (cid:48) τ , where ∆ S,η is deﬁned in(92). Now we follow the proof of Theorem 2. Let x (1) = n (cid:80) i =1 w i y i . Let ν = x (1) − µ (cid:107) x (1) − µ (cid:107) . Let I ∗ be the set deﬁned23n (26). Note that on event E , we have | I ∗ | ≥ (1 − α (cid:48) ) n . Now, on the event E , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i/ ∈ I ∗ w i (cid:68) y i − x (0) , ν (cid:69)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (148) ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i/ ∈ I ∗ w i (cid:104) y i − µ , ν (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:32) (cid:88) i/ ∈ I ∗ w i (cid:33)(cid:12)(cid:12)(cid:12)(cid:68) µ − x (0) , ν (cid:69)(cid:12)(cid:12)(cid:12) (149) ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 w i (cid:104) y i − µ , ν (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ I ∗ w i (cid:104) y i − µ , ν (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − α (cid:48) (cid:107) x (0) − µ (cid:107) − α (cid:48) τ (150) ≥(cid:107) x (1) − µ (cid:107) − (cid:107) (cid:88) i ∈ I ∗ w i ( y i − µ ) (cid:107) − α (cid:48) τ (cid:107) x (0) − µ (cid:107) τ − α (cid:48) (151) ≥(cid:107) x (1) − µ (cid:107) − (cid:107) (cid:88) i ∈ I ∗ w i ( y i − ¯ x ) (cid:107) − (cid:107) (cid:88) i ∈ I ∗ w i ( ¯ x − µ ) (cid:107) − α (cid:48) τ (cid:107) x (0) − µ (cid:107) τ − α (cid:48) (152) ≥(cid:107) x (1) − µ (cid:107) − (cid:107) (cid:88) i ∈ I ∗ w i ( y i − ¯ x ) (cid:107) − (cid:107) ¯ x − µ (cid:107) − α (cid:48) τ (cid:107) x (0) − µ (cid:107) τ − α (cid:48) (153) ≥(cid:107) x (1) − µ (cid:107) − c σ (cid:115) α (cid:48) (1 − τ )(1 − (cid:15) ) − c σ − α (cid:48) τ c στ − α (cid:48) . (154)Note that the last inequality (154) follows from Lemmas 8 and 9. Using the Cauchy–Schwarz inequality, we get (cid:32) (cid:88) i/ ∈ I ∗ w i (cid:33)(cid:32) (cid:88) i/ ∈ I ∗ w i (cid:68) y i − x (0) , ν (cid:69) (cid:33) ≥ (cid:32) (cid:88) i/ ∈ I ∗ w i (cid:68) y i − x (0) , ν (cid:69)(cid:33) (155) α (cid:48) ττ − α (cid:48) λ max (cid:32) n (cid:88) i =1 w i ( y i − x (0) )( y i − x (0) ) (cid:62) (cid:33) ≥ (cid:32) (cid:88) i/ ∈ I ∗ w i (cid:68) y i − x (0) , ν (cid:69)(cid:33) (156) α (cid:48) ττ − α (cid:48) . λ max (cid:16)(cid:80) ni =1 (1 − ˜ h i )( y i − x (0) )( y i − x (0) ) (cid:62) (cid:17) n (cid:80) i =1 (1 − ˜ h i )1 { ˜ h i ≤ τ } ≥ (cid:32) (cid:88) i/ ∈ I ∗ w i (cid:68) y i − x (0) , ν (cid:69)(cid:33) (157) σ (cid:115) α (cid:48) τ ( c + c + 2 c c )( τ − α (cid:48) ) ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i/ ∈ I ∗ w i (cid:68) y i − x (0) , ν (cid:69)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (158)Thus, we obtain from (154) and (158) that (cid:107) x (1) − µ (cid:107) ≤ σ (cid:34) τ (cid:112) α (cid:48) ( c + c + 2 c c ) τ − α (cid:48) + 2 c (cid:115) α (cid:48) τ (1 − (cid:15) ) + c + α (cid:48) τ c τ − α (cid:48) (cid:35) (159) ≤ σ (cid:34) ( c + c ) τ √ α (cid:48) τ − α (cid:48) + 2 c (cid:115) α (cid:48) τ (1 − (cid:15) ) + c + α (cid:48) τ c τ − α (cid:48) (cid:35) (160) ≤ σ ( γc + β ) . (161)It is easy to check that γ < ⇐⇒ α (cid:48)(cid:48) < f ( τ ) = ⇒ α (cid:48)(cid:48) < τ . By extending the above reasoning to t iterations ofAlgorithm 1, we get that if (cid:107) x (0) − µ (cid:107) ≤ c (0)2 σ , then (cid:107) x ( t ) − µ (cid:107) ≤ σ (cid:16) γc ( t − + β (cid:17) (162) = σ (cid:20) γ t c (0)2 + 1 − γ t − γ β (cid:21) (163) ≤ σ (cid:20) γ t c (0)2 + β − γ (cid:21) . (164)24et x (0) be the coordinate-wise median of the corrupted sample. Note that if n ≥

20 log (cid:0) dδ (cid:1) , then by Lemma 3 wehave (cid:107) x (0) − µ (cid:107) ≤ σ √ d with probability − δ . The result follows from this and the fact that Pr( E ) ≥ − δ . (cid:96) objective via Packing SDP min h (cid:107) h (cid:107) (165) s.t. ≤ h i ≤ , ∀ i,λ max (cid:32) n (cid:88) i =1 (1 − h i )( y i − x )( y i − x ) (cid:62) (cid:33) ≤ cnσ . Deﬁne the vector w with w i (cid:44) − h i . Since ≤ h i ≤ , we have ≤ w i ≤ . Further, (cid:107) h (cid:107) = (cid:80) ni =1 h i = (cid:80) ni =1 (1 − w i ) = n − (cid:80) ni =1 w i = n − (cid:62) w . Therefore, solving (165) is equivalent to solving the following: max w (cid:62) w (166) s.t. ≤ w i ≤ , ∀ i,λ max (cid:32) n (cid:88) i =1 w i ( y i − x )( y i − x ) (cid:62) (cid:33) ≤ cnσ . Then, we rewrite the constraints ≤ w i ≤ , ∀ i as ≤ w i , and (cid:80) w i e i e (cid:62) i (cid:22) I n × n , where e i is the i -thstandard basis vector in R n . This establishes the equivalence between (166) and (17). (cid:96) p via iterative re-weighted (cid:96) Consider (cid:96) p ( < p < ) in Step 1 of Algorithm 1. We have the following equivalent objective: min h (cid:107) h (cid:107) pp (167) s.t. ≤ h i ≤ , ∀ i,λ max (cid:32) n (cid:88) i =1 (1 − h i )( y i − x )( y i − x ) (cid:62) (cid:33) ≤ cnσ . Note that (cid:107) h (cid:107) pp = (cid:80) ni =1 h pi = (cid:80) ni =1 ( h i ) p . Consider that we employ the iterative re-weighted (cid:96) technique [29, 30].Then at ( k + 1) -th inner iteration, we construct a tight upper bound on (cid:107) h (cid:107) pp at h ( k )2 as n (cid:88) i =1 (cid:20)(cid:16) h ( k ) i (cid:17) p + p (cid:16) h ( k ) i (cid:17) p − (cid:16) h i − h ( k ) i (cid:17)(cid:21) . (168)We minimize this upper bound: min h n (cid:88) i =1 (cid:16) h ( k ) i (cid:17) p − h i (169) s.t. ≤ h i ≤ , ∀ i,λ max (cid:32) n (cid:88) i =1 (1 − h i )( y i − x )( y i − x ) (cid:62) (cid:33) ≤ cnσ , Deﬁne u i = (cid:16) h ( k ) i (cid:17) p − , the objective in (169) becomes (cid:80) ni =1 u i h i . Deﬁne the vector w with w i (cid:44) − h i . Since ≤ h i ≤ , we have ≤ w i ≤ . Further, (cid:80) ni =1 u i h i = (cid:80) ni =1 u i (1 − w i ) = (cid:80) ni =1 ( u i − u i w i ) . So, solving (169)25s equivalent to solving the following: min w n (cid:88) i =1 ( u i − u i w i ) (170) s.t. ≤ w i ≤ , ∀ i,λ max ( n (cid:88) i =1 w i ( y i − x )( y i − x ) (cid:62) ) ≤ cnσ . Further, deﬁne the vector z with z i (cid:44) u i w i . Then solving (170) is equivalent to solving the following: min z (cid:107) u − z (cid:107) (171) s.t. ≤ z i ≤ u i , ∀ i,λ max (cid:32) n (cid:88) i =1 z i [( y i − x )( y i − x ) (cid:62) /u i ] (cid:33) ≤ cnσ . Then, we rewrite the constraints ≤ z i ≤ u i , ∀ i as ≤ z i , and (cid:80) ni =1 z i e i e (cid:62) i (cid:22) diag ( u ) , where e i is the i -thstandard basis vector in R n . Finally, we can turn (171) into the following least squares problem with semideﬁnitecone constraints: min z (cid:107) u − z (cid:107) (172) s.t. z i ≥ , ∀ i, n (cid:88) i =1 z i (cid:20) e i e (cid:62) i ( y i − x )( y i − x ) (cid:62) /u i (cid:21) (cid:22) (cid:20) diag ( u ) cnσ I d × d (cid:21) . (cid:96) objective via Packing SDP Consider (cid:96) p (0 < p < in Step 1 of Algorithm 1 (see objective (167)). If we employ iterative re-weighted (cid:96) approach [31, 29], we need to solve the following problem: min h n (cid:88) i =1 u i h i (173) s.t. ≤ h i ≤ , ∀ i,λ max (cid:32) n (cid:88) i =1 (1 − h i )( y i − x )( y i − x ) (cid:62) (cid:33) ≤ cnσ , where u i is the weight on corresponding h i . Deﬁne the vector w with w i (cid:44) − h i . Since ≤ h i ≤ , we have ≤ w i ≤ . Further, (cid:80) ni =1 u i h i = (cid:80) ni =1 u i (1 − w i ) = (cid:80) ni =1 u i − (cid:80) ni =1 u i w i . So, solving (173) is equivalent tosolving the following: max w u (cid:62) w (174) s.t. ≤ w i ≤ , ∀ i,λ max (cid:32) n (cid:88) i =1 w i ( y i − x )( y i − x ) (cid:62) (cid:33) ≤ cnσ . Then, we rewrite the constraints ≤ w i ≤ , ∀ i as ≤ w i , and (cid:80) w i e i e (cid:62) i (cid:22) I n × n , where e i is the i -thstandard basis vector in R n . Finally, we can turn it into the following Packing SDP: max w u (cid:62) w (175) s.t. w i ≥ , ∀ i, n (cid:88) i =1 w i (cid:20) e i e (cid:62) i ( y i − x )( y i − x ) (cid:62) (cid:21) (cid:22) (cid:20) I n × n cnσ I d × d (cid:21) . .8 Corrupted image dataset We use real face images to test the eﬀectiveness of the robust mean estimation methods. The average face ofparticular regions or certain groups of people is useful for many social and psychological studies [33]. Here weuse 100 frontal human face images from Brazilian face database as inliers. For the outliers, we choose 15 faceimages of cat and dog from CIFAR10 [34]. In order to run the CDG method [20], we scale the size of images to18 ×

15 pixels, so the dimension of each datapoint is 270. Fig. 2 and Fig. 3 show the sample inlier and outlierimages. Fig. 4 shows the oracle solution (the average of the 100 inlier human faces) and the estimated meanby each method, as well as their (cid:96) distances to the oracle solution. The proposed (cid:96) and (cid:96) p methods achievesmaller recovery error than the state-of-the-art methods. The estimated mean faces by the proposed methodsalso look visually similar to the oracle solution, which illustrates the eﬃcacy of the proposed (cid:96) and (cid:96) p methods.Figure 2: Sample inlier human face images.Figure 3: Sample outlier cat and dog face images from CIFAR10.Figure 4: Reconstructed mean face and its recovery error by each method. In this subsection, we test the performance of Iterative Filtering, QUE, LRV, and the proposed (cid:96) method undereven higher dimensions than in Section 4.1. More speciﬁcally, we have d = 1000 , n = 5000 . Table 5 shows theaverage recovery error of each method w.r.t. the fraction α of the outlier points. It is evident that the proposed (cid:96) method performs much better than the current state-of-the-art methods.Table 5: Recovery error of each method under diﬀerent fraction α of the outlier points ( d = 1000 , n = 5000 ) α Iter Filter QUE LRV (cid:96)

10% 0.165 0.653 0.363

20% 0.175 0.692 0.751 https://fei.edu.br/ cet/facedatabase.htmlhttps://fei.edu.br/ cet/facedatabase.html