On the robustness to adversarial corruption and to heavy-tailed data of the Stahel-Donoho median of means
aa r X i v : . [ m a t h . S T ] J a n On the robustness to adversarial corruption and to heavy-tailed data ofthe Stahel-Donoho median of means
Jules Depersin and Guillaume Lecu´e email: [email protected], email: [email protected], ENSAE, IPParis. 5, avenue Henry Le Chatelier, 91120 Palaiseau, France.
Abstract
We consider median of means (MOM) versions of the Stahel-Donoho outlyingness (SDO) [63, 21] and of MedianAbsolute Deviation (MAD) [28] functions to construct subgaussian estimators of a mean vector under adversarialcontamination and heavy-tailed data. We develop a single analysis of the MOM version of the SDO which coversall cases ranging from the Gaussian case to the L case. It is based on isomorphic and almost isometric propertiesof the MOM versions of SDO and MAD. This analysis also covers cases where the mean does not even existbut a location parameter does; in those cases we still recover the same subgaussian rates and the same price foradversarial contamination even though there is not even a first moment. These properties are achieved by theclassical SDO median and are therefore the first non-asymptotic statistical bounds on the Stahel-Donoho mediancomplementing the √ n -consistency [54] and asymptotic normality [71] of the Stahel-Donoho estimators. We alsoshow that the MOM version of MAD can be used to construct an estimator of the covariance matrix under onlya L -moment assumption or of a scale parameter if a second moment does not exist. AMS subject classification:
Keywords:
Robustness, adversarial contamination, heavy-tailed data, depth, d -dimensional median. Robust estimation of a mean vector has witnessed an important renewal during the last decade. Two communitieshave looked at this problem from their own perspective. In the statistics literature, several works have consideredthe problem of robustness with respect to heavy-tailed data. The aim here is to construct an estimator achievingstatistical bounds with the same confidence as if all the data were i.i.d. Gaussian even though the data at hand areonly assumed to have a second moment. Such estimators are called subgaussian estimators; they are said to be robustto heavy-tailed data. The first seminal result showing the existence of such an estimator may be found in [6]. It isalso shown in [6] that the empirical mean does not achieve this goal: the rate achieved by the empirical mean in the L setup cannot be better than σ p / ( δN ) with probability at least 1 − δ whereas it is of the order of σ p log(1 /δ ) /N when the data are i.i.d. Gaussian. The rate σ p log(1 /δ ) /N is called the subgaussian rate for the mean estimationproblem in the one-dimensional case. It is this rate that has been first achieved in [6] by an M -estimator with aspecific score function. This rate was then achieved using a median-of-means principle in several works such as [17, 4].It was then extended to the d -dimensional case in many other works [51, 52, 8, 30, 10, 15, 39, 12] since then.For the mean estimation problem in R d , most of the results have been given w.r.t. the Euclidean ℓ d distance.There is however no statistical justification for this choice but that the ℓ d metric is simply the most natural Hilbertmetric in R d and so it seems natural to use it as a way to measure the statistical performance of an estimator ofa d -dimensional vector. The resulting confidence sets have therefore the form ˆ µ + r ∗ N,δ B d where ˆ µ is an estimator, B d = { x ∈ R d : k x k ≤ } is the unit Euclidean ball and r ∗ N,δ is the rate of convergence w.r.t. ℓ d achieved by ˆ µ withconfidence 1 − δ . When estimating w.r.t. the ℓ d metric, confidence sets are therefore ℓ d -balls. One may wonder ifthese confidence sets are the best from a statistical point of view, for instance, the one with smallest volume for afixed confidence 1 − δ . To answer this type of question, one usually go back to the ideal i.i.d. Gaussian case, anduse results obtained in that framework as benchmark results. We may also consider this model to design optimalbenchmark confidence sets, that could be used to define more appealing estimation metric of a mean vector in R d .Let us now see what are the ”best” (in some sense given later) confidence sets in the i.i.d. Gaussian case: let X , . . . , X N be i.i.d. distributed like N ( µ, Σ) where µ ∈ R d is the mean and Σ is a symmetric definite positive matrix(we assume here that Σ is invertible). The MLE is the empirical mean ¯ X N and √ N ( ¯ X N − µ ) ∼ N (0 , Σ). The1ater result holds asymptotically if the data are only assumed to be in L thanks to the CLT. The key observationhere is that Σ is the inverse of the Fisher information in this model and thus there are no regular asymptoticallynormal M -estimator that can estimate the mean with an asymptotic covariance matrix better than Σ. Moreover,level sets of the standard Gaussian density function are Euclidean B d balls centered at zero. As a consequence, thebest confidence sets for µ with confidence 1 − δ are ellipsoids Σ − / B d with radius given by the quantile of order 1 − δ of a chi-square variable with parameter d centered at the estimator. This type of confidence region are equivalentlywritten as estimation results of µ with respect to the norm x ∈ R d → (cid:13)(cid:13) Σ − / x (cid:13)(cid:13) . It follows that the best metric –that is the one leading to minimal volume confidence sets for a given confidence in the benchmark i.i.d. Gaussiancase – is the norm (cid:13)(cid:13) Σ − / · (cid:13)(cid:13) whose unit ball is the ellipsoid Σ / B d .Regarding our robust mean estimation problem, the two next natural questions are the following: is it possibleto construct robust mean estimators w.r.t. the (cid:13)(cid:13) Σ − / · (cid:13)(cid:13) metric? and what is the best convergence rate one canhope for? In the literature [50, 16], one may find estimators which can estimate in a robust way a mean vector w.r.t.any metric of the type u ∈ R d → k u k S = sup v ∈ S (cid:10) v, u (cid:11) where S ⊂ R d . In particular, for S = Σ / B d , this metriccoincides with the one we want to use, i.e. (cid:13)(cid:13) Σ − / · (cid:13)(cid:13) . It has also been proved that the optimal deviation minimaxrate (the one obtained in the benchmark i.i.d. Gaussian case) is for the mean estimation problem with respect to k·k S given by (see [16]) r ℓ ∗ (Σ / S ) N + sup v ∈ S (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) r log(1 /δ ) N . (1)For instance, for S = B d that is for k·k S = k·k , the later rate is the classical p Tr(Σ) /N + q k Σ k op log(1 /δ ) /N rate.The case that is interesting to us is when k·k S = (cid:13)(cid:13) Σ − / · (cid:13)(cid:13) , that is for S = Σ − / B d . In that case, the subgaussianrate is r dN + r log(1 /δ ) N . (2)This is the rate we will try to reach from an adversarial corrupted and heavy-tailed dataset. We will also have totake into account the price for corruption. There are indeed known information theoretic lower bounds showing thatthere are no statistics that can do better than ( |O| /N ) α where α ∈ [1 / ,
1] is some exponent depending on propertiesof the good data. For instance, α = 1 for Gaussian variables and α = 1 / L variables. However, we will seethat the best possible cost |O| /N (i.e. for α = 1) can be achieved even variables which do not have a first momentas long as the cdfs of all one-dimensional projections of the centered and normalized data are regular enough.Unfortunately, all estimators known to achieve the subgaussian rate in (2) (the Le Cam test estimator in [50], theminmax MOM estimator with loss function ℓ ( x, u ) = (cid:13)(cid:13) Σ − / ( u − v ) (cid:13)(cid:13) from [43] or the Fenchel-Legendre estimatorsfrom [16]) are using the set S in their construction. This is something we cannot do here because S = Σ − / B d depends on Σ which is unknown in general. One therefore has to consider other type of estimators than the onescited above. In this work, we will do it thanks to a notion of depth/outlyingness introduced at the beginning of the80’s which, unlike the last cited estimators, uses a normalization by a robust estimation of the scale.There are several ways to measure how ’deep’ is a vector with respect to a cloud of points, see for instance thehalf-space depth of Tukey [64, 58], the simplicial depth [44, 46], Mahalanobis depth or the projection depth [45].Taking a point with maximal depth is usually seen as a way to define a median in R d (see Radon points [2] or FermatPoints [27]). There are therefore several ways to define a median of a cloud of points in R d . One depth has received aparticular attention both in theory and in practice and is known as the Stahel-Donoho outlyingness (SDO) [63, 23].It can be used to construct estimators of multivariate location and scatter known as the Stahel-Donoho estimators(SDE) which were the first equivariant estimators with a high breakdown point. The aim of this work is to showthat this notion of depth can be used to construct estimator of a mean vector in R d which is robust to adversarialcontamination and to heavy-tailed data with respect to (cid:13)(cid:13) Σ − / · (cid:13)(cid:13) . Let us now define this notion of depth and recallsome of its properties.There is a common approach to many notion of depths for a general d -dimensional set of vectors: first, a definitionof depth in R is given and second, this notion is extended to R d simply by applying this one-dimensional definitionto the set of one-dimensional projections of the data in all directions v ∈ R d (or all v ∈ S for some subset S ⊂ R d )and then by taking the supremum over all v ∈ R d (or v ∈ S ). This approach is based on the idea that if a point in R d is an outlier then there must be some direction v such that it is an (univariate) outlier when projected into thatdirection.The SDO of z ∈ R with respect to a dataset { a , . . . , a K } in R is defined as SDO ( z ; { a , . . . , a K } ) = | z − Med( a k ) | Med( | a k − Med( a k ) | ) (3) We speak here about depth instead of outlyningness : these two concepts are expressing the same notion but in reverse order. R d is using the previous one for all one-dimensional projections of the data and by takingthe supremum over all directions: for any ν ∈ R d and a dataset { Z , . . . , Z K } in R d , we set SDO ( ν, { Z , . . . , Z K } ) = sup v ∈ R d SDO ( (cid:10) ν, v (cid:11) ; { (cid:10) Z , v (cid:11) , . . . , (cid:10) Z K , v (cid:11) } )= sup v ∈ R d | (cid:10) ν, v (cid:11) − Med( (cid:10) Z k , v (cid:11) ) | Med( | (cid:10) Z k , v (cid:11) − Med( (cid:10) Z k , v (cid:11) ) | ) . (4)A natural way to define a median of the Z k ’s is obtained by taking a point with minimal outlyingness (i.e. maximaldepth): ˆ µ SDO ∈ argmin µ ∈ R d SDO ( µ, { Z , . . . , Z K } ) . However ˆ µ SDO is not the most usual choice to estimate some location of the Z k ’s when they are assumed to followa statistical model. The Stahel-Donoho location estimator is rather defined as a convex sums of the data:ˆ µ SDEK = P Kk =1 w k Z k P Kk =1 w k (5)where the weights are some function of the outlyingness of the data, i.e. w k = w ( SDO ( Z k )) for some (decreasing)weight function w : R + → R + . The weights can also be used to estimate the scatter of the points byˆΣ SDEK = P k w k ( Z k − ˆ µ SDE )( Z k − ˆ µ SDE ) ⊤ P k w k . (6)Note that there is a more general definition of SDO than the one considered in (3) with general (one dimensional)definitions of location and scale statistics; in (3), we used the median Med( a k ) and Median Absolute Deviation(MAD) Med( | a k − Med( a k ) | ) for these statistics [29].As mentioned previously several results on the Stahel-Donoho Estimator (SDE) have been established during thelast forty years. They are affine equivariant meaning that for any affine transformation x ∈ R d → Ax + b of thedataset by a nonsingular matrix A ∈ R d × d and a vector b ∈ R d the location estimator ˆ µ SDEK is following the sametransformation and the scatter estimator ˆΣ
SDEK is transformed via M ∈ R d × d → AM A ⊤ . SDE have been provedto have a finite-sample breakdown point [22] which is the ”smallest amount of contamination necessary to upset anestimator entirely” from [24] in [21]. In [65], it is proved that the SDE with MAD replaced by the average of the k thand k th smallest absolute deviation about the median Med( a k ) for k = d − K +1) /
2] and k = d − K +2) / K − d + 1) / /K (this result holds when the weight function w is continuous and there is an absoluteconstant c such that w ( r ) ≤ c , w ( r ) ≤ c /r for all r ≥ √ n -consistency in [54]: if the Z k ’s are i.i.d.then √ K (cid:16) (ˆ µ SDEK , ˆΣ SDEK ) − ( t , V ) (cid:17) tends to 0 in probability when K → + ∞ where t and V are some location andscatter parameters of the distribution of Z . This result holds when the weight function w is such as | w ( r ) − w ( r ′ ) | ≤ γ min(1 , / min( r, r ′ ) ) | r − r ′ | for all r, r ′ ∈ R and when for all v ∈ R d the cumulative distribution function (cdf) of (cid:10) Z , v (cid:11) denoted by F v satisfies the following assumption : there exists some absolute constants c > c > | ǫ | ≤ c | F v (Med( F v ) + ǫ ) − F v (Med( F v )) | ≥ c | ǫ | and | F v (Med( F v ) ± σ v + ǫ ) − F v (Med( F v ± σ v )) | ≥ c | ǫ | (7)where Med( F v ) = inf( x ∈ R : F v ( x ) ≥ /
2) is the median of F v and σ v = Med( G v ) where G v is the cumulativedistribution of the random variable M AD ( (cid:10) Z , v (cid:11) ) := Med( | (cid:10) Z , v (cid:11) − Med( (cid:10) Z , v (cid:11) ) | ). A typical situation mentionedin [54] where (7) holds is when the cdf F : R d → [0 ,
1] of Z is such that F = (1 − η ) F + ηF ∗ where η < F ∗ isany cdf and F is such that there exists c > c > v ∈ R d , (cid:10) Z , v (cid:11) has a density denoted by f v satisfying f v ( t ) ≥ c for all t ∈ [Med( F v ) ± c ] ∪ [Med( F v ) − σ v ± c ] ∪ [Med( F v ) + σ v ± c ]. According to [54], thelater holds when F is spherical with positive density in a neighborhood of 0 and σ e e where e = (1 , , · · · , ∈ R d .We will come back later on these conditions since we will encounter similar assumptions for our analysis. Finally,asymptotic normality of SDE location estimators have been obtained in [71] under great generality for the locationand scatter estimators as well as for the weight function including the median and MAD estimators as in (3) andthe projection depth obtained for the weight function w : r ∈ R + → / (1 + r ). From a stochastic point of view,asymptotic results for ˆ µ SDEK hold when the cdf F is elliptically symmetric around µ which means that there exists a3ymmetric definite positive matrix Σ such that for all v ∈ S d − := { v ∈ R d : k v k = 1 } , (cid:10) Σ − / ( Z − µ ) , v (cid:11) has thesame distribution as (cid:10) Σ − / ( Z − µ ) , e (cid:11) which is a univariate symmetric variable with density function f . In thatcase, asymptotic normality was obtained when f (0) f ( σ ) > σ = M AD ( (cid:10) Σ − / ( Z − µ ) , e (cid:11) ). Again we willmeet this type of condition in our analysis.On the practical side, SDEs have been used a lot in practice and implementation on various languages such asR exists; and that is one reason why the study of the SDO may be useful, maybe more than some other notions ofdepth. In the original paper [63], the author proposes a random algorithm where the supremum over all directions v ∈ R d is approximated by subsampling orthogonal directions to d − d randomly chosenpoints in the dataset. Other strategies mixing random and deterministic directions have been proposed for instancein [60]. Several adaptations and extensions of this algorithm may be found in [14] for an extension to an arbitrarykernel space or in [67, 66] for a ”cell-wise weights” extension of the SDO where each coordinate of each data receivesits own weight. However, only very little is known on the theoretical computational side. In Section 5 of [25], analgorithm running in time O ( K d +1 log K ) is mentioned but its time complexity is making this approach impracticalfor dimensions larger than 5. There are to our knowledge no theoretical result of any kind on the convergence ofsome approximate algorithm for the computation of the SDO of a point in R d that could be used in practice. Asmentioned already in [25], ”some sort of computational breakthrough is necessary to make the estimators, as definedhere, really practical”. This looks to be still the case. We will however not discuss about this issue in the presentwork and leave this question still opened.The aim of this work is to construct mean vector estimators robust to adversarial outliers and heavy-tailed dataachieving the deviation-minimax subgaussian rate from (2) with respect to the metric (cid:13)(cid:13) Σ − / · (cid:13)(cid:13) . On our way to ourgoal, we complement the results on the √ n -consistency and the asymptotic normality of SDE, by deriving the firstnon-asymptotic convergence rate for the original SDO median (as well as its median of means version) as a robustmean estimator in R d under the following assumption. Assumption 1. [Adversarial contamination and L inliers] There exists N random vectors ( ˜ X i ) Ni =1 in R d whichare independent with mean µ and covariance matrix Σ . The N random vectors ( ˜ X i ) Ni =1 are first given to an ”ad-versary” who is allowed to modify up to |O| of these vectors. This modification does not have to follow any rule.Then, the ”adversary” gives back the modified dataset ( X i ) Ni =1 to the statistician. Hence, the statistician receivesan ”adversarially” contaminated dataset of N vectors in R d which can be partitioned into two groups: the modifieddata ( X i ) i ∈O , which can be seen as outliers and the ”good data” or inliers ( X i ) i ∈I such that ∀ i ∈ I , X i = ˜ X i . Ofcourse, the statistician does not know which data has been modified or not so that the partition O ∪ I = { , . . . , N } is unknown to the statistician. The contamination model defined in Assumption 1 covers the Huber ǫ -contamination model from [31] and also the O ∪ I contamination framework from [37]. It has been popularized by the Computer Science community in particularin [19, 20]. In the adversarial contamination model from Assumption 1, the set O can depend arbitrarily on the initialdata ( ˜ X i ) Ni =1 ; the corrupted data ( X i ) i ∈O can have any arbitrary dependence structure; and the informative data( X i ) i ∈I may also be correlated (for instance, it is the case, in general, when the |O| data ˜ X i with largest ℓ d -normare modified by the adversary).In the setup defined by Assumption 1, we will use the SDO as one of our building block to achieve our goal aswell as the Median-of-means principle [59, 1, 32]. This principle has been extensively used during the last decades inparticular for the problem of robust mean estimation [42, 18, 56, 50, 49, 52, 16, 30, 10]. The starting point of MOMestimator is to chose an integer K ∈ [ N ], split the dataset into K equal size blocks B ⊔ · · · ⊔ B K = [ N ] (w.l.o.g.we assume that N can be divided by K ) and construct K empirical means ¯ X k = | B k | − P i ∈ B k X i , one over eachblock. The Stahel-Donoho Median-of-Means that will be used to achieve the subgaussian rate (2) with respect to (cid:13)(cid:13) Σ − / · (cid:13)(cid:13) under in the adversarial and heavy-tailed setup from Assumption 1 isˆ µ SDOMOM,K ∈ argmin µ ∈ R d sup k v k =1 | (cid:10) µ, v (cid:11) − Med( (cid:10) ¯ X k , v (cid:11) ) | Med( | (cid:10) ¯ X k , v (cid:11) − Med( (cid:10) ¯ X k , v (cid:11) ) | ) . Unlike recently introduced robust mean estimators, ˆ µ SDOMOM,K is using a robust scatter estimator for normalization.Here its is a MOM version of MAD which is used to construct ˆ µ SDOMOM,K , i.e. v → Med( | (cid:10) ¯ X k , v (cid:11) − Med( (cid:10) ¯ X k , v (cid:11) ) | ). Wewill show that this normalization plays a central role in the analysis when one wants results w.r.t. the (cid:13)(cid:13) Σ − / · (cid:13)(cid:13) -norm.But beyond this observation, we will show that MAD and its MOM version satisfy isomorphic and almost-isometricproperties that can be used for other task such as to construct estimator of the covariance under only a L assumptionas in Section 4 below.The paper is organized as follows. In the next section, we consider the case where the good data have a Gaussiandistribution and the dataset has been adversarially corrupted. In that case, no need to construct bucketed means and4he original Stahel-Donoho median is proved to achieve the subgaussian rate (2). The Section 3 considers the generaladversarial corrupted and heavy-tailed framework from Assumption 1 where the MOM version of the SDO is provedto achieve the subgaussian rate. We also exhibit in this section a family of cdfs denoted here by ( H N,K,v : v ∈ S d − )which plays a key role in our analysis. In particular, when the behavior of these function around 0 is similar to theone described above in (7) then the same result as in the Gaussian case can be obtained and that may hold withoutanymore than 2 moments (see Section 3.3). In Section 4, we show how to use the MOM version of MAD to constructan estimator of the covariance matrix under L . In Section 5, we explore the properties of the family of functions( H N,K,v : v ∈ S d − ). A conclusion and open questions are provided in Section 6 that are followed by the proofs ofall the results in Section 7. Notations.
We denote by x ∈ R d → k x k = (cid:16)P j x j (cid:17) / the Euclidean norm with associated unit sphere S d − andball B d . We also denote by g ∼ N (0 ,
1) a standard one-dimensional Gaussian variable and its associated standardGaussian cdf by Φ : t ∈ R → P [ g ≤ t ] = R t −∞ φ ( u ) du where φ : u ∈ R → (2 π ) − / exp( − u /
2) is the one dimensionalGaussian density function. We also set H G : t → − Φ( t ) and W G : p ∈ (0 , → H ( − G ( p ) the inverse function of H G so that W ( p ) = Φ − (1 − p ). In this section, we prove that the original SDO median achieves the (non-asymptotic) subgaussian rate (2) when thedataset may have been corrupted by an adversary and when the good data have a Gaussian distribution; our mainmodel assumption is the following.
Assumption 2. [Adversarial contamination and Gaussian inliers] There exists N i.i.d. Gaussian vectors ( G i ) Ni =1 in R d with mean µ and (unknown) covariance matrix Σ . We assume that Σ is invertible. The N random vectors ( G i ) Ni =1 are first given to an ”adversary” who is allowed to modify up to |O| of these vectors. This modification doesnot have to follow any rule. Then, the ”adversary” gives the modified dataset ( X i ) Ni =1 to the statistician. We use the Gaussian case again as a benchmark case for the more involved heavy-tailed situation which requiresto bucket the data and some assumption on the distribution of the good data. When the ”good” data are Gaussianthere is no need to bucket the data and the elliptically symmetric property of the Gaussian variables is simplifyingthe analysis. The mean estimator we use in this section is therefore the median of the original Stahel-Donohooutlyingness function
SDO : µ ∈ R d → sup v ∈ R d | (cid:10) µ, v (cid:11) − Med( (cid:10) X i , v (cid:11) ) | Med( | (cid:10) X i , v (cid:11) − Med( (cid:10) X i , v (cid:11) ) | ) (8)and the associated median is a point minimizing this outlyingness function:ˆ µ SDO ∈ argmin µ ∈ R d SDO ( µ ) . Our main result in the adversarial corruption setup with Gaussian inliers is the following:
Theorem 1.
There are absolute constants c , c and c such that the following holds. We assume that the adversarialcontamination with Gaussian inliers model Assumption 2 holds with a number of adversarial outliers denoted by |O| .Let < ǫ < Φ − (3 / /c . We assume that |O| ≤ ǫN and N ≥ c ǫ − d . For all < u < c ǫ N , with probability atleast − − u ) , (cid:13)(cid:13)(cid:13) Σ − / (ˆ µ SDO − µ ) (cid:13)(cid:13)(cid:13) ≤ c ǫ ) C r d + 1 N + r uN + |O| N ! . Let us first remark that if
N < d then the N data X , . . . , X N cannot span the entire R d space and so thereexists a non zero vector v ∈ R d which is orthogonal to all the data points. Hence, M AD ( v ) := Med( | (cid:10) X i , v (cid:11) − Med( (cid:10) X i , v (cid:11) ) | ) = 0 a.s. and so SDO ( µ ) = + ∞ for all µ ∈ R d . Therefore, assuming that N ≥ d is a minimalassumption when we work with the SDO function. We also note that the factor Φ − (3 /
4) is sometimes used as arenormalization factor in the definition of the Stahel-Donoho outlyingness function [62].Theorem 1 shows that the SD median ˆ µ SDO is robust to adversarial contamination up to a proportion of N andthat the rate achieved remains the same as if there was no contamination when |O| . √ N min( √ u, √ d ). If we putthis result with regard to the finite-sample replacement breakdown point (RBP) achieved by the SDE with a slightmodification of MAD at the denominator as recalled in the Introduction, we see that the order of magnitude are5he same: SDE and ˆ µ SDO can handle both a proportion of N adversarial outliers. The constant in the RBP (whichis close to 1 / N >> d ) is certainly better than the one obtained in Theorem 1 but the result in the latertheorem shows that the estimator still achieve the deviation minimax rate (2) even up to ǫN outliers whereas RBPcan only insure that the estimator does not go to infinity; RBP does not ensure any statistical convergence rate afterdata corruption unlike Theorem 1 does.The rate of convergence obtained in Theorem 1 has been obtained by several other procedures. For instance, ithas been proved that the Tukey median achieves this rate in [8] when the covariance is proportional to the identityand for the Huber-contamination setup. The same bound was also obtained by a polynomial time algorithm in [11]when the covariance matrix Σ is known.The proof of Theorem 1 (which may be found in Section 7) is based on two isomorphic principles of the MADand SDO functions. We will extend these two properties to the MOM versions of MAD and SDO in the next section.For the moment, let us recall their definitions and write these two properties that are interesting beyond the proofof Theorem 1.The normalization factor in the SDO function (8) is called the MAD (median absolute deviation) [28] M AD : v ∈ R d → Med( | (cid:10) X i , v (cid:11) − Med( (cid:10) X i , v (cid:11) ) | ) . It plays a key role to get estimation result w.r.t. the (cid:13)(cid:13) Σ − / · (cid:13)(cid:13) norm whereas Σ is unknown. However, thisnormalization factor requires some more work than for the analysis of classical robust estimators that are onlyfocused on the estimation of the mean. Indeed, M AD ( v ) is actually a robust estimator of the scatter of (cid:10) G, v (cid:11) whichis Φ − (3 / (cid:13)(cid:13) Σ / v (cid:13)(cid:13) (note that if g ∼ N (0 ,
1) then
M AD ( g ) = Φ − (3 / µ is a location parameter) showing that these properties have actually more to do with elliptical symmetrythan they have to do with concentration. Proposition 1.
There are absolute constants c , c and c such that the following holds. Let < ǫ < Φ − (3 / /c .We assume that the adversarial model with Gaussian inliers Assumption 2 holds with a number of adversarial outliers |O| ≤ ǫN . We assume that N ≥ c ǫ − d . With probability at least − exp( − c ǫ N ) , for all v ∈ R d , (Φ − (3 / − c ǫ ) (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) ≤ M AD ( v ) ≤ (Φ − (3 /
4) + c ǫ ) (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) . Moreover, for all < u < c ǫ N , with probability at least − − u ) , for all v ∈ R d , if (cid:13)(cid:13) Σ − / ( v − µ ) (cid:13)(cid:13) ≥ r ∗ then (cid:13)(cid:13) Σ − / ( v − µ ) (cid:13)(cid:13) − (3 /
4) + c ǫ ) p K/N ≤ SDO K ( v ) ≤ (cid:13)(cid:13) Σ − / ( v − µ ) (cid:13)(cid:13) − (3 / − c ǫ ) p K/N and if (cid:13)(cid:13) Σ − / ( v − µ ) (cid:13)(cid:13) ≤ r ∗ then SDO K ( v ) ≤ r ∗ (Φ − (3 / − c ǫ ) − where Φ − is the quantile function ofa standard Gaussian cdf and r ∗ = (cid:16) C p ( d + 1) /N + p u/N + |O| /N (cid:17) is the subgaussian rate from (2) with theadditive adversarial contamination term |O| /N . The isomorphic properties of the MAD and SDO functions uniformly over R d imply the robustness and subgaus-sian properties of the SDO median in Theorem 1. Similar results for other depths may be found in the literature onrobust mean estimation such as the isomorphic property of the Tukey depth proved in [8]. L case In this section, we do not anymore assume that the good data follow a Gaussian distribution but we only assume thatthey have a second moment (and that the dataset may still be contaminated by an adversary following Assumption 1). Nevertheless, even though we are in the heavy tailed setup with adversarially corrupted data we still want toachieve the subgaussian rate for the (cid:13)(cid:13) Σ − / · (cid:13)(cid:13) -norm. To achieve such a result the median-of-means principle has beenproved to perform well. We will therefore use this principle together with the Stahel-Donoho concept of outlyingness.We introduce now an estimator constructed according to these two principles.6et K ∈ [ N ] be the number of blocks and let ¯ X k = (1 / | B k | ) P i ∈ B k X i , k ∈ [ K ] be the bucketed means. Outly-ingness / depth of a point µ ∈ R d is measured with respect to the bucketed means: SDO K ( µ ) = sup v ∈ R d | (cid:10) µ, v (cid:11) − Med( (cid:10) ¯ X k , v (cid:11) ) | Med( | (cid:10) ¯ X k , v (cid:11) − Med( (cid:10) ¯ X k , v (cid:11) ) | )and the Stahel-Donoho Median of means is defined asˆ µ SDOMOM,K ∈ argmin µ ∈ R d SDO K ( µ ) . As for the Gaussian case, the isomorphic and nearly-isometric properties of
SDO K and its denominator, called M OM AD K , play a key role in our analysis. The M OM AD K is a Median of means version of the Median AbsoluteDeviation function. We denote it as MOMAD for Median Of Means Absolute Deviation: M OM AD K : v ∈ R d → Med (cid:0) | (cid:10) ¯ X k , v (cid:11) − Med( (cid:10) ¯ X k , v (cid:11) ) | (cid:1) . (9)In the next section, we study metric properties of M OM AD K and for SDO K that will useful for our analysis ofˆ µ SDOMOM,K . Then, we will turn to the statistical bounds obtained for the median ˆ µ SDOMOM,K in the general heavy-tailed L setup in Section 3.2 and then we will study some extra regularity assumption of the cdfs ( H N,K,v : v ∈ S d − ) at0 that allows to get better rates in Section 3.3. M OM AD K and SDO K In this section, we show that the MOM versions of the SDO and MAD operators (called
SDO K and M OM AD K )satisfy an isomorphic and almost-isometry properties as the M AD and
SDO do in Proposition 1 that holds onlyunder a L moment assumption.We introduce two families of functions which play a central role in our analysis. They involve the non-corruptedrandom variables ˜ X i , i ∈ [ N ] (and not the corrupted data X i , i ∈ [ N ]). Definition 1.
For all v ∈ S d − , H v := H N,K,v : r ∈ R → P p N/K
N/K X i =1 (cid:10) Σ − / ( ˜ X i − µ ) , v (cid:11) ≥ r and W v := W N,K,v : p ∈ (0 , → H ( − v ( p ) , (10) where H ( − v ( p ) = max( r ∈ R : H v ( r ) ≥ p ) is the generalized inverse of H v . As already observed in the proof of the √ n -consistency of SDE from [54] as well as its asymptotic normality in[71], the behavior of the one-dimensional projection cdfs at the median and the two 1 / / W v : v ∈ S d − ). Assumption 3.
There exists some < ǫ < / and some absolute constants < ϕ l ( ǫ ) < ϕ u ( ǫ ) such that for all v ∈ S d − , max (cid:18) W v (cid:18) − ǫ (cid:19) − W v (cid:18)
12 + 2 ǫ (cid:19) , W v (cid:18) − ǫ (cid:19) − W v (cid:18)
34 + 2 ǫ (cid:19)(cid:19) ≤ ϕ u ( ǫ ) and min (cid:18) W v (cid:18)
14 + 2 ǫ (cid:19) − W v (cid:18) − ǫ (cid:19) , W v (cid:18)
12 + 2 ǫ (cid:19) − W v (cid:18) − ǫ (cid:19)(cid:19) ≥ ϕ l ( ǫ ) . Assumption 3 is a pretty weak assumption since, intuitively, it requires that the distribution of the centered andvariance one real-valued random variables (cid:10) Σ − / ( ˜ X i − µ ) , v (cid:11) have their 1 / / v ∈ S d − . For instance,in the Gaussian case, Assumption 3 holds for ϕ u ( ǫ ) = Φ − (3 /
4) + c ǫ and ϕ l ( ǫ ) = Φ − (3 / − c ǫ for some absoluteconstant c and for all 0 < ǫ < /
10 (where we recall that Φ : t → P [ g ≤ t ] where g ∼ N (0 , roposition 2. There are absolute constants c , c and c such that the following holds. We assume that Assump-tion 3 holds for some < ǫ < / and constants ϕ l ( ǫ ) and ϕ u ( ǫ ) . We assume that the adversarial contaminationwith L inliers model from Assumption 1 holds with a number of adversarial outliers |O| ≤ ǫK . We assume that K ≥ c ǫ − d . With probability at least − exp( − c ǫ K ) , for all v ∈ R d , ϕ l ( ǫ ) r KN (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) ≤ M OM AD K ( v ) ≤ ϕ u ( ǫ ) r KN (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) . Proposition 2 shows that
M OM AD K is equivalent to v → p K/N (cid:13)(cid:13) Σ / v (cid:13)(cid:13) up to the two constants ϕ u ( ǫ ) and ϕ l ( ǫ ). We will be interested in two situations regarding these constants. The first one is when their ratio is upperbounded by some absolute constant: there exists some absolute constant c such that ϕ u ( ǫ ) ϕ l ( ǫ ) ≤ c . (11)This condition will be enough to obtain robust optimal subgaussian bounds for ˆ µ SDOMOM,K in the two following theorems.If condition (11) holds we say that
M OM AD K is isomorphic to v → p K/N (cid:13)(cid:13) Σ / v (cid:13)(cid:13) . The second condition, thatwill be of interest to us is when we will estimate Σ using M OM AD K in Section 4, is when the two constants ϕ u ( ǫ )and ϕ l ( ǫ ) can be made arbitrarily close to the same constant by taking ǫ small enough, that is when there existssome absolute constants φ and c > < ǫ < φ /c , ϕ l ( ǫ ) = φ − c ǫ and ϕ u ( ǫ ) = φ + c ǫ. (12)In that case, we speak about an almost-isometric property of M OM AD K . The later condition is stronger than anisomorphic property but it allows to solve a higher order moment problem. In Section 5, we provide several exampleswhere these conditions hold as well as other properties of the family of cdfs ( H v : v ∈ S d − ) even when there is noteven a first moment.We finish this section with an isomorphic result for SDO K . The rate of convergence appears in this result: itis the level r ∗ above which SDO K is isomorphic to ν ∈ R d → (cid:13)(cid:13) Σ − / ( ν − µ ) (cid:13)(cid:13) / p K/N . One can define it as asolution to C r d + 1 K + r uK ! + sup k v k =1 H N,K,v ( r ∗ ) + |O| K <
12 (13)where u is a confidence parameter and C a constant appearing in (28). Proposition 3.
There are absolute constants c , c and c such that the following holds. We assume that Assump-tion 3 holds for some < ǫ < / and constants ϕ l ( ǫ ) and ϕ u ( ǫ ) . We assume that the adversarial contaminationwith L inliers model from Assumption 1 holds with a number of adversarial outliers denoted by |O| . We assumethat |O| ≤ ǫK and K ≥ c ǫ − d . Let u > and r ∗ be such that (13) holds. Then, with probability at least − exp( − u ) − exp( − c ǫ K ) , for all ν ∈ R d , if (cid:13)(cid:13) Σ − / ( ν − µ ) (cid:13)(cid:13) ≥ p K/Nr ∗ then (cid:13)(cid:13) Σ − / ( ν − µ ) (cid:13)(cid:13) ϕ u ( ǫ ) p K/N ≤ SDO K ( ν ) ≤ (cid:13)(cid:13) Σ − / ( ν − µ ) (cid:13)(cid:13) ϕ l ( ǫ ) p K/N and if (cid:13)(cid:13) Σ − / ( ν − µ ) (cid:13)(cid:13) ≤ p K/N r ∗ then SDO K ( ν ) ≤ (3 /ϕ l ( ǫ )) r ∗ . Proposition 3 may be seen as a MOM version holding in the heavy-tailed case of the Proposition 1 obtained inthe Gaussian case. Such an extension from the Gaussian case to the L heavy-tail case is made possible only thanksto the median-of-means principle and the use of the bucketed means instead of the data themselves. However, we willidentify situations where condition (11) and (13) with an optimal choice of rate r ∗ (that is for the subgaussian rate(2)) hold for K = N even when a first moment do not exist. In that case, one can get a contamination price down to |O| /N instead of the information theoretic lower bound in the general L case given by p |O| /N (see Section 3.3).We start with the general L case and then we will consider an extra assumption that allows for such better bounds. L case Unlike in Section 2 or Section 3.3 below where, we demand that for all v ∈ S d − and for values of 0 < r < c thedeviation function H N,K,v ( r ) is less than 1 / − c r here we simply use Markov inequality to control the function H N,K,v around 0. The price we pay by using this approach is that we will not have anymore estimation results for8he SDO MOM over K blocks which hold for all deviation parameter u up to K but only for u ∼ K . The otherprice we pay here is for the adversarial contamination cost that will be of the order of p |O| /N whereas as provedby Theorem 4 below it can be better up to |O| /N (as in the Gaussian case from Theorem 1). We will be able toachieve this result thanks to an extra regularity assumption of the cdfs H v of all one-dimensional projections around0 (see Assumption 4 below). But, for the moment, we do not grant this type of assumption in this section and obtaina general result under only the existence of a second moment as well as Assumption 3. Subgaussian rates can bederived out of this result when condition (11) holds (we refer to Section 5 where this condition is studied).In this section, the bound we use is simply the one deduced from Markov’s inequality that is for all r > K ∈ [ N ]: H N,K,v ( r ) = P p N/K
N/K X i =1 (cid:10) Σ − / ( ˜ X i − µ ) , v (cid:11) ≥ r ≤
11 + r . (14)(Note that we used a slightly modification of Markov’s inequality: if Z is a centered variance one real-valued randomvariable then P [ Z ≥ r ] = min a ∈ R P [ Z + a ≥ r + a ] ≤ (1 + r ) − ). Our main result in the general L setup will followfrom this bound and a general result stated in Section 7. It is now stated in the following theorem. Theorem 2.
There are absolute constants c , c and c such that the following holds. We assume that Assumption 3holds for some < ǫ < / . We assume that the adversarial contamination with L inliers model from Assumption 1holds with a number of adversarial outliers |O| ≤ c ǫK . We assume that c K ≥ ǫ − d . With probability at least − − c ǫ K ) , (cid:13)(cid:13)(cid:13) Σ − / (ˆ µ SDOMOM,K − µ ) (cid:13)(cid:13)(cid:13) ≤ ϕ u ( ǫ ) ϕ l ( ǫ ) r KN .
The rate of convergence in Theorem 2 can be written like the one in Theorem 1 and Theorem 4 below where thethree terms: complexity, deviation and price for adversarial corruption appear. Indeed, one should notice here thatthe deviation probability in Theorem 2 is fixed equal to 1 − − c ǫ K ) because we had to take the deviationparameter u equal to K because of the approach based on Markov’s inequality (14). It is however, equivalent toreplace p K/N by p d/ ( ǫ N ) + p u/N + p |O| / ( ǫN ) for u = K since the two quantities are equivalent under theassumptions of Theorem 2. In that case, one may recognize the complexity term p d/N , the deviation term p u/N as well as the price for adversarial corruption p |O| /N . In particular, we see that the price we pay for the corruptionis of the order of p |O| /N which is larger than the |O| /N term in the Gaussian case from Theorem 1 and it is theworst case of Theorem 4 below. Indeed, in Theorem 2 we did not exploit any other property than the existence of asecond moment whereas the two other two Theorems 1 and Theorem 4 exploit some regularity assumption around0 of the family of functions H N,K,v , v ∈ S d − . Adaptation to K via Lepski’s method. It follows from Theorem 2 that ˆ µ SDOMOM,K is an estimator which dependson the deviation parameter. Therefore, we need to construct an adaptive to K version of this estimator to disentanglethe estimator from the deviation parameter. The classical way to do it is via Lepski’s method [40, 41]. Usually, theprice we pay to make this approach work is some extra knowledge on Σ such as its trace and operator norm. Buthere for the SDO type estimator we are using together with the isomorphic property of the SDO K we only needknowledge on ϕ u ( ǫ ) and ϕ l ( ǫ ). Let us now construct this adaptive scheme: the number of blocks is chosen viaˆ K = min K ∈ [ N ] : SDO k (ˆ µ SDOMOM,K − ˆ µ SDOMOM,k ) ≤ max ϕ l ( ǫ ) , ϕ u ( ǫ ) ϕ l ( ǫ ) r Kk !! , ∀ k = N, · · · , K ! . (15) Theorem 3.
There are absolute constants c , c and c such that the following holds. We assume that Assumption 3holds for some < ǫ < / and all K ∈ [ N ] . We assume that the adversarial contamination with L inliers modelfrom Assumption 1 holds with a number of adversarial outliers denoted by |O| . Then, for all K ≥ max( c ǫ − d, c |O| ) with probability at least − − c ǫ K ) , (cid:13)(cid:13)(cid:13) Σ − / (ˆ µ SDOMOM, ˆ K − µ ) (cid:13)(cid:13)(cid:13) ≤ ϕ u ( ǫ ) ϕ l ( ǫ ) r KN . where ˆ K is the adaptive choice of number of blocks from (15) . .3 The L case under an extra regularity condition around of the H v ’s In this section, we obtain estimation bound for the MOM version of the SDO median in the adversarial corruptionwith heavy-tailed L inliers model under an extra assumption on the regularity at 0 of the family of functions H v , v ∈ S d − that is stated now. Assumption 4.
There exists some absolute constants c , c > and c > such that for all v ∈ S d − and all (2 C /c ) p ( d + 1) /K ≤ r ≤ c H N,K,v ( r ) = H v ( r ) := P p N/K
N/K X i =1 (cid:10) Σ − / ( ˜ X i − µ ) , v (cid:11) ≥ r ≤ − c r. This assumption is about the behavior around the origin of the cdf of all one-dimensional projections of therandom vectors (
N/K ) − / P N/Ki =1 Σ − / ( ˜ X i − µ ) where the ˜ X i are the non-corrupted data. The term − c r in thebound above is the behavior of regular in 0 cdfs such as in the Gaussian case (see Section 5 for more details andmore examples).Our main result in the adversarial corruption and heavy-tailed L model under Assumption 4 is the followingtheorem. The proof may be found in Section 7. Theorem 4.
There are absolute constants c , c and c such that the following holds. We assume that Assumption 3holds for some < ǫ < / and that Assumption 4 holds as well. We assume that the adversarial contaminationwith L inliers model from Assumption 1 holds with a number of adversarial outliers |O| ≤ c ǫK . We assume that c K ≥ ǫ − d . For all < u ≤ c ǫ K , with probability at least − − u ) , (cid:13)(cid:13)(cid:13) Σ − / (ˆ µ SDOMOM,K − µ ) (cid:13)(cid:13)(cid:13) ≤ c ϕ u ( ǫ ) ϕ l ( ǫ ) r dN + r uN + |O|√ N K ! . (16)We recover the optimal subgaussian rate (2) in Theorem 4 when for some 0 < ǫ < /
8, condition (11) holdsand |O| . √ Kd . The term |O| / √ KN appearing in the convergence rate of Theorem 4 is the price we pay for theadversarial contamination. It is between p |O| /N when K ∼ |O| and |O| /N when K ∼ N . Usually when the inliersare only in L , the information theoretic lower bound is known to be of the order of p |O| /N and not like |O| /N .We get a better rate in Theorem 4 thanks to Assumption 4 which is using in some more efficient way the regularityof the H v functions at 0.Unlike typical results in the MOM literature except for the one obtained in [55], the deviation rate in Theorem 4is 1 − − u ) for all u . K , in particular it does not have to depend on parameter K . As a consequence, theestimator ˆ µ SDOMOM,K does not depend on the deviation parameter. Usually, results for MOM estimators constructedon K blocks are given with probability at least 1 − exp( − c K ) and then a Lepski’s method is used to construct anadaptive to K procedure. This is not the case here nor it is for the Gaussian case in Section 2 (as we did in theprevious section). This is again because Assumption 4 is using more efficiently the behavior of H N,K,v around 0. Σ under a L -moment assumption with MOMAD In this section, we show that it is possible to estimate the covariance matrix Σ using the MOMAD estimator. Inparticular, given that the isomorphic property of MOMAD hold under Assumption 3 which does not require moremoment than L moment, we show that it is possible to estimate Σ without requiring more moment than 2 thatis just under the assumption that Σ exists. This differs from approaches based on the empirical covariance matrixwhere at best a L δ -moment assumption for some positive δ is granted for the estimation of the covariance matrix[47, 5, 48].We show that for the estimation of Σ via the MOMAD, the properties of ϕ l ( ǫ ) and ϕ u ( ǫ ) introduced in Assump-tion 3 play a key role. Let us first have a look at these quantities in the Gaussian case. In that case, there are someabsolute constants φ and c > < ǫ < φ /c , ϕ l ( ǫ ) = φ − c ǫ and ϕ u ( ǫ ) = φ + c ǫ (17)where φ = Φ − (3 /
4) (see Section 5 or the proof of Proposition 1 for more details). This later result holds in the Gaus-sian case first because the two interquartile intervals have the same length: Φ − (0) − Φ − (1 /
4) = Φ − (3 / − Φ(0) = φ and, second, because the Gaussian density function is uniformly lower bounded by an absolute positive constant lo-cally around the two 1 / / − (1 /
4) and Φ − (3 /
4) as well as around the median Φ − (1 /
2) = 0.10f this last condition were not true at some q ∈ { W (1 / , W (1 / , W (3 / } where W = W N,K,v for some direction v ∈ S d − then there will be some plateau of the cdf r ∈ R → − H N,K,v ( r ) starting at q and thus there would bea constant factor gap between W ( ℓ/ − ǫ ) and W ( ℓ/ ǫ ) for some ℓ ∈ { , , } . In that case, there would besome absolute constants c > c > | ϕ l ( ǫ ) − ϕ l ( ǫ ) | ≥ c for all 0 < ǫ ≤ c . In particular, we wouldonly have an isomorphic property for the MOMAD and thus its is not clear how to estimate Σ using MOMAD at abetter rate than a constant rate. Typical values of φ in (17) will be φ = W (1 / − W (1 /
2) = W (1 / − W (3 / v ∈ S d − ; this will hold, in particular,under a spherical symmetry assumption of the Σ − / ( ˜ X i − µ ) (see Section 5 for more formal statement). Assumption 5.
For the same choice of K as in Assumption 3 where ǫ > → ϕ l ( ǫ ) , ϕ u ( ǫ ) are defined, there areabsolute constants φ , c > and c > such that for all v ∈ S d − and all < ǫ < c , ϕ l ( ǫ ) = φ − c ǫ and ϕ u ( ǫ ) = φ + c ǫ . Let us now turn to the construction of an estimator of the covariance matrix Σ using MOMAD under Assumption 5(as well as Assumption 3). Because of the constant factor φ in Assumption 5 we will provide an estimator of the scatter matrix φ Σ (according to [53], a scatter matrix is any matrix proportional to the covariance matrix).It follows from Proposition 2 that
M OM AD K is isomorphic to v ∈ R d → φ p K/N (cid:13)(cid:13) Σ / v (cid:13)(cid:13) and that underAssumption 5 it becomes an almost isometry, that is, with probability at least 1 − exp( − c ǫ K ), for all v ∈ R d , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M OM AD K ( v ) − φ r KN (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c ǫ r KN (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) (18)as long as |O| ≤ ǫK and K ≥ c ǫ − d . In the Gaussian case and other spherical cases as in Section 5, this almostisometric property holds for K = N (and M OM AD N = M AD ) and any 0 < ǫ < /
4: it follows from Proposition 1that with probability at least 1 − exp( − c ǫ N ), for all v ∈ R d , (cid:12)(cid:12)(cid:12) M AD ( v ) − Φ − (3 / (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) ≤ c ǫ (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) . (19)We may use (18) to estimate directly the entries of Σ following an idea from [26]. Let ( e j ) dj =1 denote the canonicalbasis of R d . We have, for all i, j ∈ [ d ],4Σ ij = 4 (cid:10) e i , Σ e j (cid:11) = (cid:13)(cid:13)(cid:13) Σ / ( e i + e j ) (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) Σ / ( e i − e j ) (cid:13)(cid:13)(cid:13) . As a consequence, a natural estimator of φ Σ based on
M OM AD K is the matrix ˆΣ whose entries are defined for all i, j ∈ [ d ] by ˆΣ ij = N K (cid:0) M OM AD K ( e i + e j ) − M OM AD K ( e i − e j ) (cid:1) . Note that ˆΣ is symmetric but it may not be a SDP. To overcome this issue, a projection method has been introducedin [48] which may also be used as well for ˆΣ. Our main statistical bound for ˆΣ is the following.
Proposition 4.
Assume that Assumption 1 holds. Let K ∈ [ N ] , ϕ l and ϕ u be such that Assumption 3 and Assump-tion 5 hold. Then, for all < ǫ < c such that |O| ≤ ǫK and K ≥ c ǫ − d , with probability at least − exp( − c ǫ K ) , max i,j ∈ [ d ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) φ Σ ij − ˆΣ ij Σ ii + Σ jj (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup k u k = k v k =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:10) u, ( φ Σ − ˆΣ) v (cid:11)P i ( | u i | + | v i | )Σ ii (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c ǫ ( c ǫ + φ ) / . In particular, if one can choose K = N so that Assumption 3 and Assumption 5 hold – for instance, in theGaussian case or for other spherical variables as in Section 5 – then the M OM AD N estimator becomes the classicalMAD one and for ǫ = c d/N we have that with probability at least 1 − exp( − c d ),max i,j ∈ [ d ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) φ Σ ij − ˆΣ ij Σ ii + Σ jj (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup k u k = k v k =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:10) u, ( φ Σ − ˆΣ) v (cid:11)P i ( | u i | + | v i | )Σ ii (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c r dN . as long as |O| ≤ c d . 11 Study of the H N,K,v , v ∈ S d − functions The functions H N,K,v , v ∈ S d − play a key role in our analysis. Their behavior in a neighborhood of their 1 / / M OM AD K and SDO K functions and so of the statistical performance of theStahel Donoho Median and Median of Means. Their behavior around 0 also drives the improved rates obtained underAssumption 4. From our perspective, it is of the utmost importance to understand the behavior of these functionsat these particular points.Let us first settle down the properties of the H N,K,v functions desirable for our analysis. We set Z i = Σ − / ( ˜ X i − µ )for all i ∈ [ N ] so that the Z i ’s are independent centered isotropic vectors in R d and n = N/K . We want to identifyconditions on the distributions of the Z i ’s such that • for Assumption 4: there exists some absolute constants c , c > v ∈ S d − and all 0 < r < c , H n,v ( r ) := P " √ n n X i =1 (cid:10) Z i , v (cid:11) ≥ r ≤ − c r. (20) • for Assumption 3: there exists some absolute constants c > < ǫ < / ϕ l ( ǫ ) and ϕ u ( ǫ ) existand are such that ϕ u ( ǫ ) ϕ l ( ǫ ) ≤ c (21)or there are absolute constants φ and c > < ǫ < φ /c , ϕ l ( ǫ ) = φ − c ǫ and ϕ u ( ǫ ) = φ + c ǫ (22)which are respectively Condition 11 (insuring an isomorphic property of M OM AD K and SDO K as well asoptimal subgaussian rates for SD median and median of means) and Condition 17 (insuring almost isometricproperty of M OM AD K as well as estimation properties for ˆΣ in Section 4).Let us first study the Gaussian case which is our benchmark situation. We will then study other cases where thefamily of functions H N,K,v , v ∈ S d − satisfies these conditions. The Gaussian case.
We recall that Φ : t ∈ R → P [ g ≤ t ] = R t −∞ φ ( u ) du where φ : u ∈ R → (2 π ) − / exp( − u /
2) isthe Gaussian density function. We also denote H G : t → − Φ( t ) and W G : p ∈ (0 , → H ( − G ( p ) the inverse functionof H G so that W G ( p ) = Φ − (1 − p ). It follows from the mean value theorem that for all t, ǫ ∈ R , | H G ( t + ǫ ) − H G ( t ) | ≤ max( − φ ( t ) , − φ ( t + ǫ )) ǫ so that around 0 we have for all c > < r < c , H G ( r ) ≤ / − φ ( c ) r . As aconsequence, (20) holds in the Gaussian case for instance with c = 1 and c = φ (1). Let us now look at the twoother conditions in the Gaussian case. From the mean value theorem, we have for all p ∈ [1 / ,
1) and ǫ ≥ p + ǫ ∈ [1 / , ǫ/φ ( W G ( p )) ≤ W G ( p ) − W G ( p + ǫ ) ≤ ǫ/φ ( W G ( p + ǫ )) and for all p ∈ (0 , /
2] and ǫ ≥ p − ǫ ∈ (0 , / ǫ/φ ( W G ( p )) ≤ W G ( p − ǫ ) − W G ( p ) ≤ ǫ/φ ( W G ( p + ǫ )). We conclude that there are absoluteconstants c , c > ≤ ǫ ≤ c , ϕ u ( ǫ ) = Φ − (3 /
4) + 2 ǫ (cid:18) φ (Φ − (3 / φ (0) (cid:19) + c ǫ and ϕ l ( ǫ ) = Φ − (3 / − ǫ (cid:18) φ (Φ − (3 / φ (0) (cid:19) − c ǫ . So that both conditions (20) and (22) hold with φ = Φ − (3 / φ at the 1 / / φ (Φ − (3 / φ (Φ − (1 / φ (0)(here we used that Φ − (1 /
2) = 0) play a key role.In the following, we identify situations where the H N,K,v , v ∈ S d − functions and their pseudo inverses mimic the H G and W G functions from the Gaussian case. There are at least two reasons: the first one is that we are projectingrandom vectors leaving in R d onto one dimensional subspaces; the second reason is that we are averaging randomvariables having a second moment. We will explore these two observations in the two following paragraphs.12 ne dimensional projections and elliptically contoured distributions. The fact that the H v functionsdeal only with one-dimensional marginals is making these functions likely to behave as in the Gaussian case sinceone-dimensional projections of sufficiently spherically symmetric random vectors in R d are expected to behave likeone-dimensional Gaussian variables and this phenomenon is even more accentuated when d is large (this is oneparticular situation where large dimension d may help in Statistics). Indeed, one may have in mind an observation– sometimes attributed to H. Poincar´e – that the density function of the one-dimensional projection (cid:10) √ dU, e (cid:11) –where √ dU is uniformly distributed over √ d S d − and ( e j ) dj =1 is the canonical basis of R d – converges to the densityof a N (0 ,
1) when d → ∞ (see page 16 in [38] or Chapter 4 in [3]). One may also have in mind that there aredirections such as v = (1 / √ d, . . . , / √ d ) which are mixing the coordinates of Σ − / ( ˜ X − µ ) when projected onto v and therefore may have the tendency to mimic a standard Gaussian variable because of the CLT. Note that all theseobservations hold for N = K that is even for n = 1: because of the one-dimensional projections we may not evenhave to average the Z i ’s to mimic the Gaussian case. Therefore, Theorem 5 can be extended beyond the Gaussiancase when this phenomenon occurs.Let us now consider an example of elliptically contoured distributions where this happens to be true. Our aim isto show that Condition (11) and Assumption 4 (and so Theorem 4) may hold for K = N (i.e. n = 1) even when the˜ X i ’s do not have a first moment.We assume that the ˜ X i ’s are i.i.d. and that Σ − / ( ˜ X − µ ) has a spherically symmetric distribution; in that case,˜ X − µ is sometimes said to have an elliptically contoured distribution . Then, there exists a non-negative randomvariable R such that Σ − / ( ˜ X − µ ) is distributed according to RU where U is uniformly distributed on S d − andis independent of R (see Chapter 4 in [3]). In that case, all the (cid:10) Σ − / ( ˜ X − µ ) , v (cid:11) for v ∈ S d − have the samedistribution as (cid:10) Σ − / ( ˜ X − µ ) , e (cid:11) (where ( e j ) dj =1 is the canonical basis of R d ) which is distributed according to R (cid:10) U, e (cid:11) . Now, using that (cid:10) U, e (cid:11) is absolutely continuous w.r.t. the Lebesgue measure with density function givenby, when d ≥ t ∈ R → C d (1 − t ) d − I ( | t | ≤
1) where C d = (cid:18)Z − (1 − t ) d − dt (cid:19) − = 2Γ( d/ d − / √ π and Γ is the Gamma function, we can deduce that (even for K = N ), H v is independent of v ∈ S d − and is suchthat for all r ≥ H v ( − r ) = 1 − H v ( r ) and H v ( r ) = H ( r ) := C d Z P [ R ≥ r/x ] (cid:0) − x (cid:1) d − dx. In particular, we recover that H (0) = 1 / R ≥ R . In that example, R takes values r < r < · · · such that α j = P [ R = r j ] for all j ∈ N ∗ so that for all q > E R q = P j r qj α j which may be infinite even for q = 1 (that is when there is not even a first moment). For thisexample, we have for all r ≥ H ( r ) = C d ∞ X j =1 α j Z r/r j (1 − x ) d − dxI ( r ≤ r j ) . In particular, H is differentiable and R (cid:10) U, e (cid:11) is absolutely continuous w.r.t. the Lebesgue measure with a densityfunction given by f : r ∈ R → − H ′ ( r ) = C d ∞ X j =1 α j r j " − (cid:18) rr j (cid:19) d − I ( r ≤ r j ) . In particular, for r ∞ = lim j →∞ r j , H is strictly decreasing on [0 , r ∞ ) from H (0) = 1 / H ( r ∞ ) = 0 and beyond r ∞ it is constant equal to 0. Therefore, for all v ∈ S d − , the generalized inverse W v of H v is independent of v and itis the inverse of H : for all p ∈ (0 , /
2] there is a unique element W ( p )(= W v ( p )) in [0 , r ∞ ) such that H ( W ( p )) = p and W (1 − p ) = − W ( p ).Now, let us choose r j = 2 j C d and α j = 2 − j for all j ∈ N . We also assume d ≥ d = 1 , , E R = + ∞ and so the mean and covariance matrixdo not exist. Nevertheless, one may still assume that there exists µ ∈ R d and Σ ∈ R d × d definite positive such thatΣ − / ( ˜ X − µ ) is spherically symmetric (without having µ to be a mean vector and Σ to be a covariance matrix).Then, Theorem 4 still applies.Let us first check Condition (20). We have H (0) = 1 / ≤ r ≤ C d / √ d − f ( r ) ≥ ∞ X j =1 j (cid:20) − d − C d (cid:16) r j (cid:17) (cid:21) ≥ . (23)13oreover, we see that √ d ≤ C d ≤ √ d , hence, (23) holds for all 0 ≤ r ≤
1. Which is according to the means valuetheorem enough to show that Condition (20) holds (see (25) below for more details).Let us now check conditions (21) and (22). It follows from Proposition 5 below that, it is enough to lower bound thedensity function f in a neighborhood of p for p ∈ { W (1 / , W (1 / , W (3 / } and that W (1 / − W (3 /
4) is an absoluteconstant. But, given that W (1 /
2) = 0 and (23) holds, that f is symmetric about 0 and that W (1 /
4) = − W (3 / f ( q ) ≥ c for all q ∈ [ W (1 / − ǫ, W (1 /
4) + 2 ǫ ] for some 0 < ǫ < / c and that W (1 /
4) is an absolute constant. We first have to find W (1 /
4) which is the unique solution r such that H ( r ) = 1 /
4. We see that f is symmetric unimodal with maximal value at 0 given by f (0) = 4 / f ( r ) ≥ / ≤ r ≤ H (1 / ≥ / H (1 − / ≤ / < /
4, hence, W (1 / ∈ [1 / , − / f ( q ) ≥ / q ∈ [ W (1 / − / , W (1 /
4) + 1 / ϕ u ( ǫ ) = W (1 / − (4 / ǫ and ϕ l ( ǫ ) = W (1 /
4) + (4 / ǫ for all 0 < ǫ < /
16. Inthat case,
M OM AD K is an almost isometry and we can state a result like Theorem 1 where Φ − (3 /
4) is replacedby W (1 /
4) and µ and Σ are not anymore the mean and covariance matrix since they do not exist but ’location’ and’scale’ parameters defined such that Σ − / ( ˜ X i − µ ) are spherically symmetric. As a consequence, the phenomenonunderlying the Gaussian case from Section 2 has nothing to do with concentration but it is more about ellipticalsymmetry. Gaussian approximation.
In cases where there is some lack of spherical symmetry of Σ − / ( ˜ X − µ ) one maystudy the H v functions for a smaller number K of blocks so that n = N/K may be large enough to see someaveraging effect. In that case and because Gaussian variables satisfy all the properties we need, it is tempting to usea Gaussian approximation result such as a Berry-Esseen bound (see [61, 9, 7]) to approximate the H v functions by1 − Φ for n = N/K large enough. This strategy has been used several times in Minsker and co-authors works onMedian-of-means and Catoni’s type of estimators (see for instance [55, 57]).For instance, when for all v ∈ S d − , (cid:10) Σ − / ( ˜ X i − µ ) , v (cid:11) , i ∈ [ n ] are (independent, centered and variance one)real-valued random variables in L δ such that (cid:13)(cid:13)(cid:13)(cid:10) Σ − / ( ˜ X i − µ ) , v (cid:11)(cid:13)(cid:13)(cid:13) δ ≤ κ (uniformly in v ∈ S d − ) for some δ > c > v ∈ S d − and all r ∈ R , | H N,K,v ( r ) − P [ g ≥ r ] | ≤ c κ δ n δ/ := c n (24)It follows that for all p ∈ (0 ,
1) and ǫ ∈ R satisfying p + ǫ ∈ (0 ,
1) thatΦ − (1 − p − ǫ − c n ) ≤ W ( p + ǫ ) ≤ Φ − (1 − p − ǫ + c n ) . In particular, for all 0 < ǫ < /
16, if n is large enough so that c n ≤ ǫ then one can take ϕ u ( ǫ ) = Φ − (3 / − c ǫ and ϕ l ( ǫ ) = Φ − (3 / − c ǫ . So that the ratio ϕ u ( ǫ ) /ϕ l ( ǫ ) is constant; in that case, the M OM AD K and SDO k areisomorphism (see Proposition 2) and we recover a subgaussian rate in Theorem 4.However, a Gaussian approximation result such as the one in (24) is not enough for Assumption 4. Indeed, itfollows from (24) that for all 0 ≤ r ≤ c , H v ( r ) ≤ H G ( r ) + c n ≤ / − c r + c n for some absolute constants c > c >
0. It appears that our analysis used to prove Theorem 4 does not stand this extra error term c n compare withAssumption 4. Gaussian approximation does not help in this case: indeed Assumption 4 is more about the existenceof a uniform lower bound around 0 of the density functions of the one-dimensional projections (cid:10) n − / P i Z i , v (cid:11) aswe are considering now. Beyond the Gaussian behavior.
In the later two paragraphs, we identified situations where the n − / P ni =1 (cid:10) Z i , v (cid:11) for v ∈ S d − behave like Gaussian variables. We saw that this may be the case because we are considering one-dimensional projections of d -dimensional vectors and/or we are taking empirical means over n variables. But proper-ties we are looking for the H n,v , v ∈ S d − functions (see (20), (21) and (22)) are all dealing only with their behavioraround 3 (or 4 when the median is not 0) points. So that only the behavior of these functions at these points play arole and there is no need to mimic the Gaussian case for all values of r in R . We now state a general result going inthis direction. In particular, we recover the conditions from [54] and [71] recalled in the Introduction section.Let us assume that the n − / P ni =1 (cid:10) Z i , v (cid:11) for v ∈ S d − are absolutely continuous w.r.t. the Lebesgue measurewith a density function denoted by f v . By the mean value theorem, we have for all r ≥
0, all p ∈ (0 ,
1) and ǫ ≥ p + ǫ ∈ (0 , H v ( r ) ≤ H v (0) − min ≤ t ≤ r f v ( t ) r and ǫ max q ∈ [ p,p + ǫ ] f v ( W v ( q )) ≤ W v ( p ) − W v ( p + ǫ ) ≤ ǫ min q ∈ [ p,p + ǫ ] f v ( W v ( q )) . (25)14n particular, the values of the density functions f v , v ∈ S d − at 0 , W (1 / , W (1 /
2) and W (3 /
4) drives the qualityof inequalities from (25) and, as noted in previous works on the Stahel-Donoho outlyingness function, are enough toinsure all the conditions we need on H v and W v recalled in (20), (21) and (22). Proposition 5.
Let K ∈ [ N ] be such that N/K ∈ N . We assume that the original non-corrupted data ˜ X i , i ∈ [ N ] are independent and that there exists µ ∈ R d and Σ ∈ R d × d definite positive so that for all v ∈ S d − , p K/N P N/Ki =1 (cid:10) Σ − / ( ˜ X i − µ ) , v (cid:11) are absolutely continuous real valued random variables with a density denoted by f v . If there exists < ǫ < / and c > such that for all v ∈ S d − , all p ∈ { W v (1 / , W v (1 / , W v (3 / } and all q ∈ [ p − ǫ, p + 2 ǫ ] , f v ( q ) ≥ c then for I v = max ( W v (1 / − W v (1 / , W v (1 / − W v (3 / ϕ u ( ǫ ) = max v ∈S d − ( I v + 4 ǫ/c ) and ϕ l ( ǫ ) = min v ∈S d − ( I v − ǫ/c ) . We also have ϕ u ( ǫ ) ϕ l ( ǫ ) ≤ max v ∈S d − I v min v ∈S d − I v ǫc min v ∈S d − I v ! when ǫ ≤ c min v I v . Moreover, if ( c /
4) max v I v < / and min v I v ≥ c , for some absolute constant c > , thencondition (21) holds (and so we recover the optimal subgaussian rates in Theorem 2 and Theorem 3) and if for all v ∈ S d − , I v := φ then condition (22) holds and so does Proposition 4.If for all v ∈ S d − , H v (0) ≤ / and there are absolute constants c > and c > so that for all < r < c , f v ( v ) ≥ c then Assumption 4 holds (that is (22) holds) and so does Theorem 4. Note that in Proposition 5, µ and Σ do not have to be the mean and covariance matrix of the ˜ X i ’s. In thatcase, µ and Σ are sometimes called location and scale and so Theorem 4 still apply for the robust to adversarialcontamination and heavy-tail estimation of location, even in situations where there is not even a first moment.Proposition 5 gives an alternative to Gaussian approximation which does not, in general, allow to check Assump-tion 4 because of the residual terms in Esseen or Berry-Esseen type inequalities. The assumptions in Proposition 5are all granting that the density functions f v are locally lower bounded around the ’critical’ 1 / / K = N case as for elliptically contoured distributions. We showed that it is possible to estimate a mean vector in R d w.r.t. the metric (cid:13)(cid:13) Σ − / · (cid:13)(cid:13) even though Σ is unknown,the data set is corrupted by an adversary and the data are heavy-tailed. The rate obtained are the (deviation) minmaxone in the ideal i.i.d. Gaussian case. The estimator used to achieve this rate is a deepest point with respect a median-of-means version of the Stahel-Donoho outlyingness functional. When the data are spherical enough there is no needto bucket the data and then the estimator is using the classical Stahel-Donoho outlyingness. Our analysis shows thatthe two cases can be handled using the same methodology and that the family of cdfs ( H N,K,v : v ∈ S d − ) plays akey role in this analysis, in particular, their behavior around 0, the median and the 1 / / M OM AD K and SDO K to study the Stahel-Donoho estimator (SDE) or a median-of-means version of the SDE defined as˜ µ SDEMOM,K = P Kk =1 ˆ w k ¯ X k P Kk =1 ˆ w k (26)where ( ˆ w k ) Kk =1 are non-negative weights such that ˆ w k depends on the outlyingness of the k -th bucketed mean ¯ X k .For instance, ˆ w k = (cid:26) SDO K ( ¯ X k ) ≤ ˆ α K α k = Med( SDO K ( ¯ X k )) . (27)b) Similarly, the isomorphic or almost-isometry properties of M OM AD K and SDO K may also be used to study theproperties of a MOM version of the SDE of the covariance matrix:ˆΣ = 2 K K X k =1 ˆ w k ( ¯ X k − ˜ µ SDEMOM,K )( ¯ X k − ˜ µ SDEMOM,K ) ⊤ .
15) From a computational point of view, it is still an open question to construct an approximate solution to the SDO.The original or MOM version of the Stahel-Donoho median could be approximated via a robust gradient descentalgorithms such as the one introduced in [10, 15, 39] with some extra normalization step required by the MADdenominator. We expect this algorithm to be more efficient than the classical weighted SDE because we expect to doonly log d iterations to achieve a subgaussian estimator using a robust gradient descent algorithm whereas the SDEwould require to approximate the K depths SDO K ( ¯ X k ) , k ∈ [ K ] and should therefore require more computationaltime (note that, in practice the SDE has been reported to be more efficient than the deepest data that is the data¯ X k with the smallest SDO K ( ¯ X k ) but the SDE was not compared with an approximate solution of ˆ µ SDOMOM,K ). In this section, we provide some proofs of all the results from the preceding sections. The only complexity measurewe are using in this work is the Vapnik and Chervonenkis (VC) dimension [69, 70] of a class F of Boolean functions,i.e. of functions from R d to { , } in our case. We recall that V C ( F ) is the maximal integer n such that thereexists x , . . . , x n ∈ R d for which the set { ( f ( x ) , · · · , f ( x n )) : f ∈ F ) } is of maximal cardinality that is 2 n . The onlyVC-dimension we will use is the one of the set of all indicators of half affine spaces in R d : V C ( { x ∈ R d → I ( (cid:10) · , v (cid:11) ≥ r ) : v ∈ R d , r ∈ R } ) = d + 1 (see Example 2.6.1 in [68]). The main technical tool (see Chapter 3 in [36]) we will beusing is the following one: let Y , . . . , Y n be independent random vectors in R d , there exists an absolute constant C such that for all u >
0, with probability at least 1 − exp( − u ),sup f ∈F n n X i =1 f ( Y i ) − E f ( Y i ) ! ≤ C r V C ( F ) n + r un ! . (28)We recall that for all v ∈ S d − , K ∈ [ N ] and r > H N,K,v ( r ) = P p N/K
N/K X i =1 (cid:10) Σ − / ( ˜ X i − µ ) , v (cid:11) ≥ r . The rate of convergence we will obtained is any r ∗ satisfying C r d + 1 K + r uK ! + sup k v k =1 H N,K,v ( r ∗ ) + |O| K <
12 (29)where C is the constant from (28) and for some choice of K and u specified in each result depending on the set ofassumptions. We first prove Proposition 2 – the proof of Proposition 1 is a straighforward application of Proposition 2.
Proof of Proposition 2.
We first observe that by renormalization, it is enough to show that for all v ∈ S d − , ϕ l ( ǫ ) ≤ Med( | (cid:10) Σ − / ( ¯ X k − µ ) , v (cid:11) − Med( (cid:10) Σ − / ( ¯ X k − µ ) , v (cid:11) ) | ) ≤ ϕ u ( ǫ ) . (30)Moreover, for all i ∈ [ N ] , Σ − / ( ˜ X i − µ ) has mean zero and covariance I d . Hence, without loss of generality weassume that µ = 0 and Σ = I d .The strategy we are using to prove (30) is the following one. Let K real numbers a , . . . , a K be given and denoteby a (1) ≤ · · · ≤ a ( K ) the non-decreasing rearrangement of the ( a k ) k (this is the rearrangement of the a k ’s and notof their absolute values). To prove a result like ϕ l ( ǫ ) ≤ Med( | a k − Med( a k ) | ) ≤ ϕ u ( ǫ ), it is enough to show that ϕ l ( ǫ ) ≤ a (3( K +1) / − a (( K +1) / ≤ ϕ u ( ǫ ) and ϕ l ( ǫ ) ≤ a (( K +1) / − a (( K +1) / ≤ ϕ u ( ǫ ). As a consequence, to provea result like (30), we should study the rearrangement (the two quartiles and the median) of the (cid:10) ¯ X k , v (cid:11) , k ∈ [ K ]uniformly over all v ∈ S d − . But, |O| elements among the X i ’s come from the adversary and we do not have anycontrol on their behavior. We therefore have to consider the worst possible case which is when |O| bucketed means¯ X k are corrupted by one outliers from { X i : i ∈ O} . However, one may check that if we change |O| points in a set { a k : k ∈ [ K ] } to get a new set { A k : k ∈ [ K ] } then ϕ l ( ǫ ) ≤ a (3( K +1) / − a (( K +1) / ≤ ϕ u ( ǫ ) will be true if weshow that ϕ l ( ǫ ) ≤ A (3( K +1) / −|O| ) − A (( K +1) / |O| ) and A (3( K +1) / |O| ) − A (( K +1) / −|O| ) ≤ ϕ u ( ǫ ) – and a similarobservation holds for the other (1 / X k , k ∈ [ K ]) projected on all one dimensional directions uniformlyover these directions to deduce the result from (30) on the corrupted bucketed means ¯ X k .We denote by ˜ X k , k ∈ [ K ] the bucketed means of the original (non corrupted) dataset, i.e. ˜ X k = (1 / | B k | ) P i ∈ B k ˜ X i for k ∈ [ K ]. To prove (30) we first study the rearrangements of vectors ( (cid:10) ˜ X k , v (cid:11) ) k ∈ [ K ] uniformly over all v ∈ S d − .We will then deal with the adversarial corruption to get (30).We introduce the following supremum of empirical process: Z = sup ℓ ∈ [ K − sup k v k =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K K X k =1 I (cid:10) ˜ X k , v (cid:11) ≥ W v ( ℓ/K ) p N/K ! − P "(cid:10) ˜ X k , v (cid:11) ≥ W v ( ℓ/K ) p N/K where W v has been defined in Definition 1. It follows from (28) that for all u >
0, with probability at least1 − exp( − u ), Z ≤ C (cid:16)p ( d + 1) /K + p u/K (cid:17) (note that even though the function W v depends on v , the booleanfunction x → I ( (cid:10) x, v (cid:11) ≥ W v ( ℓ/N )) is still the indicator of a affine half-space of R d for all v ∈ R d and all ℓ ∈ [ K − { x → I ( (cid:10) x, v (cid:11) ≥ W v ( ℓ/K )) : v ∈ R d , ℓ ∈ [ K − } is less or equal to d + 1). As a consequence, for some choice of 0 < ǫ < / K ≥ (2 C )( d + 1) ǫ − then with probability at least 1 − exp( − ǫ K/ (2 C ) ), Z ≤ ǫ . Let us denote by Ω ǫ the eventonto which Z ≤ ǫ ; we proved that P [Ω ǫ ] ≥ − exp( − ǫ K/ (2 C ) ).Let us place ourselves on the event Ω ǫ up to the end of the proof. Since for all v ∈ S d − , P "(cid:10) ˜ X k , v (cid:11) ≥ W v ( ℓ/K ) p N/K = H v ( W v ( ℓ/K )) = ℓ/K, (by left continuity of H v we have H v ( W v ( p )) = p for all p ∈ (0 , ℓ ∈ [ K ] and v ∈ S d − , that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)( k ∈ [ K ] : (cid:10) ˜ X k , v (cid:11) ≥ W v ( ℓ/K ) p N/K )(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∈ [ ℓ − ǫK, ℓ + ǫK ] . (31)This last result on the uniform in v ∈ S d − rearrangement of ( (cid:10) ˜ X k , v (cid:11) ) k will be used to get the desire result onthe rearrangement for ( (cid:10) X k , v (cid:11) ) k (uniformly in v ). To go from the ˜ X k ’s to the X k ’s we now have to deal with theadversarial corruption.Since, there are |O| original data that may have been modified by the adversary, in the worse case |O| bucketedmeans ˜ X k may be considered as corrupted and so, from the above cardinality estimation result (31), we may onlycertify (on Ω ǫ ) that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)( k ∈ [ K ] : (cid:10) ¯ X k , v (cid:11) ≥ W v ( ℓ/K ) p N/K )(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∈ [ ℓ − ǫK − |O| , ℓ + ǫK + |O| ] ⊂ [ ℓ − ǫK, ℓ + 2 ǫK ]on the K bucketed means ¯ X k constructed from the adversarialy corrupted dataset { X i : i ∈ [ N ] } . We used herethe assumption that |O| ≤ ǫK . If follows from the later result that if we denote by q / K,v the 1 / (cid:10) ¯ X k , v (cid:11) : k ∈ [ K ]), by q / K,v its median and by q / K,v its 3 / r KN W v (cid:18)
34 + 2 ǫ (cid:19) ≤ q / K,v ≤ r KN W v (cid:18) − ǫ (cid:19) ; r KN W v (cid:18)
12 + 2 ǫ (cid:19) ≤ q / K,v ≤ r KN W v (cid:18) − ǫ (cid:19) and r KN W v (cid:18)
14 + 2 ǫ (cid:19) ≤ q / K,v ≤ r KN W v (cid:18) − ǫ (cid:19) . It follows from these inequalities that on the event Ω ǫ , we have for all v ∈ S d ,Med( | (cid:10) ¯ X k , v (cid:11) − Med( (cid:10) ¯ X k , v (cid:11) ) | ) ≤ r KN max (cid:18) W v (cid:18) − ǫ (cid:19) − W v (cid:18)
12 + 2 ǫ (cid:19) , W v (cid:18) − ǫ (cid:19) − W v (cid:18)
34 + 2 ǫ (cid:19)(cid:19) andMed( | (cid:10) ¯ X k , v (cid:11) − Med( (cid:10) ¯ X k , v (cid:11) ) | ) ≥ r KN min (cid:18) W v (cid:18)
14 + 2 ǫ (cid:19) − W v (cid:18) − ǫ (cid:19) , W v (cid:18)
12 + 2 ǫ (cid:19) − W v (cid:18) − ǫ (cid:19)(cid:19) . The result follows from the definition of ϕ l ( ǫ ) and ϕ u ( ǫ ) in Assumption 3.17 .2 Proof of Proposition 3 and 1 (second part): isomorphic property of SDO K . The proof of Proposition 3 and 1 (second part) relies on the next result.
Proposition 6.
We assume that the adversarial contamination with L inliers model from Assumption 1 holds witha number of adversarial outliers denoted by |O| . Let K ∈ [ N ] , u > and r ∗ be such that (29) holds. Then, withprobability at least − exp( − u ) , sup v ∈S d − | Med( (cid:10) Σ − / ( ¯ X k − µ ) , v ) (cid:11) ) | ≤ r KN r ∗ . Proof of Proposition 6.
Denote by K = { k : B k ∩ O = ∅} the set of indices of non-corrupted blocks of data.It follows from (28) and the definition of r ∗ that with probability at least 1 − exp( − u ), for all v ∈ S d − ,1 K K X k =1 I (cid:10) Σ − / ( ¯ X k − µ ) , v ) (cid:11) ≥ r ∗ p N/K ! = 1 K X k ∈K I (cid:10) Σ − / ( ˜ X k − µ ) , v ) (cid:11) ≥ r ∗ p N/K ! + 1 K X k ∈K c I (cid:10) Σ − / ( ¯ X k − µ ) , v ) (cid:11) ≥ r ∗ p N/K ! ≤ K K X k =1 I (cid:10) Σ − / ( ˜ X k − µ ) , v ) (cid:11) ≥ r ∗ p N/K ! + |O| K ≤ sup k v k =1 K K X k =1 I (cid:10) Σ − / ( ˜ X k − µ ) , v ) (cid:11) ≥ r ∗ p N/K ! − P (cid:10) Σ − / ( ˜ X k − µ ) , v ) (cid:11) ≥ r ∗ p N/K !! + P (cid:10) Σ − / ( ˜ X − µ ) , v ) (cid:11) ≥ r ∗ p N/K ! + |O| K ≤ C r d + 1 K + r uK ! + H N,K,v ( r ∗ ) + |O| K < . As a consequence, with probability at least 1 − exp( − u ), for all v ∈ S d − , K X k =1 I (cid:10) Σ − / ( ¯ X k − µ ) , v ) (cid:11) ≥ r ∗ p N/K ! < K v ∈S d − | Med( (cid:10) Σ − / ( ¯ X k − µ ) , v ) (cid:11) ) | ≤ r KN r ∗ . (32) Remark 1.
It is also possible to consider a ”directional version” of Proposition 6 if one defines a ”directionalversion” of r ∗ , that is for all directions v ∈ S d − , define r ∗ v > satisfying C r d + 1 K + r uK ! + H N,K,v ( r ∗ v ) + |O| K < . Then, under the same conditions as in Proposition 6, we have with probability at least − exp( − u ) , sup v ∈S d − | Med( (cid:10) Σ − / ( ¯ X k − µ ) , v ) (cid:11) ) | r ∗ v ≤ r KN .
Hence, Proposition 6 holds as well for r ∗ = sup k v k =1 r ∗ v . Note that for most of the v ∈ S d − the values of r ∗ v is expected to be much smaller than r ∗ . For instance, for vectors v well-spread, we expect them to have a strong”mixing” power (see for instance ”super-Gaussian directions” in [35] or [33, 34]). roof of Proposition 3 and 1. It follows from Proposition 6 and Proposition 2 that, with probability at least1 − exp( − u ) − exp( − c ǫ K ), for all v ∈ S d − , | Med( (cid:10) Σ − / ( ¯ X k − µ ) , v ) (cid:11) ) | ≤ r KN r ∗ and ϕ l ( ǫ ) r KN (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) ≤ M OM AD K ( v ) ≤ ϕ u ( ǫ ) r KN (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) . We denote by Ω the event onto which the last two properties hold. On the even Ω , for all ν ∈ R d , we have SDO K ( ν ) = sup v ∈ R d | Med( (cid:10) ¯ X k − ν, v ) (cid:11) ) | M OM AD K ( v ) ≤ sup v ∈ R d | Med( (cid:10) ¯ X k − ν, v ) (cid:11) ) | ϕ l ( ǫ ) p K/N (cid:13)(cid:13) Σ / v (cid:13)(cid:13) = sup v ∈S d − | Med( (cid:10) Σ − / ( ¯ X k − ν ) , v ) (cid:11) ) | ϕ l ( ǫ ) p K/N ≤ sup v ∈ R d | Med( (cid:10) Σ − / ( ¯ X k − µ ) , v (cid:11) ) | + | (cid:10) Σ − / ( ν − µ ) , v (cid:11) | ϕ l ( ǫ ) p K/N ≤ sup v ∈ R d p K/Nr ∗ + | (cid:10) Σ − / ( ν − µ ) , v (cid:11) | ϕ l ( ǫ ) p K/N ≤ k Σ − / ( ν − µ ) k ϕ l ( ǫ ) √ K/N if (cid:13)(cid:13) Σ − / ( ν − µ ) (cid:13)(cid:13) ≥ p K/N r ∗ r ∗ /ϕ l ( ǫ ) otherwiseand when (cid:13)(cid:13) Σ − / ( ν − µ ) (cid:13)(cid:13) ≥ p K/N r ∗ , we have SDO K ( ν ) ≥ sup v ∈S d − | (cid:10) Σ − / ( ν − µ ) , v (cid:11) | − | Med( (cid:10) Σ − / ( ¯ X k − µ ) , v (cid:11) ) | ϕ u ( ǫ ) p K/N ≥ (cid:13)(cid:13) Σ − / ( ν − µ ) (cid:13)(cid:13) ϕ u ( ǫ ) p K/N .
Proof of Proposition 1.
Proposition 1 is a corollary of Proposition 2 for K = N . For this choice of K , thereare N blocks, each containing only one data and so M OM AD N ( v ) = M AD ( v ) for all v ∈ R d . The only thing thatremains to be checked is the validity of Assumption 3 in the Gaussian case and the dependency of the ϕ l ( ǫ ) and ϕ u ( ǫ ) in terms of ǫ .When the original data ˜ X i , i ∈ [ N ] are N i.i.d. Gaussian vectors G , . . . , G N with mean µ and covariance matrixΣ then for all K ∈ [ N ], (1 / p N/K ) P N/Ki =1 Σ − / ( ˜ X i − µ ) is a standard Gaussian vector in R d . Therefore the H := H N,K,v function from Assumption 3 is equal to the function x ∈ R → − Φ( x ) where Φ : x ∈ R → P [ g ≤ x ]is the cdf of a standard Gaussian variable g ∼ N (0 ,
1) in R . This holds for all N, K and v ∈ S d − , that is H N,K,v is independent of
N, K and v ∈ S d − . Since W := W N,K,v is the generalized inverse of H , in the Gaussian case, weobtain that W ( p ) = Φ − (1 − p ) for all p ∈ (0 , C > (cid:18) W (cid:18)
14 + 2 ǫ (cid:19) − W (cid:18) − ǫ (cid:19) , W (cid:18)
12 + 2 ǫ (cid:19) − W (cid:18) − ǫ (cid:19)(cid:19) ≥ Φ − (3 / − C ǫ := ϕ l ( ǫ )and max (cid:18) W (cid:18) − ǫ (cid:19) − W (cid:18)
12 + 2 ǫ (cid:19) , W (cid:18) − ǫ (cid:19) − W (cid:18)
34 + 2 ǫ (cid:19)(cid:19) ≤ Φ − (1 / ≤ Φ − (3 /
4) + C ǫ := ϕ u ( ǫ ) . As a consequence, Assumption 3 holds in the Gaussian case for all 0 < ǫ < Φ − (3 / /C with ϕ l ( ǫ ) = Φ − (3 / − C ǫ and ϕ u ( ǫ ) = Φ − (3 /
4) + C ǫ . Proofs of theorems 1, 2 and 4
Theorems 1, 2 and 4 are corollaries of a general result that we are stating now.
Theorem 5.
There are absolute constants c , c and c such that the following holds. We assume that Assump-tion 3 holds for some < ǫ < / and constants ϕ l ( ǫ ) and ϕ u ( ǫ ) . We assume that the adversarial contamina-tion with L inliers model from Assumption 1 holds with a number of adversarial outliers denoted by |O| . Let K ≥ max( ǫ − |O| , c ǫ − d ) , < u < c ǫ K and r ∗ be such that (29) holds. Then, with probability at least − − u ) , (cid:13)(cid:13)(cid:13) Σ − / (ˆ µ SDOMOM,K − µ ) (cid:13)(cid:13)(cid:13) ≤ ϕ u ( ǫ ) ϕ l ( ǫ ) r KN r ∗ . roof of Theorem 1. There exists an absolute constant c such that for all 0 ≤ r ≤ c , P [ g ≥ r ] ≤ / − r where g ∼ N (0 , K ∈ [ N ] , v ∈ S d − and r >
0, we have H N,K,v ( r ) = P [ g ≥ r ]. As a consequence, onecan chose r ∗ , u and K such that r ∗ = C r d + 1 K + r uK ! + |O| K as long as this later quantity is less or equal to c . Finally, we apply Theorem 5 for K = N and the result followssince ˆ µ SDOMOM,N = ˆ µ SDO . Proof of Theorem 2.
It follows from Markov’s inequality (14) that we can chose u , r ∗ and K such that C r d + 1 K + r uK ! + 11 + ( r ∗ ) + |O| K < r ∗ = 2, K ≥ |O| , K > C ( d + 1) and K > C u . Note however, that because r ∗ isconstant, the convergence rate is proportional to p K/N , in particular it does not depend on u . Hence there is nointerest to consider values of u smaller than K (up to constant). We therefore apply Theorem 5 for this choice of K , u = c ǫ K and r ∗ = 2. Proof of Theorem 4.
Thanks to Assumption 4, there exists absolute constants c > c > v ∈ S d − and (2 C /c ) p ( d + 1) /K ≤ r ≤ c , H N,K,v ≤ / − c r . As a consequence, one can chose r ∗ , u and K such that r ∗ = 2 c C r d + 1 K + r uK ! + |O| K ! as long as this later quantity is less or equal to c . Finally, we apply Theorem 5 for this choice of K , u and r ∗ . Proof of Theorem 5.
We first note that a proof of Theorem 5 may follow from the isomorphic property of
SDO K from Proposition 3. However, it is possible to improve constants by using the following strategy.Let us place ourselves on the intersection of the two events where the results of both Proposition 2 and Proposi-tion 6 hold. We set f : v ∈ R d → Med( (cid:10) ¯ X k , v (cid:11) ). Since f is symmetric we have (cid:13)(cid:13)(cid:13) Σ − / (ˆ µ SDOMOM,K − µ ) (cid:13)(cid:13)(cid:13) = sup k v k =1 (cid:10) Σ − / (ˆ µ SDOMOM,K − µ ) , v (cid:11) = sup v ∈ R d (cid:10) ˆ µ SDOMOM,K − µ, v (cid:13)(cid:13) Σ / v (cid:13)(cid:13) (cid:11) = sup v ∈ R d (cid:10) ˆ µ SDOMOM,K , v (cid:11) − f ( v ) + f ( v ) − (cid:10) µ, v (cid:11) M OM AD K ( v ) M OM AD K ( v ) (cid:13)(cid:13) Σ / v (cid:13)(cid:13) ≤ sup v ∈ R d (cid:10) ˆ µ SDOMOM,K , v (cid:11) − f ( v ) M OM AD K ( v ) + sup v ∈ R d f ( v ) − (cid:10) µ, v (cid:11) M OM AD K ( v ) ! sup v ∈ R d M OM AD K ( v ) (cid:13)(cid:13) Σ / v (cid:13)(cid:13) ≤ (cid:0) SDO K (ˆ µ SDOMOM,K ) +
SDO K ( µ ) (cid:1) sup v ∈ R d M OM AD K ( v ) (cid:13)(cid:13) Σ / v (cid:13)(cid:13) ≤ SDO K ( µ ) sup v ∈ R d M OM AD K ( v ) (cid:13)(cid:13) Σ / v (cid:13)(cid:13) . where we used that SDO K (ˆ µ SDOMOM,K ) ≤ SDO K ( µ ) by definition of ˆ µ SDOMOM,K .We know how to control sup v ∈ R d M OM AD K ( v ) / (cid:13)(cid:13) Σ / v (cid:13)(cid:13) by p K/Nϕ u ( ǫ ) using Proposition 2. It remains tocontrol the term SDO K ( µ ). We have SDO K ( µ ) = sup v ∈ R d | (cid:10) µ, v (cid:11) − Med( (cid:10) ¯ X k , v (cid:11) ) | Med( | (cid:10) ¯ X k , v (cid:11) − Med( (cid:10) ¯ X k , v (cid:11) ) | ) = sup v ∈ R d | Med( (cid:10) µ − ¯ X k , v (cid:11) ) | (cid:13)(cid:13) Σ / v (cid:13)(cid:13) (cid:13)(cid:13) Σ / v (cid:13)(cid:13) M OM AD K ( v ) ≤ sup k v k =1 | Med( (cid:10) Σ − / ( ¯ X k − µ ) , v ) (cid:11) ) | sup v ∈ R d (cid:13)(cid:13) Σ / v (cid:13)(cid:13) M OM AD K ( v ) . The term sup v ∈ R d (cid:13)(cid:13) Σ / v (cid:13)(cid:13) /M OM AD K ( v ) is smaller than p N/K/ϕ l ( ǫ ) thanks to Proposition 2. Finally, to finishthe proof, we upper bound the term sup k v k =1 | Med( (cid:10) Σ − / ( ¯ X k − µ ) , v ) (cid:11) ) | by p K/N r ∗ thanks to Proposition 6.20 roof of Theorem 3 For all k ∈ [ N ], we set ˆ µ k = ˆ µ SDOMOM,k we denote by Ω k the event onto which (cid:13)(cid:13)(cid:13) Σ − / (ˆ µ k − µ ) (cid:13)(cid:13)(cid:13) ≤ ϕ u ( ǫ ) ϕ l ( ǫ ) r kN and, for all ν ∈ R d , if (cid:13)(cid:13) Σ − / ( ν − µ ) (cid:13)(cid:13) ≥ p k/N then (cid:13)(cid:13) Σ − / ( ν − µ ) (cid:13)(cid:13) ϕ u ( ǫ ) p k/N ≤ SDO k ( ν ) ≤ (cid:13)(cid:13) Σ − / ( ν − µ ) (cid:13)(cid:13) ϕ l ( ǫ ) p k/N and if (cid:13)(cid:13) Σ − / ( ν − µ ) (cid:13)(cid:13) ≤ p k/N then SDO k ( ν ) ≤ ϕ l ( ǫ ) . It follows from Proposition 3 for r ∗ = 3 and u = K/ (16 C ) and Theorem 2 that P [Ω k ] ≥ − − c ǫ k ) when k ≥ max( |O| /ǫ, c d/ǫ ).Let K ≥ max( |O| /ǫ, c d/ǫ ). On the even ∩ Nk = K Ω k , we have for all K ≤ k ≤ N , SDO k (ˆ µ K − ˆ µ k ) ≤ max ϕ l ( ǫ ) , (cid:13)(cid:13) Σ − / (ˆ µ K − ˆ µ k ) (cid:13)(cid:13) ϕ l ( ǫ ) p k/N ! ≤ max ϕ l ( ǫ ) , ϕ u ( ǫ ) ϕ l ( ǫ ) r Kk !! and so, by definition of ˆ K , we have ˆ K ≤ K . We also have by definition of ˆ K and because ˆ K ≤ K that SDO K (ˆ µ ˆ K − ˆ µ K ) ≤ max ϕ l ( ǫ ) , ϕ u ( ǫ ) ϕ l ( ǫ ) s ˆ KK ≤ ϕ u ( ǫ ) ϕ l ( ǫ ) . We conclude that either (cid:13)(cid:13) Σ − / (ˆ µ ˆ K − ˆ µ K ) (cid:13)(cid:13) ≤ p K/N and so (cid:13)(cid:13)(cid:13) Σ − / (ˆ µ ˆ K − µ ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) Σ − / (ˆ µ ˆ K − ˆ µ K ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) Σ − / ( µ − ˆ µ K ) (cid:13)(cid:13)(cid:13) ≤ (cid:18) ϕ u ( ǫ ) ϕ l ( ǫ ) (cid:19) r KN or (cid:13)(cid:13) Σ − / (ˆ µ ˆ K − ˆ µ K ) (cid:13)(cid:13) ≥ p K/N and so (cid:13)(cid:13)(cid:13) Σ − / (ˆ µ ˆ K − µ ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) Σ − / (ˆ µ ˆ K − ˆ µ K ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) Σ − / (ˆ µ K − µ ) (cid:13)(cid:13)(cid:13) ≤ SDO K (ˆ µ ˆ K − ˆ µ K )2 ϕ u ( ǫ ) r KN + 4 ϕ u ( ǫ ) ϕ l ( ǫ ) r KN ≤ ϕ u ( ǫ ) ϕ l ( ǫ ) r KN .
Proof of Proposition 4.
We have for all i, j ∈ [ d ], (cid:12)(cid:12)(cid:12) φ Σ ij − ˆΣ ij (cid:12)(cid:12)(cid:12) ≤ c ǫ ( c ǫ + φ ) (Σ ii + Σ ij ) because, it followsfrom (18) that for all v ∈ R d , (cid:12)(cid:12)(cid:12)(cid:12) M OM AD K ( v ) − φ KN (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M OM AD K ( v ) − φ r KN (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M OM AD K ( v ) + φ r KN (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) ! ≤ c ǫ KN (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) ( c ǫ + φ ) . Next, we have for all u, v ∈ R d such that k u k = k v k = 1 | (cid:10) u, ( φ Σ − ˆΣ) v (cid:11) | = N K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i,j u i v j (cid:18) φ KN (cid:13)(cid:13)(cid:13) Σ / ( e i + e j ) (cid:13)(cid:13)(cid:13) − M OM AD K ( e i + e j ) + φ KN (cid:13)(cid:13)(cid:13) Σ / ( e i − e j ) (cid:13)(cid:13)(cid:13) − M OM AD K ( e i − e j ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c ǫ ( c ǫ + φ )4 X i,j | u i || v j | (cid:18)(cid:13)(cid:13)(cid:13) Σ / ( e i + e j ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) Σ / ( e i − e j ) (cid:13)(cid:13)(cid:13) (cid:19) = c ǫ ( c ǫ + φ )2 X i,j | u i || v j | (Σ ii + Σ jj ) . eferences [1] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. Comput. SystemSci. , 58(1, part 2):137–147, 1999. Twenty-eighth Annual ACM Symposium on the Theory of Computing (Philadelphia, PA, 1996).[2] Imre B´ar´any and Nabil H Mustafa. An application of the universality theorem for tverberg partitions to data depth and hittingconvex sets.
Computational Geometry , page 101649, 2020.[3] W l odzimierz Bryc.
The normal distribution , volume 100 of
Lecture Notes in Statistics . Springer-Verlag, New York, 1995. Charac-terizations with applications.[4] S´ebastien Bubeck, Nicol`o Cesa-Bianchi, and G´abor Lugosi. Bandits with heavy tail.
IEEE Trans. Inform. Theory , 59(11):7711–7717,2013.[5] T. Tony Cai, Weidong Liu, and Harrison H. Zhou. Estimating sparse precision matrix: optimal rates of convergence and adaptiveestimation.
Ann. Statist. , 44(2):455–488, 2016.[6] Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study.
Ann. Inst. Henri Poincar´e Probab. Stat. ,48(4):1148–1185, 2012.[7] Louis H. Y. Chen and Qi-Man Shao. A non-uniform Berry-Esseen bound via Stein’s method.
Probab. Theory Related Fields ,120(2):236–254, 2001.[8] Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance and scatter matrix estimation under Huber’s contamination model.
Ann. Statist. , 46(5):1932–1960, 2018.[9] Yanchu Chen and Qi-Man Shao. Berry-Esseen inequality for unbounded exchangeable pairs. In
Probability approximations andbeyond , volume 205 of
Lect. Notes Stat. , pages 13–30. Springer, New York, 2012.[10] Yeshwanth Cherapanamjeri, Nicolas Flammarion, and Peter L. Bartlett. Fast mean estimation with sub-gaussian rates, 2019.[11] Arnak Dalalyan and Philip Thompson. Outlier-robust estimation of a sparse linear model using l1-penalized huber’s m-estimator.In
Advances in Neural Information Processing Systems , pages 13188–13198, 2019.[12] Arnak S Dalalyan and Arshak Minasyan. All-in-one robust estimator of the gaussian mean. arXiv preprint arXiv:2002.01432 , 2020.[13] P. L. Davies. Asymptotic behaviour of S -estimates of multivariate location parameters and dispersion matrices. Ann. Statist. ,15(3):1269–1292, 1987.[14] Michiel Debruyne. An outlier map for support vector machine classification.
Ann. Appl. Stat. , 3(4):1566–1580, 2009.[15] Jules Depersin and Guillaume Lecu´e. Fast algorithms for robust estimation of a mean vector. 2019.[16] Jules Depersin and Guillaume Lecu´e. Convex programs and algorithms for robust subgaussian estimation of a mean vector withrespect to any norm. Technical report, IPParis, Crest, ENSAE, 2020.[17] Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I. Oliveira. Sub-gaussian mean estimators.
Ann. Statist. , 44(6):2695–2725, 12 2016.[18] Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I. Oliveira. Sub-Gaussian mean estimators.
Ann. Statist. , 44(6):2695–2725, 2016.[19] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in highdimensions without the computational intractability. In , pages 655–664. IEEE Computer Soc., Los Alamitos, CA, 2016.[20] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions)can be practical. arXiv preprint arXiv:1703.00893 , 2017.[21] D Donoho.
L.(1982) Breakdown properties of multivariate location estimtors . PhD thesis, Ph. D. Qualifying Paper, Dept. ofStatistics, Harvard Univ.[22] David Donoho and Peter J. Huber. The notion of breakdown point. In
A Festschrift for Erich L. Lehmann
Ann. Statist. , 20(4):1803–1827, 1992.[25] David L Donoho and Miriam Gasko. Breakdown properties of location estimates based on halfspace depth and projected outlyingness.
The Annals of Statistics , 20(4):1803–1827, 1992.[26] Ramanathan Gnanadesikan and John R Kettenring. Robust estimates, residuals, and outlier detection with multiresponse data.
Biometrics , pages 81–124, 1972.[27] JBS Haldane. Note on the median of a multivariate distribution.
Biometrika , 35(3-4):414–417, 1948.[28] Frank R. Hampel. Robust estimation: a condensed partial survey.
Z. Wahrscheinlichkeitstheorie und Verw. Gebiete , 27:87–104,1973.[29] Frank R. Hampel. The influence curve and its role in robust estimation.
J. Amer. Statist. Assoc. , 69:383–393, 1974.[30] Samuel B Hopkins. Sub-gaussian mean estimation in polynomial time. arXiv preprint arXiv:1809.07425 , 2018.[31] Peter J. Huber and Elvezio M. Ronchetti.
Robust statistics . Wiley Series in Probability and Statistics. John Wiley & Sons, Inc.,Hoboken, NJ, second edition, 2009.[32] Mark R. Jerrum, Leslie G. Valiant, and Vijay V. Vazirani. Random generation of combinatorial structures from a uniform distribution.
Theoret. Comput. Sci. , 43(2-3):169–188, 1986.
33] B. Klartag and S. Sodin. Variations on the Berry-Esseen theorem.
Teor. Veroyatn. Primen. , 56(3):514–533, 2011.[34] Bo’az Klartag. A Berry-Esseen type inequality for convex bodies with an unconditional basis.
Probab. Theory Related Fields ,145(1-2):1–33, 2009.[35] Bo’az Klartag. Super-Gaussian directions of random vectors. In
Geometric aspects of functional analysis , volume 2169 of
LectureNotes in Math. , pages 187–211. Springer, Cham, 2017.[36] V. Koltchinskii.
Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems . Springer, Berlin, 2011.[37] Guillaume Lecu´e and Matthieu Lerasle. Robust machine learning by median-of-means: theory and practice.
Ann. Statist. , 48(2):906–931, 2020.[38] Michel Ledoux and Michel Talagrand.
Probability in Banach spaces . Classics in Mathematics. Springer-Verlag, Berlin, 2011.Isoperimetry and processes, Reprint of the 1991 edition.[39] Zhixian Lei, Kyle Luh, Prayaag Venkat, and Fred Zhang. A fast spectral algorithm for mean estimation with sub-gaussian rates. In
Conference on Learning Theory , pages 2598–2612. PMLR, 2020.[40] O. V. Lepski˘ı. A problem of adaptive estimation in Gaussian white noise.
Teor. Veroyatnost. i Primenen. , 35(3):459–470, 1990.[41] O. V. Lepski˘ı. Asymptotically minimax adaptive estimation. I. Upper bounds. Optimally adaptive estimates.
Teor. Veroyatnost. iPrimenen. , 36(4):645–659, 1991.[42] M. Lerasle and R. Oliveira. Robust empirical mean estimators. Technical report, IMPA and CNRS, 2011.[43] Matthieu Lerasle, Zolt´an Szab´o, Timoth´ee Mathieu, and Guillaume Lecu´e. Monk outlier-robust mean embedding estimation bymedian-of-means. In
International Conference on Machine Learning , pages 3782–3793. PMLR, 2019.[44] Regina Y. Liu. On a notion of data depth based on random simplices.
Ann. Statist. , 18(1):405–414, 1990.[45] Regina Y. Liu. Data depth and multivariate rank tests. In L -statistical analysis and related methods (Neuchˆatel, 1992) , pages279–294. North-Holland, Amsterdam, 1992.[46] Regina Y. Liu and Kesar Singh. Ordering directional data: concepts of data depth on circles and spheres. Ann. Statist. , 20(3):1468–1484, 1992.[47] Karim Lounici. High-dimensional covariance matrix estimation with missing observations.
Bernoulli , 20(3):1029–1058, 2014.[48] Junwei Lu, Fang Han, and Han Liu. Robust scatter matrix estimation for high dimensional distributions with heavy tail.
IEEEtransactions on information theory , 2020.[49] G´abor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: a survey.
Found. Comput.Math. , 19(5):1145–1190, 2019.[50] G´abor Lugosi and Shahar Mendelson. Near-optimal mean estimators with respect to general norms.
Probab. Theory Related Fields ,175(3-4):957–973, 2019.[51] G´abor Lugosi, Shahar Mendelson, et al. Sub-gaussian estimators of the mean of a random vector.
The Annals of Statistics ,47(2):783–794, 2019.[52] Z. Szabo M. Lerasle, T. Matthieu and G. Lecu´e. Monk – outliers-robust mean embedding estimation by median-of-means. Technicalreport, CNRS, University of Paris 11, Ecole Polytechnique and CREST, 2017.[53] Ricardo A. Maronna, R. Douglas Martin, and Victor J. Yohai.
Robust statistics . Wiley Series in Probability and Statistics. JohnWiley & Sons, Ltd., Chichester, 2006. Theory and methods.[54] Ricardo A Maronna and Victor J Yohai. The behavior of the stahel-donoho robust multivariate estimator.
Journal of the AmericanStatistical Association , 90(429):330–341, 1995.[55] S Minsker and N. Strawn. Distributed statistical estimation and rates of convergence in normal approximation. Technical report,arXiv: 1704.02658, 2017.[56] Stanislav Minsker. Geometric median and robust estimation in banach spaces.
Bernoulli , 21(4):2308–2335, 2015.[57] Stanislav Minsker. Uniform bounds for robust mean estimators. arXiv preprint arXiv:1812.03523 , 2018.[58] Stanislav Nagy, Carsten Sch¨utt, Elisabeth M Werner, et al. Halfspace depth and floating body.
Statistics Surveys , 13:52–118, 2019.[59] A. S. Nemirovsky and D. B. and Yudin.
Problem complexity and method efficiency in optimization . A Wiley-Interscience Publication.John Wiley & Sons, Inc., New York, 1983. Translated from the Russian and with a preface by E. R. Dawson, Wiley-InterscienceSeries in Discrete Mathematics.[60] Daniel Pe˜na and Francisco J Prieto. Combining random and specific directions for outlier detection and robust estimation inhigh-dimensional multivariate data.
Journal of Computational and Graphical Statistics , 16(1):228–254, 2007.[61] Valentin V. Petrov.
Limit theorems of probability theory , volume 4 of
Oxford Studies in Probability . The Clarendon Press, OxfordUniversity Press, New York, 1995. Sequences of independent random variables, Oxford Science Publications.[62] Peter J Rousseeuw, Jakob Raymaekers, and Mia Hubert. A measure of directional outlyingness with applications to image data andvideo.
Journal of Computational and Graphical Statistics , 27(2):345–359, 2018.[63] Werner A Stahel.
Robuste sch¨atzungen: infinitesimale optimalit¨at und sch¨atzungen von kovarianzmatrizen . PhD thesis, ETH Zurich,1981.[64] John W. Tukey. Mathematics and the picturing of data. In
Proceedings of the International Congress of Mathematicians (Vancouver,B. C., 1974), Vol. 2 , pages 523–531, 1975.[65] David E. Tyler. Finite sample breakdown points of projection based multivariate location and scatter statistics.
Ann. Statist. ,22(2):1024–1044, 1994.
66] Stefan Van Aelst. Stahel–donoho estimation for high-dimensional data.
International Journal of Computer Mathematics , 93(4):628–639, 2016.[67] Stefan Van Aelst, E Vandervieren, and Gert Willems. Stahel-donoho estimators with cellwise weights.
Journal of StatisticalComputation and Simulation , 81(1):1–27, 2011.[68] Aad W. van der Vaart and Jon A. Wellner.
Weak convergence and empirical processes . Springer Series in Statistics. Springer-Verlag,New York, 1996. With applications to statistics.[69] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In
Measures of complexity , pages 11–30. Springer, Cham, 2015. Reprint of Theor. Probability Appl. The nature of statistical learning theory . Statistics for Engineering and Information Science. Springer-Verlag,New York, second edition, 2000.[71] Yijun Zuo, Hengjian Cui, and Xuming He. On the Stahel-Donoho estimator and depth-weighted means of multivariate data.
Ann.Statist. , 32(1):167–188, 2004., 32(1):167–188, 2004.