Optimal robust mean and location estimation via convex programs with respect to any pseudo-norms
aa r X i v : . [ m a t h . S T ] F e b Optimal robust mean and location estimation via convex programs withrespect to any pseudo-norms
Jules Depersin and Guillaume Lecu´eemail: [email protected], email: [email protected], ENSAE, IPParis. 5, avenue Henry Le Chatelier, 91120 Palaiseau, France.February 2, 2021
Abstract
We consider the problem of robust mean and location estimation w.r.t. any pseudo-norm of theform x ∈ R d → k x k S = sup v ∈ S (cid:10) v, x (cid:11) where S is any symmetric subset of R d . We show that thedeviation-optimal minimax subgaussian rate for confidence 1 − δ ismax ℓ ∗ (Σ / S ) √ N , sup v ∈ S (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) r log(1 /δ ) N ! where ℓ ∗ (Σ / S ) is the Gaussian mean width of Σ / S and Σ the covariance of the data (in the bench-mark i.i.d. Gaussian case). This improves the entropic minimax lower bound from [30] and closes thegap characterized by Sudakov’s inequality between the entropy and the Gaussian mean width for thisproblem. This shows that the right statistical complexity measure for the mean estimation problemis the Gaussian mean width. We also show that this rate can be achieved by a solution to a convexoptimization problem in the adversarial and L heavy-tailed setup by considering minimum of someFenchel-Legendre transforms constructed using the Median-of-means principle. We finally show thatthis rate may also be achieved in situations where there is not even a first moment but a locationparameter exists. We consider the problem of robust (to adversarial corruption and heavy-tailed data) multivariate meanand location estimation with respect to any pseudo-norm ν ∈ R d → k ν k S = sup µ ∈ S (cid:10) µ, ν (cid:11) where S is anysymmetric subset of R d (i.e. if x ∈ S then − x ∈ S ). This problem has been extensively studied during thelast decade for S = B d the unit euclidean ball [34, 8, 15, 7, 11, 31, 6, 13, 28, 14, 9, 10, 26]. Only little isknown for general symmetric sets S and we will mainly refer to [30] where this problem has been handledfor S which is the unit dual ball B ◦ of a norm k·k (so that k·k S = k·k ).In [30], the authors introduced the problem of robust to heavy-tailed data estimation of a mean vectorw.r.t. any norm. The problem can be stated as follow: given N i.i.d. random vectors X , . . . , X N in R d withmean µ ∗ and covariance matrix Σ, a norm k·k on R d and a confidence parameter δ ∈ (0 ,
1) find an estimator˜ µ N ( δ ) and the best possible accuracy r ∗ ( N, δ ) such that with probability at least 1 − δ, k ˜ µ N ( δ ) − µ ∗ k ≤ r ∗ ( N, δ ). In [30], the authors use the median-of-means principle [35, 17, 1] to construct an estimatorsatisfying the following result.
Theorem 1. [Theorem 2 in [30]] There exist an absolute constant c such that the following holds. Givena norm k·k on R d and a confidence δ ∈ (0 , , one can construct ˜ µ N ( δ ) such that with probability at least − δ k ˜ µ N ( δ ) − µ ∗ k ≤ c √ N E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N N X i =1 ǫ i ( X i − µ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + E (cid:13)(cid:13)(cid:13) Σ / G (cid:13)(cid:13)(cid:13) + sup v ∈ B ◦ (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) p log(1 /δ ) ! where B ◦ is the unit dual ball associated with k·k , ( ǫ i ) are i.i.d. Rademacher variables independent of the X i ’s and G ∼ N (0 , I d ) . The construction of ˜ µ N ( δ ) is pretty involved and it seems hard to design an algorithm out of thisprocedure. In particular, ˜ µ N ( δ ) has not been proved to be solution to a convex optimization problem.Theorem 1’s main interest is thus from a theoretical point of view, while robust multivariate mean estima-tion can also be interesting from a practical point of view [12].The rate obtained in Theorem 1 can be decomposed into two terms: a deviation termsup v ∈ B ◦ (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) p log(1 /δ )where sup v ∈ B ◦ (cid:13)(cid:13) Σ / v (cid:13)(cid:13) is a weak variance term and a complexity term which is the sum of a Rademachercomplexity E (cid:13)(cid:13)(cid:13) N − / P Ni =1 ǫ i ( X i − µ ∗ ) (cid:13)(cid:13)(cid:13) and a Gaussian mean width E (cid:13)(cid:13) Σ / G (cid:13)(cid:13) . The intuition behind thisrate is explained in [30], in particular, in Question 1. We will however show that this rate is not the rightone and that the Gaussian mean width term is actually not necessary. Moreover, we will show that theimproved rate can be achieved by an estimator solution to a convex optimization problem in Section 3 andthat this holds even in the adversarial corruption model (see Assumption 1 in Section 3 below for a formaldefinition) and even in some situations where there is not even a first moment; in that case, µ ∗ is a location parameter and Σ a scatter parameter.The optimality of the rate in Theorem 1 has been raised in [30]. The classical approach to answer thistype of question is to consider the Gaussian case that is when the data X i , i ∈ [ N ] are i.i.d. N ( µ ∗ , Σ).This is also the strategy used in [30] to obtain the following deviation-minimax lower bound result . Theorem 2. [Theorem 3 and first paragraph in p.962 in [30]] There exists an absolute constant c > suchthat the following holds. If ˆ µ : R Nd → R d is an estimator such that for all µ ∗ ∈ R d and all δ ∈ (0 , / , P Nµ ∗ [ k ˆ µ − µ ∗ k ≤ r ∗ ] ≥ − δ where P Nµ ∗ is the probability distribution of ( X i ) i ∈ [ N ] when the X i are i.i.d. N ( µ ∗ , Σ) then r ∗ ≥ c √ N (cid:18) sup η> η q log N (Σ / B ◦ , ηB d ) + sup v ∈ B ◦ (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) p log(1 /δ ) (cid:19) where N (Σ / B ◦ , ηB d ) is the minimal number of translated of ηB d needed to cover Σ / B ◦ . The term sup v ∈ S (cid:13)(cid:13) Σ / v (cid:13)(cid:13) p log(1 /δ ) in the lower bound from Theorem 2 is obtained in [30] fromProposition 6.1 in [6] which is a deviation-minimax lower bound result holding in the one dimensional casewhich relies on the fact that the empirical mean is a sufficient statistics in the Gaussian shift theorem .The complexity term sup η> η q log N (Σ / B ◦ , ηB d ) obtained in Theorem 2 follows from the dualitytheorem of metric entropy from [2] and a volumetric argument in the Gauss space similar to the one used the result from [30] is proved for Σ = I d , it is however straightforward to extend it to the general case. The argument used in [30] goes from the one dimensional case studied in [6] to the d -dimensional case. It is given in anone formal way and may require some extra argument to hold. Indeed the estimator x ∗ ( ˆΨ N ) in [30] is constructed using the d -dimensional data X , . . . , X N and not one-dimensional data such as x ∗ ( X ) , . . . , x ∗ ( X N ). However, the result from [6] holdsfor estimators of a one dimensional mean using one-dimensional data and not d -dimensional ones. Nevertheless, Olivier Catonishowed us how to adapt the proof of Proposition 6.1 in [6] by using the sufficiency of the empirical mean in the Gaussian shiftmodel in R d to get this deviation dependent lower bound term.
2o prove dual Sudakov’s inequality in p.82-83 in [25] which has also been used to obtain minimax lowerbounds based on the entropy in [22] and [32].In general, there is a gap between the upper bound from Theorem 1 and the lower bound from Theorem 2even in the Gaussian case. This gap is characterized by Sudakov’s inequality (see Theorem 3.18 in [25] orTheorem 5.6 in [36]): sup η> η q log N (Σ / B ◦ , ηB d ) ≤ c E (cid:13)(cid:13)(cid:13) Σ / G (cid:13)(cid:13)(cid:13) (1)where G ∼ N (0 , I d ). Indeed, in the Gaussian case the complexity term of the rate obtained in Theorem 1is the Gaussian mean width, that is the right-hand term from (1) whereas the complexity term fromTheorem 2 is the entropy, that is the left-hand term in (1).As mentioned in Remark 3 from [30], when Sudakov’s inequality (1) is sharp then upper and lowerbounds from Theorem 1 and 2 match in the Gaussian case (in that case the Rademacher complexity isequal to the Gaussian mean width in Theorem 1). Sharpness in Sudakov’s inequality is however not atypical situation. In particular, for ellipsoids, Sudakov’s bound (1) is not sharp in general and thereforethe lower bound from Theorem 2 fails to recover the classical subgaussian rate for the standard Euclideannorm case (that is for S = B d ) which is given in [31] by r Tr (Σ) N + s k Σ k op log(1 /δ ) N . (2)Indeed, when k·k is the ℓ d Euclidean norm then E (cid:13)(cid:13) Σ / G (cid:13)(cid:13) = E (cid:13)(cid:13) Σ / G (cid:13)(cid:13) ∼ p Tr(Σ) (see, for instance,Proposition 2.5.1 in [37]). Whereas, for the entropy of Σ / B ◦ = Σ / B d w.r.t. ηB d , it follows fromequation (5.45) in [36] thatsup η> η q log N (Σ / B d , ηB d ) = sup n ≥ e n +1 (Σ / ) √ n + 1 ∼ sup n ≥ ,k ∈ [ d ] √ n n/k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k Y j =1 p λ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) /k ∼ vuuut sup k ∈ [ d ] k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k Y j =1 λ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) /k (3)where ( e n +1 (Σ / )) n are the entropy numbers of Σ / : ℓ d → ℓ d (see page 62 in [36] for a definition) and λ ≥ . . . ≥ λ d are the singular values of Σ. In particular, when λ j = 1 /j , the entropy bound (3) is of theorder of a constant whereas the Gaussian mean width is of the order of √ log d . We will fill this gap inSection 2 by showing a lower bound where the entropy is replaced by the (larger) Gaussian mean width.We will therefore obtain matching upper and lower bounds revealing that Gaussian mean width is the rightway to measure the statistical complexity for the mean estimation problem w.r.t. any k·k S .The paper is organized as follows. In the next section, we obtain the deviation-minimax optimal ratein the i.i.d. Gaussian case. In Section 3 we show that the rate from Theorem 1 can be improved and thatit can be achieved by a solution to a convex program in the adversarial contamination model and in underweak or no moment assumptions. All the proofs have been gathered in Section 4. k·k S In this section, we obtain the optimal deviation-minimax rates of estimation of a mean vector µ ∗ when weare given N i.i.d. X , . . . , X N distributed like N ( µ ∗ , Σ) when Σ (cid:23) P Nµ ∗ denotes the probability distribution of ( X , . . . , X N ); it is a Gaussian measure on R Nd with mean (( µ ∗ ) ⊤ , . . . , ( µ ∗ ) ⊤ ) and a block ( N d ) × ( N d ) covariance matrix with d × d diagonal blocksgiven by Σ repeated N times and 0 outside of these blocks.Unlike classical minimax results holding in expectation or with constant probability (see Chapter 2 in[38]) we want, in this section, the deviation parameter δ to appear explicitly in the minimax lower bound.3oreover, this dependency of the convergence rate with respect to δ should be of the right order given bythe subgaussian p log(1 /δ ) rate and not other polynomial dependency such as p /δ as one gets for theempirical mean for L variables (see Proposition 6.2 in [6]). This subtle behavior of the rate in terms of δ cannot be seen in expectation or constant deviation minimax lower bounds. In particular, this makessuch results (like Theorem 3 or 4 below) unachievable via classical information theoretic arguments as inChapter 2 in [38].Fortunately, in [22], a minimax lower bound has been proved thanks to the Gaussian shift theoremwhich makes the deviation parameter δ appearing explicitly in the minimax lower bound. We use thesame strategy here to prove our main result Theorem 3 below and its corollary Theorem 4 in the classicalEuclidean S = B d case.We consider the general problem of estimating µ ∗ w.r.t. k·k S . Let S ⊂ R d be a symmetric set. We firstobtain an upper bound result revealing the subgaussian rate. We use the empirical mean ¯ X N = N − P i X i as an estimator of µ ∗ . Using Borell TIS’s inequality (Theorem 7.1 in [24] or pages 56-57 in [37]) we get:for all 0 < δ <
1, with probability at least 1 − δ , (cid:13)(cid:13) ¯ X N − µ (cid:13)(cid:13) S = sup v ∈ S (cid:10) v, ¯ X N − µ (cid:11) ≤ E sup v ∈ S (cid:10) v, ¯ X N − µ (cid:11) + σ S p /δ )where σ S = sup v ∈ S q E (cid:10) v, ¯ X N − µ (cid:11) is called the weak variance. It follows that with probability at least1 − δ , (cid:13)(cid:13) ¯ X N − µ (cid:13)(cid:13) S ≤ ℓ ∗ (Σ / S ) √ N + sup v ∈ S (cid:13)(cid:13) Σ / v (cid:13)(cid:13) p log(1 /δ ) √ N (4)where ℓ ∗ (Σ / S ) = sup (cid:0)(cid:10) G, x (cid:11) : x ∈ Σ / S (cid:1) = E (cid:13)(cid:13) Σ / G (cid:13)(cid:13) S , for G ∼ N (0 , I d ), is the Gaussian mean widthof the set Σ / S . In particular, in the case where S = B d , we recover the subgaussian rate (2) in (4). Ouraim is now to show that the rate in (4) is deviation-minimax optimal. This is what is obtained in the nextresult. Theorem 3.
Let S be a symmetric subset of R d such that span( S ) = R d . If ˆ µ : R Nd → R d is an estimatorsuch that for all µ ∗ ∈ R d and all δ ∈ (0 , / , P Nµ ∗ [ k ˆ µ − µ ∗ k S ≤ r ∗ ] ≥ − δ then r ∗ ≥ max s log 2log(5 / ℓ ∗ (Σ / S ) √ N , sup v ∈ S (cid:13)(cid:13) Σ / v (cid:13)(cid:13) s log(1 /δ ) √ N ! . It follows from the upper bound (4) and the deviation-minimax lower bound from Theorem 3 that itis now possible to know exactly (up to absolute constants) the subgaussian rate for the problem of meanestimation in R d w.r.t. k·k S , it is given bymax ℓ ∗ (Σ / S ) √ N , sup v ∈ S (cid:13)(cid:13) Σ / v (cid:13)(cid:13) p log(1 /δ ) √ N ! . (5)We may identify the two complexity and deviation terms in this rate. In particular, the complexity termis measured here via the Gaussian mean width of the set Σ / S and not its entropy as it was previouslyknown following Theorem 2. Theorem 3 together with (4) show that the right way to measure the statisticalcomplexity in the problem of mean estimation in R d w.r.t. to any k·k S is via the Gaussian mean width.This differs from other statistical problems such as the regression model with random design where theentropy has been proved to be the right statistical complexity in several examples [32, 22]. Following thelater results in the regression model, Theorem 3 is a bit unexpected because one may though that bytaking an ERM over an epsilon net of R d for the right choice of ǫ one could obtain a better rate thanthe one driven by the Gaussian mean width in (5); indeed, for this type of procedure, one may expect a4ate depending on the (smaller) entropy instead of the (larger) Gaussian mean width. Theorem 3 showsthat this is not the case: even discretized ERM cannot achieve a better rate than the one driven by theGaussian mean width in the mean estimation problem.An important consequence of Theorem 3 is obtained when S = B d that is for the problem of multivariatemean estimation w.r.t. the ℓ d -norm which is the problem that has been extensively considered during thelast decade. In the following result, we recover the well-known subgaussian rate (2) showing that all theupper bound results where this rate has been proved to be achieved are actually deviation-minimax optimaland therefore could not have been improved uniformly over all µ ∗ ∈ R d . Theorem 4. If ˆ µ : R Nd → R d is an estimator such that P Nµ ∗ [ k ˆ µ − µ ∗ k ≤ r ∗ ] ≥ − δ for all µ ∗ ∈ R d andall δ ∈ (0 , / , then r ∗ ≥ max s log 22 log(5 / r Tr(Σ)
N , s k Σ k op log(1 /δ ) N . Given that the empirical mean ¯ X N is such that for all µ ∈ R d with P Nµ -probability at least 1 − δ , (cid:13)(cid:13) ¯ X N − µ (cid:13)(cid:13) ≤ r Tr (Σ) N + s k Σ k op log(1 /δ ) N we conclude from Theorem 4 that the sub-gaussian rate (2) is the deviation-minimax rate of convergencefor the multivariate mean estimation problem w.r.t. ℓ d and that it is achieved by the empirical mean. Inparticular, there are no statistical procedure that can do better than the empirical mean uniformly overall mean vectors µ ∗ ∈ R d up to constant, this includes in particular all discretized versions of ¯ X N . In this section, we introduce statistical procedures which are solutions to convex programs and which canachieve the rate from Theorem 1 without the unnecessary Gaussian mean width term E (cid:13)(cid:13) Σ / G (cid:13)(cid:13) . Wealso show that these procedures handle adversarial corruption and may still perform optimally in somesituations where there is not even a first moment. Definition 1.
Let S be a subset of R d and f : R d → R . The Fenchel-Legendre transform of f on S is thefunction f ∗ S defined for all µ ∈ R d by f ∗ S ( µ ) = sup v ∈ S (cid:0)(cid:10) µ, v (cid:11) − f ( v ) (cid:1) . For our purpose, the main property of a Fenchel-Legendre transform we will use is that it is a convexfunction as it is the maximal function of the family ( µ ∈ R d → (cid:10) µ, v (cid:11) − f ( v ) : v ∈ S ) of linear functions.We are now defining two examples of functions such that by taking the minimum of their Fenchel-Legendre transform over S will lead to optimal estimators of µ ∗ w.r.t. k·k S . The construction of thesetwo functions are based on the median-of-means principle: the dataset { X , . . . , X N } is split into K equalsize blocks of data indexed by ( B k ) k forming an equipartition of [ N ]. On each block, an empirical mean isconstructed ¯ X k = | B k | − P i ∈ B k X i . The two functions we are considering are using the K bucketed means( ¯ X k ) k and are defined, for all v ∈ R d , by f ( v ) = 1 | I K | X k ∈ I K (cid:10) ¯ X k , v (cid:11) ∗ ( k ) and g ( v ) = Med( (cid:10) ¯ X k , v (cid:11) ) = (cid:10) ¯ X k , v (cid:11) ∗ ( K +12 ) (6)where if a k = (cid:10) ¯ X k , v (cid:11) , k ∈ [ K ] then (cid:10) ¯ X k , v (cid:11) ∗ ( k ) , k ∈ [ K ] are the rearrangement of ( a k ) k such that a ∗ (1) ≤ . . . ≤ a ∗ ( K ) (this is the rearrangement of the values a k ’s themselves and not of their absolute values) and I K = (cid:20) K + 14 , K + 1)4 (cid:21) = (cid:26) K + 12 ± k : k = 0 , , · · · , K + 14 (cid:27)
5s the inter-quartiles interval – w.l.o.g. we assume that K + 1 can be divided by 4. In other words, f ( v )is the average sum over all inter-quartile values of the vector ( (cid:10) ¯ X k , v (cid:11) ) k ∈ [ K ] and g ( v ) is the median of thisvector. Note that both functions f and g are homogeneous i.e. f ( θv ) = θf ( v ) and g ( θv ) = θg ( v ) for every v ∈ R d and θ ∈ R and in particular they are odd functions; two facts we will use later.We are now considering the Fenchel-Legendre transform of the functions f and g over a symmetric set S : f ∗ S : µ ∈ R d → sup v ∈ S (cid:0)(cid:10) µ, v (cid:11) − f ( v ) (cid:1) and g ∗ S : µ ∈ R d → sup v ∈ S (cid:0)(cid:10) µ, v (cid:11) − g ( v ) (cid:1) . (7)As mentioned previously the two functions f ∗ S and g ∗ S are convex functions. We are now using them todefine convex programs whose solutions will be proved to be robust and subgaussian estimators of themean / location vector µ ∗ w.r.t. k·k S :ˆ µ fS ∈ argmin µ ∈ R d f ∗ S ( µ ) and ˆ µ gS ∈ argmin µ ∈ R d g ∗ S ( µ ) . (8)For some special choices of S , the Fenchel-Legendre minimization estimator ˆ µ gS coincides with someclassical procedures. This is for instance the case when S = B d (the unit ball of the ℓ d -norm) or S = B d .Indeed, when S = B d , ˆ µ gS is the coordinate-wise Median of Means:ˆ µ gS = argmin µ =( µ j ) ∈ R d max j ∈ [ d ] (cid:12)(cid:12) µ j − Med (cid:0)(cid:10) ¯ X k , e j (cid:11)(cid:1)(cid:12)(cid:12) = (cid:0) Med (cid:0)(cid:10) ¯ X k , e j (cid:11)(cid:1) : j ∈ [ d ] (cid:1) (9)where ( e j ) dj =1 is the canonical basis of R d , because k·k S = k·k conv( S ) where conv( S ) is the convex hull of S and so one may just take S = {± e j : j ∈ [ d ] } . It is therefore possible to derive deviation-minimax optimalbounds for the coordinate-wise Median of Means w.r.t. the ℓ d ∞ -norm from general upper bounds on ˆ µ gS since in that case k·k S = k·k ∞ .In the case S = B d (that is for the mean/location estimation problem w.r.t. ℓ d ), the Fenchel-Legendreminimum estimator ˆ µ gS is a minmax MOM estimator [23]. This connection allows to write ˆ µ gS (as well asˆ µ fS ) as a non-constraint estimator, it also shows that this minmax MOM estimator is actually solution toa convex optimization problem and how minmax MOM estimator can be generalized to other estimationrisks.Minmax MOM estimators have been introduced as a systematic way to construct robust and subgaus-sian estimators in [23]. They have been proved to be deviation-minimax optimal for the mean estimationproblem in [28] w.r.t. k·k . Their definition only requires to consider a loss function; here we take for all µ ∈ R d , ℓ µ : x ∈ R d → k x − µ k and the minmax MOM estimator is then defined as˜ µ ∈ argmin µ ∈ R d sup ν ∈ R d Med ( P B k ( ℓ µ − ℓ ν ) : k ∈ [ K ]) (10)where P B k is the empirical measure on the data in block B k . The minmax MOM estimator ˜ µ was provedto achieve the subgaussian rate in (2) with confidence 1 − δ when the number of blocks is K ∼ log(1 /δ )and K & |O| in [28].Even though the minmax formulation of ˜ µ suggests a robust version of a descent/ascent gradient methodover the median block (see [23, 28] for more details), no proof of convergence of this algorithm is knownso far. Moreover, the main drawback of the minmax MOM estimator seems to be that it is solution of anon-convex optimization problem and may therefore be likely to be rather difficult to compute in practice.In the next result, we show that this is not the case since the minmax MOM estimator (10) is in fact equalto ˆ µ gS for S = B d and it is therefore solution to a convex optimization problem. Proposition 1.
The minmax MOM estimator ˜ µ defined in (10) satisfies ˜ µ ∈ argmin µ ∈ R d g ∗ B d ( µ ) . Theminmax MOM estimator is therefore solution to a convex optimization problem. roof. We show that ˜ µ ∈ argmin µ ∈ R d sup k v k =1 Med( (cid:10) ¯ X k − µ, v (cid:11) ). We consider the quadratic/multiplierdecomposition of the difference of loss functions: for all µ, ν ∈ R d and x ∈ R d , we have ( ℓ µ − ℓ ν )( x ) = k x − µ k − k x − ν k = − (cid:10) x − µ, µ − ν (cid:11) − k µ − ν k . Hence, for all µ ∈ R d , we havesup ν ∈ R d Med ( P B k ( ℓ µ − ℓ ν )) = sup ν ∈ R d (cid:16) − (cid:10) ¯ X k − µ, µ − ν (cid:11) ) − k µ − ν k (cid:17) = sup k v k =1 sup θ ≥ (cid:0) θ Med( (cid:10) ¯ X k − µ, v (cid:11) ) − θ (cid:1) = sup k v k =1 (cid:0) Med( (cid:10) ¯ X k − µ, v (cid:11) ) (cid:1) = sup k v k =1 Med( (cid:10) ¯ X k − µ, v (cid:11) ) ! . We conclude sinceargmin µ ∈ R d sup k v k =1 Med( (cid:10) ¯ X k − µ, v (cid:11) ) ! = argmin µ ∈ R d sup k v k =1 Med (cid:0)(cid:10) ¯ X k − µ, v (cid:11)(cid:1) . It follows from Proposition 1 that the minmax MOM estimator ˜ µ is solution to a convex optimizationproblem. This fact is far from being obvious given the definition of ˜ µ in (10).Proposition 1 suggests a new formulation for ˆ µ gS and ˆ µ fS . It is indeed possible to write these estimatorsas regularized estimators instead of their original constraint formulation (note that the Fenchel-Legendretransforms in (7) are suprema over S and are therefore constraint optimization problems). We now showthat we may write them as suprema over all R d if we add an ad hoc regularization function.Let us introduce the two following functions which may be seen as regularized versions of the two f and g functions from (6): for all ν ∈ R d , F S ( ν ) = f ( ν ) + k ν k S G S ( v ) = g ( ν ) + k ν k S . (11)We also consider their Fenchel-Legendre transforms over the entire set R d : for all µ ∈ R d , F ∗ S ( µ ) = sup ν ∈ R d (cid:0)(cid:10) µ, ν (cid:11) − F S ( ν ) (cid:1) and G ∗ S ( µ ) = sup ν ∈ R d (cid:0)(cid:10) µ, ν (cid:11) − G S ( ν ) (cid:1) . The next result shows that the later two Fenchel-Legendre transforms can be used to define the twoestimators ˆ µ fS and ˆ µ gS . The proof of Proposition 2 is similar to the one of Proposition 1 where the ℓ -normis replaced by k·k S and is therefore omitted. Proposition 2.
Let S be a symmetric subset of R d such that span( S ) = R d . We have ˆ µ fS ∈ argmin µ ∈ R d F ∗ S ( µ ) and ˆ µ gS ∈ argmin µ ∈ R d G ∗ S ( µ ) . As a consequence of Proposition 2, one can write the two estimators ˆ µ fS and ˆ µ gS as solutions to un-constrained minmax optimization problems like the minmax MOM estimator (10) and in particular, onemay design an alternating ascent/descent sub-gradient algorithm similar to the one from [23] – we expectthe one associated with ˆ µ fS which uses half of the dataset at each iteration to be more efficient than theone associated with ˆ µ gS which uses only the N/K data in the median block at each iteration. That is thereason why we provide in Figure 1 this algorithm only forˆ µ fS ∈ argmin µ ∈ R d sup ν ∈ R d (cid:10) µ, ν (cid:11) − | I K | X k ∈ I K (cid:10) ¯ X k , v (cid:11) ∗ ( k ) − k ν k S . nput : the data X , . . . , X N , a number K of blocks, two steps size sequences ( η t ) t , ( θ t ) t and ǫ > output: A robust estimator of the mean µ Construct an equipartition B ⊔ · · · ⊔ B K = { , · · · , N } at random Construct the K empirical means ¯ X k = ( N/K ) P i ∈ B k X i , k ∈ [ K ] Compute ˜ µ (0) the coordinate-wise median-of-means and put µ (0) = ˜ µ (0) and ν (0) = ˜ µ (0) while (cid:13)(cid:13) µ ( t ) − µ ( t +1) (cid:13)(cid:13) S ≥ ǫ do Construct an equipartition B ⊔ · · · ⊔ B K = { , · · · , N } at random Construct the K empirical means ¯ X k = ( N/K ) P i ∈ B k X i , k ∈ [ K ] Find the inter-quartile block numbers k , . . . , k ( K +1) / ∈ [ K ] such that f ( ν ( t ) ) = 1 | I K | ( K +1) / X j =1 (cid:10) ¯ X k j , ν ( t ) (cid:11) . Construct g ( t ) a subgradient of k·k S at ν ( t ) and the ascent direction ∇ ( t +1) ν = µ ( t ) − | I K | ( K +1) / X j =1 ¯ X k j − (cid:13)(cid:13) ν ( t ) (cid:13)(cid:13) S g ( t ) . Update ν ( t +1) ← ν ( t ) + η t ∇ ( t +1) ν . Make one descent step: µ ( t +1) ← µ ( t ) − θ t ν ( t +1) . end Return µ ( t +1) Algorithm 1:
An alternating ascent/descent algorithm for the robust mean estimation problemw.r.t. k·k S with randomly chosen blocks of data at each step. In this section, we introduce the assumptions under which we will obtain some statistical upper bounds forthe Fenchel-Legendre minimum estimators introduced above. We are considering two types of assumptions:one for the outliers which will be the adversarial corruption model and one for the inlier which will beeither the existence of a second moment or a regularity assumption on a family of cdf around 0. We startwith the adversarial corruption model.
Assumption 1.
There exists N independent random vectors ( ˜ X i ) Ni =1 in R d . The N random vectors ( ˜ X i ) Ni =1 are first given to an ”adversary” who is allowed to modify up to |O| of these vectors. This modification doesnot have to follow any rule. Then, the ”adversary” gives the modified dataset ( X i ) Ni =1 to the statistician.Hence, the statistician receives an ”adversarially” contaminated dataset of N vectors in R d which can bepartitioned into two groups: the modified data ( X i ) i ∈O , which can be seen as outliers and the ”good data”or inlier ( X i ) i ∈I such that ∀ i ∈ I , X i = ˜ X i . Of course, the statistician does not know which data has beenmodified or not so that the partition O ∪ I = { , . . . , N } is unknown to the statistician. In the adversarial contamination model from Assumption 1, the set
O ⊂ [ N ] can depend arbitrarilyon the initial data ( ˜ X i ) Ni =1 ; the corrupted data ( X i ) i ∈O can have any arbitrary dependence structure; andthe informative data ( X i ) i ∈I may also be correlated (for instance, it is, in general, the case when the |O| data ˜ X i with largest ℓ d -norm are modified by the adversary). The adversarial corruption model covers theHuber ǫ -contamination model [16] and also the O ∪ I framework from [21, 23, 27].8ssumption 1 does not grant any property of the inlier data ( ˜ X i ) i ∈ [ N ] except that they are independent.We will obtain a general result under only Assumption 1 in Section 4. However, to recover convergencerates similar to the one in Theorem 1 or the subgaussian rate in (5), we will grant some assumptions onthe inlier as well. We are now considering two assumptions on the inlier which are of different nature.The two assumptions on the inlier we are now considering are related to a subtle property of the Median-of-Means (MOM) principle which somehow benefits from its two components: the empirical median andthe empirical mean. Indeed, MOM is en empirical median of empirical means and so if we refer tothe classical asymptotic normality (a.n.) results of the empirical mean and the empirical median, thefirst one holds under the existence of a second moment and the second one holds under the assumptionthat the cdf is differentiable at the median with positive derivative at the median (see Corollary 21.5in [39]). We therefore recover these two types of assumptions when we work with estimators using theMOM principle. A nice feature of MOM based estimators is that their estimation results hold undereither one of the two conditions and do not require the two assumptions to hold simultaneously. Wecan therefore consider the two assumptions independently and get two estimation results for the Fenchel-Legendre minimum estimators introduced above (which are based on the MOM principle). We start withthe moment assumption. Assumption 2.
The N independent random vectors ( ˜ X i ) Ni =1 have mean µ ∗ and there exists a SDP matrix Σ ∈ R d × d such that E ( ˜ X i − µ ∗ )( ˜ X i − µ ∗ ) ⊤ (cid:22) Σ . Most of the statistical bounds obtained on MOM based estimators have focused on the heavy-tailedsetup and have therefore consider Assumption 2 as their main assumption. This is the ’empirical meancomponent’ of the MOM principle which has been the most exploited so far. It is however also possible touse the ’empirical median component’ of the MOM principle to get statistical bounds even in cases wherea first moment does not even exist. In that case, µ ∗ is called a location parameter and Σ a scale parameter .Also, a natural assumption is similar to the one used to get the a.n. of the empirical median, that is anassumption on the cdf at the median adapted to the multidimensional and non-asymptotic setup. We arenow introducing such an assumption. Assumption 3.
The inlier data ( ˜ X i ) Ni =1 are i.i.d.. There exists µ ∗ ∈ R d and two absolute constants c > and c > such that the following holds: for all v ∈ S and all < r ≤ c , H N,K,v ( r ) ≤ / − c r where H N,K,v ( r ) = P p N/K
N/K X i =1 (cid:10) ˜ X i − µ ∗ , v (cid:11) > r . (12)A typical example where Assumption 3 holds is when S = S d − (that is for the location estimationproblem w.r.t. the Euclidean ℓ d norm) and the ˜ X i ’s are rotational invariant that is when for all v ∈ S d − , (cid:10) ˜ X − µ ∗ , v (cid:11) has the same distribution as (cid:10) ˜ X − µ ∗ , e (cid:11) where e = (1 , , . . . , ∈ R d . In that case, ˜ X has the same distribution as µ ∗ + RU where R is a real-valued random variable on R + independent of U a random vector uniformly distributed over S d − . In that case and for K = N , for all v ∈ S d − and all r ∈ R , H N,K = N,v ( r ) = H ( r ) := P [ R (cid:10) U, e (cid:11) ≥ r ] = Z + ∞ r f ( x ) dx where f : x ∈ R → C d Z + ∞| x | (cid:18) − x u (cid:19) d − d P R ( u ) , P R is the probability distribution of R and C d is a normalization constant which can be proved to satisfy √ d ≤ C d ≤ √ d (see for instance, Chapter 4 in [5]). In particular, it follows from the mean value theoremthat for all r ≥ H ( r ) ≤ H (0) − min ≤ x ≤ r f ( x ) r = 1 / − f ( r ) r . Therefore, Assumption 3 holds in thatcase when there exists constants c , c > f ( c ) ≥ c . Furthermore, we have f ( c ) ≥ C d Z + ∞ c √ d (cid:18) − c u (cid:19) d − d P R ( u ) ≥ √ d P [ R ≥ c √ d ]9ecause C d ≥ √ d and for all u ≥ c √ d , (1 − ( c /u ) ) ( d − / ≥ /
2. As a consequence, Assumption 3holds if there are some constants c > c > P [ R ≥ c √ d ] ≥ c / √ d . This is forinstance the case, when R is distributed like k G k for G ∼ N (0 , I d ) (in that case ˜ X ∼ N ( µ ∗ , I d )) because P [ k G k ≥ E k G k / ≥ / E k G k ≥ √ d/ R is the positivepart of a Cauchy variable because R + ∞√ d (1 / (1 + x )) dx ≥ / (2 √ d ). As a consequence, Assumption 3 hasnothing to do with the existence of any moment and it may hold even when there is not a first momentand even for K = N .Another example where Assumption 3 holds, that we will use in the following to obtain statisticalbounds for the coordinate-wise median of means for the location problem is when S = {± e j : j ∈ [ d ] } and˜ X = µ ∗ + Z where Z = ( z j ) dj =1 is random vector in R d with coordinates z , . . . , z d having a symmetricaround 0 Cauchy distribution. In that case, ˜ X does not have a first moment and µ ∗ is a location parameteras the center of symmetry of the distribution of ˜ X . We have for all j ∈ [ d ], H N,K = N, ± e j ( r ) = P h(cid:10) ˜ X − µ ∗ , ± e j (cid:11) ≥ r i = P [ z j ≥ r ] = Z + ∞ r dxπ (1 + x ) ≤ − rπ (1 + r ) ≤ − r π for all 0 < r ≤
1. Therefore, Assumption 3 holds in that case as well. ˆ µ fS and ˆ µ gS In this section, we obtain estimation bounds w.r.t. k·k S for ˆ µ fS and ˆ µ gS in the adversarial contaminationmodel with either the L moment Assumption 1 or the regularity at 0 Assumption 3. Estimation properties of ˆ µ fS and ˆ µ gS under Assumption 1. In this section, we obtain high prob-ability estimation upper bounds satisfied by ˆ µ fS and ˆ µ gS w.r.t. k·k S in the adversarial contamination andheavy-tailed inlier model. The rate of convergence is given by the quantity r ∗ S = max √ N E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N X i ∈ [ N ] ǫ i ( ˜ X i − µ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) S , sup v ∈ S (cid:13)(cid:13)(cid:13) Σ / v (cid:13)(cid:13)(cid:13) r KN . (13)The key metric property satisfied by the two Fenchel-Legendre transforms f ∗ S and g ∗ S in the adversarialcontamination and heavy-tailed inlier model is the following isomorphic result. Lemma 1.
Grant Assumption 1 and Assumption 2. Let S be a symmetric subset of R d . Assume that |O| < K/ . With probability at least − exp( − K/ , for all µ ∈ R d , | g ∗ S ( µ ) − k µ − µ ∗ k S | ≤ g ∗ S ( µ ∗ ) ≤ r ∗ S and | f ∗ S ( µ ) − k µ − µ ∗ k S | ≤ f ∗ S ( µ ) ≤ r ∗ S . Lemma 1 shows that if k µ − µ ∗ k S ≥ r ∗ S then k µ − µ ∗ k S ≤ g ∗ S ( µ ) ≤ k µ − µ ∗ k S and the same holdsfor f ∗ S . It means that both g ∗ S and f ∗ S are two convex functions equivalent (up to absolute constants) to µ → k µ − µ ∗ k S on R d \ (2 r ∗ S ) B S , where B S is the unit ball associated with k·k S and, on (2 r ∗ S ) B S , theyare both smaller than 2 r ∗ S . Hence, both g ∗ S ( · − µ ∗ ) and f ∗ S ( · − µ ∗ ) provide a good approximation of themetric space ( R d , k·k S ). In particular, any minimum of g ∗ S and f ∗ S will be close (up to r ∗ S ) to a minimumof µ → k µ − µ ∗ k S which is µ ∗ . This explains the statistical properties of ˆ µ fS and ˆ µ gS : from Lemma 1, (cid:13)(cid:13)(cid:13) ˆ µ fS − µ ∗ (cid:13)(cid:13)(cid:13) S ≤ f ∗ S (ˆ µ fS ) + f ∗ S ( µ ∗ ) ≤ f ∗ S ( µ ∗ ) ≤ r ∗ S and the same holds for ˆ µ gS . This leads to the following result. Theorem 5.
Grant Assumption 1 and Assumption 2. Let S be a symmetric subset of R d and r ∗ S be definedin (13) . For all K > |O| , with probability at least − exp( − K/ , (cid:13)(cid:13)(cid:13) ˆ µ fS − µ ∗ (cid:13)(cid:13)(cid:13) S ≤ r ∗ S and (cid:13)(cid:13) ˆ µ gS − µ ∗ (cid:13)(cid:13) S ≤ r ∗ S . r ∗ S obtained in Theorem 5 can be split into two terms: the complexity term given by theRademacher complexity and a deviation term exhibiting the weak variance term as in the Gaussian case.Compare with Theorem 1 from [30], this result shows that the Gaussian mean width term appearing inTheorem 1 is actually not necessary, it also shows that this improved rate can be obtained by a proceduresolution to a convex program and that it can also handle adversarial corruption. When S = B d , we recoverthe classical subgaussian rate because in that case the Rademacher complexity term in r ∗ S is less or equalto p Tr(Σ) [19]. In particular, since ˆ µ gS is the minmax MOM estimator in that case, we recover the mainresult from [28]. Estimation properties of ˆ µ gS under Assumption 3. In this section, we consider some cases where afirst moment may not exist; in that case, µ ∗ is a location parameter so that Assumption 3 holds. The rateof convergence we obtain in that case is given by r ⋄ = C c r d + 1 N + r uN ! + |O| c √ KN (14)where c is the absolute constant from Assumption 3, C the absolute constant from (28) and u > g ∗ S underAssumption 3. It is similar to the one of Lemma 1 but with the rate r ⋄ . Lemma 2.
Let S be a symmetric subset of R d . Grant Assumption 1 and Assumption 3 for some K ∈ [ N ] .Let u > . Assume that C (cid:16)p ( d + 1) /K + p u/K (cid:17) + |O| /K ≤ c c . With probability at least − exp( − u ) ,for all µ ∈ R d , | g ∗ S ( µ ) − k µ − µ ∗ k S | ≤ r ⋄ . As explained below Lemma 1, a result such as Lemma 2 may be used to upper bound the k·k S distancebetween ˆ µ gS , a minimum of g ∗ S , and µ ∗ , a minimum of µ → k µ − µ ∗ k S . This yields to the following result. Theorem 6.
Let S be a symmetric subset of R d . Grant Assumption 1 and Assumption 3 for some K ∈ [ N ] . Let u > and assume that C (cid:16)p ( d + 1) /K + p u/K (cid:17) + |O| /K ≤ c c . With probability atleast − exp( − u ) , (cid:13)(cid:13) ˆ µ gS − µ ∗ (cid:13)(cid:13) S ≤ r ⋄ where r ⋄ is defined in (14) . Unlike Theorem 5, Theorem 6 may hold even when there is not a first moment. The result fromTheorem 6 hold for all 0 < u . K whereas Theorem 5 holds only for u = K (even though one may use aLepski’s adaptive scheme to chose adaptively K ). The price for adversarial corruption in (14) is between |O| /N (for K ∼ N ) and p |O| /N (for K ∼ |O| ). It therefore depends on the choice of K for whichAssumption 3 holds. As shown after Assumption 3 for spherically symmetric random variables one cantake K = N and so the best possible price |O| /N for adversarial corruption may be achieved even when afirst moment does not exist. If one needs some averaging effect so that Theorem 6 holds, then one shouldtake K as small as possible that is K ∼ |O| and then p |O| /N will be the price for adversarial corruptionas in the L case described in Theorem 6. Subgaussian rates under weak or no moment assumption.
It is possible to recover (up to absoluteconstants) the subgaussian rate (5) in Theorem 5 for K ∼ log(1 /δ ) when the Rademacher complexity termfrom (13) and the Gaussian mean width from (5) satisfy E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N X i ∈ [ N ] ǫ i ( ˜ X i − µ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) S . ℓ ∗ (cid:16) Σ / S (cid:17) . (15)Such a result (i.e. Rademacher complexity is smaller than the Gaussian mean width up to constant)depends on the set S and the number of moments granted on the ˜ X i ’s as well as the sample size. It11bviously holds when the ˜ X i ’s are i.i.d. N ( µ ∗ , Σ), so that we recover the deviation-minimax optimalsubgaussian rate (5) in that case. It is also true when the ˜ X i ’s are subgaussian vectors. There are othersituations under weaker moment assumption where (15) holds.For instance, when S = B d , (15) holds under only a L -moment assumption (see [19]). It also holds for S = B d when the ˜ X i ’s are isotropic with coordinates having log d subgaussian moments (i.e. (cid:13)(cid:13)(cid:13)(cid:10) ˜ X i , e j (cid:11)(cid:13)(cid:13)(cid:13) L p ≤ L √ p for all 1 ≤ p ≤ log d and coordinate j ∈ [ d ]) and N & log d . Together with (9) and Theorem 5, thisimplies that the coordinate-wise MOM is a subgaussian estimator of the mean under a log d subgaussianmoment assumption. Upper bounds such as (15) have been extended in [33] to general unconditionalnorms.It is also possible to recover the subgaussian rate (5) in situations where there is not even a first momentthanks to Theorem 6. Indeed, for the case S = B d and ˜ X = µ ∗ + Z where Z = ( z j ) dj =1 has symmetricaround 0 Cauchy distributed coordinates, we showed that Assumption 3 holds for K = N and that ˆ µ gS is the coordinate-wise median (here K = N ) in (9). It follows from Theorem 6 that, when d . N and |O| . N then for all d ≤ u . N , with probability at least 1 − exp( − u ), (cid:13)(cid:13) ˆ µ gS − µ ∗ (cid:13)(cid:13) ∞ ≤ C r d + 1 N + r uN ! + 2 π |O| N (16)which is the deviation-minimiax optimal subgaussian rate (5) we would have gotten if the ˜ X i were i.i.d.isotropic Gaussian vectors centered in µ ∗ corrupted by |O| adversarial outliers (up to absolute constants).But here, (16) is obtained without the existence of a first moment. Moreover, in (16), the number ofoutliers is allowed to be proportional to N and the price for adversarial corruption is of the order of |O| /N which is the same price we have to pay when inlier have a Gaussian distribution – this differs fromthe p |O| /N information theoretical lower bound that has been obtained for some non-symmetric inlier.Furthermore, the computational cost of the coordinate-wise MOM is O ( N d ) since the cost for computingthe bucketed means is O ( N d ), the one of finding the median of K numbers is O ( K ) [3], it is therefore thesame computational cost as the one of the empirical mean. It is therefore possible to achieve the samecomputational and statistical properties as the empirical mean in a setup where a first moment does noteven exist. Proof of Theorem 3.
The minimax lower bound rate r ∗ exhibits two quantities: one which is a complexityterm depending on the Gaussian mean width of Σ / S and a deviation term depending on δ . The twoterms come from two arguments. We start with the deviation term.Let v ∈ R d be such that k v k S = 1. We consider two Gaussian measures on R dN : P = N (0 , Σ) ⊗ N and P = N (3 r ∗ v , Σ) ⊗ N . They are the distributions of a sample of N i.i.d. Gaussian vectors in R d withthe same covariance matrix Σ and the first one with mean 0 and the second one with mean 3 r ∗ v . We set A = (ˆ µ ) − ( B S (0 , r ∗ )) = { ( x , . . . , x N ) ∈ R Nd : k ˆ µ ( x , . . . , x N ) k S ≤ r ∗ } and A = (ˆ µ ) − ( B S (3 r ∗ v , r ∗ )). Itfollows from the statistical properties of ˆ µ that P [ A ] ≥ − δ and P [ A ] ≥ − δ .The key ingredient for the deviation lower bound term is a slightly generalization of Lemma 3.3 in [22]which is based on a version of the Gaussian shift Theorem from [29]. Lemma 3.
Let t Φ( t ) = P ( g ≤ t ) be the cumulative distribution function of a standard gaussian randomvariable on R . Let Σ (cid:23) be in R ( Nd ) × ( Nd ) and u, v ∈ R dN . Let two gaussian measures ν u ∼ N ( u, Σ ) and ν v ∼ N ( v, Σ ) on R Nd . If A ⊂ R dN is measurable, then ν v ( A ) ≥ − Φ (cid:0) Φ − (1 − ν u ( A )) + k Σ − / ( u − v ) k (cid:1) (17) where Σ − / is the square root of the pseudo-inverse of Σ . roof of Lemma 3. When Σ = I Nd , Lemma 3 is exactly Lemma 3.3 in [22] for σ = 1. To proveLemma 3, we observe that ν v ( A ) = P [ G + Σ − / v ∈ B ] where B = Σ − / A and G is a standard Gaussianvariable in Im(Σ ). Hence, it follows from Lemma 3.3 in [22] that P [ G + Σ − / v ∈ B ] ≥ − Φ (cid:0) Φ − (1 − P [ G + Σ − / u ∈ B ]) + k Σ − / ( u − v ) k ℓ N (cid:1) which is exactly (17).It follows from Lemma 3 that P [ A ] ≥ − Φ h Φ − (1 − P [ A ]) + (cid:13)(cid:13)(cid:13) Σ − / (0 − (3 r ∗ v , . . . , r ∗ v )) (cid:13)(cid:13)(cid:13) i . (18)Moreover, we have Φ − (1 − P [ A ]) ≤ Φ − ( δ ) (because 1 − P [ A ] ≤ δ ) and (cid:13)(cid:13)(cid:13) Σ − / (0 − (3 r ∗ v , . . . , r ∗ v )) (cid:13)(cid:13)(cid:13) = 3 r ∗ √ N (cid:13)(cid:13)(cid:13) Σ − / v (cid:13)(cid:13)(cid:13) . (19)As a consequence, if 3 r ∗ √ N (cid:13)(cid:13) Σ − / v (cid:13)(cid:13) ≤ − Φ − ( δ ) then, in (18), we get P [ A ] ≥ − Φ[0] ≥ / P [ A ] ≥ − δ > / A ∩ A = ∅ . As a consequence, we necessarilyhave 3 r ∗ √ N ≥ ( − Φ − ( δ )) (cid:13)(cid:13) Σ − / v (cid:13)(cid:13) − . The later holds for any v ∈ R d such that k v k S = 1 hence3 r ∗ √ N ≥ ( − Φ − ( δ ))[1 / inf k v k S =1 (cid:13)(cid:13) Σ − / v (cid:13)(cid:13) ]. It also follows from the bound on the Mill’s ratio from [20](here we use that for all x ≥
0, Φ( − x ) ≥ ϕ ( x ) / √ x + x where ϕ is the standard Gaussian densityfunction) that for all 0 < δ < / − Φ − ( δ ) ≥ / p log(1 /δ ). This shows that r ∗ ≥ r log(1 /δ ) N k v k S =1 (cid:13)(cid:13) Σ − / v (cid:13)(cid:13) . (20)To conclude on the deviation term, we use the following duality argument. Lemma 4.
Let A ∈ R d × d be a symmetric and invertible matrix. Let k·k be a norm and its dual norm k·k ∗ on R d . Let S be a symmetric subset of R d such that span( S ) = R d . We have k v k S =1 k A − v k ≥ sup w ∈ S k Aw k ∗ . Proof of Lemma 4.
Let v be such that k v k S = 1 and w ∈ S . We have | (cid:10) v, w (cid:11) | ≤ | (cid:10) A − v/ (cid:13)(cid:13) A − v (cid:13)(cid:13) , Aw (cid:11) | ≤ / (cid:13)(cid:13) A − v (cid:13)(cid:13) . The later holds for all v such that k v k S = 1 and { A − v/ (cid:13)(cid:13) A − v (cid:13)(cid:13) : k v k S = 1 } is the unit sphere of k·k . Hence, we conclude by taking the sup over v such that k v k S = 1 and w ∈ S .It follows from (20) and Lemma 4 for k·k = k·k and A = Σ / that r ∗ ≥ r log(1 /δ ) N sup w ∈ S (cid:13)(cid:13)(cid:13) Σ / w (cid:13)(cid:13)(cid:13) . (21)Let us now turn to the second part of the lower bound; the one coming from the complexity of theproblem (here, it is the Gaussian mean width of Σ / S ). We know that ˆ µ is an estimator such that for all µ ∈ R d , P Nµ [ k ˆ µ − µ k S ≤ r ∗ ] ≥ − δ which is equivalent to say that δ ≥ sup µ ∈ R d E Nµ φ (cid:18) k ˆ µ − µ k S r ∗ (cid:19) (22)where we set φ : t ∈ R → I ( t >
1) and E Nµ is the expectation with respect to X , . . . X N i.i.d. ∼ N ( µ, Σ).Next, we consider a Gaussian distribution γ over the set of parameters µ ∈ R d : for s >
0, we assume that µ ∼ N (0 , s Σ). It follows from (22) that δ ≥ Z µ ∈ R d E Nµ φ (cid:18) k ˆ µ − µ k S r ∗ (cid:19) γ ( µ ) dµ = E (cid:20) E (cid:20) φ (cid:18) k ˆ µ ( X , . . . , X N ) − µ k S r ∗ (cid:19) | X , . . . , X N (cid:21)(cid:21) . (23)13n other words, we lower bound the minmax risk by a Bayesian risk. We now use Anderson’s lemma tolower bound the Bayesian risk appearing in (23). We first recall Anderson’s Lemma. Theorem 7 (Anderson’s Lemma) . Let Γ be a semi-definite d × d matrix and Z ∼ N (0 , Γ) . Let w : R d → R be such that all its level sets (i.e. { x ∈ R d : w ( x ) ≤ c } for c ∈ R ) are convex and symmetric around theorigin. Then for all x ∈ R d , E w ( Z + x ) ≥ E w ( Z ) . We remark that µ − E [ µ | X , . . . , X N ] is distributed according to N (0 , ( s/ (1 + N s )Σ)) conditionally to X , . . . , X N . Therefore, applying Anderson’s Lemma conditionally to X , . . . , X N , we obtain in (23) that δ ≥ E (cid:20) φ (cid:18) k E [ µ | X , . . . , X N ] − µ k S r ∗ (cid:19)(cid:21) = P "(cid:13)(cid:13)(cid:13) Σ / G (cid:13)(cid:13)(cid:13) S ≥ r N ss r ∗ where G ∼ N (0 , I d ). This result is true for all s > s ↑ + ∞ , we obtain δ ≥ P h(cid:13)(cid:13)(cid:13) Σ / G (cid:13)(cid:13)(cid:13) S ≥ √ N r ∗ i . Using Borell-TIS’s inequality (Theorem 7.1 in [24] or pages 56-57 in [37]), we know that with probabilityat least 4 / (cid:13)(cid:13) Σ / G (cid:13)(cid:13) S ≥ E (cid:13)(cid:13) Σ / G (cid:13)(cid:13) S − σ S p /
4) where we set σ S = sup k v k S =1 (cid:13)(cid:13) Σ / v (cid:13)(cid:13) . As aconsequence, for δ = 1 /
4, we necessarily have √ N r ∗ ≥ E (cid:13)(cid:13) Σ / G (cid:13)(cid:13) S − σ S p /
4) and so √ N r ∗ ≥ (1 / E (cid:13)(cid:13) Σ / G (cid:13)(cid:13) S when E (cid:13)(cid:13) Σ / G (cid:13)(cid:13) S ≥ σ S p / E (cid:13)(cid:13) Σ / G (cid:13)(cid:13) S < σ S p / δ = 1 / r ∗ ≥ r log 4 N σ S ≥ s log 2log(5 / E (cid:13)(cid:13) Σ / G (cid:13)(cid:13) S √ N .
Proof of Theorem 4.
Theorem 4 follows from Theorem 3 and the following lower bound on E (cid:13)(cid:13) Σ / G (cid:13)(cid:13) B d .We have from Borell-TIS’s inequality that E (cid:13)(cid:13)(cid:13) Σ / G (cid:13)(cid:13)(cid:13) − (cid:16) E (cid:13)(cid:13)(cid:13) Σ / G (cid:13)(cid:13)(cid:13) (cid:17) = E (cid:16)(cid:13)(cid:13)(cid:13) Σ / G (cid:13)(cid:13)(cid:13) − E (cid:13)(cid:13)(cid:13) Σ / G (cid:13)(cid:13)(cid:13) (cid:17) = Z ∞ P h(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) Σ / G (cid:13)(cid:13)(cid:13) − E (cid:13)(cid:13)(cid:13) Σ / G (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) ≥ √ t i dt ≤ σ B d where σ B d = sup k v k =1 (cid:13)(cid:13) Σ / v (cid:13)(cid:13) = k Σ k op . Since E (cid:13)(cid:13) Σ / G (cid:13)(cid:13) = Tr(Σ), we have (cid:0) E (cid:13)(cid:13) Σ / G (cid:13)(cid:13) (cid:1) ≥ Tr(Σ) − k Σ k op . Therefore, E (cid:13)(cid:13) Σ / G (cid:13)(cid:13) ≥ p Tr(Σ) / ≥ k Σ k op and when Tr(Σ) < k Σ k op , we use thelower bound from (21) and an argument similar to the one appearing in the end of the proof of Theorem 3to get the result. Proof of Lemma 1.
We first prove the result for the g ∗ S function. The one for the f ∗ S is similar up toconstants and will be sketched after. The proof of Lemma 1 for the g ∗ S function is a corollary of the generalfact which holds under only Assumption 1. Let u > R ∗ S such that4 √ N R ∗ S E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N X i ∈ [ N ] ǫ i ( ˜ X i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) S + r uK + sup v ∈ S H N,K,v R ∗ S r NK ! + |O| K < . (24)Let us show that with large probability for all µ ∈ R d , | g ∗ S ( µ ) − k µ − µ ∗ k S | ≤ R ∗ S .14e have for all µ ∈ R d , | g ∗ S ( µ ) − k µ − µ ∗ k S | = (cid:12)(cid:12)(cid:12)(cid:12) sup v ∈ S (cid:0)(cid:10) µ, v (cid:11) − g ( v ) (cid:1) − sup v ∈ S (cid:10) v, µ − µ ∗ (cid:11)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup v ∈ S (cid:12)(cid:12)(cid:10) µ ∗ , v (cid:11) − g ( v ) (cid:12)(cid:12) = g ∗ S ( µ ∗ ) (25)where we used that S is symmetric and g is odd. It only remains to show that g ∗ S ( µ ∗ ) ≤ R ∗ S with largeprobability. To that end, it is enough to prove that, with large probability, for all v ∈ S , X k ∈ [ K ] I ( (cid:10) ¯ X k − µ ∗ , v (cid:11) > R ∗ S ) < K . (26)We use the notation introduced in Assumption 1 and we consider ˜ X k = | B k | − P i ∈ B k ˜ X i for k ∈ [ K ]which are the K bucketed means constructed on the N independent vectors ˜ X i , i ∈ [ N ] before contamina-tion (whereas ¯ X k are the ones constructed after contamination). We also set K = { k ∈ [ K ] : B k ∩ O = ∅} the indices of the non corrupted blocks. We have X k ∈ [ K ] I ( (cid:10) ¯ X k − µ ∗ , v (cid:11) > R ∗ S ) = X k ∈K I ( (cid:10) ¯ X k − µ ∗ , v (cid:11) > R ∗ S ) + X k / ∈K I ( (cid:10) ¯ X k − µ ∗ , v (cid:11) > R ∗ S ) ≤ X k ∈ [ K ] I ( (cid:10) ˜ X k − µ ∗ , v (cid:11) > R ∗ S ) + |O| . (27)It only remains to show that with probability at least 1 − exp( − u ), for all v ∈ S , X k ∈ [ K ] I ( (cid:10) ˜ X k − µ ∗ , v (cid:11) > R ∗ S ) ≤ K √ N R ∗ S E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N X i ∈ [ N ] ǫ i ( ˜ X i − µ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) S + √ uK + K sup v ∈ S H N,K,v R ∗ S r NK ! . We define φ ( t ) = 0 if t ≤ / φ ( t ) = 2( t − /
2) if 1 / ≤ t ≤ φ ( t ) = 1 if t ≥
1. We have I ( t ≥ ≤ φ ( t ) ≤ I ( t ≥ /
2) for all t ∈ R and so X k ∈ [ K ] I ( (cid:10) ˜ X k − µ ∗ , v (cid:11) > R ∗ S ) ≤ X k ∈ [ K ] I ( (cid:10) ˜ X k − µ ∗ , v (cid:11) > R ∗ S ) − P [ (cid:10) ˜ X k − µ ∗ , v (cid:11) > R ∗ S /
2] + P [ (cid:10) ˜ X k − µ ∗ , v (cid:11) > R ∗ S / ≤ X k ∈ [ K ] φ (cid:10) ˜ X k − µ ∗ , v (cid:11) R ∗ S ! − E φ (cid:10) ˜ X k − µ ∗ , v (cid:11) R ∗ S ! + P [ (cid:10) ˜ X k − µ ∗ , v (cid:11) > R ∗ S / ≤ sup v ∈ S X k ∈ [ K ] φ (cid:10) ˜ X k − µ ∗ , v (cid:11) R ∗ S ! − E φ (cid:10) ˜ X k − µ ∗ , v (cid:11) R ∗ S ! + K sup v ∈ S H N,K,v R ∗ S r NK ! . Next, we use several tools from empirical process theory and in particular, for a symmetrization argu-ment, we consider a family of N independent Rademacher variables ( ǫ i ) Ni =1 independent of the ( ˜ X i ) Ni =1 . In (bdi) below, we use the bounded difference inequality (Theorem 6.2 in [4]). In (sa-cp) , we use the sym-metrization argument and the contraction principle (Chapter 4 in [25]) – we refer to the supplementary15aterial of [27] for more details. We have, with probability at least 1 − exp( − u ),sup v ∈ S X k ∈ [ K ] φ (cid:10) ˜ X k − µ ∗ , v (cid:11) R ∗ S ! − E φ (cid:10) ˜ X k − µ ∗ , v (cid:11) R ∗ S ! ( bdi ) ≤ E sup v ∈ S X k ∈ [ K ] φ (cid:10) ˜ X k − µ ∗ , v (cid:11) R ∗ S ! − E φ (cid:10) ˜ X k − µ ∗ , v (cid:11) R ∗ S ! + √ uK ( sa − cp ) ≤ KN R ∗ S E sup v ∈ S (cid:10) v, X i ∈ [ N ] ǫ i ( ˜ X i − µ ∗ ) (cid:11) + √ uK = 4 K √ N R ∗ S E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N X i ∈ [ N ] ǫ i ( ˜ X i − µ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) S + √ uK. We therefore showed that under Assumption 1, with probability at least 1 − exp( − u ), for all µ ∈ R d , | g ∗ S ( µ ) − k µ − µ ∗ k S | ≤ R ∗ S .Now, if Assumption 2 holds then for all v ∈ S , we have from Markov’s inequality that H N,K,v R ∗ S r NK ! ≤ E (cid:10) ˜ X k − µ, v (cid:11) ( r ∗ S / = 4 Kv ⊤ Σ vN ( r ∗ S ) ≤ K sup v ∈ S (cid:13)(cid:13) Σ / v (cid:13)(cid:13) N ( r ∗ S ) ≤ R ∗ S = r ∗ S when |O| < K/ u = K/ g ∗ S under Assumption 2.Finally, for the function f ∗ S one needs to control the average of the K/ (cid:10) ¯ X k − µ ∗ , v (cid:11) in the inter-quartiles interval. This can be doneby defining an R ∗ S similar to the one in (24) but where the right-hand side value 1 / / Proof of Lemma 2.
Unlike in Lemma 1 where we used the Rademacher complexities as a complexitymeasure, in this proof, the complexity measure we are using is the Vapnik and Chervonenkis (VC) dimension[41, 42] of a class F of Boolean functions, i.e. of functions from R d to { , } in our case. We recall thatthe Vapnik and Chervonenkis dimension of F , denoted by V C ( F ), is the maximal integer n such thatthere exists x , . . . , x n ∈ R d for which the set { ( f ( x ) , . . . , f ( x n )) : f ∈ F ) } is of maximal cardinality,that is of size 2 n . The VC dimension of the set of all indicators of half affine spaces in R d is d + 1 (seeExample 2.6.1 in [40]). We also know (see, for instance, Chapter 3 in [18]) the following concentrationbound: let Y , . . . , Y n be independent random vectors in R d , there exists an absolute constant C such thatfor all u >
0, with probability at least 1 − exp( − u ),sup f ∈F n n X i =1 f ( Y i ) − E f ( Y i ) ! ≤ C r V C ( F ) n + r un ! . (28)Lemma 2 is a corollary of a general result which holds under the only Assumption 1. This general resultsays that for all u >
0, with probability at least 1 − exp( − u ), for all µ ∈ R d , | g ∗ S ( µ ) − k µ − µ ∗ k S | ≤ R ⋄ where R ⋄ is any point such that C r d + 1 K + r uK ! + sup k v k =1 H N,K,v R ⋄ r NK ! + |O| K <
12 (29)where C is the constant from (28). In particular, when Assumption 3 holds then one can check that (29)holds for R ⋄ = r ⋄ when r ⋄ ≤ c proving the result of Lemma 2. It only remains to show the general result.16o that end we follow the same strategy as in the proof of Lemma 1 up to (27) (and with R ∗ S replaced by R ⋄ ). From that point, we use (28) and the VC dimension of the set of affine half spaces to get that withprobability at least 1 − exp( − u ), for all v ∈ S , X k ∈ [ K ] I ( (cid:10) ˜ X k − µ ∗ , v (cid:11) > R ⋄ ) ≤ H N,K,v R ⋄ r NK ! + C s d + 1 N/K + r uN/K ! and so by definition of R ⋄ , on the same event, for all v ∈ S , P k ∈ [ K ] I ( (cid:10) ¯ X k − µ ∗ , v (cid:11) > R ⋄ ) < /
2. Thisconcludes the proof.
References [1] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments.
J.Comput. System Sci. , 58(1, part 2):137–147, 1999. Twenty-eighth Annual ACM Symposium on the Theory of Computing(Philadelphia, PA, 1996).[2] S. Artstein, V. Milman, and S. J. Szarek. Duality of metric entropy.
Ann. of Math. (2) , 159(3):1313–1328, 2004.[3] Manuel Blum, Robert W. Floyd, Vaughan R. Pratt, Ronald L. Rivest, and Robert Endre Tarjan. Time bounds forselection.
J. Comput. Syst. Sci. , 7(4):448–461, 1973.[4] St´ephane Boucheron, G´abor Lugosi, and Pascal Massart.
Concentration inequalities . Oxford University Press, Oxford,2013. A nonasymptotic theory of independence, With a foreword by Michel Ledoux.[5] W l odzimierz Bryc.
The normal distribution , volume 100 of
Lecture Notes in Statistics . Springer-Verlag, New York, 1995.Characterizations with applications.[6] Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study.
Ann. Inst. Henri Poincar´eProbab. Stat. , 48(4):1148–1185, 2012.[7] Olivier Catoni and Ilaria Giulini. Dimension-free pac-bayesian bounds for matrices, vectors, and linear least squaresregression. arXiv preprint arXiv:1712.02747 , 2017.[8] Olivier Catoni and Ilaria Giulini. Dimension-free pac-bayesian bounds for the estimation of the mean of a random vector. arXiv preprint arXiv:1802.04308 , 2018.[9] Yeshwanth Cherapanamjeri, Nicolas Flammarion, and Peter L. Bartlett. Fast mean estimation with sub-gaussian rates,2019.[10] Jules Depersin and Guillaume Lecu´e. Fast algorithms for robust estimation of a mean vector. 2019.[11] Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I. Oliveira. Sub-gaussian mean estimators.
Ann. Statist. ,44(6):2695–2725, 12 2016.[12] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (inhigh dimensions) can be practical. arXiv preprint arXiv:1703.00893 , 2017.[13] Matthew J Holland. Distribution-robust mean estimation via smoothed random perturbations. arXiv preprintarXiv:1906.10300 , 2019.[14] Samuel B. Hopkins. Mean estimation with sub-Gaussian rates in polynomial time.
Ann. Statist. , 48(2):1193–1213, 2020.[15] Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails.
J. Mach. Learn. Res. ,17:Paper No. 18, 40, 2016.[16] Peter J. Huber and Elvezio M. Ronchetti.
Robust statistics . Wiley Series in Probability and Statistics. John Wiley &Sons, Inc., Hoboken, NJ, second edition, 2009.[17] Mark R. Jerrum, Leslie G. Valiant, and Vijay V. Vazirani. Random generation of combinatorial structures from a uniformdistribution.
Theoret. Comput. Sci. , 43(2-3):169–188, 1986.[18] V. Koltchinskii.
Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems . Springer, Berlin,2011.[19] Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization.
Ann. Statist. ,34(6):2593–2656, 2006.[20] Yˆusaku Komatu. Elementary inequalities for Mills’ ratio.
Rep. Statist. Appl. Res. Un. Jap. Sci. Engrs. , 4:69–70, 1955.[21] Guillaume Lecu´e and Matthieu Lerasle. Learning from mom’s principles: Le cam’s approach.
Stochastic Processes andtheir applications , 129(11):4385–4410, 2019.
22] Guillaume Lecu´e and Shahar Mendelson. Learning subgaussian classes: Upper and minimax bounds. arXiv preprintarXiv:1305.4825 , 2013.[23] Guillaume Lecu´e and Matthieu Lerasle. Robust machine learning by median-of-means: Theory and practice.
Ann.Statist. , 48(2):906–931, 04 2020.[24] Michel Ledoux.
The concentration of measure phenomenon , volume 89 of
Mathematical Surveys and Monographs . Amer-ican Mathematical Society, Providence, RI, 2001.[25] Michel Ledoux and Michel Talagrand.
Probability in Banach spaces . Classics in Mathematics. Springer-Verlag, Berlin,2011. Isoperimetry and processes, Reprint of the 1991 edition.[26] Zhixian Lei, Kyle Luh, Prayaag Venkat, and Fred Zhang. A fast spectral algorithm for mean estimation with sub-gaussianrates. In
Conference on Learning Theory , pages 2598–2612, 2020.[27] Matthieu Lerasle, Zoltan Szabo, Tilmoth´ee Mathieu, and Guillaume Lecu´e. Monk – outliers-robust mean embeddingestimation by median-of-means. Technical report, CNRS, University of Paris 11, Ecole Polytechnique and CREST, 2017.[28] Matthieu Lerasle, Zolt´an Szab´o, Timoth´ee Mathieu, and Guillaume Lecu´e. Monk outlier-robust mean embedding esti-mation by median-of-means. In
International Conference on Machine Learning , pages 3782–3793, 2019.[29] Wenbo V. Li and James Kuelbs. Some shift inequalities for Gaussian measures. In
High dimensional probability (Ober-wolfach, 1996) , volume 43 of
Progr. Probab. , pages 233–243. Birkh¨auser, Basel, 1998.[30] G´abor Lugosi and Shahar Mendelson. Near-optimal mean estimators with respect to general norms.
Probab. TheoryRelated Fields , 175(3-4):957–973, 2019.[31] G´abor Lugosi, Shahar Mendelson, et al. Sub-gaussian estimators of the mean of a random vector.
The Annals of Statistics ,47(2):783–794, 2019.[32] Shahar Mendelson. “Local” vs. “global” parameters—breaking the Gaussian complexity barrier.
Ann. Statist. , 45(5):1835–1862, 2017.[33] Shahar Mendelson. On multiplier processes under weak moment assumptions. In
Geometric aspects of functional analysis ,volume 2169 of
Lecture Notes in Math. , pages 301–318. Springer, Cham, 2017.[34] Stanislav Minsker. Geometric median and robust estimation in banach spaces.
Bernoulli , 21(4):2308–2335, 2015.[35] A. S. Nemirovsky and D. B. and Yudin.
Problem complexity and method efficiency in optimization . A Wiley-IntersciencePublication. John Wiley & Sons, Inc., New York, 1983. Translated from the Russian and with a preface by E. R. Dawson,Wiley-Interscience Series in Discrete Mathematics.[36] Gilles Pisier.
The volume of convex bodies and Banach space geometry , volume 94 of
Cambridge Tracts in Mathematics .Cambridge University Press, Cambridge, 1989.[37] Michel Talagrand.
Upper and lower bounds for stochastic processes , volume 60 of
Ergebnisse der Mathematik und ihrerGrenzgebiete. 3. Folge. A Series of Modern Surveys in Mathematics [Results in Mathematics and Related Areas. 3rd Series.A Series of Modern Surveys in Mathematics] . Springer, Heidelberg, 2014. Modern methods and classical problems.[38] Alexandre B. Tsybakov.
Introduction to nonparametric estimation . Springer Series in Statistics. Springer, New York,2009. Revised and extended from the 2004 French original, Translated by Vladimir Zaiats.[39] A. W. van der Vaart.
Asymptotic statistics , volume 3 of
Cambridge Series in Statistical and Probabilistic Mathematics .Cambridge University Press, Cambridge, 1998.[40] Aad W. van der Vaart and Jon A. Wellner.
Weak convergence and empirical processes . Springer Series in Statistics.Springer-Verlag, New York, 1996. With applications to statistics.[41] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities.In
Measures of complexity , pages 11–30. Springer, Cham, 2015. Reprint of Theor. Probability Appl. The nature of statistical learning theory . Statistics for Engineering and Information Science.Springer-Verlag, New York, second edition, 2000.. Statistics for Engineering and Information Science.Springer-Verlag, New York, second edition, 2000.