Outlier Robust Mean Estimation with Subgaussian Rates via Stability
aa r X i v : . [ m a t h . S T ] J u l Outlier Robust Mean Estimation with Subgaussian Ratesvia Stability
Ilias Diakonikolas ∗ University of Wisconsin-Madison [email protected]
Daniel M. Kane † University of California, San Diego [email protected]
Ankit Pensia ‡ University of Wisconsin-Madison [email protected]
July 31, 2020
Abstract
We study the problem of outlier robust high-dimensional mean estimation under a finitecovariance assumption, and more broadly under finite low-degree moment assumptions. Weconsider a standard stability condition from the recent robust statistics literature and provethat, except with exponentially small failure probability, there exists a large fraction of theinliers satisfying this condition. As a corollary, it follows that a number of recently developedalgorithms for robust mean estimation, including iterative filtering and non-convex gradientdescent, give optimal error estimators with (near-)subgaussian rates. Previous analyses of thesealgorithms gave significantly suboptimal rates. As a corollary of our approach, we obtain the firstcomputationally efficient algorithm with subgaussian rate for outlier-robust mean estimation inthe strong contamination model under a finite covariance assumption. ∗ Authors are in alphabetical order. ∗ Supported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship. † Supported by NSF Award CCF-1553288 (CAREER) and a Sloan Research Fellowship. ‡ Supported by NSF Award CCF-1740707 (TRIPODS).
Introduction
Consider the following problem: For a given family F of distributions on R d , estimate the mean ofan unknown D ∈ F , given access to i.i.d. samples from D . This is the problem of (multivariate)mean estimation and is arguably the most fundamental statistical task. In the most basic settingwhere F is the family of high-dimensional Gaussians, the empirical mean is well-known to be anoptimal estimator — in the sense that it achieves the best possible accuracy-confidence tradeoffand is easy to compute. Unfortunately, the empirical mean is known to be highly suboptimal ifwe relax the aforementioned modeling assumptions. In this work, we study high-dimensional meanestimation in the high confidence regime when the underlying family F is only assumed to satisfybounded moment conditions (e.g., finite covariance). Moreover, we relax the “i.i.d. assumption” andaim to obtain estimators that are robust to a constant fraction of adversarial outliers.Throughout this paper, we focus on the following data contamination model (see, e.g., [DKK + Definition 1.1 (Strong Contamination Model) . Given a parameter < ǫ < / and a distributionfamily F on R d , the adversary operates as follows: The algorithm specifies the number of samples n , and n samples are drawn from some unknown D ∈ F . The adversary is allowed to inspect thesamples, remove up to ǫn of them and replace them with arbitrary points. This modified set of n points is then given as input to the algorithm. We say that a set of samples is ǫ -corrupted if it isgenerated by the above process. The parameter ǫ in Definition 1.1 is the fraction of outliers and quantifies the power of theadversary. Intuitively, among our input samples, an unknown (1 − ǫ ) fraction are generated from adistribution of interest and are called inliers , and the rest are called outliers .We note that the strong contamination model is strictly stronger than Huber’s contaminationmodel. Recall that in Huber’s contamination model [Hub64], the adversary generates samples froma mixture distribution P of the form P = (1 − ǫ ) D + ǫN , where D ∈ F is the unknown targetdistribution and N is an adversarially chosen noise distribution. That is, in Huber’s model theadversary is oblivious to the inliers and is only allowed to add outliers.In the context of robust mean estimation, we want to design an algorithm (estimator) withthe following performance: Given any ǫ -corrupted set of n samples from an unknown distribution D ∈ F , the algorithm outputs an estimate b µ ∈ R d of the target mean µ of D such that with highprobability the ℓ -norm k b µ − µ k is small. The ultimate goal is to obtain a computationally efficientestimator with optimal confidence-accuracy tradeoff . For concreteness, in the proceeding discussionwe focus on the case that F is the family of all distributions on R d with bounded covariance, i.e.,any D ∈ F has covariance matrix Σ (cid:22) I . (We note that the results of this paper apply for the moregeneral setting where Σ (cid:22) σ I , where σ > is unknown to the algorithm.)Perhaps surprisingly, even for the special case of ǫ = 0 (i.e., without adversarial contamination),designing an optimal mean estimator in the high-confidence regime is far from trivial. In particular,it is well-known (and easy to see) that the empirical mean achieves highly sub-optimal rate. Asequence of works in mathematical statistics (see, e.g., [Cat12, Min15, DLLO16, LM19c]) designednovel estimators with improved rates, culminating in an optimal estimator [LM19c]. See [LM19a]for a survey on the topic. The estimator of [LM19c] is based on the median-of-means frameworkand achieves a “subgaussian” performance guarantee: k b µ − µ k = O ( p d/n + p log(1 /τ ) /n ) , (1)1here τ > is the failure probability. The error rate (1) is information-theoretically optimal forany estimator and matches the error rate achieved by the empirical mean on Gaussian data. Un-fortunately, the estimator of [LM19c] is not efficiently computable. In particular, known algorithmsto compute it have running time exponential in the dimension d . Related works [Min15, PBR19]provide computationally efficient estimators alas with suboptimal rates. The first polynomial timealgorithm achieving the optimal rate (1) was given in [Hop18], using a convex program derivedfrom the Sums-of-Squares method. Efficient algorithms with improved asymptotic runtimes weresubsequently given in [CFB19, DL19, LLVZ19].We now turn to the outlier-robust setting ( ǫ > ) for the constant confidence regime, i.e.,when the failure probability τ is a small universal constant. The statistical foundations of outlier-robust estimation were laid out in early work by the robust statistics community, starting with thepioneering works of [Tuk60] and [Hub64]. For example, the minimax optimal estimator satisfies: k b µ − µ k = O ( √ ǫ + p d/n ) . (2)Until fairly recently however, all known polynomial-time estimators attained sub-optimal rates.Specifically, even in the limit when n → ∞ , known polynomial time estimators achieved errorof O ( √ ǫd ) , i.e., scaling polynomially with the dimension d . Recent work in computer science,starting with [DKK +
16, LRV16], gave the first efficiently computable outlier-robust estimators forhigh-dimensional mean estimation. For bounded covariance distributions, [DKK +
17, SCV18] gaveefficient algorithms with the right error guarantee of O ( √ ǫ ) . Specifically, the filtering algorithmof [DKK +
17] is known to achieve a near-optimal rate of O ( √ ǫ + p d log d/n ) .In this paper, we aim to achieve the best of both worlds. In particular, we ask the followingquestion: Can we design computationally efficient estimators with subgaussian ratesand optimal dependence on the contamination parameter ǫ ? Recent work [LM19b] gave an exponential time estimator with optimal rate in this setting. Specif-ically, [LM19b] showed that a multivariate extension of the trimmed-mean achieved the optimalerror of k b µ − µ k = O ( √ ǫ + p d/n + p log(1 /τ ) /n ) . (3)We note that [LM19b] posed as an open question the existence of a computationally efficient estima-tor achieving the optimal rate (3). Two recent works [DL19, LLVZ19] gave efficient estimators withsubgaussian rates that are outlier-robust in the additive contamination model — a weaker modelthan that of Definition 1.1. Prior to this work, no polynomial time algorithm with optimal (ornear-optimal) rate was known in the strong contamination model of Definition 1.1. As a corollaryof our approach, we answer the question of [LM19b] in the affirmative (see Proposition 1.6). In thefollowing subsection, we describe our results in detail. At a high-level, the main conceptual contribution of this work is in showing that several previouslydeveloped computationally efficient algorithms for high-dimensional robust mean estimation achievenear-subgaussian rates or subgaussian rates (after a simple pre-processing). A number of thesealgorithms are known to succeed under a standard stability condition (Definition 1.2) – a simpledeterministic condition on the empirical mean and covariance of a finite point set. We will call suchalgorithms stability-based.
Our contributions are as follows: 2
We show (Theorem 1.4) that given a set of i.i.d. samples from a finite covariance distribution,except with exponentially small failure probability, there exists a large fraction of the samplessatisfying the stability condition. As a corollary, it follows (Proposition 1.5) that any stability-based robust mean estimation algorithm achieves optimal error with (near-)subgaussian rates. • We show an analogous probabilistic result (Theorem 1.8) for known covariance distributions (or,more generally, spherical covariance distributions) with bounded k -th moment, for some k ≥ .As a corollary, we obtain that any stability-based robust mean estimator achieves optimal errorwith (near-)subgaussian rates (Proposition 1.9.) • For the case of finite covariance distributions, we show (Proposition 1.6) that a simple pre-processing step followed by any stability-based robust mean estimation algorithm yields optimalerror and subgaussian rates.To formally state our results, we require some terminology and background.
Basic Notation
For a vector v ∈ R d , we use k v k to denote its ℓ -norm. For a square matrix M ,we use tr( M ) to denotes its trace, and k M k to denote its spectral norm. We say a symmetric matrix A is PSD (positive semidefinite) if x T Ax ≥ for all vectors x . For a PSD matrix M , we use r( M ) to denote its stable rank (or intrinsic dimension), i.e., r( M ) := tr( M ) / k M k . For two symmetricmatrices A and B , we use h A, B i to denote the trace inner product tr ( AB ) and say A (cid:22) B when B − A is PSD.We use [ n ] to denote the set { , . . . , n } and S d − to denote the d -dimensional unit sphere. Weuse ∆ n to denote the probability simplex on [ n ] , i.e., ∆ n = { w ∈ R n : w i ≥ , P ni =1 w i = 1 } . For amultiset S = { x , . . . , x n } ⊂ R d of cardinality n and w ∈ ∆ n , we use µ w to denote its weighted mean µ w = P ni =1 w i x i . Similarly, we use Σ w to denote its weighted second moment matrix (centered withrespect to µ ) Σ w = P ni =1 w i ( x i − µ )( x i − µ ) T . For a set S ⊂ R d , we denote µ S = (1 / | S | ) P x ∈ S x and Σ S = (1 / | S | ) P x ∈ S ( x − µ )( x − µ ) T to denote the mean and (central) second moment matrixwith respect to the uniform distribution on S .For a set E , we use I ( x ∈ E ) to denote the indicator function for event E . For simplicity, we use I ( x ≥ t ) to denote the indicator function for the event E = { x : x ≥ t } . For a random variable Z ,we use V ( Z ) to denote its variance. We use d TV ( p, q ) to denote the total variation distance betweendistributions p and q . Stability Condition and Robust Mean Estimation.
We can now define the stability condi-tion:
Definition 1.2 (see, e.g., [DK19]) . Fix < ǫ < / and δ ≥ ǫ . A finite set S ⊂ R d is ( ǫ, δ ) -stablewith respect to mean µ ∈ R d and σ if for every S ′ ⊆ S with | S ′ | ≥ (1 − ǫ ) | S | , the following conditionshold: (i) k µ S ′ − µ k ≤ σδ , and (ii) k Σ S ′ − σ I k ≤ σ δ /ǫ . The aforementioned condition or a variant thereof is used in every known outlier-robust meanestimation algorithm. Definition 1.2 requires that after restricting to a (1 − ǫ ) -density subset S ′ ,the sample mean of S ′ is within σδ of the mean µ , and the sample variance of S ′ is σ (1 ± δ /ǫ ) inevery direction. (We note that Definition 1.2 is intended for distributions with covariance Σ (cid:22) σ I ).We will omit the parameters µ and σ when they are clear from context. In particular, our proofswill focus on the case σ = 1 , which can be achieved by scaling the datapoints appropriately.A number of known algorithmic techniques previously used for robust mean estimation, includ-ing convex programming based methods [DKK +
16, SCV18, CDG18], iterative filtering [DKK + +
17, DHL19], and even first-order methods [CDGS20], are known to succeed under the stabilitycondition. Specifically, prior work has established the following theorem:
Theorem 1.3 (Robust Mean Estimation Under Stability, see, e.g., [DK19]) . Let T ⊂ R d be an ǫ -corrupted version of a set S with the following properties: S contains a subset S ′ ⊆ S such that | S ′ | ≥ (1 − ǫ ) | S | and S ′ is ( Cǫ, δ ) stable with respect to µ ∈ R d and σ , for a sufficiently largeconstant C > . Then there is a polynomial-time algorithm, that on input ǫ, T , computes b µ suchthat k b µ − µ k = O ( σδ ) . We note in particular that the iterative filtering algorithm [DKK +
17, DK19] (see also Sec-tion 2.4.3 of [DK19]) is a very simple and practical stability-based algorithm. While previous worksmade the assumption that the upper bound parameter σ is known to the algorithm, we point outin Appendix A.2 that essentially the same algorithm and analysis works for unknown σ as well. Our Results.
Our first main result establishes the stability of a subset of i.i.d. points drawn froma distribution with bounded covariance.
Theorem 1.4.
Fix any < τ < . Let S be a multiset of n i.i.d. samples from a distribution on R d with mean µ and covariance Σ . Let ǫ ′ = O (log(1 /τ ) /n + ǫ ) ≤ c , for a sufficiently small constant c > . Then, with probability at least − τ , there exists a subset S ′ ⊆ S such that | S | ′ ≥ (1 − ǫ ′ ) n and S ′ is (2 ǫ ′ , δ ) -stable with respect to µ and k Σ k , where δ = O ( p (r(Σ) log r(Σ)) /n + √ ǫ + p log(1 /τ ) /n ) . We note here the restriction of ǫ ′ = O (1) in the theorem statement is information-theoreticallyrequired [DLLO16]. Theorem 1.4 significantly improves the probabilistic guarantees in prior work onrobust mean estimation. This includes the resilience condition of [SCV18, ZJS19] and the goodness condition of [DHL19].As a corollary, it follows that any stability-based algorithm for robust mean estimation achievesnear-subgaussian rates. Proposition 1.5.
Let T be an ǫ -corrupted set of n samples from a distribution in R d with mean µ and covariance Σ . Let ǫ ′ = O (log(1 /τ ) /n + ǫ ) ≤ c be given, for a constant c > . Then anystability-based algorithm on input T and ǫ ′ , efficiently computes b µ such that with probability at least − τ , we have k b µ − µ k = O ( p (tr (Σ) log r(Σ)) /n + p k Σ k ǫ + p k Σ k log(1 /τ ) /n ) . We note that the above error rate is minimax optimal in both ǫ and τ . In particular, theterm p log(1 /τ ) /n is additive as opposed to multiplicative. The first term is near-optimal, up tothe p log r(Σ) factor, which is at most √ log d (recall that r(Σ) denotes the stable rank of Σ , i.e., r(Σ) = tr(Σ) / k Σ k ). Prior to this work, the existence of a polynomial-time algorithm achieving theabove near-subgaussian rate in the strong contamination model was open. Proposition 1.5 showsthat any stability-based algorithm suffices for this purpose, and in particular it implies that theiterative filtering algorithm [DK19] achieves this rate as is .Given the above, a natural question is whether stability-based algorithms achieve subgaussianrates exactly , i.e., whether they match the optimal bound (3) attained by the non-constructiveestimator of [LM19b]. While the answer to this question remains open, we show that after a simplepre-processing of the data, stability-based estimators are indeed subgaussian.The pre-processing step follows the median-of-means principle [NU83, JVV86, AMS99]. Givena multiset of n points x , . . . , x n in R d and k ∈ [ n ] , we proceed as follows:1. First randomly bucket the data into k disjoint buckets of equal size (if k does not divide n ,remove some samples) and compute their empirical means z , . . . , z k .4. Output an (appropriately defined) multivariate median of z , . . . , z k .Notably, for the case of ǫ = 0 , all known efficient mean estimators with subgaussian rates use themedian-of-means framework [Hop18, DL19, CFB19, LLVZ19].To obtain the desired computationally efficient robust mean estimators with subgaussian rates,we proceed as follows:1. Given a multiset S of n ǫ -corrupted samples, randomly group the data into k = ⌊ ǫ ′ n ⌋ disjointbuckets, where ǫ ′ = O (log(1 /τ ) /n + ǫ ) , and let z , . . . , z k be the corresponding empirical meansof the buckets.2. Run any stability-based robust mean estimator on input { z , . . . , z k } .Specifically, we show: Proposition 1.6. (informal) Consider the same setting as in Proposition 1.5. Let k = ⌊ ǫ ′ n ⌋ and z , . . . , z k be the points after median-of-means pre-processing on the corrupted set T . Then anystability-based algorithm, on input { z , . . . , z k } , computes b µ such that with probability at least − τ ,it holds k b µ − µ k = O ( p tr(Σ) /n + p k Σ k ǫ + p k Σ k log(1 /τ ) /n ) . Proposition 1.6 yields the first computationally efficient algorithm with subgaussian rates in thestrong contamination model, answering the open question of [LM19b].To prove Proposition 1.6, we establish a connection between the median-of-means principleand stability. In particular, we show that the key probabilistic lemma from the median-of-meansliterature [LM19c, DL19] also implies stability.
Theorem 1.7. (informal) Consider the setting of Theorem 1.4 and set k = ⌊ ǫ ′ n ⌋ . The set { z , . . . , z k } , with probability − τ , contains a subset of size at least . k which is (0 . , δ ) -stablewith respect to µ and k k Σ k /n , where δ = O ( p r(Σ) /k + 1) . A drawback of the median-of-means framework is that the error dependence on ǫ does notimprove if we impose stronger assumptions on the distribution. Even if the underlying distributionis an identity covariance Gaussian, the error rate would scale as O ( √ ǫ ) , whereas the stability-basedalgorithms achieve error of O ( ǫ p log(1 /ǫ )) [DKK + k -th central moment σ k , if for all unit vectors v , itholds ( E ( v T ( X − µ )) k ) /k ≤ σ k ( E ( v T ( X − µ )) ) / . For such distributions, we establish the followingstronger stability condition. Theorem 1.8.
Let S be a multiset of n i.i.d. samples from a distribution on R d with mean µ ,covariance Σ = I , and bounded central moment σ k , for some k ≥ . Let ǫ ′ = O (log(1 /τ ) /n + ǫ ) ≤ c ,for a sufficiently small constant c > . Then, with probability at least − τ , there exists a subset S ′ ⊆ S such that | S | ′ ≥ (1 − ǫ ′ ) n and | S ′ | is (2 ǫ ′ , δ ) -stable with respect to µ and σ = 1 , where δ = O ( p d log d/n + σ k ǫ − k + σ p log(1 /τ ) /n ) . As a corollary, we obtain the following result for robust mean estimation with high probabilityin the strong contamination model:
Proposition 1.9.
Let T be an ǫ -corrupted set of n points from a distribution on R d with mean µ ,covariance σ I , and k -th bounded central moment σ k , for some k ≥ . Let ǫ ′ = O (log(1 /τ ) /n + ǫ ) ≤ c be given, for some c > . Then any stability-based algorithm, on input T and ǫ ′ , efficientlycomputes b µ such that with probability at least − τ , we have k b µ − µ k = O ( σ ( p d log d/n + σ k ǫ − k + σ p log(1 /τ ) /n )) .
5e note that the above error rate is near-optimal up to the log d factor and the dependenceon σ . Prior to this work, no polynomial-time estimator achieving this rate was known. Finally,recent computational hardness results [HL19] suggest that the assumption on the covariance aboveis inherent to obtain computationally efficient estimators with error rate better than Ω( √ ǫ ) , evenin the constant confidence regime. Since the initials works [DKK +
16, LRV16], there has been an explosion of research activity onalgorithmic aspects of outlier-robust high dimensional estimation by several communities. See,e.g., [DK19] for a recent survey on the topic. In the context of outlier-robust mean estimation,a number of works [DKK +
17, SCV18, CDG18, DHL19] have obtained efficient algorithms undervarious assumptions on the distribution of the inliers. Notably, efficient high-dimensional outlier-robust mean estimators have been used as primitives for robustly solving machine learning tasksthat can be expressed as stochastic optimization problems [PSBR18, DKK + +
17, DK19] – achieves this guarantee.
In Section 2, we prove Theorem 1.4 that establishes the stability of points sampled from a finitecovariance distribution. In Section 3, we establish the connection between median-of-means principleand stability to prove Theorem 1.7. Finally, Section 4 contains our results for distributions withidentity covariance and finite central moments.
Problem Setting
Consider a distribution P in R d with unknown mean µ and unknown covariance Σ . We first note that it suffices to consider the distributions such that k Σ k = 1 . Note that forcovariance matrices Σ with k Σ k = 1 , we have r(Σ) = tr(Σ) . In the remainder of this section, we willthus establish the ( ǫ, δ ) stability with respect to µ and σ = 1 , where δ = O ( p tr(Σ) log(r(Σ)) /n + √ ǫ + p log(1 /τ ) /n ) .Let S be a multiset of n i.i.d. samples from P . For the ease of exposition, we will assume thatthe support of P is bounded, i.e., for each i , k x i − µ k = O ( p tr(Σ) /ǫ ) almost surely. As we show inSection 2.3, we can simply consider the points violating this condition as outliers.We first relax the conditions for stability in the Definition 1.2 in the following Claim 2.1, provedin Appendix D.1, at an additional cost of O ( √ ǫ ) .6 laim 2.1. (Stability for bounded covariance) Let R ⊂ R d be a finite multiset such that k µ R − µ k ≤ δ ,and k Σ R − I k ≤ δ /ǫ for some ≤ ǫ ≤ δ . Then R is ( O ( ǫ ) , δ ′ ) stable with respect to µ (and σ = 1 ),where δ ′ = O ( δ + √ ǫ ) . Given Claim 2.1, our goal in proving Theorem 1.4 is to show that with probability − τ , thereexists a set S ′ ⊆ S such that | S ′ | ≥ (1 − ǫ ′ ) n , k µ S ′ − µ k ≤ δ and k Σ S − I k ≤ δ /ǫ ′ , for some valueof δ = O ( p tr(Σ) log r(Σ) /n + √ ǫ + p log(1 /τ ) /n ) and ǫ ′ = O ( ǫ + log(1 /τ ) /n ) .We first remark that the original set S of n i.i.d. data points does not satisfy either of theconditions in Claim 2.1. It does not satisfy the first condition because the sample mean is highlysub-optimal for heavy-tailed data [Cat12]. For the second condition, we note that the knownconcentration results for Σ S are not sufficient. For example, consider the case of Σ = I in theparameter regime of ǫ, τ, and n such that ǫ = O (log(1 /τ ) /n ) and n = Ω( d log d/ǫ ) so that δ = O ( √ ǫ ) .For S to be ( ǫ, δ ) stable, we require that k Σ S − I k = O (1) with probability − τ . However, theMatrix-Chernoff bound (see, e.g., [Tro15, Theorem 5.1.1]) only guarantees that with probability atleast − τ , k Σ S − I k = ˜ O ( d ) .The rest of this section is devoted to showing that, with high probability, it is possible to remove ǫ ′ n points from S such that both conditions in Claim 2.1 are satisfied for the subset. As a first step, we show that it is possible to remove an ǫ -fraction of points so that the secondmoment matrix concentrates. Since finding a subset is a discrete optimization problem, we firstperform a continuous relaxation: instead of finding a large subset, we find a suitable distributionon points. Define the following set of distributions: ∆ n,ǫ = n w ∈ R n : 0 ≤ w i ≤ / ((1 − ǫ ) n ); n P i =1 w i = 1 o . Note that ∆ n,ǫ is the convex hull of all the uniform distributions on S ′ ⊆ S : | S ′ | ≥ (1 − ǫ ) n . InAppendix D.2, we show how to recover a subset S ′ from the w . Although we use the set ∆ n,ǫ forthe sole purpose of theoretical analysis, the object ∆ n,ǫ has also been useful in the design of com-putationally efficient algorithms [DKK +
16, DK19]. We will now show that, with high probability,there exists a w ∈ ∆ n,ǫ such that Σ w has small spectral norm.Our proof technique has three main ingredients: (i) minimax duality, (ii) truncation, and (iii)concentration of truncated empirical processes. Let M be the set of all PSD matrices with tracenorm , i.e., M = { M : M (cid:23) , tr( M ) = 1 } . Using minimax duality [Sio58] and the variationalcharacterization of spectral norm, we obtain the following reformulation: k Σ w − I k = min w ∈ ∆ n,ǫ max M ∈M h n P i =1 w i ( x i − µ )( x i − µ ) T − I, M i = max M ∈M min w ∈ ∆ n,ǫ h n P i =1 w i ( x i − µ )( x i − µ ) − I, M i . (4)This dual reformulation plays a fundamental role in our analysis. Lemma 2.2 below, proved inAppendix C.2, states that, with high probability, all the terms in the dual reformulation are bounded. Lemma 2.2.
Let x , . . . , x n be n i.i.d. points from a distribution in R d with mean µ and covariance Σ (cid:22) I . Let Q = O (1 / √ ǫ + (1 /ǫ ) p tr(Σ) /n ) . For M ∈ M , let S M = { i ∈ [ n ] : ( x i − µ ) T M ( x i − µ ) ≤ Q } . Let E be the event E = {∀ M ∈ M , | S M | ≥ (1 − ǫ ) n } . There exists a constant c > such thatthe event E happens with probability at least − exp( − cǫn ) . n =Ω(tr(Σ) /ǫ ) samples, the threshold Q is O (1 / √ ǫ ) . Approximating the empirical process in Eq. (4)with a truncated process allows us to use the powerful inequality for concentration of boundedempirical processes due to Talagrand [Tal96]. Formally, we show the following lemma: Lemma 2.3.
Let x , . . . , x n be n i.i.d. points from a distribution in R d with mean µ and covariance Σ (cid:22) I . Further assume that for each i , k x i − µ k = O ( p tr(Σ) /ǫ ) . There exists c, c ′ > such thatfor ǫ ∈ (0 , c ′ ) , with probability − − cnǫ ) , we have that min w ∈ ∆ n,ǫ (cid:13)(cid:13) Σ w − I (cid:13)(cid:13) ≤ δ /ǫ , where δ = O ( p (tr(Σ) log r(Σ)) /n + √ ǫ ) .Proof. Throughout the proof, assume that the event E from Lemma 2.2 holds. Without loss ofgenerality, also assume that µ = 0 . Let f : R + → R + be the following function: f ( x ) := ( x, if x ≤ Q Q , otherwise . (5)It follows directly that f is -Lipschitz and ≤ f ( x ) ≤ x . Using minimax duality, min w ∈ ∆ n,ǫ k Σ w − I k = max M ∈M min w ∈ ∆ n,ǫ P w i x Ti M x i − ≤ max M ∈M n P i =1 f ( x Ti M x i ) / ((1 − ǫ ) n ) − , where the inequality uses that on event E , for every M ∈ M , the set S M = { [ i ] ∈ n : x Ti M x i ≤ Q } has cardinality larger than (1 − ǫ ) n , and thus, the uniform distribution on the set S M belongs to ∆ n,ǫ . Define the following empirical processes R and R ′ : R = sup M ∈M n P i =1 f ( x Ti M x i ) , R ′ = sup M ∈M n P i =1 f ( x Ti M x i ) − E f ( x Ti M x i ) . As ≤ f ( x ) ≤ x , we have that ≤ E f ( x Ti M x ) ≤ , which gives that | R − R ′ | ≤ n . Overall, weobtain the following bound: min w ∈ ∆ n,ǫ k Σ w − I k ≤ R/ ((1 − ǫ ) n ) − ≤ ( R ′ + nǫ ) / ((1 − ǫ ) n ) ≤ (2 R ′ ) /n + 2 ǫ. Note that ǫ = O ( δ /ǫ ) . We now apply Talagrand’s concentration inequality on R ′ , as each termis bounded by Q . We defer the details to Lemma 2.4 below, showing that R ′ /n = O ( δ /ǫ ) withprobability − exp( − cnǫ ) . By taking a union bound, we get that both R ′ /n = O ( δ /ǫ ) and E holdwith high probability.We provide the details of concentration of the empirical process, related to the variance inLemma 2.3, which was omitted above. Lemma 2.4.
Consider the setting in the proof of Lemma 2.3. Then, with probability − exp( − nǫ ) , R ′ /n ≤ δ /ǫ , where δ = O ( p (tr(Σ) log r(Σ)) /n + √ ǫ ) .Proof. We will apply Talagrand’s concentration inequality for the bounded empirical process, seeTheorem B.1. We first calculate the quantity σ , the wimpy variance, required in Theorem B.1below σ = sup M ∈ M n X i =1 V ( f ( x Ti M x i )) ≤ sup M ∈ M n X i =1 E ( f ( x Ti M x i )) ≤ sup M ∈ M n X i =1 Q E f ( x Ti M x i ) ≤ nQ , f ( x ) ≤ Q , f ( x ) ≤ x , and E x T M x ≤ . We now focus our attention to E R ′ .Let ξ i be n i.i.d. Rademacher random variables, independent of x , . . . , x n . We use contraction andsymmetrization properties for Rademacher averages [LT91, BLM13] to get E R ′ = E sup M ∈M n X i =1 f ( x Ti M x i ) − E f ( x Ti M x i ) ≤ E sup M ∈M n X i =1 ξ i f ( x Ti M x i ) ≤ E sup M ∈M n X i =1 ξ i x Ti M x i = 2 E k n P i =1 ξ i x i x Ti k = O r n tr(Σ) log r(Σ) ǫ + tr(Σ) log r(Σ) ǫ ! , where the last step uses the refined version of matrix-Bernstein inequality [Min17], stated in Theo-rem B.2, with L = O (tr(Σ) /ǫ ) .Note that the empirical process R ′ is bounded by Q . By applying Talagrand’s concentrationinequality for bounded empirical processes (Theorem B.1), with probability at least − exp( − nǫ ) ,we have R ′ = O (cid:16) E R ′ + p nQ √ nǫ + Q nǫ (cid:17) = ⇒ R ′ n = O tr(Σ) log r(Σ) nǫ + r tr(Σ) log r(Σ) nǫ + Q √ ǫ + ǫQ ! = 1 ǫ O tr(Σ) log r(Σ) n + r tr(Σ) log r(Σ) n √ ǫ + Qǫ √ ǫ + ( ǫQ ) ! = 1 ǫ O r tr(Σ) log r(Σ) n + √ ǫ + ǫQ !! = δ ǫ , where δ = O ( p tr(Σ) log r(Σ) /n + √ ǫ + ǫQ ) = O ( p tr(Σ) log r(Σ) /n + √ ǫ ) , where we use the factthat ǫQ = O ( √ ǫ + p tr(Σ) /n ) . Suppose u ∗ ∈ ∆ n,ǫ achieves the minimum in Lemma 2.3, i.e., k Σ u ∗ − I k ≤ δ /ǫ . It is not necessarythat k µ u ∗ − µ k ≤ δ . Recall that our aim is to find a w ∈ ∆ n,ǫ that satisfies the conditions: (i) k µ w − µ k ≤ δ , and (ii) k Σ w − I k ≤ δ /ǫ . Given u ∗ , we will remove additional O ( ǫ ) -fraction ofprobability mass from u ∗ to obtain a w ∈ ∆ n such that k µ w − µ k ≤ δ . For u ∈ ∆ n , consider thefollowing set of distributions: ∆ n,ǫ,u = n w : n P i =1 w i = 1 , w i ≤ u i / (1 − ǫ ) o . For any w ∈ ∆ n,ǫ,u ∗ , we directly obtain that Σ w (cid:22) Σ u ∗ / (1 − ǫ ) . Our main result in this subsectionis that, with high probability, there exists a w ∗ ∈ ∆ n, ǫ,u ∗ such that k µ w ∗ − µ k ≤ δ . We first provean intermediate result, Lemma 2.5 below, that uses the truncation (Lemma 2.2) and simplifies theconstraint ∆ n, ǫ,u ∗ . Let g : R → R be the following thresholding function: g ( x ) = x, if x ∈ [ − Q, Q ] ,Q, if x > Q, − Q, if x < − Q. (6)9 emma 2.5. Let w ∈ ∆ n,ǫ for some ǫ ≤ / . Suppose that the following event E holds: E := (cid:26) sup M ∈M |{ i : ( x i − µ ) T M ( x i − µ ) ≥ Q }| ≤ ǫn (cid:27) . For a unit vector v , let S v ∈ [ n ] be the following multiset: S v = { x i : x i ∈ S, | x Ti v | ≤ Q } . For a unitvector v , let w ( v ) be the following distribution: ˜ w ( v ) i := min (cid:18) w i , I { x i ∈ S v }| S v | (cid:19) , w ( v ) := ˜ w ( v ) k ˜ w ( v ) k . (7) Let g ( · ) be defined as in Eq. (6) . Then, for all unit vectors v , w ( v ) ∈ ∆ n, ǫ,w . Moreover, the followinginequalities hold: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 w ( v ) i v T ( x i − µ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ǫQ + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P i ∈ S v v T ( x i − µ ) | S v | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ǫQ + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P i ∈ S g ( v T ( x i − µ ))(1 − ǫ ) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Proof.
On the event E , we have that | S v | ≥ (1 − ǫ ) n for all v ∈ S d − . In order to show that w ( v ) ∈ ∆ n, ǫ,w , it suffices to show that for all v , w ( v ) i ≤ w i / (1 − ǫ ) . By the definition of w ( v ) i , itis sufficient to show that k ˜ w ( v ) k ≥ − ǫ . Let u S and u S v denote the uniform distributions onthe multi-sets S and S v respectively. Let d TV ( p, q ) denote the total variation distance between thedistributions p and q . First note that d TV ( w, u S v ) ≤ d TV ( w, u S ) + d TV ( u S , u S v ) ≤ ǫ − ǫ + ǫ − ǫ ≤ ǫ − ǫ ≤ ǫ. (8)We now use the alternative characterization of total variation distance (see, e.g., [Tsy08, Lemma2.1]): d TV ( p, q ) = (1 / n X i =1 | p i − q i | = 1 − n X i =1 min( p i , q i ) . Observe that ˜ w ( v ) = min( w, u S v ) ; combining this observation with Eq. (8), we get the followinglower bound on k ˜ w ( v ) k : k ˜ w ( v ) k = 1 − d TV ( w, u S v ) ≥ − ǫ. This concludes that w ( v ) ∈ ∆ n, ǫ,w . We now focus our attention on the second result in the theoremstatement. The first inequality follows from the fact that both distributions w ( v ) and u S v have totalvariation distance less than ǫ , and supported on [ − Q, Q ] . The second inequality follows from thefact that (i) | S v | ≥ (1 − ǫ ) n , (ii) g ( · ) is identity on S v , and bounded by Q outside [ − Q, Q ] , and (iii)at most ǫ -fraction of the points are outside S v . This completes the proof.Using Lemma 2.5, we prove the following: Lemma 2.6.
Let x , . . . , x n be n i.i.d. points from a distribution in R d with mean µ and covariance Σ (cid:22) I . Let < ǫ < / and u ∈ ∆ n,ǫ . Then, for a constant c > , the following holds withprobability − exp( − cnǫ ) : min w ∈ ∆ n, ǫ,u k µ w − µ k ≤ δ, where δ = O (cid:16) √ ǫ + p tr(Σ) /n (cid:17) .
10t a high-level, the proof of Lemma 2.6 proceeds as follows: We use duality and the variationalcharacterization of the ℓ norm to reduce our problem to an empirical process over projections. Wethen use Lemma 2.5 to simplify the domain constraint ∆ n, ǫ,u ∗ and obtain a bounded empiricalprocess, with an overhead of O ( ǫQ ) = O ( δ ) . Proof. (Proof of Lemma 2.6) Let ∆ be the set ∆ n, ǫ,u and assume that µ = 0 without loss ofgenerality. On the event E (defined in Lemma 2.2), using minimax duality and Claim 2.5, we get min w ∈ ∆ max v ∈S d − n P i =1 w i x Ti v = max v ∈S d − min w ∈ ∆ n P i =1 w i x Ti v ≤ ǫQ + max v ∈S d − | P i ∈ [ n ] g ( v T x i ) /n | . (9)We define the following empirical processes: N = sup v ∈S d − n P i =1 g ( v T x i ) , N ′ = sup v ∈S d − n P i =1 g ( v T x i ) − E [ g ( v T x i )] . As g ( · ) is an odd function and S d − is an even set, we get that both N and N ′ are non-negative.For any v ∈ S d − , note that v T x has variance at most and P ( | v T x | ≥ Q ) = O ( ǫ ) . We can thusbound E g ( v T x ) as O ( √ ǫ ) = O ( ǫQ ) (see Proposition B.3). This gives us that | N − N ′ | = O ( nǫQ ) .Using the variational form of the ℓ norm with Eq. (9) leads to the following inequality in terms of N ′ : min w ∈ ∆ k µ w k = max v ∈S d − min w ∈ ∆ n P i =1 w i x Ti v ≤ ǫQ + N/ ((1 − ǫ ) n ) = O ( ǫ Q ) + (2 N ′ ) /n . Note that the term ǫQ is small as ǫQ = O ( δ ) . As N ′ is a bounded empirical process, with the bound Q , we can apply Talagrand’s concentration inequality. We defer the details to Lemma 2.7 below,showing that N ′ /n = O ( p tr(Σ) /n + √ ǫ ) = O ( δ ) . Taking a union bound over concentration of N ′ and the event E , we get that the desired result holds with high probability. Lemma 2.7.
Consider the setting in Lemma 2.6. Then, with probability, − exp( − nǫ ) , R ′ /n = O ( p tr(Σ) /n + √ ǫ ) .Proof. We will use Talagrand’s concentration inequality for bounded empirical processes, stated inTheorem B.1. We first calculate the wimpy variance required for Theorem B.1, σ = sup v ∈S d − n X i =1 V ( g ( x Ti v )) ≤ sup v ∈S d − n X i =1 E g ( v T x i ) ≤ sup v ∈S d − n E ( v T x i ) ≤ n. (10)We also bound the quantity E R ′ using symmetrization and contraction [LT91, BLM13] propertiesof Rademacher averages. We have that E R ′ = E sup v ∈S d − n X i =1 g ( v T x i ) − E g ( v T x i ) ≤ E sup v ∈S d − n X i =1 ǫ i g ( v T x i ) ≤ E sup v ∈S d − n X i =1 ǫ i v T x i = 2 E k n X i =1 ǫ i x i k ≤ p n tr(Σ) , where the last step uses that ǫ i x i has covariance Σ . By applying Talagrand’s concentration inequalityfor bounded empirical processes (Theorem B.1), we get that with probability at least − exp( − nǫ ) , R ′ /n = O ( E R ′ /n + √ nǫ + Qǫ ) = O ( p tr(Σ) /n + √ ǫ ) . .3 Proof of Theorem 1.4 We first state a result stating that deterministic rounding of weights suffice, proved in Appendix D.2.
Lemma 2.8.
For ǫ ≤ , let w ∗ ∈ ∆ n,ǫ be such that for ǫ ≤ δ , we have (i) k µ w − µ k ≤ δ and (ii) k Σ w − I k ≤ δ /ǫ . Then there exists a subset S ⊆ S such that1. | S | ≥ (1 − ǫ ) | S | .2. S is ( ǫ ′ , δ ′ ) stable with respect to µ and σ = 1 , where δ ′ = O ( δ + √ ǫ + √ ǫ ′ ) . In the following, we combine the results in the previous lemmas to obtain the stability of asubset with high probability. We first give a proof sketch.
Proof Sketch of Theorem 1.4
By Lemma 2.3, we get that there exists a u ∗ ∈ ∆ n,ǫ such that k Σ u ∗ − I k ≤ δ /ǫ . Applying Lemma 2.6 with this u ∗ , we get that there exists a w ∗ ∈ ∆ n, ǫ,u ∗ suchthat k µ u ∗ − µ k ≤ δ . v T Σ w ∗ v ≤ (1 / (1 − ǫ )) v T Σ u ∗ v = O ( δ /ǫ ) , for small enough ǫ . To obtain adiscrete set, we show that rounding w ∗ to a discrete set only leads to slightly worse constants.We are now ready to prove our main theorem, which we restate for completeness. Theorem 2.9. (Theorem 1.4) Let x , . . . , x n be n i.i.d. points in R d from a distribution with mean µ and covariance Σ . Let ǫ ′ = O (log(1 /τ ) /n + ǫ ) ≤ c for a sufficiently small positive constant c .Then, with probability at least − τ , there exists a subset S ′ ⊆ S s.t. | S | ′ ≥ (1 − ǫ ′ ) n and | S ′ | is ( Cǫ ′ , δ ) -stable with respect to µ and k Σ k with δ = O ( p (r(Σ) log r(Σ)) /n + √ Cǫ ′ ) .Proof. Note that we can assume without loss of generality that µ = 0 and k Σ k = 1 , upper bound δ by δ = O ( p tr(Σ) log(r(Σ)) /n + √ Cǫ ′ ) ; othwerwise, apply the following arguments to the randomvariable ( x i − µ ) / p k Σ k (the result holds trivially if k Σ k = 0 ).We first prove a simpler version of the theorem for distributions with bounded support. Thereason we make this assumption is to apply the matrix concentration results in Theorem B.2. Base case: Bounded support
Assume that k x i − µ k = O ( p tr(Σ) /ǫ ′ ) almost surely.Note that the bounded support assumption allows us to apply Lemma 2.3. Set ˜ ǫ = ǫ ′ /c ′ fora large constant c ′ to be determined later. Let u ∗ ∈ ∆ n, ˜ ǫ achieve the minimum in Lemma 2.3.For this u ∗ , let w ∗ ∈ ∆ n, ǫ,u ∗ be the distribution achieving the minimum in Lemma 2.6. Notethat the probability of error is at most − Ω( n ˜ ǫ )) . We can choose ǫ ′ large enough, ˜ ǫ = ǫ ′ /c =Ω(log(1 /τ ) /n ) , so that the probability of failure is at most − τ . Let δ = C p tr(Σ) log r(Σ) /n + C √ ˜ ǫ for a large enough constant C to be determined later. We first look at the variance of w ∗ using theguarantee of u ∗ in Lemma 2.3: n X i =1 w ∗ i x i x Ti (cid:22) n X i =1 − ǫ ′ u ∗ i x i x Ti (cid:22) n X i =1 u ∗ i x i x Ti ≤ ǫ ( C p tr(Σ) log r(Σ) /n + C √ ˜ ǫ ) . (11)By choosing C to be a large enough constant, we get that k P ni =1 w ∗ x i x Ti − I k ≤ δ / ˜ ǫ . Now, welook at the mean. Lemma 2.6 states that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w ∗ x i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O √ ˜ ǫ + C r tr(Σ) n ! ≤ δ. (12)Since w ∗ ∈ ∆ n, ǫ,u ∗ and u ∗ ∈ ∆ n, ˜ ǫ , we have that w ∗ ∈ ∆ n, ǫ . Therefore, we have a w ∗ ∈ ∆ n, ǫ thatsatisfies the requirements of Lemma D.2. Applying Lemma D.2, we get the desired statement for aset S ′ ⊆ S . Finally, we can choose the constant c ′ in the definition of ˜ ǫ large enough, so that theset has cardinality | S ′ | ≥ (1 − ǫ ′ ) n . This completes the proof for the case of bounded support.12 eneral case We first do a simple truncation. For a large enough constant C ′ , let E be thefollowing event: E = ( X : k X − µ k ≤ C ′ r tr(Σ) ǫ ′ ) . (13)Let Q be the distribution of X conditioned on E . Note that P can be written as a convex combi-nation of two distributions: Q and some distribution R , P = (1 − P ( E )) Q + P ( E c ) R. (14)Let Z ∼ Q . By Chebyshev’s inequality, we get that P ( E c ) ≤ ǫ ′ /C ′ . Using Lemma B.5, we get that k E Z − µ k = O ( √ ǫ ′ ) and Cov ( Z ) (cid:22) I . The distribution Q satisfies the assumptions of the base caseanalyzed above. Let S E be the set { i : x i ∈ E } and let E be the following event: E = {| S E | ≥ (1 − ǫ ′ / n } . (15)A Chernoff bound implies that given n samples from P , for a c > , with probability at least − exp( − cnǫ ′ /C ′ ) ≥ − τ / (by choosing C ′ large enough and ǫ ′ = Ω(log(1 /τ ) /n ) ), E holds.For a fixed m ≥ (1 − ǫ ′ / n , let z , . . . , z m be m i.i.d. draws from the distribution Q . Applyingthe theorem statement of the base case for each such m , we get that, except with probability τ / ,there exists an S ′ ⊆ [ m ] ⊆ [ n ] with | S ′ | ≥ (1 − ǫ ′ / m ≥ (1 − ǫ ′ / n ≥ (1 − ǫ ′ ) n , such that | S ′ | is ( Cǫ ′ , O ( p d log d/n + √ Cǫ ′ )) -stable.As mentioned above (event E ), m ≥ (1 − ǫ ′ / n with probability at least − τ / . We can nowmarginalize over m to say that with probability at least − τ , there exists a ( Cǫ ′ , δ ) stable set S ′ of cardinality at least (1 − ǫ ′ ) n .However, we are still not done. We have the guarantee that S ′ is stable with respect to E Z .Using the triangle inequality and Cauchy-Schwarz, we get that the set is also ( Cǫ ′ , δ ′ ) stable withrespect to µ as well, where δ ′ = δ + k µ − E Z k = δ + O ( √ ǫ ′ ) . This completes the proof. In this section, we again consider distributions with finite covariance matrix Σ . We now turn ourattention to the proof of Theorem 1.7 that removes the additional logarithmic factor p log( r (Σ)) .In Section 3.1, we show a result stating that pre-processing on i.i.d. points yields a set that containsa large stable subset (after rescaling). Then, in Section 3.2, we use a coupling argument to show asimilar result in the strong contamination model.We recall the median of means principle. Let k ∈ [ n ] .1. First randomly bucket the data into k disjoint buckets of equal size (if k does not divide n ,remove some samples) and compute their empirical means z , . . . , z k .2. Output (appropriately defined) multivariate median of z , . . . , z k . We first recall the result (with different constants) from Depersin and Lecué [DL19] in a slightlydifferent notation. 13 heorem 3.1. [DL19, Proposition 1] Let z , . . . , z k be k points in R d obtained by the median-of-means preprocessing on n i.i.d. data x , . . . , x n from a distribution with mean µ and covari-ance Σ . Let M be the set of PSD matrices with trace at most . Then, there exists a con-stant c > , such that with probability at least − exp( − ck ) , we have that for all M ∈ M , (cid:12)(cid:12) { i ∈ [ k ] : ( z i − µ ) T M ( z i − µ ) > ( k k Σ k /n ) δ } (cid:12)(cid:12) ≤ k , where δ = O ( p r(Σ) /k + 1) . We now state our main result in this section, proved using minimax duality, that Theorem 3.1implies stability. We first consider the case of i.i.d. data points, as it conveys the underlying ideaclearly.
Theorem 3.2.
Let x , . . . , x n be n i.i.d. random variables from a distribution with mean µ andcovariance Σ (cid:22) I . For k ∈ [ n ] , let z , . . . , z k be the variables obtained by median-of-means prepro-cessing. Then, with probability − exp( − ck ) , where c is a positive universal constant, there existsa set S ⊆ [ k ] and | S | ≥ . k such that S is (0 . , δ ) -stable with respect to µ and k k Σ k /n , where δ = O ( p r(Σ) /n + 1) .Proof. For brevity, let σ = p k k Σ k /n . Suppose that the conclusion in Theorem 3.1 holds with δ = O ( p r(Σ) /k + 1) such that δ ≥ , i.e., for every M ∈ M , for at least . k points ( z i − µ ) T M ( z i − µ ) ≤ σ δ . Using minimax duality, we get that min w ∈ ∆ k, . (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k X i =1 w i ( z i − µ )( z i − µ ) T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = min w ∈ ∆ k, . max M ∈M h M, k X i =1 w i ( z i − µ )( z i − µ ) T i = max M ∈M min w ∈ ∆ k, . h M, k X i =1 w i ( z i − µ )( z i − µ ) T i≤ σ δ , where the last steps uses the conclusion of Theorem 3.1. As δ ≥ , we also get that k P ki =1 w ∗ i ( z i − µ )( z i − µ ) T − σ I k ≤ σ δ . Let w ∗ be the distribution that achieves the minimum in the abovestatement. We can also bound the first moment of w ∗ using the bound on the second moment of w ∗ as follows: k X i =1 w ∗ i v T ( z i − µ ) ≤ s k P i =1 w ∗ i ( v T ( z i − µ )) ≤ r k P i =1 w ∗ i ( z i − µ )( z i − µ ) T k ≤ √ σ δ = σδ. Given this w ∗ ∈ ∆ k, . , we will now obtain a subset of { z , . . . , z k } that satisfies the stabilitycondition. In particular, Lemma D.2 shows that we can deterministically round w ∗ such that thereexists a large stable subset of { z , . . . , z k } which is (0 . , δ ) stable with respect to µ and σ . We now prove Theorem 1.7, i.e., stability of a subset after corruption, using Theorem 3.2. Thefollowing result shares the same principle as [DHL19, Lemma B.1]: we add a coupling argumentbecause the pre-processing step (random bucketing) introduces an additional source of randomness.
Theorem 3.3. (Formal statement of Theorem 1.7) Let T be an ǫ -corrupted version of the set S ,where S is a set of n i.i.d. points from a distribution P with mean µ and covariance Σ . Set ǫ ′ = O ( ǫ + log(1 /τ ) /n ) and set k = ⌊ ǫ ′ n ⌋ . Let T k be the set of k points obtained by median-of-meanspreprocessing on the set T . Then, with probability − τ , T k is . -corruption of a set S k such thatthere exists a S ′ k ⊆ S k , | S ′ k | ≥ . k and S ′ k is (0 . , δ ) stable with respect to µ and k k Σ k /n , where δ = O ( p r(Σ) /n + 1) . roof. For simplicity, assume k divides n and let m = n/k .Let S = { x , . . . , x n } be the multiset of n i.i.d. points in R d from P . We can write T as T = { x ′ , . . . , x ′ n } such that |{ i : x ′ i = x i }| ≤ ǫn .As the algorithm only gets a multiset, we first order them arbitrarily. Let r ′ , . . . , r ′ n be anyarbitrary labelling of points and let σ ( · ) be the permutation such that r ′ i = x ′ σ ( i ) . We now splitthe points randomly into buckets by randomly shuffling them. Let σ ( · ) be a uniformly randompermutation of [ n ] independent of T (and S ). Define w ′ i = r ′ σ ( i ) = x ′ σ ( σ ( i )) . For i ∈ [ k ] , define thebucket B ′ i to be the multiset B ′ i := { w ′ ( i − m +1 , . . . , w ′ im } . For i ∈ [ k ] , define z ′ i to be the meanof the set B ′ i , i.e., z i = µ B ′ i . That is, the input to the stable algorothm would be the multiset T k ,where T k = { z ′ , . . . , z ′ k } .We now couple the corrupted points with the original points. For σ and σ , define their com-position σ ′ as σ ′ ( i ) := σ ( σ ( i )) . Define r i := x σ ( i ) and w i := r σ ( i ) = x σ ′ ( i ) . Importantly, Proposi-tion 3.4 below states that w i ’s are i.i.d. from P . The analogous bucket for uncorrupted samples is B i := { w ( i − m +1 , . . . , w im } . For i ∈ [ k ] , define z i := µ B i and define S k to be { z , . . . , z k } . There-fore, z , . . . , z k are obtained from the median-of-means processing of i.i.d. data w , . . . , w n , and thusTheorem 3.2 holds . That is, there exists S ′ k ⊆ S k that satisfies the desired properties.It remains to show that T k is a corruption of S k . It is easy to see that | T k ∩ S k | ≥ k − ǫn ≥ . k ,by choosing ǫ ′ large enough. That is, for any σ and σ , T k is at most (0 . -contamination of theset S k . Proposition 3.4.
Let x , . . . , x n be n i.i.d. points from a distribution P and σ ( · ) be a permutation,potentially depending on x , . . . , x n . Let σ ( · ) be a random permutation independent of x , . . . , x n and σ ( · ) . Define the composition permutation be σ ′ ( i ) := σ ( σ ( i )) . Then x σ ′ (1) , . . . , x σ ′ ( n ) are alsoi.i.d. from the distribution P .Proof. First oberve that σ ′ ( · ) is a uniform random permutation indpendent of x , . . . , x n . The resultfollows from the following fact: Fact 3.5.
Let x , . . . , x n be n i.i.d. points from a distribution P . Let σ ( · ) be a random permutationindependent of x , . . . , x n , then x σ (1) , . . . , x σ ( n ) are also i.i.d. from the distribution P . In this section, we consider distributions with identity covariance and bounded central moments.Our main result in this section is the proof of Theorem 1.8, which obtains a tighter dependence on ǫ .Our proof strategy closely follows the proof structure of the bounded covariance case. We suggestthe reader to read Section 2 before reading this section. This section has a similar organizationto Section 2. We start with a simplified stability condition in Lemma 4.2. Sections 4.1 and 4.2contain the arguments for controlling the second moment matrix from above and below respectively.Section 4.3 contains the results regarding the concentration results for controlling the sample mean.Finally, we combine the results of the previous sections in Section 4.4 to complete the proof ofTheorem 1.8.In the bounded covariance setting, we considered δ such that δ = Ω( √ ǫ ) . As such, we onlyneeded an upper bound on second moment matrix, Σ S ′ , for a set S ′ ⊆ S (For δ ≥ √ ǫ , the lower If ( x , . . . , x n ) are i.i.d., then choosing the buckets B i = { x ( i − m , . . . , x im } for i ∈ [ k ] preserves independence.In particular, any partition of k sets of equal cardinality that does not depend on the values of ( x , . . . , x n ) suffices.Therefore, Theorem 3.1 and Theorem 3.2 hold for this bucketing strategy too. δ = o ( √ ǫ ) , we need a sharp lower boundon the minimum eigenvalue of Σ S for all large subsets S of a set S ′ . Such a result is not possiblein general, unless we impose both: (i) identity covariance and (ii) tighter control on tails of X .We will prove the existence of a stable set with high probability using the following claim. Thisis analogous to Claim 2.1 in the bounded covariance setting. In particular, we also need a lowerbound on the minimum eigenvalue of Σ S ′ for all large subsets S ′ . Claim 4.1.
Let ≤ ǫ ≤ δ and ǫ ≤ . . A set S is ( ǫ, O ( δ )) stable with respect to µ and σ = 1 , ifit satisfies the following for all unit vectors v .1. k µ S − µ k ≤ δ .2. v T Σ S v ≤ δ /ǫ .3. For all subsets S ′ ⊆ S : | S ′ | ≥ (1 − ǫ ) | S | , v T Σ S ′ v ≥ (1 − δ /ǫ ) . The proof of Claim E.1 is provided in Appendix E.1.
For simplicity, we will state our probabilistic results directly in terms of d instead of tr(Σ) and r(Σ) .The proof techniques of Section 2 can directly be translated to obtain results in terms of Σ . Wefollow the same strategy as in Section 2.1. We first refine the bound on the truncation threshold inthe following result, proved in Appendix C.2. Lemma 4.2.
Consider the setting in Theorem 1.8. Let Q k = O ( σ k ǫ − /k +(1 /ǫ ) p tr(Σ) /n ) . For each M ∈ M , let S M be the set { i : ( x i − µ ) T M ( x i − µ ) ≥ Q k } . Let E be the event E = { sup M ∈M | S M | ≤ ǫn } . Then for a c > , with probability at least − exp( − cǫn ) , event E holds. We first find a subset such that its covariance matrix is bounded. For technical reasons, we donot assume that the covariance is exactly identity and allow some slack. The argument is similarto Lemma 2.3 for the bounded covariance. We also impose some additional constraints to simplifythe expression, as those regimes would not hold anyway in the proof.
Lemma 4.3.
Let x , . . . , x n be n i.i.d. points in R d from a distribution with mean µ , covariance Σ , and for a k ≥ , the k -th central moment is bounded by σ k . Further assume that for ǫ < . ,covariance matrix Σ satisfies that (1 − σ k ǫ − k ) (cid:22) Σ (cid:22) I . Further assume the following conditionshold:1. log(1 /τ ) /n = O ( ǫ ) .2. k x i k = O ( σ k √ dǫ − /k ) almost surely.3. σ k ǫ − k = O (1) .Then, for a c > , with probability − τ − exp( − cnǫ ) : min w ∈ ∆ n,ǫ (cid:13)(cid:13) Σ w − I (cid:13)(cid:13) ≤ δ /ǫ , where δ = O ( p ( d log d ) /n + σ k ǫ − k + σ p log(1 /τ ) /n ) . roof. We will assume without loss of generality that µ = 0 . We will assume that the event E in Lemma 4.2 holds as it only incurs an additional probability of error of exp( − cnǫ ) . We use thevariational characterization of spectral norm and minimax duality to write the following: min w ∈ ∆ n,ǫ k X i w i x i x Ti − I k = min w ∈ ∆ n,ǫ max M ∈M X w i h x i x Ti − I, M i = max M ∈M min w ∈ ∆ n,ǫ X w i x Ti M x i − ≤ max M ∈M n X i =1 − ǫ ) n ( x Ti M x i ) I x Ti Mx i ≤ Q k − , where the third inequality uses Lemma 4.2, where it chooses the uniform distribution on the set S M = { x i : x Ti M x i ≤ Q k } . Let f : R + → R + be the following function: f ( x ) := ( x, if x ≤ Q k Q k , otherwise . Define the following random variables R and R ′ : R = sup M ∈M n X i =1 f ( x Ti M x i ) , R ′ = sup M ∈M n X i =1 f ( x Ti M x i ) − E f ( x Ti M x i ) . By Lemma B.4, we get that | E f ( x Ti M x ) − | ≤ σ k ǫ − k , which gives that | R − n − R ′ | ≤ nσ k ǫ − k . We therefore get that min w ∈ ∆ n,ǫ k X i w i x i x Ti − I k ≤ max M ∈M n X i =1 − ǫ ) n ( x Ti M x i ) I x Ti Mx i ≤ Q k − ≤ max M ∈M n X i =1 − ǫ ) n f ( x Ti M x i ) −
1= 1(1 − ǫ ) n R − ≤ R ′ n + 4 σ k ǫ − k + 2 ǫ. Observe that the last two terms in the above expression are small, i.e., σ k ǫ − k + ǫ = O ( δ /ǫ ) . Wenext use Lemma E.3 in Appendix to conclude that R ′ concentrates well. Lemma E.3 states thatwith probability − τ , R ′ /n ≤ (1 /ǫ )( O ( p d log d/n + σ k ǫ − k + σ p log(1 /τ ) /n )) . Note that bothof the remaining terms are small compared to Overall, we get that min w ∈ ∆ n,ǫ k Σ w − I k ≤ δ ǫ . Taking a union bound on the event E and concentration of R ′ concludes the result.17 .2 Minimum Eigenvalue of Large Subsets In this section, we prove that under bounded central moments, the minimum eigenvalue of Σ S ′ ,of each large enough subset S ′ , has a lower bound close to . Our result is similar in spirit toKoltchinskii and Mendelson [KM15, Theorem 1.3] that only bounds the eigenvalue of Σ S . Theproof of the following lemma is very similar to the proof of Lemma 4.3. Lemma 4.4.
Consider the setting in Lemma 4.3. Then, for a constant c > , with probability − τ − exp( − cnǫ ) , the following holds: min S ′ : | S ′ |≥ (1 − ǫ ) n v T Σ S ′ v ≥ − δ ǫ , where δ = O ( q d log dn + σ k ǫ − k + σ q log( τ ) n ) .Proof. Without loss of generality, assume that µ = 0 . We will assume that event E from Lemma 4.2holds, with an additional probability of error exp( − cnǫ ) , that is sup v ∈S d − (cid:12)(cid:12)(cid:8) i : x Ti v ≥ Q k (cid:9)(cid:12)(cid:12) ≤ nǫ. Let f be as defined in the proof of Lemma 4.3. For a sequence y , . . . , y n , let y (1) , . . . , y ( n ) be itsrearrangement in non-decreasing order. For any unit vector v , we have that min S ′ : | S ′ |≥ (1 − ǫ ) n v T Σ S ′ v ≥ min w ∈ ∆ n,ǫ v T Σ w v = min w ∈ ∆ n,ǫ n X i =1 w i ( x Ti v ) ≥ (1 − ǫ ) n X i =1 ( x Ti v ) i ) / ((1 − ǫ ) n ) ≥ n X i =1 ( f (( x Ti v ) ) − Q k ǫn ) / ((1 − ǫ ) n ) , where we use that at most ǫn points have projections larger than Q k . Thus we get that the minimumeigenvalue of any large subset is lower bounded by: min w ∈ ∆ n,ǫ min v ∈S d − n X i =1 w i ( x Ti v ) ≥ min v ∈S d − n X i =1 f (( x Ti v ) ) − Q k ǫn. Let h ( · ) be the negative of the function f ( · ) . Define the following random variable Z and itscounterpart Z ′ : Z := sup v ∈S d − n X i =1 h (( x Ti v ) ) , Z ′ := sup v ∈S d − n X i =1 h (( x Ti v ) ) − E h (( x Ti v ) ) From Lemma B.4, it follows that | E h (( x Ti v ) ) + 1 | = | E f (( x Ti v ) ) − | = O ( σ k ǫ − k ) . This immedi-ately gives us that | Z ′ − Z − n | = O ( nσ k ǫ − k ) . (1 − ǫ ) n min w ∈ ∆ n,ǫ min v ∈S d − n X i =1 w i ( x Ti v ) ≥ min v ∈S d − n X i =1 f (( x Ti v ) ) − Q k ǫn = − sup v ∈S d − n X i =1 h (( x Ti v ) ) − Q k ǫn = − Z − Q k ǫn ≥ − Z ′ + n − O ( nσ k ǫ − k ) − ǫQ k ǫn. We thus require a high probability upper bound on Z ′ . Note that Z ′ behaves similarly to R ′ , definedin the proof of Lemma 4.3. Similar to the proof of Lemma E.3, we get that, with probability atleast − τ , Z ′ n ≤ ǫ (cid:16) O (cid:16)r d log dn + σ k ǫ − k + σ r log(1 /τ ) n (cid:17)(cid:17) . Note that the remaining terms σ k ǫ − k = O ( δ /ǫ ) and ǫQ k = O ( σ k ǫ − k + dnǫ = O ( δ /ǫ ) . Therefore,we get the minimum eigenvalue of any large subset is at least min w ∈ ∆ n,ǫ λ min (Σ w ) ≥ − δ ǫ , where δ = O ( q d log dn + σ k ǫ − k + σ q log( τ ) n ) . Lemmas 4.3 and 4.4 give a control on the second moment matrix. We will now further remove O ( ǫ ) fraction of points to obtain w such that k µ w − µ k is small. Lemma 4.5.
Let x , . . . , x n be n i.i.d. random variables from a distribution with mean µ andcovariance Σ (cid:22) I . Further, assume that the x i ’s are drawn from a distribution with k -th boundedcentral moment σ k for a k ≥ . Let u ∈ ∆ n,ǫ . Assume that log(1 /τ ) /n = O ( ǫ ) . Then, for a constant c > , the following holds with probability − τ − exp( − cnǫ ) : min w ∈ ∆ n, ǫ,u k n X i =1 w i x i − µ k = O ( p d/n + σ k ǫ − k + p log(1 /τ ) /n ) . Proof.
Without loss of generality, let us assume that µ = 0 . Also, assume that the event E fromLemma 4.2 holds, with the additional error of exp( − cnǫ ) . Let g ( · ) be the following function: g ( x ) = x, if x ∈ [ − Q k , Q k ] Q k , if x > Q k − Q k , if x < − Q k . Let N be the following random variable: N = sup v ∈S d − n X i =1 g ( v T x i ) = sup v ∈S d − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 g ( v T x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , g ( · ) is an odd function. We also define the following empirical process, whereeach term is centered: N ′ = sup v ∈S d − n X i =1 g ( v T x i ) − E [ g ( v T x i )] = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup v ∈S d − n X i =1 g ( v T x i ) − E [ g ( v T x i )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . As Q k = Ω( σ k ǫ − /k ) , Lemma B.4 states that sup v E g ( v T x ) = O ( σ k ǫ − k ) , and this gives that | N − N ′ | = O ( nσ k ǫ − k ) . We now use duality to write the following: min w ∈ ∆ n,ǫ,u k n X i =1 w i x i k = min w ∈ ∆ n,ǫ,u max v ∈S d − h n X i =1 w i x i , v i = max v ∈S d − min w ∈ ∆ n,ǫ,u h n X i =1 w i x i , v i≤ ǫQ k + (cid:12)(cid:12)(cid:12)(cid:12) − ǫ ) n N (cid:12)(cid:12)(cid:12)(cid:12) ≤ O ( ǫQ k ) + O ( σ k ǫ − /k ) + 2 N ′ , where the last step uses Lemma 2.5. We now use Lemma E.4 to conclude that N ′ concentrates.Recall that ǫQ k = O ( σ k ǫ − k + p d/n ) . Overall, we get that, with probability − τ − exp( − nǫ ) ,there exits a w ∈ ∆ n,ǫ,u , such that k P w i x i k = O ( p d/n + σ k ǫ − k + p log(1 /τ ) /n ) . We now combine the results in the previous lemmas to obtain the stability of a subset with highprobability. Although we prove the following result showing the existence of (2 ǫ ′ , δ ) stable subset,this can generalized to existence of ( Cǫ, O ( δ )) stable subset for a large constant C . Theorem 4.6. (Theorem 1.8) Let S = { x , . . . , x n } ⊂ R d be n i.i.d. points from a distribution withmean µ and covariance Σ such that (1 − σ k γ − k ) I (cid:22) Σ (cid:22) I . Further assume that for a k ≥ , the k th central moment is bounded by σ k . Let ǫ ′ = ǫ + log( τ ) n ≤ c for a sufficiently small constant c .Then, with probability at least − τ , there exists a subset S ′ ⊆ S s.t. | S ′ | ≥ (1 − ǫ ′ ) n and | S ′ | is (2 ǫ ′ , δ ) -stable with δ = O ( σ k ǫ − k + q d log dn + σ q log( τ ) n ) .Proof. First note that, for the bounded covariance condition, Theorem 1.4 already gives a guaranteethat, with probability at least − τ , k b µ − µ k = O (cid:16)p ( d log d ) /n + √ ǫ + p log(1 /τ ) /n (cid:17) . (16)Therefore, the guarantee of this theorem statement is tighter only in the following regimes: log(1 /τ ) /n = O ( ǫ ) , O ( σ k ǫ − k ) = O (1) , d log d/n = O ( ǫ ) . (17)For the rest of the proof, we will assume that all three of these conditions hold. Similar to the proofof Theorem 1.4, we will first prove when the samples are bounded.20 ase case: Bounded support In this case, we will assume that k x i − µ k = O ( σ k ǫ − /k √ d ) almost surely. We will use Lemma E.2 to show that the set is stable. Set ˜ ǫ = ǫ ′ /C ′ for a largeenought constant C ′ to be determined later.Note that x , . . . , x n satisfy the conditions of Lemmas 4.4, 4.3, and 4.5. In particular, we willuse Lemma 4.4 with C ˜ ǫ , where C is large enough. By choosing ǫ ′ = Ω(log(1 /τ ) /n ) , we get that,with probability − τ / , for any S ′ : | S ′ | ≥ (1 − C ˜ ǫ ) n and unit vector v , P i ∈ S ′ ( v T x i ) | S ′ | ≥ − δ C ˜ ǫ . (18)We first look at the variance using the guarantee in Lemma 4.3: Let u ∈ ∆ n, ˜ ǫ be the distributionachieving the minimum in Lemma 4.3. By choosing ǫ ′ = Ω(log(1 /τ ) /n ) , we get that with probability − τ / , n X i =1 u i ( x Ti v ) ≤ δ ˜ ǫ . (19)We now obtain a guarantee on the mean using Lemma 4.5. For this u , let w ∈ ∆ n, ǫ,u be thedistribution achieving the minimum in Lemma 4.5. Then with probability − τ / , k n X i =1 w i x i k ≤ δ. (20)Since u ∈ ∆ n, ǫ,w and w ∈ ∆ n, ˜ ǫ,u , we have that u ∈ ∆ n, ǫ . Moreover, n X i =1 w i ( x Ti v ) ≤ n X i =1 u i − ˜ ǫ ( x Ti v ) = 11 − ˜ ǫ (1 + δ ˜ ǫ ) ≤ − ˜ ǫ (˜ ǫ + δ ˜ ǫ ) ≤ δ ˜ ǫ . (21)Therefore, we have that u ∈ ∆ n, ǫ and satisfies the requirements of Lemma E.2, where we note that r = O (1) and r = O (1) to get the desired statement. By a union bound, the failure probability is τ . Finally, we choose C and C ′ large enough such that the cardinality of the stable set is at least (1 − ǫ ′ ) n and it is (2 ǫ ′ , δ ) stable. General case: Unbounded support
We first do a simple truncation. Let E be the followingevent: E = { X : k X − µ k ≤ Cσ k ǫ − k √ d } . (22)Let Q be the distribution of X conditioned on E . Note that P can be written as convex combinationof two distributions: Q and some distribution R , P = (1 − P ( E )) Q + P ( E c ) R. (23)Let Z ∼ Q . Using Lemma B.5, we get that k E Z − µ k ≤ σ k ǫ − k /C k and (1 − σ k ǫ − k /C k ) (cid:22) Cov ( Z ) (cid:22) I . Thus the distribution Q satisfies the assumptions of the base case for C ≥ .Let S E be the set { X i : X i ∈ E } . A Chernoff bound gives that given n samples from P , withprobability at least − exp( − nǫ ′ ) , E = {| S E | ≥ (1 − ǫ ′ / n } . (24)21or a fixed m ≥ (1 − ǫ ′ / n , let z , . . . , z m be m i.i.d. draws from the distribution Q . Applying thetheorem satement for Q , as it satisfies the base case above, we get that, with probability at least − exp( − cmǫ ′ ) , ∃ S ′ : | S ′ | ≥ (1 − ǫ ′ / m ≥ (1 − ǫ ′ / n ≥ (1 − ǫ ′ ) n , such that | S ′ is (2 ǫ ′ , δ ′ ) -stable.This gives us a set S ′ which is stable with respect to E Z . Using triangle inequality, we get that theset S ′ is ( ǫ, δ ′ ) stable with respect to µ as well, where δ ′ = δ + k µ − E Z k = δ + O ( σ k ǫ − k ) .We can now marginalize over m to get that with probability except − − cnǫ ′ ) , the desiredclaim holds. Choosing ǫ ′ = Ω(log(1 /τ ) n ) , we can make probability of failure less than τ . In this paper, we showed that a standard stability condition from the recent high-dimensionalrobust statistics literature suffices to obtain near-subgaussian rates for robust mean estimation inthe strong contamination model. With a simple pre-processing (bucketing), this leads to efficientoutlier-robust estimators with subgaussian rates under only a bounded covariance assumption. Aninteresting technical question is whether the extra log d factor in Theorem 1.4 is actually needed.(Our results imply that it is not needed when ǫ = Ω(1) .) If not, this would imply that stability-basedalgorithms achieve subgaussian rates without the pre-processing. References [AMS99] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the fre-quency moments.
J. Comput. Syst. Sci. , 58(1):137–147, 1999.[BLM13] S. Boucheron, G. Lugosi, and P. Massart.
Concentration Inequalities: A NonasymptoticTheory of Independence . Oxford University Press, Oxford New York, NY, paperbackedition, 2013.[Cat12] O. Catoni. Challenging the empirical mean and empirical variance: A deviation study.
Ann. Inst. H. Poincare Probab. Statist. , 48(4):1148–1185, 11 2012.[CDG18] Y. Cheng, I. Diakonikolas, and R. Ge. High-dimensional robust mean estimation innearly-linear time.
CoRR , abs/1811.09380, 2018. Conference version in SODA 2019, p.2755-2771. URL: http://arxiv.org/abs/1811.09380 , arXiv:1811.09380 .[CDGS20] Y. Cheng, I. Diakonikolas, R. Ge, and M. Soltanolkotabi. High-dimensional ro-bust mean estimation via gradient descent. CoRR , abs/2005.01378, 2020. URL: https://arxiv.org/abs/2005.01378 , arXiv:2005.01378 .[CFB19] Y. Cherapanamjeri, N. Flammarion, and P. L. Bartlett. Fast mean estimation with sub-gaussian rates. In Conference on Learning Theory, COLT 2019 , volume 99 of
Proceedingsof Machine Learning Research , pages 786–806. PMLR, 2019.[DHL19] Y. Dong, S. B. Hopkins, and J. Li. Quantum entropy scoring for fast robust mean estima-tion and improved outlier detection.
CoRR , abs/1906.11366, 2019. Conference versionin NeurIPS 2019. URL: http://arxiv.org/abs/1906.11366 , arXiv:1906.11366 .[DK19] I. Diakonikolas and D. M. Kane. Recent advances in algorithmic high-dimensional robuststatistics. CoRR , abs/1911.05911, 2019. URL: http://arxiv.org/abs/1911.05911 , arXiv:1911.05911 . 22DKK +
16] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robustestimators in high dimensions without the computational intractability. In
Proc. 57thIEEE Symposium on Foundations of Computer Science (FOCS) , pages 655–664, 2016.[DKK +
17] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Beingrobust (in high dimensions) can be practical. In
Proc. 34th International Conference onMachine Learning (ICML) , pages 999–1008, 2017.[DKK +
18] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, J. Steinhardt, and A. Stewart.Sever: A robust meta-algorithm for stochastic optimization.
CoRR , abs/1803.02815,2018. Conference version in ICML 2019. URL: http://arxiv.org/abs/1803.02815 , arXiv:1803.02815 .[DL19] J. Depersin and G. Lecue. Robust subgaussian estimation of a mean vector in nearlylinear time. CoRR , abs/1906.03058, 2019.[DLLO16] L. Devroye, M. Lerasle, G. Lugosi, and R. I. Oliveira. Sub-gaussian mean estimators.
Ann. Statist. , 44(6):2695–2725, 12 2016.[HL19] S. B. Hopkins and J. Li. How hard is robust mean estimation? In
Conference onLearning Theory, COLT 2019 , pages 1649–1682, 2019.[Hop18] S. B. Hopkins. Sub-gaussian mean estimation in polynomial time.
CoRR ,abs/1809.07425, 2018. URL: http://arxiv.org/abs/1809.07425 , arXiv:1809.07425 .[Hub64] P. J. Huber. Robust estimation of a location parameter. Ann. Math. Statist. , 35(1):73–101, 03 1964.[JVV86] M. Jerrum, L. G. Valiant, and V. V. Vazirani. Random generation of combinatorialstructures from a uniform distribution.
Theor. Comput. Sci. , 43:169–188, 1986.[KM15] V. Koltchinskii and S. Mendelson. Bounding the Smallest Singular Value of a Ran-dom Matrix Without Concentration.
International Mathematics Research Notices ,2015(23):12991–13008, 03 2015.[LLVZ19] Z. Lei, K. Luh, P. Venkat, and F. Zhang. A fast spectral algorithm formean estimation with sub-gaussian rates.
CoRR , abs/1908.04468, 2019. URL: http://arxiv.org/abs/1908.04468 , arXiv:1908.04468 .[LM19a] G. Lugosi and S. Mendelson. Mean estimation and regression under heavy-tailed distri-butions: A survey. Foundations of Computational Mathematics , 19(5):1145–1190, 2019.[LM19b] G. Lugosi and S. Mendelson. Robust multivariate mean estimation:the optimality of trimmed mean.
CoRR , abs/1907.11391, 2019. URL: http://arxiv.org/abs/1907.11391 , arXiv:1907.11391 .[LM19c] G. Lugosi and S. Mendelson. Sub-gaussian estimators of the mean of a random vector. Ann. Statist. , 47(2):783–794, 04 2019.[LRV16] K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance.In
Proc. 57th IEEE Symposium on Foundations of Computer Science (FOCS) , pages665–674, 2016. 23LT91] M. Ledoux and M. Talagrand.
Probability in Banach Spaces . Springer, 1991.[Min15] S. Minsker. Geometric median and robust estimation in Banach spaces.
Bernoulli ,21(4):2308–2335, 2015.[Min17] S. Minsker. On some extensions of Bernstein’s inequality for self-adjointoperators.
Statistics & Probability Letters , 127:111–119, August 2017. doi:10.1016/j.spl.2017.03.020 .[NU83] A. S. Nemirovsky and D.B. Udin.
Problem complexity and method efficiency in opti-mization . Wiley„ 1983.[PBR19] A. Prasad, S. Balakrishnan, and P. Ravikumar. A unified approach to robust meanestimation.
CoRR , abs/1907.00927, 2019. URL: http://arxiv.org/abs/1907.00927 , arXiv:1907.00927 .[PSBR18] A. Prasad, A. S. Suggala, S. Balakrishnan, and P. Ravikumar. Robust estimation viarobust gradient estimation. arXiv preprint arXiv:1802.06485 , 2018.[SCV18] J. Steinhardt, M. Charikar, and G. Valiant. Resilience: A criterion for learning in thepresence of arbitrary outliers. In Proc. 9th Innovations in Theoretical Computer ScienceConference (ITCS) , pages 45:1–45:21, 2018.[Sio58] M. Sion. On general minimax theorems.
Pacific Journal of Mathematics , 8(1):171–176,1958.[Tal96] M. Talagrand. New concentration inequalities in product spaces.
Inventiones Mathe-maticae , 126(3):505–563, November 1996. doi:10.1007/s002220050108 .[Tro15] J. A. Tropp. An introduction to matrix concentration inequalities.
Foun-dations and Trends R (cid:13) in Machine Learning , 8(1-2):1–230, 2015. URL: http://dx.doi.org/10.1561/2200000048 , doi:10.1561/2200000048 .[Tsy08] A. B. Tsybakov. Introduction to Nonparametric Estimation . Springer Publishing Com-pany, Incorporated, 2008.[Tuk60] J. W. Tukey. A survey of sampling from contaminated distributions.
Contributions toprobability and statistics , 2:448–485, 1960.[ZJS19] B. Zhu, J. Jiao, and J. Steinhardt. Generalized resilience and robust statistics.
CoRR ,abs/1909.08755, 2019. URL: http://arxiv.org/abs/1909.08755 , arXiv:1909.08755 .24 ppendixA Robust Mean Estimation and Stability A.1 Robust Mean Estimation from Subset Stability
The theorem statement in [DK19, Theorem 2.7] requires that the input multiset S is stable. Wenote that the arguments straightforwardly go through when S contains a large stable subset S ′ ⊆ S (see, e.g., [DKK +
16, DKK +
17, DHL19]).For concreteness, we describe a simple pre-processing of the data, that ensures that the datafollows the definition as is: simply throw away points so that the cardinality of the corrupted setmatches the cardinality of the stable subset.
Proposition A.1.
Let S be a set such that ∃ S ′ ⊆ S such that | S ′ | ≥ (1 − ǫ ) | S | and S ′ is ( Cǫ, δ ) for some C > . Let T be an ǫ -corrupted version of S . Let T ′ be the multiset obtained by removing ǫn points of T . Let ǫ ′ = ǫ − ǫ . Then T ′ is an ǫ ′ -corrupted version of a (( C − ǫ ′ / , δ ) stable set.Proof. Let T be an ǫ -corrupted version of S . That is, T = S ∪ A \ R . We now remove ǫn pointsarbitrarily from T to obtain the multiset T ′ of cardinality (1 − ǫ ) n .Let S be any subset of S ′ such that | S | = | T | = (1 − ǫ ) n. Therefore, T ′ is at most (2 ǫ ) / (1 − ǫ ) -corrupted version of S . As S ′ is ( Cǫ, δ ) stable and S is a large subset of S ′ , Claim A.2 states that S is ( ǫ , δ ) stable where ǫ ≥ − (1 − Cǫ ) / (1 − ǫ ) = ( C − ǫ ′ / . Claim A.2.
If a set S is ( ǫ, δ ) stable, then its subset S ′ of cardinality m > (1 − ǫ ) n is (1 − (1 − ǫ ) nm , δ ) stable.Proof. To show that S ′ is ( ǫ ′ , δ ) stable, it suffices to ensure that ǫ ′ ≤ ǫ and (1 − ǫ ′ ) | S ′ | ≥ (1 − ǫ ) | S | .Therefore, we require that (1 − ǫ ′ ) m ≥ (1 − ǫ ) n = ⇒ ǫ ′ ≤ − (1 − ǫ ) nm . The upper bound is always less than ǫ for m ≤ n . A.2 Adapting to Unknown Upper Bound on Covariance
As stated, the stability-based algorithms in [DKK +
17, DK19] assume that the inliers are drawn froma distribution with unknown bounded covariance Σ (cid:22) σ I , where the parameter σ > is known.Here we note that essentially the same algorithms work even if the parameter σ > is unknown.For this, we establish the following simple modification of standard results, see, e.g., [DK19]. Theorem A.3.
Let T ⊂ R d be an ǫ -corrupted version of a set S , where S is ( Cǫ, δ ) -stable withrespect to µ S and σ , where C > is a sufficiently large constant. There exists a polynomial timealgorithm that given T and ǫ (but not σ or δ ) returns a vector b µ so that k µ S − b µ k = O ( σδ ) .Proof. The algorithm is very similar to the algorithm from [DK19] except for the stopping condition.We define a weight function w : T → R ≥ initialized so that w ( x ) = 1 / | T | for all x ∈ T . Weiteratively do the following: • Compute µ ( w ) = k w k P x ∈ T w ( x ) x . • Compute Σ( w ) = k w k P x ∈ T w ( x )( x − µ ( w ))( x − µ ( w )) T .25 Compute an approximate largest eigenvector v of Σ( w ) . • Define g ( x ) for x ∈ T as g ( x ) = | v · ( x − µ ( w )) | . • Find the largest t so that P x ∈ T : g ( x ) ≥ t w ( x ) ≥ ǫ . • Define f ( x ) = ( g ( x ) if g ( x ) ≥ t otherwise . • Let m be the largest value of f ( x ) for any x ∈ T with w ( x ) = 0 . • Set w ( x ) to w ( x )(1 − f ( x ) /m ) for all x ∈ T .We then repeat this loop unless k w k < − ǫ , in which case we return µ ( w ) .Note that if S is ( ǫ, δ ) -stable with respct to µ S and σ , then S/σ is ( ǫ, δ ) with respect to µ S /σ and . We note that if σ was known, the weighted universal filter algorithm of [DK19] could beapplied to T /σ in order to learn µ S /σ to error O ( δ ) . Multiplying the result by σ would yield anapproximation to µ S with error O ( σδ ) . We note that this algorithm is equivalent to the one providedabove, except that we would stop the loop as soon as Σ( w ) ≤ σ (1 + O ( δ /ǫ )) rather than waitinguntil k w k ≤ − ǫ .However, we note that by the analysis in [DK19] of this algorithm, that at each iteration untilit stops, P x ∈ S w ( x ) decreases by less than P x ∈ T \ S w ( x ) does. Since the latter cannot decrease bymore than ǫ , this means that the algorithm of [DK19] would stop before ours does. Our algorithmthen continues to remove an additional O ( ǫ ) mass from the weight function w (but only this muchsince f has support on points of mass only a bit more than ǫ ). It is easy to see that these extraremovals do not increase Σ( w ) by more than a factor of O ( ǫ ) . This means that when ouralgorithm terminates Σ( w ) /σ ≤ I + O ( δ /ǫ ) . Thus, by the weighted version of Lemma 2.4 of[DK19], we have that k µ S − µ ( w ) k = σ k µ S /σ − µ ( w ) /σ k ≤ σO ( δ + p ǫ ( δ /ǫ )) = O ( σδ ) . This completes the proof.
B Tools from Concentration and Truncation
Organization.
In Section B.1, we state the concentration results that we will use repeatedly inthe following sections. Section B.2 contains some well-known results regarding the properties of thetruncated distribution.
B.1 Concentration Results
We first state Talagrand’s concentration inequality for bounded empirical processes.
Theorem B.1. ([BLM13, Theorem 12.5] ) Let Y , . . . , Y n be independent identically distributedrandom vectors. Assume that E Y i,s = 0 ,and that Y i,s ≤ L for all s ∈ T . Define Z = sup s ∈T n X i =1 Y i,s , σ = sup s ∈T n X i =1 E Y i,s . Then, with probability at least − exp( − t ) , we have that Z = O ( E Z + σ √ t + Lt ) . (25) See [BLM13, Exercise 12.15] for explicit constants.
26e will also repeatedly use the following version of Matrix Bernstein inequality [Tro15, Min17].
Theorem B.2. ([Tro15, Corollary 7.3.2]) Let S , . . . , S n be n independent symmetric matrices suchthat E S i = 0 and k S i k ≤ L a.s. for each index i . Let Z = P ni =1 S i and let V be any PSD matrixsuch that P ni =1 E S k S Tk (cid:22) V . Let ν = k V k and r = r( V ) . Then, we have that E k Z k = O ( p ν log r + L log r ) . (26) In particular, if S i = ξ i x i x Ti , where ξ i is a Rademacher random variable, and x i is sampledindependently from a distribution with zero mean, covariance Σ , and bounded support L , i.e. k x i k ≤ L almost surely. Then E k Z k = O ( p nL k Σ k log r(Σ) + L log r(Σ)) . B.2 Properties under Truncation
We state some basic results regarding truncation of a distribution in this subsection. These resultsare well-known in literatue and are included here for completeness (see, e.g., [DKK +
17, LRV16]).
Proposition B.3. (Shift in mean by truncation) Let X be sampled from a distribution with mean and covariance Σ (cid:22) I . For a t ≥ , let g ( · ) be defined as g ( x ) = x, if x ∈ [ − t, t ] ,t, if x > t, − t, if x < − t. If t ≥ Cǫ − , then for all v ∈ S d − , | E g ( x T v ) | ≤ C − √ ǫ .Proof. Let Z = x T v . By Markov’s inequality, P ( Z ≥ t ) ≤ P ( Z ≥ C ǫ − ) ≤ C ǫ − = C − ǫ. We get that | E g ( Z ) | = | E Z − g ( Z ) | ≤ E | Z − g ( Z ) | ≤ E | Z | I | Z |≥ t ≤ √ ǫC − . (27) Proposition B.4. (Shift in mean by truncation under higher moments) Let X be sampled froma distribution with mean and covariance (1 − σ k ǫ − k ) I (cid:22) Σ (cid:22) I . Moreover, assume that thedistribution has bounded moments, i.e., for a k ≥ : ∀ v ∈ S d − , ( E ( v T X ) k ) k ≤ σ k . (28) Note that σ ≤ . Let T k = σ k ǫ − k . Then1. For all M ∈ M , E ( x T M x ) k ≤ σ kk .
2. For all M ∈ M and t ≥ CT k , E x T M x I x T Mx ≥ t ≤ σ k C k − ǫ − k .
3. Let f ( · ) be defined as f ( x ) = min( x, t ) . For a t ≥ CT k , | E f ( x T M x ) − | ≤ σ k ǫ − k (1 + C − k ) .4. Let t ≥ CT k . For all v ∈ S d − , | E x T v I | x T v |≤ t | ≤ σ k ǫ − k C − k . . Let g ( · ) be defined as g ( x ) = sign ( x ) min( | x | , t ) . For t ≥ CT k and all v ∈ S d − , | E g ( x T v ) | ≤ σ k C − k ǫ − k .6. E k X k k ≤ d k σ kk . P ( k X k ≥ σ k √ dǫ − /k ) ≤ ǫ. Proof.
We prove each statement in turn.1. We use the spectral decomposition of M , to write M = U T ∆ U , where U is a rotation matrix, ∆ is a non-negative diagonal matrix with diagonal entries λ i and trace . Observe that if therandom variable X satisfies Equation (28), then the random variable Z := U X also satisfiesEquation (28).We use the aforementioned observation and apply Jensen’s inequality to get: E ( x T M x ) k = E ( Z T ∆ Z ) k = E ( d X i =1 λ i z i ) k ≤ d X i =1 λ i E z ki ≤ X i =1 λ i σ kk ≤ σ kk .
2. Let Z = x T M x . From the first part, we have that k -th moment of Z is bounded by σ k . ByMarkov’s inequality, we get that P { Z ≥ t } ≤ P (cid:8) Z ≥ CT k (cid:9) ≤ P (cid:26) Z ≥ C σ k ǫ k (cid:27) ≤ ǫC k σ kk E Z k ≤ ǫC k . We can now apply Hölder’s inequality, to get E Z I Z ≥ CT k ≤ σ k C k − ǫ − k .
3. As above, let Z = x T M x . It follows that f ( x ) ≤ x . Therefore, we get that E f ( x T M x ) ≤ E x T M x ≤ . For the lower bound, we get that E f ( x T M x ) ≥ E x T M x I x T Mx ≤ CT k = E x T M x − E x T M x I x T Mx>CT k ≥ − σ k ǫ − k − σ k ǫ − k C − k .
4. Let Z = x T v . We note that P ( Z ≥ t ) ≥ P ( Z ≥ CT k ) ≤ P ( Z k ≥ C k T kk ) ≤ σ kk σ kk ǫ − C k ≤ C − k ǫ. We now bound the deviation in mean by truncation: E Z = E Z I | Z |≤ t + E Z I | Z | >t = 0= ⇒ | E Z I | Z |≤ t | = | E Z I Z>t |≤ ( E Z k ) k ( P { Z > t } ) − k = σ k C − k ǫ − k .
28. Let Z = x T v . We get that | E g ( Z ) | = | E Z − g ( Z ) | ≤ E | Z − g ( Z ) | ≤ E | Z | I | Z |≥ CT k ≤ σ k ǫ − k C − k .
6. It follows by taking M = d I in the first part.7. This follows by Markov’s inequality and the previous part. Lemma B.5.
Let P be a distribution with mean µ and covariance I . Let X ∼ P . For k > , letits k -th central moment be bounded asfor all v ∈ S d − : ( E | v T X | k ) k ≤ σ k . For ǫ ≤ . , let E be the event E = {k X − µ k ≤ T } , where T is such that P ( E ) ≥ − ǫ . Let Z be the random variable X | E , that is X conditioned on X ∈ E . Then, we have that1. k µ − E Z k ≤ − ǫ σ k ǫ − k ≤ σ k ǫ − k .2. (1 − σ k ǫ − k ) I (cid:22) Cov( Z ) (cid:22) I .Proof. We prove each statement in turn.1. Let Q be the distribution of Z . We will assume that P ( E c ) > , otherwise the results holdtrivially. Let R be the distribution of X conditioned on X ∈ E c and let Y ∼ R . Note that P can be written as the convex combination of Q and R . P = ( P ( E )) Q + (1 − P ( E )) R. (29)Using this decomposition, we can calculate the shift in mean along any direction v ∈ S d − : P ( E ) v T E Z + (1 − P E ) E v T Y = v T E X = µ = ⇒ v T ( E Z − µ ) = 1 P ( E ) E ( − v T ( X − µ )) I X E ≤ P ( E ) ( E | v T ( X − µ ) | k ) k ( P ( E c )) − k ≤ P ( E ) σ k ǫ − k , where the first inequality uses Hölder’s inequality. Therefore, k E Z − µ k ≤ σ k ǫ − /k / (1 − ǫ ) .2. We will follow the notations from the previous part. Note that for all v ∈ S d − , the meanminimizes the quadratic loss E ( v T ( Z − E Z )) ≤ E ( v T ( Z − µ )) . v , we have that E ( v T ( Z − µ )) ≤ E ( v T ( Y − µ )) . As E ( v T ( X − µ )) is the convex combination of E ( v T Z ) and E ( v T Y ) , and thus larger than the minimum ofthese two, we get E ( v T ( Z − µ )) = min( E ( v T ( Y − µ )) , E ( v T ( Z − µ )) ) ≤ P ( E ) E ( v T ( Z − µ )) + (1 − P ( E )) E ( v T ( Y − µ )) = E ( v T ( X − µ )) = 1 . Therefore, we obtain the following upper bound: E v T ( Z − E Z ) ≤ E ( v T ( Z − µ )) ≤ . We now turn our attention to lower bound. We first note that (1 − P ( E )) E ( v T ( Y − µ )) = E ( v T X ) I { X ∈ E c } ≤ ( E ( v T X ) k ) k ( P ( E )) − k ≤ σ k ǫ − k . Using the definition of P , Q and R , we get E ( v T ( Z − µ )) = 1 P ( E ) ( E ( v T ( X − µ )) − (1 − P ( E )) E ( v T ( Y − µ )) ) ≥ (1 − (1 − P ( E )) E ( v T ( Y − µ )) ) ≥ − σ k ǫ − k . We are now ready to bound the lower bound the deviation from mean: E ( v T ( Z − E Z )) = E ( v T ( Z − µ )) − ( E Z − µ ) ≥ − σ k ǫ − k − ( σ k ǫ − k − ǫ ) ≥ − σ k ǫ − k − σ k ǫ − k − ǫ ≥ − σ k ǫ − k . C Bounds on the Number of Points with Large Projections
Organization.
This section contains the proofs of Lemma 2.2 and Lemma 4.2 from the mainpaper. In Section C.1, we prove the results controlling the number of outliers uniformly alongall directions v ∈ S d − . We then generalize these results to projections along PSD matrices inSection C.2. C.1 Linear Projections
We state Lemma 1 from Lugosi and Mendelson [LM19b]. We will use this result for distributionswith bounded covariance.
Lemma C.1. ([LM19b, Lemma 1]) Let x , . . . , x n be n i.i.d. points from a distribution with meanzero and covariance Σ (cid:22) I . Let Q be defined as follows: Q = 256 ǫ r tr(Σ) n + 16 √ ǫ . Then, for a constant c > , with probability at least − exp( − cǫn ) , sup v ∈S d − (cid:12)(cid:12)(cid:8) i : | v T x i | ≥ Q (cid:9)(cid:12)(cid:12) ≤ . ǫn.
30e state the following straightforward generalization of Lemma C.1 for distributions withbounded central moments. We give the proof for completeness.
Lemma C.2.
Let x , . . . , x n be n i.i.d. points from a distribution with mean zero and covariance Σ (cid:22) I . Further assume that for all v ∈ S d − : ( E ( v T X ) k ) k ≤ σ k . (30) Let Q k be defined as follows: Q k = O ǫ r tr(Σ) n + σ k ǫ − k ! . Then, there exists a c > , such that with probability at least − exp( − cnǫ ) , sup v ∈S d − (cid:12)(cid:12)(cid:8) i : | x Ti v | ≥ Q k (cid:9)(cid:12)(cid:12) = O ( nǫ ) . (31) Proof.
We follow the same strategy as in Lugosi and Mendelson [LM19b]. We first set Q k as follows: Q k = C ǫ r tr(Σ) n + σ k ǫ − k ! , for a large enough constant C to be determined later. Consider the function χ : R → R defined by χ ( x ) = , if x ≤ Q k , xQ k − , if x ∈ h Q k , Q k i , , if x ≥ Q k . (32)Therefore, I x T v ≥ Q k ≤ χ ( x Ti v ) ≤ I x T v ≥ Q k / and note that χ ( · ) is a Q k Lipschitz. We first bound thenumber of points violating the upper tail bounds. The random quantity of interest is the following: Z = sup v ∈S d − n X i =1 I x Ti v ≥ Q k . (33)We first calculate its expectation using the symmetrization principle [LT91, BLM13]. We have that E Z = E sup v ∈S d − n X i =1 I x Ti v ≥ Q k ≤ E sup v ∈S d − n X i =1 χ ( x Ti v ) ≤ E sup v ∈S d − n X i =1 ( χ ( x Ti v ) − E χ ( x Ti v )) + sup v ∈S d − E n X i =1 χ ( x Ti v ) ≤ E sup v ∈S d − n X i =1 ǫ i χ ( x Ti v ) + sup v ∈S d − E n X i =1 χ ( x Ti v ) . (34)31e bound the second term in Eq. (34) by E n X i =1 χ ( x Ti v ) ≤ E n X i =1 I x Ti v |≥ Q k / = n P ( x Ti v ≥ Q k / ≤ n P ( x Ti v ≥ Cσ k ǫ − k ) = O ( nǫ ) , by applying Markov inequality and choosing a large enough constant C for Q k . For the firstterm in Eq. (34), we upper bound χ ( · ) using contraction principle for Rademacher averages andindependence of x i : E sup v ∈S d − n X i =1 ǫ i χ ( x Ti v ) ≤ Q k E sup v ∈S d − n X i =1 ǫ i x Ti v = 2 Q k E k X i ǫ i x i k ≤ n Q k p n tr(Σ) = O ( nǫ ) , where we use the covariance bound on x i and a large enough constant for Q k ≥ ( C/ǫ ) p tr(Σ) /n .Therefore, we get that E Z = O ( nǫ ) . We can bound the wimpy variance, i.e., the quantity σ inTheorem B.1, by O ( ǫn ) . By Talagrand’s concentration B.1, we get that probability − exp( − cnǫ ) , Z = O ( nǫ + √ nσ √ cnǫ √ nγ + cnǫ ) = O ( nǫ ) . (35) C.2 Matrix Projections
We will now use the results from the previous section to prove Lemma 2.2 and Lemma 4.2. Theproof follows the ideas from [DL19, Proposition 1].
Lemma C.3.
Suppose that the event E holds, where E is the following E := ( sup v ∈S d − |{ i : | x Ti v | ≥ Q }| ≤ . ǫn ) . Let Q = 8 Q . Then the event E also holds, where E is defined as follows: E := (cid:26) sup M ∈M |{ i : x Ti M x i ≥ Q }| ≤ ǫn (cid:27) . Proof.
We follow the same proof strategy as Depersin and Lecue [DL19]. We reproduce the proofhere for completeness.Suppose that E holds but the desired event E does not hold. Let M be such that |{ i : x Ti M x i ≥ Q }| > ǫn . Let G be the Gaussian vector in R d independent of x , . . . , x n with distribution N (0 , M ) .We will work conditionally on x , . . . , x n in the remaining of the proof. Let Z be the followingrandom variable Z = n X i =1 I | x Ti G | ≥ Q . We have that x Ti G ∼ N (0 , x Ti M x i ) . For i such that x Ti M x i ≥ Q , we have that P ( | x Ti G | > Q ) ≥ P ( g ≥
58 ) > . , g is a standard Gaussian random variable. Therefore, E Z = n X i =1 P ( | x Ti G | > Q ) ≥ ǫn (0 . . Note that Z a is sum of independent indicator random variables. A Chernoff bound states that,with probability at least − exp( − cnǫ ) , Z ≥ E Z > . nǫ . However, by Gaussian concentration(see, e.g., [BLM13]) we have that with probability at least . : k G k ≤ . Taking a union bound,we get that both of the events happen simultaneously with non-zero probability. Therefore, withnon-zero probability ∃ u : k u k ≤ and n X i =1 I | x Ti u | ≥ Q > . nǫ. That is, ∃ v : k v k ≤ , and n X i =1 I | x Ti v | ≥ Q > . nǫ ≡ n X i =1 I | x Ti v |≥ Q > . nǫ, which is a contradiction to E . This completes the proof.We are now ready to prove Lemma 2.2 and 4.2. Proof. (Proof of Lemma 2.2) It follows from Lemma C.1, due to Lugosi and Mendelson [LM19b,Lemma 1], and Lemma C.3.
Proof. (Proof of Lemma 4.2) It follows from Lemma C.2, which might require a change of variables,and Lemma C.3.
D Stability for Distributions with Bounded Covariance
Organization.
Section D.1 contains the proof of the sufficient conditions for stability underbounded covariance assumption (Claim 2.1). Section D.2 contains the arguments for deterministicrounding (Lemma D.2).
D.1 Sufficient Conditions for Stability
The following claim simplifies the stability condition for the bounded covariance case.
Claim D.1. (Claim 2.1) Let S be a set such that k µ S − µ k ≤ σδ , and k Σ S − σ I k ≤ σ δ /ǫ for some ≤ ǫ ≤ δ . Let ǫ ′ < . . Then S is ( ǫ ′ , δ ′ ) stable with respect to µ and σ , where δ ′ = 2 √ ǫ ′ + 2 δ p ǫ ′ /ǫ .Proof. Let ǫ ′ < . . Without loss of generality, we can assume that σ = 1 . For S ′ ⊆ S : | S ′ | ≥ (1 − ǫ ′ ) | S | , | S ′ | X i ∈ S ′ ( x Ti v ) − ≤ | S ′ | X i ∈ S ( x Ti v ) − ≤ − ǫ ′ (1 + δ ǫ ) − δ ǫ + ǫ ′ − ǫ ′ ≤ ǫ ′ (2 ǫ ′ + 2 δ r ǫ ′ ǫ ) ≤ ( δ ′ ) ǫ ′ . δ ′ ≥ √ ǫ ′ , the lower bound on eigenvalues of Σ S ′ is trivially satisfied. We now bound the deviationin mean. Observe that the uniform distribution on S ′ can be obtained by conditioning the uniformdistribution on S on an event E , such that P ( E ) ≥ − ǫ ′ . Using this observation in conjunctionwith Hölder’s inequality gives us that for any v , the shift in mean is at most (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | S ′ | X i ∈ S ′ v T x i − | S | X i ∈ S ′ v T x i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ r δ ǫ √ ǫ ′ ≤ √ ǫ ′ + 2 δ r ǫ ′ ǫ ≤ δ ′ . (36) D.2 Deterministic Rounding of the Weight Function
The next lemma states that it suffices to find a distribution w ∈ ∆ n,ǫ for stability. Lemma D.2. (Lemma 2.8) For ǫ ≤ , let w ∗ ∈ ∆ n,ǫ be such that for ǫ ≤ δ , we have1. k µ w − µ k ≤ σδ .2. k Σ w − σ I k ≤ σ δ /ǫ .Then there exists a subset S ⊆ S such that1. | S | ≥ (1 − ǫ ) | S | .2. S is ( ǫ ′ , δ ′ ) stable with respect to µ and σ , where δ ′ = O ( δ + √ ǫ + √ ǫ ′ ) .Proof. Without loss of generality, we will assume that σ = 1 . We will use Claim D.1 to prove thisresult by first showing that there exists a S ′ ⊆ [ n ] with bounded covariance and good sample mean.Without loss of generality, we will assume that ǫn is an interger and µ = 0 . We will also assumethat − ǫ ) n ≥ w ≥ w ≥ · · · ≥ w n ≥ . For any k ∈ [ n ] , we have that X i w i ≤ n − k (1 − ǫ ) n + kw k (37) = ⇒ w k ≥ k (1 − ǫ ) n − ( n − k )(1 − ǫ ) n = k − ǫn (1 − ǫ ) nk . (38)Setting k = 2 ǫn , we have that w k ≥ ǫn n (1 − ǫ ) = 12(1 − ǫ ) n . (39)We now have a lower bound on w i for all i ≤ (1 − ǫ ) n . Now let S be the set of the n − k pointswith the largest w i . In particular, for each i ∈ S , w i ≥ − ǫ ) n . We have that, X i ∈ S | S | ( x Ti v ) = X i ∈ S − ǫ ) n ( x Ti v ) ≤ X i ∈ S − ǫ ) 2 w i (1 − ǫ )( x Ti v ) (Using Eq. (39)) ≤ − ǫ )(1 − ǫ ) X i ∈ S w i ( x Ti v ) ≤ δ ǫ ) . (40)34et the uniform distribution on S be u (1) and the uniform distribution on S be u . We now calculatethe total variation distance between w and u (1) . d TV ( w, u (1) ) ≤ d TV ( w, u ) + d TV ( u, u (1) ) ≤ ǫ + 2 ǫ = 3 ǫ. (41)Therefore, there exist distributions p (1) , p (2) , p (3) such that w = (1 − ǫ ) p (1) + 3 ǫp (2) , u = (1 − ǫ ) p (1) + 3 ǫp (3) . This decomposition follows from an alternate characterization of total variation distance(see, e.g.,[Tsy08, Lemma 2.1]). We first note that ǫ X i p (2) i ( x Ti v ) ≤ X i w i ( x Ti v ) ≤ δ ǫ . ǫ X i p (3) i ( x Ti v ) ≤ X i u (1) i ( x Ti v ) ≤ (cid:18) δ ǫ (cid:19) . Therefore, we get that | n X i =1 (1 − ǫ ) p (1) i x Ti v | ≤ | n X i =1 w i x Ti v | + | ǫ X i p (3) i x Ti v | ≤ δ + 3 ǫ vuut n X i =1 p i ( x vi ) ≤ δ + √ ǫ vuut ǫ n X i =1 p i ( x Ti v ) ≤ δ + √ ǫ r (1 + δ ǫ ) ≤ δ + √ ǫ + √ δ ≤ δ + 2 √ ǫ. We finally get that | n X i =1 u (1) i x Ti v | ≤ | n X i =1 (1 − ǫ ) p (1) i x Ti v | + | n X i =1 ǫp (3) i x Ti v |≤ δ + 2 √ ǫ + √ ǫ s ǫ X i p (3) i ( x Ti v ) ≤ δ + 2 √ ǫ + √ p ǫ + δ ≤ δ + 10 √ ǫ. (42)Therefore using Equations (40) and (42), we have a set S that satisfies the conditions inClaim D.1 with δ ′′ = 10 δ + 10 √ ǫ . Using Claim D.1, we get that S is ( ǫ ′ , δ ′ ) stable. E Stability for Distributions with Bounded Central Moments
Organization.
In this section, we provide the detailed arguments regarding the proof of Theo-rem 1.8 that were omitted from the main text. We start with a simplified stability condition inSection E.1. Section E.2 contains the argument for rounding a good distribution w ∈ ∆ n,ǫ to asubset. Section E.3 contains the arguments for controlling the second moment matrix from aboveand below respectively. Sections E.3 and E.4 contain the arguments for concentration of the secondmoment matrix and mean respectively. 35 .1 Sufficient Conditions for Stability We will prove the existence of a stable set with high probability using the following claim. This isanalogous to Claim D.1 in the bounded covariance setting, but we also need a lower bound on theminimum eigenvalue of Σ S ′ for all large subsets S ′ . Claim E.1.
Let ≤ ǫ ≤ δ and ǫ ≤ . . A set S is ( ǫ, δ ) stable, if it satisfies the following for allunit vectors v .1. k µ S − µ k ≤ δ .2. v T Σ S v ≤ δ ǫ .3. For all subsets S ′ ⊆ S such that | S ′ | ≥ (1 − ǫ ) | S | , we have v T Σ S ′ v ≥ (1 − δ ǫ ) .Proof. Without loss of generality, we will assume that µ = 0 . We first show the second condition inthe definition of stability. Let S ′ be any proper subset of S , such that | S ′ | ≥ (1 − ǫ ) | S | . Note thatthe minimum eigenvalue of S ′ is lower-bounded by the assumption: v T Σ S ′ v = 1 | S \ S ǫ | X i ∈ S \ S ǫ ( v T x ) ≥ − δ ǫ . (43)We now look at the largest eigenvalue of S ′ : v T Σ S v − | S ′ | X i ∈ S ′ ( v T x ) − ≤ | S || S ′ | | S | X i ∈ S ( v T x ) − ≤ − ǫ (1 + δ ǫ ) − ≤ − ǫ ( δ ǫ + ǫ ) ≤ δ ǫ + 2 ǫ ≤ δ ǫ . We now need to show that the mean of S ′ is also good. In order to do that, we first control thedeviation due to a small set S \ S ′ . | S | X i ∈ S \ S ′ ( v T x i ) = 1 | S | X i ∈ S ( v T x i ) − | S | ( X i ∈ S ′ ( v T x i ) ) ≤ (1 + δ ǫ ) − | S ′ || S | (1 − δ ǫ ) ≤ (1 + δ ǫ ) − (1 − ǫ )(1 − δ ǫ ) ≤ δ ǫ + ǫ. (44)We first break the deviation in mean into two terms, and control each individually: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | S ′ | X i ∈ S ′ ( v T x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = | S || S ′ | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | S | X i ∈ S \ S ǫ ( v T x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | S || S ′ | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | S | X i ∈ S ( v T x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + | S || S ′ | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | S | X i ∈ S \ S ′ ( v T x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . We can upper bound the first term by k µ S k / (1 − ǫ ) ≤ δ/ (1 − ǫ ) . We bound the second term using36he Cauchy-Schwarz inequality and Eq. (44): | S || S ′ | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | S | X i ∈ S \ S ′ ( v T x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | S \ S ′ || S ′ | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | S \ S ′ | X i ∈ S \ S ′ ( v T x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | S \ S ′ || S ′ | vuut | S \ S ′ | X i ∈ S \ S ′ ( v T x i ) = p | S \ S ′ || S || S ′ | vuut | S | X i ∈ S \ S ′ ( v T x i ) ≤ √ ǫ − ǫ r δ ǫ + ǫ. Overall, we get that | v T µ S ′ | ≤ − ǫ ( δ + √ δ + ǫ ) ≤ δ + 2 ǫ ≤ δ. E.2 Randomized Rounding of Weight Function
In this section, we show how to recover a subset from a w ∈ ∆ n,ǫ . Unlike the deterministic roundingin Section D.2, we do a randomized rounding in Lemma E.2 to get a better dependence on ǫ . Forthe second condition ( δ = O ( ǫ ) ) in Lemma E.2 to hold, it is necessary that n = Ω( d ) . If n = O ( d ) ,it is not a problem because, in this regime, the bounded covariance assumption already leads tooptimal error. Lemma E.2.
Let k ≥ . Let w ∈ ∆ n,ǫ , for ǫ ≤ , be a distribution on the set of points S such that1. k µ w − µ k ≤ δ .2. k Σ w − I k ≤ δ ǫ ≤ r , for some r > .3. Let C ≥ . For all subsets S ′ : | S ′ | ≥ (1 − Cǫ ) n and v ∈ S d − : v T Σ S ′ v ≥ − δ / ( Cǫ ) .4. w i > implies that k x i k ≤ r σ k √ dγ − /k for some r ≥ .Then, there exists a subset S ⊆ [ n ] such that1. | S | ≥ (1 − ǫ ) n .2. S is ( ǫ ′ , δ ′ ) stable, where ǫ ′ = ( C − ǫ, δ ′ = O (cid:16) δ + r r d log dn + r σ k ǫ − k r d log dn + r σ k ǫ − k (cid:17) . (45) Proof.
We will use Claim E.1 to prove this result. Without loss of generality, let µ = 0 . Therefore,it suffices to find a subset such that both the mean and the largest eigenvalue are controlled. Let Y i ∼ Bernoulli ( w i (1 − ǫ ) n ) . We have that P ni =1 E Y i = (1 − ǫ ) n . Let S be the (random) set: S = { i : Y i = 1 } . (46)37y a Chernoff bound, we have that for some constant c ′ > , P ( | S | ≥ (1 − ǫ ) n ) ≤ exp( − c ′ nǫ ) . (47)Let E be the event E = {| S | ≥ (1 − ǫ ) n } . We now bound the mean of the set S . Consider thefollowing random variable Z : Z = X i ( Y i − (1 − ǫ ) w i n ) x i . (48)The random variable Z satisfies E Z = 0 . Moreover, its covariance can be bounded using theassumption as follows: v T Σ Z v = n X i =1 w i (1 − ǫ ) n (1 − w i (1 − ǫ ) n )( v T x i ) ≤ (1 − ǫ ) n n X i =1 w i ( x Ti v ) ≤ (1 − ǫ ) n (1 + δ ǫ ) (cid:22) r n. Therefore, with probability at least . , we have that k Z k ≤ p r nd = ⇒ k X Y i x i k ≤ (1 − ǫ ) n k X i w i X i k + 10 p r nd. Let E be the event that E = {k P Y i x i k ≤ (1 − ǫ ) nσ + 10 √ r nd } . This implies that on the event E ∩ E , k µ S k ≤ − ǫ − ǫ δ + 10 c − ǫ r dn ≤ δ + 30 r r dn . (49)We now focus our attention on upper bounding the eigenvalue. Define the symmetric random matrix, Z i as Z i := Y i x i x Ti − w i (1 − ǫ ) nx i x Ti . We have that E Z i = 0 and k Z i k ≤ r dσ k ǫ − k almost surely.We now bound the matrix variance statistic (used in Theorem B.2): ν ( Z ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w i (1 − ǫ ) n (1 − w i (1 − ǫ ) n ) k x i k x i x Ti (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w i (1 − ǫ ) n r σ k dǫ k x i x Ti (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (1 − ǫ ) r σ k ndǫ k k n X i =1 w i x i x Ti k ≤ r σ k ndǫ k . By the matrix concentration (Theorem B.2), we get that with probability at least . , we have that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 Y i x i x Ti − w i (1 − ǫ ) nx i x Ti (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O s r σ k nd log dγ k + r σ k d log dγ k ! . (50)Let E be the event above, which happens with probability at least . . Under the event E ∩ E ,38e get that v T Σ S v ≤ − ǫ − ǫ w i ( x Ti v ) + 11 − ǫ O s r σ k d log dnǫ k + r σ k d log dnǫ k ≤ − ǫ − ǫ (1 + δ ǫ ) + O s r σ k d log dnǫ k + r σ k d log dnǫ k ≤ ǫ O ǫ + δ + r d log dn r σ k ǫ − k + r σ k ǫ − k d log dn ! ≤ ǫ O δ + r σ k ǫ − k + r d log dn + r σ k ǫ − k r d log dn !! . (51)Let ǫ ′ = ( C − ǫ . Note that if | S | ≥ (1 − ǫ ) | S | , then | S ′ | ≥ (1 − ǫ ′ ) | S | implies that | S ′ | ≥ (1 − Cǫ ) | S | , which leads to a lower bound on the minimum eigenvalue. This follows from thefollowing elementary calculations: | S ′ || S | ≥ (1 − ǫ ) | S || S | ≥ (1 − ǫ )(1 − ( C − ǫ ) ≥ − Cǫ. (52)Using Equations (47), (49) and (51), we get that there exists a subset S such that for all v ∈ S d − and δ ′ = O ( δ + p r d log d/n + r σ k γ / − /k p d log d/n + r σ k ǫ − k ) :1. | S | ≥ (1 − ǫ ) n ≥ (1 − ǫ ′ ) n .2. k µ S k ≤ δ ′ .3. v T Σ S v ≤ δ ′ ǫ ′ .4. For all subsets S ′ ⊆ S : | S ′ | ≥ (1 − ǫ ′ ) | S | , v T Σ S ′ v ≥ − δ ′ ǫ ′ .We now invoke Claim E.1 to conclude that S ′ is ( ǫ ′ , δ ′ ) -stable. E.3 Upper Bound on the Second Moment Matrix
Lemma E.3.
Consider the conditions in Lemma 4.3. Then, with probability − τ , R ′ /n ≤ δ /ǫ ,where δ = O ( p d log d/n + σ k ǫ − k + σ p log(1 /τ ) /n ) .Proof. (Proof of Lemma E.3) We first calculate the wimpy variance required for Theorem B.1, σ = sup M ∈ M n X i =1 V ( f ( x Ti M x i )) ≤ sup M ∈ M n X i =1 E f ( x Ti M x i ) (53) ≤ n sup M ∈ M E ( x Ti M x i ) ≤ nσ . (54)39e use symmetrization, contraction, and matrix concentration (Theorem B.2) to bound E R ′ asfollows: E R ′ = E sup M ∈M n X i =1 f ( x Ti M x i ) − E f ( x Ti M x i ) ≤ E sup M ∈M n X i =1 ǫ i f ( x Ti M x i ) ≤ E sup M ∈M n X i =1 ǫ i x Ti M x i = 2 E k n X i =1 ǫ i x i x Ti k = O s σ k nd log( d ) ǫ k + σ k d log dǫ k , where we use Theorem B.2, with ν = O ( σ k ndǫ − k ) and L = O ( σ k dǫ − k ) .Note that Q k = O ( σ k ǫ − k + (1 /ǫ ) p d/n . As R ′ is bounded by Q k , we can apply Theorem B.1to get that with probability at least − τ , R ′ /n is bounded as follows: R ′ n = O s σ k d log dnǫ k + σ k d log dnǫ k + σ s log( τ ) n + σ k ǫ k log( τ ) n + 1 ǫ dn log( τ ) n = 1 ǫ O r d log dn σ k ǫ − k + d log dn σ k ǫ − k + σ ǫσ s log( τ ) n + σ k ǫǫ − k + dn ( Using log( τ ) n = O ( ǫ ) .) = 1 ǫ O r d log dn + σ k ǫ − k + σ k ǫ − k r d log dn + σ ǫ + σ s log( τ ) n = 1 ǫ O r d log dn + σ k ǫ − k + σ s log( τ ) n ( Using σ ǫ ≤ σ k ǫ − k and σ k ǫ − k = O (1) .) ≤ δ ǫ , where we use the parameter regime stated in Lemma 4.3. E.4 Controlling the Mean
Lemma E.4.
Consider the setting in Lemma 4.5. Then, with probability, − τ − exp( − nǫ ) , R ′ n = O (cid:16)r dn + r log(1 /τ ) n + σ k ǫ − k (cid:17) . Proof.
We first calculate the wimpy variance required for Theorem B.1, σ = sup v ∈S d − n X i =1 V ( g ( x Ti v )) ≤ sup v ∈S d − n X i =1 E g ( v T x i ) ≤ sup v ∈S d − n E ( v T x i ) ≤ n.
40e use symmetrization, contraction of Rademacher averages to bound E R ′ . E R ′ = E sup v ∈S d − n X i =1 g ( v T x i ) − E g ( v T x i ) ≤ E sup v ∈S d − n X i =1 ǫ i g ( v T x i ) ≤ E sup v ∈S d − n X i =1 ǫ i v T x i = 2 E k n X i =1 ǫ i x i k ≤ r dn . By applying Theorem B.1, we get that with probability at least − τ , R ′ n = O (cid:16) E R ′ n + r log(1 /τ ) n + Q k log(1 /τ ) n (cid:17) = O (cid:16)r dn + r log(1 /τ ) n + σ k ǫ − k log( τ ) n + 1 ǫ r dn log(1 /τ ) n (cid:17) = O (cid:16)r dn + r log(1 /τ ) n + σ k ǫ − k (cid:17) , where the last inequality uses the assumption that log(1 /τ ) n = O ( ǫ ))