aa r X i v : . [ c s . L G ] F e b Robust Sparse Estimation Tasks in High Dimensions
Jerry Li ∗ EECS, MIT [email protected]
March 2, 2017
Abstract
In this paper we initiate the study of whether or not sparse estimation tasks can be performed efficiently in highdimensions, in the robust setting where an ε -fraction of samples are corrupted adversarially. We study the naturalrobust version of two classical sparse estimation problems, namely, sparse mean estimation and sparse PCA in thespiked covariance model. For both of these problems, we provide the first efficient algorithms that provide non-trivialerror guarantees in the presence of noise, using only a number of samples which is similar to the number required forthese problems without noise. In particular, our sample complexities are sublinear in the ambient dimension d . Ourwork also suggests evidence for new computational-vs-statistical gaps for these problems (similar to those for sparsePCA without noise) which only arise in the presence of noise. In the last couple of decades, there has been a large amount of work in machine learning and statistics on how toexploit sparsity in high dimensional data analysis. Motivated by the ever-increasing quantity and dimensionality ofdata, the goal at a high level is to utilize the underlying sparsity of natural data to extract meaningful guarantees usinga number of samples that is sublinear in the dimensionality of the data. In this paper, we will consider the unsupervisedsetting, where we have sample access to some distribution with some underlying sparsity, and our goal is to recoverthis distribution by exploiting this structure. Two natural and well-studied problems in this setting that attempt toexploit sparsity are sparse mean estimation and sparse PCA. In both problems, the shared theme is that we assume thatone wishes to find a distinguished sparse direction of a Gaussian data set. However, the algorithms inspired by thisline of work tend to be quite brittle—it can be shown that they fail when the model is slightly perturbed.This connects to a major concern in high dimensional data analysis: that of model misspecification . At a highlevel, the worry is that our algorithms should be able to tolerate the case when our assumed model and the true modeldo not perfectly coincide. In the distributional setting, this (more or less) corresponds to the regime when a smallfraction of our samples are adversarially corrupted. The study of these so-called robust estimators , i.e., estimatorswhich work in the presence of such noise, is a classical subfield of statistics. Unfortunately, the classical algorithmsfor these problems fail to scale as the dimensionality of the problem grows—either the algorithms run in time whichis exponential in the dimension, or the error guarantees for these algorithms degrade substantially as the dimensiongrows. In a flurry of recent work, we now know new algorithms which circumvent this “curse of dimensionality”: theyrun efficiently, and provide dimension independent error guarantees. However, these algorithms are unable to exploitany inherent sparsity in the problem.This raises the natural “meta-question”:
Question 1.1.
Do the statistical gains (achievable by computationally efficient algorithms) for sparse estimation prob-lems persist in the presence of noise? ∗ Supported by NSF grant CCF-1217921, DOE grant de-sc0008923, NSF CAREER Award CCF-145326, and a NSF Graduate Research Fel-lowship D withsome underlying sparsity constraint (e.g. sparse PCA). Suppose now an ε -fraction of the samples are corrupted. Canwe still solve the same sparse estimation problem? In this work, we initiate the study of such issues. Interestingly,new gaps between computational and statistical rates seem to emerge in the presence of noise. In particular, whilethe sparse mean estimation problem was previously quite simple to solve, the efficient algorithms which achieve theminimax rate for this problem break down in the presence of this adversarial noise. More concretely, it seems thatthe efficient algorithms which are robust to noise run into the same computational issues as those which plague sparsePCA. A very interesting question is whether this phenomenon is inherent to any computationally efficient algorithm. We study the natural robust versions of two classical, well-studied statistical tasks involving sparsity, namely, sparsemean estimation, and sparse PCA.
Robust sparse mean estimation
Here, we get a set of d -dimensional samples from N ( µ, I ) , where µ is k -sparse,and an ε -fraction of the points are corrupted adversarially. Our goal then is to recover µ . Our main contribution is thefollowing: Theorem 1.2 (informal, see Theorem 2.1) . There is an efficient algorithm, which given a set of ε -corrupted samples ofsize e O ( k log dε ) from N ( µ, I ) where µ is k -sparse, outputs a b µ so that with high probability, k b µ − µ k ≤ ε p log 1 /ε . The recovery guarantee we achieve, namely O ( ε p log 1 /ε ) , is off by the optimal guarantee by only a factor of p log 1 /ε . Moreover, results of [DKS16] imply that our bound is tight for any efficient SQ algorithm. One can showthat information theoretically, it suffices to take O ( k log dε ) samples to learn the mean to ℓ error O ( ε ) , even withcorrupted data. Without model misspecification, this problem is quite simple algorithmically: it turns out that thetruncated empirical mean achieves the information theoretically optimal rate. However, efficient algorithms for thistask break down badly given noise, and to our knowledge there is no simple way of fixing them. Very interestingly, therate we achieve is off from this information theoretic rate by a k vs k factor—the same computational vs. statisticalgap that arises in sparse PCA. This phenomenon only seems to appear in the presence of noise, and we conjecture thatthis is inherent: Conjecture 1.1.
Any efficient algorithm for robust sparse mean estimation needs e Ω( k log dε ) samples.In Appendix D we give some intuition for why it seems to be true. At a high level, it seems that any techniqueto detect outliers for the mean must look for sparse directions in which the variance is much larger than it should be;at which point the problem faces the same computational difficulties as sparse PCA. We leave closing this gap as aninteresting open problem. Robust sparse PCA
Here, we study the natural robust analogue of the spiked covariance model. Classically, twoproblems are studied in this setting. The detection problem is given as follows: given sample access to the distributions,we are asked to distinguish between N (0 , I ) , and N (0 , I + ρvv T ) where v is a k -sparse unit vector. That is, we wishto understand if we can detect the presence of any sparse principal component. Our main result is the following: Theorem 1.3 (informal, see Theorem 2.2) . Fix ρ > , and let η = O ( ε p log 1 /ε ) . If ρ > η , there is an efficientalgorithm, which given a set of ε -corrupted samples of size O ( k log dρ ) which distinguishes between N (0 , I ) , and N (0 , I + ρvv T ) with high probability. The condition that ε = e O ( ρ ) is necessary (up to log factors), as otherwise the problem is impossible informationtheoretically. Observe that this (up to log factors) matches the optimal rate for computationally efficient detection forsparse PCA without noise (under reasonable complexity theoretic assumptions, see [BR13, WBS16]), and so it seemsthat noise does not introduce an additional gap here. The recovery problem is similar, except now we want to recoverthe planted spike v , i.e. find a u minimizing L ( u, v ) = √ k uu T − vv T k , which turns out to be the natural measurefor this problem. For this, we show: 2 heorem 1.4 (informal, see Theorem 2.3) . Fix ε > and < ρ = O (1) , and let η = O ( ε p log 1 /ε ) . There is anefficient algorithm, which given a set of ε -corrupted samples of size O ( k log dη ) from N (0 , I + ρvv T ) , outputs a u sothat L ( u, v ) = O (cid:16) ηρ (cid:17) with high probability. This rate is non-trivial—in particular, it provides guarantees for recovery of v when the number of samples we takeis at the detection threshold. Moreover, up to log factors, our rate is optimal for computationally efficient algorithms–[WBS16] gives an algorithm with rate roughly O ( ε/ρ ) , and show that this is necessary. Techniques
We first introduce a simple way to describe the optimization problems used for solving sparse meanestimation and sparse PCA. This approach is very similar to the approach taken by [CRPW12] for solving under-determined linear systems. We observe that any set S in a Hilbert space naturally induces a dual norm k x k ∗S =max y ∈S |h x, y i| , and that well-known efficient algorithms for sparse mean estimation and sparse PCA simply computethis norm, and the corresponding dual witness y ∈ S which maximizes this norm, for appropriate choices of S .These norms give us a language to only consider deviations in directions we care about, which allows us to proveconcentration bounds which are not true for more traditional norms.We now describe our techniques for robust sparse mean estimation. Our starting point is the convex programmingapproach of [DKK + + A so that the empirical covariance with these weights is “maximallyexplained” by A , in a norm very similar to the one induced by the sparse PCA SDP. Utilizing that norms are convex,we show that this can be done via a convex program using the types of techniques described above, and that the topeigenvector of the optimal A gives us the desired solution. While the convex program would be quite difficult to writedown in one shot, it is quite easily expressible using the abstraction of dual norms. As mentioned previously, there has been a large amount of work on various ways to exploit sparsity for machinelearning and statistics. In the supervised setting, perhaps the most well-known of these is compressive sensing and itsvariants (see [CW08, HTW15] for more details). We do not attempt to provide an exhaustive overview the field here.Other well-known problems in the same vein include general classes of linear inverse problems, see [CRPW12] andmatrix completion ([CR12]).The question of estimating a sparse mean is very related to a classical statistical model known as the
Gaussiansequence model , and the reader is referred to [Tsy09, Joh11, Rig15] for in-depth surveys on the area. This problem3as also garnered a lot of attention recently in various distributed and memory-limited settings, see [GMN14, SD15,BGM + +
16, LRV16, CSV16, DKK +
17, DKS17, DKS16]) have givennew, computationally efficient, robust estimators for these problems and other settings which avoid this loss, and areoften almost optimal. Independent work of [DSS17] also considers the robust sparse setting. They give a similar resultfor robust mean estimation, and also consider robust sparse PCA, though in a somewhat different setting than we do,as well as robust sparse linear regression.The questions we consider are similar to learning in the presence of malicious error studied in [Val85, KL93],which has received a lot of attention, particularly in the setting of learning halfspaces ([Ser03, KLS09, ABL14]). Theyalso are connected to work on related models of robust PCA ([Bru09, CLMW11, LMTZ12, ZL14]). We refer thereader to [DKK +
16] to a detailed discussion on the relationships between these questions and the ones we study.
Throughout this paper, if v is a vector, we will let k v k denote its ℓ norm. If M is a matrix, we let k M k denoteits spectral norm, we let k M k F denote its Frobenius norm, and we let k M k = P ij | M ij | be its ℓ -norm if it wereconsidered a vector. For any two distributions F, G over R d , we let d TV ( F, G ) = R R d | F − G | dx denote the totalvariation distance between the two distributions.We will study the following contamination model: Definition 2.1 ( ε -corruption) . We say a a set of samples X , X , . . . , X n is an ε -corrupted set of samples froma distribution D if it is generated by the process following process. First, we draw n independent samples from D .Then, an adversary inspects these samples, and changes an ε -fraction of them arbitrarily , then returns these new pointsto us, in any order. Given an ε -corrupted set of samples, we let S good ⊆ [ n ] denote the indices of the uncorruptedsamples, and we let S bad ⊆ [ n ] denote the indices of the corrupted samples.As discussed in [DKK + D , but come from a distribution D ′ with total variation distance at most O ( ε ) from D .We may now formally define the algorithmic problems we consider. Robust sparse mean estimation
Here, we assume we get an ε -corrupted set of samples from N ( µ, I ) , where µ is k -sparse. Our goal is to recover µ in ℓ . It is not hard to show that there is an exponential time estimator whichachieves rate e O ( k log d/ε ) , and moreover, this rate is optimal (see Appendix A). However, this algorithm requireshighly exponential time. We show: Theorem 2.1 (Efficient robust sparse mean estimation) . Fix ε, δ > , and let k be fixed. Let η = O ( ε p log 1 /ε ) .Given an ε -corrupted set of samples X , . . . , X n ∈ R d from N ( µ, I ) , where µ is k -sparse, and n = Ω min( k , d ) + log (cid:0) d k (cid:1) + log 1 /δη ! , there is a poly-time algorithm which outputs b µ so that w.p. − δ , we have k µ − b µ k ≤ O ( η ) .
4t is well-known that information theoretically, the best error one can achieve is Θ( ε ) , as achieved by Fact A.1. Weshow that it is possible to efficiently match this bound, up to a p log 1 /ε factor. Interestingly, our rate differs from thatin Fact A.1: our sample complexity is (roughly) e O ( k log d/ε ) versus O ( k log d/ε ) . We conjecture this is necessaryfor any efficient algorithm. Robust sparse PCA
We will consider both the detection and recovery problems for sparse PCA. We first focus detection problem for sparse PCA. Here, we are given ρ > , and an ε -corrupted set of samples from a d -dimensionaldistribution D , where D can is either N (0 , I ) or N (0 , I + ρvv T ) for some k -sparse unit vector v . Our goal isto distinguish between the two cases, using as few samples as possible. It is not hard to show that informationtheoretically, O ( k log d/ρ ) samples suffice for this problem, with an inefficient algorithm (see Appendix A). Our firstresult is that efficient robust sparse PCA detection is possible, at effectively the best computationally efficient rate: Theorem 2.2 (Robust sparse PCA detection) . Fix ρ, δ, ε > . Let η = O ( ε p log 1 /ε ) . Then, if η = O ( ρ ) , and weare given a we are given a ε -corrupted set of samples from either N (0 , I ) or N (0 , I + ρvv T ) for some k -sparse unitvector v of size n = Ω min( d, k ) + log (cid:0) d k (cid:1) + log 1 /δρ ! then there is a polynomial time algorithm which succeeds with probability − δ for detection. It was shown in [BR13] that even without noise, at least n = Ω( k log d/ε ) samples are required for any polyno-mial time algorithm for detection, under reasonable complexity theoretic assumptions. Up to log factors, we recoverthis rate, even in the presence of noise.We next consider the recovery problem. Here, we are given an ε -corrupted set of samples from N (0 , I + ρvv T ) ,and our goal is to output a u minimizing L ( u, v ) , where L ( u, v ) = √ k uu T − vv T k . For the recovery problem, werecover the following efficient rate: Theorem 2.3 (Robust sparse PCA recovery) . Fix ε, ρ > . Let η be as in Theorem 2.2. There is an efficient algorithm,which given a set of ε -corrupted samples of size n from N (0 , I + ρvv T ) , where n = Ω min( d, k ) + log (cid:0) d k (cid:1) + log 1 /δη ! , outputs a u so that L ( u, v ) = O (cid:18) (1 + ρ ) ηρ (cid:19) . In particular, observe that when η = O ( ρ ) , so when ε = e O ( ρ ) , this implies that we recover v to some smallconstant error. Therefore, given the same number of samples as in Theorem 2.2, this algorithm begins to providenon-trivial recovery guarantees. Thus, this algorithm has the right “phase transition” for when it begins to work, asthis number of samples is likely necessary for any computationally efficient algorithm. Moreover, our rate itself islikely optimal (up to log factors), when ρ = O (1) . In the non-robust setting, [WBS16] showed a rate of (roughly) O ( ε/ρ ) with the same number of samples, and that any computationally efficient algorithm cannot beat this rate. Weleave it as an interesting open problem to show if this rate is achievable or not in the presence of error when ρ = ω (1) . In this section we provide technical preliminaries that we will require throughout the paper.5 .1 Naive pruning
We will require the following (straightforward) preprocessing subroutine from [DKK +
16] to remove all points whichare more than e Ω( d ) away from the true mean. Fact 3.1 (c.f. Fact 4.18 in [DKK + . Let X , . . . , X n be an ε -corrupted set of samples from N ( µ, I ) , and let δ > .There is an algorithm N AIVE P RUNE ( X , . . . , X n , δ ) which runs in O ( εd n ) time so that with probability − δ ,we have that (1) N AIVE P RUNE removes no uncorrupted points, and (2) if X i is not removed by N AIVE P RUNE , then k X i − µ k ≤ O ( p d log( n/δ )) . If these two conditions happen, we say that N AIVE P RUNE has succeeded . In this section we give a couple of concentration inequalities that we will require in the remainder of the paper. These“per-vector” and “per-matrix” concentration guarantees are well-known and follow from (scalar) Chernoff bounds, seee.g. [DKK + Fact 3.2 (Per-vector Gaussian concentration) . Fix ε, δ > . Let v ∈ R d be a unit vector, and let X , . . . , X n ∼N (0 , I ) , where n = Ω (cid:18) log 1 /δε (cid:19) . Then, with probability − δ , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)* n n X i =1 X i , v +(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ε . Fact 3.3 (Per-matrix Gaussian concentration) . Fix ε, δ > , and suppose ε ≤ . Let M ∈ R d × d be a symmetricmatrix, and let X , . . . , X n ∼ N (0 , I ) , where n = Ω (cid:18) log 1 /δε (cid:19) . Then, with probability − δ , we have: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)* n n X i =1 X i X Ti − I, M +(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ε . S n,ε For any n, ε , define the set S n,ε = ( w ∈ R n : n X i =1 w i = 1 , and ≤ w i ≤ − ε ) n , ∀ i ) . (1)We make the following observation. For any subset I ⊆ [ n ] , if we let w I be the vector whose i th coordinate is / | I | if i ∈ I and otherwise, we have S n,ε = conv { w I : | I | = (1 − ε ) n } . The set S n,ε will play a key role in our algorithms. We will think of elements in S n,ε as weights we place upon oursample points, where higher weight indicates a higher confidence that the sample is uncorrupted, and a lower weightwill indicate a higher confidence that the sample is corrupted.6 Concentration for sparse estimation problems via dual norms
In this section we give a clean way of proving concentration bounds for various objects which arise in sparse PCAand sparse mean estimation problems. We do so by observing they are instances of a very general “meta-algorithm”we call dual norm maximization. This will prove crucial to proving the correctness of our algorithms for robustsparse recovery. While this may sound similar to the “dual certificate” techniques often used in the sparse estimationliterature, these techniques are actually quite different.
Definition 4.1 (Dual norm maximization) . Let H be a Hilbert space with inner product h· , ·i . Fix any set S ⊆ H .Then the dual norm induced by S , denoted k · k ∗ S , is defined by k x k ∗ S = sup y ∈ S |h x, y i| . The dual norm maximizer of x , denoted d S ( x ) , is the vector d S ( x ) = arg max v ∈ S |h v, x i| .In particular, we will use the following two sets. Equip the space of symmetric d × d matrices with the trace innerproduct, i.e., h A, B i = tr( AB ) , so that it is a Hilbert space, and let U k = { u ∈ R d : k u k = 1 , k u k = k } (2) X k = { X ∈ R d × d : tr( X ) = 1 , k X k ≤ k, X (cid:23) } . (3)We show in Appendix B.1 that existing well-known algorithms for sparse mean recovery and sparse PCA withoutnoise can be naturally written in this fashion.Another detail we will largely ignore in this paper is the fact that efficient algorithms for these problems can onlyapproximately solve the dual norm maximization problem. However, we explain in Appendix B.2 why this does notaffect us in any meaningful way. Thus, for the rest of the paper we will assume we have access to the exact maximizer,and the exact value of the norm. We now show how the above concentration inequalities allow us to derive very strong concentration results for thedual norm maximization problem for U k and X k . Conceptually, we view these concentration results as being the majordistinction between sparse estimation and non-sparse estimation tasks. Indeed, these results are crucial for adaptingthe convex programming framework for robust estimation to sparse estimation tasks. Additionally, they allow us togive an easy proof that the L relaxation works for sparse PCA. Corollary 4.1.
Fix ε, δ > . Let X , . . . , X n ∼ N (0 , I ) , where n = Ω k + log (cid:0) dk (cid:1) + log 1 /δε ! . Then k n P ni =1 X i k ∗U k ≤ ε .Proof. Fix a set of k coordinates, and let S be the set of unit vectors supported on these k coordinates. By Fact 3.2and a net argument, one can show that for all δ , given n = Ω (cid:16) k +log 1 /δε (cid:17) , we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)* v, n n X i =1 X i +(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ε , with probability − δ . The result then follows by setting δ ′ = (cid:0) dk (cid:1) − δ and union bounding over all sets of k coordinates.The second concentration bound, which bounds deviation in X k norm, uses ideas which are similar at a high level, butrequires a bit more technical work. 7 heorem 4.2. Fix ε, δ > . Let X , . . . X n ∼ N (0 , I ) , where n = Ω min( d, k ) + log (cid:0) d k (cid:1) + log 1 /δε ! . Then (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 X i X Ti − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗X k ≤ ε . Let us first introduce the following definition.
Definition 4.2. A symmetric sparsity pattern is a set S of indices ( i, j ) ∈ [ d ] × [ d ] so that if ( i, j ) ∈ S then ( j, i ) ∈ S .We say that a symmetric matrix M ∈ R d × d respects a symmetric sparsity pattern S if supp( M ) = S .With this definition, we now show: Lemma 4.3.
Let n = O (cid:18) min( d,k )+log ( d k ) +log 1 /δε (cid:19) . Then, with probability − δ , the following holds: | tr(( b Σ − I ) X ) | ≤ O ( ε ) , for all symmetric X with k X k = k and k X k F ≤ . (4) Proof.
Fix any symmetric sparsity pattern S so that | S | ≤ k . By classical arguments one can show that there isa (1 / -net over all symmetric matrices X with k X k F = 1 respecting S of size at most O (min( d,k )) . By Fact3.3 and a basic net argument, we know that for any δ ′ , we know that except with probability − δ ′ , if we take n = O (cid:16) min( d,k )+log 1 /δ ′ ε (cid:17) samples, then for all symmetric X respecting S so that k X k F ≤ , we have | tr(( b Σ − I ) X ) | ≤ ε .The claim then follows by further union bounding over all O (cid:16)(cid:0) d k (cid:1)(cid:17) symmetric sparsity patterns S with | S | ≤ k .We will also require the following structural lemma. Lemma 4.4.
Any PSD matrix X so that tr( X ) = 1 and k X k ≤ k can be written as X = O ( n /k ) X i =1 Y i , where each Y i is symmetric, have P O ( n /k ) i =1 k Y i k F ≤ , and each Y i is k -sparse.Proof. Observe that since X is PSD, then k X k F ≤ tr( X ) = 1 . For simplicity of exposition, let us ignore that the Y i must be symmetric for this proof. We will briefly mention how to in addition ensure that the Y i are symmetricat the end of the proof. Sort the entries of X in order of decreasing | X ij | . Let Y i be the matrix whose nonzeroesare the ik + 1 through ( i + 1) k largest entries of X , in the same positions as they appear in X . Then we clearlyhave that P Y i = X i , and each Y i is exactly k -sparse. Thus it suffices to show that P k Y i k F ≤ . We have k Y k F ≤ k X k F ≤ . Additionally, we have k Y i +1 k F ≤ T | Y i | k , which follows simply because every nonzero entryof Y i +1 is at most the smallest entry of Y i , and each has exactly k nonzeros (except potentially the last one, but it isnot hard to see this cannot affect anything). Thus, in aggregate we have O ( n /k ) X i =1 k Y i k F ≤ O ( n /k ) X i =2 k Y i k F ≤ O ( n /k ) X i =1 T | Y i | k = 1 + 1 T | X | k ≤ , which is stronger than claimed. Technically the last Y i may not be k sparse but this is easily dealt with, and we will ignore this case here Y i ’s must be symmetric, and indeed they do not have to be. The only realcondition we needed was that the Y i ’s (1) had disjoint support, (2) summed to X , (3) are each Θ( k ) sparse (exceptpotentially the last one), and (4) the largest entry of Y i +1 is bounded by the smallest entry of Y i . It should be clear thatthis can be done while respecting symmetry by doubling the number of Y i , which also at most doubles the bound inthe sum of the Frobenius norms. We omit the details for simplicity. Proof of Theorem 4.2.
Let us condition on the event that (4) holds. We claim then that for all X ∈ X , we must have | tr(( b Σ − I ) X ) | ≤ O ( ε ) , as claimed. Indeed, by Lemma 4.4, for all X ∈ X , we have that X = O ( d /k ) X i =1 Y i , where each Y i is symmetric, have P O ( d /k ) i =1 k Y i k F ≤ , and each Y i is k -sparse. Thus, | tr(( b Σ − I ) X ) | ≤ O ( d /k ) X i =1 (cid:12)(cid:12)(cid:12) tr(( b Σ − I ) Y i ) (cid:12)(cid:12)(cid:12) = O ( d /k ) X i =1 k Y i k F (cid:12)(cid:12)(cid:12)(cid:12) tr (cid:18) ( b Σ − I ) Y i k Y i k F (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ O ( d /k ) X i =1 k Y i k F · O ( ε ) ( b ) ≤ O ( ε ) , where (a) follows since each Y i / k Y i k F satisfies the conditions in (4), and (b) follows from the bound on the sum ofthe Frobenius norms of the Y i . S n,ε We will require the following concentration inequalities for weighted sums of Gaussians, where the weights comefrom S n,ε , as these objects will naturally arise in our algorithms. These bounds follow by applying the above bounds,then carefully union bounding over all choices of possible subsets of (cid:0) nεn (cid:1) subsets. We need to be careful here sincethe number of things we are union bounding over increases as n increases. We include the proofs in Appendix C. Theorem 4.5.
Fix ε ≤ / and δ ≤ , and fix k ≤ d . There is a η = O ( ε p log 1 /ε ) so that for any η > η , if X , . . . , X n ∼ N (0 , I ) and n = Ω (cid:18) min( d,k )+log ( d k ) +log 1 /δη (cid:19) , then Pr ∃ w ∈ S n,ε : (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 w i X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗U k ≥ η ≤ δ . Theorem 4.6.
Fix ε ≤ / and δ ≤ , and fix k ≤ d . There is a η = O ( ε p log 1 /ε ) so that if X , . . . , X n ∼ N (0 , I ) and n = Ω (cid:18) min( d,k )+log ( d k ) +log 1 /δη (cid:19) , then we have Pr ∃ w ∈ S n,ε : (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 w i X i X Ti − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗X k ≥ η ≤ δ . A robust algorithm for robust sparse mean estimation
This section is dedicated to the description of an algorithm R
ECOVER R OBUST SM EAN for robustly learning Gaussiansequence models, and the proof of the following theorem:
Theorem 5.1.
Fix ε, τ > . Let η = O ( ε p log 1 /ε ) . Given an ε -corrupted set of samples of size n from N ( µ, I ) ,where µ is k -sparse n = Ω min( k , d ) + log (cid:0) d k (cid:1) + log 1 /τη ! , then R ECOVER R OBUST SM EAN outputs a b µ so that with probability − τ, we have k b µ − µ k ≤ O ( η ) . Our algorithm builds upon the convex programming framework developed in [DKK + Ω( √ d ) away from the mean. Then, for an appropriate choice of δ , it will attempt to (approximately) find apoint within the following convex set: RobustSM eanC τ = w ∈ S n,ε : (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w i ( X i − µ )( X i − µ ) T − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗X k ≤ τ . (5)The main difficulty with finding a point in C τ is that µ is unknown. A key insight of [DKK +
16] is that it sufficesto create an (approximate) separation oracle for the feasible set, as then we may use classical convex optimizationalgorithms (i.e. ellipsoid or cutting plane methods) to find a feasible point. In their setting (for a different C τ ), it turnsout that a simple spectral algorithm suffices to give such a separation oracle.Our main contribution is the design of separation oracle for C τ , which requires more sophisticated techniques. Inparticular, we will ideas developed in analogy to hard thresholding and SDPs similar to those developed for sparsePCA to design such an oracle. Throughout this section, we let X , . . . , X n denote an ε -corrupted set of samples from N ( µ, I ) , where µ is k -sparse.We let S good denote the set of uncorrupted samples, and we let S bad denote the set of corrupted samples. For any setof weights w ∈ S n,ε , we let w g = P i ∈ S good w i and w b = P i ∈ S bad w i .Throughout this section, we will condition on the following three deterministic events occurring:N AIVE P RUNE ( X , . . . , X n , δ ) succeeds, (6) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ S good w i ( X i − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗U k ≤ η , ∀ w ∈ S n, ε , and (7) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ S good w i ( X i − µ )( X i − µ ) T − w g I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗X k ≤ η , ∀ w ∈ S n, ε , (8)where η = O ( ε p log 1 /ε ) . When n = Ω (cid:18) min( k ,d )+log ( k d ) +log 1 /δη (cid:19) these events simultaneously happen withprobability at least − O ( δ ) by Fact 3.1, Theorem 4.5, Theorem 4.6 and a union bound, and the observation that if w ∈ S n,ε , then w/w g restricted to the indices in S good is in S (1 − ε ) n, ε .10 .2 The separation oracle Our main result in this section is the description of a polynomial time algorithm R
OBUST SM EAN O RACLE and theproof of the following theorem of its correctness:
Theorem 5.2.
Fix ε > sufficiently small. Suppose that (7) and (8) hold. Let w ∗ denote the set of weights which areuniform over the uncorrupted points. Then, there is a constant ≤ c ≤ so that R OBUST SM EAN O RACLE satisfies:1. (Completeness) If w = w ∗ , R OBUST SM EAN O RACLE outputs “YES”.2. (Soundness) If w C cη the algorithm outputs a hyperplane ℓ : R n → R so that ℓ ( w ) ≥ but ℓ ( w ∗ ) < .Moreover, if the algorithm ever outputs a hyperplane, we have ℓ ( w ∗ ) < . Plugging these guarantees into an ellipsoid (or cutting-plane) method, we obtain the following:
Corollary 5.3.
Fix ε > sufficiently small. Suppose that (7) and (8) hold. There is an algorithm A PPROX R ECOV - ER R OBUST SM EAN which queries R OBUST SM EAN O RACLE at most poly( d, /ε, log 1 /δ ) times, and so runs in time poly( d, /ε, /δ ) which outputs a w ′ so that k w − w ′ k ∞ ≤ ε/ ( n p d log n/δ ) , for some w ∈ C cτ . Our separation oracle, formally described in Algorithm 1, proceeds as follows. Given w ∈ S n,ε , it forms b µ = k b µ ′ k ∗U k · d U k ( b µ ′ ) , where b µ = P w i X i . It then forms the matrix b Σ = P w i ( X i − b µ )( X i − b µ ) T , and computes A = d X k ( b Σ) . The algorithm then checks if (cid:12)(cid:12)(cid:12) h A, b Σ i (cid:12)(cid:12)(cid:12) > C for appropriately chosen threshold C . If it does not, thealgorithm outputs “YES”. Otherwise, the algorithm outputs a separating hyperplane given by this matrix A . Algorithm 1
Separation oracle for robust sparse mean estimation. function R OBUST SM EAN O RACLE ( X , . . . , X n , w ) Let b µ = P w i X i Let b Σ = P w i ( X i − b µ )( X i − b µ ) T Let A = d X k ( b Σ) if |h A, b Σ − I i| ≥ η then Let σ = sgn (cid:16) h A, b Σ − I i (cid:17) return the hyperplane ℓ given by ℓ ( w ) = σ n X i =1 w i (cid:10) A, ( X i − b µ )( X i − b µ ) T (cid:11) − ! − |h A, b Σ − I i| . else return “YES” end end function We will require the following two lemmata:
Lemma 5.4.
Let ω , . . . , ω m be a set of non-negative weights that sum to 1. Let a , . . . , a m be any sequence ofscalars. Then m X i =1 ω i a i ≥ m X i =1 ω i a i ! . Proof.
Let Z be a random variable which is a i with probability ω i . Then E [ Z ] = P ω i a i and E [ Z ] = P ω i a i . Thenthe inequality follows from the fact that E [ Z ] − E [ Z ] = Var[ Z ] ≥ . Lemma 5.5.
Let u ∈ R d . Then ( k u k ∗U k ) ≤ k uu T k ∗X k ≤ k u k ∗U k ) . roof. Let v = d U k ( u ) . Then since vv T ∈ X k , we have that ( k uu T k ∗X k ) ≥ h vv T , uu T i = h u, v i = ( k u k ∗U k ) . Thisproves the first inequality.To prove the other inequality, we first prove the intermediate claim that sup M ∈Y k u T M u ≤ ( k u k ∗U k ) , where Y k is the set of symmetric matrices M with at most k -non-zeroes satisfying k M k F = 1 . Indeed, fix any M ∈ Y k . Let S ⊆ [ n ] be the set of non-zeroes of d U k ( u ) . This is exactly the set of the k largest elements in u , sorted by absolutevalue. Let P be the symmetric sparsity pattern respected by M . Fix an arbitrary bijection φ : P \ ( S × S ) → ( S × S ) \ P ,and let M ′ be the following matrix: M ′ i,j = M ij if ( i, j ) ∈ P T ( S × S ) , sgn ( u i u j ) M φ − ( i,j ) if ( i, j ) ∈ ( S × S ) \ P , otherwise . Then we claim that u T M u ≤ u T M ′ u . Indeed, we have u T M ′ u − u T M u = X ( i,j ) ∈ P \ ( S × S ) | M ij ( uu T ) φ ( i,j ) | − M ij ( uu T ) i,j ≥ X ( i,j ) ∈ P \ ( S × S ) | M i,j | (cid:0) | ( uu T ) φ ( i,j ) | − | ( uu T ) i,j | (cid:1) ≥ , from the definition of S . Moreover, for any M respecting S × S with k M k F = 1 , it is not hard to see that u T M u ≤ ( k u k ∗U k ) . This is because now the problem is equivalent to restricting our attention to the coordinates in S , andasking for the symmetric matrix M ∈ R S × S with k M k F = 1 maximizing u TS M u S , where u S is u restricted tothe coordinates in S . This is clearly maximized by M = k u S k u S u TS , which yields the desired expression, since k u S k = k u k U k .We can now prove the original lemma. By Lemma 4.4 we may write A = P O ( n /k ) i =1 Y i where each Y i issymmetric, k -sparse, and have P O ( n /k ) i =1 k Y i k F ≤ . We therefore have u T Au = O ( n /k ) X i =1 u T Y i u = O ( n /k ) X i =1 k Y i k F ( k u k ∗U k ) ≤ k u k ∗U k ) , as claimed, where the second line follows from the arguments above.Throughout the rest of this section, let Y i = X i − µ , so that so that Y i ∼ N (0 , I ) if i ∈ S good . We first prove thefollowing crucial proposition: Proposition 5.6.
Let w ∈ S n,ε , and let τ ≥ η . Assuming (7) and (8) hold, if k P ni =1 w i Y i k ∗U k ≥ τ , then (cid:13)(cid:13)P ni =1 w i Y i Y Ti − I (cid:13)(cid:13) ∗X k ≥ τ ε .Proof. Observe that (7) and a triangle inequality together imply that (cid:13)(cid:13)P i ∈ S bad w i Y i (cid:13)(cid:13) ∗U k ≥ τ . By definition, thisimplies there is a k -sparse unit vector u so that (cid:12)(cid:12) h u, P i ∈ S bad w i Y i i (cid:12)(cid:12) ≥ τ . WLOG assume that h u, P i ∈ S bad w i Y i i ≥ η (if the sign is negative a symmetric argument suffices). This is equivalent to the statement that X i ∈ S bad w i w b h u, Y i i ≥ τw b . Observe that the w i /w b are a set of non-negative weights summing to . Hence, by Lemma 5.4, we have X i ∈ S bad w i w b h u, Y i i ≥ (cid:18) τw b (cid:19) . A = uu T . Observe that A ∈ X k . Then the above inequality is equivalent to the statement that X i ∈ S bad w i Y Ti AY i ≥ τ w b ≥ τ ε . Moreover, by (8), we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i ∈ S good w i Y Ti AY i − I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ η , and together these two inequalities imply that n X i =1 w i Y i AY i ≥ τ ε − η ≥ τ ε , as claimed. The final inequality follows from the definition of η , and since > . Proof of Theorem 5.2.
Completeness follows from (8). We will now show soundness. Suppose w C η . We wish toshow that we will output a separating hyperplane. From the description of the algorithm, this is equivalent to showingthat k b Σ − I k X k ≥ η . Let b µ = P ni =1 w i X i , and let ∆ = µ − b µ . By elementary manipulations, we may write (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w i ( X i − b µ )( X i − b µ ) T − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w i ( Y i + ∆)( Y i + ∆) T − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X k ( a ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w i Y i Y Ti + ∆∆ T − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X k ( b ) ≥ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w i Y i Y Ti − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X k − (cid:13)(cid:13) ∆∆ T (cid:13)(cid:13) X k ( c ) ≥ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w i Y i Y Ti − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X k − k ∆ k U k , where (a) follows since P ni =1 w i Y i = ∆ by definition, (b) follows from a triangle inequality, and (c) follows fromLemma 5.5. If k ∆ k U k ≤ p η/ , then the RHS is at least η since the second term is at most η , and the first termis at least η since we assume that w C η . Conversely, if k ∆ k U k ≥ p η/ , then by Proposition 5.6, we have k P ni =1 w i Y i Y i − I k X k ≥ k ∆ k X k / (6 ε ) > k ∆ k X k as long as ε ≤ / . This implies that the RHS is at least k ∆ k X k ≥ η , as claimed.Hence, this implies that if w C η , then we output a hyperplane ℓ . It is clear by construction that ℓ ( w ) ≥ ;thus, it suffices to show that if we output a hyperplane, that ℓ ( w ∗ ) < . Letting e µ = − ε ) n P i ∈ S good w i Y i , we haveObserve that we have n X i =1 w ∗ i ( X i − b µ )( X i − b µ ) T − I = 1(1 − ε ) n X i ∈ S good ( Y i + ∆)( Y i + ∆) T − I = 1(1 − ε ) n X i ∈ S good Y i Y Ti − I + ∆ e µ T + e µ ∆ T + ∆∆ T = 1(1 − ε ) n X i ∈ S good Y i Y Ti − I + (∆ + e µ )(∆ + e µ ) T − e µ e µ T . (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w ∗ i ( X i − b µ )( X i − b µ ) T − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X k ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ε ) n X i ∈ S good Y i Y Ti − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗X k + 4 (cid:0) k ∆ + e µ k ∗U k (cid:1) + 4 (cid:0) k e µ k ∗U k (cid:1) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ε ) n X i ∈ S good Y i Y Ti − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X k + 8 (cid:0) k ∆ k ∗U k (cid:1) + 8 (cid:0) k e µ k ∗U k (cid:1) + 4 (cid:0) k e µ k ∗U k (cid:1) ≤ η + 8 (cid:0) k ∆ k ∗U k (cid:1) , (9)by (7) and (8).Observe that to show that ℓ ( w ∗ ) < it suffices to show that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w ∗ i ( X i − b µ )( X i − b µ ) − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗X k < (cid:13)(cid:13)(cid:13)b Σ − I (cid:13)(cid:13)(cid:13) ∗X k . (10)If k ∆ k ∗U k ≤ p η/ , then this follows since the quantity on the RHS is at least η by assumption, and the quantity onthe LHS is at most η by (9). If k ∆ k ∗U k ≥ p η/ , then by Proposition 5.6, the RHS of (10) is at least (cid:0) k ∆ k ∗U k (cid:1) / (3 ε ) ,which dominates the LHS as long as k ∆ k ∗U k ≥ η and ε ≤ / , which completes the proof. We now have the ingredients to prove our main theorem. Given what we have, our full algorithm R
ECOVER R O - BUST SM EAN is straightforward: first run N
AIVE P RUNE , then run A
PPROX R ECOVER R OBUST SM EAN on the prunedpoints to output some set of weights w . We then output k b µ k U k d U k ( b µ ) . The algorithm is formally defined in Algorithm2. Algorithm 2
An efficient algorithm for robust sparse mean estimation function R ECOVER R OBUST SM EAN ( X , . . . , X n , ε, δ ) Let S be the set output by N AIVE P RUNE ( X , . . . , X n , δ ) . WLOG assume S = [ n ] . Let w ′ = A PPROX R ECOVER R OBUST SM EAN ( X , . . . , X n , ε, δ ) . Let b µ = P ni =1 w ′ i X i . return k b µ k ∗U k d U k ( b µ ) end function Proof of Theorem 5.1.
Let us condition on the event that (6), (7), and (8) all hold simultaneously. As previouslymentioned, when n = Ω (cid:18) min( k ,d )+log ( k d ) +log 1 /δη (cid:19) these events simultaneously happen with probability at least − O ( δ ) . For simplicity of exposition, let us assume that N AIVE P RUNE does not remove any points. This is okaysince if it succeeds, it never removes any good points, so if it removes any points, it can only help us. Moreover, sinceit succeeds, we know that k X i − µ k ≤ O ( p d log( n/δ )) for all i ∈ [ n ] . By Corollary 5.3, we know that there is some w ∈ C η so that k w − w ′ k ∞ ≤ ε/ ( n p d log n/δ ) . We have k b µ − µ k U k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w ′ i X i − b µ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗U k ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w i X i − b µ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗U k + n X i =1 | w i − w ′ i | k X i − µ k ≤ O ( η ) + O ( ε ) ,
14y Proposition 5.6. We now show that this implies that if we let µ ′ = k b µ k ∗U k d U k ( b µ ) , then k µ ′ − µ k ≤ O ( η ) . Let S be the support of µ ′ , and let T be the support of µ . Then we have k µ ′ − µ k = X i ∈ S ∩ T ( µ ′ i − µ i ) + X i ∈ S \ T ( µ ′ i ) + X i ∈ T \ S µ i . Observe that P i ∈ S ∩ T ( µ ′ i − µ i ) + P i ∈ S \ T ( µ ′ i ) ≤ (cid:0) k b µ − µ k ∗U k (cid:1) , since µ was originally nonzero on the entries in S \ T . Moreover, for all i ∈ T \ S and j ∈ S \ T , we have ( µ ′ i ) ≤ ( µ ′ j ) . Thus we have X i ∈ T \ S µ i ≤ X i ∈ T \ S ( µ − µ ′ i ) + X i ∈ S \ T ( µ ′ j ) ≤ (cid:0) k b µ − µ k ∗U k (cid:1) . Therefore we have k µ ′ − µ k ≤ (cid:0) k b µ − µ k ∗U k (cid:1) , which implies that k µ ′ − µ k ≤ O ( η ) , as claimed. In this section, we give an efficient algorithm for detecting a spiked covariance matrix in the presence of adversarialnoise. Our algorithm is fairly straightforward: we ask for the set of weights w ∈ S n,ε so that the empirical secondmoment with these weights has minimal deviation from the identity in the dual X k norm. We may write this as aconvex program. Then, we check the value of the optimal solution of this convex program. If this value is small, thenwe say it is N (0 , I ) . if this value is large, then we say it is N (0 , I + ρvv T ) . We refer to the former as Case 1 and thelatter as Case 2. The formal description of this algorithm is given in Algorithm. Algorithm 3
Learning a spiked covariance model, robustly function D ETECT R OBUST
SPCA( X , . . . , X n , ε, δ, ρ ) Let γ be the value of the solution min w ∈ S n,ε (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w i ( X i X Ti − I ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗X k (11) if γ < ρ/ then return Case 1 else return
Case 2 end function ETECT R OBUST
SPCA
We first show that the algorithm presented above can be efficiently implemented. Indeed, one can show that by takingthe dual of the SDP defining the k · k ∗X k norm, this problem can be re-written as an SDP with (up to constant factorblowups) the same number of constraints and variables, and therefore we may solve it using traditional SDP solvertechniques.Alternatively, one may observe that to optimize Algorithm 4 via ellipsoid or cutting plane methods, it suffices to,given w ∈ S n,ε , produce a separating hyperplane for the constraint (11). This is precisely what dual norm maximiza-tion allows us to do efficiently. It is straightforward to show that the volume of S n,ε × X k is at most exponential in therelevant parameters. Therefore, by the classical theory of convex optimization, (see e.g. [CITE]), for any ξ , we mayfind a solution w ′ and γ ′ so that k w ′ − w ∗ k ∞ ≤ ξ and γ ′ so that | γ − γ ′ | < ξ for some exact minimizer w ∗ , where γ is the true value of the solution, in time poly( d, n, /ε, log 1 /ξ ) ,As mentioned in Section B.2, neither approach will in general give exact solutions, however, both can achieveinverse polynomial accuracy in the parameters in polynomial time. We will ignore these issues of numerical precisionthroughout the remainder of this section, and assume we work with exact γ .15bserve that in general it may be problematic that we don’t have exact access to the minimizer w ∗ , since some ofthe X i may be unboundedly large (in particular, if it’s corrupted) in norm. However, we only use information about γ .Since γ lives within a bounded range, and our analysis is robust to small changes to γ , these numerical issues do notchange anything in the analysis. We now show that Algorithm 4 provides the guarantees required for Theorem 2.2. We first show that if we are in Case1, then γ is small: Lemma 6.1.
Let ρ, δ > . Let ε, η be as in Theorem 2.2. Let X , . . . , X n be an ε -corrupted set of samples from N (0 , I ) of size n , where n is as in Theorem 2.2. Then, with probability − δ , we have γ ≤ ρ/ .Proof. Let w be the uniform weights over the uncorrupted points. Then it from Theorem 4.2 that k P w P ni =1 w i ( X i X Ti − I ) k ∗X k ≤ O ( η ) with probability − δ . Since w ∈ S n,ε , this immediately implies that γ ≤ O ( ρ ) . By setting constantsappropriately, we obtain the desired guarantee.We now show that if we are in Case 2, then γ must be large: Lemma 6.2.
Let ρ, δ > . Let ε, η, n be as in Theorem 2.2. Let X , . . . , X n be an ε -corrupted set of samples from N (0 , I ) of size n . Then, with probability − δ , we have γ ≥ (1 − ε ) ρ − (2 + ρ ) η . In particular, for ε sufficientlysmall, and η = O ( ρ ) , we have that γ > ρ/ .Proof. Let
Σ = I + ρvv T , and let Y i = Σ − / X i , so that if Y i is uncorrupted, then Y i ∼ N (0 , I ) . Let w ∗ bethe optimal solution to (11). By Theorem 4.6, we have that with probability − δ , we can write P ni =1 w ∗ i Y i Y Ti = w g ( I + N ) + B , where k N k ∗X k ≤ η , and B = P i ∈ S bad w ∗ i Y i Y Ti . Therefore, we have P ni =1 w ∗ X i X Ti = w g (Σ +Σ / N Σ / ) + Σ / B Σ / . By definition, we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w i ( X i X Ti − I ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗X k ≥ h w g (Σ + Σ / N Σ / ) + Σ / B Σ / − I, vv T i≥ w g h (Σ + Σ / N Σ / ) , vv T i − w g (1 + ρ ) + w g v T Σ / N Σ / v − ≥ (1 − ε ) ρ + (1 − ε ) v T Σ / N Σ / v − ε . It thus suffices to show that | v T Σ / N Σ / v | < (1 + ρ ) η . Since v is an eigenvector for Σ with eigenvalue ρ , wehave that Σ / v = √ ρ + 1 · v and thus v T Σ / N Σ / v = (1 + ρ ) v T N v = (1 + ρ ) h N, vv T i ≤ (1 + ρ ) k N k ∗X k ≤ (1 + ρ ) η . Lemmas 6.1 and 6.2 together imply the correctness of D
ETECT R OBUST
SPCA and Theorem 2.2.
In this section, we prove Theorem 2.3. We give some intuition here. Perhaps the first naive try would be to simplyrun the same SDP in (11), and hope that the dual norm maximizer gives you enough information to recover thehidden spike. This would more or less correspond to the simplest modification SDP of the sparse PCA in the non-robust setting that one could hope gives non-trivial information in this setting. However, this cannot work, for thefollowing straightforward reason: the value of the SDP is always at least O ( ρ ) , as we argued in Section 6. Therefore,the noise can pretend to be some other sparse vector u orthogonal to v , so that the covariance with noise looks like w g ( I + ρvv T ) + w g ρuu T , so that the value of the SDP can be minimized with the uniform set of weights. Then it is16asily verified that both vv T and uu T are dual norm maximizers, and so the dual norm maximizer does not uniquelydetermine v .To circumvent this, we simply add an additional slack variable to the SDP, which is an additional matrix in X k ,which we use to try to maximally explain away the rank-one part of I + ρvv T . This forces the value of the SDP to bevery small, which allows us to show that the slack variable actually captures v . Our algorithms and analyses will make crucial use of the following convex set, which is a further relaxation of X k : W (2) k = n X ∈ R d × d : tr( X ) ≤ , k X k ≤ , k X k ≤ k, X (cid:23) o . Our algorithm, given formally in Algorithm 4, will be the following. We solve a convex program which simulta-neously chooses a weights in S n,ε and a matrix A ∈ W k to minimize the W k distance between the sample covariancewith these weights, and A . Our output is then just the top eigenvector of A . Algorithm 4
Learning a spiked covariance model, robustly function R ECOVER R OBUST
SPCA( X , . . . , X n , ε, δ, ρ ) Let w ∗ , A ∗ be the solution to arg min w ∈ S n,ε ,A ∈X k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w i ( X i X Ti − I ) − ρA (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗W k (12) Let u be the top eigenector of A ∗ return The d U k ( u ) k u k ∗U k , i.e., the vector with all but the top k coordinates of v zeroed out. end function This algorithm can be run efficiently for the same reasons as explained for D
ETECT R OBUST
SPCA. For the rest of thesection we will assume that we have an exact solution for this problem. As before, we only use information about A ,and since A comes from a bounded space, and our analysis is robust to small perturbations in A , this does not changeanything. Before we can prove correctness of our algorithm, we require a couple of concentration inequalities for the set W k . Lemma 7.1.
Fix ε, δ > . Let X , . . . , X n ∼ N (0 , I ) , where n is as in Theorem 4.2. Then with probability − δ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 X i X Ti − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗W k ≤ O ( ε ) . Proof.
Let b Σ denote the empirical covariance. Observe that W k ⊆ S ∞ i =0 − i X i +1 k . Moreover, for any i , by Theorem4.2, if we take n = Ω min( d, (2 i +1 k ) ) + log (cid:0) d (2 i +1 k ) (cid:1) + log 1 /δ (2 − i ε ) = Ω min( d, k ) + log (cid:0) d k (cid:1) + 2 i log 1 /δε ! , |h M, b Σ i| ≤ ε for all M ∈ − i X i +1 k with probability − δ/ . In particular, if we take n = Ω min( d, k ) + log (cid:0) d k (cid:1) + log 1 /δε ! samples, then for any i , we have |h M, b Σ i| ≤ ε for all M ∈ − X i +1 k with probability at least − δ i / . By a unionbound over all these events, since P ∞ i =0 δ i ≤ δ , we conclude that if we take n to be as above, then |h M, b Σ i| ≤ ε forall M ∈ S ∞ i =0 − i X i +1 k with probability − δ . Since W k is contained in this set, this implies that k b Σ − Σ k ∗W k ≤ O ( ε ) with probability at least − δ , as claimed.By the same techniques as in the proofs of Theorems 4.5 and 4.6, we can show the following bound. Because ofthis, we omit the proof for conciseness. Corollary 7.2.
Fix ε, δ > . Let X , . . . , X n ∼ N (0 , I ) where n is as in Theorem 4.6. Then there is an η = O ( ε p log 1 /ε ) so that Pr ∃ w ∈ S n,ε : (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w i X i X Ti − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗W k ≥ η ≤ δ . In the rest of this section we will condition on the following deterministic event happening: ∀ w ∈ S n,ε : (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w i X i X Ti − I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗W k ≤ η , (13)where η = O ( ε log 1 /ε ) . By Corollary 7.2, this holds if we take n = Ω min( d, k ) + log (cid:0) d k (cid:1) + log 1 /δη ! samples.The rest of this section is dedicated to the proof of the following theorem, which immediately implies Theorem2.3. Theorem 7.3.
Fix ε, δ, and let η be as in (13). Assume that (13) holds.Let b v be the output of R ECOVERY R OBUST
SPCA ( X , . . . , X n , ε, δ, ρ ) . Then L ( b v, v ) ≤ O ( p (1 + ρ ) η/ρ ) . Our proof proceeds in a couple of steps. Let
Σ = I + ρvv T denote the true covariance. We first need the following,technical lemma: Lemma 7.4.
Let M ∈ W k . Then Σ / M Σ / ∈ (1 + ρ ) W k .Proof. Clearly, Σ / M Σ / (cid:23) . Moreover, since Σ / = I + ( √ ρ − vv T , we have that the maximum valueof any element of Σ / is upper bounded by √ ρ . Thus, we have k Σ / M Σ / k ≤ (1 + ρ ) k M k . We also have tr(Σ / M Σ / ) = tr(Σ M )= tr( M ) + ρv T M v ≤ ρ , since k M k ≤ . Thus Σ / M Σ / ∈ (1 + ρ ) W k , as claimed.Let w ∗ , A ∗ be the output of our algorithm. We first claim that the value of the optimal solution is quite small:18 emma 7.5. (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w ∗ i ( X i X Ti − I ) − ρA ∗ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗W k ≤ η (1 + ρ ) . Proof.
Indeed, if we let w be the uniform set of weights over the good points, and we let A = vv T , then by (13), wehave n X i =1 w i X i X Ti = Σ / ( I + N )Σ / , where k N k ∗X k ≤ η , and Σ = I + ρvv T . Thus we have that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w i ( X i X Ti − I ) − ρvv T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗W k = k Σ / N Σ / k ∗W k = max M ∈W k (cid:12)(cid:12)(cid:12) tr(Σ / N Σ / M ) (cid:12)(cid:12)(cid:12) = max M ∈W k (cid:12)(cid:12)(cid:12) tr( N Σ / M Σ / ) (cid:12)(cid:12)(cid:12) ≤ (1 + ρ ) k N k ∗W k , by Lemma 7.4.We now show that this implies the following: Lemma 7.6. v T A ∗ v ≥ − (2 + 3 ρ ) η/ρ .Proof. By (13), we know that we may write P ni =1 w i ( X i X Ti − I ) = w g ρvv T + B − (1 − w g ) I + N , where B = P i ∈ S bad w i X i X Ti , and k N k ∗W k ≤ (1 + ρ ) η . Thus, by Lemma 7.5 and the triangle inequality, we have that (cid:13)(cid:13) w g ρvv T + B − ρA (cid:13)(cid:13) ∗W k ≤ η + k N k ∗W k + (1 − w g ) k I k ∗W k + (1 − w g ) k ρA k ∗W k ≤ (1 + ρ ) η + ε + ρε ≤ (1 + 2 ρ ) η + ε . Now, since vv T ∈ W k , the above implies that | w g ρ + v T Bv − ρv T A ∗ v | ≤ (1 + 2 ρ ) η + ε , which by a further triangle inequality implies that | ρ (1 − v T A ∗ v ) + v T Bv | ≤ (1 + 2 ρ ) η + ε + ερ ≤ (2 + 3 ρ ) η . Since ≤ v T A ∗ v ≤ (since A ∈ X k ) and B is PSD, this implies that in fact, we have ≤ ρ (1 − v T A ∗ v ) ≤ (2 + 3 ρ ) η . Hence v T A ∗ v ≥ − (2 + 3 ρ ) η/ρ , as claimed.Let γ = (2 + 3 ρ ) η/ρ . The lemma implies that the top eigenvalue of A ∗ is at least − γ . Moreover, since A ∗ ∈ X k , as long as γ ≤ / , this implies that the top eigenvector of A ∗ is unique up to sign. By the constraint that η ≤ O (min( ρ, , for an appropriate choice of constants, we that γ ≤ / , and so this condition is satisfied. Recallthat u is the top eigenvector of A ∗ . Since tr( A ∗ ) = 1 and A ∗ is PSD, we may write A ∗ = λ uu T + A , where u isthe top eigenvector of A ∗ , λ ≥ − γ , and k A k ≤ γ . Thus, by the triangle inequality, this implies that k ρ ( vv T − λ uu T ) + B k ∗X k ≤ O ( ργ ) k ρ ( vv T − uu T ) + B k ∗X k ≤ O ( ργ ) . (14)We now show this implies the following intermediate result: Lemma 7.7. ( v T u ) ≥ − O ( γ ) .Proof. By Lemma 7.6, we have that v T A ∗ v = λ ( v T u ) + v T A v ≥ − γ . In particular, this implies that ( v T u ) ≥ (1 − γ ) /λ ≥ − γ , since − γ ≤ λ ≤ .We now wish to control the spectrum of B . For any subsets S, T ⊆ [ d ] , and for any vector x and any matrix M ,let x S denote x restricted to S and M S,T denote the matrix restricted to the rows in S and the columns in T . Let I bethe support of u , and let J be the support of the largest k elements of v . Lemma 7.8. k B I,I k ≤ O ( ργ ) .Proof. Observe that the condition (14) immediately implies that k ρ ( v I v TI − u I u TI ) + B I,I k ≤ cργ , (15)for some c , since any unit vector x supported on I satisfies xx T ∈ X k . Suppose that k B I,I k ≥ Cγ for somesufficiently large C . Then (15) immediately implies that k ρ ( v I v TI − u I u TI ) k ≥ ( C − c ) ργ . Since ( v I v TI − u I u TI ) isclearly rank 2, and satisfies tr( v I v TI − u I u TI ) = 1 −k u I k ≥ , this implies that the largest eigenvalue of v I v TI − u I u TI is positive. Let x be the top eigenvector of v I v TI − u I u TI . Then, we have x T ( v I v TI − u I u TI ) x + x T Bx = ( C − c ) ργ + x T Bx ≥ ( C − c ) ργ by the PSD-ness of B . If C > c, this contradicts (15), which proves the theorem.This implies the following corollary:
Corollary 7.9. k u I k ≥ − O ( γ ) .Proof. Lemma 7.8 and (15) together imply that k v I v TI − u I u TI k ≤ O ( γ ) . The desired bound then follows from areverse triangle inequality.We now show this implies a bound on B J \ I,J \ I : Lemma 7.10. k B J \ I,J \ I k ≤ O ( ργ ) .Proof. Suppose k B J \ I,J \ I k ≥ Cγ for some sufficiently large C . Since u is zero on J \ I , (14) implies that k ρv J \ I v TJ \ I + B J \ I,J \ I k ≤ cργ , for some universal c . By a triangle inequality, this implies that k v J \ I k = k v J \ I v TJ \ I k ≥ ( C − c ) γ . Since v is a unitvector, this implies that k v I k ≤ − ( C − c ) γ , which for a sufficiently large C , contradicts Corollary 7.9.We now invoke the following general fact about PSD matrices: Lemma 7.11.
Suppose M is a PSD matrix, written in block form as M = (cid:18) C DD T E (cid:19) . Suppose furthermore that k C k ≤ ξ and k E k ≤ ξ . Then k M k ≤ O ( ξ ) .Proof. It is easy to see that k M k ≤ O (max( k C k , k D k , k E k )) . Thus it suffices to bound the largest singular value of D . For any vectors φ, ψ with appropriate dimension, we have that ( φ T − ψ T ) M (cid:18) φ − ψ (cid:19) = φ T Aφ − φ T Dψ + ψ T Cψ ≥ , which immediately implies that the largest singular value of D is at most ( k A k + k B k ) / , which implies the claim.20herefore, Lemmas 7.8 and 7.10 together imply: Corollary 7.12. k v I ∪ J v TI ∪ J − u I ∪ J u TI ∪ J k ≤ O ( γ ) . Proof.
Observe (14) immediately implies that k ρ ( v I ∪ J v TI ∪ J − u I ∪ J u TI ∪ J ) + B I ∪ J,I ∪ J k ≤ O ( ργ ) , since | I ∪ J | ≤ k .Moreover, Lemmas 7.8 and 7.10 with Lemma 7.11 imply that k B I ∪ J,I ∪ J k ≤ O ( ργ ) , which immediately implies thestatement by a triangle inequality.Finally, we show this implies k vv T − u J u TJ k ≤ O ( γ ) , which is equivalent to the theorem. Proof of Theorem 7.3.
We will in fact show the slightly stronger statement, that k uu T − v J v TJ k F ≤ O ( γ ) . Observethat since uu T − vv T is rank 2, Corollary 7.12 implies that k v I ∪ J v TI ∪ J − u I ∪ J u TI ∪ J k F ≤ O ( γ ) , since for rank twomatrices, the spectral and Frobenius norm are off by a constant factor. We have k uu T − vv T k F = X ( i,j ) ∈ I ∩ J × I ∩ J ( u i u j − v i v j ) + X ( i,j ) ∈ I × I \ J × J ( v i v j ) + X ( i,j ) ∈ J × J \ I × I ( u i u j ) . We have X ( i,j ) ∈ I ∩ J × I ∩ J ( u i u j − v i v j ) + X ( i,j ) ∈ J × J \ I × I ( u i u j ) ≤ k v I ∪ J v TI ∪ J − u I ∪ J u TI ∪ J k ≤ O ( γ ) , by Corollary 7.12. Moreover, we have that X ( i,j ) ∈ I × I \ J × J ( v i v j ) ≤ X ( i,j ) ∈ I × I \ J × J ( v i v j − u i u j ) + X ( i,j ) ∈ I × I \ J × J ( u i u j ) ≤ k v I ∪ J v TI ∪ J − u I ∪ J u TI ∪ J k + X ( i,j ) ∈ I × I \ J × J ( u i u j ) ≤ k v I ∪ J v TI ∪ J − u I ∪ J u TI ∪ J k + X ( i,j ) ∈ J × J \ I × I ( u i u j ) ≤ O ( γ ) . since J × J contains the k largest entries of uu T . This completes the proof. Acknowledgements
The author would like to thank Ankur Moitra for helpful advice throughout the project, and Michael Cohen for somesurprisingly useful conversations. Is it really surprising though? eferences [ABL14] P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for efficiently learning linearseparators with noise. In STOC , pages 449–458, 2014.[ACCD11] Ery Arias-Castro, Emmanuel J Cand`es, and Arnaud Durand. Detection of an anomalous cluster in anetwork.
The Annals of Statistics , pages 278–304, 2011.[AW08] Arash A Amini and Martin J Wainwright. High-dimensional analysis of semidefinite relaxations forsparse principal components. In
Information Theory, 2008. ISIT 2008. IEEE International Symposiumon , pages 2454–2458. IEEE, 2008.[BD15] S. Balmand and A. Dalalyan. Convex programming approach to robust estimation of a multivariategaussian model, 2015.[Ber06] T. Bernholt. Robust estimators are hard to compute. Technical report, University of Dortmund, Germany,2006.[BGM +
16] Mark Braverman, Ankit Garg, Tengyu Ma, Huy L Nguyen, and David P Woodruff. Communication lowerbounds for statistical estimation problems via a distributed data processing inequality.
STOC , 2016.[BJNP13] Aharon Birnbaum, Iain M Johnstone, Boaz Nadler, and Debashis Paul. Minimax bounds for sparse pcawith noisy high-dimensional data.
Annals of statistics , 41(3):1055, 2013.[BMVX16] Jess Banks, Cristopher Moore, Roman Vershynin, and Jiaming Xu. Information-theoretic bounds andphase transitions in clustering, sparse pca, and submatrix localization. arXiv preprint arXiv:1607.05222 ,2016.[BR13] Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal components in high dimen-sion.
The Annals of Statistics , 41(4):1780–1815, 2013.[Bru09] S. C. Brubaker. Robust PCA and clustering in noisy mixtures. In
SODA 2009 , pages 1078–1087, 2009.[CLMW11] E. J. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis?
J. ACM , 58(3):11,2011.[CMW13] T Tony Cai, Zongming Ma, and Yihong Wu. Sparse pca: Optimal rates and adaptive estimation.
TheAnnals of Statistics , 41(6):3074–3110, 2013.[CR12] Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization.
Communi-cations of the ACM , 55(6):111–119, 2012.[CRPW12] Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The convex geometry oflinear inverse problems.
Foundations of Computational mathematics , 12(6):805–849, 2012.[CRZ16] T Tony Cai, Zhao Ren, and Harrison H Zhou. Estimating structured high-dimensional covariance andprecision matrices: Optimal rates and adaptive estimation.
Electronic Journal of Statistics , 10(1):1–59,2016.[CSV16] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. arXiv preprintarXiv:1611.02315 , 2016.[CT12] Thomas M Cover and Joy A Thomas.
Elements of information theory . John Wiley & Sons, 2012.[CW08] Emmanuel J Cand`es and Michael B Wakin. An introduction to compressive sampling.
IEEE signalprocessing magazine , 25(2):21–30, 2008. 22dBG08] Alexandre d’Aspremont, Francis Bach, and Laurent El Ghaoui. Optimal solutions for sparse principalcomponent analysis.
Journal of Machine Learning Research , 9(Jul):1269–1294, 2008.[dEGJL07] Alexandre d’Aspremont, Laurent El Ghaoui, Michael I Jordan, and Gert RG Lanckriet. A direct formu-lation for sparse pca using semidefinite programming.
SIAM review , 49(3):434–448, 2007.[DKK +
16] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robustestimators in high dimensions without the computational intractability. In
FOCS , pages 655–664. IEEE,2016.[DKK +
17] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Effi-cient and optimally robust learning of high-dimensional gaussians. 2017.[DKS16] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower bounds for robust esti-mation of high-dimensional gaussians and gaussian mixtures. arXiv preprint arXiv:1611.03473 , 2016.[DKS17] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Robust learning of fixed-structure bayesiannetworks. Available at arXiv:1606.07384, 2017.[DSS17] Simon Du, Balakrishnan Sivaraman, and Aarti Singh. Computationally efficient robust estimation ofsparse functionals. arXiv preprint arXiv:1702.07709 , 2017.[GMN14] Ankit Garg, Tengyu Ma, and Huy Nguyen. On communication cost of distributed statistical estimationand dimensionality. In
NIPS , pages 2726–2734, 2014.[GWL14] Quanquan Gu, Zhaoran Wang, and Han Liu. Sparse pca with oracle property. In
NIPS , pages 1529–1537,2014.[HM13] M. Hardt and A. Moitra. Algorithms and hardness for robust subspace recovery. In
COLT 2013 , pages354–375, 2013.[HR09] P. J. Huber and E. M. Ronchetti.
Robust statistics . Wiley New York, 2009.[HRRS86] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel.
Robust statistics. The approach basedon influence functions . Wiley New York, 1986.[HTW15] Trevor Hastie, Robert Tibshirani, and Martin Wainwright.
Statistical learning with sparsity . CRC press,2015.[JNRS10] Michel Journ´ee, Yurii Nesterov, Peter Richt´arik, and Rodolphe Sepulchre. Generalized power method forsparse principal component analysis.
Journal of Machine Learning Research , 11(Feb):517–553, 2010.[Joh01] Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis.
Annalsof statistics , pages 295–327, 2001.[Joh11] Iain M. Johnstone. Gaussian estimation: Sequence and wavelet models. unpublished manuscript, 2011.[JP78] D. S. Johnson and F. P. Preparata. The densest hemisphere problem.
Theoretical Computer Science ,6:93–107, 1978.[KL93] M. J. Kearns and M. Li. Learning in the presence of malicious errors.
SIAM Journal on Computing ,22(4):807–837, 1993.[KLS09] A. Klivans, P. Long, and R. Servedio. Learning halfspaces with malicious noise. 2009.[KNV15] Robert Krauthgamer, Boaz Nadler, and Dan Vilenchik. Do semidefinite relaxations solve sparse pca upto the information limit?
The Annals of Statistics , 43(3):1300–1322, 2015.23LMTZ12] G. Lerman, M. B. McCoy, J. A. Tropp, and T. Zhang. Robust computation of linear models, or how tofind a needle in a haystack.
CoRR , abs/1202.4044, 2012.[LRV16] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In
FOCS ,pages 665–674. IEEE, 2016.[LT15] P. L. Loh and X. L. Tan. High-dimensional robust precision matrix estimation: Cellwise corruption under-contamination. Available at arXiv:1509.07229, 2015.[LZ12] Zhaosong Lu and Yong Zhang. An augmented lagrangian approach for sparse principal componentanalysis.
Mathematical Programming , pages 1–45, 2012.[Ma13] Zongming Ma. Sparse principal component analysis and iterative thresholding.
The Annals of Statistics ,41(2):772–801, 2013.[MW15] Tengyu Ma and Avi Wigderson. Sum-of-squares lower bounds for sparse pca. In
Advances in NeuralInformation Processing Systems , pages 1612–1620, 2015.[OMH14] Alexei Onatski, Marcelo J Moreira, and Marc Hallin. Signal detection in high dimension: The multi-spiked case.
The Annals of Statistics , 42(1):225–254, 2014.[PWBM16] Amelia Perry, Alexander S Wein, Afonso S Bandeira, and Ankur Moitra. Optimality and sub-optimalityof pca for spiked random matrices and synchronization. arXiv preprint arXiv:1609.05573 , 2016.[Rig15] Philippe Rigollet. High dimensional statistics. 2015. available at .[SD15] Jacob Steinhardt and John C Duchi. Minimax rates for memory-bounded sparse linear regression. In
COLT , 2015.[Ser03] R. Servedio. Smooth boosting and learning with malicious noise.
JMLR , 4:633–648, 2003.[Tsy09] Alexandre B Tsybakov.
Introduction to nonparametric estimation. Revised and extended from the 2004French original. Translated by Vladimir Zaiats . Springer Series in Statistics. Springer, New York, 2009.[Tuk75] J.W. Tukey. Mathematics and picturing of data. In
Proceedings of ICM , volume 6, pages 523–531, 1975.[Val85] L. Valiant. Learning disjunctions of conjunctions. In
Proc. 9th IJCAI , pages 560–566, 1985.[WBS16] Tengyao Wang, Quentin Berthet, and Richard J Samworth. Statistical and computational trade-offs inestimation of sparse principal components.
The Annals of Statistics , 44(5):1896–1930, 2016.[WGL15] Zhaoran Wang, Quanquan Gu, and Han Liu. Statistical limits of convex relaxations. arXiv preprintarXiv:1503.01442 , 2015.[WTH09] Daniela M Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposition, with ap-plications to sparse principal components and canonical correlation analysis.
Biostatistics , page kxp008,2009.[ZL14] T. Zhang and G. Lerman. A novel m-estimator for robust pca.
J. Mach. Learn. Res. , 15(1):749–808,January 2014. 24
Information theoretic estimators for robust sparse estimation
This section is dedicated to the proofs of the following two facts:
Fact A.1.
Fix ε, δ > , and let k be fixed. Given an ε -corrupted set of samples X , . . . , X n ∈ R d from N ( µ, I ) ,where µ is k -sparse, and n = O (cid:18) k log( d/ε ) + log 1 /δε (cid:19) , there is an (inefficient) algorithm which outputs b µ so that with probability − δ , we have k µ − b µ k ≤ O ( ε ) . Moreover,up to logarithmic factors, this rate is optimal. Fact A.2.
Fix ρ, δ > . Suppose that ρ = O (1) . Then, there exist universal constants c, C so that: (a) if ε ≤ cρ , andwe are given a ε -corrupted set of samples from either N (0 , I ) or N (0 , I + ρvv T ) for some k -sparse unit vector v ofsize n = Ω k + log (cid:0) dk (cid:1) + log 1 /δρ ! , then there is an (inefficient) algorithm which succeeds with probability − δ for the detection problem. Moreover, if ε ≥ Cρ , then no algorithm succeeds with probability greater than / , and this statistical rate is optimal. The rates in Facts A.1 and A.2 are already known to be optimal (up to log factors) without noise. Thus in thissection we focus on proving the upper bounds, and the lower bounds on error.The lower bounds on what error is achievable follow from the following two facts, which follow from Pinsker’sinequality (see e.g. [CT12]), and the fact that the corruption model can, given samples from D , simulate samplesfrom D by corrupting an O ( ε ) fraction of points, if d TV ( D , D ) ≤ O ( ε ) . Fact A.3.
Fix ε > sufficiently small. Let µ , µ be arbitrary. There is some universal constant C so that if d TV ( N ( µ , I ) , µ , I ) ≤ ε , then k µ − µ k ≤ Cε , and if k µ − µ k ≤ ε , then d TV ( N ( µ , I ) , N ( µ , I )) ≤ Cε . Fact A.4.
Fix ρ = O (1) . Let u, v be arbitrary unit vectors. Then d TV ( N (0 , I ) , N (0 , I + ρvv T )) = Θ( ρ ) , and d TV ( N (0 , I + ρvv T ) , N (0 , I + ρuu T )) = O ( L ( u, v )) . Our techniques for proving the upper bounds go through the technique of agnostic hypothesis selection via tour-naments. Specifically, we use the following lemma:
Lemma A.5 ([DKK + . Let C be a class of probability distributions. Suppose that for some n, ε, δ > there exists an algorithm that given an ε -corrupted set of samples from some D ∈ C , returns a list of M distributions sothat with − δ/ probability there exists a D ′ ∈ M with d TV ( D ′ , D ) < γ . Suppose furthermore that with probability − δ/ , the distributions returned by this algorithm are all in some fixed set M . Then there exists another algorithm,which given O ( N +(log( |M| )+log(1 /δ )) /ǫ ) samples from Π , an ε -fraction of which have been arbitrarily corrupted,returns a single distribution Π ′ so that with − δ probability d TV ( D ′ , D ) < O ( γ + ε ) . A.1 Proof of Upper Bound in Fact A.1
Let M A be the set of distributions {N ( µ ′ , I ) } , where µ ′ ranges over the set of k -sparse vectors so that each coordinateof µ ′ is an integer multiple of ε/ (10 √ d ) , and so that k µ ′ − µ k ≤ A . We then have: Claim A.6.
There exists a N ( µ ′ , I ) = D ∈ M A so that k µ − µ ′ k ≤ O ( ε ) . Moreover, |M A | ≤ (cid:0) dk (cid:1) · (10 A √ d/ε ) k .Proof. The first claim is straightforward. We now prove the second claim. For each possible set of k coordinates, thereare at most (10 A √ d/ε ) k vectors supported on those k coordinates with each coordinate being an integer multiple of ε/ (10 √ d ) with distance at most A from any fixed vector. Enumerating over all (cid:0) dk (cid:1) possible choices of k coordinatesyields the desired answer. 25he estimator is given as follows: first, run N AIVE P RUNE ( X , . . . , X n , δ ) to output some µ so that with prob-ability − δ , we have k µ − µ k ≤ O ( p d log n/δ ) . Round each coordinate of µ so that it is an integer multipleof ε/ (10 √ d ) . Then, output the set of distributions M ′ = {N ( µ ′′ , I ) } , where µ ′′ is any k -sparse vector with eachcoordinate being an integer multiple of ε/ (10 √ d ) , with k α k ≤ O ( p d log n/δ ) . With probability − δ , we have M ′ ⊆ M O ( √ d log n/δ ) . By Claim A.6, applying Lemma A.5 to this set of distributions yields that we will select, withprobability − δ , a µ ′ so that k µ − µ ′ k ≤ O ( ε ) . By Claim A.6, this requires O log |M O ( √ d log n/δ ) | ε ! = O log (cid:0) dk (cid:1) + k log( d/ε ) + log 1 /δε ! , samples, which simplifies to the desired bound, as claimed. A.2 Proof of Upper Bound in Fact A.2
Our detection algorithm is given as follows. We let N be an O (1) -net over all k -sparse unit vectors, and we applyLemma A.5 to the set {N (0 , I + ρuu T ) } u ∈N ∪ {N (0 , I ) } . Clearly, we have: Claim A.7. |M| = (cid:0) dk (cid:1) O ( k ) . By Fact A.4 and the guarantees of Lemma A.5, by an appropriate setting of parameters, if we have n = O (cid:18) log |M| + log 1 /δε (cid:19) = O k + log (cid:0) dk (cid:1) + log 1 /δε ! samples, then with probability − δ we will output N (0 , I ) if and only if the true model is N (0 , I ) . This proves theupper bound. B Omitted Details from Section 4B.1 Writing non-robust algorithms as dual norm maximization
In this section we will briefly review well-known non-robust algorithms for sparse mean recovery and for sparse PCA,and write them using our language.
Thresholding
Recall that in the (non-robust) sparse mean estimation problem, one is given samples X , . . . , X n ∼N ( µ, I ) where µ is k -sparse. The goal is then to recover µ . It turns out the simple thresholding algorithm T HRESH - OLD M EAN given in Algorithm 5 suffices for recovery:
Algorithm 5
Thresholding for sparse mean estimation function T HRESHOLD M EAN ( X , . . . , X n ) Let b µ = n P ni =1 X i Let S be the set of k coordinates of b µ with largest magnitude Let b µ ′ be defined to be b µ ′ i = b µ i if i ∈ S , otherwise return b µ ′ end function The correctness of this algorithm follows from the following folklore result, whose proof we shall omit for conciseness:
Fact B.1 (c.f. [Rig15]) . Fix ε, δ > , and let X , . . . , X n be samples from N ( µ, I ) , where µ is k -sparse and n = Ω log (cid:0) dk (cid:1) + log 1 /δε ! . Then, with probability − δ , if b µ ′ is the output of T HRESHOLD M EAN , we have k b µ ′ − b µ k ≤ ε .
26o write this in our language, observe thatT
HRESHOLD SM EAN ( X , . . . , X n ) = k b µ k ∗U k · d U k ( b µ ) , where b µ = n P ni =1 X i . L relaxation In various scenarios, including recovery of a spiked covariance, one may envision the need to take k -sparse eigenvalues a matrix A , that is, vectors which solve the following non-convex optimization problem:max v T Av s.t. k v k = 1 , k v k ≤ k . (16)However, this problem is non-convex and cannot by solved efficiently. This motivates the following SDP relaxation of(16): First, one rewrites the problem asmax tr( AX ) s.t. tr( X ) = 1 , k X k ≤ k , X (cid:23) , rank( X ) = 1 (17)where k X k is the number of non-zeros of X . Observe that since X is rank if we let X = vv T these two problemsare indeed equivalent. Then to form the SDP, one removes the rank constraint, and relaxes the ℓ constraint to a ℓ constraint: max tr( AX ) s.t. tr( X ) = 1 , k X k ≤ k , X (cid:23) . (18)The work of [dEGJL07] shows that this indeed detects the presence of a spike (but at an information theoreticallysuboptimal rate).Finally, by definition, for any PSD matrix A , if X is the solution to (18) with input A , we have X = d X k ( A ) . B.2 Numerical precision
In general, we cannot find closed form solutions for d X k ( A ) in finite time. However, it is well-known that we can findthese to very high numerical precision in polynomial time. For instance, using the ellipsoid method, we can find an M ′ so that k M ′ − d X k ( A ) k ∞ ≤ ε in time poly( d, log 1 /ε ) . It is readily verified that if we set ε ′ = poly( ε, /d ) thenthe numerical precision of the answer will not effect any of the calculations we make further on. Thus for simplicityof exposition we will assume throughout the paper that given any A , we can find d X k ( A ) exactly in polynomial time. C Omitted Proofs from Section 4
Proof of Theorem 4.5.
Fix n as in Theorem 4.5, and let δ = (cid:0) nεn (cid:1) − δ . By convexity of S n,ε and the objective function,it suffices to show that with probability − δ , the following holds: ∀ w I s.t. | I | = (1 − ε ) n : (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 w i X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗U k ≤ η . Condition on the event that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗U k ≤ ε . (19)By Corollary 4.1, this occurs with probability − O ( δ ) . 27ix any I ⊆ [ n ] so that | I | = (1 − ε ) n . By Corollary 4.1 applied to I c , we have that there is some universalconstant C so that as long as εn ≥ C · min( d, k ) + log (cid:0) d k (cid:1) + log (cid:0) nεn (cid:1) + log 1 /δα , (20)then with probability − δ ′ , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) εn X i I X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗U k ≤ α . (21)Since log (cid:0) nεn (cid:1) = Θ( nε log 1 /ε ) , (20) is equivalent to the condition that n (cid:18) ε − C ε log 1 /εα (cid:19) ≥ C · min( d, k ) + log (cid:0) d k (cid:1) + log 1 /δα . Let α = O ( p log 1 /ε ) . By our choice of η , we have that ≤ ε − ε log 1 /εη ≤ ε/ (2 C ) , and by an appropriate setting ofconstants, since by our choice of n we have εn ≥ C · min( d, k ) + log (cid:0) d k (cid:1) + log 1 /δα , we have that (21) holds with probability − δ ′ . Thus by a union bound over all (cid:0) nεn (cid:1) choices of I so that | I | = (1 − ε ) n ,we have that except with probability − δ , we have that (21) holds simultaneously for all I with | I | = (1 − ε ) n . Thedesire result then follows from this and (19), and a union bound. Proof of Theorem 4.6.
This follows from the exact same techniques as the proof of Theorem 4.5, by replacing all U k with X k , and using Theorem 4.2 instead of Corollary 4.1. D Computational Barriers for sample optimal robust sparse mean estimation
We conjecture that the rate achieved by Theorem 5.1 is tight for computationally efficient algorithms (up to log factors).Intuitively, the major difficulty is that distinguishing between N ( µ , I ) and N ( µ , I ) given corrupted samples seemsto inherently require second moment (or higher) information, for any µ , µ ∈ R d . Certainly first moment informationby itself is insufficient. In this sparse setting, this is very problematic, as this inherently asks for us to detect a largesparse eigenvector of the empirical covariance. This more or less reduces to the problem solved by (16). This in turnrequires us to relax to the problem solved by SDPs for sparse PCA, for which we know Ω( k log d/ε ))