[PDF] Compressed Sensing and Matrix Completion with Constant Proportion of Corruptions

Abstract

We improve existing results in the field of compressed sensing and matrix completion when sampled data may be grossly corrupted. We introduce three new theorems. 1) In compressed sensing, we show that if the m \times n sensing matrix has independent Gaussian entries, then one can recover a sparse signal x exactly by tractable \ell1 minimimization even if a positive fraction of the measurements are arbitrarily corrupted, provided the number of nonzero entries in x is O(m/(log(n/m) + 1)). 2) In the very general sensing model introduced in "A probabilistic and RIPless theory of compressed sensing" by Candes and Plan, and assuming a positive fraction of corrupted measurements, exact recovery still holds if the signal now has O(m/(log^2 n)) nonzero entries. 3) Finally, we prove that one can recover an n \times n low-rank matrix from m corrupted sampled entries by tractable optimization provided the rank is on the order of O(m/(n log^2 n)); again, this holds when there is a positive fraction of corrupted samples.

Full PDF

aa r X i v : . [ c s . I T ] J a n Compressed Sensing and Matrix Completion with ConstantProportion of Corruptions

Xiaodong LiDepartment of Mathematics, Stanford University, Stanford, CA 94305

Abstract

In this paper we improve existing results in the ﬁeld of compressed sensing and matrixcompletion when sampled data may be grossly corrupted. We introduce three new theorems.1) In compressed sensing, we show that if the m × n sensing matrix has independent Gaussianentries, then one can recover a sparse signal x exactly by tractable ℓ minimization even if apositive fraction of the measurements are arbitrarily corrupted, provided the number of nonzeroentries in x is O ( m/ (log( n/m ) + 1)). 2) In the very general sensing model introduced in [7]and assuming a positive fraction of corrupted measurements, exact recovery still holds if thesignal now has O ( m/ (log n )) nonzero entries. 3) Finally, we prove that one can recover an n × n low-rank matrix from m corrupted sampled entries by tractable optimization providedthe rank is on the order of O ( m/ ( n log n )); again, this holds when there is a positive fractionof corrupted samples. Keywords.

Compressed Sensing, Matrix Completion, Robust PCA, Convex Optimization,Restricted Isometry Property, Golﬁng Scheme.

Compressed sensing (CS) has been well-studied in recent years [9,19]. This novel theory asserts thata sparse or approximately sparse signal x ∈ R n can be acquired by taking just a few non-adaptivelinear measurements. This fact has numerous consequences which are being explored in a numberof ﬁelds of applied science and engineering. In CS, the acquisition procedure is often representedas y = Ax , where A ∈ R m × n is called the sensing matrix and y ∈ R m is the vector of measurementsor observations. It is now well-established that the solution ˆ x to the optimization problemmin ˜ x k ˜ x k such that A ˜ x = y, (1.1)is guaranteed to be the original signal x with high probability, provided x is suﬃciently sparse and A obeys certain conditions. A typical result is this: if A has iid Gaussian entries, then exact recoveryoccurs provided k x k ≤ Cm/ (log( n/m ) + 1) [10,18,37] for some positive numerical constant

C > A is a matrix with rows randomly selected from the DFT matrix, thecondition becomes k x k ≤ Cm/ log n [9].This paper discusses a natural generalization of CS, which we shall refer to as compressed sensing ith corruptions . We assume that some entries of the data vector y are totally corrupted but wehave absolutely no idea which entries are unreliable. We still want to recover the original signaleﬃciently and accurately. Formally, we have the mathematical model y = Ax + f = [ A, I ] (cid:20) xf (cid:21) , (1.2)where x ∈ R n and f ∈ R m . The number of nonzero coeﬃcients in x is k x k and similarly for f . Asin the above model, A is an m × n sensing matrix, usually sampled from a probability distribution.The problem of recovering x (and hence f ) from y has been recently studied in the literature inconnection with some interesting applications. We discuss a few of them. • Clipping.

Signal clipping frequently appears because of nonlinearities in the acquisition device[27, 38]. Here, one typically measures g ( Ax ) rather than Ax , where g is always a nonlinearmap. Letting f = g ( Ax ) − Ax , we thus observe y = Ax + f . Nonlinearities usually occur atlarge amplitudes so that for those components with small amplitudes, we have f = g ( Ax ) − Ax = 0. This means that f is sparse and, therefore, our model is appropriate. Just as before,locating the portion of the data vector that has been clipped may be diﬃcult because ofadditional noise. • CS for networked data.

In a sensor network, diﬀerent sensors will collect measurements ofthe same signal x independently (they each measure z i = h a i , x i ) and send the outcometo a center hub for analysis [23, 30]. By setting a i as the row vectors of A , this is just z = Ax . However, typically some sensors will fail to send the measurements correctly, andwill sometimes report totally meaningless measurements. Therefore, we collect y = Ax + f ,where f models recording errors.There have been several theoretical papers investigating the exact recovery method for CS withcorruptions [28–30,38,40], and all of them consider the following recovery procedure in the noiselesscase: min ˜ x, ˜ f k ˜ x k + λ ( m, n ) k ˜ f k such that A ˜ x + ˜ f = [ A, I ] (cid:20) ˜ x ˜ f (cid:21) = y. (1.3)We will compare them with our results in Section 1.4. Matrix completion (MC) bears some similarity with CS. Here, the goal is to recover a low-rankmatrix L ∈ R n × n from a small fraction of linear measurements. For simplicity, we suppose thematrix is square as above (the general case is similar). The standard model is that we observe P O ( L ) where O ⊂ [ n ] × [ n ] := { , ..., n } × { , ..., n } and P O ( L ) ij = ( L ij if ( i, j ) ∈ O ;0 otherwise . The problem is to recover the original matrix L , and there have been many papers studying thisproblem in recent years, see [8, 12, 21, 26, 33], for example. Here one minimizes the nuclear norm —2he sum of all the singular values [20]— to recover the original low rank matrix. We discuss belowan improved result due to Gross [21] (with a slight diﬀerence).Deﬁne O ∼ Ber( ρ ) for some 0 < ρ < { ( i,j ) ∈ O } are iid Bernoulli randomvariables with parameter ρ . Then the solution tomin e L k e L k ∗ such that P O ( e L ) = P O ( L ) , (1.4)is guaranteed to be exactly L with high probability, provided ρ ≥ C ρ rµ log nn . Here, C ρ is a positivenumerical constant, r is the rank of L , and µ is an incoherence parameter introduced in [8] whichis only dependent of L .This paper is concerned with the situation in which some entries may have been corrupted. There-fore, our model is that we observe P O ( L ) + S, (1.5)where O and L are the same as before and S ∈ R n × n is supported on Ω ⊂ O . Just as in CS, thismodel has broad applicability. For example, Wu et al. used this model in photometric stereo [42].This problem has also been introduced in [4] and is related to recent work in separating a low-rankfrom a sparse component [4, 13, 14, 24, 43]. A typical result is that the solution ( b L, b S ) tomin e L, e S k e L k ∗ + λ ( m, n ) k e S k such that P O ( e L ) + e S = P O ( L ) + S, (1.6)is guaranteed to be the true pair ( L, S ) with high probability under some assumptions about

L, O, S [4, 16]. We will compare them with our result in Section 1.4.

This section introduces three models and three corresponding recovery results. The proofs of theseresults are deferred to Section 2 for Theorem 1.1, Section 3 for Theorem 1.2 and Section 4 forTheorem 1.3.

Suppose that A is an m × n ( m < n ) random matrix whose entries are iid Gaussianvariables with mean and variance /m , the signal to acquire is x ∈ R n , and our observation is y = Ax + f + w where f, w ∈ R m and k w k ≤ ǫ . Then by choosing λ ( n, m ) = √ log( n/m )+1 , thesolution (ˆ x, ˆ f ) to min ˜ x, ˜ f k ˜ x k + λ k ˜ f k such that k ( A ˜ x + ˜ f ) − y ) k ≤ ǫ (1.7) satisﬁes k ˆ x − x k + k ˆ f − f k ≤ Kǫ with probability at least − C exp( − cm ) . This holds universally;that is to say, for all vectors x and f obeying k x k ≤ αm/ (log( n/m ) + 1) and k f k ≤ αm . Here α , C , c and K are numerical constants. In the above statement, the matrix A is random. Everything else is deterministic. The reader willnotice that the number of nonzero entries is on the same order as that needed for recovery from3lean data [3,10,19,37], while the condition of f implies that one can tolerate a constant fraction ofpossibly adversarial errors. Moreover, our convex optimization is related to LASSO [35] and BasisPursuit [15]. In this model, m < n and A = 1 √ m  a ∗ ...a ∗ m  , where a , ..., a m are n iid copies of a random vector a whose distribution obeys the following twoproperties: 1) E aa ∗ = I ; 2) k a k ∞ ≤ √ µ . This model has been introduced in [7] and includes a lotof the stochastic models used in the literature. Examples include partial DFT matrices, matriceswith iid entries, certain random convolutions [34] and so on.In this model, we assume that x and f in (1.2) have ﬁxed support denoted by T and B , andwith cardinality | T | = s and | B | = m b . In the remainder of the paper, x T is the restriction of x to indices in T and f B is the restriction of f to B . Our main assumption here concerns the signsequences: the sign sequences of x T and f B are independent of each other, and each is a sequenceof symmetric iid ± Theorem 1.2

For the model above, the solution (ˆ x, ˆ f ) to (1.3) , with λ ( n, m ) = 1 / √ log n , is exactwith probability at least − Cn − , provided that s ≤ α mµ log n and m b ≤ β mµ . Here C , α and β aresome numerical constants. Above, x and f have ﬁxed supports and random signs. However, by a recent de-randomizationtechnique ﬁrst introduced in [4], exact recovery with random supports and ﬁxed signs would alsohold. We will explain this de-randomization technique in the proof of Theorem 1.3. In some speciﬁcmodels, such as independent rows from the DFT matrix, µ could be a numerical constant, whichimplies the proportion of corruptions is also a constant. An open problem is whether Theorem 1.2still holds in the case where x and f have both ﬁxed supports and signs. Another open problem isto know whether the result would hold under more general conditions about A as in [6] in the casewhere x has both random support and random signs.We emphasize that the sparsity condition k x k ≤ C mµ log n is a little stronger than the optimalresult available in the noise-free literature [7, 9]), namely, k x k ≤ C mµ log n . The extra logarithmicfactor appears to be important in the proof which we will explain in Section 3, and a third openproblem is whether or not it is possible to remove this factor.Here we do not give a sensitivity analysis for the recovery procedure as in Model 1. Actuallyby applying a similar method introduced in [7] to our argument in Section 3, a very good errorbound could be obtained in the noisy case. However, technically there is little novelty but it willmake our paper very long. Therefore we decide to only discuss the noiseless case and focus on thesampling rate and corruption ratio. 4 .3.3 MC from corrupted entries [Model 3] We assume L is of rank r and write its reduced SVD as L = U Σ V ∗ , where U, V ∈ R n × r andΣ ∈ R r × r . Let µ be the smallest quantity such that for all 1 ≤ i ≤ n , k U U ∗ e i k ≤ µrn , k V V ∗ e i k ≤ µrn , and k U V ∗ k ∞ ≤ √ µrn . This model is the same as that originally introduced in [8], and later used in [4, 12, 16, 21, 32]. Weobserve P O ( L ) + S , where O ∈ [ n ] × [ n ] and S is supported on Ω ⊂ O . Here we assume that O, Ω , S satisfy the following model: Model 3.1:

1. Fix an n by n matrix K , whose entries are either 1 or − O ∼ Ber( ρ ) for a constant ρ satisfying 0 < ρ < . Speciﬁcally speaking, 1 { ( i,j ) ∈ O } are iidBernoulli random variables with parameter ρ .3. Conditioning on ( i, j ) ∈ O , assume that ( i, j ) ∈ Ω are independent events with P (( i, j ) ∈ Ω | ( i, j ) ∈ O ) = s . This implies that Ω ∼ Ber( ρs ).4. Deﬁne Γ := O/ Ω. Then we have Γ ∼ Ber( ρ (1 − s ))5. Let S be supported on Ω, and sgn( S ) := P Ω ( K ). Theorem 1.3

Under Model 3.1, suppose ρ > C ρ µr log nn and s ≤ C s . Moreover, suppose λ := √ ρn log n and denote ( ˆ L, ˆ S ) as the optimal solution to the problem (1.6) . Then we have ( ˆ L, ˆ S ) =( L, S ) with probability at least − Cn − for some numerical constant C , provided the numericalconstants C s is suﬃciently small and C ρ is suﬃciently large. In this model O is available while Ω, Γ and S are not known explicitly from the observation P O ( L ) + S . By the assumption O ∼ Ber( ρ ), we can use | O | / ( n ) to approximate ρ . From thefollowing proof we can see that λ is not required to be √ ρn log n exactly for the exact recovery. Thepower of our result is that one can recover a low-rank matrix from a nearly minimal number ofsamples even when a constant proportion of these samples has been corrupted.We only discuss the noiseless case for this model. Actually by a method similar to [6], a subopti-mal estimation error bound can be obtained by a slight modiﬁcation of our argument. However,it is of little interest technically and beyond the optimal result when n is large. There are othersuboptimal results for matrix completion with noise, such as [1], but the error bound is not tightwhen the additional noise is small. We want to focus on the noiseless case in this paper and leavethe problem with noise for future work.The values of λ are chosen for theoretical guarantee of exact recovery in Theorem 1.1, 1.2 and1.3. In practice, λ is usually taken by cross validation. In this section we will compare Theorems 1.1, 1.2 and 1.3 with existing results in the literature.5e begin with Model 1. In [40], Wright and Ma discussed a model where the sensing ma-trix A has independent columns with common mean µ and normal perturbations with variance σ /m . They chose λ ( m, n ) = 1, and proved that (ˆ x, ˆ f ) = ( x, f ) with high probability provided k x k ≤ C ( σ, n/m ) m , k f k ≤ C ( σ, n/m ) m and f has random signs. Here C ( σ, /m ) is muchsmaller than C/ (log( n/m ) + 1). We notice that since the authors of [40] talked about a diﬀerentmodel, which is motivated by [41], it may not be comparable with ours directly. However, for ourmotivation of CS with corruptions, we assume A satisfy a symmetric distribution and get bettersampling rate.A bit later, Laska et al. [28] and Li et al. [29] also studied this problem. By setting λ ( m, n ) = 1,both papers establish that for Gaussian (or sub-Gaussian) sensing matrices A , if m > C ( k x k + k f k ) log(( n + m ) / ( k x k + k f k )), then the recovery is exact. This follows from the fact that [ A, I ]obeys a restricted isometry property known to guarantee exact recovery of sparse vectors via ℓ minimization. Furthermore, the sparsity requirement about x is the same as that found in thestandard CS literature, namely, k x k ≤ Cm/ (log( n/m ) + 1). However, the result does not allow apositive fraction of corruptions. For example, if m = √ n , we have k f k /m ≤ / log n , which willgo to zero as n goes to zero.As for Model 2, an interesting piece of work [30] (and later [31] on the noisy case) appeared duringthe preparation of this paper. These papers discuss models in which A is formed by selecting rowsfrom an orthogonal matrix with low incoherence parameter µ , which is the minimum value suchthat n | A ij | ≤ µ for any i, j . The main result states that selecting λ = p n/ ( Cµm log n ) givesexact recovery under the following assumptions: 1) the rows of A are chosen from an orthogonalmatrix uniformly at random; 2) x is a random signal with independent signs and equally likelyto be either ±

1; 3) the support of f is chosen uniformly at random. (By the de-randomizationtechnique introduced in [4] and used in [30], it would have been suﬃcient to assume that the signsof f are independent and take on the values ± m ≥ Cµ k x k (log n ) and k f k ≤ Cm , which are nearly optimal, for the bestknown sparsity condition when f = 0 is m ≥ Cµ k x k log n . In other words, the result is optimalup to an extra factor of µ log n ; the sparsity condition about f is of course nearly optimal.However, the model for A does not include some models frequently discussed in the literaturesuch as subsampled tight or continuous frames. Against this background, a recent paper of Cand`esand Plan [7] considers a very general framework, which includes a lot of common models in theliterature. Theorem 1.2 in our paper is similar to Theorem 1 in [30]. It assumes similar sparsityconditions, but is based on this much broader and more applicable model introduced in [7]. Noticethat, we require m ≥ Cµ k x k (log n ) whereas [30] requires m ≥ Cµ k x k (log n ) . Therefore, weimprove the condition by a factor of µ , which is always at least 1 and can be as large as n . However,our result imposes k f k ≤ Cm/µ , which is worse than k f k ≤ γm by the same factor. In [30], theparameter λ depends upon µ , while our λ is only a function of m and n . This is why the resultsdiﬀer, and we prefer to use a value of λ that does not depend on µ because in some applications,an accurate estimate of µ may be diﬃcult to obtain. In addition, we use diﬀerent techniques ofproof which the clever golﬁng scheme of [21] is exploited.Sparse approximation is another problem of underdetermined linear system where the dictionary6atrix A is always assumed to be deterministic. Readers interested in this problem (which alwaysrequires stronger sparsity conditions) may also want to study the recent paper [38] by Studer et al.There, the authors introduce a more general problem of the form y = Ax + Bf , and analyzed theperformance of ℓ -recovery techniques by using ideas which have been popularized under the nameof generalized uncertainty principles in the basis pursuit and sparse approximation literature.As for Model 3, Theorem 1.3 is a signiﬁcant extension of the results presented in [4], in whichthe authors have a stringent requirement ρ = 0 .

1. In a very recent and independent work [16], theauthors consider a model where both O and Ω are unions of stochastic and deterministic subsets,while we only assume the stochastic model. We recommend interested readers to read the paperfor the details. However, only considering their results on stochastic O and Ω, a direct comparisonshows that the number of samples we need is less than that in this reference. The diﬀerence isseveral logarithmic factors. Actually, the requirement of ρ in our paper is optimal even for cleandata in the literature of MC. Finally, we want to emphasize that the random support assumptionis essential in Theorem 1.3 when the rank is large. Examples can be found in [24].We wish to close our introduction with a few words concerning the techniques of proof we shalluse. The proof of Theorem 1.1 is based on the concept of restricted isometry, which is a standardtechnique in the literature of CS. However, our argument involves a generalization of the restrictedisometry concept. The proofs of Theorems 1.2 and 1.3 are based on the golﬁng scheme, an eleganttechnique pioneered by David Gross [21], and later used in [4, 7, 32] to construct dual certiﬁcates.Our proof leverages results from [4]. However, we contribute novel elements by ﬁnding an appro-priate way to phrase suﬃcient optimality conditions, which are amenable to the golﬁng scheme.Details are presented in the following sections. In the proof of Theorem 1.1, we will see the notation P T x . Here x is a k -dimensional vector, T is a subset of { , ..., k } and we also use T to represent the subspace of all k -dimensional vectorssupported on T . Then P T x is the projection of x onto the subspace T , which is to keep the valueof x on the support T and to change other elements into zeros. In this section we use the notation“ ⌊ . ⌋ ” of “ﬂoor function” to represent the integer part of any real number.First we generalize the concept of the restricted isometry property (RIP) [11] for the convenienceto prove our theorem: Deﬁnition 2.1

For any matrix Φ ∈ R l × ( n + m ) , deﬁne the RIP-constant δ s ,s by the inﬁmum valueof δ such that (1 − δ )( k x k + k f k ) ≤ (cid:13)(cid:13)(cid:13)(cid:13) Φ (cid:20) xf (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (1 + δ )( k x k + k f k ) holds for any x ∈ R n with | supp( x ) | ≤ s and f ∈ R m with | supp( f ) | ≤ s . emma 2.2 For any x , x ∈ R n and f , f ∈ R m such that supp( x ) ∩ supp( x ) = φ , | supp( x ) | + | supp( x ) | ≤ s and supp( f ) ∩ supp( f ) = φ , | supp( f ) | + | supp( f ) | ≤ s , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:28) Φ (cid:20) x f (cid:21) , Φ (cid:20) x f (cid:21)(cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≤ δ s ,s q k x k + k f k q k x k + k f k Proof

First, we suppose k x k + k f k = k x k + k f k = 1. By the deﬁnition of δ s ,s , we have2(1 − δ s ,s ) ≤ (cid:28) Φ (cid:20) x + x f + f (cid:21) , Φ (cid:20) x + x f + f (cid:21)(cid:29) ≤ δ s ,s ) , and 2(1 − δ s ,s ) ≤ (cid:28) Φ (cid:20) x − x f − f (cid:21) , Φ (cid:20) x − x f − f (cid:21)(cid:29) ≤ δ s ,s ) . By the above inequalities, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:28) Φ (cid:20) x f (cid:21) , Φ (cid:20) x f (cid:21)(cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≤ δ s ,s , and hence by homogeneity, wehave (cid:12)(cid:12)(cid:12)(cid:12)(cid:28) Φ (cid:20) x f (cid:21) , Φ (cid:20) x f (cid:21)(cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≤ δ s ,s p k x k + k f k p k x k + k f k without the norm assumption. Lemma 2.3

Suppose Φ ∈ R l × ( n + m ) with RIP-constant δ s , s < ( s , s > )and λ is between q s s and q s s . Then for any x ∈ R n with | supp( x ) | ≤ s , any f ∈ R m with | supp( f ) | ≤ s ,and any w ∈ R m with k w k ≤ ǫ the solution (ˆ x, ˆ f ) to the optimization problem (1.7) satisﬁes k ˆ x − x k + k ˆ f − f k ≤ √ δ s , s − δ s , s ǫ . Proof

Suppose ∆ x = ˆ x − x and ∆ f = ˆ f − f . Then by (1.7) we have (cid:13)(cid:13)(cid:13)(cid:13) Φ (cid:20) ∆ x ∆ f (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ≤ k w k + (cid:13)(cid:13)(cid:13)(cid:13) Φ (cid:20) ˆ x ˆ f (cid:21) − (cid:18) Φ (cid:20) xf (cid:21) + w (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ǫ. It is easy to check that the original ( x, f ) satisﬁes the inequality constraint in (1.7), so we have k x + ∆ x k + λ k f + ∆ f k ≤ k x k + λ k f k . (2.1)Then it suﬃces to show k ∆ x k + k ∆ f k ≤ √ δ s , s − δ s , s ǫ .Suppose T with | T | = s such that supp( x ) ∈ T . Denote T c = T ∪ · · · ∪ T l where | T | = ... = | T l − | = s and | T l | ≤ s . Moreover, suppose T contains the indices of the s largest (in thesense of absolute value) coeﬃcients of P T c ∆ x , T contains the indices of the s largest coeﬃcientsof P ( T ∪ T ) c ∆ x , and so on. Similarly, deﬁne V such that supp( f ) ⊂ V and | V | = s , and divide V c = V ∪ ... ∪ V k in the same way. By this setup, we easily have X j ≥ k P T j ∆ x k ≤ s − k P T c ∆ x k , (2.2)and X j ≥ k P V j ∆ f k ≤ s − k P V c ∆ f k . (2.3)8n the other hand, by the assumption supp( x ) ⊂ T and supp( f ) ⊂ V , we have, k x + ∆ x k = k P T x + P T ∆ x k + k P T c ∆ x k ≥ k x k − k P T ∆ x k + k P T c ∆ x k , (2.4)and similarly, k f + ∆ f k ≥ k f k − k P V ∆ f k + k P V c ∆ f k . (2.5)By inequalities (2.1), (2.4) and (2.5), we have k P T c ∆ x k + λ k P V c ∆ f k ≤ k P T ∆ x k + λ k P V ∆ f k . (2.6)By the deﬁnition of δ s , s , the fact (cid:13)(cid:13)(cid:13)(cid:13) Φ (cid:20) ∆ x ∆ f (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ǫ and Lemma 2.2, we have(1 − δ s , s ) (cid:16) k P T ∆ x + P T ∆ x k + k P V ∆ f + P V ∆ f k (cid:17) ≤ (cid:13)(cid:13)(cid:13)(cid:13) Φ (cid:20) P T ∆ x + P T ∆ xP V ∆ f + P V ∆ f (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:28) Φ (cid:20) P T ∆ x + P T ∆ xP V ∆ f + P V ∆ f (cid:21) , Φ (cid:20) ∆ x ∆ f (cid:21) − Φ (cid:20) P T ∆ x + ... + P T l ∆ xP V ∆ f + ... + P V k ∆ f (cid:21)(cid:29) ≤ − (cid:28) Φ (cid:20) P T ∆ x + P T ∆ xP V ∆ f + P V ∆ f (cid:21) , Φ (cid:20) P T ∆ x + ... + P T l ∆ xP V ∆ f + ... + P V k ∆ f (cid:21)(cid:29) + 2 ǫ (cid:13)(cid:13)(cid:13)(cid:13) Φ (cid:20) P T ∆ x + P T ∆ xP V ∆ f + P V ∆ f (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ≤ δ s , s (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)(cid:20) P T ∆ xP V ∆ f (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) P T ∆ xP V ∆ f (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) (cid:19) X j ≥ k P T j ∆ x k + X j ≥ k P V j ∆ f k  + 2 ǫ p δ s , s q k P T ∆ x k + k P T ∆ x k + k P V ∆ f k + k P V ∆ f k . Moreover, since X j ≥ k P T j ∆ x k + X j ≥ k P V j ∆ f k ≤ s − k P T c ∆ x k + s − k P V c ∆ f k By (2.2) and (2.3) ≤ s − ( k P T c ∆ x k + λ k P V c ∆ f k ) By λ > r s s ≤ s − ( k P T ∆ x k + λ k P V ∆ f k ) By (2.6) ≤ s − ( s k P T ∆ x k + λs k P V ∆ f k ) By Cauchy-Schwartz inequality ≤ k P T ∆ x k + 4 k P V ∆ f k , By λ < r s s we have (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)(cid:20) P T ∆ xP V ∆ f (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) P T ∆ xP V ∆ f (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) (cid:19) X j ≥ k P T j ∆ x k + X j ≥ k P V j ∆ f k  ≤ k P T ∆ x k + k P T ∆ x k + k P V ∆ f k + k P V ∆ f k ) . δ s , s < /

9, we have q k P T ∆ x k + k P T ∆ x k + k P V ∆ f k + k P V ∆ f k ≤ ǫ p δ s , s − δ s , s . Since X j ≥ k P T j ∆ x k + X j ≥ k P V j ∆ f k ≤ k P T ∆ x k + 4 k P V ∆ f k , we have k ∆ x k + k ∆ f k ≤ k P T ∆ x k + k P V ∆ f k ) + ( k P T ∆ x k + k P V ∆ f k ) ≤ √ q k P T ∆ x k + k P T ∆ x k + k P V ∆ f k + k P V ∆ f k ≤ p

13 + 13 δ s , s − δ s , s ǫ. We now cite a well-known result in the literature of CS, e.g. Theorem 5.2 of [3].

Lemma 2.4

Suppose A is a random matrix deﬁned in model 1. Then for any < δ < , thereexist c ( δ ) , c ( δ ) > such that with probability at least − − c ( δ ) m ) , (1 − δ ) k x k ≤ k Ax k ≤ (1 + δ ) k x k holds universally for any x with | supp( x ) | ≤ c ( δ ) m log nm +1 . Also, we cite a well-know result which can give a bound for the biggest singular value of randommatrix, e.g. [17] and [39].

Lemma 2.5

Let B be an m × n matrix whose entries are independent standard normal randomvariables. Then for every t ≥ , with probability at least − − t / , one has k B k , ≤√ m + √ n + t . We now prove Theorem 1.1:

Proof

Suppose α , δ are two constants independent of m and n , and their values will be speciﬁedlater. Set s = j α m log nm +1 k and s = ⌊ αm ⌋ . We want to bound the RIP-constant δ s , s for the( n + m ) × m matrix Φ = [ A, I ] when α is suﬃciently small. For any T with | T | = 2 s and V with | V | = 2 s , and any x with supp( x ) ⊂ T , any f with supp( f ) ⊂ V , we have (cid:13)(cid:13)(cid:13)(cid:13) [ A, I ] (cid:20) xf (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) = k Ax + f k = k Ax k + k f k + 2 h P V AP T x, f i . By Lemma 2.4, assuming α ≤ c ( δ ), with probability at least 1 − − c ( δ ) m )) we have(1 − δ ) k x k ≤ k Ax k ≤ (1 + δ ) k x k (2.7)10olds universally for any such T and x .Now we we ﬁx T and V , and we want to bound k P V AP T k , . By Lemma 2.5, we actually have k P V AP T k , ≤ √ m ( √ s + √ s + √ δ m ) ≤ (2 √ α + δ ) (2.8)with probability at least 1 − − δ m/ − − δ m ) (cid:0) n s (cid:1)(cid:0) m s (cid:1) ,inequality 2.8 holds universally for any V satisfying | V | = 2 s and T satisfying | V | = 2 s .By 2 s ≤ α m log nm +1 , we have 2 s log( en s ) ≤ α m , where α only depends on α and α → α →

0, and hence (cid:0) n s (cid:1) ≤ ( en s ) s ≤ exp( α m ). Similarly, because 2 s ≤ αm , we have2 s log( em s ) ≤ α m , where α only depends on α and α → α →

0, and hence (cid:0) m s (cid:1) ≤ ( em s ) s ≤ exp( α m ). Therefore, inequality 2.8 holds universally for any such T and V with prob-ability at least 1 − δ / − α − α ) m ).Combined with 2.7, we have(1 − δ ) k x k + k f k − (2 √ α + δ ) k x k k f k ≤ (cid:13)(cid:13)(cid:13)(cid:13) [ A, I ] (cid:20) xf (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (1+ δ ) k x k + k f k +(2 √ α + δ ) k x k k f k holds universally for any such T , U , x and f which probability at least 1 − − c ( δ ) m )) − δ / − α − α ) m ). By choosing an appropriate δ and letting α suﬃciently small, we have δ s , s < / − Ce − cm .Moreover, under the assumption that α (cid:16) m log( n/m )+1 (cid:17) ≥

1, we have s = j α (cid:16) m log( n/m )+1 (cid:17)k > s = ⌊ αm ⌋ > q s s < √ log nm +1 < q s s . Then Theorem 1.1 as a direct corollary of Lemma2.3 In this section we will encounter several absolute constants. Instead of denoting them by C , C ,..., we just use C , i.e., the values of C change from line to line. Also, we will use the phrase “withhigh probability” to mean with probability at least 1 − Cn − c , where C > c = 3 , , or 5 depending on the context.Here we will use a lot of notations to represent sub-matrices and sub-vectors. Suppose A ∈ R m × n , P ⊂ [ m ] := { , ..., m } , Q ⊂ [ n ] and i ∈ [ n ]. We denote by A P, : the sub-matrix of A with row indicescontained in P , by A : ,Q the sub-matrix of A with column indices contained in Q , and by A P,Q thesub-matrix of A with row indices contained in P and column indices contained in Q . Moreover, wedenote by A P,i the sub-matrix of A with row indices contained in P and column i , which is actuallya column vector.The term “vector” means column vector in this section, and all row vectors are denoted by anadjoint of a vector, such as a ∗ for a vector a . Suppose a is a vector and T a subset of indices. Thenwe denote by a T the restriction of a on T , i.e., a vector with all elements of a with indices in T .For any vector v , we use v { i } to denote the i -th element of v .11 .1 Supporting lemmas To prove Theorem 1.2 we need some supporting lemmas. Because our model of sensing matrix A is the same as in [7], we will cite some lemmas from it directly. Lemma 3.1 (Lemma 2.1 of [7]) Suppose A is as deﬁned in model 2. Let T ⊂ [ n ] be a ﬁxed set ofcardinality s. Then for δ > , P ( k A ∗ : ,T A : ,T − I k , ≥ δ ) ≤ s exp (cid:16) − mµs · δ δ/ (cid:17) . In particular, k A ∗ : ,T A : ,T − I k , ≤ with high probability provided s ≤ γ mµ log n , and k A ∗ : ,T A : ,T − I k , ≤ √ log n with high probability provided s ≤ γ mµ log n , where γ is some absolute constant. This Lemma was proved in [7] by matrix Bernstein’s inequality, which is ﬁrst introduced by [2].A deep generalization is given in [25].

Lemma 3.2 (Lemma 2.4 of [7]) Suppose A is as deﬁned in model 2. Fix T ⊂ [ n ] with | T | = s and v ∈ R s . Then k A ∗ : ,T c A : ,T v k ∞ ≤ √ s k v k with high probability provided s ≤ γ mµ log n , where γ issome absolute constant. Lemma 3.3 (Lemma 2.5 of [7]) Suppose A is as deﬁned in model 2. Fix T ⊂ [ n ] with | T | = s .Then max i ∈ T c k A ∗ : ,T A : ,i k ≤ with high probability provided s ≤ γ mµ log n , where γ is some absoluteconstant. In this part we will give a complete proof of Theorem 1.2 with a powerful technique called ”golﬁng-scheme” introduced by David Gross in [21], and later in [4] and [7]. Under the assumption of model2, we additionally assume s ≤ α mµ log n and m b ≤ β mµ , where α and β are numerical constantswhose values will speciﬁed later.First we give two useful inequalities. By replacing A with q mm − m b A B c ,T in Lemma 3.1 and Lemma3.2, we have (cid:13)(cid:13)(cid:13)(cid:13) mm − m b A ∗ B c ,T A B c ,T − I (cid:13)(cid:13)(cid:13)(cid:13) , ≤ / i ∈ T c (cid:13)(cid:13)(cid:13)(cid:13) mm − m b A ∗ B c ,T A B c ,i (cid:13)(cid:13)(cid:13)(cid:13) ≤ s ≤ γ m − m b µ log n . Since s ≤ α mµ log n and m b ≤ β mµ , both 3.1 and 3.2hold with high probability provided α and β are suﬃciently small. We assume (3.1) and (3.2) holdthroughout this section.First we prove that the solution (ˆ x, ˆ f ) of (1.3) equals ( x, f ) if we can ﬁnd an appropriate dualvector q B c satisfying the following requirement. This is actually an “inexact dual vector” of theoptimization problem (1.3). This idea was ﬁrst given explicitly in [22] and [21], and related to [5].We give a result similar to [7]. 12 emma 3.4 (Inexact Duality) Suppose there exists a vector q B c ∈ R m − m b satisfying k v T − sgn ( x T ) k ≤ λ/ , k v T c k ∞ ≤ / and k q B c k ∞ ≤ λ/ , (3.3) where v = A ∗ B c , : q B c + A ∗ B, : λ sgn ( f B ) . (3.4) Then the solution (ˆ x, ˆ f ) of (1.3) equals ( x, f ) provided β is suﬃciently small and λ < . Proof

Set h = ˆ x − x . By x T c = 0 we have h T c = ˆ x T c . (3.5)By f B c = 0, and Ax + f = A ˆ x + ˆ f , we have Ah = f − ˆ f and A B c , : h = ( f − ˆ f ) B c = − ˆ f B c . (3.6)Then we have the following inequality k ˆ x k + λ k ˆ f k = h ˆ x T , sgn(ˆ x T ) i + k ˆ x T c k + λ ( h ˆ f B , sgn( ˆ f B ) i + k ˆ f B c k ) ≥ h ˆ x T , sgn( x T ) i + k ˆ x T c k + λ ( h ˆ f B , sgn( f B ) i + k ˆ f B c k )= h x T + h T , sgn( x T ) i + k h T c k + λ ( h f B − A B, : h, sgn( f B ) i + k A B c , : h k ) By (3.5), (3.6)= k x k + λ k f k + k h T c k + λ k A B c , : h k + h h T , sgn( x T ) i − λ h A B, : h, sgn( f B ) i . Since k ˆ x k + λ k ˆ f k ≤ k x k + λ k f k , we have k h T c k + λ k A B c , : h k + h h T , sgn( x T ) i − λ h A B, : h, sgn( f B ) i ≤ . (3.7)By (3.4), we have h h T , v T i + h h T c , v T c i = h h, v i = h h, A ∗ B c , : q B c + A ∗ B, : λ sgn f B i = h A B c , : h, q B c i + λ h A B, : h, sgn f B i , and then by (3.3), h h T , sgn( x T ) i − λ h A B, : h, sgn( f B ) i = h h T , (sgn( x T ) − v T ) i + h A B c , : h, q B c i − h h T c , v T c i≥ − λ k h T k − λ k A B c , : h k − k h T c k . Unite it with (3.7), we have − λ k h T k + 34 λ k A B c , : h k + 34 k h T c k ≤ . (3.8)13y (3.1), we have (cid:13)(cid:13)(cid:13)q mm − m b A ∗ B c ,T (cid:13)(cid:13)(cid:13) , ≤ q and the smallest singular value of mm − m b A ∗ B c ,T A B c ,T isat least . Therefore, k h T k ≤ (cid:13)(cid:13)(cid:13)(cid:13) mm − m b A ∗ B c ,T A B c ,T h T (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) mm − m b A ∗ B c ,T A B c ,T c h T c (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) mm − m b A ∗ B c ,T A B c , : h (cid:13)(cid:13)(cid:13)(cid:13) (cid:19) ≤ (cid:13)(cid:13)(cid:13)(cid:13) mm − m b A ∗ B c ,T A B c ,T c h T c (cid:13)(cid:13)(cid:13)(cid:13) + √ (cid:13)(cid:13)(cid:13)(cid:13)r mm − m b A B c , : h (cid:13)(cid:13)(cid:13)(cid:13) ≤ X i ∈ T c (cid:13)(cid:13)(cid:13)(cid:13) mm − m b A ∗ B c ,T A B c ,i (cid:13)(cid:13)(cid:13)(cid:13) | h { i } | + √ (cid:13)(cid:13)(cid:13)(cid:13)r mm − m b A B c , : h (cid:13)(cid:13)(cid:13)(cid:13) By the triangle inequality ≤ k h T c k + √ (cid:13)(cid:13)(cid:13)(cid:13)r mm − m b A B c , : h (cid:13)(cid:13)(cid:13)(cid:13) By (3.2) . Plugging this into (3.8), we have (cid:0) − λ (cid:1) k h T c k + (cid:16) − √ q mm − m b (cid:17) λ k A B c , : h k ≤

0. We know − √ q mm − m b > β is suﬃciently small. Moreover, by the assumption λ < , we have h T c = 0 and A B c , : h = 0. Since A B c , : h = A B c ,T h T + A B c ,T c h T c , we have A B c ,T h T = 0. Theinequality (3.1) implies that A B c ,T is injective, so h T = 0 and h = h T + h T c = 0, which implies(ˆ x, ˆ f ) = ( x, f ).Now let’s construct a vector q B c satisfying the requirement (3.3) by choosing an appropriate λ . Proof (of Theorem 1.2) Set λ = √ log n . It suﬃces to construct a q B c satisfying (3.3). Denot-ing u = A ∗ B c . : q B c , we only need to construct a q B c satisfying k u T + λA ∗ B,T sgn( f B ) − sgn( x T ) k ≤ λ , k u T c k ∞ ≤ , k λA ∗ B, : sgn( f B ) k ∞ ≤ , k q B c k ∞ ≤ λ . Now let’s construct our q B c by the golﬁng scheme. First we have to write A B c , : as a blockmatrix. We divide B c into l = ⌊ log n + 1 ⌋ = ⌊ log n log 2 + 1 ⌋ disjoint subsets: B c = G ∪ ... ∪ G l where | G i | = m i . Then we have P li =1 m i = m − m b and A B c , : =  A G , : · · · A G l , :  . We want to mention that the partition of B c is deterministic, not depending on A , so A G , : , ..., A G l , : are independent. Noticing m b ≤ β mµ ≤ βm , by letting β suﬃciently small, we can require mm ≤ C, mm ≤ C, mm k ≤ C log n for k = 3 , ..., l for some absolute constant C . Since s ≤ α mµ log n , we have s ≤ αC m µ log n , s ≤ αC m µ log n , s ≤ αC m k µ log n for k = 3 , ..., l. (3.9)14hen by Lemma 3.1, replacing A with q mm j A G j ,T , we have the following inequalities: (cid:13)(cid:13)(cid:13)(cid:13) mm j A ∗ G j ,T A G j ,T − I (cid:13)(cid:13)(cid:13)(cid:13) , ≤ √ log n for j = 1 ,

2; (3.10) (cid:13)(cid:13)(cid:13)(cid:13) mm j A ∗ G j ,T A G j ,T − I (cid:13)(cid:13)(cid:13)(cid:13) , ≤

12 for j = 3 , ..., l ; (3.11)with high probability provided α is suﬃciently small.Now let’s give an explicit construction of q B c . Deﬁne p = sgn( x T ) − λA ∗ B,T sgn( f B ) (3.12)and p i = (cid:18) I − mm i A ∗ G i ,T A G i ,T (cid:19) p i − = (cid:18) I − mm i A ∗ G i ,T A G i ,T (cid:19) · · · (cid:18) I − mm A ∗ G ,T A G ,T (cid:19) p (3.13)for i = 1 , ..., l , and construct q B c =  mm A G ,T p ... mm l A G l ,T p l −  . (3.14)Then by u = A ∗ B c , : q B c , we have u = A ∗ B c , :  mm A G ,T p ... mm l A G l ,T p l −  = l X i =1 mm i A ∗ G i , : A G i ,T p i − . (3.15)We now bound the ℓ norm of p i . Actually, by (3.10), (3.11) and (3.13), we have k p k ≤ √ log n k p k , (3.16) k p k ≤

14 log n k p k , (3.17) k p j k ≤ n ( 12 ) j k p k for j = 3 , ..., l. (3.18)Now we will prove our constructed q B c satisﬁes the desired requirements: The proof of (cid:13)(cid:13)(cid:13) λA ∗ B, : sgn ( f B ) (cid:13)(cid:13)(cid:13) ∞ ≤ By Hoeﬀding’s inequality, for any i = 1 , ..., n , we have P (cid:16)(cid:12)(cid:12)(cid:12) A ∗ B,i sgn( f B ) (cid:12)(cid:12)(cid:12) ≥ t (cid:17) ≤ (cid:18) − t k A B,i k (cid:19) .By choosing t = C √ log n k A B,i k ( C is some absolute constant), with high probability, we have (cid:12)(cid:12)(cid:12) λA ∗ B,i sgn( f B ) (cid:12)(cid:12)(cid:12) ≤ λC √ log n k A B,i k ≤ C q µm b m ≤ √ β ≤ , provided β is suﬃciently small, andthis implies (cid:13)(cid:13)(cid:13) λA ∗ B, : sgn( f B ) (cid:13)(cid:13)(cid:13) ∞ ≤ . 15 he proof of (cid:13)(cid:13)(cid:13) u T + λA ∗ B,T sgn ( f B ) − sgn ( x T ) (cid:13)(cid:13)(cid:13) ≤ λ By (3.15) and (3.13), we have u T = l X i =1 mm i A ∗ G i ,T A G i ,T p i − = l X i =1 ( p i − − p i ) = p − p l . Then by (3.12)we have (cid:13)(cid:13)(cid:13) u T + λA ∗ B,T sgn( f B ) − sgn( x T ) (cid:13)(cid:13)(cid:13) = k u T − p k = k p l k . Since (cid:13)(cid:13)(cid:13) λA ∗ B, : sgn( f B ) (cid:13)(cid:13)(cid:13) ∞ ≤ / (cid:13)(cid:13)(cid:13) λA ∗ B,T sgn( f B ) (cid:13)(cid:13)(cid:13) ≤ √ s , which implies k p k = (cid:13)(cid:13) λA ∗ B,T sgn( f B ) − sgn( x T ) (cid:13)(cid:13) ≤ √ s. (3.19)Then by (3.18) and l = ⌊ log n + 1 ⌋ , we have k p l k ≤ n ( ) l √ s ≤ (cid:16) n (cid:17) (cid:0) n (cid:1) (cid:0) (cid:1) q αmµ log n ≤ √ log n = λ , provided α is suﬃciently small. The proof of k u T c k ∞ ≤ / u T c = l X i =1 mm i A ∗ G i ,T c A G i ,T p i − . Recall that A G , : , ..., A G l , : are independent,so by the construction of p i − we know A G i , : and p i − are independent. Replacing A with q mm i A G i , : in Lemma 3.2, and by the sparsity condition (3.9), we have l X i =1 (cid:13)(cid:13)(cid:13)(cid:13) mm i A ∗ G i ,T c A G i ,T p i − (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ l X i =1

120 1 √ s k p i − k with high probability, provided α is suﬃciently small. By (3.16), (3.17), (3.18)and (3.19), we have k u T c k ∞ ≤ l X i =1

120 1 √ s k p i − k ≤

120 1 √ s k p k <

18 .

The proof of k q B c k ∞ ≤ λ For k = 1 , .., l , we denote A G k , : = √ m  a ∗ k ...a ∗ k mk  , and A B, : = √ m  ˜ a ∗ ... ˜ a ∗ m b  . By (3.13), (3.14) and (3.12),it suﬃces to show that for any 1 ≤ k ≤ l and 1 ≤ j ≤ m k , (cid:12)(cid:12)(cid:12)(cid:12) √ mm k ( a k j ) ∗ T (cid:18) I − mm k − A ∗ G k − ,T A G k − ,T (cid:19) · · · (cid:18) I − mm A ∗ G ,T A G ,T (cid:19) (cid:0) sgn( x T ) − λA ∗ B,T sgn( f B ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ . Set w = (cid:18) I − mm A ∗ G ,T A G ,T (cid:19) · · · (cid:18) I − mm k − A ∗ G k − ,T A G k − ,T (cid:19) ( a k j ) T . (3.20)Then it suﬃces to prove (cid:12)(cid:12)(cid:12)(cid:12) √ mm k w ∗ (cid:0) sgn( x T ) − λA ∗ B,T sgn( f B ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ . Since w and sgn( x T ) are independent, by Hoeﬀding’s inequality and conditioning on w, we have P ( | w ∗ sgn( x T ) | ≥ t ) ≤ (cid:16) − t k w k (cid:17) for any t >

0. Then with high probability we have | w ∗ sgn( x T ) | ≤ C p log n k w k (3.21)16or some absolute constant C .Setting z = sgn( f B ), we have w ∗ A ∗ B,T sgn( f B ) = √ m m b X i =1 [(˜ a i ) ∗ T w ] z { i } . Since w , A B,T and z areindependent, conditioning on w we have E { [(˜ a i ) ∗ T w ] z { i } } = E { (˜ a i ) ∗ T w } E { z ( i ) } = 0 , (cid:12)(cid:12) [(˜ a i ) ∗ T w ] z { i } (cid:12)(cid:12) ≤ k w k k (˜ a i ) T k ≤ √ sµ k w k ≤ r αm log n k w k , and E { (cid:12)(cid:12) [(˜ a i ) ∗ T w ] z { i } (cid:12)(cid:12) } = E { [ w ∗ (˜ a i ) T ][(˜ a i ) ∗ T w ] } = w ∗ E { (˜ a i ) T (˜ a i ) ∗ T } w = k w k . By Bernstein’s inequality, we have P (cid:18)(cid:12)(cid:12) w ∗ A ∗ B,T sgn( f B ) (cid:12)(cid:12) ≥ t √ m (cid:19) ≤  − t / m b k w k + q αm log n k w k t/  . By choosing some numerical constant C and t = C √ m log n k w k , we have (cid:12)(cid:12) w ∗ A ∗ B,T sgn( f B ) (cid:12)(cid:12) ≤ C p log n k w k (3.22)with high probability, provided α is suﬃciently small.By (3.21) and (3.22), we have (cid:12)(cid:12)(cid:12)(cid:12) √ mm k w ∗ (cid:0) sgn( x T ) − λA ∗ B,T sgn( f B ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ mm k C p log n k w k , (3.23)for some numerical constant C .When k ≥

3, by (3.20), (3.10) and (3.11), we have k w k ≤ ( ) k − n √ µs ≤ √ αm log n . Recalling mm k ≤ C log n , by (3.23), we have (cid:12)(cid:12)(cid:12) √ mm k w ∗ (cid:16) sgn( x T ) − λA ∗ B,T sgn( f B ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ C (cid:16) mm k (cid:17) √ α (log n ) − / ≤ λ provided α is suﬃciently small.When k ≤

2, by (3.20) and (3.10), we have k w k ≤ √ µs ≤ √ αm log n . Recalling mm k ≤ C , by (3.23), wehave (cid:12)(cid:12)(cid:12) √ mm k w ∗ (cid:16) sgn( x T ) − λA ∗ B,T sgn( f B ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ C (cid:16) mm k (cid:17) √ α (log n ) − / ≤ λ provided α is suﬃcientlysmall.Here we would like to compare our golﬁng scheme with that in [7]. There are mainly twodiﬀerences. One is that we have an extra term λA ∗ B, : sgn( f B ) in the dual vector. To obtain theinequality k v T c k ∞ ≤ /

4, we propose to bound k u T c k ∞ and k λA ∗ B, : sgn( f B ) k ∞ respectively, andthis will lead to the extra log factor compared with [7]. Moreover, by using the golﬁng scheme toconstruct the dual vector, we need to bound the term k q B c k ∞ , which is not necessary in [7]. Thisinevitably incurs the random signs assumptions of the signal.17 A Proof of Theorem 1.3

In this section, the capital letters X , Y etc represent matrices, and the symbols in script font I , P T , etc represent linear operators from a matrix space to a matrix space. Moreover, for anyΩ ⊂ [ n ] × [ n ] we have P Ω M is to keep the entries of M on the support Ω and to change otherentries into zeros. For any n × n matrix A , denote by k A k F , k A k , k A k ∞ and k A k ∗ respectively theFrobenius norm, operator norm (the largest singular value), the biggest magnitude of all elements,and the nuclear norm(the sum of all singular values).Similarly to Section 3, instead of denoting them as C , C , ..., we just use C , whose values changefrom line to line. Also, we will use the phrase “with high probability” to mean with probabilityat least 1 − Cn − c , where C > c = 3 , , or 5 depending on the context. Model 3.1 is natural and used in [4], but we will use the following equivalent model for the conve-nience of proof:

Model 3.2:

1. Fix an n by n matrix K , whose entries are either 1 or − n ] × [ n ]: Γ ′ ∼ Ber((1 − s ) ρ ) and Ω ′ ∼ Ber( sρ − ρ +2 sρ ).Moreover, let O := Γ ′ ∪ Ω ′ , which thus satisﬁes O ∼ Ber( ρ ).3. Deﬁne an n × n random matrix W with independent entries W ij satisfying P ( W ij = 1) = P ( W ij = −

1) = .4. Deﬁne Ω ′′ ⊂ Ω ′ : Ω ′′ := { ( i, j ) : ( i, j ) ∈ Ω ′ , W ij = K ij } .5. Deﬁne Ω := Ω ′′ / Γ ′ , and Γ := O/ Ω.6. Let S satisfy sgn( S ) := P Ω ( K ).Obviously, in both Model 3.1 and Model 3.2 the whole setting is deterministic if we ﬁx ( O, Ω).Therefore, the probability of ( ˆ L, ˆ S ) = ( L, S ) is determined by the joint distribution of ( O, Ω). It isnot diﬃcult to prove that the joint distributions of ( O, Ω) in both models are the same. Indeed, inModel 3.1, we have that (1 { ( i,j ) ∈ O } , { ( i,j ) ∈ Ω } ) are iid random vectors with the probability distribu-tion P (1 { ( i,j ) ∈ O } = 1) = ρ , P (1 { ( i,j ) ∈ Ω } = 1 | { ( i,j ) ∈ O } = 1) = s and P (1 { ( i,j ) ∈ Ω } = 1 | { ( i,j ) ∈ O } = 0) =0. In Model 3.2, we have(1 { ( i,j ) ∈ O } , { ( i,j ) ∈ Ω } ) = (max(1 { ( i,j ) ∈ Γ ′ } , { ( i,j ) ∈ Ω ′ } ) , { ( i,j ) ∈ Ω ′ } { W i,j = K i,j } { ( i,j ) ∈ Γ ′ c } ) . This implies that (1 { ( i,j ) ∈ O } , { ( i,j ) ∈ Ω } ) are independent random vectors. Moreover, it is easy tocalculate that P (1 { ( i,j ) ∈ O } = 1) = ρ , P (1 { ( i,j ) ∈ Ω } = 1) = sρ and P (1 { ( i,j ) ∈ Ω } = 1 , { ( i,j ) ∈ O } = 0) = 0.Then we have P (1 { ( i,j ) ∈ Ω } = 1 | { ( i,j ) ∈ O } = 1) = P (1 { ( i,j ) ∈ Ω } = 1 , { ( i,j ) ∈ O } = 1) / P (1 { ( i,j ) ∈ O } = 1) = s, and P (1 { ( i,j ) ∈ Ω } = 1 | { ( i,j ) ∈ O } = 0) = P (1 { ( i,j ) ∈ Ω } = 1 , { ( i,j ) ∈ O } = 0) / P (1 { ( i,j ) ∈ O } = 0) = 0 . { ( i,j ) ∈ O } , { ( i,j ) ∈ Ω } ) depends on K , its distribution does not. By the abovewe know that ( O, Ω) has the same distribution in both models. Therefore in the following we willuse Model 3.2 instead. The advantage of using Model 3.2 is that we can utilize Γ ′ , Ω ′ , W , etc. asauxiliaries.In the next section we prove some supporting lemmas which are useful for the proof of the maintheorem. Deﬁne T := { U X ∗ + Y V ∗ , X, Y ∈ R n × r } a subspace of R n × n . Then the orthogonal projectors P T and P T ⊥ in R n × n satisfy P T X = U U ∗ X + XV V ∗ − U U ∗ XV V ∗ and P T ⊥ X = ( I − U U ∗ ) X ( I − V V ∗ )for any X ∈ R n × n . This means kP T ⊥ X k ≤ k X k for any X . Recalling the incoherence conditions:for any i ∈ { , ..., n } , k U U ∗ e i k ≤ µrn and k V V ∗ e i k ≤ µrn , we have kP T ( e i e ∗ j ) k ∞ ≤ µrn and kP T ( e i e ∗ j ) k F ≤ q µrn [8, 12]. Lemma 4.1 (Theorem 4.1 of [8]) Suppose Ω ∼ Ber ( ρ ) . Then with high probability, kP T − ρ − P T P Ω P T k ≤ ǫ , provided that ρ ≥ C ǫ − µr log nn for some numerical constant C > . The original idea of the proof of this theorem is due to [36].

Lemma 4.2 (Theorem 3.1 of [4]) Suppose Z ∈ Range ( P T ) is a ﬁxed matrix, Ω ∼ Ber ( ρ ) , and ǫ ≤ is an arbitrary constant. Then with high probability k ( I − ρ − P T P Ω ) Z k ∞ ≤ ǫ k Z k ∞ providedthat ρ ≥ C ǫ − µr log nn for some numerical constant C > . Lemma 4.3 (Theorem 6.3 of [8]) Suppose Z is a ﬁxed matrix, and Ω ∼ Ber ( ρ ) . Then with highprobability, k ( ρ I − P Ω ) Z k ≤ C ′ √ np log n k Z k ∞ provided that ρ ≤ p and p ≥ C nn for somenumerical constants C > and C ′ > . Notice that we only have ρ = p in Theorem 6.3 of [8]. By a very slight modiﬁcation in theproof (speciﬁcally, the proof of Lemma 6.2) we can have ρ ≤ p as stated above. By Lemma 3.1, we have we have k − s ) ρ P T P Γ ′ P T − P T k ≤ and k √ (1 − s ) ρ P T P Γ ′ k ≤ p / C ρ is suﬃciently large and C s is suﬃciently small. We will assume bothinequalities hold all through the paper. Theorem 4.4

If there exists an n × n matrix Y obeying  kP T Y + P T ( λ P O/ Γ ′ W − U V ∗ ) k F ≤ λn , kP T ⊥ Y + P T ⊥ ( λ P O/ Γ ′ W ) k ≤ , P Γ ′ c Y = 0 , kP Γ ′ Y k ∞ ≤ λ , (4.1) where λ = √ nρ log n . Then the solution ( ˆ L, ˆ S ) to (1.6) satisﬁes ( ˆ L, ˆ S ) = ( L, S ) . roof Set H = ˆ L − L . The condition P O ( L ) + S = P O ( ˆ L ) + ˆ S implies that P O ( H ) = S − ˆ S .Then ˆ S is supported on O because S is supported on Ω ⊂ O . By considering the subgradient ofthe nuclear norm at L , we have k ˆ L k ∗ ≥ k L k ∗ + hP T H, U V ∗ i + kP T ⊥ H k ∗ . By the deﬁnition of ( ˆ L, ˆ S ), we have k ˆ L k ∗ + λ k ˆ S k ≤ k L k ∗ + λ k S k . By the two inequalities above, we have λ k S k − λ k ˆ S k ≥ hP T ( H ) , U V ∗ i + kP T ⊥ H k ∗ , which implies λ k S k − λ kP O/ Γ ′ ( ˆ S ) k ≥ h H, U V ∗ i + kP T ⊥ ( H ) k ∗ + λ kP Γ ′ ( ˆ S ) k . On the other hand, kP O/ Γ ′ ˆ S k = k S + P O/ Γ ′ ( − H ) k ≥ k S k + h sgn( S ) , P Ω ( − H ) i + kP O/ (Γ ′ ∪ Ω) ( − H ) k ≥ k S k + hP O/ Γ ′ ( W ) , − H i . By the two inequalities above and the fact P Γ ′ ˆ S = P Γ ′ ( ˆ S − S ) = −P Γ ′ H , we have kP T ⊥ ( H ) k ∗ + λ kP Γ ′ ( H ) k ≤ h H, λ P O/ Γ ′ ( W ) − U V ∗ i . (4.2)By the assumptions of Y , we have h H, λ P O/ Γ ′ ( W ) − U V ∗ i = h H, Y + λ P O/ Γ ′ ( W ) − U V ∗ i − h H, Y i = hP T ( H ) , P T ( Y + λ P O/ Γ ′ ( W ) − U V ∗ ) i + hP T ⊥ ( H ) , P T ⊥ ( Y + λ P O/ Γ ′ ( W )) i− hP Γ ′ ( H ) , P Γ ′ ( Y ) i − hP Γ ′ c ( H ) , P Γ ′ c ( Y ) i≤ λn kP T ( H ) k F + 14 kP T ⊥ ( H ) k ∗ + λ kP Γ ′ ( H ) k . By inequality 4.2, 34 kP T ⊥ ( H ) k ∗ + 3 λ kP Γ ′ ( H ) k ≤ λn kP T ( H ) k F . (4.3)Recall that we assume k − s ) ρ P T P Γ ′ P T − P T k ≤ and k √ (1 − s ) ρ P T P Γ ′ k ≤ p / kP T ( H ) k F ≤ k − s ) ρ P T P Γ ′ P T ( H ) k F ≤ k − s ) ρ P T P Γ ′ P T ⊥ ( H ) k F + 2 k − s ) ρ P T P Γ ′ ( H ) k F ≤ s − s ) ρ kP T ⊥ H k F + s − s ) ρ kP Γ ′ H k F .

20y inequality 4.3, we have( 34 − λn s − s ) ρ ) kP T ⊥ ( H ) k F + ( 3 λ − λn s − s ) ρ ) kP Γ ′ H k F ≤ . Then P T ⊥ ( H ) = P Γ ′ H = 0, which implies P Γ ′ P T ( H ) = 0. Since P Γ ′ P T is injective ( k − s ) ρ P T P Γ ′ P T −P T k ≤ ) on T , we have P T ( H ) = 0. Then we have H = 0.Suppose we can construct Y and e Y satisfying  kP T Y + P T ( λ P Ω ′ W − U V ∗ ) k F ≤ λ n , kP T ⊥ Y + P T ⊥ ( λ P Ω ′ W ) k ≤ , P Γ ′ c Y = 0 , kP Γ ′ Y k ∞ ≤ λ . (4.4)and  kP T e Y + P T ( λ (2 P Ω ′ / Γ ′ ( W ) − P Ω ′ W ) − U V ∗ ) k F ≤ λ n , kP T ⊥ e Y + P T ⊥ ( λ (2 P Ω ′ / Γ ′ ( W ) − P Ω ′ W )) k ≤ , P Γ ′ c e Y = 0 , kP Γ ′ e Y k ∞ ≤ λ . (4.5)Then Y = ( Y + ˜ Y ) / ′ , P Ω ′ W ) and (Γ ′ , P Ω ′ / Γ ′ ( W ) −P Ω ′ W ) have the same distribution. Therefore, if we can construct Y satisfying (4.4) with high prob-ability, we can also construct e Y satisfying (4.5) with high probability. Therefore to prove Theorem1.3, we only need to prove that there exists Y satisfying (4.4) with high probability: Proof (of Theorem 1.3) Notice that Γ ′ ∼ Ber((1 − s ) ρ ). Suppose that q satisfying 1 − (1 − s ) ρ =(1 − (1 − s ) ρ ) (1 − q ) l − , where l = ⌊ n + 1 ⌋ . This implies that q ≥ Cρ/ log( n ). Deﬁne q = q = (1 − s ) ρ/ q = ... = q l = q . Then in distribution we can let Γ ′ = Γ ∪ ... ∪ Γ l , whereΓ j ∼ Ber( q j ) independently.Construct  Z = P T ( U V ∗ − λ P Ω ′ W ) ,Z j = ( P T − q j P T P Γ j P T ) Z j − for j = 1 , ..., j .,Y = P lj =1 1 q j P Γ j Z j − , Then by Lemma 4.1, we have k Z j k F ≤ k Z j − k F for j = 1 , ..., l. with high probability provided C ρ is large enough and C s is small enough. Then k Z j k F ≤ ( ) j k Z k F . By the construction of Z j , we know that Z j ∈ Range( P T ) and Z j = ( I − q P T P Γ j ) Z j − .Then similarly, by Lemma 4.2, we have k Z k ∞ ≤ √ log n k Z k ∞ , k Z j k ∞ ≤ j log n k Z k ∞ for j = 2 , ..., l with high probability provided C ρ is large enough and C s is small enough. Also, by Lemma 4.3 wehave k ( I − q P Γ j ) Z j − k ≤ C s n log nq k Z j − k ∞ for j = 1 , ..., l with high probability provided C ρ is large enough and C s is small enough.We ﬁrst bound k Z k F and k Z k ∞ . Obviously k Z k ∞ ≤ k U V ∗ k ∞ + λ kP T P Ω ′ ( W ) k ∞ . Recall that forany i, j ∈ [ n ], we have kP T ( e i e ∗ j ) k ∞ ≤ µrn and kP T ( e i e ∗ j ) k F ≤ q µrn . Moreover, P Ω ′ ( W ) satisﬁes( P Ω ′ ( W )) ; are iid random variables with the distribution( P Ω ′ ( W )) ij =  sρ − ρ +2 sρ − ρ − ρ +2 sρ − sρ − ρ +2 sρ Then by Bernstein’s inequality, we have P (cid:0)(cid:12)(cid:12)(cid:10) P T ( P Ω ′ ( W )) , e i e ∗ j (cid:11)(cid:12)(cid:12) ≥ t (cid:1) = P (cid:0)(cid:12)(cid:12)(cid:10) P Ω ′ ( W ) , P T ( e i e ∗ j ) (cid:11)(cid:12)(cid:12) ≥ t (cid:1) ≤ − t / P EX j + M t/ , where we have X EX j = 2 sρ − ρ + 2 sρ kP T e i e ∗ j k F ≤ Cρs µrn , and M = kP T e i e ∗ j k ∞ ≤ µrn . Then with high probability we have kP T P Ω ′ ( W ) k ∞ ≤ C q ρ µr log nn ( ≥ C q C ρ µr log nn µr log nn >C p C ρ M log n ). Then by k U V ∗ k ∞ ≤ √ µrn we have k Z k ∞ ≤ C √ µrn , which implies k Z k F ≤ n k Z k ∞ ≤ C √ µr .Now we want to prove Y satisﬁes 4 . P Γ ′ c Y = 0. It suﬃcesto prove  kP T Y + P T ( λ P Ω ′ ( W ) − U V ∗ ) k F ≤ λ n , kP T ⊥ Y k ≤ , kP T ⊥ ( λ P Ω ′ ( W )) k ≤ , kP Γ ′ Y k ∞ ≤ λ . (4.6)22irst, kP T Y + P T ( λ P Ω ′ ( W ) − U V ∗ ) k F = k Z − ( l X j =1 q j P T P Γ j Z j − ) k F = kP T Z − ( l X j =1 q j P T P Γ j P T Z j − ) k F = k ( P T − q P T P Γ P T ) Z − ( j X j =2 q j P T P Γ j P T Z j − ) k F = kP T Z − ( l X j =2 q j P T P Γ j P T Z j − ) k F = ... = k Z l k F ≤ C ( 12 ) l √ µr ≤ λn . Second, kP T ⊥ Y k = kP T ⊥ l X j =1 q j P Γ j Z j − k≤ l X j =1 k q j P T ⊥ P Γ j Z j − k = l X j =1 kP T ⊥ ( 1 q j P Γ j Z j − − Z j − ) k≤ l X j =1 k q j P Γ j Z j − − Z j − k≤ l X j =1 C s n log nq j k Z j − k ∞ ≤ C p n log n ( l X j =3 j − log n √ q j + 12 √ log n √ q + 1 √ q ) k Z k ∞ ≤ C √ nµr log nn √ ρ ≤ √ log n , provided C ρ is suﬃciently large.Third, we have k λ P T ⊥ P Ω ′ ( W ) k ≤ λ kP Ω ′ ( W ) k . Notice that W ij is an independent Rademachersequence independent of Ω ′ . By Lemma 4.3, we have k sρ − ρ + 2 sρ W − P Ω ′ ( W ) k ≤ C ′ p np log n k W k ∞ sρ − ρ +2 sρ ≤ p and p ≥ C nn . By Theorem 3.9 of [39], we have k W k ∞ ≤ C √ n with high probability. Therefore, kP Ω ′ ( W ) k ≤ C ′ p np log n + C √ n sρ − ρ + 2 sρ . By choosing p = ρC for some appropriate C , we have kP Ω ′ ( W ) k ≤ √ nρ log n , provided C ρ is largeenough and C s is small enough.Fourth, kP Γ Y k ∞ = kP Γ X j q j P Γ j Z j − k ∞ ≤ X j q j k Z j − k ∞ ≤ ( l X j =3 q j j − log n + 1 q √ log n + 1 q ) k Z k ∞ ≤ C √ µrnρ ≤ λ √ log n , provided C ρ is suﬃciently large.Notice that in [4] the authors used a very similar golﬁng scheme. To compare these two methods,we use here a non-uniform sizes golﬁng scheme to achieve a result with fewer log factors. Moreover,unlike in [4] the authors used both golﬁng scheme and least square method to construct two partsof the dual matrix, here we only use golﬁng scheme. Actually the method to construct the dualmatrix in [4] cannot be applied directly to our problem when ρ = O ( r log n/n ). Acknowledgements

I am grateful to my Ph. D. advisor, Emmanuel Cand`es, for his encouragements and his help inpreparing this manuscript.

References [1] A. Agarwal, S. Negahban, and M. Wainwright. Noisy matrix decomposition via convex relaxation:Optimal rates in high dimensions. in Proc. 28th Inter. Conf. Mach. Learn. (ICML). , pages 1129–1136,2011.[2] R. Ahlswede and A.Winter. Strong converse for identiﬁcation via quantum channels.

IEEE Trans.Inform. Theory , 48(3):569 – 579, 2002.[3] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometryproperty for random matrices.

Constructive Approximation , 28(3):253–263, 2008.[4] E. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis?

Journal of ACM , 58(3),2011.[5] E. Cand`es and Y. Plan. Matrix completion with noise.

Proceedings of the IEEE , 2009.

6] E. Cand`es and Y. Plan. Near-ideal model selection by ℓ minimization. Ann. Statist. , 37(5A):2145–2177,2009.[7] E. Cand`es and Y. Plan. A probabilistic and ripless theory of compressed sensing.

IEEE Transactionson Information Theory , 57(11):7235–7254, 2011.[8] E. Cand`es and B. Recht. Exact matrix completion via convex optimzation.

Foundations of Computa-tional Mathematics , 9(6), 2009.[9] E. Cand`es, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction fromhighly incomplete frequency information.

IEEE Trans. Inform. Theory , 52(2):489–509, 2006.[10] E. Cand`es, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measure-ments.

Communications on Pure and Applied Mathematics , 59(8):1207–1223, 2006.[11] E. Cand`es and T. Tao. Decoding by linear programming.

IEEE Trans. Information Theory , 51(12),2005.[12] E. Cand`es and T. Tao. The power of convex relaxation: Near-optimal matrix completion.

IEEE Trans.Inform. Theory , 56(5):2053–2080, 2010.[13] V. Chandrasekaran, S. Sanghavi, P. Parrilo, and A. Willsky. Sparse and low-rank matrix decompositions. in 15th IFAC Sypmposium on System Identiﬁcation (SYSID) , 2009.[14] V. Chandrasekaran, S. Sanghavi, P. Parrilo, and A. Willsky. Rank-sparsity incoherence for matrixdecomposition.

SIAM J. on Optimization , 21(2):572–596, 2011.[15] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit.

SIAM J. Sci. Comput. ,20(1):33–61, 1998.[16] Y. Chen, A. Jalali, S. Sanghavi, and C. Caramanis. Low-rank matrix recovery from errors and erasures.

ISIT , 2011.[17] K. Davidson and S. Szarek. Local operator theory, random matrices and banach spaces.

Handbook ofthe Geometry of Banach Spaces , I(8):317–366, 2001.[18] D.Donoho. For most large underdetermined systems of linear equations the minimal l1-norm solutionis also the sparsest solution.

Communications on Pure and Applied Mathematics , 59(6):797–829, 2006.[19] D. Donoho. Compressed sensing.

IEEE Trans. Inform. Theory , 52(4):1289 – 1306, 2006.[20] M. Fazel. Matrix rank minimization with applications.

Ph.D Thesis , 2002.[21] D. Gross. Recovering low-rank matrices from few coeﬃcients in any basis.

IEEE Trans. on InformationTheory , 57(3):1548–1566, 2011.[22] D. Gross, Y-K. Liu, S.Flammia, S. Becker, and J.Eisert. Quantum state tomography via compressedsensing.

Physical Review Letters , 105(15), 2010.[23] J. Haupt, W. Bajwa, M. Rabbat, and R. Nowak. Compressed sensing for networked data.

SignalProcessing Magazine, IEEE , 25(2):92 – 101, 2008.[24] D. Hsu, S. Kakade, and T. Zhang. Robust matrix decomposition with sparse corruptions.

InformationTheory, IEEE Transactions on , 57(11):7221–7234, 2011.[25] J.Tropp. User-friendly tail bounds for sums of random matrices.

Found. Comput. Math. , 2011.[26] R. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries.

IEEE Trans. Inform.Theory , 56(6):2980–2998, 2010.[27] J. Laska, P. Boufounos, M. Davenport, and R. Baraniuk. Democracy in action: Quantization, saturation,and compressive sensing.

Applied and Computational Harmonic Analysis , 31(3):429–443, 2011.

28] J. Laska, M. Davenport, and R. Baraniuk. Exact signal recovery from sparsely corrupted measurementsthrough the pursuit of justice.

Asilomar Conference on Signals Systems and Computers , 2009.[29] Z. Li, F. Wu, and J. Wright. On the systematic measurement matrix for compressed sensing in thepresence of gross errors.

Data Compression Conference , pages 356–365, 2010.[30] N. Nguyen and T. Tran. Exact recoverability from dense corrupted observations via l1 minimization. preprint , 2011.[31] N. Ngyuen, N. Nasrabadi, and T. Tran. Robust lasso with missing and grossly corrupted observations. preprint , 2011.[32] B. Recht. A simpler approach to matrix completion.

Journal of Machine Learning Research , 12:3413–3430, 2011.[33] B. Recht, M. Fazel, and P. Parillo. Guaranteed minimum-rank solutions of linear matrix equations vianuclear norm minimization.

SIAM Review , 52(3), 2010.[34] J. Romberg. Compressive sensing by random convolution.

SIAM J. Imaging Sciences , 2(4):1098–1128,2009.[35] R.Tibshirani. Regression shrinkage and selection via the lasso.

J. Royal Statist. Soc. B. , 58(1):267–288,1996.[36] M. Rudelson. Random vectors in the isotropic position.

J. of Functional Analysis , 164(1):60–72, 1999.[37] M. Rudelson and R. Vershynin. On sparse reconstruction from fourier and gaussian measurements.

Communications on Pure and Applied Mathematics , 61(8):1025–1045, 2008.[38] C. Studer, P. Kuppinger, G. Pope, and H. B¨olcskei. Recovery of sparsely corrupted signals. preprint ,2011.[39] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices.

Chapter 5 of the bookCompressed Sensing, Theory and Applications, ed. Y. Eldar and G. Kutyniok. Cambridge UniversityPress , pages 210–268, 2012.[40] J. Wright and Y. Ma. Dense error correction via ℓ -minimization. IEEE Transactions on InformationTheory , 56(7):3540 – 3560, 2010.[41] J. Wright, A. Y. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse represen-tation.

IEEE Trans. Pattern Anal. Mach. Intell. , 31(2):210227, 2009.[42] L. Wu, A. Ganesh, B. Shi, Y. Matsushita, Y. Wang, and Y. Ma. Robust photometric stereo via low-rankmatrix completion and recovery.

Proceedings of the 10th Asian conference on Computer vision , PartIII, 2010.[43] H. Xu, C. Caramanis, and S. Sanghavi. Robust pca via outlier pursuit. in Ad. Neural Infor. Proc. Sys.(NIPS) , pages 2496–2504, 2010., pages 2496–2504, 2010.