Sharp Support Recovery from Noisy Random Measurements by L1 minimization
Charles Dossal, Marie-Line Chabanol, Gabriel Peyré, Jalal Fadili
aa r X i v : . [ c s . I T ] S e p Sharp Support Recovery from NoisyRandom Measurements by ℓ minimization Charles Dossal a , Marie-Line Chabanol a , Gabriel Peyré b , Jalal Fadili c a IMB Université Bordeaux 1,351, cours de la Libération F-33405 Talence cedex, France b CNRS and CEREMADE, Université Paris-Dauphine,Place du Maréchal De Lattre De Tassigny, 75775 Paris Cedex 16, France c GREYC, CNRS-ENSICAEN-Université Caen,6 Bd du Maréchal Juin 14050 Caen Cedex, France
Abstract
In this paper, we investigate the theoretical guarantees of penalized ℓ -minimization (alsocalled Basis Pursuit Denoising or Lasso) in terms of sparsity pattern recovery (support andsign consistency) from noisy measurements with non-necessarily random noise, when thesensing operator belongs to the Gaussian ensemble (i.e. random design matrix with i.i.d.Gaussian entries). More precisely, we derive sharp non-asymptotic bounds on the sparsitylevel and (minimal) signal-to-noise ratio that ensure support identification for most signalsand most Gaussian sensing matrices by solving the Lasso with an appropriately chosenregularization parameter.Our first purpose is to establish conditions allowing exact sparsity pattern recoverywhen the signal is strictly sparse. Then, these conditions are extended to cover the com-pressible or nearly sparse case. In these two results, the role of the minimal signal-to-noiseratio is crucial. Our third main result gets rid of this assumption in the strictly sparsecase, but this time, the Lasso allows only partial recovery of the support. We also providein this case a sharp ℓ -consistency result on the coefficient vector.The results of the present work have several distinctive features compared to previousones. One of them is that the leading constants involved in all the bounds are sharpand explicit. This is illustrated by some numerical experiments where it is indeed shownthat the sharp sparsity level threshold identified by our theoretical results below whichsparsistency of the Lasso solution is guaranteed meets the one empirically observed. Key words:
Compressed sensing, ℓ minimization, sparsistency, consistency.
1. Introduction
The conventional wisdom in digital signal processing is the Shannon sampling theoremvalid for bandlimited signals. However, such a sampling scheme excludes many signals ofinterest that are not necessarily bandlimited but can still be explained either exactly oraccurately by a small number of degrees of freedom. Such signals are termed sparse signals. ✩ This work is supported by ANR grant NatImages ANR-08-EMER-009.
Email addresses: [email protected] (Charles Dossal),
[email protected] (Marie-Line Chabanol), [email protected] (Gabriel Peyré), [email protected] (JalalFadili)
Preprint submitted to Applied and Computational Harmonic Analysis October 26, 2018 n fact we distinguish two types of sparsity: strict and weak sparsity (the latter is alsotermed compressibility). A signal x , considered as a vector in a finite dimensional subspaceof R p , is strictly or exactly sparse if all but a few of its entries vanish; i.e., if its support I ( x ) = supp ( x ) = { ≤ i ≤ p | x [ i ] = 0 } is of cardinality k ≪ p . A k -sparse signal isa signal where exactly k samples have a non-zero value. Signals and images of practicalinterest may be compressible or weakly sparse in the sense that the sorted magnitudes | x sorted [ i ] | decay quickly. Thus x can be well-approximated as k -sparse up to an error term(this property will be used when we will tackle compressible signals). If a signal is notsparse in its original domain, it may be sparsified in an appropriate orthobasis Φ (hencethe importance of the point of view of computational harmonic analysis and approximationtheory). Without loss of generality, we assume throughout that Φ is the standard basis.The compressed sensing/sampling [1, 2, 3] asserts that sparse or compressible signals canbe reconstructed with theoretical guarantees from far fewer measurements than the ambientdimension of the signal. Furthermore, the reconstruction is stable if the measurements arecorrupted by an additive bounded noise. The encoding (or sampling) step is very fast sinceit gathers n non-adaptive linear measurements that preserve the structure of the signal x : y = Ax + w ∈ R n , (1)where A ∈ R n × p is a rectangular measurement matrix, i.e., n < p , and w accounts forpossible noise with bounded ℓ norm. In this work, we do not need w to be random andwe consider that A is drawn from the Gaussian matrix ensemble , i.e., the entries of A areindependent and identically distributed (i.i.d.) N (0 , /n ) . The columns of A are denoted a i , for i = 1 , · · · , p . In the sequel, the sub-matrix A I is the restriction of A to the columnsindexed by I ( x ) . To lighten the notation, the dependence of I on x is dropped and shouldbe understood from the context.The signal is reconstructed from this underdetermined system of linear equations bysolving a convex program of the form: x ∈ argmin x ∈ R p k x k such that Ax − y ∈ C , (2)where C is an appropriate closed convex set, and k x k q := ( P i | x [ i ] | q ) /q , q ≥ is the ℓ q -norm of a vector with the usual adaptation for q = ∞ : k x k ∞ = max i | x [ i ] | . We also denote k x k as the ℓ pseudo-norm which counts the number of non-zero entries of x . Obviously, k x k = | I ( x ) | . For any vector x , the notation x ∈ R | I ( x ) | means the restriction of x to itssupport.Typically, if C = { } (no noise), we end up with the so-called Basis Pursuit [4] problem min x ∈ R p k x k such that y = Ax . (BP)Taking C as the ℓ ball of radius ǫ , we have a noise-aware variant of BP min x ∈ R p k x k such that k Ax − y k ≤ ǫ ( ℓ -constrained)where the parameter ǫ > depends on the noise level k w k . This constrained form canalso be shown to be equivalent to the ℓ -penalized optimization problem, which goes bythe name of Basis Pursuit DeNoising [4] or Lasso in the statistics community after [5]: min x ∈ R p k y − Ax k + γ k x k , (Lasso) In a statistical linear regression setting, we would speak of a random Gaussian design. γ is the regularization parameter. ( ℓ -constrained) and (Lasso) are equivalent in thesense that there is a bijection between γ and ǫ such that both problems share the same setof solutions. However, this bijection is unknown explicitly and depends on y and A , so thatin practice, one needs to use different algorithms to solve each problem, and theoreticalresults are stated using one formulation or the other. In this paper, we focus on the Lassoformulation. It is worth noting that the Dantzig selector [6, 7] is also a special instance of(2) when C = { z ∈ R p (cid:12)(cid:12) (cid:13)(cid:13) A T z (cid:13)(cid:13) ∞ ≤ γ } .The convex problems of the form ( ℓ -constrained) and (Lasso) are computationallytractable and many algorithms have been developed to solve them, and we only mentionhere a few representatives. Homotopy continuation algorithms [8, 9, 10] track the wholeregularization path. Many first-order algorithms originating from convex non-smooth op-timization theory have been proposed to solve (Lasso). These include one-step iterativethresholding algorithms [11, 12, 13, 14], or accelerated variants [15, 16], multi-step schemessuch as [17] or [18]. The Douglas-Rachford algorithm [19, 20] is a first-order scheme thatcan be used to solve ( ℓ -constrained). A more comprehensive account can be found in [21,Chapter 7]. These last years, we have witnessed a flurry of research activity where efforts have beenmade to investigate the theoretical guarantees of ℓ minimization by solving the Lasso forsparse recovery from noisy measurements in the underdetermined case n < p . Overall, thederived conditions hinge on strong assumptions on the structure and interaction betweenthe variables in A as indexed by x . An overview of the literature pertaining to our workwill be covered in Section 1.3 after notions are introduced so that the discussions are clearer.Let x be the original vector as defined in (1), f = Ax the noiseless measurements, x ( γ ) a minimizer of the Lasso problem and f ( γ ) = Ax ( γ ) . Consistency. ℓ q -consistency on the signal x means that the ℓ q -error k x − x ( γ ) k q , for typ-ically q = 1 , or ∞ , between the unknown vector x and a solution x ( γ ) of either (Lasso)or ( ℓ -constrained) comes within a factor of the noise level. Sparsistency.
Sparsity pattern recovery (also dubbed sparsistency for short or variableselection in the statistical language) requires that the indices and signs of the solutions x ( γ ) are equal to those of x for a well chosen value of γ . Partial support recovery occurswhen the recovered support is included (strictly) in that of x with the correct sign pattern.In general, it is not clear which of these performance measures is better to characterizethe Lasso solution. Nevertheless, in the noisy case, consistency does not tell the wholestory and there are many applications where bounds on the ℓ q -error are insufficient tocharacterize the accuracy of the Lasso estimate. In this case, exact or partial recoveryof the support, hence of the correct model variables, is the desirable property to have.Among other advantages, this allows for instance to circumvent the bias of the Lasso andthus enhance the estimation of x and Ax using a debiasing procedure: recover the support I by solving the Lasso, followed by least-squares regression on the selected variables ( a i ) i ∈ I ;see e.g. [6, 22]. Our work falls within this scope and focuses on exact and partial supportidentification for both strictly sparse and compressible signals in the presence of noise onGaussian random measurements. 3 .3. Literature overview The properties of the Lasso have been extensively studied, including consistency anddistribution of its estimates. There is of course a huge literature on the subject, andcovering it fairly is beyond the scope of this paper. In this section, we restrict our overviewto those works pertaining to ours, i.e., sparsity pattern recovery in presence of noise.Much recent work aims at understanding the Lasso estimates from the point of view ofsparsistency. This body of work includes [22, 6, 23, 24, 25, 26, 27, 28, 29]. For the Lassoestimates to be close to the model selection estimates when the data dimensions ( n, p ) grow, all the aforementioned papers assumed a sparse model and used various conditionsthat require the irrelevant variables to be not too correlated with the relevant ones. Mutual coherence-based conditions.
Several researchers have studied independently thequalitative performance of the Lasso for either exact or partial sparsity pattern recovery ofsufficiently sparse signals under a mutual coherence condition on the measurement matrix A ; see for instance [23, 30, 26, 31] when A is deterministic, and [32] when A is Gaussian.However, mutual coherence is known to lead to overly pessimistic sparsity bounds. Support structure-based conditions.
These sufficient recovery conditions were refined byconsidering not only the cardinality of the support but also its structure, including thesigns of the non-zero elements of x . Such criteria use the interactions between the relevantcolumns of A I = ( a i ) i ∈ I and the irrelevant ones ( a i ) i/ ∈ I . More precisely, we define thefollowing condition developed in [33] to analyze the properties of the Lasso. This conditiongoes by the name of irrepresentable condition in the statistical literature; see e.g. [28, 22,27, 34] and [35] for a detailed review. Definition 1.
Let I be the support of x and I c its complement in { , · · · , p } . The irrep-resentable (or Fuchs) condition is fulfilled if F ( x ) := (cid:13)(cid:13) A T I c A I ( A T I A I ) − sign ( x ) (cid:13)(cid:13) ∞ = max i ∈ I c |h a i , d ( x ) i| < , (3) where d ( x ) := A I ( A T I A I ) − sign ( x ) . (4)Condition (3) will also be the soul of our analysis in this paper.The criterion (3) is closely related to the exact recovery coefficient (ERC) of Tropp [26]: ERC( x ) := 1 − max i ∈ I c (cid:13)(cid:13) ( A T I A I ) − A T I a i (cid:13)(cid:13) . (5)In [26, Corollary 13], it is established that if ERC( x ) > , then the support of the Lassosolution with a large enough parameter γ is included in the one of the subset selection (i.e., ℓ -minimization) optimal solution.In [28], an asymptotic result is reported showing that (3) is sufficient for the Lassoto guarantee exact support recovery and sign consistency. It is also shown that (3) isessentially necessary for variable selection. [24] develop very similar results and use similarrequirements. [36] and [37] derive asymptotic conditions for sparsistency of the block Lasso[38] by extending (3) and (5) to the group setting.Reference [22] proposes a non-asymptotic analysis with a sufficient condition ensuringexact support and sign pattern recovery of most sufficiently sparse vectors for matrices In fact, a slightly stronger assumption requiring that all elements in (3) are uniformly bounded awayfrom 1. (log p ) − ). Their proof relies upon(3) and a bound on norms of random sub-matrices developed in [39]. The work in [27]considers a condition of the form (3) to ensure sparsity pattern recovery. The analysisin that paper was conducted for both deterministic and standard Gaussian A in a high-dimensional setting where p and the sparsity level grow with the number of measurements n . That author also established that violation of (3) is sufficient for failure of the Lasso inrecovering the support set. In [40], the sufficient bound on the number of measurementsestablished in [27] for the standard Gaussian dense ensemble was shown to hold for sparsemeasurement ensembles. The works of [22] and [27] are certainly the most closely relatedto ours. We will elaborate more on these connections by highlighting the similarities anddifferences in Section 2.4. Variations on the Lasso.
Other variations of the Lasso, such as the adaptive Lasso [29, 42]or multi-stage variable selection methods [43, 44, 45, 46, 34]. For an overview of otherpenalized methods that have been proposed for the purpose of variable selection, see [43]. Information-theoretic bounds.
A recent line of research has developed information-theoreticsufficient and necessary bounds to characterize fundamental limits on minimal signal-to-noise ratio (SNR), the number of measurements n , and tolerable sparsity level k requiredfor exact or partial support pattern recovery of exactly sparse signals by any algorithmincluding the optimal exhaustive ℓ decoder [47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57]. Inmost of these works, the bounds are asymptotic, i.e., they provide asymptotic scaling andtypically require that the sparsity level k varies at some rate (linearly or sub-linearly) withthe signal dimension p when n grows to infinity. It is worth mentioning that a carefulnormalization is needed, for instance of the sampling matrix and noise, when comparingthese results in the literature.The paper [47] was the first to consider the information-theoretic limits of exact spar-sity recovery from the Gaussian measurement ensemble, explicitly identifying the minimalSNR (or equivalently T = min i ∈ I ( x ) | x [ i ] | ) as a key parameter. This analysis yielded nec-essary and sufficient conditions on the tuples ( n, p, k, T ) for asymptotically reliable sparsityrecovery. This complements the analysis of [27] by showing that in the sub-linear sparsityregime, i.e. k = o ( p ) , the number of measurements required by the Lasso n & k log( p − k ) achieves the information-theoretic necessary bound.Subsequent work of [48, 49, 50, 51, 52, 53, 54, 55, 56, 57] has extended or strengthenedthis type of analysis to other settings (e.g. partial support recovery, other matrix ensembles,other scaling regimes, compressible case). Most of the results developed in the literature on sparsistency of the Lasso estimateexhibit asymptotic scaling results in terms of the triple ( n, p, k ) , but this does not tell thewhole story. One often needs to know explicitly the exact numerical constants involved inthe bounds, not only their dependence on key quantities such as the SNR and/or otherparameters of the signal x . As a consequence, the majority of sufficient conditions aremore conservative than those suggested by empirical evidence. The adaptive Lasso as seen in the statistical literature turns out to be a two-step procedure, wherethe second step is to solve a reweighted ℓ norm problem, with weights given by the Lasso estimate in thefirst step. In fact, this is a special case of the iteratively reweighted ℓ -minimization [41]. The shorthand notation f & g means that g = O ( f ) .
5n this paper, we investigate the theoretical properties of the Lasso estimate in termsof sparsity pattern recovery (support and sign consistency) from noisy measurements –thenoise being not necessarily random– when the measurement matrix belongs to the Gaussianensemble. We provide precise non-asymptotic bounds, including explicit sharp leadingnumerical constants, on the key quantities that come into play (sparsity level for a givenmeasurement budget, minimal SNR, regularization parameter) to ensure exact or partialsparsity pattern recovery for both strictly sparse and compressible signals. Our resultshave several distinctive features compared to previous closely-connected works. This willbe discussed in further details in Section 2.4. Numerical evidence are reported in Section 6to confirm the theoretical findings.
The rest of the paper is organized as follows. We first state our main results and discussthe connections and novelties with respect to existing work. In Section 3 and 4, we detailthe proofs for exact recovery with strictly sparse and compressible signals, before provingthe partial support recovery result in Section 5. Numerical experiments are carried out inSection 6. Section 6 includes a final discussion and some concluding remarks.
2. Main results
Our first result Theorem 1 establishes conditions allowing exact sparsity pattern re-covery when the signal is strictly sparse. Then, these conditions are extended to coverthe compressible case in Theorem 2. In these two results, the role of the minimal SNRis crucial. Our third main result in Theorem 3 gets rid of this assumption in the strictlysparse case, but this time, the Lasso allows only partial recovery of the support. We alsoprovide in this case a sharp ℓ -consistency result on the Lasso estimate.The three theorems are stated following the same structure: suppose that ( x , w ) fulfillsome requirements formalized by a set Y , then with overwhelming probability (w.o.p. forshort) on the choice of A , the Lasso estimate obeys some property P . It should be notedthat these theorems imply in particular that w.o.p. on the choice of A , for most vectors ( x , w ) ∈ Y , the Lasso estimate satisfies property P , whatever the probability measureused on the set Y .The proof of Theorem 1 is given in Section 3. We prove its extension to compressiblesignals as stated in Theorem 2 in Section 4. Both proofs capitalize on an implicit formulaof the Lasso solution. The proof of Theorem 3 given in Section 5 is quite different, sinceno such implicit formula is used directly. Theorem 1.
Let A ∈ R n × p be a Gaussian matrix, i.e. its entries are i.i.d. N (0 , /n ) , w ∈ R n is such that k w k ≤ ε , ≤ α, β < and p > e −√ β ) . Suppose that x ∈ R p obeys k x k = k ≤ αβn p (6) and min i ∈ I | x [ i ] | = T ≥ . ε √ − α r pn . (7)6 olve the Lasso problem from the measurements y = Ax + w . Then with probability P ( n, p, α, β ) converging to 1 as n goes to infinity, the Lasso solution x ( γ ) with γ = ε √ − α r pn (8) is unique and satisfies supp ( x ( γ )) = supp ( x ) and sign (cid:16) x ( γ ) (cid:17) = sign ( x ) . The proof (see Section 3) provides an explicit bound for P ( n, p, α, β ) , showing in par-ticular that P ( n, p, α, β ) is larger than − e − . √ log n − √ π log p − o (cid:18) p (cid:19) − o ( e − . √ log n ) , although this bound on the probability is far from optimal.In plain words, Theorem 1 asserts that for ( α, β ) ∈ [0 , the support and the sign ofmost vectors obeying (6) can be recovered using the Lasso if the non-zero coefficients of x are large enough compared to noise. This bound on the sparsity of x turns out to beoptimal, since for any c > , for most vectors x such that k x k ≥ cn p , the supportcannot be recovered using the Lasso even with no noise. Indeed, [33] and [58] proved thatthe Lasso solution for any γ shares the same sign and the same support as x when y = Ax if and only if max j / ∈ I |h a j , A I ( A T I A I ) − sign ( x ) i| ≤ . Note in passing the difference with the strict inequality in (3). On the other hand, if k x k ≥ cn p with c > , then w.o.p. (cid:13)(cid:13) A I ( A T I A I ) − sign ( x ) (cid:13)(cid:13) ≥ Cn p for some C > and sufficiently large p . As a result, max j / ∈ I |h a j , A I ( A T I A I ) − sign ( x ) i| ≥ √ C > . Thisinformal optimality discussion is consistent with the information-theoretic bounds of [47],where it was proved that the number of measurements required by the Lasso achievesthe (asymptotic) information-theoretic necessary bound that has the scaling (6) when thesparsity regime is sub-linear and T ∼ / k x k .An important feature of Theorem 1 is that all the constants are made explicit and aregoverned by the two numerical constants α and β . The role of α is very instructive sincewhen lowering γ by decreasing α , the threshold on the minimal SNR is decreased to allowsmaller coefficients to be recovered, but simultaneously the probability of success gets lowerand the number of measurements required to recover the k -sparse signal increases. Theconverse applies when α is increased. On the other hand, increasing β (in an appropriaterange; see Section 3.3 for details) allows a higher threshold on the sparsity level, but againat the price of a smaller probability of success. Theorem 1 can be easily extended to weakly sparse or compressible signals. We considerthe best k -term approximation x k of x obtained by keeping only the k largest entries from x and setting the others to zero. Obviously, k = | I ( x k ) | . This is equivalently definedusing a thresholding x k [ i ] = (cid:26) x [ i ] if | x [ i ] | ≥ T, otherwise. (9)7 signal is generally considered as compressible if the residual x k − x is small. Forsparsistency to make sense in this compressible case, additional assumptions are required,namely that the largest components x k of the signal are significantly larger than the residual x k − x . This is made formal in the following theorem. Theorem 2.
Let A , α , β and p as in Theorem 1. We measure y = Ax + w , and let x k be the best k -term approximation of x where k satisfies (6) . We denote ∆ = 2 p √ α − α r pn . Suppose that k w k + 4 (cid:13)(cid:13)(cid:13) x − x k (cid:13)(cid:13)(cid:13) ≤ ε, (10) T as defined in (9) is such that T ≥ . ε (11) and (cid:13)(cid:13)(cid:13) x − x k (cid:13)(cid:13)(cid:13) ∞ ≤
45 (1 − √ α )∆ ε. (12) Then, with probability P ( n, p, α, β ) converging to 1 as n goes to infinity, the solution x ( γ ) of the Lasso from measurements y with γ = ∆ ε (13) is unique and satisfies supp ( x ( γ )) = supp (cid:16) x k (cid:17) and sign (cid:16) x ( γ ) (cid:17) = sign (cid:16) x k (cid:17) . Again, all the leading constants are explicit. Conditions (11) and (12) impose compress-ibility constraints on the signal, namely that the magnitude of the k largest components of x are well above the average magnitude ε/ √ n of the residual, and that the latter is “flat”,since the ratio of its ℓ ∞ and ℓ norms should be small.The proof (see Section 4) provides an explicit bound for P ( n, p, α, β ) , showing that P ( n, p, α, β ) is greater than − e − . √ log n − √ π log p − o (cid:18) p (cid:19) − o ( e − . √ log n ) , although once again this bound on the probability is far from optimal.Theorem 2 encompasses the strictly sparse case, Theorem 1, which is easily recoveredby letting x = x k . The parameter α plays a similar role in both theorems. Furthermore,in Theorem 2, the Lasso solution becomes more tolerant to compressibility errors x − x k as α decreases. This however comes at the price of a lower probability of success as indicatedin our proof. 8 .3. Partial Support Recovery with Strictly Sparse Signals In both previous theorems, the assumption on T plays a pivotal role: if T is too small,there is no way to distinguish the small components of x from the noise; see also thediscussion and literature review in Section 1.3. Nevertheless, if no assumptions are madeon T , one can nevertheless expect to partly recover the support of x . This is formalizedin the following result. Theorem 3.
Let A , α and β as in Theorem 1. We measure y = Ax + w , where x fulfills (6) . Then with probability P ( n, p, α, β ) converging to 1 as n goes to infinity, the solution x ( γ ) of the Lasso form measurements y with γ = ε √ − α r pn is unique and satisfies supp ( x ( γ )) ⊂ supp ( x ) . Moreover, the Lasso solution is ℓ -consistent: k x − x ( γ ) k ≤ (cid:18) r α − α (cid:19) ε . (14)The proof in Section 5 provides an explicit lower bound for P ( n, p, α, β ) , and showsthat P ( n, p, α, β ) is larger than − e − n (cid:18) −√ β − √ kn (cid:19) − √ π log p . As before, this bound on the probability is not optimal.If γ is large enough it is clear that supp ( x ( γ )) ⊂ supp ( x ) since for γ ≥ (cid:13)(cid:13) A T y (cid:13)(cid:13) ∞ , x ( γ ) = 0 . Theorem 3 provides a parameter γ proportional to ε that ensures a partialsupport recovery without any assumption on T . It also gives a sharp upper bound on ℓ -error of the Lasso solution. This result remains valid under the additional hypothesesof Theorem 1 or 2 allowing exact recovery of the support. As we mentioned in Section 1.3, our work is closely related to [22, 27], butis different in many important ways that we summarize as follows. • Deterministic vs random measurement matrices: the work of [22] considers deter-ministic matrices satisfying a weak incoherence condition. Our work focuses on theclassical Gaussian ensemble. • Asymptotic vs non-asymptotic analysis: the analysis in [27] applies to high-dimensionalsetting where even the sparsity level k grows with the number of measurements n .As a result, k appears in the statements of the probabilities, which thus requiresthat k → + ∞ . This is very different from our setting as well as that of [22] wherethe probabilities depend solely on the dimensions of A . We believe that this is morenatural in many applications. 9 Random vs deterministic noise: in both previous works, the noise is stochastic (Gaus-sian in [22] and sub-Gaussian in [27]). In our work, we handle any noise with a finite ℓ -norm. • Leading numerical constants: these are not always explicit and sharp in those works.The constant involved in the sparsity level upper-bound in [22, Theorem 1.3] is notgiven, whereas (6) gives an explicit and sharp bound. The bounds (7) and (8) on T and γ are similar to those given in [22, Theorem 1.3] once specialized for α = 3 / .In [27, Theorem 2], the constant appearing in the lower-bound on T is not given,whereas (7) provides an explicit expression that is shown to be reasonably good inSection 6. • Compressible signals: to the best of our knowledge, the compressible case has notbeen covered in the literature, and Theorem 2 appears then as a distinctively novelresult of this paper. • ℓ -consistency: such a result is not given in those references. A bound on the ℓ -prediction error on Ax − Ax ( γ ) is proved in [22]. An ℓ ∞ -consistency is establishedin [27], which is an immediate consequence of sparsistency. Our method of proofdiffers significantly from the one used in [27], and in particular it naturally leads tothe ℓ -consistency result. • Exact and partial support recovery: in [22] the partial recovery case was not con-sidered. In [27], exact and partial recovery are somewhat handled simultaneously,while we give two distinct results for each case. ℓ -consistency. This property of the Lasso estimate has been widely studied by manyauthors under various sufficient conditions. Theorem 3 may then be compared to thisliterature, and we here focus on results based on the restricted isometry property (RIP)[59] and more or less similar variants in the literature; see the discussion in [34] and thereview in [35].The RIP results are uniform and ensure ℓ -stability of the Lasso estimate for all suf-ficiently sparse vectors from noisy measurements, whereas Theorem 3 guarantees that theLasso estimate is ℓ -consistent for most sparse vectors and a given matrix. When A isGaussian, the scaling of the sparsity bound is O ( n/ log( p/n )) for RIP-based results whichis better than O ( n/ log p ) in Theorem 3. Note that the scaling O ( n ) was derived in [60]when A belongs to the uniform spherical ensemble to ensure ℓ -stability of the Lasso esti-mate for most matrices A , although the leading constants are not given explicitly. However,the RIP is a worst-case analysis, and the price is that the leading constants in the sufficientsparsity bounds are overly small. In contrast, the leading numerical constants in our spar-sity and ℓ -consistency upper-bounds are explicit and solely controlled by ( α, β ) ∈ [0 , .For instance, it can be verified from our proof that the value of the sparsity upper-boundwe provide is actually larger than the bounds obtained from the RIP for p up to e .Finally, the RIP is a deterministic property that turns out to be satisfied by many ensem-bles of random matrices other than the Gaussian. Our Theorem 3 could presumably beextended to sub-Gaussian matrices (e.g. using [61, Corollary V.2.1]), but this needs furtherinvestigation that we leave for a future work.
3. Proof of Support Identification of Exactly Sparse Signals
This section gives the proof of Theorem 1. Recall that ¯ x is the restriction of x to itssupport I ( x ) , and A I the corresponding sub-matrix. We also denote the Moore-Penrose10seudo-inverse of A I as A + I = ( A T I A I ) − A T I . From classical convex analysis, the first order optimality conditions show that a vector x ⋆ is a solution of the Lasso if and only if (cid:26) A T I ( y − Ax ⋆ ) = γ sign (cid:0) x ⋆ (cid:1) ∀ j / ∈ I, |h a j , y − Ax ⋆ i| ≤ γ, (15)where I = I ( x ⋆ ) .Hence if the goal pursued is to ensure that I ( x ⋆ ) = I ( x ) = I and sign ( x ⋆ ) = sign ( x ) ,the only candidate solution of the Lasso is x ⋆ = x − γ ( A T I A I ) − sign ( x ) + A + I w. (16)Consequently, a vector x ⋆ is a solution of the Lasso if and only the two following conditionsare met : sign ( x ) = sign ( x ⋆ ) ( C ) ∀ j / ∈ I ( x ) , |h a j , γd ( x ) + P V I ⊥ ( w ) i| ≤ γ ( C )where V I = Span( A I ) , P V I ⊥ is the orthogonal projection on the subspace orthogonal to V I ,and d ( x ) is defined in (4).Sections 3.2 and 3.3 show that under the hypotheses of Theorem 1, conditions ( C ) and ( C ) are in force with probability converging to as n goes to infinity. This will thusconclude the proof of Theorem 1. ( C ) To ensure that sign ( x ) = sign ( x ⋆ ) , it is sufficient that (cid:13)(cid:13) γ ( A T I A I ) − sign ( x ) + A + I w (cid:13)(cid:13) ∞ ≤ T . (17)We prove that this is indeed the case w.o.p. .Lemma 4, whose proof is given in Appendix A.3, shows that γ = ε √ − α q pn ≤ T . implies γ (cid:13)(cid:13) ( A T I A I ) − sign ( x ) (cid:13)(cid:13) ∞ ≤ T (1 + 4 √ α )5 . with probability greater than − kp − . − e − nα (0 . √ − p .To prove (17), we will now bound (cid:13)(cid:13) A + I w (cid:13)(cid:13) ∞ . To this end, we split it as follows (cid:13)(cid:13) A + I w (cid:13)(cid:13) ∞ = D × D × D × k w k , where D = (cid:13)(cid:13) A + I w (cid:13)(cid:13) ∞ (cid:13)(cid:13) A + I w (cid:13)(cid:13) , D = (cid:13)(cid:13) A + I w (cid:13)(cid:13) (cid:13)(cid:13) A T I w (cid:13)(cid:13) , D = (cid:13)(cid:13) A T I w (cid:13)(cid:13) k w k . ounding D . As A and w are independent, Lemma 5, proved in Appendix A.4, showsthat the distribution of A + I w is invariant under orthogonal transforms on R k . Thereforethe random variable A + I w (cid:13)(cid:13) A + I w (cid:13)(cid:13) is uniformly distributed on the unit ℓ sphere of R k .Using the concentration Lemma 7, detailed in Appendix B, with ǫ = (cid:16) n log kk (cid:17) , itfollows that P D ≤ r k (2 log n log k ) ! ≥ − ke −√ n log k ≥ − max (cid:16) n − , e − √ n ) (cid:17) . (18)One can notice that D ≤ actually gives a better bound if k is small compared to n .Moreover the bound on the probability is − n − for k big. Bounding D . D is bounded by the maximum of the eigenvalue of ( A T I A I ) − . Indeed,owing to Lemma 3 with t = 1 − q kn − − , we arrive at P (cid:16) D ≤ (cid:17) ≥ − e − n (cid:16) − − − √ p (cid:17) . (19) Bounding D . Let’s write D = 1 k w k X i ∈ I |h a i , w i| . Since each h a i , w i is a zero-mean Gaussian variable with variance k w k n , the variable n (cid:13)(cid:13) A T I w (cid:13)(cid:13) k w k , follows a χ distribution with k degrees of freedom. Therefore, in virtue of the concentrationLemma 8, stated in Appendix B, applied with δ = 2 s log n log k we obtain P (cid:18) D ≤ k √ log nn √ log k (cid:19) ≥ − √ πk e − k (cid:16)q log n log k − − log 22 − log (cid:16) log n log k (cid:17)(cid:17) ≥ − e − . √ log n This last bound may be pessimistic; when k is large this probability is actually much bigger.This shows that w.o.p. , D ≤ r kn (cid:18) log n log k (cid:19) . (20)Putting (18), (19) and (20), we conclude that (cid:13)(cid:13) A + I w (cid:13)(cid:13) ∞ ≤ ε r nn , (21)12ith probability greater than − e − . √ log n − e − n (cid:16) − − − √ p (cid:17) − max (cid:16) n − , e − √ n ) (cid:17) − kp − . − e − nα (0 . √ − p which converges to as n → + ∞ .In turn, the bound (21) becomes, under assumption (7) on T , (cid:13)(cid:13) A + I w (cid:13)(cid:13) ∞ ≤ T √ − α . . This shows that condition ( C ) is in force with probability converging to as n → + ∞ . ( C ) Let’s introduce the following vector u = γd ( x ) + P V I ⊥ ( w ) , (22)which depends on both x and w .Clearly, to comply with ( C ) , we need to bound ( h a j , u i ) j / ∈ I w.o.p. . We will start bybounding k u k . Bounding k u k . As d ( x ) ∈ V I , the Pythagorean theorem yields k u k = γ k d ( x ) k + (cid:13)(cid:13)(cid:13) P V I ⊥ ( w ) (cid:13)(cid:13)(cid:13) . (23)Let S = sign ( x ) . Then nk k d ( x ) k = n k S k S T ( A T I A I ) − S .
Since x and A are independent, Lemma 6, stated in Appendix B, shows that nk k d ( x ) k is χ -distributed with n − k + 1 degrees of freedom. Thanks to Lemma 9, see Appendix B, itfollows that for all δ > , P (cid:18) nkn − k + 1 < (1 − δ ) k d ( x ) k (cid:19) ≤ e ( n − k +1) log(1 − δ )2 . Since kn ≤
12 log p , we obtain for p ≥ e δ , P (cid:16) k < k d ( x ) k (1 − δ ) (cid:17) ≤ e n log(1 − δ )(4 − δ )8 . Choosing δ such that (1 − δ ) > √ β , we have P (cid:18) k d ( x ) k ≤ kβ (cid:19) ≥ − e n (3 −√ β ) log β . This shows that k d ( x ) k ≤ kβ with probability converging to as n → + ∞ .It is worthy to mention that the condition p > e −√ β ) actually guarantees the existenceof a suitable δ .As P V I ⊥ is an orthogonal projector, we have (cid:13)(cid:13)(cid:13) P V I ⊥ ( w ) (cid:13)(cid:13)(cid:13) ≤ k w k ≤ ε . Together with(23), this shows that P (cid:18) k u k ≤ γ kβ + ε (cid:19) ≥ − e n (3 −√ β ) log β . (24)13 ounding max j / ∈ I |h u, a j i| . For a fixed u , the random variables ( h a j , u i ) j / ∈ I are zero-meanGaussian variables with variance k u k n .Using the bound (24), traditional arguments from the concentration of the maximumof Gaussian variables tell us that max j / ∈ I |h a j , u i| ≤ s pn (cid:18) γ kβ + ε (cid:19) (25)with a probability larger than − e n (3 −√ β ) log β − √ π log p . In turn, this implies that condition ( C ) is in force w.o.p. if s pn (cid:18) γ kβ + ε (cid:19) ≤ γ. This holds if ε √ − α r pn ≤ γ. This concludes the proof of Theorem 1, and shows that overall P ( n, p, α, β ) ≥ − e − . √ log n − e − n (cid:16) − − − √ p (cid:17) − max (cid:16) n − , e − √ n ) (cid:17) − kp − . − e − nα (0 . √ − p − e n (3 −√ β ) log β − √ π log p .
4. Proof of Support Identification of Compressible Signals
To prove this theorem, we capitalize on the results of Section 3.1 by noting that y = Ax k + A ( x − x k ) + w := Ax k + Ah + w , and replacing x by x k and w by w = Ah + w .With these change of variables, it is then sufficient to check conditions ( C ) and ( C ) withthe notable difference that the noise w is not independent of A anymore. More precisely, w is independent of ( a i ) i ∈ I but not of ( a j ) j / ∈ I . Condition ( C ) . Since this condition only depends on A I , it is verified with probabilityconverging to 1 as n → + ∞ , as in the proof of Theorem 1, provided that T ≥ . γ and k w k ≤ T . q (1 − α ) n p . The first condition is a direct consequence of assumptions (11) and(13). Moreover, k w k ≤ k w k + k Ah k , where Ah is a zero-mean Gaussian vector, whoseentries are independent with variance k h k n . Therefore n k Ah k k h k has a χ distribution with n degrees of freedom. We then derive from the concentration Lemma 8 that P ( k Ah k ≤ k h k ) ≥ − √ πn e − . n . Under assumptions (10)-(11), the last inequality implies that k w k ≤ k w k + 2 k h k ≤ ε ≤ T . ≤ T . s (1 − α ) n p n → + ∞ . Condition ( C ) is thus satisfied with aprobability larger than − e − . √ log n − e − n (cid:16) − − − √ p (cid:17) − max (cid:16) n − , e − √ n ) (cid:17) − kp − . − e − nα (0 . √ − p − √ πn e − . n . Condition ( C ) . For any j / ∈ I , define the vector v j = w − h [ j ] a j . In particular, v j isindependent of a j . Condition ( C ) now reads: ∀ j / ∈ I, |h a j , γd ( x k ) + P V I ⊥ ( v j ) + h [ j ] P V I ⊥ ( a j ) i| ≤ γ , where the vector d ( x k ) is defined replacing x by x k in (4).Similarly to (24), it can be shown that w.o.p. (cid:13)(cid:13)(cid:13) γd ( x k ) + P V I ⊥ ( v j ) (cid:13)(cid:13)(cid:13) ≤ γ kβ + k v j k . On the other hand, k v j k ≤ k w k + k h k ∞ k a j k , and n k a j k is χ -distributed with n degrees of freedom. Applying Lemma 8 to bound k a j k by for all j and using similararguments to those leading to (25), we get max j / ∈ I |h a j , γd ( x k ) + P V I ⊥ ( v j ) i| ≤ s pn (cid:18) γ kβ + ( k w k + 4 k h k ) (cid:19) with probability larger than − p +13 √ πn e − . n − √ π log p , converging to 1 as n → + ∞ . Itthen follows from assumptions (10) and (13) that w.o.p. max j / ∈ I |h a j , γd ( x k ) + P V I ⊥ ( v j ) i| ≤ γ √ α ) . (26)As an orthogonal projector is a self-adjoint idempotent operator, we have for all j ≤ p , | h [ j ] h a j , P V I ⊥ ( a j ) i| ≤ k h k ∞ (cid:13)(cid:13)(cid:13) P V I ⊥ ( a j ) (cid:13)(cid:13)(cid:13) , where (cid:13)(cid:13)(cid:13) P V I ⊥ ( a j ) (cid:13)(cid:13)(cid:13) is the squared ℓ -norm of the projection of a Gaussian vector on the sub-space V I ⊥ whose dimension is n − k . As V I ⊥ is independent of a j , for j / ∈ I , n (cid:13)(cid:13)(cid:13) P V I ⊥ ( a j ) (cid:13)(cid:13)(cid:13) follows a χ distribution with n − k degrees of freedom. Using Lemma 8 together withassumptions (12)-(13), the following bound holds w.o.p. max j / ∈ j | h [ j ] h a j , P V I ⊥ ( a j ) i| ≤ . k h k ∞ ≤ γ − √ α ) (27)In summary, (26) and (27) show that ( C ) is fulfilled with probability larger than − √ πn e − . n − √ πn e − . n − √ π ( n − k ) e − . n .15 . Proof of Partial Support Recovery To prove the first part of Theorem 3, we need to show that with w.o.p. , the extension x ( γ ) on R p of the solution of min x ∈ R | I | k y − A I x k + γ k x k (28)with y = P A I ( y ) , is the solution of the Lasso. By definition, the support J of this extensionis included in I .Proving this assertion amounts to showing that x ( γ ) fulfills the necessary and sufficientoptimality conditions ( A T J ( y − Ax ( γ )) = γ sign (cid:16) x ( γ ) (cid:17) , ∀ l / ∈ J, |h a l , y − Ax ( γ ) i| ≤ γ. (29)Since y = P A I ( y ) and J ⊂ I , A T J ( y − Ax ( γ )) = A T J ( y − Ax ( γ )) . In addition, as x ( γ ) is the extension of the solution of (28), the optimality conditions associated to (28) yield ( A T J ( y − Ax ( γ )) = γ sign (cid:16) x ( γ ) (cid:17) , ∀ l ∈ ( I ∩ J c ) , |h a l , y − Ax ( γ ) i| ≤ γ. To complete the proof, it remains now to show that w.o.p. ∀ l / ∈ I, |h a l , y − Ax ( γ ) i| ≤ γ. (30)As in the proofs of Theorems 1 and 2, to bound these scalar products, the key argumentis the independence between the vectors ( a l ) l / ∈ I and the residual vector y − Ax ( γ ) .We first need the following intermediate lemma. Lemma 1.
Let A ∈ R n × k such that ( A T A ) is invertible. Take x ( γ ) as a solution of theLasso from observations y ∈ R n . The mapping f : R + ∗ → R + , γ f ( γ ) = k y − Ax ( γ ) k γ iswell-defined and non-increasing. Proof:
The authors in [8] and [58] independently proved that under the assumptionsof the lemma: • the solution x ( γ ) of the Lasso is unique; • there is a finite increasing sequence ( γ t ) t ≤ K with γ = 0 and γ K = (cid:13)(cid:13) A T y (cid:13)(cid:13) ∞ suchthat for all t < K , the sign and the support of x ( γ ) are constant on each interval ( γ t , γ t +1 ) . • x ( γ ) is a continuous function of γ .Moreover x ( γ ) with support J satisfies x ( γ ) = A + J y − γ ( A T J A J ) − sign (cid:16) x ( γ ) (cid:17) , (31)which implies that r ( γ ) := y − Ax ( γ ) = P A ⊥ J ( y ) − γA J ( A T J A J ) − sign (cid:16) x ( γ ) (cid:17) . ( γ t , γ t +1 ) , r ( γ ) is an affine function of γ which can be written r ( γ ) = z − γv, where z := P A ⊥ J ( y ) and v := A J ( A T J A J ) − sign (cid:16) x ( γ ) (cid:17) . As v ∈ V J and z ∈ V ⊥ J , thePythagorean theorem allows to write for γ ∈ ( γ t , γ t +1 ) that k r ( γ ) k γ = k z k γ + k v k . (32)We then deduce that f ( γ ) = k r ( γ ) k γ is a non-increasing function of γ on each interval ( γ t , γ t +1 ) . By continuity of f , it follows that f is non-increasing on R + ∗ . Remark 1. If ( A T I A I ) is not invertible, the Lasso may have several solutions. Nevertheless r ( γ ) is always uniquely defined and the lemma should also apply. From Lemma 1, we deduce that k y − Ax ( γ ) k γ is a non-increasing function of γ . Because y ∈ V I and A I has full column-rank, we also have lim γ → x ( γ ) = x , where on I , the entries of x are those of the unique vector of R | I | such that A I x = y .Therefore, x [ i ] = x [ i ] + ( A + I w )[ i ] , for i ∈ I . (33)Since A I is Gaussian and independent from x and w , the support of x is almost surelyequal to I . Hence there exists γ > such that if γ < γ , the support and the sign of x ( γ ) are equal to those of x . More precisely, if γ < γ , x ( γ ) satisfies x ( γ ) = x − γ ( A T I A I ) − sign ( x ) and r ( γ ) := y − Ax ( γ ) = γA I ( A T I A I ) − sign ( x ) . It then follows that for γ ∈ (0 , γ ) , k y − Ax ( γ ) k γ = (cid:13)(cid:13) A I ( A T I A I ) − sign ( x ) (cid:13)(cid:13) . Now, since (cid:13)(cid:13) A I ( A T I A I ) − sign ( x ) (cid:13)(cid:13) = h ( A T I A I ) − sign ( x ) , sign ( x ) i , we deduce that for all γ > , k y − Ax ( γ ) k γ ≤ q | I | ρ (( A T I A I ) − ) , where ρ (( A T I A I ) − ) is the spectral radius of ( A T I A I ) − . Using Lemma 3 with β < (cid:18) − q kn (cid:19) then leads to P k y − Ax ( γ ) k γ ≤ s kβ ! ≥ − e − n (cid:18) −√ β − √ kn (cid:19) . (34)17y the Pythagorean theorem and the fact that (cid:13)(cid:13)(cid:13) P V I ⊥ w (cid:13)(cid:13)(cid:13) ≤ ε , we have k y − Ax ( γ ) k = k y − y k + k y − Ax ( γ ) k = (cid:13)(cid:13)(cid:13) P V I ⊥ w (cid:13)(cid:13)(cid:13) + k y − Ax ( γ ) k ≤ ε + k y − Ax ( γ ) k . With similar arguments as those leading to (25), it can then be deduced that max l / ∈ I |h a l , y − Ax ( γ ) i| ≤ s pn (cid:18) ε + γ kβ (cid:19) . (35)with probability larger than − e − n (cid:18) −√ β − √ kn (cid:19) − √ π log p ,If k ≤ αβn p and γ ≥ ε √ − α q pn , then r p (cid:16) ε + γ kβ (cid:17) n ≤ γ , and therefore inequality(30) is satisfied w.o.p. . This ends the proof of the first part of the theorem.Let’s now turn to the proof of (14). To prove this inequality we notice that for large γ , the Lasso solution x ( γ ) is also the extension of the solution of (28) w.o.p. and we usethe Lipschitz property of the mapping γ x ( γ ) .Indeed, by the triangle inequality, k x − x ( γ ) k ≤ k x − x k + k x − x ( γ ) k . (36)Recalling from (33) that x − x = A + I w , it follows that k x − x k ≤ ε q ρ (( A T I A I ) − ) , which, using again Lemma 3, leads to the bound k x − x k ≤ ε with probability larger than − e − n (cid:18) . − q kn (cid:19) .For all γ > , x ( γ ) obeys (31), and since lim γ → x ( γ ) = x , we get that k x − x ( γ ) k ≤ γ max J ⊂ I,S ∈{− , } | J | (cid:13)(cid:13) ( A T J A J ) − S (cid:13)(cid:13) . (37)For all J ⊂ I , the inclusion principe tells us that ρ (( A T J A J ) − ) ≤ ρ (( A T I A I ) − ) . Further-more, for all S ∈ {− , } | J | , k S k ≤ √ k . Using Lemma 3 once again implies that P k x − x ( γ ) k ≤ γ s kβ ! ≥ − e − n (cid:18) −√ β − q kn (cid:19) . If γ = ε √ − α q pn and k ≤ αβn p , then w.o.p. k x − x ( γ ) k ≤ ε r α − α . This concludes the proof. 18 . Numerical Illustrations
This section aims at providing empirical support of the sharpness of our bounds byassessing experimentally the quality of the constants involved in Theorem 1. More specif-ically, we perform a probabilistic analysis of support and sign recovery, to show that thebounds (6), (8) and (7) are quite tight .In all the numerical tests, we use problems of size ( n, p ) = (8000 , and ( n, p ) =(3000 , , corresponding to moderate and high redundancies. These are realistic high-dimensional settings in agreement with signal and image processing applications. Weperform a randomized analysis, where the probability of exact recovery of supports andsigns (sparsistency) are computed by Monte-Carlo sampling with respect to a probabilitydistribution on the measurement matrix, k -sparse signals and on the noise w . As detailedin Section 1.1, the matrix A is drawn from the Gaussian ensemble. We assume that thenon-zero entries x [ i ] for i ∈ I ( x ) of a vector x ∈ R p are independent realizations of aBernoulli variable taking equiprobable values { + T, − T } . We also assume that the noise w is drawn from the uniform distribution on the sphere { w ∈ R n \ k w k = ε } . Since only theSNR matters in the bounds, we fix ε = 1 and only vary the value of T .
200 300 400 500 60000.20.40.60.81 γ =T/2 γ =T/4 γ =T/8 γ =T/6
100 150 20000.20.40.60.81 γ =T/2 γ =T/4 γ =T/6 γ =T/8 ( n, p ) = (8000 , n, p ) = (3000 , Figure 1:
Probability of sparsistency as a function of k and α = 0 . . The vertical lines corresponds to oursparsistency bound k β , from left to right, for β = 0 . , . , . , . Challenging the sparsity bound (6) . We first evaluate, for α = 0 . , and for a varying valueof k , the probability of sparsistency given that T = 5 . ε √ − α r pn and γ = T . (38)which are values in accordance with the bounds (7) and (8).In order to compute numerically this probability, for each k , we generate sparsesignals x with k x k = k , and check whether conditions ( C ) and ( C ) defined in Sec-tion 3.1 are satisfied. Figure 1 shows how this probability decays when k increases. The The
Matlab code to reproduce the figures are freely available for download from . k β = αβn p (39)as identified by the bound (6). The estimated probability exhibits a typical phase transitionthat is located precisely around the critical value k β for β close to one. This shows that ourbound is quite sharp. We also display the same probability curve for other, less conservative,values of γ ∈ { T / , T / } , which improves slightly the probability with respect to γ = T / . . Challenging the regularization parameter value (8) . We evaluate, for ( α, β ) = (0 . , . ,the probability of sparsistency using a value of γ different from γ = ε √ − α r pn (40)given in (8), for which Theorem 1 is valid. We use the critical sparsity level k = k β definedin (39). To study only the influence of γ , we use a SNR that is infinite, meaning that ε is negligible in comparison with T . This implies in particular that in this regime, onlycondition ( C ) has to be checked to estimate the probability of sparsistency.Figure 2 shows the increase in this probability as the ratio γ/γ increases. This makessense because the signal is large with respect to the noise so that a large threshold shouldbe preferred. One can see that at the critical value γ = γ suggested by Theorem 1, thisprobability is close to 1. This again confirms that the value (8) of γ is quite sharp. ( n, p ) = (8000 , n, p ) = (3000 , Figure 2:
Probability of support recovery for large T as a function of γ/γ for k = k β and ( α, β ) = (0 . , . . Challenging the signal-to-noise ratio (7) . Lastly, we estimate, for ( α, β ) = (0 . , . , theminimal signal level T that is required to ensure the inclusion of the support, meaning that I ( x ( γ )) ⊂ I ( x ) . We use the critical sparsity k = k β and γ = γ , with k β and γ as definedrespetively in (39) and (40). Since we are only interested in support inclusion, it is onlyneeded to check condition ( C ) .The bound in (7) suggests that T ≥ . γ is enough. Figure 3 however shows that thisbound is pessimistic, and that T ≥ γ appears to be enough to guarantee the supportinclusion with high probability. A few reasons may explain this sub-optimality. • There is no guarantee that the concentration lemmas we use are optimal.20
The limit ratio Tε relies mainly on Lemma 4 and especially on the bound √ b init. This bound can be improved by at least three ways. – Using the same proof, the bound can be slightly enhanced by decaying theprobability of success. – The result in the lemma is non-asymptotic. The bound and the probability werecomputed to be available for all α ≤ , β ≤ and for all p ≥ . With thevalues used in the numerical experiments, and decaying a bit the probability ofsuccess, the bound can turn into . √ b , yielding a better bound T ≥ . γ . – In the proof of Lemma 4, the inequality k B i k ≤ ρ ( B ) , is used, where ρ ( B ) isthe spectral radius of B . This bound is available for any matrix, but one mightperhaps do better by exploiting Gaussiannity of the measurement matrix. ( n, p ) = (8000 , n, p ) = (3000 , Figure 3:
Probability of support inclusion as a function of
T /γ for k = k β and ( α, β ) = (0 . , . . Conclusion
This paper has presented a novel analysis of the sparsistency of the Lasso from noisyGaussian measurements. We derived sharp bounds on the sparsity of the signal to guaranteesparsistency with high probability. This result is extended to handle compressible signalsand to establish sharp ℓ -consistency. A distinctive feature of our analysis is that it providesexplicit constants for the three key parameters of the problem: the sparsity of the signal,the minimal signal-to-noise ratio and the Lasso regularization parameter. Numerical resultssupport the claim that these constants are either sharp or at least reasonably well behaved. A. Properties of Wishart Matrices
A.1. Signs of non-diagonal entries of an inverse Wishart matrix
Lemma 2. If B ∈ R k × k is the inverse of a Wishart matrix, then for all i ≤ k , the variables (sign ( B i,j ) , j = i ) form a Rademacher sequence, that is they are independent and uniformlydistributed on {− , } . Moreover this sequence is independent of B i,i , and of ( | B i,j | ) j = i . Proof: If B = ( B i,j ) i ≤ k,j ≤ k ∈ R k × k is the inverse of a Wishart matrix, then B = ( A T A ) − where A ∈ M n,k ( R ) is a Gaussian matrix. Let E ∈ M k,k ( R ) be di-agonal such that for all ≤ i ≤ k, | E i,i | = 1 . Then ( AE ) T AE = EA T AE , hence21 ( AE ) T ( AE )) − = E ( A T A ) − E . Therefore the entries of C = (( AE ) T AE ) − are C i,j = E i,i E j,j B i,j for ≤ i, j ≤ k .But A and AE have the same law, hence B and C also have the same law. Hence for all ( ǫ j ) j ≤ k,j = i ∈ {− , } k − , the laws of ( B i, , . . . , B i,k ) and ( ǫ B i, , . . . , B i,i , . . . , ǫ k B i,k ) are thesame. This implies that the variables (sign ( B i,j ) , j = i ) form a Rademacher sequence, andthis sequence is independent of B i,i , and of ( | B i,j | ) j = i . A.2. Extreme eigenvalues of a Wishart matrix
The proof of the following lemma can be found in [62, page 42].
Lemma 3. If A ∈ R n × k is a Gaussian matrix whose coefficients are centered of variance n , then the maximal and minimal eigenvalues of the Wishart matrix B = A T A satisfy forall t > P λ max ( B ) ≥ r kn + t ! ≤ e − nt and P λ min ( B ) ≤ − r kn − t ! ≤ e − nt A.3. Sup-norm of a projected Rademacher sequence
Lemma 4. If C ∈ R n × k is a Gaussian matrix, with k ≤ nb p with < b ≤ and if S ∈ {− , } k is drawn independently from C , then if p ≥ , P (cid:16)(cid:13)(cid:13) ( C T C ) − S (cid:13)(cid:13) ∞ ≤ √ b (cid:17) ≥ − kp − . − e − nb (0 . √ − p . Proof:
We use the following splitting ( C T C ) − = I + (( C T C ) − − I ) = I + B. This shows that (cid:13)(cid:13) ( C T C ) − S (cid:13)(cid:13) ∞ ≤ k S k ∞ + k BS k ∞ = 1 + k BS k ∞ . One can then observe that ( BS )[ i ] = P j ≤ k | B i,j | S [ j ]sign ( B i,j ) ; one has B i,i > ,and according to Lemma 2, for given i , the variables sign ( B i,j ) j = i form a Rademachersequence (this means that they are independent and uniformly distributed on {− , } ),and this sequence is independent of B i,i and of ( | B i,j | ) j = i . Hence one can apply Hoeffding’sLemma 10 (multiplying the line by an independent variable uniform on {− , } to takecare of the fact that sign ( B i,i ) is not uniformly distributed), thus getting for any i ≤ k and any t > , P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k X j =1 B i,j S [ j ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t k B i k ≤ e − t . (41)Now, for all i ≤ k , k B i k ≤ ρ ( B ) , where ρ ( B ) is the spectral radius of B . UsingLemma 3 with t = (0 . − √ ) q b log p and the fact that kn ≤ b p , we get P λ min ( C T C ) ≤ − . s b log p ! ≤ e − nb (0 . √ − p . P λ max (( C T C ) − ) ≥ − . s b log p ! − ≤ e − (0 . √ − bn p . Similarly, we have P λ min (( C T C ) − ) ≤ . s b log p ! − ≤ e − (0 . √ − bn p . It finally follows that with probability larger than − e − nb (0 . √ − p , ρ ( B ) ≤ max (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (1 + 0 . s b log p ) − − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (1 − . s b log p ) − − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)! . In particular, taking log( p ) b ≥ (17 −√ ≃ . leads to ρ ( B ) ≤ . q b log p with probabilitygreater than − e − nb (0 . √ − p .Using this bound in (41) with t = 1 . p log( p ) yields P (cid:16) k BS k ∞ ≥ √ b (cid:17) ≤ P k BS k ∞ ≥ t k B i k and ρ ( B ) ≤ . s b log p ! + P ρ ( B ) ≥ . s b log p ! ≤ kp − . + 2 e − nb (0 . √ − p . If we set log( p ) b ≥ . , the following holds, P (cid:16)(cid:13)(cid:13) ( C T C ) − S (cid:13)(cid:13) ∞ ≤ √ b (cid:17) ≥ − kp − . − e − nb (0 . √ − p . Remark 2.
It is worth noting that if log pb ≥ . as in the numerical experiments ( b =0 . , p = 32000 ), one can adapt this proof and, by loosing a bit on the probability (i.e.applying the concentration lemmas with smaller values of t ), one can get (cid:13)(cid:13) ( C T C ) − S (cid:13)(cid:13) ∞ ≤ . √ b w.o.p. .A.4. Rotation invariance Lemma 5. If C ∈ R n × k is a Gaussian matrix, and w ∈ R n is independent of C , the lawof C + w is invariant under orthogonal transforms on R k . Proof: If C ∈ R n × k is a Gaussian matrix, then for any orthogonal matrix U ∈ R k × k , D = CU and C have the same distribution. The law of D + w and C + w are thus the same.Since for all w , one has D + w = U − C + w, the law of U − C + w is the same as that of C + w .23 .5. Distribution of a quadratic form The following lemma is a consequence of [63, Theorem 3.2.12].
Lemma 6. If B is a Wishart matrix as described in Lemma 3, then for all X ∈ R k independent of B , the random variable n k X k X T B − X follows a χ distribution with n − k + 1 degrees of freedom. B. Concentration inequalities
The following lemma is well known; a proof can be found in [64].
Lemma 7.
Let µ k denote the uniform probability on the unit sphere S k − in R k , and let A ⊂ S k − such that µ k ( A ) ≥ . Then µ k ( { x ∈ S k − , d ( x, A ) ≤ ǫ } ) ≥ − e − kǫ . As acorollary, µ k ( x ∈ S k − , | x | ≤ ǫ } ≥ − e − kǫ . The following lemma is due to Cai et Silverman, see [65].
Lemma 8. If X follows a χ distribution with k degrees of freedom, then for all δ > , P ( X > (1 + δ ) k ) ≤ √ πkδ e − k ( δ − log(1+ δ )) The following lemma is due to Hoeffding, see [66].
Lemma 9. If X follows a χ distribution with k degrees of freedom, then for all δ > , P ( X < (1 − δ ) k ) ≤ e k log(1 − δ )2 The following lemma can be obtained by applying the Chernoff-Hoeffding inequality.
Lemma 10. If ( ε i ) i ≤ k is a Rademacher sequence, then for all a = ( a i ) i ≤ k ∈ R k and forall t > , P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k X i =1 ε i a i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t k a k ! ≤ e − t . References [1] E. Candès, J. Romberg, T. Tao, Robust uncertainty principles: Exact signal recon-struction from highly incomplete frequency information, IEEE Trans. Info. Theory52 (2) (2006) 489–509.[2] E. Candès, T. Tao, Near-optimal signal recovery from random projections: Universalencoding strategies?, IEEE Trans. Info. Theory 52 (12) (2006) 5406–5425.[3] D. Donoho, Compressed sensing, IEEE Trans. Info. Theory 52 (4) (2006) 1289–1306.[4] S. S. Chen, D. Donoho, M. Saunders, Atomic decomposition by basis pursuit, SIAMJournal on Scientific Computing 20 (1) (1998) 33–61.[5] R. Tibshirani, Regression shrinkage and selection via the Lasso, Journal of the RoyalStatistical Society 58 (1) (1996) 267–288.[6] E. J. Candès, T. Tao, Rejoinder: the Dantzig selector: statistical estimation when p is much larger than n , Annals of Statistics 35 (6) (2007) 2392–2404.247] P. J. Bickel, Y. Ritov, A. Tsybakov, Simultaneous analysis of lasso and Dantzig selec-tor, Annals of Statistics 37 (2009) 1705–1732.[8] M. R. Osborne, B. Presnell, B. A. Turlach, On the lasso and its dual, Journal ofComputational and Graphical Statistics 9 (2) (2000) 319–337.[9] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, Annals ofStatistics 32 (2) (2004) 407–499.[10] D. L. Donoho, Y. Tsaig, Fast solution of ℓ -norm minimization problems when thesolution may be sparse, IEEE Trans. Info. Theory 54 (11) (2008) 4789–4812.[11] M. Figueiredo, R. Nowak, An EM Algorithm for Wavelet-Based Image Restoration,IEEE Trans. Image Proc. 12 (8) (2003) 906–916.[12] I. Daubechies, M. Defrise, C. D. Mol, An iterative thresholding algorithm for linearinverse problems with a sparsity constraint, Commun. on Pure and Appl. Math. 57(2004) 1413–1541.[13] J. Bect, L. Blanc Féraud, G. Aubert, A. Chambolle, A ℓ -unified variational frameworkfor image restoration, in: Proc. of ECCV04, Springer-Verlag, 2004, pp. Vol IV: 1–13.[14] P. L. Combettes, V. R. Wajs, Signal recovery by proximal forward-backward splitting,SIAM Journal on Multiscale Modeling and Simulation 4 (4) (2005) 1168–1200.[15] M. A. T. Figueiredo, R. D. Nowak, S. J. Wright, Gradient projection for sparse re-construction: Application to compressed sensing and other inverse problems, IEEEJournal of Selected Topics in Signal Processing 1 (4) (2007) 586–598.[16] J. M. B. Dias, M. A. T. Figueiredo, A new twIST: Two-step iterative shrink-age/thresholding algorithms for image restoration, IEEE Trans. Image Proc. 16 (12)(2007) 2992–3004.[17] Y. Nesterov, Gradient methods for minimizing composite objective function, COREDiscussion Papers 2007076, Université catholique de Louvain, Center for OperationsResearch and Econometrics (CORE) (Sep. 2007).[18] A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linearinverse problems, Journal on Imaging Sciences 2 (1) (2009) 183–202.[19] P. L. Combettes, J.-C. Pesquet, A Douglas-Rachford splitting approach to nonsmoothconvex variational signal recovery, IEEE Journal of Selected Topics in Signal Process-ing 1 (4) (2007) 564–574.[20] M. Fadili, J.-L. Starck, Monotone operator splitting for fast sparse solutions of inverseproblems, in: Proc. of IEEE ICIP, Cairo, Egypt, 2009.[21] J.-L. Starck, F. Murtagh, M. Fadili, Sparse Signal and Image Processing: Wavelets,Curvelets and Morphological Diversity, Cambridge University Press, Cambridge, UK,2010.[22] E. J. Candès, Y. Plan, Near-ideal model selection by ℓ minimization, Annals ofStatistics 37 (5A) (2009) 2145–2177. 2523] D. L. Donoho, M. Elad, V. N. Temlyakov, Stable recovery of sparse overcompleterepresentations in the presence of noise, IEEE Trans. Info. Theory 52 (1) (2006) 6–18.[24] N. Meinshausen, P. Bühlmann, High-dimensional graphs and variable selection withthe lasso, Ann. Statist. 34 (3) (2006) 1436–1462.[25] E. Greenshtein, Best subset selection, persistence in high-dimensional statistical learn-ing and optimization under ℓ constraint, Annals of Statistics 34 (2006) 2367–2386.[26] J. A. Tropp, Just relax: convex programming methods for identifying sparse signalsin noise, IEEE Trans. Info. Theory 52 (3) (2006) 1030–1051.[27] M. J. Wainwright, Sharp thresholds for high-dimensional and noisy sparsity recoveryusing ℓ -constrained quadratic programming (lasso), IEEE Trans. Info. Theory 55 (5)(2009) 2183–2202.[28] P. Zhao, B. Yu, On model selection consistency of lasso, J. Mach. Learn. Res. 7 (2006)2541–2563.[29] H. Zou, The adaptive lasso and its oracle properties, Journal of the American Statis-tical Association 101 (476) (2006) 1418–1429.[30] J. Fuchs, Recovery of exact sparse representations in the presence of bounded noise,IEEE Trans. Info. Theory 51 (10) (2005) 3601–3608.[31] F. Bunea, Consistent selection via the lasso for high dimensional approximating re-gression models, in: Pushing the Limits of Contemporary Statistics: Contributionsin Honor of Jayanta K. Ghosh, Vol. 3, Institute of Mathematical Statistics, 2008, pp.122–137.[32] S. Zhou, J. D. Lafferty, L. A. Wasserman, Compressed and privacy-sensitive sparseregression, IEEE Trans. Info. Theory 55 (2) (2009) 846–866.[33] J.-J. Fuchs, On sparse representations in arbitrary redundant bases, IEEE Trans. Info.Theory 50 (6) (2004) 1341–1344.[34] N. Meinshausen, B. Yu, Lasso-type recovery of sparse representations for high-dimensional data, Ann. Statist. 37 (1) (2009) 246—270.[35] S. A. van de Geer, P. Bühlmann, On the conditions used to prove oracle results forthe lasso, Electron. J. Statist. 3 (2009) 1360–1392.[36] F. R. Bach, Consistency of the group lasso and multiple kernel learning, J. Mach.Learn. Res. 9 (2008) 1179–1225.[37] Y. Nardi, A. Rinaldo, On the asymptotic properties of the group lasso estimator forlinear models, Electron. J. Statist. 2 (2008) 605–633.[38] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped variables,Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (1)(2006) 49–67.[39] J. A. Tropp, Norms of random submatrices and sparse approximation, C. R. Math.Acad. Sci. 346 (2008) 1271–1274. 2640] D. Omidiran, M. J. Wainwright, High-dimensional subset recovery in noise: Sparsifiedmeasurements without loss of statistical efficiency, Tech. Rep. 753, UC Berkeley (2008).[41] E. J. Candes, M. B. Wakin, S. P. Boyd, Enhancing sparsity by reweighted L1 mini-mization, J. Fourier Anal. Appl. 14 (5) (2008) 877–905.[42] J. Huang, S. Ma, C.-H. Zhang, Adaptive lasso for sparse high dimensional regressionmodels, Tech. rep., Univ. of Iowa (2006).[43] J. Fan, J. Lv, A selective overview of variable selection in high dimensional featurespace (invited review article), To appear in Statistica Sinica.[44] T. Zhang, Some sharp performance bounds for least squares regression with ℓ l regu-larization, Annals of Statistics 37 (2009) 2109–2144.[45] L. Wasserman, K. Roeder, High dimensional variable selection, Annals of statistics 37(2009) 2178–2201.[46] S. A. van de Geer, P. Bühlmann, S. Zhou, Prediction and variable selection with theadaptive lasso, Tech. Rep. arXiv:1001.5176v2 (2010).[47] M. J. Wainwright, Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting, IEEE Trans. Info. Theory 55 (12) (2009) 5728–5741.[48] A. K. Fletcher, S. Rangan, V. K. Goyal, Necessary and sufficient conditions on sparsitypattern recovery, IEEE Trans. on Information Theory 55 (12) (2009) 5758–5772.[49] M. Akçakaya, V. Tarokh, Shannon theoretic limits on noisy compressive sampling,IEEE Trans. on Information Theory 56 (1) (2010) 492–504.[50] G. Reeves, M. Gastpar, Sampling bounds for sparse support recovery in the presenceof noise, in: Proceedings IEEE Int. Symp. on Inform. Theory, 2008, pp. 2187–2191.[51] W. Wang, M. J. Wainwright, K. Ramchandran, Information-theoretic limits on sparsesupport recovery: Dense versus sparse measurements, IEEE Trans. on InformationTheory 56 (6) (2010) 2967–2979.[52] S. Aeron, V. Saligrama, M. Zhao, Information theoretic bounds for compressed sens-ing, IEEE Trans. on Information Theory 56 (10) (2010) 5111–5130.[53] V. Saligrama, M. Zhao, Thresholded basis pursuit: An lp algorithm for achieving op-timal support recovery for sparse and approximately sparse signals from noisy randommeasurements, Tech. Rep. arxiv 0809.4883v3 (2010).[54] G. Reeves, M. Gastpar, Approximate sparsity pattern recovery: Information-theoreticlower bounds, Tech. Rep. arXiv:1002.4458v1 (2010).[55] A. Hormati, A. Karbasi, S. Mohajer, M. Vetterli, An estimation theoretic approach forsparsity pattern recovery in the noisy setting, Tech. Rep. LCAV-ARTICLE-2009-014,EPFL (2009).[56] P. Tune, S. R. Bhaskaran, S. Hanly, Number of measurements in sparse signal recovery,in: Proceedings IEEE Int. Symp. on Inform. Theory, 2009, pp. 16–20.[57] K. R. Rad, Sharp sufficient conditions on exact sparsity pattern recovery, Tech. Rep.Preprint arXiv:0910.0456v3 (2009). 2758] C. Dossal, A necessary and sufficient condition for exact recovery by ℓ minimization,Tech. Rep. Hal-00164738 (2007).[59] E. Candès, T. Tao, Decoding by linear programming, IEEE Trans. Info. Theory 51 (12)(2005) 4203–4215.[60] D. Donoho, For most large underdetermined systems of linear equations, the minimal ℓ1