Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition
MModel Selection for High-Dimensional Regression under theGeneralized Irrepresentability Condition
Adel Javanmard ∗ and Andrea Montanari † May 3, 2013
Abstract
In the high-dimensional regression model a response variable is linearly related to p covariates,but the sample size n is smaller than p . We assume that only a small subset of covariates is ‘active’(i.e., the corresponding coefficients are non-zero), and consider the model-selection problem ofidentifying the active covariates.A popular approach is to estimate the regression coefficients through the Lasso ( (cid:96) -regularizedleast squares). This is known to correctly identify the active set only if the irrelevant covariatesare roughly orthogonal to the relevant ones, as quantified through the so called ‘irrepresentability’condition. In this paper we study the ‘Gauss-Lasso’ selector, a simple two-stage method that firstsolves the Lasso, and then performs ordinary least squares restricted to the Lasso active set.We formulate ‘generalized irrepresentability condition’ (GIC), an assumption that is substan-tially weaker than irrepresentability. We prove that, under GIC, the Gauss-Lasso correctly recov-ers the active set. Contents n = ∞ problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 The high-dimensional problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 ∗ Department of Electrical Engineering, Stanford University. Email: [email protected] † Department of Electrical Engineering and Department of Statistics, Stanford University. Email: [email protected] a r X i v : . [ m a t h . S T ] M a y Proof of Theorems 2.5 and 2.7 15
In linear regression, we wish to estimate an unknown but fixed vector of parameters θ ∈ R p from n pairs ( Y , X ) , ( Y , X ) , . . . , ( Y n , X n ), with vectors X i taking values in R p and response variables Y i given by Y i = (cid:104) θ , X i (cid:105) + W i , W i ∼ N (0 , σ ) , (1)where (cid:104) · , · (cid:105) is the standard scalar product.In matrix form, letting Y = ( Y , . . . , Y n ) T and denoting by X the design matrix with rows X T , . . . , X T n , we have Y = X θ + W , W ∼ N (0 , σ I n × n ) . (2)In this paper, we consider the high-dimensional setting in which the number of parameters exceedsthe sample size, i.e., p > n , but the number of non-zero entries of θ is smaller than p . We denoteby S ≡ supp( θ ) ⊆ [ p ] the support of θ , and let s ≡ | S | . We are interested in the ‘model selection’problem, namely in the problem of identifying S from data Y , X .In words, there exists a ‘true’ low dimensional linear model that explains the data. We want toidentify the set S of covariates that are ‘active’ within this model. This problem has motivated alarge body of research, because of its relevance to several modern data analysis tasks, ranging fromsignal processing [Don06, CRT06] to genomics [PZB +
10, SK03]. A crucial step forward has been thedevelopment of model-selection techniques based on convex optimization formulations [Tib96, CD95,CT07]. These formulations have lead to computationally efficient algorithms that can be applied tolarge scale problems. Such developments pose the following theoretical question:
For which vectors θ , designs X , and noise levels σ , the support S can be identified, with high probability, throughcomputationally efficient procedures? The same question can be asked for random designs X and, inthis case, ‘high probability’ will refer both to the noise realization W , and to the design realization X . In the rest of this introduction we shall focus –for the sake of simplicity– on the deterministicsettings, and refer to Section 3 for a treatment of Gaussian random designs.The analysis of computationally efficient methods has largely focused on (cid:96) -regularized leastsquares, a.k.a. the Lasso [Tib96]. The Lasso estimator is defined by (cid:98) θ n ( Y, X ; λ ) ≡ arg min θ ∈ R p (cid:110) n (cid:107) Y − X θ (cid:107) + λ (cid:107) θ (cid:107) (cid:111) . (3)2n case the right hand side has more than one minimizer, one of them can be selected arbitrarily forour purposes. We will often omit the arguments Y , X , as they are clear from the context. (A closelyrelated method is the so-called Dantzig selector [CT07]: it would be interesting to explore whetherour results can be generalized to that approach.)It was understood early on that, even in the large-sample, low-dimensional limit n → ∞ at p constant, supp( (cid:98) θ n ) (cid:54) = S unless the columns of X with index in S are roughly orthogonal to theones with index outside S [KF00]. This assumption is formalized by the so-called ‘irrepresentabilitycondition’, that can be stated in terms of the empirical covariance matrix (cid:98) Σ = ( X T X /n ). Letting (cid:98) Σ A,B be the submatrix ( (cid:98) Σ i,j ) i ∈ A,j ∈ B , irrepresentability requires (cid:107) (cid:98) Σ S c ,S (cid:98) Σ − S,S sign( θ ,S ) (cid:107) ∞ ≤ − η , (4)for some η > u ) i = +1, 0, − u i >
0, = 0, < η uniformly bounded away from 0,it guarantees correct model selection also in the high-dimensional regime p (cid:29) n . Meinshausenand B¨ulmann [MB06] independently established the same result for random Gaussian designs, withapplications to learning Gaussian graphical models. These papers applied to very sparse models,requiring in particular s = O ( n c ), c <
1, and parameter vectors with large coefficients. Namely,scaling the columns of X such that (cid:98) Σ i,i ≤
1, for i ∈ [ p ], they require θ min ≡ min i ∈ S | θ ,i | ≥ c (cid:112) s /n .Wainwright [Wai09] strengthened considerably these results by allowing for general scalings of s , p, n and proving that much smaller non-zero coefficients can be detected. Namely, he proved thatfor a broad class of empirical covariances it is only necessary that θ min ≥ cσ (cid:112) (log p ) /n . This scalingof the minimum non-zero entry is optimal up to constants. Also, for a specific classes of randomGaussian designs (including X with i.i.d. standard Gaussian entries), the analysis of [Wai09] providestight bounds on the minimum sample size for correct model selection. Namely, there exists c (cid:96) , c u > n < c (cid:96) s log p and succeeds with high probability if n ≥ c u s log p .While, thanks to these recent works [ZY06, MB06, Wai09], we understand reasonably well modelselection via the Lasso, it is fundamentally unknown what model-selection performances can beachieved with general computationally practical methods. Two aspects of of the above theory cannotbe improved substantially: ( i ) The non-zero entries must satisfy the condition θ min ≥ cσ/ √ n to bedetected with high probability. Even if n = p and the measurement directions X i are orthogonal,e.g., X = √ n I n × n , one would need | θ ,i | ≥ cσ/ √ n to distinguish the i -th entry from noise. Forinstance, in [JM13], the present authors prove a general upper bound on the minimax power oftests for hypotheses H ,i = { θ ,i = 0 } . Specializing this bound to the case of standard Gaussiandesigns, the analysis of [JM13] shows formally that no test can detect θ ,i (cid:54) = 0, with a fixed degree ofconfidence, unless | θ ,i | ≥ cσ/ √ n . ( ii ) The sample size must satisfy n ≥ s . Indeed, if this is not thecase, for each θ with support of size | S | = s , there is a one parameter family { θ ( t ) = θ + t v } t ∈ R with supp( θ ( t )) ⊆ S , X θ ( t ) = X θ and, for specific values of t , the support of θ ( t ) is strictlycontained in S .On the other hand, there is no fundamental reason to assume the irrepresentability condition (4).This follows from the requirement that a specific method (the Lasso) succeeds, but is unclear whyit should be necessary in general. The situation is very different for estimation consistency, e.g., forcharacterizing the (cid:96) error (cid:107) (cid:98) θ − θ (cid:107) . In that case the restricted isometry property (RIP) [CT05] (orone of its relaxations [BRT09, vdGB09]) is sufficient and –essentially– necessary.3 auss-Lasso selector : Model selector for high dimensional problems Input:
Measurement vector y , design model X , regularization parameter λ , support size s . Output:
Estimated support (cid:98) S . Let T = supp( (cid:98) θ n ) be the support of Lasso estimator (cid:98) θ n = (cid:98) θ n ( y, X , λ ) given by (cid:98) θ n ( Y, X ; λ ) ≡ arg min θ ∈ R p (cid:110) n (cid:107) Y − X θ (cid:107) + λ (cid:107) θ (cid:107) (cid:111) . Construct the estimator (cid:98) θ GL as follows: (cid:98) θ GL T = ( X T T X T ) − X T T y , (cid:98) θ GL T c = 0 . Find s -th largest entry (in modulus) of (cid:98) θ GL T , denoted by (cid:98) θ GL( s ) , and let (cid:98) S ≡ (cid:8) i ∈ [ p ] : | (cid:98) θ GL i | ≥ | (cid:98) θ GL( s ) | (cid:9) . In this paper we prove that the
Gauss-Lasso selector has nearly optimal model selection propertiesunder a condition that is strictly weaker than irrepresentability. We call this condition the generalizedirrepresentability condition (GIC). The Gauss-Lasso procedure uses the Lasso estimator to estimatea first model T ⊆ { , . . . , p } . It then constructs a new estimator by ordinary least squares regressionof the data Y onto the model T .We prove that the estimated model is, with high probability, correct (i.e., (cid:98) S = S ) under conditionscomparable to the ones assumed in [MB06, ZY06, Wai09], while replacing irrepresentability by theweaker generalized irrepresentability condition. In the case of random Gaussian designs, our analysisfurther assumes the restricted eigenvalue property in order to establish a nearly optimal scaling ofthe sample size n with the sparsity parameter s .In order to build some intuition about the difference between irrepresentability and generalizedirrepresentability, it is convenient to consider the Lasso cost function at ‘zero noise’: G ( θ ; ξ ) ≡ n (cid:107) X ( θ − θ ) (cid:107) + ξ (cid:107) θ (cid:107) = 12 (cid:104) ( θ − θ ) , (cid:98) Σ( θ − θ ) (cid:105) + ξ (cid:107) θ (cid:107) . Let (cid:98) θ ZN ( ξ ) be the minimizer of G ( · ; ξ ) and v ≡ lim ξ → sign( (cid:98) θ ZN ( ξ )). The limit is well defined byLemma 2.2 below. The KKT conditions for (cid:98) θ ZN imply, for T ≡ supp( v ), (cid:107) (cid:98) Σ T c ,T (cid:98) Σ − T,T v T (cid:107) ∞ ≤ . Since G ( · ; ξ ) has always at least one minimizer, this condition is always satisfied . Generalizedirrepresentability requires that the above inequality holds with some small slack η > (cid:107) (cid:98) Σ T c ,T (cid:98) Σ − T,T v T (cid:107) ∞ ≤ − η . v = sign( θ ). In other words, earlier work [MB06, ZY06, Wai09] required generalizedirrepresentability plus sign-consistency in zero noise, and established sign consistency in non-zeronoise. In this paper the former condition is shown to be sufficient.From a different point of view, GIC demands that irrepresentability holds for a superset of thetrue support S . It was indeed argued in the literature that such a relaxation of irrepresentabilityallows to cover a significantly broader set of cases (see for instance [BvdG11, Section 7.7.6]). However,it was never clarified why such a superset irrepresentability condition should be significantly moregeneral than simple irrepresentability. Further, no precise prescription existed for the superset of thetrue support.Our contributions can therefore be summarized as follows:1. By tying it to the KKT condition for the zero-noise problem, we justify the expectation thatgeneralized irrepresentability should hold for a broad class of design matrices.2. We thus provide a specific formulation of superset irrepresentability, prescribing both the su-perset T and the sign vector v T , that is –by itself– significantly more general than simpleirrepresentability.3. We show that, under GIC, exact support recovery can be guaranteed using the Gauss-Lasso,and formulate the appropriate ‘minimum coefficient’ conditions that guarantee this.As a side remark, even when simple irrepresentability holds, our results strengthen somewhat theestimates of [Wai09] (see below for details).The paper is organized as follows. In the rest of the introduction we illustrate the range ofapplicability of GIC through a simple example and we discuss further related work. We finallyintroduce the basic notations to be used throughout the paper.Section 2 treats the case of deterministic designs X , and develops our main results on the basis ofthe GIC. Section 3 extends our analysis to the case of random designs. In this case GIC is requiredto hold for the population covariance, and the analysis is more technical as it requires to control therandomness of the design matrix. The proofs of our main results can be found in Sections 5 and 6,with several technical steps deferred to the Appendices. In order to illustrate the range of new cases covered by our results, it is instructive to consider asimple example. A detailed discussion of this calculation can be found in Appendix B. The examplecorresponds to a Gaussian random design, i.e., the rows X T , . . . X T n are i.i.d. realizations of a p -variate normal distribution with mean zero. We write X i = ( X i, , X i, , . . . , X i,p ) T for the componentsof X i . The response variable is linearly related to the first s covariates Y i = θ , X i, + θ , X i, + · · · + θ ,s X i,s + W i , where W i ∼ N (0 , σ ) and we assume θ ,i > i ≤ s . In particular S = { , . . . , s } .As for the design matrix, first p − X i,j ∼ N (0 ,
1) are independent for 1 ≤ j ≤ p − ≤ i ≤ n ). However the p -th covariate is correlated5o the s relevant ones: X i,p = a X i, + a X i, + · · · + a X i,s + b ˜ X i,p . Here ˜ X i,p ∼ N (0 ,
1) is independent from { X i, , . . . , X i,p − } and represents the orthogonal componentof the p -th covariate. We choose the coefficients a, b ≥ s a + b = 1, whence E { X i,p } = 1and hence the p -th covariate is normalized as the first ( p −
1) ones. In other words, the rows of X are i.i.d. Gaussian X i ∼ N (0 , Σ) with covariance given byΣ ij = i = j,a if i = p, j ∈ S or i ∈ S, j = p ,0 otherwise.For a = 0, this is the standard i.i.d. design and irrepresentability holds. The Lasso correctlyrecovers the support S from n ≥ c s log p samples, provided θ min ≥ c (cid:48) (cid:112) (log p ) /n . It follows from[Wai09] that this remains true as long as a ≤ (1 − η ) /s for some η > a > /s , the Lasso includes the p -th covariate in the estimated model, withhigh probability (see Appendix B).As it is shown in Appendix B, the Gauss-Lasso is successful for a significantly larger set of valuesof a . Namely, if a ∈ (cid:20) , − ηs (cid:21) ∪ (cid:18) s , − η √ s (cid:21) , then it recovers S from n ≥ c s log p samples, provided θ min ≥ c (cid:48) (cid:112) (log p ) /n . While the interval((1 − η ) /s , /s ] is not covered by this result, we expect this to be due to the proof technique ratherthan to an intrinsic limitation of the Gauss-Lasso selector. The restricted isometry property [CT05, CT07] (or the related restricted eigenvalue [BRT09] orcompatibility conditions [vdGB09]) have been used to establish guarantees on the estimation andmodel selection errors of the Lasso or similar approaches. In particular, Bickel, Ritov and Tsybakov[BRT09] show that, under such conditions, with high probability, (cid:107) (cid:98) θ − θ (cid:107) ≤ Cσ s log pn . The same conditions can be used to prove model-selection guarantees. In particular, Zhou [Zho10]studies a multi-step thresholding procedure whose first steps coincide with the Gauss-Lasso. Whilethe main objective of this work is to prove high-dimensional (cid:96) consistency with a sparse estimatedmodel, the author also proves partial model selection guarantees. Namely, the method correctlyrecovers a subset of large coefficients S L ⊆ S , provided | θ ,i | ≥ cσ (cid:112) s (log p ) /n , for i ∈ S L . Thismeans that the coefficients that are guaranteed to be detected must be a factor √ s larger than whatis required by our results.Also related to model selection is the recent line of work on hypothesis testing in high-dimensionalregression [ZZ11, B¨uh12]. These papers propose methods for testing hypotheses of the form H ,i =6 θ ,i = 0 } . In order to achieve a given significance level, they require –again– large coefficients,namely | θ ,i | ≥ cσ (cid:112) s (log p ) /n (see [JM13] for a discussion of this point). In [JM13], we investigatea hypothesis testing method that achieves any given significance level α for | θ ,i | ≥ cσ/ √ n , with c a constant that depends on α . Although the testing procedure can be used for general setting,the guarantee on its statistical power is provided only for some random Gaussian designs in anasymptotic sense. A very recent paper by van de Geer, B¨uhlmann and Ritov [vdGBR13] proposesa similar procedure and gives conditions under which the procedure achieves the semiparametricefficiency bound. Their analysis allows for general Gaussian and sub-Gaussian designs. However, itrequires a sample size n ≥ C ( s log p ) , namely the square of the optimal sample size.Let us finally mention that an alternative approach to establishing model-selection guaranteesassumes a suitable mutual incoherence conditions. Lounici [Lou08] proves correct model selectionunder the assumption max i (cid:54) = j | (cid:98) Σ ij | = O (1 /s ). This assumption is however stronger than irrepre-sentability [vdGB09]. Cand´es and Plan [CP09] also assume mutual incoherence, albeit with a muchweaker requirement, namely max i (cid:54) = j | (cid:98) Σ ij | = O (1 / (log p )). Under this condition, they establish modelselection guarantees for an ideal scaling of the non-zero coefficients θ min ≥ cσ (cid:112) (log p ) /n . How-ever, this result only holds with high probability for a ‘random signal model’ in which the non-zerocoefficients θ ,i have uniformly random signs.Finally, model selection consistency can be obtained without irrepresentability through othermethods. For instance [Zou06] develops the adaptive Lasso, using a data-dependent weighted (cid:96) regularization, and [Bac08] proposes the Bolasso, a resampling-based techniques. Unfortunately,both of these approaches are only guaranteed to succeed in the low-dimensional regime of p fixed,and n → ∞ . We provide a brief summary of the notations used throughout the paper. For a matrix A and set ofindices I, J , we let A J denote the submatrix containing just the columns in J and A I,J denote thesubmatrix formed by the rows in I and columns in J . Likewise, for a vector v , v I is the restriction of v to indices in I . Further, the notation A − I,I represents the inverse of A I,I , i.e., A − I,I = ( A I,I ) − . Themaximum and the minimum singular values of A are respectively denoted by σ max ( A ) and σ min ( A ).We write (cid:107) v (cid:107) p for the standard (cid:96) p norm of a vector v . Specifically, (cid:107) v (cid:107) denotes the number ofnonzero entries in v . Also, (cid:107) A (cid:107) p refers to the induced operator norm on a matrix A . We use e i torefer to the i -th standard basis element, e.g., e = (1 , , . . . , v , supp( v ) representsthe positions of nonzero entries of v . Throughout, we denote the rows of the design matrix X by X , . . . , X n ∈ R p and denote its columns by x , . . . , x p ∈ R n . Further, for a vector v , sign( v ) is thevector with entries sign( v ) i = +1 if v i >
0, sign( v ) i = − v i <
0, and sign( v ) i = 0 otherwise. An outline of this section is given below:1. We first consider the zero-noise problem W = 0, and prove several useful properties of the Lassoestimator in this case. In particular, we show that there exists a threshold for the regularizationparameter below which the support of the Lasso estimator remains the same and containssupp( θ ). Moreover, the Lasso estimator support is not much larger than supp( θ ).7. We then turn to the noisy problem, and introduce the generalized irrepresentability condition (GIC) that is motivated by the properties of the Lasso in the zero-noise case. We prove thatunder GIC (and other technical conditions), with high probability, the signed support of theLasso estimator is the same as that in the zero-noise problem.3. We show that the Gauss-Lasso selector correctly recovers the signed support of θ . Recall that (cid:98) Σ ≡ ( X T X /n ) denotes the empirical covariance of the rows of the design matrix. Given (cid:98) Σ ∈ R p × p , (cid:98) Σ (cid:23) θ ∈ R p and ξ ∈ R + , we define the zero-noise Lasso estimator as (cid:98) θ ZN ( ξ ) ≡ arg min θ ∈ R p (cid:110) n (cid:104) ( θ − θ ) , (cid:98) Σ( θ − θ ) (cid:105) + ξ (cid:107) θ (cid:107) (cid:111) . (5)Note that (cid:98) θ ZN ( ξ ) is obtained by letting Y = X θ in the definition of (cid:98) θ n ( Y, X ; ξ ).Following [BRT09], we introduce a restricted eigenvalue constant for the empirical covariancematrix (cid:98) Σ: (cid:98) κ ( s, c ) ≡ min J ⊆ [ p ] | J |≤ s min u ∈ R p (cid:107) u Jc (cid:107) ≤ c (cid:107) u J (cid:107) (cid:104) u, (cid:98) Σ u (cid:105)(cid:107) u (cid:107) . (6)Our first result states that the support of (cid:98) θ ZN ( ξ ) is not much larger than the support of θ , forany ξ > Lemma 2.1.
Let (cid:98) θ ZN = (cid:98) θ ZN ( ξ ) be defined as per Eq. (17), with ξ > . Then, if s = (cid:107) θ (cid:107) , (cid:107) (cid:98) θ ZN (cid:107) ≤ (cid:32) (cid:107) (cid:98) Σ (cid:107) (cid:98) κ ( s , (cid:33) s . (7)The proof of this lemma is deferred to Section A.1. Lemma 2.2.
Let (cid:98) θ ZN = (cid:98) θ ZN ( ξ ) be defined as per Eq. (5), with ξ > . Then there exist ξ = ξ ( (cid:98) Σ , S, θ ) > , T ⊆ [ p ] , v ∈ {− , , +1 } p , such that the following happens. For all ξ ∈ (0 , ξ ) , sign( (cid:98) θ ZN ( ξ )) = v and supp( (cid:98) θ ZN ( ξ )) = supp( v ) = T . Further T ⊇ S , v ,S = sign( θ ,S ) and ξ = min i ∈ S | θ ,i / [ (cid:98) Σ − T ,T v ,T ] i | . Proof of Lemma 2.2 can be found in Section A.2.Finally we have the following standard characterization of the solution of the zero-noise problem.
Lemma 2.3.
Let (cid:98) θ ZN = (cid:98) θ ZN ( ξ ) be defined as per Eq. (5), with ξ > . Let T ⊇ S and v ∈ { +1 , , − } p be such that supp( v ) = T . Then sign( (cid:98) θ ZN ) = v if and only if (cid:13)(cid:13)(cid:13)(cid:98) Σ T c ,T (cid:98) Σ − T,T v T (cid:13)(cid:13)(cid:13) ∞ ≤ , (8) v T = sign (cid:16) θ ,T − ξ (cid:98) Σ − T,T v T (cid:17) . (9) Further, if the above holds, (cid:98) θ ZN is given by (cid:98) θ ZN T c = 0 and (cid:98) θ ZN T = θ ,T − ξ (cid:98) Σ − T,T v T . generalized irrepresentability condition (GIC) fordeterministic designs. Generalized irrepresentability (deterministic designs).
The pair ( (cid:98) Σ , θ ), (cid:98) Σ ∈ R p × p , θ ∈ R p satisfy the generalized irrepresentability condition with parameter η > v , T be defined as per Lemma 2.2. Then (cid:13)(cid:13)(cid:13)(cid:98) Σ T c ,T (cid:98) Σ − T ,T v ,T (cid:13)(cid:13)(cid:13) ∞ ≤ − η . (10)In other words we require the dual feasibility condition (8) –which always holds– to hold with apositive slack η . Consider the noisy linear observation model as described in (2), and let (cid:98) r ≡ ( X T W/n ). We beginwith a standard characterization of sign( (cid:98) θ n ), the signed support of the Lasso estimator (3). Lemma 2.4.
Let (cid:98) θ n = (cid:98) θ n ( y, X ; λ ) be defined as per Eq. (3), and let z ∈ { +1 , , − } p with supp( z ) = T . Further assume T ⊇ S . Then the signed support of the Lasso estimator is given by sign( (cid:98) θ n ) = z if and only if (cid:13)(cid:13)(cid:13)(cid:98) Σ T c ,T (cid:98) Σ − T,T z T + 1 λ ( (cid:98) r T c − (cid:98) Σ T c ,T (cid:98) Σ − T,T (cid:98) r T ) (cid:13)(cid:13)(cid:13) ∞ ≤ , (11) z T = sign (cid:16) θ ,T − (cid:98) Σ − T,T ( λz T − (cid:98) r T ) (cid:17) . (12)Lemma 2.4 is proved in Appendix A.4. Theorem 2.5.
Consider the deterministic design model with empirical covariance matrix (cid:98) Σ ≡ ( X T X ) /n , and assume that (cid:98) Σ i,i ≤ for i ∈ [ p ] . Let T ⊆ [ p ] , v ∈ { +1 , , − } p be the set andvector defined in Lemma 2.2, and t ≡ | T | . Assume that(i) We have σ min ( (cid:98) Σ T ,T ) ≥ C min > .(ii) The pair ( (cid:98) Σ , θ ) satisfies the generalized irrepresentability condition with parameter η .Consider the Lasso estimator (cid:98) θ n = (cid:98) θ n ( y, X ; λ ) defined as per Eq. (3), with regularization parameter λ = ση (cid:114) c log pn , (13) for some constant c > , and suppose that(iii) For some c > : | θ ,i | ≥ c λ + λ (cid:12)(cid:12) [ (cid:98) Σ − T ,T v ,T ] i (cid:12)(cid:12) for all i ∈ S, (14) (cid:12)(cid:12) [ (cid:98) Σ − T ,T v ,T ] i (cid:12)(cid:12) ≥ c for all i ∈ T \ S. (15)9 e further assume, without loss of generality, η ≤ c √ C min . Then the following holds true: P (cid:110) sign( (cid:98) θ n ( λ )) = v (cid:111) ≥ − p − c . (16)Theorem 2.5 is proved in Section 5.1. Note that, even in the case standard irrepresentabilityholds (and hence T = S ), this result improves over [Wai09, Theorem 1.(b)], in that the requiredlower bound for | θ ,i | , i ∈ S , does not depend on (cid:107) (cid:98) Σ S,S (cid:107) ∞ . More precisely, Theorem 2.5 assumes | θ ,i | ≥ λ ( c + | [ (cid:98) Σ − S,S v ,S ] i | ), for i ∈ S , which is weaker than the assumption of Theorem1.(b)[Wai09],namely, | θ ,i | ≥ λ ( c + (cid:107) (cid:98) Σ − S,S (cid:107) ∞ ), since (cid:107) v ,S (cid:107) ∞ ≤ Remark 2.6.
Condition (i) in Theorem 2.5 requires the submatrix (cid:98) Σ T ,T to have minimum singularvalue bounded away form zero. Assuming (cid:98) Σ S,S to be non-singular is necessary for identifiability.Requiring the minimum singular value of (cid:98) Σ T ,T to be bounded away from zero is not much morerestrictive since T is comparable in size with S , as stated in Lemma 2.1. We next show that the Gauss-Lasso selector correctly recovers the support of θ . Theorem 2.7.
Consider the deterministic design model with empirical covariance matrix (cid:98) Σ ≡ ( X T X ) /n , and assume that (cid:98) Σ i,i ≤ for i ∈ [ p ] . Under the assumptions of Theorem 2.5, P (cid:16) (cid:107) (cid:98) θ GL − θ (cid:107) ∞ ≥ µ (cid:17) ≤ p − c + 2 pe − nC min µ / σ . In particular, if (cid:98) S is the model selected by the Gauss-Lasso, we have P ( (cid:98) S = S ) ≥ − p − c / . The proof of Theorem 2.7 is given in Section 5.2.
In the previous section, we studied the case of deterministic design models which allowed for astraightforward analysis. Here, we consider the random design model which needs a more involvedanalysis. Within the random Gaussian design model, the rows X i are distributed as X i ∼ N (0 , Σ)for some (unknown) covariance matrix Σ (cid:31) ∈ R p × p , Σ (cid:31) θ ∈ R p and ξ ∈ R + , the population-levelestimator (cid:98) θ ∞ ( ξ ) = (cid:98) θ ∞ ( ξ ; θ , Σ) is defined as (cid:98) θ ∞ ( ξ ) ≡ arg min θ ∈ R p (cid:110) (cid:104) ( θ − θ ) , Σ( θ − θ ) (cid:105) + ξ (cid:107) θ (cid:107) (cid:111) . (17)Notice that the minimizer is unique because Σ is strictly positive definite and hence the cost functionon the right-hand side is strongly convex. In fact, the population-level estimator is obtained byassuming that the response vector Y is noiseless and n = ∞ , hence replacing the empirical covariance( X T X /n ) with the exact covariance Σ in the lasso optimization problem (3).Notice that the population-level estimator (cid:98) θ ∞ is deterministic, albeit X is a random design. Weshow that under some conditions on the covariance Σ and vector θ , T ≡ supp( (cid:98) θ n ) = supp( (cid:98) θ ∞ ), i.e.,10he population-level estimator and the Lasso estimator share the same (signed) support. Further T ⊇ S . Since (cid:98) θ ∞ (and hence T ) is deterministic, X T is a Gaussian matrix with rows drawn independentlyfrom N (0 , Σ T,T ). This observation allows for a simple analysis of the Gauss-Lasso selector (cid:98) θ GL .An outline of the section is given below:1. We begin with proving several properties of the population-level estimator. Similar to thezero-noise problem in Section 2.1, we show that there exists a threshold ξ , such that for all ξ ∈ (0 , ξ ), supp( (cid:98) θ ∞ ( ξ )) remains the same and contains supp( θ ). Moreover, supp( (cid:98) θ ∞ ( ξ )) isnot much larger than supp( θ ).2. We show that under GIC for covariance matrix Σ (and other sufficient conditions), with highprobability, the signed support of the Lasso estimator is the same as the signed support of thepopulation-level estimator.3. Following the previous steps, we show that the Gauss-Lasso selector correctly recovers thesigned support of θ . n = ∞ problem In this section we derive several useful properties of the population-level problem (17). ComparingEqs. (5) and (17), the estimators (cid:98) θ ZN ( ξ ) and (cid:98) θ ∞ ( ξ ) are defined in a very similar manner (the formeris defined with respect to (cid:98) Σ and the latter is defined with respect to Σ), and as we will see (cid:98) θ ∞ alsopossesses the properties stated in Section 2.1.Let κ ∞ ( s, c ) be the restricted eigenvalue constant for the covariance matrix Σ: κ ( s, c ) ≡ min J ⊆ [ p ] | J |≤ s min u ∈ R p (cid:107) u Jc (cid:107) ≤ c (cid:107) u J (cid:107) (cid:104) u, Σ u (cid:105)(cid:107) u (cid:107) . (18)The proofs of the following Lemmas are very similar to the corresponding ones in Section 2.1,and are omitted. Lemma 3.1.
Let (cid:98) θ ∞ = (cid:98) θ ∞ ( ξ ) be defined as per Eq. (17), with ξ > . Then, if s = (cid:107) θ (cid:107) , (cid:107) (cid:98) θ ∞ (cid:107) ≤ (cid:18) (cid:107) Σ (cid:107) κ ( s , (cid:19) s . (19) Lemma 3.2.
Let (cid:98) θ ∞ = (cid:98) θ ∞ ( ξ ) be defined as per Eq. (17), with ξ > . Then there exist ξ = ξ (Σ , S, θ ) > , T ⊆ [ p ] , v ∈ {− , , +1 } p , such that the following happens. For all ξ ∈ (0 , ξ ) , sign( (cid:98) θ ∞ ( ξ )) = v and supp( (cid:98) θ ∞ ( ξ )) = supp( v ) = T . Further T ⊇ S , v ,S = sign( θ ,S ) and ξ = min i ∈ S | θ ,i / [Σ − T ,T v ,T ] i | . Finally we have the following standard characterization of the solution of the n = ∞ problem(17). Lemma 3.3.
Let (cid:98) θ ∞ = (cid:98) θ ∞ ( ξ ) be defined as per Eq. (17), with ξ > . Let T ⊇ S and v ∈ { +1 , , − } p be such that supp( v ) = T . Then sign( (cid:98) θ ∞ ) = v if and only if (cid:13)(cid:13)(cid:13) Σ T c ,T Σ − T,T v T (cid:13)(cid:13)(cid:13) ∞ ≤ ,v T = sign (cid:16) θ ,T − ξ Σ − T,T v T (cid:17) . urther, if the above holds, (cid:98) θ ∞ is given by (cid:98) θ ∞ T c = 0 and (cid:98) θ ∞ T = θ ,T − ξ Σ − T,T v T . Motivated by this result, we introduce the following assumption.
Generalized irrepresentability (random designs).
The pair (Σ , θ ), Σ ∈ R p × p , θ ∈ R p satisfy the generalized irrepresentability condition with parameter η > v , T be defined as per Lemma 3.2. Then (cid:13)(cid:13)(cid:13) Σ T c ,T Σ − T ,T v ,T (cid:13)(cid:13)(cid:13) ∞ ≤ − η , (20) We now consider the Lasso estimator (3). Recall the notations (cid:98) Σ ≡ n X T X , (cid:98) r ≡ n X T W .
Note that (cid:98) Σ ∈ R p × p , (cid:98) r ∈ R p are both random quantities in the case of random designs. Theorem 3.4.
Consider the Gaussian random design model with covariance matrix Σ (cid:31) , andassume that Σ i,i ≤ for i ∈ [ p ] . Let T ⊆ [ p ] , v ∈ { +1 , , − } p be the deterministic set and vectordefined in Lemma 3.2, and t ≡ | T | . Assume that(i) We have σ min (Σ T ,T ) ≥ C min > .(ii) The pair (Σ , θ ) satisfies the generalized irrepresentability condition with parameter η .Consider the Lasso estimator (cid:98) θ n = (cid:98) θ n ( y, X ; λ ) defined as per Eq. (3), with regularization parameter λ = 4 ση (cid:114) c log pn , (21) for some constant c > , and suppose that(iii) For some c > : | θ ,i | ≥ c λ + 32 λ (cid:12)(cid:12) [Σ − T ,T v ,T ] i (cid:12)(cid:12) for all i ∈ S, (22) (cid:12)(cid:12) [Σ − T ,T v ,T ] i (cid:12)(cid:12) ≥ c for all i ∈ T \ S. (23) We further assume, without loss of generality, η ≤ c √ C min .If n ≥ max( M , M ) t log p with M ≡ c η C min , M ≡ c c C , then the following holds true: P (cid:110) sign( (cid:98) θ n ( λ )) = v (cid:111) ≥ − pe − n − e − t − p − c . (24)12nder standard irrepresentability, this result improves over [Wai09, Theorem 3.(ii)], in that therequired lower bound for | θ ,i | , i ∈ S , does not depend on (cid:107) Σ − / S,S (cid:107) ∞ . More precisely, Theorem 2.5assumes | θ ,i | ≥ λ ( c + 1 . | [Σ − S,S v ,S ] i | ), for i ∈ S , while Theorem 3.(ii)[Wai09] requires | θ ,i | ≥ cλ (cid:107) Σ − / S,S (cid:107) ∞ , for i ∈ S . Note that | [Σ − S,S v ,S ] i | ≤ (cid:107) Σ − S,S (cid:107) ∞ ≤ (cid:107) Σ − / S,S (cid:107) ∞ .While being closely analogous to Theorem 2.5, the last theorem has somewhat worse constants.Indeed in the present case we need to control the randomness of the design matrix X in addition tothe one of the noise. Remark 3.5.
Condition (i) follows readily from the restricted eigenvalue constraint as in Eq. (18) ,i.e., κ ∞ ( t , > . This is a reasonable assumption since T is not much larger than S , as statedin Lemma 3.1. Corollary 3.6.
Under the assumptions of Theorem 3.4, if n ≥ max( (cid:102) M , (cid:102) M ) s log p , with (cid:102) M = (cid:16) (cid:107) Σ (cid:107) κ ∞ ( s , (cid:17) M , (cid:102) M = (cid:16) (cid:107) Σ (cid:107) κ ∞ ( s , (cid:17) M , then the following holds: P (cid:110) sign( (cid:98) θ n ( λ )) = v (cid:111) ≥ − pe − n − e − s − p − c . Proof (Corollary 3.6).
The result follows readily from Theorem 3.4, noting that s ≤ t since S ⊆ T , and t ≤ (1 + 4 (cid:107) Σ (cid:107) /κ ∞ ( s , s as per Lemma 3.1.Below, we show that the Gauss-Lasso selector correctly recovers the signed support of θ . Theorem 3.7.
Consider the random Gaussian design model with covariance matrix Σ (cid:31) , and as-sume that Σ i,i ≤ for i ∈ [ p ] . Under the assumptions of Theorem 3.4, and for n ≥ max( (cid:102) M , (cid:102) M ) s log p ,we have P (cid:16) (cid:107) (cid:98) θ GL − θ (cid:107) ∞ ≥ µ (cid:17) ≤ pe − n + 6 e − s + 8 p − c + 2 pe − nC min µ / σ . Moreover, letting ˆ S be the model returned by the Gauss-Lasso selector, we have P ( (cid:98) S = S ) ≥ − p e − n − e − s − p − c . The proof of Theorem 3.7 is deferred to Section 6.4.
Remark 3.8. [Detection level]
Let θ min ≡ min i ∈ S | θ ,i | be the minimum magnitude of the non-zero entries of vector θ . By Theorem 3.7, Gauss-Lasso selector correctly recovers supp( θ ) , withprobability greater than − p e − n − e − s − p − c , if n ≥ max( ˜ M , ˜ M ) s log p , and θ min ≥ Cσ (cid:114) log pn (cid:0) (cid:107) Σ − T ,T (cid:107) ∞ (cid:1) , (25) where C = C ( c , c , η ) is a constant depending on c , c , and η . Eq. (25) stems from the condition (iii)in Theorem 3.4.We can further generalize this result. Define S = (cid:26) i ∈ S : | θ ,i | ≥ Cσ (cid:114) log pn (cid:0) (cid:107) Σ − T ,T (cid:107) ∞ (cid:1)(cid:27) , - . - . . . . . . θ Figure 1: Parameter vector θ for the communities dataset. The entries with magnitude larger than0 .
04 (shown in black) are treated as significant ones. and S = S \ S . By a very similar argument to the proof of Theorem 3.4, the Gauss-Lasso selectorcan recover S , if (cid:107) θ ,S (cid:107) = O ( σ (cid:112) log p/n ) . More precisely, letting (cid:102) W = X θ ,S + W , the responsevector Y can be recast as Y = X θ ,S + (cid:102) W and the Gauss-Lasso selector treats the small entries θ ,S as noise. We consider a problem about predicting the rate of violent crimes in different communities withinUS, based on other demographic attributes of the communities. We evaluate the performance ofthe Gauss-Lasso selector on the UCI communities and crimes dataset [FA10]. The dataset consistsof a univariate response variable and 122 predictive attributes for 1994 communities. The responsevariable is the total number of violent crimes per 100 K population. Covariates are quantitative,including e.g., the average family income, the fraction of unemployed population, and the policeoperating budget. We consider a linear model as in (2) and perform model selection using Gauss-Lasso selector and Lasso estimator.We do the following preprocessing steps: ( i ) Each missing value is replaced by the mean of thenon-missing values of that attribute for other communities; ( ii ) We eliminate 16 attributes to makethe ensemble of the attribute vectors linearly independent; ( iii ) We normalize the columns to havemean zero and (cid:96) norm √ n . Thus we obtain a design matrix X tot ∈ R n tot × p with n tot = 1994 and p = 106.For the sake of performance evaluation, we need to know the true model, i.e., the true significantcovariates. We let θ = ( X T tot X tot ) − X T tot y be the least square solution obtained from the wholedataset X tot . The entries of θ are shown in Fig. 1. Clearly only a few of them are non negligible,14orresponding to the true model. We treat the entries with magnitude larger than 0 .
04 as trulyactive and the others as truly inactive. The number of active covariates according to this criterionis s = 13.We choose random subsamples of size n = 85 from the communities and normalize each columnof the resulting design matrix to have mean zero and (cid:96) norm √ n . We use Gauss-Lasso selectorand Lasso for model selection based on this design. Figures 2 and 3 respectively show the solutionpath for Gauss-Lasso and Lasso as the parameter λ changes form λ = 0 .
001 to λ = 1. The pathscorresponding to the truly active set are in black and the paths corresponding to the truly inactivevariables are in red. At λ = 1, the solutions (cid:98) θ GL ( λ ) and (cid:98) θ n ( λ ) have no active variables; for decreasing λ , each knot λ k marks the entry or removal of some variables from the current active set of the Lassosolution. Therefore, the support of the Lasso solution T remains constant in between knots. SinceGauss-Lasso selector performs ordinary least squares restricted to T , its coordinate paths are constantin between knots. However, the Lasso paths are linear with respect to λ , with changes in slope atthe knots (see e.g., [EHJT04] for a discussion).It is clear from Figure 3 that the Lasso support either misses a large fraction of the truly activecovariates, or includes many false positives. For instance at λ = 0 .
08, we get 4 true positives outof 13 and 4 false positives. On the other hand, for a smaller value of the regularization parameter, λ = 0 .
01, we get 10 true positives out of 13 and 8 false positives. If we consider on the other hand the Gauss-Lasso, any λ ≤ .
02 produces a set of coefficientswith a gap between large ones, that are mostly true positives, and small ones, that are mostly truenegatives.
In this section we prove Theorems 2.5 and 2.7 using Lemmas 2.1 to 2.4. The latter are proved in theappendices.
By the condition (iii) in the statement of the theorem, we have λ < min i ∈ S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) θ ,i [ (cid:98) Σ − T ,T v ,T ] i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = ξ , where the equality holds because of Lemma 2.2. By Lemma 2.2, we know that sign( (cid:98) θ ZN ( λ )) = v and that supp( v ) = T contains the true support S . Applying Lemma 2.3, Eq. (9) and using thegeneralized irrepresentability assumption (10), we obtain (cid:13)(cid:13)(cid:13)(cid:98) Σ T c ,T (cid:98) Σ − T ,T v ,T (cid:13)(cid:13)(cid:13) ∞ ≤ − η , (26) v ,T = sign (cid:16) θ ,T − λ (cid:98) Σ − T ,T v ,T (cid:17) . (27) We treat the entries of the Lasso solution with magnitude less than 0 .
005 as zero. -6 -5 -4 -3 -2 -1 - . - . - . . . . . Gauss-Lasso log ( λ ) C oe ff i c i en t s Figure 2: Coordinate paths for Gauss-Lasso selector and a random subset of n = 85 communities.The paths corresponding to the significant variables of θ are shown in black. The coordinate pathsfor Gauss-Lasso are piecewise constant. -7 -6 -5 -4 -3 -2 -1 - . - . - . . . . . Lasso log ( λ ) C oe ff i c i en t s Figure 3: Coordinate paths for Lasso selector and a random subset of n = 85 communities. Thepaths corresponding to the significant variables of θ are shown in black. The coordinate paths forLasso are piecewise linear.Also, by Lemma 2.4, sign( (cid:98) θ n ) = v if Eqs. (11) and (12) hold with z = v and T = T , namely, if (cid:13)(cid:13)(cid:13)(cid:98) Σ T c ,T (cid:98) Σ − T ,T v ,T + 1 λ ( (cid:98) r T c − (cid:98) Σ T c ,T (cid:98) Σ − T ,T (cid:98) r T ) (cid:13)(cid:13)(cid:13) ∞ ≤ , (28) v ,T = sign (cid:16) θ ,T − (cid:98) Σ − T ,T ( λv ,T − (cid:98) r T ) (cid:17) . (29)16n the sequel, we show that these equations are satisfied, with probability lower bounded as perEq. (16).We begin with proving Eq. (28). Let T = (1 /λ )( (cid:98) r T c − (cid:98) Σ T c ,T (cid:98) Σ − T ,T (cid:98) r T ). We need to show that (cid:107)T (cid:107) ∞ ≤ η . Plugging for (cid:98) r , we get T ≡ X T c Π X ⊥ T W/ ( nλ ), where Π X ⊥ T = I − X T ( X T T X T ) − X T T is the orthogonal projection onto the orthogonal complement of the column space of X T . Since W ∼ N (0 , σ I n × n ), the variable T j = x T j Π X ⊥ T W/ ( nλ ) is normal with variance at most (cid:16) σnλ (cid:17) (cid:107) Π X ⊥ T x j (cid:107) ≤ (cid:16) σnλ (cid:17) (cid:107) x j (cid:107) ≤ σ nλ , where we used the fact that (cid:107) x j (cid:107) ≤ n , as (cid:98) Σ i,i ≤
1. By the Gaussian tail bound with union boundover j ∈ T c , we obtain P ( (cid:107)T (cid:107) ∞ ≤ η ) ≥ − pe − nλ η σ = 1 − p − c . (30)We next prove Eq. (29). Given Eq. (27), we need to showsign (cid:16) θ ,T − λ (cid:98) Σ − T ,T v ,T (cid:17) = sign (cid:16) θ ,T − (cid:98) Σ − T ,T ( λv ,T − (cid:98) r T ) (cid:17) . Let u ≡ θ ,T − λ (cid:98) Σ − T ,T v ,T , and (cid:98) u ≡ θ ,T − (cid:98) Σ − T ,T ( λv ,T − (cid:98) r T ).By condition (iii) , we have, for all i ∈ S , | u i | ≥ | θ ,i | − λ | [ (cid:98) Σ − T ,T v ,T ] i | ≥ c λ . Further, for all i ∈ T \ S , we have | u i | = λ | [ (cid:98) Σ − T ,T v ,T ] i | ≥ c λ . Summarizing, for all i ∈ T , we have | u i | ≥ c λ .We will show that (cid:107) u − (cid:98) u (cid:107) ∞ = (cid:107) (cid:98) Σ − T ,T (cid:98) r T (cid:107) ∞ < c λ , with high probability, thus implying sign( u T ) =sign( (cid:98) u T ) as desired. Lemma 5.1.
The following holds true. P (cid:16) (cid:107) (cid:98) Σ − T ,T (cid:98) r T (cid:107) ∞ ≥ σ (cid:114) c log pn (cid:107) (cid:98) Σ − T ,T (cid:107) / (cid:17) ≤ p − c . (31)Lemma 5.1 is proved by noting that conditioned on X T , (cid:98) Σ − T ,T (cid:98) r T is a Gaussian vector and thenapplying standard tail bound inequality. The details are deferred to Section A.5.Using Lemma 5.1 and the assumption η ≤ c √ C min , we get (cid:107) u − (cid:98) u (cid:107) ∞ < c λ , with probability atleast 1 − p − c .Putting all this together, Eqs. (28) and (29) hold simultaneously, with probability at least 1 − p − c . This implies the thesis. Recall that T = supp( (cid:98) θ n ). On the event E ≡ { T = T } , we have (cid:98) θ GL T = ( X T T X T ) − X T T ( X T θ ,T + W ) = θ ,T + ( X T T X T ) − X T T W , where the first equality holds since T = T ⊇ S and thus θ ,T c = 0. Further note that (cid:98) θ GL i − θ ,i , for i ∈ T , is a zero mean Gaussian vector with variance σ (cid:107) e T i ( X T T X T ) − X T T (cid:107) ≤ σ (cid:107) (cid:98) Σ − T,T (cid:107) /n ≤ σ / ( nC min ) . i ∈ [ p ], we get P (cid:16) (cid:107) (cid:98) θ GL T − θ ,T (cid:107) ∞ ≥ µ ; E (cid:17) ≤ e − nC min µ / σ . Also, under the assumptions of Theorem 2.5, P ( E ) ≥ − p − c . Hence P (cid:16) (cid:107) (cid:98) θ GL T − θ ,T (cid:107) ∞ ≥ µ (cid:17) ≤ P (cid:16) (cid:107) (cid:98) θ GL T − θ ,T (cid:107) ∞ ≥ µ ; E (cid:17) + P ( E c ) ≤ e − nC min µ / σ + 4 p − c . Since (cid:98) θ GL T c = θ ,T c = 0, we get (cid:107) (cid:98) θ GL − θ (cid:107) ∞ < µ , with probability at least 1 − p − c − e − nC min µ / σ .Moreover, if (cid:107) (cid:98) θ GL − θ (cid:107) < θ min /
2, then | (cid:98) θ GL i | > θ min / i ∈ S and | (cid:98) θ GL i | < θ min /
2, for i ∈ S c .Hence, the s top entries of (cid:98) θ GL (in modulus), returned by the Gauss-Lasso selector, correspond tothe true support S . Therefore, P ( ˆ S = S ) ≥ P ( (cid:107) (cid:98) θ GL − θ (cid:107) ∞ < θ min / ≥ − p − c − pe − nC min θ / σ ≥ − p − c / , where the last inequality follows from the facts θ min ≥ c λ , and η ≤ c √ C min . By the condition (iii) in the statement of the theorem, we have λ ≤
23 min i ∈ S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) θ ,i [Σ − T ,T v ,T ] i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ξ , where the second inequality holds because of Lemma 3.2. Therefore, as a result of Lemma 3.2, wehave sign( (cid:98) θ ∞ ( λ )) = v and that supp( v ) = T contains the true support S . Applying Lemma 3.3and using the generalized irrepresentability assumption, we have (cid:13)(cid:13)(cid:13) Σ T c ,T Σ − T ,T v ,T (cid:13)(cid:13)(cid:13) ∞ ≤ − η , (32) v ,T = sign (cid:16) θ ,T − λ Σ − T ,T v ,T (cid:17) . (33)Moreover, by Lemma 2.4, sign( (cid:98) θ n ) = v if Eqs. (11) and (12) hold with z = v and T = T , namely, (cid:13)(cid:13)(cid:13)(cid:98) Σ T c ,T (cid:98) Σ − T ,T v ,T + 1 λ ( (cid:98) r T c − (cid:98) Σ T c ,T (cid:98) Σ − T ,T (cid:98) r T ) (cid:13)(cid:13)(cid:13) ∞ ≤ , (34) v ,T = sign (cid:16) θ ,T − (cid:98) Σ − T ,T ( λv ,T − (cid:98) r T ) (cid:17) . (35)The rest of the proof is devoted to show the validity of these equations, with probability lowerbounded as per Eq. (24). 18 .1 Proof of Eq. (34) It is immediate to see that Eq. (34) holds if the followings hold true: T ≡ (cid:13)(cid:13)(cid:98) Σ T c ,T (cid:98) Σ − T ,T v ,T (cid:13)(cid:13) ∞ ≤ − η , (36) T ≡ λ (cid:13)(cid:13)(cid:98) r T c − (cid:98) Σ T c ,T (cid:98) Σ − T ,T (cid:98) r T (cid:13)(cid:13) ∞ ≤ η . (37)In order to prove inequalities (36) and (37), it is useful to recall the following proposition fromrandom matrix theory. Proposition 6.1 ([DS01, Wai09, Ver12]) . For k ≤ n , let X ∈ R n × k be a random matrix with i.i.drows drawn from N (0 , Σ) . Then the following hold true for all t ≥ and τ ≡ (cid:113) kn + t ) + ( (cid:113) kn + t ) .(a) If Σ has maximum eigenvalue σ max < ∞ , then P (cid:18) (cid:107) n X T X − Σ (cid:107) ≥ σ max τ (cid:19) ≤ e − nt / . (b) If Σ has minimum eigenvalue σ min > , then P (cid:18) (cid:107) ( 1 n X T X ) − − Σ − (cid:107) ≥ σ − τ (cid:19) ≤ e − nt / . We consider the particular choice of t = (cid:112) k/n which is useful for future reference. Since k/n ≤ τ ≤ (cid:112) k/n and therefore the specialized version of Proposition 6.1 reads: P (cid:18) (cid:107) n X T X − Σ (cid:107) ≥ (cid:112) k/n σ max (cid:19) ≤ e − k/ , (38) P (cid:18) (cid:107) ( 1 n X T X ) − − Σ − (cid:107) ≥ (cid:112) k/n σ − (cid:19) ≤ e − k/ . (39)We define the event E as E ≡ (cid:26) (cid:107) ( (cid:98) Σ T ,T ) − − Σ − T ,T (cid:107) ≤ (cid:112) t /n C − (cid:27) . Applying Eqs. (38), (39) to X T , we conclude that P ( E c ) ≤ e − t / . (40)We now have in place all we need to bound the terms T and T . T To bound T , we employ similar techniques to those used in [Wai09, Theorem 3] to verify strictdual feasibility. The argument in [Wai09] works under the irrepresentability condition (see Eq. (26)therein) and we modify it to apply to the current setting, i.e., the generalized irrepresentabilitycondition. 19e begin by conditioning on X T . For j ∈ T c , x j is a zero mean Gaussian vector and we candecompose it into a linear correlated part plus an uncorrelated part as x T j = Σ j,T Σ − T ,T X T T + (cid:15) T j , where (cid:15) j ∈ R n has i.i.d. entries distributed as (cid:15) ji ∼ N (0 , Σ j,j − Σ j,T Σ − T ,T Σ T ,j ).Letting u = (cid:98) Σ T c ,T (cid:98) Σ − T ,T v ,T , we write u j = x T j X T ( X T T X T ) − v ,T = Σ j,T (Σ T ,T ) − v ,T + (cid:15) T j X T ( X T T X T ) − v ,T . (41)The first term is bounded as | Σ j,T (Σ T ,T ) − v ,T | ≤ − η as per Eq. (32). Let m j = (cid:15) T j X T ( X T T X T ) − v ,T .Since Var( (cid:15) ji ) ≤ Σ j,j ≤
1, conditioned on X T , m j is zero mean Gaussian with variance at mostVar( m j ) ≤ (cid:107) X T ( X T T X T ) − v ,T (cid:107) ≤ n v T ,T (cid:16) X T T X T n (cid:17) − v ,T ≤ n (cid:107) (cid:98) Σ − T ,T (cid:107) (cid:107) v ,T (cid:107) . (42)Under the event E , we have (cid:107) (cid:98) Σ − T ,T (cid:107) ≤ (cid:107) Σ − T ,T (cid:107) + (cid:107) (cid:98) Σ − T ,T − Σ − T ,T (cid:107) ≤ (1 + 8 (cid:112) t /n ) C − ≤ C − , (43)and hence, Var( m j ) ≤ t / ( nC min ). We now define the event E as E ≡ (cid:26) max j ∈ T c | m j | ≥ (cid:114) c t log pn C min (cid:27) . By the total probability rule, we have P ( E ) ≤ P ( E ; E ) + P ( E c ) . Using Gaussian tail bound and union bounding over j ∈ T c , we obtain P ( E ; E ) ≤ p − c . Using thebound P ( E c ) ≤ e − t / , we arrive at: P (cid:32) max j ∈ T c | m j | > (cid:114) c t log pn C min (cid:33) ≤ p − c + 2 e − t . (44)Using this, together with Eq. (32), in Eq. (41), we obtain that the following holds true with probabilityat least 1 − p − c − e − t / : T ≤ − η + (cid:114) c t log pn C min . (45)It is easy to check that the this implies T < − η/
2, for λ as claimed in Eq. (21) provided n ≥ M t log p . 20 .1.2 Bounding T We bound T by the same technique used in proving Eq. (28). Let m = (1 /λ )( (cid:98) r T c − (cid:98) Σ T c ,T (cid:98) Σ − T ,T (cid:98) r T ).Plugging for (cid:98) r , we get m ≡ X T c Π X ⊥ T W/ ( nλ ). Since W ∼ N (0 , σ I n × n ), conditioned on X , thevariable m j = x T j Π X ⊥ T W/ ( nλ ) is normal with variance at most( σnλ ) (cid:107) Π X ⊥ T x j (cid:107) ≤ ( σnλ ) (cid:107) x j (cid:107) , where we used the contraction property of orthogonal projections. Now, define the event E as follows. E ≡ (cid:26) (cid:107) x j (cid:107) < n, ∀ j ∈ [ p ] (cid:27) . Note that (cid:107) x j (cid:107) = Σ j,j Z , where Z is a chi-squared random variable with n degrees of freedom.Using the standard chi-squared tail bounds [Joh01], for a fixed j , we have (cid:107) x j (cid:107) < j,j n ≤ n ,with probability at least 1 − e − n/ . Union bounding over j ∈ [ p ], we obtain P ( E c ) ≤ pe − n/ .Under the event E , we have Var( m j ) ≤ σ / ( nλ ). Employing the standard Gaussian tail boundalong with union bounding over j ∈ T c , we obtain P ( T ≥ η/ E ) ≤ pe − nλ η σ = 2 p − c . (46)Hence, P ( T ≥ η/ ≤ P ( T ≥ η/ E ) + P ( E c ) ≤ p − c + pe − n . (47) We next prove Eq. (35). Given Eq. (33), we need to showsign (cid:16) θ ,T − λ Σ − T ,T v ,T (cid:17) = sign (cid:16) θ ,T − (cid:98) Σ − T ,T ( λv ,T − (cid:98) r T ) (cid:17) . Let u ≡ θ ,T − λ Σ − T ,T v ,T , and (cid:98) u ≡ θ ,T − (cid:98) Σ − T ,T ( λv ,T − (cid:98) r T ).By condition (iii), we have, for all i ∈ S , | u i | ≥ | θ ,i |− λ | [Σ − T ,T v ,T ] i | ≥ c λ +(1 / λ | [Σ − T ,T v ,T ] i | .Further, for all i ∈ T \ S , we have | u i | = λ | [Σ − T ,T v ,T ] i | ≥ c λ + (1 / λ | [Σ − T ,T v ,T ] i | . Summarizing,for all i ∈ T , we have | u i | ≥ c λ + 12 λ | [Σ − T ,T v ,T ] i | . We will show that | u i − (cid:98) u i | < c λ + (1 / λ | [Σ − T ,T v ,T ] i | for all i ∈ T , with high probability, thusimplying sign( u T ) = sign( (cid:98) u T ) as desired. Since | u i − (cid:98) u i | ≤ λ | [( (cid:98) Σ − T ,T − Σ − T ,T ) v ,T ] i | + | [ (cid:98) Σ − T ,T (cid:98) r T ] i | ,it suffices to show that T ( i ) ≡ λ | [( (cid:98) Σ − T ,T − Σ − T ,T ) v ,T ] i (cid:12)(cid:12) < λ | [Σ − T ,T v ,T ] i | for all i ∈ T , (48) T ≡ (cid:107) (cid:98) Σ − T ,T (cid:98) r T (cid:107) ∞ < c λ . (49)In the sequel, we provide probabilistic bounds on T ( i ) and T .21 .2.1 Bounding T ( i ) Lemma 6.2.
Under the assumptions of Theorem 3.4, for any c (cid:48) > , t ≥ , we have P (cid:40) ∃ i ∈ T s.t. (cid:12)(cid:12) [( (cid:98) Σ − T ,T − Σ − T ,T ) v ,T ] i (cid:12)(cid:12) ≥ (cid:114) c (cid:48) c ∗ t log pn (cid:12)(cid:12) [Σ − T ,T v ,T ] i (cid:12)(cid:12)(cid:41) ≤ e − t + 2 p − c (cid:48) , where c ∗ ≡ ( c C min ) − . The proof of Lemma 6.2 is presented in Section A.6.Applying this lemma, with probability at least 1 − e − t / − p − c , we have T ( i ) < (1 / λ | [Σ − T ,T v ,T ] i | provided 16 (cid:114) c c ∗ t log pn ≤ . i.e., for n ≥ M t log p . T Lemma 6.3.
The following holds true. P (cid:18) T ≤ σ (cid:114) c log pn C min (cid:19) ≥ − e − t − p − c . (50)Lemma 6.3 is proved in Section A.7.From the last lemma, it follows that Eq. (49) holds with probability at least 1 − e − t − p − c ,provided 3 σ (cid:114) c log pn C min ≤ c λ . Choosing λ as per Eq. (21), the latter is easily shown to follow from η ≤ c √ C min . Now combining the bounds on T ,. . . T , we get that for n ≥ max( M , M ) t log p , Eqs. (34) and (35)hold simultaneously, with probability at least 1 − pe − n/ − e − t / − p − c . This implies sign( (cid:98) θ n ( λ )) = v . Note that the matrix X T is a random Gaussian matrix with rows drawn independently form N (0 , Σ T ,T ) (recall that T is a deterministic set determined by the population-level problem). There-fore, (cid:107) (cid:98) Σ − T ,T (cid:107) ≤ (cid:107) Σ − T ,T (cid:107) ≤ C − . Using Theorem 3.4 to bound the probability that T (cid:54) = T , theproof proceeds along the same lines as the proof of Theorem 2.7.22 cknowledgements A.J. is supported by a Caroline and Fabian Pease Stanford Graduate Fellowship. This work waspartially supported by the NSF CAREER award CCF-0743978, the NSF grant DMS-0806211, andthe grants AFOSR/DARPA FA9550-12-1-0411 and FA9550-13-1-0036.
A Proof of technical lemmas
A.1 Proof of Lemma 2.1
By a change of variables, it is easy to see that (cid:98) θ ZN ( ξ ) = θ + ξ (cid:98) u ( ξ ), where (cid:98) u ( ξ ) = arg min u ∈ R p F ( u ; ξ )and F ( u ; ξ ) ≡ (cid:104) u, (cid:98) Σ u (cid:105) + (cid:107) u S c (cid:107) + (cid:16) (cid:107) ξ − θ ,S + u S (cid:107) − (cid:107) ξ − θ ,S (cid:107) (cid:17) . The rest of the proof is analogous to an argument in [BRT09]. Since, by definition, F ( (cid:98) u ; ξ ) ≤ F (0; ξ ), we have 12 (cid:104) (cid:98) u, (cid:98) Σ (cid:98) u (cid:105) + (cid:107) (cid:98) u S c (cid:107) − (cid:107) (cid:98) u S (cid:107) ≤ (cid:107) (cid:98) u S c (cid:107) ≤ (cid:107) (cid:98) u S (cid:107) . Using the definition of (cid:98) κ , with J = S , c = 1, we have0 ≥ (cid:98) κ ( s , (cid:107) (cid:98) u (cid:107) + (cid:107) (cid:98) u S c (cid:107) − (cid:107) (cid:98) u S (cid:107) ≥ (cid:98) κ ( s , (cid:107) (cid:98) u S (cid:107) − (cid:107) (cid:98) u S (cid:107) , and since (cid:107) (cid:98) u S (cid:107) ≥ (cid:107) (cid:98) u S (cid:107) /s , we deduce that (cid:107) (cid:98) u S (cid:107) ≤ s (cid:98) κ ( s , . By Eq. (51), this implies in turn (cid:104) (cid:98) u, (cid:98) Σ (cid:98) u (cid:105) ≤ s (cid:98) κ ( s , . (52)Now, consider the stationarity conditions of F . These imply( (cid:98) Σ (cid:98) u ) i = − sign( (cid:98) u i ) , for i ∈ T \ S. We therefore have | T \ S | ≤ (cid:88) i ∈ T \ S ( (cid:98) Σ (cid:98) u ) i ≤ (cid:107) (cid:98) Σ (cid:98) u (cid:107) ≤ (cid:107) (cid:98) Σ (cid:107) (cid:104) (cid:98) u, (cid:98) Σ (cid:98) u (cid:105) , and our claim follows by substituting Eq. (52) in the latter equation.23 .2 Proof of Lemma 2.2 By a change of variables, it is easy to see that (cid:98) θ ZN ( ξ ) = θ + ξ (cid:98) u ( ξ ), where (cid:98) u ( ξ ) = arg min u ∈ R p F ( u ; ξ )and F ( u ; ξ ) ≡ (cid:104) u, (cid:98) Σ u (cid:105) + (cid:107) u S c (cid:107) + (cid:16) (cid:107) ξ − θ ,S + u S (cid:107) − (cid:107) ξ − θ ,S (cid:107) (cid:17) . Notice that, for any u ∈ R p , lim ξ → F ( u ; ξ ) = F ( u ), where F ( u ) ≡ (cid:104) u, (cid:98) Σ u (cid:105) + (cid:107) u S c (cid:107) + (cid:104) sign( θ ,S ) , u S (cid:105) . Indeed F ( u ; ξ ) = F ( u ) provided ξ ≤ min i ∈ S | θ ,i /u i | . Further, F ( u ; ξ ) ≥ F ( u ) for all u .Let u ≡ arg min u ∈ R p F ( u ), and set ξ ≡ min i ∈ S | θ ,i /u ,i | . Then, for any u (cid:54) = u , and all ξ ∈ (0 , ξ ), we have F ( u ; ξ ) ≥ F ( u ) > F ( u ) = F ( u ; ξ ) . Hence u is the unique minimizer of F ( u ; ξ ), i.e., (cid:98) u ( ξ ) = u for all ξ ∈ (0 , ξ ).It follows that (cid:98) θ ZN ( ξ ) = θ + ξ u for all ξ ∈ (0 , ξ ) and hence sign( (cid:98) θ ZN ( ξ )) = v and supp( (cid:98) θ ZN ( ξ )) = T where we set v ,S ≡ sign( θ ,S ) ,v ,S c ≡ sign( u ,S c ) ,T ≡ S ∪ supp( u ) . Finally, the zero subgradient condition for u reads (cid:98) Σ u + z = 0, with z S = sign( θ ,S ) and z S c ∈ ∂ (cid:107) u ,S c (cid:107) . In particular, z T = v ,T and therefore u ,T = − (cid:98) Σ − T ,T v T . This implies ξ ≡ min i ∈ S (cid:12)(cid:12)(cid:12)(cid:12) θ ,i u ,i (cid:12)(cid:12)(cid:12)(cid:12) = min i ∈ S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) θ ,i [ (cid:98) Σ − T ,T v ,T ] i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . A.3 Proof of Lemma 2.3
Writing the zero-subgradient conditions for problem (5), we have (cid:98) Σ( (cid:98) θ ZN − θ ) = − ξu, u ∈ ∂ (cid:107) (cid:98) θ ZN (cid:107) . Given that T ⊇ S , we have θ ,T c = 0, and thus (cid:98) Σ T,T ( (cid:98) θ ZN T − θ ,T ) = − ξu T , (cid:98) Σ T c ,T ( (cid:98) θ ZN T − θ ,T ) = − ξu T c . Solving for (cid:98) θ ZN T − θ ,T in terms of u T , we obtain (cid:98) Σ T c ,T (cid:98) Σ − T,T u T = u T c , (cid:98) θ ZN T = θ ,T − ξ (cid:98) Σ − T,T u T . u T = sign( (cid:98) θ ZN T ) = v T , and (cid:107) u T c (cid:107) ∞ ≤ u ∈ ∂ (cid:107) (cid:98) θ ZN (cid:107) .Now suppose that Eqs. (8) and (9) hold true.Let ˜ θ T = θ ,T − ξ (cid:98) Σ − T,T v T , and ˜ θ T c = 0. We prove that ˜ θ = (cid:98) θ ZN , by showing that it satisfiesthe zero-subgradient condition. By Eq. (9), v T = sign(˜ θ T ). Define u ∈ R p by letting u T = v T and u T c = (cid:98) Σ T c ,T (cid:98) Σ − T,T v T . Note that (cid:107) u T c (cid:107) ∞ ≤ u ∈ ∂ (cid:107) ˜ θ (cid:107) . Moreover, (cid:98) Σ T,T (˜ θ T − θ ,T ) = − ξu T (cid:98) Σ T c ,T (˜ θ T − θ ,T ) = − ξu T c , Combining the above two equations, we get the zero-subgradient condition for (˜ θ, u ). Therefore,˜ θ = (cid:98) θ ZN , and v = sign( (cid:98) θ ZN ). A.4 Proof of Lemma 2.4
The proof proceeds along the same lines as the proof of Lemma 2.3. We begin with proving the ‘onlyif’ part. The zero-subgradient condition for Problem 3 reads: − n X T ( Y − X (cid:98) θ n ) + λu = 0 , u ∈ ∂ (cid:107) (cid:98) θ n (cid:107) . Plugging for Y = X θ + W and (cid:98) r = ( X T W/n ) in the above equation, we arrive at: (cid:98) Σ( (cid:98) θ n − θ ) = (cid:98) r − λu . Since T ⊇ S , θ ,T c = 0, and writing the above equation for indices in T and T c separately, we obtain (cid:98) Σ T c ,T ( (cid:98) θ nT − θ ,T ) = (cid:98) r T c − λu T c , (cid:98) Σ T,T ( (cid:98) θ nT − θ ,T ) = (cid:98) r T − λu T . Solving for (cid:98) θ nT − θ ,T from the second equation, we get (cid:98) Σ T c ,T (cid:98) Σ − T,T u T + 1 λ ( (cid:98) r T c − (cid:98) Σ T c ,T (cid:98) Σ − T,T (cid:98) r T ) = u T c , (cid:98) θ nT = θ ,T − (cid:98) Σ − T,T ( λu T − (cid:98) r T ) . This proves Eqs. (11) and (12), since u T = sign( (cid:98) θ nT ) = z T and (cid:107) u T c (cid:107) ∞ ≤ θ T = θ ,T − (cid:98) Σ − T,T ( λz T − (cid:98) r T ), and ˜ θ T c = 0. We prove that ˜ θ = (cid:98) θ n , by showing that it satisfies thezero-subgradient condition. By Eq. (12), z T = sign(˜ θ T ). Define u ∈ R p by letting u T = z T and u T c = (cid:98) Σ T c ,T (cid:98) Σ − T,T z T + ( (cid:98) r T c − (cid:98) Σ T c ,T (cid:98) Σ − T,T (cid:98) r T ) /λ . Note that (cid:107) u T c (cid:107) ∞ ≤ u ∈ ∂ (cid:107) ˜ θ (cid:107) .Moreover, (cid:98) Σ T,T (˜ θ T − θ ,T ) = − ( λu T − (cid:98) r T ) (cid:98) Σ T c ,T (˜ θ T − θ ,T ) = − ( λu T c − (cid:98) r T c ) , Combining the above two equations, we get the zero-subgradeint condition for (˜ θ, u ). Therefore,˜ θ = (cid:98) θ n , and z = sign( (cid:98) θ n ). 25 .5 Proof of Lemma 5.1 Let m = (cid:98) Σ − T ,T (cid:98) r T = ( X T T X T ) − X T T W . Conditioned on X T , m i is a zero mean Gaussian vectorwith variance σ (cid:107) e T i ( X T X T ) − X T T (cid:107) . By a Gaussian tail bound, we get P (cid:16) | m i | ≥ (cid:112) c log p σ (cid:107) e T i ( X T T X T ) − X T T (cid:107) (cid:17) ≤ p − c . Further, notice that (cid:107) e T i ( X T T X T ) − X T T (cid:107) ≤ (cid:107) (cid:98) Σ − T ,T (cid:107) /n . By union bounding over i = 1 , . . . , p , wehave P (cid:16) (cid:107) m (cid:107) ∞ ≥ σ (cid:114) c log pn (cid:107) (cid:98) Σ − T ,T (cid:107) / (cid:17) ≤ p − c . A.6 Proof of Lemma 6.2
We begin by stating and proving a lemma that is similar to Lemma 5 in [Wai09], but provides astronger control.
Lemma A.1.
Let Z ∈ R n × k be a random matrix with i.i.d. Gaussian rows with zero mean andcovariance Σ , with k ≥ . Further let a , . . . , a M ∈ R k and b , . . . , b M ∈ R k be non-random vectors.Then, letting (cid:98) Σ Z ≡ Z T Z /n , we have, for all ∆ > : P (cid:40) ∃ i ∈ [ M ] s.t. (cid:12)(cid:12)(cid:12) (cid:104) a i , ( (cid:98) Σ − Z − Σ − ) b i (cid:105) (cid:12)(cid:12)(cid:12) ≥ (cid:114) kn |(cid:104) a i , Σ − b i (cid:105)| + ∆ (cid:107) Σ − / a i (cid:107) (cid:107) Σ − / b i (cid:107) (cid:41) ≤ e − k + 2 M exp (cid:110) − n ∆ (cid:111) . (53) Proof.
First notice that Z = (cid:101) Z Σ / with (cid:101) Z ∈ R n × k a random matrix with i.i.d. standard Gaussianentries Z ij ∼ N (0 , k × k (i.e., for Z with i.i.d. entries), which we shallassume hereafter.Defining the event E ∗ = {(cid:107) (cid:98) Σ − − I (cid:107) ≤ (cid:112) k/n } , we have, by Eq. (39) and the union bound, P (cid:40) ∃ i ∈ [ M ] s.t. (cid:12)(cid:12)(cid:12) (cid:104) a i , ( (cid:98) Σ − − I) b i (cid:105) (cid:12)(cid:12)(cid:12) ≥ (cid:114) kn |(cid:104) a i , b i (cid:105)| + ∆ (cid:107) a i (cid:107) (cid:107) b i (cid:107) (cid:41) ≤ e − k/ + M max i ∈ [ M ] P (cid:40)(cid:12)(cid:12) (cid:104) a i , ( (cid:98) Σ − − I) b i (cid:105) (cid:12)(cid:12) ≥ (cid:114) kn |(cid:104) a i , b i (cid:105)| + ∆ (cid:107) a i (cid:107) (cid:107) b i (cid:107) ; E ∗ (cid:41) We can now concentrate on the last probability. Let α ≡ |(cid:104) a i , b i (cid:105)| and β ≡ ( (cid:107) a i (cid:107) (cid:107) b i (cid:107) − (cid:104) a i , b i (cid:105) ) / .Since (cid:98) Σ is distributed as R (cid:98) Σ R T for any orthogonal matrix R , we have (cid:104) a i , ( (cid:98) Σ − − I) b i (cid:105) d = α (cid:104) e , ( (cid:98) Σ − − I) e (cid:105) + β (cid:104) e , ( (cid:98) Σ − − I) e (cid:105) , where d = denotes equality in distribution. Under the event E ∗ , we have | α (cid:104) e , ( (cid:98) Σ − − I) e (cid:105)| ≤ α (cid:112) k/n .Further ( (cid:98) Σ − − I) =
U DU T with U a uniformly random orthogonal matrix (with respect to Haar26easure on the manifold of orthogonal matrices). Letting u , u denote the first two rows of U wethen have P (cid:40)(cid:12)(cid:12) (cid:104) a i , ( (cid:98) Σ − − I) b i (cid:105) (cid:12)(cid:12) ≥ (cid:114) kn |(cid:104) a i , b i (cid:105)| + ∆ (cid:107) a i (cid:107) (cid:107) b i (cid:107) ; E ∗ (cid:41) ≤ P {|(cid:104) u , Du (cid:105)| ≥ ∆; E ∗ } . Notice that conditioned on u and D , u is uniformly random on a ( k − v = Du , we have (cid:107) v (cid:107) ≤ (cid:112) k/n . Hence, by isoperimetric inequalities on thesphere [Led01], we obtain P {|(cid:104) u , Du (cid:105)| ≥ ∆; E ∗ } ≤ sup (cid:107) v (cid:107)≤ √ k/n P {|(cid:104) u , v (cid:105)| ≥ ∆ | v }≤ (cid:110) − ( k − k/n (cid:111) ≤ (cid:110) − n ∆ (cid:111) , where the last inequality holds for all k ≥
4. The proof is completed by substituting this inequalityin the expressions above.We are now in position to prove Lemma 6.2.
Proof (Lemma 6.2).
We apply Lemma A.1 to (cid:98)
Σ = (cid:98) Σ T ,T , M = t , a i = e i and b i = v ,T for i ∈ { , . . . , t } . We get P (cid:40) ∃ i ∈ T s.t. (cid:12)(cid:12) [( (cid:98) Σ − T ,T − Σ − T ,T ) v ,T ] i (cid:12)(cid:12) ≥ (cid:114) t n (cid:12)(cid:12) [Σ − T ,T v ,T ] i (cid:12)(cid:12) + ∆ (cid:107) Σ − / T ,T e i (cid:107) (cid:107) Σ − / T ,T v ,T (cid:107) (cid:41) ≤ e − t / + 2 t exp (cid:110) − n ∆ (cid:111) . Note that (cid:107) Σ − / T ,T e i (cid:107) (cid:107) Σ − / T ,T v ,T (cid:107) ≤ C − (cid:107) e i (cid:107) (cid:107) v ,T (cid:107) = C − √ t . Further | [Σ − T ,T v ,T ] i (cid:12)(cid:12) ≥ c ,and hence (cid:107) Σ − / T ,T e i (cid:107) (cid:107) Σ − / T ,T v ,T (cid:107) ≤ (1 / √ c ∗ t | [Σ − T ,T v ,T ] i (cid:12)(cid:12) . We therefore get P (cid:40) ∃ i ∈ T s.t. (cid:12)(cid:12) [( (cid:98) Σ − T ,T − Σ − T ,T ) v ,T ] i (cid:12)(cid:12) ≥ (cid:16) (cid:114) t n + ∆2 √ c ∗ t (cid:17)(cid:12)(cid:12) [Σ − T ,T v ,T ] i (cid:12)(cid:12)(cid:41) ≤ e − t / + 2 t exp (cid:110) − n ∆ (cid:111) . The proof is completed by taking ∆ = 16 (cid:112) ( c (cid:48) log p ) /n . A.7 Proof of Lemma 6.3
By Lemma 5.1, we have P (cid:16) (cid:107) (cid:98) Σ − T ,T (cid:98) r T (cid:107) ∞ ≥ σ (cid:114) c log pn (cid:107) (cid:98) Σ − T ,T (cid:107) / (cid:17) ≤ p − c . Recalling Eq. (43), under the event E we have (cid:107) (cid:98) Σ − T ,T (cid:107) ≤ C − . Since P ( E c ) ≤ e − t / , we arriveat: P (cid:18) (cid:107) (cid:98) Σ − T ,T (cid:98) r T (cid:107) ∞ ≥ σ (cid:114) c log pn C min (cid:19) ≤ p − c + 2 e − t . Generalized irrepresentability vs. irrepresentability
In this appendix we discuss the example provided in Section 1.1 in more details. The objective isto develop some intuition on the domain of validity of generalized irrepresentability, and compare itwith the standard irrepresentability condition.As explained in Section 1.1, let S = supp( θ ) = { , . . . , s } and consider the following covariancematrix: Σ ij = i = j,a if i = p, j ∈ S or i ∈ S, j = p ,0 otherwise.Equivalently, Σ = I p × p + a (cid:0) e p u T S + u S e T p (cid:1) , where u S is the vector with entries ( u S ) i = 1 for i ∈ S and ( u S ) i = 0 for i (cid:54)∈ S . It is easy to checkthat Σ is strictly positive definite for a ∈ ( − / √ s , +1 / √ s ). By redefining the p -th covariate, wecan assume, without loss of generality, a ∈ [0 , +1 / √ s ). We will further assume sign( θ ,i ) = +1 forall i ∈ S .This example captures the case of a single confounding variable, i.e., of an irrelevant covariatethat correlates strongly with the relevant covariates, and with the response variable.We will show that the Gauss-Lasso has a significantly broader domain of validity with respect tothe simple Lasso. Claim B.1.
Consider the Gaussian design defined above, and suppose that a > /s . Then for anyregularization parameter λ and for any sample size n , the probability of correct signed support recoverywith Lasso is at most / . (and is not guaranteed with high probability unless a ∈ [0 , (1 − η ) /s ] , forsome constant η > .On the other hand, Theorem 3.7 implies correct support recovery with the Gauss-Lasso from n = Ω( s log p ) samples, for any a ∈ (cid:20) , − ηs (cid:21) ∪ (cid:18) s , − η √ s (cid:21) . (54) Proof.
In order to prove that Gauss-Lasso correctly recovers the support of θ , we will show that allthe conditions of Theorem 3.4 and Theorem 3.7 hold with constants of order one, provided Eq. (54)holds. Vice versa, the irrepresentability condition does not hold unless a ∈ [0 , /s ), and hence thesimple Lasso fails outside this regime.We now proceed to check the assumptions of Theorems 3.4 and 3.7, while showing that irrepre-sentability does not hold for a ≥ /s . Restricted eigenvalues.
We have λ min (Σ) = 1 − a √ s . In particular, for any set T ⊆ [ p ], we have λ min (Σ T,T ) ≥ − a √ s ≥ η . Also, for any constant c ≥ κ ( s , c ) ≥ − a √ s ≥ η . Irrepresentability condition.
We have Σ SS = I s × s and hence (cid:107) Σ S c S Σ − SS (cid:107) ∞ = (cid:107) Σ p,S (cid:107) = as .Hence the irrepresentability condition holds only if a ∈ [0 , /s ). The corresponding irrepresentabilityparameter is η = 1 − as . 28or large s , the condition is only satisfied for a small interval in a , compared to the interval forwhich Σ is positive definite. Generalized irrepresentability condition.
In order to check this condition, we need to compute T and v defined as per Lemma 3.2. We have (cid:98) θ ∞ ( ξ ) = arg min θ ∈ R p G ( θ ; ξ ) where G ( θ ; ξ ) ≡ (cid:104) ( θ − θ ) , Σ( θ − θ ) (cid:105) + ξ (cid:107) θ (cid:107) = 12 (cid:107) θ − θ (cid:107) + a (cid:104) u S , ( θ S − θ ,S ) (cid:105) θ p + ξ (cid:107) θ (cid:107) . From this expression, it is immediate to see that (cid:98) θ ∞ i ( ξ ) = 0 for i (cid:54)∈ S ∪{ p } . Further (cid:98) θ ∞ S ∪{ p } ( ξ ) satisfies θ S − θ ,S + aθ p u S + ξv S = 0 , (55) θ p + a (cid:104) u S , ( θ S − θ ,S ) (cid:105) + ξv p = 0 , (56)with v S ∈ ∂ (cid:107) θ S (cid:107) and v p ∈ ∂ | θ p | . Since θ ,S >
0, we have, from Eq. (55), (cid:98) θ ∞ S = θ ,S − ( a (cid:98) θ ∞ p + ξ ) u S , provided ( a (cid:98) θ ∞ p + ξ ) ≤ θ min . Substituting in Eq. (56) and solving for θ p , we get (cid:98) θ ∞ p ( ξ ) = (cid:40) a ∈ [0 , /s ) (cid:16) as − − a s (cid:17) ξ if a ∈ [1 /s , / √ s ).This holds provided ( a (cid:98) θ ∞ p + ξ ) ≤ θ min , i.e., if ξ ≤ ξ ∗ ≡ min(1 , (1 − a s ) / (1 − a )) θ min .Using the definition in Lemma 3.2, we have T = (cid:40) S if a ∈ [0 , /s ) S ∪ { p } if a ∈ [1 /s , / √ s ),and v ,T = u T .We can now check the generalized irrepresentability condition. For a ∈ [0 , /s ) we have (cid:107) Σ T c ,T Σ − T ,T v ,T (cid:107) ∞ = (cid:107) Σ S c ,S Σ − S,S u S (cid:107) ∞ = as , and therefore the generalized irrepresentability con-dition is satisfied with parameter η = 1 − as . For a ∈ [1 /s , / √ s ), we have (cid:107) Σ T c ,T Σ − T ,T v ,T (cid:107) ∞ =0. We therefore conclude that, for any fixed η ∈ (0 , η is satisfied for a ∈ (cid:104) , − ηs (cid:105) ∪ (cid:104) s , √ s (cid:17) , a significant larger domain than for simple irrepresentability. Minimum entry condition.
For a ∈ [0 , /s ), we have T = S and it is therefore only necessary tocheck Eq. (22). Since [Σ − T ,T v ,T ] i = 1, this reads | θ ,i | ≥ (cid:16) c + 32 (cid:17) λ = Cσ (cid:114) log pn , C a constant.For a ∈ (1 /s , (1 − η ) / √ s ], we have T = S ∪ { p } . A straightforward calculation shows that (cid:12)(cid:12) [Σ − T ,T v ,T ] i (cid:12)(cid:12) = 1 − a − a s , for i ∈ S , (cid:12)(cid:12) [Σ − T ,T v ,T ] p (cid:12)(cid:12) = as − − a s . It is not hard to show for all a satisfying Eq. (54), we have (cid:12)(cid:12) [Σ − T ,T v ,T ] i (cid:12)(cid:12) ≤ − (1 − η ) for i ∈ S, (cid:12)(cid:12) [Σ − T ,T v ,T ] p (cid:12)(cid:12) ≥ C , for some constant
C >
0. It therefore follows that condition (22) holds if | θ ,i | ≥ C (cid:48) σ (cid:112) log p/n andcondition (23) holds for c = C/ References [Bac08] Francis R Bach,
Bolasso: model consistent lasso estimation through the bootstrap , Pro-ceedings of the 25th international conference on Machine learning, ACM, 2008, pp. 33–40.7[BRT09] P. J. Bickel, Y. Ritov, and A. B. Tsybakov,
Simultaneous analysis of Lasso and Dantzigselector , Amer. J. of Mathematics (2009), 1705–1732. 3, 6, 8, 23[B¨uh12] P. B¨uhlmann, Statistical significance in high-dimensional linear models , arXiv:1202.1377 ,2012. 6[BvdG11] Peter B¨uhlmann and Sara van de Geer, Statistics for high-dimensional data , Springer-Verlag, 2011. 5[CD95] S.S. Chen and D.L. Donoho,
Examples of basis pursuit , Proceedings of Wavelet Appli-cations in Signal and Image Processing III (San Diego, CA), 1995. 2[CP09] E.J. Cand`es and Y. Plan,
Near-ideal model selection by (cid:96) minimization , The Annals ofStatistics (2009), no. 5A, 2145–2177. 7[CRT06] E. Candes, J. K. Romberg, and T. Tao, Robust uncertainty principles: Exact signalreconstruction from highly incomplete frequency information , IEEE Trans. on Inform.Theory (2006), 489 – 509. 2[CT05] E. J. Cand´es and T. Tao, Decoding by linear programming , IEEE Trans. on Inform.Theory (2005), 4203–4215. 3, 6[CT07] E. Cand´es and T. Tao, The Dantzig selector: statistical estimation when p is much largerthan n , Annals of Statistics (2007), 2313–2351. 2, 3, 6[Don06] D. L. Donoho, Compressed sensing , IEEE Trans. on Inform. Theory (2006), 489–509.2 30DS01] K. R. Davidson and S. J. Szarek, Local operator theory, random matrices and Banachspaces , Handbook on the Geometry of Banach spaces, vol. 1, Elsevier Science, 2001,pp. 317–366. 19[EHJT04] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani,
Least angle re-gression , Annals of Statistics (2004), 407–499. 15[FA10] A. Frank and A. Asuncion, UCI machine learning repository (communities and crimedata set) , http://archive.ics.uci.edu/ml , 2010, University of California, Irvine, School ofInformation and Computer Sciences. 14[JM13] Adel Javanmard and Andrea Montanari, Hypothesis testing in high-dimensional re-gression under the gaussian random design model: Asymptotic theory , arXiv preprintarXiv:1301.4240 , 2013. 3, 7[Joh01] I. Johnstone, Chi-squared oracle inequalities , State of the Art in Probability and Statis-tics (M. de Gunst, C. Klaassen, and A. van der Vaart, eds.), IMS Lecture Notes, Instituteof Mathematical Statistics, 2001, pp. 399–418. 21[KF00] K. Knight and W. Fu,
Asymptotics for lasso-type estimators , Annals of Statistics (2000),1356–1378. 3[Led01] M. Ledoux,
The concentration of measure phenomenon , Mathematical Surveys andMonographs, vol. 89, American Mathematical Society, Providence, RI, 2001. 27[Lou08] Karim Lounici,
Sup-norm convergence rate and sign concentration property of lasso anddantzig estimators , Electronic Journal of statistics (2008), 90–102. 7[MB06] N. Meinshausen and P. B¨uhlmann, High-dimensional graphs and variable selection withthe lasso , Ann. Statist. (2006), 1436–1462. 3, 4, 5[PZB +
10] Jie Peng, Ji Zhu, Anna Bergamaschi, Wonshik Han, Dong-Young Noh, Jonathan R Pol-lack, and Pei Wang,
Regularized multivariate regression for identifying master predictorswith application to integrative genomics study of breast cancer , The Annals of AppliedStatistics (2010), no. 1, 53–77. 2[SK03] Shirish Krishnaj Shevade and S. Sathiya Keerthi, A simple and efficient algorithm forgene selection using sparse logistic regression , Bioinformatics (2003), no. 17, 2246–2253. 2[Tib96] R. Tibshirani, Regression shrinkage and selection with the Lasso , J. Royal. Statist. SocB (1996), 267–288. 2[vdGB09] S.A. van de Geer and P. B¨uhlmann, On the conditions used to prove oracle results forthe lasso , Electron. J. Statist. (2009), 1360–1392. 3, 6, 7[vdGBR13] S. van de Geer, P. B¨uhlmann, and Y. Ritov, On asymptotically optimal confidence regionsand tests for high-dimensional models , arXiv:1303.0518 , 2013. 731Ver12] R. Vershynin, Introduction to the non-asymptotic analysis of random matrices , Com-pressed Sensing: Theory and Applications (Y.C. Eldar and G. Kutyniok, eds.), Cam-bridge University Press, 2012, pp. 210–268. 19[Wai09] M.J. Wainwright,
Sharp thresholds for high-dimensional and noisy sparsity recovery us-ing (cid:96) -constrained quadratic programming , IEEE Trans. on Inform. Theory (2009),2183–2202. 3, 4, 5, 6, 10, 13, 19, 26[Zho10] S. Zhou, Thresholded Lasso for high dimensional variable selection and statistical esti-mation , arXiv:1002.1583v2 , 2010. 6[Zou06] H. Zou, The adaptive lasso and its oracle properties , Journal of the American StatisticalAssociation (2006), no. 476, 1418–1429. 7[ZY06] P. Zhao and B. Yu,
On model selection consistency of Lasso , The Journal of MachineLearning Research (2006), 2541–2563. 3, 4, 5[ZZ11] C.-H. Zhang and S.S. Zhang, Confidence Intervals for Low-Dimensional Parameters inHigh-Dimensional Linear Models , arXiv:1110.2563arXiv:1110.2563