High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity
aa r X i v : . [ m a t h . S T ] S e p The Annals of Statistics (cid:13)
Institute of Mathematical Statistics, 2012
HIGH-DIMENSIONAL REGRESSION WITH NOISY AND MISSINGDATA: PROVABLE GUARANTEES WITH NONCONVEXITY
By Po-Ling Loh , and Martin J. Wainwright University of California, Berkeley
Although the standard formulations of prediction problems in-volve fully-observed and noiseless data drawn in an i.i.d. manner,many applications involve noisy and/or missing data, possibly in-volving dependence, as well. We study these issues in the contextof high-dimensional sparse linear regression, and propose novel esti-mators for the cases of noisy, missing and/or dependent data. Manystandard approaches to noisy or missing data, such as those usingthe EM algorithm, lead to optimization problems that are inherentlynonconvex, and it is difficult to establish theoretical guarantees onpractical algorithms. While our approach also involves optimizingnonconvex programs, we are able to both analyze the statistical errorassociated with any global optimum, and more surprisingly, to provethat a simple algorithm based on projected gradient descent will con-verge in polynomial time to a small neighborhood of the set of allglobal minimizers. On the statistical side, we provide nonasymptoticbounds that hold with high probability for the cases of noisy, missingand/or dependent data. On the computational side, we prove that un-der the same types of conditions required for statistical consistency,the projected gradient descent algorithm is guaranteed to converge ata geometric rate to a near-global minimizer. We illustrate these the-oretical predictions with simulations, showing close agreement withthe predicted scalings.
1. Introduction.
In standard formulations of prediction problems, it isassumed that the covariates are fully-observed and sampled independentlyfrom some underlying distribution. However, these assumptions are not re-
Received September 2011; revised May 2012. Supported in part by a Hertz Foundation Fellowship and the Department of Defense(DoD) through an NDSEG Fellowship. Supported in part by NSF Grant DMS-09-07632 and Air Force Office of ScientificResearch Grant AFOSR-09NL184.
AMS 2000 subject classifications.
Primary 62F12; secondary 68W25.
Key words and phrases.
High-dimensional statistics, missing data, nonconvexity, regu-larization, sparse linear regression, M -estimation. This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Statistics ,2012, Vol. 40, No. 3, 1637–1664. This reprint differs from the original inpagination and typographic detail. 1
P.-L. LOH AND M. J. WAINWRIGHT alistic for many applications, in which covariates may be observed only par-tially, observed subject to corruption or exhibit some type of dependency.Consider the problem of modeling the voting behavior of politicians: in thissetting, votes may be missing due to abstentions, and temporally dependentdue to collusion or “tit-for-tat” behavior. Similarly, surveys often suffer fromthe missing data problem, since users fail to respond to all questions. Sensornetwork data also tends to be both noisy due to measurement error, andpartially missing due to failures or drop-outs of sensors.There are a variety of methods for dealing with noisy and/or missing data,including various heuristic methods, as well as likelihood-based methods in-volving the expectation–maximization (EM) algorithm (e.g., see the book [8]and references therein). A challenge in this context is the possible noncon-vexity of associated optimization problems. For instance, in applications ofEM, problems in which the negative likelihood is a convex function oftenbecome nonconvex with missing or noisy data. Consequently, although theEM algorithm will converge to a local minimum, it is difficult to guaranteethat the local optimum is close to a global minimum.In this paper, we study these issues in the context of high-dimensionalsparse linear regression—in particular, in the case when the predictors orcovariates are noisy, missing, and/or dependent. Our main contribution isto develop and study simple methods for handling these issues, and to provetheoretical results about both the associated statistical error and the op-timization error. Like EM-based approaches, our estimators are based onsolving optimization problems that may be nonconvex; however, despite thisnonconvexity, we are still able to prove that a simple form of projected gra-dient descent will produce an output that is “sufficiently close”—as small asthe statistical error—to any global optimum. As a second result, we boundthe statistical error, showing that it has the same scaling as the minimaxrates for the classical cases of perfectly observed and independently sampledcovariates. In this way, we obtain estimators for noisy, missing, and/or de-pendent data that have the same scaling behavior as the usual fully-observedand independent case. The resulting estimators allow us to solve the problemof high-dimensional Gaussian graphical model selection with missing data.There is a large body of work on the problem of corrupted covariates orerror-in-variables for regression problems (e.g., see the papers and books [3,6, 7, 21], as well as references therein). Much of the earlier theoretical work isclassical in nature, meaning that it requires that the sample size n divergeswith the dimension p fixed. Most relevant to this paper is more recent workthat has examined issues of corrupted and/or missing data in the contextof high-dimensional sparse linear models, allowing for n ≪ p . St¨adler andB¨uhlmann [18] developed an EM-based method for sparse inverse covariancematrix estimation in the missing data regime, and used this result to derivean algorithm for sparse linear regression with missing data. As mentioned IGH-DIMENSIONAL NOISY LASSO above, however, it is difficult to guarantee that EM will converge to a pointclose to a global optimum of the likelihood, in contrast to the methodsstudied here. Rosenbaum and Tsybakov [14] studied the sparse linear modelwhen the covariates are corrupted by noise, and proposed a modified formof the Dantzig selector (see the discussion following our main results fora detailed comparison to this past work, and also to concurrent work [15]by the same authors). For the particular case of multiplicative noise, thetype of estimator that we consider here has been studied in past work [21];however, this theoretical analysis is of the classical type, holding only for n ≫ p , in contrast to the high-dimensional models that are of interest here.The remainder of this paper is organized as follows. We begin in Section 2with background and a precise description of the problem. We then introducethe class of estimators we will consider and the form of the projected gradientdescent algorithm. Section 3 is devoted to a description of our main results,including a pair of general theorems on the statistical and optimization error,and then a series of corollaries applying our results to the cases of noisy,missing, and dependent data. In Section 4, we demonstrate simulations toconfirm that our methods work in practice, and verify the theoretically-predicted scaling laws. Section 5 contains proofs of some of the main results,with the remaining proofs contained in the supplementary Appendix [9]. Notation.
For a matrix M , we write k M k max := max i,j | m ij | to bethe elementwise ℓ ∞ -norm of M . Furthermore, ||| M ||| denotes the induced ℓ -operator norm (maximum absolute column sum) of M , and ||| M ||| op isthe spectral norm of M . We write κ ( M ) := λ max ( M ) λ min ( M ) , the condition num-ber of M . For matrices M , M , we write M ⊙ M to denote the compo-nentwise Hadamard product, and write M : ⊖ M to denote componentwisedivision. For functions f ( n ) and g ( n ), we write f ( n ) - g ( n ) to mean that f ( n ) ≤ cg ( n ) for a universal constant c ∈ (0 , ∞ ), and similarly, f ( n ) % g ( n )when f ( n ) ≥ c ′ g ( n ) for some universal constant c ′ ∈ (0 , ∞ ). Finally, we write f ( n ) ≍ g ( n ) when f ( n ) - g ( n ) and f ( n ) % g ( n ) hold simultaneously.
2. Background and problem setup.
In this section, we provide back-ground and a precise description of the problem, and then motivate theclass of estimators analyzed in this paper. We then discuss a simple class ofprojected gradient descent algorithms that can be used to obtain an estima-tor.2.1.
Observation model and high-dimensional framework.
Suppose weobserve a response variable y i ∈ R linked to a covariate vector x i ∈ R p viathe linear model y i = h x i , β ∗ i + ε i for i = 1 , , . . . , n .(2.1) P.-L. LOH AND M. J. WAINWRIGHT
Here, the regression vector β ∗ ∈ R p is unknown, and ε i ∈ R is observationnoise, independent of x i . Rather than directly observing each x i ∈ R p , we ob-serve a vector z i ∈ R p linked to x i via some conditional distribution, that is, z i ∼ Q ( · | x i ) for i = 1 , , . . . , n .(2.2)This setup applies to various disturbances to the covariates, including:(a) Covariates with additive noise : We observe z i = x i + w i , where w i ∈ R p is a random vector independent of x i , say zero-mean with known covariancematrix Σ w .(b) Missing data : For some fraction ρ ∈ [0 , z i ∈ R p such that for each component j , we independently observe z ij = x ij with probability 1 − ρ , and z ij = ∗ with probability ρ . We can also considerthe case when the entries in the j th column have a different probability ρ j of being missing.(c) Covariates with multiplicative noise : Generalizing the missing dataproblem, suppose we observe z i = x i ⊙ u i , where u i ∈ R p is again a randomvector independent of x i , and ⊙ is the Hadamard product. The problemof missing data is a special case of multiplicative noise, where all u ij ’s areindependent and u ij ∼ Bernoulli(1 − ρ j ).Our first set of results is deterministic, depending on specific instantiationsof the observations { ( y i , z i ) } ni =1 . However, we are also interested in resultsthat hold with high probability when the x i ’s and z i ’s are drawn at random.We consider both the case when the x i ’s are drawn i.i.d. from a fixed distri-bution; and the case of dependent covariates, when the x i ’s are generatedaccording to a stationary vector autoregressive (VAR) process.We work within a high-dimensional framework that allows the numberof predictors p to grow and possibly exceed the sample size n . Of course,consistent estimation when n ≪ p is impossible unless the model is endowedwith additional structure—for instance, sparsity in the parameter vector β ∗ .Consequently, we study the class of models where β ∗ has at most k nonzeroparameters, where k is also allowed to increase to infinity with p and n .2.2. M -estimators for noisy and missing covariates. In order to moti-vate the class of estimators we will consider, let us begin by examininga simple deterministic problem. Let Σ x ≻ ℓ -constrained quadratic program b β ∈ arg min k β k ≤ R (cid:26) β T Σ x β − h Σ x β ∗ , β i (cid:27) . (2.3)As long as the constraint radius R is at least k β ∗ k , the unique solution tothis convex program is b β = β ∗ . Of course, this program is an idealization,since in practice we may not know the covariance matrix Σ x , and we certainlydo not know Σ x β ∗ —after all, β ∗ is the quantity we are trying to estimate! IGH-DIMENSIONAL NOISY LASSO Nonetheless, this idealization still provides useful intuition, as it suggestsvarious estimators based on the plug-in principle. Given a set of samples, itis natural to form estimates of the quantities Σ x and Σ x β ∗ , which we denoteby b Γ ∈ R p × p and b γ ∈ R p , respectively, and to consider the modified program b β ∈ arg min k β k ≤ R (cid:26) β T b Γ β − h b γ, β i (cid:27) , (2.4)or alternatively, the regularized version b β ∈ arg min β ∈ R p (cid:26) β T b Γ β − h b γ, β i + λ n k β k (cid:27) , (2.5)where λ n > b Γ Las := 1 n X T X and b γ Las := 1 n X T y, (2.6)where we have introduced the shorthand y = ( y , . . . , y n ) T ∈ R n , and X ∈ R n × p , with x Ti as its i th row. A simple calculation shows that ( b Γ Las , b γ Las ) areunbiased estimators of the pair (Σ x , Σ x β ∗ ). This unbiasedness and additionalconcentration inequalities (to be described in the sequel) underlie the well-known analysis of the Lasso in the high-dimensional regime.In this paper, we focus on more general instantiations of the programs (2.4)and (2.5), involving different choices of the pair ( b Γ , b γ ) that are adapted tothe cases of noisy and/or missing data. Note that the matrix b Γ Las is positivesemidefinite, so the Lasso program is convex. In sharp contrast, for the caseof noisy or missing data, the most natural choice of the matrix b Γ is not posi-tive semidefinite , hence the quadratic losses appearing in the problems (2.4)and (2.5) are nonconvex . Furthermore, when b Γ has negative eigenvalues, theobjective in equation (2.5) is unbounded from below. Hence, we make useof the following regularized estimator: b β ∈ arg min k β k ≤ b √ k (cid:26) β T b Γ β − h b γ, β i + λ n k β k (cid:27) (2.7)for a suitable constant b .In the presence of nonconvexity, it is generally impossible to providea polynomial-time algorithm that converges to a (near) global optimum,due to the presence of local minima. Remarkably, we are able to prove thatthis issue is not significant in our setting, and a simple projected gradientdescent algorithm applied to the programs (2.4) or (2.7) converges with highprobability to a vector extremely close to any global optimum. P.-L. LOH AND M. J. WAINWRIGHT
Let us illustrate these ideas with some examples. Recall that ( b Γ , b γ ) serveas unbiased estimators for (Σ x , Σ x β ∗ ). Example 1 (Additive noise). Suppose we observe Z = X + W , where W is a random matrix independent of X , with rows w i drawn i.i.d. from a zero-mean distribution with known covariance Σ w . We consider the pair b Γ add := 1 n Z T Z − Σ w and b γ add := 1 n Z T y. (2.8)Note that when Σ w = 0 (corresponding to the noiseless case), the estimatorsreduce to the standard Lasso. However, when Σ w = 0, the matrix b Γ add is not positive semidefinite in the high-dimensional regime ( n ≪ p ). Indeed,since the matrix n Z T Z has rank at most n , the subtracted matrix Σ w maycause b Γ add to have a large number of negative eigenvalues. For instance, ifΣ w = σ w I for σ w >
0, then b Γ add has p − n eigenvalues equal to − σ w . Example 2 (Missing data). We now consider the case where the entriesof X are missing at random. Let us first describe an estimator for the spe-cial case where each entry is missing at random, independently with someconstant probability ρ ∈ [0 , Z ∈ R n × p with entries Z ij = (cid:26) X ij , with probability 1 − ρ ,0 , otherwise.Given the observed matrix Z ∈ R n × p , we use b Γ mis := e Z T e Zn − ρ diag (cid:18) e Z T e Zn (cid:19) and b γ mis := 1 n e Z T y, (2.9)where e Z ij = Z ij / (1 − ρ ). It is easy to see that the pair ( b Γ mis , b γ mis ) reducesto the pair ( b Γ Las , b γ Las ) for the standard Lasso when ρ = 0, correspondingto no missing data. In the more interesting case when ρ ∈ (0 , e Z T e Zn in equation (2.9) has rank at most n , so the subtracted diagonal matrixmay cause the matrix b Γ mis to have a large number of negative eigenvalueswhen n ≪ p . As a consequence, the matrix b Γ mis is not (in general) positivesemidefinite, so the associated quadratic function is not convex. Example 3 (Multiplicative noise). As a generalization of the previousexample, we now consider the case of multiplicative noise. In particular,suppose we observe the quantity Z = X ⊙ U , where U is a matrix of non-negative noise variables. In many applications, it is natural to assume thatthe rows u i of U are drawn in an i.i.d. manner, say from some distribution IGH-DIMENSIONAL NOISY LASSO in which both the vector E [ u ] and the matrix E [ u u T ] have strictly positiveentries. This general family of multiplicative noise models arises in variousapplications; we refer the reader to the papers [3, 6, 7, 21] for more discussionand examples. A natural choice of the pair ( b Γ , b γ ) is given by the quantities b Γ mul := 1 n Z T Z : ⊖ E ( u u T ) and b Γ mul := 1 n Z T y : ⊖ E ( u ) , (2.10)where : ⊖ denotes elementwise division. A small calculation shows that theseare unbiased estimators of Σ x and Σ x β ∗ , respectively. The estimators (2.10)have been studied in past work [21], but only under classical scaling ( n ≫ p ).As a special case of the estimators (2.10), suppose the entries u ij of U areindependent Bernoulli(1 − ρ j ) random variables. Then the observed matrix Z = X ⊙ U corresponds to a missing-data matrix, where each element ofthe j th column has probability ρ j of being missing. In this case, the estima-tors (2.10) become b Γ mis = Z T Zn : ⊖ M and b γ mis = 1 n Z T y : ⊖ ( − ρ ) , (2.11)where M := E ( u u T ) satisfies M ij = (cid:26) (1 − ρ i )(1 − ρ j ) , if i = j ,1 − ρ i , if i = j , ρ is the parameter vector containing the ρ j ’s, and is the vector of all 1’s. Inthis way, we obtain a generalization of the estimator discussed in Example 2.2.3. Restricted eigenvalue conditions.
Given an estimate b β , there arevarious ways to assess its closeness to β ∗ . In this paper, we focus on the ℓ -norm k b β − β ∗ k , as well as the closely related ℓ -norm k b β − β ∗ k . Whenthe covariate matrix X is fully observed (so that the Lasso can be applied),it is now well understood that a sufficient condition for ℓ -recovery is thatthe matrix b Γ Las = n X T X satisfy a certain type of restricted eigenvalue (RE)condition (e.g., [2, 20]). In this paper, we make use of the following condition. Definition 1 (Lower-RE condition). The matrix b Γ satisfies a lower re-stricted eigenvalue condition with curvature α > τ ( n, p ) > θ T b Γ θ ≥ α k θ k − τ ( n, p ) k θ k for all θ ∈ R p .(2.12)It can be shown that when the Lasso matrix b Γ Las = n X T X satisfiesthis RE condition (2.12), the Lasso estimate has low ℓ -error for any vec-tor β ∗ supported on any subset of size at most k . τ ( n,p ) . In particular,bound (2.12) implies a sparse RE condition for all k of this magnitude, andconversely, Lemma 11 in the Appendix of [9] shows that a sparse RE con- P.-L. LOH AND M. J. WAINWRIGHT dition implies bound (2.12). In this paper, we work with condition (2.12),since it is especially convenient for analyzing optimization algorithms.In the standard setting (with uncorrupted and fully observed design ma-trices), it is known that for many choices of the design matrix X (with rowshaving covariance Σ), the Lasso matrix b Γ Las will satisfy such an RE conditionwith high probability (e.g., [13, 17]) with α = λ min (Σ) and τ ( n, p ) ≍ log pn .A significant portion of the analysis in this paper is devoted to proving thatdifferent choices of b Γ, such as the matrices b Γ add and b Γ mis defined earlier,also satisfy condition (2.12) with high probability. This fact is by no meansobvious, since as previously discussed, the matrices b Γ add and b Γ mis generallyhave large numbers of negative eigenvalues.Finally, although such upper bounds are not necessary for statistical con-sistency, our algorithmic results make use of the analogous upper restrictedeigenvalue condition, formalized in the following: Definition 2 (Upper-RE condition). The matrix b Γ satisfies an up-per restricted eigenvalue condition with smoothness α > τ ( n, p ) > θ T b Γ θ ≤ α k θ k + τ ( n, p ) k θ k for all θ ∈ R p .(2.13)In recent work on high-dimensional projected gradient descent, Agarwal etal. [1] make use of a more general form of the lower and upper bounds (2.12)and (2.13), applicable to nonquadratic losses as well, which are referred toas the restricted strong convexity (RSC) and restricted smoothness (RSM)conditions, respectively. For various class of random design matrices, it canbe shown that the Lasso matrix b Γ Las satisfies the upper bound (2.13) with α = 2 λ max (Σ x ) and τ ( n, p ) ≍ log pn ; see Raskutti et al. [13] for the Gaus-sian case and Rudelson and Zhou [17] for the sub-Gaussian setting. We willestablish similar scaling for our choices of b Γ.2.4.
Gradient descent algorithms.
In addition to proving results aboutthe global minima of the (possibly nonconvex) programs (2.4) and (2.5), weare also interested in polynomial-time procedures for approximating suchoptima. In this paper, we analyze some simple algorithms for solving eitherthe constrained program (2.4) or the Lagrangian version (2.7). Note thatthe gradient of the quadratic loss function takes the form ∇L ( β ) = b Γ β − b γ .In application to the constrained version, the method of projected gradientdescent generates a sequence of iterates { β t , t = 0 , , , . . . } by the recursion β t +1 = arg min k β k ≤ R (cid:26) L ( β t ) + h∇L ( β t ) , β − β t i + η k β − β t k (cid:27) , (2.14)where η > β t +1 = Π( β t − η ∇L ( β t )), where Π denotes the ℓ -projection onto the ℓ - IGH-DIMENSIONAL NOISY LASSO ball of radius R . This projection can be computed rapidly in O ( p ) timeusing a procedure due to Duchi et al. [5]. For the Lagrangian update, we usea slight variant of the projected gradient update (2.14), namely β t +1 = arg min k β k ≤ R (cid:26) L ( β t ) + h∇L ( β t ) , β − β t i + η k β − β t k + λ n k β k (cid:27) (2.15)with the only difference being the inclusion of the regularization term. Thisupdate can also performed efficiently by performing two projections ontothe ℓ -ball; see the paper [1] for details.When the objective function is convex (equivalently, b Γ is positive semidef-inite), the iterates (2.14) or (2.15) are guaranteed to converge to a globalminimum of the objective functions (2.4) and (2.7), respectively. In our set-ting, the matrix b Γ need not be positive semidefinite, so the best genericguarantee is that the iterates converge to a local optimum. However, ouranalysis shows that for the family of programs (2.4) or (2.7), under a rea-sonable set of conditions satisfied by various statistical models, the iteratesactually converge to a point extremely close to any global optimum in both ℓ -norm and ℓ -norm; see Theorem 2 to follow for a more detailed statement.
3. Main results and consequences.
We now state our main results anddiscuss their consequences for noisy, missing, and dependent data.3.1.
General results.
We provide theoretical guarantees for both the con-strained estimator (2.4) and the Lagrangian version (2.7). Note that weobtain different optimization problems as we vary the choice of the pair( b Γ , b γ ) ∈ R p × p × R p . We begin by stating a pair of general results, applicableto any pair that satisfies certain conditions. Our first result (Theorem 1)provides bounds on the statistical error, namely the quantity k b β − β ∗ k , aswell as the corresponding ℓ -error, where b β is any global optimum of theprograms (2.4) or (2.7). Since the problem may be nonconvex in general, itis not immediately obvious that one can obtain a provably good approxima-tion to any global optimum without resorting to costly search methods. Inorder to assuage this concern, our second result (Theorem 2) provides rigor-ous bounds on the optimization error, namely the differences k β t − b β k and k β t − b β k incurred by the iterate β t after running t rounds of the projectedgradient descent updates (2.14) or (2.15).3.1.1. Statistical error.
In controlling the statistical error, we assumethat the matrix b Γ satisfies a lower-RE condition with curvature α and tol-erance τ ( n, p ), as previously defined (2.12). Recall that b Γ and b γ serve assurrogates to the deterministic quantities Σ x ∈ R p × p and Σ x β ∗ ∈ R p , respec-tively. Our results also involve a measure of deviation in these surrogates. P.-L. LOH AND M. J. WAINWRIGHT
In particular, we assume that there is some function ϕ ( Q , σ ε ), depending onthe two sources of noise in our problem: the standard deviation σ ε of theobservation noise vector ε from equation (2.1), and the conditional distri-bution Q from equation (2.2) that links the covariates x i to the observedversions z i . With this notation, we consider the deviation condition k b γ − b Γ β ∗ k ∞ ≤ ϕ ( Q , σ ε ) r log pn . (3.1)To aid intuition, note that inequality (3.1) holds whenever the following twodeviation conditions are satisfied: k b γ − Σ x β ∗ k ∞ ≤ ϕ ( Q , σ ε ) r log pn and(3.2) k ( b Γ − Σ x ) β ∗ k ∞ ≤ ϕ ( Q , σ ε ) r log pn . The pair of inequalities (3.2) clearly measures the deviation of the estima-tors ( b Γ , b γ ) from their population versions, and they are sometimes easierto verify theoretically. However, inequality (3.1) may be used directly toderive tighter bounds (e.g., in the additive noise case). Indeed, the boundsestablished via inequalities (3.2) is not sharp in the limit of low noise onthe covariates, due to the second inequality. In the proofs of our corollariesto follow, we will verify the deviation conditions for various forms of noisy,missing, and dependent data, with the quantity ϕ ( Q , σ ε ) changing depend-ing on the model. We have the following result, which applies to any globaloptimum b β of the regularized version (2.7) with λ n ≥ ϕ ( Q , σ ε ) q log pn : Theorem 1 (Statistical error).
Suppose the surrogates ( b Γ , b γ ) satisfythe deviation bound (3.1), and the matrix b Γ satisfies the lower-RE condi-tion (2.12) with parameters ( α , τ ) such that √ kτ ( n, p ) ≤ min ( α √ k , ϕ ( Q , σ ε ) b r log pn ) . (3.3) Then for any vector β ∗ with sparsity at most k , there is a universal positiveconstant c such that any global optimum b β of the Lagrangian program (2.7)with any b ≥ k β ∗ k satisfies the bounds k b β − β ∗ k ≤ c √ kα max ( ϕ ( Q , σ ε ) r log pn , λ n ) and (3.4a) k b β − β ∗ k ≤ c kα max ( ϕ ( Q , σ ε ) r log pn , λ n ) . (3.4b) IGH-DIMENSIONAL NOISY LASSO The same bounds (without λ n ) also apply to the constrained program (2.4)with radius choice R = k β ∗ k . Remarks.
To be clear, all the claims of Theorem 1 are deterministic.Probabilistic conditions will enter when we analyze specific statistical modelsand certify that the RE condition (3.3) and deviation conditions are satisfiedby a random pair ( b Γ , b γ ) with high probability. We note that for the standardLasso choice ( b Γ Las , b γ Las ) of this matrix–vector pair, bounds of the form (3.4)for sub-Gaussian noise are well known from past work (e.g., [2, 11, 12, 23]).The novelty of Theorem 1 is in allowing for general pairs of such surrogates,which—as shown by the examples discussed earlier—can lead to nonconvex-ity in the underlying M -estimator. Moreover, some interesting differencesarise due to the term ϕ ( Q , σ ε ), which changes depending on the nature ofthe model (missing, noisy, and/or dependent). As will be clarified in thesequel. Proving that the conditions of Theorem 1 are satisfied with highprobability for noisy/missing data requires some nontrivial analysis involv-ing both concentration inequalities and random matrix theory.Note that in the presence of nonconvexity, it is possible in principle forthe optimization problems (2.4) and (2.7) to have many global optima thatare separated by large distances. Interestingly, Theorem 1 guarantees thatthis unpleasant feature does not arise under the stated conditions: given anytwo global optima b β and e β of the program (2.4), Theorem 1 combined withthe triangle inequality guarantees that k b β − e β k ≤ k b β − β ∗ k + k e β − β ∗ k ≤ c ϕ ( Q , σ ε ) α r k log pn [and similarly for the program (2.7)]. Consequently, under any scaling suchthat k log pn = o (1), the set of all global optima must lie within an ℓ -ballwhose radius shrinks to zero.In addition, it is worth observing that Theorem 1 makes a specific predic-tion for the scaling behavior of the ℓ -error k b β − β ∗ k . In order to study thisscaling prediction, we performed simulations under the additive noise modeldescribed in Example 1, using the parameter setting Σ x = I and Σ w = σ w I with σ w = 0 .
2. Panel (a) of Figure 1 provides plots of the error k b β − β ∗ k versus the sample size n , for problem dimensions p ∈ { , , } . Notethat for all three choices of dimensions, the error decreases to zero as thesample size n increases, showing consistency of the method. The curves also Corollary 1, to be stated shortly, guarantees that the conditions of Theorem 1 aresatisfied with high probability for the additive noise model. In addition, Theorem 2 tofollow provides an efficient method of obtaining an accurate approximation of the globaloptimum. P.-L. LOH AND M. J. WAINWRIGHT (a) (b)
Fig. 1.
Plots of the error k b β − β ∗ k after running projected gradient descent on the non-convex objective, with sparsity k ≈ √ p . Plot (a) is an error plot for i.i.d. data with additivenoise, and plot (b) shows ℓ -error versus the rescaled sample size nk log p . As predicted byTheorem 1, the curves align for different values of p in the rescaled plot. shift to the right as the dimension p increases, reflecting the natural intu-ition that larger problems are harder in a certain sense. Theorem 1 makesa specific prediction about this scaling behavior: in particular, if we plot the ℓ -error versus the rescaled sample size n/ ( k log p ), the curves should roughlyalign for different values of p . Panel (b) shows the same data re-plotted onthese rescaled axes, thus verifying the predicted “stacking behavior.”Finally, as noted by a reviewer, the constraint R = k β ∗ k in the pro-gram (2.4) is rather restrictive, since β ∗ is unknown. Theorem 1 merelyestablishes a heuristic for the scaling expected for this optimal radius. Inthis regard, the Lagrangian estimator (2.7) is more appealing, since it onlyrequires choosing b to be larger than k β ∗ k , and the conditions on theregularizer λ n are the standard ones from past work on the Lasso.3.1.2. Optimization error.
Although Theorem 1 provides guarantees thathold uniformly for any global minimizer, it does not provide guidance on howto approximate such a global minimizer using a polynomial-time algorithm.Indeed, for nonconvex programs in general, gradient-type methods may be-come trapped in local minima, and it is impossible to guarantee that all suchlocal minima are close to a global optimum. Nonetheless, we are able to showthat for the family of programs (2.4), under reasonable conditions on b Γ satis-fied in various settings, simple gradient methods will converge geometricallyfast to a very good approximation of any global optimum. The followingtheorem supposes that we apply the projected gradient updates (2.14) tothe constrained program (2.4), or the composite updates (2.15) to the La-grangian program (2.7), with stepsize η = 2 α . In both cases, we assumethat n % k log p , as is required for statistical consistency in Theorem 1. IGH-DIMENSIONAL NOISY LASSO Theorem 2 (Optimization error).
Under the conditions of Theorem 1: (a)
For any global optimum b β of the constrained program (2.4), there areuniversal positive constants ( c , c ) and a contraction coefficient γ ∈ (0 , ,independent of ( n, p, k ) , such that the gradient descent iterates (2.14) satisfythe bounds k β t − b β k ≤ γ t k β − b β k + c log pn k b β − β ∗ k + c k b β − β ∗ k , (3.5) k β t − b β k ≤ √ k k β t − b β k + 2 √ k k b β − β ∗ k + 2 k b β − β ∗ k (3.6) for all t ≥ . (b) Letting φ denote the objective function of Lagrangian program (2.7)with global optimum b β , and applying composite gradient updates (2.15), thereare universal positive constants ( c , c ) and a contraction coefficient γ ∈ (0 , , independent of ( n, p, k ) , such that k β t − b β k ≤ c k b β − β ∗ k | {z } δ for all iterates t ≥ T , (3.7) where T := c log ( φ ( β ) − φ ( b β )) δ / log(1 /γ ) .Remarks. As with Theorem 1, these claims are deterministic in nature.Probabilistic conditions will enter into the corollaries, which involve prov-ing that the surrogate matrices b Γ used for noisy, missing and/or dependentdata satisfy the lower- and upper-RE conditions with high probability. Theproof of Theorem 2 itself is based on an extension of a result due to Agarwalet al. [1] on the convergence of projected gradient descent and compositegradient descent in high dimensions. Their result, as originally stated, im-posed convexity of the loss function, but the proof can be modified so as toapply to the nonconvex loss functions of interest here. As noted followingTheorem 1, all global minimizers of the nonconvex program (2.4) lie withina small ball. In addition, Theorem 2 guarantees that the local minimizersalso lie within a ball of the same magnitude. Note that in order to showthat Theorem 2 can be applied to the specific statistical models of interestin this paper, a considerable amount of technical analysis remains in orderto establish that its conditions hold with high probability.In order to understand the significance of the bounds (3.5) and (3.7), notethat they provide upper bounds for the ℓ -distance between the iterate β t at time t , which is easily computed in polynomial-time, and any global op-timum b β of the program (2.4) or (2.7), which may be difficult to compute.Focusing on bound (3.5), since γ ∈ (0 , t increases. The remaining terms involve the statistical errors k b β − β ∗ k q ,for q = 1 ,
2, which are controlled in Theorem 1. It can be verified that thetwo terms involving the statistical error on the right-hand side are boundedas O ( k log pn ), so Theorem 2 guarantees that projected gradient descent pro- P.-L. LOH AND M. J. WAINWRIGHT (a) (b)
Fig. 2.
Plots of the optimization error log( k β t − b β k ) and statistical error log( k β t − β ∗ k ) versus iteration number t , generated by running projected gradient descent on the noncon-vex objective. Each plot shows the solution path for the same problem instance, using different starting points. As predicted by Theorem 2, the optimization error decreases ge-ometrically. duce an output that is essentially as good—in terms of statistical error—asany global optimum of the program (2.4). Bound (3.7) provides a similarguarantee for composite gradient descent applied to the Lagrangian version.Experimentally, we have found that the predictions of Theorem 2 areborne out in simulations. Figure 2 shows the results of applying the pro-jected gradient descent method to solve the optimization problem (2.4) inthe case of additive noise [panel (a)], and missing data [panel (b)]. In eachcase, we generated a random problem instance, and then applied the pro-jected gradient descent method to compute an estimate b β . We then reappliedthe projected gradient method to the same problem instance 10 times, eachtime with a random starting point, and measured the error k β t − b β k be-tween the iterates and the first estimate (optimization error), and the error k β t − β ∗ k between the iterates and the truth (statistical error). Within eachpanel, the blue traces show the optimization error over 10 trials, and thered traces show the statistical error. On the logarithmic scale given, a ge-ometric rate of convergence corresponds to a straight line. As predicted byTheorem 2, regardless of the starting point, the iterates { β t } exhibit geo-metric convergence to the same fixed point. The statistical error contractsgeometrically up to a certain point, then flattens out.3.2.
Some consequences.
As discussed previously, both Theorems 1 and 2are deterministic results. Applying them to specific statistical models re-quires some additional work in order to establish that the stated condi- To be precise, Theorem 2 states that the iterates will converge geometrically to a smallneighborhood of all the global optima.IGH-DIMENSIONAL NOISY LASSO tions are met. We now turn to the statements of some consequences ofthese theorems for different cases of noisy, missing and dependent data.In all the corollaries below, the claims hold with probability greater than1 − c exp( − c log p ), where ( c , c ) are universal positive constants, inde-pendent of all other problem parameters. Note that in all corollaries, thetriplet ( n, p, k ) is assumed to satisfy scaling of the form n % k log p , asis necessary for ℓ -consistent estimation of k -sparse vectors in p dimen-sions. Definition 3.
We say that a random matrix X ∈ R n × p is sub-Gaussianwith parameters (Σ , σ ) if:(a) each row x Ti ∈ R p is sampled independently from a zero-mean distri-bution with covariance Σ, and(b) for any unit vector u ∈ R p , the random variable u T x i is sub-Gaussianwith parameter at most σ .For instance, if we form a random matrix by drawing each row indepen-dently from the distribution N (0 , Σ), then the resulting matrix X ∈ R n × p isa sub-Gaussian matrix with parameters (Σ , ||| Σ ||| op ).3.2.1. Bounds for additive noise: i.i.d. case.
We begin with the case ofi.i.d. samples with additive noise, as described in Example 1.
Corollary 1.
Suppose that we observe Z = X + W , where the randommatrices X, W ∈ R n × p are sub-Gaussian with parameters (Σ x , σ x ) , and let ε be an i.i.d. sub-Gaussian vector with parameter σ ε . Let σ z = σ x + σ w . Thenunder the scaling n % max { σ z λ min2 (Σ x ) , } k log p , for the M -estimator based onthe surrogates ( b Γ add , b γ add ) , the results of Theorems 1 and 2 hold with param-eters α = λ min (Σ x ) and ϕ ( Q , σ ε ) = c σ z ( σ w + σ ε ) k β ∗ k , with probability atleast − c exp( − c log p ) .Remarks. (a) Consequently, the ℓ -error of any optimal solution b β sat-isfies the bound k b β − β ∗ k - σ z ( σ w + σ ε ) λ min (Σ x ) k β ∗ k r k log pn with high probability. The prefactor in this bound has a natural interpre-tation as an inverse signal-to-noise ratio; for instance, when X and W arezero-mean Gaussian matrices with row covariances Σ x = σ x I and Σ w = σ w I ,respectively, we have λ min (Σ x ) = σ x , so( σ w + σ ε ) p σ x + σ w λ min (Σ x ) = σ w + σ ε σ x s σ w σ x . P.-L. LOH AND M. J. WAINWRIGHT
This quantity grows with the ratios σ w /σ x and σ ε /σ x , which measure theSNR of the observed covariates and predictors, respectively. Note that when σ w = 0, corresponding to the case of uncorrupted covariates, the bound on ℓ -error agrees with known results. See Section 4 for simulations and furtherdiscussions of the consequences of Corollary 1.(b) We may also compare the results in (a) with bounds from past workon high-dimensional sparse regression with noisy covariates [15]. In thiswork, Rosenbaum and Tsybakov derive similar concentration bounds onsub-Gaussian matrices. The tolerance parameters are all O ( q log pn ), withprefactors depending on the sub-Gaussian parameters of the matrices. Inparticular, in their notation, ν ≍ ( σ x σ w + σ w σ ε + σ w ) r log pn k β ∗ k , leading to the bound (cf. Theorem 2 of Rosenbaum and Tsybakov [15]) k b β − β ∗ k - ν √ kλ min (Σ x ) ≍ σ λ min (Σ x ) r k log pn k β ∗ k . Extensions to unknown noise covariance.
Situations may arise wherethe noise covariance Σ w is unknown, and must be estimated from the data.One simple method is to assume that Σ w is estimated from independent ob-servations of the noise. In this case, suppose we independently observe a ma-trix W ∈ R n × p with n i.i.d. vectors of noise. Then we use b Σ w = n W T W asour estimate of Σ w . A more sophisticated variant of this method (cf. Chap-ter 4 of Carroll et al. [3]) assumes that we observe k i replicate measurements Z i , . . . , Z ik for each x i and form the estimator b Σ w = P ni =1 P k i j =1 ( Z ij − Z i · )( Z ij − Z i · ) T P ni =1 ( k i − . (3.8)Based on the estimator b Σ w , we form the pair ( e Γ , e γ ) such that e γ = n Z T y and e Γ = Z T Zn − b Σ w . In the proofs of Section 5, we will analyze the case where b Σ w = n W T W and show that the result of Corollary 1 still holds when Σ w must be estimated from the data. Note that the estimator in equation (3.8)will also yield the same result, but the analysis is more complicated.3.2.2. Bounds for missing data: i.i.d. case.
Next, we turn to the case ofi.i.d. samples with missing data, as discussed in Example 3. For a missingdata parameter vector ρ , we define ρ max := max j ρ j , and assume ρ max < Corollary 2.
Let X ∈ R n × p be sub-Gaussian with parameters (Σ x , σ x ) ,and Z the missing data matrix with parameter ρ . Let ε be an i.i.d. sub- IGH-DIMENSIONAL NOISY LASSO Gaussian vector with parameter σ ε . If n % max( − ρ max ) σ x λ (Σ x ) , k log p ,then Theorems 1 and 2 hold with probability at least − c exp( − c log p ) for α = λ min (Σ x ) and ϕ ( Q , σ ε ) = c σ x − ρ max ( σ ε + σ x − ρ max ) k β ∗ k .Remarks. Suppose X is a Gaussian random matrix and ρ j = ρ for all j .In this case, the ratio σ x λ min (Σ x ) = λ max (Σ x ) λ min (Σ x ) = κ (Σ x ) is the condition numberof Σ x . Then ϕ ( Q , σ ε ) α ≍ (cid:18) λ min (Σ x ) σ x σ ε − ρ + κ (Σ x )(1 − ρ ) (cid:19) k β ∗ k , a quantity that depends on both the conditioning of Σ x , and the fraction ρ ∈ [0 ,
1) of missing data. We will consider the results of Corollary 2 appliedto this example in the simulations of Section 4.
Extensions to unknown ρ . As in the additive noise case, we may wishto consider the case when the missing data parameters ρ are not observedand must be estimated from the data. For each j = 1 , , . . . , p , we estimate ρ j using b ρ j , the empirical average of the number of observed entries per column.Let b ρ ∈ R p denote the resulting estimator of ρ . Naturally, we use the pairof estimators ( e Γ , e γ ) defined by e Γ = Z T Zn : ⊖ f M and e γ = 1 n Z T y : ⊖ ( − b ρ ) , (3.9)where f M ij = (cid:26) (1 − b ρ i )(1 − b ρ j ) , if i = j ,1 − b ρ i , if i = j .We will show in Section 5 that Corollary 2 holds when ρ is estimated by b ρ .3.2.3. Bounds for dependent data.
Turning to the case of dependentdata, we consider the setting where the rows of X are drawn from a station-ary vector autoregressive (VAR) process according to x i +1 = Ax i + v i for i = 1 , , . . . , n − , (3.10)where v i ∈ R p is a zero-mean noise vector with covariance matrix Σ v , and A ∈ R p × p is a driving matrix with spectral norm ||| A ||| <
1. We assume therows of X are drawn from a Gaussian distribution with covariance Σ x , suchthat Σ x = A Σ x A T + Σ v . Hence, the rows of X are identically distributedbut not independent, with the choice A = 0 giving rise to the i.i.d. scenario.Corollaries 3 and 4 correspond to the case of additive noise and missing datafor a Gaussian VAR process. Corollary 3.
Suppose the rows of X are drawn according to a GaussianVAR process with driving matrix A . Suppose the additive noise matrix W is i.i.d. with Gaussian rows, and let ε be an i.i.d. sub-Gaussian vector with P.-L. LOH AND M. J. WAINWRIGHT parameter σ ε . If n % max( ζ λ (Σ x ) , k log p , with ζ = ||| Σ w ||| op + ||| Σ x ||| op −||| A ||| op ,then Theorems 1 and 2 hold with probability at least − c exp( − c log p ) for α = λ min (Σ x ) and ϕ ( Q , σ ε ) = c ( σ ε ζ + ζ ) k β ∗ k . Corollary 4.
Suppose the rows of X are drawn according to a Gaus-sian VAR process with driving matrix A , and Z is the observed matrix subjectto missing data, with parameter ρ . Let ε be an i.i.d. sub-Gaussian vector withparameter σ ε . If n % max( ζ ′ λ (Σ x ) , k log p , with ζ ′ = − ρ max ) ||| Σ x ||| op −||| A ||| op ,then Theorems 1 and 2 hold with probability at least − c exp( − c log p ) for α = λ min (Σ x ) and ϕ ( Q , σ ε ) = c ( σ ε ζ ′ + ζ ′ ) k β ∗ k . Remarks.
Note that the scaling and the form of ϕ in Corollaries 2–4are very similar, except with different effective variances σ = σ x (1 − ρ max ) , ζ or ζ ′ , depending on the type of corruption in the data. As we will see inSection 5, the proofs involve verifying the deviation conditions (3.2) usingsimilar techniques. On the other hand, the proof of Corollary 1 proceeds viadeviation condition (3.1), which produces a tighter bound.Note that we may extend the cases of dependent data to situations when Σ w and ρ are unknown and must be estimated from the data. The proofs of theseextensions are identical to the i.i.d case, so we will omit them.3.3. Application to graphical model inverse covariance estimation.
Theproblem of inverse covariance estimation for a Gaussian graphical model isalso related to the Lasso. Meinshausen and B¨uhlmann [10] prescribed a wayto recover the support of the precision matrix Θ when each column of Θ is k -sparse, via linear regression and the Lasso. More recently, Yuan [22] proposeda method for estimating Θ using the Dantzig selector, and obtained errorbounds on ||| b Θ − Θ ||| when the columns of Θ are bounded in ℓ . Both ofthese results assume that X is fully-observed and has i.i.d. rows.Suppose we are given a matrix X ∈ R n × p of samples from a multivariateGaussian distribution, where each row is distributed according to N (0 , Σ).We assume the rows of X are either i.i.d. or sampled from a Gaussian VARprocess. Based on the modified Lasso of the previous section, we devisea method to estimate Θ based on a corrupted observation matrix Z , when Θis sparse. Our method bears similarity to the method of Yuan [22], but isvalid in the case of corrupted data, and does not require an ℓ column bound.Let X j denote the j th column of X , and let X − j denote the matrix X with j th column removed. By standard results on Gaussian graphical models,there exists a vector θ j ∈ R p − such that X j = X − j θ j + ε j , (3.11)where ε j is a vector of i.i.d. Gaussians and ε j ⊥⊥ X − j for each j . If we define a j := − (Σ jj − Σ j, − j θ j ) − , we can verify that Θ j, − j = a j θ j . Our algorithm, IGH-DIMENSIONAL NOISY LASSO described below, forms estimates b θ j and b a j for each j , then combines theestimates to obtain an estimate b Θ j, − j = b a j b θ j .In the additive noise case, we observe the matrix Z = X + W . From theequations (3.11), we obtain Z j = X − j θ j + ( ε j + W j ). Note that δ j = ε j + W j is a vector of i.i.d. Gaussians, and since X ⊥⊥ W , we have δ j ⊥⊥ X − j . Hence,our results on covariates with additive noise allow us to recover θ j from Z .We can verify that this reduces to solving the program (2.4) or (2.7) withthe pair ( b Γ ( j ) , b γ ( j ) ) = ( b Σ − j, − j , n Z − jT Z j ), where b Σ = n Z T Z − Σ w .When Z is a missing-data version of X , we similarly estimate the vec-tors θ j via equation (3.11), using our results on the Lasso with missingcovariates. Here, both covariates and responses are subject to missing data,but this makes no difference in our theoretical results. For each j , we usethe pair ( b Γ ( j ) , b γ ( j ) ) = (cid:18)b Σ − j, − j , n Z − jT Z j : ⊖ ( − ρ − j )(1 − ρ j ) (cid:19) , where b Σ = n Z T Z : ⊖ M , and M is defined as in Example 3.To obtain the estimate b Θ, we therefore propose the following procedure,based on the estimators { ( b Γ ( j ) , b γ ( j ) ) } pj =1 and b Σ. Algorithm 3.1. (1) Perform p linear regressions of the variables Z j upon the remaining variables Z − j , using the program (2.4) or (2.7) with theestimators ( b Γ ( j ) , b γ ( j ) ), to obtain estimates b θ j of θ j .(2) Estimate the scalars a j using the quantity b a j := − ( b Σ jj − b Σ j, − j b θ j ) − ,based on the estimator b Σ. Form e Θ with e Θ j, − j = b a j b θ j and e Θ jj = − b a j .(3) Set b Θ = arg min Θ ∈ S p ||| Θ − e Θ ||| , where S p is the set of symmetricmatrices.Note that the minimization in step (3) is a linear program, so is easilysolved with standard methods. We have the following corollary about b Θ: Corollary 5.
Suppose the columns of the matrix Θ are k -sparse, andsuppose the condition number κ (Θ) is nonzero and finite. Suppose we have k b γ ( j ) − b Γ ( j ) θ j k ∞ ≤ ϕ ( Q , σ ε ) r log pn ∀ j, (3.12) and suppose we have the following additional deviation condition on b Σ : k b Σ − Σ k max ≤ cϕ ( Q , σ ε ) r log pn . (3.13) Finally, suppose the lower-RE condition holds uniformly over the matri-ces b Γ ( j ) with the scaling (3.3). Then under the estimation procedure of Al- P.-L. LOH AND M. J. WAINWRIGHT gorithm 3.1, there exists a universal constant c such that ||| b Θ − Θ ||| op ≤ c κ (Σ) λ min (Σ) (cid:18) ϕ ( Q , σ ε ) λ min (Σ) + ϕ ( Q , σ ε ) α (cid:19) k r log pn . Remarks.
Note that Corollary 5 is again a deterministic result, withparallel structure to Theorem 1. Furthermore, the deviation bounds (3.12)and (3.13) hold for all scenarios considered in Section 3.2 above, using Corol-laries 1–4 for the first two inequalities, and a similar bounding technique for k b Σ − Σ k max ; and the lower-RE condition holds over all matrices b Γ ( j ) by thesame technique used to establish the lower-RE condition for b Γ. The unifor-mity of the lower-RE bound over all sub-matrices holds because0 < λ min (Σ) ≤ λ min (Σ − j, − j ) ≤ λ max (Σ − j, − j ) ≤ λ max (Σ) < ∞ . Hence, the error bound in Corollary 5 holds with probability at least 1 − c exp( − c log p ) when n % k log p , for the appropriate values of ϕ and α .
4. Simulations.
In this section, we report some additional simulationresults to confirm that the scalings predicted by our theory are sharp. InFigure 1 following Theorem 1, we showed that the error curves align whenplotted against a suitably rescaled sample size, in the case of additive noiseperturbations. Panel (a) of Figure 3 shows these same types of rescaledcurves for the case of missing data, with sparsity k ≈ √ p , covariate matrixΣ x = I , and missing fraction ρ = 0 .
2, whereas panel (b) shows the rescaled(a) (b)
Fig. 3.
Plots of the error k b β − β ∗ k after running projected gradient descent on the non-convex objective, with sparsity k ≈ √ p . In all cases, we plotted the error versus the rescaledsample size nk log p . As predicted by Theorems 1 and 2, the curves align for different valuesof p when plotted in this rescaled manner. (a) Missing data case with i.i.d. covariates. (b)
Vector autoregressive data with additive noise. Each point represents an average over100 trials.
IGH-DIMENSIONAL NOISY LASSO plots for the vector autoregressive case with additive noise perturbations,using a driving matrix A with ||| A ||| op = 0 .
2. Each point corresponds to anaverage over 100 trials. Once again, we see excellent agreement with thescaling law provided by Theorem 1.We also ran simulations to verify the form of the function ϕ ( Q , σ ε ) ap-pearing in Corollaries 1 and 2. In the additive noise setting for i.i.d. data,we set Σ x = I and ε equal to i.i.d. Gaussian noise with σ ε = 0 .
5. For a fixedvalue of the parameters p = 256 and k ≈ log p , we ran the projected gra-dient descent algorithm for different values of σ w ∈ (0 . , . w = σ w I and n ≈ σ w ) k log p , with k β ∗ k = 1. According to thetheory, ϕ ( Q ,σ ε ) α ≍ ( σ w + 0 . p σ w , so that k b β − β ∗ k - ( σ w + 0 . p σ w s k log p (1 + σ w ) k log p ≍ σ w + 0 . p σ w . In order to verify this theoretical prediction, we plotted σ w versus therescaled error √ σ w σ w +0 . k b β − β ∗ k . As shown by Figure 4(a), the curve isroughly constant, as predicted by the theory.Similarly, in the missing data setting for i.i.d. data, we set Σ x = I and ε equal to i.i.d. Gaussian noise with σ ε = 0 .
5. For a fixed value of the pa-rameters p = 128 and k ≈ log p , we ran simulations for different values of themissing data parameter ρ ∈ (0 , . n ≈ − ρ ) k log p . According tothe theory, ϕ ( Q ,σ ε ) α ≍ σ ε − ρ + − ρ ) . Consequently, with our specified scalings(a) (b) Fig. 4. (a)
Plot of the rescaled ℓ -error √ σ w σ w +0 . k b β − β ∗ k versus the additive noise stan-dard deviation σ w for the i.i.d. model with additive noise. (b) Plot of the rescaled ℓ -error k b β − β ∗ k . − ρ ) versus the missing fraction ρ for the i.i.d. model with missing data. Both curvesare roughly constant, showing that our error bounds on k b β − β ∗ k exhibit the proper scaling.Each point represents an average over 200 trials. P.-L. LOH AND M. J. WAINWRIGHT of ( n, p, k ), we should expect a bound of the form k b β − β ∗ k - ϕ ( Q , σ ε ) α r k log pn ≍ . − ρ ) . The plot of ρ versus the rescaled error k b β − β ∗ k . − ρ ) is shown in Figure 4(b).The curve is again roughly constant, agreeing with theoretical results.Finally, we studied the behavior of the inverse covariance matrix estima-tion algorithm on three types of Gaussian graphical models:(a) Chain-structured graphs.
In this case, all nodes of the graph are ar-ranged in a linear chain. Hence, each node (except the two end nodes) hasdegree k = 2. The diagonal entries of Θ are set equal to 1, and all entriescorresponding to links in the chain are set equal to 0 .
1. Then Θ is rescaledso ||| Θ ||| op = 1.(b) Star-structured graphs.
In this case, all nodes are connected to a cen-tral node, which has degree k ≈ . p . All other nodes have degree 1. Thediagonal entries of Θ are set equal to 1, and all entries corresponding toedges in the graph are set equal to 0 .
1. Then Θ is rescaled so ||| Θ ||| op = 1.(c) Erd˝os–Renyi graphs.
This example comes from Rothman et al. [16].For a sparsity parameter k ≈ log p , we randomly generate the matrix Θ byfirst generating the matrix B such that the diagonal entries are 0, and allother entries are independently equal to 0.5 with probability k/p , and 0otherwise. Then δ is chosen so that Θ = B + δI has condition number p .Finally, Θ is rescaled so ||| Θ ||| op = 1.After generating the matrix X of n i.i.d. samples from the appropriate graph-ical model, with covariance matrix Σ x = Θ − , we generated the corruptedmatrix Z = X + W with Σ w = (0 . I in the additive noise case, or themissing data matrix Z with ρ = 0 . ℓ -error √ k ||| b Θ − Θ ||| op plotted against the sample size n for a chain-structured graph. In panels (b)and (d), we have ℓ -error plotted against the rescaled sample size, n/ ( k log p ).Once again, we see good agreement with the theoretical predictions. We haveobtained qualitatively similar results for the star and Erd˝os–Renyi graphs.
5. Proofs.
In this section, we prove our two main theorems. For the moretechnical proofs of the corollaries, see the supplementary Appendix [9].5.1.
Proof of Theorem 1.
Let L ( β ) = β T b Γ β − h b γ, β i + λ n k β k denotethe loss function to be minimized. This definition captures both the estima-tor (2.4) with λ n = 0 and the estimator (2.7) with the choice of λ n given inthe theorem statement. For either estimator, we are guaranteed that β ∗ isfeasible and b β is optimal for the program, so L ( b β ) ≤ L ( β ∗ ). Indeed, in theregularized case, the k -sparsity of β ∗ implies that k β ∗ k ≤ √ k k β ∗ k ≤ b √ k .Defining the error vector b ν := b β − β ∗ and performing some algebra leads to IGH-DIMENSIONAL NOISY LASSO (a) (b)(c) (d) Fig. 5. (a)
Plots of the error ||| b Θ − Θ ||| op after running projected gradient descent on thenonconvex objective for a chain-structured Gaussian graphical model with additive noise.As predicted by Theorems 1 and 2, all curves align when the error is rescaled by √ k andplotted against the ratio nk log p , as shown in (b) . Plots (c) and (d) show the results ofsimulations on missing data sets. Each point represents the average over 50 trials. the equivalent inequality b ν T b Γ b ν ≤ h b ν, b γ − b Γ β ∗ i + λ n {k β ∗ k − k β ∗ + b ν k } . (5.1)In the remainder of the proof, we first derive an upper bound for the right-hand side of this inequality. We then use this upper bound and the lower-REcondition to show that the error vector b ν must satisfy the inequality k b ν k ≤ √ k k b ν k . (5.2)Finally, we combine inequality (5.2) with the lower-RE condition to derivea lower bound on the left-hand side of the basic inequality (5.1). Combinedwith our earlier upper bound on the right-hand side, some algebra yields theclaim. P.-L. LOH AND M. J. WAINWRIGHT
Upper bound on right-hand side.
We first upper-bound the right-handside of inequality (5.1). H¨older’s inequality gives h b ν, b γ − b Γ β ∗ i ≤ k b ν k k b γ − b Γ β ∗ k ∞ . By the triangle inequality, we have k b γ − b Γ β ∗ k ∞ ≤ k b γ − Σ x β ∗ k ∞ + k (Σ x − b Γ) β ∗ k ∞ (i) ≤ ϕ ( Q , σ ε ) r log pn , where inequality (i) follows from the deviation conditions (3.2). Combiningthe pieces, we conclude that h b ν, b γ − b Γ β ∗ i ≤ k b ν k ϕ ( Q , σ ε ) r log pn (5.3) = ( k b ν S k + k b ν S c k )2 ϕ ( Q , σ ε ) r log pn . On the other hand, we have k β ∗ + b ν k − k β ∗ k ≥ {k β ∗ S k − k b ν S k } + k b ν S c k − k β ∗ k (5.4) = k b ν S c k − k b ν S k , where we have exploited the sparsity of β ∗ and applied the triangle inequal-ity. Combining the pieces, we conclude that the right-hand side of inequal-ity (5.1) is upper-bounded by2 ϕ ( Q , σ ε ) r log pn ( k b ν S k + k b ν S c k ) + λ n {k b ν S k − k b ν S c k } , (5.5)a bound that holds for any nonnegative choice of λ n . Proof of inequality (5.2).
We first consider the constrained program (2.4),with R = k β ∗ k , so k b β k = k β ∗ + b ν k ≤ k β ∗ k . Combined with inequal-ity (5.4), we conclude that k b ν S c k ≤ k b ν S k . Consequently, we have the in-equality k b ν k ≤ k b ν S k ≤ √ k k b ν k , which is a slightly stronger form of thebound (5.2).For the regularized estimator (2.7), we first note that our choice of λ n guarantees that the term (5.5) is at most λ n k b ν S k − λ n k b ν S c k . Returningto the basic inequality, we apply the lower-RE condition to lower-bound theleft-hand side, thereby obtaining the inequality − τ k b ν k ≤
12 ( α k b ν k − τ k b ν k ) ≤ λ n k b ν S k − λ n k b ν S c k . By the triangle inequality, we have k b ν k ≤ k b β k + k β ∗ k ≤ b √ k . Since wehave assumed √ kτ ( n, p ) ≤ ϕ ( Q ,σ ε ) b q log pn , we are guaranteed that τ ( n, p )2 k b ν k ≤ ϕ ( Q , σ ε ) r log pn k b ν k ≤ λ n k b ν k IGH-DIMENSIONAL NOISY LASSO by our choice of λ n . Combining the pieces, we conclude that0 ≤ λ n k b ν S k − λ n k b ν S c k + λ n k b ν S k + k b ν S c k ) = 7 λ n k b ν S k − λ n k b ν S c k and rearranging implies k b ν S c k ≤ k b ν S k , from which we conclude that k b ν k ≤ √ k k b ν k , as claimed. Lower bound on left-hand side.
We now derive a lower bound on theleft-hand side of inequality (5.1). Combining inequality (5.2) with the REcondition (2.12) gives b ν T b Γ b ν ≥ α k b ν k − τ ( n, p ) k b ν k ≥ { α − kτ ( n, p ) }k b ν k ≥ α k b ν k , (5.6)where the final step uses our assumption that kτ ( n, p ) ≤ α .Finally, combining bounds (5.5), (5.2) and (5.6) yields α k b ν k ≤ ( ϕ ( Q , σ ε ) r log pn , λ n ) k b ν k ≤ √ k max ( ϕ ( Q , σ ε ) r log pn , λ n ) k b ν k , giving inequality (3.4a). Using inequality (5.2) again gives inequality (3.4b).5.2. Proof of Theorem 2.
We begin by proving the claims for the con-strained problem, and projected gradient descent. For the ℓ -error bound,we make use of Theorem 1 in the pre-print of Agarwal et al. [1]. Their theory,as originally stated, requires that the loss function be convex, but a carefulexamination of their proof shows that their arguments hinge on restrictedstrong convexity and smoothness assumptions, corresponding to a more gen-eral version of the lower- and upper-RE conditions given here. Apart fromthese conditions, the proof exploits the fact that the sub-problems definingthe gradient updates (2.14) and (2.15) are convex. Since the loss functionitself appears only in a linear term, their theory still applies.In order to apply Theorem 1 in their paper, we first need to compute thetolerance parameter ε defined there; since β ∗ is supported on the set S with | S | = k and the RE conditions hold with τ ≍ log pn , we find that ε ≤ c log pα n ( √ k k b β − β ∗ k + 2 k b β − β ∗ k ) ≤ c ′ k log pα n k b β − β ∗ k + c log pα n k b β − β ∗ k ≤ c k b β − β ∗ k + c log pα n k b β − β ∗ k , P.-L. LOH AND M. J. WAINWRIGHT where the final inequality makes use of the assumption that n % k log p .Similarly, we may compute the contraction coefficient to be γ = (cid:18) − α α + c k log pα n (cid:19)(cid:18) − c k log pα n (cid:19) − , (5.7)so γ ∈ (0 ,
1) for n % k log p .We now establish the ℓ -error bound. First, let ∆ t := β t − β ∗ . Since β t isfeasible and b β is optimal with an active constraint, we have k β t k ≤ k b β k .Applying the triangle inequality gives k b β k ≤ k β ∗ k + k b β − β ∗ k = k β ∗ S k + k b β − β ∗ k , k β t k = k β ∗ + ∆ t k ≥ k β ∗ S + ∆ tS c k − k ∆ tS k = k β ∗ S k + k ∆ tS c k − k ∆ tS k ;combining the bounds yields k ∆ tS c k ≤ k ∆ tS k + k b β − β ∗ k . Then k ∆ t k ≤ k ∆ tS k + k b β − β ∗ k ≤ √ k k ∆ t k + k b β − β ∗ k , so k β t − b β k ≤ k b β − β ∗ k + k ∆ t k ≤ √ k ( k β t − b β k + k b β − β ∗ k ) + 2 k b β − β ∗ k . Turning to the Lagrangian version, we exploit Theorem 2 in Agarwal etal. [1], with M corresponding to the subspace of all vectors with support con-tained within the support set of β ∗ . With this choice, we have ψ ( M ) = √ k ,and the contraction coefficient γ takes the previous form (5.7), so that theassumption n % k log p guarantees that γ ∈ (0 , β ( M ) = O ( log pn ) and ρ = √ k ,and the condition n % k log p implies that ξ ( M ) = O (1). Putting togetherthe pieces, we find that the compound tolerance parameter ε satisfies thebound ε = O ( k log pn k b β − β ∗ k ) = O ( k b β − β ∗ k ), so the claim follows.
6. Discussion.
In this paper, we formulated an ℓ -constrained minimiza-tion problem for sparse linear regression on corrupted data. The source ofcorruption may be additive noise or missing data, and although the resultingobjective is not generally convex, we showed that projected gradient descentis guaranteed to converge to a point within statistical precision of the op-timum. In addition, we established ℓ - and ℓ -error bounds that hold withhigh probability when the data are drawn i.i.d. from a sub-Gaussian distri-bution, or drawn from a Gaussian vector autoregressive process. Finally, weapplied our methods to sparse inverse covariance estimation for a Gaussiangraphical model with corruptions, and obtained spectral norm rates of thesame order as existing rates for uncorrupted, i.i.d. data.Future directions of research include studying more general types of de-pendencies or corruption in the covariates of regression, such as more general IGH-DIMENSIONAL NOISY LASSO types of multiplicative noise, and performing sparse linear regression for cor-rupted data with additive noise when the noise covariance is unknown andreplicates of the data may be unavailable. As pointed out by a reviewer,it would also be interesting to study the performance of our algorithms ondata that are not sub-Gaussian, or even under model mismatch. In addition,one might consider other loss functions, where it is more difficult to correctthe objective for corrupted covariates. Finally, it remains to be seen whetheror not our techniques—used to show that certain nonconvex problems cansolved to statistical precision—can be applied more broadly. Acknowledgments.
The authors thank Alekh Agarwal, Sahand Negah-ban, John Duchi and Alexandre Tsybakov for useful discussions and guid-ance. They are also grateful to the Associate Editor and anonymous refereesfor improvements on the paper.SUPPLEMENTARY MATERIAL
Supplementary material for: High-dimensional regression with noisy andmissing data: Provable guarantees with nonconvexity (DOI: 10.1214/12-AOS1018SUPP; .pdf). Due to space constraints, we haverelegated technical details of the remaining proofs to the supplement [9].REFERENCES [1]
Agarwal, A. , Negahban, S. and
Wainwright, M. J. (2012). Fast global conver-gence of gradient methods for high-dimensional statistical recovery. Available at http://arxiv.org/abs/1104.4824 .[2]
Bickel, P. J. , Ritov, Y. and
Tsybakov, A. B. (2009). Simultaneous analysis ofLasso and Dantzig selector.
Ann. Statist. Carroll, R. J. , Ruppert, D. and
Stefanski, L. A. (1995).
Measurement Error inNonlinear Models . Monographs on Statistics and Applied Probability . Chap-man & Hall, London. MR1630517[4] Chen, S. S. , Donoho, D. L. and
Saunders, M. A. (1998). Atomic decompositionby basis pursuit.
SIAM J. Sci. Comput. Duchi, J. , Shalev-Shwartz, S. , Singer, Y. and
Chandra, T. (2008). Efficientprojections onto the ℓ -ball for learning in high dimensions. In InternationalConference on Machine Learning
Hwang, J. T. (1986). Multiplicative errors-in-variables models with applications torecent data released by the U.S. Department of Energy.
J. Amer. Statist. Assoc. Iturria, S. J. , Carroll, R. J. and
Firth, D. (1999). Polynomial regression andestimating functions in the presence of multiplicative measurement error.
J. R.Stat. Soc. Ser. B Stat. Methodol. Little, R. J. A. and
Rubin, D. B. (1987).
Statistical Analysis with Missing Data .Wiley, New York. MR0890519[9]
Loh, P. and
Wainwright, M. J. (2012). Supplement to “High-dimensional re-gression with noisy and missing data: Provable guarantees with nonconvexity.”DOI:10.1214/12-AOS1018SUPP. P.-L. LOH AND M. J. WAINWRIGHT[10]
Meinshausen, N. and
B¨uhlmann, P. (2006). High-dimensional graphs and variableselection with the Lasso.
Ann. Statist. Meinshausen, N. and
Yu, B. (2009). Lasso-type recovery of sparse representationsfor high-dimensional data.
Ann. Statist. Negahban, S. , Ravikumar, P. , Wainwright, M. J. and
Yu, B. (2009). A unifiedframework for the analysis of regularized M -estimators. In Advances in NeuralInformation Processing Systems . Curran Associates, Red Hook, NY.[13]
Raskutti, G. , Wainwright, M. J. and
Yu, B. (2010). Restricted eigenvalue prop-erties for correlated Gaussian designs.
J. Mach. Learn. Res. Rosenbaum, M. and
Tsybakov, A. B. (2010). Sparse recovery under matrix uncer-tainty.
Ann. Statist. Rosenbaum, M. and
Tsybakov, A. B. (2011). Improved matrix uncertainty selec-tor. Technical report. Available at http://arxiv.org/abs/1112.4413 .[16]
Rothman, A. J. , Bickel, P. J. , Levina, E. and
Zhu, J. (2008). Sparse permutationinvariant covariance estimation.
Electron. J. Stat. Rudelson, M. and
Zhou, S. (2011). Reconstruction from anisotropic random mea-surements. Technical report, Univ. Michigan.[18]
St¨adler, N. and
B¨uhlmann, P. (2012). Missing values: Sparse inverse covarianceestimation and an extension to sparse regression.
Statist. Comput. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso.
J. R. Stat.Soc. Ser. B Stat. Methodol. van de Geer, S. A. and B¨uhlmann, P. (2009). On the conditions used to proveoracle results for the Lasso.
Electron. J. Stat. Xu, Q. and
You, J. (2007). Covariate selection for linear errors-in-variables regres-sion models.
Comm. Statist. Theory Methods Yuan, M. (2010). High dimensional inverse covariance matrix estimation via linearprogramming.
J. Mach. Learn. Res. Zhang, C.-H. and
Huang, J. (2008). The sparsity and bias of the LASSO selectionin high-dimensional linear regression.