Estimation of (near) low-rank matrices with noise and high-dimensional scaling
aa r X i v : . [ m a t h . S T ] D ec Estimation of (near) low-rank matriceswith noise and high-dimensional scaling
Sahand Negahban ⋆ Martin J. Wainwright † ,⋆ Department of Statistics † , andDepartment of Electrical Engineering and Computer Sciences ⋆ UC Berkeley, Berkeley, CA 94720May 28, 2018Technical Report,Department of Statistics, UC Berkeley
Abstract
High-dimensional inference refers to problems of statistical estimation in which theambient dimension of the data may be comparable to or possibly even larger than thesample size. We study an instance of high-dimensional inference in which the goal is toestimate a matrix Θ ∗ ∈ R k × p on the basis of N noisy observations, and the unknownmatrix Θ ∗ is assumed to be either exactly low rank, or “near” low-rank, meaning that itcan be well-approximated by a matrix with low rank. We consider an M -estimator basedon regularization by the trace or nuclear norm over matrices, and analyze its performanceunder high-dimensional scaling. We provide non-asymptotic bounds on the Frobeniusnorm error that hold for a general class of noisy observation models, and then illustratetheir consequences for a number of specific matrix models, including low-rank multivari-ate or multi-task regression, system identification in vector autoregressive processes, andrecovery of low-rank matrices from random projections. Simulation results show excellentagreement with the high-dimensional scaling of the error predicted by our theory. High-dimensional inference refers to instances of statistical estimation in which the ambientdimension of the data is comparable to (or possibly larger than) the sample size. Problemswith a high-dimensional character arise in a variety of applications in science and engineering,including analysis of gene array data, medical imaging, remote sensing, and astronomicaldata analysis. In settings where the number of parameters may be large relative to thesample size, the utility of classical “fixed p ” results is questionable, and accordingly, a line ofon-going statistical research seeks to obtain results that hold under high-dimensional scaling,meaning that both the problem size and sample size (as well as other problem parameters)may tend to infinity simultaneously. It is usually impossible to obtain consistent proceduresin such settings without imposing some sort of additional constraints. Accordingly, there arenow various lines of work on high-dimensional inference based on imposing different types ofstructural constraints. A substantial body of past work has focused on models with sparsityconstraints, including the problem of sparse linear regression [45, 14, 17, 34, 9], banded orsparse covariance matrices [7, 8, 18], sparse inverse covariance matrices [49, 21, 42, 40], sparseeigenstructure [27, 2, 38], and sparse regression matrices [37, 30, 48, 25]. A theme commonto much of this work is the use of ℓ -penalty as a surrogate function to enforce the sparsityconstraint. 1n this paper, we focus on the problem of high-dimensional inference in the setting ofmatrix estimation. As mentioned above, there is already a substantial body of work on theproblem of sparse matrix recovery. In contrast, our interest in this paper is the problem ofestimating a matrix Θ ∗ ∈ R k × p that is either exactly low rank , meaning that it has at most r ≪ min { k, p } non-zero singular values, or more generally is near low-rank , meaning that itcan be well-approximated by a matrix of low rank. As we discuss at more length in the sequel,such exact or approximate low-rank conditions are appropriate for many applications, includ-ing multivariate or multi-task forms of regression, system identification for autoregressiveprocesses, collaborative filtering, and matrix recovery from random projections. Analogousto the use of an ℓ -regularizer for enforcing sparsity, we consider the use of the nuclear norm(also known as the trace norm) for enforcing a rank constraint in the matrix setting. Bydefinition, the nuclear norm is the sum of the singular values of a matrix, and so encouragessparsity in the vector of singular values, or equivalently for the matrix to be low-rank. Theproblem of low-rank matrix approximation and the use of nuclear norm regularization havebeen studied by various researchers. In her Ph.D. thesis, Fazel [19] discusses the use of nuclearnorm as a heuristic for restricting the rank of a matrix, showing that in practice it is oftenable to yield low-rank solutions. Other researchers have provided theoretical guarantees onthe performance of nuclear norm and related methods for low-rank matrix approximation.Srebro et al. [43] proposed nuclear norm regularization for the collaborative filtering problem,and established risk consistency under certain settings. Recht et al. [41] provided sufficientconditions for exact recovery using the nuclear norm heuristic when observing random pro-jections of a low-rank matrix, a set-up analogous to the compressed sensing model in sparselinear regression [17, 12]. Other researchers have studied a version of matrix completion inwhich a subset of entries are revealed, and the goal is to obtain perfect reconstruction eithervia the nuclear norm heuristic [13] or by other SVD-based methods [28]. Finally, Bach [6]has provided results on the consistency of nuclear norm minimization for general observationmodels in noisy settings, but applicable to the classical “fixed p ” setting.The goal of this paper is to analyze the nuclear norm relaxation for a general class of noisyobservation models, and obtain non-asymptotic error bounds on the Frobenius norm that holdunder high-dimensional scaling, and are applicable to both exactly and approximately low-rank matrices. We begin by presenting a generic observation model, and illustrating how it canbe specialized to the several cases of interest, including low-rank multivariate regression, esti-mation of autoregressive processes, and random projection (compressed sensing) observations.In particular, this model is specified in terms of an operator X , which may be deterministicor random depending on the setting, that maps any matrix Θ ∗ ∈ R k × p to a vector of N noisyobservations. We then present a single main theorem (Theorem 1) followed by two corollariesthat cover the cases of exact low-rank constraints (Corollary 1) and near low-rank constraints(Corollary 2) respectively. These results demonstrate that high-dimensional error rates arecontrolled by two key quantities. First, the (random) observation operator X is required tosatisfy a condition known as restricted strong convexity (RSC), which ensures that the lossfunction has sufficient curvature to guarantee consistent recovery of the unknown matrix Θ ∗ .Second, our theory provides insight into the choice of regularization parameter that weightsthe nuclear norm, showing that an appropriate choice is to set it proportional to the spectralnorm of a random matrix defined by the adjoint of observation operator X , and the observationnoise in the problem.This initial set of results, though appealing in terms of their simple statements and general-ity, are somewhat abstractly formulated. Our next contribution is to show that by specializingour main result (Theorem 1) to three classes of models, we can obtain some concrete results2ased on readily interpretable conditions. In particular, Corollary 3 deals with the case oflow-rank multivariate regression, relevant for applications in multitask learning. We showthat the random operator X satisfies the RSC property for a broad class of observation mod-els, and we use random matrix theory to provide an appropriate choice of the regularizationparameter. Our next result, Corollary 4, deals with the case of estimating the matrix ofparameters specifying a vector autoregressive (VAR) process [4, 31]. Here we also establishthat a suitable RSC property holds with high probability for the random operator X , and alsospecify a suitable choice of the regularization parameter. We note that the technical detailshere are considerably more subtle than the case of low-rank multivariate regression, due todependencies introduced by the autoregressive sampling scheme. Accordingly, in additionto terms that involve the size, the matrix dimensions and rank, our bounds also depend onthe mixing rate of the VAR process. Finally, we turn to the compressed sensing observationmodel for low-rank matrix recovery, as introduced by Recht et al. [41]. In this setting, weagain establish that the RSC property holds with high probability, specify a suitable choiceof the regularization parameter, and thereby obtain a Frobenius error bound for noisy obser-vations (Corollary 5). A technical result that we prove en route—namely, Proposition 1—isof possible independent interest, since it provides a bound on the constrained norm of a ran-dom Gaussian operator. In particular, this proposition allows us to obtain a sharp result(Corollary 6) for the problem of recovering a low-rank matrix from perfectly observed randomprojections, one that removes a logarithmic factor from past work [41].The remainder of this paper is organized as follows. Section 2 is devoted to backgroundmaterial, and the set-up of the problem. We present a generic observation model for low-rank matrices, and then illustrate how it captures various cases of interest. We then definethe convex program based on nuclear norm regularization that we analyze in this paper. InSection 3, we state our main theoretical results and discuss their consequences for differentmodel classes. Section 4 is devoted to the proofs of our results; in each case, we break downthe key steps in a series of lemmas, with more technical details deferred to the appendices.In Section 5, we present the results of various simulations that illustrate excellent agreementbetween the theoretical bounds and empirical behavior. Notation:
For the convenience of the reader, we collect standard pieces of notation here.For a pair of matrices Θ and Γ with commensurate dimensions, we let hh Θ , Γ ii = trace(Θ T Γ)denote the trace inner product on matrix space. For a matrix Θ ∈ R k × p , we let m = min { k, p } ,and denote its (ordered) singular values by σ (Θ) ≥ σ (Θ) ≥ . . . ≥ σ m (Θ) ≥
0. We also use thenotation σ max (Θ) = σ (Θ) and σ min (Θ) = σ m (Θ) to refer to the maximal and minimal singularvalues respectively. We use the notation ||| · ||| for various types of matrix norms based on thesesingular values, including the nuclear norm ||| Θ ||| = P mj =1 σ j (Θ), the spectral or operatornorm ||| Θ ||| op = σ (Θ), and the Frobenius norm ||| Θ ||| F = p trace(Θ T Θ) = qP mj =1 σ j (Θ). Werefer the reader to Horn and Johnson [23, 24] for more background on these matrix normsand their properties. We begin with some background on problems and applications in which rank constraints arise,before describing a generic observation model. We then introduce the semidefinite program(SDP) based on nuclear norm regularization that we study in this paper.3 .1 Models with rank constraints
Imposing a rank r constraint on a matrix Θ ∗ ∈ R k × p is equivalent to requiring the rows (orcolumns) of Θ ∗ lie in some r -dimensional subspace of R p (or R k respectively). Such typesof rank constraints (or approximate forms thereof) arise in a variety of applications, as wediscuss here. In some sense, rank constraints are a generalization of sparsity constraints;rather than assuming that the data is sparse in a known basis, a rank constraint implicitlyimposes sparsity but without assuming the basis.We first consider the problem of multivariate regression, also referred to as multi-tasklearning in statistical machine learning. The goal of multivariate regression is to estimate aprediction function that maps covariates Z j ∈ R p to multi-dimensional output vectors Y j ∈ R k .More specifically, let us consider the linear model, specified by a matrix Θ ∗ ∈ R k × p , of theform Y a = Θ ∗ Z a + W a , for a = 1 , . . . , n, (1)where { W a } na =1 is an i.i.d. sequence of k -dimensional zero-mean noise vectors. Given a collec-tion of observations { Z a , Y a } na =1 of covariate-output pairs, our goal is to estimate the unknownmatrix Θ ∗ . This type of model has been used in many applications, including analysis of fMRIimage data [22], analysis of EEG data decoding [3], neural response modeling [11] and analysisof financial data. This model and closely related ones also arise in the problem of collaborativefiltering [44], in which the goal is to predict users’ preferences for items (such as movies ormusic) based on their and other users’ ratings of related items. The papers [1, 5] discuss ad-ditional instances of low-rank decompositions. In all of these settings, the low-rank conditiontranslates into the existence of a smaller set of “features” that are actually controlling theprediction.As a second (not unrelated) example, we now consider the problem of system identificationin vector autoregressive processes (see the book [31] for detailed background). A vectorautoregressive (VAR) process in p -dimensions is a a stochastic process { Z t } ∞ t =1 specified byan initialization Z ∈ R p , followed by the recursion Z t +1 = Θ ∗ Z t + W t , for t = 1 , , , . . .. (2)In this recursion, the sequence { W t } ∞ t =1 consists of i.i.d. samples of innovations noise. Weassume that each vector W t ∈ R p is zero-mean with covariance ν I , so that the process { Z t } ∞ t =1 is zero-mean, and has a covariance matrix Σ given by the solution of the discrete-time Ricattiequation Σ = Θ ∗ Σ(Θ ∗ ) T + ν I. (3)The goal of system identification in a VAR process is to estimate the unknown matrixΘ ∗ ∈ R p × p on the basis of a sequence of samples { Z t } nt =1 . In many application domains,it is natural to expect that the system is controlled primarily by a low-dimensional subsetof variables. For instance, models of financial data might have an ambient dimension p ofthousands (including stocks, bonds, and other financial instruments), but the behavior of themarket might be governed by a much smaller set of macro-variables (combinations of thesefinancial instruments). Similar statements apply to other types of time series data, includingneural data [11, 20], subspace tracking models in signal processing, and motion models modelsin computer vision. 4 third example that we consider in this paper is a compressed sensing observation model,in which one observes random projections of the unknown matrix Θ ∗ . This observation modelhas been studied extensively in the context of estimating sparse vectors [17, 12], and Recht etal. [41] suggested and studied its extension to low-rank matrices. In their set-up, one observestrace inner products of the form hh X i , Θ ∗ ii = trace( X Ti Θ ∗ ), where X i ∈ R k × p is a randommatrix (for instance, filled with standard normal N (0 ,
1) entries). Like compressed sensingfor sparse vectors, applications of this model include computationally efficient updating inlarge databases (where the matrix Θ ∗ measures the difference between the data base at twodifferent time instants), and matrix denoising. We now introduce a generic observation model that will allow us to deal with these differentobservation models in an unified manner. For pairs of matrices
A, B ∈ R k × p , recall the Frobe-nius or trace inner product hh A, B ii := trace( BA T ). We then consider a linear observationmodel of the form y i = hh X i , Θ ∗ ii + ε i , for i = 1 , , . . . , N , (4)which is specified by the sequence of observation matrices { X i } Ni =1 and observation noise { ε i } Ni =1 . This observation model can be written in a more compact manner using operator-theoretic notation. In particular, let us define the observation vector ~y = (cid:2) y . . . y n (cid:3) T ∈ R N , with a similar definition for ~ε ∈ R N in terms of { ε i } Ni =1 . We then use the observation matrices { X i } Ni =1 to define an operator X : R k × p → R N via (cid:2) X (Θ) (cid:3) i = hh X i , Θ ii . With this notation,the observation model (4) can be re-written as ~y = X (Θ ∗ ) + ~ε. (5)Let us illustrate the form of the observation model (5) for some of the applications thatwe considered earlier. Example 1 (Multivariate regression) . Recall the observation model (1) for multivariate re-gression. In this case, we make n observations of vector pairs ( Y a , Z a ) ∈ R k × R p . Accountingfor the k -dimensional nature of the output, after the model is scalarized, we receive a totalof N = kn observations. Let us introduce the quantity b = 1 , . . . , k to index the differentelements of the output, so that we can write Y ab = hh Z a e Tb , Θ ∗ ii + W ab , for b = 1 , , . . . , k . (6)By re-indexing this collection of N = nk observations via the mapping ( a, b ) i = a + ( b − k , we recognize multivariate regression as an instance of the observation model (4) withobservation matrix X i = Z a e Tb and scalar observation y i = V ab . Example 2 (Vector autoregressive processes) . Recall that a vector autoregressive (VAR)process is defined by the recursion (2), and suppose that observe an n -sequence { Z t } nt =1 produced by this recursion. Since each Z t = (cid:2) Z t . . . Z tp (cid:3) T is p -variate, the scalarizedsample size is N = np . Letting b = 1 , , . . . , p index the dimension, we have Z ( t +1) b = hh Z t e Tb , Θ ∗ ii + W tb . (7)5n this case, we re-index the collection of N = np observations via the mapping ( t, b ) i = t + ( b − p . After doing so, we see that the autoregressive problem can be written in theform (4) with y i = Z ( t +1) b and observation matrix X i = Z t e Tb . Example 3 (Compressed sensing) . As mentioned earlier, this is a natural extension of thecompressed sensing observation model for sparse vectors to the case of low-rank matrices [41].In particular, suppose that each observation matrix X i ∈ R k × p has i.i.d. standard normal N (0 ,
1) entries, so that we make observations of the form y i = hh X i , Θ ∗ ii + ε i , for i = 1 , , . . . , n . (8)By construction, these observations are an instance of the model (4). In this case, the morecompact form (5) involves a random Gaussian operator mapping R k × p to R n , and we studysome of its properties in the sequel. We now consider an estimator that is naturally suited to the problems described in theprevious section. Recall that the nuclear or trace norm of a matrix Θ ∈ R k × p is given by ||| Θ ||| = P min { k,p } j =1 σ j (Θ), corresponding to the sum of its singular values. Given a collectionof observations ( y i , X i ) ∈ R × R k × p , for i = 1 , . . . , N from the observation model (4), weconsider estimating the unknown Θ ∗ by solving the following optimization problem b Θ ∈ arg min Θ ∈ R k × p n N k ~y − X (Θ) k + λ N ||| Θ ||| o , (9)where λ N > N and the matrix dimensions k and p . Indeed, theoptimization problem (9) is an instance of a semidefinite program [46], a class of convex opti-mization problems that can be solved efficiently by various polynomial-time algorithms [10].For instance, interior point methods are a classical method for solving semidefinite programs;moreover, as we discuss in Section 5, there are a variety of other methods for solving thesemidefinite program (SDP) defining our M -estimatorLike in any typical M -estimator for statistical inference, the regularization parameter λ N is specified by the statistician. As part of the theoretical results in the next section, we providesuitable choices of this parameter in order for the estimate b Θ to behave well, in the sense ofbeing close in Frobenius norm to the unknown matrix Θ ∗ . In this section, we state our main results and discuss some of their consequences. Section 3.1is devoted to results that apply to generic instances of low-rank problems, whereas Section 3.2is devoted to the consequences of these results for more specific problem classes, includinglow-rank multivariate regression, estimation of vector autoregressive processes, and recoveryof low-rank matrices from random projections.6 .1 Results for general model classes
We begin by introducing the key technical condition that allows us to control the error b Θ − Θ ∗ between an SDP solution b Θ and the unknown matrix Θ ∗ . We refer to it as the restricted strongconvexity condition [35], since it amounts to guaranteeing that the quadratic loss function inthe SDP (9) is strictly convex over a restricted set of directions. Letting C ⊆ R k × p denote therestricted set of directions, we say that the operator X satisfies restricted strong convexity(RSC) over the set C if there exists some κ ( X ) > N k X (∆) k ≥ κ ( X ) ||| ∆ ||| F for all ∆ ∈ C . (10)We note that analogous conditions have been used to establish error bounds in the contextof sparse linear regression [9, 15], in which case the set C corresponded to certain subsets ofsparse vectors.Of course, the definition (10) hinges on the choice of the restricted set C . In order tospecify some appropriate sets for the case of low-rank matrices, we require some additionalnotation. For any matrix Θ ∈ R k × p , we let row(Θ) ⊆ R p and col(Θ) ⊆ R k denote its row spaceand column space, respectively. For a given positive integer r ≤ min { k, p } , any r -dimensionalsubspace of R k can be represented by some orthogonal matrix U ∈ R k × r (i.e., that satisfies U T U = I r × r . In a similar fashion, any r -dimensional subspace of R p can be represented byan orthogonal matrix V ∈ R p × r . For any fixed pair of such matrices ( U, V ), we may definethe following two subspaces of R k × p : M ( U, V ) := (cid:8) Θ ∈ R k × p | row(Θ) = V and col(Θ) = U (cid:9) , and (11a) M ⊥ ( U, V ) := (cid:8) Θ ∈ R k × p | row(Θ) ⊥ V and col(Θ) ⊥ U . (cid:9) . (11b)Finally, we let Π M ( U,V ) and Π M ⊥ ( U,V ) denote the (respective) projection operators onto thesesubspaces. When the subspaces ( U, V ) are clear from context, we use the shorthand notation∆ ′′ = Π M ⊥ ( U,V ) (∆) and ∆ ′ = ∆ − ∆ ′′ . With this notation, we can define the restrictedsets C of interest. Using the singular value decomposition, we can write Θ ∗ = e U D e V T , where e U ∈ R k × k and e V ∈ R p × p are both orthogonal matrices, and D ∈ R k × p contains the singularvalues of Θ ∗ . For any positive integer r ≤ min { k, p } , we let ( U r , V r ) denote the subspace pairdefined by the top r left and right singular vectors of Θ ∗ . For a given integer r and tolerance δ >
0, we then define a subset of matrices as follows: C ( r ; δ ) : = (cid:26) ∆ ∈ R k × p | ||| ∆ ||| F ≥ δ, ||| ∆ ′′ ||| ≤ ||| ∆ ′ ||| + 4 ||| Π M ⊥ ( U r ,V r ) (Θ ∗ ) ||| (cid:27) . (12)The next ingredient is the choice of the regularization parameter λ N used in solving theSDP (9). Our theory specifies a choice for this quantity in terms of the adjoint of the operator X —namely, the operator X ∗ : R N → R k × p defined by X ∗ ( ~ε ) : = N X i =1 ε i X i . (13)With this notation, we come to the first result of our paper. It is a deterministic result, whichspecifies two conditions—namely, an RSC condition and a choice of the regularizer—thatsuffice to guarantee for any solution of the SDP (9) fall within a certain radius.7 heorem 1. Suppose that the operator X satisfies restricted strong convexity with parameter κ ( X ) > over the set C ( r ; δ ) , and that the regularization parameter λ N is chosen such that λ N ≥ ||| X ∗ ( ~ε ) ||| op /N . Then any solution b Θ to the semidefinite program (9) satisfies ||| b Θ − Θ ∗ ||| F ≤ max (cid:26) δ, λ N √ rκ ( X ) , (cid:20) λ N ||| Π M ⊥ ( U r ,V r ) (Θ ∗ ) ||| κ ( X ) (cid:21) / (cid:27) . (14)Apart from the tolerance parameter δ , the two main terms in the bound (14) have a nat-ural interpretation. The first term (involving √ r ) corresponds to estimation error , capturingthe difficulty of estimating a rank r matrix. The second is an approximation error , in whichthe projection onto the set M ⊥ ( U r , V r ) describes the gap between the true matrix Θ ∗ andthe rank r approximation.Let us begin by illustrating the consequences of Theorem 1 when the true matrix Θ ∗ hasexactly rank r , in which case there is a very natural choice of the subspaces represented by U and V . In particular, we form U from the r non-zero left singular vectors of Θ ∗ , and V from its r non-zero right singular vectors. Note that this choice of ( U, V ) ensures thatΠ M ⊥ ( U,V ) (Θ ∗ ) = 0. For technical reasons to be clarified, it suffices to set δ = 0 in the case ofexact rank constraints, and we thus obtain the following result: Corollary 1 (Exact low-rank recovery) . Suppose that Θ ∗ has rank r , and X satisfies RSCwith respect to C ( r ; 0) . Then as long as λ N ≥ ||| X ∗ ( ~ε ) ||| op /N , any optimal solution b Θ to theSDP (9) satisfies the bound ||| b Θ − Θ ∗ ||| F ≤ √ r λ N κ ( X ) . (15)Like Theorem 1, Corollary 1 is a deterministic statement on the SDP error. It takes a muchsimpler form since when Θ ∗ is exactly low rank, then neither tolerance parameter δ nor theapproximation term are required.As a more delicate example, suppose instead that Θ ∗ is nearly low-rank , an assumptionthat we can formalize by requiring that its singular value sequence { σ i (Θ ∗ ) } min { k,p } i =1 decaysquickly enough. In particular, for a parameter q ∈ [0 ,
1] and a positive radius R q , we definethe set B q ( R q ) := (cid:8) Θ ∈ R k × p | min { k,p } X i =1 | σ i (Θ) | q ≤ R q (cid:9) . (16)Note that when q = 0, the set B ( R ) corresponds to the set of matrices with rank at most R . Corollary 2 (Near low-rank recovery) . Suppose that Θ ∗ ∈ B q ( R q ) , the regularization parame-ter is lower bounded as λ N ≥ ||| X ∗ ( ~ε ) ||| op /N , and the operator X satisfies RSC with parameter κ ( X ) ∈ (0 , over the set C ( R q λ − qN ; δ ) . Then any solution b Θ to the SDP (9) satisfies ||| b Θ − Θ ∗ ||| F ≤ max n δ, p R q (cid:18) λ N κ ( X ) (cid:19) − q/ o . (17)8ote that the error bound (17) reduces to the exact low rank case (15) when q = 0, and δ = 0.The quantity λ − qN R q acts as the “effective rank” in this setting; as clarified by our proof inSection 4.2. This particular choice is designed to provide an optimal trade-off between theapproximation and estimation error terms in Theorem 1. Since λ N is chosen to decay to zeroas the sample size N increases, this effective rank will increase, reflecting the fact that as weobtain more samples, we can afford to estimate more of the smaller singular values of thematrix Θ ∗ . As stated, Corollaries 1 and 2 are fairly abstract in nature. More importantly, it is notimmediately clear how the key underlying assumption—namely, the RSC condition—can beverified, since it is specified via subspaces that depend on Θ ∗ , which is itself the unknownquantity that we are trying to estimate. Nonetheless, we now show how, when specialized tomore concrete models, these results yield concrete and readily interpretable results. As willbe clear in the proofs of these results, each corollary requires overcoming two main technicalobstacles: establishing that the appropriate form of the RSC property holds in a uniformsense (so that a priori knowledge of Θ ∗ is not required), and specifying an appropriate choiceof the regularization parameter λ N . Each of these two steps is non-trivial, requiring somerandom matrix theory, but the end results are simply stated upper bounds that hold withhigh probability.We begin with the case of rank-constrained multivariate regression. As discussed earlierin Example 1, recall that we observe pairs ( Y i , Z i ) ∈ R k × R p linked by the linear model Y i = Θ ∗ Z i + W i , where W i ∼ N (0 , ν I k × k ) is observation noise. Here we treat the caseof random design regression , meaning that the covariates Z i are modeled as random. Inparticular, in the following result, we assume that Z i ∼ N (0 , Σ), i.i.d. for some p -dimensionalcovariance matrix Σ ≻
0. Recalling that σ max (Σ) and σ min (Σ) denote the maximum andminimum eigenvalues respectively, we have: Corollary 3 (Low-rank multivariate regression) . Consider the random design multivariateregression model where Θ ∗ ∈ B q ( R q ) . There are universal constants { c i , i = 1 , , } such thatif we solve the SDP (9) with regularization parameter λ N = 10 ν p σ max (Σ) q ( k + p ) n , we have ||| b Θ − Θ ∗ ||| F ≤ c (cid:18) ν σ max (Σ) σ (Σ) (cid:19) − q/ R q (cid:18) k + pn (cid:19) − q/ (18) with probability greater than − c exp( − c ( k + p )) . Remarks:
Corollary 3 takes a particularly simple form when Σ = I p × p : then there exists aconstant c ′ such that ||| b Θ − Θ ∗ ||| F ≤ c ′ ν R q (cid:0) k + pn (cid:1) − q/ . When Θ ∗ is exactly low rank—thatis, q = 0, and Θ ∗ has rank r = R —this simplifies even further to ||| b Θ − Θ ∗ ||| F ≤ c ′ ν r ( k + p ) n . The scaling in this error bound is easily interpretable: naturally, the squared error is propor-tional to the noise variance ν , and the quantity r ( k + p ) counts the number of degrees offreedom of a k × p matrix with rank r . Note that if we did not impose any constraints onΘ ∗ , then since a k × p matrix has a total of kp free parameters, we would expect at best to9btain rates of the order ||| b Θ − Θ ∗ ||| F = Ω( ν k pn ). Note that when Θ ∗ is low rank—in par-ticular, when r ≪ min { k, p } —then the nuclear norm estimator achieves substantially fasterrates. Finally, we note that as stated, the result requires that min { k, p } tend to infinityin order for the claim to hold with high probability. Although such high-dimensional scal-ing is the primary focus of this paper, we note that for application to the classical settingof fixed ( k, p ), the same statement (with different constants) holds with k + p replaced by log n .Next we turn to the case of estimating the system matrix Θ ∗ of an autoregressive (AR)model, as discussed in Example 2. Corollary 4 (Autoregressive models) . Suppose that we are given n samples { Z t } nt =1 froma p -dimensional autoregressive process (2) that is stationary, based on a system matrix thatis stable ( ||| Θ ∗ ||| op ≤ γ < ), and approximately low-rank ( Θ ∗ ∈ B q ( R q ) ). Then there areuniversal constants { c i , i = 1 , , } such that if we solve the SDP (9) with regularizationparameter λ N = ||| Σ ||| op − γ q pn , then any solution b Θ satisfies ||| b Θ − Θ ∗ ||| F ≤ c (cid:20) σ max (Σ) σ min (Σ) (1 − γ ) (cid:21) − q R q (cid:16) pn (cid:17) − q/ (19) with probability greater than − c exp( − c p ) . Remarks:
Like Corollary 3, the result as stated requires that p tend to infinity, but the samebounds hold with p replaced by log n , yielding results suitable for classical (fixed dimension)scaling. Second, the factor ( p/n ) − q/ , like the analogous term in Corollary 3, shows thatfaster rates are obtained if Θ ∗ can be well-approximated by a low rank matrix, namely forchoices of the parameter q ∈ [0 ,
1] that are closer to zero. Indeed, in the limit q = 0, we againreduce to the case of an exact rank constraint r = R , and the corresponding squared errorscales as rp/n . In contrast to the case of multivariate regression, the error bound (19) alsodepends on the upper bound ||| Θ ∗ ||| op = γ < ∗ . Such dependence is to be expected since the quantity γ controls the (in)stability andmixing rate of the autoregressive process. As clarified in the proof, the dependence of thesampling in the AR model also presents some technical challenges not present in the settingof multivariate regression.Finally, we turn to analysis of the compressed sensing model for matrix recovery, which wasdiscussed in Example 3. The following result applies to the setting in which the observationmatrices { X i } Ni =1 are drawn i.i.d., with standard N (0 ,
1) elements. We assume that theobservation noise vector ~ε ∈ R N satisfies the bound k ~ε k ≤ ν √ N for some constant ν , anassumption that holds for any bounded noise, and also holds with high probability for anyrandom noise vector with sub-Gaussian entries with parameter ν (one example being Gaussiannoise N (0 , ν )). Corollary 5 (Compressed sensing recovery) . Suppose that Θ ∗ ∈ B q ( R q ) , and that the samplesize is lower bounded as N > R − q/ q k p . Then any solution b Θ to the SDP (9) satisfiesthe bound ||| b Θ − Θ ∗ ||| F ≤ ν − q R q hr kN + r pN i − q (20) with probability greater than − c exp( − c ( k + p )) . The term in Corollary 3 has a factor k + p , since the matrix in that case could be non-square in general. X may be of independent interest here: Proposition 1.
Under the stated conditions, the random operator X satisfies k X (Θ) k √ N ≥ ||| Θ ||| F − r kN + r pN ! ||| Θ ||| for all Θ ∈ R k × p (21) with probability at least − − N/ . The proof of this result, provided in Appendix D, makes use of the Gordon-Slepian inequalitiesfor Gaussian processes, and concentration of measure. As we show in Section 4.5, it impliesthe form of the RSC property needed to establish Corollary 5.Proposition 1 also implies an interesting property of the null space of the operator X ;one that can be used to establish a corollary about recovery of low-rank matrices when theobservations are noiseless. In particular, suppose that we are given the noiseless observations y i = hh X i , Θ ∗ ii for i = 1 , . . . , N , and that we try to recover the unknown matrix Θ ∗ by solvingthe SDP min Θ ∈ R k ⋉ p ||| Θ ||| such that hh X i , Θ ii = y i for all i = 1 , . . . , N , (22)a recovery procedure that was studied by Recht et al. [41]. Proposition 1 allows us to obtaina sharp result on recovery using this method: Corollary 6.
Suppose that Θ ∗ has rank r , and that we are given N > r ( k + p ) noiselesssamples. Then with probability at least − − N/ , the SDP (22) recovers the matrix Θ ∗ exactly. This result removes some extra logarithmic factors that were included in the earlier work [41],and provides the appropriate analog to compressed sensing results for sparse vectors [17, 12].Note that (like in most of our results) we have made little effort to obtain good constants inthis result: the important property is that the sample size N scales linearly in both r and k + p . We now turn to the proofs of Theorem 1, and Corollaries 1 through 6. In each case, weprovide the primary steps in the main text, with more technical details stated as lemmas andproved in the Appendix.
By the optimality of b Θ for the SDP (9), we have12 N k ~y − X ( b Θ) k + λ N ||| b Θ ||| ≤ N k ~y − X (Θ ∗ ) k + λ N ||| Θ ∗ ||| . Defining the error matrix ∆ = Θ ∗ − b Θ and performing some algebra yields the inequality12 N k X (∆) k ≤ N h ~ε, X (∆) i + λ N (cid:8) ||| b Θ + ∆ ||| − ||| b Θ ||| (cid:9) . (23)11y definition of the adjoint and H¨older’s inequality, we have1 N (cid:12)(cid:12) h ~ε, X (∆) i (cid:12)(cid:12) = 1 N (cid:12)(cid:12) h X ∗ ( ~ε ) , ∆ i (cid:12)(cid:12) ≤ N ||| X ∗ ( ~ε ) ||| op ||| ∆ ||| . (24)By the triangle inequality, we have ||| b Θ + ∆ ||| − ||| b Θ ||| ≤ ||| ∆ ||| . Substituting this inequalityand the bound (24) into the inequality (23) yields12 N k X (∆) k ≤ N ||| X ∗ ( ~ε ) ||| op ||| ∆ ||| + λ N ||| ∆ ||| ≤ λ N ||| ∆ ||| , where the second inequality makes use of our choice λ N ≥ N ||| X ∗ ( ~ε ) ||| op .It remains to lower bound the term on the left-hand side, while upper bounding thequantity ||| ∆ ||| on the right-hand side. The following technical result allows us to do so.Recall our earlier definition (11) of the sets M and M ⊥ associated with a given subspacepair. Lemma 1.
Let ( e U , e V ) represent a pair of r -dimensional subspaces of left and right singularvectors of Θ ∗ . Then there exists a matrix decomposition ∆ = ∆ ′ + ∆ ′′ of the error ∆ suchthat(a) The matrix ∆ ′ satisfies the constraint rank(∆ ′ ) ≤ r , and(b) If λ N ≥ ||| X ∗ ( ~ε ) ||| /N , then the nuclear norm of ∆ ′′ is bounded as ||| ∆ ′′ ||| ≤ ||| ∆ ′ ||| + 4 ||| Π M ⊥ ( e U, e V ) (Θ ∗ ) ||| (25)See Appendix A for the proof of this claim. Using Lemma 1, we can complete the proofof the theorem. In particular, from the bound (25) and the RSC assumption, we find that12 N k X (∆) k ≥ κ ( X ) ||| ∆ ||| F . Using the triangle inequality together with inequality (25), we obtain ||| ∆ ||| ≤ ||| ∆ ′ ||| + ||| ∆ ′′ ||| ≤ ||| ∆ ′ ||| + 4 ||| Π M ⊥ (Θ ∗ ) ||| . From the rank constraint in Lemma 1(a), we have ||| ∆ ′ ||| ≤ √ r ||| ∆ ′ ||| F . Putting together thepieces, we find κ ( X ) ||| ∆ ||| F ≤ max (cid:8) λ N √ r ||| ∆ ||| F , λ N ||| Π M ⊥ (Θ ∗ ) ||| (cid:9) , which implies that ||| ∆ ||| F ≤ max (cid:26) λ N √ rκ ( X ) , (cid:16) λ N ||| Π M ⊥ (Θ ∗ ) ||| κ ( X ) (cid:17) / (cid:27) , as claimed. 12 .2 Proof of Corollary 2 Let m = min { k, p } . In this case, we consider the singular value decomposition Θ ∗ = U D T V ,where U ∈ R k × m and V ∈ R p × m are orthogonal, and we assume that D is diagonal with thesingular values in non-increasing order σ (Θ ∗ ) ≥ σ (Θ ∗ ) ≥ . . . σ m (Θ ∗ ) ≥
0. For a parameter τ > K = { i | σ i (Θ ∗ ) > τ } , and let U K (respectively V K ) denote the k × | K | (respectively the p × | K | ) orthogonal matrix consisting of the first | K | columns of U (respectively V ). With this choice, the matrix Θ ∗ K c : = Π M ⊥ ( U K ,V K ) (Θ ∗ ) has rank at most m − | K | , with singular values { σ i (Θ ∗ ) , i ∈ K c } . Moreover, since σ i (Θ ∗ ) ≤ τ for all i ∈ K c , wehave ||| Θ ∗ K c ||| = τ m X i = | K | +1 σ i (Θ ∗ ) τ ≤ τ m X i = | K | +1 (cid:16) σ i (Θ ∗ ) τ (cid:17) q ≤ τ − q R q . On the other hand, we also have R q ≥ P mi =1 | σ i (Θ ∗ ) | q ≥ | K | τ q , which implies that | K | ≤ τ − q R q . From the general error bound with r = | K | , we obtain ||| b Θ − Θ ∗ ||| F ≤ max (cid:26) λ N p R q τ − q/ κ ( X ) , (cid:20) λ N τ − q R q κ ( X ) (cid:21) / (cid:27) , Setting τ = λ N /κ yields that ||| b Θ − Θ ∗ ||| F ≤ max (cid:26) λ − q/ N p R q κ − q/ , (cid:20) λ − qN R q κ − q (cid:21) / (cid:27) = 32 p R q (cid:18) λ N κ ( X ) (cid:19) − q/ , as claimed. For the proof of this corollary, we adopt the following notation. We first define the threematrices X = Z T Z T · · · Z Tn ∈ R n × p , Y = Y T Y T · · · Y Tn ∈ R n × k , and W = W T W T · · · W Tn ∈ R n × k . (26)With this notation and using the relation N = nk , the SDP objective function (9) can bewritten as k (cid:8) n ||| Y − X Θ ||| F + λ n ||| Θ ||| (cid:9) , where we have defined λ n = λ N k .In order to establish the RSC property for this model, some algebra shows that we needto establish a lower bound on the quantity12 n ||| X ∆ ||| F = 12 n p X j =1 k ( X ∆) j k ≥ σ min ( X T X )2 n ||| ∆ ||| F , where σ min denotes the minimum eigenvalue. The following lemma follows by adapting knownconcentration results for random matrices (see the paper [47] for details):13 emma 2. Let X ∈ R n × p be a random matrix with i.i.d. rows sampled from a p -variate N (0 , Σ) distribution. Then for n ≥ p , we have P (cid:20) σ min (cid:16) n X T X (cid:17) ≥ σ min (Σ)9 , σ max (cid:16) n X T X (cid:17) ≤ σ max (Σ) (cid:21) ≥ − − n/ . As a consequence, we have σ min ( X T X )2 n ≥ σ min (Σ)18 with probability at least 1 − − n ) for all n ≥ p , which establishes that the RSC property holds with κ ( X ) = σ min (Σ).Next we need upper bound the quantity ||| X ∗ ( ~ε ) ||| for this model, so as to verify that thestated choice for λ N is valid. Following some algebra, we find that1 n ||| X ∗ ( ~ε ) ||| op = 1 n ||| X T W ||| op . The following lemma is proved in Appendix B:
Lemma 3.
There are constants c i > such that P (cid:20)(cid:12)(cid:12) n ||| X T W ||| op (cid:12)(cid:12) ≥ ν q ||| Σ ||| op r k + pn (cid:21) ≤ c exp( − c ( k + p )) . (27)Using these two lemmas, we can complete the proof of Corollary 3. First, recalling thescaling N = kn , we see that Lemma 3 implies that the choice λ N = 10 ν p ||| Σ ||| op q k + pn satisfiesthe conditions of Corollary 2 with high probability. Lemma 2 shows that the RSC propertyholds with κ ( X ) = σ min (Σ) /
20, again with high probability. Consequently, Corollary 2 impliesthat ||| b Θ − Θ ∗ ||| F ≤ R q (cid:18) ν q ||| Σ ||| op r k + pn σ min (Σ) (cid:19) − q = c (cid:18) ν ||| Σ ||| op σ (Σ) (cid:19) − q/ R q (cid:18) k + pn (cid:19) − q/ with probability greater than 1 − c exp( − c ( k + p )), as claimed. For the proof of this corollary, we adopt the notation X = Z T Z T · · · Z Tn ∈ R n × p , and Y = Z T Z T · · · Z Tn +1 ∈ R n × p . Finally, we let W ∈ R n × p be a matrix with i.i.d. N (0 , ν ) elements, corresponding to theinnovations noise driving the AR process. With this notation and using the relation N = np ,the SDP objective function (9) can be written as p (cid:8) n ||| Y − X Θ ||| F + λ n ||| Θ ||| (cid:9) , where we havedefined λ n = λ N p . At a high level, the proof of this corollary is similar to that of Corollary 3,in that we use random matrix theory to establish the required RSC property, and to justifythe choice of λ n , or equivalently λ N . However, it is considerably more challenging, due to thedependence in the rows of the random matrices, and the cross-dependence between the twomatrices X and W (which were independent in setting of multivariate regression).The following lemma provides the lower bound needed to establish RSC for the autore-gressive model: 14 emma 4. The eigenspectrum of the matrix X T X/n is well-controlled in terms of the sta-tionary covariance matrix: in particular, as long as n > c p , we have σ max (cid:16)(cid:16) n X T X (cid:17)(cid:17) ( a ) ≤ σ max (Σ)1 − γ , and σ min (cid:16)(cid:16) n X T X (cid:17)(cid:17) ( b ) ≥ σ min (Σ)4 , (28) both with probability greater than − c exp( − c p ) . Thus, from the bound (28)(b), we see with the high probability, the RSC property holds with κ ( X ) = σ min (Σ) / n > c p .As before, in order to verify the choice of λ N , we need to control the quantity n ||| X T W ||| op .The following inequality, proved in Appendix C.2, yields a suitable upper bound: Lemma 5.
There exist constants c i > , independent of n, p, Σ etc. such that P (cid:2) n ||| X T W ||| op ≥ ||| Σ ||| op − γ r pn (cid:3) ≤ c exp( − c p ) . (29)From Lemma 5, we see that it suffices to choose λ N = ||| Σ ||| op − γ q pn . With this choice, Corol-lary 2 of Theorem 1 yields that ||| Θ − Θ ∗ ||| F ≤ c R q (cid:20) σ max (Σ) σ min (Σ) (1 − γ ) (cid:21) − q (cid:16) pn (cid:17) − q/ with probability greater than 1 − c exp( − c p ), as claimed. Recall that for this model, the observations are of the form y i = hh X i , Θ ∗ ii + ε i , whereΘ ∗ ∈ R k × p is the unknown matrix, and { ε i } Ni =1 is an associated noise sequence.Let us now show how Proposition 1 implies the RSC property with an appropriate toleranceparameter. In particular, let us define δ := R q (cid:2)q kN + q pN (cid:3) − q , so that if we have theinequality ||| ∆ ||| F ≤ δ , the result of Corollary 5 follows immediately. Therefore, we may take ||| ∆ ||| F ≥ δ . Now recall from Lemma 1 that the error ∆ satisfies the bound (25). Combiningthese facts, we are guaranteed that ∆ ∈ C ( r ; δ ), where the set C was previously defined (12),and it is sufficient to establish the RSC property over this set.Observe that the bound (21) implies that for any ∆ ∈ C , k X (∆) k √ N ≥ ||| ∆ ||| F − (cid:18)r kN + r pN (cid:19) ||| ∆ ||| . (30)Following the arguments used in the proofs of Theorem 1 and Corollary 2, we find that ||| ∆ ||| ≤ ||| ∆ ′ ||| + 4 ||| Π M ⊥ (Θ ∗ ) ||| ≤ q R q τ − q ||| ∆ ′ ||| F + 4 R q τ − q , (31)where τ > τ = (cid:0) √ k + √ p (cid:1) / √ N , and substitutethe resulting bound (31) into equation (30), thereby obtaining k X (∆) k √ N ≥ ||| ∆ ||| F − p R q τ − q/ ||| ∆ ||| F − R q τ − q ≥ ||| ∆ ||| F − √ δ ||| ∆ ||| F − δ ||| ∆ ||| F .
15f we choose
N > R (1 − q/ q k p , then we are guaranteed that − (4 + √ δ ≥ , whichshows that the RSC property holds with κ ( X ) = 1 / k X ∗ ( ~ε ) k /N , required for specifying a suitablechoice of λ N . Lemma 6. If k ~ε k ≤ ν √ N , then P h k X ∗ ( ~ε ) k N ≥ ν (cid:18)r kN + r pN (cid:19)i ≤ c exp( − c ( k + p )) . (32) Proof.
By definition of the adjoint operator, we have N X ∗ ( ~ε ) = N P Ni =1 ε i X i . Since theobservation matrices { X i } Ni =1 are i.i.d. Gaussian, if the sequence { ε i } Ni =1 is viewed as fixed(by conditioning as needed), then the random matrix Z : = N P Ni =1 ε i X i has zero-meani.i.d. Gaussian entries with variance k ~ε k N . Since Z ∈ R k × p , known results in random matrixtheory [16] imply that P (cid:20) ||| Z ||| op ≥ k ~ε k √ N (cid:16)r kN + r pN (cid:17)(cid:21) ≤ − c ( k + p )) , as claimed. This corollary follows from a combination of Proposition 1 and Lemma 1. Let b Θ be an optimalsolution to the SDP (22), and let ∆ = b Θ − Θ ∗ be the error. Since b Θ is optimal and Θ ∗ is feasiblefor the SDP, we have ||| b Θ ||| = ||| Θ ∗ +∆ ||| ≤ ||| Θ ∗ ||| . Using the decomposition ∆ = ∆ ′ +∆ ′′ fromLemma 1 and applying triangle inequality, we have ||| Θ ∗ + ∆ ′ + ∆ ′′ ||| ≥ ||| Θ ∗ + ∆ ′′ ||| − ||| ∆ ′ ||| .From the properties of the decomposition in Lemma 1 (see Appendix A), we find that ||| b Θ ||| = ||| Θ ∗ + ∆ ′ + ∆ ′′ ||| ≥ ||| Θ ∗ ||| + ||| ∆ ′′ ||| − ||| ∆ ′ ||| . Combining the pieces yields that ||| ∆ ′′ ||| ≤ ||| ∆ ′ ||| , and hence ||| ∆ ||| ≤ ||| ∆ ′ ||| . By Lemma 1(a),the rank of ∆ ′ is at most 2 r , so that we obtain ||| ∆ ||| ≤ √ r ||| ∆ ||| F ≤ r ||| ∆ ||| F .Note that X (∆) = 0, since both b Θ and Θ ∗ agree with the observations. Consequently,from Proposition 1, we have that0 = k X (∆) k √ N ≥ ||| ∆ ||| F − r kN + r pN ! ||| ∆ ||| ≥ ||| ∆ ||| F − r rkN + 4 r rpN ! ≥ ||| ∆ ||| F / , where the final inequality follows from the assumption that N > r ( k + p ). We have thusshown that ∆ = 0, which implies that b Θ = Θ ∗ as claimed.16 Experimental results
In this section, we report the results of various simulations that demonstrate the close agree-ment between the scaling predicted by our theory, and the actual behavior of the SDP-based M -estimator (9) in practice. In all cases, we solved the convex program (9) by using ourown implementation in MATLAB of an accelerated gradient descent method which adaptsa non-smooth convex optimization procedure [36] to the nuclear-norm [26]. We chose theregularization parameter λ N in the manner suggested by our theoretical results; in doing so,we assumed knowledge of quantities such as the noise variance ν . (In practice, one wouldhave to estimate such quantities from the data using standard methods.)We report simulation results for three of the running examples discussed in this paper:low-rank multivariate regression, estimation in vector autoregressive processes, and matrixrecovery from random projections (compressed sensing). In each case, we solved instances ofthe SDP for a square matrix Θ ∗ ∈ R p × p , where p ∈ { , , } for the first two examples,and p ∈ { , , } for the compressed sensing example. In all cases, we considered the caseof exact low rank constraints, with rank(Θ ∗ ) = r = 10, and we generated Θ ∗ by choosingthe subspaces of its left and right singular vectors uniformly at random from the Grassmanmanifold. The observation or innovations noise had variance ν = 1 in each case. The VARprocess was generated by first solving for the covariance matrix Σ using the MATLAB functiondylap and then generating a sample path. For each setting of ( r, p ), we solved the SDP for arange of sample sizes N . F r oben i u s N o r m E rr o r Error versus sample size p = 40p = 80p = 160 F r oben i u s N o r m E rr o r Error versus rescaled sample size p = 40p = 80p = 160 (a) (b)
Figure 1.
Results of applying the SDP (9) with nuclear norm regularization to the problem oflow-rank multivariate regression. (a) Plots of the Frobenius error ||| b Θ − Θ ∗ ||| F on a logarithmicscale versus the sample size N for three different matrix sizes p ∈ { , , } , all with rank r = 10. (b) Plots of the same Frobenius error versus the rescaled sample size N/ ( rp ). Consistentwith theory, all three plots are now extremely well-aligned. Figure 1 shows results for a multivariate regression model with the covariates chosenrandomly from a N (0 , I ) distribution. Panel (a) plots the Frobenius error ||| b Θ − Θ ∗ ||| F on alogarithmic scale versus the sample size N for three different matrix sizes, p ∈ { , , } .Naturally, in each case, the error decays to zero as N increases, but larger matrices requirelarger sample sizes, as reflected by the rightward shift of the curves as p is increased. Panel(b) of Figure 1 shows the exact same set of simulation results, but now with the Frobenius17rror plotted versus the rescaled sample size e N : = N/ ( rp ). As predicted by Corollary 3, theerror plots now are all aligned with one another; the degree of alignment in this particularcase is so close that the three plots are now indistinguishable. (The blue curve is the only onevisible since it was plotted last by our routine.) Consequently, Figure 1 shows that N/ ( rp )acts as the effective sample size in this high-dimensional setting.Figure 2 shows similar results for the autoregressive model discussed in Example 2. Asshown in panel (a), the Frobenius error again decays as the sample size is increased, althoughproblems involving larger matrices are shifted to the right. Panel (b) shows the same Frobe-nius error plotted versus the rescaled sample size N/ ( rp ); as predicted by Corollary 4, theerrors for different matrix sizes p are again quite well-aligned. In this case, we find (both inour theoretical analysis and experimental results) that the dependence in the autoregressiveprocess slows down the rate at which the concentration occurs, so that the results are not ascrisp as the low-rank multivariate setting in Figure 1. F r oben i u s N o r m E rr o r Error versus sample size p = 40p = 80p = 160 F r oben i u s N o r m E rr o r Error versus rescaled sample size p = 40p = 80p = 160 (a) (b)
Figure 2.
Results of applying the SDP (9) with nuclear norm regularization to estimating thesystem matrix of a vector autoregressive process. (a) Plots of the Frobenius error ||| b Θ − Θ ∗ ||| F ona logarithmic scale versus the sample size N for three different matrix sizes p ∈ { , , } ,all with rank r = 10. (b) Plots of the same Frobenius error versus the rescaled sample size N/ ( rp ). Consistent with theory, all three plots are now reasonably well-aligned. Finally, Figure 3 presents the same set of results for the compressed sensing observationmodel discussed in Example 3. Even though the observation matrices X i here are qualitativelydifferent (in comparison to the multivariate regression and autoregressive examples), we againsee the “stacking” phenomenon of the curves when plotted versus the rescaled sample size N/rp , as predicted by Corollary 5.
In this paper, we have analyzed the nuclear norm relaxation for a general class of noisyobservation models, and obtained non-asymptotic error bounds on the Frobenius norm thathold under high-dimensional scaling. In contrast to most past work, our results are applicableto both exactly and approximately low-rank matrices. We stated a main theorem that provideshigh-dimensional rates in a fairly general setting, and then showed how by specializing thisresult to some specific model classes—namely, low-rank multivariate regression, estimation18 F r oben i u s N o r m E rr o r Error versus sample size p = 20p = 40p = 80 F r oben i u s N o r m E rr o r Error versus rescaled sample size p = 20p = 40p = 80 (a) (b)
Figure 3.
Results of applying the SDP (9) with nuclear norm regularization to recoveringa low-rank matrix on the basis of random projections (compressed sensing model) (a) Plotsof the Frobenius error ||| b Θ − Θ ∗ ||| F on a logarithmic scale versus the sample size N for threedifferent matrix sizes p ∈ { , , } , all with rank r = 10. (b) Plots of the same Frobeniuserror versus the rescaled sample size N/ ( rp ). Consistent with theory, all three plots are nowreasonably well-aligned. of autoregressive processes, and matrix recovery from random projections—it yields concreteand readily interpretable rates. Lastly, we provided some simulation results that showedexcellent agreement with the predictions from our theory.This paper has focused on achievable results for low-rank matrix estimation using a par-ticular polynomial-time method. It would be interesting to establish matching lower bounds,showing that the rates obtained by this estimator are minimax-optimal. We suspect that thisshould be possible, for instance by using the techniques exploited in Raskutti et al. [39] inanalyzing minimax rates for regression over ℓ q -balls. Acknowledgements
This work was partially supported by a Sloan Foundation Fellowship, AFOSR-09NL184 grant,and an NSF-CCF-0545862 CAREER grant to MJW.
A Proof of Lemma 1
Part (a) of the claim was proved in Recht et al. [41]; we simply provide a proof here forcompleteness. We write the SVD as Θ ∗ = U DV T , where U ∈ R k × k and V ∈ R p × p areorthogonal matrices, and D is the matrix formed by the singular values of Θ ∗ . By re-orderingas needed, we may assume without loss of generality that the first r columns of U (respectively V ) correspond to the matrices e U (respectively e V ) from the statement. We then define thematrix Γ = U T ∆ V ∈ R k × p , and write it in block form asΓ = (cid:20) Γ Γ Γ Γ (cid:21) , where Γ ∈ R r × r , and Γ ∈ R ( k − r ) × ( p − r ) .19e now define the matrices∆ ′′ = U (cid:20) (cid:21) V T , and ∆ ′ = ∆ − ∆ ′′ . Note that we haverank(∆ ′ ) = rank (cid:20) Γ Γ Γ (cid:21) ≤ rank (cid:20) Γ Γ (cid:21) + rank (cid:20) Γ (cid:21) ≤ r, which establishes Lemma 1(a). Moreover, we note for future reference that by constructionof ∆ ′′ , the nuclear norm satisfies the decomposition ||| Π M ( e U, e V ) (Θ ∗ ) + ∆ ′′ ||| = ||| Π M ( e U, e V ) (Θ ∗ ) ||| + ||| ∆ ′′ ||| . (33)We now turn to the proof of Lemma 1(b). Recall that the error ∆ = b Θ − Θ ∗ associatedwith any optimal solution must satisfy the inequality (23), which implies that0 ≤ N h ~ε, X (∆) i + λ N (cid:8) ||| Θ ∗ ||| − ||| b Θ ||| (cid:9) ≤ ||| N X ∗ ( ~ε ) ||| op ||| ∆ ||| + λ N (cid:8) ||| Θ ∗ ||| − ||| b Θ ||| (cid:9) , (34)where we have used the bound (24).Using the triangle inequality and the relation (33), we have ||| b Θ ||| = ||| (Π M (Θ ∗ ) + ∆ ′′ ) + (Π M ⊥ (Θ ∗ ) + ∆ ′ ) ||| ≥ ||| (Π M (Θ ∗ ) + ∆ ′′ ) ||| − ||| (Π M ⊥ (Θ ∗ ) + ∆ ′ ) ||| ≥ ||| Π M (Θ ∗ ) ||| + ||| ∆ ′′ ||| − (cid:8) ||| (Π M ⊥ (Θ ∗ ) ||| + ||| ∆ ′ ||| (cid:9) . Consequently, we have ||| Θ ∗ ||| − ||| b Θ ||| ≤ ||| Θ ∗ ||| − (cid:8) ||| Π M (Θ ∗ ) ||| + ||| ∆ ′′ ||| (cid:9) + (cid:8) ||| (Π M ⊥ (Θ ∗ ) ||| + ||| ∆ ′ ||| (cid:9) = 2 ||| Π M ⊥ (Θ ∗ ) ||| + ||| ∆ ′ ||| − ||| ∆ ′′ ||| . Substituting this inequality into the bound (34), we obtain0 ≤ ||| N X ∗ ( ~ε ) ||| op ||| ∆ ||| + λ N (cid:8) ||| Π M ⊥ (Θ ∗ ) ||| + ||| ∆ ′ ||| − ||| ∆ ′′ ||| (cid:9) . Finally, since ||| N X ∗ ( ~ε ) ||| op ≤ λ N / ≤ λ N (cid:8) ||| Π M ⊥ (Θ ∗ ) ||| + 32 ||| ∆ ′ ||| − ||| ∆ ′′ ||| (cid:9) , from which the bound (25) follows. B Proof of Lemma 3
Let S p − = { u ∈ R p | k u k = 1 } denote the Euclidean sphere in p -dimensions. The operatornorm of interest has the variational representation1 n ||| X T W ||| op = 1 n sup u ∈ S k − sup v ∈ S p − v T X T W u a and b , define the (random) quantityΨ( a, b ) : = sup u ∈ a S k − sup v ∈ b S p − h Xv, W u i . and note that our goal is to upper bound Ψ(1 , a, b ) = a b Ψ(1 , A = { u , . . . , u A } and B = { v , . . . , v B } denote 1 / S k − and S p − ,respectively. We now claim that we have the upper boundΨ(1 , ≤ u a ∈A ,v b ∈B h Xv b , W u a i (35)To establish this claim, we note that since the sets A and B are 1 / u, v ) ∈ S p − × S p − , there exists a pair ( u a , v b ) ∈ A × B such that u = u a + ∆ u and v = v b + ∆ v , with max {k ∆ u k , k ∆ v k } ≤ /
4. Consequently, we can write h Xv, W u i = h Xv b , W u a i + h Xv b , W ∆ u i + h X ∆ v, W u a i + h X ∆ v, W ∆ u i . (36)By construction, we have the bound |h Xv b , W ∆ u i| ≤ Ψ(1 , /
4) = Ψ(1 , |h X ∆ v, W u a i| ≤ Ψ(1 ,
1) as well as |h X ∆ v, W ∆ u i| ≤ Ψ(1 , , ≤ max u a ∈A ,v b ∈B h Xv b , W u a i + 916 Ψ(1 , , from which the bound (35) follows.We now apply the union bound to control the discrete maximum. It is known (e.g., [29, 33])that there exists a 1 / S k − and S p − with at most A ≤ k and B ≤ p elementsrespectively. Consequently, we have P (cid:2) | Ψ(1 , | ≥ δ n (cid:3) ≤ k + p max u a ,v b P (cid:20) |h Xv b , W u a i| n ≥ δ (cid:21) . (37)It remains to obtain a good bound on the quantity n h Xv, W u i = n P ni =1 h v, X i ih u, W i i ,where ( u, v ) ∈ S k − × S p − are arbitrary but fixed. Since W i ∈ R k has i.i.d. N (0 , ν ) elementsand u is fixed, we have Z i : = h u, W i i ∼ N (0 , ν ) for each i = 1 , . . . , n . These variables areindependent of one another, and of the random matrix X . Therefore, conditioned on X , thesum Z : = n P ni =1 h v, X i ih u, W i i is zero-mean Gaussian with variance α : = ν n (cid:18) n k Xv k (cid:19) ≤ ν n ||| X T X/n ||| op . Define the event T = { α ≤ ν ||| Σ ||| op n } . Using Lemma 2, we have ||| X T X/n ||| op ≤ σ max (Σ)with probability at least 1 − − n/ P [ T c ] ≤ − n/ T and its complement T c , we obtain P [ | Z | ≥ t ] ≤ P (cid:2) | Z | ≥ t | T (cid:3) + P [ T c ] ≤ exp (cid:18) − n t ν (4 + ||| Σ ||| op ) (cid:19) + 2 exp( − n/ . Combining this tail bound with the upper bound (37), we have P (cid:2) | ψ (1 , | ≥ δ n (cid:3) ≤ k + p (cid:26) exp (cid:18) − n t ν ||| Σ ||| op (cid:19) + 2 exp( − n/ (cid:27) . Setting t = 20 ν ||| Σ ||| op k + pn , this probability vanishes as long as n > k + p ).21 Technical details for Corollary 4
In this appendix, we collect the proofs of Lemmas 4 and 5.
C.1 Proof of Lemma 4
Recalling that S p − denotes the unit-norm Euclidean sphere in p -dimensions, we first observethat ||| X ||| op = sup u ∈ S p − k Xu k . Our next step is to reduce the supremum to a maximizationover a finite set, using a standard covering argument. Let A = { u , . . . , u A } denote a 1 / u ∈ S p − , there is some u a ∈ A such that u = u a + ∆ u ,where k ∆ u k ≤ /
2. Consequently, for any u ∈ S p − , the triangle inequality implies that k Xu k ≤ k Xu a k + k X ∆ u k , and hence that ||| X ||| op ≤ max u a ∈A k Xu a k + ||| X ||| op . Re-arranging yields the useful inequality ||| X ||| op ≤ u a ∈A k Xu a k . (38)Using inequality (38), we have P (cid:20) n ||| X T X ||| op > t (cid:21) ≤ P (cid:20) max u a ∈A n n X i =1 ( h u a , X i i ) > t (cid:21) ≤ p max u a ∈A P (cid:20) n n X i =1 ( h u a , X i i ) > t (cid:21) . (39)where the last inequality follows from the union bound, and the fact [29, 33] that there existsa 1 / S p − with at most 4 p elements.In order to complete the proof, we need to obtain a sharp upper bound on the quantity P (cid:2) n P ni =1 ( h u, X i i ) > t (cid:3) , valid for any fixed u ∈ S p − . Define the random vector Y ∈ R n with elements Y i = (cid:10) u, X i (cid:11) . Note that Y is zero mean, and its covariance matrix R haselements R ij = E [ Y i Y j ] = u T Σ(Θ ∗ ) | j − i | u . In order to bound the spectral norm of R , we notethat since it is symmetric, we have ||| R ||| op ≤ max i =1 ,...,p P pj =1 | R ij | , and moreover | R ij | = | u T Σ(Θ ∗ ) | j − i | u | ≤ ( ||| Θ ∗ ||| op ) | j − i | Σ ≤ γ | j − i | ||| Σ ||| op . Combining the pieces, we obtain ||| R ||| op ≤ max i p X j =1 | γ | | i − j | ||| Σ ||| op ≤ ||| Σ ||| op ∞ X j =0 | γ | j ≤ ||| Σ ||| op − γ . (40)Moreover, we have trace( R ) /n = u T Σ u ≤ ||| Σ ||| op . Applying Lemma 8 with t = 5 q pn , weconclude that P (cid:20) n k Y k > ||| Σ ||| op + 5 r pn ||| R ||| op (cid:21) ≤ (cid:0) − p (cid:1) + 2 exp − n/ .. Combined with the bound (39), we obtain ||| n X T X ||| op ≤ ||| Σ ||| op (cid:26) − γ ) r pn (cid:27) ≤ ||| Σ ||| op (1 − γ ) , (41)22ith probability at least 1 − c exp( − c p ), which establishes the upper bound (28)(a).Turning to the lower bound (28)(b), we let B = { v , . . . , v B } be an ǫ -cover of S p − forsome ǫ ∈ (0 ,
1) to be chosen. Thus, for any v ∈ R p , there exists some v b such that v = v b + ∆ v ,and k ∆ v k ≤ ǫ . Define the function Ψ : R p × R p → R via Ψ( u, v ) = u T (cid:16) n X T X (cid:17) v , and notethat Ψ( u, v ) = Ψ( v, u ). With this notation, we have v T (cid:16) n X T X (cid:17) v = Ψ( v, v ) = Ψ( v k , v k ) + 2Ψ(∆ v, v ) + Ψ(∆ v, ∆ v ) ≥ Ψ( v k , v k ) + 2Ψ(∆ v, v ) , since Ψ(∆ v, ∆ v ) ≥
0. Since | Ψ(∆ v, v ) | ≤ ǫ ||| (cid:16) n X T X (cid:17) ||| op , we obtain the lower bound σ min (cid:18)(cid:16) n X T X (cid:17)(cid:19) = inf v ∈ S p − v T (cid:16) n X T X (cid:17) v ≥ min v b ∈B Ψ( v b , v b ) − ǫ ||| n X T X ||| op . By the previously established upper bound(28)(a), have ||| n X T X ||| op ≤ ||| Σ ||| op (1 − γ ) with highprobability. Hence, choosing ǫ = (1 − γ ) σ min (Σ)200 ||| Σ ||| op ensures that 2 ǫ ||| n X T X ||| op ≤ σ min (Σ) / v, v ) that holds for any fixed v ∈ S p − .Note that we can write Ψ( v, v ) = 1 n n X i =1 ( h v, X i i ) , As before, if we define the random vector Y ∈ R n with elements Y i = h v, X i i , then Y ∼ N (0 , R ) with ||| R ||| op ≤ ||| Σ ||| op − γ . Moreover, we have trace( R ) /n = v T Σ v ≥ σ min (Σ). Conse-quently, applying Lemma 8 yields P (cid:20) n k Y k < σ min (Σ) − t ||| Σ ||| op − γ (cid:21) ≤ (cid:0) − n ( t − / √ n ) / (cid:1) + 2 exp( − n , Note that this bound holds for any fixed v ∈ S p − . Setting t ∗ = (1 − γ ) σ min (Σ)16 ||| Σ ||| op and applyingthe union bound yields that P (cid:2) min v b ∈B Ψ( v b , v b ) < σ min (Σ) / (cid:3) ≤ (cid:0) ǫ (cid:1) p (cid:26) (cid:0) − n ( t ∗ − / √ n ) / (cid:1) + 2 exp( − n (cid:27) , which vanishes as long as n > /ǫ )( t ∗ ) p . C.2 Proof of Lemma 5
Let S p − = { u ∈ R p | k u k = 1 } denote the Euclidean sphere in p -dimensions, and for positivescalars a and b , define the random variable Ψ( a, b ) : = sup u ∈ a S p − sup v ∈ b S p − h Xv, W u i . Notethat our goal is to upper bound Ψ(1 , A = { u , . . . , u A } and B = { v , . . . , v B } denote1 / S p − and S p − , respectively. Following the same argument as in the proof ofLemma 3, we obtain the upper boundΨ(1 , ≤ u a ∈A ,v b ∈B h Xv b , W u a i (42)23e now apply the union bound to control the discrete maximum. It is known (e.g., [29, 33])that there exists a 1 / S p − with at most 8 p elements. Consequently, we have P (cid:2) | ψ (1 , | ≥ δ n (cid:3) ≤ p max u a ,v b P (cid:2) |h Xv b , W u a i| n ≥ δ (cid:3) . (43)It remains to obtain a tail bound on the quantity P (cid:2) |h Xv, W u i| n ≥ δ (cid:3) , for any fixed pair( u, v ) ∈ A × B .For each i = 1 , . . . , n , let X i and W i denote the i th row of X and W . Following somesimple algebra, we have the decomposition h Xv, W u i n = T − T − T , where T = 12 n n X i =1 (cid:0)(cid:10) u, W i (cid:11) + (cid:10) v, X i (cid:11)(cid:1) −
12 ( ν + v T Σ v ) T = 12 n n X i =1 (cid:0)(cid:10) u, W i (cid:11)(cid:1) − ν / T = 12 n n X i =1 (cid:0)(cid:10) v, X i (cid:11)(cid:1) − v T Σ v We may now bound each T j , j = 1 , , k Y k ,where Y ∼ N (0 , Q ) for some matrix Q (cid:23) Bound on T : We begin with T , which the easiest to control since (up to scaling by ν ), itcorresponds to the deviation away from the mean of χ -variable with n degrees of freedom.Consequently, applying Lemma 8 with Q = I , we obtain P (cid:2) | T | > ν t (cid:3) ≤ (cid:16) − n ( t − / √ n ) (cid:17) + 2 exp( − n/ . (44) Bound on T : We can write the term T as a deviation of k Y k /n from its mean, where inthis case the covariance matrix Q is no longer the identity. In concrete terms, let us definea random vector Y ∈ R n with elements Y i = h v, X i i . As seen in the proof of Lemma 4from Appendix C.1, the vector Y is zero-mean Gaussian with covariance matrix R such that ||| R ||| op ≤ ||| Σ ||| op − γ (see equation (40)). Since we have trace( R ) /n = v T Rv , applying Lemma 8yields that P (cid:2) | T | ≥ ||| Σ ||| op − γ t (cid:3) ≤ (cid:16) − n ( t − / √ n ) (cid:17) + 2 exp( − n/ . (45) Bound on T : To control this quantity, let us define a zero-mean Gaussian random vector Z ∈ R n with elements Z i = h v, X i i + h u, W i i . This random vector has covariance matrix S with elements S ij = E [ Z i Z j ] = ν δ ij + (1 − δ ij ) ν v T (Θ ∗ ) | i − j |− u + v T (Θ ∗ ) | i − j | Σ v, δ ij is the Kronecker delta for the event { i = j } . As before, by symmetry of S , we have ||| S ||| op ≤ max i =1 ,...,n P nj =1 | S ij | , and hence ||| S ||| op ≤ ν + ||| Σ ||| op + i − X j =1 | ν v T (Θ ∗ ) | i − j |− u + v T (Θ ∗ ) | i − j | Σ v | + n X j = i +1 | ν v T (Θ ∗ ) | i − j |− u + v T (Θ ∗ ) | i − j | Σ v |≤ ν + ||| Σ ||| op + 2 ∞ X j =1 ν r j − + 2 ∞ X j =1 ||| Σ ||| r j ≤ ν + ||| Σ ||| op + 2 ν − γ + 2 γ ||| Σ ||| op − γ . Morever, we have trace( S ) /n = ν + v T Θ ∗ v , so that by applying Lemma 8, we conclude that P (cid:20) | T | > (cid:0) ν − γ + 12 ||| Σ ||| op − γ (cid:1) t (cid:21) ≤ (cid:16) − n ( t − / √ n ) (cid:17) + 2 exp( − n/ , (46)which completes the analysis of this term.Combining the bounds (45), (44) and (46), we conclude that for all t > P (cid:2) |h Xv, W u i| n ≥ ||| Σ ||| op + ν ) t − γ (cid:3) ≤ (cid:16) − n ( t − / √ n ) (cid:17) + 6 exp( − n/ . (47)Setting t = 10 p p/n and combining with the bound (43), we conclude that P (cid:2) | ψ (1 , | ≥ ||| Σ ||| op + ν )1 − γ r pn (cid:3) ≤ p (cid:8) − p ) + 6 exp( − n/ (cid:9) ≤
12 exp( − p )as long as n > ((4 log 8) + 1) p . D Proof of Proposition 1
Note that k X (Θ) k = sup u ∈ S N − h u, X (Θ) i , and that since the claim (21) is invariant torescaling, it suffices to prove it for all Θ ∈ R k × p with ||| Θ ||| F = 1. Letting t ≥ Z ∗ ( t ) : = inf Θ ∈R ( t ) sup u ∈ S N − h u, X (Θ) i , where R ( t ) = { Θ ∈ R k × p | ||| Θ ||| F = 1 , ||| Θ ||| ≤ t } .In particular, our goal is to prove that for any t ≥
1, the lower bound Z ∗ ( t ) √ N ≥ − (cid:2) k + pN (cid:3) / t (48)holds with probability at least 1 − c exp( − c N ). By a standard peeling argument (see Raskuttiet al. [39] for details), this lower bound implies the claim (21).We establish the lower bound (48) using Gaussian comparison inequalities [29] and con-centration of measure (see Lemma 7). For each pair ( u, Θ) ∈ S N − × R ( t ), consider therandom variable Z u, Θ = h u, X (Θ) i , and note that it is Gaussian with zero mean. For any twopairs ( u, Θ) and ( u ′ , Θ), some calculation yields E (cid:2) ( Z u, Θ − Z u ′ , Θ ′ ) ] = ||| u ⊗ Θ − u ′ ⊗ Θ ′ ||| F . (49)25e now define a second Gaussian process { Y u, Θ | ( u, Θ) ∈ S N − × R ( t ) } via Y u, Θ : = h g, u i + hh G, Θ ii , where g ∈ R N and G ∈ R k × p are independent with i.i.d. N (0 ,
1) entries. By construction, Y u, Θ is zero-mean, and moreover, for any two pairs ( u, Θ) and ( u ′ , Θ ′ ), we have E (cid:2) ( Y u, Θ − Y u ′ , Θ ′ ) ] = k u − u ′ k + ||| Θ − Θ ′ ||| F . (50)It can be shown that for all pairs ( u, Θ) , ( u ′ , Θ ′ ) ∈ S N − × R ( t ), we have ||| u ⊗ Θ − u ′ ⊗ Θ ′ ||| F ≤ k u − u ′ k + ||| Θ − Θ ′ ||| F . (51)Moreover, equality holds whenever Θ = Θ ′ . The conditions of the Gordon-Slepian inequal-ity [29] are satisfied, so that we are guaranteed that E [ inf Θ ∈R ( t ) k X (Θ) k ] = E (cid:20) inf Θ ∈R ( t ) sup u ∈ S n − Z u, Θ (cid:21) ≥ E (cid:20) inf Θ ∈R ( t ) sup u ∈ S n − Y u, Θ (cid:21) (52)We compute E (cid:20) inf Θ ∈R ( t ) sup u ∈ S N − Y u, Θ (cid:21) = E (cid:20) sup u ∈ S N − h g, u i (cid:21) + E (cid:20) inf Θ ∈R ( t ) hh G, Θ ii (cid:21) = E [ k g k ] − E [ sup Θ ∈R ( t ) hh G, Θ ii ] ≥ √ N − t E [ ||| G ||| op ] . Since G ∈ R k × p has i.i.d. N (0 ,
1) entries, standard random matrix theory [16] implies that E [ ||| G ||| op ] ≤ √ k + √ p . Putting together the pieces, we conclude that E [ inf Θ ∈R ( t ) k X (Θ) k √ N ] ≥ − √ k + √ p √ N t.
Finally, we need to establish sharp concentration around the mean. Note that the function f ( X ) : = inf Θ ∈R ( t ) k X (Θ) k / √ N is Lipschitz with constant 1 / √ N , so that Lemma 7 impliesthat P (cid:20) inf Θ ∈R ( t ) k X (Θ) k √ N ≤ − t √ k + √ p √ N − δ (cid:21) ≤ − N δ /
2) for all δ > δ = 1 / E Some useful concentration results
The following lemma is classical [29, 32], and yields sharp concentration of a Lipschitz functionof Gaussian random variables around its mean.
Lemma 7.
Let X ∈ R n have i.i.d. N (0 , entries, and let and f : R n → R be Lipschitz withconstant L (i.e., | f ( x ) − f ( y ) | ≤ L k x − y k ∀ x, y ∈ R n ). Then for all t > , we have P [ | f ( X ) − Ef ( X ) | > t ] ≤ (cid:0) − t L (cid:1) .
26y exploiting this lemma, we can prove the following result, which yields concentration of thesquared ℓ -norm of an arbitrary Gaussian vector: Lemma 8.
Given a Gaussian random vector Y ∼ N (0 , Q ) , for all t > / √ n , we have P (cid:20) n (cid:12)(cid:12) k Y k − trace Q (cid:12)(cid:12) > t ||| Q ||| op (cid:21) ≤ − n ( t − √ n ) ! + 2 exp ( − n/ . (53) Proof.
Let √ Q be the symmetrix matrix square root, and consider the function f ( x ) = k√ Qx k / √ n .Since it is Lipschitz with constant |||√ Q ||| op / √ n , Lemma 7 implies that P (cid:2)(cid:12)(cid:12) k p QX k − E k p QX k (cid:12)(cid:12) > √ nδ (cid:3) ≤ (cid:18) − nδ ||| Q ||| op (cid:19) for all δ >
0. (54)By integrating this tail bound, we find that the variable Z = k√ QX k / √ n satisfies the boundvar( Z ) ≤ ||| Q ||| op /n , and hence conclude that (cid:12)(cid:12)p E [ Z ] − | E [ Z ] | (cid:12)(cid:12) = (cid:12)(cid:12)p trace( Q ) /n − E [ k p QX k / √ n ] (cid:12)(cid:12) ≤ p ||| Q ||| op √ n . (55)Combining this bound with the tail bound (54), we conclude that P h √ n (cid:12)(cid:12) k p QX k − p trace( Q ) (cid:12)(cid:12) > δ + 2 r ||| Q ||| op n i ≤ (cid:18) − nδ ||| Q ||| op (cid:19) for all δ > δ = ( t − / √ n ) p ||| Q ||| op in the bound (56) yields that P h √ n (cid:12)(cid:12) k p QX k − p trace( Q ) (cid:12)(cid:12) > t q ||| Q ||| op i ≤ (cid:18) − n ( t − / √ n ) (cid:19) . (57)Similarly, setting δ = p ||| Q ||| op in the tail bound (56) yields that with probability greater than1 − − n/ (cid:12)(cid:12)(cid:12)(cid:12) k Y k √ n + r trace( Q ) n (cid:12)(cid:12)(cid:12)(cid:12) ≤ r trace( Q ) n + 3 q ||| Q ||| op ≤ q ||| Q ||| op . (58)Using these two bounds, we obtain (cid:12)(cid:12)(cid:12)(cid:12) k Y k n − trace( Q ) n (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) k Y k √ n − r trace( Q ) n (cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) k Y k √ n + r trace( Q ) n (cid:12)(cid:12)(cid:12)(cid:12) ≤ t ||| Q ||| op with the claimed probability. References [1] J. Abernethy, F. Bach, T. Evgeniou, and J. Stein. Low-rank matrix factorization withattributes. Technical Report Technical Report N-24/06/MM, Ecole des mines de Paris,France, September 2006. 272] A. A. Amini and M. J. Wainwright. High-dimensional analysis of semdefinite relaxationsfor sparse principal component analysis.
Annals of Statistics , 5B:2877–2921, 2009.[3] C. W. Anderson, E. A. Stolz, and S. Shamsunder. Multivariate autoregressive models forclassification of spontaneous electroencephalogram during mental tasks.
IEEE Trans. onbio-medical engineering , 45(3):277, 1998.[4] T. W. Anderson.
The statistical analysis of time series . Wiley Classics Library. JohnWiley and Sons, New York, 1971.[5] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In
Neural Infor-mation Processing Systems (NIPS) , Vancouver, Canada, December 2006.[6] F. Bach. Consistency of trace norm minimization.
Journal of Machine Learning Research ,9:1019–1048, June 2008.[7] P. Bickel and E. Levina. Covariance estimation by thresholding.
Annals of Statistics ,2008. To appear.[8] P. Bickel and E. Levina. Regularized estimation of large covariance matrices.
Annals ofStatistics , 36(1):199–227, 2008.[9] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of lasso and dantzig selector.
Annals of Statistics , 2009. To appear.[10] S. Boyd and L. Vandenberghe.
Convex optimization . Cambridge University Press, Cam-bridge, UK, 2004.[11] E. N. Brown, R. E. Kass, and P. P. Mitra. Multiple neural spike train data analysis:state-of-the-art and future challenges.
Nature Neuroscience , 7(5), May 2004.[12] E. Candes and T. Tao. Decoding by linear programming.
IEEE Trans. Info Theory ,51(12):4203–4215, December 2005.[13] E. J. Cand`es and B. Recht. Exact matrix completion via convex optimization.
CoRR ,abs/0805.4471, 2008.[14] S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit.
SIAM J. Sci. Computing , 20(1):33–61, 1998.[15] A. Cohen, W. Dahmen, and R. DeVore. Compressed sensing and best k-term approxi-mation.
J. of. American Mathematical Society , 22(1):211–231, July 2008.[16] K. R. Davidson and S. J. Szarek. Local operator theory, random matrices, and Banachspaces. In
Handbook of Banach Spaces , volume 1, pages 317–336. Elsevier, Amsterdan,NL, 2001.[17] D. Donoho. Compressed sensing.
IEEE Trans. Info. Theory , 52(4):1289–1306, April2006.[18] N. El-Karoui. Operator norm consistent estimation of large dimensional sparse covariancematrices.
Technical Report 734, UC Berkeley, Department of Statistics , 2007.2819] M. Fazel.
Matrix Rank Minimization with Applications . PhD thesis, Stanford, 2002.Available online: http://faculty.washington.edu/mfazel/thesis-final.pdf.[20] J. Fisher and M. J. Black. Motor cortical decoding using an autoregressive movingaverage model,.
IEEE Engineering in Medicine and Biology Society , pages 1469–1472,September 2005.[21] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with thegraphical Lasso.
Biostatistics , 2007.[22] L. Harrison, W. D. Penny, and K. Friston. Multivariate autoregressive modeling of fmritime series.
NeuroImage , 19:1477–1491, 2003.[23] R. A. Horn and C. R. Johnson.
Matrix Analysis . Cambridge University Press, Cambridge,1985.[24] R. A. Horn and C. R. Johnson.
Topics in Matrix Analysis . Cambridge University Press,Cambridge, 1991.[25] J. Huang and T. Zhang. The benefit of group sparsity. Technical Report arXiv:0901.2962,Rutgers University, January 2009.[26] S. Ji and J. Ye. An accelerated gradient method for trace norm minimization. In
Inter-national Conference on Machine Learning (ICML) , New York, NY, USA, 2009. ACM.[27] I. M. Johnstone. On the distribution of the largest eigenvalue in principal componentsanalysis.
Annals of Statistics , 29(2):295–327, April 2001.[28] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entires. Preprintavailable at http://arxiv.org/abs/0906.2027v1, 2009.[29] M. Ledoux and M. Talagrand.
Probability in Banach Spaces: Isoperimetry and Processes .Springer-Verlag, New York, NY, 1991.[30] K. Lounici, M. Pontil, A. B. Tsybakov, and S. van de Geer. Taking advantage of sparsityin multi-task learning. Technical Report arXiv:0903.1468, ETH Zurich, March 2009.[31] H. L¨utkepolhl.
New introduction to multiple time series analysis . Springer, New York,2006.[32] P. Massart.
Concentration Inequalties and Model Selection . Ecole d’Et´e de Probabilit´es,Saint-Flour. Springer, New York, 2003.[33] J. Matousek.
Lectures on discrete geometry . Springer-Verlag, New York, 2002.[34] N. Meinshausen and P. B¨uhlmann. High-dimensional graphs and variable selection withthe Lasso.
Annals of Statistics , 34:1436–1462, 2006.[35] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. In
Proceedings ofthe NIPS Conference , Vancouver, Canada, December 2009.[36] Y. Nesterov. Gradient methods for minimizing composite objective function. TechnicalReport 2007/76, CORE, Universit’e catholique de Louvain, 2007.2937] G. Obozinski, M. J. Wainwright, and M. I. Jordan. Union support recovery in high-dimensional multivariate regression.
Annals of Statistics , To appear. Presented in partat NIPS 2008 conference.[38] D. Paul and I. Johnstone. Augmented sparse principal component analysis for high-dimensional data. Technical report, UC Davis, January 2008.[39] G. Raskutti, M. J. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional linear regression over ℓ q -balls. Technical Report arXiv:0910.2042, UC Berke-ley, Department of Statistics, 2009. Presented in part at Allerton Conference, Sep. 2009.[40] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covarianceestimation: Convergence rates of ℓ -regularized log-determinant divergence. Technicalreport, Department of Statistics, UC Berkeley, September 2008. Presented in part atNIPS 2008.[41] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linearmatrix equations via nuclear norm minimization. SIAM Review , 2007. to appear.[42] A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covari-ance estimation.
Electronic Journal of Statistics , 2009.[43] N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In
Pro-ceedings of the NIPS Conference , Vancouver, Canada, 2005.[44] N. Srebro, J. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In
Neural Information Processing Systems (NIPS) , Vancouver, Canada, December 2004.[45] R. Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the RoyalStatistical Society, Series B , 58(1):267–288, 1996.[46] L. Vandenberghe and S. Boyd. Semidefinite programming.
SIAM Review , 38:49–95, 1996.[47] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recoveryusing ℓ -constrained quadratic programming (Lasso). IEEE Trans. Information Theory ,55:2183–2202, May 2009.[48] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society B , 1(68):49, 2006.[49] M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model.