Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization
GGuaranteed Minimum-Rank Solutions of Linear Matrix Equationsvia Nuclear Norm Minimization
Benjamin Recht ∗ Maryam Fazel † Pablo A. Parrilo ‡ February 3, 2008
Abstract
The affine rank minimization problem consists of finding a matrix of minimum rank thatsatisfies a given system of linear equality constraints. Such problems have appeared in the liter-ature of a diverse set of fields including system identification and control, Euclidean embedding,and collaborative filtering. Although specific instances can often be solved with specialized al-gorithms, the general affine rank minimization problem is NP-hard, because it contains vectorcardinality minimization as a special case.In this paper, we show that if a certain restricted isometry property holds for the lineartransformation defining the constraints, the minimum rank solution can be recovered by solvinga convex optimization problem, namely the minimization of the nuclear norm over the givenaffine space. We present several random ensembles of equations where the restricted isometryproperty holds with overwhelming probability, provided the codimension of the subspace isΩ( r ( m + n ) log mn ), where m, n are the dimensions of the matrix, and r is its rank.The techniques used in our analysis have strong parallels in the compressed sensing frame-work. We discuss how affine rank minimization generalizes this pre-existing concept and outlinea dictionary relating concepts from cardinality minimization to those of rank minimization. Wealso discuss several algorithmic approaches to solving the norm minimization relaxations, andillustrate our results with numerical examples. Keywords. rank, convex optimization, matrix norms, random matrices, compressed sensing, semidefinite program-ming.
Notions such as order, complexity, or dimensionality can often be expressed by means of the rank ofan appropriate matrix. For example, a low-rank matrix could correspond to a low-degree statisticalmodel for a random process (e.g., factor analysis), a low-order realization of a linear system [28], alow-order controller for a plant [22], or a low-dimensional embedding of data in Euclidean space [34].If the set of feasible models or designs is affine in the matrix variable, choosing the simplest modelcan be cast as an affine rank minimization problem ,minimize rank( X )subject to A ( X ) = b, (1.1) ∗ Center for the Mathematics of Information, California Institute of Technology † Control and Dynamical Systems, California Institute of Technology ‡ Laboratory for Information and Decision Systems, Massachusetts Institute of Technology a r X i v : . [ m a t h . O C ] J un here X ∈ R m × n is the decision variable, and the linear map A : R m × n → R p and vector b ∈ R p are given. In certain instances with very special structure, the rank minimization problem canbe solved by using the singular value decomposition, or can be exactly reduced to the solution oflinear systems [37, 41]. In general, however, problem (1.1) is a challenging nonconvex optimizationproblem for which all known finite time algorithms have at least doubly exponential running timesin both theory and practice. For the general case, a variety of heuristic algorithms based on localoptimization, including alternating projections [31] and alternating LMIs [45], have been proposed.A recent heuristic introduced in [27] minimizes the nuclear norm , or the sum of the singularvalues of the matrix, over the affine subset. The nuclear norm is a convex function, can be optimizedefficiently, and is the best convex approximation of the rank function over the unit ball of matriceswith norm less than one. When the matrix variable is symmetric and positive semidefinite, thisheuristic is equivalent to the trace heuristic often used by the control community (see, e.g., [5, 37]).The nuclear norm heuristic has been observed to produce very low-rank solutions in practice, but atheoretical characterization of when it produces the minimum rank solution has not been previouslyavailable. This paper provides the first such mathematical characterization.Our work is built upon a large body of literature on a related optimization problem. Whenthe matrix variable is constrained to be diagonal, the affine rank minimization problem reducesto finding the sparsest vector in an affine subspace. This problem is commonly referred to as cardinality minimization , since we seek the vector whose support has the smallest cardinality, andis known to be NP-hard [39]. For diagonal matrices, the sum of the singular values is equal to thesum of the absolute values (i.e., the (cid:96) norm) of the diagonal elements. Minimization of the (cid:96) normis a well-known heuristic for the cardinality minimization problem, and stunning results pioneeredby Cand`es and Tao [10] and Donoho [17] have characterized a vast set of instances for which the (cid:96) heuristic can be a priori guaranteed to yield the optimal solution. These techniques provide thefoundations of the recently developed compressed sensing or compressive sampling frameworks formeasurement, coding, and signal estimation. As has been shown by a number of research groups(e.g., [4, 12, 13, 14]), the (cid:96) heuristic for cardinality minimization provably recovers the sparsestsolution whenever the sensing matrix has certain “basis incoherence” properties, and in particular,when it is randomly chosen according to certain specific ensembles.The fact that the (cid:96) heuristic is a special case of the nuclear norm heuristic suggests that theseresults from the compressed sensing literature might be extended to provide guarantees about thenuclear norm heuristic for the more general rank minimization problem. In this paper, we showthat this is indeed the case, and the parallels are surprisingly strong. Following the program laidout in the work of Cand`es and Tao, our main contribution is the development of a restrictedisometry property (RIP), under which the nuclear norm heuristic can be guaranteed to producethe minimum-rank solution. Furthermore, as in the case for the (cid:96) heuristic, we provide severalspecific examples of matrix ensembles for which RIP holds with overwhelming probability. Ourresults considerably extend the compressed sensing machinery in a so far undeveloped direction,by allowing a much more general notion of parsimonious models that rely on low-rank assumptionsinstead of cardinality restrictions.To make the parallels as clear as possible, we begin by establishing a dictionary between thematrix rank and nuclear norm minimization problems and the vector sparsity and (cid:96) norm problemsin Section 2. In the process of this discussion, we present a review of many useful properties ofthe matrices and matrix norms necessary for the main results. We then generalize the notion ofRestricted Isometry to matrices in Section 3 and show that when linear mappings are Restricted2sometries, recovering low-rank solutions of underdetermined systems can be achieved by nuclearnorm minimization. In Section 4, we present several families of random linear maps that arerestricted isometries with overwhelming probability when the dimensions are sufficiently large.In Section 5, we briefly discuss three different algorithms designed for solving the nuclear normminimization problem and their relative strengths and weaknesses: interior point methods, gradientprojection methods, and a low-rank factorization technique. In Section 6, we demonstrate that inpractice nuclear-norm minimization recovers the lowest rank solutions of affine sets with even fewerconstraints than those guaranteed by our mathematical analysis. Finally, in Section 7, we list anumber of possible directions for future research. As in the case of compressed sensing, the conditions we derive to guarantee properties about thenuclear norm heuristic are deterministic, but they are at least as difficult to check as solving therank minimization problem itself. We are only able to guarantee that the nuclear norm heuristicrecovers the minimum rank solution of A ( X ) = b when A is sampled from specific ensembles ofrandom maps. The constraints appearing in many of the applications mentioned above, such aslow-order control system design, are typically not random at all and have structured demandsaccording to the specifics of the design problem. It thus behooves us to present several exampleswhere random constraints manifest themselves in practical scenarios for which no practical solutionprocedure is known. Minimum order linear system realization
Rank minimization forms the basis of many modelreduction and low-order system identification problems for linear time-invariant (LTI) systems.The following example illustrates how random constraints might arise in this context. Considerthe problem of finding the minimum order discrete-time LTI system that is consistent with a set oftime-domain observations. In particular, suppose our observations are the system output sampledat a fixed time N , after a random Gaussian input signal is applied from t = 0 to t = N . Suppose wemake such measurements for p different input signals, that is, we observe y i ( N ) = (cid:80) Nt =0 a i ( N − t ) h ( t )for i = 1 , . . . , p , where a i , the i th input signal, is a zero-mean Gaussian random variable with thesame variance for t = 0 , . . . N , and h ( t ) denotes the impulse response. We can write this compactlyas y = Ah , where h = [ h (0) , . . . , h ( N )] (cid:48) , and A ij = a i ( N − j ).From linear system theory, the order of the minimal realization for such a system is given bythe rank of the following Hankel matrix (see, e.g., [29, 46])hank( h ) := h (0) h (1) · · · h ( N ) h (1) h (2) · · · h ( N + 1)... ... ... h ( N ) h ( N + 1) · · · h (2 N ) . Therefore the problem can be expressed asminimize rank(hank( h ))subject to Ah = y where the optimization variables are h (0) , . . . , h (2 N ), and the matrix A consists of i.i.d. zero-meanGaussian entries. 3 ow-Rank Matrix Completion In the matrix completion problem where we are given randomsubset of entries of a matrix, we would like to fill in the missing entries such that the resultingmatrix has the lowest possible rank. This problem arises in machine learning scenarios where weare given partially observed examples of a process with a low-rank covariance matrix and wouldlike to estimate the missing data. A typical situation where the hidden matrix is low-rank is whenthe columns are i.i.d. samples of a random process with low-rank covariance. Such models areubiquitous in Factor Analysis, Collaborative Filtering, and Latent Semantic Indexing [42, 47]. Inmany of these settings, some prior probability distribution (such as a Bernoulli model or uniformdistribution on subsets) is assumed to generate the set of available entries.Suppose we are presented with a set of triples ( I ( i ) , J ( i ) , S ( i )) for i = 1 , . . . , p and wish to findthe matrix with S ( i ) in the entry corresponding to row I ( i ) and column J ( i ) for all i . The matrixcompletion problem seeks to solvemin Y rank( Y )s.t. Y I ( i ) ,J ( i ) = S ( i ) , i = 1 , . . . , K which is a special case of the affine rank minimization problem. Low-dimensional Euclidean embedding problems
A problem that arises in a variety offields is the determination of configurations of points in low-dimensional Euclidean spaces, subjectto some given distance information. In Multi-Dimensional Scaling (MDS), such problems occur inextracting the underlying geometric structure of distance data. In psychometrics, the informationabout inter-point distances is usually gathered through a set of experiments where subjects areasked to make quantitative (in metric MDS) or qualitative (in non-metric MDS) comparisons ofobjects. In computational chemistry, they come up in inferring the three-dimensional structure ofa molecule (molecular conformation) from information about interatomic distances [52].A symmetric matrix D ∈ S n is called a Euclidean distance matrix (EDM) if there exist points x , . . . , x n in R d such that D ij = (cid:107) x i − x j (cid:107) . Let V := I n − n T be the projection matrix ontothe hyperplane { v ∈ R n : T v = 0 } . A classical result by Schoenberg states that D is a Euclideandistance matrix of n points in R d if and only if D ii = 0, the matrix V DV is negative semidefinite,and rank(
V DV ) is less than or equal to d [44]. If the matrix D is known exactly, the correspondingconfiguration of points (up to a unitary transform) is obtained by simply taking a matrix squareroot of − V DV . However, in many cases, only a random sampling collection of the distances areavailable. The problem of finding a valid EDM consistent with the known inter-point distances andwith the smallest embedding dimension can be expressed as the rank optimization problemminimize rank(
V DV )subject to
V DV (cid:22) A ( D ) = b, where A : S n → R p is a random sampling operator as discussed in the matrix completion problem.This problem involves a Linear Matrix Inequality (LMI) and appears to be more general thanthe equality constrained rank minimization problem. However, general LMIs can equivalently beexpressed as rank constraints on an appropriately defined block matrix. The rank of a blocksymmetric matrix is equal to the rank of a diagonal block plus the rank of its Schur complement4see, e.g., [33, § f that maps matrices into q × q symmetric matrices, that f ( X ) is positive semidefinite can be equivalently expressed through a rank constraint as f ( X ) (cid:23) ⇔ rank (cid:18)(cid:20) I q BB (cid:48) f ( X ) (cid:21)(cid:19) ≤ q, for some B ∈ R q × q . That is, if there exists a matrix B satisfying the inequality above, then f ( X ) = B (cid:48) B (cid:23)
0. Usingthis equivalent representation allows us to rewrite (1.1) as an affine rank minimization problem.
Image Compression
A simple and well-known method to compress two-dimensional images canbe obtained by using the singular value decomposition (e.g., [3]). The basic idea is to associate tothe given grayscale image a rectangular matrix M , with the entries M ij corresponding to the graylevel of the ( i, j ) pixel. The best rank- k approximation of M is given by X ∗ := arg min rank( X ) ≤ k || M − X || , where ||·|| is any unitarily invariant norm. By the classical Eckart-Young-Mirsky theorem ([20, 38]),the optimal approximant is given by a truncated singular value decomposition of M , i.e., if M = U Σ V T , then X ∗ = U Σ k V T , where the first k diagonal entries of Σ k are the largest k singular values,and the rest of the entries are zero. If for a given rank k , the approximation error || M − X ∗ || is smallenough, then the amount of data needed to encode the information about the image is k ( m + n − k )real numbers, which can be much smaller than the mn required to transmit the values of all theentries.Consider a given image, whose associated matrix M has low-rank, or can be well-approximatedby a low-rank matrix. As proposed by Wakin et al. [54], a single-pixel camera would ideally producemeasurements that are random linear combinations of all the pixels of the given image. Under thissituation, the image reconstruction problem boils down exactly to affine rank minimization, wherethe constraints are given by the random linear functionals.It should be remarked that the simple SVD image compression scheme described has certaindeficiencies that more sophisticated techniques do not share (in particular, the lack of invarianceof the description length under rotations). Nevertheless, due to its simplicity and relatively goodpractical performance, this method is particularly popular in introductory treatments and numericallinear algebra textbooks. As discussed above, when the matrix variable is constrained to be diagonal, the affine rank mini-mization problem (1.1) reduces to the cardinality minimization problem of finding the element inthe affine space with the fewest number of nonzero components. In this section we will establisha dictionary between the concepts of rank and cardinality minimization. The main elements ofthis correspondence are outlined in Table 2. With these elements in place, the existing proofs ofsparsity recovery provide a template for the more general case of low-rank recovery.In establishing our dictionary, we will provide a review of useful facts regarding matrix normsand their characterization as convex optimization problems. We will show how computing boththe operator norm and the nuclear norm of a matrix can be cast as semidefinite programming5 arsimony concept cardinality rank
Hilbert Space norm
Euclidean Frobenius sparsity inducing norm (cid:96) nuclear dual norm (cid:96) ∞ operator norm additivity disjoint support orthogonal row and column spaces convex optimization linear programming semidefinite programmingTable 1: A dictionary relating the concepts of cardinality and rank minimization. problems. We also establish the suitable optimality conditions for the minimization of the nuclearnorm under affine equality constraints, the main convex optimization problem studied in this article.Our discussion of matrix norms will mostly follow the discussion in [27, 53] where extensive lists ofreferences are provided.
Matrix vs. Vector Norms
The three vector norms that play significant roles in the compressedsensing framework are the (cid:96) , (cid:96) , and (cid:96) ∞ norms, denoted by (cid:107) x (cid:107) , (cid:107) x (cid:107) and (cid:107) x (cid:107) ∞ respectively.These norms have natural generalizations to matrices, inheriting many appealing properties fromthe vector case. In particular, there is a parallel duality structure.For a rectangular matrix X ∈ R m × n , σ i ( X ) denotes the i -th largest singular value of X andis equal to the square-root of the i -th largest eigenvalue of XX (cid:48) . The rank of X will usually bedenoted by r , and is equal to the number of nonzero singular values. For matrices X and Y of thesame dimensions, we define the inner product in R m × n as (cid:104) X, Y (cid:105) := Tr( X (cid:48) Y ) = (cid:80) mi =1 (cid:80) nj =1 X ij Y ij .The norm associated with this inner product is called the Frobenius (or Hilbert-Schmidt) norm || · || F . The Frobenius norm is also equal to the Euclidean, or (cid:96) , norm of the vector of singularvalues, i.e., (cid:107) X (cid:107) F := (cid:112) (cid:104) X, X (cid:105) = (cid:112) Tr( X (cid:48) X ) = m (cid:88) i =1 n (cid:88) j =1 X ij = (cid:32) r (cid:88) i =1 σ i (cid:33) . The operator norm (or induced 2-norm) of a matrix is equal to its largest singular value (i.e., the (cid:96) ∞ norm of the singular values): (cid:107) X (cid:107) := σ ( X ) . The nuclear norm of a matrix is equal to the sum of its singular values, i.e., (cid:107) X (cid:107) ∗ := r (cid:88) i =1 σ i ( X ) , and is alternatively known by several other names including the Schatten 1-norm, the Ky Fan r -norm, and the trace class norm. Since the singular values are all positive, the nuclear norm isequal to the (cid:96) norm of the vector of singular values. These three norms are related by the followinginequalities which hold for any matrix X of rank at most r : || X || ≤ || X || F ≤ || X || ∗ ≤ √ r || X || F ≤ r || X || . (2.1)6 ual norms For any given norm (cid:107) · (cid:107) in an inner product space, there exists a dual norm (cid:107) · (cid:107) d defined as (cid:107) X (cid:107) d := max Y {(cid:104) X, Y (cid:105) : (cid:107) Y (cid:107) ≤ } . (2.2)Furthermore, the norm dual to the norm || · || d is again the original norm || · || .In the case of vector norms in R n , it is well-known that the dual norm of the (cid:96) p norm (with1 < p < ∞ ) is the (cid:96) q norm, where p + q = 1. This fact is essentially equivalent to H¨older’sinequality. Similarly, the dual norm of the (cid:96) ∞ norm of a vector is the (cid:96) norm. These facts alsoextend to the matrix norms we have defined. For instance, the dual norm of the Frobenius norm isthe Frobenius norm. This can be verified by simple calculus (or Cauchy-Schwarz), sincemax Y { Tr( X (cid:48) Y ) : Tr( Y (cid:48) Y ) ≤ } is equal to (cid:107) X (cid:107) F , with the maximizing Y being equal to X/ (cid:107) X (cid:107) F . Similarly, as shown below, thedual norm of the operator norm is the nuclear norm. The proof of this fact will also allow us topresent variational characterizations of each of these norms as semidefinite programs. Proposition 2.1
The dual norm of the operator norm || · || in R m × n is the nuclear norm || · || ∗ . Proof
First consider an m × n matrix Z . The fact that Z has operator norm less than t can beexpressed as a linear matrix inequality: (cid:107) Z (cid:107) ≤ t ⇐⇒ t I m − ZZ (cid:48) (cid:23) ⇐⇒ (cid:20) tI m ZZ (cid:48) tI n (cid:21) (cid:23) , (2.3)where the last implication follows from a Schur complement argument. As a consequence, we cangive a semidefinite optimization characterization of the operator norm, namely (cid:107) Z (cid:107) = min t t s.t. (cid:20) tI m ZZ (cid:48) tI n (cid:21) (cid:23) . (2.4)Now let X = U Σ V (cid:48) be a singular value decomposition of an m × n matrix X , where U is an m × r matrix, V is an n × r matrix, Σ is a r × r diagonal matrix and r is the rank of X . Let Y := U V (cid:48) . Then (cid:107) Y (cid:107) = 1 and Tr( XY (cid:48) ) = (cid:80) ri =1 σ i ( X ) = || X || ∗ , and hence the dual norm isgreater than or equal to the nuclear norm.To provide an upper bound on the dual norm, we appeal to semidefinite programming duality.From the characterization in (2.3), the optimization problemmax Y {(cid:104) X, Y (cid:105) : (cid:107) Y (cid:107) ≤ } is equivalent to the semidefinite programmax Y Tr( X (cid:48) Y )s.t. (cid:20) I m YY (cid:48) I n (cid:21) (cid:23) . (2.5)7he dual of this SDP (after an inconsequential rescaling) is given bymin W ,W (Tr( W ) + Tr( W ))s.t. (cid:20) W XX (cid:48) W (cid:21) (cid:23) . (2.6)Set W := U Σ U (cid:48) and W := V Σ V (cid:48) . Then the triple ( W , W , X ) is feasible for (2.6) since (cid:20) W XX (cid:48) W (cid:21) = (cid:20) UV (cid:21) Σ (cid:20) UV (cid:21) (cid:48) (cid:23) . Furthermore, we have Tr( W ) = Tr( W ) = Tr(Σ), and thus the objective function satisfies (Tr( W )+Tr( W )) / (cid:107) X (cid:107) ∗ . Since any feasible solution of (2.6) provides an upper bound for (2.5),we have that the dual norm is less than or equal to the nuclear norm of X , thus proving theproposition.Notice that the argument given in the proof above further shows that the nuclear norm || X || ∗ can be computed using either the SDP (2.5) or its dual (2.6), since there is no duality gap betweenthem. Alternatively, this could have also been proven using a Slater-type interior point conditionsince both (2.5) and (2.6) admit strictly feasible solutions. Convex envelopes of rank and cardinality functions
Let C be a given convex set. The convex envelope of a (possibly nonconvex) function f : C → R is defined as the largest convexfunction g such that g ( x ) ≤ f ( x ) for all x ∈ C (see, e.g., [32]). This means that among all convexfunctions, g is the best pointwise approximation to f . In particular, if the optimal g can beconveniently described, it can serve as an approximation to f that can be minimized efficiently.By the chain of inequalities in (2.1), we have that rank( X ) ≥ (cid:107) X (cid:107) ∗ / (cid:107) X (cid:107) for all X . For allmatrices with (cid:107) X (cid:107) ≤
1, we must have that rank( X ) ≥ (cid:107) X (cid:107) ∗ , so the nuclear norm is a convex lowerbound of the rank function on the unit ball in the operator norm. In fact, it can be shown thatthis is the tightest convex lower bound. Theorem 2.2 ([27])
The convex envelope of rank( X ) on the set { X ∈ R m × n : (cid:107) X (cid:107) ≤ } is thenuclear norm (cid:107) X (cid:107) ∗ . The proof is given in [27] and uses a basic result from convex analysis that establishes that (undersome technical conditions) the biconjugate of a function is its convex envelope [32].Theorem 2.2 provides the following interpretation of the nuclear norm heuristic for the affinerank minimization problem. Suppose X is the minimum rank solution of A ( X ) = b , and M = (cid:107) X (cid:107) . The convex envelope of the rank on the set C = { X ∈ R m × n : (cid:107) X (cid:107) ≤ M } is (cid:107) X (cid:107) ∗ /M . Let X ∗ be the minimum nuclear norm solution of A ( X ) = b . Then we have (cid:107) X ∗ (cid:107) ∗ /M ≤ rank( X ) ≤ rank( X ∗ )providing an upper and lower bound on the optimal rank when the norm of the optimal solution isknown. Furthermore, this is the tightest lower bound among all convex lower bounds of the rankfunction on the set C .For vectors, we have a similar inequality. Let card( x ) denote the cardinality function whichcounts the number of non-zero entries in the vector x . Then we have card( x ) ≥ (cid:107) x (cid:107) / (cid:107) x (cid:107) ∞ .8ot surprisingly, the (cid:96) norm is also the convex envelope of the cardinality function over the set { x ∈ R n : (cid:107) x (cid:107) ∞ ≤ } . This result can be either proven directly or can be seen as a special caseof the above theorem. Additivity of rank and nuclear norm
A function f mapping a linear space S to R is called subadditive if f ( x + y ) ≤ f ( x ) + f ( y ). It is additive if f ( x + y ) = f ( x ) + f ( y ). In the case ofvectors, both the cardinality function and the (cid:96) norm are subadditive. That is, if x and y aresparse vectors, then it always holds that the number of non-zeros in x + y is less than or equalto the number of non-zeros in x plus the number of non-zeros of y ; furthermore (by the triangleinequality) (cid:107) x + y (cid:107) ≤ (cid:107) x (cid:107) + (cid:107) y (cid:107) . In particular, the cardinality function is additive exactly whenthe vectors x and y have disjoint support. In this case, the (cid:96) norm is also additive, in the sensethat (cid:107) x + y (cid:107) = (cid:107) x (cid:107) + (cid:107) y (cid:107) .For matrices, the rank function is subadditive. For the rank to be additive, it is necessary andsufficient that the row and column spaces of the two matrices intersect only at the origin, since inthis case they operate in essentially disjoint spaces (see, e.g., [36]). As we will show below, a relatedcondition that ensures that the nuclear norm is additive, is that the matrices A and B have rowand column spaces that are orthogonal . In fact, a compact sufficient condition for the additivityof the nuclear norm will be that AB (cid:48) = 0 and A (cid:48) B = 0. This is a stronger requirement than theaforementioned condition for rank additivity, as orthogonal subspaces only intersect at the origin.The disparity arises because the nuclear norm of a linear map depends on the choice of the innerproducts on the spaces R m and R n on which the matrix acts, whereas the rank is independent ofsuch a choice. Lemma 2.3
Let A and B be matrices of the same dimensions. If AB (cid:48) = 0 and A (cid:48) B = 0 then (cid:107) A + B (cid:107) ∗ = (cid:107) A (cid:107) ∗ + (cid:107) B (cid:107) ∗ . Proof
Partition the singular value decompositions of A and B to reflect the zero and non-zerosingular vectors A = (cid:2) U A U A (cid:3) (cid:20) Σ A (cid:21) (cid:2) V A V A (cid:3) (cid:48) B = (cid:2) U B U B (cid:3) (cid:20) Σ B (cid:21) (cid:2) V B V B (cid:3) (cid:48) . The condition AB (cid:48) = 0 implies that V (cid:48) A V B = 0, and similarly, A (cid:48) B = 0 implies that U (cid:48) A U B = 0.Hence, there exist matrices U C and V C such that [ U A U B U C ] and [ V A V B V C ] are orthogonalmatrices. Thus, the following are valid singular value decompositions for A and B : A = (cid:2) U A U B U C (cid:3) Σ A (cid:2) V A V B V C (cid:3) (cid:48) B = (cid:2) U A U B U C (cid:3) B (cid:2) V A V B V C (cid:3) (cid:48) . In particular, we have that A + B = (cid:2) U A U B (cid:3) (cid:20) Σ A Σ B (cid:21) (cid:2) V A V B (cid:3) (cid:48) . A + B are equal to the union (with repetition) of the singularvalues of A and B . Hence, (cid:107) A + B (cid:107) ∗ = (cid:107) A (cid:107) ∗ + (cid:107) B (cid:107) ∗ as desired. Corollary 2.4
Let A and B be matrices of the same dimensions. If the row and column spaces of A and B are orthogonal, then (cid:107) A + B (cid:107) ∗ = (cid:107) A (cid:107) ∗ + (cid:107) B (cid:107) ∗ . Proof
It suffices to show that if the row and column spaces of A and B are orthogonal, then AB (cid:48) = 0 and A (cid:48) B = 0. But this is immediate: if the columns of A are orthogonal to the columnsof B , we have A (cid:48) B = 0. Similarly, orthogonal row spaces imply that AB (cid:48) = 0 as well. Nuclear norm minimization
Let us turn now to the study of equality-constrained norm min-imization problems where we are searching for a matrix X ∈ R m × n of minimum nuclear normbelonging to a given affine subspace. In our applications, the subspace is usually described bylinear equations of the form A ( X ) = b , where A : R m × n → R p is a linear mapping. This problemadmits the primal-dual convex formulationmin X (cid:107) X (cid:107) ∗ max z b (cid:48) z s.t A ( X ) = b s.t. ||A ∗ ( z ) || ≤ , (2.7)where A ∗ : R p → R m × n is the adjoint of A . The formulation (2.7) is valid for any norm minimizationproblem, by replacing the norms appearing above by any dual pair of norms. In particular, if wereplace the nuclear norm with the (cid:96) norm and the operator norm with the (cid:96) ∞ norm, we obtain aprimal-dual pair of optimization problems, that can be reformulated in terms of linear programming.Using the SDP characterizations of the nuclear and operator norms given in (2.5)-(2.6) aboveallows us to rewrite (2.7) as the primal-dual pair of semidefinite programsmin X,W ,W (Tr( W ) + Tr( W )) max z b (cid:48) z s.t. (cid:20) W XX (cid:48) W (cid:21) (cid:23) (cid:20) I m A ∗ ( z ) A ∗ ( z ) (cid:48) I n (cid:21) (cid:23) . A ( X ) = b (2.8) Optimality conditions
In order to describe the optimality conditions for the norm minimizationproblem (2.7), we must first characterize the subdifferential of the nuclear norm. Recall that for aconvex function f : R n → R , the subdifferential of f at x ∈ R n is the compact convex set ∂f ( x ) := { d ∈ R n : f ( y ) ≥ f ( x ) + (cid:104) d, y − x (cid:105) ∀ y ∈ R n } . Let X be an m × n matrix with rank r and let X = U Σ V (cid:48) be a singular value decomposition where U ∈ R m × r , V ∈ R n × r and Σ is an r × r diagonal matrix. The subdifferential of the nuclear normat X is then given by (see, e.g., [55]) ∂ (cid:107) X (cid:107) ∗ = { U V (cid:48) + W : W and X have orthogonal row and column spaces, and (cid:107) W (cid:107) ≤ } . (2.9)10or comparison, recall the case of the (cid:96) norm, where T denotes the support of the n -vector x , T c is the complement of T in the set { , . . . , n } , and ∂ (cid:107) x (cid:107) = { d ∈ R n : d i = sign( x ) for i ∈ T, | d i | ≤ i ∈ T c } . (2.10)The similarity between (2.9) and (2.10) is particularly transparent if we recall the polar decomposi-tion of a matrix into a product of orthogonal and positive semidefinite matrices (see, e.g., [33]). The“angular” component of the matrix X is exactly given by U V (cid:48) . Thus, these subgradients alwayshave the form of an “angle” (or sign), plus possibly a contraction in an orthogonal direction if thenorm is not differentiable at the current point.We can now write concise optimality conditions for the optimization problem (2.7). A matrix X is optimal for (2.7) if there exists a vector z ∈ R p such that A ( X ) = b, A ∗ ( z ) ∈ ∂ (cid:107) X (cid:107) ∗ . (2.11)The first condition in (2.11) requires feasibility of the linear equations, and the second one guaran-tees that there is no feasible direction of improvement. Indeed, since A ∗ ( z ) is in the subdifferentialat X , for any Y in the primal feasible set of (2.7) we have || Y || ∗ ≥ || X || ∗ + (cid:104)A ∗ ( z ) , Y − X (cid:105) = || X || ∗ + (cid:104) z, A ( Y − X ) (cid:105) = || X || ∗ , where the last step follows from the feasibility of X and Y . As we can see, the optimality conditions(2.11) for the nuclear norm minimization problem exactly parallel those of the (cid:96) optimization case.These optimality conditions can be used to check and certify whether a given candidate X isindeed the minimum nuclear norm solution. For this, it is sufficient (and necessary) to find a vector z ∈ R p in the subdifferential of the norm, i.e., such that the left- and right-singular spaces of A ∗ ( z )are aligned with those of X , and is a contraction in the orthogonal complement. Let us now turn to the central problem analyzed in this paper. Let A : R m × n → R p be a linearmap and let X be a matrix of rank r . Set b := A ( X ), and define the convex optimization problem X ∗ := arg min X (cid:107) X (cid:107) ∗ s.t. A ( X ) = b. (3.1)In this section, we will characterize specific cases when we can a priori guarantee that X ∗ = X .The key conditions will be determined by the values of a sequence of parameters δ r that quantifythe behavior of the linear map A when restricted to the subvariety of matrices of rank r . Thefollowing definition is the natural generalization of the Restricted Isometry Property from vectorsto matrices. Definition 3.1
Let A : R m × n → R p be a linear map. Without loss of generality, assume m ≤ n .For every integer r with ≤ r ≤ m , define the r -restricted isometry constant to be the smallestnumber δ r ( A ) such that (1 − δ r ( A )) (cid:107) X (cid:107) F ≤ (cid:107)A ( X ) (cid:107) ≤ (1 + δ r ( A )) (cid:107) X (cid:107) F (3.2) holds for all matrices X of rank at most r . δ r ( A ) ≤ δ r (cid:48) ( A ) for r ≤ r (cid:48) .The Restricted Isometry Property for sparse vectors was developed by Cand`es and Tao in [14],and requires (3.2) to hold with the Euclidean norm replacing the Frobenius norm and rank beingreplaced by cardinality. Since for diagonal matrices, the Frobenius norm is equal to the Euclideannorm of the diagonal, this definition reduces to the original Restricted Isometry Property of [14] inthe diagonal case. Unlike the case of “standard” compressed sensing, our RIP condition for low-rank matricescannot be interpreted as guaranteeing all sub-matrices of the linear transform A of a certain sizeare well conditioned. Indeed, the set of matrices X for which (3.2) must hold is not a finite unionof subspaces, but rather a certain “generalized Stiefel manifold,” which is also an algebraic variety(in fact, it is the r th-secant variety of the variety of rank-one matrices). Surprisingly, we are stillable to derive analogous recovery results for low-rank solutions of equations when A obeys this RIPcondition. Furthermore, we will see in Section 4 that many ensembles of random matrices havethe Restricted Isometry Property with δ r quite small with high probability for reasonable values of m , n , and p .The following two recovery theorems will characterize the power of the restricted isometryconstants. Both theorems are more or less immediate generalizations from the sparse case to thelow-rank case and use only minimal properties of the rank of matrices and the nuclear norm. Thefirst theorem generalizes Lemma 1.3 in [14] to low-rank recovery. Theorem 3.2
Suppose that δ r < for some integer r ≥ . Then X is the only matrix of rankat most r satisfying A ( X ) = b . Proof
Assume, on the contrary, that there exists a rank r matrix X satisfying A ( X ) = b and X (cid:54) = X . Then Z := X − X is a nonzero matrix of rank at most 2 r , and A ( Z ) = 0. But then wewould have 0 = (cid:107)A ( Z ) (cid:107) ≥ (1 − δ r ) (cid:107) Z (cid:107) F > δ r . No adjustment is necessary in thetransition from sparse vectors to low-rank matrices. The key property used is the sub-additivity ofthe rank.Next, we state a weak (cid:96) -type recovery theorem whose proof mimics the approach in [12], butfor which a few details need to be adjusted when switching from vectors to matrices. Theorem 3.3
Suppose that r ≥ is such that δ r < / . Then X ∗ = X . We will need the following technical lemma that shows for any two matrices A and B , we candecompose B as the sum of two matrices B and B such that rank( B ) is not too large and B satisfies the conditions of Lemma 2.3. This will be the key decomposition for proving Theorem 3.3. Lemma 3.4
Let A and B be matrices of the same dimensions. Then there exist matrices B and B such that1. B = B + B In [14], the authors define the restricted isometry properties with squared norms. We note here that the analysisis identical modulo some algebraic rescaling of constants. We choose to drop the squares as it greatly simplifies theanalysis in Section 4. . rank( B ) ≤ A ) AB (cid:48) = 0 and A (cid:48) B = 0 (cid:104) B , B (cid:105) = 0 Proof
Consider a full singular value decomposition of AA = U (cid:20) Σ 00 0 (cid:21) V (cid:48) , and let ˆ B := U (cid:48) BV . Partition ˆ B as ˆ B = (cid:20) ˆ B ˆ B ˆ B ˆ B (cid:21) . Defining now B := U (cid:20) ˆ B ˆ B ˆ B (cid:21) V (cid:48) , B := U (cid:20) B (cid:21) V (cid:48) , it can be easily verified that B and B satisfy the conditions (1)–(4).We now proceed to a proof of Theorem 3.3. Proof [of Theorem 3.3]
By optimality of X ∗ , we have (cid:107) X (cid:107) ∗ ≥ (cid:107) X ∗ (cid:107) ∗ . Let R := X ∗ − X .Applying Lemma 3.4 to the matrices X and R , there exist matrices R and R c such that R = R + R c , rank( R ) ≤ X ), and X R (cid:48) c = 0 and X (cid:48) R c = 0. Then, (cid:107) X (cid:107) ∗ ≥ (cid:107) X + R (cid:107) ∗ ≥ (cid:107) X + R c (cid:107) ∗ − (cid:107) R (cid:107) ∗ = (cid:107) X (cid:107) ∗ + (cid:107) R c (cid:107) ∗ − (cid:107) R (cid:107) ∗ , (3.3)where the middle assertion follows from the triangle inequality and the last one from Lemma 2.3.Rearranging terms, we can conclude that (cid:107) R (cid:107) ∗ ≥ (cid:107) R c (cid:107) ∗ . (3.4)Next we partition R c into a sum of matrices R , R , . . . , each of rank at most 3 r . Let R c = U diag( σ ) V (cid:48) be the singular value decomposition of R c . For each i ≥ I i = { r ( i −
1) + 1 , . . . , ri } , and let R i := U I i diag( σ I i ) V (cid:48) I i (notice that (cid:104) R i , R j (cid:105) = 0 if i (cid:54) = j ). Byconstruction, we have σ k ≤ r (cid:88) j ∈ I i σ j ∀ k ∈ I i +1 , (3.5)which implies (cid:107) R i +1 (cid:107) F ≤ r (cid:107) R i (cid:107) ∗ . We can then compute the following bound (cid:88) j ≥ (cid:107) R j (cid:107) F ≤ √ r (cid:88) j ≥ (cid:107) R j (cid:107) ∗ = 1 √ r (cid:107) R c (cid:107) ∗ ≤ √ r (cid:107) R (cid:107) ∗ ≤ √ r √ r (cid:107) R (cid:107) F , (3.6)13here the last inequality follows from (2.1) and the fact that rank( R ) ≤ r . Finally, note that therank of R + R is at most 5 r , so we may put this all together as (cid:107)A ( R ) (cid:107) ≥ (cid:107)A ( R + R ) (cid:107) − (cid:88) j ≥ (cid:107)A ( R j ) (cid:107)≥ (1 − δ r ) (cid:107) R + R (cid:107) F − (1 + δ r ) (cid:88) j ≥ (cid:107) R j (cid:107) F ≥ (cid:18) (1 − δ r ) − (cid:113) (1 + δ r ) (cid:19) (cid:107) R (cid:107) F ≥ (cid:0) (1 − δ r ) − (1 + δ r ) (cid:1) (cid:107) R (cid:107) F . (3.7)By assumption A ( R ) = A ( X ∗ − X ) = 0, so if the factor on the right-hand side is strictly positive, R = 0, which further implies R c = 0 by (3.4), and thus X ∗ = X . Simple algebra reveals thatthe right-hand side is positive when 9 δ r + 11 δ r <
2. Since δ r ≤ δ r , we immediately have that X ∗ = X if δ r < / /
11) in the proof of the theorem is chosen for notational simplicity andis clearly not optimal. A slightly tighter bound can be achieved working directly with the secondto last line in (3.7). The most important point is that our recovery condition on δ r is an absoluteconstant, independent of m , n , r , and p .We have yet to demonstrate any specific linear mappings A for which δ r <
1. We shall show inthe next section that linear transformations sampled from several families of random matrices withappropriately chosen dimensions have this property with overwhelming probability. The analysis isagain similar to the compressive sampling literature, but several details specific to the rank recoveryproblem need to be employed.
In this section, we will demonstrate that when we sample linear maps from a class of probabilitydistributions obeying certain tail bounds, then they will obey the Restricted Isometry Property(3.2) as p , m , and n tend to infinity at appropriate rates. The following definition characterizesthis family of random linear transformation. Definition 4.1
Let A be a random variable that takes values in linear maps from R m × n to R p .We say that A is nearly isometrically distributed if for all X ∈ R m × n E [ (cid:107)A ( X ) (cid:107) ] = (cid:107) X (cid:107) F (4.1) and for all < (cid:15) < we have, P ( |(cid:107)A ( X ) (cid:107) − (cid:107) X (cid:107) F | ≥ (cid:15) (cid:107) X (cid:107) F ) ≤ (cid:16) − p (cid:15) / − (cid:15) / (cid:17) (4.2) and for all t > , we have P (cid:18) (cid:107)A(cid:107) ≥ (cid:114) mnp + t (cid:19) ≤ exp( − γpt ) (4.3) for some constant γ > . A : R m × n → R p , we can always write its matrix representation as A ( X ) = A vec( X ) , (4.4)where vec( X ) denotes the vector of X with its columns stacked in order on top of one another,and A is a p × mn matrix. We now give several examples of nearly isometric random variablesin this matrix representation. The most well known is the ensemble with independent, identicallydistributed (i.i.d.) Gaussian entries [15] A ij ∼ N (0 , p ) . (4.5)We also mention the two following ensembles of matrices, described in [2]. One has entries sampledfrom an i.i.d. symmetric Bernoulli distribution A ij = (cid:113) p with probability − (cid:113) p with probability , (4.6)and the other has zeros in two-thirds of the entries A ij = (cid:113) p with probability − (cid:113) p with probability . (4.7)The fact that the top singular value of the matrix A is concentrated around 1 + (cid:112) D/p for all ofthese ensembles follows from the work of Yin, Bai, and Krishnaiah, who showed that whenever theentries A ij are i.i.d. with zero mean and finite fourth moment, then the maximum singular valueof A is almost surely 1 + (cid:112) D/p for D sufficiently large [56]. El Karoui uses this result to provethe concentration inequality (4.3) for all such distributions [23]. The result for Gaussians is rathertight with γ = 1 / (cid:112) D/p , (4.3) holds trivially. Theconcentration inequality (4.2) is proven in [15].The main result of this section is the following:
Theorem 4.2
Fix < δ < . If A is a nearly isometric random variable, then for every ≤ r ≤ m ,there exist constants c , c > depending only on δ such that, with probability at least − exp( − c p ) , δ r ( A ) ≤ δ whenever p ≥ c r ( m + n ) log( mn ) . A whenrestricted to an arbitrary subspace of matrices U of dimension d . Lemma 4.3
Let A be a nearly isometric linear map and let U be an arbitrary subspace of m × n matrices with d = dim( U ) ≤ p . Then for any < δ < we have (1 − δ ) (cid:107) X (cid:107) F ≤ (cid:107)A ( X ) (cid:107) ≤ (1 + δ ) (cid:107) X (cid:107) F ∀ X ∈ U (4.8) with probability at least − /δ ) d exp (cid:16) − p δ / − δ / (cid:17) . (4.9) Proof
The proof of this theorem is identical to the argument in [4] where the authors restrictedtheir attention to subspaces aligned with the coordinate axes. We will sketch the proof here as theargument is straightforward.There exists a finite set Ω of at most (12 /δ ) d points such that for every X ∈ U with (cid:107) X (cid:107) F ≤ Q ∈ Ω such that (cid:107) X − Q (cid:107) F ≤ δ/
4. By the standard union bound, the concentrationinequality (4.2) holds for all Q ∈ Ω with (cid:15) = δ/ Q ∈ Ω, then we immediately have that (1 − δ/ (cid:107) Q (cid:107) F ≤ (cid:107)A ( Q ) (cid:107) ≤ (1 + δ/ (cid:107) Q (cid:107) F for all Q ∈ Ωas well.Let X be in { X ∈ U : (cid:107) X (cid:107) F ≤ } , and M be the maximum of (cid:107)A ( X ) (cid:107) on this set. Then thereexists a Q ∈ Ω such that (cid:107) X − Q (cid:107) F ≤ δ/
4. We then have (cid:107)A ( X ) (cid:107) ≤ (cid:107)A ( Q ) (cid:107) + (cid:107)A ( X − Q ) (cid:107) ≤ δ/ M δ/ , and since M ≤ δ/ M δ/ M ≤ δ . The lower bound is proven bythe following chain of inequalities (cid:107)A ( X ) (cid:107) ≥ (cid:107)A ( Q ) (cid:107) − (cid:107)A ( X − Q ) (cid:107) ≥ − δ/ − (1 + δ ) δ/ ≥ − δ. The proof of preceding lemma revealed that the near isometry of a linear map is robust to smallperturbations of the matrix on which the map is acting. We will now show that this behavior isrobust with respect to small perturbations of the subspace U as well. This perturbation will bemeasured in the natural distance between two subspaces ρ ( T , T ) := (cid:107) P T − P T (cid:107) , (4.10)where T and T are subspaces and P T i is the orthogonal projection associated with each subspace.This distance measures the operator norm of the difference between the corresponding projections,and is equal to the sine of the largest principal angle between T and T [1].16he set of all d -dimensional subspaces of R D is commonly known as the Grassmannian manifold G ( D, d ). We will endow it with the metric ρ ( · , · ) given by (4.10), also known as the projection 2-norm . In the following lemma we characterize and quantify the change in the isometry constant δ as one smoothly moves through the Grassmannian. Lemma 4.4
Let U and U be d -dimensional subspaces of R D . Suppose that for all X ∈ U , (1 − δ ) (cid:107) X (cid:107) F ≤ (cid:107)A ( X ) (cid:107) ≤ (1 + δ ) (cid:107) X (cid:107) F (4.11) for some constant < δ < . Then for all Y ∈ U (1 − δ (cid:48) ) (cid:107) Y (cid:107) F ≤ (cid:107)A ( Y ) (cid:107) ≤ (1 + δ (cid:48) ) (cid:107) Y (cid:107) F (4.12) with δ (cid:48) = δ + (1 + (cid:107)A(cid:107) ) · ρ ( U , U ) . (4.13) Proof
Consider any Y ∈ U . Then (cid:107)A ( Y ) (cid:107) = (cid:107)A ( P U ( Y ) − [ P U − P U ]( Y )) (cid:107)≤ (cid:107)A ( P U ( Y )) (cid:107) + (cid:107)A ([ P U − P U ]( Y )) (cid:107)≤ (1 + δ ) (cid:107) P U ( Y ) (cid:107) F + (cid:107)A(cid:107)(cid:107) P U − P U (cid:107)(cid:107) Y (cid:107) F ≤ (1 + δ + (cid:107)A(cid:107)(cid:107) P U − P U (cid:107) ) (cid:107) Y (cid:107) F . (4.14)Similarly, we have (cid:107)A ( Y ) (cid:107) ≥ (cid:107)A ( P U ( Y )) (cid:107) − (cid:107)A ([ P U − P U ]( Y )) (cid:107)≥ (1 − δ ) (cid:107) P U ( Y ) (cid:107) F − (cid:107)A(cid:107)(cid:107) P U − P U (cid:107)(cid:107) Y (cid:107) F ≥ (1 − δ ) (cid:107) Y (cid:107) F − (1 − δ ) (cid:107) ( P U − P U )( Y ) (cid:107) F − (cid:107)A(cid:107)(cid:107) P U − P U (cid:107)(cid:107) Y (cid:107) F ≥ [1 − δ − ( (cid:107)A(cid:107) + 1) (cid:107) P U − P U (cid:107) ] (cid:107) Y (cid:107) F , (4.15)which completes the proof.To apply these concentration results to low-rank matrices, we characterize the set of all matricesof rank at most r as a union of subspaces. Let V ⊂ R m and W ⊂ R n be fixed subspaces of dimension r . Then the set of all m × n matrices X whose row space is contained in W and column spaceis contained in V forms an r -dimensional subspace of matrices of rank less than or equal to r .Denote this subspace as Σ( V, W ) ⊂ R m × n . Any matrix of rank less than or equal to r is an elementof some Σ( V, W ) for a suitable pair of subspaces, i.e., the setΣ mnr := { Σ( V, W ) : V ∈ G ( m, r ) , W ∈ G ( n, r ) } . We now characterize how many subspaces are necessary to cover this set to arbitrary resolution. The covering number N ( (cid:15) ) of Σ mnr at resolution (cid:15) is defined to be the smallest number of subspaces( V i , W i ) such that for any pair of subspaces ( V, W ), there is an i with ρ (Σ( V, W ) , Σ( V i , W i )) ≤ (cid:15) . That is, the covering number is the smallest cardinality of an (cid:15) -net. The following Lemmacharacterizes the cardinality of such a set. 17 emma 4.5 The covering number N ( (cid:15) ) of Σ mnr is bounded above by N ( (cid:15) ) ≤ (cid:18) C (cid:15) (cid:19) r ( m + n − r ) (4.16) where C is a constant independent of (cid:15) , m , n , and r . Proof
Note that the projection operator onto Σ(
V, W ) can be written as P Σ( V,W ) = P V ⊗ P W ,so for a pair of subspaces ( V , W ) and ( V , W ), we have ρ (Σ( V , W ) , Σ( V , W )) = (cid:107) P Σ( V ,W ) − P Σ( V ,W ) (cid:107) = (cid:107) P V ⊗ P W − P V ⊗ P W (cid:107) = (cid:107) ( P V − P V ) ⊗ P W + P V ⊗ ( P W − P W ) (cid:107)≤ (cid:107) P V − P V (cid:107)(cid:107) P W (cid:107) + (cid:107) P V (cid:107)(cid:107) P W − P W (cid:107) = ρ ( V , V ) + ρ ( W , W ) . (4.17)The conditions ρ ( V , V ) ≤ (cid:15) and ρ ( W , W ) ≤ (cid:15) together imply that ρ (Σ( V , W ) , Σ( V , W )) ≤ ρ ( V , V ) + ρ ( W , W ) ≤ (cid:15) . Let V , . . . , V N cover the set of r -dimensional subspaces of R m toresolution (cid:15)/ U , . . . , U N cover the r -dimensional subspaces of R n to resolution (cid:15)/
2. Thenfor any (
V, W ), there exist i and j such that ρ ( V, V i ) ≤ (cid:15)/ ρ ( W, W j ) ≤ (cid:15)/
2. Therefore, N ( (cid:15) ) ≤ N N . By the work of Szarek on (cid:15) -nets of the Grassmannian ([49], [50, Th. 8]) there is auniversal constant C , independent of m , n , and r , such that N ≤ (cid:18) C (cid:15) (cid:19) r ( m − r ) and N ≤ (cid:18) C (cid:15) (cid:19) r ( n − r ) (4.18)which completes the proof.The exact value of the universal constant C is not provided by Szarek in [50]. It takes thesame value for any homogeneous space whose automorphism group is a subgroup of the orthogonalgroup, and is independent of the dimension of the homogeneous space. Hence, one might expectthis constant to be quite large. However, it is known that for the sphere C ≤ U, V ). Proof [of Theorem 4.2]Let Ω = { ( V i , W i ) } be a finite set of subspaces that satisfies the conditions of Lemma 4.5 for (cid:15) >
0, so | Ω | ≤ N ( (cid:15) ). For each pair ( V i , W i ), define the set of matrices B i := (cid:110) X (cid:12)(cid:12)(cid:12) ∃ ( V, W ) such that X ∈ Σ( V, W ) and ρ (Σ( V, W ) , Σ( V i , W i )) ≤ (cid:15) (cid:111) . (4.19)Since Ω is an (cid:15) -net, we have that the union of all the B i is equal to Σ mnr . Therefore, if for all i ,(1 − δ ) (cid:107) X (cid:107) F ≤ (cid:107)A ( X ) (cid:107) ≤ (1 + δ ) (cid:107) X (cid:107) F for all X ∈ B i , we must have that δ r ( A ) ≤ δ proving that P ( δ r ( A ) ≤ δ ) = P [(1 − δ ) (cid:107) X (cid:107) F ≤ (cid:107)A ( X ) (cid:107) ≤ (1 + δ ) (cid:107) X (cid:107) F ∀ X s.t. rank( X ) ≤ r ] ≥ P [ ∀ i (1 − δ ) (cid:107) X (cid:107) F ≤ (cid:107)A ( X ) (cid:107) ≤ (1 + δ ) (cid:107) X (cid:107) F ∀ X ∈ B i ] (4.20)18ow note that if we have (1 + (cid:107)A(cid:107) ) (cid:15) ≤ δ/ Y ∈ Σ( V i , W i ), (1 − δ/ (cid:107) Y (cid:107) F ≤ (cid:107)A ( Y ) (cid:107) ≤ (1 + δ/ (cid:107) Y (cid:107) F , Lemma 4.3 implies that (1 − δ ) (cid:107) X (cid:107) F ≤ (cid:107)A ( X ) (cid:107) ≤ (1 + δ ) (cid:107) X (cid:107) F for all X ∈ B i .Therefore, using a union bound, (4.20) is greater than or equal to1 − | Ω | (cid:88) i =1 P (cid:20) ∃ Y ∈ Σ( V i , W i ) (cid:107)A ( Y ) (cid:107) < (1 − δ (cid:107) Y (cid:107) F or (cid:107)A ( Y ) (cid:107) > (1 + δ (cid:107) Y (cid:107) F (cid:21) − P (cid:20) (cid:107)A(cid:107) ≥ δ (cid:15) − (cid:21) . (4.21)We can bound these quantities separately. First we have by Lemmas 4.3 and 4.5 | Ω | (cid:88) i =1 P (cid:20) ∃ Y ∈ Σ( V i , W i ) (cid:107)A ( Y ) (cid:107) < (1 − δ (cid:107) Y (cid:107) F or (cid:107)A ( Y ) (cid:107) > (1 + δ (cid:107) Y (cid:107) F (cid:21) ≤ N ( (cid:15) ) (cid:18) δ (cid:19) r exp (cid:16) − p δ / − δ / (cid:17) ≤ (cid:18) C (cid:15) (cid:19) r ( m + n − r ) (cid:18) δ (cid:19) r exp (cid:16) − p δ / − δ / (cid:17) . (4.22)Secondly, since A is nearly isometric, there exists a constant γ such that P (cid:18) (cid:107)A(cid:107) ≥ (cid:114) mnp + t (cid:19) ≤ exp( − γpt ) . (4.23)In particular, P (cid:18) (cid:107)A(cid:107) ≥ δ (cid:15) − (cid:19) ≤ exp (cid:32) − γp (cid:18) δ (cid:15) − (cid:114) mnp − (cid:19) (cid:33) . (4.24)We now must pick a suitable resolution (cid:15) to guarantee that this probability is less than exp( − c p )for a suitably chosen constant c . First note that if we choose (cid:15) < ( δ/ (cid:112) mn/p + 1) − , P (cid:18) (cid:107)A(cid:107) ≥ δ (cid:15) − (cid:19) ≤ exp( − γmn ) , (4.25)which achieves the desired scaling because mn > p . With this choice of (cid:15) , the quantity in Equa-tion (4.22) is less than or equal to2 (cid:32) C ( (cid:112) mn/p + 1) δ (cid:33) r ( m + n − r ) (24 /δ ) r exp (cid:16) − p δ / − δ / (cid:17) = exp (cid:18) − pa ( δ ) + r ( m + n − r ) log (cid:18)(cid:114) mnp + 1 (cid:19) + r ( m + n − r ) log (cid:18) C δ (cid:19) + r log (cid:18) δ (cid:19)(cid:19) (4.26)where a ( δ ) = δ / − δ / mn/p < mn for all p >
1, there exists a constant c independentof m , n , p , and r , such that the sum of the last three terms in the exponent are bounded above by19 c /a ( δ )) r ( m + n ) log( mn ). It follows that there exists a constant c independent of m , n , p , and r such that p ≥ c r ( m + n ) log( mn ) observations are sufficient to yield an RIP of δ with probabilitygreater than 1 − e − c p .Heuristically, the scaling p = O ( r ( m + n ) log( mn )) is very reasonable, since a rank r matrixhas r ( m + n − r ) degrees of freedom. This coarse tail bound only provides asymptotic estimatesfor recovery, and is quite conservative in practice. As we demonstrate in Section 6, minimum ranksolutions can be determined from between 2 r ( m + n − r ) to 4 r ( m + n − r ) observations for manypractical problems. A variety of methods can be developed for the effective minimization of the nuclear norm over anaffine subspace of matrices, and we do not have room for a comprehensive treatment here. Instead,we focus on three methods highlighting the trade-offs between computational speed and guaranteeson accuracy of the resulting solution. Directly solving the semidefinite characterization of thenuclear norm problem using primal-dual interior point methods is a numerically efficient methodfor small problems and can be used to yield accuracy up to floating-point precision.Since interior point methods use second order information, the memory requirements for com-puting descent directions quickly becomes too large as the problem size increases. Moreover, forlarger problem sizes it is preferable to use methods that exploit, at least partially, the structure ofthe problem. This can be done at several levels, either by taking into account further informationthat may be available about the linear map A (e.g., the case of partially observed Fourier mea-surements) or by formulating algorithms that are specific to the nuclear norm problem. For thelatter, we show how to apply subgradient methods to minimize the nuclear norm over an affine set.Such first-order methods cannot yield as high numerical precision as interior point methods, butmuch larger problems can be solved because no second-order information needs to be stored. Foreven larger problems, we discuss a low-rank semidefinite programming that explicitly works with afactorization of the decision variable. This method can be applied even when the matrix decisionvariable cannot fit into memory, but convergence guarantees are much less satisfactory than in theother two cases. For small problems where a high-degree of numerical precision is required, interior point methodsfor semidefinite programming can be directly applied to solve affine nuclear minimization problems.As we have seen in earlier sections, the nuclear norm minimization problem can be directly posed asa semidefinite programming problem via the standard form primal-dual pair (2.8). As written, theprimal problem has one ( n + m ) × ( n + m ) semidefinite constraint and p affine constraints. Conversely,the dual problem has one ( n + m ) × ( n + m ) semidefinite constraint and p scalar decision variables.Thus, the total number of decision variables (primal and dual) is equal to (cid:0) n + m +12 (cid:1) + p .Modern interior point solvers for semidefinite programming generally use primal-dual methods,and compute an update direction for the current solution by solving a suitable Newton system.Depending on the structure of the linear mapping A , this may entail solving a potentially large,dense linear system. 20f the matrix dimensions n and m are not too large, then any good interior point SDP solver,such as SeDuMi [48] or SDPT3 [51], will quickly produce accurate solutions. In fact, as we will seein the next section, problems with n and m around 50 can be solved to machine precision in minuteson a desktop computer to machine precision. However, solving such a primal-dual pair of programswith traditional interior point methods can prove to be quite challenging when the dimensions ofthe matrix X are much bigger than 100 × The nuclear norm minimization (3.1) is a linearly constrained nondifferentiable convex problem.There are numerous techniques to approach this kind of problems, depending on the specific na-ture of the constraints (e.g., dense vs. sparse), and the possibility of using first- or second-orderinformation.In this section we describe a simple, easy to implement, subgradient projection approach to thesolution of (3.1). This first-order method will proceed by computing a sequence of feasible points { X k } , with iterates satisfying the update rule X k +1 = Π( X k − s k Y k ) , Y k ∈ ∂ || X k || ∗ , where Π is the orthogonal projection onto the affine subspace defined by the linear constraints A ( X ) = b , and s k > X k by taking a step along the direction of a subgradient at the current point and thenprojecting back onto the feasible set. Alternatively, since X k is feasible, we can rewrite this as X k +1 = X k − s k Π A Y k , where Π A is the orthogonal projection onto the kernel of A . Since the feasible set is an affinesubspace, there are several options for the projection Π A . For small problems, one can precomputeit using, for example, a QR decomposition of the matrix representation of A and store it. Alterna-tively, one can solve a least squares problem at each step by iterative methods such as conjugategradients.The subgradient-based method described above is extremely simple to implement, since only asubgradient evaluation is required at every step. The computation of the subgradient can be doneusing the formula given in (2.9) earlier, thus requiring only a singular value decomposition of thecurrent point X k .A possible alternative here to the use of the SVD for the subgradient computation is to directlyfocus on the “angular” factor of the polar decomposition of X k , using for instance the Newton-likemethods developed by Gander in [30]. Specifically, for a given matrix X k , the Halley-like iteration X → X ( X (cid:48) X + 3 I )(3 X (cid:48) X + I ) − X , and thus yields an element of thesubdifferential of the nuclear norm. This iteration method (suitable scaled) can be faster than adirect SVD computation, particularly if the singular values of the initial matrix are close to 1. Thiscould be appealing since presumably only a very small number of iterations would be needed toupdate the polar factor of X k , although the nonsmoothness of the subdifferential is bound to causesome additional difficulties.Regarding convergence, for general nonsmooth problems, subgradient methods do not guaranteea decrease of the cost function at every iteration, even for arbitrarily small step sizes (see, e.g.,[7, § X k to any optimalpoint. There are several possibilities for the choice of stepsize s k . The simplest choice that canguarantee convergence is to use a diminishing stepsize with an infinite travel condition (i.e., suchthat lim k →∞ s k = 0 and (cid:80) k> s k diverging).Often times, even the computation of a singular value decomposition or Halley-like iteration canbe too computationally expensive. The next section proposes a reduction of the size of the searchspace to alleviate such demands. We must give up guarantees of convergence for this convenience,but this may be an acceptable trade-off for very large-scale problems. We now turn to a method that works with an explicit low-rank factorization of X . This algorithmnot only requires less storage capacity and computational overhead than the previous methods,but for many problems does not even require one to be able to store the decision variable X inmemory. This is the case, for example, in the matrix completion problem where A ( X ) is a subsetof the entries of X .Given observations of the form A ( X ) = b of an m × n matrix X of rank r , a possible searchalgorithm to find a suitable X would be to find a factorization X = LR (cid:48) , where L is an m × r matrix and R an n × r matrix, such that the equality constraints are satisfied. Since there are manypossible such factorizations, we could search for one where the matrices L and R have Frobeniusnorm as small as possible, i.e., the solution of the optimization problemmin L,R ( (cid:107) L (cid:107) F + (cid:107) R (cid:107) F )s.t. A ( LR (cid:48) ) = b. (5.1)Even though the cost function is convex, the constraint is not. Such a problem is a nonconvexquadratic program, and it is not evidently easy to optimize. We show below that the minimizationof the nuclear norm subject to equality constraints is in fact equivalent to this rather naturalheuristic optimization, as long as r is chosen to be sufficiently larger than the rank of the optimumof the nuclear norm problem. Lemma 5.1
Assume r ≥ rank( X ) . The nonconvex quadratic optimization problem (5.1) is equiv-alent to the minimum nuclear norm relaxation (3.1). Proof
Consider any feasible solution (
L, R ) of (5.1). Then, defining W := LL (cid:48) , W := RR (cid:48) , and X := LR (cid:48) yields a feasible solution of the primal SDP problem (2.8) that achieves the same cost.22ince the SDP formulation is equivalent to the nuclear norm problem, we have that the optimalvalue of (5.1) is always greater than or equal to the nuclear norm heuristic.For the converse, we can use an argument similar to the proof of Proposition 2.1. From theSVD decomposition X ∗ = U Σ V (cid:48) of the optimal solution of the nuclear norm relaxation (3.1), wecan explicitly construct matrices L := U Σ and R := V Σ for (5.1) that yield exactly the samevalue of the objective.The main advantage of this reformulation is to substantially decrease the number of primaldecision variables from nm to ( n + m ) r . For large problems, this is quite a significant reductionthat allows us to search for matrices of rank smaller than the order of 100, and n + m in the hundredsof thousands on a desktop computer. However, this problem is nonconvex and potentially subjectto local minima. This is not as much of a problem as it could be, for two reasons. First recall fromTheorem 3.2 that if δ r ( A ) <
1, there is a unique X ∗ with rank at most r such that A ( X ∗ ) = b .Since any local minima ( L ∗ , R ∗ ) of (5.1) is feasible, we would have X ∗ = L ∗ R ∗(cid:48) and we would havefound the minimum rank solution. Second, we now present an algorithm that is guaranteed toconverge to a local minima for a judiciously selected r . We will also provide a sufficient conditionfor when we can construct an optimal solution of (2.8) from the solution computed by the methodof multipliers. SDPLR and the method of multipliers
For general semidefinite programming problems,Burer and Monteiro have developed in [8, 9] a nonlinear programming approach that relies on alow-rank factorization of the matrix decision variable. We will adapt this idea to our problem,to provide a first-order Lagrangian minimization algorithm that efficiently finds a local minimaof (5.1). As a consequence of the work in [9], it will follow that for values of r larger than therank of the true optimal solution, the local minima of (5.1) can be transformed into global minimaof (2.8) under the identification W = LL (cid:48) , W = RR (cid:48) and Y = LR (cid:48) . We summarize below thedetails of this approach.The algorithm employed is called the method of multipliers , a standard approach for solvingequality constrained optimization problems [6]. The method of multipliers works with an augmentedLagrangian for (5.1) L a ( L, R ; y, σ ) := ( || L || F + || R || F ) − y (cid:48) ( A ( LR (cid:48) ) − b ) + σ ||A ( LR (cid:48) ) − b || , (5.2)where the y i are arbitrarily signed Lagrange multipliers and σ is a positive constant. A somewhatsimilar algorithm was proposed by Rennie et al in [42] in the collaborative filtering. In this work,the authors minimize L a with σ fixed and y = 0 to serve as a regularized algorithm for matrixcompletion. Remarkably, by deterministically varying σ and y , this method can be adapted intoan algorithm for solving linearly constrained nuclear-norm minimization.In the method of multipliers, one alternately minimizes the augmented Lagrangian with respectto the decision variables L and R , and then increases the value of the penalty coefficient σ andupdates y . The augmented Lagrangian can be minimized using any local search technique, and thepartial derivatives are particularly simple to compute. Let ˆ y := y − σ ( A ( LR (cid:48) ) − b ). Then we have ∇ L L a = L − A ∗ (ˆ y ) R ∇ R L a = R − A ∗ (ˆ y ) (cid:48) L.
23o calculate the gradients, we first compute the constraint violations A ( LR (cid:48) ) − b , then form ˆ y , andfinally use the above equations to compute the gradients.As the number of iterations tends to infinity, only feasible points will have finite values of L a ,and for any feasible point, L a ( L, R ) is equal to the original cost function ( (cid:107) L (cid:107) F + (cid:107) R (cid:107) F ) /
2. Themethod terminates when L and R are feasible, as in this case the Lagrangian is stationary andwe are at a local minima of (5.1). Including the y multipliers improves the conditioning of eachsubproblem where L a is minimized and enhances the rate of convergence. The following theoremshows that when the method of multipliers converges, it converges to a local minimum of (5.1). Theorem 5.2
Suppose we have a sequence ( L ( k ) , R ( k ) , y ( k ) ) of local minima of the augmented La-grangian at each step of the method of multipliers. Assume that our sequence of σ ( k ) → ∞ and thatthe sequence of y ( k ) is bounded. If ( L ( k ) , R ( k ) ) converges to ( L ∗ , R ∗ ) and the linear map Λ ( k ) ( y ) := (cid:20) A ∗ ( y ) R ( k ) A ∗ ( y ) (cid:48) L ( k ) (cid:21) (5.3) has kernel equal to the zero vector for all k , then there exists a vector y ∗ such that(i) ∇L a ( L ∗ , R ∗ ; y ∗ ) = 0 (ii) A ( L ∗ R ∗(cid:48) ) = b Proof
This proof is standard and follows the approach in [6]. As above, we define ˆ y ( k ) := y ( k ) − σ ( k ) ( A ( L ( k ) R ( k ) (cid:48) ) − b ) for all k . Since ( L ( k ) , R ( k ) ) minimize the augmented Lagrangian atiteration k , we have 0 = L ( k ) − A ∗ (ˆ y ( k ) ) R ( k ) R ( k ) − A ∗ (ˆ y ( k ) ) (cid:48) L ( k ) , (5.4)which we may rewrite as Λ ( k ) (ˆ y ( k ) ) = (cid:20) L ( k ) R ( k ) (cid:21) . (5.5)Since we have assumed that there is no non-zero y with Λ ( k ) ( y ) = 0, there exists a left-inverse andwe can solve for ˆ y ( k ) . ˆ y ( k ) = Λ ( k ) † (cid:18)(cid:20) L ( k ) R ( k ) (cid:21)(cid:19) . (5.6)Everything on the right-hand side is bounded, and L ( k ) and R ( k ) converge. Therefore, we musthave that ˆ y ( k ) converges to some ˆ y ∗ . Taking the limit of (5.4) proves (i). To prove (ii), note thatwe must have ˆ y ( k ) is bounded. Since y ( k ) is also bounded, we find that σ ( k ) ( A ( L ( k ) R ( k ) (cid:48) ) − b ) isalso bounded. But σ ( k ) → ∞ implies that A ( L ∗ R ∗(cid:48) ) = b , completing the proof.Suppose the decision variables are chosen to be of size m × r d and n × r d . A necessary conditionfor Λ k ( y ) to be full rank is for the number of decision variables r d ( m + n ) to be greater than thenumber of equalities p . In particular, this means that we must choose r d ≥ p/ ( m + n ) in order tohave any hopes of satisfying the conditions of Theorem 5.2.24igure 1: The MIT logo image. The associated matrix has dimensions 46 ×
81 and has rank 5.
We close this section by relating the solution found by the method of multipliers to the optimalsolution of the nuclear norm minimization problem. We have already shown that when the low-rank algorithm converges, it converges to a low-rank solution of A ( X ) = b . If we additionallyfind that A ∗ ( y ∗ ) has norm less than or equal to one, then it is dual feasible. One can checkusing straightforward algebra that ( L ∗ R ∗(cid:48) , L ∗ L ∗(cid:48) , R ∗ R ∗(cid:48) ) and y ∗ form an optimal primal-dual pairfor (2.8). This analysis proves the following theorem. Theorem 5.3
Let ( L ∗ , R ∗ , y ∗ ) satisfy (i)-(ii) in Theorem 5.2 and suppose (cid:107)A ∗ ( y ∗ ) (cid:107) ≤ . Then ( L ∗ R ∗(cid:48) , L ∗ L ∗(cid:48) , R ∗ R ∗(cid:48) ) is an optimal primal solution and y ∗ is an optimal dual solution of (2.8). To illustrate the scaling of low-rank recovery for a particular matrix M , consider the MIT logopresented in Figure 1. The image has a total of 46 rows and 81 columns (total 3726 elements), and3 distinct non-zero numerical values corresponding to the colors white, red, and grey. Since thelogo only has 5 distinct rows, it has rank 5. For each of the ensembles discussed in Section 4, wesampled measurement matrices with p ranging between 700 and 1500, and solved the semidefiniteprogram (2.6) using the freely available software SeDuMi [48]. On a 2.0 GHz Laptop, each semidef-inite program could be solved in less than four minutes. We chose to use this interior point methodbecause it yielded the highest accuracy in the shortest amount of time, and we were interested incharacterizing precisely when the nuclear norm heuristic succeeded and failed.Figure 2 plots the Frobenius norm of the difference between the optimal point of the semidefiniteprogram and the true image in Figure 2. We observe a sharp transition to perfect recovery near1200 measurements which is approximately equal to 2 r ( m + n − r ). In Figure 3, we graphicallyplot the recovered solutions for various values of p under the Gaussian ensemble.To demonstrate the average behavior of low-rank recovery, we conducted a series of experimentsfor a variety of the matrix sizes n , ranks r , and numbers of measurements p . For a fixed n , weconstructed random recovery scenarios for low-rank n × n matrices. For each n , we varied p between0 and n where the matrix is completely discovered. For a fixed n and p , we generated all possibleranks such that r (2 n − r ) ≤ p . This cutoff was chosen because beyond that point there would bean infinite set of matrices of rank r satisfying the p equations.25
00 800 900 1000 1100 1200 1300 1400 150010 -6 -5 -4 -3 -2 -1 e rr o r i n f r oben i u s no r m number of measurements gaussianprojectionbinarysparse binary (a) -5 -4 -3 -2 -1 e rr o r i n f r oben i u s no r m number of measurements gaussianprojectionbinarysparse binary (b)Figure 2: (a) Error, as measured by the Frobenius norm, between the recovered image and the ground truth.Observe that there is a sharp transition to near zero error at around 1200 measurements. (b) Zooming in onthis transition, we see fluctuation between high and low error when between 1125 and 1225 measurementsare available. Example recovered images using the Gaussian ensemble. (a) 700 measurements. (b) 1100 mea-surements (c) 1250 measurements. The total number of pixels is 46 ×
81 = 3726. Note that the error isplotted on a logarithmic scale.
For each ( n, p, r ) triple, we repeated the following procedure 10 times. A matrix of rank r wasgenerated by choosing two random n × r factors Y L and Y R with i.i.d. random entries and setting Y = Y L Y (cid:48) R . A matrix A was sampled from the Gaussian ensemble with p rows and n columns.Then the nuclear norm minimizationmin X (cid:107) X (cid:107) ∗ s.t. A vec( X ) = A vec( Y ) (6.1)was solved using the SDP solver SeDuMi on the formulation (2.6). Again, we chose to use SeDuMibecause we wanted to precisely distinguish between success and failure of the heuristic. We declared Y to be recovered if (cid:107) X − Y (cid:107) F / (cid:107) Y (cid:107) F < − . Figure 4 shows the results of these experimentsfor n = 30 and 40. The color of the cell in the figures reflects the empirical recovery rate of the10 runs (scaled between 0 and 1). White denotes perfect recovery in all experiments, and blackdenotes failure for all experiments.These experiments demonstrate that the logarithmic factors and constants present in our scalingresults are somewhat conservative. For example, as one might expect, low-rank matrices are per-fectly recovered by nuclear norm minimization when p = n as the matrix is uniquely determined.Moreover, as p is reduced slightly away from this value, low-rank matrices are still recovered 100percent of the time for most values of r . Finally, we note that despite the asymptotic nature of ouranalysis, our experiments demonstrate excellent performance with low-rank matrices of size 30 × ×
40 matrices, showing that the heuristic is practical even in low-dimensional settings.Intriguingly, Figure 4 also demonstrates a “phase transition” between perfect recovery andfailure. As observed in several recent papers by Donoho and his collaborators (See e.g. [18, 19]),the random sparsity recovery problem has two distinct connected regions of parameter space: onewhere the sparsity pattern is perfectly recovered, and one where no sparse solution is found. Notsurprisingly, Figure 4 illustrates an analogous phenomenon in rank recovery. Computing explicitformulas for the transition between perfect recovery and failure is left for future work.
Having illustrated the natural connections between affine rank minimization and affine cardinalityminimization, we were able to draw on these parallels to determine scenarios where the nuclearnorm heuristic was able to exactly solve the rank minimization problem. These scenarios directly27 /n r( -r) / p (a) p/n r( -r) / p (b)Figure 4: For each ( n, p, r ) triple, we repeated the following procedure ten times. A matrix of rank r wasgenerated by choosing two random n × r factors Y L and Y R with i.i.d. random entries and set Y = Y L Y (cid:48) R . Weselect a matrix A from the Gaussian ensemble with p rows and n columns. Then we solve the nuclear normminimization subject to A vec( X ) = A vec( Y ) We declare Y to be recovered if (cid:107) X − Y (cid:107) F / (cid:107) Y (cid:107) F < − .The results are shown for (a) n = 30 and (b) n = 40. The color of each cell reflects the empirical recoveryrate (scaled between 0 and 1). White denotes perfect recovery in all experiments, and black denotes failurefor all experiments. (cid:96) heuristic succeeded and ensembles of linear maps for whichthese conditions hold. Furthermore, our experimental results display similar recovery propertiesto those demonstrated in the empirical studies of (cid:96) minimization. Inspired by the success ofthis program, we close this report by briefly discussing several exciting directions that are naturalcontinuations of this work building on more analogies from the compressed sensing literature. Wealso describe possible extensions to more general notions of parsimony. Factored measurements and alternative ensembles
All of the measurement ensembles re-quire the storage of O ( mnp ) numbers. For large problems this is wholly impractical. There aremany promising alternative measurement ensembles that seem to obey the same scaling laws asthose presented in Section 4. For example, “factored” measurements, of the form A i : X (cid:55)→ u Ti Xv i ,where u i , v i are Gaussian random vectors empirically yield the same performance as the Gaussianensemble. This factored ensemble only requires storage of O (( m + n ) p ) numbers, which is a rathersignificant savings for very large problems. The proof in Section 4 does not seem to extend tothis ensemble, thus new machinery must be developed to guarantee properties about such low-rankmeasurements. Noisy Measurements and Low-rank approximation
Our results in this paper address onlythe case of exact (noiseless) measurements. It is of natural interest to understand the behavior ofthe nuclear norm heuristic in the case of noisy data. Based on the existing results for the sparsecase (e.g., [12]), it would be natural to expect similar stability properties of the recovered solution,for instance in terms of the (cid:96) norm of the computed solution. Such an analysis could also be usedto study the nuclear norm heuristic as an approximation technique where a matrix has rapidlydecaying singular values and a low-rank approximation is desired. Incoherent Ensembles and Partially Observed Transforms
Again, taking our lead fromthe compressed sensing literature, it would be of great interest to extend the results of [11] tolow-rank recovery. In this work, the authors show that partially observed unitary transformationsof sparse vectors can be used to recover the sparse vector using (cid:96) minimization. There are manypractical applications where low-rank processes are partially observed. For instance, the matrixcompletion problem can be thought of as partial observations under the identity transformations.As another example, there are many examples in two-dimensional Fourier spectroscopy where onlypartial information can be observed due to experimental constraints. Alternative numerical methods
Besides the techniques described in Section 5, there are anumber of interesting additional possibilities to solve the nuclear norm minimization problem. Anappealing suggestion is to combine the strength of second-order methods (as in the SDP approach)with the known geometry of the nuclear norm (as in the subgradient approach), and developa customized interior point method, possibly yielding faster convergence rates, while still beingrelatively memory-efficient.It is also of much interest to investigate the possible adaptation of some of the successful path-following approaches in traditional (cid:96) /cardinality minimization, such as the Homotopy [40] or LARS(least angle regression) [21]. This may be not be completely straightforward, since the efficiency ofmany of these methods often relies explicitly on the polyhedral structure of the feasible set of the (cid:96) norm problem. 29 eometric interpretations For the case of cardinality/ (cid:96) minimization, a beautiful geometricinterpretation has been set forth by Donoho and Tanner [18, 19]. Key to their results is the notionof central k -neighborliness of a centrosymmetric polytope, namely the property that every subsetof k + 1 vertices not including an antipodal pair spans a k -face. In particular, they show that the (cid:96) heuristic always succeeds whenever the image of the (cid:96) unit ball (the cross-polytope) under thelinear mapping A is a centrally k -neighborly polytope.In the case of rank minimization, the direct application of these concepts fails, since the unit ballof the nuclear norm is not a polyhedral set. Nevertheless, it seems likely that a similar explanationcould be developed, where the key feature would be the preservation under a linear map of theextremality of the components of the boundary of the nuclear norm unit ball defined by low-rankconditions. Jordan algebras
As we have seen, our results for the rank minimization problem closely parallelthe earlier developments in cardinality minimization. A convenient mathematical framework thatallows the simultaneous consideration of these cases as well as a few new ones, is that of
Jordanalgebras and the related symmetric cones [24]. In the Jordan-algebraic setting, there is an intrinsicnotion of rank that agrees with the cardinality of the support in the case of the nonnegativeorthant or the rank of a matrix in the case of the positive semidefinite cone. Besides mathematicalelegance, a direct Jordan-algebraic approach would transparently yield similar results for the caseof second-order (or Lorentz) cone constraints.As specific examples of the power and elegance of this approach, we mention the work ofFaybusovich [25] and Schmieta and Alizadeh [43] that provide a unified development of interior pointmethods for symmetric cones, as well as Faybusovich’s work on convexity theorems for quadraticmappings [26].
Parsimonious models and optimization
Sparsity and low-rank are two specific classes ofparsimonious (or low-complexity) descriptions. Are there other kinds of easy-to-describe paramet-ric models that are amenable to exact solutions via convex optimizations techniques? Given theintimate connections between linear and semidefinite programming and the Jordan algebraic ap-proaches described earlier, it is likely that this will require alternative tractable convex optimizationformulations.
We thank Stephen Boyd, Emmanuel Cand`es, Jos´e Costa, John Doyle, Ali Jadbabaie, Ali Rahimi,and Michael Wakin for their useful comments and suggestions. We also thank the IMA in Min-neapolis for hosting us during the initial stages of our collaboration.
References [1] P.-A. Absil, A. Edelman, and P. Koev. On the largest principal angle between random subspaces.
LinearAlgebra Appl. , 414(1):288–294, 2006.[2] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins.
Journalof Computer and Systems Science , 66(4):671–687, 2003. Special issue of invited papers from PODS’01.
3] H. C. Andrews and C. L. Patterson, III. Singular value decomposition (SVD) image coding.
IEEETransactions on Communications , 24(4):425–432, 1976.[4] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometryproperty for random matrices.
Preprint , 2007. http://dsp.rice.edu/cs/jlcs-v03.pdf .[5] C. Beck and R. D’Andrea. Computational study and comparisons of LFT reducibility methods. In
Proceedings of the American Control Conference , 1998.[6] D. P. Bertsekas.
Constrained Optimization and Lagrange Multiplier Methods . Athena Scientific, Bel-mont, Massachusetts, 1996.[7] D. P. Bertsekas.
Nonlinear Programming . Athena Scientific, Belmont, MA, 2nd edition, 1999.[8] S. Burer and R. D. C. Monteiro. A nonlinear programming algorithm for solving semidefinite programsvia low-rank factorization.
Mathematical Programming (Series B) , 95:329–357, 2003.[9] S. Burer and R. D. C. Monteiro. Local minima and convergence in low-rank semidefinite programming.
Mathematical Programming , 103(3):427–444, 2005.[10] E. J. Cand`es. Compressive sampling. In
International Congress of Mathematicians. Vol. III , pages1433–1452. Eur. Math. Soc., Z¨urich, 2006.[11] E. J. Cand`es and J. Romberg. Sparsity and incoherence in compressive sampling.
Inverse Problems ,23(3):969–985, 2007.[12] E. J. Cand`es, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measure-ments.
Communications of Pure and Applied Mathematics , 59:1207–1223, 2005.[13] E. J. Cand`es, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction fromhighly incomplete frequency information.
IEEE Trans. Inform. Theory , 52(2):489–509, 2006.[14] E. J. Cand`es and T. Tao. Decoding by linear programming.
IEEE Transactions on Information Theory ,51(12):4203–4215, 2005.[15] S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss.
RandomStructures and Algorithms , 22(1):60–65, 2003.[16] K. R. Davidson and S. J. Szarek. Local operator theory, random matrices and Banach spaces. In W. B.Johnson and J. Lindenstrauss, editors,
Handbook on the Geometry of Banach spaces , pages 317–366.Elsevier Scientific, 2001.[17] D. L. Donoho. Compressed sensing.
IEEE Trans. Inform. Theory , 52(4):1289–1306, 2006.[18] D. L. Donoho and J. Tanner. Neighborliness of randomly projected simplices in high dimensions.
Proc.Natl. Acad. Sci. USA , 102(27):9452–9457, 2005.[19] D. L. Donoho and J. Tanner. Sparse nonnegative solution of underdetermined linear equations by linearprogramming.
Proc. Natl. Acad. Sci. USA , 102(27):9446–9451, 2005.[20] C. Eckart and G. Young. The approximation of one matrix by another of lower rank.
Psychometrika ,1(3):211–218, 1936.[21] B. Efron, T. Hastie, I. M. Johnstone, and R. Tibshirani. Least angle regression.
Annals of Statistics ,32(2):407–499, 2004.[22] L. El Ghaoui and P. Gahinet. Rank minimization under LMI constraints: A framework for outputfeedback problems. In
Proceedings of the European Control Conference , 1993.[23] N. El Karoui.
New results about random covariance matrices and statistical applications . PhD thesis,Stanford University, 2004.
24] J. Faraut and A. Kor´anyi.
Analysis on symmetric cones . Oxford Mathematical Monographs. TheClarendon Press Oxford University Press, New York, 1994.[25] L. Faybusovich. Euclidean Jordan algebras and interior-point algorithms.
Positivity , 1(4):331–357, 1997.[26] L. Faybusovich. Jordan-algebraic approach to convexity theorem for quadratic mappings. , 2005.[27] M. Fazel.
Matrix Rank Minimization with Applications . PhD thesis, Stanford University, 2002.[28] M. Fazel, H. Hindi, and S. Boyd. A rank minimization heuristic with application to minimum ordersystem approximation. In
Proceedings of the American Control Conference , 2001.[29] M. Fazel, H. Hindi, and S. Boyd. Log-det heuristic for matrix rank minimization with applications toHankel and Euclidean distance matrices. In
Proceedings of the American Control Conference , 2003.[30] W. Gander. Algorithms for the polar decomposition.
SIAM J. Sci. Statist. Comput. , 11(6):1102–1115,1990.[31] K. M. Grigoriadis and E. B. Beran. Alternating projection algorithms for linear matrix inequalitiesproblems with rank constraints. In L. El Ghaoui and S. Niculescu, editors,
Advances in Linear MatrixInequality Methods in Control , chapter 13, pages 251–267. SIAM, 2000.[32] J.-B. Hiriart-Urruty and C. Lemar´echal.
Convex Analysis and Minimization Algorithms II: AdvancedTheory and Bundle Methods . Springer-Verlag, New York, 1993.[33] R. A. Horn and C. R. Johnson.
Topics in Matrix Analysis . Cambridge University Press, New York,1991.[34] N. Linial, E. London, and Y. Rabinovich. The geometry of graphs and some of its algorithmic applica-tions.
Combinatorica , 15:215–245, 1995.[35] G. G. Lorentz, M. von Golitschek, and Y. Makovoz.
Constructive Approximation: Advanced problems ,volume 304 of
Grundlehren der Mathematischen Wissenschaften . Springer, 1996.[36] G. Marsaglia and G. P. H. Styan. When does rank ( A + B ) = rank( A ) + rank( B )? Canad. Math. Bull. ,15:451–452, 1972.[37] M. Mesbahi and G. P. Papavassilopoulos. On the rank minimization problem over a positive semidefinitelinear matrix inequality.
IEEE Transactions on Automatic Control , 42(2):239–243, 1997.[38] L. Mirsky. Symmetric gauge functions and unitarily invariant norms.
Quart. J. Math. Oxford Ser. (2) ,11:50–59, 1960.[39] B. K. Natarajan. Sparse approximate solutions to linear systems.
SIAM Journal of Computing ,24(2):227–234, 1995.[40] M. R. Osborne, B. Presnell, and B. A. Turlach. A new approach to variable selection in least squaresproblems.
IMA Journal of Numerical Analysis , 20:389–403, 2000.[41] P. A. Parrilo and S. Khatri. On cone-invariant linear matrix inequalities.
IEEE Trans. Automat. Control ,45(8):1558–1563, 2000.[42] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction.In
Proceedings of the International Conference of Machine Learning , 2005.[43] S. H. Schmieta and F. Alizadeh. Associative and Jordan algebras, and polynomial time interior-pointalgorithms for symmetric cones.
Math. Oper. Res. , 26(3):543–564, 2001.[44] I. J. Schoenberg. Remarks to Maurice Fr´echet’s article “Sur la d´efinition axiomatique d’une classed’espace distanci´es vectoriellement applicable sur l’espace de Hilbert”.
Annals of Mathematics ,36(3):724–732, July 1935.
45] R. E. Skelton, T. Iwasaki, and K. Grigoriadis.
A Unified Algebraic Approach to Linear Control Design .Taylor and Francis, 1998.[46] E. Sontag.
Mathematical Control Theory . Springer-Verlag, New York, 1998.[47] N. Srebro.
Learning with Matrix Factorizations . PhD thesis, Massachusetts Institute of Technology,2004.[48] J. F. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones.
Opti-mization Methods and Software , 11-12:625–653, 1999.[49] S. J. Szarek. The finite dimensional basis problem with an appendix on nets of the Grassmann manifold.
Acta Mathematica , 151:153–179, 1983.[50] S. J. Szarek. Metric entropy of homogeneous spaces. In
Quantum probability (Gda´nsk, 1997) , volume 43of
Banach Center Publ. , pages 395–410. Polish Acad. Sci., Warsaw, 1998. Preprint available at arXiv:math/9701213v1 .[51] K. C. Toh, M. Todd, and R. H. T¨ut¨unc¨u.
SDPT3 - a MATLAB software package for semidefinite-quadratic-linear programming . Available from .[52] M. W. Trosset. Distance matrix completion by numerical optimization.
Computational Optimizationand Applications , 17(1):11–22, October 2000.[53] L. Vandenberghe and S. Boyd. Semidefinite programming.
SIAM Review , 38(1):49–95, 1996.[54] M. Wakin, J. Laska, M. Duarte, D. Baron, S. Sarvotham, D. Takhar, K. Kelly, and R. Baraniuk. Anarchitecture for compressive imaging. In
Proc. International Conference on Image Processing – ICIP2006 , oct 2006.[55] G. A. Watson. Characterization of the subdifferential of some matrix norms.
Linear Algebra andApplications , 170:1039–1053, 1992.[56] Y. Q. Yin, Z. D. Bai, and P. R. Krishnaiah. On the limit of the largest eigenvalue of the large dimensionalsample covariance matrix.
Probability Theory and Related Fields , 78:509–512, 1988., 78:509–512, 1988.