Alternating Direction Method of Multipliers for Sparse Principal Component Analysis
aa r X i v : . [ m a t h . O C ] N ov ALTERNATING DIRECTION METHOD OF MULTIPLIERS FOR SPARSE PRINCIPALCOMPONENT ANALYSIS
SHIQIAN MA ∗ November 28, 2011
Abstract.
We consider a convex relaxation of sparse principal component analysis proposed by d’Aspremont et al. in [9].This convex relaxation is a nonsmooth semidefinite programming problem in which the ℓ norm of the desired matrix is imposedin either the objective function or the constraint to improve the sparsity of the resulting matrix. The sparse principal componentis obtained by a rank-one decomposition of the resulting sparse matrix. We propose an alternating direction method based ona variable-splitting technique and an augmented Lagrangian framework for solving this nonsmooth semidefinite programmingproblem. In contrast to the first-order method proposed in [9] that solves approximately the dual problem of the originalsemidefinite programming problem, our method deals with the primal problem directly and solves it exactly, which guaranteesthat the resulting matrix is a sparse matrix. Global convergence result is established for the proposed method. Numericalresults on both synthetic problems and the real applications from classification of text data and senate voting data are reportedto demonstrate the efficacy of our method. Key words.
Sparse PCA, Semidefinite Programming, Alternating Direction Method, Augmented Lagrangian Method,Deflation, Projection onto the Simplex
AMS subject classifications.
1. Introduction.
Principal component analysis (PCA) plays an important role in applications arisingfrom data analysis, dimension reduction and bioinformatics etc. PCA finds a few linear combinations of theoriginal variables. These linear combinations, which are called principal components (PC), are orthogonalto each other and explain most of the variance of the data. Specifically, for a given data matrix M ∈ R p × n which consists of n samples of the p variables, PCA corresponds to a singular value decomposition (SVD)of M or an eigenvalue decomposition of the sample covariance matrix Σ = M M ⊤ ∈ R p × p . Thus, for a givensample covariance matrix Σ, PCA is usually formulated as an eigenvalue problem:(1.1) x ∗ := arg max x ⊤ Σ x, s.t. k x k ≤ , where k x k is the Euclidean norm of vector x . Problem (1.1) gives the eigenvector that corresponds to thelargest eigenvalue of Σ. However, the loading vector x ∗ is not expected to have many zero coefficients. Thismakes it hard to explain the PCs. For example, in the text classification problem, we are given a binarydata matrix M ∈ R p × n that records the occurrences of p words in n postings. That is, M ij = 1 if the i -th word appears in the j -th posting and M ij = 0 if the i -th word does not appear in the j -th posting.The standard PCA cannot tell which words contribute most to the explained variance since the loadings arelinear combinations of all the variables. Thus, sparse PCs are needed because it is easier to analyze whichvariables contribute most to the explained variance.Many techniques were proposed to extract sparse PCs from given sample covariance matrix Σ or sampledata matrix M . One natural thought is to impose a cardinality constraint to (1.1), which leads to thefollowing formulation for sparse PCA:(1.2) x ∗ := arg max x ⊤ Σ x, s.t. k x k ≤ , k x k ≤ K, ∗ Institute for Mathematics and Its Applications, 400 Lind Hall, 207 Church Street SE, University of Minnesota, Minneapolis,MN 55455, USA. Email: [email protected]. 1 here k x k (the ℓ norm of x ) counts the number of nonzeros of x and the integer K controls the sparsityof the solution. Since the cardinality constraint k x k ≤ K makes the problem numerically intractable, manydifferent models were proposed in the literature to overcome this difficulty.In [9], d’Aspremont et al. proposed to approximately solve (1.2) by its convex relaxation, which is anonsmooth semidefinite programming (SDP) problem. This is the first work that attempts to approximatelysolve (1.2) by a convex problem. The SDP formulation is based on the lifting and projection technique,which is a standard technique in using SDP to approximate combinatorial problems (see e.g., [1, 3, 34]).Note that if we denote X = xx ⊤ , then (1.2) can be rewritten as(1.3) max X ∈ R p × p {h Σ , X i , s.t. Tr ( X ) = 1 , k X k ≤ K , X (cid:23) , rank( X ) = 1 } , where Tr ( X ) denotes the trace of matrix X . The rank constraint is then dropped and the cardinalityconstraint is replaced by ℓ norm constraint, and this leads to following convex problem, which is an SDP.(1.4) max X ∈ R p × p {h Σ , X i , s.t. Tr ( X ) = 1 , k X k ≤ K, X (cid:23) } , where the ℓ norm of X is defined as k X k := P ij | X ij | and using the convex constraint k X k ≤ K to imposethe sparsity of the solution is inspired by the recent emergence of compressed sensing (see e.g., [5, 10]). Notethat k X k ≤ K is used in (1.4) instead of k X k ≤ K . This is due to the fact that, when X = xx ⊤ and Tr ( X ) = 1, we have k X k F = 1, and also that if k X k ≤ K , then k X k ≤ K k X k F . After the optimalsolution X ∗ to (1.4) is obtained, the vector ˆ x from the rank-one decomposition of X ∗ , i.e., X ∗ = ˆ x ˆ x ⊤ isused as an approximation of the solution of (1.2). This is the whole procedure of the lifting and projectiontechnique. Although some standard methods such as interior point methods can be used to solve the SDP(1.4) (see e.g., [1, 3, 34]), it is not wise to do so because (1.4) is a nonsmooth problem, and transferring itto a standard SDP increases the size of the problem dramatically.It is known that (1.4) is equivalent to the following problem with an appropriately chosen parameter ρ > X ∈ R p × p {h Σ , X i − ρ k X k s.t. Tr ( X ) = 1 , X (cid:23) } . Note that (1.5) can be rewritten as(1.6) max X (cid:23) , Tr ( X )=1 min k U k ∞ ≤ ρ h Σ +
U, X i , where k U k ∞ denotes the largest component of U in magnitude, i.e., k U k ∞ = max ij | U ij | . The dual problemof (1.5) is given by interchanging the max and min in (1.6), i.e.,min k U k ∞ ≤ ρ max X (cid:23) , Tr ( X )=1 h Σ +
U, X i , which can be further reduced to(1.7) min U ∈ R p × p λ max (Σ + U ) , s.t. k U k ∞ ≤ ρ, where λ max ( Z ) denotes the largest eigenvalue of matrix Z . d’Aspremont et al. [9] proposed to solve the dual roblem (1.7) using Nesterov’s first-order algorithm (see e.g., [27, 28]), which is an accelerated projectedgradient method. However, since the objective function of (1.7) is nonsmooth, one needs to smooth it inorder to apply Nesterov’s algorithm. Thus, the authors of [9] actually solve an approximation of the dualproblem (1.7), which can be formulated as follows.(1.8) min f µ ( U ) , s.t. k U k ∞ ≤ ρ, where µ > f µ ( U ) := max {h Σ +
U, X i − µd ( X ) , s.t. Tr ( X ) = 1 , X (cid:23) } and d ( X ) := Tr ( X log X ) + log( n ). It is shown in [9] that an approximate solution X k to the primal problem(1.5) can be obtained by X k = ∇ f µ ( U k ), where U k is an approximate solution of (1.8). It is easy to see that X k is not guaranteed to be a sparse matrix. Besides, although there is no duality gap between (1.5) and(1.7), the authors solve an approximation of (1.7). It needs also to be noted that Nesterov’s algorithm usedin [9] cannot solve the constrained problem (1.4). Although (1.4) and (1.5) are equivalent with appropriatelychosen parameters K and ρ , in many applications, it is usually easier to choose an appropriate K since weknow how many nonzeros are preferred in the sparse PCs.Nonconvex reformulations of (1.2) include the followsings. Zou et al. [42] considered a regression typeformulation of (1.2) with Lasso and elastic net regularizations. d’Aspremont et al. [8] considered a penaltyversion of (1.2),(1.9) φ ( ρ ) ≡ max k x k ≤ x ⊤ Σ x − ρ k x k . d’Aspremont [8] showed that (1.9) is equivalent to the following problem that maximizes a convex functionover spherical constraint:(1.10) φ ( ρ ) = max k x k =1 p X i =1 (( a ⊤ i x ) − ρ ) + , where ( α ) + := max { α, } , Σ = A ⊤ A and a i is the i -th column of A ∈ R p × p . Clearly, (1.10) is a non-convexproblem. d’Aspremont et al. thus proposed in [8] to solve (1.10) by a greedy method. Journee et al. [22]considered the same formulation (1.10) and proposed a gradient type method, which is actually a generalizedpower method, to solve it.Note that (1.2) only gives the largest sparse PC. In many applications, several leading sparse PCs areneeded in order to explain more variance. Multiple sparse PCs are usually found by solving a sequence ofsparse PCA problems (1.2), with Σ constructed via the so-called deflation technique for each sparse PC. Luand Zhang [24] proposed the following model to compute the leading r sparse PCs of Σ simultaneously:(1.11) max V ∈ R p × r Tr ( V ⊤ Σ V ) − ρ k V k s.t. | V ⊤ i Σ V j | ≤ ∆ ij , ∀ i = j,V ⊤ V = I, where each column of V corresponds to a loading vector of the sample covariance matrix Σ and ∆ ij ≥ i = j )are the parameters that control the correlation of the PCs. Lu and Zhang [24] proposed an augmentedLagrangian method to solve (1.11). Note that for these nonconvex formulations, algorithms proposed in theliterature usually have only local convergence and global convergence is not guaranteed. n this paper, we propose an alternating direction method based on a variable-splitting technique andan augmented Lagrangian framework for solving directly the primal problems (1.4) and (1.5). Our methodsolves two subproblems in each iteration. One subproblem has a closed-form solution that corresponds toprojecting a given matrix onto the simplex of the cone of semidefinite matrices. This projection requires aneigenvalue decomposition. The other subproblem has a closed-form solution that corresponds to a vectorshrinkage operation (for Problem (1.5)) or a projection onto the ℓ ball (for Problem (1.4)). Thus, ourmethod produces two iterative points at each iteration. One iterative point is a semidefinite matrix withtrace equal to one and the other one is a sparse matrix. Eventually these two points will converge to the samepoint, and thus we get an optimal solution which is a sparse and semidefinite matrix. Compared with theNesterov’s first-order method suggested in [9] for solving the approximated dual problem (1.8), our methodcan solve the nonsmooth primal problems (1.4) and (1.5) uniformly. Also, since we deal with the primalproblems directly, the ℓ norm in the constraint or the objective function guarantees that our solution is asparse matrix, while Nesterov’s method in [9] does not guarantee this since it solves the approximated dualproblem.The rest of the paper is organized as follows. In Section 2, we introduce our alternating direction methodof multipliers for solving the nonsmooth SDP problems (1.4) and (1.5). The global convergence results of thealternating direction method of multipliers are given in Section 3. We discuss some practical issues includingthe deflation technique for computing multiple sparse PCs in Section 4. In Section 5, we use our alternatingdirection method of multipliers to solve sparse PCA problems arising from different applications such asclassification of text data and senate voting records to demonstrate the efficacy of our method. We makesome conclusions in Section 6.
2. Alternating Direction Method of Multipliers.
We first introduce some notation. We use C todenote the simplex of the cone of the semidefinite matrices, i.e., C = { X ∈ R p × p | Tr ( X ) = 1 , X (cid:23) } . Weuse B to denote the ℓ -ball with radius K in R p × p , i.e., B = { X ∈ R p × p | k X k ≤ K } . I A ( X ) denotes theindicator function of set A , i.e.,(2.1) I A ( X ) = ( X ∈ A , + ∞ otherwise . We know that I C ( X ) and I B ( X ) are both convex functions since C and B are both convex sets. We then canreformulate (1.4) and (1.5) uniformly as the following unconstrained problem:(2.2) min −h Σ , X i + I C ( X ) + h ( X ) , where h ( X ) = I B ( X ) for (1.4) and h ( X ) = ρ k X k for (1.5). Note that h ( X ) is convex in both cases. (2.2)can be also viewed as the following inclusion problem:(2.3) Find X, s.t. 0 ∈ − Σ + ∂I C ( X ) + ∂h ( X ) . Problem (2.3) finds zero of the sum of two monotone operators. Methods based on operator-splittingtechniques, such as Douglas-Rachford method and Peachman-Rachford method, are usually used to solveProblem (2.3) (see e.g., [11, 30, 23, 13, 14, 6, 7]). From the convex optimization perspective, the alter-nating direction method of multipliers (ADMM) for solving (2.2) is a direct application of the Douglas-Rachford method. ADMM has been successfully used to solve structured convex optimization problems rising from image processing, compressed sensing, machine learning, semidefinite programming etc. (seee.g., [16, 15, 36, 38, 19, 39, 33, 17, 18, 26, 37, 2]). We now show how ADMM can be used to solve the sparsePCA problem (2.2).ADMM is based on a variable-splitting technique and an augmented Lagrangian framework. By intro-ducing a new variable Y , (2.2) can be rewritten as(2.4) min −h Σ , X i + I C ( X ) + h ( Y )s.t. X = Y. Note that although the number of variables is increased, the two nonsmooth functions I C ( · ) and h ( · ) are nowseparated since they are associated with different variables. For this equality-constrained problem, augmentedLagrangian method is a standard approach to solve it. A typical iteration of augmented Lagrangian methodfor solving (2.4) is given by:(2.5) ( X k +1 , Y k +1 ) := arg min ( X,Y ) L µ ( X, Y ; Λ k )Λ k +1 := Λ k − µ ( X k +1 − Y k +1 ) , where the augmented Lagrangian function L µ ( X, Y ; Λ) is defined as:(2.6) L µ ( X, Y ; Λ) := −h Σ , X i + I C ( X ) + h ( Y ) − h Λ , X − Y i + 12 µ k X − Y k F , where µ > X = Y . Note that it is usually hard to minimize the augmented Lagrangian function L µ ( X, Y ; Λ k ) withrespect to X and Y simultaneously. In fact, it is as difficult as solving the original problem (2.4). However,if we minimize the augmented Lagrangian function with respect to X and Y alternatingly, we obtain twosubproblems in each iteration and both of them are relatively easy to solve. This results in the followingalternating direction method of multipliers.(2.7) X k +1 := arg min X L µ ( X, Y k ; Λ k ) Y k +1 := arg min Y L µ ( X k +1 , Y ; Λ k )Λ k +1 := Λ k − ( X k +1 − Y k +1 ) /µ, It can be shown that the two subproblems in (2.7) are both relatively easy to solve in the sparse PCAproblem. Before we do that, we characterize two nice properties of the indicator function (2.1). • Property 1.
The proximal mapping of the indicator function I A ( · ) is the Euclidean projection onto A , i.e.,(2.8) prox I A ( X ) ≡ P A ( X ) , where(2.9) prox I A ( X ) := arg min U { I A ( U ) + 12 k U − X k F } , nd(2.10) P A ( X ) := arg min U { k U − X k F , s.t. U ∈ A} . • Property 2.
The optimality conditions for Problem (2.10) are given by(2.11) X − U ∗ ∈ ∂I A ( U ∗ ) , which is equivalent to(2.12) h X − U ∗ , Z − U ∗ i ≤ , ∀ Z ∈ A , where U ∗ is the optimal solution of (2.10).Now, the first subproblem in (2.7) can be reduced to:(2.13) X k +1 := arg min (cid:26) µI C ( X ) + 12 k X − ( Y k + µ Λ k + µ Σ) k F (cid:27) , which can be further reduced to projection onto C using Property 1,(2.14) X k +1 = P C ( Y k + µ Λ k + µ Σ) := arg min (cid:26) k X − ( Y k + µ Λ k + µ Σ) k F , s.t. X ∈ C (cid:27) . When h ( Y ) = I B ( Y ) as in Problem (1.4), the second subproblem in (2.7) can be reduced to:(2.15) Y k +1 := arg min (cid:26) µI B ( Y ) + 12 k Y − ( X k +1 − µ Λ k ) k F (cid:27) , which can be further reduced to projection onto B using Property 1,(2.16) Y k +1 = P B ( X k +1 − µ Λ k ) := arg min (cid:26) k Y − ( X k +1 − µ Λ k ) k F , s.t. Y ∈ B (cid:27) . When h ( Y ) = ρ k Y k as in Problem (1.5), the second subproblem in (2.7) can be reduced to:(2.17) Y k +1 := arg min (cid:26) µρ k Y k + 12 k Y − ( X k +1 − µ Λ k ) k F (cid:27) . Problem (2.17) has a closed-form solution that is given by(2.18) Y k +1 = Shrink( X k +1 − µ Λ k , µρ ) , where the shrinkage operator is defined as:(2.19) (Shrink( Z, τ )) ij := sgn( Z ij ) max {| Z ij | − τ, } , ∀ i, j. In the following, we will show that (2.13) and (2.15) are easy to solve, i.e., the two projections (2.14)and (2.16) can be done efficiently. First, since the problem of projection onto C (2.20) P C ( X ) = arg min { k Z − X k F , s.t. Tr ( Z ) = 1 , Z (cid:23) } s unitary-invariant, its solution is given by P C ( X ) = U diag( γ ) U ⊤ , where X = U diag( σ ) U ⊤ is the eigenvaluedecomposition of X , and γ is the projection of σ onto the simplex in the Euclidean space, i.e.,(2.21) γ := arg min { k ξ − σ k , s.t. p X i =1 ξ i = 1 , ξ ≥ } . We consider a slightly more general problem(2.22) ξ ∗ := arg min { k ξ − σ k , s.t. p X i =1 ξ i = r, ξ ≥ } , where scalar r >
0. Note that (2.21) is a special case of (2.22) with r = 1. From the first-order optimalityconditions for (2.22), it is easy to show that the optimal solution of (2.22) is given by ξ ∗ i := max { σ i − θ, } , ∀ i = 1 , . . . , p, where the scalar θ is the solution of the following piecewise linear equation:(2.23) p X i =1 max { σ i − θ, } = r. It is known that the piecewise linear equation (2.23) can be solved quite efficiently and thus solving (2.22)can be done easily. In fact, the following procedure (Algorithm 1) gives the optimal solution of (2.22). Werefer the readers to [31] for the proof of the validity of the algorithm. It is easy to see that Algorithm 1 has
Algorithm 1 : Projection onto the simplex in the Euclidean spaceInput: A vector σ ∈ R p and a scalar r > σ into ˆ σ as a non-decreasing order: ˆ σ ≤ ˆ σ ≤ . . . ≤ ˆ σ p Find index ˆ j , the smallest j such that ˆ σ j − p − j +1 p X i = j ˆ σ i − r > θ = p − ˆ j +1 p X i =ˆ j ˆ σ i − r Output: A vector γ , s.t. γ i = max { σ i − θ, } , i = 1 , . . . , p. an O ( p log p ) complexity. Linear time algorithms for solving (2.22) are studied in [4, 29, 12]. Thus, solving(2.13) corresponds to an eigenvalue decomposition and a projection onto the simplex in the Euclidean space,and they both can be done efficiently.Solving (2.15) (or equivalently (2.16)) corresponds to a projection onto the ℓ -ball: k Y k ≤ K . It hasbeen shown in [12, 35] that projection onto the ℓ -ball can be done easily. In fact, the solution of(2.24) ˆ γ = arg min { k ξ − ˆ σ k , s.t. k ξ k ≤ r } is given by ˆ γ i = sgn(ˆ σ i ) γ i , ∀ i = 1 , . . . , p , where γ is the solution ofmin 12 k γ − | ˆ σ |k , s.t. p X i =1 γ i = r, γ ≥ , .e., the projection of | ˆ σ | (elementwise absolute value of ˆ σ ) onto the simplex. Thus, (2.15) can be rewrittenas(2.25) vec( Y k +1 ) = arg min { k y − vec( X k +1 − µ Λ k ) k , s.t. k y k ≤ K } , and it corresponds to a projection onto the simplex in the Euclidean space, where vec( Y ) denotes the vectorform of Y which is obtained by stacking the columns of Y into a long vector.To summarize, our ADMM for solving (1.4) and (1.5) can be uniformly described as Algorithm 2. Algorithm 2 : ADMM for solving (1.4) and (1.5)Initialization: Y = 0, Λ = 0. for k=0,1,. . . do Compute the eigenvalue decomposition: Y k + µ Λ k + µ Σ = U diag( σ ) U ⊤ Project σ onto the simplex in Euclidean space by Algorithm 1, and denote the solution by γ Compute X k +1 = U diag( γ ) U ⊤ Perform one of the followings: • if (1.4) is solved, update Y k +1 by solving (2.25) • if (1.5) is solved, update Y k +1 by (2.18)Update Λ k +1 by Λ k +1 = Λ k − ( X k +1 − Y k +1 ) /µ Remark 2.1.
Although Algorithm 2 suggests that we need to compute the eigenvalue decomposition of Y k + µ Λ k + µ Σ in order to get the solution to (2.13) , we actually only need to compute the positive eigenvaluesand corresponding eigenvectors of Y k + µ Λ k + µ Σ .
3. Global Convergence Results.
In this section, we prove that the sequence ( X k , Y k , Λ k ) producedby the alternating direction method of multipliers (2.7) (i.e., Algorithm 2) converges to ( X ∗ , Y ∗ , Λ ∗ ), where( X ∗ , Y ∗ ) is an optimal solution to (2.4) and Λ ∗ is the corresponding optimal dual variable. Although theproof of global convergence results of ADMM has been studied extensively in the literature (see e.g., [14, 20]),we here give a very simple proof of the convergence of our ADMM that utilizes the special structures of thesparse PCA problem. We only prove the case when h ( Y ) = I B ( Y ) and leave the case when h ( Y ) = ρ k Y k to the readers since their proofs are almost identical.Before we give the main theorem about the global convergence of (2.7) (Algorithm 2), we need thefollowing lemma. Lemma 3.1.
Assume that ( X ∗ , Y ∗ ) is an optimal solution of (2.4) and Λ ∗ is the corresponding optimaldual variable associated with the equality constraint X = Y . Then the sequence ( X k , Y k , Λ k ) produced by (2.7) satisfies (3.1) k U k − U ∗ k G − k U k +1 − U ∗ k G ≥ k U k − U k +1 k G , where U ∗ = Λ ∗ Y ∗ ! , U k = Λ k Y k ! and G = µI µ I ! , and the norm k · k G is defined as k U k G = h U, GU i and the corresponding inner product h· , ·i G is defined as h U, V i G = h U, GV i .Proof . Since ( X ∗ , Y ∗ , Λ ∗ ) is optimal to (2.4), it follows from the KKT conditions that the followingshold:(3.2) 0 ∈ − Σ + ∂I C ( X ∗ ) − Λ ∗ , ∈ ∂I B ( Y ∗ ) + Λ ∗ , and(3.4) X ∗ = Y ∗ ∈ C ∩ B . By using Property 2, (3.2) and (3.3) can be respectively reduced to:(3.5) h Σ + Λ ∗ , X − X ∗ i ≤ , ∀ X ∈ C , and(3.6) h− Λ ∗ , Y − Y ∗ i ≤ , ∀ Y ∈ B . Note that the optimality conditions for the first subproblem (i.e., the subproblem with respect to X ) in(2.7) are given by X k +1 ∈ C and(3.7) 0 ∈ − Σ + ∂I C ( X k +1 ) − Λ k + 1 µ ( X k +1 − Y k ) . By using Property 2 and the updating formula for Λ k in (2.7), i.e.,(3.8) Λ k +1 = Λ k − µ ( X k +1 − Y k +1 ) , (3.7) can be rewritten as(3.9) h Σ + Λ k +1 + 1 µ ( Y k − Y k +1 ) , X − X k +1 i ≤ , ∀ X ∈ C . Letting X = X k +1 in (3.5) and X = X ∗ in (3.9), and summing the two resulting inequalities, we get,(3.10) h Λ k +1 − Λ ∗ + 1 µ ( Y k − Y k +1 ) , X ∗ − X k +1 i ≤ . The optimality conditions for the second subproblem (i.e., the subproblem with respect to Y ) in (2.7)are given by Y k +1 ∈ B and(3.11) 0 ∈ ∂I B ( Y k +1 ) + Λ k + 1 µ ( Y k +1 − X k +1 ) . By using Property 2 and (3.8), (3.11) can be rewritten as(3.12) h− Λ k +1 , Y − Y k +1 i ≤ , ∀ Y ∈ B . Letting Y = Y k +1 in (3.6) and Y = Y ∗ in (3.12), and summing the two resulting inequalities, we obtain,(3.13) h Λ ∗ − Λ k +1 , Y ∗ − Y k +1 i ≤ . Summing (3.10) and (3.13), and using the facts that X ∗ = Y ∗ and X k +1 = µ (Λ k − Λ k +1 ) + Y k +1 , we btain,(3.14) µ h Λ k − Λ k +1 , Λ k +1 − Λ ∗ i + 1 µ h Y k − Y k +1 , Y k +1 − Y ∗ i ≥ −h Y k − Y k +1 , Λ k − Λ k +1 i . Rearranging the left hand side of (3.14) by using Λ k +1 − Λ ∗ = (Λ k +1 − Λ k ) + (Λ k − Λ ∗ ) and Y k +1 − Y ∗ =( Y k +1 − Y k ) + ( Y k − Y ∗ ), we get(3.15) µ h Λ k − Λ ∗ , Λ k − Λ k +1 i + 1 µ h Y k − Y ∗ , Y k − Y k +1 i ≥ µ k Λ k − Λ k +1 k + 1 µ k Y k − Y k +1 k −h Λ k +1 − Λ k , Y k +1 − Y k i . Using the notation of U k , U ∗ and G , (3.15) can be rewritten as(3.16) h U k − U ∗ , U k − U k +1 i G ≥ k U k − U k +1 k G − h Λ k − Λ k +1 , Y k − Y k +1 i . Combining (3.16) with the identity k U k +1 − U ∗ k G = k U k +1 − U k k G − h U k − U k +1 , U k − U ∗ i G + k U k − U ∗ k G , we get(3.17) k U k − U ∗ k G − k U k +1 − U ∗ k G = 2 h U k − U k +1 , U k − U ∗ i − k U k +1 − U k k G ≥ k U k − U k +1 k G − h Λ k − Λ k +1 , Y k − Y k +1 i − k U k +1 − U k k G = k U k − U k +1 k G − h Λ k − Λ k +1 , Y k − Y k +1 i . Now, using (3.12) for k instead of k + 1 and letting Y = Y k +1 , we get,(3.18) h− Λ k , Y k +1 − Y k i ≤ . Letting Y = Y k in (3.12) and adding it to (3.18) yields,(3.19) h Λ k − Λ k +1 , Y k − Y k +1 i ≤ . By substituting (3.19) into (3.17) we get the desired result (3.1).We are now ready to give the main convergence result of (2.7) (Algorithm 2).
Theorem 3.2.
The sequence { ( X k , Y k , Λ k ) } produced by (2.7) (Algorithm 2) from any starting pointconverges to an optimal solution to Problem (2.4) .Proof . From Lemma 3.1 we can easily get that • (i) k U k − U k +1 k G → • (ii) { U k } lies in a compact region; • (iii) k U k − U ∗ k G is monotonically non-increasing and thus converges.It follows from (i) that Λ k − Λ k +1 → Y k − Y k +1 →
0. Then (3.8) implies that X k − X k +1 → X k − Y k →
0. From (ii) we obtain that, U k has a subsequence { U k j } that converges to ˆ U = (ˆΛ , ˆ Y ), i.e.,Λ k j → ˆΛ and Y k j → ˆ Y . From X k − Y k → X k j → ˆ X := ˆ Y . Therefore, ( ˆ X, ˆ Y , ˆΛ) is a limitpoint of { ( X k , Y k , Λ k ) } . ote that by using (3.8), (3.7) can be rewritten as(3.20) 0 ∈ − Σ + ∂I C ( X k +1 ) − Λ k +1 + 1 µ ( Y k +1 − Y k ) , which implies that(3.21) 0 ∈ − Σ + ∂I C ( ˆ X ) − ˆΛ . Note also that (3.11) implies that(3.22) 0 ∈ ∂I B ( ˆ Y ) + ˆΛ . Moreover, it follows from X k ∈ C and Y k ∈ B that(3.23) ˆ X ∈ C and ˆ Y ∈ B . (3.21), (3.22), (3.23) together with ˆ X = ˆ Y imply that ( ˆ X, ˆ Y , ˆΛ) satisfies the KKT conditions for (2.4) andthus is an optimal solution to (2.4). Therefore, we showed that any limit point of { ( X k , Y k , Λ k ) } is anoptimal solution to (2.4).
4. The Deflation Techniques and Other Practical Issues.
It should be noticed that the solutionof Problem (1.1) only gives the largest eigenvector (the eigenvector corresponding to the largest eigenvalue) ofΣ. In many applications, the largest eigenvector is not enough to explain the total variance of the data. Thusone usually needs to compute several leading eigenvectors to explain more variance of the data. Hotelling’sdeflation method [32] is usually used to extract the leading eigenvectors sequentially. The Hotelling’s deflationmethod extracts the r -th leading eigenvector of Σ by solving x r = arg max { x ⊤ Σ r − x, s.t. k x k ≤ } , where Σ := Σ and Σ r = Σ r − − x r x ⊤ r Σ r − x r x ⊤ r . It is easy to verify that Hotelling’s deflation method preserves the positive-semidefiniteness of matrix Σ r .However, as pointed out in [25], it does not preserve the positive-semidefiniteness of Σ r when it comes to thesparse PCA problem (1.2), because the solution x r is no longer an eigenvector of Σ r − . Thus, the secondleading eigenvector produced by solving the sparse PCA problem may not explain well the variance of thedata. We should point out that the deflation method used in [9] is the Hotelling’s deflation method.Several deflation techniques to overcome this difficulty for sparse PCA were proposed by Mackey in [25].In our numerical experiments, we chose to use the Schur complement deflation method in [25]. The Schurcomplement deflation method updates matrix Σ r by(4.1) Σ r = Σ r − − Σ r − x r x ⊤ r Σ r − x ⊤ r Σ r − x r . The Schur complement deflation method has the following properties as shown in [25]. (i) Schur complementdeflation preserves the positive-semidefiniteness of Σ r , i.e., Σ r (cid:23)
0. (ii) Schur complement deflation renders s orthogonal to Σ r for s ≤ r , i.e., Σ r x s = 0 , ∀ s ≤ r .When we want to find the leading r sparse PCs of Σ, we use ADMM to solve sequentially r problems(1.4) or (1.5) with Σ updated by the Schur complement deflation method (4.1). We denote the leading r sparse PCs obtained by our ADMM as X r = ( x , . . . , x r ). Usually the total variance explained by X r is given by Tr ( X ⊤ r Σ X r ). However, because we do not require x , . . . , x r to be orthogonal to each otherwhen we sequentially solve the SDPs (1.4) or (1.5), these loadings are correlated. Thus, Tr ( X ⊤ r Σ X r ) willoverestimate the total explained variance by x , . . . , x r . To alleviate the overestimated variance, Zou et al.[42] suggested that the explained total variance should be computed using the following procedure, whichwas called adjusted variance : AdjV ar ( X r ) := Tr ( R ) , where X r = QR is the QR decomposition of X r . In our numerical experiments, we always report the adjustedvariance as the explained variance.It is also worth noticing that the problems we solve are convex relaxations of the original problems (1.2)and (1.9). Hence, one needs to postprocess the matrix X obtained by solving (1.4) or (1.5) to get the solutionto (1.2) or (1.9). To get the solution to the original sparse PCA problem (1.2) or (1.9) from the solution X of the convex SDP problem, we simply perform a rank-one decomposition to X , i.e., X = xx ⊤ . Since X isa sparse matrix, x should be a sparse vector. This postprocessing technique is also used in [9].Since the sequences { X k } and { Y k } generated by ADMM converge to the same point eventually, weterminate ADMM when the difference between X k and Y k is sufficiently small. In our numerical experiments,we terminate ADMM when k X k − Y k k F max { , k X k k F , k Y k k F } < − .
5. Numerical Results.
In this section, we use our ADMM to solve the SDP formulations (1.4) and(1.5) of sparse PCA on both synthetic and real data sets. We compare the performance of ADMM withtwo methods for solving sparse PCA. One method is DSPCA [9] for solving (1.5) and the other method isALSPCA [24] for solving (1.11). The Matlab codes of DSPCA and ALSPCA were downloaded from theauthors’ websites. Note that the main parts of the DSPCA codes were actually written in C-Mex files. Ourcodes were written in Matlab. All experiments were run in MATLAB 7.3.0 on a laptop with 1.66GHZ CPUand 2GB of RAM.
We tested our ADMM on a synthetic data set suggested by Zou et al. in[42]. This synthetic example has three hidden factors: V ∼ N (0 , , V ∼ N (0 , , V = − . V + 0 . V + ǫ, where ǫ ∼ N (0 , V , V and ǫ are independent. The 10 observable variables are given by the followingprocedure: X i = V + ǫ i , ǫ i ∼ N (0 , , i = 1 , , , ,X i = V + ǫ i , ǫ i ∼ N (0 , , i = 5 , , , ,X i = V + ǫ i , ǫ i ∼ N (0 , , i = 9 , , here ǫ ji are independent for j = 1 , , i = 1 , . . . ,
10. The exact covariance matrix of ( X , . . . , X ) isused to find the standard PCs by standard PCA and sparse PCs by ADMM. Note that the variances of V , V and V indicate that V is slightly more important than V and they are both more important than V . Also,we note that the first two PCs explain more than 99% of the total variance. Thus, using the first two PCsshould be able to explain most of the variance, and the first sparse PC explains most of the variance of V using ( X , X , X , X ) and the second sparse PC explains most of the variance of V using ( X , X , X , X ).Based on these observations, we set K = 4 in (1.4) for computing both the first and the second sparse PCs.When we computed the second sparse PC, we used the Schur complement deflation method described inSection 4 to construct the corresponding sample covariance matrix. The penalty parameter µ in ADMM wasset to 0 .
8. The PCs given by the standard PCA and sparse PCA using our ADMM for solving (1.5) and theexplained variances are shown in Table 5.1. From Table 5.1 we see that ADMM gives sparse loadings using( X , X , X , X ) in the first PC and ( X , X , X , X ) in the second PC. The first two sparse PCs explain80 .
41% of the total variance.
Table 5.1
Loadings and explained variance for the first two PCs
Standard PCA ADMMVariables PC1 PC2 PC1 PC2 X -0.1157 -0.4785 0 0.5000 X -0.1157 -0.4785 0 0.5000 X -0.1157 -0.4785 0 0.5000 X -0.1157 -0.4785 0 0.5000 X X X X X X .
68% 80 . The pit props data set has been a standard benchmark for testing sparse PCAalgorithms since it was introduced by Jeffers in [21]. The pit props data set has 180 observations and 13measured variables. Thus the covariance matrix Σ is a 13 ×
13 matrix. We used our ADMM to compute thefirst six sparse PCs sequentially via the Schur complement deflation technique discussed in Section 4. Weset K = (6 , , , , ,
1) for the six problems (1.4) as suggested in [9]. We set µ = 0 . r = 6 , ρ = 0 . , ∆ ij =0 . , ∀ i = j . The results given by ALSPCA are reported in Table 5.3. Since there was no clue how to choosethe six parameters ρ in the six problems (1.5) when DSPCA is used to solve them, we did not compare withDSPCA for solving (1.5). From Tables 5.2 and 5.3 we see that both ADMM and ALSPCA gave a solutionwith 15 nonzeros in the first six sparse PCs, and the solution given by ADMM explains slightly more variancethan the solution given by ALSPCA. We created some random examples to test the speed of ADMM and com-pared it with DSPCA [9] and ALSPCA [24]. The sample covariance matrix Σ was created by addingsome small noise to a sparse rank-one matrix. Specifically, we first created a sparse vector ˆ x ∈ R p with s nonzeros randomly chosen from the Gaussian distribution N (0 , able 5.2 First six sparse PCs of the pit props data set given by ADMM
Variables PC1 PC2 PC3 PC4 PC5 PC6Topdiam -0.4908 0 0 0 0 0Length -0.5067 0 0 0 0 0Moist 0 -0.7175 0 0 0 0Testsg 0 -0.6965 0 0 0 0Ovensg 0 0 0.9263 0 0 0Ringtop -0.0668 0 0.3511 0 0 0Ringbut -0.3565 0 0.1369 0 0 0Bowmax -0.2334 0 0 0 0 0Bowdist -0.3861 0 0 0 0 0Whorls -0.4089 0 0 0 0 0Clear 0 0 0 1.0000 0 0Knots 0 0 0 0 1.0000 0Diaknot 0 0 0 0 0 1.0000Total sparsity: 15, total explained variance: 74 . Table 5.3
First six sparse PCs of the pit props data set given by ALSPCA
Variables PC1 PC2 PC3 PC4 PC5 PC6Topdiam 0.4052 0 0 0 0 0Length 0.4248 0 0 0 0 0Moist 0 -0.7262 0 0 0 0Testsg 0.0014 -0.6874 0 0 0 0Ovensg 0 0 -1.0000 0 0 0Ringtop 0.1857 0 0 0 0 0Ringbut 0.4122 0 0 0 0 0Bowmax 0.3277 0 0 0 0 0Bowdist 0.3829 0 0 0 0 0Whorls 0.4437 0.0021 0 0 0 0Clear 0 0 0 1.0000 0 0Knots 0 0 0 0 1.0000 0Diaknot 0 0 0 0 0 1.0000Total sparsity: 15, total explained variance: 73 . x ˆ x ⊤ + σvv ⊤ , where σ denotes the noise level and v ∈ R p is a random vector with entriesuniformly drawn from [0 , σ = 0 .
01 and σ = 0 . ρ ’sto get solutions with different sparsity levels. Specifically, we tested DSPCA for ρ = 0 . , . , σ = 0 .
01 and ρ = 0 . , ,
10 in Table 5.5 with σ = 0 .
1; we tested ALSPCA for ρ = 0 . , . , K ’s in (1.4) to control the sparsity level when using ADMM tosolve it. In both Tables 5.4 and 5.5, we tested four data sets with dimension p and sparsity s setting as( p, s ) = (100 , , (100 , , (200 ,
10) and (200 , µ inADMM: µ = 1 , µ k = max { µ k − / , − } . ∆ ij were set to 0.1 for all i = j in all the tests for ALSPCA assuggested in [24] for tests on random data sets.We report the cardinality of the largest sparse PC (Card), the percentage of the explained variance(PEV) and the CPU time in Tables 5.4 and 5.5. From Table 5.4 we see that, for σ = 0 .
01, all three lgorithms DSPCA, ADMM and ALSPCA are sensitive to the parameters that control the sparsity, i.e., ρ and K . ρ = 0 .
01 always gave the best results for DSPCA and ALSPCA and the explained variance is veryclose to the standard PCA. ρ = 0 . ρ was increased to 1, the solutions given by ALSPCAwere too sparse to give a relatively large explained variance; while the solutions given by DSPCA sometimeshad more nonzeros than the desired sparsity level (when ( p, s ) = (100 , p, s ) = (100 ,
20) and(200 , K = 5 , , s = 10 and K = 10 , , s = 20. Resultsshown in Table 5.4 indicate that K = s/ K was changed from 5to 4 and 3 for s = 10, the sparsity and explained variance of the solution changed a lot. When K waschanged from 10 to 9 and 8 for s = 20, the solution was not affected too much in terms of the explainedvariance. Especially, for ( p, s ) = (100 ,
20) and (200 ,
20) and K = 8, ADMM gave solutions with sparsity13 that explain 96 .
10% and 93 .
66% variance respectively, which are both very close to the results given bythe standard PCA. From Table 5.5 we see that, for σ = 0 .
1, i.e., when the noise level was large, DSPCAand ALSPCA were more sensitive to the noise compared with their performance when σ = 0 .
01. Morespecifically, in Table 5.5, ρ = 0 . p, s ) = (100 , ρ = 1 and ρ = 10were not very satisfied. For ALSPCA, when ( p, s ) = (100 ,
10) and (100 , ρ = 0 . ρ = 0 .
01 and ρ = 1 were not very satisfied. However, we observed that the performanceof ADMM when σ = 0 . σ = 0 .
01, i.e., its performance was notvery sensitive to the noise.From both Tables 5.4 and 5.5, we see that ALSPCA was slightly faster than ADMM, and they wereboth significantly faster than DSPCA. This is reasonable because ALSPCA solves the non-convex problem(1.11) and thus eigenvalue decomposition is not required, which costs most of the computational effort inDSPCA and ADMM.
Sparse PCA can also be used to classify the keywords in text data.This application has been studied by Zhang, d’Aspremont and El Ghaoui in [40] and Zhang and El Ghaouiin [41]. In this section, we show that by using our ADMM to solve the sparse PCA problem, we can alsoclassify the keywords from text data very well. The data set we used is a small version of the “20-newsgroups”data , which is also used in [40]. This data set consists of the binary occurrences of 100 specific words across16242 postings, i.e., the data matrix M is of the size 100 × M ij = 1 if the i -th word appearsat least once in the j -th posting and M ij = 0 if the i -th word does not appear in the j -th posting. Thesewords can be approximately divided into different groups such as “computer”, “religion” etc. We want tofind the words that contribute as much variance as possible and also discover which words are in the samecategory. By viewing each posting as a sample of the 100 variables, we have 16424 samples of the variables,and thus the sample covariance matrix Σ ∈ R × . Using standard PCA, it is hard to interpret whichwords contribute to each of the leading eigenvalues since the loadings are dense. However, sparse PCA canexplain as much the variance explained by the standard PCs, and meanwhile interpret well which wordscontribute together to the corresponding variance. We applied our ADMM to solve (1.4) to find the firstthree sparse PCs. We set K = 5 in all three problems and the following continuation technique was usedfor µ : µ = 100 , µ k = max { µ k − / , − } . The resulting three sparse PCs have 10, 12 and 17 nonzeros This data set can be downloaded from http://cs.nyu.edu/ ∼ roweis/data.html.15 able 5.4 Comparisons of ADMM, ALSPCA and DSPCA on random examples with σ = 0 . ( p, s ) Method Parameters Card PEV CPU(100,10) PCA 96.16%DSPCA ρ = 0 .
01 10 96.16% 12.12 ρ = 0 . ρ = 1 13 87.28% 6.56ADMM K = 5 9 95.30% 0.26 K = 4 6 91.55% 0.28 K = 3 4 79.30% 0.19ALSPCA ρ = 0 .
01 10 96.16% 0.13 ρ = 0 . ρ = 1 4 89.54% 0.13(100,20) PCA 98.07%DSPCA ρ = 0 .
01 20 98.07% 10.93 ρ = 0 . ρ = 1 20 85.25% 5.75ADMM K = 10 20 97.98% 0.28 K = 9 18 97.40% 0.28 K = 8 13 96.10% 0.31ALSPCA ρ = 0 .
01 20 98.07% 0.13 ρ = 0 . ρ = 1 8 83.87% 0.13(200,10) PCA 91.43%DSPCA ρ = 0 .
01 10 91.42% 88.87 ρ = 0 . ρ = 1 8 82.91% 45.97ADMM K = 5 9 90.61% 1.23 K = 4 6 87.04% 1.27 K = 3 4 75.40% 0.88ALSPCA ρ = 0 .
01 10 91.42% 0.23 ρ = 0 . ρ = 1 4 85.13% 0.29(200,20) PCA 95.58%DSPCA ρ = 0 .
01 20 95.58% 79.87 ρ = 0 . ρ = 1 20 83.09% 42.22ADMM K = 10 20 95.49% 1.57 K = 9 18 94.93% 1.67 K = 8 13 93.66% 1.71ALSPCA ρ = 0 .
01 20 95.58% 0.23 ρ = 0 . ρ = 1 8 81.73% 0.23respectively. The total explained variance by these three sparse PCs is 12 . . able 5.5 Comparisons of ADMM, ALSPCA and DSPCA on random examples with σ = 0 . ( p, s ) Method Parameters Card PEV CPU(100,10) PCA 71.51%DSPCA ρ = 0 . ρ = 1 10 64.92% 4.40 ρ = 10 1 27.04% 4.14ADMM K = 5 9 70.83% 0.24 K = 4 6 68.03% 0.25 K = 3 4 58.93% 0.17ALSPCA ρ = 0 .
01 31 71.49% 0.13 ρ = 0 . ρ = 1 4 66.53% 0.13(100,20) PCA 83.59%DSPCA ρ = 0 . ρ = 1 20 72.75% 5.82 ρ = 10 56 26.80% 3.01ADMM K = 10 20 83.50% 0.32 K = 9 18 83.02% 0.40 K = 8 13 81.90% 0.27ALSPCA ρ = 0 .
01 25 83.58% 0.13 ρ = 0 . ρ = 1 8 71.37% 0.13(200,10) PCA 51.69%DSPCA ρ = 0 . ρ = 1 88 46.95% 28.51 ρ = 10 1 19.80% 28.04ADMM K = 5 9 51.15% 1.22 K = 4 6 49.13% 1.38 K = 3 4 42.61% 0.87ALSPCA ρ = 0 .
01 10 51.61% 0.26 ρ = 0 . ρ = 1 4 48.01% 0.24(200,20) PCA 68.38%DSPCA ρ = 0 . ρ = 1 20 59.54% 42.63 ρ = 10 64 22.03% 34.83ADMM K = 9 18 67.91% 1.46 K = 8 14 67.01% 1.74 K = 7 11 65.26% 1.76ALSPCA ρ = 0 .
01 20 68.37% 0.24 ρ = 0 . ρ = 1 8 58.37% 0.24 In this section, we use sparse PCA to analyze the voting records of the 109thUS Senate, which was also studied by Zhang, d’Aspremont and El Ghaoui in [40]. The votes are recordedas 1 for “yes” and − M is a66 ×
100 matrix with entries 1, − M corresponds to one senator’s voting. The able 5.6 Words associated with the first three sparse PCs using ADMM . M M ⊤ in our test is a 66 ×
66 matrix.To see how standard PCA and sparse PCA perform in classifying the voting records, we implementedthe following procedure as suggested in [40]. We used standard PCA to find the largest two PCs (denotedas v and v ) of Σ. We then projected each column of M onto the subspace spanned by v and v , i.e., wefound ¯ α i and ¯ β i for each column M i such that(¯ α i , ¯ β i ) := arg min ( α i ,β i ) k α i v + β i v − M i k . We then drew each column M i as a point (¯ α i , ¯ β i ) in the two-dimensional subspace spanned by v and v . Theleft figure in Figure 5.1 shows the 100 points. We see from this figure that senators are separated very well bypartisanship. However, it is hard to interpret which bills are responsible to the explained variance, becauseall the bills are involved in the PCs. By using sparse PCA, we can interpret the explained variance by just afew bills. We applied our ADMM to find the first two sparse PCs (denoted as s and s ) of Σ. We set K = 4for both problems and used the following continuation technique on µ : µ = 100 , µ k = max { µ k − / , − } . The resulting two sparse PCs s and s produced by our ADMM have 9 and 5 nonzeros respectively. Weprojected each column of M onto the subspace spanned by these two sparse PCs. The right figure in Figure5.1 shows the 100 projections onto the subspace spanned by the sparse PCs s and s . We see from thisfigure that the senators are still separated well by partisanship. Now since only a few bills are involved in thetwo sparse PCs, we can interpret which bills are responsible most for the classification. The bills involved inthe first two PCs are listed in Table 5.7. From Table 5.7 we see that the most controversial issues betweenRepublicans and Democrats are topics such as “Budget” and “Appropriations”. Other controversial issuesinvolve topics like “Energy”, “Abortion” and “Health”.
6. Conclusion.
In this paper, we proposed alternating direction method of multipliers to solve anSDP relaxation of the sparse PCA problem. Our method incorporated a variable-splitting technique to able 5.7 Bills involved in the first two PCs by ADMM
Bills in the first sparse PCBudget, Spending and Taxes Education Funding Amendment 3804Budget, Spending and Taxes Reinstate Pay-As-You-Go through 2011 Amendment 3806Energy Issues LIHEAP Funding Amendment 3808Abortion Issues Unintended Pregnancy Amendment 3489Budget, Spending and Taxes Budget FY2006 Appropriations Resolution 3488Budget, Spending and Taxes Budget Reconciliation bill 3665Budget, Spending and Taxes Budget Reconciliation bill 3789Budget, Spending and Taxes Education Amendment 3490Health Issues Medicaid Amendment 3496Bills in the second sparse PCAppropriations Agriculture, Rural Development, FDA Appropriations Act 3677Appropriations Emergency Supplemental Appropriations Act, 2005 3515Appropriations Emergency Supplemental Appropriations Act, 2006 3845Appropriations Interior Department FY 2006 Appropriations Bill 3595Executive Branch John Negroponte, Director of National Intelligence 3505 −8 −6 −4 −2 0 2 4 6 800.511.522.533.544.55 RepublicanDemocratic −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.50.20.40.60.811.21.41.61.822.2 RepublicanDemocratic
Fig. 5.1 . Projection of the senate voting records onto the subspace spanned by the top 2 principal components: Left:standard PCA; Right: sparse PCA separate the ℓ norm constraint, which controls the sparsity of the solution, and the positive-semidefinitenessconstraint. This method resulted in two relatively simple subproblems that have closed-form solutions ineach iteration. Global convergence results were established for the proposed method. Numerical results onboth synthetic data and real data from classification of text data and senate voting records demonstratedthe efficacy of our method.Compared with Nesterov’s first-order method DSPCA for sparse PCA studied in [9], our ADMM methodsolves the primal problems directly and guarantees sparse solutions. Numerical results also indicate thatADMM is much faster than DSPCA. Compared with methods for solving nonconvex formulations of sparsePCA, the nonsmooth SDP formulation considered in this paper usually requires more computational effort ineach iteration. However, the global convergence of our ADMM for solving the nonsmooth SDP is guaranteed,while methods for solving nonconvex problems usually have only local convergence. Acknowledgement.
The author is grateful to Alexandre d’Aspremont for discussions on using DSPCA.The author thanks Stephen J. Wright and Lingzhou Xue for reading an earlier version of the manuscript nd for helpful discussions. This work was partially supported by the NSF Postdoctoral Fellowship throughthe Institute for Mathematics and Its Applications at University of Minnesota. REFERENCES[1]
Farid Alizadeh , Interior point methods in semidefinite programming with applications to combinatorial optimization ,SIAM Journal on Optimization, 5 (1993), pp. 13–51.[2]
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein , Distributed optimization and statistical learning via thealternating direction method of multipliers , Foundations and Trends in Machine Learning, (2011).[3]
Stephen Boyd and Lieven Vandenberghe , Convex optimization , Cambridge University Press, Cambridge, 2004.[4]
P. Brucker , An O ( n ) algorithm for quadratic knapsack problems , Operations Research Letters, 3 (1984), pp. 163–166.[5] E. J. Cand`es, J. Romberg, and T. Tao , Robust uncertainty principles: Exact signal reconstruction from highly incom-plete frequency information , IEEE Transactions on Information Theory, 52 (2006), pp. 489–509.[6]
P. L. Combettes and Jean-Christophe Pesquet , A Douglas-Rachford splitting approach to nonsmooth convex varia-tional signal recovery , IEEE Journal of Selected Topics in Signal Processing, 1 (2007), pp. 564–574.[7]
P. L. Combettes and V. R. Wajs , Signal recovery by proximal forward-backward splitting , SIAM Journal on MultiscaleModeling and Simulation, 4 (2005), pp. 1168–1200.[8]
A. d’Aspremont, F. Bach, and L. El Ghaoui , Optimal solutions for sparse principal component analysis , Journal ofMachine Learning Research, 9 (2008), pp. 1269–1294.[9]
A. d’Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G. Lanckriet , A direct formulation for sparse pca usingsemidefinite programming , SIAM Review, 49 (2007), pp. 434–448.[10]
D. Donoho , Compressed sensing , IEEE Transactions on Information Theory, 52 (2006), pp. 1289–1306.[11]
J. Douglas and H. H. Rachford , On the numerical solution of the heat conduction problem in 2 and 3 space variables ,Transactions of the American Mathematical Society, 82 (1956), pp. 421–439.[12]
J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra , Efficient projections onto the l1-ball for learning in highdimensions , in ICML, 2008.[13]
J. Eckstein , Splitting methods for monotone operators with applications to parallel optimization , PhD thesis, Mas-sachusetts Institute of Technology, 1989.[14]
J. Eckstein and D. P. Bertsekas , On the Douglas-Rachford splitting method and the proximal point algorithm formaximal monotone operators , Math. Program., 55 (1992), pp. 293–318.[15]
D. Gabay , Applications of the method of multipliers to variational inequalities , in Augmented Lagrangian Methods: Ap-plications to the Solution of Boundary Value Problems, M. Fortin and R. Glowinski, eds., North-Hollan, Amsterdam,1983.[16]
R. Glowinski and P. Le Tallec , Augmented Lagrangian and Operator-Splitting Methods in Nonlinear Mechanics , SIAM,Philadelphia, Pennsylvania, 1989.[17]
D. Goldfarb and S. Ma , Fast multiple splitting algorithms for convex optimization , tech. report, Department of IEOR,Columbia University. Preprint available at http://arxiv.org/abs/0912.4570, 2009.[18]
D. Goldfarb, S. Ma, and K. Scheinberg , Fast alternating linearization methods for minimizing the sumof two convex functions , tech. report, Department of IEOR, Columbia University. Preprint available athttp://arxiv.org/abs/0912.4571, 2010.[19]
T. Goldstein and S. Osher , The split Bregman method for L1-regularized problems , SIAM J. Imaging Sci., 2 (2009),pp. 323–343.[20]
B. S. He, L.-Z. Liao, D. Han, and H. Yang , A new inexact alternating direction method for monotone variationalinequalities , Math. Program., 92 (2002), pp. 103–118.[21]
J. Jeffers , Two case studies in the application of principal component analysis , Appl. Stat., 16 (1967), pp. 225–236.[22]
M. Journee, Yu. Nesterov, P. Richtarik, and R. Sepulchre , Generalized power method for sparse principal compo-nent analysis , Journal of Machine Learning Research, 11 (2010), pp. 517–553.[23]
P. L. Lions and B. Mercier , Splitting algorithms for the sum of two nonlinear operators , SIAM Journal on NumericalAnalysis, 16 (1979), pp. 964–979.[24]
Z. Lu and Y. Zhang , An augmented lagrangian approach for sparse principal component analysis , Mathematical Pro-gramming, (2011).[25]
L. Mackey , Deflation methods for sparse PCA , in Advances in Neural Information Processing Systems (NIPS), 2008.[26]
J. Malick, J. Povh, F. Rendl, and A. Wiegele , Regularization methods for semidefinite programming , SIAM Journalon Optimization, 20 (2009), pp. 336–356. 2027]
Y. E. Nesterov , A method for unconstrained convex minimization problem with the rate of convergence O (1 /k ), Dokl.Akad. Nauk SSSR, 269 (1983), pp. 543–547.[28] , Smooth minimization for non-smooth functions , Math. Program. Ser. A, 103 (2005), pp. 127–152.[29]
P. M. Pardalos and N. Kovoor , An algorithm for a singly constrained class of quadratic programs subject to upper andlower bounds , Mathematical Programming, 46 (1990), pp. 321–328.[30]
D. H. Peaceman and H. H. Rachford , The numerical solution of parabolic elliptic differential equations , SIAM Journalon Applied Mathematics, 3 (1955), pp. 28–41.[31]
Shalev-Shwartz S. and Y. Singer , Efficient learning of label ranking by soft projections onto polyhedra , Journal ofMachine Learning Research, 7 (2006), pp. 1567–1599.[32]
Y. Saad , Projection and deflation methods for partial pole assignment in linear state feedback , IEEE Trans. Automat.Contr., 33 (1998), pp. 290–297.[33]
K. Scheinberg, S. Ma, and D. Goldfarb , Sparse inverse covariance selection via alternating linearization methods , inProceedings of the Neural Information Processing Systems (NIPS), 2010.[34]
M. J. Todd , Semidefinite optimization , Acta Numer., 10 (2001), pp. 515–560.[35]
E. van den Berg and M. P. Friedlander , Probing the Pareto frontier for basis pursuit solutions , SIAM J. on ScientificComputing, 31 (2008), pp. 890–912.[36]
Y. Wang, J. Yang, W. Yin, and Y. Zhang , A new alternating minimization algorithm for total variation imagereconstruction , SIAM Journal on Imaging Sciences, 1 (2008), pp. 248–272.[37]
Z. Wen, D. Goldfarb, and W. Yin , Alternating direction augmented Lagrangian methods for semidefinite programming ,Mathematical Programming Computation, 2 (2010), pp. 203–230.[38]
J. Yang and Y. Zhang , Alternating direction algorithms for ℓ problems in compressive sensing , SIAM Journal onScientific Computing, 33 (2011), pp. 250–278.[39] X. Yuan , Alternating direction methods for sparse covariance selection
Y. Zhang, A. d’Aspremont, and L. El Ghaoui , Sparse PCA: Convex relaxations, algorithms and applications , Hand-book on Semidefinite, Cone and Polynomial Optimization, M. Anjos and J.B. Lasserre, editors, (2011).[41]
Y. Zhang and L. El Ghaoui , Large-scale sparse principal component analysis with application to text data , in NIPS,2011.[42]