Alternating Maximization: Unifying Framework for 8 Sparse PCA Formulations and Efficient Parallel Codes
Peter Richtárik, Majid Jahani, Selin Damla Ahipaşaoğlu, Martin Takáč
aa r X i v : . [ s t a t . M L ] D ec Alternating Maximization: Unifying Framework for 8 Sparse PCAFormulations and Efficient Parallel Codes ∗ Peter Richt´arik † Martin Tak´aˇc ‡ S. Damla Ahipa¸sao˘glu § December 15, 2012
Abstract
Given a multivariate data set, sparse principal component analysis (SPCA) aims to extractseveral linear combinations of the variables that together explain the variance in the data as muchas possible, while controlling the number of nonzero loadings in these combinations. In this paperwe consider 8 different optimization formulations for computing a single sparse loading vector;these are obtained by combining the following factors: we employ two norms for measuringvariance (L2, L1) and two sparsity-inducing norms (L0, L1), which are used in two differentways (constraint, penalty). Three of our formulations, notably the one with L0 constraint andL1 variance, have not been considered in the literature. We give a unifying reformulation whichwe propose to solve via a natural alternating maximization (AM) method. We show the the AMmethod is nontrivially equivalent to GPower (Journ´ee et al; JMLR :517–553, 2010) for allour formulations. Besides this, we provide 24 efficient parallel SPCA implementations: 3 codes(multi-core, GPU and cluster) for each of the 8 problems. Parallelism in the methods is aimedat i) speeding up computations (our GPU code can be 100 times faster than an efficient serialcode written in C++), ii) obtaining solutions explaining more variance and iii) dealing with bigdata problems (our cluster code is able to solve a 357 GB problem in about a minute). Keywords: sparse principal component analysis, alternating maximization, GPower, highperformance computing, big data analytics, unsupervised learning.
Principal component analysis (PCA) is an indispensable tool used for dimension reduction in vir-tually all areas of science and engineering, from machine learning, statistics, genetics and financeto computer networks [1]. Let A ∈ R n × p denote a data matrix encoding n samples (observations)of p variables (features). PCA aims to extract a few linear combinations of the columns of A , calledprincipal components (PCs), pointing in mutually orthogonal directions, together explaining asmuch variance in the data as possible. If the columns of A are centered, the problem of extractingthe first PC can be written as max {k Ax k : k x k ≤ } , (1.1) ∗ Open source code with efficient implementations of the algorithms developed in this paper is published here:https://code.google.com/p/24am/ † School of Mathematics, University of Edinburgh, Edinburgh, EH93JZ, United Kingdom ‡ School of Mathematics, University of Edinburgh, Edinburgh, EH93JZ, United Kingdom § Engineering Systems and Design, Singapore University of Technology and Design, Singapore, 138682 k · k is a suitable norm for measuring variance. The solution x of this optimization problemis called the loading vector, Ax (normalized) is the first PC. Further PCs can be obtained in thesame way with A replaced by a new matrix in a process called deflation [2]. Classical PCA employsthe L norm in the objective; using the L norm instead may alleviate problems caused by outliersin the data and hence leads to a robust PCA model [3].As normally there is no reason for the optimal loading vectors defining the PCs to be sparse,they are usually combinations of all of the variables. In some applications, however, sparse loadingvectors enhance the interpretability of the components and are easier to store, which leads to theidea to induce sparsity in the loading vectors. This problem and approaches to it are known collec-tively as sparse PCA (SPCA); for some recent work, see [4]-[11]. A popular way of incorporatinga sparsity-inducing mechanism into the above optimization formulation is via either a sparsity-inducing constraint or penalty. Two of the most popular functions for this are the L and L normof the loading vector x (the L “norm” of x , denoted by k x k , is the number of nonzeros in x ). In this paper we consider 8 optimization formulations for extracting a single sparse loading vector(i.e., for computing the first PC) arising as combinations of the following three modeling factors:we use two norms for measuring variance (classical L and robust L ) and two sparsity-inducing(SI) norms (cardinality L and L ), which are used in two different ways (as a constraint or apenalty). All have the form OP T = max x ∈ X f ( x ) , (1.2)with X ⊂ R p and f detailed in Table 1. Note that if we set s = p in the constrained or γ = 0 inthe penalized versions, the sparsity-inducing functions stop having any effect and we recover theclassical and robust PCA (1.1). Choosing 1 ≤ s < p , γ > x . X f ( x )1 L L constraint { x ∈ R p : k x k ≤ , k x k ≤ s } k Ax k L L constraint { x ∈ R p : k x k ≤ , k x k ≤ s } k Ax k L L constraint { x ∈ R p : k x k ≤ , k x k ≤ √ s } k Ax k L L constraint { x ∈ R p : k x k ≤ , k x k ≤ √ s } k Ax k L L penalty { x ∈ R p : k x k ≤ } k Ax k − γ k x k L L penalty { x ∈ R p : k x k ≤ } k Ax k − γ k x k L L penalty { x ∈ R p : k x k ≤ } k Ax k − γ k x k L L penalty { x ∈ R p : k x k ≤ } k Ax k − γ k x k Table 1: Eight sparse PCA optimization formulations; see 1.2.All 4 SPCA formulations of Table 1 involving L variance were previously studied in the litera-ture and are very popular. For instance, [6] solve a series of convex relaxations of the L constrained In the L penalized formulations this can be seen from the inequality k x k ≤ p k x k k x k . variance problem, [7] considered the L penalized and constrained formulations, [9] studied the L and L penalized versions, [10] looked at all four. The L constrained L variance formulationwas first proposed only recently, in [11]. To the best of our knowledge, the remaining three L vari-ance formulations were not considered in the literature before. In particular, the L constrained L variance formulation is new—and is perhaps preferable as it directly constraints the cardinalityof the loading vector x without using any proxies. In all 8 formulations we introduce an additional (dummy) variable y , which allows us to proposea generic alternating maximization method for solving them: i) for a fixed loading vector, find thebest dummy variable (one maximizing the objective), then ii) fix the dummy variable and find thebest loading vector; repeat steps i) and ii). This and the resulting algorithms are described indetail in Section 2. The generic AM method is not limited to our choice of SPCA formulations.Indeed, it is applicable, for instance, if instead of measuring the variance using either the L orthe L norm, we use any other norm. One critical feature shared by the formulations in Table 1 isthat steps i) and ii) of the AM method can be performed efficiently, in closed form, with the maincomputational burden in each step being a matrix-vector multiplication ( Ax in step i) and A T y in step ii)). Our method produces a sequence of loading vectors x ( k ) , k ≥
0, with monotonicallyincreasing values f ( x ( k ) ).Our approach of introducing a dummy variable and using AM is similar to that of [9], where itis done implicitly , but mainly to [12], where it is fully explicit , albeit used for different purposes.Besides providing a conceptual unification for solving all 8 formulations using a single algorithm(AM), the main theoretical result of this paper is establishing that, surprisingly, in all 8 cases, theAM method is equivalent to the GPower method [9] applied to a certain derived objective function,with iterates being either the loading vectors or the dummy variables, depending on the formulation.This result is stated and proved in Section 3. Besides giving a new unifying framework and a generic algorithm for solving a number of SPCAformulations, 5 of which were previously proposed in the literature and 3 not, our further contribu-tion is in providing efficient strategies for parallelizing AM at two different levels: i) running AMin parallel from multiple starting points in order to obtain a solution explaining more variance andii) speeding up the linear algebra involved. This is described in detail in Section 4.Moreover, we provide parallel open-source codes implementing these parallelization strategies,for each of our 8 formulations, on 3 computing architectures: i) multi-core machine , ii)
GPU-enabledcomputer , and iii) computer cluster . We also provide a serial code; however, as nearly all moderncomputers are multi-core, the serial implementation only serves the purpose of a benchmark againstwhich once can measure parallelization speedup. Hence, we provide a total of 8 × Speeding up computations.
As described above, the AM method computes a matrix-vectormultiplication at every iteration; this can be parallelized. We find that our GPU implementa-3ions are faster than our multi-core implementations, which are, in turn, considerably fasterthan the benchmark single-core codes.2.
Obtaining solutions explaining more variance.
In some applications, such as in the computa-tion of RIP constants for compressed sensing [13], it is critical that a PC is computed with ashigh explained variance as possible. The output of our 8 subroutines depends on the startingpoint used; it only finds local solutions. Running them repeatedly from different startingpoints and keeping the solution with the largest objective value results in a PC explainingmore variance. There are several ways in which this can be done, we implement 4 (NAI =“naive”, SFA = “start-from-all”, BAT = “batches” and OTF = “on-the-fly”); details aregiven in Section 4. A naive (NAI) approach is to do this sequentially; a different possibilityis to run the method from several or all starting points in parallel (BAT, SFA), possiblyasynchronously (OTF). This way at each iteration we need to perform a matrix-matrix mul-tiplication which, when computed in parallel, is performed significantly faster compared todoing the corresponding number of parallel matrix-vector multiplications, one after another.3.
Dealing with big data problems.
If speed matters, for problems of small enough size werecommend using a GPU, if available. Since GPUs have stricter memory limitations thanmulti-core workstations (a typical GPU has 6GB RAM, a multi-core machine could have20GB RAM), one may need to use a high-memory multi-core workstation if the problem sizeexceeds the GPU limit. However, for large enough (=big data) problems, one will need touse a cluster. Our cluster codes partition A , store parts of it on different nodes, and do thecomputations in a distributed way. Notation. By x and y we denote column vectors in R p and R n , respectively. The coordinatesof a vector are denoted by subscripts (eg., x , x , . . . ) while iterates are denoted by superscripts inbrackets (eg., x (0) , x (1) , . . . ). We reserve the letter k for the iteration counter. By k x k we referto the cardinality (number of nonzero loadings) of vector x . The L , L and L ∞ norms are definedby k z k = P i | z i | , k z k = ( P i z i ) / and k z k ∞ = max i | z i | , respectively. For a scalar t , we let[ t ] + = max { , t } and by sgn ( t ) we denote the sign of t . As outlined in the previous section, we will solve (1.2) by introducing a dummy variable y into eachof the 8 formulations and apply an AM method to the reformulation. First, notice that for anypair of conjugate norms k · k and k · k ∗ , we have, by definition, k z k = max k y k ∗ ≤ y T z. (2.3)In particular, k · k ∗ = k · k and k · k ∗ = k · k ∞ .Now, let Y := { y ∈ R n : k y k ≤ } for the L variance formulations and Y := { y ∈ R n : k y k ∞ ≤ } for the L variance formulations. Further, let F ( x, y ) be the function obtained from f ( x ) after replacing k Ax k with y T Ax (resp. k Ax k with ( y T Ax ) ). Then, in view of the above,(1.2) takes on the equivalent form OP T = max x ∈ X max y ∈ Y F ( x, y ) . (2.4)4hat is, the 8 problems from Table 1 can be reformulated into the form (2.4); the details can befound in Table 2. X Y F ( x, y )1 { x ∈ R p : k x k ≤ , k x k ≤ s } { y ∈ R n : k y k ≤ } y T Ax { x ∈ R p : k x k ≤ , k x k ≤ s } { y ∈ R n : k y k ∞ ≤ } y T Ax { x ∈ R p : k x k ≤ , k x k ≤ √ s } { y ∈ R n : k y k ≤ } y T Ax { x ∈ R p : k x k ≤ , k x k ≤ √ s } { y ∈ R n : k y k ∞ ≤ } y T Ax { x ∈ R p : k x k ≤ } { y ∈ R n : k y k ≤ } ( y T Ax ) − γ k x k { x ∈ R p : k x k ≤ } { y ∈ R n : k y k ∞ ≤ } ( y T Ax ) − γ k x k { x ∈ R p : k x k ≤ } { y ∈ R n : k y k ≤ } y T Ax − γ k x k { x ∈ R p : k x k ≤ } { y ∈ R n : k y k ∞ ≤ } y T Ax − γ k x k Table 2: Reformulations of the problems from Table 1.We propose to solve (2.4) via Algorithm 1.
Algorithm 1
Alternating Maximization (AM) Method.Select initial point x (0) ∈ R p ; k ← Repeat y ( k ) ← y ( x ( k ) ) := arg max y ∈ Y F ( x ( k ) , y ) x ( k +1) ← x ( y ( k ) ) := arg max x ∈ X F ( x, y ( k ) ) Until a stopping criterion is satisfied
All 8 problems of Table 2 enjoy the property that both of the steps (subproblems) of Algorithm 1can be computed in closed form. In particular, each of these 8 × z = x and z = y , the second result with z = y andthe remaining four results with z = x .Table 3 is brief at the cost of referring to a number of operators ( T s , U γ , V γ : R m R m and λ s : R m R ), which we will now define. For a given vector a ∈ R m and integer s ∈ { , , . . . , m } ,by T s ( a ) ∈ R m we denote the vector obtained from a by retaining only the s largest components of a in absolute value, with the remaining ones replaced by zero. For instance, for a = (1 , − , , , T and s = 2 we have T s ( a ) = (0 , − , , , T . For γ ≥
0, we define operators U γ and V γ element-wisefor i = 1 , . . . , m as follows: ( U γ ( a )) i := a i [ sgn ( a i − γ )] + , (2.5)5 ubproblem φ ( z ) Z z ∗ φ ( z ∗ )S1 a T z or ( a T z ) k z k ≤ a k a k k a k S2 a T z k z k ∞ ≤ sgn ( a ) k a k S3 a T z k z k ≤ , k z k ≤ s T s ( a ) k T s ( a ) k k T s ( a ) k S4 a T z k z k ≤ , k z k ≤ √ s V λs ( a ) ( a ) k V λs ( a ) ( a ) k λ s ( a ) √ s + k V λ s ( a ) ( a ) k S5 ( a T z ) − γ k z k k z k ≤ U γ ( a ) k U γ ( a ) k k U γ ( a ) k − γ k U γ ( a ) k S6 a T z − γ k z k k z k ≤ V γ ( a ) k V γ ( a ) k k V γ ( a ) k Table 3: Closed-form solutions of AM subproblems; z ∗ := arg max z ∈ Z φ ( z ).( V γ ( a )) i := sgn ( a i )( | a i | − γ ) + . (2.6)Furthermore, we let λ s ( a ) := arg min λ ≥ λ √ s + k V λ ( a ) k , which is the solution of the one-dimensional dual of the optimization problem in line 4 of Table 3. Combining Algorithm 1 with the subproblem solutions given in Table 3, the AM method for allour 8 SPCA formulations can be written down concisely; see Algorithm 2.
Algorithm 2
AM method for solving the 8 SPCA formulations of Table 2.Select initial point x (0) ∈ R p ; k ← Repeat u = Ax ( k ) If L variance then y ( k ) ← sgn ( u ) If L variance then y ( k ) ← u/ k u k v = A T y ( k ) If L penalty then x ( k +1) ← U γ ( v ) / k U γ ( v ) k If L penalty then x ( k +1) ← V γ ( v ) / k V γ ( v ) k If L constraint then x ( k +1) ← T s ( v ) / k T s ( v ) k If L constraint then x ( k +1) ← V λ s ( v ) ( v ) / k V λ s ( v ) ( v ) k k ← k + 1 Until a stopping criterion is satisfiedNote that in the methods described in Algorithm 2 it is not necessary to normalize the vector U γ ( v ) (resp. V γ ( v ), T s ( v ), and V λ s ( a ) ( v )) when computing x ( k +1) since clearly the iterate y ( k +1) ,which depends on x ( k +1) , is invariant under positive scalings of x ( k +1) . We have to remember,however, to normalize the output. The method is terminated when a maximum number of iterations maxIt is reached or when F ( x ( k +1) , y ( k ) ) F ( x ( k ) , y ( k − ) ≤ tol, GPower (generalized power method) [9] is a simple algorithm for maximizing a convex function Ψon a compact set Ω, which works via a “linearize and maximize” strategy. If by Ψ ′ ( z ( k ) ) we denotean arbitrary subgradient of Ψ at z ( k ) , then GPower performs the following iteration: z ( k +1) = arg max z ∈ Ω { Ψ( z ( k ) ) + h Ψ ′ ( z ( k ) ) , z − z ( k ) i} = arg max z ∈ Ω h Ψ ′ ( z ( k ) ) , z i . (3.7)The following theorem, our main result, gives a nontrivial insight into the relationship of AMand GPower, when the former is applied to solving any of the 8 SPCA formulations considered,and GPower is applied to a derived problem, as described by the theorem. Theorem 1 (AM = GPower) . The AM and GPower methods are equivalent in the following sense:1. For the 4 constrained sparse PCA formulations of Table 1, the x iterates of the AM method ap-plied to the corresponding reformulation of Table 2 are identical to the iterates of the GPowermethod as applied to the problem of maximizing the convex function F Y ( x ) def = max y ∈ Y F ( x, y ) on X , started from x (0) .2. For the 4 penalized sparse PCA formulations of Table 1, the y iterates of the AM method ap-plied to the corresponding reformulation of Table 2 are identical to the iterates of the GPowermethod as applied to the problem of maximizing the convex function F X ( y ) def = max x ∈ X F ( x, y ) on Y , started from y (0) .Proof. Recall that we wish to solve the problem
OP T = max x ∈ X f ( x ) = max x ∈ X max y ∈ Y F ( x, y ) | {z } F Y ( x ) = max y ∈ Y max x ∈ X F ( x, y ) | {z } F X ( y ) . We will now prove the equivalence for all 8 choices of ( f, X, Y, F ) given in Tables 1 and 2. In theproofs we will also refer to the closed form solutions of the subproblem (S1)–(S6), as detailed inTable 3.Consider first the constrained formulations: 1 , , k -th x -iterate ( x ( k ) ) of AM is identical to the k -th iterate of GPower (for k = 0 this is enforced by theassumption that GPower is started from x (0) ). By considering all 4 formulations individually, wewill show that x ( k +1) produced by AM and GPower are also identical.7 ormulation 1 : Here we have f ( x ) = k Ax k , F ( x, y ) = y T Ax,X = { x ∈ R p : k x k ≤ , k x k ≤ s } , Y = { y ∈ R n : k y k ≤ } . First, note that F Y ( x ) = max y ∈ Y F ( x, y ) ( S = k Ax k , the gradient of which is given by F ′ Y ( x ) = A T Ax k Ax k . (3.8)Given x ( k ) , in the AM method we have y ( k ) = arg max y ∈ Y F ( x ( k ) , y ) ( S = Ax ( k ) k Ax ( k ) k . (3.9)One iteration of GPower started from x ( k ) will thus produce the iterate x ( k +1) (3.7) = arg max x ∈ X h F ′ Y ( x ( k ) ) , x i (3.8) = arg max x ∈ X * A T Ax ( k ) k Ax ( k ) k , x + (3.9) = arg max x ∈ X h A T y ( k ) , x i ( S = T s ( A T y ( k ) ) k T s ( A T y ( k ) ) k . Observe that this is precisely how x ( k +1) is computed in the AM method. Formulation 2 : Here we have f ( x ) = k Ax k , F ( x, y ) = y T Ax,X = { x ∈ R p : k x k ≤ , k x k ≤ s } , Y = { y ∈ R n : k y k ∞ ≤ } . First, note that F Y ( x ) = max y ∈ Y F ( x, y ) ( S = k Ax k , the gradient of which is given by F ′ Y ( x ) = A T sgn ( Ax ) . (3.10)Given x ( k ) , in the AM method we have y ( k ) = arg max y ∈ Y F ( x ( k ) , y ) ( S = sgn ( Ax ( k ) ) . (3.11)8ne iteration of GPower started from x ( k ) will thus produce the iterate x ( k +1) (3.7) = arg max x ∈ X h F ′ Y ( x ( k ) ) , x i (3.10) = arg max x ∈ X D A T sgn ( Ax ( k ) ) , x E (3.11) = arg max x ∈ X h A T y ( k ) , x i ( S = T s ( A T y ( k ) ) k T s ( A T y ( k ) ) k . Observe that this is precisely how x ( k +1) is computed in the AM method. Formulation 3 : Here we have f ( x ) = k Ax k , F ( x, y ) = y T Ax,X = { x ∈ R p : k x k ≤ , k x k ≤ √ s } , Y = { y ∈ R n : k y k ≤ } . First, note that F Y ( x ) = max y ∈ Y F ( x, y ) ( S = k Ax k , the gradient of which is given by F ′ Y ( x ) = A T Ax k Ax k . (3.12)Given x ( k ) , in the AM method we have y ( k ) = arg max y ∈ Y F ( x ( k ) , y ) ( S = Ax ( k ) k Ax ( k ) k . (3.13)One iteration of GPower started from x ( k ) will thus produce the iterate x ( k +1) (3.7) = arg max x ∈ X h F ′ Y ( x ( k ) ) , x i (3.14) = arg max x ∈ X * A T Ax ( k ) k Ax ( k ) k , x + (3.15) = arg max x ∈ X h A T y ( k ) , x i ( S = V λ s ( A T y ( k ) ) ( A T y ( k ) ) k V λ s ( A T y ( k ) ) ( A T y ( k ) ) k . Observe that this is precisely how x ( k +1) is computed in the AM method. Formulation 4 : Here we have f ( x ) = k Ax k , F ( x, y ) = y T Ax,X = { x ∈ R p : k x k ≤ , k x k ≤ √ s } , Y = { y ∈ R n : k y k ∞ ≤ } . F Y ( x ) = max y ∈ Y F ( x, y ) ( S = k Ax k , the gradient of which is given by F ′ Y ( x ) = A T Ax k Ax k . (3.14)Given x ( k ) , in the AM method we have y ( k ) = arg max y ∈ Y F ( x ( k ) , y ) ( S = Ax ( k ) k Ax ( k ) k . (3.15)One iteration of GPower started from x ( k ) will thus produce the iterate x ( k +1) (3.7) = arg max x ∈ X h F ′ Y ( x ( k ) ) , x i (3.14) = arg max x ∈ X * A T Ax ( k ) k Ax ( k ) k , x + (3.15) = arg max x ∈ X h A T y ( k ) , x i ( S = V λ s ( A T y ( k ) ) ( A T y ( k ) ) k V λ s ( A T y ( k ) ) ( A T y ( k ) ) k . Observe that this is precisely how x ( k +1) is computed in the AM method.Consider now the penalized formulations: 5 , , k -th y -iterate ( y ( k ) ) of AM is identical to the k -th iterate of GPower (for k = 0 this is enforced by theassumption that GPower is started from y (0) ). By considering all 4 formulations individually, wewill show that y ( k +1) produced by AM and GPower are also identical. Let A = [ a , . . . , a p ], i.e.,the i -th column of A is a i . Formulation 5 : Here we have f ( x ) = k Ax k − γ k x k , F ( x, y ) = ( y T Ax ) − γ k x k ,X = { x ∈ R p : k x k ≤ } , Y = { y ∈ R n : k y k ≤ } . First, note that F X ( y ) = max x ∈ X F ( x, y ) ( S = k U γ ( A T y ) k − γ k U γ ( A T y ) k = p X i =1 [( a Ti y ) − γ ] + , the subgradient of which is given by F ′ X ( y ) = 2 p X i =1 [ sgn (( a Ti y ) − γ )] + ( a Ti y ) a i (2.5) = 2 AU γ ( A T y ) . (3.16)10iven y ( k ) , in the AM method we have x ( k +1) = arg max x ∈ X F ( x, y ( k ) ) ( S = U γ ( A T y ( k ) ) k U γ ( A T y ( k ) ) k . (3.17)One iteration of GPower started from y ( k ) will thus produce the iterate y ( k +1) (3.7) = arg max y ∈ Y h F ′ X ( y ( k ) ) , y i (3.16) = arg max k y k ∞ ≤ h AU γ ( A T y ) , y i (3.17) = arg max k y k ≤ h Ax ( k +1) , y i ( S = Ax ( k +1) k Ax ( k +1) k . Observe that this is precisely how y ( k +1) is computed in the AM method. Formulation 6 : Here we have f ( x ) = k Ax k − γ k x k , F ( x, y ) = ( y T Ax ) − γ k x k ,X = { x ∈ R p : k x k ≤ } , Y = { y ∈ R n : k y k ∞ ≤ } . First, note that F X ( y ) = max x ∈ X F ( x, y ) ( S = k U γ ( A T y ) k − γ k U γ ( A T y ) k = p X i =1 [( a Ti y ) − γ ] + , the subgradient of which is given by F ′ X ( y ) = 2 p X i =1 [ sgn (( a Ti y ) − γ )] + ( a Ti y ) a i (2.5) = 2 AU γ ( A T y ) . (3.18)Given y ( k ) , in the AM method we have x ( k +1) = arg max x ∈ X F ( x, y ( k ) ) ( S = U γ ( A T y ( k ) ) k U γ ( A T y ( k ) ) k . (3.19)One iteration of GPower started from y ( k ) will thus produce the iterate y ( k +1) (3.7) = arg max y ∈ Y h F ′ X ( y ( k ) ) , y i (3.18) = arg max k y k ∞ ≤ h AU γ ( A T y ) , y i (3.19) = arg max k y k ∞ ≤ h Ax ( k +1) , y i ( S = sgn ( Ax ( k +1) ) . Observe that this is precisely how y ( k +1) is computed in the AM method.11 ormulation 7 : Here we have f ( x ) = k Ax k − γ k x k , F ( x, y ) = y T Ax − γ k x k ,X = { x ∈ R p : k x k ≤ } , Y = { y ∈ R n : k y k ≤ } . Note that the functions y F ( x, y ) are linear and that, by definition, F X ( y ) = max x ∈ X F ( x, y ).Moreover, note that the gradient of y F ( x, y ) at y is equal to Ax . Hence, if x is any vectorthat maximizes F ( x, y ( k ) ) over X , then Ax is a subgradient of F X at y ( k ) . Note that this isprecisely how x ( k +1) is defined in the AM method: x ( k +1) = arg max x ∈ X F ( x, y ( k ) ). Hence, Ax ( k +1) is a subgradient of F X at y ( k ) and one iteration of GPower started from y ( k ) willproduce the iterate y ( k +1) (3.7) = arg max y ∈ Y h F ′ X ( y ( k ) ) , y i = arg max k y k ≤ h Ax ( k +1) , y i ( S = Ax ( k +1) k Ax ( k +1) k . Observe that this is precisely how y ( k +1) is computed in the AM method. Formulation 8 : Here we have f ( x ) = k Ax k − γ k x k , F ( x, y ) = y T Ax − γ k x k ,X = { x ∈ R p : k x k ≤ } , Y = { y ∈ R n : k y k ∞ ≤ } . Note that the functions y F ( x, y ) are linear and that, by definition, F X ( y ) = max x ∈ X F ( x, y ).Moreover, note that the gradient of y F ( x, y ) at y is equal to Ax . Hence, if x is any vectorthat maximizes F ( x, y ( k ) ) over X , then Ax is a subgradient of F X at y ( k ) . Note that this isprecisely how x ( k +1) is defined in the AM method: x ( k +1) = arg max x ∈ X F ( x, y ( k ) ). Hence, Ax ( k +1) is a subgradient of F X at y ( k ) and one iteration of GPower started from y ( k ) willproduce the iterate y ( k +1) (3.7) = arg max y ∈ Y h F ′ X ( y ( k ) ) , y i = arg max k y k ∞ ≤ h Ax ( k +1) , y i ( S = sgn ( Ax ( k +1) ) . Observe that this is precisely how y ( k +1) is computed in the AM method.Having established equivalence between AM and GPower, local convergence of the AM methodfor all 8 SPCA formulations follows from the theory developed in [9] and [10]. In this section we describe several approaches for embedding Algorithm 2 (AM) within a par-allel scheme for solving l identical SPCA problems, started from a number of starting points, x (0 , , . . . , x (0 ,l ) . This is done in order to obtain a loading vector explaining more variance and willbe discussed in more detail in Section 4.1.As we will see, it may not necessarily be most efficient to solve all l problems simultaneously.Instead, we consider a class of parallelization schemes where we divide the l problems into “batches”12f r problems each, and solve each batch of r problems simultaneously. In this setting at each it-eration we need to perform identical operations in parallel, notably matrix-vector multiplications Ax ( k, , . . . , Ax ( k,r ) and A T y ( k, , . . . , A T y ( k,r ) . It is useful to view the sequence of matrix-vectorproducts as a single matrix-matrix product, e.g., A [ x ( k, , . . . , x ( k,r ) ] in the first case, and use opti-mized libraries for parallelization. This simple trick leads to considerable speedups when comparedto other approaches. We use similar ideas for the parallel evaluation of the operators. Note thateven in the l = 1 case, i.e, if we wish to run SPCA from a single starting point only, there is scopefor parallelization of the matrix-vector products and function evaluations. Hence, parallelizationin our method serves two purposes:1. to obtain solutions explaining more variance by solving the problem from several startingpoints (we choose l > l = 1 and l > l = 6 times, using different starting points.In particular, in this section we describe 4 parallelization approaches: • NAI = “naive” ( r = 1), • SFA = “start-from-all” ( r = l ), • BAT = “batches” ( l < r < l ) • OTF = “on-the-fly” (BAT improved by a dynamic replacement strategy to reduce idle time).The working of these 4 approaches is illustrated in Figure 1 in a situation with l = 6. In whatfollows we describe the methods informally, in a narrative style, with a suitable choice of numericalexperiments illustrating the differences between the ideas. As shown in [9], [10] for GPower, and due to our equivalence theorem (Theorem 1), we know thatAlgorithm 2 (AM) is only able to converge to a local solution. Moreover, quality of the solution will13epend on the starting point (SP) x (0) used. When the algorithm is run just once, the quality ofthe obtained solution, in terms of the objective value (or explained variance), can be poor. Hence,if the amount of explained variance is important, it will be useful to run the method repeatedlyfrom a number of different SPs. In this and all subsequent experiments we generated A ∈ R n × p with independent and uniformly distributed entries from [ − , n = 512 and p = 2 , , L constrained L variance SPCA problem with s = 1 , , , . . . , s we run AM from l = 1 ,
000 randomly generated SPs with maxIt = 200 and tol = 10 − . The resultsare given in Figure 2, where the vertical axis corresponds to the amount of explained variance of aparticular solution compared to the best solution found. . . . . . Target sparsity level s E x p l a i ned v a r i an c e / B e s t e x p l a i ned v a r i an c e Figure 2: It may be easy to converge to a poor local solution.Clearly, for small s it is easy to obtain a bad solution if we run the method only a few times;this effect is milder for large s but may be substantial nevertheless in real life problems. Hence,especially when s is small, it is necessary to employ a globalization strategy such as rerunning AMfrom a number of different starting points. This experiment illustrates that the simple strategyof running the method from a number of randomly generated starting points can be effective infinding solutions with more explained variance. A “naive” (NAI) approach would be to do thissequentially: solve the problem with one starting point first before solving it for another startingpoint. Running AM in parallel, started from a number of SPs, increases the utilization of computerresources, especially on parallel architectures. In order to demonstrate this, we generated 6 datamatrices with p = 1000 , , . . . , L penalized L varianceSPCA formulation with l = 256 SPs (and maxIt = 10). By BAT r we denote the approach withbatches of size r . Hence, SFA = BAT256 and NAI = BAT1. Besides these two basic choices, welook at BAT4, BAT16 and BAT64 as well. The results can be found in Figure 3.Different problem sizes p appear on the horizontal axis; on the vertical axis we plot the speedupobtained by applying a particular batching strategy compared to NAI. Note that even on a single-14 Speedup (1 CORE)
BAT1 = NAIBAT4BAT16BAT64BAT256 = SFA 10 Speedup (12 CORES)
BAT1 = NAIBAT4BAT16BAT64BAT256 = SFA
Figure 3: Economies of scale: “Start-from-all” (SFA) is better than any of the batching strategieson a single-core machine (LEFT); even more so on a multi-core machine (RIGHT).core computer (LEFT plot) we benefit from running the methods in parallel (“economies of scale”)rather than running them one after another. Indeed, we can obtain a 2 − × speedup with BAT16across the whole range of problem sizes, and 4 × speedup with SFA for large enough p . With 12cores (RIGHT plot) the effect is much more dramatic: the speedup for BAT16 is consistently inthe 10 − × range, and can even reach 50 × for SFA. It often happens, especially when batch size is large, that some problems within a batch convergesooner than others. The vanilla BAT approach described above does nothing about it, and continuesthrough matrix-matrix multiplies, updating the already converged iterates, until the last problemin the batch converges. A minor but not negligible speedup is possible by employing an “on-the-fly”(OTF) dynamic replacement technique, where whenever a certain problem converges, it is replacedby a new one. Hence, no predefined batches exist—OTF can be viewed as a greedy list schedulingheuristic. We used l = 1024 starting points and compare SFA1024 with BAT64 and OTF64–thedynamic replacement variant of BAT64.Looking at the LEFT plot in Figure 4, we see that the average number of iterations per startingpoint is much smaller for OTF. This results in speedup of more than 2 × when compared with SFA(RIGHT plot). Notably, SFA is slower than both BAT64 and OTF64, which shows that it maynot be optimal to choose r = l . Accompanying this paper is the open source software package “24am” implementing paralleliza-tion strategies described in Section 4, all with Algorithm 2 (AM) used as the underlying solutionmethod, with the option of using any of the 8 optimization formulations of SPCA described inTable 1. The name 24am comes from the fact that we implement the solver for 3 different parallel https://code.google.com/p/24am/ Average number of iter/SP
BAT1024 = SFABAT64OTF64 10 Speedup p BAT1024 = SFABAT64OTF64
Figure 4: Dynamic Replacement: “On-the-fly” (OTF) is better than “Batches” (BAT), which isbetter than “start-from-all” (SFA).architectures: multi-core processors, GPUs and computer clusters, leading to 24 = 8 × Here we solve 9 random L constrained L variance SPCA instances of sizes p = 100 × i , i = 1 , . . . , n = p/
10, with 100 SPs each, on a machine using 1 , , −2 p Computation Time Speedup
Figure 5: Multi-core speedup is proportional to the number of cores.The plot on the LEFT shows the total computational time; the plot on the RIGHT shows thespeedup of multi-core codes compared to the single-core code. Note that the speedup is consistentlyclose to the number of cores for the 2 and 4-core setups across all problem sizes, and is growing16ith p from 5 × to about 7 . × in the 8-core setup. Here we solve 8 random L penalized L variance SPCA instances with p varying roughly between10 and 10 , and n = p/ { , , } SPs on a single-core CPUand a GPU; the results are shown in Figure 6. −2 p Computation Time
GPUCPU 10 −1 Speedup p GPU1GPU16GPU256
Figure 6: GPU code can achieve 125 × speedup compared to single-core when 256 starting pointsare used.The plot on the LEFT shows the total computational time. The red lines with triangle markerscorrespond to the single-core setup, the “higher” the line, the more starting points were used. Theblue lines with square markers correspond to our GPU codes. While the runtime increases linearlywith problem size for the single-core codes, it grows slowly for the GPU codes. Note that the GPUcode may actually be slower for small problem sizes. Looking at the RIGHT plot, we see that theGPU code is capable of a 100-125 × speedup; this happens for large problem sizes and 256 SPs.The speedup can reach 100 × for 16 SPs as well. In this experiment we solved several L penalized L variance SPCA problems with a fully dense matrix A ∈ R n × p ; the results are in Table 4. We focus our discussion on the largest of the problemsonly (last three lines of the table), one with n = 6 × and p = 8 × . We used a cluster of 800CPUs; storage of the data matrix required 357.6 GB of memory. The matrix was first loaded fromfiles to memory; this process took t = 92 seconds. Subsequently, the loaded data was distributedto CPUs where needed, which took additional t = 713 seconds. Finally we run the AM methodwith 1, 32 and 64 starting points and measured the average time of a single iteration; the resultsare t = 4 . t = 51 . t = 134 . t k column of Table 4 depicts the time it takes forthe solver to perform k iterations. We treated the problem directly, without using any safe featureelimination techniques [14]. Such preprocessing could, however, be able to expand the reach of ourcluster code to even larger problem sizes. 17 × p memory t t t t t × · × × · × × · × · × · × · × · × · × · × · × ×
10 1 49.22 104.26 0.45 2.44 11.626 · × ×
10 32 - - 6.37 29.72 115.736 · × ×
10 64 - - 14.14 52.64 219.86 · × · ×
40 1 129.69 611.69 1.24 5.12 31.466 · × · ×
40 32 - - 17.50 61.36 255.806 · × · ×
40 64 - - 31.36 141.61 525.086 · × · ×
80 1 92.12 713.45 4.14 15.82 95.516 · × · ×
80 32 - - 51.11 324.26 619.456 · × · ×
80 64 - - 134.89 690.06 -
Table 4: Result from Cluster. For first 1 experiment the dimensions of virtual grid was the sameas loaded data and hence t is small. For the rest we used 64x64 pieces. In the first experiment we tested the AM method with L constrained L variance formulation(with s = 5) on two medium-size data sets from the Machine Learning Repository [18]: newsarticles appeared in New York Times and abstracts of articles published in PubMed. Each dataset is formatted as a matrix A ∈ R n × p , where the rows of A correspond to news articles in theNYTimes data set and to abstracts in PubMed, and the columns correspond to words. The numberof appearances of word j in article or abstract i is the ( i, j )-th entry of A ; the matrices are henceclearly sparse. The NYTimes data set has 102,660 articles, 300,000 words, and approximately70 million nonzero entries. The PubMed data set contains 141,043 articles, 8.2 million words, andapproximately 484 million nonzeroes. The matrices can be stored in 0.778 GB and 5.42 GB memoryspace, respectively. We have customized the AM method to exploit sparsity as much as possible.In Table 5 we present the first 5 sparse principal components (5 words each). Clearly, the first PCfor NYT is about sports, the second about business, the third about elections, the fourth abouteducation and the fifth about United States. Similar interpretations can be given to the PubMedPCs. For single and multi-core architectures we developed our codes using the CBLAS interface. Inparticular, we use both the GSL BLAS and the Intel MKL [15] implementations (single-core) andthe GotoBLAS2 [16] and Intel MKL implementations (multi-core). Parallelization in the multi-corecase is performed by the OpenMP interface. When comparing the performance of single-core andmulti-core architectures, we use Intel MKL library for both serial and parallel versions of the samealgorithm for consistency. Nevertheless, in our experience, GotoBLAS2 implementation of thesealgorithms are faster than the Intel MKL implementation. We use CuBLAS version 4.0 [17] on GPU(and make use of Thrust whenever possible for operations such as sorting, memory arrangementsand data allocation on GPU). For comparisons between single-core and GPU architectures, we use18
YT 1st PC NYT 2nd PC NYT 3rd PC NYT 4th PC NYT 5th PCgame companies campaign children attackplay company president program governmentplayer million al gore school officialseason percent bush student USteam stock george bush teacher united statesPubMed 1st PC PubMed 2nd PC PubMed 3rd PC PubMed 4th PC PubMed 5th PCdisease cell activity cancer agelevel effect concentration malignant childpatient expression control mice childrentherapy human rat primary parenttreatment protein receptor tumor year
Table 5: First 5 sparse PCs for NYTimes and PubMed data sets.the GSL BLAS implementation on the single-core. On a cluster, linear algebra is done with IntelMKL’s PBLAS, while communication between nodes is via MPI.
We propose a unifying framework for solving 8 SPCA formulations in which all have the same formand are solved by the same algorithm: the alternating maximization (AM) method. We observedthat AM is in all cases equivalent to the GPower method applied to a suitable convex function.Five of these formulations were previously studied in the literature and three were not; notablythe L constrained L (robust) variance seems to be new. For each of these formulations we havewritten 4 efficient codes—one serial and three parallel—aimed at single-core, multi-core and GPUworkstations and a cluster. All these codes are enabled with efficient parallel implementations ofa multiple-starting-point globalization strategy which aims to find PCs explaining more variance;with speedup per starting point achieving up to two orders of magnitude. The most efficient ofthese implementations is “on-the-fly”. We demonstrated that our cluster code is able to solve avery large problem with a 357 GB fully dense data matrix. Acknowledgements
The work of P. Richt´arik and M. Tak´aˇc was supported by the EPSRC grant EP/I017127/1 (Mathe-matics for Vast Digital Resources) and the Centre for Numerical Algorithms and Intelligent Software(funded by EPSRC grant EP/G036136/1 and the Scottish Funding Council). P. Richt´arik was alsopartially supported by the EPSRC grant EP/J020567/1 (Algorithms for Data Simplicity). S. D.Ahipa¸sao˘glu was partially supported by the SUTD Start-Up Research Grant, Project Number:SRG ESD 2012 033.
References [1] Jollife, I. (1986) Principal component analysis, Springer Verlag, NY.192] Mackey, L. (2008) Deflation Methods for Sparse PCA.
Advances in Neural Information Pro-cessing Systems. :1017-1024.[3] Kwak, N. (2008) Principal component analysis based on L norm maximization. IEEE Trans-actions on Pattern Analysis and Machine Intelligence :1672-1680.[4] Zou, H., Hastie, T. and Tibshirani, R. (2004) Sparse principal component analysis. Technicalreport , Stanford University.[5] Moghaddam, B., Weiss, Y. and Avidan, S., (2006) Spectral bounds for sparse PCA: exact andgreedy algorithms.
In: Advances in Neural Information Processing Systems :915-922.[6] d’Aspremont, A., El Ghaoui, L., Jordan, M.I. and Lanckriet, G.R.G. (2007) A direct formula-tion for sparse PCA using semidefinite programming. SIAM Review :434-448.[7] d’Aspremont, A., Bach, F. and El Ghaoui, L. (2008) Optimal solutions for sparse principalcomponent analysis.
Journal of Machine Learning Research :1269-1294.[8] Lu, Z.S. and Zhang, Y. (2009) An augmented lagrangian approach for sparse principal com-ponent analysis. Mathematical Programming, Series A. DOI: 10.1007/s10107-011-0452-4.[9] Journ´ee, M., Nesterov, Y., Richt´arik, P. and Sepulchre, R. (2010) Generalized power methodfor sparse principal component analysis. Journal of Machine Learning Research :517-553.[10] Luss, R. and Teboulle, M. (2011) Conditional gradient algorithms for rank-one matrix approx-imations with a sparsity constraint. Technical report. [11] Meng, D., Zhao, Q. and Xu, Z. (2012) Improve robustness of sparse PCA by L -norm maxi-mization. Pattern Recognition :487-497.[12] Richt´arik, P. (2011) Finding sparse approximations to extreme eigenvectors: generalized powermethod for sparse PCA and extensions. In: Proceedings of Signal Processing with AdaptiveSparse Structured Representations .[13] Bah, B., Tanner, J. (2010) Improved bounds on restricted isometry constants for Gaussianmatrices.
SIAM Journal on Matrix Analysis and Applications :2882-2898.[14] Zhang, Y., El Ghaoui, L. (2011) Large-scale sparse principal component analysis with appli-cation to text data. In: Advances in Neural Information Processing Systems24