[PDF] Exact and Approximation Algorithms for Sparse PCA

Abstract

Sparse PCA (SPCA) is a fundamental model in machine learning and data analytics, which has witnessed a variety of application areas such as finance, manufacturing, biology, healthcare. To select a prespecified-size principal submatrix from a covariance matrix to maximize its largest eigenvalue for the better interpretability purpose, SPCA advances the conventional PCA with both feature selection and dimensionality reduction. This paper proposes two exact mixed-integer SDPs (MISDPs) by exploiting the spectral decomposition of the covariance matrix and the properties of the largest eigenvalues. We then analyze the theoretical optimality gaps of their continuous relaxation values and prove that they are stronger than that of the state-of-art one. We further show that the continuous relaxations of two MISDPs can be recast as saddle point problems without involving semi-definite cones, and thus can be effectively solved by first-order methods such as the subgradient method. Since off-the-shelf solvers, in general, have difficulty in solving MISDPs, we approximate SPCA with arbitrary accuracy by a mixed-integer linear program (MILP) of a similar size as MISDPs. To be more scalable, we also analyze greedy and local search algorithms, prove their first-known approximation ratios, and show that the approximation ratios are tight. Our numerical study demonstrates that the continuous relaxation values of the proposed MISDPs are quite close to optimality, the proposed MILP model can solve small and medium-size instances to optimality, and the approximation algorithms work very well for all the instances. Finally, we extend the analyses to Rank-one Sparse SVD (R1-SSVD) with non-symmetric matrices and Sparse Fair PCA (SFPCA) when there are multiple covariance matrices, each corresponding to a protected group.

Full PDF

EExact and Approximation Algorithms for SparsePCA

Yongchun Li

Department of Industrial & Systems Engineering, Virginia Tech, Blacksburg, VA 24061, [email protected]

Weijun Xie

Department of Industrial & Systems Engineering, Virginia Tech, Blacksburg, VA 24061, [email protected]

Sparse PCA (SPCA) is a fundamental model in machine learning and data analytics, which has witnessed avariety of application areas such as ﬁnance, manufacturing, biology, healthcare. To select a prespeciﬁed-sizeprincipal submatrix from a covariance matrix to maximize its largest eigenvalue for the better interpretabilitypurpose, SPCA advances the conventional PCA with both feature selection and dimensionality reduction.Existing approaches often approximate SPCA as a semi-deﬁnite program (SDP) without strictly enforcingthe important cardinality constraint that restricts the number of selected features to be a constant. To ﬁllthis gap, we propose two exact mixed-integer SDPs (MISDPs) by exploiting the spectral decomposition of thecovariance matrix and the properties of the largest eigenvalues. We then analyze the theoretical optimalitygaps of their continuous relaxation values and prove that they are stronger than that of the state-of-art one.We further show that the continuous relaxations of two MISDPs can be recast as saddle point problemswithout involving semi-deﬁnite cones, and thus can be eﬀectively solved by ﬁrst-order methods such as thesubgradient method. Since oﬀ-the-shelf solvers, in general, have diﬃculty in solving MISDPs, we approximateSPCA with arbitrary accuracy by a mixed-integer linear program (MILP) of a similar size as MISDPs. Thecontinuous relaxation values of two MISDPs can be leveraged to reduce the size of the proposed MILPfurther. To be more scalable, we also analyze greedy and local search algorithms, prove their ﬁrst-knownapproximation ratios, and show that the approximation ratios are tight. Our numerical study demonstratesthat the continuous relaxation values of the proposed MISDPs are quite close to optimality, the proposedMILP model can solve small and medium-size instances to optimality, and the approximation algorithmswork very well for all the instances. Finally, we extend the analyses to Rank-one Sparse SVD (R1-SSVD)with non-symmetric matrices and Sparse Fair PCA (SFPCA) when there are multiple covariance matrices,each corresponding to a protected group.

Key words : Sparse PCA, Largest Eigenvalue, Mixed-Integer Program, Semi-deﬁnite Program, Greedy,Local Search, SVD, Fairness

1. Introduction

This paper studies the sparse principal component analysis (SPCA) problemof the form (SPCA) w ∗ := max x ∈ R n (cid:8) x (cid:62) Ax : || x || = 1 , || x || = k (cid:9) , (1) a r X i v : . [ s t a t . M L ] A ug ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA where the symmetric positive semi-deﬁnite matrix A ∈ R n × n denotes the sample covariance out of adataset with n features and the integer k ∈ [ n ] denotes the sparsity of its ﬁrst principal component(PC). In SPCA (1), the objective is to select the best size- k principal submatrix from a covariancematrix A with the maximum largest eigenvalue. Compared to the conventional PCA, the extrazero-norm constraint || x || = k in SPCA (1) restricts the number of features of the ﬁrst PC x tobe k most important ones. In this way, SPCA improves the interpretability of the obtained PC,which has been shown as early as Jeﬀers [20] in 1967. It is also recognized that SPCA can bemore reliable for large-scale datasets than PCA, where the number of features is far more thanthat of observations [41]. These advantages of SPCA have beneﬁted many application ﬁelds suchas biology, ﬁnance, cloud computing, and healthcare, which frequently deal with datasets with amassive number of features (see, e.g., [8, 21, 25, 30]). Our paper contributes to relevant literature on SPCA from threeaspects: exact mixed-integer programs, convex relaxations, and approximation algorithms.

Exact Mixed-Integer Programs:

As shown in formulation (1), SPCA is highly non-convex-maximizing a convex function subject to two nonconvex constraints (i.e., an L equality constraintand an L equality constraint). Albeit superior to traditional PCA, SPCA (1) is notoriously knownto be computationally expensive; see, e.g., the complexity analysis and inapproximability resultsin Magdon-Ismail [27]. As a result, the equivalent formulations and algorithms for exactly solvingSPCA are quite limited in the literature (see, e.g., [5, 17, 29]). Moghaddam et al. [29] introduced abranch and bound method to solve SPCA, and they pruned redundant nodes using the eigenvalue ofprincipal submatrices and a greedy algorithm. Recently, Berk and Bertsimas [5] embedded variousupper and lower bounds into this branch and bound framework, which could eﬃciently prune nodesand quickly certiﬁcate the optimality for quite a few instances. It is worthy of mentioning thatGally and Pfetsch [17] proposed a MISDP (MISDP) formulation for SPCA. Our second MISDPformulation diﬀers from Gally and Pfetsch [17] by deriving two strong conic valid inequalities.Another interesting work can be found in Dey et al. [15], where the authors developed approximateconvex integer programs for SPCA with an optimality gap of (1 + (cid:112) k/ ( k + 1)) . Quite diﬀerently,we propose two exact MISDP formulations and one approximate mixed-integer linear program(MILP) for SPCA from novel perspectives of analyzing the largest eigenvalue. Speciﬁcally, theproposed MILP formulation can be arbitrarily close to the optimal value of SPCA, and it can bedirectly solved by oﬀ-the-shelf solvers such as Gurobi. Convex Relaxations:

Besides solving exact SPCA, researchers have also actively sought toexplore eﬀective convex relaxations. A common approach in literature is to develop SDP relaxations ongchun Li and Weijun Xie:

Exact and Approximation Algorithms for Sparse PCA for SPCA (see e.g., [1, 13, 16, 12, 40]). Albeit convex, solvers often have diﬃculty in solving large-scale instances of SDP formulations (e.g., n = Ω(100)). The computational challenge of these SDPproblems urgently calls for more eﬀective methods to compute the relaxation values for SPCA. Froma diﬀerent angle, this paper solves the continuous relaxations of the proposed MISDP formulationsas the maximin saddle point problem, where the subgradient method enjoys a O (1 /T ) rate ofconvergence [31] based on Euclidean projections. Surprisingly, we further show that the projectionoracle of the subgradient method is a second-order conic program rather than an SDP and thuscan be easily dealt with. Approximation Algorithm:

Another early thread of research on SPCA is the development ofhigh-quality heuristics for solving SPCA to near optimality such as greedy algorithm [16, 19],truncation algorithm [9], power method [22], and variable neighborhood search method [7]. Inparticular, the truncation algorithm in [9] so far provides the best-known approximation ratio O ( n − / ), which can be easily implemented to generate a feasible solution for SPCA. This paperinvestigates the greedy and local search algorithms and proves their ﬁrst-known approximationratios O (1 /k ) for SPCA. We observe that when the support of x has been suc-cessfully identiﬁed, SPCA (1) reduces to the conventional PCA ﬁnding the largest eigenvalue andeigenvector of a size- k principal submatrix of A . This fact motivates us to derive two equiva-lent MISDP formulations and an approximate MILP of SPCA. Below is a summary of the maincontributions in this paper.(i) For each formulation, we derive the theoretical optimality gap between its continuous relax-ation value and the optimal value of SPCA.(ii) Our ﬁrst MISDP formulation inspires us to derive closed-form expressions of the coeﬃcientsof valid inequalities, which can be eﬃciently embedded into the branch and cut algorithms;(iii) We show that the subgradient method can be adapted to ease the computational burdenof obtaining MISDP continuous relaxation values with O (1 /T ) rate of convergence. Thesecontinuous relaxations values can further help reduce the size of MILP;(iv) The continuous relaxation of our second MISDP formulation is proven to be stronger thanthe one proposed in d’Aspremont et al. [13];(v) The proposed MILP formulation has a similar size as two MISDPs and can be directly solvedusing many existing solvers;(vi) We prove and demonstrate the tightness of the ﬁrst-known approximation ratios for thegreedy and local search algorithms; ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA (vii) Our analyses can be extended to the Rank-one Sparse SVD (R1-SSVD), which aims to com-pute the largest singular value of the possibly non-symmetric matrix A with the sparsityconstraints on its left-singular and right-singular vectors separately; and(viii) We extend the second MISDP formulation to Sparse Fair PCA (SFPCA), where the covariancematrices are observed from multiple protected groups.Our contributions have both theoretical and practical relevance. Theoretically, we contribute threeexact mixed-integer convex programs to SPCA. Practically, our MILP formulation can either attainoptimal solutions for SPCA, improve the continuous relaxations, or ﬁnd better-quality feasiblesolutions for small and medium-size instances. We apply the computationally eﬃcient subgradi-ent method to solving the continuous relaxations of the proposed MISDPs, as well as derivingtheir theoretical optimality gaps. We also develop two scalable approximation algorithms to solveSPCA to near optimality and prove their approximation ratios. Our proposed algorithms have beendemonstrated to be successfully applied to large-scale data analytics problems, such as identifyingkey features for the drug abuse problem. We further extend the analyses to R1-SSVD and SFPCA.All the theoretical contributions are summarized in Table 1. Table 1 . Summary of Theoretical Contributions

Problem Exact Mixed Integer Program Optimality Gap SPCA MISDP (6) min { k, nk − } MISDP (15) k, nk − } MILP (22) min { k ( √ d/ / , nk − √ d + ( n − k )( √ d/ / } R1-SSVD MISDP (34) (cid:112) mnk − k − MISDP (35) min {√ k k , (cid:112) mnk − k − } MILP (36) (cid:112) mnk − k − [min { ( k + k )( √ d/ / ,mnk − k − √ d + ( m + n − k − k )( √ d/ / } − MISDP (40) –

Problem Approximation Algorithm Approximation Ratio SPCA Greedy Algorithm 1 k − Local Search Algorithm 2 k − R1-SSVD Truncation algorithm max { (cid:112) k − , (cid:112) k − , √ k k m − n − } Greedy Algorithm 3 (cid:112) k − k − Local Search Algorithm 4 (cid:112) k − k − Optimality Gap is the ratio between the continuous relaxation value and the optimal one; The formulation (40) provides an upper bound for general SFPCA and becomes exact when thereare only two groups; Approximation Ratio denotes the ratio between the objective value of an approximation algorithmand the optimal one. ongchun Li and Weijun Xie:

Exact and Approximation Algorithms for Sparse PCA Organization:

The remainder of this paper is organized as follows. Sections 2 and 3 develop twoMISDP formulations for SPCA and prove the optimality gaps of their continuous relaxation values.Section 4 investigates an approximate MILP, which can be arbitrarily close to the optimal value ofSPCA, and proves the optimality gap of its continuous relaxation value. Section 5 introduces andanalyzes two approximation algorithms. Section 6 conducts a numerical study to demonstrate theeﬃciency and the solution quality of our proposed formulations and algorithms. Sections 7 and 8separately extend the analyses to the rank-one sparse SVD (R1-SSVD) and the sparse fair PCA(SFPCA). Finally, conclusion and future directions are exhibited in Section 9.

Notation:

The following notation is used throughout the paper. We let S n , S n + , S n ++ denote set of allthe n × n symmetric real matrices, set of all the n × n symmetric positive semi-deﬁnite matrices, andset of all the n × n symmetric positive deﬁnite matrices, respectively. We use bold lower-case letters(e.g., x ) and bold upper-case letters (e.g., X ) to denote vectors and matrices, respectively, and usecorresponding non-bold letters (e.g., x i , X ij ) to denote their components. We use to denote thezero vector and to denote the all-ones vector. We use (cid:100)·(cid:101) as a ceil function. We let R n + denote theset of all the n dimensional nonnegative vectors and let R n ++ denote the set of all the n dimensionalpositive vectors. Given a positive integer n and an integer s ≤ n , we let [ n ] := { , , · · · , n } andlet [ s, n ] := { s, s + 1 , · · · , n } . We let I n denote the n × n identity matrix and let e i denote its i -thcolumn vector. Given a set S and an integer k , we let | S | denote its cardinality and (cid:0) Sk (cid:1) denote thecollection of all the size- k subsets out of S . Given an m × n matrix A and two sets S ∈ [ m ], T ∈ [ n ],we let A S,T denote a submatrix of A with rows and columns indexed by sets S, T , respectively andlet A S denote a submatrix of A with columns from the set S . Given a vector x ∈ R n , we let Diag( x )denote the diagonal matrix with diagonal elements x , · · · , x n , and let supp( x ) denote the supportof x . Given a square symmetric matrix A , let diag( A ) denote the vector of diagonal entries of A ,and let λ min ( A ) , λ max ( A ) denote the smallest and largest eigenvalues of A , respectively. Given anon-square matrix A , let σ max ( A ) denote the largest singular value. Additional notation will beintroduced later as needed.

2. Exact MISDP Formulation (I)

In this section, we derive an equivalent mixed-integersemi-deﬁnite programming (MISDP) formulation for SPCA based on the spectral decompositionand disjunctive programming techniques.To begin with, for each i ∈ [ n ], we let the binary variable z i = 1 if the i -th feature is selected,and 0, otherwise. Linearizing the zero-norm constraint using binary vector z , then SPCA (1) canbe equivalently formulated as a following nonconvex mixed-integer quadratic program:(SPCA) w ∗ := max x ∈ R n , z ∈ Z (cid:26) x (cid:62) Ax : || x || = 1 , | x i | ≤ z i , ∀ i ∈ [ n ] (cid:27) , (2) ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA where we let cardinality set Z denote the feasible region of z , i.e., Z = (cid:26) z ∈ { , } n : (cid:88) i ∈ [ n ] z i = k (cid:27) . For SPCA (2), we note that (i) the binary vector z is of vital importance and its associated feasibleregion Z will be used throughout this paper for two MISDPs and one MILP, and (ii) the derivationsof all the three mixed-integer formulations originate from the naive SPCA (2). We observe that given a size- k subset of features (i.e., thesupport of the binary vector z in formulation (2) is speciﬁed), the SPCA (2) is equivalent toﬁnding the largest eigenvalue of the corresponding principal submatrix of A . This fact inspiresus to propose three equivalent mixed-integer convex programs for SPCA (2) . This observation issummarized below. Lemma 1

For a symmetric matrix A ∈ S n and a size- k set S ⊆ [ n ] , the followings must hold:(i) max x ∈ R n (cid:8) x (cid:62) Ax : || x || = 1 , x i = 0 , ∀ i / ∈ S (cid:9) = λ max ( A S,S ) ,(ii) max X ∈S k + { tr( A S,S X ) : tr( X ) = 1 } = λ max ( A S,S ) , and(iii) If matrix A is positive semi-deﬁnite, then λ max ( A S,S ) = λ max ( (cid:80) i ∈ S c i c (cid:62) i ) , where A = C (cid:62) C , C ∈ R d × n denotes the Cholesky factorization matrix of A , d is the rank of A , and c i ∈ R d denotes i -th column vector of C for each i ∈ [ n ] .Proof. See Appendix A.1. (cid:3)

The results in Lemma 1 are crucial to this paper and allow us to derive the exact mixed-integerconvex programs of SPCA. Speciﬁcally, we remark that: Part (i) of Lemma 1 reduces SPCA toselecting the best size- k principal submatrix of A to achieve the maximum largest eigenvalue, whichestablishes a combinatorial formulation of SPCA; Part (ii) of Lemma 1 shows that SDP relaxationof the largest eigenvalue problem by dropping the rank-one constraint is exact and inspires us todevelop two MISDP formulations for SPCA; and since the covariance matrix used in SPCA isalways positive semi-deﬁnite, the identity in Part (iii) of Lemma 1 suggests an alternative way offormulating SPCA using Cholesky decomposition, which motivates us to derive an exact MISDPformulation in this section and an MILP in a later section.According to Part (i) in Lemma 1, introducing a subset S , a natural combinatorial reformulationof SPCA (1) is deﬁned as: w ∗ := max S { λ max ( A S,S ) : | S | = k, S ⊆ [ n ] } . (3) ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA By computing the Cholesky factorization of A = C (cid:62) C with C ∈ R d × n and d denoting the rank of A , then the identity in Part (iii) in Lemma 1 recasts the objective function of SPCA (3) as below: w ∗ := max S (cid:26) λ max (cid:18) (cid:88) i ∈ S c i c (cid:62) i (cid:19) : | S | = k, S ⊆ [ n ] (cid:27) . (4)Recall that for each i ∈ [ n ], binary variable z i = 1 if i th feature (i.e., column c i ) is selected, and 0,otherwise. Therefore, SPCA (4) can be further reformulated as w ∗ := max z ∈ Z (cid:26) λ max (cid:18) (cid:88) i ∈ [ n ] z i c i c (cid:62) i (cid:19)(cid:27) . (5)The above formulation involves with concave objective function but it is a maximization problem,which will cause much trouble. Fortunately, the result in Part (ii) of Lemma 1 and the reformulationtechnique from disjunctive programming [2] motivate us to convert SPCA (5) to an equivalentMISDP, which is shown as below. Theorem 1

The SPCA (2) admits an equivalent MISDP formulation (SPCA) w ∗ := max z ∈ Z, X , W , ··· , W n ∈S d + (cid:40) (cid:88) i ∈ [ n ] c (cid:62) i W i c i : tr( X ) = 1 , X (cid:23) W i , tr( W i ) = z i , ∀ i ∈ [ n ] (cid:41) . (6) Proof.

According to Part (ii) in Lemma 1, the largest eigenvalue of a symmetric matrix can beequivalently reformulated as an SDP, thus by introducing a positive semi-deﬁnite matrix variable X ∈ S d + , SPCA (5) can be represented as w ∗ := max z ∈ Z, X ∈S d + (cid:40) (cid:88) i ∈ [ n ] z i c (cid:62) i Xc i : tr( X ) = 1 (cid:41) , (7)where the objective function comes from the identity tr( c i c (cid:62) i X ) = c (cid:62) i Xc i for each i ∈ [ n ].In SPCA (7), the objective function contains bilinear terms { z i X } i ∈ [ n ] . To further convexifythem, we create two copies of the matrix variable X , denoting by W i , W i for each i ∈ [ n ] andone of them will be equal to X depending on the value of binary variable z i . Speciﬁcally, SPCA(7) now becomes w ∗ := max z ∈ Z, X , W i , W i ∈S d + (cid:40) (cid:88) i ∈ [ n ] c (cid:62) i W i c i : X = W i + W i , ∀ i ∈ [ n ] , tr( X ) = 1 , tr( W i ) = z i , tr( W i ) = 1 − z i , ∀ i ∈ [ n ] (cid:41) . Above, the matrix variables { W i } i ∈ [ n ] are redundant and can be replaced by inequality X (cid:23) W i for each i ∈ [ n ]. Thus, we arrive at the equivalent reformulation (4) for SPCA. (cid:3) ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA Theorem 1 presents the ﬁrst equivalent MISDP formulation (6) to SPCA. The resulting formu-lation (6) has several interesting properties: (i) it can be directly solved via exact MISDP solverssuch as YALMIP; (ii) matrix variables X and { W i } i ∈ [ n ] have dimension of d × d , where d is therank of matrix A . Thus, the size of SPCA (6) can be further reduced if the covariance matrix A is low-rank; and (iii) the binary variables z can be separated from the other variables, so onecan apply the Benders decomposition to solving the SPCA (6). This result will be elaborated withmore details in the next subsection.For large-scale instances, computing the continuous relaxation values of the SPCA (6) providesus an upper bound to the optimal value or can be useful to check the quality of diﬀerent heuristics.In the following, we show that the continuous relaxation value of SPCA (6) is not too far awayfrom the optimal value w ∗ . First, let w denote the continuous relaxation value, i.e., w := max z ∈ Z, X , W , ··· , W n ∈S d + (cid:40) (cid:88) i ∈ [ n ] c (cid:62) i W i c i : tr( X ) = 1 , X (cid:23) W i , tr( W i ) = z i , ∀ i ∈ [ n ] (cid:41) , (8)where we let Z denote the continuous relaxation of set Z , i.e., Z = (cid:26) z ∈ [0 , n : (cid:88) i ∈ [ n ] z i = k (cid:27) . Theorem 2

The continuous relaxation value w of formulation (6) achieves a min { k, n/k } opti-mality gap of SPCA, i.e., w ∗ ≤ w ≤ min { k, n/k } w ∗ . Proof.

It is obvious that w ∗ ≤ w since the feasible region of continuous relaxation (8) includes theoriginal decision space. Thus, it remains to show that (i) w ≤ kw ∗ and (ii) w ≤ n/kw ∗ .Part (i) w ≤ kw ∗ . For any feasible solution ( z , X , { W i } i ∈ [ n ] ) to problem (8), we must have (cid:88) i ∈ [ n ] c (cid:62) i W i c i ≤ (cid:88) i ∈ [ n ] c (cid:62) i c i tr( W i ) = (cid:88) i ∈ [ n ] z i c (cid:62) i c i ≤ (cid:88) i ∈ [ n ] z i w ∗ = kw ∗ , where the ﬁrst inequality is due to the fact that the trace of the product of two symmetric positivesemi-deﬁnite matrices is no larger than the product of the traces of these two matrices [10], theﬁrst equality is from tr( W i ) = z i for each i ∈ [ n ], the second inequality is because c (cid:62) i c i = λ max (cid:0) c i c (cid:62) i (cid:1) ≤ max S ⊆ [ n ]: | S | = k λ max (cid:18) (cid:88) j ∈ S c j c (cid:62) j (cid:19) := w ∗ , and the second equality is due to (cid:80) i ∈ [ n ] z i = k . ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA Part (ii) w ≤ n/kw ∗ . Similarly, given any feasible solution ( z , X , { W i } i ∈ [ n ] ) of continuous relax-ation (8), we must have (cid:88) i ∈ [ n ] c (cid:62) i W i c i ≤ (cid:88) i ∈ [ n ] c (cid:62) i Xc i = 1 (cid:0) n − k − (cid:1) (cid:88) S ∈ ( [ n ] k ) (cid:88) i ∈ S c (cid:62) i Xc i ≤ (cid:0) nk (cid:1)(cid:0) n − k − (cid:1) w ∗ = nk w ∗ , where the ﬁrst inequality is because W i (cid:23) X and the second one is from Part (ii) in Lemma 1. (cid:3) Theorem 2 shows that the continuous relaxation value of formulation (8) is at most min { k, n/k } away from the true optimal value of SPCA (6), implying that if k → k → n , then the continuousrelaxation value w is very close to the true optimal value w ∗ , which is consistent with the numericalstudy in Section 6. (6) and SDP Relaxation (8) : Benders Decomposition It has beenrecognized that large-scale SDPs are challenging to solve, so is the MISDP (6). In this subsection,we apply the Benders decomposition [4, 18] to the proposed MISDP (6), which can be furtherintegrated into the branch and cut framework. By relaxing the binary vector z to be continuous,the Benders Decomposition recasts the continuous SDP relaxation (8) as a maximin saddle pointproblem, which enables the adoption of the eﬃcient subgradient method.The main idea of Benders decomposition is to decompose SPCA (6) into two stages: ﬁrst, themaster problem is a pure integer maximization problem over z , and second, given a feasible z ∈ Z ,the subproblem is to maximize over the remaining variables ( X , { W i } i ∈ [ n ] ). Thus, by separatingthe binary variables, we rewrite the SPCA (6) as w ∗ := max z ∈ Z H ( z ) := max X , W , ··· , W d ∈S d + (cid:40) (cid:88) i ∈ [ n ] c (cid:62) i W i c i : tr( X ) = 1 , X (cid:23) W i , tr( W i ) = z i , ∀ i ∈ [ n ] (cid:41) . (9)Benders decomposition is of particular interest when the subproblem H ( z ) for any z ∈ Z is easyto compute, which is, unfortunately, not the case. Therefore, it is desirable if we can specify thefunction H ( z ) for any given z ∈ Z in an eﬃcient way. Surprisingly, invoking Part(ii) in Lemma 1,the strong duality of inner SDP maximization problem in (9) holds and the obtained dual problemadmits a closed-form solution for any binary variables z ∈ Z , which enables the subproblem togenerate valid inequalities to the master problem eﬃciently. The results are shown below. Proposition 1

For the function H ( z ) deﬁned in (9) , we have(i) For any z ∈ Z , function H ( z ) is equivalent to H ( z ) = min µ , Q , ··· , Q n ∈S d + (cid:26) λ max (cid:18) (cid:88) i ∈ [ n ] Q i (cid:19) + (cid:88) i ∈ [ n ] µ i z i : c i c (cid:62) i (cid:22) Q i + µ i I d , ≤ µ i ≤ (cid:107) c i (cid:107) , ∀ i ∈ [ n ] (cid:27) , (10) which is concave in z . ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA (ii) For any binary z ∈ Z , an optimal solution to problem (10) is µ ∗ i = 0 if z i = 1 and (cid:107) c i (cid:107) ,otherwise, and Q ∗ i := (1 − µ ∗ i / (cid:107) c i (cid:107) ) c i c (cid:62) i for each i ∈ [ n ] .Proof. See Appendix A.2. (cid:3)

The Part (ii) of Proposition 1 shows that given a solution z ∈ Z with its support S , the optimalvalue to (10) is equal to H ( z ) = λ max (cid:18) (cid:88) i ∈ S c i c (cid:62) i (cid:19) + (cid:88) i ∈ [ n ] \ S || c i || , which leads to an equivalent reformulation of SPCA (9) as w ∗ = max z ∈ Z (cid:26) w : w ≤ λ max ( A SS ) + (cid:88) i ∈ [ n ] \ S (cid:107) c i (cid:107) z i , ∀ S ⊆ [ n ] : | S | = k (cid:27) . (11)Above, for any mixed binary solution ( (cid:98) z , (cid:98) w ) ∈ Z × R , the most violated constraint is w ≤ λ max ( A (cid:98) S (cid:98) S ) + (cid:88) i ∈ [ n ] \ (cid:98) S (cid:107) c i (cid:107) z i , where set (cid:98) S := { i ∈ [ n ] : (cid:98) z i = 1 } denotes the support of (cid:98) z . We remark that the exact branch andcut approach to solve SPCA (11) using callback functions will beneﬁt from these closed-form validinequalities.Note that by relaxing the binary variables to be continuous, the relaxed problem (9) is equivalentto the SDP relaxation (8). However, given z ∈ Z , the dual representation of function H ( z ) in (10)is still a diﬃcult SDP. Motivated by Part (ii) in Proposition 1, we propose a more eﬃcient upperbound H ( z ) than H ( z ) by letting Q i := (1 − µ i / (cid:107) c i (cid:107) ) c i c (cid:62) i for each i ∈ [ n ] to problem (10). Inthe next theorem, we show that the relaxed H ( z ) becomes exact for any binary vector z ∈ Z andthe resulting upper bound of SPCA also achieves a min { k, n/k } optimality gap. Theorem 3

The following results hold for the relaxed function H ( z ) :(i) For any z ∈ Z , function H ( z ) is upper bounded by H ( z ) = min µ (cid:26) λ max (cid:18) (cid:88) i ∈ [ n ] (1 − µ i / (cid:107) c i (cid:107) ) c i c (cid:62) i (cid:19) + (cid:88) i ∈ [ n ] µ i z i : 0 ≤ µ i ≤ (cid:107) c i (cid:107) , ∀ i ∈ [ n ] (cid:27) ; (12) (ii) If z ∈ Z , then H ( z ) = H ( z ) = λ max ( (cid:80) i ∈ [ n ] z i c i c (cid:62) i ) ; and(iii) The continuous relaxation value of SPCA w = max z ∈ Z H ( z ) (13) achieves a min { k, n/k } optimality gap of SPCA, i.e., w ∗ ≤ w ≤ w ≤ min { k, n/k } w ∗ , where w is deﬁned in (8) . ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA Proof. (i) The conclusion follows by choosing a feasible Q i := (1 − µ i / (cid:107) c i (cid:107) ) c i c (cid:62) i for each i ∈ [ n ] in therepresentation (10).(ii) For any z ∈ Z , we derive from Part (ii) in Proposition 1 that H ( z ) ≥ λ max ( (cid:80) i ∈ [ n ] z i c i c (cid:62) i ).Thus, it is suﬃcient to show that H ( z ) ≤ λ max ( (cid:80) i ∈ [ n ] z i c i c (cid:62) i ). Indeed, this can be done simplyby letting µ i = 0 if z i = 0, and || c i || , otherwise in (12).(iii) By the proof of Theorem 2, to obtain the same optimality gap for (13) as SDP (8), we needto show that H ( z ) ≤ (cid:80) i ∈ [ n ] z i c (cid:62) i c i and H ( z ) ≤ λ max ( A ) = λ max ( (cid:80) i ∈ [ n ] c i c (cid:62) i ) for any z ∈ Z .We must have H ( z ) ≤ (cid:80) i ∈ [ n ] z i c (cid:62) i c i by by letting µ i = c (cid:62) i c i for all i ∈ [ n ] in (12).We also have H ( z ) ≤ λ max ( A ) = λ max ( (cid:80) i ∈ [ n ] c i c (cid:62) i ) by letting µ i = 0 for all i ∈ [ n ] in (12).Then the rest of the proof follows directly from that of Theorem 2 and is thus omitted. (cid:3) We remark that: (i) Compared to H ( z ), function H ( z ) in (12) only involves an n -dimensionalvariable µ . The resulting relaxation (13) of SPCA can be viewed as a conventional saddle problemso we apply the subgradient method with convergence rate of O (1 /T ) to the search for optimalsolutions (see, e.g., [31]), which oﬀers an eﬃcient way to generate an upper bound of SPCA inSection 6; (ii) On the other hand, the continuous relaxation value w = max z ∈ Z H ( z ) tends to bestronger than w in (13). Thus, it is a tradeoﬀ between computational eﬀort and a better upperbound; (iii) Surprisingly, both bounds w , w achieve the same optimality gap of SPCA. Thisimplies that there might be room to improve the analysis of optimality gap in Theorem 2. We leavethis to interested readers; and (iv) more importantly, when z ∈ Z is binary, both problems (10)and (12) have closed-form results, which are very helpful for using the branch and cut method.

3. Exact MISDP Formulation (II)

The MISDP formulation (6) developed for SPCA inthe previous section mainly are inspired from Part(ii) and Part(iii) in Lemma 1. In this section, wewill propose another exact MISDP reformulation of SPCA using Part(i) and Part(ii) in Lemma 1.Similarly, we will present the optimality gap of the corresponding SDP relaxation to demonstratethe strength of the second formulation. It is worthy of noting that the proposed MISDP (6) requiresthe positive semi-deﬁniteness of matrix A as it is built on Cholesky decomposition of A , but theresult in this section is more general and holds even matrix A is not positive semi-deﬁnite. We ﬁrst establish a naive exact MISDP formu-lation of SPCA (2) based on Part (ii) in Lemma 1, and the resulting continuous relaxation valueis equal to λ max ( A ). ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA Proposition 2

The SPCA formulation (14) can be also found in [17]. However, our proof is quite diﬀerent andshorter, since it does not involve sophisticated extreme point characterization of SDPs. Althoughthe MISDP (14) is equivalent to SPCA (2), the fact that its continuous relaxation value is equal to λ max ( A ) demonstrates that it might be a weak formulation. This motivates us to further strengthenthe formulation (14) by adding valid inequalities in the next subsection. In this subsection, weﬁrst propose two valid inequalities for SPCA (14) and derive the optimality gap of its continuousrelaxation value of the improved formulation.After examining diﬀerent types of valid inequalities, we propose the following two types of validinequalities for the SPCA formulation (14).

Lemma 2

The following two inequalities are valid to SPCA (14) (i) (cid:80) j ∈ [ n ] X ij ≤ X ii z i for all i ∈ [ n ] ; and(ii) (cid:16)(cid:80) j ∈ [ n ] | X ij | (cid:17) ≤ kX ii z i for all i ∈ [ n ] .Proof. See Appendix A.4. (cid:3)

We make the following remarks about Lemma 2.(i) Many other valid inequalities are dominated by the two types of valid inequalities in Lemma 2such as | X ij | ≤ z i , X ij ≤ X ii z j , X ij ≤ z i z j , ∀ i, j ∈ [ n ];(ii) Note that the two types of valid inequalities are both second order conic (see e.g., [3]), andthus can be embedded into SDP solvers such as MOSEK, SDPT3; and(iii) We further observe that the inequality X ii ≤ z i in (14) is dominated by the ﬁrst type ofinequalities with the facts that X ii + (cid:80) j ∈ [ n ] \{ i } X ij ≤ X ii z i and X ii ≥ i ∈ [ n ].The results in Lemma 2 together with Proposition 2 give rise to a stronger MISDP of SPCAthan formulation (14), which is summarized below. Theorem 4

The SPCA (2) can reduce to following stronger MISDP formulation: (SPCA) w ∗ := max z ∈ Z, X ∈S n + (cid:26) tr( AX ) : tr( X ) = 1 , (cid:88) j ∈ [ n ] X ij ≤ X ii z i , (cid:18) (cid:88) j ∈ [ n ] | X ij | (cid:19) ≤ kX ii z i , ∀ i ∈ [ n ] (cid:27) . (15) ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA Let w denote the continuous relaxation value of SPCA formulation (15), i.e., w := max z ∈ Z, X ∈S n + (cid:26) tr( AX ) : tr( X ) = 1 , (cid:88) j ∈ [ n ] X ij ≤ X ii z i , (cid:18) (cid:88) j ∈ [ n ] | X ij | (cid:19) ≤ kX ii z i , ∀ i ∈ [ n ] (cid:27) . (16)Clearly, we have λ max ( A ) ≥ w . We are going to prove that the continuous relaxation value canbe even stronger than a well-known SDP upper bound for SPCA (2) introduced by d’Aspremontet al. [13], denoted by w , that has been widely used for solving SPCA in literature. The upperbound from [13] comes to the following formulation w := max X ∈S n + (cid:26) tr( AX ) : tr( X ) = 1 , (cid:88) i ∈ [ n ] (cid:88) j ∈ [ n ] | X ij | ≤ k (cid:27) . (17)The formal comparison result is shown below. Proposition 3

The upper bounds w , w of SPCA deﬁned in (16) and (17) , respectively, satisfy w ≥ w , i.e., the continuous relaxations value of the stronger MISDP (15) is stronger than theoptimal value of the SDP formulation (17) from [13].Proof. To show that w ≥ w , it is suﬃcient to prove that any feasible solution ( z , X ) of thecontinuous relaxation problem (16), will satisfy the constraints in the SDP formulation (17).Clearly, we have X ∈ S n + and tr( X ) = 1. It remains that (cid:80) i ∈ [ n ] (cid:80) j ∈ [ n ] | X ij | ≤ k . Indeed, we have (cid:88) i ∈ [ n ] (cid:88) j ∈ [ n ] | X ij | ≤ (cid:88) i ∈ [ n ] √ k (cid:112) X ii z i ≤ √ k (cid:115)(cid:88) i ∈ [ n ] X ii (cid:115)(cid:88) i ∈ [ n ] z i = k, where the ﬁrst inequality results from type (ii) inequalities in Lemma 2, the second one is due toCauchySchwartz inequality, and the equality is due to tr( X ) = 1 and (cid:80) i ∈ [ n ] z i = k . (cid:3) Next, we show that the continuous relaxations value of the stronger MISDP (15) is also quiteclose to the true value. This phenomenon is more striking in the numerical study.

Theorem 5

The continuous relaxations value of the stronger MISDP formulation (15) yields a min { k, n/k } optimality gap for SPCA, i..e, w ∗ ≤ w ≤ min { k, n/k } w ∗ . Proof.

The proof is separated into two parts: (i) w ≤ kw ∗ and (ii) w ≤ n/kw ∗ .(i) w ≤ kw ∗ . For any feasible solution X to problem (16), we havetr( AX ) = (cid:88) i ∈ [ n ] (cid:88) j ∈ [ n ] A ij X ij ≤ (cid:88) i ∈ [ n ] (cid:88) j ∈ [ n ] | A ij || X ij | ≤ w ∗ (cid:88) i ∈ [ n ] (cid:88) j ∈ [ n ] | X ij | ≤ kw ∗ , where the ﬁrst inequality is due to taking the absolute values, the second one is based on thefact that max i ∈ [ n ] { A i,i } ≤ w ∗ and | A i,j | ≤ (cid:112) A i,i A j,j ≤ w ∗ for each pair i, j ∈ [ n ], and the thirdone can be obtained from the proof of Proposition 3. ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA (ii) w ≤ n/kw ∗ . The proof is similar to the one of Theorem 2 since w ≤ λ max ( A ) ≤ n/kw ∗ . (cid:3) In general, our two proposed MISDP formulations (6) and (15) are not comparable although theircontinuous relaxations have the same theoretical approximation gap, which will be also illustratedin the numerical study section. The continuous relaxation of the MISDP formulation (15) might bediﬃcult to solve due to lager size of its matrix variables and higher complexity of its constraints. Inthe next subsection, we will discuss Benders decomposition for SPCA (15), where the subproblemreduces to a second order conic program rather than an SDP.

The decomposition method developed for SPCA (15) in thissubsection follows from Section 2.2. Therefore, many details will be omitted for brevity. Similarly,we decompose the proposed MISDP formulation (15) by a master problem over binary variables z ∈ Z and a subproblem over the matrix variable X ∈ S n + . Also, we reformulate SPCA (15) as thefollowing equivalent two-stage optimization problem w ∗ = max z ∈ Z H ( z ) := max X ∈S n + (cid:26) tr( AX ) : tr( X ) = 1 , (cid:88) j ∈ [ n ] X ij ≤ X ii z i , (cid:18) (cid:88) j ∈ [ n ] | X ij | (cid:19) ≤ kX ii z i , ∀ i ∈ [ n ] (cid:27) . (18)It is favorable to derive an eﬃcient dual formulation of H ( z ) for any given z ∈ Z such that itssubgradient can be easily computed. Indeed, invoking Part(ii) in Lemma 1 and dualizing the secondorder conic constraints, the strong duality of inner maximization over X in (18) still holds. Theproof is similar to Proposition 1 and is thus omitted. Proposition 4

For any z ∈ Z , function H ( z ) is equivalent to H ( z ) = min µ , ν , ν , Λ , W , W , β λ max ( A + Λ + 1 / µ + µ + ν + ν ) − W + W )+ 1 / − µ + µ ) (cid:62) z + k/ − ν + ν ) (cid:62) z , s.t. β i + ( W ) ij + ( W ) ij ≤ , ∀ i ∈ [ n ] , j ∈ [ n ] , (cid:88) j ∈ [ n ] Λ ij + ( µ i ) ≤ ( µ i ) , ∀ i ∈ [ n ] ,β i + ( ν i ) ≤ ( ν i ) , ∀ i ∈ [ n ] , ( W ) ij ≥ , ( W ) ij ≥ , ∀ i ∈ [ n ] , ∀ j ∈ [ n ] , ν , ν ∈ R n + , Λ , W , W ∈ S n , (19) which is concave in z . For the equivalent function H ( z ) derived in Proposition 4, we remark that: (i) Note that for anygiven z ∈ Z , function H ( z ) can be solved as an second order conic program and escape from theSDP curse. More eﬀectively, it can be solved via many ﬁrst-order methods (e.g., the subgradient ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA method) since the subgradient is easy to obtain and the projection only involves second order conicconstraints; (ii) On the other hand, when we solve the continuous relaxation w = max z ∈ Z H ( z ) , (20)the subgradient method is also applicable to solve the entire maximin saddle problem with O (1 /T )rate of convergence (see, e.g., [31]); (iii) We can warm start the exact branch and cut algorithmby solving the continuous relaxation (20), and add all the subgradient inequalities into the rootrelaxed problem.

4. A Mixed-Integer Linear Program (MILP) for SPCA with Arbitrary Accuracy

The formulations developed in the previous section for solving SPCA either rely on MISDP solversor customized branch and cut algorithms, which does not leverage existing computational powersof solvers such as CPLEX, Gurobi. In this section, motivated by the SPCA formulation (5) andthe identity of eigenvalues, we further derive an approximate mixed-integer linear program (MILP)for SPCA with arbitrary accuracy (cid:15) > O ( n + d + log( (cid:15) − )) binary variables. We also provethe optimality gap of its corresponding LP relaxation. The results in this section assume that A is positive semi-deﬁnite. The diﬃculty of SPCA (5) lies in how to convexifythe objective function, i.e., the largest eigenvalue of a symmetric matrix A . In particular, ourproposed MISDP formulations stem from the fact that the largest eigenvalue can be formulated asan equivalent SDP problem. Through a diﬀerent lens, we represent the largest eigenvalue functionbased on the natural deﬁnition of eigenvalues of a matrix, i.e., λ max ( A ) = max w, x ∈ R n (cid:26) w : Ax = w x , x (cid:54) = 0 (cid:27) , where x denotes an eigenvector and the nonzero constraint rules out the trivial solution x = 0.This motivates us to recast SPCA formulation (5) as the following nonconvex problem w ∗ = max w, x ∈ R d , z ∈ Z (cid:26) w : (cid:88) i ∈ [ n ] z i c i c (cid:62) i x = w x , (cid:107) x (cid:107) ∞ = 1 (cid:27) , (21)where (cid:107) x (cid:107) ∞ = 1 also excludes the trivial solution x = .For any given z ∈ Z , the nonconvexity of SPCA formulation (21) lies in three aspects: (i) Bilin-ear terms { z i x } i ∈ [ n ] . They can be easily linearized using the disjunctive programming techniquessince vector z is binary; (ii) Constraint (cid:107) x (cid:107) ∞ = 1. The nonconvex constraint (cid:107) x (cid:107) ∞ = 1 can beequivalently written as a disjunction with 2 d sets below ∪ j ∈ [ d ] (cid:8) x ∈ R d : x j = 1 , (cid:107) x (cid:107) ∞ ≤ (cid:9) ∪ j ∈ [ d ] (cid:8) x ∈ R d : x j = − , (cid:107) x (cid:107) ∞ ≤ (cid:9) . ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA Due to the equivalence of x and − x in SPCA (21), it suﬃces to only keep ﬁrst d sets, i.e., ∪ j ∈ [ d ] (cid:8) x ∈ R d : x j = 1 , (cid:107) x (cid:107) ∞ ≤ (cid:9) . This disjunction can be equivalently described as an MILP usingthe results in [2]; and (iii) Bilinear term w x . We can ﬁrst approximate variable w using binaryexpansion and then linearize the obtained bilinear terms by the same disjunctive technique as part(i). The resulting MILP formulation is summarized in the following theorem. Theorem 6

Given a threshold (cid:15) > , the following MILP is O ( (cid:15) ) -approximate to SPCA (2) , i.e., (cid:15) ≤ (cid:98) w ( (cid:15) ) − w ∗ ≤ (cid:15) √ d (cid:98) w ( (cid:15) ) := max w, z ∈ Z, y , α , x ,, δ , µ , σ w s.t. x = δ i + δ i , || δ i || ∞ ≤ z i , || δ i || ∞ ≤ − z i , ∀ i ∈ [ n ] , x = (cid:88) j ∈ [ d ] σ j , || σ j || ∞ ≤ y j , σ jj = y j , ∀ j ∈ [ d ] , (cid:88) j ∈ [ d ] y j = 1 , x = µ (cid:96) + µ (cid:96) , || µ (cid:96) || ∞ ≤ α (cid:96) , || µ (cid:96) || ∞ ≤ − α (cid:96) , ∀ (cid:96) ∈ [ m ] ,w = w U − ( w U − w L ) (cid:18) (cid:88) i ∈ [ m ] − i α i (cid:19) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ [ n ] c i c (cid:62) i δ i − w U x + ( w U − w L ) (cid:88) (cid:96) ∈ [ m ] − (cid:96) µ (cid:96) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ (cid:15), α ∈ { , } m , y ∈ { , } d , (22) where w L , w U separately denote the lower and upper bounds of SPCA, m := (cid:100) log (( w U − w L ) (cid:15) − ) (cid:101) and the inﬁnite norm inequality constraints can be easily linearized.Proof. See Appendix A.5. (cid:3)

For the proposed MILP formulation (22), we remark that(i) This is the ﬁrst-known MILP representation with arbitrary accuracy O ( (cid:15) ) in literature ofSPCA;(ii) The MILP formulation (22), although compact, involves O ( n + d + log (cid:15) − ) binary variables, O ( nd + d log (cid:15) − ) continuous variables, and O ( nd + n log (cid:15) − ) linear constraints;(iii) In SPCA (21), one might be curious about the choice of inﬁnite norm. Unfortunately, as faras we are concerned, this is the only norm that leads to a compact MILP formulation;(iv) In the MILP formulation (22), one might consider replacing the inﬁnite norm in the constraint || (cid:80) i ∈ [ n ] c i c (cid:62) i δ i − w U x + ( w U − w L ) (cid:80) i ∈ [ m ] − i µ i || ∞ ≤ (cid:15) by other norms, which will lead todiﬀerent formulations (either MILP or mixed-integer conic program) and slightly diﬀerentapproximation bounds;(v) Strong lower and upper bounds of SPCA w L , w U can speed up the solution procedure; and(vi) Instead of building a relatively large-scale MILP formulation (22), one might solve d numberof smaller-scale MILPs by enumerating each set of a disjunction ∪ j ∈ [ d ] (cid:8) x : x j = 1 , (cid:107) x (cid:107) ∞ ≤ (cid:9) . ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA The last remark is summarized in the following corollary.

Corollary 1

Given a threshold (cid:15) > , the optimal value of MILP (22) is equal to (cid:98) w ( (cid:15) ) =max j ∈ [ d ] (cid:98) w j ( (cid:15) ) , where for each j ∈ [ d ] , (cid:98) w j ( (cid:15) ) is deﬁned as (cid:98) w j ( (cid:15) ) := max w, z ∈ Z, y , α , x , δ , µ w s.t. x = δ i + δ i , || δ i || ∞ ≤ z i , || δ i || ∞ ≤ − z i , ∀ i ∈ [ n ] , || x || ∞ ≤ , x j = 1 , x = µ (cid:96) + µ (cid:96) , || µ (cid:96) || ∞ ≤ α (cid:96) , || µ (cid:96) || ∞ ≤ − α (cid:96) , ∀ (cid:96) ∈ [ m ] ,w = w U − ( w U − w L ) (cid:18) (cid:88) i ∈ [ m ] − i α i (cid:19) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ [ n ] c i c (cid:62) i δ i − w U x + ( w U − w L ) (cid:88) i ∈ [ m ] − i µ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ (cid:15), α ∈ { , } m , (23) where w L , w U separately denote the lower and upper bounds of SPCA, m := (cid:100) log (( w U − w L ) (cid:15) − ) (cid:101) and the inﬁnite norm inequality constraints can be easily linearized. Albeit being smaller-size, some MILPs deﬁned in Corollary 1 might be infeasible. Since the optimalvalue of an infeasible maximization problem is −∞ by default, the result in Corollary 1 still holds.However, one might need to be cautious when using this result and be aware of infeasibilities. Similar to other two exact formulations, we are alsointerested in deriving theoretical approximation bound for MILP formulation (22) by relaxingbinary variables z . Particularly, we assume that other binary variables y , α can be enumeratedeﬀectively. Our results show that the theoretical optimality gap is, in general, worse than the othertwo bounds. Theorem 7

Given a threshold (cid:15) > , by enforcing the binary variables z to be continuous, let w ( (cid:15) ) denote the optimal value of the relaxed MILP formulation (22) . Then we have w ( (cid:15) ) ≤ min (cid:8) k ( √ d/ / , n/k √ d + ( n − k )( √ d/ / (cid:9) w ∗ + (cid:15) √ d. Proof.

See Appendix A.6. (cid:3)

5. Approximation Algorithms

In this section, motivated by the equivalent combinatorialformulation (4), we prove and demonstrate the tightness of the approximation ratios of the well-known greedy and local search algorithms for solving SPCA. ongchun Li and Weijun Xie:

Exact and Approximation Algorithms for Sparse PCA The greedy algorithm has been widely used in many combinatorialproblems with the cardinality constraint. The greedy algorithm in this subsection is particularlybased on the combinatorial formulation (4), which proceeds as follows: Given a subset (cid:98) S G ⊆ [ n ]denoting the selected vectors, it aims to ﬁnd a new vector from { c i } i ∈ [ n ] \ (cid:98) S G to maximize the largesteigenvalue of the sum of rank-one matrices obtained so far including the new one. The detailedimplementation can be found in Algorithm 1. Algorithm 1

Greedy Algorithm for SPCA (4) Input: n × n matrix A (cid:23) d and integer k ∈ [ n ] Let A = C (cid:62) C denote its Cholesky factorization where C ∈ R d × n Let c i ∈ R d denote the i -th column vector of matrix C for each i ∈ [ n ] Let (cid:98) S G := ∅ denote the chosen set for (cid:96) = 1 , · · · , k do Compute j ∗ ∈ arg max j ∈ [ n ] \ (cid:98) S { λ max ( (cid:80) i ∈ (cid:98) S G ∪{ j } c i c (cid:62) i ) } Add j ∗ to the set (cid:98) S G end for Output: (cid:98) S G The following result show that the greedy Algorithm 1 yields 1 /k -approximation ratio. Theorem 8

The greedy Algorithm 1 yields a k − -approximation ratio for SPCA (4) , i.e., theoutput (cid:98) S G of Algorithm 1 satisﬁes λ max (cid:18) (cid:88) i ∈ (cid:98) S G c i c (cid:62) i (cid:19) ≥ k w ∗ . Proof.

Suppose that the optimal set of SPCA (4) is S ∗ , then we have λ max (cid:18) (cid:88) i ∈ S ∗ c i c (cid:62) i (cid:19) ≤ (cid:88) i ∈ S ∗ λ max ( c i c (cid:62) i ) ≤ k max i ∈ [ n ] λ max ( c i c (cid:62) i ) ≤ kλ max (cid:18) (cid:88) i ∈ (cid:98) S G c i c (cid:62) i (cid:19) , where the ﬁrst inequality results from the convexity of largest eigenvalue function and the last oneis because at the ﬁrst iteration, the greedy Algorithm 1 must choose the largest-length vector. (cid:3) The approximation ratio k − of greedy Algorithm 1 is tight, since there exists an example whosegreedy optimum is no better than k − . This example is presented as below. Example 1

For any integer k ∈ [ d ] , let d = k + 1 , n = 2 k , and the vectors { c i } i ∈ [ n ] ⊆ R d be c i = (cid:40) e i , if i ∈ [ k ] , e k +1 , if i ∈ [ k + 1 , n ] , ∀ i ∈ [ n ] . ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA Proposition 5

In Example 1, the output value of greedy Algorithm 1 is k − -away from the trueoptimal value of SPCA. That is, approximation ratio k − of greedy Algorithm 1 is tight.Proof. In Example 1, according to the greedy Algorithm 1, it will select c , c , · · · , c k at eachiteration, i.e., the output set is (cid:98) S G = [ k ]. Thus, the resulting largest eigenvalue of greedy Algorithm 1is equal to 1.Apparently, the true optimal value of Example 1 is equal to λ max (cid:18) (cid:88) i ∈ [ k +1 ,n ] c i c (cid:62) i (cid:19) = λ max (cid:0) k e k +1 e (cid:62) k +1 (cid:1) = k. This completes the proof. (cid:3)

The local search algorithm can improve the existing solutionsand has been successfully used to solve many interesting machine learning and data analyticsproblems, such as experimental design [26] and maximum entropy sampling [24]. This subsectioninvestigates the local search algorithm for SPCA (4) and proves its approximation ratio.In the local search algorithm, we start with a size- k subset, and in each iteration, swap anelement of chosen set with one of the unchosen set as long as it improves the largest eigenvalue.The detailed implementation can be found in Algorithm 2. Algorithm 2

Local Search Algorithm for SPCA (4) Input: n × n matrix A (cid:23) d and integer k ∈ [ n ] Let A = C (cid:62) C denote its Cholesky factorization where C ∈ R d × n Let c i ∈ R d denote the i -th column vector of matrix C for each i ∈ [ n ] Initialize a size- k subset (cid:98) S L ⊆ [ n ] do for each pair ( i, j ) ∈ (cid:98) S L × ([ n ] \ (cid:98) S L ) do if λ max (cid:16)(cid:80) (cid:96) ∈ (cid:98) S L ∪{ j }\{ i } c (cid:96) c (cid:62) (cid:96) (cid:17) > λ max (cid:16)(cid:80) (cid:96) ∈ (cid:98) S L c (cid:96) c (cid:62) (cid:96) (cid:17) then Update (cid:98) S L := (cid:98) S L ∪ { j } \ { i } end if end for while there is still an improvement Output: (cid:98) S L ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA Theorem 9

The local search Algorithm 2 returns a k − -approximation ratio of SPCA, i.e., theoutput (cid:98) S L of the local search Algorithm 2 satisﬁes λ max (cid:18) (cid:88) i ∈ (cid:98) S L c i c (cid:62) i (cid:19) ≥ k w ∗ . Proof.

First, for each j ∈ [ n ], we will show that λ max (cid:18) (cid:88) (cid:96) ∈ (cid:98) S L c (cid:96) c (cid:62) (cid:96) (cid:19) ≥ λ max ( c j c (cid:62) j ) . (24)To prove it, there are two cases to be discussed: whether j belongs to (cid:98) S L or not. The monotonicityof the largest eigenvalue of sum of positive semi-deﬁnite matrices implies that the inequality (24)holds if j ∈ (cid:98) S L . If j ∈ [ n ] \ (cid:98) S L , then the local optimality condition implies that there exist i ∈ (cid:98) S L such that λ max (cid:18) (cid:88) (cid:96) ∈ (cid:98) S L c (cid:96) c (cid:62) (cid:96) (cid:19) ≥ λ max (cid:18) (cid:88) (cid:96) ∈ (cid:98) S L ∪{ j }\{ i } c (cid:96) c (cid:62) (cid:96) (cid:19) ≥ λ max ( c j c (cid:62) j ) , where the second inequality is due to the monotonicity of the largest eigenvalue of sum of positivesemi-deﬁnite matrices.Second, suppose S ∗ to be the optimal solution to SPCA (4), by inequality (24), then we have w ∗ = λ max (cid:18) (cid:88) i ∈ S ∗ c i c (cid:62) i (cid:19) ≤ (cid:88) i ∈ S ∗ λ max ( c i c (cid:62) i ) ≤ kλ max (cid:18) (cid:88) (cid:96) ∈ (cid:98) S L c (cid:96) c (cid:62) (cid:96) (cid:19) , where the ﬁrst inequality is because of the convexity of function λ max ( · ). (cid:3) We remark that Example 1 also conﬁrms the tightness of our analysis for local search Algorithm 2.

Proposition 6

In Example 1, the output value of local search Algorithm 2 is k − -away from opti-mal value of SPCA. That is, approximation ratio k − of local search Algorithm 2 is tight.Proof. In Example 1, we show that the initial subset (cid:98) S L = [ k ] already satisﬁes the local optimalitycondition.Indeed, for each pair ( i, j ) ∈ (cid:98) S L × ([ n ] \ (cid:98) S L ), we have λ max (cid:18) (cid:88) (cid:96) ∈ (cid:98) S L ∪{ j }\{ i } c (cid:96) c (cid:62) (cid:96) (cid:19) = λ max ( I d − e i e (cid:62) i ) = 1 = λ max ( I d − e d e (cid:62) d ) = λ max (cid:18) (cid:88) (cid:96) ∈ (cid:98) S L c (cid:96) c (cid:62) (cid:96) (cid:19) , where the identities follow the construction of { c i } i ∈ [ n ] in Example 1.Therefore, the set (cid:98) S L achieves the local optimum with largest eigenvalue of 1. Since the optimalvalue of SPCA is w ∗ = k , the approximation ratio of set (cid:98) S L is equal to k − . (cid:3) ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA As an improved heuristic, local search Algorithm 2 can use the output of the greedy Algorithm 1as an initial solution. The results in Theorem 9 and Proposition 6 imply that the integratedalgorithm still yields a k − -approximation ratio of SPCA, while for solving the practical instances,our numerical study shows that the integrated algorithm in fact works very well. Since the greedyAlgorithm 1 and local search Algorithm 2 repeatedly require to compute the largest eigenvalues,at each iteration, we can apply the power iteration method to eﬃciently calculate the largesteigenvalues [35] and use the eigenvectors from the previous iterations as a warm-start.Finally, we remark that there is only one swap in the local search Algorithm 2. We can improveit by increasing the number of swapping elements at each iteration, termed s -swap local search with s ∈ [ k ]. The following result shows that s -swap local search can indeed achieve a betterapproximation ratio. Corollary 2

The approximation ratio of s -swap local search is sk − for any s ∈ [ k ] . The approxi-mation ratio is tight.Proof. First, let set (cid:98) S L denote the indices of selected vectors by s -swap local search algorithm.Then following the same proof as that in Theorem 9, for any size- s set T ⊆ [ n ], we have λ max (cid:18) (cid:88) i ∈ (cid:98) S L c i c (cid:62) i (cid:19) ≥ λ max (cid:18) (cid:88) i ∈ T c i c (cid:62) i (cid:19) . (25)Let S ∗ denote the optimal solution to SPCA (4), using the result (25), the optimal value ofSPCA w ∗ is upper bounded by w ∗ = λ max (cid:18) (cid:88) i ∈ S ∗ c i c (cid:62) i (cid:19) = λ max (cid:18) (cid:0) k − s − (cid:1) (cid:88) T ⊆ S ∗ , | T | = s (cid:88) i ∈ T c i c (cid:62) i (cid:19) ≤ (cid:0) ks (cid:1)(cid:0) k − s − (cid:1) λ max (cid:18) (cid:88) i ∈ (cid:98) S L c i c (cid:62) i (cid:19) = ks (cid:18) (cid:88) i ∈ (cid:98) S L c i c (cid:62) i (cid:19) . Second, to show the tightness, let us consider the following example.

Example 2

For any integer k ∈ [ d ] , let d = k + 1 , n = ( s + 1) k , and the vectors { c i } i ∈ [ n ] ⊆ R d be c i =  e i , if i ∈ [ k ] , ... e i − ( s − k , if i ∈ [( s − k + 1 , sk ] , e k +1 , if i ∈ [ sk + 1 , n ] , ∀ i ∈ [ n ] . In Example 2, we show that the subset (cid:98) S L = [ k − s + 1] ∪ { (cid:96)k + 1 } (cid:96) ∈ [ s − satisﬁes the s -swap localoptimality condition.Indeed, for each pair ( T , T ) such that T ⊆ (cid:98) S L , T ⊆ ([ n ] \ (cid:98) S L ) with | T | = | T | = s , we have λ max (cid:18) (cid:88) (cid:96) ∈ (cid:98) S L ∪ T \ T c (cid:96) c (cid:62) (cid:96) (cid:19) ≤ s. ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA Therefore, the set (cid:98) S L achieves s -swap local optimum with largest eigenvalue of s . Since the optimalvalue of SPCA is w ∗ = k , the approximation ratio of set (cid:98) S L is equal to sk − for SPCA. (cid:3) Albeit theoretically sound, s -swap local search with s ≥ O ( n ) swaps at each iteration. Therefore, in the numerical study, we use the simple local searchAlgorithm 2, which already works very well.

6. Numerical Study

In this section, we conduct numerical experiments on six datasets withnumber of features n ranging from 13 to 2365 to demonstrate the computational eﬃciency and thesolution quality of the MISDP (6), MISDP (15), and MILP (22) for exactly solving SPCA, thecontinuous relaxations (8), (16) and heuristic Algorithms 1, 2 for approximately solving SPCA. Allthe methods in this section are coded in Python 3.6 with calls to Gurobi 9.0 and MOSEK 9.0 ona personal PC with 2.3 GHz Intel Core i5 processor and 8G of memory. The codes and data areavailable at https://github.com/yongchunli-13/Sparse-PCA . Pitprops

Dataset

We ﬁrst test the proposed three exact SPCA formulations (6), (15),(22) and their continuous relaxations to solve a commonly-used benchmark instance,

Pitprops dataset Jeﬀers [20], which consists of 13 features (i.e., n = 13). In this instance, the computationalresults of seven diﬀerent cases with k chosen from { , · · · , } are displayed in Table 2, Table 3,and Table 4.For each testing case, we solve two MISDP formulations (6) and (15) using the branch and cutmethod. As for the MILP (22), it can be simply solved in Gurobi. Throughout the numerical studyof MILP (22), we set (cid:15) = 10 − , use the best SDP relaxation values as the upper bound w U , and usethe local search Algorithm 2 to compute the lower bound w L . As the newly released Gurobi 9.0is able to solve the non-convex quadratic program, thus for the purpose of comparison, we furtheruse Gurobi to solve the following SPCA formulation w ∗ := max z ∈ Z, x ∈ R n (cid:110) x (cid:62) Ax : || x || = 1 , || x || ≤ √ k, | x i | ≤ z i , ∀ i ∈ [ n ] (cid:111) . (26)The computation results of the exact methods are shown in Table 2. In particular, we let time(s) denote the running time in seconds of each case and let Gurobi denote the performance of Gurobifor solving SPCA (26). In table 2, we see that all the SPCA formulations (6), (15), (22) can be solvedto optimality within seconds, which demonstrates the eﬃciency of the proposed formulations. Wealso compare the numerical performance of the MILP formulation (22) with formulation (26) usingthe Gurobi solver, and it is clear that MILP is more eﬃcient and stable. Especially for the case of k = 10, Gurobi has trouble ﬁnding the optimal solution of SPCA (26).Although the theoretical optimality gaps of the proposed SDP relaxations (8) and (16) are thesame, these gaps in practice can be much smaller and can be signiﬁcantly diﬀerent from each other. ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA Table 2 . Computational results of exact values with

Pitprops dataset n =13 SPCA MISDP (6) MISDP (15) MILP (22) Gurobi k w ∗ w ∗ time(s) w ∗ time(s) (cid:98) w ( (cid:15) ) time(s) w ∗ time(s)4 2.9375 2.9375 1 2.9375 2 2.9375 1 2.9375 15 3.4062 3.4062 1 3.4062 2 3.4062 1 3.4062 16 3.7710 3.7710 1 3.7710 2 3.7710 2 3.7710 17 3.9962 3.9962 1 3.9962 1 3.9962 1 3.9962 38 4.0686 4.0686 1 4.0686 2 4.0686 2 4.0686 129 4.1386 4.1386 1 4.1386 2 4.1386 1 4.1387 3010 4.1726 4.1726 1 4.1726 1 4.1726 1 4.1441 83We use MOSEK to solve both SDP relaxations. The numerical results can be found in Table 3,where the SDP relaxation (17) proposed by d’Aspremont et al. [13] is presented as a benchmarkcomparison. In Table 3, we use gap(%) to denote the optimality gap, which is computed as100 × (Upper Bound − w ∗ ) /w ∗ . It can be seen that the second SDP relaxation (16) is superior tothe ﬁrst SDP relaxation (8) on the ﬁrst ﬁve cases. When k is close to n , the ﬁrst SDP relaxation (8)can be better. This ﬁnding is consistent with remarks after Theorem 2. In addition, as proved inProposition 3, we see that the second SDP relaxation (16) always outperforms the bound (17) byd’Aspremont et al. [13]. Finally, the second SDP relaxation (16) and the bound (17) by d’Aspremontet al. [13] are also not comparable. Table 3 . Computational results of upper bounds with

Pitprops dataset n =13 SPCA Benchmark (17) SDP Relaxation (8) SDP Relaxation (16) k w ∗ w gap(%) w gap(%) time(s) w gap(%) time(s)4 2.9375 3.0172 2.71 3.1065 5.75 0.51 2.9495 0.41 0.135 3.4062 3.4581 1.52 3.4868 2.37 0.55 3.4124 0.18 0.186 3.7710 3.8137 1.13 3.7859 0.39 0.52 3.7767 0.15 0.157 3.9962 4.0316 0.89 3.9962 0.00 0.43 3.9962 0.00 0.158 4.0686 4.1448 1.87 4.0805 0.29 0.29 4.0793 0.26 0.179 4.1386 4.2063 1.64 4.1386 0.00 0.00 4.1398 0.03 0.1510 4.1726 4.2186 1.10 4.1763 0.09 0.09 4.1778 0.12 0.16Table 4 presents the objective values and optimality gaps of the proposed approximation algo-rithms for solving the Pitprops instance, where we let LB denote the lower bound and compute gap(%) by 100 × ( w ∗ − LB) /w ∗ . Note that we initialize the local search Algorithm 2 by the outputof greedy Algorithm 1. To further improve the two algorithms, at each iteration, we employ thepower iteration method to eﬃciently compute the largest eigenvalues [35] and warm-start it with ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA the good-quality eigenvectors from the previous iterations. In Table 4, we see that greedy Algo-rithm 1 and local search Algorithm 2 successfully ﬁnd the optimal solutions and outperforms thetruncation algorithm proposed by [9]. Table 4 . Computational results of lower bounds with

Pitprops dataset n =13 SPCA Truncation algorithm [9] Greedy Algorithm 1 Local Search Algorithm 2 k w ∗ LB gap(%) time(s) LB gap(%) time(s) LB gap(%) time(s)4 2.9375 2.8913 1.57 1e-3 2.9375 0.00 1e-3 2.9375 0.00 1e-25 3.4062 3.3951 0.32 1e-3 3.4062 0.00 1e-3 3.4062 0.00 1e-26 3.7710 3.7576 0.36 1e-3 3.7710 0.00 1e-2 3.7710 0.00 1e-27 3.9962 3.9929 0.08 1e-3 3.9962 0.00 1e-2 3.9962 0.00 1e-28 4.0686 4.0648 0.09 1e-3 4.0686 0.00 1e-2 4.0686 0.00 1e-29 4.1386 4.1313 0.18 1e-3 4.1386 0.00 1e-2 4.1386 0.00 1e-210 4.1726 4.0094 3.91 1e-3 4.1726 0.00 1e-2 4.1726 0.00 1e-2

In this subsection, we conduct experiments on four largerinstances from Dey et al. [15] to further testify the eﬃciency of our proposed methods for SPCA,which are

Eisen-1 , Eisen-2 , Colon and

Reddit with n =79, 118, 500, and 2000. Since the MILPformulation (22) consistently outperforms two MISDP formulations (6) and (15). Thus, in this setof numerical experiments, we will stick to the MILP formulation (22).We ﬁrst compare the performances of diﬀerent heuristic methods using the Reddit dataset with n = 2000 and k ∈ { , . . . , } . Thus, there are 7 cases in total. We implement the greedy Algo-rithm 1 and the local search Algorithm 2 and compare them with the best-known truncationalgorithm proposed by [9]. The numerical results are shown in Table 5. We see that the local searchAlgorithm 2 provides the highest-quality solution of the three. The greedy Algorithm 1 is almostequally as good as the truncation algorithm. Although the local search Algorithm 2 takes thelongest running time, the running time is quite reasonable given the size of the testing cases. Hence,our computation experiments show that the local search Algorithm 2 consistently outperforms theother two methods within a reasonably short time. Thus, we recommend using this algorithm tosolve practical problems.Next, we obtain the local search Algorithm 2, the continuous relaxation bounds and exact valuesof SPCA on the four instances, i.e., Eisen-1 , Eisen-2 , Colon and

Reddit . For these instances,MOSEK fails to solve our proposed SDP relaxations (8) and (16). Thus, instead, we use thesubgradient method to solve the continuous relaxation formulations (13) and (20). For the MILPformulation (22), we set the time limit of Gurobi to be an hour. The computational results arepresented in Table 6, where we let UB denote the upper bound of SPCA, let VAL denote the ongchun Li and Weijun Xie:

Exact and Approximation Algorithms for Sparse PCA Table 5 . Computational results of lower bounds with

Reddit dataset n =2000 Truncation Greedy Local Searchalgorithm [9] Algorithm 1 Algorithm 2 k LB time (s) LB time (s) LB time (s)10 1482.3205 3 1521.3081 1 1521.3083 920 1666.2397 2 1670.4712 4 1684.3943 5930 1953.3711 2 1856.2875 7 1953.7502 9240 2203.1715 2 2123.5635 10 2208.2452 20850 2311.2407 2 2289.0371 13 2322.8204 20760 2427.2685 3 2402.8345 16 2441.7020 20270 2475.9581 2 2488.8991 19 2494.6142 193best lower bound of MILP (22) found if the time limit is reached, and let

MIPgap( % ) denotethe percentage of output MIP Gap from Gurobi. For these instances, we see that the local searchAlgorithm 2 still performs very well and the subgradient method is also eﬃcient to solve thecontinuous relaxation (13). The continuous relaxation (20) turns out to be very diﬃcult to compute,and even more diﬃcult than the MILP formulation (22). For the instance Eisen-1 , we see thatboth the MILP formulation (22) and local search Algorithm 2 can ﬁnd the optimal solutions. Thisfurther demonstrates the eﬀectiveness of the local search Algorithm 2.

Table 6 . Computational results of lower bounds, upper bounds and exact values with four larger instances

Data Case Local Search Continuous Continuous MILP (22)Algorithm 2 Relaxation (13) Relaxation (20) n k

LB time(s) UB time(s) UB time(s) VAL MIPgap(%) time(s)Eisen-1 79 10 17.3355 1 17.9144 14 17.7571 126 17.3355 0.00 3479 20 17.7195 1 18.1309 13 18.0362 85 17.7195 0.00 125Eisen-2 118 10 11.7182 1 13.8732 89 - - 11.7182 18.39 3600118 20 19.3228 1 22.9268 90 - - 19.3228 18.65 3600Colon 500 10 2641.2289 1 2901.1105 342 - - 2641.2289 9.84 3600500 20 4255.6941 3 4833.1900 344 - - 4255.6941 13.57 3600Reddit 2000 10 1521.3083 9 1867.9965 1198 - - - - -2000 20 1684.3943 59 2184.2436 1241 - - - - -

Drugabuse

Dataset

We ﬁnally apply the proposed local search Algorithm 2 to the

Drugabuse

Dataset with n = 2365 features, where the dataset comes from a questionnaire collectedby the National Survey on Drug Use and Health (NSDUH) in 2018. It has been reported [33] thatwith the growing illicit online sale of controlled substances, deaths attributable to opioid-relateddrugs have been more than quadrupled in the U.S. since 1999. Thus, it is important to select ahandful of features that the researchers can focus on for further exploration. Indeed, SPCA is a goodtool to reduce the complexity and improve the interpretability of the machine learning algorithms ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA by selecting the most important features. Our numerical ﬁnding of the case of k = 10 is illustratedin Figure 1, where the vertical values correspond to the selected features of the ﬁrst PC, whichare scaled by 100. We see that among 10 features, there are three categories (i.e., inhalants, druginjection, drug treatment), which are important for analyzing drug abuse. In particular, SPCAselects 6 features related to drug treatment, which is consistent with the literature [11, 39] thatthe treatment records of drug abuse are informative and important. Three drug injection questionshave been designed to understand the injection experience of diﬀerent special drugs, and it is wellknown that drug injection users are at high risk for HIV and other blood-borne infections [32, 38].Inhalants feature, corresponding to various accessible products that can easily cause addictions,signiﬁcantly contributes to the increase of drug abuse [6, 14]. Figure 1.

10 features selected by local search Algorithm 2 for

Drugabuse dataset

7. Extension to the Rank-one Sparse Singular Value Decomposition (R1-SSVD)

Inthis section, we extend the proposed formulations and theoretical results to the rank-one sparsesingular value decomposition (R1-SSVD). R1-SSVD has been successfully used to analyze the row-column associations within high-dimensional data (see, e.g., [28, 23, 36]). The goal of R1-SSVD isto ﬁnd the best submatrix (possibly non-square) of a particular size whose largest singular valueis maximized, from a given matrix.Formally, R1-SSVD can be formulated as(R1-SSVD) w ∗ SVD := max u ∈ R m , v ∈ R n (cid:8) u (cid:62) Av : || u || = 1 , || v || = 1 , || u || = k , || v || = k (cid:9) , (27) ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA where the matrix A ∈ R m × n is known, m, n , and k ∈ [ m ] and k ∈ [ n ] are positive integers.Our reduction of R1-SSVD (27) to SPCA (1) follows from the development of an augmentedsymmetric matrix A ∈ S m + n A = (cid:20) AA (cid:62) (cid:21) . (28)Let x := [ u (cid:62) , v (cid:62) ] (cid:62) denote an ( m + n )-dimensional vector. According to the identity x (cid:62) Ax = (cid:2) u (cid:62) v (cid:62) (cid:3) (cid:20) AA (cid:62) (cid:21) (cid:20) uv (cid:21) = 2 u (cid:62) Av , then R1-SSVD (27) can be reformulated as w ∗ SVD := 12 max x ∈ R m + n (cid:8) x (cid:62) Ax : || x m || = 1 , || x m +1: m + n || = 1 , || x m || = k , || x m +1: m + n || = k (cid:9) , (29)where we let x m denote the collection of m entries of vector x from index set [ m ] and x m +1 ,m + n denote the n entries of x from index set [ m + 1 , m + n ]. In R1-SSVD (29), we enforce the sparserestrictions on both x m and x m +1 ,m + n . Thus, the R1-SSVD (29) can be viewed as a special caseof the conventional SPCA (1), where A is symmetric but not positive semi-deﬁnite and there aretwo sparsity constraints instead of one.Similarly, introducing binary variable z i = 1 if i th column of matrix A is chosen, 0, otherwise,we can linearize the zero-norm constraints and recast R1-SSVD (29) as w ∗ SVD := 12 max x ∈ R m + n , z ∈ Z SVD (cid:26) x (cid:62) Ax : || x m || = 1 , || x m +1: m + n || = 1 , | x | i ≤ z i , ∀ i ∈ [ m + n ] (cid:27) , (30)where set Z SVD is deﬁned as Z SVD := (cid:26) z ∈ { , } m + n : (cid:88) i ∈ [ m ] z i = k , (cid:88) i ∈ [ m +1 ,m + n ] z i = k (cid:27) . The following lemma inspires us three exact mixed-integer formulations for R1-SSVD (30).

Lemma 3

Exact and Approximation Algorithms for Sparse PCA Notably, Part (ii) in Lemma 3 shows that R1-SSVD is equivalent to the following combinatorialoptimization problem w ∗ SVD := max S ⊆ [ m + n ] (cid:26) λ max ( A S,S ) : | S ∩ [ m ] | = k , | S ∩ [ m + 1 , m + n ] | = k (cid:27) . (31)The next four subsections present MISDP formulations (I) and (II), a MILP formulation, andapproximation algorithms, respectively. The fact that matrix A is symmetric but not positive semi-deﬁnite impedes us to directly apply the results in Section 2. Fortunately, a simple remedy byadding a new matrix σ max ( A ) I m + n to A ﬁxes this issue. That is, let us deﬁne A := A + σ max ( A ) I m + n , (32)which is indeed positive semi-deﬁnite according to Part (i) in Lemma 3. More importantly, the newmatrix A preserves all the sparsity properties of the original one A .Thus, the combinatorial optimization R1-SSVD (30) is equivalent to w ∗ SVD := max S ⊆ [ m + n ] (cid:26) λ max ( A S,S ) : | S ∩ [ m ] = k , | S ∩ [ m + 1 , m + n ] = k (cid:27) − σ max ( A ) . (33)Now all the results in Section 2 are directly applicable to R1-SSVD (33). We highlight twoimportant ones below. Theorem 10

The R1-SSVD (33) admits an equivalent MISDP formulation: w ∗ SVD := max z ∈ Z SVD , X , W , ··· , W d ∈S d + (cid:40) (cid:88) i ∈ [ m + n ] c (cid:62) i W i c i : tr( X ) = 1 , X (cid:23) W i , tr( W i ) = z i , ∀ i ∈ [ m + n ] (cid:41) − σ max ( A ) , (34) where A = C (cid:62) C denotes the Cholesky factorization of A with C ∈ R d × ( m + n ) , d is the rank of A , and c i ∈ R d denotes the i -th column vector of matrix C for each i ∈ [ m + n ] . Theorem 11

The continuous relaxation value w SVD1 of formulation (34) satisﬁes w ∗ SVD ≤ w SVD1 ≤ (cid:113) mnk − k − w ∗ SVD . Proof.

See Appendix A.8. (cid:3) ongchun Li and Weijun Xie:

Exact and Approximation Algorithms for Sparse PCA Since the results in Section 3 do not rely on the positivesemi-deﬁniteness of matrix A , they can be directly extended to R1-SSVD (30).We ﬁrst illustrate a naive MISDP for R1-SSVD (30) based on Part (ii) in Lemma 3. Proposition 7

The R1-SSVD (30) is equivalent to the following MISDP formulation: w ∗ SVD := 12 max z ∈ Z SVD , X ∈S m + n + (cid:40) tr( AX ) : (cid:88) j ∈ [ m ] X jj = 1 , (cid:88) j ∈ [ m +1 ,m + n ] X jj = 1 , X ii ≤ z i , ∀ i ∈ [ m + n ] (cid:41) . (35)The R1-SSVD formulation (35) is rather weak and its continuous relaxation value is equal to σ max ( A ). Fortunately, we can derive two types of valid inequalities from strengthening it as below. Lemma 4

The MISDP formulation for R1-SSVD (35) can be strengthened by adding these valid inequalities.Similar to Theorem 5, we provide the optimality gap of its continuous relaxation value as below.

Theorem 12

The continuous relaxation value w SVD2 of R1-SSVD (35) with the inequalities inLemma 4 yields an optimality gap at most min {√ k k , mnk − k − } , i.e., w ∗ SVD ≤ w SVD2 ≤ min (cid:26)(cid:112) k k , (cid:113) mnk − k − (cid:27) w ∗ SVD . Similarly, we can develop anMILP formulation with arbitrary accuracy based on the Cholesky decomposition of matrix A inR1-SSVD (33). The proofs are similar to Section 4 and are thus omitted. Theorem 13

Given a threshold (cid:15) > and lower and upper bounds of the optimal R1-SSVD, w L , w U , the following MILP is O ( (cid:15) ) -approximate to R1-SSVD (33) , i.e., (cid:15) ≤ (cid:98) w ( (cid:15) ) − w ∗ ≤ (cid:15) √ d : (cid:98) w ( (cid:15) ) := max w, z ∈ Z SVD , y , α , x ,, δ , µ , σ w − σ max ( A )s.t. x = δ i + δ i , || δ i || ∞ ≤ z i , || δ i || ∞ ≤ − z i , ∀ i ∈ [ m + n ] , x = (cid:88) j ∈ [ d ] σ j , || σ j || ∞ ≤ y j , σ jj = y j , ∀ j ∈ [ d ] , (cid:88) j ∈ [ d ] y j = 1 , x = µ (cid:96) + µ (cid:96) , || µ (cid:96) || ∞ ≤ α (cid:96) , || µ (cid:96) || ∞ ≤ − α (cid:96) , ∀ (cid:96) ∈ [ L ] ,w = w U − ( w U − w L ) (cid:18) (cid:88) i ∈ [ L ] − i α i (cid:19) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ [ m + n ] c i c (cid:62) i δ i − w U x + ( w U − w L ) (cid:88) i ∈ [ L ] − i µ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ (cid:15), α ∈ { , } L , y ∈ { , } d , (36) ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA where L := (cid:100) log ( (cid:15)/ ( w U − w L )) (cid:101) . Theorem 14

Given a threshold (cid:15) > , let w SVD3 ( (cid:15) ) denote the optimal value of MILP formulation (36) by relaxing the binary variables z to be continuous. Then we have w SVD3 ( (cid:15) ) ≤ (cid:114) mnk k (cid:20) min (cid:26) ( k + k ) √ d + 12 ,m + nk + k √ d + ( m + n − k − k ) √ d + 12 (cid:27) − (cid:21) w ∗ SVD + (cid:15) √ d. We will investigate three approximationalgorithms for R1-SSVD (27): truncation algorithm, greedy algorithm, and local search algorithm.

The approximation algorithm in [9] via truncation is knownso far with the best approximation ratio O ( n − / ) for SPCA. We show that a similar truncationalso works for R1-SSVD.First, we deﬁne the truncation operator as below. Deﬁnition 1 (Normalized Truncation)

Given a vector x ∈ R n and an integer s ∈ [ n ] , vector (cid:98) x is an s -truncation of x if (cid:98) x i = (cid:40) | x i | , if | x i | is one of the s largest absolute entries of x , otherwise for each i ∈ [ n ] . The normalized s -truncation of x is deﬁned as (cid:98) x := (cid:98) x / (cid:107) (cid:98) x (cid:107) , which is normalizedto be of unit length. Then the truncation algorithm for R1-SSVD has the following two steps: (i) Truncation in the standard basis:

For each i ∈ [ n ], let (cid:98) u i ∈ R m be the normalized k -truncation on the i -th column vector of A , and for each j ∈ [ m ], let (cid:98) v j ∈ R n be the normalized k -truncation on the j -th row vector of A . Clearly, (cid:98) u i and (cid:98) v j are feasible to R1-SSVD (27); (ii) Truncation in the eigen-space basis: Let v and u denote the right and left eigenvectorsof A corresponding to the largest singular value. We then deﬁne the vector (cid:98) u as the normalized k -truncation on u and deﬁne (cid:98) v as the normalized k -truncation of the vector A (cid:62) (cid:98) u . It is clearthat ( (cid:98) u , (cid:98) v ) is also feasible to R1-SSVD (27).The approximation results of the truncation procedure are summarized below. Theorem 15

For R1-SSVD (27) , the truncation algorithm yields an approximation ratio max (cid:26)(cid:113) k − , (cid:113) k − , (cid:112) k k m − n − (cid:27) . In particular, the approximation ratio is O ( n − / ) when k ≈ k and m ≈ n .Proof. See Appendix A.10. (cid:3) ongchun Li and Weijun Xie:

Exact and Approximation Algorithms for Sparse PCA We design the greedy and local search algo-rithms according to the following equivalent combinatorial formulation of R1-SSVD (27) w ∗ SVD := max S ⊆ [ m ] ,S ⊆ [ n ] (cid:26) σ max ( A S ,S ) : | S | = k , | S | = k (cid:27) . (37)Diﬀerent from SPCA (3), the R1-SSVD (37) maximizes the largest singular value of any k × k submatrix rather than that of any size k -principal submatrix. Therefore, to solve R1-SSVD (37),we adapt the greedy Algorithm 1 or the local search Algorithm 2 considering selecting a row and/ora column at each iteration.Speciﬁcally, for the greedy algorithm, let two subsets S , S denote the index sets of the selectedcolumns and rows, respectively. We ﬁrst initialize the greedy algorithm by selecting the entry of A that takes the largest absolute value. Then, we add one element into each subset at each iteration,which maximizes the largest singular value of the obtained submatrix, unless we are not able to.Next, we continue to selection one row (or one column) at each iteration, until we reach a k × k submatrix. The detailed implementation can be found in Algorithm 3.Given an initial feasible solution ( S , S ) to R1-SSVD (37), the adapted local search algorithmperforms the swapping procedure on both S and S (see Algorithm 4 for details) simultaneously. Algorithm 3

Greedy Algorithm for R1-SSVD (37) Input: m × n matrix A (cid:23)

0, integers k ∈ [ m ], k ∈ [ n ] Let (cid:98) S := ∅ and (cid:98) S := ∅ denote the selected rows and columns, separately Compute j ∗ , j ∗ ∈ arg max j ∈ [ m ] ,j ∈ [ n ] (cid:8) | ( A { j } , { j } | (cid:9) Add j ∗ , j ∗ to sets (cid:98) S and (cid:98) S , separately for (cid:96) = 2 , · · · , max { k , k } do if (cid:96) ≤ min { k , k } then Compute j ∗ ∈ arg max j ∈ [ m ] \ (cid:98) S (cid:110) σ max (cid:16) A (cid:98) S ∪{ j } , (cid:98) S (cid:17)(cid:111) and add j ∗ to set (cid:98) S Compute j ∗ ∈ arg max j ∈ [ n ] \ (cid:98) S (cid:110) σ max (cid:16) A (cid:98) S , (cid:98) S ∪{ j } (cid:17)(cid:111) and add j ∗ to set (cid:98) S else if k ≤ k then Compute j ∗ ∈ arg max j ∈ [ n ] \ (cid:98) S (cid:8) σ max (cid:0) A S ,S ∪{ j } (cid:1)(cid:9) and add j ∗ to set (cid:98) S else Compute j ∗ ∈ arg max j ∈ [ m ] \ (cid:98) S (cid:110) σ max (cid:16) A (cid:98) S ∪{ j } , (cid:98) S (cid:17)(cid:111) and add j ∗ to set (cid:98) S end if end for Output: (cid:98) S , (cid:98) S The following results illustrate the theoretical performance guarantees of the two algorithms forR1-SSVD and show that the approximation ratios are both tight. ongchun Li and Weijun Xie:

Exact and Approximation Algorithms for Sparse PCA Theorem 16

Algorithm 4

Local Search Algorithm for R1-SSVD (37) Input: m × n matrix A (cid:23) k ∈ [ m ], k ∈ [ n ] Initialize a size- k subset (cid:98) S ⊆ [ m ] and a size- k subset (cid:98) S ⊆ [ n ] do for each pair ( i , j , i , j ) ∈ (cid:98) S × ([ m ] \ (cid:98) S ) × (cid:98) S × ([ n ] \ (cid:98) S ) do if σ max (cid:0) A S ∪{ j }\{ i } ,S ∪{ j }\{ i } (cid:1) > σ max ( A S ,S ) then Update (cid:98) S := (cid:98) S ∪ { j } \ { i } , (cid:98) S := (cid:98) S ∪ { j } \ { i } end if end for while there is still an improvement Output: (cid:98) S , (cid:98) S

8. Extension to Sparse Fair PCA

In this section, we study the Sparse Fair PCA (SFPCA)and show its approximate MISDP formulation. The fair PCA has been recently studied in theliterature (see, e.g., [34, 37]). The goal of SFPCA is to seek the best principal submatrices ofmulti-group covariance matrices to achieve the relatively similar objective values among diﬀerentgroups.Suppose there are s groups and their corresponding covariance matrices are { A i } i ∈ [ s ] . Then theSFPCA can be formulated as w ∗ F := max x (cid:110) min i ∈ S x (cid:62) A i x : || x || = 1 , || x || ≤ k (cid:111) . (38)By introducing binary variables z and linearizing the objective function, we obtain w ∗ F := max w, x , z ∈ Z (cid:8) w : w ≤ x (cid:62) A i x , ∀ i ∈ [ s ] , || x || = 1 , − z i ≤ x i ≤ z i , ∀ i ∈ [ n ] (cid:9) . (39)As the SFPCA (39) is quite diﬀerent from SPCA, it is not surprising that the results in Section 2and Section 4 do not apply to SFPCA (39). Fortunately, the results in Section 3 do provide aninteresting upper bound for SFPCA (39), which can be exact when there are s = 2 groups ofcovariance matrices. Introducing a rank-one positive semi-deﬁnite matrix variable X ∈ S n + such ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA that X (cid:23) xx (cid:62) , dropping the rank-one restriction, and adding the valid inequalities in Theorem 4,the problem (39) can be upper bounded by w F := max w, X , z ∈ Z (cid:26) w : w ≤ tr( A i X ) , ∀ i ∈ [ s ] , tr( X ) = 1 , (cid:88) j ∈ [ n ] X ij ≤ X ii z i , (cid:18) (cid:88) j ∈ [ n ] | X ij | (cid:19) ≤ kX ii z i , ∀ i ∈ [ n ] (cid:27) . (40)The following result shows that if s = 2, then the approximation (40) is exact, otherwise, itprovides an upper bound of SFPCA (39). Proposition 8

For the MISDP formulation (40) , we have(i) The optimal value of MISDP formulation (40) provides an upper bound of SFPCA (39) , i.e., w F ≥ w ∗ F . Also, when s = 2 , the formulation (40) becomes exact, i.e., w F = w ∗ F ; and(ii) There exists an optimal solution ( w ∗ , X ∗ , z ∗ ) of of MISDP (40) such that the rank of X ∗ isat most (cid:98) (cid:112) s + 9 / − / (cid:99) .Proof. (i) It is clear that w F ≥ w ∗ F since we drop the rank-one restriction on X of MISDP formulation(40). On the other hand, for the case of s = 2, theorem 1.1 in [37] shows that for any feasiblesolution ( w, X , z ), there exists a rank-one semi-deﬁnite matrix (cid:99) X such that the new solution( w, (cid:99) X , z ) is also feasible and achieves the same objective value. Thus, we must have w F = w ∗ F ;(ii) Suppose ( w, X , z ) denotes an optimal solution of MISDP (40). Let S = { i ∈ [ n ] : z i = 1 } . Thenaccording to theorem 1.7 in [37], there exists a semi-deﬁnite matrix (cid:99) X of the rank at most1 + (cid:98) (cid:112) s + 9 / − / (cid:99) such that the new solution ( w, (cid:99) X , z ) is also optimal. (cid:3) Proposition 8 shows that two-group SFPCA (39) admits an MISDP representation, while MISDPformulation (40) provides a low-rank solution in general for SFPCA when s >

2. It is worthy ofmentioning that the results in Proposition 8 work for any convex fairness measure.

9. Conclusion

In practice, to tune the parameter k via cross-validation, our developed greedyand local search algorithms can be quickly warm started from solution procedure in the previousiterations. We anticipate that the theoretical optimality gaps of three exact formulations for SPCAand R1-SSVD are not tight and can be further strengthened. The analysis of the optimality gapof sparse fair PCA requires new techniques, which can be an exciting research direction. Also, itmight be desirable to study robust sparse PCA when the datasets are noisy or contain outliers. ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA References [1] Amini AA, Wainwright MJ (2008) High-dimensional analysis of semideﬁnite relaxations for sparse prin-cipal components. , 2454–2458 (IEEE).[2] Balas E (1975) Disjunctive programming: cutting planes from logical conditions.

Nonlinear Programming2 , 279–312 (Elsevier).[3] Ben-Tal A, Nemirovski A (2001)

Lectures on modern convex optimization: analysis, algorithms, andengineering applications , volume 2 (Siam).[4] Benders JF (1962) Partitioning procedures for solving mixed-variables programming problems.

Numer.Math. http://dx.doi.org/10.1007/BF01386316 .[5] Berk L, Bertsimas D (2019) Certiﬁably optimal sparse principal component analysis.

MathematicalProgramming Computation

Psychological Medicine

Computers & operations research

IEEE Geoscience and Remote Sensing Letters

Conference onLearning Theory , 623–646.[10] Coope I (1994) On matrix trace inequalities and related topics for products of hermitian matrices.

Journal of mathematical analysis and applications

Nicotine and Tobacco Research arXiv preprint arXiv:1205.0121 .[13] d’Aspremont A, Ghaoui LE, Jordan MI, Lanckriet GR (2005) A direct formulation for sparse pca usingsemideﬁnite programming.

Advances in neural information processing systems , 41–48.[14] De Barona MS, Simpson DD (1984) Inhalant users in drug abuse prevention programs.

The Americanjournal of drug and alcohol abuse arXiv preprint arXiv:1810.09062 .[16] dAspremont A, Bach F, Ghaoui LE (2008) Optimal solutions for sparse principal component analysis.

Journal of Machine Learning Research ongchun Li and Weijun Xie:

Exact and Approximation Algorithms for Sparse PCA preprint, submitted .[18] Geoﬀrion AM (1972) Generalized benders decomposition.

Journal of optimization theory and applica-tions

Proceedings of the 2011 SIAM International Conference on Data Mining , 771–782 (SIAM).[20] Jeﬀers J (1967) Two case studies in the application of principal component analysis.

Journal of theRoyal Statistical Society: Series C (Applied Statistics)

IEEE Transactions on Knowledge and Data Engineering

Journal of Machine Learning Research

Biometrics arXiv preprint arXiv:2001.08537 .[25] Luss R, dAspremont A (2010) Clustering and feature selection using sparse principal component analysis.

Optimization and Engineering

Conference on Learning Theory , 2210–2258.[27] Magdon-Ismail M (2017) Np-hardness and inapproximability of sparse pca.

Information ProcessingLetters arXiv preprintarXiv:1603.06035 .[29] Moghaddam B, Weiss Y, Avidan S (2006) Spectral bounds for sparse pca: Exact and greedy algorithms.

Advances in neural information processing systems , 915–922.[30] Naikal N, Yang AY, Sastry SS (2011) Informative feature selection for object recognition via sparse pca. , 818–825 (IEEE).[31] Nedi´c A, Ozdaglar A (2009) Subgradient methods for saddle-point problems.

Journal of optimizationtheory and applications

American journal ofpublic health

Atlanta, Centers for Disease Control and Prevention . ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA

Advances in Neural Information Processing Systems , 10976–10987.[35] Semlyen A, Angelidis G (1995) Eﬃcient calculation of critical eigenvalue clusters in the small signalstability analysis of large power systems .[36] Sill M, Kaiser S, Benner A, Kopp-Schneider A (2011) Robust biclustering by sparse singular valuedecomposition incorporating stability selection.

Bioinformatics

Advances in Neural Information Processing Systems , 15135–15145.[38] Thomas DL, Vlahov D, Solomon L, Cohn S, Taylor E, Garfein R, Nelson KE (1995) Correlates ofhepatitis c virus infections among injection drug users.

Medicine

Archives of neurology

Handbook on Semideﬁnite, Conic and Polynomial Optimization , 915–940 (Springer).[41] Zhang Y, Ghaoui LE (2011) Large-scale sparse principal component analysis with application to textdata.

Advances in Neural Information Processing Systems , 532–539. ongchun Li and Weijun Xie:

Exact and Approximation Algorithms for Sparse PCA Appendix A. Proofs

A.1 Proof of Lemma 1Lemma 1

Given a size- k set S ⊆ [ n ], the maximization problemmax x ∈ R n (cid:8) x (cid:62) Ax : || x || = 1 , x i = 0 , ∀ i / ∈ S (cid:9) reduces to max x ∈ R k (cid:8) x (cid:62) A S,S x : || x || = 1 (cid:9) , which is exactly the deﬁnition of the largest eigenvalue of principal submatrix A S,S . Part (ii)

According to Part (i), it is suﬃcient to show that v ∗ = (cid:98) v , where v ∗ , (cid:98) v are deﬁned as v ∗ : = max X ∈S k + { tr( A S,S X ) : tr( X ) = 1 } , (41) (cid:98) v : = max x ∈ R k (cid:8) x (cid:62) A S,S x : || x || = 1 (cid:9) . (42)First, we must have v ∗ ≥ (cid:98) v . Indeed, for any feasible x ∈ R k to problem (42) such that || x || = 1,we can construct a positive semi-ﬁnite matrix by X = xx (cid:62) , which is feasible to problem (41) andyields the same objective value.Second, to prove (cid:98) v ≥ v ∗ , we let X ∗ ∈ S k + denote an optimal solution to problem (41) and X ∗ = (cid:80) i ∈ [ k ] λ i q i q (cid:62) i denote its spectral decomposition. Since tr( X ∗ ) = 1 and X ∗ ∈ S k + , the eigenvaluesmust satisfy (cid:80) i ∈ [ k ] λ i = 1 and λ i ≥ i ∈ [ k ]. Thus, the optimal value v ∗ of problem (41) isequal to v ∗ = tr( A S,S X ∗ ) = (cid:88) i ∈ [ k ] λ i q (cid:62) i A S,S q i ≤ max i ∈ [ k ] q (cid:62) i A S,S q i ≤ (cid:98) v, where the inequality is due to (cid:80) i ∈ [ k ] λ i = 1 and λ i ≥ i ∈ [ k ]. Part (iii)

For a positive semi-deﬁnite matrix A , let A = C (cid:62) C denote the Cholesky factorizationof A and C ∈ R d × n , thus we have λ max ( A S,S ) = λ max ( C (cid:62) S C S ) = λ max ( C S C (cid:62) S ) , where the second equality is because for any matrix, its largest singular value is equal to that ofits transpose. (cid:3) ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA A.2 Proof of Proposition 1Proposition 1

For the function H ( z ) deﬁned in (9) , we have(i) For any z ∈ Z , function H ( z ) is equivalent to H ( z ) = min µ , Q , ··· , Q n ∈S d + (cid:26) λ max (cid:18) (cid:88) i ∈ [ n ] Q i (cid:19) + (cid:88) i ∈ [ n ] µ i z i : c i c (cid:62) i (cid:22) Q i + µ i I d , ≤ µ i ≤ (cid:107) c i (cid:107) , ∀ i ∈ [ n ] (cid:27) , (10) which is concave in z .(ii) For any binary z ∈ Z , an optimal solution to problem (10) is µ ∗ i = 0 if z i = 1 and (cid:107) c i (cid:107) ,otherwise, and Q ∗ i := (1 − µ ∗ i / (cid:107) c i (cid:107) ) c i c (cid:62) i for each i ∈ [ n ] .Proof. Part (i).

We split the proof of strong duality into two cases depending on whether z is arelative interior point of set Z or not.Case a. We will ﬁrst prove the result by assuming that z is in the relative interior of set Z , i.e.,0 < z i < i ∈ [ n ]. For the inner maximization problem in (9), we dualize theconstraint X (cid:23) W i , tr( W i ) = z i with Lagrangian multiplier Q i ∈ S d + and µ i for each i ∈ [ n ].Note that the constraints X (cid:23) W i , tr( W i ) = z i for each i ∈ [ n ] and X , W , · · · , W n ∈ S d + can be always strictly satisﬁed since 0 < z i <

1. Thus, according to the strong duality ofgeneral conic program (see, e.g., Theorem 1.4.4 in [3]), function H ( z ) can be rewrite asmin µ , Q , ··· , Q n ∈S d + max X , W , ··· , W n ∈S d + (cid:40) (cid:88) i ∈ [ n ] c (cid:62) i W i c i + (cid:88) i ∈ [ n ] tr ( Q i ( X − W i )) + (cid:88) i ∈ [ n ] µ i ( z i − tr( W i )) : tr( X ) = 1 (cid:41) . (43)Then the inner maximization problem (43) over W i for each i ∈ [ n ] and X yieldsmax W i ∈S d + tr (cid:0) ( c i c (cid:62) i − Q i − µ i I d ) W i (cid:1) = (cid:40) , c i c (cid:62) i (cid:22) Q i + µ i I d , ∞ , otherwise . max X ∈S d + (cid:26) tr (cid:18)(cid:18) (cid:88) i ∈ [ n ] Q i (cid:19) X (cid:19) : tr( X ) = 1 (cid:27) = λ max (cid:18) (cid:88) i ∈ [ n ] Q i (cid:19) , where the second identity is due to Part(ii) of Lemma 1.Thus, problem (43) can be simpliﬁed as H ( z ) = min µ , Q , ··· , Q n ∈S d + (cid:26) λ max (cid:18) (cid:88) i ∈ [ n ] Q i (cid:19) + (cid:88) i ∈ [ n ] µ i z i : c i c (cid:62) i (cid:22) Q i + µ i I d , ∀ i ∈ [ n ] (cid:27) . (44)We show that for the minimization problem (44), any optimal solution ( µ , Q , · · · , Q n )must satisfy 0 ≤ µ i ≤ (cid:107) c i (cid:107) for each i ∈ [ n ]. We prove it by contradiction. Suppose thatthere exits an optimal solution ( µ , Q , · · · , Q n ) to the problem (44) such that µ j < ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA some j ∈ [ n ]. Then, we can construct a new feasible solution ( µ , Q , · · · , Q n ), which isexactly equal to ( µ , Q , · · · , Q n ) except µ j = 0 , Q j = Q j + µ j I d . The new solution yields the objective value H ( z ) + µ j − µ j z j = H ( z ) + µ j (1 − z j ) < H ( z ) , which is a contradiction to the optimality of ( µ , Q , · · · , Q n ). Similarly, suppose that thereexits an optimal solution ( µ , Q , · · · , Q n ) to the problem (44) such that µ j > (cid:107) c i (cid:107) forsome j ∈ [ n ]. Similarly, we can arrive at a contradiction by deﬁning a new feasible solution( µ , Q , · · · , Q n ), which is exactly equal to ( µ , Q , · · · , Q n ) except µ j = (cid:107) c i (cid:107) .Therefore, (44) can be reduced to (10).Case b. Now we consider the case that z is not in the relative interior of Z and deﬁne two sets T := { i ∈ [ n ] : z i = 0 } and T := { i ∈ [ n ] : z i = 1 } . Thus, at least one of the two sets is notempty. In this case, we ﬁrst observe that H ( z ) in (9) is equivalent to H ( z ) := max X , W , ··· , W d ∈S d + (cid:40) (cid:88) i ∈ [ n ] \ ( T ∪ T ) c (cid:62) i W i c i + (cid:88) i ∈ T c (cid:62) i Xc i : tr( X ) = 1 , X (cid:23) W i , tr( W i ) = z i , ∀ i ∈ [ n ] \ ( T ∪ T ) (cid:41) . (45)Next, applying the same procedure as Case a., we have H ( z ) = min µ , { Q i } i ∈ [ n ] \ ( T ∪ T ⊆S d + (cid:26) λ max (cid:18) (cid:88) i ∈ [ n ] \ ( T ∪ T ) Q i + (cid:88) i ∈ T c i c (cid:62) i (cid:19) + (cid:88) i ∈ [ n ] \ ( T ∪ T ) µ i z i : c i c (cid:62) i (cid:22) Q i + µ i I d , ≤ µ i ≤ (cid:107) c i (cid:107) , ∀ i ∈ [ n ] \ ( T ∪ T ) (cid:27) . (46)To show the equivalence between (46) and (10), it remains to prove that (cid:98) H ( z ) = min µ , { Q i } i ∈ [ n ] ⊆S d + (cid:26) λ max (cid:18) (cid:88) i ∈ [ n ] Q i (cid:19) + (cid:88) i ∈ [ n ] µ i z i : c i c (cid:62) i (cid:22) Q i + µ i I d , ≤ µ i ≤ (cid:107) c i (cid:107) , ∀ i ∈ [ n ] (cid:27) . (47)First, given any feasible solution ( µ , { Q i } i ∈ [ n ] \ ( T ∪ T ) ) to the problem (46), let us augmentit by setting Q i = , µ i = (cid:107) c i (cid:107) for each i ∈ T and Q i = c i c (cid:62) i , µ i = 0 for each i ∈ T . Then( µ , { Q i } i ∈ [ n ] ) is feasible to the problem (47) with the same objective value. Thus, we have (cid:98) H ( z ) ≤ H ( z ).On the other hand, given any feasible solution ( µ , { Q i } i ∈ [ n ] ) to the problem (47), then( µ , { Q i } i ∈ [ n ] \ ( T ∪ T ) ) is feasible to the problem (46) a smaller objective value since c i c (cid:62) i (cid:22) Q i + µ i for each i ∈ T . Thus, we have (cid:98) H ( z ) ≥ H ( z ). This completes the proof. ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA Part (ii).

For any z ∈ Z , let set S denote its support. We then construct a pair of the primal anddual solutions to the maximization problem in (9) and its dual (10) as X ∗ = q q (cid:62) , W ∗ i = X ∗ , ∀ i ∈ S, W ∗ i = 0 , ∀ i ∈ [ n ] \ S, Q ∗ i = c i c (cid:62) i , µ i = 0 , ∀ i ∈ S, Q ∗ i = 0 , µ i = || c i || , ∀ i ∈ [ n ] \ S, where q denote the eigenvector for the largest eigenvalue of matrix (cid:80) i ∈ S c i c (cid:62) i .According to the results in Lemma 1, the above solutions return the same objective value forprimal and dual problems, which is λ max ( (cid:80) i ∈ S c i c (cid:62) i ). This proves the optimality of the proposeddual solution. (cid:3) A.3 Proof of Proposition 2Proposition 2

The SPCA (2) admits the following MISDP formulation: (SPCA) w ∗ := max z ∈ Z, X ∈S n + (cid:26) tr( AX ) : tr( X ) = 1 , X ii ≤ z i , ∀ i ∈ [ n ] (cid:27) . (14) and its continuous relaxation value is equal to λ max ( A ) .Proof. (i) To show the equivalence of problem (14) and SPCA (2), we only need to show that for anyfeasible z ∈ Z with its support S = { i : z i = 1 } , we must havemax X ∈S n + (cid:26) tr( AX ) : tr( X ) = 1 , X ii ≤ z i , ∀ i ∈ [ n ] (cid:27) = λ max ( A SS ) . (48)Indeed, since X is a positive semi-deﬁnite matrix, thus X ii = 0 for each i ∈ [ n ] \ S implies X ij = 0 , ∀ ( i, j ) / ∈ S × S. The left-hand side of (48) is equivalent tomax X ∈S n + (cid:26) tr( AX ) : tr( X ) = 1 , X ii ≤ z i , ∀ i ∈ [ n ] (cid:27) = max X ∈S k + { tr( A S,S X ) : tr( X ) = 1 } = λ max ( A SS ) , where the second equality is due to Part (ii) in Lemma 1.(ii) The continuous relaxation value of problem (14) is w = max z ∈ Z, X ∈S n + (cid:26) tr( AX ) : tr( X ) = 1 , X ii ≤ z i , ∀ i ∈ [ n ] (cid:27) . Since tr( X ) = 1, thus the linking constraint X ii ≤ z i is redundant for each i ∈ [ n ]. Hence, w = max X ∈S n + (cid:26) tr( AX ) : tr( X ) = 1 (cid:27) = λ max ( A ) , where the equality is due to Part (ii) in Lemma 1. (cid:3) ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA A.4 Proof of Lemma 2Lemma 2

The following two inequalities are valid to SPCA (14) (i) (cid:80) j ∈ [ n ] X ij ≤ X ii z i for all i ∈ [ n ] ; and(ii) (cid:16)(cid:80) j ∈ [ n ] | X ij | (cid:17) ≤ kX ii z i for all i ∈ [ n ] .Proof. From the proof of Proposition 2, there must exists an optimal solution ( z ∗ , X ∗ ) of SPCA(14) such that X ∗ must be rank-one. Thus, without loss of generality, for any feasible solution( z , X ) of SPCA (14), we can assume that X = xx (cid:62) , where ( x , z ) is also feasible to SPCA (2).Next, we split the proof into two parts.(i) Since X = xx (cid:62) , thus (cid:88) j ∈ [ n ] X ij = (cid:88) j ∈ [ n ] x i x j = x i ≤ z i X ii , ∀ i ∈ [ n ] , where the last inequality follows from the facts that X ii = x i ≤ z i and z i is binary for each i ∈ [ n ].(ii) It is known (see, e.g., [15]) that || x || ≤ √ k . Thus, (cid:88) j ∈ [ n ] | X ij | = (cid:88) j ∈ [ n ] | x i || x j | ≤ √ k | x i | ≤ √ k (cid:112) X ii z i , where the second inequality is due to the facts that X ii = x i ≤ z i and z i is binary for each i ∈ [ n ]. (cid:3) A.5 Proof of Theorem 6Theorem 6

Throughout the proof, we use indices i ∈ [ n ], j ∈ [ d ], and (cid:96) ∈ [ m ] to denote the elements ofthree diﬀerent dimensional vectors, respectively. To construct the MILP by SPCA (21) and showthe approximation accuracy, we split the proof into four steps. Step

1. Linearize the bilinear terms { z i x } i ∈ [ n ] in (21). This can be done by introducing two copies δ i , δ i of vector x for each i ∈ [ n ] such that x = δ i + δ i , || δ i || ∞ ≤ z i , || δ i || ∞ ≤ − z i , ∀ i ∈ [ n ] , (cid:88) i ∈ [ n ] z i c i c (cid:62) i x = (cid:88) i ∈ [ n ] c i c (cid:62) i δ i . Step

2. Linearize the nonconvex constraint (cid:107) x (cid:107) ∞ = 1. We ﬁrst observe that due to symmetry, (cid:107) x (cid:107) ∞ = 1 can be equivalently written as a disjunction with d sets as below ∪ j ∈ [ d ] (cid:8) x ∈ R d : x j = 1 , (cid:107) x (cid:107) ∞ ≤ (cid:9) . Next, for each j ∈ d , we introduce a binary variable y j = 1 indicating the j -th set is activeand 0, otherwise, and then create a copy σ j ∈ R d of variable x such that x = (cid:88) j ∈ [ d ] σ j , || σ j || ∞ ≤ y j , σ jj = y j , ∀ j ∈ [ d ] , (cid:88) j ∈ [ d ] y j = 1 , y ∈ { , } d . Step

3. Approximate and linearize bilinear term w x . We ﬁrst approximate variable w using m binary variables α ∈ R m with m := (cid:100) log (( w U − w L ) /(cid:15) ) (cid:101) . Thus, we have w ≈ w U − ( w U − w L ) (cid:18) (cid:88) (cid:96) ∈ [ m ] − (cid:96) α (cid:96) (cid:19) with approximation accuracy at most ( w U − w L ) / m ≤ (cid:15) . The bilinear term w x is nowapproximated by w x ≈ w U x − ( w U − w L ) (cid:18) (cid:88) (cid:96) ∈ [ m ] − (cid:96) α (cid:96) x (cid:19) . (49)With binary variables α , the resulting bilinear terms { α (cid:96) x } (cid:96) ∈ [ m ] can be further linearizedfollowing the same arguments as Step 2, i.e., x = µ (cid:96) + µ (cid:96) , || µ (cid:96) || ∞ ≤ α (cid:96) , || µ (cid:96) || ∞ ≤ − α (cid:96) , ∀ (cid:96) ∈ [ m ] ,w U x − ( w U − w L ) (cid:18) (cid:88) (cid:96) ∈ [ m ] − (cid:96) α (cid:96) x (cid:19) = w U x − ( w U − w L ) (cid:88) (cid:96) ∈ [ m ] − (cid:96) µ (cid:96) . Step

4. Finally, following the approximation and linearization results in Step 3, the equality con-straint (cid:80) i ∈ [ n ] c i c (cid:62) i σ i = w x in (21) might not hold exactly. Thus we replace the equalityby the following inequality (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ [ n ] c i c (cid:62) i δ i − w U x + ( w U − w L ) (cid:88) i ∈ [ m ] − i µ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ [ n ] c i c (cid:62) i z i x − w U x + ( w U − w L ) (cid:88) i ∈ [ m ] − i α i x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) w x − w U x + ( w U − w L ) (cid:88) i ∈ [ m ] − i α i x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ ( w U − w L ) / m ≤ (cid:15), which holds for any feasible solution of formulation (21).First, we have (cid:98) w ( (cid:15) ) ≥ w ∗ − (cid:15) since w := w ∗ − (cid:15) is feasible to the MILP (22).Moreover, given an optimal solution ( (cid:98) x , (cid:98) z , (cid:98) w ( (cid:15) )) to the MILP (22), we must have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ [ n ] (cid:98) z i c i c (cid:62) i (cid:98) x − (cid:98) w ( (cid:15) ) (cid:98) x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ (cid:15) ( ⇒ ) min x : (cid:107) x (cid:107) ∞ =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ [ n ] (cid:98) z i c i c (cid:62) i x − (cid:98) w ( (cid:15) ) x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ (cid:15) ( ⇒ ) d − / min x : (cid:107) x (cid:107) ∞ =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ [ n ] (cid:98) z i c i c (cid:62) i x − (cid:98) w ( (cid:15) ) x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) ( ⇒ ) d − / min x : (cid:107) x (cid:107) ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ [ n ] (cid:98) z i c i c (cid:62) i x − (cid:98) w ( (cid:15) ) x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) ( ⇔ ) d − / min x : (cid:107) x (cid:107) =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ [ n ] (cid:98) z i c i c (cid:62) i x − (cid:98) w ( (cid:15) ) x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) where the ﬁrst implication is due to (cid:107) (cid:98) x (cid:107) ∞ = 1, the second one is due to (cid:107) x (cid:107) ∞ ≥ d − / (cid:107) x (cid:107) since x ∈ R d , the third one is because (cid:107) x (cid:107) ∞ = 1 implies (cid:107) x (cid:107) ≥

1, and the equivalence isbecause of monotonicity and positive homogeneity of the objective function. According tothe last inequality, there exists an eigenvalue w of matrix (cid:80) i ∈ [ n ] (cid:98) z i c i c (cid:62) i such that | (cid:98) w ( (cid:15) ) − w | ≤ (cid:15) √ d , which further implies that (cid:98) w ( (cid:15) ) − w ∗ ≤ (cid:15) √ d since w ≤ w ∗ . (cid:3) A.6 Proof of Theorem 7Theorem 7

From the proof of Theorem 6, we know that w ( (cid:15) ) ≤ w (0) + (cid:15) √ d . Thus, it is suﬃcient toshow that w (0) ≤ k ( √ d/ / w ∗ . We observe that when (cid:15) = 0, the resulting formulation by relaxing binary variables z to becontinuous becomes: w (0) = max w, z ∈ Z, x , { δ i } i ∈ [ n ] , { δ i } i ∈ [ n ] (cid:26) w : (cid:88) i ∈ [ n ] c i c (cid:62) i δ i = w x , (cid:107) x (cid:107) ∞ = 1 , ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA x = δ i + δ i , || δ i || ∞ ≤ z i , || δ i || ∞ ≤ − z i , ∀ i ∈ [ n ] (cid:27) , (50)Next, we split the proof into three steps. Step

1. For any feasible solution to problem (50), we have w = || (cid:80) i ∈ [ n ] c i c (cid:62) i δ i || ∞ || x || ∞ = || (cid:88) i ∈ [ n ] c i c (cid:62) i δ i || ∞ ≤ (cid:88) i ∈ [ n ] || c i c (cid:62) i δ i || ∞ = (cid:88) i ∈ [ n ] || c i || ∞ | c (cid:62) i δ i |≤ (cid:88) i ∈ [ n ] || c i || ∞ || c i || || δ i || ∞ ≤ (cid:88) i ∈ [ n ] || c i || ∞ || c i || z i ≤ k max i ∈ [ n ] || c i || ∞ || c i || , where the ﬁrst inequality is due to triangle inequality, the second one is because of Holder’sinequality, the third one is because || δ i || ∞ ≤ z i , and the last one is due to || c i || ∞ || c i || ≤ max j ∈ [ n ] || c j || ∞ || c j || for each i ∈ [ n ] and (cid:80) i ∈ [ n ] z i = k . Step

2. Now it remains to show that for each i ∈ [ n ] || c i || ∞ || c i || ≤ √ d + 12 w ∗ . Let ς be a permutation of index set [ d ] such that c i,ς (1) , · · · , c i,ς ( d ) are sorted in anascending order. Then we have c i,ς (1) + 1 d − (cid:18) (cid:88) j ∈ [2 ,d ] | c i,ς ( j ) | (cid:19) ≤ c i,ς (1) + · · · + c i,ς ( d ) = || c i || ≤ w ∗ , where the ﬁrst inequality is from the arithmetic and quadratic mean inequality and thesecond inequality follows from || c i || = λ max ( c i c (cid:62) i ) ≤ w ∗ .For ease of exposition, let us introduce v = | c i,ς (1) | and v = (cid:80) j ∈ [2 ,d ] | c i,ς ( j ) | . Next, letus consider an optimization problem ν = max v ∈ R (cid:26) v ( v + v ) : v + 1 / ( d − v ≤ w ∗ (cid:27) , (51)whose optimal value clearly provides an upper bound of || c i || ∞ || c i || .To solve (51), we ﬁrst rewrite v , v as v = r sin( θ ) r, v = r √ d − θ ) , θ ∈ [0 , π/ , r ≤ √ w ∗ . In this way, the objective function (51) is equal to v ( v + v ) = v + v v = r sin ( θ ) + r √ d − θ ) cos( θ ) = r − cos(2 θ )2 + r √ d − θ )2= r − r θ ) + 12 r √ d − θ ) ≤ r + √ d r ≤ √ d + 12 w ∗ , where the ﬁrst inequality is due to Cauchy-Schwartz inequality and the second one isbecause r ≤ w ∗ . Thus, we must have || c i || ∞ || c i || ≤ √ d + 12 w ∗ . This proves the ﬁrst bound k ( √ d/ /

2) together with Step 1. ongchun Li and Weijun Xie:

Exact and Approximation Algorithms for Sparse PCA Step

3. We now prove the second bound. Plugging the equations δ i = x − δ i for all i ∈ [ n ], werewrite the continuous relaxation value as w = || (cid:80) i ∈ [ n ] c i c (cid:62) i ( x − δ i ) || ∞ || x || ∞ ≤ || (cid:80) i ∈ [ n ] c i c (cid:62) i x || ∞ || x || ∞ + || (cid:80) i ∈ [ n ] c i c (cid:62) i δ i || ∞ || x || ∞ ≤ || (cid:80) i ∈ [ n ] c i c (cid:62) i x || ∞ || x || ∞ + ( n − k ) √ d + 12 w ∗ ≤ max i ∈ [ d ] (cid:88) j ∈ [ d ] | C ij | + ( n − k ) √ d + 12 w ∗ , where C := CC (cid:62) = (cid:80) i ∈ [ n ] c i c (cid:62) i and the ﬁrst inequality is from the triangle inequality, thesecond one follows from the derivations in Steps 1 and 2, and the third one is due to x i ≤ i ∈ [ d ].Next, the ﬁrst term of the right-hand side above can be upper bounded bymax i ∈ [ d ] (cid:88) j ∈ [ d ] | C ij | = || C || ≤ √ d || C || = √ dλ max ( C ) ≤ nk √ dw ∗ , where the equations are from the deﬁnition of (cid:96) -norm and (cid:96) -norm of a matrix and thesecond inequality is due to λ max ( C ) = λ max ( A ) ≤ n/kw ∗ . (cid:3) A.7 Proof of Lemma 3Lemma 3

Given a matrix A ∈ R m × n , consider its augmented counterpart A deﬁned in (28) , two integers k ∈ [ m ] and k ∈ [ n ] , and three subsets S, S , S ⊆ [ m + n ] such that S ⊆ [ m + n ] , | S | = k + k , S = S ∩ [ m ] , | S | = k and S = S ∩ [ m + 1 , m + n ] , | S | = k . Then thefollowing identities must hold:(i) The eigenvalues of the augmented submatrix A S,S are the singular values of submatrix A S ,S and their negations;(ii) σ max ( A S ,S ) = λ max ( A S,S ) = 1 / x ∈ R k k { x (cid:62) Ax : || x k || = 1 , || x k +1: k + k || = 1 } =1 / X ∈S k k (cid:110) tr( A S,S X ) : (cid:80) j ∈ [ k ] X jj = 1 , (cid:80) i ∈ [ k +1 ,k + k ] X ii = 1 (cid:111) .Proof. The proof includes two parts.(i) By the deﬁnition of augmented matrix A in (28), for its submatrix A S,S , we observe that A S,S = (cid:20) A S ,S A (cid:62) S ,S (cid:21) . Then the statement in Part (i) directly follows from the result in Ben-Tal and Nemirovski[3], which shows that the eigenvalues of an augmented symmetric matrix exactly are equal tothe singular values and negative ones of the original matrix. ongchun Li and Weijun Xie:

Exact and Approximation Algorithms for Sparse PCA (ii) The ﬁrst equality λ max ( A S,S ) = σ max ( A S ,S ) is obtained from Part (i).For the largest singular value of A S ,S , we have σ max ( A S ,S ) = max u ∈ R k , v ∈ R k (cid:8) u (cid:62) A S ,S v : || u || = 1 , || u || = 1 (cid:9) = 12 max x ∈ R k k (cid:8) x (cid:62) A S,S x : || x k || = 1 , || x k +1: k + k || = 1 (cid:9) , (52)which proves the second equality of Part (ii).As for the last equality of Part (ii), we let (cid:98) w ∗ SVD denote the optimal value of the right-hand side SDP problem. Then we must have (cid:98) w ∗ SVD ≥ σ max ( A S ,S ) as the SDP problem isexactly a SDP relaxation of the maximization problem over x in (52) by relaxing the rank-oneconstraint. On the other hand, summing up two constraints in the SDP problem, we obtainan upper bound of (cid:98) w ∗ SVD , i.e., (cid:98) w ∗ SVD ≤

12 max X ∈S k k (cid:8) tr( A S,S X ) : tr( X ) = 2 (cid:9) = λ max ( A S,S ) = σ max ( A S ,S ) , where the ﬁrst equality is due to Part (ii) in Lemma 1. (cid:3) A.8 Proof of Theorem 11Theorem 11

The continuous relaxation value w SVD1 of formulation (34) satisﬁes w ∗ SVD ≤ w SVD1 ≤ (cid:113) mnk − k − w ∗ SVD . Proof.

For the matrix A deﬁned in (32), using Part (i) in Lemma 3, we can derive that itslargest eigenvalue is equal to 2 σ max ( A ). Let ( (cid:98) z , (cid:99) X , (cid:99) W , · · · , (cid:99) W m + n ) denote an optimal solution tothe continuous SDP relaxation of problem (34). We now have2 σ max ( A ) = λ max (cid:16) A (cid:17) = max X (cid:23) , tr( X )=1 (cid:26) (cid:88) i ∈ [ m + n ] c (cid:62) i Xc i (cid:27) ≥ (cid:88) i ∈ [ m + n ] c (cid:62) i (cid:99) Xc i ≥ (cid:88) i ∈ [ m + n ] c (cid:62) i (cid:99) W i c i , where the last inequality is because (cid:99) X (cid:23) (cid:99) W i for each i ∈ [ m + n ]. Note that the right-hand sideabove is equal to w SVD1 + σ max ( A ) and the inequalities above lead to w SVD1 = (cid:88) i ∈ [ m + n ] c (cid:62) i (cid:99) W i c i − σ max ( A ) ≤ σ max ( A ) − σ max ( A ) = σ max ( A ) . Now it remains to show that

Claim 1 σ max ( A ) ≤ (cid:112) mnk − k − w ∗ SVD . ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA Proof.

Let u , v denote the top right and left eigenvectors of A , i.e., u (cid:62) Av = σ max ( A ) , Av = σ max ( A ) v , u (cid:62) A = σ max ( A ) u (cid:62) . We tailor u , v to meet the feasibility of R1-SSVD (27) as below (cid:98) u j = (cid:40) u j , if u j is one of the k largest entries of u , otherwise , ∀ j ∈ [ n ] , (cid:98) v j = (cid:40) ( A (cid:62) (cid:98) u ) j , if | ( A (cid:62) (cid:98) u ) j | is one of the k largest entries of | A (cid:62) (cid:98) u | , otherwise , ∀ j ∈ [ m ] . Let us normalize (cid:98) u = (cid:98) u || (cid:98) u || and (cid:98) v = (cid:98) v || (cid:98) v || . Clearly, ( (cid:98) u , (cid:98) v ) is feasible R1-SSVD (27). Then wehave (cid:114) k n σ max ( A ) ≤ σ max ( A ) (cid:98) u (cid:62) u = (cid:98) u (cid:62) Av ≤ (cid:107) (cid:98) u (cid:62) A (cid:107) ≤ (cid:114) mk (cid:98) u (cid:62) A (cid:98) v ≤ (cid:114) mk w ∗ SVD , where the ﬁrst inequality is due to the deﬁnition of (cid:98) u , the equality is because of the deﬁnition of v , the second inequality is due to the Cauchy-Schwartz inequality, the third one is based on thechoice of (cid:98) v , and the last one is due to the feasibility of ( (cid:98) u , (cid:98) v ). This completes the proof. (cid:5) (cid:3) A.9 Proof of Lemma 4Lemma 4

For R1-SSVD (35) , the following second-order conic inequalities are valid:(i) (cid:80) j ∈ [ m ] X ij ≤ z i X ii , (cid:80) j ∈ [ m +1 ,m + n ] X ij ≤ z i X ii for all i ∈ [ m + n ] ; and(ii) ( (cid:80) j ∈ [ m ] | X ij | ) ≤ k X ii z i , ( (cid:80) j ∈ [ m +1 ,m + n ] | X ij | ) ≤ k X ii z i for all i ∈ [ m + n ] .Proof. According to Proposition 7, there must exist an optimal solution ( z ∗ , X ∗ ) to MISDP (35)such that X ∗ is rank-one. Thus, without loss of generality, for any feasible solution ( z , X ) of SPCA(14), we can assume that X = (cid:20) uv (cid:21) (cid:20) uv (cid:21) (cid:62) , where vectors ( u , v ) thus satisfy || u || = || v || = 1 , || u || ≤ (cid:112) k , || v || ≤ (cid:112) k . Then the rest of the proof is almost identical to that of Lemma 2 and is thus omitted for brevity.

A.10 Proof of Theorem 15Theorem 15

For R1-SSVD (27) , the truncation algorithm yields an approximation ratio max (cid:26)(cid:113) k − , (cid:113) k − , (cid:112) k k m − n − (cid:27) . In particular, the approximation ratio is O ( n − / ) when k ≈ k and m ≈ n .Proof. We derive the three approximation ratios of the truncation algorithm below. ongchun Li and Weijun Xie:

Exact and Approximation Algorithms for Sparse PCA (i) According to the truncation in the standard basis, the obtained vector (cid:98) u i is feasible to theR1-SSVD problem for each i ∈ [ n ] and is also optimal to the following problem (cid:98) u i ∈ arg max || u i || =1 , || u i || = k (cid:8) u (cid:62) i Ae i (cid:9) , ∀ i ∈ [ n ] . Suppose the optimal solution of the R1-SSVD (27) to be u ∗ and v ∗ , let S ∗ , S ∗ denote theirsupports, respectively. We then rewrite v ∗ = (cid:80) i ∈ S ∗ v ∗ i e i and we have w ∗ SVD = ( u ∗ ) (cid:62) Av ∗ = (cid:88) i ∈ S ∗ v ∗ i ( u ∗ ) (cid:62) Ae i ≤ (cid:115)(cid:88) i ∈ S ∗ ( v ∗ i ) (cid:115)(cid:88) i ∈ S ∗ [( u ∗ ) (cid:62) Ae i ] ≤ (cid:112) k max i ∈ [ n ] (cid:98) u (cid:62) i Ae i , where the ﬁrst inequality is due to Cauchy-Schwartz and the second one is because of maxi-mality of max i ∈ [ n ] (cid:98) u (cid:62) i Ae i .Since (1 − ( k − (cid:15) ) e i + (cid:15) (cid:80) j ∈ [ k ] ∪{ i }\{ i } e j with suﬃciently small (cid:15) > (cid:15) →

0. This prove the approximation ratio (cid:112) k − .Similarly, we can derive w ∗ SVD = ( u ∗ ) (cid:62) Av ∗ ≤ (cid:112) k max j ∈ [ m ] e (cid:62) j A (cid:98) v j , which prove the approximation ratio (cid:112) k − .(ii) Following the proof of Claim 1, for the truncation in the eigen-space basis, we have (cid:114) k n w ∗ SVD ≤ (cid:114) k n σ max ( A ) ≤ σ max ( A ) (cid:98) u (cid:62) u = (cid:98) u (cid:62) Av ≤ (cid:107) (cid:98) u (cid:62) A (cid:107) ≤ (cid:114) mk (cid:98) u (cid:62) A (cid:98) v , which proves the approximation ratio of √ k k m − n − . (cid:3) A.11 Proof of Theorem 16Theorem 16

For the greedy Algorithm 3 and the local search Algorithm 4, we have (i) both algo-rithms achieve a ( √ k k ) − -approximation ratio of R1-SSVD (37) , and (ii) the ratio is tight.Proof. The proof is split into two parts.(i) In R1-SSVD (37), according to the part (i) of the proof of Theorem 15, we have w ∗ SVD ≤ (cid:112) k max j ∈ [ n ] (cid:98) u (cid:62) j Ae j ≤ (cid:112) k k max i ∈ [ m ] ,j ∈ [ n ] A ij , where vectors { (cid:98) u i } i ∈ [ n ] ⊆ R m are obtained by the normalized k -truncation in the standardbasis of A . Then, following the similar analyses of Theorem 8 and Theorem 9, the largestsingular value from greedy Algorithm 3 and local search Algorithm 4 must be lower boundedby max i ∈ [ m ] ,j ∈ [ n ] A ij . ongchun Li and Weijun Xie: Exact and Approximation Algorithms for Sparse PCA (ii) We next show an example in which the ratio (cid:112) k − k − can be achieved. Suppose that, withoutloss of generality, k ≤ k . Then, consider m = 2 k , n = 2 k , and matrix A ∈ R m × n as A := (cid:20) I k k × k k × k k × k (cid:21) . Above, the submatrix A [ k ] , [ k ] satisﬁes greedy and local optimality conditions with the objec-tive value equal to 1, while the best size k × k submatrix is A [ k +1 ,k + k ] , [ k +1 , k ] with theoptimal value √ k k ..