[PDF] Low-rank Matrix Optimization Using Polynomial-filtered Subspace Extraction

Abstract

In this paper, we study first-order methods on a large variety of low-rank matrix optimization problems, whose solutions only live in a low dimensional eigenspace. Traditional first-order methods depend on the eigenvalue decomposition at each iteration which takes most of the computation time. In order to reduce the cost, we propose an inexact algorithm framework based on a polynomial subspace extraction. The idea is to use an additional polynomial-filtered iteration to extract an approximated eigenspace, and project the iteration matrix on this subspace, followed by an optimization update. The accuracy of the extracted subspace can be controlled by the degree of the polynomial filters. This kind of subspace extraction also enjoys the warm start property: the subspace of the current iteration is refined from the previous one. Then this framework is instantiated into two algorithms: the polynomial-filtered proximal gradient method and the polynomial-filtered alternating direction method of multipliers. We give a theoretical guarantee to the two algorithms that the polynomial degree is not necessarily very large. They share the same convergence speed as the corresponding original methods if the polynomial degree grows with an order Ω(logk) at the k -th iteration. If the warm-start property is considered, the degree can be reduced to a constant, independent of the iteration k . Preliminary numerical experiments on several low-rank matrix optimization problems show that the polynomial filtered algorithms usually provide multi-fold speedups.

Full PDF

LLOW-RANK MATRIX OPTIMIZATION USING POLYNOMIAL-FILTEREDSUBSPACE EXTRACTION

YONGFENG LI ∗ , HAOYANG LIU † , ZAIWEN WEN ‡ , AND

YAXIANG YUAN § Abstract.

In this paper, we study ﬁrst-order methods on a large variety of low-rank matrix optimizationproblems, whose solutions only live in a low dimensional eigenspace. Traditional ﬁrst-order methods depend onthe eigenvalue decomposition at each iteration which takes most of the computation time. In order to reduce thecost, we propose an inexact algorithm framework based on a polynomial subspace extraction. The idea is to use anadditional polynomial-ﬁltered iteration to extract an approximated eigenspace, and project the iteration matrix onthis subspace, followed by an optimization update. The accuracy of the extracted subspace can be controlled by thedegree of the polynomial ﬁlters. This kind of subspace extraction also enjoys the warm start property: the subspaceof the current iteration is reﬁned from the previous one. Then this framework is instantiated into two algorithms: thepolynomial-ﬁltered proximal gradient method and the polynomial-ﬁltered alternating direction method of multipliers.We give a theoretical guarantee to the two algorithms that the polynomial degree is not necessarily very large. Theyshare the same convergence speed as the corresponding original methods if the polynomial degree grows with an orderΩ(log k ) at the k -th iteration. If the warm-start property is considered, the degree can be reduced to a constant,independent of the iteration k . Preliminary numerical experiments on several low-rank matrix optimization problemsshow that the polynomial ﬁltered algorithms usually provide multi-fold speedups. Key words.

Low-rank Matrix Optimization, Eigenvalue Decomposition, Inexact Optimization Method, Poly-nomial Filter, Subspace Extraction.

AMS subject classiﬁcations.

1. Introduction.

Eigenvalue decompositions (EVD) are commonly used in large varieties ofmatrix optimization problems with spectral or low-rank structures. For example, in semi-deﬁniteprogrammings (SDP), many optimization methods need to compute all positive eigenvalues andcorresponding eigenvectors of a matrix each iteration in order to preserve the semi-deﬁnite structure.In matrix recovery type problems, people are interested in the low-rank approximations of theunknown matrix. In [25], the authors propose a convex relaxation model to solve the robust PCAproblem and solve it by the accelerated proximal gradient (APG) approach. The primal problem issolved by the augmented Lagrangian method (ALM) in [13]. The application can also be regardedas an optimization problem on the rank- r matrix space thus manifold optimization methods canbe applied [22]. In all methods mentioned above, EVD of singular value decompositions (SVD) arerequired. The latter can be transformed to the former essentially. In maximal eigenvalue problems,the objective function is the largest eigenvalue of a symmetric matrix variable. It is used in manyreal applications, for example, the dual formulation of the max-cut problem [8], phase retrieval,blind deconvolution [7], distance metric learning problem [26], and Crawford number computing[11]. These optimization methods require the largest eigenvalue and the corresponding eigenvectorat each iteration. The variable dimension is usually large.In general, at least one full or truncated EVD per iteration is required in most ﬁrst-order ∗ Beijing International Center for Mathematical Research, Peking University, CHINA ([email protected]) † Beijing International Center for Mathematical Research, Peking University ([email protected]). ‡ Beijing International Center for Mathematical Research, Peking University ([email protected]). Research sup-ported in part by the NSFC grants 11421101 and 11831002, and by the National Basic Research Project under thegrant 2015CB856002. § State Key Laboratory of Scientiﬁc and Engineering Computing, Academy of Mathematics and Systems Sci-ence,Chinese Academy of Sciences, China ([email protected]). Research supported in part by NSFC grants 11331012and 11461161005. 1 a r X i v : . [ m a t h . O C ] A p r ethods to solve these applications. It has been long realized that EVD is very time-consumingfor large problems, especially when the dimension of the matrix and the number of the requiredeigenvalues are both huge. First-order methods suﬀer from this issue greatly since they usuallytake thousands of iterations to converge. Therefore, people turn to inexact methods to save thecomputation time. One popular approach relies on approximated eigen-solvers to solve the eigen-value sub-problem, such as the Lanczos method and randomized EVD with early stopping rules,see [27, 2, 20] and the references therein. The performance of these methods is determined bythe accuracy of the eigen-solver, which is sometimes hard to control in practice. Another type ofmethod is the so-called subspace method. By introducing a low-dimensional subspace, one cangreatly reduce the dimension of the original problem, and then perform reﬁnement on this subspaceas the iterations proceed. For instance, it is widely used to solve an univariate maximal eigenvalueoptimization problem [11, 19, 10]. In [28], the authors propose an inexact method which simpli-ﬁes the eigenvalue computation using Chebyshev ﬁlters in self-consistent-ﬁeld (SCF) calculations.Though inexact methods are widely used in real applications, the convergence analysis is still verylimited. Moreover, designing a practical strategy to control the inexactness of the algorithm is abig challenge. Our contributions are brieﬂy summarized as follows.1. For low-rank matrix optimization problems involving eigenvalue computations, we proposea general inexact ﬁrst-order method framework with polynomial ﬁlters which can be appliedto most existing solvers without much diﬃculty. It can be observed that for low-rankproblems, the iterates always lie in a low dimensional eigenspace. Hence our key motivationis to use one polynomial ﬁlter subspace extraction step to estimate the subspace, followed bya standard optimization update. The algorithm also beneﬁts from the warm-start property:the updated iteration point can be fed to the polynomial ﬁlter again to generate the nextestimation. Then we apply this framework to the proximal gradient method (PG) andthe alternating direction method of multipliers (ADMM) to obtain a polynomial-ﬁlteredPG (PFPG) method and a polynomial-ﬁltered ADMM (PFAM) method (see Section 2),respectively.2. We analyze the convergence property of PFPG and PFAM. It can be proved that the errorof one exact and inexact iteration is bounded by the principle angle between the true andextracted subspace, which is further controlled by the polynomial degree. The convergencerelies on an assumption that the initial space should not be orthogonal to the target space,which is essential but usually holds in many applications. Consequently, the polynomialdegree barely increases during the iterations. It can even remain a constant under thewarm-start setting. This result provides us the opportunity to use low-degree polynomialsthroughout the iterations in practice.3. We propose a practical strategy to control the accuracy of the polynomial ﬁlters so that theconvergence is guaranteed. A portable implementation of the polynomial ﬁlter frameworkis given based on this strategy. It can be plugged into any ﬁrst-order methods with only afew lines of codes added, which means that the computational cost reduction is almost afree lunch.We mention that our work is diﬀerent from randomized eigen-solver based methods [27, 2, 20].The inexactness of our method is mainly dependent on the quality of the subspace, which can behighly controllable using polynomial ﬁlters and the warm-start strategy. On the other hand, theconvergence of randomized eigen-solver based methods relies on the tail bound or the per vectorerror bound of the solution returned by the eigen-solver, which is usually stronger than subspace ssumptions. Our work also diﬀers from the subspace method proposed in [11, 10]. First the authorsmainly focus on maximal eigenvalue problems, which have special structures, while we considergeneral matrix optimization problems with low-rank variables. Second, in the subspace method,one has to compute the exact solution of the sub-problem, which is induced by the projection of theoriginal one onto the subspace. However, after we obtain the subspace, the next step is to extracteigenvalues and eigenvectors from the projected matrix to generate inexact variable updates. Thereare no sub-problems in general. The matlab codes for several low-rank matrix optimization examplesare available at https://github.com/RyanBernX/PFOpt.

The paper is organized as follows. In Section 2, we introduce the poly-nomial ﬁlter algorithm and propose two polynomial-ﬁltered optimization methods, namely, PFPGand PFAM. Then the convergence analysis is established in Section 3. Some important details ofour proposed algorithms in order to make them practical are summarized in Section 4. The eﬀec-tiveness of PFPG and PFAM is demonstrated in Section 5 with a number of practical applications.Finally, we conclude the paper in Section 6.

Let S n be the collection of all n -by- n symmetric matrices. For any X ∈ S n ,we use λ ( X ) ∈ R n to denote all eigenvalues of X , which are permuted in descending order, i.e., λ ( X ) = ( λ , . . . , λ n ) T where λ ≥ λ · · · ≥ λ n . For any matrix X ∈ R n × n , diag( X ) denotes avector in R n consisting of all diagonal entries of X . For any vector x ∈ R n , Diag( x ) denotes adiagonal matrix in R n × n whose i -th diagonal entry is x i . We use (cid:104) A, B (cid:105) to deﬁne the inner productin the matrix space R n × n , i.e., (cid:104) A, B (cid:105) := Tr( A T B ) where Tr( X ) := (cid:80) ni =1 X ii for X ∈ R n × n . Thecorresponding Frobenius norm is deﬁned as (cid:107) A (cid:107) F = Tr( A T A ). The Hadamard product of twomatrices or vectors of the same dimension is denoted by A (cid:12) B with ( A (cid:12) B ) ij = A ij × B ij . Forany matrix X ∈ R n × p , Range( X ) denotes the subspace spanned by the columns of X .

2. Polynomial Filtered Matrix Optimization.2.1. Chebyshev-ﬁltered Subspace Iteration.

The idea of polynomial ﬁltering is origi-nated from a well-known fact that polynomials are able to manipulate the eigenvalues of anysymmetric matrix A while keeping its eigenvectors unchanged. Suppose A has the eigenvalue de-composition A = U Diag( λ , . . . , λ n ) U T , then the matrix ρ ( A ) has the eigenvalue decomposition ρ ( A ) = U Diag( ρ ( λ ) , . . . , ρ ( λ n )) U T .Consider the traditional subspace iteration, U ← AU , the convergence of the desired eigen-subspace is determined by the gap of the eigenvalues, which can be very slow if the gap is nearlyzero. This is where the polynomial ﬁlters come into practice: to manipulate the eigenvalue gapaiming to a better convergence. For any polynomial ρ ( t ) and initial matrix U , the polynomial-ﬁltered subspace iteration is given by U ← ρ ( A ) U. In general, there are many choices of ρ ( t ). One popular choice is to use Chebyshev polynomialsof the ﬁrst kind, whose explicit expression can be written as(2.1) T d ( t ) = (cid:40) cos( d arccos t ) | t | ≤ , (( t − √ t − d + ( t + √ t − d ) | t | > , where d is the degree of the polynomial. One important property of Chebyshev polynomials is thatthey grow pretty fast outside the interval [ − , igenvalues in this interval eﬀectively. Figure 1 shows some examples of Chebyshev polynomials T d ( t ) with d = 1 , . . . , T q ( t ) with q = 1 , . . . ,

5. That isexactly why Chebyshev polynomials are taken into our consideration.

Fig. 1 . Chebyshev polynomials and their variants -1 -0.5 0 0.5 1-6-4-202468 d = 1d = 2d = 3d = 4d = 5d = 6 (a) Chebyshev polynomials of the ﬁrst kind T d ( t ) -1 -0.5 0 0.5 1-8-6-4-202468 q = 1q = 2q = 3q = 4q = 5 (b) Power of Chebyshev polynomial T q ( t ) In order to suppress all eigenvalues in a general interval [ a, b ], a linear mapping from [ a, b ] to[ − ,

1] is constructed. To be precise, the polynomial can be chosen as(2.2) ρ ( t ) = T d (cid:18) t − ( b + a ) / b − a ) / (cid:19) . After the polynomial ﬁlters are applied, an orthogonalization step usually follows in order toprevent the subspace from losing rank. This step is often performed by a single QR decomposition.Consequently, given an arbitrary matrix U ∈ R n × p and a polynomial ρ ( t ), the polynomial-ﬁlteredsubspace iteration can be simply written as(2.3) U + = orth( ρ ( A ) U ) , where “orth” stands for the orthogonalization operation. In most cases, the updated matrix U + spans a subspace which contains some approximated eigenvectors of A . By extracting eigen-pairsfrom this subspace, we are able to derive a number of polynomial-ﬁltered inexact optimizationmethods. In the next subsections we present two examples: PFPG and PFAM. In this subsection we show how to apply the subspace update(2.3) to the proximal gradient method, on a set of problems with a general form. Consider thefollowing unconstrained spectral operator minimization problem(2.4) min h ( x ) := F ( x ) + R ( x ) , where F ( x ) has a composition form F ( x ) = f ◦ λ ( B ( x )) with B ( x ) = G + A ∗ ( x ) and R ( x ) is aregularizer with simple structures but need not be smooth. Here G is a known matrix in S n , and ∗ : R m → S n is a linear operator. The function f : R n → R is smooth and absolutely symmetric ,i.e., f ( x ) = f ( P x ) for all x ∈ R n and any permutation matrix P ∈ R n × n . It can be easily veriﬁedthat F ( x ) is well deﬁned.According to [16, Section 6.7.2], the gradient of F in (2.4) is(2.5) ∇ F ( x ) = A (Ψ( B ( x ))) , where Ψ is a spectral operator deﬁned byΨ( X ) = V Diag( ∇ f ( λ ( X ))) V T , and V Diag( λ ( X )) V T is the full eigen-decomposition of X .For any matrix S ∈ R n × p , let P S denote the projection operator on subspace Range( S ). When S is an orthogonal matrix, the operator P S is deﬁned by(2.6) P S ( X ) = SS T XSS T . If V is exactly a p -dimensional eigen-space of the matrix X , the eigenvalue decomposition of P V ( X )can be written as(2.7) P V ( X ) = (cid:2) ˜ V V ⊥ (cid:3) (cid:20) Λ p OO O (cid:21) (cid:20) ˜ V T V T ⊥ (cid:21) , where ( ˜ V , Λ p ) is the corresponding eigen-pairs and we have Range( V ) = Range( ˜ V ). Note that( ˜ V , Λ p ) has closed-form expression with ˜ V = V Y where Y Λ p Y T is the full eigenvalue decompositionof H := V T XV ∈ R p × p .Suppose evaluating ∇ F only involves a small number of eigenvalues in the problem (2.4), thenwe are able to compute ∇ F quickly as long as the corresponding subspace is given. We summarizethem in Assumption 1. Assumption (i) Let I ( x ) ⊂ [ n ] be an integer set with I ( x ) = { s, s + 1 , ..., t } and supposethat the gradient of F ( x ) in (2.4) has the relationship (2.8) ∇ F ( x ) = ¯ g := A (Φ( P V I ( B ( x )))) , where Φ is a spectral operator and V I ∈ R n ×|I| contains eigenvectors v i , i ∈ I ( x ) of B ( x ) .(ii) Φ is Lipschitz continuous with a factor L Φ , (2.9) (cid:107) Φ( X ) − Φ( Y ) (cid:107) F ≤ L Φ (cid:107) X − Y (cid:107) F . We now compare the gradient expression (2.8) in Assumption 1 with the original gradient(2.5). Expression (2.8) implies that the subspace V I contains all information for evaluating ∇ F ( x ).In other words, we do not need full eigenvalue decompositions as indicated by (2.5). The spectraloperator Φ( · ) in (2.8) is also diﬀerent from Ψ( · ) deﬁned in (2.5). The choice of Φ( · ) can be arbitraryas long as it satisﬁes Assumption 1. In most cases Φ( · ) inherits the deﬁnition of Ψ( · ) but they neednot be the same. Under the Assumption 1, we can write the proximal gradient algorithm as(2.10) x k +1 = prox τ k R ( x k − τ k A (Φ( P V k I k ( B ( x k ))) , here V k I k contains the eigenvectors of B ( x k ) at iteration k , τ k is the step size, and the proximaloperator is deﬁned by(2.11) prox tR ( x ) = argmin u R ( u ) + 12 t (cid:107) u − x (cid:107) . In the scheme (2.10), the main cost is the computation of the EVD of P V k I k ( B ( x k )) as theexact subspace V k I k is unknown. Thus, an idea to reduce the expenses is to use polynomial ﬁltersto generate an approximation of eigen-space. Then it is easy to compute the projection on thisspace and the corresponding eigen-decomposition, whence the image of the spectral operator on theprojection matrix, as shown in (2.7). To obtain a good approximation, the update of eigen-space isby means of Chebyshev polynomials which can suppress all unwanted eigenvectors of B ( x ) and theeigen-space in last step is used as initial values. In summary, the proximal gradient method withpolynomial ﬁlters can be written as x k +1 = prox τ k R ( x k − τ k A (Φ( P U k ( B ( x k )))) , (2.12) U k +1 = orth( ρ q k +1 k +1 ( B ( x k +1 )) U k ) , (2.13)where U k ∈ R n × p with p ≥ |I k | is an orthogonal matrix which serves as an approximation of V k I k each step, and q k ≥ q k times before the orthogonalization.We mention that (2.12) and (2.13) are actually two diﬀerent updates that couple together.Given a subspace spanned by U k , the step (2.12) performs the proximal gradient descent with theapproximated gradient extracted from U k . The next step (2.13) makes reﬁnement on the subspaceto obtain U k +1 via polynomial ﬁlters. The subspace will become more and more exact as theiterations proceed. In practice, it is usually observed that U k is very “close” to the true eigen-space V k I k during the last iterations of the algorithm. Consider the following standard SDP:(2.14) min (cid:104)

C, X (cid:105) , s.t. A X = b, X (cid:23) , where A is a bounded linear operator. Let F ( X ) = 1 { X (cid:23) } ( X ) and G ( X ) = 1 {A X = b } ( X ) + (cid:104) C, X (cid:105) ,where 1 Ω ( X ) is the indicator function on a set Ω. Then the Douglas-Rachford Splitting (DRS)method on the primal SDP (2.14) can be written as Z k +1 = T DRS ( Z k ) , where T DRS = prox tG (2prox tF − I ) − prox tF + I, which is equivalent to the ADMM on the dual problem. The explicit forms of prox tF ( Z ) andprox tG ( Y ) can be written asprox tF ( Z ) = P + ( Z ) , prox tG ( Y ) = ( Y + tC ) − A ∗ ( AA ∗ ) − ( A Y + t A C − b ) , where P + ( Z ) is the projection operator onto the positive semi-deﬁnite cone. imilar to the PFPG method, P + ( Z ) is only determined by the eigen-space spanned by thepositive eigenvectors of Z , so we are able to use the polynomial ﬁlters to extract an approximation U k . Therefore, the PFAM can be written as Z k +1 = prox tG (2 P + ( P U k ( Z k )) − Z k ) − P + ( P U k ( Z k )) + Z k , (2.15) U k +1 = orth( ρ q k +1 k +1 ( Z k +1 ) U k ) , (2.16)where U k ∈ R n × p with p ≥ |I| is an orthogonal matrix and q k ≥

3. Convergence Analysis.3.1. Preliminary.

In this section we introduce some basic notations and tools that are usedin our analysis. Firstly, we introduce the deﬁnition of principal angles to measure the distancebetween two subspaces.

Definition Let

X, Y ∈ R n × p be two orthogonal matrices. The singular values of X T Y are σ ≥ σ ≥ · · · ≥ σ p , which lie in [0 , . The principal angle between Range( X ) and Range( Y ) isdeﬁned by (3.1) Θ( X, Y ) := Diag(arccos σ , . . . , arccos σ p ) , In particular, we deﬁne sin Θ by taking the sine of Θ componentwisely. The following lemma describes a very important property of Chebyshev polynomials.

Lemma The Chebyshev polynomials increase fast outside [ − , , i.e., (3.2) T d ( ± (1 + (cid:15) )) ≥ d min {√ (cid:15), }− , ∀ (cid:15) > . Proof.

It follows from the expression (2.1) that T d ( ± (1 + (cid:15) )) ≥

12 (1 + (cid:15) + (cid:112) (cid:15) + (cid:15) ) d ≥

12 (1 + √ (cid:15) ) d = 12 e d log(1+ √ (cid:15) ) ≥ d min {√ (cid:15), }− , where the last inequality follows from that log(1 + x ) ≥ log 2 · min { x, } . In this subsection, we analyze the convergence forgeneral convex problems. The relative gap is deﬁned by(3.3) G k = min i ∈I k (cid:18) | λ i − ( a k + b k ) / | ( a k − b k ) / − (cid:19) , where [ a k , b k ] is the interval suppressed by Chebyshev polynomials at the k -th iteration, I k isthe index set of the eigenvalues beyond [ a k , b k ]. Throughout this section, we make the followingassumptions. Assumption (i) (cid:107) sin Θ( V k +1 I k +1 , U k ) (cid:107) < γ, ∀ k with γ < , where V k I k and U k are deﬁnedin Assumption 1 and (2.13) at iteration k .(ii) The sequence generated by iteration (2.12) and (2.13) are bounded, i.e., (cid:107) x k (cid:107) ≤ C, ∀ k. (iii) The relative gap (3.3) has a lower bound, i.e., G k ≥ l, ∀ k . ssumption (i) implies that the initial eigen-space is not orthogonal to the truely wanted eigen-space so that we can obtain V k +1 I k +1 by performing subspace iterations on U k . Essentially, thisproperty is required by almost all iterative eigensolvers for ﬁnding the correct eigenspace at eachiteration. Assumption (ii) is commonly used in the optimization literature. Assumption (iii) is a keyassumption which guarantees that the relative gap G k exists in the asymptotic sense. Therefore,the polynomial ﬁlters are able to separate the wanted eigenvalues from the unwanted ones. Ineigenvalue computation, this property is often enforced by adding a suﬃcient number of guardingvectors so that the corresponding eigenvalue gap is large enough.The following lemma gives an error estimation when using an inexact gradient of F . It statesthat the polynomial degree d k should be proportional to log(1 /ε ), where ε is a given tolerance. Lemma Suppose that Assumptions 1 and 2 hold. Deﬁne the error between the exact andinexact gradients: (3.4) e k = A (Φ( P U k ( B ( x k )))) − ∇ F ( x k ) . Let the polynomial ﬁlter ρ k ( t ) be a Chebyshev polynomial with degree d k . To achieve (cid:107) e k (cid:107) ≤ ε fora given tolerance < ε < , the degree d k should satisfy (3.5) d k = Ω (cid:18) log ε min { l, } (cid:19) . Proof.

The update (2.1) can be regarded as one orthogonal iteration. The eigenvalue of ρ q k k ( B ( x k )) is ρ q k k ( λ i ), where λ i ’s are the eigenvalues of B ( x k ). According to Lemma 2, the eigen-values have the following distribution after we apply the Chebyshev polynomial:min i ∈I k ρ q k k ( λ i ) ≥ q k d k min {√ l, }− q k and max i/ ∈I k ρ q k k ( λ i ) ≤ . It follows from [9, Theorem 8.2.2] that(3.6) (cid:107) sin Θ( V k I k , U k ) (cid:107) ≤ q k q k d k min { l, } γ (cid:112) − γ . Due to the boundness of A and the Lipschitz continuity of Φ, we have(3.7) (cid:107) e k (cid:107) ≤ c (cid:107)P U k ( B ( x k )) − P V k I k ( B ( x k )) (cid:107) F , where c is a constant. It is shown in [9, Theorem 2.5.1] that (cid:107) sin Θ( X, Y ) (cid:107) = (cid:107) XX T − Y Y T (cid:107) . Using this identity, we have (cid:107)P U k ( B ( x k )) − P V k I k ( B ( x k )) (cid:107) F ≤(cid:107)P U k ( B ( x k )) − U k ( U k ) T ( B ( x k )) V k I k ( V k I k ) T (cid:107) F + (cid:107) U k ( U k ) T ( B ( x k )) V k I k ( V k I k ) T − P V k I k ( B ( x k )) (cid:107) F ≤ ||B ( x k ) || F (cid:107) sin Θ( V k I k , U k ) (cid:107) ≤ c · q k q k d k min { l, } , (3.8)where c is a constant depending on γ . The second inequality is due to the fact that (cid:107) AB (cid:107) F ≤(cid:107) A (cid:107) (cid:107) B (cid:107) F and the last inequality follows from (3.6) and the boundedness of A and x k . It is easyto verify that (cid:107) e k (cid:107) ≤ ε if (3.5) holds for any ε ∈ (0 , e claim that ∇ F ( x ) is Lipschitz continuous due to the boundness of A and x k and theLipschitz continuity of Φ. In the following part, we deﬁne the Lipschitz constant of ∇ F ( x ) by L .The following theorem gives the convergence of the PFPG method. Theorem Suppose that Assumptions 1 and 2 hold. Let τ k = τ ≤ L and ¯ x K = K (cid:80) Kk =1 x k .Then the convergence lim K →∞ h (¯ x K ) = h ( x ∗ ) holds if (3.9) d k = Ω (cid:18) log k min { l, } (cid:19) . Proof.

For simplicity we deﬁne the noisy residual function by r τ ( x, e ) = 1 τ ( x − prox τR ( x − τ ( ∇ F ( x ) + e ))) . Then the update (2.12) can be written as x k − x k +1 = τ k r τ k ( x k , e k ) . According to the Lipschitz continuity of ∇ F , we have h ( x k +1 ) ≤ F ( x k ) − τ k ∇ F ( x k ) T ( r τ k ( x k , e k )) + Lτ k (cid:107) r τ k ( x k , e k ) (cid:107) + R ( x k +1 ) . By the convexity of F ( x ) and R ( x ) and r τ k ( x, e k ) − ∇ F ( x k ) − e k ∈ ∂R ( x k +1 ), we have h ( x k +1 ) ≤ F ( x ∗ ) + ∇ F ( x k ) T ( x k − x ∗ ) − τ k ∇ F ( x k ) T ( r τ k ( x k , e k )) + Lτ k (cid:107) r τ k ( x k , e k ) (cid:107) + R ( x ∗ ) + ( r τ k ( x k , e k ) − ∇ F ( x k ) − e k ) T ( x k +1 − x ∗ ) . Setting the step size τ k = τ ≤ L yields h ( x k +1 ) − h ( x ∗ ) ≤ ∇ F ( x k ) T ( x k − x ∗ ) − τ ∇ F ( x k ) T ( r τ ( x k , e k )) + τ (cid:107) r τ ( x k , e k ) (cid:107) +( r τ ( x k , e k ) − ∇ F ( x k ) − e k ) T ( x k +1 − x ∗ ) . After rearranging and applying the boundness of the sequence x k , we have h ( x k +1 ) − h ( x ∗ ) ≤ τ ( (cid:107) x k − x ∗ (cid:107) − (cid:107) x k +1 − x ∗ (cid:107) ) + ( C + (cid:107) x ∗ (cid:107) ) (cid:107) e k (cid:107) , where C is a constant. Then it follows that(3.10) h (¯ x K ) − h ( x ∗ ) ≤ K K (cid:88) k =1 ( h ( x k ) − h ( x ∗ )) ≤ Kτ (cid:107) x − x ∗ (cid:107) + C + (cid:107) x ∗ (cid:107) K K (cid:88) k =1 (cid:107) e k (cid:107) . According to Lemma 3, by setting d k = Ω (cid:16) log k min { l, } (cid:17) at iteration k , we have the upper bound (cid:80) Kk =1 (cid:107) e k (cid:107) ≤ (cid:80) Kk =1 /k . Thus the right hand side of (3.10) goes to zero as K → ∞ , whichcompletes the proof. he following theorem gives a linear convergence rate under the strongly convex assumption of F ( x ). Theorem Suppose that Assumptions 1 and 2 hold, and F ( x ) is µ -strongly convex. Let τ k = τ ≤ L and w ≤ µ . Then a linear convergence rate is achieved if (3.11) d k = Ω (cid:18) − log( w (cid:107) x k +1 − x ∗ (cid:107) )min { l, } (cid:19) . Proof.

The quadratic bound in the proof of Theorem 4 gives h ( x k +1 ) ≤ F ( x k ) − τ k ∇ F ( x k ) T ( r τ k ( x k , e k )) + Lτ k (cid:107) r τ k ( x k , e k ) (cid:107) + R ( x k +1 ) . By the stongly convexity of F ( x ), the convexity of R ( x ) and r τ k ( x k , e k ) − ∇ F ( x k ) − e k ∈ ∂R ( x k +1 ),we have h ( x k +1 ) ≤ F ( x ∗ ) + ∇ F ( x k ) T ( x k − x ∗ ) − µ (cid:107) x k − x ∗ (cid:107) − τ k ∇ F ( x k ) T ( r τ k ( x k , e k ))+ Lτ k (cid:107) r τ k ( x k , e k ) (cid:107) + R ( x ∗ ) + ( r τ k ( x k , e k ) − ∇ F ( x k ) − e k ) T ( x k +1 − x ∗ ) . Under the choice of step size τ k = τ ≤ L , we have h ( x k +1 ) − h ( x ∗ ) ≤ ∇ F ( x k ) T ( x k − x ∗ ) − τ ∇ F ( x k ) T ( r τ ( x k , e k )) + τ (cid:107) r τ ( x k , e k ) (cid:107) − µ (cid:107) x k − x ∗ (cid:107) + ( r τ ( x k , e k ) − ∇ F ( x k ) − e k ) T ( x k +1 − x ∗ )= 12 τ ((1 − µτ ) (cid:107) x k − x ∗ (cid:107) − (cid:107) x k +1 − x ∗ (cid:107) ) − ( e k ) T ( x k +1 − x ∗ ) ≤ τ ((1 − µτ ) (cid:107) x k − x ∗ (cid:107) − (cid:107) x k +1 − x ∗ (cid:107) ) + (cid:107) e k (cid:107) (cid:107) x k +1 − x ∗ (cid:107) . Together with h ( x k +1 ) ≥ h ( x ∗ ), we have(3.12) (cid:107) x k +1 − x ∗ (cid:107) ≤ (1 − µτ ) (cid:107) x k − x ∗ (cid:107) + 2 τ (cid:107) e k (cid:107) (cid:107) x k +1 − x ∗ (cid:107) . The choice of d k guarantees (cid:107) e k (cid:107) ≤ w (cid:107) x k +1 − x ∗ (cid:107) . Plugging this upper bound of (cid:107) e k (cid:107) into (3.12)yields (cid:107) x k +1 − x ∗ (cid:107) ≤ η (cid:107) x k − x ∗ (cid:107) , where η = − µτ − wτ <

1, which completes the proof.

In this section, we analyze the linear convergenceof PFPG under the warm-start setting. Suppose that ρ k ( λ s i ( B ( x k )) are in decreasing order, anddeﬁne(3.13) η k = ρ k ( λ s p +1 ( B ( x k )) ρ k ( λ s p ( B ( x k )) . We should point out that under Assumption 3, η k is strictly smaller than 1 since G k ≥ l >

0. Anadditional assumption is needed if we intend to reﬁne the subspace from the previous one. ssumption (cid:107) sin Θ( V k +1 I k +1 , V k I k ) (cid:107) ≤ c (cid:107) x k +1 − x k (cid:107) for all k . Assumption 3 means that the eigen-space at two consecutive iteration points are close enough. Thisis necessary since we use the eigen-space at the previous step as the initial value of polynomial ﬁlter.This assumption is satisﬁed when the gap is large enough by the Davis-Kahan sin Θ theorem [5].The next lemma shows the relationship between the two iterations in a recursive form. It playsa key role in the proof of the main convergence theorem.

Lemma Suppose that Assumptions 1, 2 and 3 hold. Let X be the set of optimal solutionsof the problem (2.4) , and dist( x, X ) = inf y ∈X (cid:107) x − y (cid:107) be the distance between x and X . Then wehave (3.14) (cid:107) sin Θ( V k +1 I k +1 , U k +1 ) (cid:107) ≤ η q k k +1 ( c (cid:107) sin Θ( V k I k , U k ) (cid:107) + c dist( x k , X )) , where c and c are constants.Proof. The update (2.3) can be seen as one iteration in the orthogonal iteration. Then it followsfrom [9, Theorem 8.2.2] that (cid:107) sin Θ( V k +1 I k +1 , U k +1 ) (cid:107) ≤ η q k k +1 (cid:107) tan Θ( V k +1 I k +1 , U k ) (cid:107) ≤ η q k k +1 (cid:112) − γ (cid:107) sin Θ( V k +1 I k +1 , U k ) (cid:107) , According to (3.7) and (3.8) in Lemma 3, we have (cid:107) e k (cid:107) ≤ c (cid:107) sin Θ( V k I k , U k ) (cid:107) , where c is a constant. The upper bound of (cid:107) x k − x k +1 (cid:107) is given by (cid:107) x k − x k +1 (cid:107) ≤ τ k ( (cid:107) r τ k ( x k , e k ) − r τ k ( x k , (cid:107) + (cid:107) r τ k ( x k , (cid:107) ) ≤ τ k (cid:107) e k (cid:107) + ω dist( x k , X ) ≤ τ k c (cid:107) sin Θ( V k I k , U k ) (cid:107) + ω dist( x k , X ) , where the second inequality is due to the monotonity of the proximal operator and the cocoercivitywith modulus ω of the residual function r τ k ( x k , (cid:107) sin Θ( V k +1 I k +1 , U k +1 ) (cid:107) ≤ η q k k +1 (cid:112) − γ (cid:107) sin Θ( V k +1 I k +1 , U k ) (cid:107) ≤ η q k k +1 (cid:112) − γ ( (cid:107) sin Θ( V k I k , U k ) (cid:107) + (cid:107) sin Θ( V k +1 I k +1 , V k I k (cid:107) ) ≤ η q k k +1 (cid:112) − γ ( (cid:107) sin Θ( V k I k , U k ) (cid:107) + c (cid:107) x k − x k +1 (cid:107) ) ≤ η q k k +1 (cid:112) − γ ((1 + τ k c c ) (cid:107) sin Θ( V k I k , U k ) (cid:107) + c ω dist( x k , X )) , where the third inequality is from (iii) in Assumption 3. This completes the proof.Lemma 6 means that the principle angle between V k +1 I k +1 and U k +1 are determined by the anglein the previous iteration and how far x k is from the optimal set X . As an intuitive interpretation,the polynomial-ﬁltered subspace becomes more and more accurate when PFPG is close to converge. e are now ready to state the main result. It implies that the PFPG method has a linearconvergence rate if the polynomial degree is a constant, independent of the iteration k . Theorem Suppose that Assumptions 1, 2 and 3 holds, and the exact proximal gradientmethod for (2.4) has a linear convergence rate, i.e., (3.15) dist(prox τR ( x k − τ k ∇ F ( x k )) , X ) ≤ ν dist( x k , X ) , ν ∈ (0 , . If η k +1 (3.13) satisﬁes (3.16) ν + η q k k +1 c (cid:115)(cid:18) ν + η q k k +1 c (cid:19) + η q k k +1 ( τ k c c − νc ) < ρ < , where c , c , c are constants in Lemma 6, then the PFPG method has a linear convergence rate.Proof. Since X is closed, we denote x prj as the projection of prox τ k R ( x k − τ k ∇ F ( x k )) onto theoptimal set X . Then we havedist( x k +1 , X ) ≤ (cid:107) prox τ k R ( x k − τ k ( ∇ F ( x k ) + e k )) − x prj (cid:107) ≤(cid:107) prox τ k R ( x k − τ k ∇ F ( x k )) − prox τ k R ( x k − τ k ( ∇ F ( x k ) + e k )) (cid:107) + (cid:107) prox τ k R ( x k − τ k ∇ F ( x k )) − x prj (cid:107) ≤ dist(prox τ k R ( x k − τ k ∇ F ( x k )) , X ) + τ k (cid:107) e k (cid:107) ≤ dist(prox τ k R ( x k − τ k ∇ F ( x k )) , X ) + τ k c (cid:107) sin Θ( V k I k , U k ) (cid:107) F ≤ ν dist( x k , X ) + τ k c (cid:107) sin Θ( V k I k , U k ) (cid:107) F , (3.17)where the ﬁrst inequality is from the deﬁnition of the distance function, the second inequality isfrom the triangle inequality, the third inequality is shown in the proof of Lemma 6 and the lastinequality is due to the linear convergence of the proximal gradient method.From (3.14) and (3.17) we observe that the error dist( x k , X ) and (cid:107) sin Θ( V k I k , U k ) (cid:107) are coupledtogether. Thus we deﬁne the error vector(3.18) s k = (cid:20) dist( x k , X ) (cid:107) sin Θ( V k I k , U k ) (cid:107) (cid:21) . The two recursive formula can be rewritten as(3.19) s k +1 ≤ R k s k , R k = (cid:20) ν τ k c η q k k +1 c η q k k +1 c (cid:21) . The spectral radius of R k is bounded by the left term in (3.16). It is easy to verify that (3.16) holdsif the degree d k is large enough to make η k +1 suﬃciently small. The choice of d k is independent ofthe iteration number k . This completes the proof.Theorem 7 requires the linear convergence rate of the exact proximal gradient method. Thiscondition is satisﬁed if F ( x ) is strongly convex and smooth. From the proof we also observe thatdist( x k , X ) and (cid:107) sin Θ( V k I k , U k ) (cid:107) are recursively bounded by each other. By choosing suitable d k ,the two errors are able to decay simultaneously. .4. Convergence analysis for PFAM. We next analyze the convergence of PFAM. Similarto PFPG, we make the following assumptions:

Assumption Let V k I k be an orthogonal matrix whose columns are the eigenvectors corre-sponding to all positive eigenvalues of Z k .(i) (cid:107) sin Θ( V k +1 I k +1 , U k ) (cid:107) < γ, ∀ k with γ < .(ii) The sequence generated by (2.15) and (2.16) is bounded, i.e., (cid:107) Z k (cid:107) F ≤ C, ∀ k. (iii) The relative gap (3.3) has a lower bound, i.e., G k ≥ l, ∀ k . The main theorem is established by following the proof in [6].

Theorem Suppose Assumption 4 holds. The convergence (cid:107) Z k − T DRS ( Z k ) (cid:107) F = O (1 / √ k ) is archived if (3.20) d k = Ω (cid:18) log k min { l, } (cid:19) . Proof.

According to the proof of Lemma 3, we obtain (cid:107)P U k ( Z k ) − P V k I k ( Z k ) (cid:107) F ≤ c · q k q k d k min { l, } , where c is a constant. Deﬁne the error between one PFAM update and one DRS iteration E k = Z k +1 − T DRS ( Z k ). Since prox tG and P + are Lipschitz continuous with constant L and L , we have (cid:107) E k (cid:107) F ≤ (2 L L + L ) (cid:107)P U k ( Z k ) − P V k I k ( Z k ) (cid:107) F ≤ c · q k q k d k min { l, } , where c is a constant. Let s > (cid:107) E k (cid:107) F = O ( 1 k s ) , by choosing d k = Ω( s log l min( l, ).Deﬁne W k = T DRS ( Z k ) − Z k , S k = Z k +1 − Z k and T k = T DRS ( Z k +1 ) − T DRS ( Z k ). Then wehave (cid:107) W k +1 (cid:107) F = (cid:107) W k (cid:107) F + (cid:107) W k +1 − W k (cid:107) F + 2 (cid:104) W k , W k +1 − W k (cid:105) = (cid:107) W k (cid:107) F + (cid:107) T k − S k (cid:107) F + 2 (cid:104) S k − E k , T k − S k (cid:105) . (3.21)By the ﬁrm nonexpansiveness of T DRS , we obtain(3.22) 2 (cid:104) S k , T k − S k (cid:105) = (cid:107) T k (cid:107) F − (cid:107) S k (cid:107) F − (cid:107) T k − S k (cid:107) F ≤ − (cid:107) T k − S k (cid:107) F . Plugging (3.22) into (3.21) yields(3.23) (cid:107) W k +1 (cid:107) F ≤ (cid:107) W k (cid:107) F − (cid:107) T k − S k (cid:107) F − (cid:104) E k , T k − S k (cid:105) ≤ (cid:107) W k (cid:107) F + (cid:107) E k (cid:107) F . Since (cid:80) ∞ k =0 (cid:107) E k (cid:107) F < ∞ , it implies the boundness of (cid:107) W k (cid:107) F , i.e., (cid:107) W k (cid:107) F ≤ (cid:107) W (cid:107) F + k − (cid:88) i =0 (cid:107) E k (cid:107) F < ∞ . et Z ∗ be a ﬁxed point of T DRS , then we obtain that(3.24) (cid:107) T DRS ( Z k ) − Z ∗ (cid:107) F ≤ (cid:107) Z k − Z ∗ (cid:107) F ≤ (cid:107) T DRS ( Z k − ) − Z ∗ (cid:107) F + (cid:107) E k − (cid:107) F . Applying (3.24) recursively gives (cid:107) T DRS ( Z k ) − Z ∗ (cid:107) F ≤ (cid:107) T DRS ( Z ) − Z ∗ (cid:107) F + k − (cid:88) i =0 (cid:107) E k (cid:107) F < ∞ . Hence (cid:107) T DRS ( Z k ) − Z ∗ (cid:107) F is bounded. Due to the ﬁrm nonexpansiveness of T DRS , we have that (cid:107) Z k +1 − Z ∗ (cid:107) F = (cid:107) T DRS ( Z k ) − Z ∗ (cid:107) F + (cid:107) E k (cid:107) F + 2 (cid:104) T DRS ( Z k ) − Z ∗ , E k (cid:105)≤ (cid:107) Z k − Z ∗ (cid:107) F − (cid:107) W k (cid:107) F + 2 (cid:107) T DRS ( Z k ) − Z ∗ (cid:107) F (cid:107) E k (cid:107) F . Therefore, it holds(3.25) ∞ (cid:88) i =0 (cid:107) W i (cid:107) F ≤ (cid:107) Z − Z ∗ (cid:107) F + 2 ∞ (cid:88) i =0 (cid:107) T DRS ( Z i ) − Z ∗ (cid:107) F (cid:107) E i (cid:107) F ≤ ∞ . The upper bound (3.25) implies (cid:107) W k (cid:107) →

0. In fact, we can also show the convergence speed.According to (3.23) and (3.25), we have (cid:24) k (cid:25) (cid:107) W k (cid:107) F = k (cid:88) i = (cid:100) k (cid:101) (cid:107) W k (cid:107) F ≤ k (cid:88) i = (cid:100) k (cid:101) ( (cid:107) W i (cid:107) F + k (cid:88) j = i (cid:107) E j (cid:107) F )= k (cid:88) i = (cid:100) k (cid:101) (cid:107) W i (cid:107) F + k (cid:88) i = (cid:100) k (cid:101) ( k − i + 1) (cid:107) E i (cid:107) F ≤ k (cid:88) i = (cid:100) k (cid:101) (cid:107) W i (cid:107) F + c k (cid:88) i = (cid:100) k (cid:101) (cid:107) E i (cid:107) F −→ , where the last inequality is due to the fact that (cid:107) E k (cid:107) F dominates (cid:107) E k (cid:107) F since (cid:107) E k (cid:107) F = O (1 /k s ).This completes the proof.

4. Implementation Details.4.1. Choice of the Interval [ a, b ] . We mainly focus on how to extract a good subspace thatcontains the eigenvectors corresponding to 1) all positive eigenvalues and 2) r largest eigenvalues,by choosing proper [ a, b ]. In either case, a good choice of a is the smallest eigenvalue of the inputmatrix A , say λ n . For example, one can use the Lanczos method ( eigs in matlab ) to estimate λ n . Note that A is changed during the iterations, thus a must be kept up to date periodically.The estimation of b is related to the case we are dealing with. If all positive eigenvalues arecomputed, b is set to b = ηa where η is a small positive number, usually 0.1 to 0.2. If r largesteigenvalues are needed only, then we choose b = (1 − η )ˆ λ r + ηa , where ˆ λ r is one estimation of λ r .A natural choice is the λ r in the previous iteration. .2. Choice of the Subspace Dimension. We next discuss about the choice of p , thenumber of columns of U k . Suppose r largest eigenvalues of a matrix is required, we can add someguard vectors to U k and set p = 5 + q · r where q > Gradient descent algorithms usually have slow convergence.This issue can be resolved using acceleration techniques such as Anderson Acceleration (AA) [1,23]. In our implementation we apply an extrapolation-based acceleration techniques proposedin [18] to overcome the instability of the Anderson Acceleration. To be precise, we perform linearcombinations of the points x k every l +2 iterations to obtain a better estimation ˜ x = (cid:80) li =0 ˜ c i x k − l + i .Deﬁne the diﬀerence of l + 2 iteration points U = [ x k − l +1 − x k − l , . . . , x k +1 − x k ] , the coeﬃcients ˜ c = (˜ c , . . . , ˜ c l ) T is the solution of the following problem(4.1) ˜ c = argmin T c =1 c T ( U T U + λI ) c, where λ >

5. Numerical Experiments.

This section reports on a set of applications and their numericalresults based on the spectral optimization problem (2.4). All experiments are performed on a Linuxserver with two twelve-core Intel Xeon E5-2680 v3 processors at 2.5 GHz and a total amount of 128GB shared memory. All reported time is wall-clock time in seconds.

Given a matrix G ∈ S n , the (unweighted) nearestcorrelation estimation problem (NCE) is to solve the semi-deﬁnite optimization problem:(5.1) min (cid:107) G − X (cid:107) F , s . t . diag( X ) = n ,X (cid:23) . The dual problem of (5.1) is(5.2) min 12 (cid:107) Π S n + ( G + Diag( x )) (cid:107) F − Tn x, where Π S n + ( · ) is the projection operator onto the positive semi-deﬁnite cone. Note that (5.2) canbe rewritten as a spectral function minimization problem as (2.4), whose objective function andgradient are: min F ( x ) := f ◦ λ ( G + Diag( x )) − Tn x, (5.3) ∇ F ( x ) = diag(Π S n + ( G + Diag( x ))) − n , (5.4)where f ( λ ) = (cid:80) ni =1 max( λ i , . Expression (5.3) and (5.4) imply only positive eigenvalues andeigenvectors of G + Diag( x k ) are needed at each iteration. Hence Assumption 1 is satisﬁed ifΠ S n + ( G + Diag( x )) is low-rank for all x in the neighborhood of x ∗ .First we generate synthetic NCE problem data based on the ﬁrst three examples in [17]. xample C ∈ S n is randomly generated correlation matrix using gallery(’randcorr’,n) in matlab . R is an n by n random matrix whose entries R ij are sampled from the uniformdistribution in [ − , . Then the matrix G is set to G = C + R . Example G ∈ S n is a randomly generated matrix with G ij satisfying the uniform distri-bution in [ − , if i (cid:54) = j . The diagonal elements G ii is set to 1 for all i . Example G ∈ S n is a randomly generated matrix with G ij satisfying the uniform distri-bution in [0 , if i (cid:54) = j . The diagonal elements G ii is set to 1 for all i . In all three examples, the dimension n is set to various numbers from 500 to 4000. The initial point x is set to n − diag( G ). Three methods are compared in NCE problems, namely, the gradientmethod with a ﬁxed step size (Grad) and our proposed method (PFPG), and the semi-smoothNewton method (Newton) proposed in [17]. The stopping criteria is (cid:107)∇ F ( x ) (cid:107) ≤ − (cid:107)∇ F ( x ) (cid:107) .The reason not to use a higher accuracy is that ﬁrst-order methods (such as Grad) are not designedfor obtaining high accuracy solutions. We also record the number of iterations (denoted by “iter”),the objective function value f and the relative norm of the gradient (cid:107) g (cid:107) = (cid:107)∇ F ( x ) (cid:107) / (cid:107)∇ F ( x ) (cid:107) .The results are shown in Table 1.The following observations can be drawn from Table 1: • All three methods reached the required accuracy in all test instances. For large problemssuch as n = 4000, PFPG usually provides at least three times speedup over Grad. Forsome large problems such as n = 4000 in Example 5.3, PFPG has nearly 9 times speedup. • The Newton method usually converges within 10 iterations, since it enjoys the quadraticconvergence property. The main cost of the Newton method are full EVDs and solvinglinear systems, which make each iteration slow. However, PFPG beneﬁts from the subspaceextraction so that the cost of each iteration is greatly reduced, making it very competitiveagainst the Newton method. • PFPG usually needs more iterations than Grad, which is resulted from the inexactness ofthe Chebyshev ﬁlters. The error can be controlled by their degree. Using higher orderpolynomials will reduce the number of iterations, but increase the cost per iteration. More-over, the number of iterations does not increase much thanks to the warm start propertyof the ﬁlters.It should be pointed out that the beneﬁt brought by PFPG is highly dependent on the numberof positive eigenvalues r + at G + Diag( x ∗ ). We have checked this number and ﬁnd out that forExample 5.1 and 5.2, r + is approximately 0 . n , and r + ≈ . n for Example 5.3. That is thereason why PFPG is able to provide higher speedups in Example 5.3. As a conclusion we addressthat PFPG is not suitable for r + ≈ . n .Next we test the three methods on real datasets. The test matrices are selected from theinvalid correlation matrix repository . Since we are more interested in large NCE problems, wechoose three largest matrices from the collection, whose details are presented in Table 2. The label“opt. rank” denotes the rank of the optimal point X ∗ in (5.1). Note that for matrix “cor3120”, thesolution is not low-rank. However, the projection can be performed with respect to the negativedeﬁnite part of the variable, which is low-rank in this case.Again, we compare the performance of Grad, PFPG, and Newton on the three matrices. Thestopping criteria is set to (cid:107)∇ F ( x ) (cid:107) ≤ − (cid:107)∇ F ( x ) (cid:107) . The results are shown in Table 3. Forreal data, PFPG is still able to perform multi-fold speedups over the traditional gradient method. Available at https://github.com/higham/matrices-correlation-invalid16 able 1

Results of NCE problems

Results of Example 5.1 n Grad PFPG Newtontime iter f (cid:107) g (cid:107) time iter f (cid:107) g (cid:107) time iter f (cid:107) g (cid:107)

500 0.8 28 1.1e+05 7.9e-08 0.7 40 1.1e+05 4.1e-08 0.6 8 1.1e+05 1.7e-081000 2.6 28 3.2e+05 5.5e-08 1.1 46 3.2e+05 2.6e-08 1.4 8 3.2e+05 1.2e-081500 6.2 28 5.9e+05 7.0e-08 2.5 46 5.9e+05 8.2e-08 3.3 8 5.9e+05 1.0e-082000 13.5 31 9.2e+05 8.0e-08 4.2 46 9.2e+05 4.9e-08 5.9 8 9.2e+05 8.8e-092500 24.9 34 1.3e+06 1.7e-08 7.2 47 1.3e+06 6.9e-08 9.6 8 1.3e+06 7.9e-093000 43.2 34 1.7e+06 2.8e-08 13.2 52 1.7e+06 1.8e-08 15.7 8 1.7e+06 7.2e-094000 110.6 34 2.7e+06 1.4e-08 25.0 52 2.7e+06 2.4e-08 36.1 8 2.7e+06 6.3e-09Results of Example 5.2 n Grad PFPG Newtontime iter f (cid:107) g (cid:107) time iter f (cid:107) g (cid:107) time iter f (cid:107) g (cid:107)

500 0.5 16 8.9e+03 2.2e-09 0.6 16 8.9e+03 1.0e-08 0.5 5 8.9e+03 1.6e-071000 1.6 16 2.6e+04 7.4e-08 1.0 17 2.6e+04 8.3e-08 1.2 6 2.6e+04 1.2e-071500 4.0 17 5.0e+04 6.6e-08 2.2 20 5.0e+04 9.1e-08 2.5 6 5.0e+04 9.6e-082000 7.5 17 7.8e+04 8.9e-08 4.8 22 7.8e+04 2.4e-08 4.7 6 7.8e+04 8.3e-082500 13.8 19 1.1e+05 7.2e-08 7.4 22 1.1e+05 4.5e-08 7.6 6 1.1e+05 7.5e-083000 25.9 21 1.5e+05 8.9e-08 12.0 22 1.5e+05 8.1e-08 12.3 6 1.5e+05 6.9e-084000 72.5 22 2.3e+05 3.4e-09 27.2 25 2.3e+05 9.9e-08 28.6 6 2.3e+05 6.0e-08Results of Example 5.3 n Grad PFPG Newtontime iter f (cid:107) g (cid:107) time iter f (cid:107) g (cid:107) time iter f (cid:107) g (cid:107)

500 0.9 33 1.3e+05 7.4e-08 0.7 43 1.3e+05 1.3e-08 0.6 8 1.3e+05 1.8e-071000 3.8 43 5.0e+05 2.0e-08 1.1 54 5.0e+05 3.0e-08 1.6 9 5.0e+05 1.3e-071500 11.3 54 1.1e+06 2.6e-08 2.4 65 1.1e+06 7.1e-08 4.0 10 1.1e+06 1.0e-072000 22.6 54 2.0e+06 4.3e-08 5.0 76 2.0e+06 8.3e-08 7.2 10 2.0e+06 9.0e-082500 60.9 87 3.1e+06 3.6e-08 10.6 120 3.1e+06 4.2e-08 12.1 10 3.1e+06 8.0e-083000 104.0 87 4.5e+06 6.2e-08 16.1 129 4.5e+06 7.8e-08 19.3 10 4.5e+06 7.4e-084000 278.0 91 8.0e+06 7.9e-08 32.8 142 8.0e+06 7.3e-08 44.2 10 8.0e+06 6.4e-08Though not always faster than Newton, PFPG is very competitive against the second-order method.

We next consider two algorithms that deal with thenuclear norm minimization problem (NMM): svt and nnls . Consider a regularized NMM problem with the form(5.5) min (cid:107) X (cid:107) ∗ + µ (cid:107) X (cid:107) F , s.t. A ( X ) = b. The dual problem of (5.5) is(5.6) min y ∈ R m (cid:107) shrink µ ( A ∗ ( y )) (cid:107) F − b T y, able 2 Invalid correlation matrices from real applications name size opt. rank descriptioncor1399 1399 × × × Table 3

Results of NCE problems on real data matrix Grad PFPG Newtontime iter f (cid:107) g (cid:107) time iter f (cid:107) g (cid:107) time iter f (cid:107) g (cid:107) cor1399 6.0 32 6.5e+04 1.3e-06 2.6 32 6.5e+04 4.1e-06 1.9 5 6.5e+04 4.4e-06cor3120 76.0 43 9.3e+04 2.4e-06 12.3 43 9.3e+04 4.8e-06 8.9 4 9.3e+04 3.8e-06bccd16 18.9 16 1.4e+06 3.3e-09 4.2 16 1.4e+06 6.2e-07 5.0 2 1.4e+06 3.4e-07where shrink µ ( X ) is the shrinkage operator deﬁned on the matrix space R n × n , which actuallyperforms soft-thresholding on the singular values of X , i.e.shrink µ ( X ) = U Diag(( σ − µ ) + , · · · , ( σ n − µ ) + ) V T . The formulation (5.6) is a special case of our model (2.4), with f being the square of a vectorshrinkage operator. Evaluating f and ∇ f only involves computing singular values which are largerthan µ . Thus, problem (5.6) satisﬁes Assumption 1.We consider the svt algorithm [3] which has the iteration scheme(5.7) (cid:26) W k = shrink µ ( A ∗ ( x k − )) ,x k = x k − + τ k ( b − A ( W k )) . It can be shown that the svt iteration (5.7) is equivalent to the gradient descent method on thedual problem (5.6), with step size τ k . Thus a polynomial-ﬁltered svt (PFSVT) can be proposed.A small diﬀerence is that we need to extract a subspace that contains the required singular vectorsof a non-symmetric matrix A for svt . This operation can be performed by the subspace extractionfrom A T A and AA T .The matrix completion problem is used to test PFSVT, with the constraint P Ω ( X ) = P Ω ( M ).The test data are generated as follows. Example

The rank r test matrix M ∈ R m × n is generated randomly using the followingprocedure: ﬁrst two random standard Gaussian matrices M L ∈ R m × r and M R ∈ R n × r are generatedand then M is assembled by M = M L M TR ; the projection set Ω with p index pairs is sampled from m × n possible pairs uniformly. We use “SR” (sampling ratio) to denote the ratio p/ ( mn ) . We have tested the svt algorithm under three settings: svt with standard Lanczos SVD(SVT-LANSVD), svt with SVD based on the Gauss-Newton method (SVT-SLRP, proposed in[14]), and polynomial-ﬁltered svt (PFSVT), where the degree of the polynomial ﬁlter is set to1 in all cases. Note that in (5.7) the variable W k is always stored in the sparse format. Weinvoke the mkl dcsrmm subroutine in the Intel Math Kernel Library (MKL) to perform sparsematrix multiplications (SpMM). The results are displayed in Table 4, where “iter” denotes the uter iteration of the svt algorithm, “svp” stands for the recovered matrix rank by the algorithm,and “mse”( = (cid:107) X − M (cid:107) F / (cid:107) M (cid:107) F ) denotes the relative mean squared error between the recoveredmatrix X and the ground truth M . Table 4

Results of svt on randomly generated matrix completion problems by using SpMM-MKL

SVT-LANSVD SVT-SLRP PFSVTm=n SR (%) r iter svp time mse iter svp time mse iter svp time mse1000 20.0 10 79 10 4.23 1.31e-04 79 10 1.94 1.31e-04 78 10 1.83 1.40e-041000 30.0 10 62 10 3.35 1.17e-04 62 10 1.71 1.17e-04 62 10 1.66 1.15e-041000 40.0 10 53 10 2.94 1.15e-04 53 10 1.52 1.14e-04 53 10 1.79 1.13e-041000 30.0 50 169 69 43.58 2.12e-04 171 68 8.67 2.12e-04 168 67 7.31 2.10e-041000 40.0 50 110 50 13.77 1.60e-04 110 50 5.35 1.60e-04 110 50 5.28 1.58e-041000 50.0 50 86 50 13.01 1.40e-04 86 50 4.73 1.40e-04 86 50 4.38 1.39e-045000 10.0 10 53 10 28.88 1.08e-04 53 10 12.37 1.08e-04 52 10 12.47 1.22e-045000 20.0 10 42 10 30.43 1.04e-04 42 10 12.85 1.04e-04 42 10 13.39 1.05e-045000 30.0 10 38 10 34.16 9.23e-05 38 10 16.62 9.23e-05 38 10 16.99 9.55e-055000 10.0 50 107 50 107.38 1.59e-04 107 50 64.38 1.59e-04 107 50 62.10 1.60e-045000 20.0 50 67 50 90.14 1.23e-04 67 50 33.11 1.22e-04 67 50 27.83 1.25e-045000 30.0 50 54 50 92.55 1.21e-04 54 50 37.90 1.21e-04 54 50 32.87 1.21e-0410000 5.0 10 54 10 59.63 1.10e-04 54 10 27.41 1.10e-04 53 10 27.12 1.23e-0410000 8.0 10 46 10 82.54 1.08e-04 46 10 34.29 1.08e-04 46 10 35.20 1.08e-0410000 10.0 10 43 10 84.65 1.07e-04 43 10 39.26 1.07e-04 43 10 39.52 1.08e-0410000 5.0 50 109 50 271.86 1.66e-04 109 50 150.35 1.65e-04 109 50 131.95 1.67e-0410000 10.0 50 69 50 260.60 1.29e-04 69 50 173.75 1.29e-04 69 50 152.81 1.30e-0410000 15.0 50 57 50 291.30 1.16e-04 57 50 206.06 1.16e-04 57 50 178.84 1.22e-04The following observations can be drawn from Table 4. • All three solvers produce similar results: “iter”, “svp”, and “mse” are almost the same. • SVT-SLRP and PFSVT have similar speed, both of which enjoy 1 to 5 times speedup overSVT-LANSVD. In all test cases SVT-SLRP is a little bit slower than PFSVT, yet stillcomparable.

Consider a penalized formulation of NMM:(5.8) min (cid:107) X (cid:107) ∗ + 12 µ (cid:107)A ( X ) − b (cid:107) F . We turn to the nnls algorithm [21] to solve (5.8), which is essentially an accelerated proximalgradient method. At the k -th iteration, the main cost is to compute the truncated SVD of a matrix A k = β U k ( V k ) T − β U k − ( V k − ) T − β G k , where U k , V k , U k − , V k − are dense matrices, but the matrix G k is either dense or sparse, dependingon the sample ratio (SR). Though the formulation (5.8) is not a special case of (2.4), we can stillinsert the polynomial ﬁlter (2.3) into the nnls algorithm to reduce the cost of SVD, resultingin a polynomial-ﬁltered nnls algorithm (PFNNLS). The Example 5.4 is still used to compare the erformance of NNLS-LANSVD, NNLS-SLRP, and PFNNLS. The polynomial ﬁlter degree is merelyset to 1. The results are shown in Table 5 for dense G k and Table 6 for sparse G k . For matrixmultiplications, we call matlab built-in functions for dense G k and Intel MKL subroutines forsparse G k . The notations have the same meaning as in Table 4. Table 5

Results of nnls on dense examples

NNLS-LANSVD NNLS-SLRP PFNNLSm=n SR (%) r iter svp time mse iter svp time mse iter svp time mse1000 25.0 50 55 50 5.1 7.60e-04 55 50 2.7 1.18e-03 54 50 2.2 1.19e-031000 35.0 50 42 50 3.1 5.42e-04 49 50 2.6 4.40e-04 49 50 1.9 4.66e-041000 49.9 50 39 50 3.2 1.81e-04 39 50 1.9 1.81e-04 39 50 1.6 1.84e-041000 25.0 100 100 150 29.4 3.55e-01 100 150 8.6 3.87e-01 100 150 7.1 5.03e-011000 35.0 100 64 100 10.7 1.45e-03 64 100 4.2 1.48e-03 65 100 3.3 9.32e-041000 49.9 100 54 100 8.3 4.70e-04 48 100 3.8 1.05e-03 51 100 3.2 4.33e-045000 25.0 50 36 50 40.0 1.41e-04 36 50 17.8 1.42e-04 36 50 16.1 1.46e-045000 35.0 50 34 50 40.0 1.60e-04 34 50 18.7 1.61e-04 34 50 17.7 1.61e-045000 50.0 50 32 50 41.7 1.46e-04 32 50 21.5 1.48e-04 32 50 20.6 1.45e-045000 25.0 100 49 100 106.4 2.83e-04 49 100 35.6 2.88e-04 50 100 32.4 2.19e-045000 35.0 100 45 100 94.0 1.74e-04 45 100 38.7 1.76e-04 45 100 35.2 1.74e-045000 50.0 100 43 100 100.5 1.43e-04 43 100 46.1 1.44e-04 43 100 41.3 1.45e-0410000 25.0 50 32 50 106.9 1.17e-04 32 50 54.9 1.17e-04 32 50 51.4 1.18e-0410000 35.0 50 32 50 120.0 1.16e-04 32 50 66.3 1.16e-04 32 50 61.9 1.17e-0410000 50.0 50 31 50 129.1 1.39e-04 32 50 83.0 1.15e-04 31 50 76.2 1.37e-0410000 25.0 100 43 100 253.0 1.59e-04 44 100 114.9 1.74e-04 45 100 105.0 1.56e-0410000 35.0 100 42 100 263.7 1.32e-04 42 100 130.1 1.33e-04 42 100 118.3 1.37e-0410000 50.0 100 39 100 259.6 3.70e-04 40 100 156.1 1.18e-04 39 100 136.5 3.56e-04Similar observations can be drawn from Table 5 and 6. However, for one case (sparse G k , m = n = 10000, SR=2.0%), NNLS-LANSVD and PFNNLS seem to produce diﬀerent results fromNNLS-SLRP. In another case (dense G k , m = n = 1000, SR=25.0%) the mse of all three methodsis high. The reason is that nnls terminates after it has reached the max iteration 100.The last experiment is to use the three nnls algorithms on real data sets: the Jester jokedata set and the MovieLens data set, which are also mentioned in [21]. As shown in Table 7, theJester joke data set includes four examples: jester-1, jester-2, jester-3, and jester-all. The last oneis simply the combination of the ﬁrst three data sets. The MovieLens data set have three problemsbased on the size: movie-100K, movie-1M, and movie-10M. We call Intel MKL routines for matrixmultiplications since the data are all sparse. Again, the degree of the polynomial ﬁlter is set to 1and the max iteration of nnls is 100.We have the following observations from Table 7: • The three methods require similar number of iterations for all seven test cases. • NNLS-LANSVD generates solutions with the highest rank, while NNLS-SLRP producessolutions with the lowest rank. The mse of PFNNLS is a little larger in Jester joke datasets. • NNLS-SLRP provides 2-4 speedups over NNLS-LANSVD, while PFNNLS is 3-6 times faster able 6 Results of nnls on random sparse examples by using SpMM-MKL

NNLS-LANSVD NNLS-SLRP PFNNLSm=n SR (%) r iter svp time mse iter svp time mse iter svp time mse10000 2.0 10 45 10 16.5 1.86e-04 45 10 9.2 1.54e-04 45 10 7.9 1.55e-0410000 5.0 10 35 10 27.9 1.30e-04 35 10 15.0 1.30e-04 35 10 12.9 1.30e-0410000 10.0 10 30 10 53.1 1.03e-04 30 10 23.9 1.03e-04 30 10 18.9 1.03e-0410000 14.0 10 28 10 62.6 1.07e-04 29 10 28.6 1.09e-04 28 10 22.7 1.07e-0410000 2.0 50 100 77 226.7 3.42e-01 100 50 53.9 9.88e-03 100 74 43.5 7.48e-0110000 5.0 50 50 50 125.3 2.50e-04 48 50 40.1 2.52e-04 51 50 33.7 1.72e-0410000 10.0 50 37 50 187.3 1.66e-04 37 50 53.0 1.67e-04 37 50 46.3 1.79e-0410000 14.0 50 35 50 224.2 1.29e-04 35 50 62.7 1.30e-04 35 50 53.4 1.31e-0450000 2.0 10 38 10 431.9 1.09e-04 38 10 168.3 1.09e-04 38 10 137.7 1.10e-0450000 5.0 10 31 10 776.4 9.62e-05 31 10 328.7 9.62e-05 31 10 253.9 9.58e-0550000 10.0 10 28 10 1560.0 1.09e-04 28 10 586.3 1.09e-04 28 10 438.0 1.02e-0450000 14.0 10 28 10 2011.5 1.03e-04 28 10 792.0 1.03e-04 28 10 611.2 1.03e-0450000 2.0 50 49 50 1415.8 1.96e-04 49 50 442.0 1.62e-04 52 50 335.4 1.38e-0450000 5.0 50 37 50 2487.5 1.74e-04 37 50 710.7 1.77e-04 37 50 600.1 1.99e-0450000 10.0 50 32 50 4483.5 1.17e-04 32 50 1244.8 1.17e-04 32 50 1074.5 1.17e-0450000 14.0 50 31 50 5687.1 1.10e-04 31 50 1499.5 1.10e-04 31 50 1405.2 1.10e-04

Table 7

Results of nnls on real examples by using SpMM-MKL

NNLS-LANSVD NNLS-SLRP PFNNLSname ( m, n ) iter svp time mse iter svp time mse iter svp time msejester-1 (24983, 100) 26 93 10.5 1.64e-01 27 69 4.6 1.76e-01 24 84 2.3 1.80e-01jester-2 (23500, 100) 26 93 9.1 1.65e-01 26 79 4.3 1.72e-01 25 88 2.1 1.80e-01jester-3 (24938, 100) 24 83 7.1 1.16e-01 27 74 4.6 1.24e-01 24 84 2.0 1.30e-01jester-all (73421, 100) 26 93 26.2 1.58e-01 26 82 12.4 1.62e-01 24 84 5.6 1.74e-01moive-100K (943, 1682) 34 100 4.2 1.28e-01 35 100 0.8 1.26e-01 36 100 0.6 1.23e-01moive-1M (6040, 3706) 50 100 40.6 1.42e-01 50 100 10.8 1.43e-01 51 100 7.4 1.42e-01moive-10M (71567, 10677) 54 100 620.1 1.26e-01 57 100 179.9 1.27e-01 52 100 92.7 1.27e-01than NNLS-LANSVD.We make the following comments as extra interpretations of all numerical results in this sub-section. • The SLRP solver is essentially a subspace SVD method. As the authors point out in [14],the subspace is generated by solving a least square model by one Gauss-Newton iteration,after which high accuracy singular pairs can be extracted from the subspace. The diﬀerenceof SLRP and our polynomial ﬁlters is that solving the SLRP sub-problem requires morecomputation than evaluating polynomial ﬁlters, in exchange for higher precision solutions.Hence SLRP can be regarded as a variant of the subspace extraction method. • Converting the SVD of A to the EVD of A T A implicitly applies a polynomial ﬁlter ρ ( t ) = t to the singular values. Thus we can simply let the degree equal to 1 when performing ubspace extractions on A T A . Consider the max eigenvalue optimization problemwith the following form(5.9) min x F ( x ) + R ( x ) := λ ( B ( x )) + R ( x ) , where B ( x ) = G + A ∗ ( x ). The subgradient of F ( x ) is ∂F ( x ) = {A ( U SU T ) | S (cid:23) , tr( S ) = 1 } , where U ∈ R n × r is the subspace spanned by eigenvectors of λ ( B ( x )) with multiplicity r . If r = 1, then ∂F ( x ) only contains one element hence the subgradient becomes the gradient. If r > F ( x ) is only sub-diﬀerentiable. We point out that the r > r = 1. In the following two applications, namely, phase recovery and blind deconvolution, F ( x ) isdiﬀerentiable within a neighborhood of x ∗ . We use the phase recovery formulation in [7]. The smooth part F ( x ) = λ ( B ( x )) := λ ( A ∗ ( x )) is deﬁned as follows. Let the known diagonal matrices C k ∈ C n × n be the encoding diﬀraction patterns corresponding to the k -th “mask” ( k = 1 , . . . , L ). Themeasurements of an unknown signal x ∈ C n are given by b = A ( x x ∗ ) := diag  F C ... F C L  ( x x ∗ )  F C ... F C L  ∗  , where F is the unitary discrete Fourier transform (DFT). The adjoint of A can be written as A ∗ y := L (cid:88) k =1 C ∗ k F ∗ Diag( y ) F C k . The non-smooth part R ( x ) deﬁned as the indicator function of the set { x | (cid:104) b, x (cid:105) − (cid:107) x (cid:107) ∗ ≥ } , where (cid:107)·(cid:107) ∗ is the dual norm of (cid:107)·(cid:107) . In the experiments of the phase retrieval problem, both syntheticand real data are considered. The noiseless synthetic data are generated as follows. For each valueof L = 7 , . . . ,

12, we generate random complex Gaussian vectors x of length n = 1024 , L random complex Gaussian masks C k .We use the GAUGE algorithm proposed in [7] to solve the phase recovery problem, in whichthe matlab built-in function eigs is used as the eigenvalue sub-problem solver. Then we insertour polynomial ﬁlters to the original GAUGE implementation to obtain the polynomial-ﬁlteredGAUGE algorithm (PFGAUGE). It should be mentioned that though GAUGE is essentially agradient method, it has many variants to handle diﬀerent cases, which we will brieﬂy discuss aboutlater. A big highlight of GAUGE is a so-called “primal-dual reﬁnement step”, which is able toreduce the iteration steps drastically. This feature also has great impact on the performance ofGAUGE and PFGAUGE.The results of noiseless synthetic data is presented in Table 8 and 9, where “iter” denotes thenumber of outer iteration. In addition, we record the number of DFT calls (nDFT) and the number able 8 Random Gaussian signals, noiseless, n = 1024 GAUGE PFGAUGEL time iter nDFT(nEigs) err time iter nDFT(nPF) err12 2.37 3 34248(10) 8.7e-07 2.09 7 43386(17) 1.1e-0611 2.51 4 46244(13) 9.4e-07 2.28 9 48813(21) 1.2e-0610 3.81 6 62160(17) 7.0e-07 2.54 10 51900(25) 1.3e-069 6.53 8 74772(19) 7.9e-07 3.74 12 56030(28) 1.8e-068 10.09 13 117776(29) 1.2e-06 5.00 16 67960(36) 1.6e-067 17.72 21 193956(47) 1.6e-06 6.92 23 88785(50) 6.8e-07

Table 9

Random Gaussian signals, noiseless, n = 4096 GAUGE PFGAUGEL time iter nDFT(nEigs) err time iter nDFT(nPF) err12 10.85 4 57966(12) 1.4e-06 9.66 11 74442(30) 2.0e-0611 15.25 5 83298(16) 2.0e-06 10.43 11 77880(36) 2.0e-0610 22.63 8 118530(20) 2.3e-06 11.20 14 79435(37) 2.6e-069 35.89 11 176864(26) 2.0e-06 16.78 20 118377(56) 3.4e-068 49.88 15 235720(36) 2.8e-06 15.52 23 105660(53) 3.0e-067 160.64 40 779559(83) 2.6e-06 33.66 40 208915(84) 3.4e-06of eigen-solvers or polynomial ﬁlters evaluated, which is denoted by nEigs and nPF, respectively.The label “err” stands for the relative mean squared error of the solution (cid:107) xx ∗ − x x ∗ (cid:107) F / (cid:107) x (cid:107) .The following observations should be clear from the results. • Both GAUGE and PFGAUGE obtain the required accuracy in all test cases. The problemgrows harder to solve as L decreases from 12 to 7. • In general, it takes more iterations for PFGAUGE to converge compared with the originalGAUGE. An exception is the L = 7 case, in which the iterations are nearly the same. • PFGAUGE gradually becomes faster as L decreases. For example, it gives 5 times speedupat L = 7 and n = 4096. This can be also concluded from nDFT. After examining theiteration process, we ﬁnd out that the eigenvalue sub-problem is hard to solve when L issmall, thus it takes more iterations for eigs to converge. However, since we only apply apolynomial ﬁlter to the iteration point, the cost of DFT can be greatly reduced. For theeasy cases such as L = 12, the performances of the two solvers are similar.We next consider noisy synthetic data, which is generated with the following steps. First wegenerate L octanary masks C k as described in [4], and a real Gaussian vector y ∈ R n . Then asolution x ∈ C n is chosen as the normalized rightmost eigenvector of A ∗ ( y ), where n = 1024 , b is computed as b := A ( x x ∗ ) + (cid:15)y/ (cid:107) y (cid:107) , where (cid:15) := (cid:107) b − A ( x x ∗ ) (cid:107) = η (cid:107) b (cid:107) is related with a given relative noise level η ∈ (0 , L ∈ { , , } and noise level η ∈ { . , . , , , , } . For each set of parameter, werandomly generate 100 test instances and feed them to the solvers, after which we compute therelative mean squared error of the solution. The “%” label in Table 10 and 11 stands for the rate of uccessful recoveries, which is deﬁned as the number of solutions whose MSE is smaller than 10 − .All other columns in Table 10 and 11 record the average value of the 100 test instances. Table 10

Random Gaussian signals, noisy, n = 1024 GAUGE PFGAUGEL η time iter nDFT(nEigs) % err time iter nDFT(nPF) % err12 0.1% 3.02 51 9e+04(58) 100 2.1e-03 2.53 48 1e+05(55) 100 2.8e-039 0.1% 6.47 88 1e+05(102) 96 3.4e-03 4.95 88 2e+05(103) 98 3.8e-036 0.1% 13.27 195 2e+05(224) 76 7.5e-03 9.14 198 2e+05(223) 76 7.6e-0312 0.5% 5.76 94 2e+05(107) 95 5.5e-03 4.62 92 2e+05(99) 94 5.8e-039 0.5% 9.90 128 2e+05(144) 92 5.7e-03 6.78 127 2e+05(147) 93 5.6e-036 0.5% 19.84 284 3e+05(327) 74 8.5e-03 12.77 284 3e+05(335) 68 8.5e-0312 1.0% 22.82 296 7e+05(354) 78 7.2e-03 14.80 326 7e+05(372) 83 7.2e-039 1.0% 22.26 271 4e+05(318) 86 8.0e-03 13.92 290 5e+05(346) 88 7.8e-036 1.0% 71.67 979 1e+06(1171) 92 6.5e-03 38.64 1049 1e+06(1287) 92 6.4e-0312 5.0% 43.27 556 1e+06(649) 34 1.2e-02 30.53 734 1e+06(853) 31 1.4e-029 5.0% 86.61 836 2e+06(1005) 65 7.8e-03 37.42 1009 1e+06(1161) 53 9.2e-036 5.0% 90.90 1113 1e+06(1322) 88 4.7e-03 43.35 1237 1e+06(1477) 81 5.8e-0312 10.0% 61.22 692 2e+06(819) 21 1.6e-02 29.86 749 1e+06(842) 23 1.5e-029 10.0% 76.80 686 1e+06(804) 48 1.0e-02 31.81 799 1e+06(926) 50 1.0e-026 10.0% 78.61 950 1e+06(1124) 76 5.3e-03 32.38 956 1e+06(1110) 71 6.0e-0312 50.0% 36.20 368 1e+06(433) 41 1.3e-02 17.04 424 8e+05(491) 55 8.9e-039 50.0% 42.28 361 8e+05(424) 68 5.7e-03 14.76 374 5e+05(438) 56 6.6e-036 50.0% 34.76 389 6e+05(463) 86 2.5e-03 14.37 391 4e+05(477) 87 2.5e-03We make the following comments on the results of the noisy case: • Compared to the noiseless case, it takes much more iterations for GAUGE and PFGAUGEto converge. • The solutions produced by the two algorithms are similar in the aspect of the number ofiterations, the successful recover rate, and the solution error. For most cases, PFGAUGEneeds a little more iterations than GAUGE. • PFGAUGE often provides 2-3 times speedup over GAUGE. This can be also told from thenDFT column. Again, the number of DFT is greatly reduced thanks to the polynomialﬁlters.Finally we conduct experiments on 2D real data in order to assess the speedup of Chebyshevﬁlters on problems with larger size. In this scenario the measured signal x are gray scale images,summarized in Table 12. For simplicity, the images are numbered from 1 to 12. Case 1 and 2 areselected from the matlab image database; Case 3 to 12 are from the HubbleSite Gallery . Thelargest problem (No. 12) has the size n = 4 · , which brings a huge eigenvalue problem in eachiteration. For each example we generate 10 and 15 octanary masks. The results are summarizedin Table 13. The column with label “ f ” records the values of dual objective function. The label“gap” stands for the duality gap of the problem. The other columns have the same meaning as in See http://hubblesite.org/images/gallery. 24 able 11

Random Gaussian signals, noisy, n = 4096 GAUGE PFGAUGEL η time iter nDFT(nEigs) % err time iter nDFT(nPF) % err12 0.1% 26.16 90 2e+05(106) 89 3.5e-03 20.90 89 2e+05(104) 88 3.1e-039 0.1% 45.83 194 4e+05(224) 84 5.2e-03 30.10 192 3e+05(221) 80 4.6e-036 0.1% 70.11 373 5e+05(436) 41 1.1e-02 56.40 407 5e+05(487) 35 1.3e-0212 0.5% 36.31 124 3e+05(149) 85 6.2e-03 29.69 138 3e+05(161) 90 5.3e-039 0.5% 68.44 275 5e+05(310) 72 7.9e-03 43.98 266 5e+05(319) 69 8.0e-036 0.5% 117.91 625 8e+05(720) 24 1.4e-02 84.42 653 7e+05(757) 19 1.5e-0212 1.0% 98.84 290 8e+05(331) 73 7.6e-03 64.63 320 8e+05(409) 76 7.2e-039 1.0% 109.12 404 9e+05(471) 53 9.7e-03 62.14 473 8e+05(561) 52 9.9e-036 1.0% 174.41 900 1e+06(1026) 25 1.4e-02 99.72 813 9e+05(951) 13 1.6e-0212 5.0% 378.11 834 3e+06(945) 54 9.5e-03 128.56 939 2e+06(1078) 45 1.1e-029 5.0% 211.98 705 2e+06(831) 36 1.3e-02 94.93 793 1e+06(912) 36 1.3e-026 5.0% 211.82 937 1e+06(1084) 29 1.5e-02 119.41 1048 1e+06(1284) 18 1.7e-0212 10.0% 372.63 801 3e+06(935) 35 1.4e-02 118.20 896 2e+06(1036) 34 1.4e-029 10.0% 217.35 668 2e+06(789) 36 1.5e-02 102.80 826 1e+06(954) 31 1.6e-026 10.0% 259.70 961 2e+06(1126) 38 1.5e-02 124.12 1109 1e+06(1276) 38 1.3e-0212 50.0% 337.82 558 3e+06(634) 45 1.2e-02 83.17 676 1e+06(748) 35 1.6e-029 50.0% 223.85 572 2e+06(667) 53 8.6e-03 75.28 678 1e+06(765) 61 7.0e-036 50.0% 194.21 668 1e+06(761) 67 5.5e-03 76.30 684 7e+05(834) 71 3.4e-03Table 8. Since data size is very large, we terminate the algorithms as soon as they exceed a timeoutthreshold, which is set to 18000 seconds in the experiment. Table 12

Image data for 2D signals

No. name size No. name size1 coloredChips 518 ×

391 2 lighthouse 480 × ×

500 4 giantbubble(S) 600 × ×

426 6 nebula(S) 800 × × × × × • For all test cases, PFGAUGE successfully converges within the time limit, while the originalGAUGE method times out in most test cases with large data. GAUGE also fails in theCase 2 with L = 10. After checking the output of the algorithm, the reason is that theeigen-solver fails to converge in some iterations. • For the cases where both algorithm converge, PFGAUGE is able to provide over 20 timesspeedup. This can be veriﬁed by the nDFT column of the two methods as well. Thereason is that the traditional eigen-solver eigs is not scalable enough to deal with largeproblems. The convergence is very slow thus the accuracy of the solutions is not tolerable. able 13 Phase retrieval comparisons on 2D real signal

GAUGE PFGAUGENo. L time iter nDFT(nEigs) f gap time iter nDFT(nPF) f gap1 15 1155.89 4 3e+05(19) 1.0e+05 5.0e-06 549.97 7 2e+05(29) 1.0e+05 1.4e-0610 3704.66 9 1e+06(26) 1.0e+05 1.0e-05 210.36 9 6e+04(26) 1.0e+05 7.9e-062 15 3043.72 4 1e+06(19) 1.1e+05 5.7e-06 722.95 6 2e+05(26) 1.1e+05 1.5e-0610 21898.10 11 8e+06(87) 1.1e+05 NaN 158.82 8 5e+04(25) 1.1e+05 5.0e-063 15 276.71 4 8e+04(18) 1.4e+04 4.9e-06 230.50 5 7e+04(21) 1.4e+04 6.2e-0610 385.09 13 1e+05(34) 1.4e+04 9.5e-06 208.24 10 6e+04(28) 1.4e+04 9.4e-064 15 9583.33 12 3e+06(35) 2.3e+04 1.3e-06 295.84 6 8e+04(26) 2.3e+04 3.6e-0610 7433.17 18 2e+06(46) 2.3e+04 7.9e-06 238.08 7 6e+04(24) 2.3e+04 9.7e-065 15 1735.88 5 5e+05(20) 2.2e+04 3.9e-06 622.25 9 2e+05(29) 2.2e+04 7.5e-0610 1872.62 6 5e+05(22) 2.2e+04 1.5e-06 291.96 7 7e+04(24) 2.2e+04 1.8e-066 15 643.35 7 1e+05(24) 6.7e+04 2.7e-06 523.80 9 8e+04(30) 6.7e+04 6.4e-0610 21943.32 15 3e+06(42) 6.7e+04 1.2e-02 989.89 4 1e+05(10) 6.7e+04 6.7e-067 15 2252.18 10 2e+05(30) 6.3e+04 8.3e-06 1116.42 9 1e+05(30) 6.3e+04 7.6e-0610 2530.10 22 2e+05(54) 6.3e+04 8.4e-06 981.21 17 9e+04(46) 6.3e+04 6.9e-068 15 18543.56 5 1e+06(22) 5.8e+04 1.0e-03 4050.70 9 3e+05(30) 5.8e+04 5.8e-0610 19371.82 11 1e+06(33) 5.8e+04 1.0e-03 1209.17 11 9e+04(34) 5.8e+04 8.9e-069 15 20239.04 2 1e+06(17) 9.1e+04 7.8e-02 7046.67 7 5e+05(27) 9.1e+04 9.9e-0610 19610.56 8 1e+06(19) 9.1e+04 6.0e-01 3892.26 6 2e+05(14) 9.1e+04 4.7e-0610 15 7680.27 10 3e+05(32) 2.7e+05 4.1e-06 4287.95 9 2e+05(30) 2.7e+05 5.8e-0610 21958.19 5 8e+05(31) 2.7e+05 1.7e-01 4042.50 24 1e+05(58) 2.7e+05 4.8e-06However, our polynomial-ﬁltered algorithm is able to extract a proper low-rank subspacethat contains the desired eigenvectors quickly, hence the performance is much better. • PFGAUGE sometimes needs fewer iterations than GAUGE. This is because eigs fails toconverge during the iterations, thus GAUGE produces wrong updates.

Again, we use the blind deconvolution formulation in [7]. Inthis model, we measure the convolution of two signals s ∈ C m and s ∈ C m . Let B ∈ C m × n and B ∈ C m × n be two wavelet bases. The circular convolution of the signals is deﬁned by b = s ∗ s = ( B x ) ∗ ( B x )= F − diag (cid:0) ( F B x )( F B x ) T (cid:1) = F − diag (cid:0) ( F B )( x x ∗ )( F B ) ∗ (cid:1) =: A ( x x ∗ ) , where A is the non-symmetric linear map whose adjoint is A ∗ ( y ) := ( F B ) ∗ Diag( F y )( F B ) . The non-smooth part R ( x ) deﬁned as the indicator function of the set { x | (cid:104) b, x (cid:105) − (cid:107) x (cid:107) ∗ ≥ } . n the experiments of the blind deconvolution, the GAUGE algorithm and its polynomial-ﬁltered version PFGAUGE are tested. The tolerance of feasibility and duality gap is set to amoderate level, namely, 5 · − and 5 · − . We need to mention that there is a primal-dualreﬁnement strategy in the GAUGE method as mentioned before. It consists of two sub-problems:generate a reﬁned primal solution from a dual solution (pfd) and dual from primal (dfp). In the“dfp” process, the algorithm has to solve a least-squared problem using the gradient descent method.The “dfp” step is very slow in the experiments and frequently leads to algorithm failure. Therefore,we disable this step since it is optional.The test data are shown in Table 14. There are eight pictures numbered from 1 to 8. Case1 is from the original GAUGE paper [7]. Case 2 to 4 are from the matlab gallery. Case 5 to 8are downloaded from the HubbleSite Gallery as mentioned before. The results are demonstrated inTable 15. The columns with label “xErr1” and “xErr2” list the relative errors (cid:107) x j − ˆ x j (cid:107) / (cid:107) x j (cid:107) , j =1 ,

2, where ˆ x j stands for the solution returned by the algorithms. The column with label “rErr”contains the relative residual (cid:107) b − A (ˆ x ¯ˆ x ∗ ) (cid:107) / (cid:107) b (cid:107) . The other columns share the same meaningwith Table 13. We have similar observations under these settings. PFGAUGE is able to produce Table 14

Image data for blind deconvolution

No. name size No. name size1 shapes 256 ×

256 2 cameraman 256 × ×

256 4 lighthouse 480 × ×

512 6 crabnebula1024 1024 × × × In this section we show the numerical results of our PFAM. Theexperiments fall into two categories: standard SDP and non-linear SDP.

First we test PFAM on the two-body reduced density matrix (2-RDM) problem. The problem has a block diagonal structure with respect to the variable X , witheach block being a low rank matrix. Thus the polynomial ﬁlters can be applied to each block toreduce the cost. We use the preprocessed dataset in [12]. Only the cases with the maximum blocksize greater than 1000 are selected. The details of the data can be found in [15].Table 16 contains the numerical results of ADMM and PFAM. The columns with headers“pobj”, “pinf”, “dinf”, and “gap” record the primal objective function value, primal infeasibility,dual infeasibility, and duality gap at the solution, respectively. The overall tolerance of each algo-rithm is set to 10 − . A high accuracy is not needed in this application because we use ADMM togenerate initial solutions for second-order methods such as SSNSDP [12].As observed in Table 16, PFAM is two times faster than ADMM in large 2-RDM problems,such as CH2, C2, and NH3. We mention that the speedup provided by PFAM depends on thenumber of large low-rank variable blocks. One may notice that PFAM is less eﬀective on someproblems like AlH and BF. The main reason is that there are very few large blocks in these dataset,and polynomial ﬁlters are not designed for small matrices. In fact, it is always observed that afull eigenvalue decomposition is faster than any other truncated eigen-solvers or polynomial-ﬁlteredmethods in these small cases. Another minor reason is that some blocks is not low-rank, causing able 15 Blind deconvolution comparisons

No. GAUGE(nodfp)time iter nDFT(nEigs) f gap rErr xErr1 xErr21 13.10 15 4248(16) 1.681e+00 1.7e-03 3.4e-04 9.1e-02 5.5e-012 14.34 14 4816(15) 1.683e+00 4.1e-03 4.1e-04 1.4e-01 5.4e-013 15.66 18 5440(19) 1.672e+00 2.0e-03 3.0e-04 1.3e-01 5.4e-014 63.48 21 7028(22) 1.683e+00 4.3e-03 2.5e-04 1.1e-01 5.4e-015 35.62 14 4552(15) 1.689e+00 4.7e-03 2.8e-04 1.3e-01 5.3e-016 252.59 26 9588(27) 1.673e+00 3.0e-04 3.6e-04 1.0e-01 5.4e-017 282.38 22 7116(23) 1.696e+00 4.7e-03 2.6e-04 9.8e-02 5.4e-018 559.64 36 14732(37) 1.667e+00 4.7e-03 7.2e-05 9.4e-02 5.3e-01No. PFGAUGE(nodfp)time iter nDFT(nPF) f gap rErr xErr1 xErr21 6.85 15 2740(16) 1.681e+00 1.7e-03 4.0e-04 9.1e-02 5.5e-012 7.30 14 2844(15) 1.684e+00 4.7e-03 5.1e-04 1.4e-01 5.4e-013 8.93 18 3576(19) 1.671e+00 2.1e-04 3.7e-04 1.3e-01 5.4e-014 40.60 26 4560(27) 1.683e+00 4.1e-03 4.1e-04 1.1e-01 5.4e-015 21.91 14 2800(15) 1.689e+00 4.8e-03 2.9e-04 1.3e-01 5.3e-016 119.43 26 4524(27) 1.677e+00 2.9e-03 3.2e-04 1.0e-01 5.4e-017 163.91 25 4204(26) 1.696e+00 4.4e-03 3.0e-04 1.0e-01 5.4e-018 297.61 37 7568(38) 1.659e+00 4.1e-04 3.4e-04 9.3e-02 5.3e-01the performance of PFAM to be limited. This observation again addresses the feature of PFAM: itis the most suitable for large-scale low-rank problems.

As an extension, we consider plugging polynomial ﬁlters into multi-block ADMM to observe its speed-up. We are interested in the non-linear SDPs from the weightedLS model with spectral norm constraints and least unsquared deviations (LUD) model in [24].Suppose K is a given integer and S and W are two known matrices, the weighted LS modelwith spectral norm constraints is(5.10) max (cid:104) W (cid:12) S, G (cid:105) , s . t . G ii = I ,G (cid:23) , (cid:107) G (cid:107) ≤ αK, where G = ( G ij ) i,j =1 ,...,K ∈ S K is the variable, with each block G ij being a 2-by-2 small matrix,and (cid:107) · (cid:107) is the spectral norm. A three-block ADMM is introduced to solve (5.10): y k +1 = −A ( C + X k − Z k ) − µ ( A ( G k ) − b ) , (5.11) Z k +1 = argmin Z αKµ (cid:107) Z (cid:107) ∗ + 12 (cid:107) Z − B k (cid:107) F , (5.12) able 16 Results of ADMM and PFAM on 2-RDM problems data ADMM PFAMname time itr pobj pinf dinf gap time itr pobj pinf dinf gapAlH 73 373 2.46e+02 2.1e-05 9.9e-05 4.8e-05 71 377 2.46e+02 2.1e-05 9.9e-05 4.7e-05B2 428 800 5.73e+01 9.6e-05 9.9e-05 1.2e-04 257 908 5.73e+01 8.1e-05 9.9e-05 1.3e-04BF 64 339 1.42e+02 3.6e-05 9.9e-05 7.0e-05 62 341 1.42e+02 3.6e-05 9.9e-05 6.9e-05BH 476 1058 2.73e+01 9.6e-05 9.5e-05 3.7e-04 322 1026 2.73e+01 9.1e-05 9.8e-05 3.5e-04BH3O 727 437 1.31e+02 6.9e-05 9.9e-05 2.0e-04 530 458 1.31e+02 2.8e-05 9.9e-05 1.7e-04BN 144 748 9.33e+01 3.0e-05 9.9e-05 1.1e-04 93 739 9.33e+01 4.2e-05 9.9e-05 1.2e-04BO 91 377 1.17e+02 8.8e-05 9.9e-05 9.9e-05 80 375 1.17e+02 9.8e-05 9.2e-05 1.0e-04BeF 68 356 1.28e+02 5.3e-05 9.9e-05 9.0e-05 58 354 1.28e+02 5.5e-05 9.8e-05 9.1e-05BeO 91 488 1.02e+02 7.3e-05 9.9e-05 8.8e-05 71 479 1.02e+02 4.8e-05 9.9e-05 9.6e-05C2 1763 831 9.10e+01 5.8e-05 9.9e-05 1.8e-04 1032 911 9.10e+01 5.2e-05 9.9e-05 1.7e-04CF 68 354 1.59e+02 3.5e-05 9.9e-05 3.6e-05 60 351 1.59e+02 3.8e-05 9.9e-05 3.4e-05CH 453 969 4.12e+01 9.7e-05 9.8e-05 3.6e-04 303 1009 4.12e+01 9.8e-05 9.6e-05 3.4e-04CH2 1516 1350 4.50e+01 8.0e-05 9.9e-05 3.9e-04 754 1339 4.50e+01 8.3e-05 9.9e-05 3.6e-04CH2 1698 1496 4.48e+01 9.8e-05 9.2e-05 3.4e-04 782 1428 4.48e+01 9.8e-05 9.5e-05 3.5e-04CH3 2028 1113 4.94e+01 5.5e-05 9.9e-05 1.4e-04 1010 1149 4.94e+01 6.7e-05 9.9e-05 2.5e-04CH3N 756 450 1.27e+02 4.8e-05 9.9e-05 1.6e-04 491 456 1.27e+02 5.9e-05 9.9e-05 1.5e-04CN 83 439 1.11e+02 8.9e-05 9.9e-05 8.8e-05 68 437 1.11e+02 9.8e-05 9.5e-05 8.3e-05CO+ 73 379 1.35e+02 8.9e-05 9.9e-05 9.1e-05 60 377 1.35e+02 9.4e-05 9.9e-05 9.1e-05CO 63 328 1.35e+02 6.7e-05 9.8e-05 3.5e-05 54 325 1.35e+02 6.8e-05 9.9e-05 3.4e-05F- 265 648 9.96e+01 9.9e-05 9.5e-05 2.9e-04 189 639 9.96e+01 9.0e-05 9.9e-05 2.4e-04H2O 819 717 8.54e+01 9.5e-05 9.9e-05 3.4e-04 495 809 8.54e+01 9.8e-05 9.0e-05 3.4e-04HF 268 576 1.05e+02 9.9e-05 9.3e-05 2.2e-04 210 575 1.05e+02 9.6e-05 9.8e-05 2.1e-04HLi2 189 622 1.91e+01 7.5e-05 9.8e-05 1.4e-04 126 656 1.91e+01 7.9e-05 9.8e-05 1.5e-04HN2+ 99 326 1.38e+02 7.7e-05 9.9e-05 7.3e-05 82 326 1.38e+02 8.7e-05 9.9e-05 7.1e-05HNO 251 451 1.60e+02 7.4e-05 9.9e-05 4.1e-05 207 446 1.60e+02 7.8e-05 9.9e-05 3.9e-05LiF 80 406 1.16e+02 4.4e-05 9.9e-05 9.9e-05 73 406 1.16e+02 4.5e-05 9.9e-05 9.8e-05LiH 472 1030 9.00e+00 9.0e-05 4.2e-05 1.1e-04 369 1037 9.00e+00 9.1e-05 5.4e-05 1.2e-04LiOH 141 468 9.57e+01 9.8e-05 8.5e-05 1.4e-04 107 452 9.57e+01 9.5e-05 9.6e-05 1.6e-04NH 443 943 5.86e+01 8.6e-05 9.9e-05 3.0e-04 298 946 5.86e+01 8.8e-05 9.9e-05 3.1e-04NH 405 870 5.86e+01 8.9e-05 9.9e-05 2.8e-04 290 937 5.86e+01 9.8e-05 9.1e-05 2.7e-04NH2- 967 849 6.32e+01 9.8e-05 9.3e-05 4.0e-04 502 854 6.32e+01 9.7e-05 9.7e-05 3.9e-04NH3 3775 925 6.82e+01 7.3e-05 9.9e-05 3.2e-04 1819 967 6.82e+01 9.7e-05 9.8e-05 3.4e-04NaH 73 370 1.65e+02 8.9e-05 9.8e-05 1.7e-06 66 370 1.65e+02 8.9e-05 9.9e-05 5.0e-06P 176 410 3.41e+02 2.2e-05 9.9e-05 7.0e-05 155 412 3.41e+02 2.2e-05 9.8e-05 6.6e-05SiH4 212 292 3.12e+02 2.8e-05 9.9e-05 3.8e-05 193 292 3.12e+02 2.8e-05 9.9e-05 2.8e-05 X k +1 = argmin X (cid:23) (cid:107) X − H k (cid:107) F , (5.13) G k +1 = (1 − γ ) G k + γµ ( X k +1 − H k ) , (5.14) here B k = C + X k + A ∗ ( y k +1 ) + 1 µ G k ,H k = Z k +1 − C − A ∗ ( y k +1 ) − µ G k are the auxiliary variables. In the ADMM update, steps (5.12) and (5.13) require EVDs, both ofwhich can be replaced with the polynomial-ﬁltered method.The semi-deﬁnite relaxation of the LUD problem is(5.15) min (cid:80) ≤ i

00 means thatthere is no spectral norm constraint in (5.15). The number of the rotation matrices K is chosen from500 , , , G is low-rank. Indeed it is a rank-3 matrix in theform RR T . Note that for ADMM, we have tested both full eigenvalue decomposition and thetruncated version and report the one which costs less time. These examples again justify theeﬀectiveness of PFAM, though the convergence property is not clear for the multi-block version.

6. Conclusion.

In this paper, we propose a framework of polynomial-ﬁltered methods forlow-rank optimization problems. Our motivation is based on the key observation that the iterationpoints lie in a low-rank subspace of R n × n . Therefore, the strategy is to extract this subspaceapproximately, and then perform one update based on the projection of the current iteration point.Polynomials are also applied to increase the accuracy. Intuitively, the target subspaces betweenany two iterations should be close enough under some conditions. We next give two solid examplesPFPG and PFAM in order to show the basic structure of polynomial-ﬁltered methods. It is easy toobserve that this kind of method couples the subspace reﬁnement and the main iteration together. able 17 Results of ADMM and PFAM on non-linear SDPs

LS model, α = 0 . K ADMM PFAMtime itr pinf dinf mse time itr pinf dinf mse500 36.1 232 9.9e-04 1.9e-04 1.46e-02 9.2 231 1.0e-03 1.9e-04 1.46e-021000 120.1 163 9.7e-04 6.9e-04 9.82e-03 21.2 163 9.7e-04 6.9e-04 9.82e-031500 426.6 189 9.9e-04 3.2e-04 6.07e-03 74.4 189 9.9e-04 3.2e-04 6.07e-032000 1189.6 202 9.9e-04 1.7e-04 4.42e-03 148.7 202 9.9e-04 1.7e-04 4.42e-03LUD model, α = 0 . K ADMM PFAMtime itr pinf dinf mse time itr pinf dinf mse500 7.9 78 9.9e-04 8.4e-05 9.99e-03 3.8 83 9.6e-04 8.2e-05 9.99e-031000 44.4 99 9.4e-04 9.5e-04 3.06e-03 15.3 90 9.8e-04 4.6e-05 5.42e-031500 131.9 101 9.1e-04 6.1e-04 2.24e-03 47.8 118 9.1e-04 6.5e-04 2.24e-032000 334.5 102 9.3e-04 4.1e-04 1.91e-03 79.0 103 9.4e-04 4.2e-04 1.91e-03LUD model, α = 0 . K ADMM PFAMtime itr pinf dinf mse time itr pinf dinf mse500 47.2 274 1.0e-03 2.7e-04 1.91e-03 14.9 277 9.9e-04 2.7e-04 1.91e-031000 294.3 356 1.0e-03 1.3e-04 1.74e-03 68.0 369 1.0e-03 1.3e-04 1.74e-031500 871.8 364 1.0e-03 1.2e-04 1.60e-03 165.7 316 1.0e-03 4.4e-04 1.63e-032000 2526.5 413 1.0e-03 3.2e-04 2.41e-03 347.2 373 1.0e-03 4.1e-04 2.41e-03In the theoretical part, we analyze the convergence of PFPG and PFAM. A key assumption isthat the initial subspace should not be orthogonal to the target subspace to be used in the nextiteration. Together with the Chebyshev polynomials we are able to estimate the approximationerror of the subspace. The main convergence result indicates that the degree of the polynomialcan remain a constant as the iterations proceed, which is meaningful in real applications. Even ifthe warm-start property is not considered, the degree grows very slowly (about order O (log k )) toensure the convergence. Our numerical experiments shows that the polynomial-ﬁltered algorithmsare pretty eﬀective on low-rank problems compared to the original methods, since they successfullyreduce the computational costs of large-scale EVDs. Meanwhile, the number of iterations is barelyincreased. These observations coincide with our theoretical results. REFERENCES[1]

D. G. Anderson , Iterative procedures for nonlinear integral equations , Journal of the ACM (JACM), 12 (1965),pp. 547–560.[2]

S. Becker, V. Cevher, and A. Kyrillidis , Randomized low-memory singular value projection , arXiv preprintarXiv:1303.0167, (2013).[3]

J. Cai, E. Cands, and Z. Shen , A singular value thresholding algorithm for matrix completion , SIAM Journalon Optimization, 20 (2010), pp. 1956–1982.[4]

E. Candes, X. Li, and M. Soltanolkotabi , Phase Retrieval via Wirtinger Flow: Theory and Algorithms ,ArXiv e-prints, (2014).[5]

C. Davis and W. M. Kahan , The rotation of eigenvectors by a perturbation. iii , SIAM Journal on Numerical31nalysis, 7 (1970), pp. 1–46.[6]

D. Davis and W. Yin , Convergence rate analysis of several splitting schemes , in Splitting Methods in Com-munication, Imaging, Science, and Engineering, Springer, 2016, pp. 115–163.[7]

M. Friedlander and I. Macdo , Low-rank spectral optimization via gauge duality , SIAM Journal on ScientiﬁcComputing, 38 (2016), pp. A1616–A1638.[8]

M. X. Goemans and D. P. Williamson , Improved approximation algorithms for maximum cut and satisﬁa-bility problems using semideﬁnite programming , Journal of the ACM (JACM), 42 (1995), pp. 1115–1145.[9]

G. H. Golub and C. F. Van Loan , Matrix computations , vol. 3, JHU Press, 2012.[10]

F. Kangal, K. Meerbergen, E. Mengi, and W. Michiels , A subspace method for large-scale eigenvalueoptimization , SIAM Journal on Matrix Analysis and Applications, 39 (2018), pp. 48–82.[11]

D. Kressner, D. Lu, and B. Vandereycken , Subspace acceleration for the crawford number and relatedeigenvalue optimization problems , SIAM Journal on Matrix Analysis and Applications, 39 (2018), pp. 961–982.[12]

Y. Li, Z. Wen, C. Yang, and Y. Yuan , A semi-smooth newton method for solving semideﬁnite programs inelectronic structure calculations , arXiv preprint arXiv:1708.08048, (2017).[13]

Z. Lin, M. Chen, and Y. Ma , The augmented Lagrange multiplier method for exact recovery of corruptedlow-rank matrices , Journal of structural biology, 181 2 (2013), pp. 116–27.[14]

X. Liu, Z. Wen, and Y. Zhang , An eﬃcient Gauss–Newton algorithm for symmetric low-rank product matrixapproximations , SIAM Journal on Optimization, 25 (2015), pp. 1571–1608.[15]

M. Nakata, B. J. Braams, K. Fujisawa, M. Fukuda, J. K. Percus, M. Yamashita, and Z. Zhao , Variationalcalculation of second-order reduced density matrices by strong n-representability conditions and an accuratesemideﬁnite programming solver , The Journal of chemical physics, 128 (2008), p. 164113.[16]

N. Parikh, S. Boyd, et al. , Proximal algorithms , Foundations and Trends R (cid:13) in Optimization, 1 (2014),pp. 127–239.[17] H. Qi and D. Sun , A quadratically convergent newton method for computing the nearest correlation matrix ,SIAM Journal on Matrix Analysis and Applications, 28 (2006), pp. 360–385.[18]

D. Scieur, A. d’Aspremont, and F. Bach , Regularized nonlinear acceleration , in Advances In Neural Infor-mation Processing Systems, 2016, pp. 712–720.[19]

P. Sirkovic and D. Kressner , Subspace acceleration for large-scale parameter-dependent Hermitian eigen-problems , SIAM Journal on Matrix Analysis and Applications, 37 (2016), pp. 695–718.[20]

M. Soltani and C. Hegde , Fast low-rank matrix estimation without the condition number , arXiv preprintarXiv:1712.03281, (2017).[21]

K.-C. Toh and S. Yun , An accelerated proximal gradient algorithm for nuclear norm regularized linear leastsquares problems , Paciﬁc Journal of optimization, 6 (2010), p. 15.[22]

B. Vandereycken , Low-rank matrix completion by Riemannian optimization , SIAM Journal on Optimization,23 (2013), pp. 1214–1236.[23]

H. F. Walker and P. Ni , Anderson acceleration for ﬁxed-point iterations , SIAM Journal on Numerical Anal-ysis, 49 (2011), pp. 1715–1735.[24]

L. Wang, A. Singer, and Z. Wen , Orientation determination of cryo-em images using least unsquared devi-ations , SIAM journal on imaging sciences, 6 (2013), pp. 2450–2483.[25]

J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma , Robust principal component analysis: Exact recoveryof corrupted low-rank matrices via convex optimization , in Advances in Neural Information ProcessingSystems 22, Y. Bengio, D. Schuurmans, J. D. Laﬀerty, C. K. I. Williams, and A. Culotta, eds., CurranAssociates, Inc., 2009, pp. 2080–2088.[26]

Y. Ying and P. Li , Distance metric learning with eigenvalue optimization , Journal of Machine LearningResearch, 13 (2012), pp. 1–26.[27]

T. Zhou and D. Tao , Godec: Randomized low-rank & sparse matrix decomposition in noisy case , in Interna-tional conference on machine learning, Omnipress, 2011.[28]