Low-rank Matrix Optimization Using Polynomial-filtered Subspace Extraction
LLOW-RANK MATRIX OPTIMIZATION USING POLYNOMIAL-FILTEREDSUBSPACE EXTRACTION
YONGFENG LI ∗ , HAOYANG LIU † , ZAIWEN WEN ‡ , AND
YAXIANG YUAN § Abstract.
In this paper, we study first-order methods on a large variety of low-rank matrix optimizationproblems, whose solutions only live in a low dimensional eigenspace. Traditional first-order methods depend onthe eigenvalue decomposition at each iteration which takes most of the computation time. In order to reduce thecost, we propose an inexact algorithm framework based on a polynomial subspace extraction. The idea is to use anadditional polynomial-filtered iteration to extract an approximated eigenspace, and project the iteration matrix onthis subspace, followed by an optimization update. The accuracy of the extracted subspace can be controlled by thedegree of the polynomial filters. This kind of subspace extraction also enjoys the warm start property: the subspaceof the current iteration is refined from the previous one. Then this framework is instantiated into two algorithms: thepolynomial-filtered proximal gradient method and the polynomial-filtered alternating direction method of multipliers.We give a theoretical guarantee to the two algorithms that the polynomial degree is not necessarily very large. Theyshare the same convergence speed as the corresponding original methods if the polynomial degree grows with an orderΩ(log k ) at the k -th iteration. If the warm-start property is considered, the degree can be reduced to a constant,independent of the iteration k . Preliminary numerical experiments on several low-rank matrix optimization problemsshow that the polynomial filtered algorithms usually provide multi-fold speedups. Key words.
Low-rank Matrix Optimization, Eigenvalue Decomposition, Inexact Optimization Method, Poly-nomial Filter, Subspace Extraction.
AMS subject classifications.
1. Introduction.
Eigenvalue decompositions (EVD) are commonly used in large varieties ofmatrix optimization problems with spectral or low-rank structures. For example, in semi-definiteprogrammings (SDP), many optimization methods need to compute all positive eigenvalues andcorresponding eigenvectors of a matrix each iteration in order to preserve the semi-definite structure.In matrix recovery type problems, people are interested in the low-rank approximations of theunknown matrix. In [25], the authors propose a convex relaxation model to solve the robust PCAproblem and solve it by the accelerated proximal gradient (APG) approach. The primal problem issolved by the augmented Lagrangian method (ALM) in [13]. The application can also be regardedas an optimization problem on the rank- r matrix space thus manifold optimization methods canbe applied [22]. In all methods mentioned above, EVD of singular value decompositions (SVD) arerequired. The latter can be transformed to the former essentially. In maximal eigenvalue problems,the objective function is the largest eigenvalue of a symmetric matrix variable. It is used in manyreal applications, for example, the dual formulation of the max-cut problem [8], phase retrieval,blind deconvolution [7], distance metric learning problem [26], and Crawford number computing[11]. These optimization methods require the largest eigenvalue and the corresponding eigenvectorat each iteration. The variable dimension is usually large.In general, at least one full or truncated EVD per iteration is required in most first-order ∗ Beijing International Center for Mathematical Research, Peking University, CHINA ([email protected]) † Beijing International Center for Mathematical Research, Peking University ([email protected]). ‡ Beijing International Center for Mathematical Research, Peking University ([email protected]). Research sup-ported in part by the NSFC grants 11421101 and 11831002, and by the National Basic Research Project under thegrant 2015CB856002. § State Key Laboratory of Scientific and Engineering Computing, Academy of Mathematics and Systems Sci-ence,Chinese Academy of Sciences, China ([email protected]). Research supported in part by NSFC grants 11331012and 11461161005. 1 a r X i v : . [ m a t h . O C ] A p r ethods to solve these applications. It has been long realized that EVD is very time-consumingfor large problems, especially when the dimension of the matrix and the number of the requiredeigenvalues are both huge. First-order methods suffer from this issue greatly since they usuallytake thousands of iterations to converge. Therefore, people turn to inexact methods to save thecomputation time. One popular approach relies on approximated eigen-solvers to solve the eigen-value sub-problem, such as the Lanczos method and randomized EVD with early stopping rules,see [27, 2, 20] and the references therein. The performance of these methods is determined bythe accuracy of the eigen-solver, which is sometimes hard to control in practice. Another type ofmethod is the so-called subspace method. By introducing a low-dimensional subspace, one cangreatly reduce the dimension of the original problem, and then perform refinement on this subspaceas the iterations proceed. For instance, it is widely used to solve an univariate maximal eigenvalueoptimization problem [11, 19, 10]. In [28], the authors propose an inexact method which simpli-fies the eigenvalue computation using Chebyshev filters in self-consistent-field (SCF) calculations.Though inexact methods are widely used in real applications, the convergence analysis is still verylimited. Moreover, designing a practical strategy to control the inexactness of the algorithm is abig challenge. Our contributions are briefly summarized as follows.1. For low-rank matrix optimization problems involving eigenvalue computations, we proposea general inexact first-order method framework with polynomial filters which can be appliedto most existing solvers without much difficulty. It can be observed that for low-rankproblems, the iterates always lie in a low dimensional eigenspace. Hence our key motivationis to use one polynomial filter subspace extraction step to estimate the subspace, followed bya standard optimization update. The algorithm also benefits from the warm-start property:the updated iteration point can be fed to the polynomial filter again to generate the nextestimation. Then we apply this framework to the proximal gradient method (PG) andthe alternating direction method of multipliers (ADMM) to obtain a polynomial-filteredPG (PFPG) method and a polynomial-filtered ADMM (PFAM) method (see Section 2),respectively.2. We analyze the convergence property of PFPG and PFAM. It can be proved that the errorof one exact and inexact iteration is bounded by the principle angle between the true andextracted subspace, which is further controlled by the polynomial degree. The convergencerelies on an assumption that the initial space should not be orthogonal to the target space,which is essential but usually holds in many applications. Consequently, the polynomialdegree barely increases during the iterations. It can even remain a constant under thewarm-start setting. This result provides us the opportunity to use low-degree polynomialsthroughout the iterations in practice.3. We propose a practical strategy to control the accuracy of the polynomial filters so that theconvergence is guaranteed. A portable implementation of the polynomial filter frameworkis given based on this strategy. It can be plugged into any first-order methods with only afew lines of codes added, which means that the computational cost reduction is almost afree lunch.We mention that our work is different from randomized eigen-solver based methods [27, 2, 20].The inexactness of our method is mainly dependent on the quality of the subspace, which can behighly controllable using polynomial filters and the warm-start strategy. On the other hand, theconvergence of randomized eigen-solver based methods relies on the tail bound or the per vectorerror bound of the solution returned by the eigen-solver, which is usually stronger than subspace ssumptions. Our work also differs from the subspace method proposed in [11, 10]. First the authorsmainly focus on maximal eigenvalue problems, which have special structures, while we considergeneral matrix optimization problems with low-rank variables. Second, in the subspace method,one has to compute the exact solution of the sub-problem, which is induced by the projection of theoriginal one onto the subspace. However, after we obtain the subspace, the next step is to extracteigenvalues and eigenvectors from the projected matrix to generate inexact variable updates. Thereare no sub-problems in general. The matlab codes for several low-rank matrix optimization examplesare available at https://github.com/RyanBernX/PFOpt.
The paper is organized as follows. In Section 2, we introduce the poly-nomial filter algorithm and propose two polynomial-filtered optimization methods, namely, PFPGand PFAM. Then the convergence analysis is established in Section 3. Some important details ofour proposed algorithms in order to make them practical are summarized in Section 4. The effec-tiveness of PFPG and PFAM is demonstrated in Section 5 with a number of practical applications.Finally, we conclude the paper in Section 6.
Let S n be the collection of all n -by- n symmetric matrices. For any X ∈ S n ,we use λ ( X ) ∈ R n to denote all eigenvalues of X , which are permuted in descending order, i.e., λ ( X ) = ( λ , . . . , λ n ) T where λ ≥ λ · · · ≥ λ n . For any matrix X ∈ R n × n , diag( X ) denotes avector in R n consisting of all diagonal entries of X . For any vector x ∈ R n , Diag( x ) denotes adiagonal matrix in R n × n whose i -th diagonal entry is x i . We use (cid:104) A, B (cid:105) to define the inner productin the matrix space R n × n , i.e., (cid:104) A, B (cid:105) := Tr( A T B ) where Tr( X ) := (cid:80) ni =1 X ii for X ∈ R n × n . Thecorresponding Frobenius norm is defined as (cid:107) A (cid:107) F = Tr( A T A ). The Hadamard product of twomatrices or vectors of the same dimension is denoted by A (cid:12) B with ( A (cid:12) B ) ij = A ij × B ij . Forany matrix X ∈ R n × p , Range( X ) denotes the subspace spanned by the columns of X .
2. Polynomial Filtered Matrix Optimization.2.1. Chebyshev-filtered Subspace Iteration.
The idea of polynomial filtering is origi-nated from a well-known fact that polynomials are able to manipulate the eigenvalues of anysymmetric matrix A while keeping its eigenvectors unchanged. Suppose A has the eigenvalue de-composition A = U Diag( λ , . . . , λ n ) U T , then the matrix ρ ( A ) has the eigenvalue decomposition ρ ( A ) = U Diag( ρ ( λ ) , . . . , ρ ( λ n )) U T .Consider the traditional subspace iteration, U ← AU , the convergence of the desired eigen-subspace is determined by the gap of the eigenvalues, which can be very slow if the gap is nearlyzero. This is where the polynomial filters come into practice: to manipulate the eigenvalue gapaiming to a better convergence. For any polynomial ρ ( t ) and initial matrix U , the polynomial-filtered subspace iteration is given by U ← ρ ( A ) U. In general, there are many choices of ρ ( t ). One popular choice is to use Chebyshev polynomialsof the first kind, whose explicit expression can be written as(2.1) T d ( t ) = (cid:40) cos( d arccos t ) | t | ≤ , (( t − √ t − d + ( t + √ t − d ) | t | > , where d is the degree of the polynomial. One important property of Chebyshev polynomials is thatthey grow pretty fast outside the interval [ − , igenvalues in this interval effectively. Figure 1 shows some examples of Chebyshev polynomials T d ( t ) with d = 1 , . . . , T q ( t ) with q = 1 , . . . ,
5. That isexactly why Chebyshev polynomials are taken into our consideration.
Fig. 1 . Chebyshev polynomials and their variants -1 -0.5 0 0.5 1-6-4-202468 d = 1d = 2d = 3d = 4d = 5d = 6 (a) Chebyshev polynomials of the first kind T d ( t ) -1 -0.5 0 0.5 1-8-6-4-202468 q = 1q = 2q = 3q = 4q = 5 (b) Power of Chebyshev polynomial T q ( t ) In order to suppress all eigenvalues in a general interval [ a, b ], a linear mapping from [ a, b ] to[ − ,
1] is constructed. To be precise, the polynomial can be chosen as(2.2) ρ ( t ) = T d (cid:18) t − ( b + a ) / b − a ) / (cid:19) . After the polynomial filters are applied, an orthogonalization step usually follows in order toprevent the subspace from losing rank. This step is often performed by a single QR decomposition.Consequently, given an arbitrary matrix U ∈ R n × p and a polynomial ρ ( t ), the polynomial-filteredsubspace iteration can be simply written as(2.3) U + = orth( ρ ( A ) U ) , where “orth” stands for the orthogonalization operation. In most cases, the updated matrix U + spans a subspace which contains some approximated eigenvectors of A . By extracting eigen-pairsfrom this subspace, we are able to derive a number of polynomial-filtered inexact optimizationmethods. In the next subsections we present two examples: PFPG and PFAM. In this subsection we show how to apply the subspace update(2.3) to the proximal gradient method, on a set of problems with a general form. Consider thefollowing unconstrained spectral operator minimization problem(2.4) min h ( x ) := F ( x ) + R ( x ) , where F ( x ) has a composition form F ( x ) = f ◦ λ ( B ( x )) with B ( x ) = G + A ∗ ( x ) and R ( x ) is aregularizer with simple structures but need not be smooth. Here G is a known matrix in S n , and ∗ : R m → S n is a linear operator. The function f : R n → R is smooth and absolutely symmetric ,i.e., f ( x ) = f ( P x ) for all x ∈ R n and any permutation matrix P ∈ R n × n . It can be easily verifiedthat F ( x ) is well defined.According to [16, Section 6.7.2], the gradient of F in (2.4) is(2.5) ∇ F ( x ) = A (Ψ( B ( x ))) , where Ψ is a spectral operator defined byΨ( X ) = V Diag( ∇ f ( λ ( X ))) V T , and V Diag( λ ( X )) V T is the full eigen-decomposition of X .For any matrix S ∈ R n × p , let P S denote the projection operator on subspace Range( S ). When S is an orthogonal matrix, the operator P S is defined by(2.6) P S ( X ) = SS T XSS T . If V is exactly a p -dimensional eigen-space of the matrix X , the eigenvalue decomposition of P V ( X )can be written as(2.7) P V ( X ) = (cid:2) ˜ V V ⊥ (cid:3) (cid:20) Λ p OO O (cid:21) (cid:20) ˜ V T V T ⊥ (cid:21) , where ( ˜ V , Λ p ) is the corresponding eigen-pairs and we have Range( V ) = Range( ˜ V ). Note that( ˜ V , Λ p ) has closed-form expression with ˜ V = V Y where Y Λ p Y T is the full eigenvalue decompositionof H := V T XV ∈ R p × p .Suppose evaluating ∇ F only involves a small number of eigenvalues in the problem (2.4), thenwe are able to compute ∇ F quickly as long as the corresponding subspace is given. We summarizethem in Assumption 1. Assumption (i) Let I ( x ) ⊂ [ n ] be an integer set with I ( x ) = { s, s + 1 , ..., t } and supposethat the gradient of F ( x ) in (2.4) has the relationship (2.8) ∇ F ( x ) = ¯ g := A (Φ( P V I ( B ( x )))) , where Φ is a spectral operator and V I ∈ R n ×|I| contains eigenvectors v i , i ∈ I ( x ) of B ( x ) .(ii) Φ is Lipschitz continuous with a factor L Φ , (2.9) (cid:107) Φ( X ) − Φ( Y ) (cid:107) F ≤ L Φ (cid:107) X − Y (cid:107) F . We now compare the gradient expression (2.8) in Assumption 1 with the original gradient(2.5). Expression (2.8) implies that the subspace V I contains all information for evaluating ∇ F ( x ).In other words, we do not need full eigenvalue decompositions as indicated by (2.5). The spectraloperator Φ( · ) in (2.8) is also different from Ψ( · ) defined in (2.5). The choice of Φ( · ) can be arbitraryas long as it satisfies Assumption 1. In most cases Φ( · ) inherits the definition of Ψ( · ) but they neednot be the same. Under the Assumption 1, we can write the proximal gradient algorithm as(2.10) x k +1 = prox τ k R ( x k − τ k A (Φ( P V k I k ( B ( x k ))) , here V k I k contains the eigenvectors of B ( x k ) at iteration k , τ k is the step size, and the proximaloperator is defined by(2.11) prox tR ( x ) = argmin u R ( u ) + 12 t (cid:107) u − x (cid:107) . In the scheme (2.10), the main cost is the computation of the EVD of P V k I k ( B ( x k )) as theexact subspace V k I k is unknown. Thus, an idea to reduce the expenses is to use polynomial filtersto generate an approximation of eigen-space. Then it is easy to compute the projection on thisspace and the corresponding eigen-decomposition, whence the image of the spectral operator on theprojection matrix, as shown in (2.7). To obtain a good approximation, the update of eigen-space isby means of Chebyshev polynomials which can suppress all unwanted eigenvectors of B ( x ) and theeigen-space in last step is used as initial values. In summary, the proximal gradient method withpolynomial filters can be written as x k +1 = prox τ k R ( x k − τ k A (Φ( P U k ( B ( x k )))) , (2.12) U k +1 = orth( ρ q k +1 k +1 ( B ( x k +1 )) U k ) , (2.13)where U k ∈ R n × p with p ≥ |I k | is an orthogonal matrix which serves as an approximation of V k I k each step, and q k ≥ q k times before the orthogonalization.We mention that (2.12) and (2.13) are actually two different updates that couple together.Given a subspace spanned by U k , the step (2.12) performs the proximal gradient descent with theapproximated gradient extracted from U k . The next step (2.13) makes refinement on the subspaceto obtain U k +1 via polynomial filters. The subspace will become more and more exact as theiterations proceed. In practice, it is usually observed that U k is very “close” to the true eigen-space V k I k during the last iterations of the algorithm. Consider the following standard SDP:(2.14) min (cid:104)
C, X (cid:105) , s.t. A X = b, X (cid:23) , where A is a bounded linear operator. Let F ( X ) = 1 { X (cid:23) } ( X ) and G ( X ) = 1 {A X = b } ( X ) + (cid:104) C, X (cid:105) ,where 1 Ω ( X ) is the indicator function on a set Ω. Then the Douglas-Rachford Splitting (DRS)method on the primal SDP (2.14) can be written as Z k +1 = T DRS ( Z k ) , where T DRS = prox tG (2prox tF − I ) − prox tF + I, which is equivalent to the ADMM on the dual problem. The explicit forms of prox tF ( Z ) andprox tG ( Y ) can be written asprox tF ( Z ) = P + ( Z ) , prox tG ( Y ) = ( Y + tC ) − A ∗ ( AA ∗ ) − ( A Y + t A C − b ) , where P + ( Z ) is the projection operator onto the positive semi-definite cone. imilar to the PFPG method, P + ( Z ) is only determined by the eigen-space spanned by thepositive eigenvectors of Z , so we are able to use the polynomial filters to extract an approximation U k . Therefore, the PFAM can be written as Z k +1 = prox tG (2 P + ( P U k ( Z k )) − Z k ) − P + ( P U k ( Z k )) + Z k , (2.15) U k +1 = orth( ρ q k +1 k +1 ( Z k +1 ) U k ) , (2.16)where U k ∈ R n × p with p ≥ |I| is an orthogonal matrix and q k ≥
3. Convergence Analysis.3.1. Preliminary.
In this section we introduce some basic notations and tools that are usedin our analysis. Firstly, we introduce the definition of principal angles to measure the distancebetween two subspaces.
Definition Let
X, Y ∈ R n × p be two orthogonal matrices. The singular values of X T Y are σ ≥ σ ≥ · · · ≥ σ p , which lie in [0 , . The principal angle between Range( X ) and Range( Y ) isdefined by (3.1) Θ( X, Y ) := Diag(arccos σ , . . . , arccos σ p ) , In particular, we define sin Θ by taking the sine of Θ componentwisely. The following lemma describes a very important property of Chebyshev polynomials.
Lemma The Chebyshev polynomials increase fast outside [ − , , i.e., (3.2) T d ( ± (1 + (cid:15) )) ≥ d min {√ (cid:15), }− , ∀ (cid:15) > . Proof.
It follows from the expression (2.1) that T d ( ± (1 + (cid:15) )) ≥
12 (1 + (cid:15) + (cid:112) (cid:15) + (cid:15) ) d ≥
12 (1 + √ (cid:15) ) d = 12 e d log(1+ √ (cid:15) ) ≥ d min {√ (cid:15), }− , where the last inequality follows from that log(1 + x ) ≥ log 2 · min { x, } . In this subsection, we analyze the convergence forgeneral convex problems. The relative gap is defined by(3.3) G k = min i ∈I k (cid:18) | λ i − ( a k + b k ) / | ( a k − b k ) / − (cid:19) , where [ a k , b k ] is the interval suppressed by Chebyshev polynomials at the k -th iteration, I k isthe index set of the eigenvalues beyond [ a k , b k ]. Throughout this section, we make the followingassumptions. Assumption (i) (cid:107) sin Θ( V k +1 I k +1 , U k ) (cid:107) < γ, ∀ k with γ < , where V k I k and U k are definedin Assumption 1 and (2.13) at iteration k .(ii) The sequence generated by iteration (2.12) and (2.13) are bounded, i.e., (cid:107) x k (cid:107) ≤ C, ∀ k. (iii) The relative gap (3.3) has a lower bound, i.e., G k ≥ l, ∀ k . ssumption (i) implies that the initial eigen-space is not orthogonal to the truely wanted eigen-space so that we can obtain V k +1 I k +1 by performing subspace iterations on U k . Essentially, thisproperty is required by almost all iterative eigensolvers for finding the correct eigenspace at eachiteration. Assumption (ii) is commonly used in the optimization literature. Assumption (iii) is a keyassumption which guarantees that the relative gap G k exists in the asymptotic sense. Therefore,the polynomial filters are able to separate the wanted eigenvalues from the unwanted ones. Ineigenvalue computation, this property is often enforced by adding a sufficient number of guardingvectors so that the corresponding eigenvalue gap is large enough.The following lemma gives an error estimation when using an inexact gradient of F . It statesthat the polynomial degree d k should be proportional to log(1 /ε ), where ε is a given tolerance. Lemma Suppose that Assumptions 1 and 2 hold. Define the error between the exact andinexact gradients: (3.4) e k = A (Φ( P U k ( B ( x k )))) − ∇ F ( x k ) . Let the polynomial filter ρ k ( t ) be a Chebyshev polynomial with degree d k . To achieve (cid:107) e k (cid:107) ≤ ε fora given tolerance < ε < , the degree d k should satisfy (3.5) d k = Ω (cid:18) log ε min { l, } (cid:19) . Proof.
The update (2.1) can be regarded as one orthogonal iteration. The eigenvalue of ρ q k k ( B ( x k )) is ρ q k k ( λ i ), where λ i ’s are the eigenvalues of B ( x k ). According to Lemma 2, the eigen-values have the following distribution after we apply the Chebyshev polynomial:min i ∈I k ρ q k k ( λ i ) ≥ q k d k min {√ l, }− q k and max i/ ∈I k ρ q k k ( λ i ) ≤ . It follows from [9, Theorem 8.2.2] that(3.6) (cid:107) sin Θ( V k I k , U k ) (cid:107) ≤ q k q k d k min { l, } γ (cid:112) − γ . Due to the boundness of A and the Lipschitz continuity of Φ, we have(3.7) (cid:107) e k (cid:107) ≤ c (cid:107)P U k ( B ( x k )) − P V k I k ( B ( x k )) (cid:107) F , where c is a constant. It is shown in [9, Theorem 2.5.1] that (cid:107) sin Θ( X, Y ) (cid:107) = (cid:107) XX T − Y Y T (cid:107) . Using this identity, we have (cid:107)P U k ( B ( x k )) − P V k I k ( B ( x k )) (cid:107) F ≤(cid:107)P U k ( B ( x k )) − U k ( U k ) T ( B ( x k )) V k I k ( V k I k ) T (cid:107) F + (cid:107) U k ( U k ) T ( B ( x k )) V k I k ( V k I k ) T − P V k I k ( B ( x k )) (cid:107) F ≤ ||B ( x k ) || F (cid:107) sin Θ( V k I k , U k ) (cid:107) ≤ c · q k q k d k min { l, } , (3.8)where c is a constant depending on γ . The second inequality is due to the fact that (cid:107) AB (cid:107) F ≤(cid:107) A (cid:107) (cid:107) B (cid:107) F and the last inequality follows from (3.6) and the boundedness of A and x k . It is easyto verify that (cid:107) e k (cid:107) ≤ ε if (3.5) holds for any ε ∈ (0 , e claim that ∇ F ( x ) is Lipschitz continuous due to the boundness of A and x k and theLipschitz continuity of Φ. In the following part, we define the Lipschitz constant of ∇ F ( x ) by L .The following theorem gives the convergence of the PFPG method. Theorem Suppose that Assumptions 1 and 2 hold. Let τ k = τ ≤ L and ¯ x K = K (cid:80) Kk =1 x k .Then the convergence lim K →∞ h (¯ x K ) = h ( x ∗ ) holds if (3.9) d k = Ω (cid:18) log k min { l, } (cid:19) . Proof.
For simplicity we define the noisy residual function by r τ ( x, e ) = 1 τ ( x − prox τR ( x − τ ( ∇ F ( x ) + e ))) . Then the update (2.12) can be written as x k − x k +1 = τ k r τ k ( x k , e k ) . According to the Lipschitz continuity of ∇ F , we have h ( x k +1 ) ≤ F ( x k ) − τ k ∇ F ( x k ) T ( r τ k ( x k , e k )) + Lτ k (cid:107) r τ k ( x k , e k ) (cid:107) + R ( x k +1 ) . By the convexity of F ( x ) and R ( x ) and r τ k ( x, e k ) − ∇ F ( x k ) − e k ∈ ∂R ( x k +1 ), we have h ( x k +1 ) ≤ F ( x ∗ ) + ∇ F ( x k ) T ( x k − x ∗ ) − τ k ∇ F ( x k ) T ( r τ k ( x k , e k )) + Lτ k (cid:107) r τ k ( x k , e k ) (cid:107) + R ( x ∗ ) + ( r τ k ( x k , e k ) − ∇ F ( x k ) − e k ) T ( x k +1 − x ∗ ) . Setting the step size τ k = τ ≤ L yields h ( x k +1 ) − h ( x ∗ ) ≤ ∇ F ( x k ) T ( x k − x ∗ ) − τ ∇ F ( x k ) T ( r τ ( x k , e k )) + τ (cid:107) r τ ( x k , e k ) (cid:107) +( r τ ( x k , e k ) − ∇ F ( x k ) − e k ) T ( x k +1 − x ∗ ) . After rearranging and applying the boundness of the sequence x k , we have h ( x k +1 ) − h ( x ∗ ) ≤ τ ( (cid:107) x k − x ∗ (cid:107) − (cid:107) x k +1 − x ∗ (cid:107) ) + ( C + (cid:107) x ∗ (cid:107) ) (cid:107) e k (cid:107) , where C is a constant. Then it follows that(3.10) h (¯ x K ) − h ( x ∗ ) ≤ K K (cid:88) k =1 ( h ( x k ) − h ( x ∗ )) ≤ Kτ (cid:107) x − x ∗ (cid:107) + C + (cid:107) x ∗ (cid:107) K K (cid:88) k =1 (cid:107) e k (cid:107) . According to Lemma 3, by setting d k = Ω (cid:16) log k min { l, } (cid:17) at iteration k , we have the upper bound (cid:80) Kk =1 (cid:107) e k (cid:107) ≤ (cid:80) Kk =1 /k . Thus the right hand side of (3.10) goes to zero as K → ∞ , whichcompletes the proof. he following theorem gives a linear convergence rate under the strongly convex assumption of F ( x ). Theorem Suppose that Assumptions 1 and 2 hold, and F ( x ) is µ -strongly convex. Let τ k = τ ≤ L and w ≤ µ . Then a linear convergence rate is achieved if (3.11) d k = Ω (cid:18) − log( w (cid:107) x k +1 − x ∗ (cid:107) )min { l, } (cid:19) . Proof.
The quadratic bound in the proof of Theorem 4 gives h ( x k +1 ) ≤ F ( x k ) − τ k ∇ F ( x k ) T ( r τ k ( x k , e k )) + Lτ k (cid:107) r τ k ( x k , e k ) (cid:107) + R ( x k +1 ) . By the stongly convexity of F ( x ), the convexity of R ( x ) and r τ k ( x k , e k ) − ∇ F ( x k ) − e k ∈ ∂R ( x k +1 ),we have h ( x k +1 ) ≤ F ( x ∗ ) + ∇ F ( x k ) T ( x k − x ∗ ) − µ (cid:107) x k − x ∗ (cid:107) − τ k ∇ F ( x k ) T ( r τ k ( x k , e k ))+ Lτ k (cid:107) r τ k ( x k , e k ) (cid:107) + R ( x ∗ ) + ( r τ k ( x k , e k ) − ∇ F ( x k ) − e k ) T ( x k +1 − x ∗ ) . Under the choice of step size τ k = τ ≤ L , we have h ( x k +1 ) − h ( x ∗ ) ≤ ∇ F ( x k ) T ( x k − x ∗ ) − τ ∇ F ( x k ) T ( r τ ( x k , e k )) + τ (cid:107) r τ ( x k , e k ) (cid:107) − µ (cid:107) x k − x ∗ (cid:107) + ( r τ ( x k , e k ) − ∇ F ( x k ) − e k ) T ( x k +1 − x ∗ )= 12 τ ((1 − µτ ) (cid:107) x k − x ∗ (cid:107) − (cid:107) x k +1 − x ∗ (cid:107) ) − ( e k ) T ( x k +1 − x ∗ ) ≤ τ ((1 − µτ ) (cid:107) x k − x ∗ (cid:107) − (cid:107) x k +1 − x ∗ (cid:107) ) + (cid:107) e k (cid:107) (cid:107) x k +1 − x ∗ (cid:107) . Together with h ( x k +1 ) ≥ h ( x ∗ ), we have(3.12) (cid:107) x k +1 − x ∗ (cid:107) ≤ (1 − µτ ) (cid:107) x k − x ∗ (cid:107) + 2 τ (cid:107) e k (cid:107) (cid:107) x k +1 − x ∗ (cid:107) . The choice of d k guarantees (cid:107) e k (cid:107) ≤ w (cid:107) x k +1 − x ∗ (cid:107) . Plugging this upper bound of (cid:107) e k (cid:107) into (3.12)yields (cid:107) x k +1 − x ∗ (cid:107) ≤ η (cid:107) x k − x ∗ (cid:107) , where η = − µτ − wτ <
1, which completes the proof.
In this section, we analyze the linear convergenceof PFPG under the warm-start setting. Suppose that ρ k ( λ s i ( B ( x k )) are in decreasing order, anddefine(3.13) η k = ρ k ( λ s p +1 ( B ( x k )) ρ k ( λ s p ( B ( x k )) . We should point out that under Assumption 3, η k is strictly smaller than 1 since G k ≥ l >
0. Anadditional assumption is needed if we intend to refine the subspace from the previous one. ssumption (cid:107) sin Θ( V k +1 I k +1 , V k I k ) (cid:107) ≤ c (cid:107) x k +1 − x k (cid:107) for all k . Assumption 3 means that the eigen-space at two consecutive iteration points are close enough. Thisis necessary since we use the eigen-space at the previous step as the initial value of polynomial filter.This assumption is satisfied when the gap is large enough by the Davis-Kahan sin Θ theorem [5].The next lemma shows the relationship between the two iterations in a recursive form. It playsa key role in the proof of the main convergence theorem.
Lemma Suppose that Assumptions 1, 2 and 3 hold. Let X be the set of optimal solutionsof the problem (2.4) , and dist( x, X ) = inf y ∈X (cid:107) x − y (cid:107) be the distance between x and X . Then wehave (3.14) (cid:107) sin Θ( V k +1 I k +1 , U k +1 ) (cid:107) ≤ η q k k +1 ( c (cid:107) sin Θ( V k I k , U k ) (cid:107) + c dist( x k , X )) , where c and c are constants.Proof. The update (2.3) can be seen as one iteration in the orthogonal iteration. Then it followsfrom [9, Theorem 8.2.2] that (cid:107) sin Θ( V k +1 I k +1 , U k +1 ) (cid:107) ≤ η q k k +1 (cid:107) tan Θ( V k +1 I k +1 , U k ) (cid:107) ≤ η q k k +1 (cid:112) − γ (cid:107) sin Θ( V k +1 I k +1 , U k ) (cid:107) , According to (3.7) and (3.8) in Lemma 3, we have (cid:107) e k (cid:107) ≤ c (cid:107) sin Θ( V k I k , U k ) (cid:107) , where c is a constant. The upper bound of (cid:107) x k − x k +1 (cid:107) is given by (cid:107) x k − x k +1 (cid:107) ≤ τ k ( (cid:107) r τ k ( x k , e k ) − r τ k ( x k , (cid:107) + (cid:107) r τ k ( x k , (cid:107) ) ≤ τ k (cid:107) e k (cid:107) + ω dist( x k , X ) ≤ τ k c (cid:107) sin Θ( V k I k , U k ) (cid:107) + ω dist( x k , X ) , where the second inequality is due to the monotonity of the proximal operator and the cocoercivitywith modulus ω of the residual function r τ k ( x k , (cid:107) sin Θ( V k +1 I k +1 , U k +1 ) (cid:107) ≤ η q k k +1 (cid:112) − γ (cid:107) sin Θ( V k +1 I k +1 , U k ) (cid:107) ≤ η q k k +1 (cid:112) − γ ( (cid:107) sin Θ( V k I k , U k ) (cid:107) + (cid:107) sin Θ( V k +1 I k +1 , V k I k (cid:107) ) ≤ η q k k +1 (cid:112) − γ ( (cid:107) sin Θ( V k I k , U k ) (cid:107) + c (cid:107) x k − x k +1 (cid:107) ) ≤ η q k k +1 (cid:112) − γ ((1 + τ k c c ) (cid:107) sin Θ( V k I k , U k ) (cid:107) + c ω dist( x k , X )) , where the third inequality is from (iii) in Assumption 3. This completes the proof.Lemma 6 means that the principle angle between V k +1 I k +1 and U k +1 are determined by the anglein the previous iteration and how far x k is from the optimal set X . As an intuitive interpretation,the polynomial-filtered subspace becomes more and more accurate when PFPG is close to converge. e are now ready to state the main result. It implies that the PFPG method has a linearconvergence rate if the polynomial degree is a constant, independent of the iteration k . Theorem Suppose that Assumptions 1, 2 and 3 holds, and the exact proximal gradientmethod for (2.4) has a linear convergence rate, i.e., (3.15) dist(prox τR ( x k − τ k ∇ F ( x k )) , X ) ≤ ν dist( x k , X ) , ν ∈ (0 , . If η k +1 (3.13) satisfies (3.16) ν + η q k k +1 c (cid:115)(cid:18) ν + η q k k +1 c (cid:19) + η q k k +1 ( τ k c c − νc ) < ρ < , where c , c , c are constants in Lemma 6, then the PFPG method has a linear convergence rate.Proof. Since X is closed, we denote x prj as the projection of prox τ k R ( x k − τ k ∇ F ( x k )) onto theoptimal set X . Then we havedist( x k +1 , X ) ≤ (cid:107) prox τ k R ( x k − τ k ( ∇ F ( x k ) + e k )) − x prj (cid:107) ≤(cid:107) prox τ k R ( x k − τ k ∇ F ( x k )) − prox τ k R ( x k − τ k ( ∇ F ( x k ) + e k )) (cid:107) + (cid:107) prox τ k R ( x k − τ k ∇ F ( x k )) − x prj (cid:107) ≤ dist(prox τ k R ( x k − τ k ∇ F ( x k )) , X ) + τ k (cid:107) e k (cid:107) ≤ dist(prox τ k R ( x k − τ k ∇ F ( x k )) , X ) + τ k c (cid:107) sin Θ( V k I k , U k ) (cid:107) F ≤ ν dist( x k , X ) + τ k c (cid:107) sin Θ( V k I k , U k ) (cid:107) F , (3.17)where the first inequality is from the definition of the distance function, the second inequality isfrom the triangle inequality, the third inequality is shown in the proof of Lemma 6 and the lastinequality is due to the linear convergence of the proximal gradient method.From (3.14) and (3.17) we observe that the error dist( x k , X ) and (cid:107) sin Θ( V k I k , U k ) (cid:107) are coupledtogether. Thus we define the error vector(3.18) s k = (cid:20) dist( x k , X ) (cid:107) sin Θ( V k I k , U k ) (cid:107) (cid:21) . The two recursive formula can be rewritten as(3.19) s k +1 ≤ R k s k , R k = (cid:20) ν τ k c η q k k +1 c η q k k +1 c (cid:21) . The spectral radius of R k is bounded by the left term in (3.16). It is easy to verify that (3.16) holdsif the degree d k is large enough to make η k +1 sufficiently small. The choice of d k is independent ofthe iteration number k . This completes the proof.Theorem 7 requires the linear convergence rate of the exact proximal gradient method. Thiscondition is satisfied if F ( x ) is strongly convex and smooth. From the proof we also observe thatdist( x k , X ) and (cid:107) sin Θ( V k I k , U k ) (cid:107) are recursively bounded by each other. By choosing suitable d k ,the two errors are able to decay simultaneously. .4. Convergence analysis for PFAM. We next analyze the convergence of PFAM. Similarto PFPG, we make the following assumptions:
Assumption Let V k I k be an orthogonal matrix whose columns are the eigenvectors corre-sponding to all positive eigenvalues of Z k .(i) (cid:107) sin Θ( V k +1 I k +1 , U k ) (cid:107) < γ, ∀ k with γ < .(ii) The sequence generated by (2.15) and (2.16) is bounded, i.e., (cid:107) Z k (cid:107) F ≤ C, ∀ k. (iii) The relative gap (3.3) has a lower bound, i.e., G k ≥ l, ∀ k . The main theorem is established by following the proof in [6].
Theorem Suppose Assumption 4 holds. The convergence (cid:107) Z k − T DRS ( Z k ) (cid:107) F = O (1 / √ k ) is archived if (3.20) d k = Ω (cid:18) log k min { l, } (cid:19) . Proof.
According to the proof of Lemma 3, we obtain (cid:107)P U k ( Z k ) − P V k I k ( Z k ) (cid:107) F ≤ c · q k q k d k min { l, } , where c is a constant. Define the error between one PFAM update and one DRS iteration E k = Z k +1 − T DRS ( Z k ). Since prox tG and P + are Lipschitz continuous with constant L and L , we have (cid:107) E k (cid:107) F ≤ (2 L L + L ) (cid:107)P U k ( Z k ) − P V k I k ( Z k ) (cid:107) F ≤ c · q k q k d k min { l, } , where c is a constant. Let s > (cid:107) E k (cid:107) F = O ( 1 k s ) , by choosing d k = Ω( s log l min( l, ).Define W k = T DRS ( Z k ) − Z k , S k = Z k +1 − Z k and T k = T DRS ( Z k +1 ) − T DRS ( Z k ). Then wehave (cid:107) W k +1 (cid:107) F = (cid:107) W k (cid:107) F + (cid:107) W k +1 − W k (cid:107) F + 2 (cid:104) W k , W k +1 − W k (cid:105) = (cid:107) W k (cid:107) F + (cid:107) T k − S k (cid:107) F + 2 (cid:104) S k − E k , T k − S k (cid:105) . (3.21)By the firm nonexpansiveness of T DRS , we obtain(3.22) 2 (cid:104) S k , T k − S k (cid:105) = (cid:107) T k (cid:107) F − (cid:107) S k (cid:107) F − (cid:107) T k − S k (cid:107) F ≤ − (cid:107) T k − S k (cid:107) F . Plugging (3.22) into (3.21) yields(3.23) (cid:107) W k +1 (cid:107) F ≤ (cid:107) W k (cid:107) F − (cid:107) T k − S k (cid:107) F − (cid:104) E k , T k − S k (cid:105) ≤ (cid:107) W k (cid:107) F + (cid:107) E k (cid:107) F . Since (cid:80) ∞ k =0 (cid:107) E k (cid:107) F < ∞ , it implies the boundness of (cid:107) W k (cid:107) F , i.e., (cid:107) W k (cid:107) F ≤ (cid:107) W (cid:107) F + k − (cid:88) i =0 (cid:107) E k (cid:107) F < ∞ . et Z ∗ be a fixed point of T DRS , then we obtain that(3.24) (cid:107) T DRS ( Z k ) − Z ∗ (cid:107) F ≤ (cid:107) Z k − Z ∗ (cid:107) F ≤ (cid:107) T DRS ( Z k − ) − Z ∗ (cid:107) F + (cid:107) E k − (cid:107) F . Applying (3.24) recursively gives (cid:107) T DRS ( Z k ) − Z ∗ (cid:107) F ≤ (cid:107) T DRS ( Z ) − Z ∗ (cid:107) F + k − (cid:88) i =0 (cid:107) E k (cid:107) F < ∞ . Hence (cid:107) T DRS ( Z k ) − Z ∗ (cid:107) F is bounded. Due to the firm nonexpansiveness of T DRS , we have that (cid:107) Z k +1 − Z ∗ (cid:107) F = (cid:107) T DRS ( Z k ) − Z ∗ (cid:107) F + (cid:107) E k (cid:107) F + 2 (cid:104) T DRS ( Z k ) − Z ∗ , E k (cid:105)≤ (cid:107) Z k − Z ∗ (cid:107) F − (cid:107) W k (cid:107) F + 2 (cid:107) T DRS ( Z k ) − Z ∗ (cid:107) F (cid:107) E k (cid:107) F . Therefore, it holds(3.25) ∞ (cid:88) i =0 (cid:107) W i (cid:107) F ≤ (cid:107) Z − Z ∗ (cid:107) F + 2 ∞ (cid:88) i =0 (cid:107) T DRS ( Z i ) − Z ∗ (cid:107) F (cid:107) E i (cid:107) F ≤ ∞ . The upper bound (3.25) implies (cid:107) W k (cid:107) →
0. In fact, we can also show the convergence speed.According to (3.23) and (3.25), we have (cid:24) k (cid:25) (cid:107) W k (cid:107) F = k (cid:88) i = (cid:100) k (cid:101) (cid:107) W k (cid:107) F ≤ k (cid:88) i = (cid:100) k (cid:101) ( (cid:107) W i (cid:107) F + k (cid:88) j = i (cid:107) E j (cid:107) F )= k (cid:88) i = (cid:100) k (cid:101) (cid:107) W i (cid:107) F + k (cid:88) i = (cid:100) k (cid:101) ( k − i + 1) (cid:107) E i (cid:107) F ≤ k (cid:88) i = (cid:100) k (cid:101) (cid:107) W i (cid:107) F + c k (cid:88) i = (cid:100) k (cid:101) (cid:107) E i (cid:107) F −→ , where the last inequality is due to the fact that (cid:107) E k (cid:107) F dominates (cid:107) E k (cid:107) F since (cid:107) E k (cid:107) F = O (1 /k s ).This completes the proof.
4. Implementation Details.4.1. Choice of the Interval [ a, b ] . We mainly focus on how to extract a good subspace thatcontains the eigenvectors corresponding to 1) all positive eigenvalues and 2) r largest eigenvalues,by choosing proper [ a, b ]. In either case, a good choice of a is the smallest eigenvalue of the inputmatrix A , say λ n . For example, one can use the Lanczos method ( eigs in matlab ) to estimate λ n . Note that A is changed during the iterations, thus a must be kept up to date periodically.The estimation of b is related to the case we are dealing with. If all positive eigenvalues arecomputed, b is set to b = ηa where η is a small positive number, usually 0.1 to 0.2. If r largesteigenvalues are needed only, then we choose b = (1 − η )ˆ λ r + ηa , where ˆ λ r is one estimation of λ r .A natural choice is the λ r in the previous iteration. .2. Choice of the Subspace Dimension. We next discuss about the choice of p , thenumber of columns of U k . Suppose r largest eigenvalues of a matrix is required, we can add someguard vectors to U k and set p = 5 + q · r where q > Gradient descent algorithms usually have slow convergence.This issue can be resolved using acceleration techniques such as Anderson Acceleration (AA) [1,23]. In our implementation we apply an extrapolation-based acceleration techniques proposedin [18] to overcome the instability of the Anderson Acceleration. To be precise, we perform linearcombinations of the points x k every l +2 iterations to obtain a better estimation ˜ x = (cid:80) li =0 ˜ c i x k − l + i .Define the difference of l + 2 iteration points U = [ x k − l +1 − x k − l , . . . , x k +1 − x k ] , the coefficients ˜ c = (˜ c , . . . , ˜ c l ) T is the solution of the following problem(4.1) ˜ c = argmin T c =1 c T ( U T U + λI ) c, where λ >
5. Numerical Experiments.
This section reports on a set of applications and their numericalresults based on the spectral optimization problem (2.4). All experiments are performed on a Linuxserver with two twelve-core Intel Xeon E5-2680 v3 processors at 2.5 GHz and a total amount of 128GB shared memory. All reported time is wall-clock time in seconds.
Given a matrix G ∈ S n , the (unweighted) nearestcorrelation estimation problem (NCE) is to solve the semi-definite optimization problem:(5.1) min (cid:107) G − X (cid:107) F , s . t . diag( X ) = n ,X (cid:23) . The dual problem of (5.1) is(5.2) min 12 (cid:107) Π S n + ( G + Diag( x )) (cid:107) F − Tn x, where Π S n + ( · ) is the projection operator onto the positive semi-definite cone. Note that (5.2) canbe rewritten as a spectral function minimization problem as (2.4), whose objective function andgradient are: min F ( x ) := f ◦ λ ( G + Diag( x )) − Tn x, (5.3) ∇ F ( x ) = diag(Π S n + ( G + Diag( x ))) − n , (5.4)where f ( λ ) = (cid:80) ni =1 max( λ i , . Expression (5.3) and (5.4) imply only positive eigenvalues andeigenvectors of G + Diag( x k ) are needed at each iteration. Hence Assumption 1 is satisfied ifΠ S n + ( G + Diag( x )) is low-rank for all x in the neighborhood of x ∗ .First we generate synthetic NCE problem data based on the first three examples in [17]. xample C ∈ S n is randomly generated correlation matrix using gallery(’randcorr’,n) in matlab . R is an n by n random matrix whose entries R ij are sampled from the uniformdistribution in [ − , . Then the matrix G is set to G = C + R . Example G ∈ S n is a randomly generated matrix with G ij satisfying the uniform distri-bution in [ − , if i (cid:54) = j . The diagonal elements G ii is set to 1 for all i . Example G ∈ S n is a randomly generated matrix with G ij satisfying the uniform distri-bution in [0 , if i (cid:54) = j . The diagonal elements G ii is set to 1 for all i . In all three examples, the dimension n is set to various numbers from 500 to 4000. The initial point x is set to n − diag( G ). Three methods are compared in NCE problems, namely, the gradientmethod with a fixed step size (Grad) and our proposed method (PFPG), and the semi-smoothNewton method (Newton) proposed in [17]. The stopping criteria is (cid:107)∇ F ( x ) (cid:107) ≤ − (cid:107)∇ F ( x ) (cid:107) .The reason not to use a higher accuracy is that first-order methods (such as Grad) are not designedfor obtaining high accuracy solutions. We also record the number of iterations (denoted by “iter”),the objective function value f and the relative norm of the gradient (cid:107) g (cid:107) = (cid:107)∇ F ( x ) (cid:107) / (cid:107)∇ F ( x ) (cid:107) .The results are shown in Table 1.The following observations can be drawn from Table 1: • All three methods reached the required accuracy in all test instances. For large problemssuch as n = 4000, PFPG usually provides at least three times speedup over Grad. Forsome large problems such as n = 4000 in Example 5.3, PFPG has nearly 9 times speedup. • The Newton method usually converges within 10 iterations, since it enjoys the quadraticconvergence property. The main cost of the Newton method are full EVDs and solvinglinear systems, which make each iteration slow. However, PFPG benefits from the subspaceextraction so that the cost of each iteration is greatly reduced, making it very competitiveagainst the Newton method. • PFPG usually needs more iterations than Grad, which is resulted from the inexactness ofthe Chebyshev filters. The error can be controlled by their degree. Using higher orderpolynomials will reduce the number of iterations, but increase the cost per iteration. More-over, the number of iterations does not increase much thanks to the warm start propertyof the filters.It should be pointed out that the benefit brought by PFPG is highly dependent on the numberof positive eigenvalues r + at G + Diag( x ∗ ). We have checked this number and find out that forExample 5.1 and 5.2, r + is approximately 0 . n , and r + ≈ . n for Example 5.3. That is thereason why PFPG is able to provide higher speedups in Example 5.3. As a conclusion we addressthat PFPG is not suitable for r + ≈ . n .Next we test the three methods on real datasets. The test matrices are selected from theinvalid correlation matrix repository . Since we are more interested in large NCE problems, wechoose three largest matrices from the collection, whose details are presented in Table 2. The label“opt. rank” denotes the rank of the optimal point X ∗ in (5.1). Note that for matrix “cor3120”, thesolution is not low-rank. However, the projection can be performed with respect to the negativedefinite part of the variable, which is low-rank in this case.Again, we compare the performance of Grad, PFPG, and Newton on the three matrices. Thestopping criteria is set to (cid:107)∇ F ( x ) (cid:107) ≤ − (cid:107)∇ F ( x ) (cid:107) . The results are shown in Table 3. Forreal data, PFPG is still able to perform multi-fold speedups over the traditional gradient method. Available at https://github.com/higham/matrices-correlation-invalid16 able 1
Results of NCE problems
Results of Example 5.1 n Grad PFPG Newtontime iter f (cid:107) g (cid:107) time iter f (cid:107) g (cid:107) time iter f (cid:107) g (cid:107)
500 0.8 28 1.1e+05 7.9e-08 0.7 40 1.1e+05 4.1e-08 0.6 8 1.1e+05 1.7e-081000 2.6 28 3.2e+05 5.5e-08 1.1 46 3.2e+05 2.6e-08 1.4 8 3.2e+05 1.2e-081500 6.2 28 5.9e+05 7.0e-08 2.5 46 5.9e+05 8.2e-08 3.3 8 5.9e+05 1.0e-082000 13.5 31 9.2e+05 8.0e-08 4.2 46 9.2e+05 4.9e-08 5.9 8 9.2e+05 8.8e-092500 24.9 34 1.3e+06 1.7e-08 7.2 47 1.3e+06 6.9e-08 9.6 8 1.3e+06 7.9e-093000 43.2 34 1.7e+06 2.8e-08 13.2 52 1.7e+06 1.8e-08 15.7 8 1.7e+06 7.2e-094000 110.6 34 2.7e+06 1.4e-08 25.0 52 2.7e+06 2.4e-08 36.1 8 2.7e+06 6.3e-09Results of Example 5.2 n Grad PFPG Newtontime iter f (cid:107) g (cid:107) time iter f (cid:107) g (cid:107) time iter f (cid:107) g (cid:107)
500 0.5 16 8.9e+03 2.2e-09 0.6 16 8.9e+03 1.0e-08 0.5 5 8.9e+03 1.6e-071000 1.6 16 2.6e+04 7.4e-08 1.0 17 2.6e+04 8.3e-08 1.2 6 2.6e+04 1.2e-071500 4.0 17 5.0e+04 6.6e-08 2.2 20 5.0e+04 9.1e-08 2.5 6 5.0e+04 9.6e-082000 7.5 17 7.8e+04 8.9e-08 4.8 22 7.8e+04 2.4e-08 4.7 6 7.8e+04 8.3e-082500 13.8 19 1.1e+05 7.2e-08 7.4 22 1.1e+05 4.5e-08 7.6 6 1.1e+05 7.5e-083000 25.9 21 1.5e+05 8.9e-08 12.0 22 1.5e+05 8.1e-08 12.3 6 1.5e+05 6.9e-084000 72.5 22 2.3e+05 3.4e-09 27.2 25 2.3e+05 9.9e-08 28.6 6 2.3e+05 6.0e-08Results of Example 5.3 n Grad PFPG Newtontime iter f (cid:107) g (cid:107) time iter f (cid:107) g (cid:107) time iter f (cid:107) g (cid:107)
500 0.9 33 1.3e+05 7.4e-08 0.7 43 1.3e+05 1.3e-08 0.6 8 1.3e+05 1.8e-071000 3.8 43 5.0e+05 2.0e-08 1.1 54 5.0e+05 3.0e-08 1.6 9 5.0e+05 1.3e-071500 11.3 54 1.1e+06 2.6e-08 2.4 65 1.1e+06 7.1e-08 4.0 10 1.1e+06 1.0e-072000 22.6 54 2.0e+06 4.3e-08 5.0 76 2.0e+06 8.3e-08 7.2 10 2.0e+06 9.0e-082500 60.9 87 3.1e+06 3.6e-08 10.6 120 3.1e+06 4.2e-08 12.1 10 3.1e+06 8.0e-083000 104.0 87 4.5e+06 6.2e-08 16.1 129 4.5e+06 7.8e-08 19.3 10 4.5e+06 7.4e-084000 278.0 91 8.0e+06 7.9e-08 32.8 142 8.0e+06 7.3e-08 44.2 10 8.0e+06 6.4e-08Though not always faster than Newton, PFPG is very competitive against the second-order method.
We next consider two algorithms that deal with thenuclear norm minimization problem (NMM): svt and nnls . Consider a regularized NMM problem with the form(5.5) min (cid:107) X (cid:107) ∗ + µ (cid:107) X (cid:107) F , s.t. A ( X ) = b. The dual problem of (5.5) is(5.6) min y ∈ R m (cid:107) shrink µ ( A ∗ ( y )) (cid:107) F − b T y, able 2 Invalid correlation matrices from real applications name size opt. rank descriptioncor1399 1399 × × × Table 3
Results of NCE problems on real data matrix Grad PFPG Newtontime iter f (cid:107) g (cid:107) time iter f (cid:107) g (cid:107) time iter f (cid:107) g (cid:107) cor1399 6.0 32 6.5e+04 1.3e-06 2.6 32 6.5e+04 4.1e-06 1.9 5 6.5e+04 4.4e-06cor3120 76.0 43 9.3e+04 2.4e-06 12.3 43 9.3e+04 4.8e-06 8.9 4 9.3e+04 3.8e-06bccd16 18.9 16 1.4e+06 3.3e-09 4.2 16 1.4e+06 6.2e-07 5.0 2 1.4e+06 3.4e-07where shrink µ ( X ) is the shrinkage operator defined on the matrix space R n × n , which actuallyperforms soft-thresholding on the singular values of X , i.e.shrink µ ( X ) = U Diag(( σ − µ ) + , · · · , ( σ n − µ ) + ) V T . The formulation (5.6) is a special case of our model (2.4), with f being the square of a vectorshrinkage operator. Evaluating f and ∇ f only involves computing singular values which are largerthan µ . Thus, problem (5.6) satisfies Assumption 1.We consider the svt algorithm [3] which has the iteration scheme(5.7) (cid:26) W k = shrink µ ( A ∗ ( x k − )) ,x k = x k − + τ k ( b − A ( W k )) . It can be shown that the svt iteration (5.7) is equivalent to the gradient descent method on thedual problem (5.6), with step size τ k . Thus a polynomial-filtered svt (PFSVT) can be proposed.A small difference is that we need to extract a subspace that contains the required singular vectorsof a non-symmetric matrix A for svt . This operation can be performed by the subspace extractionfrom A T A and AA T .The matrix completion problem is used to test PFSVT, with the constraint P Ω ( X ) = P Ω ( M ).The test data are generated as follows. Example
The rank r test matrix M ∈ R m × n is generated randomly using the followingprocedure: first two random standard Gaussian matrices M L ∈ R m × r and M R ∈ R n × r are generatedand then M is assembled by M = M L M TR ; the projection set Ω with p index pairs is sampled from m × n possible pairs uniformly. We use “SR” (sampling ratio) to denote the ratio p/ ( mn ) . We have tested the svt algorithm under three settings: svt with standard Lanczos SVD(SVT-LANSVD), svt with SVD based on the Gauss-Newton method (SVT-SLRP, proposed in[14]), and polynomial-filtered svt (PFSVT), where the degree of the polynomial filter is set to1 in all cases. Note that in (5.7) the variable W k is always stored in the sparse format. Weinvoke the mkl dcsrmm subroutine in the Intel Math Kernel Library (MKL) to perform sparsematrix multiplications (SpMM). The results are displayed in Table 4, where “iter” denotes the uter iteration of the svt algorithm, “svp” stands for the recovered matrix rank by the algorithm,and “mse”( = (cid:107) X − M (cid:107) F / (cid:107) M (cid:107) F ) denotes the relative mean squared error between the recoveredmatrix X and the ground truth M . Table 4
Results of svt on randomly generated matrix completion problems by using SpMM-MKL
SVT-LANSVD SVT-SLRP PFSVTm=n SR (%) r iter svp time mse iter svp time mse iter svp time mse1000 20.0 10 79 10 4.23 1.31e-04 79 10 1.94 1.31e-04 78 10 1.83 1.40e-041000 30.0 10 62 10 3.35 1.17e-04 62 10 1.71 1.17e-04 62 10 1.66 1.15e-041000 40.0 10 53 10 2.94 1.15e-04 53 10 1.52 1.14e-04 53 10 1.79 1.13e-041000 30.0 50 169 69 43.58 2.12e-04 171 68 8.67 2.12e-04 168 67 7.31 2.10e-041000 40.0 50 110 50 13.77 1.60e-04 110 50 5.35 1.60e-04 110 50 5.28 1.58e-041000 50.0 50 86 50 13.01 1.40e-04 86 50 4.73 1.40e-04 86 50 4.38 1.39e-045000 10.0 10 53 10 28.88 1.08e-04 53 10 12.37 1.08e-04 52 10 12.47 1.22e-045000 20.0 10 42 10 30.43 1.04e-04 42 10 12.85 1.04e-04 42 10 13.39 1.05e-045000 30.0 10 38 10 34.16 9.23e-05 38 10 16.62 9.23e-05 38 10 16.99 9.55e-055000 10.0 50 107 50 107.38 1.59e-04 107 50 64.38 1.59e-04 107 50 62.10 1.60e-045000 20.0 50 67 50 90.14 1.23e-04 67 50 33.11 1.22e-04 67 50 27.83 1.25e-045000 30.0 50 54 50 92.55 1.21e-04 54 50 37.90 1.21e-04 54 50 32.87 1.21e-0410000 5.0 10 54 10 59.63 1.10e-04 54 10 27.41 1.10e-04 53 10 27.12 1.23e-0410000 8.0 10 46 10 82.54 1.08e-04 46 10 34.29 1.08e-04 46 10 35.20 1.08e-0410000 10.0 10 43 10 84.65 1.07e-04 43 10 39.26 1.07e-04 43 10 39.52 1.08e-0410000 5.0 50 109 50 271.86 1.66e-04 109 50 150.35 1.65e-04 109 50 131.95 1.67e-0410000 10.0 50 69 50 260.60 1.29e-04 69 50 173.75 1.29e-04 69 50 152.81 1.30e-0410000 15.0 50 57 50 291.30 1.16e-04 57 50 206.06 1.16e-04 57 50 178.84 1.22e-04The following observations can be drawn from Table 4. • All three solvers produce similar results: “iter”, “svp”, and “mse” are almost the same. • SVT-SLRP and PFSVT have similar speed, both of which enjoy 1 to 5 times speedup overSVT-LANSVD. In all test cases SVT-SLRP is a little bit slower than PFSVT, yet stillcomparable.
Consider a penalized formulation of NMM:(5.8) min (cid:107) X (cid:107) ∗ + 12 µ (cid:107)A ( X ) − b (cid:107) F . We turn to the nnls algorithm [21] to solve (5.8), which is essentially an accelerated proximalgradient method. At the k -th iteration, the main cost is to compute the truncated SVD of a matrix A k = β U k ( V k ) T − β U k − ( V k − ) T − β G k , where U k , V k , U k − , V k − are dense matrices, but the matrix G k is either dense or sparse, dependingon the sample ratio (SR). Though the formulation (5.8) is not a special case of (2.4), we can stillinsert the polynomial filter (2.3) into the nnls algorithm to reduce the cost of SVD, resultingin a polynomial-filtered nnls algorithm (PFNNLS). The Example 5.4 is still used to compare the erformance of NNLS-LANSVD, NNLS-SLRP, and PFNNLS. The polynomial filter degree is merelyset to 1. The results are shown in Table 5 for dense G k and Table 6 for sparse G k . For matrixmultiplications, we call matlab built-in functions for dense G k and Intel MKL subroutines forsparse G k . The notations have the same meaning as in Table 4. Table 5
Results of nnls on dense examples
NNLS-LANSVD NNLS-SLRP PFNNLSm=n SR (%) r iter svp time mse iter svp time mse iter svp time mse1000 25.0 50 55 50 5.1 7.60e-04 55 50 2.7 1.18e-03 54 50 2.2 1.19e-031000 35.0 50 42 50 3.1 5.42e-04 49 50 2.6 4.40e-04 49 50 1.9 4.66e-041000 49.9 50 39 50 3.2 1.81e-04 39 50 1.9 1.81e-04 39 50 1.6 1.84e-041000 25.0 100 100 150 29.4 3.55e-01 100 150 8.6 3.87e-01 100 150 7.1 5.03e-011000 35.0 100 64 100 10.7 1.45e-03 64 100 4.2 1.48e-03 65 100 3.3 9.32e-041000 49.9 100 54 100 8.3 4.70e-04 48 100 3.8 1.05e-03 51 100 3.2 4.33e-045000 25.0 50 36 50 40.0 1.41e-04 36 50 17.8 1.42e-04 36 50 16.1 1.46e-045000 35.0 50 34 50 40.0 1.60e-04 34 50 18.7 1.61e-04 34 50 17.7 1.61e-045000 50.0 50 32 50 41.7 1.46e-04 32 50 21.5 1.48e-04 32 50 20.6 1.45e-045000 25.0 100 49 100 106.4 2.83e-04 49 100 35.6 2.88e-04 50 100 32.4 2.19e-045000 35.0 100 45 100 94.0 1.74e-04 45 100 38.7 1.76e-04 45 100 35.2 1.74e-045000 50.0 100 43 100 100.5 1.43e-04 43 100 46.1 1.44e-04 43 100 41.3 1.45e-0410000 25.0 50 32 50 106.9 1.17e-04 32 50 54.9 1.17e-04 32 50 51.4 1.18e-0410000 35.0 50 32 50 120.0 1.16e-04 32 50 66.3 1.16e-04 32 50 61.9 1.17e-0410000 50.0 50 31 50 129.1 1.39e-04 32 50 83.0 1.15e-04 31 50 76.2 1.37e-0410000 25.0 100 43 100 253.0 1.59e-04 44 100 114.9 1.74e-04 45 100 105.0 1.56e-0410000 35.0 100 42 100 263.7 1.32e-04 42 100 130.1 1.33e-04 42 100 118.3 1.37e-0410000 50.0 100 39 100 259.6 3.70e-04 40 100 156.1 1.18e-04 39 100 136.5 3.56e-04Similar observations can be drawn from Table 5 and 6. However, for one case (sparse G k , m = n = 10000, SR=2.0%), NNLS-LANSVD and PFNNLS seem to produce different results fromNNLS-SLRP. In another case (dense G k , m = n = 1000, SR=25.0%) the mse of all three methodsis high. The reason is that nnls terminates after it has reached the max iteration 100.The last experiment is to use the three nnls algorithms on real data sets: the Jester jokedata set and the MovieLens data set, which are also mentioned in [21]. As shown in Table 7, theJester joke data set includes four examples: jester-1, jester-2, jester-3, and jester-all. The last oneis simply the combination of the first three data sets. The MovieLens data set have three problemsbased on the size: movie-100K, movie-1M, and movie-10M. We call Intel MKL routines for matrixmultiplications since the data are all sparse. Again, the degree of the polynomial filter is set to 1and the max iteration of nnls is 100.We have the following observations from Table 7: • The three methods require similar number of iterations for all seven test cases. • NNLS-LANSVD generates solutions with the highest rank, while NNLS-SLRP producessolutions with the lowest rank. The mse of PFNNLS is a little larger in Jester joke datasets. • NNLS-SLRP provides 2-4 speedups over NNLS-LANSVD, while PFNNLS is 3-6 times faster able 6 Results of nnls on random sparse examples by using SpMM-MKL
NNLS-LANSVD NNLS-SLRP PFNNLSm=n SR (%) r iter svp time mse iter svp time mse iter svp time mse10000 2.0 10 45 10 16.5 1.86e-04 45 10 9.2 1.54e-04 45 10 7.9 1.55e-0410000 5.0 10 35 10 27.9 1.30e-04 35 10 15.0 1.30e-04 35 10 12.9 1.30e-0410000 10.0 10 30 10 53.1 1.03e-04 30 10 23.9 1.03e-04 30 10 18.9 1.03e-0410000 14.0 10 28 10 62.6 1.07e-04 29 10 28.6 1.09e-04 28 10 22.7 1.07e-0410000 2.0 50 100 77 226.7 3.42e-01 100 50 53.9 9.88e-03 100 74 43.5 7.48e-0110000 5.0 50 50 50 125.3 2.50e-04 48 50 40.1 2.52e-04 51 50 33.7 1.72e-0410000 10.0 50 37 50 187.3 1.66e-04 37 50 53.0 1.67e-04 37 50 46.3 1.79e-0410000 14.0 50 35 50 224.2 1.29e-04 35 50 62.7 1.30e-04 35 50 53.4 1.31e-0450000 2.0 10 38 10 431.9 1.09e-04 38 10 168.3 1.09e-04 38 10 137.7 1.10e-0450000 5.0 10 31 10 776.4 9.62e-05 31 10 328.7 9.62e-05 31 10 253.9 9.58e-0550000 10.0 10 28 10 1560.0 1.09e-04 28 10 586.3 1.09e-04 28 10 438.0 1.02e-0450000 14.0 10 28 10 2011.5 1.03e-04 28 10 792.0 1.03e-04 28 10 611.2 1.03e-0450000 2.0 50 49 50 1415.8 1.96e-04 49 50 442.0 1.62e-04 52 50 335.4 1.38e-0450000 5.0 50 37 50 2487.5 1.74e-04 37 50 710.7 1.77e-04 37 50 600.1 1.99e-0450000 10.0 50 32 50 4483.5 1.17e-04 32 50 1244.8 1.17e-04 32 50 1074.5 1.17e-0450000 14.0 50 31 50 5687.1 1.10e-04 31 50 1499.5 1.10e-04 31 50 1405.2 1.10e-04
Table 7
Results of nnls on real examples by using SpMM-MKL
NNLS-LANSVD NNLS-SLRP PFNNLSname ( m, n ) iter svp time mse iter svp time mse iter svp time msejester-1 (24983, 100) 26 93 10.5 1.64e-01 27 69 4.6 1.76e-01 24 84 2.3 1.80e-01jester-2 (23500, 100) 26 93 9.1 1.65e-01 26 79 4.3 1.72e-01 25 88 2.1 1.80e-01jester-3 (24938, 100) 24 83 7.1 1.16e-01 27 74 4.6 1.24e-01 24 84 2.0 1.30e-01jester-all (73421, 100) 26 93 26.2 1.58e-01 26 82 12.4 1.62e-01 24 84 5.6 1.74e-01moive-100K (943, 1682) 34 100 4.2 1.28e-01 35 100 0.8 1.26e-01 36 100 0.6 1.23e-01moive-1M (6040, 3706) 50 100 40.6 1.42e-01 50 100 10.8 1.43e-01 51 100 7.4 1.42e-01moive-10M (71567, 10677) 54 100 620.1 1.26e-01 57 100 179.9 1.27e-01 52 100 92.7 1.27e-01than NNLS-LANSVD.We make the following comments as extra interpretations of all numerical results in this sub-section. • The SLRP solver is essentially a subspace SVD method. As the authors point out in [14],the subspace is generated by solving a least square model by one Gauss-Newton iteration,after which high accuracy singular pairs can be extracted from the subspace. The differenceof SLRP and our polynomial filters is that solving the SLRP sub-problem requires morecomputation than evaluating polynomial filters, in exchange for higher precision solutions.Hence SLRP can be regarded as a variant of the subspace extraction method. • Converting the SVD of A to the EVD of A T A implicitly applies a polynomial filter ρ ( t ) = t to the singular values. Thus we can simply let the degree equal to 1 when performing ubspace extractions on A T A . Consider the max eigenvalue optimization problemwith the following form(5.9) min x F ( x ) + R ( x ) := λ ( B ( x )) + R ( x ) , where B ( x ) = G + A ∗ ( x ). The subgradient of F ( x ) is ∂F ( x ) = {A ( U SU T ) | S (cid:23) , tr( S ) = 1 } , where U ∈ R n × r is the subspace spanned by eigenvectors of λ ( B ( x )) with multiplicity r . If r = 1, then ∂F ( x ) only contains one element hence the subgradient becomes the gradient. If r > F ( x ) is only sub-differentiable. We point out that the r > r = 1. In the following two applications, namely, phase recovery and blind deconvolution, F ( x ) isdifferentiable within a neighborhood of x ∗ . We use the phase recovery formulation in [7]. The smooth part F ( x ) = λ ( B ( x )) := λ ( A ∗ ( x )) is defined as follows. Let the known diagonal matrices C k ∈ C n × n be the encoding diffraction patterns corresponding to the k -th “mask” ( k = 1 , . . . , L ). Themeasurements of an unknown signal x ∈ C n are given by b = A ( x x ∗ ) := diag F C ... F C L ( x x ∗ ) F C ... F C L ∗ , where F is the unitary discrete Fourier transform (DFT). The adjoint of A can be written as A ∗ y := L (cid:88) k =1 C ∗ k F ∗ Diag( y ) F C k . The non-smooth part R ( x ) defined as the indicator function of the set { x | (cid:104) b, x (cid:105) − (cid:107) x (cid:107) ∗ ≥ } , where (cid:107)·(cid:107) ∗ is the dual norm of (cid:107)·(cid:107) . In the experiments of the phase retrieval problem, both syntheticand real data are considered. The noiseless synthetic data are generated as follows. For each valueof L = 7 , . . . ,
12, we generate random complex Gaussian vectors x of length n = 1024 , L random complex Gaussian masks C k .We use the GAUGE algorithm proposed in [7] to solve the phase recovery problem, in whichthe matlab built-in function eigs is used as the eigenvalue sub-problem solver. Then we insertour polynomial filters to the original GAUGE implementation to obtain the polynomial-filteredGAUGE algorithm (PFGAUGE). It should be mentioned that though GAUGE is essentially agradient method, it has many variants to handle different cases, which we will briefly discuss aboutlater. A big highlight of GAUGE is a so-called “primal-dual refinement step”, which is able toreduce the iteration steps drastically. This feature also has great impact on the performance ofGAUGE and PFGAUGE.The results of noiseless synthetic data is presented in Table 8 and 9, where “iter” denotes thenumber of outer iteration. In addition, we record the number of DFT calls (nDFT) and the number able 8 Random Gaussian signals, noiseless, n = 1024 GAUGE PFGAUGEL time iter nDFT(nEigs) err time iter nDFT(nPF) err12 2.37 3 34248(10) 8.7e-07 2.09 7 43386(17) 1.1e-0611 2.51 4 46244(13) 9.4e-07 2.28 9 48813(21) 1.2e-0610 3.81 6 62160(17) 7.0e-07 2.54 10 51900(25) 1.3e-069 6.53 8 74772(19) 7.9e-07 3.74 12 56030(28) 1.8e-068 10.09 13 117776(29) 1.2e-06 5.00 16 67960(36) 1.6e-067 17.72 21 193956(47) 1.6e-06 6.92 23 88785(50) 6.8e-07
Table 9
Random Gaussian signals, noiseless, n = 4096 GAUGE PFGAUGEL time iter nDFT(nEigs) err time iter nDFT(nPF) err12 10.85 4 57966(12) 1.4e-06 9.66 11 74442(30) 2.0e-0611 15.25 5 83298(16) 2.0e-06 10.43 11 77880(36) 2.0e-0610 22.63 8 118530(20) 2.3e-06 11.20 14 79435(37) 2.6e-069 35.89 11 176864(26) 2.0e-06 16.78 20 118377(56) 3.4e-068 49.88 15 235720(36) 2.8e-06 15.52 23 105660(53) 3.0e-067 160.64 40 779559(83) 2.6e-06 33.66 40 208915(84) 3.4e-06of eigen-solvers or polynomial filters evaluated, which is denoted by nEigs and nPF, respectively.The label “err” stands for the relative mean squared error of the solution (cid:107) xx ∗ − x x ∗ (cid:107) F / (cid:107) x (cid:107) .The following observations should be clear from the results. • Both GAUGE and PFGAUGE obtain the required accuracy in all test cases. The problemgrows harder to solve as L decreases from 12 to 7. • In general, it takes more iterations for PFGAUGE to converge compared with the originalGAUGE. An exception is the L = 7 case, in which the iterations are nearly the same. • PFGAUGE gradually becomes faster as L decreases. For example, it gives 5 times speedupat L = 7 and n = 4096. This can be also concluded from nDFT. After examining theiteration process, we find out that the eigenvalue sub-problem is hard to solve when L issmall, thus it takes more iterations for eigs to converge. However, since we only apply apolynomial filter to the iteration point, the cost of DFT can be greatly reduced. For theeasy cases such as L = 12, the performances of the two solvers are similar.We next consider noisy synthetic data, which is generated with the following steps. First wegenerate L octanary masks C k as described in [4], and a real Gaussian vector y ∈ R n . Then asolution x ∈ C n is chosen as the normalized rightmost eigenvector of A ∗ ( y ), where n = 1024 , b is computed as b := A ( x x ∗ ) + (cid:15)y/ (cid:107) y (cid:107) , where (cid:15) := (cid:107) b − A ( x x ∗ ) (cid:107) = η (cid:107) b (cid:107) is related with a given relative noise level η ∈ (0 , L ∈ { , , } and noise level η ∈ { . , . , , , , } . For each set of parameter, werandomly generate 100 test instances and feed them to the solvers, after which we compute therelative mean squared error of the solution. The “%” label in Table 10 and 11 stands for the rate of uccessful recoveries, which is defined as the number of solutions whose MSE is smaller than 10 − .All other columns in Table 10 and 11 record the average value of the 100 test instances. Table 10
Random Gaussian signals, noisy, n = 1024 GAUGE PFGAUGEL η time iter nDFT(nEigs) % err time iter nDFT(nPF) % err12 0.1% 3.02 51 9e+04(58) 100 2.1e-03 2.53 48 1e+05(55) 100 2.8e-039 0.1% 6.47 88 1e+05(102) 96 3.4e-03 4.95 88 2e+05(103) 98 3.8e-036 0.1% 13.27 195 2e+05(224) 76 7.5e-03 9.14 198 2e+05(223) 76 7.6e-0312 0.5% 5.76 94 2e+05(107) 95 5.5e-03 4.62 92 2e+05(99) 94 5.8e-039 0.5% 9.90 128 2e+05(144) 92 5.7e-03 6.78 127 2e+05(147) 93 5.6e-036 0.5% 19.84 284 3e+05(327) 74 8.5e-03 12.77 284 3e+05(335) 68 8.5e-0312 1.0% 22.82 296 7e+05(354) 78 7.2e-03 14.80 326 7e+05(372) 83 7.2e-039 1.0% 22.26 271 4e+05(318) 86 8.0e-03 13.92 290 5e+05(346) 88 7.8e-036 1.0% 71.67 979 1e+06(1171) 92 6.5e-03 38.64 1049 1e+06(1287) 92 6.4e-0312 5.0% 43.27 556 1e+06(649) 34 1.2e-02 30.53 734 1e+06(853) 31 1.4e-029 5.0% 86.61 836 2e+06(1005) 65 7.8e-03 37.42 1009 1e+06(1161) 53 9.2e-036 5.0% 90.90 1113 1e+06(1322) 88 4.7e-03 43.35 1237 1e+06(1477) 81 5.8e-0312 10.0% 61.22 692 2e+06(819) 21 1.6e-02 29.86 749 1e+06(842) 23 1.5e-029 10.0% 76.80 686 1e+06(804) 48 1.0e-02 31.81 799 1e+06(926) 50 1.0e-026 10.0% 78.61 950 1e+06(1124) 76 5.3e-03 32.38 956 1e+06(1110) 71 6.0e-0312 50.0% 36.20 368 1e+06(433) 41 1.3e-02 17.04 424 8e+05(491) 55 8.9e-039 50.0% 42.28 361 8e+05(424) 68 5.7e-03 14.76 374 5e+05(438) 56 6.6e-036 50.0% 34.76 389 6e+05(463) 86 2.5e-03 14.37 391 4e+05(477) 87 2.5e-03We make the following comments on the results of the noisy case: • Compared to the noiseless case, it takes much more iterations for GAUGE and PFGAUGEto converge. • The solutions produced by the two algorithms are similar in the aspect of the number ofiterations, the successful recover rate, and the solution error. For most cases, PFGAUGEneeds a little more iterations than GAUGE. • PFGAUGE often provides 2-3 times speedup over GAUGE. This can be also told from thenDFT column. Again, the number of DFT is greatly reduced thanks to the polynomialfilters.Finally we conduct experiments on 2D real data in order to assess the speedup of Chebyshevfilters on problems with larger size. In this scenario the measured signal x are gray scale images,summarized in Table 12. For simplicity, the images are numbered from 1 to 12. Case 1 and 2 areselected from the matlab image database; Case 3 to 12 are from the HubbleSite Gallery . Thelargest problem (No. 12) has the size n = 4 · , which brings a huge eigenvalue problem in eachiteration. For each example we generate 10 and 15 octanary masks. The results are summarizedin Table 13. The column with label “ f ” records the values of dual objective function. The label“gap” stands for the duality gap of the problem. The other columns have the same meaning as in See http://hubblesite.org/images/gallery. 24 able 11
Random Gaussian signals, noisy, n = 4096 GAUGE PFGAUGEL η time iter nDFT(nEigs) % err time iter nDFT(nPF) % err12 0.1% 26.16 90 2e+05(106) 89 3.5e-03 20.90 89 2e+05(104) 88 3.1e-039 0.1% 45.83 194 4e+05(224) 84 5.2e-03 30.10 192 3e+05(221) 80 4.6e-036 0.1% 70.11 373 5e+05(436) 41 1.1e-02 56.40 407 5e+05(487) 35 1.3e-0212 0.5% 36.31 124 3e+05(149) 85 6.2e-03 29.69 138 3e+05(161) 90 5.3e-039 0.5% 68.44 275 5e+05(310) 72 7.9e-03 43.98 266 5e+05(319) 69 8.0e-036 0.5% 117.91 625 8e+05(720) 24 1.4e-02 84.42 653 7e+05(757) 19 1.5e-0212 1.0% 98.84 290 8e+05(331) 73 7.6e-03 64.63 320 8e+05(409) 76 7.2e-039 1.0% 109.12 404 9e+05(471) 53 9.7e-03 62.14 473 8e+05(561) 52 9.9e-036 1.0% 174.41 900 1e+06(1026) 25 1.4e-02 99.72 813 9e+05(951) 13 1.6e-0212 5.0% 378.11 834 3e+06(945) 54 9.5e-03 128.56 939 2e+06(1078) 45 1.1e-029 5.0% 211.98 705 2e+06(831) 36 1.3e-02 94.93 793 1e+06(912) 36 1.3e-026 5.0% 211.82 937 1e+06(1084) 29 1.5e-02 119.41 1048 1e+06(1284) 18 1.7e-0212 10.0% 372.63 801 3e+06(935) 35 1.4e-02 118.20 896 2e+06(1036) 34 1.4e-029 10.0% 217.35 668 2e+06(789) 36 1.5e-02 102.80 826 1e+06(954) 31 1.6e-026 10.0% 259.70 961 2e+06(1126) 38 1.5e-02 124.12 1109 1e+06(1276) 38 1.3e-0212 50.0% 337.82 558 3e+06(634) 45 1.2e-02 83.17 676 1e+06(748) 35 1.6e-029 50.0% 223.85 572 2e+06(667) 53 8.6e-03 75.28 678 1e+06(765) 61 7.0e-036 50.0% 194.21 668 1e+06(761) 67 5.5e-03 76.30 684 7e+05(834) 71 3.4e-03Table 8. Since data size is very large, we terminate the algorithms as soon as they exceed a timeoutthreshold, which is set to 18000 seconds in the experiment. Table 12
Image data for 2D signals
No. name size No. name size1 coloredChips 518 ×
391 2 lighthouse 480 × ×
500 4 giantbubble(S) 600 × ×
426 6 nebula(S) 800 × × × × × • For all test cases, PFGAUGE successfully converges within the time limit, while the originalGAUGE method times out in most test cases with large data. GAUGE also fails in theCase 2 with L = 10. After checking the output of the algorithm, the reason is that theeigen-solver fails to converge in some iterations. • For the cases where both algorithm converge, PFGAUGE is able to provide over 20 timesspeedup. This can be verified by the nDFT column of the two methods as well. Thereason is that the traditional eigen-solver eigs is not scalable enough to deal with largeproblems. The convergence is very slow thus the accuracy of the solutions is not tolerable. able 13 Phase retrieval comparisons on 2D real signal
GAUGE PFGAUGENo. L time iter nDFT(nEigs) f gap time iter nDFT(nPF) f gap1 15 1155.89 4 3e+05(19) 1.0e+05 5.0e-06 549.97 7 2e+05(29) 1.0e+05 1.4e-0610 3704.66 9 1e+06(26) 1.0e+05 1.0e-05 210.36 9 6e+04(26) 1.0e+05 7.9e-062 15 3043.72 4 1e+06(19) 1.1e+05 5.7e-06 722.95 6 2e+05(26) 1.1e+05 1.5e-0610 21898.10 11 8e+06(87) 1.1e+05 NaN 158.82 8 5e+04(25) 1.1e+05 5.0e-063 15 276.71 4 8e+04(18) 1.4e+04 4.9e-06 230.50 5 7e+04(21) 1.4e+04 6.2e-0610 385.09 13 1e+05(34) 1.4e+04 9.5e-06 208.24 10 6e+04(28) 1.4e+04 9.4e-064 15 9583.33 12 3e+06(35) 2.3e+04 1.3e-06 295.84 6 8e+04(26) 2.3e+04 3.6e-0610 7433.17 18 2e+06(46) 2.3e+04 7.9e-06 238.08 7 6e+04(24) 2.3e+04 9.7e-065 15 1735.88 5 5e+05(20) 2.2e+04 3.9e-06 622.25 9 2e+05(29) 2.2e+04 7.5e-0610 1872.62 6 5e+05(22) 2.2e+04 1.5e-06 291.96 7 7e+04(24) 2.2e+04 1.8e-066 15 643.35 7 1e+05(24) 6.7e+04 2.7e-06 523.80 9 8e+04(30) 6.7e+04 6.4e-0610 21943.32 15 3e+06(42) 6.7e+04 1.2e-02 989.89 4 1e+05(10) 6.7e+04 6.7e-067 15 2252.18 10 2e+05(30) 6.3e+04 8.3e-06 1116.42 9 1e+05(30) 6.3e+04 7.6e-0610 2530.10 22 2e+05(54) 6.3e+04 8.4e-06 981.21 17 9e+04(46) 6.3e+04 6.9e-068 15 18543.56 5 1e+06(22) 5.8e+04 1.0e-03 4050.70 9 3e+05(30) 5.8e+04 5.8e-0610 19371.82 11 1e+06(33) 5.8e+04 1.0e-03 1209.17 11 9e+04(34) 5.8e+04 8.9e-069 15 20239.04 2 1e+06(17) 9.1e+04 7.8e-02 7046.67 7 5e+05(27) 9.1e+04 9.9e-0610 19610.56 8 1e+06(19) 9.1e+04 6.0e-01 3892.26 6 2e+05(14) 9.1e+04 4.7e-0610 15 7680.27 10 3e+05(32) 2.7e+05 4.1e-06 4287.95 9 2e+05(30) 2.7e+05 5.8e-0610 21958.19 5 8e+05(31) 2.7e+05 1.7e-01 4042.50 24 1e+05(58) 2.7e+05 4.8e-06However, our polynomial-filtered algorithm is able to extract a proper low-rank subspacethat contains the desired eigenvectors quickly, hence the performance is much better. • PFGAUGE sometimes needs fewer iterations than GAUGE. This is because eigs fails toconverge during the iterations, thus GAUGE produces wrong updates.
Again, we use the blind deconvolution formulation in [7]. Inthis model, we measure the convolution of two signals s ∈ C m and s ∈ C m . Let B ∈ C m × n and B ∈ C m × n be two wavelet bases. The circular convolution of the signals is defined by b = s ∗ s = ( B x ) ∗ ( B x )= F − diag (cid:0) ( F B x )( F B x ) T (cid:1) = F − diag (cid:0) ( F B )( x x ∗ )( F B ) ∗ (cid:1) =: A ( x x ∗ ) , where A is the non-symmetric linear map whose adjoint is A ∗ ( y ) := ( F B ) ∗ Diag( F y )( F B ) . The non-smooth part R ( x ) defined as the indicator function of the set { x | (cid:104) b, x (cid:105) − (cid:107) x (cid:107) ∗ ≥ } . n the experiments of the blind deconvolution, the GAUGE algorithm and its polynomial-filtered version PFGAUGE are tested. The tolerance of feasibility and duality gap is set to amoderate level, namely, 5 · − and 5 · − . We need to mention that there is a primal-dualrefinement strategy in the GAUGE method as mentioned before. It consists of two sub-problems:generate a refined primal solution from a dual solution (pfd) and dual from primal (dfp). In the“dfp” process, the algorithm has to solve a least-squared problem using the gradient descent method.The “dfp” step is very slow in the experiments and frequently leads to algorithm failure. Therefore,we disable this step since it is optional.The test data are shown in Table 14. There are eight pictures numbered from 1 to 8. Case1 is from the original GAUGE paper [7]. Case 2 to 4 are from the matlab gallery. Case 5 to 8are downloaded from the HubbleSite Gallery as mentioned before. The results are demonstrated inTable 15. The columns with label “xErr1” and “xErr2” list the relative errors (cid:107) x j − ˆ x j (cid:107) / (cid:107) x j (cid:107) , j =1 ,
2, where ˆ x j stands for the solution returned by the algorithms. The column with label “rErr”contains the relative residual (cid:107) b − A (ˆ x ¯ˆ x ∗ ) (cid:107) / (cid:107) b (cid:107) . The other columns share the same meaningwith Table 13. We have similar observations under these settings. PFGAUGE is able to produce Table 14
Image data for blind deconvolution
No. name size No. name size1 shapes 256 ×
256 2 cameraman 256 × ×
256 4 lighthouse 480 × ×
512 6 crabnebula1024 1024 × × × In this section we show the numerical results of our PFAM. Theexperiments fall into two categories: standard SDP and non-linear SDP.
First we test PFAM on the two-body reduced density matrix (2-RDM) problem. The problem has a block diagonal structure with respect to the variable X , witheach block being a low rank matrix. Thus the polynomial filters can be applied to each block toreduce the cost. We use the preprocessed dataset in [12]. Only the cases with the maximum blocksize greater than 1000 are selected. The details of the data can be found in [15].Table 16 contains the numerical results of ADMM and PFAM. The columns with headers“pobj”, “pinf”, “dinf”, and “gap” record the primal objective function value, primal infeasibility,dual infeasibility, and duality gap at the solution, respectively. The overall tolerance of each algo-rithm is set to 10 − . A high accuracy is not needed in this application because we use ADMM togenerate initial solutions for second-order methods such as SSNSDP [12].As observed in Table 16, PFAM is two times faster than ADMM in large 2-RDM problems,such as CH2, C2, and NH3. We mention that the speedup provided by PFAM depends on thenumber of large low-rank variable blocks. One may notice that PFAM is less effective on someproblems like AlH and BF. The main reason is that there are very few large blocks in these dataset,and polynomial filters are not designed for small matrices. In fact, it is always observed that afull eigenvalue decomposition is faster than any other truncated eigen-solvers or polynomial-filteredmethods in these small cases. Another minor reason is that some blocks is not low-rank, causing able 15 Blind deconvolution comparisons
No. GAUGE(nodfp)time iter nDFT(nEigs) f gap rErr xErr1 xErr21 13.10 15 4248(16) 1.681e+00 1.7e-03 3.4e-04 9.1e-02 5.5e-012 14.34 14 4816(15) 1.683e+00 4.1e-03 4.1e-04 1.4e-01 5.4e-013 15.66 18 5440(19) 1.672e+00 2.0e-03 3.0e-04 1.3e-01 5.4e-014 63.48 21 7028(22) 1.683e+00 4.3e-03 2.5e-04 1.1e-01 5.4e-015 35.62 14 4552(15) 1.689e+00 4.7e-03 2.8e-04 1.3e-01 5.3e-016 252.59 26 9588(27) 1.673e+00 3.0e-04 3.6e-04 1.0e-01 5.4e-017 282.38 22 7116(23) 1.696e+00 4.7e-03 2.6e-04 9.8e-02 5.4e-018 559.64 36 14732(37) 1.667e+00 4.7e-03 7.2e-05 9.4e-02 5.3e-01No. PFGAUGE(nodfp)time iter nDFT(nPF) f gap rErr xErr1 xErr21 6.85 15 2740(16) 1.681e+00 1.7e-03 4.0e-04 9.1e-02 5.5e-012 7.30 14 2844(15) 1.684e+00 4.7e-03 5.1e-04 1.4e-01 5.4e-013 8.93 18 3576(19) 1.671e+00 2.1e-04 3.7e-04 1.3e-01 5.4e-014 40.60 26 4560(27) 1.683e+00 4.1e-03 4.1e-04 1.1e-01 5.4e-015 21.91 14 2800(15) 1.689e+00 4.8e-03 2.9e-04 1.3e-01 5.3e-016 119.43 26 4524(27) 1.677e+00 2.9e-03 3.2e-04 1.0e-01 5.4e-017 163.91 25 4204(26) 1.696e+00 4.4e-03 3.0e-04 1.0e-01 5.4e-018 297.61 37 7568(38) 1.659e+00 4.1e-04 3.4e-04 9.3e-02 5.3e-01the performance of PFAM to be limited. This observation again addresses the feature of PFAM: itis the most suitable for large-scale low-rank problems.
As an extension, we consider plugging polynomial filters into multi-block ADMM to observe its speed-up. We are interested in the non-linear SDPs from the weightedLS model with spectral norm constraints and least unsquared deviations (LUD) model in [24].Suppose K is a given integer and S and W are two known matrices, the weighted LS modelwith spectral norm constraints is(5.10) max (cid:104) W (cid:12) S, G (cid:105) , s . t . G ii = I ,G (cid:23) , (cid:107) G (cid:107) ≤ αK, where G = ( G ij ) i,j =1 ,...,K ∈ S K is the variable, with each block G ij being a 2-by-2 small matrix,and (cid:107) · (cid:107) is the spectral norm. A three-block ADMM is introduced to solve (5.10): y k +1 = −A ( C + X k − Z k ) − µ ( A ( G k ) − b ) , (5.11) Z k +1 = argmin Z αKµ (cid:107) Z (cid:107) ∗ + 12 (cid:107) Z − B k (cid:107) F , (5.12) able 16 Results of ADMM and PFAM on 2-RDM problems data ADMM PFAMname time itr pobj pinf dinf gap time itr pobj pinf dinf gapAlH 73 373 2.46e+02 2.1e-05 9.9e-05 4.8e-05 71 377 2.46e+02 2.1e-05 9.9e-05 4.7e-05B2 428 800 5.73e+01 9.6e-05 9.9e-05 1.2e-04 257 908 5.73e+01 8.1e-05 9.9e-05 1.3e-04BF 64 339 1.42e+02 3.6e-05 9.9e-05 7.0e-05 62 341 1.42e+02 3.6e-05 9.9e-05 6.9e-05BH 476 1058 2.73e+01 9.6e-05 9.5e-05 3.7e-04 322 1026 2.73e+01 9.1e-05 9.8e-05 3.5e-04BH3O 727 437 1.31e+02 6.9e-05 9.9e-05 2.0e-04 530 458 1.31e+02 2.8e-05 9.9e-05 1.7e-04BN 144 748 9.33e+01 3.0e-05 9.9e-05 1.1e-04 93 739 9.33e+01 4.2e-05 9.9e-05 1.2e-04BO 91 377 1.17e+02 8.8e-05 9.9e-05 9.9e-05 80 375 1.17e+02 9.8e-05 9.2e-05 1.0e-04BeF 68 356 1.28e+02 5.3e-05 9.9e-05 9.0e-05 58 354 1.28e+02 5.5e-05 9.8e-05 9.1e-05BeO 91 488 1.02e+02 7.3e-05 9.9e-05 8.8e-05 71 479 1.02e+02 4.8e-05 9.9e-05 9.6e-05C2 1763 831 9.10e+01 5.8e-05 9.9e-05 1.8e-04 1032 911 9.10e+01 5.2e-05 9.9e-05 1.7e-04CF 68 354 1.59e+02 3.5e-05 9.9e-05 3.6e-05 60 351 1.59e+02 3.8e-05 9.9e-05 3.4e-05CH 453 969 4.12e+01 9.7e-05 9.8e-05 3.6e-04 303 1009 4.12e+01 9.8e-05 9.6e-05 3.4e-04CH2 1516 1350 4.50e+01 8.0e-05 9.9e-05 3.9e-04 754 1339 4.50e+01 8.3e-05 9.9e-05 3.6e-04CH2 1698 1496 4.48e+01 9.8e-05 9.2e-05 3.4e-04 782 1428 4.48e+01 9.8e-05 9.5e-05 3.5e-04CH3 2028 1113 4.94e+01 5.5e-05 9.9e-05 1.4e-04 1010 1149 4.94e+01 6.7e-05 9.9e-05 2.5e-04CH3N 756 450 1.27e+02 4.8e-05 9.9e-05 1.6e-04 491 456 1.27e+02 5.9e-05 9.9e-05 1.5e-04CN 83 439 1.11e+02 8.9e-05 9.9e-05 8.8e-05 68 437 1.11e+02 9.8e-05 9.5e-05 8.3e-05CO+ 73 379 1.35e+02 8.9e-05 9.9e-05 9.1e-05 60 377 1.35e+02 9.4e-05 9.9e-05 9.1e-05CO 63 328 1.35e+02 6.7e-05 9.8e-05 3.5e-05 54 325 1.35e+02 6.8e-05 9.9e-05 3.4e-05F- 265 648 9.96e+01 9.9e-05 9.5e-05 2.9e-04 189 639 9.96e+01 9.0e-05 9.9e-05 2.4e-04H2O 819 717 8.54e+01 9.5e-05 9.9e-05 3.4e-04 495 809 8.54e+01 9.8e-05 9.0e-05 3.4e-04HF 268 576 1.05e+02 9.9e-05 9.3e-05 2.2e-04 210 575 1.05e+02 9.6e-05 9.8e-05 2.1e-04HLi2 189 622 1.91e+01 7.5e-05 9.8e-05 1.4e-04 126 656 1.91e+01 7.9e-05 9.8e-05 1.5e-04HN2+ 99 326 1.38e+02 7.7e-05 9.9e-05 7.3e-05 82 326 1.38e+02 8.7e-05 9.9e-05 7.1e-05HNO 251 451 1.60e+02 7.4e-05 9.9e-05 4.1e-05 207 446 1.60e+02 7.8e-05 9.9e-05 3.9e-05LiF 80 406 1.16e+02 4.4e-05 9.9e-05 9.9e-05 73 406 1.16e+02 4.5e-05 9.9e-05 9.8e-05LiH 472 1030 9.00e+00 9.0e-05 4.2e-05 1.1e-04 369 1037 9.00e+00 9.1e-05 5.4e-05 1.2e-04LiOH 141 468 9.57e+01 9.8e-05 8.5e-05 1.4e-04 107 452 9.57e+01 9.5e-05 9.6e-05 1.6e-04NH 443 943 5.86e+01 8.6e-05 9.9e-05 3.0e-04 298 946 5.86e+01 8.8e-05 9.9e-05 3.1e-04NH 405 870 5.86e+01 8.9e-05 9.9e-05 2.8e-04 290 937 5.86e+01 9.8e-05 9.1e-05 2.7e-04NH2- 967 849 6.32e+01 9.8e-05 9.3e-05 4.0e-04 502 854 6.32e+01 9.7e-05 9.7e-05 3.9e-04NH3 3775 925 6.82e+01 7.3e-05 9.9e-05 3.2e-04 1819 967 6.82e+01 9.7e-05 9.8e-05 3.4e-04NaH 73 370 1.65e+02 8.9e-05 9.8e-05 1.7e-06 66 370 1.65e+02 8.9e-05 9.9e-05 5.0e-06P 176 410 3.41e+02 2.2e-05 9.9e-05 7.0e-05 155 412 3.41e+02 2.2e-05 9.8e-05 6.6e-05SiH4 212 292 3.12e+02 2.8e-05 9.9e-05 3.8e-05 193 292 3.12e+02 2.8e-05 9.9e-05 2.8e-05 X k +1 = argmin X (cid:23) (cid:107) X − H k (cid:107) F , (5.13) G k +1 = (1 − γ ) G k + γµ ( X k +1 − H k ) , (5.14) here B k = C + X k + A ∗ ( y k +1 ) + 1 µ G k ,H k = Z k +1 − C − A ∗ ( y k +1 ) − µ G k are the auxiliary variables. In the ADMM update, steps (5.12) and (5.13) require EVDs, both ofwhich can be replaced with the polynomial-filtered method.The semi-definite relaxation of the LUD problem is(5.15) min (cid:80) ≤ i 00 means thatthere is no spectral norm constraint in (5.15). The number of the rotation matrices K is chosen from500 , , , G is low-rank. Indeed it is a rank-3 matrix in theform RR T . Note that for ADMM, we have tested both full eigenvalue decomposition and thetruncated version and report the one which costs less time. These examples again justify theeffectiveness of PFAM, though the convergence property is not clear for the multi-block version. 6. Conclusion. In this paper, we propose a framework of polynomial-filtered methods forlow-rank optimization problems. Our motivation is based on the key observation that the iterationpoints lie in a low-rank subspace of R n × n . Therefore, the strategy is to extract this subspaceapproximately, and then perform one update based on the projection of the current iteration point.Polynomials are also applied to increase the accuracy. Intuitively, the target subspaces betweenany two iterations should be close enough under some conditions. We next give two solid examplesPFPG and PFAM in order to show the basic structure of polynomial-filtered methods. It is easy toobserve that this kind of method couples the subspace refinement and the main iteration together. able 17 Results of ADMM and PFAM on non-linear SDPs LS model, α = 0 . K ADMM PFAMtime itr pinf dinf mse time itr pinf dinf mse500 36.1 232 9.9e-04 1.9e-04 1.46e-02 9.2 231 1.0e-03 1.9e-04 1.46e-021000 120.1 163 9.7e-04 6.9e-04 9.82e-03 21.2 163 9.7e-04 6.9e-04 9.82e-031500 426.6 189 9.9e-04 3.2e-04 6.07e-03 74.4 189 9.9e-04 3.2e-04 6.07e-032000 1189.6 202 9.9e-04 1.7e-04 4.42e-03 148.7 202 9.9e-04 1.7e-04 4.42e-03LUD model, α = 0 . K ADMM PFAMtime itr pinf dinf mse time itr pinf dinf mse500 7.9 78 9.9e-04 8.4e-05 9.99e-03 3.8 83 9.6e-04 8.2e-05 9.99e-031000 44.4 99 9.4e-04 9.5e-04 3.06e-03 15.3 90 9.8e-04 4.6e-05 5.42e-031500 131.9 101 9.1e-04 6.1e-04 2.24e-03 47.8 118 9.1e-04 6.5e-04 2.24e-032000 334.5 102 9.3e-04 4.1e-04 1.91e-03 79.0 103 9.4e-04 4.2e-04 1.91e-03LUD model, α = 0 . K ADMM PFAMtime itr pinf dinf mse time itr pinf dinf mse500 47.2 274 1.0e-03 2.7e-04 1.91e-03 14.9 277 9.9e-04 2.7e-04 1.91e-031000 294.3 356 1.0e-03 1.3e-04 1.74e-03 68.0 369 1.0e-03 1.3e-04 1.74e-031500 871.8 364 1.0e-03 1.2e-04 1.60e-03 165.7 316 1.0e-03 4.4e-04 1.63e-032000 2526.5 413 1.0e-03 3.2e-04 2.41e-03 347.2 373 1.0e-03 4.1e-04 2.41e-03In the theoretical part, we analyze the convergence of PFPG and PFAM. A key assumption isthat the initial subspace should not be orthogonal to the target subspace to be used in the nextiteration. Together with the Chebyshev polynomials we are able to estimate the approximationerror of the subspace. The main convergence result indicates that the degree of the polynomialcan remain a constant as the iterations proceed, which is meaningful in real applications. Even ifthe warm-start property is not considered, the degree grows very slowly (about order O (log k )) toensure the convergence. Our numerical experiments shows that the polynomial-filtered algorithmsare pretty effective on low-rank problems compared to the original methods, since they successfullyreduce the computational costs of large-scale EVDs. Meanwhile, the number of iterations is barelyincreased. These observations coincide with our theoretical results. REFERENCES[1] D. G. Anderson , Iterative procedures for nonlinear integral equations , Journal of the ACM (JACM), 12 (1965),pp. 547–560.[2] S. Becker, V. Cevher, and A. Kyrillidis , Randomized low-memory singular value projection , arXiv preprintarXiv:1303.0167, (2013).[3] J. Cai, E. Cands, and Z. Shen , A singular value thresholding algorithm for matrix completion , SIAM Journalon Optimization, 20 (2010), pp. 1956–1982.[4] E. Candes, X. Li, and M. Soltanolkotabi , Phase Retrieval via Wirtinger Flow: Theory and Algorithms ,ArXiv e-prints, (2014).[5] C. Davis and W. M. Kahan , The rotation of eigenvectors by a perturbation. iii , SIAM Journal on Numerical31nalysis, 7 (1970), pp. 1–46.[6] D. Davis and W. Yin , Convergence rate analysis of several splitting schemes , in Splitting Methods in Com-munication, Imaging, Science, and Engineering, Springer, 2016, pp. 115–163.[7] M. Friedlander and I. Macdo , Low-rank spectral optimization via gauge duality , SIAM Journal on ScientificComputing, 38 (2016), pp. A1616–A1638.[8] M. X. Goemans and D. P. Williamson , Improved approximation algorithms for maximum cut and satisfia-bility problems using semidefinite programming , Journal of the ACM (JACM), 42 (1995), pp. 1115–1145.[9] G. H. Golub and C. F. Van Loan , Matrix computations , vol. 3, JHU Press, 2012.[10] F. Kangal, K. Meerbergen, E. Mengi, and W. Michiels , A subspace method for large-scale eigenvalueoptimization , SIAM Journal on Matrix Analysis and Applications, 39 (2018), pp. 48–82.[11] D. Kressner, D. Lu, and B. Vandereycken , Subspace acceleration for the crawford number and relatedeigenvalue optimization problems , SIAM Journal on Matrix Analysis and Applications, 39 (2018), pp. 961–982.[12] Y. Li, Z. Wen, C. Yang, and Y. Yuan , A semi-smooth newton method for solving semidefinite programs inelectronic structure calculations , arXiv preprint arXiv:1708.08048, (2017).[13] Z. Lin, M. Chen, and Y. Ma , The augmented Lagrange multiplier method for exact recovery of corruptedlow-rank matrices , Journal of structural biology, 181 2 (2013), pp. 116–27.[14] X. Liu, Z. Wen, and Y. Zhang , An efficient Gauss–Newton algorithm for symmetric low-rank product matrixapproximations , SIAM Journal on Optimization, 25 (2015), pp. 1571–1608.[15] M. Nakata, B. J. Braams, K. Fujisawa, M. Fukuda, J. K. Percus, M. Yamashita, and Z. Zhao , Variationalcalculation of second-order reduced density matrices by strong n-representability conditions and an accuratesemidefinite programming solver , The Journal of chemical physics, 128 (2008), p. 164113.[16] N. Parikh, S. Boyd, et al. , Proximal algorithms , Foundations and Trends R (cid:13) in Optimization, 1 (2014),pp. 127–239.[17] H. Qi and D. Sun , A quadratically convergent newton method for computing the nearest correlation matrix ,SIAM Journal on Matrix Analysis and Applications, 28 (2006), pp. 360–385.[18] D. Scieur, A. d’Aspremont, and F. Bach , Regularized nonlinear acceleration , in Advances In Neural Infor-mation Processing Systems, 2016, pp. 712–720.[19] P. Sirkovic and D. Kressner , Subspace acceleration for large-scale parameter-dependent Hermitian eigen-problems , SIAM Journal on Matrix Analysis and Applications, 37 (2016), pp. 695–718.[20] M. Soltani and C. Hegde , Fast low-rank matrix estimation without the condition number , arXiv preprintarXiv:1712.03281, (2017).[21] K.-C. Toh and S. Yun , An accelerated proximal gradient algorithm for nuclear norm regularized linear leastsquares problems , Pacific Journal of optimization, 6 (2010), p. 15.[22] B. Vandereycken , Low-rank matrix completion by Riemannian optimization , SIAM Journal on Optimization,23 (2013), pp. 1214–1236.[23] H. F. Walker and P. Ni , Anderson acceleration for fixed-point iterations , SIAM Journal on Numerical Anal-ysis, 49 (2011), pp. 1715–1735.[24] L. Wang, A. Singer, and Z. Wen , Orientation determination of cryo-em images using least unsquared devi-ations , SIAM journal on imaging sciences, 6 (2013), pp. 2450–2483.[25] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma , Robust principal component analysis: Exact recoveryof corrupted low-rank matrices via convex optimization , in Advances in Neural Information ProcessingSystems 22, Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, eds., CurranAssociates, Inc., 2009, pp. 2080–2088.[26] Y. Ying and P. Li , Distance metric learning with eigenvalue optimization , Journal of Machine LearningResearch, 13 (2012), pp. 1–26.[27] T. Zhou and D. Tao , Godec: Randomized low-rank & sparse matrix decomposition in noisy case , in Interna-tional conference on machine learning, Omnipress, 2011.[28]