[PDF] Structured Quasi-Newton Methods for Optimization with Orthogonality Constraints

Abstract

In this paper, we study structured quasi-Newton methods for optimization problems with orthogonality constraints. Note that the Riemannian Hessian of the objective function requires both the Euclidean Hessian and the Euclidean gradient. In particular, we are interested in applications that the Euclidean Hessian itself consists of a computational cheap part and a significantly expensive part. Our basic idea is to keep these parts of lower computational costs but substitute those parts of higher computational costs by the limited-memory quasi-Newton update. More specically, the part related to Euclidean gradient and the cheaper parts in the Euclidean Hessian are preserved. The initial quasi-Newton matrix is further constructed from a limited-memory Nyström approximation to the expensive part. Consequently, our subproblems approximate the original objective function in the Euclidean space and preserve the orthogonality constraints without performing the so-called vector transports. When the subproblems are solved to sufficient accuracy, both global and local q-superlinear convergence can be established under mild conditions. Preliminary numerical experiments on the linear eigenvalue problem and the electronic structure calculation show the effectiveness of our method compared with the state-of-art algorithms.

Full PDF

aa r X i v : . [ m a t h . O C ] S e p STRUCTURED QUASI-NEWTON METHODS FOR OPTIMIZATIONWITH ORTHOGONALITY CONSTRAINTS

JIANG HU ∗ , BO JIANG † , LIN LIN ‡ , ZAIWEN WEN § , AND

YAXIANG YUAN ¶ Abstract.

In this paper, we study structured quasi-Newton methods for optimization problemswith orthogonality constraints. Note that the Riemannian Hessian of the objective function requiresboth the Euclidean Hessian and the Euclidean gradient. In particular, we are interested in appli-cations that the Euclidean Hessian itself consists of a computational cheap part and a signiﬁcantlyexpensive part. Our basic idea is to keep these parts of lower computational costs but substitutethose parts of higher computational costs by the limited-memory quasi-Newton update. More speci-cally, the part related to Euclidean gradient and the cheaper parts in the Euclidean Hessian arepreserved. The initial quasi-Newton matrix is further constructed from a limited-memory Nystr¨omapproximation to the expensive part. Consequently, our subproblems approximate the original objec-tive function in the Euclidean space and preserve the orthogonality constraints without performingthe so-called vector transports. When the subproblems are solved to suﬃcient accuracy, both globaland local q-superlinear convergence can be established under mild conditions. Preliminary numeri-cal experiments on the linear eigenvalue problem and the electronic structure calculation show theeﬀectiveness of our method compared with the state-of-art algorithms.

Key words. optimization with orthogonality constraints, structured quasi-Newton method,limited-memory Nystr¨om approximation, Hartree-Fock total energy minimization, convergence.

AMS subject classiﬁcations.

1. Introduction.

In this paper, we consider the optimization problem with or-thogonality constraints:(1.1) min X ∈ C n × p f ( X ) s . t . X ∗ X = I p , where f ( X ) : C n × p → R is a R -diﬀerentiable function [26]. Although our proposedmethods are applicable to a general function f ( X ), we are in particular interested inthe cases that the Euclidean Hessian ∇ f ( X ) takes a natural structure as(1.2) ∇ f ( X ) = H c ( X ) + H e ( X ) , where the computational cost of H e ( X ) is much more expensive than that of H c ( X ).This situation occurs when f is a summation of functions whose full Hessian are ∗ Beijing International Center for Mathematical Research, Peking University, China([email protected]). † School of Mathematical Sciences, Key Laboratory for NSLSCS of Jiangsu Province, Nanjing Nor-mal University, China ([email protected]). Research supported in part by NSFC grants 11501298and 11671036, and by the NSF of Jiangsu Province (BK20150965). ‡ Department of Mathematics, University of California, Berkeley, Berkeley, CA 94720 and Com-putational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 ([email protected]). Research supported in part by the National Science Foundation underGrant No. DMS-1652330, the Department of Energy under Grants No. DE-SC0017867 and No.DE-AC02-05CH11231, and the SciDAC project. § Beijing International Center for Mathematical Research, Peking University, China([email protected]). Research supported in part by NSFC grants 11831002, 11421101 and91730302, and by the National Basic Research Project under grant 2015CB856002. ¶ State Key Laboratory of Scientiﬁc and Engineering Computing, Academy of Mathematics andSystems Science, Chinese Academy of Sciences, China ([email protected]). Research supported inpart by NSFC grants 11331012 and 11461161005.1 xpensive to be evaluated or even not accessible. A practical example is the Hartree-Fock-like total energy minimization problem in electronic structure theory [38, 31],where the computation cost associated with the Fock exchange matrix is signiﬁcantlylarger than the cost of the remaining components.There are extensive methods for solving (1.1) in the literature. By exploring thegeometry of the manifold (i.e., orthogonality constraints), the Riemannian gradientdescent, conjugate gradient (CG), Newton and trust-region methods are proposed in[11, 10, 41, 36, 1, 2, 44]. Since the second-order information sometimes is not available,the quasi-Newton type method serves as an alternative method to guarantee the goodconvergence property. Diﬀerent from the Euclidean quasi-Newton method, the vectortransport operation [2] is used to compare tangent vectors in diﬀerent tangent spaces.After obtaining a descent direction, the so-called retraction provides a curvilinearsearch along manifold. By adding some restrictions between diﬀerentiable retrac-tion and vector transport, a Riemannian Broyden-Fletcher-Goldfarb-Shanno (BFGS)method is presented in [33, 34, 35]. Due to the requirement of diﬀerentiable retrac-tion, the computational cost associated with the vector transport operation may becostly. To avoid this disadvantage, authors in [18, 21, 23, 20] develop a new class ofRiemannian BFGS methods, symmetric rank-one (SR1) and Broyden family methods.Moreover, a selection of Riemannian quasi-Newton methods has been implemented inthe software package Manopt [6] and ROPTLIB [19].

Since the set of orthogonal matrices can be viewed asthe Stiefel manifold, the existing quasi-Newton methods focus on the construction ofan approximation to the Riemannian Hessian Hess f ( X ):(1.3) Hess f ( X )[ ξ ] = Proj X ( ∇ f ( X )[ ξ ] − ξ sym( X ∗ ∇ f ( X ))) , where ξ is any tangent vector in the tangent space T X := { ξ ∈ C n × p : X ∗ ξ + ξ ∗ X = 0 } and Proj X ( Z ) := Z − X sym( X ∗ Z ) is the projection of Z onto the tangent space T X and sym( A ) := ( A + A ∗ ) /

2. See [3] for details on the structure (1.3). We brieﬂysummarize our contributions as follows. • By taking the advantage of this structure (1.3), we construct an approxi-mation to Euclidean Hessian ∇ f ( X ) instead of the full Riemannian Hes-sian Hess f ( X ) directly, but keep the remaining parts ξ sym( X ∗ ∇ f ( X )) andProj X ( · ). Then, we solve a subproblem with orthogonality constraints, whoseobjective function uses an approximate second-order Taylor expansion of f with an extra regularization term. Similar to [16], the trust-region-like strat-egy for the update of the regularization parameter and the modiﬁed CGmethod for solving the subproblem are utilized. The vector transport is notneeded in since we are working in the ambient Euclidean space. • By further taking advantage of the structure (1.2) of f , we develop a struc-tured quasi-Newton approach to construct an approximation to the expen-sive part H e while preserving the cheap part H c . This kind of structuredapproximation usually yields a better property than the approximation con-structed by the vanilla quasi-Newton method. For the construction of aninitial approximation of H e , we also investigate a limited-memory Nystr¨omapproximation, which gives a subspace approximation of a known good but till complicated approximation of H e . • When the subproblems are solved to certain accuracy, both global and localq-superlinear convergence can be established under certain mild conditions. • Applications to the linear eigenvalue problem and the electronic structurecalculation are presented. The proposed algorithms perform comparably wellwith state-of-art methods in these two applications.

Electronic structuretheories, and particularly Kohn-Sham density functional theory (KSDFT), play animportant role in quantum physics, quantum chemistry and materials science. Thisproblem can be interpreted as a minimization problem for the electronic total en-ergy over multiple electron wave functions which are orthogonal to each other. Themathematical structure of Kohn-Sham equations depends heavily on the choice ofthe exchange-correlation (XC) functional. With some abuse of terminology, through-out the paper we will use KSDFT to refer to Kohn-Sham equations with local orsemi-local exchange-correlation functionals. Before discretization, the correspond-ing Kohn-Sham Hamiltonian is a diﬀerential operator. On the other hand, whenhybrid exchange-correlation functionals [4, 15] are used, the Kohn-Sham Hamilto-nian becomes an integro-diﬀerential operator, and the Kohn-Sham equations becomeHartree-Fock-like equations. Again with some abuse of terminology, we will refer tosuch calculations as the HF calculation.For KSDFT calculations, the most popular numerical scheme is the self-consistentﬁeld (SCF) iteration which can be eﬃcient when combined with certain charge mix-ing techniques. Since the hybrid exchange-correlation functionals depend on all theelements of the density matrix, HF calculations are usually more diﬃcult than KS-DFT calculations. One commonly used algorithm is called the nested two-level SCFmethod [12]. In the inner SCF loop, by ﬁxing the density matrix and the hybrid ex-change operator, it only performs an update on the charge density ρ , which is solvedby the SCF iteration. Once the stopping criterion of the inner iteration is satisﬁed,the density matrix is updated in the outer loop according to the Kohn-Sham orbitalscomputed in the inner loop. This method can also utilize the charge mixing schemesfor the inner SCF loop to accelerate convergence. Recently, by combining with theadaptively compressed exchange operator (ACE) method [28], the convergence rateof the nested two-level SCF method is greatly improved. Another popular algorithmto solve HF calculations is the commutator direction inversion of the iterative sub-space (C-DIIS) method. By storing the density matrix explicitly, it can often lead toaccelerated convergence rate. However, when the size of the density matrix becomeslarge, the storage cost of the density matrix becomes prohibitively expensive. ThusLin et al. [17] proposed the projected C-DIIS (PC-DIIS) method, which only requiresstorage of wave function type objects instead of the whole density matrix.HF calculations can be also solved via using the aforementioned Riemannianoptimization methods (e.g., a feasible gradient method on the Stiefel manifold [44])without storing the density matrix or the wave function. However, these existingmethods often do not use the structure of the Hessian in KSDFT or HF calculations.In this paper, by exploiting the structure of the Hessian, we apply our structuredquasi-Newton method to solve these problems. Preliminary numerical experiments how that our algorithm performs at least comparably well with state-of-art methodsin their convergent case. In the case that state-of-art methods failed, our algorithmoften returns high quality solutions. This paper is organized as follows. In section 2, we intro-duce our structured quasi-Newton method and present our algorithm. In section 3,the global and local convergence is analyzed under certain inexact conditions. Insections 4 and 5, detailed applications to the linear eigenvalue problem and the elec-tronic structure calculation are discussed. Finally, we demonstrate the eﬃciency ofour proposed algorithm in section 6.

For a matrix X ∈ C n × p , we use ¯ X , X ∗ , ℜ X and ℑ X to de-note its complex conjugate, complex conjugate transpose, real and imaginary parts,respectively. Let span { X , . . . , X l } be the space spanned by the matrices X , . . . , X l .The vector denoted vec( X ) in C np is formulated by stacking each column of X one byone, from the ﬁrst to the last column; the operator mat( · ) is the inverse of vec( · ), i.e.,mat(vec( X )) = X . Given two matrices A, B ∈ C n × p , the Frobenius inner product h· , ·i is deﬁned as h A, B i = tr( A ∗ B ) and the corresponding Frobenius norm k · k F isdeﬁned as k A k F = p tr( A ∗ A ). The Hadamard product of A and B is A ⊙ B with( A ⊙ B ) ij = A ij B ij . For a matrix M ∈ C n × n , the operator diag( M ) is a vector in C n formulated by the main diagonal of M ; and for c ∈ C n , the operator Diag( c ) is an n -by- n diagonal matrix with the elements of c on the main diagonal. The notation I p denotes the p -by- p identity matrix. Let St( n, p ) := { X ∈ C n × p : X ∗ X = I p } be the(complex) Stiefel manifold. The notation N refers to the set of all natural numbers.

2. A structured quasi-Newton approach.2.1. Structured quasi-Newton subproblem.

In this subsection, we developthe structured quasi-Newton subproblem for solving (1.1). Based on the assumption(1.2), methods using the exact Hessian ∇ f ( X ) may not be the best choices. Whenthe computational cost of the gradient ∇ f ( X ) is signiﬁcantly cheaper than that of theHessian ∇ f ( X ), the quasi-Newton methods, which mainly use the gradients ∇ f ( X )to construct an approximation to ∇ f ( X ), may outperform other methods. Consider-ing the form (1.2), we can construct a structured quasi-Newton approximation B k for ∇ f ( X k ). The details will be presented in section 2.2. Note that a similar idea hasbeen presented in [47] for the unconstrained nonlinear least square problems [24, 37].Then our subproblem at the k -th iteration is constructed as(2.1) min X ∈ C n × p m k ( X ) s . t . X ∗ X = I, where m k ( X ) := ℜ (cid:10) ∇ f ( X k ) , X − X k (cid:11) + 12 ℜ (cid:10) B k [ X − X k ] , X − X k (cid:11) + τ k d ( X, X k )is an approximation to f ( X ) in the Euclidean space. For the second-order Taylorexpansion of f ( X ) at a point X k , we refer to [43, section 1.1] for details. Here, τ k is a regularization parameter and d ( X, X k ) is a proximal term to guarantee theconvergence. he proximal term can be chosen as the quadratic regularization(2.2) d ( X, X k ) = k X − X k k F or the cubic regularization(2.3) d ( X, X k ) = 23 k X − X k k F . In the following, we will mainly focus on the quadratic regularization (2.2). Due to theStiefel manifold constraint, the quadratic regularization (2.2) is actually equivalent tothe linear term − ℜ (cid:10) X, X k (cid:11) . By using the Riemannian Hessian formulation (1.3) onthe Stiefel manifold, we have(2.4) Hess m k ( X k )[ ξ ] = Proj X k (cid:0) B k [ ξ ] − ξ sym(( X k ) ∗ ∇ f ( X k ) (cid:1) + τ k ξ, ξ ∈ T X k . Hence, the regularization term is to shift the spectrum of the corresponding Rieman-nian Hessian of the approximation B k with τ k .The Riemannian quasi-Newton methods for (1.1) in the literature [19, 21, 22, 23]focus on constructing an approximation to the Riemannian Hessian Hess f ( X k ) di-rectly without using its special structure (1.3). Therefore, vector transport needs tobe utilized to transport the tangent vectors from diﬀerent tangent spaces to one com-mon tangent space. If p ≪ n , the second term sym (cid:0) ( X k ) ∗ ∇ f ( X k ) (cid:1) is a small-scaledmatrix and thus can be computed with low cost. In this case, after computing theapproximation B k [ ξ ] of ∇ f ( X )[ ξ ], we obtain a structured Riemannian quasi-Newtonapproximation Proj X k (cid:0) B k [ ξ ] − ξ sym(( X k ) ∗ ∇ f ( X k ) (cid:1) of Hess f ( X k )[ ξ ] without usingany vector transport. B k . The classical quasi-Newton methods construct theapproximation B k such that it satisﬁes the secant condition(2.5) B k [ S k ] = ∇ f ( X k ) − ∇ f ( X k − ) , where S k := X k − X k − . Noticing that ∇ f ( X ) takes the natural structure (1.2),it is reasonable to keep the cheaper part H c ( X ) while only to approximate H e ( X ).Speciﬁcally, we derive the approximation B k to the Hessian ∇ f ( X k ) as(2.6) B k = H c ( X k ) + E k , where E k is an approximation to H e ( X k ). Substituting (2.6) into (2.5), we can seethat the approximation E k should satisfy the following revised secant condition(2.7) E k [ S k ] = Y k , where(2.8) Y k := ∇ f ( X k ) − ∇ f ( X k − ) − H c ( X k )[ S k ] . For the large scale optimization problems, the limited-memory quasi-Newtonmethods are preferred since they often make simple but good approximations of theexact Hessian. Considering that the part H e ( X k ) itself may not be positive deﬁnite ven when X k is optimal, we utilize the limited-memory symmetric rank-one (LSR1)scheme to approximate H e ( X k ) such that it satisﬁes the secant equation (2.7).Let l = min { k, m } . We deﬁne the ( np ) × l matrices S k,m and Y k,m by S k,m = (cid:2) vec( S k − l ) , . . . , vec( S k − ) (cid:3) , Y k,m = (cid:2) vec( Y k − l ) , . . . , vec( Y k − ) (cid:3) . Let E k : C n × p → C n × p be the initial approximation of H e ( X k ) and deﬁne the ( np ) × l matrix Σ k,m := (cid:2) vec (cid:0) E k [ S k − l ] (cid:1) , . . . , vec (cid:0) E k [ S k − ] (cid:1)(cid:3) . Let F k,m be a matrix in C l × l with ( F k,m ) i,j = (cid:10) S k − l + i − , Y k − l + j − (cid:11) for 1 ≤ i, j ≤ l . Under the assumption that (cid:10) S j , E j [ S j ] − Y j (cid:11) = 0, j = k − l, . . . , k −

1, it follows from [9, Theorem 5.1] that thematrix F k,m − ( S k,m ) ∗ Σ k,m is invertible and the LSR1 gives(2.9) E k [ U ] = E k [ U ] + mat (cid:16) N k,m (cid:0) F k,m − ( S k,m ) ∗ Σ k,m (cid:1) − ( N k,m ) ∗ vec( U ) (cid:17) , where U ∈ C n × p is any direction and N k,m = Y k,m − Σ k,m . In the practical imple-mentation, we skip the update if (cid:12)(cid:12)(cid:10) S j , E j [ S j ] − Y j (cid:11)(cid:12)(cid:12) ≤ r k S j k F kE j [ S j ] − Y j k F with small number r , say r = 10 − . Similar idea can be found in [32]. E k . A good initial guessto the exact Hessian is also important to accelerate the convergence of the limited-memory quasi-Newton method. Here, we assume that a good initial approximation E k of the expensive part of the Hessian H e ( X k ) is known but its computational costis still very high. We conduct how to use the limited-memory Nystr¨om approximationto construct another approximation with lower computational cost based on E k .Specially, let Ω be a matrix whose columns form an orthogonal basis of a well-chosen subspace S and denote W = E k [Ω]. To reduce the computational cost andkeep the good property of E k , we construct the following approximation(2.10) ˆ E k [ U ] := W ( W ∗ Ω) † W ∗ U, where U ∈ C n × p is any direction. This is called the limited-memory Nystr¨om approx-imation; see [40] and references therein for more details. By choosing the dimensionof the subspace S properly, the rank of W ( W ∗ Ω) † W ∗ can be small enough such thatthe computational cost of ˆ E k [ U ] is signiﬁcantly reduced. Furthermore, we still wantˆ E k to satisfy the secant condition (2.7) as E k does. More speciﬁcally, we need to seekthe subspace S such that the secant conditionˆ E k [ S k ] = Y k holds. To this aim, the subspace S can be chosen asspan { X k − , X k } , which contains the element S k . By assuming that E k [ U V ] = E k [ U ] V for any matrices U, V with proper dimension (this condition is satisﬁed when E k is a matrix), we have E k will satisfy the secant condition whenever E k does. From the methods for lineareigenvalue computation in [25] and [30], the subspace S can also be decided as(2.11) span { X k − , X k , E k [ X k ] } or span { X k − h , . . . , X k − , X k } with small memory length h . Once the subspace is deﬁned, we can obtain the limited-memory Nystr¨om approximation by computing the E k [Ω] once and the pseudo inverseof a small scale matrix. Based on the theory of quasi-Newton method for unconstrained optimization, weknow that algorithms which set the solution of (2.1) as the next iteration point maynot converge if no proper requirements on approximation B k or the regularizationparameter τ k . Hence, we update the regularization parameter here using a trust-region-like strategy. Refereeing to [16], we compute a trial point Z k by utilizing amodiﬁed CG method to solve the subproblem inexactly, which is to solve the Newtonequation of (2.1) at X k as(2.12) grad m k ( X k ) + Hess m k ( X k )[ ξ ] = 0 , ξ ∈ T X k , where grad m k ( X k ) = grad f ( X k ) and Hess m k ( X k ) are given in (2.4). After obtainingthe trial point Z k of (2.1), we calculate the ratio between the predicted reduction andthe actual reduction(2.13) r k = f ( Z k ) − f ( X k ) m k ( Z k ) . If r k ≥ η >

0, then the iteration is successful and we set X k +1 = Z k ; otherwise, theiteration is unsuccessful and we set X k +1 = X k , that is,(2.14) X k +1 = ( Z k , if r k ≥ η ,X k , otherwise . The regularization parameter τ k +1 is updated as(2.15) τ k +1 ∈  (0 , γ τ k ] , if r k ≥ η , [ τ k , γ τ k ] , if η ≤ r k < η , [ γ τ k , γ τ k ] , otherwise , where 0 < η ≤ η < < γ < < γ ≤ γ . These parameters determine howaggressively the regularization parameter is decreased when an iteration is successfulor it is increased when an iteration is unsuccessful. In practice, the performance of theregularized trust-region algorithm is not very sensitive to the values of the parameters.Noticing that the Newton-type method may still be very slow when the Hessian isclose to be singular [8]. Numerically, it may happen that the regularization parameterturns to be huge and the Riemannian Newton direction is nearly parallel to the nega-tive gradient direction. Hence, it leads to an update X k +1 belonging to the subspace˜ S k := span { X k , grad f ( X k ) } , which is similar to the Riemannian gradient approach. o overcome this issue, we propose an optional step of solving (1.1) restricted to asubspace. Speciﬁcally, at X k , we construct a subspace S k with an orthogonal basis Q k ∈ C n × q ( p ≤ q ≤ n ), where q is the dimension of S k . Then any point X in thesubspace S k can be represented by X = Q k M for some M ∈ C q × p . Similar to the constructions of linear eigenvalue problems in[25] and [30], the subspace S k can be decided by using the history information { X k , X k − , . . . } , { grad f ( X k ) , grad f ( X k − ) , . . . } and other useful information. Giventhe subspace S k , the subspace method aims to ﬁnd a solution of (1.1) with an extraconstraint X ∈ S k , namely,(2.16) min M ∈ C q × p f ( Q k M ) s . t . M ∗ M = I p . The problem (2.16) can be solved inexactly by existing methods for optimizationwith orthogonality constraints. Once a good approximate solution M k of (2.16) isobtained, then we update X k +1 = Q k M k which is an approximate minimizer in thesubspace S k instead of ˜ S k . This completes one step of the subspace iteration. Infact, we compute the ratios between the norms of the Riemannian gradient of the lastfew iterations. If all of these ratios are almost 1, we infer that the current iteratesstagnates and the subspace method is called. Consequently, our algorithm frameworkis outlined in Algorithm 1. Algorithm 1:

A structured quasi-Newton method with subspace reﬁnementInput initial guess X ∈ C n × p with ( X ) ∗ X = I p and the memory length m .Choose τ >

0, 0 < η ≤ η <

1, 1 < γ ≤ γ . Set k = 0. while stopping conditions not met do Choose E k (use the limited-memory Nystr¨om approximation if necessary).Construct the approximation B k via (2.6) and (2.9).Construct the subproblem (2.1) and use the modiﬁed CG method(Algorithm 2 in [16]) to compute a new trial point Z k .Compute the ratio r k via (2.13).Update X k +1 from the trial point Z k based on (2.14).Update τ k according to (2.15). k ← k + 1. if stagnate conditions met then Solve the subspace problem (2.16) to update X k +1 .

3. Convergence analysis.

In this section, we present the convergence propertyof Algorithm 1. To guarantee the global convergence and fast local convergence rate,the inexact conditions for the subproblem (2.1) (with quadratic or cubic regulariza-tion) can be chosen as m k ( Z k ) ≤ − c k grad f ( X k ) k F (3.1) k grad m k ( Z k ) k F ≤ θ k k grad f ( X k ) k F (3.2) ith some positive constant c and θ k := min { , k grad f ( X k ) k F } . Here, the inequality(3.1) is to guarantee the global convergence and the inequality (3.2) leads to fast localconvergence. Throughout the analysis of convergence, we assume that the stagnateconditions are never met. (In fact, a suﬃcient decrease for the original problem ineach iteration can be guaranteed from the description of subspace reﬁnement. Hence,the global convergence still holds.) Since the regularization term is used, the globalconvergence of our method can be obtained by assuming the boundedness on theconstructed Hessian approximation. We ﬁrst make the following assumptions.

Assumption Let { X k } be generated by Algorithm 1 without subspace reﬁne-ment. We assume:(A1) The gradient ∇ f is Lipschitz continuous on the convex hull of St( n, p ) , i.e.,there exists L f > such that k∇ f ( X ) − ∇ f ( Y ) k F ≤ L f k X − Y k F , ∀ X, Y ∈ conv(St( n, p )) . (A2) There exists κ H > such that kB k k ≤ κ H for all k ∈ N , where k · k is theoperator norm introduced by the Euclidean inner product. Remark By Assumption (A1), ∇ f ( X ) is uniformly bounded by some constant κ g on the compact set conv(St( n, p )) , i.e., k∇ f ( X ) k F ≤ κ g , X ∈ conv(St( n, p )) . Assumption (A2) is often used in the traditional symmetric rank-1 method [7] whichappears to be reasonable in practice.

Based on the similar proof in [16, 43], we have the following theorem for globalconvergence.

Theorem Suppose that Assumptions (A1)-(A2) and the inexact conditions (3.1) hold. Then, either (3.3) grad f ( X t ) = 0 f or some t > or lim k →∞ k grad f ( X k ) k F = 0 . Proof.

For the quadratic regularization (2.2), let us note that the RiemannianHessian Hess m ( X k ) can be guaranteed to be bounded from Assumption 1. In fact,from (2.4), we have k Hess m k ( X k ) k ≤ kB k k + k X k kk∇ f ( X k ) k F + τ k ≤ κ H + κ g + τ k , where k X k k = 1 because of its unitary property. Hence, we can guarantee that thedirection obtained from the modiﬁed CG method is a descent direction via similartechniques in [16, Lemma 7]. Then the convergence of the iterates { X k } can be provedin a similar way by following the details in [16] for the quadratic regularization. Asto the cubic regularization, we can refer [43, Theorem 4.9] for a similar proof. We now focus on the local convergence with the in-exact conditions (3.1) and (3.2). We make some necessary assumptions below. ssumption Let { X k } be the sequence generated by Algorithm 1 without sub-space reﬁnement. We assume(B1) The sequence { X k } converges to X ∗ with grad f ( X ∗ ) = 0 .(B2) The Euclidean Hessian ∇ f is continuous on conv (St( n, p )) .(B3) The Riemannian Hessian Hess f ( X ) is positive deﬁnite at X ∗ .(B4) The Hessian approximation B k satisﬁes (3.4) k ( B k − ∇ f ( X k ))[ Z k − X k ] k F k Z k − X k k F → , k → ∞ . Following the proof in [16, Lemma 17], we show that all iterations are eventuallyvery successful (i.e., r k ≥ η , for all suﬃciently large k ) when Assumptions (B1)-(B4)and the inexact conditions (3.1) and (3.2) hold. Lemma Let Assumptions (B1)-(B4) be satisﬁed. Then, all iterations are even-tually very successful.Proof.

From the second-order Taylor expansion, we have f ( Z k ) − f ( X k ) − m k ( Z k ) ≤ ℜ (cid:10) ( ∇ f ( X kδ ) − B k )[ Z k − X k ]) , Z k − X k (cid:11) , for some suitable δ k ∈ [0 ,

1] and X kδ := X k + δ k ( Z k − X k ). Since the Stifel manifoldis compact, there exist some η k such that Z k = Exp X k ( η k ) where Exp X k is theexponential map from T X k St( n, p ) to St( n, p ). Following the proof in [5, Appendix B]and Assumption (B1) ( Z k can be suﬃciently close to X k for large k ), we have(3.5) k Z k − X k − η k k F ≤ κ k η k k F with a positive constant κ for all suﬃciently large k . Moreover, since the HessianHess f ( X ∗ ) is positive deﬁnite and (B4) is satisﬁed, it holds for suﬃciently large k : k Hess m k ( X k )[ η k ] k F = k Hess m k ( X k )[ Z k − X k ] k F + O ( k η k k F )= k Hess f ( X k )[ Z k − X k ] + (Hess m k ( X k ) − Hess f ( X k ))[ Z k − X k ] k F + O ( k η k k F ) ≥ λ min (Hess f ( X k )) k Z k − X k k F + o ( k Z k − X k k F ) + O ( k η k k F ) ≥ λ min (Hess f ( X k )) k η k k F + o ( k η k k F ) , where λ min (Hess f ( X k )) is the minimal spectrum of Hess f ( X k ). From Assumption(B2)-(B3), [2, Proposition 5.5.4] and the Taylor expansion of m k ◦ Exp X k , we have k grad( m k ◦ Exp X k )( η k ) − grad f ( X k ) k F = k Hess f ( X k )[ η k ] k F + o ( k η k k F ) ≥ κ k η k k F , where κ := λ min (Hess f ( X ∗ )). By [1, Lemma 7.4.9], we have(3.6) k η k k F ≤ κ ( k grad f ( X k ) k F + ˜ c k grad m k ( Z k ) k F ≤ cθ k ) κ k grad f ( X k ) k F , where ˜ c > ∇ f , (3.1), (3.5) and (3.6) that1 − r k ≤ c (cid:18) k ( ∇ f ( X k ) − B k )[ Z k − X k ] k F k Z k − X k k F k grad f ( X k ) k F + k∇ f ( X kδ ) − ∇ f ( X k ) kk Z k − X k k F k grad f ( X k ) k F (cid:19) → . herefore, the iterations are eventually very successful.As a result, the q-superlinear convergence can also be guaranteed. Theorem Suppose that Assumptions (B1)-(B4) and conditions (3.1) and (3.2) hold. Then the sequence { X k } converges q-superlinearly to X ∗ .Proof. We consider the cubic model here, while the local q-superlinear conver-gence of quadratic model can be showed by a similar fashion. Since the iterations areeventually very successful, we have X k +1 = Z k and τ k converges to zero. From (3.2),we have(3.7) (cid:13)(cid:13) grad m k ( X k +1 ) (cid:13)(cid:13) F = (cid:13)(cid:13) Proj X k +1 (cid:0) ∇ f ( X k ) + B k [∆ k ] + τ k k ∆ k k F ∆ k (cid:1)(cid:13)(cid:13) F ≤ θ k k grad f ( X k ) k F , where ∆ k = Z k − X k . Hence,(3.8) k grad f ( X k +1 ) k F = (cid:13)(cid:13) Proj X k +1 (cid:0) ∇ f ( X k +1 ) (cid:1)(cid:13)(cid:13) F = (cid:13)(cid:13) Proj X k +1 (cid:0) ∇ f ( X k ) + ∇ f ( X k )[∆ k ] + o ( k ∆ k k F ) (cid:1)(cid:13)(cid:13) F = (cid:13)(cid:13) Proj X k +1 (cid:0) ∇ f ( X k ) + B k [∆ k ] + o ( k ∆ k k F ) + ( ∇ f ( X k ) − B k )[∆ k ] (cid:1)(cid:13)(cid:13) F ≤ θ k k grad f ( X k ) k F + o ( k ∆ k k F ) . It follows from a similar argument to (3.6) that there exists some constant c k ∆ k k F ≤ c k grad f ( X k ) k F , for suﬃciently large k . Therefore, from (3.8) and the deﬁnition of θ k , we have(3.9) k grad f ( X k +1 ) k F k grad f ( X k ) k F → . Combining (3.9), Assumption (B3) and [2, Lemma 7.4.8], it yieldsdist( X k +1 , X ∗ )dist( X k , X ∗ ) → , where dist( X, Y ) is the geodesic distance between X and Y which belong to St( n, p ).This completes the proof.

4. Linear eigenvalue problem.

In this section, we apply the aforementionedstrategy to the following linear eigenvalue problem(4.1) min X ∈ R n × p f ( X ) := 12 tr( X ⊤ CX ) s . t . X ⊤ X = I p , where C := A + B . Here, A, B ∈ R n × n are symmetric matrices and we assume thatthe multiplication of BX is much more expensive than that of AX . Motivated bythe quasi-Newton methods and eliminating the linear term in subproblem (2.1), weinvestigate the multisecant conditions in [13](4.2) ˆ B k X k = BX k , ˆ B k S k = BS k ith S k = X k − X k − . By a brief induction, we have an equivalent form of (4.2)(4.3) ˆ B k [ X k − , X k ] = B [ X k − , X k ] . Then, using the limited-memory Nystr¨om approximation, we obtain the approximatedmatrix ˆ B k as(4.4) ˆ B k = W k (( W k ) ⊤ O k ) † W ⊤ k , where(4.5) O k = orth (span { X k − , X k } ) , and W k = BO k . Here, orth ( Z ) is to ﬁnd the orthogonal basis of the space spanned by Z . Therefore,an approximation C k to C can be set as(4.6) C k = A + ˆ B k . Since the objective function is invariant under rotation, i.e., f ( XQ ) = f ( X ) fororthogonal matrix Q ∈ R p × p , we also wants to construct a subproblem whose objectivefunction inherits the same property. Therefore, we use the distance function between X k and X as d p ( X, X k ) = k XX ⊤ − X k ( X k ) ⊤ k F , which has been considered in [10, 39, 46] for the electronic structure calculation. Since X k and X are orthonormal matrices, we have(4.7) d p ( X, X k ) = tr(( XX ⊤ − X k ( X k ) ⊤ )( XX ⊤ − X k ( X k ) ⊤ ))= 2 p − X ⊤ X k ( X k ) ⊤ X ) , which implies that d p ( X, X k ) is a quadratic function on X . Consequently, the sub-problem can be constructed as(4.8) min X ∈ R n × p m k ( X ) s . t . X ⊤ X = I p , where m k ( X ) := 12 tr( X ⊤ C k X ) + τ k d p ( X, X k ) . From the equivalent expression of d p ( X, X k ) in (4.7), problem (4.8) is actually a lineareigenvalue problem ( A + ˆ B k − τ k X k ( X k ) ⊤ ) X = X Λ ,X ⊤ X = I p , where Λ is a diagonal matrix whose diagonal elements are the p smallest eigenvaluesof A + ˆ B k − τ k X k ( X k ) ⊤ . Due to the low computational cost of A + ˆ B k − τ k X k ( X k ) ⊤ compared to A + B , the subproblem (4.8) can be solved eﬃciently using existingeigensolvers. As in Algorithm 1, we ﬁrst solve subproblem (4.8) to obtain a trial pointand compute the ratio (2.13) between the actual reduction and predicted reductionbased on this trial point. Then the iterate and regularization parameter are updatedaccording to (2.13) and (2.15). Note that it is not necessary to solve the subproblemshighly accurately in practice. .1. Convergence. Although the convergence analysis in section 3 is based onthe regularization terms (2.2) and (2.3), similar results can be established with thespeciﬁed regularization term τ k d p ( X, X k ) using the suﬃcient descent condition (3.1).It follows from the construction of C k in (4.6) that k C k ≤ k A k + k B k , k C k k ≤ k A k + k B k for any given matrices A and B . Hence, Assumptions (A1) and (A2) hold with L f = κ H = k A k + k B k . We have the following theorem on the global convergence. Theorem Suppose that the inexact condition (3.1) holds. Then, for the Rie-mannian gradients, either (cid:0) I n − X t ( X t ) ⊤ (cid:1) ( CX t ) = 0 for some t > k →∞ k (cid:0) I n − X k ( X k ) ⊤ (cid:1) ( CX k ) k F = 0 . Proof.

It can be guaranteed that the distance d p ( X, X k ) is very small for a largeenough regularization parameter τ k by a similar argument to [16, Lemma 9]. Speciﬁ-cally, the reduction of the subproblem requires that (cid:10) Z k , C k Z k (cid:11) + τ k k Z k ( Z k ) ⊤ − X k ( X k ) ⊤ k F − (cid:10) X k , C k X k (cid:11) ≤ . From the cyclic property of the trace operator, it holds that (cid:10) C k , Z k ( Z k ) ⊤ − X k ( X k ) ⊤ (cid:11) + τ k k Z k ( Z k ) ⊤ − X k ( X k ) ⊤ k F ≤ . Then(4.9) k Z k ( Z k ) ⊤ − X k ( X k ) ⊤ k F ≤ κ H τ k . From the descent condition (3.1) for the subproblem, there exists some positive con-stant ν such that(4.10) m k ( Z k ) − m k ( X k ) ≥ − ντ k k grad f ( X k ) k F . Based on the properties of C k and C , we have(4.11) f ( Z k ) − f ( X k ) − ( m k ( Z k ) − m k ( X k ))= (cid:10) Z k , CZ k (cid:11) − (cid:10) Z k , C k Z k (cid:11) − τ k k Z k ( Z k ) ⊤ − X k ( X k ) ⊤ k F ≤ (cid:10) C − C k , Z k ( Z k ) ⊤ (cid:11) = D C − C k , (cid:0) Z k ( Z k ) ⊤ − X k ( X k ) ⊤ (cid:1) E ≤ ( L f + κ H ) k Z k ( Z k ) ⊤ − X k ( X k ) ⊤ k F ≤ κ H ( L f + κ H ) τ k , where the second equality is due to CX k = C k X k , the unitary Z k and X k , as well as (cid:10) C − C k , Z k ( Z k ) ⊤ X k ( X k ) ⊤ (cid:11) = (cid:10) C − C k , X k ( X k ) ⊤ Z k ( Z k ) ⊤ (cid:11) = 0 . ombining (4.10) and (4.11), we have that1 − r k = f ( Z k ) − f ( X k ) − ( m k ( Z k ) − m k ( X k )) m k ( X k ) − m k ( Z k ) ≤ − η for suﬃciently large τ k as in [16, Lemma 8]. Since the subproblem is solved withsome suﬃcient reduction, the reduction of the original objective f holds for large τ k (i.e., r k is close to 1). Then the convergence of the norm of the Riemannian gradient( I n − X k ( X k ) ⊤ ) CX k follows in a similar fashion as [16, Theorem 11].The ACE method in [29] needs an estimation β explicitly such that B − βI n is negativedeﬁnite. By considering an equivalent matrix ( A + βI n ) + ( B − βI n ), the convergenceof ACE to a global minimizer is given. On the other hand, our algorithmic frameworkuses an adaptive strategy to choose τ k to guarantee the convergence to a stationarypoint. By using similar proof techniques in [29], one may also establish the convergenceto a global minimizer.

5. Electronic structure calculation.5.1. Formulation.

We now introduce the KS and HF total minimization mod-els and present their gradient and Hessian of the objective functions in these twomodels. After some proper discretization, the wave functions of p occupied statescan be approximated by a matrix X = [ x , . . . , x p ] ∈ C n × p with X ∗ X = I p , where n corresponds to the spatial degrees of freedom. The charge density associated withthe occupied states is deﬁned as ρ ( X ) = diag( XX ∗ ) . Unless otherwise speciﬁed, we use the abbreviation ρ for ρ ( X ) in the following. Thetotal energy functional is deﬁned as(5.1) E ks ( X ) := 14 tr( X ∗ LX ) + 12 tr( X ∗ V ion X ) + 12 X l X i ζ l | x ∗ i w l | + 14 ρ ⊤ L † ρ + 12 e ⊤ ǫ xc ( ρ ) , where L is a discretized Laplacian operator, V ion is the constant ionic pseudopoten-tials, w l represents a discretized pseudopotential reference projection function, ζ l is aconstant whose value is ± e is a vector of all ones in R n , and ǫ xc is related to theexchange correlation energy. Therefore, the KS total energy minimization problemcan be expressed as(5.2) min X ∈ C n × p E ks ( X ) s . t . X ∗ X = I p . Let µ xc ( ρ ) = ∂ǫ xc ( ρ ) ∂ρ and denote the Hamilton H ks ( X ) by(5.3) H ks ( X ) := 12 L + V ion + X l ζ l w l w ∗ l + Diag(( ℜ L † ) ρ ) + Diag( µ xc ( ρ ) ∗ e ) . Then the Euclidean gradient of E ks ( X ) is computed as(5.4) ∇ E ks ( X ) = H ks ( X ) X. nder the assumption that ǫ xc ( ρ ( X )) is twice diﬀerentiable with respect to ρ ( X ),Lemma 2.1 in [43] gives an explicit form of the Hessian of E ks ( X ) as(5.5) ∇ E ks ( X )[ U ] = H ks ( X ) U + R ( X )[ U ] , where U ∈ C n × p and R ( X )[ U ] := Diag (cid:16)(cid:0) ℜ L † + ∂ ǫ xc ∂ρ e (cid:1) ( ¯ X ⊙ U + X ⊙ ¯ U ) e (cid:17) X .Compared with KSDFT, the HF theory can provide a more accurate model toelectronic structure calculations by involving the Fock exchange operator. After dis-cretization, the exchange-correlation operator V ( · ) : C n × n → C n × n is usually a fourth-order tensor, see equations (3.3) and (3.4) in [27] for details. Furthermore, it is easy tosee from [27] that V ( · ) satisﬁes the following properties: (i) For any D , D ∈ C n × n ,there holds hV ( D ) , D i = hV ( D ) , D i , which further implies that(5.6) hV ( D + D ) , D + D i = hV ( D ) , D i + 2 hV ( D ) , D i + hV ( D ) , D i . (ii) If D is Hermitian, V ( D ) is also Hermitian. Besides, it should be emphasized thatcomputing V ( U ) is always very expensive since it needs to perform the multiplicationbetween a n × n × n × n fourth-order tensor and a n -by- n matrix. The correspondingFock energy is deﬁned as(5.7) E f ( X ) := 14 hV ( XX ∗ ) X, X i = 14 hV ( XX ∗ ) , XX ∗ i . Then the HF total energy minimization problem can be formulated as(5.8) min X ∈ C n × p E hf ( X ) := E ks ( X ) + E f ( X ) s . t . X ∗ X = I p . We now can explicitly compute the gradient and Hessian of E f ( X ) by using theproperties of V ( · ). Lemma Given U ∈ C n × p , the gradient and the Hessian along U of E f ( X ) are,respectively, ∇ E f ( X ) = V ( XX ∗ ) X, (5.9) ∇ E f ( X )[ U ] = V ( XX ∗ ) U + V ( XU ∗ + U X ∗ ) X. (5.10) Proof.

We ﬁrst compute the value E f ( X + U ). For simplicity, denote D := XU ∗ + U X ∗ . Using the property (5.6), by some easy calculations, we have4 E f ( X + U ) = hV (( X + U )( X + U ) ∗ ) , ( X + U )( X + U ) ∗ i = 4 E f ( X ) + 2 hV ( XX ∗ ) , D + U U ∗ i + hV ( D + U U ∗ ) , D + U U ∗ i = 4 E f ( X ) + 2 hV ( XX ∗ ) , D i + 2 hV ( XX ∗ ) , U U ∗ i + hV ( D ) , D i + h.o.t. , where h.o.t. denotes the higher-order terms. Noting that V ( XX ∗ ) and V ( D ) are bothHermitian, we have from the above assertions that(5.11) E f ( X + U ) = E f ( X )+ ℜ hV ( XX ∗ ) X, U i + 12 ℜ hV ( XX ∗ ) U + V ( D ) X, U i +h.o.t. . Finally, it follows from expansion (1.2) in [43] that the second-order Taylor expressionin X can be expressed as E f ( X + U ) = E f ( X ) + ℜh∇ E f ( X ) , U i + 12 ℜ (cid:10) ∇ E f ( X )[ U ] , U (cid:11) + h.o.t. , which with (5.11) implies (5.9) and (5.10). The proof is completed. et H hf ( X ) := H ks ( X ) + V ( XX ∗ ) be the HF Hamilton. Recalling that E hf ( X ) = E ks ( X ) + E f ( X ), we have from (5.4) and (5.9) that(5.12) ∇ E hf ( X ) = H ks ( X ) X + V ( XX ∗ ) X = H hf ( X ) X and have from (5.5) and (5.10) that(5.13) ∇ E hf ( X )[ U ] = H hf ( X ) U + R ( X )[ U ] + V ( XU ∗ + U X ∗ ) X. We nextbrieﬂy introduce the widely used methods for solving the KSDFT and HF models.For the KSDFT model (5.2), the most popular method is the SCF method [27]. Atthe k -th iteration, SCF ﬁrst ﬁxes H ks ( X ) to be H ( X k ) and then updates X k +1 viasolving the linear eigenvalue problem(5.14) X k +1 := arg min X ∈ C n × p h X, H ( X k ) X i s . t . X ∗ X = I p . Because the complexity of the HF model (5.8) is much higher than that of the KSDFTmodel, using SCF method directly may not obtain desired results. Since computing V (cid:0) X k ( X k ) ∗ (cid:1) U with some matrix U of proper dimension is still very expensive, weinvestigate the limited-memory Nystr¨om approximation ˆ V (cid:0) X k ( X k ) ∗ (cid:1) to approximate V (cid:0) X k ( X k ) ∗ (cid:1) to reduce the computational cost, i.e.,(5.15) ˆ V (cid:0) X k ( X k ) ∗ (cid:1) := Z ( Z ∗ Ω) † Z ∗ , where Z = V (cid:0) X k ( X k ) ∗ (cid:1) Ω and Ω is any orthogonal matrix whose columns form anorthogonal basis of the subspace such asspan { X k } , span { X k − , X k } or span { X k − , X k , V (cid:0) X k ( X k ) ∗ (cid:1) X k } . We should note that a similar idea called adaptive compression method was proposedin [28], which only considers to compress the operator V ( X k ( X k ) ∗ ) on the subspacespan { X k } . Then a new subproblem is constructed as(5.16) min X ∈ C n × p E ks ( X ) + 14 D ˆ V (cid:0) X k ( X k ) ∗ (cid:1) X, X E s . t . X ∗ X = I p . Here, the exact form of the easier parts E ks is preserved while its second-order ap-proximation is used in the construction of subproblem (2.1). As in the subproblem(2.1), we can utilize the Riemannian gradient method or the modiﬁed CG methodbased on the following linear equationProj X k (cid:18) ∇ E ks ( X k )[ ξ ] + 12 ˆ V ( X k ( X k ) ∗ ) ξ − ξ sym(( X k ) ∗ ∇ f ( X k )) (cid:19) = − grad E hf ( X k )to solve (5.16) inexactly. Since (5.16) is a KS-like problem, we can also use the SCFmethod. Here, we present the detailed algorithm in Algorithm 2. lgorithm 2: Iterative method for (5.8) using Nystr¨om approximationInput initial guess X ∈ C n × p with ( X ) ∗ X = I p . Set k = 0. while Stopping condtions not met do Compute the limited-memory Nystr¨om approximation ˆ V (cid:0) X k ( X k ) ∗ (cid:1) .Construct the subproblem (5.16) and solve it inexactly via theRiemannian gradient method or the modiﬁed CG method or the SCFmethod to obtain X k +1 .Set k ← k + 1.We note that Algorithm 2 is similar to the two-level nested SCF method withthe ACE formulation [28] when the subspace in (5.15) and inner solver for (5.16) arechosen as span { X k } and SCF, respectively. B k . Note that theHessian of the KSDFT or HF total energy minimization takes the natural structure(1.2), we next give the speciﬁc choices of H c ( X k ) and H e ( X k ), which are key toformulate the the structured approximation B k .For the KS problem (5.2), we have its exact Hessian in (5.5). Since the computa-tional cost of the parts L + P l ζ l w l w ∗ l are much cheaper than the remaining partsin ∇ E ks , we can choose(5.17) H c ( X k ) = 12 L + X l ζ l w l w ∗ l , H e ( X k ) = ∇ E ks ( X k ) − H c ( X k ) . The exact Hessian of E hf ( X ) in (5.8) can be separated naturally into two parts,i.e., ∇ E ks ( X ) + ∇ E f ( X ). Usually the hybrid exchange operator V ( XX ∗ ) can takemore than 95% of the overall time of the multiplication of H hf ( X )[ U ] in many realapplications [29]. Recalling (5.5), (5.10) and (5.13), we know that the computationalcost of ∇ E f ( X ) is much higher than that of ∇ E ks ( X ). Hence, we obtain the de-composition as(5.18) H c ( X k ) = ∇ E ks ( X k ) , H e ( X k ) = ∇ E f ( X k ) . Moreover, we can split the Hessian of ∇ E ks ( X k ) as done in (5.17) and obtain analternative decomposition as(5.19) H c ( X k ) = H ks ( X k ) , H e ( X k ) = ∇ E f ( X k ) + ( ∇ E ks ( X k ) − H c ( X k )) . Finally, we emphasize that the limited-memory Nystr¨om approximation (5.15)can serve as a good initial approximation for the part ∇ E f ( X k ). As presented in Algo-rithm 1, the subspace method plays an important role when the modiﬁed CG methoddoes not perform well. The ﬁrst-order optimality conditions for (5.2) and (5.8) are H ( X ) X = X Λ , X ∗ X = I p , where X ∈ C n × p , Λ is a diagonal matrix and H represents H ks for (5.2) and H hf for (5.8). Then, problems (5.2) and (5.8) are actually a nonlinear eigenvalue prob-lem which aims to ﬁnd the p smallest eigenvalues of H . We should point out that n principle X consists of the eigenvectors of H ( X ) but not necessary the eigenvec-tors corresponding to the p smallest eigenvalues. Since the optima X is still theeigenvectors of H ( X ), we can construct some subspace which contains these possiblewanted eigenvectors. Speciﬁcally, at current iterate, we ﬁrst compute the ﬁrst γp smallest eigenvalues and their corresponding eigenvectors of H ( X k ), denoted by Γ k ,then construct the subspace as(5.20) span { X k − , X k , grad f ( X k ) , Γ k } , with some small integer γ . With this subspace construction, Algorithm 1 will morelikely escape a stagnated point.

6. Numerical experiments.

In this section, we present some experiment re-sults to illustrate the eﬃciency of the limited-memory Nystr¨om approximation andour Algorithm 1. All codes were run in a workstation with Intel Xenon E5-2680 v4processors at 2.40GHz and 256GB memory running CentOS 7.3.1611 and MATLABR2017b.

We ﬁrst construct A and B by using thefollowing MATLAB commands: A = randn( n, n ); A = ( A + A ⊤ ) / B = 0 . n, n ); B = ( B + B ⊤ ) / B = B − T ; B = − B, where randn and rand are the built-in functions in MATLAB, T = λ min ( B ) I n and λ min ( B ) is the smallest eigenvalue of B . Then B is negative deﬁnite and A is symmet-ric. In our implementation, we compute the multiplication BX using P i =1 BX such that BX consumes about 95% of the whole computational time. In the secondexample, we set A to be a sparse matrix as A = gallery(‘wathen’ , s, s )with parameter s and B is the same as the ﬁrst example except that BX is computeddirectly. Since A is suﬃciently sparse, its computational cost AX is much smallerthan that of BX . We use the following stopping criterion(6.1) err := max i =1 ,...,p (cid:26) k ( A + B ) x i − µ i x i k max(1 , | µ i | ) (cid:27) ≤ − , where x i is the i -th column of the current iterate X k and µ i is the correspondingapproximated eigenvalue.The numerical results of the ﬁrst and second examples are summarized in Tables 1and 2, respectively. In these tables, EIGS is the built-in function “eigs” in MATLAB.LOBPCG is the locally optimal block preconditioned conjugate gradient method [25].ASQN is the algorithm described in section 4. The diﬀerence between ACE and ASQNis that we take O k as orth (span { X k } ) but not orth (span { X k − , X k } ). Since a goodinitial guess X k is known at the ( k + 1)-th iteration, LOBPCG is utilized to solve thecorresponding linear eigenvalue subproblem (4.8). Note that BX k − and BX k areavailable from the computation of the residual, we then adopt the orthogonalization able 1 Numerical results on random matrices

AV/BV err time AV/BV err time p = 10 n n n = 5000 p

10 20EIGS 459/459 8.0e-11 45.1 638/638 3.2e-11 62.7LOBPCG 1717/1717 9.9e-11 128.9 2914/2914 9.8e-11 130.3ASQN 2323/150 9.2e-11 13.3 3809/260 9.2e-11 8.9ACE 4056/460 9.7e-11 30.8 5902/680 9.5e-11 16.5 p

30 50EIGS 660/660 3.0e-11 62.8 879/879 1.6e-12 83.6LOBPCG 4458/4458 1.0e-10 217.6 5766/5766 9.5e-11 186.7ASQN 5315/420 9.8e-11 11.4 7879/650 9.8e-11 17.8ACE 9701/1530 9.4e-11 23.0 21664/4450 1.0e-10 50.9 technique in [30] to compute O k and W k in (4.5) without extra multiplication BO k .The labels “AV” and “BV” denote the total number of matrix-vector multiplications(MV), counting each operation AX, BX ∈ R n × p as p MVs. The columns “err” and“time” are the maximal relative error of all p eigenvectors deﬁned in (6.1), and thewall-clock time in seconds of each algorithm, respectively. The maximal number ofiterations for ASQN and ACE is set to 200.As shown in Table 1, with ﬁxed p = 10 and diﬀerent n = 5000 , , n = 5000, ASQN can still give a accurate solutionwith less time than EIGS and LOBPCG, but ACE usually takes a long time to geta solution of high accuracy. Similar conclusions can also be seen from Table 2. Inwhich, ACE and LOBPCG do not reach the given accuracy in the cases n = 11041and p = 30 , , ,

60. From the calls of AV and BV , we see that the limited-memoryNystr¨om method reduces the number of calls on the expensive part by doing moreevaluations on the cheap part. We now test the electron struc-ture calculation models in subsections 6.2 and 6.3 using the new version of the KS-SOLV package [45]. One of the main diﬀerences is that the new version uses the morerecently developed optimized norm-conserving Vanderbilt pseudopotentials (ONCV)[14], which are compatible to those used in other community software packages suchas Quantum ESPRESSO. The problem information is listed in Table 3. For fair com- able 2 Numerical results on sparse matrices

AV/BV err time AV/BV err time p = 10 s s s

11 12EIGS 1882/1882 1.5e-07 58.9 1463/1463 9.6e-11 65.4LOBPCG 4282/4282 9.5e-11 136.0 4089/4089 9.9e-11 190.6ASQN 8327/240 9.6e-11 16.7 6910/220 9.3e-11 17.5ACE 15323/1060 9.7e-11 38.9 17907/2010 1.7e-08 65.5 s = 12 p

10 20EIGS 1463/1463 9.6e-11 65.4 1148/1148 5.8e-11 50.2LOBPCG 4089/4089 9.9e-11 190.6 5530/5530 9.8e-11 86.4ASQN 6910/220 9.3e-11 17.5 9749/340 9.5e-11 16.3ACE 17907/2010 1.7e-08 65.5 14108/960 9.8e-11 23.4 p

30 40EIGS 1784/1784 8.1e-11 74.8 1836/1836 4.8e-11 69.1LOBPCG 9076/9076 5.3e-09 173.3 12192/12192 4.6e-10 207.2ASQN 17056/870 9.6e-11 41.5 19967/960 9.9e-11 39.9ACE 37162/6030 9.1e-09 78.4 48098/8040 4.6e-07 105.4 p

50 60EIGS 1743/1743 7.3e-11 69.1 2122/2122 1.6e-11 86.7LOBPCG 12288/12288 1.4e-09 168.4 15716/15716 1.1e-08 199.5ASQN 21330/1300 9.3e-11 53.6 26343/1620 9.7e-11 71.8ACE 49165/10050 2.9e-06 110.1 62668/12060 2.3e-08 134.0 parisons, we stop all algorithms when the Frobenius norm of the Riemannian gradientis less than 10 − or the maximal number of iterations is reached. In the following ta-bles, the column “solver” denotes which speciﬁed solver is used. The columns “fval”,“nrmG”, “time” are the ﬁnal objective function value, the ﬁnal Frobenius norm ofthe Riemannian gradient and the wall-clock time in seconds of each algorithm, re-spectively.In this test, we compare structured quasi-Newton method with the SCF in KS-SOLV [45], the Riemannian L-BFGS method (RQN) in Manopt [6], the Rieman-nian gradient method with BB step size (GBB) and the adaptive regularized Newtonmethod (ARNT) [16]. The default parameters therein are used. Our Algorithm 1with the approximation with (5.17) is denoted by ASQN. The parameters setting ofASQN is same to that of ARNT [16].For each algorithm, we ﬁrst use GBB to generate a good starting point withstopping criterion k grad f ( X k ) k F ≤ − and a maximum of 2000 iterations. The aximal numbers of iterations for SCF, GBB, ARNT, ASQN and RQN are set as1000, 10000, 500, 500, 500 and 1000, respectively. The numerical results are reportedin Tables 4 and 5. The column “its” represents the total number of iterations in SCF,GBB and RQN, while the two numbers in ARNT, ASQN are the total number ofouter iterations and the average numbers of inner iterations. Table 3

Problem information. name ( n , n , n ) n p alanine (91,68,61) 35829 18c12h26 (136,68,28) 16099 37ctube661 (162,162,21) 35475 48glutamine (64,55,74) 16517 29graphene16 (91,91,23) 12015 37graphene30 (181,181,23) 48019 67pentacene (80,55,160) 44791 51gaas (49,49,49) 7153 36si40 (129,129,129) 140089 80si64 (93,93,93) 51627 128al (91,91,91) 47833 12ptnio (89,48,42) 11471 43c (46,46,46) 6031 2 From Tables 4 and 5, we can see that SCF failed in “graphene16”, “graphene30”,“al”, “ptnio” and “c”. We next explain why SCF fails by taking “c” and “graphene16”as examples. For the case “c”, we obtain the same solution by using GBB, ARNTand ASQN. The number of wanted wave functions are 2, i.e., p = 2. With someabuse of notation, we denote the ﬁnal solution by X = [ x , x ]. Since X satisﬁesthe ﬁrst-order optimality condition, the columns of X are also eigenvectors of H ( X )and the corresponding eigenvalues of H ( X ) are -1.8790, -0.6058. On the other hand,the smallest four eigenvalues of H ( X ) are -1.8790, -0.6577, -0.6058, -0.6058 and thecorresponding eigenvectors, denoted by Y = [ y , y , y , y ]. The energies and norms ofRiemannian gradients of the diﬀerent eigenvector pairs [ x , x ] , [ y , y ] , [ y , y ] and[ y , y ] are ( − . , . × − ) , ( − . , . × − ) , ( − . , . × − ) and( − . , . × − ), respectively. Comparing the angles between X and Y showsthat x is nearly parallel to y but x lies in the subspace spanned by [ y , y ] otherthan y . Hence, when the SCF method is used around X , the next point will jumpto the subspace spanned by [ y , y ]. This indicates the failure of the aufbau principle,and thus the failure of the SCF procedure. This is consistent with the observationin the chemistry literature [42], where sometimes the converged solution may have a“hole” (i.e., unoccupied states) below the highest occupied energy level.In the case “graphene16”, we still obtain the same solution from GBB, ARNT andASQN. The number of wave functions p is 37. Let X be the computed solution andthe corresponding eigenvalues of H ( X ) be d . The smallest 37 eigenvalues and theircorresponding eigenvectors of H ( X ) are g and Y . We ﬁnd that the ﬁrst 36 elements of d and g are almost the same up to a machine accuracy, but the 37th element of d and g is 0.5821 and 0.5783, respectively. The energies and norms of Riemannian gradientsof X and Y are ( − . , . × − ) and ( − . , . × − ), respectively. able 4 Numerical results on KS total energy minimization. solver fval nrmG its time fval nrmG its timealanine c12h26SCF -6.27084e+1 6.3e-7 11 64.0 -8.23006e+1 6.5e-7 10 61.1GBB -6.27084e+1 8.2e-7 92 71.3 -8.23006e+1 9.5e-7 89 65.8ARNT -6.27084e+1 3.8e-7 3(13.3) 63.0 -8.23006e+1 7.5e-7 3(15.3) 60.9ASQN -6.27084e+1 9.3e-7 13(11.8) 81.9 -8.23006e+1 9.3e-7 10(13.3) 67.8RQN -6.27084e+1 1.5e-6 34 114.9 -8.23006e+1 1.7e-6 45 120.0ctube661 glutamineSCF -1.35378e+2 5.7e-7 11 200.4 -9.90525e+1 4.9e-7 10 49.5GBB -1.35378e+2 6.3e-7 102 199.7 -9.90525e+1 4.9e-7 63 44.0ARNT -1.35378e+2 3.2e-7 3(18.3) 168.3 -9.90525e+1 3.6e-7 3(12.0) 42.6ASQN -1.35378e+2 7.6e-7 11(12.8) 201.7 -9.90525e+1 5.3e-7 12(9.8) 50.7RQN -1.35378e+2 3.4e-6 40 308.8 -9.90525e+1 1.8e-6 26 72.8graphene16 graphene30SCF -9.57196e+1 8.7e-4 1000 3438.4 -1.76663e+2 3.5e-4 1000 31897.6GBB -9.57220e+1 9.4e-7 434 185.1 -1.76663e+2 9.0e-7 904 3383.9ARNT -9.57220e+1 1.8e-7 4(37.2) 164.1 -1.76663e+2 4.2e-7 5(74.2) 2386.1ASQN -9.57220e+1 8.8e-7 23(24.1) 221.2 -1.76663e+2 7.2e-7 74(31.1) 4388.1RQN -9.57220e+1 1.6e-6 213 287.8 -1.76663e+2 3.3e-5 373 4296.7pentacene gaasSCF -1.30846e+2 8.5e-7 12 279.8 -2.86349e+2 5.8e-7 15 41.1GBB -1.30846e+2 9.6e-7 101 236.1 -2.86349e+2 7.5e-7 296 77.7ARNT -1.30846e+2 2.1e-7 3(14.0) 213.6 -2.86349e+2 7.4e-7 3(46.3) 59.9ASQN -1.30846e+2 9.0e-7 23(14.5) 423.0 -2.86349e+2 6.0e-7 35(24.8) 127.2RQN -1.30846e+2 2.1e-6 34 437.9 -2.86349e+2 1.5e-6 111 116.0si40 si64SCF -1.57698e+2 7.5e-7 19 3587.4 -2.53730e+2 3.4e-7 10 1100.0GBB -1.57698e+2 8.7e-7 289 3657.2 -2.53730e+2 7.3e-7 249 1534.2ARNT -1.57698e+2 3.7e-7 3(33.0) 3343.9 -2.53730e+2 7.9e-7 3(47.3) 1106.8ASQN -1.57698e+2 9.8e-7 33(23.3) 4968.7 -2.53730e+2 9.4e-7 23(25.0) 1563.9RQN -1.57698e+2 4.1e-6 62 4946.7 -2.53730e+2 9.7e-7 122 2789.4al ptnioSCF -3.52151e+2 7.4e+0 1000 4221.1 -9.25762e+2 1.9e-1 1000 4461.9GBB -3.53707e+2 9.7e-7 1129 219.3 -9.26927e+2 2.4e-6 10000 5627.2ARNT -3.53710e+2 5.9e-7 59(60.7) 947.7 -9.26927e+2 9.4e-7 104(129.6) 7558.3ASQN -3.53710e+2 7.1e-7 94(47.3) 1395.4 -9.26927e+2 9.2e-7 153(69.6) 12728.1RQN -3.53710e+2 1.8e-3 267 323.4 -9.26925e+2 2.3e-4 380 924.4

Hence, SCF does not converge around the point X .In Tables 4 and 5, ARNT usually converges in a few iterations due to the usageof the second-order information. It is often the fastest one in terms of time since thecomputational cost of two parts of the Hessian ∇ E ks has no signiﬁcant diﬀerence.GBB also performs comparably well as ARNT. ASQN works reasonably well on mostproblems. It takes more iterations than ARNT since the limit-memory approximationoften is not as good as the Hessian. Because the costs of solving the subproblems ofASQN and ARNT are more or less the same, ASQN is not competitive to ARNT.However, by taking advantage of the problem structures, ASQN is still better thanRQN in terms of computational time and accuracy. Finally, we show the convergencebehaviors of these ﬁve methods on the system “glutamine” in Figure 1. Speciﬁcally, he error of the objective function values is deﬁned as∆ E ks ( X k ) = E ks ( X k ) − E min , where E min be the minimum of the total energy attained by all methods. iter -15 -10 -5 ob j . v a l ue - m i n . ob j . v a l ue (a) ∆ E ks ( X k ) versus iterations time -15 -10 -5 ob j . v a l ue - m i n . ob j . v a l ue (b) ∆ E ks ( X k ) versus time iter -8 -6 -4 -2 no r m o f R i e m ann g r ad . (c) k grad E ks ( X k ) k F versus iterations time -8 -6 -4 -2 no r m o f R i e m ann g r ad . (d) k grad E ks ( X k ) k F versus time Fig. 1 . Comparisons of diﬀerent algorithms on “glutamine” of KS total energy minimization.The ﬁrst two points are the input and output of the initial solver GBB, respectively.

Table 5

Numerical results on KS total energy minimization. solver fval nrmG its timecSCF -5.29296e+0 7.3e-3 1000 168.3GBB -5.31268e+0 1.0e-6 3851 112.7ARNT -5.31268e+0 5.7e-7 96(49.1) 211.3ASQN -5.31268e+0 6.7e-7 104(38.5) 183.1RQN -5.31244e+0 1.4e-3 73 10.8

In this subsection, we com-pare the performance of three variants of Algorithm 2 where the subproblem is solvedby SCF (ACE), the modiﬁed CG method (ARN) and by GBB (GBBN), respectively,the Riemannian L-BFGS (RQN) method in Manopt [6], and two variants of Algorithm1 with approximation (5.18) (ASQN) and approximation (5.19) (AKQN). Since the omputation of the exact Hessian ∇ E hf is time-consuming, we do not present theresults using the exact Hessian. The limited-memory Nystr¨om approximation (5.15)serves as an initial Hessian approximation in both ASQN and AKQN. To comparethe eﬀectiveness of quasi-Newton approximation, we set H e ( X k ) to be the limited-memory Nystr¨om approximation (5.15) in (5.19) and use the same framework as inAlgorithm 1. We should mention that the subspace reﬁnement is not used in ASQNand AKQN. Hence, only structured quasi-Newton iterations are performed in them.The default parameters in RQN and GBB are used. For ACE, GBBN, ASQN, AKQNand ARN, the subproblem is solved until the Frobenius-norm of the Riemannian gra-dient is less than 0 . {k grad f ( X k ) k F , } . We also use the adaptive strategy forchoosing the maximal number of inner iterations of ARNT in [16] for GBBN, ASQN,AKQN and ARN. The settings of other parameters of ASQN, AKQN and ARN arethe same to those in ARNT [16]. For all algorithms, we generate a good initial guessby using GBB to solve the corresponding KS total energy minimization problem (i.e.,remove E f part from E hf in the objective function) until a maximal number of iter-ations 2000 is reached or the Frobenius-norm of the Riemannian gradient is smallerthan 10 − . The maximal number of iterations for ACE, GBBN, ASQN, ARN andAKQN is set to 200 while that of RQN is set to 1000.A detailed summary of computational results is reported in Table 6. We seethat ASQN performs best among all the algorithms in terms of both the numberof iterations and time, especially in the systems: “alanine”, “graphene30”, “gaas”and “si40”. Usually, algorithms takes fewer iterations if more parts in the Hessianare preserved. Since the computational cost of the Fock energy dominates that ofthe KS part, algorithms using fewer outer iterations consumes less time to converge.Hence, ASQN is faster than AKQN. Comparing with ARN and RQN, we see thatASQN beneﬁts from our quasi-Newton technique. Using a scaled identity matrix asthe initial guess, RQN takes many more iterations than our algorithms which usethe adaptive compressed form of the hybrid exchange operator. ASQN is two timesfaster than ACE in “graphene30” and “si40”. Finally, the convergence behaviors ofthese six methods on the system “glutamine” in Figure 2, where ∆ E hf ( X k ) is deﬁnedsimilarly as the KS case. In summary, algorithms utilizing the quasi-Newton techniquecombining with the Nystr¨om approximation is often able to give better performance.

7. Conclusion.

We present a structured quasi-Newton method for optimizationwith orthogonality constraints. Instead of approximating the full Riemannian Hessiandirectly, we construct an approximation to the Euclidean Hessian and a regularizedsubproblem using this approximation while the orthogonality constraints are kept.By solving the subproblem inexactly, the global and local q-superlinear convergencecan be guaranteed under certain assumptions. Our structured quasi-Newton methodalso takes advantage of the structure of the objective function if some parts are muchmore expensive to be evaluated than other parts. Our numerical experiments on thelinear eigenvalue problems, KSDFT and HF total energy minimization demonstratethat our structured quasi-Newton algorithm is very competitive with the state-of-artalgorithms.The performance of the quasi-Newton methods can be further improved in severalperspectives. For example, ﬁnding a better initial quasi-Newton matrix than the able 6 Numerical results on HF total energy minimization. solver fval nrmG its time fval nrmG its timealanine c12h26ACE -6.61821e+1 3.8e-7 11(3.0) 261.7 -8.83756e+1 3.9e-7 8(2.9) 259.7GBBN -6.61821e+1 1.0e-6 11(17.4) 268.8 -8.83756e+1 4.9e-4 200(68.7) 11839.8ARN -6.61821e+1 9.5e-7 10(13.7) 206.6 -8.83756e+1 4.9e-4 200(2.4) 4230.3ASQN -6.61821e+1 9.1e-7 7(14.1) 169.6 -8.83756e+1 2.1e-7 7(12.6) 234.1AKQN -6.61821e+1 4.8e-7 31(7.5) 530.2 -8.83756e+1 4.9e-7 29(7.6) 871.2RQN -6.61821e+1 1.9e-6 76 1428.5 -8.83756e+1 1.3e-3 45 3446.3ctube661 glutamineACE -1.43611e+2 9.2e-7 8(2.8) 795.0 -1.04525e+2 3.9e-7 10(3.0) 229.6GBBN -1.43611e+2 6.5e-7 10(26.3) 1399.2 -1.04525e+2 8.4e-7 11(13.3) 256.9ARN -1.43611e+2 6.0e-7 9(14.1) 832.7 -1.04525e+2 8.8e-7 10(9.5) 209.5ASQN -1.43611e+2 2.0e-7 8(13.2) 777.1 -1.04525e+2 1.5e-7 8(10.1) 182.9AKQN -1.43611e+2 6.1e-7 17(10.3) 1502.0 -1.04525e+2 9.1e-7 25(6.0) 515.7RQN -1.43611e+2 7.2e-6 59 6509.0 -1.04525e+2 2.9e-6 57 1532.8graphene16 graphene30ACE -1.01716e+2 7.6e-7 13(3.4) 367.0 -1.87603e+2 8.6e-7 58(4.2) 14992.0GBBN -1.01716e+2 4.2e-7 14(42.1) 659.0 -1.87603e+2 8.9e-7 29(72.2) 19701.8ARN -1.01716e+2 4.5e-7 14(23.0) 403.6 -1.87603e+2 9.0e-7 45(35.6) 14860.6ASQN -1.01716e+2 4.9e-7 11(20.2) 357.5 -1.87603e+2 7.6e-7 15(26.5) 6183.0AKQN -1.01716e+2 7.9e-7 49(15.1) 1011.0 -1.87603e+2 8.0e-7 39(12.3) 9770.7RQN -1.01716e+2 1.0e-3 74 2978.9 -1.87603e+2 1.5e-5 110 39091.0pentacene gaasACE -1.39290e+2 6.2e-7 13(3.0) 1569.5 -2.93496e+2 8.8e-7 29(2.9) 343.8GBBN -1.39290e+2 8.2e-7 16(23.0) 2620.2 -2.93496e+2 9.3e-7 34(35.3) 659.3ARN -1.39290e+2 7.2e-7 15(12.2) 1708.1 -2.93496e+2 9.6e-7 31(20.4) 468.7ASQN -1.39290e+2 1.9e-7 9(14.3) 1168.1 -2.93496e+2 3.3e-7 10(28.0) 199.5AKQN -1.39290e+2 5.4e-7 29(8.5) 3458.4 -2.93496e+2 4.6e-7 22(18.4) 347.1RQN -1.39290e+2 2.4e-6 73 11363.8 -2.93496e+2 1.0e-6 126 2154.1si40 si64ACE -1.65698e+2 9.2e-7 29(4.5) 30256.4 -2.67284e+2 9.8e-7 9(2.9) 6974.3GBBN -1.65698e+2 8.6e-7 24(43.9) 34692.4 -2.67284e+2 5.3e-7 14(27.0) 11467.9ARN -1.65698e+2 8.0e-7 22(22.1) 21181.3 -2.67284e+2 7.7e-7 12(18.6) 9180.7ASQN -1.65698e+2 2.8e-7 12(37.8) 15369.5 -2.67284e+2 3.0e-7 8(21.9) 6764.7AKQN -1.65698e+2 9.2e-7 87(7.9) 89358.8 -2.67284e+2 7.1e-7 24(18.8) 33379.0RQN -1.65698e+2 6.1e-6 156 181976.8 -2.67284e+2 8.4e-7 112 115728.8

Nystr¨om approximation and developing a better quasi-Newton approximation thanthe LSR1 technique. Our technique can also be extended to the general Riemannianoptimization with similar structures.

REFERENCES[1]

P.-A. Absil, C. G. Baker, and K. A. Gallivan , Trust-region methods on Riemannian man-ifolds , Found. Comput. Math., 7 (2007), pp. 303–330.[2]

P.-A. Absil, R. Mahony, and R. Sepulchre , Optimization algorithms on matrix manifolds ,Princeton University Press, Princeton, NJ, 2008.[3]

P.-A. Absil, R. Mahony, and J. Trumpf , An extrinsic look at the Riemannian Hessian , inGeometric science of information, Springer, 2013, pp. 361–368.[4]

A. D. Becke , Density-functional thermochemistry. III. the role of exact exchange , J. Chem.25

20 40 60 iter -15 -10 -5 ob j . v a l ue - m i n . ob j . v a l ue (a) ∆ E hf ( X k ) versus iterations time -15 -10 -5 ob j . v a l ue - m i n . ob j . v a l ue (b) ∆ E hf ( X k ) versus time iter -10 -5 no r m o f R i e m ann g r ad . (c) k grad E hf ( X k ) k F versus iteration time -10 -5 no r m o f R i e m ann g r ad . (d) k grad E hf ( X k ) k F versus time Fig. 2 . Comparisons of diﬀerent algorithms on “glutamine” of HF total energy minimization.

Phys., 98 (1993), pp. 5648–5652.[5]

N. Boumal, P.-A. Absil, and C. Cartis , Global rates of convergence for nonconvex optimiza-tion on manifolds , IMA J. Numer. Anal., (2016).[6]

N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre , Manopt, a Matlab toolbox foroptimization on manifolds

R. H. Byrd, H. F. Khalfan, and R. B. Schnabel , Analysis of a symmetric rank-one trustregion method , SIAM J. Optim., 6 (1996), pp. 1025–1039.[8]

R. H. Byrd, M. Marazzi, and J. Nocedal , On the convergence of Newton iterations tonon-stationary points , Math. Program., 99 (2004), pp. 127–148.[9]

R. H. Byrd, J. Nocedal, and R. B. Schnabel , Representations of quasi-Newton matricesand their use in limited memory methods , Math. Program., 63 (1994), pp. 129–156.[10]

A. Edelman, T. A. Arias, and S. T. Smith , The geometry of algorithms with orthogonalityconstraints , SIAM J. Matrix Anal. Appl., 20 (1999), pp. 303–353.[11]

D. Gabay , Minimizing a diﬀerentiable function over a diﬀerential manifold , J. Optim. TheoryAppl., 37 (1982), pp. 177–219.[12]

P. Giannozzi, S. Baroni, N. Bonini, M. Calandra, R. Car, C. Cavazzoni, D. Ceresoli,G. L. Chiarotti, M. Cococcioni, I. Dabo, et al. , Quantum espresso: a modular andopen-source software project for quantum simulations of materials , Journal of Physics:Condensed matter, 21 (2009), p. 395502.[13]

S. Gratton and P. L. Toint , Multi-secant equations, approximate invariant subspaces andmultigrid optimization , tech. report, Dept of Mathematics, FUNDP, Namur (B), 2007.[14]

D. Hamann , Optimized norm-conserving Vanderbilt pseudopotentials , Phys. Rev. B, 88 (2013),p. 085117.[15]

J. Heyd, G. E. Scuseria, and M. Ernzerhof , Hybrid functionals based on a screened Coulomb otential , J. Chem. Phys., 118 (2003), pp. 8207–8215.[16] J. Hu, A. Milzarek, Z. Wen, and Y. Yuan , Adaptive quadratically regularized Newton methodfor Riemannian optimization , SIAM J. Matrix Anal. Appl., 39 (2018), pp. 1181–1207.[17]

W. Hu, L. Lin, and C. Yang , Projected commutator DIIS method for accelerating hybrid func-tional electronic structure calculations , J. Chem. Theory Comput., 13 (2017), pp. 5458–5467.[18]

W. Huang , Optimization algorithms on Riemannian manifolds with applications , PhD thesis,The Florida State University, 2013.[19]

W. Huang, P. Absil, K. Gallivan, and P. Hand , ROPTLIB: an object-oriented C++ li-brary for optimization on Riemannian manifolds , tech. report, Technical Report FSU16-14,Florida State University, 2016.[20]

W. Huang, P.-A. Absil, and K. Gallivan , A Riemannian BFGS method without diﬀerenti-ated retraction for nonconvex optimization problems , SIAM J. Optim., 28 (2018), pp. 470–495.[21]

W. Huang, P.-A. Absil, and K. A. Gallivan , A Riemannian symmetric rank-one trust-regionmethod , Math. Program., 150 (2015), pp. 179–216.[22]

W. Huang, P.-A. Absil, and K. A. Gallivan , A Riemannian BFGS method for nonconvexoptimization problems , in Numerical Mathematics and Advanced Applications ENUMATH2015, Springer, 2016, pp. 627–634.[23]

W. Huang, K. A. Gallivan, and P.-A. Absil , A Broyden class of quasi-Newton methods forRiemannian optimization , SIAM J. Optim., 25 (2015), pp. 1660–1685.[24]

R. E. Kass , Nonlinear regression analysis and its applications , J. Am. Stat. Assoc., 85 (1990),pp. 594–596.[25]

A. V. Knyazev , Toward the optimal preconditioned eigensolver: Locally optimal block precon-ditioned conjugate gradient method , SIAM J. Sci. Comput., 23 (2001), pp. 517–541.[26]

K. Kreutz-Delgado , The complex gradient operator and the CR-calculus , 2009.http://arxiv.org/abs/0906.4835.[27]

C. Le Bris , Computational chemistry from the perspective of numerical analysis , Acta Numer.,14 (2005), pp. 363–444.[28]

L. Lin , Adaptively compressed exchange operator , J. Chem. Theory Comput., 12 (2016),pp. 2242–2249.[29]

L. Lin and M. Lindsey , Convergence of adaptive compression methods for Hartree-Fock-likeequations , Commun. Pure Appl. Math., in press, (2017).[30]

X. Liu, Z. Wen, and Y. Zhang , Limited memory block Krylov subspace optimization forcomputing dominant singular value decompositions , SIAM J. Sci. Comput., 35 (2013),pp. A1641–A1668.[31]

R. M. Martin , Electronic structure: basic theory and practical methods , Cambridge universitypress, 2004.[32]

J. Nocedal and S. J. Wright , Numerical Optimization , Springer Series in Operations Re-search and Financial Engineering, Springer, New York, second ed., 2006.[33]

C. Qi , Numerical optimization methods on Riemannian manifolds , PhD thesis, Florida StateUniversity, 2011.[34]

W. Ring and B. Wirth , Optimization methods on Riemannian manifolds and their applicationto shape space , SIAM J. Optim., 22 (2012), pp. 596–627.[35]

M. Seibert, M. Kleinsteuber, and K. H¨uper , Properties of the BFGS method on Rieman-nian manifolds , Mathematical System Theory C Festschrift in Honor of Uwe Helmke onthe Occasion of his Sixtieth Birthday, (2013), pp. 395–412.[36]

S. T. Smith , Optimization techniques on Riemannian manifolds , Fields Institute Communica-tions, 3 (1994).[37]

W. Sun and Y. Yuan , Optimization theory and methods: nonlinear programming , vol. 1,Springer Science & Business Media, 2006.[38]

A. Szabo and N. S. Ostlund , Modern quantum chemistry: introduction to advanced electronicstructure theory , Courier Corporation, 2012.[39]

L. Thogersen, J. Olsen, A. Kohn, P. Jorgensen, P. Salek, and T. Helgaker , The trust-region self-consistent ﬁeld method in Kohn–Sham density functional theory , J. Chem.Phys., 123 (2005), p. 074103.[40]

J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher , Fixed-rank approximation of a ositive-semideﬁnite matrix from streaming data , in Advances in Neural Information Pro-cessing Systems, 2017, pp. 1225–1234.[41] C. Udriste , Convex functions and optimization methods on Riemannian manifolds , vol. 297,Springer Science & Business Media, 1994.[42]

R. van Leeuwen , Density functional approach to the many-body problem: key concepts andexact functionals , Adv. Quantum Chem., 43 (2003), pp. 25–94.[43]

Z. Wen, A. Milzarek, M. Ulbrich, and H. Zhang , Adaptive regularized self-consistent ﬁelditeration with exact Hessian for electronic structure calculation , SIAM J. Sci. Comput., 35(2013), pp. A1299–A1324.[44]

Z. Wen and W. Yin , A feasible method for optimization with orthogonality constraints , Math.Program., 142 (2013), pp. 397–434.[45]

C. Yang, J. C. Meza, B. Lee, and L.-W. Wang , KSSOLV —a MATLAB toolbox for solvingthe Kohn-Sham equations , ACM Trans. Math. Softw., 36 (2009), pp. 1–35.[46]

C. Yang, J. C. Meza, and L.-W. Wang , A trust region direct constrained minimizationalgorithm for the Kohn-Sham equation , SIAM J. Sci. Comput., 29 (2007), pp. 1854–1875.[47]