[PDF] An orthogonalization-free parallelizable framework for all-electron calculations in density functional theory

Abstract

All-electron calculations play an important role in density functional theory, in which improving computational efficiency is one of the most needed and challenging tasks. In the model formulations, both nonlinear eigenvalue problem and total energy minimization problem pursue orthogonal solutions. Most existing algorithms for solving these two models invoke orthogonalization process either explicitly or implicitly in each iteration. Their efficiency suffers from this process in view of its cubic complexity and low parallel scalability in terms of the number of electrons for large scale systems. To break through this bottleneck, we propose an orthogonalization-free algorithm framework based on the total energy minimization problem. It is shown that the desired orthogonality can be gradually achieved without invoking orthogonalization in each iteration. Moreover, this framework fully consists of Basic Linear Algebra Subprograms (BLAS) operations and thus can be naturally parallelized. The global convergence of the proposed algorithm is established. We also present a precondition technique which can dramatically accelerate the convergence of the algorithm. The numerical experiments on all-electron calculations show the efficiency and high scalability of the proposed algorithm.

Full PDF

AAN ORTHOGONALIZATION-FREE PARALLELIZABLE FRAMEWORK FORALL-ELECTRON CALCULATIONS IN DENSITY FUNCTIONAL THEORY ∗ BIN GAO † , GUANGHUI HU ‡ , YANG KUANG § , AND

XIN LIU ¶ Abstract.

All-electron calculations play an important role in density functional theory, in which improvingcomputational efﬁciency is one of the most needed and challenging tasks. In the model formulations, both nonlineareigenvalue problem and total energy minimization problem pursue orthogonal solutions. Most existing algorithmsfor solving these two models invoke orthogonalization process either explicitly or implicitly in each iteration. Theirefﬁciency suffers from this process in view of its cubic complexity and low parallel scalability in terms of the num-ber of electrons for large scale systems. To break through this bottleneck, we propose an orthogonalization-freealgorithm framework based on the total energy minimization problem. It is shown that the desired orthogonality canbe gradually achieved without invoking orthogonalization in each iteration. Moreover, this framework fully con-sists of Basic Linear Algebra Subprograms (BLAS) operations and thus can be naturally parallelized. The globalconvergence of the proposed algorithm is established. We also present a precondition technique which can dramati-cally accelerate the convergence of the algorithm. The numerical experiments on all-electron calculations show theefﬁciency and high scalability of the proposed algorithm.

Key words. density functional theory, all-electron calculations, orthogonalization-free, parallel algorithm

AMS subject classiﬁcations.

1. Introduction.

We aim to ﬁnd the ground state solution of a molecular system fromall-electron calculations. In view of Kohn–Sham density functional theory (KSDFT) [19],this can be achieved by solving the lowest p eigenpairs of the Kohn–Sham equation:(1.1)  ˆ Hψ l ( r ) = ε l ψ l ( r ) , l = 1 , , . . . , p, (cid:90) R ψ l ψ l (cid:48) d r = δ ll (cid:48) , l, l (cid:48) = 1 , , . . . , p, where ˆ H is the Hamiltonian operator, ψ l ( r ) is the l -th wavefunction (eigenfunction), ε l refersto the corresponding eigenenergy, δ ll (cid:48) is the Kronecker delta function, and p denotes thenumber of electrons. Alternatively, the ground state solution can be obtained by minimizingthe total energy with orthogonality constraints [27]:(1.2) min Ψ E KS ( Ψ )s . t . (cid:104) Ψ, Ψ (cid:105) = I p , where Ψ = ( ψ , ψ , . . . , ψ p ) , E KS denotes the Kohn–Sham total energy, (cid:104)· , ·(cid:105) stands forthe inner product, and I p denotes the p × p identity matrix. For notation brevity, we drop ∗ Submitted to the editors July 28, 2020.

Funding:

BG was supported by the Fonds de la Recherche Scientiﬁque – FNRS and the Fonds Wetenschap-pelijk Onderzoek – Vlaanderen under EOS Project no. 30468160. GH was supported from FDCT of Macao SAR(FDCT 029/2016/A1), MYRG of University of Macau (MYRG2017-00189-FST, MYRG2019-00154-FST), and Na-tional Natural Science Foundation of China (Grant Nos. 11922120, 11871489, and 11401608). YK was supportedby the Academic Research Fund of the Ministry of Education of Singapore under grant No. R-146-000-291-114.XL was supported in part by the National Natural Science Foundation of China (No. 11971466, 11991021 and11991020), Key Research Program of Frontier Sciences, Chinese Academy of Sciences (No. ZDBS-LY-7022), theNational Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences and the Youth Inno-vation Promotion Association, Chinese Academy of Sciences. † ICTEAM Institute, UCLouvain, Louvain-la-Neuve, Belgium ([email protected]). ‡ Department of Mathematics, University of Macau, Macao SAR, China; Zhuhai UM Science & TechnologyResearch Institute, Guangdong Province, China ([email protected]). § Corresponding author. Department of Mathematics, National University of Singapore, Singapore([email protected]). ¶ State Key Laboratory of Scientiﬁc and Engineering Computing, Academy of Mathematics and Systems Science,Chinese Academy of Sciences, and University of Chinese Academy of Sciences, China ([email protected]).1 a r X i v : . [ phy s i c s . c o m p - ph ] J u l B. GAO, G. HU, Y. KUANG, AND X. LIU the subscript and let I = I p . The detailed expressions of the Hamiltonian operator and theKohn–Sham total energy are introduced in the next section. In electronic structure calculations, the pseu-dopotential approaches have proven to be successful in predicting electrical, magnetic andchemical properties for a wide range of materials [28]. However, the pseudopotentials canhardly construct the transition metals accurately [21] and tend to mispredict the material prop-erties under extreme environment [31]. As a result, all-electron calculations which treat theCoulomb external potential exactly are in demand.One of the most challenging aspects in all-electron calculations is the computational efﬁ-ciency, which is usually dominated by two factors: the singularities arising from the Coulombexternal potential and the orthogonality constraints of the wavefunctions.To handle the singularities, the numerical discretization is generally required to be welldesigned in a manner such that it is able to capture the sharp variations of the orbitals andmeanwhile describe the results on the regions where the orbitals vary slightly with the leasteffort. We focus on the ﬁnite element discretization (FEM) [30, 29, 3, 7] since it has local ba-sis and allows a spatially adaptive resolution. Other discretizations handling the singularitiescan be found in [2, 8] and references therein.When the quantum system is large, all-electron calculations turn into expensive [22]. Inparticular, to keep the orthogonality of the orbitals becomes the bottleneck in most existingalgorithms. The self-consistent ﬁeld (SCF) method and its variants [19, 17] are commonlyused to solve the KS equation (1.1). However, the global convergence of the SCF-based al-gorithms can hardly be guaranteed [23, 24] and hence good initial guesses are often crucialfor their performance. Since they lack robustness, the bad performance is often observed innumerical experiments [32]. This motivates the research on solving the total energy mini-mization problem (1.2) directly. Most of the ﬁrst-order methods, such as QR retraction [33]and multipliers correction framework [11], carry out a feasible update. Namely, certain or-thogonalization process is invoked in each iteration. Note that the orthogonalization processcosts at least O ( p ) per iteration. Hence, these methods are not competent in solving largequantum systems due to this cubic complexity and the low scalability of any orthogonaliza-tion process.Several algorithms have been exploited to avoid the orthogonalization. Linear scalingmethods [6] build the solutions by direct minimization of unconstrained variational formu-lations. Note that most of them require to estimate the upper bound of the eigenvalue ofthe Hamiltonian [22], which is intractable in practice. Recently, an infeasible optimizationalgorithm based on the augmented Lagrangian method has been proposed in [12]. Here, “in-feasible” indicates that the iterate is not required to satisfy the constraints in each iteration.The orthogonality can be guaranteed at any cluster point of the iteration sequence. Anotherfavored property of this algorithm is that it is not sensitive with the choices of initial guess andparameters which makes it robust. Moreover, it is illustrated both theoretically and numeri-cally that this algorithm does not highly rely on any priori knowledge of the studied system.All the calculations in which fully consist of BLAS operations. Thus it can be naturallyparallelized. In view of these features, a parallelizable framework based on this infeasibleminimization method for all-electron calculations is proposed. In this paper, we provide a competitive algorithm framework for all-electron calculations in the density functional theory. The framework consists of four partsshown in Figure 1, i.e., the pre-processing part for conﬁguring the problem, the discretizationpart for numerically discretizing the continuous problem, the solving part for obtaining thesolutions of the discretized system, and the post-processing part for transforming the numer-ical solutions for the further applications.

RTHOGONALIZATION-FREE FRAMEWORK FOR KSDFT pre-processing discretization solvingmainiteration stop? post-processingNo YesF IG . 1. Flowchart of the framework for ground state calculations.

The efﬁciency of all-electron calculations beneﬁts from the following aspects of the pro-posed framework in Figure 1: i). a quality ﬁnite element space is designed for the givenelectronic structure based on the a priori analysis; ii). an orthogonalization-free method isproposed and analyzed for the discretized minimization problem; iii). high scalability is suc-cessfully demonstrated by numerical examples.More speciﬁcally, in preparation of the tetrahedron mesh, the decay of the external poten-tial is studied with the linear interpolation theory in [15, 29, 20], and a strategy on generatingradial mesh for optimally capturing such decay is designed for a given electronic structure. Itis noted that a quality ﬁnite element space would be built based on the radial mesh, and theefﬁciency of the algorithm would beneﬁt from the sparsity of the discretized system and themature and robust solvers for the sparse system such as the algebraic multigrid method.The new method for the discretized optimization problem (1.2) is proposed by extend-ing the parallelizable column-wise augmented Lagrangian (PCAL) [12] from the followingtwo aspects. First, the PCAL is revised to handle the minimization problem with generalorthogonality constraints X (cid:62) BX = I rather than the standard ones X (cid:62) X = I . The globalconvergence of the new method is established theoretically. Second, a precondition strategyis proposed for the class of the PCAL methods, and a speciﬁc preconditioner is designed forall-electron calculations, which brings the dramatic acceleration for the convergence in thesimulations.As an attractive feature of the proposed algorithm, the robustness is successfully shownby a variety of numerical experiments, i.e., a random initial guess works for all numericalexperiments in this paper, and the numerical convergence of the algorithm is not sensitive tothe selection of the parameters. Finally, the high scalability of the algorithm is demonstratedby the numerical examples, which obviously indicates the potential of our algorithm for thelarge scale systems. SR p × p := { S ∈ R p × p | S (cid:62) = S } refers to theset of p × p real symmetric matrices. σ min ( A ) denotes the smallest singular value of givenreal matrix A . Diag( v ) ∈ SR n × n denotes a diagonal matrix with all entries of v ∈ R n in its diagonal, and diag( A ) ∈ R n extracts the diagonal entries of matrix A ∈ R n × n . Forconvenience, Θ( M ) := Diag(diag( M )) represents the diagonal matrix with the diagonalentries of square matrix M in its diagonal. sym( A ) := ( A + A (cid:62) ) stands for the average ofa square matrix and its transpose.The organization of this paper is as follows. The KSDFT and its discretization are pre-sented in section 2. In section 3, we present the algorithm and its convergence results. Theimplementation details of the proposed framework are introduced in section 4 and the nu-merical experiments are reported in section 5. In the end, we draw a brief conclusion andintroduce the future works.

2. Finite Element Discretization for KSDFT.

In this section, we introduce the detailedformulations for KSDFT and the discretization part as illustrated in Figure 1.

B. GAO, G. HU, Y. KUANG, AND X. LIU

We consider a molecular system in R consisting of M nuclei of charges Z , . . . , Z M locating at the positions R , . . . , R M and p electrons in the non-relativistic set-ting. The atomic unit is adopted in this work. Thus the Hamiltonian operator ˆ H in theKohn–Sham equation (1.1) can be written as(2.1) ˆ H = − ∇ + V ext ( r ) + V Har ([ ρ ]; r ) + V xc ([ ρ ]; r ) , where the notation V ([ ρ ]; r ) implies that V is a functional of the electron density ρ ( r ) = (cid:80) pl =1 | ψ l ( r ) | . The ﬁrst term −∇ / in ˆ H is the kinetic operator. The second term in ˆ H describes the Coulomb external potential due to the nuclei which takes the form(2.2) V ext ( r ) = − M (cid:88) j =1 Z j | r − R j | . The third term is the Hartree potential describing the Coulomb repulsion among the electrons(2.3) V Har ([ ρ ]; r ) = (cid:90) R ρ ( r (cid:48) ) | r − r (cid:48) | d r (cid:48) . The last term V xc stands for the exchange-correlation potential, which is caused by the Pauliexclusion principle and other non-classical Coulomb interactions. Note that the analyticalexpression for the exchange-correlation term is unknown and therefore an approximation isneeded. Speciﬁcally, the local density approximation (LDA) from the library Libxc [25] isadopted in this work.The total energy of the given quantum system consists of several parts:(2.4) E KS = E kinetic + E ext + E Har + E xc + E nuc , where E kinetic is the kinetic energy, and E ext , E Har , E xc , and E nuc are the potential energiesinduced by V ext , V Har , V xc , and the nucleus-nucleus potential, respectively. Denoting theexchange-correlation energy per particle by (cid:15) xc ( ρ ) , then V xc is the functional derivative of (cid:15) xc ( ρ ) with respect to ρ , i.e., V xc = δ(cid:15) xc ( ρ ) /δρ . As a result, it follows that E kinetic = 12 p (cid:88) l =1 (cid:90) R |∇ ψ l | d r , E ext = (cid:90) R V ext ρ ( r ) d r , E Har = 12 (cid:90) R V Har ρ ( r ) d r ,E xc = (cid:90) R (cid:15) xc ρ ( r ) d r , E nuc = M (cid:88) j =1 M (cid:88) k = j +1 Z j Z k | R j − R k | . Note that E nuc is a constant for the given system.The ground state of the given system can be obtained from solving either the KS equa-tion (1.1) or the total energy minimization problem (1.2). In order to numerically solve thecontinuous problem, we consider the ﬁnite element discretization. In practical simulations, a bounded polyhedral do-main Ω ⊂ R is served as the computational domain. Thus the variational form of the Kohn–Sham equation (1.1) on Ω can be formulated as: Find ( ε l , ψ l ) ∈ R × H (Ω) , l = 1 , , . . . , p ,such that(2.5)  (cid:90) Ω ϕ ˆ Hψ l d r = ε l (cid:90) Ω ψ l ϕd r , ∀ ϕ ∈ H (Ω) , (cid:90) Ω ψ l ψ l (cid:48) d r = δ ll (cid:48) , l (cid:48) = 1 , , . . . , p, RTHOGONALIZATION-FREE FRAMEWORK FOR KSDFT H (Ω) = { ϕ ∈ H (Ω) : ϕ | Ω = 0 } and H (Ω) is a standard Sobolev space.To build a high quality ﬁnite element space to approximate the solution of (2.5) in all-electron calculations, the singularities stemming from the Coulomb potential in (2.2) shouldbe prudently treated. In this work, we adopt a radial mesh generation strategy to resolve thedifﬁculty brought by the singularities; see subsection 4.2 for details.Assume that the linear ﬁnite element space V h ⊂ H (Ω) is constructed on the boundeddomain Ω partitioned by T = {T K , K = 1 , , . . . , N ele } , where N ele represents the totalnumber of elements of T . Several commonly used notations in V h are deﬁned here. Thebasis functions are denoted by ϕ i , i = 1 , . . . , n , where n is the dimension of V h and the set ofbasis functions is denoted by N = ( ϕ , . . . , ϕ n ) (cid:62) . We construct the matrix of basis function B with B i,j = ϕ i ϕ j , then the symmetric mass matrix B ∈ SR n × n can be obtained from B i,j = (cid:82) Ω B i,j d r . Furthermore, a sequence of matrices { G ( l ) ∈ SR n × n , l = 1 , . . . , n } withthe entries ( G ( l ) ) i,j = (cid:82) Ω B i,j ϕ l d r are introduced. The discretized Laplacian L ∈ SR n × n on V h is deﬁned as L i,j = (cid:82) Ω ∇ ϕ j · ∇ ϕ i d r .On the ﬁnite element space V h , the discretized variation form of (2.5) turns out: Find ( ε hl , ψ hl ) ∈ R × V h , l = 1 , , . . . , p , such that(2.6)  (cid:90) Ω ϕ ˆ Hψ hl d r = ε l (cid:90) Ω ψ hl ϕd r , ∀ ϕ ∈ V h , (cid:90) Ω ψ hl ψ hl (cid:48) d r = δ ll (cid:48) , l (cid:48) = 1 , , . . . , p. We express the l -th wavefunction as ψ hl = (cid:80) ni =1 X i,l ϕ i = X (cid:62) l N , where X ∈ R n × p and X i,l stands for the i -th degree of freedom of ψ hl . Then the electron density can be rewritten as ρ ( r ) = p (cid:88) l =1 ( X (cid:62) l N )( X (cid:62) l N ) = p (cid:88) l =1 X (cid:62) l B X l = tr ( X (cid:62) B X ) . Note that the Hartree potential V Har in (2.3) is also the solution to the Poisson equation −∇ V Har = 4 πρ ( r ) . We denote the discretized Hartree potential by U ( X ) ∈ R n such that V Har = U ( X ) (cid:62) N . After the ﬁnite element discretization on the Poisson equation, U is cal-culated from the linear system LU ( X ) = 4 π (cid:0) tr ( X (cid:62) G (1) X ) , . . . , tr ( X (cid:62) G ( n ) X ) (cid:1) (cid:62) . In prac-tical simulations, this linear system is solved by an efﬁcient algebraic multigrid method [3].Due to the arbitrary of ϕ in (2.6), we can choose ϕ = ϕ i , i = 1 , . . . , n . In view of aboveexpressions, ﬁnding the solution of the discretized variational form (2.6) turns out solving thegeneralized nonlinear eigenvalue problem:(2.7) (cid:40) H ( X ) X = BX Ξ ,X (cid:62) BX = I p , where Ξ = Diag( ε h , . . . , ε hp ) , H ( X ) ∈ SR n × n is the discretized Hamiltonian matrix whichcan be formulated from (2.1) as(2.8) H ( X ) = 12 L + M ext + M Har ( X ) + M xc ( X ) . The matrices M ext , M Har ( X ) , M xc ( X ) ∈ R n × n are deﬁned as ( M ext ) i,j = (cid:90) Ω V ext B i,j d r , ( M Har ) i,j = (cid:90) Ω V Har B i,j d r , ( M xc ) i,j = (cid:90) Ω V xc B i,j d r . B. GAO, G. HU, Y. KUANG, AND X. LIU

We now represent the total energy (2.4) in the discretized form: E kinetic ( X ) = 12 p (cid:88) l =1 (cid:90) Ω ∇ ψ l · ∇ ψ l d r = 12 p (cid:88) l =1 (cid:90) Ω X (cid:62) l ∇N · X (cid:62) l ∇N d r = 12 tr ( X (cid:62) LX ) ,E ext ( X ) = (cid:90) Ω V ext ρ ( r ) d r = (cid:90) Ω V ext tr ( X (cid:62) B X ) d r = tr ( X (cid:62) M ext X ) ,E Har ( X ) = 12 (cid:90) Ω V Har ρ ( r ) d r = 12 (cid:90) Ω V Har tr ( X (cid:62) B X ) d r = 12 tr ( X (cid:62) M Har ( X ) X ) ,E xc ( X ) = (cid:90) Ω (cid:15) xc ρ ( r ) d r = (cid:90) Ω (cid:15) xc tr ( X (cid:62) B X ) d r = tr ( X (cid:62) M exc ( X ) X ) , where the matrix M exc ( X ) in the last formula is deﬁned as ( M exc ) i,j = (cid:82) Ω ε xc B i,j d r . Thusthe discretized form of the minimization problem (1.2) is assembled as(2.9) min X ∈ R n × p E KS ( X ) = E kinetic ( X ) + E ext ( X ) + E Har ( X ) + E xc ( X ) + E nuc s . t . X (cid:62) BX = I p . The generalized orthogonality constraints in (2.9) are known as the generalized Stiefel man-ifold [1], denoted by S Bn,p := { X ∈ R n × p : X (cid:62) BX = I p } . Note that the gradient of E KS ( X ) satisﬁes ∇ E KS ( X ) = 2 H ( X ) X , while we scale it as ∇ E KS ( X ) = H ( X ) X to beconsistent with the convention.

3. Parallelizable Algorithms.

In this section, we concentrate on the solving part inFigure 1. Namely, the discretized total energy minimization problem (2.9) is considered. Weﬁrst state its optimality condition. Then a one step gradient-descent update is proposed forsolving (2.9) and its global convergence result is established. We also develop an upgradedalgorithm based on the column-wise block minimization with preconditioning.The discretized total energy minimization problem (2.9) is a nonconvex constrained op-timization problem due to the orthogonality constraints. We state its ﬁrst-order optimalitycondition as follows.D

EFINITION

Given X ∈ R n × p , we call X a ﬁrst-order stationary point of (2.9) ifthe following condition (3.1) (cid:26) tr( Z (cid:62) ∇ E KS ( X )) ≥ ,X (cid:62) BX = I p holds for any Z ∈ T S Bn,p ( X ) , where T S Bn,p ( X ) := { Z ∈ R n × p : Z (cid:62) BX + X (cid:62) BZ = 0 } isthe tangent space of S Bn,p at X . Following from [11, Lemma 2.2], it can be proved that the condition (3.1) is equivalent to(3.2)  ( I n − BXX (cid:62) ) ∇ E KS ( X ) = 0 ,X (cid:62) ∇ E KS ( X ) = ∇ E KS ( X ) (cid:62) X,X (cid:62) BX = I p . In fact, the second equality of (3.2) is automatically satisﬁed since ∇ E KS ( X ) = H ( X ) X and the Hamiltonian H ( X ) is symmetric. Moreover, the condition (3.2) can be further refor-mulated as(3.3) (cid:26) ∇ E KS ( X ) = BX Λ ,X (cid:62) BX = I p , RTHOGONALIZATION-FREE FRAMEWORK FOR KSDFT Λ ∈ SR p × p can be regarded as the Lagrangian multipliers of thegeneralized orthogonality constraints. Multiplying the ﬁrst equation from the left by X (cid:62) , itfollows that Λ reads the closed-form expression at any ﬁrst-order stationary point,(3.4) Λ = X (cid:62) ∇ E KS ( X ) = X (cid:62) H ( X ) X. The infeasible method pro-posed in [12] has been proven to be efﬁcient for solving the lager scale orthogonality con-strained optimization problems. Brieﬂy, the iterates are not required to be orthogonal. Mean-while, the feasibility violation gradually decreases to zero until the method converges. Thistype of methods enables us to get rid of the unscalable computation for preserving constraints.In addition, it provides an opportunity to employ the multi-core machines and thus gain morescalability from parallel computing.The algorithm in [12] originally aims to solve the problem with orthogonality constraints( X (cid:62) X = I ), and in this subsection, we extend it to the general case ( X (cid:62) BX = I ) whichis not a trivial task. The skeleton of this algorithm is based on the augmented Lagrangianmethod (ALM) [26]. Let X k be the current iterate, the classical ALM has two major steps ineach iteration:1) Update the Lagrangian multipliers Λ k ;2) Minimize the ALM subproblem to obtain X k +1 ,(3.5) min X ∈ R n × p L β ( X, Λ k ) := E KS ( X ) − (cid:10) Λ k , X (cid:62) BX − I p (cid:11) + β (cid:13)(cid:13) X (cid:62) BX − I p (cid:13)(cid:13) , where L β ( X, Λ k ) deﬁnes the augmented Lagrangian function of problem (2.9) and β > is the penalty parameter.This framework avoids being confronted with the generalized orthogonality constraints. Next,we discuss how to update these two steps efﬁciently.For step 1), in view of the fact (3.4), we suggest the following update of Lagrangianmultipliers(3.6) Λ k = X k (cid:62) H ( X k ) X k . Due to the symmetry of the Hamiltonian H ( X k ) , the above update provides symmetric multi-pliers Λ k , which allows us to waive the symmetrization step, sym( X k (cid:62) H ( X k ) X k ) , in [12].On the other side, the ALM subproblem in step 2) is an unconstrained optimization prob-lem, and various methods can be applied to derive different updates. Instead of solving thesubproblem to a certain preset precision, our strategy is to provide an approximate solutionby an explicit formulation. We ﬁrst introduce a proximal linearized approximation [5] to sub-stitute the augmented Lagrangian function in (3.5). Speciﬁcally, we consider the subproblem(3.7) min X ∈ R n × p (cid:10) ∇ X L β ( X k , Λ k ) , X − X k (cid:11) + η k (cid:13)(cid:13) X − X k (cid:13)(cid:13) . The parameter η k measures the dominance of the proximal term. The solution of this qua-dratic subproblem reads an explicit form X k +1 = X k − η k ∇ X L β ( X k , Λ k )= X k − η k (cid:16) H ( X k ) X k − BX k X k (cid:62) H ( X k ) X k + βBX k ( X k (cid:62) BX k − I p ) (cid:17) , (3.8) B. GAO, G. HU, Y. KUANG, AND X. LIU where the last step is owing to the update formula (3.6). It implies that this modiﬁed ALMupdate is nothing but a vanilla gradient-descent step and /η k speciﬁes the stepsize.Now we turn back to the solving part in Figure 1. By using the one step gradient-descentupdate (3.8) in the main iteration, we fulﬁll a solving part for KSDFT. The complete algorithmis described in Algorithm 1. Algorithm 1:

Proximal Linearized Augmented Lagrangian Algorithm (PLAM) Input: discretization with n ∈ N and B ∈ SR n × n ; tolerance (cid:15) > ; initial guess X ∈ R n × p ; Set k := 0 . while (cid:13)(cid:13)(cid:13) ( I n − BX k X k (cid:62) ) H ( X k ) X k (cid:13)(cid:13)(cid:13) F + (cid:13)(cid:13)(cid:13) X k (cid:62) BX k − I (cid:13)(cid:13)(cid:13) F > (cid:15) do Compute the Hamiltonian H ( X k ) by (2.8). Update the variable X k +1 by (3.8). Update the parameters η k and β ; Set k := k + 1 . Output: X k .Once the pre-processing and discretization are ﬁnished, the number of degrees of free-dom n and the matrix B are ﬁxed. Meanwhile, the initial guess X can be generated byany popular strategy in KSDFT. In view of the condition (3.2), we notice that Line 2 (thestopping criteria) in Algorithm 1 is sufﬁcient to check the ﬁrst-order optimality. Line 3-5are the main iterations in Figure 1. Indeed, those calculations in KSDFT can be well assem-bled in a parallel way. The gradient-descent update in Line 4 is the BLAS3 operation. Thechoices of parameters will be discussed in subsection 4.4. To sum up, the algorithm PLAMcan be conveniently implemented since there is no matrix decomposition or eigen-solver. Itcompletely consists of BLAS operations. Therefore, the algorithm PLAM is open to be par-allelized. Note that SCF method can also be described by the framework Figure 1, and theonly distinction between SCF and PLAM is the main iteration. Speciﬁcally, SCF replacesLine 3-5 with solving a linear eigenvalue problem from (2.7). By contrast, PLAM just carriesout a one step gradient-descent update. The global convergence of the plain PLAM for orthogo-nality constraints ( X (cid:62) X = I ) has been studied in [12]. Next, we consider the generalizedcase, i.e., X (cid:62) BX = I . It can be proved that the existing results are still applicable forAlgorithm 1.A natural idea to investigate the generalized orthogonality constraints is transforming itinto the standard case. Since B is symmetric positive deﬁnite, there exists a symmetric posi-tive deﬁnite matrix G ∈ R n × n satisfying B = G . By taking Y = GX , the problem (2.9) isequivalent to(3.9) min Y ∈ R n × p g ( Y ) := E KS ( G − Y ) s.t. Y (cid:62) Y = I p . Thus the augmented Lagrangian function of (3.9) is deﬁned as ˜ L β ( Y, ˜Λ) = g ( Y ) − (cid:104) ˜Λ , Y (cid:62) Y − I p (cid:105) + β || Y (cid:62) Y − I p || . The next lemma shows that the transform Y = GX does not change the stationary pointsof problems. RTHOGONALIZATION-FREE FRAMEWORK FOR KSDFT EMMA (i) X ∗ is a ﬁrst-order stationary point of the problem (2.9) if and only if Y ∗ = GX ∗ is also a ﬁrst-order stationary point of the problem (3.9) .(ii) X ∗ is a ﬁrst-order stationary point of the ALM subprblem min X ∈ R n × p L β ( X, Λ ∗ ) with Λ ∗ = sym( ∇ E KS ( X ∗ ) (cid:62) X ∗ ) if and only if Y ∗ = GX ∗ is also a ﬁrst-order stationarypoint of the ALM subprblem min Y ∈ R n × p ˜ L β ( Y, ˜Λ ∗ ) with ˜Λ ∗ = sym( ∇ g ( Y ∗ ) (cid:62) Y ∗ ) .Proof. (i) Let Y ∗ = GX ∗ , it can be veriﬁed that ( I n − Y ∗ Y ∗(cid:62) ) ∇ g ( Y ∗ ) = G − ( I n − BX ∗ X ∗(cid:62) ) ∇ E KS ( X ∗ ) , ∇ g ( Y ∗ ) (cid:62) Y ∗ = ∇ E KS ( X ∗ ) (cid:62) X ∗ ,Y ∗(cid:62) Y ∗ − I p = X ∗(cid:62) BX ∗ − I p . Together with (3.2), we can conclude that problems (2.9) and (3.9) share the same ﬁrst-orderstationary points.(ii) Let Y ∗ = GX ∗ . Similarly, it can be veriﬁed that ˜Λ ∗ = sym( ∇ g ( Y ∗ ) (cid:62) Y ∗ ) = sym( ∇ E KS ( X ∗ ) (cid:62) X ∗ ) = Λ ∗ , ∇ Y ˜ L β ( Y ∗ , ˜Λ ∗ ) = G − ∇ X L β ( X ∗ , Λ ∗ ) . These equalities lead to the desired equivalence.In view of Lemma 3.2 and let Y = GX , the algorithm for problem (2.9) can be translatedinto an adaptation for (3.9). Next, we consider using PLAM to solve the problem (3.9). Recallthat there are two major steps in the construction of PLAM:1) For the multiplier update, we continue with the explicit update (3.6), i.e., ˜Λ k = sym( ∇ g ( Y k ) (cid:62) Y k ) .

2) We construct the subproblem with respect to Y ,(3.10) min Y ∈ R n × p (cid:68) ∇ Y ˜ L β ( Y, ˜Λ k ) , Y − Y k (cid:69) B + η k (cid:13)(cid:13) Y − Y k (cid:13)(cid:13) . where the inner product is deﬁned as (cid:10) Y, ¯ Y (cid:11) B := tr( Y (cid:62) B ¯ Y ) .Indeed, this subproblem has the closed-form solution Y k +1 = Y k − η k B ∇ Y ˜ L β ( Y, ˜Λ k )= Y k − η k B (cid:16) ∇ g ( Y k ) − Y k Ψ ( ∇ g ( Y k ) (cid:62) Y k ) + βY k ( Y k (cid:62) Y k − I p ) (cid:17) . (3.11)Using Y = GX and the expression of ˜Λ k , it follows that the X -update (3.8) can be exactly re-covered from (3.11). In other words, the algorithm PLAM for the X -problem (2.9) is provedto be equivalent to its adaptation for the Y -problem (3.9). Whereas the proximal linearizedapproximation in (3.10) differs from what we used in [12], the sketch of the convergenceanalysis is nearly the same. Therefore, the convergence results for PLAM can be accordinglymigrated from [12].Finally, we present the global convergence of PLAM without proofs. Interested readersare referred to [12] for a comprehensive understanding, such as the worst case complexityand local convergence rate.A SSUMPTION E KS ( X ) is twice differentiable. B. GAO, G. HU, Y. KUANG, AND X. LIU A SSUMPTION

For a given X ∈ R n × p , we say it is a qualiﬁed initial guess, if thereexists σ ∈ (0 , such that σ min ( X ) ≥ σ, < || X (cid:62) BX − I p || F ≤ − σ . T HEOREM

Let { X k } be the iterate sequence generated by Algorithm 1 initializedfrom X satisfying Assumption and Assumption . Suppose that the parameters β and η k ( k = 1 , . . . ) are sufﬁciently large, and in particular, the sequence { η k } is upperbounded. Then the sequence { X k } has at least one cluster point, and any which is a ﬁrst-order stationary point of problem (2.9) . According to the numerical reports in [12], theplain PLAM performs well in most problems, whereas its behavior is sensitive to the pa-rameters β and η k . In practice, it is always troublesome to tune these parameters as PLAMperforms identically on different problems. Even worse, we cannot guarantee the bounded-ness of iterate sequences without restrictions on parameters.Consequently, [12] suggests a column-wise block minimization for PLAM to overcomethese limitations. In light of its motivation, we similarly impose the redundant column-wiseconstraints on the subproblem (3.7), and obtain the following subproblem.(3.12) min X ∈ R n × p (cid:10) ∇ X L β ( X k , Λ k ) , X − X k (cid:11) + η k (cid:13)(cid:13) X − X k (cid:13)(cid:13) , s . t . Diag( X (cid:62) BX ) = I. Notice that the subproblem (3.12) is column-wisely separable. Thus, for the i -th column( i = 1 , . . . , p ), we can construct a subproblem with an extra constraint as follows,(3.13) min x ∈ R n ∇ X i L β ( X k , Λ k ) (cid:62) ( x − X ki ) + η k || x − X ki || , s . t . x (cid:62) Bx = 1 , where X i denotes the i -th column of X . The redundant constraint is for restricting the iteratesequence to a compact set and hence make it bounded. The subproblem (3.13) has the closed-form solution(3.14) X k +1 i = X ki − η k ∇ X i L β ( X k , Λ k ) (cid:13)(cid:13)(cid:13) X ki − η k ∇ X i L β ( X k , Λ k ) (cid:13)(cid:13)(cid:13) B , where (cid:107) x (cid:107) B := √ x (cid:62) Bx is a norm for any symmetric positive deﬁnite matrix B . Ac-cordingly, the Lagrangian multipliers of X k can be developed based on the new subprob-lem (3.12). In view of these formulations, an upgraded version of PLAM is listed in Algo-rithm 2 called PCAL.Note that the update (3.15) for Lagrangian multipliers in PCAL is different from (3.6)in PLAM. When the redundant constraints, (cid:107) X i (cid:107) B = 1 ( i = 1 , . . . , p ), are imposed, thecorresponding optimality condition changes simultaneously. Speciﬁcally, the problem (2.9)with redundant constraints has the ﬁrst-order optimality condition as follows,(3.16) (cid:26) ∇ E KS ( X ) = BX Λ +

BXD,X (cid:62) BX = I p . The matrix D ∈ R p × p is diagonal and denotes the multipliers for extra constraints. Followinga similar derivation of (3.3), it can be veriﬁed that Λ in (3.16) achieves the closed-formexpression (3.15) at any ﬁrst-order stationary point. Notice that the main calculation of PCAL RTHOGONALIZATION-FREE FRAMEWORK FOR KSDFT Algorithm 2:

Parallelizable Column-wise Block Minimization for PLAM (PCAL) Input: triangulation with n ∈ N and B ∈ SR n × n ; tolerance (cid:15) > ; initial guess X ∈ S Bn,p ; Set k := 0 . while (cid:13)(cid:13)(cid:13) ( I n − BX k X k (cid:62) ) H ( X k ) X k (cid:13)(cid:13)(cid:13) F + (cid:13)(cid:13)(cid:13) X k (cid:62) BX k − I (cid:13)(cid:13)(cid:13) F > (cid:15) do Compute the Hamiltonian H ( X k ) by (2.8). Compute the Lagrangian multipliers by(3.15) Λ k := X k (cid:62) H ( X k ) (cid:62) X k + Θ (cid:16) X k (cid:62) ∇ X L β ( X k , X k (cid:62) H ( X k ) (cid:62) X k ) (cid:17) . for i = 1 , . . . , p do Update X k +1 i by (3.14). Update X k +1 = [ X k +11 , . . . , X k +1 p ] . Update the parameters η k and β ; Set k := k + 1 . Output: X k .is a sequence of gradient-descent step with normalization. These for-loop computations areindependent and hence can be executed in a parallel fashion. To sum up, the upgraded versionof PLAM still enjoys the beneﬁt of parallel computing.In scientiﬁc computing, preconditioning is typically used to accelerate iterative algo-rithms. In [3], a preconditioner for the eigenvalue problem of SCF iteration has been pro-posed. It has the form of T = L − λB , where L is the discretized kinetic operatordeﬁned in (2.9) and λ is an approximated eigenvalue. Since L dominates the Hamiltonian,this preconditioner usually performs well in practical calculations. In view of the optimalitycondition (3.3), the update of Lagrangian multipliers (3.6) in PLAM can be viewed as theapproximation of the eigenvalues. Thus, we choose Λ kii = (cid:16) X k (cid:62) H ( X k ) X k (cid:17) ii to constructa preconditioner for the proposed algorithm:(3.17) T k ( i ) = (cid:26) L − Λ kii B, if Λ kii < ,I, otherwise , for i = 1 , . . . , p. Consequently, the one step gradient-descent update (3.8) in PLAM is preconditioned as X k +1 i = X ki − η k (cid:16) T k ( i ) (cid:17) − ∇ X i L β ( X k , Λ k ) , for i = 1 , . . . , p, where the preconditioned gradient can be assembled by solving p linear systems. Notethat PCAL is compatible with this type of preconditioning providing that Λ k is selectedfrom (3.15). The parallelizable structure of PLAM and PCAL is still maintained as the pre-conditioning is conducted column-wisely. A test in Figure 2 veriﬁes the effectiveness ofthe preconditioner (3.17) for both algorithms, where substationarity is computed by Line 2in Algorithm 1.

4. Implementation Details.

In this section, we introduce the implementation details ofthe framework (Figure 1) in solving the ground state. The quantum systems examined in thispaper are introduced. In addition, several numerical issues in the simulations are discussed.In view of Figure 2, we observe that PCAL behaves more efﬁcient and robust than PLAM,and thus we focus on PCAL in the following tests.2

B. GAO, G. HU, Y. KUANG, AND X. LIU(a) PLAM for He, β = 15 (b) PCAL for He, β = 1 F IG . 2. The performance of the preconditioner (3.17) for a helium (He) atom example with n = 1606 , p = 1 . All the simulations are performed on a workstation with two Intel(R) Xeon(R) ProcessorsSilver 4110 (at 2.10GHz × , 12M Cache) and 384GB of RAM, and the total number of coresis 16. The software is the C++ library AFEABIC [3] under Ubuntu 18.10. A number of atom and molecules are simulated to illustrate theeffectiveness and high scalability of the presented algorithm. It is noted that in practicalsimulations, p is regarded as the number of orbitals and each orbital is occupied by twoelectrons. The scale of testing systems, i.e., p , is ranging from 1 to 1152. In the formulationof problem (1.2), the exchange-correlation potential V xc and exchange-correlation potentialenergy (cid:15) xc per particle are obtained from the package Libxc [25]. The model equations forthe various numerical examples are only different in the external potential term V ext andprecisely in the charge numbers and positions of the nuclei. The charge of a certain nucleusused in numerical experiments is listed in Table 1. The nuclei positions for small moleculesare obtained from the calculated geometry part in CCCBDB [16] and for carbon nanotubesare from [10]. In summary, the following electronic structures He ( ), LiH (2), CH (5),H O (5), BF (16), C H (21), C H N (48), C (180), and carbon nanotubes C (288),C (576) and C (1152) are tested, where the number in the bracket stands for the numberof orbitals p in the associated system. T ABLE Charge number Z j of the nucleus. H He Li B C N O F Z j In practice, we evaluate the values for substationarity, feasibility violation and the totalenergy of each example during the simulations. Speciﬁcally, kkt = (cid:107) H ( X ) X − BX Λ (cid:107) F , f ea = (cid:13)(cid:13) X (cid:62) BX − I (cid:13)(cid:13) F , and the total energy E KS is computed from (2.4). When the sum-mation of kkt and f ea is small enough, i.e., the following stopping criterion kkt + f eakkt < tol is satisﬁed, we terminate the algorithm. Here, kkt is the initial substationarity and tol de-notes the tolerance and is chosen to be . × − in our simulations. RTHOGONALIZATION-FREE FRAMEWORK FOR KSDFT Once wedetermine the computational domain, a space discretization is generated for the ground statecalculation. To resolve the singularities in the external potential term, a non-uniform meshfor the partition of the computational domain is introduced to obtain high accuracy with leasteffort. Speciﬁcally, a global mesh size function based on the external potential is adopted togenerate the nonuniform mesh [20]. Within the linear ﬁnite element framework, to capturethe /r decay in the external potential, the mesh size function locally behaves as r / forsmall r can be derived, where r represents the distance to the nucleus. Then we can constructthe mesh size function h ( r ) at the discretized point r as in [20]:(4.1) h ( r ) = min (cid:110) γ Z − r , · · · , γ Z − M r M , γ (cid:111) , where r j = | r − R j | represents the distance to the j -th nucleus, γ controls the resolutionof the mesh, and γ is the largest allowed mesh size. Note that (4.1) implies that the closerto the nucleus, the smaller the mesh size, i.e., the denser the mesh grid, which is as desired.Moreover, the distribution of the mesh grid around the nucleus with larger charge is alsodenser than that around the nucleus with a smaller charge. This can be veriﬁed from Figure 3which shows an radial mesh example for the water molecule (H O). F IG . 3. Left: the three dimensional mesh for molecule H O using the mesh size function (4.1) with γ =0 . , γ = 8 . Middle: the mesh around the oxygen nucleus ( − . , , in X-Y plane [ − . , − . × [0 , ,on which the element shapes are kept. Right: the mesh around the hydrogen nucleus (0 . , . , in X-Y plane [0 . , . × [1 . , . . Generated by the software Gmsh v3.0.6 [14]. In the following comparison, we choose a same randomly generated initial guess, X ∈ R n × p satisfying X (cid:62) BX = I , for different methods. Given a random matrix V ∈ R n × p from the pseudo-random number generator, X is generated by the Cholesky-based Gram–Schmidt technique [13], i.e., V = X R , where R ∈ R p × p is an upper triangular matrix.However, it is known that the SCF method may suffer a lot from divergence. For the sakeof fairness, we stabilize SCF by improving the random initial guess with the imaginary timepropagation (ITP) method [20] only when it diverges. By contrast, the numerical experimentsin section 5 show that our algorithm behaves robust regardless of different initial guesses. In view of the presentedinfeasible methods, it is sufﬁcient to output results such that X satisﬁes the orthogonalityconstraint X (cid:62) BX = I . However, if we want to extract the desired wavefunctions fromthe eigenvectors of the generalized eigenvalue problem (2.7) or the other physical quantitiesbased on the eigenvalues, we need to introduce a post-processing. This is due to the fact that X only provides an orthogonal basis of the desired eigenspace rather than the eigenvectors.This can be implemented by solving a small p × p eigenvalue problem, ( X (cid:62) HX ) ˜ X = λ ˜ X , with the Rayleigh-Ritz procedure to get the eigenvalues λ i , i = 1 , . . . , p and updating4 B. GAO, G. HU, Y. KUANG, AND X. LIU X as X = X ˜ X to get the wavefunctions. Note that this procedure is called only for oncein the algorithm and it is of size p × p . Consequently, its computational cost can be ignoredcompared to solving the optimization problem.To verify the effectiveness of the post-procedure, we compute the eigenvalues of theKohn–Sham equation of CH system on the radial mesh with n = 100127 , p = 5 for SCFand PCAL. The computational domain for this example is set as [ − , , and the results arelisted in Table 2. When the post-procedure is imposed, the eigenvalues from PCAL are wellordered and agree with eigenvalues from SCF. Moreover, it veriﬁes that the post-proceduredoes not affect the energy value. In the practical simulations, the post-procedure step will beimposed as the ﬁnal step of PCAL. T ABLE Eigenvalue and energy evaluations for example CH . λ λ λ λ λ E KS SCF -9.75599 -0.66451 -0.38832 -0.38830 -0.38825 -40.24109PCAL + post-processing -9.75599 -0.66451 -0.38832 -0.38830 -0.38825 -40.24109

There are two major parameters in the algorithm PCAL.In view of Figure 2, the penalty parameter β = 1 works well for PCAL, and hence is setas the default value of β in PCAL. Next, we investigate the proximal parameter η k , whosereciprocal is the stepsize for the gradient-descent step in Algorithm 2. As suggested in [12],the Barzilai–Borwein (BB) strategy [4] is an efﬁcient way to produce the stepsize, η BB1 k := (cid:12)(cid:12)(cid:10) S k − , Y k − (cid:11)(cid:12)(cid:12) (cid:104) S k − , S k − (cid:105) , or η BB2 k := (cid:10) Y k − , Y k − (cid:11) |(cid:104) S k − , Y k − (cid:105)| , where S k = X k − X k − , Y k = ∇ X L β ( X k , Λ k ) − ∇ X L β ( X k − , Λ k − ) . It has othervariations such as the Alternating BB strategy [9], η ABB1 k := (cid:26) η BB1 k , for odd k,η BB2 k , for even k, or η ABB2 k := (cid:26) η BB2 k , for odd k,η BB1 k , for even k. We test PCAL with four choices of the parameter η k on several testing problems. The numberof iterations to achieve convergence is recorded in Table 3. The notation “-” represents thatthe stopping criterion has not reached after iterations. This table reveals that PCAL with η BB2 k behaves robust and has the best performance on number of iterations. As a result, wechoose η BB2 k as our default proximal parameter in the practical simulations. T ABLE Number of iterations with different proximal parameters.

He LiH CH H O C H BB1 409 - - - -BB2 46 54 75 60 144ABB1 86 90 129 180 291ABB2 75 74 90 119 256

5. Numerical Examples.

In this section, we numerically investigate the performanceand parallel efﬁciency of the algorithm PCAL in all-electron calculations under the presentedframework.

RTHOGONALIZATION-FREE FRAMEWORK FOR KSDFT ρ ( k +1) = α ˜ ρ ( k +1) + (1 − α ) ρ ( k ) where ˜ ρ ( k +1) is the electron density obtained from solv-ing the k -th step eigenvalue problem and α is the mixing parameter. In both SCF methodand MOptQR, the orthogonalization process is implemented by the Cholesky-based Gram-Schmidt technique [13], which is shown to be more efﬁcient than commonly-used Gram-Schmidt procedures.In the serial setting, the leading order of computational costs is O ( np ) among thesemethods. The reason is that BLAS3 operations, such as X (cid:62) ( BX ) , dominate the computing.While the function evaluation does not have a crucial impact on the cost due to the sparsity ofdiscretized Hamiltonian H and mass matrix B . In the parallel setting, the total computationcost is divided into parallel and non-parallel parts. The above-mentioned leading cost O ( np ) (BLAS3) belongs to the parallel part. However, the orthogonalization process in SCF andMOptQR whose complexity is O ( p ) cannot be efﬁciently parallelized. When p is large, thiscost is unaffordable in all-electron calculations. Conversely, PCAL is orthogonalization-freeand completely consists of BLAS3 operations, and thus beneﬁts a lot from parallel computing.These claims can be veriﬁed in the following experiments. In this subsection, we test PCAL with SCF and MOp-tQR in all electron calculations of a list of atom and molecules under serial setting. For all thesystems, the computational domain is set to be [ − , . The mesh size function (4.1) isapplied to generate the nonuniform mesh for each example. Note that the parameters in (4.1)are chosen as γ = 0 . , γ = 8 for C H and C H N , and γ = 0 . , γ = 8 for theothers. The preconditioner (3.17) is used in all the methods. For the system C H N , theinitial guess is generated by ITP (see subsection 4.2), and we choose the mixing parameter α = 0 . for SCF to make it converge. For the other systems, we choose a random initialguess and the mixing parameter α = 0 . .The detailed numerical results are listed in Table 4, Figure 4 and Figure 5. We observefrom Table 4 that: 1) the total energy E KS obtained by PCAL agrees with SCF and MOp-tQR; 2) PCAL behaves more efﬁcient than SCF and MOptQR in terms of the running time“CPU(s)”; 3) the number of iterations “ N iter ” in PCAL is less than MOptQR. Note that theiteration numbers of SCF are always the smallest but conversely the CPU time. This is dueto that the inner iterations, i.e., solving the linear eigenvalue problem, are required in eachSCF iteration. The efﬁciency of PCAL can be also observed in Figure 4 for the exampleC H , from which we ﬁnd that PCAL takes the least CPU time to converge at a given ac-curacy. In addition, the convergence results for PCAL are demonstrated in Figure 5. Theﬁrst column displays the isosurface of the electron density, the last three columns presentthe convergence history of energy, substationarity and feasibility violation, respectively. Weobserve that the feasibility violation of PCAL gradually decreases until it converges. Notethat the post-processing is not shown in this ﬁgure. In the He example, the feasibility viola-tion is close to the machine accuracy since the normalization procedure is equivalent to theorthogonalization procedure in the case of p = 1 . In this subsection, we investigate the parallel efﬁciency of PCAL. Weﬁrst test all the algorithms on a single core and record the computational proportions of paral-6

B. GAO, G. HU, Y. KUANG, AND X. LIUT

ABLE The results in Kohn–Sham total energy minimization

Solver E KS kkt N iter fea CPU(s) E KS kkt N iter fea CPU(s)He, n = 34481 , p = 1 LiH, n = 63725 , p = 2 SCF -2.86809 9.58 −

36 2.88 −

127 -7.98190 2.15 −

39 2.09 − −

36 3.55 −

82 -7.98190 1.36 −

70 2.70 − −

46 1.62 −

99 -7.98190 1.25 −

54 2.24 − , n = 141189 , p = 5 H O, n = 149616 , p = 5 SCF -40.23775 4.83 −

39 6.09 − −

44 3.64 − −

93 2.67 − −

74 2.45 − −

75 1.59 − −

60 4.82 − H , n = 241939 , p = 21 C H N , n = 522149 , p = 48 SCF -231.05824 5.11 −

41 1.73 − −

94 1.64 − −

269 5.14 − −

501 1.63 − −

144 7.35 − −

148 2.29 − (a) Energy value (b) SubstationarityF IG . 4. A comparison with different solvers for example C H . lel and non-parallel part in the total cost. We ﬁx n to be around and choose different p , namely, the different molecules. By adjusting the parameters in subsection 4.2, we controlthe number of mesh grids n being as close as possible to . The testing examples areBF (16), C H N (48), C (180), C (288), and C (576). The numerical results aredisplayed in Figure 6. We observe that the parallel part of PCAL will dominate the total costwhen p becomes large. It is even higher than When p ≥ . This means that PCALis suitable for parallel computing, especially when the scale of a system is very large. Mean-while, it can be found that the non-parallel part of SCF and MOptQR becomes large when p increases. The main reason is the cubic complexity O ( p ) of orthogonalization process,which cannot be parallelized.We next examine the scalability of PCAL in the parallel setting. The testing molecule isC which has occupied orbitals. The number of mesh grids n is set to be . Werun the code on different numbers of cores { , , , } . The corresponding speedup factoris deﬁned as speedup-factor ( m ) = wall-clock time for 4-core runwall-clock time for a m -core run . The results are presented in Figure 7, from which we observe that the speedup factor ofPCAL is close to the ideal one, and it achieves . for cores. However, MOptQR has RTHOGONALIZATION-FREE FRAMEWORK FOR KSDFT F IG . 5. Convergence history of PCAL for He, LiH, CH , H O, C H , C H N (from top to bottom). Theleft column displays the isosurface of each molecule. x -axis for the right 3 columns stands for the iteration step. B. GAO, G. HU, Y. KUANG, AND X. LIU(a) Parallel (b) Non-parallelF IG . 6. Parallel proportion versus number of orbitals p . the low scalability and its speedup factor increases slowly. Note that the results of SCF arenot recorded since the divergent phenomenon is observed. In view of Figure 6, even if wehave the convergent results of SCF, it can be justiﬁably expected that the speedup factor ofSCF will be smaller than that of MOptQR. In summary, the orthogonalization-free algorithmPCAL shows higher scalability and great potential than SCF and MOptQR. (a) Structure(b) Isosurface (c) Speedup factorF IG . 7. Example C with n = 380233 , p = 1152 .

6. Conclusion.

Based on the ﬁnite element method and PCAL algorithm, a scalableapproach is proposed in this paper for the ground state solution of a given quantum system.To resolve the singularity introduced from the all-electron model, a radial mesh is generatedaccording to the structure of the system, then the optimization problem is discretized in theassociated ﬁnite element space. To avoid the efﬁciency bottleneck for large scale systems, i.e.,the orthogonalization of those orbitals, the original PCAL method is extended and appliedin this paper for solving the discretized optimization problem. A novel preconditioner isdesigned in the extended PCAL method, which generally accelerates the convergence in thesimulations.

RTHOGONALIZATION-FREE FRAMEWORK FOR KSDFT h -adaptive mesh method will be introduced fordynamically adjusting the ﬁnite element space according to the obtained numerical solutions.Furthermore, the preconditioner introduced in the PCAL method deserves more investigationin the following study, which has a chance to effectively accelerate the convergence of thenumerical method towards the ground state. The improved method will be used for the nu-merical simulations of the Born-Oppenheimer molecular dynamics, to show the potential onthe practical applications. The results will be reported in the forthcoming paper. REFERENCES[1] P.-A. A

BSIL , R. M

AHONY , AND

R. S

EPULCHRE , Optimization algorithms on matrix manifolds , PrincetonUniversity Press, 2008, https://doi.org/10.1515/9781400830244.[2] R. A

HLRICHS , M. B ¨ AR , M. H ¨ ASER , H. H

ORN , AND

C. K ¨

OLMEL , Electronic structure calculations onworkstation computers: The program system turbomole , Chem. Phys. Lett., 162 (1989), pp. 165–169,https://doi.org/10.1016/0009-2614(89)85118-8.[3] G. B AO , G. H U , AND

D. L IU , An h -adaptive ﬁnite element solver for the calculations of the electronicstructures , J. Comput. Phys., 231 (2012), pp. 4967–4979, https://doi.org/10.1016/j.jcp.2012.04.002.[4] J. B ARZILAI AND

J. M. B

ORWEIN , Two-point step size gradient methods , IMA J. Numer. Anal., 8 (1988),pp. 141–148, https://doi.org/10.1093/imanum/8.1.141.[5] J. B

OLTE , S. S

ABACH , AND

M. T

EBOULLE , Proximal alternating linearized minimization for noncon-vex and nonsmooth problems , Math. Program., 146 (2014), pp. 459–494, https://doi.org/10.1007/s10107-013-0701-9.[6] D. R. B

OWLER AND

T. M

IYAZAKI , O ( N ) methods in electronic structure calculations , Rep. Prog. Phys.,75 (2012), p. 036503, https://doi.org/10.1088/0034-4885/75/3/036503.[7] H. C HEN , X. D AI , X. G ONG , L. H E , AND

A. Z

HOU , Adaptive ﬁnite element approximations for Kohn–Shammodels , Multiscale Model. Sim., 12 (2014), pp. 1828–1869, https://doi.org/10.1137/130916096.[8] O. C

OHEN , L. K

RONIK , AND

A. B

RANDT , Locally reﬁned multigrid solution of the all-electron Kohn–Shamequation , J. Chem. Theory Comput., 9 (2013), pp. 4744–4760, https://doi.org/10.1021/ct400479u.[9] Y.-H. D

AI AND

R. F

LETCHER , Projected Barzilai-Borwein methods for large-scale box-constrained qua-dratic programming , Numer. Math., 100 (2005), pp. 21–47, https://doi.org/10.1007/s00211-004-0569-y.[10] J. T. F

REY AND

D. J. D

OREN , TubeGen 3.4 web-interface , (2011), http://turin.nss.udel.edu/research/tubegenonline.html (accessed 2019-11-01).[11] B. G AO , X. L IU , X. C HEN , AND

Y.- X . Y UAN , A new ﬁrst-order algorithmic framework for optimizationproblems with orthogonality constraints , SIAM J. Optim., 28 (2018), pp. 302–332, https://doi.org/10.1137/16M1098759.[12] B. G AO , X. L IU , AND

Y.- X . Y UAN , Parallelizable algorithms for optimization problems with orthogonalityconstraints , SIAM J. Sci. Comput., 41 (2019), pp. A1949–A1983, https://doi.org/10.1137/18M1221679.[13] L. G

ENOVESE , A. N

EELOV , S. G

OEDECKER , T. D

EUTSCH , S. A. G

HASEMI , A. W

ILLAND , D. C

ALISTE ,O. Z

ILBERBERG , M. R

AYSON , A. B

ERGMAN , ET AL ., Daubechies wavelets as a basis set for densityfunctional pseudopotential calculations , J. Chem. Phys., 129 (2008), p. 014109, https://doi.org/10.1063/1.2949547.[14] C. G

EUZAINE AND

J.-F. R

EMACLE , Gmsh: A 3-D ﬁnite element mesh generator with built-in pre-and post-processing facilities , Int. J. Numer. Meth. Eng., 79 (2009), pp. 1309–1331, https://doi.org/10.1002/nme.2579.[15] W. H

UANG AND

R. D. R

USSELL , Adaptive moving mesh methods , vol. 174, Springer Science & BusinessMedia, 2010, https://doi.org/10.1007/978-1-4419-7916-2.[16] R. D. J. III,

NIST Computational Chemistry Comparison and Benchmark Database , (2006), http://cccbdb.nist.gov/ (accessed 2019-09-10). B. GAO, G. HU, Y. KUANG, AND X. LIU[17] G. P. K

ERKER , Efﬁcient iteration scheme for self-consistent pseudopotential calculations , Phys. Rev. B, 23(1981), p. 3082, https://doi.org/10.1103/PhysRevB.23.3082.[18] A. V. K

NYAZEV , Toward the optimal preconditioned eigensolver: Locally optimal block preconditionedconjugate gradient method , SIAM J. Sci. Comput., 23 (2001), pp. 517–541, https://doi.org/10.1137/S1064827500366124.[19] W. K

OHN AND

L. J. S

HAM , Self-consistent equations including exchange and correlation effects , Phys. Rev.,140 (1965), p. A1133, https://doi.org/10.1103/PhysRev.140.A1133.[20] Y. K

UANG AND

G. H U , On stabilizing and accelerating SCF using ITP in solving Kohn–Sham equation ,Commun. Comput. Phys., accepted.[21] P. R. L

EVASHOV , G. V. S IN ’ KO , N. A. S MIRNOV , D. V. M

INAKOV , O. P. S

HEMYAKIN , AND

K. V.K

HISHCHENKO , Pseudopotential and full-electron DFT calculations of thermodynamic properties ofelectrons in metals and semiempirical equations of state , J. Phys. Condens. Matter, 22 (2010), p. 505501,https://doi.org/10.1088/0953-8984/22/50/505501.[22] L. L IN , J. L U , AND

L. Y

ING , Numerical methods for Kohn–Sham density functional theory , Acta Numer.,28 (2019), pp. 405–539, https://doi.org/10.1017/S0962492919000047.[23] X. L IU , X. W ANG , Z. W EN , AND

Y. Y

UAN , On the convergence of the self-consistent ﬁeld iteration inKohn–Sham density functional theory , SIAM J. Matrix Anal. Appl., 35 (2014), pp. 546–558, https://doi.org/10.1137/130911032.[24] X. L IU , Z. W EN , X. W ANG , M. U

LBRICH , AND

Y. Y

UAN , On the analysis of the discretized Kohn–Shamdensity functional theory , SIAM J. Numer. Anal., 53 (2015), pp. 1758–1785, https://doi.org/10.1137/140957962.[25] M. A. L. M

ARQUES , M. J. T. O

LIVEIRA , AND

T. B

URNUS , Libxc: A library of exchange and correlationfunctionals for density functional theory , Comput. Phys. Commun., 183 (2012), pp. 2272–2281, https://doi.org/10.1016/j.cpc.2012.05.007.[26] J. N

OCEDAL AND

S. J. W

RIGHT , Numerical optimization , Springer Science & Business Media, 2006, https://doi.org/10.1007/978-0-387-40065-5.[27] M. C. P

AYNE , M. P. T

ETER , D. C. A

LLAN , T. A

RIAS , AND A . J. J

OANNOPOULOS , Iterative minimizationtechniques for ab initio total-energy calculations: molecular dynamics and conjugate gradients , Rev.Mod. Phys., 64 (1992), p. 1045, https://doi.org/10.1103/RevModPhys.64.1045.[28] W. E. P

ICKETT , Pseudopotential methods in condensed matter applications , Comput. Phys. Rep., 9 (1989),pp. 115–197, https://doi.org/10.1016/0167-7977(89)90002-6.[29] P. S

URYANARAYANA , V. G

AVINI , T. B

LESGEN , K. B

HATTACHARYA , AND

M. O

RTIZ , Non-periodic ﬁnite-element formulation of Kohn–Sham density functional theory , J. Mech. Phys. Solids, 58 (2010), pp. 256–280, https://doi.org/10.1016/j.jmps.2009.10.002.[30] E. T

SUCHIDA AND

M. T

SUKADA , Adaptive ﬁnite-element method for electronic-structure calculations , Phys.Rev. B, 54 (1996), p. 7602, https://doi.org/10.1103/PhysRevB.54.7602.[31] H. Y. X

IAO , X. D. J

IANG , G. D

UAN , F. G AO , X. T. Z U , AND

W. J. W

EBER , First-principles calculations ofpressure-induced phase transformation in AlN and GaN , Comput. Mater. Sci., 48 (2010), pp. 768–772,https://doi.org/10.1016/j.commatsci.2010.03.028.[32] C. Y

ANG , J. C. M

EZA , AND

L.-W. W

ANG , A trust region direct constrained minimization algorithm forthe Kohn–Sham equation , SIAM J. Sci. Comput., 29 (2007), pp. 1854–1875, https://doi.org/10.1137/060661442.[33] X. Z

HANG , J. Z HU , Z. W EN , AND

A. Z