[PDF] Efficient Parallel Linear Scaling Method to get the Response Density Matrix in All-Electron Real-Space Density-Functional Perturbation Theory

Abstract

The real-space density-functional perturbation theory (DFPT) for the computations of the response properties with respect to the atomic displacement and homogeneous electric field perturbation has been recently developed and implemented into the all-electron, numeric atom-centered orbitals electronic structure package FHI-aims. It is found that the bottleneck for large scale applications is the computation of the response density matrix, which scales as O( N 3 ) . Here for the response properties with respect to the homogeneous electric field, we present an efficient parallel linear scaling algorithm for the response density matrix calculation. Our scheme is based on the second-order trace-correcting purification and the parallel sparse matrix-matrix multiplication algorithms. The new scheme reduces the formal scaling from O( N 3 ) to O(N) , and shows good parallel scalability over tens of thousands of cores. As demonstrated by extensive validation, we achieve a rapid computation of accurate polarizabilities using DFPT. Finally, the computational efficiency of this scheme has been illustrated by making the scaling tests and scalability tests on massively parallel computer systems.

Full PDF

EE ﬃ cient Parallel Linear Scaling Method to get the Response Density Matrix inAll-Electron Real-Space Density-Functional Perturbation Theory Honghui Shang a , Wanzhen Liang b , Yunquan Zhang a , Jinlong Yang c a State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China b State Key Laboratory of Physical Chemistry of Solid Surfaces, Collaborative Innovation Center of Chemistry for Energy Materials, Fujian Provincial KeyLaboratory of Theoretical and Computational Chemistry, and Department of Chemistry, College of Chemistry and Chemical Engineering, Xiamen University,Xiamen, Fujian 361005, Peoples Republic of China c Hefei National Laboratory for Physical Sciences at Microscale, Department of Chemical Physics, and Synergetic Innovation Center of Quantum Information andQuantum Physics, University of Science and Technology of China, Hefei, Anhui 230026, China

Abstract

The real-space density-functional perturbation theory (DFPT) for the computations of the response properties with respect tothe atomic displacement and homogeneous electric ﬁeld perturbation has been recently developed and implemented into the all-electron, numeric atom-centered orbitals electronic structure package FHI-aims. It is found that the bottleneck for large scaleapplications is the computation of the response density matrix, which scales as O ( N ). Here for the response properties with respectto the homogeneous electric ﬁeld, we present an e ﬃ cient parallel linear scaling algorithm for the response density matrix calcu-lation. Our scheme is based on the second-order trace-correcting puriﬁcation and the parallel sparse matrix-matrix multiplicationalgorithms. The new scheme reduces the formal scaling from O ( N ) to O ( N ), and shows good parallel scalability over tens ofthousands of cores. As demonstrated by extensive validation, we achieve a rapid computation of accurate polarizabilities usingDFPT. Finally, the computational e ﬃ ciency of this scheme has been illustrated by making the scaling tests and scalability tests onmassively parallel computer systems.

1. Introduction

Density-functional theory (DFT) [1, 2] applied in chemistry,physics, and material science is the ground-state theory throughwhich one can calculate the total energy and its ﬁrst orderderivatives (e.g. dipole moment and force). The response prop-erties (e.g., polarizability, vibrational frequencies or phonondispersions) related to the second and higher order derivativesof the total energy can be obtained within the same frameworkby means of density-functional perturbation theory (DFPT) [3–5] or the so-called coupled perturbed self-consistent ﬁeld (CP-SCF) method[6–11] in the quantum chemistry community.Recently, we have developed and implemented a real-spaceformalism for DFPT [12] in the all-electron, full-potential, nu-merical atomic orbitals based Fritz Haber Institute ab initio molecular simulations (FHI-aims) package, which allows us totake advantage of the inherent locality of the basis set to achievea numerically favorable scaling. Such real-space DFPT hasbeen applied in lattice dynamics calculations[12] and in com-putations of the polarizabilities, dielectric constants, harmonicas well as anharmonic Raman spectra [13], in which good com-putational accuracy, computational e ﬃ ciency, and parallel scal-ability have been demonstrated.In our previous scaling test [12, 13], it was found that, forlarge systems with more than 1,000 atoms, the computationalcost for updating the response density matrix becomes domi-nant. It is because the dense matrix multiplication operationsin this step scale as O ( N ), and thereby, presents a serious bot- tleneck to deal with large systems. It is desirable to make thecomputational time scale linearly, i.e. O ( N ), with the size ofthe system [14]. In order to achieve this goal, the Kohns near-sightedness principle [15] need to be adopted. It says that, for aquantum mechanical system within an external potential, its lo-cal properties do not “see” a change of the external potential ifthis change is limited to a distant region. This fundamental prin-ciple is behind almost all linear scaling algorithms[16, 17, 17–21], which leads to the sparsity of the density matrix − a key toachieve the linear scaling. Using sparse zero-order density andHamiltonian matrices, in 2002, Niklasson suggested a trace-correcting (TC2) approach to replace the traditional diagonal-ization step through a density matrix puriﬁcation method [22].Later in 2004, Niklasson and Challacombe proposed the densitymatrix perturbation theory (DMPT) [23] to extend the TC2 ap-proach in calculating response density matrices. In contrast tothe traditional density functional perturbation theory, where theﬁrst-order density matrix is calculated from dense eigenstatecoe ﬃ cients matrices, the DMPT approach only adopts sparseﬁrst-order Hamiltonian and density matrices, leading to the lin-ear scaling of the calculations. Such DMPT approach can befurther combined with the CPSCF cycles, and in this way, theresponse properties can be calculated self-consistently. Thiscombination is called the TC2-CPSCF method throughout thispaper.The advantage of linear scaling in calculations can be signif-icantly enhanced if the computations are performed in a mas-sively parallel way. Currently, the parallelization of the TC2 Preprint submitted to Computer Physics Communications September 9, 2020 a r X i v : . [ phy s i c s . c o m p - ph ] S e p ethod has been achieved in a few di ﬀ erent ways, like us-ing in-node parallelism via multithreading scheme within onenode [24, 25], using MPI parallelization based on the dis-tributed block compressed sparse row (DBCSR) library [26] inCP2K [27], or using the hybrid MPI + OpenMP parallelizationscheme [28] for the coordinate (COO) data format[29]. For theTC2-CPSCF method, however, there is only serial implemen-tations [30, 31] and no parallel implementation and scalabilityperformance testing has been done yet.In this work, we have implemented the sparse matrices-basedTC2-CPSCF method in the FHI-aims package. We have alsoparallelized the code by using the MPI level distributed memoryparallelization algorithm. The linear scaling and good parallelscalability have been achieved for the response density matrixcalculation within DFPT. The linear scaling with system sizesup to several thousands of atoms and the scalability on tens ofthousands of cores are demonstrated using various realistic sys-tems.The remainder of this paper is organized as follows. Thefundamental theoretical framework is presented in Sec. 2. InSec. 3, the implementation is validated by comparing the calcu-lated analytical polarizabilities with results obtained from thetraditional O ( N ) approach. We also discuss the convergencebehavior of the implementation, the scaling of the computa-tional cost with the system sizes, and the parallel performanceon a large number of cores. In Sec. 4, we summarize our mainachievement and highlight the relevance of this work to the par-allel implementation of other methods.

2. Method

In this section, the basic equations are introduced. Here weuse a spin-unpolarized notation for the sake of simplicity, a for-mal generalization to spin-polarized notation is straightforward.Moreover, in this work, we focus on the equations for ﬁnite sys-tems, the generalization to extended periodic solid case can befound in our previous work[12, 13].In Kohn-Sham density-functional theory, the total energy isuniquely determined by the electron density n ( r ) E tot = − (cid:88) i < ψ i |∇ | ψ i > (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) T s [ n ] − (cid:90) n ( r ) (cid:88) I Z I | r − R I | d r (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) E ext [ n ] + (cid:90) (cid:90) n ( r ) n ( r (cid:48) ) | r − r (cid:48) | d r d r (cid:48) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) E H [ n ] + (cid:88) I (cid:88) J Z I Z J | R I − R J | (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) E ion − ion + E xc [ n ] , (1)in which ψ i is the Kohn-Sham eigenstate, T s is the kinetic en-ergy of non-interacting electrons, E ext the electron-nuclear, E H the Hartree, E xc the exchange-correlation, and E ion − ion the ion-ion repulsion energy. All energies are functionals of the elec-tron density. The electron density is written with the eigenfunc-tion, n ( r ) = (cid:88) i f i | ψ i ( r ) | | , (2) in which f i denotes the occupation number of eigenstate ψ i . Theground state electron density n ( r ) is obtained by variationallyminimizing Eq. (1) δδ n (cid:34) E tot − µ (cid:32)(cid:90) n ( r ) d r − N e (cid:33)(cid:35) = , (3)in which µ = δ E KS /δ n is the chemical potential. From aboveequation we get the Kohn-Sham (KS) single particle equationsˆ h KS ψ i = (cid:2) ˆ t s + ˆ v ext ( r ) + ˆ v H + ˆ v xc (cid:3) ψ i = (cid:15) i ψ i , (4)for the Kohn-Sham Hamiltonian ˆ h KS , in which, ˆ t s is the sin-gle particle kinetic operator, ˆ v ext the (external) electron-nuclearpotential, ˆ v H the Hartree potential, and ˆ v xc the exchange-correlation potential. The Kohn-Sham single particle states ψ i and their eigenenergies (cid:15) i can be calculated by solvingEq. (4). In practical numerical implementations, the Kohn-Sham states ψ i are expanded with the ﬁnite basis set χ µ ( r ) ψ i ( r ) = (cid:88) µ C µ i χ µ ( r ) , (5)using the expansion coe ﬃ cients C µ i . Here the numeric atom-centered orbitals (NAOs)[32–34] are adopted as the basis set χ µ ( r ). Denoting H µν = (cid:82) χ µ ( r )ˆ h KS χ ν ( r ) d r for the Hamiltonianmatrix and S µν = (cid:82) χ µ ( r ) χ ν ( r ) d r for the overlap matrix, we canrewrite Eq. (4) as (cid:88) ν H µν C ν i = (cid:15) i (cid:88) ν S µν C ν i . (6)And we can write it in the more convenient matrix form for thezero order Kohn-Sham equation: H (0) C (0) = S (0) C (0) E (0) , (7)whereby E (0) denotes the diagonal matrices containing theeigenvalues (cid:15) i .If an external electric ﬁeld E = (cid:16) e x , e y , e z (cid:17) with strengths e γ is applied to an isolated system, the KS Hamiltonian gains anadditional term ˆ h E = − r · E , which contributes E E [ n ] = − (cid:88) γ (cid:90) e γ r γ n ( r ) d r (8)to the total energy functional in Eq. (1). A perturbative Taylor-expansion of the total energy in the zero-ﬁeld limit gives E tot ( E ) ≈ E tot − (cid:88) γ µ γ e γ − (cid:88) γ,δ α γδ e γ e δ + · · · , (9)where δ, γ are Cartesian directions. For isolated systems, thecoe ﬃ cient in the linear term is µ γ = (cid:90) n ( r ) r γ d r , (10)which corresponds to the γ -component of the dipole moment.The coe ﬃ cient in the second-order term is the polarizability α γδ = ∂µ γ ∂ e δ = (cid:90) r γ . ∂ n ( r ) ∂ e δ d r , (11)2hich need to be calculated with the response of the ground-state density with respect to the ﬁeld strength according to the2 n + M (1) for the ﬁrst orderresponse quantities with respect to the homogeneous externalelectrical ﬁeld. M (1) = dM (0) de γ . (12)Then the ﬁrst order response of Eq. (6) is written as (cid:88) ν ( H (0) µν − (cid:15) (0) i S (0) µν ) C (1) ν i = − (cid:88) ν (cid:16) H (1) µν − (cid:15) (1) i S (0) µν (cid:17) C (0) ν i . (13)It should be noted that, for the homogeneous external electricalﬁeld perturbation discussed in this work, the ﬁrst order overlapmatrix S (1) is zero since the overlap matrix does not changewith respect to the electrical ﬁeld perturbation. And we canalso have its matrix form : H (0) C (1) − S (0) C (1) E (0) = − H (1) C (0) + S (0) C (0) E (1) , (14)whereby E (0) and E (1) denote the diagonal matrices containingthe eigenvalues (cid:15) i and their responses respectively. The Eq. (13)and Eq. (14) are called Sternheimer equation [36], which is thekey to get the response density matrix per CPSCF cycle in thedensity-functional perturbation theory[3–5, 12, 13]. ( N ) method to get the ﬁrst order density matrix The traditional O ( N ) way [6–8] to get the ﬁrst order den-sity matrix in each CPSCF cycle includes two steps. Firstly theSternheimer equation (Eq. 14) is solved to get the ﬁrst ordercoe ﬃ cients C (1) ; Secondly, the response (ﬁrst order) densitymatrix is constructed with the ﬁrst order coe ﬃ cients C (1) andthe occupation number ( f i ) of eigenstate P (1) µ,ν = (cid:88) i f i (cid:16) C (1) µ, i C (0) ν, i + C (0) µ, i C (1) ν, i (cid:17) . (15)In the ﬁrst step to solve the Sternheimer equation, the ﬁrst ordercoe ﬃ cients C (1) are expanded in terms of the zero order expan-sion coe ﬃ cients C (0) using C (1) = C (0) U (1) i.e. C (1) µ i = (cid:88) p C (0) µ p U (1) pi , (16)Then by multiplying Eq. (14) with the Hermitian conju-gate C (0) † , and using the orthonormality relation, C (0) † S (0) C (0) = , (17)we get E (0) U (1) − U (1) E (0) (18) = − C (0) † H (1) C (0) + E (1) . Due to the diagonal character of E (0) and E (1) , this matrix equa-tion contains the response of the eigenvalues on its diagonalelements (cid:15) (1) p = (cid:104) C (0) † H (1) C (0) (cid:105) pp . (19) The o ﬀ -diagonal elements determine the response of the expan-sion coe ﬃ cients for p (cid:44) qU (1) pq = ( − C (0) † H (1) C (0) ) pq ( ε p − (cid:15) q ) . (20)The diagonal elements of U (1) are zero for the electrical ﬁeldperturbation by using the orthogonality relation U (1) pp = . (21)It is clearly shown that this step needs the matrix multiplica-tions with dense eigenfunction coe ﬃ cients C (0) , which resultsin a scaling of O ( N ). In real numerical evaluation, the scalingexponents can be ﬁtted using the polynomial scaling expres-sion t = cN α for the CPU time as function of the total numberof atoms N, such scaling exponents of ﬁrst order density matrixcalculation in our previous tests were α = . α = . O ( N ) scaling. ( N ) method to get the ﬁrst order density matrix In order to reduce the O ( N ) scaling of the last section, themultiplications with the dense eigenfunction coe ﬃ cients C (0) need to be avoided, and the puriﬁcation related method [22, 23,30, 37] is a promising choice. Here we focus on the orthogo-nal formulation of the second order trace-correcting puriﬁcation(TC2) method proposed by Niklasson et al. [22, 38], which is avery e ﬃ cient[39] density-matrix-based method for linear scal-ing electronic structure calculations. The TC2 method is alsocalled the second-order spectral projection (SP2) method [40]with the same algorithm.The TC2 method is initially proposed[22] to solve the KSeigenvalue problem (Eq. 7), which allows us to obtain the den-sity matrix P (0) from ground state Hamiltonian matrix H (0) di-rectly without the need of performing a matrix diagonalization.It is based on a recursive expansion of the Fermi operator. Thedensity matrix in atomic basis set is deﬁned as P = N occ (cid:88) i C i C † i . (22)where N occ is the number of occupied states. And we can getthe idempotency relation in the non-orthogonal form PS P = P , (23)by using Eq. 17.In order to have the idempotency relation in the orthogonalform, we ﬁrstly transform the Hamiltonian matrix H (0) to itsorthogonal representation ( H (0) orth ) using L¨owdin orthogonaliza-tion [41, 42] H (0) orth = S (0) − H (0) S (0) − , (24) C (0) orth = S (0) C (0) . (25)It should be noted that the square root of the overlap ma-trix needed in the above L¨owdin orthogonalization is also cal-culated with the linear scaling algorithm [43] based on the3ewton-Schulz iterations. Then we have the orthogonal formof the KS equation, H orth C orth = (cid:15) C orth . (26)And ﬁnally we have the orthogonal form of the density matrix P (0) orth = S (0) P (0) S (0) , (27)with the idempotency relation in the orthogonal form, P (0) orth P (0) orth = P (0) orth , (28)And this is the base for the orthogonal TC2 method. The initialmatrices X (0)0 can be written as X (0)0 = (cid:15) max − H (0) orth (cid:15) max − (cid:15) min , (29)whereby the (cid:15) min and (cid:15) max denote the minimal and the maxi-mum boundary for the eigenvalues of Hamiltonian matrix H (0) ,which is estimated with Gershgorins Circle Theorem to avoidsolving the eigenvalue problem. We use X (0) n to represent the in-termediates form of the P (0) orth , and we have the TC2 main cycles: X (0) n + = (cid:40) ( X (0) n ) T r ( X (0) n ) ≥ N occ X (0) n − ( X (0) n ) T r ( X (0) n ) < N occ (30)Finally the zero order orthogonal density matrix is gotten afterthe TC2 cycles are converged: P (0) orth = lim n →∞ X (0) n . (31)We can transform it back to the non-orthogonal density matrixas P (0) = S (0) − P (0) orth S (0) − . (32)Such TC2 method can be extended to the response theorydirectly [23, 30], which provides explicit construction of thederivative density matrix, i.e. another way to get the ﬁrst orderdensity matrix directly from the ﬁrst order Hamiltonian. Herewe ﬁrst deﬁne the initial ﬁrst order matrices X (1)0 as X (1)0 = − H (1) (cid:15) max − (cid:15) min (33)where (cid:15) max and (cid:15) min are the maximal and minimal eigenvaluesof the unperturbed Hamiltonian H (0) , then using the followingrecursive cycles in Eq.(34), we get the ﬁrst order density matrixper CPSCF cycle: X (1) n + = (cid:40) X (1) n X (0) n + X (0) n X (1) n T r ( X (0) n ) ≥ N occ X (1) n − X (1) n X (0) n − X (0) n X (1) n T r ( X (0) n ) < N occ (34)Finally the ﬁrst order orthogonal density matrix is gotten afterthe recursive cycles are converged: P (1) orth = lim n →∞ X (1) n . (35)We then transform it back to the non-orthogonal ﬁrst order den-sity matrix as P (1) = S (0) − P (1) orth S (0) − . (36) C h a ng e o f b a nd e n e r gy ( a . u . ) Number of cycles in purification ∆ E (0)b ∆ E (1)b Figure 1: The change of zero band energy ( E (0) b = Tr[P (0) H (0) ]) and the ﬁrstorder band energy ( E (1) b = Tr[P (1) H (0) ]) with respect to number of recursivecycles in Eq. (30) and Eq. (34) respectively. Here the NTPoly-ﬁlter is set to10 − and NTPoly-tolerance is set to 10 − . A B C receive A(i,k) receive B(k,j) (P/c) ( P / c ) / X(P/c) ( P / c ) / X == c = 1 c = 2 Figure 2: The 3D-SpGEMM algorithm for spare matrix-matrix multiplication C = A × B on a √ P / c × √ P / c × c processor grid. Here c in the Z direction isset to 2 in this illustration. The matrix elements of A and B are broadcasted andmultiplied locally to compute a contribution to a local result matrix C, and thenC matrix are merged across the Z direction to get the ﬁnal results. These equations provide the base for computing the density-matrix response explicitly and rapidly. In this work, such TC2recursive algorithm for the ﬁrst order density matrix has beencombined with the CPSCF cycles, and implemented in the all-electron Fritz Haber Institute ab initio molecular simulations(FHI-aims) package [44, 45]. As shown in Fig.1, the change ofthe zero order band energy ( E (0) b = Tr(P (0) H (0) )) and the ﬁrst or-der band energy( E (1) b = Tr(P (1) H (0) )) converged fast with respectto the number of the recursive cycles, after around 25 cycles, thechange of the band energy is reduced to 10 − a.u.. The parallel performance bottleneck in the above TC2 andTC2-CPSCF methods is the sparse matrix-matrix multiplica-tion. Here in this subsection we will show how this sparsematrix-matrix multiplication is performed. The serial algorithmfor sparse matrix-matrix multiplication in compressed sparserows (CSR) format is given by Gustavson[46], as shown in Al-gorithm 1. In its parallelization, the so-called 3D-SpGEMM al-4 lgorithm 1

The serial sparse matrix-matrix multiplication al-gorithm with the compressed sparse rows (CSR) data format.The CSR representation of a sparse matrix A is given by threeone-dimensional array IA, JA, and A. IA is the (address) pointerof the ﬁrst nonzero element for the rows of A; JA is the columnindices of the nonzero matrix elements; A is the numerical val-ues of the nonzero matrix elements.

Require: nrow : the row dimension of A and Cncol : the column dimension of B and CIA, JA, A: input sparse matrix A with CSR formatIB, JB, B: input sparse matrix B with CSR formatIC, JC, C: output sparse matrix C with CSR formatIC(1) ← ← for k ←

1, ncol do label(k) ← end forfor i ←

1, nrow dofor pA ← IA(i), IA(i + do j ← JA(pA) for pB ← IB(j), IB(j + do k ← JB(pB) if label(k) .eq. 0 then pC ← pC + ← klabel(k) ← pCC(pC) ← A(pA) × B(pB) else

C(label(k)) ← C(label(k)) + A(pA) × B(pB) end ifend forend forfor pCN ← IC(i), pC do label(JC(pCN)) ← end for IC(i + = pC + end for gorithm developed by Ballard et al. [47] and Adaz et al. [28] isemployed to minimize data communication between processorsin the parallel progress, which e ﬀ ectively optimizes the parallelcomputations of the sparse matrix-matrix multiplication. In thisalgorithm, as shown in Fig. 2, each matrix is distributed alongthe cubic √ P / c × √ P / c × c processor grid, where 1 < c < √ P and P is the total number of the processors. Then each matrixis broadcasted and multiplied locally to compute a contributionto a local result matrix, and ﬁnally the result matrix is summedup.This 3D-SpGEMM algorithm has been implemented in theCombinatorial BLAS library[48] as well as in the NTPoly li-brary [29], a library for massively parallel sparse matrix func-tion calculations. The algorithm shows very good strong scal-ing performance for sparse matrix multiplications[29]. SuchNTPoly package has been integrated into ELSI[49], which isa general open-source infrastructure for large-scale electronicstructure theory and can be linked with FHI-aims and SIESTA.Here our implementation of the TC2-CPSCF method is basedon the sparse matrix-matrix multiplication routine from NTPolyin ELSI. The sparse matrix is stored with compressed sparserow (CSR) storage format in FHI-aims. In NTPoly, the coor-dinate format (COO) data format is adopted as the input inter-face. In the COO data format, the global row, column and valueare stored in a triplet list, which is convenient to make paralleldecomposition of the global parallel matrices into local sparsematrices. Then the local sparse matrix stored in the COO formatis transformed to the CSR format to perform the local matrix-matrix multiplication with Algorithm 1. In order to use NTPolyin FHI-aims, we ﬁrst need to translate the CSR storage formatto the triplet format, and then performs the TC2 / TC2-CPSCFscheme to get the density matrix / ﬁrst order density matrix, ﬁ-nally the data is transformed back to the CSR format in FHI-aims. It should be noted that since the input sparse matricesin our calculations are usually not well distributed in parallelprocesses, the rows and columns of the matrix need to be ran-domly permuted by multiplying the sparse matrices with per-mutation matrices, in order to achieve the load balance in thesparse matrix-matrix multiplication.In the NTPoly implementation, two parameters are used tocontrol the error. One is called NTPoly-ﬁlter, which refersto the threshold to determine which matrix elements can betreated as zero. This parameter scales linearly with the ac-cumulated density matrix error. It should be noted that theaccumulated error of the puriﬁcation method is bounded andrelated to the drop tolerance of matrix elements and the bandgap of the system [37], but it is di ﬃ cult to be controlled withrigorous numerics[50, 51]. The strategy to rigorously controlthe forward error of density matrix puriﬁcation can be foundin Ref. [51]. The other parameter is called NTPoly-tolerance,which is the convergence-threshold which compared the bandenergy between the current iteration and the last iteration. Inthe following, we will give the examination for the two param-eters in real applications.5 . Results To validate our implementation we have speciﬁcally inves-tigated the convergence of polarizabilities with respect to thenumerical parameters used in the TC2-CPSCF calculation inSec. 3.1. Furthermore, a systematic validation of the TC2-CPSCF implementation by comparing to polarizabilities ob-tained from the benchmark O ( N ) method is presented inSec. 3.2. The computational performance of the TC2-CPSCFimplementation is discussed in Sec. 3.3.In FHI-aims, the atom-centered integration grids are usedfor the numerical integration. Each atom has the radial shellsaround it, and the angular points are distributed on each radialshell. The grid settings in FHI-aims are described by light, tightand really-tight with di ﬀ erent radial shells and angular integra-tion points, the tighter the better quality of the integration grid.The basis set setting in FHI-aims are deﬁned as following: Aminimal basis includes the radial functions of the occupied or-bitals of free atoms with noble gas conﬁguration, then the quan-tum numbers of the additional valence functions, and additionalradial functions are added to make “tier 1” ,“tier 2”, and so on.Such basis sets are similar to the split-valence polarization ba-sis used in the Gaussian basis set. For example, the “tier 1”basis set is equivalent to the double-zeta plus polarization basisset. The parameter c discussed in Sec. 2.3 is set to 1 in the fol-lowing calculations, because the c > In this part, the convergence behaviour of the TC2-CPSCFmethod with respect to the numerical parameters (NTPoly-ﬁlter, NTPoly-tolerance) is analysed. We use the water (H O)molecule as an example, for which we compute the three diag-onal components of the polarizability tensor using a local ap-proximation for exchange and correlation (LDA parametriza-tion of Perdew and Zunger [52] for the correlation energy den-sity of the homogeneous electron gas based on the data ofCeperley and Alder [53]). The tight setting is adopted for theintegration and the “tier 2” basis set is adopted in this calcula-tion.The upper panel of Fig.3 shows the absolute error changein the three diagonal components of the polarizability if theNTPoly-ﬁlter is changed. The NTPoly-ﬁlter is the parameterto determine the threshold smaller than which the matrix ele-ments will be discarded in the process of zero / ﬁrst order densitymatrix puriﬁcation. Here, the NTPoly-ﬁlter is changed from10 − to 10 − and the NTPoly-tolerance is ﬁxed to 10 − . We cansee the polarizabilities converged quickly with respect to theNTPoly-ﬁlter. At around NTPoly-ﬁlter = − , we get the max-imal absolute / relative error of 0.002 Bohr / = − setting.The lower panel of Fig.3 shows the convergence test withrespect to NTPoly-tolerance, which is the parameter to deter-mine the convergence criterion of the zero / ﬁrst order densitymatrix puriﬁcation. We change the NTPoly-tolerance from -2 -3 -4 -5 -6 -7 | ∆ α | ( B oh r ) NTPoly-filter α xx α yy α zz -2 -3 -4 -5 -6 -7 | ∆ α | ( B oh r ) NTPoly-tolerance α xx α yy α zz Figure 3: The convergence of the absolute error of three diagonal componentsof the polarizability tensor of H O with respect to NTPoly-ﬁlter and NTPoly-tolerance. Here in the upper panel, the NTPoly-tolerance is ﬁxed to 10 − whilethe NTPoly-ﬁlter is changed from 10 − to 10 − ; In the lower panel, the NTPoly-ﬁlter is ﬁxed to 10 − while the NTPoly-tolerance is changed from 10 − to 10 − . − to 10 − and ﬁxed the NTPoly-ﬁlter to 10 − , and the po-larizabilities converged also fast with respect to the NTPoly-tolerance. At NTPoly-tolerance = − , we get the maximal ab-solute / relative error of 0.009 Bohr / = − setting.As a result, in the following calculation, we can safely useNTPoly-ﬁlter (10 − ) and NTPoly-tolerance (10 − ) in the vali-dation part in Sec. 3.2. Moreover, it is also enough for us to useNTPoly-ﬁlter (10 − ) and NTPoly-tolerance (10 − ) in the perfor-mance evaluation part in Sec. 3.3. The polarizabilities of 32 selected molecules are calcu-lated with the linear scaling TC2-CPSCF method described inSec. 2.2. The results are compared with the normal O ( N )method described in Sec. 2.1 to serve as the benchmark to makethe validation. The detailed comparison for each individualmolecule is listed in the Appendix A. Here we summarizedthe data in Tab. 1, where we list the mean absolute percentageerror (MAPE) and the mean absolute error (MAE) for all testedmolecules. Overall, we ﬁnd an excellent agreement betweenour O ( N ) TC2-CPSCF method and the O ( N ) benchmark re-sults. To demonstrate the scaling performance of our implemen-tation, we use the H(C H ) n H molecules oriented along the6AE (Bohr ) MAPEDimers 0.023 0.078%Molecules 0.0036 0.015% Table 1: Mean absolute error (MAE) and mean absolute percentage error(MAPE) for the di ﬀ erence between the polarizabilities obtained via linear scal-ing TC2-DFPT method for a set of 16 dimers, 16 molecules. All calculationsare performed at the LDA level of theory with fully converged numerical set-tings and relaxed geometries. Detailed informations including the values foreach individual molecule can be found in the Appendix.Figure 4: The H(C H ) n H molecules used in this work to do the performancetest. Here the repeated unit is the C H block marked with a green shade,the number n is chosen to change from 64 to 640, which corresponding to thenumber of atoms change from 386 to 3842 in our performance test. The whiteball is hydrogen atom and green one is carbon atom. X-axis as shown in Fig. 4 as the examples. All calculationsuse light settings and the LDA functional. Calculations wereperformed on three node of Intel(R) Xeon(R) CPU E5-2678v3CPUs (12 cores each at 2.50GHz).We ﬁrst investigate the matrix sparsity and the time scalingwith respect to the number of atoms. In our DFPT implemen-tation, the CPSCF is performed for each coordinate indepen-dently. As the H(C2H4) n H molecule is placed along X-axis,we just examine the DFPT perturbation for X and Z coordi-nate respectively, since the Y coordinate gives the same resultas the Z coordinate. In Fig.5(a), we can see that both the den-sity matrix sparsity and the ﬁrst order density matrix decay as O ( N ) with respect to the number of atoms. The sparsity of theﬁrst order density matrix in X-axis is larger than in Z-axis, thisis because the electric ﬁeld in X-axis just polarizes the elec-tric density in this direction and causes the density overlappingsince the H(C2H4) n H line was also placed along the X-axis. Onthe other hand, the sparsity of the ﬁrst order density matrix inZ-axis is similar with the zero order density matrix because thepolarization along Z-axis does not bring the additional overlapof the density. In Fig. 5(b), the number of the non-zero ele-ments is examined, since the number of the matrix elementsincreases as O ( N ) and the sparsity increases as O ( N ), so thenumber of the non-zero elements increase as O ( N ). Since thesparsity of the ﬁrst order density matrix in X-axis is larger thanZ-axis, so the prefactor of the non-zero elements is also largerin X-axis. Finally, in Fig. 5(c), we show the puriﬁcation timeper SCF / CPSCF cycle, in which we can see the DFPT time inZ-axis is around 3 times of the DFT time, and this is becausethe number of the matrix operations in DFPT (TC2-CPSCF) isaround 3 times of the DFT (TC2). We can also see the DFPTtime in X-axis is around 5 times of the DFT because of the spar-sity of the response density in X-axis is larger than the one inZ-axis. Finally we observe the overall linear scaling in boththe DFT and DFPT calculations for the puriﬁcation time withincreasing system sizes.In addition to the sparsity, the parameter NTPoly-ﬁlter also inﬂuences the linear scaling prefactor. As shown in Fig.6, twovalues of NTPoly-ﬁlter are adopted, and the computation timewith NTPoly-ﬁlter (10 − ) is nearly double of the one computedwith NTPoly-ﬁlter (10 − ).In Fig. 7, we present the puriﬁcation time of the zero / ﬁrstorder density matrix for the isolated H(C2H4) n H molecule sys-tems with the number of atoms changing from 386 to 3,842 (thecorresponding number of the basis functions are changing from3,082 to 30,730). The DFT- O ( N ) and DFPT- O ( N ) mean theTC2 and TC2-CPSCF method respectively as described inSec. 2.2, while the DFT- O ( N ) scaling method refers to thetraditional matrix diagonalization algorithm, and the DFPT O ( N ) scaling method refers to the dense matrix algorithm asshown in Sec.2.1. Here the DFPT results are for the perturba-tion alone the Z direction. It should be noted that, the tradi-tional O ( N ) method is fully optimized both in DFT [49] andDFPT [12, 13] method. The numerical thresholds NTPoly-ﬁlter (10 − ) and NTPoly-tolerance (10 − ) are applied in the O ( N ) method. It is clearly shown that the performance ofthe TC2 method is better than the traditional O ( N ) methodat around 1300 atoms (10000 basis functions), and the perfor-mance of the TC2-CPSCF method is better than the traditional O ( N ) method at around 3000 atoms (23000 basis function).The comparison of the total time for the calculation of the po-larizabilities between the O ( N ) TC2-CPSCF method and thetraditional O ( N ) method is shown in the Appendix B, whichgives similar crossover point.In order to systematically investigate the scaling performanceof each part in the DFPT calculation, we show in Fig. 8 forthe CPU time contributed from the individual response proper-ties (density n (1) , electrostatic potential V (1) , Hamiltonian ma-trix H (1) , density matrix P (1) ) as well as the total summation ofall the contributions ( n (1) + V (1) + H (1) + P (1) ) per DFPT cycle.The scaling exponents of the computation time (as a func-tion of the total number of atoms N) in calculating each re-sponse quantity were ﬁtted using the polynomial scaling for-mula t = cN α ( α is the exponent), with the obtained exponentvalues listed in the upper panel of Fig. 8. We ﬁnd that calcu-lating the ﬁrst order density matrix P (1) dominates the compu-tational time, which, in principle, exhibits a strict O(N) scal-ing. For a system whose size is ranged from 386 atoms to3884 atoms, the obtained exponent α of 1.2 is close to the ex-pected O(N) scaling. Calculating the ﬁrst order electrostatic re-sponse potential V (1) is the second expensive part, and the cor-responding exponent α of 1.7 is similar to that in updating theground-state electrostatic potential [32]. For very large systems( N > V (1) dominates, since it scales higherthan updating P (1) . Calculating the Hamiltonian response ma-trix H (1) and the ﬁrst order response density n (1) corresponds toan exponent α of 1.6, since it involves similar numerical opera-tions.The scalability tests are performed on the Tianhe-2 super-computer located at the National Supercomputing Center inGuangzhou, China. The largest number of nodes that we canuse for performance test is 1,050 nodes (25,200 cores). Eachnode is composed of two Intel Ivy Bridge E5-2692 processors(12 cores each at 2.2 GHz). In Fig. 9, we show the parallel7 S p a r s it y o f t h e m a t r i x ( % ) number of atoms H(C H ) n H molecules (n=64-512)

DFPT(coord=X)DFPT(coord=Z)DFT (a) Sparsity

0 500 1000 1500 2000 2500 3000 3500 N u m b e r o f t h e non - ze r o e l e m e n t s i n t h e m a t r i x number of atoms H(C H ) n H molecules (n=64-512)

DFPT(coord=X)DFPT(coord=Z)DFT (b) Non-zero element T i m e p e r s . c . f . it e r a ti on ( s ) number of atoms H(C H ) n H molecules (n=64-512)

DFPT(coord=X)DFPT(coord=Z)DFT (c) TimeFigure 5: The matrix sparsity, number of the non-zero elements and the puriﬁcation time with respect to the number of the atoms for the DFT and DFPT calculation. T i m e p e r D F T s . c . f . it e r a ti on ( s ) number of atoms H(C H ) n H molecules (n=64-512)

NTPoly-filter=10 -5 NTpoly-filter=10 -6 Figure 6: The puriﬁcation time per DFT iteration with respect to the numberof the atoms. Here the inﬂuence of the NTPoly-ﬁlter parameter to the linearscaling prefactor is shown. (basis functions) (3082) (6154) (12298) (18442) (24586) (30730)

Figure 7: The puriﬁcation time of the zero / ﬁrst order density matrix build withDFT / DFPT for isolated H(C2H4) n H molecules containing from 386 to 3842atoms, with the number of the basis functions changing from 3082 to 30730.Here we use NTPoly-ﬁlter (10 − ) and NTPoly-tolerance (10 − ) as the numeri-cal thresholds. All calculations are performed on 36 CPU cores. Scaling Factor α n (1) (1) (1) (1) T i m e p e r D FP T s . c . f . it e r a ti on ( s ) number of atoms H(C H ) n H molecules (n=64-640) n (1) (r)V (1) (r)H (1) P (1) Total (DFPT)

Figure 8: Dependence of the CPU time per DFPT cycle on the number of atomsin the H(C H ) n H molecules. The perturbation of the electric ﬁeld is along theZ direction. Here, the total CPU time (black line) as well as its four compo-nents, i.e. , CPU time for the density n (1) (blue line), the electrostatic potential V (1) (sky blue line), the Hamiltonian matrix H (1) (green line), and the densitymatrix P (1) (orange line), are shown. Double logarithmic axes are used. The ﬁt-ted CPU time exponents α for the H(C H ) n H molecules (n = t = cN α for the CPUtime as function of the number of atoms N . T i m e p e r D FP T s . c . f . it e r a ti on ( s ) number of cores H(C H ) n H molecules (n=128) n (1) (r)V (1) (r)H (1) P (1) Total (DFPT)ideal

Figure 9: Parallel scalability for the H(C H ) n H molecule containing 770atoms. The perturbation of the electric ﬁeld is along the Z direction. Here,the total CPU time (black line) as well as its four components, i.e. , CPU timefor the density n (1) (blue line), the electrostatic potential V (1) (sky blue line),the Hamiltonian matrix H (1) (green line), and the density matrix P (1) (orangeline), are shown. Double logarithmic axes are used. The red line corresponds tothe ideal scaling. Here we use light settings for the integration, a “tier 1” basisset, and the LDA functional, NTPoly-ﬁlter (10 − ) and NTPoly-tolerance (10 − )as the numerical thresholds. scalability for the ﬁnite system containing 770 atoms. Here wecan see the ﬁrst order density matrix calculation is the mosttime-consuming step, and the scalability is good. A relativespeedup of 15.2 × is obtained reducing the wall-time per DFPTiteration from 81.5 sec on 24 MPI cores to around 5.3 sec on768 MPI cores. And the parallel e ﬃ ciency is nearly 47% whenusing 768 cores. Beside the cluster systems under free bound-ary condition, we also investigate the parallel scalability for anextended system (polyethylene) under periodic boundary con-ditions with a unit cell containing 768 atoms as shown in Fig.10. Γ -point is su ﬃ cient to sample the reciprocal space due tothe large unit cell, in this system, the ﬁrst order potential is themost time-consuming step, which shows almost ideal scaling,and the parallel e ﬃ ciency is around 87% when using 768 cores.It should be noted that, the moderate parallel e ﬃ -ciency (47%) in Fig. 9 with 768 cores is because of lowcomputational intensity with 770 atoms. If we increase theH(C H ) n H system size to 3074 atoms, then the parallel ef-ﬁciency increases to around 55% at 768 cores. In Fig. 11,we further investigate the parallel scalability of the TC2-CPSCF method with di ﬀ erent system sizes for the H(C H ) n Hmolecules. Here we can see for system sizes ranged from 3,074atoms to 12,290 atoms, with the number of the basis functionschanging from 24,586 to 98,314, the scalability is good up tothe maximum 25,200 CPU cores. For the system contained3,074 atoms, a speedup of 22.3 × is obtained reducing the wall-time per DFPT iteration from 342.2 sec on 48 MPI cores to 15.3sec on 3,072 MPI cores. Then for a larger system contained6,146 atoms, a speedup of 9.5 × is obtained reducing the wall-time per DFPT iteration from 253.9 sec on 240 MPI cores to26.9 sec on 6,144 MPI cores. Finally, the total time per DFPTcycle speedups of up to 4.5 × when we go from 2,400 cores T i m e p e r D FP T s . c . f . it e r a ti on ( s ) number of cores polyethylene chain n (1) (r)V (1) (r)H (1) P (1) Total (DFPT)ideal

Figure 10: Parallel scalability for the extended system (polyethylene) with aunit cell containing 768 atoms. The perturbation of the electric ﬁeld is alongthe Z direction. Here, the total CPU time (black line) as well as its four compo-nents, i.e. , CPU time for the density n (1) (blue line), the electrostatic potential V (1) (sky blue line), the Hamiltonian matrix H (1) (green line), and the densitymatrix P (1) (orange line), are shown. Double logarithmic axes are used. Thered line corresponds to the ideal scaling. Here we use light settings for the in-tegration, a “tier 1” basis set, and the LDA functional, NTPoly-ﬁlter (10 − ) andNTPoly-tolerance (10 − ) as the numerical thresholds. to 25,200 cores for the H(C H ) n H system contained 12,290atoms.

4. Conclusions

We have implemented an e ﬃ cient parallel linear scalingmethod for perturbations of homogeneous electric ﬁelds withinan all-electron, numeric atom-centered orbitals framework. Wehave validated the implementation by comparing polarizabili-ties of molecules calculated from this O ( N ) approach with thoseobtained from the traditional O ( N ) method. The results canbe systematically converged with respect to the used numericalparameters. The scaling exponent of the computation time incalculating the ﬁrst order density matrix is α = . O ( N ) scaling. The implemented TC2-CPSCF method exhibits agood parallel scalability that can be extended up to 25,200 coresin real systems. The formalism described in this paper couldalso be applied in dealing with other type of perturbations, e.g.atomic displacements in the lattice dynamics. Moreover, the3D-SpGEMM algorithm employed in this work can also beused in the density-matrix-based Laplace-transformed CPSCFmethod [54] and the density-matrix-based time-dependent self-consistent ﬁeld method [55] for calculating dynamic polariz-abilities. References [1] P. Hohenberg, Phys. Rev. 136 (1964) B864–B871. URL: http://link.aps.org/doi/10.1103/PhysRev.136.B864 . doi: .[2] W. Kohn, L. J. Sham, Phys. Rev. 140 (1965) A1133–A1138.URL: http://link.aps.org/doi/10.1103/PhysRev.140.A1133 .doi: .

10 100 1000 10 100 1000 10000 100000 T o t a l ti m e p e r D FP T it e r a ti on ( s ) number of cores Figure 11: The strong scalability for the total time per DFPT iteration withdi ﬀ erent system sizes in the H(C H ) n H molecules. The systems we are us-ing contained 3,074 atoms (24,586 basis functions), 6,146 atoms (49,162 basisfunctions) and 12,290 atoms (98,314 basis functions).[3] X. Gonze, Phys. Rev. B 55 (1997) 10337–10354. URL: http://link.aps.org/doi/10.1103/PhysRevB.55.10337 . doi: .[4] X. Gonze, C. Lee, Phys. Rev. B 55 (1997) 10355–10368. URL: http://link.aps.org/doi/10.1103/PhysRevB.55.10355 . doi: .[5] S. Baroni, S. de Gironcoli, A. Dal Corso, P. Giannozzi, Rev. Mod.Phys. 73 (2001) 515–562. URL: http://link.aps.org/doi/10.1103/RevModPhys.73.515 . doi: .[6] J. Gerratt, I. M. Mills, J. Chem. Phys. 49 (1968) 1719–1729. URL: http://link.aip.org/link/?JCP/49/1719/1 .doi: .[7] J. A. Pople, R. Krishnan, H. B. Schlegel, J. S. Binkley, International Jour-nal of Quantum Chemistry 16 (1979) 225–241. URL: http://dx.doi.org/10.1002/qua.560160825 . doi: .[8] C. E. Dykstra, P. G. Jasien, Chem. Phys. Lett. 109 (1984) 388– 393. URL: . doi: .[9] M. Frisch, M. Head-Gordon, J. Pople, Chem. Phys. 141 (1990) 189– 196. URL: . doi: .[10] C. Ochsenfeld, M. Head-Gordon, Chem. Phys. Lett. 270 (1997) 399– 405. URL: . doi: .[11] W. Liang, Y. Zhao, M. Head-Gordon, J. Chem. Phys. 123 (2005)194106. URL: http://link.aip.org/link/?JCP/123/194106/1 .doi: .[12] H. Shang, C. Carbogno, P. Rinke, M. Sche ﬄ er, Computer Physics Com-munications 215 (2017) 26–46. URL: .doi: .[13] H. Shang, N. Raimbault, P. Rinke, M. Sche ﬄ er, M. Rossi,C. Carbogno, New Journal of Physics 20 (2018) 073040.URL: http://stacks.iop.org/1367-2630/20/i=7/a=073040?key=crossref.b45b8680fc0308226fe0611417a68450 .doi: .[14] D. R. Bowler, T. Miyazaki, Reports on Progress inPhysics 75 (2011) 036503. URL: http://stacks.iop.org/0034-4885/75/i=3/a=036503?key=crossref.28e3cb0ce7d7274e9d63f9158aff7224http://arxiv.org/abs/1108.5976http://dx.doi.org/10.1088/0034-4885/75/3/036503 . doi: . arXiv:1108.5976 .[15] W. Kohn, Phys. Rev. Lett. 76 (1996) 3168–3171. URL: https://link.aps.org/doi/10.1103/PhysRevLett.76.3168 . doi: .[16] J. M. Soler, E. Artacho, J. D. Gale, A. Garc´ıa, J. Junquera,P. Ordej´on, D. S´anchez-Portal, J. Phys. Condens. Matter 14 (2002)2745–2779. URL: http://iopscience.iop.org/0953-8984/14/11/302/ . doi: .[17] D. R. Bowler, T. Miyazaki, J. Phys. Condens. Matter 22 (2010) 74207.URL: http://stacks.iop.org/0953-8984/22/i=7/a=074207 .[18] H. Shang, H. Xiang, Z. Li, J. Yang, International Reviews inPhysical Chemistry 29 (2010) 665–691. URL: .doi: .[19] A. S. Torralba, M. Todorovi´c, V. Br´azdov´a, R. Choudhury, T. Miyazaki,M. J. Gillan, D. R. Bowler, Journal of Physics: Condensed Matter20 (2008) 294206. URL: https://doi.org/10.1088%2F0953-8984%2F20%2F29%2F294206 . doi: .[20] V. Weber, M. Challacombe, The Journal of Chemical Physics 125 (2006)104110. URL: https://doi.org/10.1063/1.2222359 . doi: . arXiv:https://doi.org/10.1063/1.2222359 .[21] X. Wu, A. Selloni, R. Car, Phys. Rev. B 79 (2009) 085102. URL: https://link.aps.org/doi/10.1103/PhysRevB.79.085102 .doi: .[22] A. M. N. Niklasson, Phys. Rev. B 66 (2002) 155115. URL: https://link-aps-org-443.webvpn.las.ac.cn/doi/10.1103/PhysRevB.66.155115 . doi: .[23] A. M. N. Niklasson, M. Challacombe, Physical Review Letters92 (2004) 193001. URL: https://link.aps.org/doi/10.1103/PhysRevLett.92.193001 . doi: . arXiv:0311591 .[24] S. M. Mniszewski, M. J. Cawkwell, M. E. Wall, J. Mohd-Yusof,N. Bock, T. C. Germann, A. M. N. Niklasson, Journal of Chem-ical Theory and Computation 11 (2015) 4644–4654. URL: https://pubs.acs.org/doi/10.1021/acs.jctc.5b00552 . doi: .[25] M. J. Cawkwell, M. A. Wood, A. M. Niklasson, S. M. Mniszewski,Journal of Chemical Theory and Computation 10 (2014) 5391–5396.doi: .[26] A. Lazzaro, J. VandeVondele, J. Hutter, O. Sch¨utt, in: Proceed-ings of the Platform for Advanced Scientiﬁc Computing Conference,PASC 17, Association for Computing Machinery, New York, NY,USA, 2017. URL: https://doi.org/10.1145/3093172.3093228 .doi: .[27] J. Hutter, M. Iannuzzi, F. Schi ﬀ mann, J. VandeVondele,WIREs Computational Molecular Science 4 (2014) 15–25. URL: . doi: . .[28] A. Azad, G. Ballard, A. Buluc¸, J. Demmel, L. Grigori, O. Schwartz,S. Toledo, S. Williams, SIAM Journal on Scientiﬁc Computing 38(2016) C624–C651. URL: . doi: .[29] W. Dawson, T. Nakajima, Computer Physics Communications 225(2018) 154–165. URL: https://doi.org/10.1016/j.cpc.2017.12.010https://linkinghub.elsevier.com/retrieve/pii/S0010465517304150 . doi: .[30] V. Weber, A. M. N. Niklasson, M. Challacombe, Physical Review Let-ters 92 (2004) 193002. URL: https://link.aps.org/doi/10.1103/PhysRevLett.92.193002 . doi: . arXiv:0312634 .[31] H. J. Xiang, J. Yang, J. G. Hou, Q. Zhu, Physical Review Letters97 (2006) 266402. URL: http://link.aps.org/doi/10.1103/PhysRevLett.97.266402https://link.aps.org/doi/10.1103/PhysRevLett.97.266402 . doi: .[32] V. Blum, R. Gehrke, F. Hanke, P. Havu, V. Havu, X. Ren,K. Reuter, M. Sche ﬄ er, Comput. Phys. Commun. 180 (2009) 2175–2196. URL: http://linkinghub.elsevier.com/retrieve/pii/S0010465509002033 . doi: .[33] B. Delley, J. Chem. Phys. 94 (1991) 7245. URL: http://scitation.aip.org/content/aip/journal/jcp/94/11/10.1063/1.460208 .doi: .[34] B. Delley, J. Chem. Phys. 92 (1990) 508. URL: http://scitation. ip.org/content/aip/journal/jcp/92/1/10.1063/1.458452 .doi: .[35] X. Gonze, J.-P. Vigneron, Phys. Rev. B 39 (1989) 13120–13128.URL: http://link.aps.org/doi/10.1103/PhysRevB.39.13120 .doi: .[36] R. M. Sternheimer, Phys. Rev. 96 (1954) 951–968. URL: http://link.aps.org/doi/10.1103/PhysRev.96.951 . doi: .[37] A. M. N. Niklasson, C. J. Tymczak, M. Challacombe,The Journal of Chemical Physics 118 (2003) 8611–8620. URL: https://doi-org-443.webvpn.las.ac.cn/10.1063/1.1559913 . doi: . arXiv:https://doi-org-443.webvpn.las.ac.cn/10.1063/1.1559913 .[38] A. M. N. Niklasson, V. Weber, The Journal of Chemical Physics 127(2007) 064105. URL: http://aip.scitation.org/doi/10.1063/1.2755775 . doi: .[39] E. Rudberg, E. H. Rubensson, Journal of Physics: Con-densed Matter 23 (2011) 075502. URL: http://stacks.iop.org/0953-8984/23/i=7/a=075502?key=crossref.f46001d5e55d398d8173874c06aff985 . doi: .[40] N. Bock, C. F. A. Negre, S. M. Mniszewski, J. Mohd-Yusof, B. Aradi,J.-L. Fattebert, D. Osei-Ku ﬀ uor, T. C. Germann, A. M. N. Niklas-son, The Journal of Supercomputing 74 (2018) 6201–6219. URL: http://link.springer.com/10.1007/s11227-018-2533-0 .doi: .[41] P. L¨owdin, The Journal of Chemical Physics 18 (1950)365–375. URL: https://doi-org-443.webvpn.las.ac.cn/10.1063/1.1747632 . doi: . arXiv:https://doi-org-443.webvpn.las.ac.cn/10.1063/1.1747632 .[42] P.-O. L¨owdin, Advances in Physics 5 (1956)1–171. URL: https://doi.org/10.1080/00018735600101155 . doi: . arXiv:https://doi.org/10.1080/00018735600101155 .[43] B. Jans´ık, S. Høst, P. Jørgensen, J. Olsen, T. Helgaker, The Journal ofChemical Physics 126 (2007) 124104. URL: https://doi.org/10.1063/1.2709881 . doi: .[44] X. Ren, P. Rinke, V. Blum, J. Wieferink, A. Tkatchenko,A. Sanﬁlippo, K. Reuter, M. Sche ﬄ er, New J. Phys. 14 (2012)053020. URL: http://stacks.iop.org/1367-2630/14/i=5/a=053020?key=crossref.351b343783c2c1df1596219a941a74eb .doi: .[45] V. Havu, V. Blum, P. Havu, M. Sche ﬄ er, J. Comput. Phys. 228 (2009)8367–8379. URL: http://linkinghub.elsevier.com/retrieve/pii/S0021999109004458 . doi: .[46] F. G. Gustavson, ACM Transactions on Mathematical Software(TOMS) 4 (1978) 250–269. URL: http://dl.acm.org/doi/10.1145/355791.355796 . doi: .[47] G. Ballard, A. Buluc, J. Demmel, L. Grigori, B. Lipshitz, O. Schwartz,S. Toledo, in: Proceedings of the 25th ACM symposium on Parallelism inalgorithms and architectures - SPAA ’13, 2, ACM Press, New York, NewYork, USA, 2013, p. 222. URL: http://dl.acm.org/citation.cfm?doid=2486159.2486196 . doi: .[48] A. B. R. Gilbert;, The International Journal of High Performance Com-puting Applications 25 (2011) 496–509. URL: http://dx.doi.org/10.1177/1094342011403516 . doi: . arXiv:http://dx.doi.org/10.1177/1094342011403516 .[49] V. W. zhe Yu, F. Corsetti, A. Garca, W. P. Huhn, M. Jacquelin, W. Jia,B. Lange, L. Lin, J. Lu, W. Mi, A. Seiﬁtokaldani, lvaro Vzquez-Mayagoitia, C. Yang, H. Yang, V. Blum, Computer Physics Communi-cations 222 (2018) 267 – 285. URL: . doi: https://doi.org/10.1016/j.cpc.2017.09.007 .[50] E. H. Rubensson, P. Sałek, Journal of Computational Chemistry 26(2005) 1628–1637. URL: http://doi.wiley.com/10.1002/jcc.20315 . doi: .[51] E. H. Rubensson, E. Rudberg, P. Sałek, The Journal of ChemicalPhysics 128 (2008) 074106. URL: http://aip.scitation.org/doi/10.1063/1.2826343 . doi: .[52] J. P. Perdew, A. Zunger, Phys. Rev. B 23 (1981) 5048–5079. URL: http://link.aps.org/doi/10.1103/PhysRevB.23.5048 . doi: .[53] D. M. Ceperley, B. J. Alder, Phys. Rev. Lett. 45 (1980) 566–569. URL: http://link.aps.org/doi/10.1103/PhysRevLett.45.566 .doi: .[54] M. Beer, C. Ochsenfeld, The Journal of Chemical Physics 128(2008) 221102. URL: http://aip.scitation.org/doi/10.1063/1.2940731 . doi: .[55] J. Kussmann, C. Ochsenfeld, The Journal of Chemical Physics 127(2007) 204103. URL: http://aip.scitation.org/doi/10.1063/1.2794033 . doi: .

5. Acknowledgments

This work was supported by CARCH4205, and by the Strate-gic Priority Research Program of Chinese Academy of Sci-ences (Grant No. XDC01040100). H.S. acknowledges VictorWen zhe Yu for inspiring discussions. H.S. thanks the Tianhe-2Supercomputer Center for computational resources.

Appendix A. Appendix: Validation of the polarizabilitytensor for molecules

We use the linear scaling TC2-CPSCF method describedin Sec. 2.2 to calculate the polarizabilities of 32 selectedmolecules, the results are compared with the normal O ( N )method described in Sec. 2.1 to serve as the benchmark to makethe validation. All calculations were performed at the LDAlevel of theory and using “tier 2” basis sets with the “reallytight” defaults for the integration grids. The NTPoly-ﬁlter isset to 10 − and NTPoly-tolerance is set to 10 − . The employedequilibrium geometries were determined by relaxation that allabsolute forces are smaller than 10 − eV / Å. As summarized inTable A.2, the mean absolute error (MAE) and the mean ab-solute percentage error (MAPE) is 0.023 Bohr and 0.078%.The largest absolute error (0.68 Bohr ) observed in the LiHmolecule, and this is because the density matrix puriﬁcationconvergence is not tight enough, if we change the NTPoly-tolerance to 10 − , then the largest absolute error in LiH is re-duced to 0.0001 Bohr . For the other larger molecules as shownin Table A.3, the mean absolute error (MAE) and the mean ab-solute percentage error (MAPE) is 0.0036 Bohr and 0.015%,which show an excellent agreement between our linear scalingTC2-DFPT implementation and the benchmark O ( N ) DFPTresults. Appendix B. Appendix: The comparison for the total time

The comparison of the total time for the calculation of thepolarizabilities between the O ( N ) TC2-CPSCF method and thetraditional O ( N ) method is shown in Fig. B.12.11 C2-CPSCF benchmark ab-err (Bohr ) rel-err(%)Cl α xx α yy α zz α xx α yy α zz α xx α yy α zz α xx α yy α zz α xx α yy α zz α xx α yy α zz α xx α yy α zz α xx α yy α zz α xx α yy α zz α xx α yy α zz α xx α yy α zz α xx α yy α zz α xx α yy α zz α xx α yy α zz α xx α yy α zz α xx α yy α zz α for 16 dimers, as computed with thepresented TC2-CPSCF implementation at the LDA level of theory. Addition-ally, mean absolute errors (MAE) and mean absolute percentage errors (MAPE)with respect to benchmark DFPT calculations are given. TC2 benchmark ab-err (Bohr ) rel-err(%)CO α xx α yy α zz O α xx α yy α zz α xx α yy α zz α xx α yy α zz α xx α yy α zz H α xx α yy α zz H α xx α yy α zz Cl α xx α yy α zz α xx α yy α zz CO α xx α yy α zz O α xx α yy α zz H α xx α yy α zz α xx α yy α zz α xx α yy α zz H α xx α yy α zz α xx α yy α zz α for 16 molecules, as computed withthe presented TC2-CPSCF implementation at the LDA level of theory. Ad-ditionally, mean absolute errors (MAE) and mean absolute percentage errors(MAPE) with respect to benchmark DFPT calculations are given. T o t a l T i m e f o r P o l a r i za b iliti e s ( s ) number of atoms H(C H ) n H molecules (n=64-640)

TC2-CPSCFTraditional O(N ) Figure B.12: The total time of the calculation of the polarizabilities for iso-lated H(C2H4) n H molecules containing from 386 to 3842 atoms. Here we useNTPoly-ﬁlter (10 − ) and NTPoly-tolerance (10 − ) as the numerical thresholds.All calculations are performed on 36 CPU cores.) as the numerical thresholds.All calculations are performed on 36 CPU cores.