[PDF] Efficient Hybrid Density Functional Calculations for Large Periodic Systems Using Numerical Atomic Orbitals

Abstract

We present an efficient, linear-scaling implementation for building the (screened) Hartree-Fock exchange (HFX) matrix for periodic systems within the framework of numerical atomic orbital (NAO) basis functions. Our implementation is based on the localized resolution of the identity approximation by which two-electron Coulomb repulsion integrals can be obtained by only computing two-center quantities -- a feature that is highly beneficial to NAOs. By exploiting the locality of basis functions and efficient prescreening of the intermediate three- and two-index tensors, one can achieve a linear scaling of the computational cost for building the HFX matrix with respect to the system size. Our implementation is massively parallel, thanks to a MPI/OpenMP hybrid parallelization strategy for distributing the computational load and memory storage. All these factors add together to enable highly efficient hybrid functional calculations for large-scale periodic systems. In this work we describe the key algorithms and implementation details for the HFX build as implemented in the ABACUS code package. The performance and scalability of our implementation with respect to the system size and the number of CPU cores are demonstrated for selected benchmark systems up to 4096 atoms.

Full PDF

EEﬃcient Hybrid Density Functional Calculationsfor Large Periodic Systems Using NumericalAtomic Orbitals

Peize Lin, † Xinguo Ren, ∗ , ‡ and Lixin He ∗ , † † CAS Key Laboratory of Quantum Information, University of Science and Technology ofChina, Hefei, Anhui 230026, China ‡ Institute of Physics, Chinese Academy of Sciences, Beijing 100190, China

E-mail: [email protected]; [email protected]

Abstract

We present an eﬃcient, linear-scaling implementation for building the (screened)Hartree-Fock exchange (HFX) matrix for periodic systems within the framework ofnumerical atomic orbital (NAO) basis functions. Our implementation is based on thelocalized resolution of the identity approximation by which two-electron Coulomb re-pulsion integrals can be obtained by only computing two-center quantities – a featurethat is highly beneﬁcial to NAOs. By exploiting the locality of basis functions andeﬃcient prescreening of the intermediate three- and two-index tensors, one can achievea linear scaling of the computational cost for building the HFX matrix with respect tothe system size. Our implementation is massively parallel, thanks to a MPI/OpenMPhybrid parallelization strategy for distributing the computational load and memorystorage. All these factors add together to enable highly eﬃcient hybrid functional cal-culations for large-scale periodic systems. In this work we describe the key algorithmsand implementation details for the HFX build as implemented in the

ABACUS code a r X i v : . [ phy s i c s . c o m p - ph ] S e p ackage. The performance and scalability of our implementation with respect to thesystem size and the number of CPU cores are demonstrated for selected benchmarksystems up to 4096 atoms. Hybrid density functionals (HDFs) belong to the fourth rung of the Jacob’s ladder in Kohn-Sham (KS) density functional theory (DFT). On the one hand, compared to lower-runglocal- and semi-local approximations, HDFs, as implemented in the framework of generalizedKS (gKS) theory, typically deliver better accuracy for ground-state properties and, addi-tionally, provide physically more sound single-particle energy spectra. On the other hand,compared to the correlated methods (e.g., ﬁfth-rung functionals, the quantum chemistrymethods, and Green-function based many-body approaches), HDFs also have clear advan-tages. This is not only because HDFs are signiﬁcantly cheaper than the correlated methods,but also because they are able to oﬀer a range of useful properties in a single calculation,including e.g., electron density, ground-state energies, atomic structures, and single-particleorbital energies, and so on. Not all of these properties are easily accessible from the moreexpensive many-body approaches. As such, HDFs have been widely used in quantum chem-istry, and are becoming increasingly popular in computational materials science. An essential feature of HDFs is that a portion of Hartree-Fock-type exchange (HFX) isincorporated in the construction of the exchange-correlation (XC) functional. Furthermore,in recently developed HDFs, the range-separation framework is often invoked, in which theHFX is decomposed into a short-range part and a long-range part, and diﬀerent portionsof the two parts are utilized in the functional constructions. Another interestingdevelopment is that the initially empirical parameters in HDFs are made system-dependentand determined on the ﬂy, or satisfying certain physical constrains.

In particular, theconcept of double screening mechanism involving both the dielectric screening and metallicscreening channel is instructive for developing more reﬁned next-generation DHFs. Computationally, the build of the full-range or range-separated HFX matrix, constitutesthe major bottleneck for HDF calculations. The canonical scaling of the computationalcost of this step is O ( N ) with N being the system size. Historically, eﬃcient algorithmsfor computing the HFX matrix were ﬁrst developed in quantum chemistry using atom-centered Gaussian-type orbitals (GTOs). Back then, O ( N ) scaling and even linear-scaling algorithms for building the HFX matrix have been designed, by exploiting thelocality of GTO basis functions, and (for insulating systems) the sparsity of the densitymatrix. To extend the application of HDFs to condensed matter systems, it is customaryto impose the periodic boundary condition (PBC) in the implementation. In practice, thismeans one needs to treat a rather large supercell, and handle additional complexities suchas the translation symmetry with respect to the lattice vectors and the singularity of theCoulomb operator. Since the early periodic Hartree-Fock (HF) implementation in theC rystal code, other GTO-based HF and HDF implementations have been reported, with linear-scaling cost for building the HFX matrix. As the interest in HDFs grows in thesolid-state community, more HFD implementations have been reported, within the projectoraugmented wave (PAW), pseudopotential plane-wave (PW), and linearized augmentedplane wave (LAPW) frameworks, where the PBC is automatically satisﬁed. However, dueto the extended nature of the PW and LAPW basis functions, the computational cost ofthe HFX in these implementations follows a canonical O ( N ) scaling as the system size.By a transformation from the Bloch orbitals to a localized Wannier function representation,numerical techniques can again be developed to achieve a linear-scaling numerical cost of theHFX matrix construction. Other techniques to speed up the HDF calculations have alsobeen developed in the PW basis context, including, e.g., the recently proposed adaptivelycompressed exchange operator technique. In recent years, owing to their strict spatial locality and more realistic radial behavior, the3umerical atomic orbitals (NAOs) are gaining increasing popularity as the basis set choicein ﬁrst-principles electronic structure calculations.For ground-state calculations based on conventional local and semilocal approxima-tions, the reliability and eﬃciency of the NAO basis set have been well established,as been demonstrated by the ﬂourishing of NAO-based ﬁrst-principles computer code pack-ages.

In case of HDFs, where the computation of two-electron Coulomb repulsion integrals(ERIs) is a necessity, NAO-based implementations have only appeared recently. Exist-ing implementations circumvent a straightforward evaluation of ERIs in terms of NAOsin one way or another, by expanding the NAOs in terms of GTOs, or by employing theresolution-of-the-identity (RI) technique (as well as its reﬁned variant – the interpola-tive separable density ﬁtting scheme ). Within the RI – also known as variationaldensity ﬁtting – approximation, the four-index ERIs are decomposed into three- andtwo-index ones, whereby the storage requirement and the computational cost are signiﬁ-cantly reduced. For molecular systems, the accuracy and eﬃciency of the conventional RIapproximation based on the Coulomb metric have been well established for the HFX andexplicit correlation calculations using both GTOs and NAOs. Despite its nice properties such as the preservation of positive deﬁniteness of the ERImatrix and the vanishing of the expansion errors up to linear order, the conventionalCoulomb-metric RI scheme is not particularly suitable for large molecules and extendedsystems. For these systems, the conventional RI scheme, which requires the computation ofa very large number of three-center Coulomb integrals and an inversion of a large Coulombmatrix, can become prohibitively expensive. To deal with this problem, localized variantsof the RI approach have been developed, where the products of two atomic orbitals(AOs) are expanded in terms of a limited set of auxiliary basis functions (ABFs) centeringin the neighborhood of the two AO centers. Restricting the ABFs expanding the AO pairproducts to only the two center where the two AOs are located, one obtains what we called4he localized RI (LRI) scheme, or the so-called pair-atomic RI (PARI) approximation in the quantum chemistry literature. The LRI scheme is essential for enabling eﬃcient NAO-based periodic HDF implementations.

In a recent paper, we benchmarked the accuracy of LRI for HDF calculations within apseudopotential-based NAO framework. Using the Heyd-Scuseria-Ernzerhof (HSE) screenedHDF as implemented in the ABACUS (Atomic-orbital Based Ab-initio Computation atUStc) code, we showed that, when paired with suitably constructed auxiliary basisset, the errors in the computed band gap values incurred by LRI is below 10 meV forprototypical semiconductors and insulators. Such an error is signiﬁcantly smaller than theerrors stemming from other sources, such as the incompleteness of one-electron basis set andthe quality pseudopotentials. These tests indicate that, within the pseudopotential-basedNAO framework, the LRI approximation is able to provide adequate accuracy for practicalpurposes.In this paper, we describe the details of the HDF implementation within the

ABA-CUS code package.

ABACUS is a recently developed ﬁrst-principles software thatemploys NAOs as basis functions and the norm-conserving pseudopotentials to describethe ion cores. An economic, linear-scaling implementation for building the HFX for largeperiodic systems has been carried out, utilizing several numerical algorithms including theLRI scheme to compute the ERIs, an eﬃcient prescreening of the intermediate quantities toform the HFX matrix, an elegant “back-folding” procedure to account for the translationalsymmetry, and a “memorization” technique for eﬃciently handling the two-center Coulombintegrals. Furthermore, depending on the actual size of the system to be simulated and theavailable resources, we use either Machine-scheduling or K-means algorithm to distributethe computational load, resulting in a parallel program showing excellent eﬃciency up tothousands of CPU cores.The outline of this paper is as follows: Section II brieﬂy reviews the exact-exchange theoryand the resolution of the identity scheme. Section III describes all algorithms and techniques5e have proposed to decrease the computing time and memory consumes. Section IV showsthe benchmark results about the accuracy and performance of the implementations. SectionV concludes this paper. In the present work, the HFX matrix is evaluated within the framework of linear combinationof atomic orbitals (LCAO). Here, we explicitly consider periodic systems, and an AO φ i ( r )centering at an atom ˜ I within a unit cell R I is denoted as φ ˜ I ( R I ) i ( r ) def = φ i ( r − τ ˜ I − R I ) (1)where τ ˜ I is the position of the atom ˜ I within the unit cell. To simplify the notation, we omitthe lattice vector index R I in the following, and denote that φ ˜ I ( R I ) i ( r ) = φ Ii ( r ), but keepin mind that diﬀerent atoms might be located in diﬀerent unit cells. In the present paper,we choose to adopt a notational system that is slightly diﬀerent from the one used in Ref.58, by which the indices of the atoms are clearly indicated. As will become clear later, byspecifying the individual atoms from which the basis functions originate, our algorithm ofevaluating the HFX matrix can be better explained.Within the LCAO formulation, the HFX matrix for a periodic system is formally givenby Σ X Ii,Jj = (cid:88) K,L (cid:88) k ∈ K,l ∈ L ( φ Ii φ Kk | φ Jj φ Ll ) D Kk,Ll (2)where we use the abbreviation( f | g ) def = (cid:90) (cid:90) f ( r ) v ( r − r (cid:48) ) g ( r (cid:48) ) d r d r (cid:48) (3)to denote the interaction integral between two functions f ( r ) and g ( r ) via a potential v ( r − r (cid:48) )6nd D Kk,Ll is the density matrix in real space, D Kk,Ll = 1 N k (cid:88) n, k ∈ f n k c k,n ( k ) c ∗ l,n ( k ) e − i k · ( R K − R L ) . (4)In eq (4), c k,n ( k ) are KS eigenvectors, f n k are the occupation numbers, and N k is the numberof k points in the 1st Brillouin zone (1BZ). Furthermore, the lattice vectors R K and R L label the unit cells where the atom K and L are located, respectively. Depending on thechoice of the potential v ( r − r (cid:48) ) in eq (3), one can either obtain the full-range HFX (when v ( r − r (cid:48) ) = 1 / | r − r (cid:48) | ) or the short-range HFX (e.g., v ( r − r (cid:48) ) = erfc( µ | r − r (cid:48) | ) / | r − r (cid:48) | , witherfc( x ) being the complementary error function, and µ a range-separation parameter). Sinceour algorithm presented below does not depend on the speciﬁc form of v ( r − r (cid:48) ), we don’tdistinguish full-range or short-range HFX, unless this turns out to be necessary.In this work, we use the LRI approximation, also known as PARI, to evaluatethe ERIs. Within LRI, the product of two NAOs centering on two atoms is approximatelyexpanded as φ Ii ( r ) φ Kk ( r ) ≈ (cid:88) A = { I,K } (cid:88) α ∈ A C AαIi,Kk P Aα ( r )= (cid:88) α ∈ I C IαIi,Kk P Iα ( r ) + (cid:88) α ∈ K C KαKk,Ii P Kα ( r ) (5)where, similar to the AO case (eq (1)), P Aα ( r ) def = P ˜ A ( R A ) α ( r ) = P α ( r − τ ˜ A − R A ) (6)and C AαIi,Kk with A = { I, K } are the two-center, three-index expansion coeﬃcients. Thelower (Roman) and upper (Greek) indices of the C coeﬃcients denote the AOs and ABFs,respectively. The essence of eq (5) is that the ABFs are restricted either on the atom I orthe atom K , on which the two AOs are centering. It should be noted that, the expansion7oeﬃcients C , although carrying three orbital indices, can be obtained in terms of two-centerintegrals, which may be evaluated rather eﬃciently using the algorithm developed byTalman for NAOs.Using eq (5), the computational expression of the HFX in eq (2) naturally splits into fourpieces,Σ X Ii,Jj ≈ (cid:88) K,L (cid:88) k ∈ K,l ∈ L (cid:34) (cid:88) α ∈ I,β ∈ J C IαIi,Kk ( P Iα | P Jβ ) C JβJj,Ll + (cid:88) α ∈ I,β ∈ L C IαIi,Kk ( P Iα | P Lβ ) C LβJj,Ll (cid:88) α ∈ K,β ∈ J C KαIi,Kk ( P Kα | P Jβ ) C JβJj,Ll + (cid:88) α ∈ K,β ∈ L C KαIi,Kk ( P Kα | P Lβ ) C LβJj,Ll (cid:35) D Kk,Ll (7)where ( P Aα | P Bβ ) with A = { I, K } and B = { J, L } are the (screened) Coulomb integralsbetween two ABFs, deﬁned by eq (3). For notational convenience, below we also denote the(screened) Coulomb integrals as V Aα,Bβ def = ( P Aα | P Bβ ).In order to present our algorithm in a clear way, we re-express eq (7) asΣ X Ii,Jj ≈ (cid:88) K,L (cid:2) H X Ii,K | Jj,L + H X Ii,K | Jj,L + H X Ii,K | Jj,L + H X Ii,K | Jj,L (cid:3) (8)where each of four terms in eq (8) stems from a corresponding one in eq (7), e.g., H X Ii,K | Jj,L def = (cid:88) k ∈ K,l ∈ L (cid:88) α ∈ I,β ∈ J C IαIi,Kk V Iα,Jβ C JβJj,Ll D Kk,Ll = (cid:88) k ∈ K,l ∈ L [ φ Ii φ Kk | φ Jj φ Ll ] D Kk,Ll (9)with the underlined atomic indices highlighting those two atoms on which the ABFs arelocated. Furthermore, for the H X objects, the atoms K and L don’t carry AO indices k, l since they have been contracted out with the density matrix. In eq (9), we have introduced8artial two-electron Coulomb integrals denoted by square brackets,[ φ Aa φ F f | φ Bb φ Gg ] def = (cid:88) α ∈ A,β ∈ B C AαAa,F f V Aα,Bβ C BβBb,Gg (10)where

A, B, F, G are atomic indices and a, b, f, g are AOs centering on these atoms. Again,the underscored indices for atoms A and B indicated these are the atoms where the ABFs arelocated. Note that exchanging the positions of A and F , or B and G in the square bracketdoes not change the value of the integral, namely, [ φ Aa φ F f | φ Bb φ Gg ] = [ φ F f φ Aa | φ Bb φ Gg ] =[ φ Aa φ F f | φ Gg φ Bb ] = [ φ F f φ Aa | φ Gg φ Bb ]. In this work, the capital letters I, J are reserved forthe atoms to which the AO indices of the HFX matrix belong,

K, L for the atoms to the AOindices of the density matrix belong, whereas

A, B are referred to the atoms on which theABFs are sitting, and

F, G for the atoms without ABFs. According to the deﬁnition givenby eq (10), one can easily see that the full ERIs are given by the sum of four partial ERIs,( φ Ii φ Kk | φ Jj φ Ll ) ≈ [ φ Ii φ Kk | φ Jj φ Ll ]+[ φ Ii φ Kk | φ Jj φ Ll ]+[ φ Ii φ Kk | φ Jj φ Ll ]+[ φ Ii φ Kk | φ Jj φ Ll ] . (11)However, as detailed below, in our practical implementations, we never need to form the fullERIs according to eq (11). Rather, the H X objects, given by the contraction of the partialERIs with the density matrix, are used to build the HFX matrix. This choice is made forthe sake for designing highly eﬃcient parallel algorithms.Equations (8)-(10) serve as the mathematical foundation for one to design eﬃcient prac-tical algorithms to evaluate the HFX matrix. We remark that the error incurred by LRI, asgiven by eq (5), can be made suﬃciently small, provided that suitable auxiliary basis setscan be constructed. The accuracy aspect of our HDF implementation based on LRI has beenbenchmarked in a previous paper. In the present paper, we focus on the implementationdetails, as well as the eﬃciency and the scaling behavior of the implementation.9

Implementation Details

As mentioned above, the key underlying equations behind our implementation are eqs (8)-(10), on which our algorithm for building the HFX matrix is based. To achieve good eﬃciencyand scalability, speciﬁc details need to be considered, such as how to design the loop struc-ture, how to screen out the ERIs below a pre-given threshold, and how to distribute thecomputational load and memory storage over the MPI (Message Passing Interface ) tasks.These details will be discussed in the present section. The loop structure for a straightforward evaluation of the HFX matrix is presented in Algo-rithm 1. In this algorithm, one ﬁrst loops over the atomic indices of the HFX matrix, i.e.,the atom pair (cid:104)

I, J (cid:105) . Here the translational symmetry of periodic systems can be taken intoaccount by restricting the atom I within the central unit cell (i.e., setting R I = ), whileallowing the atom J to be placed in the entire supercell. For each atom pair (cid:104) I, J (cid:105) , one goesthrough all the atoms K in the neighborhood of the atom I (denoted as K ∈ N [ I ]), and allthe atoms L in the neighborhood of J (i.e., L ∈ N [ J ]). Within the loop over atoms K and L , one can form all the partial ERIs belonging to these four atoms, namely, [ φ Ii φ Kk | φ Jj φ Ll ],[ φ Ii φ Kk | φ Jj φ Ll ], [ φ Ii φ Kk | φ Jj φ Ll ], [ φ Ii φ Kk | φ Jj φ Ll ]. However, one may choose not to form thefull ERIs according to eq (11) here. This is because, in parallel computing, these partialERIs are evaluated and stored on diﬀerent processes, and forming the full ERIs requires asigniﬁcant amount of communication. Instead, they are contracted locally with the densitymatrix, ending up with the much smaller H X objects, as illustrated in Algorithm 1. Addingup these H X objects for all (cid:104) K, L (cid:105) atom pairs, one obtains the full desired HFX matrix.Such a straightforward calculation ﬂow works, but is not an eﬃcient one. In fact, in Al-gorithm 1, each independent partial ERI has been computed four times, which is a big waste.For example, as noted above, the partial ERI [ φ Ii φ Kk | φ Jj φ Ll ] is equal to [ φ Kk φ Ii | φ Jj φ Ll ] due10o the presence of permutation symmetry, but both need to be computed in Algorithm 1.The same happens when one swaps the J and L indices. This is because these partial ERIscannot be stored in memory at the same time, and hence they are computed on the ﬂy withinthe (cid:104) I, J (cid:105) loop. As a consequence, only partial ERIs with a ﬁxed (cid:104)

I, J (cid:105) pair are availableat a time. Thus for I (cid:54) = K , [ φ Kk φ Ii | φ Jj φ Ll ] always needs to computed, since its equivalent[ φ Ii φ Kk | φ Jj φ Ll ] is not available any, even if it has been computed before.To overcome this redundancy issue, we switch to a revised loop structure as presented inAlgorithm 2. A key feature of Algorithm 2 is that each independent partial ERI is computedonly once. To this end, instead of ﬁrst looping over the indices of the HFX matrix, i.e.,the (cid:104) I, J (cid:105) pair, we loop over the (cid:104)

A, B (cid:105) pair, on which the ABFs are located. (Again,the translation symmetry can be accounted by restricting A in the central unit cell, whileallowing B to be located in the entire supercell.) Within the (cid:104) A, B (cid:105) loop, one computesand stores the partial ERIs [ φ Aa φ F f | φ Bb φ Gg ] for all F ∈ N [ A ] and G ∈ N [ B ]. To maximizethe usage of the partial ERIs that are present at a time, we contract the fourth-rank tensor[ φ Aa φ F f | φ Bb φ Gg ] for a given set of atoms A, B, F, G with the density matrix D F f,Gg , D F f,Bb , D Aa,Gg , and D Aa,Bb , ending up with contributions to diﬀerent sectors of the HFX matrix,as illustrated in Algorithm 2. In this algorithm, the permutation symmetry of the ERIs isnaturally incorporated, and each independent partial ERI is computed only once. The trickhere is that, with the new loop structure, we can contract the partial ERIs with diﬀerentsubblocks of the density matrix originating from diﬀerent pairs of atoms. When the loopover (cid:104)

A, B (cid:105) is complete, the full HFX matrix is obtained. This redesign of the loop structurespeeds up the calculations by roughly a factor of 4.In addition to improved eﬃciency, Algorithm 2 is also advantageous for memory usagein case of parallel computing. To understand this point, we note that the atom pairs atthe outermost loop are distributed over the MPI processes. Speciﬁcally, in Algorithm 1, the (cid:104)

I, J (cid:105) pairs are distributed, and for a given (cid:104)

I, J (cid:105) pair, the information for ∀ K ∈ N [ I ] and ∀ L ∈ N [ J ] is stored locally. These include subblocks of both the Coulomb matrix V Kµ,Lν lgorithm 1 Principal loop structure based on (cid:104)

Principal loop structure based on (cid:104)

A, B (cid:105) atomic pairs for all (cid:104) A, B (cid:105) do for all F ∈ N [ A ], G ∈ N [ B ] do Calculate [ φ Aa φ F f | φ Bb φ Gg ] H X Aa,F | Bb,G = (cid:80) f ∈ F,g ∈ G [ φ Aa φ F f | φ Bb φ Gg ] ∗ D F f,Gg H X Aa,F | B,Gg = (cid:80) f ∈ F,b ∈ B [ φ Aa φ F f | φ Bb φ Gg ] ∗ D F f,Bb H X A,F f | Bb,G = (cid:80) a ∈ A,g ∈ G [ φ Aa φ F f | φ Bb φ Gg ] ∗ D Aa,Gg H X A,F f | B,Gg = (cid:80) a ∈ A,b ∈ B [ φ Aa φ F f | φ Bb φ Gg ] ∗ D Aa,Bb Σ X Aa,Bb + = H X Aa,F | Bb,G Σ X Aa,Gg + = H X Aa,F | B,Gg Σ X F f,Bb + = H X A,F f | Bb,G Σ X F f,Gg + = H X A,F f | B,Gg end for end for D Kk,Ll originating from K and L atoms. Since the same atom pairs (cid:104) K, L (cid:105) can be neighbors of diﬀerent (cid:104)

I, J (cid:105) pairs, there will be a duplication of (cid:104)

K, L (cid:105) atompairs, and hence a duplication of V Kα,Lβ and D Kk,Ll matrices over MPI processes, but thereis no duplication of the Σ X Ii,Jj matrix. Following a similar line of reasoning, in Algorithm 2, D F f,Gg and Σ X F f,Gg matrices are duplicated, but there is no duplication of V Aα,Bβ matrix.Since the size of ABFs is several times larger than that of the AOs, the size of the V ismuch larger than that of the Σ X matrix. Thus Algorithm 2 consumes less memory thanAlgorithm 1 and this is the second advantage of Algorithm 2. VC CA BF GD

Figure 1: Pictorial illustration of Algorithm 2. The four dots denote four atoms

A, B, F and G , where A and B are the atoms where the ABFs are located, and hence they areconnected by a V (interaction) line. The atoms F and G sit in the neighborhood of A and B , respectively, represented by two circles centered at A and B . The atoms A and F , aswell as B and G are connected by the C (expansion coeﬃcient) lines. The green dash linesdenote the density matrix between two atoms. For periodic systems the atom A is restrictedto the central unit cell, whereas the atom B is located in the entire supercell.A pictorial illustration of Algorithm 2 is presented in Fig. 1, where the atoms are denotedby dots and the expansion coeﬃcients C , the Coulomb matrix V , and the density matrix D are represented by lines. Finally we remark that the ﬂowcharts presented in Algorithm 1 and2 are mainly used to elucidate our choice of the major loop structure of our implementation.In practical implementation, it is in fact not necessary (and too expensive) to explicitly formthe partial ERIs [ φ Aa φ F f | φ Bb φ Gg ]. As shown in Appendix A.1, the most eﬃcient way toproceed is to ﬁrst contract the Coulomb matrix V with the expansion coeﬃcients C , andthen multiply the resultant quantity with the density matrix D . At a ﬁnal step, the HFX13atrix by multiply C with the product of ( V C ) and D . Namely, Σ X ∼ C (( V C ) D ). Anillustration of this reﬁned procedure adopted in our practical implementation is given inAlgorithm 3. Algorithm 3

Reﬁned loop structure adopted in practical implementation. Here Y = V ∗ C ,and T = Y ∗ D are temporary rank-3 tensors for all (cid:104) A, B (cid:105) do for all F ∈ N [ A ], G ∈ N [ B ] do Y AαBb,Gg = (cid:80) β ∈ B V Aα,Bβ ∗ C BβBb,Gg T AαF f,Bb = (cid:80) g ∈ G Y AαBb,Gg ∗ D F f,Gg T AαF f,Gg = (cid:80) b ∈ B Y AαBb,Gg ∗ D F f,Bb T AαAa,Bb = (cid:80) g ∈ G Y AαBb,Gg ∗ D Aa,Gg T AαAa,Gg = (cid:80) b ∈ B Y AαBb,Gg ∗ D Aa,Bb H X Aa,F | Bb,G = (cid:80) f ∈ F (cid:80) α ∈ A C AαAa,F f ∗ T AαF f,Bb H X Aa,F | B,Gg = (cid:80) f ∈ F (cid:80) α ∈ A C AαAa,F f ∗ T AαF f,Gg H X A,F f | Bb,G = (cid:80) a ∈ A (cid:80) α ∈ A C AαAa,F f ∗ T AαAa,Bb H X A,F f | B,Gh = (cid:80) a ∈ A (cid:80) α ∈ A C AαAa,F f ∗ T AαAa,Gg Σ X Aa,Bb + = H X Aa,F | Bb,G Σ X Aa,Gg + = H X Aa,F | B,Gg Σ X F f,Bb + = H X A,F f | Bb,G Σ X F f,Gg + = H X A,F f | B,Gg end for end for

As discussed above, the most time-consuming step in self-consistent HDF calculations isto construct the HFX matrix. The building blocks of the HFX matrix is the H X objects,introduced in eqs (8) and (9), which are given schematically by H X = CV CD , where C , V ,and D denote, respectively, the expansion coeﬃcient matrix (eq (5)), the (screened) Coulombmatrix, and the density matrix. A straightforward evaluation of all H X objects scales as N ,where N at is the number of atoms in the system. However, due to the locality of the NAOs14nd, in case of insulators, the rapidly decaying behavior of the density matrix in real space,only the exchange interactions within a limited spatial range are needed. In fact, a largeportion of elements of the matrix H X , as well as those of the C , V and D matrices areextremely small or even strictly zero. Eliminating these small matrix elements has littleeﬀect on the obtained results, but can save a lot of computation time and memory. As willbe shown below, by exploiting the sparsity of these matrices, one can achieve linear scalingof the computational cost for evaluating the Σ X matrices, and hence a linear-scaling buildof the HFX matrix. C Let’s ﬁrst look at the C matrix. Because the NAOs have a ﬁnite cut-oﬀ radius, their over-lap and consequently the expansion coeﬃcients C AαAa,F f become strictly zero if the distancebetween two atoms A and F is larger than the sum of the cut-oﬀ radii of AO basis func-tions a and f , as illustrated in Fig. 1. When discussing Algorithm 1 and 2 we alreadyintroduced the notation of neighboring atoms. Here we shall deﬁne that the neighboringatoms F of an atom A are those for which the expansion coeﬃcients C AαAa,F f are non-zero,i.e., N [ A ] def = { F | C AαAa,F f (cid:54) = 0 , ∀ a ∈ A, f ∈ F, α ∈ A } . The matrix elements of C originatingfrom atom pairs that are not “neighbors” are strictly zero and are excluded from the outsetof the calculations. In addition, there are C AαAa,F f elements which are suﬃciently small to benegligible, without aﬀecting the obtained results. As detailed below, these can be ﬁlteredout to further save computation time and memory. V The matrix elements of the (screened) Coulomb interaction between the ABFs are calculatedas, V Aα,Bβ = ( P Aα | P Bβ )= (cid:104) P Aα | Q Bβ (cid:105) (12)15here Q Bβ ( r ) = (cid:82) v ( r − r (cid:48) ) P Bβ ( r (cid:48) ) d r (cid:48) , and (cid:104) f | g (cid:105) denotes the overlap integral between twofunctions f ( r ) and g ( r ).For the bare Coulomb potential v ( r , r (cid:48) ) = 1 / | r − r (cid:48) | , Q β ( r ) is has a rather long range andthe sparsity of the V matrix is low. In fact, in this case certain elements in the Coulomb ma-trix become diverging at the Γ point when Fourier transformed to k space. Well-establishedprocedures exist to deal with this so-called Γ-point singularity and we don’t discussthis issue in the present paper. For screened HDFs such as HSE, one employs the short-ranged (screened) Coulomb interaction v ( r , r (cid:48) ) = erfc( µ | r − r | ) / | r − r | , and hence Q β ( r ) isshort ranged. In this case, one can introduce a ﬁnite cut-oﬀ radius R Q for Q β ( r ), beyondwhich the elements of the (screened) Coulomb matrix become suﬃciently small and canbe neglected. Of course the sparsity of the obtained V matrix depends on the screeningparameter µ . D The density matrix D Kk,Ll of an insulator has an exponential decay behavior as a functionof the distance between the two atoms.

Therefore, if two atoms are far apart, the valueof D Kk,Ll may be small enough to be negligible. However, for metallic systems, the densitymatrix only has a polynomial delay, and its sparsity is much reduced. Similar to the case of C , the insigniﬁcant elements of D are also ﬁltered out, as detailed below. H X matrices The H X matrices are given by the product of C , V , and D , and hence their sparsity originatesfrom these individual matrices. To exploit the sparsity of these matrices and to make thebest use of BLAS matrix multiplications which are most suitable for dense matrices,an optimal strategy is to organize and store the elements of the these matrices as sparselydistributed dense subblocks. With this in mind, in our implementation the data structure ofthese four tensors are all stored as aggregates of small block tensors. For example, the global16atrix C AαAa,F f can be represented as a supermatrix C A,F , which itself is a sparse matrix, whileits non-zero elements are dense rank-3 tensors C αa,f . Similarly, the global V and D matricescan be seen as atom-pair based supermatrices V A,B and D K,L ( K = { A, F } , L = { B, G } , seeAlgorithm 2), with each element of them being a rank-2 tensor V α,β and D k,l , respectively.To exploit the sparsity of C mentioned in 3.2.1, an upper limit C A,F def = max a ∈ A,f ∈ F,α ∈ A | C AαAa,F f | (13)for each block C A,F is introduced. In practical implementation, one can introduce a ﬁnitethreshold parameter ε C . If C A,F ≤ ε C , the entire block C A,F (i.e., the contribution from the (cid:104)

A, F (cid:105) atom pair) is disregarded. Similarly, an upper bound D K,L can be deﬁned for eachblock D K,L , and the density matrix blocks with D K,L ≤ ε D are discarded. The ﬁltering ofnegligible blocks of the V matrix is controlled by the cut-oﬀ radius R Q . The inﬂuence of ε C , ε D , and R Q on the accuracy of the obtained results and the computational eﬃciency will bediscussed in Sec. 4.1.1.Due to the strict locality of the supermatrix C A,F ( C B,G ), the atom F ( G ) is constrainedto the neighbouring atoms of A ( B ). Furthermore, as long as either the Coulomb matrix(in case of screened HDFs ) or density matrix (in case of insulators) is short-ranged, onlyatom pairs (cid:104) A, B (cid:105) within a certain range contribute to the H X matrices. In other words, fora given atom A , the numbers of neighbouring atoms B , F and G contributing to the ﬁnalHFX matrix are independent of system size, which warrants a linear-scaling computationalcost for evaluating the HFX matrix, at least for screened HDFs or for insulating systems.According to the formal relationship H X ∼ CV CD , it is obvious that the sparsity of H X comes from two aspects. For one thing, the zeros of individual blocks C A,F , V A,B , and D K,L directly lead to the sparsity of the H X matrices. For example, for a given set of atoms A, B, F, G , the matrix block H X Aa,F | Bb,G is non-zero only if the four multiplier matrix blocksare all non-zero. Another possibility is that, even if the elements of the component matrices17xceed their own screening thresholds, certain elements of the resultant product matrix H X may still be negligibly small. These small matrix elements can be eﬃciently screened out bymaking use of the Cauchy-Schwarz inequalities. The detailed screening procedures will bediscussed in Sec. 3.3.1 and Sec. 3.3.2. As discussed above, even after pre-screening the individual C , V , and D matrices, there isstill a portion of the H X matrix elements being extremely small, and can be safely neglectedwithout aﬀecting the obtained results. Thus, it would be highly desirable if the insigniﬁcantelements of the H X matrices can be eﬃciently ﬁltered out, without actually calculatingthem. This is achieved by estimating the upper bounds of these matrix elements basedon the Cauchy-Schwarz inequality, as detailed below, and neglecting the elements below apre-chosen threshold at the early stage of the calculations. The upper bounds of the matrix elements of H X are estimated in terms of Cauchy-Schwarzinequality, according to which the product of two matrices A and B satisﬁes | tr[ AB ] | ≤ (cid:112) tr[ A + A ] (cid:112) tr[ B + B ] = (cid:107)A(cid:107)(cid:107)B(cid:107) (14)where (cid:107)A(cid:107) = (cid:114)(cid:80) ij | a ij | is the L2-norm of the matrix A . The Cauchy-Schwarz inequalitycan be extended straightforwardly to the multiplication of three and more matrices, e.g., | tr[ ABC ] |≤ (cid:112) tr[ A + A ] (cid:112) tr[( BC ) + ( BC )]= (cid:107)A(cid:107) (cid:112) tr[( B + B )( CC + )] ≤ (cid:107)A(cid:107) (cid:112) (cid:107)B + B(cid:107) (cid:112) (cid:107)CC + (cid:107) (15)18nd | tr[ ABCD ] |≤ (cid:112) tr[( AB ) + ( AB )] (cid:112) tr[( CD ) + ( CD )]= (cid:112) tr[( A + A )( BB + )] (cid:112) tr[( C + C )( DD + )] ≤ (cid:112) (cid:107)A + A(cid:107) (cid:112) (cid:107)BB + (cid:107) (cid:112) (cid:107)C + C(cid:107) (cid:112) (cid:107)DD + (cid:107) (16)Now we can apply the Cauchy-Schwarz inequality to eq (9) for a ﬁxed set of atoms I, J, K, L , and obtain (symbolically) H X Ii,K | Jj,L = (cid:80) k ∈ K,l ∈ L (cid:80) α ∈ I,β ∈ J C IαIi,Kk V Iα,Jβ C JβJj,Ll D Kk,Ll = tr[ C i V C j D ] (cid:16) ≤ (cid:112) (cid:107) C i C + i (cid:107) (cid:112) (cid:107) V V + (cid:107) (cid:113) (cid:107) C j C + j (cid:107) (cid:112) (cid:107) DD + (cid:107) (cid:17) = tr[ C i ( V C j ) D ] (cid:16) ≤ (cid:112) (cid:107) C i C + i (cid:107)(cid:107) V C j (cid:107) (cid:112) (cid:107) DD + (cid:107) (cid:17) = tr[ C i (( V C j ) D )] ( ≤ (cid:107) C i (cid:107)(cid:107) ( V C j ) D (cid:107) ) (17)where V = V I,J and D = D K,L here are blocks of the Coulomb matrix and density matrixoriginating from the atomic pair (cid:104)

I, J (cid:105) and (cid:104)

K, L (cid:105) , respectively. C i = C IIi,K is also a rank-2tensor representing a sector of the triple expansion coeﬃcients on atom pair (cid:104)

I, K (cid:105) (i.e., C I,K introduced above) with ﬁxed AO basis function i (i.e., the rank-3 block tensor C I,K with ﬁxed i ). The crucial part in eq (17) is the diﬀerent estimated upper bounds given inparenthesis for the same quantity. These diﬀerent estimations can be utilized to ﬁlter out theinsigniﬁcant elements at diﬀerent stages of the calculation, all together leading to a highlyeﬃcient screening algorithm.As mentioned above and explained in Appendix A.1, the best order of matrix multipli-cation to obtain H X is given by the last line of eq (17). The pseudocode of the accordinglydesigned screening algorithm for evaluating H X is illustrated in Algorithm 4. In this algo-rithm, the actual working procedure goes as follows. We ﬁrst calculate and store all neededquantities – (cid:112) (cid:107) C i C + i (cid:107) , (cid:107) C i (cid:107) , (cid:112) (cid:107) V V + (cid:107) and (cid:112) (cid:107) DD + (cid:107) in advance. During the actual cal-culation process of H X matrices, one further evaluate quantities (cid:107) V C j (cid:107) and (cid:107) ( V C j ) D (cid:107) .The three upper bounds listed in eq (17) are calculated at appropriate locations within the19alculation loops, and compared to a pre-chosen threshold ε CS-matrix . If any of the threeupper bounds is below the threshold, the corresponding elements of H X matrices are set tozero, without actually computing its value. In Sec. 4.1.2, we will present benchmark resultsregarding the inﬂuence of the threshold ε CS-matrix on accuracy and computation cost.

Algorithm 4

Cauchy-Schwarz (CS) inequality matrix screening with a pre-chosen threshold ε CS-matrix if (cid:112) (cid:107) C i C + i (cid:107) (cid:112) (cid:107) V V + (cid:107) (cid:113) (cid:107) C j C + j (cid:107) (cid:112) (cid:107) DD + (cid:107) > ε CS-matrix then Calculate

V C j and (cid:107) V C j (cid:107) if (cid:112) (cid:107) C i C + i (cid:107)(cid:107) V C j (cid:107) (cid:112) (cid:107) DD + (cid:107) > ε CS-matrix then Calculate (

V C j ) D and (cid:107) ( V C j ) D (cid:107) if (cid:107) C i (cid:107)(cid:107) (( V C j ) D ) (cid:107) > ε CS-matrix then Calculate tr[ C i (( V C j ) D )] end if end if end if3.3.2 Cauchy-Schwarz inequality ERI screening In addition to the screening criteria applied in Algorithm 4, one can apply one more criteriondirectly based on the full ERIs. The Cauchy-Schwarz inequality can again be applied toestimate the upper bound of each ERI quickly, ( φ Aa φ F f | φ Bb φ Gg ) ≤ (cid:113) ( φ Aa φ F f | φ Aa φ F f ) (cid:113) ( φ Bb φ Gg | φ Bb φ Gg ) . (18)In our implementation, we compute all the “diagonal” ERIs ( φ Aa φ F g | φ Aa φ F g ) and determine( φ A φ F | φ A φ F ) def = max a ∈ A,f ∈ F ( φ Aa φ F f | φ Aa φ F f ) (19)beforehand, which can be done with relative ease since the number of these scales linearlywith the system size. Now, the upper bound of each ERI can be found easily, and thosebelow a given threshold ε CS-ERI can be disregarded without actually calculating them.In

ABACUS , we actually only calculate (implicitly) the partial ERIs [ φ Aa φ F f | φ Bb φ Gg ]20nd its three variant form. However, this does not aﬀect the use of Cauchy-Schwarz in-equality screening. Before calculating [ φ Aa φ F f | φ Bb φ Gg ] we ﬁrst use (18) to check whether( φ Aa φ F a | φ Bb φ Gg ) is needed or not. If not, then it’s unnecessary to calculate [ φ Aa φ F f | φ Bb φ Gg ]and other three partial ERIs which form the full ERI (eq. (11)).Combining the “matrix product screening” as discussed in Sec. 3.3.1 and the ERI-basedscreening discussed above, the entire ﬂowchart of for evaluating the HFX matrix is presentedin Algorithm 5. At a given atomic structure, the maximal “diagonal” ERI ( φ A φ F | φ A φ F ) isﬁrst determined for each atomic pair. They are then used to ﬁlter out the atomic quartetwhose ERIs are below a threshold ε CS-ERI , at the beginning of the HFX calculation. The“matrix product screening” is only applied for those atomic sets { A, B, F, G } which passedthe ERI-based screening. According to the algorithms described in Sec. 3.1, in particular Algorithm 2, our centralparallelization strategy is to distribute the atom pairs (cid:104)

A, B (cid:105) over diﬀerent CPU cores.For each (cid:104)

A, B (cid:105) pair, we search the neighbouring atoms F and G within certain cut-oﬀradii, and calculate the corresponding V Aα,Bβ , C AαAa,F f and C BβBb,Gg matrices. For a givenatomic structure, these matrices are calculated once and stored in memory beforehand, sincethey depend only on the atomic structure and don’t change during the self-consistent-ﬁeld(SCF) cycles. The density matrices D Kk,Ll and HFX matrices Σ X Ii,Jj , on the other hand,are updated at each iteration of the SCF loops. The full density matrix D is calculatedafter diagonalizing the total Hamiltonian matrix H in a 2D cyclic-block form, and henceinitially also stored in the same distributed form as H . It is then redistributed over the (cid:104) A, B (cid:105) pairs via MPI communication tools, so that the needed matrix elements of D arelocally available when building the HFX via Algorithm 2. Once we have the needed C AαAa,F f , V Aα,Bβ , C BβBb,Gg , and D Kk,Ll matrices ready in each individual MPI process, the H X matricescan be calculated independently without any communication and the desired Σ X Ii,Jj matrix21 lgorithm 5

Flowchart of the HFX Evaluation Program function Initialize (cid:46)

Perform once for all Construct ABFs Parallel task distribution end function function Calculate C and V (cid:46)

Perform at each step of ionic motions Calculate V for atomic pairs (cid:104) A, B (cid:105) with distance < R Q Calculate C for blocks with C AF > ε C Calculate ( φ Aa φ F f | φ Aa φ F f ) and determine ( φ A φ F | φ A φ F ) for CS-ERI Calculate (cid:107) C (cid:107) , (cid:112) (cid:107) CC + (cid:107) and (cid:112) (cid:107) V V + (cid:107) for CS-matrix end function function Calculate exact-exchange (cid:46)

Perform at each electronic step

Transmit D for blocks with D KL > ε D Calculate (cid:112) (cid:107) DD + (cid:107) for CS-matrix Calculate Σ X Calculate E X Transmit Σ X end function function Calculate Σ X for all (cid:104) A, B (cid:105) do for all F ∈ N [ A ], G ∈ N [ B ] do if ( φ A φ F | φ A φ F )( φ B φ G | φ B φ G ) > ε CS-ERI then if (cid:112) (cid:107) CC + (cid:107) (cid:112) (cid:107) V V + (cid:107) (cid:112) (cid:107) CC + (cid:107) (cid:112) (cid:107) DD + (cid:107) > ε CS-matrix then

Calculate

V C and (cid:107)

V C (cid:107) if (cid:112) (cid:107) CC + (cid:107)(cid:107) V C (cid:107) (cid:112) (cid:107) DD + (cid:107) > ε CS-matrix then

Calculate (

V C ) D and (cid:107) ( V C ) D (cid:107) if (cid:107) C (cid:107)(cid:107) ( V C ) D (cid:107) > ε CS-matrix then

Calculate C (( V C ) D ) end if end if end if end if end for end for end function H X matrices (cf. Algorithm 2) via only light communications. Afterthe locally distributed Σ X (based on atomic pairs) is calculated, it will be transferred tothe 2D cyclic-block form and added to the total Hamiltonian H . The major communicationprocesses are illustrated in Fig. 2. The key feature of this parallelization algorithm is thatonly the relatively cheap Σ X and D matrices need to be redistributed and communicatedamong MPI processes, whereas the more expensive C and V matrices are evenly distributedover the MPI tasks and no data communications for these are needed. 𝐷 Σ 𝑋 𝐷 Σ 𝑋 Communicate Communicate Σ 𝑋 = 𝐶𝑉𝐶𝐷 Calculate Diagonalize

H+= Σ 𝑋 Figure 2: Sketch of the communication and data redistribution of the density matrix D andHFX matrix Σ X . The left-hand side indicates the 2D cyclic-block form of the these twomatrices, and the right-hand side indicates their local atom-pair based distribution form.In principle, this part of calculations can be parallelized up to M × N at CPU cores, where N at is the number of atoms in a single unit cell, and M is the number of neighbouringatoms determined by the range of the V matrix. The load balancing totally depends onthe distribution of (cid:104) A, B (cid:105) pairs. A random distribution of the (cid:104)

A, B (cid:105) pairs may cause severeload imbalance for wall time and computational bottleneck of memories, in particular for thenon-uniform systems where the number of neighbouring atoms may vary greatly for diﬀerentatoms. To overcome this diﬃculty, we developed two distribution schemes to improve the23arallel eﬃciency. The ﬁrst one is called “Machine-scheduling distribution”, which is toimprove the load balance based on computational time, and the second algorithm is called“K-means distribution”, which can be used to reduce the memory usage. Depending onthe availability of the computational resources, we choose one or the other algorithm todistribute the computational load to achieve optimal performance.The parallelization of HFX calculations in

ABACUS is a hybrid mode based on processesand threads. The MPI is adopted for parallelization over processes, while OpenMP isemployed for parallelization over threads. The (cid:104) A, B (cid:105) pairs are distributed to processesaccording to distribution schemes discussed below, and then each process forks threads toclaim and complete the distributed (cid:104)

A, B (cid:105) pairs with dynamic schedule in OpenMP for loadbalance.

For each atomic pairs (cid:104)

A, B (cid:105) , the size of the corresponding H X ( ∼ CV CD ) matrix is pro-portional to | N [ A ] | ∗ | N [ B ] | , where | N [ A ] | , | N [ B ] | are the numbers of neighbouring atoms ofatom A and B respectively. We can think of the atoms in the system as the vertices of anundirected graph, whereas the atom pairs can be viewed as the edges. The computationalcost of each edge e = (cid:104) A, B (cid:105) is roughly N e = | N [ A ] | ∗ | N [ B ] | , which is the weight of the edge.In this case, we try to distribute the weighted edges as evenly as possible among the CPUcores, to minimize the maximum computational load on one core. This becomes a classi-cal machine scheduling problem. It’s known that the global optimum solution of machinescheduling problem is NP-hard, and thus it is not possible to ﬁnd the global optimumsolution for large systems. However, one can ﬁnd an approximate solution by the greedyalgorithm. It can be proven that the maximum computational cost on one core given bythe approximate solution is not more than twice as that of the optimal solution. In practice,we ﬁnd the greedy algorithm as presented in Algorithm 6 works very well. In Algorithm 6, S p denotes the list of tasks (“edges” in this case) in process p , whereas W p denote the com-24utational loads (“weight”) on process p . The meanings and relationships between N e , S p , W p are further graphically illustrated in Fig. 3. Algorithm 6

The greedy algorithm of Machine-scheduling distribution function Machine-scheduling distribution for all process p do (cid:46) initialization list of tasks S p = ∅ task load W p = 0 end for sort { e } in descending order of N e (cid:46) reduce unbalance for all task e do (cid:46) greedy algorithm p (cid:48) = argmin p W p S p (cid:48) = S p (cid:48) (cid:83) { e } W p (cid:48) = W p (cid:48) + N e end for end function 𝑊 𝑝 𝑊 𝑝 𝑊 𝑝 𝑁 𝑒 𝑁 𝑒 𝑝 𝑁 𝑒 𝑁 𝑒 𝑁 𝑒 𝑝 𝑁 𝑒 𝑁 𝑒 𝑝 𝑆 𝑝 = {𝑒 , 𝑒 , 𝑒 } 𝑆 𝑝 = {𝑒 , 𝑒 } 𝑆 𝑝 = {𝑒 , 𝑒 } Figure 3: Graphical illustration of the process p , the list of tasks (“edges”) S p on process p ,the weight of a give task (“edge e ”) N e , and the total weight of tasks W p on process p . The “Machine-scheduling distribution” algorithm discussed above is suitable for achievingload balance in computation time. However, it is not memory friendly. To understand thispoint, we ﬁrst brieﬂy analyze the memory consumption of the major arrays in parallel com-putation of the HFX matrix. The variation of the memory consumption upon increasing thenumber of compute nodes is a key factor that aﬀects the scalability of the calculation. Thesparsity of C in large systems is guaranteed by the locality of NAOs, and its memory con-sumption increases linearly with system size. In our implementation, the memory footprint25f C in each compute node can be reduced by increasing the number of compute nodes. Asfor V , although it is dense for long-range Coulomb potential, the indices of the V matrix areprecisely those used for parallel distribution. Hence the memory storage of V in all processesis non duplicated, and it memory consumption in each process is inversely proportional tothe number of processes. On the other hand, the parallel distribution of D and Σ X is muchmore involved, and the memory consumption of these arrays may become the bottleneck forlarge size systems in parallel calculations. In practice Σ X consumes more memory than D ,and hence we take Σ X as an example to analyze the problem. The the analysis of Σ X appliesto D as well.As illustrated in Algorithm 2, our actual implementation is based on the loop structureover atomic pairs (cid:104) A, B (cid:105) . For each pair (cid:104)

A, B (cid:105) , one needs to evaluate contributions to fourblocks of the HFX matrices, i.e., Σ X F f,Gg , Σ X F f,Bb , Σ X Aa,Gg and Σ X Aa,Bb , among which Σ X F f,Gg ismost memory intensive because of the presence of diﬀerent

F, G atoms in the neighborhoodof (cid:104)

A, B (cid:105) pair. A pictorial illustration of the situation is presented in Fig. 4. When a groupof atom pairs (cid:104)

A, B (cid:105) are distributed to one process p , the set of Σ X F f,Gg that need to be storedin process p is given by (cid:83) (cid:104) A,B (cid:105)∈ p (cid:83) F ∈ N [ A ] (cid:83) G ∈ N [ B ] Σ X F G . This means that the memory consumptionof Σ X F f,Gg on a process p is proportional to the union of the neighborhood regions of the atom A ’s and B ’s assigned to p , as illustrated in Fig. 4. To minimize the memory consumption ofΣ X F f,Gg , the number of diﬀerent F ( G ) atoms in the neighborhood of A ( B ) atoms distributedon the same process should be as few as possible. The “Machine-scheduling distribution”scheme doesn’t consider the atom positions, and the (cid:104) A, B (cid:105) pairs and their neighbors in oneprocess may spread all over the supercell. If this happens, simply increasing the number ofcompute nodes does not necessarily lead to the reduction of the memory consumption of H on one process.Based on the above analysis, it is obvious that minimizing the memory consumption forΣ X F f,Gg on each process amounts to minimizing (cid:83) A N [ A ], or equivalently, by maximizing theoverlap of all N [ A ]’s present in one process. Considering that the atomic cut-oﬀ radii of all26hemical elements in the actual calculations are very close if not equal, this is essentiallyequivalent to making all A ’s allocated to each process as close as possible. The same principleapplies to atom B .Requiring the atom A ’s ( B ’s) allocated to the same process as close as possible is a typicalclustering problem – an unsupervised learning problem in machine learning. Speciﬁcally, weneed to sort the atoms into several groups, in the three-dimensional Euclidean space, withthe aims that the atoms in each group are as close as possible. The K-means algorithm given in Algorithm 7 is chosen here to cluster atoms.For short-range potential, such as HSE potential, etc., only (cid:104) A, B (cid:105) within certain rangesare needed, whereas for long-range potential, such as HF etc., almost all pairs of (cid:104)

A, B (cid:105) areneeded.

Algorithm 7

The algorithm of K-means distribution function K-means distribution τ A : coordinate of atom A x A : category of atom A τ x : center coordinate of category x while unconverged do for all atom A do category x A = argmin x {| τ A − τ x |} end for for all category x do center coordinate τ x = average { τ A | x A = x } end for end while end function In this section, we benchmark the eﬃciency and scalability of our algorithm and implemen-tations as discussed in Secs. 2 and 3 via speciﬁc examples. We ﬁrst demonstrate the eﬀect ofour screening techniques, which is the central part of our implementation (Sec 4.1). This isfollowed by an examination of the eﬃcacy of the two parallel distribution schemes (Sec 4.2).27 𝐵 𝐹 (𝐹 ) 𝐴 (𝐴 ) 𝐹 𝐴 𝐺 𝐵 𝐵 𝐺 𝐺 Figure 4: Illustration of the memory consumption of Σ X F f,Gg on one process. For an atompair (cid:104)

A, B (cid:105) , the set of Σ X F f,Gg covers ∀ F ∈ N [ A ] and ∀ G ∈ N [ B ]. For a group of atom pairs {(cid:104) A, B (cid:105)} distributed to one process, the set of Σ X F f,Gg is the union of all

F, G atoms in theneighborhood of all these {(cid:104)

A, B (cid:105)} atom pairs.A comprehensive benchmark study of the overall performance of our HFX implementationwith respect to the system size and the number of CPU cores is presented in Sec. 4.3. Fi-nally the advantage of our implementation for band structure calculations are discussed inSec. 4.4. Note that we are concentrating here the eﬃciency aspect of our implementation,and the accuracy aspect has been reported in a previous publication. C , V and D matrices We perform test calculations on a unit cell of N at atoms with N kx × N ky × N kz k -points,which corresponds to a supercell of N at × N kx × N ky × N kz atoms with a single k -point. In thissetting, the atom A can be restricted within the central unit cell, whereas the atoms B , F , G run over the entire supercell. Thus, all the C , V , and D matrices have the same real-spacedata structure and hence the same sparsity as if we are dealing with a Γ-only supercell with N at × N kx × N ky × N kz atoms. Therefore, a small unit cell with a dense k -point mesh iswell suitable for testing the eﬀect brought about by screening out the insigniﬁcant elementsof C , V and D matrices. Yet, compared to directly dealing with a large supercell, thecomputational cost can be signiﬁcantly reduced, thanks to the translational symmetry.As a speciﬁc test example, we performed HSE06 calculations (with the screening pa-28ameter µ = 0 .

11) for Si crystal. The lattice constant is chosen to be 10.236 Bohr, anda 8 × × k -point mesh is used for Brillouin zone (BZ) integration. The energy cut-oﬀ fordetermining the uniform real-space integration grid for charge density is set to be 240 Ry.We use a NAO DZP basis [2 s p d ] for the one-electron basis set; for ABFs, an optimized[5 s p d ] set of orbitals is used. The cut-oﬀ radii of both NAOs and ABFs are set to 8Bohr.Now we ﬁrst check what happens if we pre-screen the small matrix elements of individual C , V , and D matrices, i.e., the inﬂuence of the pre-screening on the numerical accuracy,the computation time, and the memory consumption. Here, the numerical accuracy of thecalculations is measured by the error of the obtained band gap, given by δE g = | E scr g − E ref g | where E scr g is the bang gap value obtained with screening and E ref g is the reference valueobtained without applying screening – by setting the thresholding parameters to zero, orin case of the V matrix, by setting R Q to a big value. Here the error of the band gap isused as an measurement of the computational accuracy, as it is more sensitive to screeningthresholds, compared to the variational quantities such as the ground-state total energy.Fig. 5 presents the band gap errors, memory consumption for C , V , D matrices, and thecomputation time of evaluating the HFX matrix in one SCF iteration as a function of theirrespective screening thresholds or cut-oﬀ parameters. The calculation was performed on oneIntel(R) Xeon(R) CPU (E5-2640 v2 @ 2.00GHz) core. As discussed in Sec. 3.2, ε C and ε D are used to directly ﬁlter the subblocks of C and D matrices, and a smaller values of ε C and ε D means more accurate calculations. The pre-screening of the V matrix, on the otherhand, is controlled by a cut-oﬀ radius R Q of the Coulomb potential Q ( r ) associated with theABFs, as introduced in Sec. 3.2.2. Obviously, a larger R Q corresponds to smaller numericalerrors.The test calculations are done in a successive way. Namely, we ﬁrst examine the inﬂuenceof R Q , while setting ε C and ε D to be zero. Then, with a ﬁxed value of R Q and zero ε D value,we look at the impact of ε C . Finally, with ﬁxed R Q and a ﬁnite ε C threshold, we check the29 R Q (Bohr)0100020003000400050006000 T i m e ( s ) E rr o r o f H S E G a p ( m e V ) (a) TimeError 812162024 R Q (Bohr)02468101214 M e m o r y o f V ( M B ) E rr o r o f H S E G a p ( m e V ) (b) MemoryError10 C T i m e ( s ) E rr o r o f H S E G a p ( m e V ) (c) TimeError 10 C M e m o r y o f C ( M B ) E rr o r o f H S E G a p ( m e V ) (d) MemoryError10 D T i m e ( s ) E rr o r o f H S E G a p ( m e V ) (e) TimeError 10 D M e m o r y o f D ( M B ) E rr o r o f H S E G a p ( m e V ) (f) MemoryError Figure 5: Computation times (left panels) and memory footprints (right panels) as a functionof the cut-oﬀ/thresholding parameters in pre-screening the individual V (upper panels), C (middle panels), and D (bottom panels) matrix. In all panels, the accompanying band gaperrors are plotted as an indicator of the numerical accuracy of the calculations.30ﬀect of ε D . Inspection of Fig. 5 reveals that, for all three matrices, there exists a window ofparameter values, within which there is almost no loss of accuracy (i.e., the obtained bandgap staying essentially unchanged), yet the computation time and memory consumption issigniﬁcantly reduced. In the HSE06 functional, the Q ( r ) itself is short-ranged, which allowsfor introducing a small cut-oﬀ radius for Q ( r ) to screen out the insigniﬁcant elements of the V matrix. As shown in Fig. 5(a) and (b), when the cut-oﬀ radius R Q is reduced from 24Bohr to 8 Bohr, the incurred band gap error is negligible (less than 1 meV), whereas thecomputation time is reduced from 5317 s to 637 s and the memory cost reduced from 12.5MB to 1.4 MB. Such a drastic reduction of the computation time and memory consumptioncertainly beneﬁts from the short-ranged nature of the V matrix in screened HDFs such asHSE. In contrast, if large-range part of the Coulomb matrix is needed, such as in Hartree-Fock, or long-range-corrected HDFs, a rather large R Q ( ∼

50 Bohr) is needed, and thememory storage of the V matrix, as well as the computation time, will get signiﬁcantly moreexpensive.Fig. 5(a) and (b) show the change of the computation time and the memory cost of the C matrix as a function of ε C , with R Q = 8 Bohr and ε D = 0. It can be seen that, if ε C isset to 1 × − , compared to ε C = 0 (no pre-screening) the computing time is reduced from637 s to 322 s (nearly 50% reduction) and the memory of the C matrix from 10.2 MB to 7.2MB (nearly 30% reduction), while the computed band gap is barely aﬀected. Note that thezero elements in C due to the ﬁnite cut-oﬀ radius of the AOs have already been excluded atthe very beginning the calculation, and the memory reduction recorded here is due to thesmall but ﬁnite matrix elements of C . When increasing ε C from 5 × − to 2.5 × − , thecomputing time goes further down to 52 s (i.e., more than 90% reduction) and the memorycost of C matrices down to 2.9 MB (i.e., more than 70% reduction), but now a visible bandgap error of 13.4 meV is incurred. However, further increasing ε C beyond 10 − leads to arapid increase of the band gap error, which should be avoided.Finally we check the inﬂuence of ε D with ﬁxed R Q = 8 Bohr and ε C = 10 − , and the31btained results are reported in Fig. 5(e) and (f). One can see that, when increasing ε D from0 to 10 − , the computation time is reduced from 322 s to 282 s, and the memory cost of D isreduced from 2.7 MB to 1.1 MB. When further increasing ε D to 10 − , the computation timeis drastically reduced from 282 s to 99 s, and the memory consumption from 1.1 MB to 0.1MB. In the mean time, no noticeable change of the band gap is observed. Such a signiﬁcantsaving in computation time and memory storage is enabled the fact that silicon crystal is aninsulator, and its density matrix decays exponentially in real space. However, if increasing ε D even further a little bit (say, to 2.5 × − ), a rapid increase of the band gap error to ∼ ε D parameter.In practice, we found that a conservative value of 10 − is safe and hence is recommended inpractical calculations. After pre-screening individual C , V , and D matrices, we can further apply the screeningtechniques based on Cauchy-Schwarz inequalities, as discussed in Sec. 3.3.2 and Sec. 3.3.1,to ﬁlter out those matrix elements that jointly lead to negligibly small H X matrix elements.Any remaining insigniﬁcant elements of C , V , and D matrices that passed the initial pre-screening step, will be identiﬁed and further excluded here.According to the screening workﬂow outlined in Algorithm 5, we ﬁrst apply the ERI-based Cauchy-Schwarz screening procedure and then the “matrix-product” based screeningone. Note that, at this point, the pre-screening of individual C , V , and D have alreadybeen performed with R Q = 8 Bohr, ε C = 10 − , and ε D = 10 − . Now, the memory storageof the Σ X matrix is used to measure the eﬀect on the memory consumption due to theCauchy-Schwarz inequality screenings. Fig. 6(a) and (b) present the computation time andthe memory consumption of Σ X as a function of ε CS-ERI –the thresholding parameter of theERI-based Cauchy-Schwarz inequality ERI screening. When setting ε CS-ERI = 10 − , theinduced error of the band gap is about 0.9 eV, while the computation time is reduced from3282 s to 171 s. In this case, the inﬂuence on the memory cost of Σ X is minor. Furtherincreasing ε CS-ERI to 10 − can accelerate the calculation by another a factor of 3, but theincurred error rises to about 20 meV, which is not recommended.A ﬁnal step is the Cauchy-Schwarz inequality matrix screening as outlined in Algorithm 4.The computation time, the memory consumption of Σ X , and the band gap error as a functionof the truncation threshold ε CS-matrix is shown in Fig. 6(c) and (d). As ε CS-matrix increasesfrom 0 to 10 − , the computation time for evaluating HFX matrix is reduced from 172 s to127 s and the memory cost of Σ X decreases from 2.6 MB to 1.3 MB. In the meantime, theaccompanying band gap error is only 0.77 meV. Further increasing ε CS-matrix from 10 − to10 − , the computation time is reduced to 82 s and memory of Σ X to 0.9 MB, but the bandgap error is increased to 11 meV.In summary, by applying both the pre-screening of individual matrices and the screen-ing procedures based on Cauchy-Schwarz inequalities, the computation time for evaluatingthe HFX matrix in one iteration is reduced by a factor of 40, whereas the total memoryconsumption of the four most memory intensive matrices – C , V , D , Σ – is reduced by afactor of 3. The accumulated error of the band gap, compared to the reference value withoutapplying any screening, is only 0.57 meV. Similarly, the incurred error in the absolute HSE06total energy, which was not reported in the above analysis, is only 0.20 meV. (Note that theerrors induced in individual steps might compensate each other, but are always in the sameorder of magnitude.) Obviously, if one can tolerate bigger errors, say 10 meV in band gap,the savings in computation time and memory storage will be even more signiﬁcant. In parallel computing, keeping good load balance among the processes is a key requirementto enable massively parallel calculations. In our implementation, this is achieved by theso-called “Machine-scheduling distribution”, as described in Sec. 3.4.1. To assess the per-33 CS ERI (Ry )050100150200250300 T i m e ( s ) E rr o r o f H S E G a p ( m e V ) (a) TimeError 10 CS ERI (Ry )0.00.51.01.52.02.53.0 M e m o r y o f X ( M B ) E rr o r o f H S E G a p ( m e V ) (b) MemoryError10 CS matrix (Ry)0255075100125150175 T i m e ( s ) E rr o r o f H S E G a p ( m e V ) (c) TimeError 10 CS matrix (Ry)0.00.51.01.52.02.53.0 M e m o r y o f X ( M B ) E rr o r o f H S E G a p ( m e V ) (d) MemoryError Figure 6: Computation times (left panels) and memory footprints of Σ X (right panels) asa function of the thresholding parameters of the ERI-based (upper panels) and “matrixproduct” based (lower panels) Cauchy-Schwarz screening. In all panels, the induced bandgap errors are plotted as an indicator of the numerical accuracy of the calculations.34ormance of this distribution scheme, we compare it with a straightforward parallelizationscheme where the atom pairs are distributed randomly, only requiring that the numbers ofatom pairs assigned to each process are equal.The test system chosen here is a DNA fragment containing 788 atoms, composed of 12AT basis pairs, as studied in Ref. 101. The load balance is measured by the ratio betweenthe maximal time consumed on one process (∆ t max ) and the average time over all processes(∆ t av ). If the load balancing is perfect, ∆ t max / ∆ t av ratio should equal 1; otherwise this ratiowill be larger than 1. Obviously, the larger the ratio is, the worse the load balance. We notethat the global communication time between diﬀerent processes is accounted for here, as itis the “common” time shared by all processes.As a numerical experiment, we carried out six independent HSE06 calculations, respec-tively, on 1, 2, 4, 8, 16, and 32 compute nodes, each with 24 CPU cores. Consistent withthe computer architecture, the calculations are parallelized using 1, 2, 4, 8, 16, and 32 MPIprocesses, each process consisting of 24 threads. When running on one compute node (1process ×

24 threads), both “Machine-scheduling distribution” and “random distribution”schemes perform perfectly, with ∆ t max / ∆ t av ratio being essentially one, as it should be.As the number of processes increases, the ∆ t max / ∆ t av ratio of the “random distribution”scheme grows gradually, reaching 2.2 when 32 nodes (768 CPU cores) are used, meaningthat the maximum computing time on one process is more than twice of the average com-puting time. In contrast, the ∆ t max / ∆ t av ratio of the “Machine-scheduling distribution”scheme stays very close to 1, increasing only slightly when more CPU cores are used. With32 processes (768 CPU cores), the wall time is still less than 1.2 times of the ideal time.These results clearly demonstrate that, with the “Machine-scheduling distribution” scheme,one can achieve excellent load balancing in parallel computing even for very inhomogeneoussystems. 35

448 96 192 384 768

Number of CPU cores1.01.21.41.61.82.02.2 M a x t i m e / A v e r a g e t i m e RandomMachine scheduling

Figure 7: The “Max time/Average time” (∆ t max / ∆ t av ) ratio for the “random distribution”(red circles) and “Machine-scheduling distribution” (blue squares) as a function of the num-ber of CPU cores. Test calculations are done for a DNA fragment. As discussed in Sec. 3.4.2, in addition to the “Machine-scheduling distribution” scheme in-tended to improve the load balance of computation time, we oﬀer an alternative, the so-called“K-means distribution” scheme to reduce the memory footprints. In our implementation,the memory consumption scales roughly linearly with the size of the unit cell, and can be-come a bottleneck for very big supercells. In these cases, the “K-means distribution” schemeenables calculations that would otherwise not run.To test the performance of the “K-means distribution” scheme, we carried out a series ofΓ-only HSE06 calculations for Si crystal with diﬀerent unit cell sizes and increasing numberof MPI processes. The maximal memory footprints for the Σ X matrix per process for both“K-means distribution” and “random distribution” schemes are presented in Fig. 8 as afunction of the number of MPI process (again each process running on 24 CPU cores withshared memory) Diﬀerent curves in Fig. 8 correspond to diﬀerent sizes of the unit cell,containing 64, 128, 256, 512, and 1024 Si atoms, respectively. The computation parameters(basis sets, cut-oﬀ energy, and cut-oﬀ radii) are the same as those used in Sec. 4.1.1. Thescreening thresholds (or truncation parameters) are chosen to be R Q =8 Bohr, ε C = 10 − ,36 D = 10 − , ε CS-ERI = 10 − , ε CS-matrix = 10 − . Tests showed that the incurred band gap errorwith these screening parameter settings is below 10 meV.As discussed in Sec. 3.4.2, in our current parallelization algorithm, the same sublocks ofthe Σ X matrix have to be stored in diﬀerent processes, leading to a duplication of the memorystorage of the Σ X matrix. This may become a bottleneck for large-scale calculations. Fig. 8clearly demonstrates the supremacy of the “K-means distribution” scheme (solid lines) overthe unoptimzed “random distribution scheme” (dashed lines) in reducing the memory cost asthe number of processes increases. With two processes, the “K-means distribution” schemegains a factor of 1.25 memory saving, while this number steadily increases to 4.7 for 64processes. Thus, the “K-means distribution” scheme can be invoked when there is a lack ofmemory, in particular in cases of massively parallel calculations.

124 8 16 32 64Number of processes0128256384512 M e m o r y ( M B ) Number of atomsper unit cell102451225612864

Figure 8: The maximal memory footprint of Σ X among all runtime processes as a function ofthe number of MPI processes, for diﬀerent unit cell sizes. The solid lines correspond to the“K-means distribution” scheme whereas the dashed lines represent the “random distribution”scheme. To document the scaling behavior of the computational time for building the HFX matrixwith the increase of system size and computing resource, we performed Γ-only HSE06 calcu-lations for Si crystal with diﬀerent supercell size and using diﬀerent number of CPU cores.37

28 256 512 1024 2048 4096

Number of atoms per unit cell020406080100120 T i m e p e r i t e r a t i o n ( s ) Number of CPU cores240480720960

Figure 9: Computation time for building the HFX matrix per SCF iteration as a functionof the system size for diﬀerent amount of CPU resources. Test systems are Si crystal withvarious supercell sizes. 38he computation time recorded here is the wall time per iteration of evaluating and trans-mitting the Σ X matrix, which is the most time-consuming part of the entire HDF calculationfor system sizes tested so far (4906 atoms per unit cell). The basic computation parametersand screening thresholds are identical to those used in Sec. 4.2.2. Since there is no memoryshortage problem in these calculations, the “Machine-scheduling distribution” scheme is usedto achieve good load balances. All calculations are done on the Tianhe-2 supercomputer,where each node has two Intel(R) Xeon(R) CPUs (E5-2692 v2 @ 2.20GHz).Fig. 9 presents the wall times as a function of the system size, i.e., the number of atoms inthe supercell, for diﬀerent numbers of CPU cores. We have tested from the smallest systemcontaining 128 Si atoms to the largest system containing 4096 Si atoms. The size of thesystems we have looked at ranges from 128 atoms up to 4096 atoms in the supercell. Thecalculations employs 10, 20, 30 and 40 processes ×

24 threads, running on 240, 480, 720 and960 CPU cores, respectively. As can be seen from Fig. 9, in all parallel runs with diﬀerentprocesses, the wall time scales almost linearly with the number of atoms in the calculations.A linear ﬁt of the data obtained using 960 CPU cores yields t =0.0072 N at +0.6401 (coeﬃcientof determination R =0.9997). Although the linear scaling behavior is expected from theunderlying algorithms, this benchmark test proves the eﬃcacy of our implementation in ABACUS . Furthermore, the absolute timings presented in Fig. 9 indicate the prefactorof our linear-scaling algorithm is rather small – a feature that is vitally important for theusability of the code for practical calculations.In Fig. 10, the wall times of the calculations presented above are re-plotted as a functionof (the inverse of) the number of CPU cores. It can be seen that, for all system sizes, thecomputation time decreases linearly as the number of CPU cores increases. Even for thesmallest system containing only 128 Si atoms, the linear scaling is still perfect up to 960CPU cores. For the 4096-atom system – the largest one tested in this work, the calculationtakes about 104 s for one HFX evaluation step using 240 CPU cores, and the computationtime is reduced to about 30 s, if 960 CPU cores are used. A linear ﬁt for the data of the39

Reciprocal of number of CPU cores020406080100120 T i m e p e r i t e r a t i o n ( s ) Number of atomsper unit cell409620481024 512256128

Figure 10: Computation time for building the HFX matrix per SCF iteration as a functionof the number of CPU cores for diﬀerent supercell sizes. Test systems are Si crystal withvarious supercell sizes. 40096-atom system yields t =23731 N − + 5.7447 ( R = 0.9998). The small prefactor inour linear-scaling implementation with respect to the system size and the excellent paralleleﬃciency allows to tackle large-scale systems with relative ease. With our implementation in ABACUS , HSE06 calculations for systems with a few thousands of atoms can be routinelyperformed. In fact, given the linear-scaling behavior of our implementation with respect tothe system size and the number of CPU cores, there is in principle to handle systems withtens of thousands of atoms. However, in that case, diagonalizing the Hamiltonian matrix willbecome the new bottleneck – a challenge that is common to all KS or gKS-DFT calculations.

One of the additional advantage of the present implementation is that the electronic bandstructures of HDF calculations can be easily obtained. We brieﬂy discuss this point in thissubsection. In our implementation, we ﬁrst obtain the HFX matrix in real space – Σ X Ii,Jj , asindicated in eq (2), and then Fourier-transform it to k space. Note that in our notationalsystem, the atom I , and J can be located in diﬀerent unit cells R I and R J , and thus can berewritten as Σ X Ii,Jj = Σ X˜ I ( R I ) i, ˜ J ( R J ) j = Σ X˜ Ii, ˜ Jj ( R J − R I ) = Σ X˜ Ii, ˜ Jj ( R ) (20)where ˜ I , and ˜ J denote the atomic indices in one unit cell and R = R J − R I . For mostsystems, the exchange interactions are short-ranged, meaning that the matrix elements ofΣ X˜ Ii, ˜ Jj ( R ) are vanishingly small for | R | > R max where R max is certain critical length.After self-consistent HDF calculations, we can obtain the real-space HFX matrix Σ X˜ Ii, ˜ Jj ( R )for all lattice vectors with | R | < = R max , and merge it with the local part of the gKS Hamilto-nian to get the full Hamiltonian. Once the the full gKS Hamiltonian H ˜ Ii, ˜ Jj ( R ) in real-spaceis obtained, one can readily construct the Hamiltonian at arbitrary k points in NAO basissets, H ˜ Ii, ˜ Jj ( k ) = (cid:88) | R | < = R max e i k · R H ˜ Ii, ˜ Jj ( R ) . (21)41he reason that the lattice summation in eq (21) can also be restricted below R max is becausethe local part of the gKS Hamiltonian is more short-ranged than the HFX part. Now, giventhat the Hamiltonian matrix at arbitrary k points is readily available from eq (21), the bandenergies along desired paths in k space can be obtained by a one-shot diagonalization throughnon-SCF calculations. Compared to the plane-wave formalism, the diagonalization at theﬁnal step is rather inexpensive, due to the much reduced basis size in the NAO framework.Therefore, one does not need to invoke the band interpolation techniques here, asis usually within the plane-wave approach. Such a real-space algorithm is also of greatadvantage if very dense k grids are needed, e.g., when calculating the optical adsorptionspectra.To check the validity of our approach for HDF band structure calculations, in Fig. 11 wepresent the HSE06 band structures for Si and GaP crystals as obtained by ABACUS , incomparison with the corresponding FHI-aims results, which are taken as the referencehere. The valence-only DZP basis sets (2 s p d for Si and P, and 2 s p d f for Ga) areused in ABACUS calculations, whereas the so-called “tight” setting is used in FHI-aimscalculations, corresponding to all-electron 4 s p d f g basis set for Si and P, and 5 s p d f basis set for Ga). Despite the diﬀerent (pseudopotential versus all-electron) descriptionsof core-valence interactions and diﬀerent basis sizes, the valence and low-lying conductionbands (and hence the band gap) obtained using the two codes agree with each other ratherwell. The remaining discrepancy for the high-lying conduction bands is expected due tothe relatively smaller basis size used in ABACUS calculations. The agreement will getfurther improved if one employs the TZDP basis sets in

ABACUS calculations. A morecomprehensive comparison study of the HSE06 band gaps obtained using diﬀerent computercodes, as well as the inﬂuence of the basis sets, can be found in Ref. L X Z W105051015 E n e r g y ( e V ) Si W L X Z W151050510 E n e r g y ( e V ) GaP

Figure 11: HSE06 electronic band structures of Si (upper panel, diamond structure) andGaP (lower panel, zinc blende structure) calculated using

ABACUS (blue solid lines) andFHI-aims (red dashed lines) codes. The experimental lattice parameters are used for bothmaterials. A 8 × × k grid is used in the BZ integration, and the NAO basis sets used bythe two codes in the present calculations can be found in the text.43 Summary

In summary, we presented an eﬃcient, linear-scaling implementation for building the (screened)HFX matrix for periodic systems within the framework of NAO basis functions. The im-plementation was based on the LRI approximation, coupled with our own procedures forconstructing the ABFs. The numerical accuracy of such an approximation for periodic HDFcalculations has been systematically benchmarked in Ref. 58.In this work, we described the numerical details behind our implementation, in particularhow we choose the loop structure over the atom pairs, and how we exploit the sparsity ofthe key matrices, including expansion coeﬃcients, the (screened) Coulomb matrix, and thedensity matrix. In the latter case, a multi-level screening procedure is employed whichensures the insigniﬁcant elements of these matrices can be eﬃciently screened out as muchas possible, leading to a linear-scaling build of the HFX matrix with a rather small prefactor.We discussed two parallel distribution schemes, which can be invoked to achieve the bestload balance for computation time, or alternatively, to reduce the memory consumptionwhen there is a shortage of memory. Benchmark calculations for Si crystal with supercellsup to 4096 atoms conﬁrms the linear-scaling behavior of the computation cost with respectto the system size, whereas calculations with increasing computing resources demonstratethe excellent parallel eﬃciency up to 10 CPU cores.Our implementation was carried out in

ABACUS , but the same techniques can be easilyutilized by other local-orbital based computer code package. With our present implementa-tion in

ABACUS , HDF calculations for systems with a few thousand atoms can be routinelydone with modest computing resources. We that expect our implementation will ﬁnd impor-tant applications in disordered systems, defects, and heterostructures where large supercellsare needed. 44 cknowledgement

We thank Liu Xiaohui, Chen Junshi, Shen Yu and Shi Rong for helpful discussions. Thework is supported by the National Key Research and Development Program of China (GrantNo. 2016YFB0201202) and National Natural Science Foundation of China (Grant Numbers11774327, 11874335). The numerical calculations have been partly done in National Super-computer Center in GuangZhou and partly on the USTC HPC facilities.

A Appendix

A.1 Ordering of matrix multiplications to evaluate H X matrices The key step in building the HFX matrix is to evaluate the H X matrices, formally introducedin eq (9). There are four variants of them (cf. eq (7)), depending on where the ABFs arelocated, but all can be seen as the contributions to Σ X Ii,Jj from the the atom pair (cid:104)

K, L (cid:105) .They are formally given by H X ∼ CV CD , i.e., a sequence of matrix products involving C , V , D matrices. Obviously, the actual order of performing the matrix multiplication isimportant here, since it will aﬀect the overall computational cost.The issue can be analyzed for a given set of four atoms – I, J, K, L . In this case, omittingthe atomic indices, the H X matrices are rank-2 tensors and can be calculated as H X ij = (cid:88) kl (cid:88) αβ C αik V αβ C βjl D kl (22)For the convenience of analysis, we assume that all atoms have the same number of AOs ( n φ )and ABFs ( n P ). Here we emphasize that n φ and n P refer to the number of basis functionsper atom (and not per unit cell). Therefore, in eq (22) C is a n P × n φ × n φ V and D are, respectively, n P × n P and n φ × n φ matrices.Given the expression in eq (22), one may recognize that there are ﬁve diﬀerent waysof ordering the matrix multiplications. The computational costs associated with the ﬁve45rderings are list in Table 1, among which the most eﬃcient two are obviously ( CV )( CD )and C (( V C ) D ), due to the fact that n P is several times larger than n φ . After consideringdata structures adopted in the code, cache optimization, and the use of BLAS library,we ﬁnally chose C (( V C ) D ) as the matrix multiplication order in our implementation.Table 1: computational cost of order of matrix multiplicationorder of matrix multiplication computational cost( CV )( CD ) n P n φ + 8 n P n φ C (( V C ) D ) n P n φ + 8 n P n φ C ( V ( CD )) 4 n P n φ + 8 n P n φ (( CV ) C ) D n P n φ + n P n φ + 4 n φ ( C ( V C )) D n P n φ + n P n φ + 4 n φ References (1) Becke, A. D. Density-functional thermochemistry. III. The role of exact exchange.

J.Chem. Phys , , 5648.(2) Perdew, J. P.; Schmidt, K. In Density Functional Theory and its Application to Ma-terials ; Van Doren, V., Van Alsenoy, C., Geerlings, P., Eds.; AIP: Melville, NY,2001.(3) Hohenberg, P.; Kohn, W. Inhomogeneous Electron Gas.

Phys. Rev. , , B864.(4) Kohn, W.; Sham, L. J. Self-Consistent Equations Including Exchange and CorrelationEﬀects. Phys. Rev. , , A1133.(5) Seidl, A.; G¨orling, A.; Vogl, P.; Majewski, J. A.; Levy, M. Generalized Kohn-Shamschemes and the band-gap problem. Phys. Rev. B , , 3764.(6) Perdew, J. P.; Ernzerhof, M.; Burke, K. Rationale of mixing exact exchange withdensity functional approximations. J. Chem. Phys. , , 9982.467) Ernzerhof, M.; Scuseria, G. E. J. Chem. Phys. , , 5029.(8) Adamo, C.; Barone, V. J. Chem. Phys. , , 6158.(9) Zhao, Y.; Truhlar, D. G. The M06 suite of density functionals for main group ther-mochemistry, thermochemical kinetics, noncovalent interactions, excited states, andtransition elements: two new functionals and systematic testing of four M06-classfunctionals and 12 other functionals. Theor. Chem. Acc. , , 215–241.(10) Chai, J.-D.; Head-Gordon, M. Systematic optimization of long-range corrected hybriddensity functionals. J. Chem. Phys. , , 084106.(11) Heyd, J.; Scuseria, G. E.; Ernzerhof, M. Hybrid functionals based on a screenedCoulomb potential. J. Chem. Phys. , , 8207.(12) Toulouse, J.; Colonna, F.; Savin, A. Long-rangeshort-range separation of the electron-electron interaction in density-functional theory. Phys. Rev. A , , 062505.(13) Iikura, H.; Tsuneda, T.; Yanai, T.; Hirao, K. A long-range correction scheme forgeneralized-gradient-approximation exchange functionals. J. Chem. Phys. , ,3540.(14) Yanai, T.; Tew, D. P.; Handy, N. C. A new hybrid exchangecorrelation functional usingthe Coulomb-attenuating method (CAM-B3LYP). Chem. Phys. Lett. , , 51.(15) Vydrov, O. A.; Scuseria, G. E. Assessment of a long-range corrected hybrid functional. J. Chem. Phys. , , 234109.(16) Skone, J. H.; Govoni, M.; Galli, G. Separated Hybrid Functionals for Solids andMolecules. Phys. Rev. B.: Condens. Mater. Phys. , , 235106.(17) Shimazaki, T.; Asai, Y. Band Structure Calculations Based on Screened Fock Ex-change Method. Chem. Phys. Lett. , , 91–94.4718) Shimazaki, T.; Asai, Y. Energy band structure calculations based on screenedHartreeFock exchange method: Si, AlP, AlAs, GaP, and GaAs. J. Chem. Phys. , , 224105.(19) Marques, M. A. L.; Vidal, J.; Oliveira, M. J. T.; Reining, L.; Botti, S. Density-BasedMixing Parameter for Hybrid Functionals. Phys. Rev. B.: Condens. Matter Mater.Phys. , , 035119.(20) Shimazaki, T.; Nakajima, T. Dielectric-dependent screened HartreeFock exchange po-tential and Slater-formula with Coulomb-hole interaction for energy band structurecalculations. J. Chem. Phys. , , 114109.(21) Skone, J. H.; Govoni, M.; Galli, G. Self-consistent hybrid functional for condensedsystems. Phys. Rev. B.: Condens. Mater. Phys. , , 195112.(22) He, J.; Franchini, C. Assessing the Performance of Self-Consistent Hybrid Functionalfor Band Gap Calculation in Oxide. J. Phys.: Condens. Matter , , 454004.(23) Kronik, L.; Stein, T.; Refaely-Abramson, S.; ; Baer, R. Excitation Gaps of Finite-Sized Systems from Optimally Tuned Range-Separated Hybrid Functionals. J. Chem.Theory Comput. , , 1515.(24) Atalla, V.; Yoon, M.; Caruso, F.; Rinke, P.; Scheﬄer, M. Hybrid density functionaltheory meets quasiparticle calculations. Phys. Rev. B , , 165112.(25) Cui, Z.-H.; Wang, Y.-C.; Zhang, M.-Y.; Xu, X.; Jiang, H. Doubly Screened HybridFunctional: An Accurate First-Principles Approach for Both Narrow- and Wide-GapSemiconductors. J. Phys. Chem. Lett. , , 2338–2345.(26) Zhang, M.-Y.; Cui, Z.-H.; Wang, Y.-C.; Jiang, H. Hybrid functionals with system-dependent parameters: conceptual foundation and methodological developments. WIREs Comput. Mol. Sci. , 4827) Alml¨of, J.; Faegri, Jr., K.; Korsell, K. Principles for a Direct SCF Approach to LCAO-MO Ab-lnitio Calculations.

J. Comput. Chem. , , 385.(28) H¨aser, M.; Ahlrichs, R. Improvements on the direct SCF method. J. Comput. Chem. , , 104.(29) Burant, J. C.; Scuseria, G. E.; Frisch, M. J. A linear scaling method for HartreeFockexchange calculations of large molecules. J. Chem. Phys. , , 8969.(30) Schwegler, E.; Challacombe, M. Linear scaling computation of the HartreeFock ex-change matrix. J. Chem. Phys. , , 2726.(31) Schwegler, E.; Challacombe, M.; Head-Gordon, M. Linear scaling computation of theFock matrix. II. Rigorous bounds on exchange integrals and incremental Fock build. J. Chem. Phys. , , 9708.(32) Ochsenfeld, C.; White, C. A.; Head-Gordon, M. Linear and sublinear scaling formationof Hartree-Fock-type exchange matrices. J. Chem. Phys. , , 1663.(33) Rudberg, E.; Rubensson, E. H.; ; Sa(cid:32)lek, P. Kohn-Sham Density Functional The-ory Electronic Structure Calculations with Linearly Scaling Computational Time andMemory Usage. J. Chem. Theory Comput. , , 340.(34) Pisani, C.; Dovesi, R. Exact-Exchange Hartree-Fock Calculations for Periodic Solids.I. Illustration of the Method. Int. J. Quantum Chem. , XVII , 501.(35) Pisani, C.; Dovesi, R.; Roetti, C.

Hartree-Fock Ab Initio Treatment of Crystallinesolids, Volume 48 of Lecture Notes in Chemistry Series ; Springer Verlag: Berlin,1988.(36) Dovesi, R.; Orlando, R.; Erba, A.; Zicovich-Wilson, C. M.; Civalleri, B.; Casassa, S.;Maschio, L.; Ferrabone, M.; De La Pierre, M.; DArco, P.; No¨el, Y.; Caus`a, M.;49´erat, M.; ; Kirtman, B. CRYSTAL14: A Program for the Ab Initio Investigationof Crystalline Solids.

Int. J. Quan. Chem. , , 1287.(37) Guidon, M.; Schiﬀmann, F.; Hutter, J.; VandeVondele, J. Ab initio molecular dy-namics using hybrid density functionals. The Journal of chemical physics , ,214104.(38) Guidon, M.; Hutter, J.; VandeVondele, J. Robust Periodic Hartree-Fock Exchange forLarge-Scale Simulations Using Gaussian Basis Sets. J. Chem. Theory Comput. , , 3010.(39) Paier, J.; Diaconu, C. V.; Scuseria, G. E.; ad Joost VandeVondele, M. G.; Hutter, J.Accurate Hartree-Fock energy of extended systems using large Gaussian basis sets. Phys. Rev. B , , 174114.(40) Paier, J.; Marsman, M.; Hummer, K.; Kresse, G.; Gerber, I. C.; ´Angy´an, J. G.Screened hybrid density functionals applied to solids. J. Chem. Phys. , ,154709.(41) Betzinger, M.; Friedrich, C.; Bl¨ugel, S. Hybrid functionals within the all-electronFLAPW method: Implementation and applications of PBE0. Phys. Rev. B , , 195117.(42) Wu, X.; Selloni, A.; Car, R. Order-N implementation of exact exchange in extendedinsulating systems. Physical Review B , , 085102.(43) Lin, L. Adaptively compressed exchange operator. Journal of chemical theory andcomputation , , 2242–2249.(44) Becke, A. Density-functional exchange-energy approximation with correct asymptoticbehavior. Phys. Rev. A , , 3098.5045) Lee, C.; Yang, W.; Parr, R. G. Development of the Colle-Salvetti correlation-energyformula into a functional of the electron density. Phys. Rev. B , , 785–789.(46) Perdew, J. P.; Burke, K.; Ernzerhof, M. Generalized Gradient Approximation MadeSimple. Phys. Rev. Lett , , 3865.(47) Delly, B. From molecules to solids with the DMol approach. J. Chem. Phys. , , 7756.(48) Koepernik, K.; Eschrig, H. Full-potential nonorthogonal local-orbital minimum-basisband-structure scheme. Phys. Rev. B , , 1743.(49) Soler, J. M.; Artacho, E.; Gale, J. D.; Garc´ıa, A.; Junquera, J.; Ordej´on, P.; S´anchez-Portal, D. The SIESTA method for ab initio order-N materials simulation. J. Phys.:Condens. Matter , Comp. Phys.Comm. , , 2175.(52) Li, P.; Liu, X.; Chen, M.; Lin, P.; Ren, X.; Lin, L.; Yang, C.; He, L. Large-scale abinitio simulations based on systematically improvable atomic basis. Comput. Mater.Sci. , , 503.(53) Shang, H.; Li, Z.; Yang, J. Implementation of Exact Exchange with Numerical AtomicOrbitals. J. Phys. Chem. A , , 1039.(54) Qin, X.; Shang, H.; Xiang, H.; Li, Z.; Yang, J. HONPAS: A Linear Scaling Open-Source Solution for Large System Simulations. Int. J. Quan. Chem. , , 647.5155) Ren, X.; Rinke, P.; Blum, V.; Wieferink, J.; Tkatchenko, A.; Sanﬁlippo, A.; Reuter, K.;Scheﬄer, M. Resolution-of-identity approach to HartreeFock, hybrid density function-als, RPA, MP2 and GW with numeric atom-centered orbital basis functions. New J.Phys. , , 053020.(56) Ihrig, A. C.; Wieferink, J.; Zhang, I. Y.; Ropo, M.; Ren, X.; Rinke, P.; Scheﬄer, M.;Blum, V. Accurate localized resolution of identity approach for linear-scaling hybriddensity functionals and for many-body perturbation theory. New J. Phys. , ,093020.(57) Levchenko, S. V.; Ren, X.; Wieferink, J.; Johanni, R.; Rinke, P.; Blum, V.; Schef-ﬂer, M. Hybrid functionals for large periodic systems in an all-electron, numeric atom-centered basis framework. Comp. Phys. Comm. , , 60.(58) Lin, P.; Ren, X.; He, L. Accuracy of Localized Resolution of the Identity in PeriodicHybrid Functional Calculations with Numerical Atomic Orbitals. J. Phys. Chem. Lett. , , 3082.(59) Lu, J.; Yin, L. Compression of the electron repulsion integraltensor in tensor hyper-contraction format with cubic scaling cost. J. Comput. Phys. , , 329.(60) Qin, X.; Hu, W.; Yang, J. Interpolative separable density ﬁtting decomposition foraccelerating Hartree-Fock exchange calculations within numerical atomic orbitals. J.Phys. Chem. A , , 5664.(61) Feyereisen, M.; Fitzgerald, G.; Komornicki, A. Use of approximate integrals in abinitio theory, An application in MP2 energy calculations. Chem. Phys. Lett. , , 359.(62) Vahtras, O.; Alml¨of, J.; Feyereisen, M. W. Integral approximations for LCAO-SCFcalculations. Chem. Phys. Lett. , , 514.5263) Weigend, F.; H¨aser, M.; Patzelt, H.; Ahlrichs, R. RI-MP2: optimized auxiliary basissets and demonstration of eﬃciency. Chem. Phys. Lett. , , 143.(64) Whitten, J. L. Coulomb potential energy integrals and approximations. J. Chem. Phys. , , 4496.(65) Dunlap, B. I.; Connolly, J. W. D.; Sabin, J. R. On some approximations of X α method. J. Chem. Phys , , 3396.(66) Dunlap, B. I.; R¨osch, N.; Trickey, S. Variational ﬁtting methods for electronic structurecalculations. Mol. Phys , , 3167.(67) Weigend, F. A fully direct RI-HF algorithm: Implementation, optimised auxiliarybasis sets, demonstration of accuracy and eﬃciency. Phys. Chem. Chem. Phys. , , 4285.(68) Eshuis, H.; Yarkony, J.; Furche, F. Fast computation of molecular random phaseapproximation correlation energies using resolution of the identity and imaginary fre-quency integration. J. Chem. Phys. , , 234114.(69) Del Ben, M.; Hutter, J.; VandeVondele, J. Electron Correlation in the CondensedPhase from a Resolution of Identity Approach Based on the Gaussian and PlaneWaves Scheme. J. Chem. Theo. Comput. , , 2654.(70) Merlot, P.; Kjrgaard, T.; Helgaker, T.; Lindh, R.; Aquilante, F.; Reine, S.; Ped-ersen, T. B. Attractive ElectronElectron Interactions within Robust Local FittingApproximations. J. Comput. Chem. , , 1486.(71) Billingsley II, F. P.; Bloor, J. E. Limited Expansion of Diatomic Overlap (LEDO): ANear?Accurate Approximate Ab Initio LCAO MO Method. I. Theory and PreliminaryInvestigations. J. Chem. Phys. , , 5178.5372) Pisani, C.; M, M. B.; Capecchi, G.; Casassa, S.; Dovesi, R.; Maschio, L.; Zicovich-Wilson, C.; Sch¨utz, M. Local-MP2 electron correlation method for nonconductingcrystals. J. Chem. Phys. , , 094113.(73) Pisani, C.; Maschio, L.; Casassa, S.; Halo, M.; Sch¨utz, M.; Usvyat, D. Periodic Lo-cal MP2 Method for the Study of Electronic Correlation in Crystals: Theory andPreliminary Applications. J. Comput. Chem. , , 2113.(74) Sodt, A.; Subotnik, J. E.; Head-Gordon, M. Linear scaling density ﬁtting. J. Chem.Phys. , , 194109.(75) Sodt, A.; Head-Gordon, M. Hartree-Fock exchange computed using the atomic reso-lution of the identity approximation. J. Chem. Phys. , , 104106.(76) Reine, S.; Tellgren, E.; Krapp, A.; Kj?rgaard, T.; Helgaker, T.; Jansik, B.; H?st, S.;Salek, P. Variational and robust density ﬁtting of four-center two-electron integrals inlocal metrics. J. Chem. Phys. , , 104101.(77) Wirz, L. N.; Reine, S. S.; Pedersen, T. B. On Resolution-of-the-Identity ElectronRepulsion Integral Approximations and Variational Stability. J. Comput. Chem. , , 4897.(78) The ABACUS software webpage: http://abacus.ustc.edu.cn .(79) Chen, M.; Guo, G.-C.; He, L. Systematically improvable optimized atomic basis setsfor ab initio calculations. J. Phys.: Condens. Matter , , 445501.(80) Hamann, D.; Schl¨uter, M.; Chiang, C. Norm-conserving pseudopotentials. PhysicalReview Letters , , 1494.(81) Talman, J. D. Numerical Fourier and Bessel transforms in logarithmic variables. J.Comp. Phys. , , 35–48. 5482) Talman, J. D. Numerical calculation of four-center Coulomb integrals. J. Chem. Phys. , , 2000–2008.(83) Talman, J. D. Numerical methods for multicenter integrals for numerically deﬁnedbasis functions applied in molecular calculations. Int. J. Quant. Chem. , ,72–90.(84) MPI. , 2020.(85) Gygi, F.; Baldereschi, A. Self-consistent Hartree-Fock and screened-exchange calcula-tions in solids: Application to silicon. Phys. Rev. B , , 4405.(86) Spencer, J.; Alavi, A. Eﬃcient calculation of the exact exchange energy in periodicsystems using a truncated Coulomb potential. Phys. Rev. B , , 193110.(87) Des Cloizeaux, J. Energy Bands and Projection Operators in a Crystal: Analytic andAsymptotic Properties. Phys. Rev. , , 809.(88) Des Cloizeaux, J. Energy Bands and Projection Operators in a Crystal: Analytic andAsymptotic Properties. Phys. Rev. , , A685.(89) He, L.; Vanderbilt, D. Exponential decay properties of Wannier functions and relatedquantities. Physical Review Letters , , 5341.(90) Lawson, C. L.; Hanson, R. J.; Kincaid, D. R.; Krogh, F. T. Basic linear algebrasubprograms for Fortran usage. ,(91) Dongarra, J. J.; Du Croz, J.; Hammarling, S.; Hanson, R. J. An extended set ofFORTRAN basic linear algebra subprograms. ACM Transactions on MathematicalSoftware (TOMS) , , 1–17.(92) Dongarra, J. J.; Cruz, J. D.; Hammarling, S.; Duﬀ, I. S. Algorithm 679: A set of level3 basic linear algebra subprograms: model implementation and test programs. ACMTransactions on Mathematical Software (TOMS) , , 18–28.5593) H¨aser, M.; Ahlrichs, R. Improvements on the direct SCF method. Journal of Compu-tational Chemistry , , 104–111.(94) OpenMP. , 2020.(95) Kan, R.; George, A. Machine scheduling problems : classiﬁcation, complexity andcomputations. ,(96) Lenstra, J. K.; Kan, A. R.; Brucker, P. Annals of discrete mathematics ; Elsevier, 1977;Vol. 1; pp 343–362.(97) Graham, R. L. Bounds for certain multiprocessing anomalies.