[PDF] The static parallel distribution algorithms for hybrid density-functional calculations in HONPAS package

Abstract

Hybrid density-functional calculation is one of the most commonly adopted electronic structure theory used in computational chemistry and materials science because of its balance between accuracy and computational cost. Recently, we have developed a novel scheme called NAO2GTO to achieve linear scaling (Order-N) calculations for hybrid density-functionals. In our scheme, the most time-consuming step is the calculation of the electron repulsion integrals (ERIs) part. So how to create an even distribution of these ERIs in parallel implementation is an issue of particular importance. Here, we present two static scalable distributed algorithms for the ERIs computation. Firstly, the ERIs are distributed over ERIs shell pairs. Secondly, the ERIs is distributed over ERIs shell quartets. In both algorithms, the calculation of ERIs is independent of each other, so the communication time is minimized. We show our speedup results to demonstrate the performance of these static parallel distributed algorithms in the Hefei Order-N packages for \textit{ab initio} simulations (HONPAS).

Full PDF

TThe static parallel distributionalgorithms for hybrid density-functionalcalculations in HONPAS package

Journal TitleXX(X):1–10c (cid:13)

Xinming Qin † ,Honghui Shang* † ,Lei Xu , Wei Hu , Jinlong Yang , Shigang Li , YunquanZhang Abstract

Hybrid density-functional calculation is one of the most commonly adopted electronic structure theory used incomputational chemistry and materials science because of its balance between accuracy and computational cost.Recently, we have developed a novel scheme called NAO2GTO to achieve linear scaling (Order-N) calculations forhybrid density-functionalsShang et al. (2011). In our scheme, the most time-consuming step is the calculation of theelectron repulsion integrals (ERIs) part. So how to create an even distribution of these ERIs in parallel implementation isan issue of particular importance. Here, we present two static scalable distributed algorithms for the ERIs computation.Firstly, the ERIs are distributed over ERIs shell pairs. Secondly, the ERIs is distributed over ERIs shell quartets. In bothalgorithms, the calculation of ERIs is independent of each other, so the communication time is minimized. We showour speedup results to demonstrate the performance of these static parallel distributed algorithms in the Hefei Order-Npackages for ab initio simulations (HONPAS)Qin et al. (2014).

INTRODUCTION

The electronic structure calculations based on densityfunctional theory (DFT) Parr and Yang (1989); Hohenbergand Kohn (1964); Kohn and Sham (1965) are the work-horse of computational chemistry and materials science.However, widely used semi-local density functionals couldunderestimate the band gaps because of its inclusion ofthe unphysical self-interaction Mori-S´anchez et al. (2008).A possible solution is to add the nonlocal Hartree-Focktype exchange (HFX) into semi-local density-functionalsto construct hybrid functionalsBecke (1993); Stephenset al. (1994); Janesko et al. (2009); Paier et al. (2009);Monkhorst (1979); Delhalle and Calais (1987); Gell-Mannand Brueckner (1957); Heyd et al. (2003, 2006); Krukauet al. (2006); Frisch et al. (2009); Paier et al. (2006).However, the drawback of hybrid density-functionals isthat it is signiﬁcantly more expensive than conventionalDFT. The most time-consuming part in hybrid density-functional calculations becomes construction of HFX matrix,even with the appearance of fast linear scaling algorithmsthat overcome the bottlenecks encountered in conventionalmethodsSchwegler and Challacombe (1996); Burant et al.(1996); Schwegler et al. (1997); Ochsenfeld et al. (1998);Polly et al. (2004); Tymczak and Challacombe (2005); Sodtand Head-Gordon (2008); Guidon et al. (2010); Merlot et al.(2014). As a result, hybrid density-functional calculationsmust make efﬁcient use of parallel computing resourcesin order to reduce the execution time of HFX matrixconstruction.The implementation of hybrid density-functionals forsolid state physics calculations are mostly based on planewaves (PW)Gonze et al. (2002, 2016); Paier et al. (2006)or linear combination of atomic orbitals (LCAO)Krukau et al. (2006); Frisch et al. (2009); Dovesi et al. (2006)method. The atomic orbitals basis set is efﬁcient forreal-space formalisms, which have attracted considerableinterest for DFT calculations because of their favorablescaling with respect to the number of atoms and theirpotential for massively parallel implementations for large-scale calculations Delley (1990); Soler et al. (2002); Blumet al. (2009); Havu et al. (2009); Ren et al. (2012); Enkovaaraet al. (2010); Mohr et al. (2014); Frisch et al. (2009);Shang et al. (2011). Unlike plane wave method, whenconstructing HFX matrix within LCAO method, we mustﬁrst calculate the ERIs via the atomic orbitals. There arecurrently two types of atomic orbits that are most commonlyused. The ﬁrst is gaussian type orbital(GTO), as adoptedin GaussianFrisch et al. (2009) and CRYSTALDovesi et al.(2006), its advantage is to calculate ERIs analytically.The second is numerical atomic orbital (NAO), whichis adopted in SIESTASoler et al. (2002), DMOLDelley(1990), OPENMXOzaki (2003), et al.. The advantage ofNAO is its strict locality, which naturally leads to lowerorder scaling of computational time versus system size. Inorder to take advantages of both types of atomic orbitals,we have proposed a new scheme called NAO2GTOShanget al. (2011), in which GTO can be used for analytical State Key Laboratory of Computer Architecture, Institute of ComputingTechnology, Chinese Academy of Sciences, Beijing, China Hefei National Laboratory for Physical Sciences at Microscale,Department of Chemical Physics, and Synergetic Innovation Center ofQuantum Information and Quantum Physics, University of Science andTechnology of China, Hefei, Anhui 230026, China

Corresponding author: *Corresponding author: E-mail: [email protected] † Both authors contributed equally to this work.

Prepared using sagej.cls [Version: 2016/06/24 v1.10] a r X i v : . [ phy s i c s . c o m p - ph ] S e p Journal Title XX(X) computation of ERIs in a straightforward and efﬁcient way,while NAO can be employed to set the strict cutoff for atomicorbitals. After employing several ERI screening techniques,the construction of HFX matrix can be very efﬁcient andscale linearlyShang et al. (2011); Qin et al. (2014).Parallelization of HFX matrix construction faces twomajor problems of load imbalance and high communicationcost. The load imbalance arises from the irregularity ofthe independent tasks available in the computation, whichis due to the screening procedure and different typesof shell quartets distributed among processes. The highcommunication cost is from interprocessor communicationof the density and/or HFX matrices, which is associated withthe data access pattern. It is well known that NWChemValievet al. (2010) and CP2K/QuickstepVandeVondele et al. (2005)are the most outstanding softwares in the ﬁeld of highperformance parallel quantum chemical computing, andboth of them use GTOs to construct HFX matrix. InNWChem, the parallelization of HFX matrix constructionis based on a static partitioning of work followed bya work stealing phaseLiu et al. (2014); Chow et al.(2015). The tasks are statically partitioned throughoutthe set of shell (or atom) quartets, and then the workstealing phase acts to polish the load balance. As aresult, this parallel implementation gives very good parallelscalability of Hartree–Fock calculationsLiu et al. (2014).In CP2K/Quickstep, the HFX parallelization strategy is toreplicate the global density and HFX matrix on each MPIprocess in order to reduce communication. A load balanceoptimization based on simulated annealing and a binningprocedure to coarse grain the load balancing problem havebeen developedGuidon et al. (2008). However, this approachmay limit both system size and ultimately parallel scalability.As the ERIs calculation is the most computationallydemanding step in the NAO2GTO scheme, the developmentof the new parallel algorithms is of particular importance.Previously, for codes using localized atomic orbitals, theparallelization of ERIs are mainly implemented to treatﬁnite, isolated systems Schmidt et al. (1993); Alexeev et al.(2002); Liu et al. (2014); Chow et al. (2015), but only a fewliterature reports exist for the treatment of periodic boundaryconditions with such basis sets Bush et al. (2011); Guidonet al. (2008), in which the Order-N screening for the ERIscalculations has not been considered. The purpose of thiswork is to present the static parallel distribution algorithmsfor the NAO2GTO schemeShang et al. (2011) with Order-N performance in HONPAS codeQin et al. (2014). In ourapproaches, the calculations of ERIs are not replicated, butare distributed over CPU cores, as a result, both the memoryand the CPU requirements of the ERIs calculation areparalleled. The efﬁciency and scalability of these algorithmsare demonstrated by benchmark timings in periodic solidsystem with 64 silicon atoms in the unit cell.The outline of this paper is as follows: In Section 2, webegin with a description of the theory of hybrid functionals.In Section 3, we describe the detail implementation of ourparallel distribution . In Section 4, we present the benchmarkresults and the efﬁciency of our scheme.

Fundamental Theoretical Framework

Before addressing the parallel algorithms, we recall the basicequations used in this work. A spin-unpolarized notationis used throughout the text for the sake of simplicity,but a formal generalization to the collinear spin caseis straightforward. In Kohn-Sham DFT, the total-energyfunctional is given as E KS = T s [ n ] + E ext [ n ] + E H [ n ] + E xc [ n ] + E nuc-nuc . (1)Here, n ( r ) is the electron density, T s is the kinetic energyof non-interacting electrons, while E ext is external energystemming from the electron-nuclear attraction, E H is theHartree energy, E xc is the exchange-correlation energy, and E nuc-nuc is the nucleus-nucleus repulsion energy.The ground state electron density n ( r ) (and theassociated ground state total energy) is obtained byvariationally minimizing Eq. (1) under the constraint thatthe number of electrons N e is conserved. This yields thechemical potential µ = δE KS /δn of the electrons and theKohn-Sham single particle equations ˆ h KS ψ i = (cid:2) ˆ t s + v ext ( r ) + v H + v xc (cid:3) ψ i = (cid:15) p ψ i (2)for the Kohn-Sham Hamiltonian ˆ h KS . In Eq. (2), ˆ t s denotesthe kinetic energy operator, v ext the external potential, v H the Hartree potential, and v xc the exchange-correlationpotential. Solving Eq. (2) yields the Kohn-Sham singleparticle states ψ p and their eigenenergies (cid:15) p . The singleparticle states determine the electron density via n ( r ) = (cid:88) i f i | ψ i | , (3)in which f i denotes the Fermi-Dirac distribution function.To solve Eq. (2) in numerical implementations, the Kohn-Sham states are expanded in a ﬁnite basis set. For periodicsystems, the crystalline orbital ψ i ( k , r ) normalized in allspace is a linear combination of Bloch functions φ µ ( k , r ) ,deﬁned in terms of atomic orbitals χ R µ ( r ) . ψ i ( k , r ) = (cid:88) µ c µ,i ( k ) φ µ ( k , r ) (4) φ µ ( k , r ) = 1 √ N (cid:88) R χ R µ ( r ) e i k · ( R + r µ ) (5)where the Greek letter µ is the index of atomic orbitals, i is the sufﬁx for different bands, R is the origin of a unitcell, N is the number of unit cells in the system. χ R µ ( r ) = χ µ ( r − R − r µ ) is the µ -th atomic orbital, whose center isdisplaced from the origin of the unit cell at R by r µ . c µ,i ( k ) is the wave function coefﬁcient, which is obtained by solvingthe following equation, H ( k ) c ( k ) = E ( k ) S ( k ) c ( k ) (6) [ H ( k )] µν = (cid:88) R H R µν e i k · ( R + r ν − r µ ) (7) H R µν = < χ µ | ˆ H | χ R ν > (8) [ S ( k )] µν = (cid:88) R S R µν e i k · ( R + r ν − r µ ) (9) Prepared using sagej.cls inming Qin, Honghui Shang, Lei Xu, Wei Hu, Jinlong Yang, Shigang Li, ,Yunquan Zhang S R µν = < χ µ | χ R ν > (10)In Eq. (8), H R µν is a matrix element of the one-electronHamiltonian operator ˆ H between the atomic orbital χ µ located in the central unit cell and χ ν located in the unitcell R .It should be noted that, the exchange-correlation potential v xc is a local and periodic in semi-local DFT, whilein Hartree-Fock and hybrid functionals, the Hartree-Fockexchange (HFX) potential matrix element is deﬁned as: [ V X ] G µλ = − (cid:88) νσ (cid:88) N , H P H − N νσ [( χ µ χ N ν | χ G λ χ H σ )] (11)where G , N , and H represent different unit cells. Thedensity matrix element P N νσ is computed by an integrationover the Brillouin zone (BZ) P N νσ = (cid:88) j (cid:90) BZ c ∗ ν,j ( k ) c σ,j ( k ) θ ( (cid:15) F − (cid:15) j ( k )) e i k · N d k (12)where θ is the step function, (cid:15) F is the fermi energy and (cid:15) j ( k ) is the j -th eigenvalue at point k .In order to calculate the following ERI in Eq. (11) ( χ µ χ N ν | χ G λ χ H σ ) = (cid:90) (cid:90) χ µ ( r ) χ N ν ( r ) χ G λ ( r (cid:48) ) χ H σ ( r (cid:48) ) | r − r (cid:48) | drdr (cid:48) (13)we use NAO2GTO scheme described in the followingsection.Following the ﬂowchart in Fig.1, the NAO2GTO schemeis to ﬁrstly ﬁt the NAO with GTOs, and then to calculateERIs analytically with ﬁtted GTOs. A NAO is a product of anumerical radial function and a spherical harmonic φ Ilmn ( r ) = ϕ Iln ( r ) Y lm (ˆ r ) (14)The radial part of the numerical atomic orbital ϕ Iln ( r ) iscalculated by the following equation: ( − d d r r + l ( l + 1)2 r + V ( r ) + V cut ) ϕ Iln ( r ) = (cid:15) l ϕ Iln ( r ) (15)where V ( r ) denotes the electrostatic potential for orbital ϕ Iln ( r ) , and V cut ensures a smooth decay of each radialfunction which is strictly zero outside a conﬁning radius r cut . METHODSNAO2GTO scheme

In our NAO2GTO approach, the radial part of the NAO ϕ Iln ( r ) is ﬁtted by the summation of several GTOs, denotedas χ ( r ) χ ( r ) ≡ (cid:88) m D m r l exp( − α m r ) (16)Parameters α m and D m are determined by minimizing theresidual sum of squares of the difference (cid:88) i ( χ ( r i ) /r li − ϕ Iln ( r i ) /r li ) (17)In practice of the solid system calculation, too diffusedbasis set may cause convergence problemGuidon et al. Figure 1.

The ﬂowchart of the NAO2GTO scheme in HONPAS. (2008), as a result the exponents smaller than 0.10 areusually not needed, and we made a constraint, i.e. ( α > . ) during our minimal search. First we use constrainedgenetic algorithmGoldberg (1989); Conn et al. (1991) to doa global search for initial guess and then do a constrainedlocal minimal search using trust-region-reﬂective algorithm,which is a subspace trust-region method and is based onthe interior-reﬂective Newton method described in Colemanand Li (1994) and Coleman and Li (1996). Each iterationinvolves the approximate solution of a large linear systemusing the method of preconditioned conjugate gradients. Wemake N ( N > ) global searches to make sure to have aglobal minimal.

Algorithm 1

The algorithm of NAO2GTO ﬁtting scheme. for iter = 1 to N do constrained genetic algorithm get initial α iterm and D iterm constrained local minimal search to get α iterm and D iterm err = (cid:80) i [ (cid:80) m D m exp( − α m r i ) − ϕ Iln ( r i ) /r li ] if iter = 1 .or. err < best err then best err = errα m = α iterm and D m = D iterm end ifend for Using the above ﬁtting parameters, we build all ERIsthat needed for the HFX. In our implementation, the fullpermutation symmetry of the ERIs has been considered forsolid systems: ( µ ν H | λ G σ N ) = ( µ ν H | σ N λ G ) =( ν µ − H | λ G − H σ N − H ) = ( ν µ − H | σ N − H λ G − H ) =( λ σ N − G | µ − G ν H − G ) = ( λ σ N − G | ν H − G µ − G ) =( σ λ G − N | µ − N ν H − N ) = ( σ λ G − N | ν H − N µ − N ) (18)In this way, we save a factor of 8 in the number of integralsthat have to be calculated.When calculating the ERIs with GTOs, the atomic orbitalsare grouped into shells according to the angular momentum.The list need to be distributed in parallel is in fact the Prepared using sagej.cls

Journal Title XX(X) shell quartet. For a shell with a angular momentum of l ,the number of atomic orbital basis functions is l + 1 , soin a shell quartet integral ( IJ | KL ) , we calculate in total (2 l I + 1)(2 l J + 1)(2 l K + 1)(2 l L + 1) atomic basis orbitalERIs together. As a result, the computational expense isstrongly dependent on the angular momenta of the shellquartet. It is a challenge to distribute these shell ERIs notonly in number but also considering the time-weight.In our NAO2GTO scheme, two shell pair lists (list-IJand list-KL) are ﬁrstly preselected according to SchwarzscreeningHser and Ahlrichs (1989), | ( µν | λσ ) | (cid:54) (cid:112) ( µν | µν )( λσ | λσ ) (19)and only the shell list indexes with ( IJ | IJ ) > τ or ( KL | KL ) > τ (here τ is the drop tolerance) are stored. Asshown in Eq. (11), the ﬁrst index I runs only within the unitcell, while the indexes (J,K,L) run over the whole supercell,so the list-IJ is smaller than the list-KL. Then in the ERIscalculations, the loops run over these two shell lists.Then before the calculation of every ERI, we useSchwarz inequality Eq. (19) again to estimate a rigorousupper bound, that only the ERIs with non-negligiblecontributions are calculated, we note this screening methodas Schwarz screening. Using the exponential decay of thecharge distributions, the Schwarz screening reduces the totalnumber of ERIs to be computed from O ( N ) to O ( N ) . TheSchwarz screening tolerance is set to 10 − in the followingcalculation. In addition, the NAO screening is also adoptedas the NAO is strictly truncatedShang et al. (2011). TheNAO screening is safe in the calculation of the short-rangeERI because in this case the Hartree-Fock exchange (HFX)Hamiltonian matrix is sparse due to the screened CoulombpotentialIzmaylov et al. (2006). As a result, we store thisHFX Hamiltonian with a sparse matrix data structure.In practice, it also should be noted that as the angularpart of the NAOs is spherical harmonic while the GTOsare Cartesian Gaussians, we need to make a transformationbetween Cartesian and Spherical harmonic functions. Thedifference between these two harmonic functions is thenumber of atomic orbitals including in the shells whoseangular momentum are larger than 1. For example, a d-shellhas 5 Spherical orbitals, but have 6 Cartesian orbitals. Adetailed transformation method can be ﬁnd in Ref.Schlegeland Frisch (1995). Parallel schemes

The ERIs have four indexed that can be paralleled. Onepossible parallel scheme to make distribution of just oneshell index, however, as the index number in one shell is toosmall to make distribution over CPU cores, such shell-onedistribution may cause serious load imbalance.The other parallel scheme is to make distribution for shell-pair (list-KL). It is a straightforward way to parallelize theERIs as in practice we loop over two pair lists. However,although the shell-pair can be distributed evenly beforeERIs calculations, ERI-screening is needed during the ERIscalculations, which also causes load imbalance. The practicalimplementation of the described formalism closely followsthe ﬂowchart shown in Algorithm 2. After the building list-KL is completed, we distributed it into CPU cores with list-KL-local at every cores. Then in the following ERIsscreening and calculation, only loops over list-KL-local isneeded for every core. The advantage of this scheme isthat it naturally bypasses the the communication process,and every CPU core only go over and compute its assignedlist-KL-local. However, although the list-KL is distributedevenly over processors, the ERI screening is located after theparallelization, which causes different number of shell ERIsthat need to be calculated in every processor. Such differentnumber of shell ERIs makes load imbalance.In order to achieve load balance, distribution of individualshell-quartet ( IJ | KL ) after the ERI screening processa possible choice. Even if the computational time isnonuniformity in the case of the different shell type, thisdistribution can also yield an even time over CPU coresbecause of its smallest distribution chunks. The practicalimplementation of the shell-quartet algorithm is shown inAlgorithm 3. Every CPU core will go over the global pairlists( list-IJ and list-KL), and make the ERI screening todetermine which ERIs are needed to be computed. Thena global counter is set to count the number of computedERIs, this counter is distributed over CPU cores to makesure the number of calculated ERIs is evenly distributed. Thedisadvantage of this algorithm is that every processor need tomake the whole ERI screening, while in the above parallel-pair algorithm, only the ERI screening in its local lists isneeded. Such globally calculated ERIs screening decreasesthe parallel efﬁciency. Algorithm 2

Flowchart of the parallel-pair algorithms forERIs. Here the shell-pair-list-KL-local means to distributethe shell-list-KL over CPU cores at the beginning. Thedescription of Schwarz screening and NAO screening are inthe text.get shell-pair-list-KL-local for list-IJ in shell-pair-list-IJ dofor list-KL in shell-pair-list-KL-local doif

Schwarz screening.and. NAO screening then compute shell ERI ( IJ | KL ) end ifend forend forAlgorithm 3 Flowchart of the parallel-quartet algorithms forERIs. Here N refers to the total number of the CPU cores,current-core refers to the index of the current processor. Thedescription of Schwarz screening and NAO screening are inthe text. for list-IJ in shell-pair-list-IJ dofor list-KL in shell-pair-list-KL doif

Schwarz screening.and. NAO screening then i ++ if i mod N eq current-core then compute shell ERI ( IJ | KL ) end ifend ifend forend for Prepared using sagej.cls inming Qin, Honghui Shang, Lei Xu, Wei Hu, Jinlong Yang, Shigang Li, ,Yunquan Zhang Table 1.

The parameters for each node of Tianhe-2.

Component ValueCPU Intel(R) Xeon(R) CPU E5-2692Freq. (GHz) 2.2Cores 12

Table 2.

The CPU time (in seconds) for the calculation of theERIs of Si 64 system with SZ basis set using different parallelschemes. cores parallel-pair parallel-quartet12 110.9 108.224 55.6 57.096 15.5 18.2192 8.6 11.6

RESULTS AND DISCUSSION

In order to demonstrate the performance of the above twostatic parallel schemes, we use silicon bulk contained 64atoms in the unit cell as a test case as shown in Fig. 2. Norm-conserving pseudopotentials generated with the Troullier-MartinsTroullier and Martins (1991) scheme, in fullyseparable form developed by Kleiman and ByladerKleinmanand Bylander (1982), are used to represent interactionbetween core ion and valence electrons. The screened hybridfunctional HSE06Krukau et al. (2006) was used in thefollowing calculations. Both single-zeta (SZ) contained s andp shells and double-zeta plus polarization (DZP) basis setcontained s,p and d shells are considered. All calculationswere carried out on Tianhe-2 supercomputer located inNational Supercomputer Center in Guangzhou, China, theconﬁguration of the machine is shown in Table 1. TheIntel Math Kernel Library(version 10.0.3.020) is used in thecalculations.For the parallel-pair and parallel-quartet algorithms, whichare fully parallelized and involve no communication, loadimbalance is one of factors that may affect the parallelefﬁciency. To examine the load balance, the timing at everycores are shown in Fig. 3-Fig. 6. It is clearly shown that theparallel-pair algorithm (red line) is load imbalance, for SZbasis set, the time difference between cores is around 10%in 12 cores (Fig. 3) case and around 80% in 192 cores (Fig.4). For DZP basis set, d shells have been considered, whichcaused more serious load imbalance, the time differencebetween cores is even around 22% in 12 cores (Fig.5) andaround 100% in 192 cores (Fig.6). On the other hand, theload balance in parallel-quartet algorithm is quite well, thetime difference between cores is within 1%. However, in thisalgorithm, as the ERIs screening part is made for the wholeERIs by all the CPU cores. which is a constant time even withthe increasing CPU cores, there are replicate calculations forthe ERIs screening which decrease the parallel efﬁciency. Asshown in Fig.4), for small basis set at large number of CPUcores, the average CPU time of the parallel-quartet is aroundtwice as the parallel-pair algorithm.Such global ERI screening in parallel-quartet algorithmalso contributes signiﬁcantly to lowering the parallel speedupand efﬁciency for SZ basis set, as shown in Table2 andFig.7. The parallel efﬁciency is only 58% at 192 CPU corefor parallel-quartet while holds around 80% for parallel-pairalgorithm. basis set shells NAOsSi64 SZ 128 256Si64 DZP 320 832

Figure 2.

The silicon bulk contained 64 atoms in the unit cellthat used as benchmark system in this work. In the upper table,the number of shells as well as the number of the NAO basisfunctions for different basis sets are listed.

CPU core index C P U T i m e ( s )

1. parallel-pair2. parallel-quartet

SZ basis set

Figure 3.

The load balance for bulk silicon supercell with 64atoms using SZ basis set at 12 CPU cores.

Table 3.

The CPU time (in seconds) for the calculation of theERIs of Si 64 system with DZP basis set using different parallelschemes. cores parallel-pair parallel-quartet12 1645.1 1572.924 904.0 806.196 251.3 225.9192 129.6 128.0For DZP basis set, the load imbalance caused by d shellsbecome another factor of lowering the parallel efﬁciency,so in this case, as shown in Table2 and Fig.8, the parallelspeedup and efﬁciency are similar for both parallel-pair and

Prepared using sagej.cls

Journal Title XX(X)

CPU core index C P U T i m e ( s )

1. parallel-pair2. parallel-quartet

SZ basis set

Figure 4.

The load balance for bulk silicon supercell with 64atoms using SZ basis set at 192 CPU cores.

CPU core index C P U T i m e ( s )

1. parallel-pair2. parallel-quartet

DZP basis set

Figure 5.

The load balance for bulk silicon supercell with 64atoms using DZP basis set at 12 CPU cores.

CPU core index C P U T i m e ( s )

1. parallel-pair2. parallel-quartet

DZP basis set

Figure 6.

The load balance for bulk silicon supercell with 64atoms using DZP basis set at 192 CPU cores. parallel-quartet algorithms, that around 80% at 192 CPUcores.

Number of CPU cores Sp ee d u p

1. parallel-pair2. parallel-quartetideal

Number of CPU cores P a r a ll e l E ff i c i e n c y SZ basis set

Figure 7. (Color online) Parallel Speedups and efﬁciency forERIs calculation formation using different parallel schemes.Speedups were obtained on Tianhe-2 for bulk silicon supercellwith 64 atoms using SZ basis set. The speedup is referenced toa run on 12 CPUs.

Number of CPU cores Sp ee d u p

1. parallel-pair2. parallel-quartetideal

Number of CPU cores P a r a ll e l E ff i c i e n c y DZP basis set

Figure 8. (Color online) Parallel Speedups and efﬁciency forERIs calculation using different parallel schemes. Speedupswere obtained on Tianhe-2 for bulk silicon supercell with 64atoms using DZP basis set The speedup is referenced to a runon 12 CPUs.

CONCLUSIONS

In summary, we have shown our two static parallelalgorithms for the ERIs calculations in NAO2GTO method.We have also analyzed the performance of these two parallelalgorithms for their load balance and parallel efﬁciency. Onthe basis of our results, the static distribution of ERI shellpairs, produces load imbalance that causes the efﬁciencyto decrease, limiting the number of CPU cores that can beutilized. On the other hand, the static distribution of ERIshell quartet can yield very high load balance, however,because the need of the global ERI screening calculation, theparallel efﬁciency has been dramatically reduced for smallbasis set. We have also tried another static method that ﬁrstlycreate a need-to-calculate ERIs list by considering all thescreening methods as well as the eight-fold permutationalsymmetry and secondly distribute the ERIs in the need-to-calculate list over a number of processes. However, we ﬁndthe time to build the need-to-calculate ERIs list is even largerthan the global ERI screening calculation. On the next step,we need to distribute the ERI screening calculation whilekeep the load balance of the ERI calculation, and a dynamicdistribution could enables load balance with little loss ofefﬁciency.

Acknowledgements

This work was supported by the National Key Research & Development Program of China (2017YFB0202302), the Special

Prepared using sagej.cls inming Qin, Honghui Shang, Lei Xu, Wei Hu, Jinlong Yang, Shigang Li, ,Yunquan Zhang Fund for Strategic Pilot Technology of Chinese Academyof Sciences (XDC01040000), the National Natural ScienceFoundation of China (61502450), the National Natural ScienceFoundation of China (21803066), Research Start-Up Grants(KY2340000094) from University of Science and Technology ofChina, and Chinese Academy of Sciences Pioneer Hundred TalentsProgram. The authors thank the Tianhe-2 Supercomputer Center forcomputational resources.

References

Alexeev Y, Kendall RA and Gordon MS (2002) The distributed dataSCF.

Computer Physics Communications https://doi.org/10.1016%2Fs0010-4655%2801%2900439-8 .Becke AD (1993) Density-functional thermochemistry. III. the roleof exact exchange.

The Journal of Chemical Physics https://doi.org/10.1063%2F1.464913 .Blum V, Gehrke R, Hanke F, Havu P, Havu V, Ren X,Reuter K and Schefﬂer M (2009) Ab initio molecularsimulations with numeric atom-centered orbitals.

Comput.Phys. Commun. http://linkinghub.elsevier.com/retrieve/pii/S0010465509002033 .Burant JC, Scuseria GE and Frisch MJ (1996) A linear scalingmethod for hartree–fock exchange calculations of largemolecules.

The Journal of Chemical Physics https://doi.org/10.1063%2F1.472627 .Bush IJ, Tomic S, Searle BG, Mallia G, Bailey CL, MontanariB, Bernasconi L, Carr JM and Harrison NM (2011) Parallelimplementation of the ab initio

CRYSTAL program: electronicstructure calculations for periodic systems.

Proceedings ofthe Royal Society A: Mathematical, Physical and EngineeringSciences https://doi.org/10.1098%2Frspa.2010.0563 .Chow E, Liu X, Smelyanskiy M and Hammond JR (2015) Parallelscalability of hartree–fock calculations.

The Journal ofChemical Physics https://doi.org/10.1063%2F1.4913961 .Coleman TF and Li Y (1994) On the convergence of interior-reﬂective newton methods for nonlinear minimization subjectto bounds.

Mathematical Programming https://doi.org/10.1007%2Fbf01582221 .Coleman TF and Li Y (1996) An interior trust region approachfor nonlinear minimization subject to bounds.

SIAM Journalon Optimization https://doi.org/10.1137%2F0806023 .Conn AR, Gould NIM and Toint P (1991) A globally convergentaugmented lagrangian algorithm for optimization with generalconstraints and simple bounds.

SIAM Journal on NumericalAnalysis https://doi.org/10.1137%2F0728030 .Delhalle J and Calais JL (1987) Direct-space analysis of the hartree-fock energy bands and density of states for metallic extendedsystems.

Physical Review B https://doi.org/10.1103% 2Fphysrevb.35.9460 .Delley B (1990) An all-electron numerical method for solving thelocal density functional for polyatomic molecules.

The Journalof Chemical Physics https://doi.org/10.1063%2F1.458452 .Dovesi R, Saunders V, Roetti C, Orlando R, Zicovich-Wilsona CM,Pascale F, Civalleri B, Doll K, Harriso NM, Bush I, DArco Pand Llunell M (2006) Crystal06 users manual. University ofTorino, Torino, 2006.Enkovaara J, Rostgaard C, Mortensen JJ, Chen J, Duak M,Ferrighi L, Gavnholt J, Glinsvad C, Haikola V, Hansen HA,Kristoffersen HH, Kuisma M, Larsen AH, Lehtovaara L,Ljungberg M, Lopez-Acevedo O, Moses PG, Ojanen J, OlsenT, Petzold V, Romero NA, Stausholm-Mø ller J, Strange M,Tritsaris GA, Vanin M, Walter M, Hammer B, H¨akkinenH, Madsen GKH, Nieminen RM, Nø rskov JK, Puska M,Rantala TT, Schiø tz J, Thygesen KS and Jacobsen KW (2010)Electronic structure calculations with GPAW: a real-spaceimplementation of the projector augmented-wave method.

J.Phys. Condens. Matter http://stacks.iop.org/0953-8984/22/i=25/a=253202 .Frisch MJ, Trucks GW, Schlegel HB, Scuseria GE, Robb MA,Cheeseman JR, Scalmani G, Barone V, Mennucci B, PeterssonGA, Nakatsuji H, Caricato M, Li X, Hratchian HP, IzmaylovAF, Bloino J, Zheng G, Sonnenberg JL, Hada M, Ehara M,Toyota K, Fukuda R, Hasegawa J, Ishida M, Nakajima T,Honda Y, Kitao O, Nakai H, Vreven T, Montgomery JA, PeraltaJE, Ogliaro F, Bearpark M, Heyd JJ, Brothers E, Kudin KN,Staroverov VN, Kobayashi R, Normand J, Raghavachari K,Rendell A, Burant JC, Iyengar SS, Tomasi J, Cossi M, RegaN, Millam JM, Klene M, Knox JE, Cross JB, Bakken V,Adamo C, Jaramillo J, Gomperts R, Stratmann RE, YazyevO, Austin AJ, Cammi R, Pomelli C, Ochterski JW, MartinRL, Morokuma K, Zakrzewski VG, Voth GA, Salvador P,Dannenberg JJ, Dapprich S, Daniels AD, Farkas, ForesmanJB, Ortiz JV, Cioslowski J and Fox DJ (2009) Gaussian 09,Revision B.01.Gell-Mann M and Brueckner KA (1957) Correlation energy of anelectron gas at high density.

Physical Review https://doi.org/10.1103%2Fphysrev.106.372 .Goldberg DE (1989)

Genetic Algorithms in Search, Optimizationand Machine Learning . 1st edition. Boston, MA, USA:Addison-Wesley Longman Publishing Co., Inc. ISBN0201157675.Gonze X, Beuken JM, Caracas R, Detraux F, Fuchs M, RignaneseGM, Sindic L, Verstraete M, Zerah G, Jollet F, Torrent M, RoyA, Mikami M, Ghosez P, Raty JY and Allan DC (2002) First-principles computation of material properties: The ABINITsoftware project.

Comput. Mater. Sci. https://doi.org/10.1016%2Fs0927-0256%2802%2900325-7 .Gonze X, Jollet F, Abreu Araujo F, Adams D, Amadon B,Applencourt T, Audouze C, Beuken JM, Bieder J, BokhanchukA, Bousquet E, Bruneval F, Caliste D, Cˆot´e M, Dahm F, DaPieve F, Delaveau M, Di Gennaro M, Dorado B, Espejo C,Geneste G, Genovese L, Gerossier A, Giantomassi M, GilletY, Hamann D, He L, Jomard G, Laﬂamme Janssen J, Le RouxS, Levitt A, Lherbier A, Liu F, Lukaˇcevi´c I, Martin A, MartinsC, Oliveira M, Ponc´e S, Pouillon Y, Rangel T, Rignanese

Prepared using sagej.cls

Journal Title XX(X)

GM, Romero A, Rousseau B, Rubel O, Shukri A, StankovskiM, Torrent M, Van Setten M, Van Troeye B, Verstraete M,Waroquiers D, Wiktor J, Xu B, Zhou A and ZwanzigerJ (2016) Recent developments in the ABINIT softwarepackage.

Comput. Phys. Commun.

DOI:10.1016/j.cpc.2016.04.003. URL http://linkinghub.elsevier.com/retrieve/pii/S0010465516300923 .Guidon M, Hutter J and VandeVondele J (2010) Auxiliarydensity matrix methods for hartree-fock exchange calculations.

Journal of Chemical Theory and Computation https://doi.org/10.1021%2Fct1002225 .Guidon M, Schiffmann F, Hutter J and VandeVondele J (2008) Abinitio molecular dynamics using hybrid density functionals.

The Journal of Chemical Physics https://doi.org/10.1063%2F1.2931945 .Havu V, Blum V, Havu P and Schefﬂer M (2009) Efﬁ-cient integration for all-electron electronic structure cal-culation using numeric basis functions.

J. Comput.Phys. http://linkinghub.elsevier.com/retrieve/pii/S0021999109004458 .Heyd J, Scuseria GE and Ernzerhof M (2003) Hybrid functionalsbased on a screened coulomb potential.

The Journalof Chemical Physics https://doi.org/10.1063%2F1.1564060 .Heyd J, Scuseria GE and Ernzerhof M (2006) Erratum: “hybridfunctionals based on a screened coulomb potential” [j. chem.phys. 118, 8207 (2003)].

The Journal of Chemical Physics https://doi.org/10.1063%2F1.2204597 .Hohenberg P and Kohn W (1964) Inhomogeneous electrongas.

Physical Review https://doi.org/10.1103%2Fphysrev.136.b864 .Hser M and Ahlrichs R (1989) Improvements on the direct SCFmethod.

Journal of Computational Chemistry https://doi.org/10.1002%2Fjcc.540100111 .Izmaylov AF, Scuseria GE and Frisch MJ (2006) Efﬁcientevaluation of short-range hartree-fock exchange in largemolecules and periodic systems.

The Journal of ChemicalPhysics https://doi.org/10.1063/1.2347713 .Janesko BG, Henderson TM and Scuseria GE (2009) Screenedhybrid density functionals for solid-state chemistry andphysics.

Phys. Chem. Chem. Phys. https://doi.org/10.1039%2Fb812838c .Kleinman L and Bylander DM (1982) Efﬁcacious form for modelpseudopotentials.

Physical Review Letters https://doi.org/10.1103%2Fphysrevlett.48.1425 .Kohn W and Sham LJ (1965) Self-consistent equations includ-ing exchange and correlation effects.

Physical Review https://doi.org/10.1103%2Fphysrev.140.a1133 . Krukau AV, Vydrov OA, Izmaylov AF and Scuseria GE (2006)Inﬂuence of the exchange screening parameter on theperformance of screened hybrid functionals.

The Journal ofChemical Physics https://doi.org/10.1063%2F1.2404663 .Liu X, Patel A and Chow E (2014) A new scalable parallelalgorithm for fock matrix construction. In: .IEEE. DOI:10.1109/ipdps.2014.97. URL https://doi.org/10.1109%2Fipdps.2014.97 .Merlot P, Izs´ak R, Borgoo A, Kjærgaard T, Helgaker T and ReineS (2014) Charge-constrained auxiliary-density-matrix methodsfor the hartree-fock exchange contribution.

The Journal ofChemical Physics https://doi.org/10.1063%2F1.4894267 .Mohr S, Ratcliff LE, Boulanger P, Genovese L, Caliste D, DeutschT and Goedecker S (2014) Daubechies wavelets for linearscaling density functional theory.

J. Chem. Phys. .Monkhorst HJ (1979) Hartree-Fock density of states for extendedsystems.

Physical Review B https://doi.org/10.1103%2Fphysrevb.20.1504 .Mori-S´anchez P, Cohen AJ and Yang W (2008) Localization anddelocalization errors in density functional theory and implica-tions for band-gap prediction.

Physical Review Letters https://doi.org/10.1103%2Fphysrevlett.100.146401 .Ochsenfeld C, White CA and Head-Gordon M (1998) Linear andsublinear scaling formation of Hartree–Fock-type exchangematrices.

The Journal of Chemical Physics https://doi.org/10.1063%2F1.476741 .Ozaki T (2003) Variationally optimized atomic orbitals for large-scale electronic structures.

Physical Review B https://doi.org/10.1103%2Fphysrevb.67.155108 .Paier J, Diaconu CV, Scuseria GE, Guidon M, VandeVondele Jand Hutter J (2009) Accurate hartree-fock energy of extendedsystems using large gaussian basis sets.

Physical Review B https://doi.org/10.1103%2Fphysrevb.80.174114 .Paier J, Marsman M, Hummer K, Kresse G, Gerber IC and ´Angy´anJG (2006) Screened hybrid density functionals applied tosolids.

The Journal of Chemical Physics https://doi.org/10.1063%2F1.2187006 .Parr RG and Yang W (1989)

Density Functional Theory of Atomsand Molecules . New York: Oxford University Press.Polly R, Werner HJ, Manby FR and Knowles PJ (2004) Fasthartree–fock theory using local density ﬁtting approximations.

Molecular Physics https://doi.org/10.1080%2F0026897042000274801 .Qin X, Shang H, Xiang H, Li Z and Yang J (2014) HONPAS:A linear scaling open-source solution for large systemsimulations.

International Journal of Quantum Chemistry https://doi.org/10.1002%2Fqua.24837 . Prepared using sagej.cls inming Qin, Honghui Shang, Lei Xu, Wei Hu, Jinlong Yang, Shigang Li, ,Yunquan Zhang Ren X, Rinke P, Blum V, Wieferink J, Tkatchenko A, Sanﬁlippo A,Reuter K and Schefﬂer M (2012) Resolution-of-identityapproach to HartreeFock, hybrid density functionals,RPA, MP2 and GW with numeric atom-centered orbitalbasis functions.

New J. Phys. http://stacks.iop.org/1367-2630/14/i=5/a=053020?key=crossref.351b343783c2c1df1596219a941a74eb .Schlegel HB and Frisch MJ (1995) Transformation betweencartesian and pure spherical harmonic gaussians.

InternationalJournal of Quantum Chemistry https://doi.org/10.1002%2Fqua.560540202 .Schmidt MW, Baldridge KK, Boatz JA, Elbert ST, Gordon MS,Jensen JH, Koseki S, Matsunaga N, Nguyen KA, Su S,Windus TL, Dupuis M and Montgomery JA (1993) Generalatomic and molecular electronic structure system.

Journalof Computational Chemistry https://doi.org/10.1002%2Fjcc.540141112 .Schwegler E and Challacombe M (1996) Linear scaling compu-tation of the hartree–fock exchange matrix.

The Journal ofChemical Physics https://doi.org/10.1063%2F1.472135 .Schwegler E, Challacombe M and Head-Gordon M (1997) Linearscaling computation of the fock matrix. II. rigorous boundson exchange integrals and incremental fock build.

TheJournal of Chemical Physics https://doi.org/10.1063%2F1.473833 .Shang H, Li Z and Yang J (2011) Implementation of screenedhybrid density functional for periodic systems with numericalatomic orbitals: Basis function ﬁtting and integral screening.

The Journal of Chemical Physics https://doi.org/10.1063%2F1.3610379 .Sodt A and Head-Gordon M (2008) Hartree-Fock exchangecomputed using the atomic resolution of the identityapproximation.

The Journal of Chemical Physics https://doi.org/10.1063%2F1.2828533 .Soler JM, Artacho E, Gale JD, Garc´ıa A, Junquera J, Ordej´onP and S´anchez-Portal D (2002) The siesta method for abinitio order- n materials simulation. Journal of Physics:Condensed Matter https://doi.org/10.1088%2F0953-8984%2F14%2F11%2F302 .Stephens PJ, Devlin FJ, Chabalowski CF and Frisch MJ (1994)Ab initio calculation of vibrational absorption and circulardichroism spectra using density functional force ﬁelds.

TheJournal of Physical Chemistry https://doi.org/10.1021%2Fj100096a001 .Troullier N and Martins JL (1991) Efﬁcient pseudopotentials forplane-wave calculations.

Physical Review B https://doi.org/10.1103%2Fphysrevb.43.1993 .Tymczak CJ and Challacombe M (2005) Linear scaling computa-tion of the fock matrix. VII. periodic density functional theoryat the γ point. The Journal of Chemical Physics https://doi.org/10.1063%2F1.1853374 .Valiev M, Bylaska E, Govind N, Kowalski K, Straatsma T, DamHV, Wang D, Nieplocha J, Apra E, Windus T and de Jong W(2010) NWChem: A comprehensive and scalable open-sourcesolution for large scale molecular simulations.

ComputerPhysics Communications https://doi.org/10.1016%2Fj.cpc.2010.04.018 .VandeVondele J, Krack M, Mohamed F, Parrinello M, ChassaingT and Hutter J (2005) Quickstep: Fast and accurate densityfunctional calculations using a mixed gaussian and plane wavesapproach.

Computer Physics Communications https://doi.org/10.1016%2Fj.cpc.2004.12.014 . Author Biographies

Xinming Qin is a PhD student in chemical physics at USTCunder the supervision of Prof.Jinlong Yang. He received theBS degree in chemistry in July 2009 at the same university.His main research interests are developing and applying newalgorithms for large-scale electronic structure calculations.He is the main developer of the HONPAS.

Honghui Shang is an associate professor at the ComputerScience at Institute of Computing Technology, ChineseAcademy of Sciences, China. She received her BS degreein physics from University of Science and Technology ofChina in 2006, and the PhD degree in Physical Chemistryfrom University of Science and Technology of China,in 2011. Between 2012 and 2018, she worked as apostdoctoral research assistant at the Fritz Haber Instituteof the Max Planck Society, Berlin, Germany, which wasresponsible for the Hybrid Inorganic/Organic Systems forOpto-Electronics (HIOS) project and for the Novel MaterialsDiscovery (NOMAD) project. Her main research interestsare developing the physical algorithm or numerical methodsfor the ﬁrst-principle calculations as well as acceleratingthese applications in the high performance computers.Currently, she is the main developer of the HONPAS(leader of of the hybrid-density-functional part) and FHI-aims (leader of the density-functional perturbation theorypart).

Lei Xu is currently working toward the BS degree in thedepartment of physics at the SiChuan University. His currentresearch interests are the parallel programming in the highperformance computing domain.

Wei Hu is currently a professor at division of theoreticaland computational sciences at Hefei National Laboratoryfor Physical Sciences at Microscale (HFNL) at USTC.He received the BS degree in chemistry from USTC in2007, and the PhD degree in Physical Chemistry fromthe same university in 2013. From 2014 to 2018, heworked as a postdoctoral fellow in the Scalable SolversGroup of the Computational Research Division at LawrenceBerkeley National Laboratory (LBNL), Berkeley, USA.During his postdoctoral research, he developed a newmassively parallel methodology, DGDFT (DiscontinuousGalerkin Method for Density Functional Theory), for large-scale DFT calculations. His main research interests focus

Prepared using sagej.cls Journal Title XX(X) on development, parallelization, and application of advancedand highly efﬁcient DFT methods and codes for accurateﬁrst-principles modeling and simulations of nanomaterials.

Jinlong Yang is a full professor of chemistry and executivedean of the School of Chemistry and Material Sciences atUSTC. He obtained his PhD degree in 1991 from USTC.He was awarded Outstanding Youth Foundation of Chinain 2000, selected as Changjiang Scholars Program ChairProfessor in 2001 and as a fellow of American PhysicalSociety (APS) in 2011. He is the second awardee of the2005s National Award for Natural Science (the secondprize) and the awardee (principal contributor) of the 2014sOutstanding Science and Technology Achievement Prizeof the Chinese Academy of Sciences (CAS). His researchmainly focuses on the development of ﬁrst-principlesmethods and their application to clusters, nanostructures,solid materials, surfaces, and interfaces. He is the initiatorand leader of the HONPAS.

Shigang Li received his Bachelor in computer scienceand technology and PhD in computer architecture from theUniversity of Science and Technology Beijing, China, in2009 and 2014, respectively. He was funded by CSC for a 2-year PhD study in University of Illinois, Urbana-Champaign.He was an assistant professor (from June 2014 to Aug.2018) in State Key Lab of Computer Architecture, Instituteof Computing Technology, Chinese Academy of Sciencesat the time of his achievement in this work. From Aug.2018 to now, he is a postdoc researcher in Department ofComputer Science, ETH Zurich. His research interests focuson the performance optimization for parallel and distributedcomputing systems, including parallel algorithms, parallelprogramming model, performance model, and intelligentmethods for performance optimization.

Yunquan Zhang received his BS degree in computerscience and engineering from the Beijing Institute ofTechnology in 1995. He received a PhD degree in ComputerSoftware and Theory from the Chinese Academy of Sciencesin 2000. He is a full professor and PhD Advisor of StateKey Lab of Computer Architecture, ICT, CAS. He isalso appointed as the Director of National SupercomputingCenter at Jinan and the General Secretary of China’s HighPerformance Computing Expert Committee. He organizesand distributes China’s TOP100 List of High PerformanceComputers, which traces and reports the development of theHPC system technology and usage in China. His researchinterests include the areas of high performance parallelcomputing, focusing on parallel programming models,high-performance numerical algorithms, and performancemodeling and evaluation for parallel programs.