[PDF] The dynamic parallel distribution algorithm for hybrid density-functional calculations in HONPAS package

Abstract

This work presents a dynamic parallel distribution scheme for the Hartree-Fock exchange~(HFX) calculations based on the real-space NAO2GTO framework. The most time-consuming electron repulsion integrals~(ERIs) calculation is perfectly load-balanced with 2-level master-worker dynamic parallel scheme, the density matrix and the HFX matrix are both stored in the sparse format, the network communication time is minimized via only communicating the index of the batched ERIs and the final sparse matrix form of the HFX matrix. The performance of this dynamic scalable distributed algorithm has been demonstrated by several examples of large scale hybrid density-functional calculations on Tianhe-2 supercomputers, including both molecular and solid states systems with multiple dimensions, and illustrates good scalability.

Full PDF

TThe dynamic parallel distribution algorithm for hybrid density-functional calculations inHONPAS package

Honghui Shang , Lei Xu , Baodong Wu a , Xinming Qin b , Yunquan Zhang a , Jinlong Yang b a State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing b Hefei National Laboratory for Physical Sciences at Microscale, Department of Chemical Physics, and Synergetic Innovation Center of Quantum Information andQuantum Physics, University of Science and Technology of China, Hefei, Anhui 230026, China

Abstract

This work presents a dynamic parallel distribution scheme for the Hartree-Fock exchange (HFX) calculations based on the real-space NAO2GTO framework. The most time-consuming electron repulsion integrals (ERIs) calculation is perfectly load-balancedwith 2-level master-worker dynamic parallel scheme, the density matrix and the HFX matrix are both stored in the sparse format, thenetwork communication time is minimized via only communicating the index of the batched ERIs and the ﬁnal sparse matrix formof the HFX matrix. The performance of this dynamic scalable distributed algorithm has been demonstrated by several examples oflarge scale hybrid density-functional calculations on Tianhe-2 supercomputers, including both molecular and solid states systemswith multiple dimensions, and illustrates good scalability.

Keywords: density-functional theory, Hartree-Fock, hybrid functionals, numeric atomic orbitals, linear scaling, MPI

1. Introduction

The hybrid density-functional calculations [1–12], whichcontains the Hartree-Fock exchange (HFX), shows the great ac-curacy performance for the geometry parameters, band struc-ture properties and cohesive energies of a large range of mate-rials. However, the computational time is extremely expensivecompared to the conventional ground state density-functionalstheory (DFT) calculation duo to the calculation of the electronrepulsion integrals (ERIs), which is the most time-consumingpart in the HFX matrix construction. Therefore, a highly ef-ﬁcient and scalable implementations of the ERIs is urgentlyneeded.There have been a variety of implementation of the hybriddensity-functionals for solid state physics calculations. Webroadly classify these works in the two categories by the us-age of the basis set: plane waves (PW) method[12–20] or lin-ear combination of atomic orbitals (LCAO) method[10, 11, 21].The plane wave basis set is the complete basis set, but not local-ized. On the contrary, the atomic orbitals basis sets are local-ized, which make the Hamiltonian matrices to be sparse. As aresult, the atomic basis sets have attracted considerable interestfor DFT calculations because of their favorable scaling with re-spect to the number of atoms and their potential for massivelyparallel implementations for large-scale calculations [11, 22–29]. There are mainly two types of the atomic orbitals, oneis the gaussian type orbital (GTO), as adopted in Gaussian[11]and CRYSTAL[21]et al.; the other one is the numerical atomicorbital (NAO), which is adopted in SIESTA[23], DMOL[22],OPENMX[30],FHI-aims[24] et al.. The advantage of GTO is Both authors contributed equally to this work. the analytical calculation of the ERIs, and the advantage ofNAO is its strict locality, which naturally leads to lower or-der scaling of computational time versus system size. We haveproposed a mixed scheme called NAO2GTO[29] to take ad-vantages of both types of atomic orbitals. In the NAO2GTOmethod, the strict cuto ﬀ of the atomic orbitals is satisﬁed withNAO, and then the NAO is ﬁtted with several GTOs to analyti-cally calculate the ERIs, after employing several ERI screeningtechniques, the construction of HFX matrix can be very e ﬃ cientand scale linearly[29, 31].In the parallelization of HFX matrix construction, we have topay attention to two problems, one is the load balancing of theERIs, and the other one is the communications of the densityand HFX matrices. Previously, the load balancing of ERIs weresolved by static or dynamic distribution schemes[21, 32–37].The major di ﬀ erence between static[21, 32, 37] anddynamic[33–36] parallel distribution algorithm is the way howto parallelize the computation of the ERIs. In the static paral-lel distribution algorithm, the ERIs are distributed among theprocessors before all the calculations of the ERIs; In the dy-namic parallel distribution algorithm, the distribution and thecalculation of the ERIs are performed simultaneously, whichimproves the load balance and parallel e ﬃ ciency. For instance,the NWchem[33] software uses a simple centralized dynamicscheduling algorithm to distribute the ERIs to the worker pro-cesses, but the parallel e ﬃ ciency decreases when very largenumbers of processes are used. The GTFock[34, 35] code usesan initial static task partitioning scheme along with a work-stealing distributed dynamic scheduling algorithm, and it givesvery good parallel scalability for the ERIs calculations. InCP2K / Quickstep[36], the ERIs are coarse grained using bins,and then based on the estimated cost of each bin, the simu-

Preprint submitted to Computer Physics Communications September 9, 2020 a r X i v : . [ phy s i c s . c o m p - ph ] S e p ated annealing method is adopted to redistribute all the binsto improve the load balance, which is limited by the accu-racy of the estimated cost of each bin. In order to reducethe communication time, both NWchem and GTFock use theGlobal Arrays framework which provides the one-sided com-munication scheme to achieve high parallel performance withthe distributed HFX matrix computations. On the contrary,the CP2K / Quickstep[36] replicates the global density and HFXmatrix on each MPI process in order to minimize the commu-nication, however, it limits system size because as the memoryusage scaling as O(N ), and it also limits the parallel scalabilityas the synchronization of the HFX matrix becomes the bottle-neck when using very large numbers of cores.Recently, we have proposed two static distributionstrategies[32] for the calculation of the ERIs, however,the static distribution of ERI shell pairs produces load im-balance that causes the decreasing of the parallel e ﬃ ciency,while the static distribution of ERI shell quartet can yieldvery high load balance, but because of the need of the globalERI screening calculation, the parallel e ﬃ ciency has alsobeen dramatically reduced, that both of the static distributionschemes limiting parallel scalability. In order to improve theparallel e ﬃ ciency, the dynamic parallel distribution algorithmis needed.Here in this work, a new dynamic parallel distribution al-gorithm based on the NAO2GTO scheme[29] has been pro-posed and implemented in the Order-N performance HONPAScode[31]. In our approaches, the calculations of the ERIs areperfectly loading balanced and can scale to very large num-bers of cores thanks to the 2-level master-worker distribution ofshell pairs. Furthermore, the communication time is minimizedby using the global sparse matrix with linear scaling memoryusage. The e ﬃ ciency and scalability of these algorithms aredemonstrated by benchmark timings in the periodic solid sys-tem with hundreds to thousands of atoms in the unit cell.The remainder of the paper is organized as follows. In Sec. 2we succinctly summarize the fundamental background of thisstudy. Then the dynamic parallel scheme and the detailed im-plementation of our parallel distribution strategies are discussedin Sec. 3. Furthermore, we demonstrated the parallel scalabilityof our implementation in Sec. 4. Finally, Sec. 5 summarizes themain ideas and ﬁndings of this work.

2. Background

In this section, we recall the basis theoretical framework usedin this work. A spin-unpolarized notation is used throughout thetext for the sake of simplicity, but a formal generalization to thecollinear spin case is straightforward. The total-energy in theKohn-Sham DFT is deﬁned as E KS = T s [ n ] + E ext [ n ] + E H [ n ] + E xc [ n ] + E nuc-nuc . (1)Here, n ( r ) is the electron density, T s is the kinetic energy ofnon-interacting electrons, while E ext is external energy stem-ming from the electron-nuclear attraction, E H is the Hartreeenergy, E xc is the exchange-correlation energy, and E nuc-nuc is the nucleus-nucleus repulsion energy. The ground state elec-tron density n ( r ) (and the associated ground state total energy)is obtained by variationally minimizing Eq. (1) under the con-straint that the number of electrons N e is conserved. This yieldsthe chemical potential µ = δ E KS /δ n of the electrons and theKohn-Sham single particle equationsˆ h KS ψ i = (cid:2) ˆ t s + v ext ( r ) + v H + v xc (cid:3) ψ i = (cid:15) p ψ i , (2)for the Kohn-Sham Hamiltonian ˆ h KS . In Eq. (2), ˆ t s denotesthe kinetic energy operator, v ext the external potential, v H theHartree potential, and v xc the exchange-correlation potential.Solving Eq. (2) yields the Kohn-Sham single particle states ψ p and their eigenenergies (cid:15) p . The single particle states determinethe electron density via n ( r ) = (cid:88) i f i | ψ i | , (3)in which f i denotes the Fermi-Dirac distribution function, and i is the su ﬃ x for di ﬀ erent Kohn-Sham state.The Eq. (2) can be solved numerically by expanding theKohn-Sham states ψ i with a ﬁnite basis set. In periodic sys-tems, such Kohn-Sham states are also called crystalline orbitals,which are normalized in the full space with a linear combina-tion of Bloch functions φ µ ( k , r ) to satisfy the periodic boundarycondition, ψ i ( k , r ) = (cid:88) µ c µ, i ( k ) φ µ ( k , r ) . (4)Such Bloch functions are deﬁned in terms of atomic orbitals χ R µ ( r ). φ µ ( k , r ) = √ N (cid:88) R χ R µ ( r ) e i k · ( R + r µ ) , (5)where the Greek letter µ is the index of atomic orbitals, R is theorigin of the auxiliary supercell, N is the number of unit cells inthe system. χ R µ ( r ) = χ µ ( r − R − r µ ) is the µ -th atomic orbital,whose center is displaced from the origin of the auxiliary super-cell at R by r µ . c µ, i ( k ) is the wave function coe ﬃ cient, which isobtained by solving the following generalized eigenvalue equa-tion, H ( k ) c ( k ) = E ( k ) S ( k ) c ( k ) , (6)where [ H ( k )] µν = (cid:88) R < χ µ | ˆ H | χ R ν > e i k · ( R + r ν − r µ ) , (7)and [ S ( k )] µν = (cid:88) R < χ µ | χ R ν > e i k · ( R + r ν − r µ ) . (8)The Hamiltonian matrix can be distributed into two parts, oneis the conventional DFT part called H DFT , and the other is H

HFX part which contains the calculation of the ERIs[ H DFT ] G µλ = < χ µ | ˆ t s + v ext ( r ) + v H + v xc | χ G λ > , (9)2 H HFX ] G µλ = − (cid:88) νσ (cid:88) N , H P H − N νσ [( χ µ χ N ν | χ G λ χ H σ )] , (10)where G , N , and H represent the di ﬀ erent origin of the auxiliarysupercell (the unit cell indexes), and the Greek letters µ, λ, ν, σ represent the indexes of atomic orbitals. Here the P N νσ denotesthe density matrix which is computed by an integration of thewave function coe ﬃ cient over the Brillouin zone (BZ) using P N νσ = (cid:88) j (cid:90) BZ c ∗ ν, j ( k ) c σ, j ( k ) θ ( (cid:15) F − (cid:15) j ( k )) e i k · N d k , (11)where θ is the step function, (cid:15) F is the fermi energy and (cid:15) j ( k ) isthe j -th eigenvalue at point k . And the full-range ERI is deﬁnedas ( χ µ χ N ν | χ G λ χ H σ ) = (cid:90) (cid:90) χ µ ( r ) χ N ν ( r ) χ G λ ( r (cid:48) ) χ H σ ( r (cid:48) ) | r − r (cid:48) | drdr (cid:48) . (12)For screened hybrid functional calculation, such as HSE06,only the short range part of the ERIs is needed, E HSE06xc = E SR − HF x ( ω ) + E SR − PBE x ( ω ) + E LR − PBE x ( ω ) + E PBE c , (13)where ω = . Bohr − and erfc ( r ) = √ π (cid:82) ∞ r e − t dt . The short-range and long-range part of PBE exchange functional is cal-culated following the Ref.[10]. All the ERIs’ calculation of thefollowing paper is for the short-range part, i.e.( χ µ χ N ν | χ G λ χ H σ ) SR = (cid:90) (cid:90) χ µ ( r ) χ N ν ( r ) erfc ( ω | r − r (cid:48) | ) χ G λ ( r (cid:48) ) χ H σ ( r (cid:48) ) | r − r (cid:48) | drdr (cid:48) . (14)It should be noted that this work focuses on the short-rangeHFX because the current auxiliary supercell is typically deter-mined by the extent of the numerical orbitals, which is onlyvalid for the short-range HFX. However, for the full HFX, thismay not be enough since a larger auxiliary supercell is requiredfor convergence.After building the whole Hamiltonian, the KS wave functioncoe ﬃ cients c µ, i ( k ) are calculated using the standard diagonal-ization scheme, and ﬁnally we have the density matrix by usingEq.11.The above procedures are repeated until the change of thedensity matrix element is smaller than a threshold, then we get aconverged density and Hamiltonian matrices in the hybrid func-tional calculation as shown in Fig.1.In order to make the calculation of the ERIs more e ﬃ cient,we have adopted the following computational schemes. Firstly,in our implementation, the 8-fold full permutation symmetry ofthe ERIs has been considered for both the molecules and thesolids systems, and in this way, we have a speedup of a factor 8for the CPU time and a memory reduction of the same size.( µ ν H | λ G σ N ) = ( µ ν H | σ N λ G ) = ( ν µ − H | λ G − H σ N − H ) = ( ν µ − H | σ N − H λ G − H ) = ( λ σ N − G | µ − G ν H − G ) = ( λ σ N − G | ν H − G µ − G ) = ( σ λ G − N | µ − N ν H − N ) = ( σ λ G − N | ν H − N µ − N ) . (15)Secondly, our NAO2GTO scheme[29] is adopted to calcu-late the ERIs analytically with ﬁtted GTOs, and as the angularpart of the NAOs is spherical harmonic while the GTOs areCartesian harmonic function, a transformation[38] between theCartesian and spherical harmonic functions are performed. Af-ter the transformation, the GTOs are grouped into shells ac-cording to the NAOs’ angular momentum, thus, if µ ∈ I , ν ∈ J , λ ∈ K , σ ∈ L , for the I, J, K, L shell quartet, then all the in-tegrals ( µν | λσ ) are computed together for one shell quartet at atime. As a result, the computational expense is strongly depen-dent on the angular momenta of the shell quartet which needsto be distributed in parallel.Thirdly, before the SCF cycle, two shell pair lists (list-IJ and list-KL) are ﬁrstly preselected according to Schwarzscreening[39], as shown in Fig.1 | ( µν | λσ ) | (cid:54) (cid:112) ( µν | µν )( λσ | λσ ) , (16)and only the shell list indexes with ( I J | I J ) > τ or ( KL | KL ) > τ (here τ is the drop tolerance) are stored. As shown in Eq.(10), the ﬁrst index I runs only within the unit cell, while theindexes (J,K,L) run over the whole supercell, so the list-IJ issmaller than the list-KL. Then in the ERIs calculations, theloops run over these two shell lists. Then before the calcula-tion of every ERI, we use Schwarz inequality Eq. (16) again toestimate a rigorous upper bound, that only the ERIs with non-negligible contributions are calculated, we note this screeningmethod as Schwarz screening. Because the exponential decayof the charge distributions, the Schwarz screening reduces thetotal number of ERIs to be computed from O ( N ) to O ( N ). Inaddition to Schwarz screening, the NAO screening[29] and thedistance screening[29] is also adopted to reduce the total num-ber of ERIs from O ( N ) to O ( N ).Finally, we use the density matrix screening to further reducethe number of ERIs, that the maximal value of the density ma-trix of each shell ( P max ) is calculated during every SCF cycle,and then the density matrix screening is, P screening × (cid:112) ( I J | I J )( KL | KL ) (cid:54) (cid:15) S chwarz , (17)where P screening = max ( | P IKmax | , | P ILmax | , | P JKmax | , | P JLmax | ) Here fourdensity matrix elements are needed for the maximal value be-cause of the 8-fold full permutation symmetry of the ERIs isexploited in the implementation. The maximal density matrixelements are chosen from the density matrix of the previousSCF cycle, which produce a stable direct SCF cycle[40].

3. Parallelization strategies

In order to solve the contradiction between parallel e ﬃ ciencyand load imbalance in the static distribution strategy, the dy-namic load balancing scheme is adopted based on the mas-ter / worker method, that one of these processes is responsiblefor managing the distribution of all the ERIs , which is called3 build H-HFX calculate P input build H-DFT build shell pair list output Conventional DFT Hamiltonian Hartree-Fock Exchange Hamiltonian

Density Matrix

SCF

Figure 1: The ﬂowchart of the hybrid functional calculation in the linear com-bination of atomic orbitals (LCAO) approach. the master, as shown in Algorithm 17, and the other processescompute the assigned ERIs, which are called the workers, asshown in Algorithm 2.We can choose to assign only one task to the worker at atime, that the worker process requests only one ERI shell quar-tet from the master at one time and after receiving it, proceedsto compute it. However, such a scheme introduces too muchcommunication time, and it will increase the execution time soas to reduces the parallel e ﬃ ciency. As a result, here we chooseto assign more than one ERI shell quartets at a time from themaster process to the worker processes. Such a set of tasksis called the batched ERIs, and only the start and end indexesof the batched ERIs are communicated. In practice, we usethe receiver-initiated method. The task distribution procedureis initiated by the worker, which requests tasks from the mas-ter. Then the master chooses to send the indexes of the batchedERIs or terminal token based on whether there are tasks left ornot. The worker who receives the task executes the task imme-diately and then requests the task after execution. If the workerreceives a terminal token, it jumps out of the loop and endsthe program. The master also exits the program after deter-mining that the terminal token was sent to each worker. Sucha master-worker scheme has been implemented using point-to-point blocking send and receive operations, and the full permu-tational symmetry of the ERIs has been considered.Although we can simply increase our computing power byincreasing the number of workers, this increase is not inﬁnite.Because the master process can only distribute one task at onetime. When there are multiple task requests, a task request can-not be satisﬁed until the master has processed requests beforeit. This bottleneck will limit the e ﬃ ciency of large-scale par-allelism. In our test, the performance of single-level master-worker parallelism began to decreases when the 4000 coreswere used, and the parallel e ﬃ ciency of the ERIs calculation Algorithm 1

Flowchart of Master algorithms for ERIs. Nmeans the number of the ERIs in one batch, N workers means thenumber of the workers in the mater-worker scheme.MPI IRECV (to accept request) while T askCount > doif request detected then send the MESSAGE (indexes of the batched ERIs)TaskCount = TaskCount − NMPI IRECV (to accept request) end ifend whileif request detected then send terminal token end iffor i = i < N workers − i + + do MPI IRECV (to accept request) if request detected then send terminal token end ifend forAlgorithm 2 Flowchart of worker algorithms for ERIs. while .TRUE. do send a task requestreceive MESSAGE (indexes of the batched ERIs) if MESSAGE is terminal token then exit end if compute the batched ERIs end while

Algorithm 3

Flowchart of sub-master algorithms for ERIs. Nmeans the number of the ERIs in one batch, N workers means thenumber of the workers. while .TRUE. do send a task request to Masterreceive MESSAGE (indexes of the batched ERIs or termi-nal token) if MESSAGE is terminal token then exit end if

MPI IRECV (to accept request) while

T askCount > doif request detected then send the MESSAGE (indexes of the batched ERIs)TaskCount = TaskCount − nMPI IRECV (to accept request) end ifend whileend whileif request detected then send terminal token end iffor i = i < N workers − i + + do MPI IRECV (to accept request) if request detected then send terminal token end ifend for In order to reduce the memory usage and minimize the com-munication time, the sparse format of both the density matrixand the HFX matrix are replicated, which are much smallerthan the dense matrix objects, as shown in Table 1, for in-stance, the dense format matrix of TiO system has 19927296elements which is 34 times larger than the sparse matrix formatwith 578616 elements. The sparse density matrices can be ac-cessed by every MPI process, so after the worker processes getthe indexes of the grouped ERIs that need to be calculated, the master sub-master sub-master worker worker worker worker Figure 2: The illustration of the 2-level master-worker dynamic load balancingscheme. corresponding local HFX matrices are built using such globaldensity matrices, and ﬁnally the MPI ALLREDUCE operationis adopted to build the global HFX matrix. It should be notedthat, during the construction of the HFX matrix, the transforma-tion between the sparse matrix index and the dense matrix indexneed to performed twice, one time for the read from the sparsedensity matrix, the other one time for the write into the sparseHFX matrix. The ﬂowchart for the HFX matrix construction isshown in Algorithm 4, which loops over shell pair lists.

Algorithm 4

Flowchart of the HFX matrix construction. I , J , K , L are for shell indexes. P gs is the global sparse densitymatrix, H HFXgs is the global sparse HFX matrix. for shell list-

I J and list- KL doif shell ERI ( I J | KL ) is not screened then compute shell ERI ( I J | KL )transform dense matrix indexes to sparse matrix indexget H HFXgs using (

I J | KL ) and P gs end ifend for MPI AllReduce to get H

HFXgs

4. Performance Results

All the results are calculation on the Tianhe-2 supercomputerlocated at the National Supercom- puting Center in Guangzhou,China, which was developed by the National University ofDefense Technology, China. Tianhe-2 is composed of 17920nodes with a custom interconnect called TH Express-2 usinga fat-tree topology. Each node is composed of two Intel IvyBridge E5-2692 processors (12 cores each at 2.2 GHz) andthree Intel Xeon Phi 31S1P coprocessors (57 cores at 1.1 GHz).Memory on each node is 64 GB DRAM and 8 GB on eachIntel Xeon Phi card. Capable of a peak performance of 54.9PFlops, Tianhe-2 has achieved a sustained performance of 33.9PFlops with a performance-per-watt of 1.9 GFlops / W. Tianhe-2has 1.4 PB memory, 12.4 PB storage capacity, and power con-sumption of 17.8 MW. The larges number of nodes that we canuse for performance test is 2150 (51600 cores), and only the5ntel Xeon Ivy Bridge CPUs are adopted in this work. SinceHONPAS is developed in the framework of SIESTA code, onlythe norm-conserving pseudopotentials can be adopted. In thefollowing calculations, the norm-conserving pseudopotentialsgenerated with the Troullier-Martins[41] scheme, in fully sep-arable form developed by Kleiman and Bylader[42], are usedto represent interaction between core ion and valence electrons.The screened hybrid functional HSE06[10] was used in the allthe calculations. The size of the batched ERIs is set to 2000000in the master processor, and is set to 10000 in the sub-masterprocessors.The performance of our method is demonstrated using the in-stances of the DNA, titanium dioxide surface, and silicon solidto test the strong scaling of the HONPAS code, which is mea-sured by the change in CPU time with the number of core usedto make the construction of the HFX matrix. The time mea-surements are for the HFX matrix construction in a single SCFstep, including the time used to setup the P max for density ma-trix screening, to calculate the ERIs, and to sum up and redis-tribute the global sparse HFX matrix. It should be noted that,the P max time is a constant value, and takes very small frac-tion of the total time for these systems when the CPU cores aresmaller than 1000. The time for synchronization of the HFXmatrix is increase with the number of cores, and the fraction ofthis part is also increased. The sample of the test systems arelisted in Tabel 1. These three examples have been chosen asthey range from the one to three dimensional, and they are thetypical applications in the materials science community.The ﬁrst system is the DNA contained 715 atom in the unit-cell. The P, H, C, N and O atoms are described using double- ζ plus polarization (DZP) valence basis sets yielding 7183 atomicorbitals per unitcell which is the rank of the Fock matrix. One k-point is used to sample the reciprocal space due to the large unitcell. In Fig. 3, the scalability of the HFX construction in oneSCF cycle is presented and separated into its three major com-ponents: calculation of the two-electron integrals (i.e. ERI),the calculation of the maximal values of density matrix in eachshell at every SCF cycle(i.e. P max ) and the global summationof the HFX matrix(i.e. MPI ALLREDUCE). The scalability isgood, especially almost ideal scaling is achieved for the calcula-tion of the ERIs, which takes almost 98% time with 1200 cores,on the other hand, the time for P max and MPI ALLREDUCE donot scale with the number of CPU cores, so as the number of theCPU cores increased, the P max time percentage change from 1%to 15%, while the MPI ALLREDUCE time percentage changefrom 1% to 25%. Although the P max and MPI ALLREDUCEare responsible for a small fraction of the total runtime, it isclear that the scaling of the P max and MPI ALLREDUCE ul-timately limits the ﬁnal parallel scaling of the total HFX cal-culation. As a result, despite the parallel e ﬃ ciency of ERIs at24000 cores is nearly 100%, the parallel e ﬃ ciency of the totalHFX time which including ERI, P max , and MPI ALLREDUCEis only 61% at 24000 cores.The second example is a TiO surface system supercell con-sisting of 144 atoms. The Ti and O atoms are described us-ing single- ζ plus polarization (SZP) valence basis sets yield-ing 1488 atomic orbitals within the unitcell and 13392 atomic Number of Cores T i m e ( s ) HFXERI P a r a ll e l E ff c i e n c y ( % ) Number of Cores T i m e ( s ) PmaxMPI_ALLREDUCE

Figure 3: The strong scalability for the periodic DNA system. Theblue / red / brown / orange bars correspond to the simulation time for the total HFXconstruction / ERI calculation / P max construction / MPI ALLREDUCE. The par-allel e ﬃ ciency of the HFX construction is labels with blue circles while theparallel e ﬃ ciency of the ERIs calculations is labels with red squares. The timeare annotated on top of the bars. The di ﬀ erence between the HFX time andthe ERI time comes from the contributions from the P max selection process andthe MPI ALLREDUCE operation for the HFX matrix which are shown in thelower panel.

144 1488 1296 13392 5786163D-Si-SZ 2000 8000 2000 8000 21160003D-Si-DZP 512 6656 512 6656 5016064

Table 1: The systems used in this work. orbitals within the supercell. It should be noted that the SZPcalculations underestimate the electronic bandgap by roughly8% with respect to the DZP basis set for the TiO bulk. Herewe use SZP for the TiO surface system just to evaluate theparallel e ﬃ ciency since the scalability does not depend on thebasis set as shown in Fig.7. The calculations have been per-formed using 6 × ×

10 k-points in the primitive Brillouin zone.In Fig. 4, the total runtime for the HFX construction in oneSCF cycle and the contributions from the calculation of thetwo-electron integrals (i.e. ERI), the calculation of the max-imal values of density matrix in each shell at every SCF cy-cle(i.e. P max ) and the global summation of the HFX matrix(i.e.MPI ALLREDUCE) are displayed. Comparing to the 1 dimen-sion DNA system, the parallel scaling of the calculation of theERIs is again near ideal with 100% parallel e ﬃ ciency, whichtakes almost 95% time with 480 cores, on the other hand, thetime for P max and MPI ALLREDUCE do not scale with thenumber of CPU cores, so as the number of the CPU cores in-creased, the P max time percentage change from 4% to 37%,while the MPI ALLREDUCE time percentage change from 1%to 42%. As a result, although the parallel e ﬃ ciency of ERIs at192000 cores is nearly 100%, the parallel e ﬃ ciency of the totalHFX time which including ERI, P max , and MPI ALLREDUCEis only 20% at 192000 cores. Here the very low 20% parallel ef-ﬁciency comes from two reasons: ﬁrstly, the calculation of P max is not distributed, which accounts for 37% of the total time at192000 cores; secondly, the MPI ALLREDUCE communica-tion time is relatively long, accounting for 42% of the total timeat 19200 cores. It should be noted that, the MPI ALLREDUCEcommunication time can be dramatically reduced by changingthe number of ERIs in one batch,which we called n block in theﬂowing, as shown in Fig.5. When decreasing the n block valuefrom 10000 to 100, the communication time can be reduced by20 times, this is because the n block could a ﬀ ect the balance ofthe number of the ERIs in each core, and ﬁnally the communi-cation time. If we use an optimal value of the n block and alsodistribute the calculation of P max , then the parallel e ﬃ ciencyshould be increased.The third example is a silicon (10 × ×

10) supercell consist-ing of 2000 atoms, a single- ζ basis set for Si atom is adopted.One k-point is used to sample the reciprocal space due to thelarge unit cell. In Fig. 6, the runtime for the HFX construc-tion in one SCF cycle and the contributions from the calcu-lation of the two-electron integrals (i.e. ERI), the calculationof the maximal values of density matrix in each shell at everySCF cycle(i.e. P max ) and the global summation of the HFXmatrix(i.e. MPI ALLREDUCE) are displayed. Comparing thissystem to the second example, the integral calculation main- tains perfect scalability over the whole range of CPU cores thatconsidered from 480 to 28800, while now the parallel e ﬃ ciencyof the HFX matrix construction step is dramatically improvedto 80% at 19200 cores, this is because in this case, the P max and MPI ALLREDUCE time percentage is 16% and 5% re-spectably. The improved scalability in this example is basi-cally owning to the more balancing distribution of the num-ber of the ERIs as only the s-type and p-type orbitals have beenconsidered. On the other hand, in the ﬁrst and the second exam-ples, the polarization d-type orbitals have been involved, whichcaused the imbalance of the number of the ERIs as well as thecorresponding HFX matrix elements, and increase their time ofthe communication costs (MPI ALLREDUCE). Finally, in thisthird example, the parallel e ﬃ ciency of ERIs at 28800 cores isnearly 100%, while the parallel e ﬃ ciency of the total HFX timeis 70% at 28800 cores.In order to see the inﬂuence of the basis set, the parallel scal-ability for a unit cell containing 2000 Si atoms with SZ basisset and 512 Si atoms with DZP basis set is shown in Fig. 7.All calculations again use one k-point to sample the reciprocalspace due to the large unit cell. Here we can see for both theSZ and DZP basis set, almost ideal scaling is achieved for theERI calculations, however, at 28800 cores, the communicationcosts (MPI ALLREDUCE) for SZ / DZP basis set are 0.7s / ﬃ ciency of thetotal HFX time is 67% /

70% at 28800 cores. The largest num-ber of CPU cores we have tested is 51600 for the Si consistingof the 512 atoms with DZP basis set, and in this case, duo tothe communication time increased to 12.8s with the percentage24% of the total HFX time, the parallel e ﬃ ciency of the HFX is53% although the integral calculation maintains perfect 100%scalability.

5. Conclusion

In summary, we have shown our dynamic parallel algorithmsfor the ERIs calculations based on the real-space NAO2GTOframework. We have also analyzed the performance of the par-allel algorithms for parallel e ﬃ ciency. Based on our results, thedynamic distribution of ERI shell quartet can yield very highload balance, nearly ideal 100% parallel e ﬃ ciency for the cal-culation of ERIs has been achieved. However, because the P max selection and the summation of the HFX matrix procedures donot distribute over CPU cores, the parallel e ﬃ ciency of the totalHFX construction is not as good as the calculation of the ERIs.On the next step, we need to also distribute the P max calculationby using the dynamic parallel distribution algorithm to improvethe parallel e ﬃ ciency. Furthermore, a shared memory method7

80 960 4800 9600 19200

Number of Cores T i m e ( s ) HFXERI P a r a ll e l E ff c i e n c y ( % )

480 960 4800 9600 19200

Number of Cores T i m e ( s ) PmaxMPI_ALLREDUCE

Figure 4: The strong scalability for the periodic TiO surface system. Theblue / red / brown / orange bars correspond to the simulation time for the total HFXconstruction / ERI calculation / P max construction / MPI ALLREDUCE. The par-allel e ﬃ ciency of the HFX construction is labels with blue circles which theparallel e ﬃ ciency of the ERIs calculations is labels with red squares. The timeare annotated on top of the bars. The di ﬀ erence between the HFX time andthe ERI time comes from the contributions from the P max selection processand the MPI ALLREDUCE operation for the HFX matrix which are shown inthe lower panel. The discussion about the low 20% HFX parallel e ﬃ ciency at19200 cores for this example is given in the text.

10 100 1000 10000

Number of ERIs in one batch T i m e ( s )

10 100 1000 10000

Number of ERIs in one batch T i m e ( s ) Figure 5: The time used for the periodic TiO surface system with respect tothe number of the ERIs in one batch. Here 480 cores are used. The red barscorrespond to the simulation time for the ERI calculation, which are shown inthe upper panel; the orange bars correspond to the MPI ALLREDUCE, whichare shown in the lower panel. The time are annotated on top of the bars.

80 960 2400 4800 19200 28800

Number of Cores T i m e ( s ) HFXERI P a r a ll e l E ff c i e n c y ( % )

480 960 2400 4800 19200 28800

Number of Cores T i m e ( s ) PmaxMPI_ALLREDUCE

Figure 6: The stronge scalability for the periodic silicon solid system. Theblue / red / brown / orange bars correspond to the simulation time for the total HFXconstruction / ERI calculation / P max construction / MPI ALLREDUCE. The par-allel e ﬃ ciency of the HFX construction is labels with blue circles which theparallel e ﬃ ciency of the ERIs calculations is labels with red squares. The timeare annotated on top of the bars. The di ﬀ erence between the HFX time andthe ERI time comes from the contributions from the P max selection process andthe MPI ALLREDUCE operation for the HFX matrix which are shown in thelower panel.

10 100 1000 10000 100000 T i m e ( s ) Number of CPU coresHFX(Si, 512 atoms, DZP)ERI(Si, 512 atoms, DZP)HFX(Si, 2000 atoms, SZ)ERI(Si, 2000 atoms, SZ)

Figure 7: The strong scalability for the periodic silicon solid system with dif-ferent system sizes and the basis set. with one-sided commutation method should also be adopted toimprove the performance. Such a two-level master-worker dy-namic parallel distribution algorithm proposed in this work canalso be extended to adopt the graphics processing units (GPUs)as accelerators.

References [1] A. D. Becke, The Journal of Chemical Physics 98 (1993) 5648–5652.URL: https://doi.org/10.1063%2F1.464913 . doi: .[2] P. J. Stephens, F. J. Devlin, C. F. Chabalowski, M. J. Frisch, The Journal ofPhysical Chemistry 98 (1994) 11623–11627. URL: https://doi.org/10.1021%2Fj100096a001 . doi: .[3] B. G. Janesko, T. M. Henderson, G. E. Scuseria, Phys. Chem.Chem. Phys. 11 (2009) 443–454. URL: https://doi.org/10.1039%2Fb812838c . doi: .[4] J. Paier, C. V. Diaconu, G. E. Scuseria, M. Guidon, J. Van-deVondele, J. Hutter, Physical Review B 80 (2009). URL: https://doi.org/10.1103%2Fphysrevb.80.174114 .doi: .[5] H. J. Monkhorst, Physical Review B 20 (1979) 1504–1513. URL: https://doi.org/10.1103%2Fphysrevb.20.1504 . doi: .[6] J. Delhalle, J.-L. Calais, Physical Review B 35 (1987) 9460–9466.URL: https://doi.org/10.1103%2Fphysrevb.35.9460 . doi: .[7] M. Gell-Mann, K. A. Brueckner, Physical Review 106 (1957)364–368. URL: https://doi.org/10.1103%2Fphysrev.106.372 .doi: .[8] J. Heyd, G. E. Scuseria, M. Ernzerhof, The Journal of ChemicalPhysics 118 (2003) 8207–8215. URL: https://doi.org/10.1063%2F1.1564060 . doi: .[9] J. Heyd, G. E. Scuseria, M. Ernzerhof, The Journal of ChemicalPhysics 124 (2006) 219906. URL: https://doi.org/10.1063%2F1.2204597 . doi: .[10] A. V. Krukau, O. A. Vydrov, A. F. Izmaylov, G. E. Scuseria, The Journalof Chemical Physics 125 (2006) 224106. URL: https://doi.org/10.1063%2F1.2404663 . doi: .[11] M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb,J. R. Cheeseman, G. Scalmani, V. Barone, B. Mennucci, G. A. Peters-son, H. Nakatsuji, M. Caricato, X. Li, H. P. Hratchian, A. F. Izmaylov,J. Bloino, G. Zheng, J. L. Sonnenberg, M. Hada, M. Ehara, K. Toy- ta, R. Fukuda, J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Ki-tao, H. Nakai, T. Vreven, J. A. Montgomery, J. E. Peralta, F. Ogliaro,M. Bearpark, J. J. Heyd, E. Brothers, K. N. Kudin, V. N. Staroverov,R. Kobayashi, J. Normand, K. Raghavachari, A. Rendell, J. C. Burant,S. S. Iyengar, J. Tomasi, M. Cossi, N. Rega, J. M. Millam, M. Klene,J. E. Knox, J. B. Cross, V. Bakken, C. Adamo, J. Jaramillo, R. Gomperts,R. E. Stratmann, O. Yazyev, A. J. Austin, R. Cammi, C. Pomelli, J. W.Ochterski, R. L. Martin, K. Morokuma, V. G. Zakrzewski, G. A. Voth,P. Salvador, J. J. Dannenberg, S. Dapprich, A. D. Daniels, Farkas, J. B.Foresman, J. V. Ortiz, J. Cioslowski, D. J. Fox, Gaussian 09, RevisionB.01, 2009.[12] J. Paier, M. Marsman, K. Hummer, G. Kresse, I. C. Gerber, J. G. ´Angy´an,The Journal of Chemical Physics 124 (2006) 154709. URL: https://doi.org/10.1063%2F1.2187006 . doi: .[13] X. Gonze, F. Jollet, F. Abreu Araujo, D. Adams, B. Amadon, T. Ap-plencourt, C. Audouze, J.-M. Beuken, J. Bieder, A. Bokhanchuk,E. Bousquet, F. Bruneval, D. Caliste, M. Cˆot´e, F. Dahm, F. Da Pieve,M. Delaveau, M. Di Gennaro, B. Dorado, C. Espejo, G. Geneste,L. Genovese, A. Gerossier, M. Giantomassi, Y. Gillet, D. Hamann,L. He, G. Jomard, J. Laﬂamme Janssen, S. Le Roux, A. Levitt,A. Lherbier, F. Liu, I. Lukaˇcevi´c, A. Martin, C. Martins, M. Oliveira,S. Ponc´e, Y. Pouillon, T. Rangel, G.-M. Rignanese, A. Romero,B. Rousseau, O. Rubel, A. Shukri, M. Stankovski, M. Torrent,M. Van Setten, B. Van Troeye, M. Verstraete, D. Waroquiers,J. Wiktor, B. Xu, A. Zhou, J. Zwanziger, Comput. Phys. Commun.(2016). URL: http://linkinghub.elsevier.com/retrieve/pii/S0010465516300923 . doi: .[14] L. Lin, Journal of Chemical Theory and Computation 12(2016) 2242–2249. URL: https://doi.org/10.1021/acs.jctc.6b00092 . doi: . arXiv:https://doi.org/10.1021/acs.jctc.6b00092 , pMID:27045571.[15] I. Duchemin, F. Gygi, Computer Physics Communications 181(2010) 855 – 860. URL: . doi: https://doi.org/10.1016/j.cpc.2009.12.021 .[16] R. A. DiStasio, B. Santra, Z. Li, X. Wu, R. Car, The Jour-nal of Chemical Physics 141 (2014) 084502. URL: https://doi.org/10.1063/1.4893377 . doi: . arXiv:https://doi.org/10.1063/1.4893377 .[17] T. A. Barnes, T. Kurth, P. Carrier, N. Wichmann, D. Prendergast, P. R.Kent, J. Deslippe, Computer Physics Communications 214 (2017) 52– 58. URL: . doi: https://doi.org/10.1016/j.cpc.2017.01.008 .[18] A. Natan, Phys. Chem. Chem. Phys. 17 (2015) 31510–31515.URL: http://dx.doi.org/10.1039/C5CP01093D . doi: .[19] N. M. Bo ﬃ , M. Jain, A. Natan, Journal of Chemical Theory and Compu-tation 12 (2016) 3614–3622. URL: https://doi.org/10.1021/acs.jctc.6b00376 . doi: .[20] X. Wu, A. Selloni, R. Car, Phys. Rev. B 79 (2009) 085102. URL: https://link.aps.org/doi/10.1103/PhysRevB.79.085102 .doi: .[21] I. J. Bush, S. Tomi´c, B. G. Searle, G. Mallia, C. L. Bailey, B. Monta-nari, L. Bernasconi, J. M. Carr, N. M. Harrison, Proceedings of the RoyalSociety A: Mathematical, Physical and Engineering Sciences 467 (2011)2112–2126. doi: .[22] B. Delley, The Journal of Chemical Physics 92 (1990) 508–517. URL: https://doi.org/10.1063%2F1.458452 . doi: .[23] J. M. Soler, E. Artacho, J. D. Gale, A. Garc´ıa, J. Junquera, P. Ordej´on,D. S´anchez-Portal, Journal of Physics: Condensed Matter 14 (2002)2745–2779. URL: https://doi.org/10.1088%2F0953-8984%2F14%2F11%2F302 . doi: .[24] V. Blum, R. Gehrke, F. Hanke, P. Havu, V. Havu, X. Ren,K. Reuter, M. Sche ﬄ er, Comput. Phys. Commun. 180 (2009) 2175–2196. URL: http://linkinghub.elsevier.com/retrieve/pii/S0010465509002033 . doi: .[25] V. Havu, V. Blum, P. Havu, M. Sche ﬄ er, J. Comput. Phys. 228 (2009)8367–8379. URL: http://linkinghub.elsevier.com/retrieve/pii/S0021999109004458 . doi: . [26] X. Ren, P. Rinke, V. Blum, J. Wieferink, A. Tkatchenko,A. Sanﬁlippo, K. Reuter, M. Sche ﬄ er, New J. Phys. 14 (2012)053020. URL: http://stacks.iop.org/1367-2630/14/i=5/a=053020?key=crossref.351b343783c2c1df1596219a941a74eb .doi: .[27] J. Enkovaara, C. Rostgaard, J. J. Mortensen, J. Chen, M. DuÅak, L. Fer-righi, J. Gavnholt, C. Glinsvad, V. Haikola, H. A. Hansen, H. H. Kristof-fersen, M. Kuisma, A. H. Larsen, L. Lehtovaara, M. Ljungberg, O. Lopez-Acevedo, P. G. Moses, J. Ojanen, T. Olsen, V. Petzold, N. A. Romero,J. Stausholm-Mø ller, M. Strange, G. A. Tritsaris, M. Vanin, M. Wal-ter, B. Hammer, H. H¨akkinen, G. K. H. Madsen, R. M. Nieminen,J. K. Nø rskov, M. Puska, T. T. Rantala, J. Schiø tz, K. S. Thyge-sen, K. W. Jacobsen, J. Phys. Condens. Matter 22 (2010) 253202. URL: http://stacks.iop.org/0953-8984/22/i=25/a=253202 .[28] S. Mohr, L. E. Ratcli ﬀ , P. Boulanger, L. Genovese, D. Caliste, T. Deutsch,S. Goedecker, J. Chem. Phys. 140 (2014) 204110. URL: . doi: .[29] H. Shang, Z. Li, J. Yang, The Journal of Chemical Physics 135 (2011)034110. URL: https://doi.org/10.1063%2F1.3610379 . doi: .[30] T. Ozaki, Physical Review B 67 (2003). URL: https://doi.org/10.1103%2Fphysrevb.67.155108 . doi: .[31] X. Qin, H. Shang, H. Xiang, Z. Li, J. Yang, International Journal ofQuantum Chemistry 115 (2014) 647–655. URL: https://doi.org/10.1002%2Fqua.24837 . doi: .[32] X. Qin, H. Shang, L. Xu, W. Hu, J. Yang, S. Li, Y. Zhang, The Inter-national Journal of High Performance Computing Applications (2019)109434201984504. doi: .[33] M. Valiev, E. Bylaska, N. Govind, K. Kowalski, T. Straatsma, H. V.Dam, D. Wang, J. Nieplocha, E. Apra, T. Windus, W. de Jong, Com-puter Physics Communications 181 (2010) 1477–1489. URL: https://doi.org/10.1016%2Fj.cpc.2010.04.018 . doi: .[34] X. Liu, A. Patel, E. Chow, in: 2014 IEEE 28th International Parallel andDistributed Processing Symposium, IEEE, 2014. URL: https://doi.org/10.1109%2Fipdps.2014.97 . doi: .[35] E. Chow, X. Liu, M. Smelyanskiy, J. R. Hammond, The Journal of Chem-ical Physics 142 (2015) 104103. URL: https://doi.org/10.1063%2F1.4913961 . doi: .[36] J. VandeVondele, M. Krack, F. Mohamed, M. Parrinello, T. Chassaing,J. Hutter, Computer Physics Communications 167 (2005) 103–128. URL: https://doi.org/10.1016%2Fj.cpc.2004.12.014 . doi: .[37] V. Weber, M. Challacombe, The Journal of Chemical Physics 125 (2006)104110. URL: https://doi.org/10.1063/1.2222359 . doi: . arXiv:https://doi.org/10.1063/1.2222359 .[38] H. B. Schlegel, M. J. Frisch, International Journal of Quantum Chem-istry 54 (1995) 83–87. URL: https://doi.org/10.1002%2Fqua.560540202 . doi: .[39] M. Hser, R. Ahlrichs, Journal of Computational Chemistry 10 (1989)104–111. URL: https://doi.org/10.1002%2Fjcc.540100111 .doi: .[40] J. Almlof, K. Faegri Jr., K. Korsell, Journal of Computational Chemistry3 (1982) 385–399. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/jcc.540030314 . doi: . arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/jcc.540030314 .[41] N. Troullier, J. L. Martins, Physical Review B 43 (1991) 1993–2006.URL: https://doi.org/10.1103%2Fphysrevb.43.1993 . doi: .[42] L. Kleinman, D. M. Bylander, Physical Review Letters 48 (1982) 1425–1428. URL: https://doi.org/10.1103%2Fphysrevlett.48.1425 .doi: .

6. Acknowledgments6. Acknowledgments