[PDF] Toward Scalable Many-Body Calculations for Nuclear Open Quantum Systems using the Gamow Shell Model

Abstract

Drip-line nuclei have very different properties from those of the valley of stability, as they are weakly bound and resonant. Therefore, the models devised for stable nuclei can no longer be applied therein. Hence, a new theoretical tool, the Gamow Shell Model (GSM), has been developed to study the many-body states occurring at the limits of the nuclear chart. GSM is a configuration interaction model based on the use of the so-called Berggren basis, which contains bound, resonant and scattering states, so that inter-nucleon correlations are fully taken into account and the asymptotes of extended many-body wave functions are precisely handled. However, large complex symmetric matrices must be diagonalized in this framework, therefore the use of very powerful parallel machines is needed therein. In order to fully take advantage of their power, a 2D partitioning scheme using hybrid MPI/OpenMP parallelization has been developed in our GSM code. The specificities of the 2D partitioning scheme in the GSM framework will be described and illustrated with numerical examples. It will then be shown that the introduction of this scheme in the GSM code greatly enhances its capabilities.

Full PDF

TToward Scalable Many-Body Calculations for Nuclear Open QuantumSystems using the Gamow Shell Model

N. Michel

Michigan State University, East Lansing, MI 48824, USASchool of Physics, Peking University, Beijing 100871, ChinaGANIL, CEA/DSM - CNRS/IN2P3, BP 55027, F-14076 Caen Cedex, FranceInstitute of Modern Physics, Chinese Academy of Sciences, Lanzhou 730000, China

H.M. Aktulga

Dept. of Computer Science, Michigan State University, East Lansing, MI 48824, USA

Y. Jaganathen

Institute of Nuclear Physics PAN, ul. Radzikowskiego 152, Pl-31342 Krak´ow, PolandNSCL/FRIB Laboratory, Michigan State University, East Lansing, MI 48824, USAGANIL, CEA/DSM - CNRS/IN2P3, BP 55027, F-14076 Caen Cedex, France

Abstract

Drip-line nuclei have very diﬀerent properties from those of the valley of stability, as they are weakly boundand resonant. Therefore, the models devised for stable nuclei can no longer be applied therein. Hence, anew theoretical tool, the Gamow Shell Model (GSM), has been developed to study the many-body statesoccurring at the limits of the nuclear chart. GSM is a conﬁguration interaction model based on the useof the so-called Berggren basis, which contains bound, resonant and scattering states, so that inter-nucleoncorrelations are fully taken into account and the asymptotes of extended many-body wave functions areprecisely handled. However, large complex symmetric matrices must be diagonalized in this framework,therefore the use of very powerful parallel machines is needed therein. In order to fully take advantage oftheir power, a 2D partitioning scheme using hybrid MPI/OpenMP parallelization has been developed in ourGSM code. The speciﬁcities of the 2D partitioning scheme in the GSM framework will be described andillustrated with numerical examples. It will then be shown that the introduction of this scheme in the GSMcode greatly enhances its capabilities.

Keywords:

Conﬁguration interaction, MPI/OpenMP hybrid parallelization, 2D partitioning

1. Introduction

The Gamow Shell Model (GSM) is an extension of the traditional nuclear shell model (SM) in thecomplex-energy plane to describe weakly bound and resonant nuclei (see Ref.[1] for a review and Refs.[2, 3, 4]

Email address: [email protected] (N. Michel)

Preprint submitted to Computer Physics Communications November 18, 2019 a r X i v : . [ phy s i c s . c o m p - ph ] N ov egarding recent developments). The fundamental idea in GSM is to use a one-body basis generated by aﬁnite range potential, called the Berggren basis, which contains bound, resonant and scattering states, sothat the outgoing character of the asymptotes of weakly bound and resonant nuclei can be imposed. Itallows calculation of halo states, which extend far away from the nuclear zone, and resonant states, whichare unbound.Since GSM is a nuclear conﬁguration interaction (CI) model, dimensions of the Hamiltonian matrix,and therefore the computational and storage costs, increase exponentially with the number of particles ina calculation. It is obviously necessary to develop a distributed memory parallel implementation of GSM,which was done in an earlier work [1]. The initial version of the parallel GSM code utilized a one dimensional(1D) partitioning scheme for the Hamiltonian matrix, and simple hybrid parallelization techniques usingMPI/OpenMP libraries. This initial version was convenient to implement and is well-balanced in terms of thestorage and computations associated with the Hamiltonian matrix elements. It performs relatively well whenthe number of computational nodes utilized is small, but it requires expensive inter-node communications forlarge scale computations due to the 1D partitioning scheme, and it does not make eﬀective use of the threadparallelism available on each node. Also, basis vectors of the eigensolver used in ﬁnding the nuclear statesof interest were replicated redundantly on each node. These limitations signiﬁcantly hamper the eﬃciencyof the GSM code when performing large-scale calculations.In this paper, we describe an implementation of the GSM code which signiﬁcantly improves its perfor-mance and storage requirements for large-scale calculations. As mentioned above, GSM is essentially a shellmodel code; therefore our implementation beneﬁts greatly from techniques used in another shell model codecalled Many-body Fermion Dynamics nuclei (MFDn), which has been optimized to run eﬃciently on lead-ership class supercomputers [5, 6, 7, 8, 9]. In particular, we adopt the two dimensional (2D) checkerboardpartitioning scheme of MFDn, which is a powerful technique allowing to take advantage of the symmetry ofthe Hamiltonian matrix while reducing the MPI communication overheads [10]. However, there are impor-tant diﬀerences between the GSM code and MFDn. First, GSM uses a continuous basis, i.e. the Berggrenbasis, as opposed to the discrete harmonic oscillator basis used in MFDn. This has important consequencesin terms of the construction of the Hamiltonian matrix in a 2D partitioned context as the matrix is lesstrivially sparse. Second, a fundamental feature of SM is the separation of the large proton-neutron fullmodel space in smaller proton and neutron subspaces, whose treatment poses no numerical problem. Onthe contrary, the weakly bound nuclei studied within GSM often present a large asymmetry between thenumber of neutrons versus the number of protons. This asymmetry can lead to very large one-body spacesand demands additional memory storage optimizations in GSM, whereas they are minute compared to thetotal space in the regular SM. Consequently, a memory optimization absent from SM had to be devised inGSM. This optimization deals with the storage of matrix elements between Slater determinants of the sametype (only protons or only neutrons) and with the storage of uncoupled two-body matrix elements. Finally,the computation of many-body resonant states in GSM has lead to additional problems absent from SM.2ndeed, contrary to the bound states targeted with SM, many-body resonant states targeted with GSM aresituated in the middle of scattering states, their energies therefore being interior eigenvalues. Hence, theycannot be calculated with the Lanczos method, which is well suited to calculate extremal eigenvalues, as isthe case for well-bound states. As a result, we have to use an eigensolver diﬀerent than the Lanczos method,used in MFDn. We use the Jacobi-Davidson (JD) method, which can directly target the interior eigenvaluesand eigenvectors. We use preconditioning and angular momentum projection techniques to ensure rapidconvergence of this eigensolver.We begin by giving an overview of SM and GSM in Sect. 2. Techniques used in the construction of theHamiltonian matrix are described in Sect. 3. The description of the JD eigensolver, implementation of asuitable preconditioner and use of angular momentum projection to ensure rotational invariance of the basisvectors are described in Sect. 4. The parallelization of the Hamiltonian construction, the JD eigensolver andGSM basis orthogonalization are described in Sect. 5. Examples of MPI memory storage and computationtimes for two nuclear systems will be given in Sect. 6.

2. Background on Shell Model Approaches

The basic idea of conﬁguration interaction (CI) is to use a basis of independent-particle many-body statesto expand correlated many-body states. In order to build independent-particle basis states, one starts fromthe one-body Schr¨odinger equation: (cid:18) ˆ p µ + ˆ V (cid:19) | φ (cid:105) = e | φ (cid:105) (1)where ˆ p / (2 µ ) is the kinetic operator, proportional to a Laplacian, ˆ V is the one-body potential, µ is theeﬀective mass of the particle, | φ (cid:105) is the one-body state solution of the one-body Schr¨odinger equation and e is its energy. In most CI models, spherical potential operators ˆ V are employed [10, 11] as they allow to takeinto account the rotational invariance of solutions exactly. This is also well suited for the present approachwhere only quasi-spherical nuclei are considered. As ˆ V is spherical, one can decompose | φ (cid:105) into radial andangular parts [12]: φ ( (cid:126)r ) = u ( r ) r Y (cid:96)jm ( θ, ϕ ) (2)where u ( r ) is the radial wave function and Y (cid:96)jm ( θ, ϕ ) is a spherical harmonic of orbital angular momentum (cid:96) coupled to spin degrees of freedom, so that its total angular momentum is j and the projection of j onthe z axis is m [12]. We will use the standard orbital notation in the following, where the orbital quantumnumber (cid:96) = 0, 1, 2, 3, 4, . . . is denoted as s , p , d , f , g , . . . . Hence, for example, the angular part of a wavefunction bearing (cid:96) = 1 and j = 3 / p / . The radial wave function u ( r ) obeys the followingSchr¨odinger equation: u (cid:48)(cid:48) ( r ) = 2 µ ¯ h (cid:18)(cid:18) (cid:96) ( (cid:96) + 1) r + V l ( r ) − e (cid:19) u ( r ) + (cid:90) V nl ( r, r (cid:48) ) u ( r (cid:48) ) dr (cid:48) (cid:19) (3)3 apturing states B ound s t a t e s A n ti - bound s t a t e s Decaying states

Scattering states Re(k)Im(k) L + Figure 1: Berggren basis for a given partial wave. Diﬀerent types of states, including bound, decaying, scattering, as well asanti-bound and capturing states, are depicted. where V l ( r ) and V nl ( r, r (cid:48) ) are the radial local and non-local potentials, respectively, issued from ˆ V . Equation(3) is solved numerically using the method of Ref.[13]. In the traditional SM, the | φ (cid:105) states are harmonicoscillator states [10, 11], which are well suited for well-bound nuclear states, but not for loosely bound andresonant nuclei. Hence, we use Berggren basis states instead [14], which are generated by a ﬁnite depthpotential, and contain bound, resonant and scattering states (see Fig.(1)). The Berggren completenessrelation reads: (cid:88) n | φ n (cid:105) (cid:104) φ n | + (cid:90) L + | φ ( k ) (cid:105) (cid:104) φ ( k ) | dk = I, (4)where the | φ n (cid:105) states are the bound states and the resonant states inside the L + contour of Fig.(1). Thesestates are usually called pole states as they are poles of the S -matrix [1]. The | φ ( k ) (cid:105) states stand for scatteringstates and follow the L + contour of Fig.(1), and I is the identity operator.Scattering states, which initially form a continuum, are discretized using a Gauss-Legendre quadraturewith about 30 points to ensure convergence [15]. Once discretized, the Berggren basis is in eﬀect the sameas that of harmonic oscillator states in the context of CI.As | φ (cid:105) has ﬁxed quantum numbers j and m , angular momentum algebra can be used therein with ladderoperators j ± . Let us write | φ (cid:105) = | u (cid:96) j m (cid:105) so as to make apparent the dependence on quantum numbers: j ± | u (cid:96) j m (cid:105) = ¯ h (cid:112) j ( j + 1) − m ( m ± | u (cid:96) j m ± (cid:105) (5)which has the property to raise or lower the value of m by one unit keeping all other values unchanged.4nother operator based on one-body states properties is the m -reversal operator (also improperly calledtime-reversal symmetry): T | j m (cid:105) = ( − j − m | j − m (cid:105) (6)The use of this operator allows to save memory and computations in CI, as we will see in the followingsections. One can build independent-particle many-body states of all nucleons from antisymmetrized tensor prod-ucts of | φ (cid:105) states of coordinates (cid:126)r , or Slater determinants: SD ( (cid:126)r , · · · , (cid:126)r A ) = (cid:114) A ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) φ ( (cid:126)r ) φ ( (cid:126)r ) · · · φ A ( (cid:126)r ) φ ( (cid:126)r ) φ ( (cid:126)r ) · · · φ A ( (cid:126)r )... ... ... ... φ ( (cid:126)r A ) φ ( (cid:126)r A ) · · · φ A ( (cid:126)r A ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (7)where A is the number of nucleons of the considered nuclear state. A complete basis can then be generatedby considering all combinations of | φ (cid:105) states provided by Eq.(3). This basis is of inﬁnite dimension and hasto be truncated in practice. For this, one considers maximal values for (cid:96) and e for used | φ (cid:105) states, and onealso limits the number of occupied scattering | φ (cid:105) states in Eq.(9) (see below), which is typically 2 to 4 atmost.Slater determinants involving pole states only are of fundamental importance, as they form the mostimportant many-body basis states of a nuclear state decomposition. In particular, a diagonalization of theHamiltonian in a model space generated by pole Slater determinants is called pole approximation and isused to initialize the JD method (see Sect. 2.1). Other Slater determinants are called scattering states .The extension of the m -reversal operator of Eq.(6) to Slater determinants is also straightforward: T | SD (cid:105) = A (cid:89) i =1 T i | φ i (cid:105) (8) In the following, we will use occupation formalism [12]. It is based on the use of creation and annihilationoperators of one-body states, denoted by a † α and a α respectively for the creation and annihilation of thestate | α (cid:105) . Consequently, a Slater determinant can be written in a more concise form: | SD (cid:105) = a † φ A · · · a † φ | (cid:105) = | φ φ · · · φ A (cid:105) (9)where | (cid:105) is the vacuum state, which contains no particles by deﬁnition.Occupation formalism algebra is closed and fulﬁll antisymmetry requirements if operators verify thefollowing equations: { a † α , a † β } = 0 , { a α , a β } = 0 , { a † α , a β } = δ αβ (10)5here | α (cid:105) and | β (cid:105) are one-body states and brackets denote the anticommutation relation: { O , O } = O O + O O (11)where O and O are two operator functions of a † and a .It is then convenient to write the Hamiltonian in occupation formalism: H = (cid:88) αβ (cid:104) α | h | β (cid:105) a † α a β + (cid:88) α<β,γ<δ (cid:104) αβ | V | γδ (cid:105) a † α a † β a δ a γ (12)where h is the one-body part of H , containing its kinetic and mean-ﬁeld part, while V is its two-body part,embodying inter-nucleon correlations, and α , β , γ , δ are one-body states. In the following, as h matrixelements are used with one creation operator and one annihilation operator, and as V is used with twocreation and two annihilation operators, they will be referred as the one particle-one hole (1p-1h) part andtwo particle-two hole (2p-2h) part, respectively.In particular, with N s being the number of states used in the one-body basis, one can see that thenumber of 1p-1h matrix elements scales as N s and that of 2p-2h matrix elements scales as N s . In orderto provide with an order of magnitude for N s , let us consider a p / contour for the Berggren basis. Asthe Gauss-Legendre quadrature imposed to the Berggren basis contour provides with about 30 states (seeSect. 2.1), and as one has 4 possible m -projections for a p / shell, one has N s = 120. It is thus clear thatthe number of 1p-1h calculations in H (see Eq.(12)) is at least four orders of magnitude smaller than thenumber of the 2p-2h calculations. Consequently, the 1p-1h part can be neglected from a performance pointof view.A matrix element (cid:104) SD f | H | SD i (cid:105) , with | SD i (cid:105) and | SD f (cid:105) the initial and ﬁnal Slater determinants, re-spectively, can be written as a function of one-body and two-body matrix elements, as well as expectationvalues of creation and annihilation matrix elements between | SD i (cid:105) and | SD f (cid:105) (see Eq.(12)). The latter arein particular equal to 0 or ± The basis of Slater determinants provided by Eq.(9) is complete and fully antisymmetric. However, itis not rotationally invariant, as the total angular momentum is not deﬁned therein. Its projection on the z axis, however, is conserved, since its value M is equal to the sum of one-body total angular momentumprojections m i , i ∈ [1 : A ] (see Eq.(2)). Consequently, given that M is a good quantum number in nuclearstates, we only have to consider the Slater determinants of ﬁxed total angular momentum projection M ina shell model calculation, which is called the M -scheme approach in CI [16].This approach is preferred to J -scheme [16], where Slater determinants are replaced by independent-particle many-body states coupled to J . Due to the conservation of the many-body total angular momentum J at basis level in J -scheme, CI dimensions are typically smaller by a factor of 5-20 in light nuclei. However,the J -scheme formulation is also more diﬃcult to implement due to antisymmetry requirements and generallyleads to denser Hamiltonian matrices (more non-zero elements) [17].6he J quantum number can be imposed in M -scheme by using appropriate linear combinations ofSlater determinants. This can be eﬀected because the ˆ J operator is closed in M -scheme, as it con-nects Slater determinants whose one-body states diﬀer through their m quantum number only, i.e. be-longing to the same conﬁguration (also called partition) [10, 11]. A conﬁguration enumerates its occupiedshells without consideration of the m quantum numbers of the occupied one-body states. For example, if | s / ( − /

2) 0 s / (1 /

2) 0 p / ( − / (cid:105) is a Slater determinant, its associated conﬁguration is [0 s / p / ].Hence, it is always possible to build J coupled states in a conﬁguration from linear combinations of its Slaterdeterminants. This is done using the L¨owdin operator [18]: P J = (cid:89) J (cid:48) (cid:54) = J ˆ J − J ( J + 1) IJ (cid:48) ( J (cid:48) + 1) − J ( J + 1) (13)which projects out all the angular momenta J (cid:48) (cid:54) = J . It has been checked numerically that the J quantumnumber conservation is precise up to 10 − or less in our applications. In order to apply ˆ J , one considersits standard formulation: ˆ J = J − J + + M ( M + 1) I (14) J ± = A (cid:88) i =1 j ± i (15)where Eq.(5) has been used. As J ± connects Slater determinants of the same conﬁguration, and whose M quantum numbers vary by one unit, it is a very sparse one-body operator, as such the computation of ˆ J isfast as well. Consequently, even though Slater determinants do not have J as a good quantum number, theuse of P J of Eq.(13) allows to eﬃciently build J coupled linear combinations of Slater determinants.The M -reversal symmetry (see Eq.(6) and Eq.(8)) is a direct consequence of rotational invariance. Indeed,rotational invariance implies that the physical properties of a shell model eigenvector are independent of M ,while the M -reversal symmetry implies that they are invariant through the symmetry M ↔ − M . Inparticular, if M = 0, the M -reversal symmetry allows to calculate only half of the output shell model vectorwhen one multiplies H by an input shell model vector. Indeed, the components of the Slater determinants | SD (cid:105) and T | SD (cid:105) diﬀer by a phase equal to ± M = 0, even nuclei are twice faster tocalculate than originally expected.Theoretically, the use of P J of Eq.(13) to impose J as a good quantum number is not necessary. Indeed,as [ H, (cid:126)J ] = 0, a linear combination of Slater determinants coupled to J provides another vector coupled to J when acted on by H . However, due to numerical inaccuracy, H and (cid:126)J do not exactly commute, so thatthe J quantum number can be lost after several matrix-vector multiplications. The obvious treatment forthis is to suppress the shell model vector components with J (cid:48) (cid:54) = J , when it is no longer an eigenstate ofˆ J , which is eﬀected using Eq.(13). In order to know whether P J has to be applied or not, one has to testwhether a shell model vector is an eigenstate of ˆ J or not, which is rather fast. Indeed, if M = J , as isusually the case, shell model vectors are checked if they are eigenstates of ˆ J if the action of J + on this shell7odel vector provides zero, as the angular momentum projection of a vector coupled to J cannot be largerthan J . The only exception to this rule is when one uses M -reversal symmetry when J >

0, as M = 0 in allcases when M -reversal symmetry is applied. In this case, one has to calculate the action of ˆ J − J ( J + 1) I on the considered shell model vector to check if it is equal to zero. Hence, as one applies J ± only once ortwice after the matrix-vector operation borne by H application, this test is very fast compared to H or P J application.

3. Data Storage in GSM

In the SM and GSM approaches, memory utilization is of critical importance as the sizes of the matricesinvolved grow rapidly with the problem size. To draw a comparison between SM and GSM in this regard, letus consider N v valence nucleons in a one-body space of N s states. Based on the discussions in the previoussection, one can neglect antisymmetry, parity and M projection to make such a comparison, as the overallimpact of these factors is minimal. Consequently, we will consider only 2p-2h matrix elements (see Sect. 2.3),of the form (cid:104) SD f | H | SD i (cid:105) = ± (cid:104) α f β f | V | α i β i (cid:105) . One then obtains that the dimension of the Hamiltonianin the GSM approach, which we will denote as d , is proportional to N sN v and that the probability to havea non-zero matrix element is proportional to (1 /N s ) N v − , as all states must be equal in | SD i (cid:105) and | SD f (cid:105) ,except for | α i β i (cid:105) (cid:54) = | α f β f (cid:105) . This implies that the total number of non-zeros is close to d /N v , whichcorresponds to d . and d . for 3p-3h and 4p-4h truncations in the continuum, respectively. For instance,for d ∼ , if one compares to the typical d . number of non-zeros in SM for this dimension [5], theHamiltonian for the GSM approach results in matrices which are typically one to two orders of magnitudelarger than that of the regular SM. To ensure fast construction of the GSM Hamiltonian and tackle the datastorage issue of the resulting sparse matrices, we have developed the techniques presented below. We saw that conﬁgurations and Slater determinants are necessary to build the many-body basis statesof GSM. There are much fewer conﬁgurations than Slater determinants, which leads to many advantages.Due to valence space truncations, conﬁgurations are generated sequentially. This is computationally inex-pensive, as there is a small number of conﬁgurations. Since Slater determinants of diﬀerent conﬁgurationsare independent, it is straightforward to use parallelization at this level: Conﬁgurations are distributed overavailable processing cores, so that all Slater determinants are generated independently by varying m quantumnumbers in occupied shells in a ﬁxed conﬁguration. The obtained Slater determinants are then distributedto all compute nodes as the full basis of Slater determinants must be present in each node.When building the Hamiltonian matrix H , we ﬁrst consider the conﬁguration and their Slater determi-nants. This way, searches for Slater determinant indices are very fast, as binary search is eﬀected at thelevel of conﬁgurations ﬁrst and at the level of Slater determinants afterwards. This also allows us to save8emory space, as all indices related to shells can be stored in arrays involving conﬁgurations only, whilethose involving the m quantum numbers only are eﬀected in arrays involving Slater determinants.In GSM, conﬁgurations and Slater determinants involve few valence particles, many shells and manyone-body states, whereas one has many valence particles, few shells and few one-body states in SM. Thus,we use an implementation based on bit storage which is diﬀerent than traditional SM [11]. As a statecan be occupied by at most one nucleon due to antisymmetry, and previously mentioned SM features, itis convenient in SM to associate a Slater determinant to a binary number. For example, | (cid:105) is aSlater determinant where the states 1, 3 and 4 are occupied, while the states 2 and 5, ..., 10 are unoccupied.The strength of this method is that Slater determinant algebra is taken care of by operations on bits, whichare very fast. However, this method becomes ineﬃcient if one has many unoccupied states, as one integerpossesses 4 bytes, or 32 bits, which can be equal to 0 or 1, so that it can contain at most 32 states. It iscustomary in GSM to have contours for the s / , p / and p / partial waves, which are discretized with30 points each typically (see Sect. 2.1). Conversely, it is suﬃcient to use 5 to 10 harmonic oscillator statesfor the d / and d / partial waves to obtain convergence. Let us consider as an example that we discretizecontours with 30 points and that we take 10 harmonic oscillator states for the d partial waves, which arein fact typical values in practice. This generates 340 one-body states, therefore 11 integers, of 44 bytes asa whole, would be necessary to store a single Slater determinant. This would be ineﬃcient memory-wisebecause one would have to store many zeros, and maybe also ineﬃcient performance-wise due to the largerarrays to consider. Hence, we store conﬁgurations and Slater determinants as regular arrays, in which casethe former example would be stored as { , , } , requiring only 6 bytes, as each state index is represented bya short integer. The bit scheme and regular scheme become equivalent if one has 22 valence nucleons, whichis well beyond the current capacities of our code, where the maximal number of particles in the continuumis about 4. Calculations are fast with this storage scheme, and they can be easily loop-parallelized (unlikebit operators acting on individual bits). We will deal here with numerical problems generated by the large asymmetry between proton and neutronspaces. As the problem is the same whether the neutron space is much larger than the proton space or not,and as we will study examples where one has valence neutrons only, we will only consider the case wherethe neutron space is dominant. Obviously, proton-rich nuclei, for which the proton space is largest, bearsimilar properties. As already mentioned, one cannot store all data related to the neutron space, as thespace required to do so is much larger than that needed in SM. The fundamental reason for this is the use ofthe Berggren basis. Indeed, one typically has 100-200 neutron valence shells, arising from the discretizationof scattering contours (see Fig.(1)), as one has to discretize the contour of one partial wave with about 30states, and one has 5 partial waves when (cid:96) ≤ s / , p / , p / , d / , d / ). Conversely, this number is 30 foran 8¯ hω space in SM, as one considers harmonic oscillator shells of the form | n (cid:96) j (cid:105) which then have to satisfy2 n + (cid:96) ≤

8. Additionally, the neutron-to-proton ratio in typical GSM applications is usually very large (for9nstance, in the study of nuclei along the neutron drip-line) and one may often have only valence neutronsin a calculation. It would be too costly to recalculate neutron matrix elements each time. Thus, memoryoptimization schemes had to be devised to store matrix elements between neutron Slater determinants. Theydeal with what we call phases, i.e. matrix elements involving creation or annihilation operators, only of theform (cid:104) SD f | a † α | SD i (cid:105) , (cid:104) SD f | a † α a β | SD i (cid:105) , ... that are equal to ±

1, with which the indices of involved one-bodyshells and states, conﬁgurations and Slater determinants must be included.Let us deﬁne the initial and ﬁnal spaces as being the spaces to which the initial | Ψ i (cid:105) and ﬁnal | Ψ f (cid:105) many-body states belong, where one has H | Ψ i (cid:105) = | Ψ f (cid:105) . In the 2D parallelization scheme (which will bediscussed in more detail in Sect. 5), one can use the fact that both initial and ﬁnal spaces, assigned to eachprocessor core on the square Hamiltonian are only a fraction of the full space. They refer to the number ofrows and columns, respectively [10].Hence, we can store matrix elements of the form (cid:104) SD int | a α | SD i (cid:105) and (cid:104) SD f | a † α | SD int (cid:105) , (cid:104) SD int | a α a β | SD i (cid:105) and (cid:104) SD f | a † α a † β | SD int (cid:105) , where | SD int (cid:105) is an intermediate Slater determinant, chosen so that the data tostore are minimal. It is clear that any one-body or two-body observable can be calculated with this scheme.Indeed, one has: (cid:104) SD f | a † α a β | SD i (cid:105) = (cid:104) SD f | a † α | SD int (cid:105) (cid:104) SD int | a β | SD i (cid:105) (16) (cid:104) SD f | a † α a † β a δ a γ | SD i (cid:105) = (cid:104) SD f | a † α a † β | SD int (cid:105) (cid:104) SD int | a δ a γ | SD i (cid:105) (17)The phase matrix, containing these matrix elements, is stored in a sparse form: One ﬁxes | SD i (cid:105) , and all | SD f (cid:105) varying by one or two states are generated. Thus, the obtained phase, but also the indices of | SD f (cid:105) and of associated one-body states ( α , β , γ and δ ) must be stored.In order to save memory, one loops ﬁrstly over conﬁgurations (see Sect. 2.4), so that only the conﬁgurationindex of | SD f (cid:105) and the shell indices (functions of n , (cid:96) , j , but not m ) associated with the one-body states ( α , β , γ and δ ) are stored, and one loops over the Slater determinants of that ﬁxed conﬁguration afterwards,where m -dependent values only are stored. From that information, one can recover the full phase matrix.One can see that the number of phases involving | SD int (cid:105) is proportional to 2 N v for one-body phasesand N v ( N v −

1) for two-body phases, with N v being the number of valence nucleons. Indeed, one has twoarrays of one-body phases and two arrays of two-body phases. The storage of phase matrices is furtheroptimized by requiring that one has α < β and γ < δ and leveraging the M -reversal symmetry. Indeed,if one applies the M -reversal operator of Eq.(8) to the Slater determinants present in (cid:104) SD int | a α | SD i (cid:105) and (cid:104) SD int | a α a β | SD i (cid:105) , the obtained phase and associated indices can be easily recovered from the phase matrixelement of the initial Slater determinants. This yields an additional memory gain of about a factor of 2 forthe storage of phases at the cost of negligible additional numerical operations.We note that memory optimization of phases described above is not utilized for the angular momentumoperator ˆ J . Since ˆ J is function of J ± (see Sect. 2.4), which are very sparse operators, the number of phasesused therein is much reduced compared to the Hamiltonian matrix. Decomposition of Eq.(16) involving an10ntermediate Slater determinant is therefore not necessary. The 1p-1h part of H is negligible for performance purposes and hence is not considered in this section(see Sect. 2.3). Therefore, we concentrate ﬁrstly on the neutron 2p-2h part of H . From Eq.(12), one has: (cid:104) SD f | H | SD i (cid:105) = (cid:104) SD f | a † α a † β a δ a γ | SD i (cid:105) (cid:104) αβ | V | γδ (cid:105) . (18)One then has to generate all the Slater determinants | SD f (cid:105) for a ﬁxed Slater determinant | SD i (cid:105) . Forthis, one loops over all intermediate conﬁgurations C int and Slater determinants | SD int (cid:105) (see above). Onethen obtains the (cid:104) SD int | a δ a γ | SD i (cid:105) phase and associated one-body states. Using the same procedure on | SD int (cid:105) , one generates (cid:104) SD f | a † α a † β | SD int (cid:105) phase and associated one-body states, so that the two-body phase (cid:104) SD f | a † α a † β a δ a γ | SD i (cid:105) is obtained with its one-body states. The two-body matrix element (cid:104) αβ | V | γδ (cid:105) comesforward from the knowledge of one-body states, so that the Hamiltonian matrix element (cid:104) SD f | H | SD i (cid:105) isobtained.The treatment of the proton 2p-2h part of H is mutatis mutandis the same as its neutron 2p-2h part,and its proton-neutron 2p-2h part of H is very similar: (cid:104) SD f | H | SD i (cid:105) = (cid:104) α p β n | V | γ p δ n (cid:105)× (cid:104) SD p ( f ) | a † α p | SD p ( int ) (cid:105) (cid:104) SD p ( int ) | a γ p | SD p ( i ) (cid:105)× (cid:104) SD n ( f ) | a † β n | SD n ( int ) (cid:105) (cid:104) SD n ( int ) | a δ n | SD n ( i ) (cid:105) (19)where the proton and neutron character of states have been explicitly written. The computational methodis otherwise the same as in the neutron 2p-2h part of H . The discussion so far has focused on the full storage scheme in the GSM code, where all elements ofthe Hamiltonian matrix are stored in memory. As the Hamiltonian matrix is always sparse, one only storesnon-zero matrix elements and their associated indices. While this is ideal from a computational cost pointof view as no recalculation of matrix elements is needed, memory requirements in GSM grow rapidly withthe number of nucleons. This is the main factor aﬀecting problem size, as work vectors, even though storedon fast memory, take much less memory than Hamiltonian as only a few tens, or at worse hundreds, of themare necessary (see Sect. 5). Despite, the memory optimizations described above, the full storage scheme isessentially total memory bound ; in other words, calculations possible with this scheme are limited by theaggregate memory space available on the compute nodes. Therefore, in addition to the full storage scheme,we have developed on-the-ﬂy and partial matrix storage schemes for GSM.In the on-the-ﬂy scheme , the sparse matrix-vector multiplications (SpMVs) needed during the eigensolverare performed on-the-ﬂy as the Hamiltonian is being recalculated from scratch at each eigensolver iteration.11hile memory requirements for the Hamiltonian matrix are virtually non-existent in this case, computationtime is maximal due to repeated constructions of the Hamiltonian.In the partial storage scheme, we do not store the ﬁnal Hamiltonian matrix elements which is made up oftwo-body interaction coeﬃcient multiplied by a phase, which form basically the whole Hamiltonian matrixup to a negligible part (see discussion in Sect. 5). Indeed, the number of two-body interaction coeﬃcients ismuch smaller than the number of Hamiltonian matrix elements (see Sect. 2.3). Therefore, it is more eﬃcientfrom the storage point of view to store the index and phase of two-body matrix elements in the Hamiltonianphase, so that the ﬁnal two-body matrix element is determined by a lookup into the two-body interactioncoeﬃcient array and multiplied by its corresponding phase. Here, the most time consuming part is thesearch in the two-body interaction coeﬃcient array, as the latter is very large and the two-body matrixelements considered in subsequent searches are not necessarily found to be contiguous in the interactionarray. The obtained memory gain with the partial storage method compared to the full storage methodis about 2.5, as two-body matrix elements are complex numbers and integers are stored instead, on theone hand, while it is still necessary to store the indices of Slater determinants | SD f (cid:105) (see Sect. 3.3), on theother hand. Consequently, the partial storage scheme lies in between the full storage and the on-the-ﬂycomputation schemes, as its memory requirements (while still being signiﬁcant) are not as large as the fullstorage scheme, but the partial storage method is faster than the on-the-ﬂy scheme as it does not requirereconstruction of the Hamiltonian from scratch.

4. Diagonalization of the GSM matrix

In SM, as one is only interested in the low-lying spectrum of a real symmetric Hamiltonian, the Lanczosmethod is the tool of choice. It is a Krylov subspace method which makes use of matrix times vectoroperations only and the low-lying extremal eigenvalues are the ones that converge ﬁrst. While the samemethod could be used for searching the bound states in GSM, it leads to poor or no convergence at all forresonant states. This is due to the presence of numerous scattering states lying below a given resonant statein GSM, which would have to converge before the desired resonant state can converge in the Lanczos method.Consequently, we turn to a diagonalization method that can directly target the interior eigenvalues, i.e. theJD method. Indeed, JD only involves sparse matrix times vector operations like the Lanczos method, but italso includes an approximated shift-and-invert method which allows it to directly target interior eigenvalueand eigenvector pairs.

As GSM Hamiltonian matrices are complex symmetric, a standard problem mentioned in this case isthe numerical breakdown of JD method [19]. Indeed, the Berggren metric is not the Hermitian metric, butthe analytic continuation of the real symmetric scalar product. Consequently, the Berggren norm of a GSMvector can be in principle equal to zero. Then, this vector cannot be normalized and the JD iterations fail.12owever, for this to occur, one would need matrix elements whose imaginary parts are of the same order ofmagnitude as their real parts on one hand, and rather large oﬀ-diagonal matrix elements on the other hand.As neither of these cases are satisﬁed in practice, breakdown does not occur. Moreover, as the number ofbasis vectors is typically a few tens or hundreds, this additional memory storage and diagonalization time isnegligible in our case. Consequently, the use of complex symmetric matrices does not lead to any convergenceissues in GSM.Let us note that the Lanczos method can be directly applied to the complex symmetric case if one replacesthe Hermitian metric by the Berggren metric [14]. But the Lanczos method performs poorly compared to theJD method because it requires hundreds of iterations if one calculates a loosely bound nuclear eigenstate,and does not even converge for resonant many-body eigenstates unless a full diagonalization is utilized.Nevertheless, the JD method requires an initial set of eigenpairs to start its iterations, and the Lanczosmethod can determine the GSM eigenpairs at the pole approximation level. But we note that one cangenerally perform a full diagonalization of the Hamiltonian matrix at the pole approximation level instead.

Even though the use of an approximated shift-and-invert scheme ensures convergence of the JD methodto the desired interior eigenvalues, this convergence can be very slow. Hence, a crucial need in GSM is to ﬁnda good preconditioner H prec , which transforms the eigenvalue problem into a form that is more favorablefor a numerical solver, essentially accelerating convergence. The simple diagonal approximation for H prec isnot suﬃcient therein, as oﬀ-diagonal matrix elements are large in the pole space. However, H is diagonallydominant in the scattering space. Consequently, to build an eﬀective preconditioner, one can take H prec equal to H itself in the pole space and equal to the diagonal of H in the scattering space (the diagonal matrixelements (cid:104) SD | H | SD (cid:105) involving the Slater determinants of a ﬁxed conﬁguration are in fact replaced by theiraverage therein in order to ensure that [ H prec , (cid:126)J ] = 0). Using H prec as a preconditioner in JD requires com-puting its inverse matrix. Since the dimension of the pole space is very small (a few hundreds at most), andthe rest of H prec is only a diagonal matrix, computing the inverse of H prec is inexpensive. Consequently, thechosen preconditioner has a minimal computational overhead, and by providing a reasonable approximationto H , it facilitates quick convergence. As the coupling to the continuum is small, with typically 70-80%of the GSM eigenpairs residing in the pole space, the preconditioned JD method converges in typically 30iterations per eigenpair, e.g. , one needs 30 JD iterations to calculate the ground state only, and 60 iterationsto calculate the ground state plus the ﬁrst excited state. In light of the above discussion, the preconditioned JD method as implemented in the GSM code issummarized below (see Ref.[20] for a more detailed description of the JD method): • Start from an approximation to the desired the GSM eigenpair, E i and | Ψ i (cid:105) . The ﬁrst eigenpair E and | Ψ (cid:105) is obtained from the pole approximation (see Sect. 2.2), where the Lanczos method and full13rthogonalization (due to the small size of the Hamiltonian) can be used. After the ﬁrst eigenpair,each set of converged pairs provide an approximation for the next pair. • Calculate the residual | R i (cid:105) = H | Ψ i (cid:105) − E i | Ψ i (cid:105) . • Update the approximate eigenvector by solving the linear system ( H prec − E i I ) | Φ i +1 (cid:105) = | R i (cid:105) , where H prec is the preconditioner for H (see Sect. 4.2). • Orthonormalize | Φ i +1 (cid:105) with respect to all previous JD vectors | Φ j (cid:105) , 0 ≤ j ≤ i , and project it onto J ifnecessary (see Sect. 2.4). • Project H onto the extended basis set | Φ j (cid:105) , 0 ≤ j ≤ i + 1, so that one obtains a very small com-plex symmetric matrix which is diagonalized exactly using standard methods. This provides with anew approximation to the eigenpair E i +1 and | Ψ i +1 (cid:105) . | Ψ i +1 (cid:105) is identiﬁed from the spectrum of thediagonalized small complex symmetric matrix using the overlap method [1]. For this, one looks forthe eigenstate whose overlap with pole approximation (see Sect. 2) is maximal. As pole Slater deter-minant components are always more important than scattering Slater determinant components (seeSect. 2.2), this guarantees that the eigenstate obtained with the overlap method corresponds to thesought physical nuclear state. • Repeat the above steps until convergence, as indicated by the norm of the residual | R i (cid:105) .We note that this procedure must be performed separately for each Hamiltonian eigenpair. However, onetypically needs to calculate fewer than 5 eigenpairs per nucleus, the most common situations being calculationof the ground state only, or the ground state and the ﬁrst excited state. Hence, it is not prohibitive to usea new JD procedure per eigenpair.

5. Parallelization of the GSM code

The ﬁrst version of the GSM code followed a 1D partitioning scheme, where the Hamiltonian H waspartitioned along its rows. The H matrix, despite being symmetric, was stored as a full (but still sparse)matrix, and each block of rows was assigned to a diﬀerent process. MPI, OpenMP versions, and the hybridscheme combining the two of them, were implemented. This scheme is indeed convenient to implementand is well balanced for the storage of H matrix elements, and is reasonably eﬃcient when the number ofprocesses is small, e.g. less than 100 as is the case in a small-scale execution. Its main drawback is thatto eﬀect the sparse Hamitonian times vector operation during JD iterations, the entire input vector mustbe replicated on each node; this clearly generates expensive MPI communications. Consequently, the 1Dpartitioning approach is not scalable to thousands of cores which would be needed for accurate calculationsof heavy nuclei.In the current version of GSM, we implemented a 2D partitioning scheme that improves both memorystorage and MPI communication costs of the 1D partitioning scheme. In this scheme, the symmetry of H is14 igure 2: Example of the 2D partitioning scheme for the Hamiltonian matrix. Occupied squares are denoted by numbers andunoccupied squares by dashed lines. exploited and H is divided into N = n d ( n d +1) / n d = 5 and N = 15 is depicted in Fig.(2).Consequently, each process receives an (almost) equal number of | SD i (cid:105) Slater determinants (correspondingto the rows of the small squares assigned to them) and | SD f (cid:105) Slater determinants (corresponding to the columns of the small squares assigned to them) that scale roughly with 1 / √ N . As the phases needed in onenode are already stored therein during the construction of the Hamiltonian, each process works independentlyfrom others and there are no MPI communications at this stage.Compared to the 1D scheme, each process accesses a much smaller portion of the input and output vectorsduring the Hamiltonian times vector operations, and as a result MPI communications create a signiﬁcantlysmaller overhead with increasing problem dimensions. Moreover, as will be discussed in more detail below, in2D partitioning with a hybrid MPI/OpenMP parallelization scheme, a single (or a small group of) thread(s)takes care of MPI data transfers, while all other threads are dedicated to matrix-vector multiplications. Assuch, MPI communications can be overlapped with matrix-vector multiplication calculations, providing evengreater scalability. Two main operations in eigensover iterations are i) sparse matrix times vector operations, and ii) or-thonormalization of the new JD vector with respect to the previous JD vectors. For the H times vectoroperation, note that as a result of exploiting symmetry of the Hamiltonian, each process must perform aregular sparse matrix multiplication (SpMV) with the submatrix it owns, and a second SpMV with thetranspose of its submatrix. To perform these operations, one uses row and column communicators, whichgroup processes along each row and column. Hence, each process is part of two groups corresponding to its15wn row and column, with each group having n d / Let us now describe the algorithm of matrix times vector when one applies the Hamiltonian H on a GSMvector within the 2D partitioning scheme, of dimension d : • All submatrices (i.e. occupied squares in Fig.(2)) are eﬀected in parallel by their owner processes. • For submatrices on the diagonal, their associated process (i.e. the process where this submatrix isstored) is the master node of the row and column communication groups. The master node is respon-sible for distributing the corresponding part of the input GSM vector to all nodes on the same rowand column via the row and column communicators. For example, in Fig.(2), nodes 2, 4 and 15 forma row communication group, while nodes 4, 5 and 6 form a column communication group. For bothcommunication groups, node 4 is the master node. • Before any computation is performed, the master nodes broadcast their part of the input GSM vectorto their column communication groups. • Upon receiving the input vector through the column communicator, each process starts eﬀecting theSpMV for the H submatrix that it owns by utilizing multiple threads. In parallel to the SpMV, onethread takes part in the collective communication of the GSM input vectors, this time through therow communication groups in preparation for the transpose SpMV operation to follow. This way, theSpMV operation and MPI communications are overlapped. • The next step is to eﬀect the transpose SpMV operations using the input vectors communicated throughthe row groups in the preceding step. While the transpose SpMV operations are being eﬀected usingmultiple threads, this time the communication thread takes part in reduction of the partial outputsof the SpMV operation above to the master nodes through the row communicators, again overlappingSpMV communications with computations. • Note that while performing the transpose SpMV, race conditions would occur because input and outputindices are exchanged when one leverages H symmetry. A block of n t vectors is used therein for output,where n t is the number of threads per node, to avoid a race conditions (see Ref.[10]). These blocks ofoutput vectors are ﬁrst aggregated locally, and then are reduced at the master nodes through columncommunicators, thereby completing the distributed memory SpMV operation.The cost of MPI communications is then that of 4 d/n d complex numbers to be transferred collectively,with two transfers out of four being overlapped with multiplications. Consequently, this scheme is as eﬃcientas in SM. Moreover, given that the matrices in GSM are relatively denser than in SM (see Sect. 3), inter-nodecommunications are expected to incur less overhead in GSM than in SM.16 .1.2. GSM vectors storage and orthogonalisation The JD method requires the storage of tens or hundreds of GSM vectors, which amount to about 160Mb each already for a dimension of 10 (GSM vectors are complex).To reduce the memory overhead andcomputational load imbalance, JD/GSM vectors are distributed among all nodes using a method similar tothe 1D hierarchical partitioning scheme, ﬁrst devised in Ref.[10]. In this scheme, at the end of the distributedSpMV, each master node separates their portion of the GSM vector output into (almost) equal parts andscatters them to a few processes. Consequently, the additional operations needed for vectors, in particularthe reorthogonalization with respect to all previously stored JD vectors, does not pose any problem in ourimplementation. ˆ J in GSM We saw in Sect. 2.4 that rotational invariance has to be checked through the action of ˆ J or J + , andimposed if necessary with the P J operator of Eq.(13) (see Eqs.(14,15)).However, 2D partitioning would be rather ineﬃcient therein due to the block structure of the J ± matrix.Indeed, J ± cannot connect two diﬀerent conﬁgurations, which implies that it is in practice entirely containedin a diagonal square of its matrix, except for side eﬀects. Consequently, 2D partitioning would imply that thediagonal nodes of the J ± matrix would contain virtually all the J ± matrix, while all the other nodes wouldbe spectators. Thus, we choose to parallelize J ± with a hybrid 1D/2D method, where each column of the J ± matrix is stored on each node. GSM vectors are divided into N parts, similarly to the 2D scheme, as such theoutput GSM vector must be reduced on each node at the end of the calculation. As each node contains a partof the diagonal squares of the J ± matrix, the J ± matrix memory distribution is well balanced. Symmetryrequirements are here absent as J ± is not symmetric, as it connects Slater determinants of total angularmomentum projection M and M ±

1. As a matter of fact, the main problem here is MPI data transfer.Indeed, the number of J (cid:48) angular momenta to suppress (see Eq.(13)) is of the order of 50, and, naively, onewould have two MPI transfers of a full GSM vector per J ± application, leading to 200 MPI transfers of a fullGSM vector per P J application, which is prohibitive. A solution to this problem lies in the block structureof the J ± matrix. Indeed, the MPI transfers to be done therein are only those involving neighboring nodesin GSM vectors, so that their total volume is that of the non-zero components of a conﬁguration in betweentwo nodes, and not d , which is tractable. Moreover, P J applications occur only a few times at most, as it israre for the J quantum number not to be numerically conserved after a H application, so that it does notslow down the implementation of eigenvectors signiﬁcantly.

6. Numerical Evaluations

In order to test the performance of the new version of the GSM code, we consider the He and Benuclei. Their model space consists of the He core with valence nucleons, four valence neutrons for Heand two valence protons and two valence neutrons for Be nuclei. The core is mimicked by a Woods-Saxon17otential and the interaction used is that of Furuchi-Horiuchi-Tamagaki type, which has been recently usedfor the description of light nuclei in GSM [4]. The model space comprises partial waves bearing (cid:96) ≤

3, wherethe s / , p / , p / , and d / states are given by the Berggren basis with a discretization of 21 points forcontours, whereas the d / , f / and f / states are of harmonic oscillator type, as in SM, where 6 statesare taken into account for each of these partial waves. The model space is truncated so that no more thantwo nucleons can occupy scattering states and harmonic oscillator states. Proton and neutron spaces aretreated symmetrically for Be. This model is used only for computational purposes, so that energies have notbeen ﬁtted to their experimental values. The model spaces used nevertheless follow the usual requirementsdemanded for a physical calculation.The He and Be nuclei are complementary for our numerical study. Indeed, nuclei with a large asym-metry between the number of neutrons versus the number of protons are more diﬃcult to treat than thosepossessing more or less the same number of valence protons and neutrons. This is the typical case for SM,so that the on-the-ﬂy method, based on the recalculation of matrix elements of the Hamiltonian, is usuallyvery eﬀective in this case. Consequently, this study also tests the ability of the code to calculate eﬃcientlyHamiltonian matrix elements involving only neutrons, which often occurs in drip-line nuclei.The GSM dimensions for the He and Be nuclei are respectively d He = 939 ,

033 and d Be = 3 , , d . He and d . Be , which is about 7 and 4 times larger, respectively, thanthe typical number of non-zero matrix elements of SM roughly equal to d / . Additionally, multiplicationsof complex numbers are in practice twice slower than multiplications of real numbers. Hence, even thoughthe aforementioned dimensions would lead to very fast calculations in SM, they take a suﬃciently long timein our study.The eﬀected parallel calculations are all hybrid as we use the MPI/OpenMP parallelization scheme. Thenumber of nodes is of the form n d ( n d + 1) / He, and its possible values from 21 to 45 MPI ranks for Be. Indeed, It wasimpossible to store to the Hamiltonian matrix of Be with the full storage method if one takes 15 nodesdue to the memory limitations of a single node. 8 OpenMP threads per MPI rank are used. All showncalculations were performed at the National Energy Research Scientiﬁc Computing Center (NERSC) ofLawrence Berkeley National Laboratory in Berkeley, CA. Code debugging, optimization and testing werepartly done using computer clusters available at the Oak Ridge Leadership Computing Facility, as well asMSU’s High Performance Computing Center (HPCC).Results for memory storage reduction are shown in Fig.(3), where we show the maximum memory spacerequired for the storage of the Hamiltonian by any one node, as well as the average space required acrossall nodes. Firstly, it is clear that the average memory distribution scales very well with the number ofnodes for both He and Be. As no data of large size are stored besides the Hamiltonian in the full storage18 B e m e m o r y r e d u c t i o n Number of nodes H e m e m o r y r e d u c t i o n Number of nodes

Figure 3: Reduction of storage memory of Hamiltonian matrix elements per node for Be (left panel) and He (right panel) asa function of the number of nodes (color online). All data have been divided by the value obtained with the smallest number ofnodes (21 for Be and 15 for He), taken as a reference point. Full storage results are represented with solid lines, with stars formaximal memory stored in a node and crosses for average memory stored in a node. Partial storage results are represented withdashed lines, with squares for maximal memory stored in a node and circles for average memory stored in a node. The strongscaling line is also depicted on the picture as a solid line, but it cannot be discerned from the depiction of average memorystored in a node obtained with the full storage method. method, it is not suprising that an exact scaling is obtained if one averages over all nodes used. However,results issued from the partial storage method slightly depart from perfect scaling. This comes from thenecessary additional storage of two-body matrix elements, which provide with Hamiltonian matrix elementsfrom stored array indices and Hamiltonian phases (see Sect.3.4). However, by considering the maximalmemory stored in a node, one can see that scaling performance is very diﬀerent for He and Be. While themaximal memory scales very satisfactorily for He, where about 90% of perfect scaling is attained, that of Be grows unevenly and its scaling eﬃciency is about 75%. The slow increase from 28 to 36 nodes indicatesa sizable load imbalance for memory distribution of the Hamiltonian matrix elements among the utilizednodes. One can assume that it comes from a less homogeneous distribution of non-zero matrix elements in Be, generated by the presence of both valence protons and neutrons, as He only bears valence neutrons.This issue is possibly caused by a few large many-body groups ending up at the same row/column group inour relatively simple distribution scheme. It may be ﬁxed by sorting the many-body groups based on theirsize before doing a round-robin distribution, or by breaking large groups into smaller onesResults for the speed-up of total calculation time are shown in Fig.(4). Similarly to Fig.(3), total cal-culation time scales very well for He, when storage and on-the-ﬂy methods show a typical ratio of 90%and 75% with respect to perfect scaling, respectively, while it is not the case for that of Be. While theseperformances seem to be low at ﬁrst sight for Be, they follow the uneven pattern of memory distributionamong nodes, provided by the full storage calculation (see the left panel of Fig.(3)). Indeed, eﬃciency variesfrom 60% to 80% for storage methods when the number of nodes goes from 36 to 45 nodes. However, thespeed-up of the on-the-ﬂy method stagnates, as it reaches only 65% with 45 nodes, which is signiﬁcantlysmaller than the values obtained with the storage methods. Nevertheless, absolute calculation times are not19 B e f u ll t i m e s p ee d - u p Number of nodes H e f u ll t i m e s p ee d - u p Number of nodes

Figure 4: Speed-up of the total time taken by a matrix-vector operation for Be (left panel) and He (right panel) as a functionof the number of nodes (color online). All data have been divided by the value obtained with the smallest number of nodes(21 for Be and 15 for He), taken as a reference point. Full storage results are represented by a solid line with crosses, partialstorage results by dashed lines with squares, and on-the-ﬂy results by a solid line with stars. The strong scaling line is depictedon the picture as a solid line. excessive with the on-the-ﬂy method compared to the partial and full storage methods, as they are slowerby a factor of about 5 and 10 for Be, respectively. Consequently, the on-the-ﬂy method is of high interestwhen the partial and full storage methods cannot be used due to the impossibility to store the Hamiltonianmatrix. Nevertheless, its low scaling properties clearly demand to be investigated in the future. B e M P I / f u ll t i m e r a t i o Number of nodes H e M P I / f u ll t i m e r a t i o Number of nodes

Figure 5: Ratio of MPI communication time to the full time spent during a matrix-vector operation for Be (left panel) and He (right panel) as a function of the number of nodes (color online). Full storage results are represented by solid lines, withpluses for average MPI time and crosses for maximal MPI time in a node. Partial storage results are represented by dashedlines, with stars for average MPI time and squares for maximal MPI time in a node. On-the-ﬂy results are represented by solidlines, with circles for average MPI time and triangles for maximal MPI time in a node.

Results for MPI communication times are shown in Fig.(5) as the ratio of MPI communication time to thefull time spent during a matrix-vector operation. As the GSM vectors transferred with MPI are the same infull, partial and on-the-ﬂy methods, MPI communication times should be identical if scaling with the numberthe nodes were perfect. However, the situation is very diﬀerent in practice. In fact, one obtains a fairly largevalue for the ratio of average and maximal MPI communication times to total times, of 20%-50% for He and30%-70% for Be (see Fig.(5)). Moreover, obtained values for average and maximal MPI times are comparable20or full, partial and on-the-ﬂy methods (see Fig.(5)). Consequently, the MPI communication times combinetwo intricate eﬀects: the time taken to do a SpMV, and the uneven distribution of Hamiltonian matrixelements among nodes. Indeed, processes will not ﬁnish their calculations at the same time. Consequently,they start communicating with each other at diﬀerent times, and they experience additional MPI delays dueto load imbalances eﬀects. Therefore, what we report as ”MPI time” must be understood as load imbalanceplus MPI time. Further investigation is then necessary to understand how to mitigate load imbalance, asthis will surely decrease MPI communication times.

7. Conclusion

GSM has been parallelized with the most powerful computing method developed for SM, based on a2D partitioning of the Hamiltonian matrix. It signiﬁcantly reduces MPI inter-node communications whileallowing to take advantage of the symmetry of the Hamiltonian. Time-reversal symmetry has also beenincluded in our approach, as such calculations are twice faster for even nuclei. As GSM vectors are scat-tered among all nodes, memory requirements are very small for the vectors entering the JD method andreorthogonalization of vectors. Moreover, an eﬀective on-the-ﬂy method has been implemented within the2D partitioning approach, which has been noticed to be slower than the initially developed scheme based onHamiltonian full storage by a factor of about 5-10, which is still reasonable. The 2D partitioning has beentested with Be and He nuclei bearing medium size dimensions, whose calculations resemble those eﬀectedin the context of GSM. They have shown the eﬃciency of our method, despite a load imbalance issue whichwill be addressed as part of our future work.Consequently, the implementation of the 2D partitioning method has expanded the matrix dimensionsthat GSM can treat, and it is now possible to deal with dimensions of the order of tens or hundreds ofmillions. These calculations will demand for sure the use of very powerful machines, bearing thousands ofcores. As the feasibility of these calculations has been demonstrated, it is now possible in a near future toconsider very large systems in GSM.

8. Acknowledgments

We thank Dr. K´evin Fossez and Prof. Witek Nazarewicz for useful discussions and comments. N. Michelwishes to thank Prof. Furong Xu for a CUSTIPEN visit in Peking University, as well as Dr. S.M. Wang,for letting us use his ﬁgure of the Berggren completeness relation. This work was supported by the USDepartment of Energy, Oﬃce of Science, Oﬃce of Nuclear Physics under Awards No. de-sc0013365 (Michi-gan State University), No. de-sc0008511 (NUCLEI SciDAC-3 Collaboration), No. de-sc0018083 (NUCLEISciDAC-4 Collaboration); the National Natural Science Foundation of China under Grant No. 11435014;and No. de-sc0009971 (CUSTIPEN: China-U.S. Theory Institute for Physics with Exotic Nuclei), and alsosupported in part by Michigan State University through computational resources provided by the Institute21or Cyber-Enabled Research. This research used resources of the Oak Ridge Leadership Computing Facility,which is a DOE Oﬃce of Science User Facility supported under Contract DE-AC05-00OR22725, and of theNational Energy Research Scientiﬁc Computing Center (NERSC), a U.S. Department of Energy Oﬃce ofScience User Facility operated under Contract No. DE-AC02-05CH11231.This work is licensed under a license Creative Commons Attribution-NonCommercial-NoDerivatives 4.0International (CC BY-NC-ND 4.0) (https://creativecommons.org/licenses/by-nc-nd/4.0/deed.en us).

References [1] N. Michel, W. Nazarewicz, M. P(cid:32)loszajczak, T. Vertse, Shell model in the complex energy plane,J. Phys. G: Nucl. Part. Phys. 36 (2009) 013101.[2] K. Fossez, J. Rotureau, N. Michel, Q. Liu, W. Nazarewicz, Single-particle and collective motion inunbound deformed39