[PDF] A Hybrid Fast Multipole Method for Cosmological N-body Simulations

Abstract

We investigate a hybrid numerical algorithm aimed at the large-scale cosmological N-body simulation for the on-going and the future high precious sky surveys. It makes use of a truncated Fast Multiple Method (FMM) for short-range gravity, incorporating with a Particle Mesh (PM) method for long-range potential, which is applied to deal with extremely large particle number. In this work, we present a specific strategy to modify a conventional FMM by a Gaussian shaped factor and provide quantitative expressions for the interaction kernels between multipole expansions. Moreover, a proper multipole acceptance criteria for the hybrid method is introduced to solve potential precision loss induced by the truncation. Such procedures reduce the mount of computation than an original FMM and decouple the global communication. A simplified version of code is introduced to verify the hybrid algorithm, accuracy and parallel implementation.

Full PDF

aa r X i v : . [ phy s i c s . c o m p - ph ] J un A Hybrid Fast Multipole Method for Cosmological N -bodySimulations Qiao Wang

Key Laboratory for Computational Astrophysics, National Astronomical Observatories, ChineseAcademy of Sciences, Beijing 100101, ChinaSchool of Astronomy and Space Science, University of Chinese Academy of Sciences, Beijing100049, China

Received 0000 xxxx 00; accepted 0000 xxxx 00

Abstract

We investigate a hybrid numerical algorithm aimed at the large-scale cosmo-logical N -body simulation for the on-going and the future high precious sky surveys. Itmakes use of a truncated Fast Multiple Method (FMM) for short-range gravity, incorpo-rating with a Particle Mesh (PM) method for long-range potential, which is applied todeal with extremely large particle number. In this work, we present a speciﬁc strategy tomodify a conventional FMM by a Gaussian shaped factor and provide quantitative ex-pressions for the interaction kernels between multipole expansions. Moreover, a propermultipole acceptance criteria for the hybrid method is introduced to solve potential pre-cision loss induced by the truncation. Such procedures reduce the mount of computationthan an original FMM and decouple the global communication. A simpliﬁed version ofcode is introduced to verify the hybrid algorithm, accuracy and parallel implementation. Key words: methods: numerical cosmology: theory - large-scale structure of universe

In the early Universe, extremely hot and dense baryons and photons are strongly interactingprior to the recombination, The relic of the ﬂuctuation of photons is imprinted on the CosmicMicrowave Background Radiation which can be detected at the radio band, as the Universe isexpanding(Planck Collaboration 2018). Meanwhile the initial ﬂuctuation of the mass is growing updue to the gravitational collapse to form planets, galaxies, cluster haloes and large scale structure ofthe universe at present(Peebles & Yu 1970). Several primary cosmic probes can extract the informationlocked in the mass distribution, such as mass function of cluster counting, the sound horizon of BaryonicAcoustic Oscillation as a standard ruler, power spectrum of cosmic gravitational lensing, etc(Kaiser1987; Eisenstein & et al. 2005). The evidence combined with various data from sky surveys(Rozo et al.2010; Abbott et al. 2018) supports a modern picture of cosmology with two mysterious components,dark matter and dark energy, whose natures are still puzzles of standard physics(DES Collaboration2019).

Q. Wang

The next generation sky surveys will reveal the dark side of the Universe by distant galaxies andquasars as cosmic probe tracers, including DESI , EUCILD , LSST . One crucial step is to gener-ate a simulated catalogue of those tracers. However, it is not trivial to produce the related observablefor galaxy formation involves complicated astrophysical processes and nonlinear evolution(Cole et al.2000; Berlind & Weinberg 2002; Kitaura et al. 2016; Guo et al. 2011). The mock tracers, such as theemission line galaxies resided at the smaller structures and earlier epoch, require underlying cosmo-logical simulations with unprecedented resolution and volume to understand their mask effects, cosmicvariance, redshift uncertainties, and etc for the next generation surveys. Various techniques are devel-oped for large-scale simulations carried out on top supercomputers with trillion particles.The movement of particles in a expanding background can be modelled by equivalent Newtoniangravity with a periodic boundary conditions. One natural solver of this model is referred as to Particle-mesh (PM) method(Hockney & Eastwood 1988) by convolution of Green function of gravity. The con-volution has cost of O ( N log N ) , calling Fast Fourier Transformation (FFT) by two times. However,PM can only deal with the scale above the computing grid so that it needs a compensated sub-grid gravitysolver, such as a truncated Particle-Particle (PP) direct summation. The pioneer cosmological simula-tion based on PP+PM method, so called P M, identiﬁed the structure of the cosmic web(Efstathiou et al.1985; Jing & Suto 2002). Once the system has apparently condensed, it will fail to reduce to O ( N ) due to the domination of PP interactions.Another solver with cost of O ( N log N ) is introduced by Barnes & Hut (1986). They make useof octal tree to organize the particles. The detail of source cells could be negligible, since it attracts aparticle well-separated like the gravity of a mass points. Thus, any particle just concerns on the cellsor particles “near enough”, determined by a opening angle to control the precision of the acceleration.Warren (2013) modiﬁed a tree code (2HOT) to run on Graphics Processing Unit (GPU) and B´edorf et al.(2014) optimized Bonsai (Portegies Zwart et al. 2013) to achieve performance of 24.77 Pﬂops for theMilky Way simulation.Tree code is less sensitive to the particle clustering than a PM code but PM method is stable andrapid for a regular and periodic mesh. A hybrid TreePM method can combine the merits from both ofmethods(Bagla 2002; Bagla & Ray 2003). The Millennium simulations(Springel et al. 2005) is carriedout with over particles by the parallel TreePM code of Gadget-2(Springel 2005) over ten yearsago. The idea of TreePM is also effective on heterogeneous platform, such as HACC (Habib et al. 2016,2013). Recently, Ishiyama et al. (2012) run a trillion particles cosmological simulation on K computer.To face the challenge of high precision, the scale of simulation is increasing as the capability ofthe supercomputers. That requires faster algorithm to deal with a unprecedented particle number. FastMultipole Method (FMM) has a nearly linear computational complexity of O ( N ) . It is originally in-troduced by Greengard & Rokhlin (1987). Cheng et al. (1999) extends it into three dimension. Similarwith Tree method, FMM also builds a tree but computes the interactions between cells. The early FMMworks on spherical polar coordinate. It is also successful to be accelerated on GPU devices, such asExaFMM (Gumerov & Duraiswami 2008; Yokota & Barba 2011). Its precision is controlled by theorder of expansions. Moreover, an implementation in a Cartesian coordinate may be more suitable toastrophysical simulation with a acceptance criteria based on opening angle, instead of “children of par-ent’s brother”(Dehnen 2000, 2002). For some mass dependent criteria, its complexity can approach O ( N . ) by using a dual tree traversal. Recently, Potter et al. (2017) reports that the application of cos-mological simulation is able to involve 2 trillion particles by software PKDGRAV-3 on supercomputer P iz Daint .This paper is organized as follows. In section 2, we brieﬂy introduce the fundamental approachon FMM then we investigate the combination of FMM and PM method, two current methods, for thecosmological simulation. The detail of our algorithm will be discussed in section 2.2. A corresponding Hybrid Fast Multipole Method for Cosmological N -body Simulations 3 Multipole Acceptance Criteria (MAC) is taken into account in section 2.3. We measure the reduction ofkernel interactions via the hybrid method than a conventional one in section 3. In section 4, we introducea parallel implementation to verify the algorithm. Finally, we summarize in the last section.

In FMM, all particles are organized into a tree and the ﬁnest tree cells (or tree nodes) are always aseries of particle packs, which are also referred as to leaves. In Fig. 1, we employ an ORB (OrthogonalRecursive Bisection) tree and set a maximum limiter for particle number in leaves. The particles includ-ing parent cell are almost equally divided into two offspring cells down to leaves.

Fig. 1

Schematic diagram for FMM. The gravitational potential at certain point (red pointwith black circle) in Region A is generated by a group of particles in region B. The multipoleexpansion is computed by P2M in Region A. Then the information is successively propagatedvia interaction kernels, M2M, M2L, L2L and L2P. The gravity from the nearby particles canbe directly accumulated by P2P. The blue dashed circle denotes the cutoff radius.The gravitational potential at point in region B induced by the particles of region A can be estimatedvia a series of interaction kernels between the multipole coefﬁcients. We follow the mathematical nota-tions in Dehnen (2014) and summarize the relevant equations of FMM method in Cartesian coordinates.The multipole coefﬁcients M of region A is computed by the particles M m ( z A ) = X a ∈ A µ a ( − m m ! ( x a − z A ) m , (1) Q. Wang where z A is the geometric (or mass) center of region A , µ a is mass of the particle labeled by a at theposition of x a and the integer vector n denotes ( n x , n y , n z ) . One can shift a multipole expansion form z to z + x by the summation of M m ( z + x ) = m X n =0 x n n ! M m − n ( z ) . (2)Given a Green function of ψ ( z b − z a ) , the coefﬁcients of local expansion L of potential Ψ at the centerof region B of z B is determined by the M at z a by equation L n ( z B ) = p −| n | X | m | =0 M m ( z A ) D n + m ( z B − z A ) , (3)where D n ≡ ∇ n ψ is a t raceless operator. For Newtonian gravity or static electricity force, it can beexpressed by a traceless displacement tensor ¯ r n multiplied by the prefactors of ˜ f ( n ) ( r ) = ( − n (2 n − r n +1 . (4)Similarly, the multipole L can be shifted by L n ( z + x ) = p −| n | X | m | =0 x m m ! L m + n ( z ) , (5)Thus, the potential at x b is approximated by Ψ( x b ) = p X | n | =0 n ! L n ( z B )( x b − z B ) n . (6)For short, The multipole expansion for the source is labeled by ’multipole’ and the expansion ofat sink (or target) place is labeled by ’local’. Then the abbreviations of the kernels are as follows:particle-to-particle is referred as to P2P, particle-to-multipole is P2M, multipole-to-multipole is M2M,multipole-to-local is M2L, local-to-local is L2L, and local-to-particle is L2P, respectively. the multipoleand local expansion coefﬁcients are included in all tree cells.The above equations can be utilized to transmit the information of gravity from a particle groupto another. The gravity of a certain particle in the purple sink (targeted) leaf induced by the orangesource leaf can be computed through a series of successive kernels. The multipole coefﬁcients M ofa (orange) leaf are determined by the discrete mass points in the cell (by using Eq. 1), which describemass distribution in source cells or leaves. Then the multipole expansion coefﬁcients in a parent cell arecomputed by its offspring cells recursively (using Eq. 2). This procedure is also referred as to upwardpass.The kernel M2L computes the local expansion coefﬁcients of gravitation potential from the multi-pole of source cells by Eq. 3. When two cells are well separated each other, the M2L kernel is called tocompute the the interactions from yellow to green cells in this illustration. The local expansion coefﬁ-cients are passed level-by-level downward (using Eq. 5) till a leaf is met. The gravity and potential ofthe targeted particles (red point with black circle) is calculated by the local expansion about the centerof purple leaf, using Eq. 6.The nearby particles surrounding the target are too close to be dealt with by above kernels andthey must be directly accumulated their interaction. Since a direct N-body summation is O ( N ) , it canbe quite time consuming. The procedure is referred as to P2P, which actually can be considered asinteractions between particle packs in our implementation as well. Hybrid Fast Multipole Method for Cosmological N -body Simulations 5 Analogue to TreePM, the essence of Particle-Mesh and Fast Multipole Method (PM-FMM) is also tosplit gravity into two parts by scale. A truncated short-range part of gravity is computed by FMM anda smoothed long-range part is by PM method. A combination of two parts must be equivalent with anoriginal inverse-square law at the split scale, by ﬁne tuning its splitting function. Bagla (2002) suggestsa Gaussian form as transition function for TreePM and we generalize it to the PM-FMM in this section.Speciﬁcally, the convolution for PM with a grid size of ∆ g needs an Gaussian function exp( − k / k s ) / √ π as a ﬁlter to suppress the Green function of Poisson equation, where k s is thewave number of split scale of r s ∼ . g . Fig. 2

Reduction effect. All prefactors are truncated at the cutoff radius ∼ r s .Correspondingly, the short range gravity must be properly modiﬁed to compensate the underesti-mation from PM method. As a result, the potential r − is modiﬁed by error function as φ ( r ) = f (0) ( r ) = 1 r erfc (cid:18) r r s (cid:19) . (7)and an inverse-square law is modiﬁed by g( r ) = f (1) r = − r r (cid:20) erfc (cid:18) r r s (cid:19) + rr s √ π exp (cid:18) − r r s (cid:19)(cid:21) . (8)Moreover, the operator D n = ∇ n [ erfc ( | r | /r s ) φ N ] for a truncated potential is employed for computationof FMM, where φ N is Newtonian potential. Q. Wang

For the higher orders, the M2L kernel in truncated algorithm can be implemented by minimumsubstitutions of the original prefactor ˜ f ( n ) (Eq. 4) by the following ones without tildes: f (2) ( r ) = 3 r erfc (cid:18) r r s (cid:19) + 1 √ π exp (cid:18) − r r s (cid:19) × (cid:20) r s r + 12 r s r (cid:21) ,f (3) ( r ) = − r erfc (cid:18) r r s (cid:19) − √ π exp (cid:18) − r r s (cid:19) × (cid:20) r s r + 52 r s r + 14 r s r (cid:21) ,f (4) ( r ) = 105 r erfc (cid:18) r r s (cid:19) + 1 √ π exp (cid:18) − r r s (cid:19) × (cid:20) r s r + 352 r s r + 74 r s r + 18 r s r (cid:21) . Fig. 2 demonstrates the comparison of two kinds of prefectors. It is apparent that all interaction aretruncated at the radius of × r s , a cutoff radius R cutoﬀ , so that the gravity of short range is negligiblebeyond that scale. For P2P kernel, only f (1) is needed.One can compute the prefactor of any order p by equation of ( − p r p +1 s f ( p ) ( x ) = (2 p − x p +1 erfc (cid:16) x (cid:17) + p X q =1 q − p (2 p − p − q + 1)!! e − x / √ πx q , (9)where x ≡ r/r s . Or a more effective approach to calculate all of prefactors in successive orders via arecurrence form of − r s x f ( p +1) ( x ) = (2 p − f ( p ) ( x ) + e − x / p − √ πr p − s . (10)The hybrid algorithm is illustrated in Fig. 1. The dash circle denotes a cutoff radius. The contributionfrom FMM is localized within the cutoff radius. A dual tree traversal method can complete all of kernel computation by once tree walking. For localtree, the traversal begins with root-root pair but it can begin with any pair of cells. If two cells are ”wellseparated”, a M2L kernel will be employed to compute the local multipole of sink cell induced by thesource one. Otherwise one of two cells, usually a larger one, needs be opened. The traversal recursivelysucceeds in the opened cell until walking at end of a tree. The interaction between two ”nearby” leavesmust use P2P kernel.Therefore the deﬁnition of “well separated” will be inﬂuenced by Multipole Acceptance criteria(MAC) . One can deﬁne the geometric relation by the parameter of opening angle θ = L/S , where L is the length of cell and S is the separation between two cells. It is still various to choice of those lengths.In PKDGRAV3, Potter et al. (2017) makes use of the opening radius RO = b max /θ as a criteria, wherethe b max is from the mass center of the cell. Dehnen (2014) introduces to a mass-dependent acceptancecriteria, which suppress the cost of O ( N . ) less than a linear complexity. To demonstrate the methodin this work, we choose the maximum side length of source cell as L , the separation S c is the distance Hybrid Fast Multipole Method for Cosmological N -body Simulations 7 Fig. 3

Multipole Acceptance Criteria. L A is side length of sink (targeted) tree cell and L B isthe source one. The red arrow is the separation S c between the centers of two nodes and S m is minimum distance between boundaries of tree cells. Two boxes are still physically relevantdespite S c beyond the cutoff radius.between the geometric center of two cells and the S m is minimum distance between two cells (seeFig. 3).Besides opening angle, our acceptance criteria needs be adjusted for effects caused by truncationand it is summarized into three items: – All of interactions between cell-cell, cell-body and body-body are neglected when the minimumseparation S m is larger than the cutoff radius; – Call M2L kernel to compute local expansion coefﬁcients while opening angle θ ≤ θ MAC .Otherwise, open the larger one of two cells of interest; – Enforce to open the cells if the separation is at the range of the transition region, even the 2ndcondition is already met, where transition region is deﬁned by S c > R cutoﬀ > S m .Our opening angle is deﬁned as θ ≡ L B /S c , where the L B is the maximum side length of source cell.It is apparent that neighbor cells must be always opened according to the 2nd item.The ﬁrst item causes an essential improvement of the hybrid method. The additional 3rd item is dueto truncated design. Fig. 2 shows that the term r − is dominated beyond the truncated scale but the higherorder multipoles still contribute the expansion coefﬁcients of local gravitational ﬁeld in the traditionalFMM. However in our method, the multipoles in any order, here from f to f , are suppressed withinthe cutoff radius. Therefore no information can be propagated out the cutoff barrier no matter how manyorders are considered. The error of short-range gravity also fails to be suppressed by a stricter θ MAC .Because the shadow region in Fig. 3 is always ignored and lost due to the modiﬁcation of prefactors,despite contributes the short-range gravity yet. The 3rd item is motivated to ﬁx it and guarantee thehigher multipole can inﬂuence the cells beyond the scale of cutoff radius.Fig. 4 shows the relative error by the new MAC with the orders up to hexadecapole. The solid curvedenotes the root-mean-square (rms) of relative error and the dash-dotted curve denotes the envelop ofmaximum error. The error of gravity in our criteria with the third item can be well controlled. It dropsas the opening angle decreases. But the relation of error to opening angle changes. It breaks the powerlaw of error with θ p , say θ in hexadecapole, presented in Dehnen (2014).The relation between accuracy and efﬁciency depends on the choice of control parameters. Butit is difﬁcult to derive the optimized parameters from theoretical analysis. The order of FMM can be Q. Wang

Fig. 4

Relation of error to opening angle. Relative error of gravity or acceleration is calculatedvia truncated FMM upto hexadecapole. The solid curve denotes the rms error and dash-dottedcurve denotes the maximum error of all particles.independently determined, which is constrained by the machine memory. The memory of cells consumesdouble for every order of FMM adds. Some applications without memory limit employ the FMM upto10th order with a huge equivalent opening angle. We practically set the order of FMM upto octupole orhexadecapole, since the cosmological simulations are usually pressed for memory. Correspondingly, itkeeps sufﬁcient statistics to set the opening angle from 0.3 to 0.4 for regular cosmological simulations.One can decrease the opening angle to improve accuracy and cause a sharp enlargement of the amountof computation. Therefore the parameters must be ﬁne adjusted in accordance with computing power,environments and accuracy tolerance.

In the previous section, we describe an algorithm for decoupling long-range and short-range gravity witha proper modiﬁcation of MAC. Correspondingly, we measure and present that the quantitative reductionof computation amount in this section. There are two extreme density distributions as cosmologicalevolution. In high-redshift epoch, the density contrast is tiny. the computing box is almost uniformlyﬁlled with mass particles. As the evolution of the universe, the initial seed of structure grows up andhighly clustering structure are formed, including ﬁlaments and the haloes. Most of particles are fallinginto condensed regions.The dashed curves correspond to the conventional case and the solid curves are in this work. Arrowsdenote the reductions of the amount. the upper curves are for the P2P kernel and lower are M2L. Thereis apparent improvement, since the mean separation of most cells are beyond the cutoff radius so thatthe computation of M2L in original FMM vanishes for the uniformed distribution, whose cost is closeto P M method in the uniformed case.Fig. 5 demonstrates the analyses of the uniformed case. Because the most time-consuming kernelsof M2L and P2P are directly inﬂuenced by the truncated MAC. The run duration depends on the imple-

Hybrid Fast Multipole Method for Cosmological N -body Simulations 9 Fig. 5

Uniformed density distribution. The dotted curves denote the counts of kernel P2P andM2L for a conventional FMM and the solid curves denote truncated FMM in this work. Eachleaf (particle pack) only holds one particle in left panel; Each leaf holds 32 particles at mostin right panel. Therefore the length of tree cells is shorter than the left panel.mentation but the count of interaction does not. It only depends on the choice of MAC. In the left panelof Fig. 5, each leaf (particle pack) only holds a particle at most, and 32 particles in the right panel. Thecount of interaction kernels is determined the depth of tree and spatial conﬁguration. The tree containsless cells in the right panel than right one in the same particle distribution.

Fig. 6

Highly clustering density distribution, which is the most condensed case from a

Lambda

CDM cosmological simulation run at redshift z = 0 . The dotted curves denote theinteraction counts of kernel P2P and M2L for a conventional FMM and the solid curves de-note truncated FMM in the work. Each leaf only contains one particle in left panel; Each leafcontains 32 particles at most in right panel.Fig. 6 demonstrates the similar analyses for the cosmic structure distribution of the universe atpresent (redshift z = 0 ). The notation is the same with Fig. 5. As expected, the total count is larger than the uniformed case, because more particles are constricted into the smaller cells so that more cells areopen by the acceptance criteria. The mean separation of cells shrinks below the cutoff radius. but thehybrid method still works to reduce the kernel count.Compared with original FMM, kernel interactions of M2L and P2P are reduced in our truncatedone. However there exists an additional PM in our method. Theoretically, PM has cost of O ( N log N ) ,which is worse than FMM. But a PM method can usually be more effectively implemented to savethe overall duration. On the contrast, the original FMM makes use of Ewald summation to deal with aperiodic boundary condition(Gumerov & Duraiswami 2014). It is not needed in this hybrid method. This PM-FMM method is employed by the cosmological N-body simulations code photo N s-2 , which isdesigned for the massively parallel cosmological simulations . Its ﬁrst version(Wang et al. 2018) adoptsa parallel Tree-PM method and the interactions between particles and tree cells are arranged into a taskpool, which is suitable for the optimization, especially on the heterogeneous platform(Makino 2004;Hamada et al. 2009; Wang et al. 2018; Iwasawa et al. 2018). This second version updates short-rangegravitational solver by truncated FMM described in the previous sections 2, instead of the tree method.In the new version, the domain decomposition returns to an ORB tree across the computing nodes,such as computing sockets or processors with a shared memory. And all of domains and their uppernodes construct a top level tree that needs be stored in all computing nodes. Thus the domain cellis a ﬁnest cell in the top level tree but a root cell for local essential tree (LET). The particles are alsoorganized into a k -d tree in every computing domain so that a distributed global ORB tree is constructed.Fig. 7 illustrates a domain decomposition for a almost uniformed particle distribution. As an exampleof 7 processes, the ORB tree ﬁrstly distribute the particles along x -axe by the fraction of 4:3 in orderto balance of particle number, then 4 processes take charge of the left 4/7 volume of box and other 3processed do the right 3/7, respectively. Fig. 7

The domain decomposition and top level tree.The upward pass of P2M and M2M ﬁrstly run on the local tree. When all local upward passes arecomplete, the parallel M2M, M2L and L2L are made for the top level tree. But the inter-domain M2Land P2P still requires the information across the other domains. In photo N s-2 , we actively send localtree cells and leaves to the domain may needed it. Using our MAC, a traversal with respect to the closestboundary of target domain will ﬁnd all of potential cells and leaves involved for the target domain. For a simpliﬁed parallel MPI+openMP version can be download at https://github.com/nullike/photoNs-2.0 to testify the hybridalgorithm. Hybrid Fast Multipole Method for Cosmological N -body Simulations 11 Fig. 8

The local tree. The tree cells (0-4) and the particle packs (7-9) needs to send to thetarget domain (Domain 9) in this case.instance, there are two local trees which are mounted on domain 9 and 10, respectively in Fig. 8. Thecell 2 may be accepted by MAC with respect to ”right side” boundary of domain 9 so that nodes 5 and6 can be safely ignored (see Fig. 8). So are leaves from 11 to 14. The separation between leaf 10 andboundary may be larger than the cutoff radius so that leaf 10 is also ignored. Then a segment of subtreeincluding all necessary information is arranged from the local tree of domain 10.Extra memory to send and receive must be allocated to exchange the segment of cells and leavesfor the communications. After those parallel operations, local downward pass of L2L and L2P can beexecuted.In this version, M2L and P2P operators are most time-consuming two kernels for gravity calculation.For the criteria of “children of parent’s brother”, the operation number of M2L and P2P only dependson the length of the tree. But for the MAC in our implementation, it also depends on the clustering ofparticles. We also make use of a double-buffering task pool to improve the concurrency, for P2P andM2L. As the mass particles collapse into the potential well, the density contrast in the simulation box iswildly different from place to place. Therefore we estimate the work load for single domain by countingthe total number of M2L and P2P operations to determine how to redistribute the particles in the nexttime step. It is similar with the strategy of the code GreeM(Ishiyama et al. 2009; Ishiyama et al. 2012).Such a feedback strategy usually can control the workload imbalance within .Practically, this hybrid method is designed for the massively parallel on supercomputers with over computing nodes or sockets. Its PM method needs to call FFT subroutine by two times at everysingle synchronized time step. A conventional FFT library, such as FFTW(Frigo & Johnson 2005), de-compose a mesh into a series of slices along a certain direction. It fails if the number of processes islarger than the side number of mesh. But it exactly happens in a cosmological simulation. In this version,we employ a Fortran library with a pencil decomposition, 2DECOMP&FFT (Li & Laizet 2010), for thePM. As a test, a simulation with ∼ grid is carried out by over 20,000 processes. In this paper, we investigate a hybrid method for the massive application of Cosmological simulations.At the epoch of precious cosmology, FMM with complexity O ( N ) is key method to run high resolutionsimulations on the supercomputers and a traditional PM method still contributes on decoupling thegravity and dealing with the periodic boundary condition. The hybrid algorithm of FMM with PMkeeps the beneﬁt of gravity splitting and decrease the amount of computations. Speciﬁcally, we modiﬁed the operators of the truncated FMM for the short-range gravity and pro-vide a general form to compute the prefactor of multipoles. We focus on a Gaussian-type truncation.Because its sharp splitting is proven by the TreePM method. In principle, one can choose another trun-cated function, instead of the exponential form. A polynomial function can be calculated more efﬁcientlyfor the numerical mathematical library than an exponential one. The modiﬁcations of their prefactors canbe generated by a similar procedure. The method in this work is different from the Particle mesh multi-pole method (PMMM), which calls ( p + 1) FFT to directly compute the multipole coefﬁcients(Nitadori2014) or Fourier transform on multipoles (FFTM) method(Ong et al. 2004). We do not use FourierTransformation to calculate the coefﬁcients of multipole expansion but gravitational potential.Moreover, multipole acceptance criteria needs be modiﬁed for two additional conditions. One is fortruncation of long-range interaction and the other is for controlling the accuracy. The count of kernelinteractions is determined by the detail of implementation of FMM and MAC. A conventional FMMhas a linear stability and Dehnen (2002) reports a better stability by using a mass dependent MAC. Inthis work, the hybrid method we demonstrate costs more than O ( N ) but less than O ( N log N ) . Thereduction of kernel computation is due to the decoupling the long-range force so that such a hybridmethod can robustly work for other kinds of traversal and tree construction as well.Finally, the current and next generation supercomputers provide a powerful numerical platformto run the massive simulations with unprecedented resolutions and simulation boxes, which usuallyare composed of tens of thousands of computing nodes and various heterogeneous accelerators andmany-core architectures. Besides the pressure of memory, I/O band and storage of snapshot, it requiresappropriate algorithms designed for massive concurrency to face the challenge software scalability andcomputing performance, especially, on the heterogeneous devices. The N-Body applications exchangethe enormous number of particles among processes so that the communication strategy becomes anessential issue. The other trend is to employ more efﬁcient method, such as O ( N ) , to deal with theextreme amount of force computation. The method we proposed provides an option to calculate theforce efﬁciently and decouple the global communication in the meantime. The eventual performanceof applications depends on the algorithm and the implementation of programming details. Here, werelease a fundamental version of code to verify the precision and validity of the hybrid algorithm, whichis expected to be optimized on high performance computers. Acknowledgements

We gratefully thank the anonymous referee for the helpful comments. We ac-knowledge the support from the National Key Program for Science and Technology Research andDevelopment (2017YFB0203300) and the Strategic Priority Research Program of Chinese Academyof Sciences, Grant No. XDC01040100.

References

Abbott, T. M. C., Abdalla, F. B., Annis, J., & et al. 2018, MNRAS, 480, 3879Bagla, J. S. 2002, Journal of Astrophysics and Astronomy, 23, 185Bagla, J. S., & Ray, S. 2003, New Astron., 8, 665Barnes, J., & Hut, P. 1986, Nature, 324, 446B´edorf, J., Gaburov, E., Fujii, M. S., et al. 2014, in Proceedings of the International Conference forHigh Performance Computing, Networking, Storage and Analysis, SC ’14 (Piscataway, NJ, USA:IEEE Press), 54Berlind, A. A., & Weinberg, D. H. 2002, ApJ, 575, 587Cheng, H., Greengard, L., & Rokhlin, V. 1999, Journal of Computational Physics, 155, 468Cole, S., Lacey, C. G., Baugh, C. M., & Frenk, C. S. 2000, MNRAS, 319, 168Dehnen, W. 2000, The Astrophysical Journal Letters, 536, L39Dehnen, W. 2002, Journal of Computational Physics, 179, 27Dehnen, W. 2014, Computational Astrophysics and Cosmology, 1, 1DES Collaboration. 2019, Phys. Rev. Lett., 122, 171301Efstathiou, G., Davis, M., White, S. D. M., & Frenk, C. S. 1985, ApJS, 57, 241

Hybrid Fast Multipole Method for Cosmological N -body Simulations 13-body Simulations 13