[PDF] PENTACLE: Parallelized Particle-Particle Particle-Tree Code for Planet Formation

Abstract

We have newly developed a Parallelized Particle-Particle Particle-tree code for Planet formation, PENTACLE, which is a parallelized hybrid N -body integrator executed on a CPU-based (super)computer. PENTACLE uses a 4th-order Hermite algorithm to calculate gravitational interactions between particles within a cutoff radius and a Barnes-Hut tree method for gravity from particles beyond. It also implements an open-source library designed for full automatic parallelization of particle simulations, FDPS (Framework for Developing Particle Simulator) to parallelize a Barnes-Hut tree algorithm for a memory-distributed supercomputer. These allow us to handle 1−10 million particles in a high-resolution N -body simulation on CPU clusters for collisional dynamics, including physical collisions in a planetesimal disc. In this paper, we show the performance and the accuracy of PENTACLE in terms of R ~ cut and a time-step Δt . It turns out that the accuracy of a hybrid N -body simulation is controlled through Δt/ R ~ cut and Δt/ R ~ cut ∼0.1 is necessary to simulate accurately accretion process of a planet for ≥ 10 6 years. For all those who interested in large-scale particle simulations, PENTACLE customized for planet formation will be freely available from this https URL under the MIT lisence.

Full PDF

PPubl. Astron. Soc. Japan (2014) 00(0), 1–12doi: 10.1093/pasj/xxx000 PENTACLE: Parallelized Particle-ParticleParticle-Tree Code for Planet Formation

Masaki I

WASAWA , ∗ Shoichi O

SHINO , Michiko S. F

UJII , and YasunoriH ORI Advanced Institutes for Computational Science, Minatojima-minamimachi, Chuo-ku, Kobe,Hyogo 6500047, Japan Centre for Computational Astrophysics, National Astronomical Observatory of Japan, 2-21-1Osawa, Mitaka, Tokyo 1818588, Japan Department of Astronomy, Graduate School of Science, The University of Tokyo, 7-3-1Hongo, Bunkyo-ku, Tokyo 1130033, Japan Astrobiology Centre, National Institutes of Natural Sciences, 2-21-1 Osawa, Mitaka, Tokyo1818588, Japan Exoplanet Detection Project, National Astronomical Observatory of Japan, 2-21-1 Osawa,Mitaka, Tokyo 1818588, Japan ∗ E-mail: [email protected]

Received (cid:104) reception date (cid:105) ; Accepted (cid:104) acception date (cid:105)

Abstract

We have newly developed a Parallelized Particle-Particle Particle-tree code for Planet forma-tion,

PENTACLE , which is a parallelized hybrid N -body integrator executed on a CPU-based(super)computer. PENTACLE uses a 4th-order Hermite algorithm to calculate gravitational in-teractions between particles within a cutoff radius and a Barnes-Hut tree method for gravityfrom particles beyond. It also implements an open-source library designed for full automaticparallelization of particle simulations, FDPS (Framework for Developing Particle Simulator) toparallelize a Barnes-Hut tree algorithm for a memory-distributed supercomputer. These allowus to handle − million particles in a high-resolution N -body simulation on CPU clustersfor collisional dynamics, including physical collisions in a planetesimal disc. In this paper, weshow the performance and the accuracy of PENTACLE in terms of ˜ R cut and a time-step ∆ t . Itturns out that the accuracy of a hybrid N -body simulation is controlled through ∆ t/ ˜ R cut and ∆ t/ ˜ R cut ∼ . is necessary to simulate accurately accretion process of a planet for ≥ years.For all those who interested in large-scale particle simulations, PENTACLE customized for planetformation will be freely available from https://github.com/PENTACLE-Team/PENTACLE underthe MIT lisence.

Key words:

Methods: numerical – Planets and satellites: formation Planet formation proceeds via collisions and accumulation ofplanetesimals. Planetesimals, which are larger than neigh-bours, can be predators because of stronger gravitational focus-ing. During this early phase, larger planetesimals overwhelmsmaller planetesimals, growing rapidly in an exponential fash- ion (Wetherill & Stewart 1989). However, the growth of a largeplanetesimal, the so-called planetary embryo, slows down be-cause of the increase in random motions of small planetesi-mals around itself by gravitational scattering (Ida & Makino1993). At this stage, planetary embryos grow oligarchically tobe similar-size and then, their orbital separations become ∼ c (cid:13) a r X i v : . [ a s t r o - ph . E P ] O c t Publications of the Astronomical Society of Japan , (2014), Vol. 00, No. 0 times Hill radii as a result of orbital repulsion. A series of path-ways toward planet formation are commonly known as runawaygrowth and oligarchic growth (Kokubo & Ida 1998).Collisional dynamics in a swarm of planetesimals is con-trolled by their mutual gravity. Planetesimal accretion is a non-linear dynamical process via multi-body interactions. An N -body simulation is an effective means of examining dynamicalbehaviours of particles that gravitationally interact each other.Gravitational force has an inﬁnite range, whereas gravitationalinteractions between particles undergoing close encounters reg-ulate the length of a time-step to integrate the dynamical evolu-tion of a collisional system. As a result, a direct N -body simula-tion for planet formation requires O ( N ) integrations per shorttime-step.Due to a high computational cost of direct N -body simula-tions, we can handle at most ∼ − particles in simula-tions of terrestrial planet formation, as shown in Fig. 1 . Thesize of an equal-mass planetesimal initially assumed in such N -body simulations is typically several hundred kilometres in ra-dius, which is similar to that of 1 Ceres and 4 Vesta. The ob-served population of asteroids in the main asteroid belt suggeststhat the smaller asteroids are, the more abundant they are (e.g.see Fig. 1 in Bottke et al. 2005). Although the present-daysize distribution of asteroids reﬂects the combination of a long-term evolution of collisional processes and gravitational per-turbations from the planets, there were likely numerous smallbodies at the early stage of planet formation. In order to de-scribe the dynamical behaviours of small bodies in an N -bodysimulation, we need to improve the mass resolution of parti-cles. A classical way is to introduce a statistical method thatdescribes accumulation processes of planetesimals in a sea ofsmall bodies, following a collisional probability; for instance,statistical simulations (e.g. Weidenschilling et al. 1997; Inabaet al. 2001; Kobayashi et al. 2011), hybrid N -body simula-tions with a statistical method for small bodies (e.g. Bromley& Kenyon 2006; Chambers 2008), and super-particle approxi-mations (Levison et al. 2012; Morishima 2015).Another tractable approach is to increase the number of par-ticles ( N ) in an N -body simulation. There were several at-tempts to accelerate and optimize processes of computing grav-itational interactions between particles. Specialized hardwaressuch as HARP and GRAPE (e.g. Sugimoto et al. 1990; Makinoet al. 1993; Makino et al. 2003) were developed to accel-erate the calculation of gravitational forces, dramatically in-creasing the number of particles used in direct N -body sim-ulations. Recently, Graphic Processing Units (GPUs) havebeen introduced as an alternative accelerator (e.g. Wang et al.2015; Bédorf et al. 2015). However, the upper limit of N wasstill several tens of thousands due to the small time-steps forclose encounters and relatively long integration time (more thanone million orbital times at 1 au), which results in a large num- ber of steps.In contrast, tree methods (Barnes & Hut 1986) can reducethe computational cost of gravity parts to O ( N log N ) . PKDGRAV (Stadel 2001) optimized for parallel computers adopts a treemethod with variable time-steps. Using

PKDGRAV , Richardson etal. (2000) simulated the dynamical evolution of a million plan-etesimals for only hundreds of dynamical times. Tree codesand also a family of particle-mesh scheme such as the P Mscheme (Hockney & Eastwood 1981), and combinations ofthe P M and tree methods (Xu 1995; Bagla 2002; Dubinskiet al. 2004; Springel 2005; Yoshikawa & Fukushige 2005;Ishiyama et al. 2009) can treat extremely large numbers of par-ticles, and therefore they are used for cosmological simulations.While dark matters can be considered as a collisionless sys-tem, planetesimal-planetesimal interactions are collisional. Ifcosmological N -body codes are applied to planetary accretion,small time-steps for close encounters and a large number ofsteps become a bottleneck.A mixed-variable symplectic (MVS) integrator, in which aHamiltonian is split into two parts and integrated separately, isanother direction to manipulate the calculation cost (Wisdom& Holman 1991; Kinoshita et al. 1991). In a MVS integera-tor, gravitational interactions caused by close encounters areintegrated using a higher-ordered scheme with smaller timesteps, but the others are calculated using a Kepler solver or afast scheme such as a tree method. SyMBA (Duncan et al.1998), Mercury (Chambers 1999), and P T method (Oshinoet al. 2011) incorporate this method, albeit the way to splita Hamiltonian is different among them. Also, the combina-tion of these attempts would be possible.

GENGA (Grimm &Stadel 2014) used GPUs for computing gravity parts, adoptinga MVS method as an integrator. Oshino et al. (2011) appliedP T method to GRAPE and later, Iwasawa et al. (2015) devel-oped a GPU-enabled P T method and succeeded in handlingmore than a million particles, although they considered a starcluster model in their simulation.We newly develop a parallelized N -body integrator basedon P T method (Oshino et al. 2011), a Parallelized Particle-Particle Particle-tree code for Planet formation which is called PENTACLE hereafter. This hybrid N -body code allows us to per-form a high-resolution N -body simulation with − millionparticles in a collisional system such as a planetesimal disc for ∼ Myr on a standard supercomputer.In this paper, we present our new hybrid N -body code, PENTACLE , in Section 2. We show the performance and accu-racy of

PENTACLE in terms of a cut-off radius and a time-stepin Section 3 and also demonstrate N -body simulations of plan-etary accretion in a swarm of 1 million planetesimals. We sum-marize our paper in the last section. ublications of the Astronomical Society of Japan , (2014), Vol. 00, No. 0 The number of particles I n t e g r a ti on ti m e ( y r) Fig. 1.

History of N -body simulations for terrestrial planet formation (seeRichardson et al. (2000) and references therein). The current level ofthe number of particles used in N -body simulations of planet formation is ∼ − . In this section, we present a basic concept and algorithm of

PENTACLE . In section 2.1, we brieﬂy review the Particle-ParticleParticle-Tree ( P T ) method which is a hybrid symplectic inte-grator used in PENTACLE . In sections 2.2.1 and 2.2.2, we de-scribe a parallelization method for computing gravity forces be-tween particles. In the last section, we show the actual recipe of

PENTACLE . The concept of

PENTACLE comes from the P T method (Oshinoet al. 2011). The P T method is a hybrid symplectic integratorsuch as SyMBA (Duncan et al. 1998) and Mercury (Chambers1999). The basic idea of this hybrid symplectic integratoris to split a gravitational force between two particles, i.e., aHamiltonian, into two parts (soft and hard parts) by their dis-tance in heliocentric coordinates. The Hamiltonian used in P T is given by H = H hard + H soft , (1) H hard = N (cid:88) i (cid:20) | p i | m i − Gm i m r i (cid:21) − N (cid:88) i N (cid:88) i r out . In PENTACLE , we use γ = 0 . . Forces derived from the two com-ponents of the Hamiltonian are given by F hard ,i = − ∂H hard ∂ q i (10) = − N (cid:88) j (cid:54) = i Gm i m j r ij [1 − K ( r ij )] q ij − Gm i m r i q i , (11) F soft ,i = − ∂H soft ∂ q i = − N (cid:88) j (cid:54) = i Gm i m j r ij K ( r ij ) q ij , (12) K ( x ) =  x < − x + 70 x − x + 35 x (0 ≤ x < ≤ x ) , (13) x = y − γ − γ . (14)A formal solution of the canonical equation of motion for agiven Hamiltonian H is given by w i ( t + δt ) = e δt { ,H } w i ( t ) = e δt { ,H soft + H hard } w i ( t ) , (15) w i = ( q i , p i ) , (16)where w i is the canonical variable of an i -th particle in phasespace and { , } means the Poisson bracket. In PENTACLE , we inte-grate time evolution of w i with a second-order approximation.If a particle of which the nearest neighbour is farther than r out ,it interacts only with the Sun and its motion is calculated bysolving the Kepler equation.For the hard part, the P T method adopts a fourth-orderHermite scheme (Makino 1991) with an individual time-stepmethod (Aarseth 1963). The individual time-step method al-lows us to handle readily close encounters and physical colli-sions. When r out is equal to the Hill radius of a particle, most Publications of the Astronomical Society of Japan , (2014), Vol. 00, No. 0 particles have no counterpart inside r out . The number of inte-gration for the hard part is O ( N ) per time-step, whereas thatfor the soft part is O ( N ) . Thus, we apply the Barnes-Huttree method (Barnes & Hut 1986) (with up to the quadrupolemoment) to the P T , which reduces a computational cost from O ( N ) to O ( N log N ) . We also create a list of neighbours basedon a tree structure in order to evaluate efﬁciently forces from thehard part. PENTACLE

One of essential ingredients to perform large N -body simula-tions is parallelization for calculating gravitational interactionsbetween particles. We apply a parallelization method executedon distributed-memory parallel computers to PENTACLE . Thismethod consists of the following steps.1. A computational domain is divided into sub-domains, eachof which is allocated to one MPI process.2. Particles are assigned to each process, and a tree structure isconstructed in it.3. Based on the tree structures, processes provide informationof particles, i.e., multipole moments of a gravitational po-tential, to each other [here after this step is called “exchangeLocal Essential Tree(LET)”].4. Using the information received at previous step, reconstructentire tree structures on each process.5. Each process evaluates gravity forces between particles us-ing the tree structure and integrates the motions of their as-signed particles by using a leapfrog scheme and a ﬁxed time-step.In order to efﬁciently implement this scheme, we use alibrary called FDPS (see Iwasawa et al. 2016). FDPS is aC++ template library that helps users develop parallel particle-based simulation codes. The basic idea of FDPS is to sepa-rate technical parts involved in parallelization from the physicalproblem itself: speciﬁcally, the decomposition of a computa-tional domain into sub-ones, exchanging particles among inter-processes, and gathering information of particles stored in otherprocesses. FDPS provides functions necessary for parallelizedtree-codes as C++ templates. Thus, users just deﬁne an arbi-trary data structure and kernel function of a potential betweentwo interacting particles, and then FDPS takes care of data ex-changes among processes. In

PENTACLE , we utilize a cutofffunction, K ( r ij ) , as a kernel function of gravitational interac-tions between particles.FDPS also has APIs to search for neighbouring particles;for example, we can use getneighbourListOneParticle() to ﬁnd particles within the radius of r out around a given particle. Parallelization of the hard part in

PENTACLE is more straightfor-ward. If particles has no neighbour within r out , their motionscan be individually integrated by solving the Kepler equation oneach process. We use the Newton-Raphson method to solve theKepler equation in this scheme. For convenience, we introducethe term “cluster” which is deﬁned as a subset of particles. Eachparticle must belongs to a cluster and clusters are exclusive eachother. All neighbour particles of an arbitrary particle in a givencluster belong to the same cluster. In other words, a particleout of a given cluster dose not have neighbours in this cluster.We also deﬁne the size of cluster as the number of particles in agiven cluster and refer a cluster with the size of k as k -cluster.Thus a particle with no neighbours is in a 1-cluster.When a particle has a neighbour and its neighbour is thetarget particle only (i.e. the both particles are in a 2-cluster),the particle interacts only with its neighbour and the Sun. Thissystem can be considered as an isolated three-body system. Ifboth particles are loaded on the same process, any inter-processcommunication is not required. Otherwise, we just send theparticle’s data to the process in which its neighbour is stored.In principle, we can integrate the motion of a particle withmultiple neighbours in a similar way. It, however, is a little bitcomplicated to ﬁnd a cluster with the size ≥ (hereafter k > -cluster) in parallel. We ﬁrst send particles in k > -cluster to theroot process named as "rank 0" and the root process integratestheir motions in serial order. As shown later, particles in k > -cluster are rare and this serial procedure has little effect on thescaling performance of PENTACLE for N (cid:46) . PENTACLE

The actual calculation of an N -body simulation in PENTACLE proceeds as follows;1. Deﬁne a data structure of a particle and a kernel function ofgravitational interactions between particles.2. Send particles’ data to FDPS.3. FDPS returns the soft force and the number of neighboursfor each particle.4. Each process gives all the assigned particles their kick ve-locities using their soft forces.5. According to the number of neighbours, particles are classi-ﬁed into three groups: a non-neighbour, one-neighbour, andmultiple-neighbour group.6. Each process integrates the motion of particles withoutneighbours, the so-called drift-step, by solving the Keplerequation.7. If a particle has a neighbour, each process checks if theneighbour’s neighbour is the target particle only, i.e., a re-ferred particle. If the referred and neighbour particles areassigned to the same process, the two are integrated in their ublications of the Astronomical Society of Japan , (2014), Vol. 00, No. 0 process, using a fourth-order Hermite scheme. If not, thereferred particle is sent to the neighbour’s process and bothprocesses integrate their motions. If the neighbour has ≥ neighbours, the referred particle is sent to the process withrank 0 (the root process).8. Each process sends particles with ≥ neighbours to the rootprocess.9. The root process integrates all the particles received.10. The root process returns particles’ data to their original pro-cesses. We demonstrate planetary accretion in a swarm of planetesi-mals by using

PENTACLE . Our simulations start from a systemof equal-mass planetesimals with mean density of

2g cm − in agas-free environment. We use the minimum mass solar nebulamodel (Hayashi 1981) as a nominal surface density proﬁle ofsolid material, Σ solid , given by Σ solid = 10 η ice (cid:16) a (cid:17) / g cm − , (17)where a is the semi-major axis and η ice is the enrichmentfactor of a surface density of solid material beyond a snow line( a snow = 2 . au) due to ice condensation.We consider two planetesimal disc models for benchmarktests. One is a narrow ring model (model R) between 0.95 auand 1.05 au and the other is a disc model (model D) with radiusof 1 au – 11 au. The ring width is large enough to trace motionsof planetesimals spreading out in a disc until the end of oursimulations. In order to investigate impacts of a computationaldomain size on the applicability of our new algorithm, we use η ice = 1 . This leads to a total solid amount of . M ⊕ formodel R and that of . M ⊕ for model D (see also Table 1).An initial eccentricity and inclination of each planetesi-mal are given by a Rayleigh distribution with dispersion of (cid:104) e (cid:105) / = 2 (cid:104) i (cid:105) / = 2 √ h , where e and i are the eccentric-ity and inclination of a planetesimal and h is the reduced Hillradius deﬁned by the ratio of the Hill radius of a body to itssemi-major axis, i.e., h = r H /a N -body simulations Assuming three different random seeds for initial values of or-bital elements of planetesimals, we performed N -body simu-lations of × planetesimals for model R. We adopted twodifferent schemes, PENTACLE and a forth-order Hermite scheme(Makino 1991). We use ˜ R cut = 0 . , θ = 0 . , and ∆ t = 1 / , In the dispersion-dominated regime, a swarm of planetesimals in a Keplerpotential spontaneously reach the isotropic state of e = 2 i through theviscous stirring between themselves, irrespective of the initial conditions(e.g. Ida & Makino 1992). Table 1.

Planetesimal disc models

Name Radius (au) Width (au) Total Mass ( M ⊕ )Model R 0.95 – 1.05 0.1 0.236Model D 1 – 11 10 10.9where ˜ R cut = r out /r H , θ is the opening angle used in the P T method (Oshino et al. 2011), ∆ t is the time-step whose unit is t kep / π ( t kep is the Kepler time), and r H is the Hill radius.Figure 2 shows that PENTACLE reproduces well results of di-rect N -body simulations. In all the runs, the number of remain-ing planetesimals decreases monotonously in a similar manner.On the other hand, we see car chases in the bottom panel ofFig. 2. Physical collisions between planetesimals are stochasticprocesses. As a result, mass evolution of the largest body (theso-called planetary embryo) shows a stepwise growth. In anycase, we ﬁnd that a planetary embryo reaches almost the samemass after years. We also plot distribution functions of theeccentricity and inclination of planetesimals at × year inFig. 3. There is no signiﬁcant difference for all the runs. We verify energy conservation in a system of planetesimals formodel R with N = 10 . We consider two additional modelswith different initial velocity dispersions: (cid:10) e (cid:11) / = 2 (cid:10) i (cid:11) / =8 √ h (hot disc) and / √ h (cold disc). We choose θ = 0 . and η = 0 . in order to suppress energy errors which arise fromboth a tree approximation and a forth-order Hermite scheme,where η is the accuracy parameter of timesteps for the forth-order Hermite scheme (e.g. see Makino & Aarseth 1992).Figure 4 shows the maximum relative energy error over 10Keplerian orbits as functions of ∆ t and ˜ R cut . For the hot disc,energy errors can be controlled through ∆ t/ ˜ R cut . This depen-dency is the same as that seen in a system of objects with highvelocity dispersions such as star cluster simulations (Iwasawaet al. 2015). On the other hand, energy errors rather depend on ∆ t/ ( ˜ R cut ) . than ∆ t/ ( ˜ R cut ) if the collisional system is cold.Energy errors of a cold system mainly come from close en-counters. Since a characteristic time-scale of a close encounteris a Keplerian time, energy errors of a cold system dependson ˜ R − . . However, planetesimal discs are heated up quicklythrough viscous stirring and/or dynamical friction. As a result,we, a posteriori , can use ∆ t/ ˜ R cut as an accuracy parameter forthe soft part in this paper.Figure 5 shows time evolution of relative energy errors to1,000 Kepler time for model R with N = 10 , θ = 0 . , and η = 0 . with respect to ﬁve different ˜ R cut . For all the runs, theenergy errors grow gradually as t / , as expected for a randomwalk. Thus, in order to suppress the energy error ( < − ) for ∼ years, we should chose ∆ t/ ˜ R cut (cid:46) . (see also Fig. 4). Publications of the Astronomical Society of Japan , (2014), Vol. 00, No. 0

0 2000 4000 6000 8000 10000 4300 4400 4500 4600 4700 4800 4900 5000 0 2000 4000 6000 8000 10000P T, seed1P T, seed2P T, seed3Hermite, seed1Hermite, seed2Hermite, seed3

Time (yr) T he l a r ge s t bod y ’ s m a ss ( g ) Time (yr) T he nu m be r o f r e m a i n i ng pa r t i c l e s Fig. 2. N -body simulations of × planetesimals for yrs by using PENTACLE (solid) and a fourth-order Hermite scheme (dashed). Three colourscorrespond to three different random seeds for initial conditions of a planetesimal disc. Left: the number of remaining particles. Right: mass of the largestbody. -4 -3 -2 -1 T, seed1P T, seed2P T, seed3Hermite, seed1Hermite, seed2Hermite, seed3 10 -4 -3 -2 -1 C u m u l a t i v e d i s t r i bu t i on Eccentricity Inclination -4 -3 -3 -4 -2 -2 -5 -5 Fig. 3.

Cumulative distribution functions of the eccentricity (left panel) and inclination (right panel) of planetesimals at × year. More details about dependencies of accuracy on parameters arediscussed in Oshino et al. (2011).Figure 6 shows time evolution of relative energy errors to100,000 Kepler times for model R with N = 10 , θ = 0 . , and η = 0 . . We chose three different parameter sets of ˜ R cut and ∆ t , such as ˜ R cut = 0 . , ∆ t = 1 / , ˜ R cut = 0 . , ∆ t = 1 / , and ˜ R cut = 0 . , ∆ t = 1 / . These results show that the increase inenergy errors roughly obeys t / over a long period of time. In order to investigate an effect of a disc radius on energy con-servation, we also perform simulations similar to those in sec-tion 3.2.1 for model D with N = 10 , θ = 0 . , and ˜ R cut =0 . , . , and0 . . We adopt ∆ t = 1 / and / for these runs.The energy error is shown in Fig. 5, and the behaviour is simi-lar to that of model R (see Fig. 7). This means that PENTACLE is applicable to N -body simulations for planet formation in avariety of disc models. We also discuss the performance and parallelization efﬁciencyof

PENTACLE . All the simulations in this section are carriedout on Cray XC30 (ATERUI) at the Centre for ComputationalAstrophysics of the National Astronomical Observatory ofJapan. ATERUI consists of 1,060 computer nodes and eachnode has two Intel Xeon E5-2690 v3 (12 cores, 2.6GHz) pro-cessors. We assigned one CPU core to one MPI process (i.e. thenumber of CPU cores N c means that of MPI processes).Fig. 8 shows the execution time per Kepler time for various N ( × , , and . × ). Each curve indicates differ-ent parameter sets of ∆ t and ˜ R cut . For all the runs, we take ∆ t/ ˜ R cut to be constant so that the energy errors of differentruns are almost the same. We see similar trends among mostruns: the execution time decreases linearly for small N c andincreases for large N c . To see more details, we plot the break-down of the execution time for the run with N = 10 (1M), ˜ R cut = 0 . , and ∆ t = 1 / in Fig. 9. We see that the exe-cution time for both the hard and soft parts decrease for small ublications of the Astronomical Society of Japan , (2014), Vol. 00, No. 0 -13 -12 -11 -10 -9 -8 -7 -6 -2 -1 -13 -12 -11 -10 -9 -8 -7 -6 -13 -12 -11 -10 -9 -8 -7 -6 -13 -12 -11 -10 -9 -8 -7 -6 -2 -1 -13 -12 -11 -10 -9 -8 -7 -6 -13 -12 -11 -10 -9 -8 -7 -6 M a x i m u m e r ne r g y e rr o r M a x i m u m e r ne r g y e rr o r M a x i m u m e r ne r g y e rr o r M a x i m u m e r ne r g y e rr o r M a x i m u m e r ne r g y e rr o r M a x i m u m e r ne r g y e rr o r Fig. 4.

Maximum energy error over ten Keplerian orbits as functions of ∆ t/ ˜ R cut (left panel) and ∆ t/ ( ˜ R cut ) . (right panel). Top, middle and bottom panelsshow results of hot, standard, and cold disc models, respectively. -14 -12 -10 -8 Δ t =1/32 E ne r g y E rr o r Δ t =1/64 -14 -12 -10 -8 Δ t =1/128

10 10 10 Δ t =1/256 Time (yr) Time (yr) E ne r g y E rr o r Fig. 5.

Relative energy error as a function of time for model R with N = 10 and θ = 0 . : ˜ R cut = 0 . (solid, black), 0.2 (dashed, red), 0.3 (thin solid, blue),0.5 (dotted, green), and 1.0 (thick solid, orange). From the left top to the right bottom, we use ∆ t = 1 / , / , / , and1 / . Publications of the Astronomical Society of Japan , (2014), Vol. 00, No. 0

Fig. 6.

Relative energy errors as a function of time for model R with N = 10 and θ = 0 . : ˜ R cut = 0 . , ∆ t = 1 / (solid, black), ˜ R cut = 0 . , ∆ t = 1 / (dashed, red), and ˜ R cut = 0 . , ∆ t = 1 / (dotted, blue). -13 -12 -11 -10 Δ t =1/64 Time (yr) E ne r g y e rr o r -13 -12 -11 -10 Δ t =1/128 Time (yr) E ne r g y e rr o r Fig. 7.

Energy errors for model D with N = 10 for ∆ t = 1 / (left) and / (right): ˜ R cut = 0 . (thick solid, red), 0.3 (dotted, blue), and 0.5 (thin, green). N c . However, as N c increase, slopes for both parts becomeshallower and ﬁnally, the execution time for the hard (soft) partstalls (increases). For the hard part, we ﬁnd a good scaling ofthe execution time for integrating particles in 1- and 2-clusters,whereas that for integrating particles in k > -clusters is indepen-dent of N c . This is because the integration of k > -clusters is notexecuted in parallel. The time for gathering and scattering parti-cles in k > -clusters from/to the root process is almost constant,because MPI_Gatherv and

MPI_Scatterv used for gatheringand scattering of particles are limited to the injection bandwidth.Thus, the minimum time for the hard part is limited to integrat-ing or gathering (and scattering) particles in k > -cluster. Forthe soft part, the execution time for calculating a gravity forcedecreases linearly for all the range of N c . However, the timefor the exchange LET increases for large N c . This is because MPI_Alltoallv is used in FDPS at this step (Iwasawa et al.2016).We also show the same ﬁgure as Fig. 9, but with ˜ R cut =1 . and ∆ t = 1 / . behaviours of all the curves are similarthose shown in Fig. 10, except that the hard part dominatesthe total execution time. Compared with the run with ˜ R cut =0 . and ∆ t = 1 / , more particles belong to k > -clusters inthis case, in other words, many particles are not integrated inparallel. Figure 11 shows that the fraction of particles in 1-, 2-, and k > -clusters, and it indicates that the number of particlein k > -cluster is about one hundred times larger than that in -cluster. If, in the hard part, each execution time of integration ofa particle per unit time is the same among all particles, the totalexecution time for the hard part can not be scaled for N c (cid:38) .This conclusion is ﬁrmly supported by Fig. 10. ublications of the Astronomical Society of Japan , (2014), Vol. 00, No. 0 The execution time for the soft part of the run with ˜ R cut =1 . and ∆ t = 1 / is roughly half of that with ˜ R cut = 0 . and ∆ t = 1 / . If we fully parallelize the hard part, its executiontime is much smaller than the soft one. As a result, the total ex-ecution time of the run with ˜ R cut = 1 . and ∆ t = 1 / roughlybecomes half of that with ˜ R cut = 0 . and ∆ t = 1 / .Finally, we discuss a possibility of full parallelization of thehard part to permit a larger ˜ R cut and ∆ t , namely, we reducethe number of the use of group communication per unit time.To discuss the possibility of full parallelization in the hard part,we plot the number of particles in the largest cluster against themean number of neighbours per particle, (cid:104) n ngb (cid:105) in Fig. 13 andthe distribution function of a cluster size, i.e., k -cluster, in Fig.14. As seen in Fig. 13, there is a threshold at (cid:104) n ngb (cid:105) ∼ .Under the circumstance of (cid:104) n ngb (cid:105) (cid:46) , many clusters containa few particles (see also Fig. 14). This means that we could in-tegrate the particles in k > -cluster in parallel if we chose smallenough ˜ R cut to guarantee (cid:104) n ngb (cid:105) < . In addition, if we couldintegrate all the particles for the hard part in parallel, gathering(scattering) particles in k > -cluster to (from) the root processwere not necessary . Figure 14 also shows that the distributionfunction of a k -cluster depends on (cid:104) n ngb (cid:105) and is independent of N . The larger N we use, the more important the parallelizationof the hard part becomes. Thus, we will modify PENTACLE tohandle the hard part in fully parallel in the near future.

In the previous section, we demonstrated that a high-resolution N -body simulation with one million particles for planet forma-tion is doable if we use PENTACLE code. How does the growthof a planetary embryo proceed in a sea of "small" planetesi-mals? In order to investigate effects of the initial size of a plan-etesimal r pls , we carried out three N -body simulations of ter-restrial planet formation using N = 10 , , and , whichcorrespond to r pls ∼ km, km, and km, respectively.We considered the same narrow planetesimal ring model as thatused in Section 3.1. In these simulations, we introduce an aero-dynamic drag force as described in Adachi et al. 1976. We as-sumed that an initial gas density at 1 au, ρ = 2 . × − g cm − .The density of a disc gas decreases exponentially with time in amanner of ρ gas = ρ exp − t/ , where ρ gas ,t is the density ofa disc gas and the time, respectively. We simulated planetesimalaccretion onto a growing planetary embryo for 1 Myrs.Figure 12 shows mass evolution of a largest body for 1 Myrs.We can see that a planetary embryo grows rapidly in a run-away fashion and its growth rate follows a power-law functionof mass, as shown in Ida & Makino (1993). Then, the emergentrunaway body enters the so-called oligarchic growth mode and Inter-adjacent node communications are still required but these costswould be much smaller than the collective one eventually, the ﬁnal mass of the embryo approaches the almostsame value in the three cases. We conﬁrmed that a classicalpicture of terrestrial planet formation holds true for a swarm ofequal-sized planetesimals with radius of ∼ km. A signiﬁ-cant difference among the three cases is, however, the durationof a transient phase between the runaway growth and the oli-garchic growth. In the transient phase, the growth of a runawaybody slows down but random velocities of ambient planetesi-mals are not excited so efﬁciently by itself yet. This may indi-cates that a planetary growth in a sea of smaller planetesimalsproceeds in a non-equilibrium oligarchic growth mode, as sug-gested by Ormel et al. (2010) and Kobayashi et al. (2010). Thistopic will be discussed in our forthcoming paper. Last but notthe least, although energy errors of the high-resolution N -bodysimulation are not shown in this paper, they are always lowerthan − . We have developed a parallelized hybrid N -body code( PENTACLE ) to perform high-resolution simulations of planetformation. The open-source library designed for full automaticparallelization of particle simulations (FDPS) is implementedin

PENTACLE . PENTACLE uses a 4th-order Hermite scheme tocalculate gravitational interactions from particles within a cut-off radius ( ˜ R cut ) and the Barnes-Hut tree method for gravityfrom particles beyond. We also conﬁrmed that results of plane-tary growth in a planetesimal ring using PENTACLE are in goodagreement with those using a direct N -body code.We ﬁgured out that ∆ t/ ˜ R cut (cid:46) . reduces energy errorsto an acceptable level when simulating planetary accretion in aswarm of planetesimals (see Figs. 4 – 7). PENTACLE allows usto handle 1 – 10 million particles in a collisional system; forexample, on a supercomputer with CPU cores, it takes onemonth to trace the dynamical evolution of planetesimals for1 Myr. Nevertheless, as shown in Fig. 8, the computational costof runs with or more particles would be still expensive. Thebottleneck of the performance for larger N c (the number of CPUcores) is the exchange LET used for MPI_Alltoallv . In orderto relax this problem, we could reduce the number of processesdue to the usage of accelerators such as GRAPE, GPU or PEZY-SC, while keeping the peak performance. We are now develop-ing a GPU-enable version,

PENTAGLE (a Parallelized Particle-Particle Particle-tree code for Planet formation on a GPU clus-ter) based on Iwasawa et al. (2015).We have focused the scope of this paper on the performanceand accuracy of

PENTACLE , but have also demonstrated the massevolution of a planetary embryo embedded in a disc with planetesimals. Increasing the number of particles used in an N -body simulation decreases their sizes, namely the physicalsize of a planetesimal. A random velocity distribution of plan- Publications of the Astronomical Society of Japan , (2014), Vol. 00, No. 0 (1/16, 1.2) (1/32, 0.6) (1/64, 0.3) (1/128, 0.15) 1010

110 10 The number of CPU cores The number of CPU cores The number of CPU cores W a ll c l o ck t i m e pe r K ep l e r t i m e ( s e c ) W a ll c l o ck t i m e pe r K ep l e r t i m e ( s e c ) W a ll c l o ck t i m e pe r K ep l e r t i m e ( s e c ) N = 8 x 10 N = 10 N = 1.25 x 10 Fig. 8.

Execution time per Kepler time for N = 8 × (left panel), (middle panel), and . × (right panel) against the number of CPU cores.Symbols correspond to different ∆ t and ˜ R cut .

110 10 totalhardsoft 10 -3 -2 -1

110 10 total (hard only)1-cluster2-clusterk >2 -clustercommunication 10 -1

110 10 total (soft only)exchange LETforce calculation W a ll c l o ck t i m e pe r K ep l e r t i m e ( s e c ) W a ll c l o ck t i m e pe r K ep l e r t i m e ( s e c ) W a ll c l o ck t i m e pe r K ep l e r t i m e ( s e c ) The number of CPU cores The number of CPU cores The number of CPU cores

Fig. 9.

Breakdown of the execution time for the runs with N = 1 M, ˜ R cut = 0 . and ∆ t = 1 / . Left panel: breakdown of the total execution time against thenumber of CPU cores. Open squares, open circles, and open triangles indicate the total execution time, the hard-part one, and the soft-part one, respectively.Middle panel: breakdown of the execution time for the hard-part. Open squares indicate the total execution time for the hard part only. Open circles, opentriangles and ﬁlled circles indicate the time for integrating particles in 1-, 2- and k > -cluster, respectively. Filled triangles indicate the time for both gatheringand scattering particles in k > -cluster from/to the root processes. Right panel: breakdown of the execution time for the soft-part. Open squares indicate thetotal execution time for the soft part only and open circles and open triangles indicate the time for the exchange LET and the force calculation using the treemethod, respectively.

110 10 totalhardsoft 10 -2 -1

110 10 total (hard only)1-cluster2-clusterk >2 -clustercommunication 10 -1

Fig. 10.

The same ﬁgures as Fig. 9, but with ˜ R cut = 1 . and ∆ t = 1 / . -4 -3 -2 -1 -1 >2 -cluster 10 -4 -3 -2 -1 -1 -4 -3 -2 -1 -1 F r a c t i on o f pa r t i c l e s F r a c t i on o f pa r t i c l e s F r a c t i on o f pa r t i c l e s N = 8 x 10 N = 10 N = 1.25 x 10 Fig. 11.

Fraction of particles in 1- (red open square), 2- (green open circle) and k -clusters (blue open triangle) against ˜ R cut for the runs with N = 8 × (left panel), (middle panel), and . × (right panel). ublications of the Astronomical Society of Japan , (2014), Vol. 00, No. 0 -6 -5 -4 -3 -2 -1 Time (yr) M a ss ( ) Fig. 12.

Mass evolution of the largest planetary embryo in N -body simulations with N = 10 (blue), (red), and (black) planetesimals, which correspondto r pls ∼ km, km, and km, respectively. -4 -3 -2 -1 Mean number of neighbour particles T he nu m be r o f pa r t i c l e s i n t he l a r ge s t c l u s t e r N = 8 x 10 N = 10 N = 1.25 x 10 Fig. 13.

The number of particles in the largest cluster against the mean number of neighbours per particle for the runs with N = 8 × (red open square), (green ﬁlled square), and . × (blue open circle). -6 -5 -4 -3 -2 -1

1 0 2 4 6 8 10 120.34 F r a c t i on o f pa r t i c l e s i n k - c l u s t e r k Fig. 14.

Fraction of particles in a k -cluster for runs with ˜ R cut = 1 . (red, open), . (green, full), and . (blue, dashed) for N = 8 × (square), (circle),and . × (triangle). Publications of the Astronomical Society of Japan , (2014), Vol. 00, No. 0 etesimals in both a runaway and oligarchic growth mode couldbecome different. This means that the growth of a planetaryembryo in a sea of small planetesimals might proceed unlike atraditional picture of terrestrial planet formation. This topic willbe discussed in our next paper.

PENTACLE will be freely available under the MIT lisencefor all those who interested in large-scale particle simula-tions. The source code is hosted on the GitHub platformand can be downloaded from https://github.com/PENTACLE-Team/PENTACLE. The next version will be exexcutable on aGPU cluster and also we will include effects of disc-planet in-teractions and a statistical treatment of collisional fragmentationinto

PENTACLE . Acknowledgments