[PDF] Flexible and scalable particle-in-cell methods for massively parallel computations

Abstract

Particle-in-cell methods couple mesh-based methods for the solution of continuum mechanics problems, with the ability to advect and evolve particles. They have a long history and many applications in scientific computing. However, they have most often only been implemented for either sequential codes, or parallel codes with static meshes that are statically partitioned. In contrast, many mesh-based codes today use adaptively changing, dynamically partitioned meshes, and can scale to thousands or tens of thousands of processors. Consequently, there is a need to revisit the data structures and algorithms necessary to use particle methods with modern, mesh-based methods. Here we review commonly encountered requirements of particle-in-cell methods, and describe efficient ways to implement them in the context of large-scale parallel finite-element codes that use dynamically changing meshes. We also provide practical experience for how to address bottlenecks that impede the efficient implementation of these algorithms and demonstrate with numerical tests both that our algorithms can be implemented with optimal complexity and that they are suitable for very large-scale, practical applications. We provide a reference implementation in ASPECT, an open source code for geodynamic mantle-convection simulations built on the deal.II library.

Full PDF

FFlexible and scalable particle-in-cell methods formassively parallel computations

RENE GASSMOELLERColorado State University [email protected]

ERIC HEIENUniversity of California, Davis ELBRIDGE GERRY PUCKETTUniversity of California, DavisWOLFGANG BANGERTHColorado State UniversityDecember 13, 2016

Abstract

Particle-in-cell methods couple mesh-based methods for the solution of con-tinuum mechanics problems, with the ability to advect and evolve particles. Theyhave a long history and many applications in scientiﬁc computing. However, theyhave most often only been implemented for either sequential codes, or parallelcodes with static meshes that are statically partitioned. In contrast, many mesh-based codes today use adaptively changing, dynamically partitioned meshes, andcan scale to thousands or tens of thousands of processors.Consequently, there is a need to revisit the data structures and algorithms nec-essary to use particle methods with modern, mesh-based methods. Here we reviewcommonly encountered requirements of particle-in-cell methods, and describe ef-ﬁcient ways to implement them in the context of large-scale parallel ﬁnite-elementcodes that use dynamically changing meshes. We also provide practical experiencefor how to address bottlenecks that impede the efﬁcient implementation of thesealgorithms and demonstrate with numerical tests both that our algorithms can beimplemented with optimal complexity and that they are suitable for very large-scale, practical applications. We provide a reference implementation in ASPECT,an open source code for geodynamic mantle-convection simulations built on the

DEAL .II library.

The dominant methodologies to numerically solve ﬂuid ﬂow problems today are allbased on a continuum description of the ﬂuid in the form of partial differential equa-tions, and include the ﬁnite element, ﬁnite volume, and ﬁnite difference methods. On1 a r X i v : . [ m a t h . NA ] D ec he other hand, it is often desirable to couple these methods with discrete, “particle”approaches in a number of applications. These include, for example: • Visualization of ﬂows:

Complex, three-dimensional ﬂow ﬁelds are often difﬁcultto visualize in ways that support understanding of their critical aspects. Stream-lines are sometimes used for this but are typically based on instantaneous ﬂowﬁelds. On the other hand, showing the real trajectories of advected particles canprovide real insight, in much the same way as droplets are used for visualizationin actual wind tunnel tests. • Tracking interfaces and origins:

In many applications, one wants to track whereinterfaces between different parts of the ﬂuid are advected, or where material thatoriginates in a part of the domain ends up as a result of the ﬂow. For example, insimulating the ﬂow in the Earth’s mantle (the application for which we developedthe methods discussed in this contribution), one often wants to know whether thelong term motion leads to a complete mixing of the mantle, or whether materialfrom the bottom of the mantle stays conﬁned to that region. • Tracking history:

In other cases, one wants to track the history of a parcel of ﬂuidor material. For example, one may want to integrate the amount and direction ofstrain a rock grain experiences over time as it is transported through the Earth’smantle, possibly relaxed over time by diffusion processes. This may then beused to understand seismic anisotropies, but accumulated strain can also be usedin damage models to feed back into the viscosity tensor and thereby affect theﬂow ﬁeld itself.Many of these applications could be achieved by advecting additional, scalar ortensor-valued ﬁelds along with the ﬂow. For example, interfaces and origins could betracked by additional “phase” ﬁelds with initial values between zero and one, and thehistory of a volume of ﬂuid could be tracked by additional ﬁelds whose values changeas a function of properties of the ﬂow ﬁeld or their location.On the other hand, advected ﬁelds also have numerical disadvantages such as un-desirable levels of diffusion and failure to preserve appropriate bounds in regions withsharp gradients; i.e., overshoot and undershoot. For these reasons – and, importantly,simply because some areas of computational science have “always” done it that way– approaches in which particles are advected along with the ﬂow are quite popularfor some applications. Use cases and discussions of computational methods can, forexample, be found as far back as 1962 in [16] and are often referred to as particle-in-cell (PIC) methods to indicate that they couple discrete (Lagrangian) particles and(Eulerian) mesh-based continuum methods.Many implementations of such methods can be found in the literature [23, 12, 21,24, 30]. However, almost all of these methods are restricted to either simple, structuredmeshes and/or sequential computations. This no longer reﬂects the state of the art ofﬂuid ﬂow solvers. The current contribution is therefore a comprehensive assessmentof all of the challenges arising when implementing particle methods in the context ofmodern computational ﬂuid dynamics (CFD) solvers. Speciﬁcally, we aim at develop-ing methods that also work in the following two situations:2

Unstructured, fully adaptive, dynamically changing 2d and 3d meshes:

We envi-sion solving complex and dynamically changing ﬂow problems in geometricallynon-trivial domains. These are most efﬁciently solved on dynamically adaptedmeshes that use mesh reﬁnement and coarsening. • Massively parallel computations:

The methods we aim to develop need to scaleto computations that run on thousands of cores or more, tens of millions of cellsor more, and billions of particles.Supporting these two scenarios in mixed particle-mesh methods requires fundamen-tally different approaches to the design of data structures and algorithms. Within thispaper, we will therefore investigate all aspects of implementing such methods requiredfor actual, realistic applications. Speciﬁcally, we will consider the following compo-nents, along with a discussion of their practical performance in typical use cases: • Appropriate data structures for the use cases outlined above; • Generation of particles; • Advection of particles by integrating their trajectory and properties; • Treatment of particles as they cross cell and processor boundaries in parallelcomputations; • Treatment of particles when the mesh is reﬁned or coarsened adaptively, includ-ing appropriate load balancing; • How to deal with “properties” that may be used to characterize auxiliary statevariables associated with each particle; • Methods to interpolate or project particle properties onto the mesh.Our goal in this paper is to provide a comprehensive assessment of all of the stepsone needs to address to augment state-of-the-art computational ﬂuid dynamics solverswith efﬁcient and scalable particle schemes . While our discussions will be generic andindependent of any concrete software package, we provide a reference implementationas part of the mantle convection modeling code ASPECT [18, 3]. We will use this ref-erence implementation to also demonstrate practical properties of our approaches andthereby validate their usefulness, efﬁciency, accuracy, and scalability. In particular, wewill show numerical results up to many thousands of cores, using billions of particles,thus demonstrating applicability to real-world testcases.This paper is structured as follows: Section 2 presents an overview of the algo-rithms and data structures we use, including considerations of algorithmic complexity.Section 3 then evaluates these methods in scaling studies and shows that our algorithmsfulﬁll the claimed properties. Section 4 presents a typical application in geodynamicmodeling that makes use of the novel properties of our algorithm. We conclude inSection 5. 3

Computational methods

In this section, let us discuss the algorithmic challenges listed in the introduction.Speciﬁcally, we address particle generation (Section 2.2), advection (Section 2.3),transport between cells and parallel subdomains (Section 2.4), handling of cells uponmesh reﬁning and coarsening (Section 2.5), treatment of particle properties (Section 2.6),transfer of information from particles to mesh cells (Section 2.7), and the generationof graphical or text-based output of particle data (Section 2.8). In most of these cases,the implementation must allow for different choices of speciﬁc algorithms; for exam-ple, the particle advection algorithm may be the single-step forward Euler method orthe two or four-step second and fourth-order Runge-Kutta methods. We will discusshow one can design generic implementations for such cases and comment on generalsoftware design choices for such a framework in Section 2.9.The theoretical assessment of performance criteria of algorithms relies on the useof appropriate data structures. Consequently, we will start our discussion in Section 2.1with a presentation of the fundamental storage scheme we propose to use. The evalua-tion of algorithmic complexities is then based on the complexity of accessing elementswithin these data structures.

To evaluate the complexity of the algorithms in this publication, we assume data struc-tures described in this subsection, which satisfy the following, basic requirements aspart of a mesh-based CFD code: • At any given time, each particle is associated with a cell K within which it islocated at that time. • Each cell K contains a number N particles ,K particles that may be different fromcell to cell and from (sub-)time step to (sub-)time step due to particles beingadvected from one cell to another. • Each cell K represents a part Ω K of the global model domain Ω • Each processor P “owns” a number N cells ,P of cells that may change from timestep to time step due to mesh reﬁnement and load balancing. The number of allparticles in these N cells ,P cells is N particles ,P . • Each particle stores its current location. • Each particle stores a global index i that uniquely identiﬁes it. • Each particle stores a number N properties of scalar properties. N properties is a run-time constant but is, in general, not known at compile time. • Algorithms must be able to: In practice, these scalar properties may be interpreted as the components of a higher-rank tensor such asan accumulated strain. However, how this data is semantically interpreted is immaterial to our discussionshere. efﬁciently identify all particles located in a cell, – efﬁciently identify the cell for a given particle, – efﬁciently transfer particles from one cell to another, – efﬁciently transfer particles from one processor to another, – efﬁciently evaluate ﬁeld-based variables at the location of a particle.These requirements make it clear that ﬁxed-sized arrays would be a poor choice.Rather, we use the following, C++-style structure to represent the data associated witha single particle: t e m p l a t e < i n t dim > s t r u c t P a r t i c l e { P a r t i c l e I d i d ;P o i n t < dim > l o c a t i o n ;P o i n t < dim > r e f e r e n c e l o c a t i o n ;do ub le ∗ p r o p e r t i e s ; } ; Here,

ParticleId is a type large enough to uniquely index all particles in a simu-lation (we provide the choice of a 32-bit or 64-bit unsigned integer type as a compiletime option), and

Point is a data type that represents the dim coordinates of alocation in R dim . location and reference location denote the location of theparticle within the domain Ω in a global coordinate system, and within local coordinatesystem of the cell K that encloses the particle, respectively. properties points toa dynamically managed memory address that can store N properties scalars; this locationmay be provided by a memory pool class that manages memory in ﬁxed increments of N properties scalars.Every processor then stores all of its particles in an object declared as follows: s t d : : multimap < c e l l i t e r a t o r , P a r t i c l e < dim > > p a r t i c l e s ; The multi-map allows the storage of or more particles for each cell among that subsetof cells of a mesh that a processor “owns”, where cells are represented as iteratorsinto the mesh data structure. Concretely, in an adaptive mesh structure that allowsfor dynamically changing meshes, a cell iterator can be represented by a pointer to themesh object, the number of reﬁnements a cell has undergone starting at the coarse mesh(the “level” of a cell), and the “index” of this cell within the set of all cells on this level.Since the mesh object will always be the same, we store the pointer externally, and theiterator is characterized only by the level and index. These pairs clearly allow for alexicographical total ordering and therefore can serve as keys into a map or multi-map. Remark 1

For some applications, it is also necessary to store but (usually) neitheradvect nor otherwise update particles that are located in “ghost” cells, i.e., cells thatsurround the ones owned by one processor, but are owned by another processor. Wewill give examples of this in Section 2.7. Whether these particles are stored in the samemulti-map, or a separate one, is unimportant; however, it is convenient to use the samefundamental data structure.

5e chose a C++ standard template library std::multimap data structure sinceit can be efﬁciently implemented as a binary search tree, connecting individual keys(the cell iterators in our case) with multiple values per key (the particles residing in aparticular cell). This container keeps the particles sorted by their containing cells at alltimes, and allows us to efﬁciently iterate over all cells, handling all particles in one cellat a time. Each insert, deletion, and search for individual particles is O (log( N cells ,P )) incomplexity; already presorted collections of particles can be merged in O ( N particles ,P ) complexity, and loops over all cells are of O ( N cells ,P ) complexity.The advantages of using a dynamic data structure such as a multi-map indexed bythe cells are as follows: • It is easy to move particles from one cell to another, as well as to remove themfrom one processor and move them to another. • It is easy to loop over all cells, identify all of the particles in the current cell,and loop over them in order to evolve their positions and properties with thevelocities and other values deﬁned by the mesh-based ﬂow solver.An alternative is to use an array of particles that individually store pointers to the cellthat each particle is in. However, this approach either requires active management ofused/unused array locations when moving particles between processors, or compress-ing arrays after every particle transfer. It also either requires linear searches for particleslocated in a particular cell, or periodic sorting of arrays based on the cell iterator key. Acareful evaluation of the operations one typically has to do in actual codes shows thatthe overhead of using array-based data structures outweighs their simplicity.Having thus ﬁxed a data structure, we will be able to describe concrete algorithmsin the following sections and analyze their complexities based on the complexity ofaccessing, inserting, or deleting elements of the multi-map.

Remark 2

It is often necessary to also know the coordinates ˆ x of the k th particle inthe reference coordinate system of the cell, K , in which the particle currently lies;i.e., to know ˆ x k = Φ − K ( x k ) where the (possibly nonlinear) function Φ K : ˆ K (cid:55)→ K maps the reference cell, ˆ K , (typically, simplices or hypercubes) to a cell K in thephysical domain. This can of course be computed on the ﬂy every time it is needed, butsince evaluating Φ − K is expensive, we also store the reference location ˆ x k within the Particle data structure and keep it in sync with the global location of the particlewhen a particle is generated, every time it is moved, or its containing cell is reﬁned orcoarsened.

The ﬁrst step in using particles in mesh-based solvers is their creation on all involvedprocessors. Depending on their intended uses, particles may initially be distributedrandomly (possibly based on a probability distribution) or in speciﬁc patterns. We willdescribe the algorithms necessary for both of these cases in the following. Examplesof use cases are shown in Fig. 1. 6igure 1:

Common methods to choose the initial particle positions: A statistical distri-bution with (a, b) uniform or analytically prescribed particle densities, or (c, d) regulardistributions in spherical or box patterns. (e, f) More complicated initial particle pat-terns can be created by variations of the regular distributions or with a user-giveninput ﬁle. Model domain and particle coloring is for illustration purposes only andwas chosen for a computation that involves the tracking of an interface in which thecolor indicates the initial position of the particle.

Random particle positions

Randomly chosen particle locations are often used incases where particles represent the values of a ﬁeld; e.g., the history of material, or ori-gin and movement of a speciﬁc type of material. In these cases, one is often not inter-ested in prescribing the exact location of each particle, but randomly chosen locationsare acceptable. The probability distribution, ρ ( x ) , from which locations are drawnwould usually be chosen as uniform in the ﬁrst of the cases above, and is often thecharacteristic function of a subdomain in the second case. Alternatively, one can use ahigher particle density in certain regions, for example to better resolve steep gradients.In practice, it is often not possible to ensure that ρ ( x ) is normalized to (cid:82) Ω ρ ( x ) = 1 where Ω is the computational domain; thus, the algorithm below is written in a way sothat it does not require a normalized probability distribution.To generate N particles on P processors, we use the following algorithm, runningon each processor: / / compute t h i s p r o c e s s o r ’ s w e i g h t w i t h re ga rd t o ρ ( x ) do ub le l o c a l w e i g h t = 0 ;s t d : : v e c t o r < double > a c c u m u l a t e d c e l l w e i g h t s ( N cells,P ) ;f o r ( c e l l K i n l o c a l l y owned c e l l s ) { l o c a l w e i g h t += (cid:82) K ρ ( x ) d x a c c u m u l a t e d c e l l w e i g h t s [ K ] = l o c a l w e i g h t ; } do ub le g l o b a l w e i g h t = (cid:80) p ∈ processors local weight [ p ] ; / / compute t h i s p r o c e s s o r ’ s number o f p a r t i c l e s ,/ / and t h e i d o f t h e f i r s t l o c a l p a r t i c l e P a r t i c l e I d N particles,P = N particles ∗ l o c a l w e i g h t / g l o b a l w e i g h t ;P a r t i c l e I d l o c a l s t a r t i d = (cid:80) local process id − p =0 N particles,p ; / d e t e r m i n e i n which c e l l s t o g e n e r a t e p a r t i c l e s s t d : : v e c t o r < u n s i g n e d i n t > p a r t i c l e s i n c e l l ( N cells,P ) ;f o r ( i ∈ [0 , N particles,P ) ) { random number = l o c a l w e i g h t ∗ uniform random ( 0 , 1 ) ;C e l l I d K = a c c u m u l a t e d c e l l w e i g h t s . lower bound ( random number ) ;++ p a r t i c l e s i n c e l l [ K ] ; } / / c r e a t e p a r t i c l e s i n each c e l l P a r t i c l e I d n e x t i d = l o c a l s t a r t i d ;f o r ( c e l l K i n l o c a l l y owned c e l l s ) { f o r ( i ∈ [0 , particles in cell [ K ] ) ) { P a r t i c l e p ;p . l o c a t i o n = . . . ; / / compute a l o c a t i o n i n K p . r e f e r e n c e l o c a t i o n = . . . ;p . i d = n e x t i d ;p a r t i c l e s . i n s e r t ( p a i r ( K , p ) ) ;++ n e x t i d ; }} Apart from the two global reductions to determine the global weight and the localstart index, all of the operations above are local to each processor. The overall runtime for generating particles is proportional to max ≤ p ≤ P N particles ,p , i.e., of optimalcomplexity in N and, if the number of particles per processor is well balanced, alsoin P . The insertion of each particle into the multi-map in the algorithm above is, ingeneral, not possible in constant time. However, since we create particles by walkingover cells in the order they are sorted, it is easy to modify the algorithm slightly toﬁrst put cell-particle pairs into a linear array (where they will already be sorted by cell)and then ﬁll the multi-map with all particles at once; this last step can be done with O ( N particles ,p ) , and therefore optimal, complexity.There remains the task of computing a random location in cell K , i.e., the stepnot yet described in the algorithm above. This step is not trivial for cells that are notrectangles (in 2d) or boxes (in 3d), in particular if the cell has curved boundaries. In ourimplementation, we ﬁrst determine the axes-parallel bounding box B K ⊇ K of K , andthen keep drawing particles at uniformly distributed random locations in B K until weﬁnd one that is inside K . This is sufﬁcient for most cases, but has several drawbacks:(i) The density of generated particles is uniform in K , even if ρ ( x ) is not; in particular,even if ρ ( x ) is zero in parts of the cell (but not everywhere in K , so that the cell’sweight is non-zero), then the algorithm may still draw particle locations in those parts.(ii) The algorithm becomes inefﬁcient if the ratio of the volume of K to the volumeof B K becomes small, because the number of particle locations one needs to drawbefore ﬁnding one in K is, on average, | B K | / | K | . An example for this situation wouldbe pencil-like cells oriented along the coordinate system diagonals. (iii) Determiningwhether a trial point x t is inside K is generally expensive unless K is a (bi-, tri-)linear However, this is often not the case in practice, see Section 2.5. Φ K fromreference cell ˆ K to K in order to determine whether Φ − K ( x t ) ∈ ˆ K .An algorithm that addresses all of these shortcomings uses a Metropolis-Hastings(MH) Monte Carlo method to draw a sequence of locations { ˆ x k } ˜ N K k =1 in the referencecell ˆ K based on the (non-normalized) probability density ˆ ρ (ˆ x ) := ρ (Φ K (ˆ x k )) | det ˆ ∇ Φ K (ˆ x k ) | .The proposal distribution for the MH algorithm can conveniently be chosen uniformin ˆ K , necessitating the second factor in the deﬁnition of ˆ ρ (ˆ x ) . This method has theadvantage that it only requires computing the cheap (often polynomial) forward map-ping Φ K rather than its inverse, and that it does not draw points at locations where ρ (Φ K (ˆ x )) = 0 . On the other hand, the MH method yields a sequence of points inwhich subsequent points may be at the same location. Consequently, to generate N K particle locations for cell K that continue to have the desired probability density, wehave to create a longer sequence of length ˜ N K ≥ N K and take every J th element where J is larger than the length of the longest subsequence of equal samples.We note that because the number of particles per processor is ﬁxed initially basedon ρ ( x ) , our algorithm is not entirely random in the distribution. However, in prac-tice we ﬁnd this does not matter if there are sufﬁciently many particles; a completelyrandom method would then unnecessarily add complexity. Prescribed particle locations

An alternative to the random arrangements of parti-cles are cases where users want to exactly prescribe initial particle locations. Theselocations may either be programmatically described (e.g., on a regular grid within asmall box), or simply be coordinates listed in a ﬁle. Maybe surprisingly, this generalcase turns out to be more computationally expensive than randomly generated particlelocations.In either case, let us assume that the initial positions of all particles are given in anarray { x k } , k = 1 . . . N . Then the following algorithm performs particle creation andinsertion: P a r t i c l e I d n e x t i d = 0 ;f o r ( k ∈ [1 , N ] )f o r ( c e l l K i n l o c a l l y owned c e l l s )i f ( x k ∈ K ) { P a r t i c l e p ;p . l o c a t i o n = x k p . r e f e r e n c e l o c a t i o n = Φ − K ( x k ) p . i d = n e x t i d ;p a r t i c l e s . i n s e r t ( p a i r ( K , p ) ) ;++ n e x t i d ; } In the worst case, the complexity of this algorithm is O ( N particles N cells ,p ) on proces-sor p . This is because, with general unstructured meshes, we can not predict whethera given particle’s location lies inside the set of cells stored by this processor withoutsearching through all cells. This limits the usefulness of the algorithm to moderatenumbers of particles. (The number of cells per processor, N cells ,p , is often already lim-ited to at most a few hundred thousands for other reasons.) However, the algorithm can9e accelerated by techniques such as checking whether a particle’s location lies insidethe bounding box of the current processor’s set of locally owned cells, before checkingevery one of the locally owned cells.On hierarchically reﬁned meshes, one can sometimes also ﬁnd the cell K fora given particle position x k by ﬁnding the coarse level cell in which it is located,and then recursively searching through its children. This reduces the complexity to O ( N particles log N cells ,p ) . However, it only works if child cells occupy the same volumeas their parent cell; this condition is often not met when using curved geometries. Remark 3

In the paragraph above we assume that the particle positions are knownin the global coordinate system, and Φ − K has to be evaluated in order to ﬁnd the sur-rounding cell. If however, the particle coordinates are known in the local coordinatesystem of each cell (e.g., if one wants to create one particle at each of the cells’ mid-points), then the above algorithm is much simpler. A loop over all cells and all localparticle coordinates will generate the particles in the optimal order to insert them intothe multi-map, and will only involve a (cheap) evaluation of Φ K . The key step – as well as typically the most expensive part – of any particle-in-cellmethod is solving the equation of motion for each particle position x k = x k ( t ) , ddt x k ( t ) = u ( x , t ) , (1)where u ( x , t ) is the velocity ﬁeld of the surrounding ﬂow, which is usually computedwith a discretized continuum mechanics model. In practice, the exact velocity u ( x , t ) is not available, but only a numerical approximation u h ( x , t ) to u ( x , t ) . Furthermore,this approximation is only available at discrete time steps, u nh ( x ) = u h ( x , t n ) andthese need to be interpolated between time steps if the advection algorithm for in-tegrating (1) requires one or more evaluations at intermediate times between t n and t n +1 . If we denote this interpolation in time by ˜ u h ( x , t ) where ˜ u h ( x , t n ) = u nh ( x ) ,then the equation the differential equation solver really tries to solve is ddt ˜ x k ( t ) = ˜ u h ( x k ( t ) , t ) . (2) Remark 4

Assessing convergence properties of an ODE integrator – for example toverify that the RK4 integator converges with fourth order – needs to take into accountthat the error in particle positions may be dominated by the difference u − ˜ u h , insteadof the ODE solver error. If, for example, we denote by ˜ x k,h ( t ) the numerical solutionof (2) , then the error will typically satisfy a relationship like (cid:107) ˜ x k ( T ) − ˜ x k,h ( T ) (cid:107) ≤ C ( T )∆ t q p where ∆ t p is the time step used by the ODE solver (which is often an integer fractionof the time step ∆ t u used to advance the velocity ﬁeld u ), q the convergence orderof the method, and C ( T ) is a (generally unknown) constant that depends on the end ime T at which one compares the solutions. On the other hand, one would typicallycompute “exact” trajectories using the exact velocity, and then assess the error as (cid:107) x k ( T ) − ˜ x k,h ( T ) (cid:107) . However, this quantity will, in the best case, only satisfy anestimate of the form (cid:107) x k ( T ) − ˜ x k,h ( T ) (cid:107) ≤ C ( T )∆ t q p + C ( T ) (cid:107) u − u h (cid:107) + C ( T ) (cid:107) u h − ˜ u h (cid:107) , with appropriately chosen norms for the second and third term. These second and thirdterms typically converge to zero at relatively low rates (compared to the order q of theODE integrator) in the mesh size h and the time step size ∆ t u , limiting the overallaccuracy of the ODE integrator. Given these considerations, and given that ODE integrators require the expensivestep of evaluating the velocity ﬁeld ˜ u h at arbitrary points in time and space, choos-ing a simpler, less accurate scheme can signiﬁcantly reduce the computation time. Inour work, we have implemented the forward Euler, Runge-Kutta 2 and Runge-Kutta 4schemes [13]. We will brieﬂy discuss them below and remark that we have found thatusing higher order or implicit integrators does not usually yield more accurate solu-tions. For simplicity, we will omit the particle index k from formulas in the remainderof this section.In the following, for simplicity in exposition, we will assume that the ODE andPDE time steps ∆ t p , ∆ t u are equal. We will therefore simply denote them as ∆ t . Thisis often the case in practice because the velocity ﬁeld is typically computed with amethod that requires a Courant-Friedrichs-Lewy (CFL) number around or smaller thanone, implying that also particles move no more than by one cell diameter per (PDE)time step. In such cases, even explicit time integrators for particle trajectories can beused without leading to instabilities, and all of the methods below fall in this category.The formulas in the remainder of this section are, however, obvious to generalize tocases where ∆ t p < ∆ t u . We will also assume in the following that we have alreadysolved the velocity ﬁeld up to time t n +1 and are now updating particle locations from x n to x n +1 . In cases where one wants to solve for particle locations before updatingthe velocity ﬁeld, ˜ u h can be extrapolated beyond t n from previous time steps. Forward Euler

The simplest method often used is the forward Euler scheme, x n +1 = x n + ∆ t ˜ u h ( t n , x n ) . It is only of ﬁrst order, but cheap to evaluate and often sufﬁcient for simple cases.

Runge-Kutta second order (RK2)

Accuracy and stability can be improved by usinga second order Runge-Kutta scheme. The new particle position is computed as k = ∆ t u h ( t n , x n ) , x n +1 = x n + ∆ t ˜ u h (cid:18) t n + ∆ t , x n + k (cid:19) . unge-Kutta fourth order (RK4) A further improvement of particle advection canbe achieved by a fourth order Runge-Kutta scheme that computes the new position as k = ∆ t ˜ u h ( t n , x n ) , k = ∆ t u h (cid:18) t n + ∆ t , x n + k (cid:19) , k = ∆ t u h (cid:18) t n + ∆ t , x n + k (cid:19) , k = ∆ t ˜ u h (cid:0) t n +1 , x n + k (cid:1) , x n +1 = x n + 16 k + 13 k + 13 k + 16 k . The primary expense in all of these methods is the evaluation of the velocity ﬁeld u nh and u n +1 h at arbitrary positions x . By sorting particles into the cells they are in,at least we know which cell K an evaluation position x lies in. However, given thatthe velocity ﬁelds u h we consider are typically ﬁnite element ﬁelds deﬁned based onshape functions whose values are determined by mapping a reference cell ˆ K to eachcell K using a transformation x = Φ K (ˆ x ) , evaluation at arbitrary points requires theexpensive inversion of Φ K . This step can not be avoided, but by storing the resultingreference coordinates ˆ x of each point after updating x , as discussed in Remark 2, weat least do not have to repeat the step for many of the algorithms below. At the end of each (particle) time step – or in the case of the RK2 and RK4 methods, atthe end of each substep – we may ﬁnd that the updated position for a particle that startedin cell K may no longer be in K . Consequently, for each particle, every (sub-)step isfollowed by a determination of whether the particle is still inside K . This can be doneas part of computing the reference coordinates ˆ x k and testing whether ˆ x k ∈ ˆ K . If theparticle has moved out of K , we need to ﬁnd the new cell K (cid:48) in which the particle nowresides.In parallel computations, particles may also cross from a cell owned by one pro-cessor to a cell owned by another processor during an advection step, or even duringan advection substep. Consequently, ownership of particles needs to be transferred ef-ﬁciently. To avoid the latency of transferring individual particles, it is most efﬁcientto collect all data that needs to be shipped to each destination into a single buffer. Inpractice, it is sufﬁcient to implement communication patterns that cover the exchangeof particles between processes whose locally owned cells neighbor each other, i.e., us-ing point-to-point messages. This is possible, in particular, if the (ODE) time step ischosen so that the CFL number is less than one, because then particles travel no morethan one cell diameter in each step. With few exceptions (see below), particles willtherefore either remain within a cell, or end up in a neighboring cell that is also eitherowned by the current process, or a ghost cell whose owner is known by the currentprocess. 12ur reference implementation employs the following algorithm to update the cell-particle map, executed for each particle stored on the current processor: i f ( p a r t i c l e p has l e f t i t s s u r r o u n d i n g c e l l K ) { K (cid:48) = f i n d c u r r e n t c e l l f o r p a r t i c l e ( p , K ) ; / / search − c e l l i f ( K (cid:48) i s l o c a l l y owned c e l l ) / / o p t i o n 1 remove ( p , K ) from m u l t i − map , add ( p , K (cid:48) ) t o m u l t i − map ;e l s e i f ( K (cid:48) i s g h o s t c e l l on c u r r e n t p r o c e s s o r ) / / o p t i o n 2 { t r a n s m i s s i o n s [ owner of K (cid:48) ] . p u s h b a c k ( p , K (cid:48) ) ;remove ( p , K ) from m u l t i − map ; } e l s e / / o p t i o n 3 remove ( p , K ) from m u l t i − map ; } f o r ( n e i g h b o r n i n n e i g h b o r s )c o m m u n i c a t e p a r t i c l e s ( n , t r a n s m i s s i o n s [ n ] ) ; The vast majority of particles remain in the current cell, end up in a locally ownedcell (option 1), or a cell owned by another process (option 2). In almost all cases wherea particle moves out of its current cell K , the cell it ends up in is in fact a neighbor of K .A small fraction of cases, however, do not fall in these categories. One example is if theneighbor cell of K is reﬁned, and transport by one diameter of K results into transportinto one of the children of the neighboring (coarse) cell that is not adjacent to K . If thischild is owned by the same process, it can be found and the particle can be assignedto it (option 1). Otherwise, it is not a ghost cell (because typically only immediateneighbors of locally owned cells are ghost cells), and we can not determine its owner.Such particles are then discarded (option 3). Another example is a particle that is closeto the boundary and that is transported outside the domain by the ODE integrator; sucha particle also needs to be discarded (again option 3). We have found that even overthe course of long simulations, only a negligible fraction of particles (less than 0.002%of all particles per 1,000 RK2 time steps with a CFL number of one) is lost because ofthese two mechanisms. As expected, an explicit Euler scheme increases the number oflost particles signiﬁcantly (1.5% per 1,000 time steps with identical time step size). Areduction of the time step size to 0.5 times CFL entirely avoids all particle losses forRK2, and reduces it signiﬁcantly for the forward Euler integrator (0.375% per thousandtime steps). Avoiding particle loss could be accomplished by further decreasing oradaptively changing the time step length. However, given the small overall loss andadded computational expense of a robust solution, dropping particles that fall out ofbounds is a reasonable approach.The algorithm above requires ﬁnding the cell an advected particle is in now (ifany), marked by the search-cell comment. On unstructured meshes, this in gen-eral requires O ( N cells ,P ) operations; furthermore, because many of the O ( N particles ,P ) cross to a different cell, this step is not of optimal complexity. On the other hand, inthe vast majority of cases, particles only cross from one cell to its (vertex) neighbors.Consequently, we ﬁrst search all of these neighbors, at a cost of O (1) per moved parti-cle since the number of neighbors is typically bounded by a relatively small constant.Only the very small fraction that do not end up on a neighbor then requires a complete13earch over all cells.Testing whether a particle is inside a cell K (cid:48) costs O (1) operations, but it is ex-pensive. We can accelerate the search by searching the vertex neighbors of K in anorder that makes it likely that we ﬁnd the cell surrounding the particle early. Followingsome experimentation, we found that the following strategy works best: Let x be theparticle’s current position, v be the vertex of K closest to x , and c K (cid:48) be the center ofcell K (cid:48) . Let a = x − v be the vector from the closest vertex of K to the particle,and b K (cid:48) = c K (cid:48) − v be the vector from the closest vertex to the center of cell K (cid:48) .Then we sort the vertex neighbors K (cid:48) of K by descending scalar product a · b K (cid:48) andsearch them in this order whether they contain the particle’s new location x . In otherwords, cells with a center in the direction of the particle movement are checked ﬁrst. Inpractice, we ﬁnd the new cell in the ﬁrst try for most cases. On the other hand, severalsimpler criteria – like the distance between particle and cell center – fail more often, inparticular for adaptively reﬁned neighbors.The deletion and re-insertion of particles into the local multi-map can be optimizedin a similar way to the generation algorithm by ﬁrst collecting all moved particles in alinear array, and then inserting them in one loop. This is advantageous because particlesof the same cell tend to move into the same neighbor cells.The communication of particles to neighboring processes is handled as a two-stepprocess. First, two integers are exchanged between every neighbor and the currentprocess, representing the number of particles that will be sent and received betweenthe respective processes. In a second step every process transmits the serialized particledata to its neighbors and receives its respective data from its neighbors. This allows usto implement all communications as non-blocking point-to-point MPI transfers, onlygenerating O (1) transmissions and O ( N particles ,P ) data per process. Since we alreadydetermined which cell contains this particle on the old process, we also transmit thisinformation to the new process, avoiding a repeat search for the enclosing cell. Over the past two decades, adaptive ﬁnite element methods have demonstrated thatthey are vastly more accurate than computations on uniformly reﬁned meshes [9, 1, 6].In the current context, such adaptively reﬁned, dynamically changing meshes presenttwo particular algorithmic challenges discussed in the following.

Mesh reﬁnement and repartitioning

In parallel mesh-based methods, reﬁnementand coarsening typically happens in two steps: First, cells are reﬁned or coarsenedseparately on each process. Particles will then have to be distributed to the childrenof their previous cell (upon reﬁnement), or be merged to the parent of their previouscell (upon coarsening). The second step, after local mesh reﬁnement and coarsening,requires that the new mesh is redistributed among the available processes to achievean efﬁcient parallel load distribution [8, 2]. During this step, particles need to be re-distributed along with their mesh cells. To keep this process as simple as possible weappend the serialized particle data to other data already attached to a cell (such as thevalues of degrees of freedom located in a cell that correspond to solution vectors, or14ertex locations), and transmit all data at the same time. We can therefore utilize thesame process typically employed in pure ﬁeld based methods, and that is well sup-ported by existing software for parallel mesh handling [8]. In particular, this approachcan use existing and well-optimized bulk communication patterns, and avoids sendingparticles individually or having to re-join particles with their cells. Load balancing

The mesh repartitioning discussed in the previous paragraph is de-signed to redistribute work equally among all available processes. For mesh-basedmethods, this typically means equilibrating the number of cells each process “owns”,as the workload of a process is generally proportional to the number of cells in allimportant stages of mesh-based algorithms (e.g., assembly, linear solves, and postpro-cessing). Consequently, equilibrating the number of cells between processes also leadsto efﬁcient parallel codes.On the other hand, in the context of mesh-particle hybrid methods such as the oneswe care about in this paper, the number of particles per cell is typically not constantand frequently ranges from zero to a few dozen or a few hundred. Consequently, re-balancing the mesh so that each process owns approximately the same number of cellsincluding all particles located in them, leads to rather unbalanced workloads during allparticle-related parts of the code. Conversely, rebalancing the mesh so that each pro-cess owns approximately the same number of particles leaves the mesh-based parts ofthe code with unbalanced workloads. Both reduce the parallel efﬁciency of the overallcode.The only approach to restore perfect scalability is to partition cells differently forthe mesh-based and particle-based parts of the code. This is possible because one typi-cally ﬁrst computes the mesh based velocity ﬁeld, and only then updates particle loca-tions and properties. The mesh partitioning step between the two phases of the overallalgorithm then simply follows the outline discussed above. On the other hand, onecan not avoid transporting both mesh and particle data during these rebalancing steps because each phase of the algorithm requires all data from the other (for example, theparticles may encode material information necessary in computing the velocity ﬁeld,whereas the velocity ﬁeld is necessary to update particle locations). Consequently, theamount of data that has to be transported twice per time step is signiﬁcant.In practice, some level of imbalance can sometimes be tolerated. In those cases,one can work with the following compromise solutions: • Repartition mesh to combined weight of particles and cells.

Instead of estimatingthe workload of each cell during the rebalancing step as either a constant (as inpure mesh-based methods) or proportional to the number of particles in a cell (as In the particular implementation upon which we base this step in our numerical examples, provided bythe P EST library, the data attached to cells to be transferred has to have a ﬁxed size, equal for all cells. Thisis not a restriction for mesh based methods that use the same polynomial degree for ﬁnite element spaceson all cells. However, it leads to inefﬁciencies if the number of particles per cell varies widely, as it oftendoes in particle-based simulations. This, however, is only a drawback of the particular implementation andcan easily be addressed in different implementations. As we will show in Section 3, the inefﬁciency has nopractical impact on the overall performance of our simulations. Indeed, in many cases this will require transportation of not only some, but essentially all cell-basedand/or particle-based data between processors. See for example [8].

15n pure particle-based methods), one can estimate it as an appropriately weightedsum of the two. The resulting mesh is optimal for neither of the two phases ofeach time step, but is overall better balanced than either of the extremes. • Ignore imbalance.

As long as the number of particles is small – where the par-ticle component of the code consequently requires only a small fraction of theoverall runtime – one may simply ignore the imbalance. A typical case for this isif particles are only used to output information for speciﬁc points of interest, e.g.,accumulated strains over the course of a simulation, or to track where materialfrom a small set of locations is transported. • Adjust particle density to mesh by particle generation.

If the region of inter-est and highest mesh resolution of a model is known in advance, one can planthe particle generation accordingly and align the particle density to the expectedmesh density. This is most useful in cases where, for example, pre-existing in-terfaces should be tracked that are expected to only advect but not diffuse overtime. This alignment then not only increases the particle resolution in regions ofinterest, but also automatically improves parallel efﬁciency and scaling. • Adjust particle density to mesh by particle population management.

In caseswhere the regions of high mesh density are not known in advance, or in casewhere particles tend to cluster or diverge in certain regions of the model, itcan be necessary to manage the particle density actively during the model run.This would include removing particles from regions with high particle density oradding particles in regions of low density. If done appropriately, the result willbe a mesh where the average number of particles per cell is managed so that itremains approximately constant. • Adjust mesh to particle density.

Instead of adjusting the particle density to alignwith the mesh density, the mesh density can also be aligned to the particle den-sity. This is useful if the feature of interest in a model is most clearly deﬁnedby the particles, e.g., by a higher particle density close to an interface. As inthe previous alternative, the alignment of mesh and particle density yields betterparallel efﬁciency and scaling.While the last two approaches lead to better scalability, they may of course not suitthe problem one originally wanted to solve. On the other hand, generating additionalparticles upon reﬁnement of a cell, and thinning out particles upon coarsening, is acommon strategy in existing codes [24, 19].

Apart from their position and ID, particles are often utilized to carry various propertiesof interest. In practice, we have encountered situations where properties are scalars,vectors, or tensors; their values may never be updated, may be updated using ﬁniteelement ﬁeld values in every time step, or only when their values are used to generategraphical output; and they may be initialized in a variety of ways. Examples for these16ases are: (i) particles initialized by a prescribed function that are used to indicatedifferent materials in the computational domain; (ii) particles that store their initialposition to track the movement of material; (iii) particles that represent some part ofthe current solution (e.g., temperature, velocity, strain rate) at their current positionfor further postprocessing; (iv) particles whose properties represent the evolution ofquantities such as the accumulated strain, or of chemical compositions.An example of this last case is a damage model in which a damage variable in-creases as material is strained, but also heals over time. This can be described by theequation ddt d k ( t ) = α (cid:107) ˙ ε ( u ( x k ( t ))) (cid:107) − βd k ( t ) , where ˙ ε ( u ( x k ( t ))) = (cid:0) ∇ u ( x k ( t )) + ∇ u ( x k ( t )) T (cid:1) is the strain rate at the locationof the k th particle, and α, β are material parameters. Similarly, applications in thegeosciences often require integrating the total (tensor-valued) deformation F k that par-ticle k has undergone over the course of a simulation [14, 7], leading to the system ofdifferential equations [20] ddt F k ( t ) = ∇ u ( x k ( t )) F k ( t ) , where ∇ u ( x k ( t )) is the velocity gradient tensor, and F k (0) = I . Clearly, these differ-ential equations may also reference other ﬁeld variables than the velocity. In all cases,the primary computational challenge is the evaluation of ﬁeld variables at arbitraryparticle positions. This is no different than what was required in advecting particle po-sitions in Section 2.3, and greatly accelerated by storing the reference location of eachparticle in the coordinate system of the surrounding cell (see Remark 2).In many areas where one uses simulations to qualitatively try to understand behav-ior , rather than determine a single quantitative answer, one often wants to track severalproperties at the same time; consequently, implementations need to allow assigningarbitrary combinations of properties – including their initialization and update features– to particles.We have found it useful to only have a single kind of particle in each simulation,i.e., all particles store the same kinds of properties. This allows storing informationabout the semantics of these properties only once by a property manager object. Themanager also allows querying properties by name, and ﬁnding the position of a par-ticular property within the vector of numbers each particle stores for all properties.Because the number of scalars that collectively make up the properties of each particleis a run-time constant, one can simplify memory management by utilizing a memorypool consisting of a consecutive array that is sliced into ﬁxed size chunks and that isdynamically managed.As discussed in the previous section, some models require dynamically generatingor destroying particles during a model run. Particle properties therefore also need todescribe how they should be initialized in these cases. In practice, we have encounteredsituations where one initializes new particles’ properties in the same way as at the startof the computation, and other cases where it is appropriate to interpolate propertiesfrom surrounding, existing particles. 17 .7 Transferring particle properties to ﬁeld based formulations The previous section was concerned with updating particle properties given certainﬁeld-based quantities. This section is concerned with the opposite problem; namely,how to use properties stored on the particles to affect ﬁeld-based variables. Examplesinclude damage models, where a damage variable d k such as the one described abovewould inﬂuence the properties of the solid or ﬂuid under consideration. For example,if the material is a metal, damage increases the material’s strength, whereas in the caseof geological applications, damage typically decreases it.In ﬁeld-based FEM methods, one typically evaluates material properties at thequadrature points. However, particles are not usually located at the quadrature pointsand consequently one needs a method to interpolate or ‘project’ properties from theparticle locations to the quadrature points. Many such interpolation algorithms, ofvarying purpose, accuracy, and computational cost, have been proposed in the litera-ture [12, 10, 30] and one’s choice of algorithm will typically depend on the intendedapplication. Therefore, generic implementations of this step need to provide an in-terface for implementing different interpolation algorithms. Interpolation algorithmsfound in the computational geosciences alone include: (i) nearest-neighbor interpo-lation among all particles in the current cell; (ii) arithmetic, geometric, or harmonicaveraging of all particles located in the same cell; (iii) distance-weighted averagingwith all particles in the current cell or within a ﬁnite distance; (iv) weighted averagingusing the quadrature point’s shape function values at the particle position as weights;(v) a (linear) least-squares projection onto a ﬁnite dimensional space.While these methods differ in their computational cost, most can be implementedwith optimal complexity since our data structures store all particles sorted by cell, andsince the algorithms do not require data from other processors. The exception is theaveraging scheme (iii) if the search radius extends past the immediate neighbor cells,and if schemes (ii) and (iv) are extended to include particles from such ‘secondary’neighbor cells. On the other hand, as long as only information from immediate neigh-boring cells is required, these algorithms can be implemented without loss of optimalcomplexity if the procedure discussed in Section 2.4 is extended to exchange particleslocated in one layer of ghost cells for each process. (See also Remark 1). Methodsthat require information from cells further away than one layer around a processor’sown cells pose a signiﬁcant challenge for massively parallel computations; we will notdiscuss this case further. The last remaining technical challenge for the presented algorithms is generating out-put from particle information for postprocessing or visualization. The difﬁculty inthese cases is associated with the sheer amount of data when dealing with hundreds ofmillions or billions of particles.In some cases, particles ﬁll the whole domain with a high particle density. Out-putting all of them may then be unnecessary or even impractical because of the size ofthe resulting output ﬁles. In these cases, we have found that it is usually sufﬁcient toinstead output an interpolation of the particle properties onto the mesh, as discussed in18igure 2:

Structure of the overall class hierarchies used in our reference implementa-tion of the algorithms described in this paper. The functionality is organized in layerswhich encapsulate the core functionality from speciﬁc algorithmic choices. This struc-ture allows ﬂexibly adjusting the algorithm to individual problems at runtime. the previous section.On the other hand, in applications in which the particles act as tracers, or in whichthey store history-dependent properties with strong spatial gradients, particle outputcan provide valuable additional information. In this case, outputting data for all par-ticles can yield challenging amounts of data (e.g. a single output ﬁle for one billionparticles only carrying a single scalar property comprises 36 GB of data in the com-pressed HDF5 format). Fortunately, ﬁle formats such as parallel VTU [25] or HDF5[11] allow each process to write their own data independently of that of other pro-cesses, and in parallel. Since in practice writing tens of thousands of ﬁles per particleoutput can be as challenging for parallel ﬁle systems as writing a single very large ﬁle,it is often beneﬁcial to group particle output into a reasonable number of ﬁles per timestep (say, 10-100 ﬁles) or use techniques like MPI Input/Output to reduce the numberof ﬁles created. All steps of this process can therefore again be achieved in optimalcomplexity.

As outlined above, many of the pieces of an overall particle-based algorithm can bechosen independently. For example, one may want to use a random particle generationmethod, and advect them along with a Runge-Kutta 2 integrator.This ﬂexibility is conveniently mapped onto implementations in object-orientedprogramming languages by deﬁning interface classes for each step. Concrete algo-rithms can then be implemented in the form of derived classes. Fig. 2 shows an ex-ample of this structure, as realized in our reference implementation in the ASPECTcode. 19

Results

We have implemented the methods discussed in the previous sections in one of ourcodes – the ASPECT code to simulate convection in the Earth’s mantle [18, 3], andthat is based on the deal.II ﬁnite element library [5, 4] – with the goal to provide it in ageneric way usable in a wide variety of settings. To verify our claims of performanceand scalability, we report results of numerical experiments in this section and showthat, as designed, all steps of our algorithms scale well even to very large problemsizes. We also assessed the correctness of the implemented advection schemes throughconvergence tests in spatially and temporally variable ﬂow.

We ﬁrst show scalability of our algorithms using a simple two-dimensional benchmarkcase with a static and uniformly reﬁned mesh. We employ a circular-ﬂow setup in aspherical shell, with no ﬂow across the boundary. Particles are distributed randomlywith uniform density (see Fig. 3, top left). Because we choose a constant angularvelocity of the material, this model creates a much higher fraction of particles thatmove into new cells per time step, compared to realistic applications where large partsof the domain move more slowly than the fastest regions. Thus, realistic applicationsspend less time in the “Sort particles” section of the code (see Section 2.4), as we willshow in the next testcase below.The left column of Fig. 3 shows excellent weak and strong scaling over at leastthree orders of magnitude of model size. For a ﬁxed problem size (strong scaling), weuse a mesh with . · degrees of freedom and . · particles. Increasingthe number of processes from 12 to 12,288 shows an almost perfect decrease in walltime for all operations, despite the rather small problem each process has to deal withfor large numbers of processes.Keeping the number of degrees of freedom and particles per core ﬁxed and increas-ing the problem size and number of processes accordingly (weak scaling, bottom leftof the ﬁgure), the wallclock time stays constant between 6 and 6,144 processes. In thistest each process owns . · degrees of freedom and . · particles. Resultsagain show excellent scalability, even to very large problem sizes.

Discussing scalability for adaptive meshes is more complicated because reﬁnementdoes not lead to a predictable increase in the number of degrees of freedom. We usea setup based on the benchmarks presented in [31]. Speciﬁcally, we use a rectangulardomain [0 , . × [0 , that contains a sharp non-horizontal interface separating a All scaling and efﬁciency tests were performed on the Cray XC-40 cluster at the North-German Su-percomputing Alliance (HLRN), using nodes with two Xeon E5-2860v3, 12-core, 2.5 GHz CPUs, and aCray Aries interconnect. Each conﬁguration was run 3 times and timings were then averaged. We ran eachconﬁguration for 10 time steps to average over individual time steps. When timing individual sections of aprogram, we introduce barriers to ensure accurate attribution to speciﬁc parts of the algorithm. Each reﬁnement step leads to four times as many cells, and consequently processes. 6,144 cores was thelast multiple to which we had access for timing purposes. .0010.010.111010010001000010 100 1000 10000 W a ll c l o ck t i m e [ s ] / E x e c u t i on W a ll c l o ck t i m e [ s ] / E x e c u t i on W a ll c l o ck t i m e [ s ] / E x e c u t i on W a ll c l o ck t i m e [ s ] / E x e c u t i on Figure 3:

Scaling of algorithms. Left column: Results for a uniformly reﬁned mesh.Right column: Results for an adaptively reﬁned mesh. Top row: Model geometry andpartition between processes. Center row: Strong scaling for a constant number of cellsand particles and variable number of processes. Bottom row: Weak scaling where thenumbers of cells and particles per process is kept constant. The bottom right panelcontains data without load balancing techniques (dashed lines), and with balancedrepartitioning as described in Subsection 2.5 (solid lines). For more information seethe main text. less dense lower layer from a denser upper layer. The shape of the interface then leadsto a Rayleigh-Taylor instability. For the strong scaling tests, we adaptively coarsen a , × , starting mesh, retaining ﬁne cells only in the vicinity of the interface.The resulting mesh has about . · degrees of freedom. We use approximately 15million, uniformly distributed particles. This setup is then run on different numbers ofprocessors.The results in Fig. 3 show that strong scaling for the adaptive grid case is as goodas for the uniform grid case, nearly decreasing the model runtime linearly from 12 to1,536 cores. The model is too small to allow a further increase in parallelism.Setting up weak scaling tests is more complicated. Since we can not predict thenumber of degrees of freedom for a given number of mesh adaptation steps, we ﬁrstrun a series of models that all start with a × mesh and adaptively reﬁne it avariable number of times. For each model we assess the resulting number of degrees offreedom. We then select a series of these models whose size is approximately a powerof two larger than the coarsest one, and run the models on as many processes as nec-21ssary to keep the ratio between number of degrees of freedom to number of processesapproximately constant. In practice this number varies between 41,000 and 48,000 de-grees of freedom per process; we have veriﬁed this variability does not signiﬁcantlyaffect scaling results. For simplicity we keep the mesh ﬁxed after the ﬁrst time step.Each of the models uses as many particles as degrees of freedom, equally distributedacross the domain.The weak scaling results are more difﬁcult to interpret than the strong scaling case.In our partitioning strategy, we only strive to balance the number of cells per process.However, because the particle density is constant while cell sizes strongly vary, the im-balance in the number of particles per process grows with the size of the model. Thisis easily seen in the top right panel of Fig. 3 in which all four processors own the samenumber of cells, but vastly different areas and consequently numbers of particles. Con-sequently, runtimes for some parts of the algorithm – in particular for particle advection– grow with global model size (dashed lines in the bottom right panel of Fig. 3).As discussed in Section 2.5, this effect can be addressed by weighting cell andparticle numbers in load balancing. The solid lines in the bottom right panel of Fig. 3show that with appropriately chosen weights, the increase in runtime can be reducedfrom a factor of 60 to a factor of 3. To achieve this, we introduce a cost factor W for each particle, relative to the cost of one cell. The total cost of each cell in loadbalancing is then one (the cost of the ﬁeld-based methods per cell) plus W times thenumber of particles in this cell. W = 0 implies that we only consider the numberof cells for load balancing, whereas W = ∞ only considers the number of particles.In practice, one will typically choose ≤ W < ; the optimal value depends on thecost of updating particle properties, the chosen particle advection scheme, and the timespent in the ﬁnite-element solver. For realistic applications, we found W = 0 . to beadequate. On the other hand, computational experiments suggest that it is not importantto exactly determine the optimal value since the overall runtime varies only weakly inthe vicinity of the minimum. We illustrate the applicability of our algorithms using two examples from modelingconvection in the Earth’s mantle. The ﬁrst is based on a benchmark that is widely usedby researchers in the computational geodynamics mantle convection community; thesecond demonstrates a global model of the evolution of the Earth’s mantle constrainedby known movements of the tectonic plates at the surface.

A common benchmark for thermo-chemical mantle convection codes simulates the en-trainment of a dense bottom layer in a slow viscous convection driven by heating frombelow and cooling from above [31, 28, 15, 17, 29]. The benchmark challenges advec-tion schemes because it involves the tracking of an interface over timespans of manythousand time steps; ﬁeld-based methods therefore accumulate signiﬁcant numericaldiffusion. Results for this problem can be found in [31, 28]. In our example, we will22se a minor modiﬁcation of the “thick layer” test of [28] for 3D models, but with farhigher resolutions and far more particles.The model computes ﬁnite-element based ﬂuid pressures and velocities in a unitcube using the incompressible Stokes equations coupled with an advection-diffusionequation for the temperature. The material is tracked by assigning a scalar densityproperty to particles, and the density of the material in a cell is computed as the arith-metic average of all particles in that cell. T = 1 and T = 0 are prescribed as tempera-ture boundary conditions at the top and bottom. All other sides have insulating thermalboundary conditions. We use tangential, free-slip velocity boundary conditions on allboundaries.The density property of particles with z ≤ . is initialized so that their buoy-ancy ratio compared to the particles above the interface is B = − . , where B = δρ/ ( ρ α δT ) . Here, δρ is the density difference between the two phases, ρ is a ref-erence density of the background material, δT is the temperature difference acrossthe domain, and α is the thermal expansivity of the background material. The othermaterial parameters and gravity are chosen to result in a Rayleigh number of ,and – deviating from [28] – we choose the initial temperature as T ( x, y, z ) = + sin( πz ) cos( πx ) cos( πy ) .Previous studies only showed results up to resolutions of cells (ﬁrst order ﬁniteelement, and ﬁnite volume scheme respectively), with 5-40 particles per cell. Herewe reproduce this case, although using second-order accurate ﬁnite elements for thevelocity and temperature (case A), and extend the model to a mesh resolution of ,with 40 (case B) and 500 (case C) particles per cell. The last of these cases results inaround 90 million degrees of freedom and more than a billion particles. The modelsrequire approximately 15,000 (A) and 32,000 (B and C) time steps respectively.Fig. 4 shows the visual end state of these models, extending Fig. 9 of [28]. As inpreviously published results, the dense layer remains stable at the bottom, with smallamounts of material entrained in the convection that develops above it.Previous studies have measured the entrainment e = . (cid:82) . (cid:82) (cid:82) C d x d y d z, where C = 1 for material that originated in the bottom layer, and C = 0 otherwise.For case A, we obtain e = 0 . , slightly lower than the previously published 3Dresult with the same resolution and number of particles ( ∼ . ). We conjecture thatthis is due to using second order elements for velocity and temperature. Increasing themesh resolution to decreases the amount of entrainment further to . , in linewith the expected trend [28], but also showing that a higher resolution is necessary toconﬁrm convergence for this result. Drastically increasing the number of particles percell (case C) only increases the entrainment slightly to . , which is also consistentwith the literature. Despite the change in statistical properties being small, this caseresolves more small-scale features of the ﬂow than case B, in particular the structure ofthe interface and the deformation of entrained material.Thus, these results suggest that the trends for the entrainment rate reported in [28]for 2D likely also hold for 3D models. They also show that even though the change inentrainment rate is already low for few (less than 40) particles per cell, ﬂow structuresin the solution are better resolved by using more than 100 particles per cell.23igure 4: End-time snapshots of the thick layer test with increasing mesh resolution andparticle numbers. Left: Temperature isosurfaces for T = 0 . , . , . , . . Center:Volume of dense material. Right: Vertical slice, showing the material composition at y = 0 . . Finally, we illustrate the capability of the methodology described above on a realisticthree-dimensional geodynamics computation. This problem models the evolution ofthe Earth’s mantle over 250 million years. The mesh changes adaptively and containson average 1.7 million cells, resulting in 90 million degrees of freedom. 4.8 millionparticles are used for post-processing. The particles are initially distributed randomly.However, in order to enforce strictly balanced parallel workloads we limit the maxi-mum number of particles per cell to 25. Therefore, at the ﬁnal time those regions ofno interest (i.e., those regions resolved with only coarse cells) have a lower particledensity than regions of interest. Speciﬁcally, we examine material close to a region ofcold downwelling (a subducted plate or “slab”) and determine its movement since thebeginning of the computation. Material properties such as density and heat capacityare computed from a database for basaltic and harzburgitic rocks, following [22], andthe viscosity is based on a published viscosity model incorporating mineral physicsproperties, geoid deformation, and seismic tomography [27]. The prescribed surfacevelocities use reconstructions of past plate movement on Earth [26].Fig. 5 shows the present-day state of the Farallon subduction zone below the west-ern United States. Particles that are initially close to the core-mantle boundary arecolored by the displacement they have experienced since the initial time. This reveals24igure 5:

Illustration of a computation with particles in a 3D mantle convection com-putation. Left: Subducting plates below the western United States (orange) push ma-terial at the core-mantle boundary (dark blue sphere) towards the west. Each Particleis colored according to the distance it has moved from its initial position (blue: littlemovement to green: large movement). Right: Vertical slice through the subductionzone. The background color is the temperature and the particles are colored by theirinitial distance from the Earth’s center (red: the particle started at surface; blue: theparticle started at the core-mantle boundary). that the Farallon slab (orange) has primarily pushed the easternmost material. Particlesin the central Paciﬁc have not moved signiﬁcantly, illuminating the limited inﬂuence ofthe west Paciﬁc subduction zones. Studies such as this one enable researchers in com-putational geodynamics to understand the history and dynamics of the Earth’s interior.

In this article, we have reviewed strategies for implementing methods that couple ﬁeld-or mesh-based, and particle-based approaches to computational problems in continuummechanics such as ﬂuid ﬂow. These methods have a long history of use in computa-tional science and engineering, However, most approaches for this coupled methodol-ogy have only been implemented as a sequential code, or as a parallel code in which thedomain is statically partitioned. In contrast, modern ﬁnite element codes have adap-tively reﬁned meshes that change in time, have hanging nodes, are dynamically reparti-tioned as appropriate, and use complicated communication patterns. Consequently, thedevelopment of efﬁcient methods to couple continuum and particle approaches requiresa rethinking of the algorithms, the implementation of these algorithms, and a conse-quent update of the available toolset. We have described how all of the operations onecommonly encounters when using particle methods can be efﬁciently implemented andhave documented through numerical examples that the expected optimal complexitiescan indeed be realized in practice. 25 eferences [1] M. Ainsworth and J. T. Oden.

A Posteriori Error Estimation in Finite ElementAnalysis . John Wiley and Sons, 2000.[2] W. Bangerth, C. Burstedde, T. Heister, and M. Kronbichler. Algorithms anddata structures for massively parallel generic adaptive ﬁnite element codes.

ACMTrans. Math. Softw. , 38(2), 2011.[3] W. Bangerth, J. Dannberg, R. Gassm¨oller, T. Heister, et al. ASPECT : AdvancedSolver for Problems in Earth’s ConvecTion . Computational Infrastructure in Geo-dynamics, 2016.[4] W. Bangerth, D. Davydov, T. Heister, L. Heltai, G. Kanschat, M. Kronbichler,M. Maier, B. Turcksin, and D. Wells. The deal.II library, version 8.4.

Journalof Numerical Mathematics , 24, 2016.[5] W. Bangerth, R. Hartmann, and G. Kanschat. deal.II – a general purpose objectoriented ﬁnite element library.

ACM Trans. Math. Softw. , 33(4):24, 2007.[6] W. Bangerth and R. Rannacher.

Adaptive Finite Element Methods for DifferentialEquations . Birkh¨auser Verlag, 2003.[7] Thorsten W. Becker, James B. Kellogg, G¨oran Ekstr¨om, and Richard J.O’Connell. Comparison of azimuthal seismic anisotropy from surface waves andﬁnite strain from global mantle-circulation models.

Geophysical Journal Inter-national , 155(2):696–714, nov 2003.[8] C. Burstedde, L. C. Wilcox, and O. Ghattas. p4est : Scalable algorithms forparallel adaptive mesh reﬁnement on forests of octrees.

SIAM J. Sci. Comput. ,33(3):1103–1133, 2011.[9] G. F. Carey.

Computational Grids: Generation, Adaptation and Solution Strate-gies . Taylor & Francis, 1997.[10] Y Deubelbeiss and BJP Kaus. Comparison of eulerian and lagrangian numericaltechniques for the stokes equations in the presence of strongly varying viscosity.

Physics of the Earth and Planetary Interiors , 171(1):92–111, 2008.[11] Mike Folk, Albert Cheng, and Kim Yates. HDF5: A ﬁle format and I/O libraryfor high performance computing applications. In

Proc. ACM/IEEE Conf. Super-computing (SC’99) , 1999.[12] Taras V Gerya and David A Yuen. Characteristics-based marker-in-cell methodwith conservative ﬁnite-differences schemes for modeling geological ﬂows withstrongly variable transport properties.

Physics of the Earth and Planetary Interi-ors , 140(4):293–318, 2003.[13] E. Hairer and G. Wanner.

Solving Ordinary Differential Equations II. Stiff andDifferential-Algebraic Problems . Springer-Verlag, Berlin, 1991.2614] Chad E. Hall, Karen M. Fischer, E. M. Parmentier, and Donna K. Blackman. Theinﬂuence of plate motions on three-dimensional back arc mantle ﬂow and shearwave splitting.

Journal of Geophysical Research: Solid Earth , 105(B12):28009–28033, dec 2000.[15] U. Hansen and D.A. Yuen. Extended-boussinesq thermal-chemical convectionwith moving heat sources and variable viscosity.

Earth and Planetary ScienceLetters , 176(3–4):401–411, 2000.[16] F.H. Harlow.

The particle-in-cell method for numerical solution fo problems inﬂuid dynamics . Mar 1962.[17] Louise H. Kellogg, Bradford H. Hager, and Rob D. van der Hilst. Compositionalstratiﬁcation in the deep mantle.

Science , 283(5409):1881–1884, 1999.[18] M. Kronbichler, T. Heister, and W. Bangerth. High accuracy mantle convec-tion simulation through modern numerical methods.

Geophysics Journal Inter-national , 191:12–29, 2012.[19] Wei Leng and Shijie Zhong. Implementation and application of adaptive meshreﬁnement for thermochemical mantle convection studies.

Geochemistry, Geo-physics, Geosystems , 12(4), 2011. Q04006.[20] Dan McKenzie and James Jackson. The relationship between strain rates, crustalthickening, palaeomagnetism, ﬁnite strain and fault movements within a deform-ing zone.

Earth and Planetary Science Letters , 65(1):182–202, 1983.[21] Allen K. McNamara and Shijie Zhong. Thermochemical structures within aspherical mantle: Superplumes or piles?

Journal of Geophysical Research ,109(B7):1–14, 2004.[22] Takashi Nakagawa, Paul J Tackley, Frederic Deschamps, and James AD Con-nolly. Incorporating self-consistently calculated mineral physics into thermo-chemical mantle convection simulations in a 3-D spherical shell and its inﬂuenceon seismic anomalies in Earth’s mantle.

Geochemistry, Geophysics, Geosystems ,10(3), 2009.[23] A Poliakov and Yu Podladchikov. Diapirism and topography.

Geophysical Jour-nal International , 109(3):553–564, 1992.[24] A A Popov and S V Sobolev. SLIM3D : A tool for three-dimensional thermome-chanical modeling of lithospheric deformation with elasto-visco-plastic rheology.

Physics of the Earth and Planetary Interiors , 171:55–75, 2008.[25] W. Schroeder, K. Martin, and B. Lorensen.

The Visualization Toolkit: An Object-Oriented Approach to 3D Graphics . Kitware, Inc., 3rd edition, 2006.[26] M. Seton, R.D. M¨uller, S. Zahirovic, C. Gaina, T. Torsvik, G. Shephard,a. Talsma, M. Gurnis, M. Turner, S. Maus, and M. Chandler. Global continentaland ocean basin reconstructions since 200Ma.

Earth-Science Reviews , 113(3-4):212–270, jul 2012. 2727] Bernhard Steinberger and Arthur R Calderwood. Models of large-scale viscousﬂow in the Earth’s mantle with constraints from mineral physics and surface ob-servations.

Geophysical Journal International , 2:1461–1481, 2006.[28] P. J. Tackley and S. D. King. Testing the tracer ratio method for modeling activecompositional ﬁelds in mantle convection simulations.

Geoch. Geoph. Geosys-tems , 4:2001GC000214/1–15, 2003.[29] Paul J. Tackley.

Three-Dimensional Simulations of Mantle Convection with aThermo-Chemical Basal Boundary Layer: D”? , pages 231–253. American Geo-physical Union, 1998.[30] M. Thielmann, D. A. May, and B. J. P. Kaus. Discretization errors in the hybridﬁnite element particle-in-cell method.

Pure and Applied Geophysics , 171:2165–2184, 2014.[31] P. E. van Keken, S. D. King, H. Schmeling, U. R. Christensen, D. Neumeister,and M.-P. Doin. A comparison of methods for the modeling of thermochemicalconvection.